A comparative study on embedding models for keyword extraction using keyBERT method

Bayan Issa and Muhammed Basheer Jasser and Hui Na Chua and Muzaffar Hamzah (2023) A comparative study on embedding models for keyword extraction using keyBERT method. In: 2023 IEEE 13th International Conference on System Engineering and Technology (ICSET), 02-03 OKTOBER 2023, Shah Alam, Malaysia.

Text
FULLTEXT.pdf
Restricted to Registered users only
Download (603kB) | Request a copy

Official URL: https://ieeexplore.ieee.org/document/10295108

Abstract

KeyBERT is a method for keywords/keyphrases extraction, which has three steps. The first step is selecting candidate keywords from a text using sklearn library, the second step is the embedding operation of the text and its candidate keywords; this operation is done by BERT to get a numerical representation that represents the meanings. The third step is calculating the cosine similarity between individual keywords vectors and document vector. In this paper, we focus on the second step of KeyBERT (embedding step). Although KeyBERT has a lot of supported models for the embedding operation, there are no extensive previous comparative studies to analyze and study the effect of using different supported models in KeyBERT. We introduce a comparative study of two commonly used groups of models; the first group is sentence-transformers pretrained models, supported via the sentence-transformers library, and the second group includes the Longformer model, supported via the Hugginface Transformers library. We conduct the comparative study of models on benchmark datasets, which contain English text documents of multi-domains with different text lengths. Based on the study, we found that the Paraphrase-mpnet-base-v2 model provides the best results among all other models in keyword extraction in terms of effectiveness (f1-score, recall, precision, MAP) on all datasets, with higher efficiency (time) on short text compared with using it on long text; accordingly, we recommend using it in that context. On the other hand, the Longformer model is the most efficient/fastest in keyword extraction among all other models on all datasets and this superiority has been evident, especially in long text; accordingly, we recommend using it in that context.

Item Type:	Conference or Workshop Item (Other)
Keyword:	Keyword extraction, NLP, Pretrained model, Embedding models
Subjects:	Q Science > QA Mathematics > QA1-939 Mathematics > QA71-90 Instruments and machines > QA75.5-76.95 Electronic computers. Computer science > QA76.75-76.765 Computer software T Technology > TA Engineering (General). Civil engineering (General) > TA1-2040 Engineering (General). Civil engineering (General) > TA1501-1820 Applied optics. Photonics
Department:	FACULTY > Faculty of Engineering
Depositing User:	JUNAINE JASNI -
Date Deposited:	08 Aug 2025 15:19
Last Modified:	08 Aug 2025 15:19
URI:	https://eprints.ums.edu.my/id/eprint/44760

Actions (login required)

View Item