Enrichment of BOW representation with syntactic and semantic background knowledge

Rayner Alfred, and Patricia Anthony, and Suraya Alias, and Asni Tahir, and Kim , On Chin and Lau , Hui Keng (2013) Enrichment of BOW representation with syntactic and semantic background knowledge. Communications in Computer and Information Science, 24 . pp. 283-292. ISSN 1865-0929

[img]
Preview
PDF
44Kb

Official URL: http://link.springer.com/chapter/10.1007%2F978-3-6...

Abstract

The basic Bag of Words (BOW) representation, that is generally used in text documents clustering or categorization, loses important syntactic and semantic information contained in the documents. When the text document contains a lot of stop words or when they are of a short length this may be particularly problematic. In this paper, we study the contribution of incorporating syntactic features and semantic knowledge into the representation in clustering texts corpus. We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies- Bouldin (DBI) index. This paper studies and compares the quality of the clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. Based on the experimental results, it is shown that the quality of clusters produced is improved by integrating the semantic and syntactic information into the standard bag of words representation of texts corpus.

Item Type:Article
Uncontrolled Keywords:clustering, bag of words, syntactic features, semantic back-ground knowledge, automatic text categorization, knowledge management
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:FACULTY > Faculty of Computing and Informatics
ID Code:12243
Deposited By:IR Admin
Deposited On:13 Nov 2015 10:58
Last Modified:13 Nov 2015 11:05

Repository Staff Only: item control page


Browse Repository
Collection
   Articles
   Book
   Speeches
   Thesis
   UMS News
Search
Quick Search

   Latest Repository

Link to other Malaysia University Institutional Repository

Malaysia University Institutional Repository