An ensemble data summarization approach based on feature transformation to learning relational data

Chung, Seng Kheau (2015) An ensemble data summarization approach based on feature transformation to learning relational data. Doctoral thesis, Universiti Malaysia Sabah.

[img]
Preview
Text
An ensemble data summarization approach.pdf

Download (12MB) | Preview

Abstract

DARA is a framework that is designed particularly to summarize data stored in a multi-relational database having non-target tables associate with the target table. In the process of summarizing data, the data stored in multi-relational database need to be transformed into Term Frequency - Inverse Document Frequency (TF-IDF) vector space. Due to the fact that the size of TF-IDF is directly affected by the number of unique terms that are found in the data stored in target tables, increasing the number of unique terms also increase the clustering complexity and it could produce less accurate clustering results. In this thesis, a Feature Selection algorithm is investigated and proposed to optimize the TF-IDF vector space by selecting only relevant features from the initial TF-IDF vector space. In addition to that, a Feature Construction algorithm is also investigated and proposed to optimize the TF-IDF vector space by merging two or more features in the TF-IDF vector space that best represent the datasets. The Information Gain borrowed from Information Retrieval theory and Term-term Correlation algorithm are used to determine the relevancy of these features to be selected or merged in order to form a new generation of TF-IDF vector space. Consequently, the size of the TF-IDF vector space is reduced. This will indirectly minimize the complexity the TF-IDF vector space that makes the clustering work more efficient while trying to maintain or improve its clustering result accuracy. A genetic algorithm (GA) is also used to find the best centroids for all the clusters generated cluster centroids. A ensemble clustering is designed, used and evaluated to generate the final classification framework that will take all input generated from the GA based clustering with Feature Selection and Feature Construction algorithms and perform the classification task for the relational datasets. Several experiments have been conducted to evaluate the predictive performance of a classification task (C4.5 classifier) when using these clusters results on several relational datasets from mutagenesis, financial and hepatitis databases. The experimental results obtained show some improvements on the predictive accuracy tasks when using the clustering results obtained. Finally, there are further Improvements shown when a GA is applied to the whole framework of the classification task by using the WEKA C4.5 classifier and taking the predictive accuracy as the fitness function. The experiment result shows that the ensemble clustering shows a good sign that indicates the consensus function works correctly. This study shows the task of optimizing the TF-IDF vector space by reducing the number of features In TF -IDF vector space increases the efficiency of the clustering task in order to produce cluster result with better accuracy. A better cluster result can also be produced by combining the cluster results generated from the GA based clustering with Feature Selection and Feature Construction algorithms.

Item Type: Thesis (Doctoral)
Keyword: framework, Term Frequency, Inverse Document Frequency, Feature Selection algorithm
Subjects: ?? QA76 ??
Department: FACULTY > Faculty of Computing and Informatics
Depositing User: MUNIRA BINTI MARASAN -
Date Deposited: 24 Oct 2017 11:39
Last Modified: 24 Oct 2017 11:39
URI: https://eprints.ums.edu.my/id/eprint/10223

Actions (login required)

View Item View Item