A multi-objectives genetic algorithm clustering ensembles based approach to summarize relational data

Gabriel, Jong Chiye (2015) A multi-objectives genetic algorithm clustering ensembles based approach to summarize relational data. Masters thesis, Universiti Malaysia Sabah.


Download (337kB) | Preview


K-means algorithm is one of the well-known clustering algorithms that promise to converge to a local optimum in few iterative. However, traditional k-means algorithm is designed to cluster data of single target table. Due to the nature of data collected in real life applications, many data have been collected and stored in relational databases. Traditional clustering and classification learning algorithms cannot be applied directly in learning multi-relational databases. Several approaches have been designed and proposed to learn relational data which includes Inductive Logic Programming based approaches, Graph based approaches, Multi-View approaches and also Dynamic Aggregation of Relational Attributes approach. Dynamic Aggregation of Relational Attributes approach is very effective in learning relational data set. Dynamic Aggregation of Relational Attributes summarizes relational data by clustering records exist in non-target tables. However, the quality of summarization of data depends highly on the position of initial centroids selected. Thus, it may affect the overall classification task. Thus, this project proposes a Genetic Algorithm-based Clustering Ensembles in learning relational datasets by combining the results obtained from several k-means clustering runs with different values of number of clusters, in which the location of centroids are optimal for every sets of clusters. The effects of using different similarity measurements and applying different fitness functions for the genetic algorithm on the predictive accuracies of the classifiers are also studied. Based on the results obtained, it can be concluded that using the consensus result of several clustering results can increase the predictive accuracy of classification task. It can be concluded that the Euclidean distance has better performance on mutagenesis datasets and cosine similarity has better performance on hepatitis datasets when evaluated with Weka C4.5 classifier, but the other way round when Naïve Bayes classifier is used for evaluation.

Item Type: Thesis (Masters)
Uncontrolled Keywords: K-means, clustering algorithm, Inductive Logic Programming approach, Dynamic Aggregation of Relational Attributes approach, Multi-View approaches, Graph based approaches, Euclidean distance
Subjects: Q Science > QA Mathematics
Divisions: FACULTY > Faculty of Computing and Informatics
Depositing User: ADMIN ADMIN
Date Deposited: 30 Oct 2015 11:20
Last Modified: 07 Nov 2017 15:31
URI: http://eprints.ums.edu.my/id/eprint/12105

Actions (login required)

View Item View Item