Thursday, December 8, 2011

Interval ckMeans An Algorithm for Clustering Symbolic Data. (Domain: Knowledge and Data Engineering)

10. Interval ckMeans An Algorithm for Clustering Symbolic Data. (Domain: Knowledge and Data Engineering)

ABSTRACT:

Clustering is the process of organizing a collection of patterns into groups based on their similarities. Fuzzy clustering techniques aim at finding groups to which every object in the database belongs to some membership degree. This paper presents a new algorithm for clustering symbolic data based on ckMeans algorithm. This new algorithm allows the data entry and the membership degree to be intervals. In order to validate the proposal, it is compared to two other algorithms using the same database.

EXISTING SYSTEM:

  • Even though dynamic clustering method used in large database like web page collection which yields better clustering, but it needs additional computation which leads to increase in time complexity.

  • And also when dynamic document clustering adopted for real world applications, sometimes it may not yield the desired output. And also dynamic algorithm works like static algorithm in initial clustering.


PROPOSED SYSTEM:

An approach for dynamic document clustering based on structured MARDL technique is our objective. At first the documents are clustered in Static method using Bisecting K-means algorithm. For clustering of documents in bisecting K-Means, all documents should be preprocessed in the initial stage. The preprocessing stage includes stop word removal process and stemming process.   In stop word removal process, words having negative influence like adverbs, conjunctions are removed and in stemming process   root word will find out by removing prefixes and suffixes of the word.

               After the preprocessing process, the documents should grouped into desired number of clusters. To make desired number of clusters, bisecting K-Means clustering method is used. In this method, each document is assigning a weight by term frequency and inverse document frequency method using cosine similarity measure. After assigning weight to each document, the documents are first separated into clusters using k-Means method. After clustering of documents using K-means method the largest cluster will split and forms two sub clusters and this step would be repeated for  many times until clusters formed are with high similarity.
The overall process is explained in the diagram below.
HARDWARE REQUIREMENTS
                     SYSTEM                     : Pentium IV 2.4 GHz
                     HARD DISK               : 40 GB
                     MONITOR                  : 15 VGA colour
                     MOUSE                      : Logitech.
                     RAM                           : 256 MB
                     KEYBOARD               : 110 keys enhanced.

SOFTWARE REQUIREMENTS
                     Operating system          :           Windows XP Professional
                     Front End                     :           JAVA
                     Tool                             :           NETBEANS IDE

REFERENCE:
Rogerio R. de Vargas, Benjamin R. C. Bedregal, “Interval ckMeans: An Algorithm for Clustering Symbolic Data”, IEEE Ref.: 978-1-61284-968-3/11. IEEE Conference 2011.

No comments:

Post a Comment