Comparison of Different Distance Measure Methods in Text Document Clustering
AbstractClustering text document is an unsupervised learning method to find common groups. The clustering of text documents are the special issue in text mining for unlabeled train documents. Fortunately, there are many proposed features and methods to resolve this problem. The framework of text document classification consists of: input text document, preprocessing, feature extraction and clustering. The common classification methods are: self-organization map, k-means and mixture of Gaussians. The correlation of resulted clusters is based on selecting a distance measure method. The main focus of this paper is to present different exiting distance measure methods along with k-means clustering for text document clustering. The experiment performed k-means clustering on the Newsgroups dataset and measure clustering entropy to evaluate the different distance measure methods.
. J Tanturm, A. Murua, W. Stuetzle, "Hierarchical model-base clustering of large database through fraction ", Proc. 8th ACM SGKDD Int. Conf Knowledge Discovery and Data Mining, 2002, pp. 183-190.
. S. Dhillon , D. S. Modha, "Concept decompositions for large sparse text data using clustering", Machine Learning, vol. 42, 2001, pp. 143-175.
. M. Steinbach, G. Karypis, V. Kumar, "A comparison of document clustering techniques", KDD Workshop on Text Mining, 2000, pp. 109-110.
. S. Vaithlyanathan, B. Dom, "Model-based hierarchical clustering", Proc. 16th Conf. Uncertainty in Artificial Intelligence, 2000, pp. 599-608.
. Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (2017, August). Scatter/gather: A cluster-based approach to browsing large document collections. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 148-159). ACM.
. Abualigah, Laith Mohammad, et al. "Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering." Expert Systems with Applications 84 (2017): 24-36.
This work is licensed under a Creative Commons Attribution 4.0 International License.