Comparison of Different Distance Measure Methods in Text Document Clustering

  • Yin Min Tun Faculty of Computer Sciences, University of Computer Studies Mandalay, Myanmar

Abstract

Clustering text document is an unsupervised learning method to find common groups. The clustering of text documents are the special issue in text mining for unlabeled train documents. Fortunately, there are many proposed features and methods to resolve this problem. The framework of text document classification consists of: input text document, preprocessing, feature extraction and clustering. The common classification methods are: self-organization map, k-means and mixture of Gaussians. The correlation of resulted clusters is based on selecting a distance measure method. The main focus of this paper is to present different exiting distance measure methods along with k-means clustering for text document clustering. The experiment performed k-means clustering on the Newsgroups dataset and measure clustering entropy to evaluate the different distance measure methods.

Downloads

Download data is not yet available.

References

[1]. Kohonen, S. Kaski, K. Lagus, J. Honkela, V. Paatero, A. Saarela, "Self organization of massive document collection", IEEE Trans,Neural Networks, vol.11, 2000, pp. 574-585.
[2]. J Tanturm, A. Murua, W. Stuetzle, "Hierarchical model-base clustering of large database through fraction ", Proc. 8th ACM SGKDD Int. Conf Knowledge Discovery and Data Mining, 2002, pp. 183-190.
[3]. S. Dhillon , D. S. Modha, "Concept decompositions for large sparse text data using clustering", Machine Learning, vol. 42, 2001, pp. 143-175.
[4]. M. Steinbach, G. Karypis, V. Kumar, "A comparison of document clustering techniques", KDD Workshop on Text Mining, 2000, pp. 109-110.
[5]. S. Vaithlyanathan, B. Dom, "Model-based hierarchical clustering", Proc. 16th Conf. Uncertainty in Artificial Intelligence, 2000, pp. 599-608.
[6]. Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (2017, August). Scatter/gather: A cluster-based approach to browsing large document collections. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 148-159). ACM.
[7]. Abualigah, Laith Mohammad, et al. "Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering." Expert Systems with Applications 84 (2017): 24-36.
Published
2018-08-07
How to Cite
TUN, Yin Min. Comparison of Different Distance Measure Methods in Text Document Clustering. International Journal of Research and Engineering, [S.l.], v. 5, n. 7, p. 445-449, aug. 2018. ISSN 2348-7860. Available at: <https://digital.ijre.org/index.php/int_j_res_eng/article/view/347>. Date accessed: 24 aug. 2019. doi: https://doi.org/10.21276/ijre.2018.5.7.2.