Lecture Notes in Informatics

Home > Vol P-165 (2010)

A comparative study on feature weight in thai document categorization

Nivet Chirawichitchai , Parinya Sa-Nguansat and Phayung Meesad

Abstract

Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weighting, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weighting methods in statistical learning of Thai Document Categorization Framework. Six methods were evaluated, including Boolean, tf, tf$\times $idf, tfc, ltc, and entropy weighting. We have evaluated these methods on Thai news article corpus with three supervised learning classifiers: Support Vector Machine (SVM), Decision Tree (DT), and Naïve Bayes (NB). We found that ltc weighting method is most effective in our experiments with SVM and DT algorithms, while entropy and Boolean weighting is more effective than the weighting with NB algorithms. Using ltc weighting with a SVM classifier yielded a very high classification performance with the F1 measure equal to 96\%.

Full Text: PDF

Lecture Notes in Informatics

10th International Conferenceon Innovative Internet Community Systems (I2CS) - Jubilee Edition 2010 - P-165, 257-266 (2010).

Gesellschaft für Informatik, Bonn
2010

Contents

A comparative study on feature weight in thai document categorization

Abstract

Lecture Notes in Informatics

10th International Conferenceon Innovative Internet Community Systems (I2CS) - Jubilee Edition 2010 - P-165, 257-266 (2010).

Gesellschaft für Informatik, Bonn 2010

Contents

A comparative study on feature weight in thai document categorization

Abstract

Gesellschaft für Informatik, Bonn
2010