Gesellschaft für Informatik e.V.

Lecture Notes in Informatics


10th International Conferenceon Innovative Internet Community Systems (I2CS) - Jubilee Edition 2010 - P-165, 257-266 (2010).

Gesellschaft für Informatik, Bonn
2010


Copyright © Gesellschaft für Informatik, Bonn

Contents

A comparative study on feature weight in thai document categorization

Nivet Chirawichitchai , Parinya Sa-Nguansat and Phayung Meesad

Abstract


Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weighting, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weighting methods in statistical learning of Thai Document Categorization Framework. Six methods were evaluated, including Boolean, tf, tf$\times $idf, tfc, ltc, and entropy weighting. We have evaluated these methods on Thai news article corpus with three supervised learning classifiers: Support Vector Machine (SVM), Decision Tree (DT), and Naïve Bayes (NB). We found that ltc weighting method is most effective in our experiments with SVM and DT algorithms, while entropy and Boolean weighting is more effective than the weighting with NB algorithms. Using ltc weighting with a SVM classifier yielded a very high classification performance with the F1 measure equal to 96\%.


Full Text: PDF

Gesellschaft für Informatik, Bonn
ISBN 978-3-88579-259-8


Last changed 04.10.2013 18:31:26