A comparative study on feature weight in thai document categorization
Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weighting, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weighting methods in statistical learning of Thai Document Categorization Framework. Six methods were evaluated, including Boolean, tf, tf$\times $idf, tfc, ltc, and entropy weighting. We have evaluated these methods on Thai news article corpus with three supervised learning classifiers: Support Vector Machine (SVM), Decision Tree (DT), and Naïve Bayes (NB). We found that ltc weighting method is most effective in our experiments with SVM and DT algorithms, while entropy and Boolean weighting is more effective than the weighting with NB algorithms. Using ltc weighting with a SVM classifier yielded a very high classification performance with the F1 measure equal to 96\%.
Full Text: PDF