바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

Optimization of Number of Training Documents in Text Categorization

Journal of the Korean Society for Information Management / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073
2006, v.23 no.4, pp.277-294
https://doi.org/10.3743/KOSIM.2006.23.4.277

  • Downloaded
  • Viewed

Abstract

This paper examines a level of categorization performance in a reallife collection of abstract articles in the fields of science and technology, and tests the optimal size of documents per category in a training set using a kNN classifier. The corpus is built by choosing categories that hold more than 2,556 documents first, and then 2,556 documents per category are randomly selected. It is further divided into eight subsets of different size of training documents: each set is randomly selected to build training documents ranging from 20 documents (Tr20) to 2,000 documents (Tr2000) per category. The categorization performances of the 8 subsets are compared. The average performance of the eight subsets is 30% in F1 measure which is relatively poor compared to the findings of previous studies. The experimental results suggest that among the eight subsets the Tr100 appears to be the most optimal size for training a kNN classifier. In addition, the correctness of subject categories assigned to the training sets is probed by manually reclassifying the training sets in order to support the above conclusion by establishing a relation between and the correctness and categorization performance.

keywords
text categorization, KNN classifier, test collections, size of training documents, 문헌범주화, 텍스트 범주화, 실험문헌집단, 학습문헌집합의 규모, kNN 분류기

Reference

1.

(1999). 문서범주화를 위한 선형분류기와 kNN의 결합모델.. , 225-231.

2.

(2006). The effect of the quality of pre-assigned subject categories on the text categorization performance.. 23(2), 266-285.

3.

(2003.). 복합분류기를 이용한 웹 문서 범주화에 관한 실험적 연구. , -.

4.

(2005.). 정보검색연구. , -.

5.

Automated learning of decision rules for text categorization ACM Transactions on Information Systems. , 233-251.

6.

(2003). Using asymmetric distributions to improve text classifier probability estimates. , 111-118.

7.

(2002). Feature selection using linear support vector machines. , -.

8.

(2004). Hierarchical document categorization with support vector machine. , 2004-.

9.

(2000). The effect of using hierarchical classifiers in text categorization. , 66-75.

10.

(1998). Category levels in hierarchical text categorization Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing. , 1-3 18.

11.

(2003). A maximal figure-of-merit learning approach to text categorization. , 174-181.

12.

(2002). Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. , -.

13.

(1999). Transductive inference for text classification using support vector machines. , -.

14.

(1998). Text categorization with support vector machines learning with many relevant features.. , 137-142.

15.

(2001). Summarization as feature selection for text categorization. , 365-370.

16.

(1998). Using a generalized instance set for automatic text categorization. , 81-89.

17.

(1996). Combining classifiers in text categorization. , 289-297.

18.

(1996). Training algorithms for linear text classifiers. , 298-306.

19.

(2002). Hierarchical text categorization using neural networks. 5, 87-118.

20.

(1999a). Combining machine learning and hierarchical indexing structures for text categorization. , 107-124.

21.

(1999b). Hierarchical neural networks for text categorization. , 281-282.

22.

(2002). Machine learning in automated text categorization. 34(1), 1-47.

23.

(2003). Boosting support vector machines for text classification through parameter-free threshold relaxation. , 247-254.

24.

(1999). Exploiting hierarchy in text categories. 1(3), 193-216.

25.

(1996). An evaluation of statistical approaches to MEDLINE indexing. , -.

26.

(yiming.1994). effective and efficient learning from human decisions in text categorization and retrieval Proceedings of SIGIR-94 17th ACM International Conference on Research and Development in Information Retrieval. , 13-22.

27.

(1999). and X. Liu. 1999. A re-examination of text categorization methods.. , 42-49.

28.

(1999). A re-examination of text categorization methods.. , 42-49.

Journal of the Korean Society for Information Management