Optimization of Number of Training Documents in Text Categorization

본 연구는 실재 시스템 환경에서 문헌 분류를 위해 범주화 기법을 적용할 경우, 범주화 성능이 어느 정도이며, 적정한 문헌범주화 성능의 달성을 위하여 분류기 학습에 필요한 범주당 가장 이상적인 학습문헌집합의 규모는 무엇인가를 파악하기 위하여 kNN 분류기를 사용하여 실험하였다. 실험문헌집단으로15만 여건의 실제 서비스되는 데이터베이스에서 2,556건 이상의 문헌을 가진 8개 범주를 선정하였다. 이들을 대상으로 범주당 학습문헌수 20개(Tr20)에서 2,000개(Tr2000)까지 단계별로 증가시키며 8개 학습문헌집합 규모를 갖도록 하위문헌집단을 구성한 후, 학습문헌집합 규모에 따른 하위문헌집단 간 범주화 성능을 비교하였다. 8개 하위문헌집단의 거시평균 성능은 F1 값 30%로 선행연구에서 발견된 kNN 분류기의 일반적인 성능에 미치지 못하는 낮은 성능을 보였다. 실험을 수행한 8개 대상문헌집단 중 학습문헌수가 100개인 Tr100 문헌집단이 F1 값 31%로 비용대 효과면에서 분류기 학습에 필요한 최적정의 실험문헌집합수로 판단되었다. 또한, 실험문헌집단에 부여된 주제범주 정확도를 수작업 재분류를 통하여 확인한 후, 이들의 범주별 범주화 성능과 관련성을 기반으로 위 결론의 신빙성을 높였다.

keywords: text categorization, KNN classifier, test collections, size of training documents, 문헌범주화, 텍스트 범주화, 실험문헌집단, 학습문헌집합의 규모, kNN 분류기

Abstract

This paper examines a level of categorization performance in a reallife collection of abstract articles in the fields of science and technology, and tests the optimal size of documents per category in a training set using a kNN classifier. The corpus is built by choosing categories that hold more than 2,556 documents first, and then 2,556 documents per category are randomly selected. It is further divided into eight subsets of different size of training documents: each set is randomly selected to build training documents ranging from 20 documents (Tr20) to 2,000 documents (Tr2000) per category. The categorization performances of the 8 subsets are compared. The average performance of the eight subsets is 30% in F1 measure which is relatively poor compared to the findings of previous studies. The experimental results suggest that among the eight subsets the Tr100 appears to be the most optimal size for training a kNN classifier. In addition, the correctness of subject categories assigned to the training sets is probed by manually reclassifying the training sets in order to support the above conclusion by establishing a relation between and the correctness and categorization performance.

keywords: text categorization, KNN classifier, test collections, size of training documents, 문헌범주화, 텍스트 범주화, 실험문헌집단, 학습문헌집합의 규모, kNN 분류기

참고문헌

(1999). 문서범주화를 위한 선형분류기와 kNN의 결합모델.. , 225-231.

(2006). The effect of the quality of pre-assigned subject categories on the text categorization performance.. 23(2), 266-285.

(2003.). 복합분류기를 이용한 웹 문서 범주화에 관한 실험적 연구. , -.

(2005.). 정보검색연구. , -.

Automated learning of decision rules for text categorization ACM Transactions on Information Systems. , 233-251.

(2003). Using asymmetric distributions to improve text classifier probability estimates. , 111-118.

(2002). Feature selection using linear support vector machines. , -.

(2004). Hierarchical document categorization with support vector machine. , 2004-.

(2000). The effect of using hierarchical classifiers in text categorization. , 66-75.

10.

(1998). Category levels in hierarchical text categorization Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing. , 1-3 18.

11.

(2003). A maximal figure-of-merit learning approach to text categorization. , 174-181.

12.

(2002). Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. , -.

13.

(1999). Transductive inference for text classification using support vector machines. , -.

14.

(1998). Text categorization with support vector machines learning with many relevant features.. , 137-142.

15.

(2001). Summarization as feature selection for text categorization. , 365-370.

16.

(1998). Using a generalized instance set for automatic text categorization. , 81-89.

17.

(1996). Combining classifiers in text categorization. , 289-297.

18.

(1996). Training algorithms for linear text classifiers. , 298-306.

19.

(2002). Hierarchical text categorization using neural networks. 5, 87-118.

20.

(1999a). Combining machine learning and hierarchical indexing structures for text categorization. , 107-124.

21.

(1999b). Hierarchical neural networks for text categorization. , 281-282.

22.

(2002). Machine learning in automated text categorization. 34(1), 1-47.

23.

(2003). Boosting support vector machines for text classification through parameter-free threshold relaxation. , 247-254.

24.

(1999). Exploiting hierarchy in text categories. 1(3), 193-216.

25.

(1996). An evaluation of statistical approaches to MEDLINE indexing. , -.

26.

(yiming.1994). effective and efficient learning from human decisions in text categorization and retrieval Proceedings of SIGIR-94 17th ACM International Conference on Research and Development in Information Retrieval. , 13-22.

27.

(1999). and X. Liu. 1999. A re-examination of text categorization methods.. , 42-49.

28.

(1999). A re-examination of text categorization methods.. , 42-49.

바로가기메뉴

논문 상세

Vol.23 No.4

문헌범주화에서 학습문헌수 최적화에 관한 연구

Optimization of Number of Training Documents in Text Categorization

초록

Abstract

참고문헌

정보관리학회지