바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

The Effect of the Quality of Pre-Assigned Subject Categories on the Text Categorization Performance

Journal of the Korean Society for Information Management / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073
2006, v.23 no.2, pp.265-285
https://doi.org/10.3743/KOSIM.2006.23.2.265


  • Downloaded
  • Viewed

Abstract

In text categorization a certain level of correctness of labels assigned to training documents is assumed without solid knowledge on that of real-world collections. Our research attempts to explore the quality of pre-assigned subject categories in a real-world collection, and to identify the relationship between the quality of category assignment in training set and text categorization performance. Particularly, we are interested in to what extent the performance can be improved by enhancing the quality (i.e., correctness) of category assignment in training documents. A collection of 1,150 abstracts in computer science is re-classified by an expert group, and divided into 907 training documents and 227 test documents (15 duplicates are removed). The performances of before and after re-classification groups, called Initial set and Recat-1/Recat-2 sets respectively, are compared using a kNN classifier. The average correctness of subject categories in the Initial set is 16%, and the categorization performance with the Initial set shows 17% in F1 value. On the other hand, the Recat-1 set scores F1 value of 61%, which is 3.6 times higher than that of the Initial set.

keywords
Text categorization, test collections, kNN, training sets, 텍스트 범주화, 문헌범주화, 실험문헌집단, kNN 분류기, 학습문헌집합, Text categorization, test collections, kNN, training sets

Reference

1.

Automated learning of decision rules for text categorization. ACM Transactions on Information Systems Subject access in online catalogs Journal of the American Society for Information Science. , 357-376.

2.

(2003). Using asymmetric distributions to improve text classifier probability estimates.. , 111-118.

3.

Indeterminancy in the subject access to documents.. , 229-241.

4.

(2002). Feature selection using linear support vector machines. , -.

5.

(2003). Text categorization by boosting automatically extracted concepts.. , 182-189.

6.

(c.1984.). Optimizing convenient online access to bibliographic databases.. , 37-47.

7.

(2005). 정보검색연구 [Research in information retrieval]』. , 1-18.

8.

(1999). Document classification and routing. , 289-310.

9.

A study of indexer consistency. , 92-94.

10.

(2002). Natural language processing for online applications: text retrieval, extraction and categorization. , -.

11.

J. and V. Slamecka. 1962. Indexer consistency under minimal conditions. Bethesda. , -.

12.

(1999). Transductive inference for text classification using support vector machines.. , 200-209.

13.

(1998). Text categorization with support vector machines: learning with many relevant features.. , 137-142.

14.

(1999). 문서범주화를 위한 선형분류기와 kNN의 결합모델 [Combining a linear classifier and a kNN model for text categorization]』.. , 225-231.

15.

(1998). Using a generalized instance set for automatic text categorization.. , 81-89.

16.

(1996). Combining classifiers in text classification. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. , 289-297.

17.

(2003). 복합분류기를 이용한 웹 문서범주화에 관한 실험적 연구 [An experimental study on categorization of web documents using an ensemble classifier]』.. , -.

18.

(20000). Interindexer consistency in PsycINFO.. 32(1), 4-8.

19.

A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. , 3-12.

20.

(1996). Training algorithms for linear text classifiers.. , 298-306.

21.

(1999). Combining machine learning and hierarchical indexing structures for text categorization.. , 107-124.

22.

(t.1991.). Individual differences in organizing . In Proceedings of the 54th Annual Meeting of the Society for Information Science. , 82-86.

23.

(2002). Machine learning in automated text categorization. 34(1), 1-47.

24.

(2006). 학습문헌집합의 속성에 따른 문헌 범주화 성능 실험 [An experimental study ascertaining the relationships between the characteristics of a training document set and the performance of text categorization]』.. , -.

25.

(1999). Maximizing text-mining performance. 14(4), 63-69.

26.

(1996). Text classification in USENET Newsgroups: a progress report. , -.

27.

(1999). An evaluation of statistical approaches to text categorization.. 1, 69-90.

28.

(1994). effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of SIGIR-94 17th ACM International Conference on Research and Development in Information Retrieval. , 13-22.

29.

(1999). An re-examination of text categorization methods.. , 42-49.

30.

(1998). The effect of using hierarchical classifiers in text categorization.. , 1-18.

Journal of the Korean Society for Information Management