A Study on Information Resource Evaluation for Text Categorization

이 연구는 색인가가 주제 색인하는 과정에서 참조하는 여러 문서구성요소를 문서 범주화의 정보원으로 인식하여 이들이 문서 범주화 성능에 미치는 영향을 살펴보는데 그 목적이 있다. 이는 기존의 문서 범주화 연구가 전문(full text)에 치중하는 것과는 달리 문서구성요소로서 정보원의 영향을 평가하여 문서 범주화에 효율적으로 사용될 수 있는지를 파악하고자 한다. 전형적인 과학기술 분야의 저널 및 회의록 논문을 데이터 집합으로 하였을 때 정보원은 본문정보 중심과 문서구성요소 중심으로 나뉘어 질 수 있다. 본문정보 중심은 본론 자체와 서론과 결론으로 구성되며, 문서구성요소 중심은 제목, 인용, 출처, 초록, 키워드로 파악된다. 실험 결과를 살펴보면, 인용, 출처, 제목 정보원은 본문 정보원과 비교하여 유의한 차이를 보이지 않으며, 키워드 정보원은 본문 정보원과 비교하여 유의한 차이를 보인다. 이러한 결과는 색인가가 참고하는 문서구성요소로서의 정보원이 문서 범주화에 본문을 대신하여 효율적으로 활용될 수 있음을 보여주고 있다.

keywords: Text Categorization, 문서범주화, 자동색인, 정보원, Text Categorization, 주제색인과정, Text Categorization

Abstract

The purpose of this study is to examine whether the information resources referenced by human indexers during indexing process are effective on Text Categorization. More specifically, information resources from bibliographic information as well as full text information were explored in the context of a typical scientific journal article data set. The experiment results pointed out that information resources such as citation, source title, and title were not significantly different with full text. Whereas keyword was found to be significantly different with full text. The findings of this study identify that information resources referenced by human indexers can be considered good candidates for text categorization for automatic subject term assignment.

keywords: Text Categorization, 문서범주화, 자동색인, 정보원, Text Categorization, 주제색인과정, Text Categorization

참고문헌

Chan, L.M. (1981). Cataloging and classification: An introduction. , -.

Chan, L.M. (1987). Instructional materials used in teaching cataloging and classification. , 131-144.

Chu, C.M. (1993). Subject analysis: The critical first stage in indexing. , 439-454.

Cunningham, S.J. (1999). Applications of machine learning in information retrieval. 34, 341-384.

Diaz, I. (2004). Improving performance of text categorization by combining filtering and support vector machines. 55(7), 579-592.

Efron, M. (2004). Machine learning for information architecture in a large governmental website. , 151-159.

(2006). Engineering Village. 2, -.

Foskett, A.C.. (1996). The Subject Approach to Information. , -.

(1985). Documentation-methods for examining documents: Determining their subjects and selecting indexing terms. , -5963.

10.

Jeng, L.H.. (1996). Using verbal reports to understand cataloging expertise: Two cases. 40(4), 343-358.

11.

Joachims, T. (1998). Text categorization with support vector machine: Learning with many relevant features. , 137-142.

12.

Larkey, L.S.. (1999). A patent search and classification system. , 179-187.

13.

Lewis, D.D. (1995). Evaluating and optimizing autonomous text categorization systems. , -.

14.

Mai, J.E.. (2005). Analysis in indexing: document and domain centered approaches. 41, 599-611.

15.

Mitchell, J.S. (2003). Dewey Decimal Classification and Relative Index. , -.

16.

Moens, M.F.. (2000). Automatic Indexing and Abstracting of Document Texts. , -.

17.

O′Connor, B.C.. (1996). Explorations in Indexing and Abstracting: pointing, virtue, and power. , -.

18.

Porter, M.F. (1980). An algorithm for suffix stripping. , 130-137.

19.

Sauperl, A. (2002). Subject determination during the cataloging process. , -.

20.

Sauperl, A. (2004). Catalogers′ common ground and shared knowledge. 55(1), 55-63.

21.

Sebastiani, F.. (2002). Hypertext categorization. , 109-129.

22.

Sebastiani, F. (2005). Text categorization. , 109-129.

23.

Slattery, S.. (2002). Hypertext categorization. , -.

24.

Taylor, A.G. (2003). The organization of information. , -.

25.

van Rijsbergen, C.J. (1979). Information Retrieval. , -.

26.

Witten, I.H. (2000). Data Mining: Practical Machine Learning Tools and Techniques with JAVA Implementations. , -.

27.

Yang, Y.. (1999). An evaluation of statistical approaches to text categorization. 1, 69-90.

28.

Zhang, B. (2004). Combining structural and citation-based evidence for text categorization. , 162-163.

바로가기메뉴

논문 상세

Vol.24 No.4

문서범주화 효율성 제고를 위한 정보원 평가에 관한 연구

A Study on Information Resource Evaluation for Text Categorization

초록

Abstract

참고문헌

정보관리학회지