A Study on Improving the Performance of Document Classification Using the Context of Terms

송성전; 정영미

doi:10.3743/KOSIM.2012.29.2.205

Apply for Authority
P-ISSN1013-0799
E-ISSN2586-2073
KCI

Home

OA Policy

Article Contents

Prev Next

e-Submission

Vol.29 No.2

Citation Share

A Study on Improving the Performance of Document Classification Using the Context of Terms

Journal of the Korean Society for Information Management / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073

2012, v.29 no.2, pp.205-224

https://doi.org/10.3743/KOSIM.2012.29.2.205

& (2012). A Study on Improving the Performance of Document Classification Using the Context of Terms. Journal of the Korean Society for Information Management, 29(2), 205-224, https://doi.org/10.3743/KOSIM.2012.29.2.205

copy

Downloaded
Viewed

PDF Download

Abstract

One of the limitations of BOW method is that each term is recognized only by its form, failing to represent the term’s meaning or thematic background. To overcome the limitation, different profiles for each term were defined by thematic categories depending on contextual characteristics. In this study, a specific term was used as a classification feature based on its meaning or thematic background through the process of comparing the context in those profiles with the occurrences in an actual document. The experiment was conducted in three phases; term weighting, ensemble classifier implementation, and feature selection. The classification performance was enhanced in all the phases with the ensemble classifier showing the highest performance score. Also, the outcome showed that the proposed method was effective in reducing the performance bias caused by the total number of learning documents.

keywords: 자동분류, 문맥프로파일, 용어가중치, 분류기 결합, 자질선정, document classification, context profile, term weighting, ensemble classifier, feature selection, document classification, context profile, term weighting, ensemble classifier, feature selection

Reference

김판준. (2008). 용어 가중치부여 기법을 이용한 로치오 분류기의 성능 향상에 관한 연구. 정보관리학회지, 25(1), 211-233.

이재윤. (2005). 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구. 정보관리학회지, 22(3), 261-287.

이지혜. (2009). 지도적 잠재의미색인(LSI)기법을 이용한 의견 문서 자동 분류에 관한 실험적 연구. 정보관리학회지, 26(3), 451-462.

정은경. (2009). 문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구. 정보관리학회지, 26(3), 261-278.

David, D. L.. (2004). Reuters-21578 text categorization test collection distribution 1.0. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

Gabrilovich, E.. (2009). Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34(2009), 443-498.

Huynh, D.. (2011). A new term ranking method based on relation extraction and graph model for text classification (145-152). Proceedings of the Australasian Computer Science Conference (ACSC 2011).

Porter, M.. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137.

Sable, C.. (2002). NLP found helpful(at least for one text categorization task) (172-179). Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMLNLP).

10.

Wang, P.. (2008). Building semantic kernels for text classification using wikipedia (713-721). Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

바로가기메뉴

Article Contents

Vol.29 No.2

A Study on Improving the Performance of Document Classification Using the Context of Terms

Abstract

Reference

Journal of the Korean Society for Information Management