바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

A Study on automatic assignment of descriptors using machine learning

Journal of the Korean Society for Information Management / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073
2006, v.23 no.1, pp.279-299
https://doi.org/10.3743/KOSIM.2006.23.1.279

  • Downloaded
  • Viewed

Abstract

This study utilizes various approaches of machine learning in the process of automatically assigning descriptors to journal articles. After selecting core journals in the field of information science and organizing test collection from the articles of the past 11 years, the effectiveness of feature selection and the size of training set was examined. In the regard of feature selection, after reducing the feature set by χ2 statistics(CHI) and criteria which prefer high-frequency features(COS, GSS, JAC), the trained Support Vector Machines(SVM) performs the best. With respective to the size of the training set, it significantly influences the performance of Support Vector Machines(SVM) and Voted Perceptron(VTP). but it scarcely affects that of Naive Bayes(NB).

keywords
descriptors, automatic indexing, machine learning, feature selection, classifier, text categorization, descriptors, automatic indexing, machine learning, feature selection, classifier, text categorization, 디스크립터, 자동색인, 기계학습, 자질 선정, 분류기, 텍스트 범주화

Reference

1.

김판준. (2005). 새로운 주제 탐지를 통한 지식 구조 갱신에 관한 연구.

2.

윤구호. (1999). 색인․초록:서울: 도서관협회.

3.

이재윤. (2005). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 관한 연구. 문헌정보학회지, 39(2), 123-146.

4.

이재윤. (2005). 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구. 정보관리학회지, 22(3), 261-287.

5.

정영미. (2005). 정보검색연구:서울: 구미무역(주) 출판부.

6.

Borko, H.. (1963). Automatic Document Classification. JACM, 10(2), 151-162.

7.

Chang, Jeffrey. (2000). Using the MeSH Hierarchy to Index Bioinformatics Articles. CS224N/Ling237 Final Projects:Stanford University.

8.

Chung, Y. (1998). Automatic subject indexing using an associative neural network." , 59-68.. ACM international Conference on Digital Libraries, 3, 59-68.

9.

Freund, Yoav. (1998). Large Margin Classification Using the Perceptron Algorithms. Proceedings of the 11th Annual Conference on Computer Learning Theory, , 209-217.

10.

Humprey, Susanne M. (1999). Automatic indexing of Documents from Journal Descriptors: A Preliminary Investigation. JASIS, 50(8), 661-674.

11.

Joachims, Thorsten. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning, 10, 137-142.

12.

Joachims, Thorsten. (2001). Learning to Classify Text Using Support Vector Machines:Boston: Kluwer Academic Publishers..

13.

John, George H.. (1995). Estimating Continuous Distributions in Bayesian Classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, , 338-345.

14.

Lan, Man. (2005). A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines. International Conference on World Wide Web, WWW(Special Interest Tracks and Posters), 14(10), 1032-1033.

15.

Lauser, B. (2003). Automatic Multi-Label Subject Indexing in a Multilingual Environment. European Conference in Research and Adavanced Technology for Digital Libraries(ECDL), 7, 140-151.

16.

Lewis, D. D. (1996). Training Algorithms for Linear Text Classfiers. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), 19, 208-306.

17.

Liang, Chun-Yan. (2006). Dictionary-based Text Categorization of Chemical Web Pages. IPM, 42(4), 1017-1029.

18.

Lewis, D. D. (1996). Training Algorithms for Linear Text Classifiers. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), 19, 298-306.

19.

Moens, Marie-Francine. (2000). Automatic Indexing and Abstracting of Document Texts. The Kluwer International Series on Information Retrieval, , -.

20.

Platt, John. (1999). Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in Neural Information Processing Systems, 11, -.

21.

Plaunt, C. (1998). An association-based Method for automatic indexing with a controlled vocabulary. JASIS, 49(10), 888-902.

22.

Rogati, M.. (2002). High-Performing Feature Selection for Text Classification. ACM CIKM International Conference on Information and Knowledge Management, , 659-661.

23.

Miguel E. Ruiz, Padmini Srinivasan. (2009). 1999. "Combining Machine Learning and Hierarchical Indexing Structures for Text Categorization." To appear in Advances in Classification Research Vol. 10: Proceedings of the 10th ASIS SIG/CR Classification Research Workshop, Washington D.C.. http://informatics.buffalo.edu/faculty/ruiz/publications/sigcr%5F10.

24.

Ruiz, Miguel E. (2002). Hierarchical Text Categorization Using Neural Networks. Information Retrieval, 5(10), 87-118.

25.

Sebastiani, Fabrizio. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1-47.

26.

Tzeras, Kostas. (199). Automatic indexing based on Bayesian Inference Network. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), , 22-34.

27.

Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1, 69-90.

28.

Yang, Y.. (1997). A Comparative Study on Feature Selection in Text Categorization. International Conference on Machine Learning(ICML), 14, 412-420.

29.

Yang, Y. (1999). A Re-examination for Text Categorization Methods. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), 22, 42-49.

30.

Zhang, J. (2003). Robustness of Regularized Linear Classification Methods in Text Categorization. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), , 190-197.

Journal of the Korean Society for Information Management