정보관리학회지, 한국정보관리학회

21

김판준(신라대학교) 2008, Vol.25, No.1, pp.211-233 https://doi.org/10.3743/KOSIM.2008.25.1.211

초록보기

초록

로치오 알고리즘에 기반한 자동분류의 성능 향상을 위하여 두 개의 실험집단(LISA, Reuters-21578)을 대상으로 여러 가중치부여 기법들을 검토하였다. 먼저, 가중치 산출에 사용되는 요소를 크게 문헌요소(document factor), 문헌집합 요소(document set factor), 범주 요소(category factor)의 세 가지로 구분하여 각 요소별 단일 가중치부여 기법의 분류 성능을 살펴보았고, 다음으로 이들 가중치 요소들 간의 조합 가중치부여 기법에 따른 성능을 알아보았다. 그 결과, 각 요소별로는 범주 요소가 가장 좋은 성능을 보였고, 그 다음이 문헌집합 요소, 그리고 문헌 요소가 가장 낮은 성능을 나타냈다. 가중치 요소 간의 조합에서는 일반적으로 사용되는 문헌 요소와 문헌집합 요소의 조합 가중치(tfidf or ltfidf)와 함께 문헌 요소를 포함하는 조합(tf*cat or ltf*cat) 보다는, 오히려 문헌 요소를 배제하고 문헌 집합 요소를 범주 요소와 결합한 조합 가중치 기법(idf*cat)이 가장 좋은 성능을 보였다. 그러나 실험집단 측면에서 단일 가중치와 조합 가중치를 서로 비교한 결과에 따르면, LISA에서 범주 요소만을 사용한 단일 가중치(cat only)가 가장 좋은 성능을 보인 반면, Reuters-21578에서는 문헌집합 요소와 범주 요소간의 조합 가중치(idf*cat)의 성능이 가장 우수한 것으로 나타났다. 따라서 가중치부여 기법에 대한 실제 적용에서는, 분류 대상이 되는 문헌집단 내 범주들의 특성을 신중하게 고려할 필요가 있다.

Abstract

This study examines various weighting methods for improving the performance of automatic classification based on Rocchio algorithm on two collections(LISA, Reuters-21578). First, three factors for weighting are identified as document factor, document factor, category factor for each weighting schemes, the performance of each was investigated. Second, the performance of combined weighting methods between the single schemes were examined. As a result, for the single schemes based on each factor, category-factor-based schemes showed the best performance, document set-factor-based schemes the second, and document-factor-based schemes the worst. For the combined weighting schemes, the schemes(idf*cat) which combine document set factor with category factor show better performance than the combined schemes(tf*cat or ltf*cat) which combine document factor with category factor as well as the common schemes(tfidf or ltfidf) that combining document factor with document set factor. However, according to the results of comparing the single weighting schemes with combined weighting schemes in the view of the collections, while category-factor-based schemes(cat only) perform best on LISA, the combined schemes(idf*cat) which combine document set factor with category factor showed best performance on the Reuters-21578. Therefore for the practical application of the weighting methods, it needs careful consideration of the categories in a collection for automatic classification.

22

텍스트 분류를 위한 자질 순위화 기법에 관한 연구

김판준(신라대학교 문헌정보학과) 2023, Vol.40, No.1, pp.1-21 https://doi.org/10.3743/KOSIM.2023.40.1.001

초록보기

초록

본 연구는 텍스트 분류를 위한 효율적인 자질선정 방법으로 자질 순위화 기법의 성능을 구체적으로 검토하였다. 지금까지 자질 순위화 기법은 주로 문헌빈도에 기초한 경우가 대부분이며, 상대적으로 용어빈도를 사용한 경우는 많지 않았다. 따라서 텍스트 분류를 위한 자질선정 방법으로 용어빈도와 문헌빈도를 개별적으로 적용한 단일 순위화 기법들의 성능을 살펴본 다음, 양자를 함께 사용하는 조합 순위화 기법의 성능을 검토하였다. 구체적으로 두 개의 실험 문헌집단(Reuters-21578, 20NG)과 5개 분류기(SVM, NB, ROC, TRA, RNN)를 사용하는 환경에서 분류 실험을 진행하였고, 결과의 신뢰성 확보를 위해 5-fold cross validation과 t-test를 적용하였다. 결과적으로, 단일 순위화 기법으로는 문헌빈도 기반의 단일 순위화 기법(chi)이 전반적으로 좋은 성능을 보였다. 또한, 최고 성능의 단일 순위화 기법과 조합 순위화 기법 간에는 유의한 성능 차이가 없는 것으로 나타났다. 따라서 충분한 학습문헌을 확보할 수 있는 환경에서는 텍스트 분류의 자질선정 방법으로 문헌빈도 기반의 단일 순위화 기법(chi)을 사용하는 것이 보다 효율적이라 할 수 있다.

Abstract

This study specifically reviewed the performance of the ranking schemes as an efficient feature selection method for text classification. Until now, feature ranking schemes are mostly based on document frequency, and relatively few cases have used the term frequency. Therefore, the performance of single ranking metrics using term frequency and document frequency individually was examined as a feature selection method for text classification, and then the performance of combination ranking schemes using both was reviewed. Specifically, a classification experiment was conducted in an environment using two data sets (Reuters-21578, 20NG) and five classifiers (SVM, NB, ROC, TRA, RNN), and to secure the reliability of the results, 5-Fold cross-validation and t-test were applied. As a result, as a single ranking scheme, the document frequency-based single ranking metric (chi) showed good performance overall. In addition, it was found that there was no significant difference between the highest-performance single ranking and the combination ranking schemes. Therefore, in an environment where sufficient learning documents can be secured in text classification, it is more efficient to use a single ranking metric (chi) based on document frequency as a feature selection method.

23

KESLI 컨소시엄 활성화를 위한 운영현황 및 요구분석

이용구(계명대학교) ; 박성재(한성대학교) ; 김정환(한국과학기술정보연구원) 2013, Vol.30, No.1, pp.221-236 https://doi.org/10.3743/KOSIM.2013.30.1.221

초록보기

초록

본 연구의 목적은 전자저널 컨소시엄 참여기관의 현황과 그들의 요구를 분석함으로써 KESLI 운영의 문제점과 활성화를 위한 방안을 마련하고자 하는 것이다. 이를 위해, 참여기관을 대상으로 컨소시엄 선정 및 관리, 평가의 항목으로 구성된 설문지를 배부하였고 응답한 179개의 설문지를 분석하였다. 분석결과, KESLI 참여기관은 장서개발정책, 편목, 이용자교육 및 평가방법에 대한 요구를 나타냈다. 따라서 KESLI는 참여기관의 요구를 반영하여 장서개발정책 마련을 위한 정책 예를 제공하고 편목의 간소화를 위한 시스템을 마련해야 한다. 또한 기관에서의 이용자교육의 효과를 높이기 위해서 교재 및 프로그램을 개발지원하며 평가방법 및 기술에 대한 교육을 실시해야 한다.

Abstract

The purpose of this study is to improve the KESLI consortium by analyzing the status of participant organizations and their needs. A survey questionnaire including questions on consortium selection, management, and evaluation was distributed. The findings from the 179 responses indicate that the needs of the participants include issues related to the collection development policy, the cataloging of e-journals, user education, and evaluation. Therefore, KESLI should provide the following: (1) examples of collection development policy used for reference, (2) system development for e-journal cataloging, (3) materials and program guidelines for user education, and (4) education related to evaluation techniques for e-journal usages.

24

토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구

육지희(연세대학교 일반대학원 문헌정보학과) ; 송민(연세대학교) 2018, Vol.35, No.2, pp.63-88 https://doi.org/10.3743/KOSIM.2018.35.2.063

초록보기

초록

본 연구는 LDA 토픽 모델과 딥 러닝을 적용한 단어 임베딩 기반의 Doc2Vec 기법을 활용하여 자질을 선정하고 자질집합의 크기와 종류 및 분류 알고리즘에 따른 분류 성능의 차이를 평가하였다. 또한 자질집합의 적절한 크기를 확인하고 문헌의 위치에 따라 종류를 다르게 구성하여 분류에 이용할 때 높은 성능을 나타내는 자질집합이 무엇인지 확인하였다. 마지막으로 딥 러닝을 활용한 실험에서는 학습 횟수와 문맥 추론 정보의 유무에 따른 분류 성능을 비교하였다. 실험문헌집단은 PMC에서 제공하는 생의학 학술문헌을 수집하고 질병 범주 체계에 따라 구분하여 Disease-35083을 구축하였다. 연구를 통하여 가장 높은 성능을 나타낸 자질집합의 종류와 크기를 확인하고 학습 시간에 효율성을 나타냄으로써 자질로의 확장 가능성을 가지는 자질집합을 제시하였다. 또한 딥 러닝과 기존 방법 간의 차이점을 비교하고 분류 환경에 따라 적합한 방법을 제안하였다.

Abstract

This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

25

데이터세트 보존포맷 검증방안에 관한 연구: 재난안전정보 데이터세트의 SIARD 적용을 통해

한희정(전북대학교 문화융복합아카이빙 연구소 전임연구원) ; 윤성호(전북대학교 일반대학원 기록관리학과 석사과정) ; 오효정(전북대학교 문헌정보학과 부교수) ; 양동민(전북대학교 일반대학원 기록관리학과 부교수) 2020, Vol.37, No.2, pp.251-284 https://doi.org/10.3743/KOSIM.2020.37.2.251

초록보기

초록

정보의 활용이 국가 경쟁력의 핵심으로 부각되면서 우리 정부를 포함한 주요 선진국들은 데이터를 중요하게 인식하고 있으며, 이에 따라 장기보존 기술 연구 및 표준 제정 등을 추진하여 데이터의 체계적인 관리 및 보존을 위한 노력을 지속적으로 기울이고 있다. 그러나 현재 국내의 경우 다양한 유형의 데이터들에 대해 법령에는 기록관리 대상으로 명시하고 있지만, 이를 수집, 관리 및 보존하기 위한 구체적인 방법은 표준전자문서 이외에는 없는 상황이다. 특히, 행정정보시스템에서 생산되는 엄청난 규모의 데이터세트에 대한 관리 및 보존은 무엇보다 강하게 요구되어 왔으나 데이터세트에 대한 지침이 제대로 제공되고 있지 않고 있다. 보존포맷 선정체계가 마련되어야 시스템 보완 및 구축이 가능하기 때문에 우선적으로 데이터세트 특성을 고려한 보존포맷 선정 기준 체계가 보다 구체화 되어야 하며, 선정기준에 따라 도출된 데이터세트 보존포맷의 변환에 대한 실증적인 검증 작업이 필요하다. 이에 본 연구는 데이터세트의 특성을 고려한 보존포맷 선정 기준에 대한 평가체계를 도출하고, 보존포맷에 대한 실증적 검증을 통해 장기보존할 수 있는 방안을 제시하고자 한다.

Abstract

As the use of information has emerged as the core of national competitiveness, major developed countries and the Korean government have realized the importance of data. They have pursued technical research and standard establishment for long-term preservation and continuously strived for systematic management and preservation of data. However, although various types of data are specified for the purpose of record management in the law, there is no specific method on how to collect, manage and preserve them, except standard electronic documents. In particular, management and preservation of huge datasets from the administrative information system have been strongly demanded above all. Any guidelines for datasets do not have been properly provided. After the framework for selecting preservation format must be prepared, the system can be supplemented and built. The framework considering the characteristics of the dataset should be specified more concretely, and empirical verification of the conversion and restoration for the dataset preservation format derived according to the selection criteria is necessary. Therefore, this study intends to propose a method for long-term preservation through empirical verification of the preservation format after deriving an evaluation the framework for the preservation format selection criteria considering the characteristics of the dataset.

26

대구·경북지역 주요 대학도서관 전자정보실의 현황 및 운영실태에 관한 분석

오동근(계명대학교) ; 김숙찬(계명대학교) 2004, Vol.21, No.4, pp.89-107 https://doi.org/10.3743/KOSIM.2004.21.4.089

초록보기

초록

이 연구는 대구경북지역의 주요 대학도서관 중 전자정보실 또는 이와 유사한 자료실을 운영하고 있는 5개 대학의 전자정보실을 대상으로 담당자에 대한 설문조사와 면담을 바탕으로 운영상의 문제점을 도출하고 개선방안을 모색하고자 시도되었다. 인력과 규모 및 기기관련사항, 운영방법, 수행업무, 운영상의 문제점, 향후의 운영계획 및 장기발전방안에 관련된 현황을 분석하였다. 현황분석과 연구자의 전자정보실 업무수행경험을 바탕으로 전자정보실의 효과적인 운영 및 이용증진을 위한 개선방안을 제시하였다. 현재 이 지역대학도서관의 전자정보실 현황은 다양한 양상을 보이고 있으나, 그 실체에 대한 각 주체들의 공감대가 형성되지 못하였고, 전자정보실 자체의 종합적인 발전 목표 및 정책이 수립되지 못한 채 단기적인 계획 및 정책을 수립하고 운영되고 있다는 점은 유사한 것으로 나타났다.

Abstract

This study analyzes the present conditions and operations of digital information rooms in the five selected university libraries located in Daegu and Kyoungpook area, with a special regard to the personnel, size of the room, facilities, collections, user instructions and public relations, and related tasks being done. It concludes with some suggestions and recommendations to improve the existing practices and the works in the room based on the result from this study.

27

초·중등 교육자원 아카이브 구축을 위한 자원의 우선순위 선정에 관한 연구

김성훈(성균관대학교 문헌정보학과) ; 도슬기(성균관대학교 문헌정보학과) ; 오삼균(성균관대학교 문헌정보학과) 2019, Vol.36, No.2, pp.153-174 https://doi.org/10.3743/KOSIM.2019.36.2.153

초록보기

초록

본 연구는 초․중등 교육자원 아카이브가 부재한 상황에서 기초조사의 성격으로 수행되었다. 연구는 웹조사와 문헌연구를 통해 국내의 초․중등 교육자원의 관리 현황 및 교육자원의 유형을 파악, 이용자를 대상으로 자원의 가치에 따른 아카이빙 우선순위를 선정하는 조사 수행, 전문가 워크샵을 통해 우선순위 선정 및 교육자원 아카이브 체계 구축의 당위성에 대한 토론 및 의견수렴 과정으로 진행되었다. 국내의 초․중등 교육자원은 공공과 민간의 영역에서 많은 양이 생산되나 산발적으로 관리되며 대중에게 공개되지 않아 이용에 불편함이 있는 상황이다. 교육자원의 유형 추출 및 우선순위 선정 결과, 실제 교육현장에서의 사례정보, 교육과정 관련자원, 교과서 등이 보존해야 할 가치가 높은 교육자원으로 나타났다. 이에 대해 전문가들은 교육자원이 가진 가치와 아카이브 수행 기관의 당위성을 근거로 국가적 차원의 교육자원 아카이브 체계의 필요성에 대해 제언하였다. 본 연구는 교육자원 아카이브의 필요성에 대한 인식 제고와 앞으로 나아가야 할 방향을 모색하는 단초로서 의미가 있다.

Abstract

This study was conducted to find out basic facts on educational resources in Korea and see how educators have managed their tasks without educational resource archives. The research had the following steps: 1) finding out how the current state of primary and secondary education resources is managed and type of educational resources via Web search and literature review, 2) conducting a survey to determine archiving priorities according to the value of resources to the users, and 3) engaging in a discussion with experts on merits of establishing an educational resource archive system. Due to sporadic management and restricted use of educational resources produced in Korea, the primary and secondary education resources are not easily available to the public even though the public and private sectors produce significant amount of resources. The survey result on preferred educational resource types for archiving showed the following top three: 1) actual case studies conducted in the field, 2) curriculum related resources, and 3) various textbooks adopted. Considering the value of educational resources and the necessity of archiving agencies, the experts recommended to set up a national archiving system of educational resources. This study hopes to raise awareness of the need to set up educational resource archives as a start.

28

지리정보시스템을 활용한 공공도서관 마케팅

이성신(경북대학교) 2011, Vol.28, No.3, pp.179-195 https://doi.org/10.3743/KOSIM.2011.28.3.179

초록보기

초록

본 연구의 목적은 지리정보시스템(GIS)이 공공도서관 자료선정과 서비스개발에 갖는 의미를 마케팅적 시각에서 탐색해보고자 하는 데 있다. 지리정보시스템이란 지리적 정보를 수집하고, 조작해서 표현해낼 수 있는 컴퓨터시스템이다. 지리정보시스템을 통해, 공공도서관은 지역사회의 교통관련정보, 정치적, 법적, 인구 통계적, 경제적, 사회적, 문화적, 교육적 정보를 수집하는 것이 가능하다. 따라서 공공도서관은 지리정보시스템을 마케팅의 첫 단계인 시장조사 즉 이용자분석에 활용함으로써 이용자의 요구에 부합되는 자료선정과 서비스개발을 할 수 있을 것이다. 이는 또한 이용자와의 지속적 관계형성이라는 마케팅의 최종목적을 달성하는데도 도움이 될 것이다.

Abstract

The purpose of this study is to investigate the implications that GIS(Geographic Information Systems) can have in public library collection selection and service development from a marketing perspective. GIS is a computer system capable of assembling, storing, manipulating, and displaying geographically referenced information. Through the understanding and utilization of GIS, we can collect geographical, transportation, political, legal, demographic, economic, social, cultural, educational, and recreational information of the community. Public libraries can utilize GIS for market research, including customer analysis to select library collection and develop library service based on library users' needs. As a result, public libraries can find a way to make a lasting relationship with users which is the final goal of marketing activities.

29

웹 검색어 선택과정에서의 이용자 불확실성의 유형 : 자연과학연구자들의 정보탐색환경에 대한 고찰

김양우(한성대학교) 2006, Vol.23, No.2, pp.287-309 https://doi.org/10.3743/KOSIM.2006.23.2.287

초록보기

초록

다수의 연구에서 정보추구 과정상 불 확신성(Uncertainty) 의 중요성이 지적되었지만, 실제 정보검색시스템을 이용한 탐색과정에서 이용자들의 불 확신성에 대한 연구는 많지 않았다. 본 연구는 실제로 정보를 추구하는 이용자들의 웹 검색어 선정과정에서의 불 확신성 인식을 조사하여, 정보탐색 과정에서의 다양한 불 확신성 유형을 식별하였다. 불 확신성 유형에 입각하여 발견된 불 확신성의 주요 원인(Origins)은 정보검색시스템 및 서비스 발전을 위한 시사점을 제시하여준다.

Abstract

While numerous studies have suggested the significance of uncertainty during the process of information-seeking, less research has investigated user uncertainty in the actual search process using a real system. This study investigated user perceptions of uncertainty in the process of the selection of Web search terms in the real information-seeking process. The subjects at the doctoral or post-doctoral level were limited to the discipline of science in order to understand user perceptions in this field. The findings revealed various dimensions, types, and incidents of uncertainty. The typology of uncertainty facilitated an understanding of the subjects' information-seeking context by identifying various aspects of the context that constituted the subjects’ uncertainty. The identification of two principal origins of uncertainty based on the different types of uncertainty generated implications to improve information systems and services.

30

랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구

김판준(신라대학교) 2019, Vol.36, No.2, pp.57-77 https://doi.org/10.3743/KOSIM.2019.36.2.057

초록보기

초록

대표적인 앙상블 기법으로서 랜덤포레스트(RF)를 문헌정보학 분야의 학술지 논문에 대한 자동분류에 적용하였다. 특히, 국내 학술지 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 트리 수, 자질선정, 학습집합 크기 등 주요 요소들에 대한 다각적인 실험을 수행하였다. 이를 통해, 실제 환경의 불균형 데이터세트(imbalanced dataset)에 대하여 랜덤포레스트(RF)의 성능을 최적화할 수 있는 방안을 모색하였다. 결과적으로 국내 학술지 논문의 자동분류에서 랜덤포레스트(RF)는 트리 수 구간 100〜1000(C)과 카이제곱통계량(CHI)으로 선정한 소규모의 자질집합(10%), 대부분의 학습집합(9〜10년)을 사용하는 경우에 가장 좋은 분류 성능을 기대할 수 있는 것으로 나타났다.

Abstract

Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms of classification performance that automatically assigns class labels to domestic journals. Through this, I explored ways to optimize the performance of random forests (RF) for imbalanced datasets in real environments. Consequently, for the automatic classification of domestic journal articles, Random Forest (RF) can be expected to have the best classification performance when using tree number interval 100〜1000(C), small feature set (10%) based on chi-square statistic (CHI), and most learning sets (9-10 years).

바로가기메뉴

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

정보관리학회지