정보관리학회지, 한국정보관리학회

21

김판준(신라대학교 문헌정보학과) 2023, Vol.40, No.1, pp.1-21 https://doi.org/10.3743/KOSIM.2023.40.1.001

초록보기

초록

본 연구는 텍스트 분류를 위한 효율적인 자질선정 방법으로 자질 순위화 기법의 성능을 구체적으로 검토하였다. 지금까지 자질 순위화 기법은 주로 문헌빈도에 기초한 경우가 대부분이며, 상대적으로 용어빈도를 사용한 경우는 많지 않았다. 따라서 텍스트 분류를 위한 자질선정 방법으로 용어빈도와 문헌빈도를 개별적으로 적용한 단일 순위화 기법들의 성능을 살펴본 다음, 양자를 함께 사용하는 조합 순위화 기법의 성능을 검토하였다. 구체적으로 두 개의 실험 문헌집단(Reuters-21578, 20NG)과 5개 분류기(SVM, NB, ROC, TRA, RNN)를 사용하는 환경에서 분류 실험을 진행하였고, 결과의 신뢰성 확보를 위해 5-fold cross validation과 t-test를 적용하였다. 결과적으로, 단일 순위화 기법으로는 문헌빈도 기반의 단일 순위화 기법(chi)이 전반적으로 좋은 성능을 보였다. 또한, 최고 성능의 단일 순위화 기법과 조합 순위화 기법 간에는 유의한 성능 차이가 없는 것으로 나타났다. 따라서 충분한 학습문헌을 확보할 수 있는 환경에서는 텍스트 분류의 자질선정 방법으로 문헌빈도 기반의 단일 순위화 기법(chi)을 사용하는 것이 보다 효율적이라 할 수 있다.

Abstract

This study specifically reviewed the performance of the ranking schemes as an efficient feature selection method for text classification. Until now, feature ranking schemes are mostly based on document frequency, and relatively few cases have used the term frequency. Therefore, the performance of single ranking metrics using term frequency and document frequency individually was examined as a feature selection method for text classification, and then the performance of combination ranking schemes using both was reviewed. Specifically, a classification experiment was conducted in an environment using two data sets (Reuters-21578, 20NG) and five classifiers (SVM, NB, ROC, TRA, RNN), and to secure the reliability of the results, 5-Fold cross-validation and t-test were applied. As a result, as a single ranking scheme, the document frequency-based single ranking metric (chi) showed good performance overall. In addition, it was found that there was no significant difference between the highest-performance single ranking and the combination ranking schemes. Therefore, in an environment where sufficient learning documents can be secured in text classification, it is more efficient to use a single ranking metric (chi) based on document frequency as a feature selection method.

22

용어의 문맥활용을 통한 문헌 자동 분류의 성능 향상에 관한 연구

송성전(연세대학교) ; 정영미(연세대학교) 2012, Vol.29, No.2, pp.205-224 https://doi.org/10.3743/KOSIM.2012.29.2.205

초록보기

초록

자동 분류에서 문헌을 표현하는 일반적인 방식인 BOW는 용어를 독립적으로 처리하기 때문에 주변 문맥을 반영하지 못한다는 한계가 있다. 이에 본 연구는 각 용어마다 주제범주별 문맥적 특징을 파악해 프로파일로 정의하고, 이 프로파일과 실제 문헌에서의 문맥을 비교하는 과정을 통해 동일한 형태의 용어라도 그 의미나 주제적 배경에 따라 구분하고자 하였다. 이를 통해 주제가 서로 다름에도 불구하고 특정 용어의 출현만으로 잘못된 분류 판정을 하는 문제를 극복하고자 하였다. 본 연구에서는 이러한 문맥적 요소를 용어 가중치, 분류기 결합, 자질선정의 3가지 항목에 적용해 보고 그 분류 성능을 측정했다. 그 결과, 세 경우 모두 베이스라인보다 분류 성능이 향상되었고 가장 큰 성능 향상을 보인 것은 분류기 결합이었다. 또한 제안한 방법은 학습문헌 수가 많고 적음에 따라 발생하는 성능의 편향을 완화하는데도 효과적인 것으로 나타났다.

Abstract

One of the limitations of BOW method is that each term is recognized only by its form, failing to represent the term’s meaning or thematic background. To overcome the limitation, different profiles for each term were defined by thematic categories depending on contextual characteristics. In this study, a specific term was used as a classification feature based on its meaning or thematic background through the process of comparing the context in those profiles with the occurrences in an actual document. The experiment was conducted in three phases; term weighting, ensemble classifier implementation, and feature selection. The classification performance was enhanced in all the phases with the ensemble classifier showing the highest performance score. Also, the outcome showed that the proposed method was effective in reducing the performance bias caused by the total number of learning documents.

23

대학생을 위한 정보활용교육 교과과정 개발에 관한 연구

정재영(서강대학교) 2015, Vol.32, No.3, pp.1-20 https://doi.org/10.3743/KOSIM.2015.32.3.001

초록보기

초록

대학에서 효과적인 연구와 학습이 이루어지기 위해서는 정보의 필요성을 인지하고, 찾아내고 평가하며, 효과적으로 사용할 수 있는 능력을 포함한 정보활용교육이 필수적이다. 본 연구는 국내․외 단체와 학자들에 의해 제시된 정보활용능력 기준 및 모형과 정보활용교육 사례를 기초로 대학생을 위한 정보활용교육 교안을 제시하고자 하였다. 정보활용교육은 1회성 교육이 갖는 한계를 극복하기 위해 정규교과목으로 운영되어야 하며, 대학도서관 정보와의 연계 및 효과적인 활용을 위해 그리고, 대학도서관 및 사서의 역할에 대한 전략적 측면의 고려를 위해 도서관 전문사서의 주도적 또는 협력적 역할이 필수적이다. 또한, 현 세대의 성향과 요구를 반영한 내용구성의 변화와 학생들의 주체적이고 자발적인 참여를 이끌어내기 위한 방법적 측면의 고려가 필요한 것으로 분석되었다.

Abstract

Information literacy education which deals with the ability for recognizing the necessity of information, finding, evaluating, and using the information effectively is essential for the college members to do research or learn effectively. This study intended to suggest the lecture program for information literacy education for college students based on information literacy standards and models suggested by the organizations and scholars domestic and foreign as well as information literacy education cases. Information literacy education should be adopted as a regular curriculum to overcome one time event’s limit. Besides, the professional librarians must involve as a role either leading or cooperative in running information literacy education in order to relate the library resources with the education and use them effectively. Finally, when it comes to the strategic aspect of the college library and the librarians’ role, the professional librarians’ leading or cooperative role is also essential. Additionally, analysis shows that information literacy education needs to be reorganized in contents so that it reflects the college students’ patterns to use information and their needs. Lastly, it needs to be thought over in how to lead the students to participate in the course independently and voluntarily.

24

자아 중심 주제 인용분석을 활용한 딥러닝 연구동향 분석

이재윤(명지대학교) 2017, Vol.34, No.4, pp.7-32 https://doi.org/10.3743/KOSIM.2017.34.4.007

초록보기

초록

최근 들어 다양한 분야에서 딥러닝이 혁신적인 기계학습 기법으로 급속하게 확산되고 있다. 이 연구에서는 딥러닝 연구동향을 분석하기 위해서 자아 중심 주제 인용분석 기법을 변형하여 응용해보았다. 이를 위해 Web of Science에서 ‘deep learning’으로 탐색하여 검색된 문헌 중 소수의 씨앗 문헌으로부터 인용 관계를 통해 분석 대상 문헌을 확보하는 방법을 시도하였다. 씨앗 문헌을 인용하는 최근 논문들을 딥러닝 분야의 현행 연구를 반영하는 자아 문헌집합으로 설정하였다. 자아 문헌으로부터 빈번히 인용된 선행 연구들은 딥러닝 분야의 연구 주제를 나타내는 인용 정체성 문헌집합으로 설정하였다. 자아 문헌집합에 대해서는 공저 네트워크 분석을 비롯한 정량적 분석을 실시하여 주요 국가와 연구 기관을 파악하였다. 인용 정체성 문헌들에 대해서는 동시인용 분석을 실시하고, 도출된 문헌 군집을 인용하는 주요 키워드인 인용 이미지 키워드를 파악하여 주요 문헌과 주요 연구 주제를 밝혀내었다. 마지막으로 특정 주제에 대한 인용 영향력이 성장하는 추세를 반영하는 인용 성장지수 CGI를 제안하고 측정하여 딥러닝 분야의 선도 연구 주제가 변화하는 동향을 밝혔다.

Abstract

Recently, deep learning has been rapidly spreading as an innovative machine learning technique in various domains. This study explored the research trends of deep learning via modified ego centered topic citation analysis. To do that, a few seed documents were selected from among the retrieved documents with the keyword ‘deep learning’ from Web of Science, and the related documents were obtained through citation relations. Those papers citing seed documents were set as ego documents reflecting current research in the field of deep learning. Preliminary studies cited frequently in the ego documents were set as the citation identity documents that represents the specific themes in the field of deep learning. For ego documents which are the result of current research activities, some quantitative analysis methods including co-authorship network analysis were performed to identify major countries and research institutes. For the citation identity documents, co-citation analysis was conducted, and key literatures and key research themes were identified by investigating the citation image keywords, which are major keywords those citing the citation identity document clusters. Finally, we proposed and measured the citation growth index which reflects the growth trend of the citation influence on a specific topic, and showed the changes in the leading research themes in the field of deep learning.

25

기업의 전자증거개시 대응을 위한 예측 부호화(Predictive Coding) 도구 적용 방안

유준상(명지대학교) ; 임진희(명지대학교) 2016, Vol.33, No.4, pp.125-157 https://doi.org/10.3743/KOSIM.2016.33.4.125

초록보기

초록

해외에 진출한 국내기업의 소송 사례가 증가하면서 기업들의 전자증거개시제도의 대응에 대한 요구가 증가하고 있다. 영미법에서 유래된 제도인 전자증거개시제도는 절차 진행과정에서 여러 곳에 산재해 있는 전자적 정보들을 중 제한된 시간 내에 소송과 관련된 전자적 정보들을 찾아 증거자료로 검토하여 제출하는 제도이다. 이는 하루에도 수많은 전자기록이 생산되는 국내기업들의 기록관리가 잘 이루어지지 않고 있는 현실에서 제한된 시간 이내에 증거자료를 추리고 검토하여 제출하는 것은 쉽지 않은 일이다. 검토대상을 줄이고 검토과정을 효율적으로 진행하는 것은 소송에서 승소를 위한 가장 중요한 과제 중 하나이다. Predictive Coding은 전자증거개시 검토 과정에서 사용되는 도구로써 기계학습을 이용하여 기업들이 보유하고 있는 전자적 정보들의 검토를 도와주는 도구이다. Predictive Coding이 기존의 검색 도구보다 효율성이 높고 잠재적으로 소송과 관련된 전자적 정보를 추려내는데 강점이 있다고 판단된다. 기업의 효율적인 검색도구의 선택과 지속적인 기록관리를 통해 검토비용의 시간적, 비용적 절감을 꾀할 수 있을 것으로 예상된다. 따라서 기업은 전자증거개시 제도에 대응하기 위해서 시간과 비용적 측면을 고려한 전문적인 Predictive Coding 솔루션의 도입과 기업 기록관리를 통해 가장 효과적인 방법을 모색해야 할 것이다.

Abstract

As the domestic companies which have made inroads into foreign markets have more lawsuits, these companies’ demands for responding to E-Discovery are also increasing. E-Discovery, derived from Anglo-American law, is the system to find electronic evidences related to lawsuits among scattered electronic data within limited time, to review them as evidences, and to submit them. It is not difficult to find, select, review, and submit evidences within limited time given the reality that the domestic companies do not manage their records even though lots of electronic records are produced everyday. To reduce items to be reviewed and proceed the process efficiently is one of the most important tasks to win a lawsuit. The Predictive Coding is a computer assisted review instrument used in reviewing process of E-Discovery, which is to help companies review their own electronic data using mechanical learning. Predictive Coding is more efficient than the previous computer assister review tools and has a merit to select electronic data related to lawsuit. Through companies’ selection of efficient computer assisted review instrument and continuous records management, it is expected that time and cost for reviewing will be saved. Therefore, in for companies to respond to E-Discovery, it is required to seek the most effective method through introduction of the professional Predictive Coding solution and Business records management with consideration of time and cost.

26

BERTopic을 활용한 불면증 소셜 데이터 토픽 모델링 및 불면증 경향 문헌 딥러닝 자동분류 모델 구축

고영수(연세대학교 문헌정보학과 석사과정) ; 이수빈(연세대학교 문헌정보학과 박사과정) ; 차민정(연세대학교 소셜오믹스 연구센터) ; 김성덕(연세대학교 문헌정보학과 석사과정) ; 이주희(연세대학교 문헌정보학과 석사과정) ; 한지영(연세대학교 문헌정보학과 석사과정) ; 송민(연세대학교 문헌정보학과) 2022, Vol.39, No.2, pp.111-129 https://doi.org/10.3743/KOSIM.2022.39.2.111

초록보기

초록

불면증은 최근 5년 새 환자가 20% 이상 증가하고 있는 현대 사회의 만성적인 질병이다. 수면이 부족할 경우 나타나는 개인 및 사회적 문제가 심각하고 불면증의 유발 요인이 복합적으로 작용하고 있어서 진단 및 치료가 중요한 질환이다. 본 연구는 자유롭게 의견을 표출하는 소셜 미디어 ‘Reddit’의 불면증 커뮤니티인 ‘insomnia’를 대상으로 5,699개의 데이터를 수집하였고 이를 국제수면장애분류 ICSD-3 기준과 정신의학과 전문의의 자문을 받은 가이드라인을 바탕으로 불면증 경향 문헌과 비경향 문헌으로 태깅하여 불면증 말뭉치를 구축하였다. 구축된 불면증 말뭉치를 학습데이터로 하여 5개의 딥러닝 언어모델(BERT, RoBERTa, ALBERT, ELECTRA, XLNet)을 훈련시켰고 성능 평가 결과 RoBERTa가 정확도, 정밀도, 재현율, F1점수에서 가장 높은 성능을 보였다. 불면증 소셜 데이터를 심층적으로 분석하기 위해 기존에 많이 사용되었던 LDA의 약점을 보완하며 새롭게 등장한 BERTopic 방법을 사용하여 토픽 모델링을 진행하였다. 계층적 클러스터링 분석 결과 8개의 주제군(‘부정적 감정’, ‘조언 및 도움과 감사’, ‘불면증 관련 질병’, ‘수면제’, ‘운동 및 식습관’, ‘신체적 특징’, ‘활동적 특징’, ‘환경적 특징’)을 확인할 수 있었다. 이용자들은 불면증 커뮤니티에서 부정 감정을 표현하고 도움과 조언을 구하는 모습을 보였다. 또한, 불면증과 관련된 질병들을 언급하고 수면제 사용에 대한 담론을 나누며 운동 및 식습관에 관한 관심을 표현하고 있었다. 발견된 불면증 관련 특징으로는 호흡, 임신, 심장 등의 신체적 특징과 좀비, 수면 경련, 그로기상태 등의 활동적 특징, 햇빛, 담요, 온도, 낮잠 등의 환경적 특징이 확인되었다.

Abstract

Insomnia is a chronic disease in modern society, with the number of new patients increasing by more than 20% in the last 5 years. Insomnia is a serious disease that requires diagnosis and treatment because the individual and social problems that occur when there is a lack of sleep are serious and the triggers of insomnia are complex. This study collected 5,699 data from ‘insomnia’, a community on ‘Reddit’, a social media that freely expresses opinions. Based on the International Classification of Sleep Disorders ICSD-3 standard and the guidelines with the help of experts, the insomnia corpus was constructed by tagging them as insomnia tendency documents and non-insomnia tendency documents. Five deep learning language models (BERT, RoBERTa, ALBERT, ELECTRA, XLNet) were trained using the constructed insomnia corpus as training data. As a result of performance evaluation, RoBERTa showed the highest performance with an accuracy of 81.33%. In order to in-depth analysis of insomnia social data, topic modeling was performed using the newly emerged BERTopic method by supplementing the weaknesses of LDA, which is widely used in the past. As a result of the analysis, 8 subject groups (‘Negative emotions’, ‘Advice and help and gratitude’, ‘Insomnia-related diseases’, ‘Sleeping pills’, ‘Exercise and eating habits’, ‘Physical characteristics’, ‘Activity characteristics’, ‘Environmental characteristics’) could be confirmed. Users expressed negative emotions and sought help and advice from the Reddit insomnia community. In addition, they mentioned diseases related to insomnia, shared discourse on the use of sleeping pills, and expressed interest in exercise and eating habits. As insomnia-related characteristics, we found physical characteristics such as breathing, pregnancy, and heart, active characteristics such as zombies, hypnic jerk, and groggy, and environmental characteristics such as sunlight, blankets, temperature, and naps.

바로가기메뉴

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

정보관리학회지