정보관리학회지, 한국정보관리학회

11

지도적 잠재의미색인(LSI)기법을 이용한 의견 문서 자동 분류에 관한 실험적 연구

이지혜(연세대학교) ; 정영미(연세대학교) 2009, Vol.26, No.3, pp.451-462 https://doi.org/10.3743/KOSIM.2009.26.3.451

초록보기

초록

본 연구에서는 의견이나 감정을 담고 있는 의견 문서들의 자동 분류 성능을 향상시키기 위하여 개념색인의 하나인 잠재의미색인 기법을 사용한 분류 실험을 수행하였다. 실험을 위해 수집한 1,000개의 의견 문서는 500개씩의 긍정 문서와 부정 문서를 포함한다. 의견 문서 텍스트의 형태소 분석을 통해 명사 형태의 내용어 집합과 용언, 부사, 어기로 구성되는 의견어 집합을 생성하였다. 각기 다른 자질 집합들을 대상으로 의견 문서를 분류한 결과 용어색인에서는 의견어 집합, 잠재의미색인에서는 내용어와 의견어를 통합한 집합, 지도적 잠재의미색인에서는 내용어 집합이 가장 좋은 성능을 보였다. 전체적으로 의견 문서의 자동 분류에서 용어색인 보다는 잠재의미색인 기법의 분류 성능이 더 좋았으며, 특히 지도적 잠재의미색인 기법을 사용할 경우 최고의 분류 성능을 보였다.

Abstract

The aim of this study is to apply latent semantic indexing(LSI) techniques for efficient automatic classification of opinionated documents. For the experiments, we collected 1,000 opinionated documents such as reviews and news, with 500 among them labelled as positive documents and the remaining 500 as negative. In this study, sets of content words and sentiment words were extracted using a POS tagger in order to identify the optimal feature set in opinion classification. Findings addressed that it was more effective to employ LSI techniques than using a term indexing method in sentiment classification. The best performance was achieved by a supervised LSI technique.

12

랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구

김판준(신라대학교) 2019, Vol.36, No.2, pp.57-77 https://doi.org/10.3743/KOSIM.2019.36.2.057

초록보기

초록

대표적인 앙상블 기법으로서 랜덤포레스트(RF)를 문헌정보학 분야의 학술지 논문에 대한 자동분류에 적용하였다. 특히, 국내 학술지 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 트리 수, 자질선정, 학습집합 크기 등 주요 요소들에 대한 다각적인 실험을 수행하였다. 이를 통해, 실제 환경의 불균형 데이터세트(imbalanced dataset)에 대하여 랜덤포레스트(RF)의 성능을 최적화할 수 있는 방안을 모색하였다. 결과적으로 국내 학술지 논문의 자동분류에서 랜덤포레스트(RF)는 트리 수 구간 100〜1000(C)과 카이제곱통계량(CHI)으로 선정한 소규모의 자질집합(10%), 대부분의 학습집합(9〜10년)을 사용하는 경우에 가장 좋은 분류 성능을 기대할 수 있는 것으로 나타났다.

Abstract

Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms of classification performance that automatically assigns class labels to domestic journals. Through this, I explored ways to optimize the performance of random forests (RF) for imbalanced datasets in real environments. Consequently, for the automatic classification of domestic journal articles, Random Forest (RF) can be expected to have the best classification performance when using tree number interval 100〜1000(C), small feature set (10%) based on chi-square statistic (CHI), and most learning sets (9-10 years).

13

디스크립터 자동 할당을 위한 저자키워드의 재분류에 관한 실험적 연구

김판준(신라대학교) ; 이재윤(경기대학교) 2012, Vol.29, No.2, pp.225-246 https://doi.org/10.3743/KOSIM.2012.29.2.225

초록보기

초록

본 연구는 국내 주요 학술 DB의 검색서비스에서 제공되고 있는 저자키워드(비통제키워드)의 재분류를 통하여 디스크립터(통제키워드)를 자동 할당할 수 있는 가능성을 모색하였다. 먼저 기계학습에 기반한 주요 분류기들의 특성을 비교하는 실험을 수행하여 재분류를 위한 최적 분류기와 파라미터를 선정하였다. 다음으로, 국내 독서 분야 학술지 논문들에 부여된 저자키워드를 학습한 결과에 따라 해당 논문들을 재분류함으로써 키워드를 추가로 할당하는 실험을 수행하였다. 또한 이러한 재분류 결과에 따라 새롭게 추가된 문헌들에 대하여 통제키워드인 디스크립터와 마찬가지로 동일 주제의 논문들을 모아주는 어휘통제 효과가 있는지를 살펴보았다. 그 결과, 저자키워드의 재분류를 통하여 디스크립터를 자동 할당하는 효과를 얻을 수 있음을 확인하였다.

Abstract

This study purported to investigate the possibility of automatic descriptor assignment using the reclassification of author keywords in domestic scholarly databases. In the first stage, we selected optimal classifiers and parameters for the reclassification by comparing the characteristics of machine learning classifiers. In the next stage, learning the author keywords that were assigned to the selected articles on readings, the author keywords were automatically added to another set of relevant articles. We examined whether the author keyword reclassifications had the effect of vocabulary control just as descriptors collocate the documents on the same topic. The results showed the author keyword reclassification had the capability of the automatic descriptor assignment.

14

디지털 아카이빙을 위한 보존 메타데이터 패키지 구축

이승민(숙명여자대학교) 2015, Vol.32, No.3, pp.21-47 https://doi.org/10.3743/KOSIM.2015.32.3.021

초록보기

초록

디지털 정보의 보존을 위해서는 보존활동과 관련된 메타데이터 구축이 필수적이다. 하지만, 현재 디지털 아카이빙에서는 디지털 객체의 보존을 위한 최적화 된 메타데이터 구조가 마련되어 있지 않은 실정이다. 이에 본 연구에서는 디지털 아카이빙의 핵심적인 프로세스를 중심으로 디지털 객체의 기술 및 보존을 지원할 수 있는 메타데이터 패키지를 구축하였다. 본 연구에서 제안한 메타데이터 패키지는 총 4개의 상위요소 및 25개의 세부적인 요소로 구성되어 있으며, 디지털 객체를 보존하는데 필요한 기술사항을 디지털 아카이빙의 핵심적인 단계에 따라 최적화 된 방식으로 제공해 줄 수 있다. 이는 디지털 객체의 보존에 있어 기존의 정보 패키지에 비해 보다 효율적이고 실제적인 기술방식으로 적용될 수 있을 것으로 기대된다.

Abstract

The construction of preservation metadata is a prerequisite for the preservation of digital information. In the current approaches to digital archiving, however, there is no metadata structure optimized to describe preserved digital objects. This research proposed metadata packages that can support the description of digital objects from the perspective of the core processes of digital archiving. The proposed metadata packages consist of 4 wrapper elements and 25 sub-elements. They can provide detailed descriptions required to preserve digital objects in accordance with the core processes of digital archiving. Therefore, the proposed metadata package can be applied to digital archiving as a better approach to the description of digital objects compared to the approaches to information package.

15

문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구

정은경(이화여자대학교) 2009, Vol.26, No.3, pp.261-278 https://doi.org/10.3743/KOSIM.2009.26.3.261

초록보기

초록

기계학습 기반 문서범주화 기법에 있어서 최적의 자질을 구성하는 것이 성능향상에 있어서 중요하다. 본 연구는 학술지 수록 논문의 필수적 구성요소인 저자 제공 키워드와 논문제목을 대상으로 자질확장에 관한 실험을 수행하였다. 자질확장은 기본적으로 선정된 자질에 기반하여 WordNet과 같은 의미기반 사전 도구를 활용하는 것이 일반적이다. 본 연구는 키워드와 논문제목을 대상으로 WordNet 동의어 관계 용어를 활용하여 자질확장을 수행하였으며, 실험 결과 문서범주화 성능이 자질확장을 적용하지 않은 결과와 비교하여 월등히 향상됨을 보여주었다. 이러한 성능향상에 긍정적인 영향을 미치는 요소로 파악된 것은 정제된 자질 기반 및 분류어 기준의 동의어 자질확장이다. 이때 용어의 중의성 해소 적용과 비적용 모두 성능향상에 영향을 미친 것으로 파악되었다. 본 연구의 결과로 키워드와 논문제목을 활용한 분류어 기준 동의어 자질 확장은 문서 범주화 성능향상에 긍정적인 요소라는 것을 제시하였다.

Abstract

Identifying optimal feature sets in Text Categorization(TC) is crucial in terms of improving the effectiveness. In this study, experiments on feature expansion were conducted using author provided keyword sets and article titles from typical scientific journal articles. The tool used for expanding feature sets is WordNet, a lexical database for English words. Given a data set and a lexical tool, this study presented that feature expansion with synonymous relationship was significantly effective on improving the results of TC. The experiment results pointed out that when expanding feature sets with synonyms using on classifier names, the effectiveness of TC was considerably improved regardless of word sense disambiguation.

16

디렉토리 분류체계의 표준구분 관련 항목 전개

김성원(충남대학교) 2008, Vol.25, No.3, pp.357-375 https://doi.org/10.3743/KOSIM.2008.25.3.357

초록보기

초록

인터넷의 보급 및 이용 활성화에 따라 인터넷을 통한 정보의 검색 및 획득이 정보검색의 일차적인 행태가 되고 있다. 인터넷을 통한 정보검색의 보편화는 인터넷 정보검색 포털이 제공하는 검색서비스의 중요성을 증대시키고 있다. 포털에서 제공하는 정보검색 서비스의 효율화는 인터넷 정보검색 환경의 효율화로 직결될 수 있다. 이에 본 고에서는 인터넷 정보검색 포털에서 제공하고 있는 서비스 가운데 인터넷 정보자료를 선별하고 조직화하여 제공하고 있는 디렉토리 서비스의 분류체계에 대해 고찰하였다. 구체적인 연구주제로 전통적인 문헌분류법에서 여러 주제분야에 공통적으로 적용될 수 있는 형식, 접근법을 모아 구성한 표준구분(standard subdivision) 항목들을 디렉토리 분류체계에서 어떻게 전개하고 있는 지 현황을 분석해 보았다. 이러한 분석을 기반으로 전통적인 문헌분류법의 표준구분에 포함된 항목들을 디렉토리 서비스에서 전개하는 방안을 제시하였다.

Abstract

With the rapid distribution and active usage of the Internet, information search and retrieval through Internet has become a primary form of information access. This ubiquity of information access through Internet means the increased significance of search performance offered by Internet portals, since the optimization of search performance by portal has strong implication for the effective access of information through Internet in general. In this context, this paper investigates the classification scheme used in the directory service of internet portals, which provides selected and organized access to Internet information. First, the author analyzes the deployment of directory classification of standard subdivision topics used in traditional library classification system, with emphasis on the table composed of the form and approach, which are applicable to diverse subject areas. Then, based on this analysis, he proposed a method of applying certain subdivisions of the standard subdivision to directory service of Internet portals.

17

기계학습에 기초한 국내 학술지 논문의 자동분류에 관한 연구

김판준(신라대학교) 2018, Vol.35, No.2, pp.37-62 https://doi.org/10.3743/KOSIM.2018.35.2.037

초록보기

초록

문헌정보학 분야의 국내 학술지 논문으로 구성된 문헌집합을 대상으로 기계학습에 기초한 자동분류의 성능에 영향을 미치는 요소들을 검토하였다. 특히, 「정보관리학회지」에 수록된 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 용어 가중치부여 기법, 학습집합 크기, 분류 알고리즘, 범주 할당 방법 등 주요 요소들의 특성을 다각적인 실험을 통해 살펴보았다. 결과적으로 분류 환경 및 문헌집합의 특성에 따라 각 요소를 적절하게 적용하는 것이 효과적이며, 보다 단순한 모델의 사용으로 상당히 좋은 수준의 성능을 도출할 수 있었다. 또한, 국내 학술지 논문의 분류는 특정 논문에 하나 이상의 범주를 할당하는 복수-범주 분류(multi-label classification)가 실제 환경에 부합한다고 할 수 있다. 따라서 이러한 환경을 고려하여 단순하고 빠른 분류 알고리즘과 소규모의 학습집합을 사용하는 최적의 분류 모델을 제안하였다.

Abstract

This study examined the factors affecting the performance of automatic classification based on machine learning for domestic journal articles in the field of LIS. In particular, In view of the classification performance that assigning automatically the class labels to the articles in 「Journal of the Korean Society for Information Management」, I investigated the characteristics of the key factors(weighting schemes, training set size, classification algorithms, label assigning methods) through the diversified experiments. Consequently, It is effective to apply each element appropriately according to the classification environment and the characteristics of the document set, and a fairly good performance can be obtained by using a simpler model. In addition, the classification of domestic journals can be considered as a multi-label classification that assigns more than one category to a specific article. Therefore, I proposed an optimal classification model using simple and fast classification algorithm and small learning set considering this environment.

18

대도시 공공도서관 장서개발정책 분석과 제언

윤희윤(대구대학교 문헌정보학과) ; 김종애(경기대학교 문헌정보학과) ; 오선경(경상대학교) 2020, Vol.37, No.3, pp.51-75 https://doi.org/10.3743/KOSIM.2020.37.3.051

초록보기

초록

모든 공공도서관은 장서 기반의 지식문화서비스 기관이다. 이를 위해 가장 먼저 수립해야 할 필수적 정책인 동시에 전략적 메뉴가 장서개발정책이다. 특히 시도 단위의 종합지식정보센터 및 공동보존서고로서의 법정 업무를 수행해야 할 지역대표도서관은 최적 장서개발정책을 수립․적용해야 한다. 본 연구는 주요 선진국의 대도시 공공도서관 장서개발정책과 국내의 권역별 지역대표도서관 장서개발지침(안) 및 규정을 분석하였다. 그 결과, 대다수 선진국의 정책은 구성체계 및 내용적 측면에서 충실한 반면에 국내는 공식화된 정책문서가 없는 실무지침에 불과하였다. 따라서 모든 지역대표도서관은 장서개발의 중요성 인식, 미래지향적 사고, 전략적 판단 등을 전제로 장서개발정책을 수립․문서화해야 하며, 이를 위한 기본원칙과 구성체계를 제언하였다.

Abstract

All public libraries are collection-based knowledge and cultural service institutions. To this end, a collection development policy is an essential and a strategic menu that every library should establish first. Regional central libraries should establish and apply optimal collection development policies to conduct the legal duties as knowledge and information centers and cooperative preservation facilities of the cities and provinces. Thus, this study analyzed and compared in detail the collection development guidelines (draft) and regulations of regional central libraries in Korea and the collection development policies of metropolitan public libraries abroad. Results showed that the policies of domestic regional central libraries were simply practical guidelines while those in most developed countries were substantial in format and content. All regional central library systems should establish and document collection development policies based on the importance of collection development, future-oriented thinking, and strategic decision. The study also suggested the basic principles and the format for this purpose.

19

단행본 서명의 단어 임베딩에 따른 자동분류의 성능 비교

이용구(경북대학교 문헌정보학과) 2023, Vol.40, No.4, pp.307-327 https://doi.org/10.3743/KOSIM.2023.40.4.307

초록보기

초록

이 연구는 짧은 텍스트인 서명에 단어 임베딩이 미치는 영향을 분석하기 위해 Word2vec, GloVe, fastText 모형을 이용하여 단행본 서명을 임베딩 벡터로 생성하고, 이를 분류자질로 활용하여 자동분류에 적용하였다. 분류기는 k-최근접 이웃(kNN) 알고리즘을 사용하였고 자동분류의 범주는 도서관에서 도서에 부여한 DDC 300대 강목을 기준으로 하였다. 서명에 대한 단어 임베딩을 적용한 자동분류 실험 결과, Word2vec와 fastText의 Skip-gram 모형이 TF-IDF 자질보다 kNN 분류기의 자동분류 성능에서 더 우수한 결과를 보였다. 세 모형의 다양한 하이퍼파라미터 최적화 실험에서는 fastText의 Skip-gram 모형이 전반적으로 우수한 성능을 나타냈다. 특히, 이 모형의 하이퍼파라미터로는 계층적 소프트맥스와 더 큰 임베딩 차원을 사용할수록 성능이 향상되었다. 성능 측면에서 fastText는 n-gram 방식을 사용하여 하부문자열 또는 하위단어에 대한 임베딩을 생성할 수 있어 재현율을 높이는 것으로 나타났다. 반면에 Word2vec의 Skip-gram 모형은 주로 낮은 차원(크기 300)과 작은 네거티브 샘플링 크기(3이나 5)에서 우수한 성능을 보였다.

Abstract

To analyze the impact of word embedding on book titles, this study utilized word embedding models (Word2vec, GloVe, fastText) to generate embedding vectors from book titles. These vectors were then used as classification features for automatic classification. The classifier utilized the k-nearest neighbors (kNN) algorithm, with the categories for automatic classification based on the DDC (Dewey Decimal Classification) main class 300 assigned by libraries to books. In the automatic classification experiment applying word embeddings to book titles, the Skip-gram architectures of Word2vec and fastText showed better results in the automatic classification performance of the kNN classifier compared to the TF-IDF features. In the optimization of various hyperparameters across the three models, the Skip-gram architecture of the fastText model demonstrated overall good performance. Specifically, better performance was observed when using hierarchical softmax and larger embedding dimensions as hyperparameters in this model. From a performance perspective, fastText can generate embeddings for substrings or subwords using the n-gram method, which has been shown to increase recall. The Skip-gram architecture of the Word2vec model generally showed good performance at low dimensions(size 300) and with small sizes of negative sampling (3 or 5).

20

자치단체의 독서진흥조례 내용분석

홍은성(전남대학교 문헌정보학과) ; 장우권(전남대학교) 2015, Vol.32, No.4, pp.107-135 https://doi.org/10.3743/KOSIM.2015.32.4.107

초록보기

초록

이 연구는 우리나라 지방자치단체의 자치법규인 독서문화진흥을 위한 조례의 제정과 시행에 대한 현황과 내용을 조사․분석한 후 조례와 규칙의 운영에 대한 효율적인 개선방안을 제시하는데 있다. 이를 위해 문헌고찰과 관련 조례를 조사․분석하였다. 연구의 결과는 1) 전국 245개 광역 및 기초자치단체가 운영 중인 독서관련 자치법규는 조례가 77건, 규칙이 7건으로 나타났다. 2) 지자체와 교육지자체의 조례와 규칙명칭이 다양하게 나타나고 있다. 3) 조례와 규칙의 명칭에 따라 내용의 구성요소가 다양하게 나타나고 있으며, 같은 조례 규칙의 명칭을 부여하고 있음에도 서로 다른 구성요소를 가지고 있다. 4) 현재까지 폐지된 지자체 독서관련 자치법규는 조례 10건, 훈령 2건으로 나타났다. 이에 독서문화진흥정책의 활성화를 위한 방안을 제시하면 1) 독서진흥정책 홍보를 통한 인지도를 개선해야 한다. 2) 지자체의 독서진흥의 환경을 고려한 최적의 자치법규 조례명칭을 부여해야 하며, 조례 규칙의 내용은 통일성을 가져야 한다. 3) 조례는 폐지하기에 앞서 폐지 후 나타난 문제점을 면밀히 살펴본 후, 주민들이나 전문가들의 의견을 충분히 수렴한 후 대체 자치법규를 제정하여야 한다.

Abstract

The purpose of this study is to investigate and analyze present condition of enactment and enforcement of regulation for reading culture promotion which is a local statute of the autonomous community of Korea to suggest effective improvement methods for operation of ordinance and regulation. In this research, literature review and regulation analysis were conducted and investigated. The results of this study are as follows. 1) There were 77 ordinances of reading related local statutes of 245 metropolitan and primary local authority and 7 regulations. 2) Ordinances and ordinance regulation of the local government and local government of education are being named variously. 3) Composition of ordinances ordinance regulation were not systematic due to diverse contents of ordinance by local government according to the names of ordinance, and they overlapped with similar contents in general. 4) There were 10 ordinances and 2 official orders for the abolished reading related local statutes of the local government until today. This study suggested the following methods to vitalize the reading culture promotion policy. 1) It would be necessary to improve awareness by promoting the reading promotion policy. 2) Optimal name for local statute and ordinance that considered the environment of reading promotion of local statue need to be assigned, and contents of the ordinance regulation related to reading needs to be consistent. 3) Local statutes need to be established by collecting enough opinions of residents or specialists after thoroughly examining problems of the ordinance before abolition.

바로가기메뉴

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

초록

Abstract

정보관리학회지