바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

음향학적 자질을 활용한 비디오 스피치 요약의 자동 추출과 표현에 관한 연구

Investigating an Automatic Method for Summarizing and Presenting a Video Speech Using Acoustic Features

정보관리학회지 / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073
2012, v.29 no.4, pp.191-208
https://doi.org/10.3743/KOSIM.2012.29.4.191
김현희 (명지대학교)
  • 다운로드 수
  • 조회수

초록

스피치 요약을 생성하는데 있어서 두 가지 중요한 측면은 스피치에서 핵심 내용을 추출하는 것과 추출한 내용을 효과적으로 표현하는 것이다. 본 연구는 강의 자료의 스피치 요약의 자동 생성을 위해서 스피치 자막이 없는 경우에도 적용할 수 있는 스피치의 음향학적 자질 즉, 스피치의 속도, 피치(소리의 높낮이) 및 강도(소리의 세기)의 세 가지 요인을 이용하여 스피치 요약을 생성할 수 있는지 분석하고, 이 중 가장 효율적으로 이용할 수 있는 요인이 무엇인지 조사하였다. 조사 결과, 강도(최대값 dB과 최소값 dB간의 차이)가 가장 효율적인 요인으로 확인되었다. 이러한 강도를 이용한 방식의 효율성과 특성을 조사하기 위해서 이 방식과 본문 키워드 방식간의 차이를 요약문의 품질 측면에서 분석하고, 이 두 방식에 의해서 각 세그먼트(문장)에 할당된 가중치간의 관계를 분석해 보았다. 그런 다음 추출된 스피치의 핵심 세그먼트를 오디오 또는 텍스트 형태로 표현했을 때 어떤 특성이 있는지 이용자 관점에서 분석해 봄으로써 음향학적 특성을 이용한 스피치 요약을 효율적으로 추출하여 표현하는 방안을 제안하였다.

keywords
speech summarization, acoustic features, prosodic features, TED Talks, Praat, 스피치 요약, 비디오, 피치, 강도, 내재적 평가, 스피치 속도, speech summarization, acoustic features, prosodic features, TED Talks, Praat

Abstract

Two fundamental aspects of speech summary generation are the extraction of key speech content and the style of presentation of the extracted speech synopses. We first investigated whether acoustic features (speaking rate, pitch pattern, and intensity) are equally important and, if not, which one can be effectively modeled to compute the significance of segments for lecture summarization. As a result, we found that the intensity (that is, difference between max DB and min DB) is the most efficient factor for speech summarization. We evaluated the intensity-based method of using the difference between max-DB and min-DB by comparing it to the keyword-based method in terms of which method produces better speech summaries and of how similar weight values assigned to segments by two methods are. Then, we investigated the way to present speech summaries to the viewers. As such, for speech summarization, we suggested how to extract key segments from a speech video efficiently using acoustic features and then present the extracted segments to the viewers.

keywords
speech summarization, acoustic features, prosodic features, TED Talks, Praat, 스피치 요약, 비디오, 피치, 강도, 내재적 평가, 스피치 속도, speech summarization, acoustic features, prosodic features, TED Talks, Praat

참고문헌

1.

김현희. (2011). 비디오 의미 파악을 위한 멀티미디어 요약의 비동시적 오디오와 이미지 정보간의 상호 작용 효과 연구. 한국문헌정보학회지, 45(2), 97-118.

2.

정영미. (2007). 정보검색연구:구미무역출판부.

3.

Boersma, P.. (2006). Praat: Doing phonetics by computer. http://www.praat.org/.

4.

Cawkell, A.. (1995). A guide to image processing and picture management:Gower Publishing Ltd.

5.

Chen, B.. (2012). A risk-aware modeling framework for speech summarization. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 211-222.

6.

Ding, W.. (1999). Multimodal surrogates for video browsing (85-93). Proceedings of the Fourth ACM conference on Digital Libraries.

7.

Fujii, Y.. (2008). Class lecture summarization taking into account consecutiveness of important sentences (2438-2441). Proceedings of Interspeech.

8.

Furui, S.. (2004). Speech-to-text and speech-to-speech summarization of spontaneous speech. IEEE Transactions on Speech Audio Process, 12(4), 401-408.

9.

Hirschberg, J.. (1996). prosodic analysis of discourse segments in direction-given monologues (286-293). Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics.

10.

Lin, S.. (2009). A comparative study of probabilistic ranking models for Chinese spoken document summarization. ACM Transactions on Asian Language Information Processing, 8(1), 1-23.

11.

Liu, Y.. (2011). Speech summarization, In Spoken language understanding: Systems for extracting semantic information from speech:John Wiley & Sons, Ltd.

12.

Maskey, S.. (2008). Automatic broadcast news speech summarization.

13.

Maskey, S.. (2005). Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization (621-624). Proceedings of Interspeech.

14.

Maskey, S.. (2006). Summarizing speech without text using Hidden Markov Models (89-92). Proceedings of the Human Language Technology Conference of the NAACL (Companion Volume: Short Papers). Association for Computational Linguistics.

15.

Marchionini, G.. (2009). Multimedia surrogates for video gisting: Toward combining spoken words and imagery. Information Processing and Management, 45(6), 615-630.

16.

Murray, G.. (2005). Extractive summarization of meeting recordings (593-596). Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH).

17.

Turner, J.. (1994). Determining the subject content of still and moving documents for storage and retrieval: An experimental investigation.

18.

Turney, P.. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303-336.

19.

van Houten, Y.. (2000). Video browsing and summarization. Telematica Instituut.

20.

Wang, D.. (2007). An acoustic measure for word prominence in spontaneous speech. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 690-701.

21.

Xie, S.. (2009). Integrating prosodic features in extractive meeting summarization (387-391). Proceedings of the 11th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding.

22.

Zhang, J.. (2007). Speech summarization without lexical features for Mandarin broadcast news (213-216). Proceedings of NAACL HLT(Companion Volume).

23.

Zhang, Z.. (2012). Active learning with semi-automatic annotation for extractive speech summarization. ACM Transactions on Speech and Language Processing, 8(4), 1-25.

24.

Zhang, J.. (2007). Improving lecture speech summarization using rhetorical information (195-200). Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding.

25.

Zhang, J.. (2007). A comparative study on speech summarization of broadcast news and lecture speech (2781-2784). Proceedings of the annual conference of the international speech communication association.

26.

Zhu, X.. (2009). Summarizing multiple spoken documents: Finding evidence from untranscribed audio (549-557). Proceedings of ACL/AFNLP.

정보관리학회지