PDF下载

[1]刘娇,崔荣一,赵亚慧,等.跨语言文献相似度的分析方法[J].延边大学学报(自然科学版),2016,42(02):151-155.
　LIU Jiao,CUI Rongyi,ZHAO Yahui,et al.An analysis method of cross-lingual literature similarity[J].Journal of Yanbian University,2016,42(02):151-155.

点击复制

跨语言文献相似度的分析方法

《延边大学学报(自然科学版)》[ISSN:1004-4353/CN:22-1191/N] 卷: 第42卷期数: 2016年02期页码: 151-155 栏目: 应用科学研究出版日期: 2016-06-20

Title:: An analysis method of cross-lingual literature similarity

作者:: 刘娇; 崔荣一; 赵亚慧; 张振国; 延边大学工学院计算机科学与技术学科智能信息处理研究室, 吉林延吉 133002

Author(s):: LIU Jiao; CUI Rongyi; ZHAO Yahui; ZHANG Zhenguo^*; Intelligent Information Processing Lab., Dept. of Computer Science & Technology, College of Engineering, Yanbian University, Yanji 133002, China

关键词:: 多语主题模型; 跨语言; 语义相似度

Keywords:: multilingual topic correlation model; cross-lingual; semantic similarity

分类号:: TP391

文献标志码:: A

摘要:: 对不同语言的句对齐文献资料进行分析,提出了基于多语主题模型的跨语言文献相似度的计算方法.首先,对收集整理的不同语言(中文、英文、韩文)文献构建数据模型,通过分词、分词结果修正及选择、词权重计算等预处理工作构造词项-文档矩阵.其次,建立多语主题语义空间,将译成3种不同语言的文献映射到语义空间,在语义空间中每一主题都由3种语言构成.最后,通过其语义空间中对应的主题计算比较不同语言间的文献相似度.实验结果显示,不同语言之间的文献相似度可以直接在语义空间中计算,且相似度计算的准确性在90%以上,验证了本文方法在跨语言文献相似度计算时的有效性.

Abstract:: We analyse different language literatures with sentence alignment and propose a cross-lingual literatures’ similarity method based on multilingual topic correlation model. In this paper, the data model for the collected different language literatures is firstly gained by term-document matrix, which is obtained by the process of words segmentation, the adjustment and selection of words segmentation results, and the weight calculation of feature words.And then, multilingual topic correlation semantic space is built. The three different language literatures are represented in the semantic space where each topic is made up of the three languages.Similarity calculation of different language literatures is completed by their correlation topic in the semantic space. Experiment results show that the similarity of different language literaturescan be calculated directly in the semantic space, the accuracy can be reached 90%, which verify the effectiveness of our method in calculating the similarity of cross-lingual literatures.

参考文献/References:

[1] 司莉,庄晓吉吉,贾欢.近10年来国外多语言信息组织与检索研究进展与启示[J].中国图书馆学报,2015,41(218):112-126.
[2] 何文垒.基于WordNet的中英文跨语言文本相似度研究[D].上海:上海交通大学,2011.
[3] 郭勇.基于《知网》的词语相似度计算研究及应用[D].长沙:湖南大学,2012.
[4] Talvensaari T, Laurikkala J, Jrvelin K, et al. Creating and exploiting a comparable corpus in cross-language information retrieval[J]. ACM Transactions on Information Systems, 2007(1):1-47.
[5] Otero P, López I. Wikipedia as multilingual source of comparable corpora[C]//Proceedings of the 3rd workshop on building and using comparable corpora. Malta: European Language Resources Association, 2010:21-25.
[6] Riesa J. Syntactic alignment models for large-scale statistical machine translation[D]. Los Angeles: University of South California, 2012.
[7] Tufis D. Finding translation examples for under-resourced language pairs or for narrow domains: the case for machine translation[J]. Computer Science Journal of Moldova, 2012(2):227-245.
[8] Wei ChihPing, Yang Christopher C, Lin ChiaMin. A latent semantic indexing-based approach to multilingual document clustering[J]. Decision Support Systems, 2008,45(3):606-620.
[9] Dumais S T, Letsche T A, Littman M L, et al. Automatic cross-language retrieval using latent semantic indexing[C]//AAAI Spring Symposium On Cross-Language Text And Speech Retrieval. 1997,15:21.
[10] Mori T, Kokubu T, Tanaka T. Cross-lingual information retrieval based on LSI with multiple word spaces[C]//In Proceedings of the 2nd NTCIR Workshop Meeting on Evaluation of Chinese & Japanese Text Retrieval and Text Summarization, 2001.

备注/Memo

收稿日期: 2016-01-04 䥺Symbolj@@ 通信作者: 张振国(1981—),男,讲师,研究方向为模式识别和图像处理.基金项目: 吉林省科技发展计划项目(20130101179JC-18); 吉林省公共计算平台资助、延边大学科技发展计划项目(延大科合字[2014]第16号)

更新日期/Last Update: 2016-03-20