PDF下载

[1]张雷,崔荣一*.基于编辑距离的词序敏感相似度度量方法[J].延边大学学报(自然科学版),2020,46(02):140-144.
　ZHANG Lei,CUI Rongyi*.A word order sensitive similarity measure based on edit distance[J].Journal of Yanbian University,2020,46(02):140-144.

点击复制

基于编辑距离的词序敏感相似度度量方法

《延边大学学报(自然科学版)》[ISSN:1004-4353/CN:22-1191/N] 卷: 第46卷期数: 2020年02期页码: 140-144 栏目: 应用科学研究出版日期: 2020-08-18

Title:: A word order sensitive similarity measure based on edit distance

文章编号:: 1004-4353(2020)02-0140-05

作者:: 张雷; 崔荣一^*; ( 延边大学工学院, 吉林延吉 133002 )

Author(s):: ZHANG Lei; CUI Rongyi^*; ( College of Engineering, Yanbian University, Yanji 133002, China )

关键词:: 文本相似度; 词袋模型; 编辑距离; 词序

Keywords:: text similarity; bag - of -words model; edit distance; word order

分类号:: TP391.1

文献标志码:: A

摘要:: 为改善余弦相似度不能反映词袋模型中词项间顺序差异的缺点,提出了一种基于编辑距离的文档相似度度量方法.首先分析了基于 tf - idf 的词袋模型和余弦相似度计算方法所存在的问题; 其次利用Jaccard系数和编辑距离描述两个字符串的公共子串中词语之间的顺序差异,并提出了一种词序敏感相似度计算方法; 最后利用实验数据对算法的有效性进行了验证,结果显示本文方法在Top1、Top3上的F1指标比原始的余弦相似度方法分别提高了0.082 5、 0.112 6,表明本文方法能够有效地提升信息检索系统的性能,具有很好的应用价值.

Abstract:: In this paper, a method is proposed to calculate the similarity between documents based on edit distance in order to improve the shortcoming that the cosine similarity method cannot reflect the order difference between the terms in the bag - of -words model. Firstly, the problems of the bag - of -words model based on tf - idf and the calculation method of cosine similarity are analyzed. Secondly, the order difference between the words in the common substrings of the two character strings is described by the Jaccard coefficient and the edit distance, and a word order sensitive similarity calculation method is proposed. Finally, the experimental data is used to verify the algorithm. The results show that the F1 value of this method on Top1 and Top3 is improved by 0.082 5 and 0.112 6 respectively compared with the original cosine similarity method. It shows that the method in this paper can effectively improve the performance of the information retrieval system and has good application value.

参考文献/References:

[1] 刘娇,崔荣一,赵亚慧,等.跨语言文献相似度的分析方法[J].延边大学学报(自然科学版),2016,42(2):151-155.
[2] 骆梅柳.文本表示模型在文本挖掘中的应用[J].现代信息科技,2019,3(7):24-25.
[3] 赵雪,崔荣一.基于N层向量空间模型的文本相似度计算方法[J].延边大学学报(自然科学版),2016,42(3):231-234.
[4] 陈行健,胡雪娇,薛卫.基于关系拓展的改进词袋模型研究[J].小型微型计算机系统,2019,40(5):1040-1044.
[5] MA H, ZHOU R, LIU F, et al. Effectively classifying short texts via improved lexical category and semantic features[C]//Proc of International Conference on Intelligent Computing, 2016, Part I:163-174.
[6] 马思丹.基于加权Word 2 vec的微博文本相似度计算方法研究[J].西安电子科技大学,2019:15-16.
[7] MANNING C D, RAGHAVAN P, SCHÜTZE H. Introduction to Information Retrieval[M].Cambridge: Cambridge University Press, 2008.
[8] 刘宝超,崔荣一.基于最大Jaccard相似度的互激励实体验证算法[J].延边大学学报(自然科学版),2015,41(1):42-45.
[9] 陈鑫,李伟康,洪宇,等.面向问句复述识别的多卷积自交互匹配方法研究[J].中文信息学报,2019,33(10):99-108.
[10] LEVENSHTEIN V. Binary codes capable of correcting spurious insertions and deletions of ones[J]. Problems of Information Transmission, 1965,1(1):8-17.

备注/Memo

收稿日期: 2020-04-14
*通信作者: 崔荣一(1962—),男,博士,教授,研究方向为自然语言处理、模式识别、智能计算.

更新日期/Last Update: 2020-08-18