LI Danyang,ZHAO Yahui*,LUO Mengjiang,et al.Query text proofreading method of professional courses based on trie tree language model[J].Journal of Yanbian University,2020,46(03):260-264.
基于字典树语言模型的专业课查询文本校对方法
- Title:
- Query text proofreading method of professional courses based on trie tree language model
- 文章编号:
- 1004-4353(2020)03-0260-05
- Keywords:
- trie tree; text proofreading; language model; automatic correction
- 分类号:
- TP391.41
- 文献标志码:
- A
- 摘要:
- 针对中文文本校对技术中存在的校对准确率较低的问题,提出了一种基于字典树模型的专业课查询文本校对方法.首先,通过计算错误文本与匹配文本间的编辑距离对错误关键词进行模糊匹配; 其次,采用字典树语言模型建立搜索树,以提高查询效率.最后,通过对比不同文本相似度阈值下的校对效果选取最佳文本相似度阈值.在最佳阈值下(0.5),将本文模型与传统的拼音模型和N -gram模型进行问句校对对比显示,本文方法的准确率(77.91%)、召回率(67%)、F值(72.04%)比传统的拼音模型校正方法分别提高了5.69%、23.67% 和11.57%,比N -gram模型校正方法分别提高了0.64%、10.33%和7.89%.因此,本文提出的方法在专业课查询文本校对方面具有很好的应用价值.
- Abstract:
- Aiming at the problem of low accuracy in Chinese text proofreading technology, a method of text query and proofreading is proposed for professional courses based on trie tree model. Firstly, the error keywords were fuzzy matched by calculating the edit distance between the error text and the matching text. Then, the trie tree language model was used to build the search tree to improve query efficiency. Finally, by comparing the proofreading effect under different text similarity thresholds, the best text similarity threshold was selected. Under the best threshold(0.5), the model was compared with the traditional Pinyin model and N -gram model in question proofreading. The accuracy rate(77.91%), recall rate(67%)and F value(72.04%)of the proposed method are 5.69%, 23.67% and 11.57% higher than those of the traditional Pinyin model correction method, and 0.64%, 10.33% and 7.89% higher than that of the N-gram model correction method. Therefore, the method proposed in this paper has good application value in the text query and proofreading of professional courses.
参考文献/References:
[1] 张仰森,丁冰青.中文文本自动校对技术现状及展望[J].中文信息学报,1998(3):3-5.
[2] KAREN KUKICH. Techniques for automatically correcting words in text[J]. ACM Computing Surveys(CSUR), 1992,24(4):377-439.
[3] LIU B Q, WANG X L, WANG Y Y. Incorporating linguistic rules in statistical Chinese language model for pinyin - to - character conversion[J]. High Technology Letters, 2001,7(2):8-13.
[4] 纪兴光.基于神经网络的带有拼写纠错功能的音字转换模型[D].北京:北京邮电大学,2019.
[5] 陶永才,海朝阳,石磊,等.中文词语搭配特征提取及文本校对研究[J].小型微型计算机系统,2018,39(11):2485-2490.
[6] 陶永才,吴文乐,海朝阳,等.一种结合LSTM和集成算法的文本校对模型[J].小型微型计算机系统,2020,41(5):967-971.
[7] 吴淙.中文文本校对关键技术研究与应用[D].成都:电子科技大学,2019.
[8] 曲强.MOOC环境下课程智能问答系统的设计与实现[D].延吉:延边大学,2018.
[9] 王璐.中文文本真词错误自动校对算法研究[D].杭州:浙江工商大学,2018.
[10] 欧晓聪.基于自动纠错的最小编辑距离优化算法[J].网络安全技术与应用,2019(12):44-48.
[11] 孙芳媛.基于倒排索引和字典树的站内搜索引擎的设计与实现[D].哈尔滨:哈尔滨工业大学,2016.
备注/Memo
收稿日期: 2020-06-22 *通信作者: 赵亚慧(1974—),女,教授,研究方向为自然语言处理.
基金项目: 国家语委“十三五”科研规划项目(YB135-76); 延边大学外国语言文学世界一流学科建设科研项目(18YLPY13,18YLPY14)