ZHANG Bolun,ZHAO Yahui,JIANG Kexin,et al.Text classification method based on knowledge enhancement[J].Journal of Yanbian University,2024,(02):78-86.
基于知识增强的文本分类方法
- Title:
- Text classification method based on knowledge enhancement
- 文章编号:
- 1004-4353(2024)02-0078-09
- Keywords:
- deep learning; neural networks; text classification; knowledge enhancement; feature extraction
- 分类号:
- TP391.1
- 文献标志码:
- A
- 摘要:
- 为了解决文本分类任务中因部分数据质量差、数据不平衡和数据集过小等原因而导致的分类不准确问题,提出了一种基于知识增强的文本分类算法.首先,该算法通过加入外部知识对数据集进行数据增强;其次,使用GloVe词向量对原始文本和外部知识进行词嵌入,并使用CNN、LSTM和BERT模型提取文本特征;再次,将提取到的原始文本特征和外部知识文本特征进行融合,以此得到最终的文本特征;最后,将融合后的文本特征送入多层感知机进行分类,以此得到文本分类的最终结果.在不同数据集上进行实验显示:在SST-5数据集上,模型CNN(KB)、LSTM(KB)和BERT(KB)的文本分类准确率比基线模型分别提高了5.01%、7.92%和1.5%;在SST-2数据集上,模型LSTM(KB)和BERT(KB)的文本分类准确率比基线模型分别提高了1.76%和1.29%;在IMDB数据集上,模型CNN(KB)、LSTM(KB)和BERT(KB)的文本分类准确率比基线模型分别提高了0.97%、2.87%和0.76%.上述结果表明,该文本分类算法可有效提高文本分类的准确性,并可为不同领域的文本分类应用提供参考.
- Abstract:
- In order to solve the problem of inaccurate classification in text categorization task due to poor quality of some data,data imbalance and too small dataset,a text categorization algorithm based on knowledge enhancement is proposed. Firstly,the algorithm enhances the data set by adding external knowledge.Secondly,the original text and external knowledge are word-embedded using GloVe word vectors and the text features are extracted using CNN,LSTM and BERT models. Thirdly,the extracted original text features and external knowledge text features are fused in order to obtain the final text features.Finally,the fused text features are fed into the multilayer sensing model to obtain the final text features. The experiments on different datasets show that on the SST-5 dataset,the text classification accuracy of CNN(KB),LSTM(KB) and BERT(KB) is improved by 5.01%,7.92% and 1.5%,respectively,compared with the baseline model,and on the SST-2 dataset,the text classification accuracy of LSTM(KB) and BERT(KB) is improved by 1.76% and 1.5%,respectively, compared with the baseline model. 1.76% and 1.29%,respectively;on the IMDB dataset,the text categorization accuracies of models CNN(KB),LSTM(KB) and BERT(KB) are improved by 0.97%,2.87% and 0.76%,respectively,over the baseline model. The above results show that the text classification algorithm can effectively improve the accuracy of text classification and can provide good reference for text classification applications in different fields.
参考文献/References:
[1] MIYATO T,DAI A M,GOODFELLOW I J. Adversarial Training Methods for Semi-Supervised Text Classification[C]//5th International Conference on Learning Representations. Toulon:OpenReview.net,2017:1-11.
[2] CHEN Y W,WANG J L,CAI Y Q,et al. A method for chinese text classification based on apparent semantics and latent aspects[J]. Journal of Ambient Intelligence and Humanized Computing,2015,6(4):473-480.
[3] 王海涛,宋文,王辉. 一种基于LSTM和CNN混合模型的文本分类方法[J]. 小型微型计算机系统,2020,41(6):1163-1168.
[4] 陈可嘉,刘惠. 基于改进BiGRU-CNN的中文文本分类方法[J]. 计算机工程,2022,48(5):59-66,73.
[5] 孙刘成,黄润才. 融合LSTM和注意力机制的新闻文本分类模型[J]. 传感器与微系统,2022,41(9):38-41.
[6] LIU W,ZHOU P,ZHAO Z,et al. K-BERT:Enabling language representation with knowledge graph[J]. Proceedings of the AAAI Conference on Artificial Intelligence,2020,34(3):2901-2908.
[7] LAN Z,CHEN M,GOODMAN S,et al. Albert:A lite bert for self-supervised learning of language representations[C]//8th International Conference on Learning Representations. Addis Ababa:OpenReview.net,2020:1-17.
[8] SUN Y,WANG S,LI Y,et al. ERNIE 2.0:A continual pre-training framework for language understanding [C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York:Association for the Advancement of Artificial Intelligence,2020:8968-8975.
[9] GHOSAL D,HAZARIKA D,ROY A,et al. KinGDOM:Knowledge-guided domain adaptation for sentiment analysis[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online:Association for Computational Linguistics,2020:3198–3210.
[10] SPEER R,CHIN J,HAVASI C. Conceptnet 5.5:An open multilingual graph of general knowledge[C]∥Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. San Fraocisco:AAAI Press,2017: 4444-4451.
[11] 卢嘉荣,肖红,姜文超,等. 基于语料关联生成的知识增强型BERT[J]. 湖北大学学报(自然科学版),2022, 44(6):732-741.
[12] HONNIBAL M,MONTANI I. SpaCy 2:Natural language understanding with Bloom embeddings,convolutional neural networks and incremental parsing[J]. To appear,2017,7(1):411-420.
[13] LECUN Y,BOTTOU L,BENGIO Y,et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE,1998,86(11):2278–2324.
[14] ELMAN J L. Distributed representations,simple recurrent networks,and gram-matical structure[J]. Machine learning,1991,7(2/3):195-225.
[15] DEVLIN J,CHANG M W,LEE K,et al. Bert:Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1 (Long and Short Papers). Minneapolis:Association for Computational Linguistics,2019:4171-4186.
[16] ZHANG X,FANG A. MPCNN with Knowledge Augmentation:A model for Chinese text classification[C]//International Conference on Intelligent Computing. Cham:Springer International Publishing,2022:141-149.
[17] 杨璐,何明祥. 基于门控机制和卷积神经网络的中文文本情感分析模型[J]. 计算机应用,2021,41(10):2842-2848.
[18] 姜克鑫,赵亚慧,崔荣一. 融合高低层语义信息的自然语言句子匹配方法[J]. 计算机应用研究,2022,39(4):1060-1063.
[19] XU P,MADOTTO A,WU C S,et al. Emo2Vec:Learning generalized emotion representation by multi-task training[C]//Proceedings of the 9th Workshop on Computational Approaches to Subjectivity,Sentiment and Social Media Analysis. Brussels:Association for Computational Linguistics,2018:292-298.
[20] IYYER M,MANJUNATHA V,BOYD-GRABER J,et al. Deep unordered composition rivals syntactic methods for text classification[C]//Proceedings of the 53rd Annual meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing:Association for Computational Linguistics, 2015:1681-1691.
[21] SHEN D,WANG G,WANG W,et al. Baseline needs more love:On simple word-embedding-based models and associated pooling mechanisms[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne:Association for Computational Linguistics,2018:440-450.
备注/Memo
投稿日期:2023-12-4
基金项目:国家语委“十三五”科研项目(YB135-76);延边大学外国语语言文学一流学科建设项目(18YLPY13)
第一作者:张博伦(2001—),女,硕士研究生,研究方向为自然语言处理.
通信作者:赵亚慧(1974—),女,硕士,教授,研究方向为智能计算、自然语言处理.