LI Lujun,ZHAO Yun,CUI Rongyi*,et al.An approach to analysis of brief summary of university graduation thesis based on machine learning[J].Journal of Yanbian University,2021,47(01):80-87.
基于机器学习的高校毕业论文课题信息分析方法
- Title:
- An approach to analysis of brief summary of university graduation thesis based on machine learning
- 文章编号:
- 1004-4353(2021)01-0080-08
- 关键词:
- 毕业论文分析; 文本聚类; DBSCAN聚类算法; Rand指数
- 分类号:
- TP391.1
- 文献标志码:
- A
- 摘要:
- 为协助教师详细了解毕业论文的课题分布情况,指导学生合理选择毕业论文课题,提出了一种基于机器学习的高校毕业论文课题信息分析方法.首先,对收集的论文课题信息文本进行规范化、去重、删除无关数据、分词等预处理,并通过人工筛选建立专业术语词典; 其次,基于逆文档频率和专业术语确定特征词,利用TF-IDF算法和专业术语因子计算特征词的权重并构造归一化文档向量; 最后,采用DBSCAN算法进行聚类,并采用Rand指数进行聚类评价,以此提取出Top -K高频特征词,并将其作为类簇描述的关键词.实验结果表明,该方法可有效分析论文课题内容的分布情况,进而为评价和设计毕业论文课题提供有效的依据.
- Abstract:
- In order to assist teachers to acquaint the distribution of graduation thesis theme in detail and instruct students to choose the thesis reasonably, we propose a graduation thesis information analytical method based on machine learning. Firstly, the standardization, de -duplication, deletion of irrelevant data and word segmentation methods were used to preprocess the collected texts of thesis theme information, and a professional term dictionary was established through manual selection. Secondly, inverse document frequency and professional terms were used to determine characteristic words. TF -IDF algorithm and professional term factor were used to calculate the weight of characteristic words, and the normalized document vector is constructed. Finally, the DBSCAN algorithm and Rand index were used for clustering and evaluation, and the extracted Top -K high -frequency characteristic words were used for cluster description keywords. The results show that our method can analyze the thesis theme distribution effectively, and can provide effective basis for the evaluation and construction of graduation thesis.
参考文献/References:
[1] 钱兵.本科生毕业论文存在的问题分析及改进策略[J].江苏高教,2017(10):60-63.
[2] 李杰,李平,陈伟炯.安全科学与工程硕博学位论文主题与方法研究[J].中国安全科学学报,2018,28(2):8-14.
[3] 付立宏,李露琪.近年来图书馆学情报学核心论文主题分析[J].图书馆学研究,2014(16):2-6.
[4] 曹树金,岳文玉.守正创新:近60年武汉大学信息管理学院学术论文研究主题的演变[J].图书馆论坛,2020(10):1-10.
[5] 甘克勤,丛超,张宝林,等.基于划分的文本聚类算法在标准文献中的试验与对比研究[J].标准科学,2013(10):47-50.
[6] 洪韵佳,许鑫.基于领域本体的知识库多层次文本聚类研究:以中华烹饪文化知识库为例[J].现代图书情报技术,2013(12):19-26.
[7] 张广凯.基于机器学习的短文本聚类算法研究[D].武汉:中南民族大学,2019.
[8] 邹臣嵩,刘松.基于谱聚类的全局中心快速更新聚类算法[J].计算机与现代化,2018(10):6-11.
[9] 安计勇,韩海英,侯效礼.一种改进的DBscan聚类算法[J].微电子学与计算机,2015(7):68-71.
[10] ABUALIGAH L M, KHADER A T. Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering[J]. The Journal of Supercomputing, 2017,73(11):4773-4795.
[11] CHANDRASHEKAR G, SAHIN F. A survey on feature selection methods[J]. Computers & Electrical Engineering, 2014,40(1):16-28.
[12] SANGAIAH A K, FAKHRY A E, ABDEL -BASSET M, et al. Arabic text clustering using improved clustering algorithms with dimensionality reduction[J]. Cluster Computing, 2019,22(2):4535-4549.
[13] 周宇,覃征.聚类分析中特征选择的研究[J].计算机应用研究,2006(5):55-57.
[14] 熊玮,白越,刘爱国.基于改进RI方法的文本聚类[J].南昌大学学报(理科版),2016,40(5):426-430.
备注/Memo
收稿日期: 2020-10-22
*通信作者: 崔荣一(1962—),男,博士,教授,研究方向为自然语言处理与模式识别.
基金项目: 吉林省高教学会项目(JGJX2018D347); 延边大学教育教学改革研究课题(延大教发[2020]35号)