XU Baoxin,HUAI Libo*,CUI Rongyi.Naive Bayes algorithm application in the classification of news based on MapReduce[J].Journal of Yanbian University,2017,43(01):55-59.
- Title:
- Naive Bayes algorithm application in the classification of news based on MapReduce
- Keywords:
- Hadoop; naive Bayes; MapReduce; text classification; news text
- 分类号:
- TP391.3
- 文献标志码:
- A
- 摘要:
- 针对传统单点串行的分类算法在面对新闻数据规模较大、分类属性较多时存在效率低的问题,本文研究了朴素贝叶斯分类算法在MapReduce下的并行实现方法.首先对新闻信息进行分词、格式转换等预处理,然后进行特征提取、分类模型构造; 最后进行了分类测试.测试结果表明,在大数据量的情况下,并行化的贝叶斯算法较传统的贝叶斯算法具有更好的执行效率和较高的扩展性.
- Abstract:
- According to the traditional single point serial classification algorithm in the face of the existence of the problem of low efficiency, large scale news data classification attribute more, in this paper naive Bayesian classification algorithm in MapReduce parallel implementation method. First of all, the word segmentation and format conversion are processed, then the feature extraction and classification model are constructed. The test results show that, in the case of large amount of data, the parallel Bayesian algorithm has better performance and scalability than the traditional Bayesian algorithm.
[1] 喻国明,李彪.新闻传播的大数据时代[M].北京:中国人民大学出版社,2014.
[2] 李安.Factiva新闻分类标引体系及其对我国的启示[J].图书馆建设,2003(3):102-104.
[3] 百度百科.新华网[EB/OL].[2013-04-18].http://baike.baidu.com/view/154954.htm.
[4] 张志平.基于“中文新闻信息分类与代码”文本分类[J].太原理工大学学报,2010(4):402-405.
[5] 张永奎,李红娟.基于类别关键词的突发事件新闻文本分类方法[J].计算机应用,2005(51):139-140.
[6] 马宾,殷立峰.一种基于Hadoop平台的并行朴素贝叶斯网络舆情快速分类算法[J].现代图书情报技术,2015(2):78-84.
[7] 段晶.朴素贝叶斯分类及其应用研究[D].大连:大连海事大学,2011.
[8] Jiang Liangxiao, Li Chaoqun, Wang Shasha, et al. Deep feature weighting for naive bayes and its application to text classification[J]. Engineering Applications of Artificial Intelligence, 2016,52:26-39.
[9] Tom White. Hadoop权威指南[M].2版.北京:清华大学出版社,2011:15-73,167-188.
[10] 李伟卫,赵航,张阳.基于MapReduce的海量数据挖掘技术研究[J].计算机工程与应用,2013,49(20):112-117.
[11] 朱珠.基于Hadoop的海量数据处理模型研究与应用[D].北京:北京邮电大学,2008.
[12] 李方,刘琼荪.基于改进属性加权的朴素贝叶斯分类模型[J].计算机工程与应用,2010(4):132-133.
[13] 郭绪坤,范冰冰.一种朴素贝叶斯文本分类算法的分布并行实现[J].计算机应用与软件,2016(11):240-243.
[14] 严嘉铭,黄理灿.基于MapReduce的朴素贝叶斯文本分类研究[J].工业控制计算机,2016,29(4):96-97.
收稿日期: 2016-12-15
*通信作者: 怀丽波(1973—),女,副教授,研究方向为优化理论与方法、数据挖掘.