基于hadoop的海量文本分类算法研究

摘要：进入大数据时代后，无论是互联网数据还是离线数据都成指数级增长，而这些海量数据主要以文本结构化或半结构化文件为主，因此，如何从海量数据中有效快速查找用户需要的的有效数据，提高用户的查找准确率成为一个巨大的挑战。查找文本数据首先需要对文本数据进行精确有效的分类，所以文本分类成为文本数据处理的主要难点。因此，本文的研究目的，在于基于现有的硬件基础下研究高效的海量文本分类算法。63216

本文基于Hadoop研究海量文本的存储和文本分类。首先，设计并实现分布式、高可靠、高可用性的数据存储模块，解决现在海量文本存储困难的问题。然后，提出基于MapReduce的分布式并行中文分词算法，改进MapReduce的InputFormat读取数据模式，解决Hadoop处理小文件效率低下的问题，相比默认的MapReduce中文分词能够提高52倍的分词效率，并能够解决现阶段海量文本分词困难的现状。最后，本文将基于MapReduce分布式计算框架研究海量的web文本分类算法，建立贝叶斯文本分类模型，进行实验验证，本文研究的文本分类算法对于未知文本分类的准确性和召回率高达97%。

毕业论文关键词： HDFS；Hadoop；MapReduce；文本分类；中文分词

Research on massive text classification algorithm

Based on Hadoop

Abstract: After entering the era of big data, whether it is Internet data or offline data are increasing exponentially, and these data mainly text structured or semi-structured documents, therefore, how to effectively search user needs from the valid data in mass data, improve the user search accuracy becomes a great challenge. Finding text data requires accurate and efficient classification of text data, so text categorization becomes the main difficulty of text data processing. Therefore, the purpose of this paper is to study the efficient massive text classification algorithm based on the existing hardware.

This paper studies the storage of massive text and text classification based on Hadoop. First, we design and implement a distributed, high reliability and high availability data storage module, which can solve the problem of massive text storage. Then, the proposed MapReduce distributed parallel Chinese segmentation algorithm based on improved MapReduce InputFormat read data model, to solve the problem of low efficiency of Hadoop with small files, compared Chinese word MapReduce default can increase 52 times the word segmentation efficiency, and can solve the present situation of massive text segmentation difficult. Finally, the web text classification algorithm based on MapReduce distributed computing framework of mass, establish the classification model, experimental verification, text classification algorithm is proposed in this paper on the accuracy of unknown text classification up to 97%.

Keywords: HDFS; Hadoop; MapReduce; Text categorization; Chinese word segmentation

1 引言 1

1.1 研究背景 1

1.2 国内外研究现状 2

1.2.1大数据国内外研究现状 2

1.2.2文本分类研究现状 4

1.3 主要工作 4

1.4 论文组织结构 5

2 大数据技术HADOOP的研究 6

上一篇：java+mysql网上图书销售系统的设计与实现

下一篇：asp.net培训中心考试系统开发与建设

基于hadoop的海量文本分类算法研究

基于Apriori算法的电影推荐

PHP+IOS的会议管理系统的设计+ER图

数据挖掘在电子商务中的应用

数据挖掘的主题标绘数据获取技术与实现

基于PageRank算法的网络数据分析

基于神经网络的验证码识别算法

基于网络的通用试题库系...

我国风险投资的发展现状问题及对策分析

老年2型糖尿病患者运动疗...

LiMn1-xFexPO4正极材料合成及充放电性能研究

互联网教育”变革路径研究进展【7972字】

网络语言“XX体”研究

ASP.net+sqlserver企业设备管理系统设计与开发

新課改下小學语文洧效阅...

安康汉江网讯

张洁小说《无字》中的女性意识

麦秸秆还田和沼液灌溉对...