垂直网站网络分布式爬虫的设计与实现

随着科技的日新月异，因特网也在飞速地发展之中，网络上的资源也在不断丰富。在由此导致的海量信息面前，如何有效地提取和利用互联网上的巨大信息量成为一个挑战。集中式搜索引擎从海量的信息中快速检索出用户真正需要的信息正变得愈加困难，搜索引擎正向着具有分布式处理能力的方向发展，扩展系统的规模来增强处理信息的能力，分布式搜索引擎应运而生。

本课题的分布式爬虫设计基于一个开源的 java 搜索引擎 nutch，能够提供一个搜索引擎运行所需的全部工具，包括 web 爬虫和全文搜索。考虑到网络爬虫的巨大任务量，需要强大的处理能力和网络带宽，本课题的设计工作将 nutch 部署在 hadoop 之上以实现分布式爬取。同时本课题使用基于 Apache Lucene 的 solr 搜索引擎，便于对爬取的结果进行查询和检索，以达到更好的用户体验。

毕业论文关键词: nutch Hadoop solr 分布式爬取77211

毕业设计说明书外文摘要

Title The Design and Implementation of a Distributed Crawler for Vertical Website Networks

Abstract With the rapid development of science and technology, internet has been developed rapidly, therefore, the network resources also haven been continuously enriched。 When facing the resultant massive amounts of information, how to effectively extract and take advantage of the huge information quantity on internet becomes a challenging issue。 It is more and more difficult for the centralized search engines to promptly retrieve information really required by the users from the massive amounts of information。 As a result, the search engine is under development toward the direction of distributed processing, for the purpose of enhancing information processing ability by expanding system scale。 Accordingly, distributed search engines are taken into consideration to satisfy the above-mentioned requirements。

This project designs and implements a distributed crawler based upon an open source java search engine nutch, which is capable of providing all required tools to operate a search engine, including web crawling and full-text search functionalities。 Considering that the heavy workload of web crawling requires powerful processing capacity and high network bandwidth, this project deploys nutch on a hadoop architecture for implementing distributed crawling。 In addition, this project uses solr search engine, which is based on Apache Lucene, to achieve convenient search and retrieve of crawling results as well as better user experience。

Keywords nutch hadoop solr distributed crawling

本科毕业设计说明书第 I 页

1 绪论 1

1。1 课题背景 1

1。2 分布式搜索引擎发展的背景及历史 2

1。3 本文研究工作 3

1。4 本文结构安排 3

2 技术综述 4

2。1nutch 简介 4

2。1。1 nutch 文件组成 4

上一篇：i-jetty人脸识别系统设计

下一篇：Word2Vec和cosine相似度程序能力智能辅助训练平台设计

垂直网站网络分布式爬虫的设计与实现

基于PageRank算法的网络数据分析

基于神经网络的验证码识别算法

基于网络的通用试题库系...

基于消费者个性特征的化...

网络常见故障的分类诊斷【2055字】

网络安全的研究【1797字】

网络信息安全技术管理的...

张洁小说《无字》中的女性意识

麦秸秆还田和沼液灌溉对...

新課改下小學语文洧效阅...

互联网教育”变革路径研究进展【7972字】

我国风险投资的发展现状问题及对策分析

LiMn1-xFexPO4正极材料合成及充放电性能研究

安康汉江网讯

老年2型糖尿病患者运动疗...

网络语言“XX体”研究

ASP.net+sqlserver企业设备管理系统设计与开发