随着科技的日新月异,因特网也在飞速地发展之中,网络上的资源也在不断丰富。在 由此导致的海量信息面前,如何有效地提取和利用互联网上的巨大信息量成为一个挑战。 集中式搜索引擎从海量的信息中快速检索出用户真正需要的信息正变得愈加困难,搜索引 擎正向着具有分布式处理能力的方向发展,扩展系统的规模来增强处理信息的能力,分布 式搜索引擎应运而生。
本课题的分布式爬虫设计基于一个开源的 java 搜索引擎 nutch,能够提供一个搜索引擎 运行所需的全部工具,包括 web 爬虫和全文搜索。考虑到网络爬虫的巨大任务量,需要强 大的处理能力和网络带宽,本课题的设计工作将 nutch 部署在 hadoop 之上以实现分布式爬 取。同时本课题使用基于 Apache Lucene 的 solr 搜索引擎,便于对爬取的结果进行查询和 检索,以达到更好的用户体验。
毕业论文关键词: nutch Hadoop solr 分布式爬取77211
毕 业 设 计 说 明 书 外 文 摘 要
Title The Design and Implementation of a Distributed Crawler for Vertical Website Networks
Abstract With the rapid development of science and technology, internet has been developed rapidly, therefore, the network resources also haven been continuously enriched。 When facing the resultant massive amounts of information, how to effectively extract and take advantage of the huge information quantity on internet becomes a challenging issue。 It is more and more difficult for the centralized search engines to promptly retrieve information really required by the users from the massive amounts of information。 As a result, the search engine is under development toward the direction of distributed processing, for the purpose of enhancing information processing ability by expanding system scale。 Accordingly, distributed search engines are taken into consideration to satisfy the above-mentioned requirements。
This project designs and implements a distributed crawler based upon an open source java search engine nutch, which is capable of providing all required tools to operate a search engine, including web crawling and full-text search functionalities。 Considering that the heavy workload of web crawling requires powerful processing capacity and high network bandwidth, this project deploys nutch on a hadoop architecture for implementing distributed crawling。 In addition, this project uses solr search engine, which is based on Apache Lucene, to achieve convenient search and retrieve of crawling results as well as better user experience。
Keywords nutch hadoop solr distributed crawling
本科毕业设计说 明书 第 I 页
目录
1 绪论 1
1。1 课题背景 1
1。2 分布式搜索引擎发展的背景及历史 2
1。3 本文研究工作 3
1。4 本文结构安排 3
2 技术综述 4
2。1nutch 简介 4
2。1。1 nutch 文件组成 4