摘要:随着互联网的高速发展,在互联网搜索服务中,搜索引擎扮演着越来越重要的角色。网络爬虫是搜索引擎系统中不可或缺的组成部分,是一种自动搜集互联网信息的程序,它负责从互联网中搜集网页,并将这些页面用于建立索引从而为搜索引擎提供支持。通过网络爬虫不仅能够为搜索引擎采集网络信息,而且可以作为定向信息采集器,定向采集某些网站显示的特定信息,如招聘信息,租房信息等。本文通过JAVA实现了一个基于广度优先算法的爬虫程序。本论文从网络爬虫的应用出发,探讨了网络爬虫在搜索引擎中的作用和地位,提出了网络爬虫的功能和设计要求。在对网络爬虫系统结构和工作原理所作分析的基础上,研究了页面爬取、解析等策略和算法,并使用Java实现了一个网络爬虫的程序,并对其运行结果做了分析。通过这一爬虫程序,可以搜集某一站点或多个站点的URL。连接外网后,可以爬取中国大部分大型主流门户的网站,如:百度,新浪,网易等。7384
关键词:搜索引擎;JAVA;广度优先.
The Design and Implementation of
Distributed Web Crawler
Abstract: With the rapid development of Internet, search engines as the main entrance of the Internet plays a more and more important role. Web crawler is a very important part of the search engines, a program which can auto collect information form Internet,which is responsible to collect web pages from Internet. These pages are used to build index and provide support for search engines. Spider can collect data for search engines ,also can be a directional information collector,collects specifically informations from some web sites,such as HR informations,house rent informations.In this paper,use JAVA implements a breadth-first algorithm Spider. The paper,discussing from the application of the search engine,searches the importance and function of Web Crawler in the search engine,and puts forward its demand of function and design.On the base of analyzing Web Crawler’s system strtucture and working elements,this paper also researches the method and strategy of multithreading scheduler,Web page crawling and HTML parsing.And then,a program of web page crawling based on Java is applied and analyzed.Through the crawler can collect a site or multiple site URL.Links outside the network,you can crawl most of China’s major large-scale portal sites,such as:Baidu , Sina , Netease.
Keywords:Search Engine;JAVA;Breadth First Search