摘要近十几年来中国互联网取得了长足的发展,从二十世纪初期,网络真正逐步渗透进人们生活开始,中国网民数与日俱增,据估算,截至2015年12月,互联网普及率高达50%以上,半数中国人已经接入互联网。互联网已然从一株幼苗变得根深叶茂,而我们生活的方方面面都被其“荫凉”所影响,信息在其中流动与传播,该过程中的作用和影响力不容置疑。互联网的可获取性,使之成为了企业公共信息的重要载体,也因为这一原因,互联网充当起了人们与信息的桥梁,并日益发挥重要的作用。近年来发生的企业发展较快,对于企业发生的公共信息进行抓取与分析。深入研究企业公共信息,对国家正确引导企业发展具有重要意义。72973
本课题的主要目的是利用网络爬虫对互联网环境中的企业公共信息开展抓取与分析研究,并实现以上信息的存储与管理。首先,文章讨论了企业公共信息的抓取方式,当前市面上数据获取软件层出不穷,比如:火车头采集器、八爪鱼采集器等,但是为了定向抓取信息,我们自己构建了简易爬虫,并对其进行了直观的介绍。另外,论文讨论了HTML页面解析,阐明了如何将获取的页面内容存储到指定数据库。页面解析算法多样,本设计选择jsoup解析器解析URL地址以及HTML文本。论文还包含数据库的设计,页面信息获取流程,企业公共信息抓取系统设计等。
毕业论文关键词:企业公共信息 网络爬虫 HTML页面解析 数据库 网络爬虫算法
Corporate Public Information on the Internet Crawling and Analysis
Abstract Over the last decade China Internet has made considerable development, from the early twentieth century, the real network and gradually penetrate into people's lives, the Chinese Internet users growing, it is estimated that, as of December 2015, the Internet penetration rate of 50% or more, half Chinese people have access to the Internet。 Internet has already become a seedling from the root, and all aspects of our lives are affected by it, "shade", in which the flow of information and communication, the process is the role and influence of the doubt。 Internet accessibility, making it a major public carrier business information, but also for this reason, the Internet served as a bridge between people and information, and to play an increasingly important role。 In recent years, the rapid development of an enterprise, an enterprise for public information and analysis on the crawl。 Depth study of corporate public information is important for countries to correctly guide the development of enterprises。
The main purpose of this project is to use the Internet web crawler in the enterprise environment to carry out public information crawling and analysis, and to achieve the above information storage and management。 First, the article discusses the public enterprise Fetch information, data acquisition software currently on the market after another, such as: locomotive acquisition, octopus and other collectors, but for information directed to crawl, to build ourselves a simple reptiles, and it was intuitive presentation。 In addition, the paper discusses the HTML page parsing, clarifies how to get content pages being stored in the specified database。 Page parsing algorithm persity, this design choice jsoup URL address parser and HTML text。 The paper also contains a database design, page information acquisition process, the enterprise public information crawling system design。
Key Words: enterprise public information web crawler HTML page analysis database information fusion Web crawler algorithm
目录
摘要 I
ABSTRACT II