SEARCH WITHIN CONTENT
Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 4, Issue 3, Pages 25-31, DOI: https://doi.org/10.21307/ijanmc-2019-056
License : (CC-BY-NC-ND 4.0)
Published Online: 11-November-2019
With the development of the times and the popularization of scientific and technological products, the Internet has become inseparable from our lives, and search engines have become a daily necessity of people. Users can search information by inputting keywords into search engines to find information related to keywords. But with the explosive growth of network information, it becomes more difficult to find the desired information accurately.
In order to meet the growing needs, this paper chooses Scrapy, an open-source crawler framework based on Python, to crawl the “knowledge”, and to learn and analyze the principle and running process of the crawler. On this basis, a prototype program of web crawler is implemented, and data crawling and storage are completed.
Firstly, this paper introduces the development process of the crawler, the working principle and classification of the crawler and the grasping strategy, and focuses on the current popular Cookie and its corresponding Session and Robots protocol.
Secondly, the use of Scrapy framework is introduced in detail. Using Scrapy framework to develop crawlers, the process and implementation details of developing crawlers by Scrapy are introduced in detail.
Finally, the crawler is tested and the results of crawling are shown.
In a word, a crawler is a script or program that can get information and save it. The first step is to send a request to the target web page or website, and then get the response from the server.
Universal crawler is an important part of search engine. Its main function is to collect web pages on the Internet, then save them and process them.
Focus on crawlers, crawlers for specific needs. When it crawls a web page, it filters the first content and grabs the web page information related to the requirements.
The crawler workflow is similar to the principle of ordinary users accessing web pages. When a user opens a web page, the browser will send a request to the server visiting the site, and the server will respond to the request and return it to the browser Response. The browser will parse the Response to display the web page. The general crawler framework is shown in Fig. 2 below.
First of all, select some sites in the Internet, and take it as a starting point.Put these starting points into the queue to be grabbed, perform the queue out operation, and read out the queue elements.Resolve the URL of the target site through DNS.DNS will convert the domain name to the corresponding IP. The Downloader Downloads the target page through the server.The URL of the download page will be extracted.Reduplicate the crawl URL queue.The URL of the crawl URL queue continues to loop until the waiting URL queue is empty.
Data items are obtained by debugging web pages that know the user interface.The fieldin Python can accept almost any data type.
Follower_info_parse has two functions: first, it can initiate requests for user information through the attention list; second, it has the function of turning over pages. By parsing the response, it can obtain all users of the current target attention list and obtain detailed information of users. There is also the function of page scheduling to get the list of concerns on the next page. Further requests are then retrieved recursively for circular crawling. Followee_info_parse, which can request user’s detailed information through fan list, also has the function of turning pages. Its implementation logic is exactly the same as follower_info_parse, except that the object of the request is different. One is to request detailed information from the person concerned, the other is to request detailed information from the person concerned with the current user.
Spiders.py is the core of web crawling module and an important part of the whole project. It defines the core business logic. Followee_info_parse, which can request user’s detailed information through fan list, also has the function of turning pages. Its implementation logic is exactly the same as follower_info_parse, except that the object of the request is different. One is to request detailed information from the person concerned, the other is to request detailed information from the person concerned with the current user.
After testing, the crawler capture data of a single host can reach 400,000 users per day. The crawling speed can be artificially controlled by setting it in the code.
After the program runs, it gets a database named “zhihu” and stores all the information in the user table.
When the crawler development is completed, it should be tested. Testing is a very important step. First of all, we need to know the performance of the crawler through testing, check whether the crawler has problems, and whether it can crawl the required data. Secondly, we should explore the anti-crawler strategy of the target website and improve the crawler. Finally, check the data that has been crawled to see if it achieves the expected goal of the project.The crawler system can also be extended, there are many technologies not added to it, and then added to it is the requirements of the enterprise level. In the process of writing this system, I consulted a lot of information about Scrapy. Scrapy framework is a new thing for me. New APIs and libraries. Fortunately, I have done some crawler projects before, which is not particularly difficult for me.