Crawler Technology Based on Scrapy Framework

Publications

Share / Export Citation / Email / Print / Text size:

International Journal of Advanced Network, Monitoring and Controls

Xi'an Technological University

Subject: Computer Science, Software Engineering

GET ALERTS

eISSN: 2470-8038

DESCRIPTION

48
Reader(s)
122
Visit(s)
0
Comment(s)
0
Share(s)

SEARCH WITHIN CONTENT

FIND ARTICLE

Volume / Issue / page

Related articles

VOLUME 4 , ISSUE 3 (Nov 2019) > List of articles

Crawler Technology Based on Scrapy Framework

Hejing Wu *

Keywords : Creeper, Scrapy, framework, Python, Cookie

Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 4, Issue 3, Pages 25-31, DOI: https://doi.org/10.21307/ijanmc-2019-056

License : (CC-BY-NC-ND 4.0)

Published Online: 11-November-2019

ARTICLE

ABSTRACT

With the development of the times and the popularization of scientific and technological products, the Internet has become inseparable from our lives, and search engines have become a daily necessity of people.In view of the growing needs。This topic requires the design of a prototype crawler system based on Scrapy framework. The specific requirements and contents are as follows: analyzing the structure and rules of the target website, looking for data items that need to be crawled. Based on Scrappy framework, a crawler prototype program is implemented by customizing crawling rules.Select the appropriate database for data access and analysis.

Graphical ABSTRACT

I. INTRODUCTION

With the development of the times and the popularization of scientific and technological products, the Internet has become inseparable from our lives, and search engines have become a daily necessity of people. Users can search information by inputting keywords into search engines to find information related to keywords. But with the explosive growth of network information, it becomes more difficult to find the desired information accurately.

In order to meet the growing needs, this paper chooses Scrapy, an open-source crawler framework based on Python, to crawl the “knowledge”, and to learn and analyze the principle and running process of the crawler. On this basis, a prototype program of web crawler is implemented, and data crawling and storage are completed.

Firstly, this paper introduces the development process of the crawler, the working principle and classification of the crawler and the grasping strategy, and focuses on the current popular Cookie and its corresponding Session and Robots protocol.

Secondly, the use of Scrapy framework is introduced in detail. Using Scrapy framework to develop crawlers, the process and implementation details of developing crawlers by Scrapy are introduced in detail.

Finally, the crawler is tested and the results of crawling are shown.

II. WORKING PRINCIPLE

In a word, a crawler is a script or program that can get information and save it. The first step is to send a request to the target web page or website, and then get the response from the server.

Universal crawler is an important part of search engine. Its main function is to collect web pages on the Internet, then save them and process them.

Focus on crawlers, crawlers for specific needs. When it crawls a web page, it filters the first content and grabs the web page information related to the requirements.

Figure 1.

The Difference between Universal and Focused Reptiles

10.21307_ijanmc-2019-056-f001.jpg

The crawler workflow is similar to the principle of ordinary users accessing web pages. When a user opens a web page, the browser will send a request to the server visiting the site, and the server will respond to the request and return it to the browser Response. The browser will parse the Response to display the web page[9]. The general crawler framework is shown in Fig. 2 below.

Figure 2.

Universal crawler framework process

10.21307_ijanmc-2019-056-f002.jpg

First of all, select some sites in the Internet, and take it as a starting point.Put these starting points into the queue to be grabbed, perform the queue out operation, and read out the queue elements.Resolve the URL of the target site through DNS.DNS will convert the domain name to the corresponding IP. The Downloader Downloads the target page through the server.The URL of the download page will be extracted.Reduplicate the crawl URL queue.The URL of the crawl URL queue continues to loop until the waiting URL queue is empty.

Figure 3.

Network crawler flow chart

10.21307_ijanmc-2019-056-f003.jpg

III. DETAILS OF CRAWLER IMPLEMENT

Data items are obtained by debugging web pages that know the user interface.The fieldin Python can accept almost any data type.

Follower_info_parse has two functions: first, it can initiate requests for user information through the attention list; second, it has the function of turning over pages. By parsing the response, it can obtain all users of the current target attention list and obtain detailed information of users. There is also the function of page scheduling to get the list of concerns on the next page. Further requests are then retrieved recursively for circular crawling. Followee_info_parse, which can request user’s detailed information through fan list, also has the function of turning pages. Its implementation logic is exactly the same as follower_info_parse, except that the object of the request is different. One is to request detailed information from the person concerned, the other is to request detailed information from the person concerned with the current user.

Figure 4.

Field definitions in item.py file

10.21307_ijanmc-2019-056-f004.jpg
Figure 5.

Followee_info_parse method

10.21307_ijanmc-2019-056-f005.jpg

Spiders.py is the core of web crawling module and an important part of the whole project. It defines the core business logic. Followee_info_parse, which can request user’s detailed information through fan list, also has the function of turning pages. Its implementation logic is exactly the same as follower_info_parse, except that the object of the request is different. One is to request detailed information from the person concerned, the other is to request detailed information from the person concerned with the current user.

Figure 6.

ZhihuSpider

10.21307_ijanmc-2019-056-f006.jpg

IV. RUNNING STATUS AND TESTING

After testing, the crawler capture data of a single host can reach 400,000 users per day. The crawling speed can be artificially controlled by setting it in the code.

Figure 7.

Screenshots of crawling 1

10.21307_ijanmc-2019-056-f007.jpg
Figure 8.

Screenshots of crawling 2

10.21307_ijanmc-2019-056-f008.jpg

After the program runs, it gets a database named “zhihu” and stores all the information in the user table.

Figure 9.

Database screenshots

10.21307_ijanmc-2019-056-f009.jpg

V. CONCLUSION

When the crawler development is completed, it should be tested. Testing is a very important step. First of all, we need to know the performance of the crawler through testing, check whether the crawler has problems, and whether it can crawl the required data. Secondly, we should explore the anti-crawler strategy of the target website and improve the crawler. Finally, check the data that has been crawled to see if it achieves the expected goal of the project.The crawler system can also be extended, there are many technologies not added to it, and then added to it is the requirements of the enterprise level. In the process of writing this system, I consulted a lot of information about Scrapy. Scrapy framework is a new thing for me. New APIs and libraries. Fortunately, I have done some crawler projects before, which is not particularly difficult for me.

ACKNOWLEDGMENT

This paper is about the scientific research project of Heilongjiang Oriental University in 2019, “Implementation of Reptiles Based on Python Scrapy Framework”, project number HDFKY190109.

References


  1. Jing Wang, Yuchun Guo. Scrapy-Based Crawling and User-Behavior Characteristics Analysison Taobao[P]. Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2012 International Conference on, 2012.
  2. McGuffeeJames W.. Non-profit geographically constrained locator[J]. ACM SIGCASComputers and Society,2015,45(2).
  3. Yuhao Fan. Design and Implementation of Distributed Crawler System Based on Scrapy[J].IOP Conference Series: Earth and Environmental Science,2018,108(4).
  4. Shen Jie, Li Yifan. Application of Web crawler system in cloud media [J]. China Cable Television, 2018 (05): 595-597.
  5. Zhang Jin. Research on Web crawler technology based on Hadoop platform [D].Nanjing University of Posts and Telecommunications, 2017.
  6. Zhao Fen, Lei Zhenzhen, Yang Xiaoyun, Su Pengju and Wang Shunye. Based on Baidu Tieba College Students’Network Public Opinion Analysis [J]. Computer Knowledge and Technology, 2018, 14 (28): 227-229.
  7. Ding Zhongxiang, Yang Yanhong, Du Yanming. Design and implementation of video information crawling based on Scrappy framework [J]. Journal of Beijing Printing Institute, 2018, 26 (09): 92-97.
  8. Xie Zhu. Emotional Tendency Analysis for Chinese Short Texts [D]. Hunan University, 2018.
  9. Wei Chengcheng. Data Information Crawler Technology Based on Python [J]. Electronic World, 2018 (11): 208-209.
  10. Ye Xiqiezhong. Research and Implementation of Tibetan Text Automatic Classification Based on Web [D]. Qinghai University for Nationalities, 2014.
  11. Zhong Jiajun. Research on Copyright Infringement Recognition of News Aggregation Platform [D]. Lanzhou University, 2018.
XML PDF Share

FIGURES & TABLES

Figure 1.

The Difference between Universal and Focused Reptiles

Full Size   |   Slide (.pptx)

Figure 2.

Universal crawler framework process

Full Size   |   Slide (.pptx)

Figure 3.

Network crawler flow chart

Full Size   |   Slide (.pptx)

Figure 4.

Field definitions in item.py file

Full Size   |   Slide (.pptx)

Figure 5.

Followee_info_parse method

Full Size   |   Slide (.pptx)

Figure 6.

ZhihuSpider

Full Size   |   Slide (.pptx)

Figure 7.

Screenshots of crawling 1

Full Size   |   Slide (.pptx)

Figure 8.

Screenshots of crawling 2

Full Size   |   Slide (.pptx)

Figure 9.

Database screenshots

Full Size   |   Slide (.pptx)

REFERENCES

  1. Jing Wang, Yuchun Guo. Scrapy-Based Crawling and User-Behavior Characteristics Analysison Taobao[P]. Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2012 International Conference on, 2012.
  2. McGuffeeJames W.. Non-profit geographically constrained locator[J]. ACM SIGCASComputers and Society,2015,45(2).
  3. Yuhao Fan. Design and Implementation of Distributed Crawler System Based on Scrapy[J].IOP Conference Series: Earth and Environmental Science,2018,108(4).
  4. Shen Jie, Li Yifan. Application of Web crawler system in cloud media [J]. China Cable Television, 2018 (05): 595-597.
  5. Zhang Jin. Research on Web crawler technology based on Hadoop platform [D].Nanjing University of Posts and Telecommunications, 2017.
  6. Zhao Fen, Lei Zhenzhen, Yang Xiaoyun, Su Pengju and Wang Shunye. Based on Baidu Tieba College Students’Network Public Opinion Analysis [J]. Computer Knowledge and Technology, 2018, 14 (28): 227-229.
  7. Ding Zhongxiang, Yang Yanhong, Du Yanming. Design and implementation of video information crawling based on Scrappy framework [J]. Journal of Beijing Printing Institute, 2018, 26 (09): 92-97.
  8. Xie Zhu. Emotional Tendency Analysis for Chinese Short Texts [D]. Hunan University, 2018.
  9. Wei Chengcheng. Data Information Crawler Technology Based on Python [J]. Electronic World, 2018 (11): 208-209.
  10. Ye Xiqiezhong. Research and Implementation of Tibetan Text Automatic Classification Based on Web [D]. Qinghai University for Nationalities, 2014.
  11. Zhong Jiajun. Research on Copyright Infringement Recognition of News Aggregation Platform [D]. Lanzhou University, 2018.

EXTRA FILES

COMMENTS