Research on Data Collection and Analysis of Second Hand House in China Based on Python

Publications

Share / Export Citation / Email / Print / Text size:

International Journal of Advanced Network, Monitoring and Controls

Xi'an Technological University

Subject: Computer Science, Software Engineering

GET ALERTS

eISSN: 2470-8038

DESCRIPTION

6
Reader(s)
6
Visit(s)
0
Comment(s)
0
Share(s)

SEARCH WITHIN CONTENT

FIND ARTICLE

Volume / Issue / page

Related articles

VOLUME 6 , ISSUE 2 (Jul 2021) > List of articles

Research on Data Collection and Analysis of Second Hand House in China Based on Python

Hejing Wu * / Ran Cui

Keywords : Python Crawler, Scrapy framework, Django Framework, HOME LINK

Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 6, Issue 2, Pages 37-47, DOI: https://doi.org/10.21307/ijanmc-2021-015

License : (CC-BY-NC-ND 4.0)

Published Online: 12-July-2021

ARTICLE

ABSTRACT

With the rapid development of domestic Internet, more and more people choose housing leasing activities through the Internet, however, the existing housing rental information website the quantity many, but because of various advertising, information is scattered, so to HOME LINK sites, according to the position provided by the customer, accurate demand such as rent or house type, vertical search related data from the rental information website and in accordance with the provisions, the structure of the storage, rent information data to provide data for subsequent analysis. Bring users a faster experience. At the same time, we should add the reminding of the houses we are interested in and the reminding of price reduction. For the users who cannot look at their mobile phones all the time, timely reminding can help them avoid missing many desirable houses. This system is committed to solving the current people to rent a request for the search of detailed provide the keywords needed. To help make it easier for everyone who needs to rent. The system is mainly composed of data cleaning, data access, algorithm design and implementation, Python implementation of the front and back end of the system, data formatting and auxiliary decision.

Graphical ABSTRACT

I. THE INTRODUCTION

With the rapid development of information technology in today’s society, people’s demand for information is also greatly increased. The society has gradually become a collection of information, and in this collection of information there are various data. The data is a form of information. In most cases, these data are hidden in the network, and these data are complex and diverse. It is very difficult to extract these complex data from the network by using our traditional processing methods, and to analyze and study to obtain useful information.

II. HOME LINK PLATFORM STATISTICAL SURVEY

This topic will choose HOME LINK mall property information platform as the crawler research object, through the crawl to the guest’s regional secondary housing, the second-hand housing prices and in different parts of the room to crawl data, through the analysis of the data from different areas, different house type and price of second hand house information to extract useful information. By further exploring the practical process, some conclusions are drawn on the crawler and basic data analysis methods, and a summary is made.

III. RELATED TECHNOLOGIES AND FRAMEWORKS

In the process of system design, we mainly use Python+Djnago+Scrapy+ WordCloud in the work technology. Scraper is an open source web crawler framework written in Python. Scrapy was originally designed to grab the network, but also can be used as building data extraction of API or general web crawler Scrapy framework provides a series of efficient and powerful component, through Scrapy framework developers can quickly build a web crawler program, even if is a complex application can also be done through a variety of plug-in or middleware to build. The basics of the Scrapy framework are shown in Figure 1.

Figure 1.

Basic principles of Scrapy frame

10.21307_ijanmc-2021-015-f001.jpg

BeautifulSoup is a library for parsing HTML or XML text. BeautifulSoup handles ironical markup in HTML or XML text by generating a parse tree and provides an interface that allows developers to easily navigate the parse tree

(Navigating), searching and modifying operations. Compared to other HTML/XML parsing tools, BeautifulSoup has the advantages of simplicity, high error tolerance, and developer friendliness.

WordCloud is a third-party library that takes the Wordcloud as an object. It can draw the Wordcloud with the frequency of words in the text as a parameter, and the size, color and shape of the Wordcloud can be set.

IV. DESIGN PROCESS

The Scrapy framework is mainly made up of four parts: items, spiders, piplines, and middlewares. Based on the basic structure of Scrapy, the crawling tool of house rental information is divided into four modules, which are crawling data module, crawling module, configuration module and data processing module. In Items, you define the item entity to climb, while in this program you define Lianjialtem item, including price, orientation, location, name, floor, area, and layout of the house.

In Python, Django is a framework based on MVC constructs. In Django, however, the framework handles the part of the controller that accepts user input, so Django focuses more on models, templates, and Views, called MTV modes. Their respective responsibilities are as follows:

Chart 1

Django Responsibilities sheet

10.21307_ijanmc-2021-015-tbl1.jpg

The crawler, which defines the crawling logic and the parsing rules for web content, is responsible for parsing responses and generating results and new requests. Crawler is the key point of this design. It defines how to grab items entities. Both the initial dynamic page connection and static page information crawl are defined in this file. The key code is as follows:

10.21307_ijanmc-2021-015-unf001.jpg

HOME LINK platforms provide second-hand housing landlord page will all kinds of second-hand housing information platform (including price, house type, floor location, etc.) published on the web page, a list of your products to crawl at the beginning of the page load will only load 30 house for candidate, selenium can be used for web to simulate human drop-down operations, but each time the drop-down will only on the basis of the original in the loading 30 entries, so use Scr num read out to crawl the total number of items of the project, divided by 30 and more down, so you can set drop-down list number according to the total number of entries, close the browser after reading.

The crawling process is shown in the figure:

Figure 2.

Crawling process

10.21307_ijanmc-2021-015-f002.jpg

A. Data analysis

1) General data analysis methods

Data analysis refers to the process of analyzing a large number of collected data with appropriate statistical analysis methods, extracting useful information and forming conclusions, and studying and summarizing the data in detail. Sometimes the resulting data needs to be further processed and extracted before it can be turned into useful information for people. Data analysis can help people make judgments so that they can take appropriate actions. The mathematical basis of data analysis had been well established as early as the 20th century, but it was not until the advent of computers that the practical operation of data analysis became possible and data analysis became widespread. So data analysis is a combination of mathematics and computer science.

Data visualization is one of the more representative aspects of data analysis. It is the trends that make data visible to human eyes. According to different needs, there are many methods of data visualization, ranging from training AI to deeply learn various patterns in data and make predictions, to analyzing basic functions in Excel sheets, which can all be the process of data analysis.

2) Key technologies and technical difficulties

a) Engine, processing the whole system of data flow processing, starting things, the core of the framework. Scheduler: The Scheduler accepts requests from the engine, queues them up, and delivers them to the engine when it requests them again.

b) The Downloader downloads the web content and returns the downloaded content to the spider.

c) Itempipcline, the project pipeline, is responsible for processing the data extracted from the web pages by spiders, mainly responsible for cleaning, verifying and storing data in the database.

d) Downloading middleware DowmLoader Middlewares. It’s the processing block between Scrapy’s Request and requestponse.

e) Spider Middlewares, Spider Middlewares, which is located between the Spider and the Spider, mainly handles the response of the Spider input and the result of the output and the new request MIDDLCWARespy.

f) Front-end and back-end connection: Since data needs to be stored in the database, the database and front-end connection need to be connected. The database connection pool is responsible for allocating, managing and releasing the database connection. It allows the application process to reuse an existing database connection. For data interaction with the back end, this article mainly uses Ajax, which is a small asynchronous framework of JavaScript and an interaction tool. The standard format of Ajax is as follows:

10.21307_ijanmc-2021-015-unf002.jpg
  • Using Echart to display data: In order to make the data look orderly, this paper adopts Echart to visualize the data and make the data more concise and objective.

  • The data must be representative, and the amount of data must be large enough, otherwise it will not be convincing. The data of large size should be clear, and enough time and energy are needed to process the data during data preprocessing, otherwise problems may occur.

3) Data visualization processing results display

a) All data table status display, key code: def index(request):

10.21307_ijanmc-2021-015-unf003.jpg
Figure 3.

Full information display

10.21307_ijanmc-2021-015-f003.jpg

b) Data pie chart shows the key codedef Per_charts(request):

10.21307_ijanmc-2021-015-unf004.jpg
10.21307_ijanmc-2021-015-unf004a.jpg
Figure 4.

Pie Chart of rent per district

10.21307_ijanmc-2021-015-f004.jpg
Figure 5.

Pie Chart of rental housing according to the ratio of each direction

10.21307_ijanmc-2021-015-f005.jpg

c) Data line chart display, key code: def line(request):

10.21307_ijanmc-2021-015-unf005.jpg
10.21307_ijanmc-2021-015-unf005a.jpg
Figure 6.

The broken line chart of the average rent price in Beijing

10.21307_ijanmc-2021-015-f006.jpg

d) Data bar statistics show, key codedef Dot_chart(request):

10.21307_ijanmc-2021-015-unf006.jpg
Figure 7.

The bar chat of Renting in Beijing

10.21307_ijanmc-2021-015-f007.jpg

e) Word cloud display, key code:def start_worldcloud():

10.21307_ijanmc-2021-015-unf007.jpg
10.21307_ijanmc-2021-015-unf007a.jpg
Figure 8.

Generating wordcloud presentation

10.21307_ijanmc-2021-015-f008.jpg

V. CONCLUSION

Through the research and analysis of the second-hand housing data of HOME LINK in Beijing, this paper studies how to scrapy structure climb the rental information on the HOME LINK website, how to build scrapy structure, rental orientation, the location of what impact on the price of Beijing.

Acknowledgements

This paper is about the scientific research project of East University of Heilongjiang in 2019, “Implementation of Crawler Based on Python Scrapy Framework”, project number HDFKY190109.

References


  1. Yuhao Fan. Design and implementation of distributed crawler system based on scrapy [J]. IOP Conference Series: Earth and Environmental Science, 2018, 108(4):2–8.
  2. Jing Wang, Yuchun Guo. Scrapy-based crawling and user-behavior characteristics analysis on taobao [P]. Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2012 International Conference on, 20120:1–5.
  3. Ryan Mitchell. Python web crawler authority Guide (Second Edition) [M]. Beijing: People’s post and Telecommunications Press, 2019:57–70.
  4. Wei Chengcheng. Data information crawler technology based on Python [J]. Electronic world, 2018 (11): 208–209.
  5. Mark. Lutz. Python learning manual (Fifth Edition, Volume I) [M]. Beijing: Mechanical Industry Press, 2019:1–2.
XML PDF Share

FIGURES & TABLES

Figure 1.

Basic principles of Scrapy frame

Full Size   |   Slide (.pptx)

Figure 2.

Crawling process

Full Size   |   Slide (.pptx)

Figure 3.

Full information display

Full Size   |   Slide (.pptx)

Figure 4.

Pie Chart of rent per district

Full Size   |   Slide (.pptx)

Figure 5.

Pie Chart of rental housing according to the ratio of each direction

Full Size   |   Slide (.pptx)

Figure 6.

The broken line chart of the average rent price in Beijing

Full Size   |   Slide (.pptx)

Figure 7.

The bar chat of Renting in Beijing

Full Size   |   Slide (.pptx)

Figure 8.

Generating wordcloud presentation

Full Size   |   Slide (.pptx)

Basic principles of Scrapy frame

Full Size   |   Slide (.pptx)

Crawling process

Full Size   |   Slide (.pptx)

Crawling process

Full Size   |   Slide (.pptx)

Full information display

Full Size   |   Slide (.pptx)

Full information display

Full Size   |   Slide (.pptx)

Pie Chart of rental housing according to the ratio of each direction

Full Size   |   Slide (.pptx)

Pie Chart of rental housing according to the ratio of each direction

Full Size   |   Slide (.pptx)

The broken line chart of the average rent price in Beijing

Full Size   |   Slide (.pptx)

The bar chat of Renting in Beijing

Full Size   |   Slide (.pptx)

The bar chat of Renting in Beijing

Full Size   |   Slide (.pptx)

REFERENCES

  1. Yuhao Fan. Design and implementation of distributed crawler system based on scrapy [J]. IOP Conference Series: Earth and Environmental Science, 2018, 108(4):2–8.
  2. Jing Wang, Yuchun Guo. Scrapy-based crawling and user-behavior characteristics analysis on taobao [P]. Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2012 International Conference on, 20120:1–5.
  3. Ryan Mitchell. Python web crawler authority Guide (Second Edition) [M]. Beijing: People’s post and Telecommunications Press, 2019:57–70.
  4. Wei Chengcheng. Data information crawler technology based on Python [J]. Electronic world, 2018 (11): 208–209.
  5. Mark. Lutz. Python learning manual (Fifth Edition, Volume I) [M]. Beijing: Mechanical Industry Press, 2019:1–2.

EXTRA FILES

COMMENTS