The price of commercial housing is related to the process of urbanization in China and the living standard of residents, so the prediction of the price of commercial housing is very important. A major difficulty in predicting regression problems is how to handle different attribute types and fuse them. This paper proposes a house price prediction model based on multi-dimensional data fusion and a fully connected neural network. The model building steps are: First, normalize the data involved in the sample; then, interpolate the normalized data to increase the data density; subsequently, the normalized sample data is converted into a pixel matrix; finally, a fully connected neural network model is established from the pixel matrix to the price of the commercial house. After the neural network model has been established, the price of house can be obtained by entering the attributes of the house into the neural network model.

Urbanization[

With the development of China’s economy, people’s living standards have gradually improved, and economic development has made people have a higher pursuit of living places. According to data from the National Bureau of Statistics[

The data in this paper comes from the Boston house price data provided by Kaggle, and the amount of data selected is relatively small. The data set contains 404 training samples and 102 test samples, for a total of 506 sample data. There are 6 attributes that affect house prices in house price forecasts. In the problem of house price prediction, the attributes that affect house prices are: transaction date_{1}, house age_{2}, distance from the subway station_{3}, the number of convenience stores in the walking circle_{4}, the dimension of the housed_{5}, and the longitude of the house _{5}; Dependent variable is house priceY.

House price forecasting is a forecasting problem, and forecasting problems are regression analysis. This section aims to state the research methods of regression analysis. Regression analysis[

Linear regression is a linear equation established between the independent variable and the dependent variable. This is the most well-known regression model. In this type of model, the independent variable may be discrete or continuous; the dependent variable must be continuous, and the nature of linear regression is linear. Logistic regression is a logistic equation built from independent variables to dependent variables. This is a regression model used to calculate the success or failure of an event. In this type of model, the independent variable may be discrete or continuous; the dependent variable must be in the interval [0,1]. Polynomial regression is a polynomial equation established between the independent variable and the dependent variable. This is a polynomial regression model commonly used in the field of deep learning. Under this model, a low polynomial degree leads to underfitting, and a high polynomial degree leads to overfitting. When dealing with multiple independent variables, stepwise regression is needed[

Data fusion[

There are many difficulties in data fusion design. The first is how to handle different attribute types, and the second is how to fuse attributes. This thesis will detail the processing method of the attribute type in the “Handling of attribute types” Section and the data fusion method in the “Data Fusion” Section.

The attribute type refers to the data type of the attribute. The attribute types are: Large_Attributes, Small_Attributes, Intermediate_Attributes, and Interval_Attributes[

The Large_Attributes are the larger the independent variable, the larger the dependent variable, that is, the independent variable will have a positive benefit on the dependent variable, in other words, there is a positive correlation between the dependent variable and the independent variable. The processing method for very large attributes is shown in (

Among them, _{max}_{min}

The Small_Attributes refers to: the larger the independent variable, the smaller the value of the dependent variable, that is, the independent variable will have a negative benefit on the dependent variable, in other words there is a negative correlation between the independent variable and the dependent variable. The processing method of extremely small attributes is shown in (

Among them, _{max}_{min}

Intermediate_Attributes refer to the existence of a threshold. When the independent variable is smaller than the threshold, it displays the characteristics of Large_Attributes. When the independent variable is larger than the threshold, it displays the characteristics of Small_Attributes. Specifically, when the independent variable is less than the threshold, there is a positive correlation between the independent variable and the dependent variable; when the independent variable is greater than the threshold, there is a negative correlation between the independent variable and the dependent variable. The processing method of Intermediate_Attributes is shown in (

Among them, _{max} is the maximum value of the attribute value; _{min} is the minimum value of the attribute value;

Enumerated_Attributes means that the attribute value of the independent variable does not have real measurement characteristics, and the result of the dependent variable will be affected by the value of the independent variable, but this influence relationship is difficult to express. The processing method of Enumerated_Attributes is as follows:

Suppose the input attribute contains _{1}, _{2}, …, _{l};

^{[7]} form;

Among them, _{1} is the 1_{1} can be expressed as:

Among them, _{2} is the 2_{2} can be expressed as:

Among them, _{l}_{l}

So far, all values of the attribute have been expressed as One-Hot form.

This section analyzes the problem of data fusion, that is, how to merge Large_Attributes, Small_Attributes, Interval_Attributes, and Enumerated_Attributes together. This thesis will propose a pixel-based data fusion method: first establish a pixel matrix; then use a fully connected neural network model to process the pixel matrix.

This section aims to transform multiple attributes into a pixel arrangement. Specifically, it is assumed that the sample contains

All values for the 1_{11}, _{12}, …, _{1i}, …, _{1j}, …, _{1m};

All values for the 2_{21}, _{22}, …, _{2i}, …, _{2j}, …, _{2m};

……

All values for the _{m1}, _{m2}, …, _{mi}, …, _{mj}, …, _{mm}.

Then, the 1_{11} _{12} … _{1i} … _{1j} … _{1m})^{T};

and the 2_{21} _{22} … _{2i} … _{2j} … _{2m})^{T};

……

and the 2_{m1} _{m2} … _{mi} … _{mj} … _{mm})^{T}.

In “Create a pixel matrix”, this article has already established the number of pixel matrices as the number of samples, and then we need to use the neural network to process the pixel matrix.

The choice of network structure: there are many neural network model structures, such as fully connected layer neural networks, convolutional neural networks, long-short-term memory networks, and Residual network. Because the application scenario in this paper is simple, it is more appropriate to choose a fully connected neural network model.

Selection of activation function: The activation function is a function that runs on the neuron and is responsible for mapping the input of the neuron to the output. The activation functions are:

Sigmoid

Tanh

ReLU

Leaky ReLU

This part needs to normalize the attributes involved in the data set: first analyze the data type of the attributes by “Attribute Analysis”; then normalize the attributes by “Normalization”.

As mentioned in “Data Sources”, the data in this paper is derived from Boston house price data provided by Kaggle, and the amount of data selected this time is relatively small. The data set contains 404 training samples and 102 test samples, for a total of 506 sample data. In the problem of house price prediction, there are 6 attributes that affect house prices: transaction date _{1}; house age _{2}; distance from the subway station _{3}; the number of convenience stores in the walking circle _{4}; the dimension of the house _{5}; the longitude of the house _{6};. dependent variable: house price Y.

Transaction date _{1} is a time variable; the house age _{2} is a Small_Attributes; distance from the subway station _{3} is a Small_Attributes; the number of convenience stores in the walking circle _{4} is a Large_Attributes; the dimension of the house _{5} and the longitude of the house _{6} are an Enumerated_Attributes.

Transaction date _{1} is a time variable; the house age _{2} is a Small_Attributes; distance from the subway station _{3} is a Small_Attributes; the number of convenience stores in the walking circle _{4} is a Large_Attributes; the dimension of the house _{5} and the longitude of the house _{6} are an Enumerated_Attributes.

In this part, the normalized data in “Normalization of attributes” needs to be fused: first, the pixel matrix is established by “Building a Pixel Matrix”; then the fully connected neural network model is established by “Building a Neural Network Model”.

A pixel matrix can be established by “Data Fusion”. As described in “Data sources”, the data in this paper is derived from Boston house price data provided by Kaggle. The amount of data selected is small. The data set contains 404 training samples and 102 test samples, for a total of 506 sample data. Then there are:

All values for the 1_{11}, _{12},…, _{17};

All values for the 2nd sample:_{21}, _{22}, …, _{27};

……

All values for the 506st sample:_{506_1}, _{506_2}, …, _{506_7}.

The paper will eventually build a neural network model of house attributes to house prices: where the input attributes are house attributes: transaction date _{1}; house age _{2}; distance from the subway station _{3}; the number of convenience stores in the walking circle _{4}; the dimension of the house _{5}; the longitude of the house _{6};output information is house price Y.

Through the analysis of “Data Fusion”, this paper will build a fully connected neural network model. The network model structure is shown in (Figure ^{-5}; Learning rate: 0.01

Network structure

There are many ways to train neural networks, such as Tensorflow, Caffe, MXNet, Torch, Theano in python, and nntool in Matlab. nntool is a network model training tool that is easy to deploy and simple in the environment. In this paper, the neural network model shown in (Figure

Nntool

See Appendix

In the process of neural network training using Matlab, part of the training process is shown in (Figure

Training process

Performance

Training State

Regression

The results of the neural network model include two parts: one is the partial result display, as shown in (Figure

Result

Error_raph

This paper finally established a neural network model from house attributes to house prices: where the input attributes are commodity house attributes: transaction date _{1}; house age _{2}; distance from the subway station _{3}; the number of convenience stores in the walking circle _{4}; the dimension of the house _{5}; the longitude of the house _{6};output information is house price Y.After the neural network model has been established, Enter the six attributes of the commercial house into this neural network model, and you can get the corresponding house price. The accuracy of the network model is 97.87%.

[pn,minp,maxp,tn,mint,maxt]=premnmx(p,t);

NodeNum1 =4;

NodeNum2=4;

NodeNum3=4;

NodeNum4=4;

NodeNum5=4;

TypeNum = 1;

TF1 = ‘tansig’;

TF2 = ‘tansig’;

TF3 = ‘tansig’;

TF4 = ‘tansig’;

TF5 = ‘tansig’;

TF6 = ‘tansig’;

net=newff(minmax(pn),[NodeNum1,NodeNum2,NodeNum3,NodeNum4,NodeNum5,TypeNum],{TF1 TF2 TF3 TF4 TF5 TF6},’traingdx’);

%traingdm

net.trainParam.show=50;

net.trainParam.epochs=50000;

net.trainParam.goal=1e-5;

net.trainParam.lr=0.01;

net=train(net,pn,tn);

p2n=tramnmx(ptest,minp,maxp);

an=sim(net,p2n);

[a]=postmnmx(an,mint,maxt)

plot(1:length(t),t,’o’,1:length(t)+1,a,’+’);

title(‘o:predictive_value--- *:actual_value’)

grid on

m=length(a);

t1=[t,a(m)];

error=t1-a;

figure

plot(1:length(error),error,’-.’)

title(‘error_graph’)

grid on