Data analysis and coding are part of the skill set of a UX researcher – especially when working with quantitative data. There are so many sources of information to learn from – but in my experience, they are not equally useful, or functional. This is the first Jupyter notebook where I use Python to run some multivariate regression analysis. The data is publicly available so that anyone can follow along.
Data: Taiwan real estate prices https://www.kaggle.com/datasets/smitisinghal/real-estate-dataset
import seaborn as sns
import pandas as pd
csv_file = "C:\\Users\\UXpix\\Downloads\\Real estate.csv"
taiwan_real_estate=pd.read_csv(csv_file)
#print(taiwan_real_estate)
list(taiwan_real_estate) #list the variables in the dataset. We can see that some of these variables are not relevant
['No', 'house age', 'distance to the nearest MRT station', 'number of convenience stores', 'latitude', 'longitude', 'house price of unit area']
taiwan_real_estate = taiwan_real_estate.drop(['No', 'latitude', 'longitude'], axis=1)
list(taiwan_real_estate)
['house age', 'distance to the nearest MRT station', 'number of convenience stores', 'house price of unit area']
# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt
# Draw the scatter plot
sns.scatterplot(x="distance to the nearest MRT station",
y="house price of unit area",
data=taiwan_real_estate)
# Show the plot
plt.show()
The scatterplot clearly shows that the relationship between price and distance to the closest metro station is non-linear, so we cannot use a linear regression model. One commonly used way of transforming the variables is to use the square root function, which happened to be effective in this case.
import numpy as np #I imported the numpy library so that I can perform mathematical transformation
taiwan_real_estate['distance_Sqrt'] = np.sqrt(taiwan_real_estate['distance to the nearest MRT station'])
taiwan_real_estate['price_Sqrt'] = np.sqrt(taiwan_real_estate["house price of unit area"])
sns.scatterplot(x="distance_Sqrt", y="price_Sqrt",
data=taiwan_real_estate)
# Show the plot
plt.show()
sns.regplot (x="distance_Sqrt",
y="price_Sqrt",
data=taiwan_real_estate,
ci=95, #this gives me the 95% confidence interval in the plot
scatter_kws={"s": 2}, line_kws={"lw":1, 'linestyle':'--'}) #s means the nodes - how big should they be,
#whereas line_kws refers to the line - lw - linewidth
#scatter_kws={'alpha':0.35})
# Show the plot
plt.show()
from statsmodels.formula.api import ols
mdl_priceSq_vs_distanceSq= ols('price_Sqrt ~ distance_Sqrt', data = taiwan_real_estate)
mdl_priceSq_vs_distanceSq = mdl_priceSq_vs_distanceSq.fit()
print(mdl_priceSq_vs_distanceSq.params)
Intercept 7.557101 distance_Sqrt -0.052459 dtype: float64
taiwan_real_estate = taiwan_real_estate.rename(columns={'house age': 'house_age', 'number of convenience stores': 'stores'})
list(taiwan_real_estate)
['house_age', 'distance to the nearest MRT station', 'stores', 'house price of unit area', 'distance_Sqrt', 'price_Sqrt']
# import packages and libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
# fit linear regression model
linear_model = ols('price_Sqrt ~ house_age + stores + distance_Sqrt + stores*distance_Sqrt + house_age * distance_Sqrt ',
data=taiwan_real_estate).fit()
# display model summary
print(linear_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: price_Sqrt R-squared: 0.641 Model: OLS Adj. R-squared: 0.636 Method: Least Squares F-statistic: 145.5 Date: Thu, 26 Sep 2024 Prob (F-statistic): 2.35e-88 Time: 15:17:03 Log-Likelihood: -422.86 No. Observations: 414 AIC: 857.7 Df Residuals: 408 BIC: 881.9 Df Model: 5 Covariance Type: nonrobust =========================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------- Intercept 7.4002 0.169 43.827 0.000 7.068 7.732 house_age -0.0227 0.006 -3.817 0.000 -0.034 -0.011 stores 0.0931 0.024 3.816 0.000 0.045 0.141 distance_Sqrt -0.0452 0.005 -8.721 0.000 -0.055 -0.035 stores:distance_Sqrt -0.0014 0.001 -1.379 0.169 -0.003 0.001 house_age:distance_Sqrt 0.0002 0.000 0.785 0.433 -0.000 0.001 ============================================================================== Omnibus: 96.960 Durbin-Watson: 2.113 Prob(Omnibus): 0.000 Jarque-Bera (JB): 908.951 Skew: 0.690 Prob(JB): 4.20e-198 Kurtosis: 10.127 Cond. No. 3.41e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.41e+03. This might indicate that there are strong multicollinearity or other numerical problems.
sns.histplot(linear_model.resid);
linear_model = ols('price_Sqrt ~ house_age + stores + distance_Sqrt',
data=taiwan_real_estate).fit()
# display model summary
print(linear_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: price_Sqrt R-squared: 0.638 Model: OLS Adj. R-squared: 0.636 Method: Least Squares F-statistic: 241.1 Date: Thu, 26 Sep 2024 Prob (F-statistic): 4.01e-90 Time: 15:23:12 Log-Likelihood: -424.31 No. Observations: 414 AIC: 856.6 Df Residuals: 410 BIC: 872.7 Df Model: 3 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- Intercept 7.3787 0.138 53.643 0.000 7.108 7.649 house_age -0.0187 0.003 -6.357 0.000 -0.024 -0.013 stores 0.0664 0.015 4.357 0.000 0.036 0.096 distance_Sqrt -0.0441 0.003 -16.172 0.000 -0.049 -0.039 ============================================================================== Omnibus: 99.008 Durbin-Watson: 2.107 Prob(Omnibus): 0.000 Jarque-Bera (JB): 906.752 Skew: 0.721 Prob(JB): 1.26e-197 Kurtosis: 10.105 Cond. No. 154. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
taiwan_RE_core_varr = taiwan_real_estate[['house_age', 'stores', 'distance_Sqrt', 'price_Sqrt']]
list(taiwan_RE_core_varr)
['house_age', 'stores', 'distance_Sqrt', 'price_Sqrt']
import seaborn as sb
import matplotlib.pyplot as mp
dataplot=sb.heatmap(taiwan_RE_core_varr.corr(numeric_only=True), cmap='YlGnBu', annot = True)
mp.show()