Linear Regression
Table of Contents
Notes
Accuracy
- Residual sum of squares (RSS) = \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\).
- Residual standard error (RSE) = \(\sqrt{\frac{1}{n-2}\text{RSS}}\).
- \(R^2\) = 1 - \(\frac{\text{RSS}}{\text{TSS}}\).
- Total sum of squares (TSS) = \(\sum_{i=1}^n (y_i - \bar{y}_i)^2\).
Outliers & High Leverage points
- Outlier is an observation for which the predicted response is far from the recorded response.
- High leverage point is an observation for which the recorded predictors are unusual.
- High leverage points usually have more influence on the model than outliers.
- The studentized (standardized) residuals are typically expected to be between -3 and 3. If they lie outside then that indicates that there might be outliers in the data.
- The average leverage is \((p + 1) / n\). If the leverage of some observation is significantly higher than the average leverage then that is a high leverage point. (Look in to "Cook's distance".)
Exercises
Question 1
Since there are three p-values in Table 3.4, there are three null hypotheses:
- H0(1):
sales
does not depend onTV
- H0(2):
sales
does not depend onradio
- H0(3):
sales
does not depend onnewspaper
The p-values corresponding to the first two null hypotheses are very small, less
than 10-4, whereas the p-value for the last one is close to 1. This means that
the probability of seeing the observed sales
numbers if H0(1) or H0(2) is true
is almost zero, but the probability of seeing the observed sales
numbers if
H0(3) is true is almost 1. In other words sales
depends on TV
and radio
but not on newspaper
.
Question 2
A KNN classifier classifies an observation is to the class with the majority of nearest neighbors. In the KNN regression method the response is estimated as the average of all the training responses of the nearest neighbors.
Question 3
- Assuming X4 = X1 X2 and X5 = X1 X3, the linear fit model is as follows: Y = 50 + 20 X1 + 0.07 X2 + 35 X3 + 0.01 X1 X2 - 10 X3 X5. Since X3 (Gender) is 1 for females and 0 for males, this fit essentially gives two models based on the gender: Y = 85 + 10 X1 + 0.07 X2 + 0.01 X1 X2 for females, and Y = 50 + 20 X1 + 0.07 X2 + 0.01 X1 X2 for males. If X1 (GPA), and X2 (IQ), are fixed for the two genders then the starting salary for males will be more on average than the starting salary of females if X1 > 35 / 10 = 3.5; i.e. given the GPA is high enough, on average males will have a higher starting salary than females (iii).
- If IQ is 110 and GPA is 4.0 then a female is predicted to have a starting salary of $ 137,100.
- False. The scale for IQ is much larger than that of GPA or Gender. Hence the coefficients related to any IQ based predictor are expected to be small. (Possibly this is why it is often recommended to scale the predictors, so they are all on an equal footing and it is easier to figure out which ones are important.)
Question 4
- Even though the true relationship is linear, we need to take into account the irreducible error \(ϵ\). The cubic model is more flexible than the linear model and therefore will be able to better fit to the irreducible error. I expect the cubic model to have lower training RSS than the linear model.
- The linear model will have a lower test RSS than the cubic model since it matches the true relationship. The cubic model would have overfitted to the irreducible error in the training data and therefore perform worse on the test data.
- Irrespective of the true relationship I would expect the cubic model to have lower training RSS than the linear model because it is more flexible than the linear model and thus fits the training data better.
- Since we do not the true relationship between X and Y, it is difficult to say which of the two models will have lower test RSS without more information.
Question 5
We just have to substitute equation 3.38 in to the equation for the \(i\)th fitted value and rearrange the equation:
\begin{align} a_{i'} = \frac{x_i x_{i'}}{\sum_{k=1}^n x_k^2}. \end{align}Question 6
The equation for the least squares line is \(y = \hat{\beta}_0 + \hat{\beta}_1 x\). From equation 3.4 we have \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\). Then putting \(x = \bar{x}\) in the least squares line equation we see that \(y = \bar{y}\). Thus the least squares line always passes through \((\bar{x}, \bar{y})\).
Question 7
Assumption: \(\bar{x} = \bar{y} = 0\). From equation 3.4 we have \(\hat{y}_i = \hat{\beta}_1 x_i\), where
\begin{align} \hat{\beta}_1 = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}. \end{align}And from equation 3.17 we have
\begin{align} R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n y_i^2}. \end{align}Expanding the square in the numerator of the above fraction and substituting \(\hat{y}_i\) with \(\hat{\beta}_1 x_i\) we get
\begin{align} R^2 = \frac{\hat{\beta}_1(2 \sum_{i=1}^n x_iy_i - \hat{\beta}_1 \sum_{i=1}^n x_i^2)}{\sum_{i=1}^n y_i^2}. \end{align}Now if we simply substitute the above form of \(\hat{\beta}_1\) in to this equation we will see that \(R^2 = \mathrm{Cor}(X, Y)^2\), where \(\mathrm{Cor}(X, Y)\) is given by equation 3.18.
Question 8
Simple linear regression on Auto
data set
import pandas as pd import numpy as np from tabulate import tabulate auto = pd.read_csv("data/Auto.csv") print(tabulate(auto.head(), auto.columns, tablefmt="orgtbl"))
| | mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |----+-------+-------------+----------------+--------------+----------+----------------+--------+----------+---------------------------| | 0 | 18 | 8 | 307 | 130 | 3504 | 12 | 70 | 1 | chevrolet chevelle malibu | | 1 | 15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 | | 2 | 18 | 8 | 318 | 150 | 3436 | 11 | 70 | 1 | plymouth satellite | | 3 | 16 | 8 | 304 | 150 | 3433 | 12 | 70 | 1 | amc rebel sst | | 4 | 17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
Recall from the last chapter horsepower
has some missing values and needs to
be converted to a numeric form before we can use this for linear regression.
auto.drop(auto[auto.horsepower == "?"].index, inplace=True) auto["horsepower"] = pd.to_numeric(auto["horsepower"])
For simple linear regression we can use the ordinary least squares model from
statsmodels
or the linear regression model from scikit-learn
. In this
question we are asked to print the summary of the fitted model. scikit-learn
has no method for generating a summary, but statsmodels
does.
import statsmodels.formula.api as smf model = smf.ols(formula="mpg~horsepower", data=auto).fit() print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: mpg R-squared: 0.606 Model: OLS Adj. R-squared: 0.605 Method: Least Squares F-statistic: 599.7 Date: Mon, 25 May 2020 Prob (F-statistic): 7.03e-81 Time: 17:19:34 Log-Likelihood: -1178.7 No. Observations: 392 AIC: 2361. Df Residuals: 390 BIC: 2369. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 39.9359 0.717 55.660 0.000 38.525 41.347 horsepower -0.1578 0.006 -24.489 0.000 -0.171 -0.145 ============================================================================== Omnibus: 16.432 Durbin-Watson: 0.920 Prob(Omnibus): 0.000 Jarque-Bera (JB): 17.305 Skew: 0.492 Prob(JB): 0.000175 Kurtosis: 3.299 Cond. No. 322. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- The F-statistic is much larger than 1, and the p-value (
P>|t|
in the table) is zero. This indicates that there is a relationship betweenmpg
andhorsepower
. - The \(R^2\) value of 0.606 indicates that this relationship explains around
61% of the
mpg
values. - The coefficient value corresponding to
horsepower
is negative. This indicates that the relation betweenmpg
andhorsepower
is negative. pred = model.get_prediction(exog=dict(horsepower=98)) pred_summary = pred.summary_frame() print(tabulate(pred_summary, pred_summary.columns, tablefmt="orgtbl"))
| | mean | mean_se | mean_ci_lower | mean_ci_upper | obs_ci_lower | obs_ci_upper | |----+---------+-----------+-----------------+-----------------+----------------+----------------| | 0 | 24.4671 | 0.251262 | 23.9731 | 24.9611 | 14.8094 | 34.1248 |
The predicted
mpg
forhorsepower
= 98 is 24.4671. The 95% confidence interval is \([23.9731, 24.9611]\), and the 95% prediction interval is \([14.8094, 34.1248]\). As mentioned in the text, the prediction interval contains the confidence interval.
Least squares plot
import matplotlib.pyplot as plt import seaborn as sns sns.set_style("ticks") X = auto["horsepower"] Y = auto["mpg"] Ypred = model.predict(X) plt.close("all") fig, ax = plt.subplots() ax.plot(X, Y, 'o', label="Data") ax.plot(X, Ypred, 'r', label="Least Squares Regression") ax.legend(loc="best") sns.despine() fig.savefig("img/3_auto_ls.png", dpi=90)
Diagnostic plots
The R
command plot
in this case gives four diagnostic plots:
- Residuals vs Fitted
- Normal Q-Q
- Scale-Location
- Residuals vs Leverage
The Residuals vs Fitted plot shows any non-linear pattern in the residuals, and by extension in the data. The Normal Q-Q plot shows if the residuals are normally distributed. The Scale-Location plot shows if there is heteroscedasticity. The Residuals vs Leverage plot shows if there are leverages in the data.
We will produce these plots using statsmodels
and seaborn
.
- Residuals vs Fitted plot
fitted_vals = model.fittedvalues plt.close("all") fig, ax = plt.subplots() residplot = sns.residplot(x=fitted_vals, y="mpg", data=auto, lowess=True, line_kws={"color": "red"}, ax=ax) ax.set_xlabel("Fitted values") ax.set_ylabel("Residuals") sns.despine() fig.savefig("img/3_res_vs_fit.png", dpi=90)
This plot clearly shows that there is non-linearity in the data.
- Normal Q-Q plot
import statsmodels.api as sm residuals = model.resid plt.close("all") fig, ax = plt.subplots() qqplot = sm.qqplot(residuals, line='45', ax=ax, fit=True) ax.set_ylabel("Standardized Residuals") sns.despine() fig.savefig("img/3_auto_qq.png", dpi=90)
Though there are some points that are far from the \(45^\circ\) fitted line, most of the points lie close to the line, indicating that the residuals are mostly normally distributed.
- Scale-Location plot
# normalized residuals and their square roots norm_residuals = model.get_influence().resid_studentized_internal norm_residuals_abs_sqrt = np.sqrt(np.abs(norm_residuals)) plt.close("all") fig, ax = plt.subplots() slplot = sns.regplot(fitted_vals, norm_residuals_abs_sqrt, lowess=True, line_kws={"color" : "red"}, ax=ax) ax.set_xlabel("Fitted values") ax.set_ylabel("Sqrt of |Standardized Residuals|") sns.despine() fig.savefig("img/3_auto_scale_loc.png", dpi=90)
This plot is similar to the first diagnostic plots, except now the quantity on the y-axis is positive. This shows that homoscedasticity is not held, i.e. the variance is not constant.
- Residuals vs Leverage plot
plt.close("all") fig, ax = plt.subplots() rlplot = sm.graphics.influence_plot(model, criterion="Cooks", ax=ax) sns.despine() ax.set_xlabel("Leverage") ax.set_ylabel("Standardized Residuals") ax.set_title(" ") fig.savefig("img/3_auto_res_vs_lev.png", dpi=90)
We see that none of the points have a very high leverage.
Question 9
Scatter plot matrix of Auto
data set
plt.close("all") spm = sns.pairplot(auto, plot_kws = {'s': 10}) spm.fig.set_size_inches(12, 12) spm.savefig("img/3_auto_scatter.png", dpi=90)
Correlation matrix
I find heat maps to be better for visualizing correlation matrices than tables. Since the correlation matrix is symmetric we can ignore either of the lower or the upper triangles. We can also ignore the diagonal since it is always going to be 1.
corr_mat = auto[auto.columns[:-1]].corr() plt.close("all") fig, ax = plt.subplots() # Custom diverging color map. cmap = sns.diverging_palette(220, 10, sep=80, n=7) # Mask for upper triangle. mask = np.triu(np.ones_like(corr_mat, dtype=np.bool)) with sns.axes_style("white"): sns.heatmap(corr_mat, mask=mask, cmap=cmap, annot=True, robust=True, ax=ax) fig.savefig("img/3_auto_corr_heat.png")
We see that mpg
has considerable negative correlations with cylinders
,
displacement
, horsepower
, and weight
. This matches what we saw in the
scatter plot matrix above. Similarly cylinders
, displacement
, horsepower
and weight
are all correlated with each other.
Multiple linear regression with Auto
data set
We could do this with the statsmodels.formula
API but that involves more
typing, so we will use the statsmodels
API.
import statsmodels.api as sm Y = auto["mpg"] X = auto[auto.columns[1:-1]] X = sm.add_constant(X) # For the intercept. ml_model = sm.OLS(Y, X).fit() print(ml_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: mpg R-squared: 0.821 Model: OLS Adj. R-squared: 0.818 Method: Least Squares F-statistic: 252.4 Date: Mon, 25 May 2020 Prob (F-statistic): 2.04e-139 Time: 17:25:25 Log-Likelihood: -1023.5 No. Observations: 392 AIC: 2063. Df Residuals: 384 BIC: 2095. Df Model: 7 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- const -17.2184 4.644 -3.707 0.000 -26.350 -8.087 cylinders -0.4934 0.323 -1.526 0.128 -1.129 0.142 displacement 0.0199 0.008 2.647 0.008 0.005 0.035 horsepower -0.0170 0.014 -1.230 0.220 -0.044 0.010 weight -0.0065 0.001 -9.929 0.000 -0.008 -0.005 acceleration 0.0806 0.099 0.815 0.415 -0.114 0.275 year 0.7508 0.051 14.729 0.000 0.651 0.851 origin 1.4261 0.278 5.127 0.000 0.879 1.973 ============================================================================== Omnibus: 31.906 Durbin-Watson: 1.309 Prob(Omnibus): 0.000 Jarque-Bera (JB): 53.100 Skew: 0.529 Prob(JB): 2.95e-12 Kurtosis: 4.460 Cond. No. 8.59e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8.59e+04. This might indicate that there are strong multicollinearity or other numerical problems.
- The large F-statistic indicates that we can ignore the null hypothesis, which
says that the response
mpg
does not depend on the predictors. The probability that this data could be generated if the null hypothesis was true is essentially zero (2.04E-139). - Looking at the p-values of the individual predictors we see that
weight
,year
, andorigin
have the most statistically significant relation withmpg
. We can also argue thatdisplacement
has a somewhat significant relation withmpg
. On the other handcylinders
,horsepower
, andacceleration
do not have a significant statistical relationship. This is not necessarily surprising. Given the correlation betweenmpg
,displacement
,cylinders
andhorsepower
I think one can argue that the information incylinders
andhorsepower
is redundant. - The coefficient for the
year
variable suggests that every year thempg
increases by0.7508
, i.e. the cars become more fuel-efficient every year.
Diagnostic plots
We make the same diagnostic plots as the previous exercise.
- Residuals vs Fitted plot
fitted_vals = ml_model.fittedvalues plt.close("all") fig, ax = plt.subplots() residplot = sns.residplot(x=fitted_vals, y="mpg", data=auto, lowess=True, line_kws={"color": "red"}, ax=ax) ax.set_xlabel("Fitted values") ax.set_ylabel("Residuals") sns.despine() fig.savefig("img/3_ml_res_vs_fit.png", dpi=90)
This plot clearly shows that there is non-linearity in the data.
- Normal Q-Q plot
import statsmodels.api as sm residuals = ml_model.resid plt.close("all") fig, ax = plt.subplots() qqplot = sm.qqplot(residuals, line='45', ax=ax, fit=True) ax.set_ylabel("Standardized Residuals") sns.despine() fig.savefig("img/3_ml_auto_qq.png", dpi=90)
Though there are some points that are far from the \(45^\circ\) fitted line, most of the points lie close to the line, indicating that the residuals are mostly normally distributed.
- Scale-Location plot
# normalized residuals and their square roots norm_residuals = ml_model.get_influence().resid_studentized_internal norm_residuals_abs_sqrt = np.sqrt(np.abs(norm_residuals)) plt.close("all") fig, ax = plt.subplots() slplot = sns.regplot(fitted_vals, norm_residuals_abs_sqrt, lowess=True, line_kws={"color" : "red"}, ax=ax) ax.set_xlabel("Fitted values") ax.set_ylabel("Sqrt of |Standardized Residuals|") sns.despine() fig.savefig("img/3_ml_auto_scale_loc.png", dpi=90)
The variance in the standardized residuals is less as compared to the single regression plot, but there is still quite a bit of variance, which means homoscedasticity is not held.
- Residuals vs Leverage plot
plt.close("all") fig, ax = plt.subplots() rlplot = sm.graphics.influence_plot(ml_model, criterion="Cooks", ax=ax) sns.despine() ax.set_xlabel("Leverage") ax.set_ylabel("Standardized Residuals") ax.set_title(" ") fig.savefig("img/3_ml_auto_res_vs_lev.png", dpi=90)
Point 13 has a high leverage but not a very high residual.
Interaction effects
We go back to the statsmodels.formula
API. Additionally I will drop
cylinders
, horsepower
, and acceleration
from the model. I will try the
following interaction terms:
year : origin
: this will add a new predictor which is a product ofyear
andorigin
, but will not includeyear
andorigin
separately,year * origin
: this will add the product ofyear
andorigin
, but also includeyear
andorigin
separately,year * weight
: same as the above, except forweight
in place oforigin
.
ml_model_1 = smf.ols(formula="mpg ~ displacement + weight + year : origin", data=auto).fit() ml_model_2 = smf.ols(formula="mpg ~ displacement + weight + year * origin", data=auto).fit() ml_model_3 = smf.ols(formula="mpg ~ displacement + weight * year + origin", data=auto).fit()
- Summary of first interaction model
print(ml_model_1.summary())
OLS Regression Results ============================================================================== Dep. Variable: mpg R-squared: 0.714 Model: OLS Adj. R-squared: 0.712 Method: Least Squares F-statistic: 322.7 Date: Mon, 25 May 2020 Prob (F-statistic): 4.85e-105 Time: 18:13:04 Log-Likelihood: -1115.9 No. Observations: 392 AIC: 2240. Df Residuals: 388 BIC: 2256. Df Model: 3 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept 39.8822 1.427 27.939 0.000 37.076 42.689 displacement -0.0100 0.006 -1.728 0.085 -0.021 0.001 weight -0.0056 0.001 -8.137 0.000 -0.007 -0.004 year:origin 0.0193 0.004 4.502 0.000 0.011 0.028 ============================================================================== Omnibus: 41.720 Durbin-Watson: 0.883 Prob(Omnibus): 0.000 Jarque-Bera (JB): 68.034 Skew: 0.674 Prob(JB): 1.69e-15 Kurtosis: 4.532 Cond. No. 2.09e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.09e+04. This might indicate that there are strong multicollinearity or other numerical problems.
The large F-statistic invalidates the null hypothesis. The individual p-values show that
displacement
is not really significant, but the product ofyear
andorigin
is. Additionally the \(R^2\) value tells us that this model explains around 71% of thempg
values. - Summary of second interaction model
print(ml_model_2.summary())
OLS Regression Results ============================================================================== Dep. Variable: mpg R-squared: 0.823 Model: OLS Adj. R-squared: 0.821 Method: Least Squares F-statistic: 359.5 Date: Mon, 25 May 2020 Prob (F-statistic): 8.65e-143 Time: 18:17:24 Log-Likelihood: -1021.6 No. Observations: 392 AIC: 2055. Df Residuals: 386 BIC: 2079. Df Model: 5 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept 7.9270 8.873 0.893 0.372 -9.519 25.373 displacement 0.0016 0.005 0.319 0.750 -0.008 0.011 weight -0.0064 0.001 -11.571 0.000 -0.007 -0.005 year 0.4313 0.113 3.818 0.000 0.209 0.653 origin -14.4936 4.707 -3.079 0.002 -23.749 -5.239 year:origin 0.2023 0.060 3.345 0.001 0.083 0.321 ============================================================================== Omnibus: 38.636 Durbin-Watson: 1.322 Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.804 Skew: 0.584 Prob(JB): 2.56e-16 Kurtosis: 4.741 Cond. No. 1.84e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.84e+05. This might indicate that there are strong multicollinearity or other numerical problems.
The \(R^2\) value has increased; it is now almost the same as the \(R^2\) value for the model with all the quantitative predictors but no interaction. This model explains around 82% of the
mpg
values. Additionally we see thatyear
is more significant thanorigin
or the product ofyear
andorigin
. Also, in addition todisplacement
, the intercept term appears to be insignificant too. - Summary of third interaction model
print(ml_model_3.summary())
OLS Regression Results ============================================================================== Dep. Variable: mpg R-squared: 0.840 Model: OLS Adj. R-squared: 0.838 Method: Least Squares F-statistic: 404.4 Date: Mon, 25 May 2020 Prob (F-statistic): 5.53e-151 Time: 18:22:57 Log-Likelihood: -1002.4 No. Observations: 392 AIC: 2017. Df Residuals: 386 BIC: 2041. Df Model: 5 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -107.6004 12.904 -8.339 0.000 -132.971 -82.229 displacement -0.0004 0.005 -0.088 0.930 -0.009 0.009 weight 0.0260 0.005 5.722 0.000 0.017 0.035 year 1.9624 0.172 11.436 0.000 1.625 2.300 weight:year -0.0004 5.97e-05 -7.214 0.000 -0.001 -0.000 origin 0.9116 0.255 3.579 0.000 0.411 1.412 ============================================================================== Omnibus: 43.792 Durbin-Watson: 1.372 Prob(Omnibus): 0.000 Jarque-Bera (JB): 89.759 Skew: 0.619 Prob(JB): 3.23e-20 Kurtosis: 4.991 Cond. No. 1.90e+07 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.9e+07. This might indicate that there are strong multicollinearity or other numerical problems.
The \(R^2\) increased a bit, and it appears other than
displacement
all the other predictors are significant. - Interaction model with two interactions
We will try an additional model with two interactions:
displacement * weight
andweight * year
.ml_model_4 = smf.ols("mpg ~ displacement * weight + weight * year + origin", data=auto).fit() print(ml_model_4.summary())
OLS Regression Results ============================================================================== Dep. Variable: mpg R-squared: 0.859 Model: OLS Adj. R-squared: 0.856 Method: Least Squares F-statistic: 389.6 Date: Mon, 25 May 2020 Prob (F-statistic): 4.10e-160 Time: 18:33:39 Log-Likelihood: -977.80 No. Observations: 392 AIC: 1970. Df Residuals: 385 BIC: 1997. Df Model: 6 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- Intercept -61.3985 13.741 -4.468 0.000 -88.416 -34.381 displacement -0.0604 0.009 -6.424 0.000 -0.079 -0.042 weight 0.0090 0.005 1.848 0.065 -0.001 0.019 displacement:weight 1.708e-05 2.38e-06 7.169 0.000 1.24e-05 2.18e-05 year 1.4982 0.174 8.616 0.000 1.156 1.840 weight:year -0.0002 6.16e-05 -4.037 0.000 -0.000 -0.000 origin 0.3388 0.253 1.342 0.181 -0.158 0.835 ============================================================================== Omnibus: 62.892 Durbin-Watson: 1.412 Prob(Omnibus): 0.000 Jarque-Bera (JB): 174.597 Skew: 0.754 Prob(JB): 1.22e-38 Kurtosis: 5.901 Cond. No. 8.00e+07 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8e+07. This might indicate that there are strong multicollinearity or other numerical problems.
This is interesting. We see that the \(R^2\) value has increased further, but now
displacement
has become significant whereasweight
andorigin
have become relatively insignificant. The interaction terms are still significant. My understanding of this model is that whileweight
does not directly affectmpg
, it increasesdisplacement
, and that affects thempg
.
Models with variable transformations
ml_model_trans = smf.ols(formula="mpg ~ np.log(weight) + np.power(weight, 2) + year", data=auto).fit() print(ml_model_trans.summary())
OLS Regression Results ============================================================================== Dep. Variable: mpg R-squared: 0.851 Model: OLS Adj. R-squared: 0.850 Method: Least Squares F-statistic: 749.0 Date: Mon, 25 May 2020 Prob (F-statistic): 4.19e-162 Time: 18:49:50 Log-Likelihood: -1001.5 No. Observations: 397 AIC: 2011. Df Residuals: 393 BIC: 2027. Df Model: 3 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------- Intercept 213.1803 16.063 13.271 0.000 181.599 244.761 np.log(weight) -32.5971 2.150 -15.159 0.000 -36.825 -28.369 np.power(weight, 2) 6.506e-07 1.12e-07 5.804 0.000 4.3e-07 8.71e-07 year 0.8355 0.044 19.023 0.000 0.749 0.922 ============================================================================== Omnibus: 69.088 Durbin-Watson: 1.361 Prob(Omnibus): 0.000 Jarque-Bera (JB): 168.823 Skew: 0.864 Prob(JB): 2.19e-37 Kurtosis: 5.687 Cond. No. 1.17e+09 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.17e+09. This might indicate that there are strong multicollinearity or other numerical problems.
The F-statistic and the p-values indicate that these transformations are statistically significant.
Question 10
Multiple regression with Carseats
data set
So far I had been loading the data sets from local .csv
files, but I recently
found out that statsmodels
makes them automatically available using the
Rdatasets project. So going forward I will be using that whenever possible.
import statsmodels.api as sm carseats = sm.datasets.get_rdataset("Carseats", package="ISLR") print(carseats.__doc__)
+----------+-----------------+ | Carseats | R Documentation | +----------+-----------------+ Sales of Child Car Seats ------------------------ Description ~~~~~~~~~~~ A simulated data set containing sales of child car seats at 400 different stores. Usage ~~~~~ :: Carseats Format ~~~~~~ A data frame with 400 observations on the following 11 variables. ``Sales`` Unit sales (in thousands) at each location ``CompPrice`` Price charged by competitor at each location ``Income`` Community income level (in thousands of dollars) ``Advertising`` Local advertising budget for company at each location (in thousands of dollars) ``Population`` Population size in region (in thousands) ``Price`` Price company charges for car seats at each site ``ShelveLoc`` A factor with levels ``Bad``, ``Good`` and ``Medium`` indicating the quality of the shelving location for the car seats at each site ``Age`` Average age of the local population ``Education`` Education level at each location ``Urban`` A factor with levels ``No`` and ``Yes`` to indicate whether the store is in an urban or rural location ``US`` A factor with levels ``No`` and ``Yes`` to indicate whether the store is in the US or not Source ~~~~~~ Simulated data References ~~~~~~~~~~ James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) *An Introduction to Statistical Learning with applications in R*, `www.StatLearning.com <www.StatLearning.com>`__, Springer-Verlag, New York Examples ~~~~~~~~ :: summary(Carseats) lm.fit=lm(Sales~Advertising+Price,data=Carseats)
Multiple linear regression to predict Sales
using Price
, Urban
, and US
.
import statsmodels.formula.api as smf model = smf.ols(formula="Sales ~ Price + Urban + US", data=carseats.data).fit() print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Sales R-squared: 0.239 Model: OLS Adj. R-squared: 0.234 Method: Least Squares F-statistic: 41.52 Date: Thu, 28 May 2020 Prob (F-statistic): 2.39e-23 Time: 18:21:58 Log-Likelihood: -927.66 No. Observations: 400 AIC: 1863. Df Residuals: 396 BIC: 1879. Df Model: 3 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept 13.0435 0.651 20.036 0.000 11.764 14.323 Urban[T.Yes] -0.0219 0.272 -0.081 0.936 -0.556 0.512 US[T.Yes] 1.2006 0.259 4.635 0.000 0.691 1.710 Price -0.0545 0.005 -10.389 0.000 -0.065 -0.044 ============================================================================== Omnibus: 0.676 Durbin-Watson: 1.912 Prob(Omnibus): 0.713 Jarque-Bera (JB): 0.758 Skew: 0.093 Prob(JB): 0.684 Kurtosis: 2.897 Cond. No. 628. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The F-statistic is larger than 1, though much smaller compared to the F-statistics in the last problem. I think this means that the alternative hypothesis is viable, but not completely sure about that.
From the individual p-values we can conclude that Urban
is not a statistically
significant predictor for Sales
.
Interpretation of coefficient of predictors
Since Urban
is not a statistically significant predictor we do not need to
worry about its coefficient. The coefficient for US
indicates that if the
store is in the US then it then the sales will increase by about 1200 units. On
the other hand the coefficient for Price
says that an increase in price will
result in a decrease in sales.
Linear model equation
The equation for this model is
\begin{align} Y = 13.04 - 0.02 X_1 + 1.20 X_2 - 0.05 X_3, \end{align}
where \(Y\), \(X_1\), \(X_2\), and \(X_3\) stand for Sales
, Urban
, US
, and
Price
, respectively. \(X_1 = 1\) if the store is an urban location, and \(0\)
otherwise. Similarly \(X_2 = 1\) if the store is in the US, and \(0\) if it is
not.
Null hypothesis rejection
We can reject the null hypothesis for US
, and Price
, since the associated
p-values are effectively 0.
Smaller multiple linear model for Carseats
sales
small_model = smf.ols(formula="Sales ~ US + Price", data=carseats.data).fit() print(small_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Sales R-squared: 0.239 Model: OLS Adj. R-squared: 0.235 Method: Least Squares F-statistic: 62.43 Date: Thu, 28 May 2020 Prob (F-statistic): 2.66e-24 Time: 18:45:15 Log-Likelihood: -927.66 No. Observations: 400 AIC: 1861. Df Residuals: 397 BIC: 1873. Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 13.0308 0.631 20.652 0.000 11.790 14.271 US[T.Yes] 1.1996 0.258 4.641 0.000 0.692 1.708 Price -0.0545 0.005 -10.416 0.000 -0.065 -0.044 ============================================================================== Omnibus: 0.666 Durbin-Watson: 1.912 Prob(Omnibus): 0.717 Jarque-Bera (JB): 0.749 Skew: 0.092 Prob(JB): 0.688 Kurtosis: 2.895 Cond. No. 607. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Based on the \(R^2\) values both the models fit the data similarly.
Confidence intervals of fitted parameters
print(small_model.conf_int())
0 1 Intercept 11.79032 14.271265 US[T.Yes] 0.69152 1.707766 Price -0.06476 -0.044195
The confidence intervals are also printed in the summary, but this is probably more convenient.
Outliers and leverages
To see if there are any leverage points we need to first calculate the average leverage, \((p + 1) / n\), for the data.
npredictors = 2 nobservations = len(carseats.data) avg_leverage = (npredictors + 1) / nobservations print(f"Average leverage: {avg_leverage}")
Average leverage: 0.0075
The Residuals vs Leverage plot is the easiest way to check for outliers and high leverage observations.
import matplotlib.pyplot as plt import seaborn as sns sns.set_style("ticks") plt.close("all") fig, ax = plt.subplots() rlplot = sm.graphics.influence_plot(small_model, criterion="Cooks", ax=ax) sns.despine() ax.set_xlabel("Leverage") ax.set_ylabel("Standardized Residuals") ax.set_title(" ") fig.savefig("img/3.10.h_res_vs_lev.png", dpi=90)
All the residuals are between -3 and 3, so there are no outliers. However there are a lot of points whose leverage greatly exceeds the average leverage. Thus there are high leverage observations.
Question 11
Simple linear regression with synthetic data
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set_style("ticks") rng = np.random.default_rng(seed=42) x = rng.normal(size=100) y = 2 * x + rng.normal(size=100) data = pd.DataFrame({"X" : x, "Y" : y}) plt.close("all") fig, ax = plt.subplots() sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax) sns.despine() fig.savefig("img/3.11.a_data.png", dpi=90)
Now we do a simple linear regression with this synthetic data. This model will not have an intercept: \(Y = βX\).
import statsmodels.formula.api as smf model = smf.ols("Y ~ X - 1", data=data).fit() print(model.summary())
OLS Regression Results ======================================================================================= Dep. Variable: Y R-squared (uncentered): 0.741 Model: OLS Adj. R-squared (uncentered): 0.738 Method: Least Squares F-statistic: 283.3 Date: Fri, 29 May 2020 Prob (F-statistic): 8.30e-31 Time: 07:28:26 Log-Likelihood: -138.87 No. Observations: 100 AIC: 279.7 Df Residuals: 99 BIC: 282.4 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ X 2.1196 0.126 16.833 0.000 1.870 2.369 ============================================================================== Omnibus: 2.995 Durbin-Watson: 1.681 Prob(Omnibus): 0.224 Jarque-Bera (JB): 2.970 Skew: 0.408 Prob(JB): 0.227 Kurtosis: 2.787 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficient estimate is \(\hat{β} = 2.1196\) with a standard error of \(0.126\). The t-statistic is 16.833 and the associated p-value is 0. This means we can reject the null hypothesis.
Inverse simple linear relation with synthetic data
We are going to use the same data, but now with X
as the response and Y
as
the predictor.
model2 = smf.ols("X ~ Y - 1", data=data).fit() print(model2.summary())
OLS Regression Results ======================================================================================= Dep. Variable: X R-squared (uncentered): 0.741 Model: OLS Adj. R-squared (uncentered): 0.738 Method: Least Squares F-statistic: 283.3 Date: Fri, 29 May 2020 Prob (F-statistic): 8.30e-31 Time: 07:36:14 Log-Likelihood: -48.770 No. Observations: 100 AIC: 99.54 Df Residuals: 99 BIC: 102.1 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Y 0.3496 0.021 16.833 0.000 0.308 0.391 ============================================================================== Omnibus: 1.369 Durbin-Watson: 1.557 Prob(Omnibus): 0.504 Jarque-Bera (JB): 1.145 Skew: -0.020 Prob(JB): 0.564 Kurtosis: 2.477 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficient estimate is \(\hat{β} = 0.3496\) with a standard error of \(0.021\). The t-statistic is \(16.833\) and the associated p-value is 0. This means we can reject the null hypothesis.
Relation between the two regressions
Given the underlying true model we should have expected that the coefficients of the two models would be multiplicative inverses of each other. But they are not. The reason being that the two models are minimizing different residual sum of squares. For the two models the residual sum of squares are
\begin{align} \mathrm{RSS}^{(1)} &= ∑_{i=1}^n (y_i - \hat{β}^(1) x_i)^2, \\ \mathrm{RSS}^{(2)} &= ∑_{i=1}^n (x_i - \hat{β}^(2) y_i)^2, \end{align}respectively. \(\mathrm{RSS}^(1)\) is minimized when \(\hat{β}^(1) = ∑y_i x_i / ∑ x_i^2\), and \(\mathrm{RSS}^(2)\) is minimized when \(\hat{β}^(2) = ∑x_i y_i / ∑ y_i^2\). If \(\hat{β}^(1) = 1 / \hat{β}^(2)\) then we have
\begin{align} (∑_{i=1}^n x_i y_i)^2 = ∑_{i=1}^n x_i^2 ∑_{i=1}^n y_i^2. \end{align}Since here \(X\) and \(Y\) are random variables with zero mean we can interpret the above equation as
\begin{align} \mathrm{Cov}(X, Y) = \mathrm{Var}(X) \mathrm{Var}(Y). \end{align}This is true only if the true relation is \(y_i = β x_i + γ\) for some nonzero constants \(β\) and \(γ\) (See DeGroot and Schervish, Theorem 4.6.3, or ProofWiki for a proof of this statement.). But the true relation in this case was \(y_i = β x_i + ϵ\), where \(ϵ\) is a Gaussian random variable with zero mean. Thus the above statement is not true, and hence \(\hat{β}^(1) ≠ 1 / \hat{β}^(2)\). For a more detailed discussion on this check Stats StackExchange.
t-statistic for first model
The t-statistic for a simple linear fit without intercept is \(\hat{β} / \mathrm{SE}(\hat{β})\) where \(\hat{β} = ∑_i x_i y_i / ∑_i x_i^2\), and the standard error is
\begin{align} \mathrm{SE}(\hat{β} = \frac{\sqrt{∑_i (y_i - x_i \hat{β})^2}}{(n-1) ∑_i x_i^2}. \end{align}Substituting the expression for \(\hat{β}\) in to the expressions for the standard error and the t-statistic gives us the expected expression for the t-statistic. The trick is to realize that the summation indices are dummy variables, i.e. \(∑_{i=1}^n x_i^2 = ∑_{j=1}^n x_j^2\). Numerically we can conform this as follows:
n = len(data) x, y = data["X"], data["Y"] t = (np.sqrt(n - 1) * np.sum(x * y) / np.sqrt(np.sum(x ** 2) * np.sum(y ** 2) - np.sum(x * y) ** 2)) print(f"t-statistic: {t:.3f}")
t-statistic: 16.833
t-statistic for second model
The expression for the t-statistic is symmetric in \(X\) and \(Y\), so irrespective of whether we are regressing \(Y\) onto \(X\) or \(X\) onto \(Y\), we will have the same t-statistic.
t-statistic for models with intercept
model3 = smf.ols(formula="Y ~ X", data=data).fit() model4 = smf.ols(formula="X ~ Y", data=data).fit()
print(model3.summary())
OLS Regression Results ============================================================================== Dep. Variable: Y R-squared: 0.740 Model: OLS Adj. R-squared: 0.738 Method: Least Squares F-statistic: 279.2 Date: Sat, 30 May 2020 Prob (F-statistic): 1.94e-30 Time: 00:26:31 Log-Likelihood: -138.87 No. Observations: 100 AIC: 281.7 Df Residuals: 98 BIC: 287.0 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.0046 0.098 -0.047 0.962 -0.200 0.190 X 2.1192 0.127 16.709 0.000 1.867 2.371 ============================================================================== Omnibus: 2.996 Durbin-Watson: 1.682 Prob(Omnibus): 0.224 Jarque-Bera (JB): 2.971 Skew: 0.409 Prob(JB): 0.226 Kurtosis: 2.787 Cond. No. 1.30 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(model4.summary())
OLS Regression Results ============================================================================== Dep. Variable: X R-squared: 0.740 Model: OLS Adj. R-squared: 0.738 Method: Least Squares F-statistic: 279.2 Date: Sat, 30 May 2020 Prob (F-statistic): 1.94e-30 Time: 00:26:47 Log-Likelihood: -48.728 No. Observations: 100 AIC: 101.5 Df Residuals: 98 BIC: 106.7 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.0114 0.040 -0.287 0.775 -0.091 0.068 Y 0.3493 0.021 16.709 0.000 0.308 0.391 ============================================================================== Omnibus: 1.373 Durbin-Watson: 1.559 Prob(Omnibus): 0.503 Jarque-Bera (JB): 1.146 Skew: -0.018 Prob(JB): 0.564 Kurtosis: 2.477 Cond. No. 1.91 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can see that the t-coefficient for the predictors is same for both models.
Question 12
Equal regression coefficients
As this is a regression without intercept we can use the expressions derived in the previous question. The coefficients will be same when \(∑_i x_i^2 = ∑_i y_i^2\). This is particularly true when \(X = Y\).
Different coefficient estimates - numerical
Essentially reusing question 11.
rng = np.random.default_rng(seed=42) x = rng.normal(size=100) y = 2 * x + rng.normal(size=100) data = pd.DataFrame({"X" : x, "Y" : y}) plt.close("all") fig, ax = plt.subplots() sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax) sns.despine() fig.savefig("img/3.12.b_data.png", dpi=90)
model1 = smf.ols("Y ~ X - 1", data=data).fit() model2 = smf.ols("X ~ Y - 1", data=data).fit()
print(model1.summary())
OLS Regression Results ======================================================================================= Dep. Variable: Y R-squared (uncentered): 0.741 Model: OLS Adj. R-squared (uncentered): 0.738 Method: Least Squares F-statistic: 283.3 Date: Sat, 30 May 2020 Prob (F-statistic): 8.30e-31 Time: 00:47:09 Log-Likelihood: -138.87 No. Observations: 100 AIC: 279.7 Df Residuals: 99 BIC: 282.4 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ X 2.1196 0.126 16.833 0.000 1.870 2.369 ============================================================================== Omnibus: 2.995 Durbin-Watson: 1.681 Prob(Omnibus): 0.224 Jarque-Bera (JB): 2.970 Skew: 0.408 Prob(JB): 0.227 Kurtosis: 2.787 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(model2.summary())
OLS Regression Results ======================================================================================= Dep. Variable: X R-squared (uncentered): 0.741 Model: OLS Adj. R-squared (uncentered): 0.738 Method: Least Squares F-statistic: 283.3 Date: Sat, 30 May 2020 Prob (F-statistic): 8.30e-31 Time: 00:47:17 Log-Likelihood: -48.770 No. Observations: 100 AIC: 99.54 Df Residuals: 99 BIC: 102.1 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Y 0.3496 0.021 16.833 0.000 0.308 0.391 ============================================================================== Omnibus: 1.369 Durbin-Watson: 1.557 Prob(Omnibus): 0.504 Jarque-Bera (JB): 1.145 Skew: -0.020 Prob(JB): 0.564 Kurtosis: 2.477 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Same coefficient estimates - numerical
We need \(∑_i x_i^2 = ∑_i y_i^2\). Setting \(X = Y\) works, but \(Y = \mathrm{Permutation}(X)\) would work too, and is more general.
rng = np.random.default_rng(seed=42) x = rng.normal(size=100) y = np.random.permutation(x) data = pd.DataFrame({"X" : x, "Y" : y}) plt.close("all") fig, ax = plt.subplots() sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax) sns.despine() fig.savefig("img/3.12.b_data.png", dpi=90)
model3 = smf.ols("Y ~ X - 1", data=data).fit() model4 = smf.ols("X ~ Y - 1", data=data).fit()
print(model3.summary())
OLS Regression Results ======================================================================================= Dep. Variable: Y R-squared (uncentered): 0.002 Model: OLS Adj. R-squared (uncentered): -0.008 Method: Least Squares F-statistic: 0.2018 Date: Sat, 30 May 2020 Prob (F-statistic): 0.654 Time: 00:52:14 Log-Likelihood: -116.23 No. Observations: 100 AIC: 234.5 Df Residuals: 99 BIC: 237.1 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ X 0.0451 0.100 0.449 0.654 -0.154 0.244 ============================================================================== Omnibus: 0.651 Durbin-Watson: 1.772 Prob(Omnibus): 0.722 Jarque-Bera (JB): 0.787 Skew: -0.142 Prob(JB): 0.675 Kurtosis: 2.671 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(model4.summary())
OLS Regression Results ======================================================================================= Dep. Variable: X R-squared (uncentered): 0.002 Model: OLS Adj. R-squared (uncentered): -0.008 Method: Least Squares F-statistic: 0.2018 Date: Sat, 30 May 2020 Prob (F-statistic): 0.654 Time: 00:52:20 Log-Likelihood: -116.23 No. Observations: 100 AIC: 234.5 Df Residuals: 99 BIC: 237.1 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Y 0.0451 0.100 0.449 0.654 -0.154 0.244 ============================================================================== Omnibus: 0.296 Durbin-Watson: 1.833 Prob(Omnibus): 0.862 Jarque-Bera (JB): 0.446 Skew: -0.105 Prob(JB): 0.800 Kurtosis: 2.749 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Question 13
Feature vector
rng = np.random.default_rng(seed=42) x = rng.normal(loc=0, scale=1, size=100)
Error vector
eps = rng.normal(loc=0, scale=0.25, size=100)
Response vector
y = -1 + 0.5 * x + eps
The length of y
is 100, and \(β_0 = -1\), and \(β_1 = 0.5\).
Scatter plot
df = pd.DataFrame({"x" : x, "y" : y}) plt.close("all") sp = sns.jointplot(x="x", y="y", data=df) # jointplot also gives the distributions of x and y in addition to the scatter plot sns.despine() sp.savefig("img/3.13.d_scatter.png", dpi=90)
We can see a clear linear trend between x
and y
.
Least squares fit
model = smf.ols("y ~ x", data=df).fit() print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.740 Model: OLS Adj. R-squared: 0.738 Method: Least Squares F-statistic: 279.2 Date: Sat, 30 May 2020 Prob (F-statistic): 1.94e-30 Time: 03:07:58 Log-Likelihood: -0.24351 No. Observations: 100 AIC: 4.487 Df Residuals: 98 BIC: 9.697 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -1.0012 0.025 -40.774 0.000 -1.050 -0.952 x 0.5298 0.032 16.709 0.000 0.467 0.593 ============================================================================== Omnibus: 2.996 Durbin-Watson: 1.682 Prob(Omnibus): 0.224 Jarque-Bera (JB): 2.971 Skew: 0.409 Prob(JB): 0.226 Kurtosis: 2.787 Cond. No. 1.30 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimates for \(β_0\) and \(β_1\) are almost equal to the true values. The true values fall within the 95% confidence interval of the estimated values.
ypred = model.predict(df["x"]) plt.close("all") fig, ax = plt.subplots() ax.plot(x, y, 'o', label="Data") ax.plot(x, ypred, 'r', label="Least Squares Regression") ax.legend(loc="best") sns.despine() fig.savefig("img/3.13.f_ols.png", dpi=90)
Polynomial regression
poly_model = smf.ols(formula="y ~ x + I(x**2)", data=df).fit() print(poly_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.746 Model: OLS Adj. R-squared: 0.741 Method: Least Squares F-statistic: 142.6 Date: Sat, 30 May 2020 Prob (F-statistic): 1.30e-29 Time: 03:32:56 Log-Likelihood: 0.93852 No. Observations: 100 AIC: 4.123 Df Residuals: 97 BIC: 11.94 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.9732 0.031 -31.881 0.000 -1.034 -0.913 x 0.5200 0.032 16.177 0.000 0.456 0.584 I(x ** 2) -0.0474 0.031 -1.523 0.131 -0.109 0.014 ============================================================================== Omnibus: 2.591 Durbin-Watson: 1.731 Prob(Omnibus): 0.274 Jarque-Bera (JB): 2.542 Skew: 0.380 Prob(JB): 0.281 Kurtosis: 2.818 Cond. No. 2.08 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The \(R^2\) values of both the models are pretty much the same. Additionally the p-value of the quadratic term is not zero. The quadratic term does not improve the model fit.
Least squares fit with less noise
The new data is as follows. The spread of the noise is now 0.1 instead of 0.25.
eps = rng.normal(loc=0, scale=0.1, size=100) y = -1 + 0.5 * x + eps df2 = pd.DataFrame({"x" : x, "y" : y}) plt.close("all") sp = sns.jointplot(x="x", y="y", data=df2) # jointplot also gives the distributions of x and y in addition to the scatter plot sns.despine() sp.savefig("img/3.13.h_scatter.png", dpi=90)
Now the least squares fit to this data.
less_noisy_model = smf.ols("y ~ x", data=df2).fit() print(less_noisy_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.935 Model: OLS Adj. R-squared: 0.934 Method: Least Squares F-statistic: 1403. Date: Sat, 30 May 2020 Prob (F-statistic): 7.01e-60 Time: 03:39:50 Log-Likelihood: 86.452 No. Observations: 100 AIC: -168.9 Df Residuals: 98 BIC: -163.7 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -1.0063 0.010 -97.523 0.000 -1.027 -0.986 x 0.4991 0.013 37.458 0.000 0.473 0.526 ============================================================================== Omnibus: 3.211 Durbin-Watson: 1.893 Prob(Omnibus): 0.201 Jarque-Bera (JB): 2.554 Skew: 0.345 Prob(JB): 0.279 Kurtosis: 3.371 Cond. No. 1.30 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The \(R^2\) value for this data set is much larger than the original data set. The model is able to explain 93% of the less noisy data, whereas it could only explain around 70% of the original data set.
ypred = less_noisy_model.predict(df2["x"]) plt.close("all") fig, ax = plt.subplots() ax.plot(x, y, 'o', label="Data") ax.plot(x, ypred, 'r', label="Least Squares Regression") ax.legend(loc="best") sns.despine() fig.savefig("img/3.13.h_ols.png", dpi=90)
Least squares fit with more noise
The new data is as follows. The spread of the noise is now 0.5 instead of 0.25.
eps = rng.normal(loc=0, scale=0.5, size=100) y = -1 + 0.5 * x + eps df3 = pd.DataFrame({"x" : x, "y" : y}) plt.close("all") sp = sns.jointplot(x="x", y="y", data=df3) sns.despine() sp.savefig("img/3.13.i_scatter.png", dpi=90)
Now the least squares fit to this data.
more_noisy_model = smf.ols("y ~ x", data=df3).fit() print(more_noisy_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.430 Model: OLS Adj. R-squared: 0.424 Method: Least Squares F-statistic: 74.01 Date: Sat, 30 May 2020 Prob (F-statistic): 1.29e-13 Time: 03:45:50 Log-Likelihood: -75.586 No. Observations: 100 AIC: 155.2 Df Residuals: 98 BIC: 160.4 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -1.0417 0.052 -19.971 0.000 -1.145 -0.938 x 0.5794 0.067 8.603 0.000 0.446 0.713 ============================================================================== Omnibus: 0.119 Durbin-Watson: 1.889 Prob(Omnibus): 0.942 Jarque-Bera (JB): 0.294 Skew: -0.029 Prob(JB): 0.863 Kurtosis: 2.741 Cond. No. 1.30 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The \(R^2\) value for this data set is much smaller than it was for the original data set. The model is able to explain only 43% of the more noisy data, whereas it could explain around 70% of the original data.
ypred = more_noisy_model.predict(df3["x"]) plt.close("all") fig, ax = plt.subplots() ax.plot(x, y, 'o', label="Data") ax.plot(x, ypred, 'r', label="Least Squares Regression") ax.legend(loc="best") sns.despine() fig.savefig("img/3.13.i_ols.png", dpi=90)
From the graph it appears that there are possible outliers, which is not surprising given the spread of the error.
Confidence intervals of the three models
print("Confidence interval based on original data set:\n") print(f"{model.conf_int()}\n") print("Confidence interval based on less noisy data set:\n") print(f"{less_noisy_model.conf_int()}\n") print("Confidence interval based on more noisy data set:\n") print(f"{more_noisy_model.conf_int()}\n")
Confidence interval based on original data set: 0 1 Intercept -1.049887 -0.952433 x 0.466873 0.592714 Confidence interval based on less noisy data set: 0 1 Intercept -1.026756 -0.985803 x 0.472653 0.525535 Confidence interval based on more noisy data set: 0 1 Intercept -1.145197 -0.938180 x 0.445752 0.713071
The confidence intervals for the less noisy data set are the tightest and the confidence intervals for the more noisy data set are the loosest.
Question 14
Multiple linear model with collinearity
from numpy.random import MT19937 rng = np.random.default_rng(MT19937(seed=5)) x1 = rng.uniform(size=100) x2 = 0.5 * x1 + rng.normal(size=100) / 10 y = 2 + 2 * x1 + 0.3 * x2 + rng.normal(size=100) df_coll = pd.DataFrame({"x1" : x1, "x2" : x2, "y" : y})
The form of the linear model is
\begin{align} Y = 2 + 2 X_1 + 0.3 X_2 + ϵ. \end{align}Correlation scatter plot
corr = df_coll.corr() print(corr)
x1 x2 y x1 1.00000 0.76818 0.55569 x2 0.76818 1.00000 0.45415 y 0.55569 0.45415 1.00000
The correlation between x1
and x2
is 0.768.
plt.close("all") fig, ax = plt.subplots() sp = sns.scatterplot(x="x1", y="x2", data=df_coll, ax=ax) sns.despine() fig.savefig("img/3.14.a_scatter.png", dpi=90)
Least squares fit with x1
and x2
coll_model1 = smf.ols("y ~ x1 + x2", data=df_coll).fit() print(coll_model1.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.311 Model: OLS Adj. R-squared: 0.296 Method: Least Squares F-statistic: 21.85 Date: Sat, 30 May 2020 Prob (F-statistic): 1.46e-08 Time: 04:51:30 Log-Likelihood: -133.37 No. Observations: 100 AIC: 272.7 Df Residuals: 97 BIC: 280.6 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 1.8690 0.194 9.651 0.000 1.485 2.253 x1 2.1749 0.568 3.832 0.000 1.048 3.301 x2 0.4454 0.881 0.505 0.614 -1.304 2.194 ============================================================================== Omnibus: 0.484 Durbin-Watson: 1.964 Prob(Omnibus): 0.785 Jarque-Bera (JB): 0.623 Skew: -0.140 Prob(JB): 0.732 Kurtosis: 2.734 Cond. No. 12.2 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The estimated values for the coefficients are 1.869, 2.175, and 0.445 which are close to the true values, albeit with large standard errors, particularly for \(\hat{β}_2\). Based on the p-values we can reject the null hypothesis for \(β_1\), but we cannot reject the null-hypothesis for \(β_2\).
Least squares fit with x1
only
coll_model2 = smf.ols("y ~ x1", data=df_coll).fit() print(coll_model2.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.309 Model: OLS Adj. R-squared: 0.302 Method: Least Squares F-statistic: 43.78 Date: Sat, 30 May 2020 Prob (F-statistic): 1.96e-09 Time: 04:51:44 Log-Likelihood: -133.50 No. Observations: 100 AIC: 271.0 Df Residuals: 98 BIC: 276.2 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 1.8690 0.193 9.688 0.000 1.486 2.252 x1 2.3952 0.362 6.617 0.000 1.677 3.114 ============================================================================== Omnibus: 0.538 Durbin-Watson: 1.940 Prob(Omnibus): 0.764 Jarque-Bera (JB): 0.650 Skew: -0.160 Prob(JB): 0.723 Kurtosis: 2.768 Cond. No. 4.80 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficient value has increased, and the \(R^2\) value has decreased marginally. We can still reject the null hypothesis based on the p-value.
Least squares fit with x2
only
coll_model3 = smf.ols("y ~ x2", data=df_coll).fit() print(coll_model3.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.206 Model: OLS Adj. R-squared: 0.198 Method: Least Squares F-statistic: 25.46 Date: Sat, 30 May 2020 Prob (F-statistic): 2.08e-06 Time: 04:51:59 Log-Likelihood: -140.42 No. Observations: 100 AIC: 284.8 Df Residuals: 98 BIC: 290.0 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 2.2853 0.171 13.355 0.000 1.946 2.625 x2 3.0393 0.602 5.046 0.000 1.844 4.235 ============================================================================== Omnibus: 0.036 Durbin-Watson: 2.117 Prob(Omnibus): 0.982 Jarque-Bera (JB): 0.064 Skew: -0.038 Prob(JB): 0.969 Kurtosis: 2.902 Cond. No. 6.38 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficient value is much larger now, but the \(R^2\) value has decreased. We can now reject the null hypothesis based on the p-value.
Contradiction of models
The three models do not contradict each other. Due to the high correlation
between x1
and x2
, we can predict x2
from x1
. Thus in the original model
x2
has very little explanatory power and so we cannot reject the null
hypothesis for \(β_2\).
For the second and third model the explanation for the
increase in the coefficients is as follows. In the second model we are
expressing x2
in terms of x1
, and so the coefficient of x1
in the
expression for y
increases to 2.3. In the third model we are expressing x1
in terms of x2
, and so the coefficient of x2
in the expression for y
increases to 4.3. The 95% confidence interval of the second model includes the
new true value of the coefficient. Even though the 95% confidence interval of
the third model does not include the new true value of the coefficient it comes
close. This is probably due to the difference between the random number
generators used by R
and numpy
.
Additional data
df_coll = df_coll.append({"x1" : 0.1, "x2" : 0.8, "y" : 6}, ignore_index=True) coll_model1 = smf.ols("y ~ x1 + x2", data=df_coll).fit() coll_model2 = smf.ols("y ~ x1", data=df_coll).fit() coll_model3 = smf.ols("y ~ x2", data=df_coll).fit()
print(coll_model1.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.295 Model: OLS Adj. R-squared: 0.281 Method: Least Squares F-statistic: 20.50 Date: Sat, 30 May 2020 Prob (F-statistic): 3.66e-08 Time: 04:56:57 Log-Likelihood: -138.91 No. Observations: 101 AIC: 283.8 Df Residuals: 98 BIC: 291.7 Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 1.9525 0.200 9.769 0.000 1.556 2.349 x1 1.2762 0.508 2.514 0.014 0.269 2.283 x2 2.0004 0.753 2.657 0.009 0.506 3.495 ============================================================================== Omnibus: 0.020 Durbin-Watson: 2.024 Prob(Omnibus): 0.990 Jarque-Bera (JB): 0.144 Skew: 0.018 Prob(JB): 0.931 Kurtosis: 2.819 Cond. No. 9.92 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(coll_model2.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.244 Model: OLS Adj. R-squared: 0.237 Method: Least Squares F-statistic: 31.98 Date: Sat, 30 May 2020 Prob (F-statistic): 1.51e-07 Time: 04:57:04 Log-Likelihood: -142.43 No. Observations: 101 AIC: 288.9 Df Residuals: 99 BIC: 294.1 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 2.0051 0.205 9.786 0.000 1.599 2.412 x1 2.1847 0.386 5.655 0.000 1.418 2.951 ============================================================================== Omnibus: 6.115 Durbin-Watson: 1.839 Prob(Omnibus): 0.047 Jarque-Bera (JB): 7.252 Skew: 0.308 Prob(JB): 0.0266 Kurtosis: 4.160 Cond. No. 4.76 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(coll_model3.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.249 Model: OLS Adj. R-squared: 0.242 Method: Least Squares F-statistic: 32.90 Date: Sat, 30 May 2020 Prob (F-statistic): 1.06e-07 Time: 04:57:12 Log-Likelihood: -142.07 No. Observations: 101 AIC: 288.1 Df Residuals: 99 BIC: 293.4 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 2.2419 0.168 13.365 0.000 1.909 2.575 x2 3.2762 0.571 5.736 0.000 2.143 4.409 ============================================================================== Omnibus: 0.044 Durbin-Watson: 2.139 Prob(Omnibus): 0.978 Jarque-Bera (JB): 0.040 Skew: -0.033 Prob(JB): 0.980 Kurtosis: 2.927 Cond. No. 6.09 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The \(R^2\) values for models 1 and 2 decreased. This observation decreased the predictive power of the models. The average leverage for the data set is
p = len(df_coll.columns) n = len(df_coll) lev = (p + 1) / n print(f"{lev:.3f}")
0.040
plt.close("all") fig, ax = plt.subplots() rlplot = sm.graphics.influence_plot(coll_model1, criterion="Cooks", ax=ax) sns.despine() ax.set_xlabel("Leverage") ax.set_ylabel("Standardized Residuals") ax.set_title(" ") fig.savefig("img/3.14.g_coll1_res_vs_lev.png", dpi=90)
For the first model it is both an outlier and a high leverage point.
plt.close("all") fig, ax = plt.subplots() rlplot = sm.graphics.influence_plot(coll_model2, criterion="Cooks", ax=ax) sns.despine() ax.set_xlabel("Leverage") ax.set_ylabel("Standardized Residuals") ax.set_title(" ") fig.savefig("img/3.14.g_coll2_res_vs_lev.png", dpi=90)
For the second model it is just an outlier.
plt.close("all") fig, ax = plt.subplots() rlplot = sm.graphics.influence_plot(coll_model3, criterion="Cooks", ax=ax) sns.despine() ax.set_xlabel("Leverage") ax.set_ylabel("Standardized Residuals") ax.set_title(" ") fig.savefig("img/3.14.g_coll3_res_vs_lev.png", dpi=90)
For the third model it is just an high leverage point.
Question 15
Predict per capita crime rate with the Boston
data set
We load the Boston
from statsmodels
.
import statsmodels.api as sm boston = sm.datasets.get_rdataset("Boston", "MASS") print(boston.__doc__)
+--------+-----------------+ | Boston | R Documentation | +--------+-----------------+ Housing Values in Suburbs of Boston ----------------------------------- Description ~~~~~~~~~~~ The ``Boston`` data frame has 506 rows and 14 columns. Usage ~~~~~ :: Boston Format ~~~~~~ This data frame contains the following columns: ``crim`` per capita crime rate by town. ``zn`` proportion of residential land zoned for lots over 25,000 sq.ft. ``indus`` proportion of non-retail business acres per town. ``chas`` Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). ``nox`` nitrogen oxides concentration (parts per 10 million). ``rm`` average number of rooms per dwelling. ``age`` proportion of owner-occupied units built prior to 1940. ``dis`` weighted mean of distances to five Boston employment centres. ``rad`` index of accessibility to radial highways. ``tax`` full-value property-tax rate per \\$10,000. ``ptratio`` pupil-teacher ratio by town. ``black`` *1000(Bk - 0.63)^2* where *Bk* is the proportion of blacks by town. ``lstat`` lower status of the population (percent). ``medv`` median value of owner-occupied homes in \\$1000s. Source ~~~~~~ Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. *J. Environ. Economics and Management* **5**, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) *Regression Diagnostics. Identifying Influential Data and Sources of Collinearity.* New York: Wiley.
Earlier we are fitted medv
to the other predictors, now we will be fitting
crim
to the other predictors.
import statsmodels.formula.api as smf df = boston.data predictors = [c for c in df.columns if c != "crim"] simple_models = {p : smf.ols(formula=f"crim ~ {p}", data=df).fit() for p in predictors} print(f"predictor coefficient p-value") for p, model in simple_models.items(): print(f"{p:^9} {model.params[p]:>9,.4f} {model.pvalues[p]:>9,.4f}")
predictor coefficient p-value zn -0.0739 0.0000 indus 0.5098 0.0000 chas -1.8928 0.2094 nox 31.2485 0.0000 rm -2.6841 0.0000 age 0.1078 0.0000 dis -1.5509 0.0000 rad 0.6179 0.0000 tax 0.0297 0.0000 ptratio 1.1520 0.0000 black -0.0363 0.0000 lstat 0.5488 0.0000 medv -0.3632 0.0000
Except for chas
everything else appears to be statistically significant.
import matplotlib.pyplot as plt import seaborn as sns from math import ceil sns.set_style("ticks") ncols = 4 nrows = ceil(len(predictors) / ncols) plt.close("all") fig, axs = plt.subplots(nrows=nrows, ncols=ncols, constrained_layout=True, figsize=(12, 10)) for i in range(nrows): for j in range(ncols): if i * ncols + j < len(predictors): sns.regplot(x=df[predictors[i * ncols + j]], y=df["crim"], ax=axs[i, j], line_kws={"color" : "r"}) sns.despine() fig.savefig("img/3.15.a_reg_mat.png", dpi=120)
Multiple regression with Boston
data set
Y = df['crim'] X = df[predictors] X = sm.add_constant(X) ml_model = sm.OLS(Y, X).fit() print(ml_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: crim R-squared: 0.454 Model: OLS Adj. R-squared: 0.440 Method: Least Squares F-statistic: 31.47 Date: Sat, 30 May 2020 Prob (F-statistic): 1.57e-56 Time: 12:26:02 Log-Likelihood: -1653.3 No. Observations: 506 AIC: 3335. Df Residuals: 492 BIC: 3394. Df Model: 13 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 17.0332 7.235 2.354 0.019 2.818 31.248 zn 0.0449 0.019 2.394 0.017 0.008 0.082 indus -0.0639 0.083 -0.766 0.444 -0.228 0.100 chas -0.7491 1.180 -0.635 0.526 -3.068 1.570 nox -10.3135 5.276 -1.955 0.051 -20.679 0.052 rm 0.4301 0.613 0.702 0.483 -0.774 1.634 age 0.0015 0.018 0.081 0.935 -0.034 0.037 dis -0.9872 0.282 -3.503 0.001 -1.541 -0.433 rad 0.5882 0.088 6.680 0.000 0.415 0.761 tax -0.0038 0.005 -0.733 0.464 -0.014 0.006 ptratio -0.2711 0.186 -1.454 0.147 -0.637 0.095 black -0.0075 0.004 -2.052 0.041 -0.015 -0.000 lstat 0.1262 0.076 1.667 0.096 -0.023 0.275 medv -0.1989 0.061 -3.287 0.001 -0.318 -0.080 ============================================================================== Omnibus: 666.613 Durbin-Watson: 1.519 Prob(Omnibus): 0.000 Jarque-Bera (JB): 84887.625 Skew: 6.617 Prob(JB): 0.00 Kurtosis: 65.058 Cond. No. 1.58e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.58e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Based on the p-values we can reject the null hypothesis for dis
, rad
, and
medv
. If we are willing to be less accurate we can also reject the null
hypothesis for zn
, nox
, black
, and lstat
.
Comparison plot
import pandas as pd ml_coefs = ml_model.params sl_coefs = pd.Series({p : simple_models[p].params.loc[p] for p in predictors}) coef_df = pd.concat([sl_coefs, ml_coefs], axis=1) coef_df.reset_index(inplace=True) coef_df.columns = ["Predictors", "Simple OLS Coefficient", "Multiple OLS Coefficient"] coef_df.dropna(inplace=True) print(coef_df)
Predictors Simple OLS Coefficient Multiple OLS Coefficient 0 zn -0.073935 0.044855 1 indus 0.509776 -0.063855 2 chas -1.892777 -0.749134 3 nox 31.248531 -10.313535 4 rm -2.684051 0.430131 5 age 0.107786 0.001452 6 dis -1.550902 -0.987176 7 rad 0.617911 0.588209 8 tax 0.029742 -0.003780 9 ptratio 1.151983 -0.271081 10 black -0.036280 -0.007538 11 lstat 0.548805 0.126211 12 medv -0.363160 -0.198887
plt.close("all") fig, ax = plt.subplots() sns.scatterplot(x="Simple OLS Coefficient", y="Multiple OLS Coefficient", data=coef_df, ax=ax) sns.despine() fig.savefig("img/3.15.c_comp_plot.png")
The coefficients for nox
are very different in the two models.
Evidence of non-linear associations
We will use scikit-learn
to generate the non-linear features.
from sklearn.preprocessing import PolynomialFeatures pd.options.display.float_format = "{:,.3f}".format Y = df['crim'] poly_features = PolynomialFeatures(degree=3) poly_predictors = {p : poly_features.fit_transform(df[p][:, None]) for p in predictors} poly_models = {p : sm.OLS(Y, poly_predictors[p]).fit() for p in predictors} for p in predictors: print(f"p-values for {p}:") print(f"{poly_models[p].pvalues}\n")
p-values for zn: const 0.000 x1 0.003 x2 0.094 x3 0.230 dtype: float64 p-values for indus: const 0.020 x1 0.000 x2 0.000 x3 0.000 dtype: float64 p-values for chas: const 0.000 x1 0.209 x2 0.209 x3 0.209 dtype: float64 p-values for nox: const 0.000 x1 0.000 x2 0.000 x3 0.000 dtype: float64 p-values for rm: const 0.081 x1 0.212 x2 0.364 x3 0.509 dtype: float64 p-values for age: const 0.358 x1 0.143 x2 0.047 x3 0.007 dtype: float64 p-values for dis: const 0.000 x1 0.000 x2 0.000 x3 0.000 dtype: float64 p-values for rad: const 0.768 x1 0.623 x2 0.613 x3 0.482 dtype: float64 p-values for tax: const 0.105 x1 0.110 x2 0.137 x3 0.244 dtype: float64 p-values for ptratio: const 0.002 x1 0.003 x2 0.004 x3 0.006 dtype: float64 p-values for black: const 0.000 x1 0.139 x2 0.474 x3 0.544 dtype: float64 p-values for lstat: const 0.554 x1 0.335 x2 0.065 x3 0.130 dtype: float64 p-values for medv: const 0.000 x1 0.000 x2 0.000 x3 0.000 dtype: float64
From the p-values we see that there is evidence of polynomial association
between the response and the predictors indus
, nox
, age
, dis
, ptratio
,
and medv
.