Midterm Corrections


Midterm Corrections

Importing Libraries

from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import KFold
import numpy as np

Defining KFold

import pandas as pd
def DoKFold(model, X, y, k, standardize=False, random_state=146):
    if standardize:
        from sklearn.preprocessing import StandardScaler as SS
        ss = SS()

    kf = KFold(n_splits=k, shuffle=True, random_state=random_state)
   
    train_scores = []
    test_scores = []

    train_mse = []
    test_mse = []

    for idxTrain, idxTest in kf.split(X):
        Xtrain = X[idxTrain,:]
        Xtest = X[idxTest,:]
        ytrain = y[idxTrain]
        ytest = y[idxTest]

        if standardize:
            Xtrain = ss.fit_transform(Xtrain)
            Xtest = ss.transform(Xtest)

        model.fit(Xtrain, ytrain)

        train_scores.append(model.score(Xtrain, ytrain))
        test_scores.append(model.score(Xtest, ytest))

        ytrain_pred = model.predict(Xtrain)
        ytest_pred = model.predict(Xtest)

        train_mse.append(np.mean((ytrain - ytrain_pred)**2))
        test_mse.append(np.mean((ytest - ytest_pred)**2))
        
    return train_scores, test_scores, train_mse, test_mse

Importing Data

data = fetch_california_housing(as_frame=True)
lin_reg = LinearRegression()

X = np.array(data.data)
y = np.array(data.target)

Question 15

What I did:

X = data.data
X_names = data.feature_names
X_df = pd.DataFrame(data = X, columns=X_names)
y = data.target
y_names = data.target_names

sns.heatmap(X_df.corr(), vmin = -1, vmax = 1, cmap = 'bwr')
plt.show()

Correct Answer:

Xy = X_df.copy()
Xy['y'] = y
Xy.corr()

Reflection

I used a heatmap that showed each of the Correlation Coefficients of the features to the target. This is a viable way to answer the problem, but I wasn’t able to get the correct answer because I didn’t know how to read the heatmap. I thought that the biggest cluster of red would mean the highest correlation. Thus, I chose Average Rooms instead of Median Income. I can see why I thought that, but it was a dumb mistake on my part.

Question 17

What I did:

lin_reg = LinearRegression()
lin_reg.fit(X,y)
lin_reg.score(X,y)

Correct Answer:

np.round(np.corrcoef(X_df['MedInc'], y)[0][1]**2,2)

from sklearn.linear_model import LinearRegression as LR
lin_reg = LR()
lin_reg.fit(X_df['MedInc'].values.reshape(-1,1),y)
np.round(lin_reg.score(X_df['MedInc'].values.reshape(-1,1),y),2)

Reflection

I knew that I had to use Linear Regression and fit the targets to the features. However, I used the entire dataset of features and targets, instead of subsetting only the feature of “MedInc.” I misread the question, so this was another dumb mistake by me.

Question 21

What I did:

from sklearn.preprocessing import StandardScaler as SS
ss = SS()
Xs = ss.fit_transform(X)
lin_reg.fit(Xs,y)
lin_coefs = lin_reg.coef_

a_range = np.linspace(0,1000,100)

rid_coefs = []
for a in a_range:
    rid_reg = Ridge(alpha=a)
    rid_reg.fit(Xs,y)
    rid_coefs.append(rid_reg.coef_)


plt.figure(figsize=(12,6))
plt.plot(a_range, rid_coefs)
plt.scatter([0]*len(lin_coefs), lin_coefs)
plt.legend(X_names, bbox_to_anchor=[1, 0.5], loc='center left')
plt.xlabel('$\\alpha$')
plt.ylabel('Coeff. Estimates')
plt.show()

Correct Answer:

##21

print(X_names[5])
lin = LR(); rid = Ridge(alpha = 25.8); las = Lasso(alpha=0.00186)
lin.fit(Xs,y); rid.fit(Xs,y); las.fit(Xs.y);
lin.coef_[5], rid.coef_[5], las.coef_[5]

Reflection

Here, I wasn’t sure how to approach the question. I decided to graph all of the Coefficient Estimates of all the features, and look at them to see which had the closest Absolute Value to zero. I think that this approach is useful for quick decision-making, but not in this case where the values are extremely close. To be completely honest, I was very lost on this section and didn’t realize that I could solve it in just a couple lines of code with “lin.coef_[5]” and such.

Question 22

What I did:

from sklearn.preprocessing import StandardScaler as SS
ss = SS()
Xs = ss.fit_transform(X)
lin_reg.fit(Xs,y)
lin_coefs = lin_reg.coef_

a_range = np.linspace(0,1000,100)

rid_coefs = []
for a in a_range:
    rid_reg = Ridge(alpha=a)
    rid_reg.fit(Xs,y)
    rid_coefs.append(rid_reg.coef_)


plt.figure(figsize=(12,6))
plt.plot(a_range, rid_coefs)
plt.scatter([0]*len(lin_coefs), lin_coefs)
plt.legend(X_names, bbox_to_anchor=[1, 0.5], loc='center left')
plt.xlabel('$\\alpha$')
plt.ylabel('Coeff. Estimates')
plt.show()

Correct Answer:

##22
print(X_names[0])
lin.coef_[0], rid.coef_[0], las.coef_[0]

Reflection

This is essentially the same question as 21.

Question 24

What I did:

### Lasso Regression ###

from sklearn.linear_model import Lasso

a_range = np.linspace(0.001, 0.003, 101)

k = 20

avg_tr_score=[]
avg_te_score=[]

for a in a_range:
    las_reg = Lasso(alpha=a)
    train_scores,test_scores = DoKFold(las_reg,X,y,k,standardize=True)
    avg_tr_score.append(np.mean(train_scores))
    avg_te_score.append(np.mean(test_scores))

idx = np.argmax(MSE)
print('Optimal alpha value: ' + format(a_range[idx], '.5f'))
print('Training score for this value: ' + format(avg_tr_score[idx],'.5f'))
print('Testing score for this value: ' + format(avg_te_score[idx], '.5f'))

Correct Answer:

idx = np.argmin(las_te_mse)
print(las_a_range[idx], las_tr[idx], las_te[idx], las_tr_mse[idx], las_te_mse[idx])

plt.plot(las_a_range, las_te_mse, 'or')
plt.xlabel('$\\alpha$')
plt.ylabel('Avg MSE')
plt.show()

Reflection

To answer this question, I did two different Lasso regressions. I copied the code from my earlier Lasso (which was completely correct), and substituted the “np.argmax()” with MSe inside instead of avg_te_score. I realize that the avg_te_score was only the values from the testing set, and not the entire dataset. Thus, my optimal alpha was 0.001.