HomeWork #1 Machine Learning
Contents
HomeWork #1 Machine Learning#
Stu. name: Seyed Mohammad Amin Dadgar
Stu. no: 4003624016
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
import scipy
Q2#
## column names are used from ```pima-indians-diabetes.name``` file
cols = ['pregnancy_count', 'glucose_test', 'blood_pressure', 'triceps_thickness', '2h_insulin', 'mass', 'pedi', 'age', 'label']
df = pd.read_csv('hw1_data/pima/pima-indians-diabetes.data', index_col=False ,names=cols)
df.head()
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
## save the dataset in the right format with its columns for other usages
df.to_csv('hw1_data/processed/pima-indians-diabetes.csv')
(a, b)#
df.describe()
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
## the variance of each attribute
pd.DataFrame(df.var(), columns=['variance'])
variance | |
---|---|
pregnancy_count | 11.354056 |
glucose_test | 1022.248314 |
blood_pressure | 374.647271 |
triceps_thickness | 254.473245 |
2h_insulin | 13281.180078 |
mass | 62.159984 |
pedi | 0.109779 |
age | 138.303046 |
label | 0.227483 |
(c)#
## Calculating the correlation between 8 attributes (label is outcluded)
attr_corr = df[cols[:-1]].corr()
attr_corr
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | |
---|---|---|---|---|---|---|---|---|
pregnancy_count | 1.000000 | 0.129459 | 0.141282 | -0.081672 | -0.073535 | 0.017683 | -0.033523 | 0.544341 |
glucose_test | 0.129459 | 1.000000 | 0.152590 | 0.057328 | 0.331357 | 0.221071 | 0.137337 | 0.263514 |
blood_pressure | 0.141282 | 0.152590 | 1.000000 | 0.207371 | 0.088933 | 0.281805 | 0.041265 | 0.239528 |
triceps_thickness | -0.081672 | 0.057328 | 0.207371 | 1.000000 | 0.436783 | 0.392573 | 0.183928 | -0.113970 |
2h_insulin | -0.073535 | 0.331357 | 0.088933 | 0.436783 | 1.000000 | 0.197859 | 0.185071 | -0.042163 |
mass | 0.017683 | 0.221071 | 0.281805 | 0.392573 | 0.197859 | 1.000000 | 0.140647 | 0.036242 |
pedi | -0.033523 | 0.137337 | 0.041265 | 0.183928 | 0.185071 | 0.140647 | 1.000000 | 0.033561 |
age | 0.544341 | 0.263514 | 0.239528 | -0.113970 | -0.042163 | 0.036242 | 0.033561 | 1.000000 |
## showing the heatmap for better visualization
sns.heatmap(attr_corr)
plt.show()

label_corr = df.corrwith(df.label).sort_values(ascending=False)
pd.DataFrame(label_corr, columns=['correlation'])
correlation | |
---|---|
label | 1.000000 |
glucose_test | 0.466581 |
mass | 0.292695 |
age | 0.238356 |
pregnancy_count | 0.221898 |
pedi | 0.173844 |
2h_insulin | 0.130548 |
triceps_thickness | 0.074752 |
blood_pressure | 0.065068 |
What we will find out from the correlation matrix with label is the most useful feature (Or the most affective feature ) for the label is glucose_test. To find why it is the most helpful attribute we can review the correlation formula with below \begin{equation} correlation = \frac{covariance(x,y)}{var(x) var(y)} \end{equation} And with this equation it’s obvious that if coefficient is positive then the effect of can be if x increases then y is increased. By this reason the biggest value for correlation of a feature with lable can be intrepreted the most helpful feature (attribute).
Also to explain the correlation value of one we can say that, x variable is the same as y in eq(1).
(d)#
If 2 attributes are fully correlated then using both in prediction may make bias and the prediction result would not be helpful. The alternative and the better way for prediction is to use one of the attributes.
(f)#
fig, ax = plt.subplots(3, 3, figsize=(15,9))
fig.tight_layout(pad=5)
sns.distplot(df.glucose_test, bins=20, ax=ax[0,0])
ax[0,0].set_title('Plasma glucose concentration a 2 hours\n in an oral glucose tolerance test')
sns.distplot(df.mass, bins=20, ax=ax[0,1])
ax[0, 1].set_title('Body mass index \n(weight in kg/(height in m)^2)')
sns.distplot(df.age, bins=20, ax=ax[0,2])
ax[0, 2].set_title('Age')
sns.distplot(df.pregnancy_count, bins=20, ax=ax[1,0])
ax[1, 0].set_title('Number of times pregnant')
sns.distplot(df.pedi, bins=20, ax=ax[1, 1])
ax[1, 1].set_title('Diabetes pedigree function')
sns.distplot(df['2h_insulin'], bins=20, ax=ax[1, 2])
ax[1, 2].set_title('2-Hour serum insulin (mu U/ml)')
sns.distplot(df.triceps_thickness, bins=20, ax=ax[2, 0])
ax[2, 0].set_title('Triceps skin fold thickness (mm)')
sns.distplot(df.blood_pressure, bins=20, ax=ax[2, 1])
ax[2, 1].set_title('Diastolic blood pressure (mm Hg)')
sns.distplot(df.label, bins=20, ax=ax[2, 2])
ax[2, 1].set_title('Class variable (0 or 1)')
plt.show()
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/home/amin/.local/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

If we look closely we can say that body mass index and the plasma glucose is the most similar to normal distribution.
fig, ax = plt.subplots(3, 3, figsize=(20,15))
fig.tight_layout(pad=5)
sns.scatterplot(df.glucose_test,df.mass, ax=ax[0,0])
ax[0,0].legend(['glucose_test And mass'])
sns.scatterplot(df.mass, df.age, ax=ax[0,1])
ax[0, 1].legend(['mass And age'])
sns.scatterplot(df.age, df.pregnancy_count, ax=ax[0,2])
ax[0, 2].legend(['age And pregnancy_count'])
sns.scatterplot(df.pregnancy_count, df.pedi, ax=ax[1,0])
ax[1, 0].legend(['pregnancy_count And predi'])
sns.scatterplot(df.pedi, df['2h_insulin'], ax=ax[1, 1])
ax[1, 1].legend(['predi And 2h_insulin'])
sns.scatterplot(df['2h_insulin'], df.triceps_thickness, ax=ax[1, 2])
ax[1, 2].legend(['2h_insulin And triceps_thickness'])
sns.scatterplot(df.triceps_thickness, df.blood_pressure, ax=ax[2, 0])
ax[2, 0].legend(['triceps_thickness And blood_pressure'])
ax[2,1].set_axis_off()
ax[2,2].set_axis_off()
plt.show()
/home/amin/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
/home/amin/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
/home/amin/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
/home/amin/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
/home/amin/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
/home/amin/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
/home/amin/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(

It is obvious that for (pregnancy_count, predi) and (age, pregnancy_count) we can say they have linear dependency.
Q3#
(a)#
create a normalize function using the equation below \begin{equation} x_{norm} = \frac{x- \mu_x}{\sigma_x} \end{equation} normalize the third attribute on pima dataset, then report the values for the first 5 entries on dataset.
## column names are used from ```pima-indians-diabetes.name``` file
cols = ['pregnancy_count', 'glucose_test', 'blood_pressure', 'triceps_thickness', '2h_insulin', 'mass', 'pedi', 'age', 'label']
df = pd.read_csv('hw1_data/pima/pima-indians-diabetes.data', index_col=False ,names=cols)
df.head()
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
def normalize(attrib):
"""
normalize an attribute using its mean and standard deviation
INPUTS:
--------
attrib: a pandas series, the original unormalized attribute
OUTPUT:
--------
normalized: the normalized attribute
"""
mu = attrib.mean()
sigma = attrib.std()
normalized = (attrib - mu) / sigma
return normalized
## third attribute is blood pressure
blood_pressure_norm = normalize(df.blood_pressure)
pd.DataFrame(blood_pressure_norm)
blood_pressure | |
---|---|
0 | 0.149543 |
1 | -0.160441 |
2 | -0.263769 |
3 | -0.160441 |
4 | -1.503707 |
... | ... |
763 | 0.356200 |
764 | 0.046215 |
765 | 0.149543 |
766 | -0.470426 |
767 | 0.046215 |
768 rows × 1 columns
## normalize first 5 attribute of the function
normalize(df[['pregnancy_count', 'glucose_test', 'blood_pressure', 'triceps_thickness', '2h_insulin']])
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | |
---|---|---|---|---|---|
0 | 0.639530 | 0.847771 | 0.149543 | 0.906679 | -0.692439 |
1 | -0.844335 | -1.122665 | -0.160441 | 0.530556 | -0.692439 |
2 | 1.233077 | 1.942458 | -0.263769 | -1.287373 | -0.692439 |
3 | -0.844335 | -0.997558 | -0.160441 | 0.154433 | 0.123221 |
4 | -1.141108 | 0.503727 | -1.503707 | 0.906679 | 0.765337 |
... | ... | ... | ... | ... | ... |
763 | 1.826623 | -0.622237 | 0.356200 | 1.721613 | 0.869464 |
764 | -0.547562 | 0.034575 | 0.046215 | 0.405181 | -0.692439 |
765 | 0.342757 | 0.003299 | 0.149543 | 0.154433 | 0.279412 |
766 | -0.844335 | 0.159683 | -0.470426 | -1.287373 | -0.692439 |
767 | -0.844335 | -0.872451 | 0.046215 | 0.655930 | -0.692439 |
768 rows × 5 columns
new_df = df[['pregnancy_count', 'glucose_test', 'triceps_thickness', '2h_insulin']].copy()
new_df['normalized_blood_pressure'] = blood_pressure_norm
new_df.head()
pregnancy_count | glucose_test | triceps_thickness | 2h_insulin | normalized_blood_pressure | |
---|---|---|---|---|---|
0 | 6 | 148 | 35 | 0 | 0.149543 |
1 | 1 | 85 | 29 | 0 | -0.160441 |
2 | 8 | 183 | 0 | 0 | -0.263769 |
3 | 1 | 89 | 23 | 94 | -0.160441 |
4 | 0 | 137 | 35 | 168 | -1.503707 |
(b)#
def discretize_attribute(attribute ,bins=10, verbose=False):
"""
descritize a continues attribute and make the intervals the bins size
INPUTS:
--------
attribute: pandas series of **one column**
bins: the interval space for each class, default is 10
verbose: print the each stage output if True, default is False
OUTPUT:
--------
pandas dataframe of the discrite values of an attribute
"""
list = np.empty((1,len(attribute.columns)))
## each bin length
bin_len = attribute.max().max() / bins
# bin_len = int(bin_len)
if verbose: print(f'bin length: {bin_len}')
## counter to iterate the intervals
interval = 0
while interval < attribute.max().max():
## find the attributes within the interval and drop na (NAN values are thoes who are not in the interval)
conditioned_data = attribute.where((interval < attribute) & (attribute < interval + bin_len)).dropna().values
if verbose: print(conditioned_data)
## To discretize the attributes set all the values to the mean of interval we have
conditioned_data[True] = (interval + bin_len) / 2
list = np.append(list, conditioned_data, axis=0 )
## iterate the next intervals
interval += bin_len
return pd.DataFrame(list, columns=attribute.columns, dtype='float32')
attrib = df[['blood_pressure']]
df_discrete_blood_pressure = discretize_attribute(attrib, 10)
df_discrete_blood_pressure
blood_pressure | |
---|---|
0 | 0.000000 |
1 | 12.200000 |
2 | 18.299999 |
3 | 18.299999 |
4 | 24.400000 |
... | ... |
728 | 61.000000 |
729 | 61.000000 |
730 | 61.000000 |
731 | 61.000000 |
732 | 61.000000 |
733 rows × 1 columns
sns.histplot(df_discrete_blood_pressure)
plt.show()

df_discrete_pregnancy = discretize_attribute(df[['pregnancy_count']], 10)
df_discrete_glucose_test = discretize_attribute(df[['glucose_test']], 10)
df_discrete_triceps_thickness = discretize_attribute(df[['triceps_thickness']], 10)
df_discrete_2h_insulin = discretize_attribute(df[['2h_insulin']], 10)
fig, axes = plt.subplots(1, 5, figsize=(30, 5))
sns.histplot(df_discrete_blood_pressure, ax=axes[0])
sns.histplot(df_discrete_2h_insulin, ax=axes[1])
sns.histplot(df_discrete_glucose_test, ax=axes[2])
sns.histplot(df_discrete_pregnancy, ax=axes[3])
sns.histplot(df_discrete_triceps_thickness, ax=axes[4])
plt.show()

Q4#
(a)#
## column names are used from ```pima-indians-diabetes.name``` file
cols = ['pregnancy_count', 'glucose_test', 'blood_pressure', 'triceps_thickness', '2h_insulin', 'mass', 'pedi', 'age', 'label']
df = pd.read_csv('hw1_data/pima/pima-indians-diabetes.data', index_col=False ,names=cols)
df.head()
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
df_class_0 = df[df.label == 0].copy()
df_class_1 = df[df.label == 1].copy()
df_class_0.describe()
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
count | 500.000000 | 500.0000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.0 |
mean | 3.298000 | 109.9800 | 68.184000 | 19.664000 | 68.792000 | 30.304200 | 0.429734 | 31.190000 | 0.0 |
std | 3.017185 | 26.1412 | 18.063075 | 14.889947 | 98.865289 | 7.689855 | 0.299085 | 11.667655 | 0.0 |
min | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.0 |
25% | 1.000000 | 93.0000 | 62.000000 | 0.000000 | 0.000000 | 25.400000 | 0.229750 | 23.000000 | 0.0 |
50% | 2.000000 | 107.0000 | 70.000000 | 21.000000 | 39.000000 | 30.050000 | 0.336000 | 27.000000 | 0.0 |
75% | 5.000000 | 125.0000 | 78.000000 | 31.000000 | 105.000000 | 35.300000 | 0.561750 | 37.000000 | 0.0 |
max | 13.000000 | 197.0000 | 122.000000 | 60.000000 | 744.000000 | 57.300000 | 2.329000 | 81.000000 | 0.0 |
df_class_1.describe()
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
count | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.000000 | 268.0 |
mean | 4.865672 | 141.257463 | 70.824627 | 22.164179 | 100.335821 | 35.142537 | 0.550500 | 37.067164 | 1.0 |
std | 3.741239 | 31.939622 | 21.491812 | 17.679711 | 138.689125 | 7.262967 | 0.372354 | 10.968254 | 0.0 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.088000 | 21.000000 | 1.0 |
25% | 1.750000 | 119.000000 | 66.000000 | 0.000000 | 0.000000 | 30.800000 | 0.262500 | 28.000000 | 1.0 |
50% | 4.000000 | 140.000000 | 74.000000 | 27.000000 | 0.000000 | 34.250000 | 0.449000 | 36.000000 | 1.0 |
75% | 8.000000 | 167.000000 | 82.000000 | 36.000000 | 167.250000 | 38.775000 | 0.728000 | 44.000000 | 1.0 |
max | 17.000000 | 199.000000 | 114.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 70.000000 | 1.0 |
(b)#
## we would choose randomly an index with the probability of 66%
def divideset1(df, prob=0.66):
"""
divide the dataset into train and test with a probability
INPUTS:
--------
df: pandas dataframe, the dataset we want to split
prob: the probability to divide the dataset, default is 0.66
OUTPUTS:
---------
train: pandas dataframe, the portion of the dataset for train
test: pandas dataframe, the portion of the dataset for test
"""
## copy the dataframe to ensure there is no problem
dataset = df.copy()
trainset = pd.DataFrame(columns=dataset.columns)
testset = pd.DataFrame(columns=dataset.columns)
## iterate over dataset and select each row as train or test
for i in range(0, len(dataset)):
## get the row
row = dataset.iloc[0]
probability = np.random.random()
## if the probability range is between 0 and chosen prob
if probability - prob > 0:
trainset = trainset.append(row)
else:
testset = testset.append(row)
return trainset, testset
train_split, test_split = divideset1(df)
train_split
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
0 | 6.0 | 148.0 | 72.0 | 35.0 | 0.0 | 33.6 | 0.627 | 50.0 | 1.0 |
249 rows × 9 columns
## check the average training split size with the count of times
times = 20
## save all lengths
lengths = []
for i in range(times):
training_len = len(divideset1(df)[0])
print(f'Iteration {i} - training length: {training_len}')
lengths.append(training_len)
print('------------------------------------------')
print(f'Average Lengths: {np.average(lengths)}')
Iteration 0 - training length: 255
Iteration 1 - training length: 256
Iteration 2 - training length: 273
Iteration 3 - training length: 250
Iteration 4 - training length: 244
Iteration 5 - training length: 237
Iteration 6 - training length: 250
Iteration 7 - training length: 247
Iteration 8 - training length: 257
Iteration 9 - training length: 259
Iteration 10 - training length: 240
Iteration 11 - training length: 238
Iteration 12 - training length: 270
Iteration 13 - training length: 267
Iteration 14 - training length: 248
Iteration 15 - training length: 248
Iteration 16 - training length: 273
Iteration 17 - training length: 257
Iteration 18 - training length: 259
Iteration 19 - training length: 247
------------------------------------------
Average Lengths: 253.75
(c)#
def divideset2(df, fraction = 0.66):
"""
Divide the dataset into train and test with fixed size every run
INPUTS:
---------
df: pandas dataframe, the dataset that is going to be splitted
fraction: the value to divide the dataset, default is 0.66
OUTPUTS:
---------
train: pandas dataframe, the portion of the dataset for train
test: pandas dataframe, the portion of the dataset for test
"""
train = df.sample(frac=0.66).copy()
test = df.drop(train.index)
return train, test
train2, test2 = divideset2(df)
print(train2.head())
print('-----------------------'* 10)
print(test2.head())
pregnancy_count glucose_test blood_pressure triceps_thickness \
630 7 114 64 0
136 0 100 70 26
468 8 120 0 0
728 2 175 88 0
382 1 109 60 8
2h_insulin mass pedi age label
630 0 27.4 0.732 34 1
136 50 30.8 0.597 21 0
468 0 30.0 0.183 38 1
728 0 22.9 0.326 22 0
382 182 25.4 0.947 21 0
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pregnancy_count glucose_test blood_pressure triceps_thickness \
0 6 148 72 35
2 8 183 64 0
3 1 89 66 23
6 3 78 50 32
11 10 168 74 0
2h_insulin mass pedi age label
0 0 33.6 0.627 50 1
2 0 23.3 0.672 32 1
3 94 28.1 0.167 21 0
6 88 31.0 0.248 26 1
11 0 38.0 0.537 34 1
Validation schemes#
(a) K-fold Cross Validation#
def kfold_crossvalidation(data, k, m):
"""
K-fold cross validation
Note: test data is equivalent the validation data in normal machine learning models (Because it is used to evaluate one model)
INPUTS:
--------
data: pandas dataframe containing feature vectors as rows
k: positive integer, the number of folds
m: target output
train_split: the fraction of data that is used to split for training, default is 0.7
OUTPUTS:
---------
training_data: multi-dimensional array of training data, each index contains the dataset for K-fold number
test_data: multi-dimensional array of test data, each ```index+1``` contains the dataset for each K-fold number
"""
## get the length of data to split it
dataframe_size = len(data)
## find the length of each split
# training_size = int(dataframe_size * train_split)
# test_size = dataframe_size - training_size
## empty arrays to save data into it
training_data = []
test_data = []
## find the split size
split = int(dataframe_size / k)
## split the data into k-fold and add the folds into the arrays
for i in range(k):
start_idx = int(i*split)
end_idx = int((i+1)*split)
test = data.iloc[start_idx:end_idx].copy()
## add the label column to corresponding index
test['label'] = m.iloc[start_idx: end_idx]
## choose other part of dataset as train
train = pd.concat([data, test, test]).drop_duplicates(keep=False)
train['label'] = m.iloc[train.index]
training_data.append(train)
test_data.append(test)
return training_data, test_data
K = 5
training_folds, test_folds = kfold_crossvalidation(df, K, df.label )
## check the K=1 test data fold
test_folds[0]
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
148 | 5 | 147 | 78 | 0 | 0 | 33.7 | 0.218 | 65 | 0 |
149 | 2 | 90 | 70 | 17 | 0 | 27.3 | 0.085 | 22 | 0 |
150 | 1 | 136 | 74 | 50 | 204 | 37.4 | 0.399 | 24 | 0 |
151 | 4 | 114 | 65 | 0 | 0 | 21.9 | 0.432 | 37 | 0 |
152 | 9 | 156 | 86 | 28 | 155 | 34.3 | 1.189 | 42 | 1 |
153 rows × 9 columns
## check K=1 fold training set
training_folds[0]
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
153 | 1 | 153 | 82 | 42 | 485 | 40.6 | 0.687 | 23 | 0 |
154 | 8 | 188 | 78 | 0 | 0 | 47.9 | 0.137 | 43 | 1 |
155 | 7 | 152 | 88 | 44 | 0 | 50.0 | 0.337 | 36 | 1 |
156 | 2 | 99 | 52 | 15 | 94 | 24.6 | 0.637 | 21 | 0 |
157 | 1 | 109 | 56 | 21 | 135 | 25.2 | 0.833 | 23 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
615 rows × 9 columns
## we can see that the test and train summation size matches the whole dataframe
for k in range(K):
## does it maches the whole dataset size? (condition variable)
condition = (len(df) == (len(training_folds[k]) + len(test_folds[k])))
print(f'Fold K={k+1}, the summation matches the whole set, {condition}')
Fold K=1, the summation matches the whole set, True
Fold K=2, the summation matches the whole set, True
Fold K=3, the summation matches the whole set, True
Fold K=4, the summation matches the whole set, True
Fold K=5, the summation matches the whole set, True
(b) Bootstraping#
def bootstrap1(data):
"""
Demonstrate one iteration of bootstraping method (it is a with replacement method)
INPUT:
-------
data: a pandas dataframe, containing our data
OUPUTS:
--------
train_data: pandas dataframe of a sample data
test_data: pandas dataframe of sample data, the data that are not included in train_data
"""
## find the length of our data (how many data rows we have)
data_length = len(data)
## the indexes to be chosen from original data
indexes = np.random.randint(data_length, size=data_length)
## create the training set
train_data = df.iloc[indexes].copy()
## choose the test set, (The data that is omited from training set)
test_data = pd.concat([data,train_data, train_data]).drop_duplicates(keep=False)
return train_data, test_data
bootstrap_train, bootstrap_test = bootstrap1(df)
## check the training set length with original data
len(bootstrap_train) == len(df)
True
print(len(bootstrap_train))
bootstrap_train.head()
768
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
455 | 14 | 175 | 62 | 30 | 0 | 33.6 | 0.212 | 38 | 1 |
67 | 2 | 109 | 92 | 0 | 0 | 42.7 | 0.845 | 54 | 0 |
692 | 2 | 121 | 70 | 32 | 95 | 39.1 | 0.886 | 23 | 0 |
191 | 9 | 123 | 70 | 44 | 94 | 33.1 | 0.374 | 40 | 0 |
711 | 5 | 126 | 78 | 27 | 22 | 29.6 | 0.439 | 40 | 0 |
print(len(bootstrap_test))
bootstrap_test.head()
277
pregnancy_count | glucose_test | blood_pressure | triceps_thickness | 2h_insulin | mass | pedi | age | label | |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
8 | 2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |
12 | 10 | 139 | 80 | 0 | 0 | 27.1 | 1.441 | 57 | 0 |
16 | 0 | 118 | 84 | 47 | 230 | 45.8 | 0.551 | 31 | 1 |
Q7#
(f)#
def poisson_distribution(X,lambda1):
"""
poisson distribution function
INPUT:
--------
X: integer or an array of integers, the input value
lambda1: float, hyperparameter to set
OUTPUT:
---------
probabiltiy: type is same as input, the probability distribution of X
"""
## calculate the value above the division and below separately
above_division = np.exp(- lambda1) * np.power(lambda1, X)
under_division = scipy.special.factorial(X)
return np.divide(above_division, under_division)
## parameter lambda = 2
X = np.linspace(0, 10, 50)
Y1 = poisson_distribution(X, lambda1=2)
plt.plot(X, Y1)
plt.show()

## parameter lambda = 6
X = np.linspace(0, 10, 50)
Y1 = poisson_distribution(X, lambda1=6)
plt.plot(X, Y1)
plt.show()

(g)#
The Maximum Likelihood estimation for poisson distribution is as below (Calculated in Q7 part c) \begin{equation} \lambda = \frac{1}{n} \sum_{i=0}^{n} x_i \end{equation}
## Read the data from poisson.txt file
X_poisson = np.fromfile('hw1_data/poisson.txt', dtype=float, sep='\n')
## Maximum Likelihood estimation
lambda1 = np.sum(X_poisson) / len(X_poisson)
print('Maximum Likelihood Estimated Parameter for poisson.txt: ', lambda1)
Maximum Likelihood Estimated Parameter for poisson.txt: 5.24
(h)#
X = np.linspace(0, 10, 50)
gamma_model = scipy.stats.gamma(1,2)
plt.plot(X, gamma_model.pdf(X))
plt.legend(['a = 1, b = 2'])
plt.show()

X = np.linspace(0, 10, 50)
gamma_model = scipy.stats.gamma(3,5)
plt.plot(X, gamma_model.pdf(X))
plt.legend(['a = 3, b = 5'])
plt.show()

(i)#
Posterior density can be found as \begin{equation} Posterior = Likelihood \times Prior \end{equation} The prior of probability is \begin{equation} P(x) = \frac{e^{-\lambda} \lambda^x}{x!} \end{equation} \begin{equation} p(\lambda | a,b) = \frac{1}{b^a \Gamma(a)} \lambda^{a-1} e^{-\frac{\lambda}{b}} \end{equation} And assuming the poisson.txt file we found the proper value for lambda
def posterior_poisson(X, a,b, lambda1):
"""
posterior for poisson distribution
INPUTS:
-------
X: values to calculate the posterior
a,b: parameters for gamma distribution
OUTPUT:
--------
probability: floating number or an array
"""
## we divided the function in 3 parts then multiplied it
gamma_distro = scipy.stats.gamma(a)
values = gamma_distro.pdf(X)
p1 = (b ** a) * values
p1 = 1 / p1
p2 = lambda1 ** (a-1)
p3 = np.exp(-lambda1 / b)
probability = p1 * p2 * p3
print(len(probability))
return probability
Y = posterior_poisson(X_poisson, 3, 5, lambda1)
plt.figure(figsize=(8,6))
plt.plot(X_poisson,Y)
plt.legend(['Posterior distribution with a=1, b=2'])
plt.show()
25

It can be seen from the plot that the posterior is not a function and it have multiple values of x for each y in range of 4 to 8.
Q8#
Part 1#
(a)#
## read the data from housing.txt
## there is one or more space for each data
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD','TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df_boston = pd.read_csv('hw1_data/housing/housing.txt', sep= ' +', index_col=False, names=columns)
/home/amin/.local/lib/python3.8/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
return func(*args, **kwargs)
df_boston.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
df_boston.dtypes
CRIM float64
ZN float64
INDUS float64
CHAS int64
NOX float64
RM float64
AGE float64
DIS float64
RAD int64
TAX float64
PTRATIO float64
B float64
LSTAT float64
MEDV float64
dtype: object
As we can see there is no attribute type as object, so we will look at the dataset closely.
df_boston.CHAS.value_counts()
0 471
1 35
Name: CHAS, dtype: int64
It is clear here that CHAS is a binary feature.
df_boston.ZN.value_counts()
0.0 372
20.0 21
80.0 15
22.0 10
12.5 10
25.0 10
40.0 7
45.0 6
30.0 6
90.0 5
95.0 4
60.0 4
21.0 4
33.0 4
55.0 3
70.0 3
34.0 3
52.5 3
35.0 3
28.0 3
75.0 3
82.5 2
85.0 2
17.5 1
100.0 1
18.0 1
Name: ZN, dtype: int64
So after having a look at dataset and two of the features we saw that just one of the columns have binary value, and it is the CHAS.
(b)#
correlation = df_boston[columns[:-1]].corrwith(df_boston.MEDV).sort_values(ascending=True)
pd.DataFrame(correlation, columns=['correlation with MEDV'])
correlation with MEDV | |
---|---|
LSTAT | -0.737663 |
PTRATIO | -0.507787 |
INDUS | -0.483725 |
TAX | -0.468536 |
NOX | -0.427321 |
CRIM | -0.388305 |
RAD | -0.381626 |
AGE | -0.376955 |
CHAS | 0.175260 |
DIS | 0.249929 |
B | 0.333461 |
ZN | 0.360445 |
RM | 0.695360 |
It is clear from above that:
Highest Positive Correlation: RM
Highest Negative Correlation: LSTAT
(c)#
fig, axes = plt.subplots(3, 5, figsize=(25, 10))
fig.tight_layout(pad=5)
sns.scatterplot(x=df_boston.LSTAT, y=df_boston.MEDV, ax=axes[0,0])
sns.scatterplot(x=df_boston.PTRATIO, y=df_boston.MEDV, ax=axes[0,1])
sns.scatterplot(x=df_boston.INDUS, y=df_boston.MEDV, ax=axes[0,2])
sns.scatterplot(x=df_boston.TAX, y=df_boston.MEDV, ax=axes[0,3])
sns.scatterplot(x=df_boston.NOX, y=df_boston.MEDV, ax=axes[0,4])
sns.scatterplot(x=df_boston.CRIM, y=df_boston.MEDV, ax=axes[1,0])
sns.scatterplot(x=df_boston.AGE, y=df_boston.MEDV, ax=axes[1,1])
sns.scatterplot(x=df_boston.CHAS, y=df_boston.MEDV, ax=axes[1,2])
sns.scatterplot(x=df_boston.DIS, y=df_boston.MEDV, ax=axes[1,3])
sns.scatterplot(x=df_boston.B, y=df_boston.MEDV, ax=axes[1,4])
sns.scatterplot(x=df_boston.ZN, y=df_boston.MEDV, ax=axes[2,0])
sns.scatterplot(x=df_boston.RM, y=df_boston.MEDV, ax=axes[2,1])
axes[2,2].set_axis_off()
axes[2,3].set_axis_off()
axes[2,4].set_axis_off()

The most correlated features are having the most linear scatter plot. As we can see from the scatter plots, RM and LSTAT are in the most linear form.
(d)#
df_boston.corr()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CRIM | 1.000000 | -0.200469 | 0.406583 | -0.055892 | 0.420972 | -0.219247 | 0.352734 | -0.379670 | 0.625505 | 0.582764 | 0.289946 | -0.385064 | 0.455621 | -0.388305 |
ZN | -0.200469 | 1.000000 | -0.533828 | -0.042697 | -0.516604 | 0.311991 | -0.569537 | 0.664408 | -0.311948 | -0.314563 | -0.391679 | 0.175520 | -0.412995 | 0.360445 |
INDUS | 0.406583 | -0.533828 | 1.000000 | 0.062938 | 0.763651 | -0.391676 | 0.644779 | -0.708027 | 0.595129 | 0.720760 | 0.383248 | -0.356977 | 0.603800 | -0.483725 |
CHAS | -0.055892 | -0.042697 | 0.062938 | 1.000000 | 0.091203 | 0.091251 | 0.086518 | -0.099176 | -0.007368 | -0.035587 | -0.121515 | 0.048788 | -0.053929 | 0.175260 |
NOX | 0.420972 | -0.516604 | 0.763651 | 0.091203 | 1.000000 | -0.302188 | 0.731470 | -0.769230 | 0.611441 | 0.668023 | 0.188933 | -0.380051 | 0.590879 | -0.427321 |
RM | -0.219247 | 0.311991 | -0.391676 | 0.091251 | -0.302188 | 1.000000 | -0.240265 | 0.205246 | -0.209847 | -0.292048 | -0.355501 | 0.128069 | -0.613808 | 0.695360 |
AGE | 0.352734 | -0.569537 | 0.644779 | 0.086518 | 0.731470 | -0.240265 | 1.000000 | -0.747881 | 0.456022 | 0.506456 | 0.261515 | -0.273534 | 0.602339 | -0.376955 |
DIS | -0.379670 | 0.664408 | -0.708027 | -0.099176 | -0.769230 | 0.205246 | -0.747881 | 1.000000 | -0.494588 | -0.534432 | -0.232471 | 0.291512 | -0.496996 | 0.249929 |
RAD | 0.625505 | -0.311948 | 0.595129 | -0.007368 | 0.611441 | -0.209847 | 0.456022 | -0.494588 | 1.000000 | 0.910228 | 0.464741 | -0.444413 | 0.488676 | -0.381626 |
TAX | 0.582764 | -0.314563 | 0.720760 | -0.035587 | 0.668023 | -0.292048 | 0.506456 | -0.534432 | 0.910228 | 1.000000 | 0.460853 | -0.441808 | 0.543993 | -0.468536 |
PTRATIO | 0.289946 | -0.391679 | 0.383248 | -0.121515 | 0.188933 | -0.355501 | 0.261515 | -0.232471 | 0.464741 | 0.460853 | 1.000000 | -0.177383 | 0.374044 | -0.507787 |
B | -0.385064 | 0.175520 | -0.356977 | 0.048788 | -0.380051 | 0.128069 | -0.273534 | 0.291512 | -0.444413 | -0.441808 | -0.177383 | 1.000000 | -0.366087 | 0.333461 |
LSTAT | 0.455621 | -0.412995 | 0.603800 | -0.053929 | 0.590879 | -0.613808 | 0.602339 | -0.496996 | 0.488676 | 0.543993 | 0.374044 | -0.366087 | 1.000000 | -0.737663 |
MEDV | -0.388305 | 0.360445 | -0.483725 | 0.175260 | -0.427321 | 0.695360 | -0.376955 | 0.249929 | -0.381626 | -0.468536 | -0.507787 | 0.333461 | -0.737663 | 1.000000 |
The most correlated ones are TAX and RAD with the correlation value of 0.910228.
Part 2#
def df_housing_read(test = False):
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD','TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
if test:
df = pd.read_csv('hw1_data/housing/housing_test.txt', sep=' +', names=columns, index_col=False, engine='python')
else:
df = pd.read_csv('hw1_data/housing/housing_train.txt', sep=' +', names=columns, index_col=False, engine='python')
return df
(a)#
df_train = df_housing_read()
df_train.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
Target output is MEDV, and we can split it from the train dataset.
## Get feature vectors and target output
Y = df_train.MEDV.copy()
X = df_train.drop(['MEDV'], axis=1)
Linear regression weights can be found as \begin{equation} w = A^{-1} b \end{equation} where b and A can be found as \begin{equation} A = \sum_{i=0}^{n} x_i x_i^{T} \end{equation} \begin{equation} b = \sum_{i=0}^{n} y_i x_i \end{equation}
def LR_solve(X, Y):
"""
Linear regression solve function
Using the old equation w = invers(A) * b
Parameters:
--------
X : matrix_like
The feature vectors matrix
Y : array_like
The vector of each feature vector labels (target outputs)
Returns:
--------
w : array_like
The vector of weights fitted on `X` features
"""
A = X.dot(X.T)
## create b and preprocess it
b = Y.multiply(X)
b = np.sum(b, axis=1)
w = np.linalg.inv(A).dot(b)
return w
## giving the transpose of X, because it's as row in our dataset
w = LR_solve(X.T, Y)
print(f'Weights: {w}', f'\nShape:{w.shape}')
Weights: [-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
Shape:(13,)
(b)#
def LR_predict(X_test, w):
"""
Predict the Linear Regression using fixed input weights
Paramaters:
-----------
X_test : matrix_like
test data, an array of feature vectors
w : array_like
array of weights, shape must meet `(n, 1)`, column vector
Returns:
--------
Y_pred : array_like
the prediction of test data, using weights
"""
Y_pred = X_test @ w
return Y_pred
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD','TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df_test = pd.read_csv('hw1_data/housing/housing_test.txt', sep=' +', names=columns, index_col=False, engine='python')
df_test.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.84054 | 0.0 | 8.14 | 0 | 0.538 | 5.599 | 85.7 | 4.4546 | 4 | 307.0 | 21.0 | 303.42 | 16.51 | 13.9 |
1 | 0.67191 | 0.0 | 8.14 | 0 | 0.538 | 5.813 | 90.3 | 4.6820 | 4 | 307.0 | 21.0 | 376.88 | 14.81 | 16.6 |
2 | 0.95577 | 0.0 | 8.14 | 0 | 0.538 | 6.047 | 88.8 | 4.4534 | 4 | 307.0 | 21.0 | 306.38 | 17.28 | 14.8 |
3 | 0.77299 | 0.0 | 8.14 | 0 | 0.538 | 6.495 | 94.4 | 4.4547 | 4 | 307.0 | 21.0 | 387.94 | 12.80 | 18.4 |
4 | 1.00245 | 0.0 | 8.14 | 0 | 0.538 | 6.674 | 87.3 | 4.2390 | 4 | 307.0 | 21.0 | 380.23 | 11.98 | 21.0 |
X_test = df_test.drop(columns=['MEDV'])
Y_test_actual = df_test.MEDV
Y_test_pred = LR_predict(X_test, w)
Y_test_pred[:5]
0 13.411778
1 16.500643
2 15.674173
3 21.839123
4 23.367941
dtype: float64
(c, d)#
## the main3_2.py program
!python3 scripts/main3_2.py
Linear Regression on housing dataset program!
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
Training:
Mean Squared Error: 24.475882784643673
Testing:
Mean Squared Error: 24.29223817565946
----------------------------------------------------------------------
Weights of the training
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
Part 3#
(a) Online Linear Regression#
The equation for online linear regression is \begin{equation} W_i = W_{i-1} + \alpha_i (y_i - f(x_i, W_{i-1})) x_i \end{equation} Where \(\alpha\) is the learning rate, \(W_i\) is the weights rate fot the \(i\)-th iteration and \(x_i\) is the \(i\)-th feature vector.
df_train = df_housing_read()
def normalize(df):
"""
normalize a pandas dataframe
Parameters:
------------
df : pandas dataframe
the dataset that is going to be normalized
Returns:
----------
df : pandas dataframe
the normalized dataframe
"""
df_normalized = df.copy()
cols = df_normalized.columns
for col in cols:
df_normalized[col] = (df_normalized[col] - df_normalized[col].mean() ) / df_normalized[col].std()
return df_normalized
df_train_normal = normalize(df_train)
df_train_normal.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.406074 | 0.271666 | -1.269381 | -0.267612 | -0.128607 | 0.384651 | -0.095048 | 0.140689 | -0.969579 | -0.612355 | -1.449420 | 0.409077 | -1.046568 | 0.119273 |
1 | -0.403616 | -0.486839 | -0.554501 | -0.267612 | -0.726487 | 0.168444 | 0.397814 | 0.567315 | -0.854463 | -0.934791 | -0.274757 | 0.409077 | -0.459782 | -0.133425 |
2 | -0.403618 | -0.486839 | -0.554501 | -0.267612 | -0.726487 | 1.241055 | -0.242547 | 0.567315 | -0.854463 | -0.934791 | -0.274757 | 0.360950 | -1.180570 | 1.245885 |
3 | -0.403023 | -0.486839 | -1.288905 | -0.267612 | -0.821802 | 0.978518 | -0.792968 | 1.099976 | -0.739347 | -1.054212 | 0.148121 | 0.382235 | -1.334319 | 1.109007 |
4 | -0.398727 | -0.486839 | -1.288905 | -0.267612 | -0.821802 | 1.187706 | -0.490776 | 1.099976 | -0.739347 | -1.054212 | 0.148121 | 0.409077 | -0.997199 | 1.403821 |
def LR_Incremental_solve(X, Y, W, iter = 1000):
"""
Incremental Learning for Linear Regression
The method used is online gradient descent
the learning rate is 2/t, and t stands for iteration number
Parameters:
------------
X : matrix_like
the features vectors for training
Y : array_like
the target output for each feature vectors represented in `X`
W : array_like
the initial weights for online gradient descent
iter : integer
the number of iterations to learn
Returns:
---------
W : array_like
the learned weights for linear regression
"""
for i in range(iter):
## data index is different from the index
## so we calculate it everytime
data_index = i % len(Y)
## the update term that is added to old weight
update_term = Y[data_index] - Function(X.iloc[data_index], W)
## the learning rate
learning_rate = (2/(i+1))
update_term = np.multiply(learning_rate * update_term, X.iloc[data_index])
## update the weights
W = np.add(W, update_term)
return W
def Function(X, W):
"""
The function for calculating the predicted output for `X`
Parameters:
-----------
X : array_like
the features vector
W : array_like
the learned weights
Returns:
--------
Y_pred : float
The predicted value for the input weights and the feature vector
"""
Y_pred = np.dot(W.T, X)
return Y_pred
## initialize the data
X_train = df_train_normal.drop(columns=['MEDV'])
Y_train = df_train_normal.MEDV
W = np.zeros(13)
weights = LR_Incremental_solve(X = X_train,Y= Y_train,W= W)
weights
CRIM -0.020193
ZN 0.620314
INDUS 0.289735
CHAS -0.076357
NOX 3.653932
RM 0.948070
AGE -1.926829
DIS 1.026664
RAD -2.659829
TAX 0.703401
PTRATIO 1.108539
B 1.023090
LSTAT -0.511682
dtype: float64
(b)#
!python3 scripts/main3_3.py
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 24.475882784643673
Testing using the last trained weights:
Test
Mean Squared Error: 24.29223817565946
----------------------------------------------------------------------
Weights of the training
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 6.794950630755434
Test
Mean Squared Error: 269174.89239789307
----------------------------------------------------------------------
Final wights of Incremental Learning
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
(c)#
Using unnormalized dataset, will overflow and the program would stop.
## we're using unnormalized dataset
X = df_train.drop(columns=['MEDV'])
Y = df_train.MEDV
W = np.zeros(13)
weights = LR_Incremental_solve(X = X,Y= Y,W= W)
weights
CRIM NaN
ZN NaN
INDUS NaN
CHAS NaN
NOX NaN
RM NaN
AGE NaN
DIS NaN
RAD NaN
TAX NaN
PTRATIO NaN
B NaN
LSTAT NaN
dtype: float64
Getting NaN shows that overflow has happened.
(d)#
!python3 scripts/main3_3.py
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 24.475882784643673
Testing using the last trained weights:
Test
Mean Squared Error: 24.29223817565946
----------------------------------------------------------------------
Weights of the training
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 6.794950630755434
Test
Mean Squared Error: 269174.89239789307
----------------------------------------------------------------------
Final wights of Incremental Learning
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
(d.2)#
## using the learning rate 2 / sqrt(t)
!python3 scripts/main3_3.py learning_rate=1
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 24.475882784643673
Testing using the last trained weights:
Test
Mean Squared Error: 24.29223817565946
----------------------------------------------------------------------
Weights of the training
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 1.5252088672442436e+54
Test
Mean Squared Error: 4.577866801177761e+56
----------------------------------------------------------------------
Final wights of Incremental Learning
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
## using static learning rate 0.5
!python3 scripts/main3_3.py learning_rate=0.5
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 24.475882784643673
Testing using the last trained weights:
Test
Mean Squared Error: 24.29223817565946
----------------------------------------------------------------------
Weights of the training
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 0.0
Test
Mean Squared Error: 0.0
----------------------------------------------------------------------
Final wights of Incremental Learning
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
## using static learning rate 0.5
!python3 scripts/main3_3.py learning_rate=0.01
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 24.475882784643673
Testing using the last trained weights:
Test
Mean Squared Error: 24.29223817565946
----------------------------------------------------------------------
Weights of the training
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 0.29961777287140706
Test
Mean Squared Error: 1568.3310594686245
----------------------------------------------------------------------
Final wights of Incremental Learning
[-9.79342380e-02 4.89586765e-02 -2.53928478e-02 3.45087927e+00
-3.55458931e-01 5.81653272e+00 -3.31447963e-03 -1.02050134e+00
2.26563208e-01 -1.22458785e-02 -3.88029879e-01 1.70214971e-02
-4.85012955e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
As we saw, the losses for both training set and test set online Linear Regression (LR) with learning rate 0.5 is zero. We can conclude from this, that the online LR with learning rate 0.5 can be the most suitable model for our problem.
Part 4#
(a)#
def extendx(X):
"""
extend the X dataset and return the linear and two degrees dataset
Parameters:
-----------
X : matrix_like
our dataset
Returns:
--------
extended_X : matrix_like
the dataset containing both X and X^2 data
Note: extended_X length is double the X
"""
df = X.copy()
## the power 2
df_2 = df.multiply(df)
df = pd.concat([ df ,df_2], ignore_index= True)
return df
df = df_housing_read()
extendx(df)
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.006320 | 18.0 | 2.3100 | 0 | 0.538000 | 6.575000 | 65.20 | 4.090000 | 1 | 296.0 | 15.3 | 396.9000 | 4.9800 | 24.00 |
1 | 0.027310 | 0.0 | 7.0700 | 0 | 0.469000 | 6.421000 | 78.90 | 4.967100 | 2 | 242.0 | 17.8 | 396.9000 | 9.1400 | 21.60 |
2 | 0.027290 | 0.0 | 7.0700 | 0 | 0.469000 | 7.185000 | 61.10 | 4.967100 | 2 | 242.0 | 17.8 | 392.8300 | 4.0300 | 34.70 |
3 | 0.032370 | 0.0 | 2.1800 | 0 | 0.458000 | 6.998000 | 45.80 | 6.062200 | 3 | 222.0 | 18.7 | 394.6300 | 2.9400 | 33.40 |
4 | 0.069050 | 0.0 | 2.1800 | 0 | 0.458000 | 7.147000 | 54.20 | 6.062200 | 3 | 222.0 | 18.7 | 396.9000 | 5.3300 | 36.20 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
861 | 0.003923 | 0.0 | 142.3249 | 0 | 0.328329 | 43.467649 | 4774.81 | 6.143458 | 1 | 74529.0 | 441.0 | 153656.1601 | 93.5089 | 501.76 |
862 | 0.002049 | 0.0 | 142.3249 | 0 | 0.328329 | 37.454400 | 5882.89 | 5.232656 | 1 | 74529.0 | 441.0 | 157529.6100 | 82.4464 | 424.36 |
863 | 0.003692 | 0.0 | 142.3249 | 0 | 0.328329 | 48.664576 | 8281.00 | 4.698056 | 1 | 74529.0 | 441.0 | 157529.6100 | 31.8096 | 571.21 |
864 | 0.012010 | 0.0 | 142.3249 | 0 | 0.328329 | 46.158436 | 7974.49 | 5.706843 | 1 | 74529.0 | 441.0 | 154802.9025 | 41.9904 | 484.00 |
865 | 0.002248 | 0.0 | 142.3249 | 0 | 0.328329 | 36.360900 | 6528.64 | 6.275025 | 1 | 74529.0 | 441.0 | 157529.6100 | 62.0944 | 141.61 |
866 rows × 14 columns
(b)#
For binary attributes, the transformation would act as \(and\) logical operator. Binary attributes in the part of dataset with Linear form won’t change, but in the polynomial the binaries will be logically \(and\).
(c)#
!python3 scripts/main3_4.py
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 55343.71451079023
Testing using the last trained weights:
Test
Mean Squared Error: 35394.311865625714
----------------------------------------------------------------------
Weights of the training
[-4.68476601e-02 4.08283559e-02 8.30964818e-02 1.09347853e+02
-2.39244676e+02 3.05351274e+01 2.35384427e-03 -5.57037854e+00
8.67934356e-01 -1.28788975e-03 -1.50542358e+00 1.40912747e-03
-3.64281060e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 0.7264756830925618
Test
Mean Squared Error: 35097170.23443318
----------------------------------------------------------------------
Final wights of Incremental Learning
[-4.68476601e-02 4.08283559e-02 8.30964818e-02 1.09347853e+02
-2.39244676e+02 3.05351274e+01 2.35384427e-03 -5.57037854e+00
8.67934356e-01 -1.28788975e-03 -1.50542358e+00 1.40912747e-03
-3.64281060e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
!python3 scripts/main3_4.py learning_rate=1
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 55343.71451079023
Testing using the last trained weights:
Test
Mean Squared Error: 35394.311865625714
----------------------------------------------------------------------
Weights of the training
[-4.68476601e-02 4.08283559e-02 8.30964818e-02 1.09347853e+02
-2.39244676e+02 3.05351274e+01 2.35384427e-03 -5.57037854e+00
8.67934356e-01 -1.28788975e-03 -1.50542358e+00 1.40912747e-03
-3.64281060e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 2.9753029489387566e+38
Test
Mean Squared Error: 1.1151099259756206e+46
----------------------------------------------------------------------
Final wights of Incremental Learning
[-4.68476601e-02 4.08283559e-02 8.30964818e-02 1.09347853e+02
-2.39244676e+02 3.05351274e+01 2.35384427e-03 -5.57037854e+00
8.67934356e-01 -1.28788975e-03 -1.50542358e+00 1.40912747e-03
-3.64281060e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
!python3 scripts/main3_4.py learning_rate=0.5
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 55343.71451079023
Testing using the last trained weights:
Test
Mean Squared Error: 35394.311865625714
----------------------------------------------------------------------
Weights of the training
[-4.68476601e-02 4.08283559e-02 8.30964818e-02 1.09347853e+02
-2.39244676e+02 3.05351274e+01 2.35384427e-03 -5.57037854e+00
8.67934356e-01 -1.28788975e-03 -1.50542358e+00 1.40912747e-03
-3.64281060e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 0.0
Test
Mean Squared Error: 0.0
----------------------------------------------------------------------
Final wights of Incremental Learning
[-4.68476601e-02 4.08283559e-02 8.30964818e-02 1.09347853e+02
-2.39244676e+02 3.05351274e+01 2.35384427e-03 -5.57037854e+00
8.67934356e-01 -1.28788975e-03 -1.50542358e+00 1.40912747e-03
-3.64281060e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
!python3 scripts/main3_4.py learning_rate=0.01
Linear Regression on housing dataset program!
----------------------------------------------------------------------
----------------------------------------------------------------------
Loading Training and Test sets
datasets loaded!
Train set head:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0 0.538 ... 1 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 ... 2 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 ... 2 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0 0.458 ... 3 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0 0.458 ... 3 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
Test set head :
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.84054 0.0 8.14 0 0.538 ... 4 307.0 21.0 303.42 16.51
1 0.67191 0.0 8.14 0 0.538 ... 4 307.0 21.0 376.88 14.81
2 0.95577 0.0 8.14 0 0.538 ... 4 307.0 21.0 306.38 17.28
3 0.77299 0.0 8.14 0 0.538 ... 4 307.0 21.0 387.94 12.80
4 1.00245 0.0 8.14 0 0.538 ... 4 307.0 21.0 380.23 11.98
[5 rows x 13 columns]
----------------------------------------------------------------------
----------------------------------------------------------------------
Training:
Mean Squared Error: 55343.71451079023
Testing using the last trained weights:
Test
Mean Squared Error: 35394.311865625714
----------------------------------------------------------------------
Weights of the training
[-4.68476601e-02 4.08283559e-02 8.30964818e-02 1.09347853e+02
-2.39244676e+02 3.05351274e+01 2.35384427e-03 -5.57037854e+00
8.67934356e-01 -1.28788975e-03 -1.50542358e+00 1.40912747e-03
-3.64281060e-01]
----------------------------------------------------------------------
----------------------------------------------------------------------
Incremental Learning
Training
Mean Squared Error: 0.31370632734922127
Test
Mean Squared Error: 89882561.46377167
----------------------------------------------------------------------
Final wights of Incremental Learning
[-4.68476601e-02 4.08283559e-02 8.30964818e-02 1.09347853e+02
-2.39244676e+02 3.05351274e+01 2.35384427e-03 -5.57037854e+00
8.67934356e-01 -1.28788975e-03 -1.50542358e+00 1.40912747e-03
-3.64281060e-01]
----------------------------------------------------------------------
Partial Wights
Plotting Training and Test Loss
Figure(1500x500)
(d)#
Comparing the plotted losses shows that there is no big differences between both parts c and d, but there are some minor differences comparing between them. For example using the \(\frac{2}{t}\) learning rate shows that the extended version of the dataset achieves the best results later than the linear version. But in contrast we can say that the we can achieve better results with linear version in more early steps.