ECE 8550: Assignment 1

1 Linear regression

In this section, you will implement linear regression to predict the death rate from several

features, including annual precipitation, temperature, population, income, pollution, etc.

The data for this assignment is given in file data.txt, which contains the data description,

17 columns (features) and 60 rows (examples). In the data matrix, columns 2-16 are input

features, and column 17 is the output target. Note that Column 1 is the index and should

not be used in the regression. You will implement gradient descent for learning the linear

regression model.

We split the data to a training set and a testing set following the 80/20 rule. Thus, we have

48 training samples (n = 48) with 16 features (d = 16). The i-th sample is denoted by xi

where x

j

i is the value of the j-th feature this sample.

1.1 Feature normalization

By looking at the feature values in the data, you may notice that some features are about

1000 times the other features. When features differ by orders of magnitude, first performing

feature normalization can make gradient descent converge much more quickly. There are

different ways to do the feature normalization. For simplicity, we use the following method:

(1) subtract the minimal value of each feature; (2) divide the feature values by their ranges

of values. Equivalently, it is given by the following equation:

for each X, let Xnorm =

X −Xmin

Xmax −Xmin

Similarly normalize the target Y . Note that to make prediction for a new data point, you

also need to first normalize the features as what you did previously for the training dataset.

1

1.2 Feature appending

Classic regression is defined as

f(xi) = w · xi + b =

d∑

j=1

wj ×xji + b,

where bi is the intercept. For simplicity, we let b = w

d+1 ×xd+1i where x

d+1

i = 1 and w

d+1 is

unknown but learnable parameter. Thus, we can combine w and b via letting

w ←[w1, w2, w3, . . . ,wd,wd+1]T ;

xi ←[x1i , x

2

i , x

3

i , . . . , x

d

i , 1]

T .

1.3 Gradient descent with ridge regularization

To recap, the prediction function for linear regression is given by

f(xi) = w · xi = wT xi =

d+1∑

j=1

wj ×xji, (1)

where

• xi is the feature vecter of the i-th sample;

• xji is the value of the j-th feature for the i-th sample;

• wj is the j-th parameter of w;

The loss function for linear regression with ridge regularization is given by

J(w) =

1

2n

n∑

i=1

(f(xi) −yi)

2

+

λ

2(d + 1)

d+1∑

j=1

(wj)2. (2)

To minimize this function, we first obtain the gradient with respect to each parameter wj

as:

∂J(w)

∂wj

=

1

n

n∑

i=1

(f(xi) −yi) x

j

i +

λ

d + 1

wj. (3)

Then, the (full) gradient descent algorithm is given as:

where α is the learning rate. The convergence criteria for the above algorithm is ∆%cost < �,

where

∆%cost =

|Jk−1(w) −Jk(w)|× 100

Jk−1(w)

,

2

Algorithm 1: Batch Gradient Descent

k = 0;

while convergence criteria not reached do

for j = 1, . . . ,m do

Update wj ← wj −α∂J(w)

∂wj

Update k ← k + 1;

where Jk(w) is the value of Eq. (2) at k-th iteration, and ∆%cost is computed at the end of

each iteration of the while loop. Initialize w = 0 and compute J0(w) with these values.

Important: You must simultaneously update wj for all j.

Task: Load the dataset in data.txt, split it into 80% / 20% training/test. The dataset is

already shuffled so you can simply use the first 80% examples for training and the remaining

20% for test. Learn the linear regression model using the training data. Plot the value of

loss function Jk(w) vs. the number of iterations (k). For each iteration, compute the mean

squared loss (w/o regularization function) on the test data. Plot the mean squared loss over

the test data v.s. the number of iterations (k) in the same figure. The mean squared loss is

defined as

MSE =

1

2m

m∑

i=1

[yi −w · xi]

2

.

1.4 Gradient descent with lasso regularization

The loss function for linear regression with lasso regularization is given by

J(w) =

1

2n

n∑

i=1

(yi −f(xi))

2

+

λ

2(d + 1)

d+1∑

j=1

|wj|. (4)

To minimize the loss function, you will need to derive the gradient by yourself. The gradient

descent algorithm is the same as the above.

Hint: For simplicity, you can consider

∂|wj|

wj

= 1 when wj = 0.

Task: Load the dataset in data.txt, split it into 80% / 20% training/test,. Learn the

lasso regression model using the training data. Plot the value of loss function Jk(w) vs. the

number of iterations (k). Plot the mean squared loss over the test data v.s. the number of

iterations (k) in the same figure.

3

1.5 What to submit

In this assignment, use α = 0.01 and � = 0.001. Use λ = 2 for ridge regularization, and use

λ = 0.3 for lasso regularization.

1. (50 pts) The source code (.py) without running errors in Anaconda 2021.11 (Python

3.9).

2. (20 pts) Plot the value of loss function Jk(w) and MSE vs. the number of iterations

(k) for Section 1.2, report the squared loss on the test data for Section 1.3.

3. (10 pts) The number of elements in w whose absoluate value is smaller than 0.01 for

Section 1.3.

4. (10 pts) Equation for the gradient of Eq. (4) which should be similar to Eq. 3.

5. (10 pts) Plot the value of loss function Jk(w) and MSE vs. the number of iterations

(k) for Section 1.3, report the squared loss on the test data for Section 1.4.

6. (10 pts) The number of elements in w whose absoluate value is smaller than 0.01 for

Section 1.4.

1.6 Requirements

1. This assignment is implemented using Python in Anaconda 2021.11 (Python 3.9).

2. No libraries are allowed except Python built-ins, numpy, and matplotlib.

3. Figures, equations, and numbers should be in a PDF file.

4. Zip the Python source code (.py) and your PDF file before uploading.

4