ECE 8550: Assignment 1
1 Linear regression
In this section, you will implement linear regression to predict the death rate from several
features, including annual precipitation, temperature, population, income, pollution, etc.
The data for this assignment is given in file data.txt, which contains the data description,
17 columns (features) and 60 rows (examples). In the data matrix, columns 2-16 are input
features, and column 17 is the output target. Note that Column 1 is the index and should
not be used in the regression. You will implement gradient descent for learning the linear
We split the data to a training set and a testing set following the 80/20 rule. Thus, we have
48 training samples (n = 48) with 16 features (d = 16). The i-th sample is denoted by xi
i is the value of the j-th feature this sample.
1.1 Feature normalization
By looking at the feature values in the data, you may notice that some features are about
1000 times the other features. When features differ by orders of magnitude, first performing
feature normalization can make gradient descent converge much more quickly. There are
different ways to do the feature normalization. For simplicity, we use the following method:
(1) subtract the minimal value of each feature; (2) divide the feature values by their ranges
of values. Equivalently, it is given by the following equation:
for each X, let Xnorm =
Similarly normalize the target Y . Note that to make prediction for a new data point, you
also need to first normalize the features as what you did previously for the training dataset.
1.2 Feature appending
Classic regression is defined as
f(xi) = w · xi + b =
wj ×xji + b,
where bi is the intercept. For simplicity, we let b = w
d+1 ×xd+1i where x
i = 1 and w
unknown but learnable parameter. Thus, we can combine w and b via letting
w ←[w1, w2, w3, . . . ,wd,wd+1]T ;
xi ←[x1i , x
i , x
i , . . . , x
i , 1]
1.3 Gradient descent with ridge regularization
To recap, the prediction function for linear regression is given by
f(xi) = w · xi = wT xi =
wj ×xji, (1)
• xi is the feature vecter of the i-th sample;
• xji is the value of the j-th feature for the i-th sample;
• wj is the j-th parameter of w;
The loss function for linear regression with ridge regularization is given by
2(d + 1)
To minimize this function, we first obtain the gradient with respect to each parameter wj
(f(xi) −yi) x
d + 1
Then, the (full) gradient descent algorithm is given as:
where α is the learning rate. The convergence criteria for the above algorithm is ∆%cost < �,
|Jk−1(w) −Jk(w)|× 100
Algorithm 1: Batch Gradient Descent
k = 0;
while convergence criteria not reached do
for j = 1, . . . ,m do
Update wj ← wj −α∂J(w)
Update k ← k + 1;
where Jk(w) is the value of Eq. (2) at k-th iteration, and ∆%cost is computed at the end of
each iteration of the while loop. Initialize w = 0 and compute J0(w) with these values.
Important: You must simultaneously update wj for all j.
Task: Load the dataset in data.txt, split it into 80% / 20% training/test. The dataset is
already shuffled so you can simply use the first 80% examples for training and the remaining
20% for test. Learn the linear regression model using the training data. Plot the value of
loss function Jk(w) vs. the number of iterations (k). For each iteration, compute the mean
squared loss (w/o regularization function) on the test data. Plot the mean squared loss over
the test data v.s. the number of iterations (k) in the same figure. The mean squared loss is
[yi −w · xi]
1.4 Gradient descent with lasso regularization
The loss function for linear regression with lasso regularization is given by
2(d + 1)
To minimize the loss function, you will need to derive the gradient by yourself. The gradient
descent algorithm is the same as the above.
Hint: For simplicity, you can consider
= 1 when wj = 0.
Task: Load the dataset in data.txt, split it into 80% / 20% training/test,. Learn the
lasso regression model using the training data. Plot the value of loss function Jk(w) vs. the
number of iterations (k). Plot the mean squared loss over the test data v.s. the number of
iterations (k) in the same figure.
1.5 What to submit
In this assignment, use α = 0.01 and � = 0.001. Use λ = 2 for ridge regularization, and use
λ = 0.3 for lasso regularization.
1. (50 pts) The source code (.py) without running errors in Anaconda 2021.11 (Python
2. (20 pts) Plot the value of loss function Jk(w) and MSE vs. the number of iterations
(k) for Section 1.2, report the squared loss on the test data for Section 1.3.
3. (10 pts) The number of elements in w whose absoluate value is smaller than 0.01 for
4. (10 pts) Equation for the gradient of Eq. (4) which should be similar to Eq. 3.
5. (10 pts) Plot the value of loss function Jk(w) and MSE vs. the number of iterations
(k) for Section 1.3, report the squared loss on the test data for Section 1.4.
6. (10 pts) The number of elements in w whose absoluate value is smaller than 0.01 for
1. This assignment is implemented using Python in Anaconda 2021.11 (Python 3.9).
2. No libraries are allowed except Python built-ins, numpy, and matplotlib.
3. Figures, equations, and numbers should be in a PDF file.
4. Zip the Python source code (.py) and your PDF file before uploading.