机器学习——（基石）作业4

Posted on 2018-10-03 Edited on 2023-10-21 In 机器学习

机器学习基石的最后一次作业，总共20道题目。 1. Deterministic noise depends on $\mathcal{H}$, as some models approximate $f$ better than others. Assume $\mathcal{H}'\subset \mathcal{H}$ and that $f$ is fixed. In general (but not necessarily in all cases), if we use $\mathcal{H}'$ instead of $\mathcal{H}$, how does deterministic noise behave?

In general, deterministic noise will decrease.
In general, deterministic noise will increase.
In general, deterministic noise will be the same.
If $d_{\text{vc} }(\mathcal{H}') \le \frac{1}{2} d_{\text{vc} }(\mathcal{H})$, deterministic noise will increase, else it will decrease.
If $d_{\text{vc} }(\mathcal{H}') \le \frac{1}{2} d_{\text{vc} }(\mathcal{H})$, deterministic noise will decrease, else it will increase.

deterministic noise出现的原因，是$H$无法完美的模拟f.而$H'$是$H$的子集，也就是它的模型复杂度更低，一般来说更无法模拟目标函数，它的deterministic noise应该是上升的，选b。

2. Consider the following hypothesis set for $\mathbf{x} \in \mathbb{R}^dx$ defined by the constraint: \[ \mathcal{H}(d, d_0) = \{ h ~|~ h(\mathbf{x}) = \mathbf{w}^\mathrm{T}\mathbf{x}; w_i = 0 \hspace{1 mm} \mbox{for} \hspace{1mm} i \geq d_0 \}, \] which of the following statements is correct?

$H(10,3)⊂H(10,4)$
$H(10,3)∪H(10,4)=\{\}$
$H(10,3)⊃H(10,4)$
$H(10,3)∩H(10,4)=\{\}$
none of the other choices

这个题目的约束不算难，也就是有一个额外的参数$d_0(d_0 \leq d)$,w下标比$d_0$大的固定为0.选项中$d_0$只有3与4两个选项，而$d_0 = 3$的$H$是包含在$d_0=4$的$H$中的，因为对于前者来说，$w_3$的值也确定了，自由度为3，后者自由度为4，$w_0$可以为0也可以为其他值.因此答案选a.

For Questions 3-4, consider the augmented error $E_{\text{aug} }(\mathbf{w}) = E_{\text{in} }(\mathbf{w}) + \frac{\lambda}{N} \mathbf{w}^T \mathbf{w}$ with some $\lambda > 0$.

3. If we want to minimize the augmented error $E_{\text{aug} }(\mathbf{w})$ by gradient descent with $\eta$ as learning rate, which of the following is a correct update rule?

$w(t+1)⟵w(t)+ηλ\nabla E \in (w(t))$.
$w(t+1)⟵w(t)-ηλ\nabla E \in (w(t))$.
$w(t+1)⟵(1− \frac {2\eta \lambda} {N})w(t)−η\nabla E \in (w(t))$.
$w(t+1)⟵(1+ \frac {2\eta \lambda} {N})w(t)−η\nabla E \in (w(t))$.
none of the other choices

这个题目也是比较简单的。梯度下降就是朝着梯度的反方向前进，因此只用求出来梯度就可以。$W^TW$的梯度很简单是$2W$，其他的与之前的一致，因此答案选c。

4. Let $\mathbf{w}_{\text{lin} }$ be the optimal solution for the plain-vanilla linear regression and $\mathbf{w}_{\text{reg} }(\lambda)$ be the optimal solution for minimizing $E_{\text{aug} }$ in Question 3, with $E_{\text{in} }$ being the squared error for linear regression. Which of the following is correct?

none of the other choices
$||W_{reg}(\lambda)|| \leq ||W_{lin}||$ for any $\lambda > 0$
$||W_{reg}(\lambda)|| \geq ||W_{lin}||$ for any $\lambda > 0$
$||W_{reg}(\lambda)||$ is always a non-decreasing function of $\lambda$ for $\lambda \ge 0$
$||W_{reg}(\lambda)||$ is always a constant function of $\lambda$ for $\lambda \ge 0$

要明白这个题目，首先要知道什么是$||W||$，这个意思是$W$向量的范数，也就等于$W^TW$.对于之前题目添加的regularization来说，限制实际上是$W^W \leq C$.如何推导？最低点$\nabla E_{aug} = 0$，也就是$\nabla E_{in} = - \frac {2\lambda} {N} W$,因此最低点$\nabla E_{in}$与$W$的长度比值是一定的，也就是$W$向量的长度被确定到了一个值。而因为一直在朝约束条件下最低点走，因次$\nabla E_{in}$也是接近平缓的也就是值比较小，所以这意味着$W$最后是比较小的。

从另一个角度来看，范数一定是大于零的，为了找到最低点当然是让范数尽量小，也就是正则化相当于给各个参数增加了惩罚，想让他们变得更小。

因此这个题目的答案选择b.如果没有约束情况下得到的最好的$W$也满足约束，也就是等于的情况，其他时候$||W_{lin}|| > ||W_{reg}||$

5. You are given the data points: $(-1,0)$, $(\rho,1)$, $(1,0)$,$ $, and a choice between two models:

constant $h_0(x)=b_0$ and
linear $h_1(x)=a_1x+b_1$

For which value of $\rho$ would the two models be tied using leave-one-out cross-validation with the squared error measure?

$\sqrt{\sqrt {3}+4}$
$\sqrt{\sqrt{3} - 1}$
$\sqrt{9+4 \sqrt{6} }$
$\sqrt{9 - \sqrt{6} }$
none of the other choice

使用Leave-One-Out Cross验证来得到$E_{val}$。对于第一种情况，$h(x)=b_0$是个常量。对于第二种情况是个直线。首先要计算出两种模型的$E_{val}$.

第一种模型： * 第一个为验证集：则$E_{in} = \frac 1 2 ((b_0-1)^2 + b_0^2)$,则$E_{in}$最小的时候$b_0 = 0.5$，$err = 0.5 \times 0.5 = 0.25$. * 第二个为验证集：则$E_{in} = b_0^2$,因此$b_0 = 0$,err = 1. * 第三个为验证集，情况与第一种情况一致，$err =0.25$.

因此这时候的$E_{val} = (1+0.25+0.25)/3 = 0.5$.

第二种模型： * 第一个为验证集：则计算出来得到$a_1 = \frac 1 {p-1},b_1 = \frac 1 {1-p}$,则预测值是$\frac 2 {1-p}$,$err = \frac 4 {(1-p)^2}$. * 第二个为验证集，得到的是一个常熟：$h(x) = 0$.这种情况下$err = 1$. * 第三个为验证集，则计算出来得到$a_1 = \frac 1 {p+1},b_1 = \frac 1 {1+p}$,则预测值是$\frac 2 {1+p}$,$err = \frac 4 {(1+p)^2}$.

这时候的$E_{val} =(1 + \frac 4 {(1-p)^2} + \frac 4 {(1+p)^2} )/3$.

题目中说了，两个模型都适用到该样本集。那么上面两个应该是相等的。可以解出来：$p^2 = 9±4\sqrt {6}$,而$9-4 \sqrt 6 < 0$，因此正确答案是 $9+ 4 \sqrt {6}$.答案选c。

For Questions 6-7, suppose that for 5 weeks in a row, a letter arrives in the mail that predicts the outcome of the upcoming Monday night baseball game.

6. Assume there are no tie. You keenly watch each Monday and to your surprise, the prediction is correct each time. On the day after the fifth game, a letter arrives, stating that if you wish to see next week's prediction, a payment of NTD $1000$ is required. Which of the following statement is true?

There are 31 win-lose predictions for 5 games.
If the sender wants to make sure that at least one person receives correct predictions on all 5 games from him, the sender should target to begin with at least 5 people.
To make sure that at least one person receives correct predictions on all 5 games from the sender, after the first letter `predicts' the outcome of the first game, the sender should target at least 16 people with the second letter.
To make sure that at least one person receives correct predictions on all 5 games from him, at least 64 letters should be sent before the fifth game.
none of the other choice

这个题目讲的是一个小把戏。5场比赛，每场比赛只有正负两个情况。因此一共可能出现的情况有$2^5=32$种。a错误。32种情况，当然要32个人，因此b错误。至少发出去的信有（32+16+8+4+2=62）封，d错误。而c，第一个结果出来后，会有一半的人收到错误的预测，因此第二封信只需要发给正确的那些人就好了，也就是16个人.

7. If the cost of printing and mailing out each letter is NTD 10. If the sender sends the minimum number of letters out, how much money can be made for the above `fraud' to succeed once? That is, one of the recipients does send him NTD 1000 to receive the prediction of the 6-th game?

NTD 340
NTD 370
NTD 400
NTD 430
NTD 460

上面一道题目推断出来，至少要发送62封信才能保证有个人收到所有预测结果。而最后一个人收到的全部正确的预测后还要在加发一封，来骗钱。也就是63封，所以答案是NTD 370，选b.

For Questions 8-10, please read the following story first. In our credit card example, the bank starts with some vague idea of what constitutes a good credit risk. So, as customers $\mathbf{x}_1, \mathbf{x}_2,...,\mathbf{x}_N$ arrive, the bank applies its vague idea to approve credit cards for some of these customers based on a formula $a(\mathbf{x})$. Then, only those who get credit cards are monitored to see if they default or not.

8. For simplicity, suppose that the first $N=10000$ customers were given credit cards by the credit approval function $a(\mathbf{x})$. Now that the bank knows the behavior of these customers, it comes to you to improve their algorithm for approving credit. The bank gives you the data $(\mathbf{x}_1, y_1), ... , (\mathbf{x}_N, y_N)$. Before you look at the data, you do mathematical derivations and come up with a credit approval function. You now test it on the data and, to your delight, obtain perfect prediction.

What is $M$, the size of your hypothesis set?

$1$
$N$
$2^N$
$N^2$
We have no idea about it.

利用数学推理想到了一个函数做出了很好的预测，因此这个vc dimension是无法计算的，但是因为没有经过数据的学习，这个H的大小应该是1，选择a.

9. With such an $M$, what does the Hoeffding bound say about the probability that the true average error rate of $g$ is worse than $1\%$ for $N=10,000$?

$\leq 0.171$
$\leq 0.221$
$\leq 0.271$
$\leq 0.321$
none of the other choices

霍夫丁不等式的简单应用：$P[\nu - \upsilon|> \epsilon ] \leq 2 e^{-2\epsilon ^2N}$。上述中$\epsilon = 0.01,N = 10000$,得到答案为0.2706705664732，答案选c.

10. You assure the bank that you have a got a system $g$ for approving credit cards for new customers, which is nearly error-free. Your confidence is given by your answer to the previous question. The bank is thrilled and uses your $g$ to approve credit for new customers. To their dismay, more than half their credit cards are being defaulted on. Assume that the customers that were sent to the old credit approval function and the customers that were sent to your g are indeed i.i.d. from the same distribution, and the bank is lucky enough (so the "bad luck" that "the true error of gg is worse than $1\%$'' does not happen). Which of the following claim is true?

By applying $a(\mathbf{x}) \mbox{ NOR } g(\mathbf{x})$ to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying $a(\mathbf{x}) \mbox{ NAND } g(\mathbf{x})$ to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying $a(\mathbf{x}) \mbox{ OR } g(\mathbf{x})$ to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
By applying $a(\mathbf{x}) \mbox{ AND } g(\mathbf{x})$ to approve credit for new customers, the performance of the overall credit approval system can be improved with guarantee provided by the previous problem.
none of the other choices

这个题目中说到，利用之前的推断出来的g，本应该有很好的表现，但是却得到了很差的表现。为什么？我们要注意一个事：Sample Bias。虽然题目中说了，新的顾客和之前系统的顾客是来自于同一分布的，但是我们得到的test数据的分布并不是原先的顾客分布。test数据中，顾客的信息并不是随机得到的，而是先经过了$a(x)$的筛选。上面的霍夫曼不等式的理论保证是在同一分布的前提下，因此首先要经过$a(x)$的筛选，然后再用$g(x)$来判断。因此答案选d.

For Questions 11-12, consider linear regression with virtual examples.

**11. That is, we add $K$ virtual examples $(\tilde{\mathbf{x} }_1, \tilde{y}_1),(\tilde{\mathbf{x} }_2, \tilde{y}_2),\dots, (\tilde{\mathbf{x} }_K, \tilde{y}_K)$ to the training data set, and solve$ { } ({n=1}^N (y_n - ^T n)^2 + {k=1}^K (_k - ^T _k)^2)$.We will show that using some "special" virtual examples, which were claimed to be a possible way to combat overfitting in Lecture 9, is related to regularization, another possible way to combat overfitting discussed in Lecture 10. Let $\tilde{\mathbf{X} } = [\tilde{\mathbf{x} }_1 \tilde{\mathbf{x} }_2 \ldots \tilde{\mathbf{x} }_K]^T$, and $\tilde{\mathbf{y} } = [\tilde{y}_1, \tilde{y}_2, \ldots, \tilde{y}_K]^T$. What is the optimal $\mathbf{w}$ to the optimization problem above, assuming that all the inversions exist?**

$(\mathbf{X}^T\mathbf{X})^{-1}(\widetilde {\mathbf{X} }^T\widetilde{\mathbf{y} })$
$(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{X}^T \mathbf{y}+\widetilde {\mathbf{X} }^T\widetilde{y})$
$(\mathbf{X}^T\mathbf{X} + \widetilde{\mathbf{X} }^T\widetilde{\mathbf{X} })^{-1}(\widetilde {\mathbf{X} }^T\widetilde{\mathbf{y} })$
$(\mathbf{X}^T\mathbf{X} + \widetilde{\mathbf{X} }^T\widetilde{\mathbf{X} })^{-1}(\mathbf{X}^T \mathbf{y} +\widetilde {\mathbf{X} }^T\widetilde{\mathbf{y} } )$
none of the other choice

这个题目说起来也容易。既然把虚拟数据也融合进去了，当然各部分都要计算，很容易排除其他答案,选择d。

12. For what $ and $\tilde{\mathbf{y} }$ will the solution of the linear regression problem above equal to \[ \mathbf{w}_{\text{reg} } = \mathrm{argmin}_{\mathbf{w} } \frac{\lambda}{N} \|\mathbf{w}\|^2 + \frac{1}{N} \|\mathbf{X}\mathbf{w}-\mathbf{y}\|^2? \]

$\tilde{\mathbf{X} } = I, \tilde{\mathbf{y} } = 0$
$\tilde{\mathbf{X} } = \sqrt {\lambda}I, \tilde{\mathbf{y} } = 0$
$\tilde{\mathbf{X} } = \lambda I, \tilde{\mathbf{y} } = \mathbf{1}$
$\tilde{\mathbf{X} } = \sqrt{\lambda} \mathbf{X}, \tilde{\mathbf{y} } = \mathbf{y}$
none of the other choice

这个问题乍一看，摸不着头脑。一个是矩阵中求逆操作，另一个是最小化一个函数（argmin(f(x))的定义之前已经说过）。但是换个办法的话，其实很容易解决，我们可以想象一下11题中需要最小化的函数$E_{in}$，则可以得到： \[ W_{vir} = argmin_w \frac 1 {N+K} (||\tilde{\mathbf{X} }\mathbf{w} - \tilde{\mathbf{y} }||^2 + ||\mathbf{X}\mathbf{w} - \mathbf{y}||^2). \] 最小化的话无论前面有没有$\frac 1 N$或者其他常数都是无所谓的。想要让二者最后结果相等，使得去掉常数之后相等即可，则，$\tilde{\mathbf{X} } = \sqrt {\lambda}I, \tilde{\mathbf{y} } = 0$.因此这道题目答案选b。

13. Consider regularized linear regression (also called ridge regression) for classification \[ \mathbf{w}_{\text{reg} } = \mbox{argmin}_{\mathbf{w} } \left(\frac{\lambda}{N} \|\mathbf{w}\|^2 + \frac{1}{N} \|\mathbf{X}\mathbf{w}-\mathbf{y}\|^2\right) . \] Run the algorithm on the following data set as $\mathcal{D}$: https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_algo/hw4_train.dat

and the following set for evaluating $E_{out}$: https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_algo/hw4_test.dat

Because the data sets are for classification, please consider only the 0/1 error for all Questions below.

Let $\lambda = 10$, which of the followings is the corresponding $E_{in}$and $E_{out}$?

$E_{in} = 0.015,E_{out} = 0.020$
$E_{in} = 0.030,E_{out} = 0.015$
$E_{in} = 0.035,E_{out} = 0.020$
$E_{in} = 0.050,E_{out} = 0.045$
$E_{in} = 0.020,E_{out} = 0.010$

这道题目是线性回归的一个改进。相对于之前的代码也只要做些许的改进就可以了。利用之前的第12题的结论，我们依然可以一步得到结果。

import math
import matplotlib.pyplot as plt
import numpy as np
import re

def sign(x):
    if x > 0:
        return +1
    else: return -1


def visualize(data,W=[]):
    nx = []
    ny = []
    ox = []
    oy = []
    for i in range(len(data)):
        if data[i][-1] == -1:
            nx.append(data[i][0])
            ny.append(data[i][1])
        else:
            ox.append(data[i][0])
            oy.append(data[i][1])
    plt.scatter(nx,ny,marker="x",c="r")
    plt.scatter(ox,oy,marker="o",c="g")
    if len(W)!=0 :
        print(W)
        x = np.linspace(0, 1, 50)
        y = -W[1] / W[2] * x - W[0] / W[2]
        plt.plot(x, y, color="black")
    plt.show()
def ridge_regression_one_step(data,_lambda):
    X_matrix = []
    Y_matrix = []
    for i in range(len(data)):
        temp = [1]
        for j in range(len(data[i])-1):
            temp.append(data[i][j])

        X_matrix.append(temp)
        Y_matrix.append([data[i][-1]])
    X = np.mat(X_matrix)
    hatX = math.sqrt(_lambda)*np.eye(len(data[0]))
    #print(hatX)
    hatY = np.mat([ 0 for i in data[0]]).T
    Y = np.mat(Y_matrix)
    W = (X.T*X + hatX.T*hatX).I*(X.T*Y+hatX.T*hatY)
    return W.T.tolist()[0]

def Ein(data,W):
    err_num = 0
    for i in range(len(data)):
        res = W[0]
        for j in range(1,len(W)):
            res += W[j]*data[i][j-1]
        if sign(res) != data[i][-1]:
            err_num+=1
    return err_num

def readDataFrom(path):
    separator = re.compile('\t|\b| |\n')
    result = []
    with open(path,"r") as f:
        s = f.readline()[:-1]
        while s:
            temp = separator.split(s)
            result.append([float(x) for x in temp])
            s = f.readline()[:-1]
    return result


if __name__ == "__main__":
    err = 0
    data = readDataFrom("./train.dat")
    #print(data)
    data_test =  readDataFrom("./test.dat")
    W = ridge_regression_one_step(data,10)
    print("Ein",Ein(data,W)/len(data))
    print("Eout",Ein(data_test,W)/len(data_test))
    visualize(data,W)

可以得到最后的运行结果如下：

1 2	Ein 0.05 Eout 0.045

因此答案选择d.

14. Following the previous Question, among $\log_{10} \lambda= \left\{2, 1, 0, -1, \ldots, -8, -9, -10 \right\}$. What is the $\lambda$ with the minimum $E_{in}$? Compute $\lambda$ and its corresponding $E_{in}$ and $E_{out}$ then select the closest answer. Break the tie by selecting the largest $\lambda$.

$log_{10}^{\lambda} = -2,E_{in} = 0.030,E_{out} = 0.040$
$log_{10}^{\lambda} = -4,E_{in} = 0.015,E_{out} = 0.020$
$log_{10}^{\lambda} = -6,E_{in} = 0.030,E_{out} = 0.040$
$log_{10}^{\lambda} = -8,E_{in} = 0.015,E_{out} = 0.020$
$log_{10}^{\lambda} = -10,E_{in} = 0.030,E_{out} = 0.040$

这个题目只需要对上面题目的执行函数做一些改动即可：

if __name__ == "__main__":
    minEin = 1
    minEout = 1
    minEinI = -1
    minEoutI = -1
    for i in range(-10,3):
        _lambda = math.pow(10,i)
        data = readDataFrom("./train.dat")
        # print(data)
        data_test = readDataFrom("./test.dat")
        W = ridge_regression_one_step(data, _lambda)
        ein = Ein(data, W) / len(data)
        eout = Ein(data_test, W) / len(data_test)
        print(i,"Ein:", ein,"Eout:", eout)
        if ein<=minEin:
            minEin = ein
            minEinI = i
        if eout <= minEout:
            minEout = eout
            minEoutI = i
    print("minEin:",minEinI,"minEout:",minEoutI)

得到结果：

-10 Ein: 0.015 Eout: 0.02
-9 Ein: 0.015 Eout: 0.02
-8 Ein: 0.015 Eout: 0.02
-7 Ein: 0.03 Eout: 0.015
-6 Ein: 0.035 Eout: 0.016
-5 Ein: 0.03 Eout: 0.016
-4 Ein: 0.03 Eout: 0.016
-3 Ein: 0.03 Eout: 0.016
-2 Ein: 0.03 Eout: 0.016
-1 Ein: 0.035 Eout: 0.016
0 Ein: 0.035 Eout: 0.02
1 Ein: 0.05 Eout: 0.045
2 Ein: 0.24 Eout: 0.261
minEin: -8 minEout: -7

可以看到Ein最小的是$\lambda = 10^{-8}$(相等取最大的),因此答案选d.

15. Following the previous Question, among $\log_{10} \lambda= \left\{2, 1, 0, -1, \ldots, -8, -9, -10 \right\}$. What is the $\lambda$ with the minimum $E_{out}$? Compute $\lambda$ and its corresponding $E_{in}$ and $E_{out}$ then select the closest answer. Break the tie by selecting the largest $\lambda$.

$log_{10}^{\lambda} = -1,E_{in} = 0.015,E_{out} = 0.015$
$log_{10}^{\lambda} = -3,E_{in} = 0.015,E_{out} = 0.015$
$log_{10}^{\lambda} = -5,E_{in} = 0.015,E_{out} = 0.030$
$log_{10}^{\lambda} = -7,E_{in} = 0.030,E_{out} = 0.015$
$log_{10}^{\lambda} = -9,E_{in} = 0.030,E_{out} = 0.030$

答案在上个题目中已经得到了。答案选d.

16. Now split the given training examples in $\mathcal{D}$ to the first 120 examples for $\mathcal{D}_{\text{train} }$ and 80 for $\mathcal{D}_{\text{val} }$. $\textit{Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem.}$

Run the algorithm on $\mathcal{D}_{\text{train} }$ to get $g^-_\lambda$, and validate $g^-_\lambda$ with $\mathcal{D}_{\text{val} }$. Among $\log_{10} \lambda= \left\{2, 1, 0, -1, \ldots, -8, -9, -10 \right\}$. What is the $\lambda$ with the minimum $E_{train}(g^-_\lambda)$? Compute $\lambda$ and the corresponding $E_{train}(g^-_\lambda)$, $E_{val}(g^-_\lambda)$ and $E_{out}(g^-_\lambda)$ then select the closet answer. Break the tie by selecting the largest $\lambda$.

$log _10^{\lambda} = 0,E_{train}(g_{\lambda}^-) = 0.000,E_{val}(g_{\lambda}^-) = 0.050,E_{out}(g_{\lambda}^-) = 0.025$
$log _10^{\lambda} = -2,E_{train}(g_{\lambda}^-) = 0.010,E_{val}(g_{\lambda}^-) = 0.050,E_{out}(g_{\lambda}^-) = 0.035$
$log _10^{\lambda} = -4,E_{train}(g_{\lambda}^-) = 0.000,E_{val}(g_{\lambda}^-) = 0.010,E_{out}(g_{\lambda}^-) = 0.025$
$log _10^{\lambda} = -6,E_{train}(g_{\lambda}^-) = 0.010,E_{val}(g_{\lambda}^-) = 0.010,E_{out}(g_{\lambda}^-) = 0.025$
$log _10^{\lambda} = -8,E_{train}(g_{\lambda}^-) = 0.000,E_{val}(g_{\lambda}^-) = 0.050,E_{out}(g_{\lambda}^-) = 0.025$

这道题目依然用之前的程序就可以完成，只需要修改主函数。这里使用到了验证集，需要添加的就是验证相关的代码。

if __name__ == "__main__":
    minEtrain = 1
    minEval = 1
    minEout = 1
    minEvalI = -1
    minEtrainI = -1
    minEoutI = -1
    for i in range(-10,3):
        _lambda = math.pow(10,i)
        data = readDataFrom("./train.dat")
        data_train = data[0:120]
        data_val = data[120:200]
        # print(data)
        data_test = readDataFrom("./test.dat")
        W = ridge_regression_one_step(data_train, _lambda)
        etrain = Ein(data_train, W) / len(data_train)
        eval = Ein(data_val,W)/len(data_val)
        eout = Ein(data_test, W) / len(data_test)
        print(i,"Etrain:", etrain,"Eval:",eval,"Eout:", eout)
        if etrain<=minEtrain:
            minEtrain = etrain
            minEtrainI = i
        if eval <= minEval:
            minEval = eval
            minEvalI = i
        if eout <= minEout:
            minEout = eout
            minEoutI = i
    print("minEtrain:",minEtrainI,"minEval:",minEvalI,"minEout",minEoutI)

最后输出如下：

-10 Etrain: 0.008333333333333333 Eval: 0.125 Eout: 0.04
-9 Etrain: 0.0 Eval: 0.1 Eout: 0.038
-8 Etrain: 0.0 Eval: 0.05 Eout: 0.025
-7 Etrain: 0.03333333333333333 Eval: 0.0375 Eout: 0.021
-6 Etrain: 0.03333333333333333 Eval: 0.0375 Eout: 0.021
-5 Etrain: 0.03333333333333333 Eval: 0.0375 Eout: 0.021
-4 Etrain: 0.03333333333333333 Eval: 0.0375 Eout: 0.021
-3 Etrain: 0.03333333333333333 Eval: 0.0375 Eout: 0.021
-2 Etrain: 0.03333333333333333 Eval: 0.0375 Eout: 0.021
-1 Etrain: 0.03333333333333333 Eval: 0.0375 Eout: 0.022
0 Etrain: 0.03333333333333333 Eval: 0.0375 Eout: 0.028
1 Etrain: 0.075 Eval: 0.125 Eout: 0.08
2 Etrain: 0.3416666666666667 Eval: 0.4125 Eout: 0.414
minEtrain: -8 minEval: 0 minEout -2

因此答案选择e.

17. Following the previous Question, among $\log_{10} \lambda= \left\{2, 1, 0, -1, \ldots, -8, -9, -10 \right\}$.What is the $\lambda$ with the minimum $E_{val}(g^-_\lambda)$? Compute $\lambda$ and the corresponding $E_{train}(g^-_\lambda)$, $E_{val}(g^-_\lambda)$ and $E_{out}(g^-_\lambda)$ then select the closet answer. Break the tie by selecting the largest $\lambda$.

$log _10^{\lambda} = 0,E_{train}(g_{\lambda}^-) = 0.033,E_{val}(g_{\lambda}^-) = 0.038,E_{out}(g_{\lambda}^-) = 0.028$
$log _10^{\lambda} = -3,E_{train}(g_{\lambda}^-) = 0.000,E_{val}(g_{\lambda}^-) = 0.028,E_{out}(g_{\lambda}^-) = 0.038$
$log _10^{\lambda} = -6,E_{train}(g_{\lambda}^-) = 0.066,E_{val}(g_{\lambda}^-) = 0.038,E_{out}(g_{\lambda}^-) = 0.038$
$log _10^{\lambda} = -9,E_{train}(g_{\lambda}^-) = 0.033,E_{val}(g_{\lambda}^-) = 0.028,E_{out}(g_{\lambda}^-) = 0.028$
$log _10^{\lambda} = -10,E_{train}(g_{\lambda}^-) = 0.066,E_{val}(g_{\lambda}^-) = 0.028,E_{out}(g_{\lambda}^-) = 0.028$

答案在上面已经给出。答案选择a.

18. Run the algorithm with the optimal $\lambda$ of the previous Question on the whole $\mathcal{D}$ to get $g_\lambda$. Compute $E_{in}(g_\lambda)$ and $E_{out}(g_\lambda)$ then select the closet answer.

$E_{in}(g_{\lambda}) = 0.015,E_{out}(g_{\lambda}) = 0.020$
$E_{in}(g_{\lambda}) = 0.025,E_{out}(g_{\lambda}) = 0.030$
$E_{in}(g_{\lambda}) = 0.035,E_{out}(g_{\lambda}) = 0.020$
$E_{in}(g_{\lambda}) = 0.045,E_{out}(g_{\lambda}) = 0.030$
$E_{in}(g_{\lambda}) = 0.055,E_{out}(g_{\lambda}) = 0.020$

根据17题选出来最佳的$\lambda = 0$，因此在全数据集上再次进行学习，得到的结果在14题的分析中已经呈现，答案是c.

可以看到的是利用验证，我们选出来了一个很贴近最佳$E_{out}$的答案了。

For Questions 19-20, split the given training examples in $\mathcal{D}$ to five folds, the first $40$ being fold 1, the next $40$ being fold 2, and so on. Again, we take a fixed split because the given examples are already randomly permuted.

19. What is the $λ$ with the minimum $E_{cv}$, where $E_{cv}$ comes from the five folds defined above? Compute $\lambda$ and the corresponding $E_{cv}$ then select the closet answer. Break the tie by selecting the largest $\lambda$.

$log_{10}^{\lambda} = 0,E_{cv} = 0.030$
$log_{10}^{\lambda} = -2,E_{cv} = 0.020$
$log_{10}^{\lambda} = -4,E_{cv} = 0.030$
$log_{10}^{\lambda} = -6,E_{cv} = 0.020$
$log_{10}^{\lambda} = -8,E_{cv} = 0.030$

这个题目要做交叉验证。因此需要写一个新的交叉验证的函数cv。同时也要需要修改主函数。

#添加的函数cv
def cv(data,fold_count,_lambda):
    # disorder data
    ecv = 0
    each_c = len(data)/fold_count
    for i in range(fold_count):
        val = data[int(i*each_c):int((i+1)*each_c)]
        train = data[0:int(i*each_c)]
        train.extend(data[int((i+1)*each_c):-1])
        W = ridge_regression_one_step(train,_lambda)
        ecv +=Ein(val,W)/len(val)
    return ecv/fold_count

最后得到结果如下：

-10 Ecv: 0.05
-9 Ecv: 0.05
-8 Ecv: 0.03
-7 Ecv: 0.034999999999999996
-6 Ecv: 0.034999999999999996
-5 Ecv: 0.034999999999999996
-4 Ecv: 0.034999999999999996
-3 Ecv: 0.034999999999999996
-2 Ecv: 0.034999999999999996
-1 Ecv: 0.034999999999999996
0 Ecv: 0.034999999999999996
1 Ecv: 0.06
2 Ecv: 0.28500000000000003
minEcv: -8

因此答案选择d.

20. Run the algorithm with the optimal $\lambda$ of the previous problem on the whole $\mathcal{D}$ to get $g_\lambda$. Compute $E_{in}(g_\lambda)$ and $E_{out}(g_\lambda)$ then select the closet answer.

$E_{in}(g_\lambda) = 0.005,E_{out}(g_\lambda) = 0.010$
$E_{in}(g_\lambda) = 0.015,E_{out}(g_\lambda) = 0.020$
$E_{in}(g_\lambda) = 0.025,E_{out}(g_\lambda) = 0.020$
$E_{in}(g_\lambda) = 0.035,E_{out}(g_\lambda) = 0.030$
$E_{in}(g_\lambda) = 0.045,E_{out}(g_\lambda) = 0.020$

上面得到的最好的$\lambda = 10^{-8}$，全部数据去学习的话，得到的$E_{in}$和$E_{out}$在14题目解析中也能看到,答案选b。