机器学习——(基石)作业二

总共20道题目。

Questions 1-2 are about noisy targets.

1.Consider the bin model for a hypothesis \(h\) that makes an error with probability \(\mu\) in approximating a deterministic target function \(f\) (both \(h\) and \(f\) outputs \(\{-1, +1\}\)). If we use the same \(h\) to approximate a noisy version of \(f\) given by

\[ P({\bf{x} },y) = P({\bf{x} })P(y|{\bf{x} })P(x,y)=P({\bf{x} })P(y∣ {\bf{x} }) \]

\[ P(y|{\bf{x} }) = \left \{ \begin{matrix} \lambda & {y=f(x)} \\ 1-\lambda & \text{otherwise} \end{matrix} \right. ​\]

What is the probability of error that \(h\) makes in approximating the noisy target \(y\)?

  1. \(1-\lambda\)

  2. \(\mu\)

  3. \(\lambda(1-\mu)+(1-\lambda)\mu\)

  4. \(\lambda\mu+(1-\lambda)(1-\mu)\)

  5. none of the other choices

这个题目半天看不懂,实际上意思是噪声的几率是(\(1-\lambda\))。算最后的错误率。所以,当预测错误时候,如果是非噪声,则最后还是错误;当预测正确时候,结果该样本是噪声,则会造成错误,将两种情况加起来,因此答案是 \(\mu \lambda + (1-\lambda)(1-\mu)\),选d.

2. Following Question 1, with what value of \(\lambda\) will the performance of \(h\) be independent of \(\mu\)?

  1. 0

  2. 0 or 1

  3. 1

  4. 0.5

  5. none of the other choices

这道题目很简单,意思是\(\lambda\)的值是多少的时候,h的性能与\(\mu\)无关。 很简单,将错误率展开:\(\mu(2 \lambda - 1) + 1 - \lambda\),可以很容易看出来,\(\lambda = 0.5\).

Questions 3-5 are about generalization error, and getting the feel of the bounds numerically.

3. Please use the simple upper bound \(N^{d_{\text{vc} } }\) on the growth function \(m_{\mathcal{H} }(N)\),assuming that \(N \geq 2\) and \(d_{vc} \geq 2\). For an \(\mathcal{H}\) with \(d_{\text{vc} } = 10\), if you want \(95\%\) confidence that your generalization error is at most 0.05, what is the closest numerical approximation of the sample size that the VC generalization bound predicts?

  1. 420,000

  2. 440,000

  3. 460,000

  4. 480,000

  5. 500,000

这个题目考验的是VC bound.翻看直接我们推导出来的最终结果: \[ \epsilon = \sqrt {\frac 8 N \ln {(\frac {4(2N)^{d_{vc} } } {\delta })} } \]

上式中,\(\epsilon = 0.05(generalization error), \delta = 0.05 (confidence)\),带入上式中,可以计算出来以下结果:

1
2
3
4
5
6
\\ε^2 = (8/N)*ln(((2*N)^10*4)/0.05) $\approx$ 0.0025
N = 420,000 ε = 0.0026817828255785
N = 440,000 ε = 0.0025683417908949
N = 460,000 ε = 0.0024644054978248
N = 480,000 ε = 0.0023688152044852
N = 500,000 ε = 0.0022805941154291
可以看到答案为 460,000.

4. There are a number of bounds on the generalization error \(\epsilon\), all holding with probability at least \(1 - \delta\). Fix \(d_{\text{vc} } = 50\)d and \(\delta = 0.05\) and plot these bounds as a function of N. Which bound is the tightest (smallest) for very large N, say N=10,000? Note that Devroye and Parrondo & Van den Broek are implicit bounds in \(\epsilon\).

  1. Original VC bound: $ $

  2. Rademacher Penalty Bound: $ + + $

  3. Parrondo and Van den Broek: $ $

  4. Devroye: \(\epsilon \le \sqrt{\frac{1}{2N} (4\epsilon(1 + \epsilon) + \ln \frac{4m_{\mathcal{H} }(N^2)}{\delta})}\)

  5. Variant VC bound: \(\epsilon \le \sqrt{\frac{16}{N}\ln\frac{2m_{\mathcal{H} }(N)}{\sqrt{\delta} } }\)

代公式的问题:

1
2
3
4
5
6
7
8
9
a. (8/10000*ln((4*(2*10000)^50)/0.05))^(0.5) = 0.63217491520084

b. ((2*ln(2*10000*10000^50))/10000)^0.5+(2/10000*ln(1/0.05))^0.5+1/10000 = 0.33130878596164

c. (1/10000*(2*ε+ln(6*(20000)^50/0.05)))^0.5 当ε等于0.223左右的时候取等号,当ε大于0.223时候,上式已经不再成立,当小于0.223时候是成立的,所以bound在是0.223左右

d. (1/20000*(4*ε*(1+ε)+ln(4*1000000^(50)/0.05)))^0.5 同上,bound在0.186左右

e. (16/10000*ln(2*10000^50/0.5))^0.5 = 0.85967743993657

答案为Devroye,选d.

5. Continuing from Question 4, for small N, say N=5, which bound is the tightest (smallest)?

答案与上面解答过程类似。

1
2
3
4
5
6
7
8
9
a. (8/5*ln((4*(2*5)^50)/0.05))^0.5 = 13.828161484991

b. ((2*ln(2*5*5^50))/5)^0.5+(2/5*ln(1/0.05))^0.5+1/5 = 7.0487765641837

c. 答案为5.0左右

d. 答案为5.5左右

e. (16/5*ln(2*5^50/0.5))^0.5 = 16.184752328814
显然答案选Parrondo and Van den Broek.

In Questions 6­-11, you are asked to play with the growth function or VC-dimension of some hypothesis sets.

6. What is the growth function \(m_{\mathcal{H} }(N)\) of "positive-and-negative intervals on \(\mathbb{R}\)"? The hypothesis set \(\mathcal{H}\) of "positive-and-negative intervals" contains the functions which are \(+1\) within an interval \([\ell,r]\) and −1 elsewhere, as well as the functions which are −1 within an interval \([\ell,r]\) and +1 elsewhere. For instance, the hypothesis \(h_1(x)=sign(x(x−4))\) is a negative interval with -1 within \([0, 4]\) and +1 elsewhere, and hence belongs to \(\mathcal{H}\). The hypothesis \(h_2(x)=sign((x+1)(x)(x−1))\) contains two positive intervals in \([-1, 0]\) and \([1, \infty)\) and hence does not belong to \(\mathcal{H}\).

  1. \(N^2-N+2\)

  2. \(N^2\)

  3. \(N^2+1\)

  4. \(N^2+N+2\)

  5. none of the other choices.

这个题目题意描述很长,但是看懂了并不难。实际上就是positive intervals的拓展,只不过原来是中间是正的,两边是负的,这时候情况与之前就不一样了。 之前,N个样本将这个直线划分成了N+1个区域,从中取两个,中间是正,外面是负,同时还包含一种全是负的情况,比如选的两个点在一个区域内,就会有全负的情况,因此结果是\(C_{N+1}^2+1 = frac{1}{2} N^2+ \frac{1}{2}N+1\); 而本题就要注意一些问题了,很直觉的想法是对上面的做法翻倍,但是实际上仔细想想,如果我们取到最边上的两个点,那么实际上就包含了全是正和全是负的结果,另一方面,只要我们取到了最边上的区域某个点,就会有重复的结果(与取另一端的端点是一样的),因此取到最边上的点应当只算一次。 所以我们要换个思路,一是两个点都不是端点区域的:\(C_{N-1}^2\), 第二个是两个点有一个是端点区域的:$C_{N-1}^1 C_2^1 \(, 最后一种情况是两个端点区域的,有两种情况,全正或者全负:2. 至于取相同区域的情况得到的结果是与最后一种情况一致的。 所以最后结果:\)m_H(N) = N^2-N+2 $.

另一种讨巧的做法:当N = 3的时候,其他的答案都大于8,这是不可能发生的。

7. Continuing from the previous problem, what is the VC-dimension of the hypothesis set of "positive-and-negative intervals on \(\mathbb{R}\)"?

既然上面都得到成长函数了,很轻易可以得到结果,答案是3,当为N = 4时候,\(N^2-N+2 = 14<16\).

8. What is the growth function \(m_{\mathcal{H} }(N)\) of "positive donuts in \(\mathbb{R}^2\)"?

The hypothesis set \(\mathcal{H}\) of "positive donuts" contains hypotheses formed by two concentric circles centered at the origin. In particular, each hypothesis is +1 within a "donut" region of \(a^2 \leq x_1^2+x_2^2 \leq b^2\) and −1 elsewhere. Without loss of generality, we assume \(0 \lt a \lt b \lt \infty\).

  1. \(N+1\)

  2. \(C_{N+1}^2+1\)

  3. \(C_{N+1}^3+1\)

  4. none of the other choices.

  5. \(C_N^2+1\)

这道题目是要在以原点为中心画两个圆,分布在环上的点为正,其余为负。看上去维度似乎变成了二维,实际上还是一维的:这个维度就是与原点的距离。如果与原点距离一致,它们的分类也是一样的。因此,我们简化一下这个问题,将与原点的距离画到一条线上,立马这个问题就成为一般的positive intervals问题了,答案也是一样的:\(C_{N+2}^2+1\)

9. Consider the "polynomial discriminant" hypothesis set of degree \(D\) on \(\mathbb{R}\), which is given by

\[ \begin{eqnarray}\mathcal{H} = \left\{ h_{\bf{c} } \; \middle| \; h_{\bf{c} }(x) = {\rm{sign} }\left(\sum_{i=0}^D c_ix^i\right) \right\}\end{eqnarray} \]

What is the VC-dimension of such an \(\mathcal{H}\)?

这个不就是perceptron吗?答案是\(D+1\).

10.Consider the "simplified decision trees" hypothesis set on \(\mathbb{R}^d\), which is given by

\[ \begin{eqnarray}\mathcal{H}= \{h_{\mathbf{t},\mathbf{S} } \; | & \; h_{\mathbf{t},\mathbf{S} }(\mathbf{x}) = 2 [[\mathbf{v}\in S]] - 1,\text{ where} \; v_i = [[x_i>t_i]], & \\& \mathbf{S} \text{ a collection of vectors in } \{0,1\}^d,\mathbf{t} \in \mathbb{R}^d &\}\end{eqnarray} \]

That is, each hypothesis makes a prediction by first using the \(d\) thresholds \(t_i\) to locate \(\mathbf{x}\) to be within one of the \(2^d\) hyper-rectangular regions, and looking up \(\mathbf{S}\) to decide whether the region should be +1 or −1.

What is the VC-dimension of the "simplified decision trees" hypothesis set?

  1. \(2^d\)

  2. \(2^{d+1}-3\)

  3. \(\infty\)

  4. none of the other choices.

  5. \(2^{d+1}\)

这个题目看不大懂...

11. Consider the "triangle waves'' hypothesis set on \(\mathbb{R}\), which is given by

\[ \begin{eqnarray}\mathcal{H} = \{h_{\alpha} \; | & \; h_{\alpha}(x) = \text{sign}(| (\alpha x) \mbox{ mod } 4 - 2| - 1), \alpha \in \mathbb{R} \}\end{eqnarray} \]

Here \((z mod 4)\) is a number \(z - 4k\) for some integer \(k\) such that \(z - 4k \in [0, 4)\). For instance, \((11.26 mod 4)\) is 3.26, and \((−11.26 mod 4)\) is 0.74. What is the VC-dimension of such an \(\mathcal{H}\)?

  1. 1

  2. 2

  3. none of the other choices

  4. 3

这个问题看上去很复杂,所以一步一步拆开来解决。 第一,这个点是分布在实数轴上的,所以我们要首先弄清楚轴上的那部分的点是+1,哪部分的点是-1. 如果是-1,则$|(x) mod 4 - 2| < 1 \(,可以推出来\)(x) mod 4 (1,3)$,同理可以退出来如果是+1,则 \((\alpha x) mod 4 \in (0,1) \bigcup (3,4)\) ,根据题中负数取余数的定义,总结一下如下:

\[ h_{\alpha}(x) = \left \{ \begin{matrix} +1& \alpha x \in (-1+4k,1+4k) \\ -1 & \alpha x \in (1+4k,3+4k) \end{matrix} \right. \]

对于N = 1和N = 2的时候,很容易可以知道各种情况都是可以shatter的。

(举个N=2的例子,如

\([0.6,0.7]—[+1,+1]; [0.6 \times \frac 9 6, 0.7 \times \frac 9 6 ]—[+1,-1];[0.6 \times \frac 29 6,0.7 \times \frac 29 6]—[-1,+1];[0.6 \times \frac 29 7,0.7 \times \frac 29 7]—[-1,-1]\)).

当N等于3的时候,也是可以被shatter。

实际上,取余的过程中有这么一个性质:\(\alpha x mod 4 = [\alpha (x mod 4)] mod 4\),这意味着(假设有3个样本),对于任何大小的\(x_n\),我们都可以将它缩放到\([0,4)\)的范围来进行处理。这个题目的答案是∞。但是如何证明我还不是很清楚。

In Questions 12-15, you are asked to verify some properties or bounds on the growth function and VC-dimension.

12. Which of the following is an upper bounds of the growth function \(m_\mathcal{H}(N)\) for \(N \ge d_{ {vc} } \ge 2\)?

  1. \(m_H(⌊N/2⌋)\)

  2. \(2^{d_{vc} }\)

  3. $ _{1 i N-1} 2^im_H(N-i)$

  4. \(\sqrt {N^{d_{vc} } }\)

  5. none of the other choices.

这个题目问的是成长函数。对于成长函数的界限,之前的博客已经有了以下的说明:

\[ B(N,k) \leq \sum _{i=0} ^{k-1} C_N^i \]

而上式中,\(k = d+1\)。 根据上式,我们可以很轻易的排除a,b两项。同时,如果举例计算,亦可以排除选项d。如,\(B(6,3) = 22 > \sqrt {6^2}\).

因此答案是c.至于对c的证明,我们可以从之前vc bound的表格里发现,

\(B(N,d) = B(N-1,d-1)+B(N-1,d) \leq 2 \times B(N-1,d) \leq 4 \times B(N-2,d) \leq 2^i \times B(N-i,d)\),因此,任何 \(2^im_H(N-i)\)都是大于等于\(m_H(N)\)的,选择一个最小的即可。

13. Which of the following is not a possible growth functions \(m_{\mathcal{H} }(N)\)for some hypothesis set?

  1. \(2^N\)

  2. \(2^{⌊ \sqrt {N} ⌋}\)

  3. 1

  4. \(N^2 -N +2\)

  5. none of the other choices.

答案是b. 首先,a,d的情况我们都遇到过,而c的情况也是很简单的,比如这个H对所有的样本都取正。至于b为什么错了,当N = 1的时候,\(2^1 = 2\),而当N = 2的时候,\(m_H(2) = 2\)\(m_H(3) =2\), \(m_H(4) = 4\). 实际上是不可能出现成长函数呈现出这样的规律增长的,因为N个点中随意取N-1个出来,必然要满足之前的N-1个时候的所有要求(出现的情况与之前的N-1的各种情况一致,可以有重复,但是不能多也不能少),这保证了成长函数要么是严格单调增的,要么是不变的(我的理解)。

**14. For hypothesis sets \(\mathcal{H}_1, \mathcal{H}_2, ..., \mathcal{H}_K\) with finite, positive VC-dimensions d_{ {vc} }(_k), some of the following bounds are correct and some are not.**

Which among the correct ones is the tightest bound on \(d_{ {vc} }(\bigcap_{k=1}^{K}\!\mathcal{H}_k)\), the VC-dimension of the \(\bf{intersection}\) of the sets?

(The VC-dimension of an empty set or a singleton set is taken as zero.)

这个题目是有K个H集合,每个集合都对应一个vc dimension,问题是这些集合的交集构成的集合的vc dimension的范围。

  1. $ 0 d_{vc}({{k=1} }^K H_k) {k=1} ^K d_{vc}(H_k)$

  2. $0 d_{vc}({{k=1} }^K H_k) {d{vc}(H_k) }_{k=1}^K $

  3. $0 d_{vc}({{k=1} }^K H_k) {d{vc}(H_k) }_{k=1}^K $

  4. $ {d_{vc}(H_k) }{k=1}^K d{vc}({{k=1} }^K H_k) {d{vc}(H_k) }_{k=1}^K $

  5. $ {d_{vc}(H_k) }{k=1}^K d{vc}({{k=1} }^K H_k) {k=1} ^K d_{vc}(H_k) $

如果交集为空,那么vc dimension为0。同时,不管怎么说,H的大小不可能是比之前任何一个 \(H_n\)大,而且一定是之前任何一个集合的一部分。因此它的vc dimension也不会超过之前任何一个集合,所有答案很明显,是b.

**15. For hypothesis sets \(\mathcal{H}_1, \mathcal{H}_2, ..., \mathcal{H}_K\) with finite, positive VC-dimensions d_{ {vc} }(_k), some of the following bounds are correct and some are not.**

Which among the correct ones is the tightest bound on \(d_{ {vc} }(\bigcup_{k=1}^{K}\!\mathcal{H}_k)\), the VC-dimension of the \(\bf{union}\) of the sets?

  1. $ 0 d_{vc}({{k=1} }^K H_k) K-1+{k=1} ^K d_{vc}(H_k)$

  2. $ {d_{vc}(H_k) }{k=1}^K d{vc}({{k=1} }^K H_k) {k=1} ^K d_{vc}(H_k) $

  3. $ {d_{vc}(H_k) }{k=1}^K d{vc}({{k=1} }^K H_k) {k=1} ^K d_{vc}(H_k) $

  4. $ {d_{vc}(H_k) }{k=1}^K d{vc}({{k=1} }^K H_k) K-1+{k=1} ^K d_{vc}(H_k) $

  5. $0 d_{vc}({{k=1} }^K H_k) {k=1} ^K d_{vc}(H_k) $

这道题目与上一道刚好相反。首先,并集是包含所有的,因此它的vc dimension一定是大于最大的。所以就排除了a,b,d。然后,再c与d之间做选择.想象一个情况,\(H_1\)是将所有的点划分为正,\(H_2\)是将所有的点划分为负,\(H_1+H_2\)的vc dimension是1,但是各自的vc dimension为0.这样足以选出这个答案是d。如何证明?观察之前的那个表,可以举出更多的例子。但是如何得到这个具体的界限,需要更严格的数学证明。

For Questions 16-20, you will play with the decision stump algorithm.

16-20题目依然是编程问题。

16. In class, we taught about the learning model of "positive and negative rays" (which is simply one-dimensional perceptron) for one-dimensional data. The model contains hypotheses of the form:

\[ h_{s, \theta}(x) = s \cdot \mbox{sign}(x - \theta). \]

The model is frequently named the "decision stump'' model and is one of the simplest learning models. As shown in class, for one-dimensional data, the VC dimension of the decision stump model is 2.

In fact, the decision stump model is one of the few models that we could easily minimize \(E_{in}\) efficiently by enumerating all possible thresholds. In particular, for \(N\) examples, there are at most \(2N\) dichotomies (see page 22 of lecture 5 slides), and thus at most \(2N\) different \(E_{in}\) values. We can then easily choose the dichotomy that leads to the lowest \(E_{in}\), where ties an be broken by randomly choosing among the lowest \(E_{in}\) ones. The chosen dichotomy stands for a combination of some "spot" (range of \(\theta\)) and \(s\), and commonly the median of the range is chosen as the \(\theta\) that realizes the dichotomy.

In this problem, you are asked to implement such and algorithm and run your program on an artificial data set. First of all, start by generating a one-dimensional data by the procedure below:

  1. Generate \(x\) by a uniform distribution in \([-1, 1]\).

  2. Generate \(y\) by \(f(x) = \tilde{s}(x)\)+\(noise\) where $ (x) = sign(x)$ and the noise flips the result with \(20%\) probability.

For any decision stump \(h_{s, \theta}\) with \(\theta \in [-1, 1]\), express \(E_{out}(h_{s, \theta})\) as a function of \(\theta\) and \(s\).

  1. \(0.3+0.5s(|\theta| - 1)\)

  2. \(0.3+0.5s(1 - |\theta|)\)

  3. \(0.5+0.3s(|\theta| - 1)\)

  4. \(0.5+0.3s(1 - |\theta|)\)

  5. none of the other choices.

虽然是编程题目,但是本道题目还没有涉及到代码编写,而是从理论分析这个问题。本题中数据生成是利用\(sign(x)+noise\),其中noise出现的概率是20%。 我们可以知道,当\(h_{s,\theta}(x)\)在没有噪声的情况下,错误率是\(\frac \theta 2\).

由第一题的分析可以知道,\(E_{out} = \frac {|\theta|} 2 \times (1 - 0.2) + (1 - \frac {|\theta|} 2) \times 0.2 = 0.3 |\theta| + 0.2\), 看了下似乎没有这个答案,这是因为我们没有考虑到符号的问题。如果考虑到符号,s是负的,那么原先的正确率反而变成错误率了, 即 \(0.8 - 0.3 |\theta|\)可以看到,答案选c。

17. Generate a data set of size 20 by the procedure above and run the one-dimensional decision stump algorithm on the data set. Record \(E_{in}\) and compute \(E_{out}\) with the formula above. Repeat the experiment (including data generation, running the decision stump algorithm, and computing \(E_{in}\) and \(E_{out}\)) 5,000 times. What is the average \(E_{in}\)? Please choose the closest option.

  1. 0.05

  2. 0.15

  3. 0.25

  4. 0.35

  5. 0.45

这道题目需要编程实现。首先,我们需要生成数据和噪音: 下面的代码生成20个数据,并用0.2的概率抽出来作为噪音。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def sign(x):
if x <= 0:
return -1
else : return 1

def generateXY():
x = []
for i in range(0,20):
x.append([random.random()*2-1])
noise = 0
for i in range(0,20):
ran = random.random()
#print(ran)
if ran <= 0.2:
noise+=1
x[i].append(-sign(x[i][0]))
else :x[i].append(sign(x[i][0]))
#print("noise:",noise)
return x

然后就是实现算法了。这个算法很简单,我们可以很轻易得枚举出来各种过程。同时为了简化算法,我没有实现s为负的场景,因为为负的场景最后大概率是选不到的。

首先,将随机数据排序,然后每次选择一个间隔,统计其之前与之后错误的分类个数。选择间隔的时候,首先选取d[i],意味着现在选择的区域是(d[i-1],d[i]),将d[i]之前的作为-1,d[i]之后包括d[i]的作为+1,这样可以简化算法。值得注意的是i将会等于len(d),因为间隔有len(d)+1个。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def decision_stump(dataset):

sort_d = sorted(dataset)
min_pos = []

err = 0

min_err = len(dataset)

for i in range(0,len(dataset)+1):
for k in range(0,i):
if sort_d[k][1]>0:
err+=1
for k in range(i,len(dataset)):
if sort_d[k][1]<0:
err+=1
if err < min_err:
min_pos = []
min_pos.append(i)
min_err = err
elif err == min_err:
min_pos.append(i)
err = 0
# choose the lowest Ein randomly
choosen = int(len(min_pos)*random.random())
if min_pos[choosen] < len(sort_d):
return [sort_d[min_pos[choosen]][0],min_err]
else: return [(sort_d[min_pos[choosen]-1][0]+1)/2,min_err]

结果:

1
average Ein: 0.1713600000000006

因此答案选b。

18. Continuing from the previous question, what is the average E_{out}? Please choose the closest option.

  1. 0.05

  2. 0.15

  3. 0.25

  4. 0.35

  5. 0.45

对于Eout的计算,可以直接使用16中的公式带入。结果如下:

1
average Eout: 0.25962811866336116
因此答案选C.

19. Decision stumps can also work for multi-dimensional data. In particular, each decision stump now deals with a specific dimension \(i\), as shown below. \[ h_{s, i, \theta}(\mathbf{x}) = s \cdot \mbox{sign}(x_i - \theta). \] Implement the following decision stump algorithm for multi-dimensional data:

  1. for each dimension \(i = 1, 2, \cdots, d\), find the best decision stump \(h_{s, i, \theta}\) using the one-dimensional decision stump algorithm that you have just implemented.

  2. return the "best of best" decision stump in terms of \(E_{in}\). If there is a tie , please randomly choose among the lowest-\(E_{in}\) ones.

The training data \(\mathcal{D}_{train}\) is available at:

https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_math/hw2_train.dat

The testing data \(\mathcal{D}_{test}\) is available at:

https://www.csie.ntu.edu.tw/~htlin/mooc/datasets/mlfound_math/hw2_test.dat

Run the algorithm on the \(\mathcal{D}_{train}\). Report the \(E_{\text{in} }\)​ of the optimal decision stump returned by your program. Choose the closest option.

在本例中,是将之前的算法用到多维度的数据上,分两步:1.对每个维度的数据运用上面的算法选出最佳的\(E_in\);2.在所有的维度中选择一个最好的出来。

这个对应到实际中可能会出现,比如某个维度是真正起作用的,而其余的特征的作用不大。

实际上用到的算法与之前的一致。但是需要注意的是,因为这次我们对真实的\(\theta,s\)值一无所知,因为不能忽略s为负的情况。改进算法的步骤很简单,因为s为负的情况出错的个数就是所有样本个数减去s为正的情况出错的个数。

改正后的算法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def decision_stump(dataset):

sort_d = sorted(dataset)
min_pos = []

err = 0
isNeg = False
min_err = len(dataset)
size = len(dataset)
for i in range(0,len(dataset)+1):
for k in range(0,i):
if sort_d[k][1]>0:
err+=1
for k in range(i,len(dataset)):
if sort_d[k][1]<0:
err+=1
isNeg = False
if err < min_err:
min_pos = []
min_pos.append([i,isNeg])
min_err = err
elif err == min_err:
min_pos.append([i,isNeg])
isNeg = True
if (size - err) < min_err:
min_pos = []
min_pos.append([i,isNeg])
min_err = size - err

elif (size - err) == min_err:
min_pos.append([i,isNeg])
err = 0
# choose the lowest Ein randomly
#print(min_pos)
choosen = int(len(min_pos)*random.random())
if min_pos[choosen][0] < len(sort_d):
return [sort_d[min_pos[choosen][0]][0],min_err,min_pos[choosen][1]]
else: return [(sort_d[min_pos[choosen][0]-1][0]+1)/2,min_err,min_pos[choosen][1]]

我们增添了一个isNeg的变量,来代表s是否是-1.

最后multi算法就是在不同维度上运行该算法,挑出错误最小的维度与\(\theta\)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def multiDDecision_stump(dataset):
min_err_d = []
min_err = 0x7fffffff
err = 0
for i in range(len(dataset)):#
temp = decision_stump(dataset[i])
err = temp[1]
#print(err)
if err < min_err:
min_err = err
min_err_d = []
min_err_d.append([temp[0],i,min_err,temp[2]])

elif err == min_err:
min_err_d.append([temp[0],i,min_err,temp[2]])
choosen = int(random.random()*len(min_err_d))
return min_err_d[choosen]

这道题目用到的数据是课程提供的,因此写入读取数据的过程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def readDataFrom(filename):
result = []
with open (filename) as f:
line = f.readline()[1:-1]
while line:
temp = line.split(' ')
#print(temp)
if len(result) == 0:
for x_i in range(len(temp)-1):
result.append([[float(temp[x_i]),float(temp[-1])]])
else:
for x_i in range(len(temp) - 1):
result[x_i].append([float(temp[x_i]),float(temp[-1])])
line = f.readline()[1:-1]
return result

最后得到结果:

1
2
3
dimension: 3
theta: 1.774
Ein: 0.25

20. Use the returned decision stump to predict the label of each example within \(\mathcal{D}_{test}\). Report an estimate of \(E_{\text{out} }\) by \(E_{\text{test} }\). Please choose the closest option.

使用题目给的数据来做测试,估计\(E_{out}\),需要一个检测错误的函数:

1
2
3
4
5
6
7
8
9
10
def checkout(min_err_d,dataset):
err = 0

for i in dataset[min_err_d[1]]:

if sign(i[0] - min_err_d[0]) != sign(i[1]):
err += 1
if min_err_d[3] == True:
err = len(dataset[0]) - err
return err
最后结果:
1
Eout: 0.36
p.s. 10,11,15题目留有疑问。