1
00:00:09,969 --> 00:00:15,889
Hello and welcome back to our discussion on
Support Vector Machines.
2
00:00:15,889 --> 00:00:22,009
So, we were looking at the optimization problem
corresponding to the optimal separating hyper
3
00:00:22,009 --> 00:00:28,619
plane. So, to solve this problem, so with
one of the techniques for solving these kinds
4
00:00:28,619 --> 00:00:38,860
of constrained optimization problems is to
set up a Lagrangian, which essentially looks
5
00:00:38,860 --> 00:01:07,650
at the original objective function, which
is half beta square and the second component
6
00:01:07,650 --> 00:01:14,330
corresponding to the constraints that we have.
So, if you look at this quantity in the square
7
00:01:14,330 --> 00:01:19,560
brackets here, so you can see that this is
the term on the left hand side of the inequality
8
00:01:19,560 --> 00:01:24,780
and that is the term on the right hand side
of the inequality and we really want to make
9
00:01:24,780 --> 00:01:37,130
sure that, this difference is not negative.
If this difference is negative, then that
10
00:01:37,130 --> 00:01:44,470
would mean that y i times x i transpose beta
plus beta naught is actually less than 1.
11
00:01:44,470 --> 00:01:50,130
So, we do not want this to be negative, so
what we essentially say is, this we added
12
00:01:50,130 --> 00:01:56,080
here as with a minus sign.
So, this essentially means that, when I minimize
13
00:01:56,080 --> 00:02:02,200
this whole expression, so this term will become
as large as possible, as largely positive
14
00:02:02,200 --> 00:02:07,990
as possible. So, that essentially means that
I will go and try and make this as larger
15
00:02:07,990 --> 00:02:17,980
than 1 as possible. So, this term here alpha
i let us me control how much weight I want
16
00:02:17,980 --> 00:02:24,880
to give to satisfying the constraints versus
how much I really want to minimize the objective
17
00:02:24,880 --> 00:02:29,920
function. So, we really need to satisfy the
constraints as much as possible and since,
18
00:02:29,920 --> 00:02:35,660
there are solutions that will satisfy the
constraint and give you a good optima.
19
00:02:35,660 --> 00:02:42,959
So, we should essentially be trying to derive
this thing to as larger value as possible.
20
00:02:42,959 --> 00:02:49,390
So, this is called the primal of the problem
and your goal is to minimize the primal. So,
21
00:02:49,390 --> 00:02:58,491
I am going to do something fairly technical
right now. So, if you do not understand all
22
00:02:58,491 --> 00:03:03,550
of it in the first goal that is fine, you
might have to do a little bit more reading
23
00:03:03,550 --> 00:03:09,110
on this side, but this is essentially give
you an idea of how people go about solving
24
00:03:09,110 --> 00:03:13,790
these kinds of problems.
So, we are going to try and create, what is
25
00:03:13,790 --> 00:03:28,670
called the dual of this, the primal objective
function. So, the dual is a way to create
26
00:03:28,670 --> 00:03:36,040
something that create an optimization problem,
that is simpler to solve in some sense than
27
00:03:36,040 --> 00:03:46,260
the primal and the dual at all points provides
you some kind of a lower bound on the kind
28
00:03:46,260 --> 00:03:51,920
of solutions that you can achieve with the
primal and that the optima of the dual you
29
00:03:51,920 --> 00:03:56,519
ideally like the optima of the primal also
to be achieved.
30
00:03:56,519 --> 00:04:01,580
So, we are going to create a problem called
the dual, we are going to solve the dual and
31
00:04:01,580 --> 00:04:05,989
when we reach the optima of the dual, you
would like the optima of the primal to be
32
00:04:05,989 --> 00:04:09,800
also achieved. The same solution that gives
you the optima in the dual problem should
33
00:04:09,800 --> 00:04:14,940
give you the optima and the primal problem
and there are technical conditions under which
34
00:04:14,940 --> 00:04:20,549
this is satisfied and we are not going to
go in to the technical conditions and this
35
00:04:20,549 --> 00:04:25,060
going to be give you a flavor of kind of results
that will be looking at.
36
00:04:25,060 --> 00:04:43,169
So, let us start by
37
00:04:43,169 --> 00:04:54,870
setting the derivative of L p to 0, derivative
with respect to beta and beta naught. So,
38
00:04:54,870 --> 00:05:00,770
taking the derivative with respect to beta
and setting it to 0 and solving for beta gives
39
00:05:00,770 --> 00:05:16,470
me… So, you can figure there out by little
bit of algebra here and likewise setting that
40
00:05:16,470 --> 00:05:54,910
derivative with respect to beta naught to
0 and solving it gives me. So, you can substitute
41
00:05:54,910 --> 00:06:07,210
these back into the primal problem and do
a lot of algebra, do a lot of algebra really
42
00:06:07,210 --> 00:06:12,729
and then I can simplify this and I will get
what is known as the dual, we write the dual
43
00:06:12,729 --> 00:06:24,440
here. So, this is just really obtained by
substituting your beta into the expressions
44
00:06:24,440 --> 00:06:30,660
here and then, using the fact that alpha i
y i will be 0 at the optimum.
45
00:06:30,660 --> 00:06:50,930
So, that is the thing, but then it is subject
to be constrained. So, note that I said, so
46
00:06:50,930 --> 00:06:56,820
your dual is always going to give you a lower
bound on the solution of the primal problem.
47
00:06:56,820 --> 00:07:02,360
So, really if you are minimizing the solution
in your primal, it should be maximizing the
48
00:07:02,360 --> 00:07:07,130
solution in the dual, so that the two of them
can coincide at some point. So, essentially
49
00:07:07,130 --> 00:07:15,949
you would be maximizing this subject to the
constraint that, all your alpha i’s are
50
00:07:15,949 --> 00:07:16,949
greater than or equal to 0.
51
00:07:16,949 --> 00:07:23,340
So, if you think about it, this is the much
easier constraint to wrap our heads around,
52
00:07:23,340 --> 00:07:30,440
because it just says that you will only be
doing it in the positive co ordinates and
53
00:07:30,440 --> 00:07:36,350
while this had a more complex set of constraints.
So, you kind of reduce the constraint to do
54
00:07:36,350 --> 00:07:42,479
something easier and therefore, the dual problem
is sometimes easier to solve. So, for the
55
00:07:42,479 --> 00:07:48,229
dual and their primal to be at optima at the
same time, so you really want them to satisfy
56
00:07:48,229 --> 00:07:56,110
a set of conditions, which are essentially
to with the derivative of the primal problem.
57
00:07:56,110 --> 00:08:05,510
So, we required that this should hold, we
required that this should hold, we write them
58
00:08:05,510 --> 00:08:31,650
as 1, 2, 3. In addition, you required that
this condition should also be required, that
59
00:08:31,650 --> 00:08:57,509
this condition should also be satisfied, these
are called the KKT or the Karush Kuhn Tucker
60
00:08:57,509 --> 00:09:07,670
conditions. And so far, the optimization problem
to have the same solution, we require that
61
00:09:07,670 --> 00:09:19,220
the KKT conditions should be satisfied.
So, once you have an optimal solution for
62
00:09:19,220 --> 00:09:24,350
the dual and the primal problem, because these
KKT conditions have to be satisfied, you can
63
00:09:24,350 --> 00:09:33,019
make certain observations, especially we are
working from condition 3 here. So, if alpha
64
00:09:33,019 --> 00:09:46,019
i is greater
than 0, so what does it mean. So, this has
65
00:09:46,019 --> 00:10:06,579
to be equal to 0, then the term in the square
bracket has to be equal to 0; that means,
66
00:10:06,579 --> 00:10:13,410
y i into x i transpose beta plus beta naught
should be 1. So, what does that mean?
67
00:10:13,410 --> 00:10:21,709
It means that, it is exactly on the edge of
the margin when it is equal to 1, because
68
00:10:21,709 --> 00:10:25,319
it is greater than equal to 1 is what we needed
to satisfy, so when it is equal to 1; that
69
00:10:25,319 --> 00:11:03,410
means, it is exactly on the margin. Likewise
if so, if the quantity in the square bracket
70
00:11:03,410 --> 00:11:16,269
is greater than 1, then
alpha i has to be 0, but that essentially
71
00:11:16,269 --> 00:11:23,579
means is, if your data point is something;
that is far away from the hyper plane let
72
00:11:23,579 --> 00:11:30,199
us just more than the margin away from the
hyper plane, then the corresponding alpha
73
00:11:30,199 --> 00:11:39,300
is will become 0.
So, what does this mean for us, so if you
74
00:11:39,300 --> 00:11:45,489
think about it. So, the solution that we get,
which is essentially beta that is the solution
75
00:11:45,489 --> 00:11:54,129
that we want to get is formed by taking the
product of alpha i, y i and x i. So, if saying
76
00:11:54,129 --> 00:12:02,269
the alpha i is going to be 0, it essentially
means that the corresponding x i has no role
77
00:12:02,269 --> 00:12:14,329
to play in determining, what my beta should
be if I say that it implies if x i is 0 that
78
00:12:14,329 --> 00:12:28,869
implies that x i has
79
00:12:28,869 --> 00:12:40,610
no role in computing beta.
So, which are the data points, which will
80
00:12:40,610 --> 00:12:50,689
actually effect the solution beta here exactly
those points for, which y i times x i transpose
81
00:12:50,689 --> 00:12:59,279
beta plus beta naught is 1; that means, these
are exactly the points, which lie on the margin
82
00:12:59,279 --> 00:13:16,579
. So, only these points will influence, how
the solution beta looks like and all the other
83
00:13:16,579 --> 00:13:22,360
data points that we have, which are farther
away from the separately high per plane, then
84
00:13:22,360 --> 00:13:37,470
these points do not matter in the solution.
So, these points are called
85
00:13:37,470 --> 00:13:44,790
support vectors.
So, you don’t really have to solve this
86
00:13:44,790 --> 00:13:49,989
optimization problem yourself there are enough
tools that actually can do it for you the
87
00:13:49,989 --> 00:13:54,609
whole goal of this lecture is to get you to
appreciate, what is said that you are doing
88
00:13:54,609 --> 00:13:59,290
when you are using a support vector machine
for solving a problem. So, at the end of the
89
00:13:59,290 --> 00:14:05,779
day all we are going to do is fire up tool
that is going to tell you, what is the separating
90
00:14:05,779 --> 00:14:11,619
hyper plane given a bunch of data, But, it
is good to have an appreciation of how the
91
00:14:11,619 --> 00:14:24,220
classifier is actually build.
So, once you figure out the beta, then I can
92
00:14:24,220 --> 00:14:31,499
substitute that I can substitute that into
the KKT the third condition here and solve
93
00:14:31,499 --> 00:14:38,980
for beta naught. So, typically what you do
is that you use every x i that is a support
94
00:14:38,980 --> 00:14:44,100
vector and you substitute that here and then,
try to solve for beta naught and typically
95
00:14:44,100 --> 00:14:51,259
end of taking the average value of that. So,
the couple of things, which I want to point
96
00:14:51,259 --> 00:14:57,239
out about support vector machines.
So, one thing is we should be very clear that
97
00:14:57,239 --> 00:15:02,829
the training data none of the training data
will fall within the margin, but that it is
98
00:15:02,829 --> 00:15:07,369
not to say that the test data might fall might
not fall within the margin the test data might
99
00:15:07,369 --> 00:15:11,129
fall within the margin it might actually fall
on the other side of the hyper plane. So,
100
00:15:11,129 --> 00:15:16,279
for all we know that the test data that could
be errors on the test data it is just on the
101
00:15:16,279 --> 00:15:26,149
training data it tries to fix something there
is as far away as possible from the data points.
102
00:15:26,149 --> 00:15:34,480
So, the idea here is that, so if I give as
much gap between the classes as possible,
103
00:15:34,480 --> 00:15:40,829
then the classifier would be more noise on
either side. So, this is the assuming that
104
00:15:40,829 --> 00:15:44,579
the noise could be in this class or in this
class if you know for sure that one class
105
00:15:44,579 --> 00:15:48,759
is noisier than the other or if one class
is more valuable than the other. So, you might
106
00:15:48,759 --> 00:15:54,350
want to actually modify your objective, so
that the line does not go write in the middle,
107
00:15:54,350 --> 00:15:59,489
but it is goes to one side or the other.
So, having said that under the assumptions
108
00:15:59,489 --> 00:16:05,160
of the support vector machines if assumptions
hold good, then is a very, very robust classifier.
109
00:16:05,160 --> 00:16:12,639
So, the reason is it pays attention only to
the points that are closest to the class boundary.
110
00:16:12,639 --> 00:16:16,680
So, you know I can have as many data points
here I say want I can have as many data points
111
00:16:16,680 --> 00:16:24,329
here I say want of the corresponding class
it will be does not affect my classification,
112
00:16:24,329 --> 00:16:29,729
because truly the once that are close the
boundary are the once that need attention.
113
00:16:29,729 --> 00:16:40,329
So, that essentially makes support vector
machines more robust and on the other hand
114
00:16:40,329 --> 00:16:47,079
if you are going to have some kind of stochastic
process that is generating the data right.
115
00:16:47,079 --> 00:16:53,919
So, if there are the few data points there
are by chance or noise data points that actually
116
00:16:53,919 --> 00:16:58,860
close to the hyper plane that will affect
the support vector machines tremendously.
117
00:16:58,860 --> 00:17:05,890
And therefore, it will try to reduce the margin
by a large extent while classifier that looks
118
00:17:05,890 --> 00:17:11,230
at the entire data and tries to find the distribution
for the entire data might be a little bit
119
00:17:11,230 --> 00:17:20,519
more robust to this kinds of noise. So, this
is the, this is how you solve the basic optimization
120
00:17:20,519 --> 00:17:22,000
problem for support vector machines.