1
00:00:09,650 --> 00:00:12,599
Hello and welcome to this module on Support
Vector Machines.
2
00:00:12,599 --> 00:00:18,090
So, we have been looking at the variety of
classifier so far and one of the things, let
3
00:00:18,090 --> 00:00:23,110
us look at the linear classifier. So, the
one of the thing is, if I have data points
4
00:00:23,110 --> 00:00:28,700
that are even perfectly separable, here is
a class and here is another class, you can
5
00:00:28,700 --> 00:00:34,710
see that they are very clearly separated.
But, when I train a linear classifier it is
6
00:00:34,710 --> 00:00:44,930
not entirely clear, which of these many possible
lines that could separate the data, would
7
00:00:44,930 --> 00:00:49,549
your classifier end up learning. There are
many, many different lines that could separate
8
00:00:49,549 --> 00:00:54,120
the data and we are not sure, what your classifier
would end up learning.
9
00:00:54,120 --> 00:01:03,960
So, support vector machines initially were
born out of and need to answer this question.
10
00:01:03,960 --> 00:01:11,680
Among all of these different lines or all
of these different decision surfaces that
11
00:01:11,680 --> 00:01:18,290
you could use for separating the data given
to you, which of those is the best decision
12
00:01:18,290 --> 00:01:32,910
surface? Some of all those alternatives which
you think should be the best decision surface.
13
00:01:32,910 --> 00:01:54,370
So, one answer to this question is to define
an optimal separating, if an optimal separating
14
00:01:54,370 --> 00:02:09,299
hyper plane as the surface, such that the
nearest data point to the surface is as far
15
00:02:09,299 --> 00:02:17,340
away as possible among all of it is surfaces.
So, here is a separating line and the nearest
16
00:02:17,340 --> 00:02:28,159
data point to that is that or that or that.
So, if you think about it, so the nearest
17
00:02:28,159 --> 00:02:38,989
data point cannot belong to just one class.
So, I could draw a line like this, but then
18
00:02:38,989 --> 00:02:46,349
there would mean that I am reducing the distance
of the data point to the separating surface
19
00:02:46,349 --> 00:02:52,919
or if I go this way, again I will be the reducing
the distance of the data point to the surface
20
00:02:52,919 --> 00:02:58,730
in one class or the other. When I say that
you are maximizing the distance of the closest
21
00:02:58,730 --> 00:03:05,739
data point to the separating hyper plane,
that essentially means that the closest data
22
00:03:05,739 --> 00:03:11,139
point from either class is at the same distance
away from the hyper plane.
23
00:03:11,139 --> 00:03:31,950
So, this distance and being same as this distance
would be the same as this distance and this.
24
00:03:31,950 --> 00:03:51,449
So, this distance of the closest data point
to the separating surface is known as the
25
00:03:51,449 --> 00:04:11,920
margin
of the classifier, which we will denote by
26
00:04:11,920 --> 00:04:19,750
m. So, the goal of finding a separating optimal
separating hyper plane is essentially to find
27
00:04:19,750 --> 00:04:27,919
the classifier, such that this margin m is
as large as possible. So, let us step back
28
00:04:27,919 --> 00:04:36,539
and think about what such a line means. You
know in all linear classifiers we have seen
29
00:04:36,539 --> 00:04:43,639
so far, so we know that we are going to say
something like.
30
00:04:43,639 --> 00:04:59,970
So, y is beta naught plus beta transpose x,
for convenience sake here I will write it
31
00:04:59,970 --> 00:05:05,909
as x transpose beta, since we are taking inner
products that is fine. So, a line like this
32
00:05:05,909 --> 00:05:16,460
could essentially be obtained by setting this
beta naught plus x transpose beta equal to
33
00:05:16,460 --> 00:05:25,060
0. So, all the data points on this line or
those data points for which beta naught plus
34
00:05:25,060 --> 00:05:42,250
x transpose beta evaluates to 0, so that is
the equation of the line here. So, if it is
35
00:05:42,250 --> 00:05:48,590
negative beta naught plus beta transpose x
is less than 0, so we are going to say that
36
00:05:48,590 --> 00:06:01,770
x is of class minus 1 and if beta naught plus
beta transpose x is greater than 0, we will
37
00:06:01,770 --> 00:06:09,610
say that next class plus 1.
So, remember that, so we will, we using some
38
00:06:09,610 --> 00:06:13,599
kind of encoding for the class. The class
could be does not buy a computer or buys a
39
00:06:13,599 --> 00:06:20,870
computer, he is sick, he is healthy. I mean
the classes could be many different things,
40
00:06:20,870 --> 00:06:25,990
but numerically we are going to be assigning
some encoding for the class and in this case,
41
00:06:25,990 --> 00:06:31,490
I choose to use minus 1 and plus 1 as the
encoded. There is a reason for that as we
42
00:06:31,490 --> 00:06:40,740
will see shortly. So, if beta naught plus
beta transpose x is less than 0 and I say,
43
00:06:40,740 --> 00:06:45,170
it is class minus 1.
But, in this case what I really want, I do
44
00:06:45,170 --> 00:06:52,289
not want it to be this less than 0, but I
want it to be at least m away from the hyper
45
00:06:52,289 --> 00:07:00,689
plane. I want it to be m away from the line
beta naught plus beta transpose x equal to
46
00:07:00,689 --> 00:07:13,669
0. So, I might use x transpose and beta transpose
x interchangeably at points, but as you know
47
00:07:13,669 --> 00:07:15,860
they are inner products, so that is fine.
48
00:07:15,860 --> 00:07:28,690
So, what I really want is, so y i is plus
1 I want beta naught plus x i transpose beta
49
00:07:28,690 --> 00:07:34,789
to be greater than m. What happens if y i
is minus 1? I really want it to be at least
50
00:07:34,789 --> 00:07:43,550
m away in that case as well, but then we know
that beta naught plus x i transpose beta would
51
00:07:43,550 --> 00:07:50,560
be negative, when the class is minus 1. So,
what I do is I essentially just multiplied
52
00:07:50,560 --> 00:08:05,419
by the actual class variable and I want
this whole distance. Because, if y i is plus
53
00:08:05,419 --> 00:08:11,249
1 I would like this also to be plus, the positive
and I want it to be at least m away from the
54
00:08:11,249 --> 00:08:15,389
hyper plane and y i is minus 1, this is going
to be negative. So, the product is going to
55
00:08:15,389 --> 00:08:21,840
be positive and I want that to be at least
m away from the hyper plane.
56
00:08:21,840 --> 00:08:31,969
So, this is thick and strained that we want
to satisfy and what is our goal. If we remember,
57
00:08:31,969 --> 00:08:41,229
our goal is to make sure that this m is as
large as possible. So, what will do is, we
58
00:08:41,229 --> 00:09:21,190
will say maximize m beta naught beta subject
toÉ So, I am going to maximize the margin
59
00:09:21,190 --> 00:09:29,470
m over beta naught and beta, subject to the
constraints that y i times x i transpose beta
60
00:09:29,470 --> 00:09:36,020
plus beta naught is greater than or equal
to m, for every data point in my training
61
00:09:36,020 --> 00:09:41,010
data.
So, this kind be done assuming that all the
62
00:09:41,010 --> 00:09:48,230
data is nicely separated. So, and I can actually
draw a linear surface that separates the data.
63
00:09:48,230 --> 00:09:53,200
So, if a kind of linear surface that separates
data, then I can come up with at least one
64
00:09:53,200 --> 00:10:02,270
surface that satisfies this constraints for
some value of m and essentially, I have to
65
00:10:02,270 --> 00:10:08,780
find value of m that is maximum here. But,
one thing if you look at this equation or
66
00:10:08,780 --> 00:10:14,010
the constraint that we have written, so I
can arbitrarily increase the value of beta
67
00:10:14,010 --> 00:10:20,250
and make this value as large as I want.
So, I need to have some constraint on beta
68
00:10:20,250 --> 00:10:27,720
as well. So, what we will do is, we will constraint
the norm of beta to be equal to 1. So, we
69
00:10:27,720 --> 00:10:36,310
will not look at all possible weights beta
naught and beta, we will only look at those
70
00:10:36,310 --> 00:10:42,520
weights insist that the size of beta is constraint
to be 1. So, the norm of beta is, you could
71
00:10:42,520 --> 00:10:54,750
take the Euclidean norm of beta, I am saying
that the norm of beta should be 1. So, I hope
72
00:10:54,750 --> 00:10:58,090
the formulation of the optimization problem
so far is clear.
73
00:10:58,090 --> 00:11:02,810
So, it is essentially saying that I want all
my data points to be at least a distance m
74
00:11:02,810 --> 00:11:11,350
away from the hyper plane and subject to that
constraint and subject to my beta being norm
75
00:11:11,350 --> 00:11:20,110
one, I want to maximize the margin. So, this
is a pretty works and constraint, so we can
76
00:11:20,110 --> 00:11:31,370
try to get rid of it by changing the other
inequality constraints to by normalizing them
77
00:11:31,370 --> 00:11:52,070
with the beta. So, this again allows me to
achieve the same effect of not getting a high
78
00:11:52,070 --> 00:11:56,610
value for m just by increasing the size of
beta, because I am dividing by the size of
79
00:11:56,610 --> 00:12:01,320
beta.
So, that achieves the same constraint and
80
00:12:01,320 --> 00:12:30,740
you can essentially write it like that. So,
one thing that we should note here is that,
81
00:12:30,740 --> 00:12:39,990
if a specific beta satisfies these constraints,
any positively scale version of beta would
82
00:12:39,990 --> 00:12:44,800
also satisfies the constraints. I can just
multiply by some positive number, if it is
83
00:12:44,800 --> 00:12:51,450
originally all, for all the exercise was giving
me negative values larger than m or minus
84
00:12:51,450 --> 00:12:57,930
m or positive values larger than m, just multiplying
it by a positive quantity will not change
85
00:12:57,930 --> 00:13:03,010
anything. It will still give me negative values
that are lesser than minus m or positive values
86
00:13:03,010 --> 00:13:14,800
that are greater than m. Therefore I can essentially
choose a specific value for beta, such that
87
00:13:14,800 --> 00:13:18,440
this evaluates to 1.
88
00:13:18,440 --> 00:13:47,190
So, I set, so accept norm beta equal to 1
by m, so that this constraint becomes y i
89
00:13:47,190 --> 00:13:57,620
x i transpose beta is greater than equal to
1 subject to the constraint that, you are
90
00:13:57,620 --> 00:15:03,700
finding the smallest such beta. So, this optimization
problem then becomes, this is optimization
91
00:15:03,700 --> 00:15:09,420
problem of maximizing the margin, now essentially
becomes the problem of finding the smallest
92
00:15:09,420 --> 00:15:23,730
beta, such that this conditions are satisfied.
So, this is essentially means that my margin
93
00:15:23,730 --> 00:15:35,010
here is going to be 1 over now beta.
So, to make it mathematically more convenient
94
00:15:35,010 --> 00:15:44,240
I am going to minimize the quadratic form
of that. So, essentially I will be minimizing
95
00:15:44,240 --> 00:15:48,620
this square of beta, since it is norm any
way. So, this would be positively to begin
96
00:15:48,620 --> 00:15:54,030
with, so I can minimize this square, that
is not a problem and so that is my final optimization
97
00:15:54,030 --> 00:16:01,500
problem. So, this is the final optimization
problems, where I am saying that, so together
98
00:16:01,500 --> 00:16:11,170
with these constraints a kind of define a
slab around the separating hyper plane, I
99
00:16:11,170 --> 00:16:20,540
define a slab around these separating hyper
plane of with 1 by beta. So, making sure that
100
00:16:20,540 --> 00:16:26,180
there are no data points with in this region,
so I am trying to now maximize the width of
101
00:16:26,180 --> 00:16:33,470
this region, so that there are no data points
in that region, that is essentially the idea
102
00:16:33,470 --> 00:16:46,420
behind this optimization problem.
So, this defines the basic optimization problem
103
00:16:46,420 --> 00:16:53,190
in the case of support vector machines. So,
in the next module we will look at, how do
104
00:16:53,190 --> 00:16:58,040
you go about setting up a solution for this
optimization problem.