1 00:00:09,650 --> 00:00:12,599 Hello and welcome to this module on Support Vector Machines. 2 00:00:12,599 --> 00:00:18,090 So, we have been looking at the variety of classifier so far and one of the things, let 3 00:00:18,090 --> 00:00:23,110 us look at the linear classifier. So, the one of the thing is, if I have data points 4 00:00:23,110 --> 00:00:28,700 that are even perfectly separable, here is a class and here is another class, you can 5 00:00:28,700 --> 00:00:34,710 see that they are very clearly separated. But, when I train a linear classifier it is 6 00:00:34,710 --> 00:00:44,930 not entirely clear, which of these many possible lines that could separate the data, would 7 00:00:44,930 --> 00:00:49,549 your classifier end up learning. There are many, many different lines that could separate 8 00:00:49,549 --> 00:00:54,120 the data and we are not sure, what your classifier would end up learning. 9 00:00:54,120 --> 00:01:03,960 So, support vector machines initially were born out of and need to answer this question. 10 00:01:03,960 --> 00:01:11,680 Among all of these different lines or all of these different decision surfaces that 11 00:01:11,680 --> 00:01:18,290 you could use for separating the data given to you, which of those is the best decision 12 00:01:18,290 --> 00:01:32,910 surface? Some of all those alternatives which you think should be the best decision surface. 13 00:01:32,910 --> 00:01:54,370 So, one answer to this question is to define an optimal separating, if an optimal separating 14 00:01:54,370 --> 00:02:09,299 hyper plane as the surface, such that the nearest data point to the surface is as far 15 00:02:09,299 --> 00:02:17,340 away as possible among all of it is surfaces. So, here is a separating line and the nearest 16 00:02:17,340 --> 00:02:28,159 data point to that is that or that or that. So, if you think about it, so the nearest 17 00:02:28,159 --> 00:02:38,989 data point cannot belong to just one class. So, I could draw a line like this, but then 18 00:02:38,989 --> 00:02:46,349 there would mean that I am reducing the distance of the data point to the separating surface 19 00:02:46,349 --> 00:02:52,919 or if I go this way, again I will be the reducing the distance of the data point to the surface 20 00:02:52,919 --> 00:02:58,730 in one class or the other. When I say that you are maximizing the distance of the closest 21 00:02:58,730 --> 00:03:05,739 data point to the separating hyper plane, that essentially means that the closest data 22 00:03:05,739 --> 00:03:11,139 point from either class is at the same distance away from the hyper plane. 23 00:03:11,139 --> 00:03:31,950 So, this distance and being same as this distance would be the same as this distance and this. 24 00:03:31,950 --> 00:03:51,449 So, this distance of the closest data point to the separating surface is known as the 25 00:03:51,449 --> 00:04:11,920 margin of the classifier, which we will denote by 26 00:04:11,920 --> 00:04:19,750 m. So, the goal of finding a separating optimal separating hyper plane is essentially to find 27 00:04:19,750 --> 00:04:27,919 the classifier, such that this margin m is as large as possible. So, let us step back 28 00:04:27,919 --> 00:04:36,539 and think about what such a line means. You know in all linear classifiers we have seen 29 00:04:36,539 --> 00:04:43,639 so far, so we know that we are going to say something like. 30 00:04:43,639 --> 00:04:59,970 So, y is beta naught plus beta transpose x, for convenience sake here I will write it 31 00:04:59,970 --> 00:05:05,909 as x transpose beta, since we are taking inner products that is fine. So, a line like this 32 00:05:05,909 --> 00:05:16,460 could essentially be obtained by setting this beta naught plus x transpose beta equal to 33 00:05:16,460 --> 00:05:25,060 0. So, all the data points on this line or those data points for which beta naught plus 34 00:05:25,060 --> 00:05:42,250 x transpose beta evaluates to 0, so that is the equation of the line here. So, if it is 35 00:05:42,250 --> 00:05:48,590 negative beta naught plus beta transpose x is less than 0, so we are going to say that 36 00:05:48,590 --> 00:06:01,770 x is of class minus 1 and if beta naught plus beta transpose x is greater than 0, we will 37 00:06:01,770 --> 00:06:09,610 say that next class plus 1. So, remember that, so we will, we using some 38 00:06:09,610 --> 00:06:13,599 kind of encoding for the class. The class could be does not buy a computer or buys a 39 00:06:13,599 --> 00:06:20,870 computer, he is sick, he is healthy. I mean the classes could be many different things, 40 00:06:20,870 --> 00:06:25,990 but numerically we are going to be assigning some encoding for the class and in this case, 41 00:06:25,990 --> 00:06:31,490 I choose to use minus 1 and plus 1 as the encoded. There is a reason for that as we 42 00:06:31,490 --> 00:06:40,740 will see shortly. So, if beta naught plus beta transpose x is less than 0 and I say, 43 00:06:40,740 --> 00:06:45,170 it is class minus 1. But, in this case what I really want, I do 44 00:06:45,170 --> 00:06:52,289 not want it to be this less than 0, but I want it to be at least m away from the hyper 45 00:06:52,289 --> 00:07:00,689 plane. I want it to be m away from the line beta naught plus beta transpose x equal to 46 00:07:00,689 --> 00:07:13,669 0. So, I might use x transpose and beta transpose x interchangeably at points, but as you know 47 00:07:13,669 --> 00:07:15,860 they are inner products, so that is fine. 48 00:07:15,860 --> 00:07:28,690 So, what I really want is, so y i is plus 1 I want beta naught plus x i transpose beta 49 00:07:28,690 --> 00:07:34,789 to be greater than m. What happens if y i is minus 1? I really want it to be at least 50 00:07:34,789 --> 00:07:43,550 m away in that case as well, but then we know that beta naught plus x i transpose beta would 51 00:07:43,550 --> 00:07:50,560 be negative, when the class is minus 1. So, what I do is I essentially just multiplied 52 00:07:50,560 --> 00:08:05,419 by the actual class variable and I want this whole distance. Because, if y i is plus 53 00:08:05,419 --> 00:08:11,249 1 I would like this also to be plus, the positive and I want it to be at least m away from the 54 00:08:11,249 --> 00:08:15,389 hyper plane and y i is minus 1, this is going to be negative. So, the product is going to 55 00:08:15,389 --> 00:08:21,840 be positive and I want that to be at least m away from the hyper plane. 56 00:08:21,840 --> 00:08:31,969 So, this is thick and strained that we want to satisfy and what is our goal. If we remember, 57 00:08:31,969 --> 00:08:41,229 our goal is to make sure that this m is as large as possible. So, what will do is, we 58 00:08:41,229 --> 00:09:21,190 will say maximize m beta naught beta subject toÉ So, I am going to maximize the margin 59 00:09:21,190 --> 00:09:29,470 m over beta naught and beta, subject to the constraints that y i times x i transpose beta 60 00:09:29,470 --> 00:09:36,020 plus beta naught is greater than or equal to m, for every data point in my training 61 00:09:36,020 --> 00:09:41,010 data. So, this kind be done assuming that all the 62 00:09:41,010 --> 00:09:48,230 data is nicely separated. So, and I can actually draw a linear surface that separates the data. 63 00:09:48,230 --> 00:09:53,200 So, if a kind of linear surface that separates data, then I can come up with at least one 64 00:09:53,200 --> 00:10:02,270 surface that satisfies this constraints for some value of m and essentially, I have to 65 00:10:02,270 --> 00:10:08,780 find value of m that is maximum here. But, one thing if you look at this equation or 66 00:10:08,780 --> 00:10:14,010 the constraint that we have written, so I can arbitrarily increase the value of beta 67 00:10:14,010 --> 00:10:20,250 and make this value as large as I want. So, I need to have some constraint on beta 68 00:10:20,250 --> 00:10:27,720 as well. So, what we will do is, we will constraint the norm of beta to be equal to 1. So, we 69 00:10:27,720 --> 00:10:36,310 will not look at all possible weights beta naught and beta, we will only look at those 70 00:10:36,310 --> 00:10:42,520 weights insist that the size of beta is constraint to be 1. So, the norm of beta is, you could 71 00:10:42,520 --> 00:10:54,750 take the Euclidean norm of beta, I am saying that the norm of beta should be 1. So, I hope 72 00:10:54,750 --> 00:10:58,090 the formulation of the optimization problem so far is clear. 73 00:10:58,090 --> 00:11:02,810 So, it is essentially saying that I want all my data points to be at least a distance m 74 00:11:02,810 --> 00:11:11,350 away from the hyper plane and subject to that constraint and subject to my beta being norm 75 00:11:11,350 --> 00:11:20,110 one, I want to maximize the margin. So, this is a pretty works and constraint, so we can 76 00:11:20,110 --> 00:11:31,370 try to get rid of it by changing the other inequality constraints to by normalizing them 77 00:11:31,370 --> 00:11:52,070 with the beta. So, this again allows me to achieve the same effect of not getting a high 78 00:11:52,070 --> 00:11:56,610 value for m just by increasing the size of beta, because I am dividing by the size of 79 00:11:56,610 --> 00:12:01,320 beta. So, that achieves the same constraint and 80 00:12:01,320 --> 00:12:30,740 you can essentially write it like that. So, one thing that we should note here is that, 81 00:12:30,740 --> 00:12:39,990 if a specific beta satisfies these constraints, any positively scale version of beta would 82 00:12:39,990 --> 00:12:44,800 also satisfies the constraints. I can just multiply by some positive number, if it is 83 00:12:44,800 --> 00:12:51,450 originally all, for all the exercise was giving me negative values larger than m or minus 84 00:12:51,450 --> 00:12:57,930 m or positive values larger than m, just multiplying it by a positive quantity will not change 85 00:12:57,930 --> 00:13:03,010 anything. It will still give me negative values that are lesser than minus m or positive values 86 00:13:03,010 --> 00:13:14,800 that are greater than m. Therefore I can essentially choose a specific value for beta, such that 87 00:13:14,800 --> 00:13:18,440 this evaluates to 1. 88 00:13:18,440 --> 00:13:47,190 So, I set, so accept norm beta equal to 1 by m, so that this constraint becomes y i 89 00:13:47,190 --> 00:13:57,620 x i transpose beta is greater than equal to 1 subject to the constraint that, you are 90 00:13:57,620 --> 00:15:03,700 finding the smallest such beta. So, this optimization problem then becomes, this is optimization 91 00:15:03,700 --> 00:15:09,420 problem of maximizing the margin, now essentially becomes the problem of finding the smallest 92 00:15:09,420 --> 00:15:23,730 beta, such that this conditions are satisfied. So, this is essentially means that my margin 93 00:15:23,730 --> 00:15:35,010 here is going to be 1 over now beta. So, to make it mathematically more convenient 94 00:15:35,010 --> 00:15:44,240 I am going to minimize the quadratic form of that. So, essentially I will be minimizing 95 00:15:44,240 --> 00:15:48,620 this square of beta, since it is norm any way. So, this would be positively to begin 96 00:15:48,620 --> 00:15:54,030 with, so I can minimize this square, that is not a problem and so that is my final optimization 97 00:15:54,030 --> 00:16:01,500 problem. So, this is the final optimization problems, where I am saying that, so together 98 00:16:01,500 --> 00:16:11,170 with these constraints a kind of define a slab around the separating hyper plane, I 99 00:16:11,170 --> 00:16:20,540 define a slab around these separating hyper plane of with 1 by beta. So, making sure that 100 00:16:20,540 --> 00:16:26,180 there are no data points with in this region, so I am trying to now maximize the width of 101 00:16:26,180 --> 00:16:33,470 this region, so that there are no data points in that region, that is essentially the idea 102 00:16:33,470 --> 00:16:46,420 behind this optimization problem. So, this defines the basic optimization problem 103 00:16:46,420 --> 00:16:53,190 in the case of support vector machines. So, in the next module we will look at, how do 104 00:16:53,190 --> 00:16:58,040 you go about setting up a solution for this optimization problem.