1 00:00:11,019 --> 00:00:16,219 Hi, so we are looking at the module on Training Logistic Regression Classifier now. 2 00:00:16,219 --> 00:00:23,320 So, in the previous module we looked at the basic idea behind logistic regression, which 3 00:00:23,320 --> 00:00:28,540 is essentially to do linear regression with a logistic transformation. So, you took the 4 00:00:28,540 --> 00:00:33,280 log odds function, which is p of x by 1 minus p of x, took the logarithm of it and try to 5 00:00:33,280 --> 00:00:41,720 fit a linear curve to this transform function. So, how do we find these parameters beta, 6 00:00:41,720 --> 00:00:46,650 beta 1 and beta naught? So, we optimize the likelihood of the training 7 00:00:46,650 --> 00:00:50,900 data with respect to the parameters beta, so that is essentially the way we are going 8 00:00:50,900 --> 00:00:56,430 to be training this. So, this is slightly different from some of the earlier methods 9 00:00:56,430 --> 00:01:01,950 we have looked at identifying the parameters, mainly because we are looking at here the 10 00:01:01,950 --> 00:01:06,640 probability of classification, not just getting the classifications right or wrong, but we 11 00:01:06,640 --> 00:01:10,670 are actually looking at the probability of classification, that makes more sense to try 12 00:01:10,670 --> 00:01:17,719 to optimize the probability of seeing the training data with respect to the parameter 13 00:01:17,719 --> 00:01:18,719 beta. 14 00:01:18,719 --> 00:01:27,289 So, what is the likelihood? So, the likelihood is the probability of a training data D given 15 00:01:27,289 --> 00:01:33,520 a particular parameter setting beta. So, you should note here that, it is the function 16 00:01:33,520 --> 00:01:39,460 of the parameter setting, because the training data D that is given to you is usually fixed. 17 00:01:39,460 --> 00:01:48,759 So, here is an example of the likelihood of some kind of classification tasks. So, I am 18 00:01:48,759 --> 00:01:52,830 going to assume that the data is given to you in the form of x i, y i pairs as we have 19 00:01:52,830 --> 00:01:58,779 done in the past. So, for each x i there is going to be an outcome 20 00:01:58,779 --> 00:02:04,469 y i, which will be either 1 if it belongs to class 1 or 0 if it belongs to class 2. 21 00:02:04,469 --> 00:02:11,620 So, let us look at one term in the product that I have written down there, is you can 22 00:02:11,620 --> 00:02:20,780 see if the output corresponding to x i is 1, then the first term in the product will 23 00:02:20,780 --> 00:02:31,310 be p of x i and the second term in the product will be 1, because y i is 1 and 1 minus p 24 00:02:31,310 --> 00:02:36,570 of x i is going to raise to the power of 0, which essentially reduce to 1. 25 00:02:36,570 --> 00:02:44,050 Likewise, if the output corresponding to x i is 0, then the first term in the product 26 00:02:44,050 --> 00:02:50,290 is going to be 1 and the second term in the product will remain as 1 minus p of x i. So, 27 00:02:50,290 --> 00:02:55,200 this essentially means that depending on, what the output variable is I am going to 28 00:02:55,200 --> 00:03:02,180 either take the probability of the data point occurring, probability of the data point having 29 00:03:02,180 --> 00:03:07,350 a label of 1 or the probability of the data point having the label of 0. So, to do recall 30 00:03:07,350 --> 00:03:12,940 that p of x i is the probability that y equal to 1 given x i. 31 00:03:12,940 --> 00:03:17,820 So, now, for this is for one data point and if I want to look at the probability of the 32 00:03:17,820 --> 00:03:22,350 entire data, I just take the product over all the data points, so the product runs from 33 00:03:22,350 --> 00:03:30,400 1 to n. So, this expression now gives me the probability of seeing the training data given 34 00:03:30,400 --> 00:03:34,930 a specific parameter setting. So, where do beta and beta naught appear on the expression 35 00:03:34,930 --> 00:03:42,840 on the right hand side, so p of x i is specified in terms of beta and beta naught. 36 00:03:42,840 --> 00:03:47,010 So, implicitly, so beta and beta naught are appearing on the right hand side of the equation 37 00:03:47,010 --> 00:03:52,470 and like I said, likelihood is the function of the parameters and hence we denote it as 38 00:03:52,470 --> 00:03:58,720 L of beta. So, now our goal is to optimize this likelihood and, so that, so we get a 39 00:03:58,720 --> 00:04:02,460 good estimate of the parameter, so we have to find the right set of beta, so that this 40 00:04:02,460 --> 00:04:09,930 probability is maximized. So, now, we look the term, the term looks a little hard to 41 00:04:09,930 --> 00:04:14,750 optimize, because lot of products here and, so we have to be little careful. 42 00:04:14,750 --> 00:04:19,930 So, the usual way that we operate here is to take logarithms of this likelihood and 43 00:04:19,930 --> 00:04:25,130 you can see that lower case l here is used to denote the log of the likelihood function 44 00:04:25,130 --> 00:04:31,500 and then, you just walk through this log likelihood expression a little slowly. So, now, that 45 00:04:31,500 --> 00:04:36,460 I have taken the logarithms in the first step. In the first step, so I have taken logarithms 46 00:04:36,460 --> 00:04:42,029 and therefore, the products that I had earlier have become summations and the exponentiation 47 00:04:42,029 --> 00:04:47,060 that I had earlier have become products. So, that corresponds the original exponentiation 48 00:04:47,060 --> 00:04:50,900 I had in my expression and now they have become products. 49 00:04:50,900 --> 00:04:57,120 Now, we can do a little bit of simplification here and you can see that, what I have essentially 50 00:04:57,120 --> 00:05:04,319 done is taken the… In the second step I have expanded the product term in the second 51 00:05:04,319 --> 00:05:10,050 term in this summation and then, I have gathered terms together which have a coefficient of 52 00:05:10,050 --> 00:05:16,460 y i. So, that gives me the second term in the summation and the first term is just essentially 53 00:05:16,460 --> 00:05:24,819 one times log of 1 minus p of x i. So, we know what log of p x i by 1 minus p x i is 54 00:05:24,819 --> 00:05:28,969 and that is essentially the function that you are trying to fit from the beginning. 55 00:05:28,969 --> 00:05:35,559 So, we replace that with our linear fit that we had, the linear regressions fit that we 56 00:05:35,559 --> 00:05:41,139 did. And then, we do further simplification in order to come up with the expression given 57 00:05:41,139 --> 00:05:46,749 on the last line, that essentially writing out p x i and then, evaluating 1 minus p x 58 00:05:46,749 --> 00:05:52,160 i and that gives me the negative logarithms term on the last line of the expressions. 59 00:05:52,160 --> 00:05:59,380 So, now, what do we do? We have the log likelihood, so what we do to maximize this log likelihood. 60 00:05:59,380 --> 00:06:05,550 We essentially take the derivatives of this log likelihood with respect to beta and then, 61 00:06:05,550 --> 00:06:10,509 we should be equating this to 0. So, the first line here is essentially taking the derivative 62 00:06:10,509 --> 00:06:16,470 of the log likelihood and it simplify to a very nice form, which is y i minus probability 63 00:06:16,470 --> 00:06:25,659 of x i times x i each individual component of x i and now, we set this equal to 0. But, 64 00:06:25,659 --> 00:06:35,409 we really cannot solve this that easily. Why? Because, p x i is actually a transcendental 65 00:06:35,409 --> 00:06:41,319 function, so it is not very easy to find the close form solution for these kinds of expressions. 66 00:06:41,319 --> 00:06:47,059 So, we have to actually look at numerical methods for solving these kinds of optimization 67 00:06:47,059 --> 00:06:51,360 problems and we essentially look at class of algorithms, which are known as interior 68 00:06:51,360 --> 00:06:56,849 point methods. So, I am not really going to get into the math behind all of this, but 69 00:06:56,849 --> 00:07:04,039 I assume that many of you have actually come across very simple optimization technique 70 00:07:04,039 --> 00:07:05,880 called Newton Raphson method. 71 00:07:05,880 --> 00:07:09,589 So and here is what the expression for Newton Raphson method is going to look like. So, 72 00:07:09,589 --> 00:07:16,629 I start off with a guess for my initial solution, the beta and start of the guess beta naught 73 00:07:16,629 --> 00:07:22,639 and I would typically like my beta naught to be close to the true solution. And once 74 00:07:22,639 --> 00:07:30,259 I have the guess for beta naught, then I keep updating the solution by essentially subtracting 75 00:07:30,259 --> 00:07:35,460 the first order derivative of the likelihood divided by the second order derivative of 76 00:07:35,460 --> 00:07:42,990 the likelihood. Take the ratio and then, subtract it from beta in order to give me my next estimate. 77 00:07:42,990 --> 00:07:49,249 So, this has fast convergence under certain regularity condition, so for one thing is 78 00:07:49,249 --> 00:07:54,889 that second derivative should like this and should be positive. And as you can see the 79 00:07:54,889 --> 00:07:59,779 first derivative is going to be 0 you are not going to be changing the value of your 80 00:07:59,779 --> 00:08:05,249 guess and when would the first derivative be 0 it will be 0 at one of the optima whether 81 00:08:05,249 --> 00:08:10,369 it is the maxima or the minima and you will also like this since the second derivative 82 00:08:10,369 --> 00:08:17,629 is going to positive to approach the minima. So, you can till when you approach optima 83 00:08:17,629 --> 00:08:21,830 you can be sure there is going to be the minima. So, this is essentially the basic idea behind 84 00:08:21,830 --> 00:08:27,899 the Newton Raphson method and when Newton Raphson method is applied specifically to 85 00:08:27,899 --> 00:08:33,510 the logistic regression problem you come up with the iterative technique, which is called 86 00:08:33,510 --> 00:08:38,220 iterative re weighted least squares approach for training for finding the parameters in 87 00:08:38,220 --> 00:08:45,390 logistic regression. And, so most of these statistical packages that we have especially 88 00:08:45,390 --> 00:08:52,330 R in particular of front trust was have a very simple function that loves you to fit 89 00:08:52,330 --> 00:08:56,760 logistic regression to any data set that you have and they will essentially be using Newton 90 00:08:56,760 --> 00:09:01,740 Raphson by way of iterative re weighted least squares techniques. 91 00:09:01,740 --> 00:09:06,350 To summarize this couple of modules logistic regression, so logistic regression is very 92 00:09:06,350 --> 00:09:13,060 powerful classifier build on the idea of doing linear regression on a logistic transformed 93 00:09:13,060 --> 00:09:19,460 output variables and the logistic regression is related to exponential family of probability 94 00:09:19,460 --> 00:09:25,070 distribution that rise in the variety of problems and that is the one of the reasons that make 95 00:09:25,070 --> 00:09:32,710 them very, very popular classifier. And apart from that they work really well 96 00:09:32,710 --> 00:09:37,820 I mean, so that is the another reason that logistic regression is classifier of choice 97 00:09:37,820 --> 00:09:42,770 for many people’s especially in medical domains, because they allow you to perform 98 00:09:42,770 --> 00:09:48,780 what you know as sensitive analysis. So, you can look at dependence of class labels on 99 00:09:48,780 --> 00:09:55,280 features by looking at the, the regression coefficient of specific feature in the fit 100 00:09:55,280 --> 00:09:55,880 that you obtain. Thank you.