1
00:00:11,019 --> 00:00:16,219
Hi, so we are looking at the module on Training
Logistic Regression Classifier now.
2
00:00:16,219 --> 00:00:23,320
So, in the previous module we looked at the
basic idea behind logistic regression, which
3
00:00:23,320 --> 00:00:28,540
is essentially to do linear regression with
a logistic transformation. So, you took the
4
00:00:28,540 --> 00:00:33,280
log odds function, which is p of x by 1 minus
p of x, took the logarithm of it and try to
5
00:00:33,280 --> 00:00:41,720
fit a linear curve to this transform function.
So, how do we find these parameters beta,
6
00:00:41,720 --> 00:00:46,650
beta 1 and beta naught?
So, we optimize the likelihood of the training
7
00:00:46,650 --> 00:00:50,900
data with respect to the parameters beta,
so that is essentially the way we are going
8
00:00:50,900 --> 00:00:56,430
to be training this. So, this is slightly
different from some of the earlier methods
9
00:00:56,430 --> 00:01:01,950
we have looked at identifying the parameters,
mainly because we are looking at here the
10
00:01:01,950 --> 00:01:06,640
probability of classification, not just getting
the classifications right or wrong, but we
11
00:01:06,640 --> 00:01:10,670
are actually looking at the probability of
classification, that makes more sense to try
12
00:01:10,670 --> 00:01:17,719
to optimize the probability of seeing the
training data with respect to the parameter
13
00:01:17,719 --> 00:01:18,719
beta.
14
00:01:18,719 --> 00:01:27,289
So, what is the likelihood? So, the likelihood
is the probability of a training data D given
15
00:01:27,289 --> 00:01:33,520
a particular parameter setting beta. So, you
should note here that, it is the function
16
00:01:33,520 --> 00:01:39,460
of the parameter setting, because the training
data D that is given to you is usually fixed.
17
00:01:39,460 --> 00:01:48,759
So, here is an example of the likelihood of
some kind of classification tasks. So, I am
18
00:01:48,759 --> 00:01:52,830
going to assume that the data is given to
you in the form of x i, y i pairs as we have
19
00:01:52,830 --> 00:01:58,779
done in the past.
So, for each x i there is going to be an outcome
20
00:01:58,779 --> 00:02:04,469
y i, which will be either 1 if it belongs
to class 1 or 0 if it belongs to class 2.
21
00:02:04,469 --> 00:02:11,620
So, let us look at one term in the product
that I have written down there, is you can
22
00:02:11,620 --> 00:02:20,780
see if the output corresponding to x i is
1, then the first term in the product will
23
00:02:20,780 --> 00:02:31,310
be p of x i and the second term in the product
will be 1, because y i is 1 and 1 minus p
24
00:02:31,310 --> 00:02:36,570
of x i is going to raise to the power of 0,
which essentially reduce to 1.
25
00:02:36,570 --> 00:02:44,050
Likewise, if the output corresponding to x
i is 0, then the first term in the product
26
00:02:44,050 --> 00:02:50,290
is going to be 1 and the second term in the
product will remain as 1 minus p of x i. So,
27
00:02:50,290 --> 00:02:55,200
this essentially means that depending on,
what the output variable is I am going to
28
00:02:55,200 --> 00:03:02,180
either take the probability of the data point
occurring, probability of the data point having
29
00:03:02,180 --> 00:03:07,350
a label of 1 or the probability of the data
point having the label of 0. So, to do recall
30
00:03:07,350 --> 00:03:12,940
that p of x i is the probability that y equal
to 1 given x i.
31
00:03:12,940 --> 00:03:17,820
So, now, for this is for one data point and
if I want to look at the probability of the
32
00:03:17,820 --> 00:03:22,350
entire data, I just take the product over
all the data points, so the product runs from
33
00:03:22,350 --> 00:03:30,400
1 to n. So, this expression now gives me the
probability of seeing the training data given
34
00:03:30,400 --> 00:03:34,930
a specific parameter setting. So, where do
beta and beta naught appear on the expression
35
00:03:34,930 --> 00:03:42,840
on the right hand side, so p of x i is specified
in terms of beta and beta naught.
36
00:03:42,840 --> 00:03:47,010
So, implicitly, so beta and beta naught are
appearing on the right hand side of the equation
37
00:03:47,010 --> 00:03:52,470
and like I said, likelihood is the function
of the parameters and hence we denote it as
38
00:03:52,470 --> 00:03:58,720
L of beta. So, now our goal is to optimize
this likelihood and, so that, so we get a
39
00:03:58,720 --> 00:04:02,460
good estimate of the parameter, so we have
to find the right set of beta, so that this
40
00:04:02,460 --> 00:04:09,930
probability is maximized. So, now, we look
the term, the term looks a little hard to
41
00:04:09,930 --> 00:04:14,750
optimize, because lot of products here and,
so we have to be little careful.
42
00:04:14,750 --> 00:04:19,930
So, the usual way that we operate here is
to take logarithms of this likelihood and
43
00:04:19,930 --> 00:04:25,130
you can see that lower case l here is used
to denote the log of the likelihood function
44
00:04:25,130 --> 00:04:31,500
and then, you just walk through this log likelihood
expression a little slowly. So, now, that
45
00:04:31,500 --> 00:04:36,460
I have taken the logarithms in the first step.
In the first step, so I have taken logarithms
46
00:04:36,460 --> 00:04:42,029
and therefore, the products that I had earlier
have become summations and the exponentiation
47
00:04:42,029 --> 00:04:47,060
that I had earlier have become products. So,
that corresponds the original exponentiation
48
00:04:47,060 --> 00:04:50,900
I had in my expression and now they have become
products.
49
00:04:50,900 --> 00:04:57,120
Now, we can do a little bit of simplification
here and you can see that, what I have essentially
50
00:04:57,120 --> 00:05:04,319
done is taken the… In the second step I
have expanded the product term in the second
51
00:05:04,319 --> 00:05:10,050
term in this summation and then, I have gathered
terms together which have a coefficient of
52
00:05:10,050 --> 00:05:16,460
y i. So, that gives me the second term in
the summation and the first term is just essentially
53
00:05:16,460 --> 00:05:24,819
one times log of 1 minus p of x i. So, we
know what log of p x i by 1 minus p x i is
54
00:05:24,819 --> 00:05:28,969
and that is essentially the function that
you are trying to fit from the beginning.
55
00:05:28,969 --> 00:05:35,559
So, we replace that with our linear fit that
we had, the linear regressions fit that we
56
00:05:35,559 --> 00:05:41,139
did. And then, we do further simplification
in order to come up with the expression given
57
00:05:41,139 --> 00:05:46,749
on the last line, that essentially writing
out p x i and then, evaluating 1 minus p x
58
00:05:46,749 --> 00:05:52,160
i and that gives me the negative logarithms
term on the last line of the expressions.
59
00:05:52,160 --> 00:05:59,380
So, now, what do we do? We have the log likelihood,
so what we do to maximize this log likelihood.
60
00:05:59,380 --> 00:06:05,550
We essentially take the derivatives of this
log likelihood with respect to beta and then,
61
00:06:05,550 --> 00:06:10,509
we should be equating this to 0. So, the first
line here is essentially taking the derivative
62
00:06:10,509 --> 00:06:16,470
of the log likelihood and it simplify to a
very nice form, which is y i minus probability
63
00:06:16,470 --> 00:06:25,659
of x i times x i each individual component
of x i and now, we set this equal to 0. But,
64
00:06:25,659 --> 00:06:35,409
we really cannot solve this that easily. Why?
Because, p x i is actually a transcendental
65
00:06:35,409 --> 00:06:41,319
function, so it is not very easy to find the
close form solution for these kinds of expressions.
66
00:06:41,319 --> 00:06:47,059
So, we have to actually look at numerical
methods for solving these kinds of optimization
67
00:06:47,059 --> 00:06:51,360
problems and we essentially look at class
of algorithms, which are known as interior
68
00:06:51,360 --> 00:06:56,849
point methods. So, I am not really going to
get into the math behind all of this, but
69
00:06:56,849 --> 00:07:04,039
I assume that many of you have actually come
across very simple optimization technique
70
00:07:04,039 --> 00:07:05,880
called Newton Raphson method.
71
00:07:05,880 --> 00:07:09,589
So and here is what the expression for Newton
Raphson method is going to look like. So,
72
00:07:09,589 --> 00:07:16,629
I start off with a guess for my initial solution,
the beta and start of the guess beta naught
73
00:07:16,629 --> 00:07:22,639
and I would typically like my beta naught
to be close to the true solution. And once
74
00:07:22,639 --> 00:07:30,259
I have the guess for beta naught, then I keep
updating the solution by essentially subtracting
75
00:07:30,259 --> 00:07:35,460
the first order derivative of the likelihood
divided by the second order derivative of
76
00:07:35,460 --> 00:07:42,990
the likelihood. Take the ratio and then, subtract
it from beta in order to give me my next estimate.
77
00:07:42,990 --> 00:07:49,249
So, this has fast convergence under certain
regularity condition, so for one thing is
78
00:07:49,249 --> 00:07:54,889
that second derivative should like this and
should be positive. And as you can see the
79
00:07:54,889 --> 00:07:59,779
first derivative is going to be 0 you are
not going to be changing the value of your
80
00:07:59,779 --> 00:08:05,249
guess and when would the first derivative
be 0 it will be 0 at one of the optima whether
81
00:08:05,249 --> 00:08:10,369
it is the maxima or the minima and you will
also like this since the second derivative
82
00:08:10,369 --> 00:08:17,629
is going to positive to approach the minima.
So, you can till when you approach optima
83
00:08:17,629 --> 00:08:21,830
you can be sure there is going to be the minima.
So, this is essentially the basic idea behind
84
00:08:21,830 --> 00:08:27,899
the Newton Raphson method and when Newton
Raphson method is applied specifically to
85
00:08:27,899 --> 00:08:33,510
the logistic regression problem you come up
with the iterative technique, which is called
86
00:08:33,510 --> 00:08:38,220
iterative re weighted least squares approach
for training for finding the parameters in
87
00:08:38,220 --> 00:08:45,390
logistic regression. And, so most of these
statistical packages that we have especially
88
00:08:45,390 --> 00:08:52,330
R in particular of front trust was have a
very simple function that loves you to fit
89
00:08:52,330 --> 00:08:56,760
logistic regression to any data set that you
have and they will essentially be using Newton
90
00:08:56,760 --> 00:09:01,740
Raphson by way of iterative re weighted least
squares techniques.
91
00:09:01,740 --> 00:09:06,350
To summarize this couple of modules logistic
regression, so logistic regression is very
92
00:09:06,350 --> 00:09:13,060
powerful classifier build on the idea of doing
linear regression on a logistic transformed
93
00:09:13,060 --> 00:09:19,460
output variables and the logistic regression
is related to exponential family of probability
94
00:09:19,460 --> 00:09:25,070
distribution that rise in the variety of problems
and that is the one of the reasons that make
95
00:09:25,070 --> 00:09:32,710
them very, very popular classifier.
And apart from that they work really well
96
00:09:32,710 --> 00:09:37,820
I mean, so that is the another reason that
logistic regression is classifier of choice
97
00:09:37,820 --> 00:09:42,770
for many people’s especially in medical
domains, because they allow you to perform
98
00:09:42,770 --> 00:09:48,780
what you know as sensitive analysis. So, you
can look at dependence of class labels on
99
00:09:48,780 --> 00:09:55,280
features by looking at the, the regression
coefficient of specific feature in the fit
100
00:09:55,280 --> 00:09:55,880
that you obtain.
Thank you.