1
00:00:10,960 --> 00:00:13,720
Hello and welcome to this module on Logistic
Regression.
2
00:00:13,720 --> 00:00:18,170
So, we have looked at the problem of classification
earlier and here is an example from one of
3
00:00:18,170 --> 00:00:19,740
the earlier modules.
4
00:00:19,740 --> 00:00:24,390
So, the users not in brown here are those
who bought a computer and those marked in
5
00:00:24,390 --> 00:00:26,030
red are people, who did not buy computer.
6
00:00:26,030 --> 00:00:32,380
And the goal of classification we said earlier
is to find a decision surface that would help
7
00:00:32,380 --> 00:00:36,150
us separate people who buy computers from
those who do not buy computers.
8
00:00:36,150 --> 00:00:41,930
There are different ways in which you could
have these decision surfaces and we looked
9
00:00:41,930 --> 00:00:42,930
at a few.
10
00:00:42,930 --> 00:00:49,780
Now, let us step back and ask the question,
what exactly does this decision surface mean.
11
00:00:49,780 --> 00:00:56,190
Specifically let me ask the question, what
is the data point that lie on a decision surface
12
00:00:56,190 --> 00:01:00,590
belong to, is it buy computers or does not
buy computers.
13
00:01:00,590 --> 00:01:07,840
So, one way of thinking about it is to say
that this decision surface denotes all the
14
00:01:07,840 --> 00:01:15,460
data points for which the probability of it
being red is equal to the probability of it
15
00:01:15,460 --> 00:01:17,880
being brown.
16
00:01:17,880 --> 00:01:22,730
This essentially means that for the points
on the boundary the decision boundary you
17
00:01:22,730 --> 00:01:27,930
are not able to make a decision as to whether
you will buy computer or does not buy a computer.
18
00:01:27,930 --> 00:01:32,380
So, what is it tell us about the points that
lie to one side of the boundary?
19
00:01:32,380 --> 00:01:37,899
So, the points that lie to one side of the
boundary are those, where the probability
20
00:01:37,899 --> 00:01:43,680
that the person will not buy a computer in
this case is higher than the probability that
21
00:01:43,680 --> 00:01:44,700
he will buy a computer.
22
00:01:44,700 --> 00:01:50,439
So, the one way of thinking about the decision
boundary is that it models all the points,
23
00:01:50,439 --> 00:01:56,250
where both the classes are equally likely
or equally probable to occur.
24
00:01:56,250 --> 00:02:03,740
So, if you want to go beyond classification,
so you might be interested in knowing what
25
00:02:03,740 --> 00:02:08,509
is the actual probability of a specific class
given a data point.
26
00:02:08,509 --> 00:02:13,690
Not just in finding the right classification,
you really like to know what is the probability
27
00:02:13,690 --> 00:02:19,810
that the person buys a computer given the
age and income of the person versus the probability
28
00:02:19,810 --> 00:02:22,870
that the person will not buy a computer given
the age and income of the person.
29
00:02:22,870 --> 00:02:28,300
So, why would you want to know this kind of
probability or the class label?
30
00:02:28,300 --> 00:02:32,280
So, one example is you could think of in medical
domain.
31
00:02:32,280 --> 00:02:38,300
Suppose I say that, you have a specific disease
or the patient walks into the hospital and
32
00:02:38,300 --> 00:02:47,500
the doctor says that the patient has a specific
disease and you would like to know if the,
33
00:02:47,500 --> 00:02:51,070
how confident is the Doctor of the prediction.
34
00:02:51,070 --> 00:02:57,590
So, the Doctor says I am 95 percent sure that
this patient has the disease, then you certainly
35
00:02:57,590 --> 00:02:59,240
would go into the treatment.
36
00:02:59,240 --> 00:03:04,170
So, like wise when you have a classifier that
is going to give you a class label you would
37
00:03:04,170 --> 00:03:08,090
like to know, how sure the classifier is of
the class label and that is one application,
38
00:03:08,090 --> 00:03:11,810
where you would like to see these kinds of
probabilities.
39
00:03:11,810 --> 00:03:17,150
So, one way to approach predicting probabilities
instead of just the class labels could be
40
00:03:17,150 --> 00:03:19,780
to treat it as a regression problem.
41
00:03:19,780 --> 00:03:26,380
So, let us stop and think about how you would
treat classification as a regression problem.
42
00:03:26,380 --> 00:03:32,190
So, normally in classification, so you have
labels, who does not buy a computer or buys
43
00:03:32,190 --> 00:03:33,190
a computer.
44
00:03:33,190 --> 00:03:40,680
So, instead of using these labels you could
use an indicator variable for the class.
45
00:03:40,680 --> 00:03:47,290
So, if the user is or the customer is going
to buy a computer I would say the output is
46
00:03:47,290 --> 00:03:52,720
1, if the customer is not going to buy a computer
I would say the output is 0.
47
00:03:52,720 --> 00:03:59,640
Now, your data gets transformed into a regression
problem now instead of a classification problem,
48
00:03:59,640 --> 00:04:07,270
where you have 0's and 1's as your response
variables and the actual attributes of the
49
00:04:07,270 --> 00:04:11,130
data has the predicted variables for the regression
problem.
50
00:04:11,130 --> 00:04:15,150
And you could use linear regression here,
we all know about linear regression now; you
51
00:04:15,150 --> 00:04:16,450
could use linear regression here.
52
00:04:16,450 --> 00:04:21,510
And the finally, their function that if it
f of x can be interpreted as the probability
53
00:04:21,510 --> 00:04:28,919
that the output y will be 1 given the data
x, that seems like a reasonable way of doing
54
00:04:28,919 --> 00:04:29,919
classification.
55
00:04:29,919 --> 00:04:34,949
So, whenever the probability is greater than
0.5, you would say that x belongs to class
56
00:04:34,949 --> 00:04:43,150
1, the probability is less than 0.5 you will
say x belongs to class 0, that it is actually
57
00:04:43,150 --> 00:04:47,599
a valid way of doing a classification using
linear regression, but there are some problems
58
00:04:47,599 --> 00:04:48,599
with that.
59
00:04:48,599 --> 00:04:49,650
So, what are the problems?
60
00:04:49,650 --> 00:04:56,919
So, linear regression is not really limited
in range, the output can go from minus infinity
61
00:04:56,919 --> 00:04:58,280
to plus infinity.
62
00:04:58,280 --> 00:05:03,670
So, typically this output cannot be interpreted
as a probability, when you troublesome it
63
00:05:03,670 --> 00:05:08,319
is the fact that the output can be negative
and therefore, this certainly cannot be interpreted
64
00:05:08,319 --> 00:05:14,229
as a probability even if you think of doing
some kind of normalization.
65
00:05:14,229 --> 00:05:19,270
Having said that I should say, it actually
works in practice, if you do not really want
66
00:05:19,270 --> 00:05:23,460
to treat it as probability, but just as a
classifier, you know if it is greater than
67
00:05:23,460 --> 00:05:28,370
0.5 it take it as 1 and lesser than 0.5 take
it as 0 it works well, it works in practice,
68
00:05:28,370 --> 00:05:34,409
but not that well and there is way of doing
better than just using simple linear regression.
69
00:05:34,409 --> 00:05:41,919
So, I want to use linear regression still,
but I am going to do that on a transformed
70
00:05:41,919 --> 00:05:44,180
function.
71
00:05:44,180 --> 00:05:48,250
If the transformation that we are going to
talk about here is called the logistic function
72
00:05:48,250 --> 00:05:52,999
or the logit function, so let us have some
notation here.
73
00:05:52,999 --> 00:06:01,180
Let p of x denote the probability that the
output y is 1 given x, then the logit transformation
74
00:06:01,180 --> 00:06:07,770
is given by the logarithm of p of x divided
by 1 minus p of x.
75
00:06:07,770 --> 00:06:18,020
So, if you think about the binary problem,
so p of x is the probability of the output
76
00:06:18,020 --> 00:06:22,590
being 1 and 1 minus p of x is a probability
of the output being 0.
77
00:06:22,590 --> 00:06:28,669
So, essentially you are taking this ratio
of the probability of success to the probability
78
00:06:28,669 --> 00:06:29,669
of failure.
79
00:06:29,669 --> 00:06:35,380
So, this is known as an odds and so this sometimes
known as the log odds function.
80
00:06:35,380 --> 00:06:43,469
So, now, what are we going to do in logistic
regression is essentially try to fit a linear
81
00:06:43,469 --> 00:06:47,289
regression model to this logistic function
as the output.
82
00:06:47,289 --> 00:06:53,990
So, essentially we end up saying that your
log of p of x by 1 minus p of x can be modeled
83
00:06:53,990 --> 00:06:59,520
as some linear function, which is beta naught
plus x times some beta 1.
84
00:06:59,520 --> 00:07:05,659
So, if you think about it you can solve for
p of x from this kind of an expression and
85
00:07:05,659 --> 00:07:09,850
then you end up having p of x looking like
a sigmoid function.
86
00:07:09,850 --> 00:07:16,099
So, e power of beta naught plus x beta divided
by 1 plus e power beta naught x beta and you
87
00:07:16,099 --> 00:07:20,729
can simplify that and the functional form
that you are going to get is something like
88
00:07:20,729 --> 00:07:21,729
this.
89
00:07:21,729 --> 00:07:26,510
So, you can see that the it behaves like a
probability function.
90
00:07:26,510 --> 00:07:32,559
So, it transfer only from 0 to 1 and by varying
the value of beta what you are going to do
91
00:07:32,559 --> 00:07:38,479
is your going to vary the slope and by varying
the value of beta naught, you are going to
92
00:07:38,479 --> 00:07:41,740
vary where the function is going to rise.
93
00:07:41,740 --> 00:07:47,879
So, this gives us a very valid way of fitting
probabilities, there is no problem with interpreting
94
00:07:47,879 --> 00:07:51,129
p of x fitted in this fashion as a probability.
95
00:07:51,129 --> 00:07:56,400
So, earlier we trying to interpret f of x
in a linear regression model as a probability
96
00:07:56,400 --> 00:08:01,020
had problems, so we could not do that, because
it could be a negative as we saw earlier.
97
00:08:01,020 --> 00:08:06,979
But, in this case since p of x is going to
be limited between 0 and 1 he might as well
98
00:08:06,979 --> 00:08:13,969
interpreted as a probability is it that right
model for doing it that is an open question,
99
00:08:13,969 --> 00:08:20,439
it depends on the domain that your working
in, but it is fairly widely used and it is
100
00:08:20,439 --> 00:08:24,830
very power in resolves in a very powerful
classifier and which you can use in variety
101
00:08:24,830 --> 00:08:30,650
of different settings, whether this assumption
is actually supported by the data or not it
102
00:08:30,650 --> 00:08:32,289
seems to work well in practice.
103
00:08:32,289 --> 00:08:38,289
So, that fig did with the linear regression
case we will predict the classes 1 and the
104
00:08:38,289 --> 00:08:43,630
probability of x is greater than 0.5 and 0
otherwise and this essentially you can show
105
00:08:43,630 --> 00:08:47,980
if this minimizes the misclassification rate
given the form of the predictor that we had
106
00:08:47,980 --> 00:08:50,970
on the previous line one thing to note.
107
00:08:50,970 --> 00:08:58,900
So, even though p of x is given by this exponential
function, the actual classification boundaryÉ
108
00:08:58,900 --> 00:09:01,420
So, what is the decision boundary?
109
00:09:01,420 --> 00:09:05,660
Decision boundary is the point, where the
probability of class 1 is equal to the probability
110
00:09:05,660 --> 00:09:10,290
of class 2 or class 1 and class 0 probabilities
are equal.
111
00:09:10,290 --> 00:09:16,560
So, you with the little bit of thought you
can see that the decision boundary is still
112
00:09:16,560 --> 00:09:22,630
given by a line which essentially beta naught
plus x beta 1 equal to 0.
113
00:09:22,630 --> 00:09:27,740
So, that gives you the decision boundary of
the logistic regression classified as well
114
00:09:27,740 --> 00:09:34,160
and hence this is also a linear classifier
and I mentioned earlier it is pretty powerful
115
00:09:34,160 --> 00:09:35,310
and works well in practice.
116
00:09:35,310 --> 00:09:41,391
So, let us look at an example of what happens
when we fit data using logistic regression
117
00:09:41,391 --> 00:09:42,530
versus linear regression.
118
00:09:42,530 --> 00:09:51,740
So, here is a two class problem, so the data
points or either in blue or in red and the
119
00:09:51,740 --> 00:09:57,080
shading in the region indicates, what is the
class label that would be predicted by the
120
00:09:57,080 --> 00:09:59,640
classifier in those regions.
121
00:09:59,640 --> 00:10:07,410
So, on the right hand side you have slides
I mean you have the prediction made by fitting
122
00:10:07,410 --> 00:10:14,120
a linear regression to the indicator variable
on the left hand side you have the output
123
00:10:14,120 --> 00:10:15,630
given by logistic regression.
124
00:10:15,630 --> 00:10:22,530
So, you can see that linear regression actually
makes a certain errors closer to the boundary
125
00:10:22,530 --> 00:10:28,160
that is because linear regression is essentially
limited at the rate at which the curves can
126
00:10:28,160 --> 00:10:34,380
climb and when closer to the boundary when
there are points that are bunch together from
127
00:10:34,380 --> 00:10:40,540
one class, but little further away from the
rest of the class linear regression is not
128
00:10:40,540 --> 00:10:45,450
able to model those successfully, while logistic
regression by virtue of the fact that you
129
00:10:45,450 --> 00:10:50,080
could have a steep climb from 0 to 1 is able
to capture those data points.
130
00:10:50,080 --> 00:10:53,810
So, this is essentially the difference between
linear and logistic regression.
131
00:10:53,810 --> 00:10:59,440
So, far I have been talking about binary classification
problems, because they are easier to illustrate
132
00:10:59,440 --> 00:11:04,790
and kind of understand the basics behind.
133
00:11:04,790 --> 00:11:11,000
But, then logistic regression can be extended
to multiple classes as well, suppose there
134
00:11:11,000 --> 00:11:17,350
are k classes then I would say that each class
gets the different set of parameters beta
135
00:11:17,350 --> 00:11:22,530
naught and beta for that specific class.
136
00:11:22,530 --> 00:11:28,690
So, in that case what happens is your probability
of...
137
00:11:28,690 --> 00:11:38,120
So, the probability that particular class
is the right class for a data point is given
138
00:11:38,120 --> 00:11:46,050
by e power beta naught of c which is the e
power beta naught of c plus x beta c it is
139
00:11:46,050 --> 00:11:51,580
our essentially the parameters specific to
the class and divided by the total the normalizing
140
00:11:51,580 --> 00:11:56,560
factor, which is essentially the numerators
sums for all the data points.
141
00:11:56,560 --> 00:12:02,310
To make the problem somewhat easier traditionally
the parameters of one of the classes, it could
142
00:12:02,310 --> 00:12:08,630
be either the first class by numbering from
0 to k or it could be the last class which
143
00:12:08,630 --> 00:12:15,900
is k is set to 0 and you can think about it,
it really does not affect what the classifier
144
00:12:15,900 --> 00:12:19,840
the decision boundary that you are going to
learn it will change the parameters that you
145
00:12:19,840 --> 00:12:24,090
are learning, but the decision boundary that
you learn will not be affected.
146
00:12:24,090 --> 00:12:29,690
So, in a sense you will be left with fewer
parameters that you have to estimate that
147
00:12:29,690 --> 00:12:34,590
is because you are talking about probability
distributions here and we know that as soon
148
00:12:34,590 --> 00:12:41,670
as you fix n outcomes in a discrete probability
distribution of the n plus 1 the outcome is
149
00:12:41,670 --> 00:12:47,660
automatically fixed in total of n plus 1 outcome.
150
00:12:47,660 --> 00:12:53,310
So, far we have been looking at the basic
model and logistic regression and I will end
151
00:12:53,310 --> 00:12:57,420
this module here and for the next module we
will look at how will actually learn the parameters
152
00:12:57,420 --> 00:12:58,810
of this logistic regressions classifier.