1
00:00:11,509 --> 00:00:17,480
Hello today we will continue our lecture series
in the topic of regression and in our last
2
00:00:17,480 --> 00:00:26,480
class, we primarily focused on, mainly ended,
fine with the ideas behind subset selection.
3
00:00:26,480 --> 00:00:31,430
So, essentially we were talking about this
problem, where you have a lot of input variables
4
00:00:31,430 --> 00:00:37,660
in, that you could use in your multiple regression
model, how do you go about picking, which
5
00:00:37,660 --> 00:00:43,420
one should stay in the model and which one
should go out. And in discussing that we would
6
00:00:43,420 --> 00:00:49,739
discussing a more broader topic which is,
how do we measure how good a model is and
7
00:00:49,739 --> 00:00:55,010
how do we do things in our regression practice,
so that we get good predictive accuracy and
8
00:00:55,010 --> 00:00:59,609
it is also useful for interpretation.
In that regard, we will continue our lecture
9
00:00:59,609 --> 00:01:07,190
today and talk about two other small measures,
which is the R square and the adjusted R square.
10
00:01:07,190 --> 00:01:11,930
The R square has at least been introduced
to you, you know you have an idea of where
11
00:01:11,930 --> 00:01:15,940
you can find that in your analysis.
We will talk a little bit about that and we
12
00:01:15,940 --> 00:01:21,550
will talk about them more as other matrix
of measuring, how good our regression is and
13
00:01:21,550 --> 00:01:28,650
then, we will move on to this topic called
regularization, which has to do with how do
14
00:01:28,650 --> 00:01:34,060
you fine tune the datas, the coefficients
of your regression model.
15
00:01:34,060 --> 00:01:41,410
When we were discussing the topic of subset
selection, which is, how do you go about choosing
16
00:01:41,410 --> 00:01:45,720
which variable should stay in the model. We
said, we would go about making those choices
17
00:01:45,720 --> 00:01:52,660
based off of some metric, not of the individual
variables, but of the model as a whole, correct
18
00:01:52,660 --> 00:01:59,160
and in that context, we primarily discuss
the use of the p value from the f test that
19
00:01:59,160 --> 00:02:05,800
you do with your ANOVA. So, for instance I
have copy pasted the same results from our
20
00:02:05,800 --> 00:02:14,180
examples that we took in our last class and
so, this stable out here kind of…
21
00:02:14,180 --> 00:02:19,420
It is just a copy paste from that previous
excel sheet and we were focusing on this p
22
00:02:19,420 --> 00:02:28,019
value primarily saying, this ANOVA is essentially
a measure of how good the overall regression
23
00:02:28,019 --> 00:02:35,280
is, not individual input variables, but can
we use that, can we use this p value as a
24
00:02:35,280 --> 00:02:42,609
guide to see which variable stay in and which
variable stay out and that is the context
25
00:02:42,609 --> 00:02:48,049
in which we discuss the topic of best subset
regression, we spoke about backwards, forwards
26
00:02:48,049 --> 00:02:52,269
and hybrid stepwise regression.
But, today what we are going to talk about
27
00:02:52,269 --> 00:02:57,719
is, we talk about two other measures which
are also available in which is, something
28
00:02:57,719 --> 00:03:04,040
that you could use and they primarily the
R square and adjusted R square. So, the first
29
00:03:04,040 --> 00:03:08,739
thing that we are going to start by saying
is the R square is not a very good measure,
30
00:03:08,739 --> 00:03:15,500
in terms of how the overall regression model
is, especially in a multiple regression context.
31
00:03:15,500 --> 00:03:20,480
I will explain that in a second, what we mean
by…
32
00:03:20,480 --> 00:03:25,400
Again just as a reminder, you know simple
regression just means there is one input variable,
33
00:03:25,400 --> 00:03:32,060
multiple regression means there are multiple
input variable. So, the R square essentially
34
00:03:32,060 --> 00:03:37,260
measures, is a measure of how much of your
variation and here, you are talking about
35
00:03:37,260 --> 00:03:43,400
variations in terms of y, your output variable.
So, how much if your output variables variation,
36
00:03:43,400 --> 00:03:49,180
the sum amount of variation and how much of
that can I explain using my model and how
37
00:03:49,180 --> 00:03:54,620
much of that can I not explain using my model.
So, if I take the overall variation of y and
38
00:03:54,620 --> 00:04:00,159
I break it up into two chunks, the amount
that can be explained by my model, which is
39
00:04:00,159 --> 00:04:04,609
my regression equation and the amount that
cannot be explained by that. If I put those
40
00:04:04,609 --> 00:04:11,640
two together I should get the overall variation
in y irrespective of x. So, R square is nothing,
41
00:04:11,640 --> 00:04:18,739
but this ratio how much variation is explained
by the model divided by the total variation.
42
00:04:18,739 --> 00:04:24,000
So, depending on the software that you use
and the text book that you use, people usually
43
00:04:24,000 --> 00:04:30,480
have sum of squares total and that is the
total variation and sum of squares model that
44
00:04:30,480 --> 00:04:34,700
is the model variation.
But, I have also seen sum of squares regression
45
00:04:34,700 --> 00:04:40,560
as a proxy for sum of squares model and the
sum of squares… If you take, if you say
46
00:04:40,560 --> 00:04:45,270
the total variation is sum of squares total
and the model variation is either sum of squares
47
00:04:45,270 --> 00:04:51,210
model or the sum of squares regression, then
the only the other quantity that is left is
48
00:04:51,210 --> 00:04:58,560
the sum of squares error or you know, as some
people again call it the residual sum of squares.
49
00:04:58,560 --> 00:05:03,000
You might have seen it RSS or sum of squares
error.
50
00:05:03,000 --> 00:05:09,770
So, this, your R square value therefore, is
nothing… You can just basically look at
51
00:05:09,770 --> 00:05:14,150
your ANOVA output, even if your R square for
some reason disappears. It is nothing but,
52
00:05:14,150 --> 00:05:23,030
the ratio of this number to this number, in
which case I think that looks like it is about
53
00:05:23,030 --> 00:05:28,170
0.8, 0.9 invalid. To give you a little more
mathematical intuition and also kind of giving
54
00:05:28,170 --> 00:05:33,450
you some kind of formulas that are used for
calculating these three values. As I mentioned,
55
00:05:33,450 --> 00:05:41,800
the sum of squares total is just essentially
the variability, in some sense the total squared
56
00:05:41,800 --> 00:05:47,190
distance of each data point from the grand
mean, whereas sum of squares model is where
57
00:05:47,190 --> 00:05:54,340
you go to each, not each data point, but here
y of i is each data point.
58
00:05:54,340 --> 00:05:59,400
We look at the deviation of each data point,
actual data point from the grand average.
59
00:05:59,400 --> 00:06:05,700
With sum of squares regression or sum of squares
model, you look at each predicted value y
60
00:06:05,700 --> 00:06:11,280
i hat. It is nothing but, what value of y
are you predicting for each x and look at
61
00:06:11,280 --> 00:06:18,550
the deviation of that from the grand mean
and finally, y i minus y i hat is how much
62
00:06:18,550 --> 00:06:23,320
is each data point deviating from each predicted
value. So, that should hopefully give you
63
00:06:23,320 --> 00:06:28,680
some intuition us to how these values are
calculated.
64
00:06:28,680 --> 00:06:37,800
So, now, going back to a bigger topic, as
a metric to see how good a regression model
65
00:06:37,800 --> 00:06:42,360
is especially in a multiple regression, where
there many variables, many input variables.
66
00:06:42,360 --> 00:06:49,280
This R square is not great, because R square
by definition will only increase as you keep
67
00:06:49,280 --> 00:06:57,330
on adding more and more variables. So, we
will be discussing this in great detail in
68
00:06:57,330 --> 00:07:03,840
our lectures on the bias variance dichotomy,
which is going to come up, but the core idea
69
00:07:03,840 --> 00:07:10,030
just to kind of give you some feel for what
we are talking about is that, if you keep
70
00:07:10,030 --> 00:07:18,280
on adding variables and if you keep on adding
complexity, at some point you should be able
71
00:07:18,280 --> 00:07:28,120
to explain away all the data that you have
or in another words, take the example where
72
00:07:28,120 --> 00:07:35,210
you have 10 data points and you had only 10
data points.
73
00:07:35,210 --> 00:07:44,080
You should technically be able to fit a line
or essentially fit a function that will go
74
00:07:44,080 --> 00:07:49,830
through all these 10 data points. If you just
choose, you know if you say I am willing to
75
00:07:49,830 --> 00:07:55,950
fit a 9th order polynomial, essentially if
the fitted model can get more and more and
76
00:07:55,950 --> 00:08:03,510
more complex, then for a finite set of data
point you should be able to explain the way
77
00:08:03,510 --> 00:08:08,190
everything, but that is not necessarily great.
Because, when you go and try and predict with
78
00:08:08,190 --> 00:08:12,860
that model, you are not going to do too well,
you just do kind of over fitted to the data
79
00:08:12,860 --> 00:08:17,970
by just constantly increasing the complexity
of the model that you are using.
80
00:08:17,970 --> 00:08:24,740
Now, given that is the base and given that
we were talking about this idea loosely when
81
00:08:24,740 --> 00:08:29,900
we in our previous lectures spoke about, how
it is really important to try and figure out
82
00:08:29,900 --> 00:08:34,279
which of the subset of the variables you want
in the model and you want to throw the others
83
00:08:34,279 --> 00:08:41,210
away. This R square does not help, because
R square will never decrease as you keep on
84
00:08:41,210 --> 00:08:45,630
adding more input variable. So, let say I
had a model with 5 input variables, I suddenly
85
00:08:45,630 --> 00:08:51,420
come up and say, maybe I should have the 6th
variable and add it, it is almost it is definite
86
00:08:51,420 --> 00:08:59,540
that R square will increase and so, R square
itself is not a great measure of how good
87
00:08:59,540 --> 00:09:03,610
a model is.
Now, as we discussed you could definitely
88
00:09:03,610 --> 00:09:09,790
use the p value from the f test of the NOVA,
but another metric that is also popular is
89
00:09:09,790 --> 00:09:17,580
called the adjusted R square and that is essentially
nothing but, a modified version of R square
90
00:09:17,580 --> 00:09:23,120
that is shown in the formula here. So, it
uses essentially to compute it, you can use
91
00:09:23,120 --> 00:09:31,620
your R square values and the idea is that
n is nothing but, the number of data points
92
00:09:31,620 --> 00:09:36,770
and k is the number of independent variables
that are there.
93
00:09:36,770 --> 00:09:43,210
So, for instance in our example, I believe
we had 88 data points in the excel example
94
00:09:43,210 --> 00:09:50,930
that we were discussing yesterday. It is 88,
I can infer that also from the total degrees
95
00:09:50,930 --> 00:10:00,801
of freedom and k essentially is 4, because
we had four input variables. See, you can
96
00:10:00,801 --> 00:10:12,339
use k, the k will be equal to 4, so 88 and
4. So, the idea out here is that, when you
97
00:10:12,339 --> 00:10:19,310
use the adjusted R square, it kind of penalizes
modules, where the number of variables you
98
00:10:19,310 --> 00:10:22,510
are using to explain it is too high.
So, you can compare the adjusted R square
99
00:10:22,510 --> 00:10:29,610
of one model versus the other, where you used
different numbers of input variables and yes,
100
00:10:29,610 --> 00:10:35,090
the higher your R square the better it is.
So, if you can get a very higher R square,
101
00:10:35,090 --> 00:10:38,640
you are going to get a, you know that is one
way of getting a higher adjusted R square,
102
00:10:38,640 --> 00:10:45,790
but not at the cost of, you know having too
many variables. You want to keep k as small
103
00:10:45,790 --> 00:10:52,320
as possible and have a very high R square.
Notice that, there is a minus in front of
104
00:10:52,320 --> 00:10:58,680
the k, but this whole thing also has a 1 minus,
so just be careful when you are trying to
105
00:10:58,680 --> 00:11:03,490
get an intuition of, how increasing or decreasing
these values is going to effect the R square,
106
00:11:03,490 --> 00:11:09,330
adjusted R square. So, that is more just,
again introducing you to the concept that
107
00:11:09,330 --> 00:11:15,290
there is, it is not the p value of the f statistic
as we discussed in the best subset selection
108
00:11:15,290 --> 00:11:20,501
and stepwise regression, but this is the concept
of adjusted R square as well and that is just
109
00:11:20,501 --> 00:11:25,980
one of the them, there are other metrics also
that can be used to measure regressions, a
110
00:11:25,980 --> 00:11:28,410
regression model.
111
00:11:28,410 --> 00:11:36,370
We now go on to the topic of regularization
and you can also use the term coefficient
112
00:11:36,370 --> 00:11:44,050
shrinkage to express the same concept. What
regularization does is also, you know as a
113
00:11:44,050 --> 00:11:50,460
huge over lap what you try to do with sub
set selection, which is the core idea being
114
00:11:50,460 --> 00:11:55,990
how can I simplify my model in some sense.
Because, I care about predictive accuracy,
115
00:11:55,990 --> 00:12:01,800
I care about interpretation, I do not care
about just trying to get this function to
116
00:12:01,800 --> 00:12:08,520
go though all my data points.
So, how can I simplify the model in some way
117
00:12:08,520 --> 00:12:12,399
and one way of simplifying the model is to
use fewer variables, that is what we saw in
118
00:12:12,399 --> 00:12:18,470
subset selection. But, once you fix the numbers
of variables, once you say I have fixed these
119
00:12:18,470 --> 00:12:26,339
are the variables that need to go in the module,
is there some way to more smoothly determine
120
00:12:26,339 --> 00:12:31,190
the coefficients. We saw how the coefficients
were being determined through an optimization
121
00:12:31,190 --> 00:12:36,910
process, but somewhere can I go directly in
there and put in my constraint of saying,
122
00:12:36,910 --> 00:12:40,670
I do want you to optimize something, but at
the same time I want you to trying keep it
123
00:12:40,670 --> 00:12:45,270
as simple as possible without kind of over
fitting the data.
124
00:12:45,270 --> 00:12:56,070
So, in some sense the problem of regularization
is a problem of saying I have an algorithm
125
00:12:56,070 --> 00:13:01,540
and I have a fixed number of input variables.
So, I do not, I am no longer bargaining to
126
00:13:01,540 --> 00:13:07,510
throw variables out and keep them in, but
is there some way of in my fitting methodology
127
00:13:07,510 --> 00:13:15,110
itself preventing this problem of over fitting
or preventing this problem of trying to get
128
00:13:15,110 --> 00:13:21,340
the functional form to be so hardwired to
the data, that it does not do such a good
129
00:13:21,340 --> 00:13:30,190
job of predicting and kind of, can I impose
penalties upon myself to kind of over fitting.
130
00:13:30,190 --> 00:13:36,910
One way in which over fitting happens is,
in problems where the problems of what we
131
00:13:36,910 --> 00:13:42,089
call as multicollinearity. Multicollinearity
is the idea that many of my input variables
132
00:13:42,089 --> 00:13:49,910
are correlated with each other and when you
have that problem, the coefficients themselves
133
00:13:49,910 --> 00:13:54,910
through a regular regression process can get,
the determining their betas can be kind of
134
00:13:54,910 --> 00:13:59,300
poor. Meaning that, there can be a lot of
variance between one samples to the next.
135
00:13:59,300 --> 00:14:04,640
So, it is like, ideally if I am doing a certain
regression process and I take a sample of
136
00:14:04,640 --> 00:14:08,910
data and I then do this regression fit. If
I take other sample of data and I get completely
137
00:14:08,910 --> 00:14:15,010
different set of betas, then that is not a
very stable process and that tends to happen
138
00:14:15,010 --> 00:14:19,550
with least squares regression, when you have
multicollinearity, meaning you have lots of
139
00:14:19,550 --> 00:14:26,140
highly correlated input variables. I mean
take this example that we have on the slide,
140
00:14:26,140 --> 00:14:31,910
you have this idea, where you get this equation
which says y is equal to 4 A plus 2 B.
141
00:14:31,910 --> 00:14:37,431
Now, imagine a word in which A and B was so
highly correlated to the point, where they
142
00:14:37,431 --> 00:14:42,680
were practically the same variable. Then,
a fitted model that says y is equal to 10
143
00:14:42,680 --> 00:14:50,640
A minus 8 B should also a kind of give you
the same results, in the sense that A and
144
00:14:50,640 --> 00:14:57,890
B are practically the same data points. They
correlated to 1 and now assume that they are
145
00:14:57,890 --> 00:15:03,080
also equal in magnitude, the 4 A and 2 B,
you can substitute B with A and you net getting
146
00:15:03,080 --> 00:15:07,320
2 A, which is the same net you are getting
in the second equations.
147
00:15:07,320 --> 00:15:17,240
So, our both equations the same, now the idea
with something like ridge regression is, you
148
00:15:17,240 --> 00:15:22,910
in cases like this, you want to have the simplest
possible equation and you want to have the
149
00:15:22,910 --> 00:15:29,980
lowest possible magnitudes for your variables.
So, you would be very happy with the y is
150
00:15:29,980 --> 00:15:36,950
equal to 2 A, I am just keeping it as simple
as that or you can take it as 2 A minus 0
151
00:15:36,950 --> 00:15:43,170
B. So, in addition to choosing which variables,
you want to keep their magnitudes, the magnitude
152
00:15:43,170 --> 00:15:46,620
of the coefficients, you want to keep them
as low as possible.
153
00:15:46,620 --> 00:15:51,790
So, if you have off setting coefficients,
you are not making a regression equation that
154
00:15:51,790 --> 00:16:00,330
says y is equal to 10,000 and 2 A and you
know, you know really large say 1000 and 2
155
00:16:00,330 --> 00:16:07,670
A minus 1000 and B. Your kind of blowing things
out of proposition, it will not generalize
156
00:16:07,670 --> 00:16:13,280
very well and so on and so forth. So, how
can we achieve this? How can we achieve this
157
00:16:13,280 --> 00:16:19,850
goal of and really simple terms, not just
allowing highly correlated variables to take
158
00:16:19,850 --> 00:16:25,480
up opposing sides and you know, have really
large coefficients.
159
00:16:25,480 --> 00:16:33,430
And the way we will do that is, by taking
a standard least squares minimization problem
160
00:16:33,430 --> 00:16:38,420
and that is essentially what have written
as the objective function here. So, the objective
161
00:16:38,420 --> 00:16:44,560
function here is no different for the ridge
regression. At least the way have written
162
00:16:44,560 --> 00:16:47,890
it out here is no different than for least
squares minimization, because all I am saying
163
00:16:47,890 --> 00:16:56,670
is let us minimize for each data point, the
actual y minus the predicted y and the predicted
164
00:16:56,670 --> 00:17:02,649
y is nothing but, beta not plus beta 1 x 1
plus beta 2 x 2 plus beta 3 x 3, goes on.
165
00:17:02,649 --> 00:17:07,850
So, essentially j goes from 1 to p, meaning
p is the total number of independent variables,
166
00:17:07,850 --> 00:17:12,039
n is the total number of data points, so you
do it for each data point. You look at the
167
00:17:12,039 --> 00:17:19,189
deviation of the fitted value, the actual
value to the estimate or the predicted value
168
00:17:19,189 --> 00:17:24,429
and the goal is, this deviation is what needs
to be minimized. The square of this deviation
169
00:17:24,429 --> 00:17:28,770
is to be minimized in an ordinary least square
regression, but in addition to that and what
170
00:17:28,770 --> 00:17:34,230
we do with the ridge regression is, we do
this optimization with a constraint and the
171
00:17:34,230 --> 00:17:39,030
constraint can be written like this.
The constraint said it is the subject to some
172
00:17:39,030 --> 00:17:44,620
beta, you take each beta which is each coefficients,
square it and the sum of those square should
173
00:17:44,620 --> 00:17:54,330
be less than sum value s and that value s
is essentially, it is not like you have particular
174
00:17:54,330 --> 00:18:01,880
number in mind. What winds up happening is
that you can rewrite this optimization by
175
00:18:01,880 --> 00:18:13,830
just not having this constraint, but essentially
out here, just adding minus lambda and summation
176
00:18:13,830 --> 00:18:18,410
of beta j square.
So, this is kind of like, you might have come
177
00:18:18,410 --> 00:18:23,309
across this with like Lagrange multipliers,
when you have a single constraint on the coefficients.
178
00:18:23,309 --> 00:18:30,610
You can just rewrite the objective function
to have that constraint integrated in to the
179
00:18:30,610 --> 00:18:37,330
thing and essentially, it is like solving
just the minimization without the constraint.
180
00:18:37,330 --> 00:18:42,750
But, essentially the solution to that is what
will ensure that your data’s themselves
181
00:18:42,750 --> 00:18:48,360
are stable and you know, not taking really
large values.
182
00:18:48,360 --> 00:18:52,781
Another way to achieve the same thing is,
again even with the lasso regression, this
183
00:18:52,781 --> 00:19:00,520
should not be ridge, this should be lasso.
So, even with the lasso regression what you
184
00:19:00,520 --> 00:19:05,610
see is the same thing, you have the same objective
function, but now it subjective to the constraint
185
00:19:05,610 --> 00:19:12,030
that the sum of the betas is less than some
value s and again, out here you can just wind
186
00:19:12,030 --> 00:19:18,460
up rewriting your objective function. But,
out here rewriting it does not do you too
187
00:19:18,460 --> 00:19:25,220
much good, because you cannot do the same
trick as with standard calculus with Lagrange
188
00:19:25,220 --> 00:19:30,390
multipliers, where you have a beta square.
When you have a mod beta, it just becomes
189
00:19:30,390 --> 00:19:34,590
computationally harder, but you know just
like we saw in the simple regression case.
190
00:19:34,590 --> 00:19:38,870
If you got an excel sheet or a MATLAB, you
can just do the optimization and put the sign
191
00:19:38,870 --> 00:19:46,280
as a constraint and as long as your optimization
technique is good enough, you should be fine,
192
00:19:46,280 --> 00:19:53,610
you should be able to redo it. I hope that
gives you some idea of regularization techniques
193
00:19:53,610 --> 00:19:56,120
and look forward to see you in the next class.
Thank you.