1
00:00:10,680 --> 00:00:18,300
Hello and welcome to this lecture on K-Nearest
Neighbors Techniques. So, in this lecture
2
00:00:18,300 --> 00:00:24,290
we will introduce a new supervised learning
approach called K Nearest Neighbors and a
3
00:00:24,290 --> 00:00:29,220
broader thing that we are trying do to with
this lecture is by introducing this technique
4
00:00:29,220 --> 00:00:33,190
and comparing it to something that you are
already familiar with, which is regression
5
00:00:33,190 --> 00:00:40,300
analysis. By doing that comparison, we are
hoping to kind of illustrateĆand help you
6
00:00:40,300 --> 00:00:49,890
appreciate two very different styles of performing
the supervised learning analysis and you know
7
00:00:49,890 --> 00:00:56,469
and therefore, the process of prediction itself.
Two very different approaches, two predictions
8
00:00:56,469 --> 00:01:01,360
and it is a kind of important to know that,
because up until now you heard of what supervised
9
00:01:01,360 --> 00:01:06,830
learning is in theory and the first technique
that we gone into and explained is regression,
10
00:01:06,830 --> 00:01:13,000
which is the certain way of for instance performing
a prediction. The hope is by introducing k
11
00:01:13,000 --> 00:01:19,090
nearest neighbors, you see a very different
way of achieving the same goal and it is important,
12
00:01:19,090 --> 00:01:24,030
because you will realize that a lot of other
machine learning techniques, take inspiration
13
00:01:24,030 --> 00:01:25,560
from this approach.
14
00:01:25,560 --> 00:01:32,280
So, this dichotomy if you will was something
that was first pointed out by Leo Breiman,
15
00:01:32,280 --> 00:01:37,439
a very famous statistician, where he said
there are two very prominent cultures to statistical
16
00:01:37,439 --> 00:01:44,400
modeling and out here, he primarily talking
about supervised learning approaches and he
17
00:01:44,400 --> 00:01:49,430
highlights and you know, there is some amount
of overlaps and so on. And I would not really
18
00:01:49,430 --> 00:01:55,150
say these two cultures and two different,
it is not like a classification as much as
19
00:01:55,150 --> 00:01:59,080
two very different styles.
And the idea is that, he says look there is
20
00:01:59,080 --> 00:02:05,799
the data modeling approach, which is what
you for instance in the standard regression,
21
00:02:05,799 --> 00:02:13,090
where you have some you know output variable
y and you kind of envision this output variable
22
00:02:13,090 --> 00:02:21,130
has some function of your input variable or
variables x. So, for instance you would say
23
00:02:21,130 --> 00:02:31,220
y is equal to some, you know f of x and plus
there is of course, some noise. In the case
24
00:02:31,220 --> 00:02:36,030
of linear regression and here a kind of shown
you an examples of multiple regression, this
25
00:02:36,030 --> 00:02:42,019
f of x becomes fairly straight forward, it
some intercept plus beta 1 x 1, beta 2 x 2
26
00:02:42,019 --> 00:02:44,931
for how many ever input variables you have.
Suppose, you just have one input variable
27
00:02:44,931 --> 00:02:49,870
that would be beta naught plus beta 1 x 1,
it is fairly straight forward. So, the point
28
00:02:49,870 --> 00:02:54,030
he makes and it is generalization, because
he says us about many other statistical methods
29
00:02:54,030 --> 00:03:00,190
which he says, you have these whole breed
of approaches to supervised learning which
30
00:03:00,190 --> 00:03:07,120
capture a functional relationship between
y and x and that function is in some sense
31
00:03:07,120 --> 00:03:14,010
cast in stone and the only job you are left
with is just go, get the data and figure out
32
00:03:14,010 --> 00:03:18,810
what the parameters are of the function.
These betas, the job is to take some data
33
00:03:18,810 --> 00:03:23,769
and figure out the betas, this functional
form itself which is said in this case of
34
00:03:23,769 --> 00:03:28,940
the linear regression, that it is a linear
combination of the different inputs plus some
35
00:03:28,940 --> 00:03:34,959
Gaussian noise that is for instance cast and
stone. And he says that you have those approaches
36
00:03:34,959 --> 00:03:39,209
and then, you have an alternative breed of
approaches and he calls them the algorithmic
37
00:03:39,209 --> 00:03:42,889
modeling culture. So, the first one he calls
is the data modeling culture, which is your
38
00:03:42,889 --> 00:03:47,410
and the best examples of the data modeling
culture could be is the multiple regression.
39
00:03:47,410 --> 00:03:53,840
So, we covered that in good detail in the
previous lectures. So, he says now look there
40
00:03:53,840 --> 00:03:58,980
is another approach, another approach is what
he calls the algorithmic modeling and there
41
00:03:58,980 --> 00:04:05,709
he says essentially, the focus is not as much
on a rigid mathematical model that you have
42
00:04:05,709 --> 00:04:13,060
presupposed that have you apriori and that
creates it is relationship between y and x,
43
00:04:13,060 --> 00:04:18,699
which is already cast and stone. But, instead
you really have a set of algorithmic instructions
44
00:04:18,699 --> 00:04:24,699
or algorithmic ideas that relate the independent
and dependent variables.
45
00:04:24,699 --> 00:04:31,930
So, y is in the loose sense of the word of
function of x, but that is really captured
46
00:04:31,930 --> 00:04:41,530
by algorithms rather than rigid mathematical
models. Now, these two, there is a reason
47
00:04:41,530 --> 00:04:45,569
why in the start I said these are not like
two classifications, but there are more two
48
00:04:45,569 --> 00:04:51,590
styles only because you can describe any algorithm
in a mathematical form and perhaps you can
49
00:04:51,590 --> 00:04:55,919
describe the math in an algorithmic form.
But, what I am going do is I am going to explain
50
00:04:55,919 --> 00:05:00,729
to you the k nearest neighbors approach and
we are going to talk about the k nearest neighbors
51
00:05:00,729 --> 00:05:05,840
approach as an example of the algorithmic
modeling and hopefully at that point, it becomes
52
00:05:05,840 --> 00:05:11,950
really clear us to what the stylistic difference
is between these two approaches.
53
00:05:11,950 --> 00:05:19,759
So, let us look at how a prediction task happens
with your linear regression approach. The
54
00:05:19,759 --> 00:05:27,270
way it happens is you have some data and right
now I am focusing on the graph on the left
55
00:05:27,270 --> 00:05:31,759
hand side of the screen, you have some data,
you do a regression analysis and what is the
56
00:05:31,759 --> 00:05:36,990
regression analysis do, it fix the line through
the data, it fix a line and if you are using
57
00:05:36,990 --> 00:05:41,180
ordinary least square regression, if it is
a line that minimizes the square deviation
58
00:05:41,180 --> 00:05:46,920
between the data points to the line and so
it chooses the line that achieves this target.
59
00:05:46,920 --> 00:05:51,069
So, you have you fit this line. Now, what
you do when you need to make a prediction?
60
00:05:51,069 --> 00:05:55,590
And someone comes along and says, oh great,
so you done a regression analysis. Could you
61
00:05:55,590 --> 00:06:03,009
tell me what, y I should expect for a given
x? So, they come and give you an x, so they
62
00:06:03,009 --> 00:06:10,509
give you this x and let us call it x, let
us just call it x 1. So, they give you an
63
00:06:10,509 --> 00:06:16,350
x 1 and say can you tell me what why I should
expect and what you do is you draw this straight
64
00:06:16,350 --> 00:06:24,480
line up like it is already there and you say,
let me see where my what value of y I would
65
00:06:24,480 --> 00:06:32,729
get based on my fitted model.
So, I basically take this x value, draw a
66
00:06:32,729 --> 00:06:40,250
line up to my regression fitted line and then
see what the height of that point is. So,
67
00:06:40,250 --> 00:06:48,480
this is my predicted y you know, so may be
y hat of 1. So, if somebody comes and gives
68
00:06:48,480 --> 00:06:55,310
me this x all I do, I do not really I have
used all of this data to create a line, but
69
00:06:55,310 --> 00:07:00,870
then after that I can lose these data points,
this data points can be erased from my memory.
70
00:07:00,870 --> 00:07:07,210
All I need to do to make a prediction at this
point is I need to know this line and this
71
00:07:07,210 --> 00:07:12,169
line I have already created with the regression.
So, I have created this line I have it my
72
00:07:12,169 --> 00:07:17,979
regression and someone comes and ask me a
question is to what y would you see with particular
73
00:07:17,979 --> 00:07:31,669
x, as you know this line has a form y is equal
to b naught plus b 1 x. So, someone comes
74
00:07:31,669 --> 00:07:37,789
and gives me a particular x. So, call it x
1 I will just substitute this x 1 out here,
75
00:07:37,789 --> 00:07:41,449
I know the values of b 1 and beta naught from
the regression, that is how I was able to
76
00:07:41,449 --> 00:07:47,039
plot the line. I will then just get a y, because
I know all the three terms on the right hand
77
00:07:47,039 --> 00:07:49,830
side.
So, I will get a predicted y and that is my
78
00:07:49,830 --> 00:07:57,849
prediction of what why I will see for a given
x. So, that is how a prediction task takes
79
00:07:57,849 --> 00:08:02,279
place of the regression approach. Now, let
us take a look how the k nearest neighbors
80
00:08:02,279 --> 00:08:09,300
approach works, the way the k nearest neighbors
approach works is that I basically I do not
81
00:08:09,300 --> 00:08:14,870
fit a line. So, I do not have line or I do
not have mathematical form, the idea behind
82
00:08:14,870 --> 00:08:19,439
k nearest neighbors is that if you come to
me with a question as to hey can you predict
83
00:08:19,439 --> 00:08:27,280
for me what y I will see for a given x.
I will take the given x and I will ask myself
84
00:08:27,280 --> 00:08:35,050
who it is nearest neighbors are. So, somebody
walks up me and says can you give me a prediction
85
00:08:35,050 --> 00:08:45,540
of y for this x 1 let us call it x 1, let
us call it x star maybe. So, x star for this
86
00:08:45,540 --> 00:08:51,260
x star, because x 1 tends to have the connotation
that it is the first data points. So, I going
87
00:08:51,260 --> 00:08:59,339
to call it x star, so for this x star can
you tell me what my predicted output would
88
00:08:59,339 --> 00:09:03,940
be. Now, remember I do not have a fitted line,
this is a completely different approach to
89
00:09:03,940 --> 00:09:08,670
making predictions.
But, what I do is I go to this line and on
90
00:09:08,670 --> 00:09:16,779
my x axis, on my input variable axis I try
to find the nearest neighbors of neighboring
91
00:09:16,779 --> 00:09:30,269
data points to the x under question. So, clearly
this data point is kind of close to this line
92
00:09:30,269 --> 00:09:35,660
and this data point is may be close to this
line and so on. And the idea is that this
93
00:09:35,660 --> 00:09:41,300
K-NN approaches has a parameter which is k.
So, let us say I have chosen five as a parameter
94
00:09:41,300 --> 00:09:45,050
and we will understand what the five means,
it is means I am looking for the five nearest
95
00:09:45,050 --> 00:09:52,020
neighbors defined by the distance from my
point under question, so this distance.
96
00:09:52,020 --> 00:09:57,290
So, I am going to see the five closest data
points and if I do that for instance, I see
97
00:09:57,290 --> 00:10:05,430
that these five data points are the ones that
are closest. So, what do I do once I have
98
00:10:05,430 --> 00:10:12,740
identified the data points I take their average
wise to make a predictions. So, I take the
99
00:10:12,740 --> 00:10:25,529
average of this y. So, let us call that y
a this y, y b and so on. So, I do the same
100
00:10:25,529 --> 00:10:31,880
thing for this data point, this data point
and I take the average of all those five specific
101
00:10:31,880 --> 00:10:38,870
wise and that is my predicted y, the predicted
wise the average and there are many modifications
102
00:10:38,870 --> 00:10:42,320
to this, sometimes you do not take the average,
you might do like the localized regression
103
00:10:42,320 --> 00:10:46,890
there, there are other ways, where you do
not just take the arithmetic mean you might
104
00:10:46,890 --> 00:10:51,940
take the median you might do other things.
But, the core approach with k nearest neighbors
105
00:10:51,940 --> 00:10:59,459
is that you have not fit a line, you not created
any mathematical abstraction of the data,
106
00:10:59,459 --> 00:11:02,930
you not abstracted away from the data. For
instance, in the regression remember I told
107
00:11:02,930 --> 00:11:07,260
you if I need to do a prediction task, I do
not even need to remember the data points
108
00:11:07,260 --> 00:11:13,959
I can throw all my data points wave. Once
I used the data points to create this line,
109
00:11:13,959 --> 00:11:22,770
I now only need the line to make a prediction
I now have a y is equal to f of x. So, I have
110
00:11:22,770 --> 00:11:30,959
this functional form and I just need to plug
in my values of x that I want to use to predictions
111
00:11:30,959 --> 00:11:33,050
and now I will automatically get a predicted
y.
112
00:11:33,050 --> 00:11:38,639
Here we are not doing that abstraction, we
are retaining the entire data set of points
113
00:11:38,639 --> 00:11:42,920
in the k nearest neighbors approach, we need
all our data points. Now, you come and ask
114
00:11:42,920 --> 00:11:48,000
me a question about a particular x, I am going
to go and look at the nearest neighbors of
115
00:11:48,000 --> 00:11:51,480
that x and if it is five nearest neighbors
I am going to look at the five nearest neighbors,
116
00:11:51,480 --> 00:11:54,520
if it is ten nearest neighbors I am going
to look at the ten nearest neighbors in all
117
00:11:54,520 --> 00:11:59,529
these cases I am going to look at nearest
neighbors and you know either take a vote
118
00:11:59,529 --> 00:12:03,290
if it is a classification problem or if it
is a regression problem am I choose to do
119
00:12:03,290 --> 00:12:10,880
choose to take something like an average.
So, that is fairly clear in this line that
120
00:12:10,880 --> 00:12:17,790
I show is kind of what is approximately the
arithmetic mean and we use that and finally,
121
00:12:17,790 --> 00:12:21,410
that is the output that we going to predict.
122
00:12:21,410 --> 00:12:27,959
Now, that should kind of illustrate k nearest
neighbors for you it is a fairly simple technique
123
00:12:27,959 --> 00:12:33,920
a lot of the focus in using this approach
to solve problems is based on how, because
124
00:12:33,920 --> 00:12:38,360
it is computationally very hard. Because,
it is storing all the data and you need shift
125
00:12:38,360 --> 00:12:42,899
through all the data to find the nearest neighbors
good amount of the focus is on the data mining
126
00:12:42,899 --> 00:12:48,199
aspect or the computational aspects using
something like that, but it is also very convenient
127
00:12:48,199 --> 00:12:51,820
when you have no ideas to what the functional
form is.
128
00:12:51,820 --> 00:12:55,880
So, suppose the linear regression approach
can be very useful if you believe the relationship
129
00:12:55,880 --> 00:13:01,089
between y and x is the straight line. But,
if you do not want to make that assumptions
130
00:13:01,089 --> 00:13:06,450
at all k nearest neighbors approach is fairly
flexible to any function of form that y and
131
00:13:06,450 --> 00:13:12,880
x could have. And there are two points to
note here just an addition to what we have
132
00:13:12,880 --> 00:13:18,620
discussed which is this k nearest neighbors
approach; obviously, works when you have multiple
133
00:13:18,620 --> 00:13:23,100
input variables. So, the examples that we
took in the previous case there was one input
134
00:13:23,100 --> 00:13:27,850
variable and then there was this output variable,
this is the output variable.
135
00:13:27,850 --> 00:13:34,690
Now, what if had multiple input variables,
the same idea. So, here we have one input
136
00:13:34,690 --> 00:13:38,569
variable in the x axis, one input variables
in the y axis. So, where is the y, where is
137
00:13:38,569 --> 00:13:43,560
your output variables, one way to think of
it is that it is coming out the screen essentially
138
00:13:43,560 --> 00:13:48,540
we are not able to in a two dimensional screen
show you the three variables. But, if you
139
00:13:48,540 --> 00:13:54,060
assume that y is kind of coming out of the
screen, you could graphically represented
140
00:13:54,060 --> 00:13:56,130
that way.
But, the important point that I wanted to
141
00:13:56,130 --> 00:14:01,470
make here is that if you needed to take like
a five nearest neighbors approaches on two
142
00:14:01,470 --> 00:14:07,350
input variables, you can still do that let
us say you are interested in this particular
143
00:14:07,350 --> 00:14:13,610
point marked with the x then your nearest
neighbors to this x again get defined on the
144
00:14:13,610 --> 00:14:21,529
two dimensions. So, it could be this point,
this point, perhaps this point and what you
145
00:14:21,529 --> 00:14:27,300
can see is we are taking some form of like
may be Euclidean distance, because the distance
146
00:14:27,300 --> 00:14:32,420
itself is not just on the x 1 axis, it is
not just on the x 2 axis, it is on both x
147
00:14:32,420 --> 00:14:37,709
1 and x 2 axis.
So, you can still use the concept of the input
148
00:14:37,709 --> 00:14:44,279
space and once you have more than three year
of when you have four or more input variables
149
00:14:44,279 --> 00:14:48,670
you are talking about hyper space. But, just
you could still use some simple measures of
150
00:14:48,670 --> 00:14:54,720
Euclidean distance or some measure or some
other distances some famous once I call them
151
00:14:54,720 --> 00:15:02,010
and Manhattan distance some of them are called
the Mahalanobis distances, Mahalanobis distance
152
00:15:02,010 --> 00:15:06,819
as well.
Now, the good thing with this approach is
153
00:15:06,819 --> 00:15:11,920
that you can even use it for classification
problems, if you briefly remember in the previous
154
00:15:11,920 --> 00:15:17,339
lecture we distinguish between supervised
learning tasks, which basically just means
155
00:15:17,339 --> 00:15:20,630
you have an output variable you need trying
to predict that output variable from the input
156
00:15:20,630 --> 00:15:26,259
variables. But, this output variable can be
continuous quantitative which is where we
157
00:15:26,259 --> 00:15:31,220
primarily talked about regression and so on.
But, it could also a categorical variable.
158
00:15:31,220 --> 00:15:38,910
o, if taken an example out here on the right
hand side, where you have two input variables
159
00:15:38,910 --> 00:15:44,110
x 1 and x 2 are two input variables and your
output variables is categorical variables
160
00:15:44,110 --> 00:15:53,120
with two classes. So, think of it as male
or female or buyer or non buyer in marketing
161
00:15:53,120 --> 00:15:57,879
contexts, defective product, not defective
product in a manufacturing context. So, let
162
00:15:57,879 --> 00:16:04,850
us say this output variables which is a categorical
variable is represented by either square circles
163
00:16:04,850 --> 00:16:09,790
or squares.
So, the orange circle are one class of the
164
00:16:09,790 --> 00:16:16,689
output, the blue squares are another class
of the output and x 1 and x 2 are just your
165
00:16:16,689 --> 00:16:22,040
two input variables again out here you might
be interested in for instances making an prediction
166
00:16:22,040 --> 00:16:31,509
out here and you might take the five nearest
neighbors perhaps it will be this and may
167
00:16:31,509 --> 00:16:38,850
be this. So, these might be the five nearest
neighbors and because you need to predict,
168
00:16:38,850 --> 00:16:42,910
whether it will be class A or class B you
might want to take a voting approach or if
169
00:16:42,910 --> 00:16:48,019
your approach is to predict the probability
of it being circles or squares that is what
170
00:16:48,019 --> 00:16:52,189
I am call class A and class B then you might
just take the ratio of the circles to squares
171
00:16:52,189 --> 00:16:56,829
in your nearest neighbors.
But, the idea is that this is also works perfectly
172
00:16:56,829 --> 00:17:02,290
well, you do not need to just take an average,
you can take you know ratios or you can just
173
00:17:02,290 --> 00:17:07,630
make them all vote essentially the majority
win. So, you have three squares and two circles,
174
00:17:07,630 --> 00:17:11,850
so I am going to predict this is going to
be a square, you can using voting approach
175
00:17:11,850 --> 00:17:17,310
to say belongs to a particular class. So,
I hope that gave you an idea of k nearest
176
00:17:17,310 --> 00:17:22,030
neighbors, but also more importantly motivated
to you the idea that you have this regression
177
00:17:22,030 --> 00:17:26,120
style approaches, where you got this explicit
data model.
178
00:17:26,120 --> 00:17:31,450
So, this is your regression style approaches
by you have a functional form and then you
179
00:17:31,450 --> 00:17:36,100
try and figure out the parameters. But, you
can also take on very different approach to
180
00:17:36,100 --> 00:17:42,220
the process of predictive analytics, which
is through the process of prediction, where
181
00:17:42,220 --> 00:17:47,230
the importance is not as much on the functional
form which is cast and stone which is really
182
00:17:47,230 --> 00:17:51,970
an assumption you are making about the relationship
between your input and output variables. But,
183
00:17:51,970 --> 00:17:57,070
it really goes beyond that right like you
do not want to make those assumptions and
184
00:17:57,070 --> 00:18:04,360
you just want to take more algorithmic approach
to this entire process.
185
00:18:04,360 --> 00:18:07,830
You will get encounter a lot of machine learning
techniques that we are going to be discussing
186
00:18:07,830 --> 00:18:13,910
later in this course really belonging to that
class. So, I hope that gave you an idea both
187
00:18:13,910 --> 00:18:18,300
k nearest neighbors and this dichotomy that
you be kind of seen machine learning.
188
00:18:18,300 --> 00:18:18,740
Thank you.