1
00:00:08,610 --> 00:00:15,969
Hi and welcome to this module on Classification
and Regression Trees. So, today we will look
2
00:00:15,969 --> 00:00:21,740
at a very simple, but powerful idea for building
a both classifiers and regressors.
3
00:00:21,740 --> 00:00:37,530
The basic idea is that, you are going to partition
4
00:00:37,530 --> 00:00:58,560
the input space into rectangles. So, let us
imagine that you
5
00:00:58,560 --> 00:01:08,229
have a two dimensional input space x 1 and
x 2. So, you are going to try and partition
6
00:01:08,229 --> 00:01:21,380
this into rectangles by drawing axis parallel
lines. So, why are we drawing access parallel
7
00:01:21,380 --> 00:01:28,840
lines here? Because, these lines can be with
specified very easily by just comparing against
8
00:01:28,840 --> 00:01:35,890
one of those dimensions of the input data.
So, for example, to draw this line all I need
9
00:01:35,890 --> 00:01:44,080
to specify is the intercept at the x 2 axis.
So, likewise to draw this line I need to specify
10
00:01:44,080 --> 00:01:52,189
the intercept of the x 1 axis. So, one way
of thinking about these kinds of partitioning
11
00:01:52,189 --> 00:02:01,350
of the input space for using axis parallel
lines is to think of making a series of decisions
12
00:02:01,350 --> 00:02:13,010
as to, which side of a specific line is your
data point line. So, you can think of this
13
00:02:13,010 --> 00:02:35,239
as following, we will call this say t 1, this
point is t 2, this point is t 3, this is t
14
00:02:35,239 --> 00:02:40,790
4. These are the intercepts along the respective
x 1 and x 2 axis.
15
00:02:40,790 --> 00:02:48,569
Then, one way of representing this is to think
of this as a series of tests or decisions
16
00:02:48,569 --> 00:02:55,510
that you are making, so I can start of by
asking the question, is x 1 less than or equal
17
00:02:55,510 --> 00:03:03,890
to t 1. So, that is essentially saying here
is a line that represents x 1 equal to t 1
18
00:03:03,890 --> 00:03:08,940
that is your data point lie to the left of
the line or to the right of the line. So,
19
00:03:08,940 --> 00:03:15,700
if it lies to the left of the line, so this
will be an x, then I ask the second question,
20
00:03:15,700 --> 00:03:24,520
which is essentially is x 2 lesser than equal
to t 2.
21
00:03:24,520 --> 00:03:33,290
So, x 2 equal to t 2 is this line and I am
asking the question if the data point is above
22
00:03:33,290 --> 00:03:38,420
this line or if the data point is below this
line. So, if it is below the line, so then
23
00:03:38,420 --> 00:03:46,860
I get a yes for this question as well and
I will denote this by R 1, so this region
24
00:03:46,860 --> 00:03:54,050
is R 1. So, since this represents both x 1
being less than t 1 and x 2 being less than
25
00:03:54,050 --> 00:04:01,780
t 2, it is essentially bounded in this region.
So, likewise if x, if the point is actually
26
00:04:01,780 --> 00:04:08,330
greater than t 2, so in this case this will
evaluate to no and say I get to a region,
27
00:04:08,330 --> 00:04:15,599
which is called R 2.
So, what would this region be if you can think
28
00:04:15,599 --> 00:04:33,650
about it? This is essentially the region,
where x 1 is greater than t 1, but x 1 is
29
00:04:33,650 --> 00:04:44,490
lesser than t 3. So, I am going to call this
region R 3, so here is x 1 is greater than
30
00:04:44,490 --> 00:04:53,930
t 1, but lesser than t 3, so that would be
R 3 and if x 1 is greater than t 3, so in
31
00:04:53,930 --> 00:04:59,831
this case I am again splitting in to two regions.
So, essentially, so I am testing on x 2 and
32
00:04:59,831 --> 00:05:23,710
t 4. So, what is that you notice about this
tree that we have drawn here?
33
00:05:23,710 --> 00:05:34,630
So, every point I am asking you a binary question,
is this yes or no. So, essentially it is a
34
00:05:34,630 --> 00:05:42,710
binary tree, at every point you divide into
two branches. And once I have divided into
35
00:05:42,710 --> 00:05:48,430
one of these branches, once I go down one
of these paths I am, from here on I am only
36
00:05:48,430 --> 00:05:54,759
concerned about data points that has satisfy
the first question that I asked. So, from
37
00:05:54,759 --> 00:05:59,800
this point on in the tree I am only worried
about data points that are to the left side
38
00:05:59,800 --> 00:06:04,659
of the line here.
At this point in the tree I am worried about
39
00:06:04,659 --> 00:06:09,970
data points that are to the right of the line
here. So, that is a couple of distinguishing
40
00:06:09,970 --> 00:06:17,429
features of the decision tree that I am making
binary decisions and I am also looking at
41
00:06:17,429 --> 00:06:22,909
some kind of a divide and conquer approach
great. Now, what we have done is that we have
42
00:06:22,909 --> 00:06:28,679
a representation that allows us to split the
input space into different regions.
43
00:06:28,679 --> 00:06:33,259
So, depending upon the kind of problem that
we are solving whether it is a classification
44
00:06:33,259 --> 00:06:40,249
problem or whether it is a regression problem,
so we would like to fit a single value to
45
00:06:40,249 --> 00:06:54,949
each of this regions. So, if it is a regression
problem, so regression problem we will be
46
00:06:54,949 --> 00:07:01,719
outputting a single value for the entire region.
So, if your data point falls in R 1 regardless
47
00:07:01,719 --> 00:07:06,280
of, where in R 1 it is falling, it is so the
data point could fall here, it could fall
48
00:07:06,280 --> 00:07:11,089
here, it could fall here, it could fall here
regardless of where in R 1 the data point
49
00:07:11,089 --> 00:07:32,039
lands up, I am going to predict the same output.
So, it would be the
50
00:07:32,039 --> 00:07:37,409
same real valued output for each region, so
in the case of classification, what you expect
51
00:07:37,409 --> 00:07:53,339
it to be, it will be the same class label
for the region. So, regardless of where in
52
00:07:53,339 --> 00:07:58,610
R 1 the data point falls I will always output
the same class label for the classification
53
00:07:58,610 --> 00:08:06,469
problem. So, now, so we have two questions
that we have to answer in the case of decision
54
00:08:06,469 --> 00:08:15,509
trees. So, the first one is how do we form
the regions and the second question is, having
55
00:08:15,509 --> 00:08:20,979
formed the regions, how do I decide what is
the output that I am going to produce for
56
00:08:20,979 --> 00:08:24,069
that region.
So, we will look at each of these problems
57
00:08:24,069 --> 00:08:27,960
in turn, but the first thing if I wanted to
mention before I go on to look at, how we
58
00:08:27,960 --> 00:08:36,620
solve them is that decision tree is a fantastic,
because they are the most interpretable of
59
00:08:36,620 --> 00:08:42,250
all of the classifiers that we are going to
look at, even more so than linear regression
60
00:08:42,250 --> 00:08:47,650
at some point. Because, if we think of the
way we constructed the decision tree, it seems
61
00:08:47,650 --> 00:08:52,990
like a very natural way to map it to how humans
think about making decision.
62
00:08:52,990 --> 00:08:58,379
So, that way the interpretability, so the
interpretability of decision trees are very
63
00:08:58,379 --> 00:09:06,149
high and in fact, that makes it one of the
classifiers or regressors of choice in a very
64
00:09:06,149 --> 00:09:11,430
wide varieties of problems. And the second
advantage of decision trees is that they can
65
00:09:11,430 --> 00:09:20,060
work well with mixed mode data. So, here the
example I gave you assume that x 1 and x 2
66
00:09:20,060 --> 00:09:26,180
are actual numbers and you could pick arbitrary
comparison points x 1, x 2 need not be numbers
67
00:09:26,180 --> 00:09:34,770
they could be a categorical variable like
color or it could be age, but represented
68
00:09:34,770 --> 00:09:40,850
as young old and middle age.
It did not necessarily be a number on which,
69
00:09:40,850 --> 00:09:45,149
you have to run this kind of test I could
compare whether the color is red or not red
70
00:09:45,149 --> 00:09:51,709
or I could look at whether the person is young
or middle aged verses the person is old. So,
71
00:09:51,709 --> 00:09:56,220
I could have any kind of binary test among
categorical attributes and then, I can still
72
00:09:56,220 --> 00:10:02,110
construct the decision tree. So, the first
advantages one of interpretability the second
73
00:10:02,110 --> 00:10:18,570
one the, which you can hand mixed mode data.
So, now, let us step back and let us look
74
00:10:18,570 --> 00:10:36,430
at regression trees specifically, so what
we know about regression. So, regression the
75
00:10:36,430 --> 00:10:41,529
goal of regression is essentially to minimize
some squared error. So, this is one of the
76
00:10:41,529 --> 00:10:45,660
goals of regression is to minimize some squared
error will stick with that of course you can
77
00:10:45,660 --> 00:11:00,020
build regressors for whatever objective that
you want optimize, So I want to fit a function
78
00:11:00,020 --> 00:11:27,000
f says that I minimize this some squared error.
Let us, suppose that I have a tree that has
79
00:11:27,000 --> 00:11:34,019
split by input space into m regions, which
I denote by R 1 to R m, so I have m regions
80
00:11:34,019 --> 00:11:40,339
in the input space. And then, for each of
these regions, so I am going to output a specific
81
00:11:40,339 --> 00:11:48,970
value, which I denote by C m. So, C m is the
value that have output for any data point
82
00:11:48,970 --> 00:11:58,629
that lies in region R m and I here is an indicator
function that denotes whether the data point
83
00:11:58,629 --> 00:12:04,569
lies in region R m or R naught.
So, essentially, what this summation tells
84
00:12:04,569 --> 00:12:11,630
us is that if the data point input data point
x that is come to us is going to lies in one
85
00:12:11,630 --> 00:12:17,939
of these regions 1 to m, 1 to capital M. I
do not know, which region it is going to lies
86
00:12:17,939 --> 00:12:23,379
in, but this indicator function will tell
me, which region it lies in. So, this summation
87
00:12:23,379 --> 00:12:30,220
will be non zero for only one term essentially
the region in which, the data point lies in.
88
00:12:30,220 --> 00:12:35,670
And therefore, the output will be the C value
corresponding to the region in which, the
89
00:12:35,670 --> 00:12:39,329
data point lies in.
Suppose x lies in R 2, then the output will
90
00:12:39,329 --> 00:12:46,810
be C 2, suppose I am given this tree already
this region split has been decided for me,
91
00:12:46,810 --> 00:12:54,510
then we know what is the best value that we
have to output for C m, so what would be the
92
00:12:54,510 --> 00:13:15,209
best value you have to output for C m.
93
00:13:15,209 --> 00:13:21,769
Essentially I go through my training data
I pick out all those x is, which lies in the
94
00:13:21,769 --> 00:13:27,370
m th region pickout all the x is that lie
in the m th region look at the corresponding
95
00:13:27,370 --> 00:13:35,930
y and take the average of those and that will
be the value that I output if the data point
96
00:13:35,930 --> 00:13:43,160
lies in the region R m.
So, why is this a reasonable choice well,
97
00:13:43,160 --> 00:13:49,459
so one way to think about it is when I am
trying to minimize the error in a specific
98
00:13:49,459 --> 00:13:55,569
region when I am trying to minimize the error
in a specific region, let us say R 4 I do
99
00:13:55,569 --> 00:14:00,399
not have to worry about any of the training
point that lie outside of R 4, because the
100
00:14:00,399 --> 00:14:05,579
value I have to predict for them is completely
independent of right all these other data
101
00:14:05,579 --> 00:14:09,810
point. So, I only have to worry about the
data point that lie within R 4 when I am trying
102
00:14:09,810 --> 00:14:16,399
to make a fit for the value there in output
in R 4.
103
00:14:16,399 --> 00:14:23,050
And among all the data points with lie in
R 4 the best prediction that I can make is;
104
00:14:23,050 --> 00:14:27,149
obviously, the average in terms of minimizing
the some squared error. So, if I have a different
105
00:14:27,149 --> 00:14:32,020
criteria let us say among to minimize the
median I mean I want to minimize absolute
106
00:14:32,020 --> 00:14:37,499
deviation, then I probably have to predict
the median not the average. So, this part
107
00:14:37,499 --> 00:14:42,879
is fine we know we are solved one of the two
problems. So, what were the two problems one
108
00:14:42,879 --> 00:14:49,769
is given the region split, what is the output
that have to predict for each region.
109
00:14:49,769 --> 00:14:54,430
So, that we know how to do that at least in
the case of regression, now comes the harder
110
00:14:54,430 --> 00:15:21,570
question, how are you going to find the regions,
how are you going to find the best R m’s.
111
00:15:21,570 --> 00:15:29,279
In fact, finding the best R m’s finding
the best region split is actually a combinatorial
112
00:15:29,279 --> 00:15:33,790
problem when it is actually infeasible and
it is going to take a very, very long time
113
00:15:33,790 --> 00:15:40,740
to find the exactly the right set of regions.
So, quite often what people do is they adopt
114
00:15:40,740 --> 00:15:52,399
a greedy approach to finding this regions.
115
00:15:52,399 --> 00:16:12,610
So, what is that greedy approach to, so you
basically start of by considering
116
00:16:12,610 --> 00:16:16,829
basically start of by considering a split
variable, so what is the split variable. So,
117
00:16:16,829 --> 00:16:22,430
in the case of the example tree here, so the
split variable at this level is x 1 and the
118
00:16:22,430 --> 00:16:29,240
split variable at here is x 2 the split variable
here was again x 1. So, essentially you consider
119
00:16:29,240 --> 00:16:51,399
some split variable and then, try to find
try to find the best split point.
120
00:16:51,399 --> 00:16:57,970
So, what is the split point again, so in the
tree that you saw earlier, so in the first
121
00:16:57,970 --> 00:17:03,790
level the split point was t 1 likewise, so
at each level we have to find out what is
122
00:17:03,790 --> 00:17:11,449
the appropriate split point is. So, let us
take us a simple example, so I am going to
123
00:17:11,449 --> 00:17:49,029
define going to define two sub regions R 1
and R 2. So, R 1 is that part of the space,
124
00:17:49,029 --> 00:17:55,120
where the variable x the j th co ordinate
of the variable x is lesser than or equal
125
00:17:55,120 --> 00:18:04,570
to some chosen value s. Likewise R 2 is that
sub region, where the j th co ordinate of
126
00:18:04,570 --> 00:18:08,159
the variable x is greater than some chosen
value s.
127
00:18:08,159 --> 00:18:17,500
So, now, what we are really trying to do is
trying to find j and s; such that we can solve
128
00:18:17,500 --> 00:18:27,179
for the best possible split just to give an
intuition here. So, in this case think of
129
00:18:27,179 --> 00:18:37,390
the original data, so I have chosen a split
point, which is t 1. So, the s in my choice
130
00:18:37,390 --> 00:18:46,350
is t 1 and this part is wherever x was less
than x 1 was less than t 1 and this part of
131
00:18:46,350 --> 00:18:50,929
the space was wherever x 1 was greater than
t 1. So, in our new terminology this will
132
00:18:50,929 --> 00:19:07,110
correspond to and this will correspond to,
so the one here is, because we are looking
133
00:19:07,110 --> 00:19:12,950
at x 1 the and the t 1, because that is the
split point of you are considering.
134
00:19:12,950 --> 00:19:22,779
So, now, we have to find j and s both; such
that this expression is minimize, what are
135
00:19:22,779 --> 00:19:27,860
this expression. So, we know this. So, C 1,
is the prediction I am going to make if the
136
00:19:27,860 --> 00:19:34,280
data point lies in the sub region R 1, so
y i, so these are all data points that lie
137
00:19:34,280 --> 00:19:39,690
in the sub region R 1, so prediction I make
for this C 1. So, this is essentially the
138
00:19:39,690 --> 00:19:45,440
squared error for all the data points at lie
in R 1 and this is like wise this squared
139
00:19:45,440 --> 00:19:47,990
error for all the data points at lie in R
2.
140
00:19:47,990 --> 00:19:54,850
So, we already saw that, so we can basically
find the C 1 that minimizes this error likewise
141
00:19:54,850 --> 00:20:02,279
find the C 2 that minimizes that error. Now,
my problem is to find j and s such that this
142
00:20:02,279 --> 00:20:08,610
entire expression is minimized some on a little
daunting in the beginning, but if you think
143
00:20:08,610 --> 00:20:16,480
about it is not that hard, why because we
are operating with the finite training set,
144
00:20:16,480 --> 00:20:20,820
like the data that is given to us at the beginning
from, which we are going to build this tree
145
00:20:20,820 --> 00:20:26,679
is going to be finite.
So, what does it tell us for every x j that
146
00:20:26,679 --> 00:20:33,400
I can choose as my splitting variable there
are only finitely many points at which, I
147
00:20:33,400 --> 00:20:39,230
have to consider a split. So, the essentially
tells me for every j I choose there are only
148
00:20:39,230 --> 00:20:45,620
a finitely many s that I have to try, why
is that because that can only be a finitely
149
00:20:45,620 --> 00:20:52,961
many different values that the variable, x
j can take in your training data. So, essentially
150
00:20:52,961 --> 00:21:02,000
what we do is that at every level in your
decision tree you basically look at this expression
151
00:21:02,000 --> 00:21:09,710
for all possible split points for every possible
splitting variable that you have in your data
152
00:21:09,710 --> 00:21:14,700
and then, decide on which is the best possible
splitting point.
153
00:21:14,700 --> 00:21:21,080
Once the best j and s is found, so what you
do next you essentially go hide and split
154
00:21:21,080 --> 00:21:27,240
the data into two parts one corresponding
to the R 1 of the best j and s and other corresponding
155
00:21:27,240 --> 00:21:33,770
to R 2 of best j and s. And now, you repeat
this process in both R 1 and R 2. So, now,
156
00:21:33,770 --> 00:21:37,430
if you think about it a problem is become
much simpler, because the ranges that you
157
00:21:37,430 --> 00:21:43,230
have to the number of data point that you
are looking at this much lesser and, so likewise
158
00:21:43,230 --> 00:21:49,500
you just keep going until you come to a point
where you are happy to stop.
159
00:21:49,500 --> 00:21:56,330
So, this essentially I will stop here as in
this module and the next module look at till
160
00:21:56,330 --> 00:22:00,730
when we will grow the tree and how are you
going to handle the classification.