1
00:00:11,130 --> 00:00:15,560
Hi, so in this module will continue looking
at Classification and Regression Trees.
2
00:00:15,560 --> 00:00:20,700
So, in the previous module we looked at how
to build the tree, but we never looked at
3
00:00:20,700 --> 00:00:32,680
when to stop. So, if we continue growing a
very large tree, so we might end up making
4
00:00:32,680 --> 00:00:40,700
a lot of decisions and essentially end up
specializing your tree to individual data
5
00:00:40,700 --> 00:00:44,280
points. So, it will lead to over fitting of
your tree to the data and that is not a good
6
00:00:44,280 --> 00:00:50,980
place to be in. On the other hand, if you
stop growing your tree to soon, you might
7
00:00:50,980 --> 00:00:53,860
be sought on finding interesting patterns
in the data.
8
00:00:53,860 --> 00:01:06,310
So, for example, a very famous XOR case, so
there is no single attribute on which you
9
00:01:06,310 --> 00:01:12,610
can split and get any kind of improvement
in your performance, because whether you split
10
00:01:12,610 --> 00:01:18,659
on x 1 or whether you split on x 2. So, half
your data is going to be positive, the other
11
00:01:18,659 --> 00:01:24,859
half is going to be negative. So, you really
cannot hope to get any improvement by only
12
00:01:24,859 --> 00:01:28,430
splitting on one attribute. So, you just kind
of stop there saying that hey there are no
13
00:01:28,430 --> 00:01:32,299
one individual attribute that gives me an
improvement in performance and therefore let
14
00:01:32,299 --> 00:01:35,159
me stop.
So, you might lose out on interesting patterns
15
00:01:35,159 --> 00:01:41,979
if you are going to stop too early. So, you
cannot leave a tree that is very, very specialized
16
00:01:41,979 --> 00:01:47,840
to small data points and you cannot stop early
as well. So, what we do? So, you typically
17
00:01:47,840 --> 00:01:54,829
grow the tree until you come to a point, where
there are few data points in the leaf. So,
18
00:01:54,829 --> 00:01:58,950
the leaf here would correspond to region.
So, I keep going the tree until I come to
19
00:01:58,950 --> 00:02:04,649
a point where each region has only a few data
points, typically you would like to have bought
20
00:02:04,649 --> 00:02:10,200
four or five data points at least per region.
So, for using some of the standard tools say
21
00:02:10,200 --> 00:02:15,010
something like weka. So, weka actually stops
by default, when the number of data points
22
00:02:15,010 --> 00:02:20,530
in a region go down to two. So, it will stop
only at that point, that is still a significant
23
00:02:20,530 --> 00:02:32,310
over fretting. So, you grow a large tree,
where there are very few points per region
24
00:02:32,310 --> 00:02:41,920
and then
you prune the tree, you basically start trying
25
00:02:41,920 --> 00:02:50,540
to collapse the internal nodes in the tree,
such that you come up with the better tree.
26
00:02:50,540 --> 00:02:56,260
So, we have looked at how you need a validation
sets. So, you have a training data and you
27
00:02:56,260 --> 00:02:57,380
have a validation data.
28
00:02:57,380 --> 00:03:04,280
So, one way of looking at pruning which is
to say that, I will look at the performance
29
00:03:04,280 --> 00:03:14,510
of the tree on the validation data. So, look
at the performance of the tree on the validation
30
00:03:14,510 --> 00:03:20,570
data and then, I will start collapsing some
of the internal nodes of the tree. So, what
31
00:03:20,570 --> 00:03:26,260
do I mean by that? Suppose that I have a tree
that let us think of the tree that we had
32
00:03:26,260 --> 00:03:30,710
earlier. So, this is the tree that we had
earlier. So, one way of pruning something
33
00:03:30,710 --> 00:03:38,650
like this could be to say that, I am going
to collapse one of the internal nodes and
34
00:03:38,650 --> 00:03:48,230
replace it with a single region.
So, there I have pruned the tree, so this
35
00:03:48,230 --> 00:04:02,210
corresponds in the picture, this corresponds
to doing something like, so that is our four
36
00:04:02,210 --> 00:04:10,180
prime. So, now, I can look at the performance
of this pruned tree on the validation data
37
00:04:10,180 --> 00:04:16,039
versus the performance of the original tree
on the validation data. So, now, if there
38
00:04:16,039 --> 00:04:22,500
is not a significant dip in the performance,
when I have removed one of the internal nodes,
39
00:04:22,500 --> 00:04:28,600
then I will keep this new tree.
So, there are essentially tells me that whatever
40
00:04:28,600 --> 00:04:34,500
I did in terms of the previous test on t 4
or something that was peculiar to the training
41
00:04:34,500 --> 00:04:38,380
data. So, now, I look at it in the validation
set, where you really does not look like I
42
00:04:38,380 --> 00:04:48,540
need to make that split. Or on the other hand,
there might be a slight decrease in performance,
43
00:04:48,540 --> 00:04:54,320
when I go from the original tree to the prune
tree. Now, the question you have to ask yourself
44
00:04:54,320 --> 00:05:03,120
is, given the fact that I have reduced the
size of the tree is it to pay the penalty
45
00:05:03,120 --> 00:05:12,350
in terms of the slight reduction in performance.
So, there is trade off, so there is this what
46
00:05:12,350 --> 00:05:23,470
would call
47
00:05:23,470 --> 00:05:29,270
the cost complexity trade off. So, that is
the cost that you incur in terms of the misclassification
48
00:05:29,270 --> 00:05:34,500
that or the misprediction in this case in
that you are going to make in terms of the
49
00:05:34,500 --> 00:05:38,880
reduced reduction in the number of regions.
And the complexity is essentially the number
50
00:05:38,880 --> 00:05:42,870
of test that you have to perform on the size
of the decision tree that you are operating
51
00:05:42,870 --> 00:05:48,970
with. So, this is the cost complexity trade
off and depending on which side of the cost
52
00:05:48,970 --> 00:05:53,770
complexity trade off that you want to be on
you can get more complex or less complexities.
53
00:05:53,770 --> 00:06:00,810
So, that many ways in which this kind of pruning
techniques can be implemented and one simple
54
00:06:00,810 --> 00:06:08,610
way which is essentially to choose a validation
set and then have a tradeoff between the prediction
55
00:06:08,610 --> 00:06:14,590
error of the pruned tree in the validation
set verses some measure of complexity of the
56
00:06:14,590 --> 00:06:19,730
tree. So, one measure of complexity of the
tree is the number of internal nodes that
57
00:06:19,730 --> 00:06:23,470
you have in the tree. So, you can basically
trade off the two and try to come up with
58
00:06:23,470 --> 00:06:34,760
a good smaller tree.
So, far we have been talking about regression
59
00:06:34,760 --> 00:06:44,990
trees and we looked at how you would split
regions and once having split the found the
60
00:06:44,990 --> 00:06:50,810
regions how do you fit, what is the best prediction
that you have to make in each of those regions
61
00:06:50,810 --> 00:06:57,110
and how would be find the regions, we adopted
a greedy approach by picking once variable
62
00:06:57,110 --> 00:07:04,860
at a time and then trying to find out which
is the best point to split the variable.
63
00:07:04,860 --> 00:07:13,020
And we looked at the squared prediction error
and essentially now if you want to look at
64
00:07:13,020 --> 00:07:27,540
classification trees, you only think we have
to ask our self is, what is the appropriate
65
00:07:27,540 --> 00:07:32,750
measure that we will have to look at in the
greedy such procedure. So, once we have decide
66
00:07:32,750 --> 00:07:37,070
on what the appropriate measure is, so the
rest of the framework that we need to build
67
00:07:37,070 --> 00:07:42,430
classification trees are already in place.
So, we can do the greedy algorithm and then
68
00:07:42,430 --> 00:07:46,169
once we know, what is the appropriate measure
that we are going to use we can optimize that
69
00:07:46,169 --> 00:07:51,930
measure in the greedy algorithm and we can
use the same setup that we have for pruning.
70
00:07:51,930 --> 00:07:56,570
So, we have a validation data and then we
have a cost complexity trade off, where the
71
00:07:56,570 --> 00:08:00,449
cost instead of being measured by this squared
error is going to be measured by whatever
72
00:08:00,449 --> 00:08:09,919
metric we choose for a classification. So,
I am going to say that I will build the classification
73
00:08:09,919 --> 00:08:44,740
tree where the prediction. So, which we will
called as p hat, so p hat here is the probability
74
00:08:44,740 --> 00:08:53,070
that a data point in region m belongs to class
k the data point in region m. So, one of these
75
00:08:53,070 --> 00:08:57,930
regions, what is the probability that this
data point belongs to class k. So, I am going
76
00:08:57,930 --> 00:09:01,150
to look at all the data point that is fall
in the region R m.
77
00:09:01,150 --> 00:09:07,550
So, I am going to look at all the data points
if fall in the region m and look at whether
78
00:09:07,550 --> 00:09:14,390
their class labels in the training data, whether
it belongs to class k. So, if a point x i
79
00:09:14,390 --> 00:09:21,390
in region m, if the label is k then this will
be 1. So, the summation is essentially going
80
00:09:21,390 --> 00:09:28,410
to count the number of data points in region
m that belongs to class one or number of data
81
00:09:28,410 --> 00:09:32,080
points in region one that belong to class
two and so on and so forth this expression
82
00:09:32,080 --> 00:09:37,060
is for class k and then I am going to divide
it by the total number of points that fall
83
00:09:37,060 --> 00:09:41,140
in region m.
So, N m is the total number of points in region
84
00:09:41,140 --> 00:09:46,241
m. So, this is essentially gives me an empirical
estimate of the probability of a data point
85
00:09:46,241 --> 00:09:53,760
falling in region m belonging to class k and
if I am interested in actually returning that
86
00:09:53,760 --> 00:10:18,260
class label then... So, the class in region
m is going to be
87
00:10:18,260 --> 00:10:24,230
the class that I will assign to data points
in region is essentially that k that gives
88
00:10:24,230 --> 00:10:29,680
me the maximum p hat value that make sense.
89
00:10:29,680 --> 00:10:41,540
So, you look at the different error measures
that we can use. So, what is the most popular
90
00:10:41,540 --> 00:10:47,110
error measure in classification or which is
the most appropriate error measure in classification
91
00:10:47,110 --> 00:11:11,920
is a simple thing called a misclassification
error, the misclassification error is given
92
00:11:11,920 --> 00:11:40,910
by... So, the number of times the actual label
does not match the prediction that we make
93
00:11:40,910 --> 00:11:47,040
by our classifiers. So, class m is the prediction
made by our classifier in the m'th region.
94
00:11:47,040 --> 00:11:53,340
So, for all the data points in the m'th region
whenever the class label the true class label
95
00:11:53,340 --> 00:11:57,750
does not match the predictor class label.
So, I am going to sum it up divided by the
96
00:11:57,750 --> 00:12:03,930
total number of data points in that region.
So, this gives me the misclassification error
97
00:12:03,930 --> 00:12:07,250
this for a specific region and I can do this
over all regions and that gives me the total
98
00:12:07,250 --> 00:12:18,980
misclassification error. So, if you think
about it this is essentially... So, the number
99
00:12:18,980 --> 00:12:24,910
of times this is should have been correct
is given by p hat m and the class that I am
100
00:12:24,910 --> 00:12:29,709
going to predict. So, this is the given this
gives me the maximum from the previous expression
101
00:12:29,709 --> 00:12:34,500
that we had. So, the total misclassification
error will be 1 minus p hat m. So, I can use
102
00:12:34,500 --> 00:12:40,149
this as my error measure.
So, once I estimate p hat the once I estimate
103
00:12:40,149 --> 00:12:45,459
the arg max over k of p hat m k then 1 minus
that gives me the misclassification error
104
00:12:45,459 --> 00:12:51,790
for that region and I can sum this over all
the regions. Whether there are a couple of
105
00:12:51,790 --> 00:13:09,110
slightly more common error measures that are
used especially in the decision tree literature.
106
00:13:09,110 --> 00:13:23,420
So, one of these measures called cross entropy
or deviance that leads us to something called
107
00:13:23,420 --> 00:13:27,200
information gained measure for looking at
the splitting criteria.
108
00:13:27,200 --> 00:13:54,560
So, the cross entropy is given by... So, this
is p hat m k is the label distribution that
109
00:13:54,560 --> 00:14:02,019
we have inferred is a label distribution that
we have inferred from the data that was given
110
00:14:02,019 --> 00:14:09,790
to us, but we stop and think about it, if
you have sufficient training data. So, that
111
00:14:09,790 --> 00:14:16,700
p hat m k is actually the true output label
distribution as well. So, this cross entropy
112
00:14:16,700 --> 00:14:27,110
term actually gives us in some sense the amount
of information that we need to encode the
113
00:14:27,110 --> 00:14:34,930
true label given that you are in the region
m. So, this gives raise to this measure called
114
00:14:34,930 --> 00:14:42,240
information gain that tells us that if I have
not split into the following set of regions.
115
00:14:42,240 --> 00:14:47,240
How much information would I need to represent
the labels of the data verses having split
116
00:14:47,240 --> 00:14:54,320
into all these regions? How much less information
do I need? If I do not know anything about
117
00:14:54,320 --> 00:15:00,209
where the data point lies on this entire plane
then I have some amount of uncertainty about
118
00:15:00,209 --> 00:15:06,910
what the label is. Once I know that the data
points lies in R 1 or R 2 verses the rest
119
00:15:06,910 --> 00:15:15,470
of the plane have less uncertainty about what
the label should be and... So, given that
120
00:15:15,470 --> 00:15:21,740
when I have split the data into two regions
I have gained some information about what
121
00:15:21,740 --> 00:15:27,019
the label should be and that leads to this
notion of an information gain measure. And
122
00:15:27,019 --> 00:15:33,620
so it comes from this cross entropy term and
we can use this also as a measure for a splitting
123
00:15:33,620 --> 00:15:44,980
the tree in the greedy growing stage.
Last one is a Gini index is given by the expression
124
00:15:44,980 --> 00:15:55,120
p hat m k into 1 minus p hat m k and it is
actually a measure of the disparity in the
125
00:15:55,120 --> 00:16:02,269
population of distribution of variables. So,
if you think about it, so this is the probability
126
00:16:02,269 --> 00:16:11,520
that the label k appears in the distribution
in a particular region m and this a probability
127
00:16:11,520 --> 00:16:20,110
that the label k does not appear in the distribution
m and this expression will be maximum when
128
00:16:20,110 --> 00:16:24,730
they are equally likely like the problem there
label k appearing and the label k not appearing
129
00:16:24,730 --> 00:16:29,730
in the region m or equally likely. So, that
is really a bad situation for us to be in.
130
00:16:29,730 --> 00:16:36,660
So, in the some sense having a high value
here essentially is a indicator that this
131
00:16:36,660 --> 00:16:42,209
is not a great way of splitting the data into
different regions. So, when you could use
132
00:16:42,209 --> 00:16:47,899
one any of these three measures in terms of
going your decision tree and they have their
133
00:16:47,899 --> 00:16:57,589
own advantages and disadvantages. And so the
cross entropy and the Gini index or more sensitive
134
00:16:57,589 --> 00:17:06,769
to note probabilities than misclassification
rate, but then
135
00:17:06,769 --> 00:17:12,750
quite often we find that these lead to much
better trees than using the misclassification
136
00:17:12,750 --> 00:17:16,620
rate directly.
So, that brings us to the end of our discussion
137
00:17:16,620 --> 00:17:20,500
on classification and regression trees, but
there are few points that I would like to
138
00:17:20,500 --> 00:17:27,299
make about the use of tree based classifiers
or regressors. So, the first thing is I glossed
139
00:17:27,299 --> 00:17:33,400
over a little bit the problem of handling
discrete valued attributes I said could choose
140
00:17:33,400 --> 00:17:39,660
red or not red, but then if you think about
discrete valued attribute that has k possible
141
00:17:39,660 --> 00:17:45,750
values, you can see that there is a combinatorial
many ways of dividing these attributes into
142
00:17:45,750 --> 00:17:51,960
two groups. So, you need to have clever way
of handling this and then many of the decision
143
00:17:51,960 --> 00:17:58,650
tree packages do handle this in a meaningful
way.
144
00:17:58,650 --> 00:18:03,860
The other thing which we have to be very aware
of when you are dealing with trees is that
145
00:18:03,860 --> 00:18:10,020
trees are notoriously unstable. So, what do
I mean by unstable is that if there is a very
146
00:18:10,020 --> 00:18:16,030
small change in the training data that you
give to the trees the tree that you build
147
00:18:16,030 --> 00:18:20,050
out of the data could be very different. So,
you could delete a few data points and you
148
00:18:20,050 --> 00:18:23,960
might end up having a tree that looks very
different. So, this is in contrast to some
149
00:18:23,960 --> 00:18:29,770
of the earlier thinks that we are looked at
like logistic regression or support vector
150
00:18:29,770 --> 00:18:34,860
machines, where these things stable to deletion
of one or two data points, but decision trees
151
00:18:34,860 --> 00:18:40,850
can be notoriously unstable.
So, one way that people get around this problem
152
00:18:40,850 --> 00:18:46,150
of instability in decision trees is to make
sure that you build many, many different decision
153
00:18:46,150 --> 00:18:52,660
trees with slightly different views of the
data and then combine the predictions made
154
00:18:52,660 --> 00:18:58,770
by all the decision trees into a single tree.
So, that way we can actually end make sure
155
00:18:58,770 --> 00:19:07,840
that the overall classifier that you build
is reasonably stable and another point that
156
00:19:07,840 --> 00:19:12,120
we have to look at suppose you have fitting
decision trees using regression. So, you can
157
00:19:12,120 --> 00:19:21,460
see that for each region we will be outputting
a specific constant value at no point or we
158
00:19:21,460 --> 00:19:24,500
worried about how will the value change from
R 1 to R 2.
159
00:19:24,500 --> 00:19:29,980
So, you are only worried about what is happening
in R 1 and we are fitting a value in R 1 and
160
00:19:29,980 --> 00:19:36,010
therefore, there is no concern of smoothness
in our faiths and therefore, the fits that
161
00:19:36,010 --> 00:19:41,131
given by decision tree can be pretty non smooth
it is a smoothness is a criteria then we will
162
00:19:41,131 --> 00:19:46,190
have to either add some more regularizing
factors into decision trees which makes it
163
00:19:46,190 --> 00:19:53,170
more complicated or will have to look at other
forms of regressions. So, just keep in mind,
164
00:19:53,170 --> 00:19:58,840
so decision trees or very powerful classifiers,
very powerful regressors they are wonderful
165
00:19:58,840 --> 00:20:04,330
in terms of interpretability, but they also
come with their own caveats. See you later.