1
00:00:15,009 --> 00:00:16,009
Descriptive Statistics: Summary Statistics:
Measures of Dispersion
2
00:00:16,009 --> 00:00:19,390
Hello and welcome to our course Introduction
to Data Analytics.
3
00:00:19,390 --> 00:00:26,899
In this lecture, we continue our work in Descriptive
Statistics to give you a timeline of where
4
00:00:26,899 --> 00:00:28,649
we are.
5
00:00:28,649 --> 00:00:37,070
We have discussed within descriptive statistics,
the various graphical and visualization techniques
6
00:00:37,070 --> 00:00:42,910
and the second half of the descriptive statistics
deals with Summary Statistics, The use of
7
00:00:42,910 --> 00:00:47,569
numbers to describe and summarize data.
8
00:00:47,569 --> 00:00:51,760
Within summary statistics, in our previous
lecture we spoke about Measures of Central
9
00:00:51,760 --> 00:00:53,320
Tendency.
10
00:00:53,320 --> 00:01:00,379
In this lecture, we will be concluding the
use of Summary Statistics with discussion
11
00:01:00,379 --> 00:01:02,160
on Measures of Dispersion.
12
00:01:02,160 --> 00:01:10,750
So, we should be at this point fairly familiar
with the use of this data set, essentially
13
00:01:10,750 --> 00:01:19,780
the data set is just a sample, which talks
about different data points over a certain
14
00:01:19,780 --> 00:01:27,620
range and the histogram that you see to the
right hand side of this is a histogram that
15
00:01:27,620 --> 00:01:28,940
is generated from this data set.
16
00:01:28,940 --> 00:01:36,800
So, we covered histograms during our lecture
on graphical techniques and we use them extensively
17
00:01:36,800 --> 00:01:40,680
in our discussions of measures of central
tendency.
18
00:01:40,680 --> 00:01:46,740
Again to be very clear, measures of central
tendency and measures of dispersion and the
19
00:01:46,740 --> 00:01:50,040
specific matrix that we would discussed in
them.
20
00:01:50,040 --> 00:01:56,080
So, for instance in measures of central tendency
we spoke about mean, median mode and today
21
00:01:56,080 --> 00:02:00,360
we going to be speaking about some matrix
associated with dispersion.
22
00:02:00,360 --> 00:02:05,380
All of these matrices don’t in any way shape
or form need this histogram.
23
00:02:05,380 --> 00:02:14,120
They directly operate on this data set and
in some sense, you might say why even talk
24
00:02:14,120 --> 00:02:15,310
about the histogram.
25
00:02:15,310 --> 00:02:21,590
And the idea is that you are absolutely right,
we do not need this histogram, but it really
26
00:02:21,590 --> 00:02:29,959
helps to explain the concepts and it is also
healthy way to start thinking about these
27
00:02:29,959 --> 00:02:34,409
matrices and start thinking about these distributions.
28
00:02:34,409 --> 00:02:39,060
It will help us also in the long run, when
we cover concepts and probability distributions.
29
00:02:39,060 --> 00:02:46,470
So, now, having said that let us talk about
what measures of dispersion seek to capture.
30
00:02:46,470 --> 00:02:52,090
We spoke about in the last lecture, how measures
of central tendency try to capture in some
31
00:02:52,090 --> 00:03:01,969
sense a central value; some central value
within this range of values that this data
32
00:03:01,969 --> 00:03:03,560
set takes up.
33
00:03:03,560 --> 00:03:13,340
So, the range of values of this data set takes
up is shown here and in some sense, the histogram
34
00:03:13,340 --> 00:03:17,340
captures their likelihood in this axis.
35
00:03:17,340 --> 00:03:23,769
So, the range is here in the x axis and the
y axis is in some sense, the likelihood of
36
00:03:23,769 --> 00:03:30,150
seeing that data and we spoke about, how measures
of central tendency try to captures, what
37
00:03:30,150 --> 00:03:36,329
appears to be like a central value and we
spoke about different matrix that do that.
38
00:03:36,329 --> 00:03:44,939
Measures of dispersion talk really about,
how the data is dispersed around this value,
39
00:03:44,939 --> 00:03:49,629
how does the data deviate from this value.
40
00:03:49,629 --> 00:03:56,329
For instance, if every single data point and
let us for now assume that 10 is our measure
41
00:03:56,329 --> 00:03:59,340
of central tendency.
42
00:03:59,340 --> 00:04:03,569
One measure of central tendency is the mean,
so for now let us just say we are using the
43
00:04:03,569 --> 00:04:04,569
mean.
44
00:04:04,569 --> 00:04:16,720
So, if 10 is the mean, then if every single
data point in this data set was equal to 10
45
00:04:16,720 --> 00:04:20,540
and it is not, what is here, but it is every
value is equal to 10.
46
00:04:20,540 --> 00:04:27,830
Then, there would be no deviation of data
from 10, you would just see a single tall
47
00:04:27,830 --> 00:04:36,590
line in this histogram and none of the sides,
none of these would exists all together, but
48
00:04:36,590 --> 00:04:38,040
that is not the case.
49
00:04:38,040 --> 00:04:44,790
Typically, most data sets, the values are
going to be different and you might have some
50
00:04:44,790 --> 00:04:49,200
measure of central tendency, but there is
going to be some amount of deviation of the
51
00:04:49,200 --> 00:04:53,970
data points on either side to the central
value.
52
00:04:53,970 --> 00:05:01,120
Now, measures of dispersion try to capture
that, do the values deviate a lot from the
53
00:05:01,120 --> 00:05:09,350
center or do they deviate very little from
this center and that is what measures of dispersion
54
00:05:09,350 --> 00:05:11,350
seek to capture.
55
00:05:11,350 --> 00:05:17,290
To understand the different measures of dispersion,
let us go back to the data set that we were
56
00:05:17,290 --> 00:05:19,650
using, when we were speaking about measures
of central tendency.
57
00:05:19,650 --> 00:05:28,120
So, I have used the same data set out here
and the simplest measure of dispersion is
58
00:05:28,120 --> 00:05:34,500
range and the range is quite simply nothing
but, the highest value minus the lowest value.
59
00:05:34,500 --> 00:05:40,480
So, in this particular case the highest value
is 9, the lowest value is 1, so quite simply
60
00:05:40,480 --> 00:05:45,150
9 minus 1 is 8 and that should be intuited
for you.
61
00:05:45,150 --> 00:05:56,490
The given that the mean of this data set is
about 4.5, 4.6, a measure of dispersion is
62
00:05:56,490 --> 00:05:58,890
just the max minus min.
63
00:05:58,890 --> 00:06:05,340
Now, if for instance if there was very low
dispersion, then the highest value would be
64
00:06:05,340 --> 00:06:11,460
close to 4.6 and the lowest value would be
close to 4.6 and so that, range between max
65
00:06:11,460 --> 00:06:13,020
minus min could have been smaller.
66
00:06:13,020 --> 00:06:20,140
At the same time, if this dispersion is very
high, on the high side and on the low side
67
00:06:20,140 --> 00:06:25,720
your max and min value is going to deviate
a lot from the 4.6 and so, you would have
68
00:06:25,720 --> 00:06:27,520
high dispersion.
69
00:06:27,520 --> 00:06:30,842
So, that is the simple one.
70
00:06:30,842 --> 00:06:36,770
This second one is the Inter Quartile Range
and the idea here is highly related to the
71
00:06:36,770 --> 00:06:41,190
concept of median, where you would arrange
the data points and you would kind of take
72
00:06:41,190 --> 00:06:43,000
a central data point.
73
00:06:43,000 --> 00:06:47,890
Another way of saying that, we discussed that
procedure of median during measures of central
74
00:06:47,890 --> 00:06:51,440
tendency, but another way of thinking of it
is that, you are taking the 50th percentile
75
00:06:51,440 --> 00:06:52,910
point.
76
00:06:52,910 --> 00:06:57,400
With the Inter Quartile Range, what you are
doing is you are taking the 75th percentile
77
00:06:57,400 --> 00:07:04,460
point or the 3rd quartile and subtracting
from at the 25th percentile point.
78
00:07:04,460 --> 00:07:11,020
The idea being that within this data set if
there is a high level of dispersion, then
79
00:07:11,020 --> 00:07:18,210
that range between the 75th percentile point
to the 25th percentile point also be high
80
00:07:18,210 --> 00:07:23,900
and if the dispersion is low, then this range
would be low and it is really noteworthy that
81
00:07:23,900 --> 00:07:31,830
this is the concept that gets captured in
box plots, which we discussed in a graphical
82
00:07:31,830 --> 00:07:33,460
techniques.
83
00:07:33,460 --> 00:07:38,620
We spoke about, how in the box plots the upper
line of the box plot and the lower line of
84
00:07:38,620 --> 00:07:43,750
the box plot, correspond usually to the third
quartile and the first quartile of your data
85
00:07:43,750 --> 00:07:45,150
set.
86
00:07:45,150 --> 00:07:48,720
So, and this is also known as the Inter Quartile
Range.
87
00:07:48,720 --> 00:07:54,450
So, it might be abbreviated to IQR in some
text books, but this is also a measure of
88
00:07:54,450 --> 00:07:56,030
dispersion.
89
00:07:56,030 --> 00:08:03,190
We then come to, what is a fairly popular
measure of dispersion and the idea behind
90
00:08:03,190 --> 00:08:13,200
this is to essentially look it, how much each
data point deviates from the mean that you
91
00:08:13,200 --> 00:08:14,840
just calculated.
92
00:08:14,840 --> 00:08:23,500
So, x_i represents each data point, because
i goes from the first value to the n_th value,
93
00:08:23,500 --> 00:08:26,910
when in a particular example n is 12.
94
00:08:26,910 --> 00:08:33,770
And, so we wanted to take each data point,
see how much it deviates from the mean.
95
00:08:33,770 --> 00:08:39,269
In a particular case for this data set, the
mean is 4.58.
96
00:08:39,269 --> 00:08:46,459
So, we will take the first data point which
is 3, so the data point 3 and we subtract
97
00:08:46,459 --> 00:08:49,140
them from 4.58 and square that value.
98
00:08:49,140 --> 00:08:53,200
We would take the second data point to the
same thing and we would keep adding up these
99
00:08:53,200 --> 00:08:57,830
squares and once you add up these squares,
you take something that kind of looks like
100
00:08:57,830 --> 00:09:02,381
an average and it is not an exact average,
because you have this minus 1 and we will
101
00:09:02,381 --> 00:09:03,710
talk about that in a minute.
102
00:09:03,710 --> 00:09:09,670
But, in concept you essentially are trying
to get an average of the square deviation
103
00:09:09,670 --> 00:09:12,360
and you will ultimately take a square root
of this.
104
00:09:12,360 --> 00:09:16,490
Now, when you take the square root, what you
get is the standard deviation and when you
105
00:09:16,490 --> 00:09:21,550
do not take a square root, you get this measure
called variance and variance is also a measure
106
00:09:21,550 --> 00:09:23,960
of dispersion.
107
00:09:23,960 --> 00:09:28,400
In concept, the only difference between standard
deviation and variance is that, 1 is the square
108
00:09:28,400 --> 00:09:29,850
root of the other.
109
00:09:29,850 --> 00:09:35,250
Again, now that you understand how a standard
deviation is calculated.
110
00:09:35,250 --> 00:09:41,380
Let us go through some questions that might
have come up, when we discuss standard deviations.
111
00:09:41,380 --> 00:09:45,200
Given that the other two methods that we have
discussed are of fairly straight forward and
112
00:09:45,200 --> 00:09:46,200
clear.
113
00:09:46,200 --> 00:09:51,360
So, here is some questions that always go
with standard deviation.
114
00:09:51,360 --> 00:09:57,260
Why do we use this square function on the
deviations and what are it is implications?
115
00:09:57,260 --> 00:10:02,850
So, what we are referring to here is the fact
that, we actually take the square of the deviation.
116
00:10:02,850 --> 00:10:05,610
So, why, what is the purpose?
117
00:10:05,610 --> 00:10:09,920
If you want to calculate, see if you want
to get some measure of average deviation,
118
00:10:09,920 --> 00:10:14,130
why not just take the deviation and take the
average of it and the answer is fairly straight
119
00:10:14,130 --> 00:10:15,820
forward to that.
120
00:10:15,820 --> 00:10:20,780
The answer is that just by definition, because
you are looking at the deviation from the
121
00:10:20,780 --> 00:10:28,430
mean, there are going to be some points that
deviate from the mean on a positive side,
122
00:10:28,430 --> 00:10:31,750
there going to be some points that deviates
from the mean on the negative side.
123
00:10:31,750 --> 00:10:41,440
So, 3 minus 4.58 would lead to a negative
number, whereas 9 minus 4.58 would have been
124
00:10:41,440 --> 00:10:46,710
a positive number and again by definition,
because of how you calculate a mean and the
125
00:10:46,710 --> 00:10:49,550
math for this is fairly straight forward.
126
00:10:49,550 --> 00:10:55,440
You will find that if you just took the deviations,
some positive numbers and some negative numbers
127
00:10:55,440 --> 00:11:00,580
and you added them up, you would always get
zero and that is because of, how the mean
128
00:11:00,580 --> 00:11:05,040
is calculated, because the mean is nothing
but, the sum of all the numbers divided by
129
00:11:05,040 --> 00:11:08,180
the total number of such numbers.
130
00:11:08,180 --> 00:11:13,330
So, by definition just taking the deviation
would result in some positive numbers and
131
00:11:13,330 --> 00:11:17,190
some negative numbers, which should cancel
each other out and give you zero.
132
00:11:17,190 --> 00:11:22,980
So, what you really trying to capture is an
average deviation, but you do not want the
133
00:11:22,980 --> 00:11:23,980
signs.
134
00:11:23,980 --> 00:11:28,030
So, what is one great thing you can do is
to square it all.
135
00:11:28,030 --> 00:11:33,870
So, when you square a number, whether it is
negative or positive, you always get a positive
136
00:11:33,870 --> 00:11:41,260
number and the other really interesting thing
is the, only thing that matters is the magnitude.
137
00:11:41,260 --> 00:11:47,900
So, minus 3 square is 9, which is also the
same as plus 3 square.
138
00:11:47,900 --> 00:11:55,390
So, the idea is that the square function is
symmetric on the plus minus side and always
139
00:11:55,390 --> 00:11:56,670
gives you a positive number.
140
00:11:56,670 --> 00:12:00,570
So, for that reason we use the square function.
141
00:12:00,570 --> 00:12:07,560
Now, are there some implications of that and
the answer is, yes there are some implications.
142
00:12:07,560 --> 00:12:10,770
The implications is, the effect that squaring
has.
143
00:12:10,770 --> 00:12:22,830
So, let us say you had two separate deviations
of one unit each, so let us say you had two
144
00:12:22,830 --> 00:12:26,400
data points that signified that there was
a deviation of one unit.
145
00:12:26,400 --> 00:12:35,330
So, given the average is 4.58, let us say
you had a 3.58 deviation.
146
00:12:35,330 --> 00:12:50,890
So, 3.58 and you had another data point, which
was 5.58, so both of these would have a deviation
147
00:12:50,890 --> 00:13:02,720
of minus 1 and plus 1 and when you square
these two numbers, so you square these two
148
00:13:02,720 --> 00:13:06,399
numbers, the answer comes out to be 2.
149
00:13:06,399 --> 00:13:12,500
So, that is what happens when you do this
entire squaring process.
150
00:13:12,500 --> 00:13:18,270
Now, what happens when in one case?
151
00:13:18,270 --> 00:13:19,330
So we had two…
152
00:13:19,330 --> 00:13:21,750
We just focused on two data points.
153
00:13:21,750 --> 00:13:26,630
Now, what happens in one case when you just…
154
00:13:26,630 --> 00:13:29,710
In one case, you are right on the mean.
155
00:13:29,710 --> 00:13:38,779
So, you are right on the mean, in the other
case you are deviating by two points essentially
156
00:13:38,779 --> 00:13:46,110
or whatever the unit you are using, a deviation
is two units.
157
00:13:46,110 --> 00:13:54,410
So, here the deviation, because you are comparing
it to the mean, you are essentially just replacing
158
00:13:54,410 --> 00:13:59,830
this 4 point, you are replacing this 3 with
this 4.58 and we are looking at what would
159
00:13:59,830 --> 00:14:00,830
happen.
160
00:14:00,830 --> 00:14:04,340
In case, because you are doing that your deviation
is zero.
161
00:14:04,340 --> 00:14:16,050
Here your deviation is 2 and because it gets
squared, that becomes 4 and so, your cumulative
162
00:14:16,050 --> 00:14:21,080
deviation in some sense is 4, whereas in the
previous case your cumulative deviation was
163
00:14:21,080 --> 00:14:23,920
only calculated as 2.
164
00:14:23,920 --> 00:14:30,550
In both cases, you deviated by two units from
your mean across the two data points.
165
00:14:30,550 --> 00:14:35,709
In one case, you deviated by one unit in the
first data point and one unit in the second
166
00:14:35,709 --> 00:14:41,089
data point, but the sum of the squares led
you to a number 2.
167
00:14:41,089 --> 00:14:46,860
In the second case, you deviated from the
mean by zero data points in the first, by
168
00:14:46,860 --> 00:14:50,990
zero units in the first data point and again
two units in the second data point.
169
00:14:50,990 --> 00:14:56,450
So, in both cases if you just look at the
actual deviation from the mean, in both cases
170
00:14:56,450 --> 00:15:03,720
you have deviated by only two points, but
in the second instance, in this instance you
171
00:15:03,720 --> 00:15:10,200
would be recording a square deviation of four
units, which is twice as much as the square
172
00:15:10,200 --> 00:15:13,300
deviation of the first case, which is two
units.
173
00:15:13,300 --> 00:15:18,890
Now, many people like that and there are many
contexts, where that makes a lot of sense.
174
00:15:18,890 --> 00:15:24,120
There are some contexts, where this justice
not make sense, but that is one of the implications
175
00:15:24,120 --> 00:15:28,350
of squaring the deviations.
176
00:15:28,350 --> 00:15:34,089
So, second question is, why do we work on
standard deviation and not the variance?
177
00:15:34,089 --> 00:15:39,640
So, the idea is, why do we take this square
root.
178
00:15:39,640 --> 00:15:46,769
Why not just report the variance, why do we
report mean, because they both the same function
179
00:15:46,769 --> 00:15:52,550
and the answer again is fairly straight forward.
180
00:15:52,550 --> 00:15:59,170
You have a data set and some units, that is
3, 4, 3, 1 could have some units and these
181
00:15:59,170 --> 00:16:07,519
units could be things like, simple things
like rupees or kilometers per hour, whatever
182
00:16:07,519 --> 00:16:10,330
it is that you know.
183
00:16:10,330 --> 00:16:18,350
You might have then collecting data own and
when you report a deviation from the mean,
184
00:16:18,350 --> 00:16:21,030
the units would then be in squared if you
are using variance.
185
00:16:21,030 --> 00:16:28,070
So, if you use variance you will have to report
a value that is in square and, so what is
186
00:16:28,070 --> 00:16:29,950
it mean to say rupees square.
187
00:16:29,950 --> 00:16:35,740
So, what is it mean to say a dispersion is
a 500 rupees square and you know, rupees squared
188
00:16:35,740 --> 00:16:40,800
is not something that we can understand, it
is far more intuitive.
189
00:16:40,800 --> 00:16:47,430
When it, truly it is form a meaningful to
say a deviation is 23 rupees from the mean,
190
00:16:47,430 --> 00:16:51,100
you can make decisions based off of that and
you can gather some insights based off of
191
00:16:51,100 --> 00:16:52,100
that.
192
00:16:52,100 --> 00:16:57,000
So, third question and often a very interesting
question is why do we average by dividing
193
00:16:57,000 --> 00:16:59,130
by n minus 1 and not n.
194
00:16:59,130 --> 00:17:09,709
So, the idea here is that the sum of the deviations
is always zero and so the last deviation,
195
00:17:09,709 --> 00:17:14,929
because you are essentially doing a series
of deviations.
196
00:17:14,929 --> 00:17:22,949
Now, the last deviation you can be found,
once we know the other n minus 1 deviations.
197
00:17:22,949 --> 00:17:29,510
So, we are not really averaging n unrelated
numbers you are really averaging only n minus
198
00:17:29,510 --> 00:17:31,920
1, a squared deviations.
199
00:17:31,920 --> 00:17:37,299
In some sense, it is almost like only the
n minus 1 square deviations can vary freely
200
00:17:37,299 --> 00:17:43,559
and we average by dividing the total, essentially
by n minus 1.
201
00:17:43,559 --> 00:17:50,080
This is also the concept of degrees of freedom,
which is how many of the values can actually
202
00:17:50,080 --> 00:17:57,429
move freely with, can move freely and still
maintain the final statistic and in this case,
203
00:17:57,429 --> 00:18:03,160
the final statistic is the mean, because you
subtracting each number from the sample mean.
204
00:18:03,160 --> 00:18:11,480
Now, the important thing is, this mean which
is 4.58 in our case is something that was
205
00:18:11,480 --> 00:18:13,650
calculated from this data.
206
00:18:13,650 --> 00:18:19,460
So, from the same data, which we are using
to calculate the standard deviation, you calculated
207
00:18:19,460 --> 00:18:23,870
the mean and that is the reason essentially
that you are using the n minus 1.
208
00:18:23,870 --> 00:18:31,140
If instead you are not using this mean, but
someone came and told you, what the true mean
209
00:18:31,140 --> 00:18:33,180
of this data was.
210
00:18:33,180 --> 00:18:39,559
Someone said, here is the data set and by
the way, the mean of this data set is 5.
211
00:18:39,559 --> 00:18:45,330
So, they just told you the data set or you
knew the data set from past experience or
212
00:18:45,330 --> 00:18:50,730
you are able to compute that, then you would
not have to do the n minus 1 and you would
213
00:18:50,730 --> 00:18:56,960
do the n, but also out here you would not
substitute 4.58, you would be substituting
214
00:18:56,960 --> 00:18:57,960
5.
215
00:18:57,960 --> 00:19:02,730
So, in each of these places you would be substituting
5, which is the true mean.
216
00:19:02,730 --> 00:19:07,950
We call that the true mean and we call 4.58
the sample mean, because 4.58 you calculated
217
00:19:07,950 --> 00:19:15,461
from this data, whereas 5 is something that
you knew on principle or you are able to use
218
00:19:15,461 --> 00:19:19,910
some other source to know, what the true mean
was.
219
00:19:19,910 --> 00:19:30,120
Now, another way the people like to describe
this is also to say for instance that, if
220
00:19:30,120 --> 00:19:36,920
this 3, 4, 3, 1 this data is ultimately a
sample from some other population, then you
221
00:19:36,920 --> 00:19:43,580
need to essentially do what we just discussed,
now which is to use this n minus 1 and take
222
00:19:43,580 --> 00:19:44,730
the sample mean.
223
00:19:44,730 --> 00:19:48,230
So, again we are talking about the case, where
nobody comes and tells you what the true mean
224
00:19:48,230 --> 00:19:49,230
is.
225
00:19:49,230 --> 00:19:54,540
So, your only hope is to calculate a mean
from the data and you calculated 4.58 and
226
00:19:54,540 --> 00:20:02,170
because this data set is a sample from something
else that generating this data, the right
227
00:20:02,170 --> 00:20:07,370
way to do it is the way the standard deviation
formula right now is shown.
228
00:20:07,370 --> 00:20:16,790
But, if in some sense this data is the population,
it is not a sample from some universe, but
229
00:20:16,790 --> 00:20:19,210
it is the real deal.
230
00:20:19,210 --> 00:20:27,590
Then, again the idea would be to use n and
not n minus 1, because this is the true mean
231
00:20:27,590 --> 00:20:33,000
and again out here, you would be substituting
the 4.58, but then this should be called a
232
00:20:33,000 --> 00:20:34,930
population standard deviation.
233
00:20:34,930 --> 00:20:40,900
So, it is POP, population standard deviation.
234
00:20:40,900 --> 00:20:47,070
But, more often than not in terms of the more
realistic situation that you will encounter
235
00:20:47,070 --> 00:20:56,211
in life, I think it is fairly safe to say
that, if you are taking the sample and you
236
00:20:56,211 --> 00:21:00,169
are calculating the mean, use n minus 1.
237
00:21:00,169 --> 00:21:08,150
If you given a sample data set, but you already
know the mean, that is you are not calculating
238
00:21:08,150 --> 00:21:11,170
it from this data set you already know the
true mean.
239
00:21:11,170 --> 00:21:17,280
In that case, you can just go ahead and use
n instead of n minus 1 and that would be the
240
00:21:17,280 --> 00:21:18,580
right standard deviation.
241
00:21:18,580 --> 00:21:24,410
So, that is as far as standard deviation goes,
but before we conclude on measures of dispersion,
242
00:21:24,410 --> 00:21:29,120
it is worth mentioning that there are some
other measures of dispersion out there and
243
00:21:29,120 --> 00:21:34,100
these are called mean absolute deviation and
there are many variance to it.
244
00:21:34,100 --> 00:21:41,460
But, the core idea is that, with mean absolute
deviations you replace what you use in standard
245
00:21:41,460 --> 00:21:47,470
deviation, which is the deviation of each
point from it is mean and squaring it, you
246
00:21:47,470 --> 00:21:52,920
replace that with an actual deviation.
247
00:21:52,920 --> 00:22:00,530
So, the deviation and this sign which is the
two vertical lines on either side, what the
248
00:22:00,530 --> 00:22:02,871
essentially mean is that the negative symbol
just goes away.
249
00:22:02,871 --> 00:22:11,320
So, a 3 minus 3.58 minus 4.58 which would
result in minus 1 would just be written down
250
00:22:11,320 --> 00:22:15,580
as a 1 and so would a 5.58 minus of 4.58.
251
00:22:15,580 --> 00:22:21,350
So, negative signs are just taken off and
then you do and the other operation of the
252
00:22:21,350 --> 00:22:22,799
same.
253
00:22:22,799 --> 00:22:27,080
The good thing with mean absolute deviation
is that, it has lot of variance, so it is
254
00:22:27,080 --> 00:22:34,200
like what is the average deviation from the
mean, that is the typical case and that is
255
00:22:34,200 --> 00:22:40,070
what I have written down here, but you can
also replace this x bar with the median of
256
00:22:40,070 --> 00:22:41,070
the x’s.
257
00:22:41,070 --> 00:22:48,430
So, the mean absolute deviation from the median
is another case and you also have cases like,
258
00:22:48,430 --> 00:22:53,490
what is the median absolute deviation from
the mean, the median absolute deviation from
259
00:22:53,490 --> 00:22:54,490
the median.
260
00:22:54,490 --> 00:22:59,830
Obviously, our previous lecture on understanding
the pros and cons of means and medians would
261
00:22:59,830 --> 00:23:04,230
play an important role in making such a selection.
262
00:23:04,230 --> 00:23:08,090
So, that should conclude a lecture on measures
of dispersion.
263
00:23:08,090 --> 00:23:13,790
In the next lecture, we will continue with
descriptive statistics, but focusing more
264
00:23:13,790 --> 00:23:14,790
on distributions.
265
00:23:14,790 --> 00:23:15,520
Thank you.