1
00:00:15,179 --> 00:00:20,480
Hello and welcome to our next module in the
course, Introduction to Data Analytics.
2
00:00:20,480 --> 00:00:26,750
In this module, we continue our previous work
on Descriptive Statistics and weÉ In our
3
00:00:26,750 --> 00:00:33,030
last session, we spoke about descriptive statistics
through the use of various graphical and virtualization
4
00:00:33,030 --> 00:00:34,820
techniques.
5
00:00:34,820 --> 00:00:42,050
In this module, we start with the use of summary
statistics or the idea that you can describe
6
00:00:42,050 --> 00:00:50,760
data with numbers, with numbers that summaries
the data and most specifically, we are going
7
00:00:50,760 --> 00:00:55,940
to be talking about measures of central tendency
in this lecture.
8
00:00:55,940 --> 00:01:04,760
So, just to jog your memories we spoke about
the idea that there could be a data set and
9
00:01:04,760 --> 00:01:10,490
a data set essentially would be representing,
could potentially be representing a particular
10
00:01:10,490 --> 00:01:18,189
variable and so, I provided for you a simple
example of a data set.
11
00:01:18,189 --> 00:01:25,930
And this data set is what is been captured
in this histogram.
12
00:01:25,930 --> 00:01:32,689
The histogram is a visualization tool that
we spoke about in our last lecture.
13
00:01:32,689 --> 00:01:42,700
Now, the histogram essentially is a very rich
representation, because it not only captures
14
00:01:42,700 --> 00:01:48,990
just some parameters associated with this
data set, but it captures various new answers
15
00:01:48,990 --> 00:01:50,670
associated with it.
16
00:01:50,670 --> 00:01:59,479
So, just to give you a quick reminder on how
this works, essentially the entire x axis
17
00:01:59,479 --> 00:02:03,340
breaks down possible values that the data
sets could take.
18
00:02:03,340 --> 00:02:12,069
So, for instance this bend is the series of
values between 10 and from the looks of it
19
00:02:12,069 --> 00:02:15,870
10.66 on this side.
20
00:02:15,870 --> 00:02:24,189
Now, depending on the number of data points
that you see here, that fall within that range
21
00:02:24,189 --> 00:02:30,249
of 10 and 10.66 that would get counted here
and out here it looks like that is about 15
22
00:02:30,249 --> 00:02:31,249
points.
23
00:02:31,249 --> 00:02:37,809
So, that is essentially how a histogram is
calculated and the idea is that, using a histogram
24
00:02:37,809 --> 00:02:44,029
you could then fit something called distribution
and the distribution is this red line that
25
00:02:44,029 --> 00:02:46,370
is shown on top of the histogram.
26
00:02:46,370 --> 00:02:54,209
And, so in some sense the histogram and the
distribution that some time follows, tells
27
00:02:54,209 --> 00:02:59,199
us the full story associated with the data.
28
00:02:59,199 --> 00:03:05,780
And usually this red line is represented through
some kind of a formula and we just call that,
29
00:03:05,780 --> 00:03:09,249
let say f of x for now.
30
00:03:09,249 --> 00:03:17,079
But, the basic idea behind summary statistics
is that you do not even need to go this deep,
31
00:03:17,079 --> 00:03:25,670
I gave you this full picture just to tell
you what the richest or the most detail story
32
00:03:25,670 --> 00:03:31,040
could be and our next sessionÕs, next modules
are going to be about distributions.
33
00:03:31,040 --> 00:03:33,900
But, now let us take a step back.
34
00:03:33,900 --> 00:03:35,760
Is there something simpler that we could do?
35
00:03:35,760 --> 00:03:42,859
Is there something simpler without even fitting
this distribution or even creating this histogram
36
00:03:42,859 --> 00:03:45,719
that could tell us a part of the story?
37
00:03:45,719 --> 00:03:51,909
And the answer is yes and there are these
various summary statistics that do exactly
38
00:03:51,909 --> 00:03:56,319
that and I am just going to talk about a few
of them now.
39
00:03:56,319 --> 00:04:02,150
The first and the most common one are these
measures of central tendency and what they
40
00:04:02,150 --> 00:04:09,760
mean is that, you have this data set and for
now, let us just occupy ourselves with the
41
00:04:09,760 --> 00:04:13,319
histogram not the distribution that is fit
on the top of the histogram.
42
00:04:13,319 --> 00:04:20,579
But, essentially with this histogram, what
is a fairly central value.
43
00:04:20,579 --> 00:04:27,919
So, it is clear that the values of this distribution
go from here to about here, but what is something
44
00:04:27,919 --> 00:04:31,940
in the center and how do you define that.
45
00:04:31,940 --> 00:04:39,009
So, one idea is often to say, well one measure
of central tendency is to see the minimum
46
00:04:39,009 --> 00:04:45,280
value going all the way to the maximum value
and take something that is in between, so
47
00:04:45,280 --> 00:04:46,830
that is one way.
48
00:04:46,830 --> 00:04:54,210
Another way could be to say, at what point
in this histogram am I really covering about
49
00:04:54,210 --> 00:04:55,770
50 percent of the area.
50
00:04:55,770 --> 00:05:02,370
So, this histogram is defined by these blue
bars and at what point am I covering about
51
00:05:02,370 --> 00:05:07,360
50 percent of that area, so that could be
one way of saying, what is central.
52
00:05:07,360 --> 00:05:14,710
There are other innovative ways, one of the
most common one is to say think of this as
53
00:05:14,710 --> 00:05:21,340
a balance, as a sea saw essentially, this
x axis line and all these blue bars are weights
54
00:05:21,340 --> 00:05:22,340
on top of it.
55
00:05:22,340 --> 00:05:29,460
So, the idea is, where would you want to put
a fulcrum, such that this whole thing balances.
56
00:05:29,460 --> 00:05:36,289
This does not, one does not tip off the sea
saw is essentially in balance.
57
00:05:36,289 --> 00:05:42,190
So, that is the core idea behind measures
of central tendency, what is a central value.
58
00:05:42,190 --> 00:05:47,860
Now, you then have measures of dispersion
and the idea behind measures of dispersion
59
00:05:47,860 --> 00:05:53,229
are, that so you might have something that
is central value.
60
00:05:53,229 --> 00:05:59,129
But, how a data point actually dispersed around
the central value?
61
00:05:59,129 --> 00:06:09,860
Are they are very far away from them or are
they very close to it and so on.
62
00:06:09,860 --> 00:06:16,270
And these are the two major forms of summary
statistics that we will cover in this course
63
00:06:16,270 --> 00:06:18,550
and that you are likely to encounter.
64
00:06:18,550 --> 00:06:23,240
But, you might have also heard of the concepts
of skew and kurtosis and so, it just briefly
65
00:06:23,240 --> 00:06:31,680
what mentioning it, skew concerns the shape
of this distribution itself and the like,
66
00:06:31,680 --> 00:06:36,811
the fact that sometimes distributions lean
to one side versus the other and this is the
67
00:06:36,811 --> 00:06:38,870
fairly colloquial way of saying it.
68
00:06:38,870 --> 00:06:44,780
But, instead this distribution the red line
being the way it is could you have a distribution
69
00:06:44,780 --> 00:06:50,129
that looked like that, which means it is leaning,
the whole thing is leaning to one side.
70
00:06:50,129 --> 00:06:56,039
Again, I am just giving you conceptual feel
for it, it is not a formal definition, we
71
00:06:56,039 --> 00:07:03,669
and then, in the same line kurtosis is the
idea that how fat are the tails of the distribution
72
00:07:03,669 --> 00:07:11,840
and what we mean by that is, you know having
a similar distribution that looks like this.
73
00:07:11,840 --> 00:07:19,110
So, that tails themselves are fatter than
the other and that property gets captured
74
00:07:19,110 --> 00:07:22,490
with kurtosis, so great.
75
00:07:22,490 --> 00:07:28,529
So, this is, this should I have given you
some a brief overview of the different, the
76
00:07:28,529 --> 00:07:35,580
over hatching idea of summarizing statistics
through mean, describing statistics through
77
00:07:35,580 --> 00:07:42,690
summary statistics or numbers as a means of
describing distributions.
78
00:07:42,690 --> 00:07:48,520
We now, go into the major subject of this
lecture which is measures of central tendency.
79
00:07:48,520 --> 00:07:56,930
So, the best way to do this is through a concrete
example and we, that is what we will do with
80
00:07:56,930 --> 00:07:57,930
this step.
81
00:07:57,930 --> 00:08:03,300
So, there are three major measures of central
tendency and they are the mean, median and
82
00:08:03,300 --> 00:08:04,970
mode.
83
00:08:04,970 --> 00:08:10,180
With the mean, the core idea and many of these
you might have encountered, you might have
84
00:08:10,180 --> 00:08:12,270
come across before.
85
00:08:12,270 --> 00:08:20,020
So, bear with me if you already heard this,
the idea behind mean is just that it is the
86
00:08:20,020 --> 00:08:22,490
concept of average.
87
00:08:22,490 --> 00:08:27,169
If you have a data set and I am just given
you a sample data set out here and, so it
88
00:08:27,169 --> 00:08:32,390
is the numbers 3, 4, 3, 1 and I have kept
the data sets smalls, so that I can illustrate
89
00:08:32,390 --> 00:08:37,820
the concept typically might be dealing with
much larger data sets, but the idea remains
90
00:08:37,820 --> 00:08:38,820
the same.
91
00:08:38,820 --> 00:08:43,419
So, for this data set the mean is nothing
but, the sum of all the numbers divided by
92
00:08:43,419 --> 00:08:45,820
the total number of a numbers there are.
93
00:08:45,820 --> 00:08:52,839
So, we take each numbers 3, 4, 3 and we add
them all up and the 12 that you get out here
94
00:08:52,839 --> 00:08:57,089
is the actual number of numbers that there
are in this list.
95
00:08:57,089 --> 00:09:05,510
So, once you divide it and that is the concept
of mean, which can also be represented mathematically
96
00:09:05,510 --> 00:09:11,320
in this form and I have just shown that, you
use of that, when you see that it is, you
97
00:09:11,320 --> 00:09:13,870
not surprised by it.
98
00:09:13,870 --> 00:09:19,930
So, great and incidentally the mean is the
concept that I was speaking about in this
99
00:09:19,930 --> 00:09:27,779
histogram of balancing the seesaw, where would
you place the fulcrum; such that this seesaw
100
00:09:27,779 --> 00:09:32,610
gets balanced, that is the same concept of
mean.
101
00:09:32,610 --> 00:09:40,030
We now move on to the next measure of central
tendency, which is called the median.
102
00:09:40,030 --> 00:09:45,680
The median is calculated by arranging all
the numbers in order.
103
00:09:45,680 --> 00:09:52,570
So, you had this data set and it was 3 comma
4 comma 3 that is what you had out of here,
104
00:09:52,570 --> 00:09:58,320
but when you bring it to median, you basically
takes the smallest number put it first and
105
00:09:58,320 --> 00:10:02,779
then, in some order ascending or descending
you arrange the numbers.
106
00:10:02,779 --> 00:10:08,980
Once you do that, you choose the central number
and that is your median.
107
00:10:08,980 --> 00:10:13,740
Now, choosing that central number is quite
easy when you have an odd number of numbers,
108
00:10:13,740 --> 00:10:20,550
so if you had 9 numbers the 5th number would
be the central number, which you have 4 numbers
109
00:10:20,550 --> 00:10:26,220
before and you have 4 numbers after words
and that is your central number, but when
110
00:10:26,220 --> 00:10:33,960
you have an even number of numbersÉ So, in
this particular case we have 12 numbers, the
111
00:10:33,960 --> 00:10:38,480
central number is really not 1 number it is
2 numbers.
112
00:10:38,480 --> 00:10:41,940
So, at here it winds up at being a 6th and
the 7th number.
113
00:10:41,940 --> 00:10:45,540
So, typically there you choose the 6th and
the 7th number and take the average.
114
00:10:45,540 --> 00:10:51,160
In a particular case that is not a problem,
because it happens to be the same number and
115
00:10:51,160 --> 00:10:55,740
quite easily we say that the answer is 4.
116
00:10:55,740 --> 00:11:02,449
The mode, which is a third measure of central
tendency and there might be a few others,
117
00:11:02,449 --> 00:11:07,660
but these are the three most common ones that
you will encounter, the mode essentially says
118
00:11:07,660 --> 00:11:10,389
what is the most common value.
119
00:11:10,389 --> 00:11:20,509
So, if you look at this data set, the number
3 appears 3 times, the number 4 appears twice
120
00:11:20,509 --> 00:11:23,800
and then, all the other numbers just appear
once.
121
00:11:23,800 --> 00:11:29,170
And, so out here it is fairly clear that the
number 3 is the most common one and hence
122
00:11:29,170 --> 00:11:32,490
3 is the answer if you, if the question is,
what is the mode.
123
00:11:32,490 --> 00:11:42,620
It is the most common number, but and that
kind of make sense, if you have a data set,
124
00:11:42,620 --> 00:11:50,230
where there are only few numbers that are
recurring, but the concept is again generalizable
125
00:11:50,230 --> 00:11:55,110
. So, even if you had a data set that look
like this, where numbers you know, it does
126
00:11:55,110 --> 00:11:58,350
not make sense to ask the question, which
is the most common number in fact, no number
127
00:11:58,350 --> 00:11:59,550
might repeat itself.
128
00:11:59,550 --> 00:12:05,290
But, out here the more essentially is based
on the range itself.
129
00:12:05,290 --> 00:12:13,690
So, you would say this is the most common
range and so this is the mode and so the mode
130
00:12:13,690 --> 00:12:24,630
out here would have been something like if
9.5 to 9.6, so great.
131
00:12:24,630 --> 00:12:30,870
So, we now understand how mean, median and
mode are calculated.
132
00:12:30,870 --> 00:12:37,600
Now, let us take a step and see, where do
we want to use which measure of central tendency.
133
00:12:37,600 --> 00:12:42,930
So, how do we choose between mean, median
and mode?
134
00:12:42,930 --> 00:12:46,779
The main concept really comes between mean
and median.
135
00:12:46,779 --> 00:12:52,360
So, let us talk about that for a minute and
becomes kind of obvious, where mode is more
136
00:12:52,360 --> 00:13:00,589
useful, because it has a very different property
associated with this central tendency, so
137
00:13:00,589 --> 00:13:01,589
great.
138
00:13:01,589 --> 00:13:09,769
If you have to choose between mean and median,
much of the debate usually comes down to outliers.
139
00:13:09,769 --> 00:13:18,820
The idea of the outliers is that, it is a
number or a value that is not really within
140
00:13:18,820 --> 00:13:23,050
that set of most of the other numbers that
you see.
141
00:13:23,050 --> 00:13:29,019
Now, that can be, because of quite of few
reasons.
142
00:13:29,019 --> 00:13:35,110
When this come about because of an error in
the data, so the data set itself could have
143
00:13:35,110 --> 00:13:41,980
an error, then it is easy to say that is around
you, so this is a bad outlier.
144
00:13:41,980 --> 00:13:49,460
But, sometimes this state can be that outlier
is very much not an error and it tells an
145
00:13:49,460 --> 00:13:52,029
important part of the story.
146
00:13:52,029 --> 00:13:58,560
In really simple words, the median is not
influenced much by the outlier, whereas the
147
00:13:58,560 --> 00:14:01,920
mean is greatly influenced by the outlier.
148
00:14:01,920 --> 00:14:08,040
For that reason, the median is often kind
of expressed as been a more robust metric
149
00:14:08,040 --> 00:14:09,040
to outliers.
150
00:14:09,040 --> 00:14:15,709
But, that we need to take a step back, just
a second about saying you know outliers can
151
00:14:15,709 --> 00:14:22,949
either be good or they need not be good and
so, it really depends on what we think about
152
00:14:22,949 --> 00:14:26,129
the outliers, ask to whether we choose to
go with the mean or median.
153
00:14:26,129 --> 00:14:32,220
Obviously, when we think the outlier is a
bad think, it should not be there or it is
154
00:14:32,220 --> 00:14:36,380
not contributing towards a story that we want
to tell.
155
00:14:36,380 --> 00:14:40,200
Then, we call it a bad outlier, we prefer
the median in that case.
156
00:14:40,200 --> 00:14:46,089
So, to actually give you some insight as to,
how the outlier affects the mean and not the
157
00:14:46,089 --> 00:14:47,089
median.
158
00:14:47,089 --> 00:14:49,209
Let us just go back to the previous example.
159
00:14:49,209 --> 00:14:56,980
Let us say that in our data set instead of
8 we had 800.
160
00:14:56,980 --> 00:15:01,800
So, that is clearly a mistake.
161
00:15:01,800 --> 00:15:06,500
Someone by mistake type two 0Õs next to 8
and, so it is 800 and let us assume it is
162
00:15:06,500 --> 00:15:09,149
a mistake, it is not obvious.
163
00:15:09,149 --> 00:15:17,129
Now, in the case of the median, this would
have a huge impact instead of 8 being here
164
00:15:17,129 --> 00:15:26,170
you would put an 800 and that would greatly
change this number.
165
00:15:26,170 --> 00:15:32,050
So, your mean is largely affected, it probably
send the number into the 100Õs.
166
00:15:32,050 --> 00:15:41,850
Now, it will have no impact on the median,
this 8 gets listed, it becomes 800, which
167
00:15:41,850 --> 00:15:47,940
means that it is not in it is place, it comes
after 9.
168
00:15:47,940 --> 00:15:58,750
So, make some space here, so the 800 comes
here, but the central two numbers still remain
169
00:15:58,750 --> 00:16:04,970
these two 4s, I mean these two 4s, so your
answer really did not change.
170
00:16:04,970 --> 00:16:11,329
So, in many ways the outlier has like this
huge impact on the mean, it has no impact
171
00:16:11,329 --> 00:16:12,950
on the median.
172
00:16:12,950 --> 00:16:19,949
Now, clearly if what you have faced with this
kind of an error, you like using the median,
173
00:16:19,949 --> 00:16:23,139
because the mean is susceptible to this problem.
174
00:16:23,139 --> 00:16:28,770
There might be other situations, where you
want to use the median and again it pertains
175
00:16:28,770 --> 00:16:35,079
outliers, but here we are not as much scared
about errors, but you are scared that there
176
00:16:35,079 --> 00:16:40,170
is this one a typical case, which is just
skewing a story that I want to tell.
177
00:16:40,170 --> 00:16:47,870
So, a classic example of where medians are
used is, when looking at salaries of people,
178
00:16:47,870 --> 00:16:52,500
where the idea is that salaries having some
sense of exponential.
179
00:16:52,500 --> 00:16:57,009
Many people earn a consistent salary and then,
there is these few peoples who just earn these
180
00:16:57,009 --> 00:17:02,970
catastrophically high amounts and so something
like mean, where and here the idea is the
181
00:17:02,970 --> 00:17:06,750
catastrophically high amounts are these outliers.
182
00:17:06,750 --> 00:17:13,990
And here talking about a mean will not give
you the typical salary that a person earns,
183
00:17:13,990 --> 00:17:19,710
because of these one or two people who earn
very high salaries.
184
00:17:19,710 --> 00:17:26,590
So, it is not just errors there might be other
situations, where you have outliers and you
185
00:17:26,590 --> 00:17:34,430
feel like the presence of these outliers is
moving you away from talking really about,
186
00:17:34,430 --> 00:17:41,870
what is a typical value, now having said that
there are many situations again, where you
187
00:17:41,870 --> 00:17:47,870
are dealing with outliers, but these outliers
are of very important part of the story.
188
00:17:47,870 --> 00:17:50,460
So, let me give you an example of this.
189
00:17:50,460 --> 00:17:58,250
So, let us say you have this data set, where
you were looking at a particular financial
190
00:17:58,250 --> 00:18:00,090
strategy.
191
00:18:00,090 --> 00:18:06,810
And in this financial strategy you are looking
at how much money you made on a daily basis
192
00:18:06,810 --> 00:18:14,401
and, so you have taken some historic data
and you want to see you want to see, what
193
00:18:14,401 --> 00:18:20,890
is a typical scenario of the strategy you
want to evaluate the strategy based on based
194
00:18:20,890 --> 00:18:22,110
on this data set.
195
00:18:22,110 --> 00:18:33,380
So, let us see this financial strategy actually
made you lose 1 rupee every day on 99 percent
196
00:18:33,380 --> 00:18:43,560
of the days, but on 1 percent of the days
this strategy gave you 10 crores, so large
197
00:18:43,560 --> 00:18:44,770
enough number.
198
00:18:44,770 --> 00:18:49,910
So, this strategy made you lose money on 99
percent of the days and on 1 percent of the
199
00:18:49,910 --> 00:18:59,020
days gave you 10 crores is this strategy you
would like to take the very straight forward
200
00:18:59,020 --> 00:19:03,790
answer is if you like making money you really
like this strategy, because despite the fact
201
00:19:03,790 --> 00:19:08,190
that you lose just 1 rupee on 99 percent on
the days as long you as you can play this
202
00:19:08,190 --> 00:19:12,760
game or you can trade on this strategy in
a stock market for long enough period of time
203
00:19:12,760 --> 00:19:16,990
here bound to in the long run make good amount
of money.
204
00:19:16,990 --> 00:19:22,800
Because, 10 crores more than compensates for
the 99 days or during, which you lost the
205
00:19:22,800 --> 00:19:23,800
1 rupee.
206
00:19:23,800 --> 00:19:30,000
Now, let us see how mean and median would
have represented this data right if you got
207
00:19:30,000 --> 00:19:34,460
a sufficiently large enough data set of having
actually play the strategy.
208
00:19:34,460 --> 00:19:40,240
So, let us say you go and you actually collect
your data set of the strategy over a 1000
209
00:19:40,240 --> 00:19:48,500
days or less than 10, 1000 days, what would
be the median of this strategy the answer
210
00:19:48,500 --> 00:19:56,510
is said the median would have been the minus
1 that you that the 1 rupee that you lost.
211
00:19:56,510 --> 00:20:03,010
So, quite simply, how is that on that data
set for this would look something like this.
212
00:20:03,010 --> 00:20:10,440
It would look like this minus 1 comma 1, 1
comma minus 1 comma dot, dot, dot, dot, with
213
00:20:10,440 --> 00:20:14,331
the lot of minus 1Õs 99 percent of them are
minus 1Õs and then, the odd time you are
214
00:20:14,331 --> 00:20:19,570
going to find this are really large number
I am not even going to talk about how many
215
00:20:19,570 --> 00:20:23,760
0Õs lots of 0Õs dot, dot, dot for the 0Õs.
216
00:20:23,760 --> 00:20:28,150
So, you put this in ascending order and you
choose the middle value that is going to be
217
00:20:28,150 --> 00:20:31,280
a minus 1, so the median gives you a minus
1.
218
00:20:31,280 --> 00:20:38,230
However, you put these numbers you add all
of these numbers up together include put your
219
00:20:38,230 --> 00:20:44,950
10 crores in there with lots of 0Õs and then,
divided by the total number of data points,
220
00:20:44,950 --> 00:20:50,250
which is you know 1000 or 10000 or something
whatever the number of data point you have
221
00:20:50,250 --> 00:20:59,280
and you going to get a very large positive
number positive and large great science.
222
00:20:59,280 --> 00:21:05,940
So, here is a story where yes youÕre not
you have an outlier you got this huge outlier
223
00:21:05,940 --> 00:21:08,490
which is 10 crores.
224
00:21:08,490 --> 00:21:14,520
But, the story was in the outlier that is
as much real money that you made or lost as
225
00:21:14,520 --> 00:21:16,230
the 1 rupee is that you lost.
226
00:21:16,230 --> 00:21:22,750
So, here is the case where the mean is probably
a great measure to go by if you had to choose
227
00:21:22,750 --> 00:21:25,090
whether to play this strategy or not.
228
00:21:25,090 --> 00:21:29,740
So, I have that gives you some idea between
mean and median, now let us talk a little
229
00:21:29,740 --> 00:21:36,410
bit about mode is an interesting one, because
it just blank it says I am going to take,
230
00:21:36,410 --> 00:21:44,560
the value, which is the most popular and that
works fine for you know distributions, which
231
00:21:44,560 --> 00:21:46,600
are fairly symmetric.
232
00:21:46,600 --> 00:21:56,820
But, in many cases that that people do not
find that too meaningful the one big advantage;
233
00:21:56,820 --> 00:22:00,710
however, that the mode has is that you can
even use a nominal variables.
234
00:22:00,710 --> 00:22:09,380
So, you might have a situation, where you
are just counting the number of you are coming
235
00:22:09,380 --> 00:22:16,381
up with the count associated with a categorical
variable an example could be in the number
236
00:22:16,381 --> 00:22:20,590
of reds the number of greens the number of
yellows and all the mode is going to do is
237
00:22:20,590 --> 00:22:27,700
say lets pick the one which has the most number
of it and it can also be fairly useful in
238
00:22:27,700 --> 00:22:29,270
multi modal distributions.
239
00:22:29,270 --> 00:22:35,980
Let me give you an example of where a multi
modal distribution and multi moral just means
240
00:22:35,980 --> 00:22:37,740
that there are many peeks to the distributions.
241
00:22:37,740 --> 00:22:43,930
So, if you go back to this slide and let me
just erase that for you.
242
00:22:43,930 --> 00:22:49,060
So, the red line is the distribution and we
going by the we going talk a lot more about
243
00:22:49,060 --> 00:22:54,330
distributions multi modal distribution is
one that might look like this, so there are
244
00:22:54,330 --> 00:22:58,730
like two peeks to this distribution.
245
00:22:58,730 --> 00:23:04,980
So, let me give you an example a real life
example of where, you could have a multi modal
246
00:23:04,980 --> 00:23:09,710
distribution and, where you might want to
use the mode.
247
00:23:09,710 --> 00:23:18,400
So, let us say you we looked on this street
and this street was a 100 meters long.
248
00:23:18,400 --> 00:23:23,060
So, one end of this street is 0 meters and
then, there are markers on this street.
249
00:23:23,060 --> 00:23:26,820
So, this 1 meter, 2 meter, 3 meter and the
street goes all the way to 100 meters.
250
00:23:26,820 --> 00:23:32,300
So, if someone said this 75 meter you immediately
knew, which point of the street or road we
251
00:23:32,300 --> 00:23:34,000
are talking about.
252
00:23:34,000 --> 00:23:42,870
So, all the residence of this street need
to make a decision, on where to place a garbage
253
00:23:42,870 --> 00:23:51,530
can a trash can, which lets say for whatever
reason people do not people have strong opinions
254
00:23:51,530 --> 00:23:52,530
of that.
255
00:23:52,530 --> 00:23:57,740
So, people are all going to go or we going
to take we are going to take a survey of all
256
00:23:57,740 --> 00:24:01,300
the residence and each ones going to come
up with the number.
257
00:24:01,300 --> 00:24:08,510
So, person one says I want the garbage can
in the 25 meter mark another person says I
258
00:24:08,510 --> 00:24:12,970
have want it in the 50 meter mark so on and
so forth.
259
00:24:12,970 --> 00:24:21,290
Now, let us say we collected all these data
and we found that 40 percent of the residence
260
00:24:21,290 --> 00:24:29,630
said they want the garbage can in the 25th
meter mark.
261
00:24:29,630 --> 00:24:41,530
Let us say another 40 percent or let us say
45 percent said they wanted the trash can
262
00:24:41,530 --> 00:24:43,790
on the 75th meter mark.
263
00:24:43,790 --> 00:24:48,171
So, just to recap 40 percent of the people
say they want the trash can on the 25th meter
264
00:24:48,171 --> 00:24:54,770
mark you know 45 percent of the residency
they wanted in the 75th meter mark and the
265
00:24:54,770 --> 00:25:00,370
remaining 15 percent they just its all over
between its like uniform somewhere between
266
00:25:00,370 --> 00:25:02,830
0 to a 100.
267
00:25:02,830 --> 00:25:11,340
Now, the problem is both mean and median might
windup saying the average preference is to
268
00:25:11,340 --> 00:25:20,101
keep the trash can somewhere in the 51 52
meter mark, because that is bound to be a
269
00:25:20,101 --> 00:25:22,000
central value.
270
00:25:22,000 --> 00:25:29,210
Now, that might be something that nobody wanted,
where as something like a mode would just
271
00:25:29,210 --> 00:25:34,880
categorically say keep it in the 75th meter,
because that is the most populist preference.
272
00:25:34,880 --> 00:25:41,820
So, in case is, where kind of taking two extremes
and averaging them out and in some sense median
273
00:25:41,820 --> 00:25:48,100
also does that as long as there are enough
data point does not work and in those cases
274
00:25:48,100 --> 00:25:51,340
the mode could be fairly useful application.
275
00:25:51,340 --> 00:25:54,890
I have just gives you an idea of the difference
measures of central tendency.
276
00:25:54,890 --> 00:25:58,520
In the next lecture we will take up measures
of dispersion.