1
00:00:14,349 --> 00:00:18,939
hello and welcome to today's lecture so we
will continue from where we left off in last
2
00:00:18,939 --> 00:00:23,460
lecture we were discussing the relative we
are discuss the relative advantages or [did/disadvantages]
3
00:00:23,460 --> 00:00:27,359
disadvantages of computing matrix of mean
median and mode and then we went on to starting
4
00:00:27,359 --> 00:00:32,809
today you know ah discuss ways and means of
quantifying. the dispersion in the extent
5
00:00:32,809 --> 00:00:37,520
of dispersion in the data so just a quick
recap between mean median and mode so as you
6
00:00:37,520 --> 00:00:42,780
know mean is much more sensitive to the presence
of outliers in the data mode is of of course
7
00:00:42,780 --> 00:00:48,100
not sensitive median is also not sensitive
mode is primarily used while describing large
8
00:00:48,100 --> 00:00:55,670
sets of data while mean and median have can
be used for describing both . small and large
9
00:00:55,670 --> 00:01:11,560
sets of data right so but as we discussed
that the if you have a population like this
10
00:01:11,560 --> 00:01:32,460
you know this is x this is your frequency
so just computing one value as mean or median
11
00:01:32,460 --> 00:01:37,369
or mode is not sufficient there has to be
a way of capturing the dispersion in this
12
00:01:37,369 --> 00:01:43,640
data and there are various metrics for you
know computing this the most simplest is the
13
00:01:43,640 --> 00:01:50,590
range and range is nothing but essentially
the defined as the maximum and minimum of
14
00:01:50,590 --> 00:01:52,259
this data ok
so as you can clearly see in this case the
15
00:01:52,259 --> 00:01:58,880
minimum is forty the maximum is hundred so
the range is sixty so as you can also see
16
00:01:58,880 --> 00:02:06,609
that in this case your you know your value
so range has to be thought of in the context
17
00:02:06,609 --> 00:02:17,920
of the minimum or the maximum or the average
ok so you can have a value of one it won't
18
00:02:17,920 --> 00:02:27,500
change your maximum but. will drastically
changes the mean so you have to think of (Refer
19
00:02:27,500 --> 00:02:33,230
Time:02:00) range in the context of whatever
is the minimum and the maximum value we then
20
00:02:33,230 --> 00:02:38,380
discussed about standard deviation so standard
deviation is essentially the sum of the squares
21
00:02:38,380 --> 00:02:44,130
of x values from the average either the population
average or the sample average the notable
22
00:02:44,130 --> 00:02:51,250
difference is after you add them up you square
them up you add them up so it doesn't matter
23
00:02:51,250 --> 00:03:03,590
whether the value was lesser or greater than
particular average and then you divide by
24
00:03:03,590 --> 00:03:09,140
the total number of observations in the population
but in case of calculating the standard deviation
25
00:03:09,140 --> 00:03:15,690
of a sample you divide by n minus one
so this is the most important difference between
26
00:03:15,690 --> 00:03:39,960
these two cases we then came to this by you
know we had derived this expression that.
27
00:03:39,960 --> 00:03:44,550
when you do this x i minus x bar whole square
you can this is as good as it is say exactly
28
00:03:44,550 --> 00:03:51,300
the same as calculating this particular term
which is summation x square minus n times
29
00:03:51,300 --> 00:03:58,760
x bar square so let us just take this particular
example and see . whether this holds good
30
00:03:58,760 --> 00:04:08,660
or not ok so let me write down the values
of x i x i minus x bar ok so x i is two three
31
00:04:08,660 --> 00:04:16,070
four five six seven eight nine ten ok my x
bar i can calculate as so two is going to
32
00:04:16,070 --> 00:04:22,470
be twelve plus twelve plus twelve plus twelve.
forty fifty four by one two three four five
33
00:04:22,470 --> 00:04:27,539
six seven eight nine nine is equal to six
ok my x bar is six so i will plot two terms
34
00:04:27,539 --> 00:04:30,350
x square and maybe x i minus x bar whole square
ok
35
00:04:30,350 --> 00:04:39,010
so this is four x i minus x bar is four square
is sixteen three is three square is nine . four
36
00:04:39,010 --> 00:04:48,310
is two square is four five is one square is
one six is zero seven gives me one eight gives
37
00:04:48,310 --> 00:04:58,090
me four nine gives me nine ten gives me sixteen
and for x square is three square nine sixteen
38
00:04:58,090 --> 00:05:11,440
twenty five thirty six forty nine sixty four
eighty one hundred ok so i have to add up
39
00:05:11,440 --> 00:05:17,060
all these x bar squares so let us see what
this gives us thirteen and sixteen twenty
40
00:05:17,060 --> 00:05:21,860
nine twenty nine and twenty five fifty four
fifty four and thirty six is ninety ninety
41
00:05:21,860 --> 00:05:27,200
and so you have ninety plus this is ninety
plus forty nine plus this is one forty five
42
00:05:27,200 --> 00:05:32,510
plus hundred ok so which is nine ninety is
so . one ninety four to eighty three eighty
43
00:05:32,510 --> 00:05:42,010
four so x summation x square so summation
x square is roughly three eight is three eighty
44
00:05:42,010 --> 00:05:46,440
four and summation x i minus x bar whole square
is twenty five plus four twenty five plus
45
00:05:46,440 --> 00:05:49,720
four twenty nine thirty thirty one plus thirty
five nine forty four sixty ok
46
00:05:49,720 --> 00:05:53,780
so i know this value so let me again write
it down here my x square is three eighty four
47
00:05:53,780 --> 00:05:59,110
and my summation x i minus x bar whole square
is sixty ok so this is sixty now summation
48
00:05:59,110 --> 00:06:02,920
x square minus n times x square average is
equal to three eighty four minus total number
49
00:06:02,920 --> 00:06:08,930
is nine into x bar is six so square so thirty
six and nine . so thirty six into nine thirty
50
00:06:08,930 --> 00:06:15,990
six into nine is nine six fifty four three
twenty four so this is three eighty four minus
51
00:06:15,990 --> 00:06:24,170
three twenty four is equal to sixty so just
this and this is the same you have confirmed
52
00:06:24,170 --> 00:06:35,270
these two values are the same now of course
when you are doing it by hand it does not
53
00:06:35,270 --> 00:06:46,630
have any value in doing this in that case
doing the x by minus x bar root is easier
54
00:06:46,630 --> 00:07:02,780
but in ah while you are doing these calculations
in in a computer by writing a program this
55
00:07:02,780 --> 00:07:06,210
is much more advantageous ok
so we then discussed about transformations
56
00:07:06,210 --> 00:07:15,030
so we had come so if you have y is equal to
a x then you can have s y will simply be a
57
00:07:15,030 --> 00:07:25,360
times s x if you have y is equal to c plus
x s y is simply equal to s x and in the general
58
00:07:25,360 --> 00:07:31,110
case when y is equal to c plus a x s y will
still be equal to a s x so bottom line is
59
00:07:31,110 --> 00:07:34,360
when you have a pre factor . which multiplies
your variable x to generate your variable
60
00:07:34,360 --> 00:07:41,210
y then that also gets reflected in the standard
deviation calculation but when you have a
61
00:07:41,210 --> 00:07:52,130
constant been added it has no. no contribution
to the variation so let us you know think
62
00:07:52,130 --> 00:08:06,440
of the practical significance of your data
of what the standard deviation tells us ok
63
00:08:06,440 --> 00:08:24,210
and that brings us so let say if we if you
know this is your data ok x and frequency
64
00:08:24,210 --> 00:08:35,620
ok so we want to know so you know let's just
say that this is your mean and you have a
65
00:08:35,620 --> 00:08:39,600
standard deviation ok
so let say this may be your standard deviation
66
00:08:39,600 --> 00:08:45,329
ok so we want to know what this standard deviation
ah conveys so one theorem i will turn to it's
67
00:08:45,329 --> 00:08:49,579
very important it cause chebyshev's theorem
so what it states given a number k greater
68
00:08:49,579 --> 00:08:58,689
than one . and a set of n measurements at
least one minus one by k square of the measurements
69
00:08:58,689 --> 00:09:04,720
will lie within k standard deviations of their
mean so what it tells us is if you have mu
70
00:09:04,720 --> 00:09:13,680
ok or if you have x bar as your you know mean
then x bar plus minus one times mu so if n
71
00:09:13,680 --> 00:09:21,110
is or plus minus two times standard deviation
so we will contain how much of the population
72
00:09:21,110 --> 00:09:27,850
one minus one by two square is equal to three
four so seventy five percent of the population
73
00:09:27,850 --> 00:09:29,970
will lie between x bar minus two mu and x
bar plus two mu. seventy five percent of the
74
00:09:29,970 --> 00:09:40,240
population will lie between x minus x bar
minus two sorry sorry this is this is sigma
75
00:09:40,240 --> 00:09:50,509
two s i. so this is this is standard deviation
s ok
76
00:09:50,509 --> 00:09:59,720
so seventy five percent of the population
will lie between . x bar minus two s and x
77
00:09:59,720 --> 00:10:14,240
bar plus two s ok similarly if you put k is
equal to three then x bar plus minus three
78
00:10:14,240 --> 00:10:32,649
s you have the value one minus one by three
square equal to eight by nine which will roughly
79
00:10:32,649 --> 00:10:43,990
is ninety percent of your data will lie between
three standard deviations of your population
80
00:10:43,990 --> 00:10:58,499
so this is chebyshev's theorem let us test
it with the following example so you have
81
00:10:58,499 --> 00:11:03,350
these values ok twenty six point one twenty
six fourteen point five twenty nine point
82
00:11:03,350 --> 00:11:08,100
three ok ok so let let. so before we go this.
i think i skipped this value so let us come
83
00:11:08,100 --> 00:11:12,499
to this example so imagine you have twenty
six observations and the mean of the population
84
00:11:12,499 --> 00:11:21,420
is seventy five ok so you have n is equal
to twenty six. x bar is equal to seventy five
85
00:11:21,420 --> 00:11:30,899
and variance equal to hundred so standard
deviation . is equal to ten. so it tells that
86
00:11:30,899 --> 00:11:39,990
if n is equal to twenty six then so between
x bar plus minus two times standard deviation
87
00:11:39,990 --> 00:11:44,540
which would mean that seventy five plus minus
twenty
88
00:11:44,540 --> 00:11:50,029
so between the values fifty five to ninety
five you will have three fourth of the population
89
00:11:50,029 --> 00:11:54,949
that is equal to three fourth into twenty
six observations will actually be within this
90
00:11:54,949 --> 00:12:00,570
range similarly x bar plus minus three standard
deviation would give me seventy five plus
91
00:12:00,570 --> 00:12:08,240
minus thirty so between the number forty five
to hundred and five you will have eight by
92
00:12:08,240 --> 00:12:12,389
nine into twenty six number of of the population
will lie within this number so this allows
93
00:12:12,389 --> 00:12:20,720
us to think of how much of the data . will
lie will will represent bulk of the data how
94
00:12:20,720 --> 00:12:27,290
much of the fraction of the population represents
bulk of the data let us test this particular
95
00:12:27,290 --> 00:12:30,579
example as well so you have the following
population and you can calculate the mean
96
00:12:30,579 --> 00:12:37,529
of this population to be roughly twenty two
so the number of observations are one two
97
00:12:37,529 --> 00:12:41,389
three four five
so in this case n is equal to fifteen your
98
00:12:41,389 --> 00:12:48,589
x bar is equal to twenty two so let us say
so it what chebyshev's theorem predicts is
99
00:12:48,589 --> 00:13:14,110
between twenty two plus minus two times the
standard deviation ok so we. we have not calculated
100
00:13:14,110 --> 00:13:18,889
the two standard deviation so let us see ok
so you have twenty six point one twenty six
101
00:13:18,889 --> 00:13:23,010
fourteen point five ok so we have to calculate
the standard of deviation for this particular
102
00:13:23,010 --> 00:13:29,519
problem i will leave it to you for. for to
do it . but you can test so you can find out
103
00:13:29,519 --> 00:13:33,119
so you calculate the standard deviation step
one calculate x bar plus minus two times standard
104
00:13:33,119 --> 00:13:39,100
deviation and check so this is my and step
three check if this theorem is true
105
00:13:39,100 --> 00:13:48,130
so for distributions which are different than
chebyshev's but. many times you obtain mount
106
00:13:48,130 --> 00:13:59,759
like distribution so where there is a clear
peak there is a tendency for the data to accumulate
107
00:13:59,759 --> 00:14:03,759
towards the center of the distribution this
is a mount type of distribution and very often
108
00:14:03,759 --> 00:14:09,209
the term normal distribution is used so what
for these kind of distributions what is known
109
00:14:09,209 --> 00:14:13,269
as within one standard deviation of the data
sixty eight percent of the data is there within
110
00:14:13,269 --> 00:14:15,910
two standard deviations of the data ninety
five . percent of the data is there ok and
111
00:14:15,910 --> 00:14:21,620
within three standard deviations of the data
ninety nine point seven percent of the data
112
00:14:21,620 --> 00:14:23,490
is there
so you can see that compared to what chebyshev's
113
00:14:23,490 --> 00:14:27,550
theorem predicts which is that within two
standard deviations you have seventy five
114
00:14:27,550 --> 00:14:35,239
percent of the data so this clearly tells
you that chebyshev's theorem is a much more
115
00:14:35,239 --> 00:14:40,819
conservative estimate that means it under
estimates how much population is because it
116
00:14:40,819 --> 00:14:55,350
makes no assumption is to how the population
is distributed but for these particular normal
117
00:14:55,350 --> 00:14:59,360
distributions you see that instead of seventy
five percent predicted by chebyshev's theorem
118
00:14:59,360 --> 00:15:06,569
ninety five percent of the population falls
between plus minus two standard deviations
119
00:15:06,569 --> 00:15:09,860
so this brings us to a very good exercise
that so for all practical purposes you can
120
00:15:09,860 --> 00:15:12,269
consider ninety five percent of the data to
be the entire data itself ok
121
00:15:12,269 --> 00:15:19,920
so when you calculate standard deviation right
you can you can to check whether the value
122
00:15:19,920 --> 00:15:26,569
of standard deviation that you . have calculated
is approximately is right or not the exercise
123
00:15:26,569 --> 00:15:35,329
is very simple you calculate the range right
so what normal in case of normal distribution
124
00:15:35,329 --> 00:15:37,509
ok in case of normal distribution x bar plus
minus two standard deviations is ninety five
125
00:15:37,509 --> 00:15:40,369
percent of the data so we all which we can
approximate as hundred percent so if this
126
00:15:40,369 --> 00:15:46,149
is hundred percent of the data that means
the difference so x bar plus minus two standard
127
00:15:46,149 --> 00:15:51,500
deviation if you. will simply be is equal
to the range so the range will give me the
128
00:15:51,500 --> 00:15:53,389
two extremes so this would mean that roughly
my fourth standard deviation is equal to the
129
00:15:53,389 --> 00:15:55,549
range ok so that is a very easy you know way
to test
130
00:15:55,549 --> 00:16:02,109
so let us take the following example let say
if you have five seven one three four right
131
00:16:02,109 --> 00:16:10,290
my range is equal to maximum minus minimum
seven minus one . is equal to six ok what
132
00:16:10,290 --> 00:16:14,399
is my standard deviation my average is so
my x bar is five plus seven plus one plus
133
00:16:14,399 --> 00:16:18,689
three plus four divided by five. twelve thirteen
sixteen twenty so x bar is four so my standard
134
00:16:18,689 --> 00:16:22,689
deviation is equal to square root of so five
minus four one square plus seven minus four
135
00:16:22,689 --> 00:16:28,029
three square plus one minus four three square
plus three minus four one square plus four
136
00:16:28,029 --> 00:16:33,700
minus four zero square right by one two three
four five we have square root of four is equal
137
00:16:33,700 --> 00:16:41,290
to square root of so nine plus one ten so
ten and ten twenty. twenty by four is equal
138
00:16:41,290 --> 00:16:50,749
to square root of five which will give you
a value of two point two right so your range
139
00:16:50,749 --> 00:16:58,089
is six you have you know standard deviation
is two point two that means that two point
140
00:16:58,089 --> 00:17:02,399
two into four is eight they are comparable
. so you know that this is this kind of is
141
00:17:02,399 --> 00:17:08,600
probably right ok but if you had gotten a
value of standard deviation is ten that would
142
00:17:08,600 --> 00:17:24,760
have you know shirley told you that there
is something wrong in the calculation so this
143
00:17:24,760 --> 00:17:34,840
is the you know this is not exact but it gives
you a way of testing whether the value of
144
00:17:34,840 --> 00:17:45,909
calculated is reasonable or not
now when we are you know handling biological
145
00:17:45,909 --> 00:17:52,639
data right let say you do an experiment where
you measure how far cells move you have huge
146
00:17:52,639 --> 00:17:58,290
heterogeneity in the data so your error bar
is going to be humongous so one way of reducing
147
00:17:58,290 --> 00:18:06,659
this error bar of plotting the error bar is
not by plotting the standard of deviation
148
00:18:06,659 --> 00:18:10,780
but by plotting the standard error of the
mean and the standard error of the mean is
149
00:18:10,780 --> 00:18:22,679
defined[ed] by six sigma m is equal to sigma
n by root sigma by root of n so sigma is the
150
00:18:22,679 --> 00:18:28,640
standard deviation you divided by square root
of n so this clearly tells you when your n
151
00:18:28,640 --> 00:18:35,160
increases then . sigma m will decrease so
if n is four for example sigma m will be sigma
152
00:18:35,160 --> 00:18:42,090
by two half of sigma if n is hundred then
sigma m will be sigma by ten so your standard
153
00:18:42,090 --> 00:18:55,380
you know error bar will decrease but again
this is a measure this tells you how how confident
154
00:18:55,380 --> 00:19:01,139
are you of calculating the mean ok
so one more thing which is very important
155
00:19:01,139 --> 00:19:07,179
is the concept of zee score or relative scoring
right imagine a class of students have taken
156
00:19:07,179 --> 00:19:19,940
an exam and always you want to know how you
are positioned with respect to the entire
157
00:19:19,940 --> 00:19:29,970
population or how we have performed relative
to how the class has performed and zee score
158
00:19:29,970 --> 00:19:40,660
ok so zee score is defined by for [a] given
score x minus x bar by standard deviation
159
00:19:40,660 --> 00:19:47,830
ok so zee score is defined by this particular
example so if we take this example. if my
160
00:19:47,830 --> 00:19:52,179
mean is twenty five my standard deviation
is for then . if x equal to thirty so i get
161
00:19:52,179 --> 00:19:55,460
a zee score of thirty minus twenty five by
four which is roughly one ok thirty minus
162
00:19:55,460 --> 00:20:04,090
twenty five is rough so it is actually one
point two five is equal to one point two five
163
00:20:04,090 --> 00:20:11,029
but for practical purposes it is roughly one
so one thing so the zee score gives you [a/of]
164
00:20:11,029 --> 00:20:21,059
way of testing whether a particular value
is an outlier or not and this is an empirical
165
00:20:21,059 --> 00:20:32,259
in nature but what is considered is if your
zee score is greater than three then that
166
00:20:32,259 --> 00:20:35,820
particular value is an outlier let us take
the following example ok so you have these
167
00:20:35,820 --> 00:20:44,159
values so your data is one one zero fifteen
two three four zero one three right so for
168
00:20:44,159 --> 00:20:49,870
this particular value the zee score
let us say we of course have to calculate
169
00:20:49,870 --> 00:20:55,029
the x bar ok x bar is two plus fifteen . seventeen
twenty twenty two twenty six thirty three
170
00:20:55,029 --> 00:21:00,490
four five six seven eight nine ten ok so x
bar is three standard deviation ok so in most
171
00:21:00,490 --> 00:21:04,159
of the cases will be zero so your standard
deviation will be you know let say ordered
172
00:21:04,159 --> 00:21:09,250
three then my zee score of fifteen will give
me fifteen minus x bar is three by three is
173
00:21:09,250 --> 00:21:15,820
giving a value of four but i think the standard
deviation will probably be standard deviation
174
00:21:15,820 --> 00:21:21,669
will probably be order four or higher five
ok in that case but what you can clearly see
175
00:21:21,669 --> 00:21:25,650
is with this calculation it may [na/not] may
not be an exact calculation but in this case
176
00:21:25,650 --> 00:21:29,759
your z score of fifteen just by looking at
the data when you see most of your points
177
00:21:29,759 --> 00:21:33,419
are within five or four and then this fifteen
value is sticking out then you know . that
178
00:21:33,419 --> 00:21:38,830
this is really an outlier and one way to do
it is to calculate a zee score and see whether
179
00:21:38,830 --> 00:21:51,840
this value is greater than three if it is
greater than three you can completely remove
180
00:21:51,840 --> 00:21:53,110
this value and then recalculate your statistics
so this helps you to remove outliers which
181
00:21:53,110 --> 00:22:01,379
completely mask the information that is contained
in your data. another important metric which
182
00:22:01,379 --> 00:22:08,549
is widely used is the concept of relative
standing or percentiles right a p h so the
183
00:22:08,549 --> 00:22:12,750
definition of a percentile is as follows a
p h percentile is the value which is greater
184
00:22:12,750 --> 00:22:21,730
than p percent of the measurements ok so you
can have based on that you can have first
185
00:22:21,730 --> 00:22:29,020
quartile which is at position twenty five
percent so it is defined is the first quartile
186
00:22:29,020 --> 00:22:34,299
is the value which is positioned at position
point two five slash n plus one. accordingly
187
00:22:34,299 --> 00:22:38,170
the third quartile is defined . at position
point seven times at n plus one
188
00:22:38,170 --> 00:22:41,899
and you can calculate something called the
interquartile range which is [na/nothing]
189
00:22:41,899 --> 00:22:46,790
nothing with the difference between q three
and q one so again let us take the following
190
00:22:46,790 --> 00:22:54,309
example so you you want to calculate the quartile
so you have the following example four eighteen
191
00:22:54,309 --> 00:23:00,200
eleven thirteen twenty eight eleven nine so
first thing will of course we sort four eight
192
00:23:00,200 --> 00:23:05,940
eleven eleven thirteen ok sixteen eighteen
twenty thirteen sixteen eighteen twenty twenty
193
00:23:05,940 --> 00:23:14,049
five one two three four five six seven eight
nine six seven eight nine ten you know so
194
00:23:14,049 --> 00:23:18,360
four eight eleven thirteen . eleven eleven
thirteen sixteen then you have eighteen then
195
00:23:18,360 --> 00:23:24,029
you have twenty and twenty five ok
so let us say you have the following measurements
196
00:23:24,029 --> 00:23:36,010
one two three four five six seven eight nine
so you have total of nine numbers so for calculating
197
00:23:36,010 --> 00:23:50,779
q one you do one fourth into nine plus one
ok which is two point five ok so it tells
198
00:23:50,779 --> 00:23:56,760
you so two point five is middle of this eight
and eleven number so for calculating q one
199
00:23:56,760 --> 00:24:05,700
you have to do eight plus point five times
eleven minus eight so this is how you will
200
00:24:05,700 --> 00:24:11,040
calculate q one ok for calculating q three
we need point seven five so three fourth time
201
00:24:11,040 --> 00:24:19,289
nine plus one is equal to seven point five
position right so for calculating q three
202
00:24:19,289 --> 00:24:29,720
so what is the seventh position one two three
four five six seven so . seven point five
203
00:24:29,720 --> 00:24:38,169
so this is your q three so it is going to
be eighteen plus point half into twenty minus
204
00:24:38,169 --> 00:24:44,590
eighteen ok so this is how you will calculate
q one and q two and i in q one and q three
205
00:24:44,590 --> 00:24:50,370
so q two is nothing but your median ok q two
is nothing but your mod
206
00:24:50,370 --> 00:24:57,240
so this is another example so where you can
calculate the same thing but i am not going
207
00:24:57,240 --> 00:25:11,389
to the details but essentially the same thing
you ordered them you find out what this total
208
00:25:11,389 --> 00:25:17,990
number of [me/measures] measures you calculate
which you know which point is where and then
209
00:25:17,990 --> 00:25:21,769
you know n you find out the position and you
have to interpolate so depending on if let
210
00:25:21,769 --> 00:25:31,620
say hypothetically between these two numbers
you get a value of position as two point two
211
00:25:31,620 --> 00:25:38,110
five then you have to take the and this is
position two this is the next position you
212
00:25:38,110 --> 00:25:43,610
have to take this position and then point
two five times the following two points. so
213
00:25:43,610 --> 00:25:50,029
what these you know these quartiles once you
calculate the quartile one of the way . of
214
00:25:50,029 --> 00:26:02,159
representing data is using what is widely
used is called a box plot so this is how a
215
00:26:02,159 --> 00:26:05,470
box plot looks ok so this value ok
so you have this entire you know range within
216
00:26:05,470 --> 00:26:15,480
which you have this value is your median the
square value so the square value is your average
217
00:26:15,480 --> 00:26:21,120
square value is your average this is your
median this is your maximum but this is your
218
00:26:21,120 --> 00:26:25,990
seventy five. third quartile this is your
first quartile ok first quartile median third
219
00:26:25,990 --> 00:26:43,220
quartile this is the maximum and this is the
minimum but what you can see is these error
220
00:26:43,220 --> 00:26:46,580
bars don't x always extend to the maximum
of the minimum because you calculate what
221
00:26:46,580 --> 00:26:52,769
is called as the lower fence is defined as
q one minus one point five times interquartile
222
00:26:52,769 --> 00:27:00,170
range. and q three is defined ah you know
upper fence is defined as q three plus one
223
00:27:00,170 --> 00:27:05,679
point five times interquartile . range so
when you plot so for example in this particular
224
00:27:05,679 --> 00:27:09,230
data set what you said even after you have
done that there is this point which is a clear
225
00:27:09,230 --> 00:27:23,019
outlier ok but for this case the extreme point
and where you are you know your buck error
226
00:27:23,019 --> 00:27:29,390
bars extend is the same point
so this would convey that this point is not
227
00:27:29,390 --> 00:27:36,470
an external point ok so in this case also
you can see the error bar here you can see
228
00:27:36,470 --> 00:27:45,649
the error bar here which is a which conveys
the message that this point is an outlier
229
00:27:45,649 --> 00:27:56,200
the other thing i wanted to look at here is
the square point which shows the mean position
230
00:27:56,200 --> 00:28:04,440
so what you see is in both these cases of
the data your mean and median are very close
231
00:28:04,440 --> 00:28:08,780
to each other ok while for this data for example
the mean is biased towards the upward side
232
00:28:08,780 --> 00:28:15,730
so this would mean that you have lesser less
number of values which populate this portion
233
00:28:15,730 --> 00:28:28,990
of the data and greater number of values . which
populate this portion of data because of which
234
00:28:28,990 --> 00:28:33,529
this is higher so when you have the same thing
so when your population is not outside the
235
00:28:33,529 --> 00:28:42,999
outlier as in this case of this case so your
minimum and the error bar would kind of converge
236
00:28:42,999 --> 00:28:51,679
but in these cases in this case here or in
this case here these points are clear outliers
237
00:28:51,679 --> 00:28:56,700
which you know which lie much outside the
bulk of the data
238
00:28:56,700 --> 00:29:03,940
so that is how you know you can detect outliers
to find out where is your lower fence where
239
00:29:03,940 --> 00:29:09,190
is your upper fence and then you know calculate
what is your you know how you want to plot
240
00:29:09,190 --> 00:29:15,070
your box plot so let us take a sample data
set let say so we do one two three eight ten
241
00:29:15,070 --> 00:29:22,549
twelve so this is your data set right so so
your n is equal to one two three four five
242
00:29:22,549 --> 00:29:29,649
six so your q one is at position so position
is point two five . stars six plus one is
243
00:29:29,649 --> 00:29:39,340
equal to seven by four is equal to one point
seven five so my q one is one point seven
244
00:29:39,340 --> 00:29:46,789
five position so between one and two it is
going to be one plus point seven five times
245
00:29:46,789 --> 00:29:57,330
two minus one is equal to one point seven
five ok this is the position remember ok now
246
00:29:57,330 --> 00:30:04,460
for q three you have position is point seven
five times six plus one is equal to three
247
00:30:04,460 --> 00:30:12,850
fourth into seven ok twenty one by four which
is five point two five so fifth number is
248
00:30:12,850 --> 00:30:23,370
ten ok but only you have point two five so
q three is equal to ten plus point two five
249
00:30:23,370 --> 00:30:32,169
times twelve minus ten is two so is five so
two into point point five so ten point five
250
00:30:32,169 --> 00:30:40,450
zero q three ok so my i q r . is equal to
q three minus . q one is equal to so ten point
251
00:30:40,450 --> 00:30:50,820
five minus one point seven five is i think
eight point so ten eight point seven five
252
00:30:50,820 --> 00:30:58,720
ok i q r is eight point seven five
so as per this my lower fence has to be q
253
00:30:58,720 --> 00:31:11,379
one minus one point five times i q r right
so my lower fence my lower fence is q one
254
00:31:11,379 --> 00:31:22,490
minus one point five into eight point seven
five and upper fence q three plus one point
255
00:31:22,490 --> 00:31:33,259
five into eight point seven five ok so you
can i think none of these points will be outside
256
00:31:33,259 --> 00:31:36,999
the range ok so most of these points will
lie inside your plot and your box plot if
257
00:31:36,999 --> 00:31:41,540
you plot will probably look something like
this . and so your median and you know your
258
00:31:41,540 --> 00:31:45,830
median [si/since] since each of these values
are only repeated once your median will be
259
00:31:45,830 --> 00:31:51,179
in between your mean and median will be very
close to each other and your error bar won't
260
00:31:51,179 --> 00:31:53,710
you know your error bar will encompass all
the points because none of the points lie
261
00:31:53,710 --> 00:31:57,749
outside ok so but it is possible to have plots
where your let say your you have an error
262
00:31:57,749 --> 00:32:02,559
bar like this ok so where you are you know
this point is way outside the maximum to which
263
00:32:02,559 --> 00:32:07,890
this goes this is your upper fence which is
why you know q three plus one point five times
264
00:32:07,890 --> 00:32:08,890
i q r
265
00:32:08,890 --> 00:32:13,070
with that i complete here so essentially we
have discussed how to calculate standard deviation
266
00:32:13,070 --> 00:32:18,639
and then from that what how can you based
on mean and standard deviation how to get
267
00:32:18,639 --> 00:32:22,869
an idea of how much of the population lies
in the bulk and how much of these points are
268
00:32:22,869 --> 00:32:27,419
outliers so using zee square zee score . you
can get an idea of which point is an outlier
269
00:32:27,419 --> 00:32:32,110
and using box plot you can represent the whole
data to show how it looks so it of course
270
00:32:32,110 --> 00:32:35,870
conveys the message which of these points
are outliers and not with that i you know
271
00:32:35,870 --> 00:32:44,429
thank you for attention will meet again in
the next class
272
00:32:44,429 --> 00:32:48,829
thank you .