1
00:00:11,969 --> 00:00:17,750
Welcome to the next class. Today we are going
to talk about another distribution it is called
2
00:00:17,750 --> 00:00:22,181
Poisson distribution. The previous class we
talked about the binomial distribution. In
3
00:00:22,181 --> 00:00:29,200
binomial distribution you have n samples and
you have k successes and probability of each
4
00:00:29,200 --> 00:00:33,540
of the success is half, so you are expected
to find out the total probability. Whereas
5
00:00:33,540 --> 00:00:42,149
in Poisson distribution, when n becomes very
large then the binomial sort of tends into
6
00:00:42,149 --> 00:00:44,800
Poisson distribution and that is what it is
all about actually.
7
00:00:44,800 --> 00:00:53,020
We have the probability of number of independent
events occurring in a fixed time. The probability
8
00:00:53,020 --> 00:01:00,580
of a particular event occurring in a fixed
time. For example, number of car accidents
9
00:01:00,580 --> 00:01:07,259
that is happening in a metro city in India
in past one month or number of infant deaths
10
00:01:07,259 --> 00:01:13,940
in India in the past one year and so on. So,
we are talking about based on a large number
11
00:01:13,940 --> 00:01:20,360
of data we are trying to find out the events
actually. The probability p will be very small
12
00:01:20,360 --> 00:01:26,550
whereas the number of observation will be
very large. It is sort of related to the previous
13
00:01:26,550 --> 00:01:33,760
one which we saw the binomial but here the
n is very, very large and p becomes very,
14
00:01:33,760 --> 00:01:39,360
very small actually. This is also very useful
in biology as I am going to talk about a few
15
00:01:39,360 --> 00:01:44,950
examples. The Poisson distribution the equation
looks like this, e power minus lambda, lambda
16
00:01:44,950 --> 00:01:50,060
raise to the power x divided by x factorial,
x could be 0, 1, 2 and so on.
17
00:01:50,060 --> 00:01:56,090
How do you know lambda? Lambda is given by
this relation n into p is equal to lambda,
18
00:01:56,090 --> 00:02:01,869
here n is very, very large actually. Some
events are rather rare, they do not happen
19
00:02:01,869 --> 00:02:09,030
often like car accidents or infant deaths
and so on actually. For example, number of
20
00:02:09,030 --> 00:02:14,890
mutations in a given stretch of DNA after
it is exposed to radiations, in such sort
21
00:02:14,890 --> 00:02:21,030
of situations we use Poisson distribution.
The equation is like this, e power minus lambda,
22
00:02:21,030 --> 00:02:26,720
lambda x divided by x factorial, the lambda
is given by this relation n is the total data
23
00:02:26,720 --> 00:02:33,710
set, p this is a probability of occurrence.
So, x could be 0 if you are looking at 0 events,
24
00:02:33,710 --> 00:02:38,099
x could be 1 if you are looking at 1 event,
2, 3 and so on actually.
25
00:02:38,099 --> 00:02:44,640
Let us look at an example, suppose the average
number of fatalities due to car accidents
26
00:02:44,640 --> 00:02:51,220
in a city in India on any day is 5. This number
one might have collected over a very long
27
00:02:51,220 --> 00:02:57,160
period of time. So, it is almost like a huge
population data and then you get this data
28
00:02:57,160 --> 00:03:06,770
actually. It says on any particular day, there
could be 5 accidents which lead to fatal result.
29
00:03:06,770 --> 00:03:10,850
What is the probability that fewer than 4
such fatalities will occur on any particular
30
00:03:10,850 --> 00:03:17,440
day? If I say tomorrow, what will be the probability
that fewer than 4. That means it could be
31
00:03:17,440 --> 00:03:21,680
0 accident, it could be 1 accident, it could
be 2 accident, it could be 3 accident, because
32
00:03:21,680 --> 00:03:26,620
we are saying fewer than 4.
What is the probability? Say if tomorrow in
33
00:03:26,620 --> 00:03:32,750
that particular city, you have fewer than
4 accidents. What you do? You know this equation,
34
00:03:32,750 --> 00:03:41,320
e power minus lambda, lambda x by x factorial
and you need to put x is equal to 0 then x
35
00:03:41,320 --> 00:03:47,379
is equal to 1, x is equal to 2, x is equal
to 3 then add up all of them. What will be
36
00:03:47,379 --> 00:03:55,170
the lambda here? Lambda is 5; because we are
talking about in any given day statistically
37
00:03:55,170 --> 00:04:01,700
they have found that there will be 5 accidents
per day. So, lambda will be equal to 5, then
38
00:04:01,700 --> 00:04:07,860
x you put it as 0, then x you put is as 1,
you put x is 2, then x is 3, then add up all
39
00:04:07,860 --> 00:04:13,069
that and that will give you the entire probability.
So, probability that fewer than 4 accidents
40
00:04:13,069 --> 00:04:20,879
take place, that means x less than 4, that
means it can be x 0 or x 1, x 2, x 3 and lambda
41
00:04:20,879 --> 00:04:26,870
is 5. So what you do, e power minus lambda
into lambda raise to the power x. In this
42
00:04:26,870 --> 00:04:33,210
case, it is 0 divided by 0 factorial then
e power minus 5, 5 raise to the power 1 divided
43
00:04:33,210 --> 00:04:38,720
by 1 factorial, e power minus 5, 5 raise to
the power 2 divided by 2 factorial, e power
44
00:04:38,720 --> 00:04:44,010
minus 5, 5 raise to the power 3, 3 factorial.
So, 0 factorial is 1 and anything raise to
45
00:04:44,010 --> 00:04:52,030
the power 0 is also 1. When you do all these
adding up you get 0.265. So, the probability
46
00:04:52,030 --> 00:04:58,810
of having fewer than 4 accidents say tomorrow
or any particular day will be 26.5 percent;
47
00:04:58,810 --> 00:05:01,780
that is what it is.
48
00:05:01,780 --> 00:05:06,350
We can also check with the online software.
Yesterday I introduced this software called
49
00:05:06,350 --> 00:05:12,160
Graph pad online software and this is the
link for that software. We can do the same
50
00:05:12,160 --> 00:05:17,190
calculations with that software also.
51
00:05:17,190 --> 00:05:26,470
I can substitute and you get 0, 1, 2, 3, 4,
if you do the cumulative as you can see for
52
00:05:26,470 --> 00:05:32,240
0 then 1 cumulative will be this plus this
giving this, for 2 cumulative this plus, this
53
00:05:32,240 --> 00:05:37,810
plus, this, for 3 cumulative this plus, this
plus, this plus, this that comes to 26.5 percent
54
00:05:37,810 --> 00:05:42,490
which is matching with 26.5. Shall we use
the graph pad?
55
00:05:42,490 --> 00:05:51,130
So, let us use the graph pad software. As
you can see it tells you that we can use Binomial
56
00:05:51,130 --> 00:05:59,270
Poisson and so on. So, we use this then go
forward then this is the one we again use
57
00:05:59,270 --> 00:06:07,270
this and go forward. We have here it says
Poisson distribution, so average number of
58
00:06:07,270 --> 00:06:13,820
objects that means on an average they have
seen 5 fatalities on any particular day in
59
00:06:13,820 --> 00:06:17,990
that city. We put 5 and then we calculate
the probability.
60
00:06:17,990 --> 00:06:24,890
ssssss
You see this, for 0 accidents on any particular
61
00:06:24,890 --> 00:06:32,520
day the probability will be 0.67 percent,
for only 1 accident the probability will be
62
00:06:32,520 --> 00:06:39,910
3.36 percent, but if it is 0 or 1 accident
then you need to add these 2 that is cumulative,
63
00:06:39,910 --> 00:06:46,660
so it comes to 4 percent. For 2 accidents,
it will be 8.42 percent, for 3 accidents it
64
00:06:46,660 --> 00:06:52,330
will be 14 percent, but if you are talking
about 0 or 1 or 2 or 3 accidents on any particular
65
00:06:52,330 --> 00:06:58,419
day, you need to add all these. So, that is
the cumulative we get 26.5; so, we got the
66
00:06:58,419 --> 00:07:05,060
same answer. We can use this graph pad software
also to do the same calculations. So, that
67
00:07:05,060 --> 00:07:14,630
is what we got here 26.5 percent for less
than fewer than 4 or less than 4 accidents.
68
00:07:14,630 --> 00:07:25,550
Even Excel has this option we have something
called function called Poisson. x mean cumulative,
69
00:07:25,550 --> 00:07:32,610
x is the number of events, x is the number
of events, m is the mean that means, m is
70
00:07:32,610 --> 00:07:38,950
the expected numerical value cumulative true
or false. Like in binomial, if you put false
71
00:07:38,950 --> 00:07:48,389
you will get the exact value, whereas if it
is true you will get the cumulative value.
72
00:07:48,389 --> 00:07:57,960
We put total as 5 and we want to look at 0,
1, 2, 3, 4, we can use the same function here
73
00:07:57,960 --> 00:08:18,710
in excel also. Let me do that, we go to the
Excel.
74
00:08:18,710 --> 00:08:30,780
We have the statistical then we have the Poisson,
so you see the Poisson distribution. When
75
00:08:30,780 --> 00:08:38,880
we say Poisson distribution x is the number
f events, mean is the expected numerical value
76
00:08:38,880 --> 00:08:45,170
and here we put false to get the absolute
or we can put true to get the cumulative.
77
00:08:45,170 --> 00:08:53,440
So, x is the number of events 5 for example,
I want to look at say 3 and then I say true
78
00:08:53,440 --> 00:09:25,399
and then I get this, it gives you something
here which is not correct, we got 26 percent.
79
00:09:25,399 --> 00:09:33,070
In the Poisson distribution, we have x is
the number of events we are looking at, mean
80
00:09:33,070 --> 00:09:42,720
is the expected numerical value and cumulative
is true in this case. Here we are having 5
81
00:09:42,720 --> 00:09:49,690
fatalities expected on any particular day
but we are looking at minimum of less than
82
00:09:49,690 --> 00:09:59,069
4, so I have put 3 here and I have put true
here and we get the answer as 26.5, which
83
00:09:59,069 --> 00:10:08,980
matches with whatever we got here. We got
26.5 and 26.5 using different method.
84
00:10:08,980 --> 00:10:16,139
So, we can use this formula or we can use
this graph pad software or we can use this
85
00:10:16,139 --> 00:10:25,709
Poisson distribution function that is available
in Excel as well actually. Here we say x is
86
00:10:25,709 --> 00:10:34,269
the number of events we are looking at, it
could be 1, 2, 3, 0 and then is the total
87
00:10:34,269 --> 00:10:44,290
and then cumulative could be true or false.
Let us look at another problem where Poisson
88
00:10:44,290 --> 00:10:45,490
distribution is useful.
89
00:10:45,490 --> 00:10:51,170
There are probability of a birth defect is
10 percent this is data which may be collected
90
00:10:51,170 --> 00:10:57,779
over a very very long period of time. So,
the probability of a birth defect is 10 percent.
91
00:10:57,779 --> 00:11:02,310
What is the probability that no one in a family
of 10 people have the birth defects? I have
92
00:11:02,310 --> 00:11:08,079
a family in a village there are 10 people,
what is the probability that no one in that
93
00:11:08,079 --> 00:11:13,269
family will have this birth defect? But the
probability of birth defect is 10 percent.
94
00:11:13,269 --> 00:11:22,529
So how do you do this? Again we need to know
lambda here, n p is equal to lambda here n
95
00:11:22,529 --> 00:11:29,379
is 10 people and the p probability is 0.1.
So lambda comes out to be 1. Here x we want
96
00:11:29,379 --> 00:11:35,459
to be 0 because we do not want to have any
birth defect here. The p 0 1 will be e power
97
00:11:35,459 --> 00:11:44,489
minus lambda 0, 0 factorial that comes to
e power minus 1 which is 0.367 that means
98
00:11:44,489 --> 00:11:51,370
there is 36 percent probability that in a
family of 10 people nobody is having the birth
99
00:11:51,370 --> 00:12:02,309
defect. We can also calculate 1 person having
the birth defect we use at least 1 person
100
00:12:02,309 --> 00:12:09,420
having the birth defect then we can look at
putting different numbers here based on what
101
00:12:09,420 --> 00:12:14,839
is the x we are looking at actually. Now if
you want to say probability of at least 1
102
00:12:14,839 --> 00:12:19,790
person with the birth defect that means the
birth defect could be all 10 having birth
103
00:12:19,790 --> 00:12:31,379
defect, all nine having birth defect, all
8, all 7, all 6 having birth defect, 5 having
104
00:12:31,379 --> 00:12:39,230
birth defect 4, 3, 2, 1. So, it will be like
1 minus nobody having birth defect, so that
105
00:12:39,230 --> 00:12:45,519
is why we have 1 minus 0.367 that is 0.633
that means probability of at least one person
106
00:12:45,519 --> 00:12:52,889
having birth defect in that family of 10 is
63 percent. Here at least 1 person means 1
107
00:12:52,889 --> 00:13:00,269
could be having birth defect, 2 could be having,
3 or 4 or 5. So it is exactly 1 minus of nobody
108
00:13:00,269 --> 00:13:03,089
having the birth defect.
109
00:13:03,089 --> 00:13:10,189
Now let us look at another interesting problem.
This I took it from website here you were
110
00:13:10,189 --> 00:13:16,610
to scatter seeds over a large field from plane.
Imagine that you have divided the field into
111
00:13:16,610 --> 00:13:22,449
blocks of equal size; you have not dropped
millions and trillions of seeds but only small
112
00:13:22,449 --> 00:13:29,329
amount of seed. What is the probability that
the seeds are independent of each other? Of
113
00:13:29,329 --> 00:13:35,019
course 1 seed settling down is not going to
effect the other seeds action. So what is
114
00:13:35,019 --> 00:13:41,829
the probability that may be at least 1 seed
you get per plot? Or what is the probability
115
00:13:41,829 --> 00:13:50,209
at least 2 seeds you get per plot? We can
again use your Poisson distribution. As you
116
00:13:50,209 --> 00:14:03,259
can see, we can say at least 1 seed we can
have about 36 percent, at least 2 seeds per
117
00:14:03,259 --> 00:14:11,109
plot you can have 40 percent that means 2
seeds 0 or 1 seed it could be. So like that
118
00:14:11,109 --> 00:14:16,910
we can calculate using the graph-pad software.
119
00:14:16,910 --> 00:14:24,259
Now let us look at another problem.
120
00:14:24,259 --> 00:14:33,160
This is something related to Genome. The base
composition of the Thermococcus celer genome
121
00:14:33,160 --> 00:14:44,069
is about 0.21 is to 0.29 is to 0.29 is to
0.21 that is the mole ratio A C G T. The probability
122
00:14:44,069 --> 00:14:52,589
will be of a or c or g or t will be in this
ratio actually, if the sequence where random
123
00:14:52,589 --> 00:15:01,319
the probability that any given position in
the genome is a Spel site that is ACTAGT would
124
00:15:01,319 --> 00:15:18,799
be 0.21 of because A is 0.21, C is 0.29, T
is again 0.21 and A is 0.21, G is 0.29 and
125
00:15:18,799 --> 00:15:27,939
T is 0.21. So the probability that you have
a genome sequence Spel site ACTAGT will be
126
00:15:27,939 --> 00:15:35,079
all these actually that is equal to 0.000164,
that is one site per 6100 base pairs. Now
127
00:15:35,079 --> 00:15:45,530
this genome is about 1890000 base pair, so
the expected number of Spel sites in a random
128
00:15:45,530 --> 00:15:55,230
sequence of this length and composition will
be this 0.00164 multiplied by this, we get
129
00:15:55,230 --> 00:16:05,420
a lambda is equal to 310 we can substitute
that into our equation for less than 5.
130
00:16:05,420 --> 00:16:14,459
We want to look at less than 5, p equal to
0, 1, 2, 3, 4, 5 so we are looking at x is
131
00:16:14,459 --> 00:16:22,709
equal to 0 lambda is equal to 310, x is equal
to 1 lambda is equal to 310, x is equal to
132
00:16:22,709 --> 00:16:30,609
2 lambda equal to 310, x is equal to 3 lambda
is equal to 310, x is equal to 4 lambda is
133
00:16:30,609 --> 00:16:35,809
equal to 310, x is equal to 5 lambda is equal
to 310. So if we substitute all these now
134
00:16:35,809 --> 00:16:44,549
we end up with such a very very small number,
5.7 10 power minus 125 but what is observed
135
00:16:44,549 --> 00:16:52,549
the observed number of sites is 5.13 which
is very big. Obviously, it is not a random
136
00:16:52,549 --> 00:16:57,639
event because if it is a random event you
should get this as a probability but actually
137
00:16:57,639 --> 00:17:06,809
you are observing 5.13 that is, at least 5
or fewer sites therefore it is reasonable
138
00:17:06,809 --> 00:17:12,079
to reject the model that the nucleotide sequence
of the Thermococcus celer genome is random
139
00:17:12,079 --> 00:17:16,150
with respect to the sequence. So it is not
a randomly happening because if it has to
140
00:17:16,150 --> 00:17:22,380
happen randomly the probability of that is
this but actually you observe almost 5.13
141
00:17:22,380 --> 00:17:34,039
or 5 or fewer sites, which is a large number
so this sequence of ACTAGT which is a Spel
142
00:17:34,039 --> 00:17:38,390
site happening is not a random event. It is
very interesting problem this it was taken
143
00:17:38,390 --> 00:17:45,769
from this particular site and we can do similar
studies on genome sequences and when you see
144
00:17:45,769 --> 00:17:53,259
a particular sequence you can see whether
it is in random event using Poisson distribution
145
00:17:53,259 --> 00:17:58,600
or it is not a random event.
146
00:17:58,600 --> 00:18:05,990
And once again to recap Poisson distribution
you have when n is very large, so you have
147
00:18:05,990 --> 00:18:12,490
a something called lambda here which is the
governing term lambda is given by n into p,
148
00:18:12,490 --> 00:18:17,039
p is the probability, n is the number of total
number of events, it is given by e minus lambda
149
00:18:17,039 --> 00:18:22,580
lambda x divided by x factorial. If I want
to look at 0 event, then I put x is equal
150
00:18:22,580 --> 00:18:30,419
to 0 here, if you want to look at 1 event
I put one here, if I want two I put 2 here
151
00:18:30,419 --> 00:18:37,190
3, 3 and if I am looking at either absolute
or I can even look at cumulative that means
152
00:18:37,190 --> 00:18:42,559
probability of fewer than 4 events, if I say
then I need to add all these. Now we can use
153
00:18:42,559 --> 00:18:49,059
Poisson distribution also to find out confidence
interval on the count for example, if I am
154
00:18:49,059 --> 00:18:57,409
counting number of bacteria colonies in a
plate or if I am counting red blood corpuscles
155
00:18:57,409 --> 00:19:05,250
in the blood these are all individual events.
Obviously, if I take a different blood sample,
156
00:19:05,250 --> 00:19:09,910
I may get a different number, if I take a
different blood sample from the same patient
157
00:19:09,910 --> 00:19:16,230
I may get a different number. Obviously, there
will a range it cannot be an absolute single
158
00:19:16,230 --> 00:19:22,750
value that confidence interval is given by
this term plus or minus t into square root
159
00:19:22,750 --> 00:19:30,929
of l. l is the average count but there is
a confidence interval associated with this
160
00:19:30,929 --> 00:19:38,610
l which is given by plus or minus t into square
root of l, where t is 1.96 for 95 percent
161
00:19:38,610 --> 00:19:44,610
confidence interval and 2.58 for 99 percent
confidence interval. This equation is very
162
00:19:44,610 --> 00:19:53,690
useful. Like I said if I am counting the number
of live bacterial cells in my plate I get
163
00:19:53,690 --> 00:20:00,820
a number say 10 power 20.
What is the range? What is the confidence
164
00:20:00,820 --> 00:20:06,210
interval? If I want to know if I am looking
at the red blood corpuscles of a volunteer,
165
00:20:06,210 --> 00:20:11,299
when I take a sample and count I may get some
number when I take another sample I may get
166
00:20:11,299 --> 00:20:17,090
another number, like that if I keep on doing
it I may get large set of numbers. Obviously
167
00:20:17,090 --> 00:20:23,830
there will be a confidence interval that is
given by this particular term t square root
168
00:20:23,830 --> 00:20:31,100
of l, l is the count and t is given by 1.96
for a 95 percent confidence and 2.58 for a
169
00:20:31,100 --> 00:20:37,010
99 percent confidence interval. Let us look
at some examples now that will give you.
170
00:20:37,010 --> 00:20:42,590
In a counting chamber, I have got 470 red
blood cells counted under a microscope in
171
00:20:42,590 --> 00:20:47,980
a volume of one micro liter. So what is the
95 percent confidence interval for the patients
172
00:20:47,980 --> 00:20:56,990
true red blood cell count? Lambda is 470;
square root of a 470 is 21.6, t is 1.96 for
173
00:20:56,990 --> 00:21:06,730
95 percent confidence, so 470 plus or minus
this. Although we measure as 470 red blood
174
00:21:06,730 --> 00:21:13,659
cells, in reality if you want to mention it
as a 95 percent confidence it will vary between
175
00:21:13,659 --> 00:21:22,720
427 and 538. For a 95 percent I put this number
as 1.96, whereas if it is a 99 percent I put
176
00:21:22,720 --> 00:21:31,860
the number as 2.58. The true value will be
95 percent of the time between 427 and 513.
177
00:21:31,860 --> 00:21:39,100
Now let us look at another plate. There are
100 agar plates containing antibiotics were
178
00:21:39,100 --> 00:21:44,129
streaked with 1 million bacteria each to determine
the incidence of antibiotic mutants after
179
00:21:44,129 --> 00:21:53,379
incubation. In all 58 mutant colonies were
found, there were 58 mutant colonies. Calculate
180
00:21:53,379 --> 00:22:00,730
the probability of finding 0 or 1 mutant colony
per plate? Obviously, if I want to find 0
181
00:22:00,730 --> 00:22:10,549
now I found 58 colonies in 100 plates, my
lambda will be equal to 0.58 that is the incident.
182
00:22:10,549 --> 00:22:16,830
If I want to see 0 then I put x as 0. So lambda
raise to the power 0, 0 factorial that is
183
00:22:16,830 --> 00:22:22,759
56 percent. If I want to see one mutant colony,
obviously, I will put e minus lambda, lambda
184
00:22:22,759 --> 00:22:29,690
raise to the power 1, 1 factorial that gives
you 32.5 percent.
185
00:22:29,690 --> 00:22:39,440
You can see is very very useful Poisson distribution
we can use it for calculating events based
186
00:22:39,440 --> 00:22:47,710
on a probability. These events are independent
of each other they are not related to each
187
00:22:47,710 --> 00:22:53,299
other. We can use Poisson distribution for
getting a confidence interval for a count
188
00:22:53,299 --> 00:22:58,169
like I showed you in example on red blood
corpuscle or if it is a bacterial colony I
189
00:22:58,169 --> 00:23:04,289
am counting. So Poisson distribution is very
useful and lambda is only factor which I need
190
00:23:04,289 --> 00:23:12,970
to know here lambda is it given by n p. Now
I want to slightly switch gears and talk about
191
00:23:12,970 --> 00:23:16,230
something called population and sample.
192
00:23:16,230 --> 00:23:23,980
Population is something which is very very
big for example, when I say there are 5 fatalities
193
00:23:23,980 --> 00:23:29,070
on the road in metropolitan city that means
this data must have been collected over a
194
00:23:29,070 --> 00:23:35,009
very very long time. So it is not that every
day it will be happen 5 but it is collected
195
00:23:35,009 --> 00:23:42,230
over a very long time and that is called a
population. Now if I am saying that the average
196
00:23:42,230 --> 00:23:51,789
height of Indians is 5 feet 5 inches so this
data is collected over a very very large data
197
00:23:51,789 --> 00:24:07,929
set called population. If you say the number
of defects children born will be 1 in 1 million
198
00:24:07,929 --> 00:24:13,960
in India then this data collected over a long
period of time with large data set and that
199
00:24:13,960 --> 00:24:21,100
we can call it population and that is generally
denoted like a capital N, whereas I may take
200
00:24:21,100 --> 00:24:28,129
a small sample that is called a subset of
this population. I can because I cannot actually
201
00:24:28,129 --> 00:24:35,240
get all the population, but I can get a small
sample. Suppose I am running a bolt factory,
202
00:24:35,240 --> 00:24:42,700
I take a few bolts and check their diameter
and see whether it conforms with what I claim
203
00:24:42,700 --> 00:24:50,289
or if I take 10 people in Chennai and find
their height and I will try to see whether
204
00:24:50,289 --> 00:24:55,759
it matches with the Indian height average
of 5 feet 5 inches but that is a sample that
205
00:24:55,759 --> 00:25:02,500
is a very small number n. Even if I take 100
people even if I take 1000 volunteers in Chennai
206
00:25:02,500 --> 00:25:07,340
and measure their heights, still I would call
it a sample, it cannot be a population.
207
00:25:07,340 --> 00:25:14,559
There is always a population which is large,
which is like a universe you know like the
208
00:25:14,559 --> 00:25:24,890
height of people in India that is very big
whereas when I take a small sample that is
209
00:25:24,890 --> 00:25:33,710
denoted by small n and in statistics based
on the results of sample we tried to predict
210
00:25:33,710 --> 00:25:39,570
what could be the population whether the sample
really falls into the population. I take 5
211
00:25:39,570 --> 00:25:46,679
bolts and measure their diameter; I may get
say 20.1 mm, 19.9 mm, 20 mm so I will take
212
00:25:46,679 --> 00:25:53,029
an average. Now I want to know whether this
average confirms with that 20 mm bolts size
213
00:25:53,029 --> 00:26:01,249
which I have mentioned in my catalog. Is it
really very close to 20 mm? Or is it very
214
00:26:01,249 --> 00:26:06,670
far away from 20 mm? When I take a sample,
sample is always very small and when I take
215
00:26:06,670 --> 00:26:14,540
an average of that sample it will not be exactly
20 mm the average may be 19.5 or 20.5. Now
216
00:26:14,540 --> 00:26:21,730
this 20.5 is it very different from what I
claim 20. Can I say they are same with 95
217
00:26:21,730 --> 00:26:26,510
percent confidence? Or can you say that they
are same with 99 percent confidence? That
218
00:26:26,510 --> 00:26:34,779
is what statistics is all about and so the
concept of population and the concept of sample
219
00:26:34,779 --> 00:26:39,149
play very important role.
So sample is always very small whereas population
220
00:26:39,149 --> 00:26:46,080
is very very large. We can always collect
samples and based on the sample results we
221
00:26:46,080 --> 00:26:52,510
tried to say whether it comes from the same
population or whether it is not coming from
222
00:26:52,510 --> 00:26:55,309
the same population.
223
00:26:55,309 --> 00:27:04,830
When we have things like in continuous data,
when I am measuring temperature 30.1, 30.2,
224
00:27:04,830 --> 00:27:12,269
30.3 and so on, I can calculate something
called mean. Mean is nothing but average.
225
00:27:12,269 --> 00:27:16,590
Everybody knows how to calculate mean, you
add up all of them divided by the number of
226
00:27:16,590 --> 00:27:22,669
samples and then you get the mean.
Normally for population mean we represent
227
00:27:22,669 --> 00:27:30,070
it as mu bar whereas for the sample mean we
may represent it as x bar. Population mean
228
00:27:30,070 --> 00:27:35,169
we always represent it as mu here, whereas
for sample mean, we generally represent it
229
00:27:35,169 --> 00:27:44,409
as x bar. Like N is the population size whereas
small n is the sample size. So always sample
230
00:27:44,409 --> 00:27:50,950
means are represented by x bar whereas population
is always represented by population mean is
231
00:27:50,950 --> 00:27:59,110
represented by mu here. Now this sample mean
is an estimate of the true population mean,
232
00:27:59,110 --> 00:28:09,110
like I said, I take 10 bolts and then measure
their diameter take an average that is x bar,
233
00:28:09,110 --> 00:28:16,970
now mu is what is the real population mean
which I say the bolts in my factory are 20
234
00:28:16,970 --> 00:28:26,139
mm of size. Now this x bar how close is it
with mu, can I say that x bar is a good representation
235
00:28:26,139 --> 00:28:33,340
of mu or is x bar very far away from mu and
so on and that is what statistical analysis
236
00:28:33,340 --> 00:28:39,730
is all about actually. Does x bar very close
to mu that I can say yes x bar is a representation
237
00:28:39,730 --> 00:28:46,580
of the population or x bar is not close to
mu. It is not representation of this population.
238
00:28:46,580 --> 00:28:51,749
Now when you say close or not so close we
use certain statistical terminology is like
239
00:28:51,749 --> 00:28:59,380
confidence limits, 95 percent confidence,
99 percent confidence and so on actually.
240
00:28:59,380 --> 00:29:05,059
So we will talk about all these much more
in detail as we go along. Now median what
241
00:29:05,059 --> 00:29:06,659
is median?
242
00:29:06,659 --> 00:29:14,080
Median is the middle point of the data set.
So if I have odd data sets median will be
243
00:29:14,080 --> 00:29:19,799
the exactly the middle point whereas if I
have the even data set even means I have 20
244
00:29:19,799 --> 00:29:26,789
numbers then, obviously the median will lie
between the two data points in the middle
245
00:29:26,789 --> 00:29:37,669
actually. Whereas if I have seen 19 so I may
have here both sides 9 and 9 and the middle
246
00:29:37,669 --> 00:29:43,600
point will be in the center whereas if I have
20 I have 10 and 10 so obviously, the median
247
00:29:43,600 --> 00:29:49,179
will be the average of the 10th and the 11th
point that is called median. So median is
248
00:29:49,179 --> 00:29:52,990
a middle point whereas mean is an average.
249
00:29:52,990 --> 00:29:57,990
Then there is something called mode. Mode
is the value that appears most often in data
250
00:29:57,990 --> 00:30:05,269
sets. If we have say data set here like this
23 is appearing many times so 23 is the mode
251
00:30:05,269 --> 00:30:12,899
of this data set. Now if the data set is like
this you have 3 and 6 appearing, right? So
252
00:30:12,899 --> 00:30:18,039
you have 2 modes this is called a bi model
distribution whereas this is a mono model.
253
00:30:18,039 --> 00:30:25,730
So we have a 3 terms the mean, median, mode.
Mean is the average, median is the central
254
00:30:25,730 --> 00:30:30,710
point, mode is the value that appears most
often in a set of a data. We will be using
255
00:30:30,710 --> 00:30:35,950
these in our statistical calculations and
as we go along actually.
256
00:30:35,950 --> 00:30:40,440
Thank you very much for your time.