1
00:00:00,909 --> 00:00:17,470
We will continue on the course on Biostatistics
and Design of Experiments. As I mentioned
2
00:00:17,470 --> 00:00:23,650
in this course, I am going to talk about biostatistics;
that is the part one of the whole thing and
3
00:00:23,650 --> 00:00:25,640
then comes the design of experiments.
4
00:00:25,640 --> 00:00:30,960
In Biostatistics, we are going to look at
large number of distributions like Binomial
5
00:00:30,960 --> 00:00:38,020
distribution, Poisson distribution, Weibull
distribution, T-distribution, Z-distribution,
6
00:00:38,020 --> 00:00:43,170
Normal distribution and so on. Then we are
going to look at something called Confidence
7
00:00:43,170 --> 00:00:49,520
interval, Test for normality, Tests of significance,
different types test, t-test, F test, ANOVA
8
00:00:49,520 --> 00:00:54,300
test. And under t test you have one sample
t- test, two sample t-test and then we are
9
00:00:54,300 --> 00:00:59,160
also going to look at Chi square test or Chi
square distribution. Then we are going to
10
00:00:59,160 --> 00:01:04,330
look at Non parametric tests, other type of
tests that are possible in biostatistics which
11
00:01:04,330 --> 00:01:08,260
does not need to have a Normal distribution.
12
00:01:08,260 --> 00:01:12,720
Then under design of experiments, we are going
to look at one factor at a time. How do I
13
00:01:12,720 --> 00:01:17,820
change one factor, like if I am changing temperature
alone, then after I finish temperature optimization,
14
00:01:17,820 --> 00:01:23,780
I go to pH alone, that is called one factor
at a time, and then go into a design, Full
15
00:01:23,780 --> 00:01:28,360
factorial design, where I am changing many
factors simultaneously, then there is something
16
00:01:28,360 --> 00:01:32,650
called Fractional factorial design, where
you are doing a fraction of the full factorial
17
00:01:32,650 --> 00:01:35,530
design that means you are cutting down on
the experiments.
18
00:01:35,530 --> 00:01:41,230
Then you are going to talk about what is this
Confounding and alias and how does confounding
19
00:01:41,230 --> 00:01:44,820
affect when you start doing the design of
experiments. Then there is something called
20
00:01:44,820 --> 00:01:50,410
Screening designs; that is initially you start
looking at a large number of parameters and
21
00:01:50,410 --> 00:01:55,150
carry out experiments, that is called Screening
designs. Then you come to Second order designs,
22
00:01:55,150 --> 00:02:02,250
that means non-linear type of designs. Then
once you collect the data from the design
23
00:02:02,250 --> 00:02:07,100
you do a Regression analysis, mathematical
modeling. Then finally you go into data reduction
24
00:02:07,100 --> 00:02:12,330
that is the second part of the course. The
first part is Biostatistics second part is
25
00:02:12,330 --> 00:02:13,390
Design of Experiments.
26
00:02:13,390 --> 00:02:18,100
Let us get into and before that these are
some of the books which may be very useful
27
00:02:18,100 --> 00:02:26,390
for you; Biostatistics An Introductory Text
by Goldstein, then we have Barlow, A Guide
28
00:02:26,390 --> 00:02:32,310
to the Use of Statistical Methods. Then Fisher
and Yates, this is very very good book which
29
00:02:32,310 --> 00:02:36,990
gives lot of statistical tables. Because as
you go along you will come across lot of tables
30
00:02:36,990 --> 00:02:46,080
t tables, f tables, random numbers, odds ratios,
confidence intervals, p values and so on,
31
00:02:46,080 --> 00:02:51,050
for all these you need to have some tables
and this one gives you. Of course, you can
32
00:02:51,050 --> 00:02:56,130
get the tables online also but they are all
based on this particular book called Fisher
33
00:02:56,130 --> 00:03:00,600
and Yates. They have developed statistical
tables for Biological, Agricultural and Medical
34
00:03:00,600 --> 00:03:01,600
Research.
35
00:03:01,600 --> 00:03:09,080
Let us look at Data Types. There are 2 types
of data, one is called the Attribute data
36
00:03:09,080 --> 00:03:17,000
other one is called the Variable; continuous
data. Attribute data could be like 0-1, pass-fail,
37
00:03:17,000 --> 00:03:25,470
live-dead, black-white. It is like a numerical
numbers, it could be in counted, 10 defects
38
00:03:25,470 --> 00:03:32,740
in 10,000 samples, 10 failures in a class
of 100 students, you can count it, you can
39
00:03:32,740 --> 00:03:38,610
classify it. So, it is based on numerical
numbers, it is discrete. Whereas in Variable
40
00:03:38,610 --> 00:03:43,319
data; continuous data, you can have continuously
changing, for example, I can measure temperature
41
00:03:43,319 --> 00:03:51,610
of a fermenter continuously, I can call it
26.5 or 26.6, 26.7, 26.8 like that I can measure
42
00:03:51,610 --> 00:03:55,620
the temperature very continuously. Similarly,
I can measure the pH of the solution in a
43
00:03:55,620 --> 00:04:01,319
very continuous manner 3.1, 3.2, 3.3 and so
on, that is called the continuous data. So,
44
00:04:01,319 --> 00:04:07,020
we have the discrete data we have the continuous
data. So, any data type can be divided into
45
00:04:07,020 --> 00:04:08,360
these two forms.
ssssss
46
00:04:08,360 --> 00:04:14,330
Now under this discrete, we can classify it
as defective or not. Especially, if you take
47
00:04:14,330 --> 00:04:19,310
a factory where they manufacture lot of product.
For example, they are manufacturing screws
48
00:04:19,310 --> 00:04:25,650
and they would like to have the screw of 10
mm diameter, so any screw that is not 10 mm;
49
00:04:25,650 --> 00:04:31,630
if it is 9 mm or if it is 11 mm it is called
a defect. We can say out of 1 million screws
50
00:04:31,630 --> 00:04:39,020
that are manufactured in each week there could
be so many screws, 10 screws which are defective,
51
00:04:39,020 --> 00:04:45,770
that means they do not have 10 mm as the diameter,
the diameter could be different, that is classifying,
52
00:04:45,770 --> 00:04:49,770
so you have something called the binomial
distribution coming in to picture. So there
53
00:04:49,770 --> 00:04:55,820
are 10 defective screws out of 1 million screws
that are manufactured in this particular week.
54
00:04:55,820 --> 00:05:01,630
Then we also have something called Poisson
distribution, this is again giving a count.
55
00:05:01,630 --> 00:05:11,210
There are 3 road accidents in the city of
Chennai in a monthsâ€™ time, there are 4 people
56
00:05:11,210 --> 00:05:19,340
suffering from HIV in this particular village
in South India, you are giving some numbers.
57
00:05:19,340 --> 00:05:26,810
And again, the numbers are collected based
on large number of samples. Again, it is count
58
00:05:26,810 --> 00:05:34,590
that is called Poisson distribution. We have
the Binomial distribution, out of 10,000 samples
59
00:05:34,590 --> 00:05:43,060
10 are defective or we have the Poisson distribution,
where I am saying there are 3 deaths per day
60
00:05:43,060 --> 00:05:47,440
in the city of Chennai.
Then under the continuous data we have the
61
00:05:47,440 --> 00:05:53,660
Normal distribution. You must all heard of
normal or the uniform distribution which looks
62
00:05:53,660 --> 00:05:59,310
like a bell type of curve and also we have
the Weibull distribution which discusses the
63
00:05:59,310 --> 00:06:08,160
life of, say for example, a light bulb or
a fan or a refrigerator that is called the
64
00:06:08,160 --> 00:06:12,620
Weibull distribution. So, that is continuous
data, we can measure the data continuously
65
00:06:12,620 --> 00:06:18,550
that is called the Weibull distribution. So,
we have two types of data, the discrete data
66
00:06:18,550 --> 00:06:24,120
and the continuous data. Discrete data is
used for identifying how many defects are
67
00:06:24,120 --> 00:06:32,650
there in a sample and how many accidents are
happening in Chennai per month or per week
68
00:06:32,650 --> 00:06:40,199
and so on, whereas continuous data we are
measuring the data in a very continuous manner.
69
00:06:40,199 --> 00:06:45,130
In fact there are large numbers of distributions
much larger than what I talked about. As you
70
00:06:45,130 --> 00:06:50,210
can see, do not get scared we have Normal
distribution, we have Uniform distribution
71
00:06:50,210 --> 00:06:54,830
then we have t distribution we are going to
talk about this. We have F-distribution we
72
00:06:54,830 --> 00:06:59,040
are going to spend some time on this, then
Chi square distribution we are going to spend
73
00:06:59,040 --> 00:07:05,020
some time on this, Weibull distribution and
so on actually. As you can see these all are
74
00:07:05,020 --> 00:07:11,330
continuous distribution, fatigue life distribution,
gamma distribution, double exponential, power
75
00:07:11,330 --> 00:07:17,030
normal, power logarithm, beta distribution
so on. So, large numbers of distributions
76
00:07:17,030 --> 00:07:21,710
are there. They are used in different scenarios,
different requirements, different problems,
77
00:07:21,710 --> 00:07:26,970
but I will be spending time on normal, I will
be spending time on T-distribution, I will
78
00:07:26,970 --> 00:07:32,050
be talking about the Chi square distribution,
F distribution, Weibull distribution but all
79
00:07:32,050 --> 00:07:39,729
these distributions are very useful actually.
As you can see they have different shapes
80
00:07:39,729 --> 00:07:45,199
and that means the probability of certain
event happening will follow different type
81
00:07:45,199 --> 00:07:52,150
of relationship or mathematical formula. So,
these all are continuous distribution. In
82
00:07:52,150 --> 00:07:59,050
the discrete we have the binomial and we have
another type of Poisson distribution, so these
83
00:07:59,050 --> 00:08:03,890
are continuous distribution. I said we will
spend time only on few of these not all of
84
00:08:03,890 --> 00:08:04,890
them.
85
00:08:04,890 --> 00:08:11,270
Let us check Binomial Distribution. You must
have all read in your school talking about
86
00:08:11,270 --> 00:08:16,840
probability, tossing a coin, tossing a dice
and so on actually. Binomial Distribution
87
00:08:16,840 --> 00:08:25,690
is based on yes-no, 0-1, success-failure,
pass-fail, live-dead, black-white or suppose
88
00:08:25,690 --> 00:08:31,680
I have a dice which has 6 faces then I am
throwing a dice you may get a number 1 or
89
00:08:31,680 --> 00:08:40,839
2 or 3 or 4 or 5 or 6 at equal probability,
all these are based on Binomial Distribution.
90
00:08:40,839 --> 00:08:45,610
The sampling is carried out without replacement
that means you are not putting it back, the
91
00:08:45,610 --> 00:08:52,639
draws are not independent. So, binomial distribution
is a good approximation here. If I am tossing
92
00:08:52,639 --> 00:08:58,079
a coin probability of coin showing head could
be half, probability of coin showing tail
93
00:08:58,079 --> 00:09:05,040
it could be half. If I throw it 10 times same
coin, if I want to know what is the probability,
94
00:09:05,040 --> 00:09:12,220
4 of them out of this 10 is heads I can use
the binomial distribution or if I am going
95
00:09:12,220 --> 00:09:21,730
to say that the birth defect of children born
in India is 10 percent and I go to a village
96
00:09:21,730 --> 00:09:27,449
which has got 1000 children what is the probability
that 4 of these children will have that particular
97
00:09:27,449 --> 00:09:32,690
defect, both defect then I can use the Binomial
Distribution. So, that way Binomial Distribution
98
00:09:32,690 --> 00:09:35,439
becomes very useful for us to do.
99
00:09:35,439 --> 00:09:40,790
Let us look at a simple problem, Tossing of
a coin. I have a coin as you know I can get
100
00:09:40,790 --> 00:09:48,519
either heads or tails. That means equal probability,
50 percent probability for heads 50 percent
101
00:09:48,519 --> 00:09:54,759
probability for tails. So, I toss the coin
4 times, I can get 0 heads that means all
102
00:09:54,759 --> 00:09:59,770
of them become tail, I can get 1 head that
means I can get 1 head and 3 tails, I can
103
00:09:59,770 --> 00:10:08,850
get 2 that means 2 heads and 2 tails, I can
get 3 heads and 1 tail or 4 heads and no tail.
104
00:10:08,850 --> 00:10:14,339
All these are possible and the likelihood
of getting each one of them is given by this
105
00:10:14,339 --> 00:10:20,410
formula 1 by 16, 4 by 16, 6 by 16, 4 by 16,
1 by 16. How do we get this?
106
00:10:20,410 --> 00:10:26,300
In the next slide, I will show you the formula.
This is how the distribution will look like
107
00:10:26,300 --> 00:10:33,399
I toss the coin 4 times obviously getting
2 heads, 2 tails is most probable. How to
108
00:10:33,399 --> 00:10:38,660
get this number of 6 by 16, I will tell you
in the next slide. And then getting 1 head
109
00:10:38,660 --> 00:10:44,690
and 3 tails or getting 3 heads and 1 tail
are equally probable which comes second and
110
00:10:44,690 --> 00:10:52,089
then getting 0 heads or getting 4 heads again
is less probable in this but they are equal.
111
00:10:52,089 --> 00:10:56,050
This is how the binomial distribution will
look like. Now what is the formula for calculating
112
00:10:56,050 --> 00:11:00,179
this probability let us show it in the next
slide.
113
00:11:00,179 --> 00:11:05,959
This is how the probability equation looks
like. The probability function f k is given
114
00:11:05,959 --> 00:11:14,610
by n factorial divided by k factorial multiplied
by n minus k factorial p power k 1 minus p
115
00:11:14,610 --> 00:11:21,699
power n minus k. So, n trials, k successes,
p is the probability.
116
00:11:21,699 --> 00:11:26,940
So, n times you are doing something and k
is the successes you are talking about, p
117
00:11:26,940 --> 00:11:33,860
is the probability. In the previous problem
like I am tossing the coin 4 times, n will
118
00:11:33,860 --> 00:11:44,800
be equal to 4. If I want to know what is the
probability for 0 heads, then k will become
119
00:11:44,800 --> 00:11:54,189
0 and p is half because I can get either head
or tail. Probabilities p which is half n will
120
00:11:54,189 --> 00:11:59,830
be 4 and if I want to get a zero heads what
is the probability I want to calculate; I
121
00:11:59,830 --> 00:12:05,000
will put k as 0.
n factorial you all know must have studied,
122
00:12:05,000 --> 00:12:10,660
n factorial is nothing but n into n minus
1 into n minus 2 into minus 3 and so on. When
123
00:12:10,660 --> 00:12:16,869
I put k equal to 0, I substitute here, I will
get 4 factorial and the denominator I put
124
00:12:16,869 --> 00:12:22,930
0 factorial then I put 4 minus 0 factorial,
half raise to the power 0, 1 minus half raise
125
00:12:22,930 --> 00:12:28,759
to the power 4 minus 0, that is what I have
written here. 0 factorial is 1, 4 factorial
126
00:12:28,759 --> 00:12:37,351
is 4 into 3 into 2 that is 24, half raise
to the power 0 is 1, 1 minus half is half,
127
00:12:37,351 --> 00:12:43,720
half raise to the power 4 is half of raise
to the power 4, this is 4 factorial at the
128
00:12:43,720 --> 00:12:49,290
denominator. So, these two will cancel, these
two will cancel. So, we have half raise to
129
00:12:49,290 --> 00:12:57,740
the power 4 that means 2 into 2 into 2 into
2, 4 times that is 1 by 16. If you want to
130
00:12:57,740 --> 00:13:06,649
see 0 heads when you toss a coin 4 times the
probability will be 1 by 16. You see that
131
00:13:06,649 --> 00:13:13,170
is what I had mentioned here right, 1 by 16.
This is how you get the data.
132
00:13:13,170 --> 00:13:19,699
Now if you want to know what is the probability
to get 2 heads when I toss the coin 4 times,
133
00:13:19,699 --> 00:13:28,779
so n equal to 4, k equal to 2 and again p
will be half, you put 4 factorial 2 factorial
134
00:13:28,779 --> 00:13:34,139
4 minus 2 factorial half raise to the power
2, 1 minus half raise to the power 4 minus
135
00:13:34,139 --> 00:13:43,519
2. So you do all these calculations, you end
up with 6 by 16, I mentioned here 6 by 16
136
00:13:43,519 --> 00:13:48,970
that is the maximum. When you toss the coin
4 times what is the probability of getting
137
00:13:48,970 --> 00:13:53,959
2 times head in that 4 is 6 by 16 that is
the maximum.
138
00:13:53,959 --> 00:14:01,449
Like that if you want to know with 4 times
tossing if k equal to 1 that means 1 head,
139
00:14:01,449 --> 00:14:05,779
what is the probability of getting 1 head
when I toss the coin the 4 times. I put n
140
00:14:05,779 --> 00:14:13,649
equal to 4 but I put k equal to 1, p will
be half in all these cases, 0 factorial you
141
00:14:13,649 --> 00:14:18,279
should remember is always 1. It is simple
to calculate.
142
00:14:18,279 --> 00:14:24,300
Now you can do the same calculation using
Excel as well. Excel has a function called
143
00:14:24,300 --> 00:14:43,760
Binom Distribution, there are 3, 4 terms inside
this. Number s is the number of successes
144
00:14:43,760 --> 00:14:51,230
in the trials, trials is the total number
and probability s is the probability and cumulative
145
00:14:51,230 --> 00:15:00,079
you can say true or false. If it is false
it will give you the exact number whereas
146
00:15:00,079 --> 00:15:09,240
if you put true it gives you the cumulative
number. Trials is n, in the equation, number
147
00:15:09,240 --> 00:15:15,949
s is the success k, probability is your p
small p and here we put true or false, if
148
00:15:15,949 --> 00:15:21,310
we put false it gives you the exact answer.
For example, in the previous problem where
149
00:15:21,310 --> 00:15:30,670
we looked at 4 times I tossed the coin, I
want to know 2 successes with heads, what
150
00:15:30,670 --> 00:15:37,029
will I do, I will put 2 here, I will put 4
here, I will put half here and I can put false
151
00:15:37,029 --> 00:15:43,050
here and that will give you the Binomial Distribution
answer, I should get 6 by 16 as my answer.
152
00:15:43,050 --> 00:15:46,569
Let us look at it in the Excel as well.
153
00:15:46,569 --> 00:15:47,569
sssss
154
00:15:47,569 --> 00:15:54,029
This is the function I said it is called Binom
Distribution, Number s is the number successes,
155
00:15:54,029 --> 00:16:01,109
I will put 2 here, number of times I do that
is 4 here, probability is half that means
156
00:16:01,109 --> 00:16:09,230
I put 0.5 here and if I put false it will
give you the exact answer whereas when I put
157
00:16:09,230 --> 00:16:16,249
true it will give you summation of all the
answer. I will put false, what did I get?
158
00:16:16,249 --> 00:16:29,819
I got 0.375, now is it same as 6 by 16? See!
That is 0.375. Using excel we can calculate
159
00:16:29,819 --> 00:16:38,779
the binomial distribution as you can see this
is the equation, for example, this is the
160
00:16:38,779 --> 00:16:46,919
number of successes, this is the number of
trials this is n, this is k, this is the p
161
00:16:46,919 --> 00:16:54,270
and here we put false to get the exact answer
here. When you put true it adds up it is a
162
00:16:54,270 --> 00:17:02,439
cumulative answer that means, if I put true
here it tells you what is the cumulative probability
163
00:17:02,439 --> 00:17:09,810
for getting at least 2 heads out of 4 trials
that means it will look at 0, it will look
164
00:17:09,810 --> 00:17:14,579
at 1, then it look at 2. It will give you
the summation of all these three things. We
165
00:17:14,579 --> 00:17:20,110
can use excel also to do the same calculation
or we can actually calculate it out also.
166
00:17:20,110 --> 00:17:25,300
You understood. You have a excel function
called Binom Distribution and there are 4
167
00:17:25,300 --> 00:17:31,870
terms here, the trials this is equal to n,
this is equal to k, this is equal to p and
168
00:17:31,870 --> 00:17:39,120
here you put false to get the answer. Now
there are many softwares which can do this
169
00:17:39,120 --> 00:17:43,210
job also, some of them are commercial, there
are could be something free also in the net
170
00:17:43,210 --> 00:17:44,210
and so on.
171
00:17:44,210 --> 00:17:52,430
I also looked at software and there this free
online statistical calculator and this is
172
00:17:52,430 --> 00:17:58,630
the link for that you know GraphPad, it is
called GraphPad software. It can do lot of
173
00:17:58,630 --> 00:18:03,570
nice calculations online, we put in some data
and it can do some calculations. I am going
174
00:18:03,570 --> 00:18:13,310
to use this and we are going to, we can do
some of the problems using this. This online
175
00:18:13,310 --> 00:18:18,660
software as we can see can do lot of calculations
it can look at Binomial Distribution, Poisson
176
00:18:18,660 --> 00:18:23,500
Distribution, Normal Distribution, it can
look at different types statistical test,
177
00:18:23,500 --> 00:18:34,200
t test, f test and so on, we are also going
to use this. This is the link to that www
178
00:18:34,200 --> 00:18:43,380
graphpad dot com quickcalcs.
Let me show you that here, when you do that
179
00:18:43,380 --> 00:18:50,280
as you can see here, this is the GraphPad
QuickCalcs. We have the Binomial Distribution
180
00:18:50,280 --> 00:18:58,280
coming into picture, we click on it and then
we go continue, when you continue as you can
181
00:18:58,280 --> 00:19:02,670
see here calculate different types of distribution.
Let us go into binomial, we will talk about
182
00:19:02,670 --> 00:19:06,500
different distribution later as I said I am
going to talk about Binomial, I am going to
183
00:19:06,500 --> 00:19:12,451
talk about Poisson, t distribution, normal
and so on. Here you have the Binomial, so
184
00:19:12,451 --> 00:19:20,480
we say continue. Here we have the Binomial
Distribution. How many trials? We are doing
185
00:19:20,480 --> 00:19:28,070
4 trials. What is the probability of success
in each trial 0.5, calculate probabilities?
186
00:19:28,070 --> 00:19:33,310
Here you can see it gives you everything.
So, number of successes 0 means it gives you
187
00:19:33,310 --> 00:19:45,520
0.25 that is 0 heads out of 4 trials the probabilities
6.25 and then if you are talking about 1 success
188
00:19:45,520 --> 00:19:51,650
out of 1 head out of 4 trials, you get a 25
percent but here you gives you the cumulative,
189
00:19:51,650 --> 00:19:58,230
6 plus 25 is giving 31. How do you, even in
Excel if we put true as the last term you
190
00:19:58,230 --> 00:20:05,390
will get the cumulative, whereas if you put
false you will get the exact. 2 trials you
191
00:20:05,390 --> 00:20:12,770
get 37 percent, you can see 0.375 and the
cumulative will be some 6 plus 25 plus 37,
192
00:20:12,770 --> 00:20:19,610
68 percent. So, 3 successes out of 4 it gives
you 25 percent, cumulative wise it is 93 percent.
193
00:20:19,610 --> 00:20:27,470
All 4 heads out of 4 trials 100 percent it
gives you. We can use this particular online
194
00:20:27,470 --> 00:20:32,540
software also there could be many online softwares
but I am looking at this particular online
195
00:20:32,540 --> 00:20:37,200
software because it looks good. There are
many commercial softwares also one can go
196
00:20:37,200 --> 00:20:42,590
about using them, it depends upon whether
you have the availability of these. There
197
00:20:42,590 --> 00:20:48,580
are softwares which may be even freely downloadable
but this is simple online software where you
198
00:20:48,580 --> 00:20:53,910
give the data and it gives you the results.
As you can see here in our problem we had
199
00:20:53,910 --> 00:21:00,070
tossing the coin 4 times and you can get heads
or tails with the probability of half now
200
00:21:00,070 --> 00:21:07,171
out of this 4 times, 0 heads 6 percent 6.25
percent probability. Out of 4 times 1 head
201
00:21:07,171 --> 00:21:13,930
25 percent probability. Out of 4 times 2 heads
37.5 percent probability. Out of 4 heads 3
202
00:21:13,930 --> 00:21:20,530
heads 25 percent probability. Out of 4 trials,
4 heads 6.25.
203
00:21:20,530 --> 00:21:27,160
I showed you 3 different ways by which we
can calculate this. Right. One is using this
204
00:21:27,160 --> 00:21:36,851
equation if the data is very small we can
use this and do it that means if the k, n
205
00:21:36,851 --> 00:21:42,780
and all is small we can. Otherwise we can
use the excel function which is called Binom
206
00:21:42,780 --> 00:21:50,220
distribution where this is the number of successes
that is k, number of trials that is n and
207
00:21:50,220 --> 00:21:57,140
this is the probability that is in this case
half p then here we give false or we can use
208
00:21:57,140 --> 00:22:03,660
this free online statistical calculator which
I showed you. We click here and then we give
209
00:22:03,660 --> 00:22:11,210
number of trials as 4 and probability is half,
it gives you the entire table for 0 success
210
00:22:11,210 --> 00:22:16,391
out of 4. What is the probability for 1 success
out of 4? What is the probability for 2 success
211
00:22:16,391 --> 00:22:20,350
out of 4? What is the probability and for
3 success out of 4? What is the probability
212
00:22:20,350 --> 00:22:25,410
and that is what it is giving you here, right?
As you can see it gives you in the entire
213
00:22:25,410 --> 00:22:32,210
table. So, I showed you three different approaches
by which we can do the Binomial Distribution
214
00:22:32,210 --> 00:22:33,210
calculation.
215
00:22:33,210 --> 00:22:38,760
Let us go further, let us look at a biological
application. 1 percent of the population is
216
00:22:38,760 --> 00:22:45,210
infected with HIV plus I am just giving. So,
may be in a country 1 percent of the population
217
00:22:45,210 --> 00:22:52,650
is infected, there are no obvious symptoms
that can be used to recognize the carriers.
218
00:22:52,650 --> 00:22:58,490
We assume that if I look at somebody I cannot
tell whether the person has HIV or not unless
219
00:22:58,490 --> 00:23:04,600
I do a detailed study. For example, I need
to select some people and do a detailed study
220
00:23:04,600 --> 00:23:11,560
if the sample size is too small then I might
not be able to find at all then if I take
221
00:23:11,560 --> 00:23:17,400
a very big sample then I need to do lot of
sample collection sample analysis. I need
222
00:23:17,400 --> 00:23:22,880
to spend lot of money that is also inefficient.
So, what do I do? Is it ok if I just take
223
00:23:22,880 --> 00:23:29,330
20 people? Is this sample adequate? Will I
be able to find at least 1 percent in that?
224
00:23:29,330 --> 00:23:34,570
That is problem. How do I do using Binomial
Distribution?
225
00:23:34,570 --> 00:23:40,990
I can use n equal to 20, I can say k equal
to 0 that means I am in that 20, I am not
226
00:23:40,990 --> 00:23:47,970
finding anybody with that and p is equal to
0.01 because I said 1 percent of the population.
227
00:23:47,970 --> 00:23:56,560
So, p is equal to 0.01. When I put it in Binomial
Distribution, 20 factorial because k equal
228
00:23:56,560 --> 00:24:02,440
to 0, this two will get canceled out, p is
equal to 0.1, k is equal to 0, this also will
229
00:24:02,440 --> 00:24:14,770
get canceled out. So 1 minus p is 0.99 raise
to the power 20 gives me 0.82. What does that
230
00:24:14,770 --> 00:24:23,960
mean? There is 82 percent chance that if I
take 20 people, I will not even find 1 person
231
00:24:23,960 --> 00:24:32,860
with that disease. Did you notice that? It
is very very important finding, there is a
232
00:24:32,860 --> 00:24:39,200
1 percent population is infected with HIV
but if I take 20 people randomly, there are
233
00:24:39,200 --> 00:24:46,770
82 percent chance that I will not find anybody
with that in that sample of 20. So, I may
234
00:24:46,770 --> 00:24:55,090
say nobody is infected, obviously what does
it mean? My sample size is too small or if
235
00:24:55,090 --> 00:25:00,360
I can say n equal to 20, k equal to 1 then
I can do the same study and see what is the
236
00:25:00,360 --> 00:25:07,670
probability of finding at least 1, what it
means is when I randomly select 20 people,
237
00:25:07,670 --> 00:25:15,230
I am not able I will not be able to find even
1 percent with that particular disease. So
238
00:25:15,230 --> 00:25:20,620
I may say that nobody is infected with this
particular disease.
239
00:25:20,620 --> 00:25:26,330
Now we can also check with the online software
also.
240
00:25:26,330 --> 00:25:32,320
.
Using the same online software, for example
241
00:25:32,320 --> 00:25:39,360
same thing for getting 0, it gives you 81
percent or 82 percent. If I want to find at
242
00:25:39,360 --> 00:25:50,030
least 1 percent with that, 98 percent will
happen actually. Same thing we can do it using
243
00:25:50,030 --> 00:25:51,490
this. So, what we do.
244
00:25:51,490 --> 00:26:07,460
We will go to the GraphPad and then I go back
and I will put 20 then I will put 0.01 then
245
00:26:07,460 --> 00:26:15,130
calculate probability. As you can see here
there is 81 percent probability or 82 percent
246
00:26:15,130 --> 00:26:21,180
probability that not a single number of successes
0, that means not a single person with that
247
00:26:21,180 --> 00:26:28,920
particular disease. Obviously my data is too
little my sample size is too little that I
248
00:26:28,920 --> 00:26:37,850
may miss out. So, you must be very careful
when you select sample, a very small sample
249
00:26:37,850 --> 00:26:45,560
can make you conclude wrongly. That sample
size is a very very important parameter and
250
00:26:45,560 --> 00:26:50,460
we are going to talk about that in other cases
also as we go along. So, with the very small
251
00:26:50,460 --> 00:26:59,090
sample size for example, here 20 people with
the 1 percent probability I may say that 82
252
00:26:59,090 --> 00:27:04,460
percent of the time there will not be even
a single person infected with that disease
253
00:27:04,460 --> 00:27:10,070
in this sample of 20. So you can see that
we can show it using this equation or we can
254
00:27:10,070 --> 00:27:17,280
go to that software GraphPad online and then
get the same answer. Even with the Excel also
255
00:27:17,280 --> 00:27:33,940
we can do the same thing, we go to the Excel.
We type BINOM distribution f x. We have BINOM
256
00:27:33,940 --> 00:27:41,760
distribution, we have number of successors
we are talking about 0, trials is 20, probability
257
00:27:41,760 --> 00:27:50,530
is 0.01, then we can say false or true it
does not mater, false then we get again you
258
00:27:50,530 --> 00:27:58,200
can see the answer is 82 percent. So. 82 percent
of the time we will not be finding any infected
259
00:27:58,200 --> 00:28:03,250
person, if I take a sample of only 20.
260
00:28:03,250 --> 00:28:08,970
So you have to be very very careful on that,
82 percent of the time we will miss out we
261
00:28:08,970 --> 00:28:10,870
will come to a wrong conclusion.
262
00:28:10,870 --> 00:28:18,000
Let us look at another problem. A tranquilizing
drug caused anemia in 2 of the first 10 patients
263
00:28:18,000 --> 00:28:29,670
who were tested. I took 10 patients and then
I am I gave the drug first 2 patient had some
264
00:28:29,670 --> 00:28:35,450
toxicity problem but then a true toxicity
of this kind is tolerable only if it does
265
00:28:35,450 --> 00:28:41,050
not affect more than 10 percent of the treated
patient, but here the first 2 patients themselves
266
00:28:41,050 --> 00:28:50,270
had the problem. Should that drug be withdrawn
or tested further. So only 10 percent of the
267
00:28:50,270 --> 00:28:57,200
patients can have this type of toxicity affects
but here with 10 patients, 2 of them are having
268
00:28:57,200 --> 00:29:10,660
problem. So, should the drug be taken out?
We are in big problem, so let us go for example.
269
00:29:10,660 --> 00:29:23,900
The GraphPad, then we will say 10 patients
and then we want to say 0.01 percent, calculate
270
00:29:23,900 --> 00:29:34,790
probability, that is very very high. So, we
cannot conclude because it is showing almost
271
00:29:34,790 --> 00:29:42,540
very high probability almost 34 percent whereas
we want to have less than we want to have
272
00:29:42,540 --> 00:29:48,770
10 percent only. Whereas if I take a larger
population, for example, if I take n equal
273
00:29:48,770 --> 00:29:56,970
to 40, if I take a larger population for testing
and then I keep the same 10 percent, when
274
00:29:56,970 --> 00:30:13,370
I calculate the probability then as you can
see here, if I go to 2 percent successes here
275
00:30:13,370 --> 00:30:19,040
it is still going to 14 percent of probability.
The cumulative if you look at it, it is coming
276
00:30:19,040 --> 00:30:23,920
to again 22 percent whereas if you want to
have less than 5 percent as a possible number
277
00:30:23,920 --> 00:30:33,210
then obviously, if I go to say n equal to
100, if I go to n equal to 100. For example,
278
00:30:33,210 --> 00:30:44,510
suppose I take a sample of a 100 patients
and then do the study as we can see here,
279
00:30:44,510 --> 00:30:50,540
out of the 100 patient I can have up to 5
patients having toxicity, I will be within
280
00:30:50,540 --> 00:30:58,740
that 10 percent limit but if I go beyond that
I will have numbers going up.
281
00:30:58,740 --> 00:31:07,120
Obviously what it means is the number of samples
I have taken should be considerably large
282
00:31:07,120 --> 00:31:15,240
in order to prove that the toxicity is less
than 10 percent. Obviously in this particular
283
00:31:15,240 --> 00:31:21,900
case also we can see the sampling size has
to be much larger.
284
00:31:21,900 --> 00:31:30,800
Suppose let us look at another problem 30
percent of the students wear glasses. If I
285
00:31:30,800 --> 00:31:34,540
take a random sample of 10 students, find
the probability that the number of students
286
00:31:34,540 --> 00:31:47,650
wearing glasses is at most 4? It is people
of different types, you can have people wearing
287
00:31:47,650 --> 00:31:53,250
glasses, you may get no one, you may get 1
person wearing glasses, you may get 2 persons
288
00:31:53,250 --> 00:31:58,960
wearing glasses, you may get all the 4 person
wearing glasses, right. We have a 30 percent
289
00:31:58,960 --> 00:32:07,440
of students, here p is equal to 0.3 and then
you have n is equal to 10 and then you want
290
00:32:07,440 --> 00:32:13,160
to look at various conditions of 1 person
wearing, 2 person wearing, 3 person wearing,
291
00:32:13,160 --> 00:32:39,730
4 person wearing glasses, that will be the
k values. So, we can use this particular function.
292
00:32:39,730 --> 00:33:01,320
I take 10 students, the probability is 0.3.
So I calculate the probabilities, as you can
293
00:33:01,320 --> 00:33:07,960
see here 0 person wearing glasses, 1 person
wearing glasses, 2 person, 3 person and so
294
00:33:07,960 --> 00:33:15,380
on. 0 person wearing glasses will be 2.82
percent but if you are talking about 1 person
295
00:33:15,380 --> 00:33:21,940
wearing glasses out of this 10 is 12 percent.
4 persons wearing glasses is 20 percent but
296
00:33:21,940 --> 00:33:29,130
if I add up all these, that means, if I take
10 students out of this lot, students wearing
297
00:33:29,130 --> 00:33:38,310
1 or 2 or 3 or 4 person wearing glasses will
be so many percent, 84 percent or 0 glasses.
298
00:33:38,310 --> 00:33:46,240
So, this is the cumulative and this is the
exact probability here. You can use this QuickCalcs
299
00:33:46,240 --> 00:33:55,880
of the GraphPad to identify the probability
distribution function for a Binomial Distribution.
300
00:33:55,880 --> 00:34:02,660
You can use this equation or we can use the
Excel function or we can use the GraphPad
301
00:34:02,660 --> 00:34:07,980
software also. So all these are possible to
get, as you can see here this is the cumulative,
302
00:34:07,980 --> 00:34:12,909
this is the exact probability for 0 person
wearing glasses, 1 person wearing glasses,
303
00:34:12,909 --> 00:34:20,480
2 person, 3 person, 4 person like that you
know it goes up to n of 10.
304
00:34:20,480 --> 00:34:28,540
Now, let us look at another problem. You know
there is a disease with known mortality 10
305
00:34:28,540 --> 00:34:33,049
percent, what is the minimum number of patients
required to demonstrate the efficacy of the
306
00:34:33,049 --> 00:34:40,620
completely curative drug? That means there
is a disease of mortality of 10 percent that
307
00:34:40,620 --> 00:34:51,409
means, 0.1, survival if you take as pie 0.9
1 minus 5 is death is 0.1. I want to show
308
00:34:51,409 --> 00:34:59,749
completely curative, that means, I do not
want to see any disease. If I take n patients
309
00:34:59,749 --> 00:35:06,549
and survival probability for each of the patient
is 0.9, it will become 0.9 into 0.9 into 0.9
310
00:35:06,549 --> 00:35:14,460
raise to the power n. Now this should be less
than 0.05 because why? 0.5 is 5 percent that
311
00:35:14,460 --> 00:35:21,180
means that gives you 95 percent confidence.
Do you understand? Thus mortality is 10 percent
312
00:35:21,180 --> 00:35:29,589
that is 0.1, survival is 0.09. If I call 5
survival as 0.09, 1 minus 5 death is equal
313
00:35:29,589 --> 00:35:37,269
to 0.1.
Now, if I take n patients then survival for
314
00:35:37,269 --> 00:35:43,260
each one is 0.9. So, 0.9 into 0.9 into 0.09,
I do it n times that is why I have 0.9 raise
315
00:35:43,260 --> 00:35:50,480
to the power of n. Now this should be less
than to get a confidence of 95 percent, this
316
00:35:50,480 --> 00:35:56,480
should be less than 0.05. So, if I calculate
this from this n I get n should be greater
317
00:35:56,480 --> 00:36:04,180
than 29, that means, I should have at least
29 patients and show on all of them none of
318
00:36:04,180 --> 00:36:12,960
them die. If I do that then I have a 95 percent
confidence that drug has a completely curative
319
00:36:12,960 --> 00:36:19,579
affect.
This approach tells you how to select the
320
00:36:19,579 --> 00:36:29,079
number of subjects or number of samples in
the in our problem. We looked at many different
321
00:36:29,079 --> 00:36:34,380
cases where we used Binomial Distribution
and Binomial Distribution is based on successes
322
00:36:34,380 --> 00:36:40,510
when you take a sample of n. So k successes
in a sample of n and the probability of each
323
00:36:40,510 --> 00:36:46,519
one happening p it tells you, what is the
probability of k successes in a sample of
324
00:36:46,519 --> 00:36:51,890
n, if the probability for each event is p
and that is what is Binomial Distribution
325
00:36:51,890 --> 00:36:58,869
is all about. We can use it like, if there
are 30 percent of the students wear glasses
326
00:36:58,869 --> 00:37:04,430
in a class. If I take 10 students, what is
the probability that 4 of them will be having
327
00:37:04,430 --> 00:37:12,210
glasses? If I have a disease which happens
2 percent in India, if I take a family of
328
00:37:12,210 --> 00:37:18,970
20 people in a house, how many of them will
have this particular disease. So, for all
329
00:37:18,970 --> 00:37:24,760
these we use this Binomial Distribution very
effectively and it is very very useful.
330
00:37:24,760 --> 00:37:30,099
I also taught you how to use the binomial
distribution using the formula n factorial
331
00:37:30,099 --> 00:37:36,160
divided by k factorial, n minus k factorial
and then numerator p raise to the power k.
332
00:37:36,160 --> 00:37:41,750
Then one minus p raise to the power n minus
k. We can do it numerically or we can use
333
00:37:41,750 --> 00:37:47,690
the Excel, all of us have Excel there is a
function called Binom Distribution in the
334
00:37:47,690 --> 00:37:54,799
Excel where you can substitute it and calculate
or you can use online software called GraphPad,
335
00:37:54,799 --> 00:38:00,400
I showed you the link to that software you
can substitute the data and get the values.
336
00:38:00,400 --> 00:38:05,509
So, all these approaches are possible and
you can see binomial distribution is very
337
00:38:05,509 --> 00:38:11,410
very useful in clinical trials and large data
analysis.
338
00:38:11,410 --> 00:38:15,809
The next class we will look at something called
the Poisson distribution again this is a Discrete
339
00:38:15,809 --> 00:38:18,319
Distribution which talks about events.
340
00:38:18,319 --> 00:38:22,499
Again, Poisson is an extension of Binomial
Distribution.
341
00:38:22,499 --> 00:38:23,420
Thank you very much.