1
00:00:12,630 --> 00:00:17,750
Welcome to the next class on Biostatistics
and Design of Experiments. We will talk more
2
00:00:17,750 --> 00:00:24,250
about t distribution, Z distribution and also
confidence interval. That means, I had mentioned
3
00:00:24,250 --> 00:00:33,820
before if I am collecting a sample and getting
a mean and I want to find out what is the
4
00:00:33,820 --> 00:00:39,860
confidence interval on the population mean,
I can use some equations to get that sort
5
00:00:39,860 --> 00:00:43,640
of information actually. So, that is what
we are going to talk about in this class.
6
00:00:43,640 --> 00:00:48,890
.
Let me again recall, suppose I have a population
7
00:00:48,890 --> 00:00:55,490
and I have sample, so that means I collect
a few pieces from the population and I get
8
00:00:55,490 --> 00:01:03,260
a mean. Now when I take another setup of pieces,
I may get a different mean. If I keep on repeating
9
00:01:03,260 --> 00:01:11,110
I will get large number of means. The means
are representation of the population mean.
10
00:01:11,110 --> 00:01:17,990
That means, the means of the sample are representation
of the population mean. But the standard deviation
11
00:01:17,990 --> 00:01:23,329
that is going to happen because the means
will be distributed in a normal way, the means
12
00:01:23,329 --> 00:01:26,511
of the samples are going to be distributed
in a normal way and there is going to be a
13
00:01:26,511 --> 00:01:31,259
standard deviation associated with that. So,
from that standard deviation we should be
14
00:01:31,259 --> 00:01:36,719
able to get the overall estimate of the standard
deviation.
15
00:01:36,719 --> 00:01:43,759
So generally, the sample mean is a usual estimator
of the population mean and if I take different
16
00:01:43,759 --> 00:01:48,689
samples I will get a large number of means
which will have its own mean. That means,
17
00:01:48,689 --> 00:01:54,249
mean of all the means and it also will have
a distribution, a normal distribution with
18
00:01:54,249 --> 00:02:00,429
its own variance. So, the standard error of
the mean, we call it standard error, is the
19
00:02:00,429 --> 00:02:04,929
standard deviation of these sample means estimate
of the population. So how do you calculation
20
00:02:04,929 --> 00:02:10,670
standard error? We get the standard deviation
of the sample, divided by square root of n.
21
00:02:10,670 --> 00:02:17,319
This is called the standard error or standard
error of the mean.
22
00:02:17,319 --> 00:02:26,180
So, this s is the sample standard deviation,
n is the sample size. So, if I am taking a
23
00:02:26,180 --> 00:02:32,720
set of samples, I will get a mean. Now I put
it back then again take another set of samples,
24
00:02:32,720 --> 00:02:38,770
I will get another mean, if I keep on repeating
I will get large number of means. Now these
25
00:02:38,770 --> 00:02:44,600
means will be normally distributed, it will
have its own mean that is the mean of all
26
00:02:44,600 --> 00:02:50,280
these means and it will have its own standard
deviation. Because the whole data will be
27
00:02:50,280 --> 00:02:57,620
a normally distributed. So, I can calculate
the standard error of this mean or it is also
28
00:02:57,620 --> 00:03:05,220
called standard error based on the sample
standard deviation. So, s is the sample standard
29
00:03:05,220 --> 00:03:12,590
deviation, n is the number of samples you
take and you do a square root and divide it
30
00:03:12,590 --> 00:03:17,150
actually. This is very important standard
error of the mean, is very important because
31
00:03:17,150 --> 00:03:22,870
we use this and calculating confidence internal
on whatever data we collect. We will see that
32
00:03:22,870 --> 00:03:24,130
time as we go along.
33
00:03:24,130 --> 00:03:29,270
Now let us look at a problem. Height of certain
variety of pulse is normally distributed.
34
00:03:29,270 --> 00:03:35,960
This is given a mu 14 and a sigma is 3. What
is the probability in a sample of nine plants?
35
00:03:35,960 --> 00:03:45,070
Suppose I take nine plants in a random, that
the heights will lie between 12 and 16. What
36
00:03:45,070 --> 00:03:52,330
will be the statistics? So, you remember this,
Z is equal to x minus mu by sigma, now this
37
00:03:52,330 --> 00:04:00,080
sigma we do not know but we know only the
standard deviation of the samples, right?
38
00:04:00,080 --> 00:04:04,210
So instead of sigma we need to substitute
by s divided by square root of n. If you remember
39
00:04:04,210 --> 00:04:13,600
in the previous equation right, s divided
by square root of n, because we have the estimate
40
00:04:13,600 --> 00:04:22,960
of the standard deviation from the sample,
we do not know the population. So, we use
41
00:04:22,960 --> 00:04:33,580
this here and sigma is given as 3, square
root of 9. We have got 9 plants, so 3 by 9
42
00:04:33,580 --> 00:04:41,139
square root which will be one, so Z is equal
to 12 minus 14 because we want find out from
43
00:04:41,139 --> 00:04:47,360
x of 12 to x of 16, so Z is equal to 16, we
will get minus 2 and plus 2. So, I want to
44
00:04:47,360 --> 00:04:52,780
find out the area between minus 2 and plus
2, if you remember long time back we did that
45
00:04:52,780 --> 00:04:54,860
from the Z table.
46
00:04:54,860 --> 00:05:02,940
This is the Z table. 2 we will get it as this
0.0228. But the catch here is it estimates
47
00:05:02,940 --> 00:05:11,380
this region, so if I want know this region
I will subtract from 0.5 minus 0.0228. That
48
00:05:11,380 --> 00:05:19,950
is the catch here if you remember, because
the Z table gives you only the outer one-tail.
49
00:05:19,950 --> 00:05:27,180
But this whole area is 0.5, so if I want to
know this inside portion that will be 0.5
50
00:05:27,180 --> 00:05:34,120
minus 0.0228, and then it will become 2 times
because I am interested in minus 2 to plus
51
00:05:34,120 --> 00:05:47,560
2. So, I will get 0.9544 that means 95 percent
of the time, the height, if I take 9 plants
52
00:05:47,560 --> 00:05:54,380
will lie between 12 to 16 actually. Do you
understand how to solve this? So here we use
53
00:05:54,380 --> 00:06:03,139
the estimate of this standard deviation based
on the sample. That is why we use S divided
54
00:06:03,139 --> 00:06:05,290
by square root of 9.
55
00:06:05,290 --> 00:06:10,960
Now again going back to the t distribution,
the confidence interval of the mean I take
56
00:06:10,960 --> 00:06:20,050
a set of samples and I get the mean that is
x bar and I get a standard deviation of that
57
00:06:20,050 --> 00:06:25,260
sample that is s divided by square root of
n, if you understand this, this is the standard
58
00:06:25,260 --> 00:06:33,730
error multiplied by t, t is the t value and
it will vary depending upon whether it is
59
00:06:33,730 --> 00:06:39,450
95 percent confidence interval or 99 percent
confidence interval. So generally, if n is
60
00:06:39,450 --> 00:06:48,900
large 95 percent confidence interval t value
will be 1.96, if n is large 99 percent confidence
61
00:06:48,900 --> 00:06:52,990
will become 2.58.
If you remember long time back I did mention
62
00:06:52,990 --> 00:07:00,310
how this 1.96 and 2.58 can logically come.
So, for very large N if we take a normally
63
00:07:00,310 --> 00:07:07,330
distribution, plus or minus 2 sigma occupies
approximately 95 percent of the area and plus
64
00:07:07,330 --> 00:07:15,090
or minus 3 sigma occupies in 99 percent area.
Instead of 2 and 3 here, we are using 1.96
65
00:07:15,090 --> 00:07:23,229
and 2.58 because we are using t distribution.
As I said it is impossible to get population
66
00:07:23,229 --> 00:07:27,900
so we can only get a sample, and from the
sample we make an estimate of the population
67
00:07:27,900 --> 00:07:35,580
values that is why these numbers instead of
2 and 3 it became 1.96 and 2.58. If I have
68
00:07:35,580 --> 00:07:44,350
a small sample and I get a mean and I get
a standard deviation, then I can get a confidence
69
00:07:44,350 --> 00:07:51,550
interval for the population mean using these
data by using this formula. This s by square
70
00:07:51,550 --> 00:08:00,130
root of n is called the standard error, t
is the t value from the t table. I will show
71
00:08:00,130 --> 00:08:08,040
you the t table and for large N the t value
for 95 percent confidence will become 1.96
72
00:08:08,040 --> 00:08:14,520
and for 99 percent confidence it will become
2.58. And I also mentioned, what is the logic
73
00:08:14,520 --> 00:08:26,930
of this 1.96 and 2.58. As I said, in a normal
distribution for a 95 percent of the area
74
00:08:26,930 --> 00:08:32,990
is covered inside plus or minus 2 sigma and
99 percent of the area is covered by plus
75
00:08:32,990 --> 00:08:39,669
or minus 3 sigma, so instead of using 2 and
3 because it is a t distribution and sample
76
00:08:39,669 --> 00:08:49,880
is always much smaller than entire population,
2 has become 1.96 and 3 has become 2.58. And
77
00:08:49,880 --> 00:08:53,980
that is how you would use that formula.
So, you have mu is equal to x bar plus or
78
00:08:53,980 --> 00:09:01,170
minus tdf s by square root of n. If n is small
there is a table called t table for corresponding
79
00:09:01,170 --> 00:09:06,010
degrees of freedom. Where degrees of freedom
is equal to n minus one, we can get the t
80
00:09:06,010 --> 00:09:12,731
value and you multiplied with the standard
error and from the x bar that is the mean
81
00:09:12,731 --> 00:09:18,210
of the sample, you get the confidence interval
for the population mean.
82
00:09:18,210 --> 00:09:24,640
In excel there is function called Confidence.
It is alpha, alpha is the significance level.
83
00:09:24,640 --> 00:09:34,100
So, it can be 0.05 or 0.01, 0.05 indicates
95 percent confidence and 0.01 indicates 99
84
00:09:34,100 --> 00:09:39,890
percent. So, you have this standard deviation,
is the population standard deviation for the
85
00:09:39,890 --> 00:09:48,880
data and size is the n actually. Basically,
it gives you this side of it if you know the
86
00:09:48,880 --> 00:09:55,990
x bar you add and then you subtract to get
the confidence interval. So, we can use the
87
00:09:55,990 --> 00:09:58,779
excel function called Confidence also here.
88
00:09:58,779 --> 00:10:03,839
There is another terminology that is called
the coefficient of variation CV, generally
89
00:10:03,839 --> 00:10:09,149
we call it, it is a relative standard deviation.
It is a standardized measure of dispersion
90
00:10:09,149 --> 00:10:14,510
of a probability distribution. So, CV is given
by 100 into sigma that is the standard deviation
91
00:10:14,510 --> 00:10:20,990
divided by mean. It tells you what is a relatively
the spread in the data, because you have the
92
00:10:20,990 --> 00:10:24,600
standard deviation in the numerator and you
have the mean in the denominator.
93
00:10:24,600 --> 00:10:30,089
If CV is very large, that means my spread
is very large, if CV is very small that means
94
00:10:30,089 --> 00:10:35,899
my spread is very small. So, CV is a quick
measure of telling how bad or how good the
95
00:10:35,899 --> 00:10:42,010
spread of the data is? and If I have 2 sets
of the data and I know the CV's I can tell
96
00:10:42,010 --> 00:10:47,520
which data set is most spread out or which
data set is more tighter and less spread out.
97
00:10:47,520 --> 00:10:56,180
So, that way CV is also a very useful property
which we can calculate and make use of in
98
00:10:56,180 --> 00:10:58,060
comparing different data sets.
99
00:10:58,060 --> 00:11:05,270
Let us look at a problem.
Heart beat of 113 students is given there.
100
00:11:05,270 --> 00:11:15,900
That is 2 students have 57 beats per minute,
4 students seems to have 62, 11 students seem
101
00:11:15,900 --> 00:11:24,510
to have 67 and so on, if you add up all these
comes to 113. So, it goes right up to 117,
102
00:11:24,510 --> 00:11:30,310
that means 4 students have 100 heart beat
of 117. Now we can get a mean for the entire
103
00:11:30,310 --> 00:11:41,170
data set, but that is not the population mean.
We can get a confidence region for the population
104
00:11:41,170 --> 00:11:48,630
mean based on the sample mean. And you are
supposed to calculate, what is that confidence
105
00:11:48,630 --> 00:11:57,470
interval for the population mean for 99 percent
data? And also calculate CV. It is quite simple.
106
00:11:57,470 --> 00:12:03,000
What we do is we can calculate x bar from
this data then we can calculate standard deviation
107
00:12:03,000 --> 00:12:10,300
that is s from this data and I know n and
then for 113 students the degree of freedom
108
00:12:10,300 --> 00:12:16,910
is 112. So, I go to the table generally the
data is large, shown for 99 percent, t value
109
00:12:16,910 --> 00:12:22,740
will be two 2.58 so I multiply 2.58, multiplied
by the standard deviation of this data and
110
00:12:22,740 --> 00:12:31,649
divided by square root of n that will be the
variation in my x bar, x bar is my average.
111
00:12:31,649 --> 00:12:35,330
So, we use this equation, understood.
112
00:12:35,330 --> 00:12:37,880
.
So, we calculate the x bar that is mean of
113
00:12:37,880 --> 00:12:43,140
the entire data, and then we calculate the
standard deviation of this data, entire data
114
00:12:43,140 --> 00:12:52,230
and n is 113. Now t is my t value of which
I can determine for 112 degrees of freedom
115
00:12:52,230 --> 00:12:58,860
because I have 113 data, so generally the
t value here will be 2.58 because the data
116
00:12:58,860 --> 00:13:04,290
set is large. Let us do the problem using
excel, it is not very difficult.
117
00:13:04,290 --> 00:13:10,230
First I need to, then total number is 113.
Now I need to find out the average mean of
118
00:13:10,230 --> 00:13:15,620
the entire data set, so what do I do 57 into
2, 62 into 4 and so on, then I add up all
119
00:13:15,620 --> 00:13:23,600
of them divided by 113 so I get an average
of 80.23, this is the average heart beat of
120
00:13:23,600 --> 00:13:31,690
all the students. But if I want to get a confidence
interval on this data because this is a sample,
121
00:13:31,690 --> 00:13:36,560
obviously I need to calculate S. If you remember
the previous equation I need to calculate
122
00:13:36,560 --> 00:13:45,230
S, because S is important and t will be 2.58
in this case square root of 113 that is what
123
00:13:45,230 --> 00:13:51,660
is going to happen here.
So how do I calculate this? This is the mean.
124
00:13:51,660 --> 00:13:58,910
So, every time I will subtract from the mean
and then I will arise square, right? You know
125
00:13:58,910 --> 00:14:04,980
how to calculate standard deviation that is
not very difficult. So, you get like this,
126
00:14:04,980 --> 00:14:12,680
the sample variance will be divided by 113
that is 1.32 and then when I take square root
127
00:14:12,680 --> 00:14:17,850
of that, that will be 1.15 because here I
need to take square root, right. So, I am
128
00:14:17,850 --> 00:14:24,529
taking the variance itself and then taking
the overall square root that is what I am
129
00:14:24,529 --> 00:14:39,570
doing, so I get 1.15. Now t as I said is 2.58,
so 2.58 into 1.15 that will be plus or minus
130
00:14:39,570 --> 00:14:49,860
on 80, 2.58 into 1.1 is around 3. That is
why the lower limit will be 80 minus approximately
131
00:14:49,860 --> 00:14:57,209
3 and the upper limit will be 80 plus approximately
3, that is 83. So, the confidence interval
132
00:14:57,209 --> 00:15:05,320
gives you the estimate of where the population
mean will lie, based on the sample mean. The
133
00:15:05,320 --> 00:15:12,790
sample mean is 80.2 and the sample standard
error is 1.15 that is s by square root of
134
00:15:12,790 --> 00:15:28,440
n is called as I mentioned here the standard
error of the mean s by square root of n here.
135
00:15:28,440 --> 00:15:35,820
I multiplied by 2.58 because t is given by
2.58.
136
00:15:35,820 --> 00:15:43,380
And that is for 112 degrees of freedom, for
99 percent so I use that. Now CV, CV is my
137
00:15:43,380 --> 00:15:55,400
coefficient of variation. Formula is 100 into
sigma divided by mu, so the mu is a 80.2 and
138
00:15:55,400 --> 00:16:09,440
then we have sigma given here, so we can multiply
by sigma divided by 80.2 and that will give
139
00:16:09,440 --> 00:16:25,880
you my what, the coefficient of variation.
Quite a simple problem, we can calculate.
140
00:16:25,880 --> 00:16:40,380
So, this is the upper and lower limits for
the heart beat with 99 percent confidence
141
00:16:40,380 --> 00:16:55,630
based on a sample of 113 students, understand?
142
00:16:55,630 --> 00:17:09,110
So, there are many statistical test that are
available which we need to use to compare
143
00:17:09,110 --> 00:17:14,669
2 sets of data and say whether one set of
data is different from another set of data
144
00:17:14,669 --> 00:17:21,280
or they are similar and so on actually, so
there are many many test here. Before you
145
00:17:21,280 --> 00:17:27,589
perform any test we need to create the hypothesis.
As I mentioned you have the null hypothesis
146
00:17:27,589 --> 00:17:33,510
and the alternate hypothesis, H naught and
H a or some people use H naught and H 1 and
147
00:17:33,510 --> 00:17:39,720
so on.
So, H naught is no difference status and so
148
00:17:39,720 --> 00:17:46,520
on actually, H a is there is a difference
or drug a is better than drug b or drug a
149
00:17:46,520 --> 00:17:53,160
is less than drug b and so on that is the
alternative. So that is the hypothesis. So,
150
00:17:53,160 --> 00:17:58,760
you create the hypothesis first and then after
that you have the data sets, you calculate
151
00:17:58,760 --> 00:18:07,299
the p value, then you decide on whether it
is a one-tail or two-tail, I did talk about
152
00:18:07,299 --> 00:18:13,320
these tails also. If I am comparing 2 drugs
and I am saying there is no difference between
153
00:18:13,320 --> 00:18:20,570
drugs or alternative there is the different
between drug then we use a two-tail test.
154
00:18:20,570 --> 00:18:30,460
So if I am comparing the heights of student
in a class 11 a, with students in height in
155
00:18:30,460 --> 00:18:36,490
11 b and I am saying there is no statistical
difference in their heights, my null hypothesis
156
00:18:36,490 --> 00:18:48,880
will be H naught null hypothesis there is
no difference in the heights, H a will be
157
00:18:48,880 --> 00:18:53,070
there is a difference in the heights. So in
such situations we are not bothered whether
158
00:18:53,070 --> 00:18:59,169
the height of class a is higher or more than
the higher of students in the class b or height
159
00:18:59,169 --> 00:19:03,710
of students in a class a is less than that
of the height of the students in the class
160
00:19:03,710 --> 00:19:09,090
b. We are not bothered about greater or less,
but we are just saying they are different.
161
00:19:09,090 --> 00:19:16,710
In such situations we use the two-tailed test,
whereas if we are talking about greater or
162
00:19:16,710 --> 00:19:20,809
less things like that then we use a single
tail test.
163
00:19:20,809 --> 00:19:28,690
So you decide on the tail, then you decide
on the p value am I looking a 95 percent confidence
164
00:19:28,690 --> 00:19:34,370
interval or I am looking at 99 percent confidence
interval. So 95 percent p will be less than
165
00:19:34,370 --> 00:19:45,450
0.05, 99 percent p will be less than 0.01.
I do that calculation and then s p is less
166
00:19:45,450 --> 00:19:56,070
than 0.05 or p is not different and so on
actually. Either I will reject the null hypothesis
167
00:19:56,070 --> 00:20:04,410
or I will not reject the null hypothesis.
So if there is no difference p is less than
168
00:20:04,410 --> 00:20:13,520
0.05, no then I cannot reject the null hypothesis,
s p is less than 0.05 then yes I need to reject
169
00:20:13,520 --> 00:20:17,770
the null hypothesis then only I will accept
the alternative hypothesis.
170
00:20:17,770 --> 00:20:23,820
Initially you start with the hypothesis H
naught and H a, then you decide on your p
171
00:20:23,820 --> 00:20:29,980
value should be 95 or 99, then you decide
on a tail should it be single tail or double
172
00:20:29,980 --> 00:20:41,610
tail, then you calculate the p value and then
you compare p value less than 0.01 or 0.05,
173
00:20:41,610 --> 00:20:47,480
then if it is less then obviously will reject
the null hypothesis so you will accept in
174
00:20:47,480 --> 00:20:53,350
a alternative hypothesis, p value is not less
than 0.05 or 0.01 so you do not reject null
175
00:20:53,350 --> 00:20:59,039
hypothesis. So there is no reason for you
to reject the null hypothesis. This is how
176
00:20:59,039 --> 00:21:04,080
any statistical analysis is carried out and
this is called the hypothesis testing. So
177
00:21:04,080 --> 00:21:09,400
how do you calculate the p value, that depends
upon what you are comparing, am I comparing
178
00:21:09,400 --> 00:21:16,510
means between data, am I comparing mean of
a data with some population mean, am I comparing
179
00:21:16,510 --> 00:21:20,950
variation of one data set the variation of
other data set. So you have different types
180
00:21:20,950 --> 00:21:26,150
of test t test, f test, chi square test so
many different test and that is what you do
181
00:21:26,150 --> 00:21:29,789
here to calculate the p value.
182
00:21:29,789 --> 00:21:36,370
There are many, many types of test. For a
two tailed and a single tailed test for example,
183
00:21:36,370 --> 00:21:41,450
in a null hypothesis suppose you are looking
at sleeping pattern because of the drug we
184
00:21:41,450 --> 00:21:47,320
can say there is no difference, whereas alternative
could be there is a change at a alpha of 0.5
185
00:21:47,320 --> 00:21:53,789
that means 95 percent confidence interval.
Then it becomes a two tailed tests. There
186
00:21:53,789 --> 00:21:59,950
is a change, we are not a saying greater change
or less change, but there is a change. We
187
00:21:59,950 --> 00:22:07,600
use two tail both sides, but there is a change
that means a mu which you calculate is different
188
00:22:07,600 --> 00:22:11,380
from mu naught so there is the difference.
So, it is a two tail test.
189
00:22:11,380 --> 00:22:20,200
Whereas if you say, the mu sleeping pattern
is greater than the old sleeping pattern,
190
00:22:20,200 --> 00:22:25,140
mu naught could be 7 hours, so there is an
increase in this sleeping patterns then it
191
00:22:25,140 --> 00:22:30,100
is in upper tail test. That is a single tail
test or mu is less than mu naught that means,
192
00:22:30,100 --> 00:22:34,760
sleeping pattern is less than 7 hours there
is a decrease then it is a lower tail again
193
00:22:34,760 --> 00:22:40,390
it is a single tailed test. So in these 2
type of situation your H naught will be mu
194
00:22:40,390 --> 00:22:46,000
equal to mu naught, H 1 will be mu equal to
greater than mu naught or mu equal to less
195
00:22:46,000 --> 00:22:51,350
than mu naught, depending upon whether I am
looking at the drug enhancing sleeping pattern
196
00:22:51,350 --> 00:22:56,320
or drug reducing sleeping pattern, then it
is a single tail test. Whereas if you have
197
00:22:56,320 --> 00:23:02,470
H naught equal to mu is equal to mu naught
H 1 is equal to that is H alternative hypothesis
198
00:23:02,470 --> 00:23:07,010
is equal to mu is naught equal to mu naught,
then we use a two tail test. Do you understand,
199
00:23:07,010 --> 00:23:15,049
this very very important, how to 0 in on the
tail, should I 0 in on single tail or should
200
00:23:15,049 --> 00:23:20,130
I 0 in on two tailed test. That is a very
important when you do a statistical analysis
201
00:23:20,130 --> 00:23:25,190
as is showed here, we need to decide on the
tail because most of the tables whether it
202
00:23:25,190 --> 00:23:31,320
is the t table or any other z table, area
and the curve which a single tail is only
203
00:23:31,320 --> 00:23:35,809
1 side, whereas if it is a double tail we
need to consider both the sides of that area,
204
00:23:35,809 --> 00:23:43,020
if you remember that very clearly in our previous
lectures. So you formulate the hypothesis
205
00:23:43,020 --> 00:23:49,600
and then you calculate your p value based
on the type of equations we used, then you
206
00:23:49,600 --> 00:23:54,559
decide on whether it is a single tail or a
two tailed test, then you decide on a am I
207
00:23:54,559 --> 00:23:59,870
going to look at it at 95 percent or 99 percent,
then you say the p you are calculated is it
208
00:23:59,870 --> 00:24:12,770
less than 0.05 or it is greater than 0.05,
if it is less than 0.05. And if it is a greater
209
00:24:12,770 --> 00:24:20,720
than, obviously, we cannot say null hypothesis
can be rejected. Whereas if it is very small
210
00:24:20,720 --> 00:24:28,820
value we can say there is no reason for rejecting
the null hypothesis.
211
00:24:28,820 --> 00:24:34,490
So these upper tailed and lower tailed let
me again spend some time. You know the normal
212
00:24:34,490 --> 00:24:43,150
distribution and you know the area, as the
outside area is more important than the inside
213
00:24:43,150 --> 00:24:49,610
area. If you are talking about 95 percent
confidence the outside area will be 5 percent
214
00:24:49,610 --> 00:24:59,960
or 0.05. So one side will be 0.025 other side
will be 0.025. For a t value 1.96 for a two
215
00:24:59,960 --> 00:25:07,150
tailed this area will be 0.025 and that area
will be 0.025 that is called two tailed test.
216
00:25:07,150 --> 00:25:12,460
So for a two tailed test for different alpha
as we can see that is this is 95 percent,
217
00:25:12,460 --> 00:25:21,110
this is 90 percent, this is the 80 percent
and this is 99 percent and so on actually.
218
00:25:21,110 --> 00:25:31,891
So for a 95 percent two tailed test z is equal
to 1.96, just like t is equal to 1.96. This
219
00:25:31,891 --> 00:25:39,730
area and this area are equal amount actually,
so it is 2.5 percent and 2.5 percent. Whereas
220
00:25:39,730 --> 00:25:45,330
if you are taking single tailed test whether
its upper tailed or lower tailed, you are
221
00:25:45,330 --> 00:25:51,130
considering only one side of the area. For
a upper tail you are considering this side
222
00:25:51,130 --> 00:25:56,210
of the area, for lower tail you are considering
only this side of the area, remember that.
223
00:25:56,210 --> 00:26:05,820
So for a 95 percent confidence interval upper
tail test you have 1.6425, sorry 1.645. As
224
00:26:05,820 --> 00:26:14,930
you can see when upper tailed 0.025 you get
1.986, if you again go back here 0.05, 1.96
225
00:26:14,930 --> 00:26:26,720
because one side 0.025 other side 0.025 together
is 0.05, which is 1.96. You see this, you
226
00:26:26,720 --> 00:26:27,900
see this.
For as upper tailed test your alternative
227
00:26:27,900 --> 00:26:34,520
hypothesis could be mu greater than mu naught
and the null hypothesis mu equal to mu naught.
228
00:26:34,520 --> 00:26:42,490
So for a lower tailed test again it is the
same things here, a z is equal to minus 1.96
229
00:26:42,490 --> 00:26:51,350
here will give 0.025. a z is equal to minus
1.645 will give you 0.05. It is the same,
230
00:26:51,350 --> 00:26:59,520
but only thing is here this z even the signs
have changed, area is the same but signs are
231
00:26:59,520 --> 00:27:06,640
changed. So for a lower tailed test your null
hypothesis could be mu equal to mu naught
232
00:27:06,640 --> 00:27:15,221
and alternative is mu is equal to less than
mu naught. For a alpha is equal to 0.05, if
233
00:27:15,221 --> 00:27:24,320
we get h naught, if a z is equal to less than
1.645 so you should get a z is equal to greater
234
00:27:24,320 --> 00:27:31,159
or t equal to. So z and t seem to be analogous
to each other as you can see in this particular
235
00:27:31,159 --> 00:27:37,179
statistical area.
You have 3 types of situations one is called
236
00:27:37,179 --> 00:27:44,970
the two tailed region that means, if it is
95 percent outside area is 5 percent. So you
237
00:27:44,970 --> 00:27:50,650
divide equally on both the sides so you get
2.5 percent and 2.5 percent. For a two tailed
238
00:27:50,650 --> 00:28:01,032
test if we look at that table for a 0.05 you
get this value as 1.96 or t as 1.96. Whereas
239
00:28:01,032 --> 00:28:05,460
in a single tailed test you are considering
only one side of the area, if it is an upper
240
00:28:05,460 --> 00:28:09,710
tail you have area on the upper side, if you
have the lower tail you have the area on the
241
00:28:09,710 --> 00:28:20,080
lower tail. So for a 95 percent upper tailed
test the z value or t value will be 1.645.
242
00:28:20,080 --> 00:28:27,700
For a 95 percent lower tail test again the
z value be minus 1.645 and as I said if you
243
00:28:27,700 --> 00:28:36,720
look at 0.025 alpha you get one 1.96 which
is similar to this because, when I take 0.05
244
00:28:36,720 --> 00:28:45,179
we are considering 0.025 plus 0.025 to get
0.05 and that is why this term here and this
245
00:28:45,179 --> 00:28:52,080
term here matches. Do you understand the entire
logic here. This is how you go about doing
246
00:28:52,080 --> 00:28:59,490
a test of significance for any data set.
So again, let me go back. So we create the
247
00:28:59,490 --> 00:29:05,809
null hypothesis null, hypothesis is there
is no difference or status co the alternative
248
00:29:05,809 --> 00:29:11,799
hypothesis could be there is a statistical
difference between the new data set and the
249
00:29:11,799 --> 00:29:18,250
old data set, then we use a two-tailed test
because when we are saying there is a difference
250
00:29:18,250 --> 00:29:24,060
we do not say whether the difference is more
or less. Whereas if the alternative hypothesis,
251
00:29:24,060 --> 00:29:29,820
at the new data set is greater than the old
data set could be height, could be iq, could
252
00:29:29,820 --> 00:29:35,940
be drug, then we use single tailed test this
is called a upper tailed test because greater.
253
00:29:35,940 --> 00:29:42,390
When we say the new data set is lower than
the old data set than we use a lower tailed
254
00:29:42,390 --> 00:29:47,590
test or again that is a single tailed test.
So you decide on the type of test and then
255
00:29:47,590 --> 00:29:52,380
you decide on what should be my p value at
which I am going to study? I may going to
256
00:29:52,380 --> 00:29:59,990
study at 95 percent confidence or 99 percent.
So for a 95 percent, p is equal to 0.05, so
257
00:29:59,990 --> 00:30:07,130
from that 0.05 I can get a t value or a z
value. Then I calculate my t or a z value
258
00:30:07,130 --> 00:30:15,279
based on the type of test I am going and then
I tell whether this calculated p value is
259
00:30:15,279 --> 00:30:25,350
much less then what is given in the table,
if it is less yes, I can reject the null hypothesis
260
00:30:25,350 --> 00:30:34,250
and accept the alternative hypothesis.
If the p value which I calculate is not then
261
00:30:34,250 --> 00:30:41,940
obviously, I cannot reject the null hypothesis.
The t value which I calculate is much large
262
00:30:41,940 --> 00:30:48,640
than the table value, then I need to reject
null hypothesis. If the t value I calculate
263
00:30:48,640 --> 00:30:54,409
is much less than the table value, then I
cannot reject the null hypothesis and that
264
00:30:54,409 --> 00:31:02,570
is how you go over doing statistical comparison
of data sets either comparison of means or
265
00:31:02,570 --> 00:31:08,490
comparison of variance or comparison of ratios
and so on actually.
266
00:31:08,490 --> 00:31:12,700
Thank you very much.