1
00:00:11,990 --> 00:00:17,640
In our previous lecture, we spoke about the
need for… We tried to motivate the need
2
00:00:17,640 --> 00:00:23,460
for inferential statistics through the context
of hypothesis testing. So, we spoke about
3
00:00:23,460 --> 00:00:34,030
why we needed it, where it would apply and
so on. We concluded that lecture by coming
4
00:00:34,030 --> 00:00:39,500
up with a template, coming up with a rubric
– essentially of what it is that one needs
5
00:00:39,500 --> 00:00:44,140
to do with hypothesis testing. So, we started
off with instructions like you need a null
6
00:00:44,140 --> 00:00:49,170
hypothesis, an alternate hypothesis; and,
the whole thing was fairly general. So, in
7
00:00:49,170 --> 00:00:54,230
today’s lecture, what we are going to do
is… Today’s lecture continues our focus
8
00:00:54,230 --> 00:00:58,799
on hypothesis tests. And, we are going to
talk about something called single sample
9
00:00:58,799 --> 00:01:02,890
tests. In the last class, you would have seen
that we gave two sets of examples; we gave
10
00:01:02,890 --> 00:01:08,390
examples of single sample tests and two sample
tests. And, today, we are to going focus on
11
00:01:08,390 --> 00:01:14,780
single sample tests. And, what we are going
to do is we are going to illustrate that template
12
00:01:14,780 --> 00:01:21,810
that you saw with by illustrating one test.
So, the test is going to be the single sample
13
00:01:21,810 --> 00:01:29,810
z-test and we are going to show you the mechanics
behind it and kind of give you the reason
14
00:01:29,810 --> 00:01:36,790
of why we do some of the math that we do.
And then, we will talk about some of the other
15
00:01:36,790 --> 00:01:40,719
tests that are there as well.
16
00:01:40,719 --> 00:01:50,170
So, the single sample z-test is a test that
is used when you want to make some inferences
17
00:01:50,170 --> 00:01:56,540
about the population mean. Note that again
there is the clarity here is that, you are
18
00:01:56,540 --> 00:02:03,490
not saying using the sample, but you are not
interested in just reporting the sample mean,
19
00:02:03,490 --> 00:02:08,069
which is what you might have done with descriptive
statistics; but, you are interested in ultimately
20
00:02:08,069 --> 00:02:19,459
making a statement about the population means.
And, this is a test, where you need to know
21
00:02:19,459 --> 00:02:23,800
the variance of – you can think of it as
variance or standard deviation; but, you need
22
00:02:23,800 --> 00:02:34,970
to know this variance of the population; and,
that is a requirement. So, it is not the same
23
00:02:34,970 --> 00:02:40,590
thing as computing the variance from the sample.
So, that is called the sample variance. And,
24
00:02:40,590 --> 00:02:44,690
there is a formula for that. There is actual
data. But, this is more useful when there
25
00:02:44,690 --> 00:02:51,160
is pin historic data and you actually know
the population variance. But, ultimately,
26
00:02:51,160 --> 00:02:58,570
you are doing a test to see if the population…
You are doing a test about the population.
27
00:02:58,570 --> 00:03:09,910
So, another case where this test finds an
application is when the sample size is fairly
28
00:03:09,910 --> 00:03:17,580
large. So, this is the one exception to this
rule that you need to know the variance. In
29
00:03:17,580 --> 00:03:26,840
some cases, when the sample size is fairly
large and larger gets defined by approximately
30
00:03:26,840 --> 00:03:33,610
30. So, that is the magic number that people
use here. And, when the sample size is 30
31
00:03:33,610 --> 00:03:40,070
or larger, you consider, you reason that you
have a large enough sample; and, in that one
32
00:03:40,070 --> 00:03:46,010
instance, you do not need to know the variance;
you can actually calculate the sample variance
33
00:03:46,010 --> 00:03:53,280
from the data itself and use that for this
test. So, just to quickly summarize with the
34
00:03:53,280 --> 00:03:59,690
single sample z-test, you have a single sample
and then you are testing… You are making
35
00:03:59,690 --> 00:04:07,660
some inferences about the population mean
based up of the sample. And typically, in
36
00:04:07,660 --> 00:04:13,290
a single sample z-test, you need to know the
variance with the small exception that you
37
00:04:13,290 --> 00:04:19,320
can also use the single sample z-test if you
have a large enough sample size and you do
38
00:04:19,320 --> 00:04:26,720
not know the variance.
So, the example we are going to use to motivate
39
00:04:26,720 --> 00:04:31,340
this test is one that you have already seen
in the previous class. So, we spoke about
40
00:04:31,340 --> 00:04:39,840
this problem on average phosphate levels in
blood; and, we said that we created… I want
41
00:04:39,840 --> 00:04:45,210
to be very clear; we created this kind of
a medical scenario; I am not… I am not saying
42
00:04:45,210 --> 00:04:51,150
this is medically accurate; we are interested
in the statistics of it. So, we imagine this
43
00:04:51,150 --> 00:04:57,720
doctor or this public health system, which
says that your average phosphate – the average
44
00:04:57,720 --> 00:05:05,789
phosphate level in your blood should be less
than 4 point 8 milligrams per deciliter. And,
45
00:05:05,789 --> 00:05:12,100
the idea here is that, doctors or whoever
understands that not every time. So, you take
46
00:05:12,100 --> 00:05:16,431
a blood reading; you take a blood and you
take a reading of the phosphate level. Not
47
00:05:16,431 --> 00:05:21,690
every time is it going to be 4 point 8 or
less; sometimes it might be more, sometimes
48
00:05:21,690 --> 00:05:28,300
it might be less. But, the whole idea is you
are trying to see if the average is less.
49
00:05:28,300 --> 00:05:33,729
And, when you say I am interested in the average,
you are not interested in the average of some
50
00:05:33,729 --> 00:05:38,949
sample that you have taken; and, here a sample
would be if you took 5 blood tests; perhaps
51
00:05:38,949 --> 00:05:43,080
different machines give different results;
perhaps different times of the day give different
52
00:05:43,080 --> 00:05:49,259
results; perhaps what you ate in the morning
affects it; but, the whole point is you just
53
00:05:49,259 --> 00:05:54,110
have that sample at hand; but, you are interested
in saying something about the population.
54
00:05:54,110 --> 00:05:59,699
So, let us just step back and go through each
of the bullet points one more time in the
55
00:05:59,699 --> 00:06:04,139
context of this example. So, we said what
are we testing? I am interested in the population
56
00:06:04,139 --> 00:06:11,680
mean. So, you might ask yourself in this particular
example what is the population? And, you can
57
00:06:11,680 --> 00:06:19,319
think of the population in a couple of different
ways; you can think of it as a concept of
58
00:06:19,319 --> 00:06:27,770
what your true average and true distribution
is. So, let us say that, yes, you can take
59
00:06:27,770 --> 00:06:32,249
a blood test. And, when you take a blood test,
it is like your taking a sample. But, there
60
00:06:32,249 --> 00:06:40,340
is the concept of what – of what the phosphate
is in your blood at all times.
61
00:06:40,340 --> 00:06:46,849
So, doing a test just gives you one peak into
the reality; but, there exists this concept
62
00:06:46,849 --> 00:06:51,729
of reality, which is that, there is some distribution.
And, we are assuming that distribution does
63
00:06:51,729 --> 00:06:57,490
not change over time; and, that distribution
has a certain mean and we are very interested
64
00:06:57,490 --> 00:07:03,629
in this mean. And, this is the distribution
of the phosphate levels in your blood. This
65
00:07:03,629 --> 00:07:09,550
is that reality, which we do not know and
you can think of it as this oracle somewhere
66
00:07:09,550 --> 00:07:17,919
that knows what your true phosphate levels
in blood is at all times and at some distribution.
67
00:07:17,919 --> 00:07:22,650
And, at any point of time, you go take a blood
test, it is like you are getting a random,
68
00:07:22,650 --> 00:07:27,580
you are getting a number from this random
variable, which is in the form of distribution.
69
00:07:27,580 --> 00:07:32,199
And, you are very interested in the mean of
the distribution. So, that is the population;
70
00:07:32,199 --> 00:07:42,159
population is your true phosphate levels in
your blood at any point of time. And so, it
71
00:07:42,159 --> 00:07:46,439
is a concept.
And, if you kind of like to think of population
72
00:07:46,439 --> 00:07:53,860
as actual data points; another way you can
think of this is to say the population is
73
00:07:53,860 --> 00:08:01,919
what is this data set – this is very large
data set that you would see if you were to
74
00:08:01,919 --> 00:08:08,990
continuously take blood tests – infinite
number of times with infinite number of machines
75
00:08:08,990 --> 00:08:17,520
over multiple days or whatever. So, you can
think of constructing this population in your
76
00:08:17,520 --> 00:08:24,219
head as a very large data set. But, the truth
is ultimately, you never have the data set
77
00:08:24,219 --> 00:08:28,860
or you do not… otherwise, we would not be
doing inferential statistics. But, what you
78
00:08:28,860 --> 00:08:34,370
do have is a sample. And, the sample can be
of some size; we have not discussed that.
79
00:08:34,370 --> 00:08:44,180
It can be 5, 10, 20, 30, 40 data points. And,
the idea here is that you know the variance.
80
00:08:44,180 --> 00:08:51,019
So, I have said that it is a known standard
deviation of 0.4 milligrams per deciliter.
81
00:08:51,019 --> 00:08:55,339
So, just mind you that, this is the standard
deviation, not of the data set, of the sample
82
00:08:55,339 --> 00:09:00,070
set you got; but, it is the standard deviation
of the population. So, it is like this.
83
00:09:00,070 --> 00:09:04,730
This population distribution – we are trying
to answer some questions about its mean. But,
84
00:09:04,730 --> 00:09:11,130
someone somehow you know what the standard
deviation of this distribution is someone
85
00:09:11,130 --> 00:09:19,230
whispered it to you or you know it from fundamental
principles or you might reason that historically
86
00:09:19,230 --> 00:09:24,300
it has been equal to this value and should
not have changed. But, you know the standard
87
00:09:24,300 --> 00:09:29,600
deviation in this particular example. And,
we have already talked about how – if you
88
00:09:29,600 --> 00:09:33,949
have a large enough sample size, you do not
actually need to know the standard deviation,
89
00:09:33,949 --> 00:09:40,500
you can compute it. So, here is the data.
So, I have told you that, we know the standard
90
00:09:40,500 --> 00:09:50,810
deviation; but, you also… This out here
is the data and I have explicitly not given
91
00:09:50,810 --> 00:09:55,820
you the exact data points. So, just to give
you an idea, each of these data points comes
92
00:09:55,820 --> 00:10:02,829
from going and doing a blood test. So, you
did a blood test and you got 4.1; you did
93
00:10:02,829 --> 00:10:08,390
it again, you got 3.9 and so on. And, there
is this list. And, this is what we call as
94
00:10:08,390 --> 00:10:14,279
a sample. So, what do you do with this data?
How do you conduct the test?
95
00:10:14,279 --> 00:10:20,339
Let us go back to this rubric that we created
that is just it is like this template that
96
00:10:20,339 --> 00:10:27,639
we discussed in the end of last class and
to conduct any hypothesis test. So, the first
97
00:10:27,639 --> 00:10:33,480
bullet point have a null and alternate hypothesis.
And, that is what we are going to do. The
98
00:10:33,480 --> 00:10:38,199
null hypothesis here is to say that, mu naught,
which means – which refers to the population
99
00:10:38,199 --> 00:10:43,730
mean. And, you can use the… I have used
mu naught here, you can use mu as well; that
100
00:10:43,730 --> 00:10:47,870
also you might see text books do that. But,
the idea is that, the null hypothesis says
101
00:10:47,870 --> 00:10:55,110
that, the true mean of the population is less
than 4 point 8 I do not know the answer to
102
00:10:55,110 --> 00:11:01,120
this question; that is what I am hypothesizing.
We are going to do some mechanics; we are
103
00:11:01,120 --> 00:11:07,690
going to go through some process. At the end
of it, we are going to see if the null hypothesis
104
00:11:07,690 --> 00:11:17,110
is true or not colloquially speaking. So,
a null hypothesis like we said, the null and
105
00:11:17,110 --> 00:11:24,100
alternate together need to be mutually exclusive,
collectively exhaustive. Mu naught says that,
106
00:11:24,100 --> 00:11:29,420
null is less than 4 point 8; that is the null
hypothesis. So, the alternate hypothesis should
107
00:11:29,420 --> 00:11:32,590
be that mu naught is greater than 4 point
8.
108
00:11:32,590 --> 00:11:36,990
The next step we said is do some basic calculations
– arithmetic on the data to create a single
109
00:11:36,990 --> 00:11:42,170
number called the test statistic. And, the
math that we are going to be doing here is
110
00:11:42,170 --> 00:11:48,610
fairly straightforward. We are going to take
x bar, which means a sample mean. The sample
111
00:11:48,610 --> 00:11:56,480
mean here would just mean that, you take these
data points out here and you take their average.
112
00:11:56,480 --> 00:12:02,709
So, you take all the data points that you
have collected and you take their average.
113
00:12:02,709 --> 00:12:13,310
So, we do that. And, we then calculate x bar
minus mu naught. And, here mu naught is 4.8;
114
00:12:13,310 --> 00:12:19,959
it is the number that you are hypothesizing.
And, you divide that by sigma divided by square
115
00:12:19,959 --> 00:12:26,319
root of n. So, here sigma refers to that standard
deviation that you already know. So, that
116
00:12:26,319 --> 00:12:35,680
is the 0 point 4 that we spoke about out here.
So, we are already given that, the standard
117
00:12:35,680 --> 00:12:45,600
deviation is 0 point 4.
So, given this, we compute x bar minus mu
118
00:12:45,600 --> 00:12:51,350
naught. So, x bar is the average of the sample;
mu naught is the 4 point 8 – the number
119
00:12:51,350 --> 00:12:57,649
that you are hypothesizing; sigma is the standard
deviation that you are given; and, n is the
120
00:12:57,649 --> 00:13:02,670
number of data points in that sample. So,
if your sample size is 10, 15, you would substitute
121
00:13:02,670 --> 00:13:10,019
10 of 15. So, that is how you calculate something
called the z-statistic. So, why you are calculating
122
00:13:10,019 --> 00:13:15,220
this value? We said in the next bullet point
that, if we assume the null hypothesis to
123
00:13:15,220 --> 00:13:20,050
be true; and, make some assumptions about.
Distributions – I would not say that every
124
00:13:20,050 --> 00:13:26,199
time; but, if we assume the null hypothesis
to be true, then technically, the test statistics
125
00:13:26,199 --> 00:13:32,940
should be no different than pulling a random
number from a specific probability distribution.
126
00:13:32,940 --> 00:13:40,490
So, if that… The whole idea is if the test
statistic – if the null hypothesis is true,
127
00:13:40,490 --> 00:13:49,481
then this test statistics z-stat should be
equivalent calculating z-stat. So, you can
128
00:13:49,481 --> 00:13:56,069
plug in numbers for x bar, mu, mu naught,
sigma and n. And, you will calculate a z-stat.
129
00:13:56,069 --> 00:14:03,639
Once you calculate the z-stat, if the null
hypothesis is true, then the z-stat that you
130
00:14:03,639 --> 00:14:09,730
are getting should be equivalent, should be
the same thing as pulling a random number
131
00:14:09,730 --> 00:14:18,160
out of a specific probability distribution.
What is this specific probability distribution?
132
00:14:18,160 --> 00:14:23,490
In the case of the single sample z-test, this
distribution is called the z-distribution.
133
00:14:23,490 --> 00:14:30,569
And, the z-distribution is nothing but a normal
distribution with mean of 0. So, I have used
134
00:14:30,569 --> 00:14:34,310
this nomenclature; we have discussed how this
is standard nomenclature; but, the n here
135
00:14:34,310 --> 00:14:40,259
means it is a normal distribution with a mean
of 0, which is the first number that you see.
136
00:14:40,259 --> 00:14:49,660
So, you have a normal distribution with a
mean of 0 and a standard deviation of 1. And,
137
00:14:49,660 --> 00:14:55,879
1 square is a convenient way to represent
it because you know very clearly that, you
138
00:14:55,879 --> 00:15:01,120
are talking about… 1 refers to the standard
deviation and 1 square is also equal to 1.
139
00:15:01,120 --> 00:15:05,470
So, the standard deviation or the variance
is equal to 1. And so, this is what is known
140
00:15:05,470 --> 00:15:11,620
as z-distribution. So, we are saying that,
if the null hypothesis is true, the z-statistic
141
00:15:11,620 --> 00:15:19,709
that you compute with this data should be
the same or should be equivalent to pulling
142
00:15:19,709 --> 00:15:29,009
a random number from a z-distribution. This
is a very useful thing to say. And, what we
143
00:15:29,009 --> 00:15:34,579
are going to do is we are going to come back
to what we do next, because of the statement.
144
00:15:34,579 --> 00:15:41,430
But, before we do that, I want to make sure
that, you understand why. If the null hypothesis
145
00:15:41,430 --> 00:15:48,610
is true, that the z-statistic – computing
a z-statistic from the data is equivalent
146
00:15:48,610 --> 00:15:54,259
to pulling a random number from a distribution
that is normally distributed with mean 0 and
147
00:15:54,259 --> 00:15:55,300
standard deviation – 1.
148
00:15:55,300 --> 00:16:05,070
So, let us go to the next slide to kind of
do that. So, at the start, you had this distribution.
149
00:16:05,070 --> 00:16:20,359
So, this distribution is the population distribution.
So, this is the population. The population
150
00:16:20,359 --> 00:16:28,300
– if the null hypothesis is true, would
have a mean equaling mu; correct? I am using
151
00:16:28,300 --> 00:16:33,410
mu and mu naught little interchangeably out
here. But, I do not want you to get confused
152
00:16:33,410 --> 00:16:39,970
by that; but, the idea is whatever your hypothesis
is, if your null hypothesis is true, then
153
00:16:39,970 --> 00:16:46,330
the mean of this distribution is equal to
4.8. Technically, it is less than or equal
154
00:16:46,330 --> 00:16:50,069
to 4.8; but, we are going to take the extreme
case. So, we are going to come to one end
155
00:16:50,069 --> 00:16:55,100
of it and say it is equal to 4.8. So, this
is the extreme situation, where the mean is
156
00:16:55,100 --> 00:17:01,930
actually equal to 4.8. And, we are already
given the standard deviation. So, you have
157
00:17:01,930 --> 00:17:07,540
already been told that, sigma is 0.4 and you
are given that value.
158
00:17:07,540 --> 00:17:12,230
So, first, let… This is just the distribution
of the population. So, you have built the
159
00:17:12,230 --> 00:17:17,559
distribution of the population. Now, when
you go to take a sample from this population
160
00:17:17,559 --> 00:17:28,120
of some size n, what do you get? What you
get is as we have discussed, if I take sample
161
00:17:28,120 --> 00:17:37,300
of 5 data points, 6 data points; and then,
compute a mean from that sample, we know that,
162
00:17:37,300 --> 00:17:42,039
the arithmetic mean that you compute from
a sample need not always be equal to the mu
163
00:17:42,039 --> 00:17:47,371
exactly. Sometimes it is mu, sometimes it
is little higher than mu; sometimes it is
164
00:17:47,371 --> 00:17:55,659
a little lower than mu. But, we have already
discussed as to how. That is also a distribution
165
00:17:55,659 --> 00:18:03,190
and that distribution called the sampling
distribution is also normally distributed
166
00:18:03,190 --> 00:18:12,350
with the same mean mu, but with a standard
deviation of sigma by square root of n; where,
167
00:18:12,350 --> 00:18:20,289
n is the number of samples you took to compute
that mean. So, it you took 5 data…
168
00:18:20,289 --> 00:18:24,250
So, if you technically just take one sample
and compute the mean from it, you will get
169
00:18:24,250 --> 00:18:27,970
the same distribution again – get the same
original distribution. So, if you substitute
170
00:18:27,970 --> 00:18:33,630
n is equal to 1, nothing changes. And, that
should be intuitive. If your sample size is
171
00:18:33,630 --> 00:18:38,380
just one data point; when you are computing
the average of one data point, it is literally
172
00:18:38,380 --> 00:18:45,410
like you are recreating the distribution.
If that n goes to infinity, then your variance
173
00:18:45,410 --> 00:18:50,820
goes to 0. So, as your sample size keeps on
increasing, you are literally going to be
174
00:18:50,820 --> 00:18:56,050
sitting on top of this line and you really
would not have this distribution. But, for
175
00:18:56,050 --> 00:19:03,710
all finite sizes of samples, what you have
is a distribution for samples, which is normally
176
00:19:03,710 --> 00:19:08,890
distributed with mean mu and standard deviation
sigma by square root of n.
177
00:19:08,890 --> 00:19:15,960
Now, what we are going to do? Now, mind you,
this is the distribution of sample means.
178
00:19:15,960 --> 00:19:23,590
So, in some sense, this is the distribution
of x bar. Now, what we are going to do is
179
00:19:23,590 --> 00:19:32,430
we are going to subtract mu from this distribution.
From this distribution, we are going to subtract
180
00:19:32,430 --> 00:19:38,510
mu. What effect does that have? From a distribution
– from a normal distribution, essentially,
181
00:19:38,510 --> 00:19:47,350
if you just subtract a number, it is literally
like just shifting the distribution and centering
182
00:19:47,350 --> 00:19:51,850
it. So… because you are just subtracting
a number, you are not affecting the standard
183
00:19:51,850 --> 00:20:00,760
deviation. So, this gets unaffected. But,
when you just subtract a number from the distribution,
184
00:20:00,760 --> 00:20:06,600
you can think of subtracting from a distribution
as you take each data point and then subtract
185
00:20:06,600 --> 00:20:11,799
the same number from it. Or, you can think
of it as computing that x bar and then subtracting
186
00:20:11,799 --> 00:20:15,710
it. But, in either case, it has the effect
of just shifting the distribution; it has
187
00:20:15,710 --> 00:20:22,690
the effect of centering the distribution at
another location. And, that is what happens.
188
00:20:22,690 --> 00:20:30,160
When you subtract mu, when you take mu out
of the x bar; so, it is… This is what I
189
00:20:30,160 --> 00:20:37,529
have done; I have taken mu out of x bar. It
has an effect of moving this distribution
190
00:20:37,529 --> 00:20:43,890
to another location, which is now centered
around the mean equal to 0.
191
00:20:43,890 --> 00:20:53,360
Now, what happens if you then take this distribution
and divide it by the standard deviation? So,
192
00:20:53,360 --> 00:20:59,460
what happens if we divide it by this number
– sigma by square root of n? The effect
193
00:20:59,460 --> 00:21:07,790
that has is in re-scaling the standard deviation
such that you now have a distribution, which
194
00:21:07,790 --> 00:21:13,500
is normally distributed. When you divide,
something that is already centered around
195
00:21:13,500 --> 00:21:16,980
0. So, the distribution is already centered
around 0; which means that it has some positive
196
00:21:16,980 --> 00:21:22,529
values, some negative values. And now, you
go divide by a number. The effect of the division
197
00:21:22,529 --> 00:21:28,330
is that, because it is already centered around
0; it is not actually going to change the
198
00:21:28,330 --> 00:21:34,289
central location of the distribution; it is
instead only going to widen it or narrow it
199
00:21:34,289 --> 00:21:39,789
depending on what you divide it by. So, I
have kind of shown the distribution getting
200
00:21:39,789 --> 00:21:43,650
narrower; but, that need not be the case;
the distribution could have just got wider.
201
00:21:43,650 --> 00:21:49,190
It just depends on whether sigma by square
root of n was greater than or less than 1.
202
00:21:49,190 --> 00:21:54,250
So, if it is greater than 1, then you would
by dividing by sigma by square root of n,
203
00:21:54,250 --> 00:22:03,429
you will be making the distribution more narrow.
But, if sigma by square root of n was smaller
204
00:22:03,429 --> 00:22:09,149
than 1, then it would have the effect of widening
the distribution. But, ultimately, when you
205
00:22:09,149 --> 00:22:12,730
go divide the normal distribution, which is
already centered around 0, all you are doing
206
00:22:12,730 --> 00:22:18,360
is you are either stretching the distribution
or kind of crunching it. You can think it
207
00:22:18,360 --> 00:22:28,160
as scaling it. And, now, you have a standard
deviation of 1. So, that is how you get the
208
00:22:28,160 --> 00:22:34,809
whole idea of x. So, on the first step, we
took x bar and then we subtracted the mu from
209
00:22:34,809 --> 00:22:44,080
it. That is where we got this value. And,
now, after dividing by this number sigma by
210
00:22:44,080 --> 00:22:50,660
square root of n, you went and divided this
by this number to get your normal distribution
211
00:22:50,660 --> 00:22:56,940
with mean 0 and standard deviation 1; which
is what we said was the z-distribution; got
212
00:22:56,940 --> 00:23:03,510
it; perfect.
So, now, what we are going to do is go back
213
00:23:03,510 --> 00:23:09,330
to this rubric. So, great; so, we have reasoned
that you have to calculate the z-stat. And,
214
00:23:09,330 --> 00:23:15,020
we have already said that, the z-stat if the
null hypothesis is true, should be like pulling
215
00:23:15,020 --> 00:23:20,870
a random number from a normal distribution.
What we are going to do now is we are going
216
00:23:20,870 --> 00:23:27,649
to test the probability that, this statistics
that you got; we are going to say – if your
217
00:23:27,649 --> 00:23:34,490
null hypothesis was true, I should have gotten
a random number from here. But, let us actually
218
00:23:34,490 --> 00:23:42,590
look at the actual number that I got. Does
my z-stat actually look like it could be something
219
00:23:42,590 --> 00:23:50,669
that I could have pulled out of a normal 0
comma 1 square, because if it does not, then
220
00:23:50,669 --> 00:24:01,200
something that I assumed was wrong. I said
that, this z-stat should look like something
221
00:24:01,200 --> 00:24:07,779
that I pulled out of a normal 0 comma 1 square
if the null hypothesis is true.
222
00:24:07,779 --> 00:24:13,010
So, let me go to take an actual look at the
z-stat number. And, if it looks like it could
223
00:24:13,010 --> 00:24:17,169
not have come from this distribution; if it
looks like it was unlikely to have come from
224
00:24:17,169 --> 00:24:23,860
this distribution, then I can reason that,
maybe the null hypothesis was not true and
225
00:24:23,860 --> 00:24:27,760
I can reject the null hypothesis. So, that
is what we are going to do the next step.
226
00:24:27,760 --> 00:24:36,040
The next step is to take the z-stat and plug
it in to this normal 0 comma 1 square and
227
00:24:36,040 --> 00:24:44,620
see how extreme is this actual number given
that we know the normal 0 coma 1 square; and
228
00:24:44,620 --> 00:24:49,363
so, we know the potential values it can take.
For instance, I have already told you it is
229
00:24:49,363 --> 00:24:55,351
a normal distribution – mean 0 comma 1 standard
deviation. So, from this distribution, if
230
00:24:55,351 --> 00:24:59,980
I were to pull a random number, how likely
is it that I would see a number like 55. That
231
00:24:59,980 --> 00:25:07,919
is too high. The standard deviation is 1,
the mean is 0; it is almost impossible that
232
00:25:07,919 --> 00:25:14,809
you would pull a number like 55 or minus 20;
so, for that matter. So, if it turns out that
233
00:25:14,809 --> 00:25:23,299
the z-stat is too extreme a value to be coming
from this normal distribution, then we can
234
00:25:23,299 --> 00:25:27,300
maybe make a statement about the null hypothesis.
So, that is what we are going to do as a next
235
00:25:27,300 --> 00:25:33,029
step. Let me just clean this up.
236
00:25:33,029 --> 00:25:41,620
So, I have just restated the official definition.
What we are going to be calculating is something
237
00:25:41,620 --> 00:25:48,180
called a p value. And, this is the probability
of seeing a test statistic as extreme as the
238
00:25:48,180 --> 00:25:53,850
calculated value if the null hypothesis is
true. With the core idea being that, if it
239
00:25:53,850 --> 00:26:00,120
looks too extreme that, if the p value of
the probability of seeing this test statistic
240
00:26:00,120 --> 00:26:08,029
is really low; then, perhaps the null hypothesis
was not true to start. So, for instance out
241
00:26:08,029 --> 00:26:15,810
here, if the z-statistic you computed was
1.2; so, I just… I am just giving you a
242
00:26:15,810 --> 00:26:21,409
number to go with. Then, the core idea is
that, you would calculate a p value based
243
00:26:21,409 --> 00:26:25,851
on the standard null hypothesis, which is
that, mu naught is less than 4.8. And, you
244
00:26:25,851 --> 00:26:32,650
will say… Let me take this distribution,
which is the z-distribution or the normal
245
00:26:32,650 --> 00:26:43,549
0 comma 1 square. So, this is the normal 0
comma 1 square. And, I am going to go place
246
00:26:43,549 --> 00:26:52,549
1.2 here. And, I am going to compute a probability;
and, this – the probability is a probability
247
00:26:52,549 --> 00:27:03,950
to the right side of 1.2. It is the area under
the curve out here. And the idea is because
248
00:27:03,950 --> 00:27:11,500
you are then quantifying the probability of
seeing something as extreme as 1.2 or greater
249
00:27:11,500 --> 00:27:17,179
is equal to the area under this curve.
And, that might happen to be any value. So,
250
00:27:17,179 --> 00:27:21,250
I mean I think in this case, it happens to
be something like 13 percent or whatever.
251
00:27:21,250 --> 00:27:29,720
But, if this number was really low; if this
number was 0.001, you might say look – this
252
00:27:29,720 --> 00:27:35,570
number is so low that I am going to reject
the null hypothesis. And, if you cannot find
253
00:27:35,570 --> 00:27:41,049
something that extreme, the standard thing
that you do is you fail to reject the null
254
00:27:41,049 --> 00:27:48,889
hypothesis; you technically never accept the
null hypothesis. So, that is the core idea.
255
00:27:48,889 --> 00:27:54,620
And, you can also think of this in a couple
of other ways. For instance, if your null
256
00:27:54,620 --> 00:28:01,909
hypothesis were that mu is greater than 4
point 8; then, you would be looking at the
257
00:28:01,909 --> 00:28:07,720
area to the left of your curve. But, typically,
in a situation like that, you actually would
258
00:28:07,720 --> 00:28:13,810
not do the statistical test. So, it is not
common to see p values greater than 0 point
259
00:28:13,810 --> 00:28:22,100
5 because at that point, you start by saying
look I have computed a z-statistics that is
260
00:28:22,100 --> 00:28:30,150
already positive, that is, 1 point 2. And
so, I know even before I go put this line
261
00:28:30,150 --> 00:28:34,520
out here that, I am going to get a probability
greater than 0 point 5.
262
00:28:34,520 --> 00:28:41,529
So, for instance, your z-stat was computed
to be exactly equal to 0. You know that, if
263
00:28:41,529 --> 00:28:45,990
your null hypothesis was mu is greater than
4 point 8; that, it will be 0 point 5. So,
264
00:28:45,990 --> 00:28:50,059
any z-stat even greater than that is bound
to be greater than 0 point 5 when you do not
265
00:28:50,059 --> 00:28:58,410
need statistics for that. But, another interesting
situation, which a lot of people work with,
266
00:28:58,410 --> 00:29:09,769
is called the two-tailed case. These two are
called 1 tailed – 1 tail. So, you also have
267
00:29:09,769 --> 00:29:17,470
the 2 tailed case, where your null hypothesis
is really that, mu is equal to 4 point 8.
268
00:29:17,470 --> 00:29:25,030
And, you are interested in rejecting this
null hypothesis whether that mu is too large;
269
00:29:25,030 --> 00:29:31,679
meaning it is large enough that you can say
that it cannot be equal to 4.8; or, if it
270
00:29:31,679 --> 00:29:39,340
is small enough. So, you are happy to reject
if you see evidence that shows that mu cannot
271
00:29:39,340 --> 00:29:44,299
be 4.8 on either account; maybe because it
is the data suggests that it is too large,
272
00:29:44,299 --> 00:29:49,020
maybe because the data suggests that it is
too small. And, that is called the 2-tailed
273
00:29:49,020 --> 00:29:54,519
case.
Now, there is lots of different software,
274
00:29:54,519 --> 00:29:59,590
where many of the steps that we have done
we are actually taking care of and it does
275
00:29:59,590 --> 00:30:04,880
not take much. Even a simple excel sheet if
you just go down the data and say do a z-test;
276
00:30:04,880 --> 00:30:11,899
it will do it. But, somewhere understanding
the mechanics of this and getting it to the
277
00:30:11,899 --> 00:30:17,060
stage of the z-statistic, at least brings
about some sense of control and transparency
278
00:30:17,060 --> 00:30:24,789
in your understanding. But, once you get off
the stage, computing this area – whether
279
00:30:24,789 --> 00:30:31,529
it is on the left-hand side, right-hand side
or either side can be done fairly; it is not
280
00:30:31,529 --> 00:30:37,990
something that can easily be done by hand.
So, what text books do is – if you have
281
00:30:37,990 --> 00:30:44,990
taken, most statistics text book will have
these pages towards the end. And, they are
282
00:30:44,990 --> 00:30:50,309
given in the form of tables. So, you can take
z-statistic number that you computed and go
283
00:30:50,309 --> 00:30:54,200
plug that in to this table. And, it will tell
you the probabilities.
284
00:30:54,200 --> 00:31:00,519
And, usually there will be a diagram on top
to hint whether they are giving you the probabilities
285
00:31:00,519 --> 00:31:05,559
to the left hand side or the right hand side
or both sides. And, you know – as long as
286
00:31:05,559 --> 00:31:12,059
you know what they are giving you, it is fairly
easy to figure out whatever it is that you
287
00:31:12,059 --> 00:31:17,019
need. If you want the right-hand side; but,
they are giving you only the left-hand side;
288
00:31:17,019 --> 00:31:21,940
then, you can just subtract the number they
are giving you from 1, because you know the
289
00:31:21,940 --> 00:31:26,620
total area under the curve is equal to 1.
But, what I am going do is I am just going
290
00:31:26,620 --> 00:31:32,780
to give you some simple Excel functions that
do this for you. For instance, in Excel, you
291
00:31:32,780 --> 00:31:38,309
can just… The convention is to give you
the area to the left-hand side. So, instance,
292
00:31:38,309 --> 00:31:45,400
what I do here is I subtract that from 1.
And, the norm s dist is what refers to the
293
00:31:45,400 --> 00:31:50,289
z-distribution. And, the true refers to the
fact that I am not interested in just height,
294
00:31:50,289 --> 00:31:55,690
I need the area under the curve. So, that
is what that… So, for the right-hand side
295
00:31:55,690 --> 00:31:59,940
tail, you can use this; for the left-hand
side tail, you can use this. A simple multiplication
296
00:31:59,940 --> 00:32:07,120
by 2 with the left-hand side case gives you
the 2-tail situation. So, with this, we have
297
00:32:07,120 --> 00:32:10,350
discussed greater detail the single sample
z-test.
298
00:32:10,350 --> 00:32:16,610
So, now, let us look at a couple of other
single sample test. What I provide you with
299
00:32:16,610 --> 00:32:22,409
here is the list of them and the formulas.
And, I will give you some idea of the context
300
00:32:22,409 --> 00:32:27,240
in which they are used. But, we would not
derive it or go through it in the same detail
301
00:32:27,240 --> 00:32:32,860
as the z-test. So, we have already discussed
the z-test. Let us now look at the next test,
302
00:32:32,860 --> 00:32:38,139
which is the t-test. So, we have finished
with the z-test. So, now, we are going to
303
00:32:38,139 --> 00:32:49,240
look at the t-test. So, with the t-test, it
is a very useful test and it also tries to
304
00:32:49,240 --> 00:32:53,640
test this… It essentially tries to do the
same job the z-test is doing; which is to
305
00:32:53,640 --> 00:32:59,630
make some statement about the population mean;
so, same problem statement in some sense.
306
00:32:59,630 --> 00:33:03,269
The one big difference is you are not given
the variance.
307
00:33:03,269 --> 00:33:10,669
And, in most situations in life, in statistics,
you would not know the variance; I mean just
308
00:33:10,669 --> 00:33:18,419
think about how fairly unrealistic it is that
you already know the variance of the population
309
00:33:18,419 --> 00:33:22,950
in a situation, where you are trying to make
a statement about the mean. I mean the only
310
00:33:22,950 --> 00:33:26,909
reason you are doing this test is because
you do not know what the population mean is.
311
00:33:26,909 --> 00:33:31,049
So, you are trying to… You are making a
hypothesis, you are taking a sample, and then
312
00:33:31,049 --> 00:33:35,169
you are testing that sample, you are working
with that sample to make a statement about
313
00:33:35,169 --> 00:33:39,059
the population mean. So, there is… I mean
think of it as there is some uncertainty about
314
00:33:39,059 --> 00:33:43,190
the population mean in the first place and
that is why you are doing this test. To assume
315
00:33:43,190 --> 00:33:48,701
that in such a situation, you already know
the population variance is not very – need
316
00:33:48,701 --> 00:33:54,220
not a very realistic. So, this test works
the same way.
317
00:33:54,220 --> 00:33:59,279
So, if you look at it, it has got the same
x bar; it has got the same mu; it has got
318
00:33:59,279 --> 00:34:05,289
the same square root of n. But, this s is
different from this sigma. And, the difference
319
00:34:05,289 --> 00:34:13,290
is here sigma was given in this z-test. So,
in the z-test, sigma is given. But, in this
320
00:34:13,290 --> 00:34:19,500
test, the s is computed; it is computed from
the data. So, you actually go back to the
321
00:34:19,500 --> 00:34:25,440
sample data and you calculate the variance
or standard deviation from the data using
322
00:34:25,440 --> 00:34:29,810
the same formula for dispersion that we would
have discussed when we spoke about standard
323
00:34:29,810 --> 00:34:38,450
deviation and descriptive statistics using
the n minus 1 idea. And, if you do not remember,
324
00:34:38,450 --> 00:34:42,650
you can go back and see that lecture. But,
the idea is that, you compute the standard
325
00:34:42,650 --> 00:34:46,369
deviation and you plug that value in to get
the t-distribution.
326
00:34:46,369 --> 00:34:55,119
Now, a couple of things that are worth noting
is that, we spoke about how if you know the
327
00:34:55,119 --> 00:34:58,750
variance, you can use the z-distribution;
if you do not know the variance, you can use
328
00:34:58,750 --> 00:35:05,869
t-distribution. But, there is this exception.
We said if your sample size is large enough;
329
00:35:05,869 --> 00:35:15,220
then, you can technically use the z-distribution
and just compute the variance and consider
330
00:35:15,220 --> 00:35:22,780
the variance to be the truth; consider the
variance to be the sigma and go ahead. I personally
331
00:35:22,780 --> 00:35:28,720
find that a little confusing; I think that
is fine; if… That is what is there in tax;
332
00:35:28,720 --> 00:35:36,020
that is what people do and that is the reasonable
approximation. But, the point is you cannot
333
00:35:36,020 --> 00:35:40,640
go wrong with using the t test when you do
not have the variance. So, even if you have
334
00:35:40,640 --> 00:35:47,780
a large enough sample size, the idea is that
the t-distribution becomes… It approximates
335
00:35:47,780 --> 00:35:54,880
z-distribution quite well when your sample
size is greater than 30. There for all practical
336
00:35:54,880 --> 00:36:01,619
purposes, the t-distribution looks exactly
like the z-distribution. But, keep the things
337
00:36:01,619 --> 00:36:06,950
really simple; you can just follow this simple
rule that, you do not know. If you know the
338
00:36:06,950 --> 00:36:13,900
variance, just use the z; if you do not know
the variance, just use the t. And, that should
339
00:36:13,900 --> 00:36:18,510
keep you clear.
The other thing to mention out here is that,
340
00:36:18,510 --> 00:36:24,329
this DOF or degrees of freedom – we have
mentioned that, out here we briefly spoke
341
00:36:24,329 --> 00:36:30,609
about that concept when we were talking again
about the standard deviation. Without going
342
00:36:30,609 --> 00:36:35,650
into too much detail into degrees of freedom
again, the simple thing to keep in mind is
343
00:36:35,650 --> 00:36:43,960
that, the t-distribution is not one distribution.
I mean it is one distribution, but in the
344
00:36:43,960 --> 00:36:49,500
sense that, the t-distribution – you can
think of the degrees of freedom as a parameter
345
00:36:49,500 --> 00:36:53,890
that goes with the t-distribution. So, just
like the normal distribution, if you say the
346
00:36:53,890 --> 00:36:59,040
normal distribution, you need to mention the
mean and the variance for you to have a…
347
00:36:59,040 --> 00:37:03,940
to actually draw the exact distribution or
to do some computation on it. It is no point
348
00:37:03,940 --> 00:37:09,240
coming to someone and saying how likely is
it to see a 1.2 in a normal distribution?
349
00:37:09,240 --> 00:37:12,470
That question does not make sense. Normal
distribution with what mean and what variance?
350
00:37:12,470 --> 00:37:18,290
And then, I can answer your question.
You can think of degrees of freedom in a similar
351
00:37:18,290 --> 00:37:24,640
light; which is that, the t-distribution itself
is not completely defined until I mention
352
00:37:24,640 --> 00:37:30,660
to you what the degrees of freedom are. So,
t-distribution with three degrees of… – with
353
00:37:30,660 --> 00:37:37,900
degrees of freedom equal to 3 looks different
from a t-distribution with degrees of freedom
354
00:37:37,900 --> 00:37:43,810
4. And, the core idea that you need to know
is that, the t-distribution has a mean of
355
00:37:43,810 --> 00:37:51,640
0. And, it looks very similar to the normal
distribution of mean 0 and standard deviation
356
00:37:51,640 --> 00:37:57,540
1. But, the exception that as the degrees
of freedom keep increasing; so, when you go
357
00:37:57,540 --> 00:38:03,940
to degrees of freedom of 30 and greater; and,
at some point, it is exactly the normal distribution.
358
00:38:03,940 --> 00:38:08,980
So, the t-distribution with a large enough
degrees of freedom is exactly like the normal
359
00:38:08,980 --> 00:38:13,930
distribution with mean 0 and standard deviation
1. But, as the degrees of freedom keep decreasing
360
00:38:13,930 --> 00:38:18,790
and come all the way down to, the lowest degrees
of freedom you can have is 1. When it comes
361
00:38:18,790 --> 00:38:25,010
all the way down to degrees freedom equal
to 1, you will find that it still looks a
362
00:38:25,010 --> 00:38:29,640
little bit like the normal distribution with
mean 0 and standard deviation 1; but, it is
363
00:38:29,640 --> 00:38:37,170
a little shorter – shorter in the center
and has fatter tails on the sides. And so,
364
00:38:37,170 --> 00:38:43,650
that is how it deviates from the normal distribution.
But, all that you need to know is that, the
365
00:38:43,650 --> 00:38:48,319
degrees of freedom get defined by the concept
n minus 1. So, number of data points minus
366
00:38:48,319 --> 00:38:52,280
1 tells you the degrees of freedom. And, once
you know the degrees of freedom, you know
367
00:38:52,280 --> 00:38:58,270
which t-distribution to look up in the tables.
So, you know how to draw the curve and then
368
00:38:58,270 --> 00:39:04,380
calculate probabilities from it. Again Excel
uses – Excel has some slightly nicer functions
369
00:39:04,380 --> 00:39:08,320
for it. So, if you are just interested in
looking at the left-hand side of the distribution,
370
00:39:08,320 --> 00:39:12,940
you just use T-DIST – T dot DIST. If you
are interested in T dot… On the right-hand
371
00:39:12,940 --> 00:39:18,900
side, you do T dot DIST dot RT; or, on both
sides, you do T dot DIST dot 2T. So, you do
372
00:39:18,900 --> 00:39:23,290
not need to actually do the 1 minus and so
on that we were talking with the z-distribution.
373
00:39:23,290 --> 00:39:27,650
Excel already has some inbuilt functions to
just point to which side of the distribution
374
00:39:27,650 --> 00:39:33,680
you are interested.
So, we go now to the next. We are finished
375
00:39:33,680 --> 00:39:37,910
with the t–distribution; we go now next
to the next test, which is the chi square
376
00:39:37,910 --> 00:39:44,050
test. And, the chi square test has a couple
of different types of tests. But, the one
377
00:39:44,050 --> 00:39:49,450
that we are interested now is the chi square
test for variance. I am using the words variance,
378
00:39:49,450 --> 00:39:53,550
standard deviation a little interchangeably;
one is just the square of the other. The test
379
00:39:53,550 --> 00:39:59,970
is ultimately one for variance. And, if you
are testing variance, you are essentially
380
00:39:59,970 --> 00:40:04,760
testing standard deviation. So, if it is easy
for you to think standard deviation, you can
381
00:40:04,760 --> 00:40:13,900
keep that in mind. And, an example for instance
of the chi square test is you are really interested
382
00:40:13,900 --> 00:40:18,040
in looking at a sample, but you are not interested
in making a statement about the population
383
00:40:18,040 --> 00:40:22,869
mean. You are instead interested in making
a statement about the population variance.
384
00:40:22,869 --> 00:40:27,220
So, you are interested in saying is the population…
Just like in this z-test and the t-test, you
385
00:40:27,220 --> 00:40:33,790
are interested in saying something like – is
the population mean equal to 4.8? Or, is the
386
00:40:33,790 --> 00:40:38,410
population mean less than 4.8?
Similarly, here you would be interested in
387
00:40:38,410 --> 00:40:43,820
saying things like – is the population variance
equal to 0.5, 0.3 – whatever number you
388
00:40:43,820 --> 00:40:47,859
have in mind. The important thing is you have
a number in mind and you are trying to see
389
00:40:47,859 --> 00:40:51,900
if the sample that you are taking… With
the sample that you are taking, can you say
390
00:40:51,900 --> 00:40:56,089
something about the population variance being
equal to this magical number that you have
391
00:40:56,089 --> 00:41:03,660
in your head. And, the mechanics of the test
is fairly straightforward. And, it is here
392
00:41:03,660 --> 00:41:09,460
the sigma naught is the hypothesized variance.
So, this is the number that you want to compare
393
00:41:09,460 --> 00:41:17,630
it to. This is the equivalent of the 4.8 that
was there for means. The s square is the sample,
394
00:41:17,630 --> 00:41:25,160
is the variance that you compute from the
data, from the sample. You take that data
395
00:41:25,160 --> 00:41:31,030
set of the samples and you compute standard
deviation, you compute a variance from that.
396
00:41:31,030 --> 00:41:35,791
And, that is s square. And, the way you do
that again to remind you is this that, 1 by
397
00:41:35,791 --> 00:41:39,760
n minus 1 in the formula for the calculation
of standard deviation; you would be using
398
00:41:39,760 --> 00:41:47,970
that. And, that is how you calculate the test
statistics, which then gets compared to a
399
00:41:47,970 --> 00:41:57,210
chi square distribution with n minus 1 degrees
of freedom; just like in the first case, it
400
00:41:57,210 --> 00:42:03,300
got compared to z-distribution and this got
compared to… The t-statistics got compared
401
00:42:03,300 --> 00:42:11,170
to t-distribution. This is the same way the
chi square distribution gets compared to a
402
00:42:11,170 --> 00:42:17,500
chi square distribution; great.
So, a couple of things to note is that, if
403
00:42:17,500 --> 00:42:21,430
again chi square also uses the concept of
degrees of freedom; so, think of the degree
404
00:42:21,430 --> 00:42:27,040
of freedom as something that defines the exact
distribution you are interested in, because
405
00:42:27,040 --> 00:42:31,440
a chi square distribution with 3 degrees of
freedom is a different distribution than a
406
00:42:31,440 --> 00:42:38,300
chi square. It is a different density function.
It looks different. It has different mathematical
407
00:42:38,300 --> 00:42:46,160
properties than a chi square distribution
with 4 degrees of freedom, 5 degrees of freedom.
408
00:42:46,160 --> 00:42:54,210
So, the degrees of freedom help you define
the exact distribution and its parameters
409
00:42:54,210 --> 00:43:00,700
and its mean variance and so on. But, essentially,
that is what you would use. You would need
410
00:43:00,700 --> 00:43:04,400
to use the degrees of the freedom and that
is also the same as before; it is n minus
411
00:43:04,400 --> 00:43:11,359
1. So, number of data points minus 1. And,
chi square… With Excel out here just uses
412
00:43:11,359 --> 00:43:17,599
chi square dist; this is the left side and
chi square dist dot rt is the right side.
413
00:43:17,599 --> 00:43:25,030
I do not see them having something for 2-tailed,
but I might be wrong. But, as long as you
414
00:43:25,030 --> 00:43:29,900
have these two, you can quite easily just
draw that graph in your head and figure out
415
00:43:29,900 --> 00:43:35,950
which side; if you are interested in a 2-tail
distribution, how you would compute that;
416
00:43:35,950 --> 00:43:40,980
great.
So, we finally, come to our last single sample
417
00:43:40,980 --> 00:43:48,990
test, which is called the proportion z-test.
And, the idea here is you are testing something
418
00:43:48,990 --> 00:43:56,660
that is a proportion. So, you are testing
something like… If you are given… So,
419
00:43:56,660 --> 00:44:08,570
you are testing a hypothesis like less than
30 percent of the shoppers, who come to my
420
00:44:08,570 --> 00:44:16,650
online store are women. So, you can say again;
we can go the 2-tailed way or you can go the
421
00:44:16,650 --> 00:44:23,080
1-tailed way; you can say less than 30 percent;
you can say 30 percent of my shoppers in my
422
00:44:23,080 --> 00:44:29,930
online store are women or you can say greater
than 30 percent of the shoppers are women.
423
00:44:29,930 --> 00:44:38,810
The key out here is that, whatever sample
you collect to actually test this hypothesis,
424
00:44:38,810 --> 00:44:43,960
the hypothesis… So, let us fix on the hypothesis.
Let us say the hypothesis is less than or
425
00:44:43,960 --> 00:44:49,930
equal to 30 percent of the shoppers, who come
to my online store are women.
426
00:44:49,930 --> 00:44:54,940
The idea is like all hypothesis testing, you
will now… – you have this hypothesis;
427
00:44:54,940 --> 00:44:58,820
you will now go and collect a sample. The
sample in this particular example could be
428
00:44:58,820 --> 00:45:04,480
something like you actually give a survey
at the end of the purchase or something and
429
00:45:04,480 --> 00:45:10,090
people actually say them – male or female.
So, you collect a sample. How does this sample
430
00:45:10,090 --> 00:45:14,320
look? The answer is that the sample unlike
in the previous examples, where you would
431
00:45:14,320 --> 00:45:19,190
have seen an actual number. So, in the previous
example, in the phosphate examples, you saw
432
00:45:19,190 --> 00:45:23,380
numbers like 4.1, 3.5 – these were actually
readings from blood tests. Here you are going
433
00:45:23,380 --> 00:45:27,270
to get something that is binary. The person
is either going to say that either they are
434
00:45:27,270 --> 00:45:32,910
female or not. So, it is a series of 1s and
0s – very similar to the idea behind Bernoulli
435
00:45:32,910 --> 00:45:42,751
trials. And, what you are doing is you are
now looking at the sample data of 1s and 0s,
436
00:45:42,751 --> 00:45:46,170
which… and then answering the question of
whether… and then saying something about
437
00:45:46,170 --> 00:45:51,770
the hypothesis, which is less than 30 percent
of the people, who come to my shop are women.
438
00:45:51,770 --> 00:45:58,420
And, this has the same intuition as all the
other forms of inferential statistics, which
439
00:45:58,420 --> 00:46:10,369
is if for instance, you take 100 samples and
you know all 100 of them point to the shoppers
440
00:46:10,369 --> 00:46:15,770
being women; then, you are likely to reject
the idea that only less than 30 percent of
441
00:46:15,770 --> 00:46:19,430
the shoppers have come to my online store
are women.
442
00:46:19,430 --> 00:46:25,250
But, the idea is if you notice something,
this looks in every way shape and form like
443
00:46:25,250 --> 00:46:32,400
the Bernoulli distribution. And, the fact
that you are counting how many. So, the Bernoulli
444
00:46:32,400 --> 00:46:36,270
distribution had to do with probability of
heads or tails. So, that would be probability
445
00:46:36,270 --> 00:46:41,400
of it being male or female. But, if you are
interested in out of size of n people, who
446
00:46:41,400 --> 00:46:46,869
arrive; how many of them? Sort of hundred
people who arrive are… Is it less than 30
447
00:46:46,869 --> 00:46:55,130
who are women? That ties us to the binomial
distribution. And, yet you do not see the
448
00:46:55,130 --> 00:47:02,349
binomial distribution being used in the test,
you instead see the same idea of z. So, you
449
00:47:02,349 --> 00:47:07,950
are again calculating a z-statistic with this
test and you are comparing to the z-distribution,
450
00:47:07,950 --> 00:47:17,820
which is normal – which is normal 0 comma
1 square. And, the idea here is fairly simple.
451
00:47:17,820 --> 00:47:22,990
If you remember, we spoke about the binomial
approx… – we are approximating this binomial
452
00:47:22,990 --> 00:47:28,359
distribution to a normal distribution. That
is a right way to put it. And, that is what
453
00:47:28,359 --> 00:47:34,270
you are doing out here. And, it should also
be intuitive in that the p hat out here is
454
00:47:34,270 --> 00:47:38,910
a calculated proportion. So, you will take
a data set. Let us say you took 30 people
455
00:47:38,910 --> 00:47:44,700
or 40 people or 50 people who came to your
store and you actually found that, exactly
456
00:47:44,700 --> 00:47:49,819
23 of those 50 were females. So, that proportion
is what? The proportion that you get from
457
00:47:49,819 --> 00:47:55,589
the sample is what you have is p hat. p naught
is the hypothesized proportion. So, p naught
458
00:47:55,589 --> 00:48:05,309
would be the 30 percent. So, this would be
sample. And, this is the equivalent of mu
459
00:48:05,309 --> 00:48:11,619
naught. So, this is the population – proportion
– the number that you are hypothesizing;
460
00:48:11,619 --> 00:48:17,930
and of course, is the sample size.
And, if you look at it, this also looks very
461
00:48:17,930 --> 00:48:25,440
much like that x bar minus mu concept. And,
in some way, this formula out here in the
462
00:48:25,440 --> 00:48:31,540
denominator should remind you of the formula
that you saw for standard deviation in the
463
00:48:31,540 --> 00:48:36,609
binomial distribution class. So, you are doing
something very similar to the x bar minus
464
00:48:36,609 --> 00:48:42,430
mu by sigma over square root of n; it is in
construct identical to that. But, you are
465
00:48:42,430 --> 00:48:48,369
using the binomial distribution for calculating
things like the variance. But, you are also
466
00:48:48,369 --> 00:48:52,390
saying – hey, this I believe should be…
you can approximate to the normal distribution.
467
00:48:52,390 --> 00:48:58,140
So, I am just going to use the z-distribution
to calculate my p values or I am going to
468
00:48:58,140 --> 00:49:03,599
use the z-tables.
Just in quick conclusion, just going back
469
00:49:03,599 --> 00:49:11,190
to the rubric that we created, I will just
clean this up. The final idea is that, you
470
00:49:11,190 --> 00:49:18,520
use these z-tables; you can use z–tables;
by z-tables, I mean like you can use the back
471
00:49:18,520 --> 00:49:24,800
of the statistics text book, you can use Excel
or Matlab or R. Just important that you know
472
00:49:24,800 --> 00:49:31,829
how to do with at least one of these softwares.
And, the core idea is that, if you get a low
473
00:49:31,829 --> 00:49:38,319
enough p value; all of these help you calculate
p value or a probability. And, if you get
474
00:49:38,319 --> 00:49:42,130
a low enough p value, you can use that as
grounds for rejecting the null hypothesis.
475
00:49:42,130 --> 00:49:49,810
And, I am saying I reject the hypothesis that
mu naught is less than or equal to 4.8. And
476
00:49:49,810 --> 00:49:55,710
also, just keep in mind – on the flip side,
you never say I accept the null hypothesis
477
00:49:55,710 --> 00:50:04,530
and you can only say that, I fail to reject
the null hypothesis. I hope that clarified
478
00:50:04,530 --> 00:50:11,710
the use of… Give you one illustrative example
of the single sample test and the idea behind
479
00:50:11,710 --> 00:50:19,540
the mechanics of it; and, introduced you to
the other tests. And, in the next class, what
480
00:50:19,540 --> 00:50:25,200
we are going to do is we are going to talk
about 2-sample tests and we are going to go
481
00:50:25,200 --> 00:50:29,549
also beyond that. We are going to talk about
the idea of having multiple samples.
482
00:50:29,549 --> 00:50:30,450
Thank you.