1
00:00:11,010 --> 00:00:15,100
Hello and welcome to our lecture on Inferential
Statistics.
2
00:00:15,100 --> 00:00:20,810
In this particular lecture, we are going to
be talking more, giving you more about Motivation,
3
00:00:20,810 --> 00:00:25,670
for why we need inferential statistics and
we will be talking a little bit about, what
4
00:00:25,670 --> 00:00:32,930
kinds of problems you can solve using these
techniques and you know, where you would applied
5
00:00:32,930 --> 00:00:34,570
and when you would applied.
6
00:00:34,570 --> 00:00:39,270
In the subsequent lectures, we will talk a
little bit about how you would applied and
7
00:00:39,270 --> 00:00:45,590
also, why you do some of the math that you
do, why you do some of the computations.
8
00:00:45,590 --> 00:00:49,690
So, understanding it a little bit more on
the nuts and bolts level is something that
9
00:00:49,690 --> 00:00:50,690
will follow.
10
00:00:50,690 --> 00:00:56,530
But, in this lecture we are going to focus
on why you need inferential statistics in
11
00:00:56,530 --> 00:01:02,190
the first place and we are going to do this
through the lens or something called hypothesis
12
00:01:02,190 --> 00:01:03,190
testing.
13
00:01:03,190 --> 00:01:10,960
So, hypothesis testing is very widely used
an excepted tool for a lot of data analysis
14
00:01:10,960 --> 00:01:18,180
and if you understand inferential statistics
through hypothesis testing, many other concepts
15
00:01:18,180 --> 00:01:20,060
in inferential statistics just fall in place.
16
00:01:20,060 --> 00:01:24,970
So, things like confidence intervals and so
on, which you might have heard become fairly
17
00:01:24,970 --> 00:01:28,270
easy to understand and process.
18
00:01:28,270 --> 00:01:33,030
So, having said that let me jump into the
subject.
19
00:01:33,030 --> 00:01:39,200
The idea behind inferential statistics is
to make some inference about the population
20
00:01:39,200 --> 00:01:41,740
from the sample.
21
00:01:41,740 --> 00:01:45,310
Just to jog your memories, I think we have
spoken about population and sample a couple
22
00:01:45,310 --> 00:01:46,640
of times.
23
00:01:46,640 --> 00:01:55,810
But, this is different from, what you would
have done with descriptive statistics.
24
00:01:55,810 --> 00:01:59,580
With descriptive statistics you do not care
about population or sample, in the end of
25
00:01:59,580 --> 00:02:01,090
the day you have some data set.
26
00:02:01,090 --> 00:02:06,780
And for simplicity sake, assume that was a
sample that you got from population and this
27
00:02:06,780 --> 00:02:13,300
sample you will very content in descriptive
statistics with describing that data set with
28
00:02:13,300 --> 00:02:17,269
describing that sample in case you have a
sample.
29
00:02:17,269 --> 00:02:20,790
But, here the kinds of problems that you are
more interested with inferential statistics,
30
00:02:20,790 --> 00:02:26,650
a ones where you have a population and you
are getting as just a sample from that.
31
00:02:26,650 --> 00:02:30,920
But, from this sample I do not want to just
say something about the sample, I do not want
32
00:02:30,920 --> 00:02:34,080
to talk about the mean of the sample, I do
not want to talk about the variance of the
33
00:02:34,080 --> 00:02:38,349
sample and I do not want to talk about the
centrality dispersion.
34
00:02:38,349 --> 00:02:41,849
My goal is to say something about the population.
35
00:02:41,849 --> 00:02:45,069
I only have data, which is the sample.
36
00:02:45,069 --> 00:02:49,769
I do not have the data associated with the
population, but with this sample can I say
37
00:02:49,769 --> 00:02:51,049
something about the population.
38
00:02:51,049 --> 00:02:55,739
So, that is the core idea and just a kind
of jogs some of your memories, the idea behind
39
00:02:55,739 --> 00:03:00,599
population and samples is fairly simple, we
looked at it to the couple of examples.
40
00:03:00,599 --> 00:03:04,920
But, you can think of the population in one
of two ways.
41
00:03:04,920 --> 00:03:10,560
The first and the more obvious way is that,
there exists this really large data set associated
42
00:03:10,560 --> 00:03:11,560
with the phenomena.
43
00:03:11,560 --> 00:03:19,879
So, let us say the phenomena was the height
of all boys, who are in 10th standard in public
44
00:03:19,879 --> 00:03:20,879
school.
45
00:03:20,879 --> 00:03:24,340
So, that is, so that could be a very large
data set and let us say that was for India.
46
00:03:24,340 --> 00:03:29,340
So, it is very large data set, there are lot
of children, who are in the 10th standard,
47
00:03:29,340 --> 00:03:32,310
who are in public schools across in India.
48
00:03:32,310 --> 00:03:36,900
And you can think of that as a population
and you want to say something about that population
49
00:03:36,900 --> 00:03:38,049
and you might not have the data.
50
00:03:38,049 --> 00:03:44,680
So, you go to take a sample, so you select
5 schools or 10 schools or you select 200
51
00:03:44,680 --> 00:03:48,849
students through a census process randomly
and take that as a sample.
52
00:03:48,849 --> 00:03:54,719
So, you take a sample of, you know some subset
of students from the population, that is one
53
00:03:54,719 --> 00:03:58,879
way of thinking of population sample and another
way could be that the population itself is
54
00:03:58,879 --> 00:04:02,599
moreover theoretical abstract concept.
55
00:04:02,599 --> 00:04:10,389
So, you could not have an actual data set,
but it could be something like I have created
56
00:04:10,389 --> 00:04:17,260
this new machine and this new machine is going
to start making certain products and let us
57
00:04:17,260 --> 00:04:18,930
say, you are very interested in a product
dimensions.
58
00:04:18,930 --> 00:04:21,730
So, let us say the diameter of a product is
machine makes.
59
00:04:21,730 --> 00:04:26,750
So, you put a raw material in to this machine,
the machine splits out some finished product
60
00:04:26,750 --> 00:04:31,930
and this finished product should have a diameter,
I mean has some diameter.
61
00:04:31,930 --> 00:04:39,970
Here you do not, because it is a new machine
you might not actually have a population;
62
00:04:39,970 --> 00:04:43,250
that is a really large data set that exists
somewhere.
63
00:04:43,250 --> 00:04:48,850
The population here is the concept that, if
this machine going to create infinite such
64
00:04:48,850 --> 00:04:54,570
products without any change in time space,
the dimensions of these products would be
65
00:04:54,570 --> 00:04:58,590
the population and ultimately you might say,
hey let us just for the first time run this
66
00:04:58,590 --> 00:05:03,660
machine and make 10 products and these 10
products that I physically have and I measured
67
00:05:03,660 --> 00:05:05,400
and so on, is the sample.
68
00:05:05,400 --> 00:05:08,710
So, here is the case, where I still want to
use the sample to say something about the
69
00:05:08,710 --> 00:05:10,380
machine in general.
70
00:05:10,380 --> 00:05:14,250
Not just the 10 products the machine is turned
out, but the concept of the population here
71
00:05:14,250 --> 00:05:19,510
is not an actual finite large data set in
my hands, it is more about concept.
72
00:05:19,510 --> 00:05:26,810
Now, having revisited population and sample
here let us again see the statement, which
73
00:05:26,810 --> 00:05:30,990
is that inferential statistics you make, you
want to say something about the population
74
00:05:30,990 --> 00:05:33,190
from the sample.
75
00:05:33,190 --> 00:05:40,520
So, as I said the major aim of this lecture
is to motivate you to see why inferential
76
00:05:40,520 --> 00:05:42,140
statistics is important.
77
00:05:42,140 --> 00:05:49,290
So, I felt that the best way to inference
to that might be to give you some examples.
78
00:05:49,290 --> 00:05:53,780
So, that is what we are going to do and I
start with some simple examples.
79
00:05:53,780 --> 00:05:59,940
I have broken them down in to one sample and
two sample examples and pretty soon, that
80
00:05:59,940 --> 00:06:02,120
will become clear what that distinction is.
81
00:06:02,120 --> 00:06:07,490
So, let us start with the first example which
is a one sample example on that upper left
82
00:06:07,490 --> 00:06:09,550
of your screen.
83
00:06:09,550 --> 00:06:17,110
So, the idea here and I am just, so you know
we are in this part and the idea here is that,
84
00:06:17,110 --> 00:06:27,940
let us say we were interested in noting the
average phosphate levels in our blood and
85
00:06:27,940 --> 00:06:33,240
I do not have a medical background or anything,
so do not look at the medical aspect of this
86
00:06:33,240 --> 00:06:34,240
examples.
87
00:06:34,240 --> 00:06:40,990
But, let us say that your doctor or doctors
in general or you know public health advocates,
88
00:06:40,990 --> 00:06:47,650
say that the average phosphate levels in bloods
should be less than 4.8 milligrams per I do
89
00:06:47,650 --> 00:06:49,520
not know deciliter.
90
00:06:49,520 --> 00:06:55,140
So, again irrespective the units, so the whole
idea is that this number, which you can get
91
00:06:55,140 --> 00:07:01,550
if you go measure your blood should be on
average less than 4.8.
92
00:07:01,550 --> 00:07:06,630
The key here is an understanding that they
should be less than 4.8 on average.
93
00:07:06,630 --> 00:07:11,940
So; that means, the doctors or the public
health advocates understand that sometimes
94
00:07:11,940 --> 00:07:18,700
it could be greater than 4.8 and that perhaps
in this particular case is not a cause for
95
00:07:18,700 --> 00:07:20,100
along.
96
00:07:20,100 --> 00:07:23,460
Again do not focus on the medical aspect,
I do not know if it is not, but this is the
97
00:07:23,460 --> 00:07:26,020
situation I am creating.
98
00:07:26,020 --> 00:07:30,370
So, but the important thing the doctors have
told you, it is on average it should be less
99
00:07:30,370 --> 00:07:31,410
than 4.8.
100
00:07:31,410 --> 00:07:37,300
So, let us say you say and you know this;
obviously, variation.
101
00:07:37,300 --> 00:07:41,150
So, it really depends on what you ate that
day, it depends on what time of the day you
102
00:07:41,150 --> 00:07:45,870
take the measurement, it depends on what instrument
you used to take the measurement, it depends
103
00:07:45,870 --> 00:07:46,980
on how much water you had.
104
00:07:46,980 --> 00:07:53,500
So, let us say there are lot of factors that
you do not seek to control and that is the
105
00:07:53,500 --> 00:07:54,890
whole idea behind this.
106
00:07:54,890 --> 00:08:00,150
But, you want to take a set of measurements
and you want to take a set of measurements
107
00:08:00,150 --> 00:08:04,560
and answer the question us to whether the
average phosphate levels in your blood.
108
00:08:04,560 --> 00:08:09,520
In general, not just on the sample that you
have taken, you do not I mean the sample could
109
00:08:09,520 --> 00:08:15,800
be anything, but you care about, in general
is my average phosphate level and blood less
110
00:08:15,800 --> 00:08:18,080
than 4.8.
111
00:08:18,080 --> 00:08:20,440
So, why you…
112
00:08:20,440 --> 00:08:25,520
So, the question might arise you know, why
you distinguishing between what the sample
113
00:08:25,520 --> 00:08:30,220
says and what reality is and that is going
to become clear in second.
114
00:08:30,220 --> 00:08:32,979
So, let us first go about trying to answer
this question.
115
00:08:32,979 --> 00:08:40,740
So, the first thing is, if you took a set
of measurements and let us say, you got consistently
116
00:08:40,740 --> 00:08:41,740
very low values.
117
00:08:41,740 --> 00:08:52,840
So, you let us say you got 2.4, 2.5, 2.1,
2.7, 2.3, 2.9 fill of four more numbers in
118
00:08:52,840 --> 00:08:54,440
the two points.
119
00:08:54,440 --> 00:09:00,279
Then, I guess you really do not need a statistician,
you can look at the sample that you got and
120
00:09:00,279 --> 00:09:05,340
you can say, look I am fairly certain that
even if I went on taking more and more samples
121
00:09:05,340 --> 00:09:10,720
or that if I woke up another day through all
these data or took another sample or if I
122
00:09:10,720 --> 00:09:16,430
took infinite many samples, in either case
it looks like my average is going to be less
123
00:09:16,430 --> 00:09:20,930
than 4.8, that is fine, you know intuitively
that seems obvious.
124
00:09:20,930 --> 00:09:25,680
Similarly, the flip sides, suppose you wanted
to take this data and you consistently got
125
00:09:25,680 --> 00:09:35,931
5.5, 6.1, 7.7, 6.9, so on, where every single
data point is significantly greater than that
126
00:09:35,931 --> 00:09:42,530
4.8 mark and approximately in that same region,
meaning it is not widely moving around.
127
00:09:42,530 --> 00:09:48,670
So, that is also an intuition that you might
have, that if one second it is 6.5 and the
128
00:09:48,670 --> 00:09:51,340
next second it is 2.1, you know it is widely
moving around.
129
00:09:51,340 --> 00:09:56,360
But, here I am giving you examples, where
6.1, 6.7, 7.1, 7.3, 7.4.
130
00:09:56,360 --> 00:10:00,340
So, it consistently significantly greater,
again you do not need a statistician.
131
00:10:00,340 --> 00:10:04,630
Somewhere your intuition, you just say look,
I mean based of the sample I am willing to
132
00:10:04,630 --> 00:10:11,090
bet that my average phosphate levels are less
than 4.8 mg per dl.
133
00:10:11,090 --> 00:10:14,390
But, then it gets a little tricky.
134
00:10:14,390 --> 00:10:21,950
What happens if you had, you know numbers,
you know some of them below some of them above.
135
00:10:21,950 --> 00:10:30,400
So, some of them are less than 4.8 some of
them are greater than 4.8 and in some sense,
136
00:10:30,400 --> 00:10:36,010
what you do then and the instinct to the intuition
there sometimes just to say well, let me take
137
00:10:36,010 --> 00:10:38,570
an average of the sample.
138
00:10:38,570 --> 00:10:46,560
If that average is greater than 4.8, then
perhaps I should conclude that my average
139
00:10:46,560 --> 00:10:53,840
is greater than 4.8 and this is small problem
with that kind of an approach.
140
00:10:53,840 --> 00:11:07,880
I mean, assume that you got the readings like
as such as 5.1, 4.8, 4.9, 4.7, 4.7, 5.3 and
141
00:11:07,880 --> 00:11:12,380
you got some set of variance and you took
the average and that average was 4.85.
142
00:11:12,380 --> 00:11:19,460
So, you saying, you know what the sample showed
that it is greater and you conclude, for instance
143
00:11:19,460 --> 00:11:25,390
that the average phosphate level in my blood
in general is greater than 4.8, based off
144
00:11:25,390 --> 00:11:27,860
of the sample that I just saw.
145
00:11:27,860 --> 00:11:32,290
The problem with that could be that may be
if you just took two more data points.
146
00:11:32,290 --> 00:11:36,480
Let us say you took two more data points,
you increased your sample by two more and
147
00:11:36,480 --> 00:11:43,400
you got a 4.6 and a 4.7 and all of a sudden,
because of these new data points, your average
148
00:11:43,400 --> 00:11:47,080
you know slights just below 4.8.
149
00:11:47,080 --> 00:11:52,020
Something about doing this process, we just
take the average of the sample and make a
150
00:11:52,020 --> 00:11:56,640
conclusion, does not seem correct for this
reason.
151
00:11:56,640 --> 00:12:03,060
And you know, another way of looking at it
is also that, if we looked at this notion
152
00:12:03,060 --> 00:12:05,600
of sampling distributions.
153
00:12:05,600 --> 00:12:14,980
So, we know that, let us take a look at the
same graph that we looked at the end of the
154
00:12:14,980 --> 00:12:15,980
last class.
155
00:12:15,980 --> 00:12:22,300
So, you had this thing, which we described
as the original distribution.
156
00:12:22,300 --> 00:12:29,680
And let us say for now, let us just say for
now, the truly your…
157
00:12:29,680 --> 00:12:37,440
This distribution by the way represents the
amount of phosphate in mg per dl that you
158
00:12:37,440 --> 00:12:41,080
will see in your blood if you do a test at
any given point of time.
159
00:12:41,080 --> 00:12:45,980
So, you get numbers from this distributions,
so sometimes you get a 5.1; that is what,
160
00:12:45,980 --> 00:12:51,720
that is the dot there, sometimes you get a
4.4 that is the other dot there.
161
00:12:51,720 --> 00:12:59,740
But, in the end of the day it looks like on
average, given how I have drawn it.
162
00:12:59,740 --> 00:13:07,900
On average it is only 4.7 mg per dl, which
is less than the 4.8 mark that we were interested.
163
00:13:07,900 --> 00:13:13,160
Now, you go and take a sample, this is the
same example as a last time, so you took the
164
00:13:13,160 --> 00:13:15,550
sample and you some data points.
165
00:13:15,550 --> 00:13:19,880
So, you took 6 data points and this is what
you got.
166
00:13:19,880 --> 00:13:23,850
Now, if I want to just take the sample average
and I am just eye balling it here, if I just
167
00:13:23,850 --> 00:13:31,870
took the average of these numbers I would
say that average falls somewhere here, would
168
00:13:31,870 --> 00:13:33,010
it to see.
169
00:13:33,010 --> 00:13:41,150
So, this might be the average of these numbers
or maybe it will fall right on this data actually.
170
00:13:41,150 --> 00:13:46,630
So, let us say this is somewhere out here
is the average and the true is this sample
171
00:13:46,630 --> 00:13:52,420
average is greater than the 4.7 and if you
want to just go by the sample average, we
172
00:13:52,420 --> 00:14:02,010
might have conclude it that the amount of
mg per dl is greater than 4.8 or whatever.
173
00:14:02,010 --> 00:14:11,900
But, here is a good part, you now had a class
in statistics that told you that the average
174
00:14:11,900 --> 00:14:19,070
that you get from the sample is not always
going to fall on this 4.7.
175
00:14:19,070 --> 00:14:21,399
As your number of samples tends to infinity?
176
00:14:21,399 --> 00:14:24,290
Yes, it will converge to this point.
177
00:14:24,290 --> 00:14:32,930
But, if you got a finite number of samples
and that is not, so it is not going to always
178
00:14:32,930 --> 00:14:34,399
be exactly on 4.7.
179
00:14:34,399 --> 00:14:36,399
So, what is it going to be?
180
00:14:36,399 --> 00:14:39,500
What it is going to be is another distribution.
181
00:14:39,500 --> 00:14:46,830
So, if you had N samples and this distribution
will change with more samples, if there are
182
00:14:46,830 --> 00:14:53,190
many, many, many, many samples, then literately
if as the number N tends to infinity, this
183
00:14:53,190 --> 00:14:56,380
distribution will pretty much like flat on
this line.
184
00:14:56,380 --> 00:15:03,040
But, if not, you still getting the sample
mean, the mean that you calculate from the
185
00:15:03,040 --> 00:15:08,550
samples literally a random variable that you
getting from this distribution.
186
00:15:08,550 --> 00:15:12,930
So, it is literally like you just pick random
points from this distribution.
187
00:15:12,930 --> 00:15:19,630
So, you get a point any where here, you probably
will not get a point here, because it does
188
00:15:19,630 --> 00:15:20,630
not…
189
00:15:20,630 --> 00:15:28,830
The probability of getting this point from
this distribution is very low, it is almost
190
00:15:28,830 --> 00:15:29,830
0.
191
00:15:29,830 --> 00:15:37,180
So, you will be getting points from this distribution
and as a result, just because this number
192
00:15:37,180 --> 00:15:44,300
is greater, that 4.7 should not make you conclude
that this mean, which is what you are trying
193
00:15:44,300 --> 00:15:45,300
to conclude.
194
00:15:45,300 --> 00:15:49,600
You are trying to say something about this
line; you are trying to say something about
195
00:15:49,600 --> 00:15:55,649
this line, which is the mean of the population
and, because you get a sample mean which is
196
00:15:55,649 --> 00:16:01,779
nothing but, the number from a distribution
should not make you conclude that therefore,
197
00:16:01,779 --> 00:16:03,720
it is greater than 4.8.
198
00:16:03,720 --> 00:16:10,610
So, that is why you need to do something more,
you need to do something more complex than
199
00:16:10,610 --> 00:16:16,380
just blindly taking the sample average.
200
00:16:16,380 --> 00:16:24,529
So, again we are going to talk later about
how you do it, but and what and when you do,
201
00:16:24,529 --> 00:16:30,580
but right now I am just trying to motivate
for you why you need something else.
202
00:16:30,580 --> 00:16:37,930
So, take another example and in this example,
we take a problem of proportions.
203
00:16:37,930 --> 00:16:46,130
So, let us say the health department or some
dentist related body says; only 5 percent
204
00:16:46,130 --> 00:16:50,830
of the toothpastes of any given brand can
be out of specification.
205
00:16:50,830 --> 00:16:56,899
So, out of specification might mean that,
you have some ratings on the amount of fluorides
206
00:16:56,899 --> 00:16:58,070
tooth paste can have.
207
00:16:58,070 --> 00:17:04,640
So, let us say you are allowed any 1000 parts
per million of fluoride and you, there are
208
00:17:04,640 --> 00:17:06,539
other chemical limits.
209
00:17:06,539 --> 00:17:14,730
But, the health department understands, that
not every toothpaste can match exactly the
210
00:17:14,730 --> 00:17:15,730
ideal requirements.
211
00:17:15,730 --> 00:17:21,070
So, let us say the set out limits on the chemicals
and they say, look if you are a toothpaste
212
00:17:21,070 --> 00:17:26,030
manufacturer, only I am going to only allow
5 percent of your toothpastes to, you know
213
00:17:26,030 --> 00:17:31,920
be out of specification, 95 percent of your
toothpastes that I see in the market need
214
00:17:31,920 --> 00:17:34,650
to fall within my guidelines.
215
00:17:34,650 --> 00:17:38,480
Same problem comes up again.
216
00:17:38,480 --> 00:17:49,679
You can take a sample of 10, 20 toothpastes
and it could very well be that truly this
217
00:17:49,679 --> 00:17:56,970
toothpaste brand is involved in a chemical
process or manufacturing process, that creates
218
00:17:56,970 --> 00:17:59,520
on average only 4 percent.
219
00:17:59,520 --> 00:18:06,260
Only 4 percent of the toothpaste that this
company makes are actually out of specification.
220
00:18:06,260 --> 00:18:13,840
But, it is perfectly possible that you went
and took 10 toothpastes from the market and
221
00:18:13,840 --> 00:18:26,280
your luck, 7 of those 10 are out of the specification;
that is perfectly possible.
222
00:18:26,280 --> 00:18:29,260
It might not be the most probabilistic thing,
but it is perfectly possible.
223
00:18:29,260 --> 00:18:34,860
It is possible that, this toothpaste company
is involved in a manufacturing process and
224
00:18:34,860 --> 00:18:41,740
chemical process that creates toothpastes
and on average, 4 percent of all the toothpastes
225
00:18:41,740 --> 00:18:42,740
they make.
226
00:18:42,740 --> 00:18:47,080
So, when I say toothpaste, think of it as
a toothpaste tube; on average 4 percent of
227
00:18:47,080 --> 00:18:49,220
all the tubes they make.
228
00:18:49,220 --> 00:18:55,660
Have a chemical composition that is not acceptable,
which is fine, because the health department
229
00:18:55,660 --> 00:18:59,950
says you cannot go more than 5 percent and
this toothpaste company has rising it is hand
230
00:18:59,950 --> 00:19:03,820
and saying hey you know, which only 4 percent.
231
00:19:03,820 --> 00:19:12,520
But, I now go and randomly sample 5 toothpastes,
10 toothpastes and I find that out of the
232
00:19:12,520 --> 00:19:15,920
5 toothpastes that I randomly sampled, 3 of
them are defective.
233
00:19:15,920 --> 00:19:25,040
All of a sudden, I am saying 3 out of 5 that
60 percent, you say only 5 percent is allowed
234
00:19:25,040 --> 00:19:27,360
and I find 60 percent.
235
00:19:27,360 --> 00:19:32,900
And so, is that does not mean the company
is not creating toothpastes less than 5 percent
236
00:19:32,900 --> 00:19:37,910
rate, which are in conformance less than 5
percent rate, probably no.
237
00:19:37,910 --> 00:19:43,470
Again you need a little bit more new on thinking
and you need little more statistics to actually
238
00:19:43,470 --> 00:19:49,740
answer this question, based off of the sample
you cannot just take the sample average.
239
00:19:49,740 --> 00:19:55,730
Another example, the third example that we
have on the one sample cases, imagine that
240
00:19:55,730 --> 00:20:01,200
you are in an insurance company and you find,
that there is this particular mechanic shop
241
00:20:01,200 --> 00:20:08,850
that is new garage, which does repairs and
because most people are required to have insurance.
242
00:20:08,850 --> 00:20:16,950
Once the garage kind of writes out an invoice
and people who file the insurance claim attach
243
00:20:16,950 --> 00:20:25,850
this mechanics invoice and tell the company
to reimburse them for this rectification that
244
00:20:25,850 --> 00:20:26,860
is claim to the car.
245
00:20:26,860 --> 00:20:31,700
And let us say, this is a new garage and you
know the insurance company is suspecting that
246
00:20:31,700 --> 00:20:38,850
these guys are cheating, that their set up
as a place to not do any real work, but just
247
00:20:38,850 --> 00:20:40,270
write really high invoices.
248
00:20:40,270 --> 00:20:44,350
So, that the insurance company, so they are
involved in some kind of a fraud.
249
00:20:44,350 --> 00:20:50,870
So, one thing that the insurance company can
do is saying, I am going to look at that next
250
00:20:50,870 --> 00:20:53,320
10, 20 or whatever repairs.
251
00:20:53,320 --> 00:20:59,491
So, let us say up to this point this garage
is not made a single you know claim, but it
252
00:20:59,491 --> 00:21:05,669
is just being set up, but the inside word
is that they are trying to cheat this system.
253
00:21:05,669 --> 00:21:09,040
So, once a garage gets set up, this insurance
company is ready.
254
00:21:09,040 --> 00:21:15,679
The first 10 claims or 20 claims at this garage
files or 30 claims, they take those claims
255
00:21:15,679 --> 00:21:23,070
and they see how that compares to the national
average in terms of average claims.
256
00:21:23,070 --> 00:21:28,701
Again the problem is just, because this sample
is greater than the national average, can
257
00:21:28,701 --> 00:21:33,370
we conclude yes these guys are cheaters and
just, because his sample average is less than
258
00:21:33,370 --> 00:21:37,200
the national average, can we conclude these
guys are not cheaters and the answer to both
259
00:21:37,200 --> 00:21:39,700
those questions is no.
260
00:21:39,700 --> 00:21:45,990
In some cases, like when I was talking to
you about the case, it might be brutally obvious,
261
00:21:45,990 --> 00:21:50,770
where every single data point is so high or
so low, that you are like I do not need a
262
00:21:50,770 --> 00:21:54,320
statistician to tell me that the answer to
this question.
263
00:21:54,320 --> 00:22:02,180
But, when it is not that case you need little
more you need inferential statistics to answer
264
00:22:02,180 --> 00:22:03,760
that question.
265
00:22:03,760 --> 00:22:08,880
So, let us we look at the single sample cases
by that we essentially mean that there was
266
00:22:08,880 --> 00:22:13,840
one data set and you are essentially comparing
that data set to some bench mark number that
267
00:22:13,840 --> 00:22:15,850
you had in your head.
268
00:22:15,850 --> 00:22:24,390
Lets now, move to two sample situations the
example here is let us say that I am running
269
00:22:24,390 --> 00:22:32,760
up foundry and some guy comes in consultant
comes in says you know if you change the temperature
270
00:22:32,760 --> 00:22:37,640
a little bit of the molten metal that you
pouring in whatever.
271
00:22:37,640 --> 00:22:44,510
Let us say change the temperature down by
degree over two and I assure you that average
272
00:22:44,510 --> 00:22:49,660
number of defects that you see in your costs
in metal costs will decrease.
273
00:22:49,660 --> 00:22:56,380
So, like a mechanical engineering application
you like may be the consultant knows what
274
00:22:56,380 --> 00:23:03,929
is talking, but how do I test it, how do I
test it and the answer to that question is
275
00:23:03,929 --> 00:23:10,190
it is you can do the following, which is before
you do that before you go change things you
276
00:23:10,190 --> 00:23:16,960
can measure the average number of defects
in your costs and you do that for 10, 20 data
277
00:23:16,960 --> 00:23:18,440
points.
278
00:23:18,440 --> 00:23:24,049
And then, you do with the consultant, which
is change the temperature and then, you collect
279
00:23:24,049 --> 00:23:25,800
the another 10, 20 data points.
280
00:23:25,800 --> 00:23:32,820
Now, let us go back to the question if the
average of the sample the first sample, so
281
00:23:32,820 --> 00:23:37,200
now, we have sample a sample b sample a corresponds
to before the temperature was change sample
282
00:23:37,200 --> 00:23:39,740
b corresponds after the temperature was change.
283
00:23:39,740 --> 00:23:45,100
Now; obviously, I mean in all likely hood
there is going to be some average to sample
284
00:23:45,100 --> 00:23:47,220
a and there will be some average to sample
b.
285
00:23:47,220 --> 00:23:50,880
In all likelihood these two are not going
to be the same one is going to be higher or
286
00:23:50,880 --> 00:23:54,990
lower than the other just like in the single
sample case if they.
287
00:23:54,990 --> 00:23:59,660
So, dramatically different these samples and
when I say dramatically different I mean dramatically
288
00:23:59,660 --> 00:24:04,280
different with respect to some amount of variability
there is in the two samples as well.
289
00:24:04,280 --> 00:24:08,500
Then, you do not need a statistician to tell
you it is like, so obvious that changing the
290
00:24:08,500 --> 00:24:11,740
temperature dramatically reduce the number
of defects.
291
00:24:11,740 --> 00:24:19,280
But, in many cases you do not know and in
those cases it is not obvious to say the average
292
00:24:19,280 --> 00:24:28,260
of sample A was different from the average
of sample B, I mean go back to I am going
293
00:24:28,260 --> 00:24:32,049
to erase this, but erase the red ink, but
go back to this example.
294
00:24:32,049 --> 00:24:40,410
Let us say that, lets say the this is the
original distribution of the number of defects,
295
00:24:40,410 --> 00:24:46,809
so this is the original distribution and let
us say this is 3 defects and this is like
296
00:24:46,809 --> 00:24:54,679
9 defects, now that is too high let us say
this is 5 defects and this is 1 defect.
297
00:24:54,679 --> 00:24:59,200
So, this original distribution is what I care
about that goes from here to hear these numbers
298
00:24:59,200 --> 00:25:02,100
of defects you will see in a costs.
299
00:25:02,100 --> 00:25:12,179
Now; obviously, if I take a sample of 6 or,
so I get a random point I let us say this
300
00:25:12,179 --> 00:25:19,690
is the random point I get and the erase the
other one done erased.
301
00:25:19,690 --> 00:25:28,590
So, I basically I took the original distribution
I took the sample of 6 costs and when I take
302
00:25:28,590 --> 00:25:34,039
the sample of 6 costs like we discuss we the
average is not going to always we exactly
303
00:25:34,039 --> 00:25:39,059
3 is going to be some number that falls in
to this distribution the distribution or sample
304
00:25:39,059 --> 00:25:45,030
means and I drawn a random number that I got
out here from that distribution.
305
00:25:45,030 --> 00:25:50,940
So, good now let us say this consultant who
is telling you to go increase or decrease
306
00:25:50,940 --> 00:25:52,450
the temperature was completely wrong.
307
00:25:52,450 --> 00:25:58,870
Let us say he had no clue, what he will say
he was just lying, but fortunately, unfortunately
308
00:25:58,870 --> 00:26:02,789
whatever is said does not make a different
I mean he lied in that you know he said is
309
00:26:02,789 --> 00:26:06,570
going to improve the process it did not improve
the process luckily it dint make the process
310
00:26:06,570 --> 00:26:07,980
was.
311
00:26:07,980 --> 00:26:14,429
Now, you go take and because a processes not
change the original distribution is not change
312
00:26:14,429 --> 00:26:19,350
after that temperature is change the number
of defects you going to be receiver also exactly
313
00:26:19,350 --> 00:26:24,200
in conformance of this distribution it is
in conformance to this outside distribution
314
00:26:24,200 --> 00:26:25,630
the original distribution.
315
00:26:25,630 --> 00:26:34,070
So, that has not changed essentially this
mean has not changed this mean has not changed.
316
00:26:34,070 --> 00:26:40,160
Now, you go take a sample of sets you get
another sample average and this sample average
317
00:26:40,160 --> 00:26:47,790
again is going to belong to this distribution
and let us say this sample average was this
318
00:26:47,790 --> 00:26:49,559
value.
319
00:26:49,559 --> 00:27:01,280
So, this is the new sample average, so this
is new it is that is new now you cannot say
320
00:27:01,280 --> 00:27:05,060
hey this new sample average is higher than
the old sample average.
321
00:27:05,060 --> 00:27:11,169
Therefore, the population mean is different
it is not I mean I just gave you an example,
322
00:27:11,169 --> 00:27:15,650
where the population mean truly did not changed,
but how you could have seen two different
323
00:27:15,650 --> 00:27:19,799
sample averages in concluded that one is greater
than the other.
324
00:27:19,799 --> 00:27:25,201
So, this is why again you need more than looking
at the sample average of a and sample average
325
00:27:25,201 --> 00:27:27,780
b and saying hey one is higher than the other.
326
00:27:27,780 --> 00:27:33,740
So, we should believe that what the consultancy
said was correct or the other way around you
327
00:27:33,740 --> 00:27:38,530
know if the if you conclude that what the
consultancy said is definitely made things
328
00:27:38,530 --> 00:27:43,809
was that is also marked correct perfectly
possible that there was truly a change, but
329
00:27:43,809 --> 00:27:47,400
because of luck you know you saw things other
way.
330
00:27:47,400 --> 00:27:52,410
Now, another example you can think of and
this the next example is one where I wanted
331
00:27:52,410 --> 00:27:57,160
to emphasis variation the rather than just
the mean, let us say you have two different
332
00:27:57,160 --> 00:28:03,929
manufacturing processes and you want compare
their variance of the finished product in
333
00:28:03,929 --> 00:28:05,000
each batch.
334
00:28:05,000 --> 00:28:11,809
So, you have manufacturing process A manufacturing
process B and they make batches of you know
335
00:28:11,809 --> 00:28:18,820
finished material finished product and within
each batch this some amount of variance of
336
00:28:18,820 --> 00:28:22,919
each products that some amount of variance
right variances the inherent variation between
337
00:28:22,919 --> 00:28:23,919
one part to the next.
338
00:28:23,919 --> 00:28:28,010
And, let us say I care about that the variance
let us say I do not in a particular batch
339
00:28:28,010 --> 00:28:32,430
I do not want one product to look very different
from the next product I do not care about
340
00:28:32,430 --> 00:28:33,430
the mean.
341
00:28:33,430 --> 00:28:39,270
But, I want them to all be consistent again
you would use the same concept you would take
342
00:28:39,270 --> 00:28:44,080
the first batch, which is made from machine
A calculate the variance of it calculate the
343
00:28:44,080 --> 00:28:50,120
variance of that sample take the second batch
made from machine B calculate the variance
344
00:28:50,120 --> 00:28:51,370
of that sample.
345
00:28:51,370 --> 00:28:57,360
And again the overall concept exists you cannot
just say sample of sample variance of machine
346
00:28:57,360 --> 00:29:02,340
A is lower than sample of machine B.
Therefore, I conclude that machine A better
347
00:29:02,340 --> 00:29:07,120
than machine B it is perfectly possible than
machine A actually worse in machine B. But,
348
00:29:07,120 --> 00:29:14,730
because you ultimately only have a sample
that machine A got lucky ultimately it goes
349
00:29:14,730 --> 00:29:17,420
back to this idea that you see in this distribution.
350
00:29:17,420 --> 00:29:23,440
But, sometimes you get number on this side
of the distribution sometimes you get a number
351
00:29:23,440 --> 00:29:25,370
on this side of the distribution.
352
00:29:25,370 --> 00:29:31,040
The more data point should take the overall
variance of this distribution reduces, which
353
00:29:31,040 --> 00:29:37,200
is why if you had infinite number of points
right none of these problem is exists, but
354
00:29:37,200 --> 00:29:38,200
you do not.
355
00:29:38,200 --> 00:29:44,549
And, if you dealing with that then there I
think to do this to use inferential statistics
356
00:29:44,549 --> 00:29:47,870
to look closely at the data.
357
00:29:47,870 --> 00:29:53,059
Another example that is often coated is things
like are tenth standard girls taller than
358
00:29:53,059 --> 00:29:59,200
tenth standard boys in India for instance
we all know that in terms of average heights
359
00:29:59,200 --> 00:30:04,480
men have larger average heights than women,
but we also heard that girls start growing
360
00:30:04,480 --> 00:30:05,770
taller earlier.
361
00:30:05,770 --> 00:30:11,510
So, I do not know is 10th standard breakeven
point at least some statistics text books
362
00:30:11,510 --> 00:30:12,921
sink to think, so.
363
00:30:12,921 --> 00:30:16,970
So, you could have a simple question like
the population here is 10th standard girls
364
00:30:16,970 --> 00:30:21,890
in India tenth standard boys in India you
are ultimately taking the sample and based
365
00:30:21,890 --> 00:30:24,730
of the sample, what can you say about the
population.
366
00:30:24,730 --> 00:30:31,920
And you know sometimes the story is obvious
from the sample itself sometimes you need
367
00:30:31,920 --> 00:30:36,850
inferential statistics to come in and tell
you whether you can say something concrete
368
00:30:36,850 --> 00:30:41,250
or perhaps you cannot say anything concrete
and that is also something inferential statistics
369
00:30:41,250 --> 00:30:42,590
will tell you.
370
00:30:42,590 --> 00:30:47,130
But, ultimately it is not as simple as just
saying the average of sample A is greater
371
00:30:47,130 --> 00:30:54,130
than the average of sample B, therefore I
am going to conclude one way or the other.
372
00:30:54,130 --> 00:31:02,289
So, let us go to what it is that we will be
trying to do with inferential statistics I
373
00:31:02,289 --> 00:31:06,840
am just going to go through the overeating
principle and keep this in mind will revisit
374
00:31:06,840 --> 00:31:08,679
this slide a couple of times.
375
00:31:08,679 --> 00:31:14,692
But, I think the ultimate test in some sense
of you understanding, what it is in, what
376
00:31:14,692 --> 00:31:24,159
it is and how it is and why it is will really
become clear once we do the actual math with
377
00:31:24,159 --> 00:31:26,299
each test.
378
00:31:26,299 --> 00:31:29,940
And ultimately these will come in the form
of test I mean some of you might have heard
379
00:31:29,940 --> 00:31:35,890
of these test like t test, z test, chi square
test, f test, ANOVA and so on and we will
380
00:31:35,890 --> 00:31:37,610
go through each of these tests.
381
00:31:37,610 --> 00:31:44,700
But, here is the overarching principle with
respect to hypothesis testing is to have this
382
00:31:44,700 --> 00:31:47,880
have a null hypothesis and an alternate hypothesis.
383
00:31:47,880 --> 00:31:53,680
So, for instance in the fluoride case the
null hypothesis could very well be let me
384
00:31:53,680 --> 00:32:01,430
erase this, the null hypothesis could very
well be that you have less than 4.8 mg per
385
00:32:01,430 --> 00:32:02,430
dl.
386
00:32:02,430 --> 00:32:07,320
So, the null hypothesis is that the actual
phosphate levels in the average phosphate
387
00:32:07,320 --> 00:32:12,370
levels in blood for person x is less than
or equal to 4.8.
388
00:32:12,370 --> 00:32:19,059
And, so the alternate hypothesis would be
that its greater than 4.8 the important thing
389
00:32:19,059 --> 00:32:25,679
is the null hypothesis and the alternate hypothesis
in some sense together should be mutually
390
00:32:25,679 --> 00:32:32,159
exclusive, which means that if it is greater
than 4.8, then it cannot be less than or equal
391
00:32:32,159 --> 00:32:37,159
to 4.8 and vise versa and you know collectively
exhaustive there should essentially cover
392
00:32:37,159 --> 00:32:40,990
the entire space that you are interested in
talking about.
393
00:32:40,990 --> 00:32:45,559
So, I mean meaning that the average phosphate
level is either less than or equal to 4.8
394
00:32:45,559 --> 00:32:48,130
or greater than 4.8 it cannot be neither.
395
00:32:48,130 --> 00:32:55,820
So, its collectively exhaustive, so then you
what we will be doing and you not been talk
396
00:32:55,820 --> 00:33:01,960
this yet, but you will be doing some basic
calculations or arithmetic on the data to
397
00:33:01,960 --> 00:33:04,130
create a single number call the test statistic.
398
00:33:04,130 --> 00:33:10,110
So, you do not know what that is yet, but
what you will do is you do the reason I am
399
00:33:10,110 --> 00:33:13,580
explaining this to you is to give you an idea
that you are going to be working with the
400
00:33:13,580 --> 00:33:14,580
sample.
401
00:33:14,580 --> 00:33:17,460
So, it is not like a magic in that you are
not going to say something about the population
402
00:33:17,460 --> 00:33:19,190
without dealing with the sample.
403
00:33:19,190 --> 00:33:25,190
So, you will be doing some math you know and
it some of that might involved taking things
404
00:33:25,190 --> 00:33:31,070
like the sample mean sample standard deviation,
but you will be doing some math on that and
405
00:33:31,070 --> 00:33:36,140
when you finish with that math you will getting
something called the test statistic some of
406
00:33:36,140 --> 00:33:40,140
you might have heard of the these test statistics
it may be called the z statistics or the t
407
00:33:40,140 --> 00:33:43,840
statistic and so on.
408
00:33:43,840 --> 00:33:50,070
The crucks of null hypothesis the crucks of
hypothesis test is that if we assumed the
409
00:33:50,070 --> 00:33:52,740
null hypothesis is to be true.
410
00:33:52,740 --> 00:33:56,480
And make some assumptions about distributions
of various variables and those we won’t
411
00:33:56,480 --> 00:34:01,540
go into that much, but if we assume the null
hypothesis is to be true.
412
00:34:01,540 --> 00:34:08,079
Then technically the test statistics should
be no different than drawing a random number
413
00:34:08,079 --> 00:34:11,200
from a specific probability distribution.
414
00:34:11,200 --> 00:34:19,810
So, in some sense what we saying is if the
null hypothesis is to if that if the true
415
00:34:19,810 --> 00:34:30,520
it is a true mean is equal to 4.8, 4.8 this
time, because the null hypothesis is true
416
00:34:30,520 --> 00:34:34,810
the null hypothesis was is it less than or
equal to 4.8, so here the null hypothesis
417
00:34:34,810 --> 00:34:35,810
is true.
418
00:34:35,810 --> 00:34:39,540
So, let us take the extreme case the true
mean is 4.8 and let us say there are some
419
00:34:39,540 --> 00:34:45,750
assumptions like that this distribution is
normal may be that these vary the samples
420
00:34:45,750 --> 00:34:49,119
that you are taking are independent of each
other some set of assumptions that you have
421
00:34:49,119 --> 00:34:50,119
to take.
422
00:34:50,119 --> 00:34:56,140
If all that of its true, then we want to do
certain things such that you will get another
423
00:34:56,140 --> 00:35:01,589
new distribution you will be doing some math
with these data points.
424
00:35:01,589 --> 00:35:04,630
See these data points that you got from the
sample you will be doing some math with those
425
00:35:04,630 --> 00:35:11,040
data points you will do that math’s such
that the test statistic would be no different
426
00:35:11,040 --> 00:35:16,900
than a single random number that you draw
from a very specific probability distribution.
427
00:35:16,900 --> 00:35:24,030
If that is the case, then you test the probability
that the test statistic you calculated belongs
428
00:35:24,030 --> 00:35:34,050
to this theoretical distribution you basically
say hey it looks to me like if the null hypothesis
429
00:35:34,050 --> 00:35:40,360
is true, then whatever I have calculated of
here with the sample should be like drawing
430
00:35:40,360 --> 00:35:43,150
a random number from this distribution.
431
00:35:43,150 --> 00:35:49,680
So, let me calculate the actual test statistic
and see how likely it is that this number
432
00:35:49,680 --> 00:35:55,540
came from that probability distribution and
that is what we call is a P value.
433
00:35:55,540 --> 00:36:04,800
Now, once you have done this process you might
say look if the null hypothesis is true, then
434
00:36:04,800 --> 00:36:09,050
in the test statistic that I should have calculated
should have come from this distribution.
435
00:36:09,050 --> 00:36:14,400
But, look at the number that I got in my hand
it is, so unlikely that I could have gotten
436
00:36:14,400 --> 00:36:18,089
this test statistic from this distribution.
437
00:36:18,089 --> 00:36:23,710
Therefore, the null hypothesis perhaps was
not true and I am going to reject the null
438
00:36:23,710 --> 00:36:29,980
hypothesis or in some cases the test statistics
that you get looks like it does belong to
439
00:36:29,980 --> 00:36:34,770
this distribution the specific distribution.
440
00:36:34,770 --> 00:36:39,570
And therefore, you can only you cannot really
say anything it is like you just have no grounds
441
00:36:39,570 --> 00:36:44,910
for rejecting the null hypothesis you can
just say I feel to reject the null hypothesis
442
00:36:44,910 --> 00:36:48,869
the important thing for you might want to
look at the this procedure a couple of times,
443
00:36:48,869 --> 00:36:55,470
but the important thing that you might want
to digest from this is that the P value itself
444
00:36:55,470 --> 00:37:02,520
is associated with the probability of seeing
this data if the null hypothesis were true
445
00:37:02,520 --> 00:37:08,190
and not really the of probability of saying
of this hypothesis being true given the data.
446
00:37:08,190 --> 00:37:14,010
So, it is probability of data given hypothesis
not probability of hypothesis given data.
447
00:37:14,010 --> 00:37:19,839
So, I hope that kind of clarifies inferential
statistics for you and in the next class we
448
00:37:19,839 --> 00:37:27,570
will look at some specific tests and go over
how you actually do these tests and even go
449
00:37:27,570 --> 00:37:32,540
deeper and talk about why we do some of the
mathematical operations that we do.
450
00:37:32,540 --> 00:37:33,230
Thank you.