1
00:00:12,410 --> 00:00:17,860
Hello welcome once again, we introduced the
concept of a Standardized Normal Distribution
2
00:00:17,860 --> 00:00:21,870
in the previous class, I will continue on
that again and then later on we will talk
3
00:00:21,870 --> 00:00:24,940
about something called t-distribution.
4
00:00:24,940 --> 00:00:26,790
So what is standardized normal distribution?
5
00:00:26,790 --> 00:00:30,000
Basically you all know what is a normal distribution.
6
00:00:30,000 --> 00:00:35,230
Normal distribution we call it uniform distribution,
bell shaped curve and so on actually.
7
00:00:35,230 --> 00:00:39,850
So, the left hand side here will be equal
to the right hand side exactly that is why
8
00:00:39,850 --> 00:00:41,769
it is called a inverted bell.
9
00:00:41,769 --> 00:00:47,820
But when you transform this normal distribution
and that is called standardized normal distribution.
10
00:00:47,820 --> 00:00:54,610
Where, the mean is centered around 0, the
mean is 0 and the area under the curve will
11
00:00:54,610 --> 00:00:59,390
become 1 and sigma that is the standard deviation
will become also 1.
12
00:00:59,390 --> 00:01:00,700
What is the advantage?
13
00:01:00,700 --> 00:01:05,430
When we convert a normal distribution into
standardized it becomes very easy for us to
14
00:01:05,430 --> 00:01:10,020
compare one set of samples with another set
of samples and so on.
15
00:01:10,020 --> 00:01:16,550
Otherwise what happens is any data for example,
if I am measuring heights of students the
16
00:01:16,550 --> 00:01:21,600
mean will be in terms of some inches and the
standard deviation again will be in terms
17
00:01:21,600 --> 00:01:22,600
of inches.
18
00:01:22,600 --> 00:01:28,220
If I am looking at height of plants I may
measure them in centimeters, so height will
19
00:01:28,220 --> 00:01:31,930
be in terms of centimeters the average height
could be in terms of centimeters.
20
00:01:31,930 --> 00:01:37,550
If I am measuring weight of the class, then
I will be measuring in terms of kilograms
21
00:01:37,550 --> 00:01:39,890
60 kilograms, 70 kilograms and so on.
22
00:01:39,890 --> 00:01:45,300
So we have a wide range of numbers where as
when you convert this normal distribution
23
00:01:45,300 --> 00:01:49,940
into standardized form, so that mean is always
equal to 0, area under the curve is 1.
24
00:01:49,940 --> 00:01:55,110
It becomes easy for us to compare and also
if you look at the tables that are available
25
00:01:55,110 --> 00:02:00,640
for determining the area under the curve,
they are all assuming that the total area
26
00:02:00,640 --> 00:02:02,060
is 1.
27
00:02:02,060 --> 00:02:04,780
That is what is called standardized normal
distribution.
28
00:02:04,780 --> 00:02:12,090
Basically what you do is we convert the data
x into say for example, x minus mu by sigma
29
00:02:12,090 --> 00:02:20,129
mu is your mean, sigma is the standard deviation
so when we do that suppose, I put x is 1 sigma
30
00:02:20,129 --> 00:02:28,900
then Z will become 1, if I put x is 2 sigma
then Z will become 2, when I put is x is equal
31
00:02:28,900 --> 00:02:32,630
to 3 sigma then is Zwill become 3 and that
is what is shown here.
32
00:02:32,630 --> 00:02:39,540
We have 1 sigma, 2 sigma, 3 sigma in terms
of Z it becomes 1, 2, 3 and similarly analogous
33
00:02:39,540 --> 00:02:44,980
minus 1 sigma will become minus 1, minus 2
sigma will become minus 2, minus 3 sigma will
34
00:02:44,980 --> 00:02:49,350
become minus 3, so it is very, very nice.
35
00:02:49,350 --> 00:02:54,560
If you look at the area as I said the area
under the curve is 1 that is 100 percent.
36
00:02:54,560 --> 00:03:01,840
If you look at the area spanning 1 sigma to
minus 1 sigma it will becomes 68.3, if we
37
00:03:01,840 --> 00:03:08,390
look at the area between plus 2 sigma and
minus 2 sigma it will be 95.4 and if you look
38
00:03:08,390 --> 00:03:14,430
at the area between spanning between plus
3 sigma and minus 3 sigma it will become 99.7
39
00:03:14,430 --> 00:03:15,430
and so on actually.
40
00:03:15,430 --> 00:03:19,210
We can even go up to 6 sigma you must have
heard about term call 6 sigma.
41
00:03:19,210 --> 00:03:28,610
If you are going up to 6 sigma then the area
spanned will be 99.9999 that is, what it is.
42
00:03:28,610 --> 00:03:33,960
What these numbers have very, very important
significance as we go along we will find that
43
00:03:33,960 --> 00:03:37,410
this 95 and 99 play very, very important role.
44
00:03:37,410 --> 00:03:43,790
So, when we say this area is 95 percent that
is plus or minus 2 sigma, then the area outside
45
00:03:43,790 --> 00:03:50,100
will be obviously 5 percent that means, this
side that is 1 side of it or one tail of it
46
00:03:50,100 --> 00:03:54,230
will be 2.5 percent the other tail of it will
have 2.5 percent.
47
00:03:54,230 --> 00:04:00,760
Similarly, if I consider plus or minus 3 sigma
as 99 percent the outside area will be 1 percent
48
00:04:00,760 --> 00:04:05,670
that means, one side it will become 0.5 percent
the other side will become 0.5 percent.
49
00:04:05,670 --> 00:04:11,900
Now, these are very, very important as we
can see later on when we talk about statistical
50
00:04:11,900 --> 00:04:12,900
significance.
51
00:04:12,900 --> 00:04:18,970
When we say 95 percent significance, 99 percent
significance then we are talking in terms
52
00:04:18,970 --> 00:04:24,900
of 100 minus 95 it is 0.5 percent significance,
100 minus 99 will be 1 percent significance
53
00:04:24,900 --> 00:04:25,900
and so on.
54
00:04:25,900 --> 00:04:32,020
We will be more interested in the area that
is outside, rather than the area that is inside,
55
00:04:32,020 --> 00:04:40,650
so this 95 percent spans plus or minus 2 sigma
and 99 percent spans plus or minus 3 sigma.
56
00:04:40,650 --> 00:04:41,650
We will talk about it.
57
00:04:41,650 --> 00:04:43,919
How do you convert any data into z?
58
00:04:43,919 --> 00:04:45,310
It is very simple.
59
00:04:45,310 --> 00:04:51,430
I know the mean, I know the sigma I take a
data x then I use this formula and covert
60
00:04:51,430 --> 00:04:53,280
that into z.
61
00:04:53,280 --> 00:05:00,250
So, if it is 1, sigma Z will become 1, if
it 2, sigma Z will become 2, if it is 3, sigma
62
00:05:00,250 --> 00:05:05,840
Z will become 3 and so on.
63
00:05:05,840 --> 00:05:11,480
Excel also has the function that is called
NORMSDIST function.
64
00:05:11,480 --> 00:05:17,900
But NORMSDIST function calculates this whole
area, if I have a value of Z here NORMSDIST
65
00:05:17,900 --> 00:05:20,900
will give me the area of the whole portion.
66
00:05:20,900 --> 00:05:25,420
So if I want to know the area of this portion
that is outside then I do 1 minus NORMSDIST.
67
00:05:25,420 --> 00:05:33,540
When Z is equal to 0 that means here obviously
this area will be 0.5 because it is symmetric
68
00:05:33,540 --> 00:05:42,540
and equal and when z is equal to 1 here that
means this area will be 0.841 and if Z is
69
00:05:42,540 --> 00:05:49,020
equal to 2 means then this area will be 9,
0.977, so if I want to know the area outside
70
00:05:49,020 --> 00:05:59,330
this then obviously 1 minus 0.977 which will
be equal to 0.023, if I multiply that twice
71
00:05:59,330 --> 00:06:01,770
that will become 0.046.
72
00:06:01,770 --> 00:06:05,110
The area outside two sigma will be 0.046.
73
00:06:05,110 --> 00:06:09,199
If you if you remember in the previous case
that is what it is right?
74
00:06:09,199 --> 00:06:16,770
The area inside 2 sigma is 95.4, so the remaining
is 4.6 so this portion is 2.3, this portion
75
00:06:16,770 --> 00:06:17,770
is 2.3.
76
00:06:17,770 --> 00:06:23,870
Actually we do not do it as 95.4 we call it
95 itself as a round number.
77
00:06:23,870 --> 00:06:29,490
This outside area will become 5, this portion
will be 2.5 and this portion will be 2.5.
78
00:06:29,490 --> 00:06:35,780
Generally that is what we do actually similarly
for 3 sigma we do not say 99.7 we say 99 percent
79
00:06:35,780 --> 00:06:41,370
so the outside portion will be 1 sigma so
one side of it will be 0.5 percent, another
80
00:06:41,370 --> 00:06:44,360
per side will be 0.5 percent.
81
00:06:44,360 --> 00:06:52,120
We can use this NORMSDIST that is available
in excel to calculate similarly we can use
82
00:06:52,120 --> 00:06:58,699
this GraphPad online software and that also
gives you but it gives you different side
83
00:06:58,699 --> 00:07:02,270
it gives this marked area.
84
00:07:02,270 --> 00:07:09,090
When Zis equal to 0 that means here the whole
area will be 1, when is Z is equal to for
85
00:07:09,090 --> 00:07:17,380
example 2, 2 sigma that is 2 and here so obviously,
the outside should be around 5 percent that
86
00:07:17,380 --> 00:07:19,740
is what it is showing approximately 0.0455.
87
00:07:19,740 --> 00:07:29,610
If Zis equal to 3 when we say 99.7 the remaining
portion is 3 percent that is what you are
88
00:07:29,610 --> 00:07:30,850
getting here.
89
00:07:30,850 --> 00:07:37,820
We can use either the simple and then use
the Z table I am going to show you, this is
90
00:07:37,820 --> 00:07:40,070
an z table.
91
00:07:40,070 --> 00:07:46,750
We can use Z table so for given value of a
Z It gives you the area it gives you the this
92
00:07:46,750 --> 00:07:51,440
area it is called single tail it is because
it is giving only one portion of the area
93
00:07:51,440 --> 00:07:58,480
where as your GraphPad gives you this plus
this so I need to double this to get the results
94
00:07:58,480 --> 00:07:59,650
from GraphPad.
95
00:07:59,650 --> 00:08:09,820
Each of these softwares or Excel function
or the table gives Z the area in a different
96
00:08:09,820 --> 00:08:15,910
way, but they are all analogous to each other
and we can calculate one from the other you
97
00:08:15,910 --> 00:08:16,910
understand.
98
00:08:16,910 --> 00:08:23,210
So the GraphPad gives you this plus this area
where as the table gives you one set of the
99
00:08:23,210 --> 00:08:29,860
area so obviously, if I want to make it appear
like GraphPad I multiply by 2, so from Z equal
100
00:08:29,860 --> 00:08:38,870
to any value for example, when Z is equal
to 2 it gives you 0.0228.
101
00:08:38,870 --> 00:08:48,350
If you want to multiply twice it will become
0.0228 multiplied by twice is 456 and that
102
00:08:48,350 --> 00:08:56,810
is what the GraphPad gives 0.0455 whereas,
Excel gives the whole area when I multiply
103
00:08:56,810 --> 00:09:03,459
1 minus Z will get this area which is similar
to the area the table gives you.
104
00:09:03,459 --> 00:09:12,019
So the table gives you Z values Z is equal
to 0.01 123 if you want second decimal this
105
00:09:12,019 --> 00:09:23,559
side we go up to 7.9, so from there we calculate
this area which is equal to this area so converting
106
00:09:23,559 --> 00:09:30,269
a normal distribution into a standardized
normal distribution has many advantages we
107
00:09:30,269 --> 00:09:34,230
can do many statistical calculations and that
is what we are going to do.
108
00:09:34,230 --> 00:09:37,209
Now let us look at a problem.
109
00:09:37,209 --> 00:09:43,329
Now, tomatoes weight in a packing machine
the mean is given as 250 grams and sigma is
110
00:09:43,329 --> 00:09:48,320
given as 0.2, so this like a population that
is why I put mu and sigma.
111
00:09:48,320 --> 00:09:53,240
It is collected over a very long period of
time so it gives you this and this.
112
00:09:53,240 --> 00:09:57,639
Now I have a tomato which is 250.5 grams.
113
00:09:57,639 --> 00:10:06,129
What percentage will weighed less than this,
so I have taken 1 sample 250.5 grams.
114
00:10:06,129 --> 00:10:07,569
What will be the weight of tomatoes?
115
00:10:07,569 --> 00:10:12,389
What will be the percentage of tomatoes which
will weighed less than 250.5 grams.
116
00:10:12,389 --> 00:10:20,100
What I do, Z is equal to x minus mu by sigma
mu 250 sigma is 0.2.
117
00:10:20,100 --> 00:10:22,619
So Z is equal to 2.5.
118
00:10:22,619 --> 00:10:30,220
I go to my table 2.5 I get 0.0062.
119
00:10:30,220 --> 00:10:32,740
0.062.
120
00:10:32,740 --> 00:10:40,749
So 0.062 percent of the sample will may more
than 250 because the table gives you this
121
00:10:40,749 --> 00:10:43,929
side right so table gives you this side.
122
00:10:43,929 --> 00:10:51,319
If you want to know this side all obviously,
I subtract by 100, so 100 minus 0.062 is 99.28
123
00:10:51,319 --> 00:10:58,829
so 99.28 percent of the tomatoes samples if
you pick up will weigh less than 250.5 grams
124
00:10:58,829 --> 00:11:00,149
do you understand?
125
00:11:00,149 --> 00:11:01,470
How to do?
126
00:11:01,470 --> 00:11:10,829
Now mu 250, I know sigma 0.02, 250.5 substitute
here I get z as 2.5, so I go to table z is
127
00:11:10,829 --> 00:11:16,959
2.5 so which is 0.0062 so you should note
that it is giving the outside area.
128
00:11:16,959 --> 00:11:23,399
So I want to know tomato weighing less than
250.5 obviously, I need to get this area.
129
00:11:23,399 --> 00:11:30,230
I can subtract 1 minus 0.0062 or 100 minus
0.062 percent.
130
00:11:30,230 --> 00:11:32,470
That is what I have done here.
131
00:11:32,470 --> 00:11:40,190
So 99.28 percent of the tomatoes will weigh
less than this particular number very interesting.
132
00:11:40,190 --> 00:11:41,670
So it is a very useful set of data.
133
00:11:41,670 --> 00:11:50,720
Now, let us look at another problem so again
tomatoes the mean is 250 grams and sigma 0.2.
134
00:11:50,720 --> 00:11:57,639
What percentages of tomatoes are expected
to have weight between 249.7 and 250.4?
135
00:11:57,639 --> 00:12:01,920
I want to know this area, this area understand.
136
00:12:01,920 --> 00:12:03,639
I want to know this area.
137
00:12:03,639 --> 00:12:10,790
What do I put same formula z is equal to x
minus mu by sigma, I will put 249.7 minus
138
00:12:10,790 --> 00:12:19,059
250 I will get minus 1.5 then I will get 250.4
minus 250 that will get give me 0.2.
139
00:12:19,059 --> 00:12:25,939
I need to get the area for minus 1.5 to 2.
140
00:12:25,939 --> 00:12:31,929
The this area that is between minus 1.5 and
2 is what I am interested in.
141
00:12:31,929 --> 00:12:43,339
Now z is equal to minus 1.5 the table will
give me this whole area so obviously, I need
142
00:12:43,339 --> 00:12:48,410
to subtract that from 0.5 because this area
is 0.5.
143
00:12:48,410 --> 00:12:52,779
Z is equal to minus 1.5 or z is equal to 1.5
it is does not matter.
144
00:12:52,779 --> 00:13:05,209
So I take z is equal to 1.5 that is 0.0668
but this whole area, right.
145
00:13:05,209 --> 00:13:11,269
I need to subtract 0.5 that this portion then
I will know this area, so I will subtract
146
00:13:11,269 --> 00:13:14,730
0.5 minus 0.0668 that is 4332.
147
00:13:14,730 --> 00:13:21,980
That is this area do you understand this portion
now z is equal to 2 I go again to the table
148
00:13:21,980 --> 00:13:26,239
z is equal to 0.0228.
149
00:13:26,239 --> 00:13:32,800
So it gives you this whole area but again
I need to subtract the 0.5 portion.
150
00:13:32,800 --> 00:13:39,079
It will give me the whole area according to
from the table but I need to subtract 0.5
151
00:13:39,079 --> 00:13:42,470
from that because of this because I need to
know only this area.
152
00:13:42,470 --> 00:13:47,019
So 0.5 minus 0.0228 is 0.04772.
153
00:13:47,019 --> 00:13:59,100
This area this area is 4772 this area is 4332,
so I add all I get 0.91 that means 91 percent
154
00:13:59,100 --> 00:14:04,059
of the tomatoes will have weight between 250.4
and 249.7.
155
00:14:04,059 --> 00:14:12,959
It is a very useful way of calculating things
if I know the mu and the sigma.
156
00:14:12,959 --> 00:14:15,019
As you can see here how powerful it is now.
157
00:14:15,019 --> 00:14:26,980
Of course, you can do the same thing using
Excel NORMDIST function or you can also use
158
00:14:26,980 --> 00:14:29,519
the GraphPad.
159
00:14:29,519 --> 00:14:39,069
Because you have to remember that the GraphPad
results or area of this and this where as
160
00:14:39,069 --> 00:14:45,119
the single tail t table gives you this area,
so if you multiply this by 2 I will get the
161
00:14:45,119 --> 00:14:46,279
GraphPad results.
162
00:14:46,279 --> 00:14:52,299
Where as the Excel software gives you the
inside area so I need to subtract 1 minus
163
00:14:52,299 --> 00:14:54,360
to get the outside area.
164
00:14:54,360 --> 00:15:01,490
So that is the relationship between the 3
different approaches.
165
00:15:01,490 --> 00:15:07,389
Now you can as a homework take up and see
whether you get the same answer as this using
166
00:15:07,389 --> 00:15:10,269
the NORMSDIST function as well as the GraphPad
function.
167
00:15:10,269 --> 00:15:11,820
I do not want to go into that.
168
00:15:11,820 --> 00:15:18,889
Now, there is something called the confidence
interval I talked about it right 2 sigma it
169
00:15:18,889 --> 00:15:26,199
is approximately 95 percent actually 95.4
we take it as 95 percent, 3 sigma is 99.7
170
00:15:26,199 --> 00:15:29,879
but we will take as 99 percent.
171
00:15:29,879 --> 00:15:44,850
When we have a data set well I know the global
population mean and standard deviation and
172
00:15:44,850 --> 00:15:51,209
then I suppose I take 1 sample in it as I
mentioned long time back it is very difficult
173
00:15:51,209 --> 00:15:57,179
to take the entire population for sampling,
but normally we will take only 1 or 2 samples,
174
00:15:57,179 --> 00:16:05,790
so I take only 1 out of this and 95 percent
of the time 95 percent of the time it will
175
00:16:05,790 --> 00:16:08,300
fall within this region.
176
00:16:08,300 --> 00:16:14,009
That means, if I know the average mean 95
percent of the time it will be within this
177
00:16:14,009 --> 00:16:22,660
average plus or minus 1.96 sigma, how would
I get this 1.96 that is the 2 sigma as you
178
00:16:22,660 --> 00:16:29,980
know 0.2 is equal to 2.28 percent.
179
00:16:29,980 --> 00:16:37,279
So multiply both sides that will come to approximately
2.28 multiply both sides will be approximately
180
00:16:37,279 --> 00:16:38,910
0.045 percent.
181
00:16:38,910 --> 00:16:42,799
So that is 95 percent here.
182
00:16:42,799 --> 00:16:55,399
So 95 percent of the time the data will lie
between this region the plus or minus 2 sigma.
183
00:16:55,399 --> 00:17:03,750
So if I take a sample 95 percent of the time
it will lie between mu plus or minus 1.96
184
00:17:03,750 --> 00:17:04,750
sigma.
185
00:17:04,750 --> 00:17:09,200
So if you want to know 99 percent obviously,
I need to take this 3 sigma right that will
186
00:17:09,200 --> 00:17:20,579
be the formula is 2.58 because for 2 sigma
as you know it will become 2 this is 0.0028
187
00:17:20,579 --> 00:17:28,950
for 3 sigma the outside area is 0.0013 right.
188
00:17:28,950 --> 00:17:34,930
We get like this.
189
00:17:34,930 --> 00:17:43,720
So this also very useful, if I am taking a
sample, if I take 10 samples from a population,
190
00:17:43,720 --> 00:17:51,010
so I will know the mean of those 10 samples
but now I want to know, what is an estimate
191
00:17:51,010 --> 00:17:57,500
of the confidence interval for the population
because the 10 samples which we take will
192
00:17:57,500 --> 00:18:03,330
give me some mean but that will not be the
exact population mean right because if I had
193
00:18:03,330 --> 00:18:08,410
taken another 10 samples I would have got
another mean, if I take another 10 samples
194
00:18:08,410 --> 00:18:12,020
I will get another mean and so on but they
are not real representation of my population
195
00:18:12,020 --> 00:18:13,020
mean.
196
00:18:13,020 --> 00:18:17,350
I can use this type of approach, should tell
what will be the range in which my population
197
00:18:17,350 --> 00:18:22,430
will lie that is the advantage of getting
something called the confidence interval.
198
00:18:22,430 --> 00:18:28,390
Now let us go to the next topic that is called
students t-distribution.
199
00:18:28,390 --> 00:18:30,080
What is this studentâ€™s T-distribution?
200
00:18:30,080 --> 00:18:35,140
This was this was developed by somebody called
William Sealy Gosset.
201
00:18:35,140 --> 00:18:41,220
He wrote some very important paper in statistics
under the pseudo name of student that is how
202
00:18:41,220 --> 00:18:50,050
it stuck on actually, he said normal distribution
is a bell shaped uniform, but then the N should
203
00:18:50,050 --> 00:18:52,850
be very, very large but if I have very, very
small n.
204
00:18:52,850 --> 00:19:02,540
Now if I have N equal to 1, 2, 3, 4, 5, 6,
30, 25 and so on that also will be symmetric
205
00:19:02,540 --> 00:19:11,200
but it will have heavier tails not like a
nice looking normal distribution this is a
206
00:19:11,200 --> 00:19:18,250
nice looking normal distribution whereas,
when I take N is equal to say 1 or 2 or 3
207
00:19:18,250 --> 00:19:24,671
will have very tail, it will be uniform just
like your normal distribution, it will be
208
00:19:24,671 --> 00:19:28,750
symmetric and all that but it will never be
a bell shaped.
209
00:19:28,750 --> 00:19:34,250
A bell shaped should have been like this,
right, like this, as you can see when I take
210
00:19:34,250 --> 00:19:36,740
much smaller numbers it will be like this.
211
00:19:36,740 --> 00:19:44,180
When I take a very large number as you see
ideally if you have more 30 or 30 then you
212
00:19:44,180 --> 00:19:51,310
can almost get-distribution almost looking
like a normal distribution that is what it
213
00:19:51,310 --> 00:19:52,310
is actually.
214
00:19:52,310 --> 00:20:00,150
When the sample size is small then the population,
so then we are going to get it is skewed.
215
00:20:00,150 --> 00:20:06,990
Although it is uniform symmetric just like
a normal distribution but it will have longer
216
00:20:06,990 --> 00:20:08,920
tails in real life.
217
00:20:08,920 --> 00:20:13,720
We are dealing only with samples we are not
dealing with the population so obviously,
218
00:20:13,720 --> 00:20:19,210
student T-distribution becomes very important
because whatever I do I am collecting 10 students
219
00:20:19,210 --> 00:20:25,340
and measuring their height, so that is a sample,
I am testing drug on 20 rats that is a small,
220
00:20:25,340 --> 00:20:31,840
small sample, I am taking 10 tomato plants
and measuring their weight, that is a sample.
221
00:20:31,840 --> 00:20:38,850
So student T-distribution becomes very important
and we might not collect 50 samples, 60 samples,
222
00:20:38,850 --> 00:20:45,860
100 samples we may collect 8, 9, 10 and so
on, so that is how this T-distribution will
223
00:20:45,860 --> 00:20:52,900
become important and you will come across
this term T, T, T many times everywhere in
224
00:20:52,900 --> 00:20:55,060
bio statistics.
225
00:20:55,060 --> 00:20:57,750
What are the properties of the T-distribution?
226
00:20:57,750 --> 00:21:02,790
The mean of the distribution equal to 0 right
because it is also symmetric just like a normal
227
00:21:02,790 --> 00:21:03,790
distribution Z.
228
00:21:03,790 --> 00:21:10,910
The variance that is the spread is given by
v divided by v minus 2, where v is the degrees
229
00:21:10,910 --> 00:21:17,950
of freedom and v is greater than 2, the variance
is always greater than 1 so the degrees of
230
00:21:17,950 --> 00:21:22,980
freedom reaches infinity the t-distribution
reaches a standard normal distribution of
231
00:21:22,980 --> 00:21:26,250
course, infinity but even a 30, 35.
232
00:21:26,250 --> 00:21:28,640
I would say it looks almost like a Normal
Distribution.
233
00:21:28,640 --> 00:21:35,080
If you have 30 samples then I think it is
almost like a normal.
234
00:21:35,080 --> 00:21:41,570
When to use this so any statistics having
a bell shaped, that means approximately normal
235
00:21:41,570 --> 00:21:45,780
the population distribution is normal, it
is symmetric, uni modal.
236
00:21:45,780 --> 00:21:48,860
You should not have bi model, I taught you
what is mode?
237
00:21:48,860 --> 00:21:52,350
That is the central tendency it should not
have outliers.
238
00:21:52,350 --> 00:21:58,150
If we have a samples of 30 it should not have
outliers, it should be symmetric and unimodal.
239
00:21:58,150 --> 00:22:04,410
If the sample size is 40 it can be little
bit skewed but still it should be unimodal
240
00:22:04,410 --> 00:22:05,410
no outliers.
241
00:22:05,410 --> 00:22:14,340
If it is greater than 40 then it should not
have outliers, but otherwise it is ok.
242
00:22:14,340 --> 00:22:20,690
The T-distribution should not be used with
small samples that are not approximately normal,
243
00:22:20,690 --> 00:22:26,680
that means if the samples are very small and
it is not normal then be careful do not use
244
00:22:26,680 --> 00:22:28,850
the T-distribution.
245
00:22:28,850 --> 00:22:34,200
So there is something called test for normality
we will talk about that later in the course
246
00:22:34,200 --> 00:22:38,990
of time to see whether your data is normal
and if it is reasonably normal then we can
247
00:22:38,990 --> 00:22:41,150
use the T -distribution.
248
00:22:41,150 --> 00:22:47,080
As I said why do we use t-distribution because
we always collect samples, we never collect
249
00:22:47,080 --> 00:22:55,730
entire population data set samples will be
having a degrees of freedom of 5, 6, 10, 20
250
00:22:55,730 --> 00:22:57,090
and so on.
251
00:22:57,090 --> 00:23:02,940
In most of the situations we use the concept
of T-distribution and perform the analysis.
252
00:23:02,940 --> 00:23:09,850
We also need to know the population and sample
I am coming back to the same thing again.
253
00:23:09,850 --> 00:23:15,710
If I draw a sample from a population and take
the mean, I will get some mean.
254
00:23:15,710 --> 00:23:21,270
If I take another set of samples, I will get
slightly different mean then if I take another
255
00:23:21,270 --> 00:23:23,600
set of samples, I will get slightly different
mean.
256
00:23:23,600 --> 00:23:29,990
So, this means will have it is own mean and
variance.
257
00:23:29,990 --> 00:23:39,170
The mean of all these means is a good representation
of the population mean but the variance is
258
00:23:39,170 --> 00:23:44,970
not really exact representation of the standard
deviation of the population.
259
00:23:44,970 --> 00:23:51,480
There is something called standard error of
the mean that is the standard deviation of
260
00:23:51,480 --> 00:23:58,260
the sample means which is an estimate of population
standard error of all these means that is
261
00:23:58,260 --> 00:24:05,520
given by s divided by square root of n, s
is your sample standard deviation, n is your
262
00:24:05,520 --> 00:24:07,320
number of data points.
263
00:24:07,320 --> 00:24:10,880
This is a standard error of the mean.
264
00:24:10,880 --> 00:24:19,760
In fact, this is a real representation of
the standard deviation of the population so
265
00:24:19,760 --> 00:24:21,140
you understand.
266
00:24:21,140 --> 00:24:27,820
I have a population I take a few samples and
then calculate a mean then I take put it back,
267
00:24:27,820 --> 00:24:32,880
then I take another set of samples and get
a mean, I will definitely get something different
268
00:24:32,880 --> 00:24:36,490
then I will take another set of samples and
then get a mean.
269
00:24:36,490 --> 00:24:44,410
I will have a large set of mean, these means
will have its own mean and some standard deviation.
270
00:24:44,410 --> 00:24:52,840
The means of the means which is a good representation
of the population mean but the standard deviation
271
00:24:52,840 --> 00:25:00,710
of these means are not and it is given like
this, the standard error of the mean is the
272
00:25:00,710 --> 00:25:04,970
standard deviation of the sample mean is estimate
of a population mean.
273
00:25:04,970 --> 00:25:12,580
So, s is your sample standard deviation and
n is a number of samples you take square root
274
00:25:12,580 --> 00:25:13,580
of n.
275
00:25:13,580 --> 00:25:18,500
This is how it is and this standard error
is very, very important.
276
00:25:18,500 --> 00:25:22,030
Standard error is a standard deviation of
the sampling distribution of a set of means
277
00:25:22,030 --> 00:25:23,040
taken from a population.
278
00:25:23,040 --> 00:25:31,470
So, standard error is a standard deviation
of all these means which you have taken with
279
00:25:31,470 --> 00:25:36,840
different samples you have taken from the
same population do you understand.
280
00:25:36,840 --> 00:25:45,060
The mean of all these means is a good representation
of the population mean but the if you want
281
00:25:45,060 --> 00:25:52,050
represent standard deviation so, if we get
something called standard error, that is the
282
00:25:52,050 --> 00:25:57,040
standard deviation of the sampling distribution
of a set means taken from a population and
283
00:25:57,040 --> 00:26:01,350
that is a good representation of the standard
deviation of the population and that is given
284
00:26:01,350 --> 00:26:05,700
by
S divided by square root of n, where S is
285
00:26:05,700 --> 00:26:12,840
the sample standard deviation and n is your
number of data or samples you pick up and
286
00:26:12,840 --> 00:26:18,540
this is very, very important we use this in
many calculations as we go along.
287
00:26:18,540 --> 00:26:29,010
Again we come back to student t-distribution,
so t student is almost a Z you know they play
288
00:26:29,010 --> 00:26:35,340
the same role z you have in the nor standardized
normal distribution where as t value you have
289
00:26:35,340 --> 00:26:40,140
in a normal distribution with very small sample
size.
290
00:26:40,140 --> 00:26:47,760
If I have a sample size of 4, mean is 100,
sample standard deviation is 10, then when
291
00:26:47,760 --> 00:26:54,680
I do a t-distribution mean is still 100, the
standard deviation will be 10 divided by 4
292
00:26:54,680 --> 00:26:57,130
that is 10 by 2 which is 5.
293
00:26:57,130 --> 00:27:06,920
So this is what it is and as I said, the t
here plays a same role as Z in the standardized
294
00:27:06,920 --> 00:27:15,590
normal distribution.
295
00:27:15,590 --> 00:27:24,010
If you take a data as 100 here and this is
110 and 110 and value of 110 what will be
296
00:27:24,010 --> 00:27:31,400
the t value here obviously, t value will be
if we take the sample standard deviation as
297
00:27:31,400 --> 00:27:37,350
10 t value will be equal to 2 actually, that
means it lies 2 standard deviations from the
298
00:27:37,350 --> 00:27:38,350
mean.
299
00:27:38,350 --> 00:27:44,510
In fact, if you remember exactly in the z
in the normal standard deviation also we had
300
00:27:44,510 --> 00:27:46,400
the same numbers coming out.
301
00:27:46,400 --> 00:27:55,400
So, 2 sigma becomes 2 and similarly here 2
times standard deviation become t equal to
302
00:27:55,400 --> 00:27:58,700
2.
303
00:27:58,700 --> 00:28:07,580
As I mentioned, if n is equal to infinity
data points, then 95 percent confidence will
304
00:28:07,580 --> 00:28:14,630
be instead of 2 sigma, if you remember here
95 percent confidence 2 sigma but then we
305
00:28:14,630 --> 00:28:23,430
are talking about t where the number of data
points are suppose to be less than the population.
306
00:28:23,430 --> 00:28:32,059
There will be instead of 2 sigma, it will
become 1.96 plus or minus sigma.
307
00:28:32,059 --> 00:28:37,760
So you understand this 1.96 that is how it
came here if you take a population then this
308
00:28:37,760 --> 00:28:42,980
will become 2 sigma but we are not taking
a population here we are talking t-distribution
309
00:28:42,980 --> 00:28:49,720
so it is 1.96, so if n is equal to 6, that
means number of data is very as reduced dramatically
310
00:28:49,720 --> 00:28:54,990
degrees of freedom n minus 1 then for 95 percent
confidence t is 2.57.
311
00:28:54,990 --> 00:28:56,580
How do you get this?
312
00:28:56,580 --> 00:28:58,470
There is a table which gives you that.
313
00:28:58,470 --> 00:29:00,520
I will show you the table.
314
00:29:00,520 --> 00:29:06,090
If n is equal to 3 data is still less as you
can see its got a long tail its symmetric
315
00:29:06,090 --> 00:29:15,540
uniform but the tail has become longer and
longer none of them look like my beautiful
316
00:29:15,540 --> 00:29:21,620
normal distribution were here we are talking
in terms of n is equal to infinity.
317
00:29:21,620 --> 00:29:23,040
We are talking about t-distribution.
318
00:29:23,040 --> 00:29:30,870
Where n is smaller than infinity, obviously
it will not look like a beautiful bell shaped
319
00:29:30,870 --> 00:29:39,390
curve, uniform curve it will have a longer
tail so when n equal to infinity in student-distribution
320
00:29:39,390 --> 00:29:52,040
for a 95 percent it gives you t of 1.96 and
1.96 and for n is equal to 6 it becomes 2.57,
321
00:29:52,040 --> 00:29:56,270
n is equal to 3 is 4.3, do you understand
this?
322
00:29:56,270 --> 00:30:04,080
If the data points keep going down the uncertainty
also goes up because from 1.96 gone 2.57 is
323
00:30:04,080 --> 00:30:06,160
gone to 4.3.
324
00:30:06,160 --> 00:30:09,860
So that is there is a higher degree of uncertainty
in the estimate of the mean.
325
00:30:09,860 --> 00:30:17,040
So if I take only 3 data points and take a
mean the uncertainty on the population mean
326
00:30:17,040 --> 00:30:21,410
is higher because this is much larger.
327
00:30:21,410 --> 00:30:27,620
Whereas, if I take very large data point the
uncertainty is much less because the graph
328
00:30:27,620 --> 00:30:32,500
also looks smaller I mean more sharper, so
the area also is much less.
329
00:30:32,500 --> 00:30:33,500
You understand.
330
00:30:33,500 --> 00:30:42,560
This 1.96, if we remember I introduced that
some time back right here you can see this
331
00:30:42,560 --> 00:30:45,360
1.96 same thing.
332
00:30:45,360 --> 00:30:50,180
So it is mean plus or minus it gives an idea
about the confidence interval mean plus or
333
00:30:50,180 --> 00:30:56,870
minus 1.96 sigma, so this 1.96 corresponds
to 95 percent confidence interval.
334
00:30:56,870 --> 00:31:07,490
It is not exactly 2 sigma that is when you
use population when you use samples and try
335
00:31:07,490 --> 00:31:14,960
to get an estimate of population obviously,
it is 1.96, this 1.96 will come quite often
336
00:31:14,960 --> 00:31:16,740
if for 95 percent.
337
00:31:16,740 --> 00:31:24,880
Similarly if I take 99 percent instead of
getting 3 sigma 3, I will get 2.58 so that
338
00:31:24,880 --> 00:31:31,210
is from the table I will show you a table
called t table and that is very useful the
339
00:31:31,210 --> 00:31:33,710
t table comes here.
340
00:31:33,710 --> 00:31:52,520
That is called the t table as you can see
here so you can see for a 1.96 you can see
341
00:31:52,520 --> 00:31:58,110
for infinity data points for a 95 percent
confidence interval.
342
00:31:58,110 --> 00:32:05,790
If you want take a 99 percent confidence interval
it is about 2.58, these 2 numbers play very
343
00:32:05,790 --> 00:32:07,190
important role.
344
00:32:07,190 --> 00:32:13,210
If I want to know a confidence interval 95
percent using t-distribution 9, 8, I will
345
00:32:13,210 --> 00:32:15,110
multiply by 1.96.
346
00:32:15,110 --> 00:32:21,260
If I want to know the confidence interval
99 I multiply by 2.58.
347
00:32:21,260 --> 00:32:25,720
These 2 terms are very, very important, do
you understand?
348
00:32:25,720 --> 00:32:34,590
We go back again in the t-distribution so
when the data points are less, it will have
349
00:32:34,590 --> 00:32:36,210
very long tail.
350
00:32:36,210 --> 00:32:42,620
As you can see as the data points increase
the tail goes down which means we get more
351
00:32:42,620 --> 00:32:48,220
confidence in estimating the population mean
when I have more data points.
352
00:32:48,220 --> 00:32:55,270
Now we will continue this student t-distribution
in the next class also.
353
00:32:55,270 --> 00:32:59,880
Thank you very much.