1
00:00:18,460 --> 00:00:26,070
Good afternoon. Last class we have described
the multivariate statistical modeling from
2
00:00:26,070 --> 00:00:35,550
the purpose as well as different modeling
techniques point of view, and we ended that
3
00:00:35,550 --> 00:00:52,010
lecture with prerequisites for the course,
prerequisites for this course or subject.
4
00:00:52,010 --> 00:01:02,230
And what we have described there that basic
statistics is one of the prerequisites and
5
00:01:02,230 --> 00:01:14,960
I told u that you also require to know matrix
algebra a bit. Now, under basic statistics
6
00:01:14,960 --> 00:01:38,210
univariate statistics, the univariate descriptive statistics and univariate inferential statistics are
7
00:01:38,210 --> 00:01:48,490
important.
So, again under univariate descriptive statistics
8
00:01:48,490 --> 00:02:15,900
usually the central tendency and dispersions,
these two issues are described under descriptive
9
00:02:15,900 --> 00:02:35,489
statistics. Under inferential statistics estimation
and your hypothesis testing, under estimation
10
00:02:35,489 --> 00:02:59,440
there will be point estimation and interval
estimation. Today, we will in this lecture
11
00:02:59,440 --> 00:03:05,090
we will describe this one univariate descriptive statistics.
12
00:03:05,090 --> 00:03:17,250
You see the content of today’s this lecture,
we will start with population and parameters
13
00:03:17,250 --> 00:03:23,639
then we will describe probability distribution.
Particularly, the normal probability distributions
14
00:03:23,639 --> 00:03:30,660
then we discuss sample and statistics followed
by measure of central tendency, measure of
15
00:03:30,660 --> 00:03:41,620
dispersion and followed by references. Now,
do you have any idea about population?
16
00:03:41,620 --> 00:03:55,479
What do you mean by population, general sense
we say that the population of West Bengal
17
00:03:55,479 --> 00:04:06,209
population of India, but in statistics this
population has much broader sense. For example,
18
00:04:06,209 --> 00:04:27,862
in last class we have described one example that a small company doing business in a city.
19
00:04:27,889 --> 00:04:49,120
So, the company has a production system, can it be a population if you define population form statistics point
20
00:04:49,120 --> 00:05:04,430
of view population if the entirety totality
or the whole population. The entirety or the
21
00:05:04,430 --> 00:05:16,599
totality or we can the whole when we talk
about the population of West Bengal that means,
22
00:05:16,599 --> 00:05:25,229
each and every resident legal residents of
West Bengal is considered from the production
23
00:05:25,229 --> 00:05:30,039
system point of view.
The system, this word also represents population
24
00:05:30,039 --> 00:05:40,539
the way we understand application of statistics,
so the system is also synonymous for us. It
25
00:05:40,539 --> 00:05:51,819
is also population, because system can be
characterized by different variables for example,
26
00:05:51,819 --> 00:06:00,780
for this company there are profit sales volume
absenteeism. So, may other variables are we
27
00:06:00,780 --> 00:06:06,710
have discussed, so these are basically which
characterize the population or the system
28
00:06:06,710 --> 00:06:12,009
another, word could be for us that is a process.
29
00:06:12,009 --> 00:06:25,810
A process also we can think in this line also
the process can be from our purpose point
30
00:06:25,810 --> 00:06:37,749
of view, a process is something where transformation
or activities taken place activities or transformation
31
00:06:37,749 --> 00:06:54,169
takes place. For example, you give inputs
as a raw materials and the process production
32
00:06:54,169 --> 00:07:12,259
process it converts into value added output.
Now, if we consider the total life cycle of
33
00:07:12,259 --> 00:07:36,139
this process, then it will produce a large number of items, so large number of items will be produced.
34
00:07:36,139 --> 00:07:42,779
All items collectively is the entirety the totality or the whole, so that things with respect
35
00:07:42,779 --> 00:07:56,539
to the items produced by this production process.
We can define population from, if you go to
36
00:07:56,539 --> 00:08:03,479
the service sector for example, the health
care system or the banking system there also
37
00:08:03,479 --> 00:08:11,490
you can define population. So, essentially
if you want to define population and you required
38
00:08:11,490 --> 00:08:18,629
to keep in your mind two things that when
I talk about the population of West Bengal.
39
00:08:18,629 --> 00:08:30,120
Suppose this type, this figure let it be the
portion, now the hilly regions population,
40
00:08:30,120 --> 00:08:36,990
at the hilly region that is different than
the population in the west of West Bengal
41
00:08:36,990 --> 00:08:45,980
or south of West Bengal. Now, for a particular
purpose you may be interested to understand
42
00:08:45,980 --> 00:08:52,300
what the educational status of the people
is, if the hilly people of West Bengal.
43
00:08:52,300 --> 00:08:59,900
Then what is happening is you are making a
boundary, creating a boundary for the system,
44
00:08:59,900 --> 00:09:05,020
so this boundary is the hilly region. So,
in that case your population is this hilly
45
00:09:05,020 --> 00:09:13,120
region only, now if you think from the voting
point of view, suppose the election time.
46
00:09:13,120 --> 00:09:21,290
So, this and all the legal that voters they
go for voting in that case all the voters
47
00:09:21,290 --> 00:09:28,970
of the total West Bengal, they are the population.
So, in that sense what is happening that means
48
00:09:28,970 --> 00:09:43,000
if you really want to define population, the
boundary is very important, getting me.
49
00:09:43,000 --> 00:09:51,180
Boundary in the sense, if you go you come
back again the manufacturing scenario, in
50
00:09:51,180 --> 00:09:56,700
manufacturing scenario you will find out that
the total production system may be composed
51
00:09:56,700 --> 00:10:05,500
of several half system that. For example,
this may be machine 1, machine 2, and machine
52
00:10:05,500 --> 00:10:14,220
3 and they are doing different operations,
raw material coming here and transforms to
53
00:10:14,220 --> 00:10:19,250
machine M 1. Then going to machine M 2, some
or the other activities is going on, now if
54
00:10:19,250 --> 00:10:26,080
you are interested to infer something about
machine 1. Suppose you want to infer something
55
00:10:26,080 --> 00:10:36,350
about machine 1, then your population is this
if you think that are some common characteristics
56
00:10:36,350 --> 00:10:43,500
applicable to all the machines.
Then you may be interested to see the totality
57
00:10:43,500 --> 00:10:53,430
including all the machines, then your population
will consider all the machines, this is very
58
00:10:53,430 --> 00:11:04,270
important. Unless we understand population,
there is no use of statistics because statistics
59
00:11:04,270 --> 00:11:12,200
is used to infer about the population, inference
related to many things. During inferential
60
00:11:12,200 --> 00:11:20,540
statistics, we will be telling you what are
the different inferences possible, but for
61
00:11:20,540 --> 00:11:26,470
the time being you please understand that
when we talk about population, we talk about
62
00:11:26,470 --> 00:11:36,510
a system or a case for. Why we require to
study the process or the population, because
63
00:11:36,510 --> 00:11:44,070
we want to understand the behavior of the
process or the system or in terms of the population
64
00:11:44,070 --> 00:11:49,160
you want to study the behavior.
65
00:11:49,160 --> 00:12:02,480
Now, if you see the size of population what
will happen? Population can be finite, can
66
00:12:02,480 --> 00:12:11,390
be infinite when I am talking about, suppose
the production of a process for 1 year, number
67
00:12:11,390 --> 00:12:20,800
of items produced per year. If that is my
population then it is a finite population,
68
00:12:20,800 --> 00:12:30,510
so time is another aspect which also defines,
used to define the population.
69
00:12:30,510 --> 00:12:36,430
So, one is the boundary another one is the
time, so in two this is basically boundary
70
00:12:36,430 --> 00:12:44,910
in the sense, space boundary and the time
boundary. So, if you go for the entire lifecycle
71
00:12:44,910 --> 00:12:51,170
of a process then what will happen can you
count that what are the number of outputs
72
00:12:51,170 --> 00:12:59,160
it is very, very difficult. So, if we talk
about the entire lifecycle, total time of
73
00:12:59,160 --> 00:13:03,480
the life of the process what will happen the
number of items produced will be countable
74
00:13:03,480 --> 00:13:18,420
infinite, whether countable infinite or infinite,
we will basically define in statistics in
75
00:13:18,420 --> 00:13:19,740
two senses.
76
00:13:19,740 --> 00:13:26,840
One is that your population will be finite
population or infinite population and finite
77
00:13:26,840 --> 00:13:32,470
population will means the size is known that
is N. For example, number of items produced
78
00:13:32,470 --> 00:13:40,310
by a production shop in 2012, infinite population
size is infinite number of items produced
79
00:13:40,310 --> 00:13:51,380
that is on the lifecycle of the process that
is countable infinite. If you need further
80
00:13:51,380 --> 00:14:03,940
explanation as I told you in the last class,
that random experiment is the issue in statistics
81
00:14:03,940 --> 00:14:10,330
deals with random variables.
Random variables comes from random experiment,
82
00:14:10,330 --> 00:14:19,930
we generate random variable based on the experiments
conducted. So, if we do one experiment like
83
00:14:19,930 --> 00:14:26,610
this, you see this figure inside this if I
say this is basically all and inside this
84
00:14:26,610 --> 00:14:35,800
there are red and white balls. Now, you pick
up one ball, next one ball like this one after
85
00:14:35,800 --> 00:14:42,620
another without replacement what will happen
after sometime there will be no ball to pick
86
00:14:42,620 --> 00:14:53,930
up experiment will end, this is finite population.
Now, in other cases see that what we do in
87
00:14:53,930 --> 00:15:01,600
the second experiment you pick up again replace,
so what will happen in that case.
88
00:15:01,600 --> 00:15:13,090
In that case, the number of ball will never
exhausted, there are so many balls red white
89
00:15:13,090 --> 00:15:19,950
what you are doing you are picking up and
finding out whether it is a red or white.
90
00:15:19,950 --> 00:15:27,260
So, either red or white, so you are counting
that red then again you are replacing this,
91
00:15:27,260 --> 00:15:33,800
similar manner you are continuing this experiment,
the size of the population what will happen.
92
00:15:33,800 --> 00:15:39,910
The number of balls will remain as it is from
experimental point of view it will be different,
93
00:15:39,910 --> 00:15:49,660
so this is what is infinite population?
In statistics most of the issues what will
94
00:15:49,660 --> 00:15:58,570
be discussed later on we consider infinite
population, so in reality they are it may
95
00:15:58,570 --> 00:16:07,810
not be 100 percent true that all populations
are infinite. But, countable infinite populations
96
00:16:07,810 --> 00:16:13,970
are many and for our practical purposes, if
we consider this infinite population there
97
00:16:13,970 --> 00:16:33,810
is no problem. Population behavior if you
measure, you require to know that what are
98
00:16:33,810 --> 00:16:41,570
the variables, that is governing the population
in sense characterize the population. So,
99
00:16:41,570 --> 00:16:58,350
population is characterized by different variables
applicable to the population.
100
00:16:58,350 --> 00:17:19,380
For example, if we consider the total students
of IIT Kharagpur, all students of IIT Kharagpur
101
00:17:19,380 --> 00:17:28,820
this is my population all students of IIT
Kharagpur and the Kharagpur students they
102
00:17:28,820 --> 00:17:40,200
come from different demographic. Their demographics
differ their socio economic family, socio
103
00:17:40,200 --> 00:17:53,279
economic status differ their performance in
the graduation that mean in IIT Kharagpur
104
00:17:53,279 --> 00:18:02,179
exams that also differ. So, for performance
you may be interested to see that what is
105
00:18:02,179 --> 00:18:11,230
percentage of marks of tenth or CGPA your
cumulative grade point average or somewhere
106
00:18:11,230 --> 00:18:16,379
related to demography.
You may be interested to see that what is
107
00:18:16,379 --> 00:18:30,690
the age, profile age sometimes we may be interested
to know their height profile you see age,
108
00:18:30,690 --> 00:18:40,850
sorry height, age, percentage of marks CGPA.
Under socio economic status, family income,
109
00:18:40,850 --> 00:18:50,100
all are basically coming under these are all
variables which characterize the students
110
00:18:50,100 --> 00:19:01,100
of IIT Kharagpur. So, if you want to understand
population, not only the space and time boundary
111
00:19:01,100 --> 00:19:09,090
we also require to understand what are the
variables that governs the population, that
112
00:19:09,090 --> 00:19:18,419
is what we see basically.
If you consider any of the variables let height,
113
00:19:18,419 --> 00:19:26,169
I am writing height is the students and this
I am denoting as x which is a random variable,
114
00:19:26,169 --> 00:19:35,690
let it be. Here, we are saying it is random
because if we just pick up one student you
115
00:19:35,690 --> 00:19:41,830
do not know what is his height whatever you
measure you find out from height that is it.
116
00:19:41,830 --> 00:19:53,559
So, it is x is, so I want to characterize
the students in terms of their height or you
117
00:19:53,559 --> 00:20:05,179
may be interested to characterize the students
in terms of their number of subjects completed
118
00:20:05,179 --> 00:20:15,840
in a year. We will find out that there are
many back lock cases, many students could
119
00:20:15,840 --> 00:20:16,789
not complete.
120
00:20:16,789 --> 00:20:20,950
So, in that sense it may so happen that if
we consider that there are 10 subjects to
121
00:20:20,950 --> 00:20:27,769
be completed, it may so happen that you will
find some students subjects completed, some
122
00:20:27,769 --> 00:20:34,830
students 1 or may be like this up to 10, although
it will be heavily biased towards 10. But,
123
00:20:34,830 --> 00:20:47,230
this is possible should and depending upon
what type of random variable you have considered
124
00:20:47,230 --> 00:20:55,149
and accordingly you require to use certain
probability distribution.
125
00:20:55,149 --> 00:21:02,029
Last class I told you that if the variable
is discrete suppose x is a discrete variable,
126
00:21:02,029 --> 00:21:17,190
discrete random variable then you have to
use discrete probability distribution we discussed
127
00:21:17,190 --> 00:21:21,659
last class. But, we have not said what are
those probability distribution later on we
128
00:21:21,659 --> 00:21:27,940
will see, but what you can see very easily
that suppose x is discrete variable it can
129
00:21:27,940 --> 00:21:37,549
take values 0, 1, 2, 3 like this.
Then if you make a tally chart, tally chart
130
00:21:37,549 --> 00:21:46,909
in the sense frequency 0, 1, 2, 3, 4, 5 like
this suppose when you are getting 0 counts
131
00:21:46,909 --> 00:22:01,100
you are putting one like this. Then again
suppose 0 count then similarly like this
132
00:22:01,100 --> 00:22:04,509
what this is the tally count what happened,
what is the occurrence of 0 2 times, this
133
00:22:04,509 --> 00:22:18,450
1 6 time, this 1 8 times, this 1 4 times,
this 1 2 times, this 1 1 time. So, by categorization
134
00:22:18,450 --> 00:22:28,379
what do you mean, here we mean that I have
my discrete random variable which can take
135
00:22:28,379 --> 00:22:47,379
different values suppose 0, 1, 2, 3, 4, 5
and it appears for different times, I think
136
00:22:47,379 --> 00:23:02,990
all of you know. This is nothing but the
frequency diagram and this frequency diagram
137
00:23:02,990 --> 00:23:11,249
if I know the total number and if you divide
each of the frequencies by their total then
138
00:23:11,249 --> 00:23:19,369
you will be getting relative frequency. That
relative frequency will give you that empirical
139
00:23:19,369 --> 00:23:26,100
probability distribution and this distribution
is known as probability mass function.
140
00:23:26,100 --> 00:23:42,409
What we have said this p m f, here is see
that this discrete variables when you get
141
00:23:42,409 --> 00:23:50,149
this type of plot, you basically developing
probability mass function. But, I told you
142
00:23:50,149 --> 00:23:57,629
that we will be considering infinite population,
so infinite population means that totality
143
00:23:57,629 --> 00:24:07,309
is not known. Second thing is that our variable
is random, what will happen next minute what
144
00:24:07,309 --> 00:24:13,100
value it will assume, we do not know.
So, anywhere in the population domain you
145
00:24:13,100 --> 00:24:19,730
cannot get that value and immediately do when
you are in, then population domain yes we
146
00:24:19,730 --> 00:24:25,779
will get the values when we go for the sampling.
But, at least before sampling we do not have
147
00:24:25,779 --> 00:24:33,840
all those values, so what you can do for a particular variable which concerned, you can
148
00:24:33,840 --> 00:24:41,509
expect something what is this expectation.
Suppose, we want to know what is the average
149
00:24:41,509 --> 00:24:48,980
height of IIT students this is nothing but
the expected value of x, so that expected
150
00:24:48,980 --> 00:24:57,509
value of x or the variable of interest this
is known as mean.
151
00:24:57,509 --> 00:25:14,539
That is mean, mean stands for mean, mean a
expected value of x when your variable x is
152
00:25:14,539 --> 00:25:27,230
discrete variable, so you will get like this
x f x for this is for all I, sorry all x.
153
00:25:27,230 --> 00:25:36,749
Whatever may be the your number for all x
if you see this example, here if you see this
154
00:25:36,749 --> 00:25:43,850
example, so we are saying here that x can
take this value. This value like this there
155
00:25:43,850 --> 00:25:55,570
are 1, 2, 3, 4, 5, 6 values, so if I assume
that these values are nothing but 0, 1, 2,
156
00:25:55,570 --> 00:26:06,059
3, 4, 5 and so these are all x values. If
I assume that they are then sitting, here
157
00:26:06,059 --> 00:26:11,059
as discrete also the probability values and
their probability values is like this. Suppose
158
00:26:11,059 --> 00:26:19,389
the first one is 0.15, second one is zero
0.20, third one is 0.25, fourth one again
159
00:26:19,389 --> 00:26:32,700
you can write 0.20, fifth one suppose 0.15
then what is left that will be 40, 60, 95,
160
00:26:32,700 --> 00:26:41,110
so 0.05, so then what is your expected value.
Here, x into f x you have to find out 0 into
161
00:26:41,110 --> 00:26:53,369
0.15 is 0, 1 into 0.2 is 0.2, 2 into 0.25
is 0.50, 3 into 0.2 is 0.60, so like this
162
00:26:53,369 --> 00:27:05,190
again 0.60 and this will be 0.25. If you add
what you will get, you add 5 6 plus 2 8, 14,
163
00:27:05,190 --> 00:27:14,799
19, 21 so 2.15 so that means that your expected
value is if this is 0 this is 1 and this is
164
00:27:14,799 --> 00:27:26,409
2 somewhere here. So, if I draw here I can
say that suppose this is my 0 value, this
165
00:27:26,409 --> 00:27:39,149
is 1 values, this 1 2 values, and 1 this one
is 3 values, this is your 4 and then 5. So,
166
00:27:39,149 --> 00:27:55,720
this is 0, 1, 2; somewhere this your value
is 2.15 this is what is expectation, but another
167
00:27:55,720 --> 00:27:59,359
measure here it is there.
168
00:27:59,359 --> 00:28:03,700
So another one is sigma square, so we have
169
00:28:03,700 --> 00:28:13,659
said here your mu that is mean we have just
said as well as, let there is another measure
170
00:28:13,659 --> 00:28:26,299
which is sigma square that is the variance
what is variance is expected value as x minus
171
00:28:26,299 --> 00:28:45,499
mu whole square. So, for this case your discrete
case you will write x minus mu whole square
172
00:28:45,499 --> 00:28:49,710
f x.
173
00:28:49,710 --> 00:28:56,480
You have computed here two things, what are
those things that you computed x, then your
174
00:28:56,480 --> 00:29:05,830
this one with the x f x you know the mu value,
mean value you know. Now, you can create x
175
00:29:05,830 --> 00:29:12,669
minus mu that means 0 minus 2.15, that is
minus 2.15 like this you can calculate and
176
00:29:12,669 --> 00:29:23,309
again you square it multiply it then add it.
So, you will be getting the sigma square that
177
00:29:23,309 --> 00:29:37,480
is variance part then what we have assumed, here we have assumed that x can take these
178
00:29:37,480 --> 00:29:44,710
five values only and this is the probability
mass function. So, what will be the sum total
179
00:29:44,710 --> 00:29:57,269
of these probability values then what is mu
and sigma or sigma square?
180
00:29:57,269 --> 00:30:04,730
mu is long run mean
Correct.
181
00:30:04,730 --> 00:30:07,830
Sigma is...
Long run standard deviation that means you
182
00:30:07,830 --> 00:30:13,419
are saying that mean and mean and standard
deviation will vary for a population for a
183
00:30:13,419 --> 00:30:24,389
particular characteristics.
It should not be for a large population.
184
00:30:24,389 --> 00:30:31,509
No, even for small population.
It should not vary.
185
00:30:31,509 --> 00:30:36,909
It should not vary it is a constant, when
we talk about a parameter.
186
00:30:36,909 --> 00:30:48,449
So, actually these are here mu and sigma square in statistically we say population mean and population
187
00:30:48,720 --> 00:31:09,880
variance they are constant then another issue will be there they are not known also. Here, we have assumed
188
00:31:10,210 --> 00:31:16,309
a very finite population a very small population
and we have calculated something like this
189
00:31:16,309 --> 00:31:22,330
population size infinite you will not get
all values of x you will never get.
190
00:31:22,330 --> 00:31:32,460
If the size is infinite, that means you cannot
measure this or say compute this, but probably
191
00:31:32,460 --> 00:31:39,970
what you can do you can expect something that
is why the expectation term is used, here
192
00:31:39,970 --> 00:31:49,999
expectation is used here. So, if I say population
parameter, now you can understand that these
193
00:31:49,999 --> 00:31:56,200
two are population parameter it is by saying these two are population parameter.
194
00:31:56,200 --> 00:32:03,759
Please do not consider them there is no other
population parameter, these two are some of
195
00:32:03,759 --> 00:32:08,029
the population parameters many of the population
parameters these two man and standard deviation.
196
00:32:08,029 --> 00:32:18,039
Standard deviation or variance they are population
parameters, why we go for population parameter,
197
00:32:18,039 --> 00:32:27,009
because the lecture is today’s topic is
very simple topic calculation point.
198
00:32:27,009 --> 00:32:36,239
Understanding point of view, we must understand
why we require population parameter, we require
199
00:32:36,239 --> 00:32:41,049
population parameter because if you know these
two parameter and you also know that your
200
00:32:41,049 --> 00:32:46,460
x is random variable and that can follow certain
probability distribution. If you know that
201
00:32:46,460 --> 00:32:53,629
distribution and also if you know the parameters,
what happens you do not require to go for
202
00:32:53,629 --> 00:33:00,600
that particular process or system, for further
study for this particular variable is concerned.
203
00:33:00,600 --> 00:33:10,440
If I say that the absenteeism in the shop
floor for the production shop considered it
204
00:33:10,440 --> 00:33:18,220
follows something like normal distribution
and we know that it is mean. If mu and sigma
205
00:33:18,220 --> 00:33:25,979
square is the variance component that means
I am in a position to know this, so if I know
206
00:33:25,979 --> 00:33:32,399
the distribution what is actually happening.
Here, that real production shop from worker
207
00:33:32,399 --> 00:33:40,210
performance point of view, the absenteeism
is converted to a mathematical equation, a
208
00:33:40,210 --> 00:33:47,359
statistical equation. That is the advantage
that means if I know truly I know that what
209
00:33:47,359 --> 00:33:53,359
is the probability distribution with respect
to a variable and what are the population
210
00:33:53,359 --> 00:33:58,749
parameters for that variable I have the distribution at my hand.
211
00:33:58,749 --> 00:34:06,600
So, I do not require to go further so long
the process will not change by process will
212
00:34:06,600 --> 00:34:16,030
not change what I mean to say that suppose
it is a machine works overtime machine condition
213
00:34:16,030 --> 00:34:23,450
deteriorates that means today a new machine
it is performance. Now, after 10 years the
214
00:34:23,450 --> 00:34:28,770
machine will not perform same at the same
level that means what happens the characteristics
215
00:34:28,770 --> 00:34:33,679
changes. So, long the characteristics not
changing even the distribution is itself enough
216
00:34:33,679 --> 00:34:47,309
for you, now if your variable will not discrete your variable is continuous. You see what
217
00:34:47,309 --> 00:35:01,470
is this one left hand side this is p m f or
p d f, p d f this is probability density function,
218
00:35:01,470 --> 00:35:06,450
now why in continuous case we say probability
density function.
219
00:35:06,450 --> 00:35:20,549
Whereas, in the discrete case we say probability
mass function you think this one, so here
220
00:35:20,549 --> 00:35:29,329
also if we know that this particular population
it has mean and variance component. When it
221
00:35:29,329 --> 00:35:35,450
is in the continuous level you have to use
these two equations for expectation, so basically
222
00:35:35,450 --> 00:35:42,670
integration will come into picture. This is
integration minus infinite to plus infinite
223
00:35:42,670 --> 00:35:51,039
depending on the range for which the variable
is defined then f x d x and your sigma square
224
00:35:51,039 --> 00:35:54,299
is nothing but again x minus mu the whole square.
225
00:35:54,299 --> 00:36:11,260
This is infinite to infinite that x minus
mu square f x d x, so I am saying the parameter
226
00:36:11,260 --> 00:36:24,280
mu and sigma square, here I hope that you understand. Now, what is population and population
227
00:36:24,280 --> 00:36:33,319
is characterized by probability distribution
if the random variable has a probability distribution.
228
00:36:33,319 --> 00:36:39,240
If you know that for that variable the parameters
of the distribution, you have characterized
229
00:36:39,240 --> 00:36:49,589
the population that is what is known as characterization
of population in terms of probability distribution
230
00:36:49,589 --> 00:36:55,809
now there are many probability distributions.
231
00:36:55,809 --> 00:37:05,280
You see here we have we can see, here that
under two heads discrete distribution and
232
00:37:05,280 --> 00:37:10,730
continuous distribution. Under discrete distribution,
binomial distribution, poison distribution,
233
00:37:10,730 --> 00:37:16,760
negative binomial, geometric, hypergeometric
many more the series. Similarly, continuous
234
00:37:16,760 --> 00:37:22,369
normal lognormal exponential Weibull, gamma,
so many distribution they are probability
235
00:37:22,369 --> 00:37:28,170
distribution we will not discuss all the distributions.
We will discuss only normal distribution,
236
00:37:28,170 --> 00:37:36,910
here because in multivariate statistical modeling
normality assumption this normality assumption
237
00:37:36,910 --> 00:37:46,530
is very valid vital one. Many of the models
assume normality of the data definitely at
238
00:37:46,530 --> 00:37:52,180
the multivariate level that will be multivariate
normality, so we will discuss only normal
239
00:37:52,180 --> 00:37:54,650
distribution other distribution.
240
00:37:54,650 --> 00:38:03,700
You can follow Johnson book is there, you
can follow this and I am sure all of you are
241
00:38:03,700 --> 00:38:13,849
familiar with this distribution this is what
is normal distribution. It looks like this
242
00:38:13,849 --> 00:38:24,900
and this mu is the center point, here and
here that it is basically symmetric. So, maximum
243
00:38:24,900 --> 00:38:31,710
number of observations also you will find
along this level and it gradually both sides
244
00:38:31,710 --> 00:38:41,619
it will gradually reduce. Finally, after 3
sigma level it will almost negligible to 0
245
00:38:41,619 --> 00:38:49,690
level like this, now how to read this normal
distribution you see that within 1 sigma plus
246
00:38:49,690 --> 00:38:57,280
minus 1 sigma this is very important minus
1 sigma to plus 1 sigma 62.23 observations
247
00:38:57,280 --> 00:39:04,289
fall within this. Then within plus 2 sigma level it will be
248
00:39:04,289 --> 00:39:14,000
little more than 95 percent, but not 96 95
point something and if you consider plus minus
249
00:39:14,000 --> 00:39:26,069
3 sigma level then your 99.73 percent observation
will fall under this category this zone. Now,
250
00:39:26,069 --> 00:39:35,059
this is important because suppose you think
you are producing something and your variable
251
00:39:35,059 --> 00:39:43,030
of interest follows normal distribution, now
from data’s you will get like this distribution
252
00:39:43,030 --> 00:39:50,280
is like this what will happen. That the spread
of this particular variable values within
253
00:39:50,280 --> 00:39:56,880
plus minus 3 sigma level 99.73 percent of
the items produced will fall under within
254
00:39:56,880 --> 00:40:06,119
this, now what will happen if the customer
will not be interested at this wide range.
255
00:40:06,119 --> 00:40:14,500
For example, suppose this is my quality characteristics
x and this is the lower specification limit
256
00:40:14,500 --> 00:40:34,359
this I am giving the physical interpretation.
Suppose this is your upper specification limit
257
00:40:34,359 --> 00:40:40,059
then what will happen ultimately, suppose
this is the mean one this follows normal distribution
258
00:40:40,059 --> 00:40:55,480
and your distribution may be like this. So,
this may be let it be minus 2 sigma plus 2 sigma,
259
00:40:55,480 --> 00:41:04,069
so what will happen ultimately that 5 percent
almost 5 percent of your production is rejected
260
00:41:04,069 --> 00:41:07,400
product because people that customer will
not accept it.
261
00:41:07,400 --> 00:41:11,609
So, when I talk about or say that characterization
of the process that means with respect to
262
00:41:11,609 --> 00:41:20,940
this distribution, this is your process exactly
that this is the shop representation of the
263
00:41:20,940 --> 00:41:24,690
process you are not going to the shop floor.
But, this is the case your customer region
264
00:41:24,690 --> 00:41:30,680
is here and you are producing at this level,
5 percentage of your production is not accepted
265
00:41:30,680 --> 00:41:38,789
by the customer you want to improve it you
are getting me, you want to improve it how
266
00:41:38,789 --> 00:41:40,289
you will do it.
267
00:41:40,289 --> 00:41:45,760
Can you explain this figure, this figure you
see this is again basically as I told you
268
00:41:45,760 --> 00:41:51,109
characterization of a process through probability
distribution this is another example. The
269
00:41:51,109 --> 00:41:57,049
quality of service provided measured on a
100 point scale at three service centers A,
270
00:41:57,049 --> 00:42:06,920
B and C is normally distributed as N 80, 9;
80 stands for the mean value, 9 stands for
271
00:42:06,920 --> 00:42:13,109
the variance, because our general notation
what we will be following is normally distribution.
272
00:42:13,109 --> 00:42:23,779
N stands for normally distributed mu, first
is mu that is sigma square basically, so mu
273
00:42:23,779 --> 00:42:30,440
is this and sigma square is this, this is
the general notation we will be following
274
00:42:30,440 --> 00:42:40,770
all through. Then your second process, suppose
B 80, 16 and third one is 90, 9 and I have
275
00:42:40,770 --> 00:42:47,480
plotted the probability distribution for all
the three processes A, B and C and please
276
00:42:47,480 --> 00:42:53,079
remember that your variable of interest of
quality of service provided. If I ask you
277
00:42:53,079 --> 00:43:15,220
which process is better you will say C yes
or no, yes why mean yes you are right mean
278
00:43:15,220 --> 00:43:19,000
is 90 and is quality of service provided on a 100 point scale.
279
00:43:19,000 --> 00:43:25,539
You are measuring the higher the value better the process, but parallely you see the variability
280
00:43:25,539 --> 00:43:33,880
is 9. So, both from mean point of view it
is at a higher level and variability point
281
00:43:33,880 --> 00:43:43,910
of view it is at the lowest level when you
compare the three processes. Now, I ask you
282
00:43:43,910 --> 00:43:53,890
from compare A and B which one is better A,
because the variability is low mean at the
283
00:43:53,890 --> 00:44:03,020
same level, so what is the physical interpretation
of this. Then physical what happens it is
284
00:44:03,020 --> 00:44:14,170
the variability, the most difficult parameter
very difficult to control variability I am
285
00:44:14,170 --> 00:44:17,390
giving you another important good example here.
286
00:44:17,390 --> 00:44:27,950
Suppose, you will think that you all know
that archery, suppose this is the bulls eye
287
00:44:27,950 --> 00:44:40,200
that one our gold medal winner, what is his name that bullet, anyhow this is the target.
288
00:44:40,200 --> 00:45:00,460
Now, someone all shoots are here and someone
shooting is like this is the bull’s eye,
289
00:45:00,460 --> 00:45:11,730
you are the trainer two shooters A and B who
by training who will be improved first. First
290
00:45:11,730 --> 00:45:22,319
one the precision is first one, the precision
level is higher than the second one what will
291
00:45:22,319 --> 00:45:29,549
happen you can shift this is mean value to this.
292
00:45:29,549 --> 00:45:35,789
But, here this is a variable one you variable
when something is variable ever for the student’s
293
00:45:35,789 --> 00:45:41,480
point of view, some students are very erratic
variable very, very difficult. But, some student
294
00:45:41,480 --> 00:45:47,520
may be because of some reason very regular,
but suddenly or basically they are coming
295
00:45:47,520 --> 00:45:54,289
late by some few minutes. It is always we
can, but they are still coming you can motivate
296
00:45:54,289 --> 00:45:58,549
them, so this is the physical meaning of the distribution.
297
00:45:58,549 --> 00:46:04,680
Now, we will come to the next important concept
is called sample and statistics.
298
00:46:04,680 --> 00:46:16,190
So, we have seen so far population and parameters
and please remember the random variable is
299
00:46:16,190 --> 00:46:21,940
there everywhere. So, population and parameters
when we talk about population we definitely
300
00:46:21,940 --> 00:46:27,339
talk about parameters, we talk about particularly
this next the probability distribution also.
301
00:46:27,339 --> 00:46:41,089
The probability distribution of the variables
of interest, variable of interest that is
302
00:46:41,089 --> 00:46:53,710
x variable of interest x, now see you are
planning to collect data what you have thought
303
00:46:53,710 --> 00:47:01,920
of that, I know my variable.
That variable is x, I want to collect data,
304
00:47:01,920 --> 00:47:09,369
how many data you want to collect you want
to collect N data points what can you say
305
00:47:09,369 --> 00:47:21,039
about each of the observation what can you
expect, getting me, so if x is normally distributed.
306
00:47:21,039 --> 00:47:28,829
Suppose x is normally distributed with mu
and sigma square that means x 1 also normally
307
00:47:28,829 --> 00:47:32,720
distributed with mean and sigma square, x
2 also normally distributes with mu and sigma
308
00:47:32,720 --> 00:47:37,200
square because they are coming from the same
population. There is no guarantee that when
309
00:47:37,200 --> 00:47:41,359
you if you go, you will observe some value
of x somebody else will go he will observe
310
00:47:41,359 --> 00:47:46,109
some different value of x even though the
observation is the first observation for you
311
00:47:46,109 --> 00:47:53,269
also it is first for him also it is.
First keep in mind this one this is very important
312
00:47:53,269 --> 00:48:03,309
you have not collected data that is before
data collection you are planning that you
313
00:48:03,309 --> 00:48:08,940
will be collecting data N data. So, x 1 to
1 first you observe x 1 like x 2 like x n,
314
00:48:08,940 --> 00:48:17,930
now what is the issue here this all these
values are random as if they are basically
315
00:48:17,930 --> 00:48:25,869
random and unknown. Now, you thought that
you have collected data, after data collection
316
00:48:25,869 --> 00:48:36,069
what will happen, so you will collect data
I am denoting in terms of small x. Let it
317
00:48:36,069 --> 00:48:51,220
be this small x, so x 1, x 2, x i, x n what
are these values known values realized, every
318
00:48:51,220 --> 00:49:02,039
value is realized known and constant this
is in the population domain.
319
00:49:02,039 --> 00:49:16,799
This is now sample of size N you see this
slide, here before data collection you have
320
00:49:16,799 --> 00:49:23,059
planned to collect N data points and these
are unknown and random. When you collect data
321
00:49:23,059 --> 00:49:34,170
after collection they are already known, fixed
values that is the big difference and another
322
00:49:34,170 --> 00:49:40,819
important concept you keep in mind that all
the observations each of the observation will
323
00:49:40,819 --> 00:49:47,410
follow the same probability distribution.
We meaning that it is normal distribution
324
00:49:47,410 --> 00:49:54,329
with mu and sigma square as mean and variance
x 1 has also normal distribution with mean
325
00:49:54,329 --> 00:50:02,890
mu sigma square as variance. Once data you
have collected forget about all distribution
326
00:50:02,890 --> 00:50:15,119
there is fixed value, no randomness in the
data it is already collected.
327
00:50:15,119 --> 00:50:25,039
This one can example last example that is
small company I can we can give a name to
328
00:50:25,039 --> 00:50:31,819
the company, I have given city cam. Later
on I will use city cam, the company name is
329
00:50:31,819 --> 00:50:38,460
city cam, so these are the profit sales volume
absenteeism all those things what is this.
330
00:50:38,460 --> 00:50:44,609
Basically, we are characterizing the that
city cam process in totality in terms of these
331
00:50:44,609 --> 00:50:54,930
variables and your sample mean and sample
variance, these are the statistics with respect
332
00:50:54,930 --> 00:51:00,510
to population mean and population variance,
correct.
333
00:51:00,510 --> 00:51:15,390
Do you know what is this is dot plot, dot
plot is something like this dot plot is something
334
00:51:15,390 --> 00:51:17,799
like this.
335
00:51:17,799 --> 00:51:27,349
Your variable is x you arrange from the smallest
to the largest, now suppose this is one value,
336
00:51:27,349 --> 00:51:30,690
this is second value, and this is third value
like different values are there. Suppose this
337
00:51:30,690 --> 00:51:38,809
value, there is two you have two observations
there let it be three observations, here let
338
00:51:38,809 --> 00:51:45,039
it be five observations, this type of plotting
you are doing here. So, again this suppose
339
00:51:45,039 --> 00:51:50,599
this again two values this is, let it be two
values like this is known as dot plot, dot
340
00:51:50,599 --> 00:51:57,380
plot again it is similar to histogram plot.
Now, here you are able to count the number
341
00:51:57,380 --> 00:52:05,460
of observations against each of the values
of the x, so it will help you to find out
342
00:52:05,460 --> 00:52:16,519
the mode suppose if I say for this example
profit in rupees million you say 9 million
343
00:52:16,519 --> 00:52:25,260
rupees case. It is two observations for 10
it is 3, for 11 it is 4, for 12 it is 3 that
344
00:52:25,260 --> 00:52:34,750
means the mode of the data points for profit
is 11 mean you have already seen median is
345
00:52:34,750 --> 00:52:41,299
the middle value. How do you compute median,
for computation of median you find out the
346
00:52:41,299 --> 00:52:50,490
position N plus 1 by 2 where N is the number
of observations.
347
00:52:50,490 --> 00:52:57,250
In this particular example, there N equal
to 12 because 12 months data, so 12 plus 1
348
00:52:57,250 --> 00:53:08,430
by 2 that means 6.5. So, 6.5 means when you
arrange your data from smallest to the largest
349
00:53:08,430 --> 00:53:13,549
you just find out the position sixth and seventh
position. Suppose this is a sixth position,
350
00:53:13,549 --> 00:53:18,750
this is your seventh position you take the
average of these two value sixth position
351
00:53:18,750 --> 00:53:27,500
value and seventh position value, so and what
is the mode is the value of x which there
352
00:53:27,500 --> 00:53:29,230
are maximum occurrences.
353
00:53:29,230 --> 00:53:36,619
This is the calculation for that data and
by using excel sheet, you can very easily
354
00:53:36,619 --> 00:53:42,880
calculate this thing, now measure of dispersion
measure of dispersion is this.
355
00:53:42,880 --> 00:53:52,460
What we have seen that if the data follows
suppose normal distribution, then this one
356
00:53:52,460 --> 00:53:59,680
is mu and this side how much it is going.
This side that is the dispersion there are
357
00:53:59,680 --> 00:54:07,049
several ways to measure dispersion, one is
range that is minima maximum minus minimum
358
00:54:07,049 --> 00:54:13,960
value another. One is the inter quartile range
which is the third quartile range minus first
359
00:54:13,960 --> 00:54:24,819
quartile, first quartile is basically N plus
1 by 4 and your quartile 1, Q 1 position is
360
00:54:24,819 --> 00:54:37,480
N plus 1 by 4 then Q 3, Q 2 position is the
median which is N plus 1 by 2. Q 3 position
361
00:54:37,480 --> 00:54:49,490
is that is third quartile which is 3 into
N plus 1 by 4, so all those position values
362
00:54:49,490 --> 00:54:54,890
you have to find out and then appropriately
you have to manipulate the data.
363
00:54:54,890 --> 00:55:02,250
So, if there are two values where N is coming
in the middle, then you take the average.
364
00:55:02,250 --> 00:55:08,319
If it is coming, not middle may be right hand
more than the middle seventh 0.75 position,
365
00:55:08,319 --> 00:55:17,529
suppose 3.75 position, so accordingly 0.75
that weight age to be given for that data.
366
00:55:17,529 --> 00:55:24,839
These are all very simple things you will
be able to find out, these are little equally
367
00:55:24,839 --> 00:55:33,500
important and you require to know also these
things. You know the variance also yes or
368
00:55:33,500 --> 00:55:43,150
no, how to compute variance statistical sense
s square is 1 by n minus 1 sum total of i
369
00:55:43,150 --> 00:56:08,930
equal to 1 to n then x i minus x bar square. This is the variability measure why this n minus 1
370
00:56:08,930 --> 00:56:23,559
minimum variance unbiased estimator
371
00:56:23,559 --> 00:56:33,680
any other explanation, now later on we will
be discussing very much very frequently the
372
00:56:33,680 --> 00:56:46,490
degrees of freedom, getting me. Degrees of
freedom we will be using d o f or d f, now
373
00:56:46,490 --> 00:56:52,569
see in this case this n minus 1 is coming
because of degrees of freedom because you
374
00:56:52,569 --> 00:57:00,599
have n data points x 1 to x n.
When you are computing this variance, you
375
00:57:00,599 --> 00:57:12,359
require what you require, you require x bar
to be computed, so as x bar is computed with
376
00:57:12,359 --> 00:57:21,269
this formulation. So, what has happened ultimately,
here when you find out that when x i minus
377
00:57:21,269 --> 00:57:28,690
x bar that is x 1 minus x bar, x 2 minus x
bar like this for the last one you do not
378
00:57:28,690 --> 00:57:37,599
require to compute it is automatically computed.
So, I will write here suppose x n minus x
379
00:57:37,599 --> 00:57:49,759
bar what I mean that suppose if I write like
this sum of x i minus x bar what will be the
380
00:57:49,759 --> 00:58:02,549
value x I bar. So, now what is mean value,
so that means summation of x i minus summation
381
00:58:02,549 --> 00:58:11,099
of x bar this is n x bar minus n x bar. So,
this is 0, so what will happen ultimately
382
00:58:11,099 --> 00:58:19,259
you are not getting n x minus x minus x bar
value one value is 1, very simple other way
383
00:58:19,259 --> 00:58:20,650
if you say.
384
00:58:20,650 --> 00:58:27,650
Suppose I have given you one equation x plus
y plus z equal to 5, what is the degree of
385
00:58:27,650 --> 00:58:32,500
freedom. Here, you see if you change x and
y, z cannot be changed further it is fixed
386
00:58:32,500 --> 00:58:36,779
even though the three values are there. You
have two degrees of freedom because I have
387
00:58:36,779 --> 00:58:43,039
made it is made at 5 and in this case also
the same thing is happening that is the sum
388
00:58:43,039 --> 00:58:51,869
of all this will be 0. So, you require how
many data points you have in s minus 1 square
389
00:58:51,869 --> 00:58:59,329
equation n minus 1 that is the another explanation
of why n minus one will be divided by 1 while
390
00:58:59,329 --> 00:59:01,460
computing this.
391
00:59:01,460 --> 00:59:15,579
Now, I will finish this lecture, see we have
we have told you that normal distribution
392
00:59:15,579 --> 00:59:21,779
is very important one and later on for this
subject multivariate normal distribution.
393
00:59:21,779 --> 00:59:29,019
So, we must know that who is the father of
normal distribution and you see that Abraham
394
00:59:29,019 --> 00:59:35,450
De Moivre, French born English mathematician.
He basically has given the general form of
395
00:59:35,450 --> 00:59:41,079
this normal distribution, what form you will
see that one by root over 2 pi sigma square
396
00:59:41,079 --> 00:59:49,410
e to the power minus half x minus mu by sigma
to the power square.
397
00:59:49,410 --> 01:00:00,650
Now, he is considered the father of normal
distribution, but only equation will not become
398
01:00:00,650 --> 01:00:10,029
sufficient later what happened that Gauss,
he is another famous mathematician and statistician
399
01:00:10,029 --> 01:00:16,170
what has he has given the properties. So,
all statistical properties of normal distribution
400
01:00:16,170 --> 01:00:27,069
is identified tested by Carl Friedrich Gauss,
he is specifically that German mathematician,
401
01:00:27,069 --> 01:00:33,849
they are not mathematician.
If you see that thing this is very interesting,
402
01:00:33,849 --> 01:00:45,089
it is not knowledge, but it is act of learning
not possession, but the act of getting there
403
01:00:45,089 --> 01:00:52,029
which grants the greatest enjoyment. So, suppose
you are doing PHD, so long you are not getting
404
01:00:52,029 --> 01:00:58,480
PHD you are thinking once I get PHD, I will
be very happy, but it is not true once you
405
01:00:58,480 --> 01:01:04,579
get within 2, 3 days you will find out that
you are the same person. But, learning going
406
01:01:04,579 --> 01:01:10,670
there, the act of going there that is what
is very, very important and this famous people
407
01:01:10,670 --> 01:01:17,490
they have quoted and we must obey to their
all suggestions.
408
01:01:17,490 --> 01:01:22,559
Thank you very much, next class I will tell
you sampling distribution.