1
00:00:18,520 --> 00:00:28,920
Good morning, welcome to the first lecture
of applied multivariate statistical modeling.
2
00:00:28,920 --> 00:00:33,310
Let me tell you the content of this today’s
presentation.
3
00:00:33,310 --> 00:00:42,770
So, we will start with introduction, then
variables, data types, data sources, models
4
00:00:42,770 --> 00:00:48,830
and modeling followed by principles of modeling,
statistical approaches to model building,
5
00:00:48,830 --> 00:00:58,030
multivariate models, some illustrative examples,
three cases followed by references. The entire
6
00:00:58,030 --> 00:01:01,989
content will be covered in two hours.
7
00:01:01,989 --> 00:01:16,070
Today, I will try to finish up to principles
of modeling, let us start with defining what
8
00:01:16,070 --> 00:01:39,140
is applied multivariate statistical modeling?
Let us define whatever you want, first is
9
00:01:39,140 --> 00:01:52,580
applied. Now, what do you mean by applied
in science, there is pure science and applied
10
00:01:52,580 --> 00:02:07,610
science. Pure science we generally understand
which is basic science, which it basically
11
00:02:07,610 --> 00:02:21,920
talks about laws theories and their development,
and their development, definitely it links
12
00:02:21,920 --> 00:02:33,720
with the phenomena, which we usually observe
in different aspects of our life. Now, applied
13
00:02:33,720 --> 00:02:42,100
science which will use the knowledge of the
pure science and develops something for the
14
00:02:42,100 --> 00:02:51,750
benefit of the mankind, so applied science
one of the benefit we can say then when you
15
00:02:51,750 --> 00:03:03,860
talk about engineering, it is basically applied.
Now, when I talk about applied statistics
16
00:03:03,860 --> 00:03:15,450
what do we mean? I am assuming that you have
knowledge on preliminary basic statistics
17
00:03:15,450 --> 00:03:29,780
for example, normal distribution. If you know
normal distribution then also you know the
18
00:03:29,780 --> 00:03:38,840
probability density function f x, which is
1 by root over 2 pi sigma square e to the
19
00:03:38,840 --> 00:03:50,390
power of minus of x minus mu by sigma square,
where x varies from minus infinity to plus
20
00:03:50,390 --> 00:04:03,960
infinity. This is the so called this bell
shaped curve which is developed by Carl Friedrich
21
00:04:03,960 --> 00:04:16,859
Gauss.
So, theoretical development so that mean in
22
00:04:16,859 --> 00:04:26,780
development of this type of distributions
it is coming under basics. Now, suppose if
23
00:04:26,780 --> 00:04:33,470
I want to apply this knowledge to a real life
situation, I can find out a situation like
24
00:04:33,470 --> 00:04:54,400
this for example, let us there are three processes,
process A B and C take certain inputs, convert
25
00:04:54,400 --> 00:05:04,050
into value added outputs, value added outputs
all cases. Let there are basically three identical
26
00:05:04,050 --> 00:05:13,779
machines which is producing steel washers,
steel washers will be shape like this where
27
00:05:13,779 --> 00:05:22,099
there is inner diameter ID. There is outer
diameter OD as usual as there will be certain
28
00:05:22,099 --> 00:05:34,319
thickness of this washer so I can say T h.
Now, if you produce a large amount of steel
29
00:05:34,319 --> 00:05:46,389
washers that means the number of items produced
is large, n is large then the quality characteristic
30
00:05:46,389 --> 00:05:50,889
or the characteristics of the steel washer
which is important to the people, the customer
31
00:05:50,889 --> 00:05:59,159
ID. If you plot you may get this type of distribution,
which is normally distributed and where you
32
00:05:59,159 --> 00:06:03,969
will be getting mean here. There will be definitely
standard deviation for ID.
33
00:06:03,969 --> 00:06:13,340
Similarly, for OD similarly, for thickness
now then what you are doing by what is applied
34
00:06:13,340 --> 00:06:21,389
here? The production process A for example,
in this case which is producing steel washers
35
00:06:21,389 --> 00:06:27,740
each is converted into a statistical process.
In the sense in terms of a distribution like
36
00:06:27,740 --> 00:06:36,370
normal distribution, where we are saying that
the production process can be interpreted,
37
00:06:36,370 --> 00:06:43,379
the behavior of this process can be interpreted
like this now in order to get it further clarified.
38
00:06:43,379 --> 00:06:51,529
If we do like this suppose, this one is for
a production process A and if I say this is
39
00:06:51,529 --> 00:07:02,969
for production process B and third one is
this one for production process C, then using
40
00:07:02,969 --> 00:07:11,270
these things you will be able to compare.
Compare A B and see their performance in terms
41
00:07:11,270 --> 00:07:19,599
of mean and standard deviation. There is possibility
also to see that whether the mean ID produced
42
00:07:19,600 --> 00:07:28,129
by C is equal to that of B or A, this type
of comparisons and things possible. So, when
43
00:07:28,129 --> 00:07:35,389
we actually when we develop something which
will be useful to the society for the mankind,
44
00:07:35,389 --> 00:07:49,669
then we say it is applied. Now, come to the
second word which is basically multivariate.
45
00:07:49,669 --> 00:07:58,659
Now, in order to understand multivariate we
have to understand what is variable. I think
46
00:07:58,659 --> 00:08:12,360
it is known to you that variable is something
which takes different values that since, I
47
00:08:12,360 --> 00:08:24,069
can say takes different values for example,
if I say I D, x is I D inner diameter. Then
48
00:08:24,069 --> 00:08:32,669
if I produce one item, I stands for the item
suppose, first item and the I D value it may
49
00:08:32,669 --> 00:08:41,490
take value X 1. When we go for second version
it may take X 2. So, if I such way if I go
50
00:08:41,490 --> 00:08:51,810
for n washers produced, let X n will come
into consideration. So, these are the values
51
00:08:51,810 --> 00:08:59,950
so I D takes different values as a result
I D is a variable here. Now, in statistics
52
00:08:59,950 --> 00:09:08,070
we basically talk about two types of variables,
one is fixed variable and the other one is
53
00:09:08,070 --> 00:09:17,580
deterministic, sorry random variable.
So, fixed other way we can say deterministic
54
00:09:17,580 --> 00:09:29,250
and random we can say probabilistic for example,
if I create another variable which is month
55
00:09:29,250 --> 00:09:38,220
it varies probably here but we know all the
months. Suppose, what will be the next month
56
00:09:38,220 --> 00:09:44,110
is this month is your December next month
will be January, it is known with certainty
57
00:09:44,110 --> 00:09:50,070
that is a deterministic model, but in this
case when you are going to produce a second
58
00:09:50,070 --> 00:09:59,120
lot. Suppose, in the second lot even in one
lot what is the value of I D for the second
59
00:09:59,120 --> 00:10:06,670
item, or second version it is not known with
certainty, it is governed through probabilistic
60
00:10:06,670 --> 00:10:12,610
distribution.
So, that sense that it is random one, we do
61
00:10:12,610 --> 00:10:17,770
not know the value exactly and this value
is coming based on certain random experiment.
62
00:10:17,770 --> 00:10:32,080
In this case the process which is producing
this item so if I go on saying like this then
63
00:10:32,080 --> 00:10:39,700
other variable here is O D. Similarly, other
one is our thickness, now in order to accumulate
64
00:10:39,700 --> 00:10:51,180
more than one variable, we will write this
X 1 is I D, X 2 is O D and X 3 is X 1, X 2
65
00:10:51,180 --> 00:10:58,270
and X 3 is thickness the for the first of
first item was produced. Then this will be
66
00:10:58,270 --> 00:11:08,620
x 11, second one x 2 1 and n one. Similarly,
for O D x 1 2, x 2 2 like x n 2, and if I
67
00:11:08,620 --> 00:11:14,540
go for the X 3 variable that is observed for
first observation, it is x 1 3, second one
68
00:11:14,540 --> 00:11:25,120
x 2 3. So, like this x n 3.
So, what we are trying to say here that we
69
00:11:25,120 --> 00:11:32,420
are considering three variables X 1, X 2,
X 3 which are nothing but the characteristics
70
00:11:32,420 --> 00:11:38,860
of the steel washers in this example which
has inner diameter, which has outer diameter,
71
00:11:38,860 --> 00:11:47,120
which has thickness. Now, if you produce n
number of washers then what will happen? Every
72
00:11:47,120 --> 00:12:00,100
washers will be having different values for
I D, O D and thickness. So, this is my observation,
73
00:12:00,100 --> 00:12:05,810
first one is observation number 1, second
one observation number 2, like that there
74
00:12:05,810 --> 00:12:14,320
is observation number n and you see in observation
number 1 if I consider only I D that value
75
00:12:14,320 --> 00:12:20,680
is x 1 1, if I consider all three together,
observation 1 takes value x 1 1, x 1 2, x
76
00:12:20,680 --> 00:12:29,000
1 3.
So, similarly if you go on increasing the
77
00:12:29,000 --> 00:12:40,990
number of variables up to X p then here it
will be X 1 p, X 2 p like this X n p. Now,
78
00:12:40,990 --> 00:13:01,600
each of these as well as this, these are observations
on multiple variables. What do you want to
79
00:13:01,600 --> 00:13:13,030
define here? We want to define here multivariate.
So, in order to do so we know variable, deterministic
80
00:13:13,030 --> 00:13:20,870
variable, probabilistic, that is random variable
and this is one example where every observation
81
00:13:20,870 --> 00:13:29,200
is measured on several variables. Then when
multiple variables come into picture then
82
00:13:29,200 --> 00:13:36,480
each observation is a variable vector example,
if I take the ith observation here then x
83
00:13:36,480 --> 00:13:50,930
i will be x i 1, x i 2 like this x i p.
So, it is a variable vector that is ith observation
84
00:13:50,930 --> 00:14:07,780
on p variables. So, when we deal with this
type of situation where our observations or
85
00:14:07,780 --> 00:14:18,700
each of the observations have multiple values
in the sense values on multiple number of
86
00:14:18,700 --> 00:14:38,270
variables more than 1 then the situation is
multivariate situation. Now, we define variable,
87
00:14:38,270 --> 00:14:49,680
we define multivariate situation, let us understand
what is variate getting me? Instead of saying
88
00:14:49,680 --> 00:14:59,920
that x i is like this, if I create something
different based on all those observations
89
00:14:59,920 --> 00:15:16,240
that there is linear combination
of variables. For example, here in this in our example
90
00:15:16,240 --> 00:15:23,270
there are three variable X 1, X 2 and X 3. If I
create a combination linear combination L
91
00:15:23,270 --> 00:15:36,270
C which is beta 1 X 1 plus beta 2 X 2 plus
beta 3 X 3. So, this combiningly will give
92
00:15:36,270 --> 00:15:46,260
a quantity or a value or other way we can
also say variable which is we are saying linear
93
00:15:46,260 --> 00:15:56,170
combination of variable which is variate,
this is variate and then what is the definition
94
00:15:56,170 --> 00:16:05,529
of variate? Linear combination of variables
with empirically written mean weights, that
95
00:16:05,529 --> 00:16:12,350
means beta 1, beta 2 and beta 3 will be determined
based on observations. There are n observations
96
00:16:12,350 --> 00:16:15,380
so we will be able to determine all those
variables.
97
00:16:15,380 --> 00:16:21,410
So, linear combination of or weighted linear
combination of the variables where the weights
98
00:16:21,410 --> 00:16:32,190
are determined empirically that is variate.
Now, in this case you can go for one variables,
99
00:16:32,190 --> 00:16:39,380
simple one variable that means if I say there
are 3 variables, we are going variable p equal
100
00:16:39,380 --> 00:16:47,080
to 1 then that will be uni-variate, when we
go for p equal to greater than equal to 2,
101
00:16:47,080 --> 00:17:00,210
that is multivariate. That is what multivariate
usually in statistics books you will be finding
102
00:17:00,210 --> 00:17:08,510
univariate statistics. For example, in terms
of normal univariate, normal distribution
103
00:17:08,510 --> 00:17:12,189
bivariate, normal distribution multivariate,
normal distribution, so all the bivariate
104
00:17:12,189 --> 00:17:17,220
is a part of multivariate, we basically talk
about when univariate means p equal to 1,
105
00:17:17,220 --> 00:17:23,670
bivariate means p equal to 2, multivariate
is p greater than equal to 2.
106
00:17:23,670 --> 00:17:31,550
So, this is what is multivariate, by word
multivariate we definitely talk something
107
00:17:31,550 --> 00:17:36,440
about linear combination of variables where
more than one variable is there, and there
108
00:17:36,440 --> 00:17:42,420
are multiple observations, not a single observation,
n number of observations and weights. We determined
109
00:17:42,430 --> 00:17:49,809
empirically based on the X observations n
observations that will be collected from the
110
00:17:49,809 --> 00:17:59,059
population for which we want to infer something.
All those inference we will discuss later.
111
00:17:59,059 --> 00:18:10,520
So, third one, the third issue is statistical.
Now, what is statistical? By statistical we
112
00:18:10,520 --> 00:18:20,300
want to say that it is basically using statistics
that is what we want to infer.
113
00:18:20,300 --> 00:18:28,320
So, whatever you are developing something
using the statistical tools and taking it,
114
00:18:28,320 --> 00:18:35,610
then this development is statistical development.
Now, what is statistics? If I say statistics
115
00:18:35,610 --> 00:18:50,250
is nothing but collecting, organizing, analyzing,
then representing and interpreting. What I
116
00:18:50,250 --> 00:19:01,380
mean to say collecting data, organizing data,
analyzing data, representing the results and
117
00:19:01,380 --> 00:19:08,110
interpreting the results for the population
for which the statistical model, or the statistics
118
00:19:08,110 --> 00:19:14,000
is used for some purpose, some purposeful
work will be served.
119
00:19:14,000 --> 00:19:30,140
So, when we talk about statistical that means
we talk about the population, then a sample
120
00:19:30,140 --> 00:19:36,840
consist of data from the population and we
have some purpose in our mind, objective in
121
00:19:36,840 --> 00:19:44,800
our mind. We want to infer something from
of the population and we collect data accordingly
122
00:19:44,800 --> 00:19:50,700
we organize the data, we analyze the data,
then we find the result and the result we
123
00:19:50,820 --> 00:19:56,140
summarize, and based on this summarization
these findings we infer about the population
124
00:19:56,140 --> 00:20:00,670
so that is what is the word statistical is used.
125
00:20:00,670 --> 00:20:10,710
Now, last two are but very important one is
the modeling, if you want to understand then
126
00:20:10,710 --> 00:20:18,230
first you understand this model. A model there
are many types of model actually very simple
127
00:20:18,230 --> 00:20:29,050
one is in our school days. I can remember
we talk about the spring balance like this,
128
00:20:29,050 --> 00:20:35,590
so what happened this is a spring, a elastic
one, a load is attached with this is P and
129
00:20:35,590 --> 00:20:41,830
it behave in some way, that behavior if you
increase the load, the elongation will be
130
00:20:41,830 --> 00:20:46,620
more. If you reduce it will be less.
So, when this is the behavior, this is the
131
00:20:46,620 --> 00:20:52,460
spring balance model so to show the behavior
of the spring this type so physical model
132
00:20:52,460 --> 00:21:02,290
are developed. So, this is one model which is basically a physical model, which is a physical model.
133
00:21:02,290 --> 00:21:17,050
Now, same thing when I came to my engineering studies, I found that there is one important concept
134
00:21:17,050 --> 00:21:27,710
called or development or theory called Hooke’s law, where that sigma he stress developed
135
00:21:27,710 --> 00:21:31,840
on the spring.
And the elongation strain developed on it
136
00:21:31,840 --> 00:21:45,690
they are modeled in such a manner that there
is a relationship like this. This is the range
137
00:21:45,690 --> 00:21:49,340
of elasticity, there is another concept called
elasticity. So, what I have seen there or
138
00:21:49,340 --> 00:21:56,910
we all have seen there that sigma epsilon.
So, sigma is E epsilon, where E is young modulus
139
00:21:56,910 --> 00:22:07,420
or modulus of elasticity. So, this is what
is the theory behind the for elasticity, the
140
00:22:07,420 --> 00:22:15,500
area of elastic body when the load is so developed
that each will not go to the yield point or
141
00:22:15,500 --> 00:22:23,250
beyond yield point, that is elastic zone.
So, for so long the body is stressed within
142
00:22:23,250 --> 00:22:30,320
the within the elastic limit, what will happen
to the property that if you remove the load
143
00:22:30,320 --> 00:22:39,679
then it will recover back to the original
position. So, this development is possible
144
00:22:39,679 --> 00:22:47,240
because the physics of this particular spring
was known and I can say if we, if I known
145
00:22:47,240 --> 00:22:52,510
the modulus of elasticity, I will be able
to tell the relationship between sigma and
146
00:22:52,510 --> 00:22:58,840
epsilon. And that time in engineering mechanics
and strength of materials subject we learn
147
00:22:58,840 --> 00:23:09,809
on these things, basic mathematical model.
So, in reality you will get different types
148
00:23:09,809 --> 00:23:15,090
of mathematical model so that means, what
I mean to say here that a physical model,
149
00:23:15,090 --> 00:23:22,380
a mathematical model. Now, what you mean by
statistical model in this case for example,
150
00:23:22,380 --> 00:23:31,860
you take a case I think the inner beginning
of this particular study for example, the
151
00:23:31,860 --> 00:23:38,250
how did we develop all these things.
So, to experiment I have no idea but suppose
152
00:23:38,250 --> 00:23:44,240
you do not know the modulus of elasticity,
but you know that say elastic body and you
153
00:23:44,240 --> 00:23:52,550
want to find the relationship in that case
you can do experiment with P, varying P from
154
00:23:52,550 --> 00:23:58,850
P 1 to P n. So, that means you will create
n different combinations then you will be
155
00:23:58,850 --> 00:24:06,910
getting 0 to n observations and sigma, epsilon
values you will be getting sigma 1, sigma
156
00:24:06,910 --> 00:24:12,500
2, sigma n; three epsilon, epsilon 1, epsilon
2 and then epsilon n.
157
00:24:12,500 --> 00:24:23,540
Now, if you plot this what will happen you
may get a plot like this, here it is sigma
158
00:24:23,540 --> 00:24:30,570
essentially what is the difference between
this and this here what I am saying, I am
159
00:24:30,570 --> 00:24:36,090
straight way without I when you showed you
have shown me this spring balance. Then I
160
00:24:36,090 --> 00:24:42,240
immediately say that in elastic body this
is the diagram, because the this Hooke’s
161
00:24:42,240 --> 00:24:47,040
law is known to me.
So, mathematics is known to me but in case
162
00:24:47,040 --> 00:24:53,740
it is not known I have done several experiments
here. And based on this I am trying, I will
163
00:24:53,740 --> 00:24:59,780
do plot like this need not the perfect straight
line, you will get when you go for the empirical
164
00:24:59,780 --> 00:25:06,220
relationship. So, this is what is the empirical
1 model? So, this empirical model when we
165
00:25:06,220 --> 00:25:13,190
talk about empirical model like this experiment
based or data based models like this, these
166
00:25:13,190 --> 00:25:18,890
are basically the statistics, these are all
statistical. So, for me this is for all of
167
00:25:18,890 --> 00:25:34,670
us, this is our statistical model.
Now, what is modeling? Then modeling is basically
168
00:25:34,670 --> 00:25:42,460
you want to get this type of results, it is
not that a immediately you will get all this
169
00:25:42,460 --> 00:25:48,670
there is a process. The steps I have to understand
what is my purpose? I have to understand in
170
00:25:48,670 --> 00:25:54,250
one or two full the purpose what are the variables
that are affecting there. I have to identify
171
00:25:54,250 --> 00:26:01,309
all the important variables, then I have to
see that how the data on the variable will
172
00:26:01,309 --> 00:26:05,220
be collected.
For example, here I shown you the experiment
173
00:26:05,220 --> 00:26:12,030
but it may so happen that you cannot do the experiment. So, in that case is there any
174
00:26:12,030 --> 00:26:18,460
other way of collecting data for example,
observation someone is interested to see the
175
00:26:18,460 --> 00:26:25,090
behavior of a particular animal. So, he cannot
do the experiment may be but there are large
176
00:26:25,090 --> 00:26:28,540
number of wild animals of that particular
species.
177
00:26:28,540 --> 00:26:41,630
So, we can observe that we are just going
and observing field based so field based observation
178
00:26:41,630 --> 00:26:50,380
our this one is our experiment, sometimes
what happened we will go for some naturalistic
179
00:26:50,380 --> 00:26:57,090
observations, naturalistic observations which
we talk about the wild animal case field based
180
00:26:57,090 --> 00:27:02,049
observation. In the production we go suppose,
the steel washer case go to the production
181
00:27:02,049 --> 00:27:07,040
shop, and see that what is happening there
and collect data and accordingly you do some
182
00:27:07,040 --> 00:27:15,860
modeling, some naturalistic observations.
So, all those type of data collection mechanism
183
00:27:15,860 --> 00:27:21,750
comes under empirical modeling and you have
to understand all these things. So, this is
184
00:27:21,750 --> 00:27:28,559
a process, the process of modeling, the process
of model building is called modeling, the
185
00:27:28,559 --> 00:27:42,220
process of model building is called of model
building is modeling.
186
00:27:42,220 --> 00:27:56,250
So, let us see some of the slides now that
I told that what is multivariate and then
187
00:27:56,250 --> 00:28:03,190
what is discussed, why should I use it and
it is a base question and that was should
188
00:28:03,190 --> 00:28:09,660
I go for multivariate things? If I can do
by some other way, why multivariate? So, they
189
00:28:09,660 --> 00:28:15,240
are some key issues which basically will be
known to you later on that when we talk about
190
00:28:15,240 --> 00:28:15,919
multivariate.
191
00:28:15,919 --> 00:28:21,669
We talk about multiple variables that is p
cross 1, if p the number of variables then
192
00:28:21,669 --> 00:28:30,830
X 1, X 2 like your X p. Now, there is possibility
that these variables are interrelated, there
193
00:28:30,830 --> 00:28:36,049
is correlation, one of the easiest way is
correlation in between the variables. So,
194
00:28:36,049 --> 00:28:42,530
that means you may be get a correlation matrix
or other way it is basically the covariance
195
00:28:42,530 --> 00:28:52,350
between the variables or covariance. By covariance
what I mean to say, if one variable varies
196
00:28:52,350 --> 00:28:56,470
there is a possibility that in particular
way that some other variable also varies,
197
00:28:56,470 --> 00:29:03,140
then there will be covariance and standardized
covariance is correlation. This is and in
198
00:29:03,140 --> 00:29:12,330
the subsequent lectures so covariance that
will be p cross p matrix will come.
199
00:29:12,330 --> 00:29:20,059
So, all those things so similarly, the mean
values for all those variables mu 1, mu 2
200
00:29:20,059 --> 00:29:28,220
like mu p, this things will be there. Now,
so my answer to your question is that why
201
00:29:28,220 --> 00:29:42,390
should I use it because no physical process
or as such any other systems also, which is
202
00:29:42,390 --> 00:29:52,690
characterized by multiple variables. They
should be analyzed other like their behavior
203
00:29:52,690 --> 00:30:00,480
should be analyzed by taking into consideration
of all the variables characterizing it.
204
00:30:00,480 --> 00:30:07,929
When these variables consider very, very important
for the design development or improvement
205
00:30:07,929 --> 00:30:14,330
of the system, for which it is developed.
And as none of as it is obvious there will
206
00:30:14,330 --> 00:30:24,000
be covariance or correlation between the variables.
If I go for univariate analysis we will lose
207
00:30:24,000 --> 00:30:31,870
substantially the information about the behavior,
because of non-inclusion of the covariance
208
00:30:31,870 --> 00:30:38,380
structure. So, we require to control this covariance
209
00:30:38,380 --> 00:30:43,850
structure and in multivariate statistics covariance
is a very big issue and which will be found
210
00:30:43,850 --> 00:30:52,010
in multivariate distribution. We will be discussing
all this covariance things so it is required.
211
00:30:52,010 --> 00:31:05,470
For example, for this case like our this one steel washer, this case the steel washer,
212
00:31:05,470 --> 00:31:13,000
three variables are visibly controlling its
quality, inner diameter, outer diameter and
213
00:31:13,000 --> 00:31:18,510
thickness. There is chance that inner and
outer diameter will be related, also the thickness
214
00:31:18,510 --> 00:31:25,240
in that case the customer will not be able
to apply it or fit it to its own situation
215
00:31:25,240 --> 00:31:30,240
if there is huge mismatch.
Now, if we control inner diameter or outer
216
00:31:30,240 --> 00:31:34,429
diameter or thickness then what will happen?
Then correlation structure will not be considered
217
00:31:34,429 --> 00:31:43,440
and ultimately we will not be able to satisfy
the customer. So, we will be using multivariate
218
00:31:43,440 --> 00:31:48,419
statistics or multivariate modeling. When
your system is complex in terms of number
219
00:31:48,419 --> 00:31:56,720
of variable it may be in conditions like this,
the correlation structure is intact in order
220
00:31:56,720 --> 00:32:02,929
to extract a those correlation information,
you want to extract the pattern from this
221
00:32:02,929 --> 00:32:09,690
data that is why you will be using. So, how
do I do it? It is through the third models,
222
00:32:09,690 --> 00:32:34,630
so these models will be discussed a little
later. Now, what is next?
223
00:32:34,630 --> 00:32:42,130
Next one example, here we are saying that
a particular company operating may be in a
224
00:32:42,130 --> 00:32:51,100
city market and we want to see the organizational
health of this company, with respect to profit
225
00:32:51,100 --> 00:32:58,510
in rupees million with respect to sales volume
in rupees hundred, absenteeism, machine breakdown
226
00:32:58,510 --> 00:33:05,870
and M ratio. Actually, this is schemated intentionally
first one is profit and sales volume, these
227
00:33:05,870 --> 00:33:12,980
are the organizational issue that health if
you sell more your profit may be more. And
228
00:33:12,980 --> 00:33:20,570
if your profit is more you are healthy in
financially, and another issue is absenteeism,
229
00:33:20,570 --> 00:33:28,150
if you are paying substantially and if you
are taking care the well being of the employee’s
230
00:33:28,150 --> 00:33:32,880
absenteeism will be less.If you are maintaining the health of the process
231
00:33:32,880 --> 00:33:40,740
here we are saying machine, your machine breakdown
will be less. And if you are if you are able
232
00:33:40,740 --> 00:33:47,770
to coordinate with customer as well as your
supplier and your M ratio, that much M ratio
233
00:33:47,770 --> 00:33:53,740
particularly I say marketing ratio will relate
to the customer and that will be high. So,
234
00:33:53,740 --> 00:34:00,530
if this is the case and then we are basically
observing from April, May, June, July that
235
00:34:00,530 --> 00:34:10,940
12 months data and in some units we have measured.
This is nothing but a case of multivariate
236
00:34:10,940 --> 00:34:19,980
situation where each of the row like starting
from 1 the first row, these values are talking
237
00:34:19,980 --> 00:34:23,450
about multivariate observations for month
April.
238
00:34:23,450 --> 00:34:30,060
Similarly, for second these are multivariate
observations so there are we have multivariate
239
00:34:30,060 --> 00:34:38,790
observations. Now, you may be may be interested
to know how profit varies over the months,
240
00:34:38,790 --> 00:34:44,000
then it will be univariate one if you want
to say that how sales volume vary over the
241
00:34:44,000 --> 00:34:49,010
month, it will be also univariate. Now, if
you want to know absenteeism varies over the
242
00:34:49,010 --> 00:34:52,190
year over month that is also univariate like this.
243
00:34:52,190 --> 00:35:01,300
But if you are interested to see that how
the profit and sales volume covary and they
244
00:35:01,300 --> 00:35:07,640
are own variation, then you will have to have to consider two variables. And then should
245
00:35:07,640 --> 00:35:13,740
be multivariate situation, sometimes you may
be interested to know how the sales volume
246
00:35:13,740 --> 00:35:18,590
will be dependent on absenteeism and machine
breakdown and marketing ratio. Then there
247
00:35:18,590 --> 00:35:24,910
must a dependent model and that is the same
multivariate issue. So, this is in that shelf
248
00:35:24,910 --> 00:35:29,640
what I am talking about multivariate observations.
249
00:35:29,640 --> 00:35:38,099
So, now we have discussed some of the things,
some of the variables and we have seen that
250
00:35:38,099 --> 00:35:43,359
we have assigned them some values, but how
where from those values are coming?
251
00:35:43,359 --> 00:35:50,960
For example, if I say steel washer the thickness
that mean be the inner or the outer thickness
252
00:35:50,960 --> 00:35:57,510
OS, how it is known? So, you have used some
measurement scale to measure this, if I want
253
00:35:57,510 --> 00:36:04,770
to say that it may be you have used Vernier
caliper to measure the outer diameter, may
254
00:36:04,770 --> 00:36:10,859
be used Vernier caliper to measure the inner
diameter. So, you have used some instrument
255
00:36:10,859 --> 00:36:16,480
and as well as you have there is scale of
measurement. In this case the scale is basically
256
00:36:16,480 --> 00:36:25,140
length which may be in terms of millimeter.
So, you have to sue some scale of measurement
257
00:36:25,140 --> 00:36:34,790
and based on the scale used whatever data
you get those data will be of different types.
258
00:36:34,790 --> 00:36:43,530
So, you see this line here, the left side
we are talking about random variables and
259
00:36:43,530 --> 00:36:48,490
right hand side we are talking about data
types. I have explained you this random variable
260
00:36:48,490 --> 00:36:53,869
earlier, so I will not spend much time here,
but you must please understand one thing that
261
00:36:53,869 --> 00:36:57,400
in random variable there will be discrete
and continuous random variable.
262
00:36:57,400 --> 00:37:04,830
By discrete random variable we mean to say
that they will take some counted account values
263
00:37:04,830 --> 00:37:15,540
like 0, 1, 2 or something like this or January,
February, March something like this, and your
264
00:37:15,540 --> 00:37:20,890
continuous case that profit absenteeism breakdown
or what is M ratio here? What is that any
265
00:37:20,890 --> 00:37:26,210
value is possible? So, please understand one
thing here, since volume are coming under
266
00:37:26,210 --> 00:37:33,430
your discrete because it is countable one but many countable, such count values can
267
00:37:33,430 --> 00:37:42,730
also be considered as continuous in any situations.
But any how so there are two types.
268
00:37:42,730 --> 00:37:48,630
Now, your data types I told you that what
measurement scale you are using. Based on
269
00:37:48,630 --> 00:37:55,250
these data types you will be known, means
that data will be having certain properties
270
00:37:55,250 --> 00:38:01,510
because data is nothing but information. How
much information is available with the data
271
00:38:01,510 --> 00:38:07,210
getting me, so did it all depends on what
scale you have used to measure this data.
272
00:38:07,210 --> 00:38:12,910
So, based on that there are four types of
data, one is nominal data, ordinal data, interval
273
00:38:12,910 --> 00:38:16,670
data and ratio data.
274
00:38:16,670 --> 00:38:24,050
Let us discuss something about nominal data.
My definition is this provide identity to
275
00:38:24,050 --> 00:38:33,140
some items or things is I say the month, the
company, small company that is the I should
276
00:38:33,140 --> 00:38:40,050
have shown you that they want to do over the
different months, what is the status. So,
277
00:38:40,050 --> 00:38:47,450
the month is a variable starting from January
to December because it changes. So, then it
278
00:38:47,450 --> 00:38:53,320
is January and February all those things nothing
but they are the identity of the period of
279
00:38:53,320 --> 00:38:59,640
time identity of the particular series.
Suppose, you just think of you are trying
280
00:38:59,640 --> 00:39:07,640
to know that some performance or status of
the different department of a for example,
281
00:39:07,650 --> 00:39:13,920
IIT so then if I say the department of chemistry,
department of physics, department of mathematics,
282
00:39:13,920 --> 00:39:18,130
department of computer science, department
of industrial engineering and management.
283
00:39:18,130 --> 00:39:24,060
So, all those things and they are basically
providing identity but we sometimes require
284
00:39:24,060 --> 00:39:30,140
this type of data in our to include in our
analysis. So, this is nothing but nominal
285
00:39:30,140 --> 00:39:36,510
data. Now, what is the problem with nominal
data? Problem with nominal data is that there
286
00:39:36,510 --> 00:39:41,270
is huge computational limitations, because
you cannot do any arithmetic limitations,
287
00:39:41,270 --> 00:39:46,490
you cannot add department of chemistry plus
department of physics like this. We cannot
288
00:39:46,490 --> 00:39:52,910
say department of chemistry is 1 and department
of physics is 2 and accordingly we will add,
289
00:39:52,910 --> 00:40:05,250
we cannot subtract, we cannot multiply, we
cannot make division also this is the problem.
290
00:40:05,250 --> 00:40:09,960
Next data type is your ordinal data type.
What is ordinal data type? Suppose, you just
291
00:40:09,960 --> 00:40:15,530
see that you have you have travelled in flight
several times may be, or train or some other
292
00:40:15,530 --> 00:40:20,790
places or you have gone to the students, and
when you have taken food and you might have
293
00:40:20,790 --> 00:40:25,990
seen that you are giving a feedback form.
They are seeing that they are pleased, they
294
00:40:25,990 --> 00:40:32,369
have read the in case of hotel food quality,
service quality, room quality all those things
295
00:40:32,369 --> 00:40:40,220
in terms of not satisfied.
We are totally unsatisfied to extremely satisfied,
296
00:40:40,220 --> 00:40:47,650
this type of scale we have used for example,
for the food case it is taste wise this very
297
00:40:47,650 --> 00:40:53,920
good, good or something like this. So, this
type of ordering when order thing is there
298
00:40:53,920 --> 00:41:02,940
this is called ordinal data. So, what it does
provide some order or rank to some items or
299
00:41:02,940 --> 00:41:10,260
things examples, service quality, it is low
medium or good and computational limitations.
300
00:41:10,260 --> 00:41:28,180
Here also we cannot do any arithmetic operations
like your addition, subtraction, multiplication
301
00:41:28,180 --> 00:41:36,020
and division. You cannot do then what way
it is better than our nominal data? It is
302
00:41:36,020 --> 00:41:42,980
better than nominal data because here you
are getting a order, a rank you are getting.
303
00:41:42,980 --> 00:41:52,080
If I say the performance that my student performance
is low, average and very good excellent like
304
00:41:52,080 --> 00:41:57,310
this, the person who is getting excellent
is definitely better than the person or the
305
00:41:57,310 --> 00:42:08,410
student who got very good. So, I have a ranking
skill here ranking ability with this data.
306
00:42:08,410 --> 00:42:18,150
So, ordinal data is rich compared to nominal data.
307
00:42:18,150 --> 00:42:26,690
Next data type I said that interval data,
what is interval data? It is basically well
308
00:42:26,690 --> 00:42:37,869
understood if we take this example, here temperature
we are measuring using two scales, one is
309
00:42:37,869 --> 00:42:45,960
celsius, another only Fahrenheit. In developing
these two scales, Fahrenheit scale as well
310
00:42:45,960 --> 00:42:58,080
as your Celsius scale, the reference point
is taken at two different points, means locations.
311
00:42:58,080 --> 00:43:06,430
It is not the same you getting me so and if
you see the horizontal lines here you see
312
00:43:06,430 --> 00:43:12,220
that 0 degree centigrade, 20 degree centigrade
and 100 degree centigrade. Then the corresponding
313
00:43:12,220 --> 00:43:19,400
Fahrenheit will be 32, 70 and 212 Fahrenheit,
understanding?
314
00:43:19,400 --> 00:43:27,730
So, there is a range that if I say the difference
from 100 to 0 degree you are getting this
315
00:43:27,730 --> 00:43:34,640
range, here are also 212 to 32 the corrseponding
range is this. So, whether we measure in using
316
00:43:34,640 --> 00:43:41,080
celsius scale or Fahrenheit scales we will
be getting the equal range. Now, what will
317
00:43:41,080 --> 00:43:47,839
happen suppose, I measured temperature today?
Today day temperature is 20 degree centigrade
318
00:43:47,839 --> 00:43:55,730
to 30 degree and may be day after tomorrow
21 degree, then if I want to do the averaging
319
00:43:55,730 --> 00:44:02,060
I can add them and then divided by 3, that
3 days average I will get if I do the same
320
00:44:02,060 --> 00:44:08,450
thing in Fahrenheit. Also it is possible I
can do that similar thing, I can do but what
321
00:44:08,450 --> 00:44:13,670
will happen?
Suppose, I want to say that what is the how
322
00:44:13,670 --> 00:44:21,380
many times temperature of today is compared
to the tomorrows, yesterday’s temperature.
323
00:44:21,380 --> 00:44:28,240
Then if I use Celsius scale and if I divide
22 by 20 and then here it will be it may be
324
00:44:28,240 --> 00:44:33,940
70 and other things, then we will find out
they are not matching. So, that means interval
325
00:44:33,940 --> 00:44:42,089
scale is some scale where you will get a interval
data range data and they are all having al,
326
00:44:42,089 --> 00:44:51,859
type of continuous properties except and they
can do 3 arithmetic operations very easily,
327
00:44:51,859 --> 00:44:57,920
addition, subtraction and multiplication.
But when you do division, you will find out
328
00:44:57,920 --> 00:45:06,619
that when you change, it changes the scale.
Ultimately what will happen? You will find
329
00:45:06,619 --> 00:45:17,450
that they, so in interval data you cannot
go for division.
330
00:45:17,450 --> 00:45:32,510
Interval data division is not possible, all
other arithmetic operations are possible.
331
00:45:32,510 --> 00:45:35,690
Let us go to the next slide.
332
00:45:35,690 --> 00:45:44,599
Our slide that is we are talking about ratio,
data ratio. Data is something where there
333
00:45:44,599 --> 00:45:48,380
is absolute 0 in the scale of measurement.
334
00:45:48,380 --> 00:45:56,839
This is 0, if I move towards right suppose
x amount and towards left also x amount then
335
00:45:56,839 --> 00:46:02,060
the difference, this difference is same. If
I go for y also, this side also y also that
336
00:46:02,060 --> 00:46:05,430
is the same. So, that means if you go in to the left it is that is the same. So, that
337
00:46:05,430 --> 00:46:12,200
means if you go in the to the left it is negative,
this side it is positive, but there is absolute
338
00:46:12,200 --> 00:46:21,099
0 in between. So, this 0 is the reference
point not in terms of the Fahrenheit and centigrade
339
00:46:21,099 --> 00:46:31,980
scale that where is the two different definition,
it contains absolute 0, highest form of data,
340
00:46:31,980 --> 00:46:47,349
sorry. So, ratio data is it contains absolute
0 highest form of data example absenteeism
341
00:46:47,349 --> 00:46:58,530
breakdown hours as shown earlier and computational,
all arithmetic operations are possible here.
342
00:46:58,530 --> 00:47:08,170
Now, if I go by the order of information available
then definitely your first one is if it is
343
00:47:08,170 --> 00:47:18,890
nominal then followed by ordinal, then your
interval, then your ratio. Then definitely
344
00:47:18,890 --> 00:47:30,970
in order of increasing information this will
the, this is the case my best data is this,
345
00:47:30,970 --> 00:47:41,339
next best is this, next best is this, next
and this is the lowest of information data.
346
00:47:41,339 --> 00:47:50,310
So, you know that different data types. Now,
you know that as you will be applying multivariate
347
00:47:50,310 --> 00:47:56,890
statistical modeling, you must require full-fledged
data. So, you need to know the data source,
348
00:47:56,890 --> 00:48:02,260
primary data collected from the source where
it is generated for example, in the case of
349
00:48:02,260 --> 00:48:08,900
a steel washer example, if you collect data
from the production shop and just going there
350
00:48:08,900 --> 00:48:17,140
collecting data or that is what is known as
primary data. Suppose, you want to see the
351
00:48:17,140 --> 00:48:24,490
behavior of the animals in the jungle go and
observe and then accordingly note down and
352
00:48:24,490 --> 00:48:33,210
that will be your primary data.
So, for the production that and that example
353
00:48:33,210 --> 00:48:38,619
the profit and sales volume case that is also
primary data. So, long you are collecting
354
00:48:38,619 --> 00:48:46,380
from the source, what is secondary data? Secondary
data stored in repository or collected by
355
00:48:46,380 --> 00:48:54,339
someone else, you are getting me? So, you
are not collecting, it is already there. We
356
00:48:54,339 --> 00:49:02,460
have different sources for example, you may
get the financial data from some sources.
357
00:49:02,460 --> 00:49:09,630
And suppose company is maintaining records
of their production and suppose their maintenance
358
00:49:09,630 --> 00:49:17,960
or the health of machines and many things.
So, you have not collected so company has
359
00:49:17,960 --> 00:49:23,710
stored and you have gone there and collected
these things, or it is better that in a literature
360
00:49:23,710 --> 00:49:29,180
you studying something in your own area. You
found that a paper is there where some data
361
00:49:29,180 --> 00:49:34,010
is given.
So, this type of data is secondary but secondary
362
00:49:34,010 --> 00:49:39,829
data must have must be authentic, in the sense
that reference of the data is available, author
363
00:49:39,829 --> 00:49:45,770
references are there, this is that author
literature data but this is definitely as
364
00:49:45,770 --> 00:49:51,730
it is done by somebody else. It is not primary,
there you have to rely on the authenticity
365
00:49:51,730 --> 00:49:57,550
of the data collected by somebody else. The
tertiary data which is basically a common
366
00:49:57,550 --> 00:50:01,770
knowledge type of things.
Suppose, you know you will find many things
367
00:50:01,770 --> 00:50:09,869
are there actually when in terms of modeling,
modeling when you start with a subject area
368
00:50:09,869 --> 00:50:15,680
you start with this that when your knowledge
is not very clear, you will start with the
369
00:50:15,680 --> 00:50:20,530
tertiary sources. And then slowly you go to the secondary source. Finally, when you do
370
00:50:20,530 --> 00:50:30,320
actual work you may go for the primary data sources.
371
00:50:30,320 --> 00:50:40,880
I told you earlier for this model, let me
repeat this again that model mimics reality
372
00:50:40,880 --> 00:50:48,430
that when you develop a model that without
considering the reality, the real thing you
373
00:50:48,430 --> 00:50:58,260
are not doing the justice. So, the model reality
so it should be a, it should have real applications
374
00:50:58,260 --> 00:51:07,200
that is what is the meaning. For example,
suppose you think of a car which is got with
375
00:51:07,200 --> 00:51:22,060
by suppose any they develop, they develop
these things. What I mean to say they develop
376
00:51:22,060 --> 00:51:30,660
a model simulation model in computer first
before going for a developing the car, one
377
00:51:30,660 --> 00:51:36,290
after the other manufacturing. The car in
the manufacturing shop or there must be some
378
00:51:36,290 --> 00:51:41,740
simulation model and means how the car will work.
379
00:51:41,740 --> 00:51:48,339
So, that type of things are known as that
means it is a in terms of the reality the
380
00:51:48,339 --> 00:51:54,800
car is the real thing. So, your modeling can
be so that is simplest example and that the
381
00:51:54,800 --> 00:52:01,890
mathematics is related to the elastic behavior
of it that is the reality. In statistical
382
00:52:01,890 --> 00:52:08,800
sense when we talk about the how sales volume
is dependent on other things that is your
383
00:52:08,800 --> 00:52:14,390
absenteeism, M ratio and all those things
that also is going to talk about the ways,
384
00:52:14,390 --> 00:52:21,260
which show actually in statistical sense,
a model talks about explain the regularity
385
00:52:21,260 --> 00:52:24,180
of a phenomena.
386
00:52:24,180 --> 00:52:30,099
In Hooke’s law the regularity is so long,
it is within the elastic limit. When the load
387
00:52:30,099 --> 00:52:37,300
is released, it will come back to the original
shape that is the regularity. In case of our
388
00:52:37,300 --> 00:52:46,070
statistical model building we talk about data
and data is nothing but equal to this is pattern
389
00:52:46,070 --> 00:53:02,290
plus error, this pattern is the regularity
pattern or systematic component. So, we must
390
00:53:02,290 --> 00:53:09,670
know what is our problem? And accordingly
all data if you collect it and you want to
391
00:53:09,670 --> 00:53:16,609
extract pattern from this data. In case of
prediction model suppose you want to predict
392
00:53:16,609 --> 00:53:23,770
some y value then with respect to some x values.
And then you will find out there is some linear
393
00:53:23,770 --> 00:53:33,130
combination variable that is X beta, then
plus l will be there. This is my regularity
394
00:53:33,130 --> 00:53:38,670
or my data.
So, when you repeat similar that similar development
395
00:53:38,670 --> 00:53:46,980
under different situations then what will
happen? Then if it performs well under the
396
00:53:46,980 --> 00:53:52,760
different situation for which it is developed,
the one day we may say it is a law or a theory
397
00:53:52,760 --> 00:54:00,700
like Hooke’s law or Hooke’s or this Hooke’s
law is this, I left there that elasticity
398
00:54:00,700 --> 00:54:10,500
thing. So, we all know that Newton’s laws
of motion and we all know that Dalton’s
399
00:54:10,500 --> 00:54:16,430
atomic theory and many other things that these
are not one day everything is developed and
400
00:54:16,430 --> 00:54:24,369
people accept it. It basically developed at
test stage verified, validated after several
401
00:54:24,369 --> 00:54:31,790
years and then other scientist other that
is the researcher, they accepted the fact
402
00:54:31,790 --> 00:54:39,930
and then it was applied to different situations
and found that it is working. I told you modeling,
403
00:54:39,930 --> 00:54:46,150
also process of building a process, physical,
mathematical and statistical, this is I have
404
00:54:46,150 --> 00:54:47,900
already explained to you.
405
00:54:47,900 --> 00:54:59,119
I hope that you got the glimpse of actually
the purpose of applied multivariate statistical
406
00:54:59,119 --> 00:55:08,440
modeling. Actually, we want to develop empirical
model, those empirical models is these are
407
00:55:08,440 --> 00:55:15,060
all data based, data based in the sense that
they contain you have data. And you are going
408
00:55:15,060 --> 00:55:22,250
for building models and you are building models,
and to find out the regularity of the data,
409
00:55:22,250 --> 00:55:31,380
or the pattern of the data. And show that
you will be able to describe the relationships
410
00:55:31,380 --> 00:55:41,000
of the population or the behavior of the population
or system in consideration. You will be able
411
00:55:41,000 --> 00:55:48,200
to establish the strength of the relationship,
you will be able to predict something, you
412
00:55:48,200 --> 00:55:54,660
may be able to prescribe something also, but
when you talk about a statistical modeling.
413
00:55:54,660 --> 00:56:01,280
Usually this is the description and prediction
part is description explanation and prediction
414
00:56:01,280 --> 00:56:09,790
this three things come into consideration.
So, slowly you will be knowing different types
415
00:56:09,790 --> 00:56:18,819
of statistical all together and you will be
tempted to develop different models, also
416
00:56:18,819 --> 00:56:26,940
based on the data whatever available to you
but before model, going for modeling or applying
417
00:56:26,940 --> 00:56:35,190
any statistical techniques what is happening?
What is we want to say that you have to have
418
00:56:35,190 --> 00:56:40,980
some principles in your mind before going
for this here. I have just jotted down some
419
00:56:40,980 --> 00:56:48,770
of the principles which I have taken from
text book by operation research by see what
420
00:56:48,770 --> 00:56:56,280
is they said that do not build a complicated
model when a simple one will suffice. For
421
00:56:56,280 --> 00:57:06,240
example, suppose if I know the mean value
of the different lots of steel over mid value
422
00:57:06,240 --> 00:57:17,220
of a particular characteristics; for example,
the inner diameter of a different your lots
423
00:57:17,220 --> 00:57:29,160
produced. And if that suffice my purpose go
for mean, or at max you may require the standard
424
00:57:29,160 --> 00:57:35,630
deviation of the inner diameter produced by
the different processes A B C as I told you.
425
00:57:35,630 --> 00:57:41,960
So, there you may you do not go for may be
that covariance structure, many other thing.
426
00:57:41,960 --> 00:57:50,300
So, you do not go for if it is needed you
go for you are modeling of the problem to
427
00:57:50,300 --> 00:57:57,230
fit the technique, many a time I have seen
it my case that there is one model which we
428
00:57:57,230 --> 00:58:02,470
will be discussing later on known as structural
equation modeling. The people are using structural
429
00:58:02,470 --> 00:58:07,940
model everywhere where a simple regression
model can be. But people are interested to
430
00:58:07,940 --> 00:58:13,160
fit the structural equation model.
So, please be little bit of cautious on those
431
00:58:13,160 --> 00:58:19,390
things, that model is for problem solving and model
comes from the problem, not to fit a statistical
432
00:58:19,390 --> 00:58:24,609
technique. Design phase of modeling must be
conducted rigorously and it will discussed
433
00:58:24,609 --> 00:58:30,190
later. What do we mean by design phase? Coming
under study design model should be verified
434
00:58:30,190 --> 00:58:36,380
prior to validation. Verification means suppose
you when you collect data you split the data
435
00:58:36,380 --> 00:58:40,849
into two halves, one for your training other for test.
436
00:58:40,849 --> 00:58:48,480
What other way I can say? One set for model
building, other set for verification and validation
437
00:58:48,480 --> 00:58:52,890
basically talks about when you take some new
data again and you find it is working, that
438
00:58:52,890 --> 00:58:58,710
is validation. A model should never take in
too literally but many a times what I have
439
00:58:58,710 --> 00:59:05,670
found that model there are more many variables,
statistics is taken very in very loose end.
440
00:59:05,670 --> 00:59:11,000
So, if there are many variables let us find
the relationship is there or not this type
441
00:59:11,000 --> 00:59:15,950
of or whatever variable is there. Let us find
that relationship without considering the
442
00:59:15,950 --> 00:59:19,760
purpose.
443
00:59:19,760 --> 00:59:26,329
A model should neither be pressed to do nor
criticized for failing to do that for which
444
00:59:26,329 --> 00:59:32,430
it was never intended for example, you are
interested to see the relationship between
445
00:59:32,430 --> 00:59:37,170
variable of a particular population. Now,
later on you want to see that how I want to
446
00:59:37,170 --> 00:59:41,400
predict something, see you developed a model
to see the pattern strength of relationship
447
00:59:41,400 --> 00:59:49,900
not to predict. So, why how can your model
will predict which was not intended for, so
448
00:59:49,900 --> 00:59:57,160
that is another issue. So, if it fails to
do prediction when it was just to understand
449
00:59:57,160 --> 01:00:03,599
the covariance structure, then we should not
criticize for this nor we should not press
450
01:00:03,599 --> 01:00:09,980
the model to do it, beware of overselling
a model many a times.
451
01:00:09,980 --> 01:00:16,910
We basically make sure of I can say recommendation
based on the model and many of the things
452
01:00:16,910 --> 01:00:24,710
basically from common sense, and so that type
of selling I prohibited some of primary benefits
453
01:00:24,710 --> 01:00:31,710
of modeling are associated with the process
of developing the model. So, the see as all
454
01:00:31,710 --> 01:00:37,040
we of you are busy in learning multivariate
statistics, multivariate modeling. So, do
455
01:00:37,040 --> 01:00:42,020
not think that always you will be doing something
great with these type of modeling you are
456
01:00:42,020 --> 01:00:46,560
learning. So, the learning process when you
develop something you know the physics of
457
01:00:46,560 --> 01:00:51,829
the problem, may be you know the process through
which data is generated, you know how the
458
01:00:51,829 --> 01:00:57,020
data to be captured, how the data to be analyzed,
what techniques is applicable.
459
01:00:57,020 --> 01:01:02,760
So, this is a entire gamut, so this gamut
of process is very, very important. So, very
460
01:01:02,760 --> 01:01:09,720
many fits very, very fits you acquire out
of it a model cannot be any better than information
461
01:01:09,720 --> 01:01:15,790
that goes into it. So, you cannot say that
you are using nominal data and you will be
462
01:01:15,790 --> 01:01:21,390
basically talking about a model of regression
where y variable is nominal. So, you have
463
01:01:21,390 --> 01:01:26,369
to have go for some other type of model for
that may be your regression. So, the information
464
01:01:26,369 --> 01:01:32,210
what you are the quality of information what
is fed into that model, that is more important
465
01:01:32,210 --> 01:01:39,839
because it if input is not good then output
also, you should not expect good.
466
01:01:39,839 --> 01:01:49,099
So, model cannot replace decision maker, getting
me? You cannot think that you are your model
467
01:01:49,099 --> 01:01:55,270
is superior than you the decision maker, the
analyst who has having the system knowledge
468
01:01:55,270 --> 01:02:04,599
they will act smart. So, they are more important
people, so whatever you develop, whatever
469
01:02:04,599 --> 01:02:11,150
you do for, what purpose you are developing,
all these things. So, that is in your brain,
470
01:02:11,150 --> 01:02:19,309
in it is the root what work there so better
than any model. So, in this case what I want
471
01:02:19,309 --> 01:02:28,839
to say that you please take all those issues
what I have discussed, the principles particularly
472
01:02:28,839 --> 01:02:37,069
in this series and accordingly develop the
model and today it is up to these. Next class,
473
01:02:37,069 --> 01:02:41,450
we will be studying the statistical approach to problem.
474
01:02:41,450 --> 01:02:45,210
Thank you for your patience.