﻿1 00:00:18,520 --> 00:00:28,920 Good morning, welcome to the first lecture of applied multivariate statistical modeling. 2 00:00:28,920 --> 00:00:33,310 Let me tell you the content of this today’s presentation. 3 00:00:33,310 --> 00:00:42,770 So, we will start with introduction, then variables, data types, data sources, models 4 00:00:42,770 --> 00:00:48,830 and modeling followed by principles of modeling, statistical approaches to model building, 5 00:00:48,830 --> 00:00:58,030 multivariate models, some illustrative examples, three cases followed by references. The entire 6 00:00:58,030 --> 00:01:01,989 content will be covered in two hours. 7 00:01:01,989 --> 00:01:16,070 Today, I will try to finish up to principles of modeling, let us start with defining what 8 00:01:16,070 --> 00:01:39,140 is applied multivariate statistical modeling? Let us define whatever you want, first is 9 00:01:39,140 --> 00:01:52,580 applied. Now, what do you mean by applied in science, there is pure science and applied 10 00:01:52,580 --> 00:02:07,610 science. Pure science we generally understand which is basic science, which it basically 11 00:02:07,610 --> 00:02:21,920 talks about laws theories and their development, and their development, definitely it links 12 00:02:21,920 --> 00:02:33,720 with the phenomena, which we usually observe in different aspects of our life. Now, applied 13 00:02:33,720 --> 00:02:42,100 science which will use the knowledge of the pure science and develops something for the 14 00:02:42,100 --> 00:02:51,750 benefit of the mankind, so applied science one of the benefit we can say then when you 15 00:02:51,750 --> 00:03:03,860 talk about engineering, it is basically applied. Now, when I talk about applied statistics 16 00:03:03,860 --> 00:03:15,450 what do we mean? I am assuming that you have knowledge on preliminary basic statistics 17 00:03:15,450 --> 00:03:29,780 for example, normal distribution. If you know normal distribution then also you know the 18 00:03:29,780 --> 00:03:38,840 probability density function f x, which is 1 by root over 2 pi sigma square e to the 19 00:03:38,840 --> 00:03:50,390 power of minus of x minus mu by sigma square, where x varies from minus infinity to plus 20 00:03:50,390 --> 00:04:03,960 infinity. This is the so called this bell shaped curve which is developed by Carl Friedrich 21 00:04:03,960 --> 00:04:16,859 Gauss. So, theoretical development so that mean in 22 00:04:16,859 --> 00:04:26,780 development of this type of distributions it is coming under basics. Now, suppose if 23 00:04:26,780 --> 00:04:33,470 I want to apply this knowledge to a real life situation, I can find out a situation like 24 00:04:33,470 --> 00:04:54,400 this for example, let us there are three processes, process A B and C take certain inputs, convert 25 00:04:54,400 --> 00:05:04,050 into value added outputs, value added outputs all cases. Let there are basically three identical 26 00:05:04,050 --> 00:05:13,779 machines which is producing steel washers, steel washers will be shape like this where 27 00:05:13,779 --> 00:05:22,099 there is inner diameter ID. There is outer diameter OD as usual as there will be certain 28 00:05:22,099 --> 00:05:34,319 thickness of this washer so I can say T h. Now, if you produce a large amount of steel 29 00:05:34,319 --> 00:05:46,389 washers that means the number of items produced is large, n is large then the quality characteristic 30 00:05:46,389 --> 00:05:50,889 or the characteristics of the steel washer which is important to the people, the customer 31 00:05:50,889 --> 00:05:59,159 ID. If you plot you may get this type of distribution, which is normally distributed and where you 32 00:05:59,159 --> 00:06:03,969 will be getting mean here. There will be definitely standard deviation for ID. 33 00:06:03,969 --> 00:06:13,340 Similarly, for OD similarly, for thickness now then what you are doing by what is applied 34 00:06:13,340 --> 00:06:21,389 here? The production process A for example, in this case which is producing steel washers 35 00:06:21,389 --> 00:06:27,740 each is converted into a statistical process. In the sense in terms of a distribution like 36 00:06:27,740 --> 00:06:36,370 normal distribution, where we are saying that the production process can be interpreted, 37 00:06:36,370 --> 00:06:43,379 the behavior of this process can be interpreted like this now in order to get it further clarified. 38 00:06:43,379 --> 00:06:51,529 If we do like this suppose, this one is for a production process A and if I say this is 39 00:06:51,529 --> 00:07:02,969 for production process B and third one is this one for production process C, then using 40 00:07:02,969 --> 00:07:11,270 these things you will be able to compare. Compare A B and see their performance in terms 41 00:07:11,270 --> 00:07:19,599 of mean and standard deviation. There is possibility also to see that whether the mean ID produced 42 00:07:19,600 --> 00:07:28,129 by C is equal to that of B or A, this type of comparisons and things possible. So, when 43 00:07:28,129 --> 00:07:35,389 we actually when we develop something which will be useful to the society for the mankind, 44 00:07:35,389 --> 00:07:49,669 then we say it is applied. Now, come to the second word which is basically multivariate. 45 00:07:49,669 --> 00:07:58,659 Now, in order to understand multivariate we have to understand what is variable. I think 46 00:07:58,659 --> 00:08:12,360 it is known to you that variable is something which takes different values that since, I 47 00:08:12,360 --> 00:08:24,069 can say takes different values for example, if I say I D, x is I D inner diameter. Then 48 00:08:24,069 --> 00:08:32,669 if I produce one item, I stands for the item suppose, first item and the I D value it may 49 00:08:32,669 --> 00:08:41,490 take value X 1. When we go for second version it may take X 2. So, if I such way if I go 50 00:08:41,490 --> 00:08:51,810 for n washers produced, let X n will come into consideration. So, these are the values 51 00:08:51,810 --> 00:08:59,950 so I D takes different values as a result I D is a variable here. Now, in statistics 52 00:08:59,950 --> 00:09:08,070 we basically talk about two types of variables, one is fixed variable and the other one is 53 00:09:08,070 --> 00:09:17,580 deterministic, sorry random variable. So, fixed other way we can say deterministic 54 00:09:17,580 --> 00:09:29,250 and random we can say probabilistic for example, if I create another variable which is month 55 00:09:29,250 --> 00:09:38,220 it varies probably here but we know all the months. Suppose, what will be the next month 56 00:09:38,220 --> 00:09:44,110 is this month is your December next month will be January, it is known with certainty 57 00:09:44,110 --> 00:09:50,070 that is a deterministic model, but in this case when you are going to produce a second 58 00:09:50,070 --> 00:09:59,120 lot. Suppose, in the second lot even in one lot what is the value of I D for the second 59 00:09:59,120 --> 00:10:06,670 item, or second version it is not known with certainty, it is governed through probabilistic 60 00:10:06,670 --> 00:10:12,610 distribution. So, that sense that it is random one, we do 61 00:10:12,610 --> 00:10:17,770 not know the value exactly and this value is coming based on certain random experiment. 62 00:10:17,770 --> 00:10:32,080 In this case the process which is producing this item so if I go on saying like this then 63 00:10:32,080 --> 00:10:39,700 other variable here is O D. Similarly, other one is our thickness, now in order to accumulate 64 00:10:39,700 --> 00:10:51,180 more than one variable, we will write this X 1 is I D, X 2 is O D and X 3 is X 1, X 2 65 00:10:51,180 --> 00:10:58,270 and X 3 is thickness the for the first of first item was produced. Then this will be 66 00:10:58,270 --> 00:11:08,620 x 11, second one x 2 1 and n one. Similarly, for O D x 1 2, x 2 2 like x n 2, and if I 67 00:11:08,620 --> 00:11:14,540 go for the X 3 variable that is observed for first observation, it is x 1 3, second one 68 00:11:14,540 --> 00:11:25,120 x 2 3. So, like this x n 3. So, what we are trying to say here that we 69 00:11:25,120 --> 00:11:32,420 are considering three variables X 1, X 2, X 3 which are nothing but the characteristics 70 00:11:32,420 --> 00:11:38,860 of the steel washers in this example which has inner diameter, which has outer diameter, 71 00:11:38,860 --> 00:11:47,120 which has thickness. Now, if you produce n number of washers then what will happen? Every 72 00:11:47,120 --> 00:12:00,100 washers will be having different values for I D, O D and thickness. So, this is my observation, 73 00:12:00,100 --> 00:12:05,810 first one is observation number 1, second one observation number 2, like that there 74 00:12:05,810 --> 00:12:14,320 is observation number n and you see in observation number 1 if I consider only I D that value 75 00:12:14,320 --> 00:12:20,680 is x 1 1, if I consider all three together, observation 1 takes value x 1 1, x 1 2, x 76 00:12:20,680 --> 00:12:29,000 1 3. So, similarly if you go on increasing the 77 00:12:29,000 --> 00:12:40,990 number of variables up to X p then here it will be X 1 p, X 2 p like this X n p. Now, 78 00:12:40,990 --> 00:13:01,600 each of these as well as this, these are observations on multiple variables. What do you want to 79 00:13:01,600 --> 00:13:13,030 define here? We want to define here multivariate. So, in order to do so we know variable, deterministic 80 00:13:13,030 --> 00:13:20,870 variable, probabilistic, that is random variable and this is one example where every observation 81 00:13:20,870 --> 00:13:29,200 is measured on several variables. Then when multiple variables come into picture then 82 00:13:29,200 --> 00:13:36,480 each observation is a variable vector example, if I take the ith observation here then x 83 00:13:36,480 --> 00:13:50,930 i will be x i 1, x i 2 like this x i p. So, it is a variable vector that is ith observation 84 00:13:50,930 --> 00:14:07,780 on p variables. So, when we deal with this type of situation where our observations or 85 00:14:07,780 --> 00:14:18,700 each of the observations have multiple values in the sense values on multiple number of 86 00:14:18,700 --> 00:14:38,270 variables more than 1 then the situation is multivariate situation. Now, we define variable, 87 00:14:38,270 --> 00:14:49,680 we define multivariate situation, let us understand what is variate getting me? Instead of saying 88 00:14:49,680 --> 00:14:59,920 that x i is like this, if I create something different based on all those observations 89 00:14:59,920 --> 00:15:16,240 that there is linear combination of variables. For example, here in this in our example 90 00:15:16,240 --> 00:15:23,270 there are three variable X 1, X 2 and X 3. If I create a combination linear combination L 91 00:15:23,270 --> 00:15:36,270 C which is beta 1 X 1 plus beta 2 X 2 plus beta 3 X 3. So, this combiningly will give 92 00:15:36,270 --> 00:15:46,260 a quantity or a value or other way we can also say variable which is we are saying linear 93 00:15:46,260 --> 00:15:56,170 combination of variable which is variate, this is variate and then what is the definition 94 00:15:56,170 --> 00:16:05,529 of variate? Linear combination of variables with empirically written mean weights, that 95 00:16:05,529 --> 00:16:12,350 means beta 1, beta 2 and beta 3 will be determined based on observations. There are n observations 96 00:16:12,350 --> 00:16:15,380 so we will be able to determine all those variables. 97 00:16:15,380 --> 00:16:21,410 So, linear combination of or weighted linear combination of the variables where the weights 98 00:16:21,410 --> 00:16:32,190 are determined empirically that is variate. Now, in this case you can go for one variables, 99 00:16:32,190 --> 00:16:39,380 simple one variable that means if I say there are 3 variables, we are going variable p equal 100 00:16:39,380 --> 00:16:47,080 to 1 then that will be uni-variate, when we go for p equal to greater than equal to 2, 101 00:16:47,080 --> 00:17:00,210 that is multivariate. That is what multivariate usually in statistics books you will be finding 102 00:17:00,210 --> 00:17:08,510 univariate statistics. For example, in terms of normal univariate, normal distribution 103 00:17:08,510 --> 00:17:12,189 bivariate, normal distribution multivariate, normal distribution, so all the bivariate 104 00:17:12,189 --> 00:17:17,220 is a part of multivariate, we basically talk about when univariate means p equal to 1, 105 00:17:17,220 --> 00:17:23,670 bivariate means p equal to 2, multivariate is p greater than equal to 2. 106 00:17:23,670 --> 00:17:31,550 So, this is what is multivariate, by word multivariate we definitely talk something 107 00:17:31,550 --> 00:17:36,440 about linear combination of variables where more than one variable is there, and there 108 00:17:36,440 --> 00:17:42,420 are multiple observations, not a single observation, n number of observations and weights. We determined 109 00:17:42,430 --> 00:17:49,809 empirically based on the X observations n observations that will be collected from the 110 00:17:49,809 --> 00:17:59,059 population for which we want to infer something. All those inference we will discuss later. 111 00:17:59,059 --> 00:18:10,520 So, third one, the third issue is statistical. Now, what is statistical? By statistical we 112 00:18:10,520 --> 00:18:20,300 want to say that it is basically using statistics that is what we want to infer. 113 00:18:20,300 --> 00:18:28,320 So, whatever you are developing something using the statistical tools and taking it, 114 00:18:28,320 --> 00:18:35,610 then this development is statistical development. Now, what is statistics? If I say statistics 115 00:18:35,610 --> 00:18:50,250 is nothing but collecting, organizing, analyzing, then representing and interpreting. What I 116 00:18:50,250 --> 00:19:01,380 mean to say collecting data, organizing data, analyzing data, representing the results and 117 00:19:01,380 --> 00:19:08,110 interpreting the results for the population for which the statistical model, or the statistics 118 00:19:08,110 --> 00:19:14,000 is used for some purpose, some purposeful work will be served. 119 00:19:14,000 --> 00:19:30,140 So, when we talk about statistical that means we talk about the population, then a sample 120 00:19:30,140 --> 00:19:36,840 consist of data from the population and we have some purpose in our mind, objective in 121 00:19:36,840 --> 00:19:44,800 our mind. We want to infer something from of the population and we collect data accordingly 122 00:19:44,800 --> 00:19:50,700 we organize the data, we analyze the data, then we find the result and the result we 123 00:19:50,820 --> 00:19:56,140 summarize, and based on this summarization these findings we infer about the population 124 00:19:56,140 --> 00:20:00,670 so that is what is the word statistical is used. 125 00:20:00,670 --> 00:20:10,710 Now, last two are but very important one is the modeling, if you want to understand then 126 00:20:10,710 --> 00:20:18,230 first you understand this model. A model there are many types of model actually very simple 127 00:20:18,230 --> 00:20:29,050 one is in our school days. I can remember we talk about the spring balance like this, 128 00:20:29,050 --> 00:20:35,590 so what happened this is a spring, a elastic one, a load is attached with this is P and 129 00:20:35,590 --> 00:20:41,830 it behave in some way, that behavior if you increase the load, the elongation will be 130 00:20:41,830 --> 00:20:46,620 more. If you reduce it will be less. So, when this is the behavior, this is the 131 00:20:46,620 --> 00:20:52,460 spring balance model so to show the behavior of the spring this type so physical model 132 00:20:52,460 --> 00:21:02,290 are developed. So, this is one model which is basically a physical model, which is a physical model. 133 00:21:02,290 --> 00:21:17,050 Now, same thing when I came to my engineering studies, I found that there is one important concept 134 00:21:17,050 --> 00:21:27,710 called or development or theory called Hooke’s law, where that sigma he stress developed 135 00:21:27,710 --> 00:21:31,840 on the spring. And the elongation strain developed on it 136 00:21:31,840 --> 00:21:45,690 they are modeled in such a manner that there is a relationship like this. This is the range 137 00:21:45,690 --> 00:21:49,340 of elasticity, there is another concept called elasticity. So, what I have seen there or 138 00:21:49,340 --> 00:21:56,910 we all have seen there that sigma epsilon. So, sigma is E epsilon, where E is young modulus 139 00:21:56,910 --> 00:22:07,420 or modulus of elasticity. So, this is what is the theory behind the for elasticity, the 140 00:22:07,420 --> 00:22:15,500 area of elastic body when the load is so developed that each will not go to the yield point or 141 00:22:15,500 --> 00:22:23,250 beyond yield point, that is elastic zone. So, for so long the body is stressed within 142 00:22:23,250 --> 00:22:30,320 the within the elastic limit, what will happen to the property that if you remove the load 143 00:22:30,320 --> 00:22:39,679 then it will recover back to the original position. So, this development is possible 144 00:22:39,679 --> 00:22:47,240 because the physics of this particular spring was known and I can say if we, if I known 145 00:22:47,240 --> 00:22:52,510 the modulus of elasticity, I will be able to tell the relationship between sigma and 146 00:22:52,510 --> 00:22:58,840 epsilon. And that time in engineering mechanics and strength of materials subject we learn 147 00:22:58,840 --> 00:23:09,809 on these things, basic mathematical model. So, in reality you will get different types 148 00:23:09,809 --> 00:23:15,090 of mathematical model so that means, what I mean to say here that a physical model, 149 00:23:15,090 --> 00:23:22,380 a mathematical model. Now, what you mean by statistical model in this case for example, 150 00:23:22,380 --> 00:23:31,860 you take a case I think the inner beginning of this particular study for example, the 151 00:23:31,860 --> 00:23:38,250 how did we develop all these things. So, to experiment I have no idea but suppose 152 00:23:38,250 --> 00:23:44,240 you do not know the modulus of elasticity, but you know that say elastic body and you 153 00:23:44,240 --> 00:23:52,550 want to find the relationship in that case you can do experiment with P, varying P from 154 00:23:52,550 --> 00:23:58,850 P 1 to P n. So, that means you will create n different combinations then you will be 155 00:23:58,850 --> 00:24:06,910 getting 0 to n observations and sigma, epsilon values you will be getting sigma 1, sigma 156 00:24:06,910 --> 00:24:12,500 2, sigma n; three epsilon, epsilon 1, epsilon 2 and then epsilon n. 157 00:24:12,500 --> 00:24:23,540 Now, if you plot this what will happen you may get a plot like this, here it is sigma 158 00:24:23,540 --> 00:24:30,570 essentially what is the difference between this and this here what I am saying, I am 159 00:24:30,570 --> 00:24:36,090 straight way without I when you showed you have shown me this spring balance. Then I 160 00:24:36,090 --> 00:24:42,240 immediately say that in elastic body this is the diagram, because the this Hooke’s 161 00:24:42,240 --> 00:24:47,040 law is known to me. So, mathematics is known to me but in case 162 00:24:47,040 --> 00:24:53,740 it is not known I have done several experiments here. And based on this I am trying, I will 163 00:24:53,740 --> 00:24:59,780 do plot like this need not the perfect straight line, you will get when you go for the empirical 164 00:24:59,780 --> 00:25:06,220 relationship. So, this is what is the empirical 1 model? So, this empirical model when we 165 00:25:06,220 --> 00:25:13,190 talk about empirical model like this experiment based or data based models like this, these 166 00:25:13,190 --> 00:25:18,890 are basically the statistics, these are all statistical. So, for me this is for all of 167 00:25:18,890 --> 00:25:34,670 us, this is our statistical model. Now, what is modeling? Then modeling is basically 168 00:25:34,670 --> 00:25:42,460 you want to get this type of results, it is not that a immediately you will get all this 169 00:25:42,460 --> 00:25:48,670 there is a process. The steps I have to understand what is my purpose? I have to understand in 170 00:25:48,670 --> 00:25:54,250 one or two full the purpose what are the variables that are affecting there. I have to identify 171 00:25:54,250 --> 00:26:01,309 all the important variables, then I have to see that how the data on the variable will 172 00:26:01,309 --> 00:26:05,220 be collected. For example, here I shown you the experiment 173 00:26:05,220 --> 00:26:12,030 but it may so happen that you cannot do the experiment. So, in that case is there any 174 00:26:12,030 --> 00:26:18,460 other way of collecting data for example, observation someone is interested to see the 175 00:26:18,460 --> 00:26:25,090 behavior of a particular animal. So, he cannot do the experiment may be but there are large 176 00:26:25,090 --> 00:26:28,540 number of wild animals of that particular species. 177 00:26:28,540 --> 00:26:41,630 So, we can observe that we are just going and observing field based so field based observation 178 00:26:41,630 --> 00:26:50,380 our this one is our experiment, sometimes what happened we will go for some naturalistic 179 00:26:50,380 --> 00:26:57,090 observations, naturalistic observations which we talk about the wild animal case field based 180 00:26:57,090 --> 00:27:02,049 observation. In the production we go suppose, the steel washer case go to the production 181 00:27:02,049 --> 00:27:07,040 shop, and see that what is happening there and collect data and accordingly you do some 182 00:27:07,040 --> 00:27:15,860 modeling, some naturalistic observations. So, all those type of data collection mechanism 183 00:27:15,860 --> 00:27:21,750 comes under empirical modeling and you have to understand all these things. So, this is 184 00:27:21,750 --> 00:27:28,559 a process, the process of modeling, the process of model building is called modeling, the 185 00:27:28,559 --> 00:27:42,220 process of model building is called of model building is modeling. 186 00:27:42,220 --> 00:27:56,250 So, let us see some of the slides now that I told that what is multivariate and then 187 00:27:56,250 --> 00:28:03,190 what is discussed, why should I use it and it is a base question and that was should 188 00:28:03,190 --> 00:28:09,660 I go for multivariate things? If I can do by some other way, why multivariate? So, they 189 00:28:09,660 --> 00:28:15,240 are some key issues which basically will be known to you later on that when we talk about 190 00:28:15,240 --> 00:28:15,919 multivariate. 191 00:28:15,919 --> 00:28:21,669 We talk about multiple variables that is p cross 1, if p the number of variables then 192 00:28:21,669 --> 00:28:30,830 X 1, X 2 like your X p. Now, there is possibility that these variables are interrelated, there 193 00:28:30,830 --> 00:28:36,049 is correlation, one of the easiest way is correlation in between the variables. So, 194 00:28:36,049 --> 00:28:42,530 that means you may be get a correlation matrix or other way it is basically the covariance 195 00:28:42,530 --> 00:28:52,350 between the variables or covariance. By covariance what I mean to say, if one variable varies 196 00:28:52,350 --> 00:28:56,470 there is a possibility that in particular way that some other variable also varies, 197 00:28:56,470 --> 00:29:03,140 then there will be covariance and standardized covariance is correlation. This is and in 198 00:29:03,140 --> 00:29:12,330 the subsequent lectures so covariance that will be p cross p matrix will come. 199 00:29:12,330 --> 00:29:20,059 So, all those things so similarly, the mean values for all those variables mu 1, mu 2 200 00:29:20,059 --> 00:29:28,220 like mu p, this things will be there. Now, so my answer to your question is that why 201 00:29:28,220 --> 00:29:42,390 should I use it because no physical process or as such any other systems also, which is 202 00:29:42,390 --> 00:29:52,690 characterized by multiple variables. They should be analyzed other like their behavior 203 00:29:52,690 --> 00:30:00,480 should be analyzed by taking into consideration of all the variables characterizing it. 204 00:30:00,480 --> 00:30:07,929 When these variables consider very, very important for the design development or improvement 205 00:30:07,929 --> 00:30:14,330 of the system, for which it is developed. And as none of as it is obvious there will 206 00:30:14,330 --> 00:30:24,000 be covariance or correlation between the variables. If I go for univariate analysis we will lose 207 00:30:24,000 --> 00:30:31,870 substantially the information about the behavior, because of non-inclusion of the covariance 208 00:30:31,870 --> 00:30:38,380 structure. So, we require to control this covariance 209 00:30:38,380 --> 00:30:43,850 structure and in multivariate statistics covariance is a very big issue and which will be found 210 00:30:43,850 --> 00:30:52,010 in multivariate distribution. We will be discussing all this covariance things so it is required. 211 00:30:52,010 --> 00:31:05,470 For example, for this case like our this one steel washer, this case the steel washer, 212 00:31:05,470 --> 00:31:13,000 three variables are visibly controlling its quality, inner diameter, outer diameter and 213 00:31:13,000 --> 00:31:18,510 thickness. There is chance that inner and outer diameter will be related, also the thickness 214 00:31:18,510 --> 00:31:25,240 in that case the customer will not be able to apply it or fit it to its own situation 215 00:31:25,240 --> 00:31:30,240 if there is huge mismatch. Now, if we control inner diameter or outer 216 00:31:30,240 --> 00:31:34,429 diameter or thickness then what will happen? Then correlation structure will not be considered 217 00:31:34,429 --> 00:31:43,440 and ultimately we will not be able to satisfy the customer. So, we will be using multivariate 218 00:31:43,440 --> 00:31:48,419 statistics or multivariate modeling. When your system is complex in terms of number 219 00:31:48,419 --> 00:31:56,720 of variable it may be in conditions like this, the correlation structure is intact in order 220 00:31:56,720 --> 00:32:02,929 to extract a those correlation information, you want to extract the pattern from this 221 00:32:02,929 --> 00:32:09,690 data that is why you will be using. So, how do I do it? It is through the third models, 222 00:32:09,690 --> 00:32:34,630 so these models will be discussed a little later. Now, what is next? 223 00:32:34,630 --> 00:32:42,130 Next one example, here we are saying that a particular company operating may be in a 224 00:32:42,130 --> 00:32:51,100 city market and we want to see the organizational health of this company, with respect to profit 225 00:32:51,100 --> 00:32:58,510 in rupees million with respect to sales volume in rupees hundred, absenteeism, machine breakdown 226 00:32:58,510 --> 00:33:05,870 and M ratio. Actually, this is schemated intentionally first one is profit and sales volume, these 227 00:33:05,870 --> 00:33:12,980 are the organizational issue that health if you sell more your profit may be more. And 228 00:33:12,980 --> 00:33:20,570 if your profit is more you are healthy in financially, and another issue is absenteeism, 229 00:33:20,570 --> 00:33:28,150 if you are paying substantially and if you are taking care the well being of the employee’s 230 00:33:28,150 --> 00:33:32,880 absenteeism will be less.If you are maintaining the health of the process 231 00:33:32,880 --> 00:33:40,740 here we are saying machine, your machine breakdown will be less. And if you are if you are able 232 00:33:40,740 --> 00:33:47,770 to coordinate with customer as well as your supplier and your M ratio, that much M ratio 233 00:33:47,770 --> 00:33:53,740 particularly I say marketing ratio will relate to the customer and that will be high. So, 234 00:33:53,740 --> 00:34:00,530 if this is the case and then we are basically observing from April, May, June, July that 235 00:34:00,530 --> 00:34:10,940 12 months data and in some units we have measured. This is nothing but a case of multivariate 236 00:34:10,940 --> 00:34:19,980 situation where each of the row like starting from 1 the first row, these values are talking 237 00:34:19,980 --> 00:34:23,450 about multivariate observations for month April. 238 00:34:23,450 --> 00:34:30,060 Similarly, for second these are multivariate observations so there are we have multivariate 239 00:34:30,060 --> 00:34:38,790 observations. Now, you may be may be interested to know how profit varies over the months, 240 00:34:38,790 --> 00:34:44,000 then it will be univariate one if you want to say that how sales volume vary over the 241 00:34:44,000 --> 00:34:49,010 month, it will be also univariate. Now, if you want to know absenteeism varies over the 242 00:34:49,010 --> 00:34:52,190 year over month that is also univariate like this. 243 00:34:52,190 --> 00:35:01,300 But if you are interested to see that how the profit and sales volume covary and they 244 00:35:01,300 --> 00:35:07,640 are own variation, then you will have to have to consider two variables. And then should 245 00:35:07,640 --> 00:35:13,740 be multivariate situation, sometimes you may be interested to know how the sales volume 246 00:35:13,740 --> 00:35:18,590 will be dependent on absenteeism and machine breakdown and marketing ratio. Then there 247 00:35:18,590 --> 00:35:24,910 must a dependent model and that is the same multivariate issue. So, this is in that shelf 248 00:35:24,910 --> 00:35:29,640 what I am talking about multivariate observations. 249 00:35:29,640 --> 00:35:38,099 So, now we have discussed some of the things, some of the variables and we have seen that 250 00:35:38,099 --> 00:35:43,359 we have assigned them some values, but how where from those values are coming? 251 00:35:43,359 --> 00:35:50,960 For example, if I say steel washer the thickness that mean be the inner or the outer thickness 252 00:35:50,960 --> 00:35:57,510 OS, how it is known? So, you have used some measurement scale to measure this, if I want 253 00:35:57,510 --> 00:36:04,770 to say that it may be you have used Vernier caliper to measure the outer diameter, may 254 00:36:04,770 --> 00:36:10,859 be used Vernier caliper to measure the inner diameter. So, you have used some instrument 255 00:36:10,859 --> 00:36:16,480 and as well as you have there is scale of measurement. In this case the scale is basically 256 00:36:16,480 --> 00:36:25,140 length which may be in terms of millimeter. So, you have to sue some scale of measurement 257 00:36:25,140 --> 00:36:34,790 and based on the scale used whatever data you get those data will be of different types. 258 00:36:34,790 --> 00:36:43,530 So, you see this line here, the left side we are talking about random variables and 259 00:36:43,530 --> 00:36:48,490 right hand side we are talking about data types. I have explained you this random variable 260 00:36:48,490 --> 00:36:53,869 earlier, so I will not spend much time here, but you must please understand one thing that 261 00:36:53,869 --> 00:36:57,400 in random variable there will be discrete and continuous random variable. 262 00:36:57,400 --> 00:37:04,830 By discrete random variable we mean to say that they will take some counted account values 263 00:37:04,830 --> 00:37:15,540 like 0, 1, 2 or something like this or January, February, March something like this, and your 264 00:37:15,540 --> 00:37:20,890 continuous case that profit absenteeism breakdown or what is M ratio here? What is that any 265 00:37:20,890 --> 00:37:26,210 value is possible? So, please understand one thing here, since volume are coming under 266 00:37:26,210 --> 00:37:33,430 your discrete because it is countable one but many countable, such count values can 267 00:37:33,430 --> 00:37:42,730 also be considered as continuous in any situations. But any how so there are two types. 268 00:37:42,730 --> 00:37:48,630 Now, your data types I told you that what measurement scale you are using. Based on 269 00:37:48,630 --> 00:37:55,250 these data types you will be known, means that data will be having certain properties 270 00:37:55,250 --> 00:38:01,510 because data is nothing but information. How much information is available with the data 271 00:38:01,510 --> 00:38:07,210 getting me, so did it all depends on what scale you have used to measure this data. 272 00:38:07,210 --> 00:38:12,910 So, based on that there are four types of data, one is nominal data, ordinal data, interval 273 00:38:12,910 --> 00:38:16,670 data and ratio data. 274 00:38:16,670 --> 00:38:24,050 Let us discuss something about nominal data. My definition is this provide identity to 275 00:38:24,050 --> 00:38:33,140 some items or things is I say the month, the company, small company that is the I should 276 00:38:33,140 --> 00:38:40,050 have shown you that they want to do over the different months, what is the status. So, 277 00:38:40,050 --> 00:38:47,450 the month is a variable starting from January to December because it changes. So, then it 278 00:38:47,450 --> 00:38:53,320 is January and February all those things nothing but they are the identity of the period of 279 00:38:53,320 --> 00:38:59,640 time identity of the particular series. Suppose, you just think of you are trying 280 00:38:59,640 --> 00:39:07,640 to know that some performance or status of the different department of a for example, 281 00:39:07,650 --> 00:39:13,920 IIT so then if I say the department of chemistry, department of physics, department of mathematics, 282 00:39:13,920 --> 00:39:18,130 department of computer science, department of industrial engineering and management. 283 00:39:18,130 --> 00:39:24,060 So, all those things and they are basically providing identity but we sometimes require 284 00:39:24,060 --> 00:39:30,140 this type of data in our to include in our analysis. So, this is nothing but nominal 285 00:39:30,140 --> 00:39:36,510 data. Now, what is the problem with nominal data? Problem with nominal data is that there 286 00:39:36,510 --> 00:39:41,270 is huge computational limitations, because you cannot do any arithmetic limitations, 287 00:39:41,270 --> 00:39:46,490 you cannot add department of chemistry plus department of physics like this. We cannot 288 00:39:46,490 --> 00:39:52,910 say department of chemistry is 1 and department of physics is 2 and accordingly we will add, 289 00:39:52,910 --> 00:40:05,250 we cannot subtract, we cannot multiply, we cannot make division also this is the problem. 290 00:40:05,250 --> 00:40:09,960 Next data type is your ordinal data type. What is ordinal data type? Suppose, you just 291 00:40:09,960 --> 00:40:15,530 see that you have you have travelled in flight several times may be, or train or some other 292 00:40:15,530 --> 00:40:20,790 places or you have gone to the students, and when you have taken food and you might have 293 00:40:20,790 --> 00:40:25,990 seen that you are giving a feedback form. They are seeing that they are pleased, they 294 00:40:25,990 --> 00:40:32,369 have read the in case of hotel food quality, service quality, room quality all those things 295 00:40:32,369 --> 00:40:40,220 in terms of not satisfied. We are totally unsatisfied to extremely satisfied, 296 00:40:40,220 --> 00:40:47,650 this type of scale we have used for example, for the food case it is taste wise this very 297 00:40:47,650 --> 00:40:53,920 good, good or something like this. So, this type of ordering when order thing is there 298 00:40:53,920 --> 00:41:02,940 this is called ordinal data. So, what it does provide some order or rank to some items or 299 00:41:02,940 --> 00:41:10,260 things examples, service quality, it is low medium or good and computational limitations. 300 00:41:10,260 --> 00:41:28,180 Here also we cannot do any arithmetic operations like your addition, subtraction, multiplication 301 00:41:28,180 --> 00:41:36,020 and division. You cannot do then what way it is better than our nominal data? It is 302 00:41:36,020 --> 00:41:42,980 better than nominal data because here you are getting a order, a rank you are getting. 303 00:41:42,980 --> 00:41:52,080 If I say the performance that my student performance is low, average and very good excellent like 304 00:41:52,080 --> 00:41:57,310 this, the person who is getting excellent is definitely better than the person or the 305 00:41:57,310 --> 00:42:08,410 student who got very good. So, I have a ranking skill here ranking ability with this data. 306 00:42:08,410 --> 00:42:18,150 So, ordinal data is rich compared to nominal data. 307 00:42:18,150 --> 00:42:26,690 Next data type I said that interval data, what is interval data? It is basically well 308 00:42:26,690 --> 00:42:37,869 understood if we take this example, here temperature we are measuring using two scales, one is 309 00:42:37,869 --> 00:42:45,960 celsius, another only Fahrenheit. In developing these two scales, Fahrenheit scale as well 310 00:42:45,960 --> 00:42:58,080 as your Celsius scale, the reference point is taken at two different points, means locations. 311 00:42:58,080 --> 00:43:06,430 It is not the same you getting me so and if you see the horizontal lines here you see 312 00:43:06,430 --> 00:43:12,220 that 0 degree centigrade, 20 degree centigrade and 100 degree centigrade. Then the corresponding 313 00:43:12,220 --> 00:43:19,400 Fahrenheit will be 32, 70 and 212 Fahrenheit, understanding? 314 00:43:19,400 --> 00:43:27,730 So, there is a range that if I say the difference from 100 to 0 degree you are getting this 315 00:43:27,730 --> 00:43:34,640 range, here are also 212 to 32 the corrseponding range is this. So, whether we measure in using 316 00:43:34,640 --> 00:43:41,080 celsius scale or Fahrenheit scales we will be getting the equal range. Now, what will 317 00:43:41,080 --> 00:43:47,839 happen suppose, I measured temperature today? Today day temperature is 20 degree centigrade 318 00:43:47,839 --> 00:43:55,730 to 30 degree and may be day after tomorrow 21 degree, then if I want to do the averaging 319 00:43:55,730 --> 00:44:02,060 I can add them and then divided by 3, that 3 days average I will get if I do the same 320 00:44:02,060 --> 00:44:08,450 thing in Fahrenheit. Also it is possible I can do that similar thing, I can do but what 321 00:44:08,450 --> 00:44:13,670 will happen? Suppose, I want to say that what is the how 322 00:44:13,670 --> 00:44:21,380 many times temperature of today is compared to the tomorrows, yesterday’s temperature. 323 00:44:21,380 --> 00:44:28,240 Then if I use Celsius scale and if I divide 22 by 20 and then here it will be it may be 324 00:44:28,240 --> 00:44:33,940 70 and other things, then we will find out they are not matching. So, that means interval 325 00:44:33,940 --> 00:44:42,089 scale is some scale where you will get a interval data range data and they are all having al, 326 00:44:42,089 --> 00:44:51,859 type of continuous properties except and they can do 3 arithmetic operations very easily, 327 00:44:51,859 --> 00:44:57,920 addition, subtraction and multiplication. But when you do division, you will find out 328 00:44:57,920 --> 00:45:06,619 that when you change, it changes the scale. Ultimately what will happen? You will find 329 00:45:06,619 --> 00:45:17,450 that they, so in interval data you cannot go for division. 330 00:45:17,450 --> 00:45:32,510 Interval data division is not possible, all other arithmetic operations are possible. 331 00:45:32,510 --> 00:45:35,690 Let us go to the next slide. 332 00:45:35,690 --> 00:45:44,599 Our slide that is we are talking about ratio, data ratio. Data is something where there 333 00:45:44,599 --> 00:45:48,380 is absolute 0 in the scale of measurement. 334 00:45:48,380 --> 00:45:56,839 This is 0, if I move towards right suppose x amount and towards left also x amount then 335 00:45:56,839 --> 00:46:02,060 the difference, this difference is same. If I go for y also, this side also y also that 336 00:46:02,060 --> 00:46:05,430 is the same. So, that means if you go in to the left it is that is the same. So, that 337 00:46:05,430 --> 00:46:12,200 means if you go in the to the left it is negative, this side it is positive, but there is absolute 338 00:46:12,200 --> 00:46:21,099 0 in between. So, this 0 is the reference point not in terms of the Fahrenheit and centigrade 339 00:46:21,099 --> 00:46:31,980 scale that where is the two different definition, it contains absolute 0, highest form of data, 340 00:46:31,980 --> 00:46:47,349 sorry. So, ratio data is it contains absolute 0 highest form of data example absenteeism 341 00:46:47,349 --> 00:46:58,530 breakdown hours as shown earlier and computational, all arithmetic operations are possible here. 342 00:46:58,530 --> 00:47:08,170 Now, if I go by the order of information available then definitely your first one is if it is 343 00:47:08,170 --> 00:47:18,890 nominal then followed by ordinal, then your interval, then your ratio. Then definitely 344 00:47:18,890 --> 00:47:30,970 in order of increasing information this will the, this is the case my best data is this, 345 00:47:30,970 --> 00:47:41,339 next best is this, next best is this, next and this is the lowest of information data. 346 00:47:41,339 --> 00:47:50,310 So, you know that different data types. Now, you know that as you will be applying multivariate 347 00:47:50,310 --> 00:47:56,890 statistical modeling, you must require full-fledged data. So, you need to know the data source, 348 00:47:56,890 --> 00:48:02,260 primary data collected from the source where it is generated for example, in the case of 349 00:48:02,260 --> 00:48:08,900 a steel washer example, if you collect data from the production shop and just going there 350 00:48:08,900 --> 00:48:17,140 collecting data or that is what is known as primary data. Suppose, you want to see the 351 00:48:17,140 --> 00:48:24,490 behavior of the animals in the jungle go and observe and then accordingly note down and 352 00:48:24,490 --> 00:48:33,210 that will be your primary data. So, for the production that and that example 353 00:48:33,210 --> 00:48:38,619 the profit and sales volume case that is also primary data. So, long you are collecting 354 00:48:38,619 --> 00:48:46,380 from the source, what is secondary data? Secondary data stored in repository or collected by 355 00:48:46,380 --> 00:48:54,339 someone else, you are getting me? So, you are not collecting, it is already there. We 356 00:48:54,339 --> 00:49:02,460 have different sources for example, you may get the financial data from some sources. 357 00:49:02,460 --> 00:49:09,630 And suppose company is maintaining records of their production and suppose their maintenance 358 00:49:09,630 --> 00:49:17,960 or the health of machines and many things. So, you have not collected so company has 359 00:49:17,960 --> 00:49:23,710 stored and you have gone there and collected these things, or it is better that in a literature 360 00:49:23,710 --> 00:49:29,180 you studying something in your own area. You found that a paper is there where some data 361 00:49:29,180 --> 00:49:34,010 is given. So, this type of data is secondary but secondary 362 00:49:34,010 --> 00:49:39,829 data must have must be authentic, in the sense that reference of the data is available, author 363 00:49:39,829 --> 00:49:45,770 references are there, this is that author literature data but this is definitely as 364 00:49:45,770 --> 00:49:51,730 it is done by somebody else. It is not primary, there you have to rely on the authenticity 365 00:49:51,730 --> 00:49:57,550 of the data collected by somebody else. The tertiary data which is basically a common 366 00:49:57,550 --> 00:50:01,770 knowledge type of things. Suppose, you know you will find many things 367 00:50:01,770 --> 00:50:09,869 are there actually when in terms of modeling, modeling when you start with a subject area 368 00:50:09,869 --> 00:50:15,680 you start with this that when your knowledge is not very clear, you will start with the 369 00:50:15,680 --> 00:50:20,530 tertiary sources. And then slowly you go to the secondary source. Finally, when you do 370 00:50:20,530 --> 00:50:30,320 actual work you may go for the primary data sources. 371 00:50:30,320 --> 00:50:40,880 I told you earlier for this model, let me repeat this again that model mimics reality 372 00:50:40,880 --> 00:50:48,430 that when you develop a model that without considering the reality, the real thing you 373 00:50:48,430 --> 00:50:58,260 are not doing the justice. So, the model reality so it should be a, it should have real applications 374 00:50:58,260 --> 00:51:07,200 that is what is the meaning. For example, suppose you think of a car which is got with 375 00:51:07,200 --> 00:51:22,060 by suppose any they develop, they develop these things. What I mean to say they develop 376 00:51:22,060 --> 00:51:30,660 a model simulation model in computer first before going for a developing the car, one 377 00:51:30,660 --> 00:51:36,290 after the other manufacturing. The car in the manufacturing shop or there must be some 378 00:51:36,290 --> 00:51:41,740 simulation model and means how the car will work. 379 00:51:41,740 --> 00:51:48,339 So, that type of things are known as that means it is a in terms of the reality the 380 00:51:48,339 --> 00:51:54,800 car is the real thing. So, your modeling can be so that is simplest example and that the 381 00:51:54,800 --> 00:52:01,890 mathematics is related to the elastic behavior of it that is the reality. In statistical 382 00:52:01,890 --> 00:52:08,800 sense when we talk about the how sales volume is dependent on other things that is your 383 00:52:08,800 --> 00:52:14,390 absenteeism, M ratio and all those things that also is going to talk about the ways, 384 00:52:14,390 --> 00:52:21,260 which show actually in statistical sense, a model talks about explain the regularity 385 00:52:21,260 --> 00:52:24,180 of a phenomena. 386 00:52:24,180 --> 00:52:30,099 In Hooke’s law the regularity is so long, it is within the elastic limit. When the load 387 00:52:30,099 --> 00:52:37,300 is released, it will come back to the original shape that is the regularity. In case of our 388 00:52:37,300 --> 00:52:46,070 statistical model building we talk about data and data is nothing but equal to this is pattern 389 00:52:46,070 --> 00:53:02,290 plus error, this pattern is the regularity pattern or systematic component. So, we must 390 00:53:02,290 --> 00:53:09,670 know what is our problem? And accordingly all data if you collect it and you want to 391 00:53:09,670 --> 00:53:16,609 extract pattern from this data. In case of prediction model suppose you want to predict 392 00:53:16,609 --> 00:53:23,770 some y value then with respect to some x values. And then you will find out there is some linear 393 00:53:23,770 --> 00:53:33,130 combination variable that is X beta, then plus l will be there. This is my regularity 394 00:53:33,130 --> 00:53:38,670 or my data. So, when you repeat similar that similar development 395 00:53:38,670 --> 00:53:46,980 under different situations then what will happen? Then if it performs well under the 396 00:53:46,980 --> 00:53:52,760 different situation for which it is developed, the one day we may say it is a law or a theory 397 00:53:52,760 --> 00:54:00,700 like Hooke’s law or Hooke’s or this Hooke’s law is this, I left there that elasticity 398 00:54:00,700 --> 00:54:10,500 thing. So, we all know that Newton’s laws of motion and we all know that Dalton’s 399 00:54:10,500 --> 00:54:16,430 atomic theory and many other things that these are not one day everything is developed and 400 00:54:16,430 --> 00:54:24,369 people accept it. It basically developed at test stage verified, validated after several 401 00:54:24,369 --> 00:54:31,790 years and then other scientist other that is the researcher, they accepted the fact 402 00:54:31,790 --> 00:54:39,930 and then it was applied to different situations and found that it is working. I told you modeling, 403 00:54:39,930 --> 00:54:46,150 also process of building a process, physical, mathematical and statistical, this is I have 404 00:54:46,150 --> 00:54:47,900 already explained to you. 405 00:54:47,900 --> 00:54:59,119 I hope that you got the glimpse of actually the purpose of applied multivariate statistical 406 00:54:59,119 --> 00:55:08,440 modeling. Actually, we want to develop empirical model, those empirical models is these are 407 00:55:08,440 --> 00:55:15,060 all data based, data based in the sense that they contain you have data. And you are going 408 00:55:15,060 --> 00:55:22,250 for building models and you are building models, and to find out the regularity of the data, 409 00:55:22,250 --> 00:55:31,380 or the pattern of the data. And show that you will be able to describe the relationships 410 00:55:31,380 --> 00:55:41,000 of the population or the behavior of the population or system in consideration. You will be able 411 00:55:41,000 --> 00:55:48,200 to establish the strength of the relationship, you will be able to predict something, you 412 00:55:48,200 --> 00:55:54,660 may be able to prescribe something also, but when you talk about a statistical modeling. 413 00:55:54,660 --> 00:56:01,280 Usually this is the description and prediction part is description explanation and prediction 414 00:56:01,280 --> 00:56:09,790 this three things come into consideration. So, slowly you will be knowing different types 415 00:56:09,790 --> 00:56:18,819 of statistical all together and you will be tempted to develop different models, also 416 00:56:18,819 --> 00:56:26,940 based on the data whatever available to you but before model, going for modeling or applying 417 00:56:26,940 --> 00:56:35,190 any statistical techniques what is happening? What is we want to say that you have to have 418 00:56:35,190 --> 00:56:40,980 some principles in your mind before going for this here. I have just jotted down some 419 00:56:40,980 --> 00:56:48,770 of the principles which I have taken from text book by operation research by see what 420 00:56:48,770 --> 00:56:56,280 is they said that do not build a complicated model when a simple one will suffice. For 421 00:56:56,280 --> 00:57:06,240 example, suppose if I know the mean value of the different lots of steel over mid value 422 00:57:06,240 --> 00:57:17,220 of a particular characteristics; for example, the inner diameter of a different your lots 423 00:57:17,220 --> 00:57:29,160 produced. And if that suffice my purpose go for mean, or at max you may require the standard 424 00:57:29,160 --> 00:57:35,630 deviation of the inner diameter produced by the different processes A B C as I told you. 425 00:57:35,630 --> 00:57:41,960 So, there you may you do not go for may be that covariance structure, many other thing. 426 00:57:41,960 --> 00:57:50,300 So, you do not go for if it is needed you go for you are modeling of the problem to 427 00:57:50,300 --> 00:57:57,230 fit the technique, many a time I have seen it my case that there is one model which we 428 00:57:57,230 --> 00:58:02,470 will be discussing later on known as structural equation modeling. The people are using structural 429 00:58:02,470 --> 00:58:07,940 model everywhere where a simple regression model can be. But people are interested to 430 00:58:07,940 --> 00:58:13,160 fit the structural equation model. So, please be little bit of cautious on those 431 00:58:13,160 --> 00:58:19,390 things, that model is for problem solving and model comes from the problem, not to fit a statistical 432 00:58:19,390 --> 00:58:24,609 technique. Design phase of modeling must be conducted rigorously and it will discussed 433 00:58:24,609 --> 00:58:30,190 later. What do we mean by design phase? Coming under study design model should be verified 434 00:58:30,190 --> 00:58:36,380 prior to validation. Verification means suppose you when you collect data you split the data 435 00:58:36,380 --> 00:58:40,849 into two halves, one for your training other for test. 436 00:58:40,849 --> 00:58:48,480 What other way I can say? One set for model building, other set for verification and validation 437 00:58:48,480 --> 00:58:52,890 basically talks about when you take some new data again and you find it is working, that 438 00:58:52,890 --> 00:58:58,710 is validation. A model should never take in too literally but many a times what I have 439 00:58:58,710 --> 00:59:05,670 found that model there are more many variables, statistics is taken very in very loose end. 440 00:59:05,670 --> 00:59:11,000 So, if there are many variables let us find the relationship is there or not this type 441 00:59:11,000 --> 00:59:15,950 of or whatever variable is there. Let us find that relationship without considering the 442 00:59:15,950 --> 00:59:19,760 purpose. 443 00:59:19,760 --> 00:59:26,329 A model should neither be pressed to do nor criticized for failing to do that for which 444 00:59:26,329 --> 00:59:32,430 it was never intended for example, you are interested to see the relationship between 445 00:59:32,430 --> 00:59:37,170 variable of a particular population. Now, later on you want to see that how I want to 446 00:59:37,170 --> 00:59:41,400 predict something, see you developed a model to see the pattern strength of relationship 447 00:59:41,400 --> 00:59:49,900 not to predict. So, why how can your model will predict which was not intended for, so 448 00:59:49,900 --> 00:59:57,160 that is another issue. So, if it fails to do prediction when it was just to understand 449 00:59:57,160 --> 01:00:03,599 the covariance structure, then we should not criticize for this nor we should not press 450 01:00:03,599 --> 01:00:09,980 the model to do it, beware of overselling a model many a times. 451 01:00:09,980 --> 01:00:16,910 We basically make sure of I can say recommendation based on the model and many of the things 452 01:00:16,910 --> 01:00:24,710 basically from common sense, and so that type of selling I prohibited some of primary benefits 453 01:00:24,710 --> 01:00:31,710 of modeling are associated with the process of developing the model. So, the see as all 454 01:00:31,710 --> 01:00:37,040 we of you are busy in learning multivariate statistics, multivariate modeling. So, do 455 01:00:37,040 --> 01:00:42,020 not think that always you will be doing something great with these type of modeling you are 456 01:00:42,020 --> 01:00:46,560 learning. So, the learning process when you develop something you know the physics of 457 01:00:46,560 --> 01:00:51,829 the problem, may be you know the process through which data is generated, you know how the 458 01:00:51,829 --> 01:00:57,020 data to be captured, how the data to be analyzed, what techniques is applicable. 459 01:00:57,020 --> 01:01:02,760 So, this is a entire gamut, so this gamut of process is very, very important. So, very 460 01:01:02,760 --> 01:01:09,720 many fits very, very fits you acquire out of it a model cannot be any better than information 461 01:01:09,720 --> 01:01:15,790 that goes into it. So, you cannot say that you are using nominal data and you will be 462 01:01:15,790 --> 01:01:21,390 basically talking about a model of regression where y variable is nominal. So, you have 463 01:01:21,390 --> 01:01:26,369 to have go for some other type of model for that may be your regression. So, the information 464 01:01:26,369 --> 01:01:32,210 what you are the quality of information what is fed into that model, that is more important 465 01:01:32,210 --> 01:01:39,839 because it if input is not good then output also, you should not expect good. 466 01:01:39,839 --> 01:01:49,099 So, model cannot replace decision maker, getting me? You cannot think that you are your model 467 01:01:49,099 --> 01:01:55,270 is superior than you the decision maker, the analyst who has having the system knowledge 468 01:01:55,270 --> 01:02:04,599 they will act smart. So, they are more important people, so whatever you develop, whatever 469 01:02:04,599 --> 01:02:11,150 you do for, what purpose you are developing, all these things. So, that is in your brain, 470 01:02:11,150 --> 01:02:19,309 in it is the root what work there so better than any model. So, in this case what I want 471 01:02:19,309 --> 01:02:28,839 to say that you please take all those issues what I have discussed, the principles particularly 472 01:02:28,839 --> 01:02:37,069 in this series and accordingly develop the model and today it is up to these. Next class, 473 01:02:37,069 --> 01:02:41,450 we will be studying the statistical approach to problem. 474 01:02:41,450 --> 01:02:45,210 Thank you for your patience.