﻿1 00:00:18,460 --> 00:00:26,070 Good afternoon. Last class we have described the multivariate statistical modeling from 2 00:00:26,070 --> 00:00:35,550 the purpose as well as different modeling techniques point of view, and we ended that 3 00:00:35,550 --> 00:00:52,010 lecture with prerequisites for the course, prerequisites for this course or subject. 4 00:00:52,010 --> 00:01:02,230 And what we have described there that basic statistics is one of the prerequisites and 5 00:01:02,230 --> 00:01:14,960 I told u that you also require to know matrix algebra a bit. Now, under basic statistics 6 00:01:14,960 --> 00:01:38,210 univariate statistics, the univariate descriptive statistics and univariate inferential statistics are 7 00:01:38,210 --> 00:01:48,490 important. So, again under univariate descriptive statistics 8 00:01:48,490 --> 00:02:15,900 usually the central tendency and dispersions, these two issues are described under descriptive 9 00:02:15,900 --> 00:02:35,489 statistics. Under inferential statistics estimation and your hypothesis testing, under estimation 10 00:02:35,489 --> 00:02:59,440 there will be point estimation and interval estimation. Today, we will in this lecture 11 00:02:59,440 --> 00:03:05,090 we will describe this one univariate descriptive statistics. 12 00:03:05,090 --> 00:03:17,250 You see the content of today’s this lecture, we will start with population and parameters 13 00:03:17,250 --> 00:03:23,639 then we will describe probability distribution. Particularly, the normal probability distributions 14 00:03:23,639 --> 00:03:30,660 then we discuss sample and statistics followed by measure of central tendency, measure of 15 00:03:30,660 --> 00:03:41,620 dispersion and followed by references. Now, do you have any idea about population? 16 00:03:41,620 --> 00:03:55,479 What do you mean by population, general sense we say that the population of West Bengal 17 00:03:55,479 --> 00:04:06,209 population of India, but in statistics this population has much broader sense. For example, 18 00:04:06,209 --> 00:04:27,862 in last class we have described one example that a small company doing business in a city. 19 00:04:27,889 --> 00:04:49,120 So, the company has a production system, can it be a population if you define population form statistics point 20 00:04:49,120 --> 00:05:04,430 of view population if the entirety totality or the whole population. The entirety or the 21 00:05:04,430 --> 00:05:16,599 totality or we can the whole when we talk about the population of West Bengal that means, 22 00:05:16,599 --> 00:05:25,229 each and every resident legal residents of West Bengal is considered from the production 23 00:05:25,229 --> 00:05:30,039 system point of view. The system, this word also represents population 24 00:05:30,039 --> 00:05:40,539 the way we understand application of statistics, so the system is also synonymous for us. It 25 00:05:40,539 --> 00:05:51,819 is also population, because system can be characterized by different variables for example, 26 00:05:51,819 --> 00:06:00,780 for this company there are profit sales volume absenteeism. So, may other variables are we 27 00:06:00,780 --> 00:06:06,710 have discussed, so these are basically which characterize the population or the system 28 00:06:06,710 --> 00:06:12,009 another, word could be for us that is a process. 29 00:06:12,009 --> 00:06:25,810 A process also we can think in this line also the process can be from our purpose point 30 00:06:25,810 --> 00:06:37,749 of view, a process is something where transformation or activities taken place activities or transformation 31 00:06:37,749 --> 00:06:54,169 takes place. For example, you give inputs as a raw materials and the process production 32 00:06:54,169 --> 00:07:12,259 process it converts into value added output. Now, if we consider the total life cycle of 33 00:07:12,259 --> 00:07:36,139 this process, then it will produce a large number of items, so large number of items will be produced. 34 00:07:36,139 --> 00:07:42,779 All items collectively is the entirety the totality or the whole, so that things with respect 35 00:07:42,779 --> 00:07:56,539 to the items produced by this production process. We can define population from, if you go to 36 00:07:56,539 --> 00:08:03,479 the service sector for example, the health care system or the banking system there also 37 00:08:03,479 --> 00:08:11,490 you can define population. So, essentially if you want to define population and you required 38 00:08:11,490 --> 00:08:18,629 to keep in your mind two things that when I talk about the population of West Bengal. 39 00:08:18,629 --> 00:08:30,120 Suppose this type, this figure let it be the portion, now the hilly regions population, 40 00:08:30,120 --> 00:08:36,990 at the hilly region that is different than the population in the west of West Bengal 41 00:08:36,990 --> 00:08:45,980 or south of West Bengal. Now, for a particular purpose you may be interested to understand 42 00:08:45,980 --> 00:08:52,300 what the educational status of the people is, if the hilly people of West Bengal. 43 00:08:52,300 --> 00:08:59,900 Then what is happening is you are making a boundary, creating a boundary for the system, 44 00:08:59,900 --> 00:09:05,020 so this boundary is the hilly region. So, in that case your population is this hilly 45 00:09:05,020 --> 00:09:13,120 region only, now if you think from the voting point of view, suppose the election time. 46 00:09:13,120 --> 00:09:21,290 So, this and all the legal that voters they go for voting in that case all the voters 47 00:09:21,290 --> 00:09:28,970 of the total West Bengal, they are the population. So, in that sense what is happening that means 48 00:09:28,970 --> 00:09:43,000 if you really want to define population, the boundary is very important, getting me. 49 00:09:43,000 --> 00:09:51,180 Boundary in the sense, if you go you come back again the manufacturing scenario, in 50 00:09:51,180 --> 00:09:56,700 manufacturing scenario you will find out that the total production system may be composed 51 00:09:56,700 --> 00:10:05,500 of several half system that. For example, this may be machine 1, machine 2, and machine 52 00:10:05,500 --> 00:10:14,220 3 and they are doing different operations, raw material coming here and transforms to 53 00:10:14,220 --> 00:10:19,250 machine M 1. Then going to machine M 2, some or the other activities is going on, now if 54 00:10:19,250 --> 00:10:26,080 you are interested to infer something about machine 1. Suppose you want to infer something 55 00:10:26,080 --> 00:10:36,350 about machine 1, then your population is this if you think that are some common characteristics 56 00:10:36,350 --> 00:10:43,500 applicable to all the machines. Then you may be interested to see the totality 57 00:10:43,500 --> 00:10:53,430 including all the machines, then your population will consider all the machines, this is very 58 00:10:53,430 --> 00:11:04,270 important. Unless we understand population, there is no use of statistics because statistics 59 00:11:04,270 --> 00:11:12,200 is used to infer about the population, inference related to many things. During inferential 60 00:11:12,200 --> 00:11:20,540 statistics, we will be telling you what are the different inferences possible, but for 61 00:11:20,540 --> 00:11:26,470 the time being you please understand that when we talk about population, we talk about 62 00:11:26,470 --> 00:11:36,510 a system or a case for. Why we require to study the process or the population, because 63 00:11:36,510 --> 00:11:44,070 we want to understand the behavior of the process or the system or in terms of the population 64 00:11:44,070 --> 00:11:49,160 you want to study the behavior. 65 00:11:49,160 --> 00:12:02,480 Now, if you see the size of population what will happen? Population can be finite, can 66 00:12:02,480 --> 00:12:11,390 be infinite when I am talking about, suppose the production of a process for 1 year, number 67 00:12:11,390 --> 00:12:20,800 of items produced per year. If that is my population then it is a finite population, 68 00:12:20,800 --> 00:12:30,510 so time is another aspect which also defines, used to define the population. 69 00:12:30,510 --> 00:12:36,430 So, one is the boundary another one is the time, so in two this is basically boundary 70 00:12:36,430 --> 00:12:44,910 in the sense, space boundary and the time boundary. So, if you go for the entire lifecycle 71 00:12:44,910 --> 00:12:51,170 of a process then what will happen can you count that what are the number of outputs 72 00:12:51,170 --> 00:12:59,160 it is very, very difficult. So, if we talk about the entire lifecycle, total time of 73 00:12:59,160 --> 00:13:03,480 the life of the process what will happen the number of items produced will be countable 74 00:13:03,480 --> 00:13:18,420 infinite, whether countable infinite or infinite, we will basically define in statistics in 75 00:13:18,420 --> 00:13:19,740 two senses. 76 00:13:19,740 --> 00:13:26,840 One is that your population will be finite population or infinite population and finite 77 00:13:26,840 --> 00:13:32,470 population will means the size is known that is N. For example, number of items produced 78 00:13:32,470 --> 00:13:40,310 by a production shop in 2012, infinite population size is infinite number of items produced 79 00:13:40,310 --> 00:13:51,380 that is on the lifecycle of the process that is countable infinite. If you need further 80 00:13:51,380 --> 00:14:03,940 explanation as I told you in the last class, that random experiment is the issue in statistics 81 00:14:03,940 --> 00:14:10,330 deals with random variables. Random variables comes from random experiment, 82 00:14:10,330 --> 00:14:19,930 we generate random variable based on the experiments conducted. So, if we do one experiment like 83 00:14:19,930 --> 00:14:26,610 this, you see this figure inside this if I say this is basically all and inside this 84 00:14:26,610 --> 00:14:35,800 there are red and white balls. Now, you pick up one ball, next one ball like this one after 85 00:14:35,800 --> 00:14:42,620 another without replacement what will happen after sometime there will be no ball to pick 86 00:14:42,620 --> 00:14:53,930 up experiment will end, this is finite population. Now, in other cases see that what we do in 87 00:14:53,930 --> 00:15:01,600 the second experiment you pick up again replace, so what will happen in that case. 88 00:15:01,600 --> 00:15:13,090 In that case, the number of ball will never exhausted, there are so many balls red white 89 00:15:13,090 --> 00:15:19,950 what you are doing you are picking up and finding out whether it is a red or white. 90 00:15:19,950 --> 00:15:27,260 So, either red or white, so you are counting that red then again you are replacing this, 91 00:15:27,260 --> 00:15:33,800 similar manner you are continuing this experiment, the size of the population what will happen. 92 00:15:33,800 --> 00:15:39,910 The number of balls will remain as it is from experimental point of view it will be different, 93 00:15:39,910 --> 00:15:49,660 so this is what is infinite population? In statistics most of the issues what will 94 00:15:49,660 --> 00:15:58,570 be discussed later on we consider infinite population, so in reality they are it may 95 00:15:58,570 --> 00:16:07,810 not be 100 percent true that all populations are infinite. But, countable infinite populations 96 00:16:07,810 --> 00:16:13,970 are many and for our practical purposes, if we consider this infinite population there 97 00:16:13,970 --> 00:16:33,810 is no problem. Population behavior if you measure, you require to know that what are 98 00:16:33,810 --> 00:16:41,570 the variables, that is governing the population in sense characterize the population. So, 99 00:16:41,570 --> 00:16:58,350 population is characterized by different variables applicable to the population. 100 00:16:58,350 --> 00:17:19,380 For example, if we consider the total students of IIT Kharagpur, all students of IIT Kharagpur 101 00:17:19,380 --> 00:17:28,820 this is my population all students of IIT Kharagpur and the Kharagpur students they 102 00:17:28,820 --> 00:17:40,200 come from different demographic. Their demographics differ their socio economic family, socio 103 00:17:40,200 --> 00:17:53,279 economic status differ their performance in the graduation that mean in IIT Kharagpur 104 00:17:53,279 --> 00:18:02,179 exams that also differ. So, for performance you may be interested to see that what is 105 00:18:02,179 --> 00:18:11,230 percentage of marks of tenth or CGPA your cumulative grade point average or somewhere 106 00:18:11,230 --> 00:18:16,379 related to demography. You may be interested to see that what is 107 00:18:16,379 --> 00:18:30,690 the age, profile age sometimes we may be interested to know their height profile you see age, 108 00:18:30,690 --> 00:18:40,850 sorry height, age, percentage of marks CGPA. Under socio economic status, family income, 109 00:18:40,850 --> 00:18:50,100 all are basically coming under these are all variables which characterize the students 110 00:18:50,100 --> 00:19:01,100 of IIT Kharagpur. So, if you want to understand population, not only the space and time boundary 111 00:19:01,100 --> 00:19:09,090 we also require to understand what are the variables that governs the population, that 112 00:19:09,090 --> 00:19:18,419 is what we see basically. If you consider any of the variables let height, 113 00:19:18,419 --> 00:19:26,169 I am writing height is the students and this I am denoting as x which is a random variable, 114 00:19:26,169 --> 00:19:35,690 let it be. Here, we are saying it is random because if we just pick up one student you 115 00:19:35,690 --> 00:19:41,830 do not know what is his height whatever you measure you find out from height that is it. 116 00:19:41,830 --> 00:19:53,559 So, it is x is, so I want to characterize the students in terms of their height or you 117 00:19:53,559 --> 00:20:05,179 may be interested to characterize the students in terms of their number of subjects completed 118 00:20:05,179 --> 00:20:15,840 in a year. We will find out that there are many back lock cases, many students could 119 00:20:15,840 --> 00:20:16,789 not complete. 120 00:20:16,789 --> 00:20:20,950 So, in that sense it may so happen that if we consider that there are 10 subjects to 121 00:20:20,950 --> 00:20:27,769 be completed, it may so happen that you will find some students subjects completed, some 122 00:20:27,769 --> 00:20:34,830 students 1 or may be like this up to 10, although it will be heavily biased towards 10. But, 123 00:20:34,830 --> 00:20:47,230 this is possible should and depending upon what type of random variable you have considered 124 00:20:47,230 --> 00:20:55,149 and accordingly you require to use certain probability distribution. 125 00:20:55,149 --> 00:21:02,029 Last class I told you that if the variable is discrete suppose x is a discrete variable, 126 00:21:02,029 --> 00:21:17,190 discrete random variable then you have to use discrete probability distribution we discussed 127 00:21:17,190 --> 00:21:21,659 last class. But, we have not said what are those probability distribution later on we 128 00:21:21,659 --> 00:21:27,940 will see, but what you can see very easily that suppose x is discrete variable it can 129 00:21:27,940 --> 00:21:37,549 take values 0, 1, 2, 3 like this. Then if you make a tally chart, tally chart 130 00:21:37,549 --> 00:21:46,909 in the sense frequency 0, 1, 2, 3, 4, 5 like this suppose when you are getting 0 counts 131 00:21:46,909 --> 00:22:01,100 you are putting one like this. Then again suppose 0 count then similarly like this 132 00:22:01,100 --> 00:22:04,509 what this is the tally count what happened, what is the occurrence of 0 2 times, this 133 00:22:04,509 --> 00:22:18,450 1 6 time, this 1 8 times, this 1 4 times, this 1 2 times, this 1 1 time. So, by categorization 134 00:22:18,450 --> 00:22:28,379 what do you mean, here we mean that I have my discrete random variable which can take 135 00:22:28,379 --> 00:22:47,379 different values suppose 0, 1, 2, 3, 4, 5 and it appears for different times, I think 136 00:22:47,379 --> 00:23:02,990 all of you know. This is nothing but the frequency diagram and this frequency diagram 137 00:23:02,990 --> 00:23:11,249 if I know the total number and if you divide each of the frequencies by their total then 138 00:23:11,249 --> 00:23:19,369 you will be getting relative frequency. That relative frequency will give you that empirical 139 00:23:19,369 --> 00:23:26,100 probability distribution and this distribution is known as probability mass function. 140 00:23:26,100 --> 00:23:42,409 What we have said this p m f, here is see that this discrete variables when you get 141 00:23:42,409 --> 00:23:50,149 this type of plot, you basically developing probability mass function. But, I told you 142 00:23:50,149 --> 00:23:57,629 that we will be considering infinite population, so infinite population means that totality 143 00:23:57,629 --> 00:24:07,309 is not known. Second thing is that our variable is random, what will happen next minute what 144 00:24:07,309 --> 00:24:13,100 value it will assume, we do not know. So, anywhere in the population domain you 145 00:24:13,100 --> 00:24:19,730 cannot get that value and immediately do when you are in, then population domain yes we 146 00:24:19,730 --> 00:24:25,779 will get the values when we go for the sampling. But, at least before sampling we do not have 147 00:24:25,779 --> 00:24:33,840 all those values, so what you can do for a particular variable which concerned, you can 148 00:24:33,840 --> 00:24:41,509 expect something what is this expectation. Suppose, we want to know what is the average 149 00:24:41,509 --> 00:24:48,980 height of IIT students this is nothing but the expected value of x, so that expected 150 00:24:48,980 --> 00:24:57,509 value of x or the variable of interest this is known as mean. 151 00:24:57,509 --> 00:25:14,539 That is mean, mean stands for mean, mean a expected value of x when your variable x is 152 00:25:14,539 --> 00:25:27,230 discrete variable, so you will get like this x f x for this is for all I, sorry all x. 153 00:25:27,230 --> 00:25:36,749 Whatever may be the your number for all x if you see this example, here if you see this 154 00:25:36,749 --> 00:25:43,850 example, so we are saying here that x can take this value. This value like this there 155 00:25:43,850 --> 00:25:55,570 are 1, 2, 3, 4, 5, 6 values, so if I assume that these values are nothing but 0, 1, 2, 156 00:25:55,570 --> 00:26:06,059 3, 4, 5 and so these are all x values. If I assume that they are then sitting, here 157 00:26:06,059 --> 00:26:11,059 as discrete also the probability values and their probability values is like this. Suppose 158 00:26:11,059 --> 00:26:19,389 the first one is 0.15, second one is zero 0.20, third one is 0.25, fourth one again 159 00:26:19,389 --> 00:26:32,700 you can write 0.20, fifth one suppose 0.15 then what is left that will be 40, 60, 95, 160 00:26:32,700 --> 00:26:41,110 so 0.05, so then what is your expected value. Here, x into f x you have to find out 0 into 161 00:26:41,110 --> 00:26:53,369 0.15 is 0, 1 into 0.2 is 0.2, 2 into 0.25 is 0.50, 3 into 0.2 is 0.60, so like this 162 00:26:53,369 --> 00:27:05,190 again 0.60 and this will be 0.25. If you add what you will get, you add 5 6 plus 2 8, 14, 163 00:27:05,190 --> 00:27:14,799 19, 21 so 2.15 so that means that your expected value is if this is 0 this is 1 and this is 164 00:27:14,799 --> 00:27:26,409 2 somewhere here. So, if I draw here I can say that suppose this is my 0 value, this 165 00:27:26,409 --> 00:27:39,149 is 1 values, this 1 2 values, and 1 this one is 3 values, this is your 4 and then 5. So, 166 00:27:39,149 --> 00:27:55,720 this is 0, 1, 2; somewhere this your value is 2.15 this is what is expectation, but another 167 00:27:55,720 --> 00:27:59,359 measure here it is there. 168 00:27:59,359 --> 00:28:03,700 So another one is sigma square, so we have 169 00:28:03,700 --> 00:28:13,659 said here your mu that is mean we have just said as well as, let there is another measure 170 00:28:13,659 --> 00:28:26,299 which is sigma square that is the variance what is variance is expected value as x minus 171 00:28:26,299 --> 00:28:45,499 mu whole square. So, for this case your discrete case you will write x minus mu whole square 172 00:28:45,499 --> 00:28:49,710 f x. 173 00:28:49,710 --> 00:28:56,480 You have computed here two things, what are those things that you computed x, then your 174 00:28:56,480 --> 00:29:05,830 this one with the x f x you know the mu value, mean value you know. Now, you can create x 175 00:29:05,830 --> 00:29:12,669 minus mu that means 0 minus 2.15, that is minus 2.15 like this you can calculate and 176 00:29:12,669 --> 00:29:23,309 again you square it multiply it then add it. So, you will be getting the sigma square that 177 00:29:23,309 --> 00:29:37,480 is variance part then what we have assumed, here we have assumed that x can take these 178 00:29:37,480 --> 00:29:44,710 five values only and this is the probability mass function. So, what will be the sum total 179 00:29:44,710 --> 00:29:57,269 of these probability values then what is mu and sigma or sigma square? 180 00:29:57,269 --> 00:30:04,730 mu is long run mean Correct. 181 00:30:04,730 --> 00:30:07,830 Sigma is... Long run standard deviation that means you 182 00:30:07,830 --> 00:30:13,419 are saying that mean and mean and standard deviation will vary for a population for a 183 00:30:13,419 --> 00:30:24,389 particular characteristics. It should not be for a large population. 184 00:30:24,389 --> 00:30:31,509 No, even for small population. It should not vary. 185 00:30:31,509 --> 00:30:36,909 It should not vary it is a constant, when we talk about a parameter. 186 00:30:36,909 --> 00:30:48,449 So, actually these are here mu and sigma square in statistically we say population mean and population 187 00:30:48,720 --> 00:31:09,880 variance they are constant then another issue will be there they are not known also. Here, we have assumed 188 00:31:10,210 --> 00:31:16,309 a very finite population a very small population and we have calculated something like this 189 00:31:16,309 --> 00:31:22,330 population size infinite you will not get all values of x you will never get. 190 00:31:22,330 --> 00:31:32,460 If the size is infinite, that means you cannot measure this or say compute this, but probably 191 00:31:32,460 --> 00:31:39,970 what you can do you can expect something that is why the expectation term is used, here 192 00:31:39,970 --> 00:31:49,999 expectation is used here. So, if I say population parameter, now you can understand that these 193 00:31:49,999 --> 00:31:56,200 two are population parameter it is by saying these two are population parameter. 194 00:31:56,200 --> 00:32:03,759 Please do not consider them there is no other population parameter, these two are some of 195 00:32:03,759 --> 00:32:08,029 the population parameters many of the population parameters these two man and standard deviation. 196 00:32:08,029 --> 00:32:18,039 Standard deviation or variance they are population parameters, why we go for population parameter, 197 00:32:18,039 --> 00:32:27,009 because the lecture is today’s topic is very simple topic calculation point. 198 00:32:27,009 --> 00:32:36,239 Understanding point of view, we must understand why we require population parameter, we require 199 00:32:36,239 --> 00:32:41,049 population parameter because if you know these two parameter and you also know that your 200 00:32:41,049 --> 00:32:46,460 x is random variable and that can follow certain probability distribution. If you know that 201 00:32:46,460 --> 00:32:53,629 distribution and also if you know the parameters, what happens you do not require to go for 202 00:32:53,629 --> 00:33:00,600 that particular process or system, for further study for this particular variable is concerned. 203 00:33:00,600 --> 00:33:10,440 If I say that the absenteeism in the shop floor for the production shop considered it 204 00:33:10,440 --> 00:33:18,220 follows something like normal distribution and we know that it is mean. If mu and sigma 205 00:33:18,220 --> 00:33:25,979 square is the variance component that means I am in a position to know this, so if I know 206 00:33:25,979 --> 00:33:32,399 the distribution what is actually happening. Here, that real production shop from worker 207 00:33:32,399 --> 00:33:40,210 performance point of view, the absenteeism is converted to a mathematical equation, a 208 00:33:40,210 --> 00:33:47,359 statistical equation. That is the advantage that means if I know truly I know that what 209 00:33:47,359 --> 00:33:53,359 is the probability distribution with respect to a variable and what are the population 210 00:33:53,359 --> 00:33:58,749 parameters for that variable I have the distribution at my hand. 211 00:33:58,749 --> 00:34:06,600 So, I do not require to go further so long the process will not change by process will 212 00:34:06,600 --> 00:34:16,030 not change what I mean to say that suppose it is a machine works overtime machine condition 213 00:34:16,030 --> 00:34:23,450 deteriorates that means today a new machine it is performance. Now, after 10 years the 214 00:34:23,450 --> 00:34:28,770 machine will not perform same at the same level that means what happens the characteristics 215 00:34:28,770 --> 00:34:33,679 changes. So, long the characteristics not changing even the distribution is itself enough 216 00:34:33,679 --> 00:34:47,309 for you, now if your variable will not discrete your variable is continuous. You see what 217 00:34:47,309 --> 00:35:01,470 is this one left hand side this is p m f or p d f, p d f this is probability density function, 218 00:35:01,470 --> 00:35:06,450 now why in continuous case we say probability density function. 219 00:35:06,450 --> 00:35:20,549 Whereas, in the discrete case we say probability mass function you think this one, so here 220 00:35:20,549 --> 00:35:29,329 also if we know that this particular population it has mean and variance component. When it 221 00:35:29,329 --> 00:35:35,450 is in the continuous level you have to use these two equations for expectation, so basically 222 00:35:35,450 --> 00:35:42,670 integration will come into picture. This is integration minus infinite to plus infinite 223 00:35:42,670 --> 00:35:51,039 depending on the range for which the variable is defined then f x d x and your sigma square 224 00:35:51,039 --> 00:35:54,299 is nothing but again x minus mu the whole square. 225 00:35:54,299 --> 00:36:11,260 This is infinite to infinite that x minus mu square f x d x, so I am saying the parameter 226 00:36:11,260 --> 00:36:24,280 mu and sigma square, here I hope that you understand. Now, what is population and population 227 00:36:24,280 --> 00:36:33,319 is characterized by probability distribution if the random variable has a probability distribution. 228 00:36:33,319 --> 00:36:39,240 If you know that for that variable the parameters of the distribution, you have characterized 229 00:36:39,240 --> 00:36:49,589 the population that is what is known as characterization of population in terms of probability distribution 230 00:36:49,589 --> 00:36:55,809 now there are many probability distributions. 231 00:36:55,809 --> 00:37:05,280 You see here we have we can see, here that under two heads discrete distribution and 232 00:37:05,280 --> 00:37:10,730 continuous distribution. Under discrete distribution, binomial distribution, poison distribution, 233 00:37:10,730 --> 00:37:16,760 negative binomial, geometric, hypergeometric many more the series. Similarly, continuous 234 00:37:16,760 --> 00:37:22,369 normal lognormal exponential Weibull, gamma, so many distribution they are probability 235 00:37:22,369 --> 00:37:28,170 distribution we will not discuss all the distributions. We will discuss only normal distribution, 236 00:37:28,170 --> 00:37:36,910 here because in multivariate statistical modeling normality assumption this normality assumption 237 00:37:36,910 --> 00:37:46,530 is very valid vital one. Many of the models assume normality of the data definitely at 238 00:37:46,530 --> 00:37:52,180 the multivariate level that will be multivariate normality, so we will discuss only normal 239 00:37:52,180 --> 00:37:54,650 distribution other distribution. 240 00:37:54,650 --> 00:38:03,700 You can follow Johnson book is there, you can follow this and I am sure all of you are 241 00:38:03,700 --> 00:38:13,849 familiar with this distribution this is what is normal distribution. It looks like this 242 00:38:13,849 --> 00:38:24,900 and this mu is the center point, here and here that it is basically symmetric. So, maximum 243 00:38:24,900 --> 00:38:31,710 number of observations also you will find along this level and it gradually both sides 244 00:38:31,710 --> 00:38:41,619 it will gradually reduce. Finally, after 3 sigma level it will almost negligible to 0 245 00:38:41,619 --> 00:38:49,690 level like this, now how to read this normal distribution you see that within 1 sigma plus 246 00:38:49,690 --> 00:38:57,280 minus 1 sigma this is very important minus 1 sigma to plus 1 sigma 62.23 observations 247 00:38:57,280 --> 00:39:04,289 fall within this. Then within plus 2 sigma level it will be 248 00:39:04,289 --> 00:39:14,000 little more than 95 percent, but not 96 95 point something and if you consider plus minus 249 00:39:14,000 --> 00:39:26,069 3 sigma level then your 99.73 percent observation will fall under this category this zone. Now, 250 00:39:26,069 --> 00:39:35,059 this is important because suppose you think you are producing something and your variable 251 00:39:35,059 --> 00:39:43,030 of interest follows normal distribution, now from data’s you will get like this distribution 252 00:39:43,030 --> 00:39:50,280 is like this what will happen. That the spread of this particular variable values within 253 00:39:50,280 --> 00:39:56,880 plus minus 3 sigma level 99.73 percent of the items produced will fall under within 254 00:39:56,880 --> 00:40:06,119 this, now what will happen if the customer will not be interested at this wide range. 255 00:40:06,119 --> 00:40:14,500 For example, suppose this is my quality characteristics x and this is the lower specification limit 256 00:40:14,500 --> 00:40:34,359 this I am giving the physical interpretation. Suppose this is your upper specification limit 257 00:40:34,359 --> 00:40:40,059 then what will happen ultimately, suppose this is the mean one this follows normal distribution 258 00:40:40,059 --> 00:40:55,480 and your distribution may be like this. So, this may be let it be minus 2 sigma plus 2 sigma, 259 00:40:55,480 --> 00:41:04,069 so what will happen ultimately that 5 percent almost 5 percent of your production is rejected 260 00:41:04,069 --> 00:41:07,400 product because people that customer will not accept it. 261 00:41:07,400 --> 00:41:11,609 So, when I talk about or say that characterization of the process that means with respect to 262 00:41:11,609 --> 00:41:20,940 this distribution, this is your process exactly that this is the shop representation of the 263 00:41:20,940 --> 00:41:24,690 process you are not going to the shop floor. But, this is the case your customer region 264 00:41:24,690 --> 00:41:30,680 is here and you are producing at this level, 5 percentage of your production is not accepted 265 00:41:30,680 --> 00:41:38,789 by the customer you want to improve it you are getting me, you want to improve it how 266 00:41:38,789 --> 00:41:40,289 you will do it. 267 00:41:40,289 --> 00:41:45,760 Can you explain this figure, this figure you see this is again basically as I told you 268 00:41:45,760 --> 00:41:51,109 characterization of a process through probability distribution this is another example. The 269 00:41:51,109 --> 00:41:57,049 quality of service provided measured on a 100 point scale at three service centers A, 270 00:41:57,049 --> 00:42:06,920 B and C is normally distributed as N 80, 9; 80 stands for the mean value, 9 stands for 271 00:42:06,920 --> 00:42:13,109 the variance, because our general notation what we will be following is normally distribution. 272 00:42:13,109 --> 00:42:23,779 N stands for normally distributed mu, first is mu that is sigma square basically, so mu 273 00:42:23,779 --> 00:42:30,440 is this and sigma square is this, this is the general notation we will be following 274 00:42:30,440 --> 00:42:40,770 all through. Then your second process, suppose B 80, 16 and third one is 90, 9 and I have 275 00:42:40,770 --> 00:42:47,480 plotted the probability distribution for all the three processes A, B and C and please 276 00:42:47,480 --> 00:42:53,079 remember that your variable of interest of quality of service provided. If I ask you 277 00:42:53,079 --> 00:43:15,220 which process is better you will say C yes or no, yes why mean yes you are right mean 278 00:43:15,220 --> 00:43:19,000 is 90 and is quality of service provided on a 100 point scale. 279 00:43:19,000 --> 00:43:25,539 You are measuring the higher the value better the process, but parallely you see the variability 280 00:43:25,539 --> 00:43:33,880 is 9. So, both from mean point of view it is at a higher level and variability point 281 00:43:33,880 --> 00:43:43,910 of view it is at the lowest level when you compare the three processes. Now, I ask you 282 00:43:43,910 --> 00:43:53,890 from compare A and B which one is better A, because the variability is low mean at the 283 00:43:53,890 --> 00:44:03,020 same level, so what is the physical interpretation of this. Then physical what happens it is 284 00:44:03,020 --> 00:44:14,170 the variability, the most difficult parameter very difficult to control variability I am 285 00:44:14,170 --> 00:44:17,390 giving you another important good example here. 286 00:44:17,390 --> 00:44:27,950 Suppose, you will think that you all know that archery, suppose this is the bulls eye 287 00:44:27,950 --> 00:44:40,200 that one our gold medal winner, what is his name that bullet, anyhow this is the target. 288 00:44:40,200 --> 00:45:00,460 Now, someone all shoots are here and someone shooting is like this is the bull’s eye, 289 00:45:00,460 --> 00:45:11,730 you are the trainer two shooters A and B who by training who will be improved first. First 290 00:45:11,730 --> 00:45:22,319 one the precision is first one, the precision level is higher than the second one what will 291 00:45:22,319 --> 00:45:29,549 happen you can shift this is mean value to this. 292 00:45:29,549 --> 00:45:35,789 But, here this is a variable one you variable when something is variable ever for the student’s 293 00:45:35,789 --> 00:45:41,480 point of view, some students are very erratic variable very, very difficult. But, some student 294 00:45:41,480 --> 00:45:47,520 may be because of some reason very regular, but suddenly or basically they are coming 295 00:45:47,520 --> 00:45:54,289 late by some few minutes. It is always we can, but they are still coming you can motivate 296 00:45:54,289 --> 00:45:58,549 them, so this is the physical meaning of the distribution. 297 00:45:58,549 --> 00:46:04,680 Now, we will come to the next important concept is called sample and statistics. 298 00:46:04,680 --> 00:46:16,190 So, we have seen so far population and parameters and please remember the random variable is 299 00:46:16,190 --> 00:46:21,940 there everywhere. So, population and parameters when we talk about population we definitely 300 00:46:21,940 --> 00:46:27,339 talk about parameters, we talk about particularly this next the probability distribution also. 301 00:46:27,339 --> 00:46:41,089 The probability distribution of the variables of interest, variable of interest that is 302 00:46:41,089 --> 00:46:53,710 x variable of interest x, now see you are planning to collect data what you have thought 303 00:46:53,710 --> 00:47:01,920 of that, I know my variable. That variable is x, I want to collect data, 304 00:47:01,920 --> 00:47:09,369 how many data you want to collect you want to collect N data points what can you say 305 00:47:09,369 --> 00:47:21,039 about each of the observation what can you expect, getting me, so if x is normally distributed. 306 00:47:21,039 --> 00:47:28,829 Suppose x is normally distributed with mu and sigma square that means x 1 also normally 307 00:47:28,829 --> 00:47:32,720 distributed with mean and sigma square, x 2 also normally distributes with mu and sigma 308 00:47:32,720 --> 00:47:37,200 square because they are coming from the same population. There is no guarantee that when 309 00:47:37,200 --> 00:47:41,359 you if you go, you will observe some value of x somebody else will go he will observe 310 00:47:41,359 --> 00:47:46,109 some different value of x even though the observation is the first observation for you 311 00:47:46,109 --> 00:47:53,269 also it is first for him also it is. First keep in mind this one this is very important 312 00:47:53,269 --> 00:48:03,309 you have not collected data that is before data collection you are planning that you 313 00:48:03,309 --> 00:48:08,940 will be collecting data N data. So, x 1 to 1 first you observe x 1 like x 2 like x n, 314 00:48:08,940 --> 00:48:17,930 now what is the issue here this all these values are random as if they are basically 315 00:48:17,930 --> 00:48:25,869 random and unknown. Now, you thought that you have collected data, after data collection 316 00:48:25,869 --> 00:48:36,069 what will happen, so you will collect data I am denoting in terms of small x. Let it 317 00:48:36,069 --> 00:48:51,220 be this small x, so x 1, x 2, x i, x n what are these values known values realized, every 318 00:48:51,220 --> 00:49:02,039 value is realized known and constant this is in the population domain. 319 00:49:02,039 --> 00:49:16,799 This is now sample of size N you see this slide, here before data collection you have 320 00:49:16,799 --> 00:49:23,059 planned to collect N data points and these are unknown and random. When you collect data 321 00:49:23,059 --> 00:49:34,170 after collection they are already known, fixed values that is the big difference and another 322 00:49:34,170 --> 00:49:40,819 important concept you keep in mind that all the observations each of the observation will 323 00:49:40,819 --> 00:49:47,410 follow the same probability distribution. We meaning that it is normal distribution 324 00:49:47,410 --> 00:49:54,329 with mu and sigma square as mean and variance x 1 has also normal distribution with mean 325 00:49:54,329 --> 00:50:02,890 mu sigma square as variance. Once data you have collected forget about all distribution 326 00:50:02,890 --> 00:50:15,119 there is fixed value, no randomness in the data it is already collected. 327 00:50:15,119 --> 00:50:25,039 This one can example last example that is small company I can we can give a name to 328 00:50:25,039 --> 00:50:31,819 the company, I have given city cam. Later on I will use city cam, the company name is 329 00:50:31,819 --> 00:50:38,460 city cam, so these are the profit sales volume absenteeism all those things what is this. 330 00:50:38,460 --> 00:50:44,609 Basically, we are characterizing the that city cam process in totality in terms of these 331 00:50:44,609 --> 00:50:54,930 variables and your sample mean and sample variance, these are the statistics with respect 332 00:50:54,930 --> 00:51:00,510 to population mean and population variance, correct. 333 00:51:00,510 --> 00:51:15,390 Do you know what is this is dot plot, dot plot is something like this dot plot is something 334 00:51:15,390 --> 00:51:17,799 like this. 335 00:51:17,799 --> 00:51:27,349 Your variable is x you arrange from the smallest to the largest, now suppose this is one value, 336 00:51:27,349 --> 00:51:30,690 this is second value, and this is third value like different values are there. Suppose this 337 00:51:30,690 --> 00:51:38,809 value, there is two you have two observations there let it be three observations, here let 338 00:51:38,809 --> 00:51:45,039 it be five observations, this type of plotting you are doing here. So, again this suppose 339 00:51:45,039 --> 00:51:50,599 this again two values this is, let it be two values like this is known as dot plot, dot 340 00:51:50,599 --> 00:51:57,380 plot again it is similar to histogram plot. Now, here you are able to count the number 341 00:51:57,380 --> 00:52:05,460 of observations against each of the values of the x, so it will help you to find out 342 00:52:05,460 --> 00:52:16,519 the mode suppose if I say for this example profit in rupees million you say 9 million 343 00:52:16,519 --> 00:52:25,260 rupees case. It is two observations for 10 it is 3, for 11 it is 4, for 12 it is 3 that 344 00:52:25,260 --> 00:52:34,750 means the mode of the data points for profit is 11 mean you have already seen median is 345 00:52:34,750 --> 00:52:41,299 the middle value. How do you compute median, for computation of median you find out the 346 00:52:41,299 --> 00:52:50,490 position N plus 1 by 2 where N is the number of observations. 347 00:52:50,490 --> 00:52:57,250 In this particular example, there N equal to 12 because 12 months data, so 12 plus 1 348 00:52:57,250 --> 00:53:08,430 by 2 that means 6.5. So, 6.5 means when you arrange your data from smallest to the largest 349 00:53:08,430 --> 00:53:13,549 you just find out the position sixth and seventh position. Suppose this is a sixth position, 350 00:53:13,549 --> 00:53:18,750 this is your seventh position you take the average of these two value sixth position 351 00:53:18,750 --> 00:53:27,500 value and seventh position value, so and what is the mode is the value of x which there 352 00:53:27,500 --> 00:53:29,230 are maximum occurrences. 353 00:53:29,230 --> 00:53:36,619 This is the calculation for that data and by using excel sheet, you can very easily 354 00:53:36,619 --> 00:53:42,880 calculate this thing, now measure of dispersion measure of dispersion is this. 355 00:53:42,880 --> 00:53:52,460 What we have seen that if the data follows suppose normal distribution, then this one 356 00:53:52,460 --> 00:53:59,680 is mu and this side how much it is going. This side that is the dispersion there are 357 00:53:59,680 --> 00:54:07,049 several ways to measure dispersion, one is range that is minima maximum minus minimum 358 00:54:07,049 --> 00:54:13,960 value another. One is the inter quartile range which is the third quartile range minus first 359 00:54:13,960 --> 00:54:24,819 quartile, first quartile is basically N plus 1 by 4 and your quartile 1, Q 1 position is 360 00:54:24,819 --> 00:54:37,480 N plus 1 by 4 then Q 3, Q 2 position is the median which is N plus 1 by 2. Q 3 position 361 00:54:37,480 --> 00:54:49,490 is that is third quartile which is 3 into N plus 1 by 4, so all those position values 362 00:54:49,490 --> 00:54:54,890 you have to find out and then appropriately you have to manipulate the data. 363 00:54:54,890 --> 00:55:02,250 So, if there are two values where N is coming in the middle, then you take the average. 364 00:55:02,250 --> 00:55:08,319 If it is coming, not middle may be right hand more than the middle seventh 0.75 position, 365 00:55:08,319 --> 00:55:17,529 suppose 3.75 position, so accordingly 0.75 that weight age to be given for that data. 366 00:55:17,529 --> 00:55:24,839 These are all very simple things you will be able to find out, these are little equally 367 00:55:24,839 --> 00:55:33,500 important and you require to know also these things. You know the variance also yes or 368 00:55:33,500 --> 00:55:43,150 no, how to compute variance statistical sense s square is 1 by n minus 1 sum total of i 369 00:55:43,150 --> 00:56:08,930 equal to 1 to n then x i minus x bar square. This is the variability measure why this n minus 1 370 00:56:08,930 --> 00:56:23,559 minimum variance unbiased estimator 371 00:56:23,559 --> 00:56:33,680 any other explanation, now later on we will be discussing very much very frequently the 372 00:56:33,680 --> 00:56:46,490 degrees of freedom, getting me. Degrees of freedom we will be using d o f or d f, now 373 00:56:46,490 --> 00:56:52,569 see in this case this n minus 1 is coming because of degrees of freedom because you 374 00:56:52,569 --> 00:57:00,599 have n data points x 1 to x n. When you are computing this variance, you 375 00:57:00,599 --> 00:57:12,359 require what you require, you require x bar to be computed, so as x bar is computed with 376 00:57:12,359 --> 00:57:21,269 this formulation. So, what has happened ultimately, here when you find out that when x i minus 377 00:57:21,269 --> 00:57:28,690 x bar that is x 1 minus x bar, x 2 minus x bar like this for the last one you do not 378 00:57:28,690 --> 00:57:37,599 require to compute it is automatically computed. So, I will write here suppose x n minus x 379 00:57:37,599 --> 00:57:49,759 bar what I mean that suppose if I write like this sum of x i minus x bar what will be the 380 00:57:49,759 --> 00:58:02,549 value x I bar. So, now what is mean value, so that means summation of x i minus summation 381 00:58:02,549 --> 00:58:11,099 of x bar this is n x bar minus n x bar. So, this is 0, so what will happen ultimately 382 00:58:11,099 --> 00:58:19,259 you are not getting n x minus x minus x bar value one value is 1, very simple other way 383 00:58:19,259 --> 00:58:20,650 if you say. 384 00:58:20,650 --> 00:58:27,650 Suppose I have given you one equation x plus y plus z equal to 5, what is the degree of 385 00:58:27,650 --> 00:58:32,500 freedom. Here, you see if you change x and y, z cannot be changed further it is fixed 386 00:58:32,500 --> 00:58:36,779 even though the three values are there. You have two degrees of freedom because I have 387 00:58:36,779 --> 00:58:43,039 made it is made at 5 and in this case also the same thing is happening that is the sum 388 00:58:43,039 --> 00:58:51,869 of all this will be 0. So, you require how many data points you have in s minus 1 square 389 00:58:51,869 --> 00:58:59,329 equation n minus 1 that is the another explanation of why n minus one will be divided by 1 while 390 00:58:59,329 --> 00:59:01,460 computing this. 391 00:59:01,460 --> 00:59:15,579 Now, I will finish this lecture, see we have we have told you that normal distribution 392 00:59:15,579 --> 00:59:21,779 is very important one and later on for this subject multivariate normal distribution. 393 00:59:21,779 --> 00:59:29,019 So, we must know that who is the father of normal distribution and you see that Abraham 394 00:59:29,019 --> 00:59:35,450 De Moivre, French born English mathematician. He basically has given the general form of 395 00:59:35,450 --> 00:59:41,079 this normal distribution, what form you will see that one by root over 2 pi sigma square 396 00:59:41,079 --> 00:59:49,410 e to the power minus half x minus mu by sigma to the power square. 397 00:59:49,410 --> 01:00:00,650 Now, he is considered the father of normal distribution, but only equation will not become 398 01:00:00,650 --> 01:00:10,029 sufficient later what happened that Gauss, he is another famous mathematician and statistician 399 01:00:10,029 --> 01:00:16,170 what has he has given the properties. So, all statistical properties of normal distribution 400 01:00:16,170 --> 01:00:27,069 is identified tested by Carl Friedrich Gauss, he is specifically that German mathematician, 401 01:00:27,069 --> 01:00:33,849 they are not mathematician. If you see that thing this is very interesting, 402 01:00:33,849 --> 01:00:45,089 it is not knowledge, but it is act of learning not possession, but the act of getting there 403 01:00:45,089 --> 01:00:52,029 which grants the greatest enjoyment. So, suppose you are doing PHD, so long you are not getting 404 01:00:52,029 --> 01:00:58,480 PHD you are thinking once I get PHD, I will be very happy, but it is not true once you 405 01:00:58,480 --> 01:01:04,579 get within 2, 3 days you will find out that you are the same person. But, learning going 406 01:01:04,579 --> 01:01:10,670 there, the act of going there that is what is very, very important and this famous people 407 01:01:10,670 --> 01:01:17,490 they have quoted and we must obey to their all suggestions. 408 01:01:17,490 --> 01:01:22,559 Thank you very much, next class I will tell you sampling distribution.