1 00:00:14,110 --> 00:00:18,750 hello and welcome to today's lecture i hope last week you got the [opp/opportunity] opportunity 2 00:00:18,750 --> 00:00:24,700 to do some examples in r in last class we had done we had solved few examples in r showed 3 00:00:24,700 --> 00:00:29,250 you how you can create a vector you can do basic scalar in addition subtraction and other 4 00:00:29,250 --> 00:00:38,310 operations and how to do vector operations so one of the things to remember is when you 5 00:00:38,310 --> 00:00:42,660 define to vectors and you are doing these operations these operations get operated at 6 00:00:42,660 --> 00:00:46,220 an element wise level in other words if i define a as a vector which 7 00:00:46,220 --> 00:01:07,820 is one two three four and b as a vector which is four five six eight then a star b will 8 00:01:07,820 --> 00:01:17,320 give me a vector which is four ten . eighteen and thirty two so on and so forth so basically 9 00:01:17,320 --> 00:01:23,750 you have an element wise operation a star b is. operated at the element wise level and 10 00:01:23,750 --> 00:01:25,680 each product is given you as a separate vector ok 11 00:01:25,680 --> 00:01:38,659 we had then shown that you know how you use the scan so either you write a is equal to 12 00:01:38,659 --> 00:01:48,250 c of one two dot dot dot so this is the you know syntax for entering number and concatenating 13 00:01:48,250 --> 00:01:55,230 them into a single vector but of course this can be very laborious when you have big data 14 00:01:55,230 --> 00:02:02,619 so and you know repeatedly entering next to each other this can be a problem so one of 15 00:02:02,619 --> 00:02:10,300 the ways around it used to use the function scan so you can write data is equal to scan 16 00:02:10,300 --> 00:02:14,640 and when you put note you know no brackets when you enter so you have the command prompt 17 00:02:14,640 --> 00:02:17,880 and where you can enter and these numbers get stored ok 18 00:02:17,880 --> 00:02:22,030 most important thing to note is in this case when you write the function scan by default 19 00:02:22,030 --> 00:02:33,860 ah the software assumes that . these numbers are real so if i if i write monday or tuesday 20 00:02:33,860 --> 00:02:41,270 then immediately this will give to an error ok so the way around it . is to write data 21 00:02:41,270 --> 00:02:50,050 is equal to scan and you have this additional term as what is equal to char so then tell 22 00:02:50,050 --> 00:02:53,620 the software then knows that you are essentially entering characters while you are entering 23 00:02:53,620 --> 00:02:59,290 these so then if you enter month but when you enter you have to write it within quotes 24 00:02:59,290 --> 00:03:06,520 monday tuesday wednesday so on and so forth . ok so and then it'll automatically take 25 00:03:06,520 --> 00:03:19,150 it you can easily add so you have an a vector a you can add let's say you can write a is 26 00:03:19,150 --> 00:03:22,990 equal to c of a comma five four five six so on and so forth so you can add these numbers 27 00:03:22,990 --> 00:03:26,670 either before or after the vector . so once you generate a vector let's say you 28 00:03:26,670 --> 00:03:47,440 have a vector a is equal to one two three four five six . ok you can use these essential 29 00:03:47,440 --> 00:03:55,980 functions like min of a max of a mean of a median of a variance of a and s d of a to 30 00:03:55,980 --> 00:04:05,890 get variance sigma square sigma this is just the median . you get x bar here and min and 31 00:04:05,890 --> 00:04:09,490 max ok so these functions will easily allow you to 32 00:04:09,490 --> 00:04:13,890 ah calculate numbers particularly when these vectors are b or the numbers are b ok then 33 00:04:13,890 --> 00:04:20,799 we briefly discussed about plotting ok so once you have a vector you can use let's say 34 00:04:20,799 --> 00:04:26,630 a box plot so if you use like bar plot or. bar plot of let's say x x or box plot of x 35 00:04:26,630 --> 00:04:30,169 x y y you will generate all these plots in in in in the plotting function so . you let's 36 00:04:30,169 --> 00:04:31,849 say i could have written box plot y y and then i could have written ylim is equal to 37 00:04:31,849 --> 00:04:46,270 c of zero to ten so this would have said the y axis range so this is the y axis range from 38 00:04:46,270 --> 00:04:59,360 zero to ten so this is how i would have entered my y axis range to be between zero and ten 39 00:04:59,360 --> 00:05:03,300 so on and so forth ok histogram of x x or y y will give you the 40 00:05:03,300 --> 00:05:14,669 histogram or the frequency distribution but it'll also generate the plot so just to get 41 00:05:14,669 --> 00:05:39,499 the frequency distribution you can write table of x x ok so these are the basics . now let 42 00:05:39,499 --> 00:05:47,360 us come to say in the generic case you just don't have values but you have values where 43 00:05:47,360 --> 00:05:53,599 there are more than one metric when chosen ok 44 00:05:53,599 --> 00:06:11,439 so let's say in a class i want to correlate so you have two vectors you've chosen x and 45 00:06:11,439 --> 00:06:29,680 y ok of these x is let's say . weight and y is agility ok or you know capability to 46 00:06:29,680 --> 00:06:39,729 run let's say how agile or fitness whatever you choose now logic would dictate you would 47 00:06:39,729 --> 00:06:58,619 expect in general that if i were to plot x and y if this is my x this is my y so x is 48 00:06:58,619 --> 00:07:08,499 my weight axis and y is my fitness axis and let's say i you know i i ah normalize it [with] 49 00:07:08,499 --> 00:07:21,099 respect between a value between zero and ten ok so you would expect that as weight will 50 00:07:21,099 --> 00:07:36,789 drop you would you can expect a curve like this you can expect a curve like this you 51 00:07:36,789 --> 00:07:52,379 can expect a curve like this but it is highly unlikely that you will have a curve like this 52 00:07:52,379 --> 00:07:56,069 this is highly unlikely from a physical point of view 53 00:07:56,069 --> 00:08:05,459 so the object of this exercise is to correlate this to particular behavior and this is [hot/how 54 00:08:05,459 --> 00:08:25,479 it] is chosen in the . principle of correlation ok how are they correlated so i can clearly 55 00:08:25,479 --> 00:08:33,409 see that in both let's say this curve a this curve b and this curve c they are correlated 56 00:08:33,409 --> 00:08:39,169 so as per this curve a let's say they're saying that you see a strong correlation such that 57 00:08:39,169 --> 00:08:43,990 increase in weight gives rise to decrease in fitness ok in b this b or for that matter 58 00:08:43,990 --> 00:08:46,920 c this is much stronger so it says that even for small changes in weight initially there 59 00:08:46,920 --> 00:08:49,740 is a huge drop in the fitness of the person concerned 60 00:08:49,740 --> 00:08:54,800 but beyond a certain weight you have a saturation ok so clearly you can see that depending on 61 00:08:54,800 --> 00:08:58,399 the nature of the data you might see. these two curves can be linearly . correlated or 62 00:08:58,399 --> 00:09:03,410 for these two curves this relationship . in non linear that is with linear increase if 63 00:09:03,410 --> 00:09:07,230 you are you know if your weight double will your fitness also reduced by half ok that 64 00:09:07,230 --> 00:09:12,529 is not so so this principle is very useful for studying 65 00:09:12,529 --> 00:09:14,640 correlation and regression and let us see . how it is done so how do you know whether 66 00:09:14,640 --> 00:09:19,230 something is positively correlated or something is negatively correlated . so you know let's 67 00:09:19,230 --> 00:09:31,029 say if i plot my x and y . if i plot x and y and i i have some scatter plots of some 68 00:09:31,029 --> 00:09:47,610 scatter plots like this ok so i can see that on an average if i were to draw a trend line 69 00:09:47,610 --> 00:09:54,070 through the middle my trend line will look something like this ok so this is an example 70 00:09:54,070 --> 00:10:00,779 of positive correlation . on the other hand if my data were to look 71 00:10:00,779 --> 00:10:06,399 something like this this is negatively . correlation this is negative correlation as we saw in 72 00:10:06,399 --> 00:10:17,720 the case of weight and fitness ok so in other cases so let's say for example we are correlating 73 00:10:17,720 --> 00:10:27,529 weight with the chance of raining today . ok so weight of a person at ten different days 74 00:10:27,529 --> 00:10:35,790 and the chance of raining or. weight of ten different people and the chance of raining 75 00:10:35,790 --> 00:10:40,100 so we can clearly see that there is expected to be no correlation between these two curves 76 00:10:40,100 --> 00:10:45,129 so in that case if i draw a line you see that the line will almost look like either horizontal 77 00:10:45,129 --> 00:10:49,480 or in some other case it might look almost like this. that the line is completely vertical 78 00:10:49,480 --> 00:11:00,630 so these are [causes/coses] where there is no correlation between x and y ok so the mathematical 79 00:11:00,630 --> 00:11:12,509 . basis for calculating correlation and regression so what you have so you have this so let's 80 00:11:12,509 --> 00:11:16,220 just say again in the case let's say x equal to y you have a function which is x equal 81 00:11:16,220 --> 00:11:19,829 to y we know it'll be a forty five degree line passing through the origin this is a 82 00:11:19,829 --> 00:11:23,199 case you will come up with something called a correlation coefficient which will come 83 00:11:23,199 --> 00:11:27,470 out to be one . ok so in other words you are. they are fully 84 00:11:27,470 --> 00:11:36,790 correlated any increase in x will give you the equal increase in y and the other hand 85 00:11:36,790 --> 00:11:45,680 let's say you have a complete opposite slope and this is the case where let's say y is 86 00:11:45,680 --> 00:11:51,889 equal to so so in this case your correlation coefficient is going to be close to value 87 00:11:51,889 --> 00:11:59,379 of minus one versus when there is no correlation when you have data like this here your correlation 88 00:11:59,379 --> 00:12:03,209 coefficient will give a value of zero ok now how do you define correlation coefficient 89 00:12:03,209 --> 00:12:05,589 mathematically . the mathematical definition of correlation coefficient is typically written 90 00:12:05,589 --> 00:12:10,131 as rho is represented by rho correlation coefficient is nothing but defined by s x y by s x into 91 00:12:10,131 --> 00:12:15,350 s y ok so where s x . this is standard deviation of x . standard deviation of y and this s 92 00:12:15,350 --> 00:12:27,930 x y is called the covariance of x and y . ok covariance is defined by s x y is equal to 93 00:12:27,930 --> 00:12:39,529 summation of x i minus x bar into y i minus y bar whole divided by n minus one . ok 94 00:12:39,529 --> 00:12:47,440 so let us i can expand this further so i can expand this to summation of x i y i . minus 95 00:12:47,440 --> 00:13:01,290 x i y bar minus x bar y i plus x bar y bar . by n minus one . ok so i know so my s x 96 00:13:01,290 --> 00:13:10,499 y equal to summation x i y i minus y bar i can take out summation x i minus x bar i can 97 00:13:10,499 --> 00:13:27,730 take out summation y i plus x bar y bar summation one i equal to one to n by n minus one . ok 98 00:13:27,730 --> 00:13:33,170 so i can rewrite this as summation x i y i so summation x i is nothing but n times x 99 00:13:33,170 --> 00:13:37,389 bar so i can write this as n x bar y bar minus . similarly here in n x bar y bar plus n x 100 00:13:37,389 --> 00:13:41,199 bar y bar by n minus one . this gives me to the formula that s x y is summation x i y 101 00:13:41,199 --> 00:13:48,470 i minus n x bar y bar by n minus one ok so this is the difference. definition of covariance 102 00:13:48,470 --> 00:13:55,660 now let us generate two vectors ok so let us see what kind of covariance we get what 103 00:13:55,660 --> 00:14:00,189 is the value of standard deviation and what is the final correlation coefficient for some 104 00:14:00,189 --> 00:14:02,660 distributions let us take one particular example where we think that they are positively correlated 105 00:14:02,660 --> 00:14:11,290 . ok let us assume that i have the following four values of x and so i can calculate what 106 00:14:11,290 --> 00:14:19,579 is the value of x bar x bar is equal to two point five . y bar equal to three point five 107 00:14:19,579 --> 00:14:33,259 ok i can find out so let us open . r studio ok let me enter so let us open r studio and 108 00:14:33,259 --> 00:14:49,930 let me enter x x is equal equal to sorry c of . one two three four . ok y y is equal 109 00:14:49,930 --> 00:15:06,089 to c of . two three four five . ok i can plot x x comma y y and this is how my plot looks 110 00:15:06,089 --> 00:15:22,170 like you can clearly see that there is a very linear correlation between x x and y y ok 111 00:15:22,170 --> 00:15:30,250 so i want to find out what is the value of s x y ok so i can find out so i know that 112 00:15:30,250 --> 00:15:40,360 s x y is equal to summation x i y i . minus n x bar y bar by n minus one so i have n is 113 00:15:40,360 --> 00:15:54,439 equal to four in this case so let us calculate the value of s x so i 114 00:15:54,439 --> 00:16:02,230 can write down here itself i can write down s d of x x s d of y y same thing so i can 115 00:16:02,230 --> 00:16:09,990 define z z equal to i can define z z equal to let's say . s d equal to x x star y y so 116 00:16:09,990 --> 00:16:15,410 i can find out what is the value of z z which is nothing but summation x y so what i have 117 00:16:15,410 --> 00:16:28,720 calculated is summation x y then i can add up if i do sum of z z i get the complete value 118 00:16:28,720 --> 00:16:35,149 which is forty ok so i can see that summation x y is coming out to be a value of forty ok 119 00:16:35,149 --> 00:16:41,850 i know standard deviation of x is equal to ah standard deviation of . x equal to one 120 00:16:41,850 --> 00:16:52,860 point two nine and. s d of both x and s d of y is equal to one point two nine ok . 121 00:16:52,860 --> 00:17:04,130 so now let us calculate so i know n is equal to four ok so sum of z zee minus x bar is 122 00:17:04,130 --> 00:17:10,950 ok so minus n is four star x bar which is two point five star y bar which is three point 123 00:17:10,950 --> 00:17:20,329 five gives me a value of five ok so i can do s x y so i can . get the value of s x y 124 00:17:20,329 --> 00:17:24,390 so i got summation x y is equal to forty i got so s x y is equal to summation x y minus 125 00:17:24,390 --> 00:17:28,020 n x bar y bar so i have calculated x bar is equal to two point five y bar equal to three 126 00:17:28,020 --> 00:17:34,179 point five n is equal to . four so i can calculate the value of s x y is equal to forty minus 127 00:17:34,179 --> 00:17:45,020 four into two point five into three point five by n minus 128 00:17:45,020 --> 00:18:10,909 one is equal to three is equal to forty minus thirty five by three equal to five by three 129 00:18:10,909 --> 00:18:14,320 . ok so and i know what is ah you know standard 130 00:18:14,320 --> 00:18:22,880 deviation of s x ok so if i do ah if i do s x ah y . ok so my correlation coefficient 131 00:18:22,880 --> 00:18:27,179 is going to be five by one point two nine star one point two nine . ok so you can accordingly 132 00:18:27,179 --> 00:18:31,280 use and find out what is the correlation coefficient of rho ok so determine the formula and use 133 00:18:31,280 --> 00:18:34,630 to find out the correlation coefficient of y ok 134 00:18:34,630 --> 00:18:53,450 now let us say one thing so ah my correlation coefficient rho is define it by s x y by s 135 00:18:53,450 --> 00:19:22,440 x into s y ok so in this case so my s x y so what is the minimum value of rho possible 136 00:19:22,440 --> 00:19:35,370 and what is the maximum value of rho possible so we thought that we reasoned that it value 137 00:19:35,370 --> 00:19:43,190 this value should be between minus one and one . ok so let us see if that is true so 138 00:19:43,190 --> 00:20:06,649 let us assume y as let's say c x in this case where c is positive ok let us assume so we 139 00:20:06,649 --> 00:20:22,510 assume a very strong correlation in fact we can also add some a plus c x to make it more 140 00:20:22,510 --> 00:20:31,400 general ok if we make it. assume as y is equal to a plus c x where my s x y is defined as 141 00:20:31,400 --> 00:20:50,090 summation of x minus x bar into y minus y bar by n minus one right so i know my y bar 142 00:20:50,090 --> 00:21:07,000 has to be a plus c x bar from previous classes we had derived this equation so y minus y 143 00:21:07,000 --> 00:21:13,779 bar is nothing but c of x minus x bar . so this implies y minus by y a c of s x minus 144 00:21:13,779 --> 00:21:22,370 x bar ok so if this is true . i can compute the value of s x y . as summation of x minus 145 00:21:22,370 --> 00:21:27,490 x bar so i can take c out whole square by n minus 146 00:21:27,490 --> 00:21:35,559 one . c x y is this now s x is root of summation x minus x bar whole square by n minus one 147 00:21:35,559 --> 00:21:44,409 root of this and s y is equal to root of summation y minus y bar whole square by n minus one 148 00:21:44,409 --> 00:21:55,679 ok so s x into s y will give me so i can take out root of n minus one i can take out n minus 149 00:21:55,679 --> 00:22:03,720 one common into root of summation x minus x bar whole square into summation y minus 150 00:22:03,720 --> 00:22:09,029 y bar whole square . ok if this was true then i again know y is equal to a plus c x . so 151 00:22:09,029 --> 00:22:14,690 my s x into x y will be one by n minus one into root of summation x minus x bar whole 152 00:22:14,690 --> 00:22:22,580 square into summation so y is a plus c x so again i know y bar is equal to a plus c x 153 00:22:22,580 --> 00:22:30,890 bar so y minus y bar is equal to c of x minus x bar ok so i can put my c outside so c square 154 00:22:30,890 --> 00:22:35,720 into x minus x bar whole square . ok so under root i can take it out is equal to 155 00:22:35,720 --> 00:22:38,510 summation x minus x bar whole square by n minus one into root of c c square ok so this 156 00:22:38,510 --> 00:22:41,390 is ah what i have so in in the you know i can write it as summation of x minus x bar 157 00:22:41,390 --> 00:22:59,269 whole square into n minus one into mod of c ok so if i do this then my s x y . so my 158 00:22:59,269 --> 00:23:15,899 rho . is defined as s x y by s x into s y which is going to be is equal to ok c into 159 00:23:15,899 --> 00:23:21,680 summation of x minus x bar whole square by n minus one divided by summation so mod c 160 00:23:21,680 --> 00:23:35,010 summation x minus x bar whole square by n minus one is equal to c by mod c ok 161 00:23:35,010 --> 00:23:39,019 so when when you[r] c is when c is positive so when c is positive so i can clearly see 162 00:23:39,019 --> 00:23:59,990 rho is going to be one so if let's say c is five then five by mod five is simply equal 163 00:23:59,990 --> 00:24:05,519 to one when c is negative . rho is going to be let's say let's say example is minus five 164 00:24:05,519 --> 00:24:11,220 minus five by five equal to minus one this tells you that your c your ah rho correlation 165 00:24:11,220 --> 00:24:14,809 coefficient . is bounded within the following limits . ok rho it bounded between minus one 166 00:24:14,809 --> 00:24:17,289 and plus one . ok so this is the value of correlation coefficient 167 00:24:17,289 --> 00:24:20,340 so minimum correlation coefficient when they are anti correlated that means x is increasing 168 00:24:20,340 --> 00:24:24,779 y is decreasing or the reverse wave x is decreasing y is increasing when there is complete anti 169 00:24:24,779 --> 00:24:28,941 correlation you will get a value of rho which is minus one when they [are] perfectly in 170 00:24:28,941 --> 00:24:34,010 sync that is x increases y increases that exact same rate you would get a value of rho 171 00:24:34,010 --> 00:24:39,059 is equal to one so when there is some association but it need not be fully strongly associated 172 00:24:39,059 --> 00:24:43,270 you might get a positive correlation or a negative correlation but the value would be 173 00:24:43,270 --> 00:24:47,580 like let say a point two or minus point two depending on the extent of correlation ok 174 00:24:47,580 --> 00:24:52,289 so now . let's say so let us ah take one another example and you know and do this calculation 175 00:24:52,289 --> 00:24:55,679 ourselves . ok so let us take an example where x and y are not. particularly correlated let 176 00:24:55,679 --> 00:25:04,090 us say if i take this point this point this point and this point ok so let's say y one 177 00:25:04,090 --> 00:25:28,050 one x is two y is five x is three y is one x is four y is five ok let us do the following 178 00:25:28,050 --> 00:25:44,950 exercise ok so my x bar is going to be two point five 179 00:25:44,950 --> 00:25:54,980 as before y bar is going to be three ok so let us find out the standard deviation s x 180 00:25:54,980 --> 00:26:08,330 so if i know the value of x x minus x bar whole square one two three four so this . two 181 00:26:08,330 --> 00:26:12,760 square one square one square two square so s x should give me a value of root of four 182 00:26:12,760 --> 00:26:19,610 plus one plus one plus four by n minus one which is three is equal to root of five plus 183 00:26:19,610 --> 00:26:29,830 five ten by three ok s x is root of ten by three . ok s x is root of ten by three i can 184 00:26:29,830 --> 00:26:37,899 do the same thing for y summation y minus sorry y minus y bar whole square i have one 185 00:26:37,899 --> 00:26:43,210 five one five ok so y bar is three i know y bar equal to three two square two square 186 00:26:43,210 --> 00:26:48,799 two square two square so s y should be root of four into four . by three solo by three 187 00:26:48,799 --> 00:26:57,030 ok so we have s x is equal to root of ten by three s y is equal to root of ten by three 188 00:26:57,030 --> 00:27:03,840 now we want to find out s x y ok so let us . take another piece of page ok so we have 189 00:27:03,840 --> 00:27:08,922 x we have y one two three four one five one five x bar so one minus two point five into 190 00:27:08,922 --> 00:27:30,710 one minus three ok two minus two point five into five minus 191 00:27:30,710 --> 00:27:46,330 three three minus two point five into one minus three four minus two point five and 192 00:27:46,330 --> 00:28:11,460 five minus three you can find out this value . so this is two all of them are two minus 193 00:28:11,460 --> 00:28:48,980 two one point five this is one point five to three minus into minus is plus this value 194 00:28:48,980 --> 00:29:04,740 is doing minus one this value is point five point five. three ok 195 00:29:04,740 --> 00:29:14,780 so i can calculate the summation so s x y will be summation of this which is six minus 196 00:29:14,780 --> 00:29:29,559 two is four by three ok so my rho is nothing but four by three by so this is sixteen by 197 00:29:29,559 --> 00:29:55,220 three root of ten by three into root of sixteen by three so i see so my three will go is four 198 00:29:55,220 --> 00:30:34,480 by four into root ten is equal to one by root ten ok so i'll get a value . which is root 199 00:30:34,480 --> 00:31:03,669 of three is a one third roughly one third so you see that 200 00:31:03,669 --> 00:31:12,710 this is still positively correlated as per this 201 00:31:12,710 --> 00:31:40,750 calculation where it is much lesser than one but it is still positive ok 202 00:31:40,750 --> 00:31:57,759 so in this class we discussed about correlation 203 00:31:57,759 --> 00:32:04,900 coefficient and how you can make use of r to calculate these individual metrics and 204 00:32:04,900 --> 00:32:11,210 even calculate the correlation coefficient in the next class we will again take few more 205 00:32:11,210 --> 00:32:21,460 examples of correlation and then go to the next step of how to do regression and fitting 206 00:32:21,460 --> 00:32:33,420 with that i end here and i look forward to meeting you in the next lecture 207 00:32:33,420 --> 00:32:35,280 thank you .