1
00:00:14,110 --> 00:00:18,750
hello and welcome to today's lecture i hope
last week you got the [opp/opportunity] opportunity
2
00:00:18,750 --> 00:00:24,700
to do some examples in r in last class we
had done we had solved few examples in r showed
3
00:00:24,700 --> 00:00:29,250
you how you can create a vector you can do
basic scalar in addition subtraction and other
4
00:00:29,250 --> 00:00:38,310
operations and how to do vector operations
so one of the things to remember is when you
5
00:00:38,310 --> 00:00:42,660
define to vectors and you are doing these
operations these operations get operated at
6
00:00:42,660 --> 00:00:46,220
an element wise level
in other words if i define a as a vector which
7
00:00:46,220 --> 00:01:07,820
is one two three four and b as a vector which
is four five six eight then a star b will
8
00:01:07,820 --> 00:01:17,320
give me a vector which is four ten . eighteen
and thirty two so on and so forth so basically
9
00:01:17,320 --> 00:01:23,750
you have an element wise operation a star
b is. operated at the element wise level and
10
00:01:23,750 --> 00:01:25,680
each product is given you as a separate vector
ok
11
00:01:25,680 --> 00:01:38,659
we had then shown that you know how you use
the scan so either you write a is equal to
12
00:01:38,659 --> 00:01:48,250
c of one two dot dot dot so this is the you
know syntax for entering number and concatenating
13
00:01:48,250 --> 00:01:55,230
them into a single vector but of course this
can be very laborious when you have big data
14
00:01:55,230 --> 00:02:02,619
so and you know repeatedly entering next to
each other this can be a problem so one of
15
00:02:02,619 --> 00:02:10,300
the ways around it used to use the function
scan so you can write data is equal to scan
16
00:02:10,300 --> 00:02:14,640
and when you put note you know no brackets
when you enter so you have the command prompt
17
00:02:14,640 --> 00:02:17,880
and where you can enter and these numbers
get stored ok
18
00:02:17,880 --> 00:02:22,030
most important thing to note is in this case
when you write the function scan by default
19
00:02:22,030 --> 00:02:33,860
ah the software assumes that . these numbers
are real so if i if i write monday or tuesday
20
00:02:33,860 --> 00:02:41,270
then immediately this will give to an error
ok so the way around it . is to write data
21
00:02:41,270 --> 00:02:50,050
is equal to scan and you have this additional
term as what is equal to char so then tell
22
00:02:50,050 --> 00:02:53,620
the software then knows that you are essentially
entering characters while you are entering
23
00:02:53,620 --> 00:02:59,290
these so then if you enter month but when
you enter you have to write it within quotes
24
00:02:59,290 --> 00:03:06,520
monday tuesday wednesday so on and so forth
. ok so and then it'll automatically take
25
00:03:06,520 --> 00:03:19,150
it you can easily add so you have an a vector
a you can add let's say you can write a is
26
00:03:19,150 --> 00:03:22,990
equal to c of a comma five four five six so
on and so forth so you can add these numbers
27
00:03:22,990 --> 00:03:26,670
either before or after the vector .
so once you generate a vector let's say you
28
00:03:26,670 --> 00:03:47,440
have a vector a is equal to one two three
four five six . ok you can use these essential
29
00:03:47,440 --> 00:03:55,980
functions like min of a max of a mean of a
median of a variance of a and s d of a to
30
00:03:55,980 --> 00:04:05,890
get variance sigma square sigma this is just
the median . you get x bar here and min and
31
00:04:05,890 --> 00:04:09,490
max ok
so these functions will easily allow you to
32
00:04:09,490 --> 00:04:13,890
ah calculate numbers particularly when these
vectors are b or the numbers are b ok then
33
00:04:13,890 --> 00:04:20,799
we briefly discussed about plotting ok so
once you have a vector you can use let's say
34
00:04:20,799 --> 00:04:26,630
a box plot so if you use like bar plot or.
bar plot of let's say x x or box plot of x
35
00:04:26,630 --> 00:04:30,169
x y y you will generate all these plots in
in in in the plotting function so . you let's
36
00:04:30,169 --> 00:04:31,849
say i could have written box plot y y and
then i could have written ylim is equal to
37
00:04:31,849 --> 00:04:46,270
c of zero to ten so this would have said the
y axis range so this is the y axis range from
38
00:04:46,270 --> 00:04:59,360
zero to ten so this is how i would have entered
my y axis range to be between zero and ten
39
00:04:59,360 --> 00:05:03,300
so on and so forth ok
histogram of x x or y y will give you the
40
00:05:03,300 --> 00:05:14,669
histogram or the frequency distribution but
it'll also generate the plot so just to get
41
00:05:14,669 --> 00:05:39,499
the frequency distribution you can write table
of x x ok so these are the basics . now let
42
00:05:39,499 --> 00:05:47,360
us come to say in the generic case you just
don't have values but you have values where
43
00:05:47,360 --> 00:05:53,599
there are more than one metric when chosen
ok
44
00:05:53,599 --> 00:06:11,439
so let's say in a class i want to correlate
so you have two vectors you've chosen x and
45
00:06:11,439 --> 00:06:29,680
y ok of these x is let's say . weight and
y is agility ok or you know capability to
46
00:06:29,680 --> 00:06:39,729
run let's say how agile or fitness whatever
you choose now logic would dictate you would
47
00:06:39,729 --> 00:06:58,619
expect in general that if i were to plot x
and y if this is my x this is my y so x is
48
00:06:58,619 --> 00:07:08,499
my weight axis and y is my fitness axis and
let's say i you know i i ah normalize it [with]
49
00:07:08,499 --> 00:07:21,099
respect between a value between zero and ten
ok so you would expect that as weight will
50
00:07:21,099 --> 00:07:36,789
drop you would you can expect a curve like
this you can expect a curve like this you
51
00:07:36,789 --> 00:07:52,379
can expect a curve like this but it is highly
unlikely that you will have a curve like this
52
00:07:52,379 --> 00:07:56,069
this is highly unlikely from a physical point
of view
53
00:07:56,069 --> 00:08:05,459
so the object of this exercise is to correlate
this to particular behavior and this is [hot/how
54
00:08:05,459 --> 00:08:25,479
it] is chosen in the . principle of correlation
ok how are they correlated so i can clearly
55
00:08:25,479 --> 00:08:33,409
see that in both let's say this curve a this
curve b and this curve c they are correlated
56
00:08:33,409 --> 00:08:39,169
so as per this curve a let's say they're saying
that you see a strong correlation such that
57
00:08:39,169 --> 00:08:43,990
increase in weight gives rise to decrease
in fitness ok in b this b or for that matter
58
00:08:43,990 --> 00:08:46,920
c this is much stronger so it says that even
for small changes in weight initially there
59
00:08:46,920 --> 00:08:49,740
is a huge drop in the fitness of the person
concerned
60
00:08:49,740 --> 00:08:54,800
but beyond a certain weight you have a saturation
ok so clearly you can see that depending on
61
00:08:54,800 --> 00:08:58,399
the nature of the data you might see. these
two curves can be linearly . correlated or
62
00:08:58,399 --> 00:09:03,410
for these two curves this relationship . in
non linear that is with linear increase if
63
00:09:03,410 --> 00:09:07,230
you are you know if your weight double will
your fitness also reduced by half ok that
64
00:09:07,230 --> 00:09:12,529
is not so
so this principle is very useful for studying
65
00:09:12,529 --> 00:09:14,640
correlation and regression and let us see
. how it is done so how do you know whether
66
00:09:14,640 --> 00:09:19,230
something is positively correlated or something
is negatively correlated . so you know let's
67
00:09:19,230 --> 00:09:31,029
say if i plot my x and y . if i plot x and
y and i i have some scatter plots of some
68
00:09:31,029 --> 00:09:47,610
scatter plots like this ok so i can see that
on an average if i were to draw a trend line
69
00:09:47,610 --> 00:09:54,070
through the middle my trend line will look
something like this ok so this is an example
70
00:09:54,070 --> 00:10:00,779
of positive correlation .
on the other hand if my data were to look
71
00:10:00,779 --> 00:10:06,399
something like this this is negatively . correlation
this is negative correlation as we saw in
72
00:10:06,399 --> 00:10:17,720
the case of weight and fitness ok so in other
cases so let's say for example we are correlating
73
00:10:17,720 --> 00:10:27,529
weight with the chance of raining today . ok
so weight of a person at ten different days
74
00:10:27,529 --> 00:10:35,790
and the chance of raining or. weight of ten
different people and the chance of raining
75
00:10:35,790 --> 00:10:40,100
so we can clearly see that there is expected
to be no correlation between these two curves
76
00:10:40,100 --> 00:10:45,129
so in that case if i draw a line you see that
the line will almost look like either horizontal
77
00:10:45,129 --> 00:10:49,480
or in some other case it might look almost
like this. that the line is completely vertical
78
00:10:49,480 --> 00:11:00,630
so these are [causes/coses] where there is
no correlation between x and y ok so the mathematical
79
00:11:00,630 --> 00:11:12,509
. basis for calculating correlation and regression
so what you have so you have this so let's
80
00:11:12,509 --> 00:11:16,220
just say again in the case let's say x equal
to y you have a function which is x equal
81
00:11:16,220 --> 00:11:19,829
to y we know it'll be a forty five degree
line passing through the origin this is a
82
00:11:19,829 --> 00:11:23,199
case you will come up with something called
a correlation coefficient which will come
83
00:11:23,199 --> 00:11:27,470
out to be one . ok
so in other words you are. they are fully
84
00:11:27,470 --> 00:11:36,790
correlated any increase in x will give you
the equal increase in y and the other hand
85
00:11:36,790 --> 00:11:45,680
let's say you have a complete opposite slope
and this is the case where let's say y is
86
00:11:45,680 --> 00:11:51,889
equal to so so in this case your correlation
coefficient is going to be close to value
87
00:11:51,889 --> 00:11:59,379
of minus one versus when there is no correlation
when you have data like this here your correlation
88
00:11:59,379 --> 00:12:03,209
coefficient will give a value of zero ok
now how do you define correlation coefficient
89
00:12:03,209 --> 00:12:05,589
mathematically . the mathematical definition
of correlation coefficient is typically written
90
00:12:05,589 --> 00:12:10,131
as rho is represented by rho correlation coefficient
is nothing but defined by s x y by s x into
91
00:12:10,131 --> 00:12:15,350
s y ok so where s x . this is standard deviation
of x . standard deviation of y and this s
92
00:12:15,350 --> 00:12:27,930
x y is called the covariance of x and y . ok
covariance is defined by s x y is equal to
93
00:12:27,930 --> 00:12:39,529
summation of x i minus x bar into y i minus
y bar whole divided by n minus one . ok
94
00:12:39,529 --> 00:12:47,440
so let us i can expand this further so i can
expand this to summation of x i y i . minus
95
00:12:47,440 --> 00:13:01,290
x i y bar minus x bar y i plus x bar y bar
. by n minus one . ok so i know so my s x
96
00:13:01,290 --> 00:13:10,499
y equal to summation x i y i minus y bar i
can take out summation x i minus x bar i can
97
00:13:10,499 --> 00:13:27,730
take out summation y i plus x bar y bar summation
one i equal to one to n by n minus one . ok
98
00:13:27,730 --> 00:13:33,170
so i can rewrite this as summation x i y i
so summation x i is nothing but n times x
99
00:13:33,170 --> 00:13:37,389
bar so i can write this as n x bar y bar minus
. similarly here in n x bar y bar plus n x
100
00:13:37,389 --> 00:13:41,199
bar y bar by n minus one . this gives me to
the formula that s x y is summation x i y
101
00:13:41,199 --> 00:13:48,470
i minus n x bar y bar by n minus one ok so
this is the difference. definition of covariance
102
00:13:48,470 --> 00:13:55,660
now let us generate two vectors ok so let
us see what kind of covariance we get what
103
00:13:55,660 --> 00:14:00,189
is the value of standard deviation and what
is the final correlation coefficient for some
104
00:14:00,189 --> 00:14:02,660
distributions let us take one particular example
where we think that they are positively correlated
105
00:14:02,660 --> 00:14:11,290
. ok let us assume that i have the following
four values of x and so i can calculate what
106
00:14:11,290 --> 00:14:19,579
is the value of x bar x bar is equal to two
point five . y bar equal to three point five
107
00:14:19,579 --> 00:14:33,259
ok i can find out so let us open . r studio
ok let me enter so let us open r studio and
108
00:14:33,259 --> 00:14:49,930
let me enter x x is equal equal to sorry c
of . one two three four . ok y y is equal
109
00:14:49,930 --> 00:15:06,089
to c of . two three four five . ok i can plot
x x comma y y and this is how my plot looks
110
00:15:06,089 --> 00:15:22,170
like you can clearly see that there is a very
linear correlation between x x and y y ok
111
00:15:22,170 --> 00:15:30,250
so i want to find out what is the value of
s x y ok so i can find out so i know that
112
00:15:30,250 --> 00:15:40,360
s x y is equal to summation x i y i . minus
n x bar y bar by n minus one so i have n is
113
00:15:40,360 --> 00:15:54,439
equal to four in this case
so let us calculate the value of s x so i
114
00:15:54,439 --> 00:16:02,230
can write down here itself i can write down
s d of x x s d of y y same thing so i can
115
00:16:02,230 --> 00:16:09,990
define z z equal to i can define z z equal
to let's say . s d equal to x x star y y so
116
00:16:09,990 --> 00:16:15,410
i can find out what is the value of z z which
is nothing but summation x y so what i have
117
00:16:15,410 --> 00:16:28,720
calculated is summation x y then i can add
up if i do sum of z z i get the complete value
118
00:16:28,720 --> 00:16:35,149
which is forty ok so i can see that summation
x y is coming out to be a value of forty ok
119
00:16:35,149 --> 00:16:41,850
i know standard deviation of x is equal to
ah standard deviation of . x equal to one
120
00:16:41,850 --> 00:16:52,860
point two nine and. s d of both x and s d
of y is equal to one point two nine ok .
121
00:16:52,860 --> 00:17:04,130
so now let us calculate so i know n is equal
to four ok so sum of z zee minus x bar is
122
00:17:04,130 --> 00:17:10,950
ok so minus n is four star x bar which is
two point five star y bar which is three point
123
00:17:10,950 --> 00:17:20,329
five gives me a value of five ok so i can
do s x y so i can . get the value of s x y
124
00:17:20,329 --> 00:17:24,390
so i got summation x y is equal to forty i
got so s x y is equal to summation x y minus
125
00:17:24,390 --> 00:17:28,020
n x bar y bar so i have calculated x bar is
equal to two point five y bar equal to three
126
00:17:28,020 --> 00:17:34,179
point five n is equal to . four so i can calculate
the value of s x y is equal to forty minus
127
00:17:34,179 --> 00:17:45,020
four into two point five into three point
five by n minus
128
00:17:45,020 --> 00:18:10,909
one is equal to three is equal to forty minus
thirty five by three equal to five by three
129
00:18:10,909 --> 00:18:14,320
. ok
so and i know what is ah you know standard
130
00:18:14,320 --> 00:18:22,880
deviation of s x ok so if i do ah if i do
s x ah y . ok so my correlation coefficient
131
00:18:22,880 --> 00:18:27,179
is going to be five by one point two nine
star one point two nine . ok so you can accordingly
132
00:18:27,179 --> 00:18:31,280
use and find out what is the correlation coefficient
of rho ok so determine the formula and use
133
00:18:31,280 --> 00:18:34,630
to find out the correlation coefficient of
y ok
134
00:18:34,630 --> 00:18:53,450
now let us say one thing so ah my correlation
coefficient rho is define it by s x y by s
135
00:18:53,450 --> 00:19:22,440
x into s y ok so in this case so my s x y
so what is the minimum value of rho possible
136
00:19:22,440 --> 00:19:35,370
and what is the maximum value of rho possible
so we thought that we reasoned that it value
137
00:19:35,370 --> 00:19:43,190
this value should be between minus one and
one . ok so let us see if that is true so
138
00:19:43,190 --> 00:20:06,649
let us assume y as let's say c x in this case
where c is positive ok let us assume so we
139
00:20:06,649 --> 00:20:22,510
assume a very strong correlation in fact we
can also add some a plus c x to make it more
140
00:20:22,510 --> 00:20:31,400
general ok if we make it. assume as y is equal
to a plus c x where my s x y is defined as
141
00:20:31,400 --> 00:20:50,090
summation of x minus x bar into y minus y
bar by n minus one right so i know my y bar
142
00:20:50,090 --> 00:21:07,000
has to be a plus c x bar from previous classes
we had derived this equation so y minus y
143
00:21:07,000 --> 00:21:13,779
bar is nothing but c of x minus x bar . so
this implies y minus by y a c of s x minus
144
00:21:13,779 --> 00:21:22,370
x bar ok so if this is true . i can compute
the value of s x y . as summation of x minus
145
00:21:22,370 --> 00:21:27,490
x bar
so i can take c out whole square by n minus
146
00:21:27,490 --> 00:21:35,559
one . c x y is this now s x is root of summation
x minus x bar whole square by n minus one
147
00:21:35,559 --> 00:21:44,409
root of this and s y is equal to root of summation
y minus y bar whole square by n minus one
148
00:21:44,409 --> 00:21:55,679
ok so s x into s y will give me so i can take
out root of n minus one i can take out n minus
149
00:21:55,679 --> 00:22:03,720
one common into root of summation x minus
x bar whole square into summation y minus
150
00:22:03,720 --> 00:22:09,029
y bar whole square . ok if this was true then
i again know y is equal to a plus c x . so
151
00:22:09,029 --> 00:22:14,690
my s x into x y will be one by n minus one
into root of summation x minus x bar whole
152
00:22:14,690 --> 00:22:22,580
square into summation so y is a plus c x so
again i know y bar is equal to a plus c x
153
00:22:22,580 --> 00:22:30,890
bar so y minus y bar is equal to c of x minus
x bar ok so i can put my c outside so c square
154
00:22:30,890 --> 00:22:35,720
into x minus x bar whole square . ok
so under root i can take it out is equal to
155
00:22:35,720 --> 00:22:38,510
summation x minus x bar whole square by n
minus one into root of c c square ok so this
156
00:22:38,510 --> 00:22:41,390
is ah what i have so in in the you know i
can write it as summation of x minus x bar
157
00:22:41,390 --> 00:22:59,269
whole square into n minus one into mod of
c ok so if i do this then my s x y . so my
158
00:22:59,269 --> 00:23:15,899
rho . is defined as s x y by s x into s y
which is going to be is equal to ok c into
159
00:23:15,899 --> 00:23:21,680
summation of x minus x bar whole square by
n minus one divided by summation so mod c
160
00:23:21,680 --> 00:23:35,010
summation x minus x bar whole square by n
minus one is equal to c by mod c ok
161
00:23:35,010 --> 00:23:39,019
so when when you[r] c is when c is positive
so when c is positive so i can clearly see
162
00:23:39,019 --> 00:23:59,990
rho is going to be one so if let's say c is
five then five by mod five is simply equal
163
00:23:59,990 --> 00:24:05,519
to one when c is negative . rho is going to
be let's say let's say example is minus five
164
00:24:05,519 --> 00:24:11,220
minus five by five equal to minus one this
tells you that your c your ah rho correlation
165
00:24:11,220 --> 00:24:14,809
coefficient . is bounded within the following
limits . ok rho it bounded between minus one
166
00:24:14,809 --> 00:24:17,289
and plus one . ok
so this is the value of correlation coefficient
167
00:24:17,289 --> 00:24:20,340
so minimum correlation coefficient when they
are anti correlated that means x is increasing
168
00:24:20,340 --> 00:24:24,779
y is decreasing or the reverse wave x is decreasing
y is increasing when there is complete anti
169
00:24:24,779 --> 00:24:28,941
correlation you will get a value of rho which
is minus one when they [are] perfectly in
170
00:24:28,941 --> 00:24:34,010
sync that is x increases y increases that
exact same rate you would get a value of rho
171
00:24:34,010 --> 00:24:39,059
is equal to one so when there is some association
but it need not be fully strongly associated
172
00:24:39,059 --> 00:24:43,270
you might get a positive correlation or a
negative correlation but the value would be
173
00:24:43,270 --> 00:24:47,580
like let say a point two or minus point two
depending on the extent of correlation ok
174
00:24:47,580 --> 00:24:52,289
so now . let's say so let us ah take one another
example and you know and do this calculation
175
00:24:52,289 --> 00:24:55,679
ourselves . ok so let us take an example where
x and y are not. particularly correlated let
176
00:24:55,679 --> 00:25:04,090
us say if i take this point this point this
point and this point ok so let's say y one
177
00:25:04,090 --> 00:25:28,050
one x is two y is five x is three y is one
x is four y is five ok let us do the following
178
00:25:28,050 --> 00:25:44,950
exercise ok
so my x bar is going to be two point five
179
00:25:44,950 --> 00:25:54,980
as before y bar is going to be three ok so
let us find out the standard deviation s x
180
00:25:54,980 --> 00:26:08,330
so if i know the value of x x minus x bar
whole square one two three four so this . two
181
00:26:08,330 --> 00:26:12,760
square one square one square two square so
s x should give me a value of root of four
182
00:26:12,760 --> 00:26:19,610
plus one plus one plus four by n minus one
which is three is equal to root of five plus
183
00:26:19,610 --> 00:26:29,830
five ten by three ok s x is root of ten by
three . ok s x is root of ten by three i can
184
00:26:29,830 --> 00:26:37,899
do the same thing for y summation y minus
sorry y minus y bar whole square i have one
185
00:26:37,899 --> 00:26:43,210
five one five ok so y bar is three i know
y bar equal to three two square two square
186
00:26:43,210 --> 00:26:48,799
two square two square so s y should be root
of four into four . by three solo by three
187
00:26:48,799 --> 00:26:57,030
ok so we have s x is equal to root of ten
by three s y is equal to root of ten by three
188
00:26:57,030 --> 00:27:03,840
now we want to find out s x y ok so let us
. take another piece of page ok so we have
189
00:27:03,840 --> 00:27:08,922
x we have y one two three four one five one
five x bar so one minus two point five into
190
00:27:08,922 --> 00:27:30,710
one minus three ok two minus two point five
into five minus
191
00:27:30,710 --> 00:27:46,330
three three minus two point five into one
minus three four minus two point five and
192
00:27:46,330 --> 00:28:11,460
five minus three you can find out this value
. so this is two all of them are two minus
193
00:28:11,460 --> 00:28:48,980
two one point five this is one point five
to three minus into minus is plus this value
194
00:28:48,980 --> 00:29:04,740
is doing minus one this value is point five
point five. three ok
195
00:29:04,740 --> 00:29:14,780
so i can calculate the summation so s x y
will be summation of this which is six minus
196
00:29:14,780 --> 00:29:29,559
two is four by three ok so my rho is nothing
but four by three by so this is sixteen by
197
00:29:29,559 --> 00:29:55,220
three root of ten by three into root of sixteen
by three so i see so my three will go is four
198
00:29:55,220 --> 00:30:34,480
by four into root ten is equal to one by root
ten ok so i'll get a value . which is root
199
00:30:34,480 --> 00:31:03,669
of three is
a one third roughly one third so you see that
200
00:31:03,669 --> 00:31:12,710
this
is still positively correlated as per this
201
00:31:12,710 --> 00:31:40,750
calculation where it is much lesser than one
but it is still positive ok
202
00:31:40,750 --> 00:31:57,759
so in
this class we discussed about correlation
203
00:31:57,759 --> 00:32:04,900
coefficient and how you can make use of r
to calculate these individual metrics and
204
00:32:04,900 --> 00:32:11,210
even calculate the correlation coefficient
in the next class we will again take few more
205
00:32:11,210 --> 00:32:21,460
examples of correlation and then go to the
next step of how to do regression and fitting
206
00:32:21,460 --> 00:32:33,420
with that i end here and i look forward to
meeting you in the next lecture
207
00:32:33,420 --> 00:32:35,280
thank you .