1
00:00:20,109 --> 00:00:29,629
Good afternoon, today we will formally enter
into the multivariate domain. Now, topic is
2
00:00:29,629 --> 00:00:34,629
multivariate descriptive statistics.
3
00:00:34,629 --> 00:00:42,920
And the content of today’s presentation
includes multivariate observations, mean vector
4
00:00:42,920 --> 00:00:49,699
and covariance matrix, correlation matrix,
sum square and cross product matrices. So,
5
00:00:49,699 --> 00:00:57,579
it will be purely data, we will be dealing
with data with symbols like x.
6
00:00:57,579 --> 00:01:16,729
So, in univariate descriptive statistics you
have seen central tendency measures as well
7
00:01:16,729 --> 00:01:27,400
as dispersion measures, univariate case you
have seen earlier in this lecture classes.
8
00:01:27,400 --> 00:01:38,280
Now, central tendency we have measured in
univariate case using mean, then your mode
9
00:01:38,280 --> 00:01:48,100
and median, these are the measures we have
adopted. And under dispersion, we have used
10
00:01:48,100 --> 00:01:59,470
range then I think IQR Inter Quartel Range
as well as standard deviation.
11
00:01:59,470 --> 00:02:13,920
Now, we will see some of the counter parts
in the multivariate domain. For example, mean
12
00:02:13,920 --> 00:02:24,659
will be mean vector, when you go for this
multivariate statistics site, you will find
13
00:02:24,659 --> 00:02:33,239
out that mean will be mean vector. And this
standard deviation another component, which
14
00:02:33,239 --> 00:02:42,239
is the measure of dispersion that will be
not standard deviation it is square that is
15
00:02:42,239 --> 00:02:57,739
the variance, variance will be a covariance
matrix. In addition as there will be more
16
00:02:57,739 --> 00:03:05,629
than 1 variable so there will be correlation
between variables, so we will be knowing correlation
17
00:03:05,629 --> 00:03:19,680
matrix also. Today’s discussion we will
be concentrated on mean vector, covariance
18
00:03:19,680 --> 00:03:25,860
matrix and correlation matrix.
19
00:03:25,860 --> 00:03:41,660
Now, think a little bit on abstraction level
now that there is a multivariate population
20
00:03:41,660 --> 00:03:50,650
and that population is characterised by a
variable vector which is known as x.
21
00:03:50,650 --> 00:04:02,099
There is p number of variables, which is characterizing
the population of interest, so it is a p cross
22
00:04:02,099 --> 00:04:10,659
1 vector. What does it mean? We are saying
we have created a vector X, which is p cross
23
00:04:10,659 --> 00:04:19,159
1 vector where, p stands for number of variables
so there are variable 1, which is x 1, variable
24
00:04:19,159 --> 00:04:30,810
2 is X 2 like variable p is X p. Now, you
think that you require to collect data on
25
00:04:30,810 --> 00:04:39,969
this p variables, so I am writing here my
first variable is X 1, second variable is
26
00:04:39,969 --> 00:04:52,669
X 2 like my last variable is X p. And you
are basically collecting data on this p variables,
27
00:04:52,669 --> 00:04:59,909
so you will be collecting observation 1, observation
2 like this n observations.
28
00:04:59,909 --> 00:05:07,169
So, essentially you are not in univariate
domain, you are in multivariate domain and
29
00:05:07,169 --> 00:05:15,009
you are not only in 1 x with several that
n values that means 1to n values. Here your
30
00:05:15,009 --> 00:05:22,919
i equal to 1 to n and j equal to 1 to p, So,
what does it mean, I equal to 1 to n, these
31
00:05:22,919 --> 00:05:36,199
are the number of observations and j equal
to 1 to p, these are the number of variables.
32
00:05:36,199 --> 00:05:45,249
Now, you think of a situation that you have
your population; you know that variables p
33
00:05:45,249 --> 00:05:49,339
variables are there, you have identified the
variables. Now, you are planning to collect
34
00:05:49,339 --> 00:05:54,830
data, you have not collected data, you are
you are planning to collect data then, our
35
00:05:54,830 --> 00:06:00,539
nomenclature the way we will be writing here,
we will be writing like this. The general
36
00:06:00,539 --> 00:06:22,619
observation will be x i j, what does it mean?
x i j means that is the i th observation on
37
00:06:22,619 --> 00:06:29,110
the j th variable that is why i stands for
i to n, which is number of observations, j
38
00:06:29,110 --> 00:06:35,999
is 1 to p number of variables x i j, which
is the i th observation on the j th variables
39
00:06:35,999 --> 00:06:44,649
to be collected.
If this is the case that means we are first
40
00:06:44,649 --> 00:06:53,029
writing the variable then, we are writing
what is the observation number, and then we
41
00:06:53,029 --> 00:07:01,449
are writing what is the variable number. So,
then this is observation 1 variable 1, so
42
00:07:01,449 --> 00:07:13,449
x 1 1, observation 1 variable 1 x 2 1 so like
this you are writing observation n variable
43
00:07:13,449 --> 00:07:26,509
1. My observation on the second variables
will be x observation 1 variable 2, x observation
44
00:07:26,509 --> 00:07:32,879
2 variable 2, so like this x observation n
variable 2.
45
00:07:32,879 --> 00:07:42,319
So, in this manner you will go then x observation
number 1 on variable p, observation number
46
00:07:42,319 --> 00:07:52,369
2 on variable p like this you will be getting
observation number n on variable p, this is
47
00:07:52,369 --> 00:08:11,830
the data matrix. And we will be denoting this
as capital X bold X and the order will be
48
00:08:11,830 --> 00:08:27,580
n cross p. This one I am saying this is a data matrix, which you are planning to collect. If this is the
49
00:08:27,580 --> 00:08:37,320
case and you all know that X is a random vector
here X 1 to X p, x is random vector because
50
00:08:37,320 --> 00:08:46,380
all the variables are random variable here.
So, x i j is the i th observation on j th
51
00:08:46,380 --> 00:08:54,420
variable, this is also a random variable.
Please keep in mind; this is also a random
52
00:08:54,420 --> 00:09:04,329
variable we are writing it as random observation.
So, as you have not collected data, you have
53
00:09:04,329 --> 00:09:12,019
just planning to collect data any value of
x i j can be found depending on the spread
54
00:09:12,019 --> 00:09:17,250
of that variable. You will be getting any
value of this, but you do not know what that
55
00:09:17,250 --> 00:09:28,209
value is. So, that is why we are saying it
will be random observation and the x i j will
56
00:09:28,209 --> 00:09:46,929
be random variable.
So, you cannot expect value of x i j. You
57
00:09:47,000 --> 00:09:57,460
have some expectation so that is why you will
see in univariate case that if x is random
58
00:09:57,470 --> 00:10:11,000
variable then, expected value of x is mu.
This is the case then for every variable here,
59
00:10:11,000 --> 00:10:21,040
it has also some expected value, so that is
what we will basically talk about the mean
60
00:10:21,040 --> 00:10:33,149
vector. So, before coming to mean vector another
two important issues is there that to be discussed
61
00:10:33,149 --> 00:10:42,649
here. Suppose the observation is i th observation
and there is the variable is x j, I told you
62
00:10:42,649 --> 00:10:51,110
the general what I have given you here that
this is the general observation x i j.
63
00:10:51,110 --> 00:10:57,360
So, similarly, there will be a general observation,
multivariate observation, general observation
64
00:10:57,360 --> 00:11:02,879
similarly, there will be a general variable.
Now, if suppose this one is i, this is the
65
00:11:02,879 --> 00:11:15,750
i th multivariate observation, So, your value
will be x observation number 1 variable 1,
66
00:11:15,750 --> 00:11:26,050
observation 1 variable 2 like this if I go
observation i on variable j then, observation
67
00:11:26,050 --> 00:11:50,990
i on variable p. So, if I write that x i is
the i th multivariate observation can I not write
68
00:11:50,990 --> 00:12:09,199
like this that x i 1, x i 2, x i j, x i p,
this is a p cross 1 vector.
69
00:12:09,199 --> 00:12:17,939
Now, if I go by the variable wise so that
means when I talk about the i th multivariate
70
00:12:17,939 --> 00:12:29,620
observation, please keep in mind that in a particular observation on all the variables.
71
00:12:29,620 --> 00:12:35,449
Why it is not that what is the in the univariate
case, what will happen, you will get the one
72
00:12:35,449 --> 00:12:41,449
value only for that observation. As it is
multivariate, you are getting all values on
73
00:12:41,449 --> 00:12:48,709
p variables, but the other important thing
is that that means all the variables are occurring
74
00:12:48,709 --> 00:12:54,259
simultaneously, simultaneous occurrence that
is why multivariate in nature.
75
00:12:54,259 --> 00:13:07,149
Now, if I go by our general variable which
is x j then, this observation will be x observation
76
00:13:07,149 --> 00:13:16,209
1 on variable j, observation 2 on variable
j and observation i on variable j similarly,
77
00:13:16,209 --> 00:13:28,050
observation n on variable j. So, you will
be getting a general variable that means observations
78
00:13:28,050 --> 00:13:31,350
on a particular variable that we will write.
79
00:13:31,350 --> 00:13:48,160
So, if we write like this x j that is the
n observation on j variable then this is x
80
00:13:48,160 --> 00:14:03,930
1 j, x 2 j like this x i j then, x n j it
is a n cross 1 matrix, So, that means essentially
81
00:14:03,930 --> 00:14:13,930
what is happening? You have one hand this
side that the observations number of observations
82
00:14:13,930 --> 00:14:28,829
axis and the other hand you have number of
variable axis. So, when you go row wise, you
83
00:14:28,829 --> 00:14:37,649
are basically talking about different multivariate
observations and when you go column wise that
84
00:14:37,649 --> 00:14:49,999
means you are talking about n number of observations
on different variables, that n observation
85
00:14:49,999 --> 00:15:00,399
on variable 1 to variable p. So, what we have
assumed here? We have assumed that we have
86
00:15:00,399 --> 00:15:06,899
not collected the data, we are planning to collect data and as a result all the entries
87
00:15:06,899 --> 00:15:17,569
in this particular matrix are random in nature.
88
00:15:17,569 --> 00:15:29,509
What will happen if you collect the data,
when you will collect data on same number
89
00:15:29,509 --> 00:15:40,079
of variables x 1 x 1 like x p and your data
matrix also will be n cross p.
90
00:15:40,079 --> 00:15:51,779
Here also you will be getting I equal to 1
to n like x 1 1, x 2 1, x n 1, x 1 2,x 2 2,
91
00:15:51,779 --> 00:16:07,470
x n 2 then x 1 p, x 2 p, x n p, this you will
get. Here also you are x i 1 then x i 2 somewhere
92
00:16:07,470 --> 00:16:17,670
x i j then x i p that general observations
will be there, which we will be able to write
93
00:16:17,670 --> 00:16:40,709
like this that x i, which is x i 1, x i 2, x i j then x i p that is p cross 1. And also you will be getting
94
00:16:40,709 --> 00:16:51,009
one general variable that observation we write
that x j. If we write, you will be getting
95
00:16:51,009 --> 00:17:14,230
x 1 j, x 2 j like this x I j then, x n j all
remain same, it is after data collection.
96
00:17:14,230 --> 00:17:20,780
Then my question is, what is the difference
between the first matrix and second matrix?
97
00:17:20,780 --> 00:17:28,039
This is our second matrix, where we said that
after data collection and this is our first
98
00:17:28,039 --> 00:17:32,320
matrix we say this is before data collection,
what is the difference?
99
00:17:32,320 --> 00:17:42,049
In second matrix values are known so it is
100
00:17:42,049 --> 00:17:49,480
either fixed values now from the data collection
point of view, they are fixed values. First
101
00:17:49,480 --> 00:17:55,140
matrix you do not know what value you will
get, that is why in the first matrices when
102
00:17:55,140 --> 00:18:00,940
you do not know anything, you will expect
some value for each of the variable that mean,
103
00:18:00,940 --> 00:18:06,460
you also expect some deviation of different
values from the mean that will be your variance
104
00:18:06,460 --> 00:18:17,360
and you also expect the 2 variables will be
co vary, that will be your covariance, that
105
00:18:17,360 --> 00:18:19,140
is from the population point of view.
106
00:18:19,140 --> 00:18:31,010
Now, let us see this data set, now if I ask
you what the x matrix, data matrix is. So,
107
00:18:31,010 --> 00:18:38,610
what is the n value, n is 12. Now, months
we are excluding for the time being, although
108
00:18:38,610 --> 00:18:44,110
month can be in two data’s variable. For
the time being let the month is excluded then
109
00:18:44,110 --> 00:19:03,340
1 2 3 4 5, so your data matrix is 12 cross
5, all the variables are measured for 12 observations.
110
00:19:03,340 --> 00:19:18,940
So, if you see this then see this is the data
matrix, this one 12 cross 5. So, the left
111
00:19:18,940 --> 00:19:31,640
hand side matrix, the data matrix is x 12
cross 5, 12 stands for n, this stands for
112
00:19:31,640 --> 00:19:37,919
p.
113
00:19:37,919 --> 00:19:50,940
Now, first we will see the expectation of
the variable values, for each variable what
114
00:19:50,940 --> 00:20:00,880
is the expected value, then we will see that
when we collect data, what that value is.
115
00:20:00,880 --> 00:20:13,250
Now, you see this slide that as there are
p variables, so there are p means. These are
116
00:20:13,250 --> 00:20:23,700
population parameters this mu is the vectors,
which represent p means for the p variables,
117
00:20:23,700 --> 00:20:30,450
and it is mean vector and which is a parameter
vector for the population. And I have written
118
00:20:30,450 --> 00:20:35,490
here that it is expected value of x 1, expected
value of x 2, expected value of x j, expected
119
00:20:35,490 --> 00:20:38,100
value of x p.
120
00:20:38,100 --> 00:20:47,740
You have already seen that what is the expected
value of x, what we have written earlier if
121
00:20:47,740 --> 00:20:59,990
your variable is discrete, you give a summation
and then you say all x, x f x. Here what happened
122
00:20:59,990 --> 00:21:05,769
we have basically so many variables at n?
We are writing like this expected value of
123
00:21:05,769 --> 00:21:12,950
x j stand for the variable for a particular
variable j th variable, then we can write
124
00:21:12,950 --> 00:21:30,240
all x j, x j f x j, when it is a discrete
variable. When it is continuous what will
125
00:21:30,240 --> 00:21:32,630
happen, what you write in continuous case here.
126
00:21:32,630 --> 00:21:42,549
You have seen continuous case, you write integration
minus 2 plus infinite x f x d x. So, I am
127
00:21:42,549 --> 00:21:52,630
writing here for continuous case integration
minus infinite to plus infinite x j f x j
128
00:21:52,630 --> 00:22:12,140
d x j. So, for both the cases j equal to 1
to p, this is your expected value. So, when
129
00:22:12,140 --> 00:22:23,769
you write down for mu what is mu? mu is a
vector p cross 1 vector, which is mu 1, mu
130
00:22:23,769 --> 00:22:33,250
2 like mu j then your mu p, which is p cross
1 vector, this is the this is nothing but
131
00:22:33,250 --> 00:22:41,510
as we discussed now expected value of x 1,
expected value of x 2 then expected value
132
00:22:41,510 --> 00:22:51,419
of x j, then expected value of x p and expectation
we will calculate like this.
133
00:22:51,419 --> 00:23:04,789
So, where you collect data we have sets of
matrices, 1 matrix we say that we have not
134
00:23:04,789 --> 00:23:16,990
collected data that is before data collection
with respect to this. We are developing this
135
00:23:16,990 --> 00:23:32,269
one, this we are saying that our topic now
is mean vector. When you collect data like
136
00:23:32,269 --> 00:23:40,140
the 1 I have already given you that 12 into
5, then our data matrix is this after data
137
00:23:40,140 --> 00:23:49,140
collection. And we want to compute the average
value only because we have seen in univariate
138
00:23:49,140 --> 00:23:52,640
case, univariate case what you have seen?
139
00:23:52,640 --> 00:24:02,120
We have also seen that x bar is the estimate
of mu, So, in our case we are writing x bar,
140
00:24:02,120 --> 00:24:16,779
which is nothing but x 1 bar, x 2 bar, x j
bar, x p bar that is p variable average vector.
141
00:24:16,779 --> 00:24:25,769
Now, what is your x bar in case of mean variate?
You have written 1 by n, sum total i equal
142
00:24:25,769 --> 00:24:37,399
to 1 to n x i, So, that mean I can write this
1 now in the same manner that this one is
143
00:24:37,399 --> 00:24:51,710
i equal to 1 to n 1 by n i sum total i equal
to 1 to n x i 1 I can write, one stand for
144
00:24:51,710 --> 00:24:56,519
the variable I stands for the observation.
Similarly, second one you can write sum total
145
00:24:56,519 --> 00:25:07,600
of I equal to 1 n x i 2.
So, in that same manner for the j th case,
146
00:25:07,600 --> 00:25:18,510
you can write i equal to 1 n x i j and the
last one you can write i equal to 1 to n x
147
00:25:18,510 --> 00:25:35,190
i p, this is my average vector for the sample
collected on p variables. So, we will not
148
00:25:35,190 --> 00:25:46,519
go for individual mean calculation, the average
calculation? Instead of doing this, we want
149
00:25:46,519 --> 00:25:52,220
to do matrix calculation, vector matrix that
matrix calculation.
150
00:25:52,220 --> 00:26:03,830
What we will do here? See I have in order
to calculate this x bar, this x bar we will
151
00:26:03,830 --> 00:26:11,769
take this data matrix past what is our earlier
data matrix was there this data matrix was
152
00:26:11,769 --> 00:26:27,039
x n cross p, which we say suppose x 1 1, x
2 1 like this x n 1 x 1 2 x 2 2 then your
153
00:26:27,039 --> 00:26:39,760
x n 2, So, like this x 1 p, x 2 p your x n
p that was my sample data and it is n cross
154
00:26:39,760 --> 00:26:41,799
p.
155
00:26:41,799 --> 00:26:57,490
You want to compute x bar from this sample
data and x bar is a p cross 1 vector, that
156
00:26:57,490 --> 00:27:11,870
is x 1 bar, x 2 bar, x j bar, x p bar and
you know the general formula. Suppose, x j
157
00:27:11,870 --> 00:27:24,179
bar is 1 by n sum total i equal to 1 to n
x i j that also you know. Now, in matrix in
158
00:27:24,179 --> 00:27:29,330
1 go you want to calculate all this things,
what you require to do that means when you
159
00:27:29,330 --> 00:27:37,029
calculate in matrix domain, please remember
I have n cross p matrix I want a p cross 1
160
00:27:37,029 --> 00:27:42,990
matrix. I am going from suppose this is n
cross p, So, you have n cross p you are going
161
00:27:42,990 --> 00:27:53,419
to p cross n that means what if I create one
matrix, which is let it is 1 big symbolic
162
00:27:53,419 --> 00:28:09,940
one, which is all ones 1 1 like this, there
will be n ones n cross 1. I am creating 1
163
00:28:09,940 --> 00:28:19,750
unit vector where all there are n elements
in the vector and each element is one only.
164
00:28:19,750 --> 00:28:28,409
Now, this I want to use this one, this unit
vector with this data matrix in such a manner
165
00:28:28,409 --> 00:28:36,490
that I will be able to apply this computational
formula then; I will be getting the p cross
166
00:28:36,490 --> 00:28:53,009
1 vector. So, that means if I create like
this suppose, p cross n into n cross 1 it
167
00:28:53,009 --> 00:29:03,299
is p cross 1 from matrix multiplication point
of view row column. So, here number of columns
168
00:29:03,299 --> 00:29:11,590
is number of row equality is there.
If I want to do this that means I have to
169
00:29:11,590 --> 00:29:23,700
transpose this matrix. I have a matrix called
x with n cross p, if I do transpose x t, this
170
00:29:23,700 --> 00:29:31,990
will be p cross n row and column will be interchanged.
So, I am doing this x transpose this is p
171
00:29:31,990 --> 00:29:42,190
cross n, I will take a dot product that is
n cross 1. So, your resultant matrix will
172
00:29:42,190 --> 00:29:51,330
be definitely this will be cancelled out and
it will be p cross 1. So, what will happen
173
00:29:51,330 --> 00:29:53,960
with this, you see now.
174
00:29:53,960 --> 00:30:12,759
So, your data is like this, so I am taking
now let p equal to 2, n equal to 3 just to
175
00:30:12,759 --> 00:30:21,799
reduce the complexity repeating the same calculation.
Then my x matrix will be 3 cross 2, which
176
00:30:21,799 --> 00:30:38,509
will be like this x 1 1, x 2 1, x 3 1 then
x 1 2, x 2 2, x 3 2 as there are 3 observations.
177
00:30:38,509 --> 00:30:47,039
So, let you create 1 unit vector, which is
1 1 and 1 this 3 is required because my n
178
00:30:47,039 --> 00:31:01,740
is 3. So, I want to multiple x transpose 1,
So, what I will do then x 1 1, x 2 1, x 3
179
00:31:01,740 --> 00:31:08,950
1 that is the first row, second row x 1 2,
x 2 2 and x 3 2.
180
00:31:08,950 --> 00:31:18,090
This is my second row because I have made
the x matrix transpose, so then you multiply
181
00:31:18,090 --> 00:31:28,389
this by 1 1 and 1. And we all know this one
is a x transpose is a 2 cross 3 and 1 is 3
182
00:31:28,389 --> 00:31:38,190
cross 1, So, we will be getting a resultant
matrix, which is 2 cross 1. So, then matrix
183
00:31:38,190 --> 00:31:45,549
multiplication point of view x 1 1 into 1
plus x 2 1 into 1, so that is what I equal
184
00:31:45,549 --> 00:32:01,450
to 1 2 3 x I 1, second one will be I equal
to 1 to 3 x I 2. Now, we have seen earlier
185
00:32:01,450 --> 00:32:15,899
that what is x j bar that is 1 by n, sum total
I equal to 1 to n x I j. So, if I divide this
186
00:32:15,899 --> 00:32:20,570
resultant thing by n, I will get the mean
vector.
187
00:32:20,570 --> 00:32:34,850
So, that means I can write x bar is 1 by n
x transpose this one or you will be able to
188
00:32:34,850 --> 00:32:41,090
do it very easily because this formulation
is much better because you can multivariate
189
00:32:41,090 --> 00:32:51,639
domain hands on that calculation type you
forget. You have to use Mat lab for understanding
190
00:32:51,639 --> 00:33:00,309
the computational part. You can now a day excel is very powerful also, you can use excel
191
00:33:00,309 --> 00:33:09,379
also. Using excel you can use this formulation
very easily I think I will be giving you tutorial
192
00:33:09,379 --> 00:33:15,649
and you will have to do this and then if you
find problem talk to me again in my chamber
193
00:33:15,649 --> 00:33:24,809
also that is no problem. So, will we be able
to compute the mean vector log given data.
194
00:33:24,809 --> 00:33:32,769
Now, see this slide suppose you take these
first 3 values for the first variable and
195
00:33:32,769 --> 00:33:40,149
also first 3 values for second variable and
use this that matrix multiplication can we
196
00:33:40,149 --> 00:33:46,899
not do.
197
00:33:46,899 --> 00:33:58,809
What I said that we are taking only 2 variable
case with 3 observations 10 12 11 then, 100
198
00:33:58,809 --> 00:34:18,570
110 and 105. So, I want to get the x bar what
we are saying x 1 bar and x 2 bar. You can
199
00:34:18,570 --> 00:34:26,550
very easily you can go like this 10 that is
33 that total is 33, here the total will be
200
00:34:26,550 --> 00:34:37,810
5 that will be 1 0 1 1 1 and 3 3 1 5. If you
divide it by 3 it will be 11 that first one
201
00:34:37,810 --> 00:34:49,910
variable 11 and the second one will be 105.
So, that means your x bar is 11 and 105 that
202
00:34:49,910 --> 00:34:55,180
you can very easily do here also, but I am
saying that you use this formulation that
203
00:34:55,180 --> 00:35:05,870
x bar equal to 1 by n x transpose 1.
If you do like this 1 by 3 x transpose will
204
00:35:05,870 --> 00:35:21,530
be 10 12 11 100 110 105 and then 1 1 1 this
is nothing but 1 by 3 into 33 315, which is
205
00:35:21,530 --> 00:35:32,300
your same thing 11 105. It seems or that means
here and here there is not much of difference
206
00:35:32,300 --> 00:35:37,040
in calculation. The reason is because of number
of observation is also less number of variable
207
00:35:37,040 --> 00:35:42,780
is also less, but you have to have p means
large value of number of variables with large
208
00:35:42,780 --> 00:35:47,090
number of observations. So, that individual
calculation is not required straight away
209
00:35:47,090 --> 00:35:52,810
go for matrix multiplication.
We will be using simple matrix multiplication,
210
00:35:52,810 --> 00:36:02,450
matrix inverse and other things in throughout
I can say the lectures. So, this is what is
211
00:36:02,450 --> 00:36:10,340
mean vector and from the population point
of view and from sample point of view. From
212
00:36:10,340 --> 00:36:19,120
sample point of view sample average is the
estimate of population mean vector.
213
00:36:19,120 --> 00:36:29,770
Now, come to covariance matrix see you have
to understand it now. Although things are
214
00:36:29,770 --> 00:36:39,680
very simple, but physical meaning of each
of the items must be understood then only
215
00:36:39,680 --> 00:36:46,430
later on we will be talking about covariance
matrix. We will not come back to the physical
216
00:36:46,430 --> 00:36:50,510
meaning of covariance matrix further, we will
simply say covariance matrix that means, you
217
00:36:50,510 --> 00:36:54,700
will be able to catch what is covariance matrix
immediately then only you will be able to
218
00:36:54,700 --> 00:37:04,850
relate the discussion that time. So we are
interested to know covariance matrix.
219
00:37:04,850 --> 00:37:15,610
So, let us discuss from population point of
view first, that is population covariance
220
00:37:15,610 --> 00:37:30,940
matrix. If I say my x is a random variable
univariate case then, if I ask you what is
221
00:37:30,940 --> 00:37:44,690
the variance of x? Then you will say that
it is expected value of x minus mu square.
222
00:37:44,690 --> 00:37:58,540
And you have also seen for your discrete case
that all x minus mu or x i minus mu this square
223
00:37:58,540 --> 00:38:06,290
then, f x i we are not using now let it be
general case like this that is why I have
224
00:38:06,290 --> 00:38:18,100
written all x. And when you go for integration
you write like this plus this x minus mu square
225
00:38:18,100 --> 00:38:34,390
f x d x, this is what this is the variance
component, which is sigma square.
226
00:38:34,390 --> 00:38:50,680
Now, I will do simple alteration instead of
x i will write x j that means I want to know
227
00:38:50,680 --> 00:38:57,270
the variance of x j i. Then you will write
it is nothing but you will write x j minus
228
00:38:57,270 --> 00:39:04,100
mu j, only j will be the added there everywhere,
then what you will write? You will write sum
229
00:39:04,100 --> 00:39:19,170
total of all x j, x j minus mu j square f
x j, this is for discrete case. And you write
230
00:39:19,170 --> 00:39:34,190
like this integration minus infinite to plus
infinite x j minus mu j square f x j d x j,
231
00:39:34,190 --> 00:39:51,740
this is your continuous case.
Now, if this is the case and if I say what
232
00:39:51,740 --> 00:40:00,950
that means for j equal to if you put j equal
to 1 2 like p in this formulation whether
233
00:40:00,950 --> 00:40:07,940
it is discrete and continuous, when you put j equal to 1 what you get, you will get sigma
234
00:40:07,940 --> 00:40:16,160
1 square. When you put j equal to 2 you get
sigma 2 square. So, like this you will get
235
00:40:16,160 --> 00:40:30,270
sigma p square that means the variance component
of all the variables coming from this equation,
236
00:40:30,270 --> 00:40:37,920
but we have seen that we have p number of
variables. And what we are also assuming that
237
00:40:37,920 --> 00:40:46,320
these p numbers of variables are not independent
to each other.
238
00:40:46,320 --> 00:40:56,910
If x 1 is dependent on x 2 or x 1 and x 2
are not independent, what will happen? For
239
00:40:56,910 --> 00:41:04,870
example, height versus weight of people.
240
00:41:04,870 --> 00:41:13,110
Nonzero, so that means they have correlation
or otherwise I can say there is very much
241
00:41:13,110 --> 00:41:19,650
if someone height is more than other one,
it is naturally that weight also more naturally.
242
00:41:19,650 --> 00:41:27,760
But there are other parameters also, which
govern that weight, but naturally this is
243
00:41:27,760 --> 00:41:35,280
the case. So, when they are not independent,
they are dependent then what will happen,
244
00:41:35,280 --> 00:41:45,160
that means when x 1 vary, x 2 also vary. So,
their simultaneous variability is known as
245
00:41:45,160 --> 00:41:56,170
what we want to say x 1 and x 2 co vary, they
simultaneously vary.
246
00:41:56,170 --> 00:42:07,790
If there is covariance means association between
their realisation of values of x 1 and x 2.
247
00:42:07,790 --> 00:42:18,370
So, now I will write again variance of x j
we have written expected value of x j minus
248
00:42:18,370 --> 00:42:28,470
mu j that square. It is basically what this
one if I just do some manipulation and if
249
00:42:28,470 --> 00:42:36,420
I write like this that there is a formulation
called covariance between x j and x k, if
250
00:42:36,420 --> 00:42:48,740
I write like this x j minus mu j into x k
minus mu k. You see the similarity between
251
00:42:48,740 --> 00:42:55,560
this 2, when I am talking about variance of
j, I am saying x j minus mu j that this one
252
00:42:55,560 --> 00:43:04,560
I am further writing x j minus mu j and x
j minus mu j, that is why this square is coming.
253
00:43:04,560 --> 00:43:11,470
That means same variable I am saying that
suppose it is repeating to creating two variables
254
00:43:11,470 --> 00:43:21,540
same on that it x 1 on x 1 that sense if you
do like this. So, the covariance is this,
255
00:43:21,540 --> 00:43:29,470
as I am saying if x j vary x k also vary,
there is a chance that is why I am expecting
256
00:43:29,470 --> 00:43:40,420
that what is the association between the two.
So, in that case we can again write down suppose
257
00:43:40,420 --> 00:43:55,630
the same formula that all x j, x k, x j minus
mu j x k minus mu k. What we will write here
258
00:43:55,630 --> 00:44:08,370
for probability density function? Can we write
that if x k and x j separately or we will
259
00:44:08,370 --> 00:44:18,600
write x j joint probability.
You have to write the joint probability here
260
00:44:18,600 --> 00:44:44,200
and continuous case you have to write like
this. It is double integration or here I have
261
00:44:44,200 --> 00:44:49,800
written all x j, x k simultaneously one symbol
I am giving or otherwise you have to write
262
00:44:49,800 --> 00:45:07,420
all x j. all x k. What is the notation for
this, we will use the notation for this is
263
00:45:07,420 --> 00:45:22,710
sigma j k. We have used sigma j for standard
deviation, sigma j j for variance and sigma
264
00:45:22,710 --> 00:45:32,190
j k for covariance, so what I am writing here
that this is sigma j k.
265
00:45:32,190 --> 00:45:43,600
Now, be careful about the notation now that
sigma j square equal to sigma j j. Later on
266
00:45:43,600 --> 00:45:51,820
we will be using sigma j j, sigma 1 1 that
is the variance component, which is basically
267
00:45:51,820 --> 00:46:01,680
if I write sigma 1 1 it is sigma 1 square.
Then we will be using sigma j k, which is
268
00:46:01,680 --> 00:46:21,510
this is your variance of x j and this one
is covariance between x j and x k. So, you
269
00:46:21,510 --> 00:46:31,750
have sigma j square, you have sigma j k and
you have p number of variables, can we not
270
00:46:31,750 --> 00:46:40,090
find out the population covariance matrix
now, we are now in a position to write down
271
00:46:40,090 --> 00:46:43,040
the population covariance matrix.
272
00:46:43,040 --> 00:46:52,810
As there are p variables so covariance stands
for every two variables. So, how many elements
273
00:46:52,810 --> 00:47:06,500
will be there. So let us write down like this,
we will create a matrix p cross p, p stands
274
00:47:06,500 --> 00:47:17,800
for number of variables so that when this
is x 1, x 2 like x p again x 1 x 2 like x
275
00:47:17,800 --> 00:47:26,880
p. So, x 1 and x 1 the variability then when
x 1 is varying with x 1, the same variable
276
00:47:26,880 --> 00:47:36,190
variability is variance. So, this 1 will be
sigma 1 1, for the second case x 2 x 2 this
277
00:47:36,190 --> 00:47:42,260
will be sigma 2 2.
So, like this for p variable case x p x p
278
00:47:42,260 --> 00:47:50,890
sigma p p, this diagonal lines all the elements
in the diagonal lines are variance component
279
00:47:50,890 --> 00:48:01,060
that I am saying this is basically variance
component. Variance part of the variable as
280
00:48:01,060 --> 00:48:07,350
I told you sigma 1 1 is equal to sigma square
sigma 2 2 is equal to sigma 2 square like
281
00:48:07,350 --> 00:48:14,000
this variance. Then the off diagonal elements
will be covariance, so what will be this?
282
00:48:14,000 --> 00:48:25,640
Sigma 1 2, sigma 1 p again I am writing 1
2 instead of 2 1 that what is the assumption
283
00:48:25,640 --> 00:48:35,600
we are doing sigma j k equal to sigma k j
because j th variable k th variable two variables
284
00:48:35,600 --> 00:48:42,990
only, but in order we are changing.
Then sigma 2 p like this you will get sigma
285
00:48:42,990 --> 00:48:52,640
1 p sigma 2 p this. So, off diagonal elements
are covariance part and diagonal elements
286
00:48:52,640 --> 00:49:00,950
are variance part, these resultant matrix
in our class we will be denoting it like capital
287
00:49:00,950 --> 00:49:09,340
sigma. So, keep in mind capital sigma whenever
we will be using, this is your population
288
00:49:09,340 --> 00:49:27,490
covariance matrix. So, population covariance
matrix looks like this the way same thing
289
00:49:27,490 --> 00:49:37,630
as see sigma 1 1, sigma 1 2, sigma 1 j, sigma
1 p, sigma 1 2, sigma 2 2, sigma 2 j, sigma
290
00:49:37,630 --> 00:49:47,210
2 p like this. If there are p variables, there
will be p cross p elements that this side
291
00:49:47,210 --> 00:49:59,770
that will be p cross p, the size of the matrix.
292
00:49:59,770 --> 00:50:22,310
Now, we require to know sample covariance
matrix, so this population covariance matrix,
293
00:50:22,310 --> 00:50:30,930
sample covariance matrix very-very vital component
of multivariate statistics, very-very vital
294
00:50:30,930 --> 00:50:35,720
covariance matrix for the population for the sample.
295
00:50:35,720 --> 00:50:49,430
Now come to the sample part, so sample case
we will say sample covariance matrix is S.
296
00:50:49,430 --> 00:51:02,820
We will be denoting sample covariance matrix
as S, this will also be p cross p matrix.
297
00:51:02,820 --> 00:51:20,760
So, my matrix elements I can write like this
s 1 1, s 1 2 like s 1 p again s 1 2, s 2 2,
298
00:51:20,760 --> 00:51:35,040
s 2 p like this s 1 p, s 2 p, dot dot dot
s p p. So, these diagonal elements these are
299
00:51:35,040 --> 00:51:50,560
the variance part and off diagonal element will be the covariance part
300
00:51:50,960 --> 00:52:06,000
variance and covariance part.How do calculate s 1 1, s 1 2 like this all the elements of this matrix? So, the general
301
00:52:06,000 --> 00:52:20,670
one is here, it will be s j j and somewhere
here maybe your s k j will be there or s j
302
00:52:20,670 --> 00:52:39,890
k. So, you can go by the same manner the way
you developed in the univariate case, the
303
00:52:39,890 --> 00:52:50,770
variance computation. So, s j j what you will
do? We have seen that 1 by n minus 1 sum total
304
00:52:50,770 --> 00:53:01,000
of i equal to 1 to n, you have written that
x, you have written i then, you have written
305
00:53:01,000 --> 00:53:10,490
minus mu that sense, but we will use here.
It is basically x bar we have use now j is
306
00:53:10,490 --> 00:53:17,240
coming into consideration we will write like
this. We can use mu, but here it is mu is
307
00:53:17,240 --> 00:53:21,970
not available and we will not when you go
for in the sample case, we will always write
308
00:53:21,970 --> 00:53:29,060
subtract by the sample average that is why
n minus 1 is subtracted. If I use mu here
309
00:53:29,060 --> 00:53:44,080
it will be 1 by n so this square then, if
I write I can write this one like this 1 by
310
00:53:44,080 --> 00:53:54,000
n minus 1 sum total of i equal to 1 to n x
i j minus x j bar again I can write like this
311
00:53:54,000 --> 00:54:05,150
x i j minus x j bar same thing.
So, using this I want to write s j k, s j
312
00:54:05,150 --> 00:54:15,190
k is 1 by n minus 1 sum total of i equal to
1 to n, first I will keep the j variable as
313
00:54:15,190 --> 00:54:41,410
it is then second case you introduce k.
314
00:54:41,410 --> 00:54:55,270
What is happening here? Actually if you see
in the covariance case or the variance case,
315
00:54:55,270 --> 00:55:10,000
the original data matrix is transformed that
will capture, that concept we will take here.
316
00:55:10,000 --> 00:55:18,340
You see x i j minus x j bar that means for
the j th variable every element is subtracted
317
00:55:18,340 --> 00:55:26,280
by its average, for the k th variable also
every element is subtracted by its average.
318
00:55:26,280 --> 00:55:38,710
When if this is the case, can I not write
down the data matrix in this format like this?
319
00:55:38,710 --> 00:55:56,100
That my original data matrix is X which is
x 1 1, x 2 1, x n 1, x 1 2, x 2 2, x n 2 then
320
00:55:56,100 --> 00:56:19,530
x 1 j, x 2 j, x i j then x n j, x 1 p, x 2
p then x n p. So, you have computed here this
321
00:56:19,530 --> 00:56:32,190
is x 1 bar, x 2 bar, x j bar, x p bar then,
you are writing something you are converting
322
00:56:32,190 --> 00:56:38,420
this that some conversion is taking place
here, that is subtraction of mean then what
323
00:56:38,420 --> 00:56:54,680
are you getting here? If I subtract by mean,
I will be getting every observation is subtracted
324
00:56:54,680 --> 00:57:32,020
by its corresponding mean value.
So, if I just after this basically it is a
325
00:57:32,020 --> 00:57:44,720
subtraction by corresponding mean, this is
this. So, instead of writing this minus this,
326
00:57:44,720 --> 00:58:06,520
if I write like this suppose I will write
x star is like this x star 1 1, x star 2 1 so like this x star
327
00:58:06,520 --> 00:58:16,530
n 1. Same manner I am writing x star 1 p,
x star 2 p like this x star n p, somewhere
328
00:58:16,530 --> 00:58:39,200
there will be x star i j. In general one where
x star i j is x i j minus x j bar that means
329
00:58:39,200 --> 00:58:50,210
this matrix, this matrix same. Now, if I use
this formula what will happen then in this
330
00:58:50,210 --> 00:58:59,820
case that means x i j minus x j bar x i j
star and this one will be x i k star.
331
00:58:59,820 --> 00:59:10,260
So, the resultant matrix will be then 1 by
n minus 1 sum total i equal to 1 to n x star
332
00:59:10,260 --> 00:59:27,390
i j x star i k. So, this type of conversion
will take place and ultimately little more
333
00:59:27,390 --> 00:59:39,770
mathematics that we will see that. I think
up to that to you calculate this using this
334
00:59:39,770 --> 00:59:48,670
formulation can you calculate. You take the
first data point, you take same data point
335
00:59:48,670 --> 01:00:00,180
that first 3 variable values, I think I have
given you. Suppose, this is my data points
336
01:00:00,180 --> 01:00:05,850
you already calculated mean value. You have
to now calculate the variance and covariance
337
01:00:05,850 --> 01:00:13,650
part because there are 2 variable only 1 covariance
will be there. Then next class we will go
338
01:00:13,650 --> 01:00:20,750
for the matrix, how using matrix multiplication
formula we will be able to calculate the covariance
339
01:00:20,750 --> 01:00:24,280
matrix totally then, the correlation matrix
all those things.
340
01:00:24,280 --> 01:00:26,120
Thank you.