1
00:00:18,660 --> 00:00:26,740
Welcome back to this session on Network Analysis.
We will continue with some of the ideas that
2
00:00:26,740 --> 00:00:33,820
we have discussed in the last lecture. To
start off, I would just briefly recap the
3
00:00:33,820 --> 00:00:37,800
concept of eigenvector centrality that we
developed on the last lecture.
4
00:00:38,190 --> 00:00:45,120
So if you look at the slides, I have written
exactly whatever we done last day on the slides.
5
00:00:45,120 --> 00:00:59,821
The basic thing that we saw was that, if x
t is the popularity value at some time t then
6
00:00:59,821 --> 00:01:12,289
we can express x t in terms of A t into x
0 this is what we have already seen. This
7
00:01:12,289 --> 00:01:23,030
we call equation 1, say for simplicity. Now
what we have further said is that let us express
8
00:01:23,030 --> 00:01:47,690
x 0 as a linear combination of some eigenvectors
of the matrix A, the adjacency matrix A. Then
9
00:01:47,690 --> 00:02:02,020
one can write x 0 as nothing but the sum of
some linear combination of some of the vectors
10
00:02:02,020 --> 00:02:20,840
of the adjacency matrix, where v i is an arbitrary
eigenvector of A, this is what we wrote last
11
00:02:20,840 --> 00:02:29,170
day already. Further from the eigenvector
equation we all know that A v should be equal
12
00:02:29,170 --> 00:02:47,550
to lambda v where lambda is the eigen value
corresponding to the eigenvector v.
13
00:02:48,420 --> 00:02:59,630
So, now if we multiply both sides of this
equation by A once again, we have A into A
14
00:02:59,630 --> 00:03:11,540
into v is equal to A into lambda into v, that
should be equal to lambda into A into v once
15
00:03:11,540 --> 00:03:19,300
again that is nothing but lambda into lambda
into v that is lambda square v. So, continuing
16
00:03:19,300 --> 00:03:30,450
like this 40 times this would imply A t v
should be equal to lambda t v. So this is
17
00:03:30,450 --> 00:03:43,270
what is we call equation 3, and let us say
that we call this equation 2. The equation
18
00:03:43,270 --> 00:03:47,610
x 0 is equal to sum of c i v i has equation
2.
19
00:03:47,930 --> 00:04:10,660
So from 1, 2 and 3 we can write the following
expression; x t should be nothing but equal
20
00:04:10,660 --> 00:04:25,280
to lambda i t c i v i. Now we will do a small
trick out here, what we will do we will divide
21
00:04:25,280 --> 00:04:48,570
both sides by lambda 1 t, where lambda 1 is
the principal eigen value and the corresponding
22
00:04:48,570 --> 00:05:11,130
principal vector is say v 1. Then you can
write this as nothing but lambda 1 power t
23
00:05:11,130 --> 00:05:17,710
c i v i. Now if we have gone sufficiently
many numbers of steps.
24
00:05:17,710 --> 00:05:24,210
So, we are talking in terms of thermo dynamical
limit that is when t is very, very large.
25
00:05:24,210 --> 00:05:32,570
The point after which there is no significant
change in the popularity, so at that point
26
00:05:32,570 --> 00:05:51,920
we will have limit on t it really goes to
large numbers we can put on both sides this
27
00:05:51,920 --> 00:06:08,110
limit. So this, what does this tell you now
for i is equal to 1 this value the value here
28
00:06:08,110 --> 00:06:18,040
changes to 1, so you have c 1 v 1 plus numbers
which have a fractional pre factor power to
29
00:06:18,040 --> 00:06:23,580
some infinite power. All those fractional
pre factors that are power to some infinite
30
00:06:23,580 --> 00:06:40,690
power go to 0 only c 1 v 1 remains. So, this
tells you that the popularity converges to
31
00:06:40,690 --> 00:06:54,480
the principal eigenvector of the adjacency
matrix A.
32
00:06:54,480 --> 00:07:03,870
So, that like kind of completes the idea of
eigenvector centrality. Basically, in a nutshell
33
00:07:03,870 --> 00:07:09,389
what you need to remember is that if you have
to compute the eigenvector centrality of a
34
00:07:09,389 --> 00:07:09,889
network.
35
00:07:10,389 --> 00:07:24,000
Then you for that network you have the adjacency
matrix say A and you compute the principal
36
00:07:24,000 --> 00:07:46,889
eigenvector of A, say that is v 1 which corresponds
to the eigenvector centrality of the nodes.
37
00:07:46,889 --> 00:08:00,880
So, all
this time we were discussing mostly in the
38
00:08:00,880 --> 00:08:08,840
context of undirected networks, where edges
are not directed. This entire exercise that
39
00:08:08,840 --> 00:08:16,100
we did in for computation of eigenvectors
is for an undirected graph. The adjacency
40
00:08:16,100 --> 00:08:25,850
matrix A is assumed to be symmetric.
So, all this explanation that i have given
41
00:08:25,850 --> 00:08:31,419
you so far is under the assumption that the
adjacency matrix that we are considering is
42
00:08:31,419 --> 00:08:35,240
symmetric, that the graph is an undirected
graph.
43
00:08:35,240 --> 00:08:43,039
Now, the immediate next question if you look
in to this slide is how to convert this definition
44
00:08:43,039 --> 00:08:45,459
in the context of a directed network.
45
00:08:46,750 --> 00:08:53,920
As soon as you try to do that there are certain
problems that crop up and one of the important
46
00:08:53,920 --> 00:09:00,350
problems is what I have shown in the figure
in the small network that I have described
47
00:09:00,350 --> 00:09:08,069
here. So what happens is think of the node
A, now this node does not have any in degree
48
00:09:08,069 --> 00:09:15,550
it’s in degree is 0, so that means the centrality
value that this node will have is 0. Then
49
00:09:15,550 --> 00:09:20,249
the node B, consider the node B this has only
one in degree from A.
50
00:09:20,800 --> 00:09:27,600
This node will also have a centrality value
of 0, because A has a centrality value of
51
00:09:27,600 --> 00:09:34,829
0 which is actually borrowed by B and that
centrality is also 0, and in this way it continues
52
00:09:34,829 --> 00:09:43,519
and propagates over the entire network. So,
the entire acyclic parts of the network actually
53
00:09:43,519 --> 00:09:51,790
for all the nodes that are part of that acycle
of that network have a 0 centrality. This
54
00:09:51,790 --> 00:09:58,800
is a very big problem, when you try to translate
the concept of eigenvector centralities for
55
00:09:58,800 --> 00:10:05,380
directed networks.
So, I repeat the problem is very simple. You
56
00:10:05,380 --> 00:10:11,499
have this node A here which do not have any
in degree, so that means since it does not
57
00:10:11,499 --> 00:10:15,699
have any in degree its centrality is going
to be 0. So if its centrality is going to
58
00:10:15,699 --> 00:10:21,170
be 0 then the centrality of B which is solely
completely dependent on A is also going to
59
00:10:21,170 --> 00:10:29,569
be 0. And in this way anybody whose centrality
is just dependent on B is its centrality the
60
00:10:29,569 --> 00:10:34,040
node the centrality of that node is also going
to be 0, and this will continue until and
61
00:10:34,040 --> 00:10:39,279
unless there is some cycle in the network.
This is a problem when you try to convert
62
00:10:39,279 --> 00:10:45,019
the definition of eigenvector centrality for
directed networks. So, what could be a solution?
63
00:10:45,019 --> 00:11:01,509
So, what we try to do is we try to sprinkle
some initial popularity to each individual
64
00:11:01,509 --> 00:11:15,509
node. Since this is a problem, now if we sprinkle
a very small equal centrality to all the nodes.
65
00:11:15,509 --> 00:11:21,989
We do not disseminate any node, so we try
to sprinkle a very, very small tiny centrality
66
00:11:21,989 --> 00:11:28,929
value to all the nodes equivalently. Now we
start from that configuration. Then in that
67
00:11:28,929 --> 00:11:33,649
configuration you can immediately see that
A’s centrality will no longer be equal to
68
00:11:33,649 --> 00:11:42,589
0. Therefore, if you do such a thing then
you have to readjust the formula that we introduced
69
00:11:42,589 --> 00:11:53,411
last.
So now again we can write x t. Or say for
70
00:11:53,411 --> 00:12:13,879
a particular node x i should be, now we are
rebalancing things. So here in this formula
71
00:12:13,879 --> 00:12:19,190
what we have brought in it looks very similar
to the previous formula that we have introduced
72
00:12:19,190 --> 00:12:25,049
for the eigenvector centralities, but then
we have brought in two important changes if
73
00:12:25,049 --> 00:12:33,100
you look carefully; one is this parameter
beta here and the other is this parameter
74
00:12:33,100 --> 00:12:44,839
alpha here. So, both beta and alpha are constants
this is the first premise. The second thing
75
00:12:44,839 --> 00:13:07,850
is beta is the small initial value of centrality
that we give to each node. So, beta is the
76
00:13:07,850 --> 00:13:14,480
initial small centrality that we actually
give to each node equivalently, and alpha
77
00:13:14,480 --> 00:13:30,660
actually is the readjustment constant
so that the entire formula remains balanced.
78
00:13:31,060 --> 00:13:36,799
Since you are introducing this beta component
into the formula we have to accordingly rescale
79
00:13:36,799 --> 00:13:44,339
the other part of the popularity, so one popularity
is inherent that is given to me that is sprinkled
80
00:13:44,339 --> 00:13:48,360
on all of us and there is another popularity
that is coming from the neighborhood. So,
81
00:13:48,360 --> 00:13:55,049
these two has to be rebalanced again in order
to keep the formula equivalent to the previous
82
00:13:55,049 --> 00:14:08,449
case. So, now given this we can immediately
write the notations in terms of x t’s as
83
00:14:08,449 --> 00:14:10,589
functions or in terms of vectors and matrices.
84
00:14:11,489 --> 00:14:18,929
So now, let us say X is the vector popularity
vector so this is like x 1 is the popularity
85
00:14:18,929 --> 00:14:36,309
for node 1, x 2 is the popularity for node
2 and so on and so forth. This is my vector
86
00:14:36,309 --> 00:14:44,699
of popularities. From here you can given the
formula that we have written just now we can
87
00:14:44,699 --> 00:14:50,649
immediately write the following expression
as I have already written also on this slides;
88
00:14:50,649 --> 00:15:09,869
A alpha A x plus the beta into 1, where 1
the darkened 1 that I am writing, the bold
89
00:15:09,869 --> 00:15:25,980
1 is the vector of all one’s. Therefore,
express X in terms of alpha A again the vector
90
00:15:25,980 --> 00:15:36,149
A x plus the beta factor which actually is
nothing but the initial popularity that you
91
00:15:36,149 --> 00:15:40,059
have sprinkle to on each and every individual
node.
92
00:15:40,059 --> 00:15:49,699
Now this if we try to readjust and write express
x as the subject of the formula, so expressing
93
00:15:49,699 --> 00:16:10,739
x as the subject of the formula we can write
x is nothing but beta into 1 minus alpha A
94
00:16:10,739 --> 00:16:18,379
inverse, and there is this vector of all one’s.
Let us denote the vectors by the vector sign
95
00:16:18,379 --> 00:16:32,209
it would be easier for us to note. Now see
from that original formula we have come up
96
00:16:32,209 --> 00:16:40,561
with an expression to find out the exact value
of popularity given the constants alpha and
97
00:16:40,561 --> 00:16:47,749
beta and the initial popularity beta, the
other readjustment constant alpha, even these
98
00:16:47,749 --> 00:16:55,189
two and the adjacency matrix of the network
you can immediately compute the popularity
99
00:16:55,189 --> 00:17:01,579
value x.
But then just a point of note here that computing
100
00:17:01,579 --> 00:17:21,470
inverse, this is the identity matrix inverse
of matrices is difficult. Thus it is better
101
00:17:21,470 --> 00:17:42,690
to compute using the iterative equation, which
in this case is again very simple you can
102
00:17:42,690 --> 00:17:54,430
write x t is equal to nothing but alpha A
x t minus 1 plus beta. And then there will
103
00:17:54,430 --> 00:18:00,300
be a point after which x t will change no
longer and you will have an expression for
104
00:18:00,300 --> 00:18:08,480
the final value of x t. All of these ideas
are borrowed from the first definition of
105
00:18:08,480 --> 00:18:13,350
eigenvectors then we are gradually incrementing
that same idea, we are building up on that
106
00:18:13,350 --> 00:18:22,860
same idea and developing better and better
matrix for application in different network
107
00:18:22,860 --> 00:18:30,160
settings. For instance, this particular development
is in the settings of are directed network.
108
00:18:30,160 --> 00:18:46,570
And this particular centrality is termed as
by the name of the inventor Katz centrality.
109
00:18:46,570 --> 00:18:54,370
So now, since we had talking about eigenvector
centralities, Katz centrality, directed graphs,
110
00:18:54,370 --> 00:19:01,250
there is one thing that actually is indispensible
and needs to be discussed in this context
111
00:19:01,250 --> 00:19:08,180
and that is like the worldwide web graph.
And in the context of the World Wide Web graph
112
00:19:08,180 --> 00:19:14,560
how are the web pages ranked. So, all these
initial settings that we have been developing
113
00:19:14,560 --> 00:19:21,200
so far in the last lecture and today’s lecture
is in the direction of building the idea of
114
00:19:21,200 --> 00:19:24,260
how to rank the web pages in a World Wide
Web graph.
115
00:19:24,810 --> 00:19:34,500
Basically, in a World Wide Web graph you will
have nodes which are pages, say page P 1,
116
00:19:34,500 --> 00:19:45,310
page P 2, page P 3, and so on. So in this
way you have a World Wide Web graph, and if
117
00:19:45,310 --> 00:19:51,400
there is a hyperlink from page P 1 you draw
a directed edge like this. There could be
118
00:19:51,400 --> 00:19:58,271
another hyper link say from page P 2 to P
1 you draw a directed edge like this. This
119
00:19:58,271 --> 00:20:02,280
is a typical representation of a World Wide
Web graph.
120
00:20:02,280 --> 00:20:09,530
Now, the question is like all of us have been
using Google search engine almost every day
121
00:20:09,530 --> 00:20:16,850
and multiple times a day and probably many
a times a question have come across is like
122
00:20:16,850 --> 00:20:23,560
when you search using a query term there Google
returns you back a bunch of web pages. Now,
123
00:20:23,560 --> 00:20:30,560
these web pages actually are ranked in some
way and this ranking actually is done in terms
124
00:20:30,560 --> 00:20:38,290
of the importance of a particular page. And
this importance is measured using some formula
125
00:20:38,290 --> 00:20:44,380
which is very similar to what we have just
now discussed, which is very similar to Katz
126
00:20:44,380 --> 00:20:49,860
centrality. And this formula was developed
in the frame work of the very famous algorithm
127
00:20:49,860 --> 00:21:03,500
called the Page Rank.
Page rank basically tries to determine the
128
00:21:03,500 --> 00:21:32,570
rank of a web page being displayed in return
to a query
129
00:21:32,570 --> 00:21:47,140
on the search engine. This algorithm actually
tries to assign importance to each individual
130
00:21:47,140 --> 00:21:55,600
page so that whenever there is a search query
the pages that are return could be actually
131
00:21:55,600 --> 00:22:05,110
sorted in terms of their importance. The basic
idea is again same as we have discussed for
132
00:22:05,110 --> 00:22:07,490
the case of the eigenvector centrality.
133
00:22:07,740 --> 00:22:14,550
If you look at the slide you I give a very
simple example of a small snap shot of the
134
00:22:14,550 --> 00:22:21,170
world wide web. So, what you see here are
a few nodes marked like A, B, C, D, etcetera
135
00:22:21,170 --> 00:22:31,840
and the size of the node is actually expresses
the extent of popularity of that particular
136
00:22:31,840 --> 00:22:41,950
node. You see that the node C has only 1 in
degree, but then you see that the popularity
137
00:22:41,950 --> 00:22:51,740
of node C is quite high, this is because node
C is actually pointed by node B who himself
138
00:22:51,740 --> 00:22:58,100
is very, very popular. This is the basic idea
of the eigenvector centrality that we have
139
00:22:58,100 --> 00:23:06,190
already discussed. Since node C is being pointed
already highly popular node B that is why
140
00:23:06,190 --> 00:23:12,950
node C’s popularity automatically increases.
Whereas, take for the example the case of
141
00:23:12,950 --> 00:23:20,480
node E. The node e here has a lot of in degree,
but each of these nodes that point node E
142
00:23:20,480 --> 00:23:27,720
are not themselves very popular. So that is
why node E is not actually very popular. Even
143
00:23:27,720 --> 00:23:34,370
if it has a very high in degree it is not
actually very, very popular. Whereas, in contrast
144
00:23:34,370 --> 00:23:41,230
nodes C which has just 1 in degree is much
more popular by virtue of having one very
145
00:23:41,230 --> 00:23:46,371
popular person pointing to node C. This is
the idea that we have already discussed, I
146
00:23:46,371 --> 00:23:50,251
have just illustrated the same using this
example on the slide.
147
00:23:51,330 --> 00:23:59,721
So then, the question is how was page rank
defined. It is again very, very simple. Even
148
00:23:59,721 --> 00:24:05,490
before we go to that definition the quantity
definition one question that can immediately
149
00:24:05,490 --> 00:24:11,010
come to our mind is how to make our own web
page. Suppose, you have (Refer Time: 24:11)
150
00:24:11,010 --> 00:24:19,150
a webpage what could be a criteria to actually
make your webpage important or popular. This
151
00:24:19,150 --> 00:24:28,140
is a very important question and various companies
have been working in order to help the promotion
152
00:24:28,140 --> 00:24:35,260
of certain web pages.
And the crux is the following; if very highly
153
00:24:35,260 --> 00:24:42,840
popular websites say for instance Google,
Yahoo, MSN, etcetera are pointing towards
154
00:24:42,840 --> 00:24:48,980
your website. If there is a hyperlink from
all these websites to your website, of course
155
00:24:48,980 --> 00:24:57,740
this is a dream for every person to your website
then immediately your page rank your importance
156
00:24:57,740 --> 00:25:06,400
actually raises. And whatever search term
is fed to the engine there is a high probability
157
00:25:06,400 --> 00:25:11,910
that your pages retreat back and shown to
the user.
158
00:25:11,910 --> 00:25:14,010
So, this is the basic idea.
159
00:25:14,260 --> 00:25:25,100
So, now with this basic idea we will to finally
defining a quantitative measure for the page rank
160
00:25:34,550 --> 00:25:46,890
Again, you have say the popularity of a particular
node being expressed by x i. Then you can
161
00:25:46,890 --> 00:25:55,930
write here x i is equal to alpha, the same
type of expression that we have already written.
162
00:25:55,930 --> 00:26:09,300
A i j x j plus beta, but there is only one
small difference from the Katz centrality
163
00:26:09,300 --> 00:26:20,180
in that, what you do is you now divide the
popularity of the node x j by the total out
164
00:26:20,180 --> 00:26:27,170
degree of x j.
Suppose, there is this node j it has an out
165
00:26:27,170 --> 00:26:38,560
degree of k j out and say the popularity of
x j of the node j is x j, then each of the
166
00:26:38,560 --> 00:26:48,310
individual nodes that j is pointing to one
of this is actually the node i with popularity
167
00:26:48,310 --> 00:27:00,730
x i. So each of them is receiving a fraction
x j by k j out from the node j. So, this fraction
168
00:27:00,730 --> 00:27:08,410
of popularity from node j actually is given
to node i. So you basically appropriately
169
00:27:08,410 --> 00:27:17,480
normalize the popularity of x j and distribute
it among all its neighbor. That is the only
170
00:27:17,480 --> 00:27:21,930
difference from the Katz centrality; Katz
centrality did not have this normalization.
171
00:27:21,930 --> 00:27:32,510
Now, one problem that immediately comes up
is that if k j out is 0, that is the out degree
172
00:27:32,510 --> 00:27:47,180
of j is 0. That means, this would immediately
say that j will not be able to contribute
173
00:27:47,180 --> 00:27:58,830
to the centrality of other nodes. This will
immediately say that node j will not be able
174
00:27:58,830 --> 00:28:06,690
to contribute to the centralities of any other
node. In such cases we assume that k j out
175
00:28:06,690 --> 00:28:15,030
is preset to 1 in order to make our computations
easy. In all such cases where k j out is 0.
176
00:28:16,580 --> 00:28:25,780
So, x j will be the popularity of j, so this
component will actually go to 0, because it
177
00:28:25,780 --> 00:28:34,580
will contribute to 0 popularity to all other
neighbors. We set k j out to equal to 1 so
178
00:28:34,580 --> 00:28:42,940
that we avoid division by 0.
In that particular situation we can write
179
00:28:42,940 --> 00:28:55,320
again the vector form of the equation as A
alpha A, now since this k j out is here we
180
00:28:55,320 --> 00:29:06,390
have this matrix D inverse x plus beta into
1, where these are all vectors. So now we
181
00:29:06,390 --> 00:29:20,080
can again express the vector x as the subject
of the formula and this will give us D inverse.
182
00:29:20,080 --> 00:29:42,200
Where D inverse, where D is the diagonal matrix
containing all the out degree values. Where
183
00:29:42,200 --> 00:29:48,860
D is the diagonal matrix containing all the
out degree values and D i i is nothing but
184
00:29:48,860 --> 00:29:54,780
max of k j out and 1. If k j out is 0 then
it is set to 1.
185
00:29:56,160 --> 00:29:58,060
Thank you very much.