1
00:00:20,330 --> 00:00:27,950
Welcome back to this session on Network Analysis.
We will continue with the ideas of network
2
00:00:27,950 --> 00:00:35,300
analysis that we have been talking for last
few days. Today, I will introduce two important
3
00:00:35,300 --> 00:00:42,460
matrixes that are very useful in the context
of directed networks, and especially in the
4
00:00:42,460 --> 00:00:56,870
context of citation networks. The first one
among this is called Co-citation Index.
5
00:00:56,870 --> 00:01:03,260
The idea is very simple that when you look
at a author author citation network, the type
6
00:01:03,260 --> 00:01:07,700
of network that we have discussing already
in the context of degree distribution studies.
7
00:01:08,360 --> 00:01:13,940
If you look at an author author citation network
what you observe is that, there are at times
8
00:01:13,940 --> 00:01:20,900
where two authors who are not connective by
a citation relation among themselves have
9
00:01:20,900 --> 00:01:28,940
some relationship in an indirect manner and
that is what we want to study here and that
10
00:01:28,940 --> 00:01:39,420
is what we quantify through this metric.
For instance consider this small hypothetical
11
00:01:39,420 --> 00:02:11,870
network here. From this network you immediately
see that author 1 is cited by author 3. And
12
00:02:11,870 --> 00:02:28,900
similarly author 2 is cited by author 3, so
mark this word cited by. Basically, there
13
00:02:28,900 --> 00:02:41,090
is this author three who actually co cites
both author 1 and author 2. You even get a
14
00:02:41,090 --> 00:02:49,010
stronger evidence of such co-citation behavior
when you look at author 4. We find that author
15
00:02:49,010 --> 00:03:11,350
4 is also cited by author 4 and author 2 is
also cited by author 4.
16
00:03:11,720 --> 00:03:20,140
This gives us additional evidence that although
author 1 and author 2 are not connected by
17
00:03:20,140 --> 00:03:28,910
any citation relation they perhaps have a
relationship between themselves. And that
18
00:03:28,910 --> 00:03:35,530
relationship could be many things, could be
in the form of that both author 1 and author
19
00:03:35,530 --> 00:03:44,849
2 works in the same field say for instance
you can imagine that probably both author
20
00:03:44,849 --> 00:03:55,680
1 and author 2 works in AI or machine learning.
Also it might be the case that author 1 and
21
00:03:55,680 --> 00:04:11,230
author 2 work on similar topic or method.
For example, say SVM in the area of machine
22
00:04:11,230 --> 00:04:18,410
learning. So, such hidden relationships between
pairs of authors or a group of authors might
23
00:04:18,410 --> 00:04:28,590
exist in an author author citation network.
And unless you have design appropriate construction
24
00:04:28,590 --> 00:04:38,279
it is not possible to tear through these relationships.
And such constructions are not very difficult
25
00:04:38,279 --> 00:04:46,019
to formulate. For instance, let us consider
that this author author citation be named as A
26
00:04:46,740 --> 00:05:06,540
So let the matrix for the author author
citation network this be denoted by the matrix A
27
00:05:07,200 --> 00:05:09,760
Then the co-citation network
28
00:05:14,059 --> 00:05:22,300
is nothing but simply the product of the two
matrices A transpose A. If you construct the
29
00:05:22,300 --> 00:05:30,719
A transpose A of this graph then you get a
new graph which would look like there will
30
00:05:30,719 --> 00:05:47,119
be only two nodes 1 and 2, and there will
be an edge between them indicating
31
00:05:47,119 --> 00:05:55,419
a co-citation relationship.
Now imagine that there are many such intermediate
32
00:05:55,419 --> 00:06:03,930
nodes like, node 3 and node 4 in this example,
then the evidence that the nodes 1 and 2 shares
33
00:06:03,930 --> 00:06:11,270
some relationships becomes stronger and stronger.
Then more the number of intermediate nodes
34
00:06:11,270 --> 00:06:18,189
like node 3 and node 4 the stronger is your
evidence that there is some form of relationship
35
00:06:18,189 --> 00:06:20,829
that exists between node 1 and node 2.
36
00:06:23,470 --> 00:06:31,890
Now, likewise there can be a symmetric or
mirror image concept which is called the Bibliographic
37
00:06:31,890 --> 00:07:00,289
Coupling. So this is as I say is just the
mirror image of co-citation index. To understand
38
00:07:00,289 --> 00:07:21,360
the concept let us again draw the same hypothetical
citation network example. It is only in the
39
00:07:21,360 --> 00:07:32,240
difference of how you the citation relationship.
In this case what you see that, author 3 cites
40
00:07:32,240 --> 00:07:40,180
author 1. The earlier relationship that we
are talking about was cited by and here we
41
00:07:40,180 --> 00:08:01,309
are more concentrating on the relationship
cits. Similarly, author 4 cites author 1.
42
00:08:01,309 --> 00:08:08,699
Again you have second level evidence a stronger
evident make making your observations stronger,
43
00:08:08,699 --> 00:08:28,849
where you see that author 3 cites author 1
sorry author 2. Similarly, author 4 cites
44
00:08:28,849 --> 00:08:40,540
author 2. The earlier case was considering
the cited by relationship, the current case
45
00:08:40,540 --> 00:08:50,279
is by considering cites relationships. So,
author 3 cites author 1 and author 4 also
46
00:08:50,279 --> 00:08:58,960
cites author 1, similarly author 3 cites author
2 and author 4 also cites author 2. This indicates
47
00:08:58,960 --> 00:09:20,060
that there is a possibility
that authors 3 and 4 shares some relationship.
48
00:09:23,420 --> 00:09:31,240
So, again extending the same idea as we talked
about in the cite co-citation index context.
49
00:09:32,430 --> 00:09:40,460
Basically here you are no longer considering
author 1 and 2, but authors 3 and 4 who have
50
00:09:40,460 --> 00:09:47,931
the similar citation behavior who have similar
patterns in their reference list. The similar
51
00:09:47,931 --> 00:09:52,560
referencing behavior, you can also call it
as a referencing behavior. The way author
52
00:09:52,560 --> 00:09:58,780
3 refers to papers, author 4 also refers to
papers in a similar way. Again the idea is
53
00:09:58,780 --> 00:10:04,450
that probably these two authors work on a
similar field or on a similar topic that is
54
00:10:04,450 --> 00:10:10,300
why very often they have to cite other similar
people. So, there could be a relationship
55
00:10:10,300 --> 00:10:17,080
in that way existing between authors 3 and
4. This kind of a relationship is called the
56
00:10:17,080 --> 00:10:21,450
Bibliographic Coupling.
Again if we consider that the corresponding
57
00:10:21,450 --> 00:10:33,990
matrix for this network is a then for this
citation author, citation network is A then
58
00:10:33,990 --> 00:10:47,930
the bibliographic coupling
network can be obtained just by the product
59
00:10:47,930 --> 00:10:56,870
of the two matrices but now written like A
dot A transpose. Whereas, the co-citation
60
00:10:56,870 --> 00:11:03,360
index matrix could be obtained by A transpose
dot A, here the bibliographic coupling matrix
61
00:11:03,360 --> 00:11:12,200
can be obtained by A dot A transpose.
Actually these two networks are used in conjunction
62
00:11:12,200 --> 00:11:17,260
with the original author citation network
matrix to do various sorts of recommendation
63
00:11:17,260 --> 00:11:21,540
task, citation recommendation, authorship
recommendation, and various other recommendation
64
00:11:21,540 --> 00:11:28,540
tasks where people heavily use higher techniques
but then at the back end some of the matrix
65
00:11:28,540 --> 00:11:34,410
that are used some of the quantification that
go on are basically drawn from these three
66
00:11:34,410 --> 00:11:37,950
networks A A A transpose and A transpose A.
67
00:11:39,110 --> 00:11:52,730
So, continuing with other matrix we will now
introduce another very interesting concept
68
00:11:52,730 --> 00:12:07,250
for directed networks and this is called Reciprocity.
The idea is very simple, suppose there is
69
00:12:07,250 --> 00:12:18,620
a node A there is another node B in the network
there is a directed edge from A to B. Now
70
00:12:18,620 --> 00:12:26,010
as the name suggest reciprocal means if there
is a link from A to B there is a reciprocal
71
00:12:26,010 --> 00:12:36,430
link from B to A. In directed networks there
might be a link back from B to A; this is
72
00:12:36,430 --> 00:12:44,200
a very different link from the link from A
to B in the context of directed networks.
73
00:12:44,200 --> 00:12:50,600
So if there is a link from A to B, then if
there exists another link from B to A then
74
00:12:50,600 --> 00:12:57,720
we say that this is a reciprocation of the
original link A to B. The reciprocation of
75
00:12:57,720 --> 00:13:04,660
the original link A to B is the link B to
A. And the associated quantification is called
76
00:13:04,660 --> 00:13:10,250
Reciprocity. So the idea is very simple. So
quantification is quantification of reciprocity
77
00:13:10,250 --> 00:13:20,760
which you define by the metric r.
So if you have m edges in the network, the
78
00:13:20,760 --> 00:13:29,610
total number of m directed edges in the network
among this how many edges are actually reciprocated,
79
00:13:29,610 --> 00:13:38,290
that is the idea. If there are say n nodes
in the network and there are some of the edges
80
00:13:38,290 --> 00:13:42,340
which are reciprocated and then there are
some of the edges which are not reciprocated.
81
00:13:42,340 --> 00:13:49,690
So, you express the total reciprocation as
a ratio of the total number of edges in the
82
00:13:49,690 --> 00:13:59,290
network. So that is how reciprocity is defined.
Basically, it is in the directed network A
83
00:13:59,290 --> 00:14:12,130
i j A j i, so there is one link from i j and
there is another link from j i. If A i j is
84
00:14:12,130 --> 00:14:19,050
1 and also A j i is 1 and this product will
become 1. Otherwise, in all other cases this
85
00:14:19,050 --> 00:14:25,060
product will be 0. If there is 1 and 0, 0
or 1 in both the cases this product will be
86
00:14:25,060 --> 00:14:31,160
0, only in the case where both of them are
1 that this product will be 1 and this is
87
00:14:31,160 --> 00:14:40,779
expressed as a fraction of the total number
of edges in the network. So, m is the total
88
00:14:40,779 --> 00:14:52,219
number of edges in the network. So this is
the how you define the reciprocity of a graph.
89
00:14:58,780 --> 00:15:05,880
So, the next concept that we will talk about
is called the Rich-club co-efficient.
90
00:15:06,650 --> 00:15:22,420
This is another very interesting phenomenon.
For the time being we will only consider undirected
91
00:15:22,420 --> 00:15:29,820
graphs, unweighted graphs for estimating the
rich-club co-efficient. The idea is very simple
92
00:15:29,820 --> 00:15:35,290
that given a social network or say given a
collaboration network of scientist do you
93
00:15:35,290 --> 00:15:41,810
find a group of people or group of scientist
or a group of nodes in the social network
94
00:15:41,810 --> 00:15:49,800
who are in some sense rich and actually form
a dense connectivity, actually establish a
95
00:15:49,800 --> 00:15:57,710
dense connectivity among themselves. So they
have to be rich and they have to be densely
96
00:15:57,710 --> 00:16:08,040
connected. These are the two constraints that
this set of nodes need to satisfy.
97
00:16:08,040 --> 00:16:15,380
Basically, how to quantify richness? So richness
in the context of rich club co-efficient can
98
00:16:15,380 --> 00:16:22,700
be quantified in different ways but the most
basic one is to assume the degree of the node.
99
00:16:22,700 --> 00:16:28,850
If the degree of the node is above a threshold
say k then you consider this node as a rich node
100
00:16:28,850 --> 00:16:44,750
So richness is in this case analogous
to the degree say k of the node being considered.
101
00:16:48,340 --> 00:16:59,980
Suppose there is a richness threshold, say
richness threshold is said to some k prime
102
00:16:59,980 --> 00:17:25,610
then you consider all nodes in the network
where the degree of the node k is greater
103
00:17:25,610 --> 00:17:34,700
than or equal to the richness threshold k
dash. So you consider all the set of nodes
104
00:17:34,700 --> 00:17:40,619
in this set where the degree of the node is
greater than the richness threshold k prime
105
00:17:40,619 --> 00:17:46,679
or k dash.
106
00:17:46,679 --> 00:18:07,981
Suppose the number of such nodes with k greater
than equal to k dash is sum n nodes. Now you
107
00:18:07,981 --> 00:18:20,159
consider this set of n nodes and you see what
is the density calculate the edge density
108
00:18:20,159 --> 00:18:38,990
or the density of the edges
in between the n nodes. So, you calculate
109
00:18:38,990 --> 00:18:45,299
the density of the edges in between this n
nodes which are actually the high degree nodes
110
00:18:45,299 --> 00:18:51,739
or nodes crossing the richness threshold bar
of k prime.
111
00:18:52,990 --> 00:19:02,280
So that, how will you calculate the density?
It is very simple. The density is nothing
112
00:19:02,280 --> 00:19:21,379
but the number of actual edges between the
n nodes divided by the maximum number of edges
113
00:19:21,379 --> 00:19:27,630
which is possible between these n nodes. The
maximum number of edges possible is n C 2.
114
00:19:27,630 --> 00:19:36,749
Whereas, you express the actual number of
edges between these pairs of n nodes as the
115
00:19:36,749 --> 00:19:43,340
ratio of the maximum number of edges that
is possible between these n nodes. And this
116
00:19:43,340 --> 00:19:56,800
is actually the rich-club co-efficient of
the network being considered.
117
00:19:57,619 --> 00:20:04,159
As soon as I talk about this you can very
well imagine that the rich-club co-efficient
118
00:20:04,159 --> 00:20:16,570
is nothing but trying to express whether the
nodes that are rich in terms of degree have
119
00:20:16,570 --> 00:20:24,580
a very high number of connections among themselves.
For instance, take for the example the actor
120
00:20:24,580 --> 00:20:30,220
actor network from the movie actor example
that I have cited earlier. So the actor actor
121
00:20:30,220 --> 00:20:38,179
network here, if you find a rich-club this
would mean that the most influential actors
122
00:20:38,179 --> 00:20:46,399
packed together from a with each other from
a group with each other and sign movies together
123
00:20:46,399 --> 00:20:52,119
so that the next movie release is really a
box office hit.
124
00:20:52,850 --> 00:20:59,110
Similarly, there are quite often we have found
that in the scientific domain there are a
125
00:20:59,110 --> 00:21:04,429
bunch of scientists who are really very highly
cited and very highly popular scientist they
126
00:21:04,429 --> 00:21:10,020
come together and do something very impactful
something very seminal. So, whenever such
127
00:21:10,020 --> 00:21:15,480
a seminal thing happens there are a bunch
of highly well known, well establishes scientists
128
00:21:15,480 --> 00:21:23,700
they come together form a rich-club and by
the virtue of forming this rich club publish
129
00:21:23,700 --> 00:21:30,190
something very seminal.
Like for example, in the current last one
130
00:21:30,190 --> 00:21:36,889
or two years context one of the examples would
be this hadron collider and the idea of got
131
00:21:36,889 --> 00:21:45,220
particles which you have probably come across
in various new articles.
132
00:21:45,220 --> 00:22:10,690
Now, the next concept that we will talk about
is the Entropy of the Degree Distribution.
133
00:22:10,690 --> 00:22:26,529
So this is another interesting idea, very
simply put. If you recollect the degree distribution
134
00:22:26,529 --> 00:22:40,840
is encoded in the variable p k, which is nothing
but the
135
00:22:40,840 --> 00:23:04,440
probability that a node chosen uniformly at
random has a degree k. So, this was the very
136
00:23:04,440 --> 00:23:10,039
simple idea of degree distribution. Since,
this is a probability distribution one can
137
00:23:10,039 --> 00:23:19,509
always estimate the entropy of this distribution.
So, simply put the entropy H is nothing but
138
00:23:19,509 --> 00:23:35,369
p k log p k; and as you can well imagine that
entropy actually encodes the randomness, the
139
00:23:35,369 --> 00:23:45,360
extent of randomness in a distribution.
f this quantity is high then the distribution
140
00:23:45,360 --> 00:23:52,730
is relatively random and the network structure
is also relatively more uniform, whereas if
141
00:23:52,730 --> 00:24:05,399
it is queued, if it is lower than probably
it is not the case. Some interesting exercises
142
00:24:05,399 --> 00:24:25,810
could be like what should be the entropy of
p k for, let us call this H p k, the entropy
143
00:24:25,810 --> 00:24:39,049
of p k that is H p k for a regular graph.
A regular graph is a graph in which all nodes
144
00:24:39,049 --> 00:24:44,539
have the same degree. So, a
is also a regular graph where each node has
145
00:24:44,539 --> 00:24:48,789
a degree n minus 1, where n is the total number
of nodes in the networks.
146
00:24:48,789 --> 00:24:56,929
Similarly, if a network each node has a degree
k then that network is called a k regular
147
00:24:56,929 --> 00:25:05,440
graph. Basically, in such a system, in such
a network what do you have p k is 1 for that
148
00:25:05,440 --> 00:25:11,789
particular value of k for which you are constructing
the regular graph and p k is 0 for all other
149
00:25:11,789 --> 00:25:19,289
values of k. In such a case what should be
the entropy, it is very simple to find. So,
150
00:25:19,289 --> 00:25:28,299
you can easily compute the entropy. For that
particular value of k there will be it is
151
00:25:28,299 --> 00:25:32,679
1, p k is 1 1 log 1 plus the rest is there
is nothing.
152
00:25:33,259 --> 00:25:41,149
Such a network does not encode any diversity
that is what is been actually talked about.
153
00:25:41,149 --> 00:25:47,249
So in such a network you do not observe any
diversity, all nodes have similar degree.
154
00:25:47,249 --> 00:25:56,570
So that is why the entropy goes to 0. Whereas,
if there is a network where this degree then
155
00:25:56,570 --> 00:26:09,730
the entropy is more close to non-zero values.
So this again gives you an idea of the topological
156
00:26:09,730 --> 00:26:15,600
structure of the underline social network.
So, one is the degree distribution itself
157
00:26:15,600 --> 00:26:19,950
and on top of it you can also measure the
entropy of this distribution to understand
158
00:26:19,950 --> 00:26:26,461
to get an idea better feel. Suppose, if the
entropy of such a network is close to 0, then
159
00:26:26,461 --> 00:26:33,279
you immediately get to know that this a more
sort of a regular structure; whereas, if this
160
00:26:33,279 --> 00:26:40,019
entropy is not close to 0 then it is a more
non-trivial structure that is there in the
161
00:26:40,019 --> 00:26:41,539
social network.
162
00:26:46,309 --> 00:26:54,221
There is this last concept that we will get
introduce to which is called the Matching
163
00:26:54,221 --> 00:27:12,519
Index. Basically, what this matching index
tries to say is tries to quantify is that
164
00:27:12,519 --> 00:27:21,039
how similar are two nodes in terms of their
connectivity patterns. The matching index
165
00:27:21,039 --> 00:27:52,649
mu i j is actually expressed using the following
formula. As soon as I write the formula it
166
00:27:52,649 --> 00:27:58,960
becomes very clear to you. So what we are
trying to see, suppose there are two nodes
167
00:27:58,960 --> 00:28:18,989
i and j in the network and there is some other
node k sitting out here, and we assume that
168
00:28:18,989 --> 00:28:31,729
there is already an edge between i and j.
Basically, if there is a lot of connections
169
00:28:31,729 --> 00:28:42,769
like k, see there is some k prime, see there
is some k double prime, all of these actually
170
00:28:42,769 --> 00:28:53,340
connect i and j. So these k prime k double
prime k are the number of nodes that is being
171
00:28:53,340 --> 00:29:03,190
measured by the numerator of this formula
A i k k j. So, basically you are trying to
172
00:29:03,190 --> 00:29:11,950
count how many such triangular shapes or triangular
completion exits between i and j and that
173
00:29:11,950 --> 00:29:17,560
you express as a fraction of the some of the
degrees of i and j.z
174
00:29:17,560 --> 00:29:24,529
Basically, this gives you an idea of the balance.
If there are two nodes i and j and there is
175
00:29:24,529 --> 00:29:32,120
an edge between them and if all other connections
all other intermediate nodes between i and
176
00:29:32,120 --> 00:29:39,789
j there is a connection from actually i to
k and k to j, if the structure is like that
177
00:29:39,789 --> 00:29:44,589
there is no other node than this set of k
values then the matching index is maximum.
178
00:29:46,289 --> 00:29:55,690
Now if there are many other nodes.
Basically it might be heavy on i, there degree
179
00:29:55,690 --> 00:30:01,299
i might have a very high degree similarly
j might have a very high degree, but the overlapping
180
00:30:01,299 --> 00:30:07,359
set of nodes is not so high then the matching
index is low. So, basically you see how well
181
00:30:07,359 --> 00:30:12,590
is the match between the neighborhood of i
and the neighborhood of j, how balanced is
182
00:30:12,590 --> 00:30:16,650
this match. That is what you try to quantify
using the matching index.
183
00:30:16,960 --> 00:30:18,159
So we stop here.