Steve Corman

Steven Corman steve.corman@asu.edu, Tue 06-Jul-21 6:00 PM

Hi Vlado. I hope things are going well there and you have survived the Apocalypse OK.

I have a two-mode graph with 19,315 vertices in one mode and 213 vertices in the other. I am trying to get the one-mode “row” projection (i.e. A x AT) of this. It is too much for Visone, even with a fast processor and lots of memory.

I remember that Pajek is designed for large networks. I was going to try it, but I can’t figure out how to accomplish this particular task. I tried the network > two-mode function with a small test dataset. It creates a partition, but I don’ t know how to turn that partition into a one-mode adjacency matrix or link list (preferably without the isolates and loops) that I can output.

Is what I want to do possible with Pajek, and do you think it will handle the calculation for a graph of that size? If so can you please give me some guidance on how to do it, or point me to a worked example?

Thanks, and best regards to you and Anushka….

Steve

Batagelj, Vladimir Wed 07-Jul-21 1:46 AM

Dear Steve,

Computing the row projection by definition takes 19315*213*19315 = 79463744925 = 79 G multiplications of numbers and requires additional space in the memory for storing 19315*19315 = 373069225 = 373 M numbers. Space shouldn't be a problem on a better laptop.

If your network is sparse (many zero entries) the computation can be done faster by dealing with only non-zero entries. Pajek supports this approach.

read network in Pajek
Network/2-Mode Network/Transpose
select original network as First (register)
select transposed network as Second
Networks/Multiply networks   (result is one-mode)
save network to file

The internal items can be replaced by a single command Network/2-Mode Network/2-Mode to 1-Mode/Rows and (probably first) Include loops if you are interested also in loop values.

For testing and an example of a two-mode description in Pajek I am attaching the Davis' Southern Women network. Some additional information you can find in https://github.com/bavla/NormNet/blob/main/docs/normnet.pdf

best regards, Vlado

Steven Corman steve.corman@asu.edu, Wed 07-Jul-21 2:42 AM

A problem I had is getting from my edgelist format to a .net file. I read it into Visone as a two-mode network and saved to .net format, but when I opened that and tried to transpose it, Pajek it said it isn’t a two-mode network.

So I wrote a script to convert based on the davis example you sent. I converted the node names to index numbers and was careful to make sure the indices were distinct for the two modes. When I opened that in Pajek, it said there are lines within modes. I don’t know how that could have happened.

Anyway I ignored that was able to get the projection of the second node in each edge by post-multiplying.

When I tried to get the projection of the first node in each edge by pre-multiplying, I got a memory error:

Perhaps that is due to the presence of the offending line-within-mode? All files involved are attached.

Batagelj, Vladimir, Wed 07-Jul-21 5:28 AM

Dear Steve,

I am attaching a ZIP. It contains

program in R for converting your lists into Pajek files
Pajek file for your data
Pathfinder skeleton of keywords from your data. I did only partial manual editing of the picture (in the picture I present sqrt(weight) - the range of original weights is very large). I used the column projection.

The problem is that your network is weighted, and I guess that the links with large weights are over-represented in the projection - you should consider some normalizations. For example, projecting the matrix sqrt(A) = [ sqrt(a[u,v]) ] .

best regards, Vlado

> wdir <- "D:/pajek/steve"
> setwd(wdir)
> T <- read.table("kw2mode-edgelist.txt",sep="\t") 
> colnames(T) <- c("source","key","count")
> head(T)
    source             key count
1 32405706 Milo_Djukanovic     1
2 32405706              DF     1
3 32405706            lame     1
4 32405706     institution     1
5 32405706           oppos     2
6 32405706          regime     1
> tail(T)
         source            key count
134010 17549459       unemploy     2
134011 18543963      Podgorica     1
134012 18543963 NATO_accession     1
134013 18543963        citizen     1
134014 18543963        corrupt     2
134015 18543963          oppos     4
> S <- factor(T$source); n <- length(levels(S))
> n
[1] 19315
> K <- factor(T$key); m <- length(levels(K))
> m
[1] 213
> net <- file("kw2-mode.net","w")
> cat("% List 2 Net:",date(),"\n",file=net) 
> cat("*vertices ",n+m,n,"\n",file=net)
> for(v in 1:n) cat(v,' "',levels(S)[v],'"\n',sep='',file=net)
> for(u in 1:m) cat(u+n,' "',levels(K)[u],'"\n',sep='',file=net)
> cat("*arcs\n",file=net)
> for(a in 1:nrow(T)) cat(S[a]," ",as.integer(K[a])+n," ",T$count[a],"\n",sep="",file=net)
> close(net)

Steven Corman steve.corman@asu.edu, Wed 07-Jul-21 8:07 PM

Thanks again, Vlado. That R script will come in handy. Why you don’t make Pajek able to read a csv file with an edgelist? That’s a pretty common format.

Anyway, it occurred to me it might interest you what I’m doing. The IDs in that file are for news articles in a database collected in Montenegro a year before and after the attempted coup in late 2016. Maybe you guessed this from looking at the keywords. I am trying to cluster those articles by doing Louvain modularity on the row-mode projection.

Best…

Steve

Batagelj, Vladimir, Fri 09-Jul-21 12:58 AM

Here is a list of commands to compute the network B = A*t(A)

read network into Pajek
Network/2-Mode Network/Transpose 2-Mode
Select original network as First (register)
Select transposed as Second
Networks/Multiply Networks [Yes]

next, to get the normalized cosine similarity

Network/2-Mode Network/2-Mode to 1-Mode/Normalize 1-Mode/Geo
Select Multiplying ... as First network
File/Network/Dispose [Yes]

with best wishes, Vlado

Batagelj, Vladimir, Thu 15-Jul-21 5:46 AM

Dear Steve,

I computed in Pajek the cosine similarity between rows of your network

read network into Pajek
Network/2-Mode Network/Transpose 2-Mode
Select original network as First (register)
Select transposed as Second
Networks/Multiply Networks [Yes]                                 50s
Network/Create New Network/Transform/Remove/Triangle/Lower        5min
Network/Create New Network/Transform/Arcs -> Edges/All           45min
Network/2-Mode Network/2-Mode to 1-Mode/Normalize 1-Mode/Geo      1min

I transformed it into a dissimilarity (d = 1 - s)

 
Network/Create New Network/Transform/Line Values/Multiply by [-1][No]   3s
Network/Create New Network/Transform/Line Values/Add Constant [1][No]   1s
Network/Create New Network/Transform/Line Values/Absolute [1][No]       1s
Network/Create New Network/Transform/Remove/Loops [No]                  8s
File/Network/Change Label [Distance]

I applied to it hierarchical clustering with relational constraint

Network/Create Hierarchy/Clustering with Relational Constraint/Run [Maximum,Tolerant]

Your projection network is very dense (Density2 [no loops allowed] = 0.7313, Average Degree = 14125.2 ). It took 2 days to get the results (maybe it would be faster on a computer with a larger memory). I saved the results (hierarchy stored as clustering MaxTol.clu and two vectors (height MaxTolh.vec and size MaxTols.vec)) for the extraction of clusters. I extracted the clusters at levels 0.9 (215 clusters) and 0.75 (521 clusters).

Network/Create Hierarchy/Clustering with Relational Constraint/Make Partition/Using Threshold  [0.75]

All the files are included in the attached ZIP.

To inspect clusters I turned to R.

> wdir <- "D:/pajek/steve"
> setwd(wdir)
> source("https://raw.githubusercontent.com/bavla/Rnet/master/R/Pajek.R")
> A <- net2matrix("kw2-mode.net")
> dim(A)
[1] 19315   213
> p <- clu2vector("MaxTol75.clu",skip=1)
> length(p)
[1] 19315
> head(p)
[1] 1 2 3 4 5 6
> f <- table(p)
> length(f)
[1] 521
> fo <- f[order(f,decreasing=TRUE)]
> head(fo)
p
  8   6  20  80  89  39
746 684 497 496 477 377
 
> C <- A[p==8,]
> g <- colSums(C)
> go <- g[order(g,decreasing=TRUE)]
> head(go,15)
        Milo_Djukanovic               Podgorica                   power                  Seria_
                   5386                     772                     582                     530
           independence                   oppos                 citizen                 corrupt
                    493                     449                     348                     227
                 identi             institution                authorit                  histor
                    187                     175                     167                     164
                 regime parliamentary_elections                communis
                    136                      99                      80
> C <- A[p==6,]
> g <- colSums(C)
> go <- g[order(g,decreasing=TRUE)]
> head(go,20)
      Podgorica           oppos          police         corrupt          Seria_ Milo_Djukanovic
           2670             415             259             258             223             161
        citizen          protes          rights        authorit    independence     institution
            143             132             130             109              94              84
          power         illegal          identi    human_rights   hunger_strike          ethnic
             73              68              67              64              62              60
         histor        legitima
             55              52
> profile <- function(c){
+    C <- A[p==c,]
+    g <- colSums(C)
+    return(g[order(g,decreasing=TRUE)])
+ }
> g8 <- profile(8)
> head(g8,20)
        Milo_Djukanovic               Podgorica                   power                  Seria_
                   5386                     772                     582                     530
           independence                   oppos                 citizen                 corrupt
                    493                     449                     348                     227
                 identi             institution                authorit                  histor
                    187                     175                     167                     164
                 regime parliamentary_elections                communis                 freedom
                    136                      99                      80                      77
                 Moscow                  ethnic                  rights                  rother
                     75                      71                      70                      70
> g20 <- profile(20)
> head(g20,20)
       Seria_       citizen  independence        rights        identi        ethnic        histor
         2953           303           172           158           137           136           128
        oppos     Podgorica    agreements       minorit   institution       corrupt      authorit
          123           123           120           104           103           102            80
       Moscow         power        impose    nationalis      language Sergey_Lavrov
           80            77            59            54            53            53
> g80 <- profile(80)
> head(g80,20)
                  oppos               Podgorica                 citizen                  identi
                   3172                     434                     426                     255
        Milo_Djukanovic                   power             institution                  protes
                    255                     246                     202                     177
             agreements                authorit                      DF parliamentary_elections
                    166                     148                     139                     128
                 Seria_                  police                 corrupt                  regime
                    117                     116                     106                      95
           independence                 minorit       opposition_leader                  native
                     93                      91                      84                      82

It is much easier to grasp the content of a cluster if we present it as a word cloud. We get nicer clouds (the range of original frequencies is very large) if we consider as the size of words a square root of their frequencies.

> library(wordcloud)
> N <- names(g80)
> ff <- as.vector(g80)
> head(ff)
[1] 3172  434  426  255  255  246
> wordcloud(words=N,freq=ff,# min.freq = 1,
+   max.words=100, random.order=FALSE, rot.per=0.35,
+   colors=brewer.pal(8, "Dark2"))
> fs <- sqrt(ff)
> wordcloud(words=N,freq=fs,# min.freq = 1,
+   max.words=100, random.order=FALSE, rot.per=0.35,
+   colors=brewer.pal(8, "Dark2"))
>
> wc <- function(c){
+    g <- profile(c)
+    N <- names(g); f <- sqrt(as.vector(g))
+    wordcloud(words=N,freq=f, min.freq = 1,
+       max.words=100, random.order=FALSE, rot.per=0.35,
+       colors=brewer.pal(8, "Dark2"))
+ }
> wc(8)
> wc(6)
> wc(20)
> wc(14)
> wc(100)
> wc(300)
> ...

I included some word clouds in the ZIP.

with best wishes, Vlado

Batagelj, Vladimir, Thu 15-Jul-21 6:48 PM

PS

There are problems with word clouds of some clusters

> wc(520)
Warning message:
In wordcloud(words = N, freq = f, min.freq = 1, max.words = 100,  :
  Special_State_Prosecutor could not be fit on page. It will not be plotted.
> wc(518)
Warning message:
In wordcloud(words = N, freq = f, min.freq = 1, max.words = 100,  :
  criminal_organization could not be fit on page. It will not be plotted.

A solution is to change the scale parameter in the wordcloud

> wc <- function(c,scale=c(3,.5)){
+    g <- profile(c)
+    N <- names(g); fs <- sqrt(as.vector(g))
+    wordcloud(words=N,freq=fs, scale=scale, min.freq = 1,
+       max.words=100, random.order=FALSE, rot.per=0.35,
+       colors=brewer.pal(8, "Dark2"))
+ }
> wc(520)
> wc(520,scale=c(2,.5))
> wc(518)
> wc(518,scale=c(2,.5))

Also, the wordcloud is not a deterministic procedure - new execution can produce a different arrangement of the words.

with best wishes, Vlado