Steven Corman steve.corman@asu.edu, Tue 06-Jul-21 6:00 PM
Hi Vlado. I hope things are going well there and you have survived the Apocalypse OK.
I have a two-mode graph with 19,315 vertices in one mode and 213 vertices in the other. I am trying to get the one-mode “row” projection (i.e. A x AT) of this. It is too much for Visone, even with a fast processor and lots of memory.
I remember that Pajek is designed for large networks. I was going to try it, but I can’t figure out how to accomplish this particular task. I tried the network > two-mode function with a small test dataset. It creates a partition, but I don’ t know how to turn that partition into a one-mode adjacency matrix or link list (preferably without the isolates and loops) that I can output.
Is what I want to do possible with Pajek, and do you think it will handle the calculation for a graph of that size? If so can you please give me some guidance on how to do it, or point me to a worked example?
Thanks, and best regards to you and Anushka….
Steve
Batagelj, Vladimir Wed 07-Jul-21 1:46 AM
Dear Steve,
Computing the row projection by definition takes 19315*213*19315 = 79463744925 = 79 G multiplications of numbers and requires additional space in the memory for storing 19315*19315 = 373069225 = 373 M numbers. Space shouldn't be a problem on a better laptop.
If your network is sparse (many zero entries) the computation can be done faster by dealing with only non-zero entries. Pajek supports this approach.
read network in Pajek Network/2-Mode Network/Transpose select original network as First (register) select transposed network as Second Networks/Multiply networks (result is one-mode) save network to file
The internal items can be replaced by a single command Network/2-Mode Network/2-Mode to 1-Mode/Rows
and (probably first) Include loops if you are interested also in loop values.
For testing and an example of a two-mode description in Pajek I am attaching the Davis' Southern Women network. Some additional information you can find in https://github.com/bavla/NormNet/blob/main/docs/normnet.pdf
best regards, Vlado
Steven Corman steve.corman@asu.edu, Wed 07-Jul-21 2:42 AM
A problem I had is getting from my edgelist format to a .net file. I read it into Visone as a two-mode network and saved to .net format, but when I opened that and tried to transpose it, Pajek it said it isn’t a two-mode network.
So I wrote a script to convert based on the davis example you sent. I converted the node names to index numbers and was careful to make sure the indices were distinct for the two modes. When I opened that in Pajek, it said there are lines within modes. I don’t know how that could have happened.
Anyway I ignored that was able to get the projection of the second node in each edge by post-multiplying.
When I tried to get the projection of the first node in each edge by pre-multiplying, I got a memory error:
Perhaps that is due to the presence of the offending line-within-mode? All files involved are attached.
Batagelj, Vladimir, Wed 07-Jul-21 5:28 AM
Dear Steve,
I am attaching a ZIP. It contains
The problem is that your network is weighted, and I guess that the links with large weights are over-represented in the projection - you should consider some normalizations. For example, projecting the matrix sqrt(A) = [ sqrt(a[u,v]) ] .
best regards, Vlado
> wdir <- "D:/pajek/steve" > setwd(wdir) > T <- read.table("kw2mode-edgelist.txt",sep="\t") > colnames(T) <- c("source","key","count") > head(T) source key count 1 32405706 Milo_Djukanovic 1 2 32405706 DF 1 3 32405706 lame 1 4 32405706 institution 1 5 32405706 oppos 2 6 32405706 regime 1 > tail(T) source key count 134010 17549459 unemploy 2 134011 18543963 Podgorica 1 134012 18543963 NATO_accession 1 134013 18543963 citizen 1 134014 18543963 corrupt 2 134015 18543963 oppos 4 > S <- factor(T$source); n <- length(levels(S)) > n [1] 19315 > K <- factor(T$key); m <- length(levels(K)) > m [1] 213 > net <- file("kw2-mode.net","w") > cat("% List 2 Net:",date(),"\n",file=net) > cat("*vertices ",n+m,n,"\n",file=net) > for(v in 1:n) cat(v,' "',levels(S)[v],'"\n',sep='',file=net) > for(u in 1:m) cat(u+n,' "',levels(K)[u],'"\n',sep='',file=net) > cat("*arcs\n",file=net) > for(a in 1:nrow(T)) cat(S[a]," ",as.integer(K[a])+n," ",T$count[a],"\n",sep="",file=net) > close(net)
Steven Corman steve.corman@asu.edu, Wed 07-Jul-21 8:07 PM
Thanks again, Vlado. That R script will come in handy. Why you don’t make Pajek able to read a csv file with an edgelist? That’s a pretty common format.
Anyway, it occurred to me it might interest you what I’m doing. The IDs in that file are for news articles in a database collected in Montenegro a year before and after the attempted coup in late 2016. Maybe you guessed this from looking at the keywords. I am trying to cluster those articles by doing Louvain modularity on the row-mode projection.
Best…
Steve
Batagelj, Vladimir, Fri 09-Jul-21 12:58 AM
Here is a list of commands to compute the network B = A*t(A)
read network into Pajek Network/2-Mode Network/Transpose 2-Mode Select original network as First (register) Select transposed as Second Networks/Multiply Networks [Yes]
next, to get the normalized cosine similarity
Network/2-Mode Network/2-Mode to 1-Mode/Normalize 1-Mode/Geo Select Multiplying ... as First network File/Network/Dispose [Yes]
with best wishes, Vlado
Batagelj, Vladimir, Thu 15-Jul-21 5:46 AM
Dear Steve,
I computed in Pajek the cosine similarity between rows of your network
read network into Pajek Network/2-Mode Network/Transpose 2-Mode Select original network as First (register) Select transposed as Second Networks/Multiply Networks [Yes] 50s Network/Create New Network/Transform/Remove/Triangle/Lower 5min Network/Create New Network/Transform/Arcs -> Edges/All 45min Network/2-Mode Network/2-Mode to 1-Mode/Normalize 1-Mode/Geo 1min
I transformed it into a dissimilarity (d = 1 - s)
Network/Create New Network/Transform/Line Values/Multiply by [-1][No] 3s Network/Create New Network/Transform/Line Values/Add Constant [1][No] 1s Network/Create New Network/Transform/Line Values/Absolute [1][No] 1s Network/Create New Network/Transform/Remove/Loops [No] 8s File/Network/Change Label [Distance]
I applied to it hierarchical clustering with relational constraint
Network/Create Hierarchy/Clustering with Relational Constraint/Run [Maximum,Tolerant]
Your projection network is very dense (Density2 [no loops allowed] = 0.7313, Average Degree = 14125.2 ). It took 2 days to get the results (maybe it would be faster on a computer with a larger memory). I saved the results (hierarchy stored as clustering MaxTol.clu and two vectors (height MaxTolh.vec and size MaxTols.vec)) for the extraction of clusters. I extracted the clusters at levels 0.9 (215 clusters) and 0.75 (521 clusters).
Network/Create Hierarchy/Clustering with Relational Constraint/Make Partition/Using Threshold [0.75]
All the files are included in the attached ZIP.
To inspect clusters I turned to R.
> wdir <- "D:/pajek/steve" > setwd(wdir) > source("https://raw.githubusercontent.com/bavla/Rnet/master/R/Pajek.R") > A <- net2matrix("kw2-mode.net") > dim(A) [1] 19315 213 > p <- clu2vector("MaxTol75.clu",skip=1) > length(p) [1] 19315 > head(p) [1] 1 2 3 4 5 6 > f <- table(p) > length(f) [1] 521 > fo <- f[order(f,decreasing=TRUE)] > head(fo) p 8 6 20 80 89 39 746 684 497 496 477 377 > C <- A[p==8,] > g <- colSums(C) > go <- g[order(g,decreasing=TRUE)] > head(go,15) Milo_Djukanovic Podgorica power Seria_ 5386 772 582 530 independence oppos citizen corrupt 493 449 348 227 identi institution authorit histor 187 175 167 164 regime parliamentary_elections communis 136 99 80 > C <- A[p==6,] > g <- colSums(C) > go <- g[order(g,decreasing=TRUE)] > head(go,20) Podgorica oppos police corrupt Seria_ Milo_Djukanovic 2670 415 259 258 223 161 citizen protes rights authorit independence institution 143 132 130 109 94 84 power illegal identi human_rights hunger_strike ethnic 73 68 67 64 62 60 histor legitima 55 52 > profile <- function(c){ + C <- A[p==c,] + g <- colSums(C) + return(g[order(g,decreasing=TRUE)]) + } > g8 <- profile(8) > head(g8,20) Milo_Djukanovic Podgorica power Seria_ 5386 772 582 530 independence oppos citizen corrupt 493 449 348 227 identi institution authorit histor 187 175 167 164 regime parliamentary_elections communis freedom 136 99 80 77 Moscow ethnic rights rother 75 71 70 70 > g20 <- profile(20) > head(g20,20) Seria_ citizen independence rights identi ethnic histor 2953 303 172 158 137 136 128 oppos Podgorica agreements minorit institution corrupt authorit 123 123 120 104 103 102 80 Moscow power impose nationalis language Sergey_Lavrov 80 77 59 54 53 53 > g80 <- profile(80) > head(g80,20) oppos Podgorica citizen identi 3172 434 426 255 Milo_Djukanovic power institution protes 255 246 202 177 agreements authorit DF parliamentary_elections 166 148 139 128 Seria_ police corrupt regime 117 116 106 95 independence minorit opposition_leader native 93 91 84 82
It is much easier to grasp the content of a cluster if we present it as a word cloud. We get nicer clouds (the range of original frequencies is very large) if we consider as the size of words a square root of their frequencies.
> library(wordcloud) > N <- names(g80) > ff <- as.vector(g80) > head(ff) [1] 3172 434 426 255 255 246 > wordcloud(words=N,freq=ff,# min.freq = 1, + max.words=100, random.order=FALSE, rot.per=0.35, + colors=brewer.pal(8, "Dark2")) > fs <- sqrt(ff) > wordcloud(words=N,freq=fs,# min.freq = 1, + max.words=100, random.order=FALSE, rot.per=0.35, + colors=brewer.pal(8, "Dark2")) > > wc <- function(c){ + g <- profile(c) + N <- names(g); f <- sqrt(as.vector(g)) + wordcloud(words=N,freq=f, min.freq = 1, + max.words=100, random.order=FALSE, rot.per=0.35, + colors=brewer.pal(8, "Dark2")) + } > wc(8) > wc(6) > wc(20) > wc(14) > wc(100) > wc(300) > ...
I included some word clouds in the ZIP.
with best wishes, Vlado
Batagelj, Vladimir, Thu 15-Jul-21 6:48 PM
PS
There are problems with word clouds of some clusters
> wc(520) Warning message: In wordcloud(words = N, freq = f, min.freq = 1, max.words = 100, : Special_State_Prosecutor could not be fit on page. It will not be plotted. > wc(518) Warning message: In wordcloud(words = N, freq = f, min.freq = 1, max.words = 100, : criminal_organization could not be fit on page. It will not be plotted.
A solution is to change the scale parameter in the wordcloud
> wc <- function(c,scale=c(3,.5)){ + g <- profile(c) + N <- names(g); fs <- sqrt(as.vector(g)) + wordcloud(words=N,freq=fs, scale=scale, min.freq = 1, + max.words=100, random.order=FALSE, rot.per=0.35, + colors=brewer.pal(8, "Dark2")) + } > wc(520) > wc(520,scale=c(2,.5)) > wc(518) > wc(518,scale=c(2,.5))
Also, the wordcloud is not a deterministic procedure - new execution can produce a different arrangement of the words.
with best wishes, Vlado