We first collected hits for the query
"block model*" or "network cluster*" or "graph cluster*" or "community detect*" or "blockmodel*" or "block-model*" or "structural equival*" or "regular equival*"
from Wos on May 16, 2015.
On January 6, 2017 we made an update for the years 2014-2017, and another update for years 2015-2017 on February 22, 2017. We also manually prepared descriptions with title and complete list of authors for the “terminal” nodes with large counts and added them to the WoS file.
We decided to make some improvements to program WoS2Pajek:
>>> import sys; wdir = r'c:\users\Batagelj\work\Python\WoS'; sys.path.append(wdir) >>> MLdir = r'c:\Python27\Lib\site-packages\MontyLingua-2.1\Python' >>> import os; os.chdir(wdir); sys.path.append(MLdir) >>> import MontyLingua >>> def lemmatize(ML,ab,stopwords): sLto = [ML.tokenize(st) for st in ML.split_sentences(ab.lower())] sLta = [ML.tag_tokenized(t) for t in sLto] lem = [ML.lemmatise_tagged(t) for t in sLta] lemas = [s.split('/')[2] for s in string.join(lem).split(' ')] return list(set(dropList(lemas,stopwords))) >>> resrc = os.path.join(wdir, "resources/") >>> stopwords = open(resrc+'StopWords.dat', 'r').read().lower().split() >>> stopwords = ['.',',',';','(',')','[',']','"','=','?','!',':','-','s','']+stopwords >>> stopwords[:20] ['.', ',', ';', '(', ')', '[', ']', '"', '=', '?', '!', ':', '-', 's', '', 'a', 'about', 'above', 'across', 'after'] >>> bibDE = "Graph algorithms; Pattern matching; Clustering; Computer vision" >>> bibID = "EDIT-DISTANCE; PATTERN-RECOGNITION; CLASSIFICATION; ALGORITHMS; REGULARIZATION; SEGMENTATION; ISOMORPHISM; SUBGRAPH; SEARCH; IMAGES" >>> bibTI = ""; bibAB = "" >>> ML = MontyLingua.MontyLingua() ****** MontyLingua v.2.1 ****** ***** by hugo@media.mit.edu ***** Lemmatiser OK! Custom Lexicon Found! Now Loading! Fast Lexicon Found! Now Loading! Lexicon OK! LexicalRuleParser OK! ContextualRuleParser OK! Commonsense OK! Semantic Interpreter OK! Loading Morph Dictionary! ********************************* >>> words = lemmatize(ML,(bibTI+'. '+bibID+';'+bibDE+'. '+bibAB).lower().replace("'"," "),stopwords) >>> words ['segmentation', 'pattern-recognition', 'classification', 'algorithm', 'cluster', 'search', 'image', 'regularization', 'edit-distance', 'isomorphism', 'subgraph', 'vision', 'graph', 'pattern', 'match'] >>> words = lemmatize(ML, (bibTI+'. '+bibID+';'+bibDE+'. '+bibAB).lower().replace("'"," ").replace("-"," "),stopwords) >>> words ['distance', 'regularization', 'isomorphism', 'classification', 'algorithm', 'edit', 'pattern', 'image', 'segmentation', 'vision', 'cluster', 'subgraph', 'match', 'graph', 'search', 'recognition'] >>> >>> import re >>> a = 'lkdfhisoe78347834 (())&/&745 ' >>> re.sub('[^0-9]','', a) '78347834745' >>>
We applied the new WoS2Pajek 1.5 to our data on February 23, 2017.
2015/05/16 2017/01/06 2017/02/22 2017/02/23 number of works = 75249 112114 117047 117082 number of authors = 44787 60419 62143 62143 number of journals = 8993 12271 12651 12652 number of keywords = 10095 12715 12913 10269 number of records = 2944 5472 6953 6953 number of duplicates = 1 62 1255 1255
Cite n=117082 m= 196406 loops = 12 multiple lines = 555 AvDegree = 3.35501614 WA n = 179225 = 117082 + 62143 m = 137888 multiple lines = 4937 WJ n = 129734 = 117082+12652 m = 118818 multiple lines =1655 WK n = 127351 = 117082+10269 m = 108747 multiple lines = 19777
Here is a recipe how to get rid of nodes with labels containing
ANONYM
Network/Create Partition/Vertex Labels Matching Regular Expression ANONYM Operations/Network+Partition/Extract Subnetwork [1]
If you have several such patterns you first determine for each of them a corresponding partition and afterwards combine them using
Partitions/Min (First, Second)
into the partition used for extraction.
We first in Pajek determine the subset of nodes with DC=0 and indegree in citation network > 0. We list the nodes with the largest indegrees:
============================================================================== 2. Extracting from V1 vertices determined by C5 [1] (111128) ============================================================================== Dimension: 111128 The lowest value: 1.0000 The highest value: 614.0000 Highest values: Rank Vertex Value Id -------------------------------------- 1 33437 614.0000 BLONDEL_V(2008):P10008 2 24339 310.0000 DANON_L(2005):P09008 3 24608 286.0000 NEWMAN_M(2004)69:1 4 33718 284.0000 LANCICHI_A(2009)11:033015 5 19509 121.0000 NEWMAN_M(2001)64:026118 6 41304 119.0000 GREGORY_S(2010)12:103018 7 29909 115.0000 ARENAS_A(2008)10:053039 8 20549 79.0000 BADER_G(2003)4:2 9 22127 72.0000 DONETTI_L(2004):P10012 10 3459 68.0000 MACQUEEN_J(1967):281 11 3448 68.0000 ERDOS_P(1960)5:17 12 32426 56.0000 NG_A(2002)14:849 13 34911 56.0000 NICOSIA_V(2009):P03024 14 30529 55.0000 ARENAS_A(2007)9:176 15 21134 55.0000 NEWMAN_M(2004)69: 16 38741 55.0000 [ANONYMO(2009): 17 47877 52.0000 ZACHARY_W(1977):452 18 49529 51.0000 [ANONYMO(2012): 19 40077 51.0000 ZHOU_Y(2009)2:718 20 65418 48.0000 [ANONYMO(2014): 21 55282 46.0000 [ANONYMO(2013): 22 27386 46.0000 BROHEE_S(2006)7:488 23 48551 45.0000 DANON_L(2005)2005:P09008 24 42179 45.0000 [ANONYMO(2011): 25 20651 43.0000 WATTS_D(1999): 26 48178 42.0000 YANG_J(2012): 27 21561 41.0000 MEADE_B(2005)110:2004JB003209 28 41111 41.0000 GREGORY_S(2007)4702:91 29 31098 39.0000 [ANONYMO(2007): 30 254 38.0000 MILGRAM_S(1967)1:61 31 51449 38.0000 XIE_J(2011):344 32 55721 36.0000 CSARDI_G(2006)1695:1695 33 41519 36.0000 LEE_C(2010): 34 371 35.0000 HARTIGAN_J(1975): 35 36260 35.0000 BAGROW_J(2008):P05001 36 20189 34.0000 WHITE_J(1986)314:1 37 30492 33.0000 FORTUNAT_S(2009): 38 39778 33.0000 TANTIPAT_C(2007):717 39 38087 33.0000 LESKOVEC_J(2010): 40 25278 32.0000 [ANONYMO(2006): 41 38729 32.0000 [ANONYMO(2008): 42 31746 31.0000 ALTAF-UL_M(2006)7:207 43 28827 31.0000 KEMPE_D(2003):137 44 25636 30.0000 MCCAFFRE_R(2005)110:2004JB003307 45 48391 30.0000 COSCIA_M(2011)4:514 46 46203 30.0000 ANDERSEN_R(2006):475 47 4508 30.0000 ORGATTI_S(1992)22:1 48 57543 29.0000 XIE_J(2012):25 49 25517 29.0000 SCHWARZ_G(1978)6:461 50 23563 29.0000 JEH_G(2002):538 51 344 29.0000 BREIGER_R(1976)41:117 52 21863 29.0000 MCCAFFRE_R(2002)30:101 53 40703 29.0000 GREGORY_S(2008)5211:408 54 19511 29.0000 PASTOR-S_R(2001)87:258701 55 19039 29.0000 [ANONYMO(2003): 56 29492 28.0000 HAN_J(2006): 57 29444 28.0000 JORDAN_M(1999)37:183 58 57391 28.0000 HOLME_P(2012)519:97 59 25218 28.0000 DERENYI_I(2005)94:160202 60 25091 28.0000 ZHOU_H(2003)67:041908 61 48303 28.0000 BLONDEL_V(2008)2008: 62 42182 28.0000 [ANONYMO(2010): 63 20653 28.0000 BLATT_M(1996)76:3251 64 19513 28.0000 SHEN-ORR_S(2002)31:64 65 36257 28.0000 SCHUETZ_P(2008)77:046112 66 35632 28.0000 RUAN_J(2008)77:016104 67 34654 28.0000 DONGEN_S(2000): 68 33774 28.0000 ZHANG_X(2009)87:38002 69 54992 27.0000 WASSER_M(1994)8: 70 54964 27.0000 MOSSEL_E(2012): 71 49527 27.0000 LUXBURG_U(2007)17:395 72 23395 27.0000 BANSAL_N(2004)56:89 73 46085 27.0000 XIE_J(2011): 74 44655 27.0000 WANG_F(2011)22:493 75 22132 27.0000 MASSEN_C(2005)71:046101 76 2678 27.0000 GUTENBER_B(1956)9:1 77 38584 27.0000 HUANG_D(2009)4:44 78 17690 27.0000 HARTIGAN_J(1979)28:100 79 17624 27.0000 GAVIN_A(2002)415:141 80 1063 27.0000 PRICE_D(1965)149:510 81 28486 26.0000 DHILLON_I(2001):269 82 25534 26.0000 [ANONYMO(2004): 83 49698 26.0000 TANG_L(2010): 84 24635 26.0000 WHITE_S(2005): 85 24614 26.0000 BAUMES_J(2005)3495:27 86 48065 26.0000 CHEN_J(2009):237 87 47770 26.0000 JAIN_A(2010)31:651 88 2910 26.0000 PACHECO_J(1992)355:71 89 43045 26.0000 MARIADAS_M(2010)4:715 90 41881 26.0000 MIN-SOO_K(2009):622 91 41567 26.0000 ZHANG_Y(2009):997 92 20550 26.0000 BU_D(2003)31:2443 93 77382 26.0000 LESKOVEC_J(2014): 94 19037 26.0000 BOLLOBAS_B(1998): 95 1163 26.0000 COLEMAN_J(1966): 96 35371 26.0000 TASGIN_M(2006): 97 17626 26.0000 HO_Y(2002)415:180 98 3317 25.0000 RUNDLE_J(1977)67:1363 99 24348 25.0000 HEINZELM_W(2002)1:660 100 23988 25.0000 KRAUSE_A(2003)426:282
Using in Pajek the option Network/Info/Vertex Label → Vertex Number
we enter the initial part of a selected work's name, for example GREGORY_S(2010)
. Afterwards repeating the Retry
button we browse through all works with names containing a given substring:
Input: GREGORY_S(2010) Possible Matching: GREGORY_S(2010)12:103018 (43356) Input: GREGORY_S(2010) Possible Matching: GREGORY_S(2010): (48362) Input: GREGORY_S(2010) Possible Matching: GREGORY_S(2010)12: (53996) Input: GREGORY_S(2010) Possible Matching: GREGORY_S(2010)112:10301 (61292) Input: GREGORY_S(2010) Possible Matching: GREGORY_S(2010)12:1 (80659) Input: GREGORY_S(2010) Possible Matching: GREGORY_S(2010)12:2011 (99518) Input: GREGORY_S(2010) Possible Matching: GREGORY_S(2010)12:1088/1367-2630/12/10/103018 (108469) Input: GREGORY_S(2010) Possible Matching: GREGORY_S(2010)12:103018 (43356)
There are two possibilities how to correct the data:
We used the second option. For the works with high counts (>= 30) we prepared lists of possible equivalents and manually determined equivalence classes
117082 "BLONDEL_V(2008)" 32075 35083 85732 86630 115732 60511 69369 81200 83974 79999 65691 79303 105754 "BLONDEL_V(2008)" 37562 50714 51071 "DANON_L(2005)" 23555 25583 43537 50989 80658 23543 40201 51079 59574 104228 "NEWMAN_M(2004)" 25896 28830 108141 80880 73541 20520 72933 70805 48370 63293 "NEWMAN_M(2004)" 20504 115510 59630 84387 85607 "NEWMAN_M(2004)" 25895 35477 "LANCICHI_A(2009)" 108114 35383 41217 43566 61424 65188 82754 99520 "LANCICHI_A(2009)" 35386 69262 93922 97483 "LANCICHI_A(2009)" 36243 69337 36768 49668 "NEWMAN_M(2001)" 28968 26373 "NEWMAN_M(2001)" 21998 47666 ...
We found the service https://scholar.google.com/citations quite helpful.
For a class representative select a work with DC > 0.
With a simple program in Python
inp = open("C:/Users/batagelj/work/Python/WoS/BM/worksEQl.txt","r") line = next(inp) n = int(line) P = list(range(1,n+1)) for line in inp: L = line.split() if len(L)>2: I = [ int(e) for e in L[1:]] print L[0] j = I[0] for i in I[1:]: P[i-1] = j inp.close() clu = open("C:/Users/batagelj/work/Python/WoS/BM/worksEQ.clu","w") clu.write("*vertices "+str(n)+"\n") for i in P: clu.write(str(i)+"\n") clu.close()
we produced a Pajek's partition file wordsEQ.clu
used in Pajek for shrinking the set of works.
Using the partition p=wordsEQ
, p : V → C, we shrink the citation network cite
to citeR
. As a byproduct we get also a partition q : VC → V, such that q(v) = u ⇒ p(u) = v.
We have to shrink also partitions year
and and DC
and the vector NP
.
In general, given a mapping s : V → B, we are asking for a mapping r : VC → B such that if q(v) = u then s(u) = r(v). Therefore r(v) = s(u) = s(q(v)) = q*s(v) or equivalently r = q*s.
In Pajek, given a mapping q : VC → V, the mapping r can be determined as follows:
select partition q as First partition select partition s as Second partition Partitions/Functional Composition First*Second
or
select partition q as First partition select vector s as First vector Operations/Vector+Partition/Functional Composition Partition*Vector
For the partition q = worksEQq we computed partitions YearR and DCR and the vector NPR.
WAr n = 179049 = 116906+62143 m = 132776 AveDegree = 1.48312473 WKr n = 127175 = 116906+10269 m = 88965 AveDegree = 1.39909573 WJr n = 129558 = 116906+12652 m = 117044 AveDegree = 1.80682011 CiteR n = 116906 m = 195784 AveDegree = 3.34942603
The network CiteR has 116906 nodes and 195784 arcs.
indegree freq 0 4070 1 93246 2 10694 3 3352 4 1610
Most of nodes are terminal (DCR=0) nodes cited only once (indegree=1). We decided (boundary problem) to include in our networks nodes with DCR > 0 or indeg > 2 (partition boundary
). They determine a subnetwork CiteB
with 13540 nodes and 82238 arcs.
For analysis of authors, keywords and journals we have complete data (besides CR field) exactly for works with DC>0. In two-mode networks we preserve in the second set only nodes with indegree > 0. The new dataset: CiteC, WAc, WKc, WJc, yearC, NPc.
To get WAc from WAr:
read partition DCr select network WAr Partition/Create Constant Partition [62143,0] = C1 select DCr as First partition select C1 as Second partition Partitions/Fuse Partitions Operations/Network+Partition/Extract Subnetwork [1-*] Partition/Create Constant Partition [5695,1] = C2 select C1 as Second partition Partitions/Fuse Partitions Partition/Binarize partition [1] = C3 Network/Create Partition/Degree/Input select C3 as Second partition Partitions/Max (First,Second) Operations/Network+Partition/Extract Subnetwork [1-*]
In same way we get also networks WKc and WJc.
WAc n = 19071 = 5695+13376 m = 21562 AveDegree = 2.26123433 WKc n = 15964 = 5695+10269 m = 88953 AveDegree = 11.14419945 WJc n = 7451 = 5695+1756 m = 5815 AveDegree = 1.56086431 CiteC n = 5695 m = 38400 AveDegree = 13.48551361