Data collection and cleaning

Data collection and cleaning

Data collection

We first collected hits for the query

"block model*" or "network cluster*" or "graph cluster*" or "community detect*" or 
"blockmodel*" or "block-model*" or "structural equival*" or "regular equival*"

from Wos on May 16, 2015.

On January 6, 2017 we made an update for the years 2014-2017, and another update for years 2015-2017 on February 22, 2017. We also manually prepared descriptions with title and complete list of authors for the “terminal” nodes with large counts and added them to the WoS file.

WoS2Pajek 1.5 - keywords

We decided to make some improvements to program WoS2Pajek:

split keyword phrases;
use last part of DI if BP is missing;
ensure that PY info is numeric.

>>> import sys; wdir = r'c:\users\Batagelj\work\Python\WoS'; sys.path.append(wdir)
>>> MLdir = r'c:\Python27\Lib\site-packages\MontyLingua-2.1\Python'
>>> import os; os.chdir(wdir); sys.path.append(MLdir)
>>> import MontyLingua

>>> def lemmatize(ML,ab,stopwords):
  sLto = [ML.tokenize(st) for st in ML.split_sentences(ab.lower())]
  sLta = [ML.tag_tokenized(t) for t in sLto]
  lem = [ML.lemmatise_tagged(t) for t in sLta]
  lemas = [s.split('/')[2] for s in string.join(lem).split(' ')]
  return list(set(dropList(lemas,stopwords)))

>>> resrc = os.path.join(wdir, "resources/")
>>> stopwords = open(resrc+'StopWords.dat', 'r').read().lower().split()
>>> stopwords = ['.',',',';','(',')','[',']','"','=','?','!',':','-','s','']+stopwords
>>> stopwords[:20]
['.', ',', ';', '(', ')', '[', ']', '"', '=', '?', '!', ':', '-', 's', '', 'a', 'about', 
    'above', 'across', 'after']
>>> bibDE = "Graph algorithms; Pattern matching; Clustering; Computer vision"
>>> bibID = "EDIT-DISTANCE; PATTERN-RECOGNITION; CLASSIFICATION; ALGORITHMS; REGULARIZATION; 
    SEGMENTATION; ISOMORPHISM; SUBGRAPH; SEARCH; IMAGES"
>>> bibTI = ""; bibAB = ""
>>> ML = MontyLingua.MontyLingua()

****** MontyLingua v.2.1 ******
***** by hugo@media.mit.edu *****
Lemmatiser OK!
Custom Lexicon Found! Now Loading!
Fast Lexicon Found! Now Loading!
Lexicon OK!
LexicalRuleParser OK!
ContextualRuleParser OK!
Commonsense OK!
Semantic Interpreter OK!
Loading Morph Dictionary!
*********************************

>>> words = lemmatize(ML,(bibTI+'. '+bibID+';'+bibDE+'. '+bibAB).lower().replace("'"," "),stopwords)
>>> words
['segmentation', 'pattern-recognition', 'classification', 'algorithm', 'cluster', 'search', 'image', 
 'regularization', 'edit-distance', 'isomorphism', 'subgraph', 'vision', 'graph', 'pattern', 'match']
>>> words = lemmatize(ML,
 (bibTI+'. '+bibID+';'+bibDE+'. '+bibAB).lower().replace("'"," ").replace("-"," "),stopwords)
>>> words
['distance', 'regularization', 'isomorphism', 'classification', 'algorithm', 'edit', 'pattern', 
 'image', 'segmentation', 'vision', 'cluster', 'subgraph', 'match', 'graph', 'search', 'recognition']
>>> 
>>> import re
>>> a = 'lkdfhisoe78347834 (())&/&745  '
>>> re.sub('[^0-9]','', a)
'78347834745'
>>>

We applied the new WoS2Pajek 1.5 to our data on February 23, 2017.

                        2015/05/16  2017/01/06  2017/02/22  2017/02/23
number of works      =       75249      112114      117047      117082
number of authors    =       44787       60419       62143       62143
number of journals   =        8993       12271       12651       12652
number of keywords   =       10095       12715       12913       10269
number of records    =        2944        5472        6953        6953
number of duplicates =           1          62        1255        1255

Cite  n=117082	m= 196406    loops = 12    multiple lines  =  555    AvDegree = 3.35501614 
WA    n = 179225 = 117082 + 62143  m = 137888    multiple lines = 4937     
WJ    n = 129734 = 117082+12652    m = 118818    multiple lines =1655  
WK    n = 127351 = 117082+10269    m = 108747    multiple lines   =    19777

Removing nodes with label patterns

Here is a recipe how to get rid of nodes with labels containing ANONYM

Network/Create Partition/Vertex Labels Matching Regular Expression   ANONYM
Operations/Network+Partition/Extract Subnetwork [1]

If you have several such patterns you first determine for each of them a corresponding partition and afterwards combine them using

Partitions/Min (First, Second)

into the partition used for extraction.

Equivalent works

We first in Pajek determine the subset of nodes with DC=0 and indegree in citation network > 0. We list the nodes with the largest indegrees:

==============================================================================
2. Extracting from V1 vertices determined by C5 [1] (111128)
==============================================================================
Dimension: 111128
The lowest value:                         1.0000
The highest value:                      614.0000

Highest values:

  Rank    Vertex         Value   Id
--------------------------------------
     1     33437      614.0000   BLONDEL_V(2008):P10008
     2     24339      310.0000   DANON_L(2005):P09008
     3     24608      286.0000   NEWMAN_M(2004)69:1
     4     33718      284.0000   LANCICHI_A(2009)11:033015
     5     19509      121.0000   NEWMAN_M(2001)64:026118
     6     41304      119.0000   GREGORY_S(2010)12:103018
     7     29909      115.0000   ARENAS_A(2008)10:053039
     8     20549       79.0000   BADER_G(2003)4:2
     9     22127       72.0000   DONETTI_L(2004):P10012
    10      3459       68.0000   MACQUEEN_J(1967):281
    11      3448       68.0000   ERDOS_P(1960)5:17
    12     32426       56.0000   NG_A(2002)14:849
    13     34911       56.0000   NICOSIA_V(2009):P03024
    14     30529       55.0000   ARENAS_A(2007)9:176
    15     21134       55.0000   NEWMAN_M(2004)69:
    16     38741       55.0000   [ANONYMO(2009):
    17     47877       52.0000   ZACHARY_W(1977):452
    18     49529       51.0000   [ANONYMO(2012):
    19     40077       51.0000   ZHOU_Y(2009)2:718
    20     65418       48.0000   [ANONYMO(2014):
    21     55282       46.0000   [ANONYMO(2013):
    22     27386       46.0000   BROHEE_S(2006)7:488
    23     48551       45.0000   DANON_L(2005)2005:P09008
    24     42179       45.0000   [ANONYMO(2011):
    25     20651       43.0000   WATTS_D(1999):
    26     48178       42.0000   YANG_J(2012):
    27     21561       41.0000   MEADE_B(2005)110:2004JB003209
    28     41111       41.0000   GREGORY_S(2007)4702:91
    29     31098       39.0000   [ANONYMO(2007):
    30       254       38.0000   MILGRAM_S(1967)1:61
    31     51449       38.0000   XIE_J(2011):344
    32     55721       36.0000   CSARDI_G(2006)1695:1695
    33     41519       36.0000   LEE_C(2010):
    34       371       35.0000   HARTIGAN_J(1975):
    35     36260       35.0000   BAGROW_J(2008):P05001
    36     20189       34.0000   WHITE_J(1986)314:1
    37     30492       33.0000   FORTUNAT_S(2009):
    38     39778       33.0000   TANTIPAT_C(2007):717
    39     38087       33.0000   LESKOVEC_J(2010):
    40     25278       32.0000   [ANONYMO(2006):
    41     38729       32.0000   [ANONYMO(2008):
    42     31746       31.0000   ALTAF-UL_M(2006)7:207
    43     28827       31.0000   KEMPE_D(2003):137
    44     25636       30.0000   MCCAFFRE_R(2005)110:2004JB003307
    45     48391       30.0000   COSCIA_M(2011)4:514
    46     46203       30.0000   ANDERSEN_R(2006):475
    47      4508       30.0000   ORGATTI_S(1992)22:1
    48     57543       29.0000   XIE_J(2012):25
    49     25517       29.0000   SCHWARZ_G(1978)6:461
    50     23563       29.0000   JEH_G(2002):538
    51       344       29.0000   BREIGER_R(1976)41:117
    52     21863       29.0000   MCCAFFRE_R(2002)30:101
    53     40703       29.0000   GREGORY_S(2008)5211:408
    54     19511       29.0000   PASTOR-S_R(2001)87:258701
    55     19039       29.0000   [ANONYMO(2003):
    56     29492       28.0000   HAN_J(2006):
    57     29444       28.0000   JORDAN_M(1999)37:183
    58     57391       28.0000   HOLME_P(2012)519:97
    59     25218       28.0000   DERENYI_I(2005)94:160202
    60     25091       28.0000   ZHOU_H(2003)67:041908
    61     48303       28.0000   BLONDEL_V(2008)2008:
    62     42182       28.0000   [ANONYMO(2010):
    63     20653       28.0000   BLATT_M(1996)76:3251
    64     19513       28.0000   SHEN-ORR_S(2002)31:64
    65     36257       28.0000   SCHUETZ_P(2008)77:046112
    66     35632       28.0000   RUAN_J(2008)77:016104
    67     34654       28.0000   DONGEN_S(2000):
    68     33774       28.0000   ZHANG_X(2009)87:38002
    69     54992       27.0000   WASSER_M(1994)8:
    70     54964       27.0000   MOSSEL_E(2012):
    71     49527       27.0000   LUXBURG_U(2007)17:395
    72     23395       27.0000   BANSAL_N(2004)56:89
    73     46085       27.0000   XIE_J(2011):
    74     44655       27.0000   WANG_F(2011)22:493
    75     22132       27.0000   MASSEN_C(2005)71:046101
    76      2678       27.0000   GUTENBER_B(1956)9:1
    77     38584       27.0000   HUANG_D(2009)4:44
    78     17690       27.0000   HARTIGAN_J(1979)28:100
    79     17624       27.0000   GAVIN_A(2002)415:141
    80      1063       27.0000   PRICE_D(1965)149:510
    81     28486       26.0000   DHILLON_I(2001):269
    82     25534       26.0000   [ANONYMO(2004):
    83     49698       26.0000   TANG_L(2010):
    84     24635       26.0000   WHITE_S(2005):
    85     24614       26.0000   BAUMES_J(2005)3495:27
    86     48065       26.0000   CHEN_J(2009):237
    87     47770       26.0000   JAIN_A(2010)31:651
    88      2910       26.0000   PACHECO_J(1992)355:71
    89     43045       26.0000   MARIADAS_M(2010)4:715
    90     41881       26.0000   MIN-SOO_K(2009):622
    91     41567       26.0000   ZHANG_Y(2009):997
    92     20550       26.0000   BU_D(2003)31:2443
    93     77382       26.0000   LESKOVEC_J(2014):
    94     19037       26.0000   BOLLOBAS_B(1998):
    95      1163       26.0000   COLEMAN_J(1966):
    96     35371       26.0000   TASGIN_M(2006):
    97     17626       26.0000   HO_Y(2002)415:180
    98      3317       25.0000   RUNDLE_J(1977)67:1363
    99     24348       25.0000   HEINZELM_W(2002)1:660
   100     23988       25.0000   KRAUSE_A(2003)426:282

Using in Pajek the option Network/Info/Vertex Label → Vertex Number we enter the initial part of a selected work's name, for example GREGORY_S(2010). Afterwards repeating the Retry button we browse through all works with names containing a given substring:

Input: GREGORY_S(2010)
Possible Matching: GREGORY_S(2010)12:103018 (43356)

Input: GREGORY_S(2010)
Possible Matching: GREGORY_S(2010): (48362)

Input: GREGORY_S(2010)
Possible Matching: GREGORY_S(2010)12: (53996)

Input: GREGORY_S(2010)
Possible Matching: GREGORY_S(2010)112:10301 (61292)

Input: GREGORY_S(2010)
Possible Matching: GREGORY_S(2010)12:1 (80659)

Input: GREGORY_S(2010)
Possible Matching: GREGORY_S(2010)12:2011 (99518)

Input: GREGORY_S(2010)
Possible Matching: GREGORY_S(2010)12:1088/1367-2630/12/10/103018 (108469)

Input: GREGORY_S(2010)
Possible Matching: GREGORY_S(2010)12:103018 (43356)

There are two possibilities how to correct the data:

make corrections in the local copy of original data (WoS file);
make the equivalence partition of nodes and shrink the set of works accordingly in all of obtained networks.

We used the second option. For the works with high counts (>= 30) we prepared lists of possible equivalents and manually determined equivalence classes

117082
"BLONDEL_V(2008)"  32075    35083 85732 86630 115732   60511 69369 81200 83974 79999 65691 79303 105754
"BLONDEL_V(2008)"  37562    50714 51071
"DANON_L(2005)"    23555    25583 43537 50989 80658    23543 40201 51079 59574 104228
"NEWMAN_M(2004)"   25896    28830 108141 80880 73541   20520 72933 70805 48370 63293
"NEWMAN_M(2004)"   20504    115510 59630 84387 85607
"NEWMAN_M(2004)"   25895    35477
"LANCICHI_A(2009)" 108114   35383 41217 43566 61424 65188 82754 99520
"LANCICHI_A(2009)" 35386    69262 93922 97483
"LANCICHI_A(2009)" 36243    69337 36768 49668
"NEWMAN_M(2001)"   28968    26373
"NEWMAN_M(2001)"   21998    47666
...

We found the service https://scholar.google.com/citations quite helpful.

For a class representative select a work with DC > 0.

With a simple program in Python

inp = open("C:/Users/batagelj/work/Python/WoS/BM/worksEQl.txt","r")
line = next(inp)
n = int(line)
P = list(range(1,n+1))
for line in inp:
   L = line.split()
   if len(L)>2:
      I = [ int(e) for e in L[1:]]
      print L[0]
      j = I[0]
      for i in I[1:]: P[i-1] = j
inp.close()
clu = open("C:/Users/batagelj/work/Python/WoS/BM/worksEQ.clu","w")
clu.write("*vertices "+str(n)+"\n")
for i in P: clu.write(str(i)+"\n")
clu.close()

we produced a Pajek's partition file wordsEQ.clu used in Pajek for shrinking the set of works.

Reduced networks

Using the partition p=wordsEQ, p : V → C, we shrink the citation network cite to citeR. As a byproduct we get also a partition q : V_C → V, such that q(v) = u ⇒ p(u) = v.

We have to shrink also partitions year and and DC and the vector NP.

In general, given a mapping s : V → B, we are asking for a mapping r : V_C → B such that if q(v) = u then s(u) = r(v). Therefore r(v) = s(u) = s(q(v)) = q*s(v) or equivalently r = q*s.

In Pajek, given a mapping q : V_C → V, the mapping r can be determined as follows:

select partition q as First partition
select partition s as Second partition
Partitions/Functional Composition First*Second

or

select partition q as First partition
select vector s as First vector
Operations/Vector+Partition/Functional Composition Partition*Vector

For the partition q = worksEQq we computed partitions YearR and DCR and the vector NPR.

WAr   n = 179049 = 116906+62143    m = 132776    AveDegree = 1.48312473
WKr   n = 127175 = 116906+10269    m =  88965    AveDegree = 1.39909573
WJr   n = 129558 = 116906+12652    m = 117044    AveDegree = 1.80682011
CiteR n = 116906   m = 195784                    AveDegree = 3.34942603

Citation network

The network CiteR has 116906 nodes and 195784 arcs.

indegree    freq
     0      4070   
     1     93246 
     2     10694  
     3      3352 
     4      1610

Most of nodes are terminal (DCR=0) nodes cited only once (indegree=1). We decided (boundary problem) to include in our networks nodes with DCR > 0 or indeg > 2 (partition boundary). They determine a subnetwork CiteB with 13540 nodes and 82238 arcs.

Complete description networks

For analysis of authors, keywords and journals we have complete data (besides CR field) exactly for works with DC>0. In two-mode networks we preserve in the second set only nodes with indegree > 0. The new dataset: CiteC, WAc, WKc, WJc, yearC, NPc.

To get WAc from WAr:

read partition DCr
select network WAr
Partition/Create Constant Partition [62143,0] = C1
select DCr as First partition
select C1 as Second partition
Partitions/Fuse Partitions
Operations/Network+Partition/Extract Subnetwork [1-*]
Partition/Create Constant Partition [5695,1] = C2
select C1 as Second partition
Partitions/Fuse Partitions
Partition/Binarize partition [1] = C3
Network/Create Partition/Degree/Input
select C3 as Second partition
Partitions/Max (First,Second)
Operations/Network+Partition/Extract Subnetwork [1-*]

In same way we get also networks WKc and WJc.

WAc   n = 19071 = 5695+13376       m = 21562     AveDegree = 2.26123433
WKc   n = 15964 = 5695+10269       m = 88953     AveDegree = 11.14419945
WJc   n =  7451 = 5695+1756        m =  5815     AveDegree = 1.56086431
CiteC n =  5695    m = 38400                     AveDegree = 13.48551361

Back to Analysis