Pajek

Pajek

Clustering

Read in Pajek the project file USC3110.paj and file with 75 standardized variables USC3110.vec .

To determine the clustering with relational constraint we have first to compute the dissimilarity between linked vertices based on selected variables. The vectors/variables to be used in the dissimilarity computation are determined by the current Pajek's cluster. In our case this cluster should contain indices from 4 to 78. We produce it as follows

Cluster/Create complete cluster [78]

Then click on the Edit Cluster Button. In the displayed list double click on vertex 1 and confirm its deletion. Repeat this also for vertices 2 and 3. We have the required cluster.

Now we can add to the links of US counties network the dissimilarities as weights

Operations/Dissimilarity*/Vector Based/Euclidean  [Yes]

and run clustering with relational constraint afterward

Net/Hierarchical Decomposition/Clustering with Relational Constraint/Options [Average,Tolerant]
Net/Hierarchical Decomposition/Clustering with Relational Constraint/Run

We get a vector of cluster heights, a vector of cluster sizes, a permutation of vertices with respect to the clustering, and a “double partition”. The “double partition” contains the description of the clustering tree in the form p[i] = j , meaning: the father of vertex i in the clustering tree is the vertex j . The leaves of the tree are the vertices of our network; the internal vertices of the tree are numbered as they are produced with indices from NumVertices+1 on.

Inspecting the distribution of heights values we selected the threshold 11. Now we can produce the corresponding partition:

[select the heights vector]
Net/Hierarchical Decomposition/Clustering with Relational Constraint/Make Partition/Using Threshold [11]

We obtain the following partition avgtol11.pdf that consists of a very large red cluster, several small clusters and a “cluster” of unclassified cyan vertices.

This partition is of little use for our analysis. See a discussion about what can be wrong and how we can try to bypass the problems.

Improving the results

We first try to extract from the dendrogram the clusters with number of units in some selected interval - for example [5,400]

> RC <- Pajek2R("./cluster/AvgTolNew.clu")
> rCount <- varCutree(RC,rep(1,3110),5,400)
> t <- table(rCount$part)
> t
     0      1      2      3      4      5      6      7      8      9     10     11     12     13     14 
   745     18    398      5      5     12      8     25      9      6      5     13     22      6     19 
    15     16     17     18     19     20     21     22     23     24     25     26     27     28     29 
    44     10      9      5      6     29      6     10    347     79      5    151      7      8      7 
    30     31     32     33     34     35     36     37     38     39     40     41     42     43     44 
   130      5      9     97     17      7     13      9      5      7      6      5     10    337     40 
    45     46     47     48     49     50     51     52     53     54     55     56     57     58     59 
     7      5     20     12      6      5     25     19     11      8      5     32    116     42      5 
    60     61     62     63     64     65 999999 
     5      5     10     18     16     14      8 
> out <- file("CRAveTot3.clu","w"); cat("*vertices 3110",rCount$part,sep="\n",file=out); close(out)

We read the “double partition” in R and using R procedures determine the partition, that is finally exported as Pajek's clustering to be displayed using Pajek. The obtained clustering has 745 unclassified units, but we can see some structure avgtol-3.pdf.

We can also determine rank-height of the clustering tree and cut it at selected level.

levCutree <- function(R,hT){
  mark <- function(t,c){
    if(t>0) vis[t] <<- TRUE
    if(t<0) part[-t] <<- c else {mark(R$merge[t,1],c); mark(R$merge[t,2],c)}
  }
  n <- nrow(R$merge)+1; part <- rep(999999,n)
  vis <- rep(FALSE,n-1); nclust <- 0
  for(i in (n-1):1) if (!vis[i]) {
    vis[i] <- TRUE; hi <- R$height[i]
    l <- R$merge[i,1]; if (l == 0) next
    r <- R$merge[i,2]; if (r == 0) next
    if (hi <= hT) {     
      nclust <- nclust+1; mark(l,nclust); mark(r,nclust)
    } else { if (l < 0) mark(l,0); if (r < 0) mark(r,0) }
  }
  return(part)
}

Procedure

> RC$height <- rep(1,3109)
> rRank <- rankTree(RC)
> RC$height <- rRank
> rankPart <- levCutree(RC,70)
> out <- file("AveTotRank70.clu","w"); cat("*vertices 3110",rankPart,sep="\n",file=out); close(out)

This gives another clustering avgtotrank70.pdf.

Using in Pajek the option Options/Mark Vertices Using/Partition Clusters we identify the cluster numbers of interesting clusters and look in R to their characteristics

> dat <- read.csv(file="varsN.csv",stringsAsFactors=FALSE)
> dim(dat)
> da <- dat[-2916,]
> dim(da)
[1] 3110   92
> south <- rankPart==87
> east <- rankPart==115
> west <- rankPart==27
> north <- rankPart==170
> da[south,"Areaname"]
> summary(da[south,])
> da[north,"Areaname"]
> summary(da[north,])
> da[east,"Areaname"]
> summary(da[east,])
> da[west,"Areaname"]
> summary(da[west,])

rankres.zip (rankPart.clu, south.txt, north.txt, east.txt, west.txt, rankDat.RData (rankPart, da, s))

> M <- cbind(apply(da[south,-c(1,2)],2,mean),apply(da[north,-c(1,2)],2,mean),
  apply(da[east,-c(1,2)],2,mean),apply(da[west,-c(1,2)],2,mean))
> M
                                              [,1]          [,2]          [,3]          [,4]
AGE050200D                            3.705131e+01  3.811625e+01  3.777956e+01  3.718879e+01
BZA115203D                            2.238562e-01 -4.332500e+00 -2.267291e+00  1.110345e+00
CLF040200D                            4.632680e+00  3.657500e+00  4.416184e+00  6.244828e+00
EDU635200D                            7.361373e+01  8.225594e+01  7.482572e+01  8.114224e+01
EDU685200D                            1.457680e+01  1.431656e+01  1.542521e+01  1.936810e+01
HSG045200D                            8.044444e+00  6.816562e+00  1.575571e+01  1.633362e+01
HSG495200D                            5.846732e+04  7.602125e+04  8.654140e+04  1.307491e+05
INC610199D                            3.725614e+04  4.543871e+04  4.188305e+04  4.402873e+04
INC910199D                            1.574440e+04  1.819005e+04  1.762827e+04  1.819972e+04
IPE010200D                            3.188551e+04  3.947356e+04  3.588853e+04  3.821586e+04
IPE120200D                            1.699837e+01  9.140313e+00  1.329523e+01  1.319052e+01
IPE220200D                            2.394739e+01  1.266219e+01  1.872641e+01  1.882414e+01
LFE305200D                            2.401340e+01  2.149219e+01  2.579881e+01  2.211379e+01
PIN020200D                            2.094700e+04  2.417972e+04  2.321806e+04  2.403807e+04
POP050200D                            8.283987e+00  4.062500e+00  1.024055e+01  1.616207e+01
POP060200D                            6.590621e+01  9.249344e+01  1.828421e+02  1.647716e+02
POP165200D                            5.081536e+01  5.074594e+01  5.101295e+01  4.982759e+01
POP225200D                            7.758333e+01  9.593219e+01  8.562027e+01  8.376983e+01
POP255200D                            1.336667e+01  1.692813e+00  1.145741e+01  1.559483e+00
POP285200D                            2.348366e+00  2.543750e-01  3.321976e-01  1.993966e+00
POP325200D                            4.490196e-01  4.378125e-01  6.032368e-01  2.432759e+00
POP405200D                            9.479739e+00  2.018750e+00  2.408007e+00  1.334052e+01
POP645200D                            3.341176e+00  1.557187e+00  2.626405e+00  8.694828e+00
VST020200D                            1.353856e+01  1.227937e+01  1.239046e+01  1.262672e+01
VST220200D                            1.151863e+01  1.079156e+01  1.044395e+01  9.142241e+00
VST420200D                            7.472222e+00  6.154687e+00  7.596252e+00  6.198276e+00
WAT130200D                            1.611786e+03  2.109659e+03  1.112694e+03  6.616555e+03
P.pop.under5                          6.580324e+00  6.232957e+00  6.118504e+00  6.348849e+00
P.pop.under18                         2.635719e+01  2.549562e+01  2.434618e+01  2.607041e+01
P.pop.over85                          2.103039e+00  2.360762e+00  1.679386e+00  1.672105e+00
P.ind.fam.farms                       9.076461e+01  8.849999e+01  9.043491e+01  8.426045e+01
P.land.farms                          6.290727e+01  7.931597e+01  3.275108e+01  3.458045e+01
P.irrigatedLand                                Inf  3.381284e+00           Inf  2.006879e+01
P.violent.crime                       3.207766e-01  8.674755e-02  2.946077e-01  3.348652e-01
P.murders                             4.446644e-03  9.991130e-04  3.495106e-03  2.931687e-03
P.rapes                               2.683564e-02  1.064653e-02  1.955211e-02  3.005992e-02
P.16to19.notHighSc                    1.056589e+01  7.645125e+00  1.062883e+01  9.078797e+00
P.Hisp.Latin                          9.407646e+00  1.978180e+00  2.341986e+00  1.327710e+01
R.Hisp.Lat.maleFemale                 1.229879e+00           Inf            NA  1.171553e+00
P.emply.ind.AGR.FOR.FISH.HUNT.MINING  8.323722e+00  5.990339e+00  3.431304e+00  7.956955e+00
P.emply.ind.CONSTRUCTION              7.450183e+00  6.311400e+00  7.353999e+00  6.644653e+00
P.emply.ind.MANUFACTORING             1.275459e+01  2.152131e+01  1.834340e+01  9.776601e+00
P.emply.ind.WHOLESALEtrade            2.766558e+00  3.053673e+00  2.866999e+00  2.833837e+00
P.emply.ind.RETAILtrade               1.084854e+01  1.095023e+01  1.131198e+01  1.064729e+01
P.emply.ind.TRANSPORT.WAREHOUSING     5.410541e+00  5.163809e+00  4.880563e+00  4.484460e+00
P.emply.ind.INFORMATION               1.441289e+00  1.651531e+00  1.763538e+00  1.860243e+00
P.emply.ind.FINANC.INSUR              4.144753e+00  4.189244e+00  4.194833e+00  4.113244e+00
P.emply.ind.PROFscientTECH            4.304369e+00  4.257937e+00  5.330816e+00  6.102132e+00
P.emply.ind.EDUC.HEALTH               1.988487e+01  1.870545e+01  1.899052e+01  1.862793e+01
P.emply.ind.ARTSaccomFOOD             5.721664e+00  5.774690e+00  6.271828e+00  8.140394e+00
P.emply.ind.PUBLICaddmin              5.223214e+00  3.521698e+00  4.666341e+00  5.762203e+00
P.25overLESS9thGRADE                  1.031324e+01  6.652107e+00  9.706479e+00  7.133818e+00
P.employ.FARMING                      1.298267e+01  9.958493e+00  4.789943e+00  8.400648e+00
P.employ.AGRIC.FOREST.FISH.HUNT       1.110473e+00  1.823078e-01  7.170553e-01  2.357110e+00
F.GOV.EXP.perCapita                   5.468467e+02  4.645189e+02  5.004825e+02  5.314188e+02
P.employ.GOV                          1.879330e+01  1.389505e+01  1.545659e+01  2.064506e+01
P.employ.GOV.stateLoc                 1.625734e+01  1.222596e+01  1.334276e+01  1.687476e+01
P.vacantHousingUnits                  1.482792e+01  8.548903e+00  1.285020e+01  1.328897e+01
P.occupiedHousingUnits                8.517208e+01  9.145110e+01  8.714980e+01  8.671103e+01
P.occupiedHousingUnitsBLACK           1.043022e+01  1.119232e+00  8.972250e+00  1.249519e+00
P.occupiedHousingUnitsHisLat          5.798692e+00  1.182373e+00  1.376569e+00  8.194458e+00
P.OWNERoccupiedHousingUnits           6.313439e+01  6.966030e+01  6.510236e+01  5.789313e+01
P.RENTERoccupiedHousingUnits          2.203769e+01  2.179080e+01  2.204743e+01  2.881790e+01
P.occupiedHousingUnitsLackingPlumb    7.908735e-01  4.948386e-01  9.446285e-01  7.176409e-01
P.URBANpopul                          4.077331e+01  4.130698e+01  4.030167e+01  6.026730e+01
P.RURALpopul                          5.922669e+01  5.869302e+01  5.969833e+01  3.973270e+01
P.CHANGErural90to00                   1.010434e+01 -3.028793e+00 -1.562482e+00 -3.422337e+00
P.CHANGEurban90to00                   8.787050e+00  1.106016e+01  2.283582e+01  3.857946e+01
P.LAND                                9.727151e+01  9.870546e+01  9.587915e+01  9.542434e+01
P.WATER                               2.728487e+00  1.294513e+00  4.120870e+00  4.575628e+00
P.BELOWpovertyLevel                   1.764538e+01  9.208135e+00  1.432859e+01  1.419735e+01
P.ABOVEpovertyLevel                   8.235462e+01  9.079186e+01  8.567141e+01  8.580265e+01
P.WHITE.BELOWpovertyLevel             1.314581e+01  8.742184e+00  1.188328e+01  1.230589e+01
P.aINDIAN.BELOWpovertyLevel                     NA            NA            NA  2.388984e+01
P.BLACK.BELOWpovertyLevel                       NA            NA            NA            NA
P.ASIAN.BELOWpovertyLevel                       NA            NA            NA            NA
P.HisLat.BELOWpovertyLevel            2.806823e+01  1.892611e+01            NA  2.553416e+01
CHANGEperCapitaIncome89to99           5.676222e+03  6.653450e+03  6.233325e+03  5.874371e+03
P.CHANGEemployIndustry90to00          1.878123e+01  1.559929e+01  1.633401e+01  2.284493e+01
P.IrrigationGROUNDwaterUse                      NA            NA            NA  2.796188e+01
GroundWaterUsePerCapita               1.018104e-03  1.067929e-03  1.556404e-04  1.129905e-03
P.NET.DOMESTIC.MIGRATIONS            -5.273263e-01 -4.457541e-01  1.212542e-01 -1.495736e-01
P.NativePopulationBornInStateOfRes    7.388104e+01  7.706141e+01  7.259531e+01  5.158747e+01
R.LABOR.FORCEmaleFemale               1.205602e+00  1.164036e+00  1.160966e+00  1.211234e+00
R.VOTING.DEMOCRATESoverREPUBLICANS    6.720723e+01  7.530776e+01  8.824383e+01  7.642053e+01
P.PUBLIC.SCHOOL.ENROLNEMT             7.128929e+01  6.774728e+01  6.893168e+01  6.875286e+01
TOTAL.DEPOSITSperCapita               1.117644e+01  1.447363e+01  1.121010e+01  8.799865e+00
CROPvaluePerFARM                      3.048392e+01  6.127429e+01  2.972905e+01  1.145691e+02
LIFESTOCKvaluePerFARM                 5.498766e+01  4.841435e+01  4.143259e+01  4.736536e+01
P.CHANGEpverty95to00                 -1.399711e+01 -8.650349e+00 -1.150927e+01 -7.111793e+00
>

Partition by state

The first two characters in the county id identify the state to which the county belongs.

> d <- readLines("usc3110.paj")[14:3123]
> n <- 3110; s <- character(n)
> for(i in 1:n) s[i] <- as.integer(substr(strsplit(dd[i],'\"')[[1]][2],1,2))
> out <- file("states3110.clu","w"); cat("*vertices 3110",s,sep="\n",file=out); close(out)

states.zip

This partition is already included in the USC3110.paj file:

------------------------------------------------------------------------------
Reading Partition   ---    "US states"
------------------------------------------------------------------------------
   To which state a county belongs?
   01 - Alabama             21 - Kentucky            38 - North Dakota
   02 - Alaska              22 - Louisiana           39 - Ohio
   04 - Arizona             23 - Maine               40 - Oklahoma
   05 - Arkansas            24 - Maryland            41 - Oregon
   06 - California          25 - Massachusetts       42 - Pennsylvania
   08 - Colorado            26 - Michigan            44 - Rhode Island
   09 - Connecticut         27 - Minnesota           45 - South Carolina
   10 - Delaware            28 - Mississippi         46 - South Dakota
   11 - Dist. of Columbia   29 - Missouri            47 - Tennessee
   12 - Florida             30 - Montana             48 - Texas
   13 - Georgia             31 - Nebraska            49 - Utah
   15 - Hawaii              32 - Nevada              50 - Vermont
   16 - Idaho               33 - New Hampshire       51 - Virginia
   17 - Illinois            34 - New Jersey          53 - Washington
   18 - Indiana             35 - New Mexico          54 - West Virginia
   19 - Iowa                36 - New York            55 - Wisconsin
   20 - Kansas              37 - North Carolina      56 - Wyoming

Partitions in 4 and 9 regions

File/Partition/Read  [states3110.clu]
File/Partition/Read  [regions4.clu]
File/Partition/Read  [regions9.clu]
[select states partition as partition 1]
[select regions9 partition as partition 2]
Operations/Functional Composition/Partition*Partition
File/Partition/Save [reg3310-9.clu]
[select states partition as partition 1]
[select regions4 partition as partition 2]
Operations/Functional Composition/Partition*Partition
File/Partition/Save [reg3310-4.clu]

Discriminant analysis

> getwd()
[1] "D:/Data/counties"
> load("rankDat.RData")
> C <- read.csv("./reg3110-9.clu",header=FALSE,skip=1)$V1
> table(C)
C
  1   2   3   4   5   6   7   8   9 
133 281 618 437  67 150 470 364 590 
> library(MASS)
> V <- da[,-c(1,2)]
> names(V)
[33] "P.irrigatedLand"                     
[39] "R.Hisp.Lat.maleFemale"               
[74] "P.aINDIAN.BELOWpovertyLevel"         
[75] "P.BLACK.BELOWpovertyLevel"           
[76] "P.ASIAN.BELOWpovertyLevel"           
[80] "P.IrrigationGROUNDwaterUse"          
> U <- V[,-c(33,39,74,75,76,80)]
> L <- lda(C~.,U)
Warning message:
In lda.default(x, grouping, ...) : variables are collinear
> names(L)
 [1] "prior"     "counts"    "means"     "scaling"   "lev"       "svd"      
 [7] "N"         "call"      "terms"     "xlevels"   "na.action"
> L$means
  AGE050200D BZA115203D CLF040200D EDU635200D EDU685200D HSG045200D HSG495200D
1   37.27970  2.4563910   6.190977   81.26241   20.69248  16.362406  152539.85
2   36.79319  2.1946237   4.670609   82.71935   20.14050  19.684588  106231.90
....
9                  68.83642                9.874268         41.91874
  LIFESTOCKvaluePerFARM P.CHANGEpverty95to00
1              50.48722            -3.836304
2              65.86691            -1.174104
3              74.07023           -13.616674
4              26.20308            -6.648226
5              29.61547            -9.345561
6              36.08322           -10.412122
7              60.42837           -12.549059
8              29.61485           -12.168219
9              48.31252            -9.428939
> L$counts
  1   2   3   4   5   6   7   8   9 
133 279 608 437  67 148 470 362 551 
> L$scaling
                                               LD1           LD2           LD3
AGE050200D                           -1.850965e-01  6.327903e-02 -1.865893e-02
BZA115203D                            1.561909e-03  5.040118e-04 -1.502661e-03
CLF040200D                           -2.216119e-01  1.937772e-01 -1.191955e-01
.... 
> L$N
[1] 3055
> L$lev
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
> L$prior
         1          2          3          4          5          6          7 
0.04353519 0.09132570 0.19901800 0.14304419 0.02193126 0.04844517 0.15384615 
         8          9 
0.11849427 0.18036007
> L$svd
[1] 35.981057 29.047318 23.855715 16.392248 14.756034 13.592597 10.619947
[8]  7.094417 
> plot(L)
>

Back to 7ISS Labs