====== Pajek ======

===== Clustering =====


Read in Pajek the project file ''USC3110.paj'' and file with 75 standardized variables ''USC3110.vec'' .

To determine the clustering with relational constraint we have first to compute the dissimilarity between linked vertices based on selected variables. The vectors/variables to be used in the dissimilarity computation are determined by the current Pajek's cluster. In our case this cluster should contain indices from 4 to 78. We produce it as follows 
<code>
Cluster/Create complete cluster [78]
</code>
Then click on the ''Edit Cluster Button''. In the displayed list double click on vertex 1 and confirm its deletion. Repeat this also for vertices 2 and 3. We have the required cluster.

Now we can add to the links of US counties network the dissimilarities as weights
<code>
Operations/Dissimilarity*/Vector Based/Euclidean  [Yes]
</code>
and run clustering with relational constraint afterward
<code>
Net/Hierarchical Decomposition/Clustering with Relational Constraint/Options [Average,Tolerant]
Net/Hierarchical Decomposition/Clustering with Relational Constraint/Run
</code> 
We get a vector of cluster heights, a vector of cluster sizes, a permutation of vertices with respect to the clustering, and a "double partition". The "double partition" contains the description of the clustering tree in the form p[i] = j , meaning: the father of vertex i in the clustering tree is the vertex j . The leaves of the tree are the vertices of our network; the internal vertices of the tree are numbered as they are produced with indices from NumVertices+1 on. 

Inspecting the distribution of heights values we selected the threshold 11. Now we can produce the corresponding partition:
<code>
[select the heights vector]
Net/Hierarchical Decomposition/Clustering with Relational Constraint/Make Partition/Using Threshold [11]
</code> 
We obtain the following partition {{notes:clu:pdf:avgtol11.pdf}} that consists of a very large red cluster, several small clusters and a "cluster" of unclassified cyan vertices.

This partition is of little use for our analysis. See a [[book:temp:private:data:cluster|discussion]] about what can be wrong and how we can try to bypass the problems.

===== Improving the results =====

We first try to extract from the dendrogram the clusters with number of units in some selected interval - for example [5,400]
<code>
> RC <- Pajek2R("./cluster/AvgTolNew.clu")
> rCount <- varCutree(RC,rep(1,3110),5,400)
> t <- table(rCount$part)
> t
     0      1      2      3      4      5      6      7      8      9     10     11     12     13     14 
   745     18    398      5      5     12      8     25      9      6      5     13     22      6     19 
    15     16     17     18     19     20     21     22     23     24     25     26     27     28     29 
    44     10      9      5      6     29      6     10    347     79      5    151      7      8      7 
    30     31     32     33     34     35     36     37     38     39     40     41     42     43     44 
   130      5      9     97     17      7     13      9      5      7      6      5     10    337     40 
    45     46     47     48     49     50     51     52     53     54     55     56     57     58     59 
     7      5     20     12      6      5     25     19     11      8      5     32    116     42      5 
    60     61     62     63     64     65 999999 
     5      5     10     18     16     14      8 
> out <- file("CRAveTot3.clu","w"); cat("*vertices 3110",rCount$part,sep="\n",file=out); close(out)
</code>
We read the "double partition" in R and using R procedures determine the partition, that is finally exported as Pajek's clustering to be displayed using Pajek.
The obtained clustering has 745 unclassified units, but we can see some structure
{{notes:clu:pdf:avgtol-3.pdf}}.

We can also determine rank-height of the clustering tree and cut it at selected level.
<code>
levCutree <- function(R,hT){
  mark <- function(t,c){
    if(t>0) vis[t] <<- TRUE
    if(t<0) part[-t] <<- c else {mark(R$merge[t,1],c); mark(R$merge[t,2],c)}
  }
  n <- nrow(R$merge)+1; part <- rep(999999,n)
  vis <- rep(FALSE,n-1); nclust <- 0
  for(i in (n-1):1) if (!vis[i]) {
    vis[i] <- TRUE; hi <- R$height[i]
    l <- R$merge[i,1]; if (l == 0) next
    r <- R$merge[i,2]; if (r == 0) next
    if (hi <= hT) {     
      nclust <- nclust+1; mark(l,nclust); mark(r,nclust)
    } else { if (l < 0) mark(l,0); if (r < 0) mark(r,0) }
  }
  return(part)
}
</code>
Procedure
<code>
> RC$height <- rep(1,3109)
> rRank <- rankTree(RC)
> RC$height <- rRank
> rankPart <- levCutree(RC,70)
> out <- file("AveTotRank70.clu","w"); cat("*vertices 3110",rankPart,sep="\n",file=out); close(out)
</code>
This gives another clustering {{notes:clu:pdf:avgtotrank70.pdf}}.

Using in Pajek the option ''Options/Mark Vertices Using/Partition Clusters'' we identify the cluster numbers of interesting clusters and look in R to their characteristics
<code>
> dat <- read.csv(file="varsN.csv",stringsAsFactors=FALSE)
> dim(dat)
> da <- dat[-2916,]
> dim(da)
[1] 3110   92
> south <- rankPart==87
> east <- rankPart==115
> west <- rankPart==27
> north <- rankPart==170
> da[south,"Areaname"]
> summary(da[south,])
> da[north,"Areaname"]
> summary(da[north,])
> da[east,"Areaname"]
> summary(da[east,])
> da[west,"Areaname"]
> summary(da[west,])
</code>
{{notes:clu:zip:rankres.zip}} (''rankPart.clu'', ''south.txt'', ''north.txt'', ''east.txt'', ''west.txt'', ''rankDat.RData'' (rankPart, da, s))

<code>
> M <- cbind(apply(da[south,-c(1,2)],2,mean),apply(da[north,-c(1,2)],2,mean),
  apply(da[east,-c(1,2)],2,mean),apply(da[west,-c(1,2)],2,mean))
> M
                                              [,1]          [,2]          [,3]          [,4]
AGE050200D                            3.705131e+01  3.811625e+01  3.777956e+01  3.718879e+01
BZA115203D                            2.238562e-01 -4.332500e+00 -2.267291e+00  1.110345e+00
CLF040200D                            4.632680e+00  3.657500e+00  4.416184e+00  6.244828e+00
EDU635200D                            7.361373e+01  8.225594e+01  7.482572e+01  8.114224e+01
EDU685200D                            1.457680e+01  1.431656e+01  1.542521e+01  1.936810e+01
HSG045200D                            8.044444e+00  6.816562e+00  1.575571e+01  1.633362e+01
HSG495200D                            5.846732e+04  7.602125e+04  8.654140e+04  1.307491e+05
INC610199D                            3.725614e+04  4.543871e+04  4.188305e+04  4.402873e+04
INC910199D                            1.574440e+04  1.819005e+04  1.762827e+04  1.819972e+04
IPE010200D                            3.188551e+04  3.947356e+04  3.588853e+04  3.821586e+04
IPE120200D                            1.699837e+01  9.140313e+00  1.329523e+01  1.319052e+01
IPE220200D                            2.394739e+01  1.266219e+01  1.872641e+01  1.882414e+01
LFE305200D                            2.401340e+01  2.149219e+01  2.579881e+01  2.211379e+01
PIN020200D                            2.094700e+04  2.417972e+04  2.321806e+04  2.403807e+04
POP050200D                            8.283987e+00  4.062500e+00  1.024055e+01  1.616207e+01
POP060200D                            6.590621e+01  9.249344e+01  1.828421e+02  1.647716e+02
POP165200D                            5.081536e+01  5.074594e+01  5.101295e+01  4.982759e+01
POP225200D                            7.758333e+01  9.593219e+01  8.562027e+01  8.376983e+01
POP255200D                            1.336667e+01  1.692813e+00  1.145741e+01  1.559483e+00
POP285200D                            2.348366e+00  2.543750e-01  3.321976e-01  1.993966e+00
POP325200D                            4.490196e-01  4.378125e-01  6.032368e-01  2.432759e+00
POP405200D                            9.479739e+00  2.018750e+00  2.408007e+00  1.334052e+01
POP645200D                            3.341176e+00  1.557187e+00  2.626405e+00  8.694828e+00
VST020200D                            1.353856e+01  1.227937e+01  1.239046e+01  1.262672e+01
VST220200D                            1.151863e+01  1.079156e+01  1.044395e+01  9.142241e+00
VST420200D                            7.472222e+00  6.154687e+00  7.596252e+00  6.198276e+00
WAT130200D                            1.611786e+03  2.109659e+03  1.112694e+03  6.616555e+03
P.pop.under5                          6.580324e+00  6.232957e+00  6.118504e+00  6.348849e+00
P.pop.under18                         2.635719e+01  2.549562e+01  2.434618e+01  2.607041e+01
P.pop.over85                          2.103039e+00  2.360762e+00  1.679386e+00  1.672105e+00
P.ind.fam.farms                       9.076461e+01  8.849999e+01  9.043491e+01  8.426045e+01
P.land.farms                          6.290727e+01  7.931597e+01  3.275108e+01  3.458045e+01
P.irrigatedLand                                Inf  3.381284e+00           Inf  2.006879e+01
P.violent.crime                       3.207766e-01  8.674755e-02  2.946077e-01  3.348652e-01
P.murders                             4.446644e-03  9.991130e-04  3.495106e-03  2.931687e-03
P.rapes                               2.683564e-02  1.064653e-02  1.955211e-02  3.005992e-02
P.16to19.notHighSc                    1.056589e+01  7.645125e+00  1.062883e+01  9.078797e+00
P.Hisp.Latin                          9.407646e+00  1.978180e+00  2.341986e+00  1.327710e+01
R.Hisp.Lat.maleFemale                 1.229879e+00           Inf            NA  1.171553e+00
P.emply.ind.AGR.FOR.FISH.HUNT.MINING  8.323722e+00  5.990339e+00  3.431304e+00  7.956955e+00
P.emply.ind.CONSTRUCTION              7.450183e+00  6.311400e+00  7.353999e+00  6.644653e+00
P.emply.ind.MANUFACTORING             1.275459e+01  2.152131e+01  1.834340e+01  9.776601e+00
P.emply.ind.WHOLESALEtrade            2.766558e+00  3.053673e+00  2.866999e+00  2.833837e+00
P.emply.ind.RETAILtrade               1.084854e+01  1.095023e+01  1.131198e+01  1.064729e+01
P.emply.ind.TRANSPORT.WAREHOUSING     5.410541e+00  5.163809e+00  4.880563e+00  4.484460e+00
P.emply.ind.INFORMATION               1.441289e+00  1.651531e+00  1.763538e+00  1.860243e+00
P.emply.ind.FINANC.INSUR              4.144753e+00  4.189244e+00  4.194833e+00  4.113244e+00
P.emply.ind.PROFscientTECH            4.304369e+00  4.257937e+00  5.330816e+00  6.102132e+00
P.emply.ind.EDUC.HEALTH               1.988487e+01  1.870545e+01  1.899052e+01  1.862793e+01
P.emply.ind.ARTSaccomFOOD             5.721664e+00  5.774690e+00  6.271828e+00  8.140394e+00
P.emply.ind.PUBLICaddmin              5.223214e+00  3.521698e+00  4.666341e+00  5.762203e+00
P.25overLESS9thGRADE                  1.031324e+01  6.652107e+00  9.706479e+00  7.133818e+00
P.employ.FARMING                      1.298267e+01  9.958493e+00  4.789943e+00  8.400648e+00
P.employ.AGRIC.FOREST.FISH.HUNT       1.110473e+00  1.823078e-01  7.170553e-01  2.357110e+00
F.GOV.EXP.perCapita                   5.468467e+02  4.645189e+02  5.004825e+02  5.314188e+02
P.employ.GOV                          1.879330e+01  1.389505e+01  1.545659e+01  2.064506e+01
P.employ.GOV.stateLoc                 1.625734e+01  1.222596e+01  1.334276e+01  1.687476e+01
P.vacantHousingUnits                  1.482792e+01  8.548903e+00  1.285020e+01  1.328897e+01
P.occupiedHousingUnits                8.517208e+01  9.145110e+01  8.714980e+01  8.671103e+01
P.occupiedHousingUnitsBLACK           1.043022e+01  1.119232e+00  8.972250e+00  1.249519e+00
P.occupiedHousingUnitsHisLat          5.798692e+00  1.182373e+00  1.376569e+00  8.194458e+00
P.OWNERoccupiedHousingUnits           6.313439e+01  6.966030e+01  6.510236e+01  5.789313e+01
P.RENTERoccupiedHousingUnits          2.203769e+01  2.179080e+01  2.204743e+01  2.881790e+01
P.occupiedHousingUnitsLackingPlumb    7.908735e-01  4.948386e-01  9.446285e-01  7.176409e-01
P.URBANpopul                          4.077331e+01  4.130698e+01  4.030167e+01  6.026730e+01
P.RURALpopul                          5.922669e+01  5.869302e+01  5.969833e+01  3.973270e+01
P.CHANGErural90to00                   1.010434e+01 -3.028793e+00 -1.562482e+00 -3.422337e+00
P.CHANGEurban90to00                   8.787050e+00  1.106016e+01  2.283582e+01  3.857946e+01
P.LAND                                9.727151e+01  9.870546e+01  9.587915e+01  9.542434e+01
P.WATER                               2.728487e+00  1.294513e+00  4.120870e+00  4.575628e+00
P.BELOWpovertyLevel                   1.764538e+01  9.208135e+00  1.432859e+01  1.419735e+01
P.ABOVEpovertyLevel                   8.235462e+01  9.079186e+01  8.567141e+01  8.580265e+01
P.WHITE.BELOWpovertyLevel             1.314581e+01  8.742184e+00  1.188328e+01  1.230589e+01
P.aINDIAN.BELOWpovertyLevel                     NA            NA            NA  2.388984e+01
P.BLACK.BELOWpovertyLevel                       NA            NA            NA            NA
P.ASIAN.BELOWpovertyLevel                       NA            NA            NA            NA
P.HisLat.BELOWpovertyLevel            2.806823e+01  1.892611e+01            NA  2.553416e+01
CHANGEperCapitaIncome89to99           5.676222e+03  6.653450e+03  6.233325e+03  5.874371e+03
P.CHANGEemployIndustry90to00          1.878123e+01  1.559929e+01  1.633401e+01  2.284493e+01
P.IrrigationGROUNDwaterUse                      NA            NA            NA  2.796188e+01
GroundWaterUsePerCapita               1.018104e-03  1.067929e-03  1.556404e-04  1.129905e-03
P.NET.DOMESTIC.MIGRATIONS            -5.273263e-01 -4.457541e-01  1.212542e-01 -1.495736e-01
P.NativePopulationBornInStateOfRes    7.388104e+01  7.706141e+01  7.259531e+01  5.158747e+01
R.LABOR.FORCEmaleFemale               1.205602e+00  1.164036e+00  1.160966e+00  1.211234e+00
R.VOTING.DEMOCRATESoverREPUBLICANS    6.720723e+01  7.530776e+01  8.824383e+01  7.642053e+01
P.PUBLIC.SCHOOL.ENROLNEMT             7.128929e+01  6.774728e+01  6.893168e+01  6.875286e+01
TOTAL.DEPOSITSperCapita               1.117644e+01  1.447363e+01  1.121010e+01  8.799865e+00
CROPvaluePerFARM                      3.048392e+01  6.127429e+01  2.972905e+01  1.145691e+02
LIFESTOCKvaluePerFARM                 5.498766e+01  4.841435e+01  4.143259e+01  4.736536e+01
P.CHANGEpverty95to00                 -1.399711e+01 -8.650349e+00 -1.150927e+01 -7.111793e+00
> 
</code>

===== Partition by state =====

The first two characters in the county id identify the state to which the county belongs.
<code>
> d <- readLines("usc3110.paj")[14:3123]
> n <- 3110; s <- character(n)
> for(i in 1:n) s[i] <- as.integer(substr(strsplit(dd[i],'\"')[[1]][2],1,2))
> out <- file("states3110.clu","w"); cat("*vertices 3110",s,sep="\n",file=out); close(out)
</code>
{{notes:clu:zip:states.zip}}

This partition is already included in the ''USC3110.paj'' file:
<code>
------------------------------------------------------------------------------
Reading Partition   ---    "US states"
------------------------------------------------------------------------------
   To which state a county belongs?
   01 - Alabama             21 - Kentucky            38 - North Dakota
   02 - Alaska              22 - Louisiana           39 - Ohio
   04 - Arizona             23 - Maine               40 - Oklahoma
   05 - Arkansas            24 - Maryland            41 - Oregon
   06 - California          25 - Massachusetts       42 - Pennsylvania
   08 - Colorado            26 - Michigan            44 - Rhode Island
   09 - Connecticut         27 - Minnesota           45 - South Carolina
   10 - Delaware            28 - Mississippi         46 - South Dakota
   11 - Dist. of Columbia   29 - Missouri            47 - Tennessee
   12 - Florida             30 - Montana             48 - Texas
   13 - Georgia             31 - Nebraska            49 - Utah
   15 - Hawaii              32 - Nevada              50 - Vermont
   16 - Idaho               33 - New Hampshire       51 - Virginia
   17 - Illinois            34 - New Jersey          53 - Washington
   18 - Indiana             35 - New Mexico          54 - West Virginia
   19 - Iowa                36 - New York            55 - Wisconsin
   20 - Kansas              37 - North Carolina      56 - Wyoming
</code>

===== Partitions in 4 and 9 regions =====

<code>
File/Partition/Read  [states3110.clu]
File/Partition/Read  [regions4.clu]
File/Partition/Read  [regions9.clu]
[select states partition as partition 1]
[select regions9 partition as partition 2]
Operations/Functional Composition/Partition*Partition
File/Partition/Save [reg3310-9.clu]
[select states partition as partition 1]
[select regions4 partition as partition 2]
Operations/Functional Composition/Partition*Partition
File/Partition/Save [reg3310-4.clu]
</code>

===== Discriminant analysis =====

<code>
> getwd()
[1] "D:/Data/counties"
> load("rankDat.RData")
> C <- read.csv("./reg3110-9.clu",header=FALSE,skip=1)$V1
> table(C)
C
  1   2   3   4   5   6   7   8   9 
133 281 618 437  67 150 470 364 590 
> library(MASS)
> V <- da[,-c(1,2)]
> names(V)
[33] "P.irrigatedLand"                     
[39] "R.Hisp.Lat.maleFemale"               
[74] "P.aINDIAN.BELOWpovertyLevel"         
[75] "P.BLACK.BELOWpovertyLevel"           
[76] "P.ASIAN.BELOWpovertyLevel"           
[80] "P.IrrigationGROUNDwaterUse"          
> U <- V[,-c(33,39,74,75,76,80)]
> L <- lda(C~.,U)
Warning message:
In lda.default(x, grouping, ...) : variables are collinear
> names(L)
 [1] "prior"     "counts"    "means"     "scaling"   "lev"       "svd"      
 [7] "N"         "call"      "terms"     "xlevels"   "na.action"
> L$means
  AGE050200D BZA115203D CLF040200D EDU635200D EDU685200D HSG045200D HSG495200D
1   37.27970  2.4563910   6.190977   81.26241   20.69248  16.362406  152539.85
2   36.79319  2.1946237   4.670609   82.71935   20.14050  19.684588  106231.90
....
9                  68.83642                9.874268         41.91874
  LIFESTOCKvaluePerFARM P.CHANGEpverty95to00
1              50.48722            -3.836304
2              65.86691            -1.174104
3              74.07023           -13.616674
4              26.20308            -6.648226
5              29.61547            -9.345561
6              36.08322           -10.412122
7              60.42837           -12.549059
8              29.61485           -12.168219
9              48.31252            -9.428939
> L$counts
  1   2   3   4   5   6   7   8   9 
133 279 608 437  67 148 470 362 551 
> L$scaling
                                               LD1           LD2           LD3
AGE050200D                           -1.850965e-01  6.327903e-02 -1.865893e-02
BZA115203D                            1.561909e-03  5.040118e-04 -1.502661e-03
CLF040200D                           -2.216119e-01  1.937772e-01 -1.191955e-01
.... 
> L$N
[1] 3055
> L$lev
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
> L$prior
         1          2          3          4          5          6          7 
0.04353519 0.09132570 0.19901800 0.14304419 0.02193126 0.04844517 0.15384615 
         8          9 
0.11849427 0.18036007
> L$svd
[1] 35.981057 29.047318 23.855715 16.392248 14.756034 13.592597 10.619947
[8]  7.094417 
> plot(L)
> 
</code>


 \\ \\
[[ru:7iss#labs|Back to 7ISS Labs]]