====== Pajek ======
===== Clustering =====
Read in Pajek the project file ''USC3110.paj'' and file with 75 standardized variables ''USC3110.vec'' .
To determine the clustering with relational constraint we have first to compute the dissimilarity between linked vertices based on selected variables. The vectors/variables to be used in the dissimilarity computation are determined by the current Pajek's cluster. In our case this cluster should contain indices from 4 to 78. We produce it as follows
Cluster/Create complete cluster [78]
Then click on the ''Edit Cluster Button''. In the displayed list double click on vertex 1 and confirm its deletion. Repeat this also for vertices 2 and 3. We have the required cluster.
Now we can add to the links of US counties network the dissimilarities as weights
Operations/Dissimilarity*/Vector Based/Euclidean [Yes]
and run clustering with relational constraint afterward
Net/Hierarchical Decomposition/Clustering with Relational Constraint/Options [Average,Tolerant]
Net/Hierarchical Decomposition/Clustering with Relational Constraint/Run
We get a vector of cluster heights, a vector of cluster sizes, a permutation of vertices with respect to the clustering, and a "double partition". The "double partition" contains the description of the clustering tree in the form p[i] = j , meaning: the father of vertex i in the clustering tree is the vertex j . The leaves of the tree are the vertices of our network; the internal vertices of the tree are numbered as they are produced with indices from NumVertices+1 on.
Inspecting the distribution of heights values we selected the threshold 11. Now we can produce the corresponding partition:
[select the heights vector]
Net/Hierarchical Decomposition/Clustering with Relational Constraint/Make Partition/Using Threshold [11]
We obtain the following partition {{notes:clu:pdf:avgtol11.pdf}} that consists of a very large red cluster, several small clusters and a "cluster" of unclassified cyan vertices.
This partition is of little use for our analysis. See a [[book:temp:private:data:cluster|discussion]] about what can be wrong and how we can try to bypass the problems.
===== Improving the results =====
We first try to extract from the dendrogram the clusters with number of units in some selected interval - for example [5,400]
> RC <- Pajek2R("./cluster/AvgTolNew.clu")
> rCount <- varCutree(RC,rep(1,3110),5,400)
> t <- table(rCount$part)
> t
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
745 18 398 5 5 12 8 25 9 6 5 13 22 6 19
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
44 10 9 5 6 29 6 10 347 79 5 151 7 8 7
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
130 5 9 97 17 7 13 9 5 7 6 5 10 337 40
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
7 5 20 12 6 5 25 19 11 8 5 32 116 42 5
60 61 62 63 64 65 999999
5 5 10 18 16 14 8
> out <- file("CRAveTot3.clu","w"); cat("*vertices 3110",rCount$part,sep="\n",file=out); close(out)
We read the "double partition" in R and using R procedures determine the partition, that is finally exported as Pajek's clustering to be displayed using Pajek.
The obtained clustering has 745 unclassified units, but we can see some structure
{{notes:clu:pdf:avgtol-3.pdf}}.
We can also determine rank-height of the clustering tree and cut it at selected level.
levCutree <- function(R,hT){
mark <- function(t,c){
if(t>0) vis[t] <<- TRUE
if(t<0) part[-t] <<- c else {mark(R$merge[t,1],c); mark(R$merge[t,2],c)}
}
n <- nrow(R$merge)+1; part <- rep(999999,n)
vis <- rep(FALSE,n-1); nclust <- 0
for(i in (n-1):1) if (!vis[i]) {
vis[i] <- TRUE; hi <- R$height[i]
l <- R$merge[i,1]; if (l == 0) next
r <- R$merge[i,2]; if (r == 0) next
if (hi <= hT) {
nclust <- nclust+1; mark(l,nclust); mark(r,nclust)
} else { if (l < 0) mark(l,0); if (r < 0) mark(r,0) }
}
return(part)
}
Procedure
> RC$height <- rep(1,3109)
> rRank <- rankTree(RC)
> RC$height <- rRank
> rankPart <- levCutree(RC,70)
> out <- file("AveTotRank70.clu","w"); cat("*vertices 3110",rankPart,sep="\n",file=out); close(out)
This gives another clustering {{notes:clu:pdf:avgtotrank70.pdf}}.
Using in Pajek the option ''Options/Mark Vertices Using/Partition Clusters'' we identify the cluster numbers of interesting clusters and look in R to their characteristics
> dat <- read.csv(file="varsN.csv",stringsAsFactors=FALSE)
> dim(dat)
> da <- dat[-2916,]
> dim(da)
[1] 3110 92
> south <- rankPart==87
> east <- rankPart==115
> west <- rankPart==27
> north <- rankPart==170
> da[south,"Areaname"]
> summary(da[south,])
> da[north,"Areaname"]
> summary(da[north,])
> da[east,"Areaname"]
> summary(da[east,])
> da[west,"Areaname"]
> summary(da[west,])
{{notes:clu:zip:rankres.zip}} (''rankPart.clu'', ''south.txt'', ''north.txt'', ''east.txt'', ''west.txt'', ''rankDat.RData'' (rankPart, da, s))
> M <- cbind(apply(da[south,-c(1,2)],2,mean),apply(da[north,-c(1,2)],2,mean),
apply(da[east,-c(1,2)],2,mean),apply(da[west,-c(1,2)],2,mean))
> M
[,1] [,2] [,3] [,4]
AGE050200D 3.705131e+01 3.811625e+01 3.777956e+01 3.718879e+01
BZA115203D 2.238562e-01 -4.332500e+00 -2.267291e+00 1.110345e+00
CLF040200D 4.632680e+00 3.657500e+00 4.416184e+00 6.244828e+00
EDU635200D 7.361373e+01 8.225594e+01 7.482572e+01 8.114224e+01
EDU685200D 1.457680e+01 1.431656e+01 1.542521e+01 1.936810e+01
HSG045200D 8.044444e+00 6.816562e+00 1.575571e+01 1.633362e+01
HSG495200D 5.846732e+04 7.602125e+04 8.654140e+04 1.307491e+05
INC610199D 3.725614e+04 4.543871e+04 4.188305e+04 4.402873e+04
INC910199D 1.574440e+04 1.819005e+04 1.762827e+04 1.819972e+04
IPE010200D 3.188551e+04 3.947356e+04 3.588853e+04 3.821586e+04
IPE120200D 1.699837e+01 9.140313e+00 1.329523e+01 1.319052e+01
IPE220200D 2.394739e+01 1.266219e+01 1.872641e+01 1.882414e+01
LFE305200D 2.401340e+01 2.149219e+01 2.579881e+01 2.211379e+01
PIN020200D 2.094700e+04 2.417972e+04 2.321806e+04 2.403807e+04
POP050200D 8.283987e+00 4.062500e+00 1.024055e+01 1.616207e+01
POP060200D 6.590621e+01 9.249344e+01 1.828421e+02 1.647716e+02
POP165200D 5.081536e+01 5.074594e+01 5.101295e+01 4.982759e+01
POP225200D 7.758333e+01 9.593219e+01 8.562027e+01 8.376983e+01
POP255200D 1.336667e+01 1.692813e+00 1.145741e+01 1.559483e+00
POP285200D 2.348366e+00 2.543750e-01 3.321976e-01 1.993966e+00
POP325200D 4.490196e-01 4.378125e-01 6.032368e-01 2.432759e+00
POP405200D 9.479739e+00 2.018750e+00 2.408007e+00 1.334052e+01
POP645200D 3.341176e+00 1.557187e+00 2.626405e+00 8.694828e+00
VST020200D 1.353856e+01 1.227937e+01 1.239046e+01 1.262672e+01
VST220200D 1.151863e+01 1.079156e+01 1.044395e+01 9.142241e+00
VST420200D 7.472222e+00 6.154687e+00 7.596252e+00 6.198276e+00
WAT130200D 1.611786e+03 2.109659e+03 1.112694e+03 6.616555e+03
P.pop.under5 6.580324e+00 6.232957e+00 6.118504e+00 6.348849e+00
P.pop.under18 2.635719e+01 2.549562e+01 2.434618e+01 2.607041e+01
P.pop.over85 2.103039e+00 2.360762e+00 1.679386e+00 1.672105e+00
P.ind.fam.farms 9.076461e+01 8.849999e+01 9.043491e+01 8.426045e+01
P.land.farms 6.290727e+01 7.931597e+01 3.275108e+01 3.458045e+01
P.irrigatedLand Inf 3.381284e+00 Inf 2.006879e+01
P.violent.crime 3.207766e-01 8.674755e-02 2.946077e-01 3.348652e-01
P.murders 4.446644e-03 9.991130e-04 3.495106e-03 2.931687e-03
P.rapes 2.683564e-02 1.064653e-02 1.955211e-02 3.005992e-02
P.16to19.notHighSc 1.056589e+01 7.645125e+00 1.062883e+01 9.078797e+00
P.Hisp.Latin 9.407646e+00 1.978180e+00 2.341986e+00 1.327710e+01
R.Hisp.Lat.maleFemale 1.229879e+00 Inf NA 1.171553e+00
P.emply.ind.AGR.FOR.FISH.HUNT.MINING 8.323722e+00 5.990339e+00 3.431304e+00 7.956955e+00
P.emply.ind.CONSTRUCTION 7.450183e+00 6.311400e+00 7.353999e+00 6.644653e+00
P.emply.ind.MANUFACTORING 1.275459e+01 2.152131e+01 1.834340e+01 9.776601e+00
P.emply.ind.WHOLESALEtrade 2.766558e+00 3.053673e+00 2.866999e+00 2.833837e+00
P.emply.ind.RETAILtrade 1.084854e+01 1.095023e+01 1.131198e+01 1.064729e+01
P.emply.ind.TRANSPORT.WAREHOUSING 5.410541e+00 5.163809e+00 4.880563e+00 4.484460e+00
P.emply.ind.INFORMATION 1.441289e+00 1.651531e+00 1.763538e+00 1.860243e+00
P.emply.ind.FINANC.INSUR 4.144753e+00 4.189244e+00 4.194833e+00 4.113244e+00
P.emply.ind.PROFscientTECH 4.304369e+00 4.257937e+00 5.330816e+00 6.102132e+00
P.emply.ind.EDUC.HEALTH 1.988487e+01 1.870545e+01 1.899052e+01 1.862793e+01
P.emply.ind.ARTSaccomFOOD 5.721664e+00 5.774690e+00 6.271828e+00 8.140394e+00
P.emply.ind.PUBLICaddmin 5.223214e+00 3.521698e+00 4.666341e+00 5.762203e+00
P.25overLESS9thGRADE 1.031324e+01 6.652107e+00 9.706479e+00 7.133818e+00
P.employ.FARMING 1.298267e+01 9.958493e+00 4.789943e+00 8.400648e+00
P.employ.AGRIC.FOREST.FISH.HUNT 1.110473e+00 1.823078e-01 7.170553e-01 2.357110e+00
F.GOV.EXP.perCapita 5.468467e+02 4.645189e+02 5.004825e+02 5.314188e+02
P.employ.GOV 1.879330e+01 1.389505e+01 1.545659e+01 2.064506e+01
P.employ.GOV.stateLoc 1.625734e+01 1.222596e+01 1.334276e+01 1.687476e+01
P.vacantHousingUnits 1.482792e+01 8.548903e+00 1.285020e+01 1.328897e+01
P.occupiedHousingUnits 8.517208e+01 9.145110e+01 8.714980e+01 8.671103e+01
P.occupiedHousingUnitsBLACK 1.043022e+01 1.119232e+00 8.972250e+00 1.249519e+00
P.occupiedHousingUnitsHisLat 5.798692e+00 1.182373e+00 1.376569e+00 8.194458e+00
P.OWNERoccupiedHousingUnits 6.313439e+01 6.966030e+01 6.510236e+01 5.789313e+01
P.RENTERoccupiedHousingUnits 2.203769e+01 2.179080e+01 2.204743e+01 2.881790e+01
P.occupiedHousingUnitsLackingPlumb 7.908735e-01 4.948386e-01 9.446285e-01 7.176409e-01
P.URBANpopul 4.077331e+01 4.130698e+01 4.030167e+01 6.026730e+01
P.RURALpopul 5.922669e+01 5.869302e+01 5.969833e+01 3.973270e+01
P.CHANGErural90to00 1.010434e+01 -3.028793e+00 -1.562482e+00 -3.422337e+00
P.CHANGEurban90to00 8.787050e+00 1.106016e+01 2.283582e+01 3.857946e+01
P.LAND 9.727151e+01 9.870546e+01 9.587915e+01 9.542434e+01
P.WATER 2.728487e+00 1.294513e+00 4.120870e+00 4.575628e+00
P.BELOWpovertyLevel 1.764538e+01 9.208135e+00 1.432859e+01 1.419735e+01
P.ABOVEpovertyLevel 8.235462e+01 9.079186e+01 8.567141e+01 8.580265e+01
P.WHITE.BELOWpovertyLevel 1.314581e+01 8.742184e+00 1.188328e+01 1.230589e+01
P.aINDIAN.BELOWpovertyLevel NA NA NA 2.388984e+01
P.BLACK.BELOWpovertyLevel NA NA NA NA
P.ASIAN.BELOWpovertyLevel NA NA NA NA
P.HisLat.BELOWpovertyLevel 2.806823e+01 1.892611e+01 NA 2.553416e+01
CHANGEperCapitaIncome89to99 5.676222e+03 6.653450e+03 6.233325e+03 5.874371e+03
P.CHANGEemployIndustry90to00 1.878123e+01 1.559929e+01 1.633401e+01 2.284493e+01
P.IrrigationGROUNDwaterUse NA NA NA 2.796188e+01
GroundWaterUsePerCapita 1.018104e-03 1.067929e-03 1.556404e-04 1.129905e-03
P.NET.DOMESTIC.MIGRATIONS -5.273263e-01 -4.457541e-01 1.212542e-01 -1.495736e-01
P.NativePopulationBornInStateOfRes 7.388104e+01 7.706141e+01 7.259531e+01 5.158747e+01
R.LABOR.FORCEmaleFemale 1.205602e+00 1.164036e+00 1.160966e+00 1.211234e+00
R.VOTING.DEMOCRATESoverREPUBLICANS 6.720723e+01 7.530776e+01 8.824383e+01 7.642053e+01
P.PUBLIC.SCHOOL.ENROLNEMT 7.128929e+01 6.774728e+01 6.893168e+01 6.875286e+01
TOTAL.DEPOSITSperCapita 1.117644e+01 1.447363e+01 1.121010e+01 8.799865e+00
CROPvaluePerFARM 3.048392e+01 6.127429e+01 2.972905e+01 1.145691e+02
LIFESTOCKvaluePerFARM 5.498766e+01 4.841435e+01 4.143259e+01 4.736536e+01
P.CHANGEpverty95to00 -1.399711e+01 -8.650349e+00 -1.150927e+01 -7.111793e+00
>
===== Partition by state =====
The first two characters in the county id identify the state to which the county belongs.
> d <- readLines("usc3110.paj")[14:3123]
> n <- 3110; s <- character(n)
> for(i in 1:n) s[i] <- as.integer(substr(strsplit(dd[i],'\"')[[1]][2],1,2))
> out <- file("states3110.clu","w"); cat("*vertices 3110",s,sep="\n",file=out); close(out)
{{notes:clu:zip:states.zip}}
This partition is already included in the ''USC3110.paj'' file:
------------------------------------------------------------------------------
Reading Partition --- "US states"
------------------------------------------------------------------------------
To which state a county belongs?
01 - Alabama 21 - Kentucky 38 - North Dakota
02 - Alaska 22 - Louisiana 39 - Ohio
04 - Arizona 23 - Maine 40 - Oklahoma
05 - Arkansas 24 - Maryland 41 - Oregon
06 - California 25 - Massachusetts 42 - Pennsylvania
08 - Colorado 26 - Michigan 44 - Rhode Island
09 - Connecticut 27 - Minnesota 45 - South Carolina
10 - Delaware 28 - Mississippi 46 - South Dakota
11 - Dist. of Columbia 29 - Missouri 47 - Tennessee
12 - Florida 30 - Montana 48 - Texas
13 - Georgia 31 - Nebraska 49 - Utah
15 - Hawaii 32 - Nevada 50 - Vermont
16 - Idaho 33 - New Hampshire 51 - Virginia
17 - Illinois 34 - New Jersey 53 - Washington
18 - Indiana 35 - New Mexico 54 - West Virginia
19 - Iowa 36 - New York 55 - Wisconsin
20 - Kansas 37 - North Carolina 56 - Wyoming
===== Partitions in 4 and 9 regions =====
File/Partition/Read [states3110.clu]
File/Partition/Read [regions4.clu]
File/Partition/Read [regions9.clu]
[select states partition as partition 1]
[select regions9 partition as partition 2]
Operations/Functional Composition/Partition*Partition
File/Partition/Save [reg3310-9.clu]
[select states partition as partition 1]
[select regions4 partition as partition 2]
Operations/Functional Composition/Partition*Partition
File/Partition/Save [reg3310-4.clu]
===== Discriminant analysis =====
> getwd()
[1] "D:/Data/counties"
> load("rankDat.RData")
> C <- read.csv("./reg3110-9.clu",header=FALSE,skip=1)$V1
> table(C)
C
1 2 3 4 5 6 7 8 9
133 281 618 437 67 150 470 364 590
> library(MASS)
> V <- da[,-c(1,2)]
> names(V)
[33] "P.irrigatedLand"
[39] "R.Hisp.Lat.maleFemale"
[74] "P.aINDIAN.BELOWpovertyLevel"
[75] "P.BLACK.BELOWpovertyLevel"
[76] "P.ASIAN.BELOWpovertyLevel"
[80] "P.IrrigationGROUNDwaterUse"
> U <- V[,-c(33,39,74,75,76,80)]
> L <- lda(C~.,U)
Warning message:
In lda.default(x, grouping, ...) : variables are collinear
> names(L)
[1] "prior" "counts" "means" "scaling" "lev" "svd"
[7] "N" "call" "terms" "xlevels" "na.action"
> L$means
AGE050200D BZA115203D CLF040200D EDU635200D EDU685200D HSG045200D HSG495200D
1 37.27970 2.4563910 6.190977 81.26241 20.69248 16.362406 152539.85
2 36.79319 2.1946237 4.670609 82.71935 20.14050 19.684588 106231.90
....
9 68.83642 9.874268 41.91874
LIFESTOCKvaluePerFARM P.CHANGEpverty95to00
1 50.48722 -3.836304
2 65.86691 -1.174104
3 74.07023 -13.616674
4 26.20308 -6.648226
5 29.61547 -9.345561
6 36.08322 -10.412122
7 60.42837 -12.549059
8 29.61485 -12.168219
9 48.31252 -9.428939
> L$counts
1 2 3 4 5 6 7 8 9
133 279 608 437 67 148 470 362 551
> L$scaling
LD1 LD2 LD3
AGE050200D -1.850965e-01 6.327903e-02 -1.865893e-02
BZA115203D 1.561909e-03 5.040118e-04 -1.502661e-03
CLF040200D -2.216119e-01 1.937772e-01 -1.191955e-01
....
> L$N
[1] 3055
> L$lev
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
> L$prior
1 2 3 4 5 6 7
0.04353519 0.09132570 0.19901800 0.14304419 0.02193126 0.04844517 0.15384615
8 9
0.11849427 0.18036007
> L$svd
[1] 35.981057 29.047318 23.855715 16.392248 14.756034 13.592597 10.619947
[8] 7.094417
> plot(L)
>
\\ \\
[[ru:7iss#labs|Back to 7ISS Labs]]