====== Pajek ====== ===== Clustering ===== Read in Pajek the project file ''USC3110.paj'' and file with 75 standardized variables ''USC3110.vec'' . To determine the clustering with relational constraint we have first to compute the dissimilarity between linked vertices based on selected variables. The vectors/variables to be used in the dissimilarity computation are determined by the current Pajek's cluster. In our case this cluster should contain indices from 4 to 78. We produce it as follows Cluster/Create complete cluster [78] Then click on the ''Edit Cluster Button''. In the displayed list double click on vertex 1 and confirm its deletion. Repeat this also for vertices 2 and 3. We have the required cluster. Now we can add to the links of US counties network the dissimilarities as weights Operations/Dissimilarity*/Vector Based/Euclidean [Yes] and run clustering with relational constraint afterward Net/Hierarchical Decomposition/Clustering with Relational Constraint/Options [Average,Tolerant] Net/Hierarchical Decomposition/Clustering with Relational Constraint/Run We get a vector of cluster heights, a vector of cluster sizes, a permutation of vertices with respect to the clustering, and a "double partition". The "double partition" contains the description of the clustering tree in the form p[i] = j , meaning: the father of vertex i in the clustering tree is the vertex j . The leaves of the tree are the vertices of our network; the internal vertices of the tree are numbered as they are produced with indices from NumVertices+1 on. Inspecting the distribution of heights values we selected the threshold 11. Now we can produce the corresponding partition: [select the heights vector] Net/Hierarchical Decomposition/Clustering with Relational Constraint/Make Partition/Using Threshold [11] We obtain the following partition {{notes:clu:pdf:avgtol11.pdf}} that consists of a very large red cluster, several small clusters and a "cluster" of unclassified cyan vertices. This partition is of little use for our analysis. See a [[book:temp:private:data:cluster|discussion]] about what can be wrong and how we can try to bypass the problems. ===== Improving the results ===== We first try to extract from the dendrogram the clusters with number of units in some selected interval - for example [5,400] > RC <- Pajek2R("./cluster/AvgTolNew.clu") > rCount <- varCutree(RC,rep(1,3110),5,400) > t <- table(rCount$part) > t 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 745 18 398 5 5 12 8 25 9 6 5 13 22 6 19 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 44 10 9 5 6 29 6 10 347 79 5 151 7 8 7 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 130 5 9 97 17 7 13 9 5 7 6 5 10 337 40 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 7 5 20 12 6 5 25 19 11 8 5 32 116 42 5 60 61 62 63 64 65 999999 5 5 10 18 16 14 8 > out <- file("CRAveTot3.clu","w"); cat("*vertices 3110",rCount$part,sep="\n",file=out); close(out) We read the "double partition" in R and using R procedures determine the partition, that is finally exported as Pajek's clustering to be displayed using Pajek. The obtained clustering has 745 unclassified units, but we can see some structure {{notes:clu:pdf:avgtol-3.pdf}}. We can also determine rank-height of the clustering tree and cut it at selected level. levCutree <- function(R,hT){ mark <- function(t,c){ if(t>0) vis[t] <<- TRUE if(t<0) part[-t] <<- c else {mark(R$merge[t,1],c); mark(R$merge[t,2],c)} } n <- nrow(R$merge)+1; part <- rep(999999,n) vis <- rep(FALSE,n-1); nclust <- 0 for(i in (n-1):1) if (!vis[i]) { vis[i] <- TRUE; hi <- R$height[i] l <- R$merge[i,1]; if (l == 0) next r <- R$merge[i,2]; if (r == 0) next if (hi <= hT) { nclust <- nclust+1; mark(l,nclust); mark(r,nclust) } else { if (l < 0) mark(l,0); if (r < 0) mark(r,0) } } return(part) } Procedure > RC$height <- rep(1,3109) > rRank <- rankTree(RC) > RC$height <- rRank > rankPart <- levCutree(RC,70) > out <- file("AveTotRank70.clu","w"); cat("*vertices 3110",rankPart,sep="\n",file=out); close(out) This gives another clustering {{notes:clu:pdf:avgtotrank70.pdf}}. Using in Pajek the option ''Options/Mark Vertices Using/Partition Clusters'' we identify the cluster numbers of interesting clusters and look in R to their characteristics > dat <- read.csv(file="varsN.csv",stringsAsFactors=FALSE) > dim(dat) > da <- dat[-2916,] > dim(da) [1] 3110 92 > south <- rankPart==87 > east <- rankPart==115 > west <- rankPart==27 > north <- rankPart==170 > da[south,"Areaname"] > summary(da[south,]) > da[north,"Areaname"] > summary(da[north,]) > da[east,"Areaname"] > summary(da[east,]) > da[west,"Areaname"] > summary(da[west,]) {{notes:clu:zip:rankres.zip}} (''rankPart.clu'', ''south.txt'', ''north.txt'', ''east.txt'', ''west.txt'', ''rankDat.RData'' (rankPart, da, s)) > M <- cbind(apply(da[south,-c(1,2)],2,mean),apply(da[north,-c(1,2)],2,mean), apply(da[east,-c(1,2)],2,mean),apply(da[west,-c(1,2)],2,mean)) > M [,1] [,2] [,3] [,4] AGE050200D 3.705131e+01 3.811625e+01 3.777956e+01 3.718879e+01 BZA115203D 2.238562e-01 -4.332500e+00 -2.267291e+00 1.110345e+00 CLF040200D 4.632680e+00 3.657500e+00 4.416184e+00 6.244828e+00 EDU635200D 7.361373e+01 8.225594e+01 7.482572e+01 8.114224e+01 EDU685200D 1.457680e+01 1.431656e+01 1.542521e+01 1.936810e+01 HSG045200D 8.044444e+00 6.816562e+00 1.575571e+01 1.633362e+01 HSG495200D 5.846732e+04 7.602125e+04 8.654140e+04 1.307491e+05 INC610199D 3.725614e+04 4.543871e+04 4.188305e+04 4.402873e+04 INC910199D 1.574440e+04 1.819005e+04 1.762827e+04 1.819972e+04 IPE010200D 3.188551e+04 3.947356e+04 3.588853e+04 3.821586e+04 IPE120200D 1.699837e+01 9.140313e+00 1.329523e+01 1.319052e+01 IPE220200D 2.394739e+01 1.266219e+01 1.872641e+01 1.882414e+01 LFE305200D 2.401340e+01 2.149219e+01 2.579881e+01 2.211379e+01 PIN020200D 2.094700e+04 2.417972e+04 2.321806e+04 2.403807e+04 POP050200D 8.283987e+00 4.062500e+00 1.024055e+01 1.616207e+01 POP060200D 6.590621e+01 9.249344e+01 1.828421e+02 1.647716e+02 POP165200D 5.081536e+01 5.074594e+01 5.101295e+01 4.982759e+01 POP225200D 7.758333e+01 9.593219e+01 8.562027e+01 8.376983e+01 POP255200D 1.336667e+01 1.692813e+00 1.145741e+01 1.559483e+00 POP285200D 2.348366e+00 2.543750e-01 3.321976e-01 1.993966e+00 POP325200D 4.490196e-01 4.378125e-01 6.032368e-01 2.432759e+00 POP405200D 9.479739e+00 2.018750e+00 2.408007e+00 1.334052e+01 POP645200D 3.341176e+00 1.557187e+00 2.626405e+00 8.694828e+00 VST020200D 1.353856e+01 1.227937e+01 1.239046e+01 1.262672e+01 VST220200D 1.151863e+01 1.079156e+01 1.044395e+01 9.142241e+00 VST420200D 7.472222e+00 6.154687e+00 7.596252e+00 6.198276e+00 WAT130200D 1.611786e+03 2.109659e+03 1.112694e+03 6.616555e+03 P.pop.under5 6.580324e+00 6.232957e+00 6.118504e+00 6.348849e+00 P.pop.under18 2.635719e+01 2.549562e+01 2.434618e+01 2.607041e+01 P.pop.over85 2.103039e+00 2.360762e+00 1.679386e+00 1.672105e+00 P.ind.fam.farms 9.076461e+01 8.849999e+01 9.043491e+01 8.426045e+01 P.land.farms 6.290727e+01 7.931597e+01 3.275108e+01 3.458045e+01 P.irrigatedLand Inf 3.381284e+00 Inf 2.006879e+01 P.violent.crime 3.207766e-01 8.674755e-02 2.946077e-01 3.348652e-01 P.murders 4.446644e-03 9.991130e-04 3.495106e-03 2.931687e-03 P.rapes 2.683564e-02 1.064653e-02 1.955211e-02 3.005992e-02 P.16to19.notHighSc 1.056589e+01 7.645125e+00 1.062883e+01 9.078797e+00 P.Hisp.Latin 9.407646e+00 1.978180e+00 2.341986e+00 1.327710e+01 R.Hisp.Lat.maleFemale 1.229879e+00 Inf NA 1.171553e+00 P.emply.ind.AGR.FOR.FISH.HUNT.MINING 8.323722e+00 5.990339e+00 3.431304e+00 7.956955e+00 P.emply.ind.CONSTRUCTION 7.450183e+00 6.311400e+00 7.353999e+00 6.644653e+00 P.emply.ind.MANUFACTORING 1.275459e+01 2.152131e+01 1.834340e+01 9.776601e+00 P.emply.ind.WHOLESALEtrade 2.766558e+00 3.053673e+00 2.866999e+00 2.833837e+00 P.emply.ind.RETAILtrade 1.084854e+01 1.095023e+01 1.131198e+01 1.064729e+01 P.emply.ind.TRANSPORT.WAREHOUSING 5.410541e+00 5.163809e+00 4.880563e+00 4.484460e+00 P.emply.ind.INFORMATION 1.441289e+00 1.651531e+00 1.763538e+00 1.860243e+00 P.emply.ind.FINANC.INSUR 4.144753e+00 4.189244e+00 4.194833e+00 4.113244e+00 P.emply.ind.PROFscientTECH 4.304369e+00 4.257937e+00 5.330816e+00 6.102132e+00 P.emply.ind.EDUC.HEALTH 1.988487e+01 1.870545e+01 1.899052e+01 1.862793e+01 P.emply.ind.ARTSaccomFOOD 5.721664e+00 5.774690e+00 6.271828e+00 8.140394e+00 P.emply.ind.PUBLICaddmin 5.223214e+00 3.521698e+00 4.666341e+00 5.762203e+00 P.25overLESS9thGRADE 1.031324e+01 6.652107e+00 9.706479e+00 7.133818e+00 P.employ.FARMING 1.298267e+01 9.958493e+00 4.789943e+00 8.400648e+00 P.employ.AGRIC.FOREST.FISH.HUNT 1.110473e+00 1.823078e-01 7.170553e-01 2.357110e+00 F.GOV.EXP.perCapita 5.468467e+02 4.645189e+02 5.004825e+02 5.314188e+02 P.employ.GOV 1.879330e+01 1.389505e+01 1.545659e+01 2.064506e+01 P.employ.GOV.stateLoc 1.625734e+01 1.222596e+01 1.334276e+01 1.687476e+01 P.vacantHousingUnits 1.482792e+01 8.548903e+00 1.285020e+01 1.328897e+01 P.occupiedHousingUnits 8.517208e+01 9.145110e+01 8.714980e+01 8.671103e+01 P.occupiedHousingUnitsBLACK 1.043022e+01 1.119232e+00 8.972250e+00 1.249519e+00 P.occupiedHousingUnitsHisLat 5.798692e+00 1.182373e+00 1.376569e+00 8.194458e+00 P.OWNERoccupiedHousingUnits 6.313439e+01 6.966030e+01 6.510236e+01 5.789313e+01 P.RENTERoccupiedHousingUnits 2.203769e+01 2.179080e+01 2.204743e+01 2.881790e+01 P.occupiedHousingUnitsLackingPlumb 7.908735e-01 4.948386e-01 9.446285e-01 7.176409e-01 P.URBANpopul 4.077331e+01 4.130698e+01 4.030167e+01 6.026730e+01 P.RURALpopul 5.922669e+01 5.869302e+01 5.969833e+01 3.973270e+01 P.CHANGErural90to00 1.010434e+01 -3.028793e+00 -1.562482e+00 -3.422337e+00 P.CHANGEurban90to00 8.787050e+00 1.106016e+01 2.283582e+01 3.857946e+01 P.LAND 9.727151e+01 9.870546e+01 9.587915e+01 9.542434e+01 P.WATER 2.728487e+00 1.294513e+00 4.120870e+00 4.575628e+00 P.BELOWpovertyLevel 1.764538e+01 9.208135e+00 1.432859e+01 1.419735e+01 P.ABOVEpovertyLevel 8.235462e+01 9.079186e+01 8.567141e+01 8.580265e+01 P.WHITE.BELOWpovertyLevel 1.314581e+01 8.742184e+00 1.188328e+01 1.230589e+01 P.aINDIAN.BELOWpovertyLevel NA NA NA 2.388984e+01 P.BLACK.BELOWpovertyLevel NA NA NA NA P.ASIAN.BELOWpovertyLevel NA NA NA NA P.HisLat.BELOWpovertyLevel 2.806823e+01 1.892611e+01 NA 2.553416e+01 CHANGEperCapitaIncome89to99 5.676222e+03 6.653450e+03 6.233325e+03 5.874371e+03 P.CHANGEemployIndustry90to00 1.878123e+01 1.559929e+01 1.633401e+01 2.284493e+01 P.IrrigationGROUNDwaterUse NA NA NA 2.796188e+01 GroundWaterUsePerCapita 1.018104e-03 1.067929e-03 1.556404e-04 1.129905e-03 P.NET.DOMESTIC.MIGRATIONS -5.273263e-01 -4.457541e-01 1.212542e-01 -1.495736e-01 P.NativePopulationBornInStateOfRes 7.388104e+01 7.706141e+01 7.259531e+01 5.158747e+01 R.LABOR.FORCEmaleFemale 1.205602e+00 1.164036e+00 1.160966e+00 1.211234e+00 R.VOTING.DEMOCRATESoverREPUBLICANS 6.720723e+01 7.530776e+01 8.824383e+01 7.642053e+01 P.PUBLIC.SCHOOL.ENROLNEMT 7.128929e+01 6.774728e+01 6.893168e+01 6.875286e+01 TOTAL.DEPOSITSperCapita 1.117644e+01 1.447363e+01 1.121010e+01 8.799865e+00 CROPvaluePerFARM 3.048392e+01 6.127429e+01 2.972905e+01 1.145691e+02 LIFESTOCKvaluePerFARM 5.498766e+01 4.841435e+01 4.143259e+01 4.736536e+01 P.CHANGEpverty95to00 -1.399711e+01 -8.650349e+00 -1.150927e+01 -7.111793e+00 > ===== Partition by state ===== The first two characters in the county id identify the state to which the county belongs. > d <- readLines("usc3110.paj")[14:3123] > n <- 3110; s <- character(n) > for(i in 1:n) s[i] <- as.integer(substr(strsplit(dd[i],'\"')[[1]][2],1,2)) > out <- file("states3110.clu","w"); cat("*vertices 3110",s,sep="\n",file=out); close(out) {{notes:clu:zip:states.zip}} This partition is already included in the ''USC3110.paj'' file: ------------------------------------------------------------------------------ Reading Partition --- "US states" ------------------------------------------------------------------------------ To which state a county belongs? 01 - Alabama 21 - Kentucky 38 - North Dakota 02 - Alaska 22 - Louisiana 39 - Ohio 04 - Arizona 23 - Maine 40 - Oklahoma 05 - Arkansas 24 - Maryland 41 - Oregon 06 - California 25 - Massachusetts 42 - Pennsylvania 08 - Colorado 26 - Michigan 44 - Rhode Island 09 - Connecticut 27 - Minnesota 45 - South Carolina 10 - Delaware 28 - Mississippi 46 - South Dakota 11 - Dist. of Columbia 29 - Missouri 47 - Tennessee 12 - Florida 30 - Montana 48 - Texas 13 - Georgia 31 - Nebraska 49 - Utah 15 - Hawaii 32 - Nevada 50 - Vermont 16 - Idaho 33 - New Hampshire 51 - Virginia 17 - Illinois 34 - New Jersey 53 - Washington 18 - Indiana 35 - New Mexico 54 - West Virginia 19 - Iowa 36 - New York 55 - Wisconsin 20 - Kansas 37 - North Carolina 56 - Wyoming ===== Partitions in 4 and 9 regions ===== File/Partition/Read [states3110.clu] File/Partition/Read [regions4.clu] File/Partition/Read [regions9.clu] [select states partition as partition 1] [select regions9 partition as partition 2] Operations/Functional Composition/Partition*Partition File/Partition/Save [reg3310-9.clu] [select states partition as partition 1] [select regions4 partition as partition 2] Operations/Functional Composition/Partition*Partition File/Partition/Save [reg3310-4.clu] ===== Discriminant analysis ===== > getwd() [1] "D:/Data/counties" > load("rankDat.RData") > C <- read.csv("./reg3110-9.clu",header=FALSE,skip=1)$V1 > table(C) C 1 2 3 4 5 6 7 8 9 133 281 618 437 67 150 470 364 590 > library(MASS) > V <- da[,-c(1,2)] > names(V) [33] "P.irrigatedLand" [39] "R.Hisp.Lat.maleFemale" [74] "P.aINDIAN.BELOWpovertyLevel" [75] "P.BLACK.BELOWpovertyLevel" [76] "P.ASIAN.BELOWpovertyLevel" [80] "P.IrrigationGROUNDwaterUse" > U <- V[,-c(33,39,74,75,76,80)] > L <- lda(C~.,U) Warning message: In lda.default(x, grouping, ...) : variables are collinear > names(L) [1] "prior" "counts" "means" "scaling" "lev" "svd" [7] "N" "call" "terms" "xlevels" "na.action" > L$means AGE050200D BZA115203D CLF040200D EDU635200D EDU685200D HSG045200D HSG495200D 1 37.27970 2.4563910 6.190977 81.26241 20.69248 16.362406 152539.85 2 36.79319 2.1946237 4.670609 82.71935 20.14050 19.684588 106231.90 .... 9 68.83642 9.874268 41.91874 LIFESTOCKvaluePerFARM P.CHANGEpverty95to00 1 50.48722 -3.836304 2 65.86691 -1.174104 3 74.07023 -13.616674 4 26.20308 -6.648226 5 29.61547 -9.345561 6 36.08322 -10.412122 7 60.42837 -12.549059 8 29.61485 -12.168219 9 48.31252 -9.428939 > L$counts 1 2 3 4 5 6 7 8 9 133 279 608 437 67 148 470 362 551 > L$scaling LD1 LD2 LD3 AGE050200D -1.850965e-01 6.327903e-02 -1.865893e-02 BZA115203D 1.561909e-03 5.040118e-04 -1.502661e-03 CLF040200D -2.216119e-01 1.937772e-01 -1.191955e-01 .... > L$N [1] 3055 > L$lev [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" > L$prior 1 2 3 4 5 6 7 0.04353519 0.09132570 0.19901800 0.14304419 0.02193126 0.04844517 0.15384615 8 9 0.11849427 0.18036007 > L$svd [1] 35.981057 29.047318 23.855715 16.392248 14.756034 13.592597 10.619947 [8] 7.094417 > plot(L) > \\ \\ [[ru:7iss#labs|Back to 7ISS Labs]]