====== Clamix - specificity of variable in cluster ====== [[https://r-forge.r-project.org/scm/viewvc.php/pkg/R/clamix4.R?view=markup&revision=2&root=clamix&pathrev=2|R-Forge / Clamix]] ===== Problem ===== In program Clamix we still don't have a good answer to the question: which variables (and their values) are characteristic (specific) for a given cluster C ? This morning (October 31, 2012) I had the idea to define for a selected variable V its **specificity** s(V,C) for a cluster C as s(V,C) = 1/2 ∫ |pU(t) - pC(t)| dt or in discrete case s(V,C) = 1/2 ∑v ∈ V |pU(v) - pC(v)| Geometrically S(V,C) is the half area of the symmetric difference of the areas bellow the distribution of values of V on set of units U and the distribution of values of V on the cluster C. See Figure 1. The specificity s(V,C) has the following properties: - 0 ≤ s(V,C) ≤ 1 - if pU = pC then s(V,c) = 0 ; values of V are random sample from the values of V on the set of units U. - if pU and pC are disjoint then s(V,c) = 1 Proof of 1.: s(V,C) = 1/2 ∫ |pU(t) - pC(t)| dt ≤ 1/2 ∫ (pU(t) + pC(t)) dt = 1/2 ( ∫ pU(t) dt + ∫ pC(t) dt) = (1+1)/2 = 1 > fU <- c(71,123,83,365,44,62) > fC <- c(2,0,1,4,0,15) > (pU <- fU/sum(fU)) [1] 0.09491979 0.16443850 0.11096257 0.48796791 0.05882353 0.08288770 > (pC <- fC/sum(fC)) [1] 0.09090909 0.00000000 0.04545455 0.18181818 0.00000000 0.68181818 > (r <- sum(abs(pU-pC))/2) [1] 0.5989305 > plot(c(1,7),c(0,max(max(pU),max(pC))),type="n",main="Distributions", + xlab="values",ylab="p") > lines(c(0,pU),type="S",col="blue") > lines(c(7,7),c(pU[6],0),col="blue") > lines((1:7)+0.02,c(0,pC),type="S",col="red") > lines(c(7,7)+0.02,c(pC[6],0),col="red") {{notes:pics:distri.png?500}} For identifying the most characteristic values I would try with the index max(pU(v), pC(v)) / min(pU(v), pC(v)) and select some values with (very) large value of this index. ===== Example / Cars ===== I put the data and the code to {{:notes:zip:specific.zip}} - I hope that there is all that is needed ;-) I included functions ''plotDistri'' and ''specific'' into Clamix2 thus producing Clamix3. plotDistri <- function(pU,pC){ ln <- length(pU) plot(c(1,ln),c(0,max(max(pU),max(pC))),type="n",main="Distributions", xlab="values",ylab="p") lines(c(0,pU[-ln]),type="S",col="blue") lines(c(ln,ln),c(pU[ln-1],0),col="blue") lines((1:ln)+0.02,c(0,pC[-ln]),type="S",col="red") lines(c(ln,ln)+0.02,c(pC[ln-1],0),col="red") } specific <- function(leader,var){ Lq <- L[[leader]] names(Lq) <- names(total) q <- Lq[[var]] names(q) <- names(pU[[var]]) ln <- length(q) q <- q/q[ln] c <- q[-ln] u <- pU[[var]][-ln] print(u) print(c) print(pmax(u,c)/pmin(u,c)) plotDistri(pU[[var]],q) } And here is the code for clustering leaders from ''cars25.rez'' setwd("C:/Users/Batagelj/work/clamix/clamix.R") source("C:\\Users\\Batagelj\\work\\clamix\\clamix.R\\clamix3.R") load("./cars2/cars25.rez") load("./cars2/cars.so") load("./cars2/cars.meta") alpha <-rep(1/nVar,nVar) hc <- hclustSO(rez$leaders) plot(hc,hang=-1) long[rez$clust==9] L <- rez$leaders total <- computeTotal(L) objects() {{notes:pics:leaders.png?500}} and for producing //**specificity**// table S. pU <- total for(j in 1:nVar) pU[[j]] <- pU[[j]]/pU[[j]][[length(pU[[j]])]] S <- matrix(0,nrow=length(L),ncol=length(total), dimnames=list(names(L),names(total))) for(i in 1:length(L)){ pC <- L[[i]] for(j in 1:nVar) { ln <- length(pC[[j]]) pC[[j]] <- pC[[j]]/pC[[j]][[ln]] S[i,j] <- sum(abs(pU[[j]][-ln]-pC[[j]][-ln]))/2 } } for(i in 1:length(L)) { cat(i,names(L)[i],"\n"); print(sort(S[i,],decreasing=TRUE)[1:7]) } Here is the list of the 7 most specific variables for each leader: 1 L1 NumPassen type rpm_maxTor height displace minFuelCon weight 0.9510749 0.8784285 0.8724981 0.8472943 0.8465530 0.8421053 0.8376575 2 L2 type NumPassen height wheelbase weight maxLoad width 0.9329496 0.9225715 0.8276864 0.7862469 0.7004026 0.6820593 0.5931772 3 L3 fuelCapac wheelbase drive width length weight luggage 0.8223112 0.8030377 0.7758308 0.7418734 0.6767976 0.6753150 0.6427614 4 L4 maxTorque maxPowKW maxPowKM displace weight maxSpeed price 0.7388487 0.6975537 0.6939879 0.6008026 0.5518519 0.5518519 0.5136627 5 L5 maxPowKW maxPowKM maxTorque accelTime fuelCapac price maxSpeed 0.7548909 0.7541496 0.6530764 0.6436974 0.6194766 0.5819286 0.5597061 6 L6 rpm_maxTor rpm_maxPow weight displace minFuelCon maxTorque price 0.8302446 0.7863762 0.6962830 0.6839458 0.6641957 0.6538176 0.6515938 7 L7 type maxTorque height displace maxSpeed NumDoors maxPowKW 0.7636739 0.6730912 0.6249364 0.5647466 0.5631186 0.5604151 0.5352113 8 L8 type drive height maxSpeed maxLoad fuelCapac weight 0.9058710 0.8732543 0.8472943 0.8176575 0.7369311 0.6093996 0.6093847 9 L9 displace maxTorque maxPowKM maxPowKW price accelTime minFuelCon 0.6499158 0.6258036 0.6206146 0.6206146 0.5664196 0.4966575 0.4959162 10 L10 maxSpeed maxPowKW maxPowKM enlarLugg type maxTorque rpm_maxTor 0.6473594 0.6369477 0.6109127 0.5828785 0.5797785 0.5769931 0.5430173 11 L11 maxPowKW price displace maxSpeed weight length wheelbase 0.8419041 0.8317272 0.8036959 0.7721593 0.7662290 0.7617812 0.7450704 12 L12 type length fuelCapac drive maxLoad wheelbase luggage 0.8421053 0.8128016 0.7382339 0.7367513 0.6600505 0.6489766 0.5693106 13 L13 maxPowKM maxTorque maxPowKW accelTime wheelbase maxSpeed width 0.6809837 0.6790215 0.6664196 0.6316749 0.6103257 0.6055640 0.5570226 14 L14 NumDoors type maxPowKW maxSpeed height maxPowKM price 0.9111175 0.9014808 0.8561898 0.8435878 0.8354337 0.8157081 0.8073370 15 L15 maxPowKM maxPowKW maxTorque weight price displace maxSpeed 0.8078356 0.7796666 0.6671709 0.5999439 0.5945846 0.5803799 0.5201651 16 L16 maxLoad maxSpeed wheelbase width fuelCapac maxTorque maxPowKW 0.8398814 0.8376575 0.8369162 0.8361749 0.8257969 0.8228317 0.7983692 17 L17 enlarLugg NumDoors type length price displace maxTorque 0.6586360 0.6022027 0.5984962 0.5925765 0.5471460 0.4960606 0.4959229 18 L18 maxPowKW fuelCapac maxPowKM length price width weight 0.7983692 0.7978249 0.7976279 0.7894478 0.7177137 0.6683428 0.6638743 19 L19 length NumDoors type weight enlarLugg wheelbase luggage 0.7607460 0.6879170 0.6842105 0.6767547 0.6586360 0.6114137 0.6011431 20 L20 rpm_maxTor rpm_maxPow fuelCapac maxPowKW maxPowKM minFuelCon weight 0.7847901 0.7766359 0.7529146 0.6956163 0.6901745 0.6822731 0.6553002 21 L21 weight maxSpeed maxPowKW price accelTime maxPowKM length 0.8097283 0.6901408 0.6530764 0.6370078 0.6283522 0.6241660 0.6095365 22 L22 type fuelCapac height weight drive maxTorque minFuelCon 0.9258710 0.8695330 0.8472943 0.8376575 0.8242887 0.7812939 0.7731397 23 L23 fuelCapac length luggage wheelbase maxLoad NumDoors width 0.8065508 0.7991379 0.7546605 0.7502402 0.7206161 0.6879170 0.6864344 24 L24 length wheelbase fuelCapac width type luggage weight 0.8413640 0.7451674 0.7214461 0.7108970 0.6882591 0.6590067 0.6083709 25 L25 type height wheelbase drive NumDoors length luggage 0.8703155 0.8472943 0.8309859 0.7265876 0.7050490 0.6538176 0.6389918 > Now we can select from the //**specificity**// table S interesting variables for selected cluster and using the function ''specific'' try to provide its characteristics. ==== 1 / NumPassen ==== specificity = 0.9510749 > specific(1,'NumPassen') 2 3 4 5 6 7 8 0.018532246 0.000000000 0.059303188 0.864343958 0.001482580 0.048925130 0.007412898 NA 0.000000000 2 3 4 5 6 7 8 NA 0 0 0 0 0 1 0 0 2 3 4 5 6 7 8 NA Inf NaN Inf Inf Inf 20.43939 Inf NaN {{notes:pics:1-NumPassen.png?500}} All cars in cluster 1 have the value NumPassen=7 . ==== 10 / maxSpeed ==== specificity = 0.6473594 > specific(10,'maxSpeed') [130,163] (163,174] (174,187] (187,200] (200,215] (215,400] NA 0.1623425 0.1475167 0.1890289 0.1890289 0.1556709 0.1564122 0.0000000 [130,163] (163,174] (174,187] (187,200] (200,215] (215,400] NA 0.00000000 0.00000000 0.03030303 0.07575758 0.80303030 0.09090909 0.00000000 [130,163] (163,174] (174,187] (187,200] (200,215] (215,400] NA Inf Inf 6.237954 2.495182 5.158514 1.720534 NaN > {{notes:pics:10-maxSpeed.png?500}} Most of the cars in cluster 10 have the maxSpeed in the interval (200,215].\\ No car in this cluster has maxSpeed in the interval [130,174]. ==== 14 / NumDoors ==== specificity = 0.9111175 > specific(14,'NumDoors') 2 3 4 5 NA 0.06449222 0.18383988 0.31208302 0.43958488 0.00000000 2 3 4 5 NA 0.97560976 0.00000000 0.02439024 0.00000000 0.00000000 2 3 4 5 NA 15.12756 Inf 12.79540 Inf NaN {{notes:pics:14-NumDoors.png?500}} Most of the cars in cluster 14 have NumDoors=2.\\ A tiny part of them have also NumDoors=4. ==== 19 / length ==== specificity = 0.7607460 > specific(19,'length') [2600,4010] (4010,4245] (4245,4470] (4470,4555] (4555,4761] (4761,6000] NA 0.1616012 0.1845812 0.1645663 0.1586360 0.1638251 0.1667902 0.0000000 [2600,4010] (4010,4245] (4245,4470] (4470,4555] (4555,4761] (4761,6000] NA 0.00000000 0.00000000 0.00000000 0.00000000 0.07246377 0.92753623 0.00000000 [2600,4010] (4010,4245] (4245,4470] (4470,4555] (4555,4761] (4761,6000] NA Inf Inf Inf Inf 2.260786 5.561095 NaN {{notes:pics:19-length.png?500}} All cars from cluster 19 have length in the interval (4555,6000].\\ Most of them in the interval (4761,6000]. ==== 19 / NumDoors ==== specificity = 0.6879170 > specific(19,'NumDoors') 2 3 4 5 NA 0.06449222 0.18383988 0.31208302 0.43958488 0.00000000 2 3 4 5 NA 0 0 1 0 0 2 3 4 5 NA Inf Inf 3.204276 Inf NaN {{notes:pics:19-NumDoors.png?500}} All cars from cluster 19 have NumDoors=4. ==== 22 / type ==== specificity = 0.9258710 > specific(22,'type') LI KL EN KA KB RO TE 0.315789474 0.330615271 0.047442550 0.157894737 0.008154188 0.014084507 0.074128984 KU 0.051890289 0.000000000 LI KL EN KA KB RO TE KU 0 0 0 0 0 0 1 0 0 LI KL EN KA KB RO TE KU Inf Inf Inf Inf Inf Inf 13.49 Inf NaN {{notes:pics:22-type.png?500}} All cars in the cluster 22 have type=TE.