====== Generalized ANOVA ====== ===== Načrt ===== - Generalized Huyghens theorem GW p7 (41) is based on GW p3 (13) and GW p3 (16). It can be used for any dissimilarity d; - The terms are nonnegative for DASS p20 Th3 (22), GW p7 (42); - If a dissimilarity d is not a distance it can be transformed into a distance using the power transformation Joly and Le Calvé. For a given dissimilarity matrix we can compute its metric index r (solving nonlinear equations). If r ≤ 1 we can use D otherwise we use Dr. - Generalized ANOVA DASS sec5 p5-8; DACO p8- - TraMineR: DACO p17 : ''R> dissassoc(dm, group = mydata$sex, R = 1000)'' ; dm - dissimilarity matrix, mydata - data frame of covariates. See in R: ''help(dissvar)'', ''help(dissassoc)'' and ''help(dissmfac)''. - Standard Ward's hierarchical clustering procedure can be used for any dissimilarity d GW p4 (21,22), 6 (35); also the local optimization method can be adapted (Batagelj, 1988 [[http://vlado.fmf.uni-lj.si/pub/networks/doc/KN/Course2.pdf|Slides 5/2]]); what about the leaders method (without leaders)? ===== 31.5.2016 ===== GANOVA - zapiski iz 31.5.2016 (kaj naj bi se še naredilo) Kaj je ničelna domneva? * različnosti med našimi središči so enake 0? * vse različnosti so enake 0? Kaj je že narejeno (članki Anderson in ostali) - pregled! Dobiti primer iz biologije..., kjer imajo zanimivo definirane različnosti med enotami (Katja Rostohar?) Primer - različnost med besedili. Računanje metričnega indeksa iz matrike: * bi se dalo teoretično določiti? (članek Bren) * preveriti asociativnost v matriki * hitreje - pokaži za eksponent, da je metrika... Povezava: različnosti med enotami in različnostjo od enote do središča ===== Metric index ===== g <- function(x,a=3,b=4,c=5) a**x + b**x - c**x curve(g(x,a=2,b=3,c=6),from=0,to=1,n=101,main="f(r)=2**r+3**r-6**r",xlab="r",ylab="f") lines(c(0,1),c(0,0),col="red") curve(g(x,a=3,b=4,c=5),from=0,to=2.2,n=101,main="f(r)=3**r+4**r-5**r",xlab="r",ylab="f") lines(c(0,2.2),c(0,0),col="red") curve(g(x,a=0.1,b=0.2,c=100),from=0,to=0.2,n=101,main="f(r)=0.1**r+0.2**r-100**r",xlab="r",ylab="f") lines(c(0,0.2),c(0,0),col="red") curve(g(x,a=2,b=8,c=3),from=0,to=1,n=101,main="f(r)=2**r+8**r-3**r",xlab="r",ylab="f") lines(c(0,1),c(0,0),col="red") {{notes:da:pics:3_4-5.png?350}} {{notes:da:pics:2_3-6.png?350}} {{notes:da:pics:01_02-100.png?350}} {{notes:da:pics:2_8-3.png?350}} ==== Newton's method ==== f(r;x,y,z) = xr + yr - zr Newton's method: r' = r - f(r;x,y,z)/f'(r;x,y,z) f'(r;x,y,z) = xr ln(x) + yr ln(y) - zr ln(z) https://en.wikipedia.org/wiki/Triangle_inequality : 2 max(a, b, c) < a + b + c === Metric index of a, b, c === > metrIdx <- function(a,b,c,eps=1e-7){ + r <- 1; delta <- 1; s <- 0 + x <- min(a,b,c); y <- median(c(a,b,c)); z <- max(a,b,c) + if(x==z) return(Inf) + if(x==0) {if(x==z) return(Inf) else return(0)} + if(x+y>z){while(x**r + y**r > z**r) r <- 2*r + } else {while(x**r + y**r > z**r) r <- r/2; r <- 2*r} + while((delta > eps)&&(s<100)){s <- s+1 + delta <- (x**r + y**r - z**r)/(log(x)*x**r + log(y)*y**r - log(z)*z**r) + r <- r - delta; cat(s,r,delta,"\n") + } + return(r) + } > metrIdx(3,4,5) 1 2 0 [1] 2 > metrIdx(2,3,6) 1 1.556355 0.4436451 2 1.185778 0.3705771 3 0.9281817 0.257596 4 0.8102006 0.1179811 5 0.7885321 0.02166846 6 0.7878855 0.0006466786 7 0.7878849 5.594962e-07 8 0.7878849 4.182778e-13 [1] 0.7878849 > metrIdx(0.1,0.2,100) 1 1.782854 0.2171457 2 1.565713 0.2171413 3 1.34859 0.2171233 4 1.131539 0.2170508 5 0.9147816 0.2167574 6 0.6992131 0.2155684 7 0.488413 0.2108001 8 0.2956796 0.1927335 9 0.1578744 0.1378051 10 0.1095511 0.04832334 11 0.1057652 0.003785881 12 0.105746 1.923971e-05 13 0.105746 4.893126e-10 [1] 0.105746 > metrIdx(2,8,3) 1 1.576497 0.4235031 2 1.194857 0.3816404 3 0.8840544 0.3108021 4 0.6816694 0.2023851 5 0.6023689 0.07930052 6 0.5918757 0.01049317 7 0.5917098 0.000165908 8 0.5917097 4.081135e-08 [1] 0.5917097 > === Metric index of a matrix D - test === Matrix D from the next section [[#cars|Cars]]. > r <- Inf; diag(D) <- 0 > for(i in 1:3){ + for(k in (i+1):4){ + for(j in setdiff(1:4,c(i,k))){ + rr <- metrIdx(D[i,j],D[j,k],D[i,k]) + r <- min(r,rr) + cat(i,j,k,rr,r,"\n\n") + }}} 1 1.580902 0.4190978 2 1.197178 0.3837241 3 0.8759452 0.3212328 4 0.6565239 0.2194213 5 0.5627061 0.09381786 6 0.5480994 0.01460669 7 0.5477855 0.000313949 8 0.5477853 1.417759e-07 9 0.5477853 2.881858e-14 1 3 2 0.5477853 0.5477853 1 1.559775 0.4402253 2 1.16772 0.3920543 3 0.8587376 0.3089828 4 0.6748005 0.1839371 5 0.6168648 0.0579357 6 0.6119783 0.004886454 7 0.6119461 3.219644e-05 8 0.6119461 1.387763e-09 1 4 2 0.6119461 0.5477853 1 1.580902 0.4190978 2 1.197178 0.3837241 3 0.8759452 0.3212328 4 0.6565239 0.2194213 5 0.5627061 0.09381786 6 0.5480994 0.01460669 7 0.5477855 0.000313949 8 0.5477853 1.417759e-07 9 0.5477853 2.881858e-14 1 2 3 0.5477853 0.5477853 1 1.632998 0.3670016 2 1.350959 0.2820399 3 1.185265 0.1656936 4 1.132346 0.0529188 5 1.127643 0.004702768 6 1.127609 3.455153e-05 7 1.127609 1.851484e-09 1 4 3 1.127609 0.5477853 1 1.559775 0.4402253 2 1.16772 0.3920543 3 0.8587376 0.3089828 4 0.6748005 0.1839371 5 0.6168648 0.0579357 6 0.6119783 0.004886454 7 0.6119461 3.219644e-05 8 0.6119461 1.387763e-09 1 2 4 0.6119461 0.5477853 1 1.632998 0.3670016 2 1.350959 0.2820399 3 1.185265 0.1656936 4 1.132346 0.0529188 5 1.127643 0.004702768 6 1.127609 3.455153e-05 7 1.127609 1.851484e-09 1 3 4 1.127609 0.5477853 1 1.580902 0.4190978 2 1.197178 0.3837241 3 0.8759452 0.3212328 4 0.6565239 0.2194213 5 0.5627061 0.09381786 6 0.5480994 0.01460669 7 0.5477855 0.000313949 8 0.5477853 1.417759e-07 9 0.5477853 2.881858e-14 2 1 3 0.5477853 0.5477853 1 3.700217 0.2997831 2 3.461561 0.2386555 3 3.308258 0.1533036 4 3.24794 0.06031766 5 3.239723 0.008217567 6 3.239584 0.0001384606 7 3.239584 3.86783e-08 2 4 3 3.239584 0.5477853 1 1.559775 0.4402253 2 1.16772 0.3920543 3 0.8587376 0.3089828 4 0.6748005 0.1839371 5 0.6168648 0.0579357 6 0.6119783 0.004886454 7 0.6119461 3.219644e-05 8 0.6119461 1.387763e-09 2 1 4 0.6119461 0.5477853 1 3.700217 0.2997831 2 3.461561 0.2386555 3 3.308258 0.1533036 4 3.24794 0.06031766 5 3.239723 0.008217567 6 3.239584 0.0001384606 7 3.239584 3.86783e-08 2 3 4 3.239584 0.5477853 1 1.632998 0.3670016 2 1.350959 0.2820399 3 1.185265 0.1656936 4 1.132346 0.0529188 5 1.127643 0.004702768 6 1.127609 3.455153e-05 7 1.127609 1.851484e-09 3 1 4 1.127609 0.5477853 1 3.700217 0.2997831 2 3.461561 0.2386555 3 3.308258 0.1533036 4 3.24794 0.06031766 5 3.239723 0.008217567 6 3.239584 0.0001384606 7 3.239584 3.86783e-08 3 2 4 3.239584 0.5477853 === Metric index of a matrix D === > metrIdx <- function(a,b,c,eps=1e-7){ + r <- 1; delta <- 1; s <- 0 + x <- min(a,b,c); y <- median(c(a,b,c)); z <- max(a,b,c) + if(x==z) return(Inf) + if(x==0) {if(x==z) return(Inf) else return(0)} + if(x+y>z){while(x**r + y**r > z**r) r <- 2*r + } else {while(x**r + y**r > z**r) r <- r/2; r <- 2*r} + while((delta > eps)&&(s<100)){s <- s+1 + delta <- (x**r + y**r - z**r)/(log(x)*x**r + log(y)*y**r - log(z)*z**r) + r <- r - delta } + return(r) + } > metricIdx <- function(D){ + r <- Inf; numL <- nrow(D); numLm <- numL-1 + for(i in 1:numLm){ for(k in (i+1):numL){ for(j in setdiff(1:numL,c(i,k))){ + r <- min(r,metrIdx(D[i,j],D[j,k],D[i,k])) + }}} + return(r) + } > (rD <- metricIdx(D)) [1] 0.3645253 > S <- D**rD > (rS <- metricIdx(S)) [1] 1 === Can the matrix metric index be determined faster? === Can we get a lower bound for r using minimal/maximal element of the matrix? > i <- which.max(D) > arrayInd(i,dim(D)) [,1] [,2] [1,] 18 11 > D[i] [1] 26.29788 > D[18,11] [1] 26.29788 > diag(D) <- Inf > j <- which.min(D) > D[j] [1] 1.386916 > arrayInd(j,dim(D)) [,1] [,2] [1,] 2 1 > D[2,1] [1] 1.386916 ===== Cars ===== > library(TraMineR) > help(TraMineR) > help(dissvar) > help(dissassoc) > help(dissmfac) > setwd("C:/Users/Batagelj/work/clamix/clamix.R") > source("C:\\Users\\Batagelj\\work\\clamix\\clamix.R\\clamix3t.R") > load("./cars2/cars25.rez") > load("./cars2/cars.so") > load("./cars2/cars.meta") > alpha <-rep(1/nVar,nVar) > L <- rez$leaders > names(L) [1] "L1" "L2" "L3" "L4" "L5" "L6" "L7" "L8" "L9" "L10" "L11" "L12" [13] "L13" "L14" "L15" "L16" "L17" "L18" "L19" "L20" "L21" "L22" "L23" "L24" [25] "L25" > hc <- hclustSO(L) > plot(hc,hang=-1) > long[rez$clust==9] [1] "ALFA ROMEO 145 1.4 16V L" "CITROEN SAXO 1.6 VTL" [3] "CITROEN ZX 1.6i AVANTAGE" "CITROEN XANTIA 1.6i X" [5] "DAEWOO NEXIA HB GL STD" "DAEWOO NEXIA HB GL ABG" ... [107] "VOLKSWAGEN GOLF 1.6 CL" "VOLKSWAGEN GOLF 1.6 GL" [109] "VOLKSWAGEN GOLF 1.6 GT" "VOLKSWAGEN GOLF 1.6 PLUS" > total <- computeTotal(L) > numL <- length(L) > numL [1] 25 > numLm <- numL-1 > D <- matrix(nrow=numL,ncol=numL); diag(D) <- Inf > for(i in 1:numLm) for(j in (i+1):numL) { + D[i,j] <- distCl(L[[i]],L[[j]]); D[j,i] <- D[i,j] + } > A <- as.dist(D) > A 1 2 3 4 5 6 7 2 1.386916 3 3.582262 8.397003 4 3.356421 7.104338 6.415633 5 3.693970 7.019706 7.888564 5.153226 6 3.605076 5.832940 4.895284 4.740308 5.805828 7 3.306033 6.352564 6.961267 4.677096 3.937799 5.020761 8 2.592247 4.501111 8.509287 7.658725 6.987216 6.455513 6.909217 9 4.036341 10.416816 11.456966 8.328767 5.413662 7.683022 6.089895 10 3.472821 8.035387 4.845966 2.977475 7.679482 4.702332 7.082956 11 3.250934 10.824382 9.351577 11.642114 12.412319 8.768878 12.648273 12 2.629572 5.249485 5.534103 5.541161 7.545072 5.977826 5.773090 13 2.785301 7.668458 5.306050 5.157297 7.634207 5.499815 6.873963 14 3.483834 10.076100 7.854228 10.320341 10.945101 9.091888 11.255809 15 3.707527 8.460017 7.776793 4.606644 5.363650 5.952800 3.256083 16 3.953338 6.831914 7.246344 6.157174 3.846592 6.317949 5.727512 17 4.027002 9.534567 7.509222 6.748467 5.399018 5.383006 4.796475 18 4.687177 14.541875 16.168530 14.245039 6.358992 11.106008 12.709572 19 3.445288 8.812849 4.376746 6.803439 9.642363 5.529869 9.150236 20 4.101289 7.070411 8.026177 4.514790 4.136549 3.268842 5.145231 21 4.171015 9.960677 10.637548 7.921125 2.331700 7.456142 7.258457 22 1.675898 3.972340 6.545308 6.420424 6.560445 5.777787 6.218061 23 3.399929 6.259198 3.959837 3.880152 5.980594 2.066168 4.195328 24 3.336785 4.857392 4.148561 2.598911 3.742954 3.413352 2.673155 25 2.235043 3.549708 4.768962 4.232677 3.980621 4.786093 4.055086 8 9 10 11 12 13 14 8 9 10.328313 10 8.764973 10.577360 11 12.139533 19.148113 10.878854 12 6.077666 12.245622 6.695133 8.617629 13 9.031828 13.627706 5.781517 9.853327 5.662595 14 9.998934 13.343736 7.141143 5.742655 9.451189 9.091763 15 9.273916 7.198497 7.354566 14.691288 8.877081 8.472681 12.042983 16 6.219615 4.609104 6.787612 9.569811 7.361549 7.143369 8.314172 17 9.478498 4.951715 5.917122 13.245604 10.676097 10.275677 11.367578 18 13.380517 12.029638 17.468914 26.297885 16.120426 19.882129 17.662338 19 9.570105 14.655353 5.018135 7.046792 4.818606 6.655113 9.408573 20 6.969604 6.116134 7.219930 13.185988 7.940596 7.942214 11.656472 21 9.695003 5.568285 10.805160 16.645523 11.293219 11.701357 13.518996 22 2.925262 9.030335 7.249211 5.360439 4.705823 5.839052 5.866798 23 7.226293 7.681937 3.623840 7.641230 5.900978 4.073250 8.159579 24 5.485786 6.281107 4.045954 8.424491 4.070893 3.754298 8.135563 25 2.009623 3.245456 4.364927 6.032709 4.363828 4.661146 4.981074 15 16 17 18 19 20 21 16 6.089135 17 4.992753 5.402473 18 16.289304 2.511787 14.998501 19 10.340307 8.091243 8.253026 21.774856 20 5.375634 6.257397 6.349464 11.931383 9.433085 21 8.416492 3.799841 6.958137 7.741296 12.810770 6.891555 22 7.669240 6.555361 8.427444 11.075206 7.223033 7.216993 8.455048 23 4.748261 5.993337 4.869672 10.922880 4.817182 6.006513 7.598098 24 3.027707 5.615018 5.745588 9.214332 5.878896 3.286702 6.462960 25 4.637410 3.348819 4.557659 4.089213 5.338677 4.398977 4.362140 22 23 24 23 5.798148 24 5.115428 3.306279 25 2.611763 4.701425 4.070759 > print(dissvar(A)) [1] 3.450939 {{pub:zip:ganova-cars.zip}} ===== References ===== * Matthias Studer, Gilbert Ritschard, Alexis Gabadinho, and Nicolas S. Müller: Discrepancy Analysis of Complex Objects Using Dissimilarities. F. Guillet et al. (Eds.): Advances in Knowledge Discovery and Management, SCI 292. [[http://mephisto.unige.ch/pub/publications/gr/Studer_akdm_2010.pdf|Springer-Verlag Berlin Heidelberg 2010, pp. 3–19.]] * Matthias Studer, Gilbert Ritschard, Alexis Gabadinho and Nicolas S. Müller: Discrepancy Analysis of State Sequences. [[http://mephisto.unige.ch/pub/publications/gr/Studer-et-al_SMR11-preprint|Sociological Methods and Research. 2011. Vol. 40(3), pp. 471-510.]] * [[http://www.unige.ch/ses/metri/assistants/studer/|Matthias Studer]] * [[http://traminer.unige.ch/|TraMineR: a toolbox for exploring sequence data]] * Alexis Gabadinho, Gilbert Ritschard, Nicolas S Müller, Matthias Studer: Analyzing and Visualizing State Sequences in R with TraMineR. [[http://www.jstatsoft.org/article/view/v040i04|Journal of Statistical Software. Vol. 40(4), pp. 1-37.]] * Batagelj, V: Generalized Ward and Related Clustering Problems. Classification and Related Methods of Data Analysis. H.H. Bock (editor). {{pub:pdf:ward.pdf|North-Holland, Amsterdam, 1988. p. 67-74.}} * Batagelj, Vladimir. Local optimization method for the generalized Ward clustering problem. In: Bogosavljević, Srđan (ed.), Sinadinović, Nada (ed.). Zbornik radova [2]. Beograd: Savezni zavod za statistiku, 1988, str. 154-159. [COBISS.SI-ID 26997762] * Bren, M, Batagelj, V: The metric index. [[http://vlado.fmf.uni-lj.si/vlado/papers/MetrIdx.pdf|Croatica chemica acta 79 (3): 399-410 2006.]] * Batagelj, V, Bren, M: Comparing resemblance measures. [[http://vlado.fmf.uni-lj.si/vlado/papers/Compare.pdf|J CLASSIF 12 (1): 73-90 1995.]] * Vladimir Batagelj, Anže Vavpetič: Faster Pathfinder algorithm for sparse networks. {{pub:pdf:pathfinder30.pdf|Sunbelt XXX, Riva del Garda, Italy, June 29 - July 4, 2010.}} * Joly S., Le Calvé G.: Etude des puissances d'une distance. [[http://archive.numdam.org/ARCHIVE/SAD/SAD_1986__11_3/SAD_1986__11_3_30_0/SAD_1986__11_3_30_0.pdf|Statistique et Analyse de Données, 11 (1986) 3, 30-50.]] * Doss S.: Sur la moyenne d'un élément aléatoire dans un espace distancié. [[http://vlado.fmf.uni-lj.si/Pub/Doc/scan/Doss.pdf|Bulletin des Scinces mathematiques]], 2e serie, t. LXXIII 1949, 1-26. * Maurice Fréchet: Une nouvelle théorie : celle des éléments aléatoires abstraits. Revue Philosophique de la France et de l'Étranger, T. 147 (1957), pp. 145-158. https://www.jstor.org/stable/41088522?seq=1#page_scan_tab_contents * http://www.numdam.org/article/AIHP_1948__10_4_215_0.pdf * http://www.numdam.org/article/ASENS_1958_3_75_3_223_0.pdf * http://www.numdam.org/article/JSFS_1947__88__410_0.pdf * http://www.numdam.org/article/AIHP_1956__15_1_1_0.pdf * http://www.numdam.org/article/JSFS_2006__147_2_61_0.pdf * http://www.numdam.org/article/JSFS_2006__147_2_23_0.pdf * http://www.numdam.org/article/ASENS_1948_3_65__211_0.pdf ===== URLs ===== * http://documents.software.dell.com/Statistics/Textbook/ANOVA-MANOVA * http://www.statmethods.net/stats/anova.html * http://mathworld.wolfram.com/ANOVA.html * http://www.theanalysisfactor.com/why-anova-and-linear-regression-are-the-same-analysis/ * http://www.inside-r.org/r-doc/stats/anova * http://www.stat.yale.edu/Courses/1997-98/101/anovareg.htm * http://statsmodels.sourceforge.net/devel/examples/generated/example_interactions.html * https://onlinecourses.science.psu.edu/stat502/node/140 * https://projecteuclid.org/euclid.aos/1112967698 * http://rtutorialseries.blogspot.si/2011/01/r-tutorial-series-two-way-anova-with.html * http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_HypothesisTesting-ANOVA/BS704_HypothesisTesting-Anova_print.html * http://www3.stat.sinica.edu.tw/sda2014/