>"baptiste auguie" <baptiste.aug...@googlemail.com> wrote in message >news:de4e29f50912110200g7e43551kef5e8053fbf6e...@mail.gmail.com... >2009/12/10 Charles C. Berry <cbe...@tajo.ucsd.edu>: >[snipped] >> Many? >> >> <snip> Charles' elegant coding of the algorithm he summarises as: >> Any point that is in the convex hull of rbind(ps,xs) is either in or >> outside >> the convex hull of ps. Right? So, just recursively eliminate points that >> are >> in the convex hull of the larger set. >>
Baptiste commented: > >If I'm not mistaken this method is efficient only because the two >point distributions are very similar (drawn from rnorm, so they look >like two concentric balls). If one of the convex hulls is very >distorted along one axis, say, I believe the method will involve many >more iterations and in the limit will require computing a convex hull >for each test point as Duncan suggested. > >Such a pathological of test points example might be, > > xs <- matrix(0,ncol=4,nrow=100) > xs[,1] <- seq(1,100) > >Or did I completely miss something? (quite possible) > >---- Until now I thought the same as Baptiste and (mea culpe) had rejected that algorithm without testing it. Now I've tried it and it works! Here's the result on my real test data; sorry it's long, but it shows some important features: a) ps is quite "pathological"?; some substantial correlations b) xs <- expand.grid(lapply(ps, unique)); this is the reason I'm doing it in the first place. I want to expand.grid without 'extrapolating' beyound the (convex hull of) the original data c) xs has lost the correlation structure of ps d) 1170 'outside' points removed in 23 iterations e) xs[-(outside)] has regained (some of) the correlation structure of ps ### <begin example> > source(.trPaths[5], echo=TRUE, max.deparse.length=150) > describe(ps) ps 5 Variables 2637 Observations ----------------------------------------------------------------------------------------------------------------------------- t n missing unique Mean .05 .10 .25 .50 .75 .90 .95 2637 0 35 136.2 0 0 30 120 194 312 360 lowest : 0 2 3 4 24, highest: 312 336 360 384 504 ----------------------------------------------------------------------------------------------------------------------------- pH n missing unique Mean 2637 0 4 5.707 4.6 (727, 28%), 5.4 (729, 28%), 6.2 (624, 24%), 7 (557, 21%) ----------------------------------------------------------------------------------------------------------------------------- T n missing unique Mean 2637 0 5 10.75 2 5 8 15 22 Frequency 447 510 593 537 550 % 17 19 22 20 21 ----------------------------------------------------------------------------------------------------------------------------- S n missing unique Mean 2637 0 4 3.097 0 (631, 24%), 2 (648, 25%), 4 (638, 24%), 6 (720, 27%) ----------------------------------------------------------------------------------------------------------------------------- N n missing unique Mean 2637 0 4 118.6 0 (701, 27%), 80 (636, 24%), 160 (628, 24%), 240 (672, 25%) ----------------------------------------------------------------------------------------------------------------------------- > cor(ps) t pH T S N t 1.000000000 0.02255541 -0.425911455 0.05541686 0.004447023 pH 0.022555414 1.00000000 -0.029277466 0.05630345 -0.031032641 T -0.425911455 -0.02927747 1.000000000 0.05948337 0.003595186 S 0.055416859 0.05630345 0.059483366 1.00000000 0.014074045 N 0.004447023 -0.03103264 0.003595186 0.01407404 1.000000000 > xs <- expand.grid(lapply(ps, unique)) > describe(xs) xs 5 Variables 11200 Observations ----------------------------------------------------------------------------------------------------------------------------- t n missing unique Mean .05 .10 .25 .50 .75 .90 .95 11200 0 35 141.7 2 4 48 123 194 336 384 lowest : 0 2 3 4 24, highest: 312 336 360 384 504 ----------------------------------------------------------------------------------------------------------------------------- pH n missing unique Mean 11200 0 4 5.8 4.6 (2800, 25%), 5.4 (2800, 25%), 6.2 (2800, 25%), 7 (2800, 25%) ----------------------------------------------------------------------------------------------------------------------------- T n missing unique Mean 11200 0 5 10.4 2 5 8 15 22 Frequency 2240 2240 2240 2240 2240 % 20 20 20 20 20 ----------------------------------------------------------------------------------------------------------------------------- S n missing unique Mean 11200 0 4 3 0 (2800, 25%), 2 (2800, 25%), 4 (2800, 25%), 6 (2800, 25%) ----------------------------------------------------------------------------------------------------------------------------- N n missing unique Mean 11200 0 4 120 0 (2800, 25%), 80 (2800, 25%), 160 (2800, 25%), 240 (2800, 25%) ----------------------------------------------------------------------------------------------------------------------------- > cor(xs) t pH T S N t 1.000000e+00 4.276263e-23 2.346103e-20 0 0.000000e+00 pH 4.276263e-23 1.000000e+00 -1.716171e-19 0 0.000000e+00 T 2.346103e-20 -1.716171e-19 1.000000e+00 0 -3.925415e-21 S 0.000000e+00 0.000000e+00 0.000000e+00 1 0.000000e+00 N 0.000000e+00 0.000000e+00 -3.925415e-21 0 1.000000e+00 > phull <- convhulln(ps) > phull2 <- convhulln(rbind(ps,xs)) > nrp <- nrow(ps) > nrx <- nrow(xs) > outside <- unique(phull2[phull2>nrp])-nrp > done <- FALSE > print(length(outside)) [1] 20 > while(!done){ phull3 <- convhulln(rbind(ps,xs[-(outside),])) also.outside <- (1:nrx)[-outside][unique(phull3[phull3>nrp])-nrp] outside <- c(outside,also.outside) print(length(outside)) done <- length(also.outside)==0 } [1] 97 [1] 243 [1] 395 [1] 501 [1] 571 [1] 642 [1] 718 [1] 790 [1] 836 [1] 876 [1] 927 [1] 973 [1] 1012 [1] 1038 [1] 1060 [1] 1088 [1] 1120 [1] 1148 [1] 1162 [1] 1168 [1] 1170 [1] 1170 > describe(xs[-(outside),]) xs[-(outside), ] 5 Variables 10030 Observations ----------------------------------------------------------------------------------------------------------------------------- t n missing unique Mean .05 .10 .25 .50 .75 .90 .95 10030 0 35 120.5 2 4 30 120 172 240 336 lowest : 0 2 3 4 24, highest: 312 336 360 384 504 ----------------------------------------------------------------------------------------------------------------------------- pH n missing unique Mean 10030 0 4 5.781 4.6 (2548, 25%), 5.4 (2559, 26%), 6.2 (2520, 25%), 7 (2403, 24%) ----------------------------------------------------------------------------------------------------------------------------- T n missing unique Mean 10030 0 5 9.584 2 5 8 15 22 Frequency 2099 2232 2174 2024 1501 % 21 22 22 20 15 ----------------------------------------------------------------------------------------------------------------------------- S n missing unique Mean 10030 0 4 3.067 0 (2371, 24%), 2 (2512, 25%), 4 (2572, 26%), 6 (2575, 26%) ----------------------------------------------------------------------------------------------------------------------------- N n missing unique Mean 10030 0 4 120.0 0 (2502, 25%), 80 (2514, 25%), 160 (2515, 25%), 240 (2499, 25%) ----------------------------------------------------------------------------------------------------------------------------- > cor(xs[-(outside),]) t pH T S N t 1.0000000000 -0.0252190240 -0.1599673922 0.0260853161 -0.0001862074 pH -0.0252190240 1.0000000000 -0.0308956477 -0.0109761096 0.0006346428 T -0.1599673922 -0.0308956477 1.0000000000 0.0506125967 0.0005164211 S 0.0260853161 -0.0109761096 0.0506125967 1.0000000000 0.0006132354 N -0.0001862074 0.0006346428 0.0005164211 0.0006132354 1.0000000000 ### <end example> Baptiste said: > >Regarding the "inhull" Matlab code, I came to the opposite conclusion: >it should be easily ported to R. 1) it is a very short piece of code >(even more so if one disregards the various checks and handling of >special cases), with no Matlab-specific objects (only integers, >booleans, matrices and vectors). 2) The core of the program relies on >the qhull library, and the same applies to R I think. 3) Matlab and R >use very similar indexing for matrices and similar linear algebra in >general. > >That said, I'm a bit short on time to give it a go myself. I think the >open-source Octave could run this code too, so it might help in >checking the code step-by-step. > > >All the best, > >baptiste I agree the Matlab code didn't look very complicated. The main reason I thought it non-trival (for me) to code in R was my ignorance; lines like... nullsp = null(xyz(tess(i,2:end),:) - repmat(xyz(tess(i,1),:),p-1,1))'; ...seem important but I don't know what 'null' and 'repmat' do. I don't even understand the algebra being implemented and I'm wary of coding in those circumstances. If I had time, I'd like to understand and code the Matlab algorithm, but this is a small part of a much bigger task, on which I must progress. So I'll thank Baptiste, Duncan and Charles for their invaluable help and move on. Best regards, Keith Jewell ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.