Re: [R] partykit ctree: minbucket and case weights
Dear Amber, your data contains missing values and you don't use surrogate splits to deal with them. So, the observations are passed down the tree randomly (there is no majority argument to ctree_control!) and thus it might happen that too small terminal nodes are created. Simply use surrogate split and the tree will be deterministic with correct-sized terminal nodes (maxsurrogate = 3, for example). Best, Torsten On Mon, 9 Jun 2014, Amber Dawn Nolder wrote: I have attached the data set (cavl) and R code used when I got the results I posted about. I included the code I used at the top of the document. Below that is the version of R used and some of the results I obtained. Many thanks! Amber On Wed, 4 Jun 2014 09:12:15 +0200 (CEST) Torsten Hothorn torsten.hoth...@uzh.ch wrote: On Tue, 3 Jun 2014, Amber Dawn Nolder wrote: I apologize for my lack of knowledge with R. I usually load my data as a csv file. May I send that to you? I was not sure if I could do so on the list. yes, and the R code you used. Thanks, Torsten Thank you? On Fri, 30 May 2014 09:37:23 +0200 (CEST) Torsten Hothorn torsten.hoth...@uzh.ch wrote: Amber, this looks like an error -- could you pls send me a reproducible example so that I can track the problem down? Best, Torsten Prof. Dr. Torsten Hothorn = \\ Universitaet Zuerich \\ Institut fuer Epidemiologie, Biostatistik und \\ Praevention, Abteilung Biostatistik // Hirschengraben 84// CH-8001 Zuerich // Schweiz// == Telephon: +41 44 634 48 17 Fax: +41 44 634 43 86 Web: http://tiny.uzh.ch/6p On Wed, 28 May 2014, Achim Zeileis wrote: Falls Du es nicht eh gesehen hast... lg, Z -- Forwarded message -- Date: Wed, 28 May 2014 17:16:12 -0400 From: Amber Dawn Nolder a.d.nol...@iup.edu To: r-help@r-project.org Subject: [R] partykit ctree: minbucket and case weights Hello, I am an R novice, and I am using the partykit package to create regression trees. I used the following to generate the trees: ctree(y~x1+x2+x3+x4,data=my_data,control=ctree_control(testtype = Bonferroni, mincriterion = 0.90, minsplit = 12, minbucket = 4, majority = TRUE) I thought that minbucket set the minimum value for the sum of weights in each terminal node, and that each case weight is 1, unless otherwise specified. In which case, the sum of case weights in a node should equal the number of cases (n) in that node. However, I sometimes obtain a tree with a terminal node that contains fewer than 4 cases. My data set has a total of 36 cases. The dependent and all independent variables are continuous data. Variables x1 and x2 contain missing (NA) values. Could someone please explain why I am getting these results? Am I mistaken about the value of case weights or about the use of minbucket to restrict the size of a terminal node? This is an example of the output: Model formula: y ~ x1 + x2 + x3 + x4 Fitted party: [1] root | [2] x4 = 30: 0.927 (n = 17, err = 1.1) | [3] x4 30 | | [4] x2 = 43: 0.472 (n = 8, err = 0.4) | | [5] x2 43 | | | [6] x3 = 0.4: 0.282 (n = 3, err = 0.0) | | | [7] x3 0.4: 0.020 (n = 8, err = 0.0) Number of inner nodes:3 Number of terminal nodes: 4 Many thanks! Amber Nolder Graduate Student Indiana University of Pennsylvania __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fwd: cforest sampling methods
Hi all, I've been using the randomForest package and I'm trying to make the switch over to party. My problem is that I have an extremely unbalanced outcome (only 1% of the data has a positive outcome) which makes resampling methods necessary. randomForest has a very useful argument that is sampsize which allows me to use a balanced subsample to build each tree in my forest. lets say the number of positive cases is 100, my forest would look something like this: rf-randomForest(y~. ,data=train, ntree=800,replace=TRUE,sampsize = c(100, 100)) so I use 100 cases and 100 controls to build each individual tree. Can I do the same for cforests? I know I can always upsample but I'd rather not. I've tried playing around with the weights argument but I'm either not getting it right or it's just the wrong thing to use. weights are your friend here: Suppose you have 100 obs of the first and 1000 obs of the second class. Using weights 1 / 100 for the class one obs and 1 / 1000 for the class two obs gives you a balanced sample: y - gl(2, 1)[c(rep(1, 100), rep(2, 1000))] w - 1 / (table(y))[y] tapply(rmultinom(n = 1, size = length(y), prob = w), y, sum) Best, Torsten Any advice on how to adapt cforests to datasets with imbalanced outcomes is greatly appreciated... Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Permutation Test on Interactions {coin}
Axel, you need a model for such type of analyses and coin is completely model-free. Torsten On Mon, 23 Sep 2013, Axel Urbiz wrote: Dear List, I'm interested in performing a permutation test on the interaction between a binary treatment indicator and a covariate (either continuous or categorical). I'm interested in the p-value of the interaction effect from a permutation test, and I'm using the coin package for that purpose. As I haven't seen any examples like this in the package documentation (or anywhere else), I'm not sure how to specify the test in this case. For example, should I interpret the p-value in the example below as the pvalue of the interaction effect between group and the covariate x. set.seed(1) library(coin) data(rotarod, package = coin) x - rnorm(24) rotarod - cbind(rotarod, x) pvalue(independence_test(time ~ group * x, data = rotarod)) Your advice would be much appreciated. Regards, Axel. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Party package: varimp(..., conditional=TRUE) error: term 1 would require 9e+12 columns (fwd)
I would like to build a forest of regression trees to see how well some covariates predict a response variable and to examine the importance of the covariates. I have a small number of covariates (8) and large number of records (27368). The response and all of the covariates are continuous variables. A cursory examination of the covariates does not suggest they are correlated in a simple fashion (e.g. the variance inflation factors are all fairly low) but common sense suggests there should be some relationship: one of them is the day of the year and some of the others are environmental parameters such as water temperature. For this reason I would like to follow the advice of Strobl et al. (2008) and try the authors' conditional variable importance measure. This is implemented in the party package by calling varimp(..., conditional=TRUE). Unfortunately, when I call that on my forest I receive the error: varimp(myforest, conditional=TRUE) Error in model.matrix.default(as.formula(f), data = blocks) : term 1 would require 9e+12 columns Does anyone know what is wrong? Hi Jason, the particular feature doesn't scale well in its current implementation. Anyway, thanks for looking up previous reports closely. I can offer to have a look at your data if you send them along with the code to reproduce the problem. Best, Torsten I noticed a post in June 2011 where a user reported this message and the ultimate problem was that the importance measure was being conditioned on too many variables (47). I have only a small number of variables here so I guessed that was not the problem. Another suggestion was that there could be a factor with too many levels. In my case, all of the variables are continuous. Term 1 (x1 below) is the day of the year, which does happen to be integers 1 ... 366. But the variable is class numeric, not integer, so I don't believe cforest would treat it as a factor, although I do not know how to tell whether cforest is treating something as continuous or as a factor. Thank you for any help you can provide. I am running R 2.13.1 with party 0.9-4. You can download the data from http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete code: load(\\Temp\\data.rdata) nrow(df) [1] 27368 summary(df) y x1 x2 x3 x4 x5 x6 x7 x8 Min. : 0.000 Min. : 1.0 Min. :0. Min. : 1.00 Min. : 52 Min. : 0.008184 Min. :16.71 Min. :0.000 Min. : 0.02727 1st Qu.: 0.000 1st Qu.:105.0 1st Qu.:0. 1st Qu.: 30.00 1st Qu.:1290 1st Qu.: 6.747035 1st Qu.:23.92 1st Qu.:0.000 1st Qu.: 0.11850 Median : 1.282 Median :169.0 Median :0.2353 Median : 38.00 Median :1857 Median :11.310277 Median :26.35 Median :0.0001569 Median : 0.14625 Mean : 5.651 Mean :178.7 Mean :0.2555 Mean : 55.03 Mean :1907 Mean :12.889021 Mean :26.31 Mean :0.0162043 Mean : 0.20684 3rd Qu.: 5.353 3rd Qu.:262.0 3rd Qu.:0.4315 3rd Qu.: 47.00 3rd Qu.:2594 3rd Qu.:18.427410 3rd Qu.:28.95 3rd Qu.:0.0144660 3rd Qu.: 0.20095 Max. :195.238 Max. :366.0 Max. :1. Max. :400.00 Max. :3832 Max. :29.492380 Max. :31.73 Max. :0.3157486 Max. :11.76877 library(HH) output deleted vif(y ~ ., data=df) x1 x2 x3 x4 x5 x6 x7 x8 1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580 library(party) output deleted mycontrols - cforest_unbiased(ntree=50, mtry=3) # Small forest but requires a few minutes myforest - cforest(y ~ ., data=df, controls=mycontrols) varimp(myforest) x1 x2 x3 x4 x5 x6 x7 x8 11.924498 103.180195 16.228864 30.658946 5.053500 12.820551 2.113394 6.911377 varimp(myforest, conditional=TRUE) Error in model.matrix.default(as.formula(f), data = blocks) : term 1 would require 9e+12 columns __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fwd: Re: Party extract BinaryTree from cforest?
-- Forwarded message -- Date: Wed, 5 Oct 2011 21:09:41 + From: Chris christopher.a.h...@gmail.com To: r-h...@stat.math.ethz.ch Subject: Re: [R] Party extract BinaryTree from cforest? I found an internal workaround to this to support printing and plot type simple, tt-party:::prettytree(cf at ensemble[[1]], names(cf at data at get(input))) npt - new(BinaryTree) npt@tree-tt plot(npt) Error in terminal_panel(S4 object of class BinaryTree) : âctreeobjâ is not a regression tree library(party) cf - cforest(Species ~ ., data = iris) pt - party:::prettytree(cf@ensemble[[1]], names(cf@data@get(input))) pt nt - new(BinaryTree) nt@tree - pt nt@data - cf@data nt@responses - cf@responses nt plot(nt) will do. Torsten plot(npt, type=simple) Any additional help is appreciated. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] `partykit': A Toolkit for Recursive Partytioning
New package `partykit': A Toolkit for Recursive Partytioning The purpose of the package is to provide a toolkit with infrastructure for representing, summarizing, and visualizing tree-structured regression and classification models. Thus, the focus is not on _inferring_ such a tree structure from data but to _represent_ a given tree so that printing/plotting and computing predictions can be performed in a standardized way. In particular, this unified infrastructure can be used for reading/coercing tree models from different sources (packages `rpart', `RWeka', `PMML') yielding objects that share functionality for `print()', `plot()', and `predict()' methods. The impatient users will hopefully have fun with install.packages(partykit) library(partykit) library(rpart) ### from ?rpart fit - rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) plot(as.party(fit)) Best, Torsten Achim ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] glht (multcomp): NA's for confidence intervals using univariate_calpha (fwd)
fixed @ R-forge. New version should appear on CRAN soon. Thanks for the report! Torsten -- Forwarded message -- Date: Sat, 3 Sep 2011 23:56:35 +0200 From: Ulrich Halekoh ulrich.hale...@agrsci.dk To: r-help@r-project.org r-help@r-project.org Subject: [R] glht (multcomp): NA's for confidence intervals using univariate_calpha Hej, Calculation of confidence intervals for means based on a model fitted with lmer using the package multcomp - yields results for calpha=adjusted_calpha - NA's forcalpha=univariate_calpha Example: library(lme4) library(multcomp) ### Generate data set.seed(8) d-expand.grid(treat=1:2,block=1:3) e-rnorm(3) names(e)-1:3 d$y-rnorm(nrow(d)) + e[d$block] d-transform(d,treat=factor(treat),block=factor(block)) # lmer fit Mod-lmer(y~treat+ (1|block), data=d) ### estimate treatment means L-cbind(c(1,0),c(0,1)) s-glht(Mod,linfct=L) ## confidence intervals confint(s,calpha=adjusted_calpha()) #produces NA's for the confidence limits confint(s,calpha=univariate_calpha()) #for models fitted with lm the problem does not occur G-lm(y~treat+ block, data=d) L-matrix( c(1,0,1/3,1/3,1,1,1/3,1/3),2,4,byrow=TRUE) s-glht(G,linfct=L) confint(s,calpha=adjusted_calpha()) confint(s,calpha=univariate_calpha()) multcomp version 1.2-7 R:platform i386-pc-mingw32 version.string R version 2.13.1 Patched (2011-08-19 r56767) Regards Ulrich Halekoh Department of Molecular Biology and Genetics, Aarhus University, Denmark ulrich.hale...@agrsci.dk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cforest - keep.forest = false option? (fwd)
-- Forwarded message -- Date: Mon, 18 Jul 2011 10:17:00 -0700 (PDT) From: KHOFF kuph...@gmail.com To: r-help@r-project.org Subject: [R] cforest - keep.forest = false option? Hi, I'm very new to R. I am most interested in the variable importance measures that result from randomForest, but many of my predictors are highly correlated. My first question is: 1. do highly correlated variables render variable importance measures in randomForest invalid? that depends on your idea of valid. A number of papers published over the last years explore this question and you should read the relevant literature first. and 2. I know that cforest is robust to highly correlated variables, however, I do not have enough space on my machine to run cforest. I used the keep.forest = false option when using randomForest and that solved my space issue. Is there a similar option for cforest (besides savesplitstats = FALSE, which isn't helping) no. party was designed as a flexible research tool and is not optimized wrt speed or memory consumption. Best, Torsten below is my code and error message Thanks in advance! fit - cforest(formula = y ~ x1 + x2+ x3+ x4+ x5+ + x6+ x7+ x8+ x9+ x10, data=data, control= cforest_unbiased(savesplitstats = FALSE, ntree = 50, mtry = 5) 1: In mf$data - data : Reached total allocation of 3955Mb: see help(memory.size) 2: In mf$data - data : Reached total allocation of 3955Mb: see help(memory.size) -- View this message in context: http://r.789695.n4.nabble.com/cforest-keep-forest-false-option-tp3675921p3675921.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fwd: varimp_in_party_package
On Thu, 16 Jun 2011, Jinrui Xu wrote: Thanks for your feedback. I think the problem is not because of many levels. There is only 1 column with two levels as class labels in my input data. Below is my code. The commandline data.cforest.varimp - varimp(data.cforest, conditional = TRUE) reports Error in model.matrix.default(as.formula(f),data = blocks): term 1 would require 4e+17 columns I also attached my input file. Hope you can run it for me to check what the problem is. Thanks a lot! PS: It takes 10 mins to finish the code below by 1 cpu and upto 2.5 GB memory. You can reduce the columns in the rawinput, which reduces computing intense and feeds back same error. library(randomForest) library(party) set.seed(71) rawinput - read.table(featureSelection_rec.vectors) rawinput$V1 - as.factor(as.numeric(rawinput$V1)) data.controls - cforest_unbiased(ntree=500, mtry=3) data.cforest - cforest(V1~.,data=rawinput,controls=data.controls) data.cforest.varimp - varimp(data.cforest, conditional = TRUE) Hi Jinrui, it turns out that for your data-set there are (using the default) parameters 47 variables to condition on and thats way to much. You can reduce the number of conditioning variables by increasing the `threshold' parameter to something like .8 Best, Torsten there is a factor with (too) many levels in your data frame `rawinput'. summary(rawinput) will tell you which one. Torsten Quoting Torsten Hothorn torsten.hoth...@stat.uni-muenchen.de: Hello everyone, I use the following command lines to get important variable from training dataset. data.controls - cforest_unbiased(ntree=500, mtry=3) data.cforest - cforest(V1~.,data=rawinput,controls=data.controls) data.cforest.varimp - varimp(data.cforest, conditional = TRUE) I got error: Error in model.matrix.default(as.formula(f),data = blocks): term 1 would require 4e+17 columns I changed data dimension to 150. The problem still exists. So, I guess there are other problems. Please give me some help or hints. Thanks! jinrui, __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fwd: varimp_in_party_package
Hello everyone, I use the following command lines to get important variable from training dataset. data.controls - cforest_unbiased(ntree=500, mtry=3) data.cforest - cforest(V1~.,data=rawinput,controls=data.controls) data.cforest.varimp - varimp(data.cforest, conditional = TRUE) I got error: Error in model.matrix.default(as.formula(f),data = blocks): term 1 would require 4e+17 columns there is a factor with (too) many levels in your data frame `rawinput'. summary(rawinput) will tell you which one. Torsten I changed data dimension to 150. The problem still exists. So, I guess there are other problems. Please give me some help or hints. Thanks! jinrui, __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question regarding qmvnorm
On Wed, 20 Apr 2011, Ravi Varadhan wrote: If you had told us what the error message was, my job would have been easier. But, at least you provied the code, so it was not hard for me to see where the problem was. There is a problem with the strategy used by `qmvnorm' to locate the initial interval in which the quantile is supposed to lie. In particular, the problem is with the approx_interval() function. In your case, the interval computed by this function does not contain the root. Hence, uniroot() fails. The solution is to provide the interval that contains the roots. I am cc'ing Torsten so that he is aware of the problem. thanks, Ravi, for the report -- indeed, `tail = upper' caused `approx_interval()' to act insane. Fixed in 0.9-99 on R-forge and soon on CRAN. Best, Torsten The following works: m - 10 rho - 0.1 k - 2 alpha - 0.05 cc_z - rep(NA, length=m) var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T) for (i in 1:m){ if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper, sigma=var, interval=c(0, 5))$quantile} else {cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail =upper, sigma=var, interval=c(0,5))$quantile} } cc_z cc_z [1] 1.9438197 1.9438197 1.8910860 1.8303681 1.7590806 1.6730814 1.5652220 [8] 1.4219558 1.2116459 0.8267921 Ravi. From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf Of li li [hannah@gmail.com] Sent: Wednesday, April 20, 2011 5:59 PM To: r-help Subject: [R] question regarding qmvnorm Dear all, I wrote the following function previously. It worked fine with the old mvtnorm package. Somehow with the updated package, I got a error message when trying to use the function. I need some help. It is sort of urgent. Can anyone please take a look. The function is the following. Thank you very much! Hannah library(mvtnorm) cc_f - function(m, rho, k, alpha) { m - 10 rho - 0.1 k - 2 alpha - 0.05 cc_z - numeric(m) var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T) for (i in 1:m){ if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper, sigma=var)$quantile} else {cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail =upper, sigma=var)$quantile} } cc - 1-pnorm(cc_z) return(cc) } [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] missing values in party::ctree
On Thu, 17 Feb 2011, Andrew Ziem wrote: After ctree builds a tree, how would I determine the direction missing values follow by examining the BinaryTree-class object? For instance in the example below Bare.nuclei has 16 missing values and is used for the first split, but the missing values are not listed in either set of factors. (I have the same question for missing values among numeric [non-factor] values, but I assume the answer is similar.) Hi Andrew, ctree() doesn't treat missings in factors as a category in its own right. Instead, it uses surrogate splits to determine the daughter node observations with missings in the primary split variable are send to (you need to specify `maxsurrogates' in ctree_control()). However, you can recode your factor and add NA to the levels. This will lead to the intended behaviour. Best, Torsten require(party) require(mlbench) data(BreastCancer) BreastCancer$Id - NULL ct - ctree(Class ~ . , data=BreastCancer, controls = ctree_control(maxdepth = 1)) ct Conditional inference tree with 2 terminal nodes Response: Class Inputs: Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size, Bare.nuclei, Bl.cromatin, Normal.nucleoli, Mitoses Number of observations: 699 1) Bare.nuclei == {1, 2}; criterion = 1, statistic = 488.294 2)* weights = 448 1) Bare.nuclei == {3, 4, 5, 6, 7, 8, 9, 10} 3)* weights = 251 sum(is.na(BreastCancer$Bare.nuclei)) [1] 16 nodes(ct, 1)[[1]]$psplit Bare.nuclei == {1, 2} nodes(ct, 1)[[1]]$ssplit list() Based on below, the answer is node 2, but I don't see it in the object. sum(BreastCancer$Bare.nuclei %in% c(1,2,NA)) [1] 448 sum(BreastCancer$Bare.nuclei %in% c(1,2)) [1] 432 sum(BreastCancer$Bare.nuclei %in% c(3:10)) [1] 251 Andrew __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problems with pdf device using plot glht function on multcomp library.
Dear Ken, it works for me on Linux. However, the print() commands around plot() are not necessary. Torsten __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] error when using multcomp and lm
On Fri, 18 Dec 2009, Achim Zeileis wrote: On Fri, 18 Dec 2009, chrischizinski wrote: Just in case anyone ever has similar issues, it appears that 'glht' does not work without an intercept. That is not precise enough to be useful. I've had a look again and appears to be the case that mcp() (or rather the internal function mcp2matrix) does not work for models where there are multiple factors but no intercept. As the posting guide asks us to do, I have (1) provided a _reproducible_ example below, and (2) included the maintainer of the package in the posting. Torsten, it would be great if you could have a look. Achim and Chris, thanks for spotting this problem. The bug is now fixed in multcomp 1.1-3 available from R-forge. Torsten Best, Z ## package library(multcomp) ## various models with and without intercept m1a - lm(breaks ~ tension, data = warpbreaks) m1b - lm(breaks ~ 0 + tension, data = warpbreaks) m2a - lm(breaks ~ wool + tension, data = warpbreaks) m2b - lm(breaks ~ 0 + wool + tension, data = warpbreaks) ## these two are equivalent: one factor with/without intercept summary(glht(m1a, linfct = mcp(tension = Tukey))) summary(glht(m1b, linfct = mcp(tension = Tukey))) ## these two should be equivalent: two factors with/without intercept ## but the latter fails summary(glht(m2a, linfct = mcp(tension = Tukey))) summary(glht(m2b, linfct = mcp(tension = Tukey))) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Computing multivariate normal probabilities. Was: Re: Problem with Numerical derivatives (numDeriv) and mvtnorm
On Sun, 22 Nov 2009, Ravi Varadhan wrote: Hi Torsten, Hi Ravi, It would be useful to warn the users that the multivariate normal probability calculated by pmvnorm using the GenzBretz algorithm is random, i.e. the result can vary between repeated executions of the function. only if a different seed is used. This would prevent inappropriate use of pmvnorm such as computing derivatives of it (see this email thread). ?pmvt has Randomized quasi-Monte Carlo methods are used for the computations. and appropriate references. In addition, the new book by Alan Genz and Frank Bretz covers all technical details in depth, so the procedures are well documented. Anyway, I'll add a statement to ?pmvnorm. Best wishes, Torsten __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] (Parallel) Random number seed question...
On Mon, 16 Nov 2009, Blair Christian wrote: Hi All, I have k identical parallel pieces of code running, each using n.rand random numbers. I would like to use the same RNG (for now), and set the seeds so that I can guarantee that there are no overlaps in the random numbers sampled by the k pieces of code. Another side goal is to have reproducibility of my results. In the past I have used C with SPRNG for this task, but I'm hoping that there is an easy way to do this in R with poor man's parallelization (eg running multiple Rs on multiple processors without the overhead of setting up any mpi or using snow(fall)). It is not clear from the documentation if set.seed arguments are sequential or not for a given RNG, eg if set.seed(1) on processor 1, set.seed(1+n.rand) on processor 2, set.seed(1+2*n.rand) on processor 3, etc for the default RNG Mersenne-Twister. An easy approach would be to simply write a script to generate n.rand numbers, records the .Random.seed and proceeds in that manner- inelegant, but effective. My question here is Is there a better way? (mvtnorm part directed to Torsten Hothorn) To further clarify, it seems there is a different RNG for normal (rnorm) than for everything else? (eg RNGKind( .., normal.kind=Inversion); further, does anybody know if mvtnorm uses this generator? mvtnorm is based on FORTRAN code which uses unif_rand() from the C API: void F77_SUB(rndstart)(void) { GetRNGstate(); } void F77_SUB(rndend)(void) { PutRNGstate(); } double F77_SUB(unifrnd)(void) { return unif_rand(); } Torsten Further, some RNGs seem to be based on the archictecture (eg the Knuth-TAOCP-2002 for example)- is the period really related to 2^32, or is it dependent the architecture, 2^64 for 64 bit R and 2^32 for 32 bit R? I noticed there are several packages related to RNG- please direct me to a vignette/R news article/previous post if this has been covered ad nauseum. I have skimmed vignettes/docs for rsprng package, RNG doc in base, setRNG package, mvtnorm package vignette (Or am I setting myself up to write a current RNG doc?) (directed to Gregory Warnes) I found a presentation by Gregory Warnes from 1999 addressing these same questions (and uses a collings generator in some C code). http://www.r-project.org/conferences/DSC-1999/slides/warnes.ps.gz Have you turned to the snowfall related parallel implementations, did your Collings generator work well, or have you discovered another trick you might like to share? Thank you all for your time and excellent contributions to the open source community, Blair __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] OT: a weighted rank-based, non-paired test statistic ?
Date: Fri, 5 Jun 2009 16:09:42 -0700 (PDT) From: Thomas Lumley tlum...@u.washington.edu To: dylan.beaude...@gmail.com Cc: 'r-h...@stat.math.ethz.ch' r-h...@stat.math.ethz.ch Subject: Re: [R] OT: a weighted rank-based, non-paired test statistic ? On Fri, 5 Jun 2009, Dylan Beaudette wrote: Is anyone aware of a rank-based, non-paired test such as the Krustal-Wallis, that can accommodate weights? You don't say what sort of weights, but basically, no. Whether you have precision weights or sampling weights, the test will no longer be distribution-free. Alternatively, would it make sense to simulate a dataset by duplicating observations in proportion to their weight, and then using the Krustal-Wallis test? No. well, if you have case weights, i.e., w[i] == 5 means: there are five observations which look exactly like observation i, then there are several ways to do it: library(coin) set.seed(29) x - gl(3, 10) y - rnorm(length(x), mean = c(0, 0, 1)[x]) d - data.frame(y = y, x = x) w - rep(2, nrow(d)) ### double each obs ### all the same kruskal_test(y ~ x, data = rbind(d, d)) Asymptotic Kruskal-Wallis Test data: y by x (1, 2, 3) chi-squared = 12.1176, df = 2, p-value = 0.002337 kruskal_test(y ~ x, data = d[rep(1:nrow(d), w),]) Asymptotic Kruskal-Wallis Test data: y by x (1, 2, 3) chi-squared = 12.1176, df = 2, p-value = 0.002337 kruskal_test(y ~ x, data = d, weights = ~ w) Asymptotic Kruskal-Wallis Test data: y by x (1, 2, 3) chi-squared = 12.1176, df = 2, p-value = 0.002337 the first two work by duplicating data, the latter one is more memory efficient since it computes weighted statistics (and their distribution). However, as Thomas pointed out, other forms of weights are more difficult to deal with. Best wishes, Torsten -thomas Thomas Lumley Assoc. Professor, Biostatistics tlum...@u.washington.eduUniversity of Washington, Seattle __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] error using lapply with oneway_test (coin package)
That's a scoping problem, I think a bug in oneway_test. Because the formula var ~ group is created with the anonymous function within lapply, its environment should be the evaluation frame of that function call and var should be visible. If I replace oneway_test() with lm() it works. I think a workaround is to construct the data argument explicitly, i.e. lapply(y, function(var) oneway_test(var ~ group, data.frame(var=var, group=group))) yes, that would be the fix: R lapply(y, function(var) oneway_test(var ~ group, data = data.frame(var = var, group = group))) $V1 Asymptotic 2-Sample Permutation Test data: var by group (1, 2) Z = -1.2054, p-value = 0.2280 alternative hypothesis: true mu is not equal to 0 $V2 Asymptotic 2-Sample Permutation Test data: var by group (1, 2) Z = 0.5672, p-value = 0.5706 alternative hypothesis: true mu is not equal to 0 Thanks, Duncan. Torsten I've cc'd Torsten Hothorn, the maintainer of coin. Duncan Murdoch Thank you, Matthieu Matthieu Dubois Post-doctoral fellow Psychology and NeuroCognition Lab (CNRS UMR 5105) Université Pierre Mendès-France BP47 --- 38040 Grenoble Cedex 9 --- France Email: matthieu.dub...@upmf-grenoble.fr Gmail: matth...@gmail.com http://web.upmf-grenoble.fr/LPNC/membre_matthieu_dubois [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] spatial probit/logit for prediction
Hi Robb, the gamboost() function in package `mboost' may help. The details are described in @article{Kneib+Hothorn+Tutz:2009, author = {Thomas Kneib and Torsten Hothorn and Gerhard Tutz}, title = {Variable Selection and Model Choice in Geoadditive Regression Models}, journal = {Biometrics}, year = {2009}, note = {Accepted} } (preprint available from http://epub.ub.uni-muenchen.de/2063/) Best wishes, Torsten __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random Forest (fwd)
Hello Is there exists a package for multivariate random forest, namely for multivariate response data ? It seems to be impossible with the randomForest function and I did not find any information about this in the help pages ... party:::cforest can do, here is an example: y - matrix(rnorm(100), nc = 2) x - matrix(runif(50 * 5), nc = 5) dat - data.frame(y = y, x = x) cf - cforest(y.1 + y.2 ~ ., data = dat) predict(cf) gives predict(cf) y.1 y.2 [1,] -0.300622618 0.318143745 [2,] -0.270218355 0.331261683 [3,] -0.330129113 0.308545082 ... Best wishes, Torsten Thank you for your help Bertrand __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Rpart and bagging - how is it done?
On Fri, 7 Mar 2008, Prof Brian Ripley wrote: I believe that the procedure you describe at the end (resampling the cases) is the original interpretation of bagging, and that using weighting is equivalent when a procedure uses case weights. If you are getting different results when replicating cases and when using weights then rpart is not using its weights strictly as case weights and it would be preferable to replicate cases. But I am getting identical predictions by the two routes: ind - sample(1:81, replace=TRUE) rpart(Kyphosis ~ Age + Number + Start, data=kyphosis[ind,], xval=0) rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, weights=tabulate(ind, nbins=81), xval=0) My memory is that rpart uses unweighted numbers for its control params (unlike tree) and hence is not strictly using case weights. I believe you can avoid that by setting the control params to their minimum and relying on pruning. BTW, it is inaccurate to call these trees 'non-pruned' -- the default setting of cp is still (potentially) doing quite a lot of pruning. Torsten Hothorn can explain why he chose to do what he did. There's a small (but only small) computational advantage in using case weights, but the tricky issue for me is how precisely tree growth is stopped, and I don't think that rpart at its default settings is mimicing what Breiman was doing (he would have been growing much larger trees). its mainly used to avoid repeated formula parsing and other data preprocessing steps everytime a tree is grown (which in my experience can be quite a substancial advantage both with respect to speed and memory consumption). As Brian said, rpart doesn't really interpret weights as case weights and thus the example code from the book is not totally correct. However, for example, party::ctree accepts case weights. Best wishes, Torsten On Thu, 6 Mar 2008, [EMAIL PROTECTED] wrote: Hi there. I was wondering if somebody knows how to perform a bagging procedure on a classification tree without running the classifier with weights. Let me first explain why I need this and then give some details of what I have found out so far. I am thinking about implementing the bagging procedure in Matlab. Matlab has a simple classification tree function (in their Statistics toolbox) but it does not accept weights. A modification of the Matlab procedure to accommodate weights would be very complicated. The rpart function in R accepts weights. This seems to allow for a rather simple implementation of bagging. In fact Everitt and Hothorn in chapter 8 of A Handbook of Statistical Analyses Using R describe such a procedure. The procedure consists in generating several samples with replacement from the original data set. This data set has N rows. The implementation described in the book first fits a non-pruned tree to the original data set. Then it generates several (say, 25) multinomial samples of size N with probabilities 1/N. Then, each sample is used in turn as the weight vector to update the original tree fit. Finally, all the updated trees are combined to produce consensus class predictions. Now, a typical realization of a multinomial sample consists of small integers and several 0's. I thought that the way that weighting worked was this: the observations with weights equal to 0 are omitted and the observations with weights 1 are essentially replicated according to the weight. So I thought that instead of running the rpart procedure with weights, say, starting with (1, 0, 2, 0, 1, ... etc.) I could simply generate a sample data set by retaining row 1, omitting row 2, replicating row 3 twice, omitting row 4, retaining row 5, etc. However, this does not seem to work as I expected. Instead of getting identical trees (from running weighted rpart on the original data set and running rpart on the sample data set described above with no weighting) I get trees that are completely different (different threshold values and different order of variables entering the splits). Moreover, the predictions from these trees can be different so the misclassification rates usually differ. This finally brings me to my question - is there a way to mimic the workings of the weighting in rpart by, for example, modification of the data set or, perhaps, some other means. Thanks in advance for your time, Andy __ Andy Jaworski 518-1-01 Process Laboratory 3M Corporate Research Laboratory - E-mail: [EMAIL PROTECTED] Tel: (651) 733-6092 Fax: (651) 736-3122 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor
[R] R News, volume 7, issue 2 is now available (fwd)
Dear useRs, The October 2007 issue of R News is now available on CRAN under the Documentation/Newsletter link. Torsten (on behalf of the R News Editorial Board) ___ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-announce __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.