Re: [R] partykit ctree: minbucket and case weights

2014-06-25 Thread Torsten Hothorn


Dear Amber,

your data contains missing values and you don't use surrogate splits to 
deal with them. So, the observations are passed down the tree randomly 
(there is no majority argument to ctree_control!) and thus it might 
happen that too small terminal nodes are created.


Simply use surrogate split and the tree will be deterministic with 
correct-sized terminal nodes (maxsurrogate = 3, for example).


Best,

Torsten

On Mon, 9 Jun 2014, Amber Dawn Nolder wrote:

I have attached the data set (cavl) and R code used when I got the results I 
posted about. I included the code I used at the top of the document. Below 
that is the version of R used and some of the results I obtained.

Many thanks!
Amber 
On Wed, 4 Jun 2014 09:12:15 +0200 (CEST)
Torsten Hothorn torsten.hoth...@uzh.ch wrote:


On Tue, 3 Jun 2014, Amber Dawn Nolder wrote:

I apologize for my lack of knowledge with R. I usually load my data as a 
csv file. May I send that to you? I was not sure if I could do so on the 
list.


yes, and the R code you used. Thanks,

Torsten


Thank you?
On Fri, 30 May 2014 09:37:23 +0200 (CEST)
Torsten Hothorn torsten.hoth...@uzh.ch wrote:


Amber,

this looks like an error -- could you pls send me a reproducible example 
so that I can track the problem down?


Best,

Torsten




Prof. Dr. Torsten Hothorn   =
 \\
Universitaet Zuerich \\
Institut fuer Epidemiologie, Biostatistik und \\
Praevention, Abteilung Biostatistik   //
Hirschengraben 84//
CH-8001 Zuerich //
Schweiz//
==
Telephon:  +41 44 634 48 17
Fax:   +41 44 634 43 86
Web:   http://tiny.uzh.ch/6p


On Wed, 28 May 2014, Achim Zeileis wrote:


Falls Du es nicht eh gesehen hast...

lg,
Z

-- Forwarded message --
Date: Wed, 28 May 2014 17:16:12 -0400
From: Amber Dawn Nolder a.d.nol...@iup.edu
To: r-help@r-project.org
Subject: [R] partykit ctree: minbucket and case weights


   Hello,
   I am an R novice, and I am using the partykit package to create
   regression trees. I used the following to generate the trees:
   ctree(y~x1+x2+x3+x4,data=my_data,control=ctree_control(testtype =
   Bonferroni, mincriterion = 0.90, minsplit = 12, minbucket = 4,
   majority = TRUE)
   I thought that minbucket set the minimum value for the sum of 
weights
   in each terminal node, and that each case weight is 1, unless 
otherwise
   specified. In which case, the sum of case weights in a node should 
equal the
   number of cases (n) in that node. However, I  sometimes obtain a tree 
with

   a terminal node that contains fewer than 4 cases.
   My data set has a total of 36 cases. The dependent and all 
independent
   variables are continuous data. Variables x1 and x2 contain missing 
(NA)

   values.
   Could someone please explain why I am getting these results?
   Am I  mistaken about the value of case weights or about the use of 
minbucket

   to restrict the size of a terminal node?
   This is an example of the output:
   Model formula:
   y ~ x1 + x2 + x3 + x4
   Fitted party:
   [1] root
   |   [2] x4 = 30: 0.927 (n = 17, err = 1.1)
   |   [3] x4  30
   |   |   [4] x2 = 43: 0.472 (n = 8, err = 0.4)
   |   |   [5] x2  43
   |   |   |   [6] x3 = 0.4: 0.282 (n = 3, err = 0.0)
   |   |   |   [7] x3  0.4: 0.020 (n = 8, err = 0.0)
   Number of inner nodes:3
   Number of terminal nodes: 4
   Many thanks!
   Amber Nolder
   Graduate Student
   Indiana University of Pennsylvania
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fwd: cforest sampling methods

2014-03-20 Thread Torsten Hothorn


Hi all,

I've been using the randomForest package and I'm trying to make the switch
over to party. My problem is that I have an extremely unbalanced outcome
(only 1% of the data has a positive outcome) which makes resampling methods
necessary.

randomForest has a very useful argument that is sampsize which allows me to
use a balanced subsample to build each tree in my forest. lets say the
number of positive cases is 100, my forest would look something like this:

rf-randomForest(y~. ,data=train, ntree=800,replace=TRUE,sampsize = c(100,
100))

so I use 100 cases and 100 controls to build each individual tree. Can I do
the same for cforests? I know I can always upsample but I'd rather not.

I've tried playing around with the weights argument but I'm either not
getting it right or it's just the wrong thing to use.


weights are your friend here: Suppose you have 100 obs of the first and 
1000 obs of the second class. Using weights 1 / 100 for the class one obs 
and 1 / 1000 for the class two obs gives you a balanced sample:


y - gl(2, 1)[c(rep(1, 100), rep(2, 1000))]
w - 1 / (table(y))[y]
tapply(rmultinom(n = 1, size = length(y), prob = w), y, sum)

Best,

Torsten




Any advice on how to adapt cforests to datasets with imbalanced outcomes is
greatly appreciated...



Thanks!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Permutation Test on Interactions {coin}

2013-09-23 Thread Torsten Hothorn


Axel,

you need a model for such type of analyses and coin is completely 
model-free.


Torsten

On Mon, 23 Sep 2013, Axel Urbiz wrote:


Dear List,
I'm interested in performing a permutation test on the interaction between a 
binary treatment
indicator and a covariate (either continuous or categorical). I'm interested in 
the p-value
of the interaction effect from a permutation test, and I'm using the coin 
package for that
purpose. 

As I haven't seen any examples like this in the package documentation (or 
anywhere else), I'm
not sure how to specify the test in this case. For example, should I interpret 
the p-value in
the example below as the pvalue of the interaction effect between group and the 
covariate x. 

set.seed(1)
library(coin)
data(rotarod, package = coin)
x - rnorm(24)
rotarod - cbind(rotarod, x)
pvalue(independence_test(time ~ group * x, data = rotarod))


Your advice would be much appreciated. 

Regards,
Axel.   

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Party package: varimp(..., conditional=TRUE) error: term 1 would require 9e+12 columns (fwd)

2011-10-17 Thread Torsten Hothorn


I would like to build a forest of regression trees to see how well some
covariates predict a response variable and to examine the importance of 
the

covariates. I have a small number of covariates (8) and large number of
records (27368). The response and all of the covariates are continuous
variables.

A cursory examination of the covariates does not suggest they are 
correlated
in a simple fashion (e.g. the variance inflation factors are all fairly 
low)
but common sense suggests there should be some relationship: one of them 
is
the day of the year and some of the others are environmental parameters 
such
as water temperature. For this reason I would like to follow the advice 
of
Strobl et al. (2008) and try the authors' conditional variable 
importance

measure. This is implemented in the party package by calling varimp(...,
conditional=TRUE). Unfortunately, when I call that on my forest I 
receive

the error:


varimp(myforest, conditional=TRUE)

Error in model.matrix.default(as.formula(f), data = blocks) :
 term 1 would require 9e+12 columns

Does anyone know what is wrong?



Hi Jason,

the particular feature doesn't scale well in its current implementation. 
Anyway, thanks for looking up previous reports closely. I can offer to 
have a look at your data if you send them along with the code to reproduce 
the problem.


Best,

Torsten


I noticed a post in June 2011 where a user reported this message and the
ultimate problem was that the importance measure was being conditioned 
on
too many variables (47). I have only a small number of variables here so 
I

guessed that was not the problem.

Another suggestion was that there could be a factor with too many 
levels. In
my case, all of the variables are continuous. Term 1 (x1 below) is the 
day
of the year, which does happen to be integers 1 ... 366. But the 
variable is
class numeric, not integer, so I don't believe cforest would treat it as 
a

factor, although I do not know how to tell whether cforest is treating
something as continuous or as a factor.

Thank you for any help you can provide. I am running R 2.13.1 with party
0.9-4. You can download the data from
http://www.duke.edu/~jjr8/data.rdata (512 KB). Here is the complete 
code:



load(\\Temp\\data.rdata)
nrow(df)

[1] 27368

summary(df)

  y x1  x2   x3
x4 x5  x6  x7 
x8


Min.   :  0.000   Min.   :  1.0   Min.   :0.   Min.   :  1.00 
Min.

:  52   Min.   : 0.008184   Min.   :16.71   Min.   :0.000   Min.   :
0.02727
1st Qu.:  0.000   1st Qu.:105.0   1st Qu.:0.   1st Qu.: 30.00   1st
Qu.:1290   1st Qu.: 6.747035   1st Qu.:23.92   1st Qu.:0.000   1st 
Qu.:

0.11850
Median :  1.282   Median :169.0   Median :0.2353   Median : 38.00 
Median

:1857   Median :11.310277   Median :26.35   Median :0.0001569   Median :
0.14625
Mean   :  5.651   Mean   :178.7   Mean   :0.2555   Mean   : 55.03 
Mean

:1907   Mean   :12.889021   Mean   :26.31   Mean   :0.0162043   Mean   :
0.20684
3rd Qu.:  5.353   3rd Qu.:262.0   3rd Qu.:0.4315   3rd Qu.: 47.00   3rd
Qu.:2594   3rd Qu.:18.427410   3rd Qu.:28.95   3rd Qu.:0.0144660   3rd 
Qu.:

0.20095
Max.   :195.238   Max.   :366.0   Max.   :1.   Max.   :400.00 
Max.

:3832   Max.   :29.492380   Max.   :31.73   Max.   :0.3157486   Max.
:11.76877

library(HH)

output deleted

vif(y ~ ., data=df)

 x1   x2   x3   x4   x5   x6   x7   x8
1.374583 1.252250 1.021672 1.218801 1.015124 1.439868 1.075546 1.060580

library(party)

output deleted
mycontrols - cforest_unbiased(ntree=50, mtry=3)   # Small 
forest

but requires a few minutes

myforest - cforest(y ~ ., data=df, controls=mycontrols)
varimp(myforest)
   x1 x2 x3 x4 x5 x6 
x7

x8
11.924498 103.180195  16.228864  30.658946   5.053500  12.820551 
2.113394

6.911377

varimp(myforest, conditional=TRUE)

Error in model.matrix.default(as.formula(f), data = blocks) :
 term 1 would require 9e+12 columns

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fwd: Re: Party extract BinaryTree from cforest?

2011-10-06 Thread Torsten Hothorn



-- Forwarded message --
Date: Wed, 5 Oct 2011 21:09:41 +
From: Chris christopher.a.h...@gmail.com
To: r-h...@stat.math.ethz.ch
Subject: Re: [R] Party extract BinaryTree from cforest?

I found an internal workaround to this to support printing and plot type 
simple,


tt-party:::prettytree(cf at ensemble[[1]], names(cf at data at 
get(input)))

npt - new(BinaryTree)
npt@tree-tt
plot(npt)

Error in terminal_panel(S4 object of class BinaryTree) :
 âctreeobjâ is not a regression tree


library(party)
cf - cforest(Species ~ ., data = iris)
pt - party:::prettytree(cf@ensemble[[1]], names(cf@data@get(input)))
pt
nt - new(BinaryTree)
nt@tree - pt
nt@data - cf@data
nt@responses - cf@responses
nt
plot(nt)

will do.

Torsten


plot(npt, type=simple)


Any additional help is appreciated.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] `partykit': A Toolkit for Recursive Partytioning

2011-10-04 Thread Torsten Hothorn


New package `partykit': A Toolkit for Recursive Partytioning

The purpose of the package is to provide a toolkit with infrastructure for
representing, summarizing, and visualizing tree-structured regression and
classification models. Thus, the focus is not on _inferring_ such a
tree structure from data but to _represent_ a given tree so that
printing/plotting and computing predictions can be performed in a
standardized  way. In particular, this unified infrastructure can be
used for reading/coercing tree models from different sources
(packages `rpart', `RWeka', `PMML') yielding objects that share
functionality for `print()', `plot()', and `predict()' methods.

The impatient users will hopefully have fun with

install.packages(partykit)
library(partykit)
library(rpart)
### from ?rpart
fit - rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
plot(as.party(fit))


Best,

Torsten  Achim

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] glht (multcomp): NA's for confidence intervals using univariate_calpha (fwd)

2011-09-05 Thread Torsten Hothorn


fixed @ R-forge. New version should appear on CRAN soon.

Thanks for the report!

Torsten



-- Forwarded message --
Date: Sat, 3 Sep 2011 23:56:35 +0200
From: Ulrich Halekoh ulrich.hale...@agrsci.dk
To: r-help@r-project.org r-help@r-project.org
Subject: [R] glht (multcomp): NA's for confidence intervals using
   univariate_calpha

Hej,

Calculation of  confidence intervals for  means
based on a model fitted with lmer

using the package multcomp

- yields results for  calpha=adjusted_calpha
- NA's forcalpha=univariate_calpha


Example:
library(lme4)
library(multcomp)
###  Generate data
set.seed(8)
d-expand.grid(treat=1:2,block=1:3)
e-rnorm(3)
names(e)-1:3
d$y-rnorm(nrow(d)) + e[d$block]
d-transform(d,treat=factor(treat),block=factor(block))
# lmer fit
Mod-lmer(y~treat+ (1|block), data=d)
### estimate  treatment means
L-cbind(c(1,0),c(0,1))
s-glht(Mod,linfct=L)
## confidence intervals
confint(s,calpha=adjusted_calpha())
#produces NA's for the confidence limits
confint(s,calpha=univariate_calpha())

#for  models fitted with lm the problem does not occur
G-lm(y~treat+ block, data=d)
L-matrix( c(1,0,1/3,1/3,1,1,1/3,1/3),2,4,byrow=TRUE)
s-glht(G,linfct=L)
confint(s,calpha=adjusted_calpha())
confint(s,calpha=univariate_calpha())



multcomp  version 1.2-7
R:platform   i386-pc-mingw32
version.string R version 2.13.1 Patched (2011-08-19 r56767)


Regards

Ulrich Halekoh
Department of Molecular Biology and Genetics,
Aarhus University, Denmark
ulrich.hale...@agrsci.dk

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cforest - keep.forest = false option? (fwd)

2011-07-20 Thread Torsten Hothorn



-- Forwarded message --
Date: Mon, 18 Jul 2011 10:17:00 -0700 (PDT)
From: KHOFF kuph...@gmail.com
To: r-help@r-project.org
Subject: [R] cforest - keep.forest = false option?

Hi,

I'm very new to R. I am most interested in the variable importance 
measures
that result from randomForest, but many of my predictors are 
highly

correlated. My first question is:

1. do highly correlated variables render variable importance 
measures in

randomForest invalid?



that depends on your idea of valid. A number of papers 
published over the last years explore this question and

you should read the relevant literature first.

and 2. I know that cforest is robust to highly correlated 
variables,
however, I do not have enough space on my machine to run cforest. 
I used the
keep.forest = false option when using randomForest and that solved 
my space
issue. Is there a similar option for cforest (besides 
savesplitstats =

FALSE, which isn't helping)


no. party was designed as a flexible research tool and is
not optimized wrt speed or memory consumption.

Best,

Torsten



below is my code and error message

Thanks in advance!


fit - cforest(formula = y ~ x1 + x2+ x3+ x4+ x5+
+ x6+ x7+ x8+ x9+ x10, data=data, control= 
cforest_unbiased(savesplitstats =

FALSE, ntree = 50, mtry = 5)

1: In mf$data - data :
 Reached total allocation of 3955Mb: see help(memory.size)
2: In mf$data - data :
 Reached total allocation of 3955Mb: see help(memory.size)


--
View this message in context: 
http://r.789695.n4.nabble.com/cforest-keep-forest-false-option-tp3675921p3675921.html

Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fwd: varimp_in_party_package

2011-06-21 Thread Torsten Hothorn


On Thu, 16 Jun 2011, Jinrui Xu wrote:


Thanks for your feedback.
I think the problem is not because of many levels. There is only 1 column 
with two levels as class labels in my input data.


Below is my code. The commandline data.cforest.varimp - 
varimp(data.cforest, conditional = TRUE) reports Error in 
model.matrix.default(as.formula(f),data = blocks): term 1 would require 4e+17 
columns


I also attached my input file. Hope you can run it for me to check what the 
problem is. Thanks a lot!


PS: It takes 10 mins to finish the code below by 1 cpu and upto 2.5 GB 
memory. You can reduce the columns in the rawinput, which reduces computing 
intense and feeds back same error.


library(randomForest)
library(party)

set.seed(71)

rawinput - read.table(featureSelection_rec.vectors)
rawinput$V1 - as.factor(as.numeric(rawinput$V1))

data.controls - cforest_unbiased(ntree=500, mtry=3)
data.cforest - cforest(V1~.,data=rawinput,controls=data.controls)
data.cforest.varimp - varimp(data.cforest, conditional = TRUE)



Hi Jinrui,

it turns out that for your data-set there are (using the default) 
parameters 47 variables to condition on and thats way to much. You can 
reduce the number of conditioning variables by increasing the `threshold'

parameter to something like .8

Best,

Torsten






there is a factor with (too) many levels in your data frame `rawinput'.

summary(rawinput)

will tell you which one.

Torsten




Quoting Torsten Hothorn torsten.hoth...@stat.uni-muenchen.de:



Hello everyone,

I use the following command lines to get important variable from training 
dataset.



data.controls - cforest_unbiased(ntree=500, mtry=3)
data.cforest - cforest(V1~.,data=rawinput,controls=data.controls)
data.cforest.varimp - varimp(data.cforest, conditional = TRUE)

I got error: Error in model.matrix.default(as.formula(f),data = blocks): 
term 1 would require 4e+17 columns



I changed data dimension to 150. The problem still exists. So, I guess 
there are other problems. Please give me some help or hints. Thanks!


jinrui,

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.










__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fwd: varimp_in_party_package

2011-06-16 Thread Torsten Hothorn


Hello everyone,

I use the following command lines to get important variable from training 
dataset.



data.controls - cforest_unbiased(ntree=500, mtry=3)
data.cforest - cforest(V1~.,data=rawinput,controls=data.controls)
data.cforest.varimp - varimp(data.cforest, conditional = TRUE)

I got error: Error in model.matrix.default(as.formula(f),data = blocks): 
term 1 would require 4e+17 columns




there is a factor with (too) many levels in your data frame `rawinput'.

summary(rawinput)

will tell you which one.

Torsten

I changed data dimension to 150. The problem still exists. So, I guess there 
are other problems. Please give me some help or hints. Thanks!


jinrui,

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] question regarding qmvnorm

2011-04-21 Thread Torsten Hothorn


On Wed, 20 Apr 2011, Ravi Varadhan wrote:


If you had told us what the error message was, my job would have been easier.  
But, at least you provied the code, so it was not hard for me to see where the 
problem was.  There is a problem with the strategy used by `qmvnorm' to locate 
the initial interval in which the quantile is supposed to lie.  In particular, 
the problem is with the approx_interval() function.

In your case, the interval computed by this function does not contain the root. 
Hence, uniroot() fails. The solution is to provide the interval that contains 
the roots.  I am cc'ing Torsten so that he is aware of the problem.



thanks, Ravi, for the report -- indeed, `tail = upper' caused 
`approx_interval()' to act insane. Fixed in 0.9-99 on R-forge and soon on 
CRAN.


Best,

Torsten


The following works:

m - 10
rho - 0.1
k - 2
alpha - 0.05

cc_z - rep(NA, length=m)
var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T)
for (i in 1:m){
if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper,
sigma=var, interval=c(0, 5))$quantile} else
{cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail
=upper, sigma=var, interval=c(0,5))$quantile}
}

cc_z


cc_z

[1] 1.9438197 1.9438197 1.8910860 1.8303681 1.7590806 1.6730814 1.5652220
[8] 1.4219558 1.2116459 0.8267921




Ravi.



From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf Of 
li li [hannah@gmail.com]
Sent: Wednesday, April 20, 2011 5:59 PM
To: r-help
Subject: [R] question regarding qmvnorm

Dear all,
   I wrote the following function previously. It worked fine with the old
mvtnorm package.
Somehow with the updated package, I got a error message when trying to use
the function.
I need some help. It is sort of urgent. Can anyone please take a look. The
function is the following.
Thank you very much!
Hannah

library(mvtnorm)

cc_f - function(m, rho, k, alpha) {

m - 10

rho - 0.1

k - 2

alpha - 0.05


cc_z - numeric(m)

var - matrix(c(1,rho,rho,1), nrow=2, ncol=2, byrow=T)

for (i in 1:m){

  if (i = k) {cc_z[i] - qmvnorm((k*(k-1))/(m*(m-1))*alpha, tail=upper,
sigma=var)$quantile} else

  {cc_z[i] - qmvnorm((k*(k-1))/((m-i+k)*(m-i+k-1))*alpha, tail
=upper, sigma=var)$quantile}

  }

cc - 1-pnorm(cc_z)

return(cc)

}

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] missing values in party::ctree

2011-02-18 Thread Torsten Hothorn


On Thu, 17 Feb 2011, Andrew Ziem  wrote:


After ctree builds a tree, how would I determine the direction missing values 
follow by examining the BinaryTree-class object?  For instance in the example 
below Bare.nuclei has 16 missing values and is used for the first split, but 
the missing values are not listed in either set of factors.   (I have the same 
question for missing values among numeric [non-factor] values, but I assume the 
answer is similar.)


Hi Andrew,

ctree() doesn't treat missings in factors as a category in its own right. 
Instead, it uses surrogate splits to determine the daughter node 
observations with missings in the primary split variable are send to (you 
need to specify `maxsurrogates' in ctree_control()).


However, you can recode your factor and add NA to the levels. This will
lead to the intended behaviour.

Best,

Torsten





require(party)
require(mlbench)
data(BreastCancer)
BreastCancer$Id - NULL
ct - ctree(Class ~ . , data=BreastCancer, controls = ctree_control(maxdepth = 
1))
ct


Conditional inference tree with 2 terminal nodes

Response:  Class
Inputs:  Cl.thickness, Cell.size, Cell.shape, Marg.adhesion, Epith.c.size, 
Bare.nuclei, Bl.cromatin, Normal.nucleoli, Mitoses
Number of observations:  699

1) Bare.nuclei == {1, 2}; criterion = 1, statistic = 488.294
 2)*  weights = 448
1) Bare.nuclei == {3, 4, 5, 6, 7, 8, 9, 10}
 3)*  weights = 251

sum(is.na(BreastCancer$Bare.nuclei))

[1] 16

nodes(ct, 1)[[1]]$psplit

Bare.nuclei == {1, 2}

nodes(ct, 1)[[1]]$ssplit

list()



Based on below, the answer is node 2, but I don't see it in the object.


sum(BreastCancer$Bare.nuclei %in% c(1,2,NA))

[1] 448

sum(BreastCancer$Bare.nuclei %in% c(1,2))

[1] 432

sum(BreastCancer$Bare.nuclei %in% c(3:10))

[1] 251


Andrew




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problems with pdf device using plot glht function on multcomp library.

2010-09-15 Thread Torsten Hothorn


Dear Ken,

it works for me on Linux. However, the print() commands around plot() are 
not necessary.


Torsten

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] error when using multcomp and lm

2009-12-18 Thread Torsten Hothorn


On Fri, 18 Dec 2009, Achim Zeileis wrote:


On Fri, 18 Dec 2009, chrischizinski wrote:

Just in case anyone ever has similar issues, it appears that 'glht' does 
not work without an intercept.


That is not precise enough to be useful. I've had a look again and appears to 
be the case that mcp() (or rather the internal function mcp2matrix) does not 
work for models where there are multiple factors but no intercept.


As the posting guide asks us to do, I have (1) provided a _reproducible_ 
example below, and (2) included the maintainer of the package in the posting. 
Torsten, it would be great if you could have a look.


Achim and Chris,

thanks for spotting this problem. The bug is now fixed in multcomp 
1.1-3 available from R-forge.


Torsten



Best,
Z

## package
library(multcomp)

## various models with and without intercept
m1a - lm(breaks ~ tension, data = warpbreaks)
m1b - lm(breaks ~ 0 + tension, data = warpbreaks)
m2a - lm(breaks ~ wool + tension, data = warpbreaks)
m2b - lm(breaks ~ 0 + wool + tension, data = warpbreaks)

## these two are equivalent: one factor with/without intercept
summary(glht(m1a, linfct = mcp(tension = Tukey)))
summary(glht(m1b, linfct = mcp(tension = Tukey)))

## these two should be equivalent: two factors with/without intercept
## but the latter fails
summary(glht(m2a, linfct = mcp(tension = Tukey)))
summary(glht(m2b, linfct = mcp(tension = Tukey)))




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Computing multivariate normal probabilities. Was: Re: Problem with Numerical derivatives (numDeriv) and mvtnorm

2009-11-23 Thread Torsten Hothorn


On Sun, 22 Nov 2009, Ravi Varadhan wrote:



Hi Torsten,



Hi Ravi,


It would be useful to warn the users that the multivariate normal probability
calculated by pmvnorm using the GenzBretz algorithm is random, i.e. 
the result can vary between repeated executions of the function.


only if a different seed is used.

This 
would prevent inappropriate use of pmvnorm such as computing derivatives 
of it (see this email thread).




?pmvt has Randomized quasi-Monte Carlo methods are used for the
computations. and appropriate references. In addition, the new book by 
Alan Genz and Frank Bretz covers all technical details in depth, so

the procedures are well documented.

Anyway, I'll add a statement to ?pmvnorm.

Best wishes,

Torsten

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] (Parallel) Random number seed question...

2009-11-16 Thread Torsten Hothorn


On Mon, 16 Nov 2009, Blair Christian wrote:


Hi All,

I have k identical parallel pieces of code running, each using n.rand
random numbers.  I would like to use the same RNG (for now), and set
the seeds so that I can guarantee that there are no overlaps in the
random numbers sampled by the k pieces of code.  Another side goal is
to have reproducibility of my results.  In the past I have used C with
SPRNG for this task, but I'm hoping that there is an easy way to do
this in R with poor man's parallelization (eg running multiple Rs on
multiple processors without the overhead of setting up any mpi or
using snow(fall)).  It is not clear from the documentation if set.seed
arguments are sequential or not for a given RNG, eg if set.seed(1) on
processor 1, set.seed(1+n.rand) on processor 2, set.seed(1+2*n.rand)
on processor 3, etc for the default RNG Mersenne-Twister.  An easy
approach would be to simply write a script to generate n.rand numbers,
records the .Random.seed and proceeds in that manner- inelegant, but
effective.  My question here is Is there a better way?


(mvtnorm part directed to Torsten Hothorn)
To further clarify, it seems there is a different RNG for normal
(rnorm) than for everything else? (eg RNGKind( ..,
normal.kind=Inversion); further, does anybody know if mvtnorm uses
this generator?



mvtnorm is based on FORTRAN code which uses unif_rand() from the C API:

void F77_SUB(rndstart)(void) { GetRNGstate(); }
void F77_SUB(rndend)(void) { PutRNGstate(); }
double F77_SUB(unifrnd)(void) { return unif_rand(); }

Torsten



Further, some RNGs seem to be based on the archictecture (eg the
Knuth-TAOCP-2002 for example)- is the period really related to 2^32,
or is it dependent the architecture, 2^64 for 64 bit R and 2^32 for 32
bit R?


I noticed there are several packages related to RNG- please direct me
to a vignette/R news article/previous post if this has been covered ad
nauseum.  I have skimmed vignettes/docs for rsprng package, RNG doc in
base, setRNG package, mvtnorm package vignette  (Or am I setting
myself up to write a current RNG doc?)


(directed to Gregory Warnes)
I found a presentation by Gregory Warnes from 1999 addressing these
same questions (and uses a collings generator in some C code).
http://www.r-project.org/conferences/DSC-1999/slides/warnes.ps.gz
Have you turned to the snowfall related parallel implementations, did
your Collings generator work well, or have you discovered another
trick you might like to share?

Thank you all for your time and excellent contributions to the open
source community,
Blair

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] OT: a weighted rank-based, non-paired test statistic ?

2009-06-09 Thread Torsten Hothorn



Date: Fri, 5 Jun 2009 16:09:42 -0700 (PDT)
From: Thomas Lumley tlum...@u.washington.edu
To: dylan.beaude...@gmail.com
Cc: 'r-h...@stat.math.ethz.ch' r-h...@stat.math.ethz.ch
Subject: Re: [R] OT: a weighted rank-based, non-paired test statistic ?

On Fri, 5 Jun 2009, Dylan Beaudette wrote:
Is anyone aware of a rank-based, non-paired test such as the 
Krustal-Wallis,

that can accommodate weights?


You don't say what sort of weights, but basically, no.

Whether you have precision weights or sampling weights, the test will no 
longer be distribution-free.



Alternatively, would it make sense to simulate a dataset by duplicating
observations in proportion to their weight, and then using the 
Krustal-Wallis

test?


No.



well, if you have case weights, i.e., w[i] == 5 means: there are five 
observations which look exactly like observation i, then there are several 
ways to do it:



library(coin)

set.seed(29)
x - gl(3, 10)
y - rnorm(length(x), mean = c(0, 0, 1)[x])
d - data.frame(y = y, x = x)
w - rep(2, nrow(d)) ### double each obs

### all the same
kruskal_test(y ~ x, data = rbind(d, d))


Asymptotic Kruskal-Wallis Test

data:  y by x (1, 2, 3)
chi-squared = 12.1176, df = 2, p-value = 0.002337



kruskal_test(y ~ x, data = d[rep(1:nrow(d), w),])


Asymptotic Kruskal-Wallis Test

data:  y by x (1, 2, 3)
chi-squared = 12.1176, df = 2, p-value = 0.002337



kruskal_test(y ~ x, data = d, weights = ~ w)


Asymptotic Kruskal-Wallis Test

data:  y by x (1, 2, 3)
chi-squared = 12.1176, df = 2, p-value = 0.002337

the first two work by duplicating data, the latter one is more memory 
efficient since it computes weighted statistics (and their distribution).


However, as Thomas pointed out, other forms of weights are more difficult 
to deal with.


Best wishes,

Torsten



-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
tlum...@u.washington.eduUniversity of Washington, Seattle

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] error using lapply with oneway_test (coin package)

2009-05-07 Thread Torsten Hothorn


That's a scoping problem, I think a bug in oneway_test.  Because the formula 
var ~ group is created with the anonymous function within lapply, its 
environment should be the evaluation frame of that function call and var 
should be visible.  If I replace oneway_test() with lm() it works.


I think a workaround is to construct the data argument explicitly, i.e.

lapply(y, function(var) oneway_test(var ~ group, data.frame(var=var, 
group=group)))




yes, that would be the fix:

R lapply(y, function(var) oneway_test(var ~ group,
  data = data.frame(var = var, group = group)))
$V1

Asymptotic 2-Sample Permutation Test

data:  var by group (1, 2)
Z = -1.2054, p-value = 0.2280
alternative hypothesis: true mu is not equal to 0


$V2

Asymptotic 2-Sample Permutation Test

data:  var by group (1, 2)
Z = 0.5672, p-value = 0.5706
alternative hypothesis: true mu is not equal to 0

Thanks, Duncan.

Torsten



I've cc'd Torsten Hothorn, the maintainer of coin.

Duncan Murdoch



Thank you,

Matthieu

Matthieu Dubois
Post-doctoral fellow

Psychology and NeuroCognition Lab (CNRS UMR 5105)
Université Pierre Mendès-France
BP47 --- 38040 Grenoble Cedex 9 --- France

Email: matthieu.dub...@upmf-grenoble.fr
Gmail: matth...@gmail.com
http://web.upmf-grenoble.fr/LPNC/membre_matthieu_dubois





[[alternative HTML version deleted]]





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] spatial probit/logit for prediction

2008-08-20 Thread Torsten Hothorn


Hi Robb,

the gamboost() function in package `mboost' may help. The details
are described in

@article{Kneib+Hothorn+Tutz:2009,
  author = {Thomas Kneib and Torsten Hothorn and Gerhard Tutz},
  title = {Variable Selection and Model Choice in Geoadditive Regression 
Models},
  journal = {Biometrics},
  year = {2009},
  note = {Accepted}
}

(preprint available from http://epub.ub.uni-muenchen.de/2063/)

Best wishes,

Torsten

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest (fwd)

2008-06-07 Thread Torsten Hothorn



Hello

Is there exists a package for multivariate random forest, namely for
multivariate response data ? It seems to be impossible with the
randomForest function and I did not find any information about this
in the help pages  ...


party:::cforest can do, here is an example:

y - matrix(rnorm(100), nc = 2)
x - matrix(runif(50 * 5), nc = 5)
dat - data.frame(y = y, x = x)
cf - cforest(y.1 + y.2 ~ ., data = dat)
predict(cf)

gives


predict(cf)

   y.1  y.2
 [1,] -0.300622618  0.318143745
 [2,] -0.270218355  0.331261683
 [3,] -0.330129113  0.308545082
...

Best wishes,

Torsten



Thank you for your help

Bertrand

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Rpart and bagging - how is it done?

2008-03-07 Thread Torsten Hothorn

On Fri, 7 Mar 2008, Prof Brian Ripley wrote:

 I believe that the procedure you describe at the end (resampling the cases) 
 is the original interpretation of bagging, and that using weighting is 
 equivalent when a procedure uses case weights.

 If you are getting different results when replicating cases and when using 
 weights then rpart is not using its weights strictly as case weights and it 
 would be preferable to replicate cases.  But I am getting identical 
 predictions by the two routes:

 ind - sample(1:81, replace=TRUE)
 rpart(Kyphosis ~ Age + Number + Start, data=kyphosis[ind,], xval=0)
 rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
  weights=tabulate(ind, nbins=81), xval=0)

 My memory is that rpart uses unweighted numbers for its control params 
 (unlike tree) and hence is not strictly using case weights.  I believe you 
 can avoid that by setting the control params to their minimum and relying on 
 pruning.

 BTW, it is inaccurate to call these trees 'non-pruned' -- the default
 setting of cp is still (potentially) doing quite a lot of pruning.

 Torsten Hothorn can explain why he chose to do what he did.  There's a small 
 (but only small) computational advantage in using case weights, but the 
 tricky issue for me is how precisely tree growth is stopped, and I don't 
 think that rpart at its default settings is mimicing what Breiman was doing 
 (he would have been growing much larger trees).


its mainly used to avoid repeated formula parsing and other data 
preprocessing steps everytime a tree is grown (which in my experience can 
be quite a substancial advantage both with respect to speed and memory 
consumption). As Brian said, rpart doesn't really interpret weights as
case weights and thus the example code from the book is not totally 
correct. However, for example, party::ctree accepts case weights.

Best wishes,

Torsten


 On Thu, 6 Mar 2008, [EMAIL PROTECTED] wrote:

 
 Hi there.
 
 I was wondering if somebody knows how to perform a bagging procedure on a
 classification tree without running the classifier with weights.
 
 Let me first explain why I need this and then give some details of what I
 have found out so far.
 
 I am thinking about implementing the bagging procedure in Matlab.  Matlab
 has a simple classification tree function (in their Statistics toolbox) but
 it does not accept weights.  A modification of the Matlab procedure to
 accommodate weights would be very complicated.
 
 The rpart function in R accepts weights.  This seems to allow for a rather
 simple implementation of bagging.  In fact Everitt and Hothorn in chapter 8
 of A Handbook of Statistical Analyses Using R describe such a procedure.
 The procedure consists in generating several samples with replacement from
 the original data set.  This data set has N rows.  The implementation
 described in the book first fits a non-pruned tree to the original data
 set.  Then it generates several (say, 25) multinomial samples of size N
 with probabilities 1/N.  Then, each sample is used in turn as the weight
 vector to update the original tree fit.  Finally, all the updated trees are
 combined to produce consensus class predictions.
 
 Now, a typical realization of a multinomial sample consists of small
 integers and several 0's.  I thought that the way that weighting worked was
 this:  the observations with weights equal to 0 are omitted and the
 observations with weights  1 are essentially replicated according to the
 weight.  So I thought that instead of running the rpart procedure with
 weights, say, starting with (1, 0, 2, 0, 1, ... etc.)  I could simply
 generate a sample data set by retaining row 1, omitting row 2, replicating
 row 3 twice, omitting row 4, retaining row 5, etc.  However, this does not
 seem to work as I expected.  Instead of getting identical trees (from
 running weighted rpart on the original data set and running rpart on the
 sample data set described above with no weighting) I get trees that are
 completely different (different threshold values and different order of
 variables entering the splits).  Moreover,  the predictions from these
 trees can be different so the misclassification rates usually differ.
 
 This finally brings me to my question - is there a way to mimic the
 workings of the weighting in rpart by, for example, modification of the
 data set or, perhaps, some other means.
 
 Thanks in advance for your time,
 
 Andy
 
 __
 Andy Jaworski
 518-1-01
 Process Laboratory
 3M Corporate Research Laboratory
 -
 E-mail: [EMAIL PROTECTED]
 Tel:  (651) 733-6092
 Fax:  (651) 736-3122
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

 -- 
 Brian D. Ripley,  [EMAIL PROTECTED]
 Professor

[R] R News, volume 7, issue 2 is now available (fwd)

2007-11-02 Thread Torsten Hothorn

Dear useRs,

The October 2007 issue of R News is now available on CRAN under the
Documentation/Newsletter link.

Torsten
(on behalf of the R News Editorial Board)

___
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-announce

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.