Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

2013-10-05 Thread Mary Kindall
Thanks Peter and Marc.
I am sorry, I was wrong in dichotomizing the response. Thanks for pointing
to my mistake.

However, a correct dichotomization is not helping.

Also the link that you provided is very useful and I am thinking now not to
dichotomize my values.

Thanks again




On Fri, Oct 4, 2013 at 3:50 PM, Marc Schwartz marc_schwa...@me.com wrote:


 On Oct 4, 2013, at 2:35 PM, peter dalgaard pda...@gmail.com wrote:

 
  On Oct 4, 2013, at 21:16 , Mary Kindall wrote:
 
  Y[Y  mean(Y)] = 0   #My edit
  Y[Y = mean(Y)] = 1  #My edit
 
  I have no clue about gbm, but I don't think the above does what I think
 you think it does.
 
  Y - as.integer(Y = mean(Y))
 
  might be closer to the mark.


 Good catch Peter! I didn't pay attention to that initially.

 Here is an example:

 set.seed(1)
 Y - rnorm(10)

  Y
  [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078 -0.8204684
  [7]  0.4874291  0.7383247  0.5757814 -0.3053884

  mean(Y)
 [1] 0.1322028

 Before changing Y:

  Y[Y  mean(Y)]
 [1] -0.6264538 -0.8356286 -0.8204684 -0.3053884

  Y[Y = mean(Y)]
 [1] 0.1836433 1.5952808 0.3295078 0.4874291 0.7383247 0.5757814


 However, the incantation that Mary is using, which calculates mean(Y)
 separately in each call, results in:

 Y[Y  mean(Y)]  = 0

  Y
  [1] 0.000 0.1836433 0.000 1.5952808 0.3295078 0.000
  [7] 0.4874291 0.7383247 0.5757814 0.000


 # mean(Y) is no longer the original value from above
  mean(Y)
 [1] 0.3909967


 Thus:

 Y[Y = mean(Y)]  = 1

  Y
  [1] 0.000 0.1836433 0.000 1.000 0.3295078 0.000
  [7] 1.000 1.000 1.000 0.000


 Some of the values in Y do not change because the threshold for modifying
 the values changed as a result of the recalculation of the mean after the
 first set of values in Y have changed. As Peter noted, you don't end up
 with a dichotomous vector.

 Using Peter's method:

 Y - as.integer(Y = mean(Y))
  Y
  [1] 0 1 0 1 1 0 1 1 1 0


 That being said, the original viewpoint stands, which is to not do this
 due to loss of information.

 Regards,

 Marc Schwartz




-- 
-
Mary Kindall
Yorktown Heights, NY
USA

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

2013-10-04 Thread Mary Kindall
This reproducible example is from the help of 'gbm' in R.

I ran the following code in R, and works fine as long as the response is
numeric.  The problem starts when I convert the response from numeric to
binary (0/1). It gives me an error.

My question is, is converting the response from numeric to binary will have
this much effect.

Help page code:

N - 1000
X1 - runif(N)
X2 - 2*runif(N)
X3 - ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 - factor(sample(letters[1:6],N,replace=TRUE))
X5 - factor(sample(letters[1:3],N,replace=TRUE))
X6 - 3*runif(N)
mu - c(-1,0,1,2)[as.numeric(X3)]

SNR - 10 # signal-to-noise ratio
Y - X1**1.5 + 2 * (X2**.5) + mu
sigma - sqrt(var(Y)/SNR)
Y - Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=500)] - NA
X4[sample(1:N,size=300)] - NA

data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# fit initial model
gbm1 -
  gbm(Y~X1+X2+X3+X4+X5+X6, # formula
  data=data,   # dataset
  var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
  # +1: monotone increase,
  #  0: no monotone restrictions
  distribution=gaussian, # see the help for other choices
  n.trees=1000,# number of trees
  shrinkage=0.05,  # shrinkage or learning rate,
  # 0.001 to 0.1 usually work
  interaction.depth=3, # 1: additive model, 2: two-way
interactions, etc.
  bag.fraction = 0.5,  # subsampling fraction, 0.5 is probably
best
  train.fraction = 0.5,# fraction of data for training,
  # first train.fraction*N used for training
  n.minobsinnode = 10, # minimum total weight needed in each
node
  cv.folds = 3,# do 3-fold cross-validation
  keep.data=TRUE,  # keep a copy of the dataset with the
object
  verbose=FALSE)   # don't print out progress

gbm1
summary(gbm1)


Now I slightly change the response variable to make it binary.

Y[Y  mean(Y)] = 0   #My edit
Y[Y = mean(Y)] = 1  #My edit
data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit

gbm2 -
  gbm(fmla,# formula
  data=data,   # dataset
  distribution=bernoulli, # My edit
  n.trees=1000,# number of trees
  shrinkage=0.05,  # shrinkage or learning rate,
  # 0.001 to 0.1 usually work
  interaction.depth=3, # 1: additive model, 2: two-way
interactions, etc.
  bag.fraction = 0.5,  # subsampling fraction, 0.5 is probably
best
  train.fraction = 0.5,# fraction of data for training,
  # first train.fraction*N used for training
  n.minobsinnode = 10, # minimum total weight needed in each
node
  cv.folds = 3,# do 3-fold cross-validation
  keep.data=TRUE,  # keep a copy of the dataset with the
object
  verbose=FALSE)   # don't print out progress

gbm2


 gbm2
gbm(formula = fmla, distribution = bernoulli, data = data,
n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
cv.folds = 3, keep.data = TRUE, verbose = FALSE)
A gradient boosted model with bernoulli loss function.
1000 iterations were performed.
The best cross-validation iteration was .
The best test-set iteration was .
Error in 1:n.trees : argument of length 0


My question is, Is binarizing the response will have so much effect that it
does not find anythin useful in the predictors?

Thanks

-- 
-
Mary Kindall
Yorktown Heights, NY
USA

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

2013-10-04 Thread Bert Gunter
My question is, Is binarizing the response will have so much effect that it
does not find anythin useful in the predictors?

Yes. Dichotomizing throws away most of the information in the data.
Which is why you shouldn't do it.

This is a statistics, not an R question, so any follow-up should be
posted on a statistical list like stats.stackexchange.com, not here.

-- Bert

On Fri, Oct 4, 2013 at 12:16 PM, Mary Kindall mary.kind...@gmail.com wrote:
 This reproducible example is from the help of 'gbm' in R.

 I ran the following code in R, and works fine as long as the response is
 numeric.  The problem starts when I convert the response from numeric to
 binary (0/1). It gives me an error.

 My question is, is converting the response from numeric to binary will have
 this much effect.

 Help page code:

 N - 1000
 X1 - runif(N)
 X2 - 2*runif(N)
 X3 - ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
 X4 - factor(sample(letters[1:6],N,replace=TRUE))
 X5 - factor(sample(letters[1:3],N,replace=TRUE))
 X6 - 3*runif(N)
 mu - c(-1,0,1,2)[as.numeric(X3)]

 SNR - 10 # signal-to-noise ratio
 Y - X1**1.5 + 2 * (X2**.5) + mu
 sigma - sqrt(var(Y)/SNR)
 Y - Y + rnorm(N,0,sigma)

 # introduce some missing values
 X1[sample(1:N,size=500)] - NA
 X4[sample(1:N,size=300)] - NA

 data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

 # fit initial model
 gbm1 -
   gbm(Y~X1+X2+X3+X4+X5+X6, # formula
   data=data,   # dataset
   var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
   # +1: monotone increase,
   #  0: no monotone restrictions
   distribution=gaussian, # see the help for other choices
   n.trees=1000,# number of trees
   shrinkage=0.05,  # shrinkage or learning rate,
   # 0.001 to 0.1 usually work
   interaction.depth=3, # 1: additive model, 2: two-way
 interactions, etc.
   bag.fraction = 0.5,  # subsampling fraction, 0.5 is probably
 best
   train.fraction = 0.5,# fraction of data for training,
   # first train.fraction*N used for training
   n.minobsinnode = 10, # minimum total weight needed in each
 node
   cv.folds = 3,# do 3-fold cross-validation
   keep.data=TRUE,  # keep a copy of the dataset with the
 object
   verbose=FALSE)   # don't print out progress

 gbm1
 summary(gbm1)


 Now I slightly change the response variable to make it binary.

 Y[Y  mean(Y)] = 0   #My edit
 Y[Y = mean(Y)] = 1  #My edit
 data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
 fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit

 gbm2 -
   gbm(fmla,# formula
   data=data,   # dataset
   distribution=bernoulli, # My edit
   n.trees=1000,# number of trees
   shrinkage=0.05,  # shrinkage or learning rate,
   # 0.001 to 0.1 usually work
   interaction.depth=3, # 1: additive model, 2: two-way
 interactions, etc.
   bag.fraction = 0.5,  # subsampling fraction, 0.5 is probably
 best
   train.fraction = 0.5,# fraction of data for training,
   # first train.fraction*N used for training
   n.minobsinnode = 10, # minimum total weight needed in each
 node
   cv.folds = 3,# do 3-fold cross-validation
   keep.data=TRUE,  # keep a copy of the dataset with the
 object
   verbose=FALSE)   # don't print out progress

 gbm2


 gbm2
 gbm(formula = fmla, distribution = bernoulli, data = data,
 n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
 shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
 cv.folds = 3, keep.data = TRUE, verbose = FALSE)
 A gradient boosted model with bernoulli loss function.
 1000 iterations were performed.
 The best cross-validation iteration was .
 The best test-set iteration was .
 Error in 1:n.trees : argument of length 0


 My question is, Is binarizing the response will have so much effect that it
 does not find anythin useful in the predictors?

 Thanks

 --
 -
 Mary Kindall
 Yorktown Heights, NY
 USA

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

(650) 467-7374

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

2013-10-04 Thread Marc Schwartz

On Oct 4, 2013, at 2:16 PM, Mary Kindall mary.kind...@gmail.com wrote:

 This reproducible example is from the help of 'gbm' in R.
 
 I ran the following code in R, and works fine as long as the response is
 numeric.  The problem starts when I convert the response from numeric to
 binary (0/1). It gives me an error.
 
 My question is, is converting the response from numeric to binary will have
 this much effect.
 
 Help page code:
 
 N - 1000
 X1 - runif(N)
 X2 - 2*runif(N)
 X3 - ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
 X4 - factor(sample(letters[1:6],N,replace=TRUE))
 X5 - factor(sample(letters[1:3],N,replace=TRUE))
 X6 - 3*runif(N)
 mu - c(-1,0,1,2)[as.numeric(X3)]
 
 SNR - 10 # signal-to-noise ratio
 Y - X1**1.5 + 2 * (X2**.5) + mu
 sigma - sqrt(var(Y)/SNR)
 Y - Y + rnorm(N,0,sigma)
 
 # introduce some missing values
 X1[sample(1:N,size=500)] - NA
 X4[sample(1:N,size=300)] - NA
 
 data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
 
 # fit initial model
 gbm1 -
  gbm(Y~X1+X2+X3+X4+X5+X6, # formula
  data=data,   # dataset
  var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
  # +1: monotone increase,
  #  0: no monotone restrictions
  distribution=gaussian, # see the help for other choices
  n.trees=1000,# number of trees
  shrinkage=0.05,  # shrinkage or learning rate,
  # 0.001 to 0.1 usually work
  interaction.depth=3, # 1: additive model, 2: two-way
 interactions, etc.
  bag.fraction = 0.5,  # subsampling fraction, 0.5 is probably
 best
  train.fraction = 0.5,# fraction of data for training,
  # first train.fraction*N used for training
  n.minobsinnode = 10, # minimum total weight needed in each
 node
  cv.folds = 3,# do 3-fold cross-validation
  keep.data=TRUE,  # keep a copy of the dataset with the
 object
  verbose=FALSE)   # don't print out progress
 
 gbm1
 summary(gbm1)
 
 
 Now I slightly change the response variable to make it binary.
 
 Y[Y  mean(Y)] = 0   #My edit
 Y[Y = mean(Y)] = 1  #My edit
 data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
 fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit
 
 gbm2 -
  gbm(fmla,# formula
  data=data,   # dataset
  distribution=bernoulli, # My edit
  n.trees=1000,# number of trees
  shrinkage=0.05,  # shrinkage or learning rate,
  # 0.001 to 0.1 usually work
  interaction.depth=3, # 1: additive model, 2: two-way
 interactions, etc.
  bag.fraction = 0.5,  # subsampling fraction, 0.5 is probably
 best
  train.fraction = 0.5,# fraction of data for training,
  # first train.fraction*N used for training
  n.minobsinnode = 10, # minimum total weight needed in each
 node
  cv.folds = 3,# do 3-fold cross-validation
  keep.data=TRUE,  # keep a copy of the dataset with the
 object
  verbose=FALSE)   # don't print out progress
 
 gbm2
 
 
 gbm2
 gbm(formula = fmla, distribution = bernoulli, data = data,
n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
cv.folds = 3, keep.data = TRUE, verbose = FALSE)
 A gradient boosted model with bernoulli loss function.
 1000 iterations were performed.
 The best cross-validation iteration was .
 The best test-set iteration was .
 Error in 1:n.trees : argument of length 0
 
 
 My question is, Is binarizing the response will have so much effect that it
 does not find anythin useful in the predictors?
 
 Thanks



Sure, it's possible. See this page for a good overview of why you should not 
dichotomize continuous data:

  http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous

Regards,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

2013-10-04 Thread peter dalgaard

On Oct 4, 2013, at 21:16 , Mary Kindall wrote:

 Y[Y  mean(Y)] = 0   #My edit
 Y[Y = mean(Y)] = 1  #My edit

I have no clue about gbm, but I don't think the above does what I think you 
think it does. 

Y - as.integer(Y = mean(Y)) 

might be closer to the mark.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

2013-10-04 Thread Marc Schwartz

On Oct 4, 2013, at 2:35 PM, peter dalgaard pda...@gmail.com wrote:

 
 On Oct 4, 2013, at 21:16 , Mary Kindall wrote:
 
 Y[Y  mean(Y)] = 0   #My edit
 Y[Y = mean(Y)] = 1  #My edit
 
 I have no clue about gbm, but I don't think the above does what I think you 
 think it does. 
 
 Y - as.integer(Y = mean(Y)) 
 
 might be closer to the mark.


Good catch Peter! I didn't pay attention to that initially.

Here is an example:

set.seed(1)
Y - rnorm(10)

 Y
 [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078 -0.8204684
 [7]  0.4874291  0.7383247  0.5757814 -0.3053884

 mean(Y)
[1] 0.1322028

Before changing Y:

 Y[Y  mean(Y)]
[1] -0.6264538 -0.8356286 -0.8204684 -0.3053884

 Y[Y = mean(Y)]
[1] 0.1836433 1.5952808 0.3295078 0.4874291 0.7383247 0.5757814


However, the incantation that Mary is using, which calculates mean(Y) 
separately in each call, results in:

Y[Y  mean(Y)]  = 0

 Y
 [1] 0.000 0.1836433 0.000 1.5952808 0.3295078 0.000
 [7] 0.4874291 0.7383247 0.5757814 0.000


# mean(Y) is no longer the original value from above
 mean(Y)
[1] 0.3909967


Thus:

Y[Y = mean(Y)]  = 1

 Y
 [1] 0.000 0.1836433 0.000 1.000 0.3295078 0.000
 [7] 1.000 1.000 1.000 0.000


Some of the values in Y do not change because the threshold for modifying the 
values changed as a result of the recalculation of the mean after the first set 
of values in Y have changed. As Peter noted, you don't end up with a 
dichotomous vector.

Using Peter's method:

Y - as.integer(Y = mean(Y)) 
 Y
 [1] 0 1 0 1 1 0 1 1 1 0


That being said, the original viewpoint stands, which is to not do this due to 
loss of information.

Regards,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.