Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
Thanks Peter and Marc. I am sorry, I was wrong in dichotomizing the response. Thanks for pointing to my mistake. However, a correct dichotomization is not helping. Also the link that you provided is very useful and I am thinking now not to dichotomize my values. Thanks again On Fri, Oct 4, 2013 at 3:50 PM, Marc Schwartz marc_schwa...@me.com wrote: On Oct 4, 2013, at 2:35 PM, peter dalgaard pda...@gmail.com wrote: On Oct 4, 2013, at 21:16 , Mary Kindall wrote: Y[Y mean(Y)] = 0 #My edit Y[Y = mean(Y)] = 1 #My edit I have no clue about gbm, but I don't think the above does what I think you think it does. Y - as.integer(Y = mean(Y)) might be closer to the mark. Good catch Peter! I didn't pay attention to that initially. Here is an example: set.seed(1) Y - rnorm(10) Y [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -0.8204684 [7] 0.4874291 0.7383247 0.5757814 -0.3053884 mean(Y) [1] 0.1322028 Before changing Y: Y[Y mean(Y)] [1] -0.6264538 -0.8356286 -0.8204684 -0.3053884 Y[Y = mean(Y)] [1] 0.1836433 1.5952808 0.3295078 0.4874291 0.7383247 0.5757814 However, the incantation that Mary is using, which calculates mean(Y) separately in each call, results in: Y[Y mean(Y)] = 0 Y [1] 0.000 0.1836433 0.000 1.5952808 0.3295078 0.000 [7] 0.4874291 0.7383247 0.5757814 0.000 # mean(Y) is no longer the original value from above mean(Y) [1] 0.3909967 Thus: Y[Y = mean(Y)] = 1 Y [1] 0.000 0.1836433 0.000 1.000 0.3295078 0.000 [7] 1.000 1.000 1.000 0.000 Some of the values in Y do not change because the threshold for modifying the values changed as a result of the recalculation of the mean after the first set of values in Y have changed. As Peter noted, you don't end up with a dichotomous vector. Using Peter's method: Y - as.integer(Y = mean(Y)) Y [1] 0 1 0 1 1 0 1 1 1 0 That being said, the original viewpoint stands, which is to not do this due to loss of information. Regards, Marc Schwartz -- - Mary Kindall Yorktown Heights, NY USA [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
This reproducible example is from the help of 'gbm' in R. I ran the following code in R, and works fine as long as the response is numeric. The problem starts when I convert the response from numeric to binary (0/1). It gives me an error. My question is, is converting the response from numeric to binary will have this much effect. Help page code: N - 1000 X1 - runif(N) X2 - 2*runif(N) X3 - ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 - factor(sample(letters[1:6],N,replace=TRUE)) X5 - factor(sample(letters[1:3],N,replace=TRUE)) X6 - 3*runif(N) mu - c(-1,0,1,2)[as.numeric(X3)] SNR - 10 # signal-to-noise ratio Y - X1**1.5 + 2 * (X2**.5) + mu sigma - sqrt(var(Y)/SNR) Y - Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=500)] - NA X4[sample(1:N,size=300)] - NA data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # fit initial model gbm1 - gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution=gaussian, # see the help for other choices n.trees=1000,# number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5,# fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3,# do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE) # don't print out progress gbm1 summary(gbm1) Now I slightly change the response variable to make it binary. Y[Y mean(Y)] = 0 #My edit Y[Y = mean(Y)] = 1 #My edit data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit gbm2 - gbm(fmla,# formula data=data, # dataset distribution=bernoulli, # My edit n.trees=1000,# number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5,# fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3,# do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE) # don't print out progress gbm2 gbm2 gbm(formula = fmla, distribution = bernoulli, data = data, n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, cv.folds = 3, keep.data = TRUE, verbose = FALSE) A gradient boosted model with bernoulli loss function. 1000 iterations were performed. The best cross-validation iteration was . The best test-set iteration was . Error in 1:n.trees : argument of length 0 My question is, Is binarizing the response will have so much effect that it does not find anythin useful in the predictors? Thanks -- - Mary Kindall Yorktown Heights, NY USA [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
My question is, Is binarizing the response will have so much effect that it does not find anythin useful in the predictors? Yes. Dichotomizing throws away most of the information in the data. Which is why you shouldn't do it. This is a statistics, not an R question, so any follow-up should be posted on a statistical list like stats.stackexchange.com, not here. -- Bert On Fri, Oct 4, 2013 at 12:16 PM, Mary Kindall mary.kind...@gmail.com wrote: This reproducible example is from the help of 'gbm' in R. I ran the following code in R, and works fine as long as the response is numeric. The problem starts when I convert the response from numeric to binary (0/1). It gives me an error. My question is, is converting the response from numeric to binary will have this much effect. Help page code: N - 1000 X1 - runif(N) X2 - 2*runif(N) X3 - ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 - factor(sample(letters[1:6],N,replace=TRUE)) X5 - factor(sample(letters[1:3],N,replace=TRUE)) X6 - 3*runif(N) mu - c(-1,0,1,2)[as.numeric(X3)] SNR - 10 # signal-to-noise ratio Y - X1**1.5 + 2 * (X2**.5) + mu sigma - sqrt(var(Y)/SNR) Y - Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=500)] - NA X4[sample(1:N,size=300)] - NA data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # fit initial model gbm1 - gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution=gaussian, # see the help for other choices n.trees=1000,# number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5,# fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3,# do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE) # don't print out progress gbm1 summary(gbm1) Now I slightly change the response variable to make it binary. Y[Y mean(Y)] = 0 #My edit Y[Y = mean(Y)] = 1 #My edit data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit gbm2 - gbm(fmla,# formula data=data, # dataset distribution=bernoulli, # My edit n.trees=1000,# number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5,# fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3,# do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE) # don't print out progress gbm2 gbm2 gbm(formula = fmla, distribution = bernoulli, data = data, n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, cv.folds = 3, keep.data = TRUE, verbose = FALSE) A gradient boosted model with bernoulli loss function. 1000 iterations were performed. The best cross-validation iteration was . The best test-set iteration was . Error in 1:n.trees : argument of length 0 My question is, Is binarizing the response will have so much effect that it does not find anythin useful in the predictors? Thanks -- - Mary Kindall Yorktown Heights, NY USA [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
On Oct 4, 2013, at 2:16 PM, Mary Kindall mary.kind...@gmail.com wrote: This reproducible example is from the help of 'gbm' in R. I ran the following code in R, and works fine as long as the response is numeric. The problem starts when I convert the response from numeric to binary (0/1). It gives me an error. My question is, is converting the response from numeric to binary will have this much effect. Help page code: N - 1000 X1 - runif(N) X2 - 2*runif(N) X3 - ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 - factor(sample(letters[1:6],N,replace=TRUE)) X5 - factor(sample(letters[1:3],N,replace=TRUE)) X6 - 3*runif(N) mu - c(-1,0,1,2)[as.numeric(X3)] SNR - 10 # signal-to-noise ratio Y - X1**1.5 + 2 * (X2**.5) + mu sigma - sqrt(var(Y)/SNR) Y - Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=500)] - NA X4[sample(1:N,size=300)] - NA data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # fit initial model gbm1 - gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution=gaussian, # see the help for other choices n.trees=1000,# number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5,# fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3,# do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE) # don't print out progress gbm1 summary(gbm1) Now I slightly change the response variable to make it binary. Y[Y mean(Y)] = 0 #My edit Y[Y = mean(Y)] = 1 #My edit data - data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit gbm2 - gbm(fmla,# formula data=data, # dataset distribution=bernoulli, # My edit n.trees=1000,# number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5,# fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3,# do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE) # don't print out progress gbm2 gbm2 gbm(formula = fmla, distribution = bernoulli, data = data, n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, cv.folds = 3, keep.data = TRUE, verbose = FALSE) A gradient boosted model with bernoulli loss function. 1000 iterations were performed. The best cross-validation iteration was . The best test-set iteration was . Error in 1:n.trees : argument of length 0 My question is, Is binarizing the response will have so much effect that it does not find anythin useful in the predictors? Thanks Sure, it's possible. See this page for a good overview of why you should not dichotomize continuous data: http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous Regards, Marc Schwartz __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
On Oct 4, 2013, at 21:16 , Mary Kindall wrote: Y[Y mean(Y)] = 0 #My edit Y[Y = mean(Y)] = 1 #My edit I have no clue about gbm, but I don't think the above does what I think you think it does. Y - as.integer(Y = mean(Y)) might be closer to the mark. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
On Oct 4, 2013, at 2:35 PM, peter dalgaard pda...@gmail.com wrote: On Oct 4, 2013, at 21:16 , Mary Kindall wrote: Y[Y mean(Y)] = 0 #My edit Y[Y = mean(Y)] = 1 #My edit I have no clue about gbm, but I don't think the above does what I think you think it does. Y - as.integer(Y = mean(Y)) might be closer to the mark. Good catch Peter! I didn't pay attention to that initially. Here is an example: set.seed(1) Y - rnorm(10) Y [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -0.8204684 [7] 0.4874291 0.7383247 0.5757814 -0.3053884 mean(Y) [1] 0.1322028 Before changing Y: Y[Y mean(Y)] [1] -0.6264538 -0.8356286 -0.8204684 -0.3053884 Y[Y = mean(Y)] [1] 0.1836433 1.5952808 0.3295078 0.4874291 0.7383247 0.5757814 However, the incantation that Mary is using, which calculates mean(Y) separately in each call, results in: Y[Y mean(Y)] = 0 Y [1] 0.000 0.1836433 0.000 1.5952808 0.3295078 0.000 [7] 0.4874291 0.7383247 0.5757814 0.000 # mean(Y) is no longer the original value from above mean(Y) [1] 0.3909967 Thus: Y[Y = mean(Y)] = 1 Y [1] 0.000 0.1836433 0.000 1.000 0.3295078 0.000 [7] 1.000 1.000 1.000 0.000 Some of the values in Y do not change because the threshold for modifying the values changed as a result of the recalculation of the mean after the first set of values in Y have changed. As Peter noted, you don't end up with a dichotomous vector. Using Peter's method: Y - as.integer(Y = mean(Y)) Y [1] 0 1 0 1 1 0 1 1 1 0 That being said, the original viewpoint stands, which is to not do this due to loss of information. Regards, Marc Schwartz __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.