Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about
"""
1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?
"""

--> MLlib trees do this currently, so you could check out that code as an
example.
I'm not sure how this would work as a generic transformer, though; it seems
more like an internal part of space-partitioning algorithms.



On Tue, Oct 27, 2015 at 5:04 PM, Meihua Wu 
wrote:

> Hi DB Tsai,
>
> Thank you again for your insightful comments!
>
> 1) I agree the sorting method you suggested is a very efficient way to
> handle the unordered categorical variables in binary classification
> and regression. I propose we have a Spark ML Transformer to do the
> sorting and encoding, bringing the benefits to many tree based
> methods. How about I open a jira for this?
>
> 2) For L2/L1 regularization vs Learning rate (I use this name instead
> shrinkage to avoid confusion), I have the following observations:
>
> Suppose G and H are the sum (over the data assigned to a leaf node) of
> the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
> Then for this leaf node,
>
> * With a learning rate eta, f_{m+1} = f_m - G/H*eta
>
> * With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)
>
> If H>0 (convex loss), both approach lead to "shrinkage":
>
> * For the learning rate approach, the percentage of shrinkage is
> uniform for any leaf node.
>
> * For L2 regularization, the percentage of shrinkage would adapt to
> the number of instances assigned to a leaf node: more instances =>
> larger G and H => less shrinkage. This behavior is intuitive to me. If
> the value estimated from this node is based on a large amount of data,
> the value should be reliable and less shrinkage is needed.
>
> I suppose we could have something similar for L1.
>
> I am not aware of theoretical results to conclude which method is
> better. Likely to be dependent on the data at hand. Implementing
> learning rate is on my radar for version 0.2. I should be able to add
> it in a week or so. I will send you a note once it is done.
>
> Thanks,
>
> Meihua
>
> On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai  wrote:
> > Hi Meihua,
> >
> > For categorical features, the ordinal issue can be solved by trying
> > all kind of different partitions 2^(q-1) -1 for q values into two
> > groups. However, it's computational expensive. In Hastie's book, in
> > 9.2.4, the trees can be trained by sorting the residuals and being
> > learnt as if they are ordered. It can be proven that it will give the
> > optimal solution. I have a proof that this works for learning
> > regression trees through variance reduction.
> >
> > I'm also interested in understanding how the L1 and L2 regularization
> > within the boosting works (and if it helps with overfitting more than
> > shrinkage).
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > --
> > Web: https://www.dbtsai.com
> > PGP Key ID: 0xAF08DF8D
> >
> >
> > On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu 
> wrote:
> >> Hi DB Tsai,
> >>
> >> Thank you very much for your interest and comment.
> >>
> >> 1) feature sub-sample is per-node, like random forest.
> >>
> >> 2) The current code heavily exploits the tree structure to speed up
> >> the learning (such as processing multiple learning node in one pass of
> >> the training data). So a generic GBM is likely to be a different
> >> codebase. Do you have any nice reference of efficient GBM? I am more
> >> than happy to look into that.
> >>
> >> 3) The algorithm accept training data as a DataFrame with the
> >> featureCol indexed by VectorIndexer. You can specify which variable is
> >> categorical in the VectorIndexer. Please note that currently all
> >> categorical variables are treated as ordered. If you want some
> >> categorical variables as unordered, you can pass the data through
> >> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> >> unordered categorical variable using the approach in RF in Spark ML
> >> (Please see roadmap in the README.md)
> >>
> >> Thanks,
> >>
> >> Meihua
> >>
> >>
> >>
> >> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> >>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> >>> you think you can implement generic GBM and have it merged as part of
> >>> Spark codebase?
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> --
> >>> Web: https://www.dbtsai.com
> >>> PGP Key ID: 0xAF08DF8D
> >>>
> >>>
> >>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
> >>>  wrote:
>  Hi Spark User/Dev,
> 
>  Inspired by the success 

Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai
Hi Meihua,

For categorical features, the ordinal issue can be solved by trying
all kind of different partitions 2^(q-1) -1 for q values into two
groups. However, it's computational expensive. In Hastie's book, in
9.2.4, the trees can be trained by sorting the residuals and being
learnt as if they are ordered. It can be proven that it will give the
optimal solution. I have a proof that this works for learning
regression trees through variance reduction.

I'm also interested in understanding how the L1 and L2 regularization
within the boosting works (and if it helps with overfitting more than
shrinkage).

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu  wrote:
> Hi DB Tsai,
>
> Thank you very much for your interest and comment.
>
> 1) feature sub-sample is per-node, like random forest.
>
> 2) The current code heavily exploits the tree structure to speed up
> the learning (such as processing multiple learning node in one pass of
> the training data). So a generic GBM is likely to be a different
> codebase. Do you have any nice reference of efficient GBM? I am more
> than happy to look into that.
>
> 3) The algorithm accept training data as a DataFrame with the
> featureCol indexed by VectorIndexer. You can specify which variable is
> categorical in the VectorIndexer. Please note that currently all
> categorical variables are treated as ordered. If you want some
> categorical variables as unordered, you can pass the data through
> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> unordered categorical variable using the approach in RF in Spark ML
> (Please see roadmap in the README.md)
>
> Thanks,
>
> Meihua
>
>
>
> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>> you think you can implement generic GBM and have it merged as part of
>> Spark codebase?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>  wrote:
>>> Hi Spark User/Dev,
>>>
>>> Inspired by the success of XGBoost, I have created a Spark package for
>>> gradient boosting tree with 2nd order approximation of arbitrary
>>> user-defined loss functions.
>>>
>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>
>>> Currently linear (normal) regression, binary classification, Poisson
>>> regression are supported. You can extend with other loss function as
>>> well.
>>>
>>> L1, L2, bagging, feature sub-sampling are also employed to avoid 
>>> overfitting.
>>>
>>> Thank you for testing. I am looking forward to your comments and
>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>
>>> Many thanks!
>>>
>>> Meihua
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-27 Thread Meihua Wu
Hi DB Tsai,

Thank you again for your insightful comments!

1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?

2) For L2/L1 regularization vs Learning rate (I use this name instead
shrinkage to avoid confusion), I have the following observations:

Suppose G and H are the sum (over the data assigned to a leaf node) of
the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
Then for this leaf node,

* With a learning rate eta, f_{m+1} = f_m - G/H*eta

* With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)

If H>0 (convex loss), both approach lead to "shrinkage":

* For the learning rate approach, the percentage of shrinkage is
uniform for any leaf node.

* For L2 regularization, the percentage of shrinkage would adapt to
the number of instances assigned to a leaf node: more instances =>
larger G and H => less shrinkage. This behavior is intuitive to me. If
the value estimated from this node is based on a large amount of data,
the value should be reliable and less shrinkage is needed.

I suppose we could have something similar for L1.

I am not aware of theoretical results to conclude which method is
better. Likely to be dependent on the data at hand. Implementing
learning rate is on my radar for version 0.2. I should be able to add
it in a week or so. I will send you a note once it is done.

Thanks,

Meihua

On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai  wrote:
> Hi Meihua,
>
> For categorical features, the ordinal issue can be solved by trying
> all kind of different partitions 2^(q-1) -1 for q values into two
> groups. However, it's computational expensive. In Hastie's book, in
> 9.2.4, the trees can be trained by sorting the residuals and being
> learnt as if they are ordered. It can be proven that it will give the
> optimal solution. I have a proof that this works for learning
> regression trees through variance reduction.
>
> I'm also interested in understanding how the L1 and L2 regularization
> within the boosting works (and if it helps with overfitting more than
> shrinkage).
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu  
> wrote:
>> Hi DB Tsai,
>>
>> Thank you very much for your interest and comment.
>>
>> 1) feature sub-sample is per-node, like random forest.
>>
>> 2) The current code heavily exploits the tree structure to speed up
>> the learning (such as processing multiple learning node in one pass of
>> the training data). So a generic GBM is likely to be a different
>> codebase. Do you have any nice reference of efficient GBM? I am more
>> than happy to look into that.
>>
>> 3) The algorithm accept training data as a DataFrame with the
>> featureCol indexed by VectorIndexer. You can specify which variable is
>> categorical in the VectorIndexer. Please note that currently all
>> categorical variables are treated as ordered. If you want some
>> categorical variables as unordered, you can pass the data through
>> OneHotEncoder before the VectorIndexer. I do have a plan to handle
>> unordered categorical variable using the approach in RF in Spark ML
>> (Please see roadmap in the README.md)
>>
>> Thanks,
>>
>> Meihua
>>
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>>  wrote:
 Hi Spark User/Dev,

 Inspired by the success of XGBoost, I have created a Spark package for
 gradient boosting tree with 2nd order approximation of arbitrary
 user-defined loss functions.

 https://github.com/rotationsymmetry/SparkXGBoost

 Currently linear (normal) regression, binary classification, Poisson
 regression are supported. You can extend with other loss function as
 well.

 L1, L2, bagging, feature sub-sampling are also employed to avoid 
 overfitting.

 Thank you for testing. I am looking forward to your comments and
 suggestions. Bugs or improvements can be reported through GitHub.

 Many thanks!

 Meihua

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: 

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Interesting. For feature sub-sampling, is it per-node or per-tree? Do
you think you can implement generic GBM and have it merged as part of
Spark codebase?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
 wrote:
> Hi Spark User/Dev,
>
> Inspired by the success of XGBoost, I have created a Spark package for
> gradient boosting tree with 2nd order approximation of arbitrary
> user-defined loss functions.
>
> https://github.com/rotationsymmetry/SparkXGBoost
>
> Currently linear (normal) regression, binary classification, Poisson
> regression are supported. You can extend with other loss function as
> well.
>
> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>
> Thank you for testing. I am looking forward to your comments and
> suggestions. Bugs or improvements can be reported through GitHub.
>
> Many thanks!
>
> Meihua
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Also, does it support categorical feature?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> you think you can implement generic GBM and have it merged as part of
> Spark codebase?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>  wrote:
>> Hi Spark User/Dev,
>>
>> Inspired by the success of XGBoost, I have created a Spark package for
>> gradient boosting tree with 2nd order approximation of arbitrary
>> user-defined loss functions.
>>
>> https://github.com/rotationsymmetry/SparkXGBoost
>>
>> Currently linear (normal) regression, binary classification, Poisson
>> regression are supported. You can extend with other loss function as
>> well.
>>
>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>
>> Thank you for testing. I am looking forward to your comments and
>> suggestions. Bugs or improvements can be reported through GitHub.
>>
>> Many thanks!
>>
>> Meihua
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread YiZhi Liu
There's an xgboost exploration jira SPARK-8547. Can it be a good start?

2015-10-27 7:07 GMT+08:00 DB Tsai :
> Also, does it support categorical feature?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>> you think you can implement generic GBM and have it merged as part of
>> Spark codebase?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>  wrote:
>>> Hi Spark User/Dev,
>>>
>>> Inspired by the success of XGBoost, I have created a Spark package for
>>> gradient boosting tree with 2nd order approximation of arbitrary
>>> user-defined loss functions.
>>>
>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>
>>> Currently linear (normal) regression, binary classification, Poisson
>>> regression are supported. You can extend with other loss function as
>>> well.
>>>
>>> L1, L2, bagging, feature sub-sampling are also employed to avoid 
>>> overfitting.
>>>
>>> Thank you for testing. I am looking forward to your comments and
>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>
>>> Many thanks!
>>>
>>> Meihua
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi YiZhi,

Thank you for mentioning the jira. I will add a note to the jira.

Meihua

On Mon, Oct 26, 2015 at 6:16 PM, YiZhi Liu  wrote:
> There's an xgboost exploration jira SPARK-8547. Can it be a good start?
>
> 2015-10-27 7:07 GMT+08:00 DB Tsai :
>> Also, does it support categorical feature?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>>  wrote:
 Hi Spark User/Dev,

 Inspired by the success of XGBoost, I have created a Spark package for
 gradient boosting tree with 2nd order approximation of arbitrary
 user-defined loss functions.

 https://github.com/rotationsymmetry/SparkXGBoost

 Currently linear (normal) regression, binary classification, Poisson
 regression are supported. You can extend with other loss function as
 well.

 L1, L2, bagging, feature sub-sampling are also employed to avoid 
 overfitting.

 Thank you for testing. I am looking forward to your comments and
 suggestions. Bugs or improvements can be reported through GitHub.

 Many thanks!

 Meihua

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi DB Tsai,

Thank you very much for your interest and comment.

1) feature sub-sample is per-node, like random forest.

2) The current code heavily exploits the tree structure to speed up
the learning (such as processing multiple learning node in one pass of
the training data). So a generic GBM is likely to be a different
codebase. Do you have any nice reference of efficient GBM? I am more
than happy to look into that.

3) The algorithm accept training data as a DataFrame with the
featureCol indexed by VectorIndexer. You can specify which variable is
categorical in the VectorIndexer. Please note that currently all
categorical variables are treated as ordered. If you want some
categorical variables as unordered, you can pass the data through
OneHotEncoder before the VectorIndexer. I do have a plan to handle
unordered categorical variable using the approach in RF in Spark ML
(Please see roadmap in the README.md)

Thanks,

Meihua



On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> you think you can implement generic GBM and have it merged as part of
> Spark codebase?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>  wrote:
>> Hi Spark User/Dev,
>>
>> Inspired by the success of XGBoost, I have created a Spark package for
>> gradient boosting tree with 2nd order approximation of arbitrary
>> user-defined loss functions.
>>
>> https://github.com/rotationsymmetry/SparkXGBoost
>>
>> Currently linear (normal) regression, binary classification, Poisson
>> regression are supported. You can extend with other loss function as
>> well.
>>
>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>
>> Thank you for testing. I am looking forward to your comments and
>> suggestions. Bugs or improvements can be reported through GitHub.
>>
>> Many thanks!
>>
>> Meihua
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org