Re: [scikit-learn] A necessary feature for Decision trees

2018-01-04 Thread Andreas Mueller
Your contribution would be very welcome, I think the current work has 
stalled.



On 01/04/2018 10:02 AM, Julio Antonio Soto de Vicente wrote:

Hi Yang Li,

I have to agree with you. Bitset and/or one hot encoding are just 
hacks which should not be necessary for decision tree learners.


There is some WIP on an implementation for natural handling of 
categorical features in trees: please take a look at 
https://github.com/scikit-learn/scikit-learn/pull/4899


Cheers!

--
Julio

El 4 ene 2018, a las 9:06, 李扬 > escribió:



Dear J.B.,

Thanks for your advice!

Yeah, I have considered using bitstring or sequence number, but the 
problem is the algorithm not the representation of categorical data.
Take the regression tree as an example, the algorithm in sklearn find 
a split value of the feature, and find the best split by computing 
the minimal impurity of child nodes.
However, find a split of the categorical feature is not that 
meaningful even though u represent it as continuous value, and the 
split result is partially depends on how u permute the value in 
categorical feature, which is not very persuasive.
Instead, in the CART algorithm, *u should separate each category in 
the feature from others and compute the impurity of the two sets. 
Then find the best separation strategy with the minimal impurity.*
Obviously, this separation process can`t be finished by current 
algorithm which simply use the split method on continuous value.


One more possible shortcoming is the categorical feature can`t be 
properly visualized. when forming a tree graph, it`s hard to get 
information from the categorical feature node while u just split it.


Thank you for your time!
Best wishes.




--
顺颂时祺!

*
*
李扬
上海交通大学 电子信息 与 电气工程 学院
电话:18818212371
地址:上海市闵行区东川路800号
邮编:200240

Yang Li  +86 188 1821 2371
Shanghai Jiao Tong University
School of Electronic,Information and Electrical Engineering F1203026
800 Dongchuan Road, Minhang District, Shanghai 200240



At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" 
> wrote:


Dear Yang Li,

> Neither the classificationTree nor the regressionTree supports
categorical feature. That means the Decision trees model can only
accept continuous feature.

Consider either manually encoding your categories in bitstrings
(e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or
using OneHotEncoder to do the same thing for you automatically.

Cheers,
J.B.



___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] A necessary feature for Decision trees

2018-01-04 Thread Julio Antonio Soto de Vicente
Hi Yang Li,

I have to agree with you. Bitset and/or one hot encoding are just hacks which 
should not be necessary for decision tree learners.

There is some WIP on an implementation for natural handling of categorical 
features in trees: please take a look at 
https://github.com/scikit-learn/scikit-learn/pull/4899

Cheers!

--
Julio

> El 4 ene 2018, a las 9:06, 李扬  escribió:
> 
> Dear J.B.,
> 
> Thanks for your advice!
> 
> Yeah, I have considered using bitstring or sequence number, but the problem 
> is the algorithm not the representation of categorical data.
> Take the regression tree as an example, the algorithm in sklearn find a split 
> value of the feature, and find the best split by computing the minimal 
> impurity of child nodes.
> However, find a split of the categorical feature is not that meaningful even 
> though u represent it as continuous value, and the split result is partially 
> depends on how u permute the value in categorical  feature, which is not very 
> persuasive.
> Instead, in the CART algorithm, u should separate each category in the 
> feature from others and compute the impurity of the two sets. Then find the 
> best separation strategy with the minimal impurity.
> Obviously, this separation process can`t be finished by current algorithm 
> which simply use the split method on continuous value.
> 
> One more possible shortcoming is the categorical feature can`t be properly 
> visualized. when forming a tree graph, it`s hard to get information from the 
> categorical feature node while u just split it.
> 
> Thank you for your time!
> Best wishes.
> 
> 
> 
> 
> --
> 顺颂时祺!
> 
> 
> 李扬 
> 上海交通大学  电子信息 与 电气工程 学院  
> 电话:18818212371
> 地址:上海市闵行区东川路800号
> 邮编:200240
> 
> Yang Li  +86 188 1821 2371
> Shanghai Jiao Tong University
> School of Electronic,Information and Electrical Engineering F1203026
> 800 Dongchuan Road, Minhang District, Shanghai 200240
> 
> 
>  
> 
> At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" 
>  wrote:
> Dear Yang Li,
> 
> > Neither the classificationTree nor the regressionTree supports categorical 
> > feature. That means the Decision trees model can only accept continuous 
> > feature. 
> 
> Consider either manually encoding your categories in bitstrings (e.g., 
> "Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder to 
> do the same thing for you automatically.
> 
> Cheers,
> J.B.
> 
> 
>  
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] A necessary feature for Decision trees

2018-01-03 Thread Brown J.B. via scikit-learn
Dear Yang Li,

> Neither the classificationTree nor the regressionTree supports
categorical feature. That means the Decision trees model can only accept
continuous feature.

Consider either manually encoding your categories in bitstrings (e.g.,
"Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder
to do the same thing for you automatically.

Cheers,
J.B.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] A necessary feature for Decision trees

2018-01-03 Thread 李扬
Hi, I`m a graduate student utilizing sklean for some data work. 
And when I`m handling the data using the Decision Trees library, I found there 
are some inconvenience:
Neither the classificationTree nor the regressionTree supports categorical 
feature. That means the Decision trees model can only accept continuous 
feature. 
For example, the categorical feature like app name such as google, facebook 
can`t be input into the model, because they can`t be transformed to continuous 
value properly. And there don`t exist a corresponding algorithm to divide 
discrete feature in the Decision Trees library.
However, the CART algorithm itself has considered the use of categorical 
feature. So I have made some modification of Decision Trees library based on 
CART and apply the new model on my own work.  And it proves that the support 
for categorical feature indeed improves the performance, which is very 
necessary for decision tree, I think.
I`m very willing to contribute this to sklearn community, but I`m new to this 
community, not so familiar about the procedure.
Could u give some suggestions or comments on this new feature? Or has anyone 
already processed on this feature? Thank you so much.


Best wishes!







--

顺颂时祺!




李扬 
上海交通大学  电子信息 与 电气工程 学院  
电话:18818212371
地址:上海市闵行区东川路800号
邮编:200240


Yang Li  +86 188 1821 2371
Shanghai Jiao Tong University
School of Electronic,Information and Electrical Engineering F1203026
800 Dongchuan Road, Minhang District, Shanghai 200240




 ___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn