subject:"Re\: \[Scikit\-learn\-general\] Random Forest with a mix of categorical and lexical features"

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2014-11-17 Thread Manish Amde

+1

Just wanted to point out that the K-1 subset proof is only true for binary
classification. Such heuristics do perform reasonably for the multiclass
classification criterion though.

On Monday, November 17, 2014, Alexander Hawk  wrote:

> Perhaps you have become aware of this by now,
> but only K-1 subset tests are needed to find the best
> categorical split, not  2^(K-1)-1.  This was a central
> result proved in Brieman's book.
>
>
>
>
> --
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2014-11-17 Thread Alexander Hawk

Perhaps you have become aware of this by now,
but only K-1 subset tests are needed to find the best 
categorical split, not  2^(K-1)-1.  This was a central 
result proved in Brieman's book.



--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Christian Jauvin

>> I believe more in my results than in my expertise - and so should you :-)
>
> +1! There's very very few examples of theory trumping data in history... And
> a bajillion of the converse.

I guess I didn't express myself clearly: I didn't mean to say that I
mistrust my results per se.. I'm not that much into skepticism! What I
meant rather is that when I'm experimenting with something new (to
me), and observe something weird or not in line with what I expect, my
a priori belief is that I most likely made a mistake, rather than
discovered some previously unnoticed flaw.

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Juan Nunez-Iglesias

On Tue, Jun 4, 2013 at 8:16 PM, Peter Prettenhofer <
peter.prettenho...@gmail.com> wrote:

> I believe more in my results than in my expertise - and so should you :-)
> **
>

+1! There's very very few examples of theory trumping data in history...
And a bajillion of the converse.

I also think Joel put it quite nicely with "all these trees can represent
the same hypothesis space, it just might require a deeper tree to represent
the same thing". Christian's results seem in no way contradictory to me,
just pleasantly surprising.
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-04 Thread Peter Prettenhofer

Hi Christian,

I believe more in my results than in my expertise - and so should you :-) **

I think you misunderstood me: I did not claim that one-hot encoded
categorical features give better results than ordinal encoded ones - I just
claimed that ordinal encoding works as good as one-hot encoded features
given that you have deep enough trees. But I've to warn you: I cannot
support my claim with (sufficient) data. So at the end of the day, its
always best to make an experiment and test it on your problem at hand.

Anyways, I cannot really see your problem (or what you did "wrong"):
according to your description it seems that the specific encoding (one-hot
vs. ordinal) has no influence on the effectiveness of the model (no
significant difference)? This is in line with observations by others.

Andy raised a very important point though: if you optimized your
hyperparameters (tree depth, min split size, ..) on the ordinal encoding
and then tested those hyperparameters on a one-hot encoding you are giving
an advantage to the ordinal encoding.

HTH,
 Peter

** that being said, I'm still quite skeptical when it comes to my results


2013/6/4 Christian Jauvin 

> Many thanks to all for your help and detailed answers, I really appreciate
> it.
>
> So I wanted to test the discussion's takeaway, namely, what Peter
> suggested: one-hot encode the categorical features with small
> cardinality, and leave the others in their ordinal form.
>
> So from the same dataset I mentioned earlier, I picked another subset
> of 5 features, this time all with small cardinality (5, 5, 6, 11 and
> 12), and all purely categorical (i.e. clearly not ordered). The
> one-hot encoding should clearly help with such a configuration.
>
> But again, what I observe when I pit the fully one-hot encoded RF
> (21000 x 39) against the ordinal-encoded one (21000 x 5) is that
> they're behaving almost the same, in terms of accuracy and AUC, with
> 10-fold cross-validation. In fact, the ordinal version even seems to
> perform very slightly better, although I don't think it's significant.
>
> I really believe in your expertise more than in my results, so what
> could I be doing wrong?
>
>
>
> On 3 June 2013 04:56, Andreas Mueller  wrote:
> > On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
> >> Our decision tree implementation only supports numerical splits; i.e.
> >> if tests val < threshold .
> >>
> >> Categorical features need to be encoded properly. I recommend one-hot
> >> encoding for features with small cardinality (e.g. < 50) and ordinal
> >> encoding (simply assign each category an integer value) for features
> >> with large cardinality.
> > This seems to be the opposite of what the kaggle tutorial suggests,
> > right? They suggest ordinal encoding for small cardinality, but don't
> > suggest
> > any other way.
> >
> > Your and Gilles' feedback make me think we should tell the kaggle people
> > to change their tutorial
> >
> >
> --
> > Get 100% visibility into Java/.NET code with AppDynamics Lite
> > It's a free troubleshooting tool designed for production
> > Get down to code-level detail for bottlenecks, with <2% overhead.
> > Download for free and get started troubleshooting in minutes.
> > http://p.sf.net/sfu/appdyn_d2d_ap2
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> --
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations
> 2. Dashboards that offer high-level views of enterprise services
> 3. A single system of record for all IT processes
> http://p.sf.net/sfu/servicenow-d2d-j
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-04 Thread Andreas Mueller

On 06/04/2013 05:55 AM, Christian Jauvin wrote:
> Many thanks to all for your help and detailed answers, I really appreciate it.
>
> So I wanted to test the discussion's takeaway, namely, what Peter
> suggested: one-hot encode the categorical features with small
> cardinality, and leave the others in their ordinal form.
>
> So from the same dataset I mentioned earlier, I picked another subset
> of 5 features, this time all with small cardinality (5, 5, 6, 11 and
> 12), and all purely categorical (i.e. clearly not ordered). The
> one-hot encoding should clearly help with such a configuration.
>
> But again, what I observe when I pit the fully one-hot encoded RF
> (21000 x 39) against the ordinal-encoded one (21000 x 5) is that
> they're behaving almost the same, in terms of accuracy and AUC, with
> 10-fold cross-validation. In fact, the ordinal version even seems to
> perform very slightly better, although I don't think it's significant.
>
> I really believe in your expertise more than in my results, so what
> could I be doing wrong?
>
>
Did you grid-search parameters again?

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Christian Jauvin

Many thanks to all for your help and detailed answers, I really appreciate it.

So I wanted to test the discussion's takeaway, namely, what Peter
suggested: one-hot encode the categorical features with small
cardinality, and leave the others in their ordinal form.

So from the same dataset I mentioned earlier, I picked another subset
of 5 features, this time all with small cardinality (5, 5, 6, 11 and
12), and all purely categorical (i.e. clearly not ordered). The
one-hot encoding should clearly help with such a configuration.

But again, what I observe when I pit the fully one-hot encoded RF
(21000 x 39) against the ordinal-encoded one (21000 x 5) is that
they're behaving almost the same, in terms of accuracy and AUC, with
10-fold cross-validation. In fact, the ordinal version even seems to
perform very slightly better, although I don't think it's significant.

I really believe in your expertise more than in my results, so what
could I be doing wrong?

On 3 June 2013 04:56, Andreas Mueller  wrote:
> On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
>> Our decision tree implementation only supports numerical splits; i.e.
>> if tests val < threshold .
>>
>> Categorical features need to be encoded properly. I recommend one-hot
>> encoding for features with small cardinality (e.g. < 50) and ordinal
>> encoding (simply assign each category an integer value) for features
>> with large cardinality.
> This seems to be the opposite of what the kaggle tutorial suggests,
> right? They suggest ordinal encoding for small cardinality, but don't
> suggest
> any other way.
>
> Your and Gilles' feedback make me think we should tell the kaggle people
> to change their tutorial
>
> --
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller

On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
> Our decision tree implementation only supports numerical splits; i.e. 
> if tests val < threshold .
>
> Categorical features need to be encoded properly. I recommend one-hot 
> encoding for features with small cardinality (e.g. < 50) and ordinal 
> encoding (simply assign each category an integer value) for features 
> with large cardinality.
This seems to be the opposite of what the kaggle tutorial suggests, 
right? They suggest ordinal encoding for small cardinality, but don't 
suggest
any other way.

Your and Gilles' feedback make me think we should tell the kaggle people 
to change their tutorial

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Peter Prettenhofer

Our decision tree implementation only supports numerical splits; i.e. if
tests val < threshold .

Categorical features need to be encoded properly. I recommend one-hot
encoding for features with small cardinality (e.g. < 50) and ordinal
encoding (simply assign each category an integer value) for features with
large cardinality. Sufficiently deep decision trees will handle ordinal
encoded categorical features nicely - the same holds for boosting models
with a sufficient number of trees (see [1]).
Furthermore, ordinal encoding might be more efficient than one-hot encoding
since fewer features need to be searched. One-hot encoding, on the other
hand, plays much more nicely with mode interpretation.

Regarding split tests for categorical variables: there are two types of
tests I'm aware of: the equality test (val = cat) and the subset test (val
in {cat-subset}). While the latter sounds more powerful it has to be
considered harmful. Subset tests give rise to 2^(K-1) - 1 potential
splitting points per categorical feature whereas numerical features only
have N - 1 potential split points ( where N is the number of examples and K
is the cardinality of the cat. feature). Large number of potential split
points can lead to sever overfitting (you most likely find a subset that
perfectly separates your data). AFAIK R's random forest package only
supports subset tests so it might in fact be advantageous to use ordinal
encoding there too when your features have large cardinality.

HTH,
 peter

[1]
http://www.salford-systems.com/en/blog/dan-steinberg/item/15-modeling-tricks-with-treenet-treating-categorical-variables-as-continuous

PS: regarding the Kaggle tutorial - they most likely were not aware of that

2013/6/3 Andreas Mueller 

> On 06/03/2013 04:41 AM, Christian Jauvin wrote:
> >> Sklearn does not implement any special treatment for categorical
> variables.
> >> You can feed any float. The question is if it would work / what it does.
> > I think I'm confused about a couple of aspects (that's what happens I
> > guess when you play with algorithms for which you don't have a
> > complete and firm understanding beforehand!). I assumed that
> > sklearn-RF's requirement for numerical inputs was just a data
> > representation/implementation aspect, and that once properly
> > transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
> > hood, whether a predictor was categorical or numerical.
> >
> > Now if I understand you well, sklearn shouldn't be able to explicitly
> > handle the categorical case where no order exists (i.e. categorical,
> > as opposed to ordinal).
> Yes. At least the splitting criterion is not the one usually used.
> >
> > But you seem to also imply that sklearn can indirectly support it
> > using dummy variables..
> Yes.
> >
> > Bigger question: given that Decision Trees (in general) support pure
> > categorical variables.. shouldn't Random Forests also do?
> >
> As I said, trees in sklearn don't. But that is a purely implementation /
> API problem.
>
> >
> >> Not sure what this says about your dataset / features.
> >> If the variables don't have any ordering and the splits take arbitrary
> >> subsets, that would seem a bit weird to me.
> > In fact that's really what I observe: apart from the first of my 4
> > variables, which is a year, the remaining 3 are purely categorical,
> > with no implicit order. So that result is weird because it is not in
> > line with what you've been saying.
> Actually I think all classifiers can also be represented by treating the
> categorical features as ordinal ones,
> it is just that the tree needs to be deeper and the splits are a bit
> weird. Imagine if you want to get category
> c out of a, b, c, d, e, you have to threshold between b and c and then
> between c and d, so you get three
> branches ('a', 'b'), ('c'), ('d', 'e'). If there is no ordering to the
> variables, that is really weird.
> If you have enough data, it might not make a difference, though - if you
> trees are not to deep (and many)
> you can dump them using dot.
>
> I don't have time to look at the documentation now, but maybe we should
> clear it up a bit.
> Also, maybe we should tell the kaggle folks to add sentence to their
> tutorial.
>
> Cheers,
> Andy
>
>
> --
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

-- 
Peter Prettenhofer
--
Get 100% visibility into Java/.NET code with AppDynamics Lite

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Gilles Louppe

On 3 June 2013 08:43, Andreas Mueller  wrote:
> On 06/03/2013 05:19 AM, Joel Nothman wrote:
>>
>> However, in these last two cases, the number of possible splits at a
>> single node is linear in the number of categories. Selecting an
>> arbitrary partition allows exponentially many splits with respect to
>> the number of categories (though there may be approximations to avoid
>> evaluating all possible splits; I'm not familiar with the literature).
>>
> I think the standard split is asking whether a variable is equal to a
> value, i.e. selecting subsets of 1. That is possible to do with two
> thresholds, but leads to a weird tree in a way.

Yes, CART builds binary decision trees. (The algorithm which splits a
node into as many children as the number of values of the variable is
ID3.)

As introduced by Breiman in his book, for a categorical variable X
taking its value in {1, ..., L}, the strategy is to consider every
subset S \subseteq {1, ..., L} of values of the variable and to pick
the one leading to the largest reduction of impurity. As such, splits
are defined as yes-no questions of the form "is x in S?".

In scikit-learn, we dont implement that. The main reason is that it
blow up computing times: if L is the cardinality of X, then there are
2^L-1 subsets to consider. The best that you can do with our
implementation is to one-hot encode your categorical variables which
will amount to select subsets of size 1, as Andy said. If you don't
one-hot encode your categorical variables, then you have to be aware
that the construction procedure will implicitely assume that the
categorical values are ordered (which may make no sense depending on
your dataset).

Gilles

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Andreas Mueller

On 06/03/2013 04:41 AM, Christian Jauvin wrote:
>> Sklearn does not implement any special treatment for categorical variables.
>> You can feed any float. The question is if it would work / what it does.
> I think I'm confused about a couple of aspects (that's what happens I
> guess when you play with algorithms for which you don't have a
> complete and firm understanding beforehand!). I assumed that
> sklearn-RF's requirement for numerical inputs was just a data
> representation/implementation aspect, and that once properly
> transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
> hood, whether a predictor was categorical or numerical.
>
> Now if I understand you well, sklearn shouldn't be able to explicitly
> handle the categorical case where no order exists (i.e. categorical,
> as opposed to ordinal).
Yes. At least the splitting criterion is not the one usually used.
>
> But you seem to also imply that sklearn can indirectly support it
> using dummy variables..
Yes.
>
> Bigger question: given that Decision Trees (in general) support pure
> categorical variables.. shouldn't Random Forests also do?
>
As I said, trees in sklearn don't. But that is a purely implementation / 
API problem.

>
>> Not sure what this says about your dataset / features.
>> If the variables don't have any ordering and the splits take arbitrary
>> subsets, that would seem a bit weird to me.
> In fact that's really what I observe: apart from the first of my 4
> variables, which is a year, the remaining 3 are purely categorical,
> with no implicit order. So that result is weird because it is not in
> line with what you've been saying.
Actually I think all classifiers can also be represented by treating the 
categorical features as ordinal ones,
it is just that the tree needs to be deeper and the splits are a bit 
weird. Imagine if you want to get category
c out of a, b, c, d, e, you have to threshold between b and c and then 
between c and d, so you get three
branches ('a', 'b'), ('c'), ('d', 'e'). If there is no ordering to the 
variables, that is really weird.
If you have enough data, it might not make a difference, though - if you 
trees are not to deep (and many)
you can dump them using dot.

I don't have time to look at the documentation now, but maybe we should 
clear it up a bit.
Also, maybe we should tell the kaggle folks to add sentence to their 
tutorial.

Cheers,
Andy

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Andreas Mueller

On 06/03/2013 05:19 AM, Joel Nothman wrote:
>
> However, in these last two cases, the number of possible splits at a 
> single node is linear in the number of categories. Selecting an 
> arbitrary partition allows exponentially many splits with respect to 
> the number of categories (though there may be approximations to avoid 
> evaluating all possible splits; I'm not familiar with the literature).
>
I think the standard split is asking whether a variable is equal to a 
value, i.e. selecting subsets of 1. That is possible to do with two 
thresholds, but leads to a weird tree in a way.


--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Joel Nothman

On Mon, Jun 3, 2013 at 12:41 PM, Christian Jauvin  wrote:

> > Sklearn does not implement any special treatment for categorical
> variables.
> > You can feed any float. The question is if it would work / what it does.
>
> I think I'm confused about a couple of aspects (that's what happens I
> guess when you play with algorithms for which you don't have a
> complete and firm understanding beforehand!). I assumed that
> sklearn-RF's requirement for numerical inputs was just a data
> representation/implementation aspect, and that once properly
> transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
> hood, whether a predictor was categorical or numerical.
>
> Now if I understand you well, sklearn shouldn't be able to explicitly
> handle the categorical case where no order exists (i.e. categorical,
> as opposed to ordinal).
>

It comes down to what sort of decision can be made at each node.
scikit-learn always uses decisions of the form (x > t) for some feature
value x and some threshold t.

Let's make this more concrete: you have a feature with possible values {A,
B, C, D}.

Ideal categorical treatment might partition a set of categories indicated
by variable x so that each partition corresponds to a different child in
the decision tree. So possible decisions would distinguish {A} from {B, C,
D}; {B} from {A, C, D}; {C} from {A, B, D}; {D} from {A, B, C}; {A, B} from
{C, D}; {A, C} from {B, D}; {A, D} from {B, C}. Scikit-learn can't make
these sorts of splits...

LabelEncoder will turn these into [0, 1, 2, 3]. Then only splits respecting
the ordering are possible. So a single split can distinguish {A} from {B,
C, D}; {A, B} from {C, D}; and {A, B, C} from {D}.

LabelBinarizer will allow a single split to distinguish any one category
from all others: {A} from {B, C, D}; {B} from {A, C, D}; {C} from {A, B,
D}; {D} from {A, B, C}.

Note that all these trees can represent the same hypothesis space, it just
might require a deeper tree to represent the same thing (and the learning
process can't take advantage of similar categories).

However, in these last two cases, the number of possible splits at a single
node is linear in the number of categories. Selecting an arbitrary
partition allows exponentially many splits with respect to the number of
categories (though there may be approximations to avoid evaluating all
possible splits; I'm not familiar with the literature).

So it should be quite clear that binarized categories allow the most
meaningful decisions with the least complexity.

Cheers,

- Joel
--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin

> Sklearn does not implement any special treatment for categorical variables.
> You can feed any float. The question is if it would work / what it does.

I think I'm confused about a couple of aspects (that's what happens I
guess when you play with algorithms for which you don't have a
complete and firm understanding beforehand!). I assumed that
sklearn-RF's requirement for numerical inputs was just a data
representation/implementation aspect, and that once properly
transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
hood, whether a predictor was categorical or numerical.

Now if I understand you well, sklearn shouldn't be able to explicitly
handle the categorical case where no order exists (i.e. categorical,
as opposed to ordinal).

But you seem to also imply that sklearn can indirectly support it
using dummy variables..

Bigger question: given that Decision Trees (in general) support pure
categorical variables.. shouldn't Random Forests also do?

>>https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py
> I don't see where categorical variables are used in this code. Could you
> please point it out?

You're right, my bad: those are not categorical predictors.

> Not sure what this says about your dataset / features.
> If the variables don't have any ordering and the splits take arbitrary
> subsets, that would seem a bit weird to me.

In fact that's really what I observe: apart from the first of my 4
variables, which is a year, the remaining 3 are purely categorical,
with no implicit order. So that result is weird because it is not in
line with what you've been saying.

Anyway, thanks for your time and patience,

Christian

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Andreas Mueller

On 06/02/2013 10:53 PM, Christian Jauvin wrote:
> Hi Andreas,
>
>> Btw, you do encode the categorical variables using one-hot, right?
>> The sklearn trees don't really support categorical variables.
> I'm rather perplexed by this.. I assumed that sklearn's RF only
> required its input to be numerical, so I only used a LabelEncoder up
> to now.
Hum. I have not considered that. Peter? Gilles? Lars? Little help?

Sklearn does not implement any special treatment for categorical variables.
You can feed any float. The question is if it would work / what it does.

I guess you (and kaggle) observed that it does work somewhat, not sure 
if it does what you want. The splits will be as for numerical
variables, i.e. > threshold. If the variables have an ordering (and 
LabelEncoder respects that ordering), that makes sense. If the variables
don't have an ordering (which I would assume is the more common case for 
categorical variables), I don't think that makes much sense.

>My assumption was backed by two external sources of information:

>(1) The benchmark code provided by Kaggle in the SO contest (which was
>actually the first time I used RFs) didn't seem to perform such a
>transformation:

>https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py

I don't see where categorical variables are used in this code. Could you 
please point it out?

>
> (2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs:
>
> http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
I am not that experienced with categorical variables. The catch here 
seems to be "not to many values".
Maybe it works for "few" values, but it is not what I would expect a 
random forest implementation to do
on categorical variables.

I think it is rather bad that the tutorial doesn't mention one-hot 
encoding if it is using sklearn.
It is somewhat trivial to perform the usual categorical tests. They are 
not implemented in sklearn, though,
as there is no obvious way to declare a column a categorical variable 
(you need an auxiliar array and no one did this yet).

> Moreoever, I just tested it with my own experiment, and I found that a
> RF trained on a (21080 x 4) input matrix (i.e. 4 categorical
> variables, non-one-hot encoded) performs the same (to the third
> decimal in accuracy and AUC, with 10-fold CV) as with its equivalent,
> one-hot encoded (21080 x 1347) matrix.
Not sure what this says about your dataset / features.
If the variables don't have any ordering and the splits take arbitrary 
subsets, that would seem a bit weird to me.
>
> Sorry if the confusion is on my side, but did I miss something?
Maybe I'm just not well-versed enough in the use of numerically encoded 
categorical variables in random forests.

Cheers,
Andy

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin

Hi Andreas,

> Btw, you do encode the categorical variables using one-hot, right?
> The sklearn trees don't really support categorical variables.

I'm rather perplexed by this.. I assumed that sklearn's RF only
required its input to be numerical, so I only used a LabelEncoder up
to now.

My assumption was backed by two external sources of information:

(1) The benchmark code provided by Kaggle in the SO contest (which was
actually the first time I used RFs) didn't seem to perform such a
transformation:

https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py

(2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs:

http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests

Moreoever, I just tested it with my own experiment, and I found that a
RF trained on a (21080 x 4) input matrix (i.e. 4 categorical
variables, non-one-hot encoded) performs the same (to the third
decimal in accuracy and AUC, with 10-fold CV) as with its equivalent,
one-hot encoded (21080 x 1347) matrix.

Sorry if the confusion is on my side, but did I miss something?

Christian

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Vlad Niculae

I got very good results on text century dating using random forests on
very few (20-ish) bag-of-words tf-idf features selected by chi2.  It
depends on the problem.

Cheers,
Vlad

On Sat, Jun 1, 2013 at 9:01 PM, Andreas Mueller
 wrote:
> On 06/01/2013 08:30 PM, Christian Jauvin wrote:
>> Hi,
>>
>> I asked a (perhaps too vague?) question about the use of Random
>> Forests with a mix of categorical and lexical features on two ML
>> forums (stats.SE and MetaOp), but since it has received no attention,
>> I figured that it might work better on this list (I'm using sklearn's
>> RF of course):
>>
>> "I'm working on a binary classification problem for which the dataset
>> is mostly composed of categorical features, but also a few lexical
>> ones (i.e. article titles and abstracts). I'm experimenting with
>> Random Forests, and my current strategy is to build the training set
>> by appending the k best lexical features (chosen with univariate
>> feature selection, and weighted with tf-idf) to the full set of
>> categorical features. This works reasonably well, but as I cannot find
>> explicit references to such a strategy of using hybrid features for
>> RF, I have doubts about my approach: does it make sense? Am I
>> "diluting" the power of the RF by doing so, and should I rather try to
>> combine two classifiers specializing on both types of features?"
>>
> I think it is ok, though I think people rarely use RF on bag-of-word
> features.
> Btw, you do encode the categorical variables using one-hot, right?
> The sklearn trees don't really support categorical variables.
> An alternative approach would be to run a linear classifier on all tfidf
> features
> and feed the output together with the other variables to the RF.
>
> Hth,
> Andy
>
> ps: try stackoverflow with scikit-learn tag next time.
>
> --
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Andreas Mueller

On 06/01/2013 08:30 PM, Christian Jauvin wrote:
> Hi,
>
> I asked a (perhaps too vague?) question about the use of Random
> Forests with a mix of categorical and lexical features on two ML
> forums (stats.SE and MetaOp), but since it has received no attention,
> I figured that it might work better on this list (I'm using sklearn's
> RF of course):
>
> "I'm working on a binary classification problem for which the dataset
> is mostly composed of categorical features, but also a few lexical
> ones (i.e. article titles and abstracts). I'm experimenting with
> Random Forests, and my current strategy is to build the training set
> by appending the k best lexical features (chosen with univariate
> feature selection, and weighted with tf-idf) to the full set of
> categorical features. This works reasonably well, but as I cannot find
> explicit references to such a strategy of using hybrid features for
> RF, I have doubts about my approach: does it make sense? Am I
> "diluting" the power of the RF by doing so, and should I rather try to
> combine two classifiers specializing on both types of features?"
>
I think it is ok, though I think people rarely use RF on bag-of-word 
features.
Btw, you do encode the categorical variables using one-hot, right?
The sklearn trees don't really support categorical variables.
An alternative approach would be to run a linear classifier on all tfidf 
features
and feed the output together with the other variables to the RF.

Hth,
Andy

ps: try stackoverflow with scikit-learn tag next time.

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Philipp Singer

Hi Christian,

Some time ago I had similar problems. I.e., I wanted to use additional 
features to my lexical features and simple concatanation didn't work 
that well for me even though both feature sets on their own performed 
pretty well.

You can follow the discussion about my problem here [1] if you scroll 
down - ignore the starting discussion. The best solution I ended up was 
the one suggested by Olivier. You basically train a linear classifier on 
your lexical features and then use the predict_proba outcome and your 
additional categorical features for training a second classifier - for 
example random forests. It was also helpful to perform leave-one-out 
when training the probabilities (if you have few samples).

[1] 
http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5F2BJ_ms51a-61HwmNrAyRTb1W0KK7ziBPzGAcdiBRqQ%40mail.gmail.com&forum_name=scikit-learn-general

If you find out anything else, let us know ;)

Regards,
Philipp

Am 01.06.2013 20:30, schrieb Christian Jauvin:
> Hi,
>
> I asked a (perhaps too vague?) question about the use of Random
> Forests with a mix of categorical and lexical features on two ML
> forums (stats.SE and MetaOp), but since it has received no attention,
> I figured that it might work better on this list (I'm using sklearn's
> RF of course):
>
> "I'm working on a binary classification problem for which the dataset
> is mostly composed of categorical features, but also a few lexical
> ones (i.e. article titles and abstracts). I'm experimenting with
> Random Forests, and my current strategy is to build the training set
> by appending the k best lexical features (chosen with univariate
> feature selection, and weighted with tf-idf) to the full set of
> categorical features. This works reasonably well, but as I cannot find
> explicit references to such a strategy of using hybrid features for
> RF, I have doubts about my approach: does it make sense? Am I
> "diluting" the power of the RF by doing so, and should I rather try to
> combine two classifiers specializing on both types of features?"
>
> http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-and-lexical-features
>
> Thanks,
>
> Christian
>
> --
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

19 matches

Site Navigation

Mail list logo

Footer information