Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2014-11-17 Thread Alexander Hawk
Perhaps you have become aware of this by now,
but only K-1 subset tests are needed to find the best 
categorical split, not  2^(K-1)-1.  This was a central 
result proved in Brieman's book.



--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2014-11-17 Thread Manish Amde
+1

Just wanted to point out that the K-1 subset proof is only true for binary
classification. Such heuristics do perform reasonably for the multiclass
classification criterion though.

On Monday, November 17, 2014, Alexander Hawk tomahawkb...@gmail.com wrote:

 Perhaps you have become aware of this by now,
 but only K-1 subset tests are needed to find the best
 categorical split, not  2^(K-1)-1.  This was a central
 result proved in Brieman's book.




 --
 Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
 from Actuate! Instantly Supercharge Your Business Reports and Dashboards
 with Interactivity, Sharing, Native Excel Exports, App Integration  more
 Get technology previously reserved for billion-dollar corporations, FREE

 http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net javascript:;
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751iu=/4140/ostg.clktrk___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Juan Nunez-Iglesias
On Tue, Jun 4, 2013 at 8:16 PM, Peter Prettenhofer 
peter.prettenho...@gmail.com wrote:

 I believe more in my results than in my expertise - and so should you :-)
 **


+1! There's very very few examples of theory trumping data in history...
And a bajillion of the converse.

I also think Joel put it quite nicely with all these trees can represent
the same hypothesis space, it just might require a deeper tree to represent
the same thing. Christian's results seem in no way contradictory to me,
just pleasantly surprising.
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Christian Jauvin
 I believe more in my results than in my expertise - and so should you :-)

 +1! There's very very few examples of theory trumping data in history... And
 a bajillion of the converse.

I guess I didn't express myself clearly: I didn't mean to say that I
mistrust my results per se.. I'm not that much into skepticism! What I
meant rather is that when I'm experimenting with something new (to
me), and observe something weird or not in line with what I expect, my
a priori belief is that I most likely made a mistake, rather than
discovered some previously unnoticed flaw.

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-04 Thread Andreas Mueller
On 06/04/2013 05:55 AM, Christian Jauvin wrote:
 Many thanks to all for your help and detailed answers, I really appreciate it.

 So I wanted to test the discussion's takeaway, namely, what Peter
 suggested: one-hot encode the categorical features with small
 cardinality, and leave the others in their ordinal form.

 So from the same dataset I mentioned earlier, I picked another subset
 of 5 features, this time all with small cardinality (5, 5, 6, 11 and
 12), and all purely categorical (i.e. clearly not ordered). The
 one-hot encoding should clearly help with such a configuration.

 But again, what I observe when I pit the fully one-hot encoded RF
 (21000 x 39) against the ordinal-encoded one (21000 x 5) is that
 they're behaving almost the same, in terms of accuracy and AUC, with
 10-fold cross-validation. In fact, the ordinal version even seems to
 perform very slightly better, although I don't think it's significant.

 I really believe in your expertise more than in my results, so what
 could I be doing wrong?


Did you grid-search parameters again?

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-04 Thread Peter Prettenhofer
Hi Christian,

I believe more in my results than in my expertise - and so should you :-) **

I think you misunderstood me: I did not claim that one-hot encoded
categorical features give better results than ordinal encoded ones - I just
claimed that ordinal encoding works as good as one-hot encoded features
given that you have deep enough trees. But I've to warn you: I cannot
support my claim with (sufficient) data. So at the end of the day, its
always best to make an experiment and test it on your problem at hand.

Anyways, I cannot really see your problem (or what you did wrong):
according to your description it seems that the specific encoding (one-hot
vs. ordinal) has no influence on the effectiveness of the model (no
significant difference)? This is in line with observations by others.

Andy raised a very important point though: if you optimized your
hyperparameters (tree depth, min split size, ..) on the ordinal encoding
and then tested those hyperparameters on a one-hot encoding you are giving
an advantage to the ordinal encoding.

HTH,
 Peter

** that being said, I'm still quite skeptical when it comes to my results


2013/6/4 Christian Jauvin cjau...@gmail.com

 Many thanks to all for your help and detailed answers, I really appreciate
 it.

 So I wanted to test the discussion's takeaway, namely, what Peter
 suggested: one-hot encode the categorical features with small
 cardinality, and leave the others in their ordinal form.

 So from the same dataset I mentioned earlier, I picked another subset
 of 5 features, this time all with small cardinality (5, 5, 6, 11 and
 12), and all purely categorical (i.e. clearly not ordered). The
 one-hot encoding should clearly help with such a configuration.

 But again, what I observe when I pit the fully one-hot encoded RF
 (21000 x 39) against the ordinal-encoded one (21000 x 5) is that
 they're behaving almost the same, in terms of accuracy and AUC, with
 10-fold cross-validation. In fact, the ordinal version even seems to
 perform very slightly better, although I don't think it's significant.

 I really believe in your expertise more than in my results, so what
 could I be doing wrong?



 On 3 June 2013 04:56, Andreas Mueller amuel...@ais.uni-bonn.de wrote:
  On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
  Our decision tree implementation only supports numerical splits; i.e.
  if tests val  threshold .
 
  Categorical features need to be encoded properly. I recommend one-hot
  encoding for features with small cardinality (e.g.  50) and ordinal
  encoding (simply assign each category an integer value) for features
  with large cardinality.
  This seems to be the opposite of what the kaggle tutorial suggests,
  right? They suggest ordinal encoding for small cardinality, but don't
  suggest
  any other way.
 
  Your and Gilles' feedback make me think we should tell the kaggle people
  to change their tutorial
 
 
 --
  Get 100% visibility into Java/.NET code with AppDynamics Lite
  It's a free troubleshooting tool designed for production
  Get down to code-level detail for bottlenecks, with 2% overhead.
  Download for free and get started troubleshooting in minutes.
  http://p.sf.net/sfu/appdyn_d2d_ap2
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- 
Peter Prettenhofer
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 05:19 AM, Joel Nothman wrote:

 However, in these last two cases, the number of possible splits at a 
 single node is linear in the number of categories. Selecting an 
 arbitrary partition allows exponentially many splits with respect to 
 the number of categories (though there may be approximations to avoid 
 evaluating all possible splits; I'm not familiar with the literature).

I think the standard split is asking whether a variable is equal to a 
value, i.e. selecting subsets of 1. That is possible to do with two 
thresholds, but leads to a weird tree in a way.


--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 04:41 AM, Christian Jauvin wrote:
 Sklearn does not implement any special treatment for categorical variables.
 You can feed any float. The question is if it would work / what it does.
 I think I'm confused about a couple of aspects (that's what happens I
 guess when you play with algorithms for which you don't have a
 complete and firm understanding beforehand!). I assumed that
 sklearn-RF's requirement for numerical inputs was just a data
 representation/implementation aspect, and that once properly
 transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
 hood, whether a predictor was categorical or numerical.

 Now if I understand you well, sklearn shouldn't be able to explicitly
 handle the categorical case where no order exists (i.e. categorical,
 as opposed to ordinal).
Yes. At least the splitting criterion is not the one usually used.

 But you seem to also imply that sklearn can indirectly support it
 using dummy variables..
Yes.

 Bigger question: given that Decision Trees (in general) support pure
 categorical variables.. shouldn't Random Forests also do?

As I said, trees in sklearn don't. But that is a purely implementation / 
API problem.


 Not sure what this says about your dataset / features.
 If the variables don't have any ordering and the splits take arbitrary
 subsets, that would seem a bit weird to me.
 In fact that's really what I observe: apart from the first of my 4
 variables, which is a year, the remaining 3 are purely categorical,
 with no implicit order. So that result is weird because it is not in
 line with what you've been saying.
Actually I think all classifiers can also be represented by treating the 
categorical features as ordinal ones,
it is just that the tree needs to be deeper and the splits are a bit 
weird. Imagine if you want to get category
c out of a, b, c, d, e, you have to threshold between b and c and then 
between c and d, so you get three
branches ('a', 'b'), ('c'), ('d', 'e'). If there is no ordering to the 
variables, that is really weird.
If you have enough data, it might not make a difference, though - if you 
trees are not to deep (and many)
you can dump them using dot.

I don't have time to look at the documentation now, but maybe we should 
clear it up a bit.
Also, maybe we should tell the kaggle folks to add sentence to their 
tutorial.

Cheers,
Andy

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Gilles Louppe
On 3 June 2013 08:43, Andreas Mueller amuel...@ais.uni-bonn.de wrote:
 On 06/03/2013 05:19 AM, Joel Nothman wrote:

 However, in these last two cases, the number of possible splits at a
 single node is linear in the number of categories. Selecting an
 arbitrary partition allows exponentially many splits with respect to
 the number of categories (though there may be approximations to avoid
 evaluating all possible splits; I'm not familiar with the literature).

 I think the standard split is asking whether a variable is equal to a
 value, i.e. selecting subsets of 1. That is possible to do with two
 thresholds, but leads to a weird tree in a way.

Yes, CART builds binary decision trees. (The algorithm which splits a
node into as many children as the number of values of the variable is
ID3.)

As introduced by Breiman in his book, for a categorical variable X
taking its value in {1, ..., L}, the strategy is to consider every
subset S \subseteq {1, ..., L} of values of the variable and to pick
the one leading to the largest reduction of impurity. As such, splits
are defined as yes-no questions of the form is x in S?.

In scikit-learn, we dont implement that. The main reason is that it
blow up computing times: if L is the cardinality of X, then there are
2^L-1 subsets to consider. The best that you can do with our
implementation is to one-hot encode your categorical variables which
will amount to select subsets of size 1, as Andy said. If you don't
one-hot encode your categorical variables, then you have to be aware
that the construction procedure will implicitely assume that the
categorical values are ordered (which may make no sense depending on
your dataset).

Gilles

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
 Our decision tree implementation only supports numerical splits; i.e. 
 if tests val  threshold .

 Categorical features need to be encoded properly. I recommend one-hot 
 encoding for features with small cardinality (e.g.  50) and ordinal 
 encoding (simply assign each category an integer value) for features 
 with large cardinality.
This seems to be the opposite of what the kaggle tutorial suggests, 
right? They suggest ordinal encoding for small cardinality, but don't 
suggest
any other way.

Your and Gilles' feedback make me think we should tell the kaggle people 
to change their tutorial

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Christian Jauvin
Many thanks to all for your help and detailed answers, I really appreciate it.

So I wanted to test the discussion's takeaway, namely, what Peter
suggested: one-hot encode the categorical features with small
cardinality, and leave the others in their ordinal form.

So from the same dataset I mentioned earlier, I picked another subset
of 5 features, this time all with small cardinality (5, 5, 6, 11 and
12), and all purely categorical (i.e. clearly not ordered). The
one-hot encoding should clearly help with such a configuration.

But again, what I observe when I pit the fully one-hot encoded RF
(21000 x 39) against the ordinal-encoded one (21000 x 5) is that
they're behaving almost the same, in terms of accuracy and AUC, with
10-fold cross-validation. In fact, the ordinal version even seems to
perform very slightly better, although I don't think it's significant.

I really believe in your expertise more than in my results, so what
could I be doing wrong?



On 3 June 2013 04:56, Andreas Mueller amuel...@ais.uni-bonn.de wrote:
 On 06/03/2013 09:15 AM, Peter Prettenhofer wrote:
 Our decision tree implementation only supports numerical splits; i.e.
 if tests val  threshold .

 Categorical features need to be encoded properly. I recommend one-hot
 encoding for features with small cardinality (e.g.  50) and ordinal
 encoding (simply assign each category an integer value) for features
 with large cardinality.
 This seems to be the opposite of what the kaggle tutorial suggests,
 right? They suggest ordinal encoding for small cardinality, but don't
 suggest
 any other way.

 Your and Gilles' feedback make me think we should tell the kaggle people
 to change their tutorial

 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Vlad Niculae
I got very good results on text century dating using random forests on
very few (20-ish) bag-of-words tf-idf features selected by chi2.  It
depends on the problem.

Cheers,
Vlad

On Sat, Jun 1, 2013 at 9:01 PM, Andreas Mueller
amuel...@ais.uni-bonn.de wrote:
 On 06/01/2013 08:30 PM, Christian Jauvin wrote:
 Hi,

 I asked a (perhaps too vague?) question about the use of Random
 Forests with a mix of categorical and lexical features on two ML
 forums (stats.SE and MetaOp), but since it has received no attention,
 I figured that it might work better on this list (I'm using sklearn's
 RF of course):

 I'm working on a binary classification problem for which the dataset
 is mostly composed of categorical features, but also a few lexical
 ones (i.e. article titles and abstracts). I'm experimenting with
 Random Forests, and my current strategy is to build the training set
 by appending the k best lexical features (chosen with univariate
 feature selection, and weighted with tf-idf) to the full set of
 categorical features. This works reasonably well, but as I cannot find
 explicit references to such a strategy of using hybrid features for
 RF, I have doubts about my approach: does it make sense? Am I
 diluting the power of the RF by doing so, and should I rather try to
 combine two classifiers specializing on both types of features?

 I think it is ok, though I think people rarely use RF on bag-of-word
 features.
 Btw, you do encode the categorical variables using one-hot, right?
 The sklearn trees don't really support categorical variables.
 An alternative approach would be to run a linear classifier on all tfidf
 features
 and feed the output together with the other variables to the RF.

 Hth,
 Andy

 ps: try stackoverflow with scikit-learn tag next time.

 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
Hi Andreas,

 Btw, you do encode the categorical variables using one-hot, right?
 The sklearn trees don't really support categorical variables.

I'm rather perplexed by this.. I assumed that sklearn's RF only
required its input to be numerical, so I only used a LabelEncoder up
to now.

My assumption was backed by two external sources of information:

(1) The benchmark code provided by Kaggle in the SO contest (which was
actually the first time I used RFs) didn't seem to perform such a
transformation:

https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py

(2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs:

http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests

Moreoever, I just tested it with my own experiment, and I found that a
RF trained on a (21080 x 4) input matrix (i.e. 4 categorical
variables, non-one-hot encoded) performs the same (to the third
decimal in accuracy and AUC, with 10-fold CV) as with its equivalent,
one-hot encoded (21080 x 1347) matrix.

Sorry if the confusion is on my side, but did I miss something?

Christian

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Andreas Mueller
On 06/02/2013 10:53 PM, Christian Jauvin wrote:
 Hi Andreas,

 Btw, you do encode the categorical variables using one-hot, right?
 The sklearn trees don't really support categorical variables.
 I'm rather perplexed by this.. I assumed that sklearn's RF only
 required its input to be numerical, so I only used a LabelEncoder up
 to now.
Hum. I have not considered that. Peter? Gilles? Lars? Little help?

Sklearn does not implement any special treatment for categorical variables.
You can feed any float. The question is if it would work / what it does.

I guess you (and kaggle) observed that it does work somewhat, not sure 
if it does what you want. The splits will be as for numerical
variables, i.e.  threshold. If the variables have an ordering (and 
LabelEncoder respects that ordering), that makes sense. If the variables
don't have an ordering (which I would assume is the more common case for 
categorical variables), I don't think that makes much sense.


My assumption was backed by two external sources of information:

(1) The benchmark code provided by Kaggle in the SO contest (which was
actually the first time I used RFs) didn't seem to perform such a
transformation:

https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py

I don't see where categorical variables are used in this code. Could you 
please point it out?


 (2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs:

 http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
I am not that experienced with categorical variables. The catch here 
seems to be not to many values.
Maybe it works for few values, but it is not what I would expect a 
random forest implementation to do
on categorical variables.

I think it is rather bad that the tutorial doesn't mention one-hot 
encoding if it is using sklearn.
It is somewhat trivial to perform the usual categorical tests. They are 
not implemented in sklearn, though,
as there is no obvious way to declare a column a categorical variable 
(you need an auxiliar array and no one did this yet).

 Moreoever, I just tested it with my own experiment, and I found that a
 RF trained on a (21080 x 4) input matrix (i.e. 4 categorical
 variables, non-one-hot encoded) performs the same (to the third
 decimal in accuracy and AUC, with 10-fold CV) as with its equivalent,
 one-hot encoded (21080 x 1347) matrix.
Not sure what this says about your dataset / features.
If the variables don't have any ordering and the splits take arbitrary 
subsets, that would seem a bit weird to me.

 Sorry if the confusion is on my side, but did I miss something?
Maybe I'm just not well-versed enough in the use of numerically encoded 
categorical variables in random forests.

Cheers,
Andy

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
 Sklearn does not implement any special treatment for categorical variables.
 You can feed any float. The question is if it would work / what it does.

I think I'm confused about a couple of aspects (that's what happens I
guess when you play with algorithms for which you don't have a
complete and firm understanding beforehand!). I assumed that
sklearn-RF's requirement for numerical inputs was just a data
representation/implementation aspect, and that once properly
transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
hood, whether a predictor was categorical or numerical.

Now if I understand you well, sklearn shouldn't be able to explicitly
handle the categorical case where no order exists (i.e. categorical,
as opposed to ordinal).

But you seem to also imply that sklearn can indirectly support it
using dummy variables..

Bigger question: given that Decision Trees (in general) support pure
categorical variables.. shouldn't Random Forests also do?

https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py
 I don't see where categorical variables are used in this code. Could you
 please point it out?

You're right, my bad: those are not categorical predictors.

 Not sure what this says about your dataset / features.
 If the variables don't have any ordering and the splits take arbitrary
 subsets, that would seem a bit weird to me.

In fact that's really what I observe: apart from the first of my 4
variables, which is a year, the remaining 3 are purely categorical,
with no implicit order. So that result is weird because it is not in
line with what you've been saying.

Anyway, thanks for your time and patience,

Christian

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Joel Nothman
On Mon, Jun 3, 2013 at 12:41 PM, Christian Jauvin cjau...@gmail.com wrote:

  Sklearn does not implement any special treatment for categorical
 variables.
  You can feed any float. The question is if it would work / what it does.

 I think I'm confused about a couple of aspects (that's what happens I
 guess when you play with algorithms for which you don't have a
 complete and firm understanding beforehand!). I assumed that
 sklearn-RF's requirement for numerical inputs was just a data
 representation/implementation aspect, and that once properly
 transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
 hood, whether a predictor was categorical or numerical.

 Now if I understand you well, sklearn shouldn't be able to explicitly
 handle the categorical case where no order exists (i.e. categorical,
 as opposed to ordinal).


It comes down to what sort of decision can be made at each node.
scikit-learn always uses decisions of the form (x  t) for some feature
value x and some threshold t.

Let's make this more concrete: you have a feature with possible values {A,
B, C, D}.

Ideal categorical treatment might partition a set of categories indicated
by variable x so that each partition corresponds to a different child in
the decision tree. So possible decisions would distinguish {A} from {B, C,
D}; {B} from {A, C, D}; {C} from {A, B, D}; {D} from {A, B, C}; {A, B} from
{C, D}; {A, C} from {B, D}; {A, D} from {B, C}. Scikit-learn can't make
these sorts of splits...

LabelEncoder will turn these into [0, 1, 2, 3]. Then only splits respecting
the ordering are possible. So a single split can distinguish {A} from {B,
C, D}; {A, B} from {C, D}; and {A, B, C} from {D}.

LabelBinarizer will allow a single split to distinguish any one category
from all others: {A} from {B, C, D}; {B} from {A, C, D}; {C} from {A, B,
D}; {D} from {A, B, C}.

Note that all these trees can represent the same hypothesis space, it just
might require a deeper tree to represent the same thing (and the learning
process can't take advantage of similar categories).

However, in these last two cases, the number of possible splits at a single
node is linear in the number of categories. Selecting an arbitrary
partition allows exponentially many splits with respect to the number of
categories (though there may be approximations to avoid evaluating all
possible splits; I'm not familiar with the literature).

So it should be quite clear that binarized categories allow the most
meaningful decisions with the least complexity.

Cheers,

- Joel
--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Philipp Singer
Hi Christian,

Some time ago I had similar problems. I.e., I wanted to use additional 
features to my lexical features and simple concatanation didn't work 
that well for me even though both feature sets on their own performed 
pretty well.

You can follow the discussion about my problem here [1] if you scroll 
down - ignore the starting discussion. The best solution I ended up was 
the one suggested by Olivier. You basically train a linear classifier on 
your lexical features and then use the predict_proba outcome and your 
additional categorical features for training a second classifier - for 
example random forests. It was also helpful to perform leave-one-out 
when training the probabilities (if you have few samples).

[1] 
http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5F2BJ_ms51a-61HwmNrAyRTb1W0KK7ziBPzGAcdiBRqQ%40mail.gmail.comforum_name=scikit-learn-general

If you find out anything else, let us know ;)

Regards,
Philipp

Am 01.06.2013 20:30, schrieb Christian Jauvin:
 Hi,

 I asked a (perhaps too vague?) question about the use of Random
 Forests with a mix of categorical and lexical features on two ML
 forums (stats.SE and MetaOp), but since it has received no attention,
 I figured that it might work better on this list (I'm using sklearn's
 RF of course):

 I'm working on a binary classification problem for which the dataset
 is mostly composed of categorical features, but also a few lexical
 ones (i.e. article titles and abstracts). I'm experimenting with
 Random Forests, and my current strategy is to build the training set
 by appending the k best lexical features (chosen with univariate
 feature selection, and weighted with tf-idf) to the full set of
 categorical features. This works reasonably well, but as I cannot find
 explicit references to such a strategy of using hybrid features for
 RF, I have doubts about my approach: does it make sense? Am I
 diluting the power of the RF by doing so, and should I rather try to
 combine two classifiers specializing on both types of features?

 http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-and-lexical-features

 Thanks,

 Christian

 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Andreas Mueller
On 06/01/2013 08:30 PM, Christian Jauvin wrote:
 Hi,

 I asked a (perhaps too vague?) question about the use of Random
 Forests with a mix of categorical and lexical features on two ML
 forums (stats.SE and MetaOp), but since it has received no attention,
 I figured that it might work better on this list (I'm using sklearn's
 RF of course):

 I'm working on a binary classification problem for which the dataset
 is mostly composed of categorical features, but also a few lexical
 ones (i.e. article titles and abstracts). I'm experimenting with
 Random Forests, and my current strategy is to build the training set
 by appending the k best lexical features (chosen with univariate
 feature selection, and weighted with tf-idf) to the full set of
 categorical features. This works reasonably well, but as I cannot find
 explicit references to such a strategy of using hybrid features for
 RF, I have doubts about my approach: does it make sense? Am I
 diluting the power of the RF by doing so, and should I rather try to
 combine two classifiers specializing on both types of features?

I think it is ok, though I think people rarely use RF on bag-of-word 
features.
Btw, you do encode the categorical variables using one-hot, right?
The sklearn trees don't really support categorical variables.
An alternative approach would be to run a linear classifier on all tfidf 
features
and feed the output together with the other variables to the RF.

Hth,
Andy

ps: try stackoverflow with scikit-learn tag next time.

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general