Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2014-11-17 Thread Alexander Hawk
Perhaps you have become aware of this by now, but only K-1 subset tests are needed to find the best categorical split, not 2^(K-1)-1. This was a central result proved in Brieman's book. -- Download BIRT iHub F-Type

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2014-11-17 Thread Manish Amde
+1 Just wanted to point out that the K-1 subset proof is only true for binary classification. Such heuristics do perform reasonably for the multiclass classification criterion though. On Monday, November 17, 2014, Alexander Hawk tomahawkb...@gmail.com wrote: Perhaps you have become aware of

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Juan Nunez-Iglesias
On Tue, Jun 4, 2013 at 8:16 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: I believe more in my results than in my expertise - and so should you :-) ** +1! There's very very few examples of theory trumping data in history... And a bajillion of the converse. I also think Joel put

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-06 Thread Christian Jauvin
I believe more in my results than in my expertise - and so should you :-) +1! There's very very few examples of theory trumping data in history... And a bajillion of the converse. I guess I didn't express myself clearly: I didn't mean to say that I mistrust my results per se.. I'm not that

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-04 Thread Andreas Mueller
On 06/04/2013 05:55 AM, Christian Jauvin wrote: Many thanks to all for your help and detailed answers, I really appreciate it. So I wanted to test the discussion's takeaway, namely, what Peter suggested: one-hot encode the categorical features with small cardinality, and leave the others in

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-04 Thread Peter Prettenhofer
Hi Christian, I believe more in my results than in my expertise - and so should you :-) ** I think you misunderstood me: I did not claim that one-hot encoded categorical features give better results than ordinal encoded ones - I just claimed that ordinal encoding works as good as one-hot encoded

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 05:19 AM, Joel Nothman wrote: However, in these last two cases, the number of possible splits at a single node is linear in the number of categories. Selecting an arbitrary partition allows exponentially many splits with respect to the number of categories (though there may

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 04:41 AM, Christian Jauvin wrote: Sklearn does not implement any special treatment for categorical variables. You can feed any float. The question is if it would work / what it does. I think I'm confused about a couple of aspects (that's what happens I guess when you play with

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Gilles Louppe
On 3 June 2013 08:43, Andreas Mueller amuel...@ais.uni-bonn.de wrote: On 06/03/2013 05:19 AM, Joel Nothman wrote: However, in these last two cases, the number of possible splits at a single node is linear in the number of categories. Selecting an arbitrary partition allows exponentially many

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 09:15 AM, Peter Prettenhofer wrote: Our decision tree implementation only supports numerical splits; i.e. if tests val threshold . Categorical features need to be encoded properly. I recommend one-hot encoding for features with small cardinality (e.g. 50) and ordinal

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Christian Jauvin
Many thanks to all for your help and detailed answers, I really appreciate it. So I wanted to test the discussion's takeaway, namely, what Peter suggested: one-hot encode the categorical features with small cardinality, and leave the others in their ordinal form. So from the same dataset I

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Vlad Niculae
I got very good results on text century dating using random forests on very few (20-ish) bag-of-words tf-idf features selected by chi2. It depends on the problem. Cheers, Vlad On Sat, Jun 1, 2013 at 9:01 PM, Andreas Mueller amuel...@ais.uni-bonn.de wrote: On 06/01/2013 08:30 PM, Christian

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
Hi Andreas, Btw, you do encode the categorical variables using one-hot, right? The sklearn trees don't really support categorical variables. I'm rather perplexed by this.. I assumed that sklearn's RF only required its input to be numerical, so I only used a LabelEncoder up to now. My

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Andreas Mueller
On 06/02/2013 10:53 PM, Christian Jauvin wrote: Hi Andreas, Btw, you do encode the categorical variables using one-hot, right? The sklearn trees don't really support categorical variables. I'm rather perplexed by this.. I assumed that sklearn's RF only required its input to be numerical, so

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
Sklearn does not implement any special treatment for categorical variables. You can feed any float. The question is if it would work / what it does. I think I'm confused about a couple of aspects (that's what happens I guess when you play with algorithms for which you don't have a complete and

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Joel Nothman
On Mon, Jun 3, 2013 at 12:41 PM, Christian Jauvin cjau...@gmail.com wrote: Sklearn does not implement any special treatment for categorical variables. You can feed any float. The question is if it would work / what it does. I think I'm confused about a couple of aspects (that's what

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Philipp Singer
Hi Christian, Some time ago I had similar problems. I.e., I wanted to use additional features to my lexical features and simple concatanation didn't work that well for me even though both feature sets on their own performed pretty well. You can follow the discussion about my problem here [1]

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-01 Thread Andreas Mueller
On 06/01/2013 08:30 PM, Christian Jauvin wrote: Hi, I asked a (perhaps too vague?) question about the use of Random Forests with a mix of categorical and lexical features on two ML forums (stats.SE and MetaOp), but since it has received no attention, I figured that it might work better on