Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
+1 Just wanted to point out that the K-1 subset proof is only true for binary classification. Such heuristics do perform reasonably for the multiclass classification criterion though. On Monday, November 17, 2014, Alexander Hawk wrote: > Perhaps you have become aware of this by now, > but only K-1 subset tests are needed to find the best > categorical split, not 2^(K-1)-1. This was a central > result proved in Brieman's book. > > > > > -- > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > with Interactivity, Sharing, Native Excel Exports, App Integration & more > Get technology previously reserved for billion-dollar corporations, FREE > > http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
Perhaps you have become aware of this by now, but only K-1 subset tests are needed to find the best categorical split, not 2^(K-1)-1. This was a central result proved in Brieman's book. -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
>> I believe more in my results than in my expertise - and so should you :-) > > +1! There's very very few examples of theory trumping data in history... And > a bajillion of the converse. I guess I didn't express myself clearly: I didn't mean to say that I mistrust my results per se.. I'm not that much into skepticism! What I meant rather is that when I'm experimenting with something new (to me), and observe something weird or not in line with what I expect, my a priori belief is that I most likely made a mistake, rather than discovered some previously unnoticed flaw. -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On Tue, Jun 4, 2013 at 8:16 PM, Peter Prettenhofer < peter.prettenho...@gmail.com> wrote: > I believe more in my results than in my expertise - and so should you :-) > ** > +1! There's very very few examples of theory trumping data in history... And a bajillion of the converse. I also think Joel put it quite nicely with "all these trees can represent the same hypothesis space, it just might require a deeper tree to represent the same thing". Christian's results seem in no way contradictory to me, just pleasantly surprising. -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
Hi Christian, I believe more in my results than in my expertise - and so should you :-) ** I think you misunderstood me: I did not claim that one-hot encoded categorical features give better results than ordinal encoded ones - I just claimed that ordinal encoding works as good as one-hot encoded features given that you have deep enough trees. But I've to warn you: I cannot support my claim with (sufficient) data. So at the end of the day, its always best to make an experiment and test it on your problem at hand. Anyways, I cannot really see your problem (or what you did "wrong"): according to your description it seems that the specific encoding (one-hot vs. ordinal) has no influence on the effectiveness of the model (no significant difference)? This is in line with observations by others. Andy raised a very important point though: if you optimized your hyperparameters (tree depth, min split size, ..) on the ordinal encoding and then tested those hyperparameters on a one-hot encoding you are giving an advantage to the ordinal encoding. HTH, Peter ** that being said, I'm still quite skeptical when it comes to my results 2013/6/4 Christian Jauvin > Many thanks to all for your help and detailed answers, I really appreciate > it. > > So I wanted to test the discussion's takeaway, namely, what Peter > suggested: one-hot encode the categorical features with small > cardinality, and leave the others in their ordinal form. > > So from the same dataset I mentioned earlier, I picked another subset > of 5 features, this time all with small cardinality (5, 5, 6, 11 and > 12), and all purely categorical (i.e. clearly not ordered). The > one-hot encoding should clearly help with such a configuration. > > But again, what I observe when I pit the fully one-hot encoded RF > (21000 x 39) against the ordinal-encoded one (21000 x 5) is that > they're behaving almost the same, in terms of accuracy and AUC, with > 10-fold cross-validation. In fact, the ordinal version even seems to > perform very slightly better, although I don't think it's significant. > > I really believe in your expertise more than in my results, so what > could I be doing wrong? > > > > On 3 June 2013 04:56, Andreas Mueller wrote: > > On 06/03/2013 09:15 AM, Peter Prettenhofer wrote: > >> Our decision tree implementation only supports numerical splits; i.e. > >> if tests val < threshold . > >> > >> Categorical features need to be encoded properly. I recommend one-hot > >> encoding for features with small cardinality (e.g. < 50) and ordinal > >> encoding (simply assign each category an integer value) for features > >> with large cardinality. > > This seems to be the opposite of what the kaggle tutorial suggests, > > right? They suggest ordinal encoding for small cardinality, but don't > > suggest > > any other way. > > > > Your and Gilles' feedback make me think we should tell the kaggle people > > to change their tutorial > > > > > -- > > Get 100% visibility into Java/.NET code with AppDynamics Lite > > It's a free troubleshooting tool designed for production > > Get down to code-level detail for bottlenecks, with <2% overhead. > > Download for free and get started troubleshooting in minutes. > > http://p.sf.net/sfu/appdyn_d2d_ap2 > > ___ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > -- > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On 06/04/2013 05:55 AM, Christian Jauvin wrote: > Many thanks to all for your help and detailed answers, I really appreciate it. > > So I wanted to test the discussion's takeaway, namely, what Peter > suggested: one-hot encode the categorical features with small > cardinality, and leave the others in their ordinal form. > > So from the same dataset I mentioned earlier, I picked another subset > of 5 features, this time all with small cardinality (5, 5, 6, 11 and > 12), and all purely categorical (i.e. clearly not ordered). The > one-hot encoding should clearly help with such a configuration. > > But again, what I observe when I pit the fully one-hot encoded RF > (21000 x 39) against the ordinal-encoded one (21000 x 5) is that > they're behaving almost the same, in terms of accuracy and AUC, with > 10-fold cross-validation. In fact, the ordinal version even seems to > perform very slightly better, although I don't think it's significant. > > I really believe in your expertise more than in my results, so what > could I be doing wrong? > > Did you grid-search parameters again? -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
Many thanks to all for your help and detailed answers, I really appreciate it. So I wanted to test the discussion's takeaway, namely, what Peter suggested: one-hot encode the categorical features with small cardinality, and leave the others in their ordinal form. So from the same dataset I mentioned earlier, I picked another subset of 5 features, this time all with small cardinality (5, 5, 6, 11 and 12), and all purely categorical (i.e. clearly not ordered). The one-hot encoding should clearly help with such a configuration. But again, what I observe when I pit the fully one-hot encoded RF (21000 x 39) against the ordinal-encoded one (21000 x 5) is that they're behaving almost the same, in terms of accuracy and AUC, with 10-fold cross-validation. In fact, the ordinal version even seems to perform very slightly better, although I don't think it's significant. I really believe in your expertise more than in my results, so what could I be doing wrong? On 3 June 2013 04:56, Andreas Mueller wrote: > On 06/03/2013 09:15 AM, Peter Prettenhofer wrote: >> Our decision tree implementation only supports numerical splits; i.e. >> if tests val < threshold . >> >> Categorical features need to be encoded properly. I recommend one-hot >> encoding for features with small cardinality (e.g. < 50) and ordinal >> encoding (simply assign each category an integer value) for features >> with large cardinality. > This seems to be the opposite of what the kaggle tutorial suggests, > right? They suggest ordinal encoding for small cardinality, but don't > suggest > any other way. > > Your and Gilles' feedback make me think we should tell the kaggle people > to change their tutorial > > -- > Get 100% visibility into Java/.NET code with AppDynamics Lite > It's a free troubleshooting tool designed for production > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://p.sf.net/sfu/appdyn_d2d_ap2 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On 06/03/2013 09:15 AM, Peter Prettenhofer wrote: > Our decision tree implementation only supports numerical splits; i.e. > if tests val < threshold . > > Categorical features need to be encoded properly. I recommend one-hot > encoding for features with small cardinality (e.g. < 50) and ordinal > encoding (simply assign each category an integer value) for features > with large cardinality. This seems to be the opposite of what the kaggle tutorial suggests, right? They suggest ordinal encoding for small cardinality, but don't suggest any other way. Your and Gilles' feedback make me think we should tell the kaggle people to change their tutorial -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
Our decision tree implementation only supports numerical splits; i.e. if tests val < threshold . Categorical features need to be encoded properly. I recommend one-hot encoding for features with small cardinality (e.g. < 50) and ordinal encoding (simply assign each category an integer value) for features with large cardinality. Sufficiently deep decision trees will handle ordinal encoded categorical features nicely - the same holds for boosting models with a sufficient number of trees (see [1]). Furthermore, ordinal encoding might be more efficient than one-hot encoding since fewer features need to be searched. One-hot encoding, on the other hand, plays much more nicely with mode interpretation. Regarding split tests for categorical variables: there are two types of tests I'm aware of: the equality test (val = cat) and the subset test (val in {cat-subset}). While the latter sounds more powerful it has to be considered harmful. Subset tests give rise to 2^(K-1) - 1 potential splitting points per categorical feature whereas numerical features only have N - 1 potential split points ( where N is the number of examples and K is the cardinality of the cat. feature). Large number of potential split points can lead to sever overfitting (you most likely find a subset that perfectly separates your data). AFAIK R's random forest package only supports subset tests so it might in fact be advantageous to use ordinal encoding there too when your features have large cardinality. HTH, peter [1] http://www.salford-systems.com/en/blog/dan-steinberg/item/15-modeling-tricks-with-treenet-treating-categorical-variables-as-continuous PS: regarding the Kaggle tutorial - they most likely were not aware of that 2013/6/3 Andreas Mueller > On 06/03/2013 04:41 AM, Christian Jauvin wrote: > >> Sklearn does not implement any special treatment for categorical > variables. > >> You can feed any float. The question is if it would work / what it does. > > I think I'm confused about a couple of aspects (that's what happens I > > guess when you play with algorithms for which you don't have a > > complete and firm understanding beforehand!). I assumed that > > sklearn-RF's requirement for numerical inputs was just a data > > representation/implementation aspect, and that once properly > > transformed (i.e. using a LabelEncoder), it wouldn't matter, under the > > hood, whether a predictor was categorical or numerical. > > > > Now if I understand you well, sklearn shouldn't be able to explicitly > > handle the categorical case where no order exists (i.e. categorical, > > as opposed to ordinal). > Yes. At least the splitting criterion is not the one usually used. > > > > But you seem to also imply that sklearn can indirectly support it > > using dummy variables.. > Yes. > > > > Bigger question: given that Decision Trees (in general) support pure > > categorical variables.. shouldn't Random Forests also do? > > > As I said, trees in sklearn don't. But that is a purely implementation / > API problem. > > > > >> Not sure what this says about your dataset / features. > >> If the variables don't have any ordering and the splits take arbitrary > >> subsets, that would seem a bit weird to me. > > In fact that's really what I observe: apart from the first of my 4 > > variables, which is a year, the remaining 3 are purely categorical, > > with no implicit order. So that result is weird because it is not in > > line with what you've been saying. > Actually I think all classifiers can also be represented by treating the > categorical features as ordinal ones, > it is just that the tree needs to be deeper and the splits are a bit > weird. Imagine if you want to get category > c out of a, b, c, d, e, you have to threshold between b and c and then > between c and d, so you get three > branches ('a', 'b'), ('c'), ('d', 'e'). If there is no ordering to the > variables, that is really weird. > If you have enough data, it might not make a difference, though - if you > trees are not to deep (and many) > you can dump them using dot. > > I don't have time to look at the documentation now, but maybe we should > clear it up a bit. > Also, maybe we should tell the kaggle folks to add sentence to their > tutorial. > > Cheers, > Andy > > > -- > Get 100% visibility into Java/.NET code with AppDynamics Lite > It's a free troubleshooting tool designed for production > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://p.sf.net/sfu/appdyn_d2d_ap2 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer -- Get 100% visibility into Java/.NET code with AppDynamics Lite
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On 3 June 2013 08:43, Andreas Mueller wrote: > On 06/03/2013 05:19 AM, Joel Nothman wrote: >> >> However, in these last two cases, the number of possible splits at a >> single node is linear in the number of categories. Selecting an >> arbitrary partition allows exponentially many splits with respect to >> the number of categories (though there may be approximations to avoid >> evaluating all possible splits; I'm not familiar with the literature). >> > I think the standard split is asking whether a variable is equal to a > value, i.e. selecting subsets of 1. That is possible to do with two > thresholds, but leads to a weird tree in a way. Yes, CART builds binary decision trees. (The algorithm which splits a node into as many children as the number of values of the variable is ID3.) As introduced by Breiman in his book, for a categorical variable X taking its value in {1, ..., L}, the strategy is to consider every subset S \subseteq {1, ..., L} of values of the variable and to pick the one leading to the largest reduction of impurity. As such, splits are defined as yes-no questions of the form "is x in S?". In scikit-learn, we dont implement that. The main reason is that it blow up computing times: if L is the cardinality of X, then there are 2^L-1 subsets to consider. The best that you can do with our implementation is to one-hot encode your categorical variables which will amount to select subsets of size 1, as Andy said. If you don't one-hot encode your categorical variables, then you have to be aware that the construction procedure will implicitely assume that the categorical values are ordered (which may make no sense depending on your dataset). Gilles -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On 06/03/2013 04:41 AM, Christian Jauvin wrote: >> Sklearn does not implement any special treatment for categorical variables. >> You can feed any float. The question is if it would work / what it does. > I think I'm confused about a couple of aspects (that's what happens I > guess when you play with algorithms for which you don't have a > complete and firm understanding beforehand!). I assumed that > sklearn-RF's requirement for numerical inputs was just a data > representation/implementation aspect, and that once properly > transformed (i.e. using a LabelEncoder), it wouldn't matter, under the > hood, whether a predictor was categorical or numerical. > > Now if I understand you well, sklearn shouldn't be able to explicitly > handle the categorical case where no order exists (i.e. categorical, > as opposed to ordinal). Yes. At least the splitting criterion is not the one usually used. > > But you seem to also imply that sklearn can indirectly support it > using dummy variables.. Yes. > > Bigger question: given that Decision Trees (in general) support pure > categorical variables.. shouldn't Random Forests also do? > As I said, trees in sklearn don't. But that is a purely implementation / API problem. > >> Not sure what this says about your dataset / features. >> If the variables don't have any ordering and the splits take arbitrary >> subsets, that would seem a bit weird to me. > In fact that's really what I observe: apart from the first of my 4 > variables, which is a year, the remaining 3 are purely categorical, > with no implicit order. So that result is weird because it is not in > line with what you've been saying. Actually I think all classifiers can also be represented by treating the categorical features as ordinal ones, it is just that the tree needs to be deeper and the splits are a bit weird. Imagine if you want to get category c out of a, b, c, d, e, you have to threshold between b and c and then between c and d, so you get three branches ('a', 'b'), ('c'), ('d', 'e'). If there is no ordering to the variables, that is really weird. If you have enough data, it might not make a difference, though - if you trees are not to deep (and many) you can dump them using dot. I don't have time to look at the documentation now, but maybe we should clear it up a bit. Also, maybe we should tell the kaggle folks to add sentence to their tutorial. Cheers, Andy -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On 06/03/2013 05:19 AM, Joel Nothman wrote: > > However, in these last two cases, the number of possible splits at a > single node is linear in the number of categories. Selecting an > arbitrary partition allows exponentially many splits with respect to > the number of categories (though there may be approximations to avoid > evaluating all possible splits; I'm not familiar with the literature). > I think the standard split is asking whether a variable is equal to a value, i.e. selecting subsets of 1. That is possible to do with two thresholds, but leads to a weird tree in a way. -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On Mon, Jun 3, 2013 at 12:41 PM, Christian Jauvin wrote: > > Sklearn does not implement any special treatment for categorical > variables. > > You can feed any float. The question is if it would work / what it does. > > I think I'm confused about a couple of aspects (that's what happens I > guess when you play with algorithms for which you don't have a > complete and firm understanding beforehand!). I assumed that > sklearn-RF's requirement for numerical inputs was just a data > representation/implementation aspect, and that once properly > transformed (i.e. using a LabelEncoder), it wouldn't matter, under the > hood, whether a predictor was categorical or numerical. > > Now if I understand you well, sklearn shouldn't be able to explicitly > handle the categorical case where no order exists (i.e. categorical, > as opposed to ordinal). > It comes down to what sort of decision can be made at each node. scikit-learn always uses decisions of the form (x > t) for some feature value x and some threshold t. Let's make this more concrete: you have a feature with possible values {A, B, C, D}. Ideal categorical treatment might partition a set of categories indicated by variable x so that each partition corresponds to a different child in the decision tree. So possible decisions would distinguish {A} from {B, C, D}; {B} from {A, C, D}; {C} from {A, B, D}; {D} from {A, B, C}; {A, B} from {C, D}; {A, C} from {B, D}; {A, D} from {B, C}. Scikit-learn can't make these sorts of splits... LabelEncoder will turn these into [0, 1, 2, 3]. Then only splits respecting the ordering are possible. So a single split can distinguish {A} from {B, C, D}; {A, B} from {C, D}; and {A, B, C} from {D}. LabelBinarizer will allow a single split to distinguish any one category from all others: {A} from {B, C, D}; {B} from {A, C, D}; {C} from {A, B, D}; {D} from {A, B, C}. Note that all these trees can represent the same hypothesis space, it just might require a deeper tree to represent the same thing (and the learning process can't take advantage of similar categories). However, in these last two cases, the number of possible splits at a single node is linear in the number of categories. Selecting an arbitrary partition allows exponentially many splits with respect to the number of categories (though there may be approximations to avoid evaluating all possible splits; I'm not familiar with the literature). So it should be quite clear that binarized categories allow the most meaningful decisions with the least complexity. Cheers, - Joel -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
> Sklearn does not implement any special treatment for categorical variables. > You can feed any float. The question is if it would work / what it does. I think I'm confused about a couple of aspects (that's what happens I guess when you play with algorithms for which you don't have a complete and firm understanding beforehand!). I assumed that sklearn-RF's requirement for numerical inputs was just a data representation/implementation aspect, and that once properly transformed (i.e. using a LabelEncoder), it wouldn't matter, under the hood, whether a predictor was categorical or numerical. Now if I understand you well, sklearn shouldn't be able to explicitly handle the categorical case where no order exists (i.e. categorical, as opposed to ordinal). But you seem to also imply that sklearn can indirectly support it using dummy variables.. Bigger question: given that Decision Trees (in general) support pure categorical variables.. shouldn't Random Forests also do? >>https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py > I don't see where categorical variables are used in this code. Could you > please point it out? You're right, my bad: those are not categorical predictors. > Not sure what this says about your dataset / features. > If the variables don't have any ordering and the splits take arbitrary > subsets, that would seem a bit weird to me. In fact that's really what I observe: apart from the first of my 4 variables, which is a year, the remaining 3 are purely categorical, with no implicit order. So that result is weird because it is not in line with what you've been saying. Anyway, thanks for your time and patience, Christian -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On 06/02/2013 10:53 PM, Christian Jauvin wrote: > Hi Andreas, > >> Btw, you do encode the categorical variables using one-hot, right? >> The sklearn trees don't really support categorical variables. > I'm rather perplexed by this.. I assumed that sklearn's RF only > required its input to be numerical, so I only used a LabelEncoder up > to now. Hum. I have not considered that. Peter? Gilles? Lars? Little help? Sklearn does not implement any special treatment for categorical variables. You can feed any float. The question is if it would work / what it does. I guess you (and kaggle) observed that it does work somewhat, not sure if it does what you want. The splits will be as for numerical variables, i.e. > threshold. If the variables have an ordering (and LabelEncoder respects that ordering), that makes sense. If the variables don't have an ordering (which I would assume is the more common case for categorical variables), I don't think that makes much sense. >My assumption was backed by two external sources of information: >(1) The benchmark code provided by Kaggle in the SO contest (which was >actually the first time I used RFs) didn't seem to perform such a >transformation: >https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py I don't see where categorical variables are used in this code. Could you please point it out? > > (2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs: > > http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests I am not that experienced with categorical variables. The catch here seems to be "not to many values". Maybe it works for "few" values, but it is not what I would expect a random forest implementation to do on categorical variables. I think it is rather bad that the tutorial doesn't mention one-hot encoding if it is using sklearn. It is somewhat trivial to perform the usual categorical tests. They are not implemented in sklearn, though, as there is no obvious way to declare a column a categorical variable (you need an auxiliar array and no one did this yet). > Moreoever, I just tested it with my own experiment, and I found that a > RF trained on a (21080 x 4) input matrix (i.e. 4 categorical > variables, non-one-hot encoded) performs the same (to the third > decimal in accuracy and AUC, with 10-fold CV) as with its equivalent, > one-hot encoded (21080 x 1347) matrix. Not sure what this says about your dataset / features. If the variables don't have any ordering and the splits take arbitrary subsets, that would seem a bit weird to me. > > Sorry if the confusion is on my side, but did I miss something? Maybe I'm just not well-versed enough in the use of numerically encoded categorical variables in random forests. Cheers, Andy -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
Hi Andreas, > Btw, you do encode the categorical variables using one-hot, right? > The sklearn trees don't really support categorical variables. I'm rather perplexed by this.. I assumed that sklearn's RF only required its input to be numerical, so I only used a LabelEncoder up to now. My assumption was backed by two external sources of information: (1) The benchmark code provided by Kaggle in the SO contest (which was actually the first time I used RFs) didn't seem to perform such a transformation: https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py (2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs: http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests Moreoever, I just tested it with my own experiment, and I found that a RF trained on a (21080 x 4) input matrix (i.e. 4 categorical variables, non-one-hot encoded) performs the same (to the third decimal in accuracy and AUC, with 10-fold CV) as with its equivalent, one-hot encoded (21080 x 1347) matrix. Sorry if the confusion is on my side, but did I miss something? Christian -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
I got very good results on text century dating using random forests on very few (20-ish) bag-of-words tf-idf features selected by chi2. It depends on the problem. Cheers, Vlad On Sat, Jun 1, 2013 at 9:01 PM, Andreas Mueller wrote: > On 06/01/2013 08:30 PM, Christian Jauvin wrote: >> Hi, >> >> I asked a (perhaps too vague?) question about the use of Random >> Forests with a mix of categorical and lexical features on two ML >> forums (stats.SE and MetaOp), but since it has received no attention, >> I figured that it might work better on this list (I'm using sklearn's >> RF of course): >> >> "I'm working on a binary classification problem for which the dataset >> is mostly composed of categorical features, but also a few lexical >> ones (i.e. article titles and abstracts). I'm experimenting with >> Random Forests, and my current strategy is to build the training set >> by appending the k best lexical features (chosen with univariate >> feature selection, and weighted with tf-idf) to the full set of >> categorical features. This works reasonably well, but as I cannot find >> explicit references to such a strategy of using hybrid features for >> RF, I have doubts about my approach: does it make sense? Am I >> "diluting" the power of the RF by doing so, and should I rather try to >> combine two classifiers specializing on both types of features?" >> > I think it is ok, though I think people rarely use RF on bag-of-word > features. > Btw, you do encode the categorical variables using one-hot, right? > The sklearn trees don't really support categorical variables. > An alternative approach would be to run a linear classifier on all tfidf > features > and feed the output together with the other variables to the RF. > > Hth, > Andy > > ps: try stackoverflow with scikit-learn tag next time. > > -- > Get 100% visibility into Java/.NET code with AppDynamics Lite > It's a free troubleshooting tool designed for production > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://p.sf.net/sfu/appdyn_d2d_ap2 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
On 06/01/2013 08:30 PM, Christian Jauvin wrote: > Hi, > > I asked a (perhaps too vague?) question about the use of Random > Forests with a mix of categorical and lexical features on two ML > forums (stats.SE and MetaOp), but since it has received no attention, > I figured that it might work better on this list (I'm using sklearn's > RF of course): > > "I'm working on a binary classification problem for which the dataset > is mostly composed of categorical features, but also a few lexical > ones (i.e. article titles and abstracts). I'm experimenting with > Random Forests, and my current strategy is to build the training set > by appending the k best lexical features (chosen with univariate > feature selection, and weighted with tf-idf) to the full set of > categorical features. This works reasonably well, but as I cannot find > explicit references to such a strategy of using hybrid features for > RF, I have doubts about my approach: does it make sense? Am I > "diluting" the power of the RF by doing so, and should I rather try to > combine two classifiers specializing on both types of features?" > I think it is ok, though I think people rarely use RF on bag-of-word features. Btw, you do encode the categorical variables using one-hot, right? The sklearn trees don't really support categorical variables. An alternative approach would be to run a linear classifier on all tfidf features and feed the output together with the other variables to the RF. Hth, Andy ps: try stackoverflow with scikit-learn tag next time. -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features
Hi Christian, Some time ago I had similar problems. I.e., I wanted to use additional features to my lexical features and simple concatanation didn't work that well for me even though both feature sets on their own performed pretty well. You can follow the discussion about my problem here [1] if you scroll down - ignore the starting discussion. The best solution I ended up was the one suggested by Olivier. You basically train a linear classifier on your lexical features and then use the predict_proba outcome and your additional categorical features for training a second classifier - for example random forests. It was also helpful to perform leave-one-out when training the probabilities (if you have few samples). [1] http://sourceforge.net/mailarchive/forum.php?thread_name=CAFvE7K5F2BJ_ms51a-61HwmNrAyRTb1W0KK7ziBPzGAcdiBRqQ%40mail.gmail.com&forum_name=scikit-learn-general If you find out anything else, let us know ;) Regards, Philipp Am 01.06.2013 20:30, schrieb Christian Jauvin: > Hi, > > I asked a (perhaps too vague?) question about the use of Random > Forests with a mix of categorical and lexical features on two ML > forums (stats.SE and MetaOp), but since it has received no attention, > I figured that it might work better on this list (I'm using sklearn's > RF of course): > > "I'm working on a binary classification problem for which the dataset > is mostly composed of categorical features, but also a few lexical > ones (i.e. article titles and abstracts). I'm experimenting with > Random Forests, and my current strategy is to build the training set > by appending the k best lexical features (chosen with univariate > feature selection, and weighted with tf-idf) to the full set of > categorical features. This works reasonably well, but as I cannot find > explicit references to such a strategy of using hybrid features for > RF, I have doubts about my approach: does it make sense? Am I > "diluting" the power of the RF by doing so, and should I rather try to > combine two classifiers specializing on both types of features?" > > http://stats.stackexchange.com/questions/60162/random-forest-with-a-mix-of-categorical-and-lexical-features > > Thanks, > > Christian > > -- > Get 100% visibility into Java/.NET code with AppDynamics Lite > It's a free troubleshooting tool designed for production > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://p.sf.net/sfu/appdyn_d2d_ap2 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general