Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Gilles Louppe
On 3 June 2013 08:43, Andreas Mueller wrote: > On 06/03/2013 05:19 AM, Joel Nothman wrote: >> >> However, in these last two cases, the number of possible splits at a >> single node is linear in the number of categories. Selecting an >> arbitrary partition allows exponentially many splits with resp

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Peter Prettenhofer
Our decision tree implementation only supports numerical splits; i.e. if tests val < threshold . Categorical features need to be encoded properly. I recommend one-hot encoding for features with small cardinality (e.g. < 50) and ordinal encoding (simply assign each category an integer value) for fe

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 09:15 AM, Peter Prettenhofer wrote: > Our decision tree implementation only supports numerical splits; i.e. > if tests val < threshold . > > Categorical features need to be encoded properly. I recommend one-hot > encoding for features with small cardinality (e.g. < 50) and ordinal

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Lars Buitinck
2013/6/2 Harold Nguyen : > http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html > Does TfidfVectorizer take a sequence of filenames, where each file is just a > plain text file ? Depends on the parameter input (the first in the list). In the example, I

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Lars Buitinck
2013/6/3 Andreas Mueller : > I named the variable, I think, and it is a bad name :-( > Should we rename it? > > I think giving a count makes more sense than giving a frequency: you want to > exclude outliers that appear only once or twice for example. I actually hadn't seen this reply. It's not a

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Andreas Mueller
On 06/03/2013 04:09 PM, Lars Buitinck wrote: > 2013/6/3 Andreas Mueller : >> I named the variable, I think, and it is a bad name :-( >> Should we rename it? >> >> I think giving a count makes more sense than giving a frequency: you want to >> exclude outliers that appear only once or twice for exam

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Joel Nothman
On Tue, Jun 4, 2013 at 12:14 AM, Andreas Mueller wrote: > On 06/03/2013 04:09 PM, Lars Buitinck wrote: > > 2013/6/3 Andreas Mueller : > >> I named the variable, I think, and it is a bad name :-( > >> Should we rename it? > >> > >> I think giving a count makes more sense than giving a frequency: yo

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Christian Jauvin
Many thanks to all for your help and detailed answers, I really appreciate it. So I wanted to test the discussion's takeaway, namely, what Peter suggested: one-hot encode the categorical features with small cardinality, and leave the others in their ordinal form. So from the same dataset I mentio