Re: [Scikit-learn-general] does sklearn implement svm by itself?

2013-06-03 Thread Andreas Mueller
On 06/03/2013 06:41 AM, Vlad Niculae wrote: With the right settings, SGDClassifier is a home cooked implementation of SVM so there's that too. That is true. Thinking about it, it is a bit weird that SGDClassifier is in linear_model and LinearSVC is in svm, as they both solve the same

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 05:19 AM, Joel Nothman wrote: However, in these last two cases, the number of possible splits at a single node is linear in the number of categories. Selecting an arbitrary partition allows exponentially many splits with respect to the number of categories (though there may

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 04:41 AM, Christian Jauvin wrote: Sklearn does not implement any special treatment for categorical variables. You can feed any float. The question is if it would work / what it does. I think I'm confused about a couple of aspects (that's what happens I guess when you play with

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Andreas Mueller
On 06/02/2013 08:48 PM, Harold Nguyen wrote: Hi Lars, Thank you very much for this response. Please excuse my questions since I'm new. From here the document on TfidfVectorizer here: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Gilles Louppe
On 3 June 2013 08:43, Andreas Mueller amuel...@ais.uni-bonn.de wrote: On 06/03/2013 05:19 AM, Joel Nothman wrote: However, in these last two cases, the number of possible splits at a single node is linear in the number of categories. Selecting an arbitrary partition allows exponentially many

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Andreas Mueller
On 06/03/2013 09:15 AM, Peter Prettenhofer wrote: Our decision tree implementation only supports numerical splits; i.e. if tests val threshold . Categorical features need to be encoded properly. I recommend one-hot encoding for features with small cardinality (e.g. 50) and ordinal

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Lars Buitinck
2013/6/2 Harold Nguyen har...@nexgate.com: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does TfidfVectorizer take a sequence of filenames, where each file is just a plain text file ? Depends on the parameter input (the first in the list).

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Lars Buitinck
2013/6/3 Andreas Mueller amuel...@ais.uni-bonn.de: I named the variable, I think, and it is a bad name :-( Should we rename it? I think giving a count makes more sense than giving a frequency: you want to exclude outliers that appear only once or twice for example. I actually hadn't seen

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-03 Thread Christian Jauvin
Many thanks to all for your help and detailed answers, I really appreciate it. So I wanted to test the discussion's takeaway, namely, what Peter suggested: one-hot encode the categorical features with small cardinality, and leave the others in their ordinal form. So from the same dataset I