Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

2013-06-02 Thread Joel Nothman
On Sun, Jun 2, 2013 at 4:43 PM, Mathieu Blondel wrote: > > >> Sounds good to me. Only I would like some confirmation on whether >> deprecating support for sequences of sequences is sensible. >> > > Sequences of sequences and arrays of sets are both iterables of iterables, > right? So, since it only

Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

2013-06-02 Thread Mathieu Blondel
On Sun, Jun 2, 2013 at 4:26 PM, Joel Nothman wrote: > > That's only true if users know they are required to pass binarized input > to cross-validation routines such as GridSearchCV and cross_val_score, or > else they might land up with a 2d array of ints instead of a 1d array of > objects. > I ha

Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

2013-06-02 Thread Joel Nothman
On Sun, Jun 2, 2013 at 6:08 PM, Mathieu Blondel wrote: > > > On Sun, Jun 2, 2013 at 4:26 PM, Joel Nothman > wrote: > >> >> That's only true if users know they are required to pass binarized input >> to cross-validation routines such as GridSearchCV and cross_val_score, or >> else they might land

Re: [Scikit-learn-general] To standardize is the question ...

2013-06-02 Thread Andreas Mueller
On 06/01/2013 11:43 PM, o m wrote: > Andy, on reading your tip, and reflecting on what I do, I'm tempted to > claim > that standardization is very important, regardless ... > > Assume x0 is very important but has a tiny range (-1/100, 1/100) I think that something with a tiny range can be more "i

Re: [Scikit-learn-general] Multilabel sequences of sequences considered harmful

2013-06-02 Thread Joel Nothman
On Sun, Jun 2, 2013 at 6:34 PM, Joel Nothman wrote: > On Sun, Jun 2, 2013 at 6:08 PM, Mathieu Blondel wrote: > >> >> >> On Sun, Jun 2, 2013 at 4:26 PM, Joel Nothman < >> jnoth...@student.usyd.edu.au> wrote: >> >>> >>> That's only true if users know they are required to pass binarized input >>> to

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Vlad Niculae
I got very good results on text century dating using random forests on very few (20-ish) bag-of-words tf-idf features selected by chi2. It depends on the problem. Cheers, Vlad On Sat, Jun 1, 2013 at 9:01 PM, Andreas Mueller wrote: > On 06/01/2013 08:30 PM, Christian Jauvin wrote: >> Hi, >> >> I

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-02 Thread Lars Buitinck
2013/6/1 Harold Nguyen : > I was wondering if anyone can point me to a tutorial on clustering text > documents, but then also displaying the results in a graph ? I see some > examples on clustering text documents, but I'd like to be able to visualize > the clusters. You'll need dimensionality redu

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-02 Thread Harold Nguyen
Hi Lars, Thank you very much for this response. Please excuse my questions since I'm new. >From here the document on TfidfVectorizer here: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does TfidfVectorizer take a sequence of filenames, where

[Scikit-learn-general] does sklearn implement svm by itself?

2013-06-02 Thread mike
or it invokes svm implementation from libsvm?-- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2%

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
Hi Andreas, > Btw, you do encode the categorical variables using one-hot, right? > The sklearn trees don't really support categorical variables. I'm rather perplexed by this.. I assumed that sklearn's RF only required its input to be numerical, so I only used a LabelEncoder up to now. My assumpt

Re: [Scikit-learn-general] does sklearn implement svm by itself?

2013-06-02 Thread Andreas Mueller
On 06/02/2013 10:18 PM, mike wrote: > or it invokes svm implementation from libsvm? Yes it does, as it says in the docs: http://scikit-learn.org/dev/modules/svm.html#implementation-details Maybe we should put this into a more prominent place? (in particular libsvm and liblinear are mentioned above

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Andreas Mueller
On 06/02/2013 10:53 PM, Christian Jauvin wrote: > Hi Andreas, > >> Btw, you do encode the categorical variables using one-hot, right? >> The sklearn trees don't really support categorical variables. > I'm rather perplexed by this.. I assumed that sklearn's RF only > required its input to be numeric

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Christian Jauvin
> Sklearn does not implement any special treatment for categorical variables. > You can feed any float. The question is if it would work / what it does. I think I'm confused about a couple of aspects (that's what happens I guess when you play with algorithms for which you don't have a complete and

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Joel Nothman
On Mon, Jun 3, 2013 at 12:41 PM, Christian Jauvin wrote: > > Sklearn does not implement any special treatment for categorical > variables. > > You can feed any float. The question is if it would work / what it does. > > I think I'm confused about a couple of aspects (that's what happens I > guess

Re: [Scikit-learn-general] does sklearn implement svm by itself?

2013-06-02 Thread Vlad Niculae
With the right settings, SGDClassifier is a home cooked implementation of SVM so there's that too. Vlad On Mon, Jun 3, 2013 at 12:23 AM, Andreas Mueller wrote: > On 06/02/2013 10:18 PM, mike wrote: >> or it invokes svm implementation from libsvm? > Yes it does, as it says in the docs: > http://s

Re: [Scikit-learn-general] does sklearn implement svm by itself?

2013-06-02 Thread Andreas Mueller
On 06/03/2013 06:41 AM, Vlad Niculae wrote: > With the right settings, SGDClassifier is a home cooked implementation > of SVM so there's that too. > That is true. Thinking about it, it is a bit weird that SGDClassifier is in linear_model and LinearSVC is in svm, as they both solve the same optimiz

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Andreas Mueller
On 06/03/2013 05:19 AM, Joel Nothman wrote: > > However, in these last two cases, the number of possible splits at a > single node is linear in the number of categories. Selecting an > arbitrary partition allows exponentially many splits with respect to > the number of categories (though there m

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

2013-06-02 Thread Andreas Mueller
On 06/03/2013 04:41 AM, Christian Jauvin wrote: >> Sklearn does not implement any special treatment for categorical variables. >> You can feed any float. The question is if it would work / what it does. > I think I'm confused about a couple of aspects (that's what happens I > guess when you play wi

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-02 Thread Andreas Mueller
On 06/02/2013 08:48 PM, Harold Nguyen wrote: Hi Lars, Thank you very much for this response. Please excuse my questions since I'm new. From here the document on TfidfVectorizer here: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does Tfidf