Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Sebastian Raschka
Hah, and I just wanted to write regarding the VotingClassifier — I remember my struggle quite well when I tried to to make it pipeline and GridSearch compatible until I figured that one out :P > On Mar 23, 2016, at 12:34 AM, Joel Nothman wrote: > > And I lied that none

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman
And I lied that none of the scikit-learn estimators define their own get_params. Of course the following do: VotingClassifier, Kernel (and subclasses), Pipeline and FeatureUnion On 23 March 2016 at 15:04, Joel Nothman wrote: > something like the following may suffice: >

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman
something like the following may suffice: def get_params(self, deep=True): out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep) out['w2v_clusters'] = self.w2v_clusters return out On 23 March 2016 at 15:01, Joel Nothman wrote: > Hi Fred, > > We

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Joel Nothman
Hi Fred, We use the __init__ signature to get the list of parameters that (a) can be set by grid search; (b) need to be copied to a cloned instance of the estimator (with any fitted model discarded) in constructing ensembles, cross validation, etc. While none of the scikit-learn library of

[Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Fred Mailhot
Hello list, Firstly, thanks for this incredible package; I use it daily at work. Now on to the meat: I'm trying to subclass TfidfVectorizer and running into issues. I want to specify an extra param for __init__() that points to a file that gets used in build_analyzer(). Skipping irrelevant bits,

Re: [Scikit-learn-general] Comparisons of classifiers

2016-03-22 Thread Raphael C
> > - In tree-based Not handling categorical variables as such hurts us a lot > There's a PR to fix that, it still needs a bit of love: > https://github.com/scikit-learn/scikit-learn/pull/4899 > This is a conversation moved from https://github.com/scikit-learn/scikit-learn/pull/4899 . In the

Re: [Scikit-learn-general] Speed up Random Forest/ Extra Trees tuning

2016-03-22 Thread Gilles Louppe
Unfortunately, the most important parameters to adjust to maximize accuracy are often those controlling the randomness in the algorithm, i.e. max_features for which this strategy is not possible. That being said, in the case of boosting, I think this strategy would be worth automatizing, e.g. to