Re: [Scikit-learn-general] Subclassing vectorizers
Thanks very much everyone; seems to be working now! On 23 March 2016 at 00:58, Sebastian Raschkawrote: > Hah, and I just wanted to write regarding the VotingClassifier — I > remember my struggle quite well when I tried to to make it pipeline and > GridSearch compatible until I figured that one out :P > > > On Mar 23, 2016, at 12:34 AM, Joel Nothman > wrote: > > > > And I lied that none of the scikit-learn estimators define their own > get_params. Of course the following do: VotingClassifier, Kernel (and > subclasses), Pipeline and FeatureUnion > > > > On 23 March 2016 at 15:04, Joel Nothman wrote: > > something like the following may suffice: > > > > def get_params(self, deep=True): > > out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep) > > out['w2v_clusters'] = self.w2v_clusters > > return out > > > > On 23 March 2016 at 15:01, Joel Nothman wrote: > > Hi Fred, > > > > We use the __init__ signature to get the list of parameters that (a) can > be set by grid search; (b) need to be copied to a cloned instance of the > estimator (with any fitted model discarded) in constructing ensembles, > cross validation, etc. While none of the scikit-learn library of estimators > do this, in practice you can overload get_params to define your own > parameter listing. See > http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params > > > > On 23 March 2016 at 14:45, Fred Mailhot wrote: > > Hello list, > > > > Firstly, thanks for this incredible package; I use it daily at work. Now > on to the meat: I'm trying to subclass TfidfVectorizer and running into > issues. I want to specify an extra param for __init__() that points to a > file that gets used in build_analyzer(). Skipping irrelevant bits, I've got > the following: > > > > #== > > class WordCooccurrenceVectorizer(TfidfVectorizer): > > > > ### override __init__ to add w2v_clusters arg > > # see > http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass > > # for explanation of syntax > > def __init__(self, *args, **kwargs): > > try: > > self.w2v_cluster_path = kwargs.pop("w2v_clusters") > > except KeyError: > > pass > > super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs) > > > > def build_analyzer(self): > > preprocess = self.build_preprocessor() > > stopwords = self.get_stop_words() > > w2v_clusters = self.load_w2v_clusters() > > tokenize = self.build_tokenizer() > > return lambda doc: > self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters) > > [...] > > #== > > > > I can instantiate this, but when I want to inspect it, I get the > following (this is in ipython, in a script it just hangs): > > > > #== > > In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2), > stop_words="english", max_df=0.5, min_df=1, max_features=1, > w2v_clusters="clusters.20160322_1803.w2v", binary=True) > > > > In [3]: vec > > Out[3]: > --- > > RuntimeError Traceback (most recent call > last) > > > /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc > in __call__(self, obj) > > 697 type_pprinters=self.type_printers, > > 698 deferred_pprinters=self.deferred_printers) > > --> 699 printer.pretty(obj) > > 700 printer.flush() > > 701 return stream.getvalue() > > > > [...] > > > > > /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc > in _get_param_names(cls) > > 193" %s with constructor %s > doesn't " > > 194" follow this convention." > > --> 195% (cls, init_signature)) > > 196 # Extract and sort argument names excluding 'self' > > 197 return sorted([p.name for p in parameters]) > > > > RuntimeError: scikit-learn estimators should always specify their > parameters in the signature of their __init__ (no varargs). 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (, > *args, **kwargs) doesn't follow this convention. > > > > In [4]: > > #== > > > > The error is clear enough -- I can't use *args and **kwargs in a sklearn > estimator's __init__() -- but I'm not sure what the correct way is to do > what I need to do. Do I literally need to specify all of the __init__ > params in my subclass and then pass them on to the __init__ of super()? If > so, what's the reason for setting this up this way? > > > > > > Thanks for any
Re: [Scikit-learn-general] Subclassing vectorizers
Hah, and I just wanted to write regarding the VotingClassifier — I remember my struggle quite well when I tried to to make it pipeline and GridSearch compatible until I figured that one out :P > On Mar 23, 2016, at 12:34 AM, Joel Nothmanwrote: > > And I lied that none of the scikit-learn estimators define their own > get_params. Of course the following do: VotingClassifier, Kernel (and > subclasses), Pipeline and FeatureUnion > > On 23 March 2016 at 15:04, Joel Nothman wrote: > something like the following may suffice: > > def get_params(self, deep=True): > out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep) > out['w2v_clusters'] = self.w2v_clusters > return out > > On 23 March 2016 at 15:01, Joel Nothman wrote: > Hi Fred, > > We use the __init__ signature to get the list of parameters that (a) can be > set by grid search; (b) need to be copied to a cloned instance of the > estimator (with any fitted model discarded) in constructing ensembles, cross > validation, etc. While none of the scikit-learn library of estimators do > this, in practice you can overload get_params to define your own parameter > listing. See > http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params > > On 23 March 2016 at 14:45, Fred Mailhot wrote: > Hello list, > > Firstly, thanks for this incredible package; I use it daily at work. Now on > to the meat: I'm trying to subclass TfidfVectorizer and running into issues. > I want to specify an extra param for __init__() that points to a file that > gets used in build_analyzer(). Skipping irrelevant bits, I've got the > following: > > #== > class WordCooccurrenceVectorizer(TfidfVectorizer): > > ### override __init__ to add w2v_clusters arg > # see > http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass > # for explanation of syntax > def __init__(self, *args, **kwargs): > try: > self.w2v_cluster_path = kwargs.pop("w2v_clusters") > except KeyError: > pass > super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs) > > def build_analyzer(self): > preprocess = self.build_preprocessor() > stopwords = self.get_stop_words() > w2v_clusters = self.load_w2v_clusters() > tokenize = self.build_tokenizer() > return lambda doc: > self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters) > [...] > #== > > I can instantiate this, but when I want to inspect it, I get the following > (this is in ipython, in a script it just hangs): > > #== > In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2), > stop_words="english", max_df=0.5, min_df=1, max_features=1, > w2v_clusters="clusters.20160322_1803.w2v", binary=True) > > In [3]: vec > Out[3]: > --- > RuntimeError Traceback (most recent call last) > /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc > in __call__(self, obj) > 697 type_pprinters=self.type_printers, > 698 deferred_pprinters=self.deferred_printers) > --> 699 printer.pretty(obj) > 700 printer.flush() > 701 return stream.getvalue() > > [...] > > /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc > in _get_param_names(cls) > 193" %s with constructor %s doesn't " > 194" follow this convention." > --> 195% (cls, init_signature)) > 196 # Extract and sort argument names excluding 'self' > 197 return sorted([p.name for p in parameters]) > > RuntimeError: scikit-learn estimators should always specify their parameters > in the signature of their __init__ (no varargs). 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (, > *args, **kwargs) doesn't follow this convention. > > In [4]: > #== > > The error is clear enough -- I can't use *args and **kwargs in a sklearn > estimator's __init__() -- but I'm not sure what the correct way is to do what > I need to do. Do I literally need to specify all of the __init__ params in my > subclass and then pass them on to the __init__ of super()? If so, what's the > reason for setting this up this way? > > > Thanks for any pointers/guidance, > Fred. > > > -- > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to
Re: [Scikit-learn-general] Subclassing vectorizers
And I lied that none of the scikit-learn estimators define their own get_params. Of course the following do: VotingClassifier, Kernel (and subclasses), Pipeline and FeatureUnion On 23 March 2016 at 15:04, Joel Nothmanwrote: > something like the following may suffice: > > def get_params(self, deep=True): > out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep) > out['w2v_clusters'] = self.w2v_clusters > return out > > On 23 March 2016 at 15:01, Joel Nothman wrote: > >> Hi Fred, >> >> We use the __init__ signature to get the list of parameters that (a) can >> be set by grid search; (b) need to be copied to a cloned instance of the >> estimator (with any fitted model discarded) in constructing ensembles, >> cross validation, etc. While none of the scikit-learn library of estimators >> do this, in practice you can overload get_params to define your own >> parameter listing. See >> http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params >> >> On 23 March 2016 at 14:45, Fred Mailhot wrote: >> >>> Hello list, >>> >>> Firstly, thanks for this incredible package; I use it daily at work. Now >>> on to the meat: I'm trying to subclass TfidfVectorizer and running into >>> issues. I want to specify an extra param for __init__() that points to a >>> file that gets used in build_analyzer(). Skipping irrelevant bits, I've got >>> the following: >>> >>> #== >>> class WordCooccurrenceVectorizer(TfidfVectorizer): >>> >>> ### override __init__ to add w2v_clusters arg >>> # see >>> http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass >>> # for explanation of syntax >>> def __init__(self, *args, **kwargs): >>> try: >>> self.w2v_cluster_path = kwargs.pop("w2v_clusters") >>> except KeyError: >>> pass >>> super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs) >>> >>> def build_analyzer(self): >>> preprocess = self.build_preprocessor() >>> stopwords = self.get_stop_words() >>> w2v_clusters = self.load_w2v_clusters() >>> tokenize = self.build_tokenizer() >>> return lambda doc: >>> self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters) >>> [...] >>> #== >>> >>> I can instantiate this, but when I want to inspect it, I get the >>> following (this is in ipython, in a script it just hangs): >>> >>> #== >>> In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2), >>> stop_words="english", max_df=0.5, min_df=1, max_features=1, >>> w2v_clusters="clusters.20160322_1803.w2v", binary=True) >>> >>> In [3]: vec >>> Out[3]: >>> --- >>> RuntimeError Traceback (most recent call >>> last) >>> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc >>> in __call__(self, obj) >>> 697 type_pprinters=self.type_printers, >>> 698 deferred_pprinters=self.deferred_printers) >>> --> 699 printer.pretty(obj) >>> 700 printer.flush() >>> 701 return stream.getvalue() >>> >>> [...] >>> >>> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc >>> in _get_param_names(cls) >>> 193" %s with constructor %s >>> doesn't " >>> 194" follow this convention." >>> --> 195% (cls, init_signature)) >>> 196 # Extract and sort argument names excluding 'self' >>> 197 return sorted([p.name for p in parameters]) >>> >>> RuntimeError: scikit-learn estimators should always specify their >>> parameters in the signature of their __init__ (no varargs). >> 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (, >>> *args, **kwargs) doesn't follow this convention. >>> >>> In [4]: >>> #== >>> >>> The error is clear enough -- I can't use *args and **kwargs in a sklearn >>> estimator's __init__() -- but I'm not sure what the correct way is to do >>> what I need to do. Do I literally need to specify all of the __init__ >>> params in my subclass and then pass them on to the __init__ of super()? If >>> so, what's the reason for setting this up this way? >>> >>> >>> Thanks for any pointers/guidance, >>> Fred. >>> >>> >>> >>> -- >>> Transform Data into Opportunity. >>> Accelerate data analysis in your applications with >>> Intel Data Analytics Acceleration Library. >>> Click to learn more. >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140 >>> ___ >>>
Re: [Scikit-learn-general] Subclassing vectorizers
something like the following may suffice: def get_params(self, deep=True): out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep) out['w2v_clusters'] = self.w2v_clusters return out On 23 March 2016 at 15:01, Joel Nothmanwrote: > Hi Fred, > > We use the __init__ signature to get the list of parameters that (a) can > be set by grid search; (b) need to be copied to a cloned instance of the > estimator (with any fitted model discarded) in constructing ensembles, > cross validation, etc. While none of the scikit-learn library of estimators > do this, in practice you can overload get_params to define your own > parameter listing. See > http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params > > On 23 March 2016 at 14:45, Fred Mailhot wrote: > >> Hello list, >> >> Firstly, thanks for this incredible package; I use it daily at work. Now >> on to the meat: I'm trying to subclass TfidfVectorizer and running into >> issues. I want to specify an extra param for __init__() that points to a >> file that gets used in build_analyzer(). Skipping irrelevant bits, I've got >> the following: >> >> #== >> class WordCooccurrenceVectorizer(TfidfVectorizer): >> >> ### override __init__ to add w2v_clusters arg >> # see >> http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass >> # for explanation of syntax >> def __init__(self, *args, **kwargs): >> try: >> self.w2v_cluster_path = kwargs.pop("w2v_clusters") >> except KeyError: >> pass >> super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs) >> >> def build_analyzer(self): >> preprocess = self.build_preprocessor() >> stopwords = self.get_stop_words() >> w2v_clusters = self.load_w2v_clusters() >> tokenize = self.build_tokenizer() >> return lambda doc: >> self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters) >> [...] >> #== >> >> I can instantiate this, but when I want to inspect it, I get the >> following (this is in ipython, in a script it just hangs): >> >> #== >> In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2), >> stop_words="english", max_df=0.5, min_df=1, max_features=1, >> w2v_clusters="clusters.20160322_1803.w2v", binary=True) >> >> In [3]: vec >> Out[3]: >> --- >> RuntimeError Traceback (most recent call >> last) >> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc >> in __call__(self, obj) >> 697 type_pprinters=self.type_printers, >> 698 deferred_pprinters=self.deferred_printers) >> --> 699 printer.pretty(obj) >> 700 printer.flush() >> 701 return stream.getvalue() >> >> [...] >> >> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc >> in _get_param_names(cls) >> 193" %s with constructor %s >> doesn't " >> 194" follow this convention." >> --> 195% (cls, init_signature)) >> 196 # Extract and sort argument names excluding 'self' >> 197 return sorted([p.name for p in parameters]) >> >> RuntimeError: scikit-learn estimators should always specify their >> parameters in the signature of their __init__ (no varargs). > 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (, >> *args, **kwargs) doesn't follow this convention. >> >> In [4]: >> #== >> >> The error is clear enough -- I can't use *args and **kwargs in a sklearn >> estimator's __init__() -- but I'm not sure what the correct way is to do >> what I need to do. Do I literally need to specify all of the __init__ >> params in my subclass and then pass them on to the __init__ of super()? If >> so, what's the reason for setting this up this way? >> >> >> Thanks for any pointers/guidance, >> Fred. >> >> >> >> -- >> Transform Data into Opportunity. >> Accelerate data analysis in your applications with >> Intel Data Analytics Acceleration Library. >> Click to learn more. >> http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140 >> ___ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > -- Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to
Re: [Scikit-learn-general] Subclassing vectorizers
Hi Fred, We use the __init__ signature to get the list of parameters that (a) can be set by grid search; (b) need to be copied to a cloned instance of the estimator (with any fitted model discarded) in constructing ensembles, cross validation, etc. While none of the scikit-learn library of estimators do this, in practice you can overload get_params to define your own parameter listing. See http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params On 23 March 2016 at 14:45, Fred Mailhotwrote: > Hello list, > > Firstly, thanks for this incredible package; I use it daily at work. Now > on to the meat: I'm trying to subclass TfidfVectorizer and running into > issues. I want to specify an extra param for __init__() that points to a > file that gets used in build_analyzer(). Skipping irrelevant bits, I've got > the following: > > #== > class WordCooccurrenceVectorizer(TfidfVectorizer): > > ### override __init__ to add w2v_clusters arg > # see > http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass > # for explanation of syntax > def __init__(self, *args, **kwargs): > try: > self.w2v_cluster_path = kwargs.pop("w2v_clusters") > except KeyError: > pass > super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs) > > def build_analyzer(self): > preprocess = self.build_preprocessor() > stopwords = self.get_stop_words() > w2v_clusters = self.load_w2v_clusters() > tokenize = self.build_tokenizer() > return lambda doc: > self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters) > [...] > #== > > I can instantiate this, but when I want to inspect it, I get the following > (this is in ipython, in a script it just hangs): > > #== > In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2), > stop_words="english", max_df=0.5, min_df=1, max_features=1, > w2v_clusters="clusters.20160322_1803.w2v", binary=True) > > In [3]: vec > Out[3]: > --- > RuntimeError Traceback (most recent call last) > /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc > in __call__(self, obj) > 697 type_pprinters=self.type_printers, > 698 deferred_pprinters=self.deferred_printers) > --> 699 printer.pretty(obj) > 700 printer.flush() > 701 return stream.getvalue() > > [...] > > /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc > in _get_param_names(cls) > 193" %s with constructor %s > doesn't " > 194" follow this convention." > --> 195% (cls, init_signature)) > 196 # Extract and sort argument names excluding 'self' > 197 return sorted([p.name for p in parameters]) > > RuntimeError: scikit-learn estimators should always specify their > parameters in the signature of their __init__ (no varargs). 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (, > *args, **kwargs) doesn't follow this convention. > > In [4]: > #== > > The error is clear enough -- I can't use *args and **kwargs in a sklearn > estimator's __init__() -- but I'm not sure what the correct way is to do > what I need to do. Do I literally need to specify all of the __init__ > params in my subclass and then pass them on to the __init__ of super()? If > so, what's the reason for setting this up this way? > > > Thanks for any pointers/guidance, > Fred. > > > > -- > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140 > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > -- Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Subclassing vectorizers
Hello list, Firstly, thanks for this incredible package; I use it daily at work. Now on to the meat: I'm trying to subclass TfidfVectorizer and running into issues. I want to specify an extra param for __init__() that points to a file that gets used in build_analyzer(). Skipping irrelevant bits, I've got the following: #== class WordCooccurrenceVectorizer(TfidfVectorizer): ### override __init__ to add w2v_clusters arg # see http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass # for explanation of syntax def __init__(self, *args, **kwargs): try: self.w2v_cluster_path = kwargs.pop("w2v_clusters") except KeyError: pass super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs) def build_analyzer(self): preprocess = self.build_preprocessor() stopwords = self.get_stop_words() w2v_clusters = self.load_w2v_clusters() tokenize = self.build_tokenizer() return lambda doc: self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters) [...] #== I can instantiate this, but when I want to inspect it, I get the following (this is in ipython, in a script it just hangs): #== In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2), stop_words="english", max_df=0.5, min_df=1, max_features=1, w2v_clusters="clusters.20160322_1803.w2v", binary=True) In [3]: vec Out[3]: --- RuntimeError Traceback (most recent call last) /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc in __call__(self, obj) 697 type_pprinters=self.type_printers, 698 deferred_pprinters=self.deferred_printers) --> 699 printer.pretty(obj) 700 printer.flush() 701 return stream.getvalue() [...] /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc in _get_param_names(cls) 193" %s with constructor %s doesn't " 194" follow this convention." --> 195% (cls, init_signature)) 196 # Extract and sort argument names excluding 'self' 197 return sorted([p.name for p in parameters]) RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). with constructor (, *args, **kwargs) doesn't follow this convention. In [4]: #== The error is clear enough -- I can't use *args and **kwargs in a sklearn estimator's __init__() -- but I'm not sure what the correct way is to do what I need to do. Do I literally need to specify all of the __init__ params in my subclass and then pass them on to the __init__ of super()? If so, what's the reason for setting this up this way? Thanks for any pointers/guidance, Fred. -- Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general