Re: [Scikit-learn-general] Pickling custom Transformers in a Pipeline

2016-04-05 Thread Fred Mailhot
Mueller <t3k...@gmail.com> wrote: > What's the type of self.custom? > > Also, you can step into the debugger to see which function it is that can > not be pickled. > > > > > On 04/05/2016 04:14 PM, Fred Mailhot wrote: > > Hi all, > > I've got a

[Scikit-learn-general] Pickling custom Transformers in a Pipeline

2016-04-05 Thread Fred Mailhot
Hi all, I've got a pipeline with some custom transformers that's not pickling, and I'm not sure why. I've had this previously when using custom preprocessors & tokenizers with CountVectorizers. I dealt with it then by defining the custom bits at the module level. I assumed I could avoid that by

Re: [Scikit-learn-general] Announcing lightning v0.1

2016-03-25 Thread Fred Mailhot
I imagine a lot of people might be interested in this, but be in a position where they need to justify bringing in a new package that mimics sklearn, rather than just using the linear models that are already available there. Could you day a but more about how/why this is better? Thanks! Fred. On

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-23 Thread Fred Mailhot
n overload get_params to define your own > parameter listing. See > http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params > > > > On 23 March 2016 at 14:45, Fred Mailhot <fred.mail...@gmail.com> wrote: > > Hello list, > > >

[Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Fred Mailhot
Hello list, Firstly, thanks for this incredible package; I use it daily at work. Now on to the meat: I'm trying to subclass TfidfVectorizer and running into issues. I want to specify an extra param for __init__() that points to a file that gets used in build_analyzer(). Skipping irrelevant bits,

Re: [Scikit-learn-general] [TfidfVectorizer problem]

2015-11-19 Thread Fred Mailhot
Have you checked that your other program tokenizes the same way as the default sklearn tokenization? On 19 November 2015 at 11:09, Ehsan Asgari wrote: > Hi, > > Thank you, but it didn't work. > I checked len(tf.vocabulary_) and it is also 1900 instead of 1914. > I have

Re: [Scikit-learn-general] [TfidfVectorizer problem]

2015-11-19 Thread Fred Mailhot
but actually there is no punctuation in my text, only space between >> terms. >> >> Best, >> Ehsan >> >> >> On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot <fred.mail...@gmail.com> >> wrote: >> >>> Have you checked that your other p

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Fred Mailhot
. FM. On 1 July 2015 at 11:42, Lars Buitinck larsm...@gmail.com wrote: 2015-07-01 16:27 GMT+02:00 Fred Mailhot fred.mail...@gmail.com: 2) The gensim implementation predates the patenting Does that matter? -- Don't

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Fred Mailhot
1) The upshot seems to be that it's a defensive patent, and in any case the code was released under Apache 2.0, so it's fine to use. https://code.google.com/p/word2vec/ https://groups.google.com/forum/#!topic/word2vec-toolkit/1hID9F74_Ho 2) The gensim implementation predates the patenting

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Fred Mailhot
://www.google.com/patents/US9037464 Filed on 15 March 2013 On Thu, Jul 2, 2015 at 4:03 AM, Matthieu Brucher matthieu.bruc...@gmail.com wrote: 2015-07-01 19:43 GMT+01:00 Andreas Mueller t3k...@gmail.com: On 07/01/2015 02:42 PM, Lars Buitinck wrote: 2015-07-01 16:27 GMT+02:00 Fred Mailhot

Re: [Scikit-learn-general] Library of pre-trained models

2015-06-30 Thread Fred Mailhot
Tangent: Are we even allowed to use word2vec anymore, now that Goog has patented it? (in any case, I'll be looking a bit more closely at GloVe) F. On 30 June 2015 at 19:26, Mathieu Blondel math...@mblondel.org wrote: For unsupervised models that take a long time to train, such as deep

Re: [Scikit-learn-general] issue with custom regressor in the pipeline

2015-05-19 Thread Fred Mailhot
Parenthesis error in the estimators list? estimators = [('my_regressor', myRegressor(blahblah)), ...] On 19 May 2015 at 15:47, Pagliari, Roberto rpagli...@appcomsci.com wrote: I'm trying to add a custom regressor to a pipeline. For debugging purposes I commented

[Scikit-learn-general] Grid searching over FeatureUnion.transformer_weights

2015-05-19 Thread Fred Mailhot
Hi all, It appears that FeatureUnion.transformer_weights isn't exposed by the get_params() method, which in turn means that it isn't grid-searchable, which seems unfortunate to me (I've had cause to do so manually recently, and wished it could be automated). Is this something that other people

Re: [Scikit-learn-general] Integrating HashingVectorizer into Pipeline

2015-05-07 Thread Fred Mailhot
I think possibly you want the TfidfTransformer, *before* the HashingVectorizer...BUT...the documentation for the HashingVectorizer appears to discount the possibility of IDF-weighting: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html On 7

Re: [Scikit-learn-general] Re : Pull Request : Renyi entropy and Cauchy-Schwartz mutual information

2015-02-23 Thread Fred Mailhot
A good MI-based feature selector would be welcome, I think. Well, by me, anyway. On 23 February 2015 at 09:37, Andy t3k...@gmail.com wrote: Hi Cecilia. An MI estimate currently seems a bit out of scope of sklearn. What context would a user apply it in? Sklearn currently contains more

Re: [Scikit-learn-general] NIPS

2014-11-18 Thread Fred Mailhot
I'm going to be at the ML+NLP workshop. On 18 November 2014 07:32, Mathieu Blondel math...@mblondel.org wrote: Hi, Anyone from the mailing-list going to NIPS this year? See you there, Mathieu -- Download BIRT

Re: [Scikit-learn-general] Sensitivity analysis

2014-01-23 Thread Fred Mailhot
Is your aim to use this information for feature selection, or do you actually want to see which features are being maximally weighted? There's a SO question that addresses the latter use: http://stackoverflow.com/questions/6697/how-to-get-most-informative-features-for-scikit-learn-classifiers

Re: [Scikit-learn-general] K Nearest Neighbour with 3d array and custom distance metric

2014-01-10 Thread Fred Mailhot
There are a few implementations of DTW in Cython floating around...I think mblondel has one. Maybe you could tweak one of these and see whether it yields a useful speed-up? https://github.com/SnippyHolloW/DTW_Cython http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/

Re: [Scikit-learn-general] Save trained classifier

2013-12-19 Thread Fred Mailhot
On 19 December 2013 15:16, Olivier Grisel olivier.gri...@ensta.org wrote: [...] But on the other hand that makes it possible to [...] to memory map the large parameter arrays by passing mmap_mode='r' to joblib.load for instance. Memory mapping can be useful to share the memory of models

Re: [Scikit-learn-general] Feature Filtering

2013-10-15 Thread Fred Mailhot
Use the same DictVectorizer that you called fit_transform() on with the training data, but just call transform() for the test data... dv = DictVectorizer() train_feats = dv.fit_transform(train_feature_dict) test_feats = dv.transform(test_feature_dict) On 15 October 2013 03:52, Lars Buitinck

Re: [Scikit-learn-general] HMM with von Mises Emmissions

2013-10-14 Thread Fred Mailhot
On 14 October 2013 20:48, Robert McGibbon rmcgi...@gmail.com wrote: [...] p.s. core devs: pretty please don't remove the HMM code from the scikit :) +1E6 -- October Webinars: Code for Performance Free Intel webinars

[Scikit-learn-general] EMNLP?

2013-09-25 Thread Fred Mailhot
Hi list, Just wondering whether anyone on here in planning on attending EMNLP. I'll be there, and as a heavy user (and hopeful eventual contributor), I'd love to meet with some of you. Fred. -- October Webinars: Code

Re: [Scikit-learn-general] Representing classifiers outside of Python

2013-09-23 Thread Fred Mailhot
FYI, I've used sklearn's LogisticRegression in an online/real-time text classification app without having to dig into the internals and gotten ~2.5ms response time (including vectorizing; vocab size ~200k). On 23 September 2013 06:37, Peter Prettenhofer peter.prettenho...@gmail.com wrote: We

Re: [Scikit-learn-general] Vectorization/tokenization question...

2013-07-19 Thread Fred Mailhot
Oh, right (duh)...I wasn't thinking clearly about the padding for char_wb. I'll do some tests with stopword removal for char_wb and submit a PR if it looks worthwhile. Cheers, Fred. On 19 July 2013 13:27, Olivier Grisel olivier.gri...@ensta.org wrote: 2013/7/19 Fred Mailhot fred.mail

[Scikit-learn-general] Vectorization/tokenization question...

2013-07-19 Thread Fred Mailhot
Hello list... I'm a huge fan of sklearn and use it daily at work. I was confused by the results of some recent text classification experiments and started looking more closely at the vectorization code. I'm wondering about the logic behind: 1) not doing stopword removal for the char_wb analyzer

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Fred Mailhot
On 12 July 2013 09:48, Lars Buitinck l.j.buiti...@uva.nl wrote: 2013/7/11 Tom Fawcett tom.fawc...@gmail.com: [...] I guess because it's terribly slow. I recently tried to cluster a sample of Wikipedia text at the word level. What kind of results did you get? I did some work recently

[Scikit-learn-general] Sklearn book?

2013-02-11 Thread Fred Mailhot
Hi list, Is anyone working on a book showcasing scikit-learn? I'm thinking something along the lines of Mahout In Action, that would showcase each of the parts of scikit-learn and provide a dead-tree reference with a lot of worked-out examples. I suppose it would make sense to wait for a 1.0

Re: [Scikit-learn-general] Sklearn book?

2013-02-11 Thread Fred Mailhot
and basically not making any money (From what I read, writing an O'Reilly book pays less than any research position). So I don't see that happening soon. Cheers, Andy On 02/11/2013 06:22 PM, Fred Mailhot wrote: Hi list, Is anyone working on a book showcasing scikit-learn? I'm thinking

Re: [Scikit-learn-general] Error when chosing large number of clusters

2013-02-01 Thread Fred Mailhot
I just had the same issue recently. It's been fixed in the dev (0.14) branch. If you pull/build/install that, everything should be fine. F. On 1 February 2013 13:40, Vinay B, vybe3...@gmail.com wrote: From the scikit script at http://scikit-learn.org/dev/_downloads/document_clustering.py ,

Re: [Scikit-learn-general] Text document clustering: How can I access the actual clustered documents

2013-01-31 Thread Fred Mailhot
Given a fitted KMeans named km, and a numpy array of documents, to get a list of documents associated with cluster i: documents[np.where(km.labels_ == i)] Not sure what you mean by a list of cluster terms, though (a list of all terms from all docs associated with a given cluster?)... On 31

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
Thanks to all for the tips on GridSearch with FeatureUnion, I'll be trying those out today. And @amueller I've been following the development of your PR for the random sampling of param space with great interest. But back to the initial problem...it seems that an empty input is the cause. My raw

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
learning with Scikit? I have a data set that is 20gb that I want to train on I don't think I can do that easily, so what should I do? Thanks, Shomiron Ghose On 15 November 2012 15:45, Fred Mailhot fred.mail...@gmail.com wrote: Dear list, I'm using GridSearchCV to do some simple model

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
On 15 November 2012 23:20, Andreas Mueller amuel...@ais.uni-bonn.de wrote: [...] You can give GridSearchCV not only a grid but also a list of grids. I would go with that. (is that sufficiently documented?) This doesn't appear to be document (at least not at

Re: [Scikit-learn-general] GridSearch example

2012-11-15 Thread Fred Mailhot
to n_jobs, not a specific classifier? Could you run with n_jobs=1 and a very small training set (like 100 examples or something) and see if it runs through? (Actually I'm totally clueless but that doesn't look like a multiprocessing error to me) On 11/15/2012 10:06 PM, Fred Mailhot wrote

Re: [Scikit-learn-general] Online learning

2012-07-14 Thread Fred Mailhot
On 14 July 2012 04:22, Olivier Grisel olivier.gri...@ensta.org wrote: 2012/7/13 Abhi kolhe_a...@yahoo.co.in: Hello, My problem is to classify a set of 200k+ emails into approx. 2800 categories. Currently the method I am using is calculating tfidf and using LinearSVC() [with a good

[Scikit-learn-general] SGDClassifier(loss=log)...

2012-06-17 Thread Fred Mailhot
Dear all, Just *bump*ing my last two questions. Apologies if this is considered poor etiquette... Thanks! -- Forwarded message -- From: Fred Mailhot fred.mail...@gmail.com Date: 15 June 2012 17:22 [...] 1) I'd like to compute the class probs; are the probs for the individual

[Scikit-learn-general] LogisticRegression versus SGDClassifier(loss=log)?

2012-06-15 Thread Fred Mailhot
Dear all, What are the advantages of choosing one of the Subject line classifiers over the other? At a quick glance, I see the following: - LogisticRegression implements predict_proba for the multiclass case, while SGDClassifier doesn't - SGDClassifier(loss=log) lets you specify multiple CPUs