Mueller <t3k...@gmail.com> wrote:
> What's the type of self.custom?
>
> Also, you can step into the debugger to see which function it is that can
> not be pickled.
>
>
>
>
> On 04/05/2016 04:14 PM, Fred Mailhot wrote:
>
> Hi all,
>
> I've got a
Hi all,
I've got a pipeline with some custom transformers that's not pickling, and
I'm not sure why. I've had this previously when using custom preprocessors
& tokenizers with CountVectorizers. I dealt with it then by defining the
custom bits at the module level.
I assumed I could avoid that by
I imagine a lot of people might be interested in this, but be in a position
where they need to justify bringing in a new package that mimics sklearn,
rather than just using the linear models that are already available there.
Could you day a but more about how/why this is better?
Thanks!
Fred.
On
n overload get_params to define your own
> parameter listing. See
> http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params
> >
> > On 23 March 2016 at 14:45, Fred Mailhot <fred.mail...@gmail.com> wrote:
> > Hello list,
> >
>
Hello list,
Firstly, thanks for this incredible package; I use it daily at work. Now on
to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits,
Have you checked that your other program tokenizes the same way as the
default sklearn tokenization?
On 19 November 2015 at 11:09, Ehsan Asgari wrote:
> Hi,
>
> Thank you, but it didn't work.
> I checked len(tf.vocabulary_) and it is also 1900 instead of 1914.
> I have
but actually there is no punctuation in my text, only space between
>> terms.
>>
>> Best,
>> Ehsan
>>
>>
>> On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot <fred.mail...@gmail.com>
>> wrote:
>>
>>> Have you checked that your other p
.
FM.
On 1 July 2015 at 11:42, Lars Buitinck larsm...@gmail.com wrote:
2015-07-01 16:27 GMT+02:00 Fred Mailhot fred.mail...@gmail.com:
2) The gensim implementation predates the patenting
Does that matter?
--
Don't
1) The upshot seems to be that it's a defensive patent, and in any case the
code was released under Apache 2.0, so it's fine to use.
https://code.google.com/p/word2vec/
https://groups.google.com/forum/#!topic/word2vec-toolkit/1hID9F74_Ho
2) The gensim implementation predates the patenting
://www.google.com/patents/US9037464
Filed on 15 March 2013
On Thu, Jul 2, 2015 at 4:03 AM, Matthieu Brucher
matthieu.bruc...@gmail.com wrote:
2015-07-01 19:43 GMT+01:00 Andreas Mueller t3k...@gmail.com:
On 07/01/2015 02:42 PM, Lars Buitinck wrote:
2015-07-01 16:27 GMT+02:00 Fred Mailhot
Tangent: Are we even allowed to use word2vec anymore, now that Goog has
patented it? (in any case, I'll be looking a bit more closely at GloVe)
F.
On 30 June 2015 at 19:26, Mathieu Blondel math...@mblondel.org wrote:
For unsupervised models that take a long time to train, such as deep
Parenthesis error in the estimators list?
estimators = [('my_regressor', myRegressor(blahblah)),
...]
On 19 May 2015 at 15:47, Pagliari, Roberto rpagli...@appcomsci.com wrote:
I'm trying to add a custom regressor to a pipeline.
For debugging purposes I commented
Hi all,
It appears that FeatureUnion.transformer_weights isn't exposed by the
get_params() method, which in turn means that it isn't grid-searchable,
which seems unfortunate to me (I've had cause to do so manually recently,
and wished it could be automated).
Is this something that other people
I think possibly you want the TfidfTransformer, *before* the
HashingVectorizer...BUT...the documentation for the HashingVectorizer
appears to discount the possibility of IDF-weighting:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
On 7
A good MI-based feature selector would be welcome, I think. Well, by me,
anyway.
On 23 February 2015 at 09:37, Andy t3k...@gmail.com wrote:
Hi Cecilia.
An MI estimate currently seems a bit out of scope of sklearn.
What context would a user apply it in?
Sklearn currently contains more
I'm going to be at the ML+NLP workshop.
On 18 November 2014 07:32, Mathieu Blondel math...@mblondel.org wrote:
Hi,
Anyone from the mailing-list going to NIPS this year?
See you there,
Mathieu
--
Download BIRT
Is your aim to use this information for feature selection, or do you
actually want to see which features are being maximally weighted? There's a
SO question that addresses the latter use:
http://stackoverflow.com/questions/6697/how-to-get-most-informative-features-for-scikit-learn-classifiers
There are a few implementations of DTW in Cython floating around...I think
mblondel has one. Maybe you could tweak one of these and see whether it
yields a useful speed-up?
https://github.com/SnippyHolloW/DTW_Cython
http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/
On 19 December 2013 15:16, Olivier Grisel olivier.gri...@ensta.org wrote:
[...]
But on the other hand that makes it possible to [...] to memory map the
large parameter
arrays by passing mmap_mode='r' to joblib.load for instance.
Memory mapping can be useful to share the memory of models
Use the same DictVectorizer that you called fit_transform() on with the
training data, but just call transform() for the test data...
dv = DictVectorizer()
train_feats = dv.fit_transform(train_feature_dict)
test_feats = dv.transform(test_feature_dict)
On 15 October 2013 03:52, Lars Buitinck
On 14 October 2013 20:48, Robert McGibbon rmcgi...@gmail.com wrote:
[...]
p.s. core devs: pretty please don't remove the HMM code from the scikit :)
+1E6
--
October Webinars: Code for Performance
Free Intel webinars
Hi list,
Just wondering whether anyone on here in planning on attending EMNLP. I'll
be there, and as a heavy user (and hopeful eventual contributor), I'd love
to meet with some of you.
Fred.
--
October Webinars: Code
FYI, I've used sklearn's LogisticRegression in an online/real-time text
classification app without having to dig into the internals and gotten
~2.5ms response time (including vectorizing; vocab size ~200k).
On 23 September 2013 06:37, Peter Prettenhofer peter.prettenho...@gmail.com
wrote:
We
Oh, right (duh)...I wasn't thinking clearly about the padding for char_wb.
I'll do some tests with stopword removal for char_wb and submit a PR if it
looks worthwhile.
Cheers,
Fred.
On 19 July 2013 13:27, Olivier Grisel olivier.gri...@ensta.org wrote:
2013/7/19 Fred Mailhot fred.mail
Hello list...
I'm a huge fan of sklearn and use it daily at work. I was confused by the
results of some recent text classification experiments and started looking
more closely at the vectorization code.
I'm wondering about the logic behind:
1) not doing stopword removal for the char_wb analyzer
On 12 July 2013 09:48, Lars Buitinck l.j.buiti...@uva.nl wrote:
2013/7/11 Tom Fawcett tom.fawc...@gmail.com:
[...]
I guess because it's terribly slow. I recently tried to cluster a
sample of Wikipedia text at the word level.
What kind of results did you get? I did some work recently
Hi list,
Is anyone working on a book showcasing scikit-learn? I'm thinking something
along the lines of Mahout In Action, that would showcase each of the
parts of scikit-learn and provide a dead-tree reference with a lot of
worked-out examples. I suppose it would make sense to wait for a 1.0
and basically not making any
money (From what I read, writing an O'Reilly book
pays less than any research position).
So I don't see that happening soon.
Cheers,
Andy
On 02/11/2013 06:22 PM, Fred Mailhot wrote:
Hi list,
Is anyone working on a book showcasing scikit-learn? I'm thinking
I just had the same issue recently. It's been fixed in the dev (0.14)
branch. If you pull/build/install that, everything should be fine.
F.
On 1 February 2013 13:40, Vinay B, vybe3...@gmail.com wrote:
From the scikit script at
http://scikit-learn.org/dev/_downloads/document_clustering.py ,
Given a fitted KMeans named km, and a numpy array of documents, to get a
list of documents associated with cluster i:
documents[np.where(km.labels_ == i)]
Not sure what you mean by a list of cluster terms, though (a list of all
terms from all docs associated with a given cluster?)...
On 31
Thanks to all for the tips on GridSearch with FeatureUnion, I'll be trying
those out today. And @amueller I've been following the development of your
PR for the random sampling of param space with great interest.
But back to the initial problem...it seems that an empty input is the
cause. My raw
learning with Scikit? I have a data set that is
20gb that I want to train on I don't think I can do that easily, so
what should I do?
Thanks,
Shomiron Ghose
On 15 November 2012 15:45, Fred Mailhot fred.mail...@gmail.com wrote:
Dear list,
I'm using GridSearchCV to do some simple model
On 15 November 2012 23:20, Andreas Mueller amuel...@ais.uni-bonn.de wrote:
[...]
You can give GridSearchCV not only a grid but also a list of grids.
I would go with that.
(is that sufficiently documented?)
This doesn't appear to be document (at least not at
to n_jobs, not a specific classifier?
Could you run with n_jobs=1 and a very small training set (like 100
examples or something)
and see if it runs through?
(Actually I'm totally clueless but that doesn't look like a
multiprocessing error to me)
On 11/15/2012 10:06 PM, Fred Mailhot wrote
On 14 July 2012 04:22, Olivier Grisel olivier.gri...@ensta.org wrote:
2012/7/13 Abhi kolhe_a...@yahoo.co.in:
Hello,
My problem is to classify a set of 200k+ emails into approx. 2800
categories.
Currently the method I am using is calculating tfidf and using
LinearSVC()
[with a good
Dear all,
Just *bump*ing my last two questions. Apologies if this is considered poor
etiquette...
Thanks!
-- Forwarded message --
From: Fred Mailhot fred.mail...@gmail.com
Date: 15 June 2012 17:22
[...]
1) I'd like to compute the class probs; are the probs for the individual
Dear all,
What are the advantages of choosing one of the Subject line classifiers
over the other? At a quick glance, I see the following:
- LogisticRegression implements predict_proba for the multiclass case,
while SGDClassifier doesn't
- SGDClassifier(loss=log) lets you specify multiple CPUs
37 matches
Mail list logo