Here's what I got so far:
http://pastie.org/6464655
It's about 40% faster.
I still need to add the fixed vocabulary option and parallelize.
--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynam
2013/3/11 Lars Buitinck :
> 2013/3/11 Olivier Grisel :
>> 2013/3/11 Roman Sinayev :
>>> I got CountVectorizer about 2x faster without multiprocessing so far,
>>> however I have a couple of questions.
>
> I'm curious how you pulled that off.
>
>>> 1. Why do we not use max_df and min_df and max_featu
2013/3/11 Olivier Grisel :
> 2013/3/11 Roman Sinayev :
>> I got CountVectorizer about 2x faster without multiprocessing so far,
>> however I have a couple of questions.
I'm curious how you pulled that off.
>> 1. Why do we not use max_df and min_df and max_features when custom
>> vocabulary is pro
2013/3/11 Roman Sinayev :
> I got CountVectorizer about 2x faster without multiprocessing so far,
> however I have a couple of questions.
>
> 1. Why do we not use max_df and min_df and max_features when custom
> vocabulary is provided?
> Some people may provide a huge vocabulary, but they wouldn't
I got CountVectorizer about 2x faster without multiprocessing so far,
however I have a couple of questions.
1. Why do we not use max_df and min_df and max_features when custom
vocabulary is provided?
Some people may provide a huge vocabulary, but they wouldn't be
interested in some words if they'r
> That doesn't mean you should try, though ;)
I believe Andy meant that it doesn't mean you *shouldn't* try :)
>
> --
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security
2013/3/7 Andreas Mueller :
> On 03/07/2013 09:40 AM, Roman Sinayev wrote:
>> I tried but TfIDF is slow after the vectorization. The other thing
>> was since it is stateless, wouldn't transformation of a test corpus
>> followed by tfidf result in a totally different matrix? You won't
>> know which
On 03/07/2013 09:40 AM, Roman Sinayev wrote:
> I tried but TfIDF is slow after the vectorization. The other thing
> was since it is stateless, wouldn't transformation of a test corpus
> followed by tfidf result in a totally different matrix? You won't
> know which words are responsible for what.
I tried but TfIDF is slow after the vectorization. The other thing
was since it is stateless, wouldn't transformation of a test corpus
followed by tfidf result in a totally different matrix? You won't
know which words are responsible for what.
>On 03/07/2013 09:13 AM, Roman Sinayev wrote:
>> Thi
On 03/07/2013 09:13 AM, Roman Sinayev wrote:
This module is a crucial bottleneck in NLP problems. I am trying to
refactor it and also make it parallel across documents with python
multiprocessing module. Is anyone else working on this?
If this is your bottleneck, you should consider using Hash
This module is a crucial bottleneck in NLP problems. I am trying to
refactor it and also make it parallel across documents with python
multiprocessing module. Is anyone else working on this?
--
Symantec Endpoint Protection
11 matches
Mail list logo