Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-12 Thread Roman Sinayev
Here's what I got so far: http://pastie.org/6464655 It's about 40% faster. I still need to add the fixed vocabulary option and parallelize. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynam

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-11 Thread Olivier Grisel
2013/3/11 Lars Buitinck : > 2013/3/11 Olivier Grisel : >> 2013/3/11 Roman Sinayev : >>> I got CountVectorizer about 2x faster without multiprocessing so far, >>> however I have a couple of questions. > > I'm curious how you pulled that off. > >>> 1. Why do we not use max_df and min_df and max_featu

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-11 Thread Lars Buitinck
2013/3/11 Olivier Grisel : > 2013/3/11 Roman Sinayev : >> I got CountVectorizer about 2x faster without multiprocessing so far, >> however I have a couple of questions. I'm curious how you pulled that off. >> 1. Why do we not use max_df and min_df and max_features when custom >> vocabulary is pro

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-11 Thread Olivier Grisel
2013/3/11 Roman Sinayev : > I got CountVectorizer about 2x faster without multiprocessing so far, > however I have a couple of questions. > > 1. Why do we not use max_df and min_df and max_features when custom > vocabulary is provided? > Some people may provide a huge vocabulary, but they wouldn't

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-11 Thread Roman Sinayev
I got CountVectorizer about 2x faster without multiprocessing so far, however I have a couple of questions. 1. Why do we not use max_df and min_df and max_features when custom vocabulary is provided? Some people may provide a huge vocabulary, but they wouldn't be interested in some words if they'r

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-08 Thread Vlad Niculae
> That doesn't mean you should try, though ;) I believe Andy meant that it doesn't mean you *shouldn't* try :) > > -- > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-07 Thread Lars Buitinck
2013/3/7 Andreas Mueller : > On 03/07/2013 09:40 AM, Roman Sinayev wrote: >> I tried but TfIDF is slow after the vectorization. The other thing >> was since it is stateless, wouldn't transformation of a test corpus >> followed by tfidf result in a totally different matrix? You won't >> know which

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-07 Thread Andreas Mueller
On 03/07/2013 09:40 AM, Roman Sinayev wrote: > I tried but TfIDF is slow after the vectorization. The other thing > was since it is stateless, wouldn't transformation of a test corpus > followed by tfidf result in a totally different matrix? You won't > know which words are responsible for what.

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-07 Thread Roman Sinayev
I tried but TfIDF is slow after the vectorization. The other thing was since it is stateless, wouldn't transformation of a test corpus followed by tfidf result in a totally different matrix? You won't know which words are responsible for what. >On 03/07/2013 09:13 AM, Roman Sinayev wrote: >> Thi

Re: [Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-07 Thread Andreas Mueller
On 03/07/2013 09:13 AM, Roman Sinayev wrote: This module is a crucial bottleneck in NLP problems. I am trying to refactor it and also make it parallel across documents with python multiprocessing module. Is anyone else working on this? If this is your bottleneck, you should consider using Hash

[Scikit-learn-general] CountVectorizer in feature extraction is still slow

2013-03-07 Thread Roman Sinayev
This module is a crucial bottleneck in NLP problems. I am trying to refactor it and also make it parallel across documents with python multiprocessing module. Is anyone else working on this? -- Symantec Endpoint Protection