Re: [Scikit-learn-general] TF-IDF and LSI

2013-09-26 Thread Lars Buitinck
2013/9/26 Olivier Grisel : > 2013/9/7 Tasos Ventouris : >> I tried to run my script and then create a string from the list for each >> text and inlcude those texts into the TfidfVectorizer. I am satisfied from >> the results, but unfortunately, if I have 1000 or more documents, this isn't >> the mo

Re: [Scikit-learn-general] TF-IDF and LSI

2013-09-26 Thread Olivier Grisel
BTW, if you want to do LSI on a large corpus, you should rather use Gensim that supports tuned datastructures and out-of-core processing for this specific application domain: http://radimrehurek.com/gensim/ -- Olivier -

Re: [Scikit-learn-general] TF-IDF and LSI

2013-09-26 Thread Olivier Grisel
2013/9/7 Tasos Ventouris : > Hello, I have to questions where I would like your feedback. > > The first one: > > Here is my code: > > from sklearn.feature_extraction.text import TfidfVectorizer > > documents = [doc1,doc2,doc3] > tfidf = TfidfVectorizer().fit_transform(documents) > pairwise_similari

[Scikit-learn-general] TF-IDF and LSI

2013-09-26 Thread Tasos Ventouris
Hello, I have to questions where I would like your feedback. The first one: Here is my code: from sklearn.feature_extraction.text import TfidfVectorizer documents = [doc1,doc2,doc3]tfidf = TfidfVectorizer().fit_transform(documents)pairwise_similarity = tfidf * tfidf.Tprint pairwise_similarity.A W

Re: [Scikit-learn-general] TF-Idf

2012-10-25 Thread Ark
>Can you try to turn off IDF normalization using `use_idf=False ` in >the constructor params of your vectorizer and retry (fit + predict) to >see if it's related to IDF normalization? >How many dimensions do you have in your fitted model? https://gist.github.com/3933727 data_vectors.shape = (10361

Re: [Scikit-learn-general] TF-Idf

2012-10-23 Thread Olivier Grisel
2012/10/22 Ark : > e if it's related to IDF normalization? >> >> How many dimensions do you have in your fitted model? >> >> >>> print len(vectorizer.vocabulary_) >> >> How many documents do you have in your training corpus? >> >> How many non-zeros do you have in your transformed document? >> >> >

Re: [Scikit-learn-general] TF-Idf

2012-10-22 Thread Ark
e if it's related to IDF normalization? > > How many dimensions do you have in your fitted model? > > >>> print len(vectorizer.vocabulary_) > > How many documents do you have in your training corpus? > > How many non-zeros do you have in your transformed document? > > >>> print vectorizer.tran

Re: [Scikit-learn-general] TF-Idf

2012-10-22 Thread Olivier Grisel
2012/10/22 Ark : > >> I don't see the number of non-zeros: could you please do: >> >> >>> print vectorizer.transform([my_text_document]) >> >> as I asked previously? The run time should be linear with the number >> of non zeros. > > > ipdb> print self.ve

Re: [Scikit-learn-general] TF-Idf

2012-10-22 Thread Ark
> I don't see the number of non-zeros: could you please do: > > >>> print vectorizer.transform([my_text_document]) > > as I asked previously? The run time should be linear with the number > of non zeros. ipdb> print self.vectorizer.transform([doc])

Re: [Scikit-learn-general] TF-Idf

2012-10-13 Thread Olivier Grisel
2012/10/13 Ark : > Olivier Grisel writes: > > >> > https://gist.github.com/3815467 >> >> The offending line seems to be: >> >> 11.1931.1937.4737.473 base.py:529(setdiag) >> >> which I don't understand how it could happen at predict time. At fit >> time it could have been: >

Re: [Scikit-learn-general] TF-Idf

2012-10-12 Thread Ark
Olivier Grisel writes: > > https://gist.github.com/3815467 > > The offending line seems to be: > > 11.1931.1937.4737.473 base.py:529(setdiag) > > which I don't understand how it could happen at predict time. At fit > time it could have been: > > https://github.com/sci

Re: [Scikit-learn-general] TF-Idf

2012-10-02 Thread Olivier Grisel
2012/10/2 Ark : > >> >> 7s is very long. How long is your text document in bytes ? >> > The text documents are around 50kB. >> >> That should not take 7s to extract a TF-IDF for a single 50kb >> document. There must be a bug, can you please put a minimalistic code >> snippet + example document that

Re: [Scikit-learn-general] TF-Idf

2012-10-01 Thread Joseph Turian
Try dividing the email in half and seeing if one half is takes much more than 50% of the time. Repeat until you have a sample that you can share :) On Mon, Oct 1, 2012 at 8:44 PM, Ark wrote: > >> >> 7s is very long. How long is your text document in bytes ? >> > The text documents are around 50k

Re: [Scikit-learn-general] TF-Idf

2012-10-01 Thread Ark
> >> 7s is very long. How long is your text document in bytes ? > > The text documents are around 50kB. > > That should not take 7s to extract a TF-IDF for a single 50kb > document. There must be a bug, can you please put a minimalistic code > snippet + example document that reproduce the issue o

Re: [Scikit-learn-general] TF-Idf

2012-09-25 Thread Olivier Grisel
2012/9/24 Ark : > Olivier Grisel writes: > >> You can use the Pipeline class to build a compound classifier that >> binds a text feature extractor with a classifier to get a text >> document classifier in the end. >> > Done! > >> >> 7s is very long. How long is your text document in bytes ? > The

Re: [Scikit-learn-general] TF-Idf

2012-09-24 Thread Ark
Olivier Grisel writes: > You can use the Pipeline class to build a compound classifier that > binds a text feature extractor with a classifier to get a text > document classifier in the end. > Done! > > 7s is very long. How long is your text document in bytes ? The text documents are around

Re: [Scikit-learn-general] TF-Idf

2012-09-22 Thread Olivier Grisel
2012/9/22 Ark : > Hello, > I am trying to classify a large document set with LinearSVC. I get good > accuracy. However I was wondering how to optimize the interface to this > classifier. For e.g.If I have an predict interface that accepts the raw > document, You can use the Pipeline class to

[Scikit-learn-general] TF-Idf

2012-09-21 Thread Ark
Hello, I am trying to classify a large document set with LinearSVC. I get good accuracy. However I was wondering how to optimize the interface to this classifier. For e.g.If I have an predict interface that accepts the raw document, and uses a precomputed classifier object, the time to predic

Re: [Scikit-learn-general] tf-idf changes

2012-03-27 Thread Jaques Grobler
Thanks a lot. I've let the author know J Le 26 mars 2012 14:14, Jaques Grobler a ?crit : > > > Hi everyone- > > > > > > I stumbled upon this post that offers a quick run-trough of > > > text-feature-extraction using > > > sklearn.feature_extraction.text's?CountVectorizer: > > > > > > > > > http:

Re: [Scikit-learn-general] tf-idf changes

2012-03-26 Thread Olivier Grisel
Le 26 mars 2012 14:14, Jaques Grobler a écrit : > Hi everyone- > > I stumbled upon this post that offers a quick run-trough of > text-feature-extraction using > sklearn.feature_extraction.text's CountVectorizer: > > > http://pyevolve.sourceforge.net/wordpress/?p=1589&cpage=1#comment-15857 > > Upon

[Scikit-learn-general] tf-idf changes

2012-03-26 Thread Jaques Grobler
Hi everyone- I stumbled upon this post that offers a quick run-trough of text-feature-extraction using *sklearn.* feature_extraction.text's CountVectorizer: http://pyevolve.sourceforge.net/wordpress/?p=1589&cpage=1#comment-15857 Upon copying the code into ipython, i get different outputs from