2013/9/26 Olivier Grisel :
> 2013/9/7 Tasos Ventouris :
>> I tried to run my script and then create a string from the list for each
>> text and inlcude those texts into the TfidfVectorizer. I am satisfied from
>> the results, but unfortunately, if I have 1000 or more documents, this isn't
>> the mo
BTW, if you want to do LSI on a large corpus, you should rather use
Gensim that supports tuned datastructures and out-of-core processing
for this specific application domain:
http://radimrehurek.com/gensim/
--
Olivier
-
2013/9/7 Tasos Ventouris :
> Hello, I have to questions where I would like your feedback.
>
> The first one:
>
> Here is my code:
>
> from sklearn.feature_extraction.text import TfidfVectorizer
>
> documents = [doc1,doc2,doc3]
> tfidf = TfidfVectorizer().fit_transform(documents)
> pairwise_similari
Hello, I have to questions where I would like your feedback.
The first one:
Here is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [doc1,doc2,doc3]tfidf =
TfidfVectorizer().fit_transform(documents)pairwise_similarity = tfidf *
tfidf.Tprint pairwise_similarity.A
W
>Can you try to turn off IDF normalization using `use_idf=False ` in
>the constructor params of your vectorizer and retry (fit + predict) to
>see if it's related to IDF normalization?
>How many dimensions do you have in your fitted model?
https://gist.github.com/3933727
data_vectors.shape = (10361
2012/10/22 Ark :
> e if it's related to IDF normalization?
>>
>> How many dimensions do you have in your fitted model?
>>
>> >>> print len(vectorizer.vocabulary_)
>>
>> How many documents do you have in your training corpus?
>>
>> How many non-zeros do you have in your transformed document?
>>
>> >
e if it's related to IDF normalization?
>
> How many dimensions do you have in your fitted model?
>
> >>> print len(vectorizer.vocabulary_)
>
> How many documents do you have in your training corpus?
>
> How many non-zeros do you have in your transformed document?
>
> >>> print vectorizer.tran
2012/10/22 Ark :
>
>> I don't see the number of non-zeros: could you please do:
>>
>> >>> print vectorizer.transform([my_text_document])
>>
>> as I asked previously? The run time should be linear with the number
>> of non zeros.
>
>
> ipdb> print self.ve
> I don't see the number of non-zeros: could you please do:
>
> >>> print vectorizer.transform([my_text_document])
>
> as I asked previously? The run time should be linear with the number
> of non zeros.
ipdb> print self.vectorizer.transform([doc])
2012/10/13 Ark :
> Olivier Grisel writes:
>
>
>> > https://gist.github.com/3815467
>>
>> The offending line seems to be:
>>
>> 11.1931.1937.4737.473 base.py:529(setdiag)
>>
>> which I don't understand how it could happen at predict time. At fit
>> time it could have been:
>
Olivier Grisel writes:
> > https://gist.github.com/3815467
>
> The offending line seems to be:
>
> 11.1931.1937.4737.473 base.py:529(setdiag)
>
> which I don't understand how it could happen at predict time. At fit
> time it could have been:
>
> https://github.com/sci
2012/10/2 Ark :
>
>> >> 7s is very long. How long is your text document in bytes ?
>> > The text documents are around 50kB.
>>
>> That should not take 7s to extract a TF-IDF for a single 50kb
>> document. There must be a bug, can you please put a minimalistic code
>> snippet + example document that
Try dividing the email in half and seeing if one half is takes much
more than 50% of the time.
Repeat until you have a sample that you can share :)
On Mon, Oct 1, 2012 at 8:44 PM, Ark wrote:
>
>> >> 7s is very long. How long is your text document in bytes ?
>> > The text documents are around 50k
> >> 7s is very long. How long is your text document in bytes ?
> > The text documents are around 50kB.
>
> That should not take 7s to extract a TF-IDF for a single 50kb
> document. There must be a bug, can you please put a minimalistic code
> snippet + example document that reproduce the issue o
2012/9/24 Ark :
> Olivier Grisel writes:
>
>> You can use the Pipeline class to build a compound classifier that
>> binds a text feature extractor with a classifier to get a text
>> document classifier in the end.
>>
> Done!
>
>>
>> 7s is very long. How long is your text document in bytes ?
> The
Olivier Grisel writes:
> You can use the Pipeline class to build a compound classifier that
> binds a text feature extractor with a classifier to get a text
> document classifier in the end.
>
Done!
>
> 7s is very long. How long is your text document in bytes ?
The text documents are around
2012/9/22 Ark :
> Hello,
> I am trying to classify a large document set with LinearSVC. I get good
> accuracy. However I was wondering how to optimize the interface to this
> classifier. For e.g.If I have an predict interface that accepts the raw
> document,
You can use the Pipeline class to
Hello,
I am trying to classify a large document set with LinearSVC. I get good
accuracy. However I was wondering how to optimize the interface to this
classifier. For e.g.If I have an predict interface that accepts the raw
document,
and uses a precomputed classifier object, the time to predic
Thanks a lot. I've let the author know
J
Le 26 mars 2012 14:14, Jaques Grobler a ?crit :
>
> > Hi everyone-
>
> >
>
> > I stumbled upon this post that offers a quick run-trough of
>
> > text-feature-extraction using
>
> > sklearn.feature_extraction.text's?CountVectorizer:
>
> >
>
> >
>
> > http:
Le 26 mars 2012 14:14, Jaques Grobler a écrit :
> Hi everyone-
>
> I stumbled upon this post that offers a quick run-trough of
> text-feature-extraction using
> sklearn.feature_extraction.text's CountVectorizer:
>
>
> http://pyevolve.sourceforge.net/wordpress/?p=1589&cpage=1#comment-15857
>
> Upon
Hi everyone-
I stumbled upon this post that offers a quick run-trough of
text-feature-extraction using *sklearn.*
feature_extraction.text's CountVectorizer:
http://pyevolve.sourceforge.net/wordpress/?p=1589&cpage=1#comment-15857
Upon copying the code into ipython, i get different outputs from
21 matches
Mail list logo