Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-30 Thread Olivier Grisel
Le 30 mars 2012 14:50, Philipp Singer a écrit : > > I just have another question regarding this because some of my coworkers > brought this idea up and I cant argue about it the way I like. > > So let's assume you have 10 documents in the training test set and 10 > documents in the test set. > > M

Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-30 Thread Philipp Singer
Am 23.03.2012 13:58, schrieb Olivier Grisel: > Le 23 mars 2012 13:27, Philipp Singer a écrit : >> The IDF statistics is computed once on the whole training corpus as >> passed to the `fit` method and then reused on each call to the >> `transform` method. >> >> For a train / test split on typically

Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Lars Buitinck
Op 23 maart 2012 13:58 heeft Olivier Grisel het volgende geschreven: > Le 23 mars 2012 13:27, Philipp Singer a écrit : >> Okay, so the tfidf values are for the whole corpus. > > Well not exactly: the IDF weights are "trained" on the training slice > of the corpus and can then be reused for the ne

Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Olivier Grisel
Le 23 mars 2012 13:27, Philipp Singer a écrit : > The IDF statistics is computed once on the whole training corpus as > passed to the `fit` method and then reused on each call to the > `transform` method. > > For a train / test split on typically call fit_transform on the train > split (to compute

Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Philipp Singer
The IDF statistics is computed once on the whole training corpus as passed to the `fit` method and then reused on each call to the `transform` method. For a train / test split on typically call fit_transform on the train split (to compute the IDF vector on the train split only) and reuse those ID

Re: [Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Olivier Grisel
Le 23 mars 2012 12:06, Philipp Singer a écrit : > Hey! > > I am currently using sklearn.feature_extraction.text.Vectorizer for feature > extraction of text documents I have. > > I am now curious and don't quite understand how the TFIDF calculation is > done. Is it done seperately for each document

[Scikit-learn-general] Text Documents - Vectorizer

2012-03-23 Thread Philipp Singer
Hey! I am currently using sklearn.feature_extraction.text.Vectorizer for feature extraction of text documents I have. I am now curious and don't quite understand how the TFIDF calculation is don