Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-30 Thread Joel Nothman
A very good point! (Although augmented and log-average tf both do some kind of normalisation of the tf distribution before IDF weighting.) ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-30 Thread Roman Yurchak
Hi Yacine, On 29/01/18 16:39, Yacine MAZARI wrote: >> I wouldn't hate if length normalisation was added to if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. T

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-30 Thread Yacine MAZARI
Okay, thanks for the replies. @Joel: Should I go ahead and send a PR with the change to TfidfTransformer? On Tue, Jan 30, 2018 at 5:27 AM, Joel Nothman wrote: > I don't think you will do this without an O(N) cost. The fact that it's > done with a second pass is moot. > > My position stands: if

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-29 Thread Joel Nothman
I don't think you will do this without an O(N) cost. The fact that it's done with a second pass is moot. My position stands: if this change happens, it should be to TfidfTransformer (which should perhaps be called something like CountVectorWeighter!) alone. On 30 January 2018 at 02:39, Yacine MAZ

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-29 Thread Yacine MAZARI
Hi Folks, Thank you all for the feedback and interesting discussion. I do realize that adding a feature comes with risks, and that there should really be compelling reasons to do so. Let me try to address your comments here, and make one final case for the value of this feature: 1) Use Normaliz

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Gael Varoquaux
On Sun, Jan 28, 2018 at 08:29:58PM +1100, Joel Nothman wrote: > I can't say it's especially obvious that these features available, and > improvements to the documentation are welcome, but CountVectorizer is > complicated enough and we would rather avoid more parameters if we can. Same feeling here

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Joel Nothman
That's equivalent to Normalizer(norm='l1') or FunctionTransformer(np.linalg.norm, kw_args={'axis': 1, 'ord': 1}). The problem is that length norm followed by TfidfTransformer now can't do sublinear TF right... But that's alright if we know we can always do FunctionTransformer(lambda X: calc_sublin

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Sebastian Raschka
Good point Joel, and I actually forgot that you can set the norm param in the TfidfVectorizer, so one could basically do vect = TfidfVectorizer(use_idf=False, norm='l1') to have the CountVectorizer behavior but normalizing by the document length. Best, Sebastian > On Jan 28, 2018, at 1:29 AM,

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Sebastian Raschka
Hi, Yacine, Just on a side note, you can set idf=False in the Tfidf and only normalize the vectors by their L2 norm. But yeah, the normalization you suggest might be really handy in certain cases. I am not sure though if it's worth making this another parameter in the CountVectorizer (which al

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Joel Nothman
sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 or L2 norm. L1 would be equivalent to "document length" as long as you did not intend to count stop words in the length. sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but does so only after acco

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-27 Thread Yacine MAZARI
Hi Jake, Thanks for the quick reply. What I meant is different from the TfIdfVectorizer. Let me clarify: In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf. But still, tf is deined here as the raw count of

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-27 Thread Jacob Vanderplas
Hi Yacine, If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer . Best, Jake Jake VanderPlas Senior Data Scienc