Hi Ted, My dataset is a collection of documents in german and I can say that the scores seems better compared to my TFIDF scores. Results make more sense now, specially my bi-grams.
Arian Pasquali http://about.me/arianpasquali 2014-10-01 13:09 GMT+01:00 Ted Dunning <[email protected]>: > Thanks so much for the feedback. Glad to hear it was straightforward. > > > But the important question is .... > > how did BM25 work for you? > > > > On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali <[email protected]> > wrote: > > > Hey guys, > > I think it is fair to give you some feedback. > > I managed to implement BM25+ <http://en.wikipedia.org/wiki/Okapi_BM25> > > term > > score on Mahout. > > It was straightforward using the current TFIDF implementation as an > > example. > > > > Basically what I did was implement the interface > > org.apache.mahout.vectorizer.Weight, create a BM25Converter and > > BM25PartialVectorReducer similar to TFIDFConverter > > < > > > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFConverter.html > > > > > and > > TFIDFPartialVectorReducer > > < > > > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.html > > > > > respectively . > > > > cheers > > Arian > > > > Arian Pasquali > > http://about.me/arianpasquali > > > > 2014-09-24 14:14 GMT+01:00 Arian Pasquali <[email protected]>: > > > > > Yes, > > > I'm studying his work <http://nlp.uned.es/~jperezi/Lucene-BM25/> and > the > > > current mahout's tfidf code. > > > Trying to understand how I would port that to mr. > > > I ll try to share something if I succeed. > > > > > > Arian Pasquali > > > http://about.me/arianpasquali > > > > > > 2014-09-24 5:12 GMT+01:00 Suneel Marthi <[email protected]>: > > > > > >> Lucene 4.x supports okapi-bm25. So it should be easy to implement. > > >> > > >> On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning <[email protected]> > > >> wrote: > > >> > > >> > Should be pretty easy. I haven't heard of anyone doing it. > > >> > > > >> > Sent from my iPhone > > >> > > > >> > > On Sep 23, 2014, at 18:53, Arian Pasquali < > [email protected]> > > >> > wrote: > > >> > > > > >> > > Hi, > > >> > > I was wondering if would be possible to support bm25 term > weighting > > >> > > extending Mahout's tf-idf implementation. > > >> > > > > >> > > I was curious to know if anyone here has already tried to do so. > > >> > > If not, what would be your suggestion for such implementation on > > >> Mahout? > > >> > > > > >> > > > > >> > > Arian Pasquali > > >> > > http://about.me/arianpasquali > > >> > > > >> > > > > > > > > >
