RE: BM25 Scoring Patch

2010-02-18 Thread Yuval Feinstein
-Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Thursday, February 18, 2010 3:09 PM To: java-user@lucene.apache.org Subject: Re: BM25 Scoring Patch Yuval, don't we still need this 'document-level IDF' for BM25f? - Yes, we do need 'document-

Re: BM25 Scoring Patch

2010-02-18 Thread Robert Muir
d be a great help. > Thanks, > Yuval > > -Original Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Wednesday, February 17, 2010 6:47 PM > To: java-user@lucene.apache.org > Subject: Re: BM25 Scoring Patch > > I tend to agree with you Marvin, you are righ

RE: BM25 Scoring Patch

2010-02-18 Thread Yuval Feinstein
me of this work myself, but guidance from a Lucene scoring guru would be a great help. Thanks, Yuval -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, February 17, 2010 6:47 PM To: java-user@lucene.apache.org Subject: Re: BM25 Scoring Patch I tend to agree

Re: BM25 Scoring Patch

2010-02-17 Thread Robert Muir
I tend to agree with you Marvin, you are right, the different scoring mechanisms need different information available and this is the problem. although last I checked, one hard part of BM25 rotates around fields versus documents... e.g. BM25's IDF calculation. but maybe this is just an extreme fo

Re: BM25 Scoring Patch

2010-02-17 Thread Marvin Humphrey
On Wed, Feb 17, 2010 at 10:31:19AM -0500, Robert Muir wrote: > yet if we don't do the hard work up front to make it easy to plug in things > like BM25, then no one will implement additional scoring formulas for > Lucene, we currently make it terribly difficult to do this. FWIW... Similarity and po

Re: BM25 Scoring Patch

2010-02-17 Thread Robert Muir
> > We opened up the TermScorer class for that. > > Thanks, > > Ivan > > --- On Wed, 2/17/10, Robert Muir wrote: > > > From: Robert Muir > > Subject: Re: BM25 Scoring Patch > > To: java-user@lucene.apache.org > > Date: Wednesday, Feb

Re: BM25 Scoring Patch

2010-02-17 Thread Ivan Provalov
Muir wrote: > From: Robert Muir > Subject: Re: BM25 Scoring Patch > To: java-user@lucene.apache.org > Date: Wednesday, February 17, 2010, 10:31 AM > Yuval, i apologize for not having an > intelligent response for your question > (if i did i would try to formulate it as a patch),

Re: BM25 Scoring Patch

2010-02-17 Thread Robert Muir
m] > Sent: Tuesday, February 16, 2010 10:38 PM > To: java-user@lucene.apache.org > Subject: Re: BM25 Scoring Patch > > Joaquin, I have a typical methodology where I don't optimize any scoring > params: be it BM25 params (I stick with your defaults), or lnb.ltc params &

RE: BM25 Scoring Patch

2010-02-17 Thread Yuval Feinstein
we > >> are > >> >> dealing here. > >> >> > >> >> PS2: In relation with TREC4 Cornell used a pivoted length > >> normalisation > >> >> and they were applying pseudo-relevance feedback, what honestly makes > >>

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
e > >> >> dealing here. > >> >> > >> >> PS2: In relation with TREC4 Cornell used a pivoted length > >> normalisation > >> >> and they were applying pseudo-relevance feedback, what honestly makes > >> >> much > &

Re: BM25 Scoring Patch

2010-02-16 Thread JOAQUIN PEREZ IGLESIAS
e >> >> part of the pool. >> >> >> >> Sorry for the huge mail :- >> >> >> >> > Hi Ivan, >> >> > >> >> > the problem is that unfortunately BM25 >> >> > cannot be implemented overwriting

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
IDF (what is > >> > interesting only at search time). > >> > If you set BM25Similarity at indexing time > >> > some basic stats are not stored > >> > correctly in the segments (like docs length). > >> > > >> > When you use BM25BooleanQuery this c

Re: BM25 Scoring Patch

2010-02-16 Thread JOAQUIN PEREZ IGLESIAS
; > not interfering on the typical use of Lucene (so no changing >> > DefaultSimilarity). >> > >> >> Joaquin, Robert, >> >> >> >> I followed Joaquin's recommendation and removed the call to set >> >> similarity >> >> to BM25 expli

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
publish the results once we run the > experiments on a full collection. Are you talking about the bias caused by > using a sub-collection? > > Thanks, > > Ivan > > --- On Tue, 2/16/10, Robert Muir wrote: > > > From: Robert Muir > > Subject: Re: BM25 Sc

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
By the end of the week, I will publish the results once we run the experiments on a full collection. Are you talking about the bias caused by using a sub-collection? Thanks, Ivan --- On Tue, 2/16/10, Robert Muir wrote: > From: Robert Muir > Subject: Re: BM25 Scoring Patch > To:

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
l numbers on the complete collection. > > We are planning to also apply the stemming. Right now we are trying to > isolate each improvement experiment. > > Thanks, > > Ivan > > > > --- On Tue, 2/16/10, Robert Muir wrote: > > > From: Robert Muir > > Sub

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
r). The results showed 55% > >> improvement for the MAP score (0.141->0.219) over default similarity. > >> > >> Joaquin, how would setting the similarity to BM25 explicitly make the > >> score worse? > >> > >> Thank you, > >&g

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
ng to isolate each improvement experiment. Thanks, Ivan --- On Tue, 2/16/10, Robert Muir wrote: > From: Robert Muir > Subject: Re: BM25 Scoring Patch > To: java-user@lucene.apache.org > Date: Tuesday, February 16, 2010, 1:14 PM > Ivan just a little more food for > though

Re: BM25 Scoring Patch

2010-02-16 Thread JOAQUIN PEREZ IGLESIAS
itly (indexer, searcher). The results showed 55% >> improvement for the MAP score (0.141->0.219) over default similarity. >> >> Joaquin, how would setting the similarity to BM25 explicitly make the >> score worse? >> >> Thank you, >> >> Ivan >&g

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
cher). The results showed 55% improvement > for the MAP score (0.141->0.219) over default similarity. > > Joaquin, how would setting the similarity to BM25 explicitly make the score > worse? > > Thank you, > > Ivan > > > > --- On Tue, 2/16/10, Robert Muir wrot

Re: BM25 Scoring Patch

2010-02-16 Thread JOAQUIN PEREZ IGLESIAS
gt; Joaquin, how would setting the similarity to BM25 explicitly make the > score worse? > > Thank you, > > Ivan > > > > --- On Tue, 2/16/10, Robert Muir wrote: > >> From: Robert Muir >> Subject: Re: BM25 Scoring Patch >> To: java-user@lucene.apache.org &g

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
van > > > > --- On Tue, 2/16/10, Robert Muir wrote: > > > From: Robert Muir > > Subject: Re: BM25 Scoring Patch > > To: java-user@lucene.apache.org > > Date: Tuesday, February 16, 2010, 11:36 AM > > yes Ivan, if possible please report > > back any

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
tly make the score worse? Thank you, Ivan --- On Tue, 2/16/10, Robert Muir wrote: > From: Robert Muir > Subject: Re: BM25 Scoring Patch > To: java-user@lucene.apache.org > Date: Tuesday, February 16, 2010, 11:36 AM > yes Ivan, if possible please report > back a

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
yes Ivan, if possible please report back any findings you can on the experiments you are doing! On Tue, Feb 16, 2010 at 11:22 AM, Joaquin Perez Iglesias < joaquin.pe...@lsi.uned.es> wrote: > Hi Ivan, > > You shouldn't set the BM25Similarity for indexing or searching. > Please try removing the lin

Re: BM25 Scoring Patch

2010-02-16 Thread Joaquin Perez Iglesias
Hi Ivan, You shouldn't set the BM25Similarity for indexing or searching. Please try removing the lines: writer.setSimilarity(new BM25Similarity()); searcher.setSimilarity(sim); Please let us/me know if you improve your results with these changes. Robert Muir escribió: Hi Ivan, I've seen

Re: BM25 Scoring Patch

2010-02-16 Thread Robert Muir
Hi Ivan, I've seen many cases where BM25 performs worse than Lucene's default Similarity. Perhaps this is just another one? Again while I have not worked with this particular collection, I looked at the statistics and noted that its composed of several 'sub-collections': for example the PAT docume