Re: Relevancy Practices

2010-05-05 Thread Grant Ingersoll

On May 2, 2010, at 5:50 AM, Avi Rosenschein wrote:

> On 4/30/10, Grant Ingersoll  wrote:
>> 
>> On Apr 30, 2010, at 8:00 AM, Avi Rosenschein wrote:
>>> Also, tuning the algorithms to the users can be very important. For
>>> instance, we have found that in a basic search functionality, the default
>>> query parser operator OR works very well. But on a page for advanced
>>> users,
>>> who want to very precisely tune their search results, a default of AND
>>> works
>>> better.
>> 
>> Avi,
>> 
>> Great example.  Can you elaborate on how you arrived at this conclusion?
>> What things did you do to determine it was a problem?
>> 
>> -Grant
> 
> Hi Grant,
> 
> Sure. On http://wiki.answers.com/, we use search in a variety of
> places and ways.
> 
> In the basic search box (what you get if you look stuff up in the main
> Ask box on the home page), we generally want the relevancy matching to
> be pretty fuzzy. For example, if the user looked up "Where can you see
> photos of the Aurora Borealis effect?" I would still want to show them
> "Where can you see photos of the Aurora Borealis?" as a match.
> 
> However, the advanced search page,
> http://wiki.answers.com/Q/Special:Search, is used by advanced users to
> filter questions by various facets and searches, and to them it is
> important for the filter to filter out non-matches, since they use it
> as a working page. For example, if they want to do a search for "Harry
> Potter" and classify all results into the "Harry Potter" category, it
> is important that not every match for "Harry" is returned.

I'm curious, Avi, if you can share how you came to these conclusions?  For 
instance, did you have any qualitative evidence that "fuzzy" was better for the 
main page?  Or was it a "I know it when I see it" kind of thing.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Relevancy Practices

2010-05-05 Thread Grant Ingersoll
Thanks, Peter.

Can you share what kind of evaluations you did to determine that the end user 
believed the results were equally relevant?  How formal was that process?

-Grant

On May 3, 2010, at 11:08 AM, Peter Keegan wrote:

> We discovered very soon after going to production that Lucene's scores were
> often 'too precise'. For example, a page of 25 results may have several
> different score values, and all within 15% of each other, but to the end
> user all 25 results were equally relevant. Thus we wanted the secondary sort
> field to determine the order, instead. This required writing a custom score
> comparator to 'round' the scores. The same thing occurred for distance
> sorting. We also limit the effect of term frequency to help prevent
> spamming.  In comparison to Avi, we use 'AND' as the default operator for
> keyword queries and if no docs are found, the query is automatically retried
> with 'OR'. This improves precision a bit and only occurs if the user
> provides no operators.
> 
> Lucene's Explanation class has been invaluable in helping me to explain a
> particular sort order in many, many situations.
> Most of our relevance tuning has occurred after deployment to production.
> 
> Peter
> 
> On Thu, Apr 29, 2010 at 10:14 AM, Grant Ingersoll wrote:
> 
>> I'm putting on a talk at Lucene Eurocon (
>> http://lucene-eurocon.org/sessions-track1-day2.html#1) on "Practical
>> Relevance" and I'm curious as to what people put in practice for testing and
>> improving relevance.  I have my own inclinations, but I don't want to muddy
>> the water just yet.  So, if you have a few moments, I'd love to hear
>> responses to the following questions.
>> 
>> What worked?
>> What didn't work?
>> What didn't you understand about it?
>> What tools did you use?
>> What tools did you wish you had either for debugging relevance or "fixing"
>> it?
>> How much time did you spend on it?
>> How did you avoid over/under tuning?
>> What stage of development/testing/production did you decide to do relevance
>> tuning?  Was that timing planned or not?
>> 
>> 
>> Thanks,
>> Grant
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Relevancy Practices

2010-05-05 Thread Peter Keegan
The feedback came directly from customers and customer facing support folks.
Here is an example of a query with keywords: nurse, rn, nursing, hospital.
The top 2 hits have scores of 26.86348 and 26.407215. To the customer, both
results were equally relevant because all of their keywords were in the
documents. For this application, the subtleties of TF/IDF are not
appreciated by the end user ;-).  Here are the Explanations for the scores
(I hope they are readable):

Doc 1:

26.86348  sum of:
  26.86348  product of:
33.57935  sum of:
  10.403484  weight(contents:nurse in 110320), product of:
0.30413723  queryWeight(contents:nurse), product of:
  4.8375363  idf(contents:  nurse=9554)
  0.06287027  queryNorm
34.206547  fieldWeight(contents:nurse in 110320), product of:
  7.071068  btq, product of:
1.4142135  tf(phraseFreq=2.0)
5.0  scorePayload(...)
  4.8375363  idf(contents:  nurse=9554)
  1.0  fieldNorm(field=contents, doc=110320)
  11.005695  weight(contents:rn in 110320), product of:
0.31281596  queryWeight(contents:rn), product of:
  4.9755783  idf(contents:  rn=8322)
  0.06287027  queryNorm
35.18265  fieldWeight(contents:rn in 110320), product of:
  7.071068  btq, product of:
1.4142135  tf(phraseFreq=3.0)
5.0  scorePayload(...)
  4.9755783  idf(contents:  rn=8322)
  1.0  fieldNorm(field=contents, doc=110320)
  10.136917  weight(contents:nursing in 110320), product of:
0.3002155  queryWeight(contents:nursing), product of:
  4.7751584  idf(contents:  nursing=10169)
  0.06287027  queryNorm
33.76547  fieldWeight(contents:nursing in 110320), product of:
  7.071068  btq, product of:
1.4142135  tf(phraseFreq=11.0)
5.0  scorePayload(...)
  4.7751584  idf(contents:  nursing=10169)
  1.0  fieldNorm(field=contents, doc=110320)
  2.0332527  weight(contents:hospital in 110320), product of:
0.30064976  queryWeight(contents:hospital), product of:
  4.7820654  idf(contents:  hospital=10099)
  0.06287027  queryNorm
6.7628617  fieldWeight(contents:hospital in 110320), product of:
  1.4142135  btq, product of:
1.4142135  tf(phraseFreq=3.0)
1.0  scorePayload(...)
  4.7820654  idf(contents:  hospital=10099)
  1.0  fieldNorm(field=contents, doc=110320)
0.8  coord(4/5)

Doc 2:

26.407215  sum of:
  26.407215  product of:
33.009018  sum of:
  10.403484  weight(contents:nurse in 271166), product of:
0.30413723  queryWeight(contents:nurse), product of:
  4.8375363  idf(contents:  nurse=9554)
  0.06287027  queryNorm
34.206547  fieldWeight(contents:nurse in 271166), product of:
  7.071068  btq, product of:
1.4142135  tf(phraseFreq=4.0)
5.0  scorePayload(...)
  4.8375363  idf(contents:  nurse=9554)
  1.0  fieldNorm(field=contents, doc=271166)
  11.005695  weight(contents:rn in 271166), product of:
0.31281596  queryWeight(contents:rn), product of:
  4.9755783  idf(contents:  rn=8322)
  0.06287027  queryNorm
35.18265  fieldWeight(contents:rn in 271166), product of:
  7.071068  btq, product of:
1.4142135  tf(phraseFreq=4.0)
5.0  scorePayload(...)
  4.9755783  idf(contents:  rn=8322)
  1.0  fieldNorm(field=contents, doc=271166)
  1.4335766  weight(contents:nursing in 271166), product of:
0.3002155  queryWeight(contents:nursing), product of:
  4.7751584  idf(contents:  nursing=10169)
  0.06287027  queryNorm
4.7751584  fieldWeight(contents:nursing in 271166), product of:
  1.0  btq, product of:
1.0  tf(phraseFreq=1.0)
1.0  scorePayload(...)
  4.7751584  idf(contents:  nursing=10169)
  1.0  fieldNorm(field=contents, doc=271166)
  10.166264  weight(contents:hospital in 271166), product of:
0.30064976  queryWeight(contents:hospital), product of:
  4.7820654  idf(contents:  hospital=10099)
  0.06287027  queryNorm
33.81431  fieldWeight(contents:hospital in 271166), product of:
  7.071068  btq, product of:
1.4142135  tf(phraseFreq=9.0)
5.0  scorePayload(...)
  4.7820654  idf(contents:  hospital=10099)
  1.0  fieldNorm(field=contents, doc=271166)
0.8  coord(4/5)

Peter

On Wed, May 5, 2010 at 10:10 AM, Grant Ingersoll wrote:

> Thanks, Peter.
>
> Can you share what kind of evaluations you did to determine that the end
> user believed the results were equally relevant?  How formal was that
> process?
>
> -Grant
>
> On May 3, 2010, at 11:08 AM, Peter Keegan wrote:
>
> > We discovered very soon after going to production that Lucene's scores
> were
> > often 'too precise'. F

Re: Relevancy Practices

2010-05-05 Thread Avi Rosenschein
On Wed, May 5, 2010 at 5:08 PM, Grant Ingersoll  wrote:

>
> On May 2, 2010, at 5:50 AM, Avi Rosenschein wrote:
>
> > On 4/30/10, Grant Ingersoll  wrote:
> >>
> >> On Apr 30, 2010, at 8:00 AM, Avi Rosenschein wrote:
> >>> Also, tuning the algorithms to the users can be very important. For
> >>> instance, we have found that in a basic search functionality, the
> default
> >>> query parser operator OR works very well. But on a page for advanced
> >>> users,
> >>> who want to very precisely tune their search results, a default of AND
> >>> works
> >>> better.
> >>
> >> Avi,
> >>
> >> Great example.  Can you elaborate on how you arrived at this conclusion?
> >> What things did you do to determine it was a problem?
> >>
> >> -Grant
> >
> > Hi Grant,
> >
> > Sure. On http://wiki.answers.com/, we use search in a variety of
> > places and ways.
> >
> > In the basic search box (what you get if you look stuff up in the main
> > Ask box on the home page), we generally want the relevancy matching to
> > be pretty fuzzy. For example, if the user looked up "Where can you see
> > photos of the Aurora Borealis effect?" I would still want to show them
> > "Where can you see photos of the Aurora Borealis?" as a match.
> >
> > However, the advanced search page,
> > http://wiki.answers.com/Q/Special:Search, is used by advanced users to
> > filter questions by various facets and searches, and to them it is
> > important for the filter to filter out non-matches, since they use it
> > as a working page. For example, if they want to do a search for "Harry
> > Potter" and classify all results into the "Harry Potter" category, it
> > is important that not every match for "Harry" is returned.
>
> I'm curious, Avi, if you can share how you came to these conclusions?  For
> instance, did you have any qualitative evidence that "fuzzy" was better for
> the main page?  Or was it a "I know it when I see it" kind of thing.
>

I guess it was an "I know it when I see it" kind of thing. But it is
supported by evidence from our testing team and direct feedback from users.
I guess one could say that the difference is less in level of user
sophistication (though that is part of it), and more in user expectation
when using different input methods of search.

Our home page encourages asking questions in natural language, and therefore
search based on that query is going to need to be "fuzzier" than a strict
match of all the terms.

-- Avi


problem in Lucene's ranking function

2010-05-05 Thread José Ramón Pérez Agüera
Hi all,

We realize that there is a bug in Lucene's ranking function. Most
ranking functions, use a non-linear method to saturate the computation
of the frequencies.
This is due to the fact that the information gained on observing a
term the first time is greater than the information gained on
subsequently seeing the same term. The non-linear method can be as
simple as a logarithmic or a square-root function or more complex
parameter-based approaches like BM25 k1 parameter. S. Robertson 2004
http://portal.acm.org/citation.cfm?id=1031181 has described the
dangers to combine scores from different document fields and what are
the most tipical errors when ranking functions are modified to
consider the structure of the documents.

To rank these structured documents, Lucene combines the scores from
document fields. The method used by Lucene to compute the score of an
structured document is based on the linear combination of the scores
for each field of the document.

Lucene's ranking function uses the square root of the term frequency
to implement the non-linear method to saturate the computation of the
frequencies, but the linear combination of the scores by field to
compute the score for the whole document that Lucene implements breaks
the saturation effect, since field's boost factors are applied after
of non-linear methods are used. The consequence is that a document
matching a single query term over several fields could score much
higher than a document matching several query terms in one field only,
which is not a good way to compute relevance and use to hurt
dramatically ranking function performance.

We have written a paper where this problem is described and some
experiments are carried out to show the effect in Lucene performance.
http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf

It would be possible to fix this problem to have Lucene working
properly for structured documents?

thank you very much in advance

jose

-- 
Jose R. Pérez-Agüera

Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
email: jagu...@email.unc.edu
Web page: http://www.unc.edu/~jaguera/
MRC website: http://ils.unc.edu/mrc/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: problem in Lucene's ranking function

2010-05-05 Thread Robert Muir
José, you might want to watch LUCENE-2392.

In this issue, we are proposing adding additional flexibility to the scoring
mechanism including:
* controlling scoring on a per-field basis
* the ability to compute and use aggregate statistics (average field length,
total TF across all docs)
* fine-grained calculation of the score: essentially at the end of the day
if you want, you can implement score() in your Similarity and do whatever
you want, so things like tf() and idf() as methods "go away" in that they
might not even make sense for your scorer. So, SimilarityProvider in this
model gets the flexibility of Scorer hopefully without the hassles.

As far as combining scores across fields, I do not see why
2010/5/5 José Ramón Pérez Agüera 

> Hi all,
>
> We realize that there is a bug in Lucene's ranking function. Most
> ranking functions, use a non-linear method to saturate the computation
> of the frequencies.
> This is due to the fact that the information gained on observing a
> term the first time is greater than the information gained on
> subsequently seeing the same term. The non-linear method can be as
> simple as a logarithmic or a square-root function or more complex
> parameter-based approaches like BM25 k1 parameter. S. Robertson 2004
> http://portal.acm.org/citation.cfm?id=1031181 has described the
> dangers to combine scores from different document fields and what are
> the most tipical errors when ranking functions are modified to
> consider the structure of the documents.
>
> To rank these structured documents, Lucene combines the scores from
> document fields. The method used by Lucene to compute the score of an
> structured document is based on the linear combination of the scores
> for each field of the document.
>
> Lucene's ranking function uses the square root of the term frequency
> to implement the non-linear method to saturate the computation of the
> frequencies, but the linear combination of the scores by field to
> compute the score for the whole document that Lucene implements breaks
> the saturation effect, since field's boost factors are applied after
> of non-linear methods are used. The consequence is that a document
> matching a single query term over several fields could score much
> higher than a document matching several query terms in one field only,
> which is not a good way to compute relevance and use to hurt
> dramatically ranking function performance.
>
> We have written a paper where this problem is described and some
> experiments are carried out to show the effect in Lucene performance.
> http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf
>
> It would be possible to fix this problem to have Lucene working
> properly for structured documents?
>
> thank you very much in advance
>
> jose
>
> --
> Jose R. Pérez-Agüera
>
> Clinical Assistant Professor
> Metadata Research Center
> School of Information and Library Science
> University of North Carolina at Chapel Hill
> email: jagu...@email.unc.edu
> Web page: http://www.unc.edu/~jaguera/
> MRC website: http://ils.unc.edu/mrc/
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: problem in Lucene's ranking function

2010-05-05 Thread José Ramón Pérez Agüera
Hi Robert,

thank you very much for your quick response, I have a couple of questions,

did you read the papers that I mention in my e-mail?
do you think that Lucene ranking function could have this problem?

My concern is not about how to implement different kind of ranking
functions for Lucene, I know that you are doing a very nice work to
implement a very flexible ranking framework for Lucene, my concern is
about a bug, which is independent of the ranking function that you are
using and which appears whether some kind of saturation function is
used in combination with a linear combination of fields for structured
documents.

Maybe I'm wrong, but if the linear combination of fields remains in
lucene ranking function core, Lucene is never going to work properly
to compute the score for structured documents.

I know how to solve the problem, and we have our own implementation of
BM25F for Lucene which performance is much better that standard
Lucene's ranking function, but I think that would be useful for other
Lucene users to know what is the problem to deal with structured
documents, and how to fix this problem for the next version,
independently what ranking function is finally implemented for Lucene.

jose

On Wed, May 5, 2010 at 1:38 PM, Robert Muir  wrote:
> José, you might want to watch LUCENE-2392.
>
> In this issue, we are proposing adding additional flexibility to the scoring
> mechanism including:
> * controlling scoring on a per-field basis
> * the ability to compute and use aggregate statistics (average field length,
> total TF across all docs)
> * fine-grained calculation of the score: essentially at the end of the day
> if you want, you can implement score() in your Similarity and do whatever
> you want, so things like tf() and idf() as methods "go away" in that they
> might not even make sense for your scorer. So, SimilarityProvider in this
> model gets the flexibility of Scorer hopefully without the hassles.
>
> As far as combining scores across fields, I do not see why
> 2010/5/5 José Ramón Pérez Agüera 
>
>> Hi all,
>>
>> We realize that there is a bug in Lucene's ranking function. Most
>> ranking functions, use a non-linear method to saturate the computation
>> of the frequencies.
>> This is due to the fact that the information gained on observing a
>> term the first time is greater than the information gained on
>> subsequently seeing the same term. The non-linear method can be as
>> simple as a logarithmic or a square-root function or more complex
>> parameter-based approaches like BM25 k1 parameter. S. Robertson 2004
>> http://portal.acm.org/citation.cfm?id=1031181 has described the
>> dangers to combine scores from different document fields and what are
>> the most tipical errors when ranking functions are modified to
>> consider the structure of the documents.
>>
>> To rank these structured documents, Lucene combines the scores from
>> document fields. The method used by Lucene to compute the score of an
>> structured document is based on the linear combination of the scores
>> for each field of the document.
>>
>> Lucene's ranking function uses the square root of the term frequency
>> to implement the non-linear method to saturate the computation of the
>> frequencies, but the linear combination of the scores by field to
>> compute the score for the whole document that Lucene implements breaks
>> the saturation effect, since field's boost factors are applied after
>> of non-linear methods are used. The consequence is that a document
>> matching a single query term over several fields could score much
>> higher than a document matching several query terms in one field only,
>> which is not a good way to compute relevance and use to hurt
>> dramatically ranking function performance.
>>
>> We have written a paper where this problem is described and some
>> experiments are carried out to show the effect in Lucene performance.
>> http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf
>>
>> It would be possible to fix this problem to have Lucene working
>> properly for structured documents?
>>
>> thank you very much in advance
>>
>> jose
>>
>> --
>> Jose R. Pérez-Agüera
>>
>> Clinical Assistant Professor
>> Metadata Research Center
>> School of Information and Library Science
>> University of North Carolina at Chapel Hill
>> email: jagu...@email.unc.edu
>> Web page: http://www.unc.edu/~jaguera/
>> MRC website: http://ils.unc.edu/mrc/
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Jose R. Pérez-Agüera

Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
email: jagu...@email.unc.edu
Web page: http://www.unc.edu/~jaguera/
MRC website: http://ils.unc.edu/mrc/


Re: problem in Lucene's ranking function

2010-05-05 Thread Robert Muir
2010/5/5 José Ramón Pérez Agüera 

> Hi Robert,
>
> thank you very much for your quick response, I have a couple of questions,
>
> did you read the papers that I mention in my e-mail?
>

Yes.


> do you think that Lucene ranking function could have this problem?
>
>
I know it does.


> My concern is not about how to implement different kind of ranking
> functions for Lucene, I know that you are doing a very nice work to
> implement a very flexible ranking framework for Lucene, my concern is
> about a bug, which is independent of the ranking function that you are
> using and which appears whether some kind of saturation function is
> used in combination with a linear combination of fields for structured
> documents.
>

I think we might disagree here though. Must 'the combining of scores from
different fields' must be hardcoded to one simple solution, or should it be
something that you can control yourself?

For example, it appears Terrier implements something different for this
problem, not the paper you referenced but a different technique?:
http://terrier.org/docs/v3.0/javadoc/org/terrier/matching/models/BM25F.html But
I don't quite understand all the subleties involved... it seems in this
other paper there is still a linear combination, but you introduce
additional per-field parameters.

The thing that makes me nervous about "hardcoding/changing" the way that
scores are combined across fields is that Lucene presents some strange
peculiarities, most notably the ability to use different scoring models for
different fields. This in fact already exists today, if you "omitTF" for one
field but not for another, you are using a different scoring model for the
two fields.


> Maybe I'm wrong, but if the linear combination of fields remains in
> lucene ranking function core, Lucene is never going to work properly
> to compute the score for structured documents.
>

I wouldn't say never, maybe we will not get there in the first go, but
hopefully at least you will be able to do the things i mentioned above, such
as using different similarities for different fields, including ones that
are not supported today.


>
> I know how to solve the problem, and we have our own implementation of
> BM25F for Lucene which performance is much better that standard
> Lucene's ranking function, but I think that would be useful for other
> Lucene users to know what is the problem to deal with structured
> documents, and how to fix this problem for the next version,
> independently what ranking function is finally implemented for Lucene.
>
>
It would be great if you could help us on that issue (I know the patch is a
bit out of date), to try to fix the scoring APIs, including perhaps thinking
about how to improve search across multiple fields for structured documents.

In my opinion, I would like to see the situation evolve away from "which
ranking function is implemented for Lucene" instead to having a variety of
built-in functions you can choose from.

So, I would rather it be more like Analyzers, where we have a variety of
high-quality implementations available, and you can make your own if you
must, but there is no real default.

-- 
Robert Muir
rcm...@gmail.com


Re: problem in Lucene's ranking function

2010-05-05 Thread José Ramón Pérez Agüera
Hi Robert,

the problem is not the linear combination of fields, the problem is to
apply the boost factor per field after the term frequency saturation
function and then make the linear combination of fields. Every system
that implement BM25F, including terrier, take care of that, because if
you don't do it you have a bug in your ranking function and not just a
different ranking function.

It is very easy solve the problem in the current Lucene ranking
function, just move the boost factor per field inside the term
frequency square root, that's all. If you implement this little
change, Lucene ranking fucntion will work properly with structured
documents and all your other concerns about allowing users to
implement different ranking functions for different situations will be
not affected by this change.

I really appreciate your work to improve Lucene ranking function, and
your time to response this emails :-)

best

jose

On Wed, May 5, 2010 at 2:12 PM, Robert Muir  wrote:
> 2010/5/5 José Ramón Pérez Agüera 
>
>> Hi Robert,
>>
>> thank you very much for your quick response, I have a couple of questions,
>>
>> did you read the papers that I mention in my e-mail?
>>
>
> Yes.
>
>
>> do you think that Lucene ranking function could have this problem?
>>
>>
> I know it does.
>
>
>> My concern is not about how to implement different kind of ranking
>> functions for Lucene, I know that you are doing a very nice work to
>> implement a very flexible ranking framework for Lucene, my concern is
>> about a bug, which is independent of the ranking function that you are
>> using and which appears whether some kind of saturation function is
>> used in combination with a linear combination of fields for structured
>> documents.
>>
>
> I think we might disagree here though. Must 'the combining of scores from
> different fields' must be hardcoded to one simple solution, or should it be
> something that you can control yourself?
>
> For example, it appears Terrier implements something different for this
> problem, not the paper you referenced but a different technique?:
> http://terrier.org/docs/v3.0/javadoc/org/terrier/matching/models/BM25F.html 
> But
> I don't quite understand all the subleties involved... it seems in this
> other paper there is still a linear combination, but you introduce
> additional per-field parameters.
>
> The thing that makes me nervous about "hardcoding/changing" the way that
> scores are combined across fields is that Lucene presents some strange
> peculiarities, most notably the ability to use different scoring models for
> different fields. This in fact already exists today, if you "omitTF" for one
> field but not for another, you are using a different scoring model for the
> two fields.
>
>
>> Maybe I'm wrong, but if the linear combination of fields remains in
>> lucene ranking function core, Lucene is never going to work properly
>> to compute the score for structured documents.
>>
>
> I wouldn't say never, maybe we will not get there in the first go, but
> hopefully at least you will be able to do the things i mentioned above, such
> as using different similarities for different fields, including ones that
> are not supported today.
>
>
>>
>> I know how to solve the problem, and we have our own implementation of
>> BM25F for Lucene which performance is much better that standard
>> Lucene's ranking function, but I think that would be useful for other
>> Lucene users to know what is the problem to deal with structured
>> documents, and how to fix this problem for the next version,
>> independently what ranking function is finally implemented for Lucene.
>>
>>
> It would be great if you could help us on that issue (I know the patch is a
> bit out of date), to try to fix the scoring APIs, including perhaps thinking
> about how to improve search across multiple fields for structured documents.
>
> In my opinion, I would like to see the situation evolve away from "which
> ranking function is implemented for Lucene" instead to having a variety of
> built-in functions you can choose from.
>
> So, I would rather it be more like Analyzers, where we have a variety of
> high-quality implementations available, and you can make your own if you
> must, but there is no real default.
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Jose R. Pérez-Agüera

Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
email: jagu...@email.unc.edu
Web page: http://www.unc.edu/~jaguera/
MRC website: http://ils.unc.edu/mrc/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: problem in Lucene's ranking function

2010-05-05 Thread Robert Muir
2010/5/5 José Ramón Pérez Agüera 

> Hi Robert,
>
> the problem is not the linear combination of fields, the problem is to
> apply the boost factor per field after the term frequency saturation
> function and then make the linear combination of fields. Every system
> that implement BM25F, including terrier, take care of that, because if
> you don't do it you have a bug in your ranking function and not just a
> different ranking function.
>

José, well then this should not be much of a problem to handle in
LUCENE-2392, because as I mentioned, if you have a tf() or idf() its really
because you decided to do this yourself. So you could easily apply the boost
inside your log or sqrt or whatever, if you want.

But what I propose we do, is make sure the relevance functions we provide
(especially any default for 4.0) take care of this for your structured case,
while still providing the capability for someone to get the old behavior
[see below]


> If you implement this little
> change, Lucene ranking fucntion will work properly with structured
> documents and all your other concerns about allowing users to
> implement different ranking functions for different situations will be
> not affected by this change.
>
>
Well, I'm not sure all my concerns go away! I think its best to implement a
change like this in the flexible scoring framework (LUCENE-2392), so that
users, if they want, can get the old behavior: "the bug" as you call it.

The reason I say this due to the unique cases of lucene, some people are
doing scoring in very crazy ways and if they aren't able to get the old
behavior with regards to boosting, they might be upset... even if it is
really giving them worse relevance...

-- 
Robert Muir
rcm...@gmail.com


Re: problem in Lucene's ranking function

2010-05-05 Thread José Ramón Pérez Agüera
Hi Robert,

I will be very happy to see this problem fixed :-) I can not image
what reasons people have to use software with bugs, I guess that
others bugs in lucene are removed. Anyway, if finally you are going to
fix the problem, these are good news :-) thank you very much for your
time.

jose

On Wed, May 5, 2010 at 3:10 PM, Robert Muir  wrote:
> 2010/5/5 José Ramón Pérez Agüera 
>
>> Hi Robert,
>>
>> the problem is not the linear combination of fields, the problem is to
>> apply the boost factor per field after the term frequency saturation
>> function and then make the linear combination of fields. Every system
>> that implement BM25F, including terrier, take care of that, because if
>> you don't do it you have a bug in your ranking function and not just a
>> different ranking function.
>>
>
> José, well then this should not be much of a problem to handle in
> LUCENE-2392, because as I mentioned, if you have a tf() or idf() its really
> because you decided to do this yourself. So you could easily apply the boost
> inside your log or sqrt or whatever, if you want.
>
> But what I propose we do, is make sure the relevance functions we provide
> (especially any default for 4.0) take care of this for your structured case,
> while still providing the capability for someone to get the old behavior
> [see below]
>
>
>> If you implement this little
>> change, Lucene ranking fucntion will work properly with structured
>> documents and all your other concerns about allowing users to
>> implement different ranking functions for different situations will be
>> not affected by this change.
>>
>>
> Well, I'm not sure all my concerns go away! I think its best to implement a
> change like this in the flexible scoring framework (LUCENE-2392), so that
> users, if they want, can get the old behavior: "the bug" as you call it.
>
> The reason I say this due to the unique cases of lucene, some people are
> doing scoring in very crazy ways and if they aren't able to get the old
> behavior with regards to boosting, they might be upset... even if it is
> really giving them worse relevance...
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Jose R. Pérez-Agüera

Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
email: jagu...@email.unc.edu
Web page: http://www.unc.edu/~jaguera/
MRC website: http://ils.unc.edu/mrc/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: problem in Lucene's ranking function

2010-05-05 Thread Yonik Seeley
2010/5/5 José Ramón Pérez Agüera :
[...]
> The consequence is that a document
> matching a single query term over several fields could score much
> higher than a document matching several query terms in one field only,

One partial workaround that people use is DisjunctionMaxQuery (used by
"dismax" query parser in Solr).
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How can I merge .cfx and .cfs into a single cfs file?

2010-05-05 Thread 张志田
Thank you Mike.

Garry
- Original Message - 
From: "Michael McCandless" 
To: 
Sent: Wednesday, May 05, 2010 8:24 PM
Subject: Re: How can I merge .cfx and .cfs into a single cfs file?


Lucene considers an index with a single .cfx and a single .cfs as optimized.

Also, note that how Lucene stores files in the index is an impl detail
-- it can change from release to release -- so relying on any of these
details is dangerous.

That said, with recent Lucene versions, if you really want to force
these two files to be consolidated, you can make a custom MergePolicy
that returns a merge from the findMergesForOptimize for this case (the
default MergePolicy returns null since it thinks this case is already
optimized).

Or... if you index with two separate IndexWriter sessions, and then
call optimize, that should also merge down to one file.

Mike

2010/5/5 张志田 :
> Uwe, thank you very much.
>
> What is the mechanizm lucene will merge these two kinds of files? Sometimes I 
> found there was only one .cfs file, but in another time there may be one cfs 
> and cfx. I understand the .cfx is used to store the term vectors etc, but why 
> does the index result not seem to be consistent?
>
> Thanks,
> Garry
> - Original Message -
> From: "Uwe Goetzke" 
> To: 
> Sent: Wednesday, May 05, 2010 3:57 PM
> Subject: AW: How can I merge .cfx and .cfs into a single cfs file?
>
>
> Index all into a directory and determine the size of all files in it.
>
> From http://lucene.apache.org/java/3_0_1/fileformats.html
> Starting with Lucene 2.3, doc store files (stored field values and term 
> vectors) can be shared in a single set of files for more than one segment. 
> When compound file is enabled, these shared files will be added into a single 
> compound file (same format as above) but with the extension .cfx.
>
> In addition to
> Compound File  .cfs  An optional "virtual" file consisting of all the other 
> index files for systems that frequently run out of file handles.
>
> Uwe
>
>
> -Ursprüngliche Nachricht-
> Von: 张志田 [mailto:zhitian.zh...@dianping.com]
> Gesendet: Mittwoch, 5. Mai 2010 08:24
> An: java-user@lucene.apache.org
> Betreff: How can I merge .cfx and .cfs into a single cfs file?
>
> Hi all,
>
> I have an index task which will index thousands of records with lucene 3.0.1. 
> My confusion is lucene will always create a .cfx and a .cfs file in the file 
> system, sometimes more, while I thought it should create a single .cfs file 
> if I optimize the index data. Is it by design? If yes, is there any 
> way/configuration I can do to merge all of the index files into a singe one?
>
> By the way, I have a logic to validate the index data, if the size of .cfs 
> increases dramatically comparing to the file generated last time, there may 
> be something wrong, a warning message will be threw. This is the reason that 
> I want to generate a single .cfs file. Any other suggestion about the index 
> validation?
>
> Any body can give me a hand?
>
> Thanks in advance.
>
> Garry
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


Re: Using IndexReader in the web environment

2010-05-05 Thread Ivan Liu
You may look this:
private static IndexSearcher indexSearcher = null;

 public synchronized IndexSearcher newIndexSearcher() {
  try {

   if (null == indexSearcher) {
Directory directory = FSDirectory.open(new
File(Config.DB_DIR+"/rssindex"));
indexSearcher = new IndexSearcher(IndexReader.open(directory, true));
   } else {
IndexReader indexReader = indexSearcher.getIndexReader();
IndexReader newIndexReader = indexReader.reopen();//reopen old
indexReader
if (newIndexReader!=indexReader) {
 indexReader.close();
 indexSearcher.close();


 indexSearcher = new IndexSearcher(newIndexReader);
}
   }
   return indexSearcher;
  } catch (CorruptIndexException e) {
   log.error(e.getMessage(),e);
   return null;
  } catch (IOException e) {
   log.error(e.getMessage(),e);
   return null;
  }
 }



-- 
冲浪板

my blog:http://chonglangban.appspot.com/
my site:http://kejiblog.appspot.com/


AW: How can I merge .cfx and .cfs into a single cfs file?

2010-05-05 Thread Uwe Goetzke
Index all into a directory and determine the size of all files in it.

>From http://lucene.apache.org/java/3_0_1/fileformats.html 
Starting with Lucene 2.3, doc store files (stored field values and term 
vectors) can be shared in a single set of files for more than one segment. When 
compound file is enabled, these shared files will be added into a single 
compound file (same format as above) but with the extension .cfx.

In addition to
Compound File   .cfsAn optional "virtual" file consisting of all the other 
index files for systems that frequently run out of file handles.

Uwe


-Ursprüngliche Nachricht-
Von: 张志田 [mailto:zhitian.zh...@dianping.com] 
Gesendet: Mittwoch, 5. Mai 2010 08:24
An: java-user@lucene.apache.org
Betreff: How can I merge .cfx and .cfs into a single cfs file?

Hi all,

I have an index task which will index thousands of records with lucene 3.0.1. 
My confusion is lucene will always create a .cfx and a .cfs file in the file 
system, sometimes more, while I thought it should create a single .cfs file if 
I optimize the index data. Is it by design? If yes, is there any 
way/configuration I can do to merge all of the index files into a singe one?

By the way, I have a logic to validate the index data, if the size of .cfs 
increases dramatically comparing to the file generated last time, there may be 
something wrong, a warning message will be threw. This is the reason that I 
want to generate a single .cfs file. Any other suggestion about the index 
validation?

Any body can give me a hand?

Thanks in advance.

Garry

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How can I merge .cfx and .cfs into a single cfs file?

2010-05-05 Thread 张志田
Uwe, thank you very much.

What is the mechanizm lucene will merge these two kinds of files? Sometimes I 
found there was only one .cfs file, but in another time there may be one cfs 
and cfx. I understand the .cfx is used to store the term vectors etc, but why 
does the index result not seem to be consistent?

Thanks,
Garry
- Original Message - 
From: "Uwe Goetzke" 
To: 
Sent: Wednesday, May 05, 2010 3:57 PM
Subject: AW: How can I merge .cfx and .cfs into a single cfs file?


Index all into a directory and determine the size of all files in it.

From http://lucene.apache.org/java/3_0_1/fileformats.html 
Starting with Lucene 2.3, doc store files (stored field values and term 
vectors) can be shared in a single set of files for more than one segment. When 
compound file is enabled, these shared files will be added into a single 
compound file (same format as above) but with the extension .cfx.

In addition to
Compound File  .cfs  An optional "virtual" file consisting of all the other 
index files for systems that frequently run out of file handles.

Uwe


-Ursprüngliche Nachricht-
Von: 张志田 [mailto:zhitian.zh...@dianping.com] 
Gesendet: Mittwoch, 5. Mai 2010 08:24
An: java-user@lucene.apache.org
Betreff: How can I merge .cfx and .cfs into a single cfs file?

Hi all,

I have an index task which will index thousands of records with lucene 3.0.1. 
My confusion is lucene will always create a .cfx and a .cfs file in the file 
system, sometimes more, while I thought it should create a single .cfs file if 
I optimize the index data. Is it by design? If yes, is there any 
way/configuration I can do to merge all of the index files into a singe one?

By the way, I have a logic to validate the index data, if the size of .cfs 
increases dramatically comparing to the file generated last time, there may be 
something wrong, a warning message will be threw. This is the reason that I 
want to generate a single .cfs file. Any other suggestion about the index 
validation?

Any body can give me a hand?

Thanks in advance.

Garry

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


Re: Using IndexReader in the web environment

2010-05-05 Thread Ian Lea
You could tell the searching part of your app, via some notification
or messaging call.  Or call IndexReader.isCurrent() from time to time,
or even on every search, and reopen() if necessary.  See the javadocs
and don't forget to close the old reader when you do call reopen.


--
Ian.


On Wed, May 5, 2010 at 5:17 AM, Vijay Veeraraghavan
 wrote:
> hey Ian,
> thanks for the reply. I find it very useful. My report generating
> scheduler will run periodically, once done it will invoke the indexer
> and exit. In this case I do not know if the index has changed or not.
> How do i keep track of the changes in the index? As the two entities,
> scheduler/indexer and the web application, are totally different.
>
> Vijay
>
> On 5/4/10, Ian Lea  wrote:
>> For best performance you should aim to keep a shared index searcher,
>> or the underlying index reader, open as long as possible.  You may of
>> course need to reopen it if/when the index changes.  As to scope, you
>> can store it wherever it makes sense for your application.
>>
>>
>> --
>> Ian.
>>
>>
>> On Tue, May 4, 2010 at 10:13 AM, Vijay Veeraraghavan
>>  wrote:
>>> Hi,
>>> Thanks for the reply. So I will have a dedicated servlet to search the
>>> index, but does it mean that the indexsearcher does not close the
>>> index, keep it open? Is it not possible to keep it in the application
>>> scope?
>>>
>>> Vijay
>>>
>>> On 5/3/10, Vijay Veeraraghavan  wrote:
 Hi all,

 In a clustered environment I search the index from the web
 application. In the web application I am creating IndexReader on each
 request. is it expensive to do like this? I read somewhere in the web
 that try using the same reader as much as possible. Can i keep the
 initially created IndexReader in the session/application scopes and
 use the same for each request? Any other idea?

 Viay

 On 5/3/10, Vijay Veeraraghavan  wrote:
> dear all,
>
> as replied below, does searching again for the document in the index
> and if found skip the indexing else index it, is this not similar to
> indexing all pdf documents once again, is not this overhead? As I am
> not going to index the details of the pdf (so if an indexed pdf was
> recreated i need not reindex it) but just the paths of the documents.
>
> Vijay
>
>>> Hey there,
>>>
>>> you might have to implement a some kind of unique identifier using an
>>> indexed lucene field. When you are indexing you should fire a query
>>> with
>>> the
>>> uuid of your document (maybe the path to you pdf document) and check
>>> if
>>> the
>>> document is in the index already. You could also do a boolean query
>>> combining UUID, timestamp and / or a hash value to see if the document
>>> has
>>> been changed. if so you can simply update the document by its UUID
>>> (something like indexwriter.updateDocument(new Term("uuid",
>>> value),document);)
>>>
>>> Unfortunately you have to implement this yourself but it should not be
>>> that
>>> much of a deal.
>>>
>>> simon
>>>
>>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan <
>>> vijay.raghava...@gmail.com> wrote:
>>>
 Dear all,
 I am using lucene 3.0 to index the pdf reports that I generate
 dynamically. I index the pdf file name (without extension), file path
 and its absolute path as fields. I search with the file name without
 extension; it retrieves a list, as usually 2 or more files are
 present
 in the same name in different sub directories. As I create the index
 for the first time it updates, assuming 100 pdf files in different
 directories, the files meta info. If again I do indexing, while my
 report generator scheduler has the produced 500 more pdf files
 totaling to 600 files in different directories, I wish to index only
 the new files to the index. But presently it’s doing the whole thing
 again (600 files). How to implement this functionality? Think of the
 thousands of pdf files created on each run.

 P.S: I cannot keep the meta-info of generated pdf files in the java
 memory, as it exceeds thousands in a single run, and update the index
 looping this list.

 new IndexWriter(FSDirectory.open(this.indexDir), new
 StandardAnalyzer(
                                        Version.LUCENE_CURRENT), true,

 IndexWriter.MaxFieldLength.LIMITED);

 is the boolean parameter is for this purpose? Please guide me.

 --
 Thanks
 Vijay Veeraraghavan



 --
 Thanks & Regards
 Vijay Veeraraghavan

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.or

Re: How can I merge .cfx and .cfs into a single cfs file?

2010-05-05 Thread Michael McCandless
Lucene considers an index with a single .cfx and a single .cfs as optimized.

Also, note that how Lucene stores files in the index is an impl detail
-- it can change from release to release -- so relying on any of these
details is dangerous.

That said, with recent Lucene versions, if you really want to force
these two files to be consolidated, you can make a custom MergePolicy
that returns a merge from the findMergesForOptimize for this case (the
default MergePolicy returns null since it thinks this case is already
optimized).

Or... if you index with two separate IndexWriter sessions, and then
call optimize, that should also merge down to one file.

Mike

2010/5/5 张志田 :
> Uwe, thank you very much.
>
> What is the mechanizm lucene will merge these two kinds of files? Sometimes I 
> found there was only one .cfs file, but in another time there may be one cfs 
> and cfx. I understand the .cfx is used to store the term vectors etc, but why 
> does the index result not seem to be consistent?
>
> Thanks,
> Garry
> - Original Message -
> From: "Uwe Goetzke" 
> To: 
> Sent: Wednesday, May 05, 2010 3:57 PM
> Subject: AW: How can I merge .cfx and .cfs into a single cfs file?
>
>
> Index all into a directory and determine the size of all files in it.
>
> From http://lucene.apache.org/java/3_0_1/fileformats.html
> Starting with Lucene 2.3, doc store files (stored field values and term 
> vectors) can be shared in a single set of files for more than one segment. 
> When compound file is enabled, these shared files will be added into a single 
> compound file (same format as above) but with the extension .cfx.
>
> In addition to
> Compound File  .cfs  An optional "virtual" file consisting of all the other 
> index files for systems that frequently run out of file handles.
>
> Uwe
>
>
> -Ursprüngliche Nachricht-
> Von: 张志田 [mailto:zhitian.zh...@dianping.com]
> Gesendet: Mittwoch, 5. Mai 2010 08:24
> An: java-user@lucene.apache.org
> Betreff: How can I merge .cfx and .cfs into a single cfs file?
>
> Hi all,
>
> I have an index task which will index thousands of records with lucene 3.0.1. 
> My confusion is lucene will always create a .cfx and a .cfs file in the file 
> system, sometimes more, while I thought it should create a single .cfs file 
> if I optimize the index data. Is it by design? If yes, is there any 
> way/configuration I can do to merge all of the index files into a singe one?
>
> By the way, I have a logic to validate the index data, if the size of .cfs 
> increases dramatically comparing to the file generated last time, there may 
> be something wrong, a warning message will be threw. This is the reason that 
> I want to generate a single .cfs file. Any other suggestion about the index 
> validation?
>
> Any body can give me a hand?
>
> Thanks in advance.
>
> Garry
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org