Re: Relevancy Practices
On May 2, 2010, at 5:50 AM, Avi Rosenschein wrote: > On 4/30/10, Grant Ingersoll wrote: >> >> On Apr 30, 2010, at 8:00 AM, Avi Rosenschein wrote: >>> Also, tuning the algorithms to the users can be very important. For >>> instance, we have found that in a basic search functionality, the default >>> query parser operator OR works very well. But on a page for advanced >>> users, >>> who want to very precisely tune their search results, a default of AND >>> works >>> better. >> >> Avi, >> >> Great example. Can you elaborate on how you arrived at this conclusion? >> What things did you do to determine it was a problem? >> >> -Grant > > Hi Grant, > > Sure. On http://wiki.answers.com/, we use search in a variety of > places and ways. > > In the basic search box (what you get if you look stuff up in the main > Ask box on the home page), we generally want the relevancy matching to > be pretty fuzzy. For example, if the user looked up "Where can you see > photos of the Aurora Borealis effect?" I would still want to show them > "Where can you see photos of the Aurora Borealis?" as a match. > > However, the advanced search page, > http://wiki.answers.com/Q/Special:Search, is used by advanced users to > filter questions by various facets and searches, and to them it is > important for the filter to filter out non-matches, since they use it > as a working page. For example, if they want to do a search for "Harry > Potter" and classify all results into the "Harry Potter" category, it > is important that not every match for "Harry" is returned. I'm curious, Avi, if you can share how you came to these conclusions? For instance, did you have any qualitative evidence that "fuzzy" was better for the main page? Or was it a "I know it when I see it" kind of thing. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Relevancy Practices
Thanks, Peter. Can you share what kind of evaluations you did to determine that the end user believed the results were equally relevant? How formal was that process? -Grant On May 3, 2010, at 11:08 AM, Peter Keegan wrote: > We discovered very soon after going to production that Lucene's scores were > often 'too precise'. For example, a page of 25 results may have several > different score values, and all within 15% of each other, but to the end > user all 25 results were equally relevant. Thus we wanted the secondary sort > field to determine the order, instead. This required writing a custom score > comparator to 'round' the scores. The same thing occurred for distance > sorting. We also limit the effect of term frequency to help prevent > spamming. In comparison to Avi, we use 'AND' as the default operator for > keyword queries and if no docs are found, the query is automatically retried > with 'OR'. This improves precision a bit and only occurs if the user > provides no operators. > > Lucene's Explanation class has been invaluable in helping me to explain a > particular sort order in many, many situations. > Most of our relevance tuning has occurred after deployment to production. > > Peter > > On Thu, Apr 29, 2010 at 10:14 AM, Grant Ingersoll wrote: > >> I'm putting on a talk at Lucene Eurocon ( >> http://lucene-eurocon.org/sessions-track1-day2.html#1) on "Practical >> Relevance" and I'm curious as to what people put in practice for testing and >> improving relevance. I have my own inclinations, but I don't want to muddy >> the water just yet. So, if you have a few moments, I'd love to hear >> responses to the following questions. >> >> What worked? >> What didn't work? >> What didn't you understand about it? >> What tools did you use? >> What tools did you wish you had either for debugging relevance or "fixing" >> it? >> How much time did you spend on it? >> How did you avoid over/under tuning? >> What stage of development/testing/production did you decide to do relevance >> tuning? Was that timing planned or not? >> >> >> Thanks, >> Grant >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Relevancy Practices
The feedback came directly from customers and customer facing support folks. Here is an example of a query with keywords: nurse, rn, nursing, hospital. The top 2 hits have scores of 26.86348 and 26.407215. To the customer, both results were equally relevant because all of their keywords were in the documents. For this application, the subtleties of TF/IDF are not appreciated by the end user ;-). Here are the Explanations for the scores (I hope they are readable): Doc 1: 26.86348 sum of: 26.86348 product of: 33.57935 sum of: 10.403484 weight(contents:nurse in 110320), product of: 0.30413723 queryWeight(contents:nurse), product of: 4.8375363 idf(contents: nurse=9554) 0.06287027 queryNorm 34.206547 fieldWeight(contents:nurse in 110320), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=2.0) 5.0 scorePayload(...) 4.8375363 idf(contents: nurse=9554) 1.0 fieldNorm(field=contents, doc=110320) 11.005695 weight(contents:rn in 110320), product of: 0.31281596 queryWeight(contents:rn), product of: 4.9755783 idf(contents: rn=8322) 0.06287027 queryNorm 35.18265 fieldWeight(contents:rn in 110320), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=3.0) 5.0 scorePayload(...) 4.9755783 idf(contents: rn=8322) 1.0 fieldNorm(field=contents, doc=110320) 10.136917 weight(contents:nursing in 110320), product of: 0.3002155 queryWeight(contents:nursing), product of: 4.7751584 idf(contents: nursing=10169) 0.06287027 queryNorm 33.76547 fieldWeight(contents:nursing in 110320), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=11.0) 5.0 scorePayload(...) 4.7751584 idf(contents: nursing=10169) 1.0 fieldNorm(field=contents, doc=110320) 2.0332527 weight(contents:hospital in 110320), product of: 0.30064976 queryWeight(contents:hospital), product of: 4.7820654 idf(contents: hospital=10099) 0.06287027 queryNorm 6.7628617 fieldWeight(contents:hospital in 110320), product of: 1.4142135 btq, product of: 1.4142135 tf(phraseFreq=3.0) 1.0 scorePayload(...) 4.7820654 idf(contents: hospital=10099) 1.0 fieldNorm(field=contents, doc=110320) 0.8 coord(4/5) Doc 2: 26.407215 sum of: 26.407215 product of: 33.009018 sum of: 10.403484 weight(contents:nurse in 271166), product of: 0.30413723 queryWeight(contents:nurse), product of: 4.8375363 idf(contents: nurse=9554) 0.06287027 queryNorm 34.206547 fieldWeight(contents:nurse in 271166), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=4.0) 5.0 scorePayload(...) 4.8375363 idf(contents: nurse=9554) 1.0 fieldNorm(field=contents, doc=271166) 11.005695 weight(contents:rn in 271166), product of: 0.31281596 queryWeight(contents:rn), product of: 4.9755783 idf(contents: rn=8322) 0.06287027 queryNorm 35.18265 fieldWeight(contents:rn in 271166), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=4.0) 5.0 scorePayload(...) 4.9755783 idf(contents: rn=8322) 1.0 fieldNorm(field=contents, doc=271166) 1.4335766 weight(contents:nursing in 271166), product of: 0.3002155 queryWeight(contents:nursing), product of: 4.7751584 idf(contents: nursing=10169) 0.06287027 queryNorm 4.7751584 fieldWeight(contents:nursing in 271166), product of: 1.0 btq, product of: 1.0 tf(phraseFreq=1.0) 1.0 scorePayload(...) 4.7751584 idf(contents: nursing=10169) 1.0 fieldNorm(field=contents, doc=271166) 10.166264 weight(contents:hospital in 271166), product of: 0.30064976 queryWeight(contents:hospital), product of: 4.7820654 idf(contents: hospital=10099) 0.06287027 queryNorm 33.81431 fieldWeight(contents:hospital in 271166), product of: 7.071068 btq, product of: 1.4142135 tf(phraseFreq=9.0) 5.0 scorePayload(...) 4.7820654 idf(contents: hospital=10099) 1.0 fieldNorm(field=contents, doc=271166) 0.8 coord(4/5) Peter On Wed, May 5, 2010 at 10:10 AM, Grant Ingersoll wrote: > Thanks, Peter. > > Can you share what kind of evaluations you did to determine that the end > user believed the results were equally relevant? How formal was that > process? > > -Grant > > On May 3, 2010, at 11:08 AM, Peter Keegan wrote: > > > We discovered very soon after going to production that Lucene's scores > were > > often 'too precise'. F
Re: Relevancy Practices
On Wed, May 5, 2010 at 5:08 PM, Grant Ingersoll wrote: > > On May 2, 2010, at 5:50 AM, Avi Rosenschein wrote: > > > On 4/30/10, Grant Ingersoll wrote: > >> > >> On Apr 30, 2010, at 8:00 AM, Avi Rosenschein wrote: > >>> Also, tuning the algorithms to the users can be very important. For > >>> instance, we have found that in a basic search functionality, the > default > >>> query parser operator OR works very well. But on a page for advanced > >>> users, > >>> who want to very precisely tune their search results, a default of AND > >>> works > >>> better. > >> > >> Avi, > >> > >> Great example. Can you elaborate on how you arrived at this conclusion? > >> What things did you do to determine it was a problem? > >> > >> -Grant > > > > Hi Grant, > > > > Sure. On http://wiki.answers.com/, we use search in a variety of > > places and ways. > > > > In the basic search box (what you get if you look stuff up in the main > > Ask box on the home page), we generally want the relevancy matching to > > be pretty fuzzy. For example, if the user looked up "Where can you see > > photos of the Aurora Borealis effect?" I would still want to show them > > "Where can you see photos of the Aurora Borealis?" as a match. > > > > However, the advanced search page, > > http://wiki.answers.com/Q/Special:Search, is used by advanced users to > > filter questions by various facets and searches, and to them it is > > important for the filter to filter out non-matches, since they use it > > as a working page. For example, if they want to do a search for "Harry > > Potter" and classify all results into the "Harry Potter" category, it > > is important that not every match for "Harry" is returned. > > I'm curious, Avi, if you can share how you came to these conclusions? For > instance, did you have any qualitative evidence that "fuzzy" was better for > the main page? Or was it a "I know it when I see it" kind of thing. > I guess it was an "I know it when I see it" kind of thing. But it is supported by evidence from our testing team and direct feedback from users. I guess one could say that the difference is less in level of user sophistication (though that is part of it), and more in user expectation when using different input methods of search. Our home page encourages asking questions in natural language, and therefore search based on that query is going to need to be "fuzzier" than a strict match of all the terms. -- Avi
problem in Lucene's ranking function
Hi all, We realize that there is a bug in Lucene's ranking function. Most ranking functions, use a non-linear method to saturate the computation of the frequencies. This is due to the fact that the information gained on observing a term the first time is greater than the information gained on subsequently seeing the same term. The non-linear method can be as simple as a logarithmic or a square-root function or more complex parameter-based approaches like BM25 k1 parameter. S. Robertson 2004 http://portal.acm.org/citation.cfm?id=1031181 has described the dangers to combine scores from different document fields and what are the most tipical errors when ranking functions are modified to consider the structure of the documents. To rank these structured documents, Lucene combines the scores from document fields. The method used by Lucene to compute the score of an structured document is based on the linear combination of the scores for each field of the document. Lucene's ranking function uses the square root of the term frequency to implement the non-linear method to saturate the computation of the frequencies, but the linear combination of the scores by field to compute the score for the whole document that Lucene implements breaks the saturation effect, since field's boost factors are applied after of non-linear methods are used. The consequence is that a document matching a single query term over several fields could score much higher than a document matching several query terms in one field only, which is not a good way to compute relevance and use to hurt dramatically ranking function performance. We have written a paper where this problem is described and some experiments are carried out to show the effect in Lucene performance. http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf It would be possible to fix this problem to have Lucene working properly for structured documents? thank you very much in advance jose -- Jose R. Pérez-Agüera Clinical Assistant Professor Metadata Research Center School of Information and Library Science University of North Carolina at Chapel Hill email: jagu...@email.unc.edu Web page: http://www.unc.edu/~jaguera/ MRC website: http://ils.unc.edu/mrc/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: problem in Lucene's ranking function
José, you might want to watch LUCENE-2392. In this issue, we are proposing adding additional flexibility to the scoring mechanism including: * controlling scoring on a per-field basis * the ability to compute and use aggregate statistics (average field length, total TF across all docs) * fine-grained calculation of the score: essentially at the end of the day if you want, you can implement score() in your Similarity and do whatever you want, so things like tf() and idf() as methods "go away" in that they might not even make sense for your scorer. So, SimilarityProvider in this model gets the flexibility of Scorer hopefully without the hassles. As far as combining scores across fields, I do not see why 2010/5/5 José Ramón Pérez Agüera > Hi all, > > We realize that there is a bug in Lucene's ranking function. Most > ranking functions, use a non-linear method to saturate the computation > of the frequencies. > This is due to the fact that the information gained on observing a > term the first time is greater than the information gained on > subsequently seeing the same term. The non-linear method can be as > simple as a logarithmic or a square-root function or more complex > parameter-based approaches like BM25 k1 parameter. S. Robertson 2004 > http://portal.acm.org/citation.cfm?id=1031181 has described the > dangers to combine scores from different document fields and what are > the most tipical errors when ranking functions are modified to > consider the structure of the documents. > > To rank these structured documents, Lucene combines the scores from > document fields. The method used by Lucene to compute the score of an > structured document is based on the linear combination of the scores > for each field of the document. > > Lucene's ranking function uses the square root of the term frequency > to implement the non-linear method to saturate the computation of the > frequencies, but the linear combination of the scores by field to > compute the score for the whole document that Lucene implements breaks > the saturation effect, since field's boost factors are applied after > of non-linear methods are used. The consequence is that a document > matching a single query term over several fields could score much > higher than a document matching several query terms in one field only, > which is not a good way to compute relevance and use to hurt > dramatically ranking function performance. > > We have written a paper where this problem is described and some > experiments are carried out to show the effect in Lucene performance. > http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf > > It would be possible to fix this problem to have Lucene working > properly for structured documents? > > thank you very much in advance > > jose > > -- > Jose R. Pérez-Agüera > > Clinical Assistant Professor > Metadata Research Center > School of Information and Library Science > University of North Carolina at Chapel Hill > email: jagu...@email.unc.edu > Web page: http://www.unc.edu/~jaguera/ > MRC website: http://ils.unc.edu/mrc/ > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com
Re: problem in Lucene's ranking function
Hi Robert, thank you very much for your quick response, I have a couple of questions, did you read the papers that I mention in my e-mail? do you think that Lucene ranking function could have this problem? My concern is not about how to implement different kind of ranking functions for Lucene, I know that you are doing a very nice work to implement a very flexible ranking framework for Lucene, my concern is about a bug, which is independent of the ranking function that you are using and which appears whether some kind of saturation function is used in combination with a linear combination of fields for structured documents. Maybe I'm wrong, but if the linear combination of fields remains in lucene ranking function core, Lucene is never going to work properly to compute the score for structured documents. I know how to solve the problem, and we have our own implementation of BM25F for Lucene which performance is much better that standard Lucene's ranking function, but I think that would be useful for other Lucene users to know what is the problem to deal with structured documents, and how to fix this problem for the next version, independently what ranking function is finally implemented for Lucene. jose On Wed, May 5, 2010 at 1:38 PM, Robert Muir wrote: > José, you might want to watch LUCENE-2392. > > In this issue, we are proposing adding additional flexibility to the scoring > mechanism including: > * controlling scoring on a per-field basis > * the ability to compute and use aggregate statistics (average field length, > total TF across all docs) > * fine-grained calculation of the score: essentially at the end of the day > if you want, you can implement score() in your Similarity and do whatever > you want, so things like tf() and idf() as methods "go away" in that they > might not even make sense for your scorer. So, SimilarityProvider in this > model gets the flexibility of Scorer hopefully without the hassles. > > As far as combining scores across fields, I do not see why > 2010/5/5 José Ramón Pérez Agüera > >> Hi all, >> >> We realize that there is a bug in Lucene's ranking function. Most >> ranking functions, use a non-linear method to saturate the computation >> of the frequencies. >> This is due to the fact that the information gained on observing a >> term the first time is greater than the information gained on >> subsequently seeing the same term. The non-linear method can be as >> simple as a logarithmic or a square-root function or more complex >> parameter-based approaches like BM25 k1 parameter. S. Robertson 2004 >> http://portal.acm.org/citation.cfm?id=1031181 has described the >> dangers to combine scores from different document fields and what are >> the most tipical errors when ranking functions are modified to >> consider the structure of the documents. >> >> To rank these structured documents, Lucene combines the scores from >> document fields. The method used by Lucene to compute the score of an >> structured document is based on the linear combination of the scores >> for each field of the document. >> >> Lucene's ranking function uses the square root of the term frequency >> to implement the non-linear method to saturate the computation of the >> frequencies, but the linear combination of the scores by field to >> compute the score for the whole document that Lucene implements breaks >> the saturation effect, since field's boost factors are applied after >> of non-linear methods are used. The consequence is that a document >> matching a single query term over several fields could score much >> higher than a document matching several query terms in one field only, >> which is not a good way to compute relevance and use to hurt >> dramatically ranking function performance. >> >> We have written a paper where this problem is described and some >> experiments are carried out to show the effect in Lucene performance. >> http://km.aifb.kit.edu/ws/semsearch10/Files/bm25f.pdf >> >> It would be possible to fix this problem to have Lucene working >> properly for structured documents? >> >> thank you very much in advance >> >> jose >> >> -- >> Jose R. Pérez-Agüera >> >> Clinical Assistant Professor >> Metadata Research Center >> School of Information and Library Science >> University of North Carolina at Chapel Hill >> email: jagu...@email.unc.edu >> Web page: http://www.unc.edu/~jaguera/ >> MRC website: http://ils.unc.edu/mrc/ >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Robert Muir > rcm...@gmail.com > -- Jose R. Pérez-Agüera Clinical Assistant Professor Metadata Research Center School of Information and Library Science University of North Carolina at Chapel Hill email: jagu...@email.unc.edu Web page: http://www.unc.edu/~jaguera/ MRC website: http://ils.unc.edu/mrc/
Re: problem in Lucene's ranking function
2010/5/5 José Ramón Pérez Agüera > Hi Robert, > > thank you very much for your quick response, I have a couple of questions, > > did you read the papers that I mention in my e-mail? > Yes. > do you think that Lucene ranking function could have this problem? > > I know it does. > My concern is not about how to implement different kind of ranking > functions for Lucene, I know that you are doing a very nice work to > implement a very flexible ranking framework for Lucene, my concern is > about a bug, which is independent of the ranking function that you are > using and which appears whether some kind of saturation function is > used in combination with a linear combination of fields for structured > documents. > I think we might disagree here though. Must 'the combining of scores from different fields' must be hardcoded to one simple solution, or should it be something that you can control yourself? For example, it appears Terrier implements something different for this problem, not the paper you referenced but a different technique?: http://terrier.org/docs/v3.0/javadoc/org/terrier/matching/models/BM25F.html But I don't quite understand all the subleties involved... it seems in this other paper there is still a linear combination, but you introduce additional per-field parameters. The thing that makes me nervous about "hardcoding/changing" the way that scores are combined across fields is that Lucene presents some strange peculiarities, most notably the ability to use different scoring models for different fields. This in fact already exists today, if you "omitTF" for one field but not for another, you are using a different scoring model for the two fields. > Maybe I'm wrong, but if the linear combination of fields remains in > lucene ranking function core, Lucene is never going to work properly > to compute the score for structured documents. > I wouldn't say never, maybe we will not get there in the first go, but hopefully at least you will be able to do the things i mentioned above, such as using different similarities for different fields, including ones that are not supported today. > > I know how to solve the problem, and we have our own implementation of > BM25F for Lucene which performance is much better that standard > Lucene's ranking function, but I think that would be useful for other > Lucene users to know what is the problem to deal with structured > documents, and how to fix this problem for the next version, > independently what ranking function is finally implemented for Lucene. > > It would be great if you could help us on that issue (I know the patch is a bit out of date), to try to fix the scoring APIs, including perhaps thinking about how to improve search across multiple fields for structured documents. In my opinion, I would like to see the situation evolve away from "which ranking function is implemented for Lucene" instead to having a variety of built-in functions you can choose from. So, I would rather it be more like Analyzers, where we have a variety of high-quality implementations available, and you can make your own if you must, but there is no real default. -- Robert Muir rcm...@gmail.com
Re: problem in Lucene's ranking function
Hi Robert, the problem is not the linear combination of fields, the problem is to apply the boost factor per field after the term frequency saturation function and then make the linear combination of fields. Every system that implement BM25F, including terrier, take care of that, because if you don't do it you have a bug in your ranking function and not just a different ranking function. It is very easy solve the problem in the current Lucene ranking function, just move the boost factor per field inside the term frequency square root, that's all. If you implement this little change, Lucene ranking fucntion will work properly with structured documents and all your other concerns about allowing users to implement different ranking functions for different situations will be not affected by this change. I really appreciate your work to improve Lucene ranking function, and your time to response this emails :-) best jose On Wed, May 5, 2010 at 2:12 PM, Robert Muir wrote: > 2010/5/5 José Ramón Pérez Agüera > >> Hi Robert, >> >> thank you very much for your quick response, I have a couple of questions, >> >> did you read the papers that I mention in my e-mail? >> > > Yes. > > >> do you think that Lucene ranking function could have this problem? >> >> > I know it does. > > >> My concern is not about how to implement different kind of ranking >> functions for Lucene, I know that you are doing a very nice work to >> implement a very flexible ranking framework for Lucene, my concern is >> about a bug, which is independent of the ranking function that you are >> using and which appears whether some kind of saturation function is >> used in combination with a linear combination of fields for structured >> documents. >> > > I think we might disagree here though. Must 'the combining of scores from > different fields' must be hardcoded to one simple solution, or should it be > something that you can control yourself? > > For example, it appears Terrier implements something different for this > problem, not the paper you referenced but a different technique?: > http://terrier.org/docs/v3.0/javadoc/org/terrier/matching/models/BM25F.html > But > I don't quite understand all the subleties involved... it seems in this > other paper there is still a linear combination, but you introduce > additional per-field parameters. > > The thing that makes me nervous about "hardcoding/changing" the way that > scores are combined across fields is that Lucene presents some strange > peculiarities, most notably the ability to use different scoring models for > different fields. This in fact already exists today, if you "omitTF" for one > field but not for another, you are using a different scoring model for the > two fields. > > >> Maybe I'm wrong, but if the linear combination of fields remains in >> lucene ranking function core, Lucene is never going to work properly >> to compute the score for structured documents. >> > > I wouldn't say never, maybe we will not get there in the first go, but > hopefully at least you will be able to do the things i mentioned above, such > as using different similarities for different fields, including ones that > are not supported today. > > >> >> I know how to solve the problem, and we have our own implementation of >> BM25F for Lucene which performance is much better that standard >> Lucene's ranking function, but I think that would be useful for other >> Lucene users to know what is the problem to deal with structured >> documents, and how to fix this problem for the next version, >> independently what ranking function is finally implemented for Lucene. >> >> > It would be great if you could help us on that issue (I know the patch is a > bit out of date), to try to fix the scoring APIs, including perhaps thinking > about how to improve search across multiple fields for structured documents. > > In my opinion, I would like to see the situation evolve away from "which > ranking function is implemented for Lucene" instead to having a variety of > built-in functions you can choose from. > > So, I would rather it be more like Analyzers, where we have a variety of > high-quality implementations available, and you can make your own if you > must, but there is no real default. > > -- > Robert Muir > rcm...@gmail.com > -- Jose R. Pérez-Agüera Clinical Assistant Professor Metadata Research Center School of Information and Library Science University of North Carolina at Chapel Hill email: jagu...@email.unc.edu Web page: http://www.unc.edu/~jaguera/ MRC website: http://ils.unc.edu/mrc/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: problem in Lucene's ranking function
2010/5/5 José Ramón Pérez Agüera > Hi Robert, > > the problem is not the linear combination of fields, the problem is to > apply the boost factor per field after the term frequency saturation > function and then make the linear combination of fields. Every system > that implement BM25F, including terrier, take care of that, because if > you don't do it you have a bug in your ranking function and not just a > different ranking function. > José, well then this should not be much of a problem to handle in LUCENE-2392, because as I mentioned, if you have a tf() or idf() its really because you decided to do this yourself. So you could easily apply the boost inside your log or sqrt or whatever, if you want. But what I propose we do, is make sure the relevance functions we provide (especially any default for 4.0) take care of this for your structured case, while still providing the capability for someone to get the old behavior [see below] > If you implement this little > change, Lucene ranking fucntion will work properly with structured > documents and all your other concerns about allowing users to > implement different ranking functions for different situations will be > not affected by this change. > > Well, I'm not sure all my concerns go away! I think its best to implement a change like this in the flexible scoring framework (LUCENE-2392), so that users, if they want, can get the old behavior: "the bug" as you call it. The reason I say this due to the unique cases of lucene, some people are doing scoring in very crazy ways and if they aren't able to get the old behavior with regards to boosting, they might be upset... even if it is really giving them worse relevance... -- Robert Muir rcm...@gmail.com
Re: problem in Lucene's ranking function
Hi Robert, I will be very happy to see this problem fixed :-) I can not image what reasons people have to use software with bugs, I guess that others bugs in lucene are removed. Anyway, if finally you are going to fix the problem, these are good news :-) thank you very much for your time. jose On Wed, May 5, 2010 at 3:10 PM, Robert Muir wrote: > 2010/5/5 José Ramón Pérez Agüera > >> Hi Robert, >> >> the problem is not the linear combination of fields, the problem is to >> apply the boost factor per field after the term frequency saturation >> function and then make the linear combination of fields. Every system >> that implement BM25F, including terrier, take care of that, because if >> you don't do it you have a bug in your ranking function and not just a >> different ranking function. >> > > José, well then this should not be much of a problem to handle in > LUCENE-2392, because as I mentioned, if you have a tf() or idf() its really > because you decided to do this yourself. So you could easily apply the boost > inside your log or sqrt or whatever, if you want. > > But what I propose we do, is make sure the relevance functions we provide > (especially any default for 4.0) take care of this for your structured case, > while still providing the capability for someone to get the old behavior > [see below] > > >> If you implement this little >> change, Lucene ranking fucntion will work properly with structured >> documents and all your other concerns about allowing users to >> implement different ranking functions for different situations will be >> not affected by this change. >> >> > Well, I'm not sure all my concerns go away! I think its best to implement a > change like this in the flexible scoring framework (LUCENE-2392), so that > users, if they want, can get the old behavior: "the bug" as you call it. > > The reason I say this due to the unique cases of lucene, some people are > doing scoring in very crazy ways and if they aren't able to get the old > behavior with regards to boosting, they might be upset... even if it is > really giving them worse relevance... > > -- > Robert Muir > rcm...@gmail.com > -- Jose R. Pérez-Agüera Clinical Assistant Professor Metadata Research Center School of Information and Library Science University of North Carolina at Chapel Hill email: jagu...@email.unc.edu Web page: http://www.unc.edu/~jaguera/ MRC website: http://ils.unc.edu/mrc/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: problem in Lucene's ranking function
2010/5/5 José Ramón Pérez Agüera : [...] > The consequence is that a document > matching a single query term over several fields could score much > higher than a document matching several query terms in one field only, One partial workaround that people use is DisjunctionMaxQuery (used by "dismax" query parser in Solr). http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How can I merge .cfx and .cfs into a single cfs file?
Thank you Mike. Garry - Original Message - From: "Michael McCandless" To: Sent: Wednesday, May 05, 2010 8:24 PM Subject: Re: How can I merge .cfx and .cfs into a single cfs file? Lucene considers an index with a single .cfx and a single .cfs as optimized. Also, note that how Lucene stores files in the index is an impl detail -- it can change from release to release -- so relying on any of these details is dangerous. That said, with recent Lucene versions, if you really want to force these two files to be consolidated, you can make a custom MergePolicy that returns a merge from the findMergesForOptimize for this case (the default MergePolicy returns null since it thinks this case is already optimized). Or... if you index with two separate IndexWriter sessions, and then call optimize, that should also merge down to one file. Mike 2010/5/5 张志田 : > Uwe, thank you very much. > > What is the mechanizm lucene will merge these two kinds of files? Sometimes I > found there was only one .cfs file, but in another time there may be one cfs > and cfx. I understand the .cfx is used to store the term vectors etc, but why > does the index result not seem to be consistent? > > Thanks, > Garry > - Original Message - > From: "Uwe Goetzke" > To: > Sent: Wednesday, May 05, 2010 3:57 PM > Subject: AW: How can I merge .cfx and .cfs into a single cfs file? > > > Index all into a directory and determine the size of all files in it. > > From http://lucene.apache.org/java/3_0_1/fileformats.html > Starting with Lucene 2.3, doc store files (stored field values and term > vectors) can be shared in a single set of files for more than one segment. > When compound file is enabled, these shared files will be added into a single > compound file (same format as above) but with the extension .cfx. > > In addition to > Compound File .cfs An optional "virtual" file consisting of all the other > index files for systems that frequently run out of file handles. > > Uwe > > > -Ursprüngliche Nachricht- > Von: 张志田 [mailto:zhitian.zh...@dianping.com] > Gesendet: Mittwoch, 5. Mai 2010 08:24 > An: java-user@lucene.apache.org > Betreff: How can I merge .cfx and .cfs into a single cfs file? > > Hi all, > > I have an index task which will index thousands of records with lucene 3.0.1. > My confusion is lucene will always create a .cfx and a .cfs file in the file > system, sometimes more, while I thought it should create a single .cfs file > if I optimize the index data. Is it by design? If yes, is there any > way/configuration I can do to merge all of the index files into a singe one? > > By the way, I have a logic to validate the index data, if the size of .cfs > increases dramatically comparing to the file generated last time, there may > be something wrong, a warning message will be threw. This is the reason that > I want to generate a single .cfs file. Any other suggestion about the index > validation? > > Any body can give me a hand? > > Thanks in advance. > > Garry > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using IndexReader in the web environment
You may look this: private static IndexSearcher indexSearcher = null; public synchronized IndexSearcher newIndexSearcher() { try { if (null == indexSearcher) { Directory directory = FSDirectory.open(new File(Config.DB_DIR+"/rssindex")); indexSearcher = new IndexSearcher(IndexReader.open(directory, true)); } else { IndexReader indexReader = indexSearcher.getIndexReader(); IndexReader newIndexReader = indexReader.reopen();//reopen old indexReader if (newIndexReader!=indexReader) { indexReader.close(); indexSearcher.close(); indexSearcher = new IndexSearcher(newIndexReader); } } return indexSearcher; } catch (CorruptIndexException e) { log.error(e.getMessage(),e); return null; } catch (IOException e) { log.error(e.getMessage(),e); return null; } } -- 冲浪板 my blog:http://chonglangban.appspot.com/ my site:http://kejiblog.appspot.com/
AW: How can I merge .cfx and .cfs into a single cfs file?
Index all into a directory and determine the size of all files in it. >From http://lucene.apache.org/java/3_0_1/fileformats.html Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx. In addition to Compound File .cfsAn optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles. Uwe -Ursprüngliche Nachricht- Von: 张志田 [mailto:zhitian.zh...@dianping.com] Gesendet: Mittwoch, 5. Mai 2010 08:24 An: java-user@lucene.apache.org Betreff: How can I merge .cfx and .cfs into a single cfs file? Hi all, I have an index task which will index thousands of records with lucene 3.0.1. My confusion is lucene will always create a .cfx and a .cfs file in the file system, sometimes more, while I thought it should create a single .cfs file if I optimize the index data. Is it by design? If yes, is there any way/configuration I can do to merge all of the index files into a singe one? By the way, I have a logic to validate the index data, if the size of .cfs increases dramatically comparing to the file generated last time, there may be something wrong, a warning message will be threw. This is the reason that I want to generate a single .cfs file. Any other suggestion about the index validation? Any body can give me a hand? Thanks in advance. Garry - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How can I merge .cfx and .cfs into a single cfs file?
Uwe, thank you very much. What is the mechanizm lucene will merge these two kinds of files? Sometimes I found there was only one .cfs file, but in another time there may be one cfs and cfx. I understand the .cfx is used to store the term vectors etc, but why does the index result not seem to be consistent? Thanks, Garry - Original Message - From: "Uwe Goetzke" To: Sent: Wednesday, May 05, 2010 3:57 PM Subject: AW: How can I merge .cfx and .cfs into a single cfs file? Index all into a directory and determine the size of all files in it. From http://lucene.apache.org/java/3_0_1/fileformats.html Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension .cfx. In addition to Compound File .cfs An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles. Uwe -Ursprüngliche Nachricht- Von: 张志田 [mailto:zhitian.zh...@dianping.com] Gesendet: Mittwoch, 5. Mai 2010 08:24 An: java-user@lucene.apache.org Betreff: How can I merge .cfx and .cfs into a single cfs file? Hi all, I have an index task which will index thousands of records with lucene 3.0.1. My confusion is lucene will always create a .cfx and a .cfs file in the file system, sometimes more, while I thought it should create a single .cfs file if I optimize the index data. Is it by design? If yes, is there any way/configuration I can do to merge all of the index files into a singe one? By the way, I have a logic to validate the index data, if the size of .cfs increases dramatically comparing to the file generated last time, there may be something wrong, a warning message will be threw. This is the reason that I want to generate a single .cfs file. Any other suggestion about the index validation? Any body can give me a hand? Thanks in advance. Garry - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using IndexReader in the web environment
You could tell the searching part of your app, via some notification or messaging call. Or call IndexReader.isCurrent() from time to time, or even on every search, and reopen() if necessary. See the javadocs and don't forget to close the old reader when you do call reopen. -- Ian. On Wed, May 5, 2010 at 5:17 AM, Vijay Veeraraghavan wrote: > hey Ian, > thanks for the reply. I find it very useful. My report generating > scheduler will run periodically, once done it will invoke the indexer > and exit. In this case I do not know if the index has changed or not. > How do i keep track of the changes in the index? As the two entities, > scheduler/indexer and the web application, are totally different. > > Vijay > > On 5/4/10, Ian Lea wrote: >> For best performance you should aim to keep a shared index searcher, >> or the underlying index reader, open as long as possible. You may of >> course need to reopen it if/when the index changes. As to scope, you >> can store it wherever it makes sense for your application. >> >> >> -- >> Ian. >> >> >> On Tue, May 4, 2010 at 10:13 AM, Vijay Veeraraghavan >> wrote: >>> Hi, >>> Thanks for the reply. So I will have a dedicated servlet to search the >>> index, but does it mean that the indexsearcher does not close the >>> index, keep it open? Is it not possible to keep it in the application >>> scope? >>> >>> Vijay >>> >>> On 5/3/10, Vijay Veeraraghavan wrote: Hi all, In a clustered environment I search the index from the web application. In the web application I am creating IndexReader on each request. is it expensive to do like this? I read somewhere in the web that try using the same reader as much as possible. Can i keep the initially created IndexReader in the session/application scopes and use the same for each request? Any other idea? Viay On 5/3/10, Vijay Veeraraghavan wrote: > dear all, > > as replied below, does searching again for the document in the index > and if found skip the indexing else index it, is this not similar to > indexing all pdf documents once again, is not this overhead? As I am > not going to index the details of the pdf (so if an indexed pdf was > recreated i need not reindex it) but just the paths of the documents. > > Vijay > >>> Hey there, >>> >>> you might have to implement a some kind of unique identifier using an >>> indexed lucene field. When you are indexing you should fire a query >>> with >>> the >>> uuid of your document (maybe the path to you pdf document) and check >>> if >>> the >>> document is in the index already. You could also do a boolean query >>> combining UUID, timestamp and / or a hash value to see if the document >>> has >>> been changed. if so you can simply update the document by its UUID >>> (something like indexwriter.updateDocument(new Term("uuid", >>> value),document);) >>> >>> Unfortunately you have to implement this yourself but it should not be >>> that >>> much of a deal. >>> >>> simon >>> >>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan < >>> vijay.raghava...@gmail.com> wrote: >>> Dear all, I am using lucene 3.0 to index the pdf reports that I generate dynamically. I index the pdf file name (without extension), file path and its absolute path as fields. I search with the file name without extension; it retrieves a list, as usually 2 or more files are present in the same name in different sub directories. As I create the index for the first time it updates, assuming 100 pdf files in different directories, the files meta info. If again I do indexing, while my report generator scheduler has the produced 500 more pdf files totaling to 600 files in different directories, I wish to index only the new files to the index. But presently it’s doing the whole thing again (600 files). How to implement this functionality? Think of the thousands of pdf files created on each run. P.S: I cannot keep the meta-info of generated pdf files in the java memory, as it exceeds thousands in a single run, and update the index looping this list. new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer( Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); is the boolean parameter is for this purpose? Please guide me. -- Thanks Vijay Veeraraghavan -- Thanks & Regards Vijay Veeraraghavan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.or
Re: How can I merge .cfx and .cfs into a single cfs file?
Lucene considers an index with a single .cfx and a single .cfs as optimized. Also, note that how Lucene stores files in the index is an impl detail -- it can change from release to release -- so relying on any of these details is dangerous. That said, with recent Lucene versions, if you really want to force these two files to be consolidated, you can make a custom MergePolicy that returns a merge from the findMergesForOptimize for this case (the default MergePolicy returns null since it thinks this case is already optimized). Or... if you index with two separate IndexWriter sessions, and then call optimize, that should also merge down to one file. Mike 2010/5/5 张志田 : > Uwe, thank you very much. > > What is the mechanizm lucene will merge these two kinds of files? Sometimes I > found there was only one .cfs file, but in another time there may be one cfs > and cfx. I understand the .cfx is used to store the term vectors etc, but why > does the index result not seem to be consistent? > > Thanks, > Garry > - Original Message - > From: "Uwe Goetzke" > To: > Sent: Wednesday, May 05, 2010 3:57 PM > Subject: AW: How can I merge .cfx and .cfs into a single cfs file? > > > Index all into a directory and determine the size of all files in it. > > From http://lucene.apache.org/java/3_0_1/fileformats.html > Starting with Lucene 2.3, doc store files (stored field values and term > vectors) can be shared in a single set of files for more than one segment. > When compound file is enabled, these shared files will be added into a single > compound file (same format as above) but with the extension .cfx. > > In addition to > Compound File .cfs An optional "virtual" file consisting of all the other > index files for systems that frequently run out of file handles. > > Uwe > > > -Ursprüngliche Nachricht- > Von: 张志田 [mailto:zhitian.zh...@dianping.com] > Gesendet: Mittwoch, 5. Mai 2010 08:24 > An: java-user@lucene.apache.org > Betreff: How can I merge .cfx and .cfs into a single cfs file? > > Hi all, > > I have an index task which will index thousands of records with lucene 3.0.1. > My confusion is lucene will always create a .cfx and a .cfs file in the file > system, sometimes more, while I thought it should create a single .cfs file > if I optimize the index data. Is it by design? If yes, is there any > way/configuration I can do to merge all of the index files into a singe one? > > By the way, I have a logic to validate the index data, if the size of .cfs > increases dramatically comparing to the file generated last time, there may > be something wrong, a warning message will be threw. This is the reason that > I want to generate a single .cfs file. Any other suggestion about the index > validation? > > Any body can give me a hand? > > Thanks in advance. > > Garry > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org