Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Wed, Mar 7, 2012 at 10:11 AM, Mike Rylander mrylan...@gmail.com wrote: On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine eha...@georgialibraries.org wrote: Kathy, While the relevance display is much improved in 2.x, it would be good to have greater relevance given, in a keyword search, to title (specifically the 245)and then subject fields. I also see where having a popularity ranking might be beneficial. I just had to explain to a board member of one of our libraries why his search for John Sandford turned up children's titles first. So having MARC field 100s ranked higher than 700 in author searches would be beneficial as well. To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Weighting different parts of one indexed term -- say, weighting the title embedded in the keyword blob higher than the subjects embedded in the same blob -- would require the above-mentioned make use of tsearch class weighting. But one can approximate that today by duplicating the index definitions from, say, title, author and subject classes within the keyword class. We've been doing the latter (duplicating title inside the keyword class) since 1.6 days - see http://coffeecode.net/archives/218-Adjusting-relevancy-rankings-in-Evergreen-1.6,-some-explorations.html for a description of how I added a keyword|title field, and then boosted its weight to 10 (versus the default of 1 that the rest of the keyword fields get). So a general keyword search for programming languages on our system by far prefers results that contain programming languages in the title... this is still working nicely in 2.1 for us. Note, however, that we did clear all entries out of the search.relevance_adjustment table as that was found to slow things down massively in the 2.0 era.
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Mon, May 7, 2012 at 3:12 PM, Kathy Lussier kluss...@masslnc.org wrote: Hi Mike, FWIW, there is a library testing some new combinations of CD modifiers and having some success. As soon as I know more I will share (if they don't first). Did anything ever come of this? I would be interested in seeing any examples that resulted in improved relevancy. The testing occurred, but I haven't heard the the outcome yet. I'll dig for it ASAP. The fairly mechanical change from GIST to GIN indexing is definitely a small-effort thing. I think the other ideas listed here (and still others from the past, like direct MARC indexing, and use of tsearch weighting classes) are probably worth trying -- particularly the relevance-adjustment-functions-in-C idea -- as GSoC projects, but may turn out to be too big. It's worth listing them as ideas for candidates to propose, though. I was happy to see that Optimize Evergreen: Convert PL/Perl-based PostgreSQL stored procedures to PL/SQL or PL/C was one of the accepted GSoC projects. However, since I got a little lost in the technical details of this discussion, I was curious if, when this GSoC project is complete, we can can feel more comfortable about using search.relevance_ranking to tweak the relevancy without adversely affecting search performance. Short version: yes Longer version: that's exactly one of the goals, and there are some other avenues of attack as well that should speed search and are related to (but not strictly inside) the GSoC project. --miker I know there were two related GSoC ideas listed, and I wasn't sure if both needed to be done together to ultimately improve search speeds. Thanks! Kathy -- Kathy Lussier Project Coordinator Massachusetts Library Network Cooperative (508) 756-0172 (508) 755-3721 (fax) kluss...@masslnc.org Twitter: http://www.twitter.com/kmlussier On 3/6/2012 5:00 PM, Mike Rylander wrote: On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussierkluss...@masslnc.org wrote: Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. FWIW, there is a library testing some new combinations of CD modifiers and having some success. As soon as I know more I will share (if they don't first). In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. * Using some type of popularity metric to boost relevancy for popular titles. I'm not sure what this metric should be (number of copies attached to record? Total circs in last x months? Total current circs?), but we believe some type of popularity measure would be particularly helpful in a public library where searches will often be for titles that are popular. For example, a search for twilight will most likely be for the Stephanie Meyers novel and not this http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike Rylander had indicated in a previous e-mail (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to handle this through an overnight cron job without a negative impact on search speeds. Right ... A regular stats-gathering job could certainly allow this, and (if the QuqeryParser explain branch gets merged to master so we have a standard search canonicalization function) logged query analysis is another option as well. Do others think these two enhancements would improve the search results in Evergreen? Do you think there are other things we could do to improve relevancy? My main concern would be that any changes might slow down search speeds, and I would want to make sure that we could do something to retrieve better search results without a slowdown. I would prefer better results with a speed /increase/! :) But, who wouldn't. I can offer at least one lower-hanging fruit idea: switch from GIST indexes to GIN indexes by default, as they're
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
Hi Mike, FWIW, there is a library testing some new combinations of CD modifiers and having some success. As soon as I know more I will share (if they don't first). Did anything ever come of this? I would be interested in seeing any examples that resulted in improved relevancy. The fairly mechanical change from GIST to GIN indexing is definitely a small-effort thing. I think the other ideas listed here (and still others from the past, like direct MARC indexing, and use of tsearch weighting classes) are probably worth trying -- particularly the relevance-adjustment-functions-in-C idea -- as GSoC projects, but may turn out to be too big. It's worth listing them as ideas for candidates to propose, though. I was happy to see that Optimize Evergreen: Convert PL/Perl-based PostgreSQL stored procedures to PL/SQL or PL/C was one of the accepted GSoC projects. However, since I got a little lost in the technical details of this discussion, I was curious if, when this GSoC project is complete, we can can feel more comfortable about using search.relevance_ranking to tweak the relevancy without adversely affecting search performance. I know there were two related GSoC ideas listed, and I wasn't sure if both needed to be done together to ultimately improve search speeds. Thanks! Kathy -- Kathy Lussier Project Coordinator Massachusetts Library Network Cooperative (508) 756-0172 (508) 755-3721 (fax) kluss...@masslnc.org Twitter: http://www.twitter.com/kmlussier On 3/6/2012 5:00 PM, Mike Rylander wrote: On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussierkluss...@masslnc.org wrote: Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. FWIW, there is a library testing some new combinations of CD modifiers and having some success. As soon as I know more I will share (if they don't first). In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. * Using some type of popularity metric to boost relevancy for popular titles. I'm not sure what this metric should be (number of copies attached to record? Total circs in last x months? Total current circs?), but we believe some type of popularity measure would be particularly helpful in a public library where searches will often be for titles that are popular. For example, a search for twilight will most likely be for the Stephanie Meyers novel and not this http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike Rylander had indicated in a previous e-mail (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to handle this through an overnight cron job without a negative impact on search speeds. Right ... A regular stats-gathering job could certainly allow this, and (if the QuqeryParser explain branch gets merged to master so we have a standard search canonicalization function) logged query analysis is another option as well. Do others think these two enhancements would improve the search results in Evergreen? Do you think there are other things we could do to improve relevancy? My main concern would be that any changes might slow down search speeds, and I would want to make sure that we could do something to retrieve better search results without a slowdown. I would prefer better results with a speed /increase/! :) But, who wouldn't. I can offer at least one lower-hanging fruit idea: switch from GIST indexes to GIN indexes by default, as they're much faster these days. Also, I was wondering if this type of project might be a good candidate for a Google Summer of Code project. The fairly mechanical change from GIST to GIN indexing is definitely a small-effort thing. I think the other ideas listed here (and still others from the past, like direct MARC indexing, and use of tsearch weighting classes) are probably worth trying -- particularly the relevance-adjustment-functions-in-C idea --
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Thu, Mar 22, 2012 at 4:56 PM, Brian Greene bgre...@cgcc.cc.or.us wrote: Does relevancy ranking currently take publication date into account? I think this could be especially helpful with topical searches when, all other things being equal, I'd probably consider the newer item to be more relevant. It is. The Date1 fixed field is used as a first tie-breaker after the primary (user-chosen) sort axis. --miker Similarly, I could see home library (in cases where that can be determined) being considered and used when there are two otherwise equally relevant items. Note that in both cases I don't want them to become de facto limiters, but rather act more like tie-breakers after the other factors have been weighed. I also support taking into account some sort of popularity measure. Thanks, Brian Brian Greene, Library Director Columbia Gorge Community College The Dalles, Oregon 97058 (541) 506-6080 | www.cgcc.cc.or.us Mike Rylander mrylan...@gmail.com 3/8/2012 10:55 AM On Thu, Mar 8, 2012 at 12:10 PM, Elizabeth Longwell blong...@eou.edu wrote: Hi, Is it necessary to re-index after changing weights for relevancy? Not at all. The only gotcha is that cached searches won't show the changed weighting (of course). So, say you searched for rowling (sans quotes) and wanted to test an author-weighting change made after the search (but before the cache expired), search again for rowling -asdlfkaf (again, sans quotes). That negated random string at the end kills the cache without materially changing the query. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com Beth Longwell Sage Library System On Wed, Mar 7, 2012 at 5:29 PM, Mike Rylander mrylan...@gmail.com wrote: On Wed, Mar 7, 2012 at 2:57 PM, Kathy Lussier kluss...@masslnc.org wrote: Hi Mike, To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Is this staff client interface already available in master? If so, can you give me a little more information on how this is done? It is. Go to Admin - Server Administration - MARC Search/Facet Fields and see the Weight field. The higher the number, the more important the field. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com Thanks! Kathy -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open- ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike Rylander Sent: Wednesday, March 07, 2012 10:11 AM To: Evergreen Discussion Group Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine eha...@georgialibraries.org wrote: Kathy, While the relevance display is much improved in 2.x, it would be good to have greater relevance given, in a keyword search, to title (specifically the 245)and then subject fields. I also see where having a popularity ranking might be beneficial. I just had to explain to a board member of one of our libraries why his search for John Sandford turned up children's titles first. So having MARC field 100s ranked higher than 700 in author searches would be beneficial as well. To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Weighting different parts of one indexed term -- say, weighting the title embedded in the keyword blob higher than the subjects embedded in the same blob -- would require the above-mentioned make use of tsearch class weighting. But one can approximate that today by duplicating the index definitions from, say, title, author and subject classes within the keyword class. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com I can't comment on any of the coding possibilities other than to say which every way doesn't negatively impact search return time is preferable. Elaine J. Elaine Hardy PINES Bibliographic Projects and Metadata Manager Georgia Public Library Service, A Unit of the University System of Georgia 1800 Century Place, Suite 150 Atlanta, Ga. 30345-4304 404.235-7128 404.235-7201, fax eha...@georgialibraries.org www.georgialibraries.org http://www.georgialibraries.org/pines/ -Original Message- From: open-ils-general-boun
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
Hi, Is it necessary to re-index after changing weights for relevancy? Beth Longwell Sage Library System On Wed, Mar 7, 2012 at 5:29 PM, Mike Rylander mrylan...@gmail.com wrote: On Wed, Mar 7, 2012 at 2:57 PM, Kathy Lussier kluss...@masslnc.org wrote: Hi Mike, To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Is this staff client interface already available in master? If so, can you give me a little more information on how this is done? It is. Go to Admin - Server Administration - MARC Search/Facet Fields and see the Weight field. The higher the number, the more important the field. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com Thanks! Kathy -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open- ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike Rylander Sent: Wednesday, March 07, 2012 10:11 AM To: Evergreen Discussion Group Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine eha...@georgialibraries.org wrote: Kathy, While the relevance display is much improved in 2.x, it would be good to have greater relevance given, in a keyword search, to title (specifically the 245)and then subject fields. I also see where having a popularity ranking might be beneficial. I just had to explain to a board member of one of our libraries why his search for John Sandford turned up children's titles first. So having MARC field 100s ranked higher than 700 in author searches would be beneficial as well. To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Weighting different parts of one indexed term -- say, weighting the title embedded in the keyword blob higher than the subjects embedded in the same blob -- would require the above-mentioned make use of tsearch class weighting. But one can approximate that today by duplicating the index definitions from, say, title, author and subject classes within the keyword class. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com I can't comment on any of the coding possibilities other than to say which every way doesn't negatively impact search return time is preferable. Elaine J. Elaine Hardy PINES Bibliographic Projects and Metadata Manager Georgia Public Library Service, A Unit of the University System of Georgia 1800 Century Place, Suite 150 Atlanta, Ga. 30345-4304 404.235-7128 404.235-7201, fax eha...@georgialibraries.org www.georgialibraries.org http://www.georgialibraries.org/pines/ -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Kathy Lussier Sent: Tuesday, March 06, 2012 4:43 PM To: 'Evergreen Discussion Group' Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. * Using some type of popularity metric to boost relevancy for popular titles. I'm not sure what this metric should be (number of copies attached to record? Total circs in last x months? Total current circs?), but we believe some type of popularity measure would be particularly helpful in a public library where searches will often be for titles that are popular. For example, a search for twilight will most likely be for the Stephanie Meyers novel
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Thu, Mar 8, 2012 at 12:10 PM, Elizabeth Longwell blong...@eou.edu wrote: Hi, Is it necessary to re-index after changing weights for relevancy? Not at all. The only gotcha is that cached searches won't show the changed weighting (of course). So, say you searched for rowling (sans quotes) and wanted to test an author-weighting change made after the search (but before the cache expired), search again for rowling -asdlfkaf (again, sans quotes). That negated random string at the end kills the cache without materially changing the query. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com Beth Longwell Sage Library System On Wed, Mar 7, 2012 at 5:29 PM, Mike Rylander mrylan...@gmail.com wrote: On Wed, Mar 7, 2012 at 2:57 PM, Kathy Lussier kluss...@masslnc.org wrote: Hi Mike, To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Is this staff client interface already available in master? If so, can you give me a little more information on how this is done? It is. Go to Admin - Server Administration - MARC Search/Facet Fields and see the Weight field. The higher the number, the more important the field. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com Thanks! Kathy -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open- ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike Rylander Sent: Wednesday, March 07, 2012 10:11 AM To: Evergreen Discussion Group Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine eha...@georgialibraries.org wrote: Kathy, While the relevance display is much improved in 2.x, it would be good to have greater relevance given, in a keyword search, to title (specifically the 245)and then subject fields. I also see where having a popularity ranking might be beneficial. I just had to explain to a board member of one of our libraries why his search for John Sandford turned up children's titles first. So having MARC field 100s ranked higher than 700 in author searches would be beneficial as well. To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Weighting different parts of one indexed term -- say, weighting the title embedded in the keyword blob higher than the subjects embedded in the same blob -- would require the above-mentioned make use of tsearch class weighting. But one can approximate that today by duplicating the index definitions from, say, title, author and subject classes within the keyword class. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com I can't comment on any of the coding possibilities other than to say which every way doesn't negatively impact search return time is preferable. Elaine J. Elaine Hardy PINES Bibliographic Projects and Metadata Manager Georgia Public Library Service, A Unit of the University System of Georgia 1800 Century Place, Suite 150 Atlanta, Ga. 30345-4304 404.235-7128 404.235-7201, fax eha...@georgialibraries.org www.georgialibraries.org http://www.georgialibraries.org/pines/ -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Kathy Lussier Sent: Tuesday, March 06, 2012 4:43 PM To: 'Evergreen Discussion Group' Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Wed, Mar 7, 2012 at 2:28 AM, Dan Scott d...@coffeecode.net wrote: Lots of snips implied below; also note that I'm running James' 2.0 query on a 2.1 system. On Tue, Mar 06, 2012 at 10:55:24PM -0500, Mike Rylander wrote: On Tue, Mar 6, 2012 at 6:13 PM, James Fournie jfour...@sitka.bclibraries.ca wrote: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. Hmm. For sure, C beats Perl for performance and would undoubtedly offer an improvement, but it looks like another bottleneck for broad searches is in having to visit sort hundreds of thousands of rows, so that they can be sorted by rank, with the added I/O cost of using a disk merge for these broad searches rather than in-memory quicksort. For comparison, I swapped out 'canada' for 'paraguay' and explain analyzed the results; 'canada' uses a disk merge because it needs to deal with 482,000 rows of data and sort 596,000 KB of data, while 'paraguay' (which only has to sort 322 rows) used an in-memory quicksort at 582 KB. This is on a system where work_mem is set to 288 MB - much higher than one would generally want, particularly for the number of physical connections that could potentially get ramped up. That high work_mem helps with reasonably broad searches, but searching for Canada in a Canadian academic library, you might as well be searching for the... All true, but also not something we can do much about (without a precalculated rank, a la PageRank); also, testing on 9.0 around its release shows that pre-limiting as we used to was slower than what we do today. I don't have the details in front of me, but there it is. Indeed, and naco_normalize is not necessarily the only normalizer that will be applied to each and every field! If you search a class or field that uses other (pos = 0) normalizers, all of those will also be applied to both the column value and the user input. There's some good news on this front, though. Galen recently implemented a trimmed down version of naco_normalize, called search_normalize, that should be a bit faster. That should lower the total cost by a noticeable amount over many thousands of rows. You might be thinking of something else? I'm pretty sure that 2bc4e97f72b shows that I implemented search_normalize() simply to avoid problems with apostrophe mangling in the strict naco_normalize() function - and I doubt there will be any observable difference in performance. Sorry, I probably am. My apologies, Dan, I didn't intend to misdirect your credit. That said, I'd be surprised if anything that shortened the pl/perl we use didn't help some, in aggregate, on very large queries. It's testable... Hrm... and looking at your example, I spotted a chance for at least one optimization. If we recognize that there is only one term in a search (as in your canada example) we can skip the word-order rel_adjustment if we're told to apply it, saving ~1/3 of the cost of that particular chunk. I can confirm this; running the same query on our system with word order removed carved response times down to 390 seconds from 580 seconds. Still unusable, but better. (EXPLAIN ANALYZE of the inner query attached). * Alternatively, I mentioned off-hand the option of direct indexing. The idea is to use an expression index defined by each row in in config.metabib_field and have a background process keep the indexes in sync with configuration as things in that table (and the related normalizer configuration, etc) changes. I fear that's the path to madness, but it would be the most space efficient way to handle things. I don't think that's the path to madness; it appeals to me, at least. (Okay, it's probably insane then.) In a broad sense it appeals to me, too. You and I were the ones who discussed this in the long-long-ago, IYR. It's when I start digging into the details of implementation and the implications for config-change-based thrashing that I start going a little mad ... OH! Ranking via ts_rank[_cd]. We can't do it without the tsvector in hand. But if we can work around that somehow (a table that stores only that value, reducing the sort size you mention above?), it's not impossible. [Side note: if you don't need the language-base relevance bump (because, say, the vast majority of your collection is english), remove the default_preferred_language[_weight] elements from your opensrf.xml -- you should save a good bit from
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
Kathy, While the relevance display is much improved in 2.x, it would be good to have greater relevance given, in a keyword search, to title (specifically the 245)and then subject fields. I also see where having a popularity ranking might be beneficial. I just had to explain to a board member of one of our libraries why his search for John Sandford turned up children's titles first. So having MARC field 100s ranked higher than 700 in author searches would be beneficial as well. I can't comment on any of the coding possibilities other than to say which every way doesn't negatively impact search return time is preferable. Elaine J. Elaine Hardy PINES Bibliographic Projects and Metadata Manager Georgia Public Library Service, A Unit of the University System of Georgia 1800 Century Place, Suite 150 Atlanta, Ga. 30345-4304 404.235-7128 404.235-7201, fax eha...@georgialibraries.org www.georgialibraries.org http://www.georgialibraries.org/pines/ -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Kathy Lussier Sent: Tuesday, March 06, 2012 4:43 PM To: 'Evergreen Discussion Group' Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. * Using some type of popularity metric to boost relevancy for popular titles. I'm not sure what this metric should be (number of copies attached to record? Total circs in last x months? Total current circs?), but we believe some type of popularity measure would be particularly helpful in a public library where searches will often be for titles that are popular. For example, a search for twilight will most likely be for the Stephanie Meyers novel and not this http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike Rylander had indicated in a previous e-mail (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to handle this through an overnight cron job without a negative impact on search speeds. Do others think these two enhancements would improve the search results in Evergreen? Do you think there are other things we could do to improve relevancy? My main concern would be that any changes might slow down search speeds, and I would want to make sure that we could do something to retrieve better search results without a slowdown. Also, I was wondering if this type of project might be a good candidate for a Google Summer of Code project. I look forward to hearing your feedback! Kathy - Kathy Lussier Project Coordinator Massachusetts Library Network Cooperative (508) 756-0172 (508) 755-3721 (fax) kluss...@masslnc.org IM: kmlussier (AOL Yahoo) Twitter: http://www.twitter.com/kmlussier
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine eha...@georgialibraries.org wrote: Kathy, While the relevance display is much improved in 2.x, it would be good to have greater relevance given, in a keyword search, to title (specifically the 245)and then subject fields. I also see where having a popularity ranking might be beneficial. I just had to explain to a board member of one of our libraries why his search for John Sandford turned up children's titles first. So having MARC field 100s ranked higher than 700 in author searches would be beneficial as well. To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Weighting different parts of one indexed term -- say, weighting the title embedded in the keyword blob higher than the subjects embedded in the same blob -- would require the above-mentioned make use of tsearch class weighting. But one can approximate that today by duplicating the index definitions from, say, title, author and subject classes within the keyword class. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com I can't comment on any of the coding possibilities other than to say which every way doesn't negatively impact search return time is preferable. Elaine J. Elaine Hardy PINES Bibliographic Projects and Metadata Manager Georgia Public Library Service, A Unit of the University System of Georgia 1800 Century Place, Suite 150 Atlanta, Ga. 30345-4304 404.235-7128 404.235-7201, fax eha...@georgialibraries.org www.georgialibraries.org http://www.georgialibraries.org/pines/ -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Kathy Lussier Sent: Tuesday, March 06, 2012 4:43 PM To: 'Evergreen Discussion Group' Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. * Using some type of popularity metric to boost relevancy for popular titles. I'm not sure what this metric should be (number of copies attached to record? Total circs in last x months? Total current circs?), but we believe some type of popularity measure would be particularly helpful in a public library where searches will often be for titles that are popular. For example, a search for twilight will most likely be for the Stephanie Meyers novel and not this http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike Rylander had indicated in a previous e-mail (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to handle this through an overnight cron job without a negative impact on search speeds. Do others think these two enhancements would improve the search results in Evergreen? Do you think there are other things we could do to improve relevancy? My main concern would be that any changes might slow down search speeds, and I would want to make sure that we could do something to retrieve better search results without a slowdown. Also, I was wondering if this type of project might be a good candidate for a Google Summer of Code project. I look forward to hearing your feedback! Kathy - Kathy Lussier Project Coordinator Massachusetts Library Network Cooperative (508) 756-0172 (508) 755-3721 (fax) kluss...@masslnc.org IM: kmlussier (AOL Yahoo) Twitter: http://www.twitter.com/kmlussier
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
Thanks guys for your feedback, just a few more thoughts I had... On 2012-03-06, at 11:28 PM, Dan Scott wrote: Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. Hmm. For sure, C beats Perl for performance and would undoubtedly offer an improvement, For me, I am not terribly excited about the prospect of another bit of C floating around, particularly if it's in the Postgres bits. I personally am not familiar with C except I can compile it and run it and maybe vaguely get an idea of what it does, but I can't really debug it very well or do anything terribly useful with it. My concern is adding more specialized work to the system makes things harder to maintain and less accessible to newcomers with the typical Linux+Perl+SQL or JS knowledge. We could write open-ils.circ in C as well, but it's just not a good idea, so I think in general we should dismiss rewrite it in C as a solution to anything. Doing some digging into the SQL logs and QueryParser.pm, we observed that the naco_normalize function appears to be what's slowing the use of relevance_adjustment down. While the naco_normalize function itself is quite fast on its own, it slows down exponentially when run on many records: explain analyze select naco_normalize(value) from metabib.keyword_field_entry limit 1; To quibble, the increase in the number of records to the time to process doesn't appear to be an exponential slowdown; it's linear (at least on our system); 10 times the records = (roughly) 10 times as long to retrieve, which is what I would expect: Yes sorry my mistake, I had done the same measurement so I don't know why I used the word, sort of like using the word literally when you don't mean literally I guess :) When using the relevance adjustments, it is run on each metabib.x_entry.value that is retrieved in the initial resultset, which in many cases would be thousands of records. You can adjust the LIMIT in the above query to see how it slows down as the result set gets larger. It is also run for each relevance_adjustment, however I'm assuming that the query parser is treating it properly as IMMUTABLE and only running it once for each adjustment. Have you tried giving the function a different cost estimate, per https://bugs.launchpad.net/evergreen/+bug/874603/comments/3 for a different but related problem? It's quite possible that something like: That said, some quick testing suggests that it doesn't make a difference to the plan, at least for the inner query that's being sent to search.query_parser_fts(). Yeah, I tried it but it seemed like it did nothing. I suspect that it is actually being treated as IMMUTABLE * Alternatively, I mentioned off-hand the option of direct indexing. The idea is to use an expression index defined by each row in in config.metabib_field and have a background process keep the indexes in sync with configuration as things in that table (and the related normalizer configuration, etc) changes. I fear that's the path to madness, but it would be the most space efficient way to handle things. I don't think that's the path to madness; it appeals to me, at least. (Okay, it's probably insane then.) I am going to be totally frank and say that think it honestly might be the path to madness. Why do we need more moving parts like a background process? I'm a little confused, how would a background process keep the indexes in sync when the configuration changes? Can you flesh this out a bit more? I don't feel I fully understand how this would work. To go back to my original idea of normalizing the indexes -- snip back to Mike's first response -- I'm just wondering: We need the pre-normalized form for some things * We could find those things, other than search, for which we use m.X_entry.value and move them elsewhere. The tradeoff would be that any change in config.metabib_field or normalizer configuration would have to cause a rewrite of that column. Realistically, how often would one change normalizers or metabib_fields? I don't think it's done lightly so it seems like a simple solution with a reasonable tradeoff -- you'd rarely want to change your normalizers so you're only running naco_normalizer on the table once in a blue moon vs running it on much of the table every time a search happens. In Solr and I'm sure other search engines you have to reindex if you change these kinds of things. The question is where is ]the m.X_entry.value used? It doesn't seem like anywhere when I skim the code but everything's so dynamically generated that it's hard to tell. Any thoughts where it might be used? I just have a hard time thinking of a use case for that field. ~James Fournie BC Libraries Cooperative
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Wed, Mar 7, 2012 at 1:29 PM, James Fournie jfour...@sitka.bclibraries.ca wrote: Thanks guys for your feedback, just a few more thoughts I had... On 2012-03-06, at 11:28 PM, Dan Scott wrote: Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. Hmm. For sure, C beats Perl for performance and would undoubtedly offer an improvement, For me, I am not terribly excited about the prospect of another bit of C floating around, particularly if it's in the Postgres bits. I personally am not familiar with C except I can compile it and run it and maybe vaguely get an idea of what it does, but I can't really debug it very well or do anything terribly useful with it. My concern is adding more specialized work to the system makes things harder to maintain and less accessible to newcomers with the typical Linux+Perl+SQL or JS knowledge. We could write open-ils.circ in C as well, but it's just not a good idea, so I think in general we should dismiss rewrite it in C as a solution to anything. I can't say I agree with that -- the rewrite of open-ils.auth in C is an unmitigated win, and open-ils.cstore (and pcrud, and other derivatives) replacing (most of) open-ils.storage is as well, IMO. It's all about using the right tool for the job, and once an API is deemed very stable, performance optimization is a valid next step. And C is, generally speaking, a better speed-oriented tool than Perl. That said, I agree that open-ils.circ is not the next thing in line for the translation treatment. :) As for it integrating with postgres, IMO http://www.postgresql.org/docs/9.1/interactive/extend-extensions.html shows how to do this right in modern times, and http://pgxn.org/ is starting to become the CPAN of postgres extensions. Ideally, all of our stored procs would best be rebundled as extensions. That said, some quick testing suggests that it doesn't make a difference to the plan, at least for the inner query that's being sent to search.query_parser_fts(). Yeah, I tried it but it seemed like it did nothing. I suspect that it is actually being treated as IMMUTABLE That's unfrotunate... :( * Alternatively, I mentioned off-hand the option of direct indexing. The idea is to use an expression index defined by each row in in config.metabib_field and have a background process keep the indexes in sync with configuration as things in that table (and the related normalizer configuration, etc) changes. I fear that's the path to madness, but it would be the most space efficient way to handle things. I don't think that's the path to madness; it appeals to me, at least. (Okay, it's probably insane then.) I am going to be totally frank and say that think it honestly might be the path to madness. Why do we need more moving parts like a background process? I'm a little confused, how would a background process keep the indexes in sync when the configuration changes? Can you flesh this out a bit more? I don't feel I fully understand how this would work. You wouldn't want to lock up the database with a DROP INDEX / CREATE INDEX pair each time a row on that table changed, so we'd need a process by which changes are registered and batched together. To go back to my original idea of normalizing the indexes -- snip back to Mike's first response -- I'm just wondering: We need the pre-normalized form for some things * We could find those things, other than search, for which we use m.X_entry.value and move them elsewhere. The tradeoff would be that any change in config.metabib_field or normalizer configuration would have to cause a rewrite of that column. Realistically, how often would one change normalizers or metabib_fields? I don't think it's done lightly so it seems like a simple solution with a reasonable tradeoff -- you'd rarely want to change your normalizers so you're only running naco_normalizer on the table once in a blue moon vs running it on much of the table every time a search happens. In Solr and I'm sure other search engines you have to reindex if you change these kinds of things. The question is where is ]the m.X_entry.value used? It doesn't seem like anywhere when I skim the code but everything's so dynamically generated that it's hard to tell. Any thoughts where it might be used? I just have a hard time thinking of a use case for that field. Hrm... now that facets live on their own table, and may even end up being folded into the browse_entry infrastructure (just a thought right now, needs analysis), I'm not thinking of anything off the top of my head, except for the current incarnation of the display_field branch. If that ends up having its own table (reasonable, I think) then ... it may be safe to fully-normalize (or, at
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
Hi Mike, To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Is this staff client interface already available in master? If so, can you give me a little more information on how this is done? Thanks! Kathy -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open- ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike Rylander Sent: Wednesday, March 07, 2012 10:11 AM To: Evergreen Discussion Group Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine eha...@georgialibraries.org wrote: Kathy, While the relevance display is much improved in 2.x, it would be good to have greater relevance given, in a keyword search, to title (specifically the 245)and then subject fields. I also see where having a popularity ranking might be beneficial. I just had to explain to a board member of one of our libraries why his search for John Sandford turned up children's titles first. So having MARC field 100s ranked higher than 700 in author searches would be beneficial as well. To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Weighting different parts of one indexed term -- say, weighting the title embedded in the keyword blob higher than the subjects embedded in the same blob -- would require the above-mentioned make use of tsearch class weighting. But one can approximate that today by duplicating the index definitions from, say, title, author and subject classes within the keyword class. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com I can't comment on any of the coding possibilities other than to say which every way doesn't negatively impact search return time is preferable. Elaine J. Elaine Hardy PINES Bibliographic Projects and Metadata Manager Georgia Public Library Service, A Unit of the University System of Georgia 1800 Century Place, Suite 150 Atlanta, Ga. 30345-4304 404.235-7128 404.235-7201, fax eha...@georgialibraries.org www.georgialibraries.org http://www.georgialibraries.org/pines/ -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Kathy Lussier Sent: Tuesday, March 06, 2012 4:43 PM To: 'Evergreen Discussion Group' Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. * Using some type of popularity metric to boost relevancy for popular titles. I'm not sure what this metric should be (number of copies attached to record? Total circs in last x months? Total current circs?), but we believe some type of popularity measure would be particularly helpful in a public library where searches will often be for titles that are popular. For example, a search for twilight will most likely be for the Stephanie Meyers novel and not this http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike Rylander had indicated in a previous e-mail (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to handle this through an overnight cron job without a negative impact on search speeds. Do others think these two enhancements would improve the search results in Evergreen? Do you think there are other things we could do to improve relevancy? My main concern would be that any changes might slow down search speeds, and I would want to make sure that we could do something to retrieve better search results without a slowdown. Also, I
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Wed, Mar 7, 2012 at 2:57 PM, Kathy Lussier kluss...@masslnc.org wrote: Hi Mike, To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Is this staff client interface already available in master? If so, can you give me a little more information on how this is done? It is. Go to Admin - Server Administration - MARC Search/Facet Fields and see the Weight field. The higher the number, the more important the field. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com Thanks! Kathy -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open- ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike Rylander Sent: Wednesday, March 07, 2012 10:11 AM To: Evergreen Discussion Group Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine eha...@georgialibraries.org wrote: Kathy, While the relevance display is much improved in 2.x, it would be good to have greater relevance given, in a keyword search, to title (specifically the 245)and then subject fields. I also see where having a popularity ranking might be beneficial. I just had to explain to a board member of one of our libraries why his search for John Sandford turned up children's titles first. So having MARC field 100s ranked higher than 700 in author searches would be beneficial as well. To be clear, weighting hits that come from different index definitions has always been possible. 2.2 will have a staff client interface to make it easier, but the capability has been there all along. Weighting different parts of one indexed term -- say, weighting the title embedded in the keyword blob higher than the subjects embedded in the same blob -- would require the above-mentioned make use of tsearch class weighting. But one can approximate that today by duplicating the index definitions from, say, title, author and subject classes within the keyword class. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com I can't comment on any of the coding possibilities other than to say which every way doesn't negatively impact search return time is preferable. Elaine J. Elaine Hardy PINES Bibliographic Projects and Metadata Manager Georgia Public Library Service, A Unit of the University System of Georgia 1800 Century Place, Suite 150 Atlanta, Ga. 30345-4304 404.235-7128 404.235-7201, fax eha...@georgialibraries.org www.georgialibraries.org http://www.georgialibraries.org/pines/ -Original Message- From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of Kathy Lussier Sent: Tuesday, March 06, 2012 4:43 PM To: 'Evergreen Discussion Group' Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. * Using some type of popularity metric to boost relevancy for popular titles. I'm not sure what this metric should be (number of copies attached to record? Total circs in last x months? Total current circs?), but we believe some type of popularity measure would be particularly helpful in a public library where searches will often be for titles that are popular. For example, a search for twilight will most likely be for the Stephanie Meyers novel and not this http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike Rylander had indicated in a previous e-mail (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussier kluss...@masslnc.org wrote: Hi all, I mentioned this during an e-mail discussion on the list last month, but I just wanted to hear from others in the Evergreen community about whether there is a desire to improve the relevance ranking for search results in Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can look at things like the document length, word proximity, and unique word count. We've found that we had to remove the modifiers for document length and unique word count to prevent a problem where brief bib records were ranked way too high in our search results. FWIW, there is a library testing some new combinations of CD modifiers and having some success. As soon as I know more I will share (if they don't first). In our local discussions, we've thought the following enhancements could improve the ranking of search results: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. * Using some type of popularity metric to boost relevancy for popular titles. I'm not sure what this metric should be (number of copies attached to record? Total circs in last x months? Total current circs?), but we believe some type of popularity measure would be particularly helpful in a public library where searches will often be for titles that are popular. For example, a search for twilight will most likely be for the Stephanie Meyers novel and not this http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike Rylander had indicated in a previous e-mail (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to handle this through an overnight cron job without a negative impact on search speeds. Right ... A regular stats-gathering job could certainly allow this, and (if the QuqeryParser explain branch gets merged to master so we have a standard search canonicalization function) logged query analysis is another option as well. Do others think these two enhancements would improve the search results in Evergreen? Do you think there are other things we could do to improve relevancy? My main concern would be that any changes might slow down search speeds, and I would want to make sure that we could do something to retrieve better search results without a slowdown. I would prefer better results with a speed /increase/! :) But, who wouldn't. I can offer at least one lower-hanging fruit idea: switch from GIST indexes to GIN indexes by default, as they're much faster these days. Also, I was wondering if this type of project might be a good candidate for a Google Summer of Code project. The fairly mechanical change from GIST to GIN indexing is definitely a small-effort thing. I think the other ideas listed here (and still others from the past, like direct MARC indexing, and use of tsearch weighting classes) are probably worth trying -- particularly the relevance-adjustment-functions-in-C idea -- as GSoC projects, but may turn out to be too big. It's worth listing them as ideas for candidates to propose, though. -- Mike Rylander | Director of Research and Development | Equinox Software, Inc. / Your Library's Guide to Open Source | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
* Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. Mike, could you elaborate what bits of code you're talking about here that could be rewritten in C? Some of my colleagues at Sitka and I were trying to find out why broad searches are unusually slow and eventually found that our adjustments in search.relevance_adjustment were slowing things down. Months earlier the CD patch was added to trunk to circumvent this problem without our knowledge, so we tried backporting that code and testing it however, in our initial tests, we weren't entirely satisfied with the CD modifiers' ability to rank items. Doing some digging into the SQL logs and QueryParser.pm, we observed that the naco_normalize function appears to be what's slowing the use of relevance_adjustment down. While the naco_normalize function itself is quite fast on its own, it slows down exponentially when run on many records: explain analyze select naco_normalize(value) from metabib.keyword_field_entry limit 1; When using the relevance adjustments, it is run on each metabib.x_entry.value that is retrieved in the initial resultset, which in many cases would be thousands of records. You can adjust the LIMIT in the above query to see how it slows down as the result set gets larger. It is also run for each relevance_adjustment, however I'm assuming that the query parser is treating it properly as IMMUTABLE and only running it once for each adjustment. Anyway, not entirely sure about how this analysis holds up in trunk as we've done this testing on Postgres 8.4 and Eg 2.0 and it looks like there's new code in trunk in O:A:Storage:Driver:Pg:QueryParser.pm, but no changes to those bits. I've attached some sample SQL of part of a 2.0 query and the same query without naco_normalize run on the metabib table. In my testing on our production dataset, this query -- a search for Canada -- went from over 80 seconds to less than 10 by removing the naco_normalize (it's still being run on the incoming term though which is probably unavoidable) My thought for a solution would be that we could have naco_normalize run as an INSERT trigger on that field. Obviously the whole tables would need to be updated which is no small task. I'm also not sure if that would impact other things, ie: where else the metabib.x_field_entry.value field is used, but but generally I'd think we'd almost always be using that value for a comparison of some kind and want that value in a normalized form. Another option may be to not normalize in those comparisons, however it's slightly less attractive IMO. Anyway I'd be interested to hear your thoughts on that. -- normal EG 2.0 query EXPLAIN ANALYZE SELECT * /* bib search */ FROM search.query_parser_fts( 1::INT, 0::INT, $core_query_25078$SELECT m.source AS id, ARRAY_ACCUM(DISTINCT m.source) AS records, (AVG( (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * x7b52820_keyword.weight * /* word_order */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ (naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 2, 1) * /* first_word */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ ('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1) * /* full_match */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ ('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1)) ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE )::INT * 5, 1))::NUMERIC AS rel, (AVG( (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * x7b52820_keyword.weight * /* word_order */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ (naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 2, 1) * /* first_word */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ ('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1) * /* full_match */ COALESCE(NULLIF( (naco_normalize(x7b52820_keyword.value) ~ ('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1)) ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE )::INT * 5, 1))::NUMERIC AS rank, FIRST(mrd.date1) AS tie_break FROM metabib.metarecord_source_map m JOIN metabib.rec_descriptor mrd ON (m.source =
Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen
On Tue, Mar 6, 2012 at 6:13 PM, James Fournie jfour...@sitka.bclibraries.ca wrote: * Giving greater weight to a record if the search terms appear in the title or subject (ideally, we would like these field to be configurable.) This is something that is tweakable in search.relevance_ranking, but my understanding is that the use of these tweaks results in a major reduction in search performance. Indeed they do, however rewriting them in C to be super-fast would improve this situation. It's primarily a matter of available time and effort. It's also, however, pretty specialized work as you're dealing with Postgres at a very intimate level. Mike, could you elaborate what bits of code you're talking about here that could be rewritten in C? I mean specifically the elaborate COALESCE/NULLIF/regexp (aka ~) parts of the SELECT clauses that implement the first-word, word-order and full-phrase relevance bumps that come from search.relevance_adjustment. There's also the option of attempting to rewrite naco_normalize and search_normalize (see below) in C. Lots of string mangling to which Perl is particularly suited, but it's not impossible by any means, and there are Postgres components (the 'unaccent' contrib/extension, for instance) that we could probably build on. Some of my colleagues at Sitka and I were trying to find out why broad searches are unusually slow and eventually found that our adjustments in search.relevance_adjustment were slowing things down. Months earlier the CD patch was added to trunk to circumvent this problem without our knowledge, so we tried backporting that code and testing it however, in our initial tests, we weren't entirely satisfied with the CD modifiers' ability to rank items. Right. These are more subtle than the heavy-handed search.relevance_adjustment settings, and therefore have a less drastic effect. But they also reduce the need for some of the search.relevance_adjustment entries, so in combination we should be able to find a good balance, especially if some of the rel_adjustment effects can be rewritten in C. Doing some digging into the SQL logs and QueryParser.pm, we observed that the naco_normalize function appears to be what's slowing the use of relevance_adjustment down. While the naco_normalize function itself is quite fast on its own, it slows down exponentially when run on many records: explain analyze select naco_normalize(value) from metabib.keyword_field_entry limit 1; When using the relevance adjustments, it is run on each metabib.x_entry.value that is retrieved in the initial resultset, which in many cases would be thousands of records. You can adjust the LIMIT in the above query to see how it slows down as the result set gets larger. It is also run for each relevance_adjustment, however I'm assuming that the query parser is treating it properly as IMMUTABLE and only running it once for each adjustment. Indeed, and naco_normalize is not necessarily the only normalizer that will be applied to each and every field! If you search a class or field that uses other (pos = 0) normalizers, all of those will also be applied to both the column value and the user input. There's some good news on this front, though. Galen recently implemented a trimmed down version of naco_normalize, called search_normalize, that should be a bit faster. That should lower the total cost by a noticeable amount over many thousands of rows. Anyway, not entirely sure about how this analysis holds up in trunk as we've done this testing on Postgres 8.4 and Eg 2.0 and it looks like there's new code in trunk in O:A:Storage:Driver:Pg:QueryParser.pm, but no changes to those bits. I've attached some sample SQL of part of a 2.0 query and the same query without naco_normalize run on the metabib table. In my testing on our production dataset, this query -- a search for Canada -- went from over 80 seconds to less than 10 by removing the naco_normalize (it's still being run on the incoming term though which is probably unavoidable) It is unavoidable, but they should only be run once on user input and the result cached. EXPLAIN will tell the tale, and if it's not then the normalizer functions aren't properly marked STABLE. Hrm... and looking at your example, I spotted a chance for at least one optimization. If we recognize that there is only one term in a search (as in your canada example) we can skip the word-order rel_adjustment if we're told to apply it, saving ~1/3 of the cost of that particular chunk. My thought for a solution would be that we could have naco_normalize run as an INSERT trigger on that field. Obviously the whole tables would need to be updated which is no small task. I'm also not sure if that would impact other things, ie: where else the metabib.x_field_entry.value field is used, but but generally I'd think we'd almost always be using that value for a comparison of some kind and