Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-05-12 Thread Dan Scott
On Wed, Mar 7, 2012 at 10:11 AM, Mike Rylander mrylan...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine
 eha...@georgialibraries.org wrote:
 Kathy,

 While the relevance display is much improved in 2.x, it would be good to
 have greater relevance given, in a keyword search, to title (specifically
 the 245)and then subject fields. I also see where having a popularity
 ranking might be beneficial.

 I just had to explain to a board member of one of our libraries why his
 search for John Sandford turned up children's titles first. So having MARC
 field 100s ranked higher than 700 in author searches would be beneficial
 as well.


 To be clear, weighting hits that come from different index definitions
 has always been possible.  2.2 will have a staff client interface to
 make it easier, but the capability has been there all along.

 Weighting different parts of one indexed term -- say, weighting the
 title embedded in the keyword blob higher than the subjects embedded
 in the same blob -- would require the above-mentioned make use of
 tsearch class weighting.  But one can approximate that today by
 duplicating the index definitions from, say, title, author and subject
 classes within the keyword class.

We've been doing the latter (duplicating title inside the keyword
class) since 1.6 days - see
http://coffeecode.net/archives/218-Adjusting-relevancy-rankings-in-Evergreen-1.6,-some-explorations.html
for a description of how I added a keyword|title field, and then
boosted its weight to 10 (versus the default of 1 that the rest of the
keyword fields get). So a general keyword search for programming
languages on our system by far prefers results that contain
programming languages in the title... this is still working nicely
in 2.1 for us.

Note, however, that we did clear all entries out of the
search.relevance_adjustment table as that was found to slow things
down massively in the 2.0 era.


Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-05-10 Thread Mike Rylander
On Mon, May 7, 2012 at 3:12 PM, Kathy Lussier kluss...@masslnc.org wrote:
 Hi Mike,


 FWIW, there is a library testing some new combinations of CD modifiers
 and having some success.  As soon as I know more I will share (if they
 don't first).

 Did anything ever come of this? I would be interested in seeing any examples
 that resulted in improved relevancy.


The testing occurred, but I haven't heard the the outcome yet.  I'll
dig for it ASAP.


 The fairly mechanical change from GIST to GIN indexing is definitely a
 small-effort thing. I think the other ideas listed here (and still
 others from the past, like direct MARC indexing, and use of tsearch
 weighting classes) are probably worth trying -- particularly the
 relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
 turn out to be too big.  It's worth listing them as ideas for
 candidates to propose, though.

 I was happy to see that Optimize Evergreen: Convert PL/Perl-based
 PostgreSQL stored procedures to PL/SQL or PL/C was one of the accepted GSoC
 projects. However, since I got a little lost in the technical details of
 this discussion, I was curious if, when this GSoC project is complete, we
 can can feel more comfortable about using search.relevance_ranking to tweak
 the relevancy without adversely affecting search performance.


Short version: yes

Longer version: that's exactly one of the goals, and there are some
other avenues of attack as well that should speed search and are
related to (but not strictly inside) the GSoC project.

--miker

 I know there were two related GSoC ideas listed, and I wasn't sure if both
 needed to be done together to ultimately improve search speeds.

 Thanks!

 Kathy


 --

 Kathy Lussier
 Project Coordinator
 Massachusetts Library Network Cooperative
 (508) 756-0172
 (508) 755-3721 (fax)
 kluss...@masslnc.org
 Twitter: http://www.twitter.com/kmlussier


 On 3/6/2012 5:00 PM, Mike Rylander wrote:

 On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussierkluss...@masslnc.org
  wrote:

 Hi all,

 I mentioned this during an e-mail discussion on the list last month, but
 I
 just wanted to hear from others in the Evergreen community about whether
 there is a desire to improve the relevance ranking for search results in
 Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it
 can
 look at things like the document length, word proximity, and unique word
 count. We've found that we had to remove the modifiers for document
 length
 and unique word count to prevent a problem where brief bib records were
 ranked way too high in our search results.


 FWIW, there is a library testing some new combinations of CD modifiers
 and having some success.  As soon as I know more I will share (if they
 don't first).


 In our local discussions, we've thought the following enhancements could
 improve the ranking of search results:

 * Giving greater weight to a record if the search terms appear in the
 title
 or subject (ideally, we would like these field to be configurable.) This
 is
 something that is tweakable in search.relevance_ranking, but my
 understanding is that the use of these tweaks results in a major
 reduction
 in search performance.


 Indeed they do, however rewriting them in C to be super-fast would
 improve this situation.  It's primarily a matter of available time and
 effort.  It's also, however, pretty specialized work as you're dealing
 with Postgres at a very intimate level.

 * Using some type of popularity metric to boost relevancy for popular
 titles. I'm not sure what this metric should be (number of copies
 attached
 to record? Total circs in last x months? Total current circs?), but we
 believe some type of popularity measure would be particularly helpful in
 a
 public library where searches will often be for titles that are popular.
 For
 example, a search for twilight will most likely be for the Stephanie
 Meyers novel and not this
 http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
 Rylander had indicated in a previous e-mail
 (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
 handle this through an overnight cron job without a negative impact on
 search speeds.


 Right ... A regular stats-gathering job could certainly allow this,
 and (if the QuqeryParser explain branch gets merged to master so we
 have a standard search canonicalization function) logged query
 analysis is another option as well.


 Do others think these two enhancements would improve the search results
 in
 Evergreen? Do you think there are other things we could do to improve
 relevancy? My main concern would be that any changes might slow down
 search
 speeds, and I would want to make sure that we could do something to
 retrieve
 better search results without a slowdown.


 I would prefer better results with a speed /increase/! :)  But, who
 wouldn't.

 I can offer at least one lower-hanging fruit idea: switch from GIST
 indexes to GIN indexes by default, as they're 

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-05-07 Thread Kathy Lussier

Hi Mike,

 FWIW, there is a library testing some new combinations of CD modifiers
 and having some success.  As soon as I know more I will share (if they
 don't first).

Did anything ever come of this? I would be interested in seeing any 
examples that resulted in improved relevancy.


 The fairly mechanical change from GIST to GIN indexing is definitely a
 small-effort thing. I think the other ideas listed here (and still
 others from the past, like direct MARC indexing, and use of tsearch
 weighting classes) are probably worth trying -- particularly the
 relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
 turn out to be too big.  It's worth listing them as ideas for
 candidates to propose, though.

I was happy to see that Optimize Evergreen: Convert PL/Perl-based 
PostgreSQL stored procedures to PL/SQL or PL/C was one of the accepted 
GSoC projects. However, since I got a little lost in the technical 
details of this discussion, I was curious if, when this GSoC project is 
complete, we can can feel more comfortable about using 
search.relevance_ranking to tweak the relevancy without adversely 
affecting search performance.


I know there were two related GSoC ideas listed, and I wasn't sure if 
both needed to be done together to ultimately improve search speeds.


Thanks!

Kathy

--

Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 756-0172
(508) 755-3721 (fax)
kluss...@masslnc.org
Twitter: http://www.twitter.com/kmlussier

On 3/6/2012 5:00 PM, Mike Rylander wrote:

On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussierkluss...@masslnc.org  wrote:

Hi all,

I mentioned this during an e-mail discussion on the list last month, but I
just wanted to hear from others in the Evergreen community about whether
there is a desire to improve the relevance ranking for search results in
Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can
look at things like the document length, word proximity, and unique word
count. We've found that we had to remove the modifiers for document length
and unique word count to prevent a problem where brief bib records were
ranked way too high in our search results.


FWIW, there is a library testing some new combinations of CD modifiers
and having some success.  As soon as I know more I will share (if they
don't first).



In our local discussions, we've thought the following enhancements could
improve the ranking of search results:

* Giving greater weight to a record if the search terms appear in the title
or subject (ideally, we would like these field to be configurable.) This is
something that is tweakable in search.relevance_ranking, but my
understanding is that the use of these tweaks results in a major reduction
in search performance.



Indeed they do, however rewriting them in C to be super-fast would
improve this situation.  It's primarily a matter of available time and
effort.  It's also, however, pretty specialized work as you're dealing
with Postgres at a very intimate level.


* Using some type of popularity metric to boost relevancy for popular
titles. I'm not sure what this metric should be (number of copies attached
to record? Total circs in last x months? Total current circs?), but we
believe some type of popularity measure would be particularly helpful in a
public library where searches will often be for titles that are popular. For
example, a search for twilight will most likely be for the Stephanie
Meyers novel and not this
http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
Rylander had indicated in a previous e-mail
(http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
handle this through an overnight cron job without a negative impact on
search speeds.


Right ... A regular stats-gathering job could certainly allow this,
and (if the QuqeryParser explain branch gets merged to master so we
have a standard search canonicalization function) logged query
analysis is another option as well.



Do others think these two enhancements would improve the search results in
Evergreen? Do you think there are other things we could do to improve
relevancy? My main concern would be that any changes might slow down search
speeds, and I would want to make sure that we could do something to retrieve
better search results without a slowdown.



I would prefer better results with a speed /increase/! :)  But, who wouldn't.

I can offer at least one lower-hanging fruit idea: switch from GIST
indexes to GIN indexes by default, as they're much faster these days.


Also, I was wondering if this type of project might be a good candidate for
a Google Summer of Code project.



The fairly mechanical change from GIST to GIN indexing is definitely a
small-effort thing. I think the other ideas listed here (and still
others from the past, like direct MARC indexing, and use of tsearch
weighting classes) are probably worth trying -- particularly the
relevance-adjustment-functions-in-C idea -- 

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-22 Thread Mike Rylander
On Thu, Mar 22, 2012 at 4:56 PM, Brian Greene bgre...@cgcc.cc.or.us wrote:
 Does relevancy ranking currently take publication date into account? I think
 this could be especially helpful with topical searches when, all other
 things being equal, I'd probably consider the newer item to be more
 relevant.

It is. The Date1 fixed field is used as a first tie-breaker after the
primary (user-chosen) sort axis.

--miker

 Similarly, I could see home library (in cases where that can
 be determined) being considered and used when there are two otherwise
 equally relevant items. Note that in both cases I don't want them to become
 de facto limiters, but rather act more like tie-breakers after the other
 factors have been weighed.

 I also support taking into account some sort of popularity measure.

 Thanks,
 Brian


 Brian Greene, Library Director
 Columbia Gorge Community College
 The Dalles, Oregon 97058
 (541) 506-6080 | www.cgcc.cc.or.us
 Mike Rylander mrylan...@gmail.com 3/8/2012 10:55 AM 
 On Thu, Mar 8, 2012 at 12:10 PM, Elizabeth Longwell blong...@eou.edu
 wrote:
 Hi,

 Is it necessary to re-index after changing weights for relevancy?

 Not at all. The only gotcha is that cached searches won't show the
 changed weighting (of course).  So, say you searched for rowling
 (sans quotes) and wanted to test an author-weighting change made after
 the search (but before the cache expired), search again for rowling
 -asdlfkaf (again, sans quotes).  That negated random string at the
 end kills the cache without materially changing the query.

 --
 Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com



 Beth Longwell
 Sage Library System

 On Wed, Mar 7, 2012 at 5:29 PM, Mike Rylander mrylan...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 2:57 PM, Kathy Lussier kluss...@masslnc.org
 wrote:
 Hi Mike,

To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

 Is this staff client interface already available in master? If so, can
 you
 give me a little more information on how this is done?

 It is.  Go to  Admin - Server Administration - MARC Search/Facet
 Fields and see the Weight field.  The higher the number, the more
 important the field.

 --
 Mike Rylander
  | Director of Research and Development
  | Equinox Software, Inc. / Your Library's Guide to Open Source
  | phone:  1-877-OPEN-ILS (673-6457)
  | email:  mi...@esilibrary.com
  | web:  http://www.esilibrary.com



 Thanks!
 Kathy



-Original Message-
From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-
ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike
Rylander
Sent: Wednesday, March 07, 2012 10:11 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in
Evergreen

On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine
eha...@georgialibraries.org wrote:
 Kathy,

 While the relevance display is much improved in 2.x, it would be good
to
 have greater relevance given, in a keyword search, to title
(specifically
 the 245)and then subject fields. I also see where having a popularity
 ranking might be beneficial.

 I just had to explain to a board member of one of our libraries why
his
 search for John Sandford turned up children's titles first. So having
MARC
 field 100s ranked higher than 700 in author searches would be
beneficial
 as well.


To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

Weighting different parts of one indexed term -- say, weighting the
title embedded in the keyword blob higher than the subjects embedded
in the same blob -- would require the above-mentioned make use of
tsearch class weighting.  But one can approximate that today by
duplicating the index definitions from, say, title, author and subject
classes within the keyword class.

--
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com


 I can't comment on any of the coding possibilities other than to say
which
 every way doesn't negatively impact search return time is preferable.

 Elaine


 J. Elaine Hardy
 PINES Bibliographic Projects and Metadata Manager
 Georgia Public Library Service,
 A Unit of the University System of Georgia
 1800 Century Place, Suite 150
 Atlanta, Ga. 30345-4304
 404.235-7128
 404.235-7201, fax

 eha...@georgialibraries.org
 www.georgialibraries.org
 http://www.georgialibraries.org/pines/


 -Original Message-
 From: open-ils-general-boun

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-08 Thread Elizabeth Longwell
Hi,

Is it necessary to re-index after changing weights for relevancy?

Beth Longwell
Sage Library System

On Wed, Mar 7, 2012 at 5:29 PM, Mike Rylander mrylan...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 2:57 PM, Kathy Lussier kluss...@masslnc.org wrote:
 Hi Mike,

To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

 Is this staff client interface already available in master? If so, can you
 give me a little more information on how this is done?

 It is.  Go to  Admin - Server Administration - MARC Search/Facet
 Fields and see the Weight field.  The higher the number, the more
 important the field.

 --
 Mike Rylander
  | Director of Research and Development
  | Equinox Software, Inc. / Your Library's Guide to Open Source
  | phone:  1-877-OPEN-ILS (673-6457)
  | email:  mi...@esilibrary.com
  | web:  http://www.esilibrary.com



 Thanks!
 Kathy



-Original Message-
From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-
ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike
Rylander
Sent: Wednesday, March 07, 2012 10:11 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in
Evergreen

On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine
eha...@georgialibraries.org wrote:
 Kathy,

 While the relevance display is much improved in 2.x, it would be good
to
 have greater relevance given, in a keyword search, to title
(specifically
 the 245)and then subject fields. I also see where having a popularity
 ranking might be beneficial.

 I just had to explain to a board member of one of our libraries why
his
 search for John Sandford turned up children's titles first. So having
MARC
 field 100s ranked higher than 700 in author searches would be
beneficial
 as well.


To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

Weighting different parts of one indexed term -- say, weighting the
title embedded in the keyword blob higher than the subjects embedded
in the same blob -- would require the above-mentioned make use of
tsearch class weighting.  But one can approximate that today by
duplicating the index definitions from, say, title, author and subject
classes within the keyword class.

--
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com


 I can't comment on any of the coding possibilities other than to say
which
 every way doesn't negatively impact search return time is preferable.

 Elaine


 J. Elaine Hardy
 PINES Bibliographic Projects and Metadata Manager
 Georgia Public Library Service,
 A Unit of the University System of Georgia
 1800 Century Place, Suite 150
 Atlanta, Ga. 30345-4304
 404.235-7128
 404.235-7201, fax

 eha...@georgialibraries.org
 www.georgialibraries.org
 http://www.georgialibraries.org/pines/


 -Original Message-
 From: open-ils-general-boun...@list.georgialibraries.org
 [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf
Of
 Kathy Lussier
 Sent: Tuesday, March 06, 2012 4:43 PM
 To: 'Evergreen Discussion Group'
 Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

 Hi all,

 I mentioned this during an e-mail discussion on the list last month,
but I
 just wanted to hear from others in the Evergreen community about
whether
 there is a desire to improve the relevance ranking for search results
in
 Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and
it
 can look at things like the document length, word proximity, and
unique
 word count. We've found that we had to remove the modifiers for
document
 length and unique word count to prevent a problem where brief bib
records
 were ranked way too high in our search results.

 In our local discussions, we've thought the following enhancements
could
 improve the ranking of search results:

 * Giving greater weight to a record if the search terms appear in the
 title or subject (ideally, we would like these field to be
configurable.)
 This is something that is tweakable in search.relevance_ranking, but
my
 understanding is that the use of these tweaks results in a major
reduction
 in search performance.

 * Using some type of popularity metric to boost relevancy for popular
 titles. I'm not sure what this metric should be (number of copies
attached
 to record? Total circs in last x months? Total current circs?), but
we
 believe some type of popularity measure would be particularly helpful
in a
 public library where searches will often be for titles that are
popular.
 For example, a search for twilight will most likely be for the
Stephanie
 Meyers novel

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-08 Thread Mike Rylander
On Thu, Mar 8, 2012 at 12:10 PM, Elizabeth Longwell blong...@eou.edu wrote:
 Hi,

 Is it necessary to re-index after changing weights for relevancy?

Not at all. The only gotcha is that cached searches won't show the
changed weighting (of course).  So, say you searched for rowling
(sans quotes) and wanted to test an author-weighting change made after
the search (but before the cache expired), search again for rowling
-asdlfkaf (again, sans quotes).  That negated random string at the
end kills the cache without materially changing the query.

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com



 Beth Longwell
 Sage Library System

 On Wed, Mar 7, 2012 at 5:29 PM, Mike Rylander mrylan...@gmail.com wrote:
 On Wed, Mar 7, 2012 at 2:57 PM, Kathy Lussier kluss...@masslnc.org wrote:
 Hi Mike,

To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

 Is this staff client interface already available in master? If so, can you
 give me a little more information on how this is done?

 It is.  Go to  Admin - Server Administration - MARC Search/Facet
 Fields and see the Weight field.  The higher the number, the more
 important the field.

 --
 Mike Rylander
  | Director of Research and Development
  | Equinox Software, Inc. / Your Library's Guide to Open Source
  | phone:  1-877-OPEN-ILS (673-6457)
  | email:  mi...@esilibrary.com
  | web:  http://www.esilibrary.com



 Thanks!
 Kathy



-Original Message-
From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-
ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike
Rylander
Sent: Wednesday, March 07, 2012 10:11 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in
Evergreen

On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine
eha...@georgialibraries.org wrote:
 Kathy,

 While the relevance display is much improved in 2.x, it would be good
to
 have greater relevance given, in a keyword search, to title
(specifically
 the 245)and then subject fields. I also see where having a popularity
 ranking might be beneficial.

 I just had to explain to a board member of one of our libraries why
his
 search for John Sandford turned up children's titles first. So having
MARC
 field 100s ranked higher than 700 in author searches would be
beneficial
 as well.


To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

Weighting different parts of one indexed term -- say, weighting the
title embedded in the keyword blob higher than the subjects embedded
in the same blob -- would require the above-mentioned make use of
tsearch class weighting.  But one can approximate that today by
duplicating the index definitions from, say, title, author and subject
classes within the keyword class.

--
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com


 I can't comment on any of the coding possibilities other than to say
which
 every way doesn't negatively impact search return time is preferable.

 Elaine


 J. Elaine Hardy
 PINES Bibliographic Projects and Metadata Manager
 Georgia Public Library Service,
 A Unit of the University System of Georgia
 1800 Century Place, Suite 150
 Atlanta, Ga. 30345-4304
 404.235-7128
 404.235-7201, fax

 eha...@georgialibraries.org
 www.georgialibraries.org
 http://www.georgialibraries.org/pines/


 -Original Message-
 From: open-ils-general-boun...@list.georgialibraries.org
 [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf
Of
 Kathy Lussier
 Sent: Tuesday, March 06, 2012 4:43 PM
 To: 'Evergreen Discussion Group'
 Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

 Hi all,

 I mentioned this during an e-mail discussion on the list last month,
but I
 just wanted to hear from others in the Evergreen community about
whether
 there is a desire to improve the relevance ranking for search results
in
 Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and
it
 can look at things like the document length, word proximity, and
unique
 word count. We've found that we had to remove the modifiers for
document
 length and unique word count to prevent a problem where brief bib
records
 were ranked way too high in our search results.

 In our local discussions, we've thought the following enhancements
could
 improve the ranking of search results:

 * Giving greater weight to a record if the search terms appear in the
 title

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-07 Thread Mike Rylander
On Wed, Mar 7, 2012 at 2:28 AM, Dan Scott d...@coffeecode.net wrote:
 Lots of snips implied below; also note that I'm running James' 2.0
 query on a 2.1 system.

 On Tue, Mar 06, 2012 at 10:55:24PM -0500, Mike Rylander wrote:
 On Tue, Mar 6, 2012 at 6:13 PM, James Fournie
 jfour...@sitka.bclibraries.ca wrote:
 
  * Giving greater weight to a record if the search terms appear in the 
  title
  or subject (ideally, we would like these field to be configurable.) This 
  is
  something that is tweakable in search.relevance_ranking, but my
  understanding is that the use of these tweaks results in a major 
  reduction
  in search performance.
 
 
  Indeed they do, however rewriting them in C to be super-fast would
  improve this situation.  It's primarily a matter of available time and
  effort.  It's also, however, pretty specialized work as you're dealing
  with Postgres at a very intimate level.

 Hmm. For sure, C beats Perl for performance and would undoubtedly offer an
 improvement, but it looks like another bottleneck for broad searches is
 in having to visit  sort hundreds of thousands of rows, so that they
 can be sorted by rank, with the added I/O cost of using a disk merge for
 these broad searches rather than in-memory quicksort.

 For comparison, I swapped out 'canada' for 'paraguay' and explain
 analyzed the results; 'canada' uses a disk merge because it needs to
 deal with 482,000 rows of data and sort 596,000 KB of data, while
 'paraguay' (which only has to sort 322 rows) used an in-memory quicksort
 at 582 KB.

 This is on a system where work_mem is set to 288 MB - much higher than
 one would generally want, particularly for the number of physical
 connections that could potentially get ramped up. That high work_mem
 helps with reasonably broad searches, but searching for Canada in a
 Canadian academic library, you might as well be searching for the...


All true, but also not something we can do much about (without a
precalculated rank, a la PageRank); also, testing on 9.0 around its
release shows that pre-limiting as we used to was slower than what we
do today.  I don't have the details in front of me, but there it is.


 Indeed, and naco_normalize is not necessarily the only normalizer that
 will be applied to each and every field!  If you search a class or
 field that uses other (pos = 0) normalizers, all of those will also
 be applied to both the column value and the user input.

 There's some good news on this front, though.  Galen recently
 implemented a trimmed down version of naco_normalize, called
 search_normalize, that should be a bit faster.  That should lower the
 total cost by a noticeable amount over many thousands of rows.

 You might be thinking of something else? I'm pretty sure that
 2bc4e97f72b shows that I implemented search_normalize() simply to avoid
 problems with apostrophe mangling in the strict naco_normalize()
 function - and I doubt there will be any observable difference in
 performance.

Sorry, I probably am.  My apologies, Dan, I didn't intend to misdirect
your credit.  That said, I'd be surprised if anything that shortened
the pl/perl we use didn't help some, in aggregate, on very large
queries.  It's testable...


 Hrm... and looking at your example, I spotted a chance for at least
 one optimization.  If we recognize that there is only one term in a
 search (as in your canada example) we can skip the word-order
 rel_adjustment if we're told to apply it, saving ~1/3 of the cost of
 that particular chunk.

 I can confirm this; running the same query on our system with word order
 removed carved response times down to 390 seconds from 580 seconds.
 Still unusable, but better. (EXPLAIN ANALYZE of the inner query
 attached).

   * Alternatively, I mentioned off-hand the option of direct indexing.
  The idea is to use an expression index defined by each row in in
 config.metabib_field and have a background process keep the indexes in
 sync with configuration as things in that table (and the related
 normalizer configuration, etc) changes.  I fear that's the path to
 madness, but it would be the most space efficient way to handle
 things.

 I don't think that's the path to madness; it appeals to me, at least.
 (Okay, it's probably insane then.)


In a broad sense it appeals to me, too.  You and I were the ones who
discussed this in the long-long-ago, IYR.  It's when I start digging
into the details of implementation and the implications for
config-change-based thrashing that I start going a little mad ...

OH!  Ranking via ts_rank[_cd].  We can't do it without the tsvector in hand.

But if we can work around that somehow (a table that stores only that
value, reducing the sort size you mention above?), it's not
impossible.

 [Side note: if you don't need the language-base relevance bump
 (because, say, the vast majority of your collection is english),
 remove the default_preferred_language[_weight] elements from your
 opensrf.xml -- you should save a good bit from 

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-07 Thread Hardy, Elaine
Kathy,

While the relevance display is much improved in 2.x, it would be good to
have greater relevance given, in a keyword search, to title (specifically
the 245)and then subject fields. I also see where having a popularity
ranking might be beneficial.

I just had to explain to a board member of one of our libraries why his
search for John Sandford turned up children's titles first. So having MARC
field 100s ranked higher than 700 in author searches would be beneficial
as well.

I can't comment on any of the coding possibilities other than to say which
every way doesn't negatively impact search return time is preferable.

Elaine
 

J. Elaine Hardy
PINES Bibliographic Projects and Metadata Manager
Georgia Public Library Service,
A Unit of the University System of Georgia
1800 Century Place, Suite 150
Atlanta, Ga. 30345-4304
404.235-7128
404.235-7201, fax

eha...@georgialibraries.org
www.georgialibraries.org
http://www.georgialibraries.org/pines/


-Original Message-
From: open-ils-general-boun...@list.georgialibraries.org
[mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of
Kathy Lussier
Sent: Tuesday, March 06, 2012 4:43 PM
To: 'Evergreen Discussion Group'
Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Hi all,

I mentioned this during an e-mail discussion on the list last month, but I
just wanted to hear from others in the Evergreen community about whether
there is a desire to improve the relevance ranking for search results in
Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it
can look at things like the document length, word proximity, and unique
word count. We've found that we had to remove the modifiers for document
length and unique word count to prevent a problem where brief bib records
were ranked way too high in our search results.

In our local discussions, we've thought the following enhancements could
improve the ranking of search results:

* Giving greater weight to a record if the search terms appear in the
title or subject (ideally, we would like these field to be configurable.)
This is something that is tweakable in search.relevance_ranking, but my
understanding is that the use of these tweaks results in a major reduction
in search performance.

* Using some type of popularity metric to boost relevancy for popular
titles. I'm not sure what this metric should be (number of copies attached
to record? Total circs in last x months? Total current circs?), but we
believe some type of popularity measure would be particularly helpful in a
public library where searches will often be for titles that are popular.
For example, a search for twilight will most likely be for the Stephanie
Meyers novel and not this
http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
Rylander had indicated in a previous e-mail
(http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
handle this through an overnight cron job without a negative impact on
search speeds.

Do others think these two enhancements would improve the search results in
Evergreen? Do you think there are other things we could do to improve
relevancy? My main concern would be that any changes might slow down
search speeds, and I would want to make sure that we could do something to
retrieve better search results without a slowdown.

Also, I was wondering if this type of project might be a good candidate
for a Google Summer of Code project.

I look forward to hearing your feedback!

Kathy

-
Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 756-0172
(508) 755-3721 (fax)
kluss...@masslnc.org
IM: kmlussier (AOL  Yahoo)
Twitter: http://www.twitter.com/kmlussier






Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-07 Thread Mike Rylander
On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine
eha...@georgialibraries.org wrote:
 Kathy,

 While the relevance display is much improved in 2.x, it would be good to
 have greater relevance given, in a keyword search, to title (specifically
 the 245)and then subject fields. I also see where having a popularity
 ranking might be beneficial.

 I just had to explain to a board member of one of our libraries why his
 search for John Sandford turned up children's titles first. So having MARC
 field 100s ranked higher than 700 in author searches would be beneficial
 as well.


To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

Weighting different parts of one indexed term -- say, weighting the
title embedded in the keyword blob higher than the subjects embedded
in the same blob -- would require the above-mentioned make use of
tsearch class weighting.  But one can approximate that today by
duplicating the index definitions from, say, title, author and subject
classes within the keyword class.

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com


 I can't comment on any of the coding possibilities other than to say which
 every way doesn't negatively impact search return time is preferable.

 Elaine


 J. Elaine Hardy
 PINES Bibliographic Projects and Metadata Manager
 Georgia Public Library Service,
 A Unit of the University System of Georgia
 1800 Century Place, Suite 150
 Atlanta, Ga. 30345-4304
 404.235-7128
 404.235-7201, fax

 eha...@georgialibraries.org
 www.georgialibraries.org
 http://www.georgialibraries.org/pines/


 -Original Message-
 From: open-ils-general-boun...@list.georgialibraries.org
 [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf Of
 Kathy Lussier
 Sent: Tuesday, March 06, 2012 4:43 PM
 To: 'Evergreen Discussion Group'
 Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

 Hi all,

 I mentioned this during an e-mail discussion on the list last month, but I
 just wanted to hear from others in the Evergreen community about whether
 there is a desire to improve the relevance ranking for search results in
 Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it
 can look at things like the document length, word proximity, and unique
 word count. We've found that we had to remove the modifiers for document
 length and unique word count to prevent a problem where brief bib records
 were ranked way too high in our search results.

 In our local discussions, we've thought the following enhancements could
 improve the ranking of search results:

 * Giving greater weight to a record if the search terms appear in the
 title or subject (ideally, we would like these field to be configurable.)
 This is something that is tweakable in search.relevance_ranking, but my
 understanding is that the use of these tweaks results in a major reduction
 in search performance.

 * Using some type of popularity metric to boost relevancy for popular
 titles. I'm not sure what this metric should be (number of copies attached
 to record? Total circs in last x months? Total current circs?), but we
 believe some type of popularity measure would be particularly helpful in a
 public library where searches will often be for titles that are popular.
 For example, a search for twilight will most likely be for the Stephanie
 Meyers novel and not this
 http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
 Rylander had indicated in a previous e-mail
 (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
 handle this through an overnight cron job without a negative impact on
 search speeds.

 Do others think these two enhancements would improve the search results in
 Evergreen? Do you think there are other things we could do to improve
 relevancy? My main concern would be that any changes might slow down
 search speeds, and I would want to make sure that we could do something to
 retrieve better search results without a slowdown.

 Also, I was wondering if this type of project might be a good candidate
 for a Google Summer of Code project.

 I look forward to hearing your feedback!

 Kathy

 -
 Kathy Lussier
 Project Coordinator
 Massachusetts Library Network Cooperative
 (508) 756-0172
 (508) 755-3721 (fax)
 kluss...@masslnc.org
 IM: kmlussier (AOL  Yahoo)
 Twitter: http://www.twitter.com/kmlussier






Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-07 Thread James Fournie
Thanks guys for your feedback, just a few more thoughts I had...

On 2012-03-06, at 11:28 PM, Dan Scott wrote:
 Indeed they do, however rewriting them in C to be super-fast would
 improve this situation.  It's primarily a matter of available time and
 effort.  It's also, however, pretty specialized work as you're dealing
 with Postgres at a very intimate level.
 
 Hmm. For sure, C beats Perl for performance and would undoubtedly offer an
 improvement,

For me, I am not terribly excited about the prospect of another bit of C 
floating around, particularly if it's in the Postgres bits.  I personally am 
not familiar with C except I can compile it and run it and maybe vaguely get an 
idea of what it does, but I can't really debug it very well or do anything 
terribly useful with it.  My concern is adding more specialized work to the 
system makes things harder to maintain and less accessible to newcomers with 
the typical Linux+Perl+SQL or JS knowledge.   We could write open-ils.circ in C 
as well, but it's just not a good idea, so I think in general we should dismiss 
rewrite it in C as a solution to anything.

 Doing some digging into the SQL logs and QueryParser.pm, we observed that 
 the naco_normalize function appears to be what's slowing the use of 
 relevance_adjustment down.  While the naco_normalize function itself is 
 quite fast on its own, it slows down exponentially when run on many records:
 
 explain analyze select naco_normalize(value) from 
 metabib.keyword_field_entry limit 1;
 
 To quibble, the increase in the number of records to the time to process
 doesn't appear to be an exponential slowdown; it's linear (at least on
 our system); 10 times the records = (roughly) 10 times as long to
 retrieve, which is what I would expect:

Yes sorry my mistake, I had done the same measurement so I don't know why I 
used the word, sort of like using the word literally when you don't mean 
literally I guess :)


 When using the relevance adjustments, it is run on each 
 metabib.x_entry.value that is retrieved in the initial resultset, which in 
 many cases would be thousands of records.  You can adjust the LIMIT in the 
 above query to see how it slows down as the result set gets larger.  It is 
 also run for each relevance_adjustment, however I'm assuming that the query 
 parser is treating it properly as IMMUTABLE and only running it once for 
 each adjustment.
 
 
 Have you tried giving the function a different cost estimate, per
 https://bugs.launchpad.net/evergreen/+bug/874603/comments/3 for a
 different but related problem? It's quite possible that something like:
 
 That said, some quick testing suggests that it doesn't make a difference
 to the plan, at least for the inner query that's being sent to
 search.query_parser_fts().

Yeah, I tried it but it seemed like it did nothing.   I suspect that it is 
actually being treated as IMMUTABLE

 
  * Alternatively, I mentioned off-hand the option of direct indexing.
 The idea is to use an expression index defined by each row in in
 config.metabib_field and have a background process keep the indexes in
 sync with configuration as things in that table (and the related
 normalizer configuration, etc) changes.  I fear that's the path to
 madness, but it would be the most space efficient way to handle
 things.
 
 I don't think that's the path to madness; it appeals to me, at least.
 (Okay, it's probably insane then.)

I am going to be totally frank and say that think it honestly might be the path 
to madness.  Why do we need more moving parts like a background process?  I'm a 
little confused, how would a background process keep the indexes in sync when 
the configuration changes?  Can you flesh this out a bit more?  I don't feel I 
fully understand how this would work.

To go back to my original idea of normalizing the indexes -- snip back to 
Mike's first response -- I'm just wondering:

 We need the pre-normalized form for some things
* We could find those things, other than search, for which we use
 m.X_entry.value and move them elsewhere.  The tradeoff would be that
 any change in config.metabib_field or normalizer configuration would
 have to cause a rewrite of that column.

Realistically, how often would one change normalizers or metabib_fields?   I 
don't think it's done lightly so it seems like a simple solution with a 
reasonable tradeoff -- you'd rarely want to change your normalizers so you're 
only running naco_normalizer on the table once in a blue moon vs running it on 
much of the table every time a search happens.   In Solr and I'm sure other 
search engines you have to reindex if you change these kinds of things. 

The question is where is ]the m.X_entry.value used?  It doesn't seem like 
anywhere when I skim the code but everything's so dynamically generated that 
it's hard to tell.  Any thoughts where it might be used?   I just have a hard 
time thinking of a use case for that field.  

~James Fournie
BC Libraries Cooperative




Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-07 Thread Mike Rylander
On Wed, Mar 7, 2012 at 1:29 PM, James Fournie
jfour...@sitka.bclibraries.ca wrote:
 Thanks guys for your feedback, just a few more thoughts I had...

 On 2012-03-06, at 11:28 PM, Dan Scott wrote:
 Indeed they do, however rewriting them in C to be super-fast would
 improve this situation.  It's primarily a matter of available time and
 effort.  It's also, however, pretty specialized work as you're dealing
 with Postgres at a very intimate level.

 Hmm. For sure, C beats Perl for performance and would undoubtedly offer an
 improvement,

 For me, I am not terribly excited about the prospect of another bit of C 
 floating around, particularly if it's in the Postgres bits.  I personally am 
 not familiar with C except I can compile it and run it and maybe vaguely get 
 an idea of what it does, but I can't really debug it very well or do anything 
 terribly useful with it.  My concern is adding more specialized work to the 
 system makes things harder to maintain and less accessible to newcomers with 
 the typical Linux+Perl+SQL or JS knowledge.   We could write open-ils.circ in 
 C as well, but it's just not a good idea, so I think in general we should 
 dismiss rewrite it in C as a solution to anything.


I can't say I agree with that -- the rewrite of open-ils.auth in C is
an unmitigated win, and open-ils.cstore (and pcrud, and other
derivatives) replacing (most of) open-ils.storage is as well, IMO.
It's all about using the right tool for the job, and once an API is
deemed very stable, performance optimization is a valid next step.
And C is, generally speaking, a better speed-oriented tool than Perl.

That said, I agree that open-ils.circ is not the next thing in line
for the translation treatment.  :)

As for it integrating with postgres, IMO
http://www.postgresql.org/docs/9.1/interactive/extend-extensions.html
shows how to do this right in modern times, and http://pgxn.org/ is
starting to become the CPAN of postgres extensions.  Ideally, all of
our stored procs would best be rebundled as extensions.

 
 That said, some quick testing suggests that it doesn't make a difference
 to the plan, at least for the inner query that's being sent to
 search.query_parser_fts().

 Yeah, I tried it but it seemed like it did nothing.   I suspect that it is 
 actually being treated as IMMUTABLE


That's unfrotunate... :(


  * Alternatively, I mentioned off-hand the option of direct indexing.
 The idea is to use an expression index defined by each row in in
 config.metabib_field and have a background process keep the indexes in
 sync with configuration as things in that table (and the related
 normalizer configuration, etc) changes.  I fear that's the path to
 madness, but it would be the most space efficient way to handle
 things.

 I don't think that's the path to madness; it appeals to me, at least.
 (Okay, it's probably insane then.)

 I am going to be totally frank and say that think it honestly might be the 
 path to madness.  Why do we need more moving parts like a background process? 
  I'm a little confused, how would a background process keep the indexes in 
 sync when the configuration changes?  Can you flesh this out a bit more?  I 
 don't feel I fully understand how this would work.


You wouldn't want to lock up the database with a DROP INDEX / CREATE
INDEX pair each time a row on that table changed, so we'd need a
process by which changes are registered and batched together.

 To go back to my original idea of normalizing the indexes -- snip back to 
 Mike's first response -- I'm just wondering:

 We need the pre-normalized form for some things
* We could find those things, other than search, for which we use
 m.X_entry.value and move them elsewhere.  The tradeoff would be that
 any change in config.metabib_field or normalizer configuration would
 have to cause a rewrite of that column.

 Realistically, how often would one change normalizers or metabib_fields?   I 
 don't think it's done lightly so it seems like a simple solution with a 
 reasonable tradeoff -- you'd rarely want to change your normalizers so you're 
 only running naco_normalizer on the table once in a blue moon vs running it 
 on much of the table every time a search happens.   In Solr and I'm sure 
 other search engines you have to reindex if you change these kinds of things.

 The question is where is ]the m.X_entry.value used?  It doesn't seem like 
 anywhere when I skim the code but everything's so dynamically generated that 
 it's hard to tell.  Any thoughts where it might be used?   I just have a hard 
 time thinking of a use case for that field.


Hrm... now that facets live on their own table, and may even end up
being folded into the browse_entry infrastructure (just a thought
right now, needs analysis), I'm not thinking of anything off the top
of my head, except for the current incarnation of the display_field
branch.  If that ends up having its own table (reasonable, I think)
then ... it may be safe to fully-normalize (or, at 

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-07 Thread Kathy Lussier
Hi Mike,

To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

Is this staff client interface already available in master? If so, can you
give me a little more information on how this is done?

Thanks!
Kathy
 
 

-Original Message-
From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-
ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike
Rylander
Sent: Wednesday, March 07, 2012 10:11 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in
Evergreen

On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine
eha...@georgialibraries.org wrote:
 Kathy,

 While the relevance display is much improved in 2.x, it would be good
to
 have greater relevance given, in a keyword search, to title
(specifically
 the 245)and then subject fields. I also see where having a popularity
 ranking might be beneficial.

 I just had to explain to a board member of one of our libraries why
his
 search for John Sandford turned up children's titles first. So having
MARC
 field 100s ranked higher than 700 in author searches would be
beneficial
 as well.


To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

Weighting different parts of one indexed term -- say, weighting the
title embedded in the keyword blob higher than the subjects embedded
in the same blob -- would require the above-mentioned make use of
tsearch class weighting.  But one can approximate that today by
duplicating the index definitions from, say, title, author and subject
classes within the keyword class.

--
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com


 I can't comment on any of the coding possibilities other than to say
which
 every way doesn't negatively impact search return time is preferable.

 Elaine


 J. Elaine Hardy
 PINES Bibliographic Projects and Metadata Manager
 Georgia Public Library Service,
 A Unit of the University System of Georgia
 1800 Century Place, Suite 150
 Atlanta, Ga. 30345-4304
 404.235-7128
 404.235-7201, fax

 eha...@georgialibraries.org
 www.georgialibraries.org
 http://www.georgialibraries.org/pines/


 -Original Message-
 From: open-ils-general-boun...@list.georgialibraries.org
 [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf
Of
 Kathy Lussier
 Sent: Tuesday, March 06, 2012 4:43 PM
 To: 'Evergreen Discussion Group'
 Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

 Hi all,

 I mentioned this during an e-mail discussion on the list last month,
but I
 just wanted to hear from others in the Evergreen community about
whether
 there is a desire to improve the relevance ranking for search results
in
 Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and
it
 can look at things like the document length, word proximity, and
unique
 word count. We've found that we had to remove the modifiers for
document
 length and unique word count to prevent a problem where brief bib
records
 were ranked way too high in our search results.

 In our local discussions, we've thought the following enhancements
could
 improve the ranking of search results:

 * Giving greater weight to a record if the search terms appear in the
 title or subject (ideally, we would like these field to be
configurable.)
 This is something that is tweakable in search.relevance_ranking, but
my
 understanding is that the use of these tweaks results in a major
reduction
 in search performance.

 * Using some type of popularity metric to boost relevancy for popular
 titles. I'm not sure what this metric should be (number of copies
attached
 to record? Total circs in last x months? Total current circs?), but
we
 believe some type of popularity measure would be particularly helpful
in a
 public library where searches will often be for titles that are
popular.
 For example, a search for twilight will most likely be for the
Stephanie
 Meyers novel and not this
 http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC.
Mike
 Rylander had indicated in a previous e-mail
 (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able
to
 handle this through an overnight cron job without a negative impact
on
 search speeds.

 Do others think these two enhancements would improve the search
results in
 Evergreen? Do you think there are other things we could do to improve
 relevancy? My main concern would be that any changes might slow down
 search speeds, and I would want to make sure that we could do
something to
 retrieve better search results without a slowdown.

 Also, I

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-07 Thread Mike Rylander
On Wed, Mar 7, 2012 at 2:57 PM, Kathy Lussier kluss...@masslnc.org wrote:
 Hi Mike,

To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

 Is this staff client interface already available in master? If so, can you
 give me a little more information on how this is done?

It is.  Go to  Admin - Server Administration - MARC Search/Facet
Fields and see the Weight field.  The higher the number, the more
important the field.

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com



 Thanks!
 Kathy



-Original Message-
From: open-ils-general-boun...@list.georgialibraries.org [mailto:open-
ils-general-boun...@list.georgialibraries.org] On Behalf Of Mike
Rylander
Sent: Wednesday, March 07, 2012 10:11 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] Improving relevance ranking in
Evergreen

On Wed, Mar 7, 2012 at 8:35 AM, Hardy, Elaine
eha...@georgialibraries.org wrote:
 Kathy,

 While the relevance display is much improved in 2.x, it would be good
to
 have greater relevance given, in a keyword search, to title
(specifically
 the 245)and then subject fields. I also see where having a popularity
 ranking might be beneficial.

 I just had to explain to a board member of one of our libraries why
his
 search for John Sandford turned up children's titles first. So having
MARC
 field 100s ranked higher than 700 in author searches would be
beneficial
 as well.


To be clear, weighting hits that come from different index definitions
has always been possible.  2.2 will have a staff client interface to
make it easier, but the capability has been there all along.

Weighting different parts of one indexed term -- say, weighting the
title embedded in the keyword blob higher than the subjects embedded
in the same blob -- would require the above-mentioned make use of
tsearch class weighting.  But one can approximate that today by
duplicating the index definitions from, say, title, author and subject
classes within the keyword class.

--
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com


 I can't comment on any of the coding possibilities other than to say
which
 every way doesn't negatively impact search return time is preferable.

 Elaine


 J. Elaine Hardy
 PINES Bibliographic Projects and Metadata Manager
 Georgia Public Library Service,
 A Unit of the University System of Georgia
 1800 Century Place, Suite 150
 Atlanta, Ga. 30345-4304
 404.235-7128
 404.235-7201, fax

 eha...@georgialibraries.org
 www.georgialibraries.org
 http://www.georgialibraries.org/pines/


 -Original Message-
 From: open-ils-general-boun...@list.georgialibraries.org
 [mailto:open-ils-general-boun...@list.georgialibraries.org] On Behalf
Of
 Kathy Lussier
 Sent: Tuesday, March 06, 2012 4:43 PM
 To: 'Evergreen Discussion Group'
 Subject: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

 Hi all,

 I mentioned this during an e-mail discussion on the list last month,
but I
 just wanted to hear from others in the Evergreen community about
whether
 there is a desire to improve the relevance ranking for search results
in
 Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and
it
 can look at things like the document length, word proximity, and
unique
 word count. We've found that we had to remove the modifiers for
document
 length and unique word count to prevent a problem where brief bib
records
 were ranked way too high in our search results.

 In our local discussions, we've thought the following enhancements
could
 improve the ranking of search results:

 * Giving greater weight to a record if the search terms appear in the
 title or subject (ideally, we would like these field to be
configurable.)
 This is something that is tweakable in search.relevance_ranking, but
my
 understanding is that the use of these tweaks results in a major
reduction
 in search performance.

 * Using some type of popularity metric to boost relevancy for popular
 titles. I'm not sure what this metric should be (number of copies
attached
 to record? Total circs in last x months? Total current circs?), but
we
 believe some type of popularity measure would be particularly helpful
in a
 public library where searches will often be for titles that are
popular.
 For example, a search for twilight will most likely be for the
Stephanie
 Meyers novel and not this
 http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC.
Mike
 Rylander had indicated in a previous e-mail
 (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-06 Thread Mike Rylander
On Tue, Mar 6, 2012 at 4:42 PM, Kathy Lussier kluss...@masslnc.org wrote:
 Hi all,

 I mentioned this during an e-mail discussion on the list last month, but I
 just wanted to hear from others in the Evergreen community about whether
 there is a desire to improve the relevance ranking for search results in
 Evergreen. Currently, we can tweak relevancy in the opensrf.xml, and it can
 look at things like the document length, word proximity, and unique word
 count. We've found that we had to remove the modifiers for document length
 and unique word count to prevent a problem where brief bib records were
 ranked way too high in our search results.

FWIW, there is a library testing some new combinations of CD modifiers
and having some success.  As soon as I know more I will share (if they
don't first).


 In our local discussions, we've thought the following enhancements could
 improve the ranking of search results:

 * Giving greater weight to a record if the search terms appear in the title
 or subject (ideally, we would like these field to be configurable.) This is
 something that is tweakable in search.relevance_ranking, but my
 understanding is that the use of these tweaks results in a major reduction
 in search performance.


Indeed they do, however rewriting them in C to be super-fast would
improve this situation.  It's primarily a matter of available time and
effort.  It's also, however, pretty specialized work as you're dealing
with Postgres at a very intimate level.

 * Using some type of popularity metric to boost relevancy for popular
 titles. I'm not sure what this metric should be (number of copies attached
 to record? Total circs in last x months? Total current circs?), but we
 believe some type of popularity measure would be particularly helpful in a
 public library where searches will often be for titles that are popular. For
 example, a search for twilight will most likely be for the Stephanie
 Meyers novel and not this
 http://books.google.com/books/about/Twilight.html?id=zEhkpXCyGzIC. Mike
 Rylander had indicated in a previous e-mail
 (http://markmail.org/message/h6u5r3sy4nr36wsl) that we might be able to
 handle this through an overnight cron job without a negative impact on
 search speeds.

Right ... A regular stats-gathering job could certainly allow this,
and (if the QuqeryParser explain branch gets merged to master so we
have a standard search canonicalization function) logged query
analysis is another option as well.


 Do others think these two enhancements would improve the search results in
 Evergreen? Do you think there are other things we could do to improve
 relevancy? My main concern would be that any changes might slow down search
 speeds, and I would want to make sure that we could do something to retrieve
 better search results without a slowdown.


I would prefer better results with a speed /increase/! :)  But, who wouldn't.

I can offer at least one lower-hanging fruit idea: switch from GIST
indexes to GIN indexes by default, as they're much faster these days.

 Also, I was wondering if this type of project might be a good candidate for
 a Google Summer of Code project.


The fairly mechanical change from GIST to GIN indexing is definitely a
small-effort thing. I think the other ideas listed here (and still
others from the past, like direct MARC indexing, and use of tsearch
weighting classes) are probably worth trying -- particularly the
relevance-adjustment-functions-in-C idea -- as GSoC projects, but may
turn out to be too big.  It's worth listing them as ideas for
candidates to propose, though.

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  mi...@esilibrary.com
 | web:  http://www.esilibrary.com


Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-06 Thread James Fournie
 
 * Giving greater weight to a record if the search terms appear in the title
 or subject (ideally, we would like these field to be configurable.) This is
 something that is tweakable in search.relevance_ranking, but my
 understanding is that the use of these tweaks results in a major reduction
 in search performance.
 
 
 Indeed they do, however rewriting them in C to be super-fast would
 improve this situation.  It's primarily a matter of available time and
 effort.  It's also, however, pretty specialized work as you're dealing
 with Postgres at a very intimate level.

Mike, could you elaborate what bits of code you're talking about here that 
could be rewritten in C?

Some of my colleagues at Sitka and I were trying to find out why broad searches 
are unusually slow and eventually found that our adjustments in 
search.relevance_adjustment were slowing things down.   Months earlier the CD 
patch was added to trunk to circumvent this problem without our knowledge, so 
we tried backporting that code and testing it however, in our initial tests, we 
weren't entirely satisfied with the CD modifiers' ability to rank items.  

Doing some digging into the SQL logs and QueryParser.pm, we observed that the 
naco_normalize function appears to be what's slowing the use of 
relevance_adjustment down.  While the naco_normalize function itself is quite 
fast on its own, it slows down exponentially when run on many records:

explain analyze select naco_normalize(value) from metabib.keyword_field_entry 
limit 1;

When using the relevance adjustments, it is run on each metabib.x_entry.value 
that is retrieved in the initial resultset, which in many cases would be 
thousands of records.  You can adjust the LIMIT in the above query to see how 
it slows down as the result set gets larger.  It is also run for each 
relevance_adjustment, however I'm assuming that the query parser is treating it 
properly as IMMUTABLE and only running it once for each adjustment.

Anyway, not entirely sure about how this analysis holds up in trunk as we've 
done this testing on Postgres 8.4 and Eg 2.0 and it looks like there's new code 
in trunk in O:A:Storage:Driver:Pg:QueryParser.pm, but no changes to those bits. 
 

I've attached some sample SQL of part of a 2.0 query and the same query without 
naco_normalize run on the metabib table.  In my testing on our production 
dataset, this query -- a search for Canada -- went from over 80 seconds to 
less than 10 by removing the naco_normalize (it's still being run on the 
incoming term though which is probably unavoidable)

My thought for a solution would be that we could have naco_normalize run as an 
INSERT trigger on that field.  Obviously the whole tables would need to be 
updated which is no small task.  I'm also not sure if that would impact other 
things, ie: where else the metabib.x_field_entry.value field is used, but but 
generally I'd think we'd almost always be using that value for a comparison of 
some kind and want that value in a normalized form.   Another option may be to 
not normalize in those comparisons, however it's slightly less attractive IMO. 
Anyway I'd be interested to hear your thoughts on that.


-- normal EG 2.0 query
EXPLAIN ANALYZE
 SELECT  * /* bib search */
   FROM  search.query_parser_fts(
 1::INT,
 0::INT,
 $core_query_25078$SELECT  m.source AS id,
 ARRAY_ACCUM(DISTINCT m.source) AS records,
 (AVG(
   (rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * 
x7b52820_keyword.weight
  *  /* word_order */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
(naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 2, 1)
   *  /* first_word */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1)
   *  /* full_match */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1))
   ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE 
)::INT * 5, 1))::NUMERIC AS rel,
  (AVG(
(rank(x7b52820_keyword.index_vector, x7b52820_keyword.tsq) * 
x7b52820_keyword.weight
   *  /* word_order */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
(naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 2, 1)
   *  /* first_word */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$))), FALSE )::INT * 5, 1)
   *  /* full_match */ COALESCE(NULLIF( 
(naco_normalize(x7b52820_keyword.value) ~ 
('^'||naco_normalize($_25078$canada$_25078$)||'$')), FALSE )::INT * 5, 1))
   ) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_25078$eng$_25078$ , FALSE 
)::INT * 5, 1))::NUMERIC AS rank, 
  FIRST(mrd.date1) AS tie_break
FROM  metabib.metarecord_source_map m
  JOIN metabib.rec_descriptor mrd ON (m.source = 

Re: [OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

2012-03-06 Thread Mike Rylander
On Tue, Mar 6, 2012 at 6:13 PM, James Fournie
jfour...@sitka.bclibraries.ca wrote:

 * Giving greater weight to a record if the search terms appear in the title
 or subject (ideally, we would like these field to be configurable.) This is
 something that is tweakable in search.relevance_ranking, but my
 understanding is that the use of these tweaks results in a major reduction
 in search performance.


 Indeed they do, however rewriting them in C to be super-fast would
 improve this situation.  It's primarily a matter of available time and
 effort.  It's also, however, pretty specialized work as you're dealing
 with Postgres at a very intimate level.

 Mike, could you elaborate what bits of code you're talking about here that 
 could be rewritten in C?


I mean specifically the elaborate COALESCE/NULLIF/regexp (aka ~) parts
of the SELECT clauses that implement the first-word, word-order and
full-phrase relevance bumps that come from
search.relevance_adjustment.  There's also the option of attempting to
rewrite naco_normalize and search_normalize (see below) in C.  Lots of
string mangling to which Perl is particularly suited, but it's not
impossible by any means, and there are Postgres components (the
'unaccent' contrib/extension, for instance) that we could probably
build on.

 Some of my colleagues at Sitka and I were trying to find out why broad 
 searches are unusually slow and eventually found that our adjustments in 
 search.relevance_adjustment were slowing things down.   Months earlier the CD 
 patch was added to trunk to circumvent this problem without our knowledge, so 
 we tried backporting that code and testing it however, in our initial tests, 
 we weren't entirely satisfied with the CD modifiers' ability to rank items.


Right.  These are more subtle than the heavy-handed
search.relevance_adjustment settings, and therefore have a less
drastic effect.  But they also reduce the need for some of the
search.relevance_adjustment entries, so in combination we should be
able to find a good balance, especially if some of the rel_adjustment
effects can be rewritten in C.

 Doing some digging into the SQL logs and QueryParser.pm, we observed that the 
 naco_normalize function appears to be what's slowing the use of 
 relevance_adjustment down.  While the naco_normalize function itself is quite 
 fast on its own, it slows down exponentially when run on many records:

 explain analyze select naco_normalize(value) from metabib.keyword_field_entry 
 limit 1;

 When using the relevance adjustments, it is run on each metabib.x_entry.value 
 that is retrieved in the initial resultset, which in many cases would be 
 thousands of records.  You can adjust the LIMIT in the above query to see how 
 it slows down as the result set gets larger.  It is also run for each 
 relevance_adjustment, however I'm assuming that the query parser is treating 
 it properly as IMMUTABLE and only running it once for each adjustment.


Indeed, and naco_normalize is not necessarily the only normalizer that
will be applied to each and every field!  If you search a class or
field that uses other (pos = 0) normalizers, all of those will also
be applied to both the column value and the user input.

There's some good news on this front, though.  Galen recently
implemented a trimmed down version of naco_normalize, called
search_normalize, that should be a bit faster.  That should lower the
total cost by a noticeable amount over many thousands of rows.

 Anyway, not entirely sure about how this analysis holds up in trunk as we've 
 done this testing on Postgres 8.4 and Eg 2.0 and it looks like there's new 
 code in trunk in O:A:Storage:Driver:Pg:QueryParser.pm, but no changes to 
 those bits.

 I've attached some sample SQL of part of a 2.0 query and the same query 
 without naco_normalize run on the metabib table.  In my testing on our 
 production dataset, this query -- a search for Canada -- went from over 80 
 seconds to less than 10 by removing the naco_normalize (it's still being run 
 on the incoming term though which is probably unavoidable)


It is unavoidable, but they should only be run once on user input and
the result cached.  EXPLAIN will tell the tale, and if it's not then
the normalizer functions aren't properly marked STABLE.

Hrm... and looking at your example, I spotted a chance for at least
one optimization.  If we recognize that there is only one term in a
search (as in your canada example) we can skip the word-order
rel_adjustment if we're told to apply it, saving ~1/3 of the cost of
that particular chunk.

 My thought for a solution would be that we could have naco_normalize run as 
 an INSERT trigger on that field.  Obviously the whole tables would need to be 
 updated which is no small task.  I'm also not sure if that would impact other 
 things, ie: where else the metabib.x_field_entry.value field is used, but but 
 generally I'd think we'd almost always be using that value for a comparison 
 of some kind and