[CODE4LIB] solr - search query count | highlighting
For our finding aids, we are using fedoragenericsearch 2.2 with solr as index. Because the EADs can be huge, the EADs are indexed but not stored (with stored EADs, search time for ~500 objects = 20 min rather than 1 sec). However, we would like to have number of search terms found within each hit. For example, CDL's collection: http://www.oac.cdlib.org/search?query=Donner Also we would like highlighting/snippets of the search term similar to CDL's. Is it a lost cause to have this functionality without storing the EAD? Is there a way to store the EAD and have a reasonable response time? --- Eric James Yale University Libraries
Re: [CODE4LIB] solr - search query count | highlighting
Hi Eric, You do not have to store the entire text content of the EAD guide in order to enable facets. Here's an example: http://kittredgecollection.org/results?q=*:* . There are about 15 facets enabled on a collection of almost 1500 EAD documents (though quite small in filesize compared to traditional EAD finding aids), and there's no slowdown whatsoever. I don't believe you need to store the guides to enable highlighting either, though I have heard there is some dropoff in performance with highlighting enabled. I've never done benchmarking on highlighting enabled versus disabled, so I can't tell you how much of a dropoff there is. In an index of only several hundred documents, I would think that the dropoff with highlighting enabled would be fairly negligible. Ethan On Fri, Oct 16, 2009 at 3:12 PM, Eric James cirese...@hotmail.com wrote: For our finding aids, we are using fedoragenericsearch 2.2 with solr as index. Because the EADs can be huge, the EADs are indexed but not stored (with stored EADs, search time for ~500 objects = 20 min rather than 1 sec). However, we would like to have number of search terms found within each hit. For example, CDL's collection: http://www.oac.cdlib.org/search?query=Donner Also we would like highlighting/snippets of the search term similar to CDL's. Is it a lost cause to have this functionality without storing the EAD? Is there a way to store the EAD and have a reasonable response time? --- Eric James Yale University Libraries
Re: [CODE4LIB] solr - search query count | highlighting
Thanks for your response. But, yes I'm able to use facets in general, and yes I'm able to do highlighting on stored fields. But finding how many times the query appears in the full text is my question. For example say you search on Heisenberg We'd like to see: Hit 1: Your search for Heisenberg appears 10 times within the Finding Aid Hit 2: Your search for Heisenberg appears 3 times within the Finding Aid Hit 3: Your search for Heisenberg appears 88 times within the Finding Aid etc Could there be a solr parameter that calculates this? Otherwise a klugey, not very scalable method could be that once you retrieve a solr result xml, find the fedora pid, retrieve the EAD full text, run a standard function to count how many times the query appears in the text for each hit, and add parameters back into the xml with these counts. Date: Fri, 16 Oct 2009 15:27:42 -0400 From: ewg4x...@gmail.com Subject: Re: [CODE4LIB] solr - search query count | highlighting To: CODE4LIB@LISTSERV.ND.EDU Hi Eric, You do not have to store the entire text content of the EAD guide in order to enable facets. Here's an example: http://kittredgecollection.org/results?q=*:* . There are about 15 facets enabled on a collection of almost 1500 EAD documents (though quite small in filesize compared to traditional EAD finding aids), and there's no slowdown whatsoever. I don't believe you need to store the guides to enable highlighting either, though I have heard there is some dropoff in performance with highlighting enabled. I've never done benchmarking on highlighting enabled versus disabled, so I can't tell you how much of a dropoff there is. In an index of only several hundred documents, I would think that the dropoff with highlighting enabled would be fairly negligible. Ethan On Fri, Oct 16, 2009 at 3:12 PM, Eric James cirese...@hotmail.com wrote: For our finding aids, we are using fedoragenericsearch 2.2 with solr as index. Because the EADs can be huge, the EADs are indexed but not stored (with stored EADs, search time for ~500 objects = 20 min rather than 1 sec). However, we would like to have number of search terms found within each hit. For example, CDL's collection: http://www.oac.cdlib.org/search?query=Donner Also we would like highlighting/snippets of the search term similar to CDL's. Is it a lost cause to have this functionality without storing the EAD? Is there a way to store the EAD and have a reasonable response time? --- Eric James Yale University Libraries
Re: [CODE4LIB] solr - search query count | highlighting
i think some of the new TermVectorComponent stuff might be applicable...i've not experimented with it yet tho, so YMMV. http://wiki.apache.org/solr/TermVectorComponent it's only part of 1.4, which is due for a release any day now, once they patch up a Lucene bug On Fri, Oct 16, 2009 at 3:52 PM, Eric James cirese...@hotmail.com wrote: Thanks for your response. But, yes I'm able to use facets in general, and yes I'm able to do highlighting on stored fields. But finding how many times the query appears in the full text is my question. For example say you search on Heisenberg We'd like to see: Hit 1: Your search for Heisenberg appears 10 times within the Finding Aid Hit 2: Your search for Heisenberg appears 3 times within the Finding Aid Hit 3: Your search for Heisenberg appears 88 times within the Finding Aid etc Could there be a solr parameter that calculates this? Otherwise a klugey, not very scalable method could be that once you retrieve a solr result xml, find the fedora pid, retrieve the EAD full text, run a standard function to count how many times the query appears in the text for each hit, and add parameters back into the xml with these counts. Date: Fri, 16 Oct 2009 15:27:42 -0400 From: ewg4x...@gmail.com Subject: Re: [CODE4LIB] solr - search query count | highlighting To: CODE4LIB@LISTSERV.ND.EDU Hi Eric, You do not have to store the entire text content of the EAD guide in order to enable facets. Here's an example: http://kittredgecollection.org/results?q=*:* . There are about 15 facets enabled on a collection of almost 1500 EAD documents (though quite small in filesize compared to traditional EAD finding aids), and there's no slowdown whatsoever. I don't believe you need to store the guides to enable highlighting either, though I have heard there is some dropoff in performance with highlighting enabled. I've never done benchmarking on highlighting enabled versus disabled, so I can't tell you how much of a dropoff there is. In an index of only several hundred documents, I would think that the dropoff with highlighting enabled would be fairly negligible. Ethan On Fri, Oct 16, 2009 at 3:12 PM, Eric James cirese...@hotmail.com wrote: For our finding aids, we are using fedoragenericsearch 2.2 with solr as index. Because the EADs can be huge, the EADs are indexed but not stored (with stored EADs, search time for ~500 objects = 20 min rather than 1 sec). However, we would like to have number of search terms found within each hit. For example, CDL's collection: http://www.oac.cdlib.org/search?query=Donner Also we would like highlighting/snippets of the search term similar to CDL's. Is it a lost cause to have this functionality without storing the EAD? Is there a way to store the EAD and have a reasonable response time? --- Eric James Yale University Libraries
Re: [CODE4LIB] solr - search query count | highlighting
Hi Eric, If you use debugQuery=on parameter, you'll receive the explain structure, which tell you about the score number calculation factors. An example: str name=oai:URMST:Transformation_Service/1 1.5076942 = (MATCH) fieldWeight(text:chant in 0), product of: 1.4142135 = tf(termFreq(text:chant)=2) 6.8230457 = idf(docFreq=1, numDocs=676) 0.15625 = fieldNorm(field=text, doc=0) /str Here tf(termFreq(text:chant)=2) tell you, that the queried term found two times in the document. You should apply a regex to extract this info from the explain string. Since this term is an analyzed term, it is possible that it not equals with the user input, but debug's 'parsedquery' parameter tell you the terms Solr search behind the scene. In Lucene, if the field stores the termVector's positions, there are API calls, that you can get the exact place of the term within the field (as character positions, or as the n-th token), but I don't know how to extract this info through Solr. Hope this helps. Király Péter eXtensible Catalog http://xcproject.org - Original Message - From: Eric James cirese...@hotmail.com To: CODE4LIB@LISTSERV.ND.EDU Sent: Friday, October 16, 2009 9:52 PM Subject: Re: [CODE4LIB] solr - search query count | highlighting Thanks for your response. But, yes I'm able to use facets in general, and yes I'm able to do highlighting on stored fields. But finding how many times the query appears in the full text is my question. For example say you search on Heisenberg We'd like to see: Hit 1: Your search for Heisenberg appears 10 times within the Finding Aid Hit 2: Your search for Heisenberg appears 3 times within the Finding Aid Hit 3: Your search for Heisenberg appears 88 times within the Finding Aid etc Could there be a solr parameter that calculates this? Otherwise a klugey, not very scalable method could be that once you retrieve a solr result xml, find the fedora pid, retrieve the EAD full text, run a standard function to count how many times the query appears in the text for each hit, and add parameters back into the xml with these counts. Date: Fri, 16 Oct 2009 15:27:42 -0400 From: ewg4x...@gmail.com Subject: Re: [CODE4LIB] solr - search query count | highlighting To: CODE4LIB@LISTSERV.ND.EDU Hi Eric, You do not have to store the entire text content of the EAD guide in order to enable facets. Here's an example: http://kittredgecollection.org/results?q=*:* . There are about 15 facets enabled on a collection of almost 1500 EAD documents (though quite small in filesize compared to traditional EAD finding aids), and there's no slowdown whatsoever. I don't believe you need to store the guides to enable highlighting either, though I have heard there is some dropoff in performance with highlighting enabled. I've never done benchmarking on highlighting enabled versus disabled, so I can't tell you how much of a dropoff there is. In an index of only several hundred documents, I would think that the dropoff with highlighting enabled would be fairly negligible. Ethan On Fri, Oct 16, 2009 at 3:12 PM, Eric James cirese...@hotmail.com wrote: For our finding aids, we are using fedoragenericsearch 2.2 with solr as index. Because the EADs can be huge, the EADs are indexed but not stored (with stored EADs, search time for ~500 objects = 20 min rather than 1 sec). However, we would like to have number of search terms found within each hit. For example, CDL's collection: http://www.oac.cdlib.org/search?query=Donner Also we would like highlighting/snippets of the search term similar to CDL's. Is it a lost cause to have this functionality without storing the EAD? Is there a way to store the EAD and have a reasonable response time? --- Eric James Yale University Libraries
Re: [CODE4LIB] solr - search query count | highlighting
On Fri, Oct 16, 2009 at 3:12 PM, Eric James cirese...@hotmail.com wrote: For our finding aids, we are using fedoragenericsearch 2.2 with solr as index. Because the EADs can be huge, the EADs are indexed but not stored (with stored EADs, search time for ~500 objects = 20 min rather than 1 sec). Eric, what do your actual schema and Solr configuration look like? One possibility would be to store and index the actual contents of the EAD in a separate field and not return that field by default in query responses. For what it's worth, this is what we're doing at NYPL for our EAD files that are being indexed as part of the new Drupal-based site we're building. Mark A. Matienzo Applications Developer, Digital Experience Group The New York Public Library However, we would like to have number of search terms found within each hit. For example, CDL's collection: http://www.oac.cdlib.org/search?query=Donner Also we would like highlighting/snippets of the search term similar to CDL's. Is it a lost cause to have this functionality without storing the EAD? Is there a way to store the EAD and have a reasonable response time? --- Eric James Yale University Libraries
Re: [CODE4LIB] solr - search query count | highlighting
Maybe you should look into using what CDL uses to get that functionality, which is also based on Lucene: http://www.cdlib.org/inside/projects/xtf/ Roy On 10/16/09 10/16/09 12:12 PM, Eric James cirese...@hotmail.com wrote: For our finding aids, we are using fedoragenericsearch 2.2 with solr as index. Because the EADs can be huge, the EADs are indexed but not stored (with stored EADs, search time for ~500 objects = 20 min rather than 1 sec). However, we would like to have number of search terms found within each hit. For example, CDL's collection: http://www.oac.cdlib.org/search?query=Donner Also we would like highlighting/snippets of the search term similar to CDL's. Is it a lost cause to have this functionality without storing the EAD? Is there a way to store the EAD and have a reasonable response time? --- Eric James Yale University Libraries