Re: [CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Roy Tennant
Maybe you should look into using what CDL uses to get that functionality,
which is also based on Lucene:

http://www.cdlib.org/inside/projects/xtf/

Roy


On 10/16/09 10/16/09 € 12:12 PM, "Eric James"  wrote:

> For our finding aids, we are using fedoragenericsearch 2.2 with solr as index.
> Because the EADs can be huge, the EADs are indexed but not stored (with stored
> EADs, search time for ~500 objects = 20 min rather than < 1 sec).
> 
>  
> 
> However, we would like to have number of search terms found within each hit.
> For example, CDL's collection:
> 
> http://www.oac.cdlib.org/search?query=Donner
> 
>  
> 
> Also we would like highlighting/snippets of the search term similar to CDL's.
> 
>  
> 
> Is it a lost cause to have this functionality without storing the EAD?  Is
> there a way to store the EAD and have a reasonable response time?
> 
>  
> 
> ---
> 
> Eric James
> 
> Yale University Libraries
> 
>  
> 
>  
>  


Re: [CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Mark A. Matienzo
On Fri, Oct 16, 2009 at 3:12 PM, Eric James  wrote:
> For our finding aids, we are using fedoragenericsearch 2.2 with solr as 
> index.  Because the EADs can be huge, the EADs are indexed but not stored 
> (with stored EADs, search time for ~500 objects = 20 min rather than < 1 sec).

Eric, what do your actual schema and Solr configuration look like?
One possibility would be to store and index the actual contents of the
EAD in a separate field and not return that field by default in query
responses. For what it's worth, this is what we're doing at NYPL for
our EAD files that are being indexed as part of the new Drupal-based
site we're building.


Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library



>
>
>
> However, we would like to have number of search terms found within each hit.  
> For example, CDL's collection:
>
> http://www.oac.cdlib.org/search?query=Donner
>
>
>
> Also we would like highlighting/snippets of the search term similar to CDL's.
>
>
>
> Is it a lost cause to have this functionality without storing the EAD?  Is 
> there a way to store the EAD and have a reasonable response time?
>
>
>
> ---
>
> Eric James
>
> Yale University Libraries
>
>
>
>
>


Re: [CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Király Péter

Hi Eric,

If you use &debugQuery=on parameter, you'll receive the "explain" structure, 
which tell

you about the score number calculation factors. An example:


1.5076942 = (MATCH) fieldWeight(text:chant in 0), product of:
 1.4142135 = tf(termFreq(text:chant)=2)
 6.8230457 = idf(docFreq=1, numDocs=676)
 0.15625 = fieldNorm(field=text, doc=0)


Here tf(termFreq(text:chant)=2) tell you, that the queried term found two 
times
in the document. You should apply a regex to extract this info from the 
explain
string. Since this term is an analyzed term, it is possible that it not 
equals with the
user input, but debug's 'parsedquery' parameter tell you the terms Solr 
search

behind the scene.

In Lucene, if the field stores the termVector's positions, there are API 
calls, that
you can get the exact place of the term within the field (as character 
positions,
or as the n-th token), but I don't know how to extract this info through 
Solr.


Hope this helps.

Király Péter
eXtensible Catalog
http://xcproject.org

- Original Message - 
From: "Eric James" 

To: 
Sent: Friday, October 16, 2009 9:52 PM
Subject: Re: [CODE4LIB] solr - search query count | highlighting


Thanks for your response.  But, yes I'm able to use facets in general, and 
yes I'm able to do highlighting on stored fields.




But finding how many times the query appears in the full text is my 
question. For example say you search on "Heisenberg"   We'd like to see:




Hit 1: Your search for Heisenberg appears 10 times within the Finding Aid

Hit 2: Your search for Heisenberg appears 3 times within the Finding Aid

Hit 3: Your search for Heisenberg appears 88 times within the Finding Aid

etc



Could there be a solr parameter that calculates this? Otherwise a klugey, 
not very scalable method could be that once you retrieve a solr result xml, 
find the fedora pid, retrieve the EAD full text, run a standard function to 
count how many times the query appears in the text for each hit, and add 
parameters back into the xml with these counts.






Date: Fri, 16 Oct 2009 15:27:42 -0400
From: ewg4x...@gmail.com
Subject: Re: [CODE4LIB] solr - search query count | highlighting
To: CODE4LIB@LISTSERV.ND.EDU

Hi Eric,

You do not have to store the entire text content of the EAD guide in order
to enable facets. Here's an example:
http://kittredgecollection.org/results?q=*:* . There are about 15 facets
enabled on a collection of almost 1500 EAD documents (though quite small 
in
filesize compared to traditional EAD finding aids), and there's no 
slowdown

whatsoever. I don't believe you need to store the guides to enable
highlighting either, though I have heard there is some dropoff in
performance with highlighting enabled. I've never done benchmarking on
highlighting enabled versus disabled, so I can't tell you how much of a
dropoff there is. In an index of only several hundred documents, I would
think that the dropoff with highlighting enabled would be fairly 
negligible.


Ethan

On Fri, Oct 16, 2009 at 3:12 PM, Eric James  wrote:

> For our finding aids, we are using fedoragenericsearch 2.2 with solr as
> index. Because the EADs can be huge, the EADs are indexed but not stored
> (with stored EADs, search time for ~500 objects = 20 min rather than < 1
> sec).
>
>
>
> However, we would like to have number of search terms found within each
> hit. For example, CDL's collection:
>
> http://www.oac.cdlib.org/search?query=Donner
>
>
>
> Also we would like highlighting/snippets of the search term similar to
> CDL's.
>
>
>
> Is it a lost cause to have this functionality without storing the EAD? 
> Is

> there a way to store the EAD and have a reasonable response time?
>
>
>
> ---
>
> Eric James
>
> Yale University Libraries
>
>
>
>
>




Re: [CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Rob Casson
i think some of the new TermVectorComponent stuff might be
applicable...i've not experimented with it yet tho, so YMMV.

 http://wiki.apache.org/solr/TermVectorComponent

it's only part of 1.4, which is due for a release any day now, once
they patch up a Lucene bug


On Fri, Oct 16, 2009 at 3:52 PM, Eric James  wrote:
> Thanks for your response.  But, yes I'm able to use facets in general, and 
> yes I'm able to do highlighting on stored fields.
>
>
>
> But finding how many times the query appears in the full text is my question. 
> For example say you search on "Heisenberg"   We'd like to see:
>
>
>
> Hit 1: Your search for Heisenberg appears 10 times within the Finding Aid
>
> Hit 2: Your search for Heisenberg appears 3 times within the Finding Aid
>
> Hit 3: Your search for Heisenberg appears 88 times within the Finding Aid
>
> etc
>
>
>
> Could there be a solr parameter that calculates this? Otherwise a klugey, not 
> very scalable method could be that once you retrieve a solr result xml, find 
> the fedora pid, retrieve the EAD full text, run a standard function to count 
> how many times the query appears in the text for each hit, and add parameters 
> back into the xml with these counts.
>
>
>
>
>> Date: Fri, 16 Oct 2009 15:27:42 -0400
>> From: ewg4x...@gmail.com
>> Subject: Re: [CODE4LIB] solr - search query count | highlighting
>> To: CODE4LIB@LISTSERV.ND.EDU
>>
>> Hi Eric,
>>
>> You do not have to store the entire text content of the EAD guide in order
>> to enable facets. Here's an example:
>> http://kittredgecollection.org/results?q=*:* . There are about 15 facets
>> enabled on a collection of almost 1500 EAD documents (though quite small in
>> filesize compared to traditional EAD finding aids), and there's no slowdown
>> whatsoever. I don't believe you need to store the guides to enable
>> highlighting either, though I have heard there is some dropoff in
>> performance with highlighting enabled. I've never done benchmarking on
>> highlighting enabled versus disabled, so I can't tell you how much of a
>> dropoff there is. In an index of only several hundred documents, I would
>> think that the dropoff with highlighting enabled would be fairly negligible.
>>
>> Ethan
>>
>> On Fri, Oct 16, 2009 at 3:12 PM, Eric James  wrote:
>>
>> > For our finding aids, we are using fedoragenericsearch 2.2 with solr as
>> > index. Because the EADs can be huge, the EADs are indexed but not stored
>> > (with stored EADs, search time for ~500 objects = 20 min rather than < 1
>> > sec).
>> >
>> >
>> >
>> > However, we would like to have number of search terms found within each
>> > hit. For example, CDL's collection:
>> >
>> > http://www.oac.cdlib.org/search?query=Donner
>> >
>> >
>> >
>> > Also we would like highlighting/snippets of the search term similar to
>> > CDL's.
>> >
>> >
>> >
>> > Is it a lost cause to have this functionality without storing the EAD? Is
>> > there a way to store the EAD and have a reasonable response time?
>> >
>> >
>> >
>> > ---
>> >
>> > Eric James
>> >
>> > Yale University Libraries
>> >
>> >
>> >
>> >
>> >
>


Re: [CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Eric James
Thanks for your response.  But, yes I'm able to use facets in general, and yes 
I'm able to do highlighting on stored fields.

 

But finding how many times the query appears in the full text is my question. 
For example say you search on "Heisenberg"   We'd like to see:

 

Hit 1: Your search for Heisenberg appears 10 times within the Finding Aid

Hit 2: Your search for Heisenberg appears 3 times within the Finding Aid

Hit 3: Your search for Heisenberg appears 88 times within the Finding Aid

etc

 

Could there be a solr parameter that calculates this? Otherwise a klugey, not 
very scalable method could be that once you retrieve a solr result xml, find 
the fedora pid, retrieve the EAD full text, run a standard function to count 
how many times the query appears in the text for each hit, and add parameters 
back into the xml with these counts. 

 

 
> Date: Fri, 16 Oct 2009 15:27:42 -0400
> From: ewg4x...@gmail.com
> Subject: Re: [CODE4LIB] solr - search query count | highlighting
> To: CODE4LIB@LISTSERV.ND.EDU
> 
> Hi Eric,
> 
> You do not have to store the entire text content of the EAD guide in order
> to enable facets. Here's an example:
> http://kittredgecollection.org/results?q=*:* . There are about 15 facets
> enabled on a collection of almost 1500 EAD documents (though quite small in
> filesize compared to traditional EAD finding aids), and there's no slowdown
> whatsoever. I don't believe you need to store the guides to enable
> highlighting either, though I have heard there is some dropoff in
> performance with highlighting enabled. I've never done benchmarking on
> highlighting enabled versus disabled, so I can't tell you how much of a
> dropoff there is. In an index of only several hundred documents, I would
> think that the dropoff with highlighting enabled would be fairly negligible.
> 
> Ethan
> 
> On Fri, Oct 16, 2009 at 3:12 PM, Eric James  wrote:
> 
> > For our finding aids, we are using fedoragenericsearch 2.2 with solr as
> > index. Because the EADs can be huge, the EADs are indexed but not stored
> > (with stored EADs, search time for ~500 objects = 20 min rather than < 1
> > sec).
> >
> >
> >
> > However, we would like to have number of search terms found within each
> > hit. For example, CDL's collection:
> >
> > http://www.oac.cdlib.org/search?query=Donner
> >
> >
> >
> > Also we would like highlighting/snippets of the search term similar to
> > CDL's.
> >
> >
> >
> > Is it a lost cause to have this functionality without storing the EAD? Is
> > there a way to store the EAD and have a reasonable response time?
> >
> >
> >
> > ---
> >
> > Eric James
> >
> > Yale University Libraries
> >
> >
> >
> >
> >
  

Re: [CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Eric Lease Morgan
On Oct 16, 2009, at 3:12 PM, Eric James wrote:

> For our finding aids, we are using fedoragenericsearch 2.2 with solr  
> as index.  Because the EADs can be huge, the EADs are indexed but  
> not stored (with stored EADs, search time for ~500 objects = 20 min  
> rather than < 1 sec).
>
> However, we would like to have number of search terms found within  
> each hit.  For example, CDL's collection:
>
> http://www.oac.cdlib.org/search?query=Donner
>
> Also we would like highlighting/snippets of the search term similar  
> to CDL's.
>
> Is it a lost cause to have this functionality without storing the  
> EAD?  Is there a way to store the EAD and have a reasonable response  
> time?


Hmmm... I'm not an expert, only a novice Solr hacker, but I've had  
pretty good success full text indexing entire books, denoting them as  
stored, and searching the index whose results are complete with  
highlighted snippets. Here's my field definition:

   

While search response times equal about 2 seconds or so, it certainly  
does return in 20 minutes. There are about 16,000 indexed books. Try:

   http://infomotions.com/alex/

Yes, things like snippets are a lost cause without storing the indexed  
data, unless maybe you can link to the content. The later was alluded  
to the (one and only) Solr book, but I didn't even consider it.  
'Seemed too expensive.

Query counts? I don't know about those.

-- 
Eric Lease Morgan


Re: [CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Ethan Gruber
Hi Eric,

You do not have to store the entire text content of the EAD guide in order
to enable facets.  Here's an example:
http://kittredgecollection.org/results?q=*:* .  There are about 15 facets
enabled on a collection of almost 1500 EAD documents (though quite small in
filesize compared to traditional EAD finding aids), and there's no slowdown
whatsoever.  I don't believe you need to store the guides to enable
highlighting either, though I have heard there is some dropoff in
performance with highlighting enabled.  I've never done benchmarking on
highlighting enabled versus disabled, so I can't tell you how much of a
dropoff there is.  In an index of only several hundred documents, I would
think that the dropoff with highlighting enabled would be fairly negligible.

Ethan

On Fri, Oct 16, 2009 at 3:12 PM, Eric James  wrote:

> For our finding aids, we are using fedoragenericsearch 2.2 with solr as
> index.  Because the EADs can be huge, the EADs are indexed but not stored
> (with stored EADs, search time for ~500 objects = 20 min rather than < 1
> sec).
>
>
>
> However, we would like to have number of search terms found within each
> hit.  For example, CDL's collection:
>
> http://www.oac.cdlib.org/search?query=Donner
>
>
>
> Also we would like highlighting/snippets of the search term similar to
> CDL's.
>
>
>
> Is it a lost cause to have this functionality without storing the EAD?  Is
> there a way to store the EAD and have a reasonable response time?
>
>
>
> ---
>
> Eric James
>
> Yale University Libraries
>
>
>
>
>


[CODE4LIB] solr - search query count | highlighting

2009-10-16 Thread Eric James
For our finding aids, we are using fedoragenericsearch 2.2 with solr as index.  
Because the EADs can be huge, the EADs are indexed but not stored (with stored 
EADs, search time for ~500 objects = 20 min rather than < 1 sec).

 

However, we would like to have number of search terms found within each hit.  
For example, CDL's collection:

http://www.oac.cdlib.org/search?query=Donner

 

Also we would like highlighting/snippets of the search term similar to CDL's.

 

Is it a lost cause to have this functionality without storing the EAD?  Is 
there a way to store the EAD and have a reasonable response time?

 

---

Eric James

Yale University Libraries