Re: Deciding whether to stem at query time

2012-04-25 Thread Otis Gospodnetic
Hi Andrew,

* store == store the original value of the text/string being indexed
* index == do analysis on the original text/string, which typically means 
tokenization of text and optional filtering/modifying (e.g. removing, stemming) 
of tokens.

* indexed fields can be searched
* fields that are only stored cannot

* stored fields can be used for displaying in UI and highlighting
* fields that are only indexed and not stored not so much

HTH,
Otis

Solr Performance Monitoring - 
http://sematext.com/spm/solr-performance-monitoring





 From: Andrew Wagner wagner.and...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Tuesday, April 24, 2012 10:40 AM
Subject: Re: Deciding whether to stem at query time
 
I'm sorry, I'm missing something. What's the difference between storing
and indexing a field?

On Tue, Apr 24, 2012 at 10:28 AM, Paul Libbrecht p...@hoplahup.net wrote:


 Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit :
  This would not necessarily increase the size of your index that much -
 you don't to store both fields, just 1 of them if you really need it for
 highlighting or displaying.  If not, just index.

 I second this.
 The query expansion process is far from being a slow thing... you can
 easily expand to tens of fields with a fairly small penalty.

 Where you have a penalty is at stored fields... these need to be really
 carefully avoided as much as possible.
 As long as you keep them small, the legendary performance of SOLR will
 still hold.

 paul




Re: Deciding whether to stem at query time

2012-04-24 Thread Andrew Wagner
Ah, this is a really good point. Still seems like it has the downsides of
#2, though, much bigger space requirements and possibly some time lost on
queries.

On Mon, Apr 23, 2012 at 3:35 PM, Walter Underwood wun...@wunderwood.orgwrote:

 There is a third approach. Create two fields and always query both of
 them, with the exact field given a higher weight. This works great and
 performs well.

 It is what we did at Netflix and what I'm doing at Chegg.

 wunder

 On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote:

  So I just realized the other day that stemming basically happens at index
  time. If I'm understanding correctly, there's no way to allow a user to
  specify, at run time, whether to stem particular words or not based on a
  single index. I think there are two options, but I'd love to hear that
 I'm
  wrong:
 
  1.) Incrementally build up a white list of words that don't stem very
 well.
  To pick a random example out of the blue, light isn't super closely
  related to, lighter, so I might choose not to stem that. If I wanted to
  do this, I think (if I understand correctly), stemmerOverrideFilter would
  help me out with this. I'm not a big fan of this approach.
 
  2.) Index all the text in two fields, once with stemming and once
 without.
  Then build some kind of option into the UI for specifying whether to stem
  the words or not, and search the appropriate field. Unfortunately, this
  would roughly double the size of my index, and probably affect query
 times
  too. Plus, the UI would probably suck.
 
  Am I missing an option? Has anyone tried one of these approaches?
 
  Thanks!
  Andrew








Re: Deciding whether to stem at query time

2012-04-24 Thread Otis Gospodnetic
Hi Andrew,

This would not necessarily increase the size of your index that much - you 
don't to store both fields, just 1 of them if you really need it for 
highlighting or displaying.  If not, just index.

Otis 

Performance Monitoring for Solr - 
http://sematext.com/spm/solr-performance-monitoring




 From: Andrew Wagner wagner.and...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Tuesday, April 24, 2012 7:21 AM
Subject: Re: Deciding whether to stem at query time
 
Ah, this is a really good point. Still seems like it has the downsides of
#2, though, much bigger space requirements and possibly some time lost on
queries.

On Mon, Apr 23, 2012 at 3:35 PM, Walter Underwood wun...@wunderwood.orgwrote:

 There is a third approach. Create two fields and always query both of
 them, with the exact field given a higher weight. This works great and
 performs well.

 It is what we did at Netflix and what I'm doing at Chegg.

 wunder

 On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote:

  So I just realized the other day that stemming basically happens at index
  time. If I'm understanding correctly, there's no way to allow a user to
  specify, at run time, whether to stem particular words or not based on a
  single index. I think there are two options, but I'd love to hear that
 I'm
  wrong:
 
  1.) Incrementally build up a white list of words that don't stem very
 well.
  To pick a random example out of the blue, light isn't super closely
  related to, lighter, so I might choose not to stem that. If I wanted to
  do this, I think (if I understand correctly), stemmerOverrideFilter would
  help me out with this. I'm not a big fan of this approach.
 
  2.) Index all the text in two fields, once with stemming and once
 without.
  Then build some kind of option into the UI for specifying whether to stem
  the words or not, and search the appropriate field. Unfortunately, this
  would roughly double the size of my index, and probably affect query
 times
  too. Plus, the UI would probably suck.
 
  Am I missing an option? Has anyone tried one of these approaches?
 
  Thanks!
  Andrew










Re: Deciding whether to stem at query time

2012-04-24 Thread Paul Libbrecht

Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit :
 This would not necessarily increase the size of your index that much - you 
 don't to store both fields, just 1 of them if you really need it for 
 highlighting or displaying.  If not, just index.

I second this.
The query expansion process is far from being a slow thing... you can easily 
expand to tens of fields with a fairly small penalty.

Where you have a penalty is at stored fields... these need to be really 
carefully avoided as much as possible.
As long as you keep them small, the legendary performance of SOLR will still 
hold.

paul

Re: Deciding whether to stem at query time

2012-04-24 Thread Andrew Wagner
I'm sorry, I'm missing something. What's the difference between storing
and indexing a field?

On Tue, Apr 24, 2012 at 10:28 AM, Paul Libbrecht p...@hoplahup.net wrote:


 Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit :
  This would not necessarily increase the size of your index that much -
 you don't to store both fields, just 1 of them if you really need it for
 highlighting or displaying.  If not, just index.

 I second this.
 The query expansion process is far from being a slow thing... you can
 easily expand to tens of fields with a fairly small penalty.

 Where you have a penalty is at stored fields... these need to be really
 carefully avoided as much as possible.
 As long as you keep them small, the legendary performance of SOLR will
 still hold.

 paul


Re: Deciding whether to stem at query time

2012-04-24 Thread Erick Erickson
When you set store=true in your schema, a verbatim copy of
the raw input is placed in the *.fdt file. That is the information
returned when you specify the fl parameter for instance.

When you set index=true, the input is analyzed and the
resulting terms are placed in the inverted index and are
searchable.

The two are essentially completely orthogonal for all you
specify them at the same time.

So, a field that's stored but not indexed would be displayable
to the user, but no searches could be performed on it.

A field indexed but stored can be searched, but the information
is not retrievable.

Why are there two options? Well, you may use copyField to
index the data two different ways for two different purposes, as
in this thread. Putting the verbatim data in twice is wasteful,
you only ever need it once.

Why store in the first palce? Because all that gets into the
inverted index is the results of the analysis. So if you indexed
story with stemming turned on, it might result in stori being
in the index. And if you use phonetic filters, it's much worse,
your terms will be something like UNT4 or KMPT which are
totally unsuitable to show the user. So if you want to _search_
phonetically but display the field to the user, you would both
index and store.

And even if you could recover the terms from the inverted
index as they were fed in, it would be a very expensive
process. Luke does this, you might try reconstructing
a document with Luke to see what a reconstructed doc
looks like, and how long it takes.

Hope that helps
Erick

On Tue, Apr 24, 2012 at 10:40 AM, Andrew Wagner wagner.and...@gmail.com wrote:
 I'm sorry, I'm missing something. What's the difference between storing
 and indexing a field?

 On Tue, Apr 24, 2012 at 10:28 AM, Paul Libbrecht p...@hoplahup.net wrote:


 Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit :
  This would not necessarily increase the size of your index that much -
 you don't to store both fields, just 1 of them if you really need it for
 highlighting or displaying.  If not, just index.

 I second this.
 The query expansion process is far from being a slow thing... you can
 easily expand to tens of fields with a fairly small penalty.

 Where you have a penalty is at stored fields... these need to be really
 carefully avoided as much as possible.
 As long as you keep them small, the legendary performance of SOLR will
 still hold.

 paul


Re: Deciding whether to stem at query time

2012-04-23 Thread Walter Underwood
There is a third approach. Create two fields and always query both of them, 
with the exact field given a higher weight. This works great and performs well.

It is what we did at Netflix and what I'm doing at Chegg.

wunder

On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote:

 So I just realized the other day that stemming basically happens at index
 time. If I'm understanding correctly, there's no way to allow a user to
 specify, at run time, whether to stem particular words or not based on a
 single index. I think there are two options, but I'd love to hear that I'm
 wrong:
 
 1.) Incrementally build up a white list of words that don't stem very well.
 To pick a random example out of the blue, light isn't super closely
 related to, lighter, so I might choose not to stem that. If I wanted to
 do this, I think (if I understand correctly), stemmerOverrideFilter would
 help me out with this. I'm not a big fan of this approach.
 
 2.) Index all the text in two fields, once with stemming and once without.
 Then build some kind of option into the UI for specifying whether to stem
 the words or not, and search the appropriate field. Unfortunately, this
 would roughly double the size of my index, and probably affect query times
 too. Plus, the UI would probably suck.
 
 Am I missing an option? Has anyone tried one of these approaches?
 
 Thanks!
 Andrew







Re: Deciding whether to stem at query time

2012-04-23 Thread Michael Sokolov
Yes, and you might choose to use different options for different 
fields.  For dictionary searches, where users are searching for specific 
words, and a high degree of precision is called for, stemming is less 
helpful, but for full text searches, more so.


-Mike

On 4/23/2012 3:35 PM, Walter Underwood wrote:

There is a third approach. Create two fields and always query both of them, 
with the exact field given a higher weight. This works great and performs well.

It is what we did at Netflix and what I'm doing at Chegg.

wunder

On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote:


So I just realized the other day that stemming basically happens at index
time. If I'm understanding correctly, there's no way to allow a user to
specify, at run time, whether to stem particular words or not based on a
single index. I think there are two options, but I'd love to hear that I'm
wrong:

1.) Incrementally build up a white list of words that don't stem very well.
To pick a random example out of the blue, light isn't super closely
related to, lighter, so I might choose not to stem that. If I wanted to
do this, I think (if I understand correctly), stemmerOverrideFilter would
help me out with this. I'm not a big fan of this approach.

2.) Index all the text in two fields, once with stemming and once without.
Then build some kind of option into the UI for specifying whether to stem
the words or not, and search the appropriate field. Unfortunately, this
would roughly double the size of my index, and probably affect query times
too. Plus, the UI would probably suck.

Am I missing an option? Has anyone tried one of these approaches?

Thanks!
Andrew










Re: Deciding whether to stem at query time

2012-04-23 Thread Walter Underwood
Right. Stemming is less useful for author fields, you don't need to match bill 
gate or steve job.

Also, if you want to do fuzzy matching, you should only do that on the exact 
fields, not the stemmed fields.

wunder

On Apr 23, 2012, at 3:45 PM, Michael Sokolov wrote:

 Yes, and you might choose to use different options for different fields.  For 
 dictionary searches, where users are searching for specific words, and a high 
 degree of precision is called for, stemming is less helpful, but for full 
 text searches, more so.
 
 -Mike
 
 On 4/23/2012 3:35 PM, Walter Underwood wrote:
 There is a third approach. Create two fields and always query both of them, 
 with the exact field given a higher weight. This works great and performs 
 well.
 
 It is what we did at Netflix and what I'm doing at Chegg.
 
 wunder
 
 On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote:
 
 So I just realized the other day that stemming basically happens at index
 time. If I'm understanding correctly, there's no way to allow a user to
 specify, at run time, whether to stem particular words or not based on a
 single index. I think there are two options, but I'd love to hear that I'm
 wrong:
 
 1.) Incrementally build up a white list of words that don't stem very well.
 To pick a random example out of the blue, light isn't super closely
 related to, lighter, so I might choose not to stem that. If I wanted to
 do this, I think (if I understand correctly), stemmerOverrideFilter would
 help me out with this. I'm not a big fan of this approach.
 
 2.) Index all the text in two fields, once with stemming and once without.
 Then build some kind of option into the UI for specifying whether to stem
 the words or not, and search the appropriate field. Unfortunately, this
 would roughly double the size of my index, and probably affect query times
 too. Plus, the UI would probably suck.
 
 Am I missing an option? Has anyone tried one of these approaches?
 
 Thanks!
 Andrew
 
 
 
 
 
 

--
Walter Underwood
wun...@wunderwood.org