Re: Deciding whether to stem at query time
Hi Andrew, * store == store the original value of the text/string being indexed * index == do analysis on the original text/string, which typically means tokenization of text and optional filtering/modifying (e.g. removing, stemming) of tokens. * indexed fields can be searched * fields that are only stored cannot * stored fields can be used for displaying in UI and highlighting * fields that are only indexed and not stored not so much HTH, Otis Solr Performance Monitoring - http://sematext.com/spm/solr-performance-monitoring From: Andrew Wagner wagner.and...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, April 24, 2012 10:40 AM Subject: Re: Deciding whether to stem at query time I'm sorry, I'm missing something. What's the difference between storing and indexing a field? On Tue, Apr 24, 2012 at 10:28 AM, Paul Libbrecht p...@hoplahup.net wrote: Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit : This would not necessarily increase the size of your index that much - you don't to store both fields, just 1 of them if you really need it for highlighting or displaying. If not, just index. I second this. The query expansion process is far from being a slow thing... you can easily expand to tens of fields with a fairly small penalty. Where you have a penalty is at stored fields... these need to be really carefully avoided as much as possible. As long as you keep them small, the legendary performance of SOLR will still hold. paul
Re: Deciding whether to stem at query time
Ah, this is a really good point. Still seems like it has the downsides of #2, though, much bigger space requirements and possibly some time lost on queries. On Mon, Apr 23, 2012 at 3:35 PM, Walter Underwood wun...@wunderwood.orgwrote: There is a third approach. Create two fields and always query both of them, with the exact field given a higher weight. This works great and performs well. It is what we did at Netflix and what I'm doing at Chegg. wunder On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote: So I just realized the other day that stemming basically happens at index time. If I'm understanding correctly, there's no way to allow a user to specify, at run time, whether to stem particular words or not based on a single index. I think there are two options, but I'd love to hear that I'm wrong: 1.) Incrementally build up a white list of words that don't stem very well. To pick a random example out of the blue, light isn't super closely related to, lighter, so I might choose not to stem that. If I wanted to do this, I think (if I understand correctly), stemmerOverrideFilter would help me out with this. I'm not a big fan of this approach. 2.) Index all the text in two fields, once with stemming and once without. Then build some kind of option into the UI for specifying whether to stem the words or not, and search the appropriate field. Unfortunately, this would roughly double the size of my index, and probably affect query times too. Plus, the UI would probably suck. Am I missing an option? Has anyone tried one of these approaches? Thanks! Andrew
Re: Deciding whether to stem at query time
Hi Andrew, This would not necessarily increase the size of your index that much - you don't to store both fields, just 1 of them if you really need it for highlighting or displaying. If not, just index. Otis Performance Monitoring for Solr - http://sematext.com/spm/solr-performance-monitoring From: Andrew Wagner wagner.and...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, April 24, 2012 7:21 AM Subject: Re: Deciding whether to stem at query time Ah, this is a really good point. Still seems like it has the downsides of #2, though, much bigger space requirements and possibly some time lost on queries. On Mon, Apr 23, 2012 at 3:35 PM, Walter Underwood wun...@wunderwood.orgwrote: There is a third approach. Create two fields and always query both of them, with the exact field given a higher weight. This works great and performs well. It is what we did at Netflix and what I'm doing at Chegg. wunder On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote: So I just realized the other day that stemming basically happens at index time. If I'm understanding correctly, there's no way to allow a user to specify, at run time, whether to stem particular words or not based on a single index. I think there are two options, but I'd love to hear that I'm wrong: 1.) Incrementally build up a white list of words that don't stem very well. To pick a random example out of the blue, light isn't super closely related to, lighter, so I might choose not to stem that. If I wanted to do this, I think (if I understand correctly), stemmerOverrideFilter would help me out with this. I'm not a big fan of this approach. 2.) Index all the text in two fields, once with stemming and once without. Then build some kind of option into the UI for specifying whether to stem the words or not, and search the appropriate field. Unfortunately, this would roughly double the size of my index, and probably affect query times too. Plus, the UI would probably suck. Am I missing an option? Has anyone tried one of these approaches? Thanks! Andrew
Re: Deciding whether to stem at query time
Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit : This would not necessarily increase the size of your index that much - you don't to store both fields, just 1 of them if you really need it for highlighting or displaying. If not, just index. I second this. The query expansion process is far from being a slow thing... you can easily expand to tens of fields with a fairly small penalty. Where you have a penalty is at stored fields... these need to be really carefully avoided as much as possible. As long as you keep them small, the legendary performance of SOLR will still hold. paul
Re: Deciding whether to stem at query time
I'm sorry, I'm missing something. What's the difference between storing and indexing a field? On Tue, Apr 24, 2012 at 10:28 AM, Paul Libbrecht p...@hoplahup.net wrote: Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit : This would not necessarily increase the size of your index that much - you don't to store both fields, just 1 of them if you really need it for highlighting or displaying. If not, just index. I second this. The query expansion process is far from being a slow thing... you can easily expand to tens of fields with a fairly small penalty. Where you have a penalty is at stored fields... these need to be really carefully avoided as much as possible. As long as you keep them small, the legendary performance of SOLR will still hold. paul
Re: Deciding whether to stem at query time
When you set store=true in your schema, a verbatim copy of the raw input is placed in the *.fdt file. That is the information returned when you specify the fl parameter for instance. When you set index=true, the input is analyzed and the resulting terms are placed in the inverted index and are searchable. The two are essentially completely orthogonal for all you specify them at the same time. So, a field that's stored but not indexed would be displayable to the user, but no searches could be performed on it. A field indexed but stored can be searched, but the information is not retrievable. Why are there two options? Well, you may use copyField to index the data two different ways for two different purposes, as in this thread. Putting the verbatim data in twice is wasteful, you only ever need it once. Why store in the first palce? Because all that gets into the inverted index is the results of the analysis. So if you indexed story with stemming turned on, it might result in stori being in the index. And if you use phonetic filters, it's much worse, your terms will be something like UNT4 or KMPT which are totally unsuitable to show the user. So if you want to _search_ phonetically but display the field to the user, you would both index and store. And even if you could recover the terms from the inverted index as they were fed in, it would be a very expensive process. Luke does this, you might try reconstructing a document with Luke to see what a reconstructed doc looks like, and how long it takes. Hope that helps Erick On Tue, Apr 24, 2012 at 10:40 AM, Andrew Wagner wagner.and...@gmail.com wrote: I'm sorry, I'm missing something. What's the difference between storing and indexing a field? On Tue, Apr 24, 2012 at 10:28 AM, Paul Libbrecht p...@hoplahup.net wrote: Le 24 avr. 2012 à 17:16, Otis Gospodnetic a écrit : This would not necessarily increase the size of your index that much - you don't to store both fields, just 1 of them if you really need it for highlighting or displaying. If not, just index. I second this. The query expansion process is far from being a slow thing... you can easily expand to tens of fields with a fairly small penalty. Where you have a penalty is at stored fields... these need to be really carefully avoided as much as possible. As long as you keep them small, the legendary performance of SOLR will still hold. paul
Re: Deciding whether to stem at query time
There is a third approach. Create two fields and always query both of them, with the exact field given a higher weight. This works great and performs well. It is what we did at Netflix and what I'm doing at Chegg. wunder On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote: So I just realized the other day that stemming basically happens at index time. If I'm understanding correctly, there's no way to allow a user to specify, at run time, whether to stem particular words or not based on a single index. I think there are two options, but I'd love to hear that I'm wrong: 1.) Incrementally build up a white list of words that don't stem very well. To pick a random example out of the blue, light isn't super closely related to, lighter, so I might choose not to stem that. If I wanted to do this, I think (if I understand correctly), stemmerOverrideFilter would help me out with this. I'm not a big fan of this approach. 2.) Index all the text in two fields, once with stemming and once without. Then build some kind of option into the UI for specifying whether to stem the words or not, and search the appropriate field. Unfortunately, this would roughly double the size of my index, and probably affect query times too. Plus, the UI would probably suck. Am I missing an option? Has anyone tried one of these approaches? Thanks! Andrew
Re: Deciding whether to stem at query time
Yes, and you might choose to use different options for different fields. For dictionary searches, where users are searching for specific words, and a high degree of precision is called for, stemming is less helpful, but for full text searches, more so. -Mike On 4/23/2012 3:35 PM, Walter Underwood wrote: There is a third approach. Create two fields and always query both of them, with the exact field given a higher weight. This works great and performs well. It is what we did at Netflix and what I'm doing at Chegg. wunder On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote: So I just realized the other day that stemming basically happens at index time. If I'm understanding correctly, there's no way to allow a user to specify, at run time, whether to stem particular words or not based on a single index. I think there are two options, but I'd love to hear that I'm wrong: 1.) Incrementally build up a white list of words that don't stem very well. To pick a random example out of the blue, light isn't super closely related to, lighter, so I might choose not to stem that. If I wanted to do this, I think (if I understand correctly), stemmerOverrideFilter would help me out with this. I'm not a big fan of this approach. 2.) Index all the text in two fields, once with stemming and once without. Then build some kind of option into the UI for specifying whether to stem the words or not, and search the appropriate field. Unfortunately, this would roughly double the size of my index, and probably affect query times too. Plus, the UI would probably suck. Am I missing an option? Has anyone tried one of these approaches? Thanks! Andrew
Re: Deciding whether to stem at query time
Right. Stemming is less useful for author fields, you don't need to match bill gate or steve job. Also, if you want to do fuzzy matching, you should only do that on the exact fields, not the stemmed fields. wunder On Apr 23, 2012, at 3:45 PM, Michael Sokolov wrote: Yes, and you might choose to use different options for different fields. For dictionary searches, where users are searching for specific words, and a high degree of precision is called for, stemming is less helpful, but for full text searches, more so. -Mike On 4/23/2012 3:35 PM, Walter Underwood wrote: There is a third approach. Create two fields and always query both of them, with the exact field given a higher weight. This works great and performs well. It is what we did at Netflix and what I'm doing at Chegg. wunder On Apr 23, 2012, at 12:21 PM, Andrew Wagner wrote: So I just realized the other day that stemming basically happens at index time. If I'm understanding correctly, there's no way to allow a user to specify, at run time, whether to stem particular words or not based on a single index. I think there are two options, but I'd love to hear that I'm wrong: 1.) Incrementally build up a white list of words that don't stem very well. To pick a random example out of the blue, light isn't super closely related to, lighter, so I might choose not to stem that. If I wanted to do this, I think (if I understand correctly), stemmerOverrideFilter would help me out with this. I'm not a big fan of this approach. 2.) Index all the text in two fields, once with stemming and once without. Then build some kind of option into the UI for specifying whether to stem the words or not, and search the appropriate field. Unfortunately, this would roughly double the size of my index, and probably affect query times too. Plus, the UI would probably suck. Am I missing an option? Has anyone tried one of these approaches? Thanks! Andrew -- Walter Underwood wun...@wunderwood.org