Tokenization: How to Allow Multiple Strategies?
Hey everyone, Tokenization seems inherently fuzzy and imprecise, yet Solr/Lucene does not appear to provide an easy mechanism to account for this fuzziness. Let's take an example, where the document I'm indexing is v1.1.0 mr. jones www.gmail.com I may want to tokenize this as follows: [v1.1.0, mr, jones, www.gmail.com] ...or I may want to tokenize this as follows: [v1, 1.0, mr, jones, www, gmail.com] ...or I may want to tokenize it another way. I would think that the best approach would be indexing using multiple strategies, such as: [v1.1.0, v1, 1.0, mr, jones, www.gmail.com, www, gmail.com] However, this would destroy phrase queries. And while Lucene lets you index multiple tokens at the same position, I haven't found a way to deal with cases where you want to index a set of tokens at one position: nor does that even make sense. For instance, I can't index [www, gmail.com] in the same position as www.gmail.com. So: - Any thoughts, in general, about how you all approach this fuzziness? Do you just choose one tokenization strategy and hope for the best? - Might there be a way to use multiple strategies and *not* break phrase queries that I'm overlooking? Thanks! Tavi
Re: Tokenization: How to Allow Multiple Strategies?
Hi Tavi, if you want to use multiple tokenization strategies (different tokenizers so to speak) you have to use different fieldTypes. Maybe you have to create your own tokenizer for doing what you want or a PatternTokenizer might help you. However, your examples for the different positions of specific terms reminds me on the WordDelimiterFilter (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory ). It does almost everything you wrote and is close to what you want, I think. Have a look at it. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenization: How to Allow Multiple Strategies?
Thanks for the suggestions! Using a new field makes sense, except it would double the size of the index. I'd like to add additional terms, at my discretion, only when there's ambiguity. More specifically, do you know of any way to put multiple *tokens sets* at the same position of the same field? If I can tokenize 123-4567 apple as: [Token(123), Token(-), Token(4567), Token(apple)] or [Token(123-4567), Token(apple)] ...might there be a way to put [Token(123), Token(-), Token(4567)] *and* [Token(123-4567)] in the index in such a way that the PhraseQuery Token(123-4567) Token(apple) would match the above string, *and* the PhraseQuery Token(123) Token(-) Token(4567) Token(apple) would also match it? Thanks! Tavi On Tue, Feb 8, 2011 at 10:34 AM, Em mailformailingli...@yahoo.de wrote: Hi Tavi, if you want to use multiple tokenization strategies (different tokenizers so to speak) you have to use different fieldTypes. Maybe you have to create your own tokenizer for doing what you want or a PatternTokenizer might help you. However, your examples for the different positions of specific terms reminds me on the WordDelimiterFilter (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory ). It does almost everything you wrote and is close to what you want, I think. Have a look at it. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenization: How to Allow Multiple Strategies?
A couple of things... First, you haven't provided any evidence that increasing the index size is a concern. If your index isn't all that large, it really doesn't matter, and conserving index size may not be a concern. WordDelimterFilterFactory (WDFF) will do the use cases you outlined below, but don't get stuck on, for instance, having the '-' be a token unless you can say for certain that it has benefits over both indexing and searching on just 123 followed by 4567 which is what would happen with WDFF. I recommend that you look at the analysis page (check the verbose box) to see the effects of tokenization with various analysis chains before making any firm decisions. Best Erick On Tue, Feb 8, 2011 at 6:24 PM, Tavi Nathanson tavi.nathan...@gmail.comwrote: Thanks for the suggestions! Using a new field makes sense, except it would double the size of the index. I'd like to add additional terms, at my discretion, only when there's ambiguity. More specifically, do you know of any way to put multiple *tokens sets* at the same position of the same field? If I can tokenize 123-4567 apple as: [Token(123), Token(-), Token(4567), Token(apple)] or [Token(123-4567), Token(apple)] ...might there be a way to put [Token(123), Token(-), Token(4567)] *and* [Token(123-4567)] in the index in such a way that the PhraseQuery Token(123-4567) Token(apple) would match the above string, *and* the PhraseQuery Token(123) Token(-) Token(4567) Token(apple) would also match it? Thanks! Tavi On Tue, Feb 8, 2011 at 10:34 AM, Em mailformailingli...@yahoo.de wrote: Hi Tavi, if you want to use multiple tokenization strategies (different tokenizers so to speak) you have to use different fieldTypes. Maybe you have to create your own tokenizer for doing what you want or a PatternTokenizer might help you. However, your examples for the different positions of specific terms reminds me on the WordDelimiterFilter (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory ). It does almost everything you wrote and is close to what you want, I think. Have a look at it. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html Sent from the Solr - User mailing list archive at Nabble.com.