Tokenization: How to Allow Multiple Strategies?

2011-02-08 Thread Tavi Nathanson
Hey everyone,

Tokenization seems inherently fuzzy and imprecise, yet Solr/Lucene does not
appear to provide an easy mechanism to account for this fuzziness.

Let's take an example, where the document I'm indexing is v1.1.0 mr. jones
www.gmail.com

I may want to tokenize this as follows: [v1.1.0, mr, jones, 
www.gmail.com]
...or I may want to tokenize this as follows: [v1, 1.0, mr, jones,
www, gmail.com]
...or I may want to tokenize it another way.

I would think that the best approach would be indexing using multiple
strategies, such as:

[v1.1.0, v1, 1.0, mr, jones, www.gmail.com, www, gmail.com]

However, this would destroy phrase queries. And while Lucene lets you index
multiple tokens at the same position, I haven't found a way to deal with
cases where you want to index a set of tokens at one position: nor does that
even make sense. For instance, I can't index [www, gmail.com] in the
same position as www.gmail.com.

So:

- Any thoughts, in general, about how you all approach this fuzziness? Do
you just choose one tokenization strategy and hope for the best?
- Might there be a way to use multiple strategies and *not* break phrase
queries that I'm overlooking?

Thanks!

Tavi


Re: Tokenization: How to Allow Multiple Strategies?

2011-02-08 Thread Em

Hi Tavi,

if you want to use multiple tokenization strategies (different tokenizers so
to speak) you have to use different fieldTypes.

Maybe you have to create your own tokenizer for doing what you want or a
PatternTokenizer might help you.

However, your examples for the different positions of specific terms reminds
me on the WordDelimiterFilter (see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
).

It does almost everything you wrote and is close to what you want, I think.
Have a look at it.

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenization: How to Allow Multiple Strategies?

2011-02-08 Thread Tavi Nathanson
Thanks for the suggestions! Using a new field makes sense, except it would
double the size of the index. I'd like to add additional terms, at my
discretion, only when there's ambiguity.

More specifically, do you know of any way to put multiple *tokens sets* at
the same position of the same field?

If I can tokenize 123-4567 apple as:

[Token(123), Token(-), Token(4567), Token(apple)]
or
[Token(123-4567), Token(apple)]

...might there be a way to put [Token(123), Token(-), Token(4567)] *and*
[Token(123-4567)]  in the index in such a way that the PhraseQuery
Token(123-4567) Token(apple) would match the above string, *and* the
PhraseQuery Token(123) Token(-) Token(4567) Token(apple) would also match
it?

Thanks!
Tavi

On Tue, Feb 8, 2011 at 10:34 AM, Em mailformailingli...@yahoo.de wrote:


 Hi Tavi,

 if you want to use multiple tokenization strategies (different tokenizers
 so
 to speak) you have to use different fieldTypes.

 Maybe you have to create your own tokenizer for doing what you want or a
 PatternTokenizer might help you.

 However, your examples for the different positions of specific terms
 reminds
 me on the WordDelimiterFilter (see

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
 ).

 It does almost everything you wrote and is close to what you want, I think.
 Have a look at it.

 Regards
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Tokenization: How to Allow Multiple Strategies?

2011-02-08 Thread Erick Erickson
A couple of things...

First, you haven't provided any evidence that increasing the index size is a
concern. If your index isn't all that large, it really doesn't matter, and
conserving
index size may not be a concern.

WordDelimterFilterFactory (WDFF) will do the use cases you outlined below,
but don't
get stuck on, for instance, having the '-' be a token unless you can say
for certain that it has benefits over both indexing and searching on just
123 followed by 4567 which is what would happen with WDFF.

I recommend that you look at the analysis page (check the verbose box)
to see the effects of tokenization with various analysis chains before
making
any firm decisions.

Best
Erick

On Tue, Feb 8, 2011 at 6:24 PM, Tavi Nathanson tavi.nathan...@gmail.comwrote:

 Thanks for the suggestions! Using a new field makes sense, except it would
 double the size of the index. I'd like to add additional terms, at my
 discretion, only when there's ambiguity.

 More specifically, do you know of any way to put multiple *tokens sets* at
 the same position of the same field?

 If I can tokenize 123-4567 apple as:

 [Token(123), Token(-), Token(4567), Token(apple)]
 or
 [Token(123-4567), Token(apple)]

 ...might there be a way to put [Token(123), Token(-), Token(4567)] *and*
 [Token(123-4567)]  in the index in such a way that the PhraseQuery
 Token(123-4567) Token(apple) would match the above string, *and* the
 PhraseQuery Token(123) Token(-) Token(4567) Token(apple) would also match
 it?

 Thanks!
 Tavi

 On Tue, Feb 8, 2011 at 10:34 AM, Em mailformailingli...@yahoo.de wrote:

 
  Hi Tavi,
 
  if you want to use multiple tokenization strategies (different tokenizers
  so
  to speak) you have to use different fieldTypes.
 
  Maybe you have to create your own tokenizer for doing what you want or a
  PatternTokenizer might help you.
 
  However, your examples for the different positions of specific terms
  reminds
  me on the WordDelimiterFilter (see
 
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
  ).
 
  It does almost everything you wrote and is close to what you want, I
 think.
  Have a look at it.
 
  Regards
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html
  Sent from the Solr - User mailing list archive at Nabble.com.