StandardTokenizer vs. hyphens

Kai Gülzau Fri, 17 May 2013 09:27:24 -0700

Is there some StandardTokenizer Implementation which does not break words on 
hyphens?


I think it would be more flexible to retain hyphens and use a 
WordDelimiterFactory to split these tokens.


StandardTokenizer today:
doc1: email -> email
doc2: e-mail -> e|mail
doc3: e mail -> e|mail

query1: email -> doc1
query2: e-mail -> doc2,doc3
query2: e mail -> doc2,doc3


StandardTokenizer which keeps hyphens + WDF:
doc1: email -> email
doc2: e-mail -> e-mail|email|e|mail
doc3: e mail -> e|mail

query1: email -> doc1,doc2
query2: e-mail -> doc1,doc2,doc3
query2: e mail -> doc2,doc3


Any suggestions to configure or code the 2nd behavior?

Regards,

Kai Gülzau

StandardTokenizer vs. hyphens

Reply via email to