Is there some StandardTokenizer Implementation which does not break words on hyphens?
I think it would be more flexible to retain hyphens and use a WordDelimiterFactory to split these tokens. StandardTokenizer today: doc1: email -> email doc2: e-mail -> e|mail doc3: e mail -> e|mail query1: email -> doc1 query2: e-mail -> doc2,doc3 query2: e mail -> doc2,doc3 StandardTokenizer which keeps hyphens + WDF: doc1: email -> email doc2: e-mail -> e-mail|email|e|mail doc3: e mail -> e|mail query1: email -> doc1,doc2 query2: e-mail -> doc1,doc2,doc3 query2: e mail -> doc2,doc3 Any suggestions to configure or code the 2nd behavior? Regards, Kai Gülzau