Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Chris Hostetter Sat, 12 May 2007 11:30:40 -0700

: > Incidently, PatternTokenizerFactory seems to have the anoying limitation
: > of assuming there is a token prior to each match -- even if the match
: > explicitly matches on the start of the string (so it creates a 0 width
: > token) ... that seems like a bug right?


: how would you change it?  I don't know regex well enough to see the
: limitation.  My only criteria was that the output is the same as if you
: send it to string.split( pattern );

Ahhh.... yes i see ... if you are trying to mimic String.split (or
Pattern.split) then the current behavior is correct.  my thinking was that
if you were trying to use this to tokenize on whitespace (or something
like that) and your input as "  aaa bbb   ccc  " ... this would create 4
tokens: an zero width token, followed by tokens for aaa, bbb, and ccc ...
but that first token seeemed like a mistake to me (or if it's not a
mistake, then it seemed like there should also be a zero width width token
at the end after the last space too ... but that's the say string
splitting works too.


-Hoss

Re: [jira] Commented: (SOLR-234) TrimFilter should update the start and end offsets

Reply via email to