: > Incidently, PatternTokenizerFactory seems to have the anoying limitation : > of assuming there is a token prior to each match -- even if the match : > explicitly matches on the start of the string (so it creates a 0 width : > token) ... that seems like a bug right?
: how would you change it? I don't know regex well enough to see the : limitation. My only criteria was that the output is the same as if you : send it to string.split( pattern ); Ahhh.... yes i see ... if you are trying to mimic String.split (or Pattern.split) then the current behavior is correct. my thinking was that if you were trying to use this to tokenize on whitespace (or something like that) and your input as " aaa bbb ccc " ... this would create 4 tokens: an zero width token, followed by tokens for aaa, bbb, and ccc ... but that first token seeemed like a mistake to me (or if it's not a mistake, then it seemed like there should also be a zero width width token at the end after the last space too ... but that's the say string splitting works too. -Hoss
