Re: Erroneous tokenization behavior

2016-09-14 Thread Sattam Alsubaiee
Thanks, Steve. Sattam On Tue, Sep 13, 2016 at 5:51 PM, Steve Rowe wrote: > Hi Sattam, > > You’re right, StandardTokenizer's behavior changed (in 4.9.1/4.10) to > split long tokens at maxTokenLength rather than ignore tokens longer than > maxTokenLength. > > You can simulate

Re: Erroneous tokenization behavior

2016-09-13 Thread Steve Rowe
Hi Sattam, You’re right, StandardTokenizer's behavior changed (in 4.9.1/4.10) to split long tokens at maxTokenLength rather than ignore tokens longer than maxTokenLength. You can simulate the old behavior by setting maxTokenLength to the length of the longest token you want to be able to

Re: Erroneous tokenization behavior

2016-09-13 Thread Sattam Alsubaiee
Hi Michael, Yes, that's the desired behavior. The setMaxTokenLength method is supposed to allow that. Cheers, Sattam On Tue, Sep 13, 2016 at 11:57 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > I guess this was a change in behavior in those versions. > > Are you wanting to

Re: Erroneous tokenization behavior

2016-09-13 Thread Michael McCandless
I guess this was a change in behavior in those versions. Are you wanting to discard the too-long terms (the 4.7.x behavior)? Mike McCandless http://blog.mikemccandless.com On Tue, Sep 13, 2016 at 12:42 AM, Sattam Alsubaiee wrote: > I'm trying to understand the

Erroneous tokenization behavior

2016-09-12 Thread Sattam Alsubaiee
I'm trying to understand the tokenization behavior in Lucene. When using the StandardTokenizer in Lucene version 4.7.1, and trying to tokenize the following string "Tokenize me!" with max token filter set to be 4, I get only the token "me", but when using Lucene version 4.10.4, I get the following