Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-02 Thread Steve Rowe
Paul, You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds better handling for some languages to UAX#29 Word Break rules conformance, and also finds token boundaries when the writing system (aka script) changes. This is intended to be extensible per script. The root

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Dawid Weiss
are skipped over and not returned as tokens. Steve On Sep 30, 2014, at 3:54 PM, Paul Taylor paul_t...@fastmail.fm wrote: Does StandardTokenizer remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor
On 01/10/2014 08:08, Dawid Weiss wrote: Hi Steve, I have to admit I also find it frequently useful to include punctuation as tokens (even if it's filtered out by subsequent token filters for indexing, it's a useful to-have for other NLP tasks). Do you think it'd be possible (read: relatively

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Michael McCandless
remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages However this code except fails on incrementToken() implying that the !!! are removed from

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Steve Rowe
Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese,

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor
On 01/10/2014 18:42, Steve Rowe wrote: Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. Yeah sure, I did try this and hit

Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Paul Taylor
Does StandardTokenizer remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages However this code except fails on incrementToken() implying

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Jack Krupansky
4.1) Does StandardTokenizer remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages However this code except fails on incrementToken() implying

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Steve Rowe
sequences between boundaries that contain letters and/or digits are returned as tokens; all other sequences between boundaries are skipped over and not returned as tokens. Steve On Sep 30, 2014, at 3:54 PM, Paul Taylor paul_t...@fastmail.fm wrote: Does StandardTokenizer remove punctuation