Paul,
You should also check out ICUTokenizer/DefaultICUTokenizerConfig, which adds
better handling for some languages to UAX#29 Word Break rules conformance, and
also finds token boundaries when the writing system (aka script) changes. This
is intended to be extensible per script.
The root
are skipped
over and not returned as tokens.
Steve
On Sep 30, 2014, at 3:54 PM, Paul Taylor paul_t...@fastmail.fm wrote:
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer
On 01/10/2014 08:08, Dawid Weiss wrote:
Hi Steve,
I have to admit I also find it frequently useful to include
punctuation as tokens (even if it's filtered out by subsequent token
filters for indexing, it's a useful to-have for other NLP tasks). Do
you think it'd be possible (read: relatively
remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better support
for Asian languages
However this code except fails on incrementToken() implying that the !!!
are removed from
Paul,
Boilerplate upgrade recommendation: consider using the most recent Lucene
release (4.10.1) - it’s the most stable, performant, and featureful release
available, and many bugs have been fixed since the 4.1 release.
FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese,
On 01/10/2014 18:42, Steve Rowe wrote:
Paul,
Boilerplate upgrade recommendation: consider using the most recent Lucene
release (4.10.1) - it’s the most stable, performant, and featureful release
available, and many bugs have been fixed since the 4.1 release.
Yeah sure, I did try this and hit
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better
support for Asian languages
However this code except fails on incrementToken() implying
4.1)
Does StandardTokenizer remove punctuation (in Lucene 4.1)
Im just trying to move back to StandardTokenizer from my own old custom
implemenation because the newer version seems to have much better
support for Asian languages
However this code except fails on incrementToken() implying
sequences between boundaries that contain letters and/or digits are
returned as tokens; all other sequences between boundaries are skipped over and
not returned as tokens.
Steve
On Sep 30, 2014, at 3:54 PM, Paul Taylor paul_t...@fastmail.fm wrote:
Does StandardTokenizer remove punctuation