Thanks to all for clearing this up. It seems we are still quite far
away from full Unicode support:-(
As to the questions about the encoding in previous messages, all of
the other characters in the documents come through without a glitch,
so there is definitely no other issue involved.
What was the actual format of the Extension B characters in the XML
being posted?
-- Ken
Erik Hatcher wrote:
Wow - great stuff Steve!
As for StandardTokenizer and Java version - no worries there
really, as Solr itself requires Java 1.5+, so when such a tokenizer
is made available it could be used just fine in Solr even if it
isn't built into a core Lucene release for a while.
Erik
On Feb 28, 2008, at 12:08 PM, Steven A Rowe wrote:
On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
And as Erik mentioned, it appears that line 114 of
StandardTokenizerImpl.jflex:
http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
needs to be updated to include the Extension B character range.
JFlex 1.4.1 (the latest release) does not support supplementary
code points (those above the BMP - Basic Multilingual Plane:
[U+0000-U+FFFF]), and CJK Ideograph Extension B is definitely a
supplementary range - see the first column from
<http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt>
(the extent of this range is unchanged through the latest [beta]
version, 5.1.0):
20000;<CJK Ideograph Extension B, First> ...
2A6D6;<CJK Ideograph Extension B, Last> ...
I am working with Gerwin Klein on the development version of
JFlex, and am hoping to get Level 1 [Regular Expression] Basic
Unicode Support into the next release (see
<http://unicode.org/reports/tr18/>) - among other things, this
entails accepting supplementary code points.
However, the next release of JFlex will require Java 1.5+, and
Lucene 2.X requires Java 1.4, so until Lucene reaches release 3.0
and begins requiring Java 1.5 (and Solr incorporates it), JFlex
support of supplementary code points is moot.
In short, it'll probably be at least a year before the
StandardTokenizer can be modified to accept supplementary
characters, given the processes involved.
Steve
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"