Fwd: Standard or Modified UTF-8?
Greets, It was suggested that I move this to the developers list from the users list... -- Marvin Humphrey Begin forwarded message: From: Marvin Humphrey [EMAIL PROTECTED] Date: August 26, 2005 4:51:27 PM PDT To: java-user@lucene.apache.org Subject: Standard or Modified UTF-8? Reply-To: java-user@lucene.apache.org Greets, As part of my attempt to speed up Plucene and establish index compatibility between Plucene and Java Lucene, I'm porting InputStream and OutputStream to XS (the C API for accessing Perl's guts), and I believe I have found a documentation bug in the file- format spec at... http://lucene.apache.org/java/docs/fileformats.html Lucene writes unicode character sequences using the standard UTF-8 encoding. Snooping the code in OutputStream, it looks like you are writing modified UTF-8 -- NOT standard -- because a null byte is written using the two-byte form. else if (((code = 0x80) (code = 0x7FF)) || code == 0) { writeByte((byte)(0xC0 | (code 6))); writeByte((byte)(0x80 | (code 0x3F))); http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 Can someone please confirm that the intention is to write modified UTF-8? Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fwd: Lucene does NOT use UTF-8.
Greets, Discussion moved from the users list as per suggestion... -- Marvin Humphrey Begin forwarded message: From: Marvin Humphrey [EMAIL PROTECTED] Date: August 26, 2005 9:18:21 PM PDT To: java-user@lucene.apache.org, [EMAIL PROTECTED] Subject: Lucene does NOT use UTF-8. Reply-To: java-user@lucene.apache.org Greets, [crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED] I've delved into the matter of Lucene and UTF-8 a little further, and I am discouraged by what I believe I've uncovered. Lucene should not be advertising that it uses standard UTF-8 -- or even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8. The two distinguishing characteristics of Modified UTF-8 are the treatment of codepoints above the BMP (which are written as surrogate pairs), and the encoding of null bytes as 1100 1000 rather than . Both of these became illegal as of Unicode 3.1 (IIRC), because they are not shortest-form and non-shortest-form UTF-8 presents a security risk. The documentation should really state that Lucene stores strings in a Java-only adulteration of UTF-8, unsuitable for interchange. Since Perl uses true shortest-form UTF-8 as its native encoding, Plucene would have to jump through two efficiency-killing hoops in order to write files that would not choke Lucene: instead of writing out its true, legal UTF-8 directly, it would be necessary to first translate to UTF-16, then duplicate the Lucene encoding algorithm from OutputStream. In theory. Below you will find a simple Perl script which illustrates what happens when Perl encounters malformed UTF-8. Run it (you need Perl 5.8 or higher) and you will see why even if I thought it was a good idea to emulate the Java hack for encoding Modified UTF-8, trying to make it work in practice would be a nightmare. If Plucene were to write legal UTF-8 strings to its index files, Java Lucene would misbehave and possibly blow up any time a string contained either a 4-byte character or a null byte. On the flip side, Perl will spew warnings like crazy and possibly blow up whenever it encounters a Lucene-encoded null or surrogate pair. The potential blowups are due to the fact that Lucene and Plucene will not agree on how many characters a string contains, resulting in overruns or underruns. I am hoping that the answer to this will be a fix to the encoding mechanism in Lucene so that it really does use legal UTF-8. The most efficient way to go about this has not yet presented itself. Marvin Humphrey Rectangular Research http://www.rectangular.com/ # #!/usr/bin/perl use strict; use warnings; # illegal_null.plx -- Perl complains about non-shortest-form null. my $data = foo\xC0\x80\n; open (my $virtual_filehandle, +:utf8, \$data); print $virtual_filehandle; - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
On Aug 26, 2005, at 10:14 PM, jian chen wrote: It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? It's not a matter of a simple switch. The VInt count at the head of a Lucene string is not the number of Unicode code points the string contains. It's the number of Java chars necessary to contain that string. Code points above the BMP require 2 java chars, since they must be represented by surrogate pairs. The same code point must be represented by one character in legal UTF-8. If Plucene counts the number of legal UTF-8 characters and assigns that number as the VInt at the front of a string, when Java Lucene decodes the string it will allocate an array of char which is too small to hold the string. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Hi, Ken, Thanks for your email. You are right, I was meant to propose that Lucene switch to use true UTF-8, rather than having to work around this issue by fixing the caused problems elsewhere. Also, conforming to standards like UTF-8 will make the code easier for new developers to pick up. Just my 2 cents. Thanks, Jian On 8/27/05, Ken Krugler [EMAIL PROTECTED] wrote: On Aug 26, 2005, at 10:14 PM, jian chen wrote: It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? The use of 0xC0 0x80 to encode a U+ Unicode code point is an aspect of Java serialization of character streams. Java uses what they call a modified version of UTF-8, though that's a really bad way to describe it. It's a different Unicode encoding, one that resembles UTF-8, but that's it. It's not a matter of a simple switch. The VInt count at the head of a Lucene string is not the number of Unicode code points the string contains. It's the number of Java chars necessary to contain that string. Code points above the BMP require 2 java chars, since they must be represented by surrogate pairs. The same code point must be represented by one character in legal UTF-8. If Plucene counts the number of legal UTF-8 characters and assigns that number as the VInt at the front of a string, when Java Lucene decodes the string it will allocate an array of char which is too small to hold the string. I think Jian was proposing that Lucene switch to using a true UTF-8 encoding, which would make things a bit cleaner. And probably easier than changing all references to CEUS-8 :) And yes, given that the integer count is the number of UTF-16 code units required to represent the string, your code will need to do a bit more processing when calculating the character count, but that's a one-liner, right? -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fwd: Lucene does NOT use UTF-8.
On Saturday 27 August 2005 16:05, Marvin Humphrey wrote: Lucene should not be advertising that it uses standard UTF-8 -- or even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8. For now, I've changed the information about the file format documentation. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]