Fwd: Standard or Modified UTF-8?

2005-08-27 Thread Marvin Humphrey

Greets,

It was suggested that I move this to the developers list from the  
users list...


-- Marvin Humphrey

Begin forwarded message:

From: Marvin Humphrey [EMAIL PROTECTED]
Date: August 26, 2005 4:51:27 PM PDT
To: java-user@lucene.apache.org
Subject: Standard or Modified UTF-8?
Reply-To: java-user@lucene.apache.org


Greets,

As part of my attempt to speed up Plucene and establish index  
compatibility between Plucene and Java Lucene, I'm porting  
InputStream and OutputStream to XS (the C API for accessing Perl's  
guts), and I believe I have found a documentation bug in the file- 
format spec at...


http://lucene.apache.org/java/docs/fileformats.html

Lucene writes unicode character sequences using the standard UTF-8  
encoding.


Snooping the code in OutputStream, it looks like you are writing  
modified UTF-8 -- NOT standard -- because a null byte is written  
using the two-byte form.


  else if (((code = 0x80)  (code = 0x7FF)) || code == 0) {
writeByte((byte)(0xC0 | (code  6)));
writeByte((byte)(0x80 | (code  0x3F)));

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Can someone please confirm that the intention is to write modified  
UTF-8?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey

Greets,

Discussion moved from the users list as per suggestion...

-- Marvin Humphrey

Begin forwarded message:

From: Marvin Humphrey [EMAIL PROTECTED]
Date: August 26, 2005 9:18:21 PM PDT
To: java-user@lucene.apache.org, [EMAIL PROTECTED]
Subject: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org


Greets,

[crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED]

I've delved into the matter of Lucene and UTF-8 a little further, and  
I am discouraged by what I believe I've uncovered.


Lucene should not be advertising that it uses standard UTF-8 -- or  
even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8.  The  
two distinguishing characteristics of Modified UTF-8 are the  
treatment of codepoints above the BMP (which are written as surrogate  
pairs), and the encoding of null bytes as 1100  1000  rather  
than  .  Both of these became illegal as of Unicode 3.1  
(IIRC), because they are not shortest-form and non-shortest-form  
UTF-8 presents a security risk.


The documentation should really state that Lucene stores strings in a  
Java-only adulteration of UTF-8, unsuitable for interchange.  Since  
Perl uses true shortest-form UTF-8 as its native encoding, Plucene  
would have to jump through two efficiency-killing hoops in order to  
write files that would not choke Lucene: instead of writing out its  
true, legal UTF-8 directly, it would be necessary to first translate  
to UTF-16, then duplicate the Lucene encoding algorithm from  
OutputStream.  In theory.


Below you will find a simple Perl script which illustrates what  
happens when Perl encounters malformed UTF-8.  Run it (you need Perl  
5.8 or higher) and you will see why even if I thought it was a good  
idea to emulate the Java hack for encoding Modified UTF-8, trying  
to make it work in practice would be a nightmare.


If Plucene were to write legal UTF-8 strings to its index files, Java  
Lucene would misbehave and possibly blow up any time a string  
contained either a 4-byte character or a null byte.  On the flip  
side, Perl will spew warnings like crazy and possibly blow up  
whenever it encounters a Lucene-encoded null or surrogate pair.  The  
potential blowups are due to the fact that Lucene and Plucene will  
not agree on how many characters a string contains, resulting in  
overruns or underruns.


I am hoping that the answer to this will be a fix to the encoding  
mechanism in Lucene so that it really does use legal UTF-8.  The most  
efficient way to go about this has not yet presented itself.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = foo\xC0\x80\n;

open (my $virtual_filehandle, +:utf8, \$data);
print $virtual_filehandle;




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey

On Aug 26, 2005, at 10:14 PM, jian chen wrote:

It seems to me that in theory, Lucene storage code could use true  
UTF-8 to
store terms. Maybe it is just a legacy issue that the modified  
UTF-8 is

used?


It's not a matter of a simple switch.  The VInt count at the head of  
a Lucene string is not the number of Unicode code points the string  
contains.  It's the number of Java chars necessary to contain that  
string.  Code points above the BMP require 2 java chars, since they  
must be represented by surrogate pairs.  The same code point must be  
represented by one character in legal UTF-8.


If Plucene counts the number of legal UTF-8 characters and assigns  
that number as the VInt at the front of a string, when Java Lucene  
decodes the string it will allocate an array of char which is too  
small to hold the string.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, Ken,

Thanks for your email. You are right, I was meant to propose that Lucene 
switch to use true UTF-8, rather than having to work around this issue by 
fixing the caused problems elsewhere. 

Also, conforming to standards like UTF-8 will make the code easier for new 
developers to pick up.

Just my 2 cents.

Thanks,

Jian

On 8/27/05, Ken Krugler [EMAIL PROTECTED] wrote:
 
 On Aug 26, 2005, at 10:14 PM, jian chen wrote:
 
 It seems to me that in theory, Lucene storage code could use true UTF-8 
 to
 store terms. Maybe it is just a legacy issue that the modified UTF-8 is
 used?
 
 The use of 0xC0 0x80 to encode a U+ Unicode code point is an
 aspect of Java serialization of character streams. Java uses what
 they call a modified version of UTF-8, though that's a really bad
 way to describe it. It's a different Unicode encoding, one that
 resembles UTF-8, but that's it.
 
 It's not a matter of a simple switch. The VInt count at the head of
 a Lucene string is not the number of Unicode code points the string
 contains. It's the number of Java chars necessary to contain that
 string. Code points above the BMP require 2 java chars, since they
 must be represented by surrogate pairs. The same code point must be
 represented by one character in legal UTF-8.
 
 If Plucene counts the number of legal UTF-8 characters and assigns
 that number as the VInt at the front of a string, when Java Lucene
 decodes the string it will allocate an array of char which is too
 small to hold the string.
 
 I think Jian was proposing that Lucene switch to using a true UTF-8
 encoding, which would make things a bit cleaner. And probably easier
 than changing all references to CEUS-8 :)
 
 And yes, given that the integer count is the number of UTF-16 code
 units required to represent the string, your code will need to do a
 bit more processing when calculating the character count, but that's
 a one-liner, right?
 
 -- Ken
 --
 Ken Krugler
 TransPac Software, Inc.
 http://www.transpac.com
 +1 530-470-9200
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: Fwd: Lucene does NOT use UTF-8.

2005-08-27 Thread Daniel Naber
On Saturday 27 August 2005 16:05, Marvin Humphrey wrote:

 Lucene should not be advertising that it uses standard UTF-8 -- or  
 even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8.  

For now, I've changed the information about the file format documentation.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]