Fwd: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey

Greets,

Discussion moved from the users list as per suggestion...

-- Marvin Humphrey

Begin forwarded message:

From: Marvin Humphrey [EMAIL PROTECTED]
Date: August 26, 2005 9:18:21 PM PDT
To: java-user@lucene.apache.org, [EMAIL PROTECTED]
Subject: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org


Greets,

[crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED]

I've delved into the matter of Lucene and UTF-8 a little further, and  
I am discouraged by what I believe I've uncovered.


Lucene should not be advertising that it uses standard UTF-8 -- or  
even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8.  The  
two distinguishing characteristics of Modified UTF-8 are the  
treatment of codepoints above the BMP (which are written as surrogate  
pairs), and the encoding of null bytes as 1100  1000  rather  
than  .  Both of these became illegal as of Unicode 3.1  
(IIRC), because they are not shortest-form and non-shortest-form  
UTF-8 presents a security risk.


The documentation should really state that Lucene stores strings in a  
Java-only adulteration of UTF-8, unsuitable for interchange.  Since  
Perl uses true shortest-form UTF-8 as its native encoding, Plucene  
would have to jump through two efficiency-killing hoops in order to  
write files that would not choke Lucene: instead of writing out its  
true, legal UTF-8 directly, it would be necessary to first translate  
to UTF-16, then duplicate the Lucene encoding algorithm from  
OutputStream.  In theory.


Below you will find a simple Perl script which illustrates what  
happens when Perl encounters malformed UTF-8.  Run it (you need Perl  
5.8 or higher) and you will see why even if I thought it was a good  
idea to emulate the Java hack for encoding Modified UTF-8, trying  
to make it work in practice would be a nightmare.


If Plucene were to write legal UTF-8 strings to its index files, Java  
Lucene would misbehave and possibly blow up any time a string  
contained either a 4-byte character or a null byte.  On the flip  
side, Perl will spew warnings like crazy and possibly blow up  
whenever it encounters a Lucene-encoded null or surrogate pair.  The  
potential blowups are due to the fact that Lucene and Plucene will  
not agree on how many characters a string contains, resulting in  
overruns or underruns.


I am hoping that the answer to this will be a fix to the encoding  
mechanism in Lucene so that it really does use legal UTF-8.  The most  
efficient way to go about this has not yet presented itself.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = foo\xC0\x80\n;

open (my $virtual_filehandle, +:utf8, \$data);
print $virtual_filehandle;




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Fwd: Lucene does NOT use UTF-8.

2005-08-27 Thread Daniel Naber
On Saturday 27 August 2005 16:05, Marvin Humphrey wrote:

 Lucene should not be advertising that it uses standard UTF-8 -- or  
 even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8.  

For now, I've changed the information about the file format documentation.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]