Re: Fwd: Lucene does NOT use UTF-8.

2005-08-27 Thread Daniel Naber
On Saturday 27 August 2005 16:05, Marvin Humphrey wrote:

> Lucene should not be advertising that it uses "standard UTF-8" -- or  
> even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.  

For now, I've changed the information about the file format documentation.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, Ken,

Thanks for your email. You are right, I was meant to propose that Lucene 
switch to use true UTF-8, rather than having to work around this issue by 
fixing the caused problems elsewhere. 

Also, conforming to standards like UTF-8 will make the code easier for new 
developers to pick up.

Just my 2 cents.

Thanks,

Jian

On 8/27/05, Ken Krugler <[EMAIL PROTECTED]> wrote:
> 
> >On Aug 26, 2005, at 10:14 PM, jian chen wrote:
> >
> >>It seems to me that in theory, Lucene storage code could use true UTF-8 
> to
> >>store terms. Maybe it is just a legacy issue that the modified UTF-8 is
> >>used?
> 
> The use of 0xC0 0x80 to encode a U+ Unicode code point is an
> aspect of Java serialization of character streams. Java uses what
> they call "a modified version of UTF-8", though that's a really bad
> way to describe it. It's a different Unicode encoding, one that
> resembles UTF-8, but that's it.
> 
> >It's not a matter of a simple switch. The VInt count at the head of
> >a Lucene string is not the number of Unicode code points the string
> >contains. It's the number of Java chars necessary to contain that
> >string. Code points above the BMP require 2 java chars, since they
> >must be represented by surrogate pairs. The same code point must be
> >represented by one character in legal UTF-8.
> >
> >If Plucene counts the number of legal UTF-8 characters and assigns
> >that number as the VInt at the front of a string, when Java Lucene
> >decodes the string it will allocate an array of char which is too
> >small to hold the string.
> 
> I think Jian was proposing that Lucene switch to using a true UTF-8
> encoding, which would make things a bit cleaner. And probably easier
> than changing all references to CEUS-8 :)
> 
> And yes, given that the integer count is the number of UTF-16 code
> units required to represent the string, your code will need to do a
> bit more processing when calculating the character count, but that's
> a one-liner, right?
> 
> -- Ken
> --
> Ken Krugler
> TransPac Software, Inc.
> 
> +1 530-470-9200
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>


Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Ken Krugler

On Aug 26, 2005, at 10:14 PM, jian chen wrote:


It seems to me that in theory, Lucene storage code could use true UTF-8 to
store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?


The use of 0xC0 0x80 to encode a U+ Unicode code point is an 
aspect of Java serialization of character streams. Java uses what 
they call "a modified version of UTF-8", though that's a really bad 
way to describe it. It's a different Unicode encoding, one that 
resembles UTF-8, but that's it.


It's not a matter of a simple switch.  The VInt count at the head of 
a Lucene string is not the number of Unicode code points the string 
contains.  It's the number of Java chars necessary to contain that 
string.  Code points above the BMP require 2 java chars, since they 
must be represented by surrogate pairs.  The same code point must be 
represented by one character in legal UTF-8.


If Plucene counts the number of legal UTF-8 characters and assigns 
that number as the VInt at the front of a string, when Java Lucene 
decodes the string it will allocate an array of char which is too 
small to hold the string.


I think Jian was proposing that Lucene switch to using a true UTF-8 
encoding, which would make things a bit cleaner. And probably easier 
than changing all references to CEUS-8 :)


And yes, given that the integer count is the number of UTF-16 code 
units required to represent the string, your code will need to do a 
bit more processing when calculating the character count, but that's 
a one-liner, right?


-- Ken
--
Ken Krugler
TransPac Software, Inc.

+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey

On Aug 26, 2005, at 10:14 PM, jian chen wrote:

It seems to me that in theory, Lucene storage code could use true  
UTF-8 to
store terms. Maybe it is just a legacy issue that the modified  
UTF-8 is

used?


It's not a matter of a simple switch.  The VInt count at the head of  
a Lucene string is not the number of Unicode code points the string  
contains.  It's the number of Java chars necessary to contain that  
string.  Code points above the BMP require 2 java chars, since they  
must be represented by surrogate pairs.  The same code point must be  
represented by one character in legal UTF-8.


If Plucene counts the number of legal UTF-8 characters and assigns  
that number as the VInt at the front of a string, when Java Lucene  
decodes the string it will allocate an array of char which is too  
small to hold the string.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey

Greets,

Discussion moved from the users list as per suggestion...

-- Marvin Humphrey

Begin forwarded message:

From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: August 26, 2005 9:18:21 PM PDT
To: java-user@lucene.apache.org, [EMAIL PROTECTED]
Subject: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org


Greets,

[crossposted to java-user@lucene.apache.org and [EMAIL PROTECTED]

I've delved into the matter of Lucene and UTF-8 a little further, and  
I am discouraged by what I believe I've uncovered.


Lucene should not be advertising that it uses "standard UTF-8" -- or  
even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.  The  
two distinguishing characteristics of "Modified UTF-8" are the  
treatment of codepoints above the BMP (which are written as surrogate  
pairs), and the encoding of null bytes as 1100  1000  rather  
than  .  Both of these became illegal as of Unicode 3.1  
(IIRC), because they are not shortest-form and non-shortest-form  
UTF-8 presents a security risk.


The documentation should really state that Lucene stores strings in a  
Java-only adulteration of UTF-8, unsuitable for interchange.  Since  
Perl uses true shortest-form UTF-8 as its native encoding, Plucene  
would have to jump through two efficiency-killing hoops in order to  
write files that would not choke Lucene: instead of writing out its  
true, legal UTF-8 directly, it would be necessary to first translate  
to UTF-16, then duplicate the Lucene encoding algorithm from  
OutputStream.  In theory.


Below you will find a simple Perl script which illustrates what  
happens when Perl encounters malformed UTF-8.  Run it (you need Perl  
5.8 or higher) and you will see why even if I thought it was a good  
idea to emulate the Java hack for encoding "Modified UTF-8", trying  
to make it work in practice would be a nightmare.


If Plucene were to write legal UTF-8 strings to its index files, Java  
Lucene would misbehave and possibly blow up any time a string  
contained either a 4-byte character or a null byte.  On the flip  
side, Perl will spew warnings like crazy and possibly blow up  
whenever it encounters a Lucene-encoded null or surrogate pair.  The  
potential blowups are due to the fact that Lucene and Plucene will  
not agree on how many characters a string contains, resulting in  
overruns or underruns.


I am hoping that the answer to this will be a fix to the encoding  
mechanism in Lucene so that it really does use legal UTF-8.  The most  
efficient way to go about this has not yet presented itself.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = "foo\xC0\x80\n";

open (my $virtual_filehandle, "+<:utf8", \$data);
print <$virtual_filehandle>;




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: Standard or Modified UTF-8?

2005-08-27 Thread Marvin Humphrey

Greets,

It was suggested that I move this to the developers list from the  
users list...


-- Marvin Humphrey

Begin forwarded message:

From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: August 26, 2005 4:51:27 PM PDT
To: java-user@lucene.apache.org
Subject: Standard or Modified UTF-8?
Reply-To: java-user@lucene.apache.org


Greets,

As part of my attempt to speed up Plucene and establish index  
compatibility between Plucene and Java Lucene, I'm porting  
InputStream and OutputStream to XS (the C API for accessing Perl's  
guts), and I believe I have found a documentation bug in the file- 
format spec at...


http://lucene.apache.org/java/docs/fileformats.html

"Lucene writes unicode character sequences using the standard UTF-8  
encoding."


Snooping the code in OutputStream, it looks like you are writing  
modified UTF-8 -- NOT standard -- because a null byte is written  
using the two-byte form.


  else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
writeByte((byte)(0xC0 | (code >> 6)));
writeByte((byte)(0x80 | (code & 0x3F)));

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Can someone please confirm that the intention is to write modified  
UTF-8?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]