Fwd: Standard or Modified UTF-8?

2005-08-27 Thread Marvin Humphrey
Greets, It was suggested that I move this to the developers list from the users list... -- Marvin Humphrey Begin forwarded message: From: Marvin Humphrey [EMAIL PROTECTED] Date: August 26, 2005 4:51:27 PM PDT To: java-user@lucene.apache.org Subject: Standard or Modified UTF-8? Reply-To:

Fwd: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey
Greets, Discussion moved from the users list as per suggestion... -- Marvin Humphrey Begin forwarded message: From: Marvin Humphrey [EMAIL PROTECTED] Date: August 26, 2005 9:18:21 PM PDT To: java-user@lucene.apache.org, [EMAIL PROTECTED] Subject: Lucene does NOT use UTF-8. Reply-To:

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey
On Aug 26, 2005, at 10:14 PM, jian chen wrote: It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? It's not a matter of a simple switch. The VInt count at the head of a Lucene string is

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, Ken, Thanks for your email. You are right, I was meant to propose that Lucene switch to use true UTF-8, rather than having to work around this issue by fixing the caused problems elsewhere. Also, conforming to standards like UTF-8 will make the code easier for new developers to pick up.

Re: Fwd: Lucene does NOT use UTF-8.

2005-08-27 Thread Daniel Naber
On Saturday 27 August 2005 16:05, Marvin Humphrey wrote: Lucene should not be advertising that it uses standard UTF-8 -- or   even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8.   For now, I've changed the information about the file format documentation. Regards Daniel --