On Saturday 27 August 2005 16:05, Marvin Humphrey wrote:
> Lucene should not be advertising that it uses "standard UTF-8" -- or
> even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.
For now, I've changed the information about the file format documentation.
Regards
Daniel
--
http
Hi, Ken,
Thanks for your email. You are right, I was meant to propose that Lucene
switch to use true UTF-8, rather than having to work around this issue by
fixing the caused problems elsewhere.
Also, conforming to standards like UTF-8 will make the code easier for new
developers to pick up.
On Aug 26, 2005, at 10:14 PM, jian chen wrote:
It seems to me that in theory, Lucene storage code could use true UTF-8 to
store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?
The use of 0xC0 0x80 to encode a U+ Unicode code point is an
aspect of Java serialization
On Aug 26, 2005, at 10:14 PM, jian chen wrote:
It seems to me that in theory, Lucene storage code could use true
UTF-8 to
store terms. Maybe it is just a legacy issue that the modified
UTF-8 is
used?
It's not a matter of a simple switch. The VInt count at the head of
a Lucene string is
Greets,
Discussion moved from the users list as per suggestion...
-- Marvin Humphrey
Begin forwarded message:
From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: August 26, 2005 9:18:21 PM PDT
To: java-user@lucene.apache.org, [EMAIL PROTECTED]
Subject: Lucene does NOT use UTF-8.
Reply-To: java-use
Greets,
It was suggested that I move this to the developers list from the
users list...
-- Marvin Humphrey
Begin forwarded message:
From: Marvin Humphrey <[EMAIL PROTECTED]>
Date: August 26, 2005 4:51:27 PM PDT
To: java-user@lucene.apache.org
Subject: Standard or Modified UTF-8?
Reply-To: j