Greets,
It was suggested that I move this to the developers list from the
users list...
-- Marvin Humphrey
Begin forwarded message:
From: Marvin Humphrey [EMAIL PROTECTED]
Date: August 26, 2005 4:51:27 PM PDT
To: java-user@lucene.apache.org
Subject: Standard or Modified UTF-8?
Reply-To:
Greets,
Discussion moved from the users list as per suggestion...
-- Marvin Humphrey
Begin forwarded message:
From: Marvin Humphrey [EMAIL PROTECTED]
Date: August 26, 2005 9:18:21 PM PDT
To: java-user@lucene.apache.org, [EMAIL PROTECTED]
Subject: Lucene does NOT use UTF-8.
Reply-To:
On Aug 26, 2005, at 10:14 PM, jian chen wrote:
It seems to me that in theory, Lucene storage code could use true
UTF-8 to
store terms. Maybe it is just a legacy issue that the modified
UTF-8 is
used?
It's not a matter of a simple switch. The VInt count at the head of
a Lucene string is
Hi, Ken,
Thanks for your email. You are right, I was meant to propose that Lucene
switch to use true UTF-8, rather than having to work around this issue by
fixing the caused problems elsewhere.
Also, conforming to standards like UTF-8 will make the code easier for new
developers to pick up.
On Saturday 27 August 2005 16:05, Marvin Humphrey wrote:
Lucene should not be advertising that it uses standard UTF-8 -- or
even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8.
For now, I've changed the information about the file format documentation.
Regards
Daniel
--