On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote:
Yonik Seeley wrote:
I've been looking around... do you have a pointer to the source
where just the suffix is converted from UTF-8?
I understand the index format, but I'm not sure I understand the
problem that would be posed by the prefix
Wolfgang Hoschek wrote:
I don't know if it matters for Lucene usage. But if using
CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a
significant problem, it's probably due to startup/init time of these
methods for individually converting many small strings, not inherently
due to
The temporary char[] buffer is cached per InputStream instance, so the extra
memory allocation shouldn't be a big deal. One could also use
String(byte[],offset,len,UTF-8), and that creates a char[] that is used
directly by the string instead of being copied. It remains to be seen how
fast the
How will the difference impact String memory allocations? Looking at the
String code, I can't see where it would make an impact.
This is from Lucene InputStream:
public final String readString() throws IOException {
int length = readVInt();
if (chars == null || length chars.length)
chars =
.
-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 11:28 AM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.
How will the difference impact String memory allocations? Looking at the
String code, I can't see where it would make
[EMAIL PROTECTED] wrote:
How will the difference impact String memory allocations? Looking at
the String code, I can't see where it would make an impact.
I spoke a bit too soon. I should have looked at the code first. You're
right, I don't think it would require more allocations.
When
On Monday 29 August 2005 19:56, Ken Krugler wrote:
Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data.
But wouldn't UTF-16 mean 2 bytes per character?
Yes, UTF-16 means two bytes per code unit. A Unicode
Daniel Naber wrote:
On Monday 29 August 2005 19:56, Ken Krugler wrote:
Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data.
But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the
, August 29, 2005 4:24 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.
Ken Krugler wrote:
The remaining issue is dealing with old-format indexes.
I think that revving the version number on the segments file would be a
good start. This file must be read before any others
Ken Krugler wrote:
I think the VInt should be the numbers of bytes to be stored using
the UTF-8
encoding.
It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.
For performance, you can use the actual CharSet encoding
DM Smith wrote:
Daniel Naber wrote:
But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to
be the case.
UTF-16 is a fixed 2 byte/char representation.
Except when it's not. I.e., above the BMP.
From the Unicode 4.0 standard
Message-
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 11:54 AM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.
I think the VInt should be the numbers of bytes to be stored using the
UTF-8
encoding.
It is trivial to use the String methods
I think you guys are WAY overcomplicating things, or you just don't know
enough about the Java class libraries.
People were just pointing out that if the vint isn't String.length(), then
one has to either buffer the entire string, or pre-scan it.
It's a valid point, and CharsetEncoder
: Lucene does NOT use UTF-8.
I think you guys are WAY overcomplicating things, or you just don't know
enough about the Java class libraries.
People were just pointing out that if the vint isn't String.length(), then
one has to either buffer the entire string, or pre-scan it.
It's a valid point
On 8/30/05, Robert Engels [EMAIL PROTECTED] wrote:
Not true. You do not need to pre-scan it.
What I previously wrote, with emphasis on key words added:
one has to *either* buffer the entire string, *or* pre-scan it.
-Yonik Now hiring -- http://tinyurl.com/7m67g
I've been looking around... do you have a pointer to the source where just
the suffix is converted from UTF-8?
I understand the index format, but I'm not sure I understand the problem
that would be posed by the prefix length being a byte count.
-Yonik Now hiring -- http://tinyurl.com/7m67g
On
Daniel Naber wrote:
On Monday 29 August 2005 19:56, Ken Krugler wrote:
Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data.
But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem
to be the
The inefficiency would be if prefix were re-converted from UTF-8
for each term, e.g., in order to compare it to the target.
Ahhh, gotcha.
A related problem exists even if the prefix length vInt is changed to
represent the number of unicode chars (as opposed to number of java chars),
right?
On 8/30/05, Ken Krugler [EMAIL PROTECTED] wrote:
Daniel Naber wrote:
On Monday 29 August 2005 19:56, Ken Krugler wrote:
Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data.
But wouldn't UTF-16 mean
Yonik Seeley wrote:
A related problem exists even if the prefix length vInt is changed
to represent the number of unicode chars (as opposed to number of
java chars), right? The prefix length is no longer the offset into
the char[] to put the suffix.
Yes, I suppose this is a problem too.
Where/how is the Lucene ordering of terms used?
An ordering is necessary to be able to find things in the index.
For the most part, the ordering doesn't seem matter... the only query that
comes to mind where it does matter is RangeQuery.
For sorting queries, one is able to specify a Locale.
Yonik Seeley wrote:
Where/how is the Lucene ordering of terms used?
An ordering is necessary to be able to find things in the index.
For the most part, the ordering doesn't seem matter... the only query that
comes to mind where it does matter is RangeQuery.
For back-compatibility it would
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve this
issue anyone raising a hand?
I could, but recent posts makes me think this is heading towards a
religious debate :)
I think the following statements
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve
this
issue anyone raising a hand?
I could, but recent posts makes me think this is heading towards a
Erik Hatcher wrote:
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve
this
issue anyone raising a hand?
I could, but recent posts makes me think this is
[snip]
The surrogate pair problem is another matter entirely. First of all,
lets see if I do understand the problem correctly: Some unicode
characters can be represented by one codepoint outside the BMP (i.
e., not with 16 bits) and alternatively with two codepoints, both of
them in the
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve this
issue anyone raising a hand?
I could, but recent posts makes me think this is heading towards a
If the rest of the world of Lucene ports followed suit with PyLucene and
did the GCJ/SWIG thing, we'd have no problems :) What are the
disadvantages to following this model with Plucene?
Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG
does support
Eric Hatcher wrote...
What, if any, performance impact would changing Java Lucene in this
regard have?
And Ken Krugler wrote...
Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data.
I had been working
Ken Krugler wrote:
The remaining issue is dealing with old-format indexes.
I think that revving the version number on the segments file would be a
good start. This file must be read before any others. Its current
version is -1 and would become -2. (All positive values are version 0,
for
Doug,
How will the difference impact String memory allocations? Looking at the
String code, I can't see where it would make an impact.
Tim
I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
27, 2005 2:11:34 PM PDT
To: java-user@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org
I've delved into the matter of Lucene and UTF-8 a little further,
and I am discouraged by what I believe I've uncovered.
Lucene should not be advertising
Hi Marvin,
Thanks for the detailed response. After spending a bit more time in
the code, I think you're right - all strings seem to be funnelled
through IndexOutput. The remaining issue is dealing with old-format
indexes.
I'm going to take this off-list now, since I'm guessing most list
Hello, Robert...
On Aug 28, 2005, at 7:50 PM, Robert Engels wrote:
Sorry, but I think you are barking up the wrong tree... and your
tone is
quite bizarre. My personal OPINION is that your script language
is an
abomination, and anyone that develops in it is clearly hurting the
advancement
On Aug 26, 2005, at 10:14 PM, jian chen wrote:
It seems to me that in theory, Lucene storage code could use true
UTF-8 to
store terms. Maybe it is just a legacy issue that the modified
UTF-8 is
used?
It's not a matter of a simple switch. The VInt count at the head of
a Lucene string is
Hi, Ken,
Thanks for your email. You are right, I was meant to propose that Lucene
switch to use true UTF-8, rather than having to work around this issue by
fixing the caused problems elsewhere.
Also, conforming to standards like UTF-8 will make the code easier for new
developers to pick up.
36 matches
Mail list logo