Re: Lucene does NOT use UTF-8.

2005-08-31 Thread Wolfgang Hoschek
On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote: Yonik Seeley wrote: I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix

Re: Lucene does NOT use UTF-8.

2005-08-31 Thread Doug Cutting
Wolfgang Hoschek wrote: I don't know if it matters for Lucene usage. But if using CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a significant problem, it's probably due to startup/init time of these methods for individually converting many small strings, not inherently due to

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
The temporary char[] buffer is cached per InputStream instance, so the extra memory allocation shouldn't be a big deal. One could also use String(byte[],offset,len,UTF-8), and that creates a char[] that is used directly by the string instead of being copied. It remains to be seen how fast the

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. This is from Lucene InputStream: public final String readString() throws IOException { int length = readVInt(); if (chars == null || length chars.length) chars =

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
. -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 11:28 AM To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
[EMAIL PROTECTED] wrote: How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. I spoke a bit too soon. I should have looked at the code first. You're right, I don't think it would require more allocations. When

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler
On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? Yes, UTF-16 means two bytes per code unit. A Unicode

Re: Lucene does NOT use UTF-8

2005-08-30 Thread DM Smith
Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler
, August 29, 2005 4:24 PM To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread DM Smith
Ken Krugler wrote: I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Steven Rowe
DM Smith wrote: Daniel Naber wrote: But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case. UTF-16 is a fixed 2 byte/char representation. Except when it's not. I.e., above the BMP. From the Unicode 4.0 standard

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
Message- From: Ken Krugler [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 11:54 AM To: java-dev@lucene.apache.org Subject: RE: Lucene does NOT use UTF-8. I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
I think you guys are WAY overcomplicating things, or you just don't know enough about the Java class libraries. People were just pointing out that if the vint isn't String.length(), then one has to either buffer the entire string, or pre-scan it. It's a valid point, and CharsetEncoder

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
: Lucene does NOT use UTF-8. I think you guys are WAY overcomplicating things, or you just don't know enough about the Java class libraries. People were just pointing out that if the vint isn't String.length(), then one has to either buffer the entire string, or pre-scan it. It's a valid point

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
On 8/30/05, Robert Engels [EMAIL PROTECTED] wrote: Not true. You do not need to pre-scan it. What I previously wrote, with emphasis on key words added: one has to *either* buffer the entire string, *or* pre-scan it. -Yonik Now hiring -- http://tinyurl.com/7m67g

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. -Yonik Now hiring -- http://tinyurl.com/7m67g On

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler
Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
The inefficiency would be if prefix were re-converted from UTF-8 for each term, e.g., in order to compare it to the target. Ahhh, gotcha. A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right?

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Tom White
On 8/30/05, Ken Krugler [EMAIL PROTECTED] wrote: Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler
Yonik Seeley wrote: A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right? The prefix length is no longer the offset into the char[] to put the suffix. Yes, I suppose this is a problem too.

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
Where/how is the Lucene ordering of terms used? An ordering is necessary to be able to find things in the index. For the most part, the ordering doesn't seem matter... the only query that comes to mind where it does matter is RangeQuery. For sorting queries, one is able to specify a Locale.

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
Yonik Seeley wrote: Where/how is the Lucene ordering of terms used? An ordering is necessary to be able to find things in the index. For the most part, the ordering doesn't seem matter... the only query that comes to mind where it does matter is RangeQuery. For back-compatibility it would

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) I think the following statements

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Erik Hatcher
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ronald Dauster
Erik Hatcher wrote: On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
[snip] The surrogate pair problem is another matter entirely. First of all, lets see if I do understand the problem correctly: Some unicode characters can be represented by one codepoint outside the BMP (i. e., not with 16 bits) and alternatively with two codepoints, both of them in the

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Andi Vajda
If the rest of the world of Lucene ports followed suit with PyLucene and did the GCJ/SWIG thing, we'd have no problems :) What are the disadvantages to following this model with Plucene? Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG does support

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Marvin Humphrey
Eric Hatcher wrote... What, if any, performance impact would changing Java Lucene in this regard have? And Ken Krugler wrote... Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. I had been working

Re: Lucene does NOT use UTF-8.

2005-08-29 Thread Doug Cutting
Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for

Re: Lucene does NOT use UTF-8.

2005-08-29 Thread tjones
Doug, How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. Tim I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize

Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Marvin Humphrey
27, 2005 2:11:34 PM PDT To: java-user@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. Reply-To: java-user@lucene.apache.org I've delved into the matter of Lucene and UTF-8 a little further, and I am discouraged by what I believe I've uncovered. Lucene should not be advertising

Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Ken Krugler
Hi Marvin, Thanks for the detailed response. After spending a bit more time in the code, I think you're right - all strings seem to be funnelled through IndexOutput. The remaining issue is dealing with old-format indexes. I'm going to take this off-list now, since I'm guessing most list

Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Marvin Humphrey
Hello, Robert... On Aug 28, 2005, at 7:50 PM, Robert Engels wrote: Sorry, but I think you are barking up the wrong tree... and your tone is quite bizarre. My personal OPINION is that your script language is an abomination, and anyone that develops in it is clearly hurting the advancement

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey
On Aug 26, 2005, at 10:14 PM, jian chen wrote: It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? It's not a matter of a simple switch. The VInt count at the head of a Lucene string is

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, Ken, Thanks for your email. You are right, I was meant to propose that Lucene switch to use true UTF-8, rather than having to work around this issue by fixing the caused problems elsewhere. Also, conforming to standards like UTF-8 will make the code easier for new developers to pick up.