Re: Lucene does NOT use UTF-8.

2005-08-29 Thread tjones
Doug, How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. Tim I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize stri

RE: Lucene does NOT use UTF-8.

2005-08-29 Thread Robert Engels
I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding classes - avoiding all of

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Daniel Naber
On Monday 29 August 2005 19:56, Ken Krugler wrote: > "Lucene writes strings as a VInt representing the length of the > string in Java chars (UTF-16 code units), followed by the character > data." But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case. Regards Daniel

Re: Lucene does NOT use UTF-8.

2005-08-29 Thread Doug Cutting
Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for b

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Marvin Humphrey
Eric Hatcher wrote... What, if any, performance impact would changing Java Lucene in this regard have? And Ken Krugler wrote... "Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data." I had been working

Re: Lucene and UTF-8

2005-08-29 Thread Ken Krugler
Hi Marvin, I'm guessing that since I'm the one that cares most about interoperability, I'll have to volunteer to do the heavy lifting. Tomorrow I'll go through and survey how many and which things would need to change to achieve full UTF-8 compliance. One concern is that I think in order to m

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Andi Vajda
If the rest of the world of Lucene ports followed suit with PyLucene and did the GCJ/SWIG thing, we'd have no problems :) What are the disadvantages to following this model with Plucene? Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG does support cross-languag

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Andi Vajda
I'm also curious about the existing CLucene & PyLucene ports. Would they also need to be similarly modified, with the proposed changes? PyLucene is built from the Java Lucene source code, so any change made to Java Lucene is getting reflected in PyLucene once it gets refreshed. The next refr

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religiou

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
[snip] The surrogate pair problem is another matter entirely. First of all, lets see if I do understand the problem correctly: Some unicode characters can be represented by one codepoint outside the BMP (i. e., not with 16 bits) and alternatively with two codepoints, both of them in the 16-bi

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ronald Dauster
Erik Hatcher wrote: On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Erik Hatcher
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a rel

Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler
I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) I think the following statements are