Doug,
How will the difference impact String memory allocations? Looking at the
String code, I can't see where it would make an impact.
Tim
I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
stri
I think the VInt should be the numbers of bytes to be stored using the UTF-8
encoding.
It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.
For performance, you can use the actual CharSet encoding classes - avoiding
all of
On Monday 29 August 2005 19:56, Ken Krugler wrote:
> "Lucene writes strings as a VInt representing the length of the
> string in Java chars (UTF-16 code units), followed by the character
> data."
But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the
case.
Regards
Daniel
Ken Krugler wrote:
The remaining issue is dealing with old-format indexes.
I think that revving the version number on the segments file would be a
good start. This file must be read before any others. Its current
version is -1 and would become -2. (All positive values are version 0,
for b
Eric Hatcher wrote...
What, if any, performance impact would changing Java Lucene in this
regard have?
And Ken Krugler wrote...
"Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data."
I had been working
Hi Marvin,
I'm guessing that since I'm the one that cares most about
interoperability, I'll have to volunteer to do the heavy lifting.
Tomorrow I'll go through and survey how many and which things would
need to change to achieve full UTF-8 compliance. One concern is
that I think in order to m
If the rest of the world of Lucene ports followed suit with PyLucene and
did the GCJ/SWIG thing, we'd have no problems :) What are the
disadvantages to following this model with Plucene?
Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG
does support cross-languag
I'm also curious about the existing CLucene & PyLucene ports. Would they also
need to be similarly modified, with the proposed changes?
PyLucene is built from the Java Lucene source code, so any change made to Java
Lucene is getting reflected in PyLucene once it gets refreshed. The next
refr
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve this
issue anyone raising a hand?
I could, but recent posts makes me think this is heading towards a
religiou
[snip]
The surrogate pair problem is another matter entirely. First of all,
lets see if I do understand the problem correctly: Some unicode
characters can be represented by one codepoint outside the BMP (i.
e., not with 16 bits) and alternatively with two codepoints, both of
them in the 16-bi
Erik Hatcher wrote:
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve
this
issue anyone raising a hand?
I could, but recent posts makes me think this is
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve
this
issue anyone raising a hand?
I could, but recent posts makes me think this is heading towards a
rel
I'm not familiar with UTF-8 enough to follow the details of this
discussion. I hope other Lucene developers are, so we can resolve this
issue anyone raising a hand?
I could, but recent posts makes me think this is heading towards a
religious debate :)
I think the following statements are
13 matches
Mail list logo