Re: Lucene does NOT use UTF-8.

2005-08-31 Thread Wolfgang Hoschek

On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote:


Yonik Seeley wrote:

I've been looking around... do you have a pointer to the source  
where just the suffix is converted from UTF-8?
I understand the index format, but I'm not sure I understand the  
problem that would be posed by the prefix length being a byte count.




TermBuffer.java:66

Things could work fine if the prefix length were a byte count.  A  
byte buffer could easily be constructed that contains the full byte  
sequence (prefix + suffix), and then this could be converted to a  
String.  The inefficiency would be if prefix were re-converted from  
UTF-8 for each term, e.g., in order to compare it to the target.   
Prefixes are frequently longer than suffixes, so this could be  
significant.  Does that make sense?  I don't know whether it would  
actually be significant, although TermBuffer.java was added  
recently as a measurable performance enhancement, so this is  
performance critical code.


We need to stop discussing this in the abstract and start coding  
alternatives and benchmarking them.  Is  
java.nio.charset.CharsetEncoder fast enough?  Will moving things  
through CharBuffer and ByteBuffer be too slow?  Should Lucene keep  
maintaining its own UTF-8 implementation for performance?  I don't  
know, only some experiments will tell.


Doug



I don't know if it matters for Lucene usage. But if using  
CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a  
significant problem, it's probably due to startup/init time of these  
methods for individually converting many small strings, not  
inherently due to UTF-8 usage. I'm confident that a custom UTF-8  
implementation can almost completely eliminate these issues. I've  
done this before for binary XML with great success, and it could  
certainly be done for lucene just as well. Bottom line: It's probably  
an issue that can be dealt with via proper impl; it probably  
shouldn't dictate design directions.


Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-31 Thread Doug Cutting

Wolfgang Hoschek wrote:
I don't know if it matters for Lucene usage. But if using  
CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a  
significant problem, it's probably due to startup/init time of these  
methods for individually converting many small strings, not  inherently 
due to UTF-8 usage. I'm confident that a custom UTF-8  implementation 
can almost completely eliminate these issues. I've  done this before for 
binary XML with great success, and it could  certainly be done for 
lucene just as well. Bottom line: It's probably  an issue that can be 
dealt with via proper impl; it probably  shouldn't dictate design 
directions.


Good point.  Currently Lucene already has its own (buggy) UTF-8 
implementation for performance, so that wouldn't really be a big change.


The big question now seems to be whether the stored character sequence 
lengths should be in bytes or characters.  Bytes might be fast and 
simple (whether we implement our own UTF-8 in Java or not) but are not 
back-compatible.  So do we bite the bullet and make a very incompatible 
change to index formats?  Or do we make these counts be unicode 
characters (which is mostly back-compatible) and make the code a bit 
more awkward?  Some implementations would be nice to see just how 
awkward things get.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
The temporary char[] buffer is cached per InputStream instance, so the extra 
memory allocation shouldn't be a big deal. One could also use 
String(byte[],offset,len,UTF-8), and that creates a char[] that is used 
directly by the string instead of being copied. It remains to be seen how 
fast the native java char converter is though.

I like the idea of the length being the number of bytes... it encapsulates 
the content in case you want to rapidly skip over it (or rapidly copy it). 
It's more future proof w.r.t. alternate encodings (or binary), and if it had 
been number if bytes from the start, it wouldn't have to be changed now.

-Yonik

On 8/29/05, Doug Cutting [EMAIL PROTECTED] wrote:

 I would argue that the length written be the number of characters in the
 string, rather than the number of bytes written, since that can minimize
 string memory allocations.



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
 How will the difference impact String memory allocations? Looking at the
 String code, I can't see where it would make an impact.


This is from Lucene InputStream:
public final String readString() throws IOException {
int length = readVInt();
if (chars == null || length  chars.length)
chars = new char[length];
readChars(chars, 0, length);
return new String(chars, 0, length);
}

If you know the length in bytes, you still have to allocate that many chars 
(even though the number of chars may be less than the number of bytes). Not 
a big deal IMHO.

A bigger pain is on the writing side, where you can't stream things because 
you don't know what the length is going to be (in either bytes *or* UTF-8 
chars).

So it turns out that Java's 16 bit chars were just a waste... it's still a 
multibyte format *and* it takes up more space. UTF-8 would have been nice - 
no conversions necessary.


-Yonik Now hiring -- http://tinyurl.com/7m67g


RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
That method should easily be changed to

public final String readString() throws IOException {
int length = readVInt();
return new String(readBytes(length),UTF-8);
}

readBytes(0 could reuse the same array if it was large enough. Then only the
single char[] is created in the String code.

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 11:28 AM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


 How will the difference impact String memory allocations? Looking at the
 String code, I can't see where it would make an impact.


This is from Lucene InputStream:
public final String readString() throws IOException {
int length = readVInt();
if (chars == null || length  chars.length)
chars = new char[length];
readChars(chars, 0, length);
return new String(chars, 0, length);
}

If you know the length in bytes, you still have to allocate that many chars
(even though the number of chars may be less than the number of bytes). Not
a big deal IMHO.

A bigger pain is on the writing side, where you can't stream things because
you don't know what the length is going to be (in either bytes *or* UTF-8
chars).

So it turns out that Java's 16 bit chars were just a waste... it's still a
multibyte format *and* it takes up more space. UTF-8 would have been nice -
no conversions necessary.


-Yonik Now hiring -- http://tinyurl.com/7m67g


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting

[EMAIL PROTECTED] wrote:
How will the difference impact String memory allocations?  Looking at 
the String code, I can't see where it would make an impact.


I spoke a bit too soon.  I should have looked at the code first.  You're 
right, I don't think it would require more allocations.


When considering this byte-count versus character-count issue please 
note that it also arises elsewhere.  The PrefixLength in the Term 
Dictionary section of the file format document is currently defined as a 
number of characters, not bytes.


http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary

Implementing this in terms of bytes may have performance implications, 
since, at first glance, the entire byte sequence would need to be 
converted from UTF-8 into the internal string representation for each 
term, rather than just the suffix.  Does anyone see a way around that?


As for how we got to this point: I wrote Lucene's UTF-8 reading and 
writing code in 1998, back when Unicode still had fewer than 2^16 
characters.  It's surprising that it has lasted this long without anyone 
noticing!


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler

On Monday 29 August 2005 19:56, Ken Krugler wrote:


 Lucene writes strings as a VInt representing the length of the
 string in Java chars (UTF-16 code units), followed by the character
 data.


But wouldn't UTF-16 mean 2 bytes per character?


Yes, UTF-16 means two bytes per code unit. A Unicode character (code 
point) is encoded as either one or two UTF-16 code units.



That doesn't seem to be the
case.


The case where? You mean in what actually gets written out?

String.length() is the length in terms of Java chars, which means 
UTF-16 code units (well, sort of...see below). Looking at the code, 
IndexOutput.writeString() calls writeVInt() with the string length.


One related note. Java 1.4 supports Unicode 3.0, while Java 5.0 
supports Unicode 4.0. It was in Unicode 3.1 that supplementary 
characters (code points  U+0, ie outside of the BMP) were added, 
and the UTF-16 encoding formalized.


So I think the issue of non-BMP characters is currently a bit 
esoteric for Lucene, since I'm guessing there are other places in the 
code (e.g. JDK calls used by Lucene) where non-BMP characters won't 
be properly handled. Though some quick tests indicate that there is 
some knowledge of surrogate pairs in 1.4 (e.g. converting a String 
w/surrogate pairs to UTF-8 does the right thing).


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-30 Thread DM Smith

Daniel Naber wrote:


On Monday 29 August 2005 19:56, Ken Krugler wrote:
 


Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data.
   

But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the 
case.


UTF-16 is a fixed 2 byte/char representation. But one cannot equate the 
character count with the byte count. Each Java char is 2 bytes. I think 
all that is being said is that the VInt is equal to str.length() as java 
gives it.


On an unrelated project we are determining whether we should use a 
denormalized (letter followed by an accents) or a normalized form 
(letter with accents) of accented characters as we present the text to a 
GUI. We have found that font support varies but appears to be better for 
denormalized. This is not an issue for storage, as it can be transformed 
before it goes to screen. However, it is useful to know which form it is in.


The reason I mention this is that I seem to remember that the length of 
the java string varies with the representation. So then the count would 
not be the number of glyphs that the user sees. Please correct me if I 
am wrong.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler

I think the VInt should be the numbers of bytes to be stored using the UTF-8
encoding.

It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.

For performance, you can use the actual CharSet encoding classes - avoiding
all of the lookups performed by the String class.


Regardless of what underlying support is used, if you want to write 
out the VInt value as UTF-8 bytes versus Java chars, the Java String 
has to either be converted to UTF-8 in memory first, or pre-scanned. 
The first is a memory hit, and the second is a performance hit. I 
don't know the extent of either, but it's there.


Note that since the VInt is a variable size, you can't write out the 
bytes first and then fill in the correct value later.


-- Ken



-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, August 29, 2005 4:24 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


Ken Krugler wrote:

 The remaining issue is dealing with old-format indexes.


I think that revving the version number on the segments file would be a
good start.  This file must be read before any others.  Its current
version is -1 and would become -2.  (All positive values are version 0,
for back-compatibility.)  Implementations can be modified to pass the
version around if they wish to be back-compatible, or they can simply
throw exceptions for old format indexes.

I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
string memory allocations.


 I'm going to take this off-list now [ ... ]


Please don't.  It's better to have a record of the discussion.

Doug



--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread DM Smith



Ken Krugler wrote:

I think the VInt should be the numbers of bytes to be stored using 
the UTF-8

encoding.

It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.

For performance, you can use the actual CharSet encoding classes - 
avoiding

all of the lookups performed by the String class.



Regardless of what underlying support is used, if you want to write 
out the VInt value as UTF-8 bytes versus Java chars, the Java String 
has to either be converted to UTF-8 in memory first, or pre-scanned. 
The first is a memory hit, and the second is a performance hit. I 
don't know the extent of either, but it's there.


Note that since the VInt is a variable size, you can't write out the 
bytes first and then fill in the correct value later.


Sure you can. Do a tell to get the position. Write any number. Write 
the text. Do another tell to note the position. Based on the 
difference between the two tells, you have the length. Rewind to the 
first tell and write out the number. Then advance to the end.


I am not recommending this, but it can be done.

There may be other ways.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-30 Thread Steven Rowe

DM Smith wrote:

Daniel Naber wrote:
But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to 
be the case.



UTF-16 is a fixed 2 byte/char representation.


Except when it's not.  I.e., above the BMP.

From the Unicode 4.0 standard 
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf:


   In the UTF-16 encoding form, code points in the
   range U+..U+ are represented as a single
   16-bit code unit; code points in the supplementary
   planes, in the range U+1..U+10, are
   instead represented as pairs of 16-bit code units.
   These pairs of special code units are known as
   surrogate pairs.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
I think you guys are WAY overcomplicating things, or you just don't know
enough about the Java class libraries.

If you use the java.nio.charset.CharsetEncoder class, then you can reuse the
byte[] array, and then it is a simple write of the length, and a blast copy
of the required number of bytes to the OutputStream (which will either fit
or expand its byte[]). You can perform all of this WITHOUT creating new
byte[] or char[] (as long as the existing one is large enough to fit the
encoded/decoded data).

There is no need to use any sort of file position mark/reset stuff.

R




-Original Message-
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 11:54 AM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.


I think the VInt should be the numbers of bytes to be stored using the
UTF-8
encoding.

It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.

For performance, you can use the actual CharSet encoding classes - avoiding
all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to write
out the VInt value as UTF-8 bytes versus Java chars, the Java String
has to either be converted to UTF-8 in memory first, or pre-scanned.
The first is a memory hit, and the second is a performance hit. I
don't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out the
bytes first and then fill in the correct value later.

-- Ken


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, August 29, 2005 4:24 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


Ken Krugler wrote:
  The remaining issue is dealing with old-format indexes.

I think that revving the version number on the segments file would be a
good start.  This file must be read before any others.  Its current
version is -1 and would become -2.  (All positive values are version 0,
for back-compatibility.)  Implementations can be modified to pass the
version around if they wish to be back-compatible, or they can simply
throw exceptions for old format indexes.

I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
string memory allocations.

  I'm going to take this off-list now [ ... ]

Please don't.  It's better to have a record of the discussion.

Doug


--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
 I think you guys are WAY overcomplicating things, or you just don't know
 enough about the Java class libraries.


People were just pointing out that if the vint isn't String.length(), then 
one has to either buffer the entire string, or pre-scan it.

It's a valid point, and CharsetEncoder doesn't change that.

 -Yonik Now hiring -- http://tinyurl.com/7m67g


RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
Not true. You do not need to pre-scan it.

When you use CharSet encoder, it will write the bytes to a buffer (expanding
as needed). At the end of the encoding you can get the actual number of
bytes needed.

The pseudo-code is

use CharsetEncoder to write String to ByteBuffer
write VInt using ByteBuffer.getLength()
write bytes using ByteBuffer.getByte[]

better yet you NIO so you can pass the ByteBuffer directly.


-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 12:56 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Lucene does NOT use UTF-8.


 I think you guys are WAY overcomplicating things, or you just don't know
 enough about the Java class libraries.


People were just pointing out that if the vint isn't String.length(), then
one has to either buffer the entire string, or pre-scan it.

It's a valid point, and CharsetEncoder doesn't change that.

 -Yonik Now hiring -- http://tinyurl.com/7m67g


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
On 8/30/05, Robert Engels [EMAIL PROTECTED] wrote:
 
 Not true. You do not need to pre-scan it.


What I previously wrote, with emphasis on key words added:
one has to *either* buffer the entire string, *or* pre-scan it.

-Yonik Now hiring -- http://tinyurl.com/7m67g


Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
I've been looking around... do you have a pointer to the source where just 
the suffix is converted from UTF-8?

I understand the index format, but I'm not sure I understand the problem 
that would be posed by the prefix length being a byte count.

-Yonik Now hiring -- http://tinyurl.com/7m67g

On 8/30/05, Doug Cutting [EMAIL PROTECTED] wrote:
 
 [EMAIL PROTECTED] wrote:
  How will the difference impact String memory allocations? Looking at
  the String code, I can't see where it would make an impact.
 
 I spoke a bit too soon. I should have looked at the code first. You're
 right, I don't think it would require more allocations.
 
 When considering this byte-count versus character-count issue please
 note that it also arises elsewhere. The PrefixLength in the Term
 Dictionary section of the file format document is currently defined as a
 number of characters, not bytes.
 
 http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary
 
 Implementing this in terms of bytes may have performance implications,
 since, at first glance, the entire byte sequence would need to be
 converted from UTF-8 into the internal string representation for each
 term, rather than just the suffix. Does anyone see a way around that?
 
 As for how we got to this point: I wrote Lucene's UTF-8 reading and
 writing code in 1998, back when Unicode still had fewer than 2^16
 characters. It's surprising that it has lasted this long without anyone
 noticing!
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler

Daniel Naber wrote:


On Monday 29 August 2005 19:56, Ken Krugler wrote:


Lucene writes strings as a VInt representing the length of the
string in Java chars (UTF-16 code units), followed by the character
data.
  

But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem 
to be the case.



UTF-16 is a fixed 2 byte/char representation.


I hate to keep beating this horse, but I want to emphasize that it's 
2 bytes per Java char (or UTF-16 code unit), not Unicode character 
(code point).


But one cannot equate the character count with the byte count. Each 
Java char is 2 bytes. I think all that is being said is that the 
VInt is equal to str.length() as java gives it.


On an unrelated project we are determining whether we should use a 
denormalized (letter followed by an accents) or a normalized form 
(letter with accents) of accented characters as we present the text 
to a GUI. We have found that font support varies but appears to be 
better for denormalized. This is not an issue for storage, as it can 
be transformed before it goes to screen. However, it is useful to 
know which form it is in.


The reason I mention this is that I seem to remember that the length 
of the java string varies with the representation.


String.length() is the number of Java chars, which always uses 
UTF-16. If you normalize text, then yes that can change the number of 
code units and thus the length of the string, but so can doing any 
kind of text munging (e.g. replacement) operation on characters in 
the string.


So then the count would not be the number of glyphs that the user 
sees. Please correct me if I am wrong.


All kinds of mxn mappings (both at the layout engine level, and using 
font tables) are possible when going from Unicode characters to 
display glyphs. Plus zero-width left-kerning glyphs would also alter 
the relationship between # of visual characters and backing store 
characters.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
 The inefficiency would be if prefix were re-converted from UTF-8
 for each term, e.g., in order to compare it to the target.

Ahhh, gotcha.

A related problem exists even if the prefix length vInt is changed to 
represent the number of unicode chars (as opposed to number of java chars), 
right? The prefix length is no longer the offset into the char[] to put the 
suffix.

Another approach might be to convert the target to a UTF-8 byte[] 
and do all comparisons on byte[]. UTF-8 has some very nice properties, 
including that the byte[] representation of UTF-8 strings compare the same 
as UCS-4 would.

As you say, the variations need to be tested.

-Yonik 
Now hiring -- http://tinyurl.com/7m67g


Re: Lucene does NOT use UTF-8

2005-08-30 Thread Tom White
On 8/30/05, Ken Krugler [EMAIL PROTECTED] wrote:
 
 Daniel Naber wrote:
 
 On Monday 29 August 2005 19:56, Ken Krugler wrote:
 
 Lucene writes strings as a VInt representing the length of the
 string in Java chars (UTF-16 code units), followed by the character
 data.
 
 
 But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem
 to be the case.
 
 UTF-16 is a fixed 2 byte/char representation.
 
 I hate to keep beating this horse, but I want to emphasize that it's
 2 bytes per Java char (or UTF-16 code unit), not Unicode character
 (code point).


There's more horse beating on Java and Unicode 4 in this blog entry: 
http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html.


Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler

Yonik Seeley wrote:
A related problem exists even if the prefix length vInt is changed 
to represent the number of unicode chars (as opposed to number of 
java chars), right? The prefix length is no longer the offset into 
the char[] to put the suffix.


Yes, I suppose this is a problem too.  Sigh.

Another approach might be to convert the target to a UTF-8 byte[] 
and do all comparisons on byte[]. UTF-8 has some very nice 
properties, including that the byte[] representation of UTF-8 
strings compare the same as UCS-4 would.


I was not aware of that, but I see you are correct:

   o  The byte-value lexicographic sorting order of UTF-8 strings is the
  same as if ordered by character numbers.

(From http://www.faqs.org/rfcs/rfc3629.html)

That makes the byte representation much more palatable, since Lucene 
orders terms lexicographically.


Where/how is the Lucene ordering of terms used?

I'm asking because people often confuse lexicographic order with 
dictionary order, whereas in the context of UTF-8 it just means 
the same order as Unicode code points. And the order of Java chars 
would be the same as for Unicode code points, other than non-BMP 
characters.


Thanks,

-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
 Where/how is the Lucene ordering of terms used?


An ordering is necessary to be able to find things in the index.
For the most part, the ordering doesn't seem matter... the only query that 
comes to mind where it does matter is RangeQuery.

For sorting queries, one is able to specify a Locale.
-Yonik 
Now hiring -- http://tinyurl.com/7m67g


Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting

Yonik Seeley wrote:

Where/how is the Lucene ordering of terms used?


An ordering is necessary to be able to find things in the index.
For the most part, the ordering doesn't seem matter... the only query that 
comes to mind where it does matter is RangeQuery.


For back-compatibility it would be best if the ordering is consistent 
with the current ordering, i.e., lexicographic by character (or code 
point, if you prefer).  Fortunately, UTF-8 makes this easy.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler

I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue anyone raising a hand?


I could, but recent posts makes me think this is heading towards a 
religious debate :)


I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes to 
be used by other implementations besides the reference Java version.


b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.

c. The hard(er) part would be backwards compatibility with older 
indexes. I haven't looked at this enough to really know, but one 
example is the compound file (xx.cfs) format...I didn't see a version 
number, and it contains strings.


d. The documentation could be clearer on what is meant by the string 
length, but this is a trivial change.


What's unclear to me (not being a Perl, Python, etc jock) is how much 
easier it would be to get these other implementations working with 
Lucene, following a change to UTF-8. So I can't comment on the return 
on time required to change things.


I'm also curious about the existing CLucene  PyLucene ports. Would 
they also need to be similarly modified, with the proposed changes?


One final point. I doubt people have been adding strings with 
embedded nulls, and text outside of the Unicode BMP is also very 
rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's 
only the above two edge cases that create an interoperability problem.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Erik Hatcher

On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:

I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve  
this

issue anyone raising a hand?



I could, but recent posts makes me think this is heading towards a  
religious debate :)


Ken - you mentioned taking the discussion off-line in a previous  
post.  Please don't.  Let's keep it alive on java-dev until we have a  
resolution to it.



I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes  
to be used by other implementations besides the reference Java  
version.


b. It would be easy to tweak Lucene to read/write conformant UTF-8  
strings.


What, if any, performance impact would changing Java Lucene in this  
regard have?   (I realize this is rhetorical at this point, until a  
solution is at hand)


c. The hard(er) part would be backwards compatibility with older  
indexes. I haven't looked at this enough to really know, but one  
example is the compound file (xx.cfs) format...I didn't see a  
version number, and it contains strings.


I don't know the gory details, but we've made compatibility breaking  
changes in the past and the current version of Lucene can open older  
formats, but only write the most current format.  I suspect it could  
be made to be backwards compatible.  Worst case, we break  
compatibility in 2.0.


d. The documentation could be clearer on what is meant by the  
string length, but this is a trivial change.


That change was made by Daniel soon after this discussion began.

What's unclear to me (not being a Perl, Python, etc jock) is how  
much easier it would be to get these other implementations working  
with Lucene, following a change to UTF-8. So I can't comment on the  
return on time required to change things.


I'm also curious about the existing CLucene  PyLucene ports. Would  
they also need to be similarly modified, with the proposed changes?


PyLucene is literally the Java version of Lucene underneath (via GCJ/ 
SWIG), so no worries there.  CLucene would need to be changed, as  
well as DotLucene and the other ports out there.


If the rest of the world of Lucene ports followed suit with PyLucene  
and did the GCJ/SWIG thing, we'd have no problems :)  What are the  
disadvantages to following this model with Plucene?


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ronald Dauster

Erik Hatcher wrote:


On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:


I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve  
this

issue anyone raising a hand?



I could, but recent posts makes me think this is heading towards a  
religious debate :)



Ken - you mentioned taking the discussion off-line in a previous  
post.  Please don't.  Let's keep it alive on java-dev until we have a  
resolution to it.



I'd also like to follow this thread.


I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes  
to be used by other implementations besides the reference Java  version.


b. It would be easy to tweak Lucene to read/write conformant UTF-8  
strings.



What, if any, performance impact would changing Java Lucene in this  
regard have?   (I realize this is rhetorical at this point, until a  
solution is at hand)


Looking at the source of 1.4.3, fixing the NUL character encoding is 
trivial for writing and reading already works for both the standard and 
the java-style encoding. Not much work and absolutely no performance 
impact here.


The surrogate pair problem is another matter entirely. First of all, 
lets see if I do understand the problem correctly: Some unicode 
characters can be represented by one codepoint outside the BMP (i. e., 
not with 16 bits) and alternatively with two codepoints, both of them in 
the 16-bit range. According to Marvin's explanations, the Unicode 
standard requires these characters to be represented as the one 
codepoint in UTF-8, resulting in a 4-, 5-, or 6-byte encoding for that 
character.


But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
range cannot be represented as chars.  That is, the 
in-memory-representation still requires the use of the surrogate pairs.  
Therefore, writing consists of translating the surrogate pair to the 
16bit representation of the same character and then algorithmically 
encoding that.  Reading is exactly the reverse process.


Adding code to handle the 4 to 6 byte encodings to the 
readChars/writeChars method is simple, but how do you do the mapping 
from surrogate pairs to the chars they represent? Is there an algorithm 
for doing that except for table lookups or huge switch statements?


c. The hard(er) part would be backwards compatibility with older  
indexes. I haven't looked at this enough to really know, but one  
example is the compound file (xx.cfs) format...I didn't see a  
version number, and it contains strings.



I don't know the gory details, but we've made compatibility breaking  
changes in the past and the current version of Lucene can open older  
formats, but only write the most current format.  I suspect it could  
be made to be backwards compatible.  Worst case, we break  
compatibility in 2.0.


I believe backward compatibility is the easy part and comes for free.  
As I mentioned above, reading the correct NUL encoding already works 
and the non-BMP characters will have to be represented as surrogate 
pairs internally anyway.  So there is no problem with reading the old 
encoding and there is nothing wrong with still using or reading the 
surrogate pairs, only that they would not be written. Even indices with 
mixed segments are not a problem. 

Given that the CompoundFileReader/Writer use a 
lucene.store.OutputStream/InputStream for their FileEntries, they would 
also be able to read older files but potentially write incompatible 
files.  OTOH, when used inside lucene, the filenames do not contain NULs 
of non-BMP chars.


But: Is the compound file format supposed to be interoperable? Which 
formats are?



[...]

What's unclear to me (not being a Perl, Python, etc jock) is how  
much easier it would be to get these other implementations working  
with Lucene, following a change to UTF-8. So I can't comment on the  
return on time required to change things.


[...]



PyLucene is literally the Java version of Lucene underneath (via GCJ/ 
SWIG), so no worries there.  CLucene would need to be changed, as  
well as DotLucene and the other ports out there.


If the rest of the world of Lucene ports followed suit with PyLucene  
and did the GCJ/SWIG thing, we'd have no problems :)  What are the  
disadvantages to following this model with Plucene?



Some parts of the Lucene API require subclassing (e. g., Analyzer) and 
SWIG does support cross-language polymorphism only for a few languages, 
notably Python and Java but not for Perl. Noticing the smiley I won't 
mention the zillion other reasons not to use the GCJ/SWIG thing.


Ronald

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler

[snip]

The surrogate pair problem is another matter entirely. First of all, 
lets see if I do understand the problem correctly: Some unicode 
characters can be represented by one codepoint outside the BMP (i. 
e., not with 16 bits) and alternatively with two codepoints, both of 
them in the 16-bit range.


A Unicode character has a code point, which is a scalar value in the 
range U+ to U+10. The code point for every character in the 
Unicode character set will fall in this range.


There are Unicode encoding schemes, which specify how Unicode code 
point values are serialized. Examples include UTF-8, UTF-16LE, 
UTF-16BE, UTF-32, UTF-7, etc.


The UTF-16 (big or little endian) encoding scheme uses two code units 
(16-bit values) to encode Unicode characters with code point values  
U+0.


According to Marvin's explanations, the Unicode standard requires 
these characters to be represented as the one codepoint in UTF-8, 
resulting in a 4-, 5-, or 6-byte encoding for that character.


Since the Unicode code point range is constrained to 
U+...U+10, the longest valid UTF-8 sequence is 4 bytes.


But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit 
range cannot be represented as chars.  That is, the 
in-memory-representation still requires the use of the surrogate 
pairs.  Therefore, writing consists of translating the surrogate 
pair to the 16bit representation of the same character and then 
algorithmically encoding that.  Reading is exactly the reverse 
process.


Yes. Writing requires that you combine the two surrogate characters 
into a Unicode code point, then converting that value into the UTF-8 
4 byte sequence.


Adding code to handle the 4 to 6 byte encodings to the 
readChars/writeChars method is simple, but how do you do the mapping 
from surrogate pairs to the chars they represent? Is there an 
algorithm for doing that except for table lookups or huge switch 
statements?


It's easy, since U+D800...U+DBFF is defined as the range for the high 
(most significant) surrogate, and U+DC00...U+DFFF is defined as the 
range for the low (least significant) surrogate.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Ken Krugler

On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:

I'm not familiar with UTF-8 enough to follow the details of this
discussion.  I hope other Lucene developers are, so we can resolve this
issue anyone raising a hand?


I could, but recent posts makes me think this is heading towards a 
religious debate :)


Ken - you mentioned taking the discussion off-line in a previous 
post.  Please don't.  Let's keep it alive on java-dev until we have 
a resolution to it.



I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes 
to be used by other implementations besides the reference Java 
version.


b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.


What, if any, performance impact would changing Java Lucene in this 
regard have?   (I realize this is rhetorical at this point, until a 
solution is at hand)


Almost zero. A tiny hit when reading/writing surrogate pairs, to 
properly encode them as a 4 byte UTF-8 sequence versus two 3-byte 
sequences.


c. The hard(er) part would be backwards compatibility with older 
indexes. I haven't looked at this enough to really know, but one 
example is the compound file (xx.cfs) format...I didn't see a 
version number, and it contains strings.


I don't know the gory details, but we've made compatibility breaking 
changes in the past and the current version of Lucene can open older 
formats, but only write the most current format.  I suspect it could 
be made to be backwards compatible.  Worst case, we break 
compatibility in 2.0.


Ronald is correct in that it would be easy to make the reader handle 
both Java modified UTF-8 and UTF-8, and the writer always output 
UTF-8. So the only problem would be if older versions of Lucene (or 
maybe CLucene) wound up trying to read strings that contained 4-byte 
UTF-8 sequences, as they wouldn't know how to convert this into two 
UTF-16 Java chars.


Since 4-byte UTF-8 sequences are only for characters outside of the 
BMP, and these are rare, it seems like an OK thing to do, but that's 
just my uninformed view.


d. The documentation could be clearer on what is meant by the 
string length, but this is a trivial change.


That change was made by Daniel soon after this discussion began.


Daniel changed the definition of Chars, but String section still 
needs to be clarified. Currently it says:


Lucene writes strings as a VInt representing the length, followed by 
the character data.


It should read:

Lucene writes strings as a VInt representing the length of the 
string in Java chars (UTF-16 code units), followed by the character 
data.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Andi Vajda


If the rest of the world of Lucene ports followed suit with PyLucene  and 
did the GCJ/SWIG thing, we'd have no problems :)  What are the 
disadvantages to following this model with Plucene?



Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG 
does support cross-language polymorphism only for a few languages, notably 
Python and Java but not for Perl. Noticing the smiley I won't mention the 
zillion other reasons not to use the GCJ/SWIG thing.


Yes, that's true, Java Lucene requires a bunch of subclassing to truly shine 
in any sizable application. I didn't use SWIG's director feature to implement 
extension but a more or less hardcarved SWIG-in-reverse trick that can easily 
be reproduced by other such SWIG-based ports.

See http://svn.osafoundation.org/pylucene/trunk/README for more details...

Andi..


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8

2005-08-29 Thread Marvin Humphrey


Eric Hatcher wrote...

What, if any, performance impact would changing Java Lucene in this  
regard have?


And Ken Krugler wrote...

Lucene writes strings as a VInt representing the length of the  
string in Java chars (UTF-16 code units), followed by the character  
data.


I had been working under the assumption that the value of the VInt  
would be changed as well.  It seemed logical that if strings were  
encoded with legal UTF-8, the count at the head should indicate  
either 1) the number of UTF-8 characters in the string, or 2) the  
number of bytes occupied by the encoded string.


Do either of those and more substantial changes to Java Lucene would  
be required.  I expect that the impact on performance could be made  
negligible for the first option, but the question of backwards  
compatibility would become a lot messier.


It simply had not occurred to me to keep the VInt as is.  If you do  
that, this becomes a much more localized problem.


For Plucene, I'll avoid the gory details and just say that having the  
VInt continue to represent UTF-16 code units limits the availability  
of certain options, but doesn't cause major inefficiencies.  Now that  
we know that's what it does, we can work with it.  A transition to  
always-legal UTF-8 obviates the need to scan for and fix the edge  
cases, and addresses my main concern.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-29 Thread Doug Cutting

Ken Krugler wrote:

The remaining issue is dealing with old-format indexes.


I think that revving the version number on the segments file would be a 
good start.  This file must be read before any others.  Its current 
version is -1 and would become -2.  (All positive values are version 0, 
for back-compatibility.)  Implementations can be modified to pass the 
version around if they wish to be back-compatible, or they can simply 
throw exceptions for old format indexes.


I would argue that the length written be the number of characters in the 
string, rather than the number of bytes written, since that can minimize 
string memory allocations.



I'm going to take this off-list now [ ... ]


Please don't.  It's better to have a record of the discussion.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-29 Thread tjones

Doug,

How will the difference impact String memory allocations?  Looking at the 
String code, I can't see where it would make an impact.


Tim


I would argue that the length written be the number of characters in the 
string, rather than the number of bytes written, since that can minimize 
string memory allocations.



I'm going to take this off-list now [ ... ]


Please don't.  It's better to have a record of the discussion.

Doug



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Marvin Humphrey
 it with pointers.  Not sure how to
  // make it work in Java.
  break;
  }
}
  }


Initial benchmarking experiments appear to indicate negligible impact
on performance.

 So I doubt
 this would be a slam-dunk in the Lucene community.

I appreciate your willingness to at least weigth the matter, and I
understand the potential reluctance.  Hopefully the comparable
performance of the standards-compliant code above will render the issue
moot, and the next release of Lucene will use legal UTF-8.

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



From: Ken Krugler [EMAIL PROTECTED]
Date: August 27, 2005 2:11:34 PM PDT
To: java-user@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.
Reply-To: java-user@lucene.apache.org


I've delved into the matter of Lucene and UTF-8 a little further,  
and I am discouraged by what I believe I've uncovered.


Lucene should not be advertising that it uses standard UTF-8 --  
or even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8.




Unfortunately this is how Sun documents the format they use for  
serialized strings.



The two distinguishing characteristics of Modified UTF-8 are the  
treatment of codepoints above the BMP (which are written as  
surrogate pairs), and the encoding of null bytes as 1100  1000  
 rather than  .  Both of these became illegal as of  
Unicode 3.1 (IIRC), because they are not shortest-form and non- 
shortest-form UTF-8 presents a security risk.




For UTF-8 these were always invalid, but the standard wasn't very  
clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs  
encouraged some sloppy implementations.



The documentation should really state that Lucene stores strings in  
a Java-only adulteration of UTF-8,




Yes, good point. I don't know who's in charge of that page, but it  
should be fixed.




unsuitable for interchange.



Other than as an internal representation for Java serialization.


Since Perl uses true shortest-form UTF-8 as its native encoding,  
Plucene would have to jump through two efficiency-killing hoops in  
order to write files that would not choke Lucene: instead of  
writing out its true, legal UTF-8 directly, it would be necessary  
to first translate to UTF-16, then duplicate the Lucene encoding  
algorithm from OutputStream.  In theory.




Actually I don't think it would be all that bad. Since a null in the  
middle of a string is rare, as is a character outside of the BMP, a  
quick scan of the text should be sufficient to determine if it can be  
written as-is.


The ICU project has C code that can be used to quickly walk a string.  
I believe these would find/report such invalid code points, if you  
use the safe (versus faster unsafe) versions.



Below you will find a simple Perl script which illustrates what  
happens when Perl encounters malformed UTF-8.  Run it (you need  
Perl 5.8 or higher) and you will see why even if I thought it was a  
good idea to emulate the Java hack for encoding Modified UTF-8,  
trying to make it work in practice would be a nightmare.


If Plucene were to write legal UTF-8 strings to its index files,  
Java Lucene would misbehave and possibly blow up any time a string  
contained either a 4-byte character or a null byte.  On the flip  
side, Perl will spew warnings like crazy and possibly blow up  
whenever it encounters a Lucene-encoded null or surrogate pair.   
The potential blowups are due to the fact that Lucene and Plucene  
will not agree on how many characters a string contains, resulting  
in overruns or underruns.


I am hoping that the answer to this will be a fix to the encoding  
mechanism in Lucene so that it really does use legal UTF-8.  The  
most efficient way to go about this has not yet presented itself.




I'd need to look at the code more, but using something other than the  
Java serialized format would probably incur a performance penalty for  
the Java implementation. Or at least make it harder to handle the  
strings using the standard Java serialization support. So I doubt  
this would be a slam-dunk in the Lucene community.


-- Ken




#

#!/usr/bin/perl
use strict;
use warnings;

# illegal_null.plx -- Perl complains about non-shortest-form null.

my $data = foo\xC0\x80\n;

open (my $virtual_filehandle, +:utf8, \$data);
print $virtual_filehandle;



--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Ken Krugler

Hi Marvin,

Thanks for the detailed response. After spending a bit more time in 
the code, I think you're right - all strings seem to be funnelled 
through IndexOutput. The remaining issue is dealing with old-format 
indexes.


I'm going to take this off-list now, since I'm guessing most list 
readers aren't too interested in the on-going discussion. If anybody 
else would like to be copied, send me an email.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-28 Thread Marvin Humphrey

Hello, Robert...

On Aug 28, 2005, at 7:50 PM, Robert Engels wrote:

Sorry, but I think you are barking up the wrong tree... and your  
tone is
quite bizarre. My personal OPINION is that your script language  
is an

abomination, and anyone that develops in it is clearly hurting the
advancement of all software - but that is another story, and  
doesn't matter

much to the discussion - in a similar fashion your choice of words is
clearly not gong to help matters.


My personal perspective is a utilitarian one: languages, platforms,  
they all come and go eventually, and in between a lot of stuff gets  
done.  I enjoy and appreciate Java (what I know of it), and I watched  
the Ruby/Java spat a little while ago with dismay.  The enmity is not  
returned.  :)


It may be less efficient to decode in other languages, but I don't  
think the
original Lucene designers were too worried about the efficiencies  
of other

languages/platforms.


That may be the case.  I suppose we're about to find out how  
important the Lucene development community considers interchange.   
The phrase standard UTF-8 in the documentation led me to believe  
that the intention was to deploy honest-to-goodness UTF-8.  In fact,  
as was pointed out, the early versions of the Unicode standard were  
not very clear.  Lucene was originally begun in 1998, and Unicode  
Corrigendum #1: UTF-8 Shortest Form wasn't released until 2001.  My  
best guess is that it was supposed to be legal UTF-8 and that the non- 
conformance is unintentional.


Otis Gospodnetic raised objections when the Plucene project made the  
decision to abandon index compatibility with Java Lucene.  I've been  
arguing that that decision ought to be reconsidered.  It will make it  
easier to achieve this shared goal of interoperability if Plucene  
does not have to go out of its way to defeat measures painstakingly  
put in place by the Perl5Porters team to ensure secure and robust  
Unicode support.


One of the reasons I have placed my own search engine project on hold  
was that I concluded I could not improve in a meaningful way on  
Lucene's file format.  It's really a marvelous piece of work.   
Perhaps it will become the TIFF of inverted index formats.  It seems  
to me that the Lucene project would benefit from having it widely  
adopted.  I'd like to help with that.


Using String.getBytes(UTF-8), and String.String(byte[],UTF-8)  
is all

that is needed.


Thank you for the tip.  At first blush, I'm concerned that those may  
be difficult to make work with OutputStream's readByte() without  
incurring a performance penalty, but if I'm wrong and it's six-of-one- 
half-dozen-of-another for Java Lucene, then if a change is going to  
be made, I'll argue for that one.  That would harmonize with the way  
binary field data is stored, assuming that I can trust that portion  
of the spec document. ;)


Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Marvin Humphrey

On Aug 26, 2005, at 10:14 PM, jian chen wrote:

It seems to me that in theory, Lucene storage code could use true  
UTF-8 to
store terms. Maybe it is just a legacy issue that the modified  
UTF-8 is

used?


It's not a matter of a simple switch.  The VInt count at the head of  
a Lucene string is not the number of Unicode code points the string  
contains.  It's the number of Java chars necessary to contain that  
string.  Code points above the BMP require 2 java chars, since they  
must be represented by surrogate pairs.  The same code point must be  
represented by one character in legal UTF-8.


If Plucene counts the number of legal UTF-8 characters and assigns  
that number as the VInt at the front of a string, when Java Lucene  
decodes the string it will allocate an array of char which is too  
small to hold the string.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, Ken,

Thanks for your email. You are right, I was meant to propose that Lucene 
switch to use true UTF-8, rather than having to work around this issue by 
fixing the caused problems elsewhere. 

Also, conforming to standards like UTF-8 will make the code easier for new 
developers to pick up.

Just my 2 cents.

Thanks,

Jian

On 8/27/05, Ken Krugler [EMAIL PROTECTED] wrote:
 
 On Aug 26, 2005, at 10:14 PM, jian chen wrote:
 
 It seems to me that in theory, Lucene storage code could use true UTF-8 
 to
 store terms. Maybe it is just a legacy issue that the modified UTF-8 is
 used?
 
 The use of 0xC0 0x80 to encode a U+ Unicode code point is an
 aspect of Java serialization of character streams. Java uses what
 they call a modified version of UTF-8, though that's a really bad
 way to describe it. It's a different Unicode encoding, one that
 resembles UTF-8, but that's it.
 
 It's not a matter of a simple switch. The VInt count at the head of
 a Lucene string is not the number of Unicode code points the string
 contains. It's the number of Java chars necessary to contain that
 string. Code points above the BMP require 2 java chars, since they
 must be represented by surrogate pairs. The same code point must be
 represented by one character in legal UTF-8.
 
 If Plucene counts the number of legal UTF-8 characters and assigns
 that number as the VInt at the front of a string, when Java Lucene
 decodes the string it will allocate an array of char which is too
 small to hold the string.
 
 I think Jian was proposing that Lucene switch to using a true UTF-8
 encoding, which would make things a bit cleaner. And probably easier
 than changing all references to CEUS-8 :)
 
 And yes, given that the integer count is the number of UTF-16 code
 units required to represent the string, your code will need to do a
 bit more processing when calculating the character count, but that's
 a one-liner, right?
 
 -- Ken
 --
 Ken Krugler
 TransPac Software, Inc.
 http://www.transpac.com
 +1 530-470-9200
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]