Re: Lucene does NOT use UTF-8.
On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote: Yonik Seeley wrote: I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. TermBuffer.java:66 Things could work fine if the prefix length were a byte count. A byte buffer could easily be constructed that contains the full byte sequence (prefix + suffix), and then this could be converted to a String. The inefficiency would be if prefix were re-converted from UTF-8 for each term, e.g., in order to compare it to the target. Prefixes are frequently longer than suffixes, so this could be significant. Does that make sense? I don't know whether it would actually be significant, although TermBuffer.java was added recently as a measurable performance enhancement, so this is performance critical code. We need to stop discussing this in the abstract and start coding alternatives and benchmarking them. Is java.nio.charset.CharsetEncoder fast enough? Will moving things through CharBuffer and ByteBuffer be too slow? Should Lucene keep maintaining its own UTF-8 implementation for performance? I don't know, only some experiments will tell. Doug I don't know if it matters for Lucene usage. But if using CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a significant problem, it's probably due to startup/init time of these methods for individually converting many small strings, not inherently due to UTF-8 usage. I'm confident that a custom UTF-8 implementation can almost completely eliminate these issues. I've done this before for binary XML with great success, and it could certainly be done for lucene just as well. Bottom line: It's probably an issue that can be dealt with via proper impl; it probably shouldn't dictate design directions. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Wolfgang Hoschek wrote: I don't know if it matters for Lucene usage. But if using CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a significant problem, it's probably due to startup/init time of these methods for individually converting many small strings, not inherently due to UTF-8 usage. I'm confident that a custom UTF-8 implementation can almost completely eliminate these issues. I've done this before for binary XML with great success, and it could certainly be done for lucene just as well. Bottom line: It's probably an issue that can be dealt with via proper impl; it probably shouldn't dictate design directions. Good point. Currently Lucene already has its own (buggy) UTF-8 implementation for performance, so that wouldn't really be a big change. The big question now seems to be whether the stored character sequence lengths should be in bytes or characters. Bytes might be fast and simple (whether we implement our own UTF-8 in Java or not) but are not back-compatible. So do we bite the bullet and make a very incompatible change to index formats? Or do we make these counts be unicode characters (which is mostly back-compatible) and make the code a bit more awkward? Some implementations would be nice to see just how awkward things get. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
The temporary char[] buffer is cached per InputStream instance, so the extra memory allocation shouldn't be a big deal. One could also use String(byte[],offset,len,UTF-8), and that creates a char[] that is used directly by the string instead of being copied. It remains to be seen how fast the native java char converter is though. I like the idea of the length being the number of bytes... it encapsulates the content in case you want to rapidly skip over it (or rapidly copy it). It's more future proof w.r.t. alternate encodings (or binary), and if it had been number if bytes from the start, it wouldn't have to be changed now. -Yonik On 8/29/05, Doug Cutting [EMAIL PROTECTED] wrote: I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize string memory allocations.
Re: Lucene does NOT use UTF-8.
How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. This is from Lucene InputStream: public final String readString() throws IOException { int length = readVInt(); if (chars == null || length chars.length) chars = new char[length]; readChars(chars, 0, length); return new String(chars, 0, length); } If you know the length in bytes, you still have to allocate that many chars (even though the number of chars may be less than the number of bytes). Not a big deal IMHO. A bigger pain is on the writing side, where you can't stream things because you don't know what the length is going to be (in either bytes *or* UTF-8 chars). So it turns out that Java's 16 bit chars were just a waste... it's still a multibyte format *and* it takes up more space. UTF-8 would have been nice - no conversions necessary. -Yonik Now hiring -- http://tinyurl.com/7m67g
RE: Lucene does NOT use UTF-8.
That method should easily be changed to public final String readString() throws IOException { int length = readVInt(); return new String(readBytes(length),UTF-8); } readBytes(0 could reuse the same array if it was large enough. Then only the single char[] is created in the String code. -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 11:28 AM To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. This is from Lucene InputStream: public final String readString() throws IOException { int length = readVInt(); if (chars == null || length chars.length) chars = new char[length]; readChars(chars, 0, length); return new String(chars, 0, length); } If you know the length in bytes, you still have to allocate that many chars (even though the number of chars may be less than the number of bytes). Not a big deal IMHO. A bigger pain is on the writing side, where you can't stream things because you don't know what the length is going to be (in either bytes *or* UTF-8 chars). So it turns out that Java's 16 bit chars were just a waste... it's still a multibyte format *and* it takes up more space. UTF-8 would have been nice - no conversions necessary. -Yonik Now hiring -- http://tinyurl.com/7m67g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
[EMAIL PROTECTED] wrote: How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. I spoke a bit too soon. I should have looked at the code first. You're right, I don't think it would require more allocations. When considering this byte-count versus character-count issue please note that it also arises elsewhere. The PrefixLength in the Term Dictionary section of the file format document is currently defined as a number of characters, not bytes. http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary Implementing this in terms of bytes may have performance implications, since, at first glance, the entire byte sequence would need to be converted from UTF-8 into the internal string representation for each term, rather than just the suffix. Does anyone see a way around that? As for how we got to this point: I wrote Lucene's UTF-8 reading and writing code in 1998, back when Unicode still had fewer than 2^16 characters. It's surprising that it has lasted this long without anyone noticing! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? Yes, UTF-16 means two bytes per code unit. A Unicode character (code point) is encoded as either one or two UTF-16 code units. That doesn't seem to be the case. The case where? You mean in what actually gets written out? String.length() is the length in terms of Java chars, which means UTF-16 code units (well, sort of...see below). Looking at the code, IndexOutput.writeString() calls writeVInt() with the string length. One related note. Java 1.4 supports Unicode 3.0, while Java 5.0 supports Unicode 4.0. It was in Unicode 3.1 that supplementary characters (code points U+0, ie outside of the BMP) were added, and the UTF-16 encoding formalized. So I think the issue of non-BMP characters is currently a bit esoteric for Lucene, since I'm guessing there are other places in the code (e.g. JDK calls used by Lucene) where non-BMP characters won't be properly handled. Though some quick tests indicate that there is some knowledge of surrogate pairs in 1.4 (e.g. converting a String w/surrogate pairs to UTF-8 does the right thing). -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case. UTF-16 is a fixed 2 byte/char representation. But one cannot equate the character count with the byte count. Each Java char is 2 bytes. I think all that is being said is that the VInt is equal to str.length() as java gives it. On an unrelated project we are determining whether we should use a denormalized (letter followed by an accents) or a normalized form (letter with accents) of accented characters as we present the text to a GUI. We have found that font support varies but appears to be better for denormalized. This is not an issue for storage, as it can be transformed before it goes to screen. However, it is useful to know which form it is in. The reason I mention this is that I seem to remember that the length of the java string varies with the representation. So then the count would not be the number of glyphs that the user sees. Please correct me if I am wrong. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene does NOT use UTF-8.
I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding classes - avoiding all of the lookups performed by the String class. Regardless of what underlying support is used, if you want to write out the VInt value as UTF-8 bytes versus Java chars, the Java String has to either be converted to UTF-8 in memory first, or pre-scanned. The first is a memory hit, and the second is a performance hit. I don't know the extent of either, but it's there. Note that since the VInt is a variable size, you can't write out the bytes first and then fill in the correct value later. -- Ken -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 4:24 PM To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for back-compatibility.) Implementations can be modified to pass the version around if they wish to be back-compatible, or they can simply throw exceptions for old format indexes. I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize string memory allocations. I'm going to take this off-list now [ ... ] Please don't. It's better to have a record of the discussion. Doug -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Ken Krugler wrote: I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding classes - avoiding all of the lookups performed by the String class. Regardless of what underlying support is used, if you want to write out the VInt value as UTF-8 bytes versus Java chars, the Java String has to either be converted to UTF-8 in memory first, or pre-scanned. The first is a memory hit, and the second is a performance hit. I don't know the extent of either, but it's there. Note that since the VInt is a variable size, you can't write out the bytes first and then fill in the correct value later. Sure you can. Do a tell to get the position. Write any number. Write the text. Do another tell to note the position. Based on the difference between the two tells, you have the length. Rewind to the first tell and write out the number. Then advance to the end. I am not recommending this, but it can be done. There may be other ways. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
DM Smith wrote: Daniel Naber wrote: But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case. UTF-16 is a fixed 2 byte/char representation. Except when it's not. I.e., above the BMP. From the Unicode 4.0 standard http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf: In the UTF-16 encoding form, code points in the range U+..U+ are represented as a single 16-bit code unit; code points in the supplementary planes, in the range U+1..U+10, are instead represented as pairs of 16-bit code units. These pairs of special code units are known as surrogate pairs. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene does NOT use UTF-8.
I think you guys are WAY overcomplicating things, or you just don't know enough about the Java class libraries. If you use the java.nio.charset.CharsetEncoder class, then you can reuse the byte[] array, and then it is a simple write of the length, and a blast copy of the required number of bytes to the OutputStream (which will either fit or expand its byte[]). You can perform all of this WITHOUT creating new byte[] or char[] (as long as the existing one is large enough to fit the encoded/decoded data). There is no need to use any sort of file position mark/reset stuff. R -Original Message- From: Ken Krugler [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 11:54 AM To: java-dev@lucene.apache.org Subject: RE: Lucene does NOT use UTF-8. I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding classes - avoiding all of the lookups performed by the String class. Regardless of what underlying support is used, if you want to write out the VInt value as UTF-8 bytes versus Java chars, the Java String has to either be converted to UTF-8 in memory first, or pre-scanned. The first is a memory hit, and the second is a performance hit. I don't know the extent of either, but it's there. Note that since the VInt is a variable size, you can't write out the bytes first and then fill in the correct value later. -- Ken -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, August 29, 2005 4:24 PM To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for back-compatibility.) Implementations can be modified to pass the version around if they wish to be back-compatible, or they can simply throw exceptions for old format indexes. I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize string memory allocations. I'm going to take this off-list now [ ... ] Please don't. It's better to have a record of the discussion. Doug -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
I think you guys are WAY overcomplicating things, or you just don't know enough about the Java class libraries. People were just pointing out that if the vint isn't String.length(), then one has to either buffer the entire string, or pre-scan it. It's a valid point, and CharsetEncoder doesn't change that. -Yonik Now hiring -- http://tinyurl.com/7m67g
RE: Lucene does NOT use UTF-8.
Not true. You do not need to pre-scan it. When you use CharSet encoder, it will write the bytes to a buffer (expanding as needed). At the end of the encoding you can get the actual number of bytes needed. The pseudo-code is use CharsetEncoder to write String to ByteBuffer write VInt using ByteBuffer.getLength() write bytes using ByteBuffer.getByte[] better yet you NIO so you can pass the ByteBuffer directly. -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 12:56 PM To: java-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Lucene does NOT use UTF-8. I think you guys are WAY overcomplicating things, or you just don't know enough about the Java class libraries. People were just pointing out that if the vint isn't String.length(), then one has to either buffer the entire string, or pre-scan it. It's a valid point, and CharsetEncoder doesn't change that. -Yonik Now hiring -- http://tinyurl.com/7m67g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
On 8/30/05, Robert Engels [EMAIL PROTECTED] wrote: Not true. You do not need to pre-scan it. What I previously wrote, with emphasis on key words added: one has to *either* buffer the entire string, *or* pre-scan it. -Yonik Now hiring -- http://tinyurl.com/7m67g
Re: Lucene does NOT use UTF-8.
I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. -Yonik Now hiring -- http://tinyurl.com/7m67g On 8/30/05, Doug Cutting [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. I spoke a bit too soon. I should have looked at the code first. You're right, I don't think it would require more allocations. When considering this byte-count versus character-count issue please note that it also arises elsewhere. The PrefixLength in the Term Dictionary section of the file format document is currently defined as a number of characters, not bytes. http://lucene.apache.org/java/docs/fileformats.html#Term Dictionary Implementing this in terms of bytes may have performance implications, since, at first glance, the entire byte sequence would need to be converted from UTF-8 into the internal string representation for each term, rather than just the suffix. Does anyone see a way around that? As for how we got to this point: I wrote Lucene's UTF-8 reading and writing code in 1998, back when Unicode still had fewer than 2^16 characters. It's surprising that it has lasted this long without anyone noticing! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case. UTF-16 is a fixed 2 byte/char representation. I hate to keep beating this horse, but I want to emphasize that it's 2 bytes per Java char (or UTF-16 code unit), not Unicode character (code point). But one cannot equate the character count with the byte count. Each Java char is 2 bytes. I think all that is being said is that the VInt is equal to str.length() as java gives it. On an unrelated project we are determining whether we should use a denormalized (letter followed by an accents) or a normalized form (letter with accents) of accented characters as we present the text to a GUI. We have found that font support varies but appears to be better for denormalized. This is not an issue for storage, as it can be transformed before it goes to screen. However, it is useful to know which form it is in. The reason I mention this is that I seem to remember that the length of the java string varies with the representation. String.length() is the number of Java chars, which always uses UTF-16. If you normalize text, then yes that can change the number of code units and thus the length of the string, but so can doing any kind of text munging (e.g. replacement) operation on characters in the string. So then the count would not be the number of glyphs that the user sees. Please correct me if I am wrong. All kinds of mxn mappings (both at the layout engine level, and using font tables) are possible when going from Unicode characters to display glyphs. Plus zero-width left-kerning glyphs would also alter the relationship between # of visual characters and backing store characters. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
The inefficiency would be if prefix were re-converted from UTF-8 for each term, e.g., in order to compare it to the target. Ahhh, gotcha. A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right? The prefix length is no longer the offset into the char[] to put the suffix. Another approach might be to convert the target to a UTF-8 byte[] and do all comparisons on byte[]. UTF-8 has some very nice properties, including that the byte[] representation of UTF-8 strings compare the same as UCS-4 would. As you say, the variations need to be tested. -Yonik Now hiring -- http://tinyurl.com/7m67g
Re: Lucene does NOT use UTF-8
On 8/30/05, Ken Krugler [EMAIL PROTECTED] wrote: Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case. UTF-16 is a fixed 2 byte/char representation. I hate to keep beating this horse, but I want to emphasize that it's 2 bytes per Java char (or UTF-16 code unit), not Unicode character (code point). There's more horse beating on Java and Unicode 4 in this blog entry: http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html.
Re: Lucene does NOT use UTF-8.
Yonik Seeley wrote: A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right? The prefix length is no longer the offset into the char[] to put the suffix. Yes, I suppose this is a problem too. Sigh. Another approach might be to convert the target to a UTF-8 byte[] and do all comparisons on byte[]. UTF-8 has some very nice properties, including that the byte[] representation of UTF-8 strings compare the same as UCS-4 would. I was not aware of that, but I see you are correct: o The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers. (From http://www.faqs.org/rfcs/rfc3629.html) That makes the byte representation much more palatable, since Lucene orders terms lexicographically. Where/how is the Lucene ordering of terms used? I'm asking because people often confuse lexicographic order with dictionary order, whereas in the context of UTF-8 it just means the same order as Unicode code points. And the order of Java chars would be the same as for Unicode code points, other than non-BMP characters. Thanks, -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Where/how is the Lucene ordering of terms used? An ordering is necessary to be able to find things in the index. For the most part, the ordering doesn't seem matter... the only query that comes to mind where it does matter is RangeQuery. For sorting queries, one is able to specify a Locale. -Yonik Now hiring -- http://tinyurl.com/7m67g
Re: Lucene does NOT use UTF-8.
Yonik Seeley wrote: Where/how is the Lucene ordering of terms used? An ordering is necessary to be able to find things in the index. For the most part, the ordering doesn't seem matter... the only query that comes to mind where it does matter is RangeQuery. For back-compatibility it would be best if the ordering is consistent with the current ordering, i.e., lexicographic by character (or code point, if you prefer). Fortunately, UTF-8 makes this easy. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) I think the following statements are all true: a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version. b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings. c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings. d. The documentation could be clearer on what is meant by the string length, but this is a trivial change. What's unclear to me (not being a Perl, Python, etc jock) is how much easier it would be to get these other implementations working with Lucene, following a change to UTF-8. So I can't comment on the return on time required to change things. I'm also curious about the existing CLucene PyLucene ports. Would they also need to be similarly modified, with the proposed changes? One final point. I doubt people have been adding strings with embedded nulls, and text outside of the Unicode BMP is also very rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's only the above two edge cases that create an interoperability problem. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) Ken - you mentioned taking the discussion off-line in a previous post. Please don't. Let's keep it alive on java-dev until we have a resolution to it. I think the following statements are all true: a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version. b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings. What, if any, performance impact would changing Java Lucene in this regard have? (I realize this is rhetorical at this point, until a solution is at hand) c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings. I don't know the gory details, but we've made compatibility breaking changes in the past and the current version of Lucene can open older formats, but only write the most current format. I suspect it could be made to be backwards compatible. Worst case, we break compatibility in 2.0. d. The documentation could be clearer on what is meant by the string length, but this is a trivial change. That change was made by Daniel soon after this discussion began. What's unclear to me (not being a Perl, Python, etc jock) is how much easier it would be to get these other implementations working with Lucene, following a change to UTF-8. So I can't comment on the return on time required to change things. I'm also curious about the existing CLucene PyLucene ports. Would they also need to be similarly modified, with the proposed changes? PyLucene is literally the Java version of Lucene underneath (via GCJ/ SWIG), so no worries there. CLucene would need to be changed, as well as DotLucene and the other ports out there. If the rest of the world of Lucene ports followed suit with PyLucene and did the GCJ/SWIG thing, we'd have no problems :) What are the disadvantages to following this model with Plucene? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
Erik Hatcher wrote: On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) Ken - you mentioned taking the discussion off-line in a previous post. Please don't. Let's keep it alive on java-dev until we have a resolution to it. I'd also like to follow this thread. I think the following statements are all true: a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version. b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings. What, if any, performance impact would changing Java Lucene in this regard have? (I realize this is rhetorical at this point, until a solution is at hand) Looking at the source of 1.4.3, fixing the NUL character encoding is trivial for writing and reading already works for both the standard and the java-style encoding. Not much work and absolutely no performance impact here. The surrogate pair problem is another matter entirely. First of all, lets see if I do understand the problem correctly: Some unicode characters can be represented by one codepoint outside the BMP (i. e., not with 16 bits) and alternatively with two codepoints, both of them in the 16-bit range. According to Marvin's explanations, the Unicode standard requires these characters to be represented as the one codepoint in UTF-8, resulting in a 4-, 5-, or 6-byte encoding for that character. But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit range cannot be represented as chars. That is, the in-memory-representation still requires the use of the surrogate pairs. Therefore, writing consists of translating the surrogate pair to the 16bit representation of the same character and then algorithmically encoding that. Reading is exactly the reverse process. Adding code to handle the 4 to 6 byte encodings to the readChars/writeChars method is simple, but how do you do the mapping from surrogate pairs to the chars they represent? Is there an algorithm for doing that except for table lookups or huge switch statements? c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings. I don't know the gory details, but we've made compatibility breaking changes in the past and the current version of Lucene can open older formats, but only write the most current format. I suspect it could be made to be backwards compatible. Worst case, we break compatibility in 2.0. I believe backward compatibility is the easy part and comes for free. As I mentioned above, reading the correct NUL encoding already works and the non-BMP characters will have to be represented as surrogate pairs internally anyway. So there is no problem with reading the old encoding and there is nothing wrong with still using or reading the surrogate pairs, only that they would not be written. Even indices with mixed segments are not a problem. Given that the CompoundFileReader/Writer use a lucene.store.OutputStream/InputStream for their FileEntries, they would also be able to read older files but potentially write incompatible files. OTOH, when used inside lucene, the filenames do not contain NULs of non-BMP chars. But: Is the compound file format supposed to be interoperable? Which formats are? [...] What's unclear to me (not being a Perl, Python, etc jock) is how much easier it would be to get these other implementations working with Lucene, following a change to UTF-8. So I can't comment on the return on time required to change things. [...] PyLucene is literally the Java version of Lucene underneath (via GCJ/ SWIG), so no worries there. CLucene would need to be changed, as well as DotLucene and the other ports out there. If the rest of the world of Lucene ports followed suit with PyLucene and did the GCJ/SWIG thing, we'd have no problems :) What are the disadvantages to following this model with Plucene? Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG does support cross-language polymorphism only for a few languages, notably Python and Java but not for Perl. Noticing the smiley I won't mention the zillion other reasons not to use the GCJ/SWIG thing. Ronald - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
[snip] The surrogate pair problem is another matter entirely. First of all, lets see if I do understand the problem correctly: Some unicode characters can be represented by one codepoint outside the BMP (i. e., not with 16 bits) and alternatively with two codepoints, both of them in the 16-bit range. A Unicode character has a code point, which is a scalar value in the range U+ to U+10. The code point for every character in the Unicode character set will fall in this range. There are Unicode encoding schemes, which specify how Unicode code point values are serialized. Examples include UTF-8, UTF-16LE, UTF-16BE, UTF-32, UTF-7, etc. The UTF-16 (big or little endian) encoding scheme uses two code units (16-bit values) to encode Unicode characters with code point values U+0. According to Marvin's explanations, the Unicode standard requires these characters to be represented as the one codepoint in UTF-8, resulting in a 4-, 5-, or 6-byte encoding for that character. Since the Unicode code point range is constrained to U+...U+10, the longest valid UTF-8 sequence is 4 bytes. But since a Java char _is_ 16 bit, the codepoints beyond the 16-bit range cannot be represented as chars. That is, the in-memory-representation still requires the use of the surrogate pairs. Therefore, writing consists of translating the surrogate pair to the 16bit representation of the same character and then algorithmically encoding that. Reading is exactly the reverse process. Yes. Writing requires that you combine the two surrogate characters into a Unicode code point, then converting that value into the UTF-8 4 byte sequence. Adding code to handle the 4 to 6 byte encodings to the readChars/writeChars method is simple, but how do you do the mapping from surrogate pairs to the chars they represent? Is there an algorithm for doing that except for table lookups or huge switch statements? It's easy, since U+D800...U+DBFF is defined as the range for the high (most significant) surrogate, and U+DC00...U+DFFF is defined as the range for the low (least significant) surrogate. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote: I'm not familiar with UTF-8 enough to follow the details of this discussion. I hope other Lucene developers are, so we can resolve this issue anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) Ken - you mentioned taking the discussion off-line in a previous post. Please don't. Let's keep it alive on java-dev until we have a resolution to it. I think the following statements are all true: a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version. b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings. What, if any, performance impact would changing Java Lucene in this regard have? (I realize this is rhetorical at this point, until a solution is at hand) Almost zero. A tiny hit when reading/writing surrogate pairs, to properly encode them as a 4 byte UTF-8 sequence versus two 3-byte sequences. c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings. I don't know the gory details, but we've made compatibility breaking changes in the past and the current version of Lucene can open older formats, but only write the most current format. I suspect it could be made to be backwards compatible. Worst case, we break compatibility in 2.0. Ronald is correct in that it would be easy to make the reader handle both Java modified UTF-8 and UTF-8, and the writer always output UTF-8. So the only problem would be if older versions of Lucene (or maybe CLucene) wound up trying to read strings that contained 4-byte UTF-8 sequences, as they wouldn't know how to convert this into two UTF-16 Java chars. Since 4-byte UTF-8 sequences are only for characters outside of the BMP, and these are rare, it seems like an OK thing to do, but that's just my uninformed view. d. The documentation could be clearer on what is meant by the string length, but this is a trivial change. That change was made by Daniel soon after this discussion began. Daniel changed the definition of Chars, but String section still needs to be clarified. Currently it says: Lucene writes strings as a VInt representing the length, followed by the character data. It should read: Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
If the rest of the world of Lucene ports followed suit with PyLucene and did the GCJ/SWIG thing, we'd have no problems :) What are the disadvantages to following this model with Plucene? Some parts of the Lucene API require subclassing (e. g., Analyzer) and SWIG does support cross-language polymorphism only for a few languages, notably Python and Java but not for Perl. Noticing the smiley I won't mention the zillion other reasons not to use the GCJ/SWIG thing. Yes, that's true, Java Lucene requires a bunch of subclassing to truly shine in any sizable application. I didn't use SWIG's director feature to implement extension but a more or less hardcarved SWIG-in-reverse trick that can easily be reproduced by other such SWIG-based ports. See http://svn.osafoundation.org/pylucene/trunk/README for more details... Andi.. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8
Eric Hatcher wrote... What, if any, performance impact would changing Java Lucene in this regard have? And Ken Krugler wrote... Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data. I had been working under the assumption that the value of the VInt would be changed as well. It seemed logical that if strings were encoded with legal UTF-8, the count at the head should indicate either 1) the number of UTF-8 characters in the string, or 2) the number of bytes occupied by the encoded string. Do either of those and more substantial changes to Java Lucene would be required. I expect that the impact on performance could be made negligible for the first option, but the question of backwards compatibility would become a lot messier. It simply had not occurred to me to keep the VInt as is. If you do that, this becomes a much more localized problem. For Plucene, I'll avoid the gory details and just say that having the VInt continue to represent UTF-16 code units limits the availability of certain options, but doesn't cause major inefficiencies. Now that we know that's what it does, we can work with it. A transition to always-legal UTF-8 obviates the need to scan for and fix the edge cases, and addresses my main concern. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for back-compatibility.) Implementations can be modified to pass the version around if they wish to be back-compatible, or they can simply throw exceptions for old format indexes. I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize string memory allocations. I'm going to take this off-list now [ ... ] Please don't. It's better to have a record of the discussion. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Doug, How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. Tim I would argue that the length written be the number of characters in the string, rather than the number of bytes written, since that can minimize string memory allocations. I'm going to take this off-list now [ ... ] Please don't. It's better to have a record of the discussion. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
it with pointers. Not sure how to // make it work in Java. break; } } } Initial benchmarking experiments appear to indicate negligible impact on performance. So I doubt this would be a slam-dunk in the Lucene community. I appreciate your willingness to at least weigth the matter, and I understand the potential reluctance. Hopefully the comparable performance of the standards-compliant code above will render the issue moot, and the next release of Lucene will use legal UTF-8. Best, Marvin Humphrey Rectangular Research http://www.rectangular.com/ From: Ken Krugler [EMAIL PROTECTED] Date: August 27, 2005 2:11:34 PM PDT To: java-user@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. Reply-To: java-user@lucene.apache.org I've delved into the matter of Lucene and UTF-8 a little further, and I am discouraged by what I believe I've uncovered. Lucene should not be advertising that it uses standard UTF-8 -- or even UTF-8 at all, since Modified UTF-8 is _illegal_ UTF-8. Unfortunately this is how Sun documents the format they use for serialized strings. The two distinguishing characteristics of Modified UTF-8 are the treatment of codepoints above the BMP (which are written as surrogate pairs), and the encoding of null bytes as 1100 1000 rather than . Both of these became illegal as of Unicode 3.1 (IIRC), because they are not shortest-form and non- shortest-form UTF-8 presents a security risk. For UTF-8 these were always invalid, but the standard wasn't very clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs encouraged some sloppy implementations. The documentation should really state that Lucene stores strings in a Java-only adulteration of UTF-8, Yes, good point. I don't know who's in charge of that page, but it should be fixed. unsuitable for interchange. Other than as an internal representation for Java serialization. Since Perl uses true shortest-form UTF-8 as its native encoding, Plucene would have to jump through two efficiency-killing hoops in order to write files that would not choke Lucene: instead of writing out its true, legal UTF-8 directly, it would be necessary to first translate to UTF-16, then duplicate the Lucene encoding algorithm from OutputStream. In theory. Actually I don't think it would be all that bad. Since a null in the middle of a string is rare, as is a character outside of the BMP, a quick scan of the text should be sufficient to determine if it can be written as-is. The ICU project has C code that can be used to quickly walk a string. I believe these would find/report such invalid code points, if you use the safe (versus faster unsafe) versions. Below you will find a simple Perl script which illustrates what happens when Perl encounters malformed UTF-8. Run it (you need Perl 5.8 or higher) and you will see why even if I thought it was a good idea to emulate the Java hack for encoding Modified UTF-8, trying to make it work in practice would be a nightmare. If Plucene were to write legal UTF-8 strings to its index files, Java Lucene would misbehave and possibly blow up any time a string contained either a 4-byte character or a null byte. On the flip side, Perl will spew warnings like crazy and possibly blow up whenever it encounters a Lucene-encoded null or surrogate pair. The potential blowups are due to the fact that Lucene and Plucene will not agree on how many characters a string contains, resulting in overruns or underruns. I am hoping that the answer to this will be a fix to the encoding mechanism in Lucene so that it really does use legal UTF-8. The most efficient way to go about this has not yet presented itself. I'd need to look at the code more, but using something other than the Java serialized format would probably incur a performance penalty for the Java implementation. Or at least make it harder to handle the strings using the standard Java serialization support. So I doubt this would be a slam-dunk in the Lucene community. -- Ken # #!/usr/bin/perl use strict; use warnings; # illegal_null.plx -- Perl complains about non-shortest-form null. my $data = foo\xC0\x80\n; open (my $virtual_filehandle, +:utf8, \$data); print $virtual_filehandle; -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Hi Marvin, Thanks for the detailed response. After spending a bit more time in the code, I think you're right - all strings seem to be funnelled through IndexOutput. The remaining issue is dealing with old-format indexes. I'm going to take this off-list now, since I'm guessing most list readers aren't too interested in the on-going discussion. If anybody else would like to be copied, send me an email. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Hello, Robert... On Aug 28, 2005, at 7:50 PM, Robert Engels wrote: Sorry, but I think you are barking up the wrong tree... and your tone is quite bizarre. My personal OPINION is that your script language is an abomination, and anyone that develops in it is clearly hurting the advancement of all software - but that is another story, and doesn't matter much to the discussion - in a similar fashion your choice of words is clearly not gong to help matters. My personal perspective is a utilitarian one: languages, platforms, they all come and go eventually, and in between a lot of stuff gets done. I enjoy and appreciate Java (what I know of it), and I watched the Ruby/Java spat a little while ago with dismay. The enmity is not returned. :) It may be less efficient to decode in other languages, but I don't think the original Lucene designers were too worried about the efficiencies of other languages/platforms. That may be the case. I suppose we're about to find out how important the Lucene development community considers interchange. The phrase standard UTF-8 in the documentation led me to believe that the intention was to deploy honest-to-goodness UTF-8. In fact, as was pointed out, the early versions of the Unicode standard were not very clear. Lucene was originally begun in 1998, and Unicode Corrigendum #1: UTF-8 Shortest Form wasn't released until 2001. My best guess is that it was supposed to be legal UTF-8 and that the non- conformance is unintentional. Otis Gospodnetic raised objections when the Plucene project made the decision to abandon index compatibility with Java Lucene. I've been arguing that that decision ought to be reconsidered. It will make it easier to achieve this shared goal of interoperability if Plucene does not have to go out of its way to defeat measures painstakingly put in place by the Perl5Porters team to ensure secure and robust Unicode support. One of the reasons I have placed my own search engine project on hold was that I concluded I could not improve in a meaningful way on Lucene's file format. It's really a marvelous piece of work. Perhaps it will become the TIFF of inverted index formats. It seems to me that the Lucene project would benefit from having it widely adopted. I'd like to help with that. Using String.getBytes(UTF-8), and String.String(byte[],UTF-8) is all that is needed. Thank you for the tip. At first blush, I'm concerned that those may be difficult to make work with OutputStream's readByte() without incurring a performance penalty, but if I'm wrong and it's six-of-one- half-dozen-of-another for Java Lucene, then if a change is going to be made, I'll argue for that one. That would harmonize with the way binary field data is stored, assuming that I can trust that portion of the spec document. ;) Cheers, Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
On Aug 26, 2005, at 10:14 PM, jian chen wrote: It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? It's not a matter of a simple switch. The VInt count at the head of a Lucene string is not the number of Unicode code points the string contains. It's the number of Java chars necessary to contain that string. Code points above the BMP require 2 java chars, since they must be represented by surrogate pairs. The same code point must be represented by one character in legal UTF-8. If Plucene counts the number of legal UTF-8 characters and assigns that number as the VInt at the front of a string, when Java Lucene decodes the string it will allocate an array of char which is too small to hold the string. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
Hi, Ken, Thanks for your email. You are right, I was meant to propose that Lucene switch to use true UTF-8, rather than having to work around this issue by fixing the caused problems elsewhere. Also, conforming to standards like UTF-8 will make the code easier for new developers to pick up. Just my 2 cents. Thanks, Jian On 8/27/05, Ken Krugler [EMAIL PROTECTED] wrote: On Aug 26, 2005, at 10:14 PM, jian chen wrote: It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? The use of 0xC0 0x80 to encode a U+ Unicode code point is an aspect of Java serialization of character streams. Java uses what they call a modified version of UTF-8, though that's a really bad way to describe it. It's a different Unicode encoding, one that resembles UTF-8, but that's it. It's not a matter of a simple switch. The VInt count at the head of a Lucene string is not the number of Unicode code points the string contains. It's the number of Java chars necessary to contain that string. Code points above the BMP require 2 java chars, since they must be represented by surrogate pairs. The same code point must be represented by one character in legal UTF-8. If Plucene counts the number of legal UTF-8 characters and assigns that number as the VInt at the front of a string, when Java Lucene decodes the string it will allocate an array of char which is too small to hold the string. I think Jian was proposing that Lucene switch to using a true UTF-8 encoding, which would make things a bit cleaner. And probably easier than changing all references to CEUS-8 :) And yes, given that the integer count is the number of UTF-16 code units required to represent the string, your code will need to do a bit more processing when calculating the character count, but that's a one-liner, right? -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]