Re: Lucene and UTF-8
Perl development is going very well, by the way. On the indexing side, I've got a new app going which solves both the index compatibility issue and the speed issue, about which I'll make a presentation in this forum after I flesh it out and clean it up. Well, I'm lying a little. The app doesn't quite write a valid Lucene 1.4.3 index, since it writes true UTF-8. If these patches get adopted prior to the release of 1.9, though, it will write valid Lucene 1.9 indexes. This UTF stuff is not my thing, and I have a hard time following all the discussion here (read: I don't get it)... but it sounds like good changes. Could one of the other Lucene committers following this thread apply the patches and commit the stuff if it looks good? Perhaps this is something we should do between 1.9 and 2.0, since the patch will make the new indices incompatible, and breaking the compatibility at version 2.0 would be okay, while 1.9 should remain compatible with 1.4.3 indices and just have a bunch of methods deprecated. Just to clarify, an incompatibility will occur if: a. The new code is used to write the index. b. The text being written contains an embedded null or an extended (not in the BMP) Unicode code point. c. Old code is then used to read the index. It may still make sense to defer this change to 2.0, but it's not at the level of changing the format of an index file. -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and UTF-8
Hello, Perl development is going very well, by the way. On the indexing side, I've got a new app going which solves both the index compatibility issue and the speed issue, about which I'll make a presentation in this forum after I flesh it out and clean it up. Well, I'm lying a little. The app doesn't quite write a valid Lucene 1.4.3 index, since it writes true UTF-8. If these patches get adopted prior to the release of 1.9, though, it will write valid Lucene 1.9 indexes. This UTF stuff is not my thing, and I have a hard time following all the discussion here (read: I don't get it)... but it sounds like good changes. Could one of the other Lucene committers following this thread apply the patches and commit the stuff if it looks good? Perhaps this is something we should do between 1.9 and 2.0, since the patch will make the new indices incompatible, and breaking the compatibility at version 2.0 would be okay, while 1.9 should remain compatible with 1.4.3 indices and just have a bunch of methods deprecated. If some job changes work out for me, I may have some time to make the 1.9 release. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and UTF-8
On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote: import java.util.Arrays; ... Arrays.equals(array1, array2); Great, thank you, Chris. The patch for IndexOutput.java is done. It will now write valid UTF-8. Older versions of Lucene will not be able to read indexes written using this class, as they will choke if they encounter a null byte or a 4-byte UTF-8 sequence. As an added bonus, this patch yields a speedup of a couple percentage points (on my machine), made possible by simplified conditionals. For instance, the first if() clause... if (code = 0x01 code = 0x7F) ...is now... if (code 0x80) The new TestIndexOutput.java class is sort of done. It has all the tests Ken suggested, though I think it could stand the addition of a randomized test to excite edge cases. The data mirrors the data from TestIndexInput.java, and that's by design, as I think with so much overlap the two ought to be merged. How does TestIndexIO.java grab you all? On Aug 29, 2005, at 11:49 AM, Ken Krugler wrote: a. Single surrogate pair (two Java chars) b. Surrogate pair at the beginning, followed by regular data. c. Surrogate pair at the end, followed by regular data. d. Two surrogate pairs in a row. Then all of the above, but remove the second (low-order) surrogate character (busted format). Then all of the above, but replace the first (high-order) surrogate character. A minor wrinkle: each unpaired surrogate will have to be replaced by the Unicode replacement character U+FFFD, or the VInt count will be off. This means that a UTF-16LE sequence will grow by a code point, as the (mis-ordered) surrogate pair (representing a single code point), will get subbed out for two replacement characters. I don't think this is serious, though. Then all of the above, but replace the surrogate pair with an xC0 x80 encoded null byte. I left this out of the test cases for IndexOutput (it's in there, and important, for IndexInput). The UTF-16 sequence \u00C0\u0080 doesn't map to a null, so I used the regular UTF-16 null \u. As before, I think this is what you intended. Files and patches can be found here: http://www.rectangular.com/downloads/IndexOutput.patch http://www.rectangular.com/downloads/MockIndexOutput.java http://www.rectangular.com/downloads/TestIndexOutput.java Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and UTF-8
Hello again, I've prepared a patch for IndexInput.java, and an accompanying patch for TestIndexInput.java. I figured I would submit them for discussion here before filing them via Jira. The patches are attached to this email; if I find that they get stripped by the listserv, I'll post them on a website. The patch to IndexInput.java makes it capable of decoding both modified UTF-8 and valid UTF-8, so backwards compatibility is preserved. I'll have another patch for IndexOutput.java soon, but IndexInput.java doesn't have to wait for it. A crude benchmarking app I already have set up (it just builds an index with 1000 docs) seems to support my expectation: this change to IndexInput should have little or no impact on speed with western, mostly-ascii text. It might actually be a smidgen faster with text which is mostly multi-byte UTF-8, since an if-else-if chain with calculations within conditionals has been replaced by a switch based on a lookup table. The only real cost for this patch is the memory hit for loading the 248-byte lookup table. My local copy of trunk revision 590297 passes all tests with these patches, except for TestIndexModifier which fails regardless. Ken Krugler wrote... Good test data for the decoder would be the following: a. Single surrogate pair (two Java chars) b. Surrogate pair at the beginning, followed by regular data. c. Surrogate pair at the end, followed by regular data. d. Two surrogate pairs in a row. I've selected U+1D11E MUSICAL SYMBOL G CLEF and U+1D160 MUSICAL SYMBOL EIGHTH NOTE as the non-BMP code points of choice. http://www.fileformat.info/info/unicode/char/01d11e/index.htm http://www.fileformat.info/info/unicode/char/01d160/index.htm It might be my quadranoia acting up again, but it seemed like a good idea to add another test case, since UTF-8 is a stateful encoding (within a short span): e. A string with two embedded surrogate pairs. Lu\uD834\uDD1Ece\uD834\uDD60ne Then all of the above, but remove the second (low-order) surrogate character (busted format). Then all of the above, but replace the first (high-order) surrogate character. These are interesting. Lucene isn't equipped for detection/ correction of invalid Unicode when reading its own index files, and implementing such capabilities would impose a performance penalty. The assumption is that Lucene will always read its own files and that those files will never contain corrupt data. Debatable, but it doesn't seem to have caused problems up till now. Since there's no way to check if IndexInput catches invalid input, I've skipped these two cases -- but I'll put them in my upcoming IndexOutput patches, which is I think what you intended anyway. Then all of the above, but replace the surrogate pair with an xC0 x80 encoded null byte. Done. Three more test batches seemed appropriate. Cases for the \x00 null, which would previously have been interpreted incorrectly as the start of a 3-byte UTF-8 sequence. Cases for two-byte UTF-8, using U+00BF INVERTED QUESTION MARK. http://www.fileformat.info/info/unicode/char/00bf/index.htm Cases for three-byte UTF-8, using U+2620 SKULL AND CROSSBONES. http://www.fileformat.info/info/unicode/char/2620/index.htm Previously, there was only a test for the string Lucene. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and UTF-8
I wrote: The patches are attached to this email; if I find that they get stripped by the listserv, I'll post them on a website. They got stripped, so here are the links: http://www.rectangular.com/downloads/IndexInput.patch http://www.rectangular.com/downloads/TestIndexInput.patch Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and UTF-8
Greets, I don't see any junit tests which address IndexOutput directly. I'm going to create one unless someone points out a file or portion thereof that I've overlooked. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and UTF-8
Hi Marvin, I'm guessing that since I'm the one that cares most about interoperability, I'll have to volunteer to do the heavy lifting. Tomorrow I'll go through and survey how many and which things would need to change to achieve full UTF-8 compliance. One concern is that I think in order to make that last case work, readChars() may need to return an array. Since readChars() is part of the public API and may be called by something other than readString(), I don't know if that'll fly. I don't believe such a change would be required, since the ultimate data source/destination on the Java side will look the same (array of Java chars) - the only issue is how it looks when serialized. It seems clear that you have sufficient expertise to hone my rough contributions into final form. If you have the interest, would that be a good division of labor? I wish I could do this alone and just supply finished, tested patches, but obviously I can't. Or perhaps I'm underestimating your level of interest -- do you want to take the ball and run with it? I can take a look at the code, sure. The hard part will be coding up the JUnit test cases (see below). I think we could stand to have 2 corpuses of test documents available: one is which predominantly 2-byte and 3-byte UTF-8 (but no 4-byte), and another which has the full range including non-BMP code points. I can hunt those down or maybe get somebody from the Plucene community to create them, but perhaps they already exist? Good test data for the decoder would be the following: a. Single surrogate pair (two Java chars) b. Surrogate pair at the beginning, followed by regular data. c. Surrogate pair at the end, followed by regular data. d. Two surrogate pairs in a row. Then all of the above, but remove the second (low-order) surrogate character (busted format). Then all of the above, but replace the first (high-order) surrogate character. Then all of the above, but replace the surrogate pair with an xC0 x80 encoded null byte. And no, I don't think this test data exists, unfortunately. But it shouldn't be too hard to generate. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-470-9200 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]