Re: Lucene and UTF-8

2005-09-27 Thread Ken Krugler
  Perl development is going very well, by the way.  On the indexing 
 side, I've got a new app going which solves both the index 
 compatibility issue and the speed issue, about which I'll make a 
 presentation in this forum after I flesh it out and clean it up.



  Well, I'm lying a little.  The app doesn't quite write a valid Lucene
  1.4.3 index, since it writes true UTF-8.  If these patches get 

 adopted prior to the release of 1.9, though, it will write valid

  Lucene 1.9 indexes.

This UTF stuff is not my thing, and I have a hard time following all
the discussion here (read: I don't get it)... but it sounds like good
changes. 


Could one of the other Lucene committers following this thread apply
the patches and commit the stuff if it looks good?  Perhaps this is
something we should do between 1.9 and 2.0, since the patch will make
the new indices incompatible, and breaking the compatibility at version
2.0 would be okay, while 1.9 should remain compatible with 1.4.3
indices and just have a bunch of methods deprecated.


Just to clarify, an incompatibility will occur if:

a. The new code is used to write the index.
b. The text being written contains an embedded null or an extended 
(not in the BMP) Unicode code point.

c. Old code is then used to read the index.

It may still make sense to defer this change to 2.0, but it's not at 
the level of changing the format of an index file.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and UTF-8

2005-09-25 Thread Otis Gospodnetic
Hello,

 Perl development is going very well, by the way.  On the indexing  
 side, I've got a new app going which solves both the index  
 compatibility issue and the speed issue, about which I'll make a  
 presentation in this forum after I flesh it out and clean it up.
 
 Well, I'm lying a little.  The app doesn't quite write a valid Lucene
  
 1.4.3 index, since it writes true UTF-8.  If these patches get  
 adopted prior to the release of 1.9, though, it will write valid  
 Lucene 1.9 indexes.

This UTF stuff is not my thing, and I have a hard time following all
the discussion here (read: I don't get it)... but it sounds like good
changes.  

Could one of the other Lucene committers following this thread apply
the patches and commit the stuff if it looks good?  Perhaps this is
something we should do between 1.9 and 2.0, since the patch will make
the new indices incompatible, and breaking the compatibility at version
2.0 would be okay, while 1.9 should remain compatible with 1.4.3
indices and just have a bunch of methods deprecated.

If some job changes work out for me, I may have some time to make the
1.9 release.

Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and UTF-8

2005-09-21 Thread Marvin Humphrey

On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote:


import java.util.Arrays;

...

Arrays.equals(array1, array2);


Great, thank you, Chris.

The patch for IndexOutput.java is done.  It will now write valid  
UTF-8.  Older versions of Lucene will not be able to read indexes  
written using this class, as they will choke if they encounter a null  
byte or a 4-byte UTF-8 sequence.


As an added bonus, this patch yields a speedup of a couple percentage  
points (on my machine), made possible by simplified conditionals.   
For instance, the first if() clause...


if (code = 0x01  code = 0x7F)

...is now...

if (code  0x80)

The new TestIndexOutput.java class is sort of done.  It has all the  
tests Ken suggested, though I think it could stand the addition of a  
randomized test to excite edge cases.  The data mirrors the data from  
TestIndexInput.java, and that's by design, as I think with so much  
overlap the two ought to be merged.  How does TestIndexIO.java grab  
you all?


On Aug 29, 2005, at 11:49 AM, Ken Krugler wrote:


a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.

Then all of the above, but remove the second (low-order) surrogate  
character (busted format).


Then all of the above, but replace the first (high-order) surrogate  
character.


A minor wrinkle: each unpaired surrogate will have to be replaced by  
the Unicode replacement character U+FFFD, or the VInt count will be  
off.  This means that a UTF-16LE sequence will grow by a code point,  
as the (mis-ordered) surrogate pair (representing a single code  
point), will get subbed out for two replacement characters.   I don't  
think this is serious, though.


Then all of the above, but replace the surrogate pair with an xC0  
x80 encoded null byte.


I left this out of the test cases for IndexOutput (it's in there, and  
important, for IndexInput).  The UTF-16 sequence \u00C0\u0080  
doesn't map to a null, so I used the regular UTF-16 null \u.   
As before, I think this is what you intended.


Files and patches can be found here:

http://www.rectangular.com/downloads/IndexOutput.patch
http://www.rectangular.com/downloads/MockIndexOutput.java
http://www.rectangular.com/downloads/TestIndexOutput.java

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and UTF-8

2005-09-20 Thread Marvin Humphrey

Hello again,

I've prepared a patch for IndexInput.java, and an accompanying patch  
for TestIndexInput.java.  I figured I would submit them for  
discussion here before filing them via Jira.  The patches are  
attached to this email; if I find that they get stripped by the  
listserv, I'll post them on a website.


The patch to IndexInput.java makes it capable of decoding both  
modified UTF-8 and valid UTF-8, so backwards compatibility is  
preserved.  I'll have another patch for IndexOutput.java soon, but  
IndexInput.java doesn't have to wait for it.


A crude benchmarking app I already have set up (it just builds an  
index with 1000 docs) seems to support my expectation: this change to  
IndexInput should have little or no impact on speed with western,  
mostly-ascii text.  It might actually be a smidgen faster with text  
which is mostly multi-byte UTF-8, since an if-else-if chain with  
calculations within conditionals has been replaced by a switch based  
on a lookup table.  The only real cost for this patch is the memory  
hit for loading the 248-byte lookup table.


My local copy of trunk revision 590297 passes all tests with these  
patches, except for TestIndexModifier which fails regardless.


Ken Krugler wrote...


Good test data for the decoder would be the following:

a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.


I've selected U+1D11E MUSICAL SYMBOL G CLEF and U+1D160 MUSICAL  
SYMBOL EIGHTH NOTE as the non-BMP code points of choice.


http://www.fileformat.info/info/unicode/char/01d11e/index.htm
http://www.fileformat.info/info/unicode/char/01d160/index.htm

It might be my quadranoia acting up again, but it seemed like a good  
idea to add another test case, since UTF-8 is a stateful encoding  
(within a short span):


e. A string with two embedded surrogate pairs.

Lu\uD834\uDD1Ece\uD834\uDD60ne

Then all of the above, but remove the second (low-order) surrogate  
character (busted format).


Then all of the above, but replace the first (high-order) surrogate  
character.


These are interesting.  Lucene isn't equipped for detection/ 
correction of invalid Unicode when reading its own index files, and  
implementing such capabilities would impose a performance penalty.   
The assumption is that Lucene will always read its own files and that  
those files will never contain corrupt data.  Debatable, but it  
doesn't seem to have caused problems up till now.


Since there's no way to check if IndexInput catches invalid input,  
I've skipped these two cases -- but I'll put them in my upcoming  
IndexOutput patches, which is I think what you intended anyway.


Then all of the above, but replace the surrogate pair with an xC0  
x80 encoded null byte.


Done.

Three more test batches seemed appropriate.

Cases for the \x00 null, which would previously have been interpreted  
incorrectly as the start of a 3-byte UTF-8 sequence.


Cases for two-byte UTF-8, using U+00BF INVERTED QUESTION MARK.
http://www.fileformat.info/info/unicode/char/00bf/index.htm

Cases for three-byte UTF-8, using U+2620 SKULL AND CROSSBONES.
http://www.fileformat.info/info/unicode/char/2620/index.htm

Previously, there was only a test for the string Lucene.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and UTF-8

2005-09-20 Thread Marvin Humphrey

I wrote:

The patches are attached to this email; if I find that they get  
stripped by the listserv, I'll post them on a website.


They got stripped, so here are the links:

http://www.rectangular.com/downloads/IndexInput.patch
http://www.rectangular.com/downloads/TestIndexInput.patch

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and UTF-8

2005-09-20 Thread Marvin Humphrey

Greets,

I don't see any junit tests which address IndexOutput directly.  I'm  
going to create one unless someone points out a file or portion  
thereof that I've overlooked.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and UTF-8

2005-08-29 Thread Ken Krugler

Hi Marvin,

I'm guessing that since I'm the one that cares most about 
interoperability, I'll have to volunteer to do the heavy lifting.
Tomorrow I'll go through and survey how many and which things would 
need to change to achieve full UTF-8 compliance.  One concern is 
that I think in order to make that last case work, readChars() may 
need to return an array.  Since readChars() is part of the public 
API and may be called by something other than readString(), I don't 
know if that'll fly.


I don't believe such a change would be required, since the ultimate 
data source/destination on the Java side will look the same (array of 
Java chars) - the only issue is how it looks when serialized.


It seems clear that you have sufficient expertise to hone my rough 
contributions into final form.  If you have the interest, would that 
be a good division of labor?  I wish I could do this alone and just 
supply finished, tested patches, but obviously I can't.  Or perhaps 
I'm underestimating your level of interest -- do you want to take 
the ball and run with it?


I can take a look at the code, sure. The hard part will be coding up 
the JUnit test cases (see below).


I think we could stand to have 2 corpuses of test documents 
available: one is which predominantly 2-byte and 3-byte UTF-8 (but 
no 4-byte), and another which has the full range including non-BMP 
code points.  I can hunt those down or maybe get somebody from the 
Plucene community to create them, but perhaps they already exist?


Good test data for the decoder would be the following:

a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.

Then all of the above, but remove the second (low-order) surrogate 
character (busted format).


Then all of the above, but replace the first (high-order) surrogate character.

Then all of the above, but replace the surrogate pair with an xC0 x80 
encoded null byte.


And no, I don't think this test data exists, unfortunately. But it 
shouldn't be too hard to generate.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]