On second thought... I think we'd better go with the C++ interface code you posted (and we should apply it across the other filters as appropriate). I still want to run some tests to see if it's really going to kill performance when we run it on large chunks of data, but I would guess we can commit your patch tomorrow.
In order to do things right, using the C interface, we would basically have to re-implement all of the inefficient parts hidden behind the C++ interface anyway. Specifically I'm thinking of the pre-flighting of conversion calls to determine string sizes when we're converting between UTF-8 & UTF-16 and also when we run various normalizations, converters, & transliterators on the UTF-16 itself. DM Smith wrote: > On Feb 23, 2008, at 7:46 AM, Chris Little wrote: >> X*2 could be either doubling the byte size to accommodate conversion >> from 8-bit chars to 16-bit chars OR could be acceptance of the fact >> that >> characters we encounter might actually be represented as surrogate >> pairs >> in UTF-16. (ICU uses UTF-16 internally.) > > I don't think the former applies. SWBuf.length() will return the > number of bytes in the array, which will be either equal or greater > than the number of UTF-8 characters. I think that a UChar is the size > of a UTF-16 character, so the receiving buffer, source, needs only to > be big enough for the maximal number of UTF-16 bytes. > > There are comments that the *2 represents space for surrogate pairs. ICU UChars are 16-bits long. A character in UTF-16 can be either one or two 16-bit shorts long. If the character is in Plane 0 (the BMP) then it's one short long. If it's outside Plane 0, then it will be represented by a surrogate pair (2 shorts). So the number of UChars in a string might be double the number of characters in that string. Now that I think about it, the number of UTF-8 bytes necessary to represent a character is always greater than or equal to the number of UTF-16 shorts necessary to represent it, but this is all the sort of thinking I'd like to avoid by using your patch, assuming the C++ interface doesn't slow things down too badly in actual usage. Normalization itself could cause growth of the string size that I don't really want to think about. --Chris _______________________________________________ sword-devel mailing list: [email protected] http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
