On Feb 23, 2008, at 7:46 AM, Chris Little wrote: > > > DM Smith wrote: >> The thing I noticed in Sword's ICU filters is that it was not >> consistent >> in how it set up the UChar array or converted that back to a SWBuf. > > Thanks for digging through everything. I will see if I can't make > things > a little more consistent once I get UTF8NFC debugged. > >> The setup may be wrong: >> int32_t len = text.length() * 2; >> source = new UChar[len + 1]; >> len = ucnv_toUChars(conv, source, len, text.c_str(), -1, >> &err); > > Yes, that's where I'm focusing my attention. > >> Many of the filters just use text.length(), one uses >> text.length()*2+1, >> another 5+text.length()*5 and only this one uses text.length()*2. > > Well, here are some guesses as to what these might have come from.... > > X+1 is probably making room for a null termination (probably > unnecessary > since everything is null terminated to begin with).
The SWBuf is null terminated. From what I can understand from the ICU docs: UChar buffers do not need to be, but they can be. The +1 is necessary to ensure space for a null terminator. If the UChar is not null terminated, then the actual length needs to be remembered at every stage. > > > X*2 could be either doubling the byte size to accommodate conversion > from 8-bit chars to 16-bit chars OR could be acceptance of the fact > that > characters we encounter might actually be represented as surrogate > pairs > in UTF-16. (ICU uses UTF-16 internally.) I don't think the former applies. SWBuf.length() will return the number of bytes in the array, which will be either equal or greater than the number of UTF-8 characters. I think that a UChar is the size of a UTF-16 character, so the receiving buffer, source, needs only to be big enough for the maximal number of UTF-16 bytes. There are comments that the *2 represents space for surrogate pairs. > > > X*5 is probably allowing for expansion from a character to its UTF-8 > representation, which is maximally 5-bytes long. This is only used in the nfkd filter. So *5 probably represents the maximal size of a decomposition. I have no guess as to why +5. I'm not familiar with surrogate pairs, but it appears that there is no accounting for them. > > > I'll get it all sorted out eventually, but those are what those > numbers > probably represent. > > > I had a bit of difficulty getting BCB5 installed and working in Vista, > but I think I've got everything running well enough for the moment so > that I can get to work on this. Many thanks! _______________________________________________ sword-devel mailing list: [email protected] http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
