I originally did the UTF-8 transcoder that way, with a leading loop. However, it was removed at some point because it caused a problem I think. I don't remember what it was, but it must have been removed for a reason, because I'd already proven that it was a lot faster. -------------- Dean Roddey Software Geek Extraordinaire Portal, Inc [EMAIL PROTECTED] -----Original Message----- From: Arnold, Curt [mailto:[EMAIL PROTECTED]] Sent: Friday, March 30, 2001 10:53 AM To: '[EMAIL PROTECTED]' Subject: UTF8 transcode optimization (was Andy Heninger's new DOM) Andy Heninger wrote: >> About 6% of the time is taken in the UTF8 transcoder (this on a file >>with no characters with code higher than points than 127). Seems like >>that could be whittled down a bit. >I've tried to get this one down already, and couldn't find anything more >to do to that loop. I tried several different forms without getting any >further improvement. The percentage looks even bigger with SAX. Doing a profile of thread test using SAX showed UTF8 transcoding taking about 10% of the time. I was able to cut that down to 2.2% in a profile build and to see Win32 release performance go up by around 3% by adding a preliminary loop that copies until it runs into the first multi byte sequence or exhausts either the source or destination buffers. I originally was trying to do something more elaborate (casting to long*'s and OR'ing with 0x80808080 to check four bytes at a time), but this seems to do the trick just fine without adding any dependencies on the length of longs. unsigned int XMLUTF8Transcoder::transcodeFrom(const XMLByte* const srcData , const unsigned int srcCount , XMLCh* const toFill , const unsigned int maxChars , unsigned int& bytesEaten , unsigned char* const charSizes) { // Watch for pathological scenario. Shouldn't happen, but... if (!srcCount || !maxChars) return 0; // If debugging, make sure that the block size is legal #if defined(XERCES_DEBUG) checkBlockSize(maxChars); #endif // // Get pointers to our start and end points of the input and output // buffers. // const XMLByte* srcPtr = srcData; const XMLByte* srcEnd = srcPtr + srcCount; XMLCh* outPtr = toFill; XMLCh* outEnd = outPtr + maxChars; unsigned char* sizePtr = charSizes; +// +// copy characters until the first multibyte sequence or +// exhaustion of the source or destination buffers +// + unsigned int bytesToEat = srcCount; + if(srcCount > maxChars) { + bytesToEat = maxChars; + } + for(unsigned int i = 0; i < bytesToEat && *srcPtr < 128; i++) { + *outPtr++ = *srcPtr++; + } // // We now loop until we either run out of input data, or room to store // output chars. // while ((srcPtr < srcEnd) && (outPtr < outEnd)) { // Get the next leading byte out const XMLByte firstByte = *srcPtr; // Special-case ASCII, which is a leading byte value of <= 127 if (firstByte <= 127) --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
