RE: UTF8 transcode optimization (was Andy Heninger's new DOM)

Dean Roddey Fri, 30 Mar 2001 10:52:57 -0800
I originally did the UTF-8 transcoder that way, with a leading loop.
However, it was removed at some point because it caused a problem I think. I
don't remember what it was, but it must have been removed for a reason,
because I'd already proven that it was a lot faster.

--------------
Dean Roddey
Software Geek Extraordinaire
Portal, Inc
[EMAIL PROTECTED]



-----Original Message-----
From: Arnold, Curt [mailto:[EMAIL PROTECTED]]
Sent: Friday, March 30, 2001 10:53 AM
To: '[EMAIL PROTECTED]'
Subject: UTF8 transcode optimization (was Andy Heninger's new DOM)


Andy Heninger wrote:
>> About 6% of the time is taken in the UTF8 transcoder (this on a file
>>with no characters with code higher than points than 127).  Seems like
>>that could be whittled down a bit.

>I've tried to get this one down already, and couldn't find anything more
>to do to that loop.  I tried several different forms without getting any
>further improvement.  The percentage looks even bigger with SAX.

Doing a profile of thread test using SAX showed UTF8 transcoding taking
about 10% of the time.  I was able to cut that down to 2.2% in a profile
build and to see Win32 release performance go up by
around 3% by adding a preliminary loop that copies until it runs into the
first multi byte sequence or exhausts either the source or destination
buffers.

I originally was trying to do something more elaborate (casting to long*'s
and OR'ing with 0x80808080 to check four bytes at a time), but this seems to
do the trick just fine without adding any
dependencies on the length of longs.


unsigned int
XMLUTF8Transcoder::transcodeFrom(const  XMLByte* const          srcData
                                , const unsigned int            srcCount
                                ,       XMLCh* const            toFill
                                , const unsigned int            maxChars
                                ,       unsigned int&           bytesEaten
                                ,       unsigned char* const    charSizes)
{
    // Watch for pathological scenario. Shouldn't happen, but...
    if (!srcCount || !maxChars)
        return 0;

    // If debugging, make sure that the block size is legal
    #if defined(XERCES_DEBUG)
    checkBlockSize(maxChars);
    #endif

    //
    //  Get pointers to our start and end points of the input and output
    //  buffers.
    //
    const XMLByte*  srcPtr = srcData;
    const XMLByte*  srcEnd = srcPtr + srcCount;
    XMLCh*          outPtr = toFill;
    XMLCh*          outEnd = outPtr + maxChars;
    unsigned char*  sizePtr = charSizes;

+//
+//   copy characters until the first multibyte sequence or
+//       exhaustion of the source or destination buffers
+//
+     unsigned int bytesToEat = srcCount;
+     if(srcCount > maxChars) {
+               bytesToEat = maxChars;
+       }
+       for(unsigned int i = 0; i < bytesToEat && *srcPtr < 128; i++) {
+               *outPtr++ = *srcPtr++;
+       }

    //
    //  We now loop until we either run out of input data, or room to store
    //  output chars.
    //
    while ((srcPtr < srcEnd) && (outPtr < outEnd))
    {
        // Get the next leading byte out
        const XMLByte firstByte = *srcPtr;

        // Special-case ASCII, which is a leading byte value of <= 127
        if (firstByte <= 127)

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
RE: UTF8 transcode optimization (was Andy Heninger's new DOM)

Reply via email to