Jill Ramonsky scripsit:

> I had to write an API for my employer last year to handle some aspects 
> of Unicode. We normalised everything to NFD, not NFC (but that's easier, 
> not harder). Nonetheless, all the string handling routines were not 
> allowed to /assume/ that the input was in NFD, but they had to guarantee 
> that the output was. These routines, therefore, had to do a "convert to 
> NFD" on every input, even if the input were already in NFD. This did 
> have a significant performance hit, since we were handling (Unicode) 
> strings throughout the app.

Indeed it would.  However, checking for normalization is cheaper than
normalizing, and Unicode makes properties available that allow a streamlined
but incomplete check that returns "not normalized" or "maybe normalized".
So input can be handled as follows:

        if maybeNormalized(input)
        then    if normalized(input)
                then    doTheWork(input)
                else    doTheWork(normalize(input))
                fi
        else    doTheWork(normalize(input))
        fi

The W3C recommends, however, that non-normalized input be rejected rather
than forcibly normalized, on the ground that the supplier of the input
is not meeting his contract.

> I think that next time I write a similar API, I wll deal with 
> (string+bool) pairs, instead of plain strings, with the bool  meaning 
> "already normalised". This would definitely speed things up. Of course, 
> for any strings coming in from "outside", I'd still have to assume they 
> were not normalised, just in case.

W3C refers to this concept as "certified text".  It's a good idea.

> Jill
> 

-- 
Verbogeny is one of the pleasurettes    John Cowan <[EMAIL PROTECTED]>
of a creatific thinkerizer.             http://www.reutershealth.com
   -- Peter da Silva                    http://www.ccil.org/~cowan

Reply via email to