There was talk recently on this list of mapping grapheme clusters to the PUA (for application internal use only, obviously, not for export to the real world). I actually did this recently, though my app ended up in an incomplete state since I got bored and moved onto something else. The algorithm worked though, so I present it here and place it in the public domain, licence free, for anyone to use who wants to do so. Such an encoded string I called a "grapheme string", or "gstring" for short. Of course, that was before "grapheme" was renamed as "default grapheme cluster", so the name doesn't work quite as well now.

The range of characters I resereved for my private use actually consisted of the surrogate codepoints, not the PUA codepoints. I reasoned that the PUA area might actually be being used for something (else), but the surrogate codepoints were illegal and therefore available. Despite the fact that number of possible graphmes is infinite, I never actually ran out of codepoints.

Here's the algorithm in pseudo-code:


// The following are static and global
max_word (a 16-bit unsigned integer, initially the lowest codepoint you reserve (e.g. the start of the PUA) minus one)
map_grapheme_to_word[]  (a mapping from grapheme (=array of codepoints) to 16-bit word, initially empty)
map_word_to_grapheme[]  (a mapping from 16-bit word to grapheme, initially empty)



// Convert unicode text to internal representation with one 16-bit word per grapheme
// -- input (text_unicode) is an array of codepoints (ie. it has already been decoded from UTF-whatever)
// -- output (text_internal) is an array of 16-bit words, each representing one grapheme. THIS STRING MAY NEVER BE EXPORTED.

text_internal = ""
for (each grapheme in text_unicode) // each grapheme is a substring of one or more codepoints
{
    grapheme = convert_to_NFC(grapheme);
    if (num_codepoints(grapheme) == 1 && codepoint_of(grapheme) < 0x10000)
    {
        text_internal += codepoint_of(grapheme);
    }
    else
    {
        if (!exists(map_grapheme_to_word[grapheme]))
        {
            if (max_word still in range)
            {
                map_grapheme_to_word[grapheme] = ++max_word;
                map_word_to_grapheme[max_word] = grapheme;
            }
            else
            {
                text_internal += U+FFFD; // Whoa!! Ran out of reserved characters! Could add error handling here.
            }
        }
        text_internal += map_grapheme_to_word[grapheme];
    }
}
return text_internal;



// The converse process
text_unicode = "";
for (each word in text_internal)
{
    if (word in correct range) // e.g. PUA but doesn't have to be
    {
        if (exists(map_word_to_grapheme[max_word]))
        {
            text_unicode += map_word_to_grapheme[max_word];
        }
        else
        {
            // error - should never happen
            text_unicode += U+FFFD;
        }
    }
    else
    {
        text_unicode += word;
    }
}
return text_unicode;



Jill

Reply via email to