There was talk recently on this list of mapping grapheme clusters to
the PUA (for application internal use only, obviously, not for export
to the real world). I actually did this recently, though my app ended
up in an incomplete state since I got bored and moved onto something
else. The algorithm worked though, so I present it here and place it in
the public domain, licence free, for anyone to use who wants to do so.
Such an encoded string I called a "grapheme string", or "gstring" for
short. Of course, that was before "grapheme" was renamed as "default
grapheme cluster", so the name doesn't work quite as well now.
The range of characters I resereved for my private use actually
consisted of the surrogate codepoints, not the PUA codepoints. I
reasoned that the PUA area might actually be being used for something
(else), but the surrogate codepoints were illegal and therefore
available. Despite the fact that number of possible graphmes is
infinite, I never actually ran out of codepoints.
Here's the algorithm in pseudo-code:
// The following are static and global
max_word (a 16-bit unsigned integer, initially the lowest codepoint you
reserve (e.g. the start of the PUA) minus one)
map_grapheme_to_word[] (a mapping from grapheme (=array of codepoints)
to 16-bit word, initially empty)
map_word_to_grapheme[] (a mapping from 16-bit word to grapheme,
initially empty)
// Convert unicode text to internal representation with one 16-bit word
per grapheme
// -- input (text_unicode) is an array of codepoints (ie. it
has already been decoded from UTF-whatever)
// -- output (text_internal) is an array of 16-bit words, each
representing one grapheme. THIS STRING MAY NEVER BE EXPORTED.
text_internal = ""
for (each grapheme in text_unicode) // each grapheme is a substring of
one or more codepoints
{
grapheme = convert_to_NFC(grapheme);
if (num_codepoints(grapheme) == 1 && codepoint_of(grapheme)
< 0x10000)
{
text_internal += codepoint_of(grapheme);
}
else
{
if (!exists(map_grapheme_to_word[grapheme]))
{
if (max_word still in range)
{
map_grapheme_to_word[grapheme] = ++max_word;
map_word_to_grapheme[max_word] = grapheme;
}
else
{
text_internal += U+FFFD; // Whoa!! Ran out of reserved
characters! Could add error handling here.
}
}
text_internal += map_grapheme_to_word[grapheme];
}
}
return text_internal;
// The converse process
text_unicode = "";
for (each word in text_internal)
{
if (word in correct range) // e.g. PUA but doesn't have to be
{
if (exists(map_word_to_grapheme[max_word]))
{
text_unicode += map_word_to_grapheme[max_word];
}
else
{
// error - should never happen
text_unicode += U+FFFD;
}
}
else
{
text_unicode += word;
}
}
return text_unicode;
Jill

