Mike C. Fletcher wrote:
[1] by transcoded, I mean using some stable, predictable mechanism that
always converts the same e.g. Arabic word to the same sequence of ascii
characters, a normalized token that always represents the same Arabic
word, but has no necessary English representation.
A possible transcoder could use unicodedata, as in the attached example
--
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
: Write code, do good : http://topp.openplans.org/careers
"""
Example:
>>> data = u'Hi\u1000there'
>>> data.encode('ascii', 'unicode_name')
'HiMYANMAR_LETTER_KA_there'
"""
import unicodedata
import codecs
def unicode_name_transcoder(exc):
data = exc.object[exc.start:exc.end]
name = unicodedata.name(data).replace(' ', '_')+'_'
return (unicode(name), exc.end)
codecs.register_error('unicode_name', unicode_name_transcoder)
if __name__ == '__main__':
import doctest
doctest.testmod()
_______________________________________________
Sugar mailing list
[email protected]
http://lists.laptop.org/listinfo/sugar