Mike C. Fletcher wrote:
[1] by transcoded, I mean using some stable, predictable mechanism that always converts the same e.g. Arabic word to the same sequence of ascii characters, a normalized token that always represents the same Arabic word, but has no necessary English representation.

A possible transcoder could use unicodedata, as in the attached example


--
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
            : Write code, do good : http://topp.openplans.org/careers
"""
Example:

    >>> data = u'Hi\u1000there'
    >>> data.encode('ascii', 'unicode_name')
    'HiMYANMAR_LETTER_KA_there'
"""
import unicodedata
import codecs

def unicode_name_transcoder(exc):
    data = exc.object[exc.start:exc.end]
    name = unicodedata.name(data).replace(' ', '_')+'_'
    return (unicode(name), exc.end)

codecs.register_error('unicode_name', unicode_name_transcoder)

if __name__ == '__main__':
    import doctest
    doctest.testmod()
    
_______________________________________________
Sugar mailing list
[email protected]
http://lists.laptop.org/listinfo/sugar

Reply via email to