RE: GBK Traditional to Simplified mapping table

Marco Cimarosti Fri, 11 Jan 2002 02:50:44 -0800

Doug Ewell wrote:
> [...]  Far from being a simple operation like Latin 
> case mapping (to which it was compared), TC/SC
> requires potentially complex analysis of the text
> being converted.
> 
> This is the opinion of many experts within, as well as 
> outside, the Unicode standardization effort, and it is
> the reason you will not find a Unicode TC/SC mapping
> table.


Actually, such an table can easily be extracted from Unicode's UniHan
database (a huge file: <http://www.unicode.org/Public/UNIDATA/Unihan.txt>).

The relevant information for TC->SC is field <kSimplifiedVariant>, and for
SC->TC is field <kTraditionalVariant>.

As each field is on a separate line, the information can be extracted quite
simply, such as with the DOS command:

        find "kSimplifiedVariant" Unihan.txt > kSimplifiedVariant.txt

However, as Doug explained, this 1-to-1 data is NOT suitable for a
full-fledged conversion. However, the data may be a good starting point for
more complex approaches.

It can also turn useful for implementing things such as a user-friendly
search function, that would match any variant of the sought characters. In
this respect, UniHan contains two more fields that may be useful:
<kSemanticVariant>, <kSpecializedSemanticVariant>.

_ Marco

RE: GBK Traditional to Simplified mapping table

Reply via email to