Re: [whatwg] Encoding: big5 and big5-hkscs

Anne van Kesteren Wed, 04 Apr 2012 09:23:58 -0700

On Fri, 30 Mar 2012 14:00:38 +0200, Anne van Kesteren <[email protected]>wrote:

Ideally someone does detailed content analysis to figure out what thebest path forward is here, though I'm not entirely sure how.

I still don't know how, but thanks to Simon Pieters I gathered some URLsfrom http://dotnetdotcom.org/ and found that 22 pages (of which at leasttwo are big5-hkscs encoded) out of 609 have byte sequences in the rangesthat are distinct between big5 and big5-hkscs and in most implementations(in IE they are identical, in Opera big5-hkscs is a superset I believe).The byte sequences found per URL are published here:http://lists.w3.org/Archives/Public/www-archive/2012Apr/0020.html

To go from (lead, trail) to an index usable in big5.json you can use afunction such as:


def get_index(lead, trail):
    row = 0xFE-0xA1 + RANGE + 1
    cell = (trail-0xA1 + RANGE) if trail > (0x7E+1) else trail - 0x40
    return (lead-0x81) * row + cell

I can do that for the dataset, but I need someone who is able to interpretthe results to see which decoding makes more sense.



--
Anne van Kesteren
http://annevankesteren.nl/

Re: [whatwg] Encoding: big5 and big5-hkscs

Reply via email to