Re: Unihan.txt and the four dictionary sorting algorithm

John Jenkins Tue, 20 Apr 2004 19:29:40 -0700

On Apr 20, 2004, at 5:11 PM, [EMAIL PROTECTED] wrote:

The DOS editor chokes on such a large text file, so does my older hex
editor.  Thank goodness for BabelPad, otherwise it would've been hard
to insert proper (for my system) line breaks into the file.

BBEdit on the Mac tends to be unhappy with it, too.

The tab "character" is used in the file. Arguably, this "character" should never appear in a plain text file, rather it should be converted to an appropriate number of U+0020 characters by the application on save. Of course, this would make the file even bigger.

Tab-separated data files are quite common. (Indeed, I tend to get annoyed with the main UCD file because it's semicolon-separated.) I'm not sure why you'd want a tab never to appear in a plain-text file.

Instead of (for instance) "KUA4", why not "KUA⁴"?

I think your text got garbled here, but in any event, you've replaced one four-character word with another one. :-)

Realistically, the earliest versions of the Unihan.txt file predate the ability to safely exchange or use anything other than ASCII. Our Mandarin romanization dates back to those days.

Now that UTF-8 support is relatively common, we're moving more and more data in the file to non-ASCII form.

Much of the text in UNIHAN.TXT is redundant, the hex character
is repeated along with each field name over and over again.

Putting the hex character at the beginning of each line, with one
character per line and CSVs would make UNIHAN.TXT *much* smaller.
Of course, commas would have to be removed from the definition
fields.  (Hmmm, maybe definition field commas could be replaced
with MIDDLE DOT?)

Hmm. Interesting suggestion.

OTOH, the current format lends itself nicely to use with some utilities, like the Unix grep command.

Fundamentally, any format we select would be nice in some situations and not so nice in others.

But, changing the format of the file might make it harder for some
users to find the data they seek.  So, I'm not necessarily proposing
any change, but rather pointing out that alternatives exist.

That's the *real* problem. Goodness knows the current format has real problems, and brevity is not among its virtues. (OTOH, the format it replaces was brief to the point of being incomprehensible.) Unfortunately, nobody's come up with a good strategy for migrating to something else.

(Which is why we're stuck with a misspelling in one of the field names.)

And, of course, you're perfectly free to massage the data as suits your own purposes. My Unihan lookup took for Mac OS X converts it all to XML, for instance.

In spite of its unwieldy size, UNIHAN.TXT is a useful tool and I'm
grateful for its existence.

Thanks.

========
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: Unihan.txt and the four dictionary sorting algorithm

Reply via email to