Re: Unihan.txt and the four dictionary sorting algorithm

Doug Ewell Wed, 21 Apr 2004 09:16:44 -0700

Raymond Mercier <RaymondM at compuserve dot com> wrote:

> The problem of the size of Unihan has nothing at all to do with the
> cost of storage, and everything to do with the functioning of programs
> that might open and read it.
> Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D,
> this means that when opened in notepad the lines are not separated...


I have to agree that an ordinary plain-text editor is probably not the
right tool for browsing a 25-megabyte data file, even though I've been
known to do the same with UnicodeData.txt (which is admittedly an order
of magnitude smaller).

Even though Unihan is packaged as plain text, one record per
LF-terminated line (well, sort of), it's really more appropriate to
think of it as a data file, intended to be read by software.  Something
like a batch file that calls grep (or other plain-text search tool)
would be more appropriate.

And as John said, converting LF to CRLF is quite a simple task -- it can
even be done by your FTP client, while downloading the file -- and
should not be thought of as a deficiency in the current plain-text
format.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Unihan.txt and the four dictionary sorting algorithm

Reply via email to