Raymond Mercier wrote, > John Jenkins writes > >>Also, even though the full Unihan database is 25+ Mb in size, given the > cheapness of disk space nowadays, it's not all *that* big, surely. > << > > The problem of the size of Unihan has nothing at all to do with the cost of > storage, and everything to do with the functioning of programs that might > open and read it. > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this > means that when opened in notepad the lines are not separated. Notepad does > have the advantage that the UTF-8 encoding is recognized, and the characters > are displayed.
UNIHAN.TXT isn't going to get any smaller by itself. The trend indicates that it will just keep on growing, even if VS characters are used with CJK. The DOS editor chokes on such a large text file, so does my older hex editor. Thank goodness for BabelPad, otherwise it would've been hard to insert proper (for my system) line breaks into the file. The tab "character" is used in the file. Arguably, this "character" should never appear in a plain text file, rather it should be converted to an appropriate number of U+0020 characters by the application on save. Of course, this would make the file even bigger. Instead of (for instance) "KUA4", why not "KUA⁴"? Much of the text in UNIHAN.TXT is redundant, the hex character is repeated along with each field name over and over again. Putting the hex character at the beginning of each line, with one character per line and CSVs would make UNIHAN.TXT *much* smaller. Of course, commas would have to be removed from the definition fields. (Hmmm, maybe definition field commas could be replaced with MIDDLE DOT?) But, changing the format of the file might make it harder for some users to find the data they seek. So, I'm not necessarily proposing any change, but rather pointing out that alternatives exist. In spite of its unwieldy size, UNIHAN.TXT is a useful tool and I'm grateful for its existence. Best regards, James Kass

