Re: UNIHAN.TXT

John Jenkins Fri, 30 Apr 2004 11:56:34 -0700


On Apr 30, 2004, at 1:12 AM, [EMAIL PROTECTED] wrote:

Like UNIHAN.TXT, brevity is not a feature of the following...
Tabs... In addition to the points Mike made about the tab character having different semantics depending on the application/platform, I just don't think a control character like tab belongs in a *.TXT file period.

I'm sorry, but I still don't get this point. To say that a tab doesn't belong in a plain text file makes as much sense to me as saying that a carriage return or a line feed doesn't belong in a plain text file.

Although
UNIHAN.TXT is referred to as a database, it isn't.

Well, I guess we're going to have to figure out what we mean by "database".

Rather, it's the raw
material for a database offered in plain-text form.


True.

Still, tabs are arguably OK. It's easy enough to strip them out when they're not wanted. (I'd rather deal with tabs in a text file which is to be imported into a database than ASCII quotes.)

Unix -vs- DOS... I'll stick with the tools I've been using for a quarter century and their descendants, thanks just the same. With respect to the idea that a text editor is not the proper tool with which to open a *.TXT file, well...

It's not that a text editor isn't the proper tool, is that some text editors barf when they encounter files that are too big or that don't follow a certain set of line break conventions.

Trivial -vs- non-trivial... Once the raw data has been imported into a database, it's trivial to massage or manipulate it. It's easy enough to generate a CSV file from a database application, and I've done so. But, the only reason that I wanted it in CSV in the first place was to make it easy to import the data into the database application. This was *not* trivial to do; it involved a lot of coding and counting, and a bit of trial-and-error with various field lengths. Still, the task managed to keep me quiet for a few days...

Perl is your friend. It would be easy to write a perl script to do the job of converting the existing file to CSV. That would be better to post than a duplicate of the .txt file, anyway, because it would be longer-lived and smaller.

With a CSV file, importing data from a text file into a database file simply involves a single line command in the interactive mode (once the database file structure has been established). This is true for dBASE, FoxPro, and related database applications.

But not, apparently, mySQL, which is what we use to maintain the Unihan database.

But, if you wanted to modify only one field, it's more efficient to skip through 71098 records reading and modifying only the appropriate field in the record than to go skipping through all 1063127. Easier to program, too.

No, not really. It depends on your programming tools. Personally, I find it much easier to write programs that process the file as-is than would be the case were it to have a more CSV-like syntax. (Or XML, or whatever.)

IOW, there's no way we can maximize ease-of-use for everybody. No matter what format we pick, somebody's going to be inconvenienced by it.

(Suppose you were a purist who wanted to see Stimson's pronunciations using the actual characters that Stimson used?


You can use the next edition of the file; we're switching over.  :-)

John Jenkins wrote,
Unfortunately, nobody's come up with a good strategy for migrating to
something else.
I could send you the CSV file for posting, if you think anyone else would want it.

In this case, just having the CSV file doesn't really help. The problem with migration is that anything that depends on the current format will break if we switch formats. It's easier IMHO to make available techniques for people to massage the data into alternate forms if they really want it that way.

========
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/

Re: UNIHAN.TXT

Reply via email to