On Apr 30, 2004, at 1:12 AM, [EMAIL PROTECTED] wrote:
Like UNIHAN.TXT, brevity is not a feature of the following...
Tabs... In addition to the points Mike made about the tab character having
different semantics depending on the application/platform, I just don't
think a control character like tab belongs in a *.TXT file period.
I'm sorry, but I still don't get this point. To say that a tab doesn't belong in a plain text file makes as much sense to me as saying that a carriage return or a line feed doesn't belong in a plain text file.
Although UNIHAN.TXT is referred to as a database, it isn't.
Well, I guess we're going to have to figure out what we mean by "database".
Rather, it's the raw material for a database offered in plain-text form.
True.
Still, tabs are arguably
OK. It's easy enough to strip them out when they're not wanted. (I'd
rather deal with tabs in a text file which is to be imported into a database
than ASCII quotes.)
Unix -vs- DOS... I'll stick with the tools I've been using for a quarter century
and their descendants, thanks just the same. With respect to the idea that a
text editor is not the proper tool with which to open a *.TXT file, well...
It's not that a text editor isn't the proper tool, is that some text editors barf when they encounter files that are too big or that don't follow a certain set of line break conventions.
Trivial -vs- non-trivial... Once the raw data has been imported into a database,
it's trivial to massage or manipulate it. It's easy enough to generate a CSV
file from a database application, and I've done so. But, the only reason that
I wanted it in CSV in the first place was to make it easy to import the data
into the database application. This was *not* trivial to do; it involved a lot
of coding and counting, and a bit of trial-and-error with various field lengths.
Still, the task managed to keep me quiet for a few days...
Perl is your friend. It would be easy to write a perl script to do the job of converting the existing file to CSV. That would be better to post than a duplicate of the .txt file, anyway, because it would be longer-lived and smaller.
With a CSV file, importing data from a text file into a database file simply
involves a single line command in the interactive mode (once the database
file structure has been established). This is true for dBASE, FoxPro, and
related database applications.
But not, apparently, mySQL, which is what we use to maintain the Unihan database.
But, if you wanted to modify only one field, it's more efficient to skip
through 71098 records reading and modifying only the appropriate field
in the record than to go skipping through all 1063127. Easier to program, too.
No, not really. It depends on your programming tools. Personally, I find it much easier to write programs that process the file as-is than would be the case were it to have a more CSV-like syntax. (Or XML, or whatever.)
IOW, there's no way we can maximize ease-of-use for everybody. No matter what format we pick, somebody's going to be inconvenienced by it.
(Suppose you were a purist who wanted to see Stimson's pronunciations using
the actual characters that Stimson used?
You can use the next edition of the file; we're switching over. :-)
John Jenkins wrote,
Unfortunately, nobody's come up with a good strategy for migrating to something else.
I could send you the CSV file for posting, if you think anyone else would
want it.
In this case, just having the CSV file doesn't really help. The problem with migration is that anything that depends on the current format will break if we switch formats. It's easier IMHO to make available techniques for people to massage the data into alternate forms if they really want it that way.
======== John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jhjenkins/

