On 29/08/18 07:55, Janusz S. Bień via Unicode wrote:
> 
> On Tue, Aug 28 2018 at 9:43 -0700, unicode@unicode.org writes:
> > On August 23, 2011, Asmus Freytag wrote:
> >
> >> On 8/23/2011 7:22 AM, Doug Ewell wrote:
> >>> Of all applications, a word processor or DTP application would want
> >>> to know more about the properties of characters than just whether
> >>> they are RTL. Line breaking, word breaking, and case mapping come to
> >>> mind.
> >>>
> >>> I would think the format used by standard UCD files, or the XML
> >>> equivalent, would be preferable to making one up:
[…]
> >>
> >> The right answer would follow the XML format of the UCD.
> >>
> >> That's the only format that allows all necessary information contained
> >> in one file,
> 
> For me necessary are also comments and crossreferences contained in
> NamesList.txt. Do I understand correctly that only "ISO Comment
> properties" are included in the file?

Even that comment field is obsoleted. But it’s unclear to me what exactly 
it was providing from ISO.

> 
> >> and it would leverage of any effort that users of the
> >> main UCD have made in parsing the XML format.
> >>
> >> An XML format shold also be flexible in that you can add/remove not
> >> just characters, but properties as needed.
> >>
> >> The worst thing do do, other than designing something from scratch,
> >> would be to replicate the UnicodeData.txt layout with its random, but
> >> fixed collection of properties and insanely many semi-colons. None of
> >> the existing UCD txt files carries all the needed data in a single
> >> file.

Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It’s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that’s why at some point I submitted feedback asking 
for an extension. Indeed we could use more information than what is yielded 
by UCD \setminus NamesList.txt (that we may not parse, as per file header).
Given NamesList.txt / Code Charts comments are kept minimal by design, 
one couldn’t simply pop them into XML or whatever, as the result would be 
disappointing and call for completion in the aftermath. Yet another task 
competing with CLDR survey. Reviewing CLDR data is IMO top priority.
There are many flaws to be fixed in many languages including in English.
A lot of useful digest charts are extracted from XML there, and we really 
need to go through the data and correct the many many errors, please.

Unlike XML, human readability of CSV may not be immediate. Yes you simply 
cannot always count the semicolons and remember the property name from 
the value position if it isn’t obvious by itself. But we use spreadsheets. At 
least 
some people do. That’s where the magic works. 

Looking up things in a spreadsheet is a good way to find out about wrong 
property values. Looks like handling files only programmatically gets
everything screwed up.


Marcel

Reply via email to