Kenneth Whistler wrote:
> 
> Theo Veenker wrote:
> 
> > I'd like to make a few remarks about the UCD files.
> 
> First of all, while I'd like to thank Theo for going to the
> trouble of checking the data files so carefully, and coming
> up with some genuine errors in the data, I have a couple of
> comments for people who are checking and reporting errors.
> 
> 1. The preferred mechanism for reporting errors in data files
>    or other errors in the standard is to make use of the
>    reporting form on the Unicode website, rather than broadcasting
>    email to the open list, in hope that someone will notice and
>    take action. Please use:
> 
>    http://www.unicode.org/unicode/reporting.html
> 
>    (which you can also find by following the "Contact Us" link
>    on the home page)

I apologize. Next time I will use the reporting form. May be somebody
should make a note of this in the readme that accompanies the data files.

[snip]

> >  o  UnicodeData-3.2.0.txt still uses this notation:
> >       1234;<Blah, First>;Lo;0;L;;;;;N;;;;;
> >       5678;<Blah, Last>;Lo;0;L;;;;;N;;;;;
> >     instead of
> >       1234..5678;<Blah, First>..<Blah, Last>;Lo;0;L;;;;;N;;;;;
> >     Since all other UCD files use the latter notation why not change this
> >     one too? IMHO backward compatibility with existing UCD file parsers
> >     shouldn't be an issue in this particular case.
> 
> It is an issue for some parsers. (And a burden on me, personally,
> to fix them, since some of them are used in utilities which maintain
> other parts of the Unicode Standard, or the Unicode Collation Algorithm.)
> And we don't know how many other old parsers would blow up if we
> just changed it. The UTC decided to leave it alone for now -- although
> it might modify it in the future.

I know it would break current parsers (actually the new parser implementation 
would be a tiny bit simpler) and I won't sleep less if it is kept the way it
is, BUT:

In UnicodeCharacterDatabase.html where the UCD File Format is described it 
says: "Files in the UCD use the following format, unless otherwise specified."
What is the point of giving a detailed description of the format if the
phrase "unless otherwise specified" is required. It makes the description
rather useless. Because of this closure, which is as far as I can tell only 
required because of the historic notation used in the main UCD data file,
I cannot assume the format for a particular UCD file isn't modified across
releases. Well... I can assume so, but not rely on it. Anyway, I hope the UTC
will decide once to use exactly the same format for all UCD files instead of 
more or less the same.

Best regards,
Theo

Reply via email to