Kenneth Whistler wrote: > > Theo Veenker wrote: > > > I'd like to make a few remarks about the UCD files. > > First of all, while I'd like to thank Theo for going to the > trouble of checking the data files so carefully, and coming > up with some genuine errors in the data, I have a couple of > comments for people who are checking and reporting errors. > > 1. The preferred mechanism for reporting errors in data files > or other errors in the standard is to make use of the > reporting form on the Unicode website, rather than broadcasting > email to the open list, in hope that someone will notice and > take action. Please use: > > http://www.unicode.org/unicode/reporting.html > > (which you can also find by following the "Contact Us" link > on the home page)
I apologize. Next time I will use the reporting form. May be somebody should make a note of this in the readme that accompanies the data files. [snip] > > o UnicodeData-3.2.0.txt still uses this notation: > > 1234;<Blah, First>;Lo;0;L;;;;;N;;;;; > > 5678;<Blah, Last>;Lo;0;L;;;;;N;;;;; > > instead of > > 1234..5678;<Blah, First>..<Blah, Last>;Lo;0;L;;;;;N;;;;; > > Since all other UCD files use the latter notation why not change this > > one too? IMHO backward compatibility with existing UCD file parsers > > shouldn't be an issue in this particular case. > > It is an issue for some parsers. (And a burden on me, personally, > to fix them, since some of them are used in utilities which maintain > other parts of the Unicode Standard, or the Unicode Collation Algorithm.) > And we don't know how many other old parsers would blow up if we > just changed it. The UTC decided to leave it alone for now -- although > it might modify it in the future. I know it would break current parsers (actually the new parser implementation would be a tiny bit simpler) and I won't sleep less if it is kept the way it is, BUT: In UnicodeCharacterDatabase.html where the UCD File Format is described it says: "Files in the UCD use the following format, unless otherwise specified." What is the point of giving a detailed description of the format if the phrase "unless otherwise specified" is required. It makes the description rather useless. Because of this closure, which is as far as I can tell only required because of the historic notation used in the main UCD data file, I cannot assume the format for a particular UCD file isn't modified across releases. Well... I can assume so, but not rely on it. Anyway, I hope the UTC will decide once to use exactly the same format for all UCD files instead of more or less the same. Best regards, Theo

