Theo Venker wrote: > I'd like to make a few remarks about the UCD files.
First of all, while I'd like to thank Theo for going to the trouble of checking the data files so carefully, and coming up with some genuine errors in the data, I have a couple of comments for people who are checking and reporting errors. 1. The preferred mechanism for reporting errors in data files or other errors in the standard is to make use of the reporting form on the Unicode website, rather than broadcasting email to the open list, in hope that someone will notice and take action. Please use: http://www.unicode.org/unicode/reporting.html (which you can also find by following the "Contact Us" link on the home page) 2. There is a reason why the UTC announces an extended BETA period before the release of a Unicode version, and encourages people to report errors in the data files during that period, *before* the actual release is finalized. Errors reported then can be fixed before the release. But at this point, the Unicode 3.2.0 data files are finalized, warts and all. Reporting an error immediately *after* a release is actually one of the worst times to do so, since that is the maximal time before the next release, meaning that the chance of an error report being lost or forgotten before the next opportunity to fix it is greatest. So in the future, please do take the BETA period as your best opportunity for getting errors in the data files fixed in a timely manner. > > The following things I ran into when checking out the 3.2.0 release: > > o In PropertyValueAliases-3.2.0.txt line 79: > ccc; 202; ATBL ; Attached_Below_Left > whereas in UnicodeData-3.2.0.html I read: > 200: Below left attached > 202: Below attached > What is is correct value for "attached below left", 200 or 202? 200. The error is in PropertyValueAliases-3.2.0.txt, where the entry should be for Attached_Below, rather than Attached_Below_Left: ccc; 202; ATB ; Attached_Below > > o In SpecialCasing-3.2.0.txt lines 234 and 235 are missing the closing > semicolon. This problem also appeared in 3.1.1. Noted. To be fixed. > > o Typo in UnicodeCharacterDatabase-3.2.0.html: > "DerivedNormalizationProperties", should be "DerivedNormalizationProps". Noted. To be fixed. > > Minor points that I find a bit annoying: > > o Many of the UCD files have a comment header with lines longer than 80 > characters. Viewing these files using the page utility on a 80 column > terminal window to gives ugly output due to the forced line wrapping. Noted. This could be corrected, but is not a high priority. There are many other lines which exceed 80 characters in the data, too. > > o All UCD files except CaseFolding-3.2.0.txt and SpecialCasing-3.2.0.txt > *separate* columns by semicolons. For the two exceptions the semicolon > *terminates* a column, why not keep it the same for all UCD files? This is an issue for the UTC to decide. > > o UnicodeData-3.2.0.txt still uses this notation: > 1234;<Blah, First>;Lo;0;L;;;;;N;;;;; > 5678;<Blah, Last>;Lo;0;L;;;;;N;;;;; > instead of > 1234..5678;<Blah, First>..<Blah, Last>;Lo;0;L;;;;;N;;;;; > Since all other UCD files use the latter notation why not change this > one too? IMHO backward compatibility with existing UCD file parsers > shouldn't be an issue in this particular case. It is an issue for some parsers. (And a burden on me, personally, to fix them, since some of them are used in utilities which maintain other parts of the Unicode Standard, or the Unicode Collation Algorithm.) And we don't know how many other old parsers would blow up if we just changed it. The UTC decided to leave it alone for now -- although it might modify it in the future. --Ken

