On 11/9/2011 9:30 AM, Asmus Freytag wrote:
On 11/9/2011 1:18 AM, "Martin J. Dürst" wrote:
I tried to find something like a normative description of the default bidi class of unassigned code points.

In UTR #9, it says (http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types):

Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change. For assignments to character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD].

That *is* the normative description of the default Bidi_Class for unassigned code points.


The DerivedBidiClass.txt file, as far as I understand, is mainly a condensation of bidi classes into character ranges (rather than giving them for each codepoint independently as in UnicodeData.txt). I.e. it can at any moment be derived automatically from UnicodeData.txt, and is as such not normative.

Because the default values for Bidi_Class are complicated, and cannot be derived simply by parsing the values for *assigned* characters in UnicodeData.txt, the listing of the default values for Bidi_Class in DerivedBidiClass.txt have to be
taken as normative for those values.


Why is it then that the default class assignments are only given in this file (unless I have overlooked something)? And why is it that they are only given in comments?

Because the UnicodeData.txt file has no header (for historical compatibility).

Because, like the practice of putting <style> in HTML inside comments, these things (@missing) are in comments to protect older parsers.

And to go beyond what Asmus said there, the "@missing" hack was created as
a syntax for specifying *the* default values for properties where it makes sense that they have a *single* default value. It doesn't work for specifying multiple default values differing by code point range. Hence no addition of the @missing comment in DerivedBidiClass.txt (or its potential addition to PropertyValueAliases.txt)
doesn't suffice for the entire definition.

I'm trying to create a program that takes all the bidi assignments (including default ones) and creates the data part of a bidi algorithm implementation, but I don't feel confident to code against stuff that's in comments. Any advice?

Use the values in the comments.

Remember that this is not *code* with comments that get stripped out before compiling. These are text data files for parsing. The fact that people are already parsing the @missing statements indicates that those are being treated normatively now. You could say the same thing for the titles, dates, and copyright notices on these data
files: they aren't "optional" content to be ignored.

Is it possible that this could be fixed (making it more normative, and putting it in a form that's easier to process automatically)?

This is part of a very large problem for creating a more complete and machine-parseable means of accessing *all* of the Unicode character property data, including data about the *status* of properties and their default values. It won't, IMO, be fixed by individual
file fixes one at a time, although incremental improvement can be helpful.

Note that the UCD in XML was created to address this problem in part, but it still cannot answer many questions about the status of properties, their full derivations,
their interactions, and their functions.

--Ken




Reply via email to