At 11:11 AM 8/5/2004, Peter Kirk wrote:
In TUS 4.0 Section 5.3, p.111, the following is stated of default ignorable code points:

These characters are also ignored except with respect to specific, defined processes; for example, ZERO WIDTH NON-JOINER is ignored in collation. ... For more information, see Section 5.20, Default Ignorable Code Points.


But in Section 5.20, although there is a lot about rendering default ignorable code points, there is no further information about any other processing of them. The implication of that section seems to be that these characters are intended to be ignored in rendering but not in other processes such as collation.

You are correct in that the default (!) behavior of these characters in
all processing depends on the purposes of that process. For most Unicode
defined processes (even, where these definitions themselves are 'default'
definitions) the behavior of all characters is in fact defined by the combination of their relevant Unicode properties and the rules for the published algorithm.


Rendering is special in that we do *not* provide a general algorithm, so if we intend a specific default behavior, it needs to be stated in the text.

Is this or the summary in Section 5.3 in fact to be taken as the intention of the standard? Has the summary simply not been updated for consistency with the fuller details? Or has the fuller description been unintentionally restricted to rendering?

The summary is correct.

Is it in fact the intention that all default ignorable characters must always be ignored in collation? Or is it possible to tailor collation not to ignore them? The collation algorithm seems to suggest the latter, in that there seems to be no mention of these characters being obligatorily ignored - although I presume they have zero weight by default (in DUCET).

Correct. By the way, these characters are called Default_Ignorable, and not Must_Ignore for a reason. You are always free to tailor things so that they are not ignored. Even in rendering the tailoring is the 'show controls' mode, which would make some or all of these characters visible.


This has some quite serious implication for processing of texts including ZW(N)J, variation selectors etc.

How these characters are treated is important, but there isn't as much of an issue here as you make it out to be.


A./

Other relevant sources of text about these are

UCD.html:

For programmatic determination of default-ignorable code points. New characters that should be ignored in processing (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default behavior of such characters when not otherwise supported. For more information, see <http://www.unicode.org/reports/tr29/>UAX #29: Text Boundaries.

with no mention of default-ignorable in the text of that UAX. (I've just
filed a web-report on that).






Reply via email to