On 3/6/2012 8:27 PM, fantasai wrote:
Unicode has a Pc category into which it assigns various low lines:

_    U+005F     LOW LINE
‿    U+203F     UNDERTIE
⁀    U+2040     CHARACTER TIE
⁔    U+2054     INVERTED UNDERTIE

Those 4 are the actual connectors. The concept arose because of the
peculiar behavior of U+005F LOW LINE, which although classed as
"punctuation", in majority usage doesn't actually serve to delimit things,
but rather is a way of tying them together, particularly for identifier syntaxes.
For decades now, programmers have been using it as a replacement
for SPACE which allows for visual separation of "words" without the
segmentation effects.

The various TIEs are traditional editing marks which have a comparable
effect. Although they don't occur in regular orthographies and are not
widely used in any syntax, if they *do* occur in digital text, the default
behavior you would want for them would be to keep elements together,
rather than separate them.

︳    U+FE33     PRESENTATION FORM FOR VERTICAL LOW LINE
︴    U+FE34     PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
﹍    U+FE4D     DASHED LOW LINE
﹎    U+FE4E     CENTRELINE LOW LINE
﹏    U+FE4F     WAVY LOW LINE
_    U+FF3F     FULLWIDTH LOW LINE



Those 6 are completely different. The first 5 are compatibility dreck coming out of
CNS, and their original intent (most likely) was to represent various styles
of underlining of Chinese text. They cannot be meaningfully used for that now -- you would do that instead with text styles -- but they are encoded for roundtrip
conversion to CNS. U+FF3F is just a fullwidth variant from Shift-JIS (etc.)

The reason they are gc=Pc is entirely a normalization consistency issue, because
they all have compatibility decompositions to U+005F LOW LINE.

However, the overlines that are almost exactly the same thing, are categorized as Po:

‾    U+203E     OVERLINE

The overline isn't typically used to tie anything together. This is essentially
just a spacing clone of the combining overline.

﹉    U+FE49     DASHED OVERLINE
﹊    U+FE4A     CENTRELINE OVERLINE
﹋    U+FE4B     WAVY OVERLINE
﹌    U+FE4C     DOUBLE WAVY OVERLINE

And those 4 are more CNS compatibility dreck, again representing badly encoded
characters for what should actually be done with text styles.


Is this a bug or a feature? :) Shouldn't they be Pc?

It is a feature. And no, they should not be gc=Pc.

The main algorithmic consequences of gc=Pc are that U+005F (and the kin it
drags along) are Word_Break=ExtendNumLet, which keeps them from
defining default word boundaries, and gc=Pc is included in the
derivation of ID_Continue (and XID_Continue), which keeps them in
identifiers.

I don't know of any particular reason why anyone would want to keep the
spacing overline either inside default word segments or inside identifiers.

--Ken




Reply via email to