Re: Connector Punctuation and Overlines

Ken Whistler Wed, 07 Mar 2012 11:50:35 -0800

On 3/6/2012 8:27 PM, fantasai wrote:

Unicode has a Pc category into which it assigns various low lines:


_    U+005F     LOW LINE
‿    U+203F     UNDERTIE
⁀    U+2040     CHARACTER TIE
⁔    U+2054     INVERTED UNDERTIE


Those 4 are the actual connectors. The concept arose because of the
peculiar behavior of U+005F LOW LINE, which although classed as
"punctuation", in majority usage doesn't actually serve to delimit things,

but rather is a way of tying them together, particularly for identifiersyntaxes.

For decades now, programmers have been using it as a replacement
for SPACE which allows for visual separation of "words" without the
segmentation effects.

The various TIEs are traditional editing marks which have a comparable
effect. Although they don't occur in regular orthographies and are not
widely used in any syntax, if they *do* occur in digital text, the default
behavior you would want for them would be to keep elements together,
rather than separate them.

︳    U+FE33     PRESENTATION FORM FOR VERTICAL LOW LINE
︴    U+FE34     PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
﹍    U+FE4D     DASHED LOW LINE
﹎    U+FE4E     CENTRELINE LOW LINE
﹏    U+FE4F     WAVY LOW LINE
＿    U+FF3F     FULLWIDTH LOW LINE

Those 6 are completely different. The first 5 are compatibility dreckcoming out of

CNS, and their original intent (most likely) was to represent various styles

of underlining of Chinese text. They cannot be meaningfully used forthat now --you would do that instead with text styles -- but they are encoded forroundtrip

conversion to CNS. U+FF3F is just a fullwidth variant from Shift-JIS (etc.)

The reason they are gc=Pc is entirely a normalization consistency issue,because

they all have compatibility decompositions to U+005F LOW LINE.

However, the overlines that are almost exactly the same thing, arecategorized as Po:
‾    U+203E     OVERLINE

The overline isn't typically used to tie anything together. This isessentially

just a spacing clone of the combining overline.

﹉    U+FE49     DASHED OVERLINE
﹊    U+FE4A     CENTRELINE OVERLINE
﹋    U+FE4B     WAVY OVERLINE
﹌    U+FE4C     DOUBLE WAVY OVERLINE

And those 4 are more CNS compatibility dreck, again representing badlyencoded

characters for what should actually be done with text styles.


Is this a bug or a feature? :) Shouldn't they be Pc?


It is a feature. And no, they should not be gc=Pc.

The main algorithmic consequences of gc=Pc are that U+005F (and the kin it
drags along) are Word_Break=ExtendNumLet, which keeps them from
defining default word boundaries, and gc=Pc is included in the
derivation of ID_Continue (and XID_Continue), which keeps them in
identifiers.

I don't know of any particular reason why anyone would want to keep the
spacing overline either inside default word segments or inside identifiers.

--Ken

Re: Connector Punctuation and Overlines

Reply via email to