Re: Regarding canonical combing class value for U+0F76 and similar characters (Unicode 6.2.0)

Richard Wordingham Sat, 18 May 2013 02:51:51 -0700

On Sat, 18 May 2013 09:18:37 +0200
Philippe Verdy <[email protected]> wrote:


> 2013/5/18 Richard Wordingham <[email protected]>

> > It [CGJ - RW] cannot be discarded when collation is used for
> > sorting.
 
> it can, after the initial noramlization steps (well it also blocks
> recognizing digrams as a single "letter" in some alphanet, but the
> use of CGJ for that pupose is deprecated in favor of using joiner
> controls before another starter character).

Where is this deprecation?  Have you been confused by the following
paragraph in the Unicode Collation Algorithm:

"Sequences of characters which include the combining grapheme joiner or
other completely ignorable characters may also be given tailored
weights. Thus the sequence <c, CGJ, h> could be weighted completely
differently from either the contraction "ch" or the sequence "c"
followed by "h" without the contraction. However, this application of
CGJ is not recommended, because it would produce effects much different
than the normal usage above, which is to simply interrupt contractions."

While a soft hyphen might be a reasonable alternative in the Welsh place
name 'Llangollen', to ensure it is sorted after 'Llanberis', I'd be
reluctant to use it ensure 'Bangor' is sorted after 'Bala'.  Are you
suggesting I should use ZWNJ in 'Llangollen' and 'Bangor' when there
is a risk of them being sorted according to Welsh rules?  (The relevant
Welsh rule is g < ng < h, but Welsh has many words in which the
visual sequence 'ng' is not a 'letter'.)

> Once digrams have been isolated, the CGJ gets discarded for collation
> purpose as well (it becomes ignorable).

That's close to the truth; it needs to be retained when identity
strength collation is used.

> You'll still find exceptions in some tailorings, but tailorings can
> in fact do what they want ; <snip>

Now, tailoring using CGJ is 'not recommended'.

Richard.

Re: Regarding canonical combing class value for U+0F76 and similar characters (Unicode 6.2.0)

Reply via email to