On Sat, 18 May 2013 09:18:37 +0200 Philippe Verdy <[email protected]> wrote:
> 2013/5/18 Richard Wordingham <[email protected]> > > It [CGJ - RW] cannot be discarded when collation is used for > > sorting. > it can, after the initial noramlization steps (well it also blocks > recognizing digrams as a single "letter" in some alphanet, but the > use of CGJ for that pupose is deprecated in favor of using joiner > controls before another starter character). Where is this deprecation? Have you been confused by the following paragraph in the Unicode Collation Algorithm: "Sequences of characters which include the combining grapheme joiner or other completely ignorable characters may also be given tailored weights. Thus the sequence <c, CGJ, h> could be weighted completely differently from either the contraction "ch" or the sequence "c" followed by "h" without the contraction. However, this application of CGJ is not recommended, because it would produce effects much different than the normal usage above, which is to simply interrupt contractions." While a soft hyphen might be a reasonable alternative in the Welsh place name 'Llangollen', to ensure it is sorted after 'Llanberis', I'd be reluctant to use it ensure 'Bangor' is sorted after 'Bala'. Are you suggesting I should use ZWNJ in 'Llangollen' and 'Bangor' when there is a risk of them being sorted according to Welsh rules? (The relevant Welsh rule is g < ng < h, but Welsh has many words in which the visual sequence 'ng' is not a 'letter'.) > Once digrams have been isolated, the CGJ gets discarded for collation > purpose as well (it becomes ignorable). That's close to the truth; it needs to be retained when identity strength collation is used. > You'll still find exceptions in some tailorings, but tailorings can > in fact do what they want ; <snip> Now, tailoring using CGJ is 'not recommended'. Richard.

