On Sun, 4 Oct 2015 15:44:32 +0200 Mark Davis ☕️ <[email protected]> wrote:
> When I use http://unicode.org/cldr/utility/breaks.jsp, it does show > the sequence 𑒏�𑒺 as just two grapheme clusters. But that's the sequence <U+1148F, U+FFFD, U+114BA>, which has no lone surrogates at all! (I had to look at the raw email file to be sure of what the text was - my email client displays U+FFFD and malformed alleged UTF-8 the same.) I believe I would have a good chance of repairing that by replacing U+FFFD by nothing. It's not even certain that the substitution to replace U+FFFD would work. With a more fully supported script in LibreOffice, I would have to switch 'CTL diacritic' matching off and hope that substitution replaced the shortest match. That currently works for replacing one Thai consonant by another. To systematically replace a non-spacing Thai character by another, I have to resort to 'regular expression' search and replace. I must hope that they never choose to interpret the search as matching extended grapheme clusters. Do all Unicode character properties extend to all codepoints? If not, how does one tell which do and which don't? If the Unicode segmentation algorithms do apply to sequences of codepoints, as opposed to merely to Unicode strings, then indeed <U+D805, U+114BA> is a legacy grapheme cluster. It's an extremely unhelpful one! > In #29 we are specifically not concerned about ill-formed text (or > other degenerate cases). I suppose it would be possible to handle > isolated surrogates in different way (eg always breaking) if it > represented a common problem, but someone would have to make a very > good case for that. I suppose the argument will go that by using rare scripts or obsolete characters, one deserves all the problems that one gets. The only widely used script where one is likely to encounter lone surrogates is CJK, and they are less of a problem there. Ideally, one shouldn't get isolated surrogates, but when one does, the mechanisms intended to prevent them occurring can make dealing with them difficult. Richard.

