2015-10-04 21:30 GMT+02:00 Richard Wordingham < richard.wording...@ntlworld.com>:
> On Sun, 4 Oct 2015 15:44:32 +0200 > Mark Davis ☕️ <m...@macchiato.com> wrote: > > > When I use http://unicode.org/cldr/utility/breaks.jsp, it does show > > the sequence 𑒏�𑒺 as just two grapheme clusters. > > But that's the sequence <U+1148F, U+FFFD, U+114BA>, which has no lone > surrogates at all! (I had to look at the raw email file to be sure of > what the text was - my email client displays U+FFFD and malformed > alleged UTF-8 the same.) Mark just said that it was what was shown, i.e. the lone surrogate got treated as U+FFFD. However my opinion is that 𑒏�𑒺 (using U+FFFD substitution) gives 2 grapheme clusters, I would prefer a solution that gives 3 grapheme clusters, as if the lone surrogate was a line-break control, so that the third character (combining, but just after the lone surrogate) will not combine with it but will be handled as a defective combining sequence with no starter at all before it.