If you know of combining marks whose scx values should include Thai, please let us know.
Also, by "Latin is not a complex script" I mean it in the narrow sense I stated, where the goal is the ordering of characters. That is, nobody would normally wonder whether 0.5 when expressed by a sequence with U+2044 FRACTION SLASH should be written as the sequence <2, U+2044 FRACTION SLASH, 1>! There will always be some edge cases, but the target is Tibetan or Myanmar, not Latin or Cyrillic. Mark On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham < [email protected]> wrote: > On Thu, 12 Jan 2017 21:03:29 +0100 > Mark Davis ☕️ <[email protected]> wrote: > > > That was just an example off the top of my head of the format for > > using with regex; I don't pretend that it is vetted. Latin is not a > > complex script, so it was only an illustration. However, it was just > > brain freeze on my part to not also include Inherited or ZWJ. A more > > serious effort would look at some of the issues from > > http://unicode.org/reports/tr29/, for example. On the other hand, CGJ > > is not a problem: it is Mn > > <http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say) > > U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included. > > Ah, I had not appreciated that sc=Inherited does not imply > scx=Inherited. Using Script_Extensions to document the international > combining characters that are used, for example, with Thai bases could > have all sorts of undesirable knock-on effects. > > Richard. > >

