That was just an example off the top of my head of the format for using with regex; I don't pretend that it is vetted. Latin is not a complex script, so it was only an illustration. However, it was just brain freeze on my part to not also include Inherited or ZWJ. A more serious effort would look at some of the issues from http://unicode.org/reports/tr29/, for example. On the other hand, CGJ is not a problem: it is Mn <http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say) U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.
Mark On Thu, Jan 12, 2017 at 7:42 PM, Richard Wordingham < [email protected]> wrote: > On Thu, 12 Jan 2017 14:12:09 +0100 > Mark Davis ☕️ <[email protected]> wrote: > > > I agree that comprehension is a goal. I'd imagine using a BNF regex, > > like the following. This is simple, since I'm just doing Latin, but > > you can see what I mean. > > > word = base* ; > > base = (latinLetter latinMn*) ; > > latinLetter = [[:scx=Latn:]&[:L:]] ; > > latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ; > > > > which turns into the single regex expression: > > > > ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)* > > Ouch! That's alarmingly wrong. You've excluded the likes of > English 'Caesar' with ZWJ, Welsh 'Llan͏gollen' with CGJ (the word > doesn't contain the letter 'ng') and the ISO-sanctioned transliteration > of Thai SO SUEA as 's̄'. Fixinɡ it isn't easy. At least, I assume > Arabic harakat don't attach to Latin letters in your conception of > Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't > work well. > > The problem may be conflicting requirements on the Script_Extensions > property. > > Richard. > >

