If you look back at http://www.unicode.org/reports/tr29/tr29-27.html#GB8a (2015), the rule was simply not to break sequences of RI characters.
We changed that in http://www.unicode.org/reports/tr29/tr29-29.html#GB12 (2016) to only group pairs. Unfortunately, the (informative) table http://www.unicode.org/reports/tr29/tr29-31.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters was not updated after 2015 to keep pace with the changes in rules. So that is still to do.... Mark <https://twitter.com/mark_e_davis> On Mon, Dec 18, 2017 at 10:59 AM, Andre Schappo via Unicode < unicode@unicode.org> wrote: > Ah! That explains why > > pcre2grep -u '^\X{1}$' > > matches with > > 🇬🇧 > 🇩🇪🇫🇷 > 🇨🇳🇮🇹🇲🇾 > 🇪🇸🇦🇺🇷🇺🇳🇱🇯🇵 > > ...etc... > > André Schappo > > On 17 Dec 2017, at 17:17, Mark Davis ☕️ via Unicode <unicode@unicode.org> > wrote: > > Thanks for the feedback. You're correct about this; that is a holdover > from an earlier version of the document when there was a more basic > treatment of RI sequences. > > There is already an action to modify these. There is a placeholder review > note about that just above > > http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_ > Sequences_and_Grapheme_Clusters > > (scroll up just a bit). > > Mark > > Mark <https://twitter.com/mark_e_davis> > > On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode < > unicode@unicode.org> wrote: > >> Hi, >> >> It’s possible I’m missing something, but the formal grammar/regular >> expression given for extended grapheme clusters appears to have a bug >> in it. >> <https://unicode.org/reports/tr29/#Table_Combining_Char_Sequ >> ences_and_Grapheme_Clusters> >> >> The bug is here: >> >> RI-Sequence := Regional_Indicator+ >> >> If the formal grammar is intended to exactly match the rules given the >> the “Grapheme Cluster Boundary Rules” section below it as-is, then >> this should be >> >> RI-Sequence := Regional_Indicator Regional_Indicator >> >> since as given it would cause any number of RI characters to coalesce >> into a single grapheme cluster, instead of pairs of characters. That >> is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one >> grapheme cluster instead of the correct two. >> >> -- >> dpk (David P. Kendal) · Nassauische Str. 36, 10717 >> <https://maps.google.com/?q=Nassauische+Str.+36,+10717&entry=gmail&source=g> >> DE · http://dpk.io/ >> we do these things not because they are easy, +49 159 03847809 >> but because we thought they were going to be easy >> — ‘The Programmers’ Credo’, Maciej Cegłowski >> >> >> > > 🌏 🌍 🌎 > André Schappo > https://schappo.blogspot.co.uk > https://twitter.com/andreschappo > https://weibo.com/andreschappo > https://groups.google.com/forum/#!forum/computer-science-curriculum- > internationalization > > > > > >