I have two points. First, this excerpt from Synopsis 6: The :m (or :ignoremark) modifier scopes exactly like :ignorecase except that it ignores marks (accents and such) instead of case. It is equivalent to taking each grapheme (in both target and pattern), converting both to NFD (maximally decomposed) and then comparing the two base characters (Unicode non-mark characters) while ignoring any trailing mark characters. The mark characters are ignored only for the purpose of determining the truth of the assertion; the actual text matched includes all ignored characters, including any that follow the final base character.
The :mm (or :samemark) variant may be used on a substitution to change the substituted string to the same mark/accent pattern as the matched string. Mark info is carried across on a character by character basis. If the right string is longer than the left one, the remaining characters are substituted without any modification. (Note that NFD/NFC distinctions are usually immaterial, since Perl encapsulates that in grapheme mode.) Under :sigspace the preceding rules are applied word by word. In perl5, one must manually run two matches on all data. First: I notice that ignoring marks (and such) and ignoring case are both differently strengthed effects of the Unicode Collation Algorithm. What about simply allowing folks to specify which of the four (or more, I guess) levels of UCA equivalence/folding they want? Second: I'm not altogether reassured by the parenned bit about NFD/NFC being immaterial. That's because I've been pretty annoying lately in perl5 with having to manually run *everything* through a double match every time, and I can't avoid it by prenormalizing. I'm just hoping that perl6 will handle this better. It's usually like this: NFD($data) =~ $pattern NFC($data) =~ $pattern Or if you know your data is NFD: $data =~ $pattern NFC($data) =~ $pattern Or if you know your data is NFC: NFD($data) =~ $pattern $data =~ $pattern That's because even if your data in a known state with respect to normalization, if your pattern admits both NFD and NFC forms, which it would if read in from a file etc, then you have to run them both. For example, suppose you read a pattern whose characters are specified indirectly/symbolically: $pattern = q<\xE9>; # LATIN SMALL LETTER E WITH ACUTE or $pattern = q<e\x{301}>; # "e" + COMBINING ACUTE ACCENT It would be ok if those were literal characters, because you could just NFD the patterns and be done. But they're not. So in order for $data =~ $pattern to work properly with both, you really have to do a guaranteed double-convert/match each time. This is rather unfortunate, to put it mildly. What you really want is a pattern compile flag that imposes canonical matching, and does this correctly even when faced with named characters, etc. My read of S06 suggests that this will not be an issue. I do wonder what happens when you want to match just the combining part. Does that fail in grapheme mode? It shouldn't: you *can* have standalones. But then we're back to partial matches in the middle of things, which is something that plagues us with full Unicode case-folding. This is the "\N{LATIN SMALL LIGATURE FFI}" =~ /(f)(f)/i problem, amongst others. Seems that you are going to get into the same dilemma if you allow matching partial graphemes in grapheme mode. Hm. --tom