Re: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)
> Date: Tue, 15 Oct 2019 00:23:59 +0100 > From: Richard Wordingham via Unicode > > > I'm well aware of the official position. However, when we attempted > > to implement it unconditionally in Emacs, some people objected, and > > brought up good reasons. You can, of course, elect to disregard this > > experience, and instead learn it from your own. > > Is there a good record of these complaints anywhere? You could look up these discussions: https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html > (It would occasionally be useful to have an easily issued command > like 'delete preceding NFD codepoint'.) I agree. Emacs commands that delete characters backward (usually invoked by the Backspace key) do that automatically, if the text before cursor was produced by composing several codepoints. > I did mention above that occasionally one needs to know what > codepoints were used and in what order. Sure. There's an Emacs command (C-u C-x =) which shows that information for the text at a given position.
Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)
On Mon, 14 Oct 2019 21:41:19 +0300 Eli Zaretskii via Unicode wrote: > > Date: Mon, 14 Oct 2019 19:29:39 +0100 > > From: Richard Wordingham via Unicode > > The official position is that text that is canonically > > equivalent is the same. There are problem areas where traditional > > modes of expression require that canonically equivalent text be > > treated differently. For these, it is useful to have tools that > > treat them differently. However, the normal presumption should be > > that canonically equivalent text is the same. > I'm well aware of the official position. However, when we attempted > to implement it unconditionally in Emacs, some people objected, and > brought up good reasons. You can, of course, elect to disregard this > experience, and instead learn it from your own. Is there a good record of these complaints anywhere? It is annoying when a text entry function does not keep the text as one enters it, but it would be interesting to know what the other complaints were. (It would occasionally be useful to have an easily issued command like 'delete preceding NFD codepoint'.) I did mention above that occasionally one needs to know what codepoints were used and in what order. Richard.
Re: Pure Regular Expression Engines and Literal Clusters
> Date: Mon, 14 Oct 2019 19:29:39 +0100 > From: Richard Wordingham via Unicode > > On Mon, 14 Oct 2019 10:05:49 +0300 > Eli Zaretskii via Unicode wrote: > > > I think these are two separate issues: whether search should normalize > > (a.k.a. performs character folding) should be a user option. You are > > talking only about canonical equivalence, but there's also > > compatibility decomposition, so, for example, searching for "1" should > > perhaps match ¹ and ①. > > HERETIC! > > The official position is that text that is canonically > equivalent is the same. There are problem areas where traditional > modes of expression require that canonically equivalent text be treated > differently. For these, it is useful to have tools that treat them > differently. However, the normal presumption should be that > canonically equivalent text is the same. I'm well aware of the official position. However, when we attempted to implement it unconditionally in Emacs, some people objected, and brought up good reasons. You can, of course, elect to disregard this experience, and instead learn it from your own. > The party line seems to be that most searching should actually be done > using a 'collation', which brings with it different levels of > 'folding'. In multilingual use, a collation used for searching should > be quite different to one used for sorting. Alas, collation is locale- and language-dependent. And, if you are going to use your search in a multilingual application (Emacs is such an application), you will have hard time even knowing which tailoring to apply for each potential match, because you will need to support the use case of working with text that mixes languages. Leaving the conundrum to the user to resolve seems to be a good compromise, and might actually teach us something that is useful for future modifications of the "party line".
Re: Pure Regular Expression Engines and Literal Clusters
On Mon, 14 Oct 2019 10:05:49 +0300 Eli Zaretskii via Unicode wrote: > > Date: Mon, 14 Oct 2019 01:10:45 +0100 > > From: Richard Wordingham via Unicode > > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, > > and were expecting normalisation (even to NFC) to be a possible > > cure. They had begun to realise that converting expressions to > > match all or none of a set of canonical equivalents was hard; the > > issue of non-contiguous matches wasn't mentioned. > I think these are two separate issues: whether search should normalize > (a.k.a. performs character folding) should be a user option. You are > talking only about canonical equivalence, but there's also > compatibility decomposition, so, for example, searching for "1" should > perhaps match ¹ and ①. HERETIC! The official position is that text that is canonically equivalent is the same. There are problem areas where traditional modes of expression require that canonically equivalent text be treated differently. For these, it is useful to have tools that treat them differently. However, the normal presumption should be that canonically equivalent text is the same. The party line seems to be that most searching should actually be done using a 'collation', which brings with it different levels of 'folding'. In multilingual use, a collation used for searching should be quite different to one used for sorting. Now, there is a case for being able to switch off normalisation and canonical equivalence generally, e.g. when dealing with ISO 10646 text instead of Unicode text. This of course still leaves the question of what character classes defined by Unicode properties then mean. If one converts the regular expression so that what it matches is closed under canonical equivalence, then visibly normalising the searched text becomes irrelevant. For working with Unicode traces, I actually do both. I convert the text to NFD but report matches in terms of the original code point sequence; working this way simplifies the conversion of the regular expression, which I do as part of its compilation. For traces, it seems only natural to treat precomposed characters as syntactic sugar for the NFD decompositions. (They have no place in the formal theory of traces.) However, I go further and convert the decomposed text to NFD. (Recall that conversion to NFD can change the stored order of combining marks.) One of the simplifications I get is that straight runs of text in the regular expression then match in the middle just by converting that run and the searched strings. For the concatenation of expressions A and B, once I am looking at the possible interleaving of two traces, I am dealing with NFA states of the form states(A) × {1..254} × states(B), so that for an element (a, n, b), a corresponds to starts of words with a match in A, b corresponds to starts of _words_ with a match in B, and n is the ccc of the last character used to advance to b. The element n blocks non-starters that can't belong to a word matching A. If I didn't (internally) convert the searched text to NFD, the element n would have to be a set of blocked canonical combining classes, changing the number of possible values from 54 to 2^54 - 1. While aficionados of regular languages may object that converting the searched text to NFD is cheating, there is a theorem that if I have a finite automaton that recognises a family of NFD strings, there is another finite automaton that will recognise all their canonical equivalents. Richard.
Re: Pure Regular Expression Engines and Literal Clusters
> On 14 Oct 2019, at 02:10, Richard Wordingham via Unicode > wrote: > > On Mon, 14 Oct 2019 00:22:36 +0200 > Hans Åberg via Unicode wrote: > >>> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode >>> wrote: > >>> Besides invalidating complexity metrics, the issue was what \p{Lu} >>> should match. For example, with PCRE syntax, GNU grep Version 2.25 >>> \p{Lu} matches U+0100 but not . When I'm respecting >>> canonical equivalence, I want both to match [:Lu:], and that's what >>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters. > >> Hopefully some experts here can tune in, explaining exactly what >> regular expressions they have in mind. > > The best indication lies at > https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents The certificate has expired, one day ago, risking to steal personal and financial information says the browser, refusing to load it. So one has to load the totally insecure HTTP page for risk of creating a mayhem on the computer. :-) > (2008), which is the last version before support for canonical > equivalence was dropped as a requirement. As said there, one might add all the equivalents if one can find them. Alternatively, one could normalize the regex and the string, keeping track of the translation boundaries on the string so that it can be translated back to a match on the original string if called for.
Re: Pure Regular Expression Engines and Literal Clusters
On Sun, 13 Oct 2019 21:28:34 -0700 Mark Davis ☕️ via Unicode wrote: > The problem is that most regex engines are not written to handle some > "interesting" features of canonical equivalence, like discontinuity. > Suppose that X is canonically equivalent to AB. > >- A query /X/ can match the separated A and C in the target string >"AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how > should it behave? "pqb", "pbq", "bpq"? If A contains a non-starter, pqbC. If C contains a non-starter, Abpq. Otherwise, if the results are canonically inequivalent, it should raise an exception for attempting a process that is either ill-defined or not Unicode-compliant. > If the input was in NFD (for > example), should the output be rearranged/decomposed so that it is > NFD? and so on. That is not a new issue. It exists already. >- A query /A/ can match *part* of the X in the target string > "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what > should result: "apqBb"? Yes, unless raising an exception is appropriate (see above). > The syntax and APIs for regex engines are not built to handle these > features. It introduces a enough complications in the code, syntax, > and semantics that no major implementation has seen fit to do it. We > used to have a section in the spec about this, but were convinced > that it was better off handled at a higher level. What higher level? If anything, I would say that the handler is at a lower level (character fragments and the like). The potential requirement should be restored, but not subsumed in Levels 1 to 3. It is a sufficiently different level of endeavour. Richard.
Re: Pure Regular Expression Engines and Literal Clusters
On Sun, 13 Oct 2019 20:25:25 -0700 Asmus Freytag via Unicode wrote: > On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote: > On Sun, 13 Oct 2019 17:13:28 -0700 >> Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so >> [:Lu:] should not match > COMBINING CIRCUMFLEX ACCENT>. > Why does it matter if it is precomposed? Why should it? (For anyone > other than a character coding maven). Because general_category is a property of characters, not strings. It matters to anyone who intends to conform to a standard. >> Now, I could invent a string >> property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*). No, I shouldn't! \m{xLu} is infinite, which would not be allowed for a Unicode set. I'd have to resort to a wordy definition for it to be a property. Richard.
Re: Pure Regular Expression Engines and Literal Clusters
> Date: Mon, 14 Oct 2019 01:10:45 +0100 > From: Richard Wordingham via Unicode > > >> Besides invalidating complexity metrics, the issue was what \p{Lu} > >> should match. For example, with PCRE syntax, GNU grep Version 2.25 > >> \p{Lu} matches U+0100 but not . When I'm respecting > >> canonical equivalence, I want both to match [:Lu:], and that's what > >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters. > > > Hopefully some experts here can tune in, explaining exactly what > > regular expressions they have in mind. > > The best indication lies at > https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents > (2008), which is the last version before support for canonical > equivalence was dropped as a requirement. > > It's not entirely coherent, as the authors don't seem to find an > expression like > > \p{L}\p{gcb=extend}* > > a natural thing to use, as the second factor is mostly sequences of > non-starters. At that point, I would say they weren't expecting > \p{Lu} to not match , as they were still expecting [ä] to > match both "ä" and "a\u0308". > > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and > were expecting normalisation (even to NFC) to be a possible cure. They > had begun to realise that converting expressions to match all or none > of a set of canonical equivalents was hard; the issue of non-contiguous > matches wasn't mentioned. I think these are two separate issues: whether search should normalize (a.k.a. performs character folding) should be a user option. You are talking only about canonical equivalence, but there's also compatibility decomposition, so, for example, searching for "1" should perhaps match ¹ and ①.