Re: Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-14 Thread Eli Zaretskii via Unicode
> Date: Tue, 15 Oct 2019 00:23:59 +0100
> From: Richard Wordingham via Unicode 
> 
> > I'm well aware of the official position.  However, when we attempted
> > to implement it unconditionally in Emacs, some people objected, and
> > brought up good reasons.  You can, of course, elect to disregard this
> > experience, and instead learn it from your own.
> 
> Is there a good record of these complaints anywhere?

You could look up these discussions:

  https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00189.html
  https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html

> (It would occasionally be useful to have an easily issued command
> like 'delete preceding NFD codepoint'.)

I agree.  Emacs commands that delete characters backward (usually
invoked by the Backspace key) do that automatically, if the text
before cursor was produced by composing several codepoints.

> I did mention above that occasionally one needs to know what
> codepoints were used and in what order.

Sure.  There's an Emacs command (C-u C-x =) which shows that
information for the text at a given position.


Annoyances from Implementation of Canonical Equivalence (was: Pure Regular Expression Engines and Literal Clusters)

2019-10-14 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 21:41:19 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Mon, 14 Oct 2019 19:29:39 +0100
> > From: Richard Wordingham via Unicode 

> > The official position is that text that is canonically
> > equivalent is the same.  There are problem areas where traditional
> > modes of expression require that canonically equivalent text be
> > treated differently.  For these, it is useful to have tools that
> > treat them differently.  However, the normal presumption should be
> > that canonically equivalent text is the same.  

> I'm well aware of the official position.  However, when we attempted
> to implement it unconditionally in Emacs, some people objected, and
> brought up good reasons.  You can, of course, elect to disregard this
> experience, and instead learn it from your own.

Is there a good record of these complaints anywhere?  It is annoying
when a text entry function does not keep the text as one enters it, but
it would be interesting to know what the other complaints were.  (It
would occasionally be useful to have an easily issued command like
'delete preceding NFD codepoint'.)  I did mention above that
occasionally one needs to know what codepoints were used and in what
order.

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Eli Zaretskii via Unicode
> Date: Mon, 14 Oct 2019 19:29:39 +0100
> From: Richard Wordingham via Unicode 
> 
> On Mon, 14 Oct 2019 10:05:49 +0300
> Eli Zaretskii via Unicode  wrote:
> 
> > I think these are two separate issues: whether search should normalize
> > (a.k.a. performs character folding) should be a user option.  You are
> > talking only about canonical equivalence, but there's also
> > compatibility decomposition, so, for example, searching for "1" should
> > perhaps match ¹ and ①.
> 
> HERETIC!
> 
> The official position is that text that is canonically
> equivalent is the same.  There are problem areas where traditional
> modes of expression require that canonically equivalent text be treated
> differently.  For these, it is useful to have tools that treat them
> differently.  However, the normal presumption should be that
> canonically equivalent text is the same.

I'm well aware of the official position.  However, when we attempted
to implement it unconditionally in Emacs, some people objected, and
brought up good reasons.  You can, of course, elect to disregard this
experience, and instead learn it from your own.

> The party line seems to be that most searching should actually be done
> using a 'collation', which brings with it different levels of
> 'folding'.  In multilingual use, a collation used for searching should
> be quite different to one used for sorting.

Alas, collation is locale- and language-dependent.  And, if you are
going to use your search in a multilingual application (Emacs is such
an application), you will have hard time even knowing which tailoring
to apply for each potential match, because you will need to support
the use case of working with text that mixes languages.

Leaving the conundrum to the user to resolve seems to be a good
compromise, and might actually teach us something that is useful for
future modifications of the "party line".


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Mon, 14 Oct 2019 10:05:49 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Mon, 14 Oct 2019 01:10:45 +0100
> > From: Richard Wordingham via Unicode 

> > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*,
> > and were expecting normalisation (even to NFC) to be a possible
> > cure.  They had begun to realise that converting expressions to
> > match all or none of a set of canonical equivalents was hard; the
> > issue of non-contiguous matches wasn't mentioned.  

> I think these are two separate issues: whether search should normalize
> (a.k.a. performs character folding) should be a user option.  You are
> talking only about canonical equivalence, but there's also
> compatibility decomposition, so, for example, searching for "1" should
> perhaps match ¹ and ①.

HERETIC!

The official position is that text that is canonically
equivalent is the same.  There are problem areas where traditional
modes of expression require that canonically equivalent text be treated
differently.  For these, it is useful to have tools that treat them
differently.  However, the normal presumption should be that
canonically equivalent text is the same.

The party line seems to be that most searching should actually be done
using a 'collation', which brings with it different levels of
'folding'.  In multilingual use, a collation used for searching should
be quite different to one used for sorting.

Now, there is a case for being able to switch off normalisation and
canonical equivalence generally, e.g. when dealing with ISO 10646 text
instead of Unicode text.  This of course still leaves the question of
what character classes defined by Unicode properties then mean.

If one converts the regular expression so that what it matches is
closed under canonical equivalence, then visibly normalising the
searched text becomes irrelevant.  For working with Unicode traces, I
actually do both.  I convert the text to NFD but report matches in terms
of the original code point sequence; working this way simplifies the
conversion of the regular expression, which I do as part of its
compilation.  For traces, it seems only natural to treat precomposed
characters as syntactic sugar for the NFD decompositions.  (They
have no place in the formal theory of traces.)  However, I go further
and convert the decomposed text to NFD. (Recall that conversion to NFD
can change the stored order of combining marks.)

One of the simplifications I get is that straight runs of text in the
regular expression then match in the middle just by converting that run
and the searched strings.

For the concatenation of expressions A and B, once I am looking at the
possible interleaving of two traces, I am dealing with NFA states of
the form states(A) × {1..254} × states(B), so that for an element (a,
n, b), a corresponds to starts of words with a match in A, b
corresponds to starts of _words_ with a match in B, and n is the ccc
of the last character used to advance to b.  The element n blocks
non-starters that can't belong to a word matching A.  If I didn't
(internally) convert the searched text to NFD, the element n would have
to be a set of blocked canonical combining classes, changing the number
of possible values from 54 to 2^54 - 1.

While aficionados of regular languages may object that converting the
searched text to NFD is cheating, there is a theorem that if I have a
finite automaton that recognises a family of NFD strings, there is
another finite automaton that will recognise all their canonical
equivalents.

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Hans Åberg via Unicode



> On 14 Oct 2019, at 02:10, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 14 Oct 2019 00:22:36 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
>>>  wrote:
> 
>>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>>> should match.  For example, with PCRE syntax, GNU grep Version 2.25
>>> \p{Lu} matches U+0100 but not .  When I'm respecting
>>> canonical equivalence, I want both to match [:Lu:], and that's what
>>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
> 
>> Hopefully some experts here can tune in, explaining exactly what
>> regular expressions they have in mind.
> 
> The best indication lies at
> https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents

The certificate has expired, one day ago, risking to steal personal and 
financial information says the browser, refusing to load it. So one has to load 
the totally insecure HTTP page for risk of creating a mayhem on the computer. 
:-)

> (2008), which is the last version before support for canonical
> equivalence was dropped as a requirement.

As said there, one might add all the equivalents if one can find them. 
Alternatively, one could normalize the regex and the string, keeping track of 
the translation boundaries on the string so that it can be translated back to a 
match on the original string if called for.





Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 21:28:34 -0700
Mark Davis ☕️ via Unicode  wrote:

> The problem is that most regex engines are not written to handle some
> "interesting" features of canonical equivalence, like discontinuity.
> Suppose that X is canonically equivalent to AB.
> 
>- A query /X/ can match the separated A and C in the target string
>"AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how
> should it behave? "pqb", "pbq", "bpq"?

If A contains a non-starter, pqbC.
If C contains a non-starter, Abpq.
Otherwise, if the results are canonically inequivalent, it should
raise an exception for attempting a process that is either ill-defined
or not Unicode-compliant. 

> If the input was in NFD (for
> example), should the output be rearranged/decomposed so that it is
> NFD? and so on.

That is not a new issue.  It exists already.

>- A query /A/ can match *part* of the X in the target string
> "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what
> should result: "apqBb"?

Yes, unless raising an exception is appropriate (see above).

> The syntax and APIs for regex engines are not built to handle these
> features. It introduces a enough complications in the code, syntax,
> and semantics that no major implementation has seen fit to do it. We
> used to have a section in the spec about this, but were convinced
> that it was better off handled at a higher level.

What higher level?  If anything, I would say that the handler is at a
lower level (character fragments and the like).

The potential requirement should be restored, but not subsumed in
Levels 1 to 3.  It is a sufficiently different level of endeavour.

Richard.



Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Richard Wordingham via Unicode
On Sun, 13 Oct 2019 20:25:25 -0700
Asmus Freytag via Unicode  wrote:

> On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:
> On Sun, 13 Oct 2019 17:13:28 -0700

>> Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so
>> [:Lu:] should not match > COMBINING CIRCUMFLEX ACCENT>. 

> Why does it matter if it is precomposed? Why should it? (For anyone
> other than a character coding maven).

Because general_category is a property of characters, not strings.  It
matters to anyone who intends to conform to a standard.

>> Now, I could invent a string
>> property so that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

No, I shouldn't!  \m{xLu} is infinite, which would not be allowed for
a Unicode set.  I'd have to resort to a wordy definition for it to be a
property.

Richard.


Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Eli Zaretskii via Unicode
> Date: Mon, 14 Oct 2019 01:10:45 +0100
> From: Richard Wordingham via Unicode 
> 
> >> Besides invalidating complexity metrics, the issue was what \p{Lu}
> >> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> >> \p{Lu} matches U+0100 but not .  When I'm respecting
> >> canonical equivalence, I want both to match [:Lu:], and that's what
> >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
>  
> > Hopefully some experts here can tune in, explaining exactly what
> > regular expressions they have in mind.
> 
> The best indication lies at
> https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents
> (2008), which is the last version before support for canonical
> equivalence was dropped as a requirement.
> 
> It's not entirely coherent, as the authors don't seem to find an
> expression like
> 
> \p{L}\p{gcb=extend}*
> 
> a natural thing to use, as the second factor is mostly sequences of
> non-starters.  At that point, I would say they weren't expecting
> \p{Lu} to not match  , as they were still expecting  [ä] to
> match both "ä" and "a\u0308".
> 
> They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and
> were expecting normalisation (even to NFC) to be a possible cure.  They
> had begun to realise that converting expressions to match all or none
> of a set of canonical equivalents was hard; the issue of non-contiguous
> matches wasn't mentioned.

I think these are two separate issues: whether search should normalize
(a.k.a. performs character folding) should be a user option.  You are
talking only about canonical equivalence, but there's also
compatibility decomposition, so, for example, searching for "1" should
perhaps match ¹ and ①.