Re: [Toybox] Unicode string comparison.

Rob Landley Fri, 23 Sep 2022 01:28:24 -0700

On 9/22/22 10:48, enh wrote:
> On Thu, Sep 22, 2022 at 2:43 AM Rob Landley <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     If you you have the same set of combining characters in a different 
> order, is
>     the result still considered the same character for string matching 
> purposes?
> 
> "depends". there are multiple normalization forms.


Oh joy.

> "For most full-featured regular expression engines, it is quite difficult to
> match under canonical equivalence, which may involve reordering, splitting, or
> merging of characters."

I've gone back to just punting unicode to regcomp() and friends: you stick a
character above 127 in your pattern and it's not taking the fast path I'm
implementing. (Not that I expect the regex engine to do better, but then it's
not _my_ fault quite so much.)

But I'm trying to understand regex escapes, and...

$ echo 'a[c' | grep 'a\[c'
a[c
$ echo 'a\bc' | grep 'a\bc'
$ echo abc | grep 'a\bc'
$ echo ac | grep 'a\bc'
$ echo 'a^c' | grep 'a\^c'
a\c
$ echo 'a^c' | grep 'a^c'
a^c
$ echo 'a\b' | grep 'a\b'
a\b
$ echo 'a\b' | grep 'a\b.'
a\b
$ echo 'a\b' | grep 'a\b..'
a\b
$ echo 'a\b' | grep 'a\b...'
$

I do not understand regex escapes. (This is all with the debian grep.)

Rob
_______________________________________________
Toybox mailing list
[email protected]
http://lists.landley.net/listinfo.cgi/toybox-landley.net

Re: [Toybox] Unicode string comparison.

Reply via email to