Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z

2023-04-21 Thread Yuri
Yuri wrote:
> parv/FreeBSD wrote:
>> Wrote Dimitry Andric on Fri, 21 Apr 2023 10:38:05 UTC
>> (via
>> https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html 
>>  )
>>>
>>> ... However, I have read that with unicode, you should *never*
>>> use [A-Z] or [0-9], but character classes instead. That seems to give
>>> both files on macOS and Linux with [[:alpha:]]:
>> ...
>>
>> Subject to the locale, problem with that is "[[:alpha:]]" will match
>> more than 26 English letters "A" through "Z" (besides also matching
>> lower case "a" through "z") even if none of 26 * 2 English alphabets
>> appear in a string.
> 
> (replying to random recent message)
> 
> And there is a bit of quite recent history for fnmatch() related to
> [a-z], same was done for regex with the same outcome -- attempt to make
> [a-z] (guess [A-Z] as well) range non-collating failed.  I am not aware
> of the encountered failures, hopefully someone should remember:

I just tried less intrusive change that seems to help with these ranges
(but there's still a question what failed previously):

diff --git a/lib/libc/gen/fnmatch.c b/lib/libc/gen/fnmatch.c
index 40670545993..3234c14 100644
--- a/lib/libc/gen/fnmatch.c
+++ b/lib/libc/gen/fnmatch.c
@@ -295,10 +295,11 @@ rangematch(const char *pattern, wchar_t test, int
flags, char **newp,
if (flags & FNM_CASEFOLD)
c2 = towlower(c2);

-   if (table->__collate_load_error ?
+   if (table->__collate_load_error ||
+   iswascii(test) ?
c <= test && test <= c2 :
-  __wcollate_range_cmp(c, test) <= 0
-   && __wcollate_range_cmp(test, c2) <= 0
+   __wcollate_range_cmp(c, test) <= 0 &&
+   __wcollate_range_cmp(test, c2) <= 0
   )
ok = 1;
} else if (c == test)

$ LC_ALL=en_US.UTF-8
LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7
find . -name '[a-z]*'
./bar
$ LC_ALL=en_US.UTF-8
LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7
find . -name '[A-Z]*'
./FOO

> 
> commit 5a5807dd4ca34467ac5fb458bc19f12bf62075a5
> Author: Andrey A. Chernov 
> Date:   Sun Jul 10 03:49:38 2016 +
> 
> Remove broken support for collation in [a-z] type ranges.
> Only first 256 wide chars are considered currently, all other are just
> dropped from the range. Proper implementation require reverse tables
> database lookup, since objects are really big as max UTF-8 (1114112
> code points), so just the same scanning as it was for 256 chars will
> slow things down.
> 
> POSIX does not require collation for [a-z] type ranges and does not
> prohibit it for non-POSIX locales. POSIX require collation for ranges
> only for POSIX (or C) locale which is equal to ASCII and binary for
> other chars, so we already have it.
> 
> No other *BSD implements collation for [a-z] type ranges.
> 
> Restore ABI compatibility with unused now __collate_range_cmp() which
> is visible from outside (will be removed later).
> 
> commit 1daad8f5ad767dfe7896b8d1959a329785c9a76b
> Author: Andrey A. Chernov 
> Date:   Thu Jul 14 08:18:12 2016 +
> 
> Back out non-collating [a-z] ranges.
> Instead of changing whole course to another POSIX-permitted way
> for consistency and uniformity I decide to completely ignore missing
> regex fucntionality and concentrace on fixing bugs in what we have now,
> too many small obstacles instead, counting ports.
> 
> commit 12eae8c8f346cb459a388259ca98faebdac47038
> Author: Andrey A. Chernov 
> Date:   Thu Jul 14 09:07:25 2016 +
> 
> 1) Eliminate possibility to call __*collate_range_cmp() with inclomplete
> locale (which cause core dump) by removing whole 'table' argument
> by which it passed.
> 
> 2) Restore __collate_range_cmp() in __sccl().
> 
> 3) Collating [a-z] range in regcomp() only for single bytes locales
> (we can't do it now for other ones). In previous state only first 256
> wchars are considered and all others are just silently dropped from the
> range.
> 
> 




Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z

2023-04-21 Thread Yuri
parv/FreeBSD wrote:
> Wrote Dimitry Andric on Fri, 21 Apr 2023 10:38:05 UTC
> (via
> https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html 
>  )
>>
>> ... However, I have read that with unicode, you should *never*
>> use [A-Z] or [0-9], but character classes instead. That seems to give
>> both files on macOS and Linux with [[:alpha:]]:
> ...
> 
> Subject to the locale, problem with that is "[[:alpha:]]" will match
> more than 26 English letters "A" through "Z" (besides also matching
> lower case "a" through "z") even if none of 26 * 2 English alphabets
> appear in a string.

(replying to random recent message)

And there is a bit of quite recent history for fnmatch() related to
[a-z], same was done for regex with the same outcome -- attempt to make
[a-z] (guess [A-Z] as well) range non-collating failed.  I am not aware
of the encountered failures, hopefully someone should remember:


commit 5a5807dd4ca34467ac5fb458bc19f12bf62075a5
Author: Andrey A. Chernov 
Date:   Sun Jul 10 03:49:38 2016 +

Remove broken support for collation in [a-z] type ranges.
Only first 256 wide chars are considered currently, all other are just
dropped from the range. Proper implementation require reverse tables
database lookup, since objects are really big as max UTF-8 (1114112
code points), so just the same scanning as it was for 256 chars will
slow things down.

POSIX does not require collation for [a-z] type ranges and does not
prohibit it for non-POSIX locales. POSIX require collation for ranges
only for POSIX (or C) locale which is equal to ASCII and binary for
other chars, so we already have it.

No other *BSD implements collation for [a-z] type ranges.

Restore ABI compatibility with unused now __collate_range_cmp() which
is visible from outside (will be removed later).

commit 1daad8f5ad767dfe7896b8d1959a329785c9a76b
Author: Andrey A. Chernov 
Date:   Thu Jul 14 08:18:12 2016 +

Back out non-collating [a-z] ranges.
Instead of changing whole course to another POSIX-permitted way
for consistency and uniformity I decide to completely ignore missing
regex fucntionality and concentrace on fixing bugs in what we have now,
too many small obstacles instead, counting ports.

commit 12eae8c8f346cb459a388259ca98faebdac47038
Author: Andrey A. Chernov 
Date:   Thu Jul 14 09:07:25 2016 +

1) Eliminate possibility to call __*collate_range_cmp() with inclomplete
locale (which cause core dump) by removing whole 'table' argument
by which it passed.

2) Restore __collate_range_cmp() in __sccl().

3) Collating [a-z] range in regcomp() only for single bytes locales
(we can't do it now for other ones). In previous state only first 256
wchars are considered and all others are just silently dropped from the
range.




Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z

2023-04-21 Thread parv/FreeBSD
Wrote Dimitry Andric on Fri, 21 Apr 2023 10:38:05 UTC
(via
https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html )
>
> ... However, I have read that with unicode, you should *never*
> use [A-Z] or [0-9], but character classes instead. That seems to give
> both files on macOS and Linux with [[:alpha:]]:
...

Subject to the locale, problem with that is "[[:alpha:]]" will match
more than 26 English letters "A" through "Z" (besides also matching
lower case "a" through "z") even if none of 26 * 2 English alphabets
appear in a string.


- parv