Re: svn commit: r301461 - in head/lib/libc: gen locale regex
On 06/05/16 14:49, Andrey Chernov wrote: On 05.06.2016 22:12, Pedro F. Giffuni wrote: --- head/lib/libc/regex/regcomp.c Sun Jun 5 18:16:33 2016 (r301460) +++ head/lib/libc/regex/regcomp.c Sun Jun 5 19:12:52 2016 (r301461) @@ -821,10 +821,10 @@ p_b_term(struct parse *p, cset *cs) (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE); CHaddrange(p, cs, start, finish); } else { - (void)REQUIRE(__collate_range_cmp(table, start, finish) <= 0, REG_ERANGE); + (void)REQUIRE(__wcollate_range_cmp(table, start, finish) <= 0, REG_ERANGE); for (i = 0; i <= UCHAR_MAX; i++) { - if ( __collate_range_cmp(table, start, i) <= 0 - && __collate_range_cmp(table, i, finish) <= 0 + if ( __wcollate_range_cmp(table, start, i) <= 0 + && __wcollate_range_cmp(table, i, finish) <= 0 ) CHadd(p, cs, i); } As I already mention in PR, we have broken regcomp after someone adds wchar_t support there. Now regcomp ranges works only for the first 256 wchars of the current locale, notice that loop upper limit: for (i = 0; i <= UCHAR_MAX; i++) { In general, ranges are either broken in regcomp now or are memory eating. We have bitmask only for the first 256 wchars, all other added to the range literally. Imagine what happens if someone specify full Unicode range in regexp. Proper fix will be adding bitmask for the whole Unicode range, and even in that case regcomp attempting to use collation in ranges will be _very_slow_ since needs to check all Unicode chars in its for (i = 0; i <= Max_Unicode_wchar; i++) { loop. Better stop pretending that we are able to do collation support in the ranges, since POSIX cares about its own locale only here: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched." Until whole Unicode range bitmask will be implemented (if ever), better stop pretending to honor collation order, we just can't do it with wchars now and do what NetBSD/OpenBSD does (using wchar_t) instead. It does not prevent memory eating on big ranges (bitmask is needed, see above), but at least fix the thing that only first 256 wchars are considered. Sadly regex is one part of the system that could use a maintainer :(, I have been forced to look at it more than I'd like to but I don't really use the collation support at all. Pedro. ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r301461 - in head/lib/libc: gen locale regex
On 05.06.2016 22:12, Pedro F. Giffuni wrote: > When collation support was brought in, the second and third > arguments in __collate_range_cmp() were changed from int to > wchar_t, breaking the ABI. Change them to a "char" type which > makes more sense and keeps the ABI compatible. Not only that breaks ABI, but changing strcoll_l() to wcscoll_l() in the __collate_range_cmp() too (now fixed), while this function is visible outside of libc. ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r301461 - in head/lib/libc: gen locale regex
On 05.06.2016 22:12, Pedro F. Giffuni wrote: > --- head/lib/libc/regex/regcomp.c Sun Jun 5 18:16:33 2016 > (r301460) > +++ head/lib/libc/regex/regcomp.c Sun Jun 5 19:12:52 2016 > (r301461) > @@ -821,10 +821,10 @@ p_b_term(struct parse *p, cset *cs) > (void)REQUIRE((uch)start <= (uch)finish, > REG_ERANGE); > CHaddrange(p, cs, start, finish); > } else { > - (void)REQUIRE(__collate_range_cmp(table, start, > finish) <= 0, REG_ERANGE); > + (void)REQUIRE(__wcollate_range_cmp(table, > start, finish) <= 0, REG_ERANGE); > for (i = 0; i <= UCHAR_MAX; i++) { > - if ( __collate_range_cmp(table, > start, i) <= 0 > - && __collate_range_cmp(table, i, > finish) <= 0 > + if ( __wcollate_range_cmp(table, > start, i) <= 0 > + && __wcollate_range_cmp(table, i, > finish) <= 0 > ) > CHadd(p, cs, i); > } > As I already mention in PR, we have broken regcomp after someone adds wchar_t support there. Now regcomp ranges works only for the first 256 wchars of the current locale, notice that loop upper limit: for (i = 0; i <= UCHAR_MAX; i++) { In general, ranges are either broken in regcomp now or are memory eating. We have bitmask only for the first 256 wchars, all other added to the range literally. Imagine what happens if someone specify full Unicode range in regexp. Proper fix will be adding bitmask for the whole Unicode range, and even in that case regcomp attempting to use collation in ranges will be _very_slow_ since needs to check all Unicode chars in its for (i = 0; i <= Max_Unicode_wchar; i++) { loop. Better stop pretending that we are able to do collation support in the ranges, since POSIX cares about its own locale only here: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched." Until whole Unicode range bitmask will be implemented (if ever), better stop pretending to honor collation order, we just can't do it with wchars now and do what NetBSD/OpenBSD does (using wchar_t) instead. It does not prevent memory eating on big ranges (bitmask is needed, see above), but at least fix the thing that only first 256 wchars are considered. ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"