Re: [PATCH v3] wildmatch: properly fold case everywhere

2013-05-29 Thread Duy Nguyen
On Thu, May 30, 2013 at 12:57 AM, Anthony Ramine  wrote:
>>> If the range to match against is [A-_], it will become [a-_] which is an 
>>> empty range, ord('a') > ord('_'). I think it is simpler to reuse toupper() 
>>> after the fact as I did.
>>>
>>> Anyway maybe I should add a test for that corner case?
>>
>> Yeah I was thinking about such a case, but I saw glibc do it... I
>> guess we just found another bug, at least in compat/fnmatch.c. Yes a
>> test for it would be great, in case I change my mind 2 years from now
>> and decide to turn it the other way ;)
>
> Should I patch compat/fnmatch.c too? That would make it different from the 
> glibc's one.

No. I plan to remove compat/fnmatch and always use wildmatch, even
ignoring system's fnmatch. That would keep the matching behavior
consistent across platforms.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] wildmatch: properly fold case everywhere

2013-05-29 Thread Anthony Ramine
Replied inline.

-- 
Anthony Ramine

Le 29 mai 2013 à 15:52, Duy Nguyen a écrit :

> On Wed, May 29, 2013 at 8:37 PM, Anthony Ramine  wrote:
>> Le 29 mai 2013 à 15:22, Duy Nguyen a écrit :
>> 
>>> On Tue, May 28, 2013 at 8:58 PM, Anthony Ramine  wrote:
 Case folding is not done correctly when matching against the [:upper:]
 character class and uppercased character ranges (e.g. A-Z).
 Specifically, an uppercase letter fails to match against any of them
 when case folding is requested because plain characters in the pattern
 and the whole string and preemptively lowercased to handle the base case
 fast.
>>> 
>>> I did a little test with glibc fnmatch and also checked the source
>>> code. I don't think 'a' matches [:upper:]. So I'm not sure if that's a
>>> correct behavior or a bug in glibc. The spec is not clear (I think) on
>>> this. I guess we should just assume that 'a' should match '[:upper:]'?
>> 
>> I don't know, in my opinion if case folding is enabled we should say 
>> [:upper:], [:lower:] and [:alpha:] are equivalent.
>> 
>> This opinion is shared by GNU Flex [1]:
>> 
>>>  • If your scanner is case-insensitive (the ‘-i’ flag), then 
>>> ‘[:upper:]’ and ‘[:lower:]’ are equivalent to ‘[:alpha:]’.
>> 
>> [1] http://flex.sourceforge.net/manual/Patterns.html
> 
> Then we should do it too because of this precedent, I think.
> 
 @@ -196,6 +196,11 @@ static int dowild(const uchar *p, const uchar *text, 
 unsigned int flags)
   }
   if (t_ch <= p_ch && t_ch >= prev_ch)
   matched = 1;
 +   else if ((flags & WM_CASEFOLD) && 
 ISLOWER(t_ch)) {
 +   uchar t_ch_upper = 
 toupper(t_ch);
 +   if (t_ch_upper <= p_ch && 
 t_ch_upper >= prev_ch)
 +   matched = 1;
 +   }
>>> 
>>> Or we could stick with to tolower. Something like this
>>> 
>>> if ((t_ch <= p_ch && t_ch >= prev_ch) ||
>>>  ((flags & WM_CASEFOLD) &&
>>> t_ch <= tolower(p_ch) && t_ch >= tolower(prev_ch)))
>>>  match = 1;
>>> 
>>> I think it's easier to read if we either downcase all, or upcase all, not 
>>> both.
>> 
>> If the range to match against is [A-_], it will become [a-_] which is an 
>> empty range, ord('a') > ord('_'). I think it is simpler to reuse toupper() 
>> after the fact as I did.
>> 
>> Anyway maybe I should add a test for that corner case?
> 
> Yeah I was thinking about such a case, but I saw glibc do it... I
> guess we just found another bug, at least in compat/fnmatch.c. Yes a
> test for it would be great, in case I change my mind 2 years from now
> and decide to turn it the other way ;)

Should I patch compat/fnmatch.c too? That would make it different from the 
glibc's one.

>> 
   p_ch = 0; /* This makes "prev_ch" 
 get set to 0. */
   } else if (p_ch == '[' && p[1] == ':') {
   const uchar *s;
 @@ -245,6 +250,8 @@ static int dowild(const uchar *p, const uchar *text, 
 unsigned int flags)
   } else if (CC_EQ(s,i, "upper")) {
   if (ISUPPER(t_ch))
   matched = 1;
 +   else if ((flags & 
 WM_CASEFOLD) && ISLOWER(t_ch))
 +   matched = 1;
   } else if (CC_EQ(s,i, "xdigit")) {
   if (ISXDIGIT(t_ch))
   matched = 1;
>>> 
>>> If WM_CASEFOLD is set, maybe isalpha(t_ch) is enough then?
>> 
>> Yes isalpha() is enought but I wanted to keep the two cases separated, I can 
>> amend that if you want.
> 
> Either way is fine. I don't think this code is performance critical. Your 
> call.
> --
> Duy

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] wildmatch: properly fold case everywhere

2013-05-29 Thread Duy Nguyen
On Wed, May 29, 2013 at 8:37 PM, Anthony Ramine  wrote:
> Le 29 mai 2013 à 15:22, Duy Nguyen a écrit :
>
>> On Tue, May 28, 2013 at 8:58 PM, Anthony Ramine  wrote:
>>> Case folding is not done correctly when matching against the [:upper:]
>>> character class and uppercased character ranges (e.g. A-Z).
>>> Specifically, an uppercase letter fails to match against any of them
>>> when case folding is requested because plain characters in the pattern
>>> and the whole string and preemptively lowercased to handle the base case
>>> fast.
>>
>> I did a little test with glibc fnmatch and also checked the source
>> code. I don't think 'a' matches [:upper:]. So I'm not sure if that's a
>> correct behavior or a bug in glibc. The spec is not clear (I think) on
>> this. I guess we should just assume that 'a' should match '[:upper:]'?
>
> I don't know, in my opinion if case folding is enabled we should say 
> [:upper:], [:lower:] and [:alpha:] are equivalent.
>
> This opinion is shared by GNU Flex [1]:
>
>>   • If your scanner is case-insensitive (the ‘-i’ flag), then 
>> ‘[:upper:]’ and ‘[:lower:]’ are equivalent to ‘[:alpha:]’.
>
> [1] http://flex.sourceforge.net/manual/Patterns.html

Then we should do it too because of this precedent, I think.

>>> @@ -196,6 +196,11 @@ static int dowild(const uchar *p, const uchar *text, 
>>> unsigned int flags)
>>>}
>>>if (t_ch <= p_ch && t_ch >= prev_ch)
>>>matched = 1;
>>> +   else if ((flags & WM_CASEFOLD) && 
>>> ISLOWER(t_ch)) {
>>> +   uchar t_ch_upper = 
>>> toupper(t_ch);
>>> +   if (t_ch_upper <= p_ch && 
>>> t_ch_upper >= prev_ch)
>>> +   matched = 1;
>>> +   }
>>
>> Or we could stick with to tolower. Something like this
>>
>> if ((t_ch <= p_ch && t_ch >= prev_ch) ||
>>   ((flags & WM_CASEFOLD) &&
>>  t_ch <= tolower(p_ch) && t_ch >= tolower(prev_ch)))
>>   match = 1;
>>
>> I think it's easier to read if we either downcase all, or upcase all, not 
>> both.
>
> If the range to match against is [A-_], it will become [a-_] which is an 
> empty range, ord('a') > ord('_'). I think it is simpler to reuse toupper() 
> after the fact as I did.
>
> Anyway maybe I should add a test for that corner case?

Yeah I was thinking about such a case, but I saw glibc do it... I
guess we just found another bug, at least in compat/fnmatch.c. Yes a
test for it would be great, in case I change my mind 2 years from now
and decide to turn it the other way ;)

>
>>>p_ch = 0; /* This makes "prev_ch" 
>>> get set to 0. */
>>>} else if (p_ch == '[' && p[1] == ':') {
>>>const uchar *s;
>>> @@ -245,6 +250,8 @@ static int dowild(const uchar *p, const uchar *text, 
>>> unsigned int flags)
>>>} else if (CC_EQ(s,i, "upper")) {
>>>if (ISUPPER(t_ch))
>>>matched = 1;
>>> +   else if ((flags & 
>>> WM_CASEFOLD) && ISLOWER(t_ch))
>>> +   matched = 1;
>>>} else if (CC_EQ(s,i, "xdigit")) {
>>>if (ISXDIGIT(t_ch))
>>>matched = 1;
>>
>> If WM_CASEFOLD is set, maybe isalpha(t_ch) is enough then?
>
> Yes isalpha() is enought but I wanted to keep the two cases separated, I can 
> amend that if you want.

Either way is fine. I don't think this code is performance critical. Your call.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] wildmatch: properly fold case everywhere

2013-05-29 Thread Anthony Ramine
Replied inline.

Regards,

-- 
Anthony Ramine

Le 29 mai 2013 à 15:22, Duy Nguyen a écrit :

> On Tue, May 28, 2013 at 8:58 PM, Anthony Ramine  wrote:
>> Case folding is not done correctly when matching against the [:upper:]
>> character class and uppercased character ranges (e.g. A-Z).
>> Specifically, an uppercase letter fails to match against any of them
>> when case folding is requested because plain characters in the pattern
>> and the whole string and preemptively lowercased to handle the base case
>> fast.
> 
> I did a little test with glibc fnmatch and also checked the source
> code. I don't think 'a' matches [:upper:]. So I'm not sure if that's a
> correct behavior or a bug in glibc. The spec is not clear (I think) on
> this. I guess we should just assume that 'a' should match '[:upper:]'?

I don't know, in my opinion if case folding is enabled we should say [:upper:], 
[:lower:] and [:alpha:] are equivalent.

This opinion is shared by GNU Flex [1]:

>   • If your scanner is case-insensitive (the ‘-i’ flag), then ‘[:upper:]’ 
> and ‘[:lower:]’ are equivalent to ‘[:alpha:]’.

[1] http://flex.sourceforge.net/manual/Patterns.html

>> @@ -196,6 +196,11 @@ static int dowild(const uchar *p, const uchar *text, 
>> unsigned int flags)
>>}
>>if (t_ch <= p_ch && t_ch >= prev_ch)
>>matched = 1;
>> +   else if ((flags & WM_CASEFOLD) && 
>> ISLOWER(t_ch)) {
>> +   uchar t_ch_upper = 
>> toupper(t_ch);
>> +   if (t_ch_upper <= p_ch && 
>> t_ch_upper >= prev_ch)
>> +   matched = 1;
>> +   }
> 
> Or we could stick with to tolower. Something like this
> 
> if ((t_ch <= p_ch && t_ch >= prev_ch) ||
>   ((flags & WM_CASEFOLD) &&
>  t_ch <= tolower(p_ch) && t_ch >= tolower(prev_ch)))
>   match = 1;
> 
> I think it's easier to read if we either downcase all, or upcase all, not 
> both.

If the range to match against is [A-_], it will become [a-_] which is an empty 
range, ord('a') > ord('_'). I think it is simpler to reuse toupper() after the 
fact as I did.

Anyway maybe I should add a test for that corner case?

>>p_ch = 0; /* This makes "prev_ch" get 
>> set to 0. */
>>} else if (p_ch == '[' && p[1] == ':') {
>>const uchar *s;
>> @@ -245,6 +250,8 @@ static int dowild(const uchar *p, const uchar *text, 
>> unsigned int flags)
>>} else if (CC_EQ(s,i, "upper")) {
>>if (ISUPPER(t_ch))
>>matched = 1;
>> +   else if ((flags & 
>> WM_CASEFOLD) && ISLOWER(t_ch))
>> +   matched = 1;
>>} else if (CC_EQ(s,i, "xdigit")) {
>>if (ISXDIGIT(t_ch))
>>matched = 1;
> 
> If WM_CASEFOLD is set, maybe isalpha(t_ch) is enough then?

Yes isalpha() is enought but I wanted to keep the two cases separated, I can 
amend that if you want.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] wildmatch: properly fold case everywhere

2013-05-29 Thread Duy Nguyen
On Tue, May 28, 2013 at 8:58 PM, Anthony Ramine  wrote:
> Case folding is not done correctly when matching against the [:upper:]
> character class and uppercased character ranges (e.g. A-Z).
> Specifically, an uppercase letter fails to match against any of them
> when case folding is requested because plain characters in the pattern
> and the whole string and preemptively lowercased to handle the base case
> fast.

I did a little test with glibc fnmatch and also checked the source
code. I don't think 'a' matches [:upper:]. So I'm not sure if that's a
correct behavior or a bug in glibc. The spec is not clear (I think) on
this. I guess we should just assume that 'a' should match '[:upper:]'?

> @@ -196,6 +196,11 @@ static int dowild(const uchar *p, const uchar *text, 
> unsigned int flags)
> }
> if (t_ch <= p_ch && t_ch >= prev_ch)
> matched = 1;
> +   else if ((flags & WM_CASEFOLD) && 
> ISLOWER(t_ch)) {
> +   uchar t_ch_upper = 
> toupper(t_ch);
> +   if (t_ch_upper <= p_ch && 
> t_ch_upper >= prev_ch)
> +   matched = 1;
> +   }

Or we could stick with to tolower. Something like this

if ((t_ch <= p_ch && t_ch >= prev_ch) ||
   ((flags & WM_CASEFOLD) &&
  t_ch <= tolower(p_ch) && t_ch >= tolower(prev_ch)))
   match = 1;

I think it's easier to read if we either downcase all, or upcase all, not both.

> p_ch = 0; /* This makes "prev_ch" get 
> set to 0. */
> } else if (p_ch == '[' && p[1] == ':') {
> const uchar *s;
> @@ -245,6 +250,8 @@ static int dowild(const uchar *p, const uchar *text, 
> unsigned int flags)
> } else if (CC_EQ(s,i, "upper")) {
> if (ISUPPER(t_ch))
> matched = 1;
> +   else if ((flags & 
> WM_CASEFOLD) && ISLOWER(t_ch))
> +   matched = 1;
> } else if (CC_EQ(s,i, "xdigit")) {
> if (ISXDIGIT(t_ch))
> matched = 1;

If WM_CASEFOLD is set, maybe isalpha(t_ch) is enough then?
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html