Re: [PATCH v3 4/4] urlmatch: allow globbing for the URL host part

2017-01-27 Thread Junio C Hamano
Patrick Steinhardt  writes:

>> This is probably a useful improvement.
>> 
>> Having said that, when I mentioned "glob", I meant to also support
>> something like this:
>> 
>>  https://www[1-4].ibm.com/
>
> The problem with additional extended syntax like proposed by you
> is that we would indeed need an escaping mechanism here.

True.  I think a true shell globbing is overkill (so is regexp) and
just a simple wildcarding with '*' would be a good first step that
is easy to explain and later extend as needed.

Thanks.


Re: [PATCH v3 4/4] urlmatch: allow globbing for the URL host part

2017-01-26 Thread Patrick Steinhardt
On Thu, Jan 26, 2017 at 12:43:31PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt  writes:
> 
> > The URL matching function computes for two URLs whether they match not.
> > The match is performed by splitting up the URL into different parts and
> > then doing an exact comparison with the to-be-matched URL.
> >
> > The main user of `urlmatch` is the configuration subsystem. It allows to
> > set certain configurations based on the URL which is being connected to
> > via keys like `http..*`. A common use case for this is to set
> > proxies for only some remotes which match the given URL. Unfortunately,
> > having exact matches for all parts of the URL can become quite tedious
> > in some setups. Imagine for example a corporate network where there are
> > dozens or even hundreds of subdomains, which would have to be configured
> > individually.
> >
> > This commit introduces the ability to use globbing in the host-part of
> > the URLs. A user can simply specify a `*` as part of the host name to
> > match all subdomains at this level. For example adding a configuration
> > key `http.https://*.example.com.proxy` will match all subdomains of
> > `https://example.com`.
> 
> This is probably a useful improvement.
> 
> Having said that, when I mentioned "glob", I meant to also support
> something like this:
> 
>   https://www[1-4].ibm.com/

The problem with additional extended syntax like proposed by you
is that we would indeed need an escaping mechanism here. '[]' are
already allowed inside the host part to enable IPv6 hosts of the
form 'https://[2001:0db8:]/', so the syntax is now ambiguous. So
we have to be cautios which characters to enable for globbing
syntax. As of now, I think we can only safely include '*' and '?'
here without escaping mechanisms.

If additional use cases come up we might still extend the syntax
later on to allow for more special syntax.

> And when people read "glob", that is what they expect.
> 
> So calling this "the ability to use globbing" is misleading.
> The last paragraph in the log message above needs a bit of
> tweaking, perhaps like this:
> 
>   Allow users to write an asterisk '*' in place of any 'host'
>   or 'subdomain' label as part of the host name.  For example,
>   "http.https://*.example.com.proxy; sets "http.proxy" for all
>   direct subdomains of "https://example.com;,
>   e.g. "https://foo.example.com;, but not
>   "https://foo.bar.example.com;.
> 
> Fortunately, your update to config.txt, which is facing the end
> users, does not misuse the word and instead is explicit that the
> only thing the matcher does is to match '*' to a single hierarchy.
> It is clear that even http://www*.ibm.com/ is not supported from
> the description, which is good.

I agree that globbing is the wrong word here. I'll swap in
"wildcard" where applicable.

I'll send a version 4 later on. Thanks again for your feedback
and improvements.

Regards
Patrick


signature.asc
Description: PGP signature


Re: [PATCH v3 4/4] urlmatch: allow globbing for the URL host part

2017-01-26 Thread Junio C Hamano
Junio C Hamano  writes:

> Junio C Hamano  writes:
>
>> +while (url_len && pat_len) {
>> +const char *url_next = end_of_token(url, '.', url_len);
>> +const char *pat_next = end_of_token(pat, '.', pat_len);
>> + ...
>>  }
>>  
>> +return 1;
>
> Embarrassing.  The last one must be "have they both run out?" i.e.
>
>   return (!url_len && !pat_len);

OK, here is my second try.  The added test piece is to catch the
silly mistake I made above.

diff --git a/t/t1300-repo-config.sh b/t/t1300-repo-config.sh
index ec545e0929..33fd59fbb3 100755
--- a/t/t1300-repo-config.sh
+++ b/t/t1300-repo-config.sh
@@ -1177,7 +1177,7 @@ test_expect_success 'urlmatch' '
test_cmp expect actual
 '
 
-test_expect_success 'glob-based urlmatch' '
+test_expect_success 'urlmatch with wildcard' '
cat >.git/config <<-\EOF &&
[http]
sslVerify
@@ -1210,6 +1210,10 @@ test_expect_success 'glob-based urlmatch' '
echo http.sslverify false
} >expect &&
git config --get-urlmatch HTTP https://good.example.com >actual &&
+   test_cmp expect actual &&
+
+   echo http.sslverify >expect &&
+   git config --get-urlmatch HTTP https://more.example.com.au >actual &&
test_cmp expect actual
 '
 
diff --git a/urlmatch.c b/urlmatch.c
index 53ff972a60..0e007a3e07 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -63,36 +63,47 @@ static int append_normalized_escapes(struct strbuf *buf,
return 1;
 }
 
+static const char *end_of_token(const char *s, int c, size_t n)
+{
+   const char *next = memchr(s, c, n);
+   if (!next)
+   next = s + n;
+   return next;
+}
+
 static int match_host(const struct url_info *url_info,
  const struct url_info *pattern_info)
 {
-   char *url = xmemdupz(url_info->url + url_info->host_off, 
url_info->host_len);
-   char *pat = xmemdupz(pattern_info->url + pattern_info->host_off, 
pattern_info->host_len);
-   char *url_tok, *pat_tok, *url_save, *pat_save;
-   int matching;
-
-   url_tok = strtok_r(url, ".", _save);
-   pat_tok = strtok_r(pat, ".", _save);
-
-   for (; url_tok && pat_tok; url_tok = strtok_r(NULL, ".", _save),
-  pat_tok = strtok_r(NULL, ".", _save)) {
-   if (!strcmp(pat_tok, "*"))
-   continue; /* a simple glob matches everything */
-
-   if (strcmp(url_tok, pat_tok)) {
-   /* subdomains do not match */
-   matching = 0;
-   break;
-   }
+   const char *url = url_info->url + url_info->host_off;
+   const char *pat = pattern_info->url + pattern_info->host_off;
+   int url_len = url_info->host_len;
+   int pat_len = pattern_info->host_len;
+
+   while (url_len && pat_len) {
+   const char *url_next = end_of_token(url, '.', url_len);
+   const char *pat_next = end_of_token(pat, '.', pat_len);
+
+   if (pat_next == pat + 1 && pat[0] == '*')
+   /* wildcard matches anything */
+   ;
+   else if ((pat_next - pat) == (url_next - url) &&
+!memcmp(url, pat, url_next - url))
+   /* the components are the same */
+   ;
+   else
+   return 0; /* found an unmatch */
+
+   if (url_next < url + url_len)
+   url_next++;
+   url_len -= url_next - url;
+   url = url_next;
+   if (pat_next < pat + pat_len)
+   pat_next++;
+   pat_len -= pat_next - pat;
+   pat = pat_next;
}
 
-   /* matching if both URL and pattern are at their ends */
-   matching = (url_tok == NULL && pat_tok == NULL);
-
-   free(url);
-   free(pat);
-
-   return matching;
+   return (!url_len && !pat_len);
 }
 
 static char *url_normalize_1(const char *url, struct url_info *out_info, char 
allow_globs)


Re: [PATCH v3 4/4] urlmatch: allow globbing for the URL host part

2017-01-26 Thread Junio C Hamano
Junio C Hamano  writes:

> + while (url_len && pat_len) {
> + const char *url_next = end_of_token(url, '.', url_len);
> + const char *pat_next = end_of_token(pat, '.', pat_len);
> + ...
>   }
>  
> + return 1;

Embarrassing.  The last one must be "have they both run out?" i.e.

return (!url_len && !pat_len);



Re: [PATCH v3 4/4] urlmatch: allow globbing for the URL host part

2017-01-26 Thread Junio C Hamano
Patrick Steinhardt  writes:

> The URL matching function computes for two URLs whether they match not.
> The match is performed by splitting up the URL into different parts and
> then doing an exact comparison with the to-be-matched URL.
>
> The main user of `urlmatch` is the configuration subsystem. It allows to
> set certain configurations based on the URL which is being connected to
> via keys like `http..*`. A common use case for this is to set
> proxies for only some remotes which match the given URL. Unfortunately,
> having exact matches for all parts of the URL can become quite tedious
> in some setups. Imagine for example a corporate network where there are
> dozens or even hundreds of subdomains, which would have to be configured
> individually.
>
> This commit introduces the ability to use globbing in the host-part of
> the URLs. A user can simply specify a `*` as part of the host name to
> match all subdomains at this level. For example adding a configuration
> key `http.https://*.example.com.proxy` will match all subdomains of
> `https://example.com`.

This is probably a useful improvement.

Having said that, when I mentioned "glob", I meant to also support
something like this:

https://www[1-4].ibm.com/

And when people read "glob", that is what they expect.

So calling this "the ability to use globbing" is misleading.
The last paragraph in the log message above needs a bit of
tweaking, perhaps like this:

Allow users to write an asterisk '*' in place of any 'host'
or 'subdomain' label as part of the host name.  For example,
"http.https://*.example.com.proxy; sets "http.proxy" for all
direct subdomains of "https://example.com;,
e.g. "https://foo.example.com;, but not
"https://foo.bar.example.com;.

Fortunately, your update to config.txt, which is facing the end
users, does not misuse the word and instead is explicit that the
only thing the matcher does is to match '*' to a single hierarchy.
It is clear that even http://www*.ibm.com/ is not supported from
the description, which is good.

>  . Host/domain name (e.g., `example.com` in `https://example.com/`).
> -  This field must match exactly between the config key and the URL.
> +  This field must match between the config key and the URL. It is
> +  possible to specify a `*` as part of the host name to match all subdomains
> +  at this level. `https://*.example.com/` for example would match
> +  `https://foo.example.com/`, but not `https://foo.bar.example.com/`.

This is good as-is.

>  . Port number (e.g., `8080` in `http://example.com:8080/`).
>This field must match exactly between the config key and the URL.
> diff --git a/t/t1300-repo-config.sh b/t/t1300-repo-config.sh
> index 923bfc5a2..ec545e092 100755
> --- a/t/t1300-repo-config.sh
> +++ b/t/t1300-repo-config.sh
> @@ -1177,6 +1177,42 @@ test_expect_success 'urlmatch' '
>   test_cmp expect actual
>  '
>  
> +test_expect_success 'glob-based urlmatch' '

This is not "glob".  A more generic term "wildcard" is OK.

> + cat >.git/config <<-\EOF &&
> + [http]
> + sslVerify
> ...
> +static int match_host(const struct url_info *url_info,
> +   const struct url_info *pattern_info)
> +{
> + char *url = xmemdupz(url_info->url + url_info->host_off, 
> url_info->host_len);
> + char *pat = xmemdupz(pattern_info->url + pattern_info->host_off, 
> pattern_info->host_len);
> + char *url_tok, *pat_tok, *url_save, *pat_save;
> + int matching;
> +
> + url_tok = strtok_r(url, ".", _save);
> + pat_tok = strtok_r(pat, ".", _save);

Hmph, this will be the first use of strtok_r() in our codebase.
Does everybody have it?

For a use like this where your delimiter set is a singleton, it may
be simpler to do the usual strchrnul() or memchr() based loop.  The
attached is my attempt to do so on top of this patch.

> +
> + for (; url_tok && pat_tok; url_tok = strtok_r(NULL, ".", _save),
> +pat_tok = strtok_r(NULL, ".", _save)) {
> + if (!strcmp(pat_tok, "*"))
> + continue; /* a simple glob matches everything */

s/glob/asterisk/

Other than that, the patch looks OK.



diff --git a/urlmatch.c b/urlmatch.c
index 53ff972a60..8dfc7fd28a 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -63,36 +63,47 @@ static int append_normalized_escapes(struct strbuf *buf,
return 1;
 }
 
+static const char *end_of_token(const char *s, int c, size_t n)
+{
+   const char *next = memchr(s, c, n);
+   if (!next)
+   next = s + n;
+   return next;
+}
+
 static int match_host(const struct url_info *url_info,
  const struct url_info *pattern_info)
 {
-   char *url = xmemdupz(url_info->url + url_info->host_off, 
url_info->host_len);
-   char *pat = xmemdupz(pattern_info->url + pattern_info->host_off, 
pattern_info->host_len);
-   char *url_tok, *pat_tok, *url_save, 

[PATCH v3 4/4] urlmatch: allow globbing for the URL host part

2017-01-25 Thread Patrick Steinhardt
The URL matching function computes for two URLs whether they match not.
The match is performed by splitting up the URL into different parts and
then doing an exact comparison with the to-be-matched URL.

The main user of `urlmatch` is the configuration subsystem. It allows to
set certain configurations based on the URL which is being connected to
via keys like `http..*`. A common use case for this is to set
proxies for only some remotes which match the given URL. Unfortunately,
having exact matches for all parts of the URL can become quite tedious
in some setups. Imagine for example a corporate network where there are
dozens or even hundreds of subdomains, which would have to be configured
individually.

This commit introduces the ability to use globbing in the host-part of
the URLs. A user can simply specify a `*` as part of the host name to
match all subdomains at this level. For example adding a configuration
key `http.https://*.example.com.proxy` will match all subdomains of
`https://example.com`.

Signed-off-by: Patrick Steinhardt 
---
 Documentation/config.txt |  5 -
 t/t1300-repo-config.sh   | 36 
 urlmatch.c   | 38 ++
 3 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 506431267..078e9b490 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1914,7 +1914,10 @@ http..*::
   must match exactly between the config key and the URL.
 
 . Host/domain name (e.g., `example.com` in `https://example.com/`).
-  This field must match exactly between the config key and the URL.
+  This field must match between the config key and the URL. It is
+  possible to specify a `*` as part of the host name to match all subdomains
+  at this level. `https://*.example.com/` for example would match
+  `https://foo.example.com/`, but not `https://foo.bar.example.com/`.
 
 . Port number (e.g., `8080` in `http://example.com:8080/`).
   This field must match exactly between the config key and the URL.
diff --git a/t/t1300-repo-config.sh b/t/t1300-repo-config.sh
index 923bfc5a2..ec545e092 100755
--- a/t/t1300-repo-config.sh
+++ b/t/t1300-repo-config.sh
@@ -1177,6 +1177,42 @@ test_expect_success 'urlmatch' '
test_cmp expect actual
 '
 
+test_expect_success 'glob-based urlmatch' '
+   cat >.git/config <<-\EOF &&
+   [http]
+   sslVerify
+   [http "https://*.example.com;]
+   sslVerify = false
+   cookieFile = /tmp/cookie.txt
+   EOF
+
+   test_expect_code 1 git config --bool --get-urlmatch doesnt.exist 
https://good.example.com >actual &&
+   test_must_be_empty actual &&
+
+   echo true >expect &&
+   git config --bool --get-urlmatch http.SSLverify https://example.com 
>actual &&
+   test_cmp expect actual &&
+
+   echo true >expect &&
+   git config --bool --get-urlmatch http.SSLverify 
https://good-example.com >actual &&
+   test_cmp expect actual &&
+
+   echo true >expect &&
+   git config --bool --get-urlmatch http.sslverify 
https://deep.nested.example.com >actual &&
+   test_cmp expect actual &&
+
+   echo false >expect &&
+   git config --bool --get-urlmatch http.sslverify 
https://good.example.com >actual &&
+   test_cmp expect actual &&
+
+   {
+   echo http.cookiefile /tmp/cookie.txt &&
+   echo http.sslverify false
+   } >expect &&
+   git config --get-urlmatch HTTP https://good.example.com >actual &&
+   test_cmp expect actual
+'
+
 # good section hygiene
 test_expect_failure 'unsetting the last key in a section removes header' '
cat >.git/config <<-\EOF &&
diff --git a/urlmatch.c b/urlmatch.c
index e328905eb..53ff972a6 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -63,6 +63,38 @@ static int append_normalized_escapes(struct strbuf *buf,
return 1;
 }
 
+static int match_host(const struct url_info *url_info,
+ const struct url_info *pattern_info)
+{
+   char *url = xmemdupz(url_info->url + url_info->host_off, 
url_info->host_len);
+   char *pat = xmemdupz(pattern_info->url + pattern_info->host_off, 
pattern_info->host_len);
+   char *url_tok, *pat_tok, *url_save, *pat_save;
+   int matching;
+
+   url_tok = strtok_r(url, ".", _save);
+   pat_tok = strtok_r(pat, ".", _save);
+
+   for (; url_tok && pat_tok; url_tok = strtok_r(NULL, ".", _save),
+  pat_tok = strtok_r(NULL, ".", _save)) {
+   if (!strcmp(pat_tok, "*"))
+   continue; /* a simple glob matches everything */
+
+   if (strcmp(url_tok, pat_tok)) {
+   /* subdomains do not match */
+   matching = 0;
+   break;
+   }
+   }
+
+   /* matching if both URL and pattern are at their ends */
+