Re: [PATCH v2 02/21] prefix_ref_iterator: break when we leave the prefix

2017-09-21 Thread Stefan Beller
On Wed, Sep 20, 2017 at 9:59 PM, Jeff King  wrote:

>> But this compare function is not to order by the natural encoding order,
>> but it's used to detect the '0' at the end of prefix, which orders
>> before *any* unsigned char.
>
> It's not just detecting the "0". We care about the ordering overall (so
> that "refs/foo" comes after "refs/bar", and we know that "refs/bar/baz"
> cannot come after "refs/foo", and we can stop iterating).

refs/{foo,bar,bar/baz} is all ASCII, such that the ordering by byte value
and byte position (2nd order) orders 'correctly'. correct in the sense as
Git expects. But if you use other encodings, the natural encoding
may differ from the arbitrary order that we have here. (Example
utf-8 with BOM and smileys)

However these different orders do not matter (according to my initial
conclusion), because we need (a) find out about '\0' and (b) only need
only 'arbitrary' order (which is referred to by the " ordered" trait).

>> Essentially compare_prefix wants to provide the same return
>> value as `strncmp(refname, prefix, min(strlen(refname), strlen(prefix)));`
>> except that it is optimized as we do not have to walk over a string
>> multiple times (to determine length and then pass it to compare).
>
> Hmm, yeah, I think that would be an equivalent. I didn't think of that,
> but as you say it would be less efficient.

I was just too lazy to check if we had the results of strlen around already,
such that the comparison would be equally cheap, but more readable.
Also IIRC later patches enhance this function, so we shall not make use
of the strncmp function here.

> The patch is credited to me, but I actually screwed up the ordering by
> failing to do the unsigned cast. Michael fixed that part before posting
> it. :)

Thanks,
Stefan


Re: [PATCH v2 02/21] prefix_ref_iterator: break when we leave the prefix

2017-09-21 Thread Michael Haggerty
On 09/20/2017 10:25 PM, Stefan Beller wrote:
> On Mon, Sep 18, 2017 at 11:22 PM, Michael Haggerty  
> wrote:
>> [...]
>> +/* Return -1, 0, 1 if refname is before, inside, or after the prefix. */
>> +static int compare_prefix(const char *refname, const char *prefix)
>> +{
>> +   while (*prefix) {
>> +   if (*refname != *prefix)
>> +   return ((unsigned char)*refname < (unsigned 
>> char)*prefix) ? -1 : +1;
> 
> This looks interesting.
> 
> We get a signed char* , cast it to unsigned char* and then
> compare byte by byte.

Not quite. We get a `char *` of unknown signedness (it's
implementation-dependent), dereference it into a `char`, then cast that
value to `unsigned char`. What you described would be

*(const unsigned char *)refname < *(const unsigned char *)prefix

But I assume that these two variants would result in identical assembly
code.

> [...]

Michael


Re: [PATCH v2 02/21] prefix_ref_iterator: break when we leave the prefix

2017-09-20 Thread Jeff King
On Wed, Sep 20, 2017 at 01:25:43PM -0700, Stefan Beller wrote:

> > +/* Return -1, 0, 1 if refname is before, inside, or after the prefix. */
> > +static int compare_prefix(const char *refname, const char *prefix)
> > +{
> > +   while (*prefix) {
> > +   if (*refname != *prefix)
> > +   return ((unsigned char)*refname < (unsigned 
> > char)*prefix) ? -1 : +1;
> 
> This looks interesting.
> 
> We get a signed char* , cast it to unsigned char* and then
> compare byte by byte.
> 
> The cast lead me to think that this should work for non-ASCII
> (e.g. ANSI?), but given that there are multi-byte characters (e.g.
> UTF-8) depending on your encoding used, we may not assume
> that in the given encoding the characters order by their byte order?

We don't really care about encodings here. We're doing a byte-wise
comparison. Which may include high-bit bytes for some encodings.

The important thing is that it match the byte-wise comparison that we
use elsewhere (e.g., in memcmp or strcmp), since our sort order must be
the same. And those functions are defined to do the funny "subtract as
unsigned" rule.

> But this compare function is not to order by the natural encoding order,
> but it's used to detect the '0' at the end of prefix, which orders
> before *any* unsigned char.

It's not just detecting the "0". We care about the ordering overall (so
that "refs/foo" comes after "refs/bar", and we know that "refs/bar/baz"
cannot come after "refs/foo", and we can stop iterating).

> We cannot enhance starts_with for this case as we'd have to invert
> the return value to differentiate between the prefix ordering before
> or after the given string.

Exactly. starts_with() is about a boolean match. But this is inherently
about matching the ordering of other "cmp" functions, but for a partial
match. It's possible that's something we could use in a more general
case, but I don't know of one offhand.

> Essentially compare_prefix wants to provide the same return
> value as `strncmp(refname, prefix, min(strlen(refname), strlen(prefix)));`
> except that it is optimized as we do not have to walk over a string
> multiple times (to determine length and then pass it to compare).

Hmm, yeah, I think that would be an equivalent. I didn't think of that,
but as you say it would be less efficient.

> Feel free to ignore the rambling, I just had to sort my thoughts.
> The patch looks good.

Rambling is sometimes good if it points out holes in the original
reasoning.

The patch is credited to me, but I actually screwed up the ordering by
failing to do the unsigned cast. Michael fixed that part before posting
it. :)

-Peff


Re: [PATCH v2 02/21] prefix_ref_iterator: break when we leave the prefix

2017-09-20 Thread Stefan Beller
On Mon, Sep 18, 2017 at 11:22 PM, Michael Haggerty  wrote:
> From: Jeff King 
>
> If the underlying iterator is ordered, then `prefix_ref_iterator` can
> stop as soon as it sees a refname that comes after the prefix. This
> will rarely make a big difference now, because `ref_cache_iterator`
> only iterates over the directory containing the prefix (and usually
> the prefix will span a whole directory anyway). But if *hint, hint* a
> future reference backend doesn't itself know where to stop the
> iteration, then this optimization will be a big win.
>
> Note that there is no guarantee that the underlying iterator doesn't
> include output preceding the prefix, so we have to skip over any
> unwanted references before we get to the ones that we want.
>
> Signed-off-by: Jeff King 
> Signed-off-by: Michael Haggerty 
> ---
>  refs/iterator.c | 32 +++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/refs/iterator.c b/refs/iterator.c
> index c475360f0a..bd35da4e62 100644
> --- a/refs/iterator.c
> +++ b/refs/iterator.c
> @@ -287,6 +287,20 @@ struct prefix_ref_iterator {
> int trim;
>  };
>
> +/* Return -1, 0, 1 if refname is before, inside, or after the prefix. */
> +static int compare_prefix(const char *refname, const char *prefix)
> +{
> +   while (*prefix) {
> +   if (*refname != *prefix)
> +   return ((unsigned char)*refname < (unsigned 
> char)*prefix) ? -1 : +1;

This looks interesting.

We get a signed char* , cast it to unsigned char* and then
compare byte by byte.

The cast lead me to think that this should work for non-ASCII
(e.g. ANSI?), but given that there are multi-byte characters (e.g.
UTF-8) depending on your encoding used, we may not assume
that in the given encoding the characters order by their byte order?

But this compare function is not to order by the natural encoding order,
but it's used to detect the '0' at the end of prefix, which orders
before *any* unsigned char.

We cannot enhance starts_with for this case as we'd have to invert
the return value to differentiate between the prefix ordering before
or after the given string.

Essentially compare_prefix wants to provide the same return
value as `strncmp(refname, prefix, min(strlen(refname), strlen(prefix)));`
except that it is optimized as we do not have to walk over a string
multiple times (to determine length and then pass it to compare).

Feel free to ignore the rambling, I just had to sort my thoughts.
The patch looks good.

Thanks,
Stefan

> +
> +   refname++;
> +   prefix++;
> +   }
> +
> +   return 0;
> +}
> +
>  static int prefix_ref_iterator_advance(struct ref_iterator *ref_iterator)
>  {
> struct prefix_ref_iterator *iter =
> @@ -294,9 +308,25 @@ static int prefix_ref_iterator_advance(struct 
> ref_iterator *ref_iterator)
> int ok;
>
> while ((ok = ref_iterator_advance(iter->iter0)) == ITER_OK) {
> -   if (!starts_with(iter->iter0->refname, iter->prefix))
> +   int cmp = compare_prefix(iter->iter0->refname, iter->prefix);
> +
> +   if (cmp < 0)
> continue;
>
> +   if (cmp > 0) {
> +   /*
> +* If the source iterator is ordered, then we
> +* can stop the iteration as soon as we see a
> +* refname that comes after the prefix:
> +*/
> +   if (iter->iter0->ordered) {
> +   ok = ref_iterator_abort(iter->iter0);
> +   break;
> +   } else {
> +   continue;
> +   }
> +   }
> +
> if (iter->trim) {
> /*
>  * It is nonsense to trim off characters that
> --
> 2.14.1
>