Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-14 Thread Quint Guvernator
2014-03-14 0:57 GMT-04:00 Jeff King p...@peff.net:
 This discussion ended up encompassing a lot of other related cleanups. I
 hope we didn't scare you away. :)

I don't think you could; this community is much more accepting than
other software communities around the web. The fact that I received
constructive feedback rather than a lecture when formatting issues
slipped my mind (i.e. forgetting [PATCH v2]) is reason enough to stick
around!

 My understanding is that you were approaching this as a micro-project
 for GSoC. I'd love it if you want to pick up and run with some of the
 ideas discussed here. But as far as a microproject goes, I think it
 would make sense to identify one or two no-brainer improvement spots by
 hand, and submit a patch with just those (and I think Junio gave some
 good guidelines in his reply).

I agree with trying to push a few uncontroversial changes through. I'd
love to take a deeper look at these helper functions and related
cleanups…perhaps it would be worth it to identify a few key areas to
work on in addition to a main GSoC project? In fact, the project I'm
looking to take on (rebase --interactive) also involves code cleanup
and might not take all summer, so I could see how those could work
well together in a proposal.

I'll be re-reading this thread and working on this patch over the
weekend to try to identify the more straightforward hunks I could
submit in a patch.

Thanks Peff and everyone else for your help.
Quint
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-14 Thread Junio C Hamano
Quint Guvernator quintus.pub...@gmail.com writes:

 I'll be re-reading this thread and working on this patch over the
 weekend to try to identify the more straightforward hunks I could
 submit in a patch.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-13 Thread David Kastrup
Junio C Hamano gits...@pobox.com writes:

 Taking two random examples from an early and a late parts of the
 patch:

 --- a/builtin/cat-file.c
 +++ b/builtin/cat-file.c
 @@ -82,7 +82,7 @@ static int cat_one_file(int opt, const char *exp_type, 
 const char *obj_name)
   enum object_type type;
   unsigned long size;
   char *buffer = read_sha1_file(sha1, type, 
 size);
 - if (memcmp(buffer, object , 7) ||
 + if (!starts_with(buffer, object ) ||

[...]

 The original hunks show that the code knows and relies on magic
 numbers 7 and 8 very clearly and there are rooms for improvement.

Like: what if the file is empty?

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-13 Thread Junio C Hamano
Quint Guvernator quintus.pub...@gmail.com writes:

 The result after the conversion, however, still have the same magic
 numbers, but one less of them each.  Doesn't it make it harder to
 later spot the patterns to come up with a better abstraction that
 does not rely on the magic number?

 It is _not_ my goal to make the code harder to maintain down the road.

Good.

 So, at this point, which hunks (if any) are worth patching?

Will, I am not going through all the mechanical hits to memcmp() and
judge each and every one if it is a good idea to convert.  Anybody
who does so in order to tell you which hunks are worth patching
would end up being the one doing the real work, and at that point
there is nothing left to be credited as your work anymore ;-)

But as Peff said, there are good bits, like these ones just for a
few examples.

diff --git a/builtin/apply.c b/builtin/apply.c
index a7e72d5..16c20af 100644
--- a/builtin/apply.c
+++ b/builtin/apply.c
@@ -1631,7 +1631,7 @@ static int parse_fragment(const char *line, unsigned long 
size,
 * l10n of \ No newline... is at least that long.
 */
case '\\':
-   if (len  12 || memcmp(line, \\ , 2))
+   if (len  12 || !starts_with(line, \\ ))
return -1;
break;
}
@@ -1646,7 +1646,7 @@ static int parse_fragment(const char *line, unsigned long 
size,
 * it in the above loop because we hit oldlines == newlines == 0
 * before seeing it.
 */
-   if (12  size  !memcmp(line, \\ , 2))
+   if (12  size  starts_with(line, \\ ))
offset += linelen(line, size);
 
patch-lines_added += added;

These two are about An incomplete line marker begins with a
backslash and a SP and there is no other significance in the
constant 2 (like, after we recognise the match, we start scanning
the remainder of the line starting at the offset 2).

It is a tangent but I notice that these two parts (both in the
original and in the version after patch) contradict what the
incomplete last line marker should look like in a minor detail.

On the other hand, I think this one from nearby is iffy.

@@ -846,8 +846,8 @@ static int has_epoch_timestamp(const char *nameline)
 * -MM-DD hh:mm:ss must be from either 1969-12-31
 * (west of GMT) or 1970-01-01 (east of GMT)
 */
-   if ((zoneoffset  0  memcmp(timestamp, 1969-12-31, 10)) ||
-   (0 = zoneoffset  memcmp(timestamp, 1970-01-01, 10)))
+   if ((zoneoffset  0  !starts_with(timestamp, 1969-12-31)) ||
+   (0 = zoneoffset  !starts_with(timestamp, 1970-01-01)))
return 0;
 
hourminute = (strtol(timestamp + 11, NULL, 10) * 60 +

If one looks at the post-context of the hunk, one would realize that
this codepath very intimately knows how the timestamp should look
like at the byte-offset level, not just -MM-DD ought to be
10-byte long, but there should be two-digit hour part after
skipping one byte after that -MM-DD part, followed by two-digit
minute part after further skipping one byte, knowing that these
details are guaranteed by the stamp_regexp[] pattern it earlier made
sure the given string would match.

I do not think it is a good idea to reduce this kind of precise
format knowledge from this function in the first place (after all,
this is parsing a header line in a traditional diff whose format is
known to us), and even if it were our eventual goal to reduce the
precise format knowledge, it would not help very much to get rid of
constant 10 only from these two memcmp() calls, and that is why I
think this hunk is iffy.

Hope the above shows what kind of thinking is needed to judge each
change from memcmp() to !starts_with().

Thanks.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-13 Thread Junio C Hamano
David Kastrup d...@gnu.org writes:

 Junio C Hamano gits...@pobox.com writes:

 Taking two random examples from an early and a late parts of the
 patch:

 --- a/builtin/cat-file.c
 +++ b/builtin/cat-file.c
 @@ -82,7 +82,7 @@ static int cat_one_file(int opt, const char *exp_type, 
 const char *obj_name)
  enum object_type type;
  unsigned long size;
  char *buffer = read_sha1_file(sha1, type, 
 size);
 -if (memcmp(buffer, object , 7) ||
 +if (!starts_with(buffer, object ) ||

 [...]

 The original hunks show that the code knows and relies on magic
 numbers 7 and 8 very clearly and there are rooms for improvement.

 Like: what if the file is empty?

Yes.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-13 Thread Jeff King
On Thu, Mar 13, 2014 at 10:47:28AM -0700, Junio C Hamano wrote:

  --- a/builtin/cat-file.c
  +++ b/builtin/cat-file.c
  @@ -82,7 +82,7 @@ static int cat_one_file(int opt, const char *exp_type, 
  const char *obj_name)
 enum object_type type;
 unsigned long size;
 char *buffer = read_sha1_file(sha1, type, 
  size);
  -  if (memcmp(buffer, object , 7) ||
  +  if (!starts_with(buffer, object ) ||
 
  [...]
 
  The original hunks show that the code knows and relies on magic
  numbers 7 and 8 very clearly and there are rooms for improvement.
 
  Like: what if the file is empty?
 
 Yes.

I think this one is actually OK. The result of read_sha1_file is
NUL-terminated, and we know that starts_with reads byte-by-byte (the
prior memcmp is wrong, but only if you care about accessing bytes past
the end of the NUL).

That whole piece of code seems silly, though, IMHO. It should be using
parse_tag or peel_to_type.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-13 Thread Jeff King
On Wed, Mar 12, 2014 at 11:33:50PM -0400, Quint Guvernator wrote:

 It is _not_ my goal to make the code harder to maintain down the road.
 So, at this point, which hunks (if any) are worth patching?

This discussion ended up encompassing a lot of other related cleanups. I
hope we didn't scare you away. :)

My understanding is that you were approaching this as a micro-project
for GSoC. I'd love it if you want to pick up and run with some of the
ideas discussed here. But as far as a microproject goes, I think it
would make sense to identify one or two no-brainer improvement spots by
hand, and submit a patch with just those (and I think Junio gave some
good guidelines in his reply).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Jeff King
On Wed, Mar 12, 2014 at 10:43:54AM -0400, Quint Guvernator wrote:

 memcmp() is replaced with negated starts_with() when comparing strings
 from the beginning. starts_with() looks nicer and it saves the extra
 argument of the length of the comparing string.

Thanks, I think this is a real readability improvement in most cases.

One of the things to do when reviewing a patch like this is make sure
that there aren't any silly arithmetic mistakes. Checking that can be
tedious, so I thought about how _I_ would do it programatically, and
then compared our results.

I tried:

  perl -i -lpe '
s/memcmp\(([^,]+), (.*?), (\d+)\)/
 length($2) == $3 ?
   qq{!starts_with($1, $2)} :
   $
/ge
  ' $(git ls-files '*.c')

That comes up with almost the same results as yours, but misses a few
cases that you caught (in addition to the need to simplify
!!starts_with()):

  - memcmp(foo, bar, strlen(bar))

  - strings with interpolation, like foo\n, which is actually 4
characters in length, not 5.

But there were only a few of those, and I hand-verified them. So I feel
pretty good that the changes are all correct transformations.

That leaves the question of whether they are all improvements in
readability (more on that below).

 note: this commit properly handles negation in starts_with()
 
 Signed-off-by: Quint Guvernator quintus.pub...@gmail.com
 ---

This note should go after the --- line so that it is not part of the
commit message (it is only interesting to people seeing v2 and wondering
what changed from v1 earlier on the list, not people reading the history
much later).

 diff --git a/builtin/commit-tree.c b/builtin/commit-tree.c
 index 987a4c3..2777519 100644
 --- a/builtin/commit-tree.c
 +++ b/builtin/commit-tree.c
 @@ -66,7 +66,7 @@ int cmd_commit_tree(int argc, const char **argv, const char 
 *prefix)
   continue;
   }
  
 - if (!memcmp(arg, -S, 2)) {
 + if (starts_with(arg, -S)) {
   sign_commit = arg + 2;
   continue;
   }

This is an improvement, but we still have the magic 2 below. Would
skip_prefix be a better match here, like:

  if ((val = skip_prefix(arg, -S))) {
  sign_commit = val;
  continue;
  }

We've also talked about refactoring skip_prefix to return a boolean, and
put the value in an out-parameter. Which would make this even more
readable:

  if (skip_prefix(arg, -S, sign_commit))
  continue;

Maybe the time has come to do that.

 --- a/builtin/mailinfo.c
 +++ b/builtin/mailinfo.c
 @@ -626,7 +626,7 @@ static int handle_boundary(void)
   strbuf_addch(newline, '\n');
  again:
   if (line.len = (*content_top)-len + 2 
 - !memcmp(line.buf + (*content_top)-len, --, 2)) {
 + starts_with(line.buf + (*content_top)-len, --)) {

I'm not sure if this improves readability or not. It's certainly nice to
get rid of the magic 2, but starts_with is a bit of a misnomer, since
we are indexing deep into the buffer anyway. And we still have the magic
2 above anyway.

I'm on the fence.

 @@ -727,8 +727,8 @@ static int is_scissors_line(const struct strbuf *line)
   continue;
   }
   if (i + 1  len 
 - (!memcmp(buf + i, 8, 2) || !memcmp(buf + i, 8, 2) ||
 -  !memcmp(buf + i, %, 2) || !memcmp(buf + i, %, 2))) {
 + (starts_with(buf + i, 8) || starts_with(buf + i, 8) ||
 +  starts_with(buf + i, %) || starts_with(buf + i, %))) 
 {

Same as above.

 @@ -66,7 +66,7 @@ static int verify_tag(char *buffer, unsigned long size)
   return error(char%PRIuMAX: could not find next \\\n\,
   (uintmax_t) (type_line - buffer));
   tag_line++;
 - if (memcmp(tag_line, tag , 4) || tag_line[4] == '\n')
 + if (!starts_with(tag_line, tag ) || tag_line[4] == '\n')
   return error(char%PRIuMAX: no \tag \ found,
   (uintmax_t) (tag_line - buffer));

I think this is another that could benefit from an enhanced skip_prefix:

  if (!skip_prefix(tag_line, tag , tag_name) || *tag_name == '\n')
 ...

and then we can get rid of the tag_line += 4 that is used much later
(in fact, I suspect that whole function could be improved in this
respect).

 @@ -269,7 +269,7 @@ int parse_commit_buffer(struct commit *item, const void 
 *buffer, unsigned long s
   return 0;
   item-object.parsed = 1;
   tail += size;
 - if (tail = bufptr + 46 || memcmp(bufptr, tree , 5) || bufptr[45] != 
 '\n')
 + if (tail = bufptr + 46 || !starts_with(bufptr, tree ) || bufptr[45] 
 != '\n')
   return error(bogus commit object %s, 
 sha1_to_hex(item-object.sha1));
   if (get_sha1_hex(bufptr + 5, parent)  0)

Again, we just use bufptr + 5 a bit later here. So a candidate for
skip_prefix.

   graft = lookup_commit_graft(item-object.sha1);
 - while (bufptr + 48  

Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Junio C Hamano
Jeff King p...@peff.net writes:

  static inline int standard_header_field(const char *field, size_t len)
  {
 -return ((len == 4  !memcmp(field, tree , 5)) ||
 -(len == 6  !memcmp(field, parent , 7)) ||
 -(len == 6  !memcmp(field, author , 7)) ||
 -(len == 9  !memcmp(field, committer , 10)) ||
 -(len == 8  !memcmp(field, encoding , 9)));
 +return ((len == 4  starts_with(field, tree )) ||
 +(len == 6  starts_with(field, parent )) ||
 +(len == 6  starts_with(field, author )) ||
 +(len == 9  starts_with(field, committer )) ||
 +(len == 8  starts_with(field, encoding )));

 These extra len checks are interesting.  They look like an attempt to
 optimize lookup, since the caller will already have scanned forward to
 the space.

If one really wants to remove the magic constants from this, then
one must take advantage of the pattern

len == strlen(S) - 1  !memcmp(field, S, strlen(S))

that appears here, and come up with a simple abstraction to express
that we are only using the string S (e.g. tree ), length len and
location field of the counted string.

Blindly replacing starts_with() with !memcmp() in the above part is
a readability regression otherwise.

 ... I
 think with a few more helpers we could really further clean up some of
 these callsites.

Yes.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Jeff King
On Wed, Mar 12, 2014 at 12:39:01PM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
   static inline int standard_header_field(const char *field, size_t len)
   {
  -  return ((len == 4  !memcmp(field, tree , 5)) ||
  -  (len == 6  !memcmp(field, parent , 7)) ||
  -  (len == 6  !memcmp(field, author , 7)) ||
  -  (len == 9  !memcmp(field, committer , 10)) ||
  -  (len == 8  !memcmp(field, encoding , 9)));
  +  return ((len == 4  starts_with(field, tree )) ||
  +  (len == 6  starts_with(field, parent )) ||
  +  (len == 6  starts_with(field, author )) ||
  +  (len == 9  starts_with(field, committer )) ||
  +  (len == 8  starts_with(field, encoding )));
 
  These extra len checks are interesting.  They look like an attempt to
  optimize lookup, since the caller will already have scanned forward to
  the space.
 
 If one really wants to remove the magic constants from this, then
 one must take advantage of the pattern
 
   len == strlen(S) - 1  !memcmp(field, S, strlen(S))
 
 that appears here, and come up with a simple abstraction to express
 that we are only using the string S (e.g. tree ), length len and
 location field of the counted string.
 
 Blindly replacing starts_with() with !memcmp() in the above part is
 a readability regression otherwise.

I actually think the right solution is:

  static inline int standard_header_field(const char *field, size_t len)
  {
  return mem_equals(field, len, tree ) ||
 mem_equals(field, len, parent ) ||
 ...;
  }

and the caller should tell us it's OK to look at field[len]:

  standard_header_field(line, eof - line + 1)

We could also omit the space from the standard_header_field. The caller
just ran strchr() looking for the space, so we know that either it is
there, or we are at the end of the line/buffer. Arguably a string like
parent\n should be a parent header with no data (but right now it is
not matched by this function). I'm not aware of an implementation that
writes such a thing, but it seems to fall in the be liberal in what you
accept category.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 Blindly replacing starts_with() with !memcmp() in the above part is
 a readability regression otherwise.

 I actually think the right solution is:

   static inline int standard_header_field(const char *field, size_t len)
   {
   return mem_equals(field, len, tree ) ||
  mem_equals(field, len, parent ) ||
  ...;
   }

 and the caller should tell us it's OK to look at field[len]:

   standard_header_field(line, eof - line + 1)

 We could also omit the space from the standard_header_field.

Yes, that was what I had in mind.  The only reason why the callee
(over-)optimizes the SP must follow these know keywords part by
using the extra len parameter is because the caller has to do a
single strchr() to skip an arbitrary field name anyway so computing
len is essentially free.

 The caller
 just ran strchr() looking for the space, so we know that either it is
 there, or we are at the end of the line/buffer. Arguably a string like
 parent\n should be a parent header with no data (but right now it is
 not matched by this function). I'm not aware of an implementation that
 writes such a thing, but it seems to fall in the be liberal in what you
 accept category.

It is _not_ a standard header field, so it will be read by the logic
in the caller as a non-standard header field without getting lost.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread David Kastrup
Junio C Hamano gits...@pobox.com writes:

 Jeff King p...@peff.net writes:

  static inline int standard_header_field(const char *field, size_t len)
  {
 -   return ((len == 4  !memcmp(field, tree , 5)) ||
 -   (len == 6  !memcmp(field, parent , 7)) ||
 -   (len == 6  !memcmp(field, author , 7)) ||
 -   (len == 9  !memcmp(field, committer , 10)) ||
 -   (len == 8  !memcmp(field, encoding , 9)));
 +   return ((len == 4  starts_with(field, tree )) ||
 +   (len == 6  starts_with(field, parent )) ||
 +   (len == 6  starts_with(field, author )) ||
 +   (len == 9  starts_with(field, committer )) ||
 +   (len == 8  starts_with(field, encoding )));

 These extra len checks are interesting.  They look like an attempt to
 optimize lookup, since the caller will already have scanned forward to
 the space.

 If one really wants to remove the magic constants from this, then
 one must take advantage of the pattern

   len == strlen(S) - 1  !memcmp(field, S, strlen(S))

If the caller has already scanned forward to the space, then there is no
actual need to let the comparison compare the space again after checking
len, is there?  That makes for a more consistent look in the memcmp
case.

One could do this in a switch on len instead.  Not that it seems
terribly more efficient.

 that appears here, and come up with a simple abstraction to express
 that we are only using the string S (e.g. tree ), length len and
 location field of the counted string.

 Blindly replacing starts_with() with !memcmp() in the above part is
 a readability regression otherwise.

Don't really think so: if the len at the front and the back is the same
and the space is not explicitly compared any more, both look pretty much
the same to me.

 ... I think with a few more helpers we could really further clean up
 some of these callsites.

 Yes.

Possibly.  But it does seem like overkill.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread René Scharfe

Am 12.03.2014 20:39, schrieb Junio C Hamano:

Jeff King p...@peff.net writes:


  static inline int standard_header_field(const char *field, size_t len)
  {
-   return ((len == 4  !memcmp(field, tree , 5)) ||
-   (len == 6  !memcmp(field, parent , 7)) ||
-   (len == 6  !memcmp(field, author , 7)) ||
-   (len == 9  !memcmp(field, committer , 10)) ||
-   (len == 8  !memcmp(field, encoding , 9)));
+   return ((len == 4  starts_with(field, tree )) ||
+   (len == 6  starts_with(field, parent )) ||
+   (len == 6  starts_with(field, author )) ||
+   (len == 9  starts_with(field, committer )) ||
+   (len == 8  starts_with(field, encoding )));


These extra len checks are interesting.  They look like an attempt to
optimize lookup, since the caller will already have scanned forward to
the space.


I wonder what the performance impact might be.  The length checks are 
also needed for correctness, however, to avoid running over the end of 
the buffer.



If one really wants to remove the magic constants from this, then
one must take advantage of the pattern

len == strlen(S) - 1  !memcmp(field, S, strlen(S))

that appears here, and come up with a simple abstraction to express
that we are only using the string S (e.g. tree ), length len and
location field of the counted string.


This might be a job for kwset.

René

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Jeff King
On Wed, Mar 12, 2014 at 01:08:03PM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  Blindly replacing starts_with() with !memcmp() in the above part is
  a readability regression otherwise.
 
  I actually think the right solution is:
 
static inline int standard_header_field(const char *field, size_t len)
{
return mem_equals(field, len, tree ) ||
   mem_equals(field, len, parent ) ||
   ...;
}
 
  and the caller should tell us it's OK to look at field[len]:
 
standard_header_field(line, eof - line + 1)
 
  We could also omit the space from the standard_header_field.
 
 Yes, that was what I had in mind.  The only reason why the callee
 (over-)optimizes the SP must follow these know keywords part by
 using the extra len parameter is because the caller has to do a
 single strchr() to skip an arbitrary field name anyway so computing
 len is essentially free.

One thing that bugs me about the current code is that the sub-function
looks one past the end of the length given to it by the caller.
Switching it to pass eof - line + 1 resolves that, but is it right?

The character pointed at by eof is either:

  1. space, if our strchr(line, ' ') found something

  2. the first character of the next line, if our
 memchr(line, '\n', eob - line) found something

  3. Whatever is at eob (end-of-buffer)

There are two questionable things here. In (1), we use strchr on a
sized buffer.  And in (3), we look one past the size that was passed in.

In both cases, we are saved by the fact that the buffer is actually NUL
terminated at the end of size (because it comes from read_sha1_file).
But we may find a space much further than the line ending which is
supposed to be our boundary, and end up having to do a comparison to
cover this case.

So I think the current code is correct, but we could make it more
obvious by:

  1. Restricting our search for the field separator to the current line.

  2. Explicitly avoid looking for headers when we did not find a space,
 since we cannot match anything anyway.

Like:

diff --git a/commit.c b/commit.c
index 6bf4fe0..9383cc1 100644
--- a/commit.c
+++ b/commit.c
@@ -1325,14 +1325,14 @@ static struct commit_extra_header 
*read_commit_extra_header_lines(
strbuf_reset(buf);
it = NULL;
 
-   eof = strchr(line, ' ');
-   if (next = eof)
+   eof = memchr(line, ' ', next - line);
+   if (eof) {
+   if (standard_header_field(line, eof - line + 1) ||
+   excluded_header_field(line, eof - line, exclude))
+   continue;
+   } else
eof = next;
 
-   if (standard_header_field(line, eof - line) ||
-   excluded_header_field(line, eof - line, exclude))
-   continue;
-
it = xcalloc(1, sizeof(*it));
it-key = xmemdupz(line, eof-line);
*tail = it;

I also think that eof = next (which I retained here) is off-by-one.
next here is not the newline, but the start of the next line. And I'm
guessing the code actually wanted the newline (otherwise it-key ends
up with the newline in it). But we cannot just subtract one, because if
we hit eob, it really is in the right spot.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Jeff King
On Wed, Mar 12, 2014 at 05:14:15PM -0400, Jeff King wrote:

 I also think that eof = next (which I retained here) is off-by-one.
 next here is not the newline, but the start of the next line. And I'm
 guessing the code actually wanted the newline (otherwise it-key ends
 up with the newline in it). But we cannot just subtract one, because if
 we hit eob, it really is in the right spot.

I started on a patch for this, but found another interesting corner
case. If we do not find a newline and hit end-of-buffer (which
_shouldn't_ happen, as that is a malformed commit object), we set next
to eob. But then we copy the whole string, including *next into the
value of the header.

So we intend to capture the trailing newline in the value, and do in the
normal case. But in the end-of-buffer case, we capture an extra NUL. I
think that's OK, because the eventual fate of the buffer is to become a
C-string. But it does mean that the result sometimes has a
newline-terminator and sometimes doesn't, and the calling code needs to
handle this when printing, or risk lines running together.

Should this function append a newline if there is not already one? If
it's a mergetag header, we feed the result to gpg, etc, and expect to
get the data out verbatim. We would not want to mess that up. OTOH, the
commit object is bogus (and possibly truncated) in the first place, so
it probably doesn't really matter. And the GPG signature on tag objects
has its own delimiters, so a stray newline present or not at the end
should not matter.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Jeff King
On Wed, Mar 12, 2014 at 05:14:15PM -0400, Jeff King wrote:

 One thing that bugs me about the current code is that the sub-function
 looks one past the end of the length given to it by the caller.
 Switching it to pass eof - line + 1 resolves that, but is it right?
 
 The character pointed at by eof is either:
 
   1. space, if our strchr(line, ' ') found something
 
   2. the first character of the next line, if our
  memchr(line, '\n', eob - line) found something
 
   3. Whatever is at eob (end-of-buffer)

This misses a case. We might find no space at all, and eof is NULL. We
never dereference it, so we don't segfault, but it does cause a bogus
giant allocation.

I'm out of time for the day, but here is a test I started on that
demonstrates the failure:

diff --git a/t/t7513-commit-bogus-extra-headers.sh 
b/t/t7513-commit-bogus-extra-headers.sh
index e69de29..834db08 100755
--- a/t/t7513-commit-bogus-extra-headers.sh
+++ b/t/t7513-commit-bogus-extra-headers.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+
+test_description='check parsing of badly formed commit objects'
+. ./test-lib.sh
+
+EMPTY_TREE=4b825dc642cb6eb9a060e54bf8d69288fbee4904
+
+test_expect_success 'newline right after mergetag header' '
+   test_tick 
+   cat commit -EOF 
+   tree $EMPTY_TREE
+   author $GIT_AUTHOR_NAME $GIT_AUTHOR_EMAIL $GIT_AUTHOR_DATE
+   committer $GIT_COMMITTER_NAME $GIT_COMMITTER_EMAIL $GIT_COMMITTER_DATE
+   mergetag
+
+   foo
+   EOF
+   commit=$(git hash-object -w -t commit commit) 
+   cat expect -EOF 
+   todo...
+   EOF
+   git --no-pager show --show-signature $commit actual 
+   test_cmp expect actual
+'
+
+test_done

The git show fails with:

  fatal: Out of memory, malloc failed (tried to allocate 18446744073699764927 
bytes)

So I think the whole function could use some refactoring to handle
corner cases better. I'll try to take a look tomorrow, but please feel
free if somebody else wants to take a crack at it.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 Thanks, I think this is a real readability improvement in most cases.
 ...

 I tried:

   perl -i -lpe '
 s/memcmp\(([^,]+), (.*?), (\d+)\)/
  length($2) == $3 ?
qq{!starts_with($1, $2)} :
$
 /ge
   ' $(git ls-files '*.c')

 That comes up with almost the same results as yours, but misses a few
 cases that you caught (in addition to the need to simplify
 !!starts_with()):

   - memcmp(foo, bar, strlen(bar))

   - strings with interpolation, like foo\n, which is actually 4
 characters in length, not 5.

 But there were only a few of those, and I hand-verified them. So I feel
 pretty good that the changes are all correct transformations.

 That leaves the question of whether they are all improvements in
 readability (more on that below).

After reading the patch and the result of your Perl replacement, I'd
agree with the correctness but I am not as convinced as you seem
to be about the real readability improvement in most cases.  some
cases, perhaps, though.

Taking two random examples from an early and a late parts of the
patch:

--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -82,7 +82,7 @@ static int cat_one_file(int opt, const char *exp_type, const 
char *obj_name)
enum object_type type;
unsigned long size;
char *buffer = read_sha1_file(sha1, type, 
size);
-   if (memcmp(buffer, object , 7) ||
+   if (!starts_with(buffer, object ) ||
get_sha1_hex(buffer + 7, blob_sha1))
die(%s not a valid tag, 
sha1_to_hex(sha1));
free(buffer);

diff --git a/transport.c b/transport.c
index ca7bb44..a267822 100644
--- a/transport.c
+++ b/transport.c
@@ -1364,7 +1364,7 @@ static int refs_from_alternate_cb(struct 
alternate_object_database *e,
 
while (other[len-1] == '/')
other[--len] = '\0';
-   if (len  8 || memcmp(other + len - 8, /objects, 8))
+   if (len  8 || !starts_with(other + len - 8, /objects))
return 0;
/* Is this a git repository with refs? */
memcpy(other + len - 8, /refs, 6);


The original hunks show that the code knows and relies on magic
numbers 7 and 8 very clearly and there are rooms for improvement.
The result after the conversion, however, still have the same magic
numbers, but one less of them each.  Doesn't it make it harder to
later spot the patterns to come up with a better abstraction that
does not rely on the magic number?  Especially in the first hunk, we
can spot the repeated 7s in the original that make it glaringly
clear that we might want a better abstraction there, but in the text
after the patch, there is less clue that encourages us to take a
look at that buffer + 7 further to make sure we do not feed a
wrong string to get_sha1_hex() by mistake when we update the
surrounding code or the data format this codepath parses.

I think grepping for memcmp() that compares the same number of
bytes as a constant string, like you showed in that Perl scriptlet,
is a very good first step to locate where we might want to look for
improvements.  I however think that a mechanical replacement of all
such memcmp() with !start_with() is of a dubious value.

After finding the hunk in the cat-file.c shown above, and after
seeing many other similar patterns, one may realize that it is a
good use case for a variant of skip_prefix() that returns boolean,
which we discussed earlier, perhaps like so:

if (!skip_over(buffer, object , object_name) ||
get_sha1_hex(object_name, blob_sha1))
die(...);

and such a solution would be what truly eradicates the reliance of
magic constants that might break by miscounting bytes in string
constants.  I do not think mechanical replacement to !start_with()
is going in the right direction and reaching a halfway to that
goal.  I honestly think that it instead is taking us into a wrong
direction, without really solving the use of brittle magic constants
and making remaining reliance of them even harder to spot.

So
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 So I think the whole function could use some refactoring to handle
 corner cases better.  I'll try to take a look tomorrow, but please
 feel free if somebody else wants to take a crack at it.

Yup, thanks.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] general style: replaces memcmp() with proper starts_with()

2014-03-12 Thread Quint Guvernator
From what I can gather, there seems to be opposition to specific
pieces of this patch.

The following area is clearly the most controversial:

  static inline int standard_header_field(const char *field, size_t len)
  {
 -return ((len == 4  !memcmp(field, tree , 5)) ||
 -(len == 6  !memcmp(field, parent , 7)) ||
 -(len == 6  !memcmp(field, author , 7)) ||
 -(len == 9  !memcmp(field, committer , 10)) ||
 -(len == 8  !memcmp(field, encoding , 9)));
 +return ((len == 4  starts_with(field, tree )) ||
 +(len == 6  starts_with(field, parent )) ||
 +(len == 6  starts_with(field, author )) ||
 +(len == 9  starts_with(field, committer )) ||
 +(len == 8  starts_with(field, encoding )));

I am happy to submit a version of this patch excluding this section
(and/or others), but it seems I've stumbled into a more fundamental
conversation about the place for helper functions in general (and
about refactoring skip_prefix()). I am working on this particular
change as a microproject, #14 on the list [1], and am not as familiar
with the conventions of the Git codebase as many of you on this list
are.

Junio said:

 The result after the conversion, however, still have the same magic
 numbers, but one less of them each.  Doesn't it make it harder to
 later spot the patterns to come up with a better abstraction that
 does not rely on the magic number?

It is _not_ my goal to make the code harder to maintain down the road.
So, at this point, which hunks (if any) are worth patching?

Quint


[1]: http://git.github.io/SoC-2014-Microprojects.html
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html