Re: [PATCH] general style: replaces memcmp() with proper starts_with()
2014-03-14 0:57 GMT-04:00 Jeff King p...@peff.net: This discussion ended up encompassing a lot of other related cleanups. I hope we didn't scare you away. :) I don't think you could; this community is much more accepting than other software communities around the web. The fact that I received constructive feedback rather than a lecture when formatting issues slipped my mind (i.e. forgetting [PATCH v2]) is reason enough to stick around! My understanding is that you were approaching this as a micro-project for GSoC. I'd love it if you want to pick up and run with some of the ideas discussed here. But as far as a microproject goes, I think it would make sense to identify one or two no-brainer improvement spots by hand, and submit a patch with just those (and I think Junio gave some good guidelines in his reply). I agree with trying to push a few uncontroversial changes through. I'd love to take a deeper look at these helper functions and related cleanups…perhaps it would be worth it to identify a few key areas to work on in addition to a main GSoC project? In fact, the project I'm looking to take on (rebase --interactive) also involves code cleanup and might not take all summer, so I could see how those could work well together in a proposal. I'll be re-reading this thread and working on this patch over the weekend to try to identify the more straightforward hunks I could submit in a patch. Thanks Peff and everyone else for your help. Quint -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Quint Guvernator quintus.pub...@gmail.com writes: I'll be re-reading this thread and working on this patch over the weekend to try to identify the more straightforward hunks I could submit in a patch. Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Junio C Hamano gits...@pobox.com writes: Taking two random examples from an early and a late parts of the patch: --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -82,7 +82,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name) enum object_type type; unsigned long size; char *buffer = read_sha1_file(sha1, type, size); - if (memcmp(buffer, object , 7) || + if (!starts_with(buffer, object ) || [...] The original hunks show that the code knows and relies on magic numbers 7 and 8 very clearly and there are rooms for improvement. Like: what if the file is empty? -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Quint Guvernator quintus.pub...@gmail.com writes: The result after the conversion, however, still have the same magic numbers, but one less of them each. Doesn't it make it harder to later spot the patterns to come up with a better abstraction that does not rely on the magic number? It is _not_ my goal to make the code harder to maintain down the road. Good. So, at this point, which hunks (if any) are worth patching? Will, I am not going through all the mechanical hits to memcmp() and judge each and every one if it is a good idea to convert. Anybody who does so in order to tell you which hunks are worth patching would end up being the one doing the real work, and at that point there is nothing left to be credited as your work anymore ;-) But as Peff said, there are good bits, like these ones just for a few examples. diff --git a/builtin/apply.c b/builtin/apply.c index a7e72d5..16c20af 100644 --- a/builtin/apply.c +++ b/builtin/apply.c @@ -1631,7 +1631,7 @@ static int parse_fragment(const char *line, unsigned long size, * l10n of \ No newline... is at least that long. */ case '\\': - if (len 12 || memcmp(line, \\ , 2)) + if (len 12 || !starts_with(line, \\ )) return -1; break; } @@ -1646,7 +1646,7 @@ static int parse_fragment(const char *line, unsigned long size, * it in the above loop because we hit oldlines == newlines == 0 * before seeing it. */ - if (12 size !memcmp(line, \\ , 2)) + if (12 size starts_with(line, \\ )) offset += linelen(line, size); patch-lines_added += added; These two are about An incomplete line marker begins with a backslash and a SP and there is no other significance in the constant 2 (like, after we recognise the match, we start scanning the remainder of the line starting at the offset 2). It is a tangent but I notice that these two parts (both in the original and in the version after patch) contradict what the incomplete last line marker should look like in a minor detail. On the other hand, I think this one from nearby is iffy. @@ -846,8 +846,8 @@ static int has_epoch_timestamp(const char *nameline) * -MM-DD hh:mm:ss must be from either 1969-12-31 * (west of GMT) or 1970-01-01 (east of GMT) */ - if ((zoneoffset 0 memcmp(timestamp, 1969-12-31, 10)) || - (0 = zoneoffset memcmp(timestamp, 1970-01-01, 10))) + if ((zoneoffset 0 !starts_with(timestamp, 1969-12-31)) || + (0 = zoneoffset !starts_with(timestamp, 1970-01-01))) return 0; hourminute = (strtol(timestamp + 11, NULL, 10) * 60 + If one looks at the post-context of the hunk, one would realize that this codepath very intimately knows how the timestamp should look like at the byte-offset level, not just -MM-DD ought to be 10-byte long, but there should be two-digit hour part after skipping one byte after that -MM-DD part, followed by two-digit minute part after further skipping one byte, knowing that these details are guaranteed by the stamp_regexp[] pattern it earlier made sure the given string would match. I do not think it is a good idea to reduce this kind of precise format knowledge from this function in the first place (after all, this is parsing a header line in a traditional diff whose format is known to us), and even if it were our eventual goal to reduce the precise format knowledge, it would not help very much to get rid of constant 10 only from these two memcmp() calls, and that is why I think this hunk is iffy. Hope the above shows what kind of thinking is needed to judge each change from memcmp() to !starts_with(). Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
David Kastrup d...@gnu.org writes: Junio C Hamano gits...@pobox.com writes: Taking two random examples from an early and a late parts of the patch: --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -82,7 +82,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name) enum object_type type; unsigned long size; char *buffer = read_sha1_file(sha1, type, size); -if (memcmp(buffer, object , 7) || +if (!starts_with(buffer, object ) || [...] The original hunks show that the code knows and relies on magic numbers 7 and 8 very clearly and there are rooms for improvement. Like: what if the file is empty? Yes. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
On Thu, Mar 13, 2014 at 10:47:28AM -0700, Junio C Hamano wrote: --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -82,7 +82,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name) enum object_type type; unsigned long size; char *buffer = read_sha1_file(sha1, type, size); - if (memcmp(buffer, object , 7) || + if (!starts_with(buffer, object ) || [...] The original hunks show that the code knows and relies on magic numbers 7 and 8 very clearly and there are rooms for improvement. Like: what if the file is empty? Yes. I think this one is actually OK. The result of read_sha1_file is NUL-terminated, and we know that starts_with reads byte-by-byte (the prior memcmp is wrong, but only if you care about accessing bytes past the end of the NUL). That whole piece of code seems silly, though, IMHO. It should be using parse_tag or peel_to_type. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
On Wed, Mar 12, 2014 at 11:33:50PM -0400, Quint Guvernator wrote: It is _not_ my goal to make the code harder to maintain down the road. So, at this point, which hunks (if any) are worth patching? This discussion ended up encompassing a lot of other related cleanups. I hope we didn't scare you away. :) My understanding is that you were approaching this as a micro-project for GSoC. I'd love it if you want to pick up and run with some of the ideas discussed here. But as far as a microproject goes, I think it would make sense to identify one or two no-brainer improvement spots by hand, and submit a patch with just those (and I think Junio gave some good guidelines in his reply). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
On Wed, Mar 12, 2014 at 10:43:54AM -0400, Quint Guvernator wrote: memcmp() is replaced with negated starts_with() when comparing strings from the beginning. starts_with() looks nicer and it saves the extra argument of the length of the comparing string. Thanks, I think this is a real readability improvement in most cases. One of the things to do when reviewing a patch like this is make sure that there aren't any silly arithmetic mistakes. Checking that can be tedious, so I thought about how _I_ would do it programatically, and then compared our results. I tried: perl -i -lpe ' s/memcmp\(([^,]+), (.*?), (\d+)\)/ length($2) == $3 ? qq{!starts_with($1, $2)} : $ /ge ' $(git ls-files '*.c') That comes up with almost the same results as yours, but misses a few cases that you caught (in addition to the need to simplify !!starts_with()): - memcmp(foo, bar, strlen(bar)) - strings with interpolation, like foo\n, which is actually 4 characters in length, not 5. But there were only a few of those, and I hand-verified them. So I feel pretty good that the changes are all correct transformations. That leaves the question of whether they are all improvements in readability (more on that below). note: this commit properly handles negation in starts_with() Signed-off-by: Quint Guvernator quintus.pub...@gmail.com --- This note should go after the --- line so that it is not part of the commit message (it is only interesting to people seeing v2 and wondering what changed from v1 earlier on the list, not people reading the history much later). diff --git a/builtin/commit-tree.c b/builtin/commit-tree.c index 987a4c3..2777519 100644 --- a/builtin/commit-tree.c +++ b/builtin/commit-tree.c @@ -66,7 +66,7 @@ int cmd_commit_tree(int argc, const char **argv, const char *prefix) continue; } - if (!memcmp(arg, -S, 2)) { + if (starts_with(arg, -S)) { sign_commit = arg + 2; continue; } This is an improvement, but we still have the magic 2 below. Would skip_prefix be a better match here, like: if ((val = skip_prefix(arg, -S))) { sign_commit = val; continue; } We've also talked about refactoring skip_prefix to return a boolean, and put the value in an out-parameter. Which would make this even more readable: if (skip_prefix(arg, -S, sign_commit)) continue; Maybe the time has come to do that. --- a/builtin/mailinfo.c +++ b/builtin/mailinfo.c @@ -626,7 +626,7 @@ static int handle_boundary(void) strbuf_addch(newline, '\n'); again: if (line.len = (*content_top)-len + 2 - !memcmp(line.buf + (*content_top)-len, --, 2)) { + starts_with(line.buf + (*content_top)-len, --)) { I'm not sure if this improves readability or not. It's certainly nice to get rid of the magic 2, but starts_with is a bit of a misnomer, since we are indexing deep into the buffer anyway. And we still have the magic 2 above anyway. I'm on the fence. @@ -727,8 +727,8 @@ static int is_scissors_line(const struct strbuf *line) continue; } if (i + 1 len - (!memcmp(buf + i, 8, 2) || !memcmp(buf + i, 8, 2) || - !memcmp(buf + i, %, 2) || !memcmp(buf + i, %, 2))) { + (starts_with(buf + i, 8) || starts_with(buf + i, 8) || + starts_with(buf + i, %) || starts_with(buf + i, %))) { Same as above. @@ -66,7 +66,7 @@ static int verify_tag(char *buffer, unsigned long size) return error(char%PRIuMAX: could not find next \\\n\, (uintmax_t) (type_line - buffer)); tag_line++; - if (memcmp(tag_line, tag , 4) || tag_line[4] == '\n') + if (!starts_with(tag_line, tag ) || tag_line[4] == '\n') return error(char%PRIuMAX: no \tag \ found, (uintmax_t) (tag_line - buffer)); I think this is another that could benefit from an enhanced skip_prefix: if (!skip_prefix(tag_line, tag , tag_name) || *tag_name == '\n') ... and then we can get rid of the tag_line += 4 that is used much later (in fact, I suspect that whole function could be improved in this respect). @@ -269,7 +269,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s return 0; item-object.parsed = 1; tail += size; - if (tail = bufptr + 46 || memcmp(bufptr, tree , 5) || bufptr[45] != '\n') + if (tail = bufptr + 46 || !starts_with(bufptr, tree ) || bufptr[45] != '\n') return error(bogus commit object %s, sha1_to_hex(item-object.sha1)); if (get_sha1_hex(bufptr + 5, parent) 0) Again, we just use bufptr + 5 a bit later here. So a candidate for skip_prefix. graft = lookup_commit_graft(item-object.sha1); - while (bufptr + 48
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Jeff King p...@peff.net writes: static inline int standard_header_field(const char *field, size_t len) { -return ((len == 4 !memcmp(field, tree , 5)) || -(len == 6 !memcmp(field, parent , 7)) || -(len == 6 !memcmp(field, author , 7)) || -(len == 9 !memcmp(field, committer , 10)) || -(len == 8 !memcmp(field, encoding , 9))); +return ((len == 4 starts_with(field, tree )) || +(len == 6 starts_with(field, parent )) || +(len == 6 starts_with(field, author )) || +(len == 9 starts_with(field, committer )) || +(len == 8 starts_with(field, encoding ))); These extra len checks are interesting. They look like an attempt to optimize lookup, since the caller will already have scanned forward to the space. If one really wants to remove the magic constants from this, then one must take advantage of the pattern len == strlen(S) - 1 !memcmp(field, S, strlen(S)) that appears here, and come up with a simple abstraction to express that we are only using the string S (e.g. tree ), length len and location field of the counted string. Blindly replacing starts_with() with !memcmp() in the above part is a readability regression otherwise. ... I think with a few more helpers we could really further clean up some of these callsites. Yes. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
On Wed, Mar 12, 2014 at 12:39:01PM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: static inline int standard_header_field(const char *field, size_t len) { - return ((len == 4 !memcmp(field, tree , 5)) || - (len == 6 !memcmp(field, parent , 7)) || - (len == 6 !memcmp(field, author , 7)) || - (len == 9 !memcmp(field, committer , 10)) || - (len == 8 !memcmp(field, encoding , 9))); + return ((len == 4 starts_with(field, tree )) || + (len == 6 starts_with(field, parent )) || + (len == 6 starts_with(field, author )) || + (len == 9 starts_with(field, committer )) || + (len == 8 starts_with(field, encoding ))); These extra len checks are interesting. They look like an attempt to optimize lookup, since the caller will already have scanned forward to the space. If one really wants to remove the magic constants from this, then one must take advantage of the pattern len == strlen(S) - 1 !memcmp(field, S, strlen(S)) that appears here, and come up with a simple abstraction to express that we are only using the string S (e.g. tree ), length len and location field of the counted string. Blindly replacing starts_with() with !memcmp() in the above part is a readability regression otherwise. I actually think the right solution is: static inline int standard_header_field(const char *field, size_t len) { return mem_equals(field, len, tree ) || mem_equals(field, len, parent ) || ...; } and the caller should tell us it's OK to look at field[len]: standard_header_field(line, eof - line + 1) We could also omit the space from the standard_header_field. The caller just ran strchr() looking for the space, so we know that either it is there, or we are at the end of the line/buffer. Arguably a string like parent\n should be a parent header with no data (but right now it is not matched by this function). I'm not aware of an implementation that writes such a thing, but it seems to fall in the be liberal in what you accept category. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Jeff King p...@peff.net writes: Blindly replacing starts_with() with !memcmp() in the above part is a readability regression otherwise. I actually think the right solution is: static inline int standard_header_field(const char *field, size_t len) { return mem_equals(field, len, tree ) || mem_equals(field, len, parent ) || ...; } and the caller should tell us it's OK to look at field[len]: standard_header_field(line, eof - line + 1) We could also omit the space from the standard_header_field. Yes, that was what I had in mind. The only reason why the callee (over-)optimizes the SP must follow these know keywords part by using the extra len parameter is because the caller has to do a single strchr() to skip an arbitrary field name anyway so computing len is essentially free. The caller just ran strchr() looking for the space, so we know that either it is there, or we are at the end of the line/buffer. Arguably a string like parent\n should be a parent header with no data (but right now it is not matched by this function). I'm not aware of an implementation that writes such a thing, but it seems to fall in the be liberal in what you accept category. It is _not_ a standard header field, so it will be read by the logic in the caller as a non-standard header field without getting lost. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Junio C Hamano gits...@pobox.com writes: Jeff King p...@peff.net writes: static inline int standard_header_field(const char *field, size_t len) { - return ((len == 4 !memcmp(field, tree , 5)) || - (len == 6 !memcmp(field, parent , 7)) || - (len == 6 !memcmp(field, author , 7)) || - (len == 9 !memcmp(field, committer , 10)) || - (len == 8 !memcmp(field, encoding , 9))); + return ((len == 4 starts_with(field, tree )) || + (len == 6 starts_with(field, parent )) || + (len == 6 starts_with(field, author )) || + (len == 9 starts_with(field, committer )) || + (len == 8 starts_with(field, encoding ))); These extra len checks are interesting. They look like an attempt to optimize lookup, since the caller will already have scanned forward to the space. If one really wants to remove the magic constants from this, then one must take advantage of the pattern len == strlen(S) - 1 !memcmp(field, S, strlen(S)) If the caller has already scanned forward to the space, then there is no actual need to let the comparison compare the space again after checking len, is there? That makes for a more consistent look in the memcmp case. One could do this in a switch on len instead. Not that it seems terribly more efficient. that appears here, and come up with a simple abstraction to express that we are only using the string S (e.g. tree ), length len and location field of the counted string. Blindly replacing starts_with() with !memcmp() in the above part is a readability regression otherwise. Don't really think so: if the len at the front and the back is the same and the space is not explicitly compared any more, both look pretty much the same to me. ... I think with a few more helpers we could really further clean up some of these callsites. Yes. Possibly. But it does seem like overkill. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Am 12.03.2014 20:39, schrieb Junio C Hamano: Jeff King p...@peff.net writes: static inline int standard_header_field(const char *field, size_t len) { - return ((len == 4 !memcmp(field, tree , 5)) || - (len == 6 !memcmp(field, parent , 7)) || - (len == 6 !memcmp(field, author , 7)) || - (len == 9 !memcmp(field, committer , 10)) || - (len == 8 !memcmp(field, encoding , 9))); + return ((len == 4 starts_with(field, tree )) || + (len == 6 starts_with(field, parent )) || + (len == 6 starts_with(field, author )) || + (len == 9 starts_with(field, committer )) || + (len == 8 starts_with(field, encoding ))); These extra len checks are interesting. They look like an attempt to optimize lookup, since the caller will already have scanned forward to the space. I wonder what the performance impact might be. The length checks are also needed for correctness, however, to avoid running over the end of the buffer. If one really wants to remove the magic constants from this, then one must take advantage of the pattern len == strlen(S) - 1 !memcmp(field, S, strlen(S)) that appears here, and come up with a simple abstraction to express that we are only using the string S (e.g. tree ), length len and location field of the counted string. This might be a job for kwset. René -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
On Wed, Mar 12, 2014 at 01:08:03PM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: Blindly replacing starts_with() with !memcmp() in the above part is a readability regression otherwise. I actually think the right solution is: static inline int standard_header_field(const char *field, size_t len) { return mem_equals(field, len, tree ) || mem_equals(field, len, parent ) || ...; } and the caller should tell us it's OK to look at field[len]: standard_header_field(line, eof - line + 1) We could also omit the space from the standard_header_field. Yes, that was what I had in mind. The only reason why the callee (over-)optimizes the SP must follow these know keywords part by using the extra len parameter is because the caller has to do a single strchr() to skip an arbitrary field name anyway so computing len is essentially free. One thing that bugs me about the current code is that the sub-function looks one past the end of the length given to it by the caller. Switching it to pass eof - line + 1 resolves that, but is it right? The character pointed at by eof is either: 1. space, if our strchr(line, ' ') found something 2. the first character of the next line, if our memchr(line, '\n', eob - line) found something 3. Whatever is at eob (end-of-buffer) There are two questionable things here. In (1), we use strchr on a sized buffer. And in (3), we look one past the size that was passed in. In both cases, we are saved by the fact that the buffer is actually NUL terminated at the end of size (because it comes from read_sha1_file). But we may find a space much further than the line ending which is supposed to be our boundary, and end up having to do a comparison to cover this case. So I think the current code is correct, but we could make it more obvious by: 1. Restricting our search for the field separator to the current line. 2. Explicitly avoid looking for headers when we did not find a space, since we cannot match anything anyway. Like: diff --git a/commit.c b/commit.c index 6bf4fe0..9383cc1 100644 --- a/commit.c +++ b/commit.c @@ -1325,14 +1325,14 @@ static struct commit_extra_header *read_commit_extra_header_lines( strbuf_reset(buf); it = NULL; - eof = strchr(line, ' '); - if (next = eof) + eof = memchr(line, ' ', next - line); + if (eof) { + if (standard_header_field(line, eof - line + 1) || + excluded_header_field(line, eof - line, exclude)) + continue; + } else eof = next; - if (standard_header_field(line, eof - line) || - excluded_header_field(line, eof - line, exclude)) - continue; - it = xcalloc(1, sizeof(*it)); it-key = xmemdupz(line, eof-line); *tail = it; I also think that eof = next (which I retained here) is off-by-one. next here is not the newline, but the start of the next line. And I'm guessing the code actually wanted the newline (otherwise it-key ends up with the newline in it). But we cannot just subtract one, because if we hit eob, it really is in the right spot. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
On Wed, Mar 12, 2014 at 05:14:15PM -0400, Jeff King wrote: I also think that eof = next (which I retained here) is off-by-one. next here is not the newline, but the start of the next line. And I'm guessing the code actually wanted the newline (otherwise it-key ends up with the newline in it). But we cannot just subtract one, because if we hit eob, it really is in the right spot. I started on a patch for this, but found another interesting corner case. If we do not find a newline and hit end-of-buffer (which _shouldn't_ happen, as that is a malformed commit object), we set next to eob. But then we copy the whole string, including *next into the value of the header. So we intend to capture the trailing newline in the value, and do in the normal case. But in the end-of-buffer case, we capture an extra NUL. I think that's OK, because the eventual fate of the buffer is to become a C-string. But it does mean that the result sometimes has a newline-terminator and sometimes doesn't, and the calling code needs to handle this when printing, or risk lines running together. Should this function append a newline if there is not already one? If it's a mergetag header, we feed the result to gpg, etc, and expect to get the data out verbatim. We would not want to mess that up. OTOH, the commit object is bogus (and possibly truncated) in the first place, so it probably doesn't really matter. And the GPG signature on tag objects has its own delimiters, so a stray newline present or not at the end should not matter. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
On Wed, Mar 12, 2014 at 05:14:15PM -0400, Jeff King wrote: One thing that bugs me about the current code is that the sub-function looks one past the end of the length given to it by the caller. Switching it to pass eof - line + 1 resolves that, but is it right? The character pointed at by eof is either: 1. space, if our strchr(line, ' ') found something 2. the first character of the next line, if our memchr(line, '\n', eob - line) found something 3. Whatever is at eob (end-of-buffer) This misses a case. We might find no space at all, and eof is NULL. We never dereference it, so we don't segfault, but it does cause a bogus giant allocation. I'm out of time for the day, but here is a test I started on that demonstrates the failure: diff --git a/t/t7513-commit-bogus-extra-headers.sh b/t/t7513-commit-bogus-extra-headers.sh index e69de29..834db08 100755 --- a/t/t7513-commit-bogus-extra-headers.sh +++ b/t/t7513-commit-bogus-extra-headers.sh @@ -0,0 +1,26 @@ +#!/bin/sh + +test_description='check parsing of badly formed commit objects' +. ./test-lib.sh + +EMPTY_TREE=4b825dc642cb6eb9a060e54bf8d69288fbee4904 + +test_expect_success 'newline right after mergetag header' ' + test_tick + cat commit -EOF + tree $EMPTY_TREE + author $GIT_AUTHOR_NAME $GIT_AUTHOR_EMAIL $GIT_AUTHOR_DATE + committer $GIT_COMMITTER_NAME $GIT_COMMITTER_EMAIL $GIT_COMMITTER_DATE + mergetag + + foo + EOF + commit=$(git hash-object -w -t commit commit) + cat expect -EOF + todo... + EOF + git --no-pager show --show-signature $commit actual + test_cmp expect actual +' + +test_done The git show fails with: fatal: Out of memory, malloc failed (tried to allocate 18446744073699764927 bytes) So I think the whole function could use some refactoring to handle corner cases better. I'll try to take a look tomorrow, but please feel free if somebody else wants to take a crack at it. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Jeff King p...@peff.net writes: Thanks, I think this is a real readability improvement in most cases. ... I tried: perl -i -lpe ' s/memcmp\(([^,]+), (.*?), (\d+)\)/ length($2) == $3 ? qq{!starts_with($1, $2)} : $ /ge ' $(git ls-files '*.c') That comes up with almost the same results as yours, but misses a few cases that you caught (in addition to the need to simplify !!starts_with()): - memcmp(foo, bar, strlen(bar)) - strings with interpolation, like foo\n, which is actually 4 characters in length, not 5. But there were only a few of those, and I hand-verified them. So I feel pretty good that the changes are all correct transformations. That leaves the question of whether they are all improvements in readability (more on that below). After reading the patch and the result of your Perl replacement, I'd agree with the correctness but I am not as convinced as you seem to be about the real readability improvement in most cases. some cases, perhaps, though. Taking two random examples from an early and a late parts of the patch: --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -82,7 +82,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name) enum object_type type; unsigned long size; char *buffer = read_sha1_file(sha1, type, size); - if (memcmp(buffer, object , 7) || + if (!starts_with(buffer, object ) || get_sha1_hex(buffer + 7, blob_sha1)) die(%s not a valid tag, sha1_to_hex(sha1)); free(buffer); diff --git a/transport.c b/transport.c index ca7bb44..a267822 100644 --- a/transport.c +++ b/transport.c @@ -1364,7 +1364,7 @@ static int refs_from_alternate_cb(struct alternate_object_database *e, while (other[len-1] == '/') other[--len] = '\0'; - if (len 8 || memcmp(other + len - 8, /objects, 8)) + if (len 8 || !starts_with(other + len - 8, /objects)) return 0; /* Is this a git repository with refs? */ memcpy(other + len - 8, /refs, 6); The original hunks show that the code knows and relies on magic numbers 7 and 8 very clearly and there are rooms for improvement. The result after the conversion, however, still have the same magic numbers, but one less of them each. Doesn't it make it harder to later spot the patterns to come up with a better abstraction that does not rely on the magic number? Especially in the first hunk, we can spot the repeated 7s in the original that make it glaringly clear that we might want a better abstraction there, but in the text after the patch, there is less clue that encourages us to take a look at that buffer + 7 further to make sure we do not feed a wrong string to get_sha1_hex() by mistake when we update the surrounding code or the data format this codepath parses. I think grepping for memcmp() that compares the same number of bytes as a constant string, like you showed in that Perl scriptlet, is a very good first step to locate where we might want to look for improvements. I however think that a mechanical replacement of all such memcmp() with !start_with() is of a dubious value. After finding the hunk in the cat-file.c shown above, and after seeing many other similar patterns, one may realize that it is a good use case for a variant of skip_prefix() that returns boolean, which we discussed earlier, perhaps like so: if (!skip_over(buffer, object , object_name) || get_sha1_hex(object_name, blob_sha1)) die(...); and such a solution would be what truly eradicates the reliance of magic constants that might break by miscounting bytes in string constants. I do not think mechanical replacement to !start_with() is going in the right direction and reaching a halfway to that goal. I honestly think that it instead is taking us into a wrong direction, without really solving the use of brittle magic constants and making remaining reliance of them even harder to spot. So -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
Jeff King p...@peff.net writes: So I think the whole function could use some refactoring to handle corner cases better. I'll try to take a look tomorrow, but please feel free if somebody else wants to take a crack at it. Yup, thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] general style: replaces memcmp() with proper starts_with()
From what I can gather, there seems to be opposition to specific pieces of this patch. The following area is clearly the most controversial: static inline int standard_header_field(const char *field, size_t len) { -return ((len == 4 !memcmp(field, tree , 5)) || -(len == 6 !memcmp(field, parent , 7)) || -(len == 6 !memcmp(field, author , 7)) || -(len == 9 !memcmp(field, committer , 10)) || -(len == 8 !memcmp(field, encoding , 9))); +return ((len == 4 starts_with(field, tree )) || +(len == 6 starts_with(field, parent )) || +(len == 6 starts_with(field, author )) || +(len == 9 starts_with(field, committer )) || +(len == 8 starts_with(field, encoding ))); I am happy to submit a version of this patch excluding this section (and/or others), but it seems I've stumbled into a more fundamental conversation about the place for helper functions in general (and about refactoring skip_prefix()). I am working on this particular change as a microproject, #14 on the list [1], and am not as familiar with the conventions of the Git codebase as many of you on this list are. Junio said: The result after the conversion, however, still have the same magic numbers, but one less of them each. Doesn't it make it harder to later spot the patterns to come up with a better abstraction that does not rely on the magic number? It is _not_ my goal to make the code harder to maintain down the road. So, at this point, which hunks (if any) are worth patching? Quint [1]: http://git.github.io/SoC-2014-Microprojects.html -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html