[PATCH] commit: abort before commit-msg if empty message

2018-12-07 Thread Jonathan Tan
When a user runs "git commit" without specifying a message, an editor
appears with advice:

Please enter the commit message for your changes. Lines starting
with '#' will be ignored, and an empty message aborts the commit.

However, if the user supplies an empty message and has a commit-msg hook
which updates the message to be non-empty, the commit proceeds to occur,
despite what the advice states.

Teach commit to also check the emptiness of the commit message before it
invokes the commit-msg hook if an editor is used and if no_verify is not
set (that is, commit-msg is not suppressed). This makes the advice true.

(The implementation in this commit reads the commit message twice even
if there is no commit-msg hook. I think that this is fine, since this is
for interactive use - an alternative would be to plumb information about
the absence of the hook from run_hook_ve() upwards, which seems more
complicated.)

Signed-off-by: Jonathan Tan 
---
This was noticed with the commit-msg hook that comes with Gerrit, which
basically just calls git interpret-trailers to add a Change-Id trailer.
---
 builtin/commit.c   | 43 --
 t/t7504-commit-msg-hook.sh | 11 ++
 2 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index c021b119bb..3681a59af8 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -652,6 +652,21 @@ static void adjust_comment_line_char(const struct strbuf 
*sb)
comment_line_char = *p;
 }
 
+static void read_and_clean_commit_message(struct strbuf *sb)
+{
+   if (strbuf_read_file(sb, git_path_commit_editmsg(), 0) < 0) {
+   int saved_errno = errno;
+   rollback_index_files();
+   die(_("could not read commit message: %s"), 
strerror(saved_errno));
+   }
+
+   if (verbose || /* Truncate the message just before the diff, if any. */
+   cleanup_mode == COMMIT_MSG_CLEANUP_SCISSORS)
+   strbuf_setlen(sb, wt_status_locate_end(sb->buf, sb->len));
+   if (cleanup_mode != COMMIT_MSG_CLEANUP_NONE)
+   strbuf_stripspace(sb, cleanup_mode == COMMIT_MSG_CLEANUP_ALL);
+}
+
 static int prepare_to_commit(const char *index_file, const char *prefix,
 struct commit *current_head,
 struct wt_status *s,
@@ -970,6 +985,22 @@ static int prepare_to_commit(const char *index_file, const 
char *prefix,
argv_array_clear();
}
 
+   if (use_editor && !no_verify) {
+   /*
+* Abort the commit if the user supplied an empty commit
+* message in the editor. (Because the commit-msg hook is to be
+* run, we need to check this now, since that hook may change
+* the commit message.)
+*/
+   read_and_clean_commit_message();
+   if (message_is_empty(, cleanup_mode) && 
!allow_empty_message) {
+   rollback_index_files();
+   fprintf(stderr, _("Aborting commit due to empty commit 
message.\n"));
+   exit(1);
+   }
+   strbuf_release();
+   }
+
if (!no_verify &&
run_commit_hook(use_editor, index_file, "commit-msg", 
git_path_commit_editmsg(), NULL)) {
return 0;
@@ -1608,17 +1639,7 @@ int cmd_commit(int argc, const char **argv, const char 
*prefix)
 
/* Finally, get the commit message */
strbuf_reset();
-   if (strbuf_read_file(, git_path_commit_editmsg(), 0) < 0) {
-   int saved_errno = errno;
-   rollback_index_files();
-   die(_("could not read commit message: %s"), 
strerror(saved_errno));
-   }
-
-   if (verbose || /* Truncate the message just before the diff, if any. */
-   cleanup_mode == COMMIT_MSG_CLEANUP_SCISSORS)
-   strbuf_setlen(, wt_status_locate_end(sb.buf, sb.len));
-   if (cleanup_mode != COMMIT_MSG_CLEANUP_NONE)
-   strbuf_stripspace(, cleanup_mode == COMMIT_MSG_CLEANUP_ALL);
+   read_and_clean_commit_message();
 
if (message_is_empty(, cleanup_mode) && !allow_empty_message) {
rollback_index_files();
diff --git a/t/t7504-commit-msg-hook.sh b/t/t7504-commit-msg-hook.sh
index 31b9c6a2c1..b44d6fc43e 100755
--- a/t/t7504-commit-msg-hook.sh
+++ b/t/t7504-commit-msg-hook.sh
@@ -122,6 +122,17 @@ test_expect_success 'with failing hook (editor)' '
 
 '
 
+test_expect_success 'hook is not run if commit message was empty' '
+   echo "yet more another" >>file &&
+   git add file &&
+   echo >FAKE_MSG &&
+   test_must_fail env GIT_EDITOR="\"\$FAKE_EDITOR\"" git commit 2>err &&
+
+   # Verify that git stopped

[PATCH on master v2] revision: use commit graph in get_reference()

2018-12-07 Thread Jonathan Tan
When fetching into a repository, a connectivity check is first made by
check_exist_and_connected() in builtin/fetch.c that runs:

  git rev-list --objects --stdin --not --all --quiet <(list of objects)

If the client repository has many refs, this command can be slow,
regardless of the nature of the server repository or what is being
fetched. A profiler reveals that most of the time is spent in
setup_revisions() (approx. 60/63), and of the time spent in
setup_revisions(), most of it is spent in parse_object() (approx.
49/60). This is because setup_revisions() parses the target of every ref
(from "--all"), and parse_object() reads the buffer of the object.

Reading the buffer is unnecessary if the repository has a commit graph
and if the ref points to a commit (which is typically the case). This
patch uses the commit graph wherever possible; on my computer, when I
run the above command with a list of 1 object on a many-ref repository,
I get a speedup from 1.8s to 1.0s.

Another way to accomplish this effect would be to modify parse_object()
to use the commit graph if possible; however, I did not want to change
parse_object()'s current behavior of always checking the object
signature of the returned object.

Signed-off-by: Jonathan Tan 
---
This patch is now on master.

v2 makes use of the optimization Stolee describes in [1], except that I
have arranged the functions slightly differently. In particular, I
didn't want to add even more ways to obtain objects, so I let
parse_commit_in_graph() be able to take in either a commit shell or an
OID, and did not create the parse_probably_commit() function he
suggested. But I'm not really attached to this design choice, and can
change it if requested.

[1] https://public-inbox.org/git/aa0cd481-c135-47aa-2a69-e3dc71661...@gmail.com/
---
 commit-graph.c | 38 --
 commit-graph.h | 12 
 commit.c   |  2 +-
 revision.c |  5 -
 t/helper/test-repository.c |  4 ++--
 5 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 40c855f185..a571b523b7 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -286,7 +286,8 @@ void close_commit_graph(struct repository *r)
r->objects->commit_graph = NULL;
 }
 
-static int bsearch_graph(struct commit_graph *g, struct object_id *oid, 
uint32_t *pos)
+static int bsearch_graph(struct commit_graph *g, const struct object_id *oid,
+uint32_t *pos)
 {
return bsearch_hash(oid->hash, g->chunk_oid_fanout,
g->chunk_oid_lookup, g->hash_len, pos);
@@ -374,24 +375,41 @@ static int find_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
}
 }
 
-static int parse_commit_in_graph_one(struct commit_graph *g, struct commit 
*item)
+static struct commit *parse_commit_in_graph_one(struct repository *r,
+   struct commit_graph *g,
+   struct commit *shell,
+   const struct object_id *oid)
 {
uint32_t pos;
 
-   if (item->object.parsed)
-   return 1;
+   if (shell && shell->object.parsed)
+   return shell;
 
-   if (find_commit_in_graph(item, g, ))
-   return fill_commit_in_graph(item, g, pos);
+   if (shell && shell->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   pos = shell->graph_pos;
+   } else if (bsearch_graph(g, shell ? >object.oid : oid, )) {
+   /* bsearch_graph sets pos */
+   } else {
+   return NULL;
+   }
 
-   return 0;
+   if (!shell) {
+   shell = lookup_commit(r, oid);
+   if (!shell)
+   return NULL;
+   }
+
+   fill_commit_in_graph(shell, g, pos);
+   return shell;
 }
 
-int parse_commit_in_graph(struct repository *r, struct commit *item)
+struct commit *parse_commit_in_graph(struct repository *r, struct commit 
*shell,
+const struct object_id *oid)
 {
if (!prepare_commit_graph(r))
return 0;
-   return parse_commit_in_graph_one(r->objects->commit_graph, item);
+   return parse_commit_in_graph_one(r, r->objects->commit_graph, shell,
+oid);
 }
 
 void load_commit_graph_info(struct repository *r, struct commit *item)
@@ -1025,7 +1043,7 @@ int verify_commit_graph(struct repository *r, struct 
commit_graph *g)
}
 
graph_commit = lookup_commit(r, _oid);
-   if (!parse_commit_in_graph_one(g, graph_commit))
+   if (!parse_commit_in_graph_one(r, g, graph_commit, NULL))
graph_report("failed to parse %s from commit-graph",

Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()

2018-12-06 Thread Jonathan Tan
Also CC-ing Stolee since I mention multi-pack indices at the end.

> This seems like a reasonable thing to do, but I have sort of a
> meta-comment. In several places we've started doing this kind of "if
> it's this type of object, do X, otherwise do Y" optimization (e.g.,
> handling large blobs for streaming).
> 
> And in the many cases we end up doubling the effort to do object
> lookups: here we do one lookup to get the type, and then if it's not a
> commit (or if we don't have a commit graph) we end up parsing it anyway.
> 
> I wonder if we could do better. In this instance, it might make sense
> to first see if we actually have a commit graph available (it might not
> have this object, of course, but at least we'd expect it to have most
> commits).

This makes sense - I thought I shouldn't mention the commit graph in the
code since it seems like a layering violation, but I felt the need to
mention commit graph in a comment, so maybe the need to mention commit
graph in the code is there too. Subsequently, maybe the lookup-for-type
could be replaced by a lookup-in-commit-graph (maybe by using
parse_commit_in_graph() directly), which should be at least slightly
faster.

> In general, it would be nice if we had a more incremental API
> for accessing objects: open, get metadata, then read the data. That
> would make these kinds of optimizations "free".

Would this be assuming that to read the data, you would (1) first need to
read the metadata, and (2) there would be no redundancy in reading the
two? It seems to me that for loose objects, you would want to perform
all your reads at once, since any read requires opening the file, and
for commit graphs, you just want to read what you want, since the
metadata and the data are in separate places.

> I don't have numbers for how much the extra lookups cost. The lookups
> are probably dwarfed by parse_object() in general, so even if we save
> only a few full object loads, it may be a win. It just seems a shame
> that we may be making the "slow" paths (when our type-specific check
> doesn't match) even slower.

I agree. I think it will always remain a tradeoff when we have multiple
data sources of objects (loose, packed, commit graph - and we can't
unify them all, since they each have their uses). Unless the multi-pack
index can reference commit graphs as well...then it could be our first
point of reference without introducing any inefficiencies...


Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()

2018-12-06 Thread Jonathan Tan
> > This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
> > function.
> 
> This is a mere nicety, not strictly required.
> Before we had parse_commit(struct commit *) which would accomplish the
> same, (and we'd still have that afterwards as a #define falling back onto
> the_repository). As the function get_reference() is not the_repository safe
> as it contains a call to is_promisor_object() that is repository
> agnostic, I think
> it would be fair game to not depend on that series. I am not
> complaining, though.

Good point - I'll base the next version on master (and add a TODO
explaining which functions are not yet converted).

> AFAICT oid_object_info doesn't take advantage of the commit graph,
> but just looks up the object header, which is still less than completely
> parsing it. Then lookup_commit is overly strict, as it may return
> NULL as when there still is a type mismatch (I don't think a mismatch
> could happen here, as both rely on just the object store, and not the
> commit graph.), so this would be just defensive programming for
> the sake of it. I dunno.
> 
> struct commit *c;
> 
> if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
> (c = lookup_commit(revs->repo, oid)) &&
> !repo_parse_commit(revs->repo, c))
> object = (struct object *) c;
> else
> object = parse_object(revs->repo, oid);

I like this way better - I'll do it in the next version.


Re: [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line

2018-12-06 Thread Jonathan Tan
> Jonathan Tan  writes:
> 
> > @@ -126,6 +129,12 @@ static int read_pack_objects_stdout(int outfd, struct 
> > output_state *os)
> > }
> > os->used += readsz;
> >  
> > +   if (!os->packfile_started) {
> > +   os->packfile_started = 1;
> > +   if (use_protocol_v2)
> > +   packet_write_fmt(1, "packfile\n");
> 
> If we fix this function so that the only byte in the buffer is held
> back without emitted when os->used == 1 as I alluded to, this may
> have to be done a bit later, as with such a change, it is no longer
> guaranteed that send_client_data() will be called after this point.

I'm not sure what you mean about there being no guarantee that
send_client_data() is not called - in create_pack_file(), there is an
"if (output_state.used > 0)" line (previously "if (0 <= buffered)") that
outputs anything remaining.

> Isn't progress output that goes to the channel #2 pretty much
> independent from the payload stream that carries the pkt-line 
> command like "packfile" plus the raw pack stream?  It somehow
> feels like an oxymoron to _buffer_ progress indicator, as it
> defeats the whole point of progress report to buffer it.

Yes, it is - I don't fully like this part of the design. I alluded to a
similar issue (keepalive) in the toplevel email [1] and that it might be
better if the server can send sideband throughout the whole response -
perhaps that should be investigated first. If we had sideband throughout
the whole response, we wouldn't need to buffer the progress indicator.

[1] https://public-inbox.org/git/cover.1543879256.git.jonathanta...@google.com/


Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-06 Thread Jonathan Tan
> > +This feature allows servers to serve part of their packfile response as 
> > URIs.
> > +This allows server designs that improve scalability in bandwidth and CPU 
> > usage
> > +(for example, by serving some data through a CDN), and (in the future) 
> > provides
> > +some measure of resumability to clients.
> 
> Without reading the remainder, this makes readers anticipate a few
> good things ;-)
> 
>  - "part of", so pre-generated constant material can be given from
>CDN and then followed-up by "filling the gaps" small packfile,
>perhaps?

Yes :-)

>  - The "part of" transmission may not bring the repository up to
>date wrt to the "want" objects; would this feature involve "you
>asked history up to these commits, but with this pack-uri, you'll
>be getting history up to these (somewhat stale) commits"?

It could be, but not necessarily. In my current WIP implementation, for
example, pack URIs don't give you any commits at all (and thus, no
history) - only blobs. Quite a few people first think of the "stale
clone then top-up" case, though - I wonder if it would be a good idea to
give the blob example in this paragraph in order to put people in the
right frame of mind.

> > +If the client replies with the following arguments:
> > +
> > + * packfile-uris
> > + * thin-pack
> > + * ofs-delta
> 
> "with the following" meaning "with all of the following", or "with
> any of the following"?  Is there a reason why the server side must
> require that the client understands and is willing to accept a
> thin-pack when wanting to use packfile-uris?  The same question for
> the ofs-delta.

"All of the following", but from your later comments, we probably don't
need this section anyway.

> My recommendation is to drop the mention of "thin" and "ofs" from
> the above list, and also from the following paragraph.  The "it MAY
> send" will serve as a practical escape clause to allow a server/CDN
> implementation that *ALWAYS* prepares pregenerated material that can
> only be digested by clients that supports thin and ofs.  Such a server
> can send packfile-URIs only when all of the three are given by the
> client and be compliant.  And such an update to the proposed document
> would allow a more diskful server to prepare both thin and non-thin
> pregenerated packs and choose which one to give to the client depending
> on the capability.

That is true - we can just let the server decide. I'll update the patch
accordingly, and state that the URIs should point to packs with features
like thin-pack and ofs-delta only if the client has declared that it
supports them.

> > +Clients then should understand that the returned packfile could be 
> > incomplete,
> > +and that it needs to download all the given URIs before the fetch or clone 
> > is
> > +complete. Each URI should point to a Git packfile (which may be a thin 
> > pack and
> > +which may contain offset deltas).
> 
> weaken or remove the (parenthetical comment) in the last sentence,
> and replace the beginning of the section with something like
> 
>   If the client replies with 'packfile-uris', when the server
>   sends the packfile, it MAY send a `packfile-uris` section...
> 
> You may steal what I wrote in the above response to help the
> server-side folks to decide how to actually implement the "it MAY
> send a packfile-uris" part in the document.

OK, will do.

> OK, this comes back to what I alluded to at the beginning.  We could
> respond to a full-clone request by feeding a series of packfile-uris
> and some ref information, perhaps like this:
> 
>   * Grab this packfile and update your remote-tracking refs
>   and tags to these values; you'd be as if you cloned the
>   project when it was at v1.0.
> 
>   * When you are done with the above, grab this packfile and
>   update your remote-tracking refs and tags to these values;
>   you'd be as if you cloned the project when it was at v2.0.
> 
>   * When you are done with the above, grab this packfile and
>   update your remote-tracking refs and tags to these values;
>   you'd be as if you cloned the project when it was at v3.0.
> 
>   ...
> 
>   * When you are done with the above, here is the remaining
>   packdata to bring you fully up to date with your original
>   "want"s.
> 
> and before fully reading the proposal, I anticipated that it was
> what you were going to describe.  The major difference is "up to the
> packdata given to you so far, you'd be as if you fetched these" ref
> information, which would allow you to be interrupted and then simply
> resume, without having to remember the set of packfile-uris yet to
> be processed across a fetch/clone failure.  If you sucessfully fetch
> packfile for ..v1.0, you can update the remote-tracking refs to
> match as if you fetched back when that was the most recent state of
> the project, and then if you failed while transferring packfile for
> v1.0..v2.0, 

Re: [WIP RFC 1/5] Documentation: order protocol v2 sections

2018-12-06 Thread Jonathan Tan
> > The git command line expects Git servers to follow a specific order of
> 
> "Command line"?  It sounds like you are talking about the order of
> command line arguments and options, but apparently that is not what
> you are doing.  Is it "The git over-the-wire protocol"?

I meant to say the current Git implementation, as opposed to what is
written in the specification. I'll replace it with "The current C Git
implementation".

> Earlier, we said that shallow-info is not given when packfile is not
> there.  That is captured in the updated EBNF above.  We don't have a
> corresponding removal of a bullet point for wanted-refs section below
> but probably that is because the original did not have corresponding
> bullet point to begin with.

That's because the corresponding bullet point had other information.
Quoted in full below:

>   * This section is only included if the client has requested a
> ref using a 'want-ref' line and if a packfile section is also
> included in the response.

I could reword it to "If a packfile section is included in the response,
this section is only included if the client has requested a ref using a
'want-ref' line", but I don't think that is significantly clearer.


Re: [PATCH 9/9] fetch: try fetching submodules if needed objects were not fetched

2018-12-04 Thread Jonathan Tan
> Try fetching a submodule by object id if the object id that the
> superproject points to, cannot be found.

Mention here the consequences of what happens when this attempt to fetch
fails. Also, this seems to be a case of "do or do not, there is no try"
- maybe it's better to say "Fetch commits by ID from the submodule's
origin if the submodule doesn't already contain the commit that the
superproject references" (note that there is no "Try" word, since the
consequence is to fail the entire command).

Also mention that this fetch is always from the origin.

> builtin/fetch used to only inspect submodules when they were fetched
> "on-demand", as in either on/off case it was clear whether the submodule
> needs to be fetched. However to know whether we need to try fetching the
> object ids, we need to identify the object names, which is done in this
> function check_for_new_submodule_commits(), so we'll also run that code
> in case the submodule recursion is set to "on".
> 
> The submodule checks were done only when a ref in the superproject
> changed, these checks were extended to also be performed when fetching
> into FETCH_HEAD for completeness, and add a test for that too.

"inspect submodules" and "submodule checks" are unnecessarily vague to
me - might be better to just say "A list of new submodule commits are
already generated in certain conditions (by
check_for_new_submodule_commits()); this new feature invokes that
function in more situations".

> - if ((recurse_submodules != RECURSE_SUBMODULES_OFF) &&
> - (recurse_submodules != RECURSE_SUBMODULES_ON))
> - check_for_new_submodule_commits(>new_oid);
>   r = s_update_ref(msg, ref, 0);
>   format_display(display, r ? '!' : '*', what,
>  r ? _("unable to update local ref") : NULL,
> @@ -779,9 +776,6 @@ static int update_local_ref(struct ref *ref,
>   strbuf_add_unique_abbrev(, >object.oid, 
> DEFAULT_ABBREV);
>   strbuf_addstr(, "..");
>   strbuf_add_unique_abbrev(, >new_oid, 
> DEFAULT_ABBREV);
> - if ((recurse_submodules != RECURSE_SUBMODULES_OFF) &&
> - (recurse_submodules != RECURSE_SUBMODULES_ON))
> - check_for_new_submodule_commits(>new_oid);
>   r = s_update_ref("fast-forward", ref, 1);
>   format_display(display, r ? '!' : ' ', quickref.buf,
>  r ? _("unable to update local ref") : NULL,
> @@ -794,9 +788,6 @@ static int update_local_ref(struct ref *ref,
>   strbuf_add_unique_abbrev(, >object.oid, 
> DEFAULT_ABBREV);
>   strbuf_addstr(, "...");
>   strbuf_add_unique_abbrev(, >new_oid, 
> DEFAULT_ABBREV);
> - if ((recurse_submodules != RECURSE_SUBMODULES_OFF) &&
> - (recurse_submodules != RECURSE_SUBMODULES_ON))
> - check_for_new_submodule_commits(>new_oid);
>   r = s_update_ref("forced-update", ref, 1);
>   format_display(display, r ? '!' : '+', quickref.buf,
>  r ? _("unable to update local ref") : _("forced 
> update"),
> @@ -892,6 +883,8 @@ static int store_updated_refs(const char *raw_url, const 
> char *remote_name,
>   ref->force = rm->peer_ref->force;
>   }
>  
> + if (recurse_submodules != RECURSE_SUBMODULES_OFF)
> + check_for_new_submodule_commits(>old_oid);

As discussed above, indeed, check_for_new_submodule_commits() is now
invoked in more situations (not just when there is a local ref, and also
when recurse_submodules is _ON).

> @@ -1231,8 +1231,14 @@ struct submodule_parallel_fetch {
>   int result;
>  
>   struct string_list changed_submodule_names;
> +
> + /* The submodules to fetch in */
> + struct fetch_task **oid_fetch_tasks;
> + int oid_fetch_tasks_nr, oid_fetch_tasks_alloc;
>  };

Better to document as "Pending fetches by OIDs", I think. (These are not
fetches by default refspec, and are not already in progress.)

> +struct fetch_task {
> + struct repository *repo;
> + const struct submodule *sub;
> + unsigned free_sub : 1; /* Do we need to free the submodule? */
> +
> + /* fetch specific oids if set, otherwise fetch default refspec */
> + struct oid_array *commits;
> +};

I would document this as "Fetch in progress (if callback data) or
pending (if in oid_fetch_tasks in struct submodule_parallel_fetch)".
This potential confusion is why I wanted 2 separate types, as I wrote in
[1].

[1] 
https://public-inbox.org/git/20181026204106.132296-1-jonathanta...@google.com/

> +/**
> + * When a submodule is not defined in .gitmodules, we cannot access it
> + * via the regular submodule-config. Create a fake submodule, which we can
> + * work on.
> + */
> +static const struct submodule *get_non_gitmodules_submodule(const char *path)


Re: [PATCH 8/9] submodule.c: fetch in submodules git directory instead of in worktree

2018-12-04 Thread Jonathan Tan
> Keep the properties introduced in 10f5c52656 (submodule: avoid
> auto-discovery in prepare_submodule_repo_env(), 2016-09-01), by fixating
> the git directory of the submodule.

This is to avoid the autodetection of the Git repository, making it less
error-prone; looks good to me.


Re: [PATCH 7/9] submodule: migrate get_next_submodule to use repository structs

2018-12-04 Thread Jonathan Tan
> We used to recurse into submodules, even if they were broken having
> only an objects directory. The child process executed in the submodule
> would fail though if the submodule was broken. This is tested via
> "fetching submodule into a broken repository" in t5526.
> 
> This patch tightens the check upfront, such that we do not need
> to spawn a child process to find out if the submodule is broken.

Thanks, patches 4-7 look good to me - I see that you have addressed all
my comments. Not sending one email each for patches 4, 5, and 6 -
although I have commented on all of them, my comments were minor.

My more in-depth review was done on a previous version [1], and I see
that my comments have been addressed. Also, Stefan says [2] (and implements
in this patch):

> > > If the working tree directory is empty for that submodule, it means
> > > it is likely not initialized. But why would we use that as a signal to
> > > skip the submodule?
> >
> > What I meant was: if empty, skip it completely. Otherwise, do the
> > repo_submodule_init() and repo_init() thing, and if they both fail, set
> > spf->result to 1, preserving existing behavior.
> 
> I did it the other way round:
> 
> If repo_[submodule_]init fails, see if we have a gitlink in tree and
> an empty dir in the FS, to decide if we need to signal failure.

This works too.

[1] 
https://public-inbox.org/git/20181017225811.66554-1-jonathanta...@google.com/
[2] 
https://public-inbox.org/git/CAGZ79kbNXD35ZwevjLZcrGsT=2hncupmvuwvp1rjsksh0gd...@mail.gmail.com/


Re: [PATCH 3/9] submodule.c: sort changed_submodule_names before searching it

2018-12-04 Thread Jonathan Tan
> We can string_list_insert() to maintain sorted-ness of the
> list as we find new items, or we can string_list_append() to
> build an unsorted list and sort it at the end just once.
> 
> As we do not rely on the sortedness while building the
> list, we pick the "append and sort at the end" as it
> has better worst case execution times.

I would write this entire commit message as:

  Instead of using unsorted_string_list_lookup(), sort
  changed_submodule_names before performing any lookups so that we can
  use the faster string_list_lookup() instead.

The code in this patch is fine, and patches 1-2 are fine too.


[PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()

2018-12-04 Thread Jonathan Tan
When fetching into a repository, a connectivity check is first made by
check_exist_and_connected() in builtin/fetch.c that runs:

  git rev-list --objects --stdin --not --all --quiet <(list of objects)

If the client repository has many refs, this command can be slow,
regardless of the nature of the server repository or what is being
fetched. A profiler reveals that most of the time is spent in
setup_revisions() (approx. 60/63), and of the time spent in
setup_revisions(), most of it is spent in parse_object() (approx.
49/60). This is because setup_revisions() parses the target of every ref
(from "--all"), and parse_object() reads the buffer of the object.

Reading the buffer is unnecessary if the repository has a commit graph
and if the ref points to a commit (which is typically the case). This
patch uses the commit graph wherever possible; on my computer, when I
run the above command with a list of 1 object on a many-ref repository,
I get a speedup from 1.8s to 1.0s.

Another way to accomplish this effect would be to modify parse_object()
to use the commit graph if possible; however, I did not want to change
parse_object()'s current behavior of always checking the object
signature of the returned object.

Signed-off-by: Jonathan Tan 
---
This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
function.

A colleague noticed this issue when handling a mirror clone.

Looking at the bigger picture, the speed of the connectivity check
during a fetch might be further improved by passing only the negotiation
tips (obtained through --negotiation-tip) instead of "--all". This patch
just handles the low-hanging fruit first.
---
 revision.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index b5108b75ab..e7da2c57ab 100644
--- a/revision.c
+++ b/revision.c
@@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, 
const char *name,
 {
struct object *object;
 
-   object = parse_object(revs->repo, oid);
+   /*
+* If the repository has commit graphs, repo_parse_commit() avoids
+* reading the object buffer, so use it whenever possible.
+*/
+   if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
+   struct commit *c = lookup_commit(revs->repo, oid);
+   if (!repo_parse_commit(revs->repo, c))
+   object = (struct object *) c;
+   else
+   object = NULL;
+   } else {
+   object = parse_object(revs->repo, oid);
+   }
+
if (!object) {
if (revs->ignore_missing)
return object;
-- 
2.19.0.271.gfe8321ec05.dirty



Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-04 Thread Jonathan Tan
> Some thoughts here:
> 
> First, I'd like to see a section (and a bit in the implementation)
> requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> Allowing the server to downgrade to HTTP, even by accident, would be a
> security problem.
> 
> Second, this feature likely should be opt-in for SSH. One issue I've
> seen repeatedly is that people don't want to use HTTPS to fetch things
> when they're using SSH for Git. Many people in corporate environments
> have proxies that break HTTP for non-browser use cases[0], and using SSH
> is the only way that they can make a functional Git connection.

Good points about SSH support and the client needing to control which
protocols the server will send URIs for. I'll include a line in the
client request in which the client can specify which protocols it is OK
with.

> Third, I think the server needs to be required to both support Range
> headers and never change the content of a URI, so that we can have
> resumable clone implicit in this design. There are some places in the
> world where connections are poor and fetching even the initial packfile
> at once might be a problem. (I've seen such questions on Stack
> Overflow, for example.)

Good points. I'll add these in the next revision.

> Having said that, I think overall this is a good idea and I'm glad to
> see a proposal for it.

Thanks, and thanks for your comments too.


[WIP RFC 0/5] Design for offloading part of packfile response to CDN

2018-12-03 Thread Jonathan Tan
Some of us have been working on a design to improve the scalability of
Git servers by allowing them to offload part of the packfile response to
CDNs in this way: returning HTTP(S) URIs in fetch responses in addition
to packfiles.

This can reduce the load on individual Git servers and improves
proximity (by having data served from closer to the user).

I have included here a design document (patch 2) and a rough
implementation of the server (patch 5). Currently, the implementation
only allows replacing single blobs with URIs, but the protocol
improvement is designed in such a way as to allow independent
improvement of Git server implementations.

There is a potential issue: a server which produces both the URIs and
the packfile at roughly the same time (like the implementation in this
patch set) will not have sideband access until it has concluded sending
the URIs. Among other things, this means that the server cannot send
keepalive packets until quite late in the response. One solution to this
might be to add a feature that allows the server to use a sideband
throughout the whole response - and this has other benefits too like
allowing servers to inform the client throughout the whole fetch, not
just at the end.

Jonathan Tan (5):
  Documentation: order protocol v2 sections
  Documentation: add Packfile URIs design doc
  upload-pack: refactor reading of pack-objects out
  upload-pack: refactor writing of "packfile" line
  upload-pack: send part of packfile response as uri

 Documentation/technical/packfile-uri.txt |  83 +
 Documentation/technical/protocol-v2.txt  |  22 ++--
 builtin/pack-objects.c   |  48 
 fetch-pack.c |   9 ++
 t/t5702-protocol-v2.sh   |  25 
 upload-pack.c| 150 ---
 6 files changed, 285 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

-- 
2.19.0.271.gfe8321ec05.dirty



[WIP RFC 1/5] Documentation: order protocol v2 sections

2018-12-03 Thread Jonathan Tan
The git command line expects Git servers to follow a specific order of
sections when transmitting protocol v2 responses, but this is not
explicit in the documentation. Make the order explicit.

Signed-off-by: Jonathan Tan 
---
 Documentation/technical/protocol-v2.txt | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/Documentation/technical/protocol-v2.txt 
b/Documentation/technical/protocol-v2.txt
index 09e4e0273f..345c00e08c 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -309,11 +309,11 @@ the 'wanted-refs' section in the server's response as 
explained below.
 
 The response of `fetch` is broken into a number of sections separated by
 delimiter packets (0001), with each section beginning with its section
-header.
+header. Most sections are sent only when the packfile is sent.
 
-output = *section
-section = (acknowledgments | shallow-info | wanted-refs | packfile)
- (flush-pkt | delim-pkt)
+output = acknowledgements flush-pkt |
+[acknowledgments delim-pkt] [shallow-info delim-pkt]
+[wanted-refs delim-pkt] packfile flush-pkt
 
 acknowledgments = PKT-LINE("acknowledgments" LF)
  (nak | *ack)
@@ -335,9 +335,10 @@ header.
   *PKT-LINE(%x01-03 *%x00-ff)
 
 acknowledgments section
-   * If the client determines that it is finished with negotiations
- by sending a "done" line, the acknowledgments sections MUST be
- omitted from the server's response.
+   * If the client determines that it is finished with negotiations by
+ sending a "done" line (thus requiring the server to send a packfile),
+ the acknowledgments sections MUST be omitted from the server's
+ response.
 
* Always begins with the section header "acknowledgments"
 
@@ -388,9 +389,6 @@ header.
  which the client has not indicated was shallow as a part of
  its request.
 
-   * This section is only included if a packfile section is also
- included in the response.
-
 wanted-refs section
* This section is only included if the client has requested a
  ref using a 'want-ref' line and if a packfile section is also
-- 
2.19.0.271.gfe8321ec05.dirty



[WIP RFC 5/5] upload-pack: send part of packfile response as uri

2018-12-03 Thread Jonathan Tan
This is a partial implementation of upload-pack sending part of its
packfile response as URIs.

The client is not fully implemented - it knows to ignore the
"packfile-uris" section, but because it does not actually fetch those
URIs, the returned packfile is incomplete. A test is included to show
that the appropriate URI is indeed transmitted, and that the returned
packfile is lacking exactly the expected object.

Signed-off-by: Jonathan Tan 
---
 builtin/pack-objects.c | 48 ++
 fetch-pack.c   |  9 
 t/t5702-protocol-v2.sh | 25 ++
 upload-pack.c  | 37 
 4 files changed, 115 insertions(+), 4 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e7ea206c08..2abbddd3cb 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -117,6 +117,15 @@ enum missing_action {
 static enum missing_action arg_missing_action;
 static show_object_fn fn_show_object;
 
+struct configured_exclusion {
+   struct oidmap_entry e;
+   char *uri;
+};
+static struct oidmap configured_exclusions;
+
+static int exclude_configured_blobs;
+static struct oidset excluded_by_config;
+
 /*
  * stats
  */
@@ -831,6 +840,23 @@ static off_t write_reused_pack(struct hashfile *f)
return reuse_packfile_offset - sizeof(struct pack_header);
 }
 
+static void write_excluded_by_configs(void)
+{
+   struct oidset_iter iter;
+   const struct object_id *oid;
+
+   oidset_iter_init(_by_config, );
+   while ((oid = oidset_iter_next())) {
+   struct configured_exclusion *ex =
+   oidmap_get(_exclusions, oid);
+
+   if (!ex)
+   BUG("configured exclusion wasn't configured");
+   write_in_full(1, ex->uri, strlen(ex->uri));
+   write_in_full(1, "\n", 1);
+   }
+}
+
 static const char no_split_warning[] = N_(
 "disabling bitmap writing, packs are split due to pack.packSizeLimit"
 );
@@ -1124,6 +1150,12 @@ static int want_object_in_pack(const struct object_id 
*oid,
}
}
 
+   if (exclude_configured_blobs &&
+   oidmap_get(_exclusions, oid)) {
+   oidset_insert(_by_config, oid);
+   return 0;
+   }
+
return 1;
 }
 
@@ -2728,6 +2760,19 @@ static int git_pack_config(const char *k, const char *v, 
void *cb)
pack_idx_opts.version);
return 0;
}
+   if (!strcmp(k, "uploadpack.blobpackfileuri")) {
+   struct configured_exclusion *ex = xmalloc(sizeof(*ex));
+   const char *end;
+
+   if (parse_oid_hex(v, >e.oid, ) || *end != ' ')
+   die(_("value of uploadpack.blobpackfileuri must be "
+ "of the form ' ' (got '%s')"), v);
+   if (oidmap_get(_exclusions, >e.oid))
+   die(_("object already configured in another "
+ "uploadpack.blobpackfileuri (got '%s')"), v);
+   ex->uri = xstrdup(end + 1);
+   oidmap_put(_exclusions, ex);
+   }
return git_default_config(k, v, cb);
 }
 
@@ -3314,6 +3359,8 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
 N_("do not pack objects in promisor packfiles")),
OPT_BOOL(0, "delta-islands", _delta_islands,
 N_("respect islands during delta compression")),
+   OPT_BOOL(0, "exclude-configured-blobs", 
_configured_blobs,
+N_("respect uploadpack.blobpackfileuri")),
OPT_END(),
};
 
@@ -3487,6 +3534,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
return 0;
if (nr_result)
prepare_pack(window, depth);
+   write_excluded_by_configs();
write_pack_file();
if (progress)
fprintf_ln(stderr,
diff --git a/fetch-pack.c b/fetch-pack.c
index 9691046e64..6e1985ab55 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1413,6 +1413,15 @@ static struct ref *do_fetch_pack_v2(struct 
fetch_pack_args *args,
receive_wanted_refs(, sought, nr_sought);
 
/* get the pack */
+   if (process_section_header(, "packfile-uris", 
1)) {
+   /* skip the whole section */
+   process_section_header(, 
"packfile-uris", 0);
+   while (packet_reader_read() == 
PACKET_READ_NORMAL) {
+   /* do nothing */
+   }
+   if (read

[WIP RFC 4/5] upload-pack: refactor writing of "packfile" line

2018-12-03 Thread Jonathan Tan
A subsequent patch allows pack-objects to output additional information
(in addition to the packfile that it currently outputs). This means that
we must hold off on writing the "packfile" section header to the client
before we process the output of pack-objects, so move the writing of
the "packfile" section header to read_pack_objects_stdout().

Unfortunately, this also means that we cannot send keepalive packets
until pack-objects starts sending out the packfile, since the sideband
is only established when the "packfile" section header is sent.

Signed-off-by: Jonathan Tan 
---
 upload-pack.c | 47 ---
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index cec43e0a46..aa2589b858 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -104,9 +104,12 @@ static int write_one_shallow(const struct commit_graft 
*graft, void *cb_data)
 struct output_state {
char buffer[8193];
int used;
+   unsigned packfile_started : 1;
+   struct strbuf progress_buf;
 };
 
-static int read_pack_objects_stdout(int outfd, struct output_state *os)
+static int read_pack_objects_stdout(int outfd, struct output_state *os,
+   int use_protocol_v2)
 {
/* Data ready; we keep the last byte to ourselves
 * in case we detect broken rev-list, so that we
@@ -126,6 +129,12 @@ static int read_pack_objects_stdout(int outfd, struct 
output_state *os)
}
os->used += readsz;
 
+   if (!os->packfile_started) {
+   os->packfile_started = 1;
+   if (use_protocol_v2)
+   packet_write_fmt(1, "packfile\n");
+   }
+
if (os->used > 1) {
send_client_data(1, os->buffer, os->used - 1);
os->buffer[0] = os->buffer[os->used - 1];
@@ -138,8 +147,17 @@ static int read_pack_objects_stdout(int outfd, struct 
output_state *os)
return readsz;
 }
 
+static void flush_progress_buf(struct strbuf *progress_buf)
+{
+   if (!progress_buf->len)
+   return;
+   send_client_data(2, progress_buf->buf, progress_buf->len);
+   strbuf_reset(progress_buf);
+}
+
 static void create_pack_file(const struct object_array *have_obj,
-const struct object_array *want_obj)
+const struct object_array *want_obj,
+int use_protocol_v2)
 {
struct child_process pack_objects = CHILD_PROCESS_INIT;
struct output_state output_state = {0};
@@ -260,9 +278,13 @@ static void create_pack_file(const struct object_array 
*have_obj,
 */
sz = xread(pack_objects.err, progress,
  sizeof(progress));
-   if (0 < sz)
-   send_client_data(2, progress, sz);
-   else if (sz == 0) {
+   if (0 < sz) {
+   if (output_state.packfile_started)
+   send_client_data(2, progress, sz);
+   else
+   strbuf_add(_state.progress_buf,
+  progress, sz);
+   } else if (sz == 0) {
close(pack_objects.err);
pack_objects.err = -1;
}
@@ -273,8 +295,10 @@ static void create_pack_file(const struct object_array 
*have_obj,
}
if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
int result = read_pack_objects_stdout(pack_objects.out,
- _state);
-
+ _state,
+ use_protocol_v2);
+   if (output_state.packfile_started)
+   flush_progress_buf(_state.progress_buf);
if (result == 0) {
close(pack_objects.out);
pack_objects.out = -1;
@@ -293,12 +317,14 @@ static void create_pack_file(const struct object_array 
*have_obj,
 * protocol to say anything, so those clients are just out of
 * luck.
 */
-   if (!ret && use_sideband) {
+   if (!ret && use_sideband && output_state.packfile_started) {
static const char buf[] = "0005\1";
write_or_die(1, buf, 5);
}
}
 
+   flush_progress_buf(_state.progress_buf);
+
if (finish_command(_objects)) {
error("git upl

[WIP RFC 2/5] Documentation: add Packfile URIs design doc

2018-12-03 Thread Jonathan Tan
Signed-off-by: Jonathan Tan 
---
 Documentation/technical/packfile-uri.txt | 83 
 Documentation/technical/protocol-v2.txt  |  6 +-
 2 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

diff --git a/Documentation/technical/packfile-uri.txt 
b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 00..6535801486
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,83 @@
+Packfile URIs
+=
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+
+
+The server advertises `packfile-uris`.
+
+If the client replies with the following arguments:
+
+ * packfile-uris
+ * thin-pack
+ * ofs-delta
+
+when the server sends the packfile, it MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
+this section.
+
+Clients then should understand that the returned packfile could be incomplete,
+and that it needs to download all the given URIs before the fetch or clone is
+complete. Each URI should point to a Git packfile (which may be a thin pack and
+which may contain offset deltas).
+
+Server design
+-
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=
+` entries. Whenever the list of objects to be sent is assembled, a blob
+with the given sha1 can be replaced by the given URI. This allows, for example,
+servers to delegate serving of large blobs to CDNs.
+
+Client design
+-
+
+While fetching, the client needs to remember the list of URIs and cannot
+declare that the fetch is complete until all URIs have been downloaded as
+packfiles.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+The client can inhibit this feature (i.e. refrain from sending the
+`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.
+
+Future work
+---
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, a long-running process that takes in entire requests and
+   outputs a list of URIs and the corresponding inclusion and exclusion sets of
+   objects. This allows, e.g., signed URIs to be used and packfiles for common
+   requests to be cached.
+ * On the client, resumption of clone. If a clone is interrupted, information
+   could be recorded in the repository's config and a "clone-resume" command
+   can resume the clone in progress. (Resumption of subsequent fetches is more
+   difficult because that must deal with the user wanting to use the repository
+   even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
+
diff --git a/Documentation/technical/protocol-v2.txt 
b/Documentation/technical/protocol-v2.txt
index 345c00e08c..2cb1c41742 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is 
sent.
 
 output = acknowledgements flush-pkt |
 [acknowledgments delim-pkt] [shallow-info delim-pkt]
-[wanted-refs delim-pkt] packfile flush-pkt
+[wanted-refs delim-pkt] [packfile-uris delim-pkt]
+packfile flush-pkt
 
 acknowledgments = PKT-LINE("acknowledgments" LF)
  (nak | *ack)
@@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is 
sent.
  *PKT-LINE(wanted-ref LF)
 wanted-ref = obj-id SP refname
 
+packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
+packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
+
 packfile = PKT-LINE("packfile" LF)
   *PKT-LINE(%x01-03 *%x00-ff)
 
-- 
2.19.0.271.gfe8321ec05.dirty



[WIP RFC 3/5] upload-pack: refactor reading of pack-objects out

2018-12-03 Thread Jonathan Tan
Subsequent patches will change how the output of pack-objects is
processed, so extract that processing into its own function.

Currently, at most 1 character can be buffered (in the "buffered" local
variable). One of those patches will require a larger buffer, so replace
that "buffered" local variable with a buffer array.

Signed-off-by: Jonathan Tan 
---
 upload-pack.c | 80 +--
 1 file changed, 46 insertions(+), 34 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 5e81f1ff24..cec43e0a46 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -101,14 +101,51 @@ static int write_one_shallow(const struct commit_graft 
*graft, void *cb_data)
return 0;
 }
 
+struct output_state {
+   char buffer[8193];
+   int used;
+};
+
+static int read_pack_objects_stdout(int outfd, struct output_state *os)
+{
+   /* Data ready; we keep the last byte to ourselves
+* in case we detect broken rev-list, so that we
+* can leave the stream corrupted.  This is
+* unfortunate -- unpack-objects would happily
+* accept a valid packdata with trailing garbage,
+* so appending garbage after we pass all the
+* pack data is not good enough to signal
+* breakage to downstream.
+*/
+   ssize_t readsz;
+
+   readsz = xread(outfd, os->buffer + os->used,
+  sizeof(os->buffer) - os->used);
+   if (readsz < 0) {
+   return readsz;
+   }
+   os->used += readsz;
+
+   if (os->used > 1) {
+   send_client_data(1, os->buffer, os->used - 1);
+   os->buffer[0] = os->buffer[os->used - 1];
+   os->used = 1;
+   } else {
+   send_client_data(1, os->buffer, os->used);
+   os->used = 0;
+   }
+
+   return readsz;
+}
+
 static void create_pack_file(const struct object_array *have_obj,
 const struct object_array *want_obj)
 {
struct child_process pack_objects = CHILD_PROCESS_INIT;
-   char data[8193], progress[128];
+   struct output_state output_state = {0};
+   char progress[128];
char abort_msg[] = "aborting due to possible repository "
"corruption on the remote side.";
-   int buffered = -1;
ssize_t sz;
int i;
FILE *pipe_fd;
@@ -235,39 +272,15 @@ static void create_pack_file(const struct object_array 
*have_obj,
continue;
}
if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
-   /* Data ready; we keep the last byte to ourselves
-* in case we detect broken rev-list, so that we
-* can leave the stream corrupted.  This is
-* unfortunate -- unpack-objects would happily
-* accept a valid packdata with trailing garbage,
-* so appending garbage after we pass all the
-* pack data is not good enough to signal
-* breakage to downstream.
-*/
-   char *cp = data;
-   ssize_t outsz = 0;
-   if (0 <= buffered) {
-   *cp++ = buffered;
-   outsz++;
-   }
-   sz = xread(pack_objects.out, cp,
- sizeof(data) - outsz);
-   if (0 < sz)
-   ;
-   else if (sz == 0) {
+   int result = read_pack_objects_stdout(pack_objects.out,
+ _state);
+
+   if (result == 0) {
close(pack_objects.out);
pack_objects.out = -1;
-   }
-   else
+   } else if (result < 0) {
goto fail;
-   sz += outsz;
-   if (1 < sz) {
-   buffered = data[sz-1] & 0xFF;
-   sz--;
}
-   else
-   buffered = -1;
-   send_client_data(1, data, sz);
}
 
/*
@@ -292,9 +305,8 @@ static void create_pack_file(const struct object_array 
*have_obj,
}
 
/* flush the data */
-   if (0 <= buffered) {
-   data[0] = buffered;
-   send_client_data(1, data, 1);
+   if (output_state.used > 0) {
+   send_client_data(1, output_state.buffer, output_state.used);
fprintf(stderr, "flushed.\n");
}
if (use_sideband)
-- 
2.19.0.271.gfe8321ec05.dirty



Re: [PATCHv3 00/23] Bring more repository handles into our code base

2018-11-15 Thread Jonathan Tan
> Please have a look at the last 4 patches specifically as they were new in
> the last iteration (but did not receive any comment), as they demonstrate
> and fix a problem that is only exposed when using GIT_TEST_COMMIT_GRAPH=1
> for the test suite.

Thanks. I only reviewed patches 18 and 20-23, as only those are
different from the previous iteration according to the range-diff.

I've written my comments about patch 18 already [1], and the other
patches look good to me.

In patch 21, I could go either way about whether it's more desirable to
pass the pool or the repository to the freeing functions.

Thanks for discovering the issue that patch 23 illustrates. I thought
that the tests were written carefully enough in that the_repository
didn't have any relevant objects or configurations (all relevant data
was in a path that is not the default repository), but apparently some
still slipped through.

[1] 
https://public-inbox.org/git/20181115195416.21890-1-jonathanta...@google.com/


Re: [PATCH 18/23] submodule: use submodule repos for object lookup

2018-11-15 Thread Jonathan Tan
> +/*
> + * Initialize 'out' based on the provided submodule path.
> + *
> + * Unlike repo_submodule_init, this tolerates submodules not present
> + * in .gitmodules. This function exists only to preserve historical behavior,
> + *
> + * Returns 0 on success, -1 when the submodule is not present.
>   */
> -static void show_submodule_header(struct diff_options *o, const char *path,
> +static struct repository *open_submodule(const char *path)

The function documentation needs to be reworded - there's no "out", and
the return value is now a possibly NULL pointer to struct repository.

> +{
> + struct strbuf sb = STRBUF_INIT;
> + struct repository *out = xmalloc(sizeof(*out));
> +
> + if (submodule_to_gitdir(, path) || repo_init(out, sb.buf, NULL)) {
> + strbuf_release();
> + free(out);
> + return NULL;
> + }
> +
> + out->submodule_prefix = xstrdup(path);

I've discussed this submodule_prefix line before [1] - do we really need
this? Tests pass even if I remove this line.

Other than that, this patch looks good.

[1] 
https://public-inbox.org/git/20181019203750.110741-1-jonathanta...@google.com/


Re: [PATCH 2/3] ieot: default to not writing IEOT section

2018-11-12 Thread Jonathan Tan
> +index.recordOffsetTable::
> + Specifies whether the index file should include an "Index Entry
> + Offset Table" section. This reduces index load time on
> + multiprocessor machines but produces a message "ignoring IEOT
> + extension" when reading the index using Git versions before 2.20.
> + Defaults to 'false'.

Probably worth adding a test that exercises this new config option -
somehow create an index with index.recordOffsetTable=1, check that the
index contains the appropriate string (a few ways to do this, but I'm
not sure which are portable), and then run a Git command that reads the
index to make sure it is valid; then do the same except
index.recordOffsetTable=0.

The code itself looks good to me.

Same comment for patch 1.


Re: [PATCH 1/2] ls-remote: do not send ref prefixes for patterns

2018-11-08 Thread Jonathan Tan
> Jeff King  writes:
> 
> > Since b4be74105f (ls-remote: pass ref prefixes when requesting a
> > remote's refs, 2018-03-15), "ls-remote foo" will pass "refs/heads/foo",
> > "refs/tags/foo", etc to the transport code in an attempt to let the
> > other side reduce the size of its advertisement.
> 
> Jonathan, seeing 2b554353 ("fetch: send "refs/tags/" prefix upon CLI
> refspecs", 2018-06-05), I am guessing that you are doing the proto v2
> work inherited from Brandon?

Sorry for the late reply - I had some personal events, but I should be
able to look more at Git stuff from now on.

Well, it's true that I have been fixing some bugs related to protocol
v2.

> Having to undo this is unfortunate, but
> I agree with this patch that we need to do this until ref prefix learns
> to grok wildcards.

It is unfortunate, although as far as I can tell, the performance
improvements from "git fetch" (which I think is the more frequent
command called) remain, so it might not be so bad. I see from your
What's Cooking that these patches are to be merged to master, which I
agree with.


Re: [PATCH v3 2/3] shallow: offer to prune only non-existing entries

2018-10-26 Thread Jonathan Tan
> Thanks for confirming.
> 
> So even better would be to use `is_promisor_object(oid) ||
> has_object_file(oid)`, right?
> 
> This is something that is probably not even needed: as I mentioned, the
> shallow commits are *expected* to be local. It should not ever happen that
> they are fetched.

That would help, but I don't think it would help in the "fast-forward
from A to B where A is B's parent" case I describe in [1].

My suggestion was:

> It sounds safer to me to use the fast approach in this patch when the
> repository is not partial, and stick to the slow approach when it is.

which can be done by replacing "prune_shallow(0, 1)" in patch 3 with
"prune_shallow(0, !repository_format_partial_clone)", possibly with a comment
that the fast method checks object existence for each shallow line directly,
which is undesirable when the repository is a partial clone.
(repository_format_partial_clone is non-NULL with the name of the promisor
remote if the repository is a partial clone, and NULL otherwise).

[1] 
https://public-inbox.org/git/20181025185459.206127-1-jonathanta...@google.com/


Re: [PATCH 10/10] builtin/fetch: check for submodule updates in any ref update

2018-10-26 Thread Jonathan Tan
> Gerrit, the code review tool, has a different workflow than our mailing
> list based approach. Usually users upload changes to a Gerrit server and
> continuous integration and testing happens by bots. Sometimes however a
> user wants to checkout a change locally and look at it locally. For this
> use case, Gerrit offers a command line snippet to copy and paste to your
> terminal, which looks like

As I said in my review of the previous patch, I think this should be
squashed into the previous patch.

Also, remember to add the version when formatting the e-mail ("[PATCH
v? ??/??]").


Re: [PATCH 09/10] fetch: try fetching submodules if needed objects were not fetched

2018-10-26 Thread Jonathan Tan
> But this default fetch is not sufficient, as a newly fetched commit in
> the superproject could point to a commit in the submodule that is not
> in the default refspec. This is common in workflows like Gerrit's.
> When fetching a Gerrit change under review (from refs/changes/??), the
> commits in that change likely point to submodule commits that have not
> been merged to a branch yet.
> 
> Try fetching a submodule by object id if the object id that the
> superproject points to, cannot be found.

I see that these suggestions of mine (from [1]) was implemented, but not
others. If you disagree, that's fine, but I think they should be
discussed.

[1] 
https://public-inbox.org/git/20181018003954.139498-1-jonathanta...@google.com/

> The try does not happen when the "git fetch" done at the
> superproject is not storing the fetched results in remote
> tracking branches (i.e. instead just recording them to
> FETCH_HEAD) in this step. A later patch will fix this.

E.g. here, I said that there was no remote tracking branch involved.

> - if ((recurse_submodules != RECURSE_SUBMODULES_OFF) &&
> - (recurse_submodules != RECURSE_SUBMODULES_ON))
> + if (recurse_submodules != RECURSE_SUBMODULES_OFF)

I think the next patch should be squashed into this patch. Then you can
say that these are now redundant and can be removed.

> @@ -1218,8 +1218,12 @@ struct submodule_parallel_fetch {
>   int result;
>  
>   struct string_list changed_submodule_names;
> + struct get_next_submodule_task **fetch_specific_oids;
> + int fetch_specific_oids_nr, fetch_specific_oids_alloc;
>  };

Add documentation for fetch_specific_oids. Also, it might be better to
call these oid_fetch_tasks and the struct, "struct fetch_task".

Here, struct get_next_submodule_task is used for 2 different things:
 (1) After the first round fetch, fetch_finish() uses it to determine if
 a second round is needed.
 (2) In submodule_parallel_fetch.fetch_specific_oids, to tell the
 parallel runner (through get_next_submodule_task()) to do this
 fetch.

I think that it's better to have 2 different structs for these 2
different uses. (Note that task_cb can be NULL for the second round.
Unless we plan to recheck the OIDs? Currently we recheck them, but we
don't do anything either way.)

> +static const struct submodule *get_default_submodule(const char *path)
> +{
> + struct submodule *ret = NULL;
> + const char *name = default_name_or_path(path);
> +
> + if (!name)
> + return NULL;
> +
> + ret = xmalloc(sizeof(*ret));
> + memset(ret, 0, sizeof(*ret));
> + ret->path = name;
> + ret->name = name;
> +
> + return (const struct submodule *) ret;
> +}

I think that this is best described as the submodule that has no entry
in .gitmodules? Maybe call it "get_non_gitmodules_submodule" and
document it with a similar comment as in get_submodule_repo_for().

> +
> +static struct get_next_submodule_task *get_next_submodule_task_create(
> + struct repository *r, const char *path)
> +{
> + struct get_next_submodule_task *task = xmalloc(sizeof(*task));
> + memset(task, 0, sizeof(*task));
> +
> + task->sub = submodule_from_path(r, _oid, path);
> + if (!task->sub) {
> + task->sub = get_default_submodule(path);
> + task->free_sub = 1;
> + }
> +
> + return task;
> +}

Clearer to me would be to make get_next_submodule_task_create() return
NULL if submodule_from_path() returns NULL.

> + if (spf->fetch_specific_oids_nr) {
> + struct get_next_submodule_task *task = 
> spf->fetch_specific_oids[spf->fetch_specific_oids_nr - 1];

Break lines to 80.

> + argv_array_pushv(>args, spf->args.argv);
> + argv_array_push(>args, "on-demand");

Same comment about "on-demand" as in my previous e-mail.

> + argv_array_push(>args, "--submodule-prefix");
> + argv_array_push(>args, submodule_prefix.buf);
> +
> + /* NEEDSWORK: have get_default_remote from s--h */

Same comment about "s--h" as in my previous e-mail.

> + commits = it->util;
> + oid_array_filter(commits,
> +  commit_exists_in_sub,
> +  task->repo);
> +
> + /* Are there commits that do not exist? */
> + if (commits->nr) {
> + /* We already tried fetching them, do not try again. */
> + if (task->commits)
> + return 0;

Same comment about "task->commits" as in my previous e-mail.

> diff --git a/t/t5526-fetch-submodules.sh b/t/t5526-fetch-submodules.sh
> index 6c2f9b2ba2..5a75b57852 100755

One more thing to test is the case where a submodule doesn't have a
.gitmodules entry.


Re: [PATCH 07/10] submodule: migrate get_next_submodule to use repository structs

2018-10-26 Thread Jonathan Tan
> We used to recurse into submodules, even if they were broken having
> only an objects directory. The child process executed in the submodule
> would fail though if the submodule was broken.
> 
> This patch tightens the check upfront, such that we do not need
> to spawn a child process to find out if the submodule is broken.

Thanks for the clear commit message. Also mention which tests currently
exercise this.

> +static struct repository *get_submodule_repo_for(struct repository *r,
> +  const struct submodule *sub)
> +{
> + struct repository *ret = xmalloc(sizeof(*ret));
> +
> + if (repo_submodule_init(ret, r, sub)) {
> + /*
> +  * No entry in .gitmodules? Technically not a submodule,
> +  * but historically we supported repositories that happen to be
> +  * in-place where a gitlink is. Keep supporting them.
> +  */
> + struct strbuf gitdir = STRBUF_INIT;
> + strbuf_repo_worktree_path(, r, "%s/.git", sub->path);
> + if (repo_init(ret, gitdir.buf, NULL)) {
> + strbuf_release();

You should also free ret here.

> + return NULL;
> + }
> + strbuf_release();
> + }
> +
> + return ret;
> +}

The above is the rest of get_submodule_repo_for(), so that we can see
that gitdir is indeed freed.

> - if (!git_dir)
> - git_dir = submodule_git_dir.buf;
> - if (is_directory(git_dir)) {
> + repo = get_submodule_repo_for(spf->r, submodule);
> + if (repo) {
>   child_process_init(cp);
> - cp->dir = strbuf_detach(_path, NULL);
>   prepare_submodule_repo_env(>env_array);
> + cp->dir = xstrdup(repo->worktree);

Move the cp->dir one line up for a cleaner diff.

[snip]

> + repo_clear(repo);
> + free(repo);
>   ret = 1;
> + } else {
> + /*
> +  * An empty directory is normal,
> +  * the submodule is not initialized
> +  */
> + if (S_ISGITLINK(ce->ce_mode) &&
> + !is_empty_dir(ce->name))
> + die(_("Could not access submodule '%s'"), 
> ce->name);

Previously, a failed fetch would just set spf->result = 1 (in
fetch_finish()), allowing other fetches to still proceed. This sounds
like a better idea to me instead of die-ing outright. (Also remember to
print a warning message - since we no longer spawn a child process to
try a fetch that will fail, we need to print a message ourselves.)


Re: [PATCH 06/10] repository: repo_submodule_init to take a submodule struct

2018-10-26 Thread Jonathan Tan
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 7da8fef31a..ba7634258a 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -418,7 +418,10 @@ static int grep_submodule(struct grep_opt *opt, struct 
> repository *superproject,
> const struct object_id *oid,
> const char *filename, const char *path)
>  {
> - struct repository submodule;
> + struct repository subrepo;
> + const struct submodule *sub = submodule_from_path(superproject,
> +   _oid, path);

[snip]

> - if (repo_submodule_init(, superproject, path)) {
> + if (repo_submodule_init(, superproject, sub)) {

The last argument to repo_submodule_init is now
"submodule_from_path(superproject, _oid, path)" instead of "path",
and looking forward into the patch, we do not need a NULL check because
repo_submodule_init() tolerates NULL in that argument.

Let's see if the rest of the code follows the pattern - a call to
submodule_from_path() with the 3 expected arguments (repo, null OID,
path).

> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index 7f9919a362..4d1649c1b3 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -206,17 +206,19 @@ static void show_files(struct repository *repo, struct 
> dir_struct *dir);
>  static void show_submodule(struct repository *superproject,
>  struct dir_struct *dir, const char *path)
>  {
> - struct repository submodule;
> + struct repository subrepo;
> + const struct submodule *sub = submodule_from_path(superproject,
> +   _oid, path);
>  
> - if (repo_submodule_init(, superproject, path))
> + if (repo_submodule_init(, superproject, sub))

So far so good.

> diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
> index 5f8a804a6e..015aa1471f 100644
> --- a/builtin/submodule--helper.c
> +++ b/builtin/submodule--helper.c
> @@ -2037,7 +2037,7 @@ static int ensure_core_worktree(int argc, const char 
> **argv, const char *prefix)
>   if (!sub)
>   BUG("We could get the submodule handle before?");
>  
> - if (repo_submodule_init(, the_repository, path))
> + if (repo_submodule_init(, the_repository, sub))

The definition of "sub" is not quoted here in this e-mail, but it is
indeed "submodule_from_path(the_repository, _oid, path)".
("the_repository" in the invocation to submodule_from_path() is correct
because the 2nd argument to the invocation of repo_submodule_init() is
"the_repository".)

> -int repo_submodule_init(struct repository *submodule,
> +int repo_submodule_init(struct repository *subrepo,
>   struct repository *superproject,
> - const char *path)
> + const struct submodule *sub)
>  {
> - const struct submodule *sub;
>   struct strbuf gitdir = STRBUF_INIT;
>   struct strbuf worktree = STRBUF_INIT;
>   int ret = 0;
>  
> - sub = submodule_from_path(superproject, _oid, path);

As expected, this line is removed.

>   if (!sub) {
>   ret = -1;
>   goto out;
>   }
>  
> - strbuf_repo_worktree_path(, superproject, "%s/.git", path);
> - strbuf_repo_worktree_path(, superproject, "%s", path);
> + strbuf_repo_worktree_path(, superproject, "%s/.git", sub->path);
> + strbuf_repo_worktree_path(, superproject, "%s", sub->path);

path and sub->path are the same, so this is fine. (This can be seen from
cache_put_path() and cache_lookup_path() in submodule-config.c.)

> - submodule->submodule_prefix = xstrfmt("%s%s/",
> -   superproject->submodule_prefix ?
> -   superproject->submodule_prefix :
> -   "", path);
> + subrepo->submodule_prefix = xstrfmt("%s%s/",
> + superproject->submodule_prefix ?
> + superproject->submodule_prefix :
> + "", sub->path);

Likewise.

> +/*
> + * Initialize the repository 'subrepo' as the submodule given by the
> + * struct submodule 'sub' in parent repository 'superproject'.
> + * Return 0 upon success and a non-zero value upon failure, which may happen
> + * if the submodule is not found, or 'sub' is NULL.
> + */
> +struct submodule;
> +int repo_submodule_init(struct repository *subrepo,
>   struct repository *superproject,
> - const char *path);
> + const struct submodule *sub);

Here is where it says that the last argument can be NULL.

> diff --git a/t/helper/test-submodule-nested-repo-config.c 
> b/t/helper/test-submodule-nested-repo-config.c
> index a31e2a9bea..bc97929bbc 100644
> --- a/t/helper/test-submodule-nested-repo-config.c
> +++ b/t/helper/test-submodule-nested-repo-config.c
> 

Re: [PATCH 05/10] submodule: store OIDs in changed_submodule_names

2018-10-26 Thread Jonathan Tan
> Reviewed-by: Jonathan Tan 

Probably better not to include such lines, especially since the review
by me is not yet complete.

Having said that, patches 1-5 look good to me. Patches 1-3 are identical
to the previous version, which I have already reviewed. In patch 4,
Stefan made the code change I suggested [1].

In this patch, compared to the previous version which I have already
reviewed [2], the code is unchanged aside from some variable renaming. I
suggested a commit title change which Stefan has done.

[1] 
https://public-inbox.org/git/20181017212624.196598-1-jonathanta...@google.com/
[2] 
https://public-inbox.org/git/20181017214534.199890-1-jonathanta...@google.com/


Re: [PATCH v3 2/3] shallow: offer to prune only non-existing entries

2018-10-25 Thread Jonathan Tan
> On Wed, 24 Oct 2018, Johannes Schindelin wrote:
> 
> > On Wed, 24 Oct 2018, Junio C Hamano wrote:
> > 
> > > Jonathan, do you see any issues with the use of lookup_commit() in
> > > this change wrt lazy clone?  I am wondering what happens when the
> > > commit in question is at, an immediate parent of, or an immediate
> > > child of a promisor object.  I _think_ this change won't make it
> > > worse for two features in playing together, but thought that it
> > > would be better to double check.
> > 
> > Good point.
> > 
> > Instinctively, I would say that no promised object can be a shallow
> > commit. The entire idea of a shallow commit is that it *is* present, but
> > none of its parents.

I'm envisioning a client repo with a single branch, cloned both with
--depth=1 and with --filter=, that has just fetched to the same
branch also with --depth=1 resulting in a fast-forward from A to B.

If A is B's parent, then A would be known to be promised. (Incidentally,
the problem is greater in current Git, because for performance reasons,
we do not check promisor status when lazily fetching - so it doesn't
matter here whether an object is known to be promised or not.)

When pruning shallow and checking the existence of A, this would trigger
a fetch for A, which would download all commits and trees reachable from
it.

It sounds safer to me to use the fast approach in this patch when the
repository is not partial, and stick to the slow approach when it is.

> > However, I am curious whether there is a better way to check for the
> > presence of a local commit? Do we have an API function for that, that I
> > missed? (I do not really want to parse the commit, after all, just verify
> > that it is not pruned.)
> 
> Okay, I looked around a bit. It seems that there is an
> `is_promisor_object(oid)` function in `pu` to see whether an object was
> promised. If need be (and I am still not convinced that there is a need),
> then we can always add a call to that function to the condition.

I don't think there is a need for is_promisor_object() either - as
described above, it doesn't completely solve the problem.

> Coming back to my question whether there is a better way to check for the
> presence of a local commit, I figured that I can use `has_object_file()`
> instead of looking up and parsing the commit, as this code does not really
> need to verify that the shallow entry refers to a commit, but only that it
> refers to a local object.

Note that has_object_file() also triggers the lazy fetch if needed, but
I agree that it's better because you don't really need to parse the
commit.

There is the possibility of just checking for loose objects (which does
not trigger any lazy fetches), which works for builtin/prune since it
only prunes loose objects, but doesn't work in the general case, I
guess.


Re: [PATCH 8/9] fetch: retry fetching submodules if needed objects were not fetched

2018-10-23 Thread Jonathan Tan
> > Another thing you need to clarify is what happens if the fetch-by-commit
> > fails. Right now, it seems that it will make the whole thing fail, which
> > might be a surprising change in behavior.
> 
> But a positive surprise, I would assume?

Whether positive or negative, I think that this needs to be mentioned in
the commit message.

As for positive or negative, I tend to agree that it's positive - sure,
some previously successful fetches would now fail, but the results of
those fetches could not be recursively checked out anyway.

> > The test stores the result in a normal branch, not a remote tracking
> > branch. Is storing in a normal branch required?
> 
> In the test we fetch from another repository, such that in the
> repository-under-test this will show up in a remote tracking branch?

If that were true, I would expect that when this line:

> git fetch --recurse-submodules --recurse-submodules-default on-demand origin 
> refs/changes/2:refs/heads/my_branch &&

is replaced by this line:

> git fetch --recurse-submodules --recurse-submodules-default on-demand origin 
> refs/changes/2 &&

then things would still work. The tests pass with the first line (after
I fixed a type mismatch) but not with the second. (Also I don't think a
remote-tracking branch is generated here - the output printed doesn't
indicate so, and refs/changes/2 is not a branch anyway.)

> > Also, do you know why this is required? A naive reading of the patch
> > leads me to believe that this should work even if merely fetching to
> > FETCH_HEAD.
> 
> See the next patch, check_for_new_submodule_commits() is missing
> for FETCH_HEAD.

I see in the next patch that there is an "if" branch in
store_updated_refs() without update_local_ref() in which
"check_for_new_submodule_commits(>old_oid)" needs to be inserted. I
think this is a symptom that maybe check_for_new_submodule_commits()
needs to be extracted from update_local_ref() and put into
store_updated_refs() instead? In update_local_ref(), it is called on
ref->new_oid, which is actually the same as rm->old_oid anyway (there is
an oidcpy earlier).

> > > +static const struct submodule *get_default_submodule(const char *path)
> > > +{
> > > + struct submodule *ret = NULL;
> > > + const char *name = default_name_or_path(path);
> > > +
> > > + if (!name)
> > > + return NULL;
> > > +
> > > + ret = xmalloc(sizeof(*ret));
> > > + memset(ret, 0, sizeof(*ret));
> > > + ret->path = name;
> > > + ret->name = name;
> > > +
> > > + return (const struct submodule *) ret;
> > > +}
> >
> > What is a "default" submodule and why would you need one?
> 
> s/default/artificial/. Such a submodule is a submodule that has no
> config in the .gitmodules file and its name == path.
> We need to keep those around for historic reasons AFAICT, c.f.
> c68f837576 (implement fetching of moved submodules, 2017-10-16)

Ah, OK. I would call it a fake submodule then, and copy over the "No
entry in .gitmodules?" comment.

> > Will task->sub ever be NULL?
> 
> Yes, if we fall back to these "default" submodule and just try if it
> can be handled
> as a submodule, but it cannot be handled as such,
> get_next_submodule_task_create has
> 
> task->sub = submodule_from_path(r, _oid, path);
> if (!task->sub) {
> task->sub = get_default_submodule(path);
> 
> and get_default_submodule can return NULL.

Ah, yes you're right.


Re: [PATCH 7/9] submodule: fetch in submodules git directory instead of in worktree

2018-10-23 Thread Jonathan Tan
> > Why does GIT_DIR need to be set? Is it to avoid subcommands recursively
> > checking the parent directories in case the CWD is a malformed Git
> > repository? If yes, maybe it's worth adding a comment.
> 
> It is copying the structure from prepare_submodule_repo_env,
> specifically 10f5c52656 (submodule: avoid auto-discovery in
> prepare_submodule_repo_env(), 2016-09-01), which sounds
> appealing (and brings real benefits for the working directory),
> but I have not thought about this protection for the git dir.
> 
> Maybe another approach is to not set the cwd for the child process
> and instead point GIT_DIR_ENVIRONMENT only to the right
> directory.
> 
> Then the use of GIT_DIR_ENVIRONMENT is obvious and
> is not just for protection of corner cases.
> 
> However I think this protection is really valuable for the
> .git dir as well as the submodule may be broken and we do not
> want to end up in an infinite loop (as the discovery would find
> the superproject which then tries to recurse, again, into the
> submodule with the broken git dir)
> 
> When adding the comment here, we'd also want to have
> the comment in prepare_submodule_repo_env, which
> could be its own preparation commit.

I agree with the protection. As for the preparation commit, I don't
think it's always the code author's responsibility to tidy up the
surrounding code, but since you're adding an identical comment here,
it's probably worth it to add the comment there too.

> > This is the significant thing that this patch does more - an unskipped
> > submodule is now something that either passes the checks in
> > repo_submodule_init() or the checks in repo_init(), which seems to be
> > stricter than the current check that ".git" points to a directory or is
> > one. This means that we skip certain broken repositories, and this
> > necessitates a change in the test.
> 
> I see. However there is no change in function, the check in repo_init
> (or repo_submodule_init) is less strict than the check in the child process.
> So if there are broken submodule repositories, the difference of this
> patch is the layer at which it is caught, i.e. we would not spawn a child
> that fails, but skip the submodule.
> 
> Thinking of that, maybe we need to announce that in get_next_submodule

The consequence of getting caught changes, though. Currently,
spf->result is set to 1 whenever a child process fails. But in this
patch, some of these repositories would be entirely skipped, meaning
that no child process is run, and spf->result is never modified.

> > I think we should be more particular about what we're allowed to skip -
> > in particular, maybe if we're planning to skip this submodule, its
> > corresponding directory in the worktree (if one exists) needs to be
> > empty.
> 
> If the working tree directory is empty for that submodule, it means
> it is likely not initialized. But why would we use that as a signal to
> skip the submodule?

What I meant was: if empty, skip it completely. Otherwise, do the
repo_submodule_init() and repo_init() thing, and if they both fail, set
spf->result to 1, preserving existing behavior.


[PATCH] fetch-pack: be more precise in parsing v2 response

2018-10-19 Thread Jonathan Tan
Each section in a protocol v2 response is followed by either a DELIM
packet (indicating more sections to follow) or a FLUSH packet
(indicating none to follow). But when parsing the "acknowledgments"
section, do_fetch_pack_v2() is liberal in accepting both, but determines
whether to continue reading or not based solely on the contents of the
"acknowledgments" section, not on whether DELIM or FLUSH was read.

There is no issue with a protocol-compliant server, but can result in
confusing error messages when communicating with a server that
serves unexpected additional sections. Consider a server that sends
"new-section" after "acknowledgments":

 - client writes request
 - client reads the "acknowledgments" section which contains no "ready",
   then DELIM
 - since there was no "ready", client needs to continue negotiation, and
   writes request
 - client reads "new-section", and reports to the end user "expected
   'acknowledgments', received 'new-section'"

For the person debugging the involved Git implementation(s), the error
message is confusing in that "new-section" was not received in response
to the latest request, but to the first one.

One solution is to always continue reading after DELIM, but in this
case, we can do better. We know from the protocol that "ready" means at
least the packfile section is coming (hence, DELIM) and that no "ready"
means that no sections are to follow (hence, FLUSH). So teach
process_acks() to enforce this.

Signed-off-by: Jonathan Tan 
---
 fetch-pack.c   | 12 ++
 t/t5702-protocol-v2.sh | 50 ++
 2 files changed, 62 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index b3ed7121bc..9691046e64 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1248,6 +1248,18 @@ static int process_acks(struct fetch_negotiator 
*negotiator,
reader->status != PACKET_READ_DELIM)
die(_("error processing acks: %d"), reader->status);
 
+   /*
+* If an "acknowledgments" section is sent, a packfile is sent if and
+* only if "ready" was sent in this section. The other sections
+* ("shallow-info" and "wanted-refs") are sent only if a packfile is
+* sent. Therefore, a DELIM is expected if "ready" is sent, and a FLUSH
+* otherwise.
+*/
+   if (received_ready && reader->status != PACKET_READ_DELIM)
+   die(_("expected packfile to be sent after 'ready'"));
+   if (!received_ready && reader->status != PACKET_READ_FLUSH)
+   die(_("expected no other sections to be sent after no 
'ready'"));
+
/* return 0 if no common, 1 if there are common, or 2 if ready */
return received_ready ? 2 : (received_ack ? 1 : 0);
 }
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 8360188c01..51009ca391 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -512,6 +512,56 @@ test_expect_success 'push with http:// and a config of v2 
does not request v2' '
! grep "git< version 2" log
 '
 
+test_expect_success 'when server sends "ready", expect DELIM' '
+   rm -rf "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" http_child &&
+
+   git init "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
+   test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one &&
+
+   git clone "$HTTPD_URL/smart/http_parent" http_child &&
+
+   test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" two &&
+
+   # After "ready" in the acknowledgments section, pretend that a FLUSH
+   # () was sent instead of a DELIM (0001).
+   printf "/ready/,$ s/0001//" \
+   >"$HTTPD_ROOT_PATH/one-time-sed" &&
+
+   test_must_fail git -C http_child -c protocol.version=2 \
+   fetch "$HTTPD_URL/one_time_sed/http_parent" 2> err &&
+   test_i18ngrep "expected packfile to be sent after .ready." err
+'
+
+test_expect_success 'when server does not send "ready", expect FLUSH' '
+   rm -rf "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" http_child log &&
+
+   git init "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
+   test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" one &&
+
+   git clone "$HTTPD_URL/smart/http_parent" http_child &&
+
+   test_commit -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" two &&
+
+   # Create many commits to extend the negotiation phase across multiple
+   # requests, so that the server does not send "ready" in the first
+   #

Re: [PATCH 19/19] submodule: don't add submodule as odb for push

2018-10-19 Thread Jonathan Tan
> In push_submodule(), because we do not actually need access to objects
> in the submodule, do not invoke add_submodule_odb().
> (for_each_remote_ref_submodule() does not require access to those
> objects, and the actual push is done by spawning another process,
> which handles object access by itself.)

The code looks good - my analysis is the same as that in my review of
the previous version [1].

Can you mention, in the commit message, the tests that exercise the
functionality here (and say that they still pass)?

[1] 
https://public-inbox.org/git/20181011230028.200488-1-jonathanta...@google.com/
> 
> Signed-off-by: Stefan Beller 
> ---
>  submodule.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/submodule.c b/submodule.c
> index 7305ae2e10..e623e6bf7f 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1024,9 +1024,6 @@ static int push_submodule(const char *path,
> const struct string_list *push_options,
> int dry_run)
>  {
> - if (add_submodule_odb(path))
> - return 1;
> -
>   if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
>   struct child_process cp = CHILD_PROCESS_INIT;
>   argv_array_push(, "push");
> -- 
> 2.19.0
> 


Re: [PATCH 18/19] submodule: use submodule repos for object lookup

2018-10-19 Thread Jonathan Tan
> This converts the 'show_submodule_header' function to use
> the repository API properly, such that the submodule objects
> are not added to the main object store.

There is also a side effect in that the submodule now needs to pass all
the checks done by repo_init() instead of merely having  the "objects/"
directory. Can you add information about this to the commit message?

Also, which tests exercise this functionality? Mention them in the
commit message.

> +/*
> + * Initialize 'out' based on the provided submodule path.
> + *
> + * Unlike repo_submodule_init, this tolerates submodules not present
> + * in .gitmodules. This function exists only to preserve historical behavior,
> + *
> + * Returns 0 on success, -1 when the submodule is not present.
> + */
> +static int open_submodule(struct repository *out, const char *path)
> +{
> + struct strbuf sb = STRBUF_INIT;
> +
> + if (submodule_to_gitdir(, path) || repo_init(out, sb.buf, NULL)) {
> + strbuf_release();
> + return -1;
> + }
> +
> + out->submodule_prefix = xstrdup(path);
> + out->submodule_prefix = xstrfmt("%s%s/",
> + the_repository->submodule_prefix ?
> + the_repository->submodule_prefix :
> + "", path);

You seem to say here [1] that we don't need submodule_prefix, but you're
instead setting it twice? :-)

[1] 
https://public-inbox.org/git/cagz79kztxonmuyx+ehg0q3ss2m-etkf0yiw3e40u3vct4qm...@mail.gmail.com/

> +/*
> + * Helper function to display the submodule header line prior to the full
> + * summary output.
> + *
> + * If it can locate the submodule git directory it will create a repository
> + * handle for the submodule and lookup both the left and right commits and
> + * put them into the left and right pointers.
>   */
> -static void show_submodule_header(struct diff_options *o, const char *path,
> +static void show_submodule_header(struct diff_options *o,
> + const char *path,
>   struct object_id *one, struct object_id *two,
>   unsigned dirty_submodule,
> + struct repository *sub,
>   struct commit **left, struct commit **right,
>   struct commit_list **merge_bases)
>  {

Location of the submodule git directory is done by the caller of this
function, not by this function. Also this function doesn't create any
repository handles. The documentation probably needs to be updated. Also
mention what happens if sub is NULL.

> @@ -563,16 +596,20 @@ void show_submodule_summary(struct diff_options *o, 
> const char *path,
>   struct rev_info rev;
>   struct commit *left = NULL, *right = NULL;
>   struct commit_list *merge_bases = NULL;
> + struct repository subrepo, *sub = 
> +
> + if (open_submodule(, path) < 0)
> + sub = NULL;

Handling of the subrepo and *sub seems clumsy - it might be better to
just let open_submodule() return a struct repository pointer.

Previous 17 patches look good - most are the same as the previous
version, which I already was happy with, and I see that the patch I
requested in [2] was added.

[2] 
https://public-inbox.org/git/20181011221114.186377-1-jonathanta...@google.com/


Re: [RFC PATCH 0/2] Bring the_repository into cmd_foo

2018-10-18 Thread Jonathan Tan
> > Or instead we could accelerate the long term plan of removing a
> > hard coded the_repository and have each cmd builtin take an additional
> > repository pointer from the init code, such that we'd bring all of Git to
> > work on arbitrary repositories. Then the standard test suite should be
> > okay, as there is no special case for the_repository any more.
> 
> Demo'd in this RFC series for git-merge-base.
> 
> The core idea is found in patch 1,
> and the proof of concept is found in patch 2.

I don't think working around the_repository is sufficient, as there are
other ways to access the same repository state (the_index, directly
accessing files through file I/O). Instead I would prefer a test like in
t/test-repository.c - each patch set would probably only need one test
for the last function converted, since typically the last function uses
every other function converted.

Also, even if we decided that working around the_repository is
sufficient, I don't think this get_the_repository() is a good approach.
If (or when) we decide to convert all builtins to not use
the_repository, we would have to clean up all such calls.

Better would be to pass the_repository from the invoker of the cmd
builtin, and reuse NO_THE_REPOSITORY_COMPATIBILITY_MACROS in the
builtin. (I haven't thought much about how to transition to this, but
one way might be to extend "struct cmd_struct" in git.c to also have a
new-style function pointer, and #define GIT_CMD(c, f, o) {c, NULL, o, f}
or something like that.)

This doesn't directly address the fact that the builtin might call lib
code that indirectly references the_repository, but I think that this
won't be an issue because by the time we're ready to convert builtins to
not use the_repository, most if not all of the lib code would have
NO_THE_REPOSITORY_COMPATIBILITY_MACROS defined anyway.


[PATCH v2 2/3] upload-pack: make want_obj not global

2018-10-18 Thread Jonathan Tan
Because upload_pack_v2() can be invoked multiple times in the same
process, the static variable want_obj may not be empty when it is
invoked. To make further analysis of this situation easier, make the
variable local; analysis will be done in a subsequent patch.

The new local variable in upload_pack_v2() is static to preserve
existing behavior; this is not necessary in upload_pack() because
upload_pack() is only invoked once per process.

Signed-off-by: Jonathan Tan 
---
 upload-pack.c | 116 --
 1 file changed, 66 insertions(+), 50 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index cb2401f90d..451bf47e7f 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -52,7 +52,6 @@ static int no_progress, daemon_mode;
 #define ALLOW_ANY_SHA1 07
 static unsigned int allow_unadvertised_object_request;
 static int shallow_nr;
-static struct object_array want_obj;
 static struct object_array extra_edge_obj;
 static unsigned int timeout;
 static int keepalive = 5;
@@ -98,7 +97,8 @@ static int write_one_shallow(const struct commit_graft 
*graft, void *cb_data)
return 0;
 }
 
-static void create_pack_file(const struct object_array *have_obj)
+static void create_pack_file(const struct object_array *have_obj,
+const struct object_array *want_obj)
 {
struct child_process pack_objects = CHILD_PROCESS_INIT;
char data[8193], progress[128];
@@ -159,9 +159,9 @@ static void create_pack_file(const struct object_array 
*have_obj)
if (shallow_nr)
for_each_commit_graft(write_one_shallow, pipe_fd);
 
-   for (i = 0; i < want_obj.nr; i++)
+   for (i = 0; i < want_obj->nr; i++)
fprintf(pipe_fd, "%s\n",
-   oid_to_hex(_obj.objects[i].item->oid));
+   oid_to_hex(_obj->objects[i].item->oid));
fprintf(pipe_fd, "--not\n");
for (i = 0; i < have_obj->nr; i++)
fprintf(pipe_fd, "%s\n",
@@ -337,19 +337,21 @@ static int got_oid(const char *hex, struct object_id *oid,
return 0;
 }
 
-static int ok_to_give_up(const struct object_array *have_obj)
+static int ok_to_give_up(const struct object_array *have_obj,
+struct object_array *want_obj)
 {
uint32_t min_generation = GENERATION_NUMBER_ZERO;
 
if (!have_obj->nr)
return 0;
 
-   return can_all_from_reach_with_flag(_obj, THEY_HAVE,
+   return can_all_from_reach_with_flag(want_obj, THEY_HAVE,
COMMON_KNOWN, oldest_have,
min_generation);
 }
 
-static int get_common_commits(struct object_array *have_obj)
+static int get_common_commits(struct object_array *have_obj,
+ struct object_array *want_obj)
 {
struct object_id oid;
char last_hex[GIT_MAX_HEXSZ + 1];
@@ -367,7 +369,7 @@ static int get_common_commits(struct object_array *have_obj)
 
if (!line) {
if (multi_ack == 2 && got_common
-   && !got_other && ok_to_give_up(have_obj)) {
+   && !got_other && ok_to_give_up(have_obj, want_obj)) 
{
sent_ready = 1;
packet_write_fmt(1, "ACK %s ready\n", last_hex);
}
@@ -388,7 +390,7 @@ static int get_common_commits(struct object_array *have_obj)
switch (got_oid(arg, , have_obj)) {
case -1: /* they have what we do not */
got_other = 1;
-   if (multi_ack && ok_to_give_up(have_obj)) {
+   if (multi_ack && ok_to_give_up(have_obj, 
want_obj)) {
const char *hex = oid_to_hex();
if (multi_ack == 2) {
sent_ready = 1;
@@ -581,7 +583,7 @@ static int has_unreachable(struct object_array *src)
return 1;
 }
 
-static void check_non_tip(void)
+static void check_non_tip(struct object_array *want_obj)
 {
int i;
 
@@ -592,14 +594,14 @@ static void check_non_tip(void)
 */
if (!stateless_rpc && !(allow_unadvertised_object_request & 
ALLOW_REACHABLE_SHA1))
goto error;
-   if (!has_unreachable(_obj))
+   if (!has_unreachable(want_obj))
/* All the non-tip ones are ancestors of what we advertised */
return;
 
 error:
/* Pick one of them (we know there at least is one) */
-   for (i = 0; i < want_obj.nr; i++) {
-   struct object *o = want_obj.objects[i].item;
+   for (i = 0; i < want_obj->nr; i++) {
+ 

[PATCH v2 0/3] Clear flags before each v2 request

2018-10-18 Thread Jonathan Tan
To explain the differences in this version of the patch set, I'll quote
an email [1] from Junio:

[1] https://public-inbox.org/git/xmqq5zxzvnq1@gitster-ct.c.googlers.com/

> The change to the code itself sort-of makes sense (I say sort-of
> because I didn't carefully look at the callers to see if they mind
> getting all these flags cleared, but the ones that are cleared are
> the ones that are involved mostly in the negotiation and shold be
> OK).

I have included 2 additional patches for these reasons:

 - After reading the section of Junio's email quoted above, I took
   another look at the flags, and found that not only is state stored in
   the flags between invocations of upload_pack_v2(), state is also
   stored in the want_obj and have_obj global variables. The additional
   patches help clean those up.

 - To help reviewers who want to see if the callers mind getting all 8
   flags cleared, I have included a discussion of all 8 flags in the
   commit message of patch 3. The additional patches made the discussion
   easier.

Responses to other points:

> Hmph, what if commit O had a long history behind it?  
> 
> Should fetching of B result in fetching the whole history?

I think so - when we fetch without --depth or any similar arguments, I
think it's reasonable to have all objects referenced by the fetched
tips.

> Would we
> notice that now all of A's parents are available locally and declare
> that the repository is no longer shallow?

We could, but I think this is outside the scope of this patch set.

> Use test_seq instead, or you'll get hit by test-lint?

Thanks for the pointer to test-lint. I've used test_seq, and checked
that test-lint doesn't print any errors.

> Applied on 'master' or 'maint', this new test does not pass even
> with s/seq/test_&/, so there may be something else wrong with it,
> though.

Thanks - there was a copy-and-paste error (should have grepped for
"fetch< version 2", not "git< version 2").

Jonathan Tan (3):
  upload-pack: make have_obj not global
  upload-pack: make want_obj not global
  upload-pack: clear flags before each v2 request

 t/t5702-protocol-v2.sh |  25 +++
 upload-pack.c  | 153 -
 2 files changed, 115 insertions(+), 63 deletions(-)

-- 
2.19.0.271.gfe8321ec05.dirty



[PATCH v2 1/3] upload-pack: make have_obj not global

2018-10-18 Thread Jonathan Tan
Because upload_pack_v2() can be invoked multiple times in the same
process, the static variable have_obj may not be empty when it is
invoked. To make further analysis of this situation easier, make the
variable local; analysis will be done in a subsequent patch.

The new local variable in upload_pack_v2() is static to preserve
existing behavior; this is not necessary in upload_pack() because
upload_pack() is only invoked once per process.

Signed-off-by: Jonathan Tan 
---
 upload-pack.c | 58 ---
 1 file changed, 32 insertions(+), 26 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 62a1000f44..cb2401f90d 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -52,7 +52,6 @@ static int no_progress, daemon_mode;
 #define ALLOW_ANY_SHA1 07
 static unsigned int allow_unadvertised_object_request;
 static int shallow_nr;
-static struct object_array have_obj;
 static struct object_array want_obj;
 static struct object_array extra_edge_obj;
 static unsigned int timeout;
@@ -99,7 +98,7 @@ static int write_one_shallow(const struct commit_graft 
*graft, void *cb_data)
return 0;
 }
 
-static void create_pack_file(void)
+static void create_pack_file(const struct object_array *have_obj)
 {
struct child_process pack_objects = CHILD_PROCESS_INIT;
char data[8193], progress[128];
@@ -164,9 +163,9 @@ static void create_pack_file(void)
fprintf(pipe_fd, "%s\n",
oid_to_hex(_obj.objects[i].item->oid));
fprintf(pipe_fd, "--not\n");
-   for (i = 0; i < have_obj.nr; i++)
+   for (i = 0; i < have_obj->nr; i++)
fprintf(pipe_fd, "%s\n",
-   oid_to_hex(_obj.objects[i].item->oid));
+   oid_to_hex(_obj->objects[i].item->oid));
for (i = 0; i < extra_edge_obj.nr; i++)
fprintf(pipe_fd, "%s\n",
oid_to_hex(_edge_obj.objects[i].item->oid));
@@ -303,7 +302,8 @@ static void create_pack_file(void)
die("git upload-pack: %s", abort_msg);
 }
 
-static int got_oid(const char *hex, struct object_id *oid)
+static int got_oid(const char *hex, struct object_id *oid,
+  struct object_array *have_obj)
 {
struct object *o;
int we_knew_they_have = 0;
@@ -331,17 +331,17 @@ static int got_oid(const char *hex, struct object_id *oid)
parents->item->object.flags |= THEY_HAVE;
}
if (!we_knew_they_have) {
-   add_object_array(o, NULL, _obj);
+   add_object_array(o, NULL, have_obj);
return 1;
}
return 0;
 }
 
-static int ok_to_give_up(void)
+static int ok_to_give_up(const struct object_array *have_obj)
 {
uint32_t min_generation = GENERATION_NUMBER_ZERO;
 
-   if (!have_obj.nr)
+   if (!have_obj->nr)
return 0;
 
return can_all_from_reach_with_flag(_obj, THEY_HAVE,
@@ -349,7 +349,7 @@ static int ok_to_give_up(void)
min_generation);
 }
 
-static int get_common_commits(void)
+static int get_common_commits(struct object_array *have_obj)
 {
struct object_id oid;
char last_hex[GIT_MAX_HEXSZ + 1];
@@ -367,11 +367,11 @@ static int get_common_commits(void)
 
if (!line) {
if (multi_ack == 2 && got_common
-   && !got_other && ok_to_give_up()) {
+   && !got_other && ok_to_give_up(have_obj)) {
sent_ready = 1;
packet_write_fmt(1, "ACK %s ready\n", last_hex);
}
-   if (have_obj.nr == 0 || multi_ack)
+   if (have_obj->nr == 0 || multi_ack)
packet_write_fmt(1, "NAK\n");
 
if (no_done && sent_ready) {
@@ -385,10 +385,10 @@ static int get_common_commits(void)
continue;
}
if (skip_prefix(line, "have ", )) {
-   switch (got_oid(arg, )) {
+   switch (got_oid(arg, , have_obj)) {
case -1: /* they have what we do not */
got_other = 1;
-   if (multi_ack && ok_to_give_up()) {
+   if (multi_ack && ok_to_give_up(have_obj)) {
const char *hex = oid_to_hex();
if (multi_ack == 2) {
sent_ready = 1;
@@ -404,14 +404,14 @@ static int get_common_commits(void)
packet_write_fmt(1, "ACK %s common\n", 

[PATCH v2 3/3] upload-pack: clear flags before each v2 request

2018-10-18 Thread Jonathan Tan
Suppose a server has the following commit graph:

 A   B
  \ /
   O

We create a client by cloning A from the server with depth 1, and add
many commits to it (so that future fetches span multiple requests due to
lengthy negotiation). If it then fetches B using protocol v2, the fetch
spanning multiple requests, the resulting packfile does not contain O
even though the client did report that A is shallow.

This is because upload_pack_v2() can be called multiple times while
processing the same session. During the 2nd and all subsequent
invocations, some object flags remain from the previous invocations. In
particular, CLIENT_SHALLOW remains, preventing process_shallow() from
adding client-reported shallows to the "shallows" array, and hence
pack-objects not knowing about these client-reported shallows.

Therefore, teach upload_pack_v2() to clear object flags at the start of
each invocation. This has some other results:

 - THEY_HAVE gates addition of objects to have_obj in process_haves().
   Previously in upload_pack_v2(), have_obj needed to be static because
   once an object is added to have_obj, it is never readded and thus we
   needed to retain the contents of have_obj between invocations. Now
   that flags are cleared, this is no longer necessary. This patch does
   not change the behavior of ok_to_give_up() (THEY_HAVE is still set on
   each "have") and got_oid() (used only in non-v2)); THEY_HAVE is not
   used in any other function.

 - WANTED gates addition of objects to want_obj in parse_want() and
   parse_want_ref(). It is also used in receive_needs(), but that is
   only used in non-v2. For the same reasons as THEY_HAVE, want_obj no
   longer needs to be static in upload_pack_v2().

 - CLIENT_SHALLOW is changed as discussed above.

Clearing of the other 5 flags does not affect functionality in v2. (Note
that in non-v2, upload_pack() is only called once per process, so each
invocation starts with blank flags anyway.)

 - OUR_REF is only used in non-v2.

 - COMMON_KNOWN is only used as a scratch flag in ok_to_give_up().

 - SHALLOW is passed to invocations in deepen() and
   deepen_by_rev_list(), but upload-pack doesn't use it.

 - NOT_SHALLOW is used by send_shallow() and send_unshallow(), but
   invocations of those functions are always preceded by code that sets
   NOT_SHALLOW on the appropriate objects.

 - HIDDEN_REF is only used in non-v2.

Signed-off-by: Jonathan Tan 
---
 t/t5702-protocol-v2.sh | 25 +
 upload-pack.c  | 13 +
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 88a886975d..4adc4b00e3 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -429,6 +429,31 @@ test_expect_success 'fetch supports include-tag and tag 
following' '
git -C client cat-file -e $(git -C client rev-parse annotated_tag)
 '
 
+test_expect_success 'upload-pack respects client shallows' '
+   rm -rf server client trace &&
+
+   git init server &&
+   test_commit -C server base &&
+   test_commit -C server client_has &&
+
+   git clone --depth=1 "file://$(pwd)/server" client &&
+
+   # Add extra commits to the client so that the whole fetch takes more
+   # than 1 request (due to negotiation)
+   for i in $(test_seq 1 32)
+   do
+   test_commit -C client c$i
+   done &&
+
+   git -C server checkout -b newbranch base &&
+   test_commit -C server client_wants &&
+
+   GIT_TRACE_PACKET="$(pwd)/trace" git -C client -c protocol.version=2 \
+   fetch origin newbranch &&
+   # Ensure that protocol v2 is used
+   grep "fetch< version 2" trace
+'
+
 # Test protocol v2 with 'http://' transport
 #
 . "$TEST_DIRECTORY"/lib-httpd.sh
diff --git a/upload-pack.c b/upload-pack.c
index 451bf47e7f..42c5ec44cf 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -37,6 +37,9 @@
 #define CLIENT_SHALLOW (1u << 18)
 #define HIDDEN_REF (1u << 19)
 
+#define ALL_FLAGS (THEY_HAVE | OUR_REF | WANTED | COMMON_KNOWN | SHALLOW | \
+   NOT_SHALLOW | CLIENT_SHALLOW | HIDDEN_REF)
+
 static timestamp_t oldest_have;
 
 static int deepen_relative;
@@ -1409,10 +1412,10 @@ int upload_pack_v2(struct repository *r, struct 
argv_array *keys,
 {
enum fetch_state state = FETCH_PROCESS_ARGS;
struct upload_pack_data data;
-   /* NEEDSWORK: make this non-static */
-   static struct object_array have_obj;
-   /* NEEDSWORK: make this non-static */
-   static struct object_array want_obj;
+   struct object_array have_obj = OBJECT_ARRAY_INIT;
+   struct object_array want_obj = OBJECT_ARRAY_INIT;
+
+   clear_object_flags(ALL_FLAGS);
 
git_config(upload_pack_config, NULL);
 
@@ -1464,6 +1467,8 @@ int upload_pack_v2(struct repos

Re: [PATCH 9/9] builtin/fetch: check for submodule updates for non branch fetches

2018-10-17 Thread Jonathan Tan
> @@ -887,11 +887,14 @@ static int store_updated_refs(const char *raw_url, 
> const char *remote_name,
>   rc |= update_local_ref(ref, what, rm, ,
>  summary_width);
>   free(ref);
> - } else
> + } else {
> + check_for_new_submodule_commits(>old_oid);

Does this need to be guarded with a recurse_submodules check, just like
in update_local_ref()?

Also, this warrants a comment - this is here because there is some code
later that requires the new submodule commits to be registered, and the
other branch does not require it only because update_local_ref() calls
it.

> @@ -615,7 +615,7 @@ test_expect_success "fetch new commits on-demand when 
> they are not reachable" '
>   git update-ref refs/changes/2 $D &&
>   (
>   cd downstream &&
> - git fetch --recurse-submodules --recurse-submodules-default 
> on-demand origin refs/changes/2:refs/heads/my_branch &&
> + git fetch --recurse-submodules origin refs/changes/2 &&
>   git -C submodule cat-file -t $C &&
>   git checkout --recurse-submodules FETCH_HEAD
>   )

I think there should be a new test - we can tell from the code that just
because fetching to FETCH_HEAD works doesn't mean that fetching to a ref
works, and vice versa.

Also, can you make the test fetch 2 refs? So that we know that it works
with more than one.


Re: [PATCH 8/9] fetch: retry fetching submodules if needed objects were not fetched

2018-10-17 Thread Jonathan Tan
> Currently when git-fetch is asked to recurse into submodules, it dispatches
> a plain "git-fetch -C " (with some submodule related options
> such as prefix and recusing strategy, but) without any information of the
> remote or the tip that should be fetched.
> 
> This works surprisingly well in some workflows (such as using submodules
> as a third party library), while not so well in other scenarios, such
> as in a Gerrit topic-based workflow, that can tie together changes
> (potentially across repositories) on the server side. One of the parts
> of such a Gerrit workflow is to download a change when wanting to examine
> it, and you'd want to have its submodule changes that are in the same
> topic downloaded as well. However these submodule changes reside in their
> own repository in their own ref (refs/changes/).

It seems that the pertinent situation is when, in the superproject, you
fetch a commit (which may be the target of a ref, or an ancestor of the
target of a ref) that points to a submodule commit that was not fetched
by the default refspec-less fetch that you describe in the first
paragraph. I would just describe it as follows:

  But this default fetch is not sufficient, as a newly fetched commit in
  the superproject could point to a commit in the submodule that is not
  in the default refspec. This is common in workflows like Gerrit's.
  When fetching a Gerrit change under review (from refs/changes/??), the
  commits in that change likely point to submodule commits that have not
  been merged to a branch yet.

Another thing you need to clarify is what happens if the fetch-by-commit
fails. Right now, it seems that it will make the whole thing fail, which
might be a surprising change in behavior.

> Retry fetching a submodule by object name if the object id that the
> superproject points to, cannot be found.

I don't really think of this as a retry - the first time, you're
fetching the default refspec, and the second time, with a specific list
of object IDs. Also, be consistent in the usage of "object name" and
"object id" - as far as I know, they are the same thing.

> This retrying does not happen when the "git fetch" done at the
> superproject is not storing the fetched results in remote
> tracking branches (i.e. instead just recording them to
> FETCH_HEAD) in this step. A later patch will fix this.

The test stores the result in a normal branch, not a remote tracking
branch. Is storing in a normal branch required?

Also, do you know why this is required? A naive reading of the patch
leads me to believe that this should work even if merely fetching to
FETCH_HEAD.

> +struct get_next_submodule_task {
> + struct repository *repo;
> + const struct submodule *sub;
> + unsigned free_sub : 1; /* Do we need to free the submodule? */
> + struct oid_array *commits;
> +};

Firstly, I don't think "commits" needs to be a pointer.

Document at least "commits". As far as I can tell, if NULL (or empty if
you take my suggestion), this means that this task is the first pass for
this particular submodule and the fetch needs to be done using the
default refspec. Otherwise, this task is the second pass for this
particular submodule and the fetch needs to be done passing the
contained OIDs.

> +static const struct submodule *get_default_submodule(const char *path)
> +{
> + struct submodule *ret = NULL;
> + const char *name = default_name_or_path(path);
> +
> + if (!name)
> + return NULL;
> +
> + ret = xmalloc(sizeof(*ret));
> + memset(ret, 0, sizeof(*ret));
> + ret->path = name;
> + ret->name = name;
> +
> + return (const struct submodule *) ret;
> +}

What is a "default" submodule and why would you need one?

> + task = get_next_submodule_task_create(spf->r, ce->name);
> +
> + if (!task->sub) {
> + free(task);
> + continue;
>   }

Will task->sub ever be NULL?

> + if (spf->retry_nr) {
> + struct get_next_submodule_task *task = spf->retry[spf->retry_nr 
> - 1];
> + struct strbuf submodule_prefix = STRBUF_INIT;
> + spf->retry_nr--;
> +
> + strbuf_addf(_prefix, "%s%s/", spf->prefix, 
> task->sub->path);
> +
> + child_process_init(cp);
> + prepare_submodule_repo_env_in_gitdir(>env_array);
> + cp->git_cmd = 1;
> + cp->dir = task->repo->gitdir;
> +
> + argv_array_init(>args);
> + argv_array_pushv(>args, spf->args.argv);
> + argv_array_push(>args, "on-demand");

This means that we need to trust that the last entry in spf->args can
take an "on-demand" parameter. Could we supply that argument here
explicitly instead?

> + argv_array_push(>args, "--submodule-prefix");
> + argv_array_push(>args, submodule_prefix.buf);
> +
> + /* NEEDSWORK: have get_default_remote from s--h */

What is s--h?

> +static int 

Re: [PATCH 7/9] submodule: fetch in submodules git directory instead of in worktree

2018-10-17 Thread Jonathan Tan
> This patch started as a refactoring to make 'get_next_submodule' more
> readable, but upon doing so, I realized that "git fetch" of the submodule
> actually doesn't need to be run in the submodules worktree. So let's run
> it in its git dir instead.

The commit message needs to be updated, I think - this patch does
significantly more than fetching in the gitdir.

> This patch leaks the cp->dir in get_next_submodule, as any further
> callback in run_processes_parallel doesn't have access to the child
> process any more.

The cp->dir is already leaked - probably better to write "cp->dir in
get_next_submodule() is still leaked, but this will be fixed in a
subsequent patch".

> +static void prepare_submodule_repo_env_in_gitdir(struct argv_array *out)
> +{
> + prepare_submodule_repo_env_no_git_dir(out);
> + argv_array_pushf(out, "%s=.", GIT_DIR_ENVIRONMENT);

Why does GIT_DIR need to be set? Is it to avoid subcommands recursively
checking the parent directories in case the CWD is a malformed Git
repository? If yes, maybe it's worth adding a comment.

> +static struct repository *get_submodule_repo_for(struct repository *r,
> +  const struct submodule *sub)
> +{
> + struct repository *ret = xmalloc(sizeof(*ret));
> +
> + if (repo_submodule_init(ret, r, sub)) {
> + /*
> +  * No entry in .gitmodules? Technically not a submodule,
> +  * but historically we supported repositories that happen to be
> +  * in-place where a gitlink is. Keep supporting them.
> +  */
> + struct strbuf gitdir = STRBUF_INIT;
> + strbuf_repo_worktree_path(, r, "%s/.git", sub->path);
> + if (repo_init(ret, gitdir.buf, NULL)) {
> + strbuf_release();
> + return NULL;
> + }
> + strbuf_release();
> + }
> +
> + return ret;
> +}

This is the significant thing that this patch does more - an unskipped
submodule is now something that either passes the checks in
repo_submodule_init() or the checks in repo_init(), which seems to be
stricter than the current check that ".git" points to a directory or is
one. This means that we skip certain broken repositories, and this
necessitates a change in the test.

I think we should be more particular about what we're allowed to skip -
in particular, maybe if we're planning to skip this submodule, its
corresponding directory in the worktree (if one exists) needs to be
empty.

> - cp->dir = strbuf_detach(_path, NULL);
> - prepare_submodule_repo_env(>env_array);
> + prepare_submodule_repo_env_in_gitdir(>env_array);
> + cp->dir = xstrdup(repo->gitdir);

Here is where the functionality change (fetch in ".git") described in
the commit message occurs.


Re: [PATCH 6/9] repository: repo_submodule_init to take a submodule struct

2018-10-17 Thread Jonathan Tan
> When constructing a struct repository for a submodule for some revision
> of the superproject where the submodule is not contained in the index,
> it may not be present in the working tree currently either. In that
> siutation giving a 'path' argument is not useful. Upgrade the
> repo_submodule_init function to take a struct submodule instead.

Are there ways for other code to create a struct submodule without using
submodule_from_path()? If yes, maybe outline them here and say that this
makes repo_submodule_init() more useful, since it can now be used with
those methods.

> While we are at it, overhaul the repo_submodule_init function by renaming
> the submodule repository struct, which is to be initialized, to a name
> that is not confused with the struct submodule as easily.

"Overhaul" is probably not the right word for just a rename of a local
variable. Also mention the other functions in which you did this rename
(or just say "repo_submodule_init() and other functions").

> +/*
> + * Initialize the repository 'subrepo' as the submodule given by the
> + * struct submodule 'sub' in parent repository 'superproject'.
> + * Return 0 upon success and a non-zero value upon failure.
> + */
> +struct submodule;
> +int repo_submodule_init(struct repository *subrepo,
>   struct repository *superproject,
> - const char *path);
> + const struct submodule *sub);

>From this description, I would expect "sub" to not be allowed to be
NULL, but from the code I don't think that's the case. Should we
prohibit NULL (and add checks to all repo_submodule_init()'s callers) or
document that a NULL sub is allowed (and what happens in that case)?


Re: [PATCH 5/9] submodule.c: do not copy around submodule list

2018-10-17 Thread Jonathan Tan
> By doing so we'll have access to the util pointer for longer that
> contains the commits that we need to fetch, which will be
> useful in a later patch.

It seems that the main point of this patch is so that the OIDs be stored
in changed_submodule_names, because you need them in a later patch. If
so, I would have expected a commit title like "submodule: store OIDs in
changed_submodule_names".

> @@ -1142,8 +1142,7 @@ static void calculate_changed_submodule_paths(
>   struct submodule_parallel_fetch *spf)
>  {
>   struct argv_array argv = ARGV_ARRAY_INIT;
> - struct string_list changed_submodules = STRING_LIST_INIT_DUP;
> - const struct string_list_item *name;
> + struct string_list_item *name;

Prior to this patch, changed_submodules is here just so that we know
what to add to changed_submodule_names (it will be cleared at the end of
the function). So removing it is fine.

> - collect_changed_submodules(_submodules, );
> + collect_changed_submodules(>changed_submodule_names, );
>  
> - for_each_string_list_item(name, _submodules) {
> + for_each_string_list_item(name, >changed_submodule_names) {

We add all the changed submodules directly to changed_submodule_names,
and iterate over it. So we use changed_submodule_names as a
scratchpad...

> - if (!submodule_has_commits(path, commits))
> - string_list_append(>changed_submodule_names,
> -name->string);
> + if (submodule_has_commits(path, commits)) {
> + oid_array_clear(commits);
> + *name->string = '\0';
> + }

...but this is fine because previously, we appended the name->string to
changed_submodule_names (with no util) whereas now, we make the
name->string empty in the opposite condition.

Before this patch, at the end of the loop, we had all the wanted
submodule names with no util in changed_submodule_names. With this
patch, at the end of the loop, we have all the wanted submodule names
with util pointing to an OID array, and also some blanks with util
pointing to a cleared OID array.

> - free_submodules_oids(_submodules);
> + string_list_remove_empty_items(>changed_submodule_names, 1);

The local variable changed_submodules no longer exists, so removing that
line is fine.

And we remove all the blanks from changed_submodule_names.

> @@ -1377,7 +1378,7 @@ int fetch_populated_submodules(struct repository *r,
>  
>   argv_array_clear();
>  out:
> - string_list_clear(_submodule_names, 1);
> + free_submodules_oids(_submodule_names);

And because changed_submodule_names now has a non-trivial util pointer,
we need to free it properly.

This patch looks good, except that the commit title and message could
perhaps be clearer.


Re: [PATCH 4/9] submodule.c: move global changed_submodule_names into fetch submodule struct

2018-10-17 Thread Jonathan Tan
> The `changed_submodule_names` are only used for fetching, so let's make it
> part of the struct that is passed around for fetching submodules.

Keep the titles of commit messages to 50 characters or under.

> +static void calculate_changed_submodule_paths(
> + struct submodule_parallel_fetch *spf)

Instead of taking the entire struct, could this just take the list of
changed submodule names instead?


Re: [PATCH 3/9] submodule.c: sort changed_submodule_names before searching it

2018-10-17 Thread Jonathan Tan
> We can string_list_insert() to maintain sorted-ness of the
> list as we find new items, or we can string_list_append() to
> build an unsorted list and sort it at the end just once.

This confused me at first, because neither function is mentioned in the
patch.

> As we do not rely on the sortedness while building the
> list, we pick the "append and sort at the end" as it
> has better worst case execution times.

It took me some time to see that you were rejecting the two solutions
you listed in the first paragraph, and are instead using a third (that
you describe in this paragraph).

The code itself looks fine.

In the future, I think that it's better if this type of patch went into
its own patch set - this seems independent of the concerns of this patch
set, so splitting up keeps patch sets small.


Re: Git wire protocol v2 fails fetching shallow changes with `pack has XXX unresolved deltas` on large repos with lots of tags Inbox x

2018-10-17 Thread Jonathan Tan
> No changes are needed in mirrored repository. Crash happens both with
> 2.18.0 and 2.19.1 git versions. Having repository locally is not
> required but reduces test runtime, you can quite reliably reproduce
> issue when fetching over net directly from chromium.orgbypassing
> mirroring step.

Do you have the reproduction steps for this? If I run

  git -c protocol.version=2 fetch --tags \
https://chromium.googlesource.com/chromium/src \
+refs/heads/*:refs/remotes/origin/* --depth=1

repeatedly in the same repository, it succeeds. (And I've checked with
GIT_TRACE_PACKET that it uses protocol v2.)


Re: [PATCH 0/4] Bloom filter experiment

2018-10-16 Thread Jonathan Tan
> | Implementation | Queries | Maybe | FP # | FP %  |
> ||-|---|--|---|
> | Szeder | 66095   | 1142  | 256  | 0.38% |
> | Jonathan   | 66459   | 107   | 89   | 0.16% |
> | Stolee | 53025   | 492   | 479  | 0.90% |
> 
> (Note that we must have used different starting points, which is why my 
> "Queries" is so much smaller.)

I suspect it's because your bloom filter implementation covers only the
first parent (if I'm understanding get_bloom_filter() correctly). When I
only covered the first parent in my initial test (see patch 2 of [1]), I
got (following the columns in the table above):

  53096 107 89 0.001676

Also, I think that the rejecting power (Queries - Maybe)/(Total tree
comparisons if no bloom filters were used) needs to be in the evaluation
criteria somewhere, as that indicates how many tree comparisons we
managed to avoid.

Also, we probably should also test on a file that changes more
frequently :-)

[1] https://public-inbox.org/git/cover.1539219248.git.jonathanta...@google.com/

> The increase in false-positive percentage is expected in my 
> implementation. I'm using the correct filter sizes to hit a <1% FP 
> ratio. This could be lowered by changing the settings, and the size 
> would dynamically grow. For my Git repo (which contains 
> git-for-windows/git and microsoft/git) this implementation grows the 
> commmit-graph file from 5.8 MB to 7.3 MB (1.5 MB total, compared to 
> Szeder's 8MB filter). For 105,260 commits, that rounds out to less than 
> 20 bytes per commit (compared to Jonathan's 256 bytes per commit).

Mine has 256 bits per commit, which is 32 bytes per commit (still more
than yours).

Having said all that, thanks for writing up your version - in
particular, variable sized filters (like in yours) seem to be the way to
go.

> We'll see how much time I have to do this polish, but I think the 
> benefit is proven.

Agreed.


Re: [PATCH 17/19] submodule: use submodule repos for object lookup

2018-10-16 Thread Jonathan Tan
> Thanks for the review of the whole series!
> 
> I have redone this series, addressing all your comments. I addressed
> this comment differently than suggested, and put the submodule
> repository argument at the end of the parameter list, such that it
> goes with all the other arguments to be filled in.

Sounds good.

> I was about to resend the series, but test-merged with the other
> submodule series in flight (origin/sb/submodule-recursive-fetch-gets-the-tip)
> which had some conflicts that I can easily resolve by rebasing on top.

I presume you are talking about [1]? Maybe consider rebasing that one on
top of this instead, since this is just a refactoring whereas
submodule-recursive-fetch-gets-the-tip changes functionality, from what
I understand of patches 8 and 9.

[1] https://public-inbox.org/git/20181016181327.107186-1-sbel...@google.com/

> It conflicts a lot when merging to next, due to the latest patch
> ("Apply semantic patches from previous patches"), so I am not sure
> how to proceed properly. Maybe we'd omit that patch and
> run 'make coccicheck' on next to apply the semantic patches
> there instead.

Omitting the patch sounds good to me. For me, just stating that you have
excluded any coccinelle-generated patches in order to ease merging into
the various branches is sufficient, and people can test the coccinelle
patches included by running "make coccicheck" then "patch -p1


Re: [PATCH v4 9/9] Documentation/config: add odb..promisorRemote

2018-10-16 Thread Jonathan Tan
>  1. Teaching partial clone to attempt to fetch missing objects from
> multiple remotes instead of only one.  This is useful because you
> can have a server that is nearby and cheaper to serve from (some
> kind of local cache server) that you make requests to first before
> falling back to the canonical source of objects.

Quoting the above definition of (1) for reference. I think Jonathan
Nieder has covered the relevant points well - I'll just expand on (1).

> So much for the current setup.  For (1), I believe you are proposing to
> still have only one effective , so it doesn't necessarily
> require modifying the extensions.* configuration.  Instead, the idea is
> that when trying to access an object, we would follow one of a list of
> steps:
> 
>  1. First, check the local object store. If it's there, we're done.
>  2. Second, try alternates --- maybe the object is in one of those!
>  3. Now, try promisor remotes, one at a time, in user-configured order.
> 
> In other words, I think that for (1) all we would need is a new
> configuration
> 
>   [object]
>   missingObjectRemote = local-cache-remote
>   missingObjectRemote = origin
> 
> The semantics would be that when trying to access a promised object,
> we attempt to fetch from these remotes one at a time, in the order
> specified.  We could require that the remote named in
> extensions.partialClone be one of the listed remotes, without having
> to care where it shows up in the list.

Or allow extensions.partialClone= wherein  is not in the
missingObjectRemote, in which case  is tried first, so that we don't
have to reject some configurations.

> That way, we get the benefit (1) without having to change the
> semantics of extensions.partialClone and without having to care about
> the order of sections in the config.  What do you think?

Let's define the promisor remotes of a repository as those in
missingObjectRemote or extensions.partialClone (currently, we talk about
"the promisor remote" (singular), defined in extensions.partialClone).

Overall, this seems like a reasonable idea to me, if we keep the
restriction that we can only fetch with filter from a promisor remote.
This allows us to extend the definition of a promisor object in a
manner consistent to the current definition - to say "a promisor object
is one promised by at least one promisor remote" (currently, "a promisor
object is promised by the promisor remote").

In the presence of missingObjectRemote, old versions of Git, when lazily
fetching, would only know to try the extensions.partialClone remote. But
this is safe because existing data wouldn't be clobbered (since we're
not using ideas like adding meaning to the contents of the .promisor
file). Also, other things like fsck and gc still work.


[PATCH] upload-pack: clear flags before each v2 request

2018-10-16 Thread Jonathan Tan
Suppose a server has the following commit graph:

 A   B
  \ /
   O

We create a client by cloning A from the server with depth 1, and add
many commits to it (so that future fetches span multiple requests due to
lengthy negotiation). If it then fetches B using protocol v2, the fetch
spanning multiple requests, the resulting packfile does not contain O
even though the client did report that A is shallow.

This is because upload_pack_v2() can be called multiple times while
processing the same session. During the 2nd and all subsequent
invocations, some object flags remain from the previous invocations. In
particular, CLIENT_SHALLOW remains, preventing process_shallow() from
adding client-reported shallows to the "shallows" array, and hence
pack-objects not knowing about these client-reported shallows.

Therefore, teach upload_pack_v2() to clear object flags at the start of
each invocation.

(One alternative is to reduce or eliminate usage of object flags in
protocol v2, but that doesn't seem feasible because almost all 8 flags
are used pervasively in v2 code.)

Signed-off-by: Jonathan Tan 
---
This was noticed by Arturas Moskvinas  in [1]. The
reproduction steps given were to repeat a shallow fetch twice in
succession, but I found it easier to write a more understandable test if
I made the 2nd fetch an ordinary fetch. In any case, I also reran the
original reproduction steps, and the fetch completes without error.

This patch doesn't cover the negotiation issue that I mentioned in my
previous reply [2].

[1] 
https://public-inbox.org/git/CAGY-PBgsG-T3JY=awszwgmpfx+jdx-a1fcv0s6vr067bsqg...@mail.gmail.com/
[2] 
https://public-inbox.org/git/20181013004356.257709-1-jonathanta...@google.com/
---
 t/t5702-protocol-v2.sh | 25 +
 upload-pack.c  |  5 +
 2 files changed, 30 insertions(+)

diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 88a886975d..70b88385ba 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -429,6 +429,31 @@ test_expect_success 'fetch supports include-tag and tag 
following' '
git -C client cat-file -e $(git -C client rev-parse annotated_tag)
 '
 
+test_expect_success 'upload-pack respects client shallows' '
+   rm -rf server client trace &&
+
+   git init server &&
+   test_commit -C server base &&
+   test_commit -C server client_has &&
+
+   git clone --depth=1 "file://$(pwd)/server" client &&
+
+   # Add extra commits to the client so that the whole fetch takes more
+   # than 1 request (due to negotiation)
+   for i in $(seq 1 32)
+   do
+   test_commit -C client c$i
+   done &&
+
+   git -C server checkout -b newbranch base &&
+   test_commit -C server client_wants &&
+
+   GIT_TRACE_PACKET="$(pwd)/trace" git -C client -c protocol.version=2 \
+   fetch origin newbranch &&
+   # Ensure that protocol v2 is used
+   grep "git< version 2" trace
+'
+
 # Test protocol v2 with 'http://' transport
 #
 . "$TEST_DIRECTORY"/lib-httpd.sh
diff --git a/upload-pack.c b/upload-pack.c
index 62a1000f44..de7de1de38 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -37,6 +37,9 @@
 #define CLIENT_SHALLOW (1u << 18)
 #define HIDDEN_REF (1u << 19)
 
+#define ALL_FLAGS (THEY_HAVE | OUR_REF | WANTED | COMMON_KNOWN | SHALLOW | \
+   NOT_SHALLOW | CLIENT_SHALLOW | HIDDEN_REF)
+
 static timestamp_t oldest_have;
 
 static int deepen_relative;
@@ -1393,6 +1396,8 @@ int upload_pack_v2(struct repository *r, struct 
argv_array *keys,
enum fetch_state state = FETCH_PROCESS_ARGS;
struct upload_pack_data data;
 
+   clear_object_flags(ALL_FLAGS);
+
git_config(upload_pack_config, NULL);
 
upload_pack_data_init();
-- 
2.19.0.271.gfe8321ec05.dirty



Re: Git wire protocol v2 fails fetching shallow changes with `pack has XXX unresolved deltas` on large repos with lots of tags Inbox x

2018-10-12 Thread Jonathan Tan
> On large repositories with lots of tags - git wire protocol v2 fails
> to fetch shallow changes, it fails with error `pack has XXX unresolved
> deltas`. Unfortunately I didn't find easy way to reproduce it except
> cloning+fetching chromium repository, the way jenkins does.
> Reproduction steps:

[snip]

Thanks for your bug report and reproduction steps. I managed to
reproduce your issue and took a look.

The main issue seems to be that with v2, upload-pack doesn't pass
"--shallow X" to pack-objects (the write_one_shallow() callback is never
called, even if I change the "if (shallow_nr)" to "if (1)"), so
pack-objects probably doesn't know that some objects cannot be used as
delta bases. (With v0, write_one_shallow() is indeed called.) The issue
probably lies in how v0 and v2 handle client-provided shallows
differently.

There also seems to be another issue in that negotiation occurs
differently in these 2 protocols - I see the full list of "have" lines
being sent in the final request to the server in v0, but a very limited
list in v2. This might be because of the ref prefix limiting in v2, but
I haven't fully investigated it.

I'll look some more into this.


Re: [RFC PATCH 00/19] Bring more repository handles into our code base

2018-10-11 Thread Jonathan Tan
> This series takes another approach as it doesn't change the signature of
> functions, but introduces new functions that can deal with arbitrary 
> repositories, keeping the old function signature around using a shallow 
> wrapper.
> 
> Additionally each patch adds a semantic patch, that would port from the old to
> the new function. These semantic patches are all applied in the very last 
> patch,
> but we could omit applying the last patch if it causes too many merge 
> conflicts
> and trickl in the semantic patches over time when there are no merge 
> conflicts.

Thanks, this looks like a good plan.

One concern is that if we leave 2 versions of functions around, it will
be difficult to look at a function and see if it's truly
multi-repository-compatible (or making a call to a function that
internally uses the_repository, and is thus wrong). But with the plan
Stefan quoted [1], mentioned in commit e675765235 ("diff.c: remove
implicit dependency on the_index", 2018-09-21):

  The plan is these macros will always be defined for all library files
  and the macros are only accessible in builtin/

(The macros include NO_THE_REPOSITORY_COMPATIBILITY_MACROS, which
disables the single-repository function-like macros.) This mitigates the
concern somewhat.

[1] https://public-inbox.org/git/20181011211754.31369-1-sbel...@google.com/


Re: [PATCH 18/19] submodule: don't add submodule as odb for push

2018-10-11 Thread Jonathan Tan
> The submodule was added as an alternative in eb21c732d6 (push: teach
> --recurse-submodules the on-demand option, 2012-03-29), but was
> not explained, why.
> 
> In similar code, submodule_has_commits, the submodule is added as an
> alternative to perform a quick check if we need to dive into the submodule.
> 
> However in push_submodule
> (a) for_each_remote_ref_submodule will also provide the quick check and
> (b) after that we don't need to have submodule objects around, as all
> further code is to spawn a separate process.

After some investigation, I think I understand. I would explain it this
way:

  In push_submodule(), because we do not actually need access to objects
  in the submodule, do not invoke add_submodule_odb().
  (for_each_remote_ref_submodule() does not require access to those
  objects, and the actual push is done by spawning another process,
  which handles object access by itself.)

I'm not sure if it's worth mentioning the commit in which the call was
introduced, since nothing seems to have changed between then and now
(the same bug is present when it was introduced, and now).

I also checked the users of push_submodule() (transport_push()) and
indeed it doesn't seem to make use of the additional objects brought in
by add_submodule_odb().

Do you know if pushing of submodules is exercised by any test?

Other than that, the code itself looks good.


Re: [PATCH 17/19] submodule: use submodule repos for object lookup

2018-10-11 Thread Jonathan Tan
> +/*
> + * Initialize 'out' based on the provided submodule path.
> + *
> + * Unlike repo_submodule_init, this tolerates submodules not present
> + * in .gitmodules. NEEDSWORK: The repo_submodule_init behavior is
> + * preferrable. This function exists only to preserve historical behavior.

What do you mean by "The repo_submodule_init behavior is preferable"? If
we need to preserve historical behavior, then it seems that the most
preferable one is something that meets our needs (like open_submodule()
in this patch).

> +static int open_submodule(struct repository *out, const char *path)
> +{
> + struct strbuf sb = STRBUF_INIT;
> +
> + if (submodule_to_gitdir(, path) || repo_init(out, sb.buf, NULL)) {
> + strbuf_release();
> + return -1;
> + }
> +
> + out->submodule_prefix = xstrdup(path);

Do we need to set submodule_prefix?

> @@ -507,7 +533,7 @@ static void show_submodule_header(struct diff_options *o, 
> const char *path,
>   else if (is_null_oid(two))
>   message = "(submodule deleted)";
>  
> - if (add_submodule_odb(path)) {
> + if (open_submodule(sub, path) < 0) {

This function, as a side effect, writes the open repository to "sub" for
its caller to use. I think it's better if its callers open "sub" and
then pass it to show_submodule_header() if successful. If opening the
submodule is expected to fail sometimes (like it seems here), then we
can allow callers to also pass NULL, and document what happens when NULL
is passed.

Also, repositories open in this way should also be freed.


Re: [PATCH 16/19] pretty: prepare format_commit_message to handle arbitrary repositories

2018-10-11 Thread Jonathan Tan
Patches 6-16 are all quite straightforward, and are reviewed-by: me.


Re: [PATCH 05/19] object: parse_object to honor its repository argument

2018-10-11 Thread Jonathan Tan
> In 8e4b0b6047 (object.c: allow parse_object to handle
> arbitrary repositories, 2018-06-28), we forgot to pass the
> repository down to the read_object_file.

[snip]

> @@ -270,7 +270,7 @@ struct object *parse_object(struct repository *r, const 
> struct object_id *oid)
>   return lookup_object(r, oid->hash);
>   }
>  
> - buffer = read_object_file(oid, , );
> + buffer = repo_read_object_file(r, oid, , );

There is still the matter of the 2 invocations of has_object_file()
earlier in this function. The first one probably can be replaced with
oid_object_info_extended() (see the definition of
has_sha1_file_with_flags() to see how to do it), and the second one
looks redundant to me and can be removed. Apart from that, I don't see
any other invocations that need to be converted, so parse_object() will
indeed fully honor its repository argument.


Re: [PATCH 04/19] object-store: prepare read_object_file to deal with arbitrary repositories

2018-10-11 Thread Jonathan Tan
> Introduce repo_read_object_file which takes the repository argument, and
> hide the original read_object_file as a macro behind
> NO_THE_REPOSITORY_COMPATIBILITY_MACROS, which we planned for in
> e675765235 (diff.c: remove implicit dependency on the_index, 2018-09-21)

That commit didn't seem to plan for anything - it just seems to add a
new function with the name "repo_" preprended and define a macro if
NO_THE_REPOSITORY_COMPATIBILITY_MACROS is not set, just like this patch.
Maybe s/which we planned for in/just like in/.

The patch itself looks good.


Re: [PATCH 03/19] object-store: allow read_object_file_extended to read from arbitrary repositories

2018-10-11 Thread Jonathan Tan
> @@ -1413,10 +1414,10 @@ void *read_object_file_extended(const struct 
> object_id *oid,
>   const char *path;
>   struct stat st;
>   const struct object_id *repl = lookup_replace ?
> - lookup_replace_object(the_repository, oid) : oid;
> + lookup_replace_object(r, oid) : oid;

Firstly, patches 1 and 2 are obviously correct.

This lookup_replace_object() is a bit tricky in that at 'master',
register_replace_ref() (indirectly called by lookup_replace_object())
did not handle arbitrary repositories correctly, but 'next' has a patch
that solves this, so this shouldn't be an issue. The other function
changes look fine. So this patch looks correct too.


[PATCH] Per-commit and per-parent filters for 2 parents

2018-10-11 Thread Jonathan Tan
Signed-off-by: Jonathan Tan 
---
> One main benefit of storing on Bloom filter per commit is to avoid
> recomputing hashes at every commit. Currently, this patch only improves
> locality when checking membership at the cost of taking up more space.
> Drop the dependence on the parent oid and then we can save the time
> spent hashing during history queries.

I've removed the hashing of the parent OID here and tried having
per-parent and per-commit hashes for the first 2 parents of each commit
instead of only 1, thus doubling the filter size. The results are not
much of an improvement though:

bloom filter total queries: 66409 definitely not: 56424 maybe: 9985 false 
positives: 9099 fp ratio: 0.137015
0:01.17
---
 bloom-filter.c | 25 -
 bloom-filter.h |  4 ++--
 commit-graph.c | 13 -
 revision.c | 11 +--
 4 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/bloom-filter.c b/bloom-filter.c
index 39b453908f..10c73c45ae 100644
--- a/bloom-filter.c
+++ b/bloom-filter.c
@@ -11,7 +11,7 @@ void bloom_filter_init(struct bloom_filter *bf, uint32_t 
commit_nr, uint32_t bit
bf->nr_entries = 0;
bf->commit_nr = commit_nr;
bf->bit_size = bit_size;
-   bf->bits = xcalloc(1, commit_nr * bit_size / CHAR_BIT);
+   bf->bits = xcalloc(1, 2 * commit_nr * bit_size / CHAR_BIT);
 }
 
 void bloom_filter_free(struct bloom_filter *bf)
@@ -22,24 +22,24 @@ void bloom_filter_free(struct bloom_filter *bf)
 }
 
 
-static void bloom_filter_set_bits(struct bloom_filter *bf, uint32_t graph_pos, 
const uint32_t *offsets,
+static void bloom_filter_set_bits(struct bloom_filter *bf, uint32_t graph_pos, 
int parent_index, const uint32_t *offsets,
   int nr_offsets, int nr_entries)
 {
int i;
for (i = 0; i < nr_offsets; i++) {
-   uint32_t byte_offset = (offsets[i] % bf->bit_size + graph_pos * 
bf->bit_size) / CHAR_BIT;
+   uint32_t byte_offset = (offsets[i] % bf->bit_size + (2 * 
graph_pos + parent_index) * bf->bit_size) / CHAR_BIT;
unsigned char mask = 1 << offsets[i] % CHAR_BIT;
bf->bits[byte_offset] |= mask;
}
bf->nr_entries += nr_entries;
 }
 
-static int bloom_filter_check_bits(struct bloom_filter *bf, uint32_t 
graph_pos, const uint32_t *offsets,
+static int bloom_filter_check_bits(struct bloom_filter *bf, uint32_t 
graph_pos, int parent_index, const uint32_t *offsets,
int nr)
 {
int i;
for (i = 0; i < nr; i++) {
-   uint32_t byte_offset = (offsets[i] % bf->bit_size + graph_pos * 
bf->bit_size) / CHAR_BIT;
+   uint32_t byte_offset = (offsets[i] % bf->bit_size + (2 * 
graph_pos + parent_index) * bf->bit_size) / CHAR_BIT;
unsigned char mask = 1 << offsets[i] % CHAR_BIT;
if (!(bf->bits[byte_offset] & mask))
return 0;
@@ -48,19 +48,18 @@ static int bloom_filter_check_bits(struct bloom_filter *bf, 
uint32_t graph_pos,
 }
 
 
-void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, const 
unsigned char *hash)
+void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, int 
parent_index, const unsigned char *hash)
 {
uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
hashcpy((unsigned char*)offsets, hash);
-   bloom_filter_set_bits(bf, graph_pos, offsets,
-the_hash_algo->rawsz / sizeof(*offsets), 1);
+   bloom_filter_set_bits(bf, graph_pos, parent_index, offsets, 
the_hash_algo->rawsz / sizeof(*offsets), 1);
 }
 
-int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, const 
unsigned char *hash)
+int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, int 
parent_index, const unsigned char *hash)
 {
uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
hashcpy((unsigned char*)offsets, hash);
-   return bloom_filter_check_bits(bf, graph_pos, offsets,
+   return bloom_filter_check_bits(bf, graph_pos, parent_index, offsets,
the_hash_algo->rawsz / sizeof(*offsets));
 }
 
@@ -87,8 +86,8 @@ int bloom_filter_load(struct bloom_filter *bf)
read_in_full(fd, >bit_size, sizeof(bf->bit_size));
if (bf->bit_size % CHAR_BIT)
BUG("invalid size for bloom filter");
-   bf->bits = xmalloc(bf->commit_nr * bf->bit_size / CHAR_BIT);
-   read_in_full(fd, bf->bits, bf->commit_nr * bf->bit_size / CHAR_BIT);
+   bf->bits = xmalloc(2 * bf->commit_nr * bf->bit_size / CHAR_BIT);
+   read_in_full(fd, bf->bits, 2 * bf->commit_nr * bf->bit_size / CHAR_BIT);
 
close(fd);
 
@@ -102,7 +101,7 @@ void bloom_filter_write(struct bloom_filter *bf)
write_in_full(fd, >nr_entries, sizeof(bf->nr

[PATCH 2/2] Only make bloom filter for first parent

2018-10-10 Thread Jonathan Tan
Signed-off-by: Jonathan Tan 
---
 commit-graph.c |  4 ++--
 revision.c | 20 
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 90b0b3df90..d21d555611 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -782,9 +782,9 @@ static void fill_bloom_filter(struct bloom_filter *bf,
 
for (i = 0; i < commit_nr; i++) {
struct commit *commit = commits[i];
-   struct commit_list *parent;
+   struct commit_list *parent = commit->parents;
 
-   for (parent = commit->parents; parent; parent = parent->next)
+   if (parent)
add_changes_to_bloom_filter(bf, parent->item, commit, i,
);
 
diff --git a/revision.c b/revision.c
index c84a997928..5a433a5878 100644
--- a/revision.c
+++ b/revision.c
@@ -539,11 +539,11 @@ static int check_maybe_different_in_bloom_filter(struct 
rev_info *revs,
 }
 
 static int rev_compare_tree(struct rev_info *revs,
-   struct commit *parent, struct commit *commit)
+   struct commit *parent, struct commit *commit, int 
nth_parent)
 {
struct tree *t1 = get_commit_tree(parent);
struct tree *t2 = get_commit_tree(commit);
-   int bloom_ret;
+   int bloom_ret = 1;
 
if (!t1)
return REV_TREE_NEW;
@@ -568,17 +568,21 @@ static int rev_compare_tree(struct rev_info *revs,
return REV_TREE_SAME;
}
 
-   bloom_ret = check_maybe_different_in_bloom_filter(revs, parent, commit);
-   if (bloom_ret == 0)
-   return REV_TREE_SAME;
+   if (!nth_parent) {
+   bloom_ret = check_maybe_different_in_bloom_filter(revs, parent, 
commit);
+   if (bloom_ret == 0)
+   return REV_TREE_SAME;
+   }
 
tree_difference = REV_TREE_SAME;
revs->pruning.flags.has_changes = 0;
if (diff_tree_oid(>object.oid, >object.oid, "",
   >pruning) < 0)
return REV_TREE_DIFFERENT;
-   if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
-   bloom_filter_count_false_positive++;
+   if (!nth_parent) {
+   if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
+   bloom_filter_count_false_positive++;
+   }
return tree_difference;
 }
 
@@ -776,7 +780,7 @@ static void try_to_simplify_commit(struct rev_info *revs, 
struct commit *commit)
die("cannot simplify commit %s (because of %s)",
oid_to_hex(>object.oid),
oid_to_hex(>object.oid));
-   switch (rev_compare_tree(revs, p, commit)) {
+   switch (rev_compare_tree(revs, p, commit, nth_parent)) {
case REV_TREE_SAME:
if (!revs->simplify_history || !relevant_commit(p)) {
/* Even if a merge with an uninteresting
-- 
2.19.0.271.gfe8321ec05.dirty



[PATCH 1/2] One filter per commit

2018-10-10 Thread Jonathan Tan
Signed-off-by: Jonathan Tan 
---
 bloom-filter.c | 31 ++-
 bloom-filter.h | 12 
 commit-graph.c | 26 --
 revision.c |  9 +++--
 4 files changed, 37 insertions(+), 41 deletions(-)

diff --git a/bloom-filter.c b/bloom-filter.c
index 7dce0e35fa..39b453908f 100644
--- a/bloom-filter.c
+++ b/bloom-filter.c
@@ -1,14 +1,17 @@
 #include "cache.h"
 #include "bloom-filter.h"
 
-void bloom_filter_init(struct bloom_filter *bf, uint32_t bit_size)
+void bloom_filter_init(struct bloom_filter *bf, uint32_t commit_nr, uint32_t 
bit_size)
 {
if (bit_size % CHAR_BIT)
BUG("invalid size for bloom filter");
+   if (bit_size > 1024)
+   BUG("aborting: the bit size is per commit, not for the whole 
filter");
 
bf->nr_entries = 0;
+   bf->commit_nr = commit_nr;
bf->bit_size = bit_size;
-   bf->bits = xmalloc(bit_size / CHAR_BIT);
+   bf->bits = xcalloc(1, commit_nr * bit_size / CHAR_BIT);
 }
 
 void bloom_filter_free(struct bloom_filter *bf)
@@ -19,24 +22,24 @@ void bloom_filter_free(struct bloom_filter *bf)
 }
 
 
-void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *offsets,
+static void bloom_filter_set_bits(struct bloom_filter *bf, uint32_t graph_pos, 
const uint32_t *offsets,
   int nr_offsets, int nr_entries)
 {
int i;
for (i = 0; i < nr_offsets; i++) {
-   uint32_t byte_offset = (offsets[i] % bf->bit_size) / CHAR_BIT;
+   uint32_t byte_offset = (offsets[i] % bf->bit_size + graph_pos * 
bf->bit_size) / CHAR_BIT;
unsigned char mask = 1 << offsets[i] % CHAR_BIT;
bf->bits[byte_offset] |= mask;
}
bf->nr_entries += nr_entries;
 }
 
-int bloom_filter_check_bits(struct bloom_filter *bf, const uint32_t *offsets,
+static int bloom_filter_check_bits(struct bloom_filter *bf, uint32_t 
graph_pos, const uint32_t *offsets,
int nr)
 {
int i;
for (i = 0; i < nr; i++) {
-   uint32_t byte_offset = (offsets[i] % bf->bit_size) / CHAR_BIT;
+   uint32_t byte_offset = (offsets[i] % bf->bit_size + graph_pos * 
bf->bit_size) / CHAR_BIT;
unsigned char mask = 1 << offsets[i] % CHAR_BIT;
if (!(bf->bits[byte_offset] & mask))
return 0;
@@ -45,19 +48,19 @@ int bloom_filter_check_bits(struct bloom_filter *bf, const 
uint32_t *offsets,
 }
 
 
-void bloom_filter_add_hash(struct bloom_filter *bf, const unsigned char *hash)
+void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, const 
unsigned char *hash)
 {
uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
hashcpy((unsigned char*)offsets, hash);
-   bloom_filter_set_bits(bf, offsets,
+   bloom_filter_set_bits(bf, graph_pos, offsets,
 the_hash_algo->rawsz / sizeof(*offsets), 1);
 }
 
-int bloom_filter_check_hash(struct bloom_filter *bf, const unsigned char *hash)
+int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, const 
unsigned char *hash)
 {
uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
hashcpy((unsigned char*)offsets, hash);
-   return bloom_filter_check_bits(bf, offsets,
+   return bloom_filter_check_bits(bf, graph_pos, offsets,
the_hash_algo->rawsz / sizeof(*offsets));
 }
 
@@ -80,11 +83,12 @@ int bloom_filter_load(struct bloom_filter *bf)
return -1;
 
read_in_full(fd, >nr_entries, sizeof(bf->nr_entries));
+   read_in_full(fd, >commit_nr, sizeof(bf->commit_nr));
read_in_full(fd, >bit_size, sizeof(bf->bit_size));
if (bf->bit_size % CHAR_BIT)
BUG("invalid size for bloom filter");
-   bf->bits = xmalloc(bf->bit_size / CHAR_BIT);
-   read_in_full(fd, bf->bits, bf->bit_size / CHAR_BIT);
+   bf->bits = xmalloc(bf->commit_nr * bf->bit_size / CHAR_BIT);
+   read_in_full(fd, bf->bits, bf->commit_nr * bf->bit_size / CHAR_BIT);
 
close(fd);
 
@@ -96,8 +100,9 @@ void bloom_filter_write(struct bloom_filter *bf)
int fd = xopen(git_path_bloom(), O_WRONLY | O_CREAT | O_TRUNC, 0666);
 
write_in_full(fd, >nr_entries, sizeof(bf->nr_entries));
+   write_in_full(fd, >commit_nr, sizeof(bf->commit_nr));
write_in_full(fd, >bit_size, sizeof(bf->bit_size));
-   write_in_full(fd, bf->bits, bf->bit_size / CHAR_BIT);
+   write_in_full(fd, bf->bits, bf->commit_nr * bf->bit_size / CHAR_BIT);
 
close(fd);
 }
diff --git a/bloom-filter.h b/bloom-filter.h
index 94d0af1708..607649b8db 100644
--- a/bloom-filter.h
+++ b/bloom-filter.h
@@ -5,30 +5,26 @@
 
 struct bloom_filter {
  

[PATCH 0/2] Per-commit filter proof of concept

2018-10-10 Thread Jonathan Tan
I did my own experiments on top of what Szeder provided - the first
patch is to have one fixed-size bloom filter per commit, and the second
patch makes that bloom filter apply to only the first parent of each
commit. The results are:

  Original (szeder's)
  $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) time ./git commit-graph write
  0:10.28
  $ ls -l .git/objects/info/bloom
  8388616
  $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y time ./git -c \
core.commitgraph=true rev-list --count --full-history HEAD -- \
t/valgrind/valgrind.sh
  886
  bloom filter total queries: 66459 definitely not: 65276 maybe: 1183 false 
positives: 297 fp ratio: 0.004469
  0:00.24

  With patch 1
  $ GIT_USE_POC_BLOOM_FILTER=256 time ./git commit-graph write
  0:16.22
  $ ls -l .git/objects/info/bloom
  1832620
  $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y time ./git -c \
core.commitgraph=true rev-list --count --full-history HEAD -- \
t/valgrind/valgrind.sh
  886
  bloom filter total queries: 66459 definitely not: 46637 maybe: 19822 false 
positives: 18936 fp ratio: 0.284928
  0:01.53

  With patch 2
  $ GIT_USE_POC_BLOOM_FILTER=256 time ./git commit-graph write
  0:06.70
  $ ls -l .git/objects/info/bloom
  1832620
  $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y time ./git -c \
core.commitgraph=true rev-list --count --full-history HEAD -- \
t/valgrind/valgrind.sh
  886
  bloom filter total queries: 53096 definitely not: 52989 maybe: 107 false 
positives: 89 fp ratio: 0.001676
  0:01.29

For comparison, a non-GIT_USE_POC_BLOOM_FILTER rev-list takes 3.517
seconds.

I haven't investigated why patch 1 takes longer than the original to
create the bloom filter.

Using per-commit filters and restricting the bloom filter to a single
parent increases the relative power of the filter in omitting tree
inspections compared to the original (107/53096 vs 1183/66459), but the
lack of coverage w.r.t. the non-first parents had a more significant
effect than I thought (1.29s vs .24s). It might be best to have one
filter for each (commit, parent) pair (or, at least, the first two
parents of each commit - we probably don't need to care that much about
octopus merges) - this would take up more disk space than if we only
store filters for the first parent, but is still less than the original
example of storing information for all commits in one filter.

There are more possibilities like dynamic filter sizing, different
hashing, and hashing to support wildcard matches, which I haven't looked
into.

Jonathan Tan (2):
  One filter per commit
  Only make bloom filter for first parent

 bloom-filter.c | 31 ++-
 bloom-filter.h | 12 
 commit-graph.c | 30 ++
 revision.c | 29 +++--
 4 files changed, 51 insertions(+), 51 deletions(-)

-- 
2.19.0.271.gfe8321ec05.dirty



Re: [PATCH] builtin/grep.c: remote superflous submodule code

2018-10-09 Thread Jonathan Tan
> It claimed that grep would still need some explicit handling, but that is
> not the call to repo_read_gitmodules (applying this patch on top of
> ff6f1f564c4 still keep the test suite happy, specifically
> t7814-grep-recurse-submodules, which contains a test
> "grep history with moved submoules")

Firstly, spelling of "remove" and "superfluous" in the commit title.

I don't think the "grep history with moved submodules" test exercises
much. That test only tests the superproject > submodule case, but we
need a superproject > submodule > sub-submodule case, because what is
being removed is a call to repo_read_gitmodules() on a repository
("struct repository submodule") that has a superproject ("struct
repository *superproject"). In other words, we need a submodule that has
its own gitmodules.

Alternatively, it would be fine if someone could point out where the
.gitmodules file is lazily loaded when grep_submodule() is invoked. I
couldn't find it, although I wasn't looking very hard. I did look at the
invocation of repo_submodule_init() (right before the removed lines),
which indeed calls repo_read_gitmodules() indirectly through
submodule_from_path(), but that is called on the superproject, whereas
what is being removed is a call on the submodule.

> The special handling is the call to gitmodules_config_oid which was added
> already in 74ed43711f (grep: enable recurse-submodules to work on
>  objects, 2016-12-16), but then was still named
> gitmodules_config_sha1.

If you're stating that gitmodules_config_oid() is where the .gitmodules
file is lazily loaded, it doesn't seem to be that way, because that
function works only on the_repository (at least on 'master' and 'next').

> This is a resend of origin/sb/grep-submodule-cleanup,
> and I think picking ff6f1f564c4 as the base for the series would
> also be appropriate.

Any particular reason why you suggest that commit (which is more than a
year old)? It seems that basing this on 'master' is fine.


[PATCH v2] cache-tree: skip some blob checks in partial clone

2018-10-09 Thread Jonathan Tan
In a partial clone, whenever a sparse checkout occurs, the existence of
all blobs in the index is verified, whether they are included or
excluded by the .git/info/sparse-checkout specification. This
significantly degrades performance because a lazy fetch occurs whenever
the existence of a missing blob is checked.

This is because cache_tree_update() checks the existence of all objects
in the index, whether or not CE_SKIP_WORKTREE is set on them. Teach
cache_tree_update() to skip checking CE_SKIP_WORKTREE objects when the
repository is a partial clone. This improves performance for sparse
checkout and also other operations that use cache_tree_update().

Instead of completely removing the check, an argument could be made that
the check should instead be replaced by a check that the blob is
promised, but for performance reasons, I decided not to do this.
If the user needs to verify the repository, it can be done using fsck
(which will notify if a tree points to a missing and non-promised blob,
whether the blob is included or excluded by the sparse-checkout
specification).

Signed-off-by: Jonathan Tan 
---
Changes from v1:

After feedback, I restricted this to partial clone. Once restricted, I
agree with Ben that this can be done for all users of
cache_tree_update(), not just unpack-trees, so I have removed the
ability to control the behavior using a flag.

I also took the opportunity to simplify the missing check by using a
variable.
---
 cache-tree.c |  6 +-
 t/t1090-sparse-checkout-scope.sh | 33 
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5ce51468f0..f210481f9b 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -326,6 +326,7 @@ static int update_one(struct cache_tree *it,
unsigned mode;
int expected_missing = 0;
int contains_ita = 0;
+   int ce_missing_ok;
 
path = ce->name;
pathlen = ce_namelen(ce);
@@ -355,8 +356,11 @@ static int update_one(struct cache_tree *it,
i++;
}
 
+   ce_missing_ok = mode == S_IFGITLINK || missing_ok ||
+   (repository_format_partial_clone &&
+ce_skip_worktree(ce));
if (is_null_oid(oid) ||
-   (mode != S_IFGITLINK && !missing_ok && 
!has_object_file(oid))) {
+   (!ce_missing_ok && !has_object_file(oid))) {
strbuf_release();
if (expected_missing)
return -1;
diff --git a/t/t1090-sparse-checkout-scope.sh b/t/t1090-sparse-checkout-scope.sh
index 25d7c700f6..090b7fc3d3 100755
--- a/t/t1090-sparse-checkout-scope.sh
+++ b/t/t1090-sparse-checkout-scope.sh
@@ -63,4 +63,37 @@ test_expect_success 'return to full checkout of master' '
test "$(cat b)" = "modified"
 '
 
+test_expect_success 'in partial clone, sparse checkout only fetches needed 
blobs' '
+   test_create_repo server &&
+   git clone "file://$(pwd)/server" client &&
+
+   test_config -C server uploadpack.allowfilter 1 &&
+   test_config -C server uploadpack.allowanysha1inwant 1 &&
+   echo a >server/a &&
+   echo bb >server/b &&
+   mkdir server/c &&
+   echo ccc >server/c/c &&
+   git -C server add a b c/c &&
+   git -C server commit -m message &&
+
+   test_config -C client core.sparsecheckout 1 &&
+   test_config -C client extensions.partialclone origin &&
+   echo "!/*" >client/.git/info/sparse-checkout &&
+   echo "/a" >>client/.git/info/sparse-checkout &&
+   git -C client fetch --filter=blob:none origin &&
+   git -C client checkout FETCH_HEAD &&
+
+   git -C client rev-list HEAD \
+   --quiet --objects --missing=print >unsorted_actual &&
+   (
+   printf "?" &&
+   git hash-object server/b &&
+   printf "?" &&
+   git hash-object server/c/c
+   ) >unsorted_expect &&
+   sort unsorted_actual >actual &&
+   sort unsorted_expect >expect &&
+   test_cmp expect actual
+'
+
 test_done
-- 
2.19.0.271.gfe8321ec05.dirty



[PATCH] unpack-trees: allow missing CE_SKIP_WORKTREE objs

2018-10-08 Thread Jonathan Tan
Whenever a sparse checkout occurs, the existence of all blobs in the
index is verified, whether or not they are included or excluded by the
.git/info/sparse-checkout specification. This degrades performance,
significantly in the case of a partial clone, because a lazy fetch
occurs whenever the existence of a missing blob is checked.

At the point of invoking cache_tree_update() in unpack_trees(),
CE_SKIP_WORKTREE is already set on all excluded blobs
(mark_new_skip_worktree() is called twice to set CE_NEW_SKIP_WORKTREE,
then apply_sparse_checkout() is called which copies over
CE_NEW_SKIP_WORKTREE to CE_SKIP_WORKTREE). cache_tree_update() can use
this information to know which blobs are excluded, and thus skip the
checking of these.

Because cache_tree_update() is used from multiple places, this new
behavior is guarded by a new flag WRITE_TREE_SKIP_WORKTREE_MISSING_OK.
Implement this new flag, and teach unpack_trees() to invoke
cache_tree_update() with this new flag.

Signed-off-by: Jonathan Tan 
---
 cache-tree.c |  6 +-
 cache-tree.h |  4 
 t/t1090-sparse-checkout-scope.sh | 33 
 unpack-trees.c   |  1 +
 4 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5ce51468f0..340caf2d34 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -246,6 +246,8 @@ static int update_one(struct cache_tree *it,
int missing_ok = flags & WRITE_TREE_MISSING_OK;
int dryrun = flags & WRITE_TREE_DRY_RUN;
int repair = flags & WRITE_TREE_REPAIR;
+   int skip_worktree_missing_ok =
+   flags & WRITE_TREE_SKIP_WORKTREE_MISSING_OK;
int to_invalidate = 0;
int i;
 
@@ -356,7 +358,9 @@ static int update_one(struct cache_tree *it,
}
 
if (is_null_oid(oid) ||
-   (mode != S_IFGITLINK && !missing_ok && 
!has_object_file(oid))) {
+   (mode != S_IFGITLINK && !missing_ok &&
+!(skip_worktree_missing_ok && ce_skip_worktree(ce)) &&
+!has_object_file(oid))) {
strbuf_release();
if (expected_missing)
return -1;
diff --git a/cache-tree.h b/cache-tree.h
index 0ab6784ffe..655d487619 100644
--- a/cache-tree.h
+++ b/cache-tree.h
@@ -40,6 +40,10 @@ void cache_tree_verify(struct index_state *);
 #define WRITE_TREE_DRY_RUN 4
 #define WRITE_TREE_SILENT 8
 #define WRITE_TREE_REPAIR 16
+/*
+ * Do not check for the presence of cache entry objects with CE_SKIP_WORKTREE.
+ */
+#define WRITE_TREE_SKIP_WORKTREE_MISSING_OK 32
 
 /* error return codes */
 #define WRITE_TREE_UNREADABLE_INDEX (-1)
diff --git a/t/t1090-sparse-checkout-scope.sh b/t/t1090-sparse-checkout-scope.sh
index 25d7c700f6..090b7fc3d3 100755
--- a/t/t1090-sparse-checkout-scope.sh
+++ b/t/t1090-sparse-checkout-scope.sh
@@ -63,4 +63,37 @@ test_expect_success 'return to full checkout of master' '
test "$(cat b)" = "modified"
 '
 
+test_expect_success 'in partial clone, sparse checkout only fetches needed 
blobs' '
+   test_create_repo server &&
+   git clone "file://$(pwd)/server" client &&
+
+   test_config -C server uploadpack.allowfilter 1 &&
+   test_config -C server uploadpack.allowanysha1inwant 1 &&
+   echo a >server/a &&
+   echo bb >server/b &&
+   mkdir server/c &&
+   echo ccc >server/c/c &&
+   git -C server add a b c/c &&
+   git -C server commit -m message &&
+
+   test_config -C client core.sparsecheckout 1 &&
+   test_config -C client extensions.partialclone origin &&
+   echo "!/*" >client/.git/info/sparse-checkout &&
+   echo "/a" >>client/.git/info/sparse-checkout &&
+   git -C client fetch --filter=blob:none origin &&
+   git -C client checkout FETCH_HEAD &&
+
+   git -C client rev-list HEAD \
+   --quiet --objects --missing=print >unsorted_actual &&
+   (
+   printf "?" &&
+   git hash-object server/b &&
+   printf "?" &&
+   git hash-object server/c/c
+   ) >unsorted_expect &&
+   sort unsorted_actual >actual &&
+   sort unsorted_expect >expect &&
+   test_cmp expect actual
+'
+
 test_done
diff --git a/unpack-trees.c b/unpack-trees.c
index 51bfac6aa0..39e0e7a6c7 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1635,6 +1635,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, 
struct unpack_trees_options
o->result.cache_tree = cache_tree();
 

Re: [PATCH v3 2/5] fetch-pack: load tip_oids eagerly iff needed

2018-10-04 Thread Jonathan Tan
> Yes, I agree with you that the loops are still entwined. They're at
> least now in a single function, though, which IMHO is a slight
> improvement.

Hmm...originally, the main functionality was in a single loop in a
single function. But I say that because I consider the lazy loading in
tip_oids_contain() as something both peripheral and independent (if the
main loop's logic changes, the lazy loading most likely does not need to
be changed).

> I agree with you that just checking:
> 
>   if (oidset_count() != 0)
> 
> would be fine, too.

OK, we're agreed on this :-)

> Or I am even OK with leaving the existing tablesize
> check. It is a little intimate with the implementation details, but I
> suspect that if oidset were to change (e.g., to initialize the buckets
> immediately), the problem would be pretty apparent in the tests.

I am OK with this too, except that (as far as I can tell) the point of
this patch set is to replace the internals of oidset, so we no longer
have the tablesize check. Unless you meant the khash analog of
tablesize? I would be OK if all tablesize references are replaced with
the khash analog in the same patch that the oidset internals are
replaced.


Re: [PATCH v3 2/5] fetch-pack: load tip_oids eagerly iff needed

2018-10-04 Thread Jonathan Tan
> Determine if the oidset needs to be populated upfront and then do that
> instead.  This duplicates a loop, but simplifies the existing one by
> separating concerns between the two.

[snip]

> + if (strict) {
> + for (i = 0; i < nr_sought; i++) {
> + ref = sought[i];
> + if (!is_unmatched_ref(ref))
> + continue;
> +
> + add_refs_to_oidset(_oids, unmatched);
> + add_refs_to_oidset(_oids, newlist);
> + break;
> + }
> + }
> +
>   /* Append unmatched requests to the list */
>   for (i = 0; i < nr_sought; i++) {
>   ref = sought[i];
>   if (!is_unmatched_ref(ref))
>   continue;
>  
> - if ((allow_unadvertised_object_request &
> -  (ALLOW_TIP_SHA1 | ALLOW_REACHABLE_SHA1)) ||
> - tip_oids_contain(_oids, unmatched, newlist,
> -  >old_oid)) {
> + if (!strict || oidset_contains(_oids, >old_oid)) {
>   ref->match_status = REF_MATCHED;
>   *newtail = copy_ref(ref);
>   newtail = &(*newtail)->next;

I don't think the concerns are truly separated - the first loop quoted
needs to know that in the second loop, tip_oids is accessed only if
there is at least one unmatched ref. Would it be better to expose the
size of the oidset and then use it in place of
"tip_oids->map.map.tablesize"? Checking for initialization (using
"tablesize") is conceptually different from checking the size, but in
this code, both initialization and the first increase in size occur upon
the first oidset_insert(), so we should still get the same result.

But if we do end up going with the approach in this patch, then this
patch is obviously correct.

I also looked at patch 1 of this patch set and it is obviously correct.
I didn't look at the other patches.


[PATCH v2 0/2] Lazy fetch bug fix (and feature that reveals it)

2018-10-03 Thread Jonathan Tan
> This seems to break 5520 and 5616 when merged to 'pu'.  
> 
> It seems that merging master to md/filter-trees and then applying
> this is sufficient to break 5616.

I verified that 5616 is broken on [master + md/filter-trees + my patch],
and after investigation, found a bug in the lazy fetch implementation.
Patch 1 contains a fix, and more information about that bug.

Patch 2 is the original patch, updated with the commit message I
suggested [1] in reply to Junio's comment.

I bundled these patches together because patch 2 (in combination with
certain other commits) reveals the bug that patch 1 fixes, and to make
it easier for others to verify that these patches together pass all
tests when rebased on [master + md/filter-trees] or 'pu' (at least, as
of the time of writing).

[1] 
https://public-inbox.org/git/20180927183718.89804-1-jonathanta...@google.com/

Jonathan Tan (2):
  fetch-pack: avoid object flags if no_dependents
  fetch-pack: exclude blobs when lazy-fetching trees

 fetch-pack.c | 115 +--
 fetch-pack.h |   7 +++
 t/t0410-partial-clone.sh |  41 ++
 3 files changed, 121 insertions(+), 42 deletions(-)

-- 
2.19.0.605.g01d371f741-goog



[PATCH v2 2/2] fetch-pack: exclude blobs when lazy-fetching trees

2018-10-03 Thread Jonathan Tan
A partial clone with missing trees can be obtained using "git clone
--filter=tree:none ". In such a repository, when a tree needs to
be lazily fetched, any tree or blob it directly or indirectly references
is fetched as well, regardless of whether the original command required
those objects, or if the local repository already had some of them.

This is because the fetch protocol, which the lazy fetch uses, does not
allow clients to request that only the wanted objects be sent, which
would be the ideal solution. This patch implements a partial solution:
specify the "blob:none" filter, somewhat reducing the fetch payload.

This change has no effect when lazily fetching blobs (due to how filters
work). And if lazily fetching a commit (such repositories are difficult
to construct and is not a use case we support very well, but it is
possible), referenced commits and trees are still fetched - only the
blobs are not fetched.

The necessary code change is done in fetch_pack() instead of somewhere
closer to where the "filter" instruction is written to the wire so that
only one part of the code needs to be changed in order for users of all
protocol versions to benefit from this optimization.

Signed-off-by: Jonathan Tan 
---
 fetch-pack.c | 14 ++
 fetch-pack.h |  7 +++
 t/t0410-partial-clone.sh | 41 
 3 files changed, 62 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index b9b1211dda..5d82666a52 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1615,6 +1615,20 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
if (nr_sought)
nr_sought = remove_duplicates_in_refs(sought, nr_sought);
 
+   if (args->no_dependents && !args->filter_options.choice) {
+   /*
+* The protocol does not support requesting that only the
+* wanted objects be sent, so approximate this by setting a
+* "blob:none" filter if no filter is already set. This works
+* for all object types: note that wanted blobs will still be
+* sent because they are directly specified as a "want".
+*
+* NEEDSWORK: Add an option in the protocol to request that
+* only the wanted objects be sent, and implement it.
+*/
+   parse_list_objects_filter(>filter_options, "blob:none");
+   }
+
if (!ref) {
packet_flush(fd[1]);
die(_("no matching remote head"));
diff --git a/fetch-pack.h b/fetch-pack.h
index 5b6e868802..43ec344d95 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -43,6 +43,13 @@ struct fetch_pack_args {
unsigned from_promisor:1;
 
/*
+* Attempt to fetch only the wanted objects, and not any objects
+* referred to by them. Due to protocol limitations, extraneous
+* objects may still be included. (When fetching non-blob
+* objects, only blobs are excluded; when fetching a blob, the
+* blob itself will still be sent. The client does not need to
+* know whether a wanted object is a blob or not.)
+*
 * If 1, fetch_pack() will also not modify any object flags.
 * This allows fetch_pack() to safely be called by any function,
 * regardless of which object flags it uses (if any).
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index cfd0655ea1..c521d7d6c6 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -182,6 +182,47 @@ test_expect_success 'fetching of missing objects works 
with ref-in-want enabled'
grep "git< fetch=.*ref-in-want" trace
 '
 
+test_expect_success 'fetching of missing blobs works' '
+   rm -rf server repo &&
+   test_create_repo server &&
+   test_commit -C server foo &&
+   git -C server repack -a -d --write-bitmap-index &&
+
+   git clone "file://$(pwd)/server" repo &&
+   git hash-object repo/foo.t >blobhash &&
+   rm -rf repo/.git/objects/* &&
+
+   git -C server config uploadpack.allowanysha1inwant 1 &&
+   git -C server config uploadpack.allowfilter 1 &&
+   git -C repo config core.repositoryformatversion 1 &&
+   git -C repo config extensions.partialclone "origin" &&
+
+   git -C repo cat-file -p $(cat blobhash)
+'
+
+test_expect_success 'fetching of missing trees does not fetch blobs' '
+   rm -rf server repo &&
+   test_create_repo server &&
+   test_commit -C server foo &&
+   git -C server repack -a -d --write-bitmap-index &&
+
+   git clone "file://$(pwd)/server" repo &&
+   git -C repo rev-parse foo^{tree} >treehash &&
+   git hash-

[PATCH v2 1/2] fetch-pack: avoid object flags if no_dependents

2018-10-03 Thread Jonathan Tan
When fetch_pack() is invoked as part of another Git command (due to a
lazy fetch from a partial clone, for example), it uses object flags that
may already be used by the outer Git command.

The commit that introduced the lazy fetch feature (88e2f9ed8e
("introduce fetch-object: fetch one promisor object", 2017-12-05)) tried
to avoid this overlap, but it did not avoid it totally. It was
successful in avoiding writing COMPLETE, but did not avoid reading
COMPLETE, and did not avoid writing and reading ALTERNATE.

Ensure that no flags are written or read by fetch_pack() in the case
where it is used to perform a lazy fetch. To do this, it is sufficient
to avoid checking completeness of wanted refs (unnecessary in the case
of lazy fetches), and to avoid negotiation-related work (in the current
implementation, already, no negotiation is performed). After that was
done, the lack of overlap was verified by checking all direct and
indirect usages of COMPLETE and ALTERNATE - that they are read or
written only if no_dependents is false.

There are other possible solutions to this issue:

 (1) Split fetch-pack.{c,h} into a flag-using part and a non-flag-using
 part, and whenever no_dependents is set, only use the
 non-flag-using part.
 (2) Make fetch_pack() be able to be used with arbitrary repository
 objects. fetch_pack() should then create its own repository object
 based on the given repository object, with its own object
 hashtable, so that the flags do not conflict.

(1) is possible but invasive - some functions would need to be split;
and such invasiveness would potentially be unnecessary if we ever were
to need (2) anyway. (2) would be useful if we were to support, say,
submodules that were partial clones themselves, but I don't know when or
if the Git project plans to support those.

Signed-off-by: Jonathan Tan 
---
 fetch-pack.c | 101 ++-
 1 file changed, 59 insertions(+), 42 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 75047a4b2a..b9b1211dda 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -253,8 +253,10 @@ static int find_common(struct fetch_negotiator *negotiator,
if (args->stateless_rpc && multi_ack == 1)
die(_("--stateless-rpc requires multi_ack_detailed"));
 
-   mark_tips(negotiator, args->negotiation_tips);
-   for_each_cached_alternate(negotiator, insert_one_alternate_object);
+   if (!args->no_dependents) {
+   mark_tips(negotiator, args->negotiation_tips);
+   for_each_cached_alternate(negotiator, 
insert_one_alternate_object);
+   }
 
fetching = 0;
for ( ; refs ; refs = refs->next) {
@@ -271,8 +273,12 @@ static int find_common(struct fetch_negotiator *negotiator,
 * We use lookup_object here because we are only
 * interested in the case we *know* the object is
 * reachable and we have already scanned it.
+*
+* Do this only if args->no_dependents is false (if it is true,
+* we cannot trust the object flags).
 */
-   if (((o = lookup_object(the_repository, remote->hash)) != NULL) 
&&
+   if (!args->no_dependents &&
+   ((o = lookup_object(the_repository, remote->hash)) != NULL) 
&&
(o->flags & COMPLETE)) {
continue;
}
@@ -710,31 +716,29 @@ static void mark_complete_and_common_ref(struct 
fetch_negotiator *negotiator,
 
oidset_clear(_oid_set);
 
-   if (!args->no_dependents) {
-   if (!args->deepen) {
-   for_each_ref(mark_complete_oid, NULL);
-   for_each_cached_alternate(NULL, 
mark_alternate_complete);
-   commit_list_sort_by_date();
-   if (cutoff)
-   mark_recent_complete_commits(args, cutoff);
-   }
+   if (!args->deepen) {
+   for_each_ref(mark_complete_oid, NULL);
+   for_each_cached_alternate(NULL, mark_alternate_complete);
+   commit_list_sort_by_date();
+   if (cutoff)
+   mark_recent_complete_commits(args, cutoff);
+   }
 
-   /*
-* Mark all complete remote refs as common refs.
-* Don't mark them common yet; the server has to be told so 
first.
-*/
-   for (ref = *refs; ref; ref = ref->next) {
-   struct object *o = deref_tag(the_repository,
-
lookup_object(the_repository,
-ref->old_oid.hash),
-NULL, 0);
+   /*
+* Mark all complete 

Re: [RFC PATCH v2 4/4] fetch: do not list refs if fetching only hashes

2018-09-28 Thread Jonathan Tan
> > +   /*
> > +* We can avoid listing refs if all of them are exact
> > +* OIDs
> > +*/
> > +   must_list_refs = 0;
> > +   for (i = 0; i < rs->nr; i++) {
> > +   if (!rs->items[i].exact_sha1) {
> > +   must_list_refs = 1;
> > +   break;
> > +   }
> > +   }
> 
> This seems to be a repeat pattern, Is it worth it to encapsulate it
> as a function in transport or refs?
> 
>   int must_list_refs(struct ref **to_fetch)
>   {
> // body as the loop above
>   }

The repetition is unfortunate - I tried to think of a better way to do
it but couldn't. We can't do what you suggest because this one loops
over refspecs but the other one loops over refs.


[PATCH] negotiator/skipping: parse commit before queueing

2018-09-27 Thread Jonathan Tan
The skipping negotiator pushes entries onto the priority queue without
ensuring that the commit is parsed, resulting in the entry ending up in
the wrong position due to a lack of commit time. Fix this by parsing the
commit whenever we push an entry onto the priority queue.

Signed-off-by: Jonathan Tan 
---
This was noticed by Aevar in [1]. With this fix, 163 "have" lines are
produced instead of the 14002 reported in [1].

I have included a test to demonstrate the issue, but I'm not sure if
it's worth including it in the source tree.

[1] https://public-inbox.org/git/87o9ciisg6@evledraar.gmail.com/
---
 negotiator/skipping.c|  2 +-
 t/t5552-skipping-fetch-negotiator.sh | 22 ++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/negotiator/skipping.c b/negotiator/skipping.c
index dffbc76c49..9ce0555840 100644
--- a/negotiator/skipping.c
+++ b/negotiator/skipping.c
@@ -64,6 +64,7 @@ static struct entry *rev_list_push(struct data *data, struct 
commit *commit, int
 
entry = xcalloc(1, sizeof(*entry));
entry->commit = commit;
+   parse_commit(commit);
prio_queue_put(>rev_list, entry);
 
if (!(mark & COMMON))
@@ -177,7 +178,6 @@ static const struct object_id *get_rev(struct data *data)
if (!(commit->object.flags & COMMON) && !entry->ttl)
to_send = commit;
 
-   parse_commit(commit);
for (p = commit->parents; p; p = p->next)
parent_pushed |= push_parent(data, entry, p->item);
 
diff --git a/t/t5552-skipping-fetch-negotiator.sh 
b/t/t5552-skipping-fetch-negotiator.sh
index 30857b84a8..f2f938e54e 100755
--- a/t/t5552-skipping-fetch-negotiator.sh
+++ b/t/t5552-skipping-fetch-negotiator.sh
@@ -83,6 +83,28 @@ test_expect_success 'unknown fetch.negotiationAlgorithm 
values error out' '
test_i18ngrep ! "unknown fetch negotiation algorithm" err
 '
 
+test_expect_success 'works in absence of tags' '
+   rm -rf server client trace &&
+   git init server &&
+   test_commit -C server to_fetch &&
+
+   git init client &&
+   for i in $(test_seq 11)
+   do
+   test_commit -C client c$i
+   done &&
+   git -C client tag middle c6 &&
+   for i in $(test_seq 11)
+   do
+   git -C client tag -d c$i
+   done &&
+
+   test_config -C client fetch.negotiationalgorithm skipping &&
+   trace_fetch client "$(pwd)/server" &&
+   have_sent HEAD HEAD~2 HEAD~5 HEAD~10 &&
+   have_not_sent HEAD~1 HEAD~3 HEAD~4 HEAD~6 HEAD~7 HEAD~8 HEAD~9
+'
+
 test_expect_success 'when two skips collide, favor the larger one' '
rm -rf server client trace &&
git init server &&
-- 
2.19.0.605.g01d371f741-goog



Re: [PATCH 0/2] negotiator: improve recent behavior + docs

2018-09-27 Thread Jonathan Tan
> I get:
> 
> warning: Ignoring --negotiation-tip because the protocol does not support 
> it.

When I implemented --negotiation-tip, I only implemented it for
protocols that support connect or stateless-connect, because
implementing it fully would have required expanding the protocol helper
functionality. For reference, the commit is 3390e42adb ("fetch-pack:
support negotiation tip whitelist", 2018-07-03).

So HTTPS wouldn't work unless you were using protocol v2.

> So that seems like another bug, and as an aside, a "skipping"
> implementation that sends ~1/4 of the commits in the repo seems way less
> aggressive than it should be. I was expecting something that would
> gradually "ramp up" from the tips. Where say starting at master/next/pu
> we present every 100th commit as a "have" until the 1000th commit, then
> every 1000 commits until 10k and quickly after that step up the size
> rapidly.

I reproduced using your commands, and yes, there is a bug - I'll take a
look.


Re: [PATCH 0/2] negotiator: improve recent behavior + docs

2018-09-27 Thread Jonathan Tan
> > If you wanted to do this, it seems better to me to just declare a "null"
> > negotiation algorithm that does not perform any negotiation at all.
> 
> I think such an algorithm is a good idea in general, especially for
> testing, and yeah, maybe that's the best way out of this, i.e. to do:
> 
> if git rev-parse {}/HEAD 2>/dev/null
> then
> git fetch --negotiation-tip={}/HEAD {}
> else
> git -c fetch.negotiationAlgorithm=null fetch {}
> fi
> 
> Would such an algorithm be added by overriding default.c's add_tip
> function to never add anything by calling default_negotiator_init()
> followed by null_negotiator_init(), which would only override add_tip?
> (yay C OO)
> 
> If so from fetch-pack.c it looks like there may be the limitation on the
> interface that the negotiator can't exit early (in
> fetch-pack.c:mark_tips). But I've just skimmed this, so maybe I've
> missed something.

(I was reminded to reply to this offlist - sorry for the late reply.)

I think too many things need to be replaced (known_common, add_tip, and
ack all need to do nothing), so it's best to start from scratch. That
way, we also don't need to deal with the subtleties of C OO :-)

> Also, looks like because of the current interface =null and
> --negotiation-tip=* would (somewhat confusingly) do a "real" negotiation
> if done that way, since it'll bypass the API and insert tips for it to
> negotiate, but it looks like overriding next() will get around that.

If you do it as I suggest (in particular, add_tip doing nothing) then
there is the opposite problem that it won't be easy to inform the user
that --negotiation-tip does nothing in this case. Maybe there needs to
be an "accepts_tips" field in struct fetch_negotiator that, if false,
means that custom tips (or any tips) are not accepted, allowing the
caller of the negotiator to print a warning message in this case.


[RFC PATCH v2 4/4] fetch: do not list refs if fetching only hashes

2018-09-27 Thread Jonathan Tan
If only hash literals are given on a "git fetch" command-line, tag
following is not requested, and the fetch is done using protocol v2, a
list of refs is not required from the remote. Therefore, optimize by
invoking transport_get_remote_refs() only if we need the refs.

Signed-off-by: Jonathan Tan 
---
 builtin/fetch.c | 32 ++--
 t/t5551-http-fetch-smart.sh | 15 +++
 t/t5702-protocol-v2.sh  | 13 +
 3 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 0696abfc2a..4c4f8fa194 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1175,6 +1175,7 @@ static int do_fetch(struct transport *transport,
int retcode = 0;
const struct ref *remote_refs;
struct argv_array ref_prefixes = ARGV_ARRAY_INIT;
+   int must_list_refs = 1;
 
if (tags == TAGS_DEFAULT) {
if (transport->remote->fetch_tags == 2)
@@ -1190,17 +1191,36 @@ static int do_fetch(struct transport *transport,
goto cleanup;
}
 
-   if (rs->nr)
+   if (rs->nr) {
+   int i;
+
refspec_ref_prefixes(rs, _prefixes);
-   else if (transport->remote && transport->remote->fetch.nr)
+
+   /*
+* We can avoid listing refs if all of them are exact
+* OIDs
+*/
+   must_list_refs = 0;
+   for (i = 0; i < rs->nr; i++) {
+   if (!rs->items[i].exact_sha1) {
+   must_list_refs = 1;
+   break;
+   }
+   }
+   } else if (transport->remote && transport->remote->fetch.nr)
refspec_ref_prefixes(>remote->fetch, _prefixes);
 
-   if (ref_prefixes.argc &&
-   (tags == TAGS_SET || (tags == TAGS_DEFAULT))) {
-   argv_array_push(_prefixes, "refs/tags/");
+   if (tags == TAGS_SET || tags == TAGS_DEFAULT) {
+   must_list_refs = 1;
+   if (ref_prefixes.argc)
+   argv_array_push(_prefixes, "refs/tags/");
}
 
-   remote_refs = transport_get_remote_refs(transport, _prefixes);
+   if (must_list_refs)
+   remote_refs = transport_get_remote_refs(transport, 
_prefixes);
+   else
+   remote_refs = NULL;
+
argv_array_clear(_prefixes);
 
ref_map = get_ref_map(transport->remote, remote_refs, rs,
diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh
index 771f36f9ff..12b339d239 100755
--- a/t/t5551-http-fetch-smart.sh
+++ b/t/t5551-http-fetch-smart.sh
@@ -381,6 +381,21 @@ test_expect_success 'using fetch command in remote-curl 
updates refs' '
test_cmp expect actual
 '
 
+test_expect_success 'fetch by SHA-1 without tag following' '
+   SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
+   rm -rf "$SERVER" client &&
+
+   git init "$SERVER" &&
+   test_commit -C "$SERVER" foo &&
+
+   git clone $HTTPD_URL/smart/server client &&
+
+   test_commit -C "$SERVER" bar &&
+   git -C "$SERVER" rev-parse bar >bar_hash &&
+   git -C client -c protocol.version=0 fetch \
+   --no-tags origin $(cat bar_hash)
+'
+
 test_expect_success 'GIT_REDACT_COOKIES redacts cookies' '
rm -rf clone &&
echo "Set-Cookie: Foo=1" >cookies &&
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index a316bb9bf4..1a97331648 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -79,6 +79,19 @@ test_expect_success 'fetch with git:// using protocol v2' '
grep "fetch< version 2" log
 '
 
+test_expect_success 'fetch by hash without tag following with protocol v2 does 
not list refs' '
+   test_when_finished "rm -f log" &&
+
+   test_commit -C "$daemon_parent" two_a &&
+   git -C "$daemon_parent" rev-parse two_a >two_a_hash &&
+
+   GIT_TRACE_PACKET="$(pwd)/log" git -C daemon_child -c protocol.version=2 
\
+   fetch --no-tags origin $(cat two_a_hash) &&
+
+   grep "fetch< version 2" log &&
+   ! grep "fetch> command=ls-refs" log
+'
+
 test_expect_success 'pull with git:// using protocol v2' '
test_when_finished "rm -f log" &&
 
-- 
2.19.0.605.g01d371f741-goog



[RFC PATCH v2 3/4] transport: list refs before fetch if necessary

2018-09-27 Thread Jonathan Tan
The built-in bundle transport and the transport helper interface do not
work when transport_fetch_refs() is called immediately after transport
creation. This will be needed in a subsequent patch, so fix this.

Evidence: fetch_refs_from_bundle() relies on data->header being
initialized in get_refs_from_bundle(), and fetch() in transport-helper.c
relies on either data->fetch or data->import being set by get_helper(),
but neither transport_helper_init() nor fetch() calls get_helper().

Up until the introduction of the partial clone feature, this has not
been a problem, because transport_fetch_refs() is always called after
transport_get_remote_refs(). With the introduction of the partial clone
feature, which involves calling transport_fetch_refs() (to fetch objects
by their OIDs) without transport_get_remote_refs(), this is still not a
problem, but only coincidentally - we do not support partially cloning a
bundle, and as for cloning using a transport-helper-using protocol, it
so happens that before transport_fetch_refs() is called, fetch_refs() in
fetch-object.c calls transport_set_option(), which means that the
aforementioned get_helper() is invoked through set_helper_option() in
transport-helper.c.

This could be fixed by fixing the transports themselves, but it doesn't
seem like a good idea to me to open up previously untested code paths;
also, there may be transport helpers in the wild that assume that "list"
is always called before "fetch". Instead, fix this by having
transport_fetch_refs() call transport_get_remote_refs() to ensure that
the latter is always called at least once, unless the transport
explicitly states that it supports fetching without listing refs.

Signed-off-by: Jonathan Tan 
---
 transport-helper.c   |  1 +
 transport-internal.h |  6 ++
 transport.c  | 12 
 3 files changed, 19 insertions(+)

diff --git a/transport-helper.c b/transport-helper.c
index 143ca008c8..7213fa0d32 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -1105,6 +1105,7 @@ static struct ref *get_refs_list(struct transport 
*transport, int for_push,
 }
 
 static struct transport_vtable vtable = {
+   0,
set_helper_option,
get_refs_list,
fetch,
diff --git a/transport-internal.h b/transport-internal.h
index 1cde6258a7..004bee5e36 100644
--- a/transport-internal.h
+++ b/transport-internal.h
@@ -6,6 +6,12 @@ struct transport;
 struct argv_array;
 
 struct transport_vtable {
+   /**
+* This transport supports the fetch() function being called
+* without get_refs_list() first being called.
+*/
+   unsigned fetch_without_list : 1;
+
/**
 * Returns 0 if successful, positive if the option is not
 * recognized or is inapplicable, and negative if the option
diff --git a/transport.c b/transport.c
index 4329cca8e5..ea72fff6a6 100644
--- a/transport.c
+++ b/transport.c
@@ -733,6 +733,7 @@ static int disconnect_git(struct transport *transport)
 }
 
 static struct transport_vtable taken_over_vtable = {
+   1,
NULL,
get_refs_via_connect,
fetch_refs_via_pack,
@@ -882,6 +883,7 @@ void transport_check_allowed(const char *type)
 }
 
 static struct transport_vtable bundle_vtable = {
+   0,
NULL,
get_refs_from_bundle,
fetch_refs_from_bundle,
@@ -891,6 +893,7 @@ static struct transport_vtable bundle_vtable = {
 };
 
 static struct transport_vtable builtin_smart_vtable = {
+   1,
NULL,
get_refs_via_connect,
fetch_refs_via_pack,
@@ -1254,6 +1257,15 @@ int transport_fetch_refs(struct transport *transport, 
struct ref *refs)
struct ref **heads = NULL;
struct ref *rm;
 
+   if (!transport->vtable->fetch_without_list)
+   /*
+* Some transports (e.g. the built-in bundle transport and the
+* transport helper interface) do not work when fetching is
+* done immediately after transport creation. List the remote
+* refs anyway (if not already listed) as a workaround.
+*/
+   transport_get_remote_refs(transport, NULL);
+
for (rm = refs; rm; rm = rm->next) {
nr_refs++;
if (rm->peer_ref &&
-- 
2.19.0.605.g01d371f741-goog



[RFC PATCH v2 2/4] transport: do not list refs if possible

2018-09-27 Thread Jonathan Tan
When all refs to be fetched are exact OIDs, it is possible to perform a
fetch without requiring the remote to list refs if protocol v2 is used.
Teach Git to do this.

This currently has an effect only for lazy fetches done from partial
clones. The change necessary to likewise optimize "git fetch 
" will be done in a subsequent patch.

Signed-off-by: Jonathan Tan 
---
 fetch-pack.c   |  2 +-
 t/t5702-protocol-v2.sh |  5 +
 transport.c| 13 +++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 75047a4b2a..15652b4776 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1598,7 +1598,7 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
if (nr_sought)
nr_sought = remove_duplicates_in_refs(sought, nr_sought);
 
-   if (!ref) {
+   if (version != protocol_v2 && !ref) {
packet_flush(fd[1]);
die(_("no matching remote head"));
}
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 3beeed4546..a316bb9bf4 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -286,6 +286,11 @@ test_expect_success 'dynamically fetch missing object' '
grep "version 2" trace
 '
 
+test_expect_success 'when dynamically fetching missing object, do not list 
refs' '
+   cat trace &&
+   ! grep "git> command=ls-refs" trace
+'
+
 test_expect_success 'partial fetch' '
rm -rf client "$(pwd)/trace" &&
git init client &&
diff --git a/transport.c b/transport.c
index 5fb9ff6b56..4329cca8e5 100644
--- a/transport.c
+++ b/transport.c
@@ -341,8 +341,17 @@ static int fetch_refs_via_pack(struct transport *transport,
args.server_options = transport->server_options;
args.negotiation_tips = data->options.negotiation_tips;
 
-   if (!data->got_remote_heads)
-   refs_tmp = get_refs_via_connect(transport, 0, NULL);
+   if (!data->got_remote_heads) {
+   int i;
+   int must_list_refs = 0;
+   for (i = 0; i < nr_heads; i++) {
+   if (!to_fetch[i]->exact_oid) {
+   must_list_refs = 1;
+   break;
+   }
+   }
+   refs_tmp = handshake(transport, 0, NULL, must_list_refs);
+   }
 
switch (data->version) {
case protocol_v2:
-- 
2.19.0.605.g01d371f741-goog



[RFC PATCH v2 1/4] transport: allow skipping of ref listing

2018-09-27 Thread Jonathan Tan
The get_refs_via_connect() function both performs the handshake
(including determining the protocol version) and obtaining the list of
remote refs. However, the fetch protocol v2 supports fetching objects
without the listing of refs, so make it possible for the user to skip
the listing by creating a new handshake() function. This will be used in
a subsequent commit.
---
 transport.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/transport.c b/transport.c
index 1c76d64aba..5fb9ff6b56 100644
--- a/transport.c
+++ b/transport.c
@@ -252,8 +252,18 @@ static int connect_setup(struct transport *transport, int 
for_push)
return 0;
 }
 
-static struct ref *get_refs_via_connect(struct transport *transport, int 
for_push,
-   const struct argv_array *ref_prefixes)
+/*
+ * Obtains the protocol version from the transport and writes it to
+ * transport->data->version, first connecting if not already connected.
+ *
+ * If the protocol version is one that allows skipping the listing of remote
+ * refs, and must_list_refs is 0, the listing of remote refs is skipped and
+ * this function returns NULL. Otherwise, this function returns the list of
+ * remote refs.
+ */
+static struct ref *handshake(struct transport *transport, int for_push,
+const struct argv_array *ref_prefixes,
+int must_list_refs)
 {
struct git_transport_data *data = transport->data;
struct ref *refs = NULL;
@@ -268,8 +278,10 @@ static struct ref *get_refs_via_connect(struct transport 
*transport, int for_pus
data->version = discover_version();
switch (data->version) {
case protocol_v2:
-   get_remote_refs(data->fd[1], , , for_push,
-   ref_prefixes, transport->server_options);
+   if (must_list_refs)
+   get_remote_refs(data->fd[1], , , for_push,
+   ref_prefixes,
+   transport->server_options);
break;
case protocol_v1:
case protocol_v0:
@@ -283,9 +295,18 @@ static struct ref *get_refs_via_connect(struct transport 
*transport, int for_pus
}
data->got_remote_heads = 1;
 
+   if (reader.line_peeked)
+   BUG("buffer must be empty at the end of handshake()");
+
return refs;
 }
 
+static struct ref *get_refs_via_connect(struct transport *transport, int 
for_push,
+   const struct argv_array *ref_prefixes)
+{
+   return handshake(transport, for_push, ref_prefixes, 1);
+}
+
 static int fetch_refs_via_pack(struct transport *transport,
   int nr_heads, struct ref **to_fetch)
 {
-- 
2.19.0.605.g01d371f741-goog



[RFC PATCH v2 0/4] Avoid ls-refs when possible in protocol v2

2018-09-27 Thread Jonathan Tan
To answer Junio's questions in [1], I think it's best to include the
full patch set that I'm developing, so here it is. The original patch is
now patch 3 of this set.

[1] https://public-inbox.org/git/xmqq8t3pnphe@gitster-ct.c.googlers.com/

Rearranging Junio's questions:

> ... ah, do you mean that this is not a new feature, but is a bugfix
> for some callers that are not calling get-remote-refs before calling
> fetch-refs, and the bit is to work around the fact that some
> transport not just can function without get-remote-refs first but do
> not want to call it?

Yes, it is the bugfix you describe, except that the bug coincidentally
does not cause any bad behavior. fetch-object.c indeed does not call
get-remote-refs before fetch-refs, but it calls transport_set_option(),
which so happens to do what we need (call set_helper_option()).

However, we need it now, because ...

> But this I do not quite understand.  It looks saying "when asked to
> fetch, if the transport does not allow us to do so without first
> getting the advertisement, lazily do that", and that may be a good
> thing to do, but then aren't the current set of callers already
> calling transport-get-remote-refs elsewhere before they call
> transport-fetch-refs?  IOW, I would have expected to see a matching
> removal, or at least a code that turns an unconditional call to
> get-remote-refs to a conditional one that is done only for the
> transport that lacks the capability, or something along that line.

... this "matching removal" you are talking about is in the subsequent
patch 4. And there is no transport_set_option() to save us this time, so
we really do need this bugfix.

> IOW, I am a bit confused by this comment (copied from an earlier part)
> 
> > +   /**
> > +* This transport supports the fetch() function being called
> > +* without get_refs_list() first being called.
> > +*/
> 
> Shouldn't it read more like "this transport does not want its
> get-refs-list called when fetch-refs is done"?
> 
> I dunno.

I'm not sure I understand - transports generally don't care if
get-refs-list is called after fetch-refs. Also, this already happens
when fetching with tag following from a server that does not support tag
following, using a transport that supports reuse.

Jonathan Tan (4):
  transport: allow skipping of ref listing
  transport: do not list refs if possible
  transport: list refs before fetch if necessary
  fetch: do not list refs if fetching only hashes

 builtin/fetch.c | 32 +-
 fetch-pack.c|  2 +-
 t/t5551-http-fetch-smart.sh | 15 +++
 t/t5702-protocol-v2.sh  | 18 +
 transport-helper.c  |  1 +
 transport-internal.h|  6 +
 transport.c | 54 -
 7 files changed, 115 insertions(+), 13 deletions(-)

-- 
2.19.0.605.g01d371f741-goog



Re: [PATCH] fetch-pack: approximate no_dependents with filter

2018-09-27 Thread Jonathan Tan
> It is very clear how you are churning the code, but it is utterly
> unclear from the description what you perceived as a problem and why
> this change is a good (if not the best) solution for that problem,
> at least to me.

Firstly, thanks for your comments and questions - it's sometimes hard
for me to think of the questions someone else would ask when reading one
of my patches. I have tried to rewrite the commit message (you can see
it at the end of this e-mail) following your questions.

The new paragraph 1 addresses what I perceive as a problem, and the new
paragraph 2 addresses the ideal and partial solution.

> After reading the above description, I cannot shake the feeling that
> this is tied too strongly to the tree:0 use case?  Does it help
> other use cases (e.g. would it be useful or harmful if a lazy clone
> was done to exclude blobs that are larger than certain threshold, or
> objects of all types that are not referenced by commits younger than
> certain threshold)?

Yes, it is solely for the tree:0 use case. But it doesn't hurt other use
cases, as I have explained in new paragraph 3.

I have retained old paragraph 3 as new paragraph 4, and removed old
paragraph 2 as it mostly duplicates the comments in the code. New commit
message follows:

[start commit message]

fetch-pack: exclude blobs when lazy-fetching trees

A partial clone with missing trees can be obtained using "git clone
--filter=tree:none ". In such a repository, when a tree needs to
be lazily fetched, any tree or blob it directly or indirectly references
is fetched as well, regardless of whether the original command required
those objects, or if the local repository already had some of them.

This is because the fetch protocol, which the lazy fetch uses, does not
allow clients to request that only the wanted objects be sent, which
would be the ideal solution. This patch implements a partial solution:
specify the "blob:none" filter, somewhat reducing the fetch payload.

This change has no effect when lazily fetching blobs (due to how filters
work). And if lazily fetching a commit (such repositories are difficult
to construct and is not a use case we support very well, but it is
possible), referenced commits and trees are still fetched - only the
blobs are not fetched.

The necessary code change is done in fetch_pack() instead of somewhere
closer to where the "filter" instruction is written to the wire so that
only one part of the code needs to be changed in order for users of all
protocol versions to benefit from this optimization.

[end commit message]


[RFC PATCH] transport: list refs before fetch if necessary

2018-09-25 Thread Jonathan Tan
The built-in bundle transport and the transport helper interface do not
work when transport_fetch_refs() is called immediately after transport
creation.

Evidence: fetch_refs_from_bundle() relies on data->header being
initialized in get_refs_from_bundle(), and fetch() in transport-helper.c
relies on either data->fetch or data->import being set by get_helper(),
but neither transport_helper_init() nor fetch() calls get_helper().

Up until the introduction of the partial clone feature, this has not
been a problem, because transport_fetch_refs() is always called after
transport_get_remote_refs(). With the introduction of the partial clone
feature, which involves calling transport_fetch_refs() (to fetch objects
by their OIDs) without transport_get_remote_refs(), this is still not a
problem, but only coincidentally - we do not support partially cloning a
bundle, and as for cloning using a transport-helper-using protocol, it
so happens that before transport_fetch_refs() is called, fetch_refs() in
fetch-object.c calls transport_set_option(), which means that the
aforementioned get_helper() is invoked through set_helper_option() in
transport-helper.c. In the future, though, there may be other use cases
in which we want to fetch without requiring listing of remote refs, so
this is still worth fixing.

This could be fixed by fixing the transports themselves, but it doesn't
seem like a good idea to me to open up previously untested code paths;
also, there may be transport helpers in the wild that assume that "list"
is always called before "fetch". Instead, fix this by having
transport_fetch_refs() call transport_get_remote_refs() to ensure that
the latter is always called at least once, unless the transport
explicitly states that it supports fetching without listing refs.

Signed-off-by: Jonathan Tan 
---
I discovered this while investigating the possibility of taking
advantage of the fact that protocol v2 allows us to fetch without first
invoking ls-refs. This is useful both when lazily fetching to a partial
clone, and when invoking "git fetch --no-tags  " (note
that tag following must be disabled).

Any comments on this (for or against) is appreciated, and suggestions of
better approaches are appreciated too.
---
 transport-helper.c   |  1 +
 transport-internal.h |  6 ++
 transport.c  | 12 
 3 files changed, 19 insertions(+)

diff --git a/transport-helper.c b/transport-helper.c
index 143ca008c8..7213fa0d32 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -1105,6 +1105,7 @@ static struct ref *get_refs_list(struct transport 
*transport, int for_push,
 }
 
 static struct transport_vtable vtable = {
+   0,
set_helper_option,
get_refs_list,
fetch,
diff --git a/transport-internal.h b/transport-internal.h
index 1cde6258a7..004bee5e36 100644
--- a/transport-internal.h
+++ b/transport-internal.h
@@ -6,6 +6,12 @@ struct transport;
 struct argv_array;
 
 struct transport_vtable {
+   /**
+* This transport supports the fetch() function being called
+* without get_refs_list() first being called.
+*/
+   unsigned fetch_without_list : 1;
+
/**
 * Returns 0 if successful, positive if the option is not
 * recognized or is inapplicable, and negative if the option
diff --git a/transport.c b/transport.c
index 1c76d64aba..ee8a78ff37 100644
--- a/transport.c
+++ b/transport.c
@@ -703,6 +703,7 @@ static int disconnect_git(struct transport *transport)
 }
 
 static struct transport_vtable taken_over_vtable = {
+   1,
NULL,
get_refs_via_connect,
fetch_refs_via_pack,
@@ -852,6 +853,7 @@ void transport_check_allowed(const char *type)
 }
 
 static struct transport_vtable bundle_vtable = {
+   0,
NULL,
get_refs_from_bundle,
fetch_refs_from_bundle,
@@ -861,6 +863,7 @@ static struct transport_vtable bundle_vtable = {
 };
 
 static struct transport_vtable builtin_smart_vtable = {
+   1,
NULL,
get_refs_via_connect,
fetch_refs_via_pack,
@@ -1224,6 +1227,15 @@ int transport_fetch_refs(struct transport *transport, 
struct ref *refs)
struct ref **heads = NULL;
struct ref *rm;
 
+   if (!transport->vtable->fetch_without_list)
+   /*
+* Some transports (e.g. the built-in bundle transport and the
+* transport helper interface) do not work when fetching is
+* done immediately after transport creation. List the remote
+* refs anyway (if not already listed) as a workaround.
+*/
+   transport_get_remote_refs(transport, NULL);
+
for (rm = refs; rm; rm = rm->next) {
nr_refs++;
if (rm->peer_ref &&
-- 
2.19.0.605.g01d371f741-goog



[PATCH] fetch-pack: approximate no_dependents with filter

2018-09-24 Thread Jonathan Tan
Whenever a lazy fetch is performed for a tree object, any trees and
blobs it directly or indirectly references will be fetched as well.
There is a "no_dependents" argument in struct fetch_pack_args that
indicates that objects that the wanted object references need not be
sent, but it currently has no effect other than to inhibit usage of
object flags.

Extend the "no_dependents" argument to also exclude sending of objects
as much as the current protocol allows: when fetching a tree, all trees
it references will be sent (but not the blobs), and when fetching a
blob, it will still be sent. (If this mechanism is used to fetch a
commit or any other non-blob object, all referenced objects, except
blobs, will be sent.) The client neither needs to know or specify the
type of each object it wants.

The necessary code change is done in fetch_pack() instead of somewhere
closer to where the "filter" instruction is written to the wire so that
only one part of the code needs to be changed in order for users of all
protocol versions to benefit from this optimization.

Signed-off-by: Jonathan Tan 
---
This was prompted by a user at $DAY_JOB who had a partial clone
excluding trees, and had a workflow that only required tree objects (and
not blobs).

This will hopefully make partial clones excluding trees (with the
"tree:0" filter) a bit better, in that if an operation requires only
trees to be inspected, the required download is much smaller.
---
 fetch-pack.c | 14 ++
 fetch-pack.h |  7 +++
 t/t0410-partial-clone.sh | 41 
 3 files changed, 62 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index 88a078e9b..c25b0f54c 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1598,6 +1598,20 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
if (nr_sought)
nr_sought = remove_duplicates_in_refs(sought, nr_sought);
 
+   if (args->no_dependents && !args->filter_options.choice) {
+   /*
+* The protocol does not support requesting that only the
+* wanted objects be sent, so approximate this by setting a
+* "blob:none" filter if no filter is already set. This works
+* for all object types: note that wanted blobs will still be
+* sent because they are directly specified as a "want".
+*
+* NEEDSWORK: Add an option in the protocol to request that
+* only the wanted objects be sent, and implement it.
+*/
+   parse_list_objects_filter(>filter_options, "blob:none");
+   }
+
if (!ref) {
packet_flush(fd[1]);
die(_("no matching remote head"));
diff --git a/fetch-pack.h b/fetch-pack.h
index 5b6e86880..43ec344d9 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -43,6 +43,13 @@ struct fetch_pack_args {
unsigned from_promisor:1;
 
/*
+* Attempt to fetch only the wanted objects, and not any objects
+* referred to by them. Due to protocol limitations, extraneous
+* objects may still be included. (When fetching non-blob
+* objects, only blobs are excluded; when fetching a blob, the
+* blob itself will still be sent. The client does not need to
+* know whether a wanted object is a blob or not.)
+*
 * If 1, fetch_pack() will also not modify any object flags.
 * This allows fetch_pack() to safely be called by any function,
 * regardless of which object flags it uses (if any).
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 128130066..08a0c3651 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -170,6 +170,47 @@ test_expect_success 'fetching of missing objects' '
git verify-pack --verbose "$IDX" | grep "$HASH"
 '
 
+test_expect_success 'fetching of missing blobs works' '
+   rm -rf server repo &&
+   test_create_repo server &&
+   test_commit -C server foo &&
+   git -C server repack -a -d --write-bitmap-index &&
+
+   git clone "file://$(pwd)/server" repo &&
+   git hash-object repo/foo.t >blobhash &&
+   rm -rf repo/.git/objects/* &&
+
+   git -C server config uploadpack.allowanysha1inwant 1 &&
+   git -C server config uploadpack.allowfilter 1 &&
+   git -C repo config core.repositoryformatversion 1 &&
+   git -C repo config extensions.partialclone "origin" &&
+
+   git -C repo cat-file -p $(cat blobhash)
+'
+
+test_expect_success 'fetching of missing trees does not fetch blobs' '
+   rm -rf server repo &&
+   test_create_repo server &&
+   test_commit -C server foo &am

[PATCH v2 0/2] Check presence of targets when fetching to partial clone

2018-09-21 Thread Jonathan Tan
New in v2:
 - added patch to clarify in documentation what check_connected() does
 - renamed quickfetch() to check_exist_and_connected() to better reflect
   what it does
 - also updated its documentation; I avoided usage of "wanted objects"
   and used "fetch targets" instead to clarify that I'm only talking
   about the direct targets, not all objects referenced by those targets
 - in check_exist_and_connected() (previously quickfetch()), check
   existence directly regardless of whether the repository is a partial
   clone or not

This should resolve most of Junio's comments in [1], except that I chose
not to modify or rename check_connected(). In this current world, it is
true that we usually require existence of ref targets, but that might
not be so in the future (having said that, I don't know of any schedules
for this future). Also, check_connected() is used in a few places, so
such a change would cause some churn in the codebase. So I left this
function alone.

[1] https://public-inbox.org/git/xmqqy3bvycie@gitster-ct.c.googlers.com/

Jonathan Tan (2):
  connected: document connectivity in partial clones
  fetch: in partial clone, check presence of targets

 builtin/fetch.c  | 15 +--
 connected.h  |  6 +++---
 t/t5616-partial-clone.sh | 17 +
 3 files changed, 33 insertions(+), 5 deletions(-)

-- 
2.19.0.444.g18242da7ef-goog



[PATCH v2 2/2] fetch: in partial clone, check presence of targets

2018-09-21 Thread Jonathan Tan
When fetching an object that is known as a promisor object to the local
repository, the connectivity check in quickfetch() in builtin/fetch.c
succeeds, causing object transfer to be bypassed. However, this should
not happen if that object is merely promised and not actually present.

Because this happens, when a user invokes "git fetch origin " on
the command-line, the  object may not actually be fetched even
though the command returns an exit code of 0. This is a similar issue
(but with a different cause) to the one fixed by a0c9016abd
("upload-pack: send refs' objects despite "filter"", 2018-07-09).

Therefore, update quickfetch() to also directly check for the presence
of all objects to be fetched. Its documentation and name are also
updated to better reflect what it does.

Signed-off-by: Jonathan Tan 
---
 builtin/fetch.c  | 15 +--
 t/t5616-partial-clone.sh | 17 +
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 61bec5d21..b9e74c129 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -924,10 +924,11 @@ static int store_updated_refs(const char *raw_url, const 
char *remote_name,
  * everything we are going to fetch already exists and is connected
  * locally.
  */
-static int quickfetch(struct ref *ref_map)
+static int check_exist_and_connected(struct ref *ref_map)
 {
struct ref *rm = ref_map;
struct check_connected_options opt = CHECK_CONNECTED_INIT;
+   struct ref *r;
 
/*
 * If we are deepening a shallow clone we already have these
@@ -938,13 +939,23 @@ static int quickfetch(struct ref *ref_map)
 */
if (deepen)
return -1;
+
+   /*
+* check_connected() allows objects to merely be promised, but
+* we need all direct targets to exist.
+*/
+   for (r = rm; r; r = r->next) {
+   if (!has_object_file(>old_oid))
+   return -1;
+   }
+
opt.quiet = 1;
return check_connected(iterate_ref_map, , );
 }
 
 static int fetch_refs(struct transport *transport, struct ref *ref_map)
 {
-   int ret = quickfetch(ref_map);
+   int ret = check_exist_and_connected(ref_map);
if (ret)
ret = transport_fetch_refs(transport, ref_map);
if (!ret)
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index bbbe7537d..359d27d02 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -170,6 +170,23 @@ test_expect_success 'partial clone fetches blobs pointed 
to by refs even if norm
git -C dst fsck
 '
 
+test_expect_success 'fetch what is specified on CLI even if already promised' '
+   rm -rf src dst.git &&
+   git init src &&
+   test_commit -C src foo &&
+   test_config -C src uploadpack.allowfilter 1 &&
+   test_config -C src uploadpack.allowanysha1inwant 1 &&
+
+   git hash-object --stdin blob &&
+
+   git clone --bare --filter=blob:none "file://$(pwd)/src" dst.git &&
+   git -C dst.git rev-list --objects --quiet --missing=print HEAD 
>missing_before &&
+   grep "?$(cat blob)" missing_before &&
+   git -C dst.git fetch origin $(cat blob) &&
+   git -C dst.git rev-list --objects --quiet --missing=print HEAD 
>missing_after &&
+   ! grep "?$(cat blob)" missing_after
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
2.19.0.444.g18242da7ef-goog



[PATCH v2 1/2] connected: document connectivity in partial clones

2018-09-21 Thread Jonathan Tan
In acb0c57260 ("fetch: support filters", 2017-12-08), check_connected()
was extended to allow objects to either be promised to be available (if
the repository is a partial clone) or to be present; previously, this
function required the latter. However, this change was not reflected in
the documentation of that function. Update the documentation
accordingly.

Signed-off-by: Jonathan Tan 
---
 connected.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/connected.h b/connected.h
index e4c961817..8d5a6b3ad 100644
--- a/connected.h
+++ b/connected.h
@@ -51,9 +51,9 @@ struct check_connected_options {
 #define CHECK_CONNECTED_INIT { 0 }
 
 /*
- * Make sure that our object store has all the commits necessary to
- * connect the ancestry chain to some of our existing refs, and all
- * the trees and blobs that these commits use.
+ * Make sure that all given objects and all objects reachable from them
+ * either exist in our object store or (if the repository is a partial
+ * clone) are promised to be available.
  *
  * Return 0 if Ok, non zero otherwise (i.e. some missing objects)
  *
-- 
2.19.0.444.g18242da7ef-goog



Re: [PATCH] fetch: in partial clone, check presence of targets

2018-09-20 Thread Jonathan Tan
> Jonathan Tan  writes:
> 
> > +   if (repository_format_partial_clone) {
> > +   /*
> > +* For the purposes of the connectivity check,
> > +* check_connected() considers all objects promised by
> > +* promisor objects as existing, which means that the
> > +* connectivity check would pass even if a target object
> > +* in rm is merely promised and not present. When
> > +* fetching objects, we need all of them to be present
> > +* (in addition to having correct connectivity), so
> > +* check them directly.
> > +*/
> > +   struct ref *r;
> > +   for (r = rm; r; r = r->next) {
> > +   if (!has_object_file(>old_oid))
> > +   return -1;
> > +   }
> 
> Because check_connected() lies in the presense of "promisor", we can
> defeat it this way, which makes sense.  
> 
> I wonder if it makes sense to do this check unconditionally, even
> when we are in a fully functioning clone.  The check is quite cheap
> and can reject a ref_map that has an object we do not know about,
> without check_connected() having to ask the rev-list.

It makes sense to me from a runtime point of view - the usual case, for
me, is when we're missing at least one object that we're fetching, and
doing the cheap check even allows us to skip the expensive check.

The hard part for me lies in how to communicate to future readers of the
code that they cannot remove this section to simplify the code. We would
need a more complicated comment, something like this:

 /*
  * Check if all wanted objects are present.
  *
  * If the local repository is a partial clone, check_connected() is not
  * sufficient - it would pass even if a wanted object is merely
  * promised and not present. This is because, for the purposes of the
  * connectivity check, check_connected() considers all objects promised
  * by promisor objects as existing.
  *
  * If the local repository is not a partial clone, check_connected() is
  * sufficient, but this loop allows us to avoid the expensive
  * check_connected() in the usual case that at least one wanted object
  * is missing.
  */


[PATCH] fetch: in partial clone, check presence of targets

2018-09-20 Thread Jonathan Tan
When fetching an object that is known as a promisor object to the local
repository, the connectivity check in quickfetch() in builtin/fetch.c
succeeds, causing object transfer to be bypassed. However, this should
not happen if that object is merely promised and not actually present.

Because this happens, when a user invokes "git fetch origin " on
the command-line, the  object may not actually be fetched even
though the command returns an exit code of 0. This is a similar issue
(but with a different cause) to the one fixed by a0c9016abd
("upload-pack: send refs' objects despite "filter"", 2018-07-09).

Therefore, update quickfetch() to also directly check for the presence
of all objects to be fetched.

Signed-off-by: Jonathan Tan 
---
In this patch, I have striven to avoid piping from Git commands that may
fail, following the guidelines in [1].

[1] 
https://public-inbox.org/git/c625bfe2205d51b3158ef71e4bf472708642c146.1537223021.git.matv...@google.com/
---
 builtin/fetch.c  | 19 +++
 t/t5616-partial-clone.sh | 17 +
 2 files changed, 36 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 61bec5d21..e9640fe5a 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -938,6 +938,25 @@ static int quickfetch(struct ref *ref_map)
 */
if (deepen)
return -1;
+
+   if (repository_format_partial_clone) {
+   /*
+* For the purposes of the connectivity check,
+* check_connected() considers all objects promised by
+* promisor objects as existing, which means that the
+* connectivity check would pass even if a target object
+* in rm is merely promised and not present. When
+* fetching objects, we need all of them to be present
+* (in addition to having correct connectivity), so
+* check them directly.
+*/
+   struct ref *r;
+   for (r = rm; r; r = r->next) {
+   if (!has_object_file(>old_oid))
+   return -1;
+   }
+   }
+
opt.quiet = 1;
return check_connected(iterate_ref_map, , );
 }
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index bbbe7537d..359d27d02 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -170,6 +170,23 @@ test_expect_success 'partial clone fetches blobs pointed 
to by refs even if norm
git -C dst fsck
 '
 
+test_expect_success 'fetch what is specified on CLI even if already promised' '
+   rm -rf src dst.git &&
+   git init src &&
+   test_commit -C src foo &&
+   test_config -C src uploadpack.allowfilter 1 &&
+   test_config -C src uploadpack.allowanysha1inwant 1 &&
+
+   git hash-object --stdin blob &&
+
+   git clone --bare --filter=blob:none "file://$(pwd)/src" dst.git &&
+   git -C dst.git rev-list --objects --quiet --missing=print HEAD 
>missing_before &&
+   grep "?$(cat blob)" missing_before &&
+   git -C dst.git fetch origin $(cat blob) &&
+   git -C dst.git rev-list --objects --quiet --missing=print HEAD 
>missing_after &&
+   ! grep "?$(cat blob)" missing_after
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
2.19.0.444.g18242da7ef-goog



Re: [PATCH 1/2] fetch-object: provide only one fetching function

2018-09-13 Thread Jonathan Tan
> Instead of explaining why the new convention is better to justify
> (2), the above three lines handwave by saying "more flexible"
> twice.  We should do better.
> 
>   fetch-object: unify fetch_object[s] functions
> 
>   There are fetch_object() and fetch_objects() helpers in
>   fetch-object.h; as the latter takes "struct oid_array",
>   the former cannot be made into a thin wrapper around the
>   latter without an extra allocation and set-up cost.
> 
>   Update fetch_objects() to take an array of "struct
>   object_id" and number of elements in it as separate
>   parameters, remove fetch_object(), and adjust all existing
>   callers of these functions to use the new fetch_objects().
> 
> perhaps?

Thanks - your explanation is much clearer than mine. Let me know if you
want a reroll (or if you can update the commit message yourself, that's
fine too).


[PATCH 0/2] Bugfix for partial clones and ref-in-want servers

2018-09-12 Thread Jonathan Tan
A coworker discovered a bug when having a partial clone against a server
that supports ref-in-want - the lazy fetches no longer work, because the
client attempts to fetch with "want-ref ", which is not permitted
by the protocol.

The first patch deduplicates some code, so that the bugfix need only be
applied once, and the second patch contains the actual bugfix.

Jonathan Tan (2):
  fetch-object: provide only one fetching function
  fetch-object: set exact_oid when fetching

 fetch-object.c   | 17 ++---
 fetch-object.h   |  8 ++--
 sha1-file.c  |  2 +-
 t/t0410-partial-clone.sh | 12 
 unpack-trees.c   |  2 +-
 5 files changed, 22 insertions(+), 19 deletions(-)

-- 
2.19.0.397.gdd90340f6a-goog



[PATCH 2/2] fetch-object: set exact_oid when fetching

2018-09-12 Thread Jonathan Tan
fetch_objects() currently does not set exact_oid in struct ref when
invoking transport_fetch_refs(). If the server supports ref-in-want,
fetch_pack() uses this field to determine whether a wanted ref should be
requested as a "want-ref" line or a "want" line; without the setting of
exact_oid, the wrong line will be sent.

Set exact_oid, so that the correct line is sent.

Signed-off-by: Jonathan Tan 
---
 fetch-object.c   |  1 +
 t/t0410-partial-clone.sh | 12 
 2 files changed, 13 insertions(+)

diff --git a/fetch-object.c b/fetch-object.c
index 1af1bf857..426654880 100644
--- a/fetch-object.c
+++ b/fetch-object.c
@@ -32,6 +32,7 @@ void fetch_objects(const char *remote_name, const struct 
object_id *oids,
for (i = 0; i < oid_nr; i++) {
struct ref *new_ref = alloc_ref(oid_to_hex([i]));
oidcpy(_ref->old_oid, [i]);
+   new_ref->exact_oid = 1;
new_ref->next = ref;
ref = new_ref;
}
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 128130066..0ab02c337 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -170,6 +170,18 @@ test_expect_success 'fetching of missing objects' '
git verify-pack --verbose "$IDX" | grep "$HASH"
 '
 
+test_expect_success 'fetching of missing objects works with ref-in-want 
enabled' '
+   # ref-in-want requires protocol version 2
+   git -C server config protocol.version 2 &&
+   git -C server config uploadpack.allowrefinwant 1 &&
+   git -C repo config protocol.version 2 &&
+
+   rm -rf repo/.git/objects/* &&
+   rm -f trace &&
+   GIT_TRACE_PACKET="$(pwd)/trace" git -C repo cat-file -p "$HASH" &&
+   grep "git< fetch=.*ref-in-want" trace
+'
+
 test_expect_success 'rev-list stops traversal at missing and promised commit' '
rm -rf repo &&
test_create_repo repo &&
-- 
2.19.0.397.gdd90340f6a-goog



[PATCH 1/2] fetch-object: provide only one fetching function

2018-09-12 Thread Jonathan Tan
fetch-object.h currently provides two functions (fetch_object() and
fetch_objects()) that could be replaced by a single, more flexible
function. Replace those two functions with the more flexible function.

Signed-off-by: Jonathan Tan 
---
 fetch-object.c | 16 +---
 fetch-object.h |  8 ++--
 sha1-file.c|  2 +-
 unpack-trees.c |  2 +-
 4 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/fetch-object.c b/fetch-object.c
index 853624f81..1af1bf857 100644
--- a/fetch-object.c
+++ b/fetch-object.c
@@ -23,21 +23,15 @@ static void fetch_refs(const char *remote_name, struct ref 
*ref)
fetch_if_missing = original_fetch_if_missing;
 }
 
-void fetch_object(const char *remote_name, const unsigned char *sha1)
-{
-   struct ref *ref = alloc_ref(sha1_to_hex(sha1));
-   hashcpy(ref->old_oid.hash, sha1);
-   fetch_refs(remote_name, ref);
-}
-
-void fetch_objects(const char *remote_name, const struct oid_array *to_fetch)
+void fetch_objects(const char *remote_name, const struct object_id *oids,
+  int oid_nr)
 {
struct ref *ref = NULL;
int i;
 
-   for (i = 0; i < to_fetch->nr; i++) {
-   struct ref *new_ref = alloc_ref(oid_to_hex(_fetch->oid[i]));
-   oidcpy(_ref->old_oid, _fetch->oid[i]);
+   for (i = 0; i < oid_nr; i++) {
+   struct ref *new_ref = alloc_ref(oid_to_hex([i]));
+   oidcpy(_ref->old_oid, [i]);
new_ref->next = ref;
ref = new_ref;
}
diff --git a/fetch-object.h b/fetch-object.h
index 4b269d07e..d2f996d4e 100644
--- a/fetch-object.h
+++ b/fetch-object.h
@@ -1,11 +1,7 @@
 #ifndef FETCH_OBJECT_H
 #define FETCH_OBJECT_H
 
-#include "sha1-array.h"
-
-extern void fetch_object(const char *remote_name, const unsigned char *sha1);
-
-extern void fetch_objects(const char *remote_name,
- const struct oid_array *to_fetch);
+void fetch_objects(const char *remote_name, const struct object_id *oids,
+  int oid_nr);
 
 #endif
diff --git a/sha1-file.c b/sha1-file.c
index 97b742384..2edf4564f 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1317,7 +1317,7 @@ int oid_object_info_extended(struct repository *r, const 
struct object_id *oid,
 * TODO Pass a repository struct through fetch_object,
 * such that arbitrary repositories work.
 */
-   fetch_object(repository_format_partial_clone, 
real->hash);
+   fetch_objects(repository_format_partial_clone, real, 1);
already_retried = 1;
continue;
}
diff --git a/unpack-trees.c b/unpack-trees.c
index f25089b87..82a83dbc6 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -392,7 +392,7 @@ static int check_updates(struct unpack_trees_options *o)
}
if (to_fetch.nr)
fetch_objects(repository_format_partial_clone,
- _fetch);
+ to_fetch.oid, to_fetch.nr);
fetch_if_missing = fetch_if_missing_store;
oid_array_clear(_fetch);
}
-- 
2.19.0.397.gdd90340f6a-goog



Re: Is it possible to git clone --filter= without any objects?

2018-09-11 Thread Jonathan Tan
On Tue, Sep 11, 2018 at 10:15 AM, Stefan Beller  wrote:
> You might be pleased to hear about a series floating on the mailing list,
> that started at
> https://public-inbox.org/git/cover.1533854545.git.matv...@google.com/
> and promised to filter trees away, and its latest version can be found at
> https://public-inbox.org/git/cover.1536081438.git.matv...@google.com/

I mentioned this in an email to the original poster in which I forgot
to also CC the mailing list:

By "without any objects" in your email subject, do you mean "without
blob and tree objects"? If yes, there is some code in the
md/filter-trees branch that can do that with a "--filter=tree:0"
option. As far as I know, it is not yet in any released version of
Git, but hopefully will be in the next one (the "What's Cooking" [1]
email mentions that it will be merged to the "next" branch, which is
one of the steps before it is released).

[1] https://public-inbox.org/git/xmqqmusw6gbo@gitster-ct.c.googlers.com/


Re: [PATCH v4 4/6] rev-list: handle missing tree objects properly

2018-08-14 Thread Jonathan Tan
> > So we don't want to die in list-objects.c. If we
> > fail to fetch, then we will die on line 213 in rev-list.c.
> 
> Why don't we want to die in list-objects.c? When --missing=error is
> passed, fetch_if_missing retains its default value of 1, so
> parse_tree_gently() will attempt to fetch it - and if it fails, I think
> it's appropriate to die in list-objects.c (and this should be the
> current behavior). On other values, e.g. --missing=allow-any, there is
> no autofetch (since fetch_if_missing is 0), so it is correct not to die
> in list-objects.c.

After some in-office discussion, I should have checked line 213 in
builtin/rev-list.c more thorougly. Indeed it is OK not to die in
list-objects.c here, since builtin/rev-list.c already knows how to
handle missing objects in the --missing=error circumstance.


Re: [PATCH v4 4/6] rev-list: handle missing tree objects properly

2018-08-14 Thread Jonathan Tan
> > > @@ -373,6 +375,7 @@ int cmd_rev_list(int argc, const char **argv, const 
> > > char *prefix)
> > >   init_revisions(, prefix);
> > >   revs.abbrev = DEFAULT_ABBREV;
> > >   revs.commit_format = CMIT_FMT_UNSPECIFIED;
> > > + revs.do_not_die_on_missing_tree = 1;
> >
> > Is this correct? I would have expected this to be set only if --missing
> > was set.
> If --missing is not set, then we want to fetch missing objects
> automatically, and then die if we fail to do that, which is what
> happens for blobs.

This is true, and should already be handled. Pay attention to when
fetch_if_missing is set in builtin/rev-list.c.

do_not_die_on_missing_tree should probably be set to 1 whenever
fetch_if_missing is set to 0, I think.

(I acknowledge that the usage of this global variable is confusing, but
I couldn't think of a better way to implement this when I did. Perhaps
when the object store refactoring is done, this can be a store-specific
setting instead of a global variable.)

> So we don't want to die in list-objects.c. If we
> fail to fetch, then we will die on line 213 in rev-list.c.

Why don't we want to die in list-objects.c? When --missing=error is
passed, fetch_if_missing retains its default value of 1, so
parse_tree_gently() will attempt to fetch it - and if it fails, I think
it's appropriate to die in list-objects.c (and this should be the
current behavior). On other values, e.g. --missing=allow-any, there is
no autofetch (since fetch_if_missing is 0), so it is correct not to die
in list-objects.c.

> > As for --missing=allow-promisor, I don't see them being tested anywhere
> > :-( so feel free to make a suggestion. I would put them in t6112 for
> > easy comparison with the other --missing tests.
> Kept my allow-promisor test in t0410 since it requires partial clone
> to be turned on in the config, and because it is pretty similar to
> --exclude-promisor-objects.

OK, sounds good to me.


Re: [PATCH v4 6/6] list-objects-filter: implement filter tree:0

2018-08-14 Thread Jonathan Tan
> - grep -E "tree|blob" objs >trees_and_blobs &&
> - test_line_count = 1 trees_and_blobs
> + grep -E "tree|blob" objs \
> + | awk -f print_1.awk >trees_and_blobs &&
> + git -C r1 rev-parse HEAD: >expected &&
> + test_cmp trees_and_blobs expected

Indent "| awk" (and similar lines in this patch) - although I guess it
is likely that you actually have it indented, and your e-mail client
modified the whitespace so that it looks like there is no indent.

Other than that, this interdiff looks good to me. Thanks.


Re: [PATCH v4 6/6] list-objects-filter: implement filter tree:0

2018-08-14 Thread Jonathan Tan
> @@ -743,6 +743,9 @@ specification contained in .
>   A debug option to help with future "partial clone" development.
>   This option specifies how missing objects are handled.
>  +
> +The form '--filter=tree:' omits all blobs and trees deeper than
> + from the root tree. Currently, only =0 is supported.
> ++
>  The form '--missing=error' requests that rev-list stop with an error if
>  a missing object is encountered.  This is the default action.
>  +

The "--filter" documentation should go with the other "--filter"
information, not right after --missing.

> +test_expect_success 'setup for tests of tree:0' '
> + mkdir r1/subtree &&
> + echo "This is a file in a subtree" > r1/subtree/file &&
> + git -C r1 add subtree/file &&
> + git -C r1 commit -m subtree
> +'

Style: no space after >

> +test_expect_success 'grab tree directly when using tree:0' '
> + # We should get the tree specified directly but not its blobs or 
> subtrees.
> + git -C r1 pack-objects --rev --stdout --filter=tree:0 >commitsonly.pack 
> <<-EOF &&
> + HEAD:
> + EOF
> + git -C r1 index-pack ../commitsonly.pack &&
> + git -C r1 verify-pack -v ../commitsonly.pack >objs &&
> + grep -E "tree|blob" objs >trees_and_blobs &&
> + test_line_count = 1 trees_and_blobs
> +'

Can we also verify that the SHA-1 in trees_and_blobs is what we
expected?

> +test_expect_success 'use fsck before and after manually fetching a missing 
> subtree' '
> + # push new commit so server has a subtree
> + mkdir src/dir &&
> + echo "in dir" > src/dir/file.txt &&

No space after >

> + git -C src add dir/file.txt &&
> + git -C src commit -m "file in dir" &&
> + git -C src push -u srv master &&
> + SUBTREE=$(git -C src rev-parse HEAD:dir) &&
> +
> + rm -rf dst &&
> + git clone --no-checkout --filter=tree:0 "file://$(pwd)/srv.bare" dst &&
> + git -C dst fsck &&
> + git -C dst cat-file -p $SUBTREE >tree_contents 2>err &&
> + git -C dst fsck
> +'

If you don't need to redirect to err, don't do so.

Before the cat-file, also verify that the tree is missing, most likely
through a "git rev-list" with "--missing=print".

And I would grep on the tree_contents to ensure that the filename
("file.txt") is there, so that we know that we got the correct tree.

> +test_expect_success 'can use tree:0 to filter partial clone' '
> + rm -rf dst &&
> + git clone --no-checkout --filter=tree:0 "file://$(pwd)/srv.bare" dst &&
> + git -C dst rev-list master --missing=allow-any --objects 
> >fetched_objects &&
> + cat fetched_objects \
> + | awk -f print_1.awk \
> + | xargs -n1 git -C dst cat-file -t >fetched_types &&
> + sort fetched_types -u >unique_types.observed &&
> + echo commit > unique_types.expected &&
> + test_cmp unique_types.observed unique_types.expected
> +'
> +
> +test_expect_success 'auto-fetching of trees with --missing=error' '
> + git -C dst rev-list master --missing=error --objects >fetched_objects &&
> + cat fetched_objects \
> + | awk -f print_1.awk \
> + | xargs -n1 git -C dst cat-file -t >fetched_types &&
> + sort fetched_types -u >unique_types.observed &&
> + printf "blob\ncommit\ntree\n" >unique_types.expected &&
> + test_cmp unique_types.observed unique_types.expected
> +'

These two tests seem redundant with the 'use fsck before and after
manually fetching a missing subtree' test (after the latter is
appropriately renamed). I think we only need to test this sequence once,
which can be placed in one or spread over multiple tests:

 1. partial clone with --filter=tree:0
 2. fsck works
 3. verify that trees are indeed missing
 4. autofetch a tree
 5. fsck still works


Re: [PATCH v4 4/6] rev-list: handle missing tree objects properly

2018-08-14 Thread Jonathan Tan
> Previously, we assumed only blob objects could be missing. This patch
> makes rev-list handle missing trees like missing blobs. A missing tree
> will cause an error if --missing indicates an error should be caused,
> and the hash is printed even if the tree is missing.

The last sentence is difficult to understand - probably better to say
that all --missing= arguments and --exclude-promisor-objects work for
missing trees like they currently do for blobs (and do not fixate on
just --missing=error). And also demonstrate this in tests, like in
t6612.

> In list-objects.c we no longer print a message to stderr if a tree
> object is missing (quiet_on_missing is always true). I couldn't find
> any place where this would matter, or where the caller of
> traverse_commit_list would need to be fixed to show the error. However,
> in the future it would be trivial to make the caller show the message if
> we needed to.
> 
> This is not tested very thoroughly, since we cannot create promisor
> objects in tests without using an actual partial clone. t0410 has a
> promise_and_delete utility function, but the is_promisor_object function
> does not return 1 for objects deleted in this way. More tests will will
> come in a patch that implements a filter that can be used with git
> clone.

These two paragraphs are no longer applicable, I think.

> @@ -373,6 +375,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
> *prefix)
>   init_revisions(, prefix);
>   revs.abbrev = DEFAULT_ABBREV;
>   revs.commit_format = CMIT_FMT_UNSPECIFIED;
> + revs.do_not_die_on_missing_tree = 1;

Is this correct? I would have expected this to be set only if --missing
was set.

> - process_tree_contents(ctx, tree, base);
> + /*
> +  * NEEDSWORK: we should not have to process this tree's contents if the
> +  * filter wants to exclude all its contents AND the filter doesn't need
> +  * to collect the omitted OIDs. We should add a LOFR_SKIP_TREE bit which
> +  * allows skipping all children.
> +  */
> + if (parsed)
> + process_tree_contents(ctx, tree, base);

I agree with Jeff Hostetler in [1] that a LOFR_SKIP_TREE bit is
desirable, but I don't think that this patch is the right place to
introduce this NEEDSWORK. For me, this patch is about skipping iterating
over the contents of a tree because the tree does not exist; this
NEEDSWORK is about skipping iterating over the contents of a tree
because we don't want its contents, and it is quite confusing to
conflate the two.

[1] 
https://public-inbox.org/git/d751d56b-84bb-a03d-5f2a-7dbaf8d94...@jeffhostetler.com/

> @@ -124,6 +124,7 @@ struct rev_info {
>   first_parent_only:1,
>   line_level_traverse:1,
>   tree_blobs_in_commit_order:1,
> + do_not_die_on_missing_tree:1,

I know that the other flags don't have documentation, but I think it's
worth documenting this one because it is rather complicated. I have
provided a sample one in my earlier review - feel free to use that or
come up with your own.

> diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
> index 4984ca583..74e3c5767 100755
> --- a/t/t0410-partial-clone.sh
> +++ b/t/t0410-partial-clone.sh
> @@ -186,6 +186,72 @@ test_expect_success 'rev-list stops traversal at missing 
> and promised commit' '
>   ! grep $FOO out
>  '
>  
> +test_expect_success 'show missing tree objects with --missing=print' '
> + rm -rf repo &&
> + test_create_repo repo &&
> + test_commit -C repo foo &&
> + test_commit -C repo bar &&
> + test_commit -C repo baz &&
> +
> + TREE=$(git -C repo rev-parse bar^{tree}) &&
> +
> + promise_and_delete $TREE &&
> +
> + git -C repo config core.repositoryformatversion 1 &&
> + git -C repo config extensions.partialclone "arbitrary string" &&
> + git -C repo rev-list --quiet --missing=print --objects HEAD 
> >missing_objs 2>rev_list_err &&
> + echo "?$TREE" >expected &&
> + test_cmp expected missing_objs &&
> +
> + # do not complain when a missing tree cannot be parsed
> + ! grep -q "Could not read " rev_list_err
> +'

I think that the --exclude-promisor-tests can go in t0410 as you have
done, but the --missing tests (except for --missing=allow-promisor)
should go in t6112. (And like the existing --missing tests, they should
be done without setting extensions.partialclone.)

As for --missing=allow-promisor, I don't see them being tested anywhere
:-( so feel free to make a suggestion. I would put them in t6112 for
easy comparison with the other --missing tests.


Re: [PATCH v2 3/5] rev-list: handle missing tree objects properly

2018-08-14 Thread Jonathan Tan
> I don't understand about the ">= 0". What should I replace it with?
> Maybe you mean the return is never positive so I can change:
> 
> parse_tree_gently(tree, 1) >= 0
> 
> to:
> !parse_tree_gently(tree, 1)
> 
> ?

Sorry for the lack of clarity - that is what I meant.

> > The missing mechanism (for error, allow-any, print) should work without
> > needing to consult whether an object is a promisor object or not - it
> > should just print whatever is missing, so the "if
> > (!is_promisor_object..." line looks out of place.
> Done. I considered that a missing object which is not a promisor is a
> serious error, so I had it die here.

It is a serious error, but as far as I can tell, that is what the
--missing flags are supposed to help diagnose (so we can't die since we
need the diagnoses to be printed). See, for example, 'rev-list W/
--missing=print' in t6112 - the "r1" repository does not have partial
clone enabled (which I verified by inserting a test_pause then cat-ting
r1/.git/config), but nothing dies.

> But now that I've added the
> do_not_die_on_missing_tree flag, it's more natural to keep the
> previous promisor check as-is.

OK, I'll take a look once you send out v4.

> Also, is_promisor_object is an
> expensive check, and it would be better to skip it during the common
> execution path (which should be when exclude_promisor_objects, an
> internal-use-only flag, is *not* set, which means we never call
> is_promisor_object.

That's true.

> > In my original review [1], I suggested that we always show a tree if we
> > have its hash - if we don't have the object, we just recurse into it.
> > This would be the same as your patch, except that the 'die("bad tree
> > object...' is totally removed instead of merely moved. I still think
> > this solution has some merit - all the tests still pass (except that we
> > need to check for "unable to read" instead of "bad tree object" in error
> > messages), but I just realized that it might still be backwards
> > incompatible in that a basic "rev-list --objects" would now succeed
> > instead of fail if a tree was missing (I haven't tested this though).
> The presence of the die if !is_promisor_object is what justified the
> changing of the parse_tree_gently to always be gently, since it is
> what showed the OID. Can we really remove both? Maybe in a different
> patch set, since I'm no longer touching that line?

That's true - the idea of removing both needs more thought, and if we
were to do so, we definitely can do it in a different patch set.

> > We might need a flag called "do_not_die_on_missing_tree" (much like your
> > original idea of "show_missing_trees") so that callers that are prepared
> > to deal with missing trees can set this. Sorry for the churn. You can
> > document it as such:
> Added it, but not with a command-line flag, only in rev-info.h. We can
> always  add a flag later if people have been relying on the existing
> behavior of git rev-list to balk at missing trees. (That seems
> unlikely though, considering there is no filter to enable that before
> this patchset).

By flag, I indeed meant in rev-info.h - sorry for the confusion. That
sounds good.


  1   2   3   4   5   6   7   8   9   10   >