Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()

2018-02-07 Thread Derrick Stolee

On 2/7/2018 10:08 AM, SZEDER Gábor wrote:

On Mon, Feb 5, 2018 at 5:06 PM, Derrick Stolee  wrote:

On 2/2/2018 10:32 AM, SZEDER Gábor wrote:

In my git repo, with 9 pack files at the moment, i.e. not that big a
repo and not that many pack files:

$ time ./git commit-graph --write --update-head
4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803

real0m27.550s
user0m27.113s
sys 0m0.376s

In comparison, performing a good old revision walk to gather all the
info that is written into the graph file:

$ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
52954

real0m0.903s
user0m0.972s
sys 0m0.058s


Two reasons this is in here:

(1) It's easier to get the write implemented this way and add the reachable
closure later (which I do).

(2) For GVFS, we want to add all commits that arrived in a "prefetch pack"
to the graph even if we do not have a ref that points to the commit yet. We
expect many commits to become reachable soon and having them in the graph
saves a lot of time in merge-base calculations.

So, (1) is for patch simplicity, and (2) is why I want it to be an option in
the final version. See the --stdin-packs argument later for a way to do this
incrementally.

I expect almost all users to use the reachable closure method with
--stdin-commits (and that's how I will integrate automatic updates with
'fetch', 'repack', and 'gc' in a later patch).

I see.  I was about to ask about the expected use-cases of the
'--stdin-packs' option, considering how much slower it is to enumerate
all objects in pack files, but run out of time after patch 10.

The run-time using '--stdin-commits' is indeed great:

   $ time { git for-each-ref --format='%(objectname)' refs/heads/ | ./git
 commit-graph --write --update-head --stdin-commits ; }
   82fe9a5cd715ff578f01f7f44e0611d7902d20c8

   real  0m0.985s
   user  0m0.916s
   sys   0m0.024s

Considering the run-time difference, I think in the end it would be a
better default for a plain 'git commit-graph --write' to traverse
history from all refs, and it should enumerate pack files only if
explicitly told so via '--stdin-packs'.

To be clear: I'm not saying that traversing history should already be
the default when introducing construct_commit_graph() and '--write'.  If
enumerating pack files keeps the earlier patches simpler and easier to
review, then by all means stick with it, and only change the
'--stdin-*'-less behavior near the end of the series, when all the
building blocks are already in place (but then mention this in the early
commit messages).


I will consider this.


I have also noticed a segfault when feeding non-commit object names to
'--stdin-commits', i.e. when I run the above command without restricting
'git for-each-ref' to branches and it listed object names of tags as
well.

   $ git rev-parse v2.16.1 |./git commit-graph --write --update-head
--stdin-commits
   error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
   error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
   error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
   Segmentation fault

(gdb) bt
#0  __memcpy_avx_unaligned ()
 at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:126
#1  0x004ea97c in sha1write (f=0x356bbf0, buf=0x4, count=20)
 at csum-file.c:104
#2  0x004d98e1 in write_graph_chunk_data (f=0x356bbf0, hash_len=20,
 commits=0x3508de0, nr_commits=50615) at commit-graph.c:506
#3  0x004da9ca in construct_commit_graph (
 pack_dir=0x8ff360 ".git/objects/pack", pack_indexes=0x0, nr_packs=0,
 commit_hex=0x8ff790, nr_commits=1) at commit-graph.c:818
#4  0x0044184e in graph_write () at builtin/commit-graph.c:149
#5  0x00441a8c in cmd_commit_graph (argc=0, argv=0x7fffe310,
 prefix=0x0) at builtin/commit-graph.c:224
#6  0x00405a0a in run_builtin (p=0x8ad950 , argc=4,
 argv=0x7fffe310) at git.c:346
#7  0x00405ce4 in handle_builtin (argc=4, argv=0x7fffe310)
 at git.c:555
#8  0x00405ec8 in run_argv (argcp=0x7fffe1cc, argv=0x7fffe1c0)
 at git.c:607
#9  0x00406079 in cmd_main (argc=4, argv=0x7fffe310) at git.c:684
#10 0x004a85c8 in main (argc=5, argv=0x7fffe308)
 at common-main.c:43


I noticed this while preparing v3. I have a fix, but you now remind me 
that I need to add tags to the test script.


Thanks,
-Stolee


Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()

2018-02-07 Thread SZEDER Gábor
On Mon, Feb 5, 2018 at 5:06 PM, Derrick Stolee  wrote:
> On 2/2/2018 10:32 AM, SZEDER Gábor wrote:

>> In my git repo, with 9 pack files at the moment, i.e. not that big a
>> repo and not that many pack files:
>>
>>$ time ./git commit-graph --write --update-head
>>4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803
>>
>>real0m27.550s
>>user0m27.113s
>>sys 0m0.376s
>>
>> In comparison, performing a good old revision walk to gather all the
>> info that is written into the graph file:
>>
>>$ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
>>52954
>>
>>real0m0.903s
>>user0m0.972s
>>sys 0m0.058s
>
>
> Two reasons this is in here:
>
> (1) It's easier to get the write implemented this way and add the reachable
> closure later (which I do).
>
> (2) For GVFS, we want to add all commits that arrived in a "prefetch pack"
> to the graph even if we do not have a ref that points to the commit yet. We
> expect many commits to become reachable soon and having them in the graph
> saves a lot of time in merge-base calculations.
>
> So, (1) is for patch simplicity, and (2) is why I want it to be an option in
> the final version. See the --stdin-packs argument later for a way to do this
> incrementally.
>
> I expect almost all users to use the reachable closure method with
> --stdin-commits (and that's how I will integrate automatic updates with
> 'fetch', 'repack', and 'gc' in a later patch).

I see.  I was about to ask about the expected use-cases of the
'--stdin-packs' option, considering how much slower it is to enumerate
all objects in pack files, but run out of time after patch 10.

The run-time using '--stdin-commits' is indeed great:

  $ time { git for-each-ref --format='%(objectname)' refs/heads/ | ./git
commit-graph --write --update-head --stdin-commits ; }
  82fe9a5cd715ff578f01f7f44e0611d7902d20c8

  real  0m0.985s
  user  0m0.916s
  sys   0m0.024s

Considering the run-time difference, I think in the end it would be a
better default for a plain 'git commit-graph --write' to traverse
history from all refs, and it should enumerate pack files only if
explicitly told so via '--stdin-packs'.

To be clear: I'm not saying that traversing history should already be
the default when introducing construct_commit_graph() and '--write'.  If
enumerating pack files keeps the earlier patches simpler and easier to
review, then by all means stick with it, and only change the
'--stdin-*'-less behavior near the end of the series, when all the
building blocks are already in place (but then mention this in the early
commit messages).


I have also noticed a segfault when feeding non-commit object names to
'--stdin-commits', i.e. when I run the above command without restricting
'git for-each-ref' to branches and it listed object names of tags as
well.

  $ git rev-parse v2.16.1 |./git commit-graph --write --update-head
--stdin-commits
  error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
  error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
  error: Object eb5fcb24f69e13335cf6a6a1b1d4553fa2b0f202 not a commit
  Segmentation fault

(gdb) bt
#0  __memcpy_avx_unaligned ()
at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:126
#1  0x004ea97c in sha1write (f=0x356bbf0, buf=0x4, count=20)
at csum-file.c:104
#2  0x004d98e1 in write_graph_chunk_data (f=0x356bbf0, hash_len=20,
commits=0x3508de0, nr_commits=50615) at commit-graph.c:506
#3  0x004da9ca in construct_commit_graph (
pack_dir=0x8ff360 ".git/objects/pack", pack_indexes=0x0, nr_packs=0,
commit_hex=0x8ff790, nr_commits=1) at commit-graph.c:818
#4  0x0044184e in graph_write () at builtin/commit-graph.c:149
#5  0x00441a8c in cmd_commit_graph (argc=0, argv=0x7fffe310,
prefix=0x0) at builtin/commit-graph.c:224
#6  0x00405a0a in run_builtin (p=0x8ad950 , argc=4,
argv=0x7fffe310) at git.c:346
#7  0x00405ce4 in handle_builtin (argc=4, argv=0x7fffe310)
at git.c:555
#8  0x00405ec8 in run_argv (argcp=0x7fffe1cc, argv=0x7fffe1c0)
at git.c:607
#9  0x00406079 in cmd_main (argc=4, argv=0x7fffe310) at git.c:684
#10 0x004a85c8 in main (argc=5, argv=0x7fffe308)
at common-main.c:43


Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()

2018-02-05 Thread Derrick Stolee

On 2/2/2018 10:32 AM, SZEDER Gábor wrote:

Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given pack
directory.

I'm afraid that scanning all packed objects is a bit of a roundabout
way to approach this.

In my git repo, with 9 pack files at the moment, i.e. not that big a
repo and not that many pack files:

   $ time ./git commit-graph --write --update-head
   4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803

   real0m27.550s
   user0m27.113s
   sys 0m0.376s

In comparison, performing a good old revision walk to gather all the
info that is written into the graph file:

   $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
   52954

   real0m0.903s
   user0m0.972s
   sys 0m0.058s


Two reasons this is in here:

(1) It's easier to get the write implemented this way and add the 
reachable closure later (which I do).


(2) For GVFS, we want to add all commits that arrived in a "prefetch 
pack" to the graph even if we do not have a ref that points to the 
commit yet. We expect many commits to become reachable soon and having 
them in the graph saves a lot of time in merge-base calculations.


So, (1) is for patch simplicity, and (2) is why I want it to be an 
option in the final version. See the --stdin-packs argument later for a 
way to do this incrementally.


I expect almost all users to use the reachable closure method with 
--stdin-commits (and that's how I will integrate automatic updates with 
'fetch', 'repack', and 'gc' in a later patch).





+char* get_commit_graph_filename_hash(const char *pack_dir,
+struct object_id *hash)
+{
+   size_t len;
+   struct strbuf head_path = STRBUF_INIT;
+   strbuf_addstr(&head_path, pack_dir);
+   strbuf_addstr(&head_path, "/graph-");
+   strbuf_addstr(&head_path, oid_to_hex(hash));
+   strbuf_addstr(&head_path, ".graph");

Nit: this is assembling the path of a graph file, not that of a
graph-head, so the strbuf should be renamed accordingly.


+
+   return strbuf_detach(&head_path, &len);
+}




Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()

2018-02-02 Thread SZEDER Gábor

> Teach Git to write a commit graph file by checking all packed objects
> to see if they are commits, then store the file in the given pack
> directory.

I'm afraid that scanning all packed objects is a bit of a roundabout
way to approach this.

In my git repo, with 9 pack files at the moment, i.e. not that big a
repo and not that many pack files:

  $ time ./git commit-graph --write --update-head
  4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803

  real0m27.550s
  user0m27.113s
  sys 0m0.376s

In comparison, performing a good old revision walk to gather all the
info that is written into the graph file:

  $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
  52954

  real0m0.903s
  user0m0.972s
  sys 0m0.058s



> +char* get_commit_graph_filename_hash(const char *pack_dir,
> +  struct object_id *hash)
> +{
> + size_t len;
> + struct strbuf head_path = STRBUF_INIT;
> + strbuf_addstr(&head_path, pack_dir);
> + strbuf_addstr(&head_path, "/graph-");
> + strbuf_addstr(&head_path, oid_to_hex(hash));
> + strbuf_addstr(&head_path, ".graph");

Nit: this is assembling the path of a graph file, not that of a
graph-head, so the strbuf should be renamed accordingly.

> +
> + return strbuf_detach(&head_path, &len);
> +}



Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()

2018-02-01 Thread SZEDER Gábor

> Teach Git to write a commit graph file by checking all packed objects
> to see if they are commits, then store the file in the given pack
> directory.
> 
> Signed-off-by: Derrick Stolee 
> ---
>  Makefile   |   1 +
>  commit-graph.c | 376 
> +
>  commit-graph.h |  20 +++
>  3 files changed, 397 insertions(+)
>  create mode 100644 commit-graph.c
>  create mode 100644 commit-graph.h


> diff --git a/commit-graph.c b/commit-graph.c
> new file mode 100644
> index 00..db2b7390c7
> --- /dev/null
> +++ b/commit-graph.c

> +struct packed_commit_list {
> + struct commit **list;
> + int num;
> + int size;
> +};
> +
> +struct packed_oid_list {
> + struct object_id **list;
> + int num;
> + int size;
> +};

When we manage the memory allocation of an array with the ALLOC_GROW
macro, then we tend to call the helper fields as 'alloc' and 'nr'.

> +static int if_packed_commit_add_to_list(const struct object_id *oid,
> + struct packed_git *pack,
> + uint32_t pos,
> + void *data)
> +{
> + struct packed_oid_list *list = (struct packed_oid_list*)data;
> + enum object_type type;
> + unsigned long size;
> + void *inner_data;
> + off_t offset = nth_packed_object_offset(pack, pos);
> + inner_data = unpack_entry(pack, offset, &type, &size);
> +
> + if (inner_data)
> + free(inner_data);
> +
> + if (type != OBJ_COMMIT)
> + return 0;
> +
> + ALLOC_GROW(list->list, list->num + 1, list->size);
> + list->list[list->num] = (struct object_id *)malloc(sizeof(struct 
> object_id));
> + oidcpy(list->list[list->num], oid);
> + (list->num)++;
> +
> + return 0;
> +}
> +
> +struct object_id *construct_commit_graph(const char *pack_dir)
> +{
> + struct packed_oid_list oids;
> + struct packed_commit_list commits;
> + struct commit_graph_header hdr;
> + struct sha1file *f;
> + int i, count_distinct = 0;
> + struct strbuf tmp_file = STRBUF_INIT;
> + unsigned char final_hash[GIT_MAX_RAWSZ];
> + char *graph_name;
> + int fd;
> + uint32_t chunk_ids[5];
> + uint64_t chunk_offsets[5];
> + int num_long_edges;
> + struct object_id *f_hash;
> + char *fname;
> + struct commit_list *parent;
> +
> + oids.num = 0;
> + oids.size = 1024;
> + ALLOC_ARRAY(oids.list, oids.size);
> + for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
> + QSORT(oids.list, oids.num, commit_compare);
> +
> + count_distinct = 1;
> + for (i = 1; i < oids.num; i++) {
> + if (oidcmp(oids.list[i-1], oids.list[i]))
> + count_distinct++;
> + }
> +
> + commits.num = 0;
> + commits.size = count_distinct;
> + ALLOC_ARRAY(commits.list, commits.size);
> +
> + num_long_edges = 0;
> + for (i = 0; i < oids.num; i++) {
> + int num_parents = 0;
> + if (i > 0 && !oidcmp(oids.list[i-1], oids.list[i]))
> + continue;
> +
> + commits.list[commits.num] = lookup_commit(oids.list[i]);
> + parse_commit(commits.list[commits.num]);
> +
> + for (parent = commits.list[commits.num]->parents;
> +  parent; parent = parent->next)
> + num_parents++;
> +
> + if (num_parents > 2)
> + num_long_edges += num_parents - 1;
> +
> + commits.num++;
> + }
> +
> + strbuf_addstr(&tmp_file, pack_dir);
> + strbuf_addstr(&tmp_file, "/tmp_graph_XX");
> +
> + fd = git_mkstemp_mode(tmp_file.buf, 0444);
> + if (fd < 0)
> + die_errno("unable to create '%s'", tmp_file.buf);
> +
> + graph_name = strbuf_detach(&tmp_file, NULL);
> + f = sha1fd(fd, graph_name);
> +
> + hdr.graph_signature = htonl(GRAPH_SIGNATURE);
> + hdr.graph_version = GRAPH_VERSION;
> + hdr.hash_version = GRAPH_OID_VERSION;
> + hdr.hash_len = GRAPH_OID_LEN;
> + hdr.num_chunks = 4;
> +
> + assert(sizeof(hdr) == 8);
> + sha1write(f, &hdr, sizeof(hdr));
> +
> + chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
> + chunk_ids[1] = GRAPH_CHUNKID_OIDLOOKUP;
> + chunk_ids[2] = GRAPH_CHUNKID_DATA;
> + chunk_ids[3] = GRAPH_CHUNKID_LARGEEDGES;
> + chunk_ids[4] = 0;
> +
> + chunk_offsets[0] = sizeof(hdr) + GRAPH_CHUNKLOOKUP_SIZE;
> + chunk_offsets[1] = chunk_offsets[0] + GRAPH_FANOUT_SIZE;
> + chunk_offsets[2] = chunk_offsets[1] + GRAPH_OID_LEN * commits.num;
> + chunk_offsets[3] = chunk_offsets[2] + (GRAPH_OID_LEN + 16) * 
> commits.num;
> + chunk_offsets[4] = chunk_offsets[3] + 4 * num_long_edges;
> +
> + for (i = 0; i <= hdr.num_chunks; i++) {
> + uint32_t chunk_write[3];
> +
> + chunk_write[0] = htonl(chunk_ids[i]);
> + chunk_write[1] = htonl(chunk_offsets[i] >>

Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()

2018-02-01 Thread Jonathan Tan
On Tue, 30 Jan 2018 16:39:33 -0500
Derrick Stolee  wrote:

> +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
> +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */

Could all these just be string constants? sha1write can handle them well
enough.

> +static void write_graph_chunk_fanout(struct sha1file *f,
> +  struct commit **commits,
> +  int nr_commits)
> +{
> + uint32_t i, count = 0;
> + struct commit **list = commits;
> + struct commit **last = commits + nr_commits;
> +
> + /*
> +  * Write the first-level table (the list is sorted,
> +  * but we use a 256-entry lookup to be able to avoid
> +  * having to do eight extra binary search iterations).
> +  */
> + for (i = 0; i < 256; i++) {
> + uint32_t swap_count;
> +
> + while (list < last) {
> + if ((*list)->object.oid.hash[0] != i)
> + break;
> + count++;
> + list++;
> + }
> +
> + swap_count = htonl(count);
> + sha1write(f, &swap_count, 4);

You can use sha1write_be32() instead of swapping.

> +static void write_graph_chunk_large_edges(struct sha1file *f,
> +   struct commit **commits,
> +   int nr_commits)
> +{
> + struct commit **list = commits;
> + struct commit **last = commits + nr_commits;
> + struct commit_list *parent;
> +
> + while (list < last) {
> + int num_parents = 0;
> + for (parent = (*list)->parents; num_parents < 3 && parent;
> +  parent = parent->next)
> + num_parents++;
> +
> + if (num_parents <= 2) {
> + list++;
> + continue;
> + }
> +
> + for (parent = (*list)->parents; parent; parent = parent->next) {
> + uint32_t int_id, swap_int_id;
> + uint32_t last_edge = 0;
> +
> + if (parent == (*list)->parents)
> + continue;

Probably better to just initialize "parent = (*list)->parents->next".
Also probably best to add a comment describing why you are doing this
(e.g. "The first parent is already in the main commit table; the large
edges table only contains the second parent onwards").

> +struct packed_commit_list {
> + struct commit **list;
> + int num;
> + int size;
> +};
> +
> +struct packed_oid_list {
> + struct object_id **list;
> + int num;
> + int size;
> +};

What are num and size? If they're nr and alloc, maybe use those names
instead.

> +static int if_packed_commit_add_to_list(const struct object_id *oid,
> + struct packed_git *pack,
> + uint32_t pos,
> + void *data)
> +{
> + struct packed_oid_list *list = (struct packed_oid_list*)data;
> + enum object_type type;
> + unsigned long size;
> + void *inner_data;
> + off_t offset = nth_packed_object_offset(pack, pos);
> + inner_data = unpack_entry(pack, offset, &type, &size);
> +
> + if (inner_data)
> + free(inner_data);
> +
> + if (type != OBJ_COMMIT)
> + return 0;
> +
> + ALLOC_GROW(list->list, list->num + 1, list->size);
> + list->list[list->num] = (struct object_id *)malloc(sizeof(struct 
> object_id));

No need to cast return value of malloc. Also, use xmalloc?

> +struct object_id *construct_commit_graph(const char *pack_dir)
> +{
> + struct packed_oid_list oids;
> + struct packed_commit_list commits;
> + struct commit_graph_header hdr;
> + struct sha1file *f;
> + int i, count_distinct = 0;
> + struct strbuf tmp_file = STRBUF_INIT;
> + unsigned char final_hash[GIT_MAX_RAWSZ];
> + char *graph_name;
> + int fd;
> + uint32_t chunk_ids[5];
> + uint64_t chunk_offsets[5];
> + int num_long_edges;
> + struct object_id *f_hash;
> + char *fname;
> + struct commit_list *parent;
> +
> + oids.num = 0;
> + oids.size = 1024;
> + ALLOC_ARRAY(oids.list, oids.size);
> + for_each_packed_object(if_packed_commit_add_to_list, &oids, 0);
> + QSORT(oids.list, oids.num, commit_compare);
> +
> + count_distinct = 1;
> + for (i = 1; i < oids.num; i++) {
> + if (oidcmp(oids.list[i-1], oids.list[i]))
> + count_distinct++;
> + }
> +
> + commits.num = 0;
> + commits.size = count_distinct;
> + ALLOC_ARRAY(commits.list, commits.size);
> +
> + num_long_edges = 0;
> + for (i = 0; i < oids.num; i++) {
> + int num_parents = 0;
> + if (