Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-19 Thread Derrick Stolee

On 3/19/2018 8:55 AM, Derrick Stolee wrote:


Thanks for this! Fixing this performance problem is very important to 
me, as we will use the "--stdin-packs" mechanism in the GVFS scenario 
(we will walk all commits in the prefetch packs full of commits and 
trees instead of relying on refs). This speedup is very valuable!


Thanks,
-Stolee


Also, for those interested in this series, I plan to do a rebase onto 
2.17.0, when available, as my re-roll. I pushed my responses to the 
current feedback at the GitHub PR for the series [1].


If you are planning to provide more feedback to the series, then please 
let me know and I'll delay my re-roll so you have a chance to review.


Thanks,
-Stolee

[1] https://github.com/derrickstolee/git/pull/2


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-19 Thread Derrick Stolee

On 3/16/2018 12:28 PM, Lars Schneider wrote:



On 14 Mar 2018, at 21:43, Junio C Hamano  wrote:

Derrick Stolee  writes:



Hopefully this version is ready to merge. I have several follow-up topics
in mind to submit soon after, including:

A few patches add trailing blank lines and other whitespace
breakages, which will stop my "git merge" later to 'next' and down,
as I have a pre-commit hook to catch them.

@stolee:

I run "git --no-pager diff --check $BASE_HASH...$HEAD_HASH" to detect
these kinds of things. I run this as part of my "prepare patch" [1] script
which is inspired by a similar script originally written by Dscho.

Do you think it would make sense to mention (or even
recommend) such a script in your awesome GfW CONTRIBUTING.md?


- Lars


[1] 
https://github.com/larsxschneider/git-list-helper/blob/master/prepare-patch.sh#L71



Thanks for the suggestions. Somehow I got extra whitespace doing 
copy/paste in vim and I never re-opened that file in my normal editor 
(VS Code with an extension that shows trailing whitespace).


On 3/15/2018 1:23 PM, Johannes Schindelin wrote:

git log --check`
was introduced to show you whitespace problems). If all of those
whitespace issues are unintentional, you can fix them using `git rebase
--whitespace=fix` in the most efficient way.


Thanks for both of the suggestions. The `rebase` check was already in 
the document, so I put the checks immediately above that line. PR is 
available now [1].


Thanks,
-Stolee

[1] https://github.com/git-for-windows/git/pull/1567


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-19 Thread Derrick Stolee



On 3/16/2018 4:19 PM, Jeff King wrote:

On Fri, Mar 16, 2018 at 04:06:39PM -0400, Jeff King wrote:


Furthermore, in order to look at an object it has to be zlib inflated
first, and since commit objects tend to be much smaller than trees and
especially blobs, there are a lot less bytes to inflate:

   $ grep ^commit type-size |cut -d' ' -f2 |avg
   34395730 / 53754 = 639
   $ cat type-size |cut -d' ' -f2 |avg
   3866685744 / 244723 = 15800

So a simple revision walk inflates less than 1% of the bytes that the
"enumerate objects packfiles" approach has to inflate.

I don't think this is quite accurate. It's true that we have to
_consider_ every object, but Git is smart enough not to inflate each one
to find its type. For loose objects we just inflate the header. For
packed objects, we either pick the type directly out of the packfile
header (for a non-delta) or can walk the delta chain (without actually
looking at the data bytes!) until we hit the base.

Hmm, so that's a big part of the problem with this patch series. It
actually _does_ unpack every object with --stdin-packs to get the type,
which is just silly. With the patch below, my time for "commit-graph
write --stdin-packs" on linux.git goes from over 5 minutes (I got bored
and killed it) to 17 seconds.

diff --git a/commit-graph.c b/commit-graph.c
index 6348bab82b..cf1da2e8c1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -491,11 +491,12 @@ static int add_packed_commits(const struct object_id *oid,
  {
struct packed_oid_list *list = (struct packed_oid_list*)data;
enum object_type type;
-   unsigned long size;
-   void *inner_data;
off_t offset = nth_packed_object_offset(pack, pos);
-   inner_data = unpack_entry(pack, offset, , );
-   FREE_AND_NULL(inner_data);
+   struct object_info oi = OBJECT_INFO_INIT;
+
+   oi.typep = 
+   if (packed_object_info(pack, offset, ) < 0)
+   die("unable to get type of object %s", oid_to_hex(oid));
  
  	if (type != OBJ_COMMIT)

return 0;

-Peff


Thanks for this! Fixing this performance problem is very important to 
me, as we will use the "--stdin-packs" mechanism in the GVFS scenario 
(we will walk all commits in the prefetch packs full of commits and 
trees instead of relying on refs). This speedup is very valuable!


Thanks,
-Stolee


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-19 Thread Johannes Schindelin
Hi Junio,

On Fri, 16 Mar 2018, Junio C Hamano wrote:

> Johannes Schindelin  writes:
> 
> > I understand that it is a trade-off between time you have to spend and
> > that others have to spend, and since you do not scale, that trade-off
> > has to be in your favor.
> 
> That tradeoff may exist, but it does not weigh in the picture above
> at all.

It does, however. You frequently do not even tell the original contributor
what changes you made while applying the patches. I know because I have
been surprised by some of those changes, long after you merged them into
`master`.

And quite honestly, my time is valuable, too. So you should stop assuming
that I, for one (and probably other contributors, too) compare carefully
what differences exist between the local topic branch and what you chose
to make of the patches. I cannot make you stop suggesting that, but I can
tell you right away that I won't do that unless I *have* to. It is an
inefficient use of my time. I wish you would also realize that it
invariably leads to your having to "touch up" iteration after iteration
because your touch-ups had not been picked up.

Having said that, I can live with the status quo. I have a track record of
being able to.

Ciao,
Dscho


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 11:33:55AM -0700, Junio C Hamano wrote:

> It is not so surprising that history walking runs rings around
> enumerating objects in packfiles, if packfiles are built well.
> 
> A well-built packfile tends to has newer objects in base form and
> has delta that goes in backward direction (older objects are
> represented as delta against newer ones).  This helps warlking from
> the tips of the history quite a bit, because your delta base cache
> will tend to have the base object (i.e. objects in the newer part of
> the history you just walked) that will be required to access the
> "next" older part of the history more often than not.
> 
> Trying to read the objects in the pack in their object name order
> would essentially mean reading them in a cryptgraphically random
> order.  Half the time you will end up wanting to access an object
> that is near the tip of a very deep delta chain even before you've
> accessed any of the base objects in the delta chain.

I coincidentally was doing some experiments in this area a few weeks
ago, and found a few things:

  1. The ordering makes a _huge_ difference for accessing trees and
 blobs.

  2. Pack order (not pack-idx order) is actually the best order, since
 it tends to follow the delta patterns (it's close to traversal
 order, but packs delta families more tightly).

  3. None of this really matters for commits, since we almost never
 store them as deltas anyway.

Here are a few experiments people can do themselves to demonstrate (my
numbers here are all from linux.git, which is sort of a wort-case
for bad ordering because its size stresses the default delta cache):

  [every object in sha1 order: slow]
  $ time git cat-file --batch-all-objects --batch >/dev/null
  real  8m44.041s
  user  8m31.359s
  sys   0m12.262s

  [every object from a traversal: faster, but --objects traversals are
   actually CPU heavy due to all of the hash lookups for each tree. Note
   not just wall-clock time but the CPU since it's split across two
   processes]
  $ time git rev-list --objects --all |
 cut -d' ' -f2 |
 git cat-file --batch >/dev/null
  real  1m2.667s
  user  0m58.537s
  sys   0m32.392s

  [every object in pack order: fastest. This is due to skipping the
   traversal overhead, and should use our delta cache quite efficiently.
   I'm assuming a single pack and no loose objects here, but the
   performance should generalize since accessing the "big" pack
   dominates]
  $ time git show-index <$(ls .git/objects/pack/*.idx) |
 sort -n |
 cut -d' ' -f2 |
 git cat-file --batch >/dev/null
  real  0m51.718s
  user  0m50.963s
  sys   0m7.068s

  [just commits, sha1 order: not horrible]
  $ time git cat-file --batch-all-objects --batch-check='%(objecttype) 
%(objectname)' |
 grep ^commit |
 cut -d' ' -f2 |
 git cat-file --batch >/dev/null
  real  0m8.115s
  user  0m14.033s
  sys   0m1.170s

  [just commits, pack order: slightly worse due to the extra piping, but
   obviously that could be done more quickly internally]
  $ time git show-index <$(ls .git/objects/pack/*.idx) |
 sort -n |
 cut -d' ' -f2 |
 git cat-file --batch-check='%(objecttype) %(objectname)' |
 grep ^commit |
 cut -d' ' -f2 |
 git cat-file --batch >/dev/null
  real  0m21.670s
  user  0m24.867s
  sys   0m9.600s

  [and the reason is that hardly any commits get deltas]
  $ git cat-file --batch-all-objects --batch-check='%(objecttype) %(deltabase)' 
|
grep ^commit >commits
  $ wc -l commits
  692596
  $ grep -v '' commits | wc -l
  18856

For the purposes of this patch series, I don't think the order matters
much, since we're only dealing with commits. For doing --batch-check, I
think the sha1 ordering given by "cat-file --batch-all-objects" is
convenient, and doesn't have a big impact on performance. But it's
_awful_ for --batch. I think we may want to add a sorting option to just
return the objects in the original packfile order.

-Peff


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 04:06:39PM -0400, Jeff King wrote:

> > Furthermore, in order to look at an object it has to be zlib inflated
> > first, and since commit objects tend to be much smaller than trees and
> > especially blobs, there are a lot less bytes to inflate:
> > 
> >   $ grep ^commit type-size |cut -d' ' -f2 |avg
> >   34395730 / 53754 = 639
> >   $ cat type-size |cut -d' ' -f2 |avg
> >   3866685744 / 244723 = 15800
> > 
> > So a simple revision walk inflates less than 1% of the bytes that the
> > "enumerate objects packfiles" approach has to inflate.
> 
> I don't think this is quite accurate. It's true that we have to
> _consider_ every object, but Git is smart enough not to inflate each one
> to find its type. For loose objects we just inflate the header. For
> packed objects, we either pick the type directly out of the packfile
> header (for a non-delta) or can walk the delta chain (without actually
> looking at the data bytes!) until we hit the base.

Hmm, so that's a big part of the problem with this patch series. It
actually _does_ unpack every object with --stdin-packs to get the type,
which is just silly. With the patch below, my time for "commit-graph
write --stdin-packs" on linux.git goes from over 5 minutes (I got bored
and killed it) to 17 seconds.

diff --git a/commit-graph.c b/commit-graph.c
index 6348bab82b..cf1da2e8c1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -491,11 +491,12 @@ static int add_packed_commits(const struct object_id *oid,
 {
struct packed_oid_list *list = (struct packed_oid_list*)data;
enum object_type type;
-   unsigned long size;
-   void *inner_data;
off_t offset = nth_packed_object_offset(pack, pos);
-   inner_data = unpack_entry(pack, offset, , );
-   FREE_AND_NULL(inner_data);
+   struct object_info oi = OBJECT_INFO_INIT;
+
+   oi.typep = 
+   if (packed_object_info(pack, offset, ) < 0)
+   die("unable to get type of object %s", oid_to_hex(oid));
 
if (type != OBJ_COMMIT)
return 0;

-Peff


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread SZEDER Gábor
On Fri, Mar 16, 2018 at 7:33 PM, Junio C Hamano  wrote:
> SZEDER Gábor  writes:
>
>> You should forget '--stdin-packs' and use '--stdin-commits' to generate
>> the initial graph, it's much faster even without '--additive'[1].  See
>>
>>   
>> https://public-inbox.org/git/CAM0VKj=wmkBNH=pscrztxfrc13rig1easw89q6ljansdjde...@mail.gmail.com/
>>
>> I still think that the default behaviour for 'git commit-graph write'
>> should simply walk history from all refs instead of enumerating all
>> objects in all packfiles.
>
> Somehow I missed that one.  Thanks for the link to it.
>
> It is not so surprising that history walking runs rings around
> enumerating objects in packfiles, if packfiles are built well.
>
> A well-built packfile tends to has newer objects in base form and
> has delta that goes in backward direction (older objects are
> represented as delta against newer ones).  This helps warlking from
> the tips of the history quite a bit, because your delta base cache
> will tend to have the base object (i.e. objects in the newer part of
> the history you just walked) that will be required to access the
> "next" older part of the history more often than not.
>
> Trying to read the objects in the pack in their object name order
> would essentially mean reading them in a cryptgraphically random
> order.  Half the time you will end up wanting to access an object
> that is near the tip of a very deep delta chain even before you've
> accessed any of the base objects in the delta chain.

I came up with a different explanation back then: we are only interested
in commit objects when creating the commit graph, and only a small-ish
fraction of all objects are commit objects, so the "enumerate objects in
packfiles" approach has to look at a lot more objects:

  # in my git fork
  $ git rev-list --all --objects |cut -d' ' -f1 |\
git cat-file --batch-check='%(objecttype) %(objectsize)' >type-size
  $ grep -c ^commit type-size
  53754
  $ wc -l type-size
  244723 type-size

I.e. only about 20% of all objects are commit objects.

Furthermore, in order to look at an object it has to be zlib inflated
first, and since commit objects tend to be much smaller than trees and
especially blobs, there are a lot less bytes to inflate:

  $ grep ^commit type-size |cut -d' ' -f2 |avg
  34395730 / 53754 = 639
  $ cat type-size |cut -d' ' -f2 |avg
  3866685744 / 244723 = 15800

So a simple revision walk inflates less than 1% of the bytes that the
"enumerate objects packfiles" approach has to inflate.


>> [1] - Please excuse the bikeshed: '--additive' is such a strange
>>   sounding option name, at least for me.  '--append', perhaps?
>
> Yeah, I think "fetch --append" is probably a precedence.


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 08:48:49PM +0100, SZEDER Gábor wrote:

> I came up with a different explanation back then: we are only interested
> in commit objects when creating the commit graph, and only a small-ish
> fraction of all objects are commit objects, so the "enumerate objects in
> packfiles" approach has to look at a lot more objects:
> 
>   # in my git fork
>   $ git rev-list --all --objects |cut -d' ' -f1 |\
> git cat-file --batch-check='%(objecttype) %(objectsize)' >type-size
>   $ grep -c ^commit type-size
>   53754
>   $ wc -l type-size
>   244723 type-size
> 
> I.e. only about 20% of all objects are commit objects.
> 
> Furthermore, in order to look at an object it has to be zlib inflated
> first, and since commit objects tend to be much smaller than trees and
> especially blobs, there are a lot less bytes to inflate:
> 
>   $ grep ^commit type-size |cut -d' ' -f2 |avg
>   34395730 / 53754 = 639
>   $ cat type-size |cut -d' ' -f2 |avg
>   3866685744 / 244723 = 15800
> 
> So a simple revision walk inflates less than 1% of the bytes that the
> "enumerate objects packfiles" approach has to inflate.

I don't think this is quite accurate. It's true that we have to
_consider_ every object, but Git is smart enough not to inflate each one
to find its type. For loose objects we just inflate the header. For
packed objects, we either pick the type directly out of the packfile
header (for a non-delta) or can walk the delta chain (without actually
looking at the data bytes!) until we hit the base.

So starting from scratch:

  git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
  grep ^commit |
  cut -d' ' -f2 |
  git cat-file --batch

is in the same ballpark for most repos as:

  git rev-list --all |
  git cat-file --batch

though in my timings the traversal is a little bit faster (and I'd
expect that to remain the case when doing it all in a single process,
since the traversal only follows commit links, whereas processing the
object list has to do the type lookup for each object before deciding
whether to inflate it).

I'm not sure, though, if that edge would remain for incremental updates.
For instance, after we take in some new objects via "fetch", the
traversal strategy would want to do something like:

  git rev-list $new_tips --not --all |
  git cat-file --batch

whose performance will depend on the refs _currently_ in the repository,
as we load them as UNINTERESTING tips for the walk. Whereas doing:

  git show-index <.git/objects/pack/the-one-new-packfile.idx |
  cut -d' ' -f2 |
  git cat-file --batch-check='%(objecttype) %(objectname)' |
  grep ^commit |
  cut -d' ' -f2 |
  git cat-file --batch

always scales exactly with the size of the new objects (obviously that's
kind of baroque and this would all be done internally, but I'm trying to
demonstrate the algorithmic complexity). I'm not sure what the plan
would be if we explode loose objects, though. ;)

-Peff


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Junio C Hamano
SZEDER Gábor  writes:

> You should forget '--stdin-packs' and use '--stdin-commits' to generate
> the initial graph, it's much faster even without '--additive'[1].  See
>
>   
> https://public-inbox.org/git/CAM0VKj=wmkBNH=pscrztxfrc13rig1easw89q6ljansdjde...@mail.gmail.com/
>
> I still think that the default behaviour for 'git commit-graph write'
> should simply walk history from all refs instead of enumerating all
> objects in all packfiles.

Somehow I missed that one.  Thanks for the link to it.

It is not so surprising that history walking runs rings around
enumerating objects in packfiles, if packfiles are built well.

A well-built packfile tends to has newer objects in base form and
has delta that goes in backward direction (older objects are
represented as delta against newer ones).  This helps warlking from
the tips of the history quite a bit, because your delta base cache
will tend to have the base object (i.e. objects in the newer part of
the history you just walked) that will be required to access the
"next" older part of the history more often than not.

Trying to read the objects in the pack in their object name order
would essentially mean reading them in a cryptgraphically random
order.  Half the time you will end up wanting to access an object
that is near the tip of a very deep delta chain even before you've
accessed any of the base objects in the delta chain.

> [1] - Please excuse the bikeshed: '--additive' is such a strange
>   sounding option name, at least for me.  '--append', perhaps?

Yeah, I think "fetch --append" is probably a precedence.


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Junio C Hamano
Johannes Schindelin  writes:

>> > Stolee, you definitely want to inspect those changes (`git log --check`
>> > was introduced to show you whitespace problems). If all of those
>> > whitespace issues are unintentional, you can fix them using `git rebase
>> > --whitespace=fix` in the most efficient way.
>> 
>> Another way that may be easier (depending on the way Derrick works)
>> is to fetch from me and start working from there, as if they were
>> the last set of commits that were sent to the list.  "git log
>> --first-parent --oneline master..pu" would show where the tip of the
>> topic is.
>
> That is not really easier. We had that discussion before. Stolee would
> have to remove your Signed-off-by: lines *manually*.

In return, all the whitespace fixes (and other fixes if any) I did
on my end can be reused free by the submitter, instead of having to
redo it *manually*.

If a reroll of the series does not touch one specific commit, that
commit can be left as-is; I do not see a need to remove anybody's
sign-off or add yet another of your own, if the last two sign-offs
are from you and your upstream maintainer, if you did not change
anythning in what you got from the latter.  This depends on what
tool is used to work on refinement, but with "rebase -i", you'd
leave "pick" as "pick" and not "edit" or "reword" and it would do
the right thing.

If you did refine, you get an editor when you record that
refinement, so it is just a few key strokes, either "dd" or \C-k, to
do that removal *manually*.  So I am not sure why you are making a
mountain out of this molehill.

If you do want to remove the last two sign-off (i.e. penultimate one
by the author done during the initial submission, plus the last one
by me), well, "rebase -i" is open source.  We can add features to
the tool to help everybody collaborate better.  Extending changes
like planned addition of --signoff by Phillip, it is not all that
far-fetched to add a mechanism that notices a project-specific
trailer rewrite rules in-tree and uses that in between each step to
rewrite the trailer block of the commit message, for example, and
the rule

> I understand that it is a trade-off between time you have to spend and
> that others have to spend, and since you do not scale, that trade-off has
> to be in your favor.

That tradeoff may exist, but it does not weigh in the picture above
at all.

Perhaps it is better to try to actually think of a way to work
together better, instead of just whining.


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread SZEDER Gábor
On Fri, Mar 16, 2018 at 4:06 PM, Ævar Arnfjörð Bjarmason
 wrote:

> I noticed that it takes a *long* time to generate the graph, on a bigger
> repo I have it takes 20 minutes, and this is a repo where repack -A -d
> itself takes 5-8 minutes, probably on the upper end of that with the
> bitmap, but once you do that it's relatively snappy with --stdin-commits
> --additive when I feed it the new commits.
>
> I don't have any need really to make this run in 10m instead of 20m,
> just something I found interesting, i.e. how it compares to the repack
> itself.

You should forget '--stdin-packs' and use '--stdin-commits' to generate
the initial graph, it's much faster even without '--additive'[1].  See

  
https://public-inbox.org/git/CAM0VKj=wmkBNH=pscrztxfrc13rig1easw89q6ljansdjde...@mail.gmail.com/

I still think that the default behaviour for 'git commit-graph write'
should simply walk history from all refs instead of enumerating all
objects in all packfiles.


[1] - Please excuse the bikeshed: '--additive' is such a strange
  sounding option name, at least for me.  '--append', perhaps?


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Lars Schneider

> On 14 Mar 2018, at 21:43, Junio C Hamano  wrote:
> 
> Derrick Stolee  writes:
> 
>> This v6 includes feedback around csum-file.c and the rename of hashclose()
>> to finalize_hashfile(). These are the first two commits of the series, so
>> they could be pulled out independently.
>> 
>> The only other change since v5 is that I re-ran the performance numbers
>> in "commit: integrate commit graph with commit parsing".
> 
> Thanks.
> 
>> Hopefully this version is ready to merge. I have several follow-up topics
>> in mind to submit soon after, including:
> 
> A few patches add trailing blank lines and other whitespace
> breakages, which will stop my "git merge" later to 'next' and down,
> as I have a pre-commit hook to catch them.

@stolee: 

I run "git --no-pager diff --check $BASE_HASH...$HEAD_HASH" to detect
these kinds of things. I run this as part of my "prepare patch" [1] script
which is inspired by a similar script originally written by Dscho.

Do you think it would make sense to mention (or even
recommend) such a script in your awesome GfW CONTRIBUTING.md?


- Lars


[1] 
https://github.com/larsxschneider/git-list-helper/blob/master/prepare-patch.sh#L71




Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Wed, Mar 14 2018, Derrick Stolee jotted:

> Hopefully this version is ready to merge. I have several follow-up topics
> in mind to submit soon after, including:

I've been doing some preliminary testing of this internally, all good so
far, on a relatively small repo (~100k commits) I was using for testing:

- git -c core.commitGraph=true -C  rev-list --all:
-  /mnt/ext4_graph => min:273mean:279max:389-- (273 274 
275 276 277 279 282 282 345 389)
-/mnt/ext4 => min:1087   mean:1123   max:1175   -- (1087 
1092 1092 1104 1117 1123 1126 1136 1143 1175)

This is on a fresh clone with one giant pack and where the commit graph
data was generated afterwards with "git commit-graph write" for the
*_graph repo, so it contains all the commits.

So less than 25% of the mean time it spent before. Nice. Those are times
in milliseconds over 10 runs, for this particular one I didn't get much
of an improvement in --graph, but still ~10%:

- git -c core.commitGraph=true -C  log --oneline --graph:
-  /mnt/ext4_graph => min:1420   mean:1449   max:1586   -- (1420 
1423 1428 1434 1449 1449 1490 1548 1567 1586)
-/mnt/ext4 => min:1547   mean:1616   max:2136   -- (1547 
1557 1581 1585 1598 1616 1621 1740 1964 2136)

I noticed that it takes a *long* time to generate the graph, on a bigger
repo I have it takes 20 minutes, and this is a repo where repack -A -d
itself takes 5-8 minutes, probably on the upper end of that with the
bitmap, but once you do that it's relatively snappy with --stdin-commits
--additive when I feed it the new commits.

I don't have any need really to make this run in 10m instead of 20m,
just something I found interesting, i.e. how it compares to the repack
itself.


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Johannes Schindelin
Hi Junio,

On Thu, 15 Mar 2018, Junio C Hamano wrote:

> Johannes Schindelin  writes:
> 
> > Stolee, you definitely want to inspect those changes (`git log --check`
> > was introduced to show you whitespace problems). If all of those
> > whitespace issues are unintentional, you can fix them using `git rebase
> > --whitespace=fix` in the most efficient way.
> 
> Another way that may be easier (depending on the way Derrick works)
> is to fetch from me and start working from there, as if they were
> the last set of commits that were sent to the list.  "git log
> --first-parent --oneline master..pu" would show where the tip of the
> topic is.

That is not really easier. We had that discussion before. Stolee would
have to remove your Signed-off-by: lines *manually*.

I understand that it is a trade-off between time you have to spend and
that others have to spend, and since you do not scale, that trade-off has
to be in your favor.

My hope is that we will eventually collaborate more effectively using Git
itself, then those trade-offs will become a lot less involved because the
overall cost will be a lot smaller.

Ciao,
Dscho


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-15 Thread Ramsay Jones


On 15/03/18 18:41, Junio C Hamano wrote:
> Johannes Schindelin  writes:
> 
>> Stolee, you definitely want to inspect those changes (`git log --check`
>> was introduced to show you whitespace problems). If all of those
>> whitespace issues are unintentional, you can fix them using `git rebase
>> --whitespace=fix` in the most efficient way.
> 
> Another way that may be easier (depending on the way Derrick works)
> is to fetch from me and start working from there, as if they were
> the last set of commits that were sent to the list.  "git log
> --first-parent --oneline master..pu" would show where the tip of the
> topic is.

BTW, thanks for adding the 'SQUASH??? sparse fixes' on top of that
branch - sparse is now quiet on the 'pu' branch. (The same can't
be said of static-check.pl, but that is a different issue. ;-) ).

Thanks!

ATB,
Ramsay Jones



Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-15 Thread Junio C Hamano
Johannes Schindelin  writes:

> Stolee, you definitely want to inspect those changes (`git log --check`
> was introduced to show you whitespace problems). If all of those
> whitespace issues are unintentional, you can fix them using `git rebase
> --whitespace=fix` in the most efficient way.

Another way that may be easier (depending on the way Derrick works)
is to fetch from me and start working from there, as if they were
the last set of commits that were sent to the list.  "git log
--first-parent --oneline master..pu" would show where the tip of the
topic is.



Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-15 Thread Johannes Schindelin
Hi Junio,

On Wed, 14 Mar 2018, Junio C Hamano wrote:

> A few patches add trailing blank lines and other whitespace
> breakages, which will stop my "git merge" later to 'next' and down,
> as I have a pre-commit hook to catch them.

I wonder how you cope with the intentional "whitespace breakage" caused by
a TAB after HS in my recreate-merges patch series...

> Here is the output from my "git am -s" session.
> 
> Applying: csum-file: rename hashclose() to finalize_hashfile()
> Applying: csum-file: refactor finalize_hashfile() method
> .git/rebase-apply/patch:109: new blank line at EOF.

Stolee, you definitely want to inspect those changes (`git log --check`
was introduced to show you whitespace problems). If all of those
whitespace issues are unintentional, you can fix them using `git rebase
--whitespace=fix` in the most efficient way.

Ciao,
Dscho


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-14 Thread Junio C Hamano
Derrick Stolee  writes:

> This v6 includes feedback around csum-file.c and the rename of hashclose()
> to finalize_hashfile(). These are the first two commits of the series, so
> they could be pulled out independently.
>
> The only other change since v5 is that I re-ran the performance numbers
> in "commit: integrate commit graph with commit parsing".

Thanks.

> Hopefully this version is ready to merge. I have several follow-up topics
> in mind to submit soon after, including:

A few patches add trailing blank lines and other whitespace
breakages, which will stop my "git merge" later to 'next' and down,
as I have a pre-commit hook to catch them.

Here is the output from my "git am -s" session.

Applying: csum-file: rename hashclose() to finalize_hashfile()
Applying: csum-file: refactor finalize_hashfile() method
.git/rebase-apply/patch:109: new blank line at EOF.
+
warning: 1 line adds whitespace errors.
Applying: commit-graph: add format document
.git/rebase-apply/patch:175: new blank line at EOF.
+
warning: 1 line adds whitespace errors.
Applying: graph: add commit graph design document
.git/rebase-apply/patch:42: new blank line at EOF.
+
.git/rebase-apply/patch:109: new blank line at EOF.
+
warning: 2 lines add whitespace errors.
Applying: commit-graph: create git-commit-graph builtin
.git/rebase-apply/patch:323: space before tab in indent.
fd = hold_lock_file_for_update(, graph_name, 0);
.git/rebase-apply/patch:334: space before tab in indent.
fd = hold_lock_file_for_update(, graph_name, 
LOCK_DIE_ON_ERROR);
.git/rebase-apply/patch:385: new blank line at EOF.
+
.git/rebase-apply/patch:398: new blank line at EOF.
+
warning: 2 lines applied after fixing whitespace errors.
Applying: commit-graph: implement write_commit_graph()
.git/rebase-apply/patch:138: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:144: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:154: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:160: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:197: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
warning: squelched 6 whitespace errors
warning: 10 lines applied after fixing whitespace errors.
Applying: commit-graph: implement 'git-commit-graph write'
Test number t5318 already taken
.git/rebase-apply/patch:346: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:356: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:366: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:374: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:384: indent with spaces.
cd "$TRASH_DIRECTORY/bare" &&
warning: 5 lines add whitespace errors.
Applying: commit-graph: implement git commit-graph read
Applying: commit-graph: add core.commitGraph setting
Applying: commit-graph: close under reachability
.git/rebase-apply/patch:302: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:310: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:321: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:331: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:341: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
warning: squelched 2 whitespace errors
warning: 7 lines add whitespace errors.
Applying: commit: integrate commit graph with commit parsing
.git/rebase-apply/patch:224: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:227: trailing whitespace.
graph_read_expect "9" "large_edges" 
.git/rebase-apply/patch:234: indent with spaces.
cd "$TRASH_DIRECTORY" &&
warning: 2 lines applied after fixing whitespace errors.
Applying: commit-graph: read only from specific pack-indexes
.git/rebase-apply/patch:196: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:209: indent with spaces.
cd "$TRASH_DIRECTORY" &&
warning: 1 line applied after fixing whitespace errors.
Applying: commit-graph: build graph from starting commits
.git/rebase-apply/patch:148: indent with spaces.
cd "$TRASH_DIRECTORY/full" &&
.git/rebase-apply/patch:158: indent with spaces.
cd "$TRASH_DIRECTORY" &&
warning: 1 line applied after fixing whitespace errors.
Applying: commit-graph: implement "--additive" option



Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-14 Thread Ramsay Jones


On 14/03/18 19:27, Derrick Stolee wrote:
> This v6 includes feedback around csum-file.c and the rename of hashclose()
> to finalize_hashfile(). These are the first two commits of the series, so
> they could be pulled out independently.
> 
> The only other change since v5 is that I re-ran the performance numbers
> in "commit: integrate commit graph with commit parsing".

I haven't looked at v6 (I will wait for it to hit pu), but v5 is
still causing sparse to complain.

The diff given below (on top of current pu @9e418c7c9), fixes it
for me. (Using a plain integer as a NULL pointer, in builtin/commit-
graph.c, and the 'commit_graph' symbol should be file-local, in
commit-graph.c).

Thanks!

ATB,
Ramsay Jones

-- >8 --
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 62ac26e44..855df66bd 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -31,7 +31,7 @@ static struct opts_commit_graph {
 
 static int graph_read(int argc, const char **argv)
 {
-   struct commit_graph *graph = 0;
+   struct commit_graph *graph = NULL;
char *graph_name;
 
static struct option builtin_commit_graph_read_options[] = {
diff --git a/commit-graph.c b/commit-graph.c
index 631edac4c..7b45fe85d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -182,7 +182,7 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file)
 }
 
 /* global storage */
-struct commit_graph *commit_graph = NULL;
+static struct commit_graph *commit_graph = NULL;
 
 static void prepare_commit_graph_one(const char *obj_dir)
 {



[PATCH v6 00/14] Serialized Git Commit Graph

2018-03-14 Thread Derrick Stolee
This v6 includes feedback around csum-file.c and the rename of hashclose()
to finalize_hashfile(). These are the first two commits of the series, so
they could be pulled out independently.

The only other change since v5 is that I re-ran the performance numbers
in "commit: integrate commit graph with commit parsing".

Hopefully this version is ready to merge. I have several follow-up topics
in mind to submit soon after, including:

* Auto-generate the commit graph as the repo changes:
   i. teach git-commit-graph an "fsck" subcommand and integrate with git-fsck
  ii. teach git-repack to call git-commit-graph
* Generation numbers:
   i. teach git-commit-graph to compute generation numbers
  ii. consume generation numbers in paint_down_to_common()
* Move globals from commit-graph.c to the_repository

The three bullets (*) are relatively independent but have sub-items that
appear in priority order.

Derrick Stolee (14):
  csum-file: rename hashclose() to finalize_hashfile()
  csum-file: refactor finalize_hashfile() method
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement write_commit_graph()
  commit-graph: implement 'git-commit-graph write'
  commit-graph: implement git commit-graph read
  commit-graph: add core.commitGraph setting
  commit-graph: close under reachability
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits
  commit-graph: implement "--additive" option

 .gitignore  |   1 +
 Documentation/config.txt|   3 +
 Documentation/git-commit-graph.txt  |  93 +++
 Documentation/technical/commit-graph-format.txt |  98 
 Documentation/technical/commit-graph.txt| 164 ++
 Makefile|   2 +
 alloc.c |   1 +
 builtin.h   |   1 +
 builtin/commit-graph.c  | 172 ++
 builtin/index-pack.c|   2 +-
 builtin/pack-objects.c  |   6 +-
 bulk-checkin.c  |   4 +-
 cache.h |   1 +
 command-list.txt|   1 +
 commit-graph.c  | 719 
 commit-graph.h  |  47 ++
 commit.c|   3 +
 commit.h|   3 +
 config.c|   5 +
 contrib/completion/git-completion.bash  |   2 +
 csum-file.c |  10 +-
 csum-file.h |   9 +-
 environment.c   |   1 +
 fast-import.c   |   2 +-
 git.c   |   1 +
 pack-bitmap-write.c |   2 +-
 pack-write.c|   5 +-
 packfile.c  |   4 +-
 packfile.h  |   2 +
 t/t5318-commit-graph.sh | 225 
 30 files changed, 1568 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 Documentation/technical/commit-graph-format.txt
 create mode 100644 Documentation/technical/commit-graph.txt
 create mode 100644 builtin/commit-graph.c
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h
 create mode 100755 t/t5318-commit-graph.sh


base-commit: d0db9edba0050ada6f6eac68061599690d2a4333
-- 
2.14.1