Re: Ideas to speed up repacking
> Martin Fick writes: > > * Setup 1: > > Do a full repack. All loose and packed objects are > > added ... > > * Scenario 1: > > Start with Setup 1. Nothing has changed on the repo > > contents (no new object/packs, refs all the same), but > > repacking config options have changed (for example > > compression level has changed). On Tuesday, December 03, 2013 10:50:07 am Junio C Hamano wrote: > Duy Nguyen writes: > > Reading Martin's mail again I wonder how we just > > "grab all objects and skip history traversal". Who will > > decide object order in the new pack if we don't > > traverse history and collect path information. > > I vaguely recall raising a related topic for "quick > repack, assuming everything in existing packfiles are > reachable, that only removes loose cruft" several weeks > ago. Once you decide that your quick repack do not care > about ejecting objects from existing packs, like how I > suspect Martin's outline will lead us to, we can repack > the reachable loose ones on the recent surface of the > history and then concatenate the contents of existing > packs, excluding duplicates and possibly adjusting the > delta base offsets for some entries, without traversing > the bulk of the history. >From this, it sounds like scenario 1 (a single pack being repacked) might then be doable (just trying to establish a really simple baseline)? Except that it would potentially not result in the same ordering without traversing history? Or, would the current pack ordering be preserved and thus be correct? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas to speed up repacking
Duy Nguyen writes: > Reading Martin's mail again I wonder how we just > "grab all objects and skip history traversal". Who will decide object > order in the new pack if we don't traverse history and collect path > information. I vaguely recall raising a related topic for "quick repack, assuming everything in existing packfiles are reachable, that only removes loose cruft" several weeks ago. Once you decide that your quick repack do not care about ejecting objects from existing packs, like how I suspect Martin's outline will lead us to, we can repack the reachable loose ones on the recent surface of the history and then concatenate the contents of existing packs, excluding duplicates and possibly adjusting the delta base offsets for some entries, without traversing the bulk of the history. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas to speed up repacking
On Tue, Dec 3, 2013 at 2:17 PM, Junio C Hamano wrote: > Duy Nguyen writes: > >>> If nothing else has happened in the repository, perhaps, but I >>> suspect that the real problem is how you would prove it. For >>> example, I am guessing that your Scenario 4 could be something like: >>> >>> : setup #1 >>> $ git repack -a -d -f >>> $ git prune >>> >>> : scenario #4 >>> $ git commit --allow-empty -m 'new commit' >>> >>> which would add a single loose object to the repository, advancing >>> the current branch ref by one commit, fast-forwarding relative to >>> the state you were in after setup #1. >>> >>> But how would you efficiently prove that it was the only thing that >>> happened? >> >> Shawn mentioned elsewhere that we could generate bundle header in and >> keep it in pack-XXX.bh file at pack creation time. With that >> information we could verify if a ref has been reset, just fast >> forwarded or even deleted. > > With what information? If you keep the back-then-current information > and nothing else, how would you differentiate between the simple > scenario #4 above vs 'lost and new' two commit versions of the > scenario? The endpoints should both show that one ref (and only one > ref) advanced by one commit, but one has cruft in the object > database while the other does not. Yeah I was wrong. Reading Martin's mail again I wonder how we just "grab all objects and skip history traversal". Who will decide object order in the new pack if we don't traverse history and collect path information. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas to speed up repacking
Duy Nguyen writes: >> If nothing else has happened in the repository, perhaps, but I >> suspect that the real problem is how you would prove it. For >> example, I am guessing that your Scenario 4 could be something like: >> >> : setup #1 >> $ git repack -a -d -f >> $ git prune >> >> : scenario #4 >> $ git commit --allow-empty -m 'new commit' >> >> which would add a single loose object to the repository, advancing >> the current branch ref by one commit, fast-forwarding relative to >> the state you were in after setup #1. >> >> But how would you efficiently prove that it was the only thing that >> happened? > > Shawn mentioned elsewhere that we could generate bundle header in and > keep it in pack-XXX.bh file at pack creation time. With that > information we could verify if a ref has been reset, just fast > forwarded or even deleted. With what information? If you keep the back-then-current information and nothing else, how would you differentiate between the simple scenario #4 above vs 'lost and new' two commit versions of the scenario? The endpoints should both show that one ref (and only one ref) advanced by one commit, but one has cruft in the object database while the other does not. >> Also with Scenario #2, how would you prove that the new pack does >> not contain any cruft that is not reachable? When receiving a pack >> and updating our refs, we only prove that we have all the objects >> needed to complete updated refs---we do not reject packs with crufts >> that are not necessary. > > We trust the pack producer to do it correctly, I guess. If a pack > producer guarantees not to store any cruft, it could mark the pack > somehow. That is not an answer. Since when do we design to blindly trust anybody on the other end of the wire? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas to speed up repacking
On Tue, Dec 3, 2013 at 7:44 AM, Junio C Hamano wrote: >> * Scenario 4: >> >>Starts with Setup 1. Add some loose objects to the repo >> via a local fast forward ref update (I am assuming this is >> possible without adding any new unreferenced objects?) >> >> >> In all 4 scenarios, I believe we should be able to skip >> history traversal and simply grab all objects and repack >> them into a new file? > > If nothing else has happened in the repository, perhaps, but I > suspect that the real problem is how you would prove it. For > example, I am guessing that your Scenario 4 could be something like: > > : setup #1 > $ git repack -a -d -f > $ git prune > > : scenario #4 > $ git commit --allow-empty -m 'new commit' > > which would add a single loose object to the repository, advancing > the current branch ref by one commit, fast-forwarding relative to > the state you were in after setup #1. > > But how would you efficiently prove that it was the only thing that > happened? Shawn mentioned elsewhere that we could generate bundle header in and keep it in pack-XXX.bh file at pack creation time. With that information we could verify if a ref has been reset, just fast forwarded or even deleted. > Also with Scenario #2, how would you prove that the new pack does > not contain any cruft that is not reachable? When receiving a pack > and updating our refs, we only prove that we have all the objects > needed to complete updated refs---we do not reject packs with crufts > that are not necessary. We trust the pack producer to do it correctly, I guess. If a pack producer guarantees not to store any cruft, it could mark the pack somehow. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas to speed up repacking
Martin Fick writes: > I wanted to explore the idea of exploiting knowledge about > previous repacks to help speed up future repacks. > > I had various ideas that seemed like they might be good > places to start, but things quickly got away from me. > Mainly I wanted to focus on reducing and even sometimes > eliminating reachability calculations since that seems to be > be the one major unsolved slow piece during repacking. > > My first line of thinking goes like this: "After a full > repack, reachability of the current refs is known. Exploit > that knowledge for future repacks." There are some very > simple scenarios where if we could figure out how to > identify them reliably, I think we could simply avoid > reachability calculations entirely, and yet end up with the > same repacked files as if we had done the reachability > calculations. Let me outline some to see if they make sense > as starting place for further discussion. > > - > > * Setup 1: > > Do a full repack. All loose and packed objects are added > to a single pack file (assumes git config repack options do > not create multiple packs). > > * Scenario 1: > > Start with Setup 1. Nothing has changed on the repo > contents (no new object/packs, refs all the same), but > repacking config options have changed (for example > compression level has changed). > > * Scenario 2: > >Starts with Setup 1. Add one new pack file that was > pushed to the repo by adding a new ref to the repo (existing > refs did not change). > > * Scenario 3: > >Starts with Setup 1. Add one new pack file that was > pushed to the repo by updating an existing ref with a fast > forward. > > * Scenario 4: > >Starts with Setup 1. Add some loose objects to the repo > via a local fast forward ref update (I am assuming this is > possible without adding any new unreferenced objects?) > > > In all 4 scenarios, I believe we should be able to skip > history traversal and simply grab all objects and repack > them into a new file? If nothing else has happened in the repository, perhaps, but I suspect that the real problem is how you would prove it. For example, I am guessing that your Scenario 4 could be something like: : setup #1 $ git repack -a -d -f $ git prune : scenario #4 $ git commit --allow-empty -m 'new commit' which would add a single loose object to the repository, advancing the current branch ref by one commit, fast-forwarding relative to the state you were in after setup #1. But how would you efficiently prove that it was the only thing that happened? The user could have done this instead of a single commit: : scenario #4 look-alike $ git commit --allow-empty -m 'lost commit' $ git reset --hard HEAD^ $ git commit --allow-empty -m 'new commit' and the reflog entry for HEAD or the current branch ref for that lost commit may be already ancient when you looked at this state. Your object database has two loose commits, and you would want to lose the older one 'lost commit' which is not reachable. Also with Scenario #2, how would you prove that the new pack does not contain any cruft that is not reachable? When receiving a pack and updating our refs, we only prove that we have all the objects needed to complete updated refs---we do not reject packs with crufts that are not necessary. These two are only examples, and we might be able to convince ourselves that not pruning (or ejecting cruft from packs) is OK, but that is introducing a different mode of operation, not optimizing the repacking without changing what "repacking" means (I am not saying it is bad to change the meaning if we can make a good argument between pros-and-cons; a small bloat might be acceptable relative to a good enough performance gain, but only unless the user is using repack && prune as a way to eradicate undesirable contents out of the object database). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ideas to speed up repacking
I wanted to explore the idea of exploiting knowledge about previous repacks to help speed up future repacks. I had various ideas that seemed like they might be good places to start, but things quickly got away from me. Mainly I wanted to focus on reducing and even sometimes eliminating reachability calculations since that seems to be be the one major unsolved slow piece during repacking. My first line of thinking goes like this: "After a full repack, reachability of the current refs is known. Exploit that knowledge for future repacks." There are some very simple scenarios where if we could figure out how to identify them reliably, I think we could simply avoid reachability calculations entirely, and yet end up with the same repacked files as if we had done the reachability calculations. Let me outline some to see if they make sense as starting place for further discussion. - * Setup 1: Do a full repack. All loose and packed objects are added to a single pack file (assumes git config repack options do not create multiple packs). * Scenario 1: Start with Setup 1. Nothing has changed on the repo contents (no new object/packs, refs all the same), but repacking config options have changed (for example compression level has changed). * Scenario 2: Starts with Setup 1. Add one new pack file that was pushed to the repo by adding a new ref to the repo (existing refs did not change). * Scenario 3: Starts with Setup 1. Add one new pack file that was pushed to the repo by updating an existing ref with a fast forward. * Scenario 4: Starts with Setup 1. Add some loose objects to the repo via a local fast forward ref update (I am assuming this is possible without adding any new unreferenced objects?) In all 4 scenarios, I believe we should be able to skip history traversal and simply grab all objects and repack them into a new file? - Of the 4 scenarios above, it seems like #3 and #4 are very common operations (#2 is perhaps even more common for Gerrit)? If these scenarios can be reliably identified somehow, then perhaps they could be used to reduce repacking time for these scenarios, and later used as building blocks to reduce repacking time for other related but slightly more complicated scenarios (with reduced history walking instead of none)? For example to identify scenario 1, what if we kept a copy of all refs and their shas used during a full repack along with the newly repacked file? A simplistic approach would store them in the same format as the packed-refs file as pack-.refs. During repacking, if none of the refs have changed and there are no new objects... Then, if none of the refs have changed and there are new objects, we can just throw the new objects away? ... I am going to stop here because this email is long enough and I wanted to get some feedback on the ideas first before offering more solutions. Thanks, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html