Re: commit-message attack for extracting sensitive data from rewritten Git history
Am 4/8/2013 23:54, schrieb Jeff King: Yeah, it would make sense for filter-branch to have a --map-commit-ids option or similar that does the update. At first I thought it might take two passes, but I don't think it is necessary, as long as we traverse the commits topologically (i.e., you cannot have mentioned X in a commit that is an ancestor of X, so you do not have to worry about mapping it until after it has been processed). Topological traversal is not sufficient. Consider this history: o--A--o-- / / --o--B--o If A mentions B (think of cherry-pick -x), then you must ensure that the branch containing B was traversed first. -- Hannes -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: commit-message attack for extracting sensitive data from rewritten Git history
On Tue, Apr 09, 2013 at 08:03:24AM +0200, Johannes Sixt wrote: Am 4/8/2013 23:54, schrieb Jeff King: Yeah, it would make sense for filter-branch to have a --map-commit-ids option or similar that does the update. At first I thought it might take two passes, but I don't think it is necessary, as long as we traverse the commits topologically (i.e., you cannot have mentioned X in a commit that is an ancestor of X, so you do not have to worry about mapping it until after it has been processed). Topological traversal is not sufficient. Consider this history: o--A--o-- / / --o--B--o If A mentions B (think of cherry-pick -x), then you must ensure that the branch containing B was traversed first. Yeah, you're right. Multiple passes are necessary to get it completely right. And because each pass may change more commit id's, you have to recurse to pick up those changes, and keep going until you have a pass with no changes. But I haven't thought that hard about it. There might be a clever optimization where you can prune out parts of the history (e.g., if you know that all changes to consider are in descendants of a commit, you do not have to care about rewriting the commit or its ancestors). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: commit-message attack for extracting sensitive data from rewritten Git history
On 9 April 2013 18:01, Jeff King p...@peff.net wrote: On Tue, Apr 09, 2013 at 08:03:24AM +0200, Johannes Sixt wrote: If A mentions B (think of cherry-pick -x), then you must ensure that the branch containing B was traversed first. Yeah, you're right. Multiple passes are necessary to get it completely right. And because each pass may change more commit id's, you have to recurse to pick up those changes, and keep going until you have a pass with no changes. Just to give some context on how the BFG handles this (without doing multiple passes): The BFG makes a design choice (based on it's intended use-case of annihilating unwanted data) that a specific tree or blob will always be cleaned in exactly the same way - because when you're trying to get rid of large blobs or private data, you most likely /don't care/ where it is, what commit it belongs to, how old it is. The id for a cleaned tree or blob is always the same no matter where it came from, and so the BFG maintains a in-memory mapping of 'dirty' to 'clean' object ids while cleaning a repo - whenever an object (commit, tag, tree, blob) is cleaned, these values are stored in the map: dirty-id - clean-id clean-id - clean-id (in terms of memory overhead, this amounts to only ~ 128MB for even quite a large repo like the linux kernel, so I don't spend much time worrying about it) The map memoises the cleaning functions on all objects, so an object (particularly a tree) never gets cleaned more than once, which is one of the things that makes the BFG fast. Having these memoised functions makes cleaning commit messages fairly easy - the message is grepped for hex strings more than a few characters in length, and if a matched string resolves uniquely to an object id in the repo, the clean() method is called on it to get the cleaned id - which will either return immediately with a previously calculated result, or if the id came from a different branch, trigger a cascade of more cleaning, eventually returning the required cleaned id. In the case of git-filter-branch, the user has a lot more freedom to change the tree-structure of commits on a commit-by-commit basis, so memoising tree-cleaning is out of the question, but I guess it might be possible to do memoisation of just the commit ids to short-cut the multiple-pass problem. - Roberto Tyley -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: commit-message attack for extracting sensitive data from rewritten Git history
Roberto Tyley roberto.ty...@gmail.com writes: Here's an unmodified repo, in which the user unwisely committed a database password: https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-original/commit/8c9cfe3c The unwise commit is reverted with a second commit using 'git revert', which obviously leaves the password in Git history, and - some time later - it's decided to properly clean the repo history with git-filter-branch git gc, purging the password so the repo can be more widely shared (open-sourced, or just externally hosted). git-filter-branch works exactly as intended, purging the password, but the one thing it does not- typically - do is update the commit message The git-filter-branch command has a --msg-filter option which could be used for this purpose, with the application of some judicious bash-scripting, grepsed-ing. However, I must confess that I believe users would be better advised to use The BFG: http://rtyley.github.io/bfg-repo-cleaner/ With or without the security issue, leaving old object names that will become irrelevant in the rewritten history will make the resulting history less useful, simply because people cannot look at the objects these messages refer to. The same argument is behind the reason why cherry-pick -x was originally the default, found to be a mistake and made optional. filter-branch provides map helper function to help mapping old object names to rewritten object names, but stops there; it leaves it up to the message filter script to identify what string in the message is an object name to be rewritten. It can be taught to be more helpful to the message filter writers, and you seem to have done so in BFG, which is very good. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
commit-message attack for extracting sensitive data from rewritten Git history
This is a demonstration of a mildly-interesting security concern relating to Git git-filter-branch - not a vulnerability in Git itself, just in the way it can be used. I thought it was interesting to demonstrate that there is sometimes an avenue of attack for recovering sensitive data that's been removed from Git history using git-filter-branch. I think it's a low-severity issue, you may wish to ignore this, and indeed I've been very politely told already that it's clearly nonsense :) Here's an unmodified repo, in which the user unwisely committed a database password: https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-original/commit/8c9cfe3c The unwise commit is reverted with a second commit using 'git revert', which obviously leaves the password in Git history, and - some time later - it's decided to properly clean the repo history with git-filter-branch git gc, purging the password so the repo can be more widely shared (open-sourced, or just externally hosted). git-filter-branch works exactly as intended, purging the password, but the one thing it does not- typically - do is update the commit message. So in the cleaned repo, the commit message for the revert commit still looks like this: https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-git-filter-branch-cleaned/commit/bf0637a5 It contains a commit id (8c9cfe3) which is no longer in the repo, but can very easily be associated with an existing commit simply by examining the subject line of the reverted commit (Carelessly checking password into source control). It's also obvious, from examining the repo, where the excised data was removed (ie at the db.password= line). At this point it's possible to do a brute-force attack where you generate possible passwords, insert them into the available commit's tree, and compare them against the leaked commit id. When the the commit id matches, the sensitive data has been recovered. A proof-of-concept implementation of this attack was indeed able to recover the purged password: -- $ java -jar gma-0.1.jar 8c9cfe3c attack-pinpoint gma-demo-repo-git-filter-branch-cleaned Brute-force search using these characters : 0123456789abcdefghijklmnopqrstuvwxyz Available commit, presumed cleaned : 8ebbf661 File path : src/main/resources/config.properties Template blob : dca1a2fb Exhausted strings of length 1 or less ... Exhausted strings of length 4 or less Match with '0g6rw' -- So all of this amounts to a fairly low severity issue - people should always change credentials when they mistakenly commit them to a repo - but I guess the point is that from a paranoia point of view, you want to remove all information - including old commit hashes buried in commit messages - that relate to sensitive data when you clean a repo for sharing. The git-filter-branch command has a --msg-filter option which could be used for this purpose, with the application of some judicious bash-scripting, grepsed-ing. However, I must confess that I believe users would be better advised to use The BFG: http://rtyley.github.io/bfg-repo-cleaner/ The BFG already addresses this issue by replacing all old Git object-ids found in commit/tag messages with the updated id. For instance, here's that exact same commit message when cleaned with the BFG: https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-bfg-cleaned/commit/35840201 In the case that the users specifies a filtering operation is not removing 'private' data, the BFG replaces old ids with text of the form 'newid [formerly oldid], but if the operation is in fact to strip private data, the replacement value is simply the newid - and without the old commit id, the attack described above is not possible. I believe it's worth educating users to give them a more realistic understanding of their exposure, and would like to update the documentation of git-filter-branch to give them a better idea of their options for removing private data - that would include noting the BFG as alternative. - Roberto Tyley https://github.com/rtyley/bfg-repo-cleaner/blob/v1.2.0/src/main/scala/com/madgag/git/bfg/cleaner/ObjectIdSubstitutor.scala#L33-L60 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html