Re: commit-message attack for extracting sensitive data from rewritten Git history

2013-04-09 Thread Johannes Sixt
Am 4/8/2013 23:54, schrieb Jeff King:
 Yeah, it would make sense for filter-branch to have a --map-commit-ids
 option or similar that does the update. At first I thought it might take
 two passes, but I don't think it is necessary, as long as we traverse
 the commits topologically (i.e., you cannot have mentioned X in a commit
 that is an ancestor of X, so you do not have to worry about mapping it
 until after it has been processed).

Topological traversal is not sufficient. Consider this history:

 o--A--o--
/ /
 --o--B--o

If A mentions B (think of cherry-pick -x), then you must ensure that the
branch containing B was traversed first.

-- Hannes
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: commit-message attack for extracting sensitive data from rewritten Git history

2013-04-09 Thread Jeff King
On Tue, Apr 09, 2013 at 08:03:24AM +0200, Johannes Sixt wrote:

 Am 4/8/2013 23:54, schrieb Jeff King:
  Yeah, it would make sense for filter-branch to have a --map-commit-ids
  option or similar that does the update. At first I thought it might take
  two passes, but I don't think it is necessary, as long as we traverse
  the commits topologically (i.e., you cannot have mentioned X in a commit
  that is an ancestor of X, so you do not have to worry about mapping it
  until after it has been processed).
 
 Topological traversal is not sufficient. Consider this history:
 
  o--A--o--
 / /
  --o--B--o
 
 If A mentions B (think of cherry-pick -x), then you must ensure that the
 branch containing B was traversed first.

Yeah, you're right. Multiple passes are necessary to get it
completely right. And because each pass may change more commit id's, you
have to recurse to pick up those changes, and keep going until you have
a pass with no changes.

But I haven't thought that hard about it. There might be a clever
optimization where you can prune out parts of the history (e.g., if you
know that all changes to consider are in descendants of a commit, you do
not have to care about rewriting the commit or its ancestors).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: commit-message attack for extracting sensitive data from rewritten Git history

2013-04-09 Thread Roberto Tyley
On 9 April 2013 18:01, Jeff King p...@peff.net wrote:
 On Tue, Apr 09, 2013 at 08:03:24AM +0200, Johannes Sixt wrote:
 If A mentions B (think of cherry-pick -x), then you must ensure that the
 branch containing B was traversed first.

 Yeah, you're right. Multiple passes are necessary to get it
 completely right. And because each pass may change more commit id's, you
 have to recurse to pick up those changes, and keep going until you have
 a pass with no changes.

Just to give some context on how the BFG handles this (without doing
multiple passes):

The BFG makes a design choice (based on it's intended use-case of
annihilating unwanted data) that a specific tree or blob will always
be cleaned in exactly the same way - because when you're trying to get
rid of large blobs or private data, you most likely /don't care/ where
it is, what commit it belongs to, how old it is. The id for a cleaned
tree or blob is always the same no matter where it came from, and so
the BFG maintains a in-memory mapping of 'dirty' to 'clean' object ids
while cleaning a repo - whenever an object (commit, tag, tree, blob)
is cleaned, these values are stored in the map:


  dirty-id - clean-id
  clean-id - clean-id

(in terms of memory overhead, this amounts to only ~ 128MB for even
quite a large repo like the linux kernel, so I don't spend much time
worrying about it)


The map memoises the cleaning functions on all objects, so an object
(particularly a tree) never gets cleaned more than once, which is one
of the things that makes the BFG fast.

Having these memoised functions makes cleaning commit messages fairly
easy - the message is grepped for hex strings more than a few
characters in length, and if a matched string resolves uniquely to an
object id in the repo, the clean() method is called on it to get the
cleaned id - which will either return immediately with a previously
calculated result, or if the id came from a different branch, trigger
a cascade of more cleaning, eventually returning the required cleaned
id.

In the case of git-filter-branch, the user has a lot more freedom to
change the tree-structure of commits on a commit-by-commit basis, so
memoising tree-cleaning is out of the question, but I guess it might
be possible to do memoisation of just the commit ids to short-cut the
multiple-pass problem.

- Roberto Tyley
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: commit-message attack for extracting sensitive data from rewritten Git history

2013-04-08 Thread Junio C Hamano
Roberto Tyley roberto.ty...@gmail.com writes:

 Here's an unmodified repo, in which the user unwisely committed a
 database password:

 https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-original/commit/8c9cfe3c

 The unwise commit is reverted with a second commit using 'git revert',
 which obviously leaves the password in Git history, and - some time
 later - it's decided to properly clean the repo history with
 git-filter-branch  git gc, purging the password so the repo can be
 more widely shared (open-sourced, or just externally hosted).

 git-filter-branch works exactly as intended, purging the password, but
 the one thing it does not- typically - do is update the commit
 message
  The git-filter-branch command has a --msg-filter option
 which could be used for this purpose, with the application of some
 judicious bash-scripting, grepsed-ing. However, I must confess that I
 believe users would be better advised to use The BFG:

 http://rtyley.github.io/bfg-repo-cleaner/

With or without the security issue, leaving old object names that
will become irrelevant in the rewritten history will make the
resulting history less useful, simply because people cannot look at
the objects these messages refer to. The same argument is behind the
reason why cherry-pick -x was originally the default, found to be
a mistake and made optional.

filter-branch provides map helper function to help mapping old
object names to rewritten object names, but stops there; it leaves
it up to the message filter script to identify what string in the
message is an object name to be rewritten.

It can be taught to be more helpful to the message filter writers,
and you seem to have done so in BFG, which is very good.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


commit-message attack for extracting sensitive data from rewritten Git history

2013-04-07 Thread Roberto Tyley
This is a demonstration of a mildly-interesting security concern
relating to Git  git-filter-branch - not a vulnerability in Git
itself, just in the way it can be used. I thought it was interesting
to demonstrate that there is sometimes an avenue of attack for
recovering sensitive data that's been removed from Git history using
git-filter-branch. I think it's a low-severity issue, you may wish to
ignore this, and indeed I've been very politely told already that it's
clearly nonsense :)

Here's an unmodified repo, in which the user unwisely committed a
database password:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-original/commit/8c9cfe3c

The unwise commit is reverted with a second commit using 'git revert',
which obviously leaves the password in Git history, and - some time
later - it's decided to properly clean the repo history with
git-filter-branch  git gc, purging the password so the repo can be
more widely shared (open-sourced, or just externally hosted).

git-filter-branch works exactly as intended, purging the password, but
the one thing it does not- typically - do is update the commit
message. So in the cleaned repo, the commit message for the revert
commit still looks like this:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-git-filter-branch-cleaned/commit/bf0637a5

It contains a commit id (8c9cfe3) which is no longer in the repo, but
can very easily be associated with an existing commit simply by
examining the subject line of the reverted commit (Carelessly
checking password into source control). It's also obvious, from
examining the repo, where the excised data was removed (ie at the
db.password= line). At this point it's possible to do a brute-force
attack where you generate possible passwords, insert them into the
available commit's tree, and compare them against the leaked commit
id. When the the commit id matches, the sensitive data has been
recovered.

A proof-of-concept implementation of this attack was indeed able to
recover the purged password:

--
$ java -jar gma-0.1.jar 8c9cfe3c attack-pinpoint
gma-demo-repo-git-filter-branch-cleaned

Brute-force search using these characters : 0123456789abcdefghijklmnopqrstuvwxyz
Available commit, presumed cleaned : 8ebbf661
File path : src/main/resources/config.properties
Template blob : dca1a2fb
Exhausted strings of length 1 or less
...
Exhausted strings of length 4 or less
Match with '0g6rw'
--

So all of this amounts to a fairly low severity issue - people should
always change credentials when they mistakenly commit them to a repo -
but I guess the point is that from a paranoia point of view, you want
to remove all information - including old commit hashes buried in
commit messages - that relate to sensitive data when you clean a repo
for sharing. The git-filter-branch command has a --msg-filter option
which could be used for this purpose, with the application of some
judicious bash-scripting, grepsed-ing. However, I must confess that I
believe users would be better advised to use The BFG:

http://rtyley.github.io/bfg-repo-cleaner/

The BFG already addresses this issue by replacing all old Git
object-ids found in commit/tag messages with the updated id. For
instance, here's that exact same commit message when cleaned with the
BFG:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-bfg-cleaned/commit/35840201

In the case that the users specifies a filtering operation is not
removing 'private' data, the BFG replaces old ids with text of the
form 'newid [formerly oldid], but if the operation is in fact to
strip private data, the replacement value is simply the newid - and
without the old commit id, the attack described above is not possible.

I believe it's worth educating users to give them a more realistic
understanding of their exposure, and would like to update the
documentation of git-filter-branch to give them a better idea of their
options for removing private data - that would include noting the BFG
as alternative.

- Roberto Tyley

https://github.com/rtyley/bfg-repo-cleaner/blob/v1.2.0/src/main/scala/com/madgag/git/bfg/cleaner/ObjectIdSubstitutor.scala#L33-L60
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html