[PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Jeff King
Sometimes users want to report a bug they experience on
their repository, but they are not at liberty to share the
contents of the repository. It would be useful if they could
produce a repository that has a similar shape to its history
and tree, but without leaking any information. This
anonymized repository could then be shared with developers
(assuming it still replicates the original problem).

This patch implements an --anonymize option to
fast-export, which generates a stream that can recreate such
a repository. Producing a single stream makes it easy for
the caller to verify that they are not leaking any useful
information. You can get an overview of what will be shared
by running a command like:

  git fast-export --anonymize --all |
  perl -pe 's/\d+/X/g' |
  sort -u |
  less

which will show every unique line we generate, modulo any
numbers (each anonymized token is assigned a number, like
User 0, and we replace it consistently in the output).

In addition to anonymizing, this produces test cases that
are relatively small (compared to the original repository)
and fast to generate (compared to using filter-branch, or
modifying the output of fast-export yourself). Here are
numbers for git.git:

  $ time git fast-export --anonymize --all \
 --tag-of-filtered-object=drop output
  real0m2.883s
  user0m2.828s
  sys 0m0.052s

  $ gzip output
  $ ls -lh output.gz | awk '{print $5}'
  2.9M

Signed-off-by: Jeff King p...@peff.net
---
I haven't used this for anything real yet. It was a fun exercise, and I
do think it should work in practice. I'd be curious to hear a success
report of somebody actually debugging something with this.

In theory we could anonymize in a reversible way (e.g., by encrypting
each token with a key, and then not sharing the key), but it's a lot
more complicated and I don't think it buys us much. The one thing I'd
really like is to be able to test packing on an anonymized repository,
but two objects which delta well together will not have their encrypted
contents delta (unless you use something weak like ECB mode, in which
case the contents are not really as anonymized as you would hope).

I think most interesting cases involve things like commit traversal, and
that should still work here, even with made-up contents. Some weird
cases involving trees would not work if they depend on the filenames
(e.g., things that impact sort order). We could allow finer-grained
control, like --anonymize=commits,blobs if somebody was OK sharing
their filenames. I did not go that far here, but it should be pretty
easy to build on top.

 Documentation/git-fast-export.txt |   6 +
 builtin/fast-export.c | 280 --
 t/t9351-fast-export-anonymize.sh  | 117 
 3 files changed, 392 insertions(+), 11 deletions(-)
 create mode 100755 t/t9351-fast-export-anonymize.sh

diff --git a/Documentation/git-fast-export.txt 
b/Documentation/git-fast-export.txt
index 221506b..0ec7cad 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -105,6 +105,12 @@ marks the same across runs.
in the commit (as opposed to just listing the files which are
different from the commit's first parent).
 
+--anonymize::
+   Replace all paths, blob contents, commit and tag messages,
+   names, and email addresses in the output with anonymized data,
+   while still retaining the shape of history and of the stored
+   tree.
+
 --refspec::
Apply the specified refspec to each ref exported. Multiple of them can
be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 92b4624..acd2838 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -18,6 +18,7 @@
 #include parse-options.h
 #include quote.h
 #include remote.h
+#include blob.h
 
 static const char *fast_export_usage[] = {
N_(git fast-export [rev-list-opts]),
@@ -34,6 +35,7 @@ static int full_tree;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct refspec *refspecs;
 static int refspecs_nr;
+static int anonymize;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
 const char *arg, int unset)
@@ -81,6 +83,76 @@ static int has_unshown_parent(struct commit *commit)
return 0;
 }
 
+struct anonymized_entry {
+   struct hashmap_entry hash;
+   const char *orig;
+   size_t orig_len;
+   const char *anon;
+   size_t anon_len;
+};
+
+static int anonymized_entry_cmp(const void *va, const void *vb,
+   const void *data)
+{
+   const struct anonymized_entry *a = va, *b = vb;
+   return a-orig_len != b-orig_len ||
+   memcmp(a-orig, b-orig, a-orig_len);
+}
+
+/*
+ * Basically keep a cache of X-Y so that we can repeatedly replace
+ * the same anonymized string with another. The actual generation
+ * is farmed out to the generate function.
+ */
+static 

Re: [PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 +/*
 + * We anonymize each component of a path individually,
 + * so that paths a/b and a/c will share a common root.
 + * The paths are cached via anonymize_mem so that repeated
 + * lookups for a will yield the same value.
 + */
 +static void anonymize_path(struct strbuf *out, const char *path,
 +struct hashmap *map,
 +char *(*generate)(const char *, size_t *))
 +{
 + while (*path) {
 + const char *end_of_component = strchrnul(path, '/');
 + size_t len = end_of_component - path;
 + const char *c = anonymize_mem(map, generate, path, len);
 + strbuf_add(out, c, len);
 + path = end_of_component;
 + if (*path)
 + strbuf_addch(out, *path++);
 + }
 +}

Do two paths sort the same way before and after anonymisation?  For
example, if generate() works as a simple substitution, it should map
a character that sorts before (or after) '/' with another that also
sorts before (or after) '/' for us to be able to diagnose an error
that comes from D/F sort order confusion.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 +--anonymize::
 + Replace all paths, blob contents, commit and tag messages,
 + names, and email addresses in the output with anonymized data,
 + while still retaining the shape of history and of the stored
 + tree.

Sometimes branch names can contain codenames the project may prefer
to hide from the general public, so they may need to be anonymised
as well.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Jeff King
On Thu, Aug 21, 2014 at 01:15:10PM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  +/*
  + * We anonymize each component of a path individually,
  + * so that paths a/b and a/c will share a common root.
  + * The paths are cached via anonymize_mem so that repeated
  + * lookups for a will yield the same value.
  + */
  +static void anonymize_path(struct strbuf *out, const char *path,
  +  struct hashmap *map,
  +  char *(*generate)(const char *, size_t *))
  +{
  +   while (*path) {
  +   const char *end_of_component = strchrnul(path, '/');
  +   size_t len = end_of_component - path;
  +   const char *c = anonymize_mem(map, generate, path, len);
  +   strbuf_add(out, c, len);
  +   path = end_of_component;
  +   if (*path)
  +   strbuf_addch(out, *path++);
  +   }
  +}
 
 Do two paths sort the same way before and after anonymisation?  For
 example, if generate() works as a simple substitution, it should map
 a character that sorts before (or after) '/' with another that also
 sorts before (or after) '/' for us to be able to diagnose an error
 that comes from D/F sort order confusion.

No, the sort order is totally lost. I'd be afraid that a general scheme
would end up leaking information about what was in the filenames. It
might be acceptable to leak some information here, though, if it adds to
the realism of the result.

I tried here to lay the basic infrastructure and do the simplest thing
that might work, so we could evaluate proposals like that independently
(and also because I didn't come up with a clever enough algorithm to do
what you're asking).  Patches welcome on top. :)

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Jeff King
On Thu, Aug 21, 2014 at 02:57:22PM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  +--anonymize::
  +   Replace all paths, blob contents, commit and tag messages,
  +   names, and email addresses in the output with anonymized data,
  +   while still retaining the shape of history and of the stored
  +   tree.
 
 Sometimes branch names can contain codenames the project may prefer
 to hide from the general public, so they may need to be anonymised
 as well.

Yes, I do anonymize them (and check it in the tests). See
anonymize_refname. I just forgot to include it in the list. Trivial
squashable patch is below.

The few things I don't anonymize are:

  1. ref prefixes. We see the same distribution of refs/heads vs
 refs/tags, etc.

  2. refs/heads/master is left untouched, for convenience (and because
 it's not really a secret). The implementation is lazy, though, and
 would leave refs/heads/master-supersecret, as well. I can tighten
 that if we really want to be careful.

  3. gitlinks are left untouched, since sha1s cannot be reversed. This
 could leak some information (if your private repo points to a
 public, I can find out you have it as submodule). I doubt it
 matters, but we can also scramble the sha1s.

---
diff --git a/Documentation/git-fast-export.txt 
b/Documentation/git-fast-export.txt
index 0ec7cad..52831fa 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -106,10 +106,10 @@ marks the same across runs.
different from the commit's first parent).
 
 --anonymize::
-   Replace all paths, blob contents, commit and tag messages,
-   names, and email addresses in the output with anonymized data,
-   while still retaining the shape of history and of the stored
-   tree.
+   Replace all refnames, paths, blob contents, commit and tag
+   messages, names, and email addresses in the output with
+   anonymized data, while still retaining the shape of history and
+   of the stored tree.
 
 --refspec::
Apply the specified refspec to each ref exported. Multiple of them can
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html