[PATCH] teach fast-export an --anonymize option
Sometimes users want to report a bug they experience on their repository, but they are not at liberty to share the contents of the repository. It would be useful if they could produce a repository that has a similar shape to its history and tree, but without leaking any information. This anonymized repository could then be shared with developers (assuming it still replicates the original problem). This patch implements an --anonymize option to fast-export, which generates a stream that can recreate such a repository. Producing a single stream makes it easy for the caller to verify that they are not leaking any useful information. You can get an overview of what will be shared by running a command like: git fast-export --anonymize --all | perl -pe 's/\d+/X/g' | sort -u | less which will show every unique line we generate, modulo any numbers (each anonymized token is assigned a number, like User 0, and we replace it consistently in the output). In addition to anonymizing, this produces test cases that are relatively small (compared to the original repository) and fast to generate (compared to using filter-branch, or modifying the output of fast-export yourself). Here are numbers for git.git: $ time git fast-export --anonymize --all \ --tag-of-filtered-object=drop output real0m2.883s user0m2.828s sys 0m0.052s $ gzip output $ ls -lh output.gz | awk '{print $5}' 2.9M Signed-off-by: Jeff King p...@peff.net --- I haven't used this for anything real yet. It was a fun exercise, and I do think it should work in practice. I'd be curious to hear a success report of somebody actually debugging something with this. In theory we could anonymize in a reversible way (e.g., by encrypting each token with a key, and then not sharing the key), but it's a lot more complicated and I don't think it buys us much. The one thing I'd really like is to be able to test packing on an anonymized repository, but two objects which delta well together will not have their encrypted contents delta (unless you use something weak like ECB mode, in which case the contents are not really as anonymized as you would hope). I think most interesting cases involve things like commit traversal, and that should still work here, even with made-up contents. Some weird cases involving trees would not work if they depend on the filenames (e.g., things that impact sort order). We could allow finer-grained control, like --anonymize=commits,blobs if somebody was OK sharing their filenames. I did not go that far here, but it should be pretty easy to build on top. Documentation/git-fast-export.txt | 6 + builtin/fast-export.c | 280 -- t/t9351-fast-export-anonymize.sh | 117 3 files changed, 392 insertions(+), 11 deletions(-) create mode 100755 t/t9351-fast-export-anonymize.sh diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index 221506b..0ec7cad 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -105,6 +105,12 @@ marks the same across runs. in the commit (as opposed to just listing the files which are different from the commit's first parent). +--anonymize:: + Replace all paths, blob contents, commit and tag messages, + names, and email addresses in the output with anonymized data, + while still retaining the shape of history and of the stored + tree. + --refspec:: Apply the specified refspec to each ref exported. Multiple of them can be specified. diff --git a/builtin/fast-export.c b/builtin/fast-export.c index 92b4624..acd2838 100644 --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -18,6 +18,7 @@ #include parse-options.h #include quote.h #include remote.h +#include blob.h static const char *fast_export_usage[] = { N_(git fast-export [rev-list-opts]), @@ -34,6 +35,7 @@ static int full_tree; static struct string_list extra_refs = STRING_LIST_INIT_NODUP; static struct refspec *refspecs; static int refspecs_nr; +static int anonymize; static int parse_opt_signed_tag_mode(const struct option *opt, const char *arg, int unset) @@ -81,6 +83,76 @@ static int has_unshown_parent(struct commit *commit) return 0; } +struct anonymized_entry { + struct hashmap_entry hash; + const char *orig; + size_t orig_len; + const char *anon; + size_t anon_len; +}; + +static int anonymized_entry_cmp(const void *va, const void *vb, + const void *data) +{ + const struct anonymized_entry *a = va, *b = vb; + return a-orig_len != b-orig_len || + memcmp(a-orig, b-orig, a-orig_len); +} + +/* + * Basically keep a cache of X-Y so that we can repeatedly replace + * the same anonymized string with another. The actual generation + * is farmed out to the generate function. + */ +static
Re: [PATCH] teach fast-export an --anonymize option
Jeff King p...@peff.net writes: +/* + * We anonymize each component of a path individually, + * so that paths a/b and a/c will share a common root. + * The paths are cached via anonymize_mem so that repeated + * lookups for a will yield the same value. + */ +static void anonymize_path(struct strbuf *out, const char *path, +struct hashmap *map, +char *(*generate)(const char *, size_t *)) +{ + while (*path) { + const char *end_of_component = strchrnul(path, '/'); + size_t len = end_of_component - path; + const char *c = anonymize_mem(map, generate, path, len); + strbuf_add(out, c, len); + path = end_of_component; + if (*path) + strbuf_addch(out, *path++); + } +} Do two paths sort the same way before and after anonymisation? For example, if generate() works as a simple substitution, it should map a character that sorts before (or after) '/' with another that also sorts before (or after) '/' for us to be able to diagnose an error that comes from D/F sort order confusion. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] teach fast-export an --anonymize option
Jeff King p...@peff.net writes: +--anonymize:: + Replace all paths, blob contents, commit and tag messages, + names, and email addresses in the output with anonymized data, + while still retaining the shape of history and of the stored + tree. Sometimes branch names can contain codenames the project may prefer to hide from the general public, so they may need to be anonymised as well. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] teach fast-export an --anonymize option
On Thu, Aug 21, 2014 at 01:15:10PM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: +/* + * We anonymize each component of a path individually, + * so that paths a/b and a/c will share a common root. + * The paths are cached via anonymize_mem so that repeated + * lookups for a will yield the same value. + */ +static void anonymize_path(struct strbuf *out, const char *path, + struct hashmap *map, + char *(*generate)(const char *, size_t *)) +{ + while (*path) { + const char *end_of_component = strchrnul(path, '/'); + size_t len = end_of_component - path; + const char *c = anonymize_mem(map, generate, path, len); + strbuf_add(out, c, len); + path = end_of_component; + if (*path) + strbuf_addch(out, *path++); + } +} Do two paths sort the same way before and after anonymisation? For example, if generate() works as a simple substitution, it should map a character that sorts before (or after) '/' with another that also sorts before (or after) '/' for us to be able to diagnose an error that comes from D/F sort order confusion. No, the sort order is totally lost. I'd be afraid that a general scheme would end up leaking information about what was in the filenames. It might be acceptable to leak some information here, though, if it adds to the realism of the result. I tried here to lay the basic infrastructure and do the simplest thing that might work, so we could evaluate proposals like that independently (and also because I didn't come up with a clever enough algorithm to do what you're asking). Patches welcome on top. :) -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] teach fast-export an --anonymize option
On Thu, Aug 21, 2014 at 02:57:22PM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: +--anonymize:: + Replace all paths, blob contents, commit and tag messages, + names, and email addresses in the output with anonymized data, + while still retaining the shape of history and of the stored + tree. Sometimes branch names can contain codenames the project may prefer to hide from the general public, so they may need to be anonymised as well. Yes, I do anonymize them (and check it in the tests). See anonymize_refname. I just forgot to include it in the list. Trivial squashable patch is below. The few things I don't anonymize are: 1. ref prefixes. We see the same distribution of refs/heads vs refs/tags, etc. 2. refs/heads/master is left untouched, for convenience (and because it's not really a secret). The implementation is lazy, though, and would leave refs/heads/master-supersecret, as well. I can tighten that if we really want to be careful. 3. gitlinks are left untouched, since sha1s cannot be reversed. This could leak some information (if your private repo points to a public, I can find out you have it as submodule). I doubt it matters, but we can also scramble the sha1s. --- diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index 0ec7cad..52831fa 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -106,10 +106,10 @@ marks the same across runs. different from the commit's first parent). --anonymize:: - Replace all paths, blob contents, commit and tag messages, - names, and email addresses in the output with anonymized data, - while still retaining the shape of history and of the stored - tree. + Replace all refnames, paths, blob contents, commit and tag + messages, names, and email addresses in the output with + anonymized data, while still retaining the shape of history and + of the stored tree. --refspec:: Apply the specified refspec to each ref exported. Multiple of them can -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html