Re: [PATCH] teach fast-export an --anonymize option
On Thu, Aug 21, 2014 at 02:57:22PM -0700, Junio C Hamano wrote: > Jeff King writes: > > > +--anonymize:: > > + Replace all paths, blob contents, commit and tag messages, > > + names, and email addresses in the output with anonymized data, > > + while still retaining the shape of history and of the stored > > + tree. > > Sometimes branch names can contain codenames the project may prefer > to hide from the general public, so they may need to be anonymised > as well. Yes, I do anonymize them (and check it in the tests). See anonymize_refname. I just forgot to include it in the list. Trivial squashable patch is below. The few things I don't anonymize are: 1. ref prefixes. We see the same distribution of refs/heads vs refs/tags, etc. 2. refs/heads/master is left untouched, for convenience (and because it's not really a secret). The implementation is lazy, though, and would leave "refs/heads/master-supersecret", as well. I can tighten that if we really want to be careful. 3. gitlinks are left untouched, since sha1s cannot be reversed. This could leak some information (if your private repo points to a public, I can find out you have it as submodule). I doubt it matters, but we can also scramble the sha1s. --- diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index 0ec7cad..52831fa 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -106,10 +106,10 @@ marks the same across runs. different from the commit's first parent). --anonymize:: - Replace all paths, blob contents, commit and tag messages, - names, and email addresses in the output with anonymized data, - while still retaining the shape of history and of the stored - tree. + Replace all refnames, paths, blob contents, commit and tag + messages, names, and email addresses in the output with + anonymized data, while still retaining the shape of history and + of the stored tree. --refspec:: Apply the specified refspec to each ref exported. Multiple of them can -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] teach fast-export an --anonymize option
On Thu, Aug 21, 2014 at 01:15:10PM -0700, Junio C Hamano wrote: > Jeff King writes: > > > +/* > > + * We anonymize each component of a path individually, > > + * so that paths a/b and a/c will share a common root. > > + * The paths are cached via anonymize_mem so that repeated > > + * lookups for "a" will yield the same value. > > + */ > > +static void anonymize_path(struct strbuf *out, const char *path, > > + struct hashmap *map, > > + char *(*generate)(const char *, size_t *)) > > +{ > > + while (*path) { > > + const char *end_of_component = strchrnul(path, '/'); > > + size_t len = end_of_component - path; > > + const char *c = anonymize_mem(map, generate, path, &len); > > + strbuf_add(out, c, len); > > + path = end_of_component; > > + if (*path) > > + strbuf_addch(out, *path++); > > + } > > +} > > Do two paths sort the same way before and after anonymisation? For > example, if generate() works as a simple substitution, it should map > a character that sorts before (or after) '/' with another that also > sorts before (or after) '/' for us to be able to diagnose an error > that comes from D/F sort order confusion. No, the sort order is totally lost. I'd be afraid that a general scheme would end up leaking information about what was in the filenames. It might be acceptable to leak some information here, though, if it adds to the realism of the result. I tried here to lay the basic infrastructure and do the simplest thing that might work, so we could evaluate proposals like that independently (and also because I didn't come up with a clever enough algorithm to do what you're asking). Patches welcome on top. :) -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] teach fast-export an --anonymize option
Jeff King writes: > +--anonymize:: > + Replace all paths, blob contents, commit and tag messages, > + names, and email addresses in the output with anonymized data, > + while still retaining the shape of history and of the stored > + tree. Sometimes branch names can contain codenames the project may prefer to hide from the general public, so they may need to be anonymised as well. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] teach fast-export an --anonymize option
Jeff King writes: > +/* > + * We anonymize each component of a path individually, > + * so that paths a/b and a/c will share a common root. > + * The paths are cached via anonymize_mem so that repeated > + * lookups for "a" will yield the same value. > + */ > +static void anonymize_path(struct strbuf *out, const char *path, > +struct hashmap *map, > +char *(*generate)(const char *, size_t *)) > +{ > + while (*path) { > + const char *end_of_component = strchrnul(path, '/'); > + size_t len = end_of_component - path; > + const char *c = anonymize_mem(map, generate, path, &len); > + strbuf_add(out, c, len); > + path = end_of_component; > + if (*path) > + strbuf_addch(out, *path++); > + } > +} Do two paths sort the same way before and after anonymisation? For example, if generate() works as a simple substitution, it should map a character that sorts before (or after) '/' with another that also sorts before (or after) '/' for us to be able to diagnose an error that comes from D/F sort order confusion. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html