Re: [PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Jeff King
On Thu, Aug 21, 2014 at 02:57:22PM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > +--anonymize::
> > +   Replace all paths, blob contents, commit and tag messages,
> > +   names, and email addresses in the output with anonymized data,
> > +   while still retaining the shape of history and of the stored
> > +   tree.
> 
> Sometimes branch names can contain codenames the project may prefer
> to hide from the general public, so they may need to be anonymised
> as well.

Yes, I do anonymize them (and check it in the tests). See
anonymize_refname. I just forgot to include it in the list. Trivial
squashable patch is below.

The few things I don't anonymize are:

  1. ref prefixes. We see the same distribution of refs/heads vs
 refs/tags, etc.

  2. refs/heads/master is left untouched, for convenience (and because
 it's not really a secret). The implementation is lazy, though, and
 would leave "refs/heads/master-supersecret", as well. I can tighten
 that if we really want to be careful.

  3. gitlinks are left untouched, since sha1s cannot be reversed. This
 could leak some information (if your private repo points to a
 public, I can find out you have it as submodule). I doubt it
 matters, but we can also scramble the sha1s.

---
diff --git a/Documentation/git-fast-export.txt 
b/Documentation/git-fast-export.txt
index 0ec7cad..52831fa 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -106,10 +106,10 @@ marks the same across runs.
different from the commit's first parent).
 
 --anonymize::
-   Replace all paths, blob contents, commit and tag messages,
-   names, and email addresses in the output with anonymized data,
-   while still retaining the shape of history and of the stored
-   tree.
+   Replace all refnames, paths, blob contents, commit and tag
+   messages, names, and email addresses in the output with
+   anonymized data, while still retaining the shape of history and
+   of the stored tree.
 
 --refspec::
Apply the specified refspec to each ref exported. Multiple of them can
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Jeff King
On Thu, Aug 21, 2014 at 01:15:10PM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > +/*
> > + * We anonymize each component of a path individually,
> > + * so that paths a/b and a/c will share a common root.
> > + * The paths are cached via anonymize_mem so that repeated
> > + * lookups for "a" will yield the same value.
> > + */
> > +static void anonymize_path(struct strbuf *out, const char *path,
> > +  struct hashmap *map,
> > +  char *(*generate)(const char *, size_t *))
> > +{
> > +   while (*path) {
> > +   const char *end_of_component = strchrnul(path, '/');
> > +   size_t len = end_of_component - path;
> > +   const char *c = anonymize_mem(map, generate, path, &len);
> > +   strbuf_add(out, c, len);
> > +   path = end_of_component;
> > +   if (*path)
> > +   strbuf_addch(out, *path++);
> > +   }
> > +}
> 
> Do two paths sort the same way before and after anonymisation?  For
> example, if generate() works as a simple substitution, it should map
> a character that sorts before (or after) '/' with another that also
> sorts before (or after) '/' for us to be able to diagnose an error
> that comes from D/F sort order confusion.

No, the sort order is totally lost. I'd be afraid that a general scheme
would end up leaking information about what was in the filenames. It
might be acceptable to leak some information here, though, if it adds to
the realism of the result.

I tried here to lay the basic infrastructure and do the simplest thing
that might work, so we could evaluate proposals like that independently
(and also because I didn't come up with a clever enough algorithm to do
what you're asking).  Patches welcome on top. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Junio C Hamano
Jeff King  writes:

> +--anonymize::
> + Replace all paths, blob contents, commit and tag messages,
> + names, and email addresses in the output with anonymized data,
> + while still retaining the shape of history and of the stored
> + tree.

Sometimes branch names can contain codenames the project may prefer
to hide from the general public, so they may need to be anonymised
as well.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] teach fast-export an --anonymize option

2014-08-21 Thread Junio C Hamano
Jeff King  writes:

> +/*
> + * We anonymize each component of a path individually,
> + * so that paths a/b and a/c will share a common root.
> + * The paths are cached via anonymize_mem so that repeated
> + * lookups for "a" will yield the same value.
> + */
> +static void anonymize_path(struct strbuf *out, const char *path,
> +struct hashmap *map,
> +char *(*generate)(const char *, size_t *))
> +{
> + while (*path) {
> + const char *end_of_component = strchrnul(path, '/');
> + size_t len = end_of_component - path;
> + const char *c = anonymize_mem(map, generate, path, &len);
> + strbuf_add(out, c, len);
> + path = end_of_component;
> + if (*path)
> + strbuf_addch(out, *path++);
> + }
> +}

Do two paths sort the same way before and after anonymisation?  For
example, if generate() works as a simple substitution, it should map
a character that sorts before (or after) '/' with another that also
sorts before (or after) '/' for us to be able to diagnose an error
that comes from D/F sort order confusion.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html