Re: [PATCH v2 0/9] Teach 'run' perf script to read config files

2017-09-26 Thread Roberto Tyley
On 26 September 2017 at 16:40, Christian Couder
 wrote:
> On Sun, Sep 24, 2017 at 9:59 AM, Junio C Hamano  wrote:
>> Christian Couder  writes:
>>
>>> (It looks like smtp.gmail.com isn't working anymore for me, so I am
>>> trying to send this using Gmail for the cover letter and Submitgit for
>>> the patches.)
>>
>> SubmitGit may want to learn the "change the timestamps of the
>> individual patches by 1 second" trick from "git send-email" to help
>> threading (you can view inbox/comp.version-control.git/ group over
>> nntp and tell your newsreader to sort-by-date).
>
> Roberto is now in CC. I will let him answer about that.

I had a quick look at git-send-email.perl, I see the trick is the `time++` one
introduced with https://github.com/git/git/commit/a5370b16 - seems reasonable!

SubmitGit makes all emails in-reply-to the initial email, which I
think is correct behaviour,
but I can see that offsetting the times would probably give a more
reliable sorting in
a newsreader. Unfortunately the documentation for AWS Simple Email Service (SES)
says:

  "Note: Amazon SES overrides any Date header you provide with the
time that Amazon
  SES accepts the message."

http://docs.aws.amazon.com/ses/latest/DeveloperGuide/header-fields.html

...so the only way SubmitGit can offset the times is to literally
delay the sending of the emails,
which is a bit unfortunate for patchbombs more than a few dozen commits long!

I'll take a further look at this when I get a bit more free time.

Roberto


[PATCH 1/2] Partition SubmittingPatches doc into two files

2016-04-14 Thread Roberto Tyley
No editorial changes in this commit, the text that is transferred into the
second file is unchanged apart from minor chunk re-ordering.

The split is based on:

* Information needed for all users, whether using `git send-email` or
  submitGit (ie good commit practice, mailing list etiquette)
* Information needed just for `git send-email`/MUA users (generating the
  right kind of diff, avoid MIME & PGP, send-email & MUA specific hints)
---
 Documentation/SubmittingPatches  | 137 -
 Documentation/SubmittingPatchesByMUA | 142 +++
 2 files changed, 142 insertions(+), 137 deletions(-)
 create mode 100644 Documentation/SubmittingPatchesByMUA

diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index 98fc4cc..6dca41d 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -119,11 +119,6 @@ archive, summarize the relevant points of the discussion.
 
 (3) Generate your patch using Git tools out of your commits.
 
-Git based diff tools generate unidiff which is the preferred format.
-
-You do not have to be afraid to use -M option to "git diff" or
-"git format-patch", if your patch involves file renames.  The
-receiving end can handle them just fine.
 
 Please make sure your patch does not add commented out debugging code,
 or include any extra files which do not relate to what your patch
@@ -136,11 +131,6 @@ that is fine, but please mark it as such.
 
 (4) Sending your patches.
 
-Learn to use format-patch and send-email if possible.  These commands
-are optimized for the workflow of sending patches, avoiding many ways
-your existing e-mail client that is optimized for "multipart/*" mime
-type e-mails to corrupt and render your patches unusable.
-
 People on the Git mailing list need to be able to read and
 comment on the changes you are submitting.  It is important for
 a developer to be able to "quote" your changes, using standard
@@ -148,18 +138,8 @@ e-mail tools, so that they may comment on specific 
portions of
 your code.  For this reason, each patch should be submitted
 "inline" in a separate message.
 
-Multiple related patches should be grouped into their own e-mail
-thread to help readers find all parts of the series.  To that end,
-send them as replies to either an additional "cover letter" message
-(see below), the first patch, or the respective preceding patch.
 
-If your log message (including your name on the
-Signed-off-by line) is not writable in ASCII, make sure that
-you send off a message in the correct encoding.
 
-WARNING: Be wary of your MUAs word-wrap
-corrupting your patch.  Do not cut-n-paste your patch; you can
-lose tabs that way if you are not careful.
 
 It is a common convention to prefix your subject line with
 [PATCH].  This lets people easily distinguish patches from other
@@ -187,31 +167,6 @@ an explanation of changes between each iteration can be 
kept in
 Git-notes and inserted automatically following the three-dash
 line via `git format-patch --notes`.
 
-Do not attach the patch as a MIME attachment, compressed or not.
-Do not let your e-mail client send quoted-printable.  Do not let
-your e-mail client send format=flowed which would destroy
-whitespaces in your patches. Many
-popular e-mail applications will not always transmit a MIME
-attachment as plain text, making it impossible to comment on
-your code.  A MIME attachment also takes a bit more time to
-process.  This does not decrease the likelihood of your
-MIME-attached change being accepted, but it makes it more likely
-that it will be postponed.
-
-Exception:  If your mailer is mangling patches then someone may ask
-you to re-send them using MIME, that is OK.
-
-Do not PGP sign your patch, at least for now.  Most likely, your
-maintainer or other people on the list would not have your PGP
-key and would not bother obtaining it anyway.  Your patch is not
-judged by who you are; a good patch from an unknown origin has a
-far better chance of being accepted than a patch from a known,
-respected origin that is done poorly or does incorrect things.
-
-If you really really really really want to do a PGP signed
-patch, format it as "multipart/signed", not a text/plain message
-that starts with '-BEGIN PGP SIGNED MESSAGE-'.  That is
-not a text/plain, it's something else.
 
 Send your patch with "To:" set to the mailing list, with "cc:" listing
 people who are involved in the area you are touching (the output from
@@ -370,95 +325,3 @@ Know the status of your patch after submission
   entitled "What's cooking in git.git" and "What's in git.git" giving
   the status of various proposed changes.
 
-
-MUA specific hints
-
-Some of patches I receive or pick up from the list share common
-patterns of breakage.  Please make sure your MUA is set up
-properly not to corrupt whitespaces.
-
-See the DISCUSSION section of git-format-patch(1) for hints on

[PATCH 2/2] Add submitGit patch-submission information

2016-04-14 Thread Roberto Tyley
Most of the guidance on how to use submitGit will stay with the tool
itself, so the edits here are mostly to make the choice clear to users.

Because generation of patches is quite different for MUA-users and
submitGit users, I've merged section 3 and 4 together:

section 3 - 'Generate your patch using Git tools out of your commits.'
+
section 4 - 'Sending your patches.'
=
new section 3 - 'Generate and send your patch to the Git mailing list'

I've edited the text of old section 3 to make it more concise (using
'make sure' for emphasis just once before presenting the requirements
list).
---
 Documentation/SubmittingPatches | 44 +++--
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index 6dca41d..9735236 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -117,29 +117,43 @@ without external resources. Instead of giving a URL to a 
mailing list
 archive, summarize the relevant points of the discussion.
 
 
-(3) Generate your patch using Git tools out of your commits.
-
-
-Please make sure your patch does not add commented out debugging code,
-or include any extra files which do not relate to what your patch
-is trying to achieve. Make sure to review
-your patch after generating it, to ensure accuracy.  Before
-sending out, please make sure it cleanly applies to the "master"
-branch head.  If you are preparing a work based on "next" branch,
-that is fine, but please mark it as such.
-
-
-(4) Sending your patches.
+(3) Generate and send your patch to the Git mailing list
 
 People on the Git mailing list need to be able to read and
 comment on the changes you are submitting.  It is important for
 a developer to be able to "quote" your changes, using standard
 e-mail tools, so that they may comment on specific portions of
 your code.  For this reason, each patch should be submitted
-"inline" in a separate message.
+"inline" (not as an attachment) in a separate message.
+
+There can be unexpected problems in sending patches:
+
+  . Webmail clients like Gmail generally corrupt whitespace in patches.
+  . messages using HTML-formatting (used by default in many webmail
+clients) is automatically rejected by the Git mailing list server.
+
+Because of these factors, it's recommended that you use one of these
+specific methods to generate and send your patchs:
+
+  - Generate mail-ready patch files using "git format-patch" and
+send them using "git send-email" to the Git mailing list.
+See SubmittingPatchesByMUA for further details.
 
+  - Create a pull request on https://github.com/git/git and
+use https://submitgit.herokuapp.com/ to send it as a patch series
+to the mailing list.  Note that the PR is just the place where your
+patch is born - discussion of the patch should still take place on
+the Git mailing list.
 
+Please make sure to review your patch before sending it, to ensure that
+it:
 
+  . accurately reflects the change you want to make
+  . does not add commented-out debugging code, or include any extra
+files which do not relate to what your patch is trying to achieve.
+  . cleanly applies to the "master" branch head.  If you are preparing
+a work based on "next" branch, that is fine, but please mark it as
+such.
 
 It is a common convention to prefix your subject line with
 [PATCH].  This lets people easily distinguish patches from other
@@ -186,7 +200,7 @@ patch.
  *2* The mailing list: git@vger.kernel.org
 
 
-(5) Sign your work
+(4) Sign your work
 
 To improve tracking of who did what, we've borrowed the
 "sign-off" procedure from the Linux kernel project on patches

--
https://github.com/git/git/pull/223
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] commit: add a commit.verbose config variable

2016-03-11 Thread Roberto Tyley
On 11 March 2016 at 05:44, Eric Sunshine  wrote:
> On Fri, Mar 11, 2016 at 05:45:27AM +0530, Pranit Bauva wrote:
>> Actually I am sending the patches with submitGit herokuapp because my
>> institute proxy does not allow IMAP/POP3 connections.

Really glad to hear this is helping you Pranit - I hadn't even thought
of the blocked IMAP/POP3 connections problem, I'm not sure what other
method you could have easily used to get round this.

> That's unfortunate. Your separate "cover letter" often arrives hours
> later than the patch itself. Perhaps Roberto can comment on submitGit
> and per-patch commentary.

This sounds like an improvement I need to make to submitGit, I've
created an issue here:

https://github.com/rtyley/submitgit/issues/30
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Update diff-highlight

2016-02-26 Thread Roberto Tyley
On 22 February 2016 at 04:49, Eric Sunshine  wrote:
> On Sun, Feb 21, 2016 at 11:14 PM, Peter Dave Hello
>  wrote:
>> From: Peter Dave Hello 
>
> This "From:" line looks suspiciously incorrect. If anything, you'd
> probably want to drop the line altogether or use:
>
> From: Peter Dave Hello 

Peter's commit (https://github.com/git/git/commit/15415c6e) had an author of
'peterdavehe...@users.noreply.github.com' (perhaps because the commit was
generated through GitHub's interface?), and submitGit added it as an in-body
'From: ' line because it differed from the address used to send the email
(h...@peterdavehello.org - submitGit always uses the user's
primary-email-address-in-GitHub to send the email).

A 'noreply' address is obviously not wanted in this context though, so
I've updated
submitGit to disregard them when deciding whether or not to generate an in-body
'From: ' header: https://github.com/rtyley/submitgit/pull/29

Roberto
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1] travis-ci: override CFLAGS properly, add -Wdeclaration-after-statement

2016-02-09 Thread Roberto Tyley
On 9 February 2016 at 18:42, Junio C Hamano  wrote:
> Lars Schneider  writes:
>> Jeff Merkey made me aware of http://kernelnewbies.org/FirstKernelPatch [2]
>> where I found checkpatch.pl [3]. Would it make sense to check all commits
>> that are not in next/master/maint with this script on Travis-CI?
>
> That does not help very much.  These changes are already shown to
> people and dirtied their eyes, and most likely I've already have
> wasted time tweaking the glitches out locally.  The damage has
> already been done.
>
> It would make a lot of sense if the checkpatch is called inside
> Roberto Tyley's "pull-request-to-patch-submission" thing, though.

I've not personally run checkpatch.pl (as Peff mentioned, it's not
actually a documented part of the Git project's recommend contribution
workflow) - I'm still trying to understand whether it will restrict
it's errors to just the things that are introduced in a patch, or if
it will indiscriminately mention existing problems too (of which I
guess there are many already present in the live Git codebase?). If it
mentions _existing_ problems, I wouldn't personally want it in any
automated flow until it can be tuned to find the trees of master/maint
totally clean. At that point it could be added to the Travis build,
and GitHub would automatically reflect the Travis status in any
git/git PR.

I like the idea of giving helpful guidance to users on how to make
their patches cleaner - I'm not that enthusiastic about submitGit
invoking the checkpatch.pl script directly at this point, given that
it lives in a separate project (the linux kernel) and the version
Junio uses is patched off _that_ - I'm lazy enough to not want to try
to get that all to work reliably on a little transient Heroku box.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] stash: use "stash--helper"

2016-01-28 Thread Roberto Tyley
On 28 January 2016 at 21:41, Stefan Beller  wrote:
> On Thu, Jan 28, 2016 at 1:25 PM, Matthias Aßhauer  wrote:
 https://github.com/git/git/pull/191
>>>
>>> Oh I see you're using the pull-request to email translator, cool!

Yay!

>> Yes, I did. It definitly makes things easier if you are not used to mailing 
>> lists, but it was also a bit of a kerfuffle. I tried to start working on 
>> coverletter support, but I couldn't get it to accept the amazon SES 
>> credentials I provided. I ended up manually submiting the coverletter. It 
>> also didn't like my name.

Apologies for that - https://github.com/rtyley/submitgit/pull/26 has
just been deployed, which should resolve the encoding for non-US ASCII
characters - if you feel like submitting another patch, and want to
put the eszett back into your GitHub account display name, I'd be
interested to know how that goes.

> Not sure if Roberto, the creator of that tool, follows the mailing
> list.  I cc'd him.

I don't closely follow the mailing list, so thanks for the cc!

Roberto
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v1] Add Travis CI support

2015-10-03 Thread Roberto Tyley
On 28 September 2015 at 19:47, Junio C Hamano  wrote:
> I won't enable it on github.com:gitster/git anyway, so I do not
> think that is a concern.  I thought what people are talking about
> was to add it on github.com:git/git, but have I been misreading the
> thread?  I do not even own the latter repository (I only can push
> into it).

I was momentarily surprised to hear that Junio doesn't own github.com/git/git
but I had a quick look at the github.com/git organisation, and it turns
out that Peff and Scott Chacon are the current owners - so at the
moment I think they're the only ones who could switch on the GitHub
webhook to hit Travis.

For what it's worth, I'd love to see Travis CI - or any form of CI -
running for the core Git project. It doesn't require giving write
access to Travis, and beyond the good reasons given by Lars,
I'm also personally interested because it opens up the possibility
of some useful enhancements to the submitGit flow - so that you
can't send email to the list without knowing you've broken tests
first.

Regarding Luke's concerns about excess emails coming from CI,
default Travis behaviour is for emails to be sent to the committer and
author, but only if they have write access to the repository the commit
was pushed to:

http://docs.travis-ci.com/user/notifications/#How-is-the-build-email-receiver-determined%3F

If Travis emails do become problematic, you can disable them
completely by adding 2 lines of config to the .travis.yml:

http://docs.travis-ci.com/user/notifications/#Email-notifications

Given this, enabling Travis CI for git/git seems pretty low risk,
are there any strong objections to it happening?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] rebase -i: demonstrate incorrect behavior of post-rewrite

2015-06-01 Thread Roberto Tyley
On 22 May 2015 at 16:59, Junio C Hamano gits...@pobox.com wrote:
 Roberto, isn't your threading of multi-patch series busted?

 Why is 1/2 a follow-up to 2/2?  Do you have a time-machine ;-)?

Oh, embarrassing, I better destroy the time-machine:

https://github.com/rtyley/submitgit/pull/5

This was due to me not realising that the GitHub API returns commit lists for
PRs in reverse-chronological order... thanks for pointing that out!
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Announce] submitGit for patch submission (was Diffing submodule does not yield complete logs)

2015-05-22 Thread Roberto Tyley
On Tuesday, 19 May 2015, Stefan Beller sbel...@google.com wrote:
 On Tue, May 19, 2015 at 12:29 PM, Robert Dailey
 rcdailey.li...@gmail.com wrote:
  How do you send your patches inline?
[snip]
 This workflow discussion was a topic at the GitMerge2015 conference,
 and there are essentially 2 groups, those who know how to send email
 and those who complain about it. A solution was agreed on by nearly all
 of the contributors. It would be awesome to have a git-to-email proxy,
 such that you could do a git push proxy master:refs/for/mailinglist
 and this proxy would convert the push into sending patch series to the
 mailing list. It could even convert the following discussion back into
 comments (on Github?) but as a first step we'd want to try out a one
 way proxy.

 Unfortunately nobody stepped up to actually do the work, yet :(


Hello, I'm stepping up to do that work :) Or at least, I'm implementing a
one-way GitHub PR - Mailing list tool, called submitGit:

https://submitgit.herokuapp.com/

Here's what a user does:

* create a PR on https://github.com/git/git
* logs into https://submitgit.herokuapp.com/ with GitHub auth
* selects their PR on https://submitgit.herokuapp.com/git/git/pulls
* gets submitGit to email the PR as patches to themselves, in order to
check it looks ok
* when they're ready, get submitGit to send it to the mailing list on
their behalf

All discussion of the patch *stays* on the mailing list - I'm not
attempting to change
anything about the Git community process, other than make it easier
for a wider group
people to submit patches to the list.

For hard-core contributors to Git, I'd imagine that git format-patch 
send-email
remain the fastest way to do their work. But those tools are _unfamiliar to the
majority of Git users_ - so submitGit aims to cater to those users, because they
definitely have valuable contributions to make, which would be tragic
to throw away.

I've been working on submitGit in my spare time for the past few
weeks, and there
are still features I plan to add (like guiding the user to more
'correct' word wrapping,
sign-off, etc), but given this discussion, I wanted to chime in and
let people know
what's here so far. It would be great if people could take the time to
explore the tool
(you don't have to raise a git/git PR in order to try sending one *to
yourself*, for
instance) and give feedback on list, or in GitHub issues:

https://github.com/rtyley/submitgit/issues

I've been lucky enough to discuss the ideas around submitGit with a
few people at
the Git-Merge conf, so thanks to Peff, Thomas Ferris Nicolaisen, and Emma Jane
Hogbin Westby for listening to me (not to imply their endorsement of
what I've done,
just thanks for talking about it!).

Roberto
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Diffing submodule does not yield complete logs for merge commits

2015-05-22 Thread Roberto Tyley
On Tuesday, 19 May 2015, Stefan Beller sbel...@google.com wrote:
 On Tue, May 19, 2015 at 12:29 PM, Robert Dailey
 rcdailey.li...@gmail.com wrote:
  How do you send your patches inline?

 This workflow discussion was a topic at the GitMerge2015 conference,
 and there are essentially 2 groups, those who know how to send email
 and those who complain about it. A solution was agreed on by nearly all
 of the contributors. It would be awesome to have a git-to-email proxy,
 such that you could do a git push proxy master:refs/for/mailinglist
 and this proxy would convert the push into sending patch series to the
 mailing list. It could even convert the following discussion back into
 comments (on Github?) but as a first step we'd want to try out a one
 way proxy.

 Unfortunately nobody stepped up to actually do the work, yet :(

I've replied to this on a separate announcement thread on the Git mailing
list here:

http://thread.gmane.org/gmane.comp.version-control.git/269699

...I've created a new tool called submitGit, which aims to help.

  I am willing to review the typical workflow for contributing via git
  on mailing lists but I haven't seen any informative reading material
  on this. I just find using command line to email patches and dealing
  with other issues not worth the trouble. Lack of syntax highlighting,
  lack of monospace font, the fact that I'm basically forced to install
  mail client software just to contribute a single git patch.

I'd be interested to know what you think!

Roberto
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: filter-branch performance

2014-12-10 Thread Roberto Tyley
On 9 December 2014 at 18:59, Jeff King p...@peff.net wrote:
 On Tue, Dec 09, 2014 at 07:52:33PM +0100, Henning Moll wrote:
 I assume that there is a lot of process forking going on. Could that be the
 cause?

 Yes. filter-branch is a shell scripts, and it is probably running
 multiple git commands per commit it is filtering.

 Any ideas how to further improve?

Depending on how much time you can sink into improving the performance
(versus just allowing the process to run to completion), you could
also look into a non-forking solution, as well as not bothering to
load the commit trees. To me non-forking means putting everything into
the JVM by using JGit, like the BFG does, though libgit2 might also be
an option.

Changing the BFG's code to do the transformation in your script is
absolutely trivial - define a commit-node cleaner like this:

object SetCommitterToAuthor extends CommitNodeCleaner {
  override def fixer(kit: CommitNodeCleaner.Kit) = c =
c.copy(committer = c.author) // PersonIdent class holds name, email 
time
}

...trivial if you don't mind compiling Scala with SBT that is, and I'm
sure some people do! A DSL for non-Scala people to define their own
BFG scripts would be good, I must get on that some day.

The BFG is generally faster than filter-branch for 3 reasons:

1. No forking - everything stays in the JVM process
2. Embarrassingly parallel algorithm makes good use of multi-core machines
3. Memoization means no Git object (file or folder) is cleaned more than once

In the case of your problem, only the first factor will be noticeably
helpful. Unfortunately commits do need to be cleaned sequentially, as
their hashes depend on the hashes of their parents, and filter-branch
doesn't clean /commits/ more than once, the way it does with files or
folders - so the last 2 reasons in the list won't be significant.

For your specific use case tho', the fact that BFG doesn't load the
file tree at all unless it needs to clean it will also help.

I decided to knock up an egregious hack in the BFG to see what
performance would be like. I ran it against a fairly large repo
(https://github.com/bfg-repo-cleaner-demos/intellij-community-original),
100k commits, stored in /dev/shm, and used the SetCommitterToAuthor
code above. The BFG run completed in 31.7 seconds, you can see the
resulting repo here:

https://github.com/rtyley/intellij-community-set-committer-to-author

I started running the same test some time ago using filter-branch,
unfortunately that test has not completed yet - the BFG appears to be
substantially faster.

Before:
$ git cat-file -p b02bf46c4e93c2e8570910cdd68eb6f4ce21ff81
tree 7a412e49ecdbd966d7efe5fe746ff3ea3b6067d1
parent 8794219e3e84aed3cc8af926ffd74beafa51fb6b
author peter pe...@jetbrains.com 1370854045 +0200
committer peter pe...@jetbrains.com 1370854098 +0200

After:
$ git cat-file -p 3adb7b2a5c87320a5a028b6a59a7132c75a6e91c
tree 7a412e49ecdbd966d7efe5fe746ff3ea3b6067d1
parent 5efcdb551789b0d0bb541de9325f09521c5fbcb6
author peter pe...@jetbrains.com 1370854045 +0200
committer peter pe...@jetbrains.com 1370854045 +0200 - time fixed

The relevant code is in:
https://github.com/rtyley/bfg-repo-cleaner/compare/set-committer-to-author
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: filter-branch performance

2014-12-10 Thread Roberto Tyley
On 10 December 2014 at 14:37, Jeff King p...@peff.net wrote:
 On Wed, Dec 10, 2014 at 02:18:24PM +, Roberto Tyley wrote:
 object SetCommitterToAuthor extends CommitNodeCleaner {
   override def fixer(kit: CommitNodeCleaner.Kit) = c =
 c.copy(committer = c.author) // PersonIdent class holds name, email 
 time
 }

 Thanks. I _almost_ mentioned BFG in the original email, but I didn't
 think it could do arbitrary fixes like this. Can you monkey-patch in
 arbitrary code, or do you have to rebuild all of BFG to include the
 snippet above?

Well, I publish a bfg-library jar to Maven Central, so you don't need
to rebuild that:

http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22bfg-library_2.11%22

...in principle you can write a Java/Groovy/whatever project that
calls that jar (your entry point would be
com.madgag.git.bfg.cleaner.RepoRewriter) - tho' to be honest, I can't
swear to how /friendly/ the API would be to call from non-Scala-land
though, as I haven't tried it.

Incidentally, if people want to try compiling this monkey-patched BFG
at home, this is how you'd do it:

* Install SBT - http://www.scala-sbt.org/download.html (or 'brew
install sbt' for Mac OS X)
* git clone https://github.com/rtyley/bfg-repo-cleaner.git --branch
set-committer-to-author
* cd bfg-repo-cleaner
* sbt bfg/run --no-blob-protection

There will be a lot of automated downloading of dependencies, and
compilation will be slow the first time around, but at least there
aren't that many steps. I do realise that being Scala/JVM based makes
working on the BFG a bit of a specialist activity at the moment!

 A DSL for non-Scala people to define their own
 BFG scripts would be good, I must get on that some day.

 That would be cool.  Even if the DSL was just Java, if you could do
 something like:

   vi fix.java
   javac fix.java
   bfg --filter=fix.class

 that would be very useful (and I am probably showing my lack of Java chops
 by getting the compilation command or filenames wrong :) ).

Your syntax is right :) I'll give it some thought.


 I started running the same test some time ago using filter-branch,
 unfortunately that test has not completed yet - the BFG appears to be
 substantially faster.

 No fair if you didn't run filter-branch on a PC and BFG on a Raspberry
 Pi. You have to give us a fighting chance. :)

I guess I made that rod for my own back :) http://youtu.be/Ir4IHzPhJuI
for those who haven't seen it.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-10 Thread Roberto Tyley
On 10 December 2014 at 16:07, Junio C Hamano gits...@pobox.com wrote:
 Jeff King p...@peff.net writes:
 git reflog expire --expire=now --all  git gc --prune=now --aggressive

 Maybe:

 git gc --purge

 Yeah, that is common enough that it might be worthwhile (you probably
 want --expire-unreachable in the reflog invocation, though).

 Also you would not want an unconditional --aggressive.

After a big rewrite deleting files the re-optimisation of --aggressive
can make a big difference to packsize - for instance 1.2GB to 768MB in
a test I just ran - but of course it is *much* slower, so I suspect
you're right about not including it.

I wasn't aware of the '--expire-unreachable=all' switch, though it
seems like a 'milder' version of the '--expire=now' switch? - in that
it would keep reflog entries if they haven't been changed, which is
fair enough and compatible with the 'purge' goal.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: filter-branch performance

2014-12-10 Thread Roberto Tyley
On 10 December 2014 at 16:05, Junio C Hamano gits...@pobox.com wrote:
 Roberto Tyley roberto.ty...@gmail.com writes:

 The BFG is generally faster than filter-branch for 3 reasons:

 1. No forking - everything stays in the JVM process
 2. Embarrassingly parallel algorithm makes good use of multi-core machines
 3. Memoization means no Git object (file or folder) is cleaned more than once

 In the case of your problem, only the first factor will be noticeably
 helpful. Unfortunately commits do need to be cleaned sequentially, as
 their hashes depend on the hashes of their parents, and filter-branch
 doesn't clean /commits/ more than once, the way it does with files or
 folders - so the last 2 reasons in the list won't be significant.

 Just this part.  If your history is bushy, you should be able to
 rewrite histories of merged branches in parallel up to the point
 they are merged---rewriting of the merge commit of course has to
 wait until all the branches have been rewritten, though.

That's true, and the bfg does take advantage of that parallelism, so
as well as point 1, point 2 will provide some benefit if history is
bushy enough :)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-09 Thread Roberto Tyley
On 9 December 2014 at 14:14, Jeff King p...@peff.net wrote:
 On Mon, Dec 08, 2014 at 05:22:23PM +0100, Martin Scherer wrote:

 # invoke bfg --delete-folders something multiple times with different
 pattern.

 # try to cleanup

 git gc --aggressive --prune=now # big blobs still in history
 git fsck # no results
 git fsck --full  --unreachable --dangling # no results

 Might you still have reflogs pointing to the objects? Try:

   git reflog expire --expire-unreachable=now --all

Yeah, we figured that's what it was!

https://github.com/rtyley/bfg-repo-cleaner/issues/62#issuecomment-66152559

 I also don't know if BFG keeps backup refs around (filter-branch, for
 example, writes a copy of the original refs into refs/original; you
 would want to delete that if you're trying to slim down the repo).

The BFG reports the ref changes to the command line (and outputs a
full list of changed object-ids in
repo-name.git.bfg-report/[datetime]/object-id-map.old-new.txt) but
doesn't keep refs (like refs/original) around because that would get
in the way of the BFG's explicit intended use-case of removing
unwanted data.

Thanks for the object-size checking scripts, very useful.

Roberto
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-09 Thread Roberto Tyley
On Tuesday, 9 December 2014, Jeff King p...@peff.net wrote:
 I actually think filter-branch's refs/original is a bit outdated at
 this point. The information is there in the reflogs already, and
 dealing with refs/original often causes confusion in my experience. It
 could probably use a git filter-branch --restore or something to
 switch each $ref to $ref@{1} (after making sure that the reflog entry
 was from filter-branch, of course).

Yeah, I'd agree that refs/original can cause confusion.


 Not that I expect you to want to work on filter-branch. :) But maybe
 food for thought for a BFG feature.

I haven't heard much demand for a recover/restore feature on the BFG
(I think by the time people get to the BFG, they're pretty sure they
want to go ahead with the procedure!) but I'll bear it in mind. Mind
you, to make the post-rewrite clean-up easier, I'd be happy to
contribute a patch that gives 'gc' a flag to do the equivalent of:

git reflog expire --expire=now --all  git gc --prune=now --aggressive

Maybe:

git gc --purge

??
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-08 Thread Roberto Tyley
Hi Martin, I'm the developer of the BFG - I'd guess that there
probably isn't a bug for Git developers here, so you might want to
open one or more issues at
https://github.com/rtyley/bfg-repo-cleaner/issues, where I'd be happy
to take a look.

best regards,
Roberto

 On 8 Dec 2014 16:35, Martin Scherer m.sche...@fu-berlin.de wrote:

 Hi,

 after using BFG on a repo given certain directory globs, all of those
 files(names) are gone from history, but can not be collected by garbage
 collection anymore. So the blobs of the underlying files are not deleted
 and only the file names are not associated with the blob anymore. I
 wonder, if I discovered a bug (at least in bfg). But I expect git to
 discover that this blobs are not used in any way (so they have to
 associated to something right?)

 # invoke bfg --delete-folders something multiple times with different
 pattern.

 # try to cleanup

 git gc --aggressive --prune=now # big blobs still in history
 git fsck # no results
 git fsck --full  --unreachable --dangling # no results

 to verify if the blobs are still there, see the output of

 git gc  git verify-pack -v .git/objects/pack/pack-*.idx | egrep ^\w+
 blob\W+[0-9]+ [0-9]+ [0-9]+$ | sort -k 3 -n -r  bigobjects
 .txt

 head bigobjects.txt # outputs 9451427d7335395779b91864418630d2f0af780a
 blob   7895212 1869047 7657491


 Also if bfg is being told to remove the biggest blob (bfg -B 1) with
 no-blob-protection, it does not succeed in removing it.

 --- output of bfg -B 1

 Found 1 blob ids for large blobs - biggest=7895212 smallest=7895212
 

 BFG aborting: No refs to update - no dirty commits found??
 ---

 The repo can be found here.

 https://github.com/marscher/stallone_stale_objects

 I will restart all over to cleanup the history, but I guess this might
 be interesting for git developers.


 Best,
 Martin
 --
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


old git documentation pages hosted at kernel.org

2014-07-26 Thread Roberto Tyley
The Git documentation pages hosted at kernel.org are a bit over a year
out of date (Last updated 2013-02-15 19:24:31 UTC) - so from around
Git v1.8:

https://www.kernel.org/pub/software/scm/git/docs/

Are they fiddly to update? Should they be updated in celebration of
Git 2.0, or maybe instead redirect to http://git-scm.com/docs ?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix documentation AsciiDoc links for external urls

2014-02-18 Thread Roberto Tyley
Turns out that putting 'link:' before the 'http' is actually superfluous
in AsciiDoc, as there's already a predefined macro to handle it.

http, https, [etc] URLs are rendered using predefined inline macros.
http://www.methods.co.nz/asciidoc/userguide.html#_urls

Hypertext links to files on the local file system are specified
using the link inline macro.
http://www.methods.co.nz/asciidoc/userguide.html#_linking_to_local_documents

Despite being superfluous, the reference implementation of AsciiDoc
tolerates the extra 'link:' and silently removes it, giving a functioning
link in the generated HTML. However, AsciiDoctor (the Ruby implementation
of AsciiDoc used to render the http://git-scm.com/ site) does /not/ have
this behaviour, and so generates broken links, as can be seen here:

http://git-scm.com/docs/git-cvsimport (links to cvs2git  parsecvs)
http://git-scm.com/docs/git-filter-branch (link to The BFG)

It's worth noting that after this change, the html generated by 'make html'
in the git project is identical, and all links still work.

Signed-off-by: Roberto Tyley roberto.ty...@gmail.com
---
 Documentation/git-cvsimport.txt   | 4 ++--
 Documentation/git-filter-branch.txt   | 4 ++--
 Documentation/gitcore-tutorial.txt| 2 +-
 Documentation/gitcvs-migration.txt| 2 +-
 Documentation/gitweb.txt  | 2 +-
 Documentation/technical/http-protocol.txt | 4 ++--
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-cvsimport.txt b/Documentation/git-cvsimport.txt
index 2df9953..260f39f 100644
--- a/Documentation/git-cvsimport.txt
+++ b/Documentation/git-cvsimport.txt
@@ -21,8 +21,8 @@ DESCRIPTION
 *WARNING:* `git cvsimport` uses cvsps version 2, which is considered
 deprecated; it does not work with cvsps version 3 and later.  If you are
 performing a one-shot import of a CVS repository consider using
-link:http://cvs2svn.tigris.org/cvs2git.html[cvs2git] or
-link:https://github.com/BartMassey/parsecvs[parsecvs].
+http://cvs2svn.tigris.org/cvs2git.html[cvs2git] or
+https://github.com/BartMassey/parsecvs[parsecvs].
 
 Imports a CVS repository into Git. It will either create a new
 repository, or incrementally import into an existing one.
diff --git a/Documentation/git-filter-branch.txt 
b/Documentation/git-filter-branch.txt
index 2eba627..09535f2 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -436,7 +436,7 @@ git-filter-branch allows you to make complex shell-scripted 
rewrites
 of your Git history, but you probably don't need this flexibility if
 you're simply _removing unwanted data_ like large files or passwords.
 For those operations you may want to consider
-link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
+http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
 a JVM-based alternative to git-filter-branch, typically at least
 10-50x faster for those use-cases, and with quite different
 characteristics:
@@ -455,7 +455,7 @@ characteristics:
   _is_ possible to write filters that include their own parallellism,
   in the scripts executed against each commit.
 
-* The link:http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
+* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
   are much more restrictive than git-filter branch, and dedicated just
   to the tasks of removing unwanted data- e.g:
   `--strip-blobs-bigger-than 1M`.
diff --git a/Documentation/gitcore-tutorial.txt 
b/Documentation/gitcore-tutorial.txt
index 058a352..d2d7c21 100644
--- a/Documentation/gitcore-tutorial.txt
+++ b/Documentation/gitcore-tutorial.txt
@@ -1443,7 +1443,7 @@ Although Git is a truly distributed system, it is often
 convenient to organize your project with an informal hierarchy
 of developers. Linux kernel development is run this way. There
 is a nice illustration (page 17, Merges to Mainline) in
-link:http://www.xenotime.net/linux/mentor/linux-mentoring-2006.pdf[Randy 
Dunlap's presentation].
+http://www.xenotime.net/linux/mentor/linux-mentoring-2006.pdf[Randy Dunlap's 
presentation].
 
 It should be stressed that this hierarchy is purely *informal*.
 There is nothing fundamental in Git that enforces the chain of
diff --git a/Documentation/gitcvs-migration.txt 
b/Documentation/gitcvs-migration.txt
index 5ea94cb..5f4e890 100644
--- a/Documentation/gitcvs-migration.txt
+++ b/Documentation/gitcvs-migration.txt
@@ -117,7 +117,7 @@ Importing a CVS archive
 ---
 
 First, install version 2.1 or higher of cvsps from
-link:http://www.cobite.com/cvsps/[http://www.cobite.com/cvsps/] and make
+http://www.cobite.com/cvsps/[http://www.cobite.com/cvsps/] and make
 sure it is in your path.  Then cd to a checked out CVS working directory
 of the project you are interested in and run linkgit:git-cvsimport[1]:
 
diff --git a/Documentation/gitweb.txt b/Documentation/gitweb.txt
index cca14b8..cd9c895 100644
--- a/Documentation/gitweb.txt
+++ b

[PATCH] Fix documentation AsciiDoc links for external urls

2014-02-15 Thread Roberto Tyley
Turns out that putting 'link:' before the 'http' is actually superfluous
in AsciiDoc, as there's already a predefined macro to handle it.

http, https, [etc] URLs are rendered using predefined inline macros.
http://www.methods.co.nz/asciidoc/userguide.html#_urls

Hypertext links to files on the local file system are specified
using the link inline macro.
http://www.methods.co.nz/asciidoc/userguide.html#_linking_to_local_documents

Despite being superfluous, the reference implementation of AsciiDoc
tolerates the extra 'link:' and silently removes it, giving a functioning
link in the generated HTML. However, AsciiDoctor (the Ruby implementation
of AsciiDoc used to render the http://git-scm.com/ site) does /not/ have
this behaviour, and so generates broken links, as can be seen here:

http://git-scm.com/docs/git-cvsimport (links to cvs2git  parsecvs)
http://git-scm.com/docs/git-filter-branch (link to The BFG)

It's worth noting that after this change, the html generated by 'make html'
in the git project is identical, and all links still work.
---
 Documentation/git-cvsimport.txt   | 4 ++--
 Documentation/git-filter-branch.txt   | 4 ++--
 Documentation/gitcore-tutorial.txt| 2 +-
 Documentation/gitcvs-migration.txt| 2 +-
 Documentation/gitweb.txt  | 2 +-
 Documentation/technical/http-protocol.txt | 4 ++--
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-cvsimport.txt b/Documentation/git-cvsimport.txt
index 2df9953..260f39f 100644
--- a/Documentation/git-cvsimport.txt
+++ b/Documentation/git-cvsimport.txt
@@ -21,8 +21,8 @@ DESCRIPTION
 *WARNING:* `git cvsimport` uses cvsps version 2, which is considered
 deprecated; it does not work with cvsps version 3 and later.  If you are
 performing a one-shot import of a CVS repository consider using
-link:http://cvs2svn.tigris.org/cvs2git.html[cvs2git] or
-link:https://github.com/BartMassey/parsecvs[parsecvs].
+http://cvs2svn.tigris.org/cvs2git.html[cvs2git] or
+https://github.com/BartMassey/parsecvs[parsecvs].
 
 Imports a CVS repository into Git. It will either create a new
 repository, or incrementally import into an existing one.
diff --git a/Documentation/git-filter-branch.txt 
b/Documentation/git-filter-branch.txt
index 2eba627..09535f2 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -436,7 +436,7 @@ git-filter-branch allows you to make complex shell-scripted 
rewrites
 of your Git history, but you probably don't need this flexibility if
 you're simply _removing unwanted data_ like large files or passwords.
 For those operations you may want to consider
-link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
+http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
 a JVM-based alternative to git-filter-branch, typically at least
 10-50x faster for those use-cases, and with quite different
 characteristics:
@@ -455,7 +455,7 @@ characteristics:
   _is_ possible to write filters that include their own parallellism,
   in the scripts executed against each commit.
 
-* The link:http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
+* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
   are much more restrictive than git-filter branch, and dedicated just
   to the tasks of removing unwanted data- e.g:
   `--strip-blobs-bigger-than 1M`.
diff --git a/Documentation/gitcore-tutorial.txt 
b/Documentation/gitcore-tutorial.txt
index 058a352..d2d7c21 100644
--- a/Documentation/gitcore-tutorial.txt
+++ b/Documentation/gitcore-tutorial.txt
@@ -1443,7 +1443,7 @@ Although Git is a truly distributed system, it is often
 convenient to organize your project with an informal hierarchy
 of developers. Linux kernel development is run this way. There
 is a nice illustration (page 17, Merges to Mainline) in
-link:http://www.xenotime.net/linux/mentor/linux-mentoring-2006.pdf[Randy 
Dunlap's presentation].
+http://www.xenotime.net/linux/mentor/linux-mentoring-2006.pdf[Randy Dunlap's 
presentation].
 
 It should be stressed that this hierarchy is purely *informal*.
 There is nothing fundamental in Git that enforces the chain of
diff --git a/Documentation/gitcvs-migration.txt 
b/Documentation/gitcvs-migration.txt
index 5ea94cb..5f4e890 100644
--- a/Documentation/gitcvs-migration.txt
+++ b/Documentation/gitcvs-migration.txt
@@ -117,7 +117,7 @@ Importing a CVS archive
 ---
 
 First, install version 2.1 or higher of cvsps from
-link:http://www.cobite.com/cvsps/[http://www.cobite.com/cvsps/] and make
+http://www.cobite.com/cvsps/[http://www.cobite.com/cvsps/] and make
 sure it is in your path.  Then cd to a checked out CVS working directory
 of the project you are interested in and run linkgit:git-cvsimport[1]:
 
diff --git a/Documentation/gitweb.txt b/Documentation/gitweb.txt
index cca14b8..cd9c895 100644
--- a/Documentation/gitweb.txt
+++ b/Documentation/gitweb.txt
@@ -84,7 +84,7 @@ separator 

Re: [BUG?] inconsistent `git reflog show` output, possibly `git fsck` output

2013-09-22 Thread Roberto Tyley

On 21/09/2013 23:16, Keshav Kini wrote:

[SNIP]
This situation came about because the BFG Repo-Cleaner doesn't write new
reflog entries after creating its new objects and moving refs around.


True enough - I don't think the BFG does write new entires to the
reflog when it does the final ref-update, and it would be nicer if it 
did. I'll get that fixed.


thanks,
Roberto
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: commit-message attack for extracting sensitive data from rewritten Git history

2013-04-09 Thread Roberto Tyley
On 9 April 2013 18:01, Jeff King p...@peff.net wrote:
 On Tue, Apr 09, 2013 at 08:03:24AM +0200, Johannes Sixt wrote:
 If A mentions B (think of cherry-pick -x), then you must ensure that the
 branch containing B was traversed first.

 Yeah, you're right. Multiple passes are necessary to get it
 completely right. And because each pass may change more commit id's, you
 have to recurse to pick up those changes, and keep going until you have
 a pass with no changes.

Just to give some context on how the BFG handles this (without doing
multiple passes):

The BFG makes a design choice (based on it's intended use-case of
annihilating unwanted data) that a specific tree or blob will always
be cleaned in exactly the same way - because when you're trying to get
rid of large blobs or private data, you most likely /don't care/ where
it is, what commit it belongs to, how old it is. The id for a cleaned
tree or blob is always the same no matter where it came from, and so
the BFG maintains a in-memory mapping of 'dirty' to 'clean' object ids
while cleaning a repo - whenever an object (commit, tag, tree, blob)
is cleaned, these values are stored in the map:


  dirty-id - clean-id
  clean-id - clean-id

(in terms of memory overhead, this amounts to only ~ 128MB for even
quite a large repo like the linux kernel, so I don't spend much time
worrying about it)


The map memoises the cleaning functions on all objects, so an object
(particularly a tree) never gets cleaned more than once, which is one
of the things that makes the BFG fast.

Having these memoised functions makes cleaning commit messages fairly
easy - the message is grepped for hex strings more than a few
characters in length, and if a matched string resolves uniquely to an
object id in the repo, the clean() method is called on it to get the
cleaned id - which will either return immediately with a previously
calculated result, or if the id came from a different branch, trigger
a cascade of more cleaning, eventually returning the required cleaned
id.

In the case of git-filter-branch, the user has a lot more freedom to
change the tree-structure of commits on a commit-by-commit basis, so
memoising tree-cleaning is out of the question, but I guess it might
be possible to do memoisation of just the commit ids to short-cut the
multiple-pass problem.

- Roberto Tyley
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


commit-message attack for extracting sensitive data from rewritten Git history

2013-04-07 Thread Roberto Tyley
This is a demonstration of a mildly-interesting security concern
relating to Git  git-filter-branch - not a vulnerability in Git
itself, just in the way it can be used. I thought it was interesting
to demonstrate that there is sometimes an avenue of attack for
recovering sensitive data that's been removed from Git history using
git-filter-branch. I think it's a low-severity issue, you may wish to
ignore this, and indeed I've been very politely told already that it's
clearly nonsense :)

Here's an unmodified repo, in which the user unwisely committed a
database password:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-original/commit/8c9cfe3c

The unwise commit is reverted with a second commit using 'git revert',
which obviously leaves the password in Git history, and - some time
later - it's decided to properly clean the repo history with
git-filter-branch  git gc, purging the password so the repo can be
more widely shared (open-sourced, or just externally hosted).

git-filter-branch works exactly as intended, purging the password, but
the one thing it does not- typically - do is update the commit
message. So in the cleaned repo, the commit message for the revert
commit still looks like this:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-git-filter-branch-cleaned/commit/bf0637a5

It contains a commit id (8c9cfe3) which is no longer in the repo, but
can very easily be associated with an existing commit simply by
examining the subject line of the reverted commit (Carelessly
checking password into source control). It's also obvious, from
examining the repo, where the excised data was removed (ie at the
db.password= line). At this point it's possible to do a brute-force
attack where you generate possible passwords, insert them into the
available commit's tree, and compare them against the leaked commit
id. When the the commit id matches, the sensitive data has been
recovered.

A proof-of-concept implementation of this attack was indeed able to
recover the purged password:

--
$ java -jar gma-0.1.jar 8c9cfe3c attack-pinpoint
gma-demo-repo-git-filter-branch-cleaned

Brute-force search using these characters : 0123456789abcdefghijklmnopqrstuvwxyz
Available commit, presumed cleaned : 8ebbf661
File path : src/main/resources/config.properties
Template blob : dca1a2fb
Exhausted strings of length 1 or less
...
Exhausted strings of length 4 or less
Match with '0g6rw'
--

So all of this amounts to a fairly low severity issue - people should
always change credentials when they mistakenly commit them to a repo -
but I guess the point is that from a paranoia point of view, you want
to remove all information - including old commit hashes buried in
commit messages - that relate to sensitive data when you clean a repo
for sharing. The git-filter-branch command has a --msg-filter option
which could be used for this purpose, with the application of some
judicious bash-scripting, grepsed-ing. However, I must confess that I
believe users would be better advised to use The BFG:

http://rtyley.github.io/bfg-repo-cleaner/

The BFG already addresses this issue by replacing all old Git
object-ids found in commit/tag messages with the updated id. For
instance, here's that exact same commit message when cleaned with the
BFG:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-bfg-cleaned/commit/35840201

In the case that the users specifies a filtering operation is not
removing 'private' data, the BFG replaces old ids with text of the
form 'newid [formerly oldid], but if the operation is in fact to
strip private data, the replacement value is simply the newid - and
without the old commit id, the attack described above is not possible.

I believe it's worth educating users to give them a more realistic
understanding of their exposure, and would like to update the
documentation of git-filter-branch to give them a better idea of their
options for removing private data - that would include noting the BFG
as alternative.

- Roberto Tyley

https://github.com/rtyley/bfg-repo-cleaner/blob/v1.2.0/src/main/scala/com/madgag/git/bfg/cleaner/ObjectIdSubstitutor.scala#L33-L60
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


A fast alternative to git-filter-branch - The BFG Repo-Cleaner

2013-02-04 Thread Roberto Tyley
I recently released The BFG Repo-Cleaner, a new tool for cleansing bad
data out of Git repository histories. The BFG is typically at least
10-50x faster than git-filter-branch at these tasks:

* Removing Crazy Big Files from repo history
* Removing Passwords, Credentials  other Private data

http://rtyley.github.com/bfg-repo-cleaner/

As an example, these are timings for deleting an arbitrary file from
the large GCC repository (148495 commits):

The BFG : 3m29s
$ bfg -D README-fixinc

git filter-branch : 472m31s
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch
gcc/README-fixinc' --prune-empty --tag-name-filter cat -- --all

(roughly a 135x speed increase, reducing the task of processing a
large codebase from an overnight job to the work of a few minutes
all timings done in a 4GB tmpfs ramdisk)


The BFG has some simple but very powerful command-line options, which
perform at similar speed:

remove all blobs bigger than 1 megabyte :
$ bfg --strip-blobs-bigger-than 1M  my-repo.git

replace all passwords (listed in a file 'passwords.txt') with ***REMOVED*** :
$ bfg --replace-banned-strings passwords.txt  my-repo.git


The main source of the BFG's performance advantage comes from
preventing repeated examination of the same tree objects. The approach
of git-filter-branch performs filtering for each commit, against the
complete file-hierarchy of each commit, one after the other, even
though commit trees are largely very similar. For the use-cases of The
BFG that's unnecessary- we don't care where, and in which commit, a
'bad' file exists - we just want it dealt with. Consequently the BFG
processes the Git object db on a memoised tree-by-tree basis,
processing each and every file  folder exactly once - the final
processing of the commit hierarchy is very quick. This _does_ mean
that it's not possible to delete files based on their absolute path
within the repo, but they can deleted based on their filename,
blob-id, or contents. This, and multi-core processing by default,
gives the dramatic speed-up while still providing the same results.
There's more performance data here:
https://docs.google.com/spreadsheet/ccc?key=0AsR1d5Zpes8HdER3VGU1a3dOcmVHMmtzT2dsS2xNenc

I'd welcome feedback, and if anyone has cause to filter a repository's
history in future, I'd appreciate you giving the BFG a try and letting
me know how you found it.

thanks,
Roberto Tyley
software dev @ The Guardian

http://rtyley.github.com/bfg-repo-cleaner/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html