Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-26 Thread Junio C Hamano
Jeff King  writes:

> I'd be very surprised if this works in practice on most of our current
> test scripts. There are often subtle dependencies on the state left over
> from previous tests. Running the script below up through t3800 (at which
> point I lost patience) reveals 37 test scripts that are broken. Which is
> only about 17%, but we're clearly not quite there yet.
> ...

Yes, I agree that in an ideal world with infinite amount of time, we
should strive to make sure that each test is independent from
previous steps, and if we do not have infinite amount of time, we
should make sure we find some (say 5%) time to various code clean-up
effort, in both writing and reviewing such, including tests.

The attached is cute.  Thanks for a food for thought.

> -- >8 --
> #!/usr/bin/perl
> #
> # Run as "perl foo.pl t-basic.sh" (or whatever script you want to
> # check).
>
> my $script = shift;
> my ($script_num) = $script =~ /^(t\d+)/;
>
> # run once to get the list of tests
> my @tests = run_tests($script);
>
> # mark some tests as "setup" tests that will always be run
> foreach my $test (@tests) {
>   if ($test->{desc} =~ /set.?up/i) {
>   print STDERR "marking $test->{num} - $test->{desc} as setup\n";
>   $test->{setup} = 1;
>   }
> }
>
> # now try each test in isolation, but including setup tests
> foreach my $test (@tests) {
>   $ENV{GIT_SKIP_TESTS} = skip_all_except($script_num, $test, @tests);
>   run_tests($script) or die "failed $test->{num} ($test->{desc})\n";
> }
>
> sub run_tests {
>   my @r;
>   open(my $fh, '-|', qw(sh), $script);
>   while (<$fh>) {
>   /^ok (\d+) - (.*)/ and push @r, { num => $1, desc => $2 };
>   }
>   $? and return ();
>   return @r;
> }
>
> sub skip_all_except {
>   my $prefix = shift;
>   my $want = shift;
>
>   return join(' ',
>   map { "$prefix.$_->{num}" }
>   grep { !$_->{setup} && $_->{num} != $want->{num} }
>   @_);
> }
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-26 Thread Jeff King
On Mon, Aug 26, 2013 at 09:02:46AM -0700, Junio C Hamano wrote:

> > There's one subtle thing I didn't mention in the "it is already on stack
> > overflow". If you have a version of git which complains about the null
> > sha1, then the SO advice is already broken. But if the SO works, then
> > you do not have a version of git which complains. So why do you care?
> >
> > And the answer is: you may be pushing to a remote with a version of git
> > that complains, and which has receive.fsckObjects set (and in many
> > cases, that remote is GitHub, since we have had that check on for a
> > while).
> >
> > I don't know if it is worth spelling that out or not.
> 
> Probably not.
> 
> You could aim to correct each and every wrong suggestions on a site
> where misguided leads other misguided, but it is a hopeless task.

I do no think the advice on SO was misguided. My point was that a reader
of the revised commit message mentioning SO may wonder "why would people
on SO suggest that? It does not work since we introduced the check in
the first place".

But if neither you nor Jonathan wondered about it, then it is probably
not worth caring.

> >> > After this patch, do you think (in a separate change) it would make
> >> > sense for cache-tree.c::update_one() to check for null sha1 and error
> >> > out unless GIT_ALLOW_NULL_SHA1 is true?  That would let us get rid of
> >> > the caveat from the last paragraph.
> [...]

Hmm. We are already complaining about the null sha1 in most cases,
because we do not have the object. It is only for submodules that the
bogus entry can make it into a tree.  So I tried this:

diff --git a/cache-tree.c b/cache-tree.c
index 0bbec43..0ec3923 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -343,7 +343,8 @@ static int update_one(struct cache_tree *it,
entlen = pathlen - baselen;
i++;
}
-   if (mode != S_IFGITLINK && !missing_ok && !has_sha1_file(sha1)) 
{
+   if ((mode != S_IFGITLINK && !missing_ok && !has_sha1_file(sha1))
+   || (mode == S_IFGITLINK && is_null_sha1(sha1))) {
strbuf_release(&buffer);
return error("invalid object %06o %s for '%.*s'",
mode, sha1_to_hex(sha1), entlen+baselen, path);
diff --git a/t/t7009-filter-branch-null-sha1.sh 
b/t/t7009-filter-branch-null-sha1.sh
index 50d5689..59c5edc 100755
--- a/t/t7009-filter-branch-null-sha1.sh
+++ b/t/t7009-filter-branch-null-sha1.sh
@@ -31,10 +31,10 @@ test_expect_success 'filter commands are still checked' '
git commit -a -m "back to normal"
 '
 
-test_expect_success 'filter commands are still checked' '
+test_expect_success 'complain about trees that are broken after filtering' '
test_must_fail git filter-branch \
--force --prune-empty \
-   --index-filter "git rm --cached --ignore-unmatch three.t"
+   --index-filter ""
 '
 
 test_expect_success 'removing the broken entry works' '

But it doesn't work. The reason is that we do not need to re-make the
cache-tree, because we are just handing back the crap that we got from
the original tree. And I do not think we would want to invalidate the
whole cache tree just on the off chance that it has a problem;
filter-branch is slow enough as it is.

So I think we are better off to let filter-branch run at full speed,
even if optimizations like cache-tree mean we are missing out on
validating all of the data.  If you want to know whether your repo is in
good shape, you should run `git fsck`.

-Peff

PS I got more information from the user who originally experienced this
   problem (and yes, it was a submodule entry). It turns out that he had
   done some unusual stuff with having a git repository inside his Eclipse
   project, and adding it as a sub-project of his build.  So it was
   almost certainly EGit/JGit which wrote the bad entry in the first
   place (and he is making a bug report to EGit folks).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-26 Thread Jeff King
On Mon, Aug 26, 2013 at 10:35:01AM -0700, Jonathan Nieder wrote:

> > I don't see that splitting it up more hurts this. If we wanted more
> > automatic rearranging or skipping of tests, we would need tests to
> > declare dependencies on their setup. And we would need to be able to
> > declare dependencies on multiple tests, since having a single setup test
> > does not work in all cases (e.g., sometimes you are testing each step,
> > and the final step relies on earlier steps).
> 
> Actually dependencies can already be inferred for most test scripts
> using the following rule:
> 
>   Each test depends on all tests with the word "setup" or the
>   phrase "set up" in their title that precede it.

I'd be very surprised if this works in practice on most of our current
test scripts. There are often subtle dependencies on the state left over
from previous tests. Running the script below up through t3800 (at which
point I lost patience) reveals 37 test scripts that are broken. Which is
only about 17%, but we're clearly not quite there yet.

But sure, I agree that is something to strive for. But I think my point
stands; splitting up the setup doesn't hurt as long as you note all of
the tests as dependencies. And by your rule above, the advice would be
"each of these tests should say "setup" in the description". That makes
sense to me, and I don't mind doing it here.

> Of course splitting up the setup into 3 steps neither helps nor hurts
> that.  What I was complaining about is splitting the filter-branch
> from the verification that filter-branch had the right result.

I totally missed your original point, then. I don't mind combining the
last two into a single test. That makes sense to me (I only split them
at all because I added the second one much later during the
development).

-Peff

-- >8 --
#!/usr/bin/perl
#
# Run as "perl foo.pl t-basic.sh" (or whatever script you want to
# check).

my $script = shift;
my ($script_num) = $script =~ /^(t\d+)/;

# run once to get the list of tests
my @tests = run_tests($script);

# mark some tests as "setup" tests that will always be run
foreach my $test (@tests) {
if ($test->{desc} =~ /set.?up/i) {
print STDERR "marking $test->{num} - $test->{desc} as setup\n";
$test->{setup} = 1;
}
}

# now try each test in isolation, but including setup tests
foreach my $test (@tests) {
$ENV{GIT_SKIP_TESTS} = skip_all_except($script_num, $test, @tests);
run_tests($script) or die "failed $test->{num} ($test->{desc})\n";
}

sub run_tests {
my @r;
open(my $fh, '-|', qw(sh), $script);
while (<$fh>) {
/^ok (\d+) - (.*)/ and push @r, { num => $1, desc => $2 };
}
$? and return ();
return @r;
}

sub skip_all_except {
my $prefix = shift;
my $want = shift;

return join(' ',
map { "$prefix.$_->{num}" }
grep { !$_->{setup} && $_->{num} != $want->{num} }
@_);
}
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-26 Thread Jonathan Nieder
Jeff King wrote:
> On Sun, Aug 25, 2013 at 12:54:12PM -0700, Jonathan Nieder wrote:

>> [setup split across three tests]
>>
>> This is kind of an old-fashioned test, since each step of the setup is
>> treated as a separate test assertion.  I don't really mind until we
>> get better automation to make it easy to skip or rearrange tests.
>> Just for reference, I think the usual way to do this now is
>
> I don't see that splitting it up more hurts this. If we wanted more
> automatic rearranging or skipping of tests, we would need tests to
> declare dependencies on their setup. And we would need to be able to
> declare dependencies on multiple tests, since having a single setup test
> does not work in all cases (e.g., sometimes you are testing each step,
> and the final step relies on earlier steps).

Actually dependencies can already be inferred for most test scripts
using the following rule:

Each test depends on all tests with the word "setup" or the
phrase "set up" in their title that precede it.

And while there's no existing automation for testing that that's all
the tests rely on (by automatically skipping or reordering tests, or
running non-setup tests in separate sandboxes in parallel), in
practice it is still already useful since it makes it safe to use
GIT_SKIP_TESTS subject to the following constraint:

In each test script, for every setup test skipped, all later
tests are skipped as well.

I don't care as much about GIT_SKIP_TESTS as about being able to
introduce new tests in the middle of a file.

Of course splitting up the setup into 3 steps neither helps nor hurts
that.  What I was complaining about is splitting the filter-branch
from the verification that filter-branch had the right result.

Sorry for the confusion,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-26 Thread Junio C Hamano
Jeff King  writes:

>> I found this version more readable than Peff's (albeit slightly).
>
> OK. Do you want to apply with Jonathan's wording, then?

I can do that, as it seems all of us are in agreement.

> There's one subtle thing I didn't mention in the "it is already on stack
> overflow". If you have a version of git which complains about the null
> sha1, then the SO advice is already broken. But if the SO works, then
> you do not have a version of git which complains. So why do you care?
>
> And the answer is: you may be pushing to a remote with a version of git
> that complains, and which has receive.fsckObjects set (and in many
> cases, that remote is GitHub, since we have had that check on for a
> while).
>
> I don't know if it is worth spelling that out or not.

Probably not.

You could aim to correct each and every wrong suggestions on a site
where misguided leads other misguided, but it is a hopeless task.

>> > After this patch, do you think (in a separate change) it would make
>> > sense for cache-tree.c::update_one() to check for null sha1 and error
>> > out unless GIT_ALLOW_NULL_SHA1 is true?  That would let us get rid of
>> > the caveat from the last paragraph.
>> 
>> Hmm, interesting thought.
>
> I think it is worth doing. The main reason I put the original check on
> writing to the index is that it more clearly pinpoints the source of the
> error. If we just died during git-write-tree, then you know somebody
> broke your index, but you don't know which command.
>
> But checking in both places would add extra protection, and would make
> possible the "relax on read, strict on write" policy that filter-branch
> wants to do.

Yeah, I agree with all of the above.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-26 Thread Jeff King
On Sun, Aug 25, 2013 at 11:03:54PM -0700, Junio C Hamano wrote:

> Jonathan Nieder  writes:
> 
> > In other words, why not use something like this?
> >
> > write_index: optionally allow broken null sha1s
> >
> > Commit 4337b58 (do not write null sha1s to on-disk index, 2012-07-28)
> > added a safety check preventing git from writing null sha1s into the
> > index. The intent was to catch errors in other parts of the code that
> > might let such an entry slip into the index (or worse, a tree).
> >
> > Some existing repositories have some invalid trees that contain null
> > sha1s already, though.  Until 4337b58, a common way to clean this up
> > would be to use git-filter-branch's index-filter to repair such broken
> > entries.  That now fails when filter-branch tries to write out the
> > index.
> >
> > Introduce a GIT_ALLOW_NULL_SHA1 environment variable to relax this check
> > and make it easier to recover from such a history.
> 
> I found this version more readable than Peff's (albeit slightly).

OK. Do you want to apply with Jonathan's wording, then?

There's one subtle thing I didn't mention in the "it is already on stack
overflow". If you have a version of git which complains about the null
sha1, then the SO advice is already broken. But if the SO works, then
you do not have a version of git which complains. So why do you care?

And the answer is: you may be pushing to a remote with a version of git
that complains, and which has receive.fsckObjects set (and in many
cases, that remote is GitHub, since we have had that check on for a
while).

I don't know if it is worth spelling that out or not.

> > After this patch, do you think (in a separate change) it would make
> > sense for cache-tree.c::update_one() to check for null sha1 and error
> > out unless GIT_ALLOW_NULL_SHA1 is true?  That would let us get rid of
> > the caveat from the last paragraph.
> 
> Hmm, interesting thought.

I think it is worth doing. The main reason I put the original check on
writing to the index is that it more clearly pinpoints the source of the
error. If we just died during git-write-tree, then you know somebody
broke your index, but you don't know which command.

But checking in both places would add extra protection, and would make
possible the "relax on read, strict on write" policy that filter-branch
wants to do.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-26 Thread Jeff King
On Sun, Aug 25, 2013 at 12:54:12PM -0700, Jonathan Nieder wrote:

> [setup split across three tests]
>
> This is kind of an old-fashioned test, since each step of the setup is
> treated as a separate test assertion.  I don't really mind until we
> get better automation to make it easy to skip or rearrange tests.
> Just for reference, I think the usual way to do this now is

I don't see that splitting it up more hurts this. If we wanted more
automatic rearranging or skipping of tests, we would need tests to
declare dependencies on their setup. And we would need to be able to
declare dependencies on multiple tests, since having a single setup test
does not work in all cases (e.g., sometimes you are testing each step,
and the final step relies on earlier steps).

And I do think splitting it up has more granularity. For example, in the
sequence here:

>   test_expect_success 'setup' '
>   # create base commits
>   ...
> 
>   # create a commit with bogus null sha1 in the tree
>   ...
> 
>   # We have to make one more commit on top removing the broken
>   # entry, since otherwise our index does not match HEAD and
>   # filter-branch will complain. We could make the index match
>   # HEAD, but doing so would involve writing a null sha1 into
>   # the index.
>   ...
>   '

Right now it is not hard to do step 2. But I could easily imagine a
world in which git-mktree learns to complain about such bogus trees. And
the sequence would become:

  1. create base commits

  2. check that mktree barfs

  3. check that we can override mktree to create a broken tree

  4. clean up tip state

in which case you really want to have individual tests.

I do not care _that_ much, since mktree does not behave that way now,
and the setup is otherwise not that likely to fail.  But I do not think
more granularity actually hurts us, and it can sometimes help (as
described above, but also when something fails, the test output more
clearly pinpoints what happened).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-25 Thread Junio C Hamano
Jonathan Nieder  writes:

> In other words, why not use something like this?
>
>   write_index: optionally allow broken null sha1s
>
>   Commit 4337b58 (do not write null sha1s to on-disk index, 2012-07-28)
>   added a safety check preventing git from writing null sha1s into the
>   index. The intent was to catch errors in other parts of the code that
>   might let such an entry slip into the index (or worse, a tree).
>
>   Some existing repositories have some invalid trees that contain null
>   sha1s already, though.  Until 4337b58, a common way to clean this up
>   would be to use git-filter-branch's index-filter to repair such broken
>   entries.  That now fails when filter-branch tries to write out the
>   index.
>
>   Introduce a GIT_ALLOW_NULL_SHA1 environment variable to relax this check
>   and make it easier to recover from such a history.

I found this version more readable than Peff's (albeit slightly).

> After this patch, do you think (in a separate change) it would make
> sense for cache-tree.c::update_one() to check for null sha1 and error
> out unless GIT_ALLOW_NULL_SHA1 is true?  That would let us get rid of
> the caveat from the last paragraph.

Hmm, interesting thought.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] write_index: optionally allow broken null sha1s

2013-08-25 Thread Jonathan Nieder
On Sun, Aug 25, 2013 at 05:58:18AM -0400, Jeff King wrote:
> On Sat, Aug 24, 2013 at 11:15:00PM -0700, Jonathan Nieder wrote:

>>> I was tempted to not involve filter-branch in this commit at all, and
>>> instead require the user to manually invoke
>>>
>>>   GIT_ALLOW_NULL_SHA1=1 git filter-branch ...
>>>
>>> to perform such a filter.  That would be slightly safer, but requires
>>> some specialized knowledge from the user (and advice on using
>>> filter-branch to remove such entries already exists on places like
>>> stackoverflow, and this patch makes it Just Work on recent versions of
>>> git).
>>
>> The above few paragraphs explained the most mysterious part of the
>> patch to me.  I think they would be fine to include in the commit
>> message.
>
> I rolled them into the commit message.

Hm --- the most useful part was "advice on using filter-branch to
remove such entries already exists on places like stackoverflow",
which was dropped.  From my point of view, that is exactly the
motivation of the patch.

I also found the "I was tempted to ... That would be slightly safer,
but requires ..." structure easier to read.

In other words, why not use something like this?

write_index: optionally allow broken null sha1s

Commit 4337b58 (do not write null sha1s to on-disk index, 2012-07-28)
added a safety check preventing git from writing null sha1s into the
index. The intent was to catch errors in other parts of the code that
might let such an entry slip into the index (or worse, a tree).

Some existing repositories have some invalid trees that contain null
sha1s already, though.  Until 4337b58, a common way to clean this up
would be to use git-filter-branch's index-filter to repair such broken
entries.  That now fails when filter-branch tries to write out the
index.

Introduce a GIT_ALLOW_NULL_SHA1 environment variable to relax this check
and make it easier to recover from such a history.

It is tempting to not involve filter-branch in this commit at all, and
instead require the user to manually invoke

GIT_ALLOW_NULL_SHA1=1 git filter-branch ...

to perform an index-filter on a history with trees with null sha1s.
That would be slightly safer, but requires some specialized knowledge
from the user.  So let's set the GIT_ALLOW_NULL_SHA1 variable
automatically when checking out the to-be-filtered trees.  Advice on
using filter-branch to remove such entries already exists on places like
stackoverflow, and this patch makes it Just Work again on recent
versions of git.

Further commands that touch the index will still notice and fail,   
unless
they actually remove the broken entries.  A filter-branch whose filters
do not touch the index at all will not error out (since we complain of
the null sha1 only on writing, not when making a tree out of the index),
but this is acceptable, as we still print a loud warning, so the problem
is unlikely to go unnoticed.

With or without such a change,
Reviewed-by: Jonathan Nieder 

After this patch, do you think (in a separate change) it would make
sense for cache-tree.c::update_one() to check for null sha1 and error
out unless GIT_ALLOW_NULL_SHA1 is true?  That would let us get rid of
the caveat from the last paragraph.

[...]
> --- /dev/null
> +++ b/t/t7009-filter-branch-null-sha1.sh
> @@ -0,0 +1,52 @@
> +#!/bin/sh
> +
> +test_description='filter-branch removal of trees with null sha1'
> +. ./test-lib.sh
> +
> +test_expect_success 'create base commits' '
> + test_commit one &&
> + test_commit two &&
> + test_commit three
> +'
> +
> +test_expect_success 'create a commit with a bogus null sha1 in the tree' '
> + {
> + git ls-tree HEAD &&
> + printf "16 commit $_z40\\tbroken\\n"
> + } >broken-tree
> + echo "add broken entry" >msg &&
> +
> + tree=$(git mktree  + test_tick &&
> + commit=$(git commit-tree $tree -p HEAD  + git update-ref HEAD "$commit"
> +'
> +
> +# we have to make one more commit on top removing the broken
> +# entry, since otherwise our index does not match HEAD (and filter-branch 
> will
> +# complain). We could make the index match HEAD, but doing so would involve
> +# writing a null sha1 into the index.
> +test_expect_success 'create a commit dropping the broken entry' '
> + test_tick &&
> + git commit -a -m "back to normal"
> +'

This is kind of an old-fashioned test, since each step of the setup is
treated as a separate test assertion.  I don't really mind until we
get better automation to make it easy to skip or rearrange tests.
Just for reference, I think the usual way to do this now is

test_expect_success 'setup' '
# create base commits
...

# create a commit with bogus null sha1 in the tree