Re: pack corruption post-mortem

2013-10-25 Thread Jeff King
On Thu, Oct 17, 2013 at 08:47:05AM -0700, Junio C Hamano wrote:

  In general, I don't think we know enough about patterns of recovery
  corruption to say which commands would definitely be worth implementing.
  Part of the reason I wrote this up is to document this one case. But
  this is the first time in 7 years of git usage that I've had to do this.
  So I'd feel a little bit better about sinking time into it after seeing
  a few more cases and realizing where the patterns are.
 
 There was one area in our Documentation/ set we used to use to keep
 this kind of message almost as-is; perhaps this message fits there?

I've wondered if that howto section has much value, as they are already
available in the list archive, and often show their age after a while.
Still, I suppose it is a sort of curated list of interesting posts.

Here's my article, all gussied up for the howto directory. Take it or
leave it.

-- 8 --
Subject: [PATCH] howto: add article on recovering a corrupted object

This is an asciidoc-ified version of a corruption post-mortem sent to
the git list. It complements the existing howto article, since it covers
a case where the object couldn't be easily recreated or copied from
elsewhere.

Signed-off-by: Jeff King p...@peff.net
---
 Documentation/Makefile |   1 +
 .../howto/recover-corrupted-object-harder.txt  | 242 +
 2 files changed, 243 insertions(+)
 create mode 100644 Documentation/howto/recover-corrupted-object-harder.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 4f13a23..91a12c7 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -53,6 +53,7 @@ SP_ARTICLES += howto/setup-git-server-over-http
 SP_ARTICLES += howto/separating-topic-branches
 SP_ARTICLES += howto/revert-a-faulty-merge
 SP_ARTICLES += howto/recover-corrupted-blob-object
+SP_ARTICLES += howto/recover-corrupted-object-harder
 SP_ARTICLES += howto/rebuild-from-update-hook
 SP_ARTICLES += howto/rebase-from-internal-branch
 SP_ARTICLES += howto/maintain-git
diff --git a/Documentation/howto/recover-corrupted-object-harder.txt 
b/Documentation/howto/recover-corrupted-object-harder.txt
new file mode 100644
index 000..6f33dac
--- /dev/null
+++ b/Documentation/howto/recover-corrupted-object-harder.txt
@@ -0,0 +1,242 @@
+Date: Wed, 16 Oct 2013 04:34:01 -0400
+From: Jeff King p...@peff.net
+Subject: pack corruption post-mortem
+Abstract: Recovering a corrupted object when no good copy is available.
+Content-type: text/asciidoc
+
+How to recover an object from scratch
+=
+
+I was recently presented with a repository with a corrupted packfile,
+and was asked if the data was recoverable. This post-mortem describes
+the steps I took to investigate and fix the problem. I thought others
+might find the process interesting, and it might help somebody in the
+same situation.
+
+
+Note: In this case, no good copy of the repository was available. For
+the much easier case where you can get the corrupted object from
+elsewhere, see link:recover-corrupted-blob-object.html[this howto].
+
+
+I started with an fsck, which found a problem with exactly one object
+(I've used $pack and $obj below to keep the output readable, and also
+because I'll refer to them later):
+
+---
+$ git fsck
+error: $pack SHA1 checksum mismatch
+error: index CRC mismatch for object $obj from $pack at offset 51653873
+error: inflate: data stream error (incorrect data check)
+error: cannot unpack $obj from $pack at offset 51653873
+---
+
+The pack checksum failing means a byte is munged somewhere, and it is
+presumably in the object mentioned (since both the index checksum and
+zlib were failing).
+
+Reading the zlib source code, I found that incorrect data check means
+that the adler-32 checksum at the end of the zlib data did not match the
+inflated data. So stepping the data through zlib would not help, as it
+did not fail until the very end, when we realize the crc does not match.
+The problematic bytes could be anywhere in the object data.
+
+The first thing I did was pull the broken data out of the packfile. I
+needed to know how big the object was, which I found out with:
+
+
+$ git show-index $idx | cut -d' ' -f1 | sort -n | grep -A1 51653873
+51653873
+51664736
+
+
+Show-index gives us the list of objects and their offsets. We throw away
+everything but the offsets, and then sort them so that our interesting
+offset (which we got from the fsck output above) is followed immediately
+by the offset of the next object. Now we know that the object data is
+10863 bytes long, and we can grab it with:
+
+
+  dd if=$pack of=object bs=1 skip=51653873 count=10863
+
+
+I inspected a hexdump of the data, looking for any obvious bogosity
+(e.g., a 4K run of zeroes would be a good sign

Re: pack corruption post-mortem

2013-10-20 Thread Nicolas Pitre
On Sat, 19 Oct 2013, Shawn Pearce wrote:

 On Sat, Oct 19, 2013 at 7:41 AM, Nicolas Pitre n...@fluxnic.net wrote:
  On Sat, 19 Oct 2013, Duy Nguyen wrote:
 
  On Wed, Oct 16, 2013 at 3:34 PM, Jeff King p...@peff.net wrote:
   I was recently presented with a repository with a corrupted packfile,
   and was asked if the data was recoverable. This post-mortem describes
   the steps I took to investigate and fix the problem. I thought others
   might find the process interesting, and it might help somebody in the
   same situation.
  
   I started with an fsck, which found a problem with exactly one object
   (I've used $pack and $obj below to keep the output readable, and also
   because I'll refer to them later):
  
   $ git fsck
   error: $pack SHA1 checksum mismatch
   error: index CRC mismatch for object $obj from $pack at offset 
   51653873
   error: inflate: data stream error (incorrect data check)
   error: cannot unpack $obj from $pack at offset 51653873
 
  I wonder if we should protect the sha-1 and pathname tables in packv4
  with CRC too. A bit flipped in there could cause stream of corrupt
  objects and make it hard to pinpoint the corrupt location..
 
  It turns out that we already have this covered.
 
  The SHA1 used in the name of the pack file is actually the SHA1 checksum
  of the SHA1 table.
 
 I continue to believe this naming is wrong. The pack file name should
 be the SHA1 checksum of the pack data stream, but the SHA1 table. This
 would allow cleaner update of a repository that was repacked with
 different compression settings, but identical objects.

OK, after some thoughts, I decided that it is best _not_ to rely on the 
pack name.  The pack name currently carries no meaning what so ever and 
git works just fine if some packs are arbitrarily named.  Your concern 
about its current naming scheme is certainly legitimate, and I don't 
want pack v4 to introduce any kind of restrictions here.

Furthermore, the SHA1 table is the only pack element which integrity may 
not independently be verified at the moment.  This makes corruption 
isolation much harder when receiving a streamed pack.

Therefore I decided to introduce a small pack v4 format change with the 
following patch:

- 8
From: Nicolas Pitre n...@fluxnic.net
Date: Sun, 20 Oct 2013 14:52:24 -0400
Subject: [PATCH] pack v4: add a SHA1 checksum to the SHA1 table

Every packed element currently has some integrity protection coming from
the CRC32 embedded in the zlib deflated stream, except for the SHA1 table.

Some bit flip corruption in the SHA1 table may have repercutions on a
whole lot of objects and could be very hard to isolate.

Let's add some integrity protection on the SHA1 table by terminating it
with an additional SHA1 entry being the SHA1 checksum of the table.

Signed-off-by: Nicolas Pitre n...@fluxnic.net

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index fd2e737..a370f26 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -94,7 +94,9 @@ Git pack format
 === Pack v4 tables
 
  - A table of sorted SHA-1 object names for all objects contained in
-   the on-disk pack.
+   the on-disk pack, with a final entry being the SHA1 sum of all the
+   previous entries.  The size of this table is therefore
+   (nr_objects + 1) * 20 bytes.
 
The SHA-1 table in thin packs must include the omitted objects as well.
 
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index caec388..01300d6 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1471,7 +1471,10 @@ static struct packv4_dict *read_dict(void)
 
 static void parse_dictionaries(void)
 {
+   git_SHA_CTX ctx;
+   unsigned char table_sha1[20];
int i;
+
if (!packv4)
return;
 
@@ -1485,6 +1488,12 @@ static void parse_dictionaries(void)
die(_(wrong order in SHA-1 table at entry %d), i);
}
 
+   git_SHA1_Init(ctx);
+   git_SHA1_Update(ctx, sha1_table, 20 * nr_objects_final);
+   git_SHA1_Final(table_sha1, ctx);
+   if (hashcmp(table_sha1, fill_and_use(20)) != 0)
+   die(_(SHA-1 table checksum mismatch));
+
name_dict = read_dict();
path_dict = read_dict();
 }
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 9fd5640..eb57ada 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -816,11 +816,18 @@ static void unpack_all(void)
use(sizeof(struct pack_header));
 
if (packv4) {
+   git_SHA_CTX ctx;
+   unsigned char table_sha1[20];
sha1_table = xmalloc(20 * nr_objects);
for (i = 0; i  nr_objects; i++) {
unsigned char *p = sha1_table + i * 20;
hashcpy(p, fill_and_use(20));
}
+   git_SHA1_Init(ctx);
+   git_SHA1_Update(ctx, sha1_table, 20 * 

Re: pack corruption post-mortem

2013-10-20 Thread Nicolas Pitre
On Sun, 20 Oct 2013, Duy Nguyen wrote:

 On Sat, Oct 19, 2013 at 9:41 PM, Nicolas Pitre n...@fluxnic.net wrote:
  On Sat, 19 Oct 2013, Duy Nguyen wrote:
  The SHA1 used in the name of the pack file is actually the SHA1 checksum
  of the SHA1 table.
 
  The path and ident tables are already protected by the CRC32 in the zlib
  deflated stream.
 
  Normal objects are also zlib deflated (except for their header) but you
  need to inflate them in order to have this CRC verified, which the pack
  data copy tries to avoid.  Hence the separate CRC32 in the index file in
  that case.
 
 OK slight change in the subject, what about reading code (i.e.
 sha1_file.c)? With v2 crc32 is verified by object inflate code. With
 v4 trees or commits, because we store some (or all) data outside of
 the deflated stream, we will not benefit from crc32 verifcation
 previously done for all trees and commits. Should we perform explict
 crc32 check when reading v4 trees and commits (and maybe verify the
 sha-1 table too)?

I suppose that we should... at some point.

I did the SHA1 table check only for index-pack and unpack-objects in my 
latest patch.  Adding it to check_packed_git_idx() as well should be 
trivial.

I'm not sure about the best way to do systematic checks on tree objects 
though.  We have both the CRC32 recorded in the index file and the 
object SHA1 recorded in the SHA1 table.  But any of them needs to be 
computed as we walk the object and we currently havn't found the best 
way to do that yet.  So I'd suggest postponing this until the tree walk 
is properly implemented to perform well first.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pack corruption post-mortem

2013-10-19 Thread Duy Nguyen
On Wed, Oct 16, 2013 at 3:34 PM, Jeff King p...@peff.net wrote:
 I was recently presented with a repository with a corrupted packfile,
 and was asked if the data was recoverable. This post-mortem describes
 the steps I took to investigate and fix the problem. I thought others
 might find the process interesting, and it might help somebody in the
 same situation.

 I started with an fsck, which found a problem with exactly one object
 (I've used $pack and $obj below to keep the output readable, and also
 because I'll refer to them later):

 $ git fsck
 error: $pack SHA1 checksum mismatch
 error: index CRC mismatch for object $obj from $pack at offset 51653873
 error: inflate: data stream error (incorrect data check)
 error: cannot unpack $obj from $pack at offset 51653873

I wonder if we should protect the sha-1 and pathname tables in packv4
with CRC too. A bit flipped in there could cause stream of corrupt
objects and make it hard to pinpoint the corrupt location..
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pack corruption post-mortem

2013-10-19 Thread Nicolas Pitre
On Sat, 19 Oct 2013, Duy Nguyen wrote:

 On Wed, Oct 16, 2013 at 3:34 PM, Jeff King p...@peff.net wrote:
  I was recently presented with a repository with a corrupted packfile,
  and was asked if the data was recoverable. This post-mortem describes
  the steps I took to investigate and fix the problem. I thought others
  might find the process interesting, and it might help somebody in the
  same situation.
 
  I started with an fsck, which found a problem with exactly one object
  (I've used $pack and $obj below to keep the output readable, and also
  because I'll refer to them later):
 
  $ git fsck
  error: $pack SHA1 checksum mismatch
  error: index CRC mismatch for object $obj from $pack at offset 51653873
  error: inflate: data stream error (incorrect data check)
  error: cannot unpack $obj from $pack at offset 51653873
 
 I wonder if we should protect the sha-1 and pathname tables in packv4
 with CRC too. A bit flipped in there could cause stream of corrupt
 objects and make it hard to pinpoint the corrupt location..

It turns out that we already have this covered.

The SHA1 used in the name of the pack file is actually the SHA1 checksum 
of the SHA1 table.

The path and ident tables are already protected by the CRC32 in the zlib 
deflated stream.

Normal objects are also zlib deflated (except for their header) but you 
need to inflate them in order to have this CRC verified, which the pack 
data copy tries to avoid.  Hence the separate CRC32 in the index file in 
that case.

However the pack v4 tables are very unlikely to be reused as is from one 
pack to another.


Nicolas
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pack corruption post-mortem

2013-10-19 Thread Shawn Pearce
On Sat, Oct 19, 2013 at 7:41 AM, Nicolas Pitre n...@fluxnic.net wrote:
 On Sat, 19 Oct 2013, Duy Nguyen wrote:

 On Wed, Oct 16, 2013 at 3:34 PM, Jeff King p...@peff.net wrote:
  I was recently presented with a repository with a corrupted packfile,
  and was asked if the data was recoverable. This post-mortem describes
  the steps I took to investigate and fix the problem. I thought others
  might find the process interesting, and it might help somebody in the
  same situation.
 
  I started with an fsck, which found a problem with exactly one object
  (I've used $pack and $obj below to keep the output readable, and also
  because I'll refer to them later):
 
  $ git fsck
  error: $pack SHA1 checksum mismatch
  error: index CRC mismatch for object $obj from $pack at offset 51653873
  error: inflate: data stream error (incorrect data check)
  error: cannot unpack $obj from $pack at offset 51653873

 I wonder if we should protect the sha-1 and pathname tables in packv4
 with CRC too. A bit flipped in there could cause stream of corrupt
 objects and make it hard to pinpoint the corrupt location..

 It turns out that we already have this covered.

 The SHA1 used in the name of the pack file is actually the SHA1 checksum
 of the SHA1 table.

I continue to believe this naming is wrong. The pack file name should
be the SHA1 checksum of the pack data stream, but the SHA1 table. This
would allow cleaner update of a repository that was repacked with
different compression settings, but identical objects.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pack corruption post-mortem

2013-10-19 Thread Duy Nguyen
On Sat, Oct 19, 2013 at 9:41 PM, Nicolas Pitre n...@fluxnic.net wrote:
 On Sat, 19 Oct 2013, Duy Nguyen wrote:
 The SHA1 used in the name of the pack file is actually the SHA1 checksum
 of the SHA1 table.

 The path and ident tables are already protected by the CRC32 in the zlib
 deflated stream.

 Normal objects are also zlib deflated (except for their header) but you
 need to inflate them in order to have this CRC verified, which the pack
 data copy tries to avoid.  Hence the separate CRC32 in the index file in
 that case.

OK slight change in the subject, what about reading code (i.e.
sha1_file.c)? With v2 crc32 is verified by object inflate code. With
v4 trees or commits, because we store some (or all) data outside of
the deflated stream, we will not benefit from crc32 verifcation
previously done for all trees and commits. Should we perform explict
crc32 check when reading v4 trees and commits (and maybe verify the
sha-1 table too)?
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pack corruption post-mortem

2013-10-17 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Wed, Oct 16, 2013 at 09:41:16AM -0600, Martin Fick wrote:

 I have nightmares about this sort of thing every now and 
 then, and we even experience some corruption here and there 
 that needs to be fixed (mainly missing objects when we toy 
 with different git repack arguments).  I cannot help but 
 wonder, how we can improve git further to either help 
 diagnose or even fix some of these problems?  More inline 
 below...

 In general, I don't think we know enough about patterns of recovery
 corruption to say which commands would definitely be worth implementing.
 Part of the reason I wrote this up is to document this one case. But
 this is the first time in 7 years of git usage that I've had to do this.
 So I'd feel a little bit better about sinking time into it after seeing
 a few more cases and realizing where the patterns are.

There was one area in our Documentation/ set we used to use to keep
this kind of message almost as-is; perhaps this message fits there?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


pack corruption post-mortem

2013-10-16 Thread Jeff King
I was recently presented with a repository with a corrupted packfile,
and was asked if the data was recoverable. This post-mortem describes
the steps I took to investigate and fix the problem. I thought others
might find the process interesting, and it might help somebody in the
same situation.

I started with an fsck, which found a problem with exactly one object
(I've used $pack and $obj below to keep the output readable, and also
because I'll refer to them later):

$ git fsck
error: $pack SHA1 checksum mismatch
error: index CRC mismatch for object $obj from $pack at offset 51653873
error: inflate: data stream error (incorrect data check)
error: cannot unpack $obj from $pack at offset 51653873

The pack checksum failing means a byte is munged somewhere, and it is
presumably in the object mentioned (since both the index checksum and
zlib were failing).

Reading the zlib source code, I found that incorrect data check means
that the adler-32 checksum at the end of the zlib data did not match the
inflated data. So stepping the data through zlib would not help, as it
did not fail until the very end, when we realize the crc does not match.
The problematic bytes could be anywhere in the object data.

The first thing I did was pull the broken data out of the packfile. I
needed to know how big the object was, which I found out with:

  $ git show-index $idx | cut -d' ' -f1 | sort -n | grep -A1 51653873
  51653873
  51664736

Show-index gives us the list of objects and their offsets. We throw away
everything but the offsets, and then sort them so that our interesting
offset (which we got from the fsck output above) is followed immediately
by the offset of the next object. Now we know that the object data is
10863 bytes long, and we can grab it with:

  dd if=$pack of=object bs=1 skip=51653873 count=10863

I inspected a hexdump of the data, looking for any obvious bogosity
(e.g., a 4K run of zeroes would be a good sign of filesystem
corruption). But everything looked pretty reasonable.

Note that the object file isn't fit for feeding straight to zlib; it
has the git packed object header, which is variable-length. We want to
strip that off so we can start playing with the zlib data directly. You
can either work your way through it manually (the format is described in
Documentation/technical/pack-format.txt), or you can walk through it in
a debugger. I did the latter, creating a valid pack like:

  # pack magic and version
  printf 'PACK\0\0\0\2' tmp.pack
  # pack has one object
  printf '\0\0\0\1' tmp.pack
  # now add our object data
  cat object tmp.pack
  # and then append the pack trailer
  /path/to/git.git/test-sha1 -b tmp.pack trailer
  cat trailer tmp.pack

and then running git index-pack tmp.pack in the debugger (stop at
unpack_raw_entry). Doing this, I found that there were 3 bytes of header
(and the header itself had a sane type and size). So I stripped those
off with:

  dd if=object of=zlib bs=1 skip=3

I ran the result through zlib's inflate using a custom C program. And
while it did report the error, I did get the right number of output
bytes (i.e., it matched git's size header that we decoded above). But
feeding the result back to git hash-object didn't produce the same
sha1. So there were some wrong bytes, but I didn't know which. The file
happened to be C source code, so I hoped I could notice something
obviously wrong with it, but I didn't. I even got it to compile!

I also tried comparing it to other versions of the same path in the
repository, hoping that there would be some part of the diff that didn't
make sense. Unfortunately, this happened to be the only revision of this
particular file in the repository, so I had nothing to compare against.

So I took a different approach. Working under the guess that the
corruption was limited to a single byte, I wrote a program to munge each
byte individually, and try inflating the result. Since the object was
only 10K compressed, that worked out to about 2.5M attempts, which took
a few minutes.

The program I used is here:

-- 8 --
#include stdio.h
#include unistd.h
#include string.h
#include signal.h
#include zlib.h

static int try_zlib(unsigned char *buf, int len)
{
  /* make this absurdly large so we don't have to loop */
  static unsigned char out[1024*1024];
  z_stream z;
  int ret;

  memset(z, 0, sizeof(z));
  inflateInit(z);

  z.next_in = buf;
  z.avail_in = len;
  z.next_out = out;
  z.avail_out = sizeof(out);

  ret = inflate(z, 0);
  inflateEnd(z);
  return ret = 0;
}

/* eye candy */
static int counter = 0;
static void progress(int sig)
{
  fprintf(stderr, \r%d, counter);
  alarm(1);
}

int main(void)
{
  /* oversized so we can read the whole buffer in */
  unsigned char buf[1024*1024];
  int len;
  unsigned i, j;

  signal(SIGALRM, progress);
  alarm(1);

  len = read(0, buf, sizeof(buf));
  for (i = 0; i  len; i++) {
unsigned char c = buf[i];
for (j = 0; j = 0xff; j++) {
  buf[i] = j;

  counter++;
  

Re: pack corruption post-mortem

2013-10-16 Thread Duy Nguyen
On Wed, Oct 16, 2013 at 3:34 PM, Jeff King p...@peff.net wrote:
 I was recently presented with a repository with a corrupted packfile,
 and was asked if the data was recoverable. This post-mortem describes
 the steps I took to investigate and fix the problem. I thought others
 might find the process interesting, and it might help somebody in the
 same situation.

It's like reading an LWN article. Thank you.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pack corruption post-mortem

2013-10-16 Thread Martin Fick
On Wednesday, October 16, 2013 02:34:01 am Jeff King wrote:
 I was recently presented with a repository with a
 corrupted packfile, and was asked if the data was
 recoverable. This post-mortem describes the steps I took
 to investigate and fix the problem. I thought others
 might find the process interesting, and it might help
 somebody in the same situation.

This is awesome Peff, thanks for the great writeup!

I have nightmares about this sort of thing every now and 
then, and we even experience some corruption here and there 
that needs to be fixed (mainly missing objects when we toy 
with different git repack arguments).  I cannot help but 
wonder, how we can improve git further to either help 
diagnose or even fix some of these problems?  More inline 
below...


 The first thing I did was pull the broken data out of the
 packfile. I needed to know how big the object was, which
 I found out with:
 
   $ git show-index $idx | cut -d' ' -f1 | sort -n | grep
 -A1 51653873 51653873
   51664736
 
 Show-index gives us the list of objects and their
 offsets. We throw away everything but the offsets, and
 then sort them so that our interesting offset (which we
 got from the fsck output above) is followed immediately
 by the offset of the next object. Now we know that the
 object data is 10863 bytes long, and we can grab it
 with:
 
   dd if=$pack of=object bs=1 skip=51653873 count=10863

Is there a current plumbing command that should be enhanced 
to be able to do the 2 steps above directly for people 
debugging (maybe with some new switch)?  If not, should we 
create one, git show --zlib, or git cat-file --zlib?


 Note that the object file isn't fit for feeding
 straight to zlib; it has the git packed object header,
 which is variable-length. We want to strip that off so
 we can start playing with the zlib data directly. You
 can either work your way through it manually (the format
 is described in
 Documentation/technical/pack-format.txt), or you can
 walk through it in a debugger. I did the latter,
 creating a valid pack like:
 
   # pack magic and version
   printf 'PACK\0\0\0\2' tmp.pack
   # pack has one object
   printf '\0\0\0\1' tmp.pack
   # now add our object data
   cat object tmp.pack
   # and then append the pack trailer
   /path/to/git.git/test-sha1 -b tmp.pack trailer
   cat trailer tmp.pack
 
 and then running git index-pack tmp.pack in the
 debugger (stop at unpack_raw_entry). Doing this, I found
 that there were 3 bytes of header (and the header itself
 had a sane type and size). So I stripped those off with:
 
   dd if=object of=zlib bs=1 skip=3

This too feels like something we should be able to do with a 
plumbing command eventually?

git zlib-extract

 So I took a different approach. Working under the guess
 that the corruption was limited to a single byte, I
 wrote a program to munge each byte individually, and try
 inflating the result. Since the object was only 10K
 compressed, that worked out to about 2.5M attempts,
 which took a few minutes.

Awesome!  Would this make a good new plumbing command, git 
zlib-fix?


 I fixed the packfile itself with:
 
   chmod +w $pack
   printf '\xc7' | dd of=$pack bs=1 seek=51659518
 conv=notrunc chmod -w $pack
 
 The '\xc7' comes from the replacement byte our munge
 program found. The offset 51659518 is derived by taking
 the original object offset (51653873), adding the
 replacement offset found by munge (5642), and then
 adding back in the 3 bytes of git header we stripped.

Another plumbing command needed?  git pack-put --zlib?

I am not saying my command suggestions are good, but maybe 
they will inspire the right answer?

-Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pack corruption post-mortem

2013-10-16 Thread Jeff King
On Wed, Oct 16, 2013 at 09:41:16AM -0600, Martin Fick wrote:

 I have nightmares about this sort of thing every now and 
 then, and we even experience some corruption here and there 
 that needs to be fixed (mainly missing objects when we toy 
 with different git repack arguments).  I cannot help but 
 wonder, how we can improve git further to either help 
 diagnose or even fix some of these problems?  More inline 
 below...

In general, I don't think we know enough about patterns of recovery
corruption to say which commands would definitely be worth implementing.
Part of the reason I wrote this up is to document this one case. But
this is the first time in 7 years of git usage that I've had to do this.
So I'd feel a little bit better about sinking time into it after seeing
a few more cases and realizing where the patterns are.

One of the major hassles is that the assumptions you can and can't make
depend on what data you have that _isn't_ corrupted. Do you have a pack
index, or a bare pack? Do you have zlib data that fails the crc, or zlib
data that cannot be parsed?

In this case there was no other copy of the repository. But if you know
the broken object (which we did here), and you can copy it from
elsewhere, then git already will try to find other sources of the
object (loose, or in another pack).

dd if=$pack of=object bs=1 skip=51653873 count=10863
 
 Is there a current plumbing command that should be enhanced 
 to be able to do the 2 steps above directly for people 
 debugging (maybe with some new switch)?  If not, should we 
 create one, git show --zlib, or git cat-file --zlib?

Most of the git plumbing commands deal with data at the object layer.
This is really about going a step below and saying Give me the on-disk
representation of the object. We recently introduced an
%(objectsize:disk) formatter for cat-file. The logical extension would
be to ask for %(contents:disk) or something. Though what you get would
depend on how the object is stored, so you would need to figure that out
to do anything useful with it.

Note that this implies you actually have a packfile index that says
object XXX is at offset YYY. In some corruption cases, you might have
only a packfile. That is generally enough to generate the index, but if
there is corruption, you cannot actually parse the pack to find out the
sha1 of the objects.

So in the worst case, what you really want is something like dump the
object in packfile X at offset Y. But even then, you don't know the
length of the object. The packfile is a stream, and the length we
calculated is from the index, which depends on the zlib data parsing in
some sane way.

  and then running git index-pack tmp.pack in the
  debugger (stop at unpack_raw_entry). Doing this, I found
  that there were 3 bytes of header (and the header itself
  had a sane type and size). So I stripped those off with:
  
dd if=object of=zlib bs=1 skip=3
 
 This too feels like something we should be able to do with a 
 plumbing command eventually?
 
 git zlib-extract

Perhaps. I think if you had some extract object at offset X from the
packfile command, it would be optional to give you the whole thing, or
just the zlib data.

  So I took a different approach. Working under the guess
  that the corruption was limited to a single byte, I
  wrote a program to munge each byte individually, and try
  inflating the result. Since the object was only 10K
  compressed, that worked out to about 2.5M attempts,
  which took a few minutes.
 
 Awesome!  Would this make a good new plumbing command, git 
 zlib-fix?

I'd like to see it actually work more than once first. This relies on
there being single-byte corruption. Even double-byte corruption starts
to get expensive to brute-force like this. SHA1, by its nature, requires
brute-forcing. But it's possible that the crc, not being
cryptographically secure, could be reverse-engineered to find likely
spots of corruption. I don't know enough about it to say.

  I fixed the packfile itself with:
  
chmod +w $pack
printf '\xc7' | dd of=$pack bs=1 seek=51659518
  conv=notrunc chmod -w $pack
  
  The '\xc7' comes from the replacement byte our munge
  program found. The offset 51659518 is derived by taking
  the original object offset (51653873), adding the
  replacement offset found by munge (5642), and then
  adding back in the 3 bytes of git header we stripped.
 
 Another plumbing command needed?  git pack-put --zlib?

I think in this case that dd does a nice job of solving the problem.
Some of the stuff I did was very git-specific and required knowledge of
the formats. But this one is really just replace byte X at offset Y,
and I don't see any need to avoid a general-purpose tool (except that
dd is itself reasonably arcane :) ).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pack corruption post-mortem

2013-10-16 Thread Duy Nguyen
On Wed, Oct 16, 2013 at 10:41 PM, Martin Fick mf...@codeaurora.org wrote:
 and then running git index-pack tmp.pack in the
 debugger (stop at unpack_raw_entry). Doing this, I found
 that there were 3 bytes of header (and the header itself
 had a sane type and size). So I stripped those off with:

   dd if=object of=zlib bs=1 skip=3

 This too feels like something we should be able to do with a
 plumbing command eventually?

 git zlib-extract

Not an official plumbing, but I faced similar problems with pack v4. I
needed to verify that the output is correct and low level decoding
like this is generally a good thing to start with. So I wrote
test-dump [1] that can take an offset, a format and try to decode it.
It does not support zlib inflation yet, but adding one should be easy.
And because this is just a test program we don't really need to think
hard before adding something.

[1] http://article.gmane.org/gmane.comp.version-control.git/235388
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html