Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-03-06 Thread Filip Navara
On Wed, Feb 11, 2009 at 5:48 PM, Jamie Lokier  wrote:
> Kevin Wolf wrote:
>> Besides reviewing the code over and over again, I think the only real
>> chance is that you can get a non-productive copy of your image and add
>> some debug code so that we can see at least which code path is causing
>> problems.
>
> I have a copy of my image to reproduce the bug, so I can test patches
> including diagnostic patches.  That's what I did to narrow it down.

Let's see. I have looked at the change in revision 5006 back and forth
and this is the only bug that I can see...

Does the patch help any?

Best regards,
Filip Navara


block-qcow2.diff
Description: Binary data


Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-16 Thread Jamie Lokier
Kevin Wolf wrote:
> >> By the way and completely off-topic: Have you already tried to use the
> >> VHD patches? I would really like to know if they fix your problems.
> > 
> > Are those patches in kvm-83?  I still have the image that was causing
> > problems way back, and I'm converting it to raw now with kvm-83 to see
> > if it now matches the raw image produced by VPC's own tool.
> 
> Avi mentioned the patches in the kvm-84 announcement yesterday, so it
> seems they are not in kvm-83.

Ok.  I did the conversion from VHD to raw with "kvm-83-img convert"
and it gave a different result to MSVPC - i.e. wrong.  I'll try it
again sometime with kvm-84.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-16 Thread Kevin Wolf
Jamie Lokier schrieb:
> Kevin Wolf wrote:
>> Besides reviewing the code over and over again, I think the only real
>> chance is that you can get a non-productive copy of your image and add
>> some debug code so that we can see at least which code path is causing
>> problems.
> 
> I have a copy of my image to reproduce the bug, so I can test patches
> including diagnostic patches.  That's what I did to narrow it down.
> 
> Being a company mail server, I can't send you the image of course.

I perfectly understand that, it just makes debugging hard when you don't
have a specific suspicion nor the problematic image.

Do you need those diagnostic patches from me or will you try to put
debug messages into the code yourself? I would just start putting in
random printfs into alloc_cluster_offset() and the functions it calls to
get some information about the failing write. I guess we'd need some
iterations with my diagnostic patch and it wouldn't be any better than
ad-hoc hacks done by yourself.

>> By the way and completely off-topic: Have you already tried to use the
>> VHD patches? I would really like to know if they fix your problems.
> 
> Are those patches in kvm-83?  I still have the image that was causing
> problems way back, and I'm converting it to raw now with kvm-83 to see
> if it now matches the raw image produced by VPC's own tool.

Avi mentioned the patches in the kvm-84 announcement yesterday, so it
seems they are not in kvm-83.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-12 Thread Johannes Schindelin
Hi,

On Wed, 11 Feb 2009, Chris Wright wrote:

> * Kevin Wolf (kw...@suse.de) wrote:
> > I would suspect that simply having a 64 bit host isn't enough to trigger
> > the problem. These patches were in for half a year now without anyone
> > noticing such failure.
> 
> BTW, we've seen similar corruption, but not narrowed it down successfully as
> it's been quite difficult to trigger.

Same here,
Dscho

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-11 Thread Chris Wright
* Kevin Wolf (kw...@suse.de) wrote:
> I would suspect that simply having a 64 bit host isn't enough to trigger
> the problem. These patches were in for half a year now without anyone
> noticing such failure.

BTW, we've seen similar corruption, but not narrowed it down successfully as
it's been quite difficult to trigger.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-11 Thread Jamie Lokier
Kevin Wolf wrote:
> Besides reviewing the code over and over again, I think the only real
> chance is that you can get a non-productive copy of your image and add
> some debug code so that we can see at least which code path is causing
> problems.

I have a copy of my image to reproduce the bug, so I can test patches
including diagnostic patches.  That's what I did to narrow it down.

Being a company mail server, I can't send you the image of course.

> > Aside from logic, the code mixes signed 32-bit with unsigned 64-bit
> > with unclear naming which would make me nervous.  My host is 64-bit,
> > by the way.
> 
> I would suspect that simply having a 64 bit host isn't enough to trigger
> the problem. These patches were in for half a year now without anyone
> noticing such failure.

It was just for clarity.  If there are any bugs it's more likely to be
truncation on a 32 bit host :-)

I didn't see any mention of "long", so the code should behave the
same on 64-bit and 32-bit hosts.

> By the way and completely off-topic: Have you already tried to use the
> VHD patches? I would really like to know if they fix your problems.

Are those patches in kvm-83?  I still have the image that was causing
problems way back, and I'm converting it to raw now with kvm-83 to see
if it now matches the raw image produced by VPC's own tool.

Beyond checking that it reads ok, which it didn't before, I don't know
how to test the VPC support properly, but I can try booting the image
and see if it at least doesn't fsck^H^H^H^Hscandisk like it used to.

I'm not using VPC images any more, we just install Windows into empty
QCOW2 or raw images, like everyone else. :-)

At some point I may test the VPC support with prebuilt images
downloaded from Microsoft - you can too!

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-11 Thread Kevin Wolf
Jamie Lokier schrieb:
> Kevin Wolf wrote:
>> Jamie Lokier schrieb:
>>> Although there are many ways to make Windows blue screen in KVM, in
>>> this case I've narrowed it down to the difference in
>>> qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83).
>> This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you
>> narrow it down to one of these? I certainly don't feel like reviewing
>> all of them once again.
> 
> It's QEMU SVN delta 5005-5006, copied below.
> 
> I've tested by applying the diffs up to QEMU SVN revs 5003 to 500,
> onto kvm-72, and 5005-5006 is the diff which triggers the failed guest
> boot, consistently.

That's exactly what I was afraid of... It's the most difficult patch of
the series. I'm adding Laurent to CC who wrote the patch series then,
but I can imagine he wants to do different things in his spare time.

Besides reviewing the code over and over again, I think the only real
chance is that you can get a non-productive copy of your image and add
some debug code so that we can see at least which code path is causing
problems.

> Aside from logic, the code mixes signed 32-bit with unsigned 64-bit
> with unclear naming which would make me nervous.  My host is 64-bit,
> by the way.

I would suspect that simply having a 64 bit host isn't enough to trigger
the problem. These patches were in for half a year now without anyone
noticing such failure.

By the way and completely off-topic: Have you already tried to use the
VHD patches? I would really like to know if they fix your problems.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-11 Thread Jamie Lokier
Kevin Wolf wrote:
> Jamie Lokier schrieb:
> > Although there are many ways to make Windows blue screen in KVM, in
> > this case I've narrowed it down to the difference in
> > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83).
> 
> This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you
> narrow it down to one of these? I certainly don't feel like reviewing
> all of them once again.

It's QEMU SVN delta 5005-5006, copied below.

I've tested by applying the diffs up to QEMU SVN revs 5003 to 500,
onto kvm-72, and 5005-5006 is the diff which triggers the failed guest
boot, consistently.

Aside from logic, the code mixes signed 32-bit with unsigned 64-bit
with unclear naming which would make me nervous.  My host is 64-bit,
by the way.

-- Jamie


--- trunk/block-qcow2.c 2008/08/14 18:09:32 5005
+++ trunk/block-qcow2.c 2008/08/14 18:10:28 5006
@@ -52,6 +52,8 @@
 #define QCOW_CRYPT_NONE 0
 #define QCOW_CRYPT_AES  1
 
+#define QCOW_MAX_CRYPT_CLUSTERS 32
+
 /* indicate that the refcount of the referenced cluster is exactly one. */
 #define QCOW_OFLAG_COPIED (1LL << 63)
 /* indicate that the cluster is compressed (they never have the copied flag) */
@@ -263,7 +265,8 @@
 if (!s->cluster_cache)
 goto fail;
 /* one more sector for decompressed data alignment */
-s->cluster_data = qemu_malloc(s->cluster_size + 512);
+s->cluster_data = qemu_malloc(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size
+  + 512);
 if (!s->cluster_data)
 goto fail;
 s->cluster_cache_offset = -1;
@@ -612,43 +615,98 @@
  * For a given offset of the disk image, return cluster offset in
  * qcow2 file.
  *
+ * on entry, *num is the number of contiguous clusters we'd like to
+ * access following offset.
+ *
+ * on exit, *num is the number of contiguous clusters we can read.
+ *
  * Return 1, if the offset is found
  * Return 0, otherwise.
  *
  */
 
-static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset)
+static uint64_t get_cluster_offset(BlockDriverState *bs,
+   uint64_t offset, int *num)
 {
 BDRVQcowState *s = bs->opaque;
 int l1_index, l2_index;
-uint64_t l2_offset, *l2_table, cluster_offset;
+uint64_t l2_offset, *l2_table, cluster_offset, next;
+int l1_bits;
+int index_in_cluster, nb_available, nb_needed;
+
+index_in_cluster = (offset >> 9) & (s->cluster_sectors - 1);
+nb_needed = *num + index_in_cluster;
+
+l1_bits = s->l2_bits + s->cluster_bits;
+
+/* compute how many bytes there are between the offset and
+ * and the end of the l1 entry
+ */
+
+nb_available = (1 << l1_bits) - (offset & ((1 << l1_bits) - 1));
+
+/* compute the number of available sectors */
+
+nb_available = (nb_available >> 9) + index_in_cluster;
+
+cluster_offset = 0;
 
 /* seek the the l2 offset in the l1 table */
 
-l1_index = offset >> (s->l2_bits + s->cluster_bits);
+l1_index = offset >> l1_bits;
 if (l1_index >= s->l1_size)
-return 0;
+goto out;
 
 l2_offset = s->l1_table[l1_index];
 
 /* seek the l2 table of the given l2 offset */
 
 if (!l2_offset)
-return 0;
+goto out;
 
 /* load the l2 table in memory */
 
 l2_offset &= ~QCOW_OFLAG_COPIED;
 l2_table = l2_load(bs, l2_offset);
 if (l2_table == NULL)
-return 0;
+goto out;
 
 /* find the cluster offset for the given disk offset */
 
 l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
 cluster_offset = be64_to_cpu(l2_table[l2_index]);
+nb_available = s->cluster_sectors;
+l2_index++;
+
+if (!cluster_offset) {
 
-return cluster_offset & ~QCOW_OFLAG_COPIED;
+   /* how many empty clusters ? */
+
+   while (nb_available < nb_needed && !l2_table[l2_index]) {
+   l2_index++;
+   nb_available += s->cluster_sectors;
+   }
+} else {
+
+   /* how many allocated clusters ? */
+
+   cluster_offset &= ~QCOW_OFLAG_COPIED;
+   while (nb_available < nb_needed) {
+   next = be64_to_cpu(l2_table[l2_index]) & ~QCOW_OFLAG_COPIED;
+   if (next != cluster_offset + (nb_available << 9))
+   break;
+   l2_index++;
+   nb_available += s->cluster_sectors;
+   }
+   }
+
+out:
+if (nb_available > nb_needed)
+nb_available = nb_needed;
+
+*num = nb_available - index_in_cluster;
+
+return cluster_offset;
 }
 
 /*
@@ -659,13 +717,10 @@
  */
 
 static void free_any_clusters(BlockDriverState *bs,
-  uint64_t cluster_offset)
+  uint64_t cluster_offset, int nb_clusters)
 {
 BDRVQcowState *s = bs->opaque;
 
-if (cluster_offset == 0)
-return;
-
 /* free the cluster */
 
 if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
@@ -677,7 +732,9 @@
 return;
 }
 
-free_clusters(bs, cluster_offset, s->cluster_size);
+free_clusters

Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-11 Thread Jamie Lokier
Kevin Wolf wrote:
> Jamie Lokier schrieb:
> > Although there are many ways to make Windows blue screen in KVM, in
> > this case I've narrowed it down to the difference in
> > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83).
> 
> This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you
> narrow it down to one of these? I certainly don't feel like reviewing
> all of them once again.

That's helpful, thanks.  It's a production mail server which was
affected, and it's being used at the moment.  Not sure if I can narrow
it down that easily :-)

> Do I understand correctly that your image fails with the kvm-73 version
> of block-qcow2.c and afterwards boots with kvm-72? So the image is not
> really corrupted but it's more likely an IO error which brings the OS down?

That's correct, it's always booted when trying again with kvm-72, or a
later kvm with block-qcow2.c reverted.

It might be an I/O error rather than corruption.  Up to kvm-76, I/O
errors aren't reported over the IDE driver.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change

2009-02-11 Thread Kevin Wolf
Jamie Lokier schrieb:
> Although there are many ways to make Windows blue screen in KVM, in
> this case I've narrowed it down to the difference in
> qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83).

This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you
narrow it down to one of these? I certainly don't feel like reviewing
all of them once again.

Do I understand correctly that your image fails with the kvm-73 version
of block-qcow2.c and afterwards boots with kvm-72? So the image is not
really corrupted but it's more likely an IO error which brings the OS down?

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html