Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
On Wed, Feb 11, 2009 at 5:48 PM, Jamie Lokier wrote: > Kevin Wolf wrote: >> Besides reviewing the code over and over again, I think the only real >> chance is that you can get a non-productive copy of your image and add >> some debug code so that we can see at least which code path is causing >> problems. > > I have a copy of my image to reproduce the bug, so I can test patches > including diagnostic patches. That's what I did to narrow it down. Let's see. I have looked at the change in revision 5006 back and forth and this is the only bug that I can see... Does the patch help any? Best regards, Filip Navara block-qcow2.diff Description: Binary data
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
Kevin Wolf wrote: > >> By the way and completely off-topic: Have you already tried to use the > >> VHD patches? I would really like to know if they fix your problems. > > > > Are those patches in kvm-83? I still have the image that was causing > > problems way back, and I'm converting it to raw now with kvm-83 to see > > if it now matches the raw image produced by VPC's own tool. > > Avi mentioned the patches in the kvm-84 announcement yesterday, so it > seems they are not in kvm-83. Ok. I did the conversion from VHD to raw with "kvm-83-img convert" and it gave a different result to MSVPC - i.e. wrong. I'll try it again sometime with kvm-84. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
Jamie Lokier schrieb: > Kevin Wolf wrote: >> Besides reviewing the code over and over again, I think the only real >> chance is that you can get a non-productive copy of your image and add >> some debug code so that we can see at least which code path is causing >> problems. > > I have a copy of my image to reproduce the bug, so I can test patches > including diagnostic patches. That's what I did to narrow it down. > > Being a company mail server, I can't send you the image of course. I perfectly understand that, it just makes debugging hard when you don't have a specific suspicion nor the problematic image. Do you need those diagnostic patches from me or will you try to put debug messages into the code yourself? I would just start putting in random printfs into alloc_cluster_offset() and the functions it calls to get some information about the failing write. I guess we'd need some iterations with my diagnostic patch and it wouldn't be any better than ad-hoc hacks done by yourself. >> By the way and completely off-topic: Have you already tried to use the >> VHD patches? I would really like to know if they fix your problems. > > Are those patches in kvm-83? I still have the image that was causing > problems way back, and I'm converting it to raw now with kvm-83 to see > if it now matches the raw image produced by VPC's own tool. Avi mentioned the patches in the kvm-84 announcement yesterday, so it seems they are not in kvm-83. Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
Hi, On Wed, 11 Feb 2009, Chris Wright wrote: > * Kevin Wolf (kw...@suse.de) wrote: > > I would suspect that simply having a 64 bit host isn't enough to trigger > > the problem. These patches were in for half a year now without anyone > > noticing such failure. > > BTW, we've seen similar corruption, but not narrowed it down successfully as > it's been quite difficult to trigger. Same here, Dscho -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
* Kevin Wolf (kw...@suse.de) wrote: > I would suspect that simply having a 64 bit host isn't enough to trigger > the problem. These patches were in for half a year now without anyone > noticing such failure. BTW, we've seen similar corruption, but not narrowed it down successfully as it's been quite difficult to trigger. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
Kevin Wolf wrote: > Besides reviewing the code over and over again, I think the only real > chance is that you can get a non-productive copy of your image and add > some debug code so that we can see at least which code path is causing > problems. I have a copy of my image to reproduce the bug, so I can test patches including diagnostic patches. That's what I did to narrow it down. Being a company mail server, I can't send you the image of course. > > Aside from logic, the code mixes signed 32-bit with unsigned 64-bit > > with unclear naming which would make me nervous. My host is 64-bit, > > by the way. > > I would suspect that simply having a 64 bit host isn't enough to trigger > the problem. These patches were in for half a year now without anyone > noticing such failure. It was just for clarity. If there are any bugs it's more likely to be truncation on a 32 bit host :-) I didn't see any mention of "long", so the code should behave the same on 64-bit and 32-bit hosts. > By the way and completely off-topic: Have you already tried to use the > VHD patches? I would really like to know if they fix your problems. Are those patches in kvm-83? I still have the image that was causing problems way back, and I'm converting it to raw now with kvm-83 to see if it now matches the raw image produced by VPC's own tool. Beyond checking that it reads ok, which it didn't before, I don't know how to test the VPC support properly, but I can try booting the image and see if it at least doesn't fsck^H^H^H^Hscandisk like it used to. I'm not using VPC images any more, we just install Windows into empty QCOW2 or raw images, like everyone else. :-) At some point I may test the VPC support with prebuilt images downloaded from Microsoft - you can too! -- Jamie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
Jamie Lokier schrieb: > Kevin Wolf wrote: >> Jamie Lokier schrieb: >>> Although there are many ways to make Windows blue screen in KVM, in >>> this case I've narrowed it down to the difference in >>> qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). >> This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you >> narrow it down to one of these? I certainly don't feel like reviewing >> all of them once again. > > It's QEMU SVN delta 5005-5006, copied below. > > I've tested by applying the diffs up to QEMU SVN revs 5003 to 500, > onto kvm-72, and 5005-5006 is the diff which triggers the failed guest > boot, consistently. That's exactly what I was afraid of... It's the most difficult patch of the series. I'm adding Laurent to CC who wrote the patch series then, but I can imagine he wants to do different things in his spare time. Besides reviewing the code over and over again, I think the only real chance is that you can get a non-productive copy of your image and add some debug code so that we can see at least which code path is causing problems. > Aside from logic, the code mixes signed 32-bit with unsigned 64-bit > with unclear naming which would make me nervous. My host is 64-bit, > by the way. I would suspect that simply having a 64 bit host isn't enough to trigger the problem. These patches were in for half a year now without anyone noticing such failure. By the way and completely off-topic: Have you already tried to use the VHD patches? I would really like to know if they fix your problems. Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
Kevin Wolf wrote: > Jamie Lokier schrieb: > > Although there are many ways to make Windows blue screen in KVM, in > > this case I've narrowed it down to the difference in > > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). > > This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you > narrow it down to one of these? I certainly don't feel like reviewing > all of them once again. It's QEMU SVN delta 5005-5006, copied below. I've tested by applying the diffs up to QEMU SVN revs 5003 to 500, onto kvm-72, and 5005-5006 is the diff which triggers the failed guest boot, consistently. Aside from logic, the code mixes signed 32-bit with unsigned 64-bit with unclear naming which would make me nervous. My host is 64-bit, by the way. -- Jamie --- trunk/block-qcow2.c 2008/08/14 18:09:32 5005 +++ trunk/block-qcow2.c 2008/08/14 18:10:28 5006 @@ -52,6 +52,8 @@ #define QCOW_CRYPT_NONE 0 #define QCOW_CRYPT_AES 1 +#define QCOW_MAX_CRYPT_CLUSTERS 32 + /* indicate that the refcount of the referenced cluster is exactly one. */ #define QCOW_OFLAG_COPIED (1LL << 63) /* indicate that the cluster is compressed (they never have the copied flag) */ @@ -263,7 +265,8 @@ if (!s->cluster_cache) goto fail; /* one more sector for decompressed data alignment */ -s->cluster_data = qemu_malloc(s->cluster_size + 512); +s->cluster_data = qemu_malloc(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + + 512); if (!s->cluster_data) goto fail; s->cluster_cache_offset = -1; @@ -612,43 +615,98 @@ * For a given offset of the disk image, return cluster offset in * qcow2 file. * + * on entry, *num is the number of contiguous clusters we'd like to + * access following offset. + * + * on exit, *num is the number of contiguous clusters we can read. + * * Return 1, if the offset is found * Return 0, otherwise. * */ -static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset) +static uint64_t get_cluster_offset(BlockDriverState *bs, + uint64_t offset, int *num) { BDRVQcowState *s = bs->opaque; int l1_index, l2_index; -uint64_t l2_offset, *l2_table, cluster_offset; +uint64_t l2_offset, *l2_table, cluster_offset, next; +int l1_bits; +int index_in_cluster, nb_available, nb_needed; + +index_in_cluster = (offset >> 9) & (s->cluster_sectors - 1); +nb_needed = *num + index_in_cluster; + +l1_bits = s->l2_bits + s->cluster_bits; + +/* compute how many bytes there are between the offset and + * and the end of the l1 entry + */ + +nb_available = (1 << l1_bits) - (offset & ((1 << l1_bits) - 1)); + +/* compute the number of available sectors */ + +nb_available = (nb_available >> 9) + index_in_cluster; + +cluster_offset = 0; /* seek the the l2 offset in the l1 table */ -l1_index = offset >> (s->l2_bits + s->cluster_bits); +l1_index = offset >> l1_bits; if (l1_index >= s->l1_size) -return 0; +goto out; l2_offset = s->l1_table[l1_index]; /* seek the l2 table of the given l2 offset */ if (!l2_offset) -return 0; +goto out; /* load the l2 table in memory */ l2_offset &= ~QCOW_OFLAG_COPIED; l2_table = l2_load(bs, l2_offset); if (l2_table == NULL) -return 0; +goto out; /* find the cluster offset for the given disk offset */ l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1); cluster_offset = be64_to_cpu(l2_table[l2_index]); +nb_available = s->cluster_sectors; +l2_index++; + +if (!cluster_offset) { -return cluster_offset & ~QCOW_OFLAG_COPIED; + /* how many empty clusters ? */ + + while (nb_available < nb_needed && !l2_table[l2_index]) { + l2_index++; + nb_available += s->cluster_sectors; + } +} else { + + /* how many allocated clusters ? */ + + cluster_offset &= ~QCOW_OFLAG_COPIED; + while (nb_available < nb_needed) { + next = be64_to_cpu(l2_table[l2_index]) & ~QCOW_OFLAG_COPIED; + if (next != cluster_offset + (nb_available << 9)) + break; + l2_index++; + nb_available += s->cluster_sectors; + } + } + +out: +if (nb_available > nb_needed) +nb_available = nb_needed; + +*num = nb_available - index_in_cluster; + +return cluster_offset; } /* @@ -659,13 +717,10 @@ */ static void free_any_clusters(BlockDriverState *bs, - uint64_t cluster_offset) + uint64_t cluster_offset, int nb_clusters) { BDRVQcowState *s = bs->opaque; -if (cluster_offset == 0) -return; - /* free the cluster */ if (cluster_offset & QCOW_OFLAG_COMPRESSED) { @@ -677,7 +732,9 @@ return; } -free_clusters(bs, cluster_offset, s->cluster_size); +free_clusters
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
Kevin Wolf wrote: > Jamie Lokier schrieb: > > Although there are many ways to make Windows blue screen in KVM, in > > this case I've narrowed it down to the difference in > > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). > > This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you > narrow it down to one of these? I certainly don't feel like reviewing > all of them once again. That's helpful, thanks. It's a production mail server which was affected, and it's being used at the moment. Not sure if I can narrow it down that easily :-) > Do I understand correctly that your image fails with the kvm-73 version > of block-qcow2.c and afterwards boots with kvm-72? So the image is not > really corrupted but it's more likely an IO error which brings the OS down? That's correct, it's always booted when trying again with kvm-72, or a later kvm with block-qcow2.c reverted. It might be an I/O error rather than corruption. Up to kvm-76, I/O errors aren't reported over the IDE driver. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change
Jamie Lokier schrieb: > Although there are many ways to make Windows blue screen in KVM, in > this case I've narrowed it down to the difference in > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you narrow it down to one of these? I certainly don't feel like reviewing all of them once again. Do I understand correctly that your image fails with the kvm-73 version of block-qcow2.c and afterwards boots with kvm-72? So the image is not really corrupted but it's more likely an IO error which brings the OS down? Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html