Re: Endless loop in qcow2_alloc_cluster_offset
Am 07.05.2010 03:19, schrieb Marcelo Tosatti: On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote: Hi, I just managed to push a qemu-kvm process (git rev. b496fe3431) into an endless loop in qcow2_alloc_cluster_offset, namely over QLIST_FOREACH(old_alloc, s-cluster_allocs, next_in_flight): (gdb) bt #0 0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at /data/qemu-kvm/block/qcow2-cluster.c:750 #1 0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at /data/qemu-kvm/block/qcow2.c:587 #2 0x00482a44 in qcow_aio_writev (bs=value optimized out, sector_num=value optimized out, qiov=value optimized out, nb_sectors=value optimized out, cb=value optimized out, opaque=value optimized out) at /data/qemu-kvm/block/qcow2.c:645 #3 0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 bdrv_rw_em_cb, opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362 #4 0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, buf=0xd67200 H\a, nb_sectors=16) at /data/qemu-kvm/block.c:1736 #5 0x00435581 in ide_sector_write (s=0xc92650) at /data/qemu-kvm/hw/ide/core.c:622 #6 0x00425fc2 in kvm_handle_io (env=value optimized out) at /data/qemu-kvm/kvm-all.c:553 #7 kvm_run (env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:964 #8 0x00426049 in kvm_cpu_exec (env=0x1000) at /data/qemu-kvm/qemu-kvm.c:1651 #9 0x0042627d in kvm_main_loop_cpu (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1893 #10 ap_main_loop (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1943 #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0 #12 0x7f48abf0711d in clone () from /lib64/libc.so.6 #13 0x in ?? () (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $5 = (struct QCowL2Meta *) 0xcb3568 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} So next == first. Seen the exact same bug twice in a row while installing FC12 with IDE disk, current qemu-kvm.git. qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \ -m 1000 -vnc :1 \ -net nic,model=virtio \ -net tap,script=/root/ifup.sh -serial stdio \ -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor telnet::4445,server,nowait -usbdevice tablet Can't reproduce though. In current git master? That's interesting news. I had kind of expected it would be fixed with c644db3d. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
On Fri, May 07, 2010 at 09:37:22AM +0200, Kevin Wolf wrote: Am 07.05.2010 03:19, schrieb Marcelo Tosatti: On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote: Hi, I just managed to push a qemu-kvm process (git rev. b496fe3431) into an endless loop in qcow2_alloc_cluster_offset, namely over QLIST_FOREACH(old_alloc, s-cluster_allocs, next_in_flight): (gdb) bt #0 0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at /data/qemu-kvm/block/qcow2-cluster.c:750 #1 0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at /data/qemu-kvm/block/qcow2.c:587 #2 0x00482a44 in qcow_aio_writev (bs=value optimized out, sector_num=value optimized out, qiov=value optimized out, nb_sectors=value optimized out, cb=value optimized out, opaque=value optimized out) at /data/qemu-kvm/block/qcow2.c:645 #3 0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 bdrv_rw_em_cb, opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362 #4 0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, buf=0xd67200 H\a, nb_sectors=16) at /data/qemu-kvm/block.c:1736 #5 0x00435581 in ide_sector_write (s=0xc92650) at /data/qemu-kvm/hw/ide/core.c:622 #6 0x00425fc2 in kvm_handle_io (env=value optimized out) at /data/qemu-kvm/kvm-all.c:553 #7 kvm_run (env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:964 #8 0x00426049 in kvm_cpu_exec (env=0x1000) at /data/qemu-kvm/qemu-kvm.c:1651 #9 0x0042627d in kvm_main_loop_cpu (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1893 #10 ap_main_loop (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1943 #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0 #12 0x7f48abf0711d in clone () from /lib64/libc.so.6 #13 0x in ?? () (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $5 = (struct QCowL2Meta *) 0xcb3568 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} So next == first. Seen the exact same bug twice in a row while installing FC12 with IDE disk, current qemu-kvm.git. qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \ -m 1000 -vnc :1 \ -net nic,model=virtio \ -net tap,script=/root/ifup.sh -serial stdio \ -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor telnet::4445,server,nowait -usbdevice tablet Can't reproduce though. In current git master? That's interesting news. I had kind of expected it would be fixed with c644db3d. Yes, with 31b460256 more precisely. And the symptom was the same as Jan reported, cluster_allocs.lh_first had le_next pointing to itself. Perhaps you can add an assert there, so it abort()'s in that case along with some useful information? I'll try to reproduce. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote: Hi, I just managed to push a qemu-kvm process (git rev. b496fe3431) into an endless loop in qcow2_alloc_cluster_offset, namely over QLIST_FOREACH(old_alloc, s-cluster_allocs, next_in_flight): (gdb) bt #0 0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at /data/qemu-kvm/block/qcow2-cluster.c:750 #1 0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at /data/qemu-kvm/block/qcow2.c:587 #2 0x00482a44 in qcow_aio_writev (bs=value optimized out, sector_num=value optimized out, qiov=value optimized out, nb_sectors=value optimized out, cb=value optimized out, opaque=value optimized out) at /data/qemu-kvm/block/qcow2.c:645 #3 0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 bdrv_rw_em_cb, opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362 #4 0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, buf=0xd67200 H\a, nb_sectors=16) at /data/qemu-kvm/block.c:1736 #5 0x00435581 in ide_sector_write (s=0xc92650) at /data/qemu-kvm/hw/ide/core.c:622 #6 0x00425fc2 in kvm_handle_io (env=value optimized out) at /data/qemu-kvm/kvm-all.c:553 #7 kvm_run (env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:964 #8 0x00426049 in kvm_cpu_exec (env=0x1000) at /data/qemu-kvm/qemu-kvm.c:1651 #9 0x0042627d in kvm_main_loop_cpu (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1893 #10 ap_main_loop (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1943 #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0 #12 0x7f48abf0711d in clone () from /lib64/libc.so.6 #13 0x in ?? () (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $5 = (struct QCowL2Meta *) 0xcb3568 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} So next == first. Seen the exact same bug twice in a row while installing FC12 with IDE disk, current qemu-kvm.git. qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \ -m 1000 -vnc :1 \ -net nic,model=virtio \ -net tap,script=/root/ifup.sh -serial stdio \ -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor telnet::4445,server,nowait -usbdevice tablet Can't reproduce though. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
Am 07.12.2009 16:00, schrieb Kevin Wolf: Am 07.12.2009 15:16, schrieb Jan Kiszka: Likely not. What I did was nothing special, and I did not noticed such a crash in the last months. And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I still haven't seen this and I still have no theory on what could be happening here. I'm just trying to write down what I think must happen to get into this situation. Maybe you can point at something I'm missing or maybe it helps you to have a sudden inspiration. The crash happens because we have a loop in the s-cluster_allocs list. A loop can only be created by inserting an object twice. The only insert to this list happens in qcow2_alloc_cluster_offset (though an earlier call than that of the stack trace). There is only one relevant caller of this function, qcow_aio_write_cb. Part of it is a call to run_dependent_requests which removes the request from s-cluster_allocs. So after the QLIST_REMOVE in run_dependent_requests the request can't be contained in the list, but at the call of qcow2_alloc_cluster_offset it must be contained again. It must be added somewhere in between these two calls. In qcow_aio_write_cb there isn't much happening between these calls. The only thing that could somehow become dangerous is the qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests. Hm, you're using only one disk, and it's an IDE disk, right? Then the queue of dependent requests should be empty anyway, so no dangerous calls here. Maybe your theory of a memory corruption is the better one. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Jan Kiszka wrote: Kevin Wolf wrote: Hi Jan, Am 19.11.2009 13:19, schrieb Jan Kiszka: (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $5 = (struct QCowL2Meta *) 0xcb3568 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} So next == first. Oops. Doesn't sound quite right... Is something fiddling with cluster_allocs concurrently, e.g. some signal handler? Or what could cause this list corruption? Would it be enough to move to QLIST_FOREACH_SAFE? Are there any specific signals you're thinking of? Related to block code No, was just blind guessing. I can only think of SIGUSR2 and this one shouldn't call any block driver functions directly. You're using aio=threads, I assume? (It's the default) Yes, all on defaults. QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop doesn't insert or remove any elements. If the list is corrupted now, I think it would be corrupted with QLIST_FOREACH_SAFE as well - at best, the endless loop would occur one call later. The only way I see to get such a loop in a list is to re-insert an element that already is part of the list. The only insert is at qcow2-cluster.c:777. Remains the question how we came there twice without run_dependent_requests() removing the L2Meta from our list first - because this is definitely wrong... Presumably, it's not reproducible? Likely not. What I did was nothing special, and I did not noticed such a crash in the last months. And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I could try to run the step in a loop, hopefully retriggering it once in a (likely longer) while. But then we need some good instrumentation first. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Jan Kiszka wrote: And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I could try to run the step in a loop, hopefully retriggering it once in a (likely longer) while. But then we need some good instrumentation first. Maybe I'm seeing ghosts, and I don't even have a minimal clue about what goes on in the code, but this looks fishy: preallocate() invokes qcow2_alloc_cluster_offset() passing meta, a stack variable. It seems that qcow2_alloc_cluster_offset() may insert this structure into cluster_allocs and leave it there. So we corrupt the queue as soon as preallocate() returns, no? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Am 07.12.2009 15:16, schrieb Jan Kiszka: Likely not. What I did was nothing special, and I did not noticed such a crash in the last months. And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I still haven't seen this and I still have no theory on what could be happening here. I'm just trying to write down what I think must happen to get into this situation. Maybe you can point at something I'm missing or maybe it helps you to have a sudden inspiration. The crash happens because we have a loop in the s-cluster_allocs list. A loop can only be created by inserting an object twice. The only insert to this list happens in qcow2_alloc_cluster_offset (though an earlier call than that of the stack trace). There is only one relevant caller of this function, qcow_aio_write_cb. Part of it is a call to run_dependent_requests which removes the request from s-cluster_allocs. So after the QLIST_REMOVE in run_dependent_requests the request can't be contained in the list, but at the call of qcow2_alloc_cluster_offset it must be contained again. It must be added somewhere in between these two calls. In qcow_aio_write_cb there isn't much happening between these calls. The only thing that could somehow become dangerous is the qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests. I could try to run the step in a loop, hopefully retriggering it once in a (likely longer) while. But then we need some good instrumentation first. I can't explain what exactly would be going wrong there, but if my thoughts are right so far, I think that moving this into a Bottom Half would help. So if you can reproduce it in a loop this could be worth a try. I'd certainly prefer to understand the problem first, but thinking about AIO is the perfect way to make your brain hurt... Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
On 12/07/2009 04:50 PM, Jan Kiszka wrote: Maybe I'm seeing ghosts, and I don't even have a minimal clue about what goes on in the code, but this looks fishy: Plenty of ghosts in qcow2, of all those explorers who tried to brave the code. Only Kevin has ever come back. preallocate() invokes qcow2_alloc_cluster_offset() passingmeta, a stack variable. It seems that qcow2_alloc_cluster_offset() may insert this structure into cluster_allocs and leave it there. So we corrupt the queue as soon as preallocate() returns, no? We invoke run_dependent_requests() which should dequeue those meta again (I think). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Am 07.12.2009 15:50, schrieb Jan Kiszka: Jan Kiszka wrote: And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I could try to run the step in a loop, hopefully retriggering it once in a (likely longer) while. But then we need some good instrumentation first. Maybe I'm seeing ghosts, and I don't even have a minimal clue about what goes on in the code, but this looks fishy: preallocate() invokes qcow2_alloc_cluster_offset() passing meta, a stack variable. It seems that qcow2_alloc_cluster_offset() may insert this structure into cluster_allocs and leave it there. So we corrupt the queue as soon as preallocate() returns, no? preallocate() is about metadata preallocation during image creation. It is only ever run by qemu-img. Apart from that it calls run_dependent_requests() which removes the request from the list again. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Kevin Wolf wrote: Am 07.12.2009 15:50, schrieb Jan Kiszka: Jan Kiszka wrote: And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I could try to run the step in a loop, hopefully retriggering it once in a (likely longer) while. But then we need some good instrumentation first. Maybe I'm seeing ghosts, and I don't even have a minimal clue about what goes on in the code, but this looks fishy: preallocate() invokes qcow2_alloc_cluster_offset() passing meta, a stack variable. It seems that qcow2_alloc_cluster_offset() may insert this structure into cluster_allocs and leave it there. So we corrupt the queue as soon as preallocate() returns, no? preallocate() is about metadata preallocation during image creation. It is only ever run by qemu-img. Apart from that it calls run_dependent_requests() which removes the request from the list again. OK, I see - was far too easy anyway. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Kevin Wolf wrote: Am 07.12.2009 15:16, schrieb Jan Kiszka: Likely not. What I did was nothing special, and I did not noticed such a crash in the last months. And now it happened again (qemu-kvm head, during kernel installation from network onto local qcow2-disk). Any clever idea how to proceed with this? I still haven't seen this and I still have no theory on what could be happening here. I'm just trying to write down what I think must happen to get into this situation. Maybe you can point at something I'm missing or maybe it helps you to have a sudden inspiration. The crash happens because we have a loop in the s-cluster_allocs list. A loop can only be created by inserting an object twice. The only insert to this list happens in qcow2_alloc_cluster_offset (though an earlier call than that of the stack trace). There is only one relevant caller of this function, qcow_aio_write_cb. Part of it is a call to run_dependent_requests which removes the request from s-cluster_allocs. So after the QLIST_REMOVE in run_dependent_requests the request can't be contained in the list, but at the call of qcow2_alloc_cluster_offset it must be contained again. It must be added somewhere in between these two calls. In qcow_aio_write_cb there isn't much happening between these calls. The only thing that could somehow become dangerous is the qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests. If m-nb_clusters is not, the entry won't be removed from the list. And of something corrupted nb_clusters so that it became 0 although it's still enqueued, we would see the deadly loop I faced, right? Unfortunately, any arbitrary memory corruption that generates such zeros can cause this... Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Am 07.12.2009 17:09, schrieb Jan Kiszka: Kevin Wolf wrote: In qcow_aio_write_cb there isn't much happening between these calls. The only thing that could somehow become dangerous is the qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests. If m-nb_clusters is not, the entry won't be removed from the list. And of something corrupted nb_clusters so that it became 0 although it's still enqueued, we would see the deadly loop I faced, right? Unfortunately, any arbitrary memory corruption that generates such zeros can cause this... Right, this looks like another way to get into that endless loop. I don't think it's very likely the cause, but who knows. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Endless loop in qcow2_alloc_cluster_offset
Hi, I just managed to push a qemu-kvm process (git rev. b496fe3431) into an endless loop in qcow2_alloc_cluster_offset, namely over QLIST_FOREACH(old_alloc, s-cluster_allocs, next_in_flight): (gdb) bt #0 0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at /data/qemu-kvm/block/qcow2-cluster.c:750 #1 0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at /data/qemu-kvm/block/qcow2.c:587 #2 0x00482a44 in qcow_aio_writev (bs=value optimized out, sector_num=value optimized out, qiov=value optimized out, nb_sectors=value optimized out, cb=value optimized out, opaque=value optimized out) at /data/qemu-kvm/block/qcow2.c:645 #3 0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 bdrv_rw_em_cb, opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362 #4 0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, buf=0xd67200 H\a, nb_sectors=16) at /data/qemu-kvm/block.c:1736 #5 0x00435581 in ide_sector_write (s=0xc92650) at /data/qemu-kvm/hw/ide/core.c:622 #6 0x00425fc2 in kvm_handle_io (env=value optimized out) at /data/qemu-kvm/kvm-all.c:553 #7 kvm_run (env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:964 #8 0x00426049 in kvm_cpu_exec (env=0x1000) at /data/qemu-kvm/qemu-kvm.c:1651 #9 0x0042627d in kvm_main_loop_cpu (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1893 #10 ap_main_loop (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1943 #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0 #12 0x7f48abf0711d in clone () from /lib64/libc.so.6 #13 0x in ?? () (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $5 = (struct QCowL2Meta *) 0xcb3568 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} So next == first. Is something fiddling with cluster_allocs concurrently, e.g. some signal handler? Or what could cause this list corruption? Would it be enough to move to QLIST_FOREACH_SAFE? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Hi Jan, Am 19.11.2009 13:19, schrieb Jan Kiszka: (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $5 = (struct QCowL2Meta *) 0xcb3568 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} So next == first. Oops. Doesn't sound quite right... Is something fiddling with cluster_allocs concurrently, e.g. some signal handler? Or what could cause this list corruption? Would it be enough to move to QLIST_FOREACH_SAFE? Are there any specific signals you're thinking of? Related to block code I can only think of SIGUSR2 and this one shouldn't call any block driver functions directly. You're using aio=threads, I assume? (It's the default) QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop doesn't insert or remove any elements. If the list is corrupted now, I think it would be corrupted with QLIST_FOREACH_SAFE as well - at best, the endless loop would occur one call later. The only way I see to get such a loop in a list is to re-insert an element that already is part of the list. The only insert is at qcow2-cluster.c:777. Remains the question how we came there twice without run_dependent_requests() removing the L2Meta from our list first - because this is definitely wrong... Presumably, it's not reproducible? Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Endless loop in qcow2_alloc_cluster_offset
Kevin Wolf wrote: Hi Jan, Am 19.11.2009 13:19, schrieb Jan Kiszka: (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $5 = (struct QCowL2Meta *) 0xcb3568 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}} So next == first. Oops. Doesn't sound quite right... Is something fiddling with cluster_allocs concurrently, e.g. some signal handler? Or what could cause this list corruption? Would it be enough to move to QLIST_FOREACH_SAFE? Are there any specific signals you're thinking of? Related to block code No, was just blind guessing. I can only think of SIGUSR2 and this one shouldn't call any block driver functions directly. You're using aio=threads, I assume? (It's the default) Yes, all on defaults. QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop doesn't insert or remove any elements. If the list is corrupted now, I think it would be corrupted with QLIST_FOREACH_SAFE as well - at best, the endless loop would occur one call later. The only way I see to get such a loop in a list is to re-insert an element that already is part of the list. The only insert is at qcow2-cluster.c:777. Remains the question how we came there twice without run_dependent_requests() removing the L2Meta from our list first - because this is definitely wrong... Presumably, it's not reproducible? Likely not. What I did was nothing special, and I did not noticed such a crash in the last months. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html