Re: Endless loop in qcow2_alloc_cluster_offset

2010-05-07 Thread Kevin Wolf
Am 07.05.2010 03:19, schrieb Marcelo Tosatti:
 On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote:
 Hi,

 I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
 endless loop in qcow2_alloc_cluster_offset, namely over
 QLIST_FOREACH(old_alloc, s-cluster_allocs, next_in_flight):

 (gdb) bt
 #0  0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, 
 offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at 
 /data/qemu-kvm/block/qcow2-cluster.c:750
 #1  0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at 
 /data/qemu-kvm/block/qcow2.c:587
 #2  0x00482a44 in qcow_aio_writev (bs=value optimized out, 
 sector_num=value optimized out, qiov=value optimized out, 
 nb_sectors=value optimized out, cb=value optimized out, opaque=value 
 optimized out) at /data/qemu-kvm/block/qcow2.c:645
 #3  0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, 
 qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 bdrv_rw_em_cb, 
 opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
 #4  0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, 
 buf=0xd67200 H\a, nb_sectors=16) at /data/qemu-kvm/block.c:1736
 #5  0x00435581 in ide_sector_write (s=0xc92650) at 
 /data/qemu-kvm/hw/ide/core.c:622
 #6  0x00425fc2 in kvm_handle_io (env=value optimized out) at 
 /data/qemu-kvm/kvm-all.c:553
 #7  kvm_run (env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:964
 #8  0x00426049 in kvm_cpu_exec (env=0x1000) at 
 /data/qemu-kvm/qemu-kvm.c:1651
 #9  0x0042627d in kvm_main_loop_cpu (_env=value optimized out) at 
 /data/qemu-kvm/qemu-kvm.c:1893
 #10 ap_main_loop (_env=value optimized out) at 
 /data/qemu-kvm/qemu-kvm.c:1943
 #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0
 #12 0x7f48abf0711d in clone () from /lib64/libc.so.6
 #13 0x in ?? ()
 (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $5 = (struct QCowL2Meta *) 0xcb3568
 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, 
 depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight 
 = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}

 So next == first.

 
 Seen the exact same bug twice in a row while installing FC12 with IDE
 disk, current qemu-kvm.git. 
 
 qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \
 -m 1000  -vnc :1 \
 -net nic,model=virtio \
 -net tap,script=/root/ifup.sh -serial stdio \
 -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor
 telnet::4445,server,nowait -usbdevice tablet
 
 Can't reproduce though.

In current git master? That's interesting news. I had kind of expected
it would be fixed with c644db3d.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2010-05-07 Thread Marcelo Tosatti
On Fri, May 07, 2010 at 09:37:22AM +0200, Kevin Wolf wrote:
 Am 07.05.2010 03:19, schrieb Marcelo Tosatti:
  On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote:
  Hi,
 
  I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
  endless loop in qcow2_alloc_cluster_offset, namely over
  QLIST_FOREACH(old_alloc, s-cluster_allocs, next_in_flight):
 
  (gdb) bt
  #0  0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, 
  offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at 
  /data/qemu-kvm/block/qcow2-cluster.c:750
  #1  0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at 
  /data/qemu-kvm/block/qcow2.c:587
  #2  0x00482a44 in qcow_aio_writev (bs=value optimized out, 
  sector_num=value optimized out, qiov=value optimized out, 
  nb_sectors=value optimized out, cb=value optimized out, opaque=value 
  optimized out) at /data/qemu-kvm/block/qcow2.c:645
  #3  0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, 
  qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 bdrv_rw_em_cb, 
  opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
  #4  0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, 
  buf=0xd67200 H\a, nb_sectors=16) at /data/qemu-kvm/block.c:1736
  #5  0x00435581 in ide_sector_write (s=0xc92650) at 
  /data/qemu-kvm/hw/ide/core.c:622
  #6  0x00425fc2 in kvm_handle_io (env=value optimized out) at 
  /data/qemu-kvm/kvm-all.c:553
  #7  kvm_run (env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:964
  #8  0x00426049 in kvm_cpu_exec (env=0x1000) at 
  /data/qemu-kvm/qemu-kvm.c:1651
  #9  0x0042627d in kvm_main_loop_cpu (_env=value optimized out) 
  at /data/qemu-kvm/qemu-kvm.c:1893
  #10 ap_main_loop (_env=value optimized out) at 
  /data/qemu-kvm/qemu-kvm.c:1943
  #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0
  #12 0x7f48abf0711d in clone () from /lib64/libc.so.6
  #13 0x in ?? ()
  (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
  $5 = (struct QCowL2Meta *) 0xcb3568
  (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
  $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 
  0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, 
  next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
 
  So next == first.
 
  
  Seen the exact same bug twice in a row while installing FC12 with IDE
  disk, current qemu-kvm.git. 
  
  qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \
  -m 1000  -vnc :1 \
  -net nic,model=virtio \
  -net tap,script=/root/ifup.sh -serial stdio \
  -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor
  telnet::4445,server,nowait -usbdevice tablet
  
  Can't reproduce though.
 
 In current git master? That's interesting news. I had kind of expected
 it would be fixed with c644db3d.

Yes, with 31b460256 more precisely. And the symptom was the same as Jan
reported, cluster_allocs.lh_first had le_next pointing to itself.

Perhaps you can add an assert there, so it abort()'s in that case along
with some useful information? I'll try to reproduce.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2010-05-06 Thread Marcelo Tosatti
On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote:
 Hi,
 
 I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
 endless loop in qcow2_alloc_cluster_offset, namely over
 QLIST_FOREACH(old_alloc, s-cluster_allocs, next_in_flight):
 
 (gdb) bt
 #0  0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, 
 offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at 
 /data/qemu-kvm/block/qcow2-cluster.c:750
 #1  0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at 
 /data/qemu-kvm/block/qcow2.c:587
 #2  0x00482a44 in qcow_aio_writev (bs=value optimized out, 
 sector_num=value optimized out, qiov=value optimized out, 
 nb_sectors=value optimized out, cb=value optimized out, opaque=value 
 optimized out) at /data/qemu-kvm/block/qcow2.c:645
 #3  0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, 
 qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 bdrv_rw_em_cb, 
 opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
 #4  0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, 
 buf=0xd67200 H\a, nb_sectors=16) at /data/qemu-kvm/block.c:1736
 #5  0x00435581 in ide_sector_write (s=0xc92650) at 
 /data/qemu-kvm/hw/ide/core.c:622
 #6  0x00425fc2 in kvm_handle_io (env=value optimized out) at 
 /data/qemu-kvm/kvm-all.c:553
 #7  kvm_run (env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:964
 #8  0x00426049 in kvm_cpu_exec (env=0x1000) at 
 /data/qemu-kvm/qemu-kvm.c:1651
 #9  0x0042627d in kvm_main_loop_cpu (_env=value optimized out) at 
 /data/qemu-kvm/qemu-kvm.c:1893
 #10 ap_main_loop (_env=value optimized out) at 
 /data/qemu-kvm/qemu-kvm.c:1943
 #11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0
 #12 0x7f48abf0711d in clone () from /lib64/libc.so.6
 #13 0x in ?? ()
 (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $5 = (struct QCowL2Meta *) 0xcb3568
 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, 
 depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight 
 = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
 
 So next == first.
 

Seen the exact same bug twice in a row while installing FC12 with IDE
disk, current qemu-kvm.git. 

qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \
-m 1000  -vnc :1 \
-net nic,model=virtio \
-net tap,script=/root/ifup.sh -serial stdio \
-cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor
telnet::4445,server,nowait -usbdevice tablet

Can't reproduce though.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-08 Thread Kevin Wolf
Am 07.12.2009 16:00, schrieb Kevin Wolf:
 Am 07.12.2009 15:16, schrieb Jan Kiszka:
 Likely not. What I did was nothing special, and I did not noticed such a
 crash in the last months.

 And now it happened again (qemu-kvm head, during kernel installation
 from network onto local qcow2-disk). Any clever idea how to proceed with
 this?
 
 I still haven't seen this and I still have no theory on what could be
 happening here. I'm just trying to write down what I think must happen
 to get into this situation. Maybe you can point at something I'm missing
 or maybe it helps you to have a sudden inspiration.
 
 The crash happens because we have a loop in the s-cluster_allocs list.
 A loop can only be created by inserting an object twice. The only insert
 to this list happens in qcow2_alloc_cluster_offset (though an earlier
 call than that of the stack trace).
 
 There is only one relevant caller of this function, qcow_aio_write_cb.
 Part of it is a call to run_dependent_requests which removes the request
 from s-cluster_allocs. So after the QLIST_REMOVE in
 run_dependent_requests the request can't be contained in the list, but
 at the call of qcow2_alloc_cluster_offset it must be contained again. It
 must be added somewhere in between these two calls.
 
 In qcow_aio_write_cb there isn't much happening between these calls. The
 only thing that could somehow become dangerous is the
 qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.

Hm, you're using only one disk, and it's an IDE disk, right? Then the
queue of dependent requests should be empty anyway, so no dangerous
calls here. Maybe your theory of a memory corruption is the better one.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-07 Thread Jan Kiszka
Jan Kiszka wrote:
 Kevin Wolf wrote:
 Hi Jan,

 Am 19.11.2009 13:19, schrieb Jan Kiszka:
 (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $5 = (struct QCowL2Meta *) 0xcb3568
 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, 
 depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, 
 next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}

 So next == first.
 Oops. Doesn't sound quite right...

 Is something fiddling with cluster_allocs concurrently, e.g. some signal
 handler? Or what could cause this list corruption? Would it be enough to
 move to QLIST_FOREACH_SAFE?
 Are there any specific signals you're thinking of? Related to block code
 
 No, was just blind guessing.
 
 I can only think of SIGUSR2 and this one shouldn't call any block driver
 functions directly. You're using aio=threads, I assume? (It's the default)
 
 Yes, all on defaults.
 
 QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop
 doesn't insert or remove any elements. If the list is corrupted now, I
 think it would be corrupted with QLIST_FOREACH_SAFE as well - at best,
 the endless loop would occur one call later.

 The only way I see to get such a loop in a list is to re-insert an
 element that already is part of the list. The only insert is at
 qcow2-cluster.c:777. Remains the question how we came there twice
 without run_dependent_requests() removing the L2Meta from our list first
 - because this is definitely wrong...

 Presumably, it's not reproducible?
 
 Likely not. What I did was nothing special, and I did not noticed such a
 crash in the last months.

And now it happened again (qemu-kvm head, during kernel installation
from network onto local qcow2-disk). Any clever idea how to proceed with
this?

I could try to run the step in a loop, hopefully retriggering it once in
a (likely longer) while. But then we need some good instrumentation first.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-07 Thread Jan Kiszka
Jan Kiszka wrote:
 And now it happened again (qemu-kvm head, during kernel installation
 from network onto local qcow2-disk). Any clever idea how to proceed with
 this?
 
 I could try to run the step in a loop, hopefully retriggering it once in
 a (likely longer) while. But then we need some good instrumentation first.
 

Maybe I'm seeing ghosts, and I don't even have a minimal clue about what
goes on in the code, but this looks fishy:

preallocate() invokes qcow2_alloc_cluster_offset() passing meta, a
stack variable. It seems that qcow2_alloc_cluster_offset() may insert
this structure into cluster_allocs and leave it there. So we corrupt the
queue as soon as preallocate() returns, no?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-07 Thread Kevin Wolf
Am 07.12.2009 15:16, schrieb Jan Kiszka:
 Likely not. What I did was nothing special, and I did not noticed such a
 crash in the last months.
 
 And now it happened again (qemu-kvm head, during kernel installation
 from network onto local qcow2-disk). Any clever idea how to proceed with
 this?

I still haven't seen this and I still have no theory on what could be
happening here. I'm just trying to write down what I think must happen
to get into this situation. Maybe you can point at something I'm missing
or maybe it helps you to have a sudden inspiration.

The crash happens because we have a loop in the s-cluster_allocs list.
A loop can only be created by inserting an object twice. The only insert
to this list happens in qcow2_alloc_cluster_offset (though an earlier
call than that of the stack trace).

There is only one relevant caller of this function, qcow_aio_write_cb.
Part of it is a call to run_dependent_requests which removes the request
from s-cluster_allocs. So after the QLIST_REMOVE in
run_dependent_requests the request can't be contained in the list, but
at the call of qcow2_alloc_cluster_offset it must be contained again. It
must be added somewhere in between these two calls.

In qcow_aio_write_cb there isn't much happening between these calls. The
only thing that could somehow become dangerous is the
qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.

 I could try to run the step in a loop, hopefully retriggering it once in
 a (likely longer) while. But then we need some good instrumentation first.

I can't explain what exactly would be going wrong there, but if my
thoughts are right so far, I think that moving this into a Bottom Half
would help. So if you can reproduce it in a loop this could be worth a try.

I'd certainly prefer to understand the problem first, but thinking about
AIO is the perfect way to make your brain hurt...

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-07 Thread Avi Kivity

On 12/07/2009 04:50 PM, Jan Kiszka wrote:


Maybe I'm seeing ghosts, and I don't even have a minimal clue about what
goes on in the code, but this looks fishy:

   


Plenty of ghosts in qcow2, of all those explorers who tried to brave the 
code.  Only Kevin has ever come back.



preallocate() invokes qcow2_alloc_cluster_offset() passingmeta, a
stack variable. It seems that qcow2_alloc_cluster_offset() may insert
this structure into cluster_allocs and leave it there. So we corrupt the
queue as soon as preallocate() returns, no?

   


We invoke run_dependent_requests() which should dequeue those meta 
again (I think).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-07 Thread Kevin Wolf
Am 07.12.2009 15:50, schrieb Jan Kiszka:
 Jan Kiszka wrote:
 And now it happened again (qemu-kvm head, during kernel installation
 from network onto local qcow2-disk). Any clever idea how to proceed with
 this?

 I could try to run the step in a loop, hopefully retriggering it once in
 a (likely longer) while. But then we need some good instrumentation first.

 
 Maybe I'm seeing ghosts, and I don't even have a minimal clue about what
 goes on in the code, but this looks fishy:
 
 preallocate() invokes qcow2_alloc_cluster_offset() passing meta, a
 stack variable. It seems that qcow2_alloc_cluster_offset() may insert
 this structure into cluster_allocs and leave it there. So we corrupt the
 queue as soon as preallocate() returns, no?

preallocate() is about metadata preallocation during image creation. It
is only ever run by qemu-img. Apart from that it calls
run_dependent_requests() which removes the request from the list again.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-07 Thread Jan Kiszka
Kevin Wolf wrote:
 Am 07.12.2009 15:50, schrieb Jan Kiszka:
 Jan Kiszka wrote:
 And now it happened again (qemu-kvm head, during kernel installation
 from network onto local qcow2-disk). Any clever idea how to proceed with
 this?

 I could try to run the step in a loop, hopefully retriggering it once in
 a (likely longer) while. But then we need some good instrumentation first.

 Maybe I'm seeing ghosts, and I don't even have a minimal clue about what
 goes on in the code, but this looks fishy:

 preallocate() invokes qcow2_alloc_cluster_offset() passing meta, a
 stack variable. It seems that qcow2_alloc_cluster_offset() may insert
 this structure into cluster_allocs and leave it there. So we corrupt the
 queue as soon as preallocate() returns, no?
 
 preallocate() is about metadata preallocation during image creation. It
 is only ever run by qemu-img. Apart from that it calls
 run_dependent_requests() which removes the request from the list again.

OK, I see - was far too easy anyway.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-07 Thread Jan Kiszka
Kevin Wolf wrote:
 Am 07.12.2009 15:16, schrieb Jan Kiszka:
 Likely not. What I did was nothing special, and I did not noticed such a
 crash in the last months.
 And now it happened again (qemu-kvm head, during kernel installation
 from network onto local qcow2-disk). Any clever idea how to proceed with
 this?
 
 I still haven't seen this and I still have no theory on what could be
 happening here. I'm just trying to write down what I think must happen
 to get into this situation. Maybe you can point at something I'm missing
 or maybe it helps you to have a sudden inspiration.
 
 The crash happens because we have a loop in the s-cluster_allocs list.
 A loop can only be created by inserting an object twice. The only insert
 to this list happens in qcow2_alloc_cluster_offset (though an earlier
 call than that of the stack trace).
 
 There is only one relevant caller of this function, qcow_aio_write_cb.
 Part of it is a call to run_dependent_requests which removes the request
 from s-cluster_allocs. So after the QLIST_REMOVE in
 run_dependent_requests the request can't be contained in the list, but
 at the call of qcow2_alloc_cluster_offset it must be contained again. It
 must be added somewhere in between these two calls.
 
 In qcow_aio_write_cb there isn't much happening between these calls. The
 only thing that could somehow become dangerous is the
 qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.

If m-nb_clusters is not, the entry won't be removed from the list. And
of something corrupted nb_clusters so that it became 0 although it's
still enqueued, we would see the deadly loop I faced, right?
Unfortunately, any arbitrary memory corruption that generates such zeros
can cause this...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-12-07 Thread Kevin Wolf
Am 07.12.2009 17:09, schrieb Jan Kiszka:
 Kevin Wolf wrote:
 In qcow_aio_write_cb there isn't much happening between these calls. The
 only thing that could somehow become dangerous is the
 qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.
 
 If m-nb_clusters is not, the entry won't be removed from the list. And
 of something corrupted nb_clusters so that it became 0 although it's
 still enqueued, we would see the deadly loop I faced, right?
 Unfortunately, any arbitrary memory corruption that generates such zeros
 can cause this...

Right, this looks like another way to get into that endless loop. I
don't think it's very likely the cause, but who knows.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Endless loop in qcow2_alloc_cluster_offset

2009-11-19 Thread Jan Kiszka
Hi,

I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
endless loop in qcow2_alloc_cluster_offset, namely over
QLIST_FOREACH(old_alloc, s-cluster_allocs, next_in_flight):

(gdb) bt
#0  0x0048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, 
offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at 
/data/qemu-kvm/block/qcow2-cluster.c:750
#1  0x004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at 
/data/qemu-kvm/block/qcow2.c:587
#2  0x00482a44 in qcow_aio_writev (bs=value optimized out, 
sector_num=value optimized out, qiov=value optimized out, nb_sectors=value 
optimized out, cb=value optimized out, opaque=value optimized out) at 
/data/qemu-kvm/block/qcow2.c:645
#3  0x00470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, 
qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 bdrv_rw_em_cb, 
opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
#4  0x00472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, 
buf=0xd67200 H\a, nb_sectors=16) at /data/qemu-kvm/block.c:1736
#5  0x00435581 in ide_sector_write (s=0xc92650) at 
/data/qemu-kvm/hw/ide/core.c:622
#6  0x00425fc2 in kvm_handle_io (env=value optimized out) at 
/data/qemu-kvm/kvm-all.c:553
#7  kvm_run (env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:964
#8  0x00426049 in kvm_cpu_exec (env=0x1000) at 
/data/qemu-kvm/qemu-kvm.c:1651
#9  0x0042627d in kvm_main_loop_cpu (_env=value optimized out) at 
/data/qemu-kvm/qemu-kvm.c:1893
#10 ap_main_loop (_env=value optimized out) at /data/qemu-kvm/qemu-kvm.c:1943
#11 0x7f48ae89d070 in start_thread () from /lib64/libpthread.so.0
#12 0x7f48abf0711d in clone () from /lib64/libc.so.6
#13 0x in ?? ()
(gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
$5 = (struct QCowL2Meta *) 0xcb3568
(gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
$6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, 
depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = 
{le_next = 0xcb3568, le_prev = 0xc4ebd8}}

So next == first.

Is something fiddling with cluster_allocs concurrently, e.g. some signal
handler? Or what could cause this list corruption? Would it be enough to
move to QLIST_FOREACH_SAFE?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-11-19 Thread Kevin Wolf
Hi Jan,

Am 19.11.2009 13:19, schrieb Jan Kiszka:
 (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $5 = (struct QCowL2Meta *) 0xcb3568
 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, 
 depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight 
 = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
 
 So next == first.

Oops. Doesn't sound quite right...

 Is something fiddling with cluster_allocs concurrently, e.g. some signal
 handler? Or what could cause this list corruption? Would it be enough to
 move to QLIST_FOREACH_SAFE?

Are there any specific signals you're thinking of? Related to block code
I can only think of SIGUSR2 and this one shouldn't call any block driver
functions directly. You're using aio=threads, I assume? (It's the default)

QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop
doesn't insert or remove any elements. If the list is corrupted now, I
think it would be corrupted with QLIST_FOREACH_SAFE as well - at best,
the endless loop would occur one call later.

The only way I see to get such a loop in a list is to re-insert an
element that already is part of the list. The only insert is at
qcow2-cluster.c:777. Remains the question how we came there twice
without run_dependent_requests() removing the L2Meta from our list first
- because this is definitely wrong...

Presumably, it's not reproducible?

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Endless loop in qcow2_alloc_cluster_offset

2009-11-19 Thread Jan Kiszka
Kevin Wolf wrote:
 Hi Jan,
 
 Am 19.11.2009 13:19, schrieb Jan Kiszka:
 (gdb) print ((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $5 = (struct QCowL2Meta *) 0xcb3568
 (gdb) print *((BDRVQcowState *)bs-opaque)-cluster_allocs.lh_first 
 $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, 
 depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight 
 = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}

 So next == first.
 
 Oops. Doesn't sound quite right...
 
 Is something fiddling with cluster_allocs concurrently, e.g. some signal
 handler? Or what could cause this list corruption? Would it be enough to
 move to QLIST_FOREACH_SAFE?
 
 Are there any specific signals you're thinking of? Related to block code

No, was just blind guessing.

 I can only think of SIGUSR2 and this one shouldn't call any block driver
 functions directly. You're using aio=threads, I assume? (It's the default)

Yes, all on defaults.

 
 QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop
 doesn't insert or remove any elements. If the list is corrupted now, I
 think it would be corrupted with QLIST_FOREACH_SAFE as well - at best,
 the endless loop would occur one call later.
 
 The only way I see to get such a loop in a list is to re-insert an
 element that already is part of the list. The only insert is at
 qcow2-cluster.c:777. Remains the question how we came there twice
 without run_dependent_requests() removing the L2Meta from our list first
 - because this is definitely wrong...
 
 Presumably, it's not reproducible?

Likely not. What I did was nothing special, and I did not noticed such a
crash in the last months.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html