Re: [Qemu-devel] Fwd: [RFC 00/27] Migration thread (WIP)

2012-07-27 Thread Juan Quintela
Chegu Vinod chegu_vi...@hp.com wrote:
 On 7/26/2012 11:41 AM, Chegu Vinod wrote:


 
 
  Original Message  


  Subject:  [Qemu-devel] [RFC 00/27] Migration thread (WIP) 

  Date: Tue, 24 Jul 2012 20:36:25 +0200 

  From: Juan Quintela quint...@redhat.com 

  To:   qemu-devel@nongnu.org   


 
 
 Hi

 This series are on top of the migration-next-v5 series just posted.

 First of all, this is an RFC/Work in progress.  Just a lot of people
 asked for it, and I would like review of the design.

 Hello,
 
 Thanks for sharing this early/WIP version for evaluation. 
 
 Still in the middle of  code review..but wanted to share a couple
 of quick  observations.
 'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G
 and downtime 2s). 
 Once with no workload (i.e. idle guest) and the second was with a
 SpecJBB running in the guest.  
 
 The idle guest case seemed to migrate fine...
 
 
 capabilities: xbzrle: off
 Migration status: completed
 transferred ram: 3811345 kbytes
 remaining ram: 0 kbytes
 total ram: 134226368 kbytes
 total time: 199743 milliseconds
 
 
 In the case of the SpecJBB I ran into issues during stage 3...the
 source host's qemu and the guest hung. I need to debug this
 more... (if  already have some hints pl. let me know.).
 
 
 capabilities: xbzrle: off 
 Migration status: active
 transferred ram: 127618578 kbytes
 remaining ram: 2386832 kbytes
 total ram: 134226368 kbytes
 total time: 526139 milliseconds
 (qemu) qemu_savevm_state_complete called 
 qemu_savevm_state_complete calling ram_save_complete
  
 ---  hung somewhere after this ('need to get more info).
 
 


 Appears to be some race condition...as there are cases when it hangs
 and in some cases it succeeds.

Weird guess, try to use less vcpus, same ram.  The way that we stop cpus
is _hacky_ to say it somewhere.  Will try to think about that part.

Thanks for the testing.  All my testing has been done with 8GB guests
and 2vcps.  Will try with more vcpus to see if it makes a difference.





 (qemu) info migrate
 capabilities: xbzrle: off 
 Migration status: completed
 transferred ram: 129937687 kbytes
 remaining ram: 0 kbytes
 total ram: 134226368 kbytes
 total time: 543228 milliseconds

Humm, _that_ is more strange.  This means that it finished.  Could you
run qemu under gdb and sent me the stack traces?

I don't know your gdb thread kung-fu, so here are the instructions just
in case:

gdb --args exact qemu commandh line you used
C-c to break when it hangs
(gdb)info threads
you see all the threads running
(gdb)thread 1
or whatever other number
(gdb)bt
the backtrace of that thread.

I am specially interested in the backtrace of the migration thread and
of the iothread.

Thanks, Juan.


 Need to review/debug...

 Vinod



 ---
 
 As with the non-migration-thread version the Specjbb workload
 completed before the migration attempted to move to stage 3 (i.e.
 didn't converge while the workload was still active). 
 
 BTW, with this version of the bits (i.e. while running SpecJBB
 which is supposed to dirty quite a bit of memory) I noticed that
 there wasn't much change in the b/w usage of the dedicated 10Gb
 private network link (It was still  ~1.5-3.0Gb/sec).   Expected
 this to be a little better since we have a separate thread...  not
 sure what else is in play here ? (numa locality of where the
 migration thread runs or something other basic tuning in the
 implementation ?)
 
 'have a hi-level design question... (perhaps folks have already
 thought about it..and categorized it as potential future
 optimization..?)
 
 Would it be possible to off load the iothread completely [from all
 migration related activity] and have one thread (with the
 appropriate protection) get involved with getting the list of the
 dirty pages ? Have one or more threads dedicated for trying to
 push multiple streams of data to saturate the allocated network
 bandwidth ?  This may help in large + busy guests. Comments?   
 There  are perhaps other implications of doing all of this (like
 burning more host cpu cycles) but perhaps this can be configurable
 based on user's needs... e.g. fewer but large guests on a host
 with no over subscription. 
 
 Thanks
 Vinod
 
 
 
   

Re: [Qemu-devel] Fwd: [RFC 00/27] Migration thread (WIP)

2012-07-26 Thread Chegu Vinod




 Original Message 
Subject:[Qemu-devel] [RFC 00/27] Migration thread (WIP)
Date:   Tue, 24 Jul 2012 20:36:25 +0200
From:   Juan Quintela quint...@redhat.com
To: qemu-devel@nongnu.org



Hi

This series are on top of the migration-next-v5 series just posted.

First of all, this is an RFC/Work in progress.  Just a lot of people
asked for it, and I would like review of the design.

Hello,

Thanks for sharing this early/WIP version for evaluation.

Still in the middle of  code review..but wanted to share a couple of 
quick  observations.
'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and 
downtime 2s).
Once with no workload (i.e. idle guest) and the second was with a 
SpecJBB running in the guest.


The idle guest case seemed to migrate fine...


capabilities: xbzrle: off
Migration status: completed
transferred ram: 3811345 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 199743 milliseconds


In the case of the SpecJBB I ran into issues during stage 3...the source 
host's qemu and the guest hung. I need to debug this more... (if  
already have some hints pl. let me know.).



capabilities: xbzrle: off
Migration status: active
transferred ram: 127618578 kbytes
remaining ram: 2386832 kbytes
total ram: 134226368 kbytes
total time: 526139 milliseconds
(qemu) qemu_savevm_state_complete called
qemu_savevm_state_complete calling ram_save_complete

---  hung somewhere after this ('need to get more info).


---

As with the non-migration-thread version the Specjbb workload completed 
before the migration attempted to move to stage 3 (i.e. didn't converge 
while the workload was still active).


BTW, with this version of the bits (i.e. while running SpecJBB which is 
supposed to dirty quite a bit of memory) I noticed that there wasn't 
much change in the b/w usage of the dedicated 10Gb private network link 
(It was still  ~1.5-3.0Gb/sec).   Expected this to be a little better 
since we have a separate thread...  not sure what else is in play here ? 
(numa locality of where the migration thread runs or something other 
basic tuning in the implementation ?)


'have a hi-level design question... (perhaps folks have already thought 
about it..and categorized it as potential future optimization..?)


Would it be possible to off load the iothread completely [from all 
migration related activity] and have one thread (with the appropriate 
protection) get involved with getting the list of the dirty pages ? Have 
one or more threads dedicated for trying to push multiple streams of 
data to saturate the allocated network bandwidth ?  This may help in 
large + busy guests. Comments?There  are perhaps other implications 
of doing all of this (like burning more host cpu cycles) but perhaps 
this can be configurable based on user's needs... e.g. fewer but large 
guests on a host with no over subscription.


Thanks
Vinod



It does:
- get a new bitmap for migration, and that bitmap uses 1 bit by page
- it unfolds migration_buffered_file.  Only one user existed.
- it simplifies buffered_file a lot.

- About the migration thread, special attention was giving to try to
   get the series review-able (reviewers would tell if I got it).

Basic design:
- we create a new thread instead of a timer function
- we move all the migration work to that thread (but run everything
   except the waits with the iothread lock.
- we move all the writting to outside the iothread lock.  i.e.
   we walk the state with the iothread hold, and copy everything to one buffer.
   then we write that buffer to the sockets outside the iothread lock.
- once here, we move to writting synchronously to the sockets.
- this allows us to simplify quite a lot.

And basically, that is it.  Notice that we still do the iterate page
walking with the iothread held.  Light testing show that we got
similar speed and latencies than without the thread (notice that
almost no optimizations done here yet).

Appart of the review:
- Are there any locking issues that I have missed (I guess so)
- stop all cpus correctly.  vm_stop should be called from the iothread,
   I use the trick of using a bottom half to get that working correctly.
   but this _implementation_ is ugly as hell.  Is there an easy way
   of doing it?
- Do I really have to export last_ram_offset(), there is no other way
   of knowing the ammount of RAM?

Known issues:

- for some reason, when it has to start a 2nd round of bitmap
   handling, it decides to dirty all pages.  Haven't found still why
   this happens.

If you can test it, and said me where it breaks, it would also help.

Work is based on Umesh thread work, and work that Paolo Bonzini had
work on top of that.  All the mirgation thread was done from scratch
becase I was unable to debug why it was failing, but it owes a lot
to the previous design.

Thanks in advance, Juan.

The following changes since commit a21143486b9c6d7a50b7b62877c02b3c686943cb:

   Merge remote-tracking branch 

Re: [Qemu-devel] Fwd: [RFC 00/27] Migration thread (WIP)

2012-07-26 Thread Chegu Vinod

On 7/26/2012 11:41 AM, Chegu Vinod wrote:




 Original Message 
Subject:[Qemu-devel] [RFC 00/27] Migration thread (WIP)
Date:   Tue, 24 Jul 2012 20:36:25 +0200
From:   Juan Quintela quint...@redhat.com
To: qemu-devel@nongnu.org



Hi

This series are on top of the migration-next-v5 series just posted.

First of all, this is an RFC/Work in progress.  Just a lot of people
asked for it, and I would like review of the design.

Hello,

Thanks for sharing this early/WIP version for evaluation.

Still in the middle of  code review..but wanted to share a couple of 
quick  observations.
'tried to use it to migrate a 128G/10VCPU guest (speed set to 10G and 
downtime 2s).
Once with no workload (i.e. idle guest) and the second was with a 
SpecJBB running in the guest.


The idle guest case seemed to migrate fine...


capabilities: xbzrle: off
Migration status: completed
transferred ram: 3811345 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 199743 milliseconds


In the case of the SpecJBB I ran into issues during stage 3...the 
source host's qemu and the guest hung. I need to debug this more... 
(if  already have some hints pl. let me know.).



capabilities: xbzrle: off
Migration status: active
transferred ram: 127618578 kbytes
remaining ram: 2386832 kbytes
total ram: 134226368 kbytes
total time: 526139 milliseconds
(qemu) qemu_savevm_state_complete called
qemu_savevm_state_complete calling ram_save_complete

---  hung somewhere after this ('need to get more info).




Appears to be some race condition...as there are cases when it hangs and 
in some cases it succeeds.


(qemu) info migrate
capabilities: xbzrle: off
Migration status: completed
transferred ram: 129937687 kbytes
remaining ram: 0 kbytes
total ram: 134226368 kbytes
total time: 543228 milliseconds

Need to review/debug...

Vinod




---

As with the non-migration-thread version the Specjbb workload 
completed before the migration attempted to move to stage 3 (i.e. 
didn't converge while the workload was still active).


BTW, with this version of the bits (i.e. while running SpecJBB which 
is supposed to dirty quite a bit of memory) I noticed that there 
wasn't much change in the b/w usage of the dedicated 10Gb private 
network link (It was still  ~1.5-3.0Gb/sec). Expected this to be a 
little better since we have a separate thread...  not sure what else 
is in play here ? (numa locality of where the migration thread runs or 
something other basic tuning in the implementation ?)


'have a hi-level design question... (perhaps folks have already 
thought about it..and categorized it as potential future optimization..?)


Would it be possible to off load the iothread completely [from all 
migration related activity] and have one thread (with the appropriate 
protection) get involved with getting the list of the dirty pages ? 
Have one or more threads dedicated for trying to push multiple streams 
of data to saturate the allocated network bandwidth ?  This may help 
in large + busy guests. Comments? There  are perhaps other 
implications of doing all of this (like burning more host cpu cycles) 
but perhaps this can be configurable based on user's needs... e.g. 
fewer but large guests on a host with no over subscription.


Thanks
Vinod



It does:
- get a new bitmap for migration, and that bitmap uses 1 bit by page
- it unfolds migration_buffered_file.  Only one user existed.
- it simplifies buffered_file a lot.

- About the migration thread, special attention was giving to try to
   get the series review-able (reviewers would tell if I got it).

Basic design:
- we create a new thread instead of a timer function
- we move all the migration work to that thread (but run everything
   except the waits with the iothread lock.
- we move all the writting to outside the iothread lock.  i.e.
   we walk the state with the iothread hold, and copy everything to one buffer.
   then we write that buffer to the sockets outside the iothread lock.
- once here, we move to writting synchronously to the sockets.
- this allows us to simplify quite a lot.

And basically, that is it.  Notice that we still do the iterate page
walking with the iothread held.  Light testing show that we got
similar speed and latencies than without the thread (notice that
almost no optimizations done here yet).

Appart of the review:
- Are there any locking issues that I have missed (I guess so)
- stop all cpus correctly.  vm_stop should be called from the iothread,
   I use the trick of using a bottom half to get that working correctly.
   but this _implementation_ is ugly as hell.  Is there an easy way
   of doing it?
- Do I really have to export last_ram_offset(), there is no other way
   of knowing the ammount of RAM?

Known issues:

- for some reason, when it has to start a 2nd round of bitmap
   handling, it decides to dirty all pages.  Haven't found still why
   this happens.

If you can test it, and said me where it breaks, it would also