subject:"\[Qemu\-devel\] \[PATCH V3 10\/11\] vcpu\: introduce lockmap"

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-21 Thread liu ping fan

On Thu, Sep 20, 2012 at 5:15 PM, Avi Kivity a...@redhat.com wrote:
 On 09/20/2012 10:51 AM, liu ping fan wrote:
 On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity a...@redhat.com wrote:
 On 09/19/2012 11:36 AM, liu ping fan wrote:

 It basically means you can't hold contents of device state in local
 variables.  You need to read everything again from the device.  That
 includes things like DMA enable bits.

 I think that read everything again from the device can not work.
 Suppose the following scene: If the device's state contains the change
 of a series of internal registers (supposing partA+partB;
 partC+partD), after partA changed, the device's lock is broken.  At
 this point, another access to this device, it will work on partA+partB
 to determine C+D, but since partB is not correctly updated yet. So C+D
 may be decided by broken context and be wrong.

 That's the guest's problem.  Note it could have happened even before the
 change, since the writes to A/B/C/D are unordered wrt the DMA.

 Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
 is not signaled to guest, the guest should not launch operations on
 (C+D). Right?   But here comes the question, if ready not signaled to
 guest, how can guest launch operation on (A+B) again?

 It may be evil.

 i.e. although local lock is broken, the (A+B) is still intact when
 re-acquire local lock.  So need not to read everything again from the
 device.  Wrong?

 The device needs to perform according to its specifications.  If the
 specifications allow for this kind of access, we must ensure it works.
 If they don't, we must ensure something sane happens, instead of a qemu
 crash or exploit.  This means that anything dangerous like pointers must
 be revalidated.  To be on the safe side, I recommend revalidating (or
 reloading) everything, but it may not be necessary in all cases.

Yeah, catch the two points exactly.

 What my suggestion is:
 lock();
 set_and_test(dev-busy);
 if busy
   unlock and return;
 changing device registers;
 do other things including calling to c_p_m_rw() //here,lock broken,
 but set_and_test() works
 clear(dev-busy);
 unlock();

 So changing device registers is protected, and unbreakable.

 But the changes may be legitimate.  Maybe you're writing to a completely
 unrelated register, from a different vcpu, now that write is lost.

 But I think it will mean more-fine locks for each groups of unrelated
 register, and accordingly, the busy should be bitmap for each group.

 It's possible.  Let's defer the discussion until a concrete case is
 before us.  It may be that different devices will want different
 solutions (though there is value in applying one solution everywhere).

Okay. appreciate for the total detail explanation.

Regards,
pingfan

 --
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-20 Thread liu ping fan

On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity a...@redhat.com wrote:
 On 09/19/2012 11:36 AM, liu ping fan wrote:

 It basically means you can't hold contents of device state in local
 variables.  You need to read everything again from the device.  That
 includes things like DMA enable bits.

 I think that read everything again from the device can not work.
 Suppose the following scene: If the device's state contains the change
 of a series of internal registers (supposing partA+partB;
 partC+partD), after partA changed, the device's lock is broken.  At
 this point, another access to this device, it will work on partA+partB
 to determine C+D, but since partB is not correctly updated yet. So C+D
 may be decided by broken context and be wrong.

 That's the guest's problem.  Note it could have happened even before the
 change, since the writes to A/B/C/D are unordered wrt the DMA.

Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
is not signaled to guest, the guest should not launch operations on
(C+D). Right?   But here comes the question, if ready not signaled to
guest, how can guest launch operation on (A+B) again?
i.e. although local lock is broken, the (A+B) is still intact when
re-acquire local lock.  So need not to read everything again from the
device.  Wrong?


 (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
 is a trivial example of an iommu, we should get that going).

 Or for converted device, we can just tag it a
 busy flag, that is checkset busy flag at the entry of device's
 mmio-dispatch.  So  when re-acquire device's lock, the device's state
 is intact.

 The state can be changed by a parallel access to another register, which
 is valid.

 Do you mean that the device can be accessed in parallel? But how? we
 use device's lock.

 Some DMA is asynchronous already, like networking and IDE DMA.  So we
 drop the big lock while doing it.  With the change to drop device locks
 around c_p_m_rw(), we have that problem everywhere.

 What my suggestion is:
 lock();
 set_and_test(dev-busy);
 if busy
   unlock and return;
 changing device registers;
 do other things including calling to c_p_m_rw() //here,lock broken,
 but set_and_test() works
 clear(dev-busy);
 unlock();

 So changing device registers is protected, and unbreakable.

 But the changes may be legitimate.  Maybe you're writing to a completely
 unrelated register, from a different vcpu, now that write is lost.

But I think it will mean more-fine locks for each groups of unrelated
register, and accordingly, the busy should be bitmap for each group.

Thanks and regards,
pingfan

 --
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-20 Thread Avi Kivity

On 09/20/2012 10:51 AM, liu ping fan wrote:
 On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity a...@redhat.com wrote:
 On 09/19/2012 11:36 AM, liu ping fan wrote:

 It basically means you can't hold contents of device state in local
 variables.  You need to read everything again from the device.  That
 includes things like DMA enable bits.

 I think that read everything again from the device can not work.
 Suppose the following scene: If the device's state contains the change
 of a series of internal registers (supposing partA+partB;
 partC+partD), after partA changed, the device's lock is broken.  At
 this point, another access to this device, it will work on partA+partB
 to determine C+D, but since partB is not correctly updated yet. So C+D
 may be decided by broken context and be wrong.

 That's the guest's problem.  Note it could have happened even before the
 change, since the writes to A/B/C/D are unordered wrt the DMA.

 Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
 is not signaled to guest, the guest should not launch operations on
 (C+D). Right?   But here comes the question, if ready not signaled to
 guest, how can guest launch operation on (A+B) again?

It may be evil.

 i.e. although local lock is broken, the (A+B) is still intact when
 re-acquire local lock.  So need not to read everything again from the
 device.  Wrong?

The device needs to perform according to its specifications.  If the
specifications allow for this kind of access, we must ensure it works.
If they don't, we must ensure something sane happens, instead of a qemu
crash or exploit.  This means that anything dangerous like pointers must
be revalidated.  To be on the safe side, I recommend revalidating (or
reloading) everything, but it may not be necessary in all cases.

 What my suggestion is:
 lock();
 set_and_test(dev-busy);
 if busy
   unlock and return;
 changing device registers;
 do other things including calling to c_p_m_rw() //here,lock broken,
 but set_and_test() works
 clear(dev-busy);
 unlock();

 So changing device registers is protected, and unbreakable.

 But the changes may be legitimate.  Maybe you're writing to a completely
 unrelated register, from a different vcpu, now that write is lost.

 But I think it will mean more-fine locks for each groups of unrelated
 register, and accordingly, the busy should be bitmap for each group.

It's possible.  Let's defer the discussion until a concrete case is
before us.  It may be that different devices will want different
solutions (though there is value in applying one solution everywhere).


-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Avi Kivity

On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
 On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
 edgar.igles...@gmail.com wrote:
 On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
 Ping for PMM,

 This is the root case of your block on the SDHCI series - this is a
 discussion on resolution to bogus infinite looping DMA. For current
 participants in this discussion, heres our thread on the same topic
 over in SD land:

 http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html

 With the findings here and acknowledgement that this is a general
 problem, we either need to declare this issue of scope for SDHCI, or
 work with these guys (in the immediate future) on the DMA infinite
 loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
 here, but can we get a decisive plan for going forward with this issue
 if it is going to continue to block SDHCI.

 Thanks to Igor for identifying the overlap between these discussions.

 Hi,

 A couple of dumb questions.

 What is the reason for the blocker? that possible guest dos is worse
 than no functionality at all?

 Can't you put the DMA/IO processing into the background?
 
 I dont know a way do do asynchronous DMA unless I am missing something
 here.

You could schedule a bottom half and do the accesses from there.  Solves
nothing though.

 So what happens is we have a device that walks a SG list
 synchronously while holding all the locks and stuff being discussed
 here. If that SG list infinite loops then crash.

Did you mean loop, or recursion into the memory region that initiates DMA?

As this problem already exists I don't think new device models need
block on it.

Maybe a simple fix is:

void my_device_write(...)
{
   if (dma_in_progress) {
   return;
   }
   ...
   s-dma_in_progress = true;
   cpu_physical_memory_rw(...)
   s-dma_in_progress = false;
}

and we could try to figure out how to move this into common code.
-- 

error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Jan Kiszka

On 2012-09-19 06:40, Peter Crosthwaite wrote:
 On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
 edgar.igles...@gmail.com wrote:
 On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
 Ping for PMM,

 This is the root case of your block on the SDHCI series - this is a
 discussion on resolution to bogus infinite looping DMA. For current
 participants in this discussion, heres our thread on the same topic
 over in SD land:

 http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html

 With the findings here and acknowledgement that this is a general
 problem, we either need to declare this issue of scope for SDHCI, or
 work with these guys (in the immediate future) on the DMA infinite
 loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
 here, but can we get a decisive plan for going forward with this issue
 if it is going to continue to block SDHCI.

 Thanks to Igor for identifying the overlap between these discussions.

 Hi,

 A couple of dumb questions.

 What is the reason for the blocker? that possible guest dos is worse
 than no functionality at all?

 Can't you put the DMA/IO processing into the background?
 
 I dont know a way do do asynchronous DMA unless I am missing something
 here. So what happens is we have a device that walks a SG list
 synchronously while holding all the locks and stuff being discussed
 here. If that SG list infinite loops then crash.

We could switch to async DMA, but most DMA-capable devices aren't
prepared for rescheduling points in the middle of their code. This will
require quite a bit of code review.

 

 what's the diff between setup of bad descriptors and  writing a
 while (1) loop in the guest CPU?

 
 While(1) in guest doesnt freeze all of QEMU (monitor still works and
 stuff), wheras the DMA ops going in circles does, as locks are taken
 by an infinite loop.

That's the point. If we loop, we need at least a way to stop it by
issuing a device or system reset. But I tend to think that detecting
loops and failing/ignoring requests is easier implementable.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Avi Kivity

On 09/17/2012 05:24 AM, liu ping fan wrote:
 On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity a...@redhat.com wrote:
 On 09/13/2012 09:55 AM, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 03:24 PM, Avi Kivity wrote:
 On 09/11/2012 12:57 PM, Jan Kiszka wrote:
 On 2012-09-11 11:44, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.

 That is our core problem: inconsistent invocation of existing services
 /wrt locking. For portio, I was lucky that there is no nesting and I was
 able to drop the big lock around all (x86) call sites. But MMIO is way
 more tricky due to DMA nesting.

 Maybe we need to switch to a continuation style.  Instead of expecting
 cpu_physical_memory_rw() to complete synchronously, it becomes an
 asynchronous call and you provide it with a completion.  That means
 devices which use it are forced to drop the lock in between.  Block and
 network clients will be easy to convert since they already use APIs that
 drop the lock (except for accessing the descriptors).

 We could try to introduce a different version of cpu_physical_memory_rw,
 cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
 request can trigger the very same access in a nested fashion, and we
 will have to detect this to avoid locking up QEMU (locking up the guest
 might be OK).

 An async version of c_p_m_rw() will just cause a completion to bounce
 around, consuming cpu but not deadlocking anything.  If we can keep a
 count of the bounces, we might be able to stall it indefinitely or at
 least ratelimit it.


 Another option is to require all users of c_p_m_rw() and related to use
 a coroutine or thread.  That makes the programming easier (but still
 required a revalidation after the dropped lock).

 For the nested cpu_physical_memory_rw(), we change it internal but
 keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
 into cpu_physical_memory_rw_internal() )


 LOCK()  // can be device or big lock or both, depend on caller.
 ..
 cpu_physical_memory_rw()
 {
UNLOCK() //unlock all the locks
queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
 // cpu_physical_memory_rw_internal can take lock(device,biglock) again
wait_for_completion(completion)
LOCK()
 }
 ..
 UNLOCK()


 This is dangerous.  The caller expects to hold the lock across the call,
 and will not re-validate its state.

 Although cpu_physical_memory_rw_internal() can be freed to use lock,
 but with precondition. -- We still need to trace lock stack taken by
 cpu_physical_memory_rw(), so that it can return to caller correctly.
 Is that OK?

 I'm convinced that we need a recursive lock if we don't convert
 everything at once.

 So how about:

 - make bql a recursive lock (use pthreads, don't invent our own)
 - change c_p_m_rw() calling convention to caller may hold the BQL, but
 no device lock

 this allows devices to DMA each other.  Unconverted device models will
 work as they used to.  Converted device models will need to drop the
 device lock, re-acquire it, and revalidate any state.  That will cause
 
 I think we are cornered by devices to DMA each other, and it raises
 the unavoidable nested lock. Also to avoid deadlock caused by device's
 lock sequence, we should drop the current device lock  to acquire
 another one.   The only little  diverge is about the revalidate, do
 we need to rollback?

It basically means you can't hold contents of device state in local
variables.  You need to read everything again from the device.  That
includes things like DMA enable bits.

(btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
is a trivial example of an iommu, we should get that going).

 Or for converted device, we can just tag it a
 busy flag, that is checkset busy flag at the entry of device's
 mmio-dispatch.  So  when re-acquire device's lock, the device's state
 is intact.

The state can be changed by a parallel access to another register, which
is valid.

 
 Has anybody other suggestion?


-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread liu ping fan

On Wed, Sep 19, 2012 at 4:01 PM, Avi Kivity a...@redhat.com wrote:
 On 09/17/2012 05:24 AM, liu ping fan wrote:
 On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity a...@redhat.com wrote:
 On 09/13/2012 09:55 AM, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 03:24 PM, Avi Kivity wrote:
 On 09/11/2012 12:57 PM, Jan Kiszka wrote:
 On 2012-09-11 11:44, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.

 That is our core problem: inconsistent invocation of existing services
 /wrt locking. For portio, I was lucky that there is no nesting and I was
 able to drop the big lock around all (x86) call sites. But MMIO is way
 more tricky due to DMA nesting.

 Maybe we need to switch to a continuation style.  Instead of expecting
 cpu_physical_memory_rw() to complete synchronously, it becomes an
 asynchronous call and you provide it with a completion.  That means
 devices which use it are forced to drop the lock in between.  Block and
 network clients will be easy to convert since they already use APIs that
 drop the lock (except for accessing the descriptors).

 We could try to introduce a different version of cpu_physical_memory_rw,
 cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
 request can trigger the very same access in a nested fashion, and we
 will have to detect this to avoid locking up QEMU (locking up the guest
 might be OK).

 An async version of c_p_m_rw() will just cause a completion to bounce
 around, consuming cpu but not deadlocking anything.  If we can keep a
 count of the bounces, we might be able to stall it indefinitely or at
 least ratelimit it.


 Another option is to require all users of c_p_m_rw() and related to use
 a coroutine or thread.  That makes the programming easier (but still
 required a revalidation after the dropped lock).

 For the nested cpu_physical_memory_rw(), we change it internal but
 keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
 into cpu_physical_memory_rw_internal() )


 LOCK()  // can be device or big lock or both, depend on caller.
 ..
 cpu_physical_memory_rw()
 {
UNLOCK() //unlock all the locks
queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
 // cpu_physical_memory_rw_internal can take lock(device,biglock) again
wait_for_completion(completion)
LOCK()
 }
 ..
 UNLOCK()


 This is dangerous.  The caller expects to hold the lock across the call,
 and will not re-validate its state.

 Although cpu_physical_memory_rw_internal() can be freed to use lock,
 but with precondition. -- We still need to trace lock stack taken by
 cpu_physical_memory_rw(), so that it can return to caller correctly.
 Is that OK?

 I'm convinced that we need a recursive lock if we don't convert
 everything at once.

 So how about:

 - make bql a recursive lock (use pthreads, don't invent our own)
 - change c_p_m_rw() calling convention to caller may hold the BQL, but
 no device lock

 this allows devices to DMA each other.  Unconverted device models will
 work as they used to.  Converted device models will need to drop the
 device lock, re-acquire it, and revalidate any state.  That will cause

 I think we are cornered by devices to DMA each other, and it raises
 the unavoidable nested lock. Also to avoid deadlock caused by device's
 lock sequence, we should drop the current device lock  to acquire
 another one.   The only little  diverge is about the revalidate, do
 we need to rollback?

 It basically means you can't hold contents of device state in local
 variables.  You need to read everything again from the device.  That
 includes things like DMA enable bits.

I think that read everything again from the device can not work.
Suppose the following scene: If the device's state contains the change
of a series of internal registers (supposing partA+partB;
partC+partD), after partA changed, the device's lock is broken.  At
this point, another access to this device, it will work on partA+partB
to determine C+D, but since partB is not correctly updated yet. So C+D
may be decided by broken context and be wrong.

 (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
 is a trivial example of an iommu, we should get that going).

 Or for converted device, we can just tag it a
 busy flag, that is checkset busy flag at

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Avi Kivity

On 09/19/2012 11:36 AM, liu ping fan wrote:

 It basically means you can't hold contents of device state in local
 variables.  You need to read everything again from the device.  That
 includes things like DMA enable bits.

 I think that read everything again from the device can not work.
 Suppose the following scene: If the device's state contains the change
 of a series of internal registers (supposing partA+partB;
 partC+partD), after partA changed, the device's lock is broken.  At
 this point, another access to this device, it will work on partA+partB
 to determine C+D, but since partB is not correctly updated yet. So C+D
 may be decided by broken context and be wrong.

That's the guest's problem.  Note it could have happened even before the
change, since the writes to A/B/C/D are unordered wrt the DMA.

 
 (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
 is a trivial example of an iommu, we should get that going).

 Or for converted device, we can just tag it a
 busy flag, that is checkset busy flag at the entry of device's
 mmio-dispatch.  So  when re-acquire device's lock, the device's state
 is intact.

 The state can be changed by a parallel access to another register, which
 is valid.

 Do you mean that the device can be accessed in parallel? But how? we
 use device's lock.

Some DMA is asynchronous already, like networking and IDE DMA.  So we
drop the big lock while doing it.  With the change to drop device locks
around c_p_m_rw(), we have that problem everywhere.

 What my suggestion is:
 lock();
 set_and_test(dev-busy);
 if busy
   unlock and return;
 changing device registers;
 do other things including calling to c_p_m_rw() //here,lock broken,
 but set_and_test() works
 clear(dev-busy);
 unlock();
 
 So changing device registers is protected, and unbreakable.

But the changes may be legitimate.  Maybe you're writing to a completely
unrelated register, from a different vcpu, now that write is lost.


-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Edgar E. Iglesias

On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
 On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
  On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
  edgar.igles...@gmail.com wrote:
  On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
  Ping for PMM,
 
  This is the root case of your block on the SDHCI series - this is a
  discussion on resolution to bogus infinite looping DMA. For current
  participants in this discussion, heres our thread on the same topic
  over in SD land:
 
  http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
 
  With the findings here and acknowledgement that this is a general
  problem, we either need to declare this issue of scope for SDHCI, or
  work with these guys (in the immediate future) on the DMA infinite
  loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
  here, but can we get a decisive plan for going forward with this issue
  if it is going to continue to block SDHCI.
 
  Thanks to Igor for identifying the overlap between these discussions.
 
  Hi,
 
  A couple of dumb questions.
 
  What is the reason for the blocker? that possible guest dos is worse
  than no functionality at all?
 
  Can't you put the DMA/IO processing into the background?
  
  I dont know a way do do asynchronous DMA unless I am missing something
  here.
 
 You could schedule a bottom half and do the accesses from there.  Solves
 nothing though.
 
  So what happens is we have a device that walks a SG list
  synchronously while holding all the locks and stuff being discussed
  here. If that SG list infinite loops then crash.
 
 Did you mean loop, or recursion into the memory region that initiates DMA?

I think we were discussing the loop issue (I hadn't thought of recursion
issues before) :)

Jan's point of resetability was interesting.

Processing a finite amount of dma descriptors and scheduling bh's
to continously get back and continue processing should be OK no?

That would avoid the lockups and allow the device to be reset at
any time. Or am I missing something?

IIRC the etrax dma does something like that but only for receive
dma rings...

Cheers,
Edgar

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Avi Kivity

On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:
 On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
 On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
  On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
  edgar.igles...@gmail.com wrote:
  On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
  Ping for PMM,
 
  This is the root case of your block on the SDHCI series - this is a
  discussion on resolution to bogus infinite looping DMA. For current
  participants in this discussion, heres our thread on the same topic
  over in SD land:
 
  http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
 
  With the findings here and acknowledgement that this is a general
  problem, we either need to declare this issue of scope for SDHCI, or
  work with these guys (in the immediate future) on the DMA infinite
  loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
  here, but can we get a decisive plan for going forward with this issue
  if it is going to continue to block SDHCI.
 
  Thanks to Igor for identifying the overlap between these discussions.
 
  Hi,
 
  A couple of dumb questions.
 
  What is the reason for the blocker? that possible guest dos is worse
  than no functionality at all?
 
  Can't you put the DMA/IO processing into the background?
  
  I dont know a way do do asynchronous DMA unless I am missing something
  here.
 
 You could schedule a bottom half and do the accesses from there.  Solves
 nothing though.
 
  So what happens is we have a device that walks a SG list
  synchronously while holding all the locks and stuff being discussed
  here. If that SG list infinite loops then crash.
 
 Did you mean loop, or recursion into the memory region that initiates DMA?
 
 I think we were discussing the loop issue (I hadn't thought of recursion
 issues before) :)
 
 Jan's point of resetability was interesting.
 
 Processing a finite amount of dma descriptors and scheduling bh's
 to continously get back and continue processing should be OK no?
 

Yes.

 That would avoid the lockups and allow the device to be reset at
 any time. Or am I missing something?
 

Not that I can see.  If real hardware can be looped, so can qemu.  I'm
only worried about recursion and deadlocks (while real hardware can
deadlock, we'd prefer to avoid that).


-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Edgar E. Iglesias

On Wed, Sep 19, 2012 at 03:12:12PM +0300, Avi Kivity wrote:
 On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:
  On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
  On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
   On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
   edgar.igles...@gmail.com wrote:
   On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
   Ping for PMM,
  
   This is the root case of your block on the SDHCI series - this is a
   discussion on resolution to bogus infinite looping DMA. For current
   participants in this discussion, heres our thread on the same topic
   over in SD land:
  
   http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
  
   With the findings here and acknowledgement that this is a general
   problem, we either need to declare this issue of scope for SDHCI, or
   work with these guys (in the immediate future) on the DMA infinite
   loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
   here, but can we get a decisive plan for going forward with this issue
   if it is going to continue to block SDHCI.
  
   Thanks to Igor for identifying the overlap between these discussions.
  
   Hi,
  
   A couple of dumb questions.
  
   What is the reason for the blocker? that possible guest dos is worse
   than no functionality at all?
  
   Can't you put the DMA/IO processing into the background?
   
   I dont know a way do do asynchronous DMA unless I am missing something
   here.
  
  You could schedule a bottom half and do the accesses from there.  Solves
  nothing though.
  
   So what happens is we have a device that walks a SG list
   synchronously while holding all the locks and stuff being discussed
   here. If that SG list infinite loops then crash.
  
  Did you mean loop, or recursion into the memory region that initiates DMA?
  
  I think we were discussing the loop issue (I hadn't thought of recursion
  issues before) :)
  
  Jan's point of resetability was interesting.
  
  Processing a finite amount of dma descriptors and scheduling bh's
  to continously get back and continue processing should be OK no?
  
 
 Yes.
 
  That would avoid the lockups and allow the device to be reset at
  any time. Or am I missing something?
  
 
 Not that I can see.  If real hardware can be looped, so can qemu.  I'm
 only worried about recursion and deadlocks (while real hardware can
 deadlock, we'd prefer to avoid that).

Agreed, thanks

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Igor Mitsyanko


On 09/19/2012 04:12 PM, Avi Kivity wrote:

On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:

On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:

On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:

On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
edgar.igles...@gmail.com wrote:

On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:

Ping for PMM,

This is the root case of your block on the SDHCI series - this is a
discussion on resolution to bogus infinite looping DMA. For current
participants in this discussion, heres our thread on the same topic
over in SD land:

http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html

With the findings here and acknowledgement that this is a general
problem, we either need to declare this issue of scope for SDHCI, or
work with these guys (in the immediate future) on the DMA infinite
loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
here, but can we get a decisive plan for going forward with this issue
if it is going to continue to block SDHCI.

Thanks to Igor for identifying the overlap between these discussions.

Hi,

A couple of dumb questions.

What is the reason for the blocker? that possible guest dos is worse
than no functionality at all?

Can't you put the DMA/IO processing into the background?

I dont know a way do do asynchronous DMA unless I am missing something
here.

You could schedule a bottom half and do the accesses from there.  Solves
nothing though.


So what happens is we have a device that walks a SG list
synchronously while holding all the locks and stuff being discussed
here. If that SG list infinite loops then crash.

Did you mean loop, or recursion into the memory region that initiates DMA?

I think we were discussing the loop issue (I hadn't thought of recursion
issues before) :)

Jan's point of resetability was interesting.

Processing a finite amount of dma descriptors and scheduling bh's
to continously get back and continue processing should be OK no?


Yes.


Bottom half idea sounds good.
Also, for this particular device, rather then limiting amount of 
processed descriptor entries in bottom half, it seems reasonable to 
interrupt bh handler if we encounter a LINK descriptor since only 
recursive LINKs can generate infinite loops. In that case, a very long 
descriptor table without a single LINK entry could potentially hung QEMU 
for a long time, but that time will be finite anyway. Long term, this 
problem will disappear when/if someone converts QEMU SD card model to 
use async io.

That would avoid the lockups and allow the device to be reset at
any time. Or am I missing something?


Not that I can see.  If real hardware can be looped, so can qemu.  I'm
only worried about recursion and deadlocks (while real hardware can
deadlock, we'd prefer to avoid that).




So, I think the idea here is that if real hardware can be locked we 
should lock too, but provide a guest CPU a possibility to abort locked 
operation. For this particular example, SD card controller can deadlock 
on recursive descriptor LINK entries, but CPU can abort ongoing 
transaction at any time by issuing ABORT command. And if guest CPU 
busy-waits in infinite while() loop for a TRANSFER OVER flag to be set, 
then it had it coming.

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Avi Kivity

On 09/19/2012 04:01 PM, Igor Mitsyanko wrote:
 That would avoid the lockups and allow the device to be reset at
 any time. Or am I missing something?

 Not that I can see.  If real hardware can be looped, so can qemu.  I'm
 only worried about recursion and deadlocks (while real hardware can
 deadlock, we'd prefer to avoid that).


 
 So, I think the idea here is that if real hardware can be locked we
 should lock too, but provide a guest CPU a possibility to abort locked
 operation. For this particular example, SD card controller can deadlock
 on recursive descriptor LINK entries, but CPU can abort ongoing
 transaction at any time by issuing ABORT command. And if guest CPU
 busy-waits in infinite while() loop for a TRANSFER OVER flag to be set,
 then it had it coming.

If real hardware would lock up so should we (in the guest's eyes) but
the qemu monitor should remain responsive.  It is not part of emulated
hardware.


-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-19 Thread Igor Mitsyanko


On 09/19/2012 11:57 AM, Jan Kiszka wrote:

On 2012-09-19 06:40, Peter Crosthwaite wrote:

On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
edgar.igles...@gmail.com wrote:

On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:

Ping for PMM,

This is the root case of your block on the SDHCI series - this is a
discussion on resolution to bogus infinite looping DMA. For current
participants in this discussion, heres our thread on the same topic
over in SD land:

http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html

With the findings here and acknowledgement that this is a general
problem, we either need to declare this issue of scope for SDHCI, or
work with these guys (in the immediate future) on the DMA infinite
loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
here, but can we get a decisive plan for going forward with this issue
if it is going to continue to block SDHCI.

Thanks to Igor for identifying the overlap between these discussions.

Hi,

A couple of dumb questions.

What is the reason for the blocker? that possible guest dos is worse
than no functionality at all?

Can't you put the DMA/IO processing into the background?

I dont know a way do do asynchronous DMA unless I am missing something
here. So what happens is we have a device that walks a SG list
synchronously while holding all the locks and stuff being discussed
here. If that SG list infinite loops then crash.

We could switch to async DMA, but most DMA-capable devices aren't
prepared for rescheduling points in the middle of their code. This will
require quite a bit of code review.


From what I understand, we can't switch to async DMA while SD card 
model doesn't support it. And introducing this support would require to 
convert every existing SD card user in QEMU.
I can recall that such a conversion was already suggested in a mailing 
list a while ago, but no one (including me) volunteered to do it :)



what's the diff between setup of bad descriptors and  writing a
while (1) loop in the guest CPU?


While(1) in guest doesnt freeze all of QEMU (monitor still works and
stuff), wheras the DMA ops going in circles does, as locks are taken
by an infinite loop.

That's the point. If we loop, we need at least a way to stop it by
issuing a device or system reset. But I tend to think that detecting
loops and failing/ignoring requests is easier implementable.

Jan

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-18 Thread Peter Crosthwaite

Ping for PMM,

This is the root case of your block on the SDHCI series - this is a
discussion on resolution to bogus infinite looping DMA. For current
participants in this discussion, heres our thread on the same topic
over in SD land:

http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html

With the findings here and acknowledgement that this is a general
problem, we either need to declare this issue of scope for SDHCI, or
work with these guys (in the immediate future) on the DMA infinite
loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
here, but can we get a decisive plan for going forward with this issue
if it is going to continue to block SDHCI.

Thanks to Igor for identifying the overlap between these discussions.

Regards,
Peter

On Tue, Sep 11, 2012 at 10:39 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 03:35 PM, Jan Kiszka wrote:
 On 2012-09-11 14:30, Avi Kivity wrote:
 The other option is to keep DMA requests issued by devices synchronous
 but let them fail if we are about to lock up. Still requires changes,
 but is probably more comprehensible for device model developers.

 How do you handle failures?

 By not sending a network frame or dropping an incoming one, e.g., and
 signaling this in a device specific way.

 Doesn't work for block devices.

 Because the block layer API cannot report errors to the devices? What
 happens if there is a real I/O error?

 We report real I/O errors.  But if we report a transient error due to
 some lock being taken as an I/O error, the guest will take unwarranted
 action.

 If the errors are not expected in normal operation (we can avoid them if
 all DMA is to real RAM) then this is an acceptable solution.  Still it
 generates a lot of rarely used code paths and so isn't very good for
 security.

 I'm not talking about transient errors. Recursions like this are always
 guest configuration errors that would cause real devices to lock up or
 timeout as well. This is practically about avoiding that a malicious
 guest can lock up QEMU, leaving it inoperative even for management tools.

 Ok.  That's more palatable.  We don't even have to report an error in
 that case, we can just perform the operation incorrectly (as I'm sure
 real hardware will), log an error, and continue.


 --
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-18 Thread Edgar E. Iglesias

On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
 Ping for PMM,
 
 This is the root case of your block on the SDHCI series - this is a
 discussion on resolution to bogus infinite looping DMA. For current
 participants in this discussion, heres our thread on the same topic
 over in SD land:
 
 http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
 
 With the findings here and acknowledgement that this is a general
 problem, we either need to declare this issue of scope for SDHCI, or
 work with these guys (in the immediate future) on the DMA infinite
 loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
 here, but can we get a decisive plan for going forward with this issue
 if it is going to continue to block SDHCI.
 
 Thanks to Igor for identifying the overlap between these discussions.

Hi,

A couple of dumb questions.

What is the reason for the blocker? that possible guest dos is worse
than no functionality at all?

Can't you put the DMA/IO processing into the background?

what's the diff between setup of bad descriptors and  writing a
while (1) loop in the guest CPU?

Cheers,
E

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-18 Thread Peter Crosthwaite

On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
edgar.igles...@gmail.com wrote:
 On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
 Ping for PMM,

 This is the root case of your block on the SDHCI series - this is a
 discussion on resolution to bogus infinite looping DMA. For current
 participants in this discussion, heres our thread on the same topic
 over in SD land:

 http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html

 With the findings here and acknowledgement that this is a general
 problem, we either need to declare this issue of scope for SDHCI, or
 work with these guys (in the immediate future) on the DMA infinite
 loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
 here, but can we get a decisive plan for going forward with this issue
 if it is going to continue to block SDHCI.

 Thanks to Igor for identifying the overlap between these discussions.

 Hi,

 A couple of dumb questions.

 What is the reason for the blocker? that possible guest dos is worse
 than no functionality at all?

 Can't you put the DMA/IO processing into the background?

I dont know a way do do asynchronous DMA unless I am missing something
here. So what happens is we have a device that walks a SG list
synchronously while holding all the locks and stuff being discussed
here. If that SG list infinite loops then crash.


 what's the diff between setup of bad descriptors and  writing a
 while (1) loop in the guest CPU?


While(1) in guest doesnt freeze all of QEMU (monitor still works and
stuff), wheras the DMA ops going in circles does, as locks are taken
by an infinite loop.

Regards,
peter

 Cheers,
 E

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-16 Thread liu ping fan

On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity a...@redhat.com wrote:
 On 09/13/2012 09:55 AM, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 03:24 PM, Avi Kivity wrote:
 On 09/11/2012 12:57 PM, Jan Kiszka wrote:
 On 2012-09-11 11:44, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.

 That is our core problem: inconsistent invocation of existing services
 /wrt locking. For portio, I was lucky that there is no nesting and I was
 able to drop the big lock around all (x86) call sites. But MMIO is way
 more tricky due to DMA nesting.

 Maybe we need to switch to a continuation style.  Instead of expecting
 cpu_physical_memory_rw() to complete synchronously, it becomes an
 asynchronous call and you provide it with a completion.  That means
 devices which use it are forced to drop the lock in between.  Block and
 network clients will be easy to convert since they already use APIs that
 drop the lock (except for accessing the descriptors).

 We could try to introduce a different version of cpu_physical_memory_rw,
 cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
 request can trigger the very same access in a nested fashion, and we
 will have to detect this to avoid locking up QEMU (locking up the guest
 might be OK).

 An async version of c_p_m_rw() will just cause a completion to bounce
 around, consuming cpu but not deadlocking anything.  If we can keep a
 count of the bounces, we might be able to stall it indefinitely or at
 least ratelimit it.


 Another option is to require all users of c_p_m_rw() and related to use
 a coroutine or thread.  That makes the programming easier (but still
 required a revalidation after the dropped lock).

 For the nested cpu_physical_memory_rw(), we change it internal but
 keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
 into cpu_physical_memory_rw_internal() )


 LOCK()  // can be device or big lock or both, depend on caller.
 ..
 cpu_physical_memory_rw()
 {
UNLOCK() //unlock all the locks
queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
 // cpu_physical_memory_rw_internal can take lock(device,biglock) again
wait_for_completion(completion)
LOCK()
 }
 ..
 UNLOCK()


 This is dangerous.  The caller expects to hold the lock across the call,
 and will not re-validate its state.

 Although cpu_physical_memory_rw_internal() can be freed to use lock,
 but with precondition. -- We still need to trace lock stack taken by
 cpu_physical_memory_rw(), so that it can return to caller correctly.
 Is that OK?

 I'm convinced that we need a recursive lock if we don't convert
 everything at once.

 So how about:

 - make bql a recursive lock (use pthreads, don't invent our own)
 - change c_p_m_rw() calling convention to caller may hold the BQL, but
 no device lock

 this allows devices to DMA each other.  Unconverted device models will
 work as they used to.  Converted device models will need to drop the
 device lock, re-acquire it, and revalidate any state.  That will cause

I think we are cornered by devices to DMA each other, and it raises
the unavoidable nested lock. Also to avoid deadlock caused by device's
lock sequence, we should drop the current device lock  to acquire
another one.   The only little  diverge is about the revalidate, do
we need to rollback? Or for converted device, we can just tag it a
busy flag, that is checkset busy flag at the entry of device's
mmio-dispatch.  So  when re-acquire device's lock, the device's state
is intact.

Has anybody other suggestion?

Thanks and regards,
pingfan
 them to become more correct wrt recursive invocation.


 --
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-13 Thread liu ping fan

On Tue, Sep 11, 2012 at 10:54 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Sep 11, 2012 at 03:41:15PM +0300, Avi Kivity wrote:
 On 09/11/2012 03:24 PM, Avi Kivity wrote:
  On 09/11/2012 12:57 PM, Jan Kiszka wrote:
  On 2012-09-11 11:44, liu ping fan wrote:
  On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
  On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
  From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
  The func call chain can suffer from recursively hold
  qemu_mutex_lock_iothread. We introduce lockmap to record the
  lock depth.
 
  What is the root cause?  io handlers initiating I/O?
 
  cpu_physical_memory_rw() can be called nested, and when called, it can
  be protected by no-lock/device lock/ big-lock.
  I think without big lock, io-dispatcher should face the same issue.
  As to main-loop, have not carefully consider, but at least, dma-helper
  will call cpu_physical_memory_rw() with big-lock.
 
  That is our core problem: inconsistent invocation of existing services
  /wrt locking. For portio, I was lucky that there is no nesting and I was
  able to drop the big lock around all (x86) call sites. But MMIO is way
  more tricky due to DMA nesting.
 
  Maybe we need to switch to a continuation style.  Instead of expecting
  cpu_physical_memory_rw() to complete synchronously, it becomes an
  asynchronous call and you provide it with a completion.  That means
  devices which use it are forced to drop the lock in between.  Block and
  network clients will be easy to convert since they already use APIs that
  drop the lock (except for accessing the descriptors).
 
  We could try to introduce a different version of cpu_physical_memory_rw,
  cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
  request can trigger the very same access in a nested fashion, and we
  will have to detect this to avoid locking up QEMU (locking up the guest
  might be OK).
 
  An async version of c_p_m_rw() will just cause a completion to bounce
  around, consuming cpu but not deadlocking anything.  If we can keep a
  count of the bounces, we might be able to stall it indefinitely or at
  least ratelimit it.
 

 Another option is to require all users of c_p_m_rw() and related to use
 a coroutine or thread.  That makes the programming easier (but still
 required a revalidation after the dropped lock).

 Unless i am misunderstanding this thread, the iothread flow section of
 http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html
 contains possible solutions for this.

 Basically use trylock() and if it fails, then choose a bailout option
 (which there are more than one possibilities listed).

I think in that thread, the original goal of the trylock() is to solve
the lock held by other thread, by here we want to resolve nested lock.
But yes, there is similar point to solve it by async if we do not
allow nested lock.

Regards,
pingfan

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-13 Thread liu ping fan

On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 03:24 PM, Avi Kivity wrote:
 On 09/11/2012 12:57 PM, Jan Kiszka wrote:
 On 2012-09-11 11:44, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.

 That is our core problem: inconsistent invocation of existing services
 /wrt locking. For portio, I was lucky that there is no nesting and I was
 able to drop the big lock around all (x86) call sites. But MMIO is way
 more tricky due to DMA nesting.

 Maybe we need to switch to a continuation style.  Instead of expecting
 cpu_physical_memory_rw() to complete synchronously, it becomes an
 asynchronous call and you provide it with a completion.  That means
 devices which use it are forced to drop the lock in between.  Block and
 network clients will be easy to convert since they already use APIs that
 drop the lock (except for accessing the descriptors).

 We could try to introduce a different version of cpu_physical_memory_rw,
 cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
 request can trigger the very same access in a nested fashion, and we
 will have to detect this to avoid locking up QEMU (locking up the guest
 might be OK).

 An async version of c_p_m_rw() will just cause a completion to bounce
 around, consuming cpu but not deadlocking anything.  If we can keep a
 count of the bounces, we might be able to stall it indefinitely or at
 least ratelimit it.


 Another option is to require all users of c_p_m_rw() and related to use
 a coroutine or thread.  That makes the programming easier (but still
 required a revalidation after the dropped lock).

For the nested cpu_physical_memory_rw(), we change it internal but
keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
into cpu_physical_memory_rw_internal() )


LOCK()  // can be device or big lock or both, depend on caller.
..
cpu_physical_memory_rw()
{
   UNLOCK() //unlock all the locks
   queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
// cpu_physical_memory_rw_internal can take lock(device,biglock) again
   wait_for_completion(completion)
   LOCK()
}
..
UNLOCK()

Although cpu_physical_memory_rw_internal() can be freed to use lock,
but with precondition. -- We still need to trace lock stack taken by
cpu_physical_memory_rw(), so that it can return to caller correctly.
Is that OK?

Regards,
pingfan

 --
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-13 Thread Avi Kivity

On 09/13/2012 09:55 AM, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 03:24 PM, Avi Kivity wrote:
 On 09/11/2012 12:57 PM, Jan Kiszka wrote:
 On 2012-09-11 11:44, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.

 That is our core problem: inconsistent invocation of existing services
 /wrt locking. For portio, I was lucky that there is no nesting and I was
 able to drop the big lock around all (x86) call sites. But MMIO is way
 more tricky due to DMA nesting.

 Maybe we need to switch to a continuation style.  Instead of expecting
 cpu_physical_memory_rw() to complete synchronously, it becomes an
 asynchronous call and you provide it with a completion.  That means
 devices which use it are forced to drop the lock in between.  Block and
 network clients will be easy to convert since they already use APIs that
 drop the lock (except for accessing the descriptors).

 We could try to introduce a different version of cpu_physical_memory_rw,
 cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
 request can trigger the very same access in a nested fashion, and we
 will have to detect this to avoid locking up QEMU (locking up the guest
 might be OK).

 An async version of c_p_m_rw() will just cause a completion to bounce
 around, consuming cpu but not deadlocking anything.  If we can keep a
 count of the bounces, we might be able to stall it indefinitely or at
 least ratelimit it.


 Another option is to require all users of c_p_m_rw() and related to use
 a coroutine or thread.  That makes the programming easier (but still
 required a revalidation after the dropped lock).

 For the nested cpu_physical_memory_rw(), we change it internal but
 keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
 into cpu_physical_memory_rw_internal() )
 
 
 LOCK()  // can be device or big lock or both, depend on caller.
 ..
 cpu_physical_memory_rw()
 {
UNLOCK() //unlock all the locks
queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
 // cpu_physical_memory_rw_internal can take lock(device,biglock) again
wait_for_completion(completion)
LOCK()
 }
 ..
 UNLOCK()
 

This is dangerous.  The caller expects to hold the lock across the call,
and will not re-validate its state.

 Although cpu_physical_memory_rw_internal() can be freed to use lock,
 but with precondition. -- We still need to trace lock stack taken by
 cpu_physical_memory_rw(), so that it can return to caller correctly.
 Is that OK?

I'm convinced that we need a recursive lock if we don't convert
everything at once.

So how about:

- make bql a recursive lock (use pthreads, don't invent our own)
- change c_p_m_rw() calling convention to caller may hold the BQL, but
no device lock

this allows devices to DMA each other.  Unconverted device models will
work as they used to.  Converted device models will need to drop the
device lock, re-acquire it, and revalidate any state.  That will cause
them to become more correct wrt recursive invocation.


-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Avi Kivity

On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

What is the root cause?  io handlers initiating I/O?

Perhaps we can have a solution that is local to exec.c.


-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread liu ping fan

On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

cpu_physical_memory_rw() can be called nested, and when called, it can
be protected by no-lock/device lock/ big-lock.
I think without big lock, io-dispatcher should face the same issue.
As to main-loop, have not carefully consider, but at least, dma-helper
will call cpu_physical_memory_rw() with big-lock.

Regards,
pingfan

 Perhaps we can have a solution that is local to exec.c.


 --
 error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Avi Kivity

On 09/11/2012 12:44 PM, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.

Then we should look for a solution that is local to exec.c (and the
nested dispatch problem).  I think we can identify and fix all nested
dispatches (converting them to either async dma, or letting the memory
core do a single dispatch without indirection through I/O handlers).

 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.

DMA is inherently asynchronous, so we already drop the lock between
initiation and completion; we need to find a way to make it easy to use
the API without taking the lock while the transfer takes place.

-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Jan Kiszka

On 2012-09-11 11:44, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.

That is our core problem: inconsistent invocation of existing services
/wrt locking. For portio, I was lucky that there is no nesting and I was
able to drop the big lock around all (x86) call sites. But MMIO is way
more tricky due to DMA nesting.

We could try to introduce a different version of cpu_physical_memory_rw,
cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
request can trigger the very same access in a nested fashion, and we
will have to detect this to avoid locking up QEMU (locking up the guest
might be OK).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Jan Kiszka

On 2012-09-11 11:54, Avi Kivity wrote:
 On 09/11/2012 12:44 PM, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 
 Then we should look for a solution that is local to exec.c (and the
 nested dispatch problem).  I think we can identify and fix all nested
 dispatches (converting them to either async dma, or letting the memory
 core do a single dispatch without indirection through I/O handlers).
 
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.
 
 DMA is inherently asynchronous, so we already drop the lock between
 initiation and completion; we need to find a way to make it easy to use
 the API without taking the lock while the transfer takes place.

We will have to review/rework device models that want to use the new
locking scheme in a way that they can drop their own lock while issuing
DMA. But that is surely non-trivial.

The other option is to keep DMA requests issued by devices synchronous
but let them fail if we are about to lock up. Still requires changes,
but is probably more comprehensible for device model developers.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Avi Kivity

On 09/11/2012 01:04 PM, Jan Kiszka wrote:

 DMA is inherently asynchronous, so we already drop the lock between
 initiation and completion; we need to find a way to make it easy to use
 the API without taking the lock while the transfer takes place.
 
 We will have to review/rework device models that want to use the new
 locking scheme in a way that they can drop their own lock while issuing
 DMA. But that is surely non-trivial.

Most DMA today happens without the big qemu lock.  We only need to
convert the paths that actually access memory, these are the block and
network layers (for the respective devices).

 The other option is to keep DMA requests issued by devices synchronous
 but let them fail if we are about to lock up. Still requires changes,
 but is probably more comprehensible for device model developers.

How do you handle failures?

-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Jan Kiszka

On 2012-09-11 13:03, Avi Kivity wrote:
 On 09/11/2012 01:04 PM, Jan Kiszka wrote:
 
 DMA is inherently asynchronous, so we already drop the lock between
 initiation and completion; we need to find a way to make it easy to use
 the API without taking the lock while the transfer takes place.

 We will have to review/rework device models that want to use the new
 locking scheme in a way that they can drop their own lock while issuing
 DMA. But that is surely non-trivial.
 
 Most DMA today happens without the big qemu lock.  We only need to
 convert the paths that actually access memory, these are the block and
 network layers (for the respective devices).

...and the devices themselves, of course.

 
 The other option is to keep DMA requests issued by devices synchronous
 but let them fail if we are about to lock up. Still requires changes,
 but is probably more comprehensible for device model developers.
 
 How do you handle failures?

By not sending a network frame or dropping an incoming one, e.g., and
signaling this in a device specific way.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Avi Kivity

On 09/11/2012 02:08 PM, Jan Kiszka wrote:
 On 2012-09-11 13:03, Avi Kivity wrote:
 On 09/11/2012 01:04 PM, Jan Kiszka wrote:
 
 DMA is inherently asynchronous, so we already drop the lock between
 initiation and completion; we need to find a way to make it easy to use
 the API without taking the lock while the transfer takes place.

 We will have to review/rework device models that want to use the new
 locking scheme in a way that they can drop their own lock while issuing
 DMA. But that is surely non-trivial.
 
 Most DMA today happens without the big qemu lock.  We only need to
 convert the paths that actually access memory, these are the block and
 network layers (for the respective devices).
 
 ...and the devices themselves, of course.

Right, for descriptors and stuff.  So a guest can set a DMA address to
point at its own BAR, and cause qemu to deadlock.

The problem is not limited to locking.  A guest can also cause a write
to a BAR to initiate a synchronous write with the same address and data,
triggering infinite recursion.

Perhaps one fix is the mythical DMA API, which will provide each DMA
initiator with its own view of memory (a private MemoryRegion root
node).  Even that can be worked around with a pair of devices accessing
each other.

 
 
 The other option is to keep DMA requests issued by devices synchronous
 but let them fail if we are about to lock up. Still requires changes,
 but is probably more comprehensible for device model developers.
 
 How do you handle failures?
 
 By not sending a network frame or dropping an incoming one, e.g., and
 signaling this in a device specific way.

Doesn't work for block devices.

-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Avi Kivity

On 09/11/2012 12:57 PM, Jan Kiszka wrote:
 On 2012-09-11 11:44, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.
 
 That is our core problem: inconsistent invocation of existing services
 /wrt locking. For portio, I was lucky that there is no nesting and I was
 able to drop the big lock around all (x86) call sites. But MMIO is way
 more tricky due to DMA nesting.

Maybe we need to switch to a continuation style.  Instead of expecting
cpu_physical_memory_rw() to complete synchronously, it becomes an
asynchronous call and you provide it with a completion.  That means
devices which use it are forced to drop the lock in between.  Block and
network clients will be easy to convert since they already use APIs that
drop the lock (except for accessing the descriptors).

 We could try to introduce a different version of cpu_physical_memory_rw,
 cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
 request can trigger the very same access in a nested fashion, and we
 will have to detect this to avoid locking up QEMU (locking up the guest
 might be OK).

An async version of c_p_m_rw() will just cause a completion to bounce
around, consuming cpu but not deadlocking anything.  If we can keep a
count of the bounces, we might be able to stall it indefinitely or at
least ratelimit it.

-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Jan Kiszka

On 2012-09-11 14:20, Avi Kivity wrote:
 On 09/11/2012 02:08 PM, Jan Kiszka wrote:
 On 2012-09-11 13:03, Avi Kivity wrote:
 On 09/11/2012 01:04 PM, Jan Kiszka wrote:

 DMA is inherently asynchronous, so we already drop the lock between
 initiation and completion; we need to find a way to make it easy to use
 the API without taking the lock while the transfer takes place.

 We will have to review/rework device models that want to use the new
 locking scheme in a way that they can drop their own lock while issuing
 DMA. But that is surely non-trivial.

 Most DMA today happens without the big qemu lock.  We only need to
 convert the paths that actually access memory, these are the block and
 network layers (for the respective devices).

 ...and the devices themselves, of course.
 
 Right, for descriptors and stuff.  So a guest can set a DMA address to
 point at its own BAR, and cause qemu to deadlock.
 
 The problem is not limited to locking.  A guest can also cause a write
 to a BAR to initiate a synchronous write with the same address and data,
 triggering infinite recursion.
 
 Perhaps one fix is the mythical DMA API, which will provide each DMA
 initiator with its own view of memory (a private MemoryRegion root
 node).  Even that can be worked around with a pair of devices accessing
 each other.

Hmm, don't see an alternative to runtime loop detection yet.

 


 The other option is to keep DMA requests issued by devices synchronous
 but let them fail if we are about to lock up. Still requires changes,
 but is probably more comprehensible for device model developers.

 How do you handle failures?

 By not sending a network frame or dropping an incoming one, e.g., and
 signaling this in a device specific way.
 
 Doesn't work for block devices.

Because the block layer API cannot report errors to the devices? What
happens if there is a real I/O error?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Avi Kivity

On 09/11/2012 03:25 PM, Jan Kiszka wrote:

 Most DMA today happens without the big qemu lock.  We only need to
 convert the paths that actually access memory, these are the block and
 network layers (for the respective devices).

 ...and the devices themselves, of course.
 
 Right, for descriptors and stuff.  So a guest can set a DMA address to
 point at its own BAR, and cause qemu to deadlock.
 
 The problem is not limited to locking.  A guest can also cause a write
 to a BAR to initiate a synchronous write with the same address and data,
 triggering infinite recursion.
 
 Perhaps one fix is the mythical DMA API, which will provide each DMA
 initiator with its own view of memory (a private MemoryRegion root
 node).  Even that can be worked around with a pair of devices accessing
 each other.
 
 Hmm, don't see an alternative to runtime loop detection yet.

See an expensive one on the other branch of this thread.

 
 


 The other option is to keep DMA requests issued by devices synchronous
 but let them fail if we are about to lock up. Still requires changes,
 but is probably more comprehensible for device model developers.

 How do you handle failures?

 By not sending a network frame or dropping an incoming one, e.g., and
 signaling this in a device specific way.
 
 Doesn't work for block devices.
 
 Because the block layer API cannot report errors to the devices? What
 happens if there is a real I/O error?

We report real I/O errors.  But if we report a transient error due to
some lock being taken as an I/O error, the guest will take unwarranted
action.

If the errors are not expected in normal operation (we can avoid them if
all DMA is to real RAM) then this is an acceptable solution.  Still it
generates a lot of rarely used code paths and so isn't very good for
security.

-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Jan Kiszka

On 2012-09-11 14:30, Avi Kivity wrote:
 The other option is to keep DMA requests issued by devices synchronous
 but let them fail if we are about to lock up. Still requires changes,
 but is probably more comprehensible for device model developers.

 How do you handle failures?

 By not sending a network frame or dropping an incoming one, e.g., and
 signaling this in a device specific way.

 Doesn't work for block devices.

 Because the block layer API cannot report errors to the devices? What
 happens if there is a real I/O error?
 
 We report real I/O errors.  But if we report a transient error due to
 some lock being taken as an I/O error, the guest will take unwarranted
 action.
 
 If the errors are not expected in normal operation (we can avoid them if
 all DMA is to real RAM) then this is an acceptable solution.  Still it
 generates a lot of rarely used code paths and so isn't very good for
 security.

I'm not talking about transient errors. Recursions like this are always
guest configuration errors that would cause real devices to lock up or
timeout as well. This is practically about avoiding that a malicious
guest can lock up QEMU, leaving it inoperative even for management tools.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Avi Kivity

On 09/11/2012 03:35 PM, Jan Kiszka wrote:
 On 2012-09-11 14:30, Avi Kivity wrote:
 The other option is to keep DMA requests issued by devices synchronous
 but let them fail if we are about to lock up. Still requires changes,
 but is probably more comprehensible for device model developers.

 How do you handle failures?

 By not sending a network frame or dropping an incoming one, e.g., and
 signaling this in a device specific way.

 Doesn't work for block devices.

 Because the block layer API cannot report errors to the devices? What
 happens if there is a real I/O error?
 
 We report real I/O errors.  But if we report a transient error due to
 some lock being taken as an I/O error, the guest will take unwarranted
 action.
 
 If the errors are not expected in normal operation (we can avoid them if
 all DMA is to real RAM) then this is an acceptable solution.  Still it
 generates a lot of rarely used code paths and so isn't very good for
 security.
 
 I'm not talking about transient errors. Recursions like this are always
 guest configuration errors that would cause real devices to lock up or
 timeout as well. This is practically about avoiding that a malicious
 guest can lock up QEMU, leaving it inoperative even for management tools.

Ok.  That's more palatable.  We don't even have to report an error in
that case, we can just perform the operation incorrectly (as I'm sure
real hardware will), log an error, and continue.


-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Avi Kivity

On 09/11/2012 03:24 PM, Avi Kivity wrote:
 On 09/11/2012 12:57 PM, Jan Kiszka wrote:
 On 2012-09-11 11:44, liu ping fan wrote:
 On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
 On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan pingf...@linux.vnet.ibm.com

 The func call chain can suffer from recursively hold
 qemu_mutex_lock_iothread. We introduce lockmap to record the
 lock depth.

 What is the root cause?  io handlers initiating I/O?

 cpu_physical_memory_rw() can be called nested, and when called, it can
 be protected by no-lock/device lock/ big-lock.
 I think without big lock, io-dispatcher should face the same issue.
 As to main-loop, have not carefully consider, but at least, dma-helper
 will call cpu_physical_memory_rw() with big-lock.
 
 That is our core problem: inconsistent invocation of existing services
 /wrt locking. For portio, I was lucky that there is no nesting and I was
 able to drop the big lock around all (x86) call sites. But MMIO is way
 more tricky due to DMA nesting.
 
 Maybe we need to switch to a continuation style.  Instead of expecting
 cpu_physical_memory_rw() to complete synchronously, it becomes an
 asynchronous call and you provide it with a completion.  That means
 devices which use it are forced to drop the lock in between.  Block and
 network clients will be easy to convert since they already use APIs that
 drop the lock (except for accessing the descriptors).
 
 We could try to introduce a different version of cpu_physical_memory_rw,
 cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
 request can trigger the very same access in a nested fashion, and we
 will have to detect this to avoid locking up QEMU (locking up the guest
 might be OK).
 
 An async version of c_p_m_rw() will just cause a completion to bounce
 around, consuming cpu but not deadlocking anything.  If we can keep a
 count of the bounces, we might be able to stall it indefinitely or at
 least ratelimit it.
 

Another option is to require all users of c_p_m_rw() and related to use
a coroutine or thread.  That makes the programming easier (but still
required a revalidation after the dropped lock).

-- 
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Marcelo Tosatti

On Tue, Sep 11, 2012 at 03:41:15PM +0300, Avi Kivity wrote:
 On 09/11/2012 03:24 PM, Avi Kivity wrote:
  On 09/11/2012 12:57 PM, Jan Kiszka wrote:
  On 2012-09-11 11:44, liu ping fan wrote:
  On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity a...@redhat.com wrote:
  On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
  From: Liu Ping Fan pingf...@linux.vnet.ibm.com
 
  The func call chain can suffer from recursively hold
  qemu_mutex_lock_iothread. We introduce lockmap to record the
  lock depth.
 
  What is the root cause?  io handlers initiating I/O?
 
  cpu_physical_memory_rw() can be called nested, and when called, it can
  be protected by no-lock/device lock/ big-lock.
  I think without big lock, io-dispatcher should face the same issue.
  As to main-loop, have not carefully consider, but at least, dma-helper
  will call cpu_physical_memory_rw() with big-lock.
  
  That is our core problem: inconsistent invocation of existing services
  /wrt locking. For portio, I was lucky that there is no nesting and I was
  able to drop the big lock around all (x86) call sites. But MMIO is way
  more tricky due to DMA nesting.
  
  Maybe we need to switch to a continuation style.  Instead of expecting
  cpu_physical_memory_rw() to complete synchronously, it becomes an
  asynchronous call and you provide it with a completion.  That means
  devices which use it are forced to drop the lock in between.  Block and
  network clients will be easy to convert since they already use APIs that
  drop the lock (except for accessing the descriptors).
  
  We could try to introduce a different version of cpu_physical_memory_rw,
  cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
  request can trigger the very same access in a nested fashion, and we
  will have to detect this to avoid locking up QEMU (locking up the guest
  might be OK).
  
  An async version of c_p_m_rw() will just cause a completion to bounce
  around, consuming cpu but not deadlocking anything.  If we can keep a
  count of the bounces, we might be able to stall it indefinitely or at
  least ratelimit it.
  
 
 Another option is to require all users of c_p_m_rw() and related to use
 a coroutine or thread.  That makes the programming easier (but still
 required a revalidation after the dropped lock).

Unless i am misunderstanding this thread, the iothread flow section of
http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html
contains possible solutions for this.

Basically use trylock() and if it fails, then choose a bailout option
(which there are more than one possibilities listed).

[Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap

2012-09-11 Thread Liu Ping Fan

From: Liu Ping Fan pingf...@linux.vnet.ibm.com

The func call chain can suffer from recursively hold
qemu_mutex_lock_iothread. We introduce lockmap to record the
lock depth.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 cpus.c  |   18 ++
 qemu-thread-posix.c |   23 +++
 qemu-thread-posix.h |5 +
 qemu-thread.h   |3 +++
 4 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index 4cd7f85..09f6670 100644
--- a/cpus.c
+++ b/cpus.c
@@ -736,6 +736,8 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
 int r;
 
 pthread_setspecific(qemu_thread_key, cpu-thread);
+cpu-thread-lockmap.biglock = 0;
+qemu_big_lockmap_inc();
 qemu_mutex_lock(qemu_global_mutex);
 qemu_thread_get_self(cpu-thread);
 env-thread_id = qemu_get_thread_id();
@@ -905,6 +907,14 @@ int qemu_cpu_is_self(void *_env)
 
 void qemu_mutex_lock_iothread(void)
 {
+unsigned int map;
+
+if (!qemu_thread_is_self(io_thread)) {
+map = qemu_big_lockmap_inc();
+if (map  1) {
+return;
+}
+}
 if (!tcg_enabled()) {
 qemu_mutex_lock(qemu_global_mutex);
 } else {
@@ -920,6 +930,14 @@ void qemu_mutex_lock_iothread(void)
 
 void qemu_mutex_unlock_iothread(void)
 {
+unsigned int map;
+
+if (!qemu_thread_is_self(io_thread)) {
+map = qemu_big_lockmap_dec();
+if (map != 0) {
+return;
+}
+}
 qemu_mutex_unlock(qemu_global_mutex);
 }
 
diff --git a/qemu-thread-posix.c b/qemu-thread-posix.c
index f448fcb..1e07dc2 100644
--- a/qemu-thread-posix.c
+++ b/qemu-thread-posix.c
@@ -17,6 +17,7 @@
 #include signal.h
 #include stdint.h
 #include string.h
+#include glib.h
 #include qemu-thread.h
 
 pthread_key_t qemu_thread_key;
@@ -158,6 +159,28 @@ void qemu_thread_key_create(void)
 pthread_key_create(qemu_thread_key, NULL);
 }
 
+int16_t qemu_big_lockmap_inc(void)
+{
+QemuThread *t = pthread_getspecific(qemu_thread_key);
+
+return ++t-lockmap.biglock;
+}
+
+int16_t qemu_big_lockmap_dec(void)
+{
+QemuThread *t = pthread_getspecific(qemu_thread_key);
+g_assert(t-lockmap.biglock  0);
+
+return --t-lockmap.biglock;
+}
+
+int16_t qemu_big_lockmap_get(void)
+{
+QemuThread *t = pthread_getspecific(qemu_thread_key);
+
+return t-lockmap.biglock;
+}
+
 bool qemu_thread_is_self(QemuThread *thread)
 {
return pthread_equal(pthread_self(), thread-thread);
diff --git a/qemu-thread-posix.h b/qemu-thread-posix.h
index 2607b1c..8f9506b 100644
--- a/qemu-thread-posix.h
+++ b/qemu-thread-posix.h
@@ -10,8 +10,13 @@ struct QemuCond {
 pthread_cond_t cond;
 };
 
+typedef struct Lockmap {
+int16_t biglock;
+} Lockmap;
+
 struct QemuThread {
 pthread_t thread;
+Lockmap lockmap;
 };
 
 extern pthread_key_t qemu_thread_key;
diff --git a/qemu-thread.h b/qemu-thread.h
index 4a6427d..529850b 100644
--- a/qemu-thread.h
+++ b/qemu-thread.h
@@ -47,4 +47,7 @@ bool qemu_thread_is_self(QemuThread *thread);
 void qemu_thread_exit(void *retval);
 
 void qemu_thread_key_create(void);
+int16_t qemu_big_lockmap_inc(void);
+int16_t qemu_big_lockmap_dec(void);
+int16_t qemu_big_lockmap_get(void);
 #endif
-- 
1.7.4.4

37 matches

Mail list logo