Re: hexagon: modeling a shared lock state

2024-01-26 Thread Richard Henderson

On 1/26/24 02:28, Brian Cain wrote:

static void do_hwlock(CPUHexagonState *env, bool *lock)
{
  bql_lock();

  if (*lock) {
env->hwlock_pending = true;
cs->halted = true;
cs->exception_index = EXCP_HALTED;
bql_unlock();
cpu_loop_exit(cs);


In place of the above - we have cpu_interrupt(cs, CPU_INTERRUPT_HALT) -- but is 
that equivalent?


No, it is not.  Raising an interrupt will not take effect immediately.


 Is there one idiom that's preferred over another?  Somehow it seems simpler if 
we don't need to longjmp and IIRC some of these patterns do.


You need the longjmp to halt and stop execution without completing the current insn, so 
that the insn can be restarted later.


Oh!  Just remembered.   You'll want cpu_loop_exit_restore() there, so that pc is updated 
properly.  Better to unwind in the contended case than require the pc be updated along the 
fast path uncontended case.



r~



RE: hexagon: modeling a shared lock state

2024-01-25 Thread Brian Cain


> -Original Message-
> From: Richard Henderson 
> Sent: Wednesday, January 24, 2024 6:22 PM
> To: Brian Cain ; Philippe Mathieu-Daudé
> 
> Cc: qemu-devel@nongnu.org; Sid Manning ; Marco
> Liebel ; Matheus Bernardino
> 
> Subject: Re: hexagon: modeling a shared lock state
> 
> WARNING: This email originated from outside of Qualcomm. Please be wary of
> any links or attachments, and do not enable macros.
> 
> On 1/25/24 01:07, Brian Cain wrote:
> > Philippe,
> >
> > For hexagon sysemu, while internally reviewing patches to be upstreamed we
> noticed that
> > our design for a lock instruction is not quite suitable.  The k0lock 
> > instruction
> will halt
> > if some other hexagon hardware CPU has already claimed the kernel lock,
> only to continue
> > executing once some CPU executes the unlock instruction.  We modeled this
> using a lock
> > state enumeration member { OWNER, WAITING, UNLOCKED } in **each**
> vCPU and atomically
> > transitioning the lock required us to have vCPU[n] write the updated lock
> state to vCPU[m]
> > when unlocking.
> >
> > In context:
> >
> https://github.com/quic/qemu/blob/hexagon_sysemu_20_dec_2023/target/he
> xagon/op_helper.c#L1790-L1821
> <https://github.com/quic/qemu/blob/hexagon_sysemu_20_dec_2023/target/h
> exagon/op_helper.c#L1790-L1821>
> >
> > So instead, maybe it makes more sense to have a system device hold a single
> representation
> > of the lock’s state and each vCPU can do some kind of access via load/store
> and/or
> > interrupts to/from the device?  I was thinking of raising an interrupt on 
> > the
> lock device
> > at the vCPU’s k0lock instruction to indicate demand for the lock, and then 
> > the
> device
> > could raise an interrupt to one of the vCPUs when it’s granted the lock.  
> > One
> drawback for
> > this is that this might add significant latency to the uncontested lock 
> > case.  So
> I also
> > considered doing a load of some part of the lock device’s memory that could
> indicate
> > whether the lock was available, and if available it would claim it with a 
> > store
> (both
> > ld/st while holding BQL).  Only if unavailable it would halt and raise the
> interrupt.  Is
> > it possible to create an address space that’s independent of the true system
> memory map
> > for this use case or would we need to carve out some memory from the
> existing memory map
> > for this synthetic device?  Or – do you have a suggestion for a better
> approach overall?
> 
> I think you are over-thinking this.  A system device would just get in the 
> way.  I

Ok, great - our existing design is ~roughly like this.  But we can explore this 
-- thanks for writing this example!

> think
> you want something like
> 
>struct CPUHexagonState {
>  ...
>  bool hwlock_pending;
>}
> 
>hexagon_cpu_has_work() {
>  if (cpu->hwlock_pending) {
>return false;
>  }
>}
> 
>static void do_hwlock(CPUHexagonState *env, bool *lock)
>{
>  bql_lock();
> 
>  if (*lock) {
>env->hwlock_pending = true;
>cs->halted = true;
>cs->exception_index = EXCP_HALTED;
>bql_unlock();
>cpu_loop_exit(cs);

In place of the above - we have cpu_interrupt(cs, CPU_INTERRUPT_HALT) -- but is 
that equivalent?  Is there one idiom that's preferred over another?  Somehow it 
seems simpler if we don't need to longjmp and IIRC some of these patterns do.

>  } else {
>*lock = true;
>bql_unlock();
>  }
>}
> 
>static void do_hwunlock(CPUHexagonState *env, bool *lock)
>{
>  CPUState *cs;
>  BQL_LOCK_GUARD();
> 
>  *lock = false;
> 
>  CPU_FOREACH(cs) {
>if (cs->hwlock_pending) {
>  cs->hwlock_pending = false;
>  qemu_cpu_kick(cs);
>}
>  }
>}
> 
>static bool k0lock, tlblock;
> 
>void HELPER(k0lock)(CPUHexagonState *env)
>void HELPER(tlblock)(CPUHexagonState *env)
>void HELPER(k0unlock)(CPUHexagonState *env)
>void HELPER(tlbunlock)(CPUHexagonState *env)
> 
> Open questions are:
> 
> (1) Do interrupts cancel lock wait?  Either way, placement in
> hexagon_cpu_has_work is key.

Yeah - they do, we will double check the interaction w has_work.

> (2) I assume that the pc will not be advanced, so that after the kick we will 
> re-
> execute
> the hwlock instruction.  There will be a thundering herd racing to grab the 
> lock
> again,
> but it saves queuing logic, which might be complicated if interrupts are
> involved.

Yes that's right, too.  Thanks!


Re: hexagon: modeling a shared lock state

2024-01-24 Thread Richard Henderson

On 1/25/24 01:07, Brian Cain wrote:

Philippe,

For hexagon sysemu, while internally reviewing patches to be upstreamed we noticed that 
our design for a lock instruction is not quite suitable.  The k0lock instruction will halt 
if some other hexagon hardware CPU has already claimed the kernel lock, only to continue 
executing once some CPU executes the unlock instruction.  We modeled this using a lock 
state enumeration member { OWNER, WAITING, UNLOCKED } in **each** vCPU and atomically 
transitioning the lock required us to have vCPU[n] write the updated lock state to vCPU[m] 
when unlocking.


In context: 
https://github.com/quic/qemu/blob/hexagon_sysemu_20_dec_2023/target/hexagon/op_helper.c#L1790-L1821 


So instead, maybe it makes more sense to have a system device hold a single representation 
of the lock’s state and each vCPU can do some kind of access via load/store and/or 
interrupts to/from the device?  I was thinking of raising an interrupt on the lock device 
at the vCPU’s k0lock instruction to indicate demand for the lock, and then the device 
could raise an interrupt to one of the vCPUs when it’s granted the lock.  One drawback for 
this is that this might add significant latency to the uncontested lock case.  So I also 
considered doing a load of some part of the lock device’s memory that could indicate 
whether the lock was available, and if available it would claim it with a store (both 
ld/st while holding BQL).  Only if unavailable it would halt and raise the interrupt.  Is 
it possible to create an address space that’s independent of the true system memory map 
for this use case or would we need to carve out some memory from the existing memory map 
for this synthetic device?  Or – do you have a suggestion for a better approach overall?


I think you are over-thinking this.  A system device would just get in the way.  I think 
you want something like


  struct CPUHexagonState {
...
bool hwlock_pending;
  }

  hexagon_cpu_has_work() {
if (cpu->hwlock_pending) {
  return false;
}
  }

  static void do_hwlock(CPUHexagonState *env, bool *lock)
  {
bql_lock();

if (*lock) {
  env->hwlock_pending = true;
  cs->halted = true;
  cs->exception_index = EXCP_HALTED;
  bql_unlock();
  cpu_loop_exit(cs);
} else {
  *lock = true;
  bql_unlock();
}
  }

  static void do_hwunlock(CPUHexagonState *env, bool *lock)
  {
CPUState *cs;
BQL_LOCK_GUARD();

*lock = false;

CPU_FOREACH(cs) {
  if (cs->hwlock_pending) {
cs->hwlock_pending = false;
qemu_cpu_kick(cs);
  }
}
  }

  static bool k0lock, tlblock;

  void HELPER(k0lock)(CPUHexagonState *env)
  void HELPER(tlblock)(CPUHexagonState *env)
  void HELPER(k0unlock)(CPUHexagonState *env)
  void HELPER(tlbunlock)(CPUHexagonState *env)

Open questions are:

(1) Do interrupts cancel lock wait?  Either way, placement in 
hexagon_cpu_has_work is key.

(2) I assume that the pc will not be advanced, so that after the kick we will re-execute 
the hwlock instruction.  There will be a thundering herd racing to grab the lock again, 
but it saves queuing logic, which might be complicated if interrupts are involved.



r~