from:"Alvise Rigo"

Re: [Qemu-devel] Making TCG configurable in system mode

2016-12-16 Thread Alvise Rigo

On Wed, Dec 14, 2016 at 10:51 PM, Paolo Bonzini  wrote:
>
>> I am looking at the possibility to add a new QEMU configuration option
>> to make TCG optional (in qemu-system-*). What I am exploring is a way
>> to exclude any of the TCG code not needed by KVM from the QEMU binary.
>> There has been a previous attempt in the past from Paolo Bonzini,
>> namely https://github.com/bonzini/qemu/tree/disable-tcg, that
>> eventually was not upstreamed. I was looking into this work mainly,
>> mostly to understand if the same approach can be respinned and used to
>> support all the QEMU's targets. Any input on this is welcome.
>
> Yes, it sure can!  However I suggest doing it one target at a time,
> because there can be tricky dependencies between helper files and

Indeed, doing it gradually is the right way to tackle it. I might be
back with some more concrete work. Thank you for the time being.

> KVM support code.  It's fine as long as the configure script only
> allows --disable-tcg for specific targets where it works.
>
> IIRC my branch only covered x86 (or maybe PPC too?!?  I don't remember).
>
> The hardest part is making sure that it doesn't bitrot, and it's hard
> because we don't have CI for architectures other than x86.  But at least
> Peter builds on ARM, and target submaintainers do build on PPC and s390
> so it's not that bad perhaps.

Would a CI set up just to test this feature make sense?

Thank you,
alvise

>
>> I was also wondering if an approach could be based on the recent patch
>> series that allows to use the TCG frontend as a library --
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg415514.html.
>> Making qemu-user and qemu-system users of such a library might help in
>> having TCG optional. Obviously this solution introduces many other
>> challenges and I'm not even sure if it's actually viable.
>
> I think making qemu-system use such a library would be very hard, because
> of the different implementation of qemu_ld/qemu_st in user and system
> emulation.  I don't think it's important for your purpose.
>
> Paolo

[Qemu-devel] Making TCG configurable in system mode

2016-12-14 Thread Alvise Rigo

Hi all,

I am looking at the possibility to add a new QEMU configuration option
to make TCG optional (in qemu-system-*). What I am exploring is a way
to exclude any of the TCG code not needed by KVM from the QEMU binary.
There has been a previous attempt in the past from Paolo Bonzini,
namely https://github.com/bonzini/qemu/tree/disable-tcg, that
eventually was not upstreamed. I was looking into this work mainly,
mostly to understand if the same approach can be respinned and used to
support all the QEMU's targets. Any input on this is welcome.

I was also wondering if an approach could be based on the recent patch
series that allows to use the TCG frontend as a library --
https://www.mail-archive.com/qemu-devel@nongnu.org/msg415514.html.
Making qemu-user and qemu-system users of such a library might help in
having TCG optional. Obviously this solution introduces many other
challenges and I'm not even sure if it's actually viable.

I would like to hear your opinion on this, ideally to identify what
would be the best direction to follow for bringing this new feature
into QEMU.

Thank you,
alvise

Re: [Qemu-devel] [RFC 7/8] cpu-exec-common: Introduce async_safe_run_on_cpu()

2016-07-01 Thread Alvise Rigo

Hi Sergey,

On Mon, Jun 20, 2016 at 12:28 AM, Sergey Fedorov
<sergey.fedo...@linaro.org> wrote:
>
> From: Sergey Fedorov <serge.f...@gmail.com>
>
> This patch is based on the ideas found in work of KONRAD Frederic [1],
> Alex Bennée [2], and Alvise Rigo [3].
>
> This mechanism allows to perform an operation safely in a quiescent
> state. Quiescent state means: (1) no vCPU is running and (2) BQL in
> system-mode or 'exclusive_lock' in user-mode emulation is held while
> performing the operation. This functionality is required e.g. for
> performing translation buffer flush safely in multi-threaded user-mode
> emulation.
>
> The existing CPU work queue is used to schedule such safe operations. A
> new 'safe' flag is added into struct qemu_work_item to designate the
> special requirements of the safe work. An operation in a quiescent sate
> can be scheduled by using async_safe_run_on_cpu() function which is
> actually the same as sync_run_on_cpu() except that it marks the queued
> work item with the 'safe' flag set to true. Given this flag set
> queue_work_on_cpu() atomically increments 'safe_work_pending' global
> counter and kicks all the CPUs instead of just the target CPU as in case
> of normal CPU work. This allows to force other CPUs to exit their
> execution loops and wait in wait_safe_cpu_work() function for the safe
> work to finish. When a CPU drains its work queue, if it encounters a
> work item marked as safe, it first waits for other CPUs to exit their
> execution loops, then called the work item function, and finally
> decrements 'safe_work_pending' counter with signalling other CPUs to let
> them continue execution as soon as all pending safe work items have been
> processed. The 'tcg_pending_cpus' protected by 'exclusive_lock' in
> user-mode or by 'qemu_global_mutex' in system-mode emulation is used to
> determine if there is any CPU run and wait for it to exit the execution
> loop. The fairness of all the CPU work queues is ensured by draining all
> the pending safe work items before any CPU can run.
>
> [1] http://lists.nongnu.org/archive/html/qemu-devel/2015-08/msg01128.html
> [2] http://lists.nongnu.org/archive/html/qemu-devel/2016-04/msg02531.html
> [3] http://lists.nongnu.org/archive/html/qemu-devel/2016-05/msg04792.html
>
> Signed-off-by: Sergey Fedorov <serge.f...@gmail.com>
> Signed-off-by: Sergey Fedorov <sergey.fedo...@linaro.org>
> ---
>  cpu-exec-common.c   | 45 -
>  cpus.c  | 16 
>  include/exec/exec-all.h |  2 ++
>  include/qom/cpu.h   | 14 ++
>  linux-user/main.c   |  2 +-
>  5 files changed, 77 insertions(+), 2 deletions(-)
>
> diff --git a/cpu-exec-common.c b/cpu-exec-common.c
> index 8184e0662cbd..3056324738f8 100644
> --- a/cpu-exec-common.c
> +++ b/cpu-exec-common.c
> @@ -25,6 +25,7 @@
>
>  bool exit_request;
>  CPUState *tcg_current_cpu;
> +int tcg_pending_cpus;
>
>  /* exit the current TB, but without causing any exception to be raised */
>  void cpu_loop_exit_noexc(CPUState *cpu)
> @@ -78,6 +79,15 @@ void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
>  siglongjmp(cpu->jmp_env, 1);
>  }
>
> +static int safe_work_pending;
> +
> +void wait_safe_cpu_work(void)
> +{
> +while (atomic_mb_read(_work_pending) > 0) {
> +wait_cpu_work();
> +}
> +}
> +

Is this piece of code deadlock-safe once we are in mttcg mode?
What happens when two threads call simultaneously async_safe_run_on_cpu?

Thank you,
alvise

>
>  static void queue_work_on_cpu(CPUState *cpu, struct qemu_work_item *wi)
>  {
>  qemu_mutex_lock(>work_mutex);
> @@ -89,9 +99,18 @@ static void queue_work_on_cpu(CPUState *cpu, struct 
> qemu_work_item *wi)
>  cpu->queued_work_last = wi;
>  wi->next = NULL;
>  wi->done = false;
> +if (wi->safe) {
> +atomic_inc(_work_pending);
> +}
>  qemu_mutex_unlock(>work_mutex);
>
> -qemu_cpu_kick(cpu);
> +if (!wi->safe) {
> +qemu_cpu_kick(cpu);
> +} else {
> +CPU_FOREACH(cpu) {
> +qemu_cpu_kick(cpu);
> +}
> +}
>  }
>
>  void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
> @@ -106,6 +125,7 @@ void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void 
> *data)
>  wi.func = func;
>  wi.data = data;
>  wi.free = false;
> +wi.safe = false;
>
>  queue_work_on_cpu(cpu, );
>  while (!atomic_mb_read()) {
> @@ -129,6 +149,20 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func 
> func, void *data)
>  wi->func = func;
>  wi->data = data;
>  wi-

Re: [Qemu-devel] Any topics for today's MTTCG sync-up call?

2016-06-20 Thread alvise rigo

On Mon, Jun 20, 2016 at 4:12 PM, Alex Bennée <alex.ben...@linaro.org> wrote:

>
> alvise rigo <a.r...@virtualopensystems.com> writes:
>
> > Hi Alex,
> >
> > I'm looking into the worries that Sergey issued in his review of the
> > last LL/SC series. The target is to reduce the TLB flushes by using an
> > exclusive history of dynamic length. I don't have anything ready yet
> > though.
>
> Are you also tackling the race condition and ensuring all flushes are
> done before the critical work?
>

Yes, I'm am.


>
> Sergey has posted an RFC for his "quiescent work" solution which is
> worth looking at.
>

I will see if it fits well with my use case and let him know.

Regards,
alvise


>
> >
> > Best regards,
> > alvise
> >
> > On Mon, Jun 20, 2016 at 1:57 PM, Alex Bennée <alex.ben...@linaro.org>
> wrote:
> >>
> >> Hi,
> >>
> >> We missed the last call (sorry I was travelling). Have we any topics we
> >> would like to cover this week?
> >>
> >> --
> >> Alex Bennée
>
>
> --
> Alex Bennée
>

Re: [Qemu-devel] Any topics for today's MTTCG sync-up call?

2016-06-20 Thread alvise rigo

Hi Alex,

I'm looking into the worries that Sergey issued in his review of the
last LL/SC series. The target is to reduce the TLB flushes by using an
exclusive history of dynamic length. I don't have anything ready yet
though.

Best regards,
alvise

On Mon, Jun 20, 2016 at 1:57 PM, Alex Bennée  wrote:
>
> Hi,
>
> We missed the last call (sorry I was travelling). Have we any topics we
> would like to cover this week?
>
> --
> Alex Bennée

Re: [Qemu-devel] exec: Safe work in quiescent state

2016-06-15 Thread alvise rigo

On Wed, Jun 15, 2016 at 4:51 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> alvise rigo <a.r...@virtualopensystems.com> writes:
>
>> Hi Sergey,
>>
>> Nice review of the implementations we have so far.
>> Just few comments below.
>>
>> On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov <serge.f...@gmail.com> wrote:
>>> On 10/06/16 00:51, Sergey Fedorov wrote:
>>>> For certain kinds of tasks we might need a quiescent state to perform an
>>>> operation safely. Quiescent state means no CPU thread executing, and
>>>> probably BQL held as well. The tasks could include:
> 
>>>
>>> Alvise's async_wait_run_on_cpu() [3]:
>>> - uses the same queue as async_run_on_cpu();
>>> - the CPU that requested the job is recorded in qemu_work_item;
>>> - each CPU has a counter of such jobs it has requested;
>>> - the counter is decremented upon job completion;
>>> - only the target CPU is forced to exit the execution loop, i.e. the job
>>> is not run in quiescent state;
>>
>> async_wait_run_on_cpu() kicks the target VCPU before calling
>> cpu_exit() on the current VCPU, so all the VCPUs are forced to exit.
>> Moreover, the current VCPU waits for all the tasks to be completed.
>
> The effect of qemu_cpu_kick() for TCG is effectively just doing a
> cpu_exit() anyway. Once done any TCG code will exit on it's next
> intra-block transition.
>
>>
> 
>>> Distilling the requirements, safe work mechanism should:
>>> - support both system and user-mode emulation;
>>> - allow to schedule an asynchronous operation to be performed out of CPU
>>> execution loop;
>>> - guarantee that all CPUs are out of execution loop before the operation
>>> can begin;
>>
>> This requirement is probably not necessary if we need to query TLB
>> flushes to other VCPUs, since every VCPU will flush its own TLB.
>> For this reason we probably need to mechanisms:
>> - The first allows a VCPU to query a job to all the others and wait
>> for all of them to be done (like for global TLB flush)
>
> Do we need to wait?

Yes, otherwise the instruction (like MCR which allows to do TLB
invalidation) is not completely emulated before executing the
following one.
During the LL emulation is also required since it avoids possible race
conditions.

>
>> - The second allows a VCPU to perform a task in quiescent state i.e.
>> the task starts and finishes when all VCPUs are out of the execution
>> loop (translation buffer flush)
>
> If you really want to ensure everything is done then you can exit the
> block early. To get the sort of dsb() flush semantics mentioned you
> simply:
>
>   - queue your async safe work
>   - exit block on dsb()
>
>   This ensures by the time the TCG thread restarts for the next
>   instruction all pending work has been flushed.
>
>> Does this make sense?
>
> I think we want one way of doing things for anything that is Cross CPU
> and requires a degree of synchronisation. If it ends up being too
> expensive then we can look at more efficient special casing solutions.

OK, I agree that we should start with an approach that fits the two use cases.

Thank you,
alvise

>
>>
>>> - guarantee that no CPU enters execution loop before all the scheduled
>>> operations are complete.
>>
>> This is probably too much in some cases for the reasons of before.
>>
>> Best regards,
>> alvise
>>
>>>
>>> If that sounds like a sane approach, I'll come up with a more specific
>>> solution to discuss. The solution could be merged into v2.7 along with
>>> safe translation buffer flush in user-mode as an actual use case. Safe
>>> cross-CPU TLB flush would become a part of MTTCG work. Comments,
>>> suggestions, arguments etc. are welcome!
>>>
>>> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/355323/focus=355632
>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/407030/focus=407039
>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=413982
>>> [4] http://thread.gmane.org/gmane.comp.emulators.qemu/356765/focus=356789
>>> [5] http://thread.gmane.org/gmane.comp.emulators.qemu/397295/focus=397301
>>> [6] http://thread.gmane.org/gmane.comp.emulators.qemu/413978/focus=417231
>>>
>>> Kind regards,
>>> Sergey
>
>
> --
> Alex Bennée

Re: [Qemu-devel] exec: Safe work in quiescent state

2016-06-15 Thread alvise rigo

Hi Sergey,

Nice review of the implementations we have so far.
Just few comments below.

On Wed, Jun 15, 2016 at 2:59 PM, Sergey Fedorov  wrote:
> On 10/06/16 00:51, Sergey Fedorov wrote:
>> For certain kinds of tasks we might need a quiescent state to perform an
>> operation safely. Quiescent state means no CPU thread executing, and
>> probably BQL held as well. The tasks could include:
>> - Translation buffer flush (user and system-mode)
>> - Cross-CPU TLB flush (system-mode)
>> - Exclusive operation emulation (user-mode)
>>
>> If we use a single shared translation buffer which is not managed by RCU
>> and simply flushed when full, we'll need a quiescent state to flush it
>> safely.
>>
>> In multi-threaded TCG, cross-CPU TLB flush from TCG helpers could
>> probably be made with async_run_on_cpu(). I suppose it is always the
>> guest system that needs to synchronise this operation properly. And as
>> soon as we request the target CPU to exit its execution loop for serving
>> the asynchronous work, we should probably be okay to continue execution
>> on the CPU requested the operation while the target CPU executing till
>> the end of its current TB before it actually flushed its TLB.
>>
>> As of slow-path LL/SC emulation in multi-threaded TCG, cross-CPU TLB
>> flushes (actually TLB flushes on all CPUs) must me done synchronously
>> and thus might require quiescent state.
>>
>> Exclusive operation emulation in user-mode is currently implemented in
>> this manner, see for start_exclusive(). It might change to some generic
>> mechanism of atomic/exclusive instruction emulation for system and
>> user-mode.
>>
>> It looks like we need to implement a common mechanism to perform safe
>> work in a quiescent state which could work in both system and user-mode,
>> at least for safe translation bufferflush in user-mode and MTTCG. I'm
>> going to implement such a mechanism. I would appreciate any suggestions,
>> comments and remarks.
>
> Considering different attempts to implement similar functionality, I've
> got the following summary.
>
> Fred's original async_run_safe_work_on_cpu() [1]:
> - resembles async_run_on_cpu();
> - introduces a per-CPU safe work queue, a per-CPU flag to prevent the
> CPU from executing code, and a global counter of pending jobs;
> - implements rather complicated scheduling of jobs relying on both the
> per-CPU flag and the global counter;
> - may be not entirely safe when draining work queues if multiple CPUs
> have scheduled safe work;
> - does not support user-mode emulation.
>
> Alex's reiteration of Fred's approach [2]:
> - maintains a single global safe work queue;
> - uses GArray rather than linked list to implement the work queue;
> - introduces a global counter of CPUs which have entered their execution
> loop;
> - makes use of the last CPU exited its execution loop to drain the safe
> work queue;
> - still does not support user-mode emulation.
>
> Alvise's async_wait_run_on_cpu() [3]:
> - uses the same queue as async_run_on_cpu();
> - the CPU that requested the job is recorded in qemu_work_item;
> - each CPU has a counter of such jobs it has requested;
> - the counter is decremented upon job completion;
> - only the target CPU is forced to exit the execution loop, i.e. the job
> is not run in quiescent state;

async_wait_run_on_cpu() kicks the target VCPU before calling
cpu_exit() on the current VCPU, so all the VCPUs are forced to exit.
Moreover, the current VCPU waits for all the tasks to be completed.

> - does not support user-mode emulation.
>
> Emilio's cpu_tcg_sched_work() [4]:
> - exploits tb_lock() to force CPUs exit their execution loop;
> - requires 'tb_lock' to be held when scheduling a job;
> - allows each CPU to schedule only a single job;
> - handles scheduled work right in cpu_exec();
> - exploits synchronize_rcu() to wait for other CPUs to exit their
> execution loop;
> - implements a complicated synchronization scheme;
> - should support both system and user-mode emulation.
>
>
> As of requirements for common safe work mechanism, each use case has its
> own considerations.
>
> Translation buffer flush just requires that no CPU is executing
> generated code during the operation.
>
> Cross-CPU TLB flush basically requires no CPU is performing TLB
> lookup/modification. Some architectures might require TLB flush be
> complete before the requesting CPU can continue execution; other might
> allow to delay it until some "synchronization point". In case of ARM,
> one of such synchronization points is DMB instruction. We might allow
> the operation to be performed asynchronously and continue execution, but
> we'd need to end TB and synchronize on each DMB instruction. That
> doesn't seem very efficient. So a simple approach to force the operation
> to complete before executing anything else would probably make sense in
> both cases. Slow-path LL/SC emulation also requires cross-CPU TLB flush
> to be complete before it can finish emulation of a LL

Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading

2016-06-14 Thread alvise rigo

1. LL(x)   // x requires a flush
2. query flush to all the n VCPUs
3. exit from the CPU loop and wait until all the flushes are done
4. enter the loop to re-execute LL(x). This time no flush is required

Now, points 2. and 3. can be done either with n calls of
async_safe_run_on_cpu() or with n calls of async_wait_run_on_cpu(). In my
opinion the former is not really done for this use case since it would call
n^2 times cpu_exit() and it would not really ensure that the VCPU has
exited from the guest code to make an iteration of the CPU loop.

Regards,
alvise

On Tue, Jun 14, 2016 at 2:00 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> alvise rigo <a.r...@virtualopensystems.com> writes:
>
>> On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.f...@gmail.com> wrote:
>>> On 26/05/16 19:35, Alvise Rigo wrote:
>>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>>> LoadLink/StoreConditional thread safe.
>>>>
>>>> During an LL access, this lock protects the load access itself, the
>>>> update of the exclusive history and the update of the VCPU's protected
>>>> range.  In a SC access, the lock protects the store access itself, the
>>>> possible reset of other VCPUs' protected range and the reset of the
>>>> exclusive context of calling VCPU.
>>>>
>>>> The lock is also taken when a normal store happens to access an
>>>> exclusive page to reset other VCPUs' protected range in case of
>>>> collision.
>>>
>>> I think the key problem here is that the load in LL helper can race with
>>> a concurrent regular fast-path store. It's probably easier to annotate
>>> the source here:
>>>
>>>  1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>  2  TCGMemOpIdx oi, uintptr_t retaddr)
>>>  3  {
>>>  4  WORD_TYPE ret;
>>>  5  int index;
>>>  6  CPUState *this_cpu = ENV_GET_CPU(env);
>>>  7  CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>>  8  hwaddr hw_addr;
>>>  9  unsigned mmu_idx = get_mmuidx(oi);
>>>
>>> 10  index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>
>>> 11  tcg_exclusive_lock();
>>>
>>> 12  /* Use the proper load helper from cpu_ldst.h */
>>> 13  ret = helper_ld(env, addr, oi, retaddr);
>>>
>>> 14  /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>>> + xlat)
>>> 15   * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>> 16  hw_addr = (env->iotlb[mmu_idx][index].addr &
>>> TARGET_PAGE_MASK) + addr;
>>> 17  if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>>> TLB_MMIO))) {
>>> 18  /* If all the vCPUs have the EXCL bit set for this page
>>> there is no need
>>> 19   * to request any flush. */
>>> 20  if (cpu_physical_memory_set_excl(hw_addr)) {
>>> 21  CPUState *cpu;
>>>
>>> 22  excl_history_put_addr(hw_addr);
>>> 23  CPU_FOREACH(cpu) {
>>> 24  if (this_cpu != cpu) {
>>> 25  tlb_flush_other(this_cpu, cpu, 1);
>>> 26  }
>>> 27  }
>>> 28  }
>>> 29  /* For this vCPU, just update the TLB entry, no need to
>>> flush. */
>>> 30  env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>> 31  } else {
>>> 32  /* Set a pending exclusive access in the MemoryRegion */
>>> 33  MemoryRegion *mr = iotlb_to_region(this_cpu,
>>> 34
>>> env->iotlb[mmu_idx][index].addr,
>>> 35
>>> env->iotlb[mmu_idx][index].attrs);
>>> 36  mr->pending_excl_access = true;
>>> 37  }
>>>
>>> 38  cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>
>>> 39  tcg_exclusive_unlock();
>>>
>>> 40  /* From now on we are in LL/SC context */
>>> 41  this_cpu->ll_sc_context = true;
>>>
>>> 42  return ret;
>>> 43  }
>>>
>>>
>>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>>> store at this address occurs after we finished load at line 13 but
>>> b

Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading

2016-06-10 Thread alvise rigo

This would require to fill again the whole history which I find very
unlikely. In any case, this has to be documented.

Thank you,
alvise

On Fri, Jun 10, 2016 at 6:00 PM, Sergey Fedorov <serge.f...@gmail.com> wrote:
> On 10/06/16 18:53, alvise rigo wrote:
>> On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.f...@gmail.com> wrote:
>>> On 26/05/16 19:35, Alvise Rigo wrote:
>>>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>>>> LoadLink/StoreConditional thread safe.
>>>>
>>>> During an LL access, this lock protects the load access itself, the
>>>> update of the exclusive history and the update of the VCPU's protected
>>>> range.  In a SC access, the lock protects the store access itself, the
>>>> possible reset of other VCPUs' protected range and the reset of the
>>>> exclusive context of calling VCPU.
>>>>
>>>> The lock is also taken when a normal store happens to access an
>>>> exclusive page to reset other VCPUs' protected range in case of
>>>> collision.
>>> I think the key problem here is that the load in LL helper can race with
>>> a concurrent regular fast-path store. It's probably easier to annotate
>>> the source here:
>>>
>>>  1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>>>  2  TCGMemOpIdx oi, uintptr_t retaddr)
>>>  3  {
>>>  4  WORD_TYPE ret;
>>>  5  int index;
>>>  6  CPUState *this_cpu = ENV_GET_CPU(env);
>>>  7  CPUClass *cc = CPU_GET_CLASS(this_cpu);
>>>  8  hwaddr hw_addr;
>>>  9  unsigned mmu_idx = get_mmuidx(oi);
>>>
>>> 10  index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>
>>> 11  tcg_exclusive_lock();
>>>
>>> 12  /* Use the proper load helper from cpu_ldst.h */
>>> 13  ret = helper_ld(env, addr, oi, retaddr);
>>>
>>> 14  /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
>>> + xlat)
>>> 15   * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
>>> 16  hw_addr = (env->iotlb[mmu_idx][index].addr &
>>> TARGET_PAGE_MASK) + addr;
>>> 17  if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
>>> TLB_MMIO))) {
>>> 18  /* If all the vCPUs have the EXCL bit set for this page
>>> there is no need
>>> 19   * to request any flush. */
>>> 20  if (cpu_physical_memory_set_excl(hw_addr)) {
>>> 21  CPUState *cpu;
>>>
>>> 22  excl_history_put_addr(hw_addr);
>>> 23  CPU_FOREACH(cpu) {
>>> 24  if (this_cpu != cpu) {
>>> 25  tlb_flush_other(this_cpu, cpu, 1);
>>> 26  }
>>> 27  }
>>> 28  }
>>> 29  /* For this vCPU, just update the TLB entry, no need to
>>> flush. */
>>> 30  env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>> 31  } else {
>>> 32  /* Set a pending exclusive access in the MemoryRegion */
>>> 33  MemoryRegion *mr = iotlb_to_region(this_cpu,
>>> 34
>>> env->iotlb[mmu_idx][index].addr,
>>> 35
>>> env->iotlb[mmu_idx][index].attrs);
>>> 36  mr->pending_excl_access = true;
>>> 37  }
>>>
>>> 38  cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>>>
>>> 39  tcg_exclusive_unlock();
>>>
>>> 40  /* From now on we are in LL/SC context */
>>> 41  this_cpu->ll_sc_context = true;
>>>
>>> 42  return ret;
>>> 43  }
>>>
>>>
>>> The exclusive lock at line 11 doesn't help if concurrent fast-patch
>>> store at this address occurs after we finished load at line 13 but
>>> before TLB is flushed as a result of line 25. If we reorder the load to
>>> happen after the TLB flush request we still must be sure that the flush
>>> is complete before we can do the load safely.
>> You are right, the risk actually exists. One solution to the problem
>> could be to ignore the data acquired by the load and redo the LL after
>> the flushes have been completed (basically the disas_ctx->pc points to
>> the LL instruction). This time the LL will happen without flush
>> requests and the access will be actually protected by the lock.
>
> Yes, if some other CPU wouldn't evict an entry with the same address
> from the exclusive history...
>
> Kind regards,
> Sergey

Re: [Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading

2016-06-10 Thread alvise rigo

On Fri, Jun 10, 2016 at 5:21 PM, Sergey Fedorov <serge.f...@gmail.com> wrote:
> On 26/05/16 19:35, Alvise Rigo wrote:
>> Using tcg_exclusive_{lock,unlock}(), make the emulation of
>> LoadLink/StoreConditional thread safe.
>>
>> During an LL access, this lock protects the load access itself, the
>> update of the exclusive history and the update of the VCPU's protected
>> range.  In a SC access, the lock protects the store access itself, the
>> possible reset of other VCPUs' protected range and the reset of the
>> exclusive context of calling VCPU.
>>
>> The lock is also taken when a normal store happens to access an
>> exclusive page to reset other VCPUs' protected range in case of
>> collision.
>
> I think the key problem here is that the load in LL helper can race with
> a concurrent regular fast-path store. It's probably easier to annotate
> the source here:
>
>  1  WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
>  2  TCGMemOpIdx oi, uintptr_t retaddr)
>  3  {
>  4  WORD_TYPE ret;
>  5  int index;
>  6  CPUState *this_cpu = ENV_GET_CPU(env);
>  7  CPUClass *cc = CPU_GET_CLASS(this_cpu);
>  8  hwaddr hw_addr;
>  9  unsigned mmu_idx = get_mmuidx(oi);
>
> 10  index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>
> 11  tcg_exclusive_lock();
>
> 12  /* Use the proper load helper from cpu_ldst.h */
> 13  ret = helper_ld(env, addr, oi, retaddr);
>
> 14  /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr
> + xlat)
> 15   * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
> 16  hw_addr = (env->iotlb[mmu_idx][index].addr &
> TARGET_PAGE_MASK) + addr;
> 17  if (likely(!(env->tlb_table[mmu_idx][index].addr_read &
> TLB_MMIO))) {
> 18  /* If all the vCPUs have the EXCL bit set for this page
> there is no need
> 19   * to request any flush. */
> 20  if (cpu_physical_memory_set_excl(hw_addr)) {
> 21  CPUState *cpu;
>
> 22  excl_history_put_addr(hw_addr);
> 23  CPU_FOREACH(cpu) {
> 24  if (this_cpu != cpu) {
> 25  tlb_flush_other(this_cpu, cpu, 1);
> 26  }
> 27  }
> 28  }
> 29  /* For this vCPU, just update the TLB entry, no need to
> flush. */
> 30  env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
> 31  } else {
> 32  /* Set a pending exclusive access in the MemoryRegion */
> 33  MemoryRegion *mr = iotlb_to_region(this_cpu,
> 34
> env->iotlb[mmu_idx][index].addr,
> 35
> env->iotlb[mmu_idx][index].attrs);
> 36  mr->pending_excl_access = true;
> 37  }
>
> 38  cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
>
> 39  tcg_exclusive_unlock();
>
> 40  /* From now on we are in LL/SC context */
> 41  this_cpu->ll_sc_context = true;
>
> 42  return ret;
> 43  }
>
>
> The exclusive lock at line 11 doesn't help if concurrent fast-patch
> store at this address occurs after we finished load at line 13 but
> before TLB is flushed as a result of line 25. If we reorder the load to
> happen after the TLB flush request we still must be sure that the flush
> is complete before we can do the load safely.

You are right, the risk actually exists. One solution to the problem
could be to ignore the data acquired by the load and redo the LL after
the flushes have been completed (basically the disas_ctx->pc points to
the LL instruction). This time the LL will happen without flush
requests and the access will be actually protected by the lock.

Regards,
alvise

>
>>
>> Moreover, adapt target-arm to also cope with the new multi-threaded
>> execution.
>>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  softmmu_llsc_template.h | 11 +--
>>  softmmu_template.h  |  6 ++
>>  target-arm/op_helper.c  |  6 ++
>>  3 files changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>> index 2c4a494..d3810c0 100644
>> --- a/softmmu_llsc_template.h
>> +++ b/softmmu_llsc_template.h
>> @@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, 
>> target_ulong addr,
>>  hwaddr hw_addr;
>>  unsigned mmu_idx = get_mmuidx(oi);
>>
>> +index = (addr >> TAR

Re: [Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns

2016-06-10 Thread alvise rigo

I might have broken something while rebasing on top of
enable-mttcg-for-armv7-v1.
I will sort this problem out.

Thank you,
alvise

On Fri, Jun 10, 2016 at 5:21 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Hi,
>>
>> This series ports the latest iteration of the LL/SC work on top of the
>> latest MTTCG reference branch posted recently by Alex.
>>
>> These patches apply on top of the following series:
>>
>> - [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86
>>   https://github.com/stsquad/qemu/tree/mttcg/enable-mttcg-for-armv7-v1
>> - [RFC v8 00/14] Slow-path for atomic instruction translation
>>   https://git.virtualopensystems.com/dev/qemu-mt/tree/\
>>   slowpath-for-atomic-v8-no-mttcg - only minor changes have been necessary
>> - Few recent patches from Emilio regarding the spinlock implementation
>>
>> Overall, these patches allow the LL/SC infrastructure to work in 
>> multi-threaded
>> mode (patches 01-02-04) and make TLB flushes to other VCPUs safe.
>>
>> Patch 03 introduces a new API to submit a work item to a VCPU and wait for 
>> its
>> completion. This API is used to query TLB flushes that result from the
>> emulation of some ARM instructions. Patches 07, 08 and 09 modify the current
>> tlb_flush_* functions to use the new API.  Patch 10 fixes a rare hang that I
>> was experiencing with this branch.
>>
>> The whole work can be fetched from the following repository:
>> g...@git.virtualopensystems.com:dev/qemu-mt.git
>> at the branch "slowpath-for-atomic-v8-mttcg".
>
> Hmm this branch has build failures for linux-user and other
> architectures. Is this the latest one?
>
>>
>> Alvise Rigo (10):
>>   exec: Introduce tcg_exclusive_{lock,unlock}()
>>   softmmu_llsc_template.h: Move to multi-threading
>>   cpus: Introduce async_wait_run_on_cpu()
>>   cputlb: Introduce tlb_flush_other()
>>   target-arm: End TB after ldrex instruction
>>   cputlb: Add tlb_tables_flush_bitmap()
>>   cputlb: Query tlb_flush_by_mmuidx
>>   cputlb: Query tlb_flush_page_by_mmuidx
>>   cputlb: Query tlb_flush_page_all
>>   cpus: Do not sleep if some work item is pending
>>
>>  cpus.c |  48 ++-
>>  cputlb.c   | 202 
>> ++---
>>  exec.c |  18 
>>  include/exec/exec-all.h|  13 +--
>>  include/qom/cpu.h  |  36 
>>  softmmu_llsc_template.h|  13 ++-
>>  softmmu_template.h |   6 ++
>>  target-arm/helper.c|  79 +-
>>  target-arm/op_helper.c |   6 ++
>>  target-arm/translate-a64.c |   2 +
>>  target-arm/translate.c |   2 +
>>  11 files changed, 327 insertions(+), 98 deletions(-)
>
>
> --
> Alex Bennée

Re: [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation

2016-06-09 Thread alvise rigo

Hi Sergey,

Thank you for this precise summary.

On Thu, Jun 9, 2016 at 1:42 PM, Sergey Fedorov <serge.f...@gmail.com> wrote:
> Hi,
>
> On 19/04/16 16:39, Alvise Rigo wrote:
>> This patch series provides an infrastructure for atomic instruction
>> implementation in QEMU, thus offering a 'legacy' solution for
>> translating guest atomic instructions. Moreover, it can be considered as
>> a first step toward a multi-thread TCG.
>>
>> The underlying idea is to provide new TCG helpers (sort of softmmu
>> helpers) that guarantee atomicity to some memory accesses or in general
>> a way to define memory transactions.
>>
>> More specifically, the new softmmu helpers behave as LoadLink and
>> StoreConditional instructions, and are called from TCG code by means of
>> target specific helpers. This work includes the implementation for all
>> the ARM atomic instructions, see target-arm/op_helper.c.
>
> I think that is a generally good idea to provide LL/SC TCG operations
> for emulating guest atomic instruction behaviour as those operations
> allow to implement other atomic primitives such as copmare-and-swap and
> atomic arithmetic easily. Another advantage of these operations is that
> they are free from ABA problem.
>
>> The implementation heavily uses the software TLB together with a new
>> bitmap that has been added to the ram_list structure which flags, on a
>> per-CPU basis, all the memory pages that are in the middle of a LoadLink
>> (LL), StoreConditional (SC) operation.  Since all these pages can be
>> accessed directly through the fast-path and alter a vCPU's linked value,
>> the new bitmap has been coupled with a new TLB flag for the TLB virtual
>> address which forces the slow-path execution for all the accesses to a
>> page containing a linked address.
>
> But I'm afraid we've got a scalability problem using software TLB engine
> heavily. This approach relies on TLB flush of all CPUs which is not very
> cheap operation. That is going to be even more expansive in case of
> MTTCG as you need to exit the CPU execution loop in order to avoid
> deadlocks.
>
> I see you try mitigate this issue by introducing a history of N last
> pages touched by an exclusive access. That would work fine avoiding
> excessive TLB flushes as long as the current working set of exclusively
> accessed pages does not go beyond N. Once we exceed this limit we'll get
> a global TLB flush on most LL operations. I'm afraid we can get dramatic

Indeed, if the guest does a loop of N+1 atomic operations, at each
iteration we will have N flushes.

> performance decrease as guest code implements finer-grained locking
> scheme. I would like to emphasise that performance can degrade sharply
> and dramatically as soon as the limit gets exceeded. How could we tackle
> this problem?

In my opinion, the length of the history should not be fixed to avoid
the drawback of above. We can make the history's length dynamic (until
a given threshold is reached) according to the pressure of atomic
instructions. What should remain constant is the time elapsed to make
a cycle of the history's array. We can for instance store in the lower
bits of the addresses in the history a sort of timestamp used to
calculate the period and adjust accordingly the length of the history.
What do you think?

I will try to explore other ways to tackle the problem.

Best regards,
alvise

>
> Kind regards,
> Sergey

Re: [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()

2016-06-08 Thread alvise rigo

I think that async_safe_run_on_cpu() does a different thing: it
queries a job to the target vCPU and wants all the other to "observe"
the submitted task. However, we will have the certainty that only the
target vCPU observed the task, the other might still be running in the
guest code.

alvise

On Wed, Jun 8, 2016 at 5:20 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Sergey Fedorov <serge.f...@gmail.com> writes:
>
>> On 08/06/16 17:10, alvise rigo wrote:
>>> Using run_on_cpu() we might deadlock QEMU if other vCPUs are waiting
>>> for the current vCPU. We need to exit from the vCPU loop in order to
>>> avoid this.
>>
>> I see, we could deadlock indeed. Alternatively, we may want fix
>> run_on_cpu() to avoid waiting for completion by itself when called from
>> CPU loop.
>
> async_safe_run_on_cpu can't deadlock as all vCPUs are suspended (or
> waiting) for the work to complete. The tasks are run in strict order so
> if you queued async tasks for other vCPUs first you could ensure
> everything is in the state you want it when you finally service the
> calling vCPU.
>
>>
>> Kind regards,
>> Sergey
>
>
> --
> Alex Bennée

Re: [Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()

2016-06-08 Thread alvise rigo

Using run_on_cpu() we might deadlock QEMU if other vCPUs are waiting
for the current vCPU. We need to exit from the vCPU loop in order to
avoid this.

Regards,
alvise

On Wed, Jun 8, 2016 at 3:54 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Introduce a new function that allows the calling VCPU to add a work item
>> to another VCPU (aka target VCPU). This new function differs from
>> async_run_on_cpu() since it makes the calling VCPU waiting for the target
>> VCPU to finish the work item. The mechanism makes use of the halt_cond
>> to wait and in case process pending work items.
>
> Isn't this exactly what would happen if you use run_on_cpu(). That will
> stall the current vCPU and busy wait until the work item is completed.
>
>>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  cpus.c| 44 ++--
>>  include/qom/cpu.h | 31 +++
>>  2 files changed, 73 insertions(+), 2 deletions(-)
>>
>> diff --git a/cpus.c b/cpus.c
>> index b9ec903..7bc96e2 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -89,7 +89,7 @@ static bool cpu_thread_is_idle(CPUState *cpu)
>>  if (cpu->stop || cpu->queued_work_first) {
>>  return false;
>>  }
>> -if (cpu_is_stopped(cpu)) {
>> +if (cpu_is_stopped(cpu) || async_waiting_for_work(cpu)) {
>>  return true;
>>  }
>>  if (!cpu->halted || cpu_has_work(cpu) ||
>> @@ -1012,6 +1012,7 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func 
>> func, void *data)
>>  wi->func = func;
>>  wi->data = data;
>>  wi->free = true;
>> +wi->wcpu = NULL;
>>
>>  qemu_mutex_lock(>work_mutex);
>>  if (cpu->queued_work_first == NULL) {
>> @@ -1027,6 +1028,40 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func 
>> func, void *data)
>>  qemu_cpu_kick(cpu);
>>  }
>>
>> +void async_wait_run_on_cpu(CPUState *cpu, CPUState *wcpu, run_on_cpu_func 
>> func,
>> +void 
>> *data)
>> +{
>> +struct qemu_work_item *wwi;
>> +
>> +assert(wcpu != cpu);
>> +
>> +wwi = g_malloc0(sizeof(struct qemu_work_item));
>> +wwi->func = func;
>> +wwi->data = data;
>> +wwi->free = true;
>> +wwi->wcpu = wcpu;
>> +
>> +/* Increase the number of pending work items */
>> +atomic_inc(>pending_work_items);
>> +
>> +qemu_mutex_lock(>work_mutex);
>> +/* Add the waiting work items at the beginning to free as soon as 
>> possible
>> + * the waiting CPU. */
>> +if (cpu->queued_work_first == NULL) {
>> +cpu->queued_work_last = wwi;
>> +} else {
>> +wwi->next = cpu->queued_work_first;
>> +}
>> +cpu->queued_work_first = wwi;
>> +wwi->done = false;
>> +qemu_mutex_unlock(>work_mutex);
>> +
>> +qemu_cpu_kick(cpu);
>> +
>> +/* In order to wait, @wcpu has to exit the CPU loop */
>> +cpu_exit(wcpu);
>> +}
>> +
>>  /*
>>   * Safe work interface
>>   *
>> @@ -1120,6 +1155,10 @@ static void flush_queued_work(CPUState *cpu)
>>  qemu_mutex_unlock(>work_mutex);
>>  wi->func(cpu, wi->data);
>>  qemu_mutex_lock(>work_mutex);
>> +if (wi->wcpu != NULL) {
>> +atomic_dec(>wcpu->pending_work_items);
>> +qemu_cond_broadcast(wi->wcpu->halt_cond);
>> +}
>>  if (wi->free) {
>>  g_free(wi);
>>  } else {
>> @@ -1406,7 +1445,8 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>>  while (1) {
>>  bool sleep = false;
>>
>> -if (cpu_can_run(cpu) && !async_safe_work_pending()) {
>> +if (cpu_can_run(cpu) && !async_safe_work_pending()
>> + && !async_waiting_for_work(cpu)) {
>>  int r;
>>
>>  atomic_inc(_scheduled_cpus);
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 019f06d..7be82ed 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -259,6 +259,8 @@ struct qemu_work_item {
>>  void *data;
>>  int done;
>>  bool free;
>> +/* CPU waiting for this work item to finish. If NULL, no

Re: [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()

2016-06-08 Thread alvise rigo

As far as I understand, linux-user uses a mutex to make the atomic
accesses exclusive with respect to other CPU's atomic accesses. So
basically in the LDREX case it implements:
lock() -> access() -> unlock()
This patch series makes the atomic accesses exclusive with respect to
every memory access, this is allowed by the softmmu. The access is now
something like:
lock() -> softmmu_access() -> unlock()
where "softmmu_access()" is not just a memory access, but includes a
manipulation of the EXCL bitmap and possible queries of TLB flushes.
So there are similarities, but are pretty much confined to the
locking/unlocking of a spinlock/mutex.

This made me think, how does linux-user can properly work with
upstream TCG, for instance, in an absurd configuration like target-arm
on ARM host?

alvise

On Wed, Jun 8, 2016 at 11:21 AM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Add tcg_exclusive_{lock,unlock}() functions that will be used for making
>> the emulation of LL and SC instructions thread safe.
>
> I wonder how much similarity there is to the mechanism linus-user ends
> up using for it's exclusive-start/end?
>
>>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  cpus.c|  2 ++
>>  exec.c| 18 ++
>>  include/qom/cpu.h |  5 +
>>  3 files changed, 25 insertions(+)
>>
>> diff --git a/cpus.c b/cpus.c
>> index 860e7ba..b9ec903 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -961,6 +961,8 @@ void qemu_init_cpu_loop(void)
>>  qemu_cond_init(_work_cond);
>>  qemu_mutex_init(_global_mutex);
>>
>> +qemu_spin_init(_exclusive_lock);
>> +
>>  qemu_thread_get_self(_thread);
>>
>>  safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 
>> 128);
>> diff --git a/exec.c b/exec.c
>> index a24b31c..1c72113 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -197,6 +197,24 @@ void cpu_exclusive_history_free(void)
>>  g_free(excl_history.c_array);
>>  }
>>  }
>> +
>> +__thread bool cpu_have_exclusive_lock;
>> +QemuSpin cpu_exclusive_lock;
>> +inline void tcg_exclusive_lock(void)
>> +{
>> +if (!cpu_have_exclusive_lock) {
>> +qemu_spin_lock(_exclusive_lock);
>> +cpu_have_exclusive_lock = true;
>> +}
>> +}
>> +
>> +inline void tcg_exclusive_unlock(void)
>> +{
>> +if (cpu_have_exclusive_lock) {
>> +cpu_have_exclusive_lock = false;
>> +qemu_spin_unlock(_exclusive_lock);
>> +}
>> +}
>>  #endif
>>
>>  #if !defined(CONFIG_USER_ONLY)
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 0f51870..019f06d 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -201,6 +201,11 @@ typedef struct CPUClass {
>>  void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
>>  } CPUClass;
>>
>> +/* Protect cpu_exclusive_* variable .*/
>> +void tcg_exclusive_lock(void);
>> +void tcg_exclusive_unlock(void);
>> +extern QemuSpin cpu_exclusive_lock;
>> +
>>  #ifdef HOST_WORDS_BIGENDIAN
>>  typedef struct icount_decr_u16 {
>>  uint16_t high;
>
>
> --
> Alex Bennée

Re: [Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()

2016-06-02 Thread alvise rigo

Hi Pranith,

Thank you for the hint, I will keep this in mind for the next version.

Regards,
alvise

On Tue, May 31, 2016 at 5:03 PM, Pranith Kumar <bobby.pr...@gmail.com> wrote:
> Hi Alvise,
>
> On Thu, May 26, 2016 at 12:35 PM, Alvise Rigo
> <a.r...@virtualopensystems.com> wrote:
>> Add tcg_exclusive_{lock,unlock}() functions that will be used for making
>> the emulation of LL and SC instructions thread safe.
>>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>
> 
>
>> +__thread bool cpu_have_exclusive_lock;
>> +QemuSpin cpu_exclusive_lock;
>> +inline void tcg_exclusive_lock(void)
>> +{
>> +if (!cpu_have_exclusive_lock) {
>> +qemu_spin_lock(_exclusive_lock);
>> +cpu_have_exclusive_lock = true;
>> +}
>> +}
>> +
>> +inline void tcg_exclusive_unlock(void)
>> +{
>> +if (cpu_have_exclusive_lock) {
>> +cpu_have_exclusive_lock = false;
>> +qemu_spin_unlock(_exclusive_lock);
>> +}
>> +}
>
> I think the unlock() here should have an assert if
> cpu_have_exclusive_lock is false. From what I can see, a thread should
> either take the exclusive lock or wait spinning for it in lock(). So
> unlock() should always see cpu_have_exclusive_lock as true. It is a
> good place to find locking bugs.
>
> --
> Pranith

[Qemu-devel] [RFC 06/10] cputlb: Add tlb_tables_flush_bitmap()

2016-05-26 Thread Alvise Rigo

Add a simple helper function to flush the TLB at the indexes specified
by a bitmap. The function will be more useful in the following patches,
when it will be possible to query tlb_flush_by_mmuidx() to VCPUs.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 55f7447..5bbbf1b 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -129,15 +129,34 @@ void tlb_flush(CPUState *cpu, int flush_global)
 }
 }
 
-static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+/* Flush tlb_table[] and tlb_v_table[] of @cpu at MMU indexes given by @bitmap.
+ * Flush also tb_jmp_cache. */
+static inline void tlb_tables_flush_bitmap(CPUState *cpu, unsigned long 
*bitmap)
 {
-CPUArchState *env = cpu->env_ptr;
+int mmu_idx;
 
 tlb_debug("start\n");
 /* must reset current TB so that interrupts cannot modify the
links while we are modifying them */
 cpu->current_tb = NULL;
 
+for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+if (test_bit(mmu_idx, bitmap)) {
+CPUArchState *env = cpu->env_ptr;
+
+tlb_debug("%d\n", mmu_idx);
+
+memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
+memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+}
+}
+memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+}
+
+static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+{
+DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 };
+
 for (;;) {
 int mmu_idx = va_arg(argp, int);
 
@@ -145,13 +164,10 @@ static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, 
va_list argp)
 break;
 }
 
-tlb_debug("%d\n", mmu_idx);
-
-memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
-memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+set_bit(mmu_idx, idxmap);
 }
 
-memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+tlb_tables_flush_bitmap(cpu, idxmap);
 }
 
 void tlb_flush_by_mmuidx(CPUState *cpu, ...)
-- 
2.8.3

[Qemu-devel] [RFC 10/10] cpus: Do not sleep if some work item is pending

2016-05-26 Thread Alvise Rigo

If a VCPU returns EXCP_HALTED from the guest code execution and in the
mean time receives a work item, it will go to sleep without processing
the job.

Before sleeping, check if any work has been added.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cpus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cpus.c b/cpus.c
index 7bc96e2..3d19a2e 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1477,7 +1477,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 
 handle_icount_deadline();
 
-if (sleep) {
+if (sleep && cpu->queued_work_first == NULL) {
 qemu_cond_wait(cpu->halt_cond, _global_mutex);
 }
 
-- 
2.8.3

[Qemu-devel] [RFC 09/10] cputlb: Query tlb_flush_page_all

2016-05-26 Thread Alvise Rigo

Secure tlb_flush_page_all() by waiting the queried flushes to be
actually completed using async_wait_run_on_cpu();

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c| 15 ++-
 include/exec/exec-all.h |  4 ++--
 target-arm/helper.c |  4 ++--
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 77a1997..4ed0cc8 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -346,13 +346,18 @@ static void tlb_flush_page_async_work(CPUState *cpu, void 
*opaque)
 tlb_flush_page(cpu, GPOINTER_TO_UINT(opaque));
 }
 
-void tlb_flush_page_all(target_ulong addr)
+void tlb_flush_page_all(CPUState *this_cpu, target_ulong addr)
 {
-CPUState *cpu;
+CPUState *other_cpu;
 
-CPU_FOREACH(cpu) {
-async_run_on_cpu(cpu, tlb_flush_page_async_work,
- GUINT_TO_POINTER(addr));
+CPU_FOREACH(other_cpu) {
+if (other_cpu != this_cpu) {
+async_wait_run_on_cpu(other_cpu, this_cpu,
+  tlb_flush_page_async_work,
+  GUINT_TO_POINTER(addr));
+} else {
+tlb_flush_page(current_cpu, addr);
+}
 }
 }
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index cb891d2..36f1b81 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -191,7 +191,7 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
 void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr);
 void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
  uintptr_t retaddr);
-void tlb_flush_page_all(target_ulong addr);
+void tlb_flush_page_all(CPUState *this_cpu, target_ulong addr);
 #else
 static inline void tlb_flush_page(CPUState *cpu, target_ulong addr)
 {
@@ -209,7 +209,7 @@ static inline void tlb_flush_page_by_mmuidx(CPUState *cpu, 
CPUState *target,
 static inline void tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target ...)
 {
 }
-static inline void tlb_flush_page_all(target_ulong addr)
+static inline void tlb_flush_page_all(CPUState *this_cpu, target_ulong addr)
 {
 }
 #endif
diff --git a/target-arm/helper.c b/target-arm/helper.c
index 0187c0a..8988c8b 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -554,13 +554,13 @@ static void tlbiasid_is_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 static void tlbimva_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
  uint64_t value)
 {
-tlb_flush_page_all(value & TARGET_PAGE_MASK);
+tlb_flush_page_all(ENV_GET_CPU(env), value & TARGET_PAGE_MASK);
 }
 
 static void tlbimvaa_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
  uint64_t value)
 {
-tlb_flush_page_all(value & TARGET_PAGE_MASK);
+tlb_flush_page_all(ENV_GET_CPU(env), value & TARGET_PAGE_MASK);
 }
 
 static const ARMCPRegInfo cp_reginfo[] = {
-- 
2.8.3

[Qemu-devel] [RFC 07/10] cputlb: Query tlb_flush_by_mmuidx

2016-05-26 Thread Alvise Rigo

Some architectures need to flush the TLB by MMU index. As per
tlb_flush(), also these flushes have to be properly queried to the
target VCPU. For the time being, this type of flush is used only in the
ARM/aarch64 target architecture and is the result of guest instructions
emulation. As a result, we can always get safely the CPUState of the
current VCPU without relying on current_cpu. This however complicates a
bit the function prototype by adding an argument pointing to the current
VCPU's CPUState.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c| 49 +++--
 include/exec/exec-all.h |  4 ++--
 target-arm/helper.c | 40 +---
 3 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 5bbbf1b..73624d6 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -59,6 +59,8 @@
 /* We need a solution for stuffing 64 bit pointers in 32 bit ones if
  * we care about this combination */
 QEMU_BUILD_BUG_ON(sizeof(target_ulong) > sizeof(void *));
+/* Size, in bytes, of the bitmap used by tlb_flush_by_mmuidx functions */
+#define MMUIDX_BITMAP_SIZE sizeof(unsigned long) * BITS_TO_LONGS(NB_MMU_MODES)
 
 /* statistics */
 int tlb_flush_count;
@@ -153,10 +155,41 @@ static inline void tlb_tables_flush_bitmap(CPUState *cpu, 
unsigned long *bitmap)
 memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
 }
 
-static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+struct TLBFlushByMMUIdxParams {
+DECLARE_BITMAP(idx_to_flush, NB_MMU_MODES);
+};
+
+static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, void *opaque)
+{
+struct TLBFlushByMMUIdxParams *params = opaque;
+
+tlb_tables_flush_bitmap(cpu, params->idx_to_flush);
+
+g_free(params);
+}
+
+static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target,
+ unsigned long *idxmap)
 {
+if (!qemu_cpu_is_self(target)) {
+struct TLBFlushByMMUIdxParams *params;
+
+params = g_malloc(sizeof(struct TLBFlushByMMUIdxParams));
+memcpy(params->idx_to_flush, idxmap, MMUIDX_BITMAP_SIZE);
+async_wait_run_on_cpu(target, cpu, tlb_flush_by_mmuidx_async_work,
+  params);
+} else {
+tlb_tables_flush_bitmap(cpu, idxmap);
+}
+}
+
+void tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target_cpu, ...)
+{
+va_list argp;
 DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 };
 
+va_start(argp, target_cpu);
+
 for (;;) {
 int mmu_idx = va_arg(argp, int);
 
@@ -167,15 +200,9 @@ static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, 
va_list argp)
 set_bit(mmu_idx, idxmap);
 }
 
-tlb_tables_flush_bitmap(cpu, idxmap);
-}
-
-void tlb_flush_by_mmuidx(CPUState *cpu, ...)
-{
-va_list argp;
-va_start(argp, cpu);
-v_tlb_flush_by_mmuidx(cpu, argp);
 va_end(argp);
+
+v_tlb_flush_by_mmuidx(cpu, target_cpu, idxmap);
 }
 
 static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong addr)
@@ -244,7 +271,9 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong 
addr, ...)
   TARGET_FMT_lx "/" TARGET_FMT_lx ")\n",
   env->tlb_flush_addr, env->tlb_flush_mask);
 
-v_tlb_flush_by_mmuidx(cpu, argp);
+/* Temporarily use current_cpu until tlb_flush_page_by_mmuidx
+ * is reworked */
+tlb_flush_by_mmuidx(current_cpu, cpu, argp);
 va_end(argp);
 return;
 }
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index bc97683..066870b 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -152,7 +152,7 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong 
addr, ...);
  * Flush all entries from the TLB of the specified CPU, for the specified
  * MMU indexes.
  */
-void tlb_flush_by_mmuidx(CPUState *cpu, ...);
+void tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target, ...);
 /**
  * tlb_set_page_with_attrs:
  * @cpu: CPU to add this TLB entry for
@@ -205,7 +205,7 @@ static inline void tlb_flush_page_by_mmuidx(CPUState *cpu,
 {
 }
 
-static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
+static inline void tlb_flush_by_mmuidx(CPUState *cpu, CPUState *target ...)
 {
 }
 static inline void tlb_flush_page_all(target_ulong addr)
diff --git a/target-arm/helper.c b/target-arm/helper.c
index bc9fbda..3dcd910 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -2388,7 +2388,7 @@ static void vttbr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 
 /* Accesses to VTTBR may change the VMID so we must flush the TLB.  */
 if (raw_read(env, ri) != value) {
-tlb_flush_by_mmuidx(cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0,
+tlb_flush_by_mmuidx(cs, cs, ARMMMUIdx_S12NSE1, ARMMMUIdx_S12NSE0,
 ARMMMUIdx_S2NS, -1);
 raw_write(env, ri, v

[Qemu-devel] [RFC 02/10] softmmu_llsc_template.h: Move to multi-threading

2016-05-26 Thread Alvise Rigo

Using tcg_exclusive_{lock,unlock}(), make the emulation of
LoadLink/StoreConditional thread safe.

During an LL access, this lock protects the load access itself, the
update of the exclusive history and the update of the VCPU's protected
range.  In a SC access, the lock protects the store access itself, the
possible reset of other VCPUs' protected range and the reset of the
exclusive context of calling VCPU.

The lock is also taken when a normal store happens to access an
exclusive page to reset other VCPUs' protected range in case of
collision.

Moreover, adapt target-arm to also cope with the new multi-threaded
execution.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 softmmu_llsc_template.h | 11 +--
 softmmu_template.h  |  6 ++
 target-arm/op_helper.c  |  6 ++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index 2c4a494..d3810c0 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -62,11 +62,13 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, 
target_ulong addr,
 hwaddr hw_addr;
 unsigned mmu_idx = get_mmuidx(oi);
 
+index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+
+tcg_exclusive_lock();
+
 /* Use the proper load helper from cpu_ldst.h */
 ret = helper_ld(env, addr, oi, retaddr);
 
-index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-
 /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
  * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
 hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
@@ -95,6 +97,8 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong 
addr,
 
 cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
 
+tcg_exclusive_unlock();
+
 /* From now on we are in LL/SC context */
 this_cpu->ll_sc_context = true;
 
@@ -114,6 +118,8 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, 
target_ulong addr,
  * access as one made by the store conditional wrapper. If the store
  * conditional does not succeed, the value will be set to 0.*/
 cpu->excl_succeeded = true;
+
+tcg_exclusive_lock();
 helper_st(env, addr, val, oi, retaddr);
 
 if (cpu->excl_succeeded) {
@@ -123,6 +129,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, 
target_ulong addr,
 
 /* Unset LL/SC context */
 cc->cpu_reset_excl_context(cpu);
+tcg_exclusive_unlock();
 
 return ret;
 }
diff --git a/softmmu_template.h b/softmmu_template.h
index 76fe37e..9363a7b 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -537,11 +537,16 @@ static inline void 
smmu_helper(do_excl_store)(CPUArchState *env,
 }
 }
 
+/* Take the lock in case we are not coming from a SC */
+tcg_exclusive_lock();
+
 smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
   get_mmuidx(oi), index, retaddr);
 
 reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
 
+tcg_exclusive_unlock();
+
 return;
 }
 
@@ -572,6 +577,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 /* Handle an IO access or exclusive access.  */
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
 if (tlb_addr & TLB_EXCL) {
+
 smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
retaddr);
 return;
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index e22afc5..19ea52d 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -35,7 +35,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
 cs->exception_index = excp;
 env->exception.syndrome = syndrome;
 env->exception.target_el = target_el;
+tcg_exclusive_lock();
 cc->cpu_reset_excl_context(cs);
+tcg_exclusive_unlock();
 cpu_loop_exit(cs);
 }
 
@@ -58,7 +60,9 @@ void HELPER(atomic_clear)(CPUARMState *env)
 CPUState *cs = ENV_GET_CPU(env);
 CPUClass *cc = CPU_GET_CLASS(cs);
 
+tcg_exclusive_lock();
 cc->cpu_reset_excl_context(cs);
+tcg_exclusive_unlock();
 }
 
 uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
@@ -874,7 +878,9 @@ void HELPER(exception_return)(CPUARMState *env)
 
 aarch64_save_sp(env, cur_el);
 
+tcg_exclusive_lock();
 cc->cpu_reset_excl_context(cs);
+tcg_exclusive_unlock();
 
 /* We must squash the PSTATE.SS bit to zero unless both of the
  * following hold:
-- 
2.8.3

[Qemu-devel] [RFC 03/10] cpus: Introduce async_wait_run_on_cpu()

2016-05-26 Thread Alvise Rigo

Introduce a new function that allows the calling VCPU to add a work item
to another VCPU (aka target VCPU). This new function differs from
async_run_on_cpu() since it makes the calling VCPU waiting for the target
VCPU to finish the work item. The mechanism makes use of the halt_cond
to wait and in case process pending work items.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cpus.c| 44 ++--
 include/qom/cpu.h | 31 +++
 2 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/cpus.c b/cpus.c
index b9ec903..7bc96e2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -89,7 +89,7 @@ static bool cpu_thread_is_idle(CPUState *cpu)
 if (cpu->stop || cpu->queued_work_first) {
 return false;
 }
-if (cpu_is_stopped(cpu)) {
+if (cpu_is_stopped(cpu) || async_waiting_for_work(cpu)) {
 return true;
 }
 if (!cpu->halted || cpu_has_work(cpu) ||
@@ -1012,6 +1012,7 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func 
func, void *data)
 wi->func = func;
 wi->data = data;
 wi->free = true;
+wi->wcpu = NULL;
 
 qemu_mutex_lock(>work_mutex);
 if (cpu->queued_work_first == NULL) {
@@ -1027,6 +1028,40 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func 
func, void *data)
 qemu_cpu_kick(cpu);
 }
 
+void async_wait_run_on_cpu(CPUState *cpu, CPUState *wcpu, run_on_cpu_func func,
+void *data)
+{
+struct qemu_work_item *wwi;
+
+assert(wcpu != cpu);
+
+wwi = g_malloc0(sizeof(struct qemu_work_item));
+wwi->func = func;
+wwi->data = data;
+wwi->free = true;
+wwi->wcpu = wcpu;
+
+/* Increase the number of pending work items */
+atomic_inc(>pending_work_items);
+
+qemu_mutex_lock(>work_mutex);
+/* Add the waiting work items at the beginning to free as soon as possible
+ * the waiting CPU. */
+if (cpu->queued_work_first == NULL) {
+cpu->queued_work_last = wwi;
+} else {
+wwi->next = cpu->queued_work_first;
+}
+cpu->queued_work_first = wwi;
+wwi->done = false;
+qemu_mutex_unlock(>work_mutex);
+
+qemu_cpu_kick(cpu);
+
+/* In order to wait, @wcpu has to exit the CPU loop */
+cpu_exit(wcpu);
+}
+
 /*
  * Safe work interface
  *
@@ -1120,6 +1155,10 @@ static void flush_queued_work(CPUState *cpu)
 qemu_mutex_unlock(>work_mutex);
 wi->func(cpu, wi->data);
 qemu_mutex_lock(>work_mutex);
+if (wi->wcpu != NULL) {
+atomic_dec(>wcpu->pending_work_items);
+qemu_cond_broadcast(wi->wcpu->halt_cond);
+}
 if (wi->free) {
 g_free(wi);
 } else {
@@ -1406,7 +1445,8 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 while (1) {
 bool sleep = false;
 
-if (cpu_can_run(cpu) && !async_safe_work_pending()) {
+if (cpu_can_run(cpu) && !async_safe_work_pending()
+ && !async_waiting_for_work(cpu)) {
 int r;
 
 atomic_inc(_scheduled_cpus);
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 019f06d..7be82ed 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -259,6 +259,8 @@ struct qemu_work_item {
 void *data;
 int done;
 bool free;
+/* CPU waiting for this work item to finish. If NULL, no CPU is waiting. */
+CPUState *wcpu;
 };
 
 /**
@@ -303,6 +305,7 @@ struct qemu_work_item {
  * @kvm_fd: vCPU file descriptor for KVM.
  * @work_mutex: Lock to prevent multiple access to queued_work_*.
  * @queued_work_first: First asynchronous work pending.
+ * @pending_work_items: Work items for which the CPU needs to wait completion.
  *
  * State of one CPU core or thread.
  */
@@ -337,6 +340,7 @@ struct CPUState {
 
 QemuMutex work_mutex;
 struct qemu_work_item *queued_work_first, *queued_work_last;
+int pending_work_items;
 
 CPUAddressSpace *cpu_ases;
 int num_ases;
@@ -398,6 +402,9 @@ struct CPUState {
  * by a stcond (see softmmu_template.h). */
 bool excl_succeeded;
 
+/* True if some CPU requested a TLB flush for this CPU. */
+bool pending_tlb_flush;
+
 /* Note that this is accessed at the start of every TB via a negative
offset from AREG0.  Leave this field at the end so as to make the
(absolute value) offset as small as possible.  This reduces code
@@ -680,6 +687,19 @@ void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void 
*data);
 void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
 
 /**
+ * async_wait_run_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @wpu: The vCPU submitting the work.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchron

[Qemu-devel] [RFC 08/10] cputlb: Query tlb_flush_page_by_mmuidx

2016-05-26 Thread Alvise Rigo

Similarly to the previous commit, make tlb_flush_page_by_mmuidx query the
flushes when targeting different VCPUs.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c| 90 ++---
 include/exec/exec-all.h |  5 +--
 target-arm/helper.c | 35 ++-
 3 files changed, 85 insertions(+), 45 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 73624d6..77a1997 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -157,6 +157,8 @@ static inline void tlb_tables_flush_bitmap(CPUState *cpu, 
unsigned long *bitmap)
 
 struct TLBFlushByMMUIdxParams {
 DECLARE_BITMAP(idx_to_flush, NB_MMU_MODES);
+/* Used by tlb_flush_page_by_mmuidx */
+target_ulong addr;
 };
 
 static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, void *opaque)
@@ -255,28 +257,13 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
 tb_flush_jmp_cache(cpu, addr);
 }
 
-void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...)
+static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu, void *opaque)
 {
 CPUArchState *env = cpu->env_ptr;
-int i, k;
-va_list argp;
-
-va_start(argp, addr);
-
-tlb_debug("addr "TARGET_FMT_lx"\n", addr);
-
-/* Check if we need to flush due to large pages.  */
-if ((addr & env->tlb_flush_mask) == env->tlb_flush_addr) {
-tlb_debug("forced full flush ("
-  TARGET_FMT_lx "/" TARGET_FMT_lx ")\n",
-  env->tlb_flush_addr, env->tlb_flush_mask);
+struct TLBFlushByMMUIdxParams *params = opaque;
+target_ulong addr = params->addr;
+int mmu_idx, i;
 
-/* Temporarily use current_cpu until tlb_flush_page_by_mmuidx
- * is reworked */
-tlb_flush_by_mmuidx(current_cpu, cpu, argp);
-va_end(argp);
-return;
-}
 /* must reset current TB so that interrupts cannot modify the
links while we are modifying them */
 cpu->current_tb = NULL;
@@ -284,6 +271,49 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong 
addr, ...)
 addr &= TARGET_PAGE_MASK;
 i = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
 
+for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+if (test_bit(mmu_idx, params->idx_to_flush)) {
+int k;
+
+tlb_debug("idx %d\n", mmu_idx);
+tlb_flush_entry(>tlb_table[mmu_idx][i], addr);
+/* check whether there are vltb entries that need to be flushed */
+for (k = 0; k < CPU_VTLB_SIZE; k++) {
+tlb_flush_entry(>tlb_v_table[mmu_idx][k], addr);
+}
+}
+}
+
+tb_flush_jmp_cache(cpu, addr);
+
+g_free(params);
+}
+
+static void v_tlb_flush_page_by_mmuidx(CPUState *cpu, CPUState *target_cpu,
+target_ulong addr, unsigned long *idxmap)
+{
+if (!qemu_cpu_is_self(target_cpu)) {
+struct TLBFlushByMMUIdxParams *params;
+
+params = g_malloc(sizeof(struct TLBFlushByMMUIdxParams));
+params->addr = addr;
+memcpy(params->idx_to_flush, idxmap, MMUIDX_BITMAP_SIZE);
+async_wait_run_on_cpu(target_cpu, cpu,
+  tlb_flush_page_by_mmuidx_async_work, params);
+} else {
+tlb_tables_flush_bitmap(cpu, idxmap);
+}
+}
+
+void tlb_flush_page_by_mmuidx(CPUState *cpu, CPUState *target,
+  target_ulong addr, ...)
+{
+DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 };
+CPUArchState *env = target->env_ptr;
+va_list argp;
+
+va_start(argp, addr);
+
 for (;;) {
 int mmu_idx = va_arg(argp, int);
 
@@ -291,18 +321,24 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong 
addr, ...)
 break;
 }
 
-tlb_debug("idx %d\n", mmu_idx);
+set_bit(mmu_idx, idxmap);
+}
 
-tlb_flush_entry(>tlb_table[mmu_idx][i], addr);
+va_end(argp);
 
-/* check whether there are vltb entries that need to be flushed */
-for (k = 0; k < CPU_VTLB_SIZE; k++) {
-tlb_flush_entry(>tlb_v_table[mmu_idx][k], addr);
-}
+tlb_debug("addr "TARGET_FMT_lx"\n", addr);
+
+/* Check if we need to flush due to large pages.  */
+if ((addr & env->tlb_flush_mask) == env->tlb_flush_addr) {
+tlb_debug("forced full flush ("
+  TARGET_FMT_lx "/" TARGET_FMT_lx ")\n",
+  env->tlb_flush_addr, env->tlb_flush_mask);
+
+v_tlb_flush_by_mmuidx(cpu, target, idxmap);
+return;
 }
-va_end(argp);
 
-tb_flush_jmp_cache(cpu, addr);
+v_tlb_flush_page_by_mmuidx(cpu, target, addr, idxmap);
 }
 
 static void tlb_flush_page_async_work(CPUState *cpu, void *opaque)
diff --git a/include/exec/exec-all.h b/include/exec/exec-all

[Qemu-devel] [RFC 04/10] cputlb: Introduce tlb_flush_other()

2016-05-26 Thread Alvise Rigo

In some cases (like in softmmu_llsc_template.h) we know for certain that
we need to flush other VCPUs' TLB. tlb_flush_other() serves this
purpose, allowing the VCPU @cpu to query a global flush to @other.

In addition, use it also in softmmu_llsc_template.h and tlb_flush()
if possible.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c| 28 +++-
 softmmu_llsc_template.h |  2 +-
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 1586b64..55f7447 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -81,12 +81,24 @@ static void tlb_flush_nocheck(CPUState *cpu, int 
flush_global)
 env->tlb_flush_addr = -1;
 env->tlb_flush_mask = 0;
 tlb_flush_count++;
-/* atomic_mb_set(>pending_tlb_flush, 0); */
 }
 
 static void tlb_flush_global_async_work(CPUState *cpu, void *opaque)
 {
 tlb_flush_nocheck(cpu, GPOINTER_TO_INT(opaque));
+atomic_mb_set(>pending_tlb_flush, false);
+}
+
+static void tlb_flush_other(CPUState *cpu, CPUState *other, int flush_global)
+{
+if (other->created) {
+if (!atomic_xchg(>pending_tlb_flush, true)) {
+async_wait_run_on_cpu(other, cpu, tlb_flush_global_async_work,
+  GINT_TO_POINTER(flush_global));
+}
+} else {
+tlb_flush_nocheck(other, flush_global);
+}
 }
 
 /* NOTE:
@@ -103,11 +115,17 @@ static void tlb_flush_global_async_work(CPUState *cpu, 
void *opaque)
  */
 void tlb_flush(CPUState *cpu, int flush_global)
 {
-if (cpu->created) {
-async_run_on_cpu(cpu, tlb_flush_global_async_work,
- GINT_TO_POINTER(flush_global));
-} else {
+/* if @cpu has not been created yet or it is the current_cpu, we do not
+ * need to query the flush. */
+if (current_cpu == cpu || !cpu->created) {
 tlb_flush_nocheck(cpu, flush_global);
+} else {
+if (current_cpu) {
+tlb_flush_other(current_cpu, cpu, flush_global);
+} else {
+async_run_on_cpu(cpu, tlb_flush_global_async_work,
+ GINT_TO_POINTER(flush_global));
+}
 }
 }
 
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index d3810c0..51ce58f 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -81,7 +81,7 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong 
addr,
 excl_history_put_addr(hw_addr);
 CPU_FOREACH(cpu) {
 if (this_cpu != cpu) {
-tlb_flush(cpu, 1);
+tlb_flush_other(this_cpu, cpu, 1);
 }
 }
 }
-- 
2.8.3

[Qemu-devel] [RFC 05/10] target-arm: End TB after ldrex instruction

2016-05-26 Thread Alvise Rigo

A VCPU executing a ldrex instruction might query flushes to other VCPUs:
in this cases, the calling VCPU uses cpu_exit to exit from the cpu loop
and wait the other VCPUs to perform the flush. In order to exit from the
cpu loop as soon as possible, interrupt the TB after the ldrex
instruction.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 target-arm/translate-a64.c | 2 ++
 target-arm/translate.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 376cb1c..2a14c14 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -1875,6 +1875,8 @@ static void disas_ldst_excl(DisasContext *s, uint32_t 
insn)
 if (!is_store) {
 s->is_ldex = true;
 gen_load_exclusive(s, rt, rt2, tcg_addr, size, is_pair);
+gen_a64_set_pc_im(s->pc);
+s->is_jmp = DISAS_JUMP;
 } else {
 gen_store_exclusive(s, rs, rt, rt2, tcg_addr, size, is_pair);
 }
diff --git a/target-arm/translate.c b/target-arm/translate.c
index 0677e04..7c1cb19 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -8807,6 +8807,8 @@ static void disas_arm_insn(DisasContext *s, unsigned int 
insn)
 default:
 abort();
 }
+gen_set_pc_im(s, s->pc);
+s->is_jmp = DISAS_JUMP;
 } else {
 rm = insn & 0xf;
 switch (op1) {
-- 
2.8.3

[Qemu-devel] [RFC 01/10] exec: Introduce tcg_exclusive_{lock, unlock}()

2016-05-26 Thread Alvise Rigo

Add tcg_exclusive_{lock,unlock}() functions that will be used for making
the emulation of LL and SC instructions thread safe.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cpus.c|  2 ++
 exec.c| 18 ++
 include/qom/cpu.h |  5 +
 3 files changed, 25 insertions(+)

diff --git a/cpus.c b/cpus.c
index 860e7ba..b9ec903 100644
--- a/cpus.c
+++ b/cpus.c
@@ -961,6 +961,8 @@ void qemu_init_cpu_loop(void)
 qemu_cond_init(_work_cond);
 qemu_mutex_init(_global_mutex);
 
+qemu_spin_init(_exclusive_lock);
+
 qemu_thread_get_self(_thread);
 
 safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 
128);
diff --git a/exec.c b/exec.c
index a24b31c..1c72113 100644
--- a/exec.c
+++ b/exec.c
@@ -197,6 +197,24 @@ void cpu_exclusive_history_free(void)
 g_free(excl_history.c_array);
 }
 }
+
+__thread bool cpu_have_exclusive_lock;
+QemuSpin cpu_exclusive_lock;
+inline void tcg_exclusive_lock(void)
+{
+if (!cpu_have_exclusive_lock) {
+qemu_spin_lock(_exclusive_lock);
+cpu_have_exclusive_lock = true;
+}
+}
+
+inline void tcg_exclusive_unlock(void)
+{
+if (cpu_have_exclusive_lock) {
+cpu_have_exclusive_lock = false;
+qemu_spin_unlock(_exclusive_lock);
+}
+}
 #endif
 
 #if !defined(CONFIG_USER_ONLY)
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 0f51870..019f06d 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -201,6 +201,11 @@ typedef struct CPUClass {
 void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
 } CPUClass;
 
+/* Protect cpu_exclusive_* variable .*/
+void tcg_exclusive_lock(void);
+void tcg_exclusive_unlock(void);
+extern QemuSpin cpu_exclusive_lock;
+
 #ifdef HOST_WORDS_BIGENDIAN
 typedef struct icount_decr_u16 {
 uint16_t high;
-- 
2.8.3

[Qemu-devel] [RFC 00/10] MTTCG: Slow-path for atomic insns

2016-05-26 Thread Alvise Rigo

Hi,

This series ports the latest iteration of the LL/SC work on top of the
latest MTTCG reference branch posted recently by Alex.

These patches apply on top of the following series:

- [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86
  https://github.com/stsquad/qemu/tree/mttcg/enable-mttcg-for-armv7-v1
- [RFC v8 00/14] Slow-path for atomic instruction translation
  https://git.virtualopensystems.com/dev/qemu-mt/tree/\
  slowpath-for-atomic-v8-no-mttcg - only minor changes have been necessary
- Few recent patches from Emilio regarding the spinlock implementation

Overall, these patches allow the LL/SC infrastructure to work in multi-threaded
mode (patches 01-02-04) and make TLB flushes to other VCPUs safe.

Patch 03 introduces a new API to submit a work item to a VCPU and wait for its
completion. This API is used to query TLB flushes that result from the
emulation of some ARM instructions. Patches 07, 08 and 09 modify the current
tlb_flush_* functions to use the new API.  Patch 10 fixes a rare hang that I
was experiencing with this branch.

The whole work can be fetched from the following repository:
g...@git.virtualopensystems.com:dev/qemu-mt.git
at the branch "slowpath-for-atomic-v8-mttcg".

Alvise Rigo (10):
  exec: Introduce tcg_exclusive_{lock,unlock}()
  softmmu_llsc_template.h: Move to multi-threading
  cpus: Introduce async_wait_run_on_cpu()
  cputlb: Introduce tlb_flush_other()
  target-arm: End TB after ldrex instruction
  cputlb: Add tlb_tables_flush_bitmap()
  cputlb: Query tlb_flush_by_mmuidx
  cputlb: Query tlb_flush_page_by_mmuidx
  cputlb: Query tlb_flush_page_all
  cpus: Do not sleep if some work item is pending

 cpus.c |  48 ++-
 cputlb.c   | 202 ++---
 exec.c |  18 
 include/exec/exec-all.h|  13 +--
 include/qom/cpu.h  |  36 
 softmmu_llsc_template.h|  13 ++-
 softmmu_template.h |   6 ++
 target-arm/helper.c|  79 +-
 target-arm/op_helper.c |   6 ++
 target-arm/translate-a64.c |   2 +
 target-arm/translate.c |   2 +
 11 files changed, 327 insertions(+), 98 deletions(-)

-- 
2.8.3

Re: [Qemu-devel] Any topics for today's MTTCG sync-up call?

2016-05-23 Thread alvise rigo

Hi Alex,

I finally solved the issue I had, the branch is working well as far as I
can say.
The work I will share, in addition to making the LL/SC work mttcg-aware,
extends the various TLB flushes calls with the query-based mechanism: the
requesting CPU queries the flushes to the target CPUs and wait them for
completion.

Sorry for the delay, I have been quite busy. I just need to polish some
commits, than (this week) I will share the branch.

Best regards,
alvise

On Mon, May 23, 2016 at 12:57 PM, Alex Bennée 
wrote:

> Hi,
>
> It's been a while since the last sync-up call. Have we got any topics to
> discuss today?
>
> Sergey and I (with a little Paolo) have spent some of last week delving
> into the locking hierarchy w.r.t to tb_lock vs mmap_lock to see if there
> is any simplification to be had. I'm not sure if this is a topic
> conducive to a phone call instead of the mailing list but if others want
> to discuss it we can add it as an agenda item.
>
> We also have a new member of the team. Pranith has joined as a GSoC
> student. He'll be looking at memory ordering with his first pass at the
> problem looking to solve the store-after-load issues which do show up on
> ARM-on-x86 (see my testcase).
>
> Alvise, is there any help you need with the LL/SC stuff? The MTTCG aware
> version has been taking some time so would it be worth sharing the
> issues you have hit with the group?
>
> Emilio, is there anything you want to add? I've been following the QHT
> stuff which is a really positive addition which my v3 base patches is
> based upon (making the hot-path non lock contended). Do you have
> anything in the works above that?
>
> Cheers,
>
> --
> Alex Bennée
>

Re: [Qemu-devel] MTTCG Sync-up call today?

2016-05-09 Thread alvise rigo

Not from my side.
Hope to have some news by the end of the week.

Regards,
alvise

On Mon, May 9, 2016 at 1:56 PM, Alex Bennée  wrote:

>
> Hi,
>
> Do we have anything we want to discuss today?
>
> --
> Alex Bennée
>

Re: [Qemu-devel] MTTCG Sync-up call today?

2016-04-25 Thread alvise rigo

Hi Alex,

On Mon, Apr 25, 2016 at 11:53 AM, Alex Bennée 
wrote:

> Hi,
>
> We are due to have a sync-up call today but I don't think I'll be able
> to make it thanks to a very rough voice courtesy of my
> petri-dishes/children. However since the last call:
>
>  * Posted final parts of the MTTCG patch set
>  * Lots of reviews
>
> Please welcome Pranith to the group who is participating as a GSoC
> student. His project will be looking at the modelling of memory barriers
> in MTTCG.
>
> My plan for the next week is look in more detail at the tb_find_fast
> lock breaking patch. I had to drop it from the original series due to
> regressions but it has a massive effect on performance that means it
> needs sorting. I'll probably start re-building a tree with Emilio's QHT
> patches included which will allow for lockless lookups in the fast path
> and then start re-basing the enabling and ARMv7 patches on top of that.
>
> I'll also do another review pass of Alvise' LL/SC patches but a fresh
> pair of eyes on them will be appreciated.
>
> Alvise, how are you doing with the MTTCG version?
>

I'm working on this, mostly I'm investigating some issues related to the
dirty bitmaps' code that lately changed a bit.
Apart from that, the branch is almost ready.

Best regards,
alvise


>
> --
> Alex Bennée
>

[Qemu-devel] [RFC v8 14/14] target-arm: aarch64: Use ls/st exclusive for atomic insns

2016-04-19 Thread Alvise Rigo

Use the new LL/SC runtime helpers to handle the aarch64 atomic instructions
in softmmu_llsc_template.h.

The STXP emulation required a dedicated helper to handle the paired
doubleword case.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 target-arm/helper-a64.c|  55 +++
 target-arm/helper-a64.h|   2 +
 target-arm/translate-a64.c | 168 +
 target-arm/translate.c |   7 --
 4 files changed, 149 insertions(+), 83 deletions(-)

diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index c7bfb4d..170c59b 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -26,6 +26,7 @@
 #include "qemu/bitops.h"
 #include "internals.h"
 #include "qemu/crc32c.h"
+#include "tcg/tcg.h"
 #include  /* For crc32 */
 
 /* C2.4.7 Multiply and divide */
@@ -443,3 +444,57 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, 
uint32_t bytes)
 /* Linux crc32c converts the output to one's complement.  */
 return crc32c(acc, buf, bytes) ^ 0x;
 }
+
+/* STXP emulation for two 64 bit doublewords. We can't use directly two
+ * stcond_i64 accesses, otherwise the first will conclude the LL/SC pair.
+ * Instead, two normal 64-bit accesses are used and the CPUState is
+ * updated accordingly.
+ *
+ * We do not support paired STXPs to MMIO memory, this will become trivial
+ * when the softmmu will support 128bit memory accesses.
+ */
+target_ulong HELPER(stxp_i128)(CPUArchState *env, target_ulong addr,
+   uint64_t vall, uint64_t valh,
+   uint32_t mmu_idx)
+{
+CPUState *cpu = ENV_GET_CPU(env);
+CPUClass *cc = CPU_GET_CLASS(cpu);
+TCGMemOpIdx op;
+target_ulong ret = 0;
+
+if (!cpu->ll_sc_context) {
+cpu->excl_succeeded = false;
+ret = 1;
+goto out;
+}
+
+op = make_memop_idx(MO_BEQ, mmu_idx);
+
+/* According to section C6.6.191 of ARM ARM DDI 0487A.h, the access has
+ * to be quadword aligned. */
+if (addr & 0xf) {
+/* TODO: Do unaligned access */
+qemu_log_mask(LOG_UNIMP, "aarch64: silently executing STXP quadword"
+  "unaligned, exception not implemented yet.\n");
+}
+
+/* Setting excl_succeeded to true will make the store exclusive. */
+cpu->excl_succeeded = true;
+helper_ret_stq_mmu(env, addr, vall, op, GETRA());
+
+if (!cpu->excl_succeeded) {
+ret = 1;
+goto out;
+}
+
+helper_ret_stq_mmu(env, addr + 8, valh, op, GETRA());
+if (!cpu->excl_succeeded) {
+ret = 1;
+}
+
+out:
+/* Unset LL/SC context */
+cc->cpu_reset_excl_context(cpu);
+
+return ret;
+}
diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
index 1d3d10f..4ecb118 100644
--- a/target-arm/helper-a64.h
+++ b/target-arm/helper-a64.h
@@ -46,3 +46,5 @@ DEF_HELPER_FLAGS_2(frecpx_f32, TCG_CALL_NO_RWG, f32, f32, ptr)
 DEF_HELPER_FLAGS_2(fcvtx_f64_to_f32, TCG_CALL_NO_RWG, f32, f64, env)
 DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
+/* STXP helper */
+DEF_HELPER_5(stxp_i128, i64, env, i64, i64, i64, i32)
diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 80f6c20..d5f613e 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -37,9 +37,6 @@
 static TCGv_i64 cpu_X[32];
 static TCGv_i64 cpu_pc;
 
-/* Load/store exclusive handling */
-static TCGv_i64 cpu_exclusive_high;
-
 static const char *regnames[] = {
 "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
 "x8", "x9", "x10", "x11", "x12", "x13", "x14", "x15",
@@ -93,9 +90,6 @@ void a64_translate_init(void)
   offsetof(CPUARMState, xregs[i]),
   regnames[i]);
 }
-
-cpu_exclusive_high = tcg_global_mem_new_i64(TCG_AREG0,
-offsetof(CPUARMState, exclusive_high), "exclusive_high");
 }
 
 static inline ARMMMUIdx get_a64_user_mem_index(DisasContext *s)
@@ -1219,7 +1213,7 @@ static void handle_hint(DisasContext *s, uint32_t insn,
 
 static void gen_clrex(DisasContext *s, uint32_t insn)
 {
-tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+gen_helper_atomic_clear(cpu_env);
 }
 
 /* CLREX, DSB, DMB, ISB */
@@ -1685,11 +1679,9 @@ static void disas_b_exc_sys(DisasContext *s, uint32_t 
insn)
 }
 
 /*
- * Load/Store exclusive instructions are implemented by remembering
- * the value/address loaded, and seeing if these are the same
- * when the store is performed. This is not actually

[Qemu-devel] [RFC v8 12/14] target-arm: translate: Use ld/st excl for atomic insns

2016-04-19 Thread Alvise Rigo

Use the new LL/SC runtime helpers to handle the ARM atomic instructions
in softmmu_llsc_template.h.

In general, the helper generator
gen_{ldrex,strex}_{8,16a,32a,64a}() calls the function
helper_{le,be}_{ldlink,stcond}{ub,uw,ul,q}_mmu() implemented in
softmmu_llsc_template.h, doing an alignment check.

In addition, add a simple helper function to emulate the CLREX instruction.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 target-arm/cpu.h   |   3 +
 target-arm/helper.h|   2 +
 target-arm/machine.c   |   7 ++
 target-arm/op_helper.c |  14 ++-
 target-arm/translate.c | 258 -
 5 files changed, 174 insertions(+), 110 deletions(-)

diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index b8b3364..46ab87f 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -462,6 +462,9 @@ typedef struct CPUARMState {
 float_status fp_status;
 float_status standard_fp_status;
 } vfp;
+/* Even if we don't use these values anymore, we still keep them for
+ * retro-compatibility in case of migration toward QEMU versions without
+ * the LoadLink/StoreExclusive backend. */
 uint64_t exclusive_addr;
 uint64_t exclusive_val;
 uint64_t exclusive_high;
diff --git a/target-arm/helper.h b/target-arm/helper.h
index c2a85c7..37cec49 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -532,6 +532,8 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+DEF_HELPER_1(atomic_clear, void, env)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
diff --git a/target-arm/machine.c b/target-arm/machine.c
index ed1925a..9660163 100644
--- a/target-arm/machine.c
+++ b/target-arm/machine.c
@@ -203,6 +203,7 @@ static const VMStateInfo vmstate_cpsr = {
 static void cpu_pre_save(void *opaque)
 {
 ARMCPU *cpu = opaque;
+CPUARMState *env = >env;
 
 if (kvm_enabled()) {
 if (!write_kvmstate_to_list(cpu)) {
@@ -221,6 +222,12 @@ static void cpu_pre_save(void *opaque)
cpu->cpreg_array_len * sizeof(uint64_t));
 memcpy(cpu->cpreg_vmstate_values, cpu->cpreg_values,
cpu->cpreg_array_len * sizeof(uint64_t));
+
+/* Ensure to fail the next STREX for versions of QEMU with the
+ * old backend. */
+env->exclusive_addr = -1;
+env->exclusive_val = -1;
+env->exclusive_high = -1;
 }
 
 static int cpu_post_load(void *opaque, int version_id)
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index a5ee65f..3ae0b6a 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -29,11 +29,13 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
 uint32_t syndrome, uint32_t target_el)
 {
 CPUState *cs = CPU(arm_env_get_cpu(env));
+CPUClass *cc = CPU_GET_CLASS(cs);
 
 assert(!excp_is_internal(excp));
 cs->exception_index = excp;
 env->exception.syndrome = syndrome;
 env->exception.target_el = target_el;
+cc->cpu_reset_excl_context(cs);
 cpu_loop_exit(cs);
 }
 
@@ -51,6 +53,14 @@ static int exception_target_el(CPUARMState *env)
 return target_el;
 }
 
+void HELPER(atomic_clear)(CPUARMState *env)
+{
+CPUState *cs = ENV_GET_CPU(env);
+CPUClass *cc = CPU_GET_CLASS(cs);
+
+cc->cpu_reset_excl_context(cs);
+}
+
 uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
   uint32_t rn, uint32_t maxindex)
 {
@@ -681,6 +691,8 @@ static int el_from_spsr(uint32_t spsr)
 
 void HELPER(exception_return)(CPUARMState *env)
 {
+CPUState *cs = ENV_GET_CPU(env);
+CPUClass *cc = CPU_GET_CLASS(cs);
 int cur_el = arm_current_el(env);
 unsigned int spsr_idx = aarch64_banked_spsr_index(cur_el);
 uint32_t spsr = env->banked_spsr[spsr_idx];
@@ -689,7 +701,7 @@ void HELPER(exception_return)(CPUARMState *env)
 
 aarch64_save_sp(env, cur_el);
 
-env->exclusive_addr = -1;
+cc->cpu_reset_excl_context(cs);
 
 /* We must squash the PSTATE.SS bit to zero unless both of the
  * following hold:
diff --git a/target-arm/translate.c b/target-arm/translate.c
index cff511b..9c2b197 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -60,6 +60,7 @@ TCGv_ptr cpu_env;
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
 static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
+/* The following two variables are still used by the aarch64 front-end */
 TCGv_i64 cpu_exclusive_addr;
 TCGv_i64 cpu_exclusive_val;
 #ifdef CONFIG_USER_ONLY
@@ -7413,57 +7414,139 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
 tcg_gen_or_i32(cpu_ZF, lo, hi);
 }
 
-/* Load/Store exclusive instructions are implemented b

[Qemu-devel] [RFC v8 13/14] target-arm: cpu64: use custom set_excl hook

2016-04-19 Thread Alvise Rigo

In aarch64 the LDXP/STXP instructions allow to perform up to 128 bits
exclusive accesses. However, due to a softmmu limitation, such wide
accesses are not allowed.

To workaround this limitation, we need to support LoadLink instructions
that cover at least 128 consecutive bits (see the next patch for more
details).

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 target-arm/cpu64.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/target-arm/cpu64.c b/target-arm/cpu64.c
index cc177bb..1d45e66 100644
--- a/target-arm/cpu64.c
+++ b/target-arm/cpu64.c
@@ -287,6 +287,13 @@ static void aarch64_cpu_set_pc(CPUState *cs, vaddr value)
 }
 }
 
+static void aarch64_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr size)
+{
+cpu->excl_protected_range.begin = addr;
+/* At least cover 128 bits for a STXP access (two paired doublewords 
case)*/
+cpu->excl_protected_range.end = addr + 16;
+}
+
 static void aarch64_cpu_class_init(ObjectClass *oc, void *data)
 {
 CPUClass *cc = CPU_CLASS(oc);
@@ -297,6 +304,7 @@ static void aarch64_cpu_class_init(ObjectClass *oc, void 
*data)
 cc->gdb_write_register = aarch64_cpu_gdb_write_register;
 cc->gdb_num_core_regs = 34;
 cc->gdb_core_xml_file = "aarch64-core.xml";
+cc->cpu_set_excl_protected_range = aarch64_set_excl_range;
 }
 
 static void aarch64_cpu_register(const ARMCPUInfo *info)
-- 
2.8.0

[Qemu-devel] [RFC v8 11/14] tcg: Create new runtime helpers for excl accesses

2016-04-19 Thread Alvise Rigo

Introduce a set of new runtime helpers to handle exclusive instructions.
These helpers are used as hooks to call the respective LL/SC helpers in
softmmu_llsc_template.h from TCG code.

The helpers ending with an "a" make an alignment check.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 Makefile.target |   2 +-
 include/exec/helper-gen.h   |   3 ++
 include/exec/helper-proto.h |   1 +
 include/exec/helper-tcg.h   |   3 ++
 tcg-llsc-helper.c   | 104 
 tcg-llsc-helper.h   |  61 ++
 tcg/tcg-llsc-gen-helper.h   |  67 
 7 files changed, 240 insertions(+), 1 deletion(-)
 create mode 100644 tcg-llsc-helper.c
 create mode 100644 tcg-llsc-helper.h
 create mode 100644 tcg/tcg-llsc-gen-helper.h

diff --git a/Makefile.target b/Makefile.target
index 34ddb7e..faf32a2 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -135,7 +135,7 @@ obj-y += arch_init.o cpus.o monitor.o gdbstub.o balloon.o 
ioport.o numa.o
 obj-y += qtest.o bootdevice.o
 obj-y += hw/
 obj-$(CONFIG_KVM) += kvm-all.o
-obj-y += memory.o cputlb.o
+obj-y += memory.o cputlb.o tcg-llsc-helper.o
 obj-y += memory_mapping.o
 obj-y += dump.o
 obj-y += migration/ram.o migration/savevm.o
diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
index 0d0da3a..f8483a9 100644
--- a/include/exec/helper-gen.h
+++ b/include/exec/helper-gen.h
@@ -60,6 +60,9 @@ static inline void glue(gen_helper_, 
name)(dh_retvar_decl(ret)  \
 #include "trace/generated-helpers.h"
 #include "trace/generated-helpers-wrappers.h"
 #include "tcg-runtime.h"
+#if defined(CONFIG_SOFTMMU)
+#include "tcg-llsc-gen-helper.h"
+#endif
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h
index effdd43..90be2fd 100644
--- a/include/exec/helper-proto.h
+++ b/include/exec/helper-proto.h
@@ -29,6 +29,7 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), 
dh_ctype(t3), \
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
+#include "tcg/tcg-llsc-gen-helper.h"
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h
index 79fa3c8..6228a7f 100644
--- a/include/exec/helper-tcg.h
+++ b/include/exec/helper-tcg.h
@@ -38,6 +38,9 @@
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
+#ifdef CONFIG_SOFTMMU
+#include "tcg-llsc-gen-helper.h"
+#endif
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/tcg-llsc-helper.c b/tcg-llsc-helper.c
new file mode 100644
index 000..646b4ba
--- /dev/null
+++ b/tcg-llsc-helper.c
@@ -0,0 +1,104 @@
+/*
+ * Runtime helpers for atomic istruction emulation
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.r...@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "exec/cpu_ldst.h"
+#include "exec/helper-head.h"
+#include "tcg-llsc-helper.h"
+
+#define LDEX_HELPER(SUFF, OPC, FUNC)   \
+uint32_t HELPER(ldlink_i##SUFF)(CPUArchState *env, target_ulong addr,  \
+uint32_t index)\
+{  \
+CPUArchState *state = env; \
+TCGMemOpIdx op;\
+   \
+op = make_memop_idx((OPC), index); \
+   \
+return (uint32_t)FUNC(state, addr, op, GETRA());   \
+}
+
+#define STEX_HELPER(SUFF, DATA_TYPE, OPC, FUNC)\
+target_ulong HELPER(stcond_i##SUFF)(CPUArchState *env, target_ulong addr,

[Qemu-devel] [RFC v8 10/14] softmmu: Support MMIO exclusive accesses

2016-04-19 Thread Alvise Rigo

Enable exclusive accesses when the MMIO flag is set in the TLB entry.

In case a LL access is done to MMIO memory, we treat it differently from
a RAM access in that we do not rely on the EXCL bitmap to flag the page
as exclusive. In fact, we don't even need the TLB_EXCL flag to force the
slow path, since it is always forced anyway.

As for the RAM case, also the MMIO exclusive ranges have to be protected
by other CPU's accesses. In order to do that, we flag the accessed
MemoryRegion to mark that an exclusive access has been performed and is
not concluded yet. This flag will force the other CPUs to invalidate the
exclusive range in case of collision: basically, it serves the same
purpose as TLB_EXCL for the TLBEntries referring exclusive memory.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c|  7 +--
 include/exec/memory.h   |  1 +
 softmmu_llsc_template.h | 11 +++
 softmmu_template.h  | 22 ++
 4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index e5df3a5..3cf40a3 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -29,7 +29,6 @@
 #include "exec/memory-internal.h"
 #include "exec/ram_addr.h"
 #include "tcg/tcg.h"
-#include "hw/hw.h"
 
 //#define DEBUG_TLB
 //#define DEBUG_TLB_CHECK
@@ -508,9 +507,10 @@ static inline void excl_history_put_addr(hwaddr addr)
 /* For every vCPU compare the exclusive address and reset it in case of a
  * match. Since only one vCPU is running at once, no lock has to be held to
  * guard this operation. */
-static inline void reset_other_cpus_colliding_ll_addr(hwaddr addr, hwaddr size)
+static inline bool reset_other_cpus_colliding_ll_addr(hwaddr addr, hwaddr size)
 {
 CPUState *cpu;
+bool ret = false;
 
 CPU_FOREACH(cpu) {
 if (current_cpu != cpu &&
@@ -520,8 +520,11 @@ static inline void 
reset_other_cpus_colliding_ll_addr(hwaddr addr, hwaddr size)
cpu->excl_protected_range.begin,
addr, size)) {
 cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+ret = true;
 }
 }
+
+return ret;
 }
 
 #define MMUSUFFIX _mmu
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 71e0480..bacb3ad 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -171,6 +171,7 @@ struct MemoryRegion {
 bool rom_device;
 bool flush_coalesced_mmio;
 bool global_locking;
+bool pending_excl_access; /* A vCPU issued an exclusive access */
 uint8_t dirty_log_mask;
 ram_addr_t ram_addr;
 Object *owner;
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index 1e24fec..ca55502 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -84,15 +84,18 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, 
target_ulong addr,
 }
 }
 }
+/* For this vCPU, just update the TLB entry, no need to flush. */
+env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
 } else {
-hw_error("EXCL accesses to MMIO regions not supported yet.");
+/* Set a pending exclusive access in the MemoryRegion */
+MemoryRegion *mr = iotlb_to_region(this_cpu,
+   env->iotlb[mmu_idx][index].addr,
+   env->iotlb[mmu_idx][index].attrs);
+mr->pending_excl_access = true;
 }
 
 cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
 
-/* For this vCPU, just update the TLB entry, no need to flush. */
-env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
-
 /* From now on we are in LL/SC context */
 this_cpu->ll_sc_context = true;
 
diff --git a/softmmu_template.h b/softmmu_template.h
index 2934a0c..2dc5e01 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -360,6 +360,28 @@ static inline void glue(io_write, SUFFIX)(CPUArchState 
*env,
 MemoryRegion *mr = iotlb_to_region(cpu, physaddr, iotlbentry->attrs);
 
 physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
+
+/* While for normal RAM accesses we define exclusive memory at TLBEntry
+ * granularity, for MMIO memory we use a MemoryRegion granularity.
+ * The pending_excl_access flag is the analogous of TLB_EXCL. */
+if (unlikely(mr->pending_excl_access)) {
+if (cpu->excl_succeeded) {
+/* This SC access finalizes the LL/SC pair, thus the MemoryRegion
+ * has no pending exclusive access anymore.
+ * N.B.: Here excl_succeeded == true means that this access
+ * comes from an exclusive instruction. */
+MemoryRegion *mr = iotlb_to_region(cpu, iotlbentry->addr,
+

[Qemu-devel] [RFC v8 06/14] qom: cpu: Add CPUClass hooks for exclusive range

2016-04-19 Thread Alvise Rigo

The excl_protected_range is a hwaddr range set by the VCPU at the
execution of a LoadLink instruction. If a normal access writes to this
range, the corresponding StoreCond will fail.

Each architecture can set the exclusive range when issuing the LoadLink
operation through a CPUClass hook. This comes in handy to emulate, for
instance, the exclusive monitor implemented in some ARM architectures
(more precisely, the Exclusive Reservation Granule).

In addition, add another CPUClass hook called to decide whether a
StoreCond has to fail or not.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 include/qom/cpu.h | 20 
 qom/cpu.c | 27 +++
 2 files changed, 47 insertions(+)

diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 2e5229d..21f10eb 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -29,6 +29,7 @@
 #include "qemu/queue.h"
 #include "qemu/thread.h"
 #include "qemu/typedefs.h"
+#include "qemu/range.h"
 
 typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
  void *opaque);
@@ -123,6 +124,10 @@ struct TranslationBlock;
  * @cpu_exec_enter: Callback for cpu_exec preparation.
  * @cpu_exec_exit: Callback for cpu_exec cleanup.
  * @cpu_exec_interrupt: Callback for processing interrupts in cpu_exec.
+ * @cpu_set_excl_protected_range: Callback used by LL operation for setting the
+ *exclusive range.
+ * @cpu_valid_excl_access: Callback for checking the validity of a SC 
operation.
+ * @cpu_reset_excl_context: Callback for resetting the exclusive context.
  * @disas_set_info: Setup architecture specific components of disassembly info
  *
  * Represents a CPU family or model.
@@ -183,6 +188,13 @@ typedef struct CPUClass {
 void (*cpu_exec_exit)(CPUState *cpu);
 bool (*cpu_exec_interrupt)(CPUState *cpu, int interrupt_request);
 
+/* Atomic instruction handling */
+void (*cpu_set_excl_protected_range)(CPUState *cpu, hwaddr addr,
+ hwaddr size);
+bool (*cpu_valid_excl_access)(CPUState *cpu, hwaddr addr,
+ hwaddr size);
+void (*cpu_reset_excl_context)(CPUState *cpu);
+
 void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
 } CPUClass;
 
@@ -219,6 +231,9 @@ struct kvm_run;
 #define TB_JMP_CACHE_BITS 12
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
+/* Atomic insn translation TLB support. */
+#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+
 /**
  * CPUState:
  * @cpu_index: CPU index (informative).
@@ -341,6 +356,11 @@ struct CPUState {
  */
 bool throttle_thread_scheduled;
 
+/* vCPU's exclusive addresses range.
+ * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not
+ * in the middle of a LL/SC. */
+struct Range excl_protected_range;
+
 /* Note that this is accessed at the start of every TB via a negative
offset from AREG0.  Leave this field at the end so as to make the
(absolute value) offset as small as possible.  This reduces code
diff --git a/qom/cpu.c b/qom/cpu.c
index 8f537a4..309d487 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -203,6 +203,29 @@ static bool cpu_common_exec_interrupt(CPUState *cpu, int 
int_req)
 return false;
 }
 
+static void cpu_common_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr size)
+{
+cpu->excl_protected_range.begin = addr;
+cpu->excl_protected_range.end = addr + size;
+}
+
+static bool cpu_common_valid_excl_access(CPUState *cpu, hwaddr addr, hwaddr 
size)
+{
+/* Check if the excl range completely covers the access */
+if (cpu->excl_protected_range.begin <= addr &&
+cpu->excl_protected_range.end >= addr + size) {
+
+return true;
+}
+
+return false;
+}
+
+static void cpu_common_reset_excl_context(CPUState *cpu)
+{
+cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+}
+
 void cpu_dump_state(CPUState *cpu, FILE *f, fprintf_function cpu_fprintf,
 int flags)
 {
@@ -252,6 +275,7 @@ static void cpu_common_reset(CPUState *cpu)
 cpu->can_do_io = 1;
 cpu->exception_index = -1;
 cpu->crash_occurred = false;
+cpu_common_reset_excl_context(cpu);
 memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
 }
 
@@ -355,6 +379,9 @@ static void cpu_class_init(ObjectClass *klass, void *data)
 k->cpu_exec_enter = cpu_common_noop;
 k->cpu_exec_exit = cpu_common_noop;
 k->cpu_exec_interrupt = cpu_common_exec_interrupt;
+k->cpu_set_excl_protected_range = cpu_common_set_excl_range;
+k->cpu_valid_excl_access = cpu_common_valid_excl_access;
+k->cpu_reset_excl_context = cpu_common_reset_excl_context;
 dc->realize = cpu_common_realizefn;
 /*
  * Reason: CPUs still need special care by board code: wiring up
-- 
2.8.0

[Qemu-devel] [RFC v8 09/14] softmmu: Honor the new exclusive bitmap

2016-04-19 Thread Alvise Rigo

The pages set as exclusive (clean) in the DIRTY_MEMORY_EXCLUSIVE bitmap
have to have their TLB entries flagged with TLB_EXCL. The accesses to
pages with TLB_EXCL flag set have to be properly handled in that they
can potentially invalidate an open LL/SC transaction.

Modify the TLB entries generation to honor the new bitmap and extend
the softmmu_template to handle the accesses made to guest pages marked
as exclusive. The TLB_EXCL flag is used only for normal RAM memory.

Exclusive accesses to MMIO memory are still not supported, but they will
with the next patch.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c   | 36 ++
 softmmu_template.h | 65 +++---
 2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 02b0d14..e5df3a5 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -416,11 +416,20 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
vaddr,
 || memory_region_is_romd(section->mr)) {
 /* Write access calls the I/O callback.  */
 te->addr_write = address | TLB_MMIO;
-} else if (memory_region_is_ram(section->mr)
-   && cpu_physical_memory_is_clean(section->mr->ram_addr
-   + xlat)) {
-te->addr_write = address | TLB_NOTDIRTY;
 } else {
+if (memory_region_is_ram(section->mr)
+&& cpu_physical_memory_is_clean(section->mr->ram_addr
+   + xlat)) {
+address |= TLB_NOTDIRTY;
+}
+/* Only normal RAM accesses need the TLB_EXCL flag to handle
+ * exclusive store operatoins. */
+if (!(address & TLB_MMIO) &&
+cpu_physical_memory_is_excl(section->mr->ram_addr + xlat)) 
{
+/* There is at least one vCPU that has flagged the address as
+ * exclusive. */
+address |= TLB_EXCL;
+}
 te->addr_write = address;
 }
 } else {
@@ -496,6 +505,25 @@ static inline void excl_history_put_addr(hwaddr addr)
 excl_history.c_array[excl_history.last_idx] = addr & TARGET_PAGE_MASK;
 }
 
+/* For every vCPU compare the exclusive address and reset it in case of a
+ * match. Since only one vCPU is running at once, no lock has to be held to
+ * guard this operation. */
+static inline void reset_other_cpus_colliding_ll_addr(hwaddr addr, hwaddr size)
+{
+CPUState *cpu;
+
+CPU_FOREACH(cpu) {
+if (current_cpu != cpu &&
+cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
+ranges_overlap(cpu->excl_protected_range.begin,
+   cpu->excl_protected_range.end -
+   cpu->excl_protected_range.begin,
+   addr, size)) {
+cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+}
+}
+}
+
 #define MMUSUFFIX _mmu
 
 /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
diff --git a/softmmu_template.h b/softmmu_template.h
index ede1240..2934a0c 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -469,6 +469,43 @@ static inline void smmu_helper(do_ram_store)(CPUArchState 
*env,
 #endif
 }
 
+static inline void smmu_helper(do_excl_store)(CPUArchState *env,
+  bool little_endian,
+  DATA_TYPE val, target_ulong addr,
+  TCGMemOpIdx oi, int index,
+  uintptr_t retaddr)
+{
+CPUIOTLBEntry *iotlbentry = >iotlb[get_mmuidx(oi)][index];
+CPUState *cpu = ENV_GET_CPU(env);
+CPUClass *cc = CPU_GET_CLASS(cpu);
+/* The slow-path has been forced since we are writing to
+ * exclusive-protected memory. */
+hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+/* The function reset_other_cpus_colliding_ll_addr could have reset
+ * the exclusive address. Fail the SC in this case.
+ * N.B.: here excl_succeed == true means that the caller is
+ * helper_stcond_name in softmmu_llsc_template.
+ * On the contrary, excl_succeeded == false occurs when a VCPU is
+ * writing through normal store to a page with TLB_EXCL bit set. */
+if (cpu->excl_succeeded) {
+if (!cc->cpu_valid_excl_access(cpu, hw_addr, DATA_SIZE)) {
+/* The vCPU is SC-ing to an unprotected address. */
+cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+cpu->excl_succeeded = false;
+
+return;
+

[Qemu-devel] [RFC v8 07/14] softmmu: Add helpers for a new slowpath

2016-04-19 Thread Alvise Rigo

The new helpers rely on the legacy ones to perform the actual read/write.

The LoadLink helper (helper_ldlink_name) prepares the way for the
following StoreCond operation. It sets the linked address and the size
of the access. The LoadLink helper also updates the TLB entry of the
page involved in the LL/SC to all vCPUs by forcing a TLB flush, so that
the following accesses made by all the vCPUs will follow the slow path.

The StoreConditional helper (helper_stcond_name) returns 1 if the
store has to fail due to a concurrent access to the same page by
another vCPU. A 'concurrent access' can be a store made by *any* vCPU
(although, some implementations allow stores made by the CPU that issued
the LoadLink).

For the time being we do not support exclusive accesses to MMIO memory.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c|   4 ++
 include/qom/cpu.h   |   5 ++
 qom/cpu.c   |   2 +
 softmmu_llsc_template.h | 132 
 softmmu_template.h  |  12 +
 tcg/tcg.h   |  31 
 6 files changed, 186 insertions(+)
 create mode 100644 softmmu_llsc_template.h

diff --git a/cputlb.c b/cputlb.c
index f6fb161..58d6f03 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -29,6 +29,7 @@
 #include "exec/memory-internal.h"
 #include "exec/ram_addr.h"
 #include "tcg/tcg.h"
+#include "hw/hw.h"
 
 //#define DEBUG_TLB
 //#define DEBUG_TLB_CHECK
@@ -476,6 +477,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
target_ulong addr)
 
 #define MMUSUFFIX _mmu
 
+/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
+#define GEN_EXCLUSIVE_HELPERS
 #define SHIFT 0
 #include "softmmu_template.h"
 
@@ -488,6 +491,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
target_ulong addr)
 #define SHIFT 3
 #include "softmmu_template.h"
 #undef MMUSUFFIX
+#undef GEN_EXCLUSIVE_HELPERS
 
 #define MMUSUFFIX _cmmu
 #undef GETPC_ADJ
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 21f10eb..014851e 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -356,10 +356,15 @@ struct CPUState {
  */
 bool throttle_thread_scheduled;
 
+/* Used by the atomic insn translation backend. */
+bool ll_sc_context;
 /* vCPU's exclusive addresses range.
  * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not
  * in the middle of a LL/SC. */
 struct Range excl_protected_range;
+/* Used to carry the SC result but also to flag a normal store access made
+ * by a stcond (see softmmu_template.h). */
+bool excl_succeeded;
 
 /* Note that this is accessed at the start of every TB via a negative
offset from AREG0.  Leave this field at the end so as to make the
diff --git a/qom/cpu.c b/qom/cpu.c
index 309d487..3280735 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -224,6 +224,8 @@ static bool cpu_common_valid_excl_access(CPUState *cpu, 
hwaddr addr, hwaddr size
 static void cpu_common_reset_excl_context(CPUState *cpu)
 {
 cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+cpu->ll_sc_context = false;
+cpu->excl_succeeded = false;
 }
 
 void cpu_dump_state(CPUState *cpu, FILE *f, fprintf_function cpu_fprintf,
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
new file mode 100644
index 000..ca2ac95
--- /dev/null
+++ b/softmmu_llsc_template.h
@@ -0,0 +1,132 @@
+/*
+ *  Software MMU support (esclusive load/store operations)
+ *
+ * Generate helpers used by TCG for qemu_ldlink/stcond ops.
+ *
+ * Included from softmmu_template.h only.
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.r...@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* This template does not generate together the le and be version, but only one
+ * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
+ * The same nomenclature as softmmu_template.h is used for the exclusive
+ * helpers.  */
+
+#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
+
+#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name

[Qemu-devel] [RFC v8 08/14] softmmu: Add history of excl accesses

2016-04-19 Thread Alvise Rigo

Add a circular buffer to store the hw addresses used in the last
EXCLUSIVE_HISTORY_LEN exclusive accesses.

When an address is pop'ed from the buffer, its page will be set as not
exclusive. In this way we avoid frequent set/unset of a page (causing
frequent flushes as well).

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c| 21 +
 exec.c  | 19 +++
 include/qom/cpu.h   |  8 
 softmmu_llsc_template.h |  1 +
 vl.c|  3 +++
 5 files changed, 52 insertions(+)

diff --git a/cputlb.c b/cputlb.c
index 58d6f03..02b0d14 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -475,6 +475,27 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
target_ulong addr)
 return qemu_ram_addr_from_host_nofail(p);
 }
 
+/* Keep a circular array with the last excl_history.length addresses used for
+ * exclusive accesses. The exiting addresses are marked as non-exclusive. */
+extern CPUExclusiveHistory excl_history;
+static inline void excl_history_put_addr(hwaddr addr)
+{
+hwaddr last;
+
+/* Calculate the index of the next exclusive address */
+excl_history.last_idx = (excl_history.last_idx + 1) % excl_history.length;
+
+last = excl_history.c_array[excl_history.last_idx];
+
+/* Unset EXCL bit of the oldest entry */
+if (last != EXCLUSIVE_RESET_ADDR) {
+cpu_physical_memory_unset_excl(last);
+}
+
+/* Add a new address, overwriting the oldest one */
+excl_history.c_array[excl_history.last_idx] = addr & TARGET_PAGE_MASK;
+}
+
 #define MMUSUFFIX _mmu
 
 /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
diff --git a/exec.c b/exec.c
index cefee1b..3c54b92 100644
--- a/exec.c
+++ b/exec.c
@@ -177,6 +177,25 @@ struct CPUAddressSpace {
 MemoryListener tcg_as_listener;
 };
 
+/* Exclusive memory support */
+CPUExclusiveHistory excl_history;
+void cpu_exclusive_history_init(void)
+{
+/* Initialize exclusive history for atomic instruction handling. */
+if (tcg_enabled()) {
+g_assert(EXCLUSIVE_HISTORY_CPU_LEN * max_cpus <= UINT16_MAX);
+excl_history.length = EXCLUSIVE_HISTORY_CPU_LEN * max_cpus;
+excl_history.c_array = g_malloc(excl_history.length * sizeof(hwaddr));
+memset(excl_history.c_array, -1, excl_history.length * sizeof(hwaddr));
+}
+}
+
+void cpu_exclusive_history_free(void)
+{
+if (tcg_enabled()) {
+g_free(excl_history.c_array);
+}
+}
 #endif
 
 #if !defined(CONFIG_USER_ONLY)
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 014851e..de144f6 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -232,7 +232,15 @@ struct kvm_run;
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
 /* Atomic insn translation TLB support. */
+typedef struct CPUExclusiveHistory {
+uint16_t last_idx;   /* index of last insertion */
+uint16_t length; /* history's length, it depends on smp_cpus */
+hwaddr *c_array; /* history's circular array */
+} CPUExclusiveHistory;
 #define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+#define EXCLUSIVE_HISTORY_CPU_LEN 256
+void cpu_exclusive_history_init(void);
+void cpu_exclusive_history_free(void);
 
 /**
  * CPUState:
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index ca2ac95..1e24fec 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -77,6 +77,7 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong 
addr,
 CPUState *cpu;
 
 cpu_physical_memory_set_excl(hw_addr);
+excl_history_put_addr(hw_addr);
 CPU_FOREACH(cpu) {
 if (this_cpu != cpu) {
 tlb_flush(cpu, 1);
diff --git a/vl.c b/vl.c
index f043009..b22d99b 100644
--- a/vl.c
+++ b/vl.c
@@ -547,6 +547,7 @@ static void res_free(void)
 {
 g_free(boot_splash_filedata);
 boot_splash_filedata = NULL;
+cpu_exclusive_history_free();
 }
 
 static int default_driver_check(void *opaque, QemuOpts *opts, Error **errp)
@@ -4322,6 +4323,8 @@ int main(int argc, char **argv, char **envp)
 
 configure_accelerator(current_machine);
 
+cpu_exclusive_history_init();
+
 if (qtest_chrdev) {
 qtest_init(qtest_chrdev, qtest_log, _fatal);
 }
-- 
2.8.0

[Qemu-devel] [RFC v8 02/14] softmmu: Simplify helper_*_st_name, wrap unaligned code

2016-04-19 Thread Alvise Rigo

Attempting to simplify the helper_*_st_name, wrap the
do_unaligned_access code into an shared inline function. As this also
removes the goto statement the inline code is expanded twice in each
helper.

>From Message-id 1452268394-31252-2-git-send-email-alex.ben...@linaro.org:
There is a minor wrinkle that we need to use a unique name for each
inline fragment as the template is included multiple times. For this the
smmu_helper macro does the appropriate glue magic.

I've tested the result with no change to functionality. Comparing the
the objdump of cputlb.o shows minimal changes in probe_write and
everything else is identical.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
CC: Alvise Rigo <a.r...@virtualopensystems.com>
Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
[Alex Bennée: define smmu_helper and unified logic between be/le]

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 softmmu_template.h | 82 ++
 1 file changed, 46 insertions(+), 36 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 208f808..3eb54f8 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -370,6 +370,46 @@ static inline void glue(io_write, SUFFIX)(CPUArchState 
*env,
  iotlbentry->attrs);
 }
 
+/* Inline helper functions for SoftMMU
+ *
+ * These functions help reduce code duplication in the various main
+ * helper functions. Constant arguments (like endian state) will allow
+ * the compiler to skip code which is never called in a given inline.
+ */
+#define smmu_helper(name) glue(glue(glue(smmu_helper_, SUFFIX), \
+MMUSUFFIX), _##name)
+static inline void smmu_helper(do_unl_store)(CPUArchState *env,
+ bool little_endian,
+ DATA_TYPE val,
+ target_ulong addr,
+ TCGMemOpIdx oi,
+ unsigned mmu_idx,
+ uintptr_t retaddr)
+{
+int i;
+
+if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+ mmu_idx, retaddr);
+}
+/* Note: relies on the fact that tlb_fill() does not remove the
+ * previous page from the TLB cache.  */
+for (i = DATA_SIZE - 1; i >= 0; i--) {
+uint8_t val8;
+if (little_endian) {
+/* Little-endian extract.  */
+val8 = val >> (i * 8);
+} else {
+/* Big-endian extract.  */
+val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
+}
+/* Note the adjustment at the beginning of the function.
+   Undo that for the recursion.  */
+glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
+oi, retaddr + GETPC_ADJ);
+}
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -399,7 +439,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
 CPUIOTLBEntry *iotlbentry;
 if ((addr & (DATA_SIZE - 1)) != 0) {
-goto do_unaligned_access;
+smmu_helper(do_unl_store)(env, false, val, addr, oi, mmu_idx, 
retaddr);
+return;
 }
 iotlbentry = >iotlb[mmu_idx][index];
 
@@ -414,23 +455,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 if (DATA_SIZE > 1
 && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
  >= TARGET_PAGE_SIZE)) {
-int i;
-do_unaligned_access:
-if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
- mmu_idx, retaddr);
-}
-/* XXX: not efficient, but simple */
-/* Note: relies on the fact that tlb_fill() does not remove the
- * previous page from the TLB cache.  */
-for (i = DATA_SIZE - 1; i >= 0; i--) {
-/* Little-endian extract.  */
-uint8_t val8 = val >> (i * 8);
-/* Note the adjustment at the beginning of the function.
-   Undo that for the recursion.  */
-glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
-oi, retaddr + GETPC_ADJ);
-}
+smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, retaddr);
 return;
 }
 
@@ -479,7 +504,8 @@ void helper_be_st_name(CPUArchState *env, target_ulon

[Qemu-devel] [RFC v8 01/14] exec.c: Add new exclusive bitmap to ram_list

2016-04-19 Thread Alvise Rigo

The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC). For all these
pages, the corresponding TLB entries will be generated in such a way to
force the slow-path for all the VCPUs (see the following patches).

When the system starts, the whole memory is set to dirty.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 exec.c  |  2 +-
 include/exec/memory.h   |  3 ++-
 include/exec/ram_addr.h | 31 +++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/exec.c b/exec.c
index 7115403..cefee1b 100644
--- a/exec.c
+++ b/exec.c
@@ -1579,7 +1579,7 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, 
Error **errp)
 ram_list.dirty_memory[i] =
 bitmap_zero_extend(ram_list.dirty_memory[i],
old_ram_size, new_ram_size);
-   }
+}
 }
 cpu_physical_memory_set_dirty_range(new_block->offset,
 new_block->used_length,
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c92734a..71e0480 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@
 #define DIRTY_MEMORY_VGA   0
 #define DIRTY_MEMORY_CODE  1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM   3/* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM   4/* num of dirty bits */
 
 #include 
 #include 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index ef1489d..19789fc 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "sysemu/sysemu.h"
 
 struct RAMBlock {
 struct rcu_head rcu;
@@ -172,6 +173,9 @@ static inline void 
cpu_physical_memory_set_dirty_range(ram_addr_t start,
 if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
 bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
 }
+if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
+bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE], page, end - page);
+}
 xen_modified_memory(start, length);
 }
 
@@ -287,5 +291,32 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned 
long *dest,
 }
 
 void migration_bitmap_extend(ram_addr_t old, ram_addr_t new);
+
+/* Exclusive bitmap support. */
+#define EXCL_BITMAP_GET_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
+
+/* Make the page of @addr not exclusive. */
+static inline void cpu_physical_memory_unset_excl(ram_addr_t addr)
+{
+set_bit_atomic(EXCL_BITMAP_GET_OFFSET(addr),
+   ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+/* Return true if the page of @addr is exclusive, i.e. the EXCL bit is set. */
+static inline int cpu_physical_memory_is_excl(ram_addr_t addr)
+{
+return !test_bit(EXCL_BITMAP_GET_OFFSET(addr),
+ ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+/* Set the page of @addr as exclusive clearing its EXCL bit and return the
+ * previous bit's state. */
+static inline int cpu_physical_memory_set_excl(ram_addr_t addr)
+{
+return bitmap_test_and_clear_atomic(
+ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+EXCL_BITMAP_GET_OFFSET(addr), 1);
+}
+
 #endif
 #endif
-- 
2.8.0

[Qemu-devel] [RFC v8 03/14] softmmu: Simplify helper_*_st_name, wrap MMIO code

2016-04-19 Thread Alvise Rigo

Attempting to simplify the helper_*_st_name, wrap the MMIO code into an
inline function. The function covers both BE and LE cases and it is expanded
twice in each helper (TODO: check this last statement).

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 softmmu_template.h | 49 +++--
 1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 3eb54f8..9185486 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -410,6 +410,29 @@ static inline void smmu_helper(do_unl_store)(CPUArchState 
*env,
 }
 }
 
+static inline void smmu_helper(do_mmio_store)(CPUArchState *env,
+  bool little_endian,
+  DATA_TYPE val,
+  target_ulong addr,
+  TCGMemOpIdx oi, unsigned mmu_idx,
+  int index, uintptr_t retaddr)
+{
+CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
+
+if ((addr & (DATA_SIZE - 1)) != 0) {
+smmu_helper(do_unl_store)(env, little_endian, val, addr, mmu_idx, oi,
+  retaddr);
+}
+/* ??? Note that the io helpers always read data in the target
+   byte ordering.  We should push the LE/BE request down into io.  */
+if (little_endian) {
+val = TGT_LE(val);
+} else {
+val = TGT_BE(val);
+}
+glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -437,17 +460,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 
 /* Handle an IO access.  */
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-CPUIOTLBEntry *iotlbentry;
-if ((addr & (DATA_SIZE - 1)) != 0) {
-smmu_helper(do_unl_store)(env, false, val, addr, oi, mmu_idx, 
retaddr);
-return;
-}
-iotlbentry = >iotlb[mmu_idx][index];
-
-/* ??? Note that the io helpers always read data in the target
-   byte ordering.  We should push the LE/BE request down into io.  */
-val = TGT_LE(val);
-glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+smmu_helper(do_mmio_store)(env, true, val, addr, oi, mmu_idx, index,
+   retaddr);
 return;
 }
 
@@ -502,17 +516,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 
 /* Handle an IO access.  */
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-CPUIOTLBEntry *iotlbentry;
-if ((addr & (DATA_SIZE - 1)) != 0) {
-smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, 
retaddr);
-return;
-}
-iotlbentry = >iotlb[mmu_idx][index];
-
-/* ??? Note that the io helpers always read data in the target
-   byte ordering.  We should push the LE/BE request down into io.  */
-val = TGT_BE(val);
-glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+smmu_helper(do_mmio_store)(env, false, val, addr, oi, mmu_idx, index,
+   retaddr);
 return;
 }
 
-- 
2.8.0

[Qemu-devel] [RFC v8 04/14] softmmu: Simplify helper_*_st_name, wrap RAM code

2016-04-19 Thread Alvise Rigo

Attempting to simplify the helper_*_st_name, wrap the code relative to a
RAM access into an inline function. The function covers both BE and LE cases
and it is expanded twice in each helper (TODO: check this last statement).

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 softmmu_template.h | 80 +++---
 1 file changed, 40 insertions(+), 40 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 9185486..ea6a0fb 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -433,13 +433,48 @@ static inline void 
smmu_helper(do_mmio_store)(CPUArchState *env,
 glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
 }
 
+static inline void smmu_helper(do_ram_store)(CPUArchState *env,
+ bool little_endian, DATA_TYPE val,
+ target_ulong addr, TCGMemOpIdx oi,
+ unsigned mmu_idx, int index,
+ uintptr_t retaddr)
+{
+uintptr_t haddr;
+
+/* Handle slow unaligned access (it spans two pages or IO).  */
+if (DATA_SIZE > 1
+&& unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+ >= TARGET_PAGE_SIZE)) {
+smmu_helper(do_unl_store)(env, little_endian, val, addr, oi, mmu_idx,
+  retaddr);
+return;
+}
+
+/* Handle aligned access or unaligned access in the same page.  */
+if ((addr & (DATA_SIZE - 1)) != 0
+&& (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+ mmu_idx, retaddr);
+}
+
+haddr = addr + env->tlb_table[mmu_idx][index].addend;
+#if DATA_SIZE == 1
+glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+#else
+if (little_endian) {
+glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+} else {
+glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
+}
+#endif
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
 unsigned mmu_idx = get_mmuidx(oi);
 int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
 target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-uintptr_t haddr;
 
 /* Adjust the given return address.  */
 retaddr -= GETPC_ADJ;
@@ -465,27 +500,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 return;
 }
 
-/* Handle slow unaligned access (it spans two pages or IO).  */
-if (DATA_SIZE > 1
-&& unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
- >= TARGET_PAGE_SIZE)) {
-smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, retaddr);
-return;
-}
-
-/* Handle aligned access or unaligned access in the same page.  */
-if ((addr & (DATA_SIZE - 1)) != 0
-&& (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
- mmu_idx, retaddr);
-}
-
-haddr = addr + env->tlb_table[mmu_idx][index].addend;
-#if DATA_SIZE == 1
-glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
-#else
-glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
-#endif
+smmu_helper(do_ram_store)(env, true, val, addr, oi, mmu_idx, index,
+  retaddr);
 }
 
 #if DATA_SIZE > 1
@@ -495,7 +511,6 @@ void helper_be_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 unsigned mmu_idx = get_mmuidx(oi);
 int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
 target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-uintptr_t haddr;
 
 /* Adjust the given return address.  */
 retaddr -= GETPC_ADJ;
@@ -521,23 +536,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 return;
 }
 
-/* Handle slow unaligned access (it spans two pages or IO).  */
-if (DATA_SIZE > 1
-&& unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
- >= TARGET_PAGE_SIZE)) {
-smmu_helper(do_unl_store)(env, false, val, addr, oi, mmu_idx, retaddr);
-return;
-}
-
-/* Handle aligned access or unaligned access in the same page.  */
-if ((addr & (DATA_SIZE - 1)) != 0
-&& (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
- mmu_idx, retaddr);
-}
-
-haddr =

[Qemu-devel] [RFC v8 05/14] softmmu: Add new TLB_EXCL flag

2016-04-19 Thread Alvise Rigo

Add a new TLB flag to force all the accesses made to a page to follow
the slow-path.

The TLB entries referring guest pages with the DIRTY_MEMORY_EXCLUSIVE
bit clean will have this flag set.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 include/exec/cpu-all.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index 83b1781..f8d8feb 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -277,6 +277,14 @@ CPUArchState *cpu_copy(CPUArchState *env);
 #define TLB_NOTDIRTY(1 << 4)
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO(1 << 5)
+/* Set if TLB entry references a page that requires exclusive access.  */
+#define TLB_EXCL(1 << 6)
+
+/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
+ * above. */
+#if TLB_EXCL >= TARGET_PAGE_SIZE
+#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
+#endif
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
-- 
2.8.0

[Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation

2016-04-19 Thread Alvise Rigo

 aarch64 TCG backend support
- part of the code has been rewritten

Changes from v2:
- the bitmap accessors are now atomic
- a rendezvous between vCPUs and a simple callback support before executing
  a TB have been added to handle the TLB flush support
- the softmmu_template and softmmu_llsc_template have been adapted to work
  on real multi-threading

Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
  been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
  and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
  softmmu_template.h and to simplify the methods generation through
  softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386

This work has been sponsored by Huawei Technologies Duesseldorf GmbH.

Alvise Rigo (14):
  exec.c: Add new exclusive bitmap to ram_list
  softmmu: Simplify helper_*_st_name, wrap unaligned code
  softmmu: Simplify helper_*_st_name, wrap MMIO code
  softmmu: Simplify helper_*_st_name, wrap RAM code
  softmmu: Add new TLB_EXCL flag
  qom: cpu: Add CPUClass hooks for exclusive range
  softmmu: Add helpers for a new slowpath
  softmmu: Add history of excl accesses
  softmmu: Honor the new exclusive bitmap
  softmmu: Support MMIO exclusive accesses
  tcg: Create new runtime helpers for excl accesses
  target-arm: translate: Use ld/st excl for atomic insns
  target-arm: cpu64: use custom set_excl hook
  target-arm: aarch64: Use ls/st exclusive for atomic insns

 Makefile.target |   2 +-
 cputlb.c|  64 ++-
 exec.c  |  21 +++-
 include/exec/cpu-all.h  |   8 ++
 include/exec/helper-gen.h   |   3 +
 include/exec/helper-proto.h |   1 +
 include/exec/helper-tcg.h   |   3 +
 include/exec/memory.h   |   4 +-
 include/exec/ram_addr.h |  31 +
 include/qom/cpu.h   |  33 ++
 qom/cpu.c   |  29 +
 softmmu_llsc_template.h | 136 ++
 softmmu_template.h  | 274 ++--
 target-arm/cpu.h|   3 +
 target-arm/cpu64.c  |   8 ++
 target-arm/helper-a64.c |  55 +
 target-arm/helper-a64.h |   2 +
 target-arm/helper.h |   2 +
 target-arm/machine.c|   7 ++
 target-arm/op_helper.c  |  14 ++-
 target-arm/translate-a64.c  | 168 +++
 target-arm/translate.c  | 263 +++---
 tcg-llsc-helper.c   | 104 +
 tcg-llsc-helper.h   |  61 ++
 tcg/tcg-llsc-gen-helper.h   |  67 +++
 tcg/tcg.h   |  31 +
 vl.c|   3 +
 27 files changed, 1110 insertions(+), 287 deletions(-)
 create mode 100644 softmmu_llsc_template.h
 create mode 100644 tcg-llsc-helper.c
 create mode 100644 tcg-llsc-helper.h
 create mode 100644 tcg/tcg-llsc-gen-helper.h

-- 
2.8.0

Re: [Qemu-devel] MTTCG Sync-up call today? Agenda items?

2016-04-11 Thread alvise rigo

Hi Alex,

On Mon, Apr 11, 2016 at 1:21 PM, Alex Bennée  wrote:
>
> Hi,
>
> It's been awhile since we synced-up with quite weeks and Easter out of
> the way are we good for a call today?


Indeed, it has been a while.

>
>
> Some items I can think would be worth covering:
>
>   - State of MTTCG enabled LL/SC
>
> I think Alvise was looking at some run-loop changes for the MTTCG
> enabled part of his LL/SC patch set. I haven't heard anything for a
> while.


I've been quite busy lately, but expect the v8 of the LL/SC patch
series by the end of this week. Sorry for the delay.

In any case, I'm in if any call will be held.

Regards,
alvise

>
>
>   - Linaro's current efforts
>
> Sergey is currently working with us in up-streaming MTTCG related
> patches. We've taken stuff from Paolo and Fred's trees and push
> several series to the list for review:
>   - various TCG clean-ups
>   - atomic jump patching
>   - base enabling patches
>
>   - Memory ordering work
>
> I put this up as a suggested project for GSoC and we had several
> applicants. We are currently awaiting Google's decision on slot
> allocations so hopefully we'll have an extra pair of hands on this
> chunk.
>
>   - Emilio's work
>
> Emilio posted a series last year with some alternative approaches to
> solving some of the MTTCG problems. He's back and working on this
> again and has posted his qht series as a precursor to his next
> revision of the MTTCG tree.
>
>   - TCG maintainers view
>
> It would be useful if we had a view from the TCG maintainers of how
> we are doing and if the approaches have a chance of getting merged.
>
>   - Timescales
>
> Internally at Linaro we've been discussing potential timescales. We are
> currently aiming to have all major pieces of MTTCG posted on list and
> being iterated through reviews before this years KVM Forum. This will
> leave us KVM forum to resolve any remaining barriers to eventual
> up-streaming.
>
> Anything else worthy of discussion? If you need the number to call ping
> me off list.
>
> --
> Alex Bennée

Re: [Qemu-devel] [mttcg] cputlb: Use async tlb_flush_by_mmuidx

2016-03-11 Thread alvise rigo

Hi Paolo,

On Mon, Mar 7, 2016 at 10:18 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 04/03/2016 15:28, alvise rigo wrote:
>> A small update on this. I have a working implementation of the "halted
>> state" mechanism for waiting all the pending flushes to be completed.
>> However, the way I'm going back to the cpus.c loop (the while(1) in
>> qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops
>> that always end the TB, a simple cpu_exit() allows me to go back to the
>> main loop. I think in this case we can also use the cpu_loop_exit(),
>> though making the code a bit more complicated since the PC would require
>> some adjustments.
>
> I think in both cases the function to use is cpu_loop_exit_restore.  It
> will restart execution of the current instruction so it should be fine
> as long as you don't call it unconditionally.

Indeed, cpu_loop_exit_restore() works just fine for those helpers that
do not return any value, thank you.

>
> If you're not calling it directly from the helper, you need to save
> GETPC() in the helper and propagate it down to the call site.  Then the
> call site can use it as the last argument.  For an example see
> helper_ljmp_protected's call to switch_tss_ra in target-i386/seg_helper.c.
>
>> I wanted then to apply the same "halted state" to the LoadLink helper,
>> since also this one might cause some flush requests.
>
> Interesting, where is this documented in the ARM ARM?

I'm referring to the usual flush requests that a LL(x) operation might
issue in order to have all the VCPUs agreeing on "x is an exclusive
address". Adding the halted state we ensure that the calling VCPU
resumes its execution after all the other VCPUs have set the TLB_EXCL
flag (this should also fix the race condition you were worried about).

>
>> In this case, we
>> can not just call cpu_loop_exit() in that the guest code would miss the
>> returned value. Forcing the LDREX instruction to also end the TB through
>> an empty 'is_jmp' condition did the trick allowing once again to use
>> cpu_exit(). Is there another better solution?
>
> Perhaps cpu_loop_exit_restore()?

For some reason this is not working to exit from helper_ldlink_name in
softmmu_llsc_template.h (the method returns a "WORD_TYPE"). The
execution in the guest is brought to an infinite loop, most likely
because of a deadlock due to an improper emulation of LDREX and STREX.
In any case the cpu_exit() solution still works with the downside of a
slightly bigger overhead in exiting/entering the TB.

Thank you,
alvise

>
> Paolo

Re: [Qemu-devel] [RFC v7 14/16] target-arm: translate: Use ld/st excl for atomic insns

2016-03-07 Thread alvise rigo

On Thu, Feb 18, 2016 at 6:02 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Use the new LL/SC runtime helpers to handle the ARM atomic instructions
>> in softmmu_llsc_template.h.
>>
>> In general, the helper generator
>> gen_{ldrex,strex}_{8,16a,32a,64a}() calls the function
>> helper_{le,be}_{ldlink,stcond}{ub,uw,ul,q}_mmu() implemented in
>> softmmu_llsc_template.h, doing an alignment check.
>>
>> In addition, add a simple helper function to emulate the CLREX instruction.
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  target-arm/cpu.h   |   2 +
>>  target-arm/helper.h|   4 ++
>>  target-arm/machine.c   |   2 +
>>  target-arm/op_helper.c |  10 +++
>>  target-arm/translate.c | 188 
>> +++--
>>  5 files changed, 202 insertions(+), 4 deletions(-)
>>
>> diff --git a/target-arm/cpu.h b/target-arm/cpu.h
>> index b8b3364..bb5361f 100644
>> --- a/target-arm/cpu.h
>> +++ b/target-arm/cpu.h
>> @@ -462,9 +462,11 @@ typedef struct CPUARMState {
>>  float_status fp_status;
>>  float_status standard_fp_status;
>>  } vfp;
>> +#if !defined(CONFIG_ARM_USE_LDST_EXCL)
>>  uint64_t exclusive_addr;
>>  uint64_t exclusive_val;
>>  uint64_t exclusive_high;
>> +#endif
>>  #if defined(CONFIG_USER_ONLY)
>>  uint64_t exclusive_test;
>>  uint32_t exclusive_info;
>> diff --git a/target-arm/helper.h b/target-arm/helper.h
>> index c2a85c7..6bc3c0a 100644
>> --- a/target-arm/helper.h
>> +++ b/target-arm/helper.h
>> @@ -532,6 +532,10 @@ DEF_HELPER_2(dc_zva, void, env, i64)
>>  DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>>  DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>>
>> +#ifdef CONFIG_ARM_USE_LDST_EXCL
>> +DEF_HELPER_1(atomic_clear, void, env)
>> +#endif
>> +
>>  #ifdef TARGET_AARCH64
>>  #include "helper-a64.h"
>>  #endif
>> diff --git a/target-arm/machine.c b/target-arm/machine.c
>> index ed1925a..7adfb4d 100644
>> --- a/target-arm/machine.c
>> +++ b/target-arm/machine.c
>> @@ -309,9 +309,11 @@ const VMStateDescription vmstate_arm_cpu = {
>>  VMSTATE_VARRAY_INT32(cpreg_vmstate_values, ARMCPU,
>>   cpreg_vmstate_array_len,
>>   0, vmstate_info_uint64, uint64_t),
>> +#if !defined(CONFIG_ARM_USE_LDST_EXCL)
>>  VMSTATE_UINT64(env.exclusive_addr, ARMCPU),
>>  VMSTATE_UINT64(env.exclusive_val, ARMCPU),
>>  VMSTATE_UINT64(env.exclusive_high, ARMCPU),
>> +#endif
>
> Hmm this does imply we either need to support migration of the LL/SC
> state in the generic code or map the generic state into the ARM specific
> machine state or we'll break migration.
>
> The later if probably better so you can save machine state from a
> pre-LL/SC build and migrate to a new LL/SC enabled build.

This basically would require to add in cpu_pre_save some code to copy
env.exclusive_* to the new structures. As a consequence, this will not
get rid of the variables pre-LL/SC.

>
>>  VMSTATE_UINT64(env.features, ARMCPU),
>>  VMSTATE_UINT32(env.exception.syndrome, ARMCPU),
>>  VMSTATE_UINT32(env.exception.fsr, ARMCPU),
>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>> index a5ee65f..404c13b 100644
>> --- a/target-arm/op_helper.c
>> +++ b/target-arm/op_helper.c
>> @@ -51,6 +51,14 @@ static int exception_target_el(CPUARMState *env)
>>  return target_el;
>>  }
>>
>> +#ifdef CONFIG_ARM_USE_LDST_EXCL
>> +void HELPER(atomic_clear)(CPUARMState *env)
>> +{
>> +ENV_GET_CPU(env)->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>> +ENV_GET_CPU(env)->ll_sc_context = false;
>> +}
>> +#endif
>> +
>
> Given this is just touching generic CPU state this helper should probably be
> part of the generic TCG runtime. I assume other arches will just call
> this helper as well.

Would it make sense instead to add a new CPUClass hook for this? Other
architectures might want a different behaviour (or add something
else).

Thank you,
alvise

>
>
>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>>uint32_t rn, uint32_t maxindex)
>>

Re: [Qemu-devel] [RFC v7 10/16] softmmu: Protect MMIO exclusive range

2016-03-07 Thread alvise rigo

On Thu, Feb 18, 2016 at 5:25 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> alvise rigo <a.r...@virtualopensystems.com> writes:
>
>> On Wed, Feb 17, 2016 at 7:55 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>>>
>>> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>>>
>>>> As for the RAM case, also the MMIO exclusive ranges have to be protected
>>>> by other CPU's accesses. In order to do that, we flag the accessed
>>>> MemoryRegion to mark that an exclusive access has been performed and is
>>>> not concluded yet.
>>>>
>>>> This flag will force the other CPUs to invalidate the exclusive range in
>>>> case of collision.
>>>>
>>>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>>>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>>>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>>>> ---
>>>>  cputlb.c| 20 +---
>>>>  include/exec/memory.h   |  1 +
>>>>  softmmu_llsc_template.h | 11 +++
>>>>  softmmu_template.h  | 22 ++
>>>>  4 files changed, 43 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/cputlb.c b/cputlb.c
>>>> index 87d09c8..06ce2da 100644
>>>> --- a/cputlb.c
>>>> +++ b/cputlb.c
>>>> @@ -496,19 +496,25 @@ tb_page_addr_t get_page_addr_code(CPUArchState 
>>>> *env1, target_ulong addr)
>>>>  /* For every vCPU compare the exclusive address and reset it in case of a
>>>>   * match. Since only one vCPU is running at once, no lock has to be held 
>>>> to
>>>>   * guard this operation. */
>>>> -static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>>>> +static inline bool lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>>>>  {
>>>>  CPUState *cpu;
>>>> +bool ret = false;
>>>>
>>>>  CPU_FOREACH(cpu) {
>>>> -if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
>>>> -ranges_overlap(cpu->excl_protected_range.begin,
>>>> -   cpu->excl_protected_range.end -
>>>> -   cpu->excl_protected_range.begin,
>>>> -   addr, size)) {
>>>> -cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>>>> +if (current_cpu != cpu) {
>>>
>>> I'm confused by this change. I don't see anywhere in the MMIO handling
>>> why we would want to change skipping the CPU. Perhaps this belongs in
>>> the previous patch? Maybe the function should really be
>>> lookup_and_maybe_reset_other_cpu_ll_addr?
>>
>> This is actually used later on in this patch.
>
> But aren't there other users before the functional change was made to
> skip the current_cpu? Where their expectations wrong or should we have
> always skipped the current CPU?

I see your point now. When current_cpu was skipped, there was no need
of the line
cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
in helper_stcond_name() when we return back from softmmu_template.h.

The error is that that line should have been added in this patch, not
in PATCH 07/16. Fixing it for the next version.

>
> The additional of the bool return I agree only needs to be brought in
> now when there are functions that care.
>
>>
>>>
>>>> +if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
>>>> +ranges_overlap(cpu->excl_protected_range.begin,
>>>> +   cpu->excl_protected_range.end -
>>>> +   cpu->excl_protected_range.begin,
>>>> +   addr, size)) {
>>>> +cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>>>> +ret = true;
>>>> +}
>>>>  }
>>>>  }
>>>> +
>>>> +return ret;
>>>>  }
>>>>
>>>>  #define MMUSUFFIX _mmu
>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>>>> index 71e0480..bacb3ad 100644
>>>> --- a/include/exec/memory.h
>>>> +++ b/include/exec/memory.h
>>>> @@ -171,6 +171,7 @@ struct MemoryRegion {
>>>>  bool rom_device;
>>>>  bool flush_coa

Re: [Qemu-devel] [RFC v7 12/16] configure: Use slow-path for atomic only when the softmmu is enabled

2016-03-07 Thread alvise rigo

On Thu, Feb 18, 2016 at 5:40 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Use the new slow path for atomic instruction translation when the
>> softmmu is enabled.
>>
>> At the moment only arm and aarch64 use the new LL/SC backend. It is
>> possible to disable such backed with --disable-arm-llsc-backend.
>
> Do we want to disable the backend once it is merged? Does it serve a
> purpose other than to confuse the user?

I added this option in order to have a quick way to compile a binary
with an without backend, it has been useful in the development
process. Now it's probably time to drop it.

Thank you,
alvise

>
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  configure | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/configure b/configure
>> index 44ac9ab..915efcc 100755
>> --- a/configure
>> +++ b/configure
>> @@ -294,6 +294,7 @@ solaris="no"
>>  profiler="no"
>>  cocoa="no"
>>  softmmu="yes"
>> +arm_tcg_use_llsc="yes"
>>  linux_user="no"
>>  bsd_user="no"
>>  aix="no"
>> @@ -880,6 +881,10 @@ for opt do
>>;;
>>--disable-debug-tcg) debug_tcg="no"
>>;;
>> +  --enable-arm-llsc-backend) arm_tcg_use_llsc="yes"
>> +  ;;
>> +  --disable-arm-llsc-backend) arm_tcg_use_llsc="no"
>> +  ;;
>>--enable-debug)
>># Enable debugging options that aren't excessively noisy
>>debug_tcg="yes"
>> @@ -4751,6 +4756,7 @@ echo "host CPU  $cpu"
>>  echo "host big endian   $bigendian"
>>  echo "target list   $target_list"
>>  echo "tcg debug enabled $debug_tcg"
>> +echo "arm use llsc backend" $arm_tcg_use_llsc
>>  echo "gprof enabled $gprof"
>>  echo "sparse enabled$sparse"
>>  echo "strip binaries$strip_opt"
>> @@ -4806,6 +4812,7 @@ echo "Install blobs $blobs"
>>  echo "KVM support   $kvm"
>>  echo "RDMA support  $rdma"
>>  echo "TCG interpreter   $tcg_interpreter"
>> +echo "use ld/st excl$softmmu"
>
> I think we can drop everything above here.
>
>>  echo "fdt support   $fdt"
>>  echo "preadv support$preadv"
>>  echo "fdatasync $fdatasync"
>> @@ -5863,6 +5870,13 @@ fi
>>  echo "LDFLAGS+=$ldflags" >> $config_target_mak
>>  echo "QEMU_CFLAGS+=$cflags" >> $config_target_mak
>>
>> +# Use tcg LL/SC tcg backend for exclusive instruction is arm/aarch64
>> +# softmmus targets
>> +if test "$arm_tcg_use_llsc" = "yes" ; then
>> +  if test "$target" = "arm-softmmu" ; then
>> +echo "CONFIG_ARM_USE_LDST_EXCL=y" >> $config_target_mak
>> +  fi
>> +fi
>
> This isn't going to be just ARM specific and it will be progressively
> turned on for other arches. So perhaps with the CONFIG_SOFTMMU section:
>
> if test "$target_softmmu" = "yes" ; then
> echo "CONFIG_SOFTMMU=y" >> $config_target_mak
>
> # Use SoftMMU LL/SC primitives?
> case "$target_name" in
> arm | aarch64)
> echo "CONFIG_USE_LDST_EXCL=y" >> $config_target_mak
> ;;
> esac
> fi
>
>
>>  done # for target in $targets
>>
>>  if [ "$pixman" = "internal" ]; then
>
>
> --
> Alex Bennée

Re: [Qemu-devel] [mttcg] cputlb: Use async tlb_flush_by_mmuidx

2016-03-04 Thread alvise rigo

A small update on this. I have a working implementation of the "halted
state" mechanism for waiting all the pending flushes to be completed.
However, the way I'm going back to the cpus.c loop (the while(1) in
qemu_tcg_cpu_thread_fn) is a bit convoluted. In the case of the TLB ops
that always end the TB, a simple cpu_exit() allows me to go back to the
main loop. I think in this case we can also use the cpu_loop_exit(), though
making the code a bit more complicated since the PC would require some
adjustments.

I wanted then to apply the same "halted state" to the LoadLink helper,
since also this one might cause some flush requests. In this case, we can
not just call cpu_loop_exit() in that the guest code would miss the
returned value. Forcing the LDREX instruction to also end the TB through an
empty 'is_jmp' condition did the trick allowing once again to use
cpu_exit(). Is there another better solution?

Thank you,
alvise

On Mon, Feb 29, 2016 at 3:18 PM, alvise rigo <a.r...@virtualopensystems.com>
wrote:

> I see the risk. I will come back with something and let you know.
>
> Thank you,
> alvise
>
> On Mon, Feb 29, 2016 at 3:06 PM, Paolo Bonzini <pbonz...@redhat.com>
> wrote:
> >
> >
> > On 29/02/2016 15:02, alvise rigo wrote:
> >> > Yeah, that's the other approach -- really split the things that can
> >> > be async and do real "wait for completion" at points which must
> >> > synchronize. (Needs a little care since DMB is not the only such
> point.)
> >> > An initial implementation that does an immediate wait-for-completion
> >> > is probably simpler to review though, and add the real asynchrony
> >> > later. And either way you need an API for the target to wait for
> >> > completion.
> >> OK, so basically being sure that the target CPU performs the flush
> >> before executing the next TB is not enough. We need a sort of feedback
> >> that the flush has been done before emulating the next guest
> >> instruction. Did I get it right?
> >
> > That risks getting deadlocks if CPU A asks B to flush the TLB and vice
> > versa.  Using a halted state means that the VCPU thread goes through the
> > cpus.c loop and can for example service other CPUs' TLB flush requests.
> >
> > Paolo
>

Re: [Qemu-devel] [mttcg] cputlb: Use async tlb_flush_by_mmuidx

2016-02-29 Thread alvise rigo

I see the risk. I will come back with something and let you know.

Thank you,
alvise

On Mon, Feb 29, 2016 at 3:06 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 29/02/2016 15:02, alvise rigo wrote:
>> > Yeah, that's the other approach -- really split the things that can
>> > be async and do real "wait for completion" at points which must
>> > synchronize. (Needs a little care since DMB is not the only such point.)
>> > An initial implementation that does an immediate wait-for-completion
>> > is probably simpler to review though, and add the real asynchrony
>> > later. And either way you need an API for the target to wait for
>> > completion.
>> OK, so basically being sure that the target CPU performs the flush
>> before executing the next TB is not enough. We need a sort of feedback
>> that the flush has been done before emulating the next guest
>> instruction. Did I get it right?
>
> That risks getting deadlocks if CPU A asks B to flush the TLB and vice
> versa.  Using a halted state means that the VCPU thread goes through the
> cpus.c loop and can for example service other CPUs' TLB flush requests.
>
> Paolo

Re: [Qemu-devel] [mttcg] cputlb: Use async tlb_flush_by_mmuidx

2016-02-29 Thread alvise rigo

On Mon, Feb 29, 2016 at 2:55 PM, Peter Maydell <peter.mayd...@linaro.org> wrote:
> On 29 February 2016 at 13:50, Paolo Bonzini <pbonz...@redhat.com> wrote:
>>
>>
>> On 29/02/2016 14:21, Peter Maydell wrote:
>>> On 29 February 2016 at 13:16, Alvise Rigo <a.r...@virtualopensystems.com> 
>>> wrote:
>>>> > As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the
>>>> > TLB flush if it targets another VCPU. To accomplish this, a new async
>>>> > work has been added, together with a new TLBFlushByMMUIdxParams. A
>>>> > bitmap is used to track the MMU indexes to flush.
>>>> >
>>>> > This patch applies to the multi_tcg_v8 branch.
>>> What's the API for a target CPU emulation to say "and now I must
>>> wait for the TLB op to finish" before completing this guest
>>> instruction?
>>
>> My proposal has been for a while for DMB to put the CPU in a halted
>> state (remote TLB callbacks then can decrement a counter and signal
>> cpu_halt_cond when it's zero), but no one has implemented this.
>
> Yeah, that's the other approach -- really split the things that can
> be async and do real "wait for completion" at points which must
> synchronize. (Needs a little care since DMB is not the only such point.)
> An initial implementation that does an immediate wait-for-completion
> is probably simpler to review though, and add the real asynchrony
> later. And either way you need an API for the target to wait for
> completion.

OK, so basically being sure that the target CPU performs the flush
before executing the next TB is not enough. We need a sort of feedback
that the flush has been done before emulating the next guest
instruction. Did I get it right?

Thank you,
alvise

>
> thanks
> -- PMM

[Qemu-devel] [mttcg] cputlb: Use async tlb_flush_by_mmuidx

2016-02-29 Thread Alvise Rigo

As in the case of tlb_flush(), also tlb_flush_by_mmuidx has to query the
TLB flush if it targets another VCPU. To accomplish this, a new async
work has been added, together with a new TLBFlushByMMUIdxParams. A
bitmap is used to track the MMU indexes to flush.

This patch applies to the multi_tcg_v8 branch.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c | 65 
 1 file changed, 53 insertions(+), 12 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 29252d1..1eeeccb 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -103,9 +103,11 @@ void tlb_flush(CPUState *cpu, int flush_global)
 }
 }
 
-static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+/* Flush tlb_table[] and tlb_v_table[] of @cpu at MMU indexes given by @bitmap.
+ * Flush also tb_jmp_cache. */
+static inline void tlb_tables_flush_bitmap(CPUState *cpu, unsigned long 
*bitmap)
 {
-CPUArchState *env = cpu->env_ptr;
+int mmu_idx;
 
 #if defined(DEBUG_TLB)
 printf("tlb_flush_by_mmuidx:");
@@ -114,6 +116,41 @@ static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, 
va_list argp)
links while we are modifying them */
 cpu->current_tb = NULL;
 
+for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+if (test_bit(mmu_idx, bitmap)) {
+CPUArchState *env = cpu->env_ptr;
+#if defined(DEBUG_TLB)
+printf(" %d", mmu_idx);
+#endif
+memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
+memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+}
+}
+memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+
+#if defined(DEBUG_TLB)
+printf("\n");
+#endif
+}
+
+struct TLBFlushByMMUIdxParams {
+CPUState *cpu;
+DECLARE_BITMAP(idx_to_flush, NB_MMU_MODES);
+};
+
+static void tlb_flush_by_mmuidx_async_work(void *opaque)
+{
+struct TLBFlushByMMUIdxParams *params = opaque;
+
+tlb_tables_flush_bitmap(params->cpu, params->idx_to_flush);
+
+g_free(params);
+}
+
+static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
+{
+DECLARE_BITMAP(idxmap, NB_MMU_MODES) = { 0 };
+
 for (;;) {
 int mmu_idx = va_arg(argp, int);
 
@@ -121,19 +158,23 @@ static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, 
va_list argp)
 break;
 }
 
-#if defined(DEBUG_TLB)
-printf(" %d", mmu_idx);
-#endif
-
-memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
-memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
+set_bit(mmu_idx, idxmap);
 }
 
-#if defined(DEBUG_TLB)
-printf("\n");
-#endif
+if (!qemu_cpu_is_self(cpu)) {
+/* We do not set the pendind_tlb_flush bit, only a global flush
+ * does that. */
+if (!atomic_read(>pending_tlb_flush)) {
+ struct TLBFlushByMMUIdxParams *params;
 
-memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+ params = g_malloc(sizeof(struct TLBFlushByMMUIdxParams));
+ params->cpu = cpu;
+ memcpy(params->idx_to_flush, idxmap, sizeof(idxmap));
+ async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work, params);
+ }
+} else {
+tlb_tables_flush_bitmap(cpu, idxmap);
+}
 }
 
 void tlb_flush_by_mmuidx(CPUState *cpu, ...)
-- 
2.7.2

Re: [Qemu-devel] [RFC v7 00/16] Slow-path for atomic instruction translation

2016-02-19 Thread alvise rigo

On Fri, Feb 19, 2016 at 12:44 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> This is the seventh iteration of the patch series which applies to the
>> upstream branch of QEMU (v2.5.0-rc4).
>>
>> Changes versus previous versions are at the bottom of this cover letter.
>>
>> The code is also available at following repository:
>> https://git.virtualopensystems.com/dev/qemu-mt.git
>> branch:
>> slowpath-for-atomic-v7-no-mttcg
>
> OK I'm done on this review pass. I think generally we are in pretty good
> shape although I await to see what extra needs to be done for the MTTCG
> case.

Hi Alex,

Thank you for this review. Regarding the extra needs and integration
with the MTTCG code, I've made available at this address [1] a working
branch with the two patch series merged together. The branch boots
fine Linux on both aarch64 and arm architectures. There is still that
known issue with virtio, that Fred should fix soon. Let me know your
first impressions.

[1] https://git.virtualopensystems.com/dev/qemu-mt.git (branch
"merging-slowpath-v7-mttcg-v8-wip")

Thank you,
alvise

>
> We are coming up to soft-freeze on 1/3/16 and it would be nice to get
> this merged by then. As it is a fairly major chunk of work it would need
> to get the initial commit by that date.
>
> However before we can get to that stage we need some review from the
> maintainers. For your next version can you please:
>
>   - Drop the RFC tag, I think we have had enough comment ;-)
>   - Make sure you CC the TCG maintainers (Paolo, Peter C and Richard 
> Henderson)
>   - Also CC the ARM maintainers (Peter M)
>   - Be ready for a fast turnaround
>
> Paolo/Richard,
>
> Do you have any comments on this iteration?
>
> --
> Alex Bennée

Re: [Qemu-devel] [RFC v7 08/16] softmmu: Honor the new exclusive bitmap

2016-02-18 Thread alvise rigo

On Tue, Feb 16, 2016 at 6:39 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
> > The pages set as exclusive (clean) in the DIRTY_MEMORY_EXCLUSIVE bitmap
> > have to have their TLB entries flagged with TLB_EXCL. The accesses to
> > pages with TLB_EXCL flag set have to be properly handled in that they
> > can potentially invalidate an open LL/SC transaction.
> >
> > Modify the TLB entries generation to honor the new bitmap and extend
> > the softmmu_template to handle the accesses made to guest pages marked
> > as exclusive.
> >
> > In the case we remove a TLB entry marked as EXCL, we unset the
> > corresponding exclusive bit in the bitmap.
> >
> > Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
> > Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
> > Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
> > ---
> >  cputlb.c   | 44 --
> >  softmmu_template.h | 80 
> > --
> >  2 files changed, 113 insertions(+), 11 deletions(-)
> >
> > diff --git a/cputlb.c b/cputlb.c
> > index ce6d720..aa9cc17 100644
> > --- a/cputlb.c
> > +++ b/cputlb.c
> > @@ -395,6 +395,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, 
> > target_ulong vaddr,
> >  env->tlb_v_table[mmu_idx][vidx] = *te;
> >  env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
> >
> > +if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & 
> > TLB_EXCL))) {
> > +/* We are removing an exclusive entry, set the page to dirty. This
> > + * is not be necessary if the vCPU has performed both SC and LL. */
> > +hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & 
> > TARGET_PAGE_MASK) +
> > +  (te->addr_write & 
> > TARGET_PAGE_MASK);
> > +if (!cpu->ll_sc_context) {
> > +cpu_physical_memory_unset_excl(hw_addr);
> > +}
> > +}
> > +
>
> I'm confused by the later patches removing this code and its comments
> about missing the setting of flags.


I hope I answered to this question in the other thread.

>
>
> >  /* refill the tlb */
> >  env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
> >  env->iotlb[mmu_idx][index].attrs = attrs;
> > @@ -418,9 +428,19 @@ void tlb_set_page_with_attrs(CPUState *cpu, 
> > target_ulong vaddr,
> >  } else if (memory_region_is_ram(section->mr)
> > && cpu_physical_memory_is_clean(section->mr->ram_addr
> > + xlat)) {
> > -te->addr_write = address | TLB_NOTDIRTY;
> > -} else {
> > -te->addr_write = address;
> > +address |= TLB_NOTDIRTY;
> > +}
> > +
> > +/* Since the MMIO accesses follow always the slow path, we do not 
> > need
> > + * to set any flag to trap the access */
> > +if (!(address & TLB_MMIO)) {
> > +if (cpu_physical_memory_is_excl(section->mr->ram_addr + xlat)) 
> > {
> > +/* There is at least one vCPU that has flagged the address 
> > as
> > + * exclusive. */
> > +te->addr_write = address | TLB_EXCL;
> > +} else {
> > +te->addr_write = address;
> > +}
>
> Again this is confusing when following patches blat over the code.
> Perhaps this part of the patch should be:
>
> /* Since the MMIO accesses follow always the slow path, we do not need
>  * to set any flag to trap the access */
> if (!(address & TLB_MMIO)) {
> if (cpu_physical_memory_is_excl(section->mr->ram_addr + xlat)) {
> /* There is at least one vCPU that has flagged the address as
>  * exclusive. */
> address |= TLB_EXCL;
> }
> }
> te->addr_write = address;
>
> So the future patch is clearer about what it does?


Yes, this is more clear. I will fix it.

>
>
> >  }
> >  } else {
> >  te->addr_write = -1;
> > @@ -474,6 +494,24 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
> > target_ulong addr)
> >  return qemu_ram_addr_from_host_nofail(p);
> >  }
> >
> > +/* For every vCPU compare t

Re: [Qemu-devel] [RFC v7 09/16] softmmu: Include MMIO/invalid exclusive accesses

2016-02-18 Thread alvise rigo

On Tue, Feb 16, 2016 at 6:49 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Enable exclusive accesses when the MMIO/invalid flag is set in the TLB
>> entry.
>>
>> In case a LL access is done to MMIO memory, we treat it differently from
>> a RAM access in that we do not rely on the EXCL bitmap to flag the page
>> as exclusive. In fact, we don't even need the TLB_EXCL flag to force the
>> slow path, since it is always forced anyway.
>>
>> This commit does not take care of invalidating an MMIO exclusive range from
>> other non-exclusive accesses i.e. CPU1 LoadLink to MMIO address X and
>> CPU2 writes to X. This will be addressed in the following commit.
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  cputlb.c   |  7 +++
>>  softmmu_template.h | 26 --
>>  2 files changed, 23 insertions(+), 10 deletions(-)
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index aa9cc17..87d09c8 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -424,7 +424,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
>> vaddr,
>>  if ((memory_region_is_ram(section->mr) && section->readonly)
>>  || memory_region_is_romd(section->mr)) {
>>  /* Write access calls the I/O callback.  */
>> -te->addr_write = address | TLB_MMIO;
>> +address |= TLB_MMIO;
>>  } else if (memory_region_is_ram(section->mr)
>> && cpu_physical_memory_is_clean(section->mr->ram_addr
>> + xlat)) {
>> @@ -437,11 +437,10 @@ void tlb_set_page_with_attrs(CPUState *cpu, 
>> target_ulong vaddr,
>>  if (cpu_physical_memory_is_excl(section->mr->ram_addr + xlat)) {
>>  /* There is at least one vCPU that has flagged the address 
>> as
>>   * exclusive. */
>> -te->addr_write = address | TLB_EXCL;
>> -} else {
>> -te->addr_write = address;
>> +address |= TLB_EXCL;
>>  }
>>  }
>> +te->addr_write = address;
>
> As mentioned before I think this bit belongs in the earlier patch.
>
>>  } else {
>>  te->addr_write = -1;
>>  }
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 267c52a..c54bdc9 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -476,7 +476,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
>> addr, DATA_TYPE val,
>>
>>  /* Handle an IO access or exclusive access.  */
>>  if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>> -if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
>> +if (tlb_addr & TLB_EXCL) {
>>  CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
>>  CPUState *cpu = ENV_GET_CPU(env);
>>  CPUClass *cc = CPU_GET_CLASS(cpu);
>> @@ -500,8 +500,15 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
>> addr, DATA_TYPE val,
>>  }
>>  }
>>
>> -glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi,
>> -mmu_idx, index, 
>> retaddr);
>> +if (tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)) { /* MMIO
>> access */
>
> What about the other flags? Shouldn't this be tlb_addr & TLB_MMIO?

The upstream QEMU's condition to follow the IO access path is:
if (unlikely(tlb_addr & ~TARGET_PAGE_MASK))
Now, we split this in:
if (tlb_addr & TLB_EXCL)
for RAM exclusive accesses and
if (tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL))
for IO accesses. In this last case, we handle also the IO exclusive accesses.

>
>> +glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
>> + mmu_idx, index,
>> + retaddr);
>> +} else {
>> +glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi,
>> +mmu_idx, index,
>> +retaddr);
>> +}
>>
>>

Re: [Qemu-devel] [RFC v7 13/16] softmmu: Add history of excl accesses

2016-02-18 Thread alvise rigo

On Tue, Feb 16, 2016 at 6:07 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Add a circular buffer to store the hw addresses used in the last
>> EXCLUSIVE_HISTORY_LEN exclusive accesses.
>>
>> When an address is pop'ed from the buffer, its page will be set as not
>> exclusive. In this way, we avoid:
>> - frequent set/unset of a page (causing frequent flushes as well)
>> - the possibility to forget the EXCL bit set.
>
> Why was this a possibility before? Shouldn't that be tackled in the
> patch that introduced it?

Yes and no. The problem happens for instance when a LL will not be
followed by the SC. In this situation, the flag will be set for the
page, but might remain set for the rest of the execution (unless a
complete LL/SC in performed later on in the same guest page).

>
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  cputlb.c| 29 +++--
>>  exec.c  | 19 +++
>>  include/qom/cpu.h   |  8 
>>  softmmu_llsc_template.h |  1 +
>>  vl.c|  3 +++
>>  5 files changed, 50 insertions(+), 10 deletions(-)
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index 06ce2da..f3c4d97 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -395,16 +395,6 @@ void tlb_set_page_with_attrs(CPUState *cpu, 
>> target_ulong vaddr,
>>  env->tlb_v_table[mmu_idx][vidx] = *te;
>>  env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
>>
>> -if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & 
>> TLB_EXCL))) {
>> -/* We are removing an exclusive entry, set the page to dirty. This
>> - * is not be necessary if the vCPU has performed both SC and LL. */
>> -hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & 
>> TARGET_PAGE_MASK) +
>> -  (te->addr_write & 
>> TARGET_PAGE_MASK);
>> -if (!cpu->ll_sc_context) {
>> -cpu_physical_memory_unset_excl(hw_addr);
>> -}
>> -}
>> -
>
> Errm is this right? I got confused reviewing 8/16 because my final tree
> didn't have this code. I'm not sure the adding of history obviates the
> need to clear the exclusive flag?

We will clear it adding a new item to the history. When an entry is
added, the oldest is removed and cleaned, solving the problem
mentioned above.

Thank you,
alvise

>
>>  /* refill the tlb */
>>  env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
>>  env->iotlb[mmu_idx][index].attrs = attrs;
>> @@ -517,6 +507,25 @@ static inline bool lookup_and_reset_cpus_ll_addr(hwaddr 
>> addr, hwaddr size)
>>  return ret;
>>  }
>>
>> +extern CPUExclusiveHistory excl_history;
>> +static inline void excl_history_put_addr(hwaddr addr)
>> +{
>> +hwaddr last;
>> +
>> +/* Calculate the index of the next exclusive address */
>> +excl_history.last_idx = (excl_history.last_idx + 1) % 
>> excl_history.length;
>> +
>> +last = excl_history.c_array[excl_history.last_idx];
>> +
>> +/* Unset EXCL bit of the oldest entry */
>> +if (last != EXCLUSIVE_RESET_ADDR) {
>> +cpu_physical_memory_unset_excl(last);
>> +}
>> +
>> +/* Add a new address, overwriting the oldest one */
>> +excl_history.c_array[excl_history.last_idx] = addr & TARGET_PAGE_MASK;
>> +}
>> +
>>  #define MMUSUFFIX _mmu
>>
>>  /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
>> diff --git a/exec.c b/exec.c
>> index 51f366d..2e123f1 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -177,6 +177,25 @@ struct CPUAddressSpace {
>>  MemoryListener tcg_as_listener;
>>  };
>>
>> +/* Exclusive memory support */
>> +CPUExclusiveHistory excl_history;
>> +void cpu_exclusive_history_init(void)
>> +{
>> +/* Initialize exclusive history for atomic instruction handling. */
>> +if (tcg_enabled()) {
>> +g_assert(EXCLUSIVE_HISTORY_CPU_LEN * max_cpus <= UINT16_MAX);
>> +excl_history.length = EXCLUSIVE_HISTORY_CPU_LEN * max_cpus;
>> +excl_history.c_array = g_malloc(excl_history.length * 
>> sizeof(hwaddr));
>> +memset(excl_history.c_array, -1, excl_history.length * 
>> sizeo

Re: [Qemu-devel] [RFC v7 10/16] softmmu: Protect MMIO exclusive range

2016-02-18 Thread alvise rigo

On Wed, Feb 17, 2016 at 7:55 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> As for the RAM case, also the MMIO exclusive ranges have to be protected
>> by other CPU's accesses. In order to do that, we flag the accessed
>> MemoryRegion to mark that an exclusive access has been performed and is
>> not concluded yet.
>>
>> This flag will force the other CPUs to invalidate the exclusive range in
>> case of collision.
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  cputlb.c| 20 +---
>>  include/exec/memory.h   |  1 +
>>  softmmu_llsc_template.h | 11 +++
>>  softmmu_template.h  | 22 ++
>>  4 files changed, 43 insertions(+), 11 deletions(-)
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index 87d09c8..06ce2da 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -496,19 +496,25 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
>> target_ulong addr)
>>  /* For every vCPU compare the exclusive address and reset it in case of a
>>   * match. Since only one vCPU is running at once, no lock has to be held to
>>   * guard this operation. */
>> -static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>> +static inline bool lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>>  {
>>  CPUState *cpu;
>> +bool ret = false;
>>
>>  CPU_FOREACH(cpu) {
>> -if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
>> -ranges_overlap(cpu->excl_protected_range.begin,
>> -   cpu->excl_protected_range.end -
>> -   cpu->excl_protected_range.begin,
>> -   addr, size)) {
>> -cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>> +if (current_cpu != cpu) {
>
> I'm confused by this change. I don't see anywhere in the MMIO handling
> why we would want to change skipping the CPU. Perhaps this belongs in
> the previous patch? Maybe the function should really be
> lookup_and_maybe_reset_other_cpu_ll_addr?

This is actually used later on in this patch.

>
>> +if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
>> +ranges_overlap(cpu->excl_protected_range.begin,
>> +   cpu->excl_protected_range.end -
>> +   cpu->excl_protected_range.begin,
>> +   addr, size)) {
>> +cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>> +ret = true;
>> +}
>>  }
>>  }
>> +
>> +return ret;
>>  }
>>
>>  #define MMUSUFFIX _mmu
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 71e0480..bacb3ad 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -171,6 +171,7 @@ struct MemoryRegion {
>>  bool rom_device;
>>  bool flush_coalesced_mmio;
>>  bool global_locking;
>> +bool pending_excl_access; /* A vCPU issued an exclusive access */
>>  uint8_t dirty_log_mask;
>>  ram_addr_t ram_addr;
>>  Object *owner;
>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>> index 101f5e8..b4712ba 100644
>> --- a/softmmu_llsc_template.h
>> +++ b/softmmu_llsc_template.h
>> @@ -81,15 +81,18 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, 
>> target_ulong addr,
>>  }
>>  }
>>  }
>> +/* For this vCPU, just update the TLB entry, no need to flush. */
>> +env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>>  } else {
>> -hw_error("EXCL accesses to MMIO regions not supported yet.");
>> +/* Set a pending exclusive access in the MemoryRegion */
>> +MemoryRegion *mr = iotlb_to_region(this,
>> +   env->iotlb[mmu_idx][index].addr,
>> +   
>> env->iotlb[mmu_idx][index].attrs);
>> +mr->pending_excl_access = true;
>>  }
>>
>>  cc->cpu_set_excl_protected_range(this, hw_addr, DATA_SIZE);
>>
>> -/* For this vCPU, just upd

Re: [Qemu-devel] [RFC v7 07/16] softmmu: Add helpers for a new slowpath

2016-02-18 Thread alvise rigo

On Thu, Feb 11, 2016 at 5:33 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> The new helpers rely on the legacy ones to perform the actual read/write.
>>
>> The LoadLink helper (helper_ldlink_name) prepares the way for the
>> following StoreCond operation. It sets the linked address and the size
>> of the access. The LoadLink helper also updates the TLB entry of the
>> page involved in the LL/SC to all vCPUs by forcing a TLB flush, so that
>> the following accesses made by all the vCPUs will follow the slow path.
>>
>> The StoreConditional helper (helper_stcond_name) returns 1 if the
>> store has to fail due to a concurrent access to the same page by
>> another vCPU. A 'concurrent access' can be a store made by *any* vCPU
>> (although, some implementations allow stores made by the CPU that issued
>> the LoadLink).
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  cputlb.c|   3 ++
>>  include/qom/cpu.h   |   5 ++
>>  softmmu_llsc_template.h | 133 
>> 
>>  softmmu_template.h  |  12 +
>>  tcg/tcg.h   |  31 +++
>>  5 files changed, 184 insertions(+)
>>  create mode 100644 softmmu_llsc_template.h
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index f6fb161..ce6d720 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -476,6 +476,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
>> target_ulong addr)
>>
>>  #define MMUSUFFIX _mmu
>>
>> +/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
>> +#define GEN_EXCLUSIVE_HELPERS
>>  #define SHIFT 0
>>  #include "softmmu_template.h"
>>
>> @@ -488,6 +490,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
>> target_ulong addr)
>>  #define SHIFT 3
>>  #include "softmmu_template.h"
>>  #undef MMUSUFFIX
>> +#undef GEN_EXCLUSIVE_HELPERS
>>
>>  #define MMUSUFFIX _cmmu
>>  #undef GETPC_ADJ
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 682c81d..6f6c1c0 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -351,10 +351,15 @@ struct CPUState {
>>   */
>>  bool throttle_thread_scheduled;
>>
>> +/* Used by the atomic insn translation backend. */
>> +bool ll_sc_context;
>>  /* vCPU's exclusive addresses range.
>>   * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not
>>   * in the middle of a LL/SC. */
>>  struct Range excl_protected_range;
>> +/* Used to carry the SC result but also to flag a normal store access 
>> made
>> + * by a stcond (see softmmu_template.h). */
>> +bool excl_succeeded;
>>
>>  /* Note that this is accessed at the start of every TB via a negative
>> offset from AREG0.  Leave this field at the end so as to make the
>> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
>> new file mode 100644
>> index 000..101f5e8
>> --- /dev/null
>> +++ b/softmmu_llsc_template.h
>> @@ -0,0 +1,133 @@
>> +/*
>> + *  Software MMU support (esclusive load/store operations)
>> + *
>> + * Generate helpers used by TCG for qemu_ldlink/stcond ops.
>> + *
>> + * Included from softmmu_template.h only.
>> + *
>> + * Copyright (c) 2015 Virtual Open Systems
>> + *
>> + * Authors:
>> + *  Alvise Rigo <a.r...@virtualopensystems.com>
>> + *
>> + * This library is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU Lesser General Public
>> + * License as published by the Free Software Foundation; either
>> + * version 2 of the License, or (at your option) any later version.
>> + *
>> + * This library is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * Lesser General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU Lesser General Public
>> + * License along with this library; if not, see 
>> <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +/* This template does not generate together the le and be version, but only 
>> one
>> + *

Re: [Qemu-devel] [RFC v7 06/16] qom: cpu: Add CPUClass hooks for exclusive range

2016-02-18 Thread alvise rigo

On Thu, Feb 11, 2016 at 2:22 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> The excl_protected_range is a hwaddr range set by the VCPU at the
>> execution of a LoadLink instruction. If a normal access writes to this
>> range, the corresponding StoreCond will fail.
>>
>> Each architecture can set the exclusive range when issuing the LoadLink
>> operation through a CPUClass hook. This comes in handy to emulate, for
>> instance, the exclusive monitor implemented in some ARM architectures
>> (more precisely, the Exclusive Reservation Granule).
>>
>> In addition, add another CPUClass hook called to decide whether a
>> StoreCond has to fail or not.
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  include/qom/cpu.h | 15 +++
>>  qom/cpu.c | 20 
>>  2 files changed, 35 insertions(+)
>>
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 2e5229d..682c81d 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -29,6 +29,7 @@
>>  #include "qemu/queue.h"
>>  #include "qemu/thread.h"
>>  #include "qemu/typedefs.h"
>> +#include "qemu/range.h"
>>
>>  typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
>>   void *opaque);
>> @@ -183,6 +184,12 @@ typedef struct CPUClass {
>>  void (*cpu_exec_exit)(CPUState *cpu);
>>  bool (*cpu_exec_interrupt)(CPUState *cpu, int interrupt_request);
>>
>> +/* Atomic instruction handling */
>> +void (*cpu_set_excl_protected_range)(CPUState *cpu, hwaddr addr,
>> + hwaddr size);
>> +int (*cpu_valid_excl_access)(CPUState *cpu, hwaddr addr,
>> + hwaddr size);
>> +
>>  void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
>>  } CPUClass;
>>
>> @@ -219,6 +226,9 @@ struct kvm_run;
>>  #define TB_JMP_CACHE_BITS 12
>>  #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
>>
>> +/* Atomic insn translation TLB support. */
>> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>> +
>>  /**
>>   * CPUState:
>>   * @cpu_index: CPU index (informative).
>> @@ -341,6 +351,11 @@ struct CPUState {
>>   */
>>  bool throttle_thread_scheduled;
>>
>> +/* vCPU's exclusive addresses range.
>> + * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not
>> + * in the middle of a LL/SC. */
>> +struct Range excl_protected_range;
>> +
>
> In which case we should probably initialise that on CPU creation as we
> don't start in the middle of a LL/SC.

Agreed.

>
>>  /* Note that this is accessed at the start of every TB via a negative
>> offset from AREG0.  Leave this field at the end so as to make the
>> (absolute value) offset as small as possible.  This reduces code
>> diff --git a/qom/cpu.c b/qom/cpu.c
>> index 8f537a4..a5d360c 100644
>> --- a/qom/cpu.c
>> +++ b/qom/cpu.c
>> @@ -203,6 +203,24 @@ static bool cpu_common_exec_interrupt(CPUState *cpu, 
>> int int_req)
>>  return false;
>>  }
>>
>> +static void cpu_common_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr 
>> size)
>> +{
>> +cpu->excl_protected_range.begin = addr;
>> +cpu->excl_protected_range.end = addr + size;
>> +}
>> +
>> +static int cpu_common_valid_excl_access(CPUState *cpu, hwaddr addr, hwaddr 
>> size)
>> +{
>> +/* Check if the excl range completely covers the access */
>> +if (cpu->excl_protected_range.begin <= addr &&
>> +cpu->excl_protected_range.end >= addr + size) {
>> +
>> +return 1;
>> +}
>> +
>> +return 0;
>> +}
>
> This can be a bool function.

OK.

Thank you,
alvise

>
>> +
>>  void cpu_dump_state(CPUState *cpu, FILE *f, fprintf_function cpu_fprintf,
>>  int flags)
>>  {
>> @@ -355,6 +373,8 @@ static void cpu_class_init(ObjectClass *klass, void 
>> *data)
>>  k->cpu_exec_enter = cpu_common_noop;
>>  k->cpu_exec_exit = cpu_common_noop;
>>  k->cpu_exec_interrupt = cpu_common_exec_interrupt;
>> +k->cpu_set_excl_protected_range = cpu_common_set_excl_range;
>> +k->cpu_valid_excl_access = cpu_common_valid_excl_access;
>>  dc->realize = cpu_common_realizefn;
>>  /*
>>   * Reason: CPUs still need special care by board code: wiring up
>
>
> --
> Alex Bennée

Re: [Qemu-devel] [RFC v7 01/16] exec.c: Add new exclusive bitmap to ram_list

2016-02-11 Thread alvise rigo

You are right, the for loop with i < DIRTY_MEMORY_NUM works just fine.

Thank you,
alvise

On Thu, Feb 11, 2016 at 2:00 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> The purpose of this new bitmap is to flag the memory pages that are in
>> the middle of LL/SC operations (after a LL, before a SC). For all these
>> pages, the corresponding TLB entries will be generated in such a way to
>> force the slow-path for all the VCPUs (see the following patches).
>>
>> When the system starts, the whole memory is set to dirty.
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  exec.c  |  7 +--
>>  include/exec/memory.h   |  3 ++-
>>  include/exec/ram_addr.h | 31 +++
>>  3 files changed, 38 insertions(+), 3 deletions(-)
>>
>> diff --git a/exec.c b/exec.c
>> index 7115403..51f366d 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -1575,11 +1575,14 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, 
>> Error **errp)
>>  int i;
>>
>>  /* ram_list.dirty_memory[] is protected by the iothread lock.  */
>> -for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
>> +for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
>>  ram_list.dirty_memory[i] =
>>  bitmap_zero_extend(ram_list.dirty_memory[i],
>> old_ram_size, new_ram_size);
>> -   }
>> +}
>> +ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] =
>> +
>> bitmap_zero_extend(ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
>> +   old_ram_size, new_ram_size);
>
> In the previous patch you moved this out of the loop as
> ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] was a different size to
> the other dirty bitmaps. This no longer seems to be the case so this
> seems pointless.
>
>>  }
>>  cpu_physical_memory_set_dirty_range(new_block->offset,
>>  new_block->used_length,
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index c92734a..71e0480 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -19,7 +19,8 @@
>>  #define DIRTY_MEMORY_VGA   0
>>  #define DIRTY_MEMORY_CODE  1
>>  #define DIRTY_MEMORY_MIGRATION 2
>> -#define DIRTY_MEMORY_NUM   3/* num of dirty bits */
>> +#define DIRTY_MEMORY_EXCLUSIVE 3
>> +#define DIRTY_MEMORY_NUM   4/* num of dirty bits */
>>
>>  #include 
>>  #include 
>> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
>> index ef1489d..19789fc 100644
>> --- a/include/exec/ram_addr.h
>> +++ b/include/exec/ram_addr.h
>> @@ -21,6 +21,7 @@
>>
>>  #ifndef CONFIG_USER_ONLY
>>  #include "hw/xen/xen.h"
>> +#include "sysemu/sysemu.h"
>>
>>  struct RAMBlock {
>>  struct rcu_head rcu;
>> @@ -172,6 +173,9 @@ static inline void 
>> cpu_physical_memory_set_dirty_range(ram_addr_t start,
>>  if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
>>  bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
>>  }
>> +if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
>> +bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE], page, end - page);
>> +}
>>  xen_modified_memory(start, length);
>>  }
>>
>> @@ -287,5 +291,32 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned 
>> long *dest,
>>  }
>>
>>  void migration_bitmap_extend(ram_addr_t old, ram_addr_t new);
>> +
>> +/* Exclusive bitmap support. */
>> +#define EXCL_BITMAP_GET_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
>> +
>> +/* Make the page of @addr not exclusive. */
>> +static inline void cpu_physical_memory_unset_excl(ram_addr_t addr)
>> +{
>> +set_bit_atomic(EXCL_BITMAP_GET_OFFSET(addr),
>> +   ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
>> +}
>> +
>> +/* Return true if the page of @addr is exclusive, i.e. the EXCL bit is set. 
>> */
>> +static inline int cpu_physical_memory_is_excl(ram_addr_t addr)
>> +{
>> +return !test_bit(EXCL_BITMAP_GET_OFFSET(addr),
>> + ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
>> +}
>> +
>> +/* Set the page of @addr as exclusive clearing its EXCL bit and return the
>> + * previous bit's state. */
>> +static inline int cpu_physical_memory_set_excl(ram_addr_t addr)
>> +{
>> +return bitmap_test_and_clear_atomic(
>> +
>> ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
>> +EXCL_BITMAP_GET_OFFSET(addr), 1);
>> +}
>> +
>>  #endif
>>  #endif
>
>
> --
> Alex Bennée

[Qemu-devel] [RFC v7 01/16] exec.c: Add new exclusive bitmap to ram_list

2016-01-29 Thread Alvise Rigo

The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC). For all these
pages, the corresponding TLB entries will be generated in such a way to
force the slow-path for all the VCPUs (see the following patches).

When the system starts, the whole memory is set to dirty.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 exec.c  |  7 +--
 include/exec/memory.h   |  3 ++-
 include/exec/ram_addr.h | 31 +++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index 7115403..51f366d 100644
--- a/exec.c
+++ b/exec.c
@@ -1575,11 +1575,14 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, 
Error **errp)
 int i;
 
 /* ram_list.dirty_memory[] is protected by the iothread lock.  */
-for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
+for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
 ram_list.dirty_memory[i] =
 bitmap_zero_extend(ram_list.dirty_memory[i],
old_ram_size, new_ram_size);
-   }
+}
+ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] =
+bitmap_zero_extend(ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+   old_ram_size, new_ram_size);
 }
 cpu_physical_memory_set_dirty_range(new_block->offset,
 new_block->used_length,
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c92734a..71e0480 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@
 #define DIRTY_MEMORY_VGA   0
 #define DIRTY_MEMORY_CODE  1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM   3/* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM   4/* num of dirty bits */
 
 #include 
 #include 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index ef1489d..19789fc 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "sysemu/sysemu.h"
 
 struct RAMBlock {
 struct rcu_head rcu;
@@ -172,6 +173,9 @@ static inline void 
cpu_physical_memory_set_dirty_range(ram_addr_t start,
 if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
 bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
 }
+if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
+bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE], page, end - page);
+}
 xen_modified_memory(start, length);
 }
 
@@ -287,5 +291,32 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned 
long *dest,
 }
 
 void migration_bitmap_extend(ram_addr_t old, ram_addr_t new);
+
+/* Exclusive bitmap support. */
+#define EXCL_BITMAP_GET_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
+
+/* Make the page of @addr not exclusive. */
+static inline void cpu_physical_memory_unset_excl(ram_addr_t addr)
+{
+set_bit_atomic(EXCL_BITMAP_GET_OFFSET(addr),
+   ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+/* Return true if the page of @addr is exclusive, i.e. the EXCL bit is set. */
+static inline int cpu_physical_memory_is_excl(ram_addr_t addr)
+{
+return !test_bit(EXCL_BITMAP_GET_OFFSET(addr),
+ ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+/* Set the page of @addr as exclusive clearing its EXCL bit and return the
+ * previous bit's state. */
+static inline int cpu_physical_memory_set_excl(ram_addr_t addr)
+{
+return bitmap_test_and_clear_atomic(
+ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+EXCL_BITMAP_GET_OFFSET(addr), 1);
+}
+
 #endif
 #endif
-- 
2.7.0

[Qemu-devel] [RFC v7 14/16] target-arm: translate: Use ld/st excl for atomic insns

2016-01-29 Thread Alvise Rigo

Use the new LL/SC runtime helpers to handle the ARM atomic instructions
in softmmu_llsc_template.h.

In general, the helper generator
gen_{ldrex,strex}_{8,16a,32a,64a}() calls the function
helper_{le,be}_{ldlink,stcond}{ub,uw,ul,q}_mmu() implemented in
softmmu_llsc_template.h, doing an alignment check.

In addition, add a simple helper function to emulate the CLREX instruction.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 target-arm/cpu.h   |   2 +
 target-arm/helper.h|   4 ++
 target-arm/machine.c   |   2 +
 target-arm/op_helper.c |  10 +++
 target-arm/translate.c | 188 +++--
 5 files changed, 202 insertions(+), 4 deletions(-)

diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index b8b3364..bb5361f 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -462,9 +462,11 @@ typedef struct CPUARMState {
 float_status fp_status;
 float_status standard_fp_status;
 } vfp;
+#if !defined(CONFIG_ARM_USE_LDST_EXCL)
 uint64_t exclusive_addr;
 uint64_t exclusive_val;
 uint64_t exclusive_high;
+#endif
 #if defined(CONFIG_USER_ONLY)
 uint64_t exclusive_test;
 uint32_t exclusive_info;
diff --git a/target-arm/helper.h b/target-arm/helper.h
index c2a85c7..6bc3c0a 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -532,6 +532,10 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+#ifdef CONFIG_ARM_USE_LDST_EXCL
+DEF_HELPER_1(atomic_clear, void, env)
+#endif
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
diff --git a/target-arm/machine.c b/target-arm/machine.c
index ed1925a..7adfb4d 100644
--- a/target-arm/machine.c
+++ b/target-arm/machine.c
@@ -309,9 +309,11 @@ const VMStateDescription vmstate_arm_cpu = {
 VMSTATE_VARRAY_INT32(cpreg_vmstate_values, ARMCPU,
  cpreg_vmstate_array_len,
  0, vmstate_info_uint64, uint64_t),
+#if !defined(CONFIG_ARM_USE_LDST_EXCL)
 VMSTATE_UINT64(env.exclusive_addr, ARMCPU),
 VMSTATE_UINT64(env.exclusive_val, ARMCPU),
 VMSTATE_UINT64(env.exclusive_high, ARMCPU),
+#endif
 VMSTATE_UINT64(env.features, ARMCPU),
 VMSTATE_UINT32(env.exception.syndrome, ARMCPU),
 VMSTATE_UINT32(env.exception.fsr, ARMCPU),
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index a5ee65f..404c13b 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -51,6 +51,14 @@ static int exception_target_el(CPUARMState *env)
 return target_el;
 }
 
+#ifdef CONFIG_ARM_USE_LDST_EXCL
+void HELPER(atomic_clear)(CPUARMState *env)
+{
+ENV_GET_CPU(env)->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+ENV_GET_CPU(env)->ll_sc_context = false;
+}
+#endif
+
 uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
   uint32_t rn, uint32_t maxindex)
 {
@@ -689,7 +697,9 @@ void HELPER(exception_return)(CPUARMState *env)
 
 aarch64_save_sp(env, cur_el);
 
+#if !defined(CONFIG_ARM_USE_LDST_EXCL)
 env->exclusive_addr = -1;
+#endif
 
 /* We must squash the PSTATE.SS bit to zero unless both of the
  * following hold:
diff --git a/target-arm/translate.c b/target-arm/translate.c
index cff511b..5150841 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -60,8 +60,10 @@ TCGv_ptr cpu_env;
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
 static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
+#if !defined(CONFIG_ARM_USE_LDST_EXCL)
 TCGv_i64 cpu_exclusive_addr;
 TCGv_i64 cpu_exclusive_val;
+#endif
 #ifdef CONFIG_USER_ONLY
 TCGv_i64 cpu_exclusive_test;
 TCGv_i32 cpu_exclusive_info;
@@ -94,10 +96,12 @@ void arm_translate_init(void)
 cpu_VF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, VF), 
"VF");
 cpu_ZF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, ZF), 
"ZF");
 
+#if !defined(CONFIG_ARM_USE_LDST_EXCL)
 cpu_exclusive_addr = tcg_global_mem_new_i64(TCG_AREG0,
 offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
 cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
 offsetof(CPUARMState, exclusive_val), "exclusive_val");
+#endif
 #ifdef CONFIG_USER_ONLY
 cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
 offsetof(CPUARMState, exclusive_test), "exclusive_test");
@@ -7413,15 +7417,145 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
 tcg_gen_or_i32(cpu_ZF, lo, hi);
 }
 
-/* Load/Store exclusive instructions are implemented by remembering
+/* If the softmmu is enabled, the translation of Load/Store exclusive
+   instructions will rely on the gen_helper_{ldlink,stcond} hel

[Qemu-devel] [RFC v7 05/16] softmmu: Add new TLB_EXCL flag

2016-01-29 Thread Alvise Rigo

Add a new TLB flag to force all the accesses made to a page to follow
the slow-path.

The TLB entries referring guest pages with the DIRTY_MEMORY_EXCLUSIVE
bit clean will have this flag set.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 include/exec/cpu-all.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index 83b1781..f8d8feb 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -277,6 +277,14 @@ CPUArchState *cpu_copy(CPUArchState *env);
 #define TLB_NOTDIRTY(1 << 4)
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO(1 << 5)
+/* Set if TLB entry references a page that requires exclusive access.  */
+#define TLB_EXCL(1 << 6)
+
+/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
+ * above. */
+#if TLB_EXCL >= TARGET_PAGE_SIZE
+#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
+#endif
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
-- 
2.7.0

[Qemu-devel] [RFC v7 03/16] softmmu: Simplify helper_*_st_name, wrap MMIO code

2016-01-29 Thread Alvise Rigo

Attempting to simplify the helper_*_st_name, wrap the MMIO code into an
inline function.

Based on this work, Alex proposed the following patch series
https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg01136.html
that reduces code duplication of the softmmu_helpers.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 softmmu_template.h | 66 --
 1 file changed, 44 insertions(+), 22 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 7029a03..3d388ec 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -396,6 +396,26 @@ static inline void glue(helper_le_st_name, 
_do_unl_access)(CPUArchState *env,
 }
 }
 
+static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState *env,
+DATA_TYPE val,
+target_ulong addr,
+TCGMemOpIdx oi,
+unsigned mmu_idx,
+int index,
+uintptr_t retaddr)
+{
+CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
+
+if ((addr & (DATA_SIZE - 1)) != 0) {
+glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
+oi, retaddr);
+}
+/* ??? Note that the io helpers always read data in the target
+   byte ordering.  We should push the LE/BE request down into io.  */
+val = TGT_LE(val);
+glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -423,17 +443,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 
 /* Handle an IO access.  */
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-CPUIOTLBEntry *iotlbentry;
-if ((addr & (DATA_SIZE - 1)) != 0) {
-glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
-oi, retaddr);
-}
-iotlbentry = >iotlb[mmu_idx][index];
-
-/* ??? Note that the io helpers always read data in the target
-   byte ordering.  We should push the LE/BE request down into io.  */
-val = TGT_LE(val);
-glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
+ mmu_idx, index, retaddr);
 return;
 }
 
@@ -488,6 +499,26 @@ static inline void glue(helper_be_st_name, 
_do_unl_access)(CPUArchState *env,
 }
 }
 
+static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState *env,
+DATA_TYPE val,
+target_ulong addr,
+TCGMemOpIdx oi,
+unsigned mmu_idx,
+int index,
+uintptr_t retaddr)
+{
+CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
+
+if ((addr & (DATA_SIZE - 1)) != 0) {
+glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
+oi, retaddr);
+}
+/* ??? Note that the io helpers always read data in the target
+   byte ordering.  We should push the LE/BE request down into io.  */
+val = TGT_BE(val);
+glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+}
+
 void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -515,17 +546,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 
 /* Handle an IO access.  */
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-CPUIOTLBEntry *iotlbentry;
-if ((addr & (DATA_SIZE - 1)) != 0) {
-glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
-oi, retaddr);
-}
-iotlbentry = >iotlb[mmu_idx][index];
-
-/* ??? Note that the io helpers always read data in the target
-   byte ordering.  We should push the LE/BE request down into io.  */
-val = TGT_BE(val);
-glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+glue(

[Qemu-devel] [RFC v7 09/16] softmmu: Include MMIO/invalid exclusive accesses

2016-01-29 Thread Alvise Rigo

Enable exclusive accesses when the MMIO/invalid flag is set in the TLB
entry.

In case a LL access is done to MMIO memory, we treat it differently from
a RAM access in that we do not rely on the EXCL bitmap to flag the page
as exclusive. In fact, we don't even need the TLB_EXCL flag to force the
slow path, since it is always forced anyway.

This commit does not take care of invalidating an MMIO exclusive range from
other non-exclusive accesses i.e. CPU1 LoadLink to MMIO address X and
CPU2 writes to X. This will be addressed in the following commit.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c   |  7 +++
 softmmu_template.h | 26 --
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index aa9cc17..87d09c8 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -424,7 +424,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
vaddr,
 if ((memory_region_is_ram(section->mr) && section->readonly)
 || memory_region_is_romd(section->mr)) {
 /* Write access calls the I/O callback.  */
-te->addr_write = address | TLB_MMIO;
+address |= TLB_MMIO;
 } else if (memory_region_is_ram(section->mr)
&& cpu_physical_memory_is_clean(section->mr->ram_addr
+ xlat)) {
@@ -437,11 +437,10 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
vaddr,
 if (cpu_physical_memory_is_excl(section->mr->ram_addr + xlat)) {
 /* There is at least one vCPU that has flagged the address as
  * exclusive. */
-te->addr_write = address | TLB_EXCL;
-} else {
-te->addr_write = address;
+address |= TLB_EXCL;
 }
 }
+te->addr_write = address;
 } else {
 te->addr_write = -1;
 }
diff --git a/softmmu_template.h b/softmmu_template.h
index 267c52a..c54bdc9 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -476,7 +476,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 
 /* Handle an IO access or exclusive access.  */
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+if (tlb_addr & TLB_EXCL) {
 CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
 CPUState *cpu = ENV_GET_CPU(env);
 CPUClass *cc = CPU_GET_CLASS(cpu);
@@ -500,8 +500,15 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 }
 }
 
-glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi,
-mmu_idx, index, retaddr);
+if (tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)) { /* MMIO access */
+glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
+ mmu_idx, index,
+ retaddr);
+} else {
+glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi,
+mmu_idx, index,
+retaddr);
+}
 
 lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
 
@@ -620,7 +627,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 
 /* Handle an IO access or exclusive access.  */
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+if (tlb_addr & TLB_EXCL) {
 CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
 CPUState *cpu = ENV_GET_CPU(env);
 CPUClass *cc = CPU_GET_CLASS(cpu);
@@ -644,8 +651,15 @@ void helper_be_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 }
 }
 
-glue(helper_be_st_name, _do_ram_access)(env, val, addr, oi,
-mmu_idx, index, retaddr);
+if (tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)) { /* MMIO access */
+glue(helper_be_st_name, _do_mmio_access)(env, val, addr, oi,
+ mmu_idx, index,
+ retaddr);
+} else {
+glue(helper_be_st_name, _do_ram_access)(env, val, addr, oi,
+mmu_idx, index,
+retaddr);
+}
 
 lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
 
-- 
2.7.0

[Qemu-devel] [RFC v7 02/16] softmmu: Simplify helper_*_st_name, wrap unaligned code

2016-01-29 Thread Alvise Rigo

Attempting to simplify the helper_*_st_name, wrap the
do_unaligned_access code into an inline function.
Remove also the goto statement.

Based on this work, Alex proposed the following patch series
https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg01136.html
that reduces code duplication of the softmmu_helpers.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 softmmu_template.h | 96 ++
 1 file changed, 60 insertions(+), 36 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 208f808..7029a03 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState 
*env,
  iotlbentry->attrs);
 }
 
+static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
+   DATA_TYPE val,
+   target_ulong addr,
+   TCGMemOpIdx oi,
+   unsigned mmu_idx,
+   uintptr_t retaddr)
+{
+int i;
+
+if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+ mmu_idx, retaddr);
+}
+/* XXX: not efficient, but simple */
+/* Note: relies on the fact that tlb_fill() does not remove the
+ * previous page from the TLB cache.  */
+for (i = DATA_SIZE - 1; i >= 0; i--) {
+/* Little-endian extract.  */
+uint8_t val8 = val >> (i * 8);
+/* Note the adjustment at the beginning of the function.
+   Undo that for the recursion.  */
+glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
+oi, retaddr + GETPC_ADJ);
+}
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -399,7 +425,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
 CPUIOTLBEntry *iotlbentry;
 if ((addr & (DATA_SIZE - 1)) != 0) {
-goto do_unaligned_access;
+glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
+oi, retaddr);
 }
 iotlbentry = >iotlb[mmu_idx][index];
 
@@ -414,23 +441,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 if (DATA_SIZE > 1
 && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
  >= TARGET_PAGE_SIZE)) {
-int i;
-do_unaligned_access:
-if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
- mmu_idx, retaddr);
-}
-/* XXX: not efficient, but simple */
-/* Note: relies on the fact that tlb_fill() does not remove the
- * previous page from the TLB cache.  */
-for (i = DATA_SIZE - 1; i >= 0; i--) {
-/* Little-endian extract.  */
-uint8_t val8 = val >> (i * 8);
-/* Note the adjustment at the beginning of the function.
-   Undo that for the recursion.  */
-glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
-oi, retaddr + GETPC_ADJ);
-}
+glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
+oi, retaddr);
 return;
 }
 
@@ -450,6 +462,32 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 }
 
 #if DATA_SIZE > 1
+static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
+   DATA_TYPE val,
+   target_ulong addr,
+   TCGMemOpIdx oi,
+   unsigned mmu_idx,
+   uintptr_t retaddr)
+{
+int i;
+
+if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+ mmu_idx, retaddr);
+}
+/* XXX: not efficient, but simple */
+/* Note: relies on the fact that tlb_fill() does not remove the
+ * previous page from the TLB cache.  */
+for (i = DATA_SIZE - 1; i >= 0; i--) {
+/*

[Qemu-devel] [RFC v7 07/16] softmmu: Add helpers for a new slowpath

2016-01-29 Thread Alvise Rigo

The new helpers rely on the legacy ones to perform the actual read/write.

The LoadLink helper (helper_ldlink_name) prepares the way for the
following StoreCond operation. It sets the linked address and the size
of the access. The LoadLink helper also updates the TLB entry of the
page involved in the LL/SC to all vCPUs by forcing a TLB flush, so that
the following accesses made by all the vCPUs will follow the slow path.

The StoreConditional helper (helper_stcond_name) returns 1 if the
store has to fail due to a concurrent access to the same page by
another vCPU. A 'concurrent access' can be a store made by *any* vCPU
(although, some implementations allow stores made by the CPU that issued
the LoadLink).

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c|   3 ++
 include/qom/cpu.h   |   5 ++
 softmmu_llsc_template.h | 133 
 softmmu_template.h  |  12 +
 tcg/tcg.h   |  31 +++
 5 files changed, 184 insertions(+)
 create mode 100644 softmmu_llsc_template.h

diff --git a/cputlb.c b/cputlb.c
index f6fb161..ce6d720 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -476,6 +476,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
target_ulong addr)
 
 #define MMUSUFFIX _mmu
 
+/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
+#define GEN_EXCLUSIVE_HELPERS
 #define SHIFT 0
 #include "softmmu_template.h"
 
@@ -488,6 +490,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
target_ulong addr)
 #define SHIFT 3
 #include "softmmu_template.h"
 #undef MMUSUFFIX
+#undef GEN_EXCLUSIVE_HELPERS
 
 #define MMUSUFFIX _cmmu
 #undef GETPC_ADJ
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 682c81d..6f6c1c0 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -351,10 +351,15 @@ struct CPUState {
  */
 bool throttle_thread_scheduled;
 
+/* Used by the atomic insn translation backend. */
+bool ll_sc_context;
 /* vCPU's exclusive addresses range.
  * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not
  * in the middle of a LL/SC. */
 struct Range excl_protected_range;
+/* Used to carry the SC result but also to flag a normal store access made
+ * by a stcond (see softmmu_template.h). */
+bool excl_succeeded;
 
 /* Note that this is accessed at the start of every TB via a negative
offset from AREG0.  Leave this field at the end so as to make the
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
new file mode 100644
index 000..101f5e8
--- /dev/null
+++ b/softmmu_llsc_template.h
@@ -0,0 +1,133 @@
+/*
+ *  Software MMU support (esclusive load/store operations)
+ *
+ * Generate helpers used by TCG for qemu_ldlink/stcond ops.
+ *
+ * Included from softmmu_template.h only.
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.r...@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* This template does not generate together the le and be version, but only one
+ * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
+ * The same nomenclature as softmmu_template.h is used for the exclusive
+ * helpers.  */
+
+#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
+
+#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
+
+#else /* LE helpers + 8bit helpers (generated only once for both LE end BE) */
+
+#if DATA_SIZE > 1
+#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
+#else /* DATA_SIZE <= 1 */
+#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_ret_ld, USUFFIX

[Qemu-devel] [RFC v7 06/16] qom: cpu: Add CPUClass hooks for exclusive range

2016-01-29 Thread Alvise Rigo

The excl_protected_range is a hwaddr range set by the VCPU at the
execution of a LoadLink instruction. If a normal access writes to this
range, the corresponding StoreCond will fail.

Each architecture can set the exclusive range when issuing the LoadLink
operation through a CPUClass hook. This comes in handy to emulate, for
instance, the exclusive monitor implemented in some ARM architectures
(more precisely, the Exclusive Reservation Granule).

In addition, add another CPUClass hook called to decide whether a
StoreCond has to fail or not.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 include/qom/cpu.h | 15 +++
 qom/cpu.c | 20 
 2 files changed, 35 insertions(+)

diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 2e5229d..682c81d 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -29,6 +29,7 @@
 #include "qemu/queue.h"
 #include "qemu/thread.h"
 #include "qemu/typedefs.h"
+#include "qemu/range.h"
 
 typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
  void *opaque);
@@ -183,6 +184,12 @@ typedef struct CPUClass {
 void (*cpu_exec_exit)(CPUState *cpu);
 bool (*cpu_exec_interrupt)(CPUState *cpu, int interrupt_request);
 
+/* Atomic instruction handling */
+void (*cpu_set_excl_protected_range)(CPUState *cpu, hwaddr addr,
+ hwaddr size);
+int (*cpu_valid_excl_access)(CPUState *cpu, hwaddr addr,
+ hwaddr size);
+
 void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
 } CPUClass;
 
@@ -219,6 +226,9 @@ struct kvm_run;
 #define TB_JMP_CACHE_BITS 12
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
+/* Atomic insn translation TLB support. */
+#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+
 /**
  * CPUState:
  * @cpu_index: CPU index (informative).
@@ -341,6 +351,11 @@ struct CPUState {
  */
 bool throttle_thread_scheduled;
 
+/* vCPU's exclusive addresses range.
+ * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not
+ * in the middle of a LL/SC. */
+struct Range excl_protected_range;
+
 /* Note that this is accessed at the start of every TB via a negative
offset from AREG0.  Leave this field at the end so as to make the
(absolute value) offset as small as possible.  This reduces code
diff --git a/qom/cpu.c b/qom/cpu.c
index 8f537a4..a5d360c 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -203,6 +203,24 @@ static bool cpu_common_exec_interrupt(CPUState *cpu, int 
int_req)
 return false;
 }
 
+static void cpu_common_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr size)
+{
+cpu->excl_protected_range.begin = addr;
+cpu->excl_protected_range.end = addr + size;
+}
+
+static int cpu_common_valid_excl_access(CPUState *cpu, hwaddr addr, hwaddr 
size)
+{
+/* Check if the excl range completely covers the access */
+if (cpu->excl_protected_range.begin <= addr &&
+cpu->excl_protected_range.end >= addr + size) {
+
+return 1;
+}
+
+return 0;
+}
+
 void cpu_dump_state(CPUState *cpu, FILE *f, fprintf_function cpu_fprintf,
 int flags)
 {
@@ -355,6 +373,8 @@ static void cpu_class_init(ObjectClass *klass, void *data)
 k->cpu_exec_enter = cpu_common_noop;
 k->cpu_exec_exit = cpu_common_noop;
 k->cpu_exec_interrupt = cpu_common_exec_interrupt;
+k->cpu_set_excl_protected_range = cpu_common_set_excl_range;
+k->cpu_valid_excl_access = cpu_common_valid_excl_access;
 dc->realize = cpu_common_realizefn;
 /*
  * Reason: CPUs still need special care by board code: wiring up
-- 
2.7.0

[Qemu-devel] [RFC v7 15/16] target-arm: cpu64: use custom set_excl hook

2016-01-29 Thread Alvise Rigo

In aarch64 the LDXP/STXP instructions allow to perform up to 128 bits
exclusive accesses. However, due to a softmmu limitation, such wide
accesses are not allowed.

To workaround this limitation, we need to support LoadLink instructions
that cover at least 128 consecutive bits (see the next patch for more
details).

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 target-arm/cpu64.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/target-arm/cpu64.c b/target-arm/cpu64.c
index cc177bb..1d45e66 100644
--- a/target-arm/cpu64.c
+++ b/target-arm/cpu64.c
@@ -287,6 +287,13 @@ static void aarch64_cpu_set_pc(CPUState *cs, vaddr value)
 }
 }
 
+static void aarch64_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr size)
+{
+cpu->excl_protected_range.begin = addr;
+/* At least cover 128 bits for a STXP access (two paired doublewords 
case)*/
+cpu->excl_protected_range.end = addr + 16;
+}
+
 static void aarch64_cpu_class_init(ObjectClass *oc, void *data)
 {
 CPUClass *cc = CPU_CLASS(oc);
@@ -297,6 +304,7 @@ static void aarch64_cpu_class_init(ObjectClass *oc, void 
*data)
 cc->gdb_write_register = aarch64_cpu_gdb_write_register;
 cc->gdb_num_core_regs = 34;
 cc->gdb_core_xml_file = "aarch64-core.xml";
+cc->cpu_set_excl_protected_range = aarch64_set_excl_range;
 }
 
 static void aarch64_cpu_register(const ARMCPUInfo *info)
-- 
2.7.0

[Qemu-devel] [RFC v7 16/16] target-arm: aarch64: add atomic instructions

2016-01-29 Thread Alvise Rigo

Use the new LL/SC runtime helpers to handle the aarch64 atomic instructions
in softmmu_llsc_template.h.

The STXP emulation required a dedicated helper to handle the paired
doubleword case.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 configure  |   6 +-
 target-arm/helper-a64.c|  55 +++
 target-arm/helper-a64.h|   4 ++
 target-arm/op_helper.c |   8 +++
 target-arm/translate-a64.c | 134 -
 5 files changed, 204 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index 915efcc..38121ff 100755
--- a/configure
+++ b/configure
@@ -5873,9 +5873,11 @@ echo "QEMU_CFLAGS+=$cflags" >> $config_target_mak
 # Use tcg LL/SC tcg backend for exclusive instruction is arm/aarch64
 # softmmus targets
 if test "$arm_tcg_use_llsc" = "yes" ; then
-  if test "$target" = "arm-softmmu" ; then
+  case "$target" in
+arm-softmmu | aarch64-softmmu)
 echo "CONFIG_ARM_USE_LDST_EXCL=y" >> $config_target_mak
-  fi
+;;
+  esac
 fi
 done # for target in $targets
 
diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index c7bfb4d..dcee66f 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -26,6 +26,7 @@
 #include "qemu/bitops.h"
 #include "internals.h"
 #include "qemu/crc32c.h"
+#include "tcg/tcg.h"
 #include  /* For crc32 */
 
 /* C2.4.7 Multiply and divide */
@@ -443,3 +444,57 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, 
uint32_t bytes)
 /* Linux crc32c converts the output to one's complement.  */
 return crc32c(acc, buf, bytes) ^ 0x;
 }
+
+#ifdef CONFIG_ARM_USE_LDST_EXCL
+/* STXP emulation for two 64 bit doublewords. We can't use directly two
+ * stcond_i64 accesses, otherwise the first will conclude the LL/SC pair.
+ * Instead, two normal 64-bit accesses are used and the CPUState is
+ * updated accordingly. */
+target_ulong HELPER(stxp_i128)(CPUArchState *env, target_ulong addr,
+   uint64_t vall, uint64_t valh,
+   uint32_t mmu_idx)
+{
+CPUState *cpu = ENV_GET_CPU(env);
+TCGMemOpIdx op;
+target_ulong ret = 0;
+
+if (!cpu->ll_sc_context) {
+cpu->excl_succeeded = false;
+ret = 1;
+goto out;
+}
+
+op = make_memop_idx(MO_BEQ, mmu_idx);
+
+/* According to section C6.6.191 of ARM ARM DDI 0487A.h, the access has to
+ * be quadword aligned.  For the time being, we do not support paired STXPs
+ * to MMIO memory, this will become trivial when the softmmu will support
+ * 128bit memory accesses. */
+if (addr & 0xf) {
+/* TODO: Do unaligned access */
+}
+
+/* Setting excl_succeeded to true will make the store exclusive. */
+cpu->excl_succeeded = true;
+helper_ret_stq_mmu(env, addr, vall, op, GETRA());
+
+if (!cpu->excl_succeeded) {
+ret = 1;
+goto out;
+}
+
+helper_ret_stq_mmu(env, addr + 8, valh, op, GETRA());
+if (!cpu->excl_succeeded) {
+ret = 1;
+} else {
+cpu->excl_succeeded = false;
+}
+
+out:
+/* Unset LL/SC context */
+cpu->ll_sc_context = false;
+cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+
+return ret;
+}
+#endif
diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
index 1d3d10f..c416a83 100644
--- a/target-arm/helper-a64.h
+++ b/target-arm/helper-a64.h
@@ -46,3 +46,7 @@ DEF_HELPER_FLAGS_2(frecpx_f32, TCG_CALL_NO_RWG, f32, f32, ptr)
 DEF_HELPER_FLAGS_2(fcvtx_f64_to_f32, TCG_CALL_NO_RWG, f32, f64, env)
 DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
+#ifdef CONFIG_ARM_USE_LDST_EXCL
+/* STXP helper */
+DEF_HELPER_5(stxp_i128, i64, env, i64, i64, i64, i32)
+#endif
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index 404c13b..146fc9a 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -34,6 +34,14 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
 cs->exception_index = excp;
 env->exception.syndrome = syndrome;
 env->exception.target_el = target_el;
+#ifdef CONFIG_ARM_USE_LDST_EXCL
+HELPER(atomic_clear)(env);
+/* If the exception happens in the middle of a LL/SC, we need to clear
+ * excl_succeeded to avoid that the normal store following the exception is
+ * wrongly interpreted as exclusive.
+ * */
+cs->excl_succeeded = 0;
+#endif
 cpu_loop_exit(cs);
 }
 
diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 80f6c20..f34e957 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -37,8 +37,10 @@
 static TCGv

[Qemu-devel] [RFC v7 08/16] softmmu: Honor the new exclusive bitmap

2016-01-29 Thread Alvise Rigo

The pages set as exclusive (clean) in the DIRTY_MEMORY_EXCLUSIVE bitmap
have to have their TLB entries flagged with TLB_EXCL. The accesses to
pages with TLB_EXCL flag set have to be properly handled in that they
can potentially invalidate an open LL/SC transaction.

Modify the TLB entries generation to honor the new bitmap and extend
the softmmu_template to handle the accesses made to guest pages marked
as exclusive.

In the case we remove a TLB entry marked as EXCL, we unset the
corresponding exclusive bit in the bitmap.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c   | 44 --
 softmmu_template.h | 80 --
 2 files changed, 113 insertions(+), 11 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index ce6d720..aa9cc17 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -395,6 +395,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
vaddr,
 env->tlb_v_table[mmu_idx][vidx] = *te;
 env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
 
+if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL))) 
{
+/* We are removing an exclusive entry, set the page to dirty. This
+ * is not be necessary if the vCPU has performed both SC and LL. */
+hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
+  (te->addr_write & TARGET_PAGE_MASK);
+if (!cpu->ll_sc_context) {
+cpu_physical_memory_unset_excl(hw_addr);
+}
+}
+
 /* refill the tlb */
 env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
 env->iotlb[mmu_idx][index].attrs = attrs;
@@ -418,9 +428,19 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
vaddr,
 } else if (memory_region_is_ram(section->mr)
&& cpu_physical_memory_is_clean(section->mr->ram_addr
+ xlat)) {
-te->addr_write = address | TLB_NOTDIRTY;
-} else {
-te->addr_write = address;
+address |= TLB_NOTDIRTY;
+}
+
+/* Since the MMIO accesses follow always the slow path, we do not need
+ * to set any flag to trap the access */
+if (!(address & TLB_MMIO)) {
+if (cpu_physical_memory_is_excl(section->mr->ram_addr + xlat)) {
+/* There is at least one vCPU that has flagged the address as
+ * exclusive. */
+te->addr_write = address | TLB_EXCL;
+} else {
+te->addr_write = address;
+}
 }
 } else {
 te->addr_write = -1;
@@ -474,6 +494,24 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
target_ulong addr)
 return qemu_ram_addr_from_host_nofail(p);
 }
 
+/* For every vCPU compare the exclusive address and reset it in case of a
+ * match. Since only one vCPU is running at once, no lock has to be held to
+ * guard this operation. */
+static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
+{
+CPUState *cpu;
+
+CPU_FOREACH(cpu) {
+if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
+ranges_overlap(cpu->excl_protected_range.begin,
+   cpu->excl_protected_range.end -
+   cpu->excl_protected_range.begin,
+   addr, size)) {
+cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+}
+}
+}
+
 #define MMUSUFFIX _mmu
 
 /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
diff --git a/softmmu_template.h b/softmmu_template.h
index 4332db2..267c52a 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -474,11 +474,43 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
 }
 
-/* Handle an IO access.  */
+/* Handle an IO access or exclusive access.  */
 if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
- mmu_idx, index, retaddr);
-return;
+if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
+CPUState *cpu = ENV_GET_CPU(env);
+CPUClass *cc = CPU_GET_CLASS(cpu);
+/* The slow-path has been forced since we are writing to
+ * exclusive-protected memory. */
+hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+/* The function lookup_and_reset_cpus_l

[Qemu-devel] [RFC v7 11/16] tcg: Create new runtime helpers for excl accesses

2016-01-29 Thread Alvise Rigo

Introduce a set of new runtime helpers to handle exclusive instructions.
These helpers are used as hooks to call the respective LL/SC helpers in
softmmu_llsc_template.h from TCG code.

The helpers ending with an "a" make an alignment check.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 Makefile.target |   2 +-
 include/exec/helper-gen.h   |   3 ++
 include/exec/helper-proto.h |   1 +
 include/exec/helper-tcg.h   |   3 ++
 tcg-llsc-helper.c   | 104 
 tcg-llsc-helper.h   |  61 ++
 tcg/tcg-llsc-gen-helper.h   |  67 
 7 files changed, 240 insertions(+), 1 deletion(-)
 create mode 100644 tcg-llsc-helper.c
 create mode 100644 tcg-llsc-helper.h
 create mode 100644 tcg/tcg-llsc-gen-helper.h

diff --git a/Makefile.target b/Makefile.target
index 34ddb7e..faf32a2 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -135,7 +135,7 @@ obj-y += arch_init.o cpus.o monitor.o gdbstub.o balloon.o 
ioport.o numa.o
 obj-y += qtest.o bootdevice.o
 obj-y += hw/
 obj-$(CONFIG_KVM) += kvm-all.o
-obj-y += memory.o cputlb.o
+obj-y += memory.o cputlb.o tcg-llsc-helper.o
 obj-y += memory_mapping.o
 obj-y += dump.o
 obj-y += migration/ram.o migration/savevm.o
diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
index 0d0da3a..f8483a9 100644
--- a/include/exec/helper-gen.h
+++ b/include/exec/helper-gen.h
@@ -60,6 +60,9 @@ static inline void glue(gen_helper_, 
name)(dh_retvar_decl(ret)  \
 #include "trace/generated-helpers.h"
 #include "trace/generated-helpers-wrappers.h"
 #include "tcg-runtime.h"
+#if defined(CONFIG_SOFTMMU)
+#include "tcg-llsc-gen-helper.h"
+#endif
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h
index effdd43..90be2fd 100644
--- a/include/exec/helper-proto.h
+++ b/include/exec/helper-proto.h
@@ -29,6 +29,7 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), 
dh_ctype(t3), \
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
+#include "tcg/tcg-llsc-gen-helper.h"
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h
index 79fa3c8..6228a7f 100644
--- a/include/exec/helper-tcg.h
+++ b/include/exec/helper-tcg.h
@@ -38,6 +38,9 @@
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
+#ifdef CONFIG_SOFTMMU
+#include "tcg-llsc-gen-helper.h"
+#endif
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/tcg-llsc-helper.c b/tcg-llsc-helper.c
new file mode 100644
index 000..646b4ba
--- /dev/null
+++ b/tcg-llsc-helper.c
@@ -0,0 +1,104 @@
+/*
+ * Runtime helpers for atomic istruction emulation
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.r...@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "exec/cpu_ldst.h"
+#include "exec/helper-head.h"
+#include "tcg-llsc-helper.h"
+
+#define LDEX_HELPER(SUFF, OPC, FUNC)   \
+uint32_t HELPER(ldlink_i##SUFF)(CPUArchState *env, target_ulong addr,  \
+uint32_t index)\
+{  \
+CPUArchState *state = env; \
+TCGMemOpIdx op;\
+   \
+op = make_memop_idx((OPC), index); \
+   \
+return (uint32_t)FUNC(state, addr, op, GETRA());   \
+}
+
+#define STEX_HELPER(SUFF, DATA_TYPE, OPC, FUNC)\
+target_ulong HELPER(stcond_i##SUFF)(CPUArchState *env, target_ulong addr,

[Qemu-devel] [RFC v7 04/16] softmmu: Simplify helper_*_st_name, wrap RAM code

2016-01-29 Thread Alvise Rigo

Attempting to simplify the helper_*_st_name, wrap the code relative to a
RAM access into an inline function.

Based on this work, Alex proposed the following patch series
https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg01136.html
that reduces code duplication of the softmmu_helpers.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 softmmu_template.h | 110 +
 1 file changed, 68 insertions(+), 42 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 3d388ec..6279437 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -416,13 +416,46 @@ static inline void glue(helper_le_st_name, 
_do_mmio_access)(CPUArchState *env,
 glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
 }
 
+static inline void glue(helper_le_st_name, _do_ram_access)(CPUArchState *env,
+   DATA_TYPE val,
+   target_ulong addr,
+   TCGMemOpIdx oi,
+   unsigned mmu_idx,
+   int index,
+   uintptr_t retaddr)
+{
+uintptr_t haddr;
+
+/* Handle slow unaligned access (it spans two pages or IO).  */
+if (DATA_SIZE > 1
+&& unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+ >= TARGET_PAGE_SIZE)) {
+glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
+retaddr);
+return;
+}
+
+/* Handle aligned access or unaligned access in the same page.  */
+if ((addr & (DATA_SIZE - 1)) != 0
+&& (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+ mmu_idx, retaddr);
+}
+
+haddr = addr + env->tlb_table[mmu_idx][index].addend;
+#if DATA_SIZE == 1
+glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+#else
+glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+#endif
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
TCGMemOpIdx oi, uintptr_t retaddr)
 {
 unsigned mmu_idx = get_mmuidx(oi);
 int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
 target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-uintptr_t haddr;
 
 /* Adjust the given return address.  */
 retaddr -= GETPC_ADJ;
@@ -448,28 +481,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 return;
 }
 
-/* Handle slow unaligned access (it spans two pages or IO).  */
-if (DATA_SIZE > 1
-&& unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
- >= TARGET_PAGE_SIZE)) {
-glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
-oi, retaddr);
-return;
-}
-
-/* Handle aligned access or unaligned access in the same page.  */
-if ((addr & (DATA_SIZE - 1)) != 0
-&& (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
- mmu_idx, retaddr);
-}
-
-haddr = addr + env->tlb_table[mmu_idx][index].addend;
-#if DATA_SIZE == 1
-glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
-#else
-glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
-#endif
+glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
+retaddr);
 }
 
 #if DATA_SIZE > 1
@@ -519,13 +532,42 @@ static inline void glue(helper_be_st_name, 
_do_mmio_access)(CPUArchState *env,
 glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
 }
 
+static inline void glue(helper_be_st_name, _do_ram_access)(CPUArchState *env,
+   DATA_TYPE val,
+   target_ulong addr,
+   TCGMemOpIdx oi,
+   unsigned mmu_idx,
+   int index,
+   uintptr_t retaddr)
+{
+uintptr_t haddr;
+
+/* Handle slow unaligned access (it spans two pages or IO).  */
+if (DATA_SIZE > 1
+&& unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+ >= TARGET_PAGE_SIZE)) {
+

[Qemu-devel] [RFC v7 10/16] softmmu: Protect MMIO exclusive range

2016-01-29 Thread Alvise Rigo

As for the RAM case, also the MMIO exclusive ranges have to be protected
by other CPU's accesses. In order to do that, we flag the accessed
MemoryRegion to mark that an exclusive access has been performed and is
not concluded yet.

This flag will force the other CPUs to invalidate the exclusive range in
case of collision.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c| 20 +---
 include/exec/memory.h   |  1 +
 softmmu_llsc_template.h | 11 +++
 softmmu_template.h  | 22 ++
 4 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 87d09c8..06ce2da 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -496,19 +496,25 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
target_ulong addr)
 /* For every vCPU compare the exclusive address and reset it in case of a
  * match. Since only one vCPU is running at once, no lock has to be held to
  * guard this operation. */
-static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
+static inline bool lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
 {
 CPUState *cpu;
+bool ret = false;
 
 CPU_FOREACH(cpu) {
-if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
-ranges_overlap(cpu->excl_protected_range.begin,
-   cpu->excl_protected_range.end -
-   cpu->excl_protected_range.begin,
-   addr, size)) {
-cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+if (current_cpu != cpu) {
+if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
+ranges_overlap(cpu->excl_protected_range.begin,
+   cpu->excl_protected_range.end -
+   cpu->excl_protected_range.begin,
+   addr, size)) {
+cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+ret = true;
+}
 }
 }
+
+return ret;
 }
 
 #define MMUSUFFIX _mmu
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 71e0480..bacb3ad 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -171,6 +171,7 @@ struct MemoryRegion {
 bool rom_device;
 bool flush_coalesced_mmio;
 bool global_locking;
+bool pending_excl_access; /* A vCPU issued an exclusive access */
 uint8_t dirty_log_mask;
 ram_addr_t ram_addr;
 Object *owner;
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index 101f5e8..b4712ba 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -81,15 +81,18 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, 
target_ulong addr,
 }
 }
 }
+/* For this vCPU, just update the TLB entry, no need to flush. */
+env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
 } else {
-hw_error("EXCL accesses to MMIO regions not supported yet.");
+/* Set a pending exclusive access in the MemoryRegion */
+MemoryRegion *mr = iotlb_to_region(this,
+   env->iotlb[mmu_idx][index].addr,
+   env->iotlb[mmu_idx][index].attrs);
+mr->pending_excl_access = true;
 }
 
 cc->cpu_set_excl_protected_range(this, hw_addr, DATA_SIZE);
 
-/* For this vCPU, just update the TLB entry, no need to flush. */
-env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
-
 /* From now on we are in LL/SC context */
 this->ll_sc_context = true;
 
diff --git a/softmmu_template.h b/softmmu_template.h
index c54bdc9..71c5152 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -360,6 +360,14 @@ static inline void glue(io_write, SUFFIX)(CPUArchState 
*env,
 MemoryRegion *mr = iotlb_to_region(cpu, physaddr, iotlbentry->attrs);
 
 physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
+
+/* Invalidate the exclusive range that overlaps this access */
+if (mr->pending_excl_access) {
+if (lookup_and_reset_cpus_ll_addr(physaddr, 1 << SHIFT)) {
+mr->pending_excl_access = false;
+}
+}
+
 if (mr != _mem_rom && mr != _mem_notdirty && !cpu->can_do_io) {
 cpu_io_recompile(cpu, retaddr);
 }
@@ -504,6 +512,13 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
addr, DATA_TYPE val,
 glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
  mmu_idx, index,
  retaddr);
+/* N.B.: Here excl_succeeded == true means that this access
+

[Qemu-devel] [RFC v7 00/16] Slow-path for atomic instruction translation

2016-01-29 Thread Alvise Rigo

 added to handle the TLB flush support
- the softmmu_template and softmmu_llsc_template have been adapted to work
  on real multi-threading

Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
  been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
  and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
  softmmu_template.h and to simplify the methods generation through
  softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386

This work has been sponsored by Huawei Technologies Duesseldorf GmbH.



Alvise Rigo (16):
  exec.c: Add new exclusive bitmap to ram_list
  softmmu: Simplify helper_*_st_name, wrap unaligned code
  softmmu: Simplify helper_*_st_name, wrap MMIO code
  softmmu: Simplify helper_*_st_name, wrap RAM code
  softmmu: Add new TLB_EXCL flag
  qom: cpu: Add CPUClass hooks for exclusive range
  softmmu: Add helpers for a new slowpath
  softmmu: Honor the new exclusive bitmap
  softmmu: Include MMIO/invalid exclusive accesses
  softmmu: Protect MMIO exclusive range
  tcg: Create new runtime helpers for excl accesses
  configure: Use slow-path for atomic only when the softmmu is enabled
  softmmu: Add history of excl accesses
  target-arm: translate: Use ld/st excl for atomic insns
  target-arm: cpu64: use custom set_excl hook
  target-arm: aarch64: add atomic instructions

 Makefile.target |   2 +-
 configure   |  16 +++
 cputlb.c|  63 +++-
 exec.c  |  26 +++-
 include/exec/cpu-all.h  |   8 ++
 include/exec/helper-gen.h   |   3 +
 include/exec/helper-proto.h |   1 +
 include/exec/helper-tcg.h   |   3 +
 include/exec/memory.h   |   4 +-
 include/exec/ram_addr.h |  31 
 include/qom/cpu.h   |  28 
 qom/cpu.c   |  20 +++
 softmmu_llsc_template.h | 137 ++
 softmmu_template.h  | 342 ++--
 target-arm/cpu.h|   2 +
 target-arm/cpu64.c  |   8 ++
 target-arm/helper-a64.c |  55 +++
 target-arm/helper-a64.h |   4 +
 target-arm/helper.h |   4 +
 target-arm/machine.c|   2 +
 target-arm/op_helper.c  |  18 +++
 target-arm/translate-a64.c  | 134 -
 target-arm/translate.c  | 188 +++-
 tcg-llsc-helper.c   | 104 ++
 tcg-llsc-helper.h   |  61 
 tcg/tcg-llsc-gen-helper.h   |  67 +
 tcg/tcg.h   |  31 
 vl.c|   3 +
 28 files changed, 1273 insertions(+), 92 deletions(-)
 create mode 100644 softmmu_llsc_template.h
 create mode 100644 tcg-llsc-helper.c
 create mode 100644 tcg-llsc-helper.h
 create mode 100644 tcg/tcg-llsc-gen-helper.h

-- 
2.7.0

[Qemu-devel] [RFC v7 12/16] configure: Use slow-path for atomic only when the softmmu is enabled

2016-01-29 Thread Alvise Rigo

Use the new slow path for atomic instruction translation when the
softmmu is enabled.

At the moment only arm and aarch64 use the new LL/SC backend. It is
possible to disable such backed with --disable-arm-llsc-backend.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 configure | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/configure b/configure
index 44ac9ab..915efcc 100755
--- a/configure
+++ b/configure
@@ -294,6 +294,7 @@ solaris="no"
 profiler="no"
 cocoa="no"
 softmmu="yes"
+arm_tcg_use_llsc="yes"
 linux_user="no"
 bsd_user="no"
 aix="no"
@@ -880,6 +881,10 @@ for opt do
   ;;
   --disable-debug-tcg) debug_tcg="no"
   ;;
+  --enable-arm-llsc-backend) arm_tcg_use_llsc="yes"
+  ;;
+  --disable-arm-llsc-backend) arm_tcg_use_llsc="no"
+  ;;
   --enable-debug)
   # Enable debugging options that aren't excessively noisy
   debug_tcg="yes"
@@ -4751,6 +4756,7 @@ echo "host CPU  $cpu"
 echo "host big endian   $bigendian"
 echo "target list   $target_list"
 echo "tcg debug enabled $debug_tcg"
+echo "arm use llsc backend" $arm_tcg_use_llsc
 echo "gprof enabled $gprof"
 echo "sparse enabled$sparse"
 echo "strip binaries$strip_opt"
@@ -4806,6 +4812,7 @@ echo "Install blobs $blobs"
 echo "KVM support   $kvm"
 echo "RDMA support  $rdma"
 echo "TCG interpreter   $tcg_interpreter"
+echo "use ld/st excl$softmmu"
 echo "fdt support   $fdt"
 echo "preadv support$preadv"
 echo "fdatasync $fdatasync"
@@ -5863,6 +5870,13 @@ fi
 echo "LDFLAGS+=$ldflags" >> $config_target_mak
 echo "QEMU_CFLAGS+=$cflags" >> $config_target_mak
 
+# Use tcg LL/SC tcg backend for exclusive instruction is arm/aarch64
+# softmmus targets
+if test "$arm_tcg_use_llsc" = "yes" ; then
+  if test "$target" = "arm-softmmu" ; then
+echo "CONFIG_ARM_USE_LDST_EXCL=y" >> $config_target_mak
+  fi
+fi
 done # for target in $targets
 
 if [ "$pixman" = "internal" ]; then
-- 
2.7.0

[Qemu-devel] [RFC v7 13/16] softmmu: Add history of excl accesses

2016-01-29 Thread Alvise Rigo

Add a circular buffer to store the hw addresses used in the last
EXCLUSIVE_HISTORY_LEN exclusive accesses.

When an address is pop'ed from the buffer, its page will be set as not
exclusive. In this way, we avoid:
- frequent set/unset of a page (causing frequent flushes as well)
- the possibility to forget the EXCL bit set.

Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 cputlb.c| 29 +++--
 exec.c  | 19 +++
 include/qom/cpu.h   |  8 
 softmmu_llsc_template.h |  1 +
 vl.c|  3 +++
 5 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 06ce2da..f3c4d97 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -395,16 +395,6 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong 
vaddr,
 env->tlb_v_table[mmu_idx][vidx] = *te;
 env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
 
-if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL))) 
{
-/* We are removing an exclusive entry, set the page to dirty. This
- * is not be necessary if the vCPU has performed both SC and LL. */
-hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
-  (te->addr_write & TARGET_PAGE_MASK);
-if (!cpu->ll_sc_context) {
-cpu_physical_memory_unset_excl(hw_addr);
-}
-}
-
 /* refill the tlb */
 env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
 env->iotlb[mmu_idx][index].attrs = attrs;
@@ -517,6 +507,25 @@ static inline bool lookup_and_reset_cpus_ll_addr(hwaddr 
addr, hwaddr size)
 return ret;
 }
 
+extern CPUExclusiveHistory excl_history;
+static inline void excl_history_put_addr(hwaddr addr)
+{
+hwaddr last;
+
+/* Calculate the index of the next exclusive address */
+excl_history.last_idx = (excl_history.last_idx + 1) % excl_history.length;
+
+last = excl_history.c_array[excl_history.last_idx];
+
+/* Unset EXCL bit of the oldest entry */
+if (last != EXCLUSIVE_RESET_ADDR) {
+cpu_physical_memory_unset_excl(last);
+}
+
+/* Add a new address, overwriting the oldest one */
+excl_history.c_array[excl_history.last_idx] = addr & TARGET_PAGE_MASK;
+}
+
 #define MMUSUFFIX _mmu
 
 /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
diff --git a/exec.c b/exec.c
index 51f366d..2e123f1 100644
--- a/exec.c
+++ b/exec.c
@@ -177,6 +177,25 @@ struct CPUAddressSpace {
 MemoryListener tcg_as_listener;
 };
 
+/* Exclusive memory support */
+CPUExclusiveHistory excl_history;
+void cpu_exclusive_history_init(void)
+{
+/* Initialize exclusive history for atomic instruction handling. */
+if (tcg_enabled()) {
+g_assert(EXCLUSIVE_HISTORY_CPU_LEN * max_cpus <= UINT16_MAX);
+excl_history.length = EXCLUSIVE_HISTORY_CPU_LEN * max_cpus;
+excl_history.c_array = g_malloc(excl_history.length * sizeof(hwaddr));
+memset(excl_history.c_array, -1, excl_history.length * sizeof(hwaddr));
+}
+}
+
+void cpu_exclusive_history_free(void)
+{
+if (tcg_enabled()) {
+g_free(excl_history.c_array);
+}
+}
 #endif
 
 #if !defined(CONFIG_USER_ONLY)
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 6f6c1c0..0452fd0 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -227,7 +227,15 @@ struct kvm_run;
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
 /* Atomic insn translation TLB support. */
+typedef struct CPUExclusiveHistory {
+uint16_t last_idx;   /* index of last insertion */
+uint16_t length; /* history's length, it depends on smp_cpus */
+hwaddr *c_array; /* history's circular array */
+} CPUExclusiveHistory;
 #define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+#define EXCLUSIVE_HISTORY_CPU_LEN 256
+void cpu_exclusive_history_init(void);
+void cpu_exclusive_history_free(void);
 
 /**
  * CPUState:
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index b4712ba..b4e7f9d 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -75,6 +75,7 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong 
addr,
  * to request any flush. */
 if (!cpu_physical_memory_is_excl(hw_addr)) {
 cpu_physical_memory_set_excl(hw_addr);
+excl_history_put_addr(hw_addr);
 CPU_FOREACH(cpu) {
 if (current_cpu != cpu) {
 tlb_flush(cpu, 1);
diff --git a/vl.c b/vl.c
index f043009..b22d99b 100644
--- a/vl.c
+++ b/vl.c
@@ -547,6 +547,7 @@ static void res_free(void)
 {
 g_free(boot_splash_filedata);
 boot_splash_filedata = NULL;
+cpu_exclusive_history_free();
 }
 
 static int default_driver_check(void *opaque, Qem

Re: [Qemu-devel] Status of my hacks on the MTTCG WIP branch

2016-01-19 Thread alvise rigo

On Mon, Jan 18, 2016 at 8:09 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
>
> Alex Bennée <alex.ben...@linaro.org> writes:
>
> > alvise rigo <a.r...@virtualopensystems.com> writes:
> >
> >> On Fri, Jan 15, 2016 at 4:25 PM, Alex Bennée <alex.ben...@linaro.org> 
> >> wrote:
> >>>
> >>> alvise rigo <a.r...@virtualopensystems.com> writes:
> >>>
> >>>> On Fri, Jan 15, 2016 at 3:51 PM, Alex Bennée <alex.ben...@linaro.org> 
> >>>> wrote:
> >>>>>
> >>>>> alvise rigo <a.r...@virtualopensystems.com> writes:
> >>>>>
> 
> >>>> Keep in mind that Linux on arm64 uses the LDXP/STXP instructions that
> >>>> exist solely in aarch64.
> >>>> These instructions are purely emulated now and can potentially write
> >>>> 128 bits of data in a non-atomic fashion.
> >>>
> >>> Sure, but I doubt they are the reason for this hang as the kernel
> >>> doesn't use them.
> >>
> >> The kernel does use them for __cmpxchg_double in
> >> arch/arm64/include/asm/atomic_ll_sc.h.
> >
> > I take it back, if I'd have grepped for "ldxp" instead of "stxp" I would
> > have seen it, sorry about that ;-)
> >
> >> In any case, the normal exclusive instructions are also emulated in
> >> target-arm/translate-a64.c.
> >
> > I'll check on them on Monday. I'd assumed all the stuff was in the
> > helpers as I scanned through and missed the translate.c changes Fred
> > made. Hopefully that will be the last hurdle.
>
> I'm pleased to confirm you were right. I hacked up Fred's helper based
> solution for aarch64 including the ldxp/stxp stuff. It's not
> semantically correct because:
>
>   result = atomic_bool_cmpxchg(p, oldval, (uint8_t)newval) &&
>atomic_bool_cmpxchg([1], oldval2, (uint8_t)newval2);
>
> won't leave the system as it was before if the race causes the second

Exactly.

> cmpxchg to fail. I assume this won't be a problem in the LL/SC world as
> we'll be able to serialise all accesses to the exclusive page properly?

In LL/SC the idea would be to dedicate one ARM-specific helper (in
target-arm/helper-a64.c) to handle this case.
Once the helper grabbed the excl mutex, we are allowed to make 128
bits or bigger accesses.

>
>
> See:
>
> https://github.com/stsquad/qemu/tree/mttcg/multi_tcg_v8_wip_ajb_fix_locks-r2
>
> >
> > In the meantime if I'm not booting Jessie I can get MTTCG aarch64
> > working with a initrd based rootfs. Once I've gone through those I'm
> > planning on giving it a good stress test with -fsantize=threads.
>
> My first pass with this threw up a bunch of errors with the RCU code
> like this:
>
> WARNING: ThreadSanitizer: data race (pid=15387)
>   Atomic write of size 4 at 0x7f59efa51d48 by main thread (mutexes: write 
> M172):
> #0 __tsan_atomic32_fetch_add  (libtsan.so.0+0x00058e8f)
> #1 call_rcu1 util/rcu.c:288 (qemu-system-aarch64+0x006c3bd0)
> #2 address_space_update_topology 
> /home/alex/lsrc/qemu/qemu.git/memory.c:806 
> (qemu-system-aarch64+0x001ed9ca)
> #3 memory_region_transaction_commit 
> /home/alex/lsrc/qemu/qemu.git/memory.c:842 
> (qemu-system-aarch64+0x001ed9ca)
> #4 address_space_init /home/alex/lsrc/qemu/qemu.git/memory.c:2136 
> (qemu-system-aarch64+0x001f1fa6)
> #5 memory_map_init /home/alex/lsrc/qemu/qemu.git/exec.c:2344 
> (qemu-system-aarch64+0x00196607)
> #6 cpu_exec_init_all /home/alex/lsrc/qemu/qemu.git/exec.c:2795 
> (qemu-system-aarch64+0x00196607)
> #7 main /home/alex/lsrc/qemu/qemu.git/vl.c:4083 
> (qemu-system-aarch64+0x001829aa)
>
>   Previous read of size 4 at 0x7f59efa51d48 by thread T1:
> #0 call_rcu_thread util/rcu.c:242 (qemu-system-aarch64+0x006c3d92)
> #1   (libtsan.so.0+0x000235f9)
>
>   Location is global 'rcu_call_count' of size 4 at 0x7f59efa51d48 
> (qemu-system-aarch64+0x010f1d48)
>
>   Mutex M172 (0x7f59ef6254e0) created at:
> #0 pthread_mutex_init  (libtsan.so.0+0x00027ee5)
> #1 qemu_mutex_init util/qemu-thread-posix.c:55 
> (qemu-system-aarch64+0x006ad747)
> #2 qemu_init_cpu_loop /home/alex/lsrc/qemu/qemu.git/cpus.c:890 
> (qemu-system-aarch64+0x001d4166)
> #3 main /home/alex/lsrc/qemu/qemu.git/vl.c:3005 
> (qemu-system-aarch64+0x001820ac)
>
>   Thread T1 (tid=15389, running) created by main thread at:
> #0 pthread_create  (libtsan.so.0+0x000274c7)
> #1 qemu_thread_create util/qemu-thread-posix.c:525 
> (qemu-system-aarch64+0x006ae04d)
> #2 rcu_init_complete util/rcu.c:320 (qemu-system-aarch64+0x006c3d52)
> #3 rcu_init util/rcu.c:351 (qemu-system-aarch64+0x0018e288)
> #4 __libc_csu_init  (qemu-system-aarch64+0x006c63ec)
>
>
> but I don't know how many are false positives so I'm going to look in more
> detail now.

Umm...I'm not very familiar with the sanitize option, I'll let you
follow this lead :).

alvise

>
> 
>
> --
> Alex Bennée

Re: [Qemu-devel] [RFC PATCH 2/2] softmmu: simplify helper_*_st_name with smmu_helper(do_unl_store)

2016-01-19 Thread alvise rigo

On Fri, Jan 8, 2016 at 4:53 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
> From: Alvise Rigo <a.r...@virtualopensystems.com>
>
> Attempting to simplify the helper_*_st_name, wrap the
> do_unaligned_access code into an shared inline function. As this also
> removes the goto statement the inline code is expanded twice in each
> helper.
>
> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
> CC: Alvise Rigo <a.r...@virtualopensystems.com>
> Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
>
> ---
> v2
>   - based on original patch from Alvise
>   - uses a single shared inline function to reduce duplication
> ---
>  softmmu_template.h | 75 
> --
>  1 file changed, 39 insertions(+), 36 deletions(-)
>
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 0074bd7..ac0b4ac 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -159,6 +159,39 @@ static inline int smmu_helper(victim_tlb_hit) (const 
> bool is_read, CPUArchState
>  }
>
>  #ifndef SOFTMMU_CODE_ACCESS
> +
> +static inline void smmu_helper(do_unl_store)(CPUArchState *env,
> + bool little_endian,
> + DATA_TYPE val,
> + target_ulong addr,
> + TCGMemOpIdx oi,
> + unsigned mmu_idx,
> + uintptr_t retaddr)
> +{
> +int i;
> +
> +if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> + mmu_idx, retaddr);
> +}
> +/* Note: relies on the fact that tlb_fill() does not remove the
> + * previous page from the TLB cache.  */
> +for (i = DATA_SIZE - 1; i >= 0; i--) {
> +uint8_t val8;
> +if (little_endian) {
> +/* Little-endian extract.  */
> +val8 = val >> (i * 8);
> +} else {
> +/* Big-endian extract.  */
> +val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> +}
> +/* Note the adjustment at the beginning of the function.
> +   Undo that for the recursion.  */
> +glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> +oi, retaddr + GETPC_ADJ);
> +}
> +}
> +
>  static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
>CPUIOTLBEntry *iotlbentry,
>target_ulong addr,
> @@ -416,7 +449,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
> addr, DATA_TYPE val,
>  if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>  CPUIOTLBEntry *iotlbentry;
>  if ((addr & (DATA_SIZE - 1)) != 0) {
> -goto do_unaligned_access;
> +smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, 
> retaddr);
> +return;
>  }
>  iotlbentry = >iotlb[mmu_idx][index];
>
> @@ -431,23 +465,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
> addr, DATA_TYPE val,
>  if (DATA_SIZE > 1
>  && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>   >= TARGET_PAGE_SIZE)) {
> -int i;
> -do_unaligned_access:
> -if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> - mmu_idx, retaddr);
> -}
> -/* XXX: not efficient, but simple */
> -/* Note: relies on the fact that tlb_fill() does not remove the
> - * previous page from the TLB cache.  */
> -for (i = DATA_SIZE - 1; i >= 0; i--) {
> -/* Little-endian extract.  */
> -uint8_t val8 = val >> (i * 8);
> -/* Note the adjustment at the beginning of the function.
> -   Undo that for the recursion.  */
> -glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> -oi, retaddr + GETPC_ADJ);
> -}
> +smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, 
> retaddr);
>  return;
>  }
>
> @@ -496,7 +514,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong 
> addr, DATA_TYPE val,
>  if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>  CPUIOTLBEntry *iotlbentry;
>

[Qemu-devel] [PATCH v2] target-arm: Use the right MMU index in arm_regime_using_lpae_format

2016-01-15 Thread Alvise Rigo

arm_regime_using_lpae_format checks whether the LPAE extension is used
for stage 1 translation regimes. MMU indexes not exclusively of a stage 1
regime won't work with this method.

In case of ARMMMUIdx_S12NSE0 or ARMMMUIdx_S12NSE1, offset these values
by ARMMMUIdx_S1NSE0 to get the right index indicating a stage 1
translation regime.

Rename also the function to arm_s1_regime_using_lpae_format and update
the comments to reflect the change.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 target-arm/helper.c| 12 
 target-arm/internals.h |  5 +++--
 target-arm/op_helper.c |  2 +-
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index 59d5a41..faeaaa8 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -5996,11 +5996,15 @@ static inline bool regime_using_lpae_format(CPUARMState 
*env,
 return false;
 }
 
-/* Returns true if the translation regime is using LPAE format page tables.
- * Used when raising alignment exceptions, whose FSR changes depending on
- * whether the long or short descriptor format is in use. */
-bool arm_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx)
+/* Returns true if the stage 1 translation regime is using LPAE format page
+ * tables. Used when raising alignment exceptions, whose FSR changes depending
+ * on whether the long or short descriptor format is in use. */
+bool arm_s1_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx)
 {
+if (mmu_idx == ARMMMUIdx_S12NSE0 || mmu_idx == ARMMMUIdx_S12NSE1) {
+mmu_idx += ARMMMUIdx_S1NSE0;
+}
+
 return regime_using_lpae_format(env, mmu_idx);
 }
 
diff --git a/target-arm/internals.h b/target-arm/internals.h
index b925aaa..d226bbe 100644
--- a/target-arm/internals.h
+++ b/target-arm/internals.h
@@ -441,8 +441,9 @@ struct ARMMMUFaultInfo {
 bool arm_tlb_fill(CPUState *cpu, vaddr address, int rw, int mmu_idx,
   uint32_t *fsr, ARMMMUFaultInfo *fi);
 
-/* Return true if the translation regime is using LPAE format page tables */
-bool arm_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx);
+/* Return true if the stage 1 translation regime is using LPAE format page
+ * tables */
+bool arm_s1_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx);
 
 /* Raise a data fault alignment exception for the specified virtual address */
 void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr, int is_write,
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index e42d287..951fc5a 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -149,7 +149,7 @@ void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr, 
int is_write,
 /* the DFSR for an alignment fault depends on whether we're using
  * the LPAE long descriptor format, or the short descriptor format
  */
-if (arm_regime_using_lpae_format(env, cpu_mmu_index(env, false))) {
+if (arm_s1_regime_using_lpae_format(env, cpu_mmu_index(env, false))) {
 env->exception.fsr = 0x21;
 } else {
 env->exception.fsr = 0x1;
-- 
2.7.0

Re: [Qemu-devel] [PATCH] target-arm: fix MMU index in arm_cpu_do_unaligned_access

2016-01-15 Thread alvise rigo

On Fri, Jan 15, 2016 at 11:04 AM, Peter Maydell
<peter.mayd...@linaro.org> wrote:
> On 15 January 2016 at 09:59, Alvise Rigo <a.r...@virtualopensystems.com> 
> wrote:
>> arm_regime_using_lpae_format checks whether the LPAE extension is used
>> for stage 1 translation regimes. MMU indexes not exclusively of a stage 1
>> regime won't work with this method.
>>
>> In case of ARMMMUIdx_S12NSE0 or ARMMMUIdx_S12NSE1, offset these values
>> by ARMMMUIdx_S1NSE0 to get the right index indicating a stage 1
>> translation regime.
>>
>> Rename also the function to arm_s1_regime_using_lpae_format and update
>> the comments to reflect the change.
>>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  target-arm/helper.c| 8 
>>  target-arm/internals.h | 5 +++--
>>  target-arm/op_helper.c | 8 ++--
>>  3 files changed, 13 insertions(+), 8 deletions(-)
>>
>> diff --git a/target-arm/helper.c b/target-arm/helper.c
>> index 59d5a41..8317ff5 100644
>> --- a/target-arm/helper.c
>> +++ b/target-arm/helper.c
>> @@ -5996,10 +5996,10 @@ static inline bool 
>> regime_using_lpae_format(CPUARMState *env,
>>  return false;
>>  }
>>
>> -/* Returns true if the translation regime is using LPAE format page tables.
>> - * Used when raising alignment exceptions, whose FSR changes depending on
>> - * whether the long or short descriptor format is in use. */
>> -bool arm_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx)
>> +/* Returns true if the stage 1 translation regime is using LPAE format page
>> + * tables. Used when raising alignment exceptions, whose FSR changes 
>> depending
>> + * on whether the long or short descriptor format is in use. */
>> +bool arm_s1_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx)
>>  {
>>  return regime_using_lpae_format(env, mmu_idx);
>>  }
>> diff --git a/target-arm/internals.h b/target-arm/internals.h
>> index b925aaa..d226bbe 100644
>> --- a/target-arm/internals.h
>> +++ b/target-arm/internals.h
>> @@ -441,8 +441,9 @@ struct ARMMMUFaultInfo {
>>  bool arm_tlb_fill(CPUState *cpu, vaddr address, int rw, int mmu_idx,
>>uint32_t *fsr, ARMMMUFaultInfo *fi);
>>
>> -/* Return true if the translation regime is using LPAE format page tables */
>> -bool arm_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx);
>> +/* Return true if the stage 1 translation regime is using LPAE format page
>> + * tables */
>> +bool arm_s1_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx);
>>
>>  /* Raise a data fault alignment exception for the specified virtual address 
>> */
>>  void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr, int is_write,
>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>> index e42d287..ccc505d 100644
>> --- a/target-arm/op_helper.c
>> +++ b/target-arm/op_helper.c
>> @@ -133,7 +133,7 @@ void arm_cpu_do_unaligned_access(CPUState *cs, vaddr 
>> vaddr, int is_write,
>>  {
>>  ARMCPU *cpu = ARM_CPU(cs);
>>  CPUARMState *env = >env;
>> -int target_el;
>> +int target_el, mmu_idx;
>>  bool same_el;
>>
>>  if (retaddr) {
>> @@ -146,10 +146,14 @@ void arm_cpu_do_unaligned_access(CPUState *cs, vaddr 
>> vaddr, int is_write,
>>
>>  env->exception.vaddress = vaddr;
>>
>> +mmu_idx = cpu_mmu_index(env, false);
>> +if (mmu_idx == ARMMMUIdx_S12NSE0 || mmu_idx == ARMMMUIdx_S12NSE1) {
>> +mmu_idx += ARMMMUIdx_S1NSE0;
>> +}
>
> I would let the arm_s1_regime_using_lpae_format() function do this conversion
> from the S12 index to the S1 index.

OK, I will send the updated version right away.

>
> Otherwise this looks good to me.

Thank you,
alvise

>
>>  /* the DFSR for an alignment fault depends on whether we're using
>>   * the LPAE long descriptor format, or the short descriptor format
>>   */
>> -if (arm_regime_using_lpae_format(env, cpu_mmu_index(env, false))) {
>> +if (arm_s1_regime_using_lpae_format(env, mmu_idx)) {
>>  env->exception.fsr = 0x21;
>>  } else {
>>  env->exception.fsr = 0x1;
>> --
>> 2.7.0
>
> thanks
> -- PMM

Re: [Qemu-devel] Status of my hacks on the MTTCG WIP branch

2016-01-15 Thread alvise rigo

This problem could be related to a missing multi-threaded aware
translation of the atomic instructions.
I'm working on this missing piece, probably the next week I will
publish something.

Regards,
alvise

On Fri, Jan 15, 2016 at 3:24 PM, Pranith Kumar  wrote:
> Hi Alex,
>
> On Fri, Jan 15, 2016 at 8:53 AM, Alex Bennée  wrote:
>> Can you try this branch:
>>
>>
>> https://github.com/stsquad/qemu/tree/mttcg/multi_tcg_v8_wip_ajb_fix_locks-r1
>>
>> I think I've caught all the things likely to screw up addressing.
>>
>
> I tried this branch and the boot hangs like follows:
>
> [2.001083] random: systemd-udevd urandom read with 1 bits of entropy
> available
> main-loop: WARNING: I/O thread spun for 1000 iterations
> [   23.778970] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected
> by 0, t=2102 jiffies, g=-165, c=-166, q=83)
> [   23.780265] All QSes seen, last rcu_sched kthread activity 2101
> (4294939656-4294937555), jiffies_till_next_fqs=1, root ->qsmask 0x0
> [   23.781228] swapper/0   R  running task0 0  0
> 0x0080
> [   23.781977] Call trace:
> [   23.782375] [] dump_backtrace+0x0/0x170
> [   23.782852] [] show_stack+0x20/0x2c
> [   23.783279] [] sched_show_task+0x9c/0xf0
> [   23.783746] [] rcu_check_callbacks+0x7b8/0x828
> [   23.784230] [] update_process_times+0x40/0x74
> [   23.784723] [] tick_sched_handle.isra.15+0x38/0x7c
> [   23.785247] [] tick_sched_timer+0x48/0x84
> [   23.785705] [] __run_hrtimer+0x90/0x200
> [   23.786148] [] hrtimer_interrupt+0xec/0x268
> [   23.786612] [] arch_timer_handler_virt+0x38/0x48
> [   23.787120] [] handle_percpu_devid_irq+0x90/0x12c
> [   23.787621] [] generic_handle_irq+0x38/0x54
> [   23.788093] [] __handle_domain_irq+0x68/0xc4
> [   23.788578] [] gic_handle_irq+0x38/0x84
> [   23.789035] Exception stack(0xffc00073bde0 to 0xffc00073bf00)
> [   23.789650] bde0: 00738000 ffc0 0073e71c ffc0 0073bf20 ffc0
> 00086948 ffc0
> [   23.790356] be00: 000d848c ffc0   3ffcdb0c ffc0
>  0100
> [   23.791030] be20: 38b97100 ffc0 0073bea0 ffc0 67f6e000 0005
> 567f1c33 
> [   23.791744] be40: 00748cf0 ffc0 0073be70 ffc0 c1e2e4a0 ffbd
> 3a801148 ffc0
> [   23.792406] be60:  0040 0073e000 ffc0 3a801168 ffc0
> 97bbb588 007f
> [   23.793055] be80: 0021d7e8 ffc0 97b3d6ec 007f c37184d0 007f
> 00738000 ffc0
> [   23.793720] bea0: 0073e71c ffc0 006ff7e8 ffc0 007c8000 ffc0
> 0073e680 ffc0
> [   23.794373] bec0: 0072fac0 ffc0 0001  0073bf30 ffc0
> 0050e9e8 ffc0
> [   23.795025] bee0:   0073bf20 ffc0 00086944 ffc0
> 0073bf20 ffc0
> [   23.795721] [] el1_irq+0x64/0xc0
> [   23.796131] [] cpu_startup_entry+0x130/0x204
> [   23.796605] [] rest_init+0x78/0x84
> [   23.797028] [] start_kernel+0x3a0/0x3b8
> [   23.797528] rcu_sched kthread starved for 2101 jiffies!
>
>
> I will try to debug and see where it is hanging.
>
> Thanks!
> --
> Pranith

Re: [Qemu-devel] Status of my hacks on the MTTCG WIP branch

2016-01-15 Thread alvise rigo

On Fri, Jan 15, 2016 at 3:51 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> alvise rigo <a.r...@virtualopensystems.com> writes:
>
>> This problem could be related to a missing multi-threaded aware
>> translation of the atomic instructions.
>> I'm working on this missing piece, probably the next week I will
>> publish something.
>
> Maybe. We still have Fred's:
>
>   Use atomic cmpxchg to atomically check the exclusive value in a STREX
>
> Which I think papers over the cracks for both arm and aarch64 in MTTCG
> while not being as correct as your work.

Keep in mind that Linux on arm64 uses the LDXP/STXP instructions that
exist solely in aarch64.
These instructions are purely emulated now and can potentially write
128 bits of data in a non-atomic fashion.

Regards,
alvise

>
>>
>> Regards,
>> alvise
>>
>> On Fri, Jan 15, 2016 at 3:24 PM, Pranith Kumar <bobby.pr...@gmail.com> wrote:
>>> Hi Alex,
>>>
>>> On Fri, Jan 15, 2016 at 8:53 AM, Alex Bennée <alex.ben...@linaro.org> wrote:
>>>> Can you try this branch:
>>>>
>>>>
>>>> https://github.com/stsquad/qemu/tree/mttcg/multi_tcg_v8_wip_ajb_fix_locks-r1
>>>>
>>>> I think I've caught all the things likely to screw up addressing.
>>>>
>>>
>>> I tried this branch and the boot hangs like follows:
>>>
>>> [2.001083] random: systemd-udevd urandom read with 1 bits of entropy
>>> available
>>> main-loop: WARNING: I/O thread spun for 1000 iterations
>>> [   23.778970] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected
>>> by 0, t=2102 jiffies, g=-165, c=-166, q=83)
>
> This is just saying the kernel has been waiting for a while and nothing
> has happened.
>
>>> I will try to debug and see where it is hanging.
>
> If we knew what the kernel was waiting for that would be useful to know.
>
>>>
>>> Thanks!
>>> --
>>> Pranith
>
>
> --
> Alex Bennée

Re: [Qemu-devel] Status of my hacks on the MTTCG WIP branch

2016-01-15 Thread alvise rigo

On Fri, Jan 15, 2016 at 4:25 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> alvise rigo <a.r...@virtualopensystems.com> writes:
>
>> On Fri, Jan 15, 2016 at 3:51 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>>>
>>> alvise rigo <a.r...@virtualopensystems.com> writes:
>>>
>>>> This problem could be related to a missing multi-threaded aware
>>>> translation of the atomic instructions.
>>>> I'm working on this missing piece, probably the next week I will
>>>> publish something.
>>>
>>> Maybe. We still have Fred's:
>>>
>>>   Use atomic cmpxchg to atomically check the exclusive value in a STREX
>>>
>>> Which I think papers over the cracks for both arm and aarch64 in MTTCG
>>> while not being as correct as your work.
>>
>> Keep in mind that Linux on arm64 uses the LDXP/STXP instructions that
>> exist solely in aarch64.
>> These instructions are purely emulated now and can potentially write
>> 128 bits of data in a non-atomic fashion.
>
> Sure, but I doubt they are the reason for this hang as the kernel
> doesn't use them.

The kernel does use them for __cmpxchg_double in
arch/arm64/include/asm/atomic_ll_sc.h.
In any case, the normal exclusive instructions are also emulated in
target-arm/translate-a64.c.

alvise

>
>>
>> Regards,
>> alvise
>>
>>>
>>>>
>>>> Regards,
>>>> alvise
>>>>
>>>> On Fri, Jan 15, 2016 at 3:24 PM, Pranith Kumar <bobby.pr...@gmail.com> 
>>>> wrote:
>>>>> Hi Alex,
>>>>>
>>>>> On Fri, Jan 15, 2016 at 8:53 AM, Alex Bennée <alex.ben...@linaro.org> 
>>>>> wrote:
>>>>>> Can you try this branch:
>>>>>>
>>>>>>
>>>>>> https://github.com/stsquad/qemu/tree/mttcg/multi_tcg_v8_wip_ajb_fix_locks-r1
>>>>>>
>>>>>> I think I've caught all the things likely to screw up addressing.
>>>>>>
>>>>>
>>>>> I tried this branch and the boot hangs like follows:
>>>>>
>>>>> [2.001083] random: systemd-udevd urandom read with 1 bits of entropy
>>>>> available
>>>>> main-loop: WARNING: I/O thread spun for 1000 iterations
>>>>> [   23.778970] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected
>>>>> by 0, t=2102 jiffies, g=-165, c=-166, q=83)
>>>
>>> This is just saying the kernel has been waiting for a while and nothing
>>> has happened.
>>>
>>>>> I will try to debug and see where it is hanging.
>>>
>>> If we knew what the kernel was waiting for that would be useful to know.
>>>
>>>>>
>>>>> Thanks!
>>>>> --
>>>>> Pranith
>>>
>>>
>>> --
>>> Alex Bennée
>
>
> --
> Alex Bennée

[Qemu-devel] [PATCH] target-arm: fix MMU index in arm_cpu_do_unaligned_access

2016-01-15 Thread Alvise Rigo

arm_regime_using_lpae_format checks whether the LPAE extension is used
for stage 1 translation regimes. MMU indexes not exclusively of a stage 1
regime won't work with this method.

In case of ARMMMUIdx_S12NSE0 or ARMMMUIdx_S12NSE1, offset these values
by ARMMMUIdx_S1NSE0 to get the right index indicating a stage 1
translation regime.

Rename also the function to arm_s1_regime_using_lpae_format and update
the comments to reflect the change.

Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
---
 target-arm/helper.c| 8 
 target-arm/internals.h | 5 +++--
 target-arm/op_helper.c | 8 ++--
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/target-arm/helper.c b/target-arm/helper.c
index 59d5a41..8317ff5 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -5996,10 +5996,10 @@ static inline bool regime_using_lpae_format(CPUARMState 
*env,
 return false;
 }
 
-/* Returns true if the translation regime is using LPAE format page tables.
- * Used when raising alignment exceptions, whose FSR changes depending on
- * whether the long or short descriptor format is in use. */
-bool arm_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx)
+/* Returns true if the stage 1 translation regime is using LPAE format page
+ * tables. Used when raising alignment exceptions, whose FSR changes depending
+ * on whether the long or short descriptor format is in use. */
+bool arm_s1_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx)
 {
 return regime_using_lpae_format(env, mmu_idx);
 }
diff --git a/target-arm/internals.h b/target-arm/internals.h
index b925aaa..d226bbe 100644
--- a/target-arm/internals.h
+++ b/target-arm/internals.h
@@ -441,8 +441,9 @@ struct ARMMMUFaultInfo {
 bool arm_tlb_fill(CPUState *cpu, vaddr address, int rw, int mmu_idx,
   uint32_t *fsr, ARMMMUFaultInfo *fi);
 
-/* Return true if the translation regime is using LPAE format page tables */
-bool arm_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx);
+/* Return true if the stage 1 translation regime is using LPAE format page
+ * tables */
+bool arm_s1_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx);
 
 /* Raise a data fault alignment exception for the specified virtual address */
 void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr, int is_write,
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index e42d287..ccc505d 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -133,7 +133,7 @@ void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr, 
int is_write,
 {
 ARMCPU *cpu = ARM_CPU(cs);
 CPUARMState *env = >env;
-int target_el;
+int target_el, mmu_idx;
 bool same_el;
 
 if (retaddr) {
@@ -146,10 +146,14 @@ void arm_cpu_do_unaligned_access(CPUState *cs, vaddr 
vaddr, int is_write,
 
 env->exception.vaddress = vaddr;
 
+mmu_idx = cpu_mmu_index(env, false);
+if (mmu_idx == ARMMMUIdx_S12NSE0 || mmu_idx == ARMMMUIdx_S12NSE1) {
+mmu_idx += ARMMMUIdx_S1NSE0;
+}
 /* the DFSR for an alignment fault depends on whether we're using
  * the LPAE long descriptor format, or the short descriptor format
  */
-if (arm_regime_using_lpae_format(env, cpu_mmu_index(env, false))) {
+if (arm_s1_regime_using_lpae_format(env, mmu_idx)) {
 env->exception.fsr = 0x21;
 } else {
 env->exception.fsr = 0x1;
-- 
2.7.0

Re: [Qemu-devel] [PATCH v2] target-arm: raise exception on misaligned LDREX operands

2016-01-14 Thread alvise rigo

Forcing an unaligned LDREX access in aarch32, QEMU fails the following assert:
target-arm/helper.c:5921:regime_el: code should not be reached

Running this snippet both baremetal and on top of Linux will trigger
the problem:

static inline int cmpxchg(volatile void *ptr, unsigned int old,
  unsigned int new)
{
unsigned int oldval, res;

do {
asm volatile("@ __cmpxchg4\n"
"ldrex%1, [%2]\n"
"mov%0, #0\n"
"teq%1, %3\n"
"strexeq %0, %4, [%2]\n"
: "=" (res), "=" (oldval)
: "r" (ptr), "Ir" (old), "r" (new)
: "memory", "cc");
} while (res);

return oldval;
}

void main(void)
{
int arr[2] = {0, 0};
int *ptr = (int *)(((void *)) + 1);

cmpxchg(ptr, 0, 0xbeef);
}

On Thu, Dec 3, 2015 at 7:36 PM, Andrew Baumann
 wrote:
> Qemu does not generally perform alignment checks. However, the ARM ARM
> requires implementation of alignment exceptions for a number of cases
> including LDREX, and Windows-on-ARM relies on this.
>
> This change adds plumbing to enable alignment checks on loads using
> MO_ALIGN, a do_unaligned_access hook to raise the exception (data
> abort), and uses the new aligned loads in LDREX (for all but
> single-byte loads).
>
> Signed-off-by: Andrew Baumann 
> ---
> Thanks for the feedback on v1! I wish I had known about (or gone
> looking for) MO_ALIGN sooner...
>
> arm_regime_using_lpae_format() is a no-op wrapper I added to export
> regime_using_lpae_format (which is a static inline). Would it be
> preferable to simply export the existing function, and rename it? If
> so, is this still the correct name to use for the function?
>
>  target-arm/cpu.c   |  1 +
>  target-arm/helper.c|  8 
>  target-arm/internals.h |  7 +++
>  target-arm/op_helper.c | 35 ++-
>  target-arm/translate.c | 11 +++
>  5 files changed, 57 insertions(+), 5 deletions(-)
>
> diff --git a/target-arm/cpu.c b/target-arm/cpu.c
> index 30739fc..35a1f12 100644
> --- a/target-arm/cpu.c
> +++ b/target-arm/cpu.c
> @@ -1417,6 +1417,7 @@ static void arm_cpu_class_init(ObjectClass *oc, void 
> *data)
>  cc->handle_mmu_fault = arm_cpu_handle_mmu_fault;
>  #else
>  cc->do_interrupt = arm_cpu_do_interrupt;
> +cc->do_unaligned_access = arm_cpu_do_unaligned_access;
>  cc->get_phys_page_debug = arm_cpu_get_phys_page_debug;
>  cc->vmsd = _arm_cpu;
>  cc->virtio_is_big_endian = arm_cpu_is_big_endian;
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index afc4163..59d5a41 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -5996,6 +5996,14 @@ static inline bool 
> regime_using_lpae_format(CPUARMState *env,
>  return false;
>  }
>
> +/* Returns true if the translation regime is using LPAE format page tables.
> + * Used when raising alignment exceptions, whose FSR changes depending on
> + * whether the long or short descriptor format is in use. */
> +bool arm_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx)
> +{
> +return regime_using_lpae_format(env, mmu_idx);
> +}
> +
>  static inline bool regime_is_user(CPUARMState *env, ARMMMUIdx mmu_idx)
>  {
>  switch (mmu_idx) {
> diff --git a/target-arm/internals.h b/target-arm/internals.h
> index 347998c..b925aaa 100644
> --- a/target-arm/internals.h
> +++ b/target-arm/internals.h
> @@ -441,4 +441,11 @@ struct ARMMMUFaultInfo {
>  bool arm_tlb_fill(CPUState *cpu, vaddr address, int rw, int mmu_idx,
>uint32_t *fsr, ARMMMUFaultInfo *fi);
>
> +/* Return true if the translation regime is using LPAE format page tables */
> +bool arm_regime_using_lpae_format(CPUARMState *env, ARMMMUIdx mmu_idx);
> +
> +/* Raise a data fault alignment exception for the specified virtual address 
> */
> +void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr, int is_write,
> + int is_user, uintptr_t retaddr);
> +
>  #endif
> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
> index 6cd54c8..c6995ca 100644
> --- a/target-arm/op_helper.c
> +++ b/target-arm/op_helper.c
> @@ -126,7 +126,40 @@ void tlb_fill(CPUState *cs, target_ulong addr, int 
> is_write, int mmu_idx,
>  raise_exception(env, exc, syn, target_el);
>  }
>  }
> -#endif
> +
> +/* Raise a data fault alignment exception for the specified virtual address 
> */
> +void arm_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr, int is_write,
> + int is_user, uintptr_t retaddr)
> +{
> +ARMCPU *cpu = ARM_CPU(cs);
> +CPUARMState *env = >env;
> +int target_el;
> +bool same_el;
> +
> +if (retaddr) {
> +/* now we have a real cpu fault */
> +cpu_restore_state(cs, retaddr);
> +}
> +
> +target_el = exception_target_el(env);
> +same_el =

Re: [Qemu-devel] [RFC v6 11/14] softmmu: Simplify helper_*_st_name, wrap MMIO code

2016-01-11 Thread alvise rigo

On Mon, Jan 11, 2016 at 10:54 AM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Attempting to simplify the helper_*_st_name, wrap the MMIO code into an
>> inline function.
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  softmmu_template.h | 64 
>> +-
>>  1 file changed, 44 insertions(+), 20 deletions(-)
>>
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 92f92b1..2ebf527 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -396,6 +396,26 @@ static inline void glue(helper_le_st_name, 
>> _do_unl_access)(CPUArchState *env,
>>  }
>>  }
>>
>> +static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState 
>> *env,
>> +DATA_TYPE val,
>> +target_ulong 
>> addr,
>> +TCGMemOpIdx oi,
>> +unsigned 
>> mmu_idx,
>> +int index,
>> +uintptr_t 
>> retaddr)
>> +{
>> +CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
>> +
>> +if ((addr & (DATA_SIZE - 1)) != 0) {
>> +glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>> +oi, retaddr);
>> +}
>> +/* ??? Note that the io helpers always read data in the target
>> +   byte ordering.  We should push the LE/BE request down into io.  */
>> +val = TGT_LE(val);
>> +glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> +}
>> +
>
> Some comment as previous patches. I think we can have a single function
> that is shared between both helpers.

Of course. If the objdump you got from this version and the version
with single helper is basically the same, then there's no reason to
make two distinct variants.

Thank you,
alvise

>
>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>> TCGMemOpIdx oi, uintptr_t retaddr)
>>  {
>> @@ -458,16 +478,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong 
>> addr, DATA_TYPE val,
>>
>>  return;
>>  } else {
>> -if ((addr & (DATA_SIZE - 1)) != 0) {
>> -glue(helper_le_st_name, _do_unl_access)(env, val, addr, 
>> mmu_idx,
>> -oi, retaddr);
>> -}
>> -iotlbentry = >iotlb[mmu_idx][index];
>> -
>> -/* ??? Note that the io helpers always read data in the target
>> -   byte ordering.  We should push the LE/BE request down into 
>> io.  */
>> -val = TGT_LE(val);
>> -glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> +glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
>> + mmu_idx, index, 
>> retaddr);
>>  return;
>>  }
>>  }
>> @@ -523,6 +535,26 @@ static inline void glue(helper_be_st_name, 
>> _do_unl_access)(CPUArchState *env,
>>  }
>>  }
>>
>> +static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState 
>> *env,
>> +DATA_TYPE val,
>> +target_ulong 
>> addr,
>> +TCGMemOpIdx oi,
>> +unsigned 
>> mmu_idx,
>> +int index,
>> +uintptr_t 
>> retaddr)
>> +{
>> +CPUIOTLBEntry *iotlbentry = >iotlb[mmu_idx][index];
>> +
>> +if ((addr & (DATA_SIZE - 1)) != 0) {
>> +glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>> +oi, retaddr);
>> +}
>> +/* ??? Note that the io helpers always

Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2016-01-07 Thread alvise rigo

On Thu, Jan 7, 2016 at 11:22 AM, Peter Maydell <peter.mayd...@linaro.org> wrote:
> On 7 January 2016 at 10:21, alvise rigo <a.r...@virtualopensystems.com> wrote:
>> Hi,
>>
>> On Wed, Jan 6, 2016 at 7:00 PM, Andrew Baumann
>> <andrew.baum...@microsoft.com> wrote:
>>> As a heads up, we just added support for alignment checks in LDREX:
>>> https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96
>
>> It should be if we add an aligned variant for each of the exclusive helper.
>> BTW, why don't we make the check also for the STREX instruction?
>
> Andrew's patch only changed the bits Windows cares about, I think.
> We should indeed extend this to cover also STREX and the A64 instructions
> as well, I think.

The alignment check is easily doable in general. The only tricky part
I found is the A64's STXP instruction that requires quadword alignment
for the 64bit paired access.
In that case, the translation of the instruction will rely on a
aarch64-only helper. The alternative solution would be to extend
softmmu_template.h to generate 128bit accesses, but I don't believe
this is the right way to go.

Regards,
alvise

>
> thanks
> -- PMM

Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2016-01-07 Thread alvise rigo

Hi,

On Wed, Jan 6, 2016 at 7:00 PM, Andrew Baumann
<andrew.baum...@microsoft.com> wrote:
>
> Hi,
>
> > From: qemu-devel-bounces+andrew.baumann=microsoft@nongnu.org
> > [mailto:qemu-devel-
> > bounces+andrew.baumann=microsoft....@nongnu.org] On Behalf Of
> > Alvise Rigo
> > Sent: Monday, 14 December 2015 00:41
> >
> > This is the sixth iteration of the patch series which applies to the
> > upstream branch of QEMU (v2.5.0-rc3).
> >
> > Changes versus previous versions are at the bottom of this cover letter.
> >
> > The code is also available at following repository:
> > https://git.virtualopensystems.com/dev/qemu-mt.git
> > branch:
> > slowpath-for-atomic-v6-no-mttcg
> >
> > This patch series provides an infrastructure for atomic instruction
> > implementation in QEMU, thus offering a 'legacy' solution for
> > translating guest atomic instructions. Moreover, it can be considered as
> > a first step toward a multi-thread TCG.
> >
> > The underlying idea is to provide new TCG helpers (sort of softmmu
> > helpers) that guarantee atomicity to some memory accesses or in general
> > a way to define memory transactions.
> >
> > More specifically, the new softmmu helpers behave as LoadLink and
> > StoreConditional instructions, and are called from TCG code by means of
> > target specific helpers. This work includes the implementation for all
> > the ARM atomic instructions, see target-arm/op_helper.c.
>
> As a heads up, we just added support for alignment checks in LDREX:
> https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96

Thank you for the update.

>
> Hopefully it is an easy change to ensure that the same check happens for the 
> relevant loads when CONFIG_TCG_USE_LDST_EXCL is enabled?

It should be if we add an aligned variant for each of the exclusive helper.
BTW, why don't we make the check also for the STREX instruction?

Regards,
alvise

>
> Thanks,
> Andrew

Re: [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code

2016-01-07 Thread alvise rigo

On Thu, Jan 7, 2016 at 3:46 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Attempting to simplify the helper_*_st_name, wrap the
>> do_unaligned_access code into an inline function.
>> Remove also the goto statement.
>
> As I said in the other thread I think these sort of clean-ups can come
> before the ll/sc implementations and potentially get merged ahead of the
> rest of it.
>
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  softmmu_template.h | 96 
>> ++
>>  1 file changed, 60 insertions(+), 36 deletions(-)
>>
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index d3d5902..92f92b1 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState 
>> *env,
>>   iotlbentry->attrs);
>>  }
>>
>> +static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState 
>> *env,
>> +   DATA_TYPE val,
>> +   target_ulong 
>> addr,
>> +   TCGMemOpIdx oi,
>> +   unsigned mmu_idx,
>> +   uintptr_t 
>> retaddr)
>> +{
>> +int i;
>> +
>> +if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> +cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> + mmu_idx, retaddr);
>> +}
>> +/* XXX: not efficient, but simple */
>> +/* Note: relies on the fact that tlb_fill() does not remove the
>> + * previous page from the TLB cache.  */
>> +for (i = DATA_SIZE - 1; i >= 0; i--) {
>> +/* Little-endian extract.  */
>> +uint8_t val8 = val >> (i * 8);
>> +/* Note the adjustment at the beginning of the function.
>> +   Undo that for the recursion.  */
>> +glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>> +oi, retaddr + GETPC_ADJ);
>> +}
>> +}
>
> There is still duplication of 99% of the code here which is silly given

Then why should we keep this template-like design in the first place?
I tried to keep the code duplication for performance reasons
(otherwise how can we justify the two almost identical versions of the
helper?), while making the code more compact and readable.

> the compiler inlines the code anyway. If we gave the helper a more
> generic name and passed the endianess in via args I would hope the
> compiler did the sensible thing and constant fold the code. Something
> like:
>
> static inline void glue(helper_generic_st_name, _do_unl_access)
> (CPUArchState *env,
> bool little_endian,
> DATA_TYPE val,
> target_ulong addr,
> TCGMemOpIdx oi,
> unsigned mmu_idx,
> uintptr_t retaddr)
> {
> int i;
>
> if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>  mmu_idx, retaddr);
> }
> /* Note: relies on the fact that tlb_fill() does not remove the
>  * previous page from the TLB cache.  */
> for (i = DATA_SIZE - 1; i >= 0; i--) {
> if (little_endian) {

little_endian will have >99% of the time the same value, does it make
sense to have a branch here?

Thank you,
alvise

> /* Little-endian extract.  */
> uint8_t val8 = val >> (i * 8);
> } else {
> /* Big-endian extract.  */
> uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> }
> /* Note the adjustment at the beginning of the function.
>Undo that for the recursion.  */
> glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> oi, retaddr + GETPC_ADJ);
> }
> }
>
>
>> +
>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>> TCGMemOpIdx oi, uintptr_t retaddr)
>>  {

Re: [Qemu-devel] [RFC v6 08/14] target-arm: Add atomic_clear helper for CLREX insn

2016-01-06 Thread alvise rigo

On Wed, Jan 6, 2016 at 6:13 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Add a simple helper function to emulate the CLREX instruction.
>
> And now I see ;-)
>
> I suspect this should be merged with the other helpers as a generic helper.

Agreed.

>
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  target-arm/helper.h| 2 ++
>>  target-arm/op_helper.c | 6 ++
>>  target-arm/translate.c | 1 +
>>  3 files changed, 9 insertions(+)
>>
>> diff --git a/target-arm/helper.h b/target-arm/helper.h
>> index c2a85c7..37cec49 100644
>> --- a/target-arm/helper.h
>> +++ b/target-arm/helper.h
>> @@ -532,6 +532,8 @@ DEF_HELPER_2(dc_zva, void, env, i64)
>>  DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>>  DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>>
>> +DEF_HELPER_1(atomic_clear, void, env)
>> +
>>  #ifdef TARGET_AARCH64
>>  #include "helper-a64.h"
>>  #endif
>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>> index 6cd54c8..5a67557 100644
>> --- a/target-arm/op_helper.c
>> +++ b/target-arm/op_helper.c
>> @@ -50,6 +50,12 @@ static int exception_target_el(CPUARMState *env)
>>  return target_el;
>>  }
>>
>> +void HELPER(atomic_clear)(CPUARMState *env)
>> +{
>> +ENV_GET_CPU(env)->excl_protected_range.begin = -1;
>
> Is there not a defined reset value EXCLUSIVE_RESET_ADDR we should use here?

Yes, I will move the EXCLUSIVE_RESET_ADDR definition somewhere else in
order to include it in this file.

>
>> +ENV_GET_CPU(env)->ll_sc_context = false;
>> +}
>> +
>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>>uint32_t rn, uint32_t maxindex)
>>  {
>> diff --git a/target-arm/translate.c b/target-arm/translate.c
>> index e88d8a3..e0362e0 100644
>> --- a/target-arm/translate.c
>> +++ b/target-arm/translate.c
>> @@ -7514,6 +7514,7 @@ static void gen_load_exclusive(DisasContext *s, int 
>> rt, int rt2,
>>  static void gen_clrex(DisasContext *s)
>>  {
>>  #ifdef CONFIG_TCG_USE_LDST_EXCL
>> +gen_helper_atomic_clear(cpu_env);
>>  #else
>>  tcg_gen_movi_i64(cpu_exclusive_addr, -1);
>>  #endif
>
>
> --
> Alex Bennée

Re: [Qemu-devel] [RFC v6 02/14] softmmu: Add new TLB_EXCL flag

2016-01-05 Thread alvise rigo

On Tue, Jan 5, 2016 at 5:10 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Add a new TLB flag to force all the accesses made to a page to follow
>> the slow-path.
>>
>> In the case we remove a TLB entry marked as EXCL, we unset the
>> corresponding exclusive bit in the bitmap.
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  cputlb.c|  38 +++-
>>  include/exec/cpu-all.h  |   8 
>>  include/exec/cpu-defs.h |   1 +
>>  include/qom/cpu.h   |  14 ++
>>  softmmu_template.h  | 114 
>> ++--
>>  5 files changed, 152 insertions(+), 23 deletions(-)
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index bf1d50a..7ee0c89 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -394,6 +394,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, 
>> target_ulong vaddr,
>>  env->tlb_v_table[mmu_idx][vidx] = *te;
>>  env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
>>
>> +if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write &
>> TLB_EXCL))) {
>
> Why do we care about TLB_MMIO flags here? Does it actually happen? Would
> bad things happen if we enforced exclusivity for an MMIO write? Do the
> other flags matter?

In the previous version of the patch series it came out that the
accesses to MMIO regions have to be supported since, for instance, the
GDB stub relies on them.
The last two patches actually finalize the MMIO support.

>
> There should be a comment as to why MMIO is mentioned I think.

OK.

>
>> +/* We are removing an exclusive entry, set the page to dirty. This
>> + * is not be necessary if the vCPU has performed both SC and LL. */
>> +hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & 
>> TARGET_PAGE_MASK) +
>> +  (te->addr_write & 
>> TARGET_PAGE_MASK);
>> +if (!cpu->ll_sc_context) {
>> +cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
>> +}
>> +}
>> +
>>  /* refill the tlb */
>>  env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
>>  env->iotlb[mmu_idx][index].attrs = attrs;
>> @@ -419,7 +429,15 @@ void tlb_set_page_with_attrs(CPUState *cpu, 
>> target_ulong vaddr,
>> + xlat)) {
>>  te->addr_write = address | TLB_NOTDIRTY;
>>  } else {
>> -te->addr_write = address;
>> +if (!(address & TLB_MMIO) &&
>> +cpu_physical_memory_atleast_one_excl(section->mr->ram_addr
>> +   + xlat)) {
>> +/* There is at least one vCPU that has flagged the address 
>> as
>> + * exclusive. */
>> +te->addr_write = address | TLB_EXCL;
>> +} else {
>> +te->addr_write = address;
>> +}
>>  }
>>  } else {
>>  te->addr_write = -1;
>> @@ -471,6 +489,24 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, 
>> target_ulong addr)
>>  return qemu_ram_addr_from_host_nofail(p);
>>  }
>>
>> +/* For every vCPU compare the exclusive address and reset it in case of a
>> + * match. Since only one vCPU is running at once, no lock has to be held to
>> + * guard this operation. */
>> +static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>> +{
>> +CPUState *cpu;
>> +
>> +CPU_FOREACH(cpu) {
>> +if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
>> +ranges_overlap(cpu->excl_protected_range.begin,
>> +   cpu->excl_protected_range.end -
>> +   cpu->excl_protected_range.begin,
>> +   addr, size)) {
>> +cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>> +}
>> +}
>> +}
>> +
>>  #define MMUSUFFIX _mmu
>>
>>  #define SHIFT 0
>> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
>> index 83b1781..f8d8feb 100644
>> --- a/include/exec/cpu-all.h
>> +++ b/include/exec/cpu-all.h
>> @@ -277,6

Re: [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list

2015-12-18 Thread alvise rigo

On Fri, Dec 18, 2015 at 2:18 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> The purpose of this new bitmap is to flag the memory pages that are in
>> the middle of LL/SC operations (after a LL, before a SC) on a per-vCPU
>> basis.
>> For all these pages, the corresponding TLB entries will be generated
>> in such a way to force the slow-path if at least one vCPU has the bit
>> not set.
>> When the system starts, the whole memory is dirty (all the bitmap is
>> set). A page, after being marked as exclusively-clean, will be
>> restored as dirty after the SC.
>>
>> For each page we keep 8 bits to be shared among all the vCPUs available
>> in the system. In general, the to the vCPU n correspond the bit n % 8.
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  exec.c  |  8 --
>>  include/exec/memory.h   |  3 +-
>>  include/exec/ram_addr.h | 76 
>> +
>>  3 files changed, 84 insertions(+), 3 deletions(-)
>>
>> diff --git a/exec.c b/exec.c
>> index 0bf0a6e..e66d232 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -1548,11 +1548,15 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, 
>> Error **errp)
>>  int i;
>>
>>  /* ram_list.dirty_memory[] is protected by the iothread lock.  */
>> -for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
>> +for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
>>  ram_list.dirty_memory[i] =
>>  bitmap_zero_extend(ram_list.dirty_memory[i],
>> old_ram_size, new_ram_size);
>> -   }
>> +}
>> +ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] = bitmap_zero_extend(
>> +ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
>> +old_ram_size * EXCL_BITMAP_CELL_SZ,
>> +new_ram_size * EXCL_BITMAP_CELL_SZ);
>>  }
>
> I'm wondering is old/new_ram_size should be renamed to
> old/new_ram_pages?

Yes, I think it make more sense.

>
> So as I understand it we have created a bitmap which has
> EXCL_BITMAP_CELL_SZ bits per page.
>
>>  cpu_physical_memory_set_dirty_range(new_block->offset,
>>  new_block->used_length,
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 0f07159..2782c77 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -19,7 +19,8 @@
>>  #define DIRTY_MEMORY_VGA   0
>>  #define DIRTY_MEMORY_CODE  1
>>  #define DIRTY_MEMORY_MIGRATION 2
>> -#define DIRTY_MEMORY_NUM   3/* num of dirty bits */
>> +#define DIRTY_MEMORY_EXCLUSIVE 3
>> +#define DIRTY_MEMORY_NUM   4/* num of dirty bits */
>>
>>  #include 
>>  #include 
>> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
>> index 7115154..b48af27 100644
>> --- a/include/exec/ram_addr.h
>> +++ b/include/exec/ram_addr.h
>> @@ -21,6 +21,7 @@
>>
>>  #ifndef CONFIG_USER_ONLY
>>  #include "hw/xen/xen.h"
>> +#include "sysemu/sysemu.h"
>>
>>  struct RAMBlock {
>>  struct rcu_head rcu;
>> @@ -82,6 +83,13 @@ int qemu_ram_resize(ram_addr_t base, ram_addr_t newsize, 
>> Error **errp);
>>  #define DIRTY_CLIENTS_ALL ((1 << DIRTY_MEMORY_NUM) - 1)
>>  #define DIRTY_CLIENTS_NOCODE  (DIRTY_CLIENTS_ALL & ~(1 << 
>> DIRTY_MEMORY_CODE))
>>
>> +/* Exclusive bitmap support. */
>> +#define EXCL_BITMAP_CELL_SZ 8
>> +#define EXCL_BITMAP_GET_BIT_OFFSET(addr) \
>> +(EXCL_BITMAP_CELL_SZ * (addr >> TARGET_PAGE_BITS))
>> +#define EXCL_BITMAP_GET_BYTE_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
>> +#define EXCL_IDX(cpu) (cpu % EXCL_BITMAP_CELL_SZ)
>> +
>
> I think some of the explanation of what CELL_SZ means from your commit
> message needs to go here.

OK.

>
>>  static inline bool cpu_physical_memory_get_dirty(ram_addr_t start,
>>   ram_addr_t length,
>>   unsigned client)
>> @@ -173,6 +181,11 @@ static inline void 
>> cpu_physical_memory_set_dirty_range(ram_addr_t start,
>>  if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
>>  bitmap

Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-17 Thread alvise rigo

Hi Alex,

On Thu, Dec 17, 2015 at 5:06 PM, Alex Bennée <alex.ben...@linaro.org> wrote:

>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
> > This is the sixth iteration of the patch series which applies to the
> > upstream branch of QEMU (v2.5.0-rc3).
> >
> > Changes versus previous versions are at the bottom of this cover letter.
> >
> > The code is also available at following repository:
> > https://git.virtualopensystems.com/dev/qemu-mt.git
> > branch:
> > slowpath-for-atomic-v6-no-mttcg
>
> I'm starting to look through this now. However one problem that
>

Thank you for this.


> immediately comes up is the aarch64 breakage. Because there is an
> intrinsic link between a lot of the arm and aarch64 code it breaks the
> other targets.
>
> You could fix this by ensuring that CONFIG_TCG_USE_LDST_EXCL doesn't get
> passed to the aarch64 build (tricky as aarch64-softmmu.mak includes
> arm-softmmu.mak) or bite the bullet now and add the 64 bit helpers that
> will be needed to convert the aarch64 exclusive equivalents.
>

This is what I'm doing right now :)

Best regards,
alvise


>
> >
> > This patch series provides an infrastructure for atomic instruction
> > implementation in QEMU, thus offering a 'legacy' solution for
> > translating guest atomic instructions. Moreover, it can be considered as
> > a first step toward a multi-thread TCG.
> >
> > The underlying idea is to provide new TCG helpers (sort of softmmu
> > helpers) that guarantee atomicity to some memory accesses or in general
> > a way to define memory transactions.
> >
> > More specifically, the new softmmu helpers behave as LoadLink and
> > StoreConditional instructions, and are called from TCG code by means of
> > target specific helpers. This work includes the implementation for all
> > the ARM atomic instructions, see target-arm/op_helper.c.
> >
> > The implementation heavily uses the software TLB together with a new
> > bitmap that has been added to the ram_list structure which flags, on a
> > per-CPU basis, all the memory pages that are in the middle of a LoadLink
> > (LL), StoreConditional (SC) operation.  Since all these pages can be
> > accessed directly through the fast-path and alter a vCPU's linked value,
> > the new bitmap has been coupled with a new TLB flag for the TLB virtual
> > address which forces the slow-path execution for all the accesses to a
> > page containing a linked address.
> >
> > The new slow-path is implemented such that:
> > - the LL behaves as a normal load slow-path, except for clearing the
> >   dirty flag in the bitmap.  The cputlb.c code while generating a TLB
> >   entry, checks if there is at least one vCPU that has the bit cleared
> >   in the exclusive bitmap, it that case the TLB entry will have the EXCL
> >   flag set, thus forcing the slow-path.  In order to ensure that all the
> >   vCPUs will follow the slow-path for that page, we flush the TLB cache
> >   of all the other vCPUs.
> >
> >   The LL will also set the linked address and size of the access in a
> >   vCPU's private variable. After the corresponding SC, this address will
> >   be set to a reset value.
> >
> > - the SC can fail returning 1, or succeed, returning 0.  It has to come
> >   always after a LL and has to access the same address 'linked' by the
> >   previous LL, otherwise it will fail. If in the time window delimited
> >   by a legit pair of LL/SC operations another write access happens to
> >   the linked address, the SC will fail.
> >
> > In theory, the provided implementation of TCG LoadLink/StoreConditional
> > can be used to properly handle atomic instructions on any architecture.
> >
> > The code has been tested with bare-metal test cases and by booting Linux.
> >
> > * Performance considerations
> > The new slow-path adds some overhead to the translation of the ARM
> > atomic instructions, since their emulation doesn't happen anymore only
> > in the guest (by mean of pure TCG generated code), but requires the
> > execution of two helpers functions. Despite this, the additional time
> > required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
> > negligible.
> > Instead, on a LL/SC bound test scenario - like:
> > https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
> > solution requires 30% (1 million iterations) and 70% (10 millions
> > iterations) of additional time for the test to complete.
> >
> > Changes from v5:
> > - The exclusive memory region is now set through a CPUClass hook,
> >   allowi

Re: [Qemu-devel] [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code

2015-12-17 Thread alvise rigo

On Thu, Dec 17, 2015 at 5:52 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Alvise Rigo <a.r...@virtualopensystems.com> writes:
>
>> Attempting to simplify the helper_*_st_name, wrap the code relative to a
>> RAM access into an inline function.
>
> This commit breaks a default x86_64-softmmu build:

I see. Would these three commits make more sense if squashed together?
Or better to leave them distinct and fix the compilation issue?

BTW, I will now start using that script.

Thank you,
alvise

>
>   CCx86_64-softmmu/../hw/audio/pcspk.o
> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:527:0:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function 
> ‘helper_ret_stb_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ 
> undeclared (first use in this function)
>  haddr = addr + env->tlb_table[mmu_idx][index].addend;
>  ^
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: note: each 
> undeclared identifier is reported only once for each function it appears in
> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:530:0:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function 
> ‘helper_le_stw_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ 
> undeclared (first use in this function)
>  haddr = addr + env->tlb_table[mmu_idx][index].addend;
>  ^
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function 
> ‘helper_be_stw_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ 
> undeclared (first use in this function)
>  haddr = addr + env->tlb_table[mmu_idx][index].addend;
>  ^
> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:533:0:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function 
> ‘helper_le_stl_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ 
> undeclared (first use in this function)
>  haddr = addr + env->tlb_table[mmu_idx][index].addend;
>  ^
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function 
> ‘helper_be_stl_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ 
> undeclared (first use in this function)
>  haddr = addr + env->tlb_table[mmu_idx][index].addend;
>  ^
> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:536:0:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function 
> ‘helper_le_stq_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ 
> undeclared (first use in this function)
>  haddr = addr + env->tlb_table[mmu_idx][index].addend;
>  ^
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function 
> ‘helper_be_stq_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ 
> undeclared (first use in this function)
>  haddr = addr + env->tlb_table[mmu_idx][index].addend;
>  ^
> make[1]: *** [cputlb.o] Error 1
> make[1]: *** Waiting for unfinished jobs
>   CCx86_64-softmmu/../hw/block/fdc.o
> make: *** [subdir-x86_64-softmmu] Error 2
>
> ERROR: commit 3a371deaf11ce944127a00eadbc7e811b6798de1 failed to build!
> commit 3a371deaf11ce944127a00eadbc7e811b6798de1
> Author: Alvise Rigo <a.r...@virtualopensystems.com>
> Date:   Thu Dec 10 17:26:54 2015 +0100
>
> softmmu: Simplify helper_*_st_name, wrap RAM code
>
> Found while checking with Jeff's compile-check script:
>
> https://github.com/codyprime/git-scripts
>
> git compile-check -r c3626ca7df027dabf0568284360a23faf18f0884..HEAD
>
>>
>> Suggested-by: Jani Kokkonen <jani.kokko...@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.font...@huawei.com>
>> Signed-off-by: Alvise Rigo <a.r...@virtualopensystems.com>
>> ---
>>  softmmu_template.h | 110 
>> +
>>  1 file changed, 68 insertions(+), 42 deletions(-)
>>
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 2ebf527..262c95f 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -416,13 +416,46 @@ static inline void glue(helper_le_st_name, 
>> _do_mmio_access)(CPUArchState *env,
>>  glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>>  }
>>
>> +static inline void glue(helper_le_st_name, _do_ram_access)(CPUArchState 
>> *env,
>> +   DATA_TYPE val,
>> +   target_ulong 
>> a

Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-15 Thread alvise rigo

Hi Andreas,

On Mon, Dec 14, 2015 at 11:09 PM, Andreas Tobler <andre...@fgznet.ch> wrote:
> Alvise,
>
> On 14.12.15 09:41, Alvise Rigo wrote:
>>
>> This is the sixth iteration of the patch series which applies to the
>> upstream branch of QEMU (v2.5.0-rc3).
>>
>> Changes versus previous versions are at the bottom of this cover letter.
>>
>> The code is also available at following repository:
>> https://git.virtualopensystems.com/dev/qemu-mt.git
>> branch:
>> slowpath-for-atomic-v6-no-mttcg
>
>
> Thank you very much for this work. I tried to rebase myself, but it was over
> my head.
>
> I'm looking for a qemu solution where I can use my cores.
>
> My use case is doing gcc porting for aarch64-*-freebsd*. I think it doesn't
> matter which OS. This arch has not enough real affordable HW solutions on
> the market yet. So I was looking for your solution. Claudio gave me a hint
> about it.
>
> Your recent merge/rebase only covers arm itself, not aarch64, right?

Indeed, only arm. Keep in mind that this patch series applies to the
upstream version of QEMU, not to the mttcg branch.
In other words, the repo includes a version of QEMU which is
single-threaded with some changes for the atomic instructions handling
in sight of a multi-threaded emulation.

>
> Linking fails with unreferenced cpu_exclusive_addr stuff in
> target-arm/translate-a64.c

Even if aarch64 is not supported, this error should not happen. My
fault, I will fix it in the coming version.

>
> Are you working on this already? Or Claudio?

As soon as the mttcg branch will be updated, I will rebase this patch
series on top of the new branch, and possibly I will also cover the
aarch64 architecture.

Thank you,
alvise

>
>> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
>
>
> ...
>
> Thank you!
> Andreas
>

Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

2015-12-15 Thread alvise rigo

Hi Paolo,

On Mon, Dec 14, 2015 at 11:17 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 14/12/2015 11:04, alvise rigo wrote:
>> In any case, what I proposed in the mttcg based v5 was:
>> - A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB.
>> This is done by querying a TLB flush to all (not exactly all...) the
>> CPUs. To be 100% safe, probably we should also wait that the flush is
>> actually performed
>> - A TLB_EXCL flag set always forces the slow-path, allowing the CPUs
>> to check for possible collision with a "exclusive memory region"
>>
>> Now, why the fact of querying the flush (and possibly ensuring that
>> the flush has been actually done) should not be enough?
>
> There will always be a race where the normal store fails.  While I
> haven't studied your code enough to do a constructive proof, it's enough
> to prove the impossibility of what you're trying to do.  Mind, I also
> believed for a long time that it was possible to do it!
>
> If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
> store, you can model this as a consensus problem.  For example, CPU 0
> could propose that the subsequent SC succeeds, while CPU 1 proposes that
> it fails.  The outcome of the SC instruction depends on who wins.

I see your point. This, as you wrote, holds only when we attempt to
make the fast path wait-free.
However, the implementation I proposed is not wait-free and somehow
serializes the accesses made to the shared resources (that will
determine if the access was successful or not) by means of a mutex.
The assumption I made - and somehow verified - is that the "colliding
fast accesses" are rare. I guess you also agree on this, otherwise how
could a wait-free implementation possibly work without being coupled
with primitives with appropriate consensus number?

Thank you,
alvise

>
> Therefore, implementing LL/SC problem requires---on both CPU 0 and CPU
> 1, and hence for both LL/SC and normal store---an atomic primitive with
> consensus number >= 2.  Other than LL/SC itself, the commonly-available
> operations satisfying this requirement are test-and-set (consensus
> number 2) and compare-and-swap (infinite consensus number).  Normal
> memory reads and writes (called "atomic registers" in multi-processing
> research lingo) have consensus number 1; it's not enough.
>
> If the host had LL/SC, CPU 1 could in principle delegate its side of the
> consensus problem to the processor; but even that is not a solution
> because processors constrain the instructions that can appear between
> the load and the store, and this could cause an infinite sequence of
> spurious failed SCs.  Another option is transactional memory, but it's
> also too slow for normal stores.
>
> The simplest solution is not to implement full LL/SC semantics; instead,
> similar to linux-user, a SC operation can perform a cmpxchg from the
> value fetched by LL to the argument of SC.  This bypasses the issue
> because stores do not have to be instrumented at all, but it does mean
> that the emulation suffers from the ABA problem.
>
> TLB_EXCL is also a middle-ground, a little bit stronger than cmpxchg.
> It's more complex and more accurate, but also not perfect.  Which is
> okay, but has to be documented.
>
> Paolo

Re: [Qemu-devel] [RFC v6 09/14] softmmu: Add history of excl accesses

2015-12-15 Thread alvise rigo

On Mon, Dec 14, 2015 at 10:35 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 14/12/2015 09:41, Alvise Rigo wrote:
>> +static inline void excl_history_put_addr(CPUState *cpu, hwaddr addr)
>> +{
>> +/* Avoid some overhead if the address we are about to put is equal to
>> + * the last one */
>> +if (cpu->excl_protected_addr[cpu->excl_protected_last] !=
>> +(addr & TARGET_PAGE_MASK)) {
>> +cpu->excl_protected_last = (cpu->excl_protected_last + 1) %
>> +EXCLUSIVE_HISTORY_LEN;
>
> Either use "&" here...
>
>> +/* Unset EXCL bit of the oldest entry */
>> +if (cpu->excl_protected_addr[cpu->excl_protected_last] !=
>> +EXCLUSIVE_RESET_ADDR) {
>> +cpu_physical_memory_unset_excl(
>> +cpu->excl_protected_addr[cpu->excl_protected_last],
>> +cpu->cpu_index);
>> +}
>> +
>> +/* Add a new address, overwriting the oldest one */
>> +cpu->excl_protected_addr[cpu->excl_protected_last] =
>> +addr & TARGET_PAGE_MASK;
>> +}
>> +}
>> +
>>  #define MMUSUFFIX _mmu
>>
>>  /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 9e409ce..5f65ebf 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -217,6 +217,7 @@ struct kvm_run;
>>
>>  /* Atomic insn translation TLB support. */
>>  #define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>> +#define EXCLUSIVE_HISTORY_LEN 8
>>
>>  /**
>>   * CPUState:
>> @@ -343,6 +344,8 @@ struct CPUState {
>>   * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
>>   * in the middle of a LL/SC. */
>>  struct Range excl_protected_range;
>> +hwaddr excl_protected_addr[EXCLUSIVE_HISTORY_LEN];
>> +int excl_protected_last;
>
> ... or make this an "unsigned int".  Otherwise the code will contain an
> actual (and slow) modulo operation.

Absolutely true.

Thank you,
alvise

>
> Paolo
>
>>  /* Used to carry the SC result but also to flag a normal (legacy)
>>   * store access made by a stcond (see softmmu_template.h). */
>>  int excl_succeeded;

1 2 3 >

1 - 100 of 296 matches

Mail list logo