from:"Pranith Kumar"

Warnings with GCC 9.3

2020-04-23 Thread Pranith Kumar

Hello,

I keep seeing these warnings on the latest master with GCC 9.3:

/home/pranith/qemu/hw/block/pflash_cfi01.c: In function
‘pflash_mem_read_with_attrs’:

/home/pranith/qemu/hw/block/pflash_cfi01.c:667:20: note: parameter passing
for argument of type ‘MemTxAttrs’ {aka ‘struct MemTxAttrs’} changed in GCC
9.1
  667 | static MemTxResult pflash_mem_read_with_attrs(void *opaque, hwaddr
addr, uint64_t *value,

  |^~


Are there any patches in queue to fix this before the release?

Thanks,
-- 
Pranith

Re: [Qemu-devel] [PATCH v3 00/50] tcg plugin support

2019-06-21 Thread Pranith Kumar

On Fri, Jun 21, 2019 at 1:21 AM Alex Bennée  wrote:

> > * Register and memory read/write API
> >
> >   It would be great to have register and memory read/write API i.e., ability
> >   to read/write to registers/memory from within the callback. This gives the
> >   plugin ability to do system introspection. (Not sure if the current 
> > patchset
> >   implements this already).
>
> Not currently. The trick is to have something flexible enough without
> exposing internals. I guess we could consider the gdb register
> enumeration or maybe hook into stuff declared as
> tcg_global_mem_new_i[64|32]. That won't get every "register" but it
> should certainly cover the main general purpose ones. Things like SVE
> and AdvSIMD vector registers wouldn't be seen though.

I guess general registers could be a good starting point. We can then
implement arch specific register access APIs.

>
> > * Register callbacks
> >
> >   A callback needs to be invoked whenever a specified registers is read or
> >   written to.
>
> Again tricky as not every register read/write is obvious from TCG -
> vector registers tweaked from helpers would be a good example.
>
> >
> > * Where do new plugins live in the tree?
> >
> >   The current source files in plugins (api, core etc.,) I think are better 
> > if
> >   moved to tcg/plugins/.  The various plugins we write would then live in 
> > the
> >   plugins/ folder instead of the current tests/plugin/ folder.
>
> The example plugins are really just toys for experimenting with the API
> - I don't see too much problem with them being in tests. However the
> howvec plugin is very guest architecture specific so we could consider a
> bit more of a hierarchy. Maybe these should all live in tests/tcg?
>

So where do you want 'real' plugins to live in the tree? It would be
good to think about the structure for those.

> >
> > * Timer interrupts
> >
> >   What I observed is that the system execution is affected by what you do in
> >   the callbacks because of timer interrupts. For example, if you spend time 
> > in
> >   the memory callback doing a bit of processing and writing to a file, you
> >   will see more timer interrupt instructions. One solution to this would be 
> > to
> >   use 'icount', but it does not work that well. I think we need to do
> >   something similar to what gdb does in debug mode. How would you handle 
> > MTTCG
> >   guests in that case?
>
> icount is going to be the best you can get for deterministic time -
> other efforts to pause/restart virtual time going in and out of plugins
> are just going to add a lot of overhead.

I wonder why using icount is not working in this case. Are there any
timers that fire non-deterministically even when icount is used?

>
> Remember QEMU doesn't even try to be a cycle accurate emulation so
> expecting to get reasonable timing information out of these plugins is a
> stretch. Maybe I should make that clearer in the design document?

It is less about being cycle accurate and more about being
deterministic. For example, when tracing using plugins+callbacks, you
will see a lot more interrupt code in the trace than when if you
execute without tracing. How do we get them to be more similar?

Another idea would be to provide an API for the plugin to generate the
timer interrupt. This allows the plugin to generate regular interrupts
irrespective of what is being done in the callbacks.

>
> The gdb behaviour is just a massive hack. When single-stepping in GDB we
> prevent timer IRQs from being delivered - they have still fired and are
> pending and will execute as soon as you hit continue.
>
> >   Another approach would be to offload callback generation to a separate
> >   plugin thread. The main thread will copy data required by a callback and
> >   invoke the callback asynchronously (std::async in C++ if you are
> >   familiar).
>
> This would complicate things - the current iteration I'm working on
> drops the haddr cb in favour of dynamically resolving the vaddr in the
> callback. But that approach is only valid during the callback before
> something else potentially pollutes the TLB.
>

> >
> > * Starting and stopping callback generation
> >
> >   It would be great if we have a mechanism to dynamically start/stop 
> > callbacks
> >   when a sequence of code (magic instruction) is executed. This would be
> >   useful to annotate region-of-interest (ROI) in benchmarks to
> > generate callbacks.
>
> Well we have that now. At each TB generation event the callback is free to 
> register
> as many or few callbacks as it likes dynamically.

But how does the plugin know that the TB being generated is the first
TB in the ROI?
Similarly the plugin needs to know the then end of ROI has been reached.

Also, please note that there can be multiple ROIs. It would be good to
know if we can assign ids to each ROI for the plugin.

Thanks,
-- 
Pranith

Re: [Qemu-devel] [PATCH v3 00/50] tcg plugin support

2019-06-20 Thread Pranith Kumar

Hi Alex/Emilio,

I am really happy to see the progress you made on the plugin feature. Looking
forward to seeing it merged soon! Please CC me on future versions of the
patchset. I am happy to help review and contribute to this effort.

I have a few general comments from experience writing a very similar system
(qsim) below.

On Fri, Jun 14, 2019 at 10:23 AM Alex Bennée  wrote:
>
> Hi,
>
> This is v3 of the proposed plugins API for QEMU. As Emilio is busy
> having finished his time at Columbia I have volunteered to take the
> patch series forward. Emilio's RFC v2 was posted last year:
>
>   Subject: [RFC v2 00/38] Plugin support
>   Date: Sun,  9 Dec 2018 14:37:11 -0500
>   Message-Id: <20181209193749.12277-1-c...@braap.org>
>

* Register and memory read/write API

  It would be great to have register and memory read/write API i.e., ability
  to read/write to registers/memory from within the callback. This gives the
  plugin ability to do system introspection. (Not sure if the current patchset
  implements this already).

* Register callbacks

  A callback needs to be invoked whenever a specified registers is read or
  written to.

* Where do new plugins live in the tree?

  The current source files in plugins (api, core etc.,) I think are better if
  moved to tcg/plugins/.  The various plugins we write would then live in the
  plugins/ folder instead of the current tests/plugin/ folder.

* Timer interrupts

  What I observed is that the system execution is affected by what you do in
  the callbacks because of timer interrupts. For example, if you spend time in
  the memory callback doing a bit of processing and writing to a file, you
  will see more timer interrupt instructions. One solution to this would be to
  use 'icount', but it does not work that well. I think we need to do
  something similar to what gdb does in debug mode. How would you handle MTTCG
  guests in that case?

  Another approach would be to offload callback generation to a separate
  plugin thread. The main thread will copy data required by a callback and
  invoke the callback asynchronously (std::async in C++ if you are familiar).

* Starting and stopping callback generation

  It would be great if we have a mechanism to dynamically start/stop callbacks
  when a sequence of code (magic instruction) is executed. This would be
  useful to annotate region-of-interest (ROI) in benchmarks to
generate callbacks.

  Also, the return value from a callback can be used to decide further course
  of action. For example, if our plugin needs 1 callbacks, it can indicate
  to stop generating further callbacks in the return value of the callback
  once it got the necessary callbacks.

* State saving API

  An API to save the state of the VM from the plugin code.

Let me know your thoughts and any other ideas you might have.

Thanks,
--
Pranith

Re: [Qemu-devel] [PATCH v3 07/50] plugin: add user-facing API

2019-06-18 Thread Pranith Kumar

On Fri, Jun 14, 2019 at 10:24 AM Alex Bennée  wrote:
>
> From: "Emilio G. Cota" 
>
> Add the API first to ease review.
>
> Signed-off-by: Emilio G. Cota 
> Signed-off-by: Alex Bennée 
>
> ---
> v3
>   - merge in changes to plugin install/reset/uninstall
>   - split api file
> ---
>  include/qemu/qemu-plugin.h | 339 +
>  1 file changed, 339 insertions(+)
>  create mode 100644 include/qemu/qemu-plugin.h
>
> diff --git a/include/qemu/qemu-plugin.h b/include/qemu/qemu-plugin.h
> new file mode 100644
> index 00..0db1ef9714
> --- /dev/null
> +++ b/include/qemu/qemu-plugin.h
> @@ -0,0 +1,339 @@
> +/*
> + * Copyright (C) 2017, Emilio G. Cota 
> + * Copyright (C) 2019, Linaro
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +#ifndef QEMU_PLUGIN_API_H
> +#define QEMU_PLUGIN_API_H
> +
> +#include 
> +#include 
> +
> +/*
> + * For best performance, build the plugin with -fvisibility=hidden so that
> + * QEMU_PLUGIN_LOCAL is implicit. Then, just mark qemu_plugin_install with
> + * QEMU_PLUGIN_EXPORT. For more info, see
> + *   https://gcc.gnu.org/wiki/Visibility
> + */
> +#if defined _WIN32 || defined __CYGWIN__
> +  #ifdef BUILDING_DLL
> +#define QEMU_PLUGIN_EXPORT __declspec(dllexport)
> +  #else
> +#define QEMU_PLUGIN_EXPORT __declspec(dllimport)
> +  #endif
> +  #define QEMU_PLUGIN_LOCAL
> +#else
> +  #if __GNUC__ >= 4
> +#define QEMU_PLUGIN_EXPORT __attribute__((visibility("default")))
> +#define QEMU_PLUGIN_LOCAL  __attribute__((visibility("hidden")))
> +  #else
> +#define QEMU_PLUGIN_EXPORT
> +#define QEMU_PLUGIN_LOCAL
> +  #endif
> +#endif
> +
> +typedef uint64_t qemu_plugin_id_t;
> +
> +/**
> + * qemu_plugin_install() - Install a plugin
> + * @id: this plugin's opaque ID
> + * @argc: number of arguments
> + * @argv: array of arguments (@argc elements)
> + *
> + * All plugins must export this symbol.
> + *
> + * Note: Calling qemu_plugin_uninstall() from this function is a bug. To 
> raise
> + * an error during install, return !0.
> + *
> + * Note: @argv remains valid throughout the lifetime of the loaded plugin.
> + */
> +QEMU_PLUGIN_EXPORT int qemu_plugin_install(qemu_plugin_id_t id, int argc,
> +   char **argv);
> +
> +/*
> + * Prototypes for the various callback styles we will be registering
> + * in the following functions.
> + */
> +typedef void (*qemu_plugin_simple_cb_t)(qemu_plugin_id_t id);
> +
> +typedef void (*qemu_plugin_udata_cb_t)(qemu_plugin_id_t id, void *userdata);
> +
> +typedef void (*qemu_plugin_vcpu_simple_cb_t)(qemu_plugin_id_t id,
> + unsigned int vcpu_index);
> +
> +typedef void (*qemu_plugin_vcpu_udata_cb_t)(unsigned int vcpu_index,
> +void *userdata);
> +
> +/**
> + * qemu_plugin_uninstall() - Uninstall a plugin
> + * @id: this plugin's opaque ID
> + * @cb: callback to be called once the plugin has been removed
> + *
> + * Do NOT assume that the plugin has been uninstalled once this function
> + * returns. Plugins are uninstalled asynchronously, and therefore the given
> + * plugin receives callbacks until @cb is called.
> + *
> + * Note: Calling this function from qemu_plugin_install() is a bug.
> + */
> +void qemu_plugin_uninstall(qemu_plugin_id_t id, qemu_plugin_simple_cb_t cb);
> +
> +/**
> + * qemu_plugin_reset() - Reset a plugin
> + * @id: this plugin's opaque ID
> + * @cb: callback to be called once the plugin has been reset
> + *
> + * Unregisters all callbacks for the plugin given by @id.
> + *
> + * Do NOT assume that the plugin has been reset once this function returns.
> + * Plugins are reset asynchronously, and therefore the given plugin receives
> + * callbacks until @cb is called.
> + */
> +void qemu_plugin_reset(qemu_plugin_id_t id, qemu_plugin_simple_cb_t cb);
> +
> +/**
> + * qemu_plugin_register_vcpu_init_cb() - register a vCPU initialization 
> callback
> + * @id: plugin ID
> + * @cb: callback function
> + *
> + * The @cb function is called every time a vCPU is initialized.
> + *
> + * See also: qemu_plugin_register_vcpu_exit_cb()
> + */
> +void qemu_plugin_register_vcpu_init_cb(qemu_plugin_id_t id,
> +   qemu_plugin_vcpu_simple_cb_t cb);
> +
> +/**
> + * qemu_plugin_register_vcpu_exit_cb() - register a vCPU exit callback
> + * @id: plugin ID
> + * @cb: callback function
> + *
> + * The @cb function is called every time a vCPU exits.
> + *
> + * See also: qemu_plugin_register_vcpu_init_cb()
> + */
> +void qemu_plugin_register_vcpu_exit_cb(qemu_plugin_id_t id,
> +   qemu_plugin_vcpu_simple_cb_t cb);
> +
> +/**
> + * qemu_plugin_register_vcpu_idle_cb() - register a vCPU idle callback
> + * @id: plugin ID
> + * @cb: callback function
> + *
> + * The @cb function is called every time a

Re: [Qemu-devel] [PATCH v3 22/50] *-user: plugin syscalls

2019-06-18 Thread Pranith Kumar

Minor nits.

On Fri, Jun 14, 2019 at 11:41 AM Alex Bennée  wrote:
>
> From: "Emilio G. Cota" 
>
> Signed-off-by: Emilio G. Cota 
> ---
>  bsd-user/syscall.c   | 9 +
>  linux-user/syscall.c | 3 +++
>  2 files changed, 12 insertions(+)
>
> diff --git a/bsd-user/syscall.c b/bsd-user/syscall.c
> index 84a983a9a1..50e47d217c 100644
> --- a/bsd-user/syscall.c
> +++ b/bsd-user/syscall.c
> @@ -323,6 +323,8 @@ abi_long do_freebsd_syscall(void *cpu_env, int num, 
> abi_long arg1,
>  gemu_log("freebsd syscall %d\n", num);
>  #endif
>  trace_guest_user_syscall(cpu, num, arg1, arg2, arg3, arg4, arg5, arg6, 
> arg7, arg8);
> +qemu_plugin_vcpu_syscall(cpu, num, arg1, arg2, arg3, arg4, arg5, arg6, 
> arg7,
> + arg8);

Looking at the previous line, seems like you can avoid splitting this
line into 2. Keeps it more consistent that way.

>  if(do_strace)
>  print_freebsd_syscall(num, arg1, arg2, arg3, arg4, arg5, arg6);
>
> @@ -404,6 +406,7 @@ abi_long do_freebsd_syscall(void *cpu_env, int num, 
> abi_long arg1,
>  if (do_strace)
>  print_freebsd_syscall_ret(num, ret);
>  trace_guest_user_syscall_ret(cpu, num, ret);
> +qemu_plugin_vcpu_syscall_ret(cpu, num, ret);
>  return ret;
>   efault:
>  ret = -TARGET_EFAULT;
> @@ -422,6 +425,8 @@ abi_long do_netbsd_syscall(void *cpu_env, int num, 
> abi_long arg1,
>  gemu_log("netbsd syscall %d\n", num);
>  #endif
>  trace_guest_user_syscall(cpu, num, arg1, arg2, arg3, arg4, arg5, arg6, 
> 0, 0);
> +qemu_plugin_vcpu_syscall(cpu, num, arg1, arg2, arg3, arg4, arg5, arg6, 0,
> + 0);

ditto.

>  if(do_strace)
>  print_netbsd_syscall(num, arg1, arg2, arg3, arg4, arg5, arg6);
>
> @@ -480,6 +485,7 @@ abi_long do_netbsd_syscall(void *cpu_env, int num, 
> abi_long arg1,
>  if (do_strace)
>  print_netbsd_syscall_ret(num, ret);
>  trace_guest_user_syscall_ret(cpu, num, ret);
> +qemu_plugin_vcpu_syscall_ret(cpu, num, ret);
>  return ret;
>   efault:
>  ret = -TARGET_EFAULT;
> @@ -498,6 +504,8 @@ abi_long do_openbsd_syscall(void *cpu_env, int num, 
> abi_long arg1,
>  gemu_log("openbsd syscall %d\n", num);
>  #endif
>  trace_guest_user_syscall(cpu, num, arg1, arg2, arg3, arg4, arg5, arg6, 
> 0, 0);
> +qemu_plugin_vcpu_syscall(cpu, num, arg1, arg2, arg3, arg4, arg5, arg6, 0,
> + 0);

ditto.

>  if(do_strace)
>  print_openbsd_syscall(num, arg1, arg2, arg3, arg4, arg5, arg6);
>
> @@ -556,6 +564,7 @@ abi_long do_openbsd_syscall(void *cpu_env, int num, 
> abi_long arg1,
>  if (do_strace)
>  print_openbsd_syscall_ret(num, ret);
>  trace_guest_user_syscall_ret(cpu, num, ret);
> +qemu_plugin_vcpu_syscall_ret(cpu, num, ret);
>  return ret;
>   efault:
>  ret = -TARGET_EFAULT;
> diff --git a/linux-user/syscall.c b/linux-user/syscall.c
> index b187c1281d..7f3cfdee84 100644
> --- a/linux-user/syscall.c
> +++ b/linux-user/syscall.c
> @@ -11724,6 +11724,8 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
> arg1,
>
>  trace_guest_user_syscall(cpu, num, arg1, arg2, arg3, arg4,
>   arg5, arg6, arg7, arg8);
> +qemu_plugin_vcpu_syscall(cpu, num, arg1, arg2, arg3, arg4, arg5, arg6, 
> arg7,
> + arg8);

This I am not sure.


>
>  if (unlikely(do_strace)) {
>  print_syscall(num, arg1, arg2, arg3, arg4, arg5, arg6);
> @@ -11736,5 +11738,6 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
> arg1,
>  }
>
>  trace_guest_user_syscall_ret(cpu, num, ret);
> +qemu_plugin_vcpu_syscall_ret(cpu, num, ret);
>  return ret;
>  }
> --
> 2.20.1
>
>

Re: [Qemu-devel] [PATCH v3 05/50] docs/devel: add plugins.rst design document

2019-06-18 Thread Pranith Kumar

Hi,

On Fri, Jun 14, 2019 at 10:21 AM Alex Bennée  wrote:
>
> This is mostly extracted from Emilio's more verbose commit comments
> with some additional verbiage from me.
>
> Signed-off-by: Alex Bennée 
> ---
>  docs/devel/index.rst   |  1 +
>  docs/devel/plugins.rst | 99 ++
>  2 files changed, 100 insertions(+)
>  create mode 100644 docs/devel/plugins.rst
>
> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> index 2a4ddf40ad..7e6d20c970 100644
> --- a/docs/devel/index.rst
> +++ b/docs/devel/index.rst
> @@ -21,3 +21,4 @@ Contents:
> testing
> decodetree
> secure-coding-practices
> +   plugins
> diff --git a/docs/devel/plugins.rst b/docs/devel/plugins.rst
> new file mode 100644
> index 00..b0c30375ef
> --- /dev/null
> +++ b/docs/devel/plugins.rst
> @@ -0,0 +1,99 @@
> +..
> +   Copyright (C) 2017, Emilio G. Cota 
> +   Copyright (c) 2019, Linaro Limited
> +   Written by Emilio Cota and Alex Bennée
> +
> +
> +QEMU TCG Plugins
> +
> +
> +QEMU TCG plugins provide a way for users to run experiments taking
> +advantage of the total system control emulation can have over a guest.
> +It provides a mechanism for plugins to subscribe to events during
> +translation and execution and optionally callback into the plugin
> +during these events.
> +
> +API Stability
> +=
> +
> +This is a new feature for QEMU and it does allow people to develop
> +out-of-tree plugins than can be dynamically linked into a running QEMU

s/than/that/

> +process. However the project reserves the right to change or break the
> +API should it need to do so.
> +
> +Exposure of QEMU internals
> +--
> +
> +The plugin architecture actively avoids leaking implementation details
> +about how QEMU's translation works to the plugins. While there are
> +conceptions such as translation time and translation blocks the
> +details are opaque to plugins. The plugin is able to query select
> +details of instructions and system configuration only through the
> +exported *qemu_plugin* functions. The types used to describe
> +instructions and events are opaque to the plugins themselves.
> +
> +Usage
> +=
> +
> +The QEMU binary needs to be compiled for plugin support:
> +
> +::
> +configure --enable-plugins
> +
> +Once built a program can be run with multiple plugins loaded each with
> +their own arguments:
> +
> +::
> +$QEMU $OTHER_QEMU_ARGS \
> +  -plugin tests/plugin/libhowvec.so,arg=inline,arg=hint \
> +  -plugin tests/plugin/libhotblocks.so

I think this might be a good place to describe what these arguments are.

> +
> +Plugin Life cycle
> +=
> +
> +First the plugin is loaded and the public qemu_plugin_install function
> +is called. The plugin with then register callbacks for various plugin

s/with/will/

> +events. Generally at least the atexit_cb is registered so the plugin
> +can dump its information at the end of a run.

Is that a hard requirement?

> +
> +When a registered event occurs the plugin callback is called. The

I would prefer 'callback is invoked'.

> +callbacks may provide additional information. In the case of a
> +translation event the plugin has an option to enumerate the
> +instructions in a block of instructions and optionally register
> +callbacks to some or all instructions when they are executed.
> +
> +There is also a facility to add an inline event where code to
> +increment a counter can be directly inlined with the translation.
> +Currently only a simple increment is supported. This is not atomic so
> +the plugin must either keep it's counters separated and indexed by CPU
> +or use a callback which can ensure atomicity.
> +
> +Finally when QEMU exits all the registered atexit callbacks are called

Add period at end of sentence and preferably "s/called/invoked/"

> +
> +Internals
> +=
> +
> +Locking
> +---
> +
> +We have to ensure we cannot deadlock, particularly under MTTCG. For
> +this we acquire a lock when called from plugin code. We also keep the
> +list of callbacks under RCU so that we do not have to hold the lock
> +when calling the callbacks. This is also for performance, since some
> +callbacks (e.g. memory access callbacks) might be called very
> +frequently.
> +
> +  * A consequence of this is that we keep our own list of CPUs, so that
> +we do not have to worry about locking order wrt cpu_list_lock.
> +  * Use a recursive lock, since we can get registration calls from
> +callbacks.
> +
> +As a result registering/unregistering callbacks is "slow", since it
> +takes a lock. But this is very infrequent; we want performance when
> +calling (or not calling) callbacks, not when registering them. Using
> +RCU is great for this.
> +
> +We support the uninstallation of a plugin at any time (e.g. from plugin
> +callbacks). This means some callbacks might still be called after the 
> uninstall
> +function returns. The plugin isn't

Re: [Qemu-devel] qemu-riscv64 seg fault

2018-09-03 Thread Pranith Kumar

On Mon, Sep 3, 2018 at 1:07 AM Michael Clark  wrote:
>
> Thanks. I was just about to log an issue in the riscv-qemu issue tracker on 
> GitHub.
>
> I reproduced it on my side. The fact that it is causes QEMU user to crash in 
> translate.c is interesting.
>
> I ran your program with -d in_asm and it appears to crash in thread::join

Interestingly, qemu-x86_64 crashes too. But running natively on x86
works fine. There is something off somewhere...

~/quickht$ qemu-x86_64 ./bench -t 2 -u 2
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault

~/quickht$ ./bench -t 2 -u 2
Comparisons: 23922
Buckets touched: 4933
Operations: 2000

--
Pranith

Re: [Qemu-devel] qemu-riscv64 seg fault

2018-09-03 Thread Pranith Kumar

On second looks, running the benchmark on a RISCV processor is also
giving a seg fault. So may be there is something wrong with the
benchmark... OTOH, x86 version runs fine... hmm

Please ignore this report, I will try to investigate further.

Thanks,
On Mon, Sep 3, 2018 at 12:45 AM Pranith Kumar  wrote:
>
> Hi Michael,
>
> qemu-riscv64 seg faults for me on a static binary. You can build the
> binary from here: https://github.com/pranith/quickht
>
> $ STATIC=1 RISCV=1 make
>
> $ qemu-riscv64 ./bench -t 1 -u 1
> 
>
> Thanks,
> --
> Pranith

-- 
Pranith

[Qemu-devel] qemu-riscv64 seg fault

2018-09-03 Thread Pranith Kumar

Hi Michael,

qemu-riscv64 seg faults for me on a static binary. You can build the
binary from here: https://github.com/pranith/quickht

$ STATIC=1 RISCV=1 make

$ qemu-riscv64 ./bench -t 1 -u 1


Thanks,
-- 
Pranith

Re: [Qemu-devel] [PATCH 1/5] target/arm: Remove stale comment

2017-09-05 Thread Pranith Kumar

Hi Alex,

On Tue, Sep 5, 2017 at 8:02 AM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Pranith Kumar <bobby.pr...@gmail.com> writes:
>
>> Update the comment which is not true since MTTCG.
>
> What happened to the cover letter? We seem to have a mix of patches but
> no summary of the overall outcome.
>

These are a bunch of unrelated patches, so there is no theme. I will
include a cover letter saying so from now on.

Thanks,
-- 
Pranith

Re: [Qemu-devel] [PATCH] tcg/softmmu: Increase size of TLB caches

2017-09-05 Thread Pranith Kumar

On Tue, Sep 5, 2017 at 5:50 PM, Richard Henderson <r...@twiddle.net> wrote:
> On 08/29/2017 10:23 AM, Pranith Kumar wrote:
>> This patch increases the number of entries cached in the TLB. I went
>> over a few architectures to see if increasing it is problematic. Only
>> armv6 seems to have a limitation that only 8 bits can be used for
>> indexing these entries. For other architectures, the number of TLB
>> entries is increased to a 4K-sized cache. The patch also doubles the
>> number of victim TLB entries.
>>
>> Some statistics collected from a build benchmark for various cache
>> sizes is listed below:
>>
>> | TLB bits\vTLB entires | 8 |16  |32 |
>> | 8 | 952.94(+0.0%) | 929.99(+2.4%)  | 919.02(+3.6%) |
>> |10 | 898.92(+5.6%) | 886.13(+7.0%)  | 887.03(+6.9%) |
>> |12 | 878.56(+7.8%) | 873.53(+8.3%)* | 875.34(+8.1%) |
>>
>> The best combination for this workload came out to be 12 bits for the
>> TLB and a 16 entry vTLB cache.
>
> This significantly degrades performance of alpha-softmmu.
> It spends about 25% of all cpu time in memset.

What workload does it degrade for? I will try to reproduce and see
which memset is causing this.

-- 
Pranith

Re: [Qemu-devel] [PATCH] arm_gicv3_kvm: Fix compile warning

2017-08-31 Thread Pranith Kumar

CC'ing stable for 2.10.

On Tue, Aug 29, 2017 at 1:32 PM, Pranith Kumar <bobby.pr...@gmail.com> wrote:
> Fix the following warning:
>
> /home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: warning: logical not is 
> only applied to the left hand side of this bitwise operator 
> [-Wlogical-not-parentheses]
> if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
> ^ ~
> /home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: note: add parentheses 
> after the '!' to evaluate the bitwise operator first
> if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
> ^
> /home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: note: add parentheses 
> around left hand side expression to silence this warning
> if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
> ^
>
> Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
> ---
>  hw/intc/arm_gicv3_kvm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
> index 6051c77705..481fe5405a 100644
> --- a/hw/intc/arm_gicv3_kvm.c
> +++ b/hw/intc/arm_gicv3_kvm.c
> @@ -293,7 +293,7 @@ static void kvm_arm_gicv3_put(GICv3State *s)
>  kvm_gicr_access(s, GICR_PROPBASER + 4, ncpu, , true);
>
>  reg64 = c->gicr_pendbaser;
> -if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
> +if (!(c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS)) {
>  /* Setting PTZ is advised if LPIs are disabled, to reduce
>   * GIC initialization time.
>   */
> --
> 2.11.0
>



-- 
Pranith

Re: [Qemu-devel] [PATCH v4 11/43] tcg: define CF_PARALLEL and use it for TB hashing along with CF_COUNT_MASK

2017-08-30 Thread Pranith Kumar

On Tue, Aug 29, 2017 at 5:16 PM, Emilio G. Cota <c...@braap.org> wrote:
> On Sun, Aug 27, 2017 at 18:15:50 -0400, Pranith Kumar wrote:
>> Hi Emilio,
>>
>> On Fri, Jul 21, 2017 at 1:59 AM, Emilio G. Cota <c...@braap.org> wrote:
>> > This will enable us to decouple code translation from the value
>> > of parallel_cpus at any given time. It will also help us minimize
>> > TB flushes when generating code via EXCP_ATOMIC.
>> >
>> > Note that the declaration of parallel_cpus is brought to exec-all.h
>> > to be able to define there the "curr_cflags" inline.
>> >
>> > Signed-off-by: Emilio G. Cota <c...@braap.org>
>>
>> I was testing a winxp image today and I bisected a infinite loop to
>> this commit. The loop happens both with and without mttcg, so I think
>> it has got to do with something else.
>
> Can you test the below? It lets me boot ubuntu, otherwise it reliably
> chokes on a 'rep movsb' *very* early (doesn't even get to grub).
>
> This discusson on v2 might be relevant (I added CF_COUNT_MASK as a
> result of it, but it seems I have to revisit that):
>   https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg06456.html
>
> Anyway let me know if this fixes it for you. Thanks for testing!
>

Yes, this fixes the issue for me.

Thanks,
--
Pranith

Re: [Qemu-devel] [PATCH] arm_gicv3_kvm: Fix compile warning

2017-08-29 Thread Pranith Kumar

I should have worded the subject better. The warning is pointing to an
actual bug.

On Tue, Aug 29, 2017 at 1:32 PM, Pranith Kumar <bobby.pr...@gmail.com> wrote:
> Fix the following warning:
>
> /home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: warning: logical not is 
> only applied to the left hand side of this bitwise operator 
> [-Wlogical-not-parentheses]
> if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
> ^ ~
> /home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: note: add parentheses 
> after the '!' to evaluate the bitwise operator first
> if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
> ^
> /home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: note: add parentheses 
> around left hand side expression to silence this warning
> if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
> ^
>
> Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
> ---
>  hw/intc/arm_gicv3_kvm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
> index 6051c77705..481fe5405a 100644
> --- a/hw/intc/arm_gicv3_kvm.c
> +++ b/hw/intc/arm_gicv3_kvm.c
> @@ -293,7 +293,7 @@ static void kvm_arm_gicv3_put(GICv3State *s)
>  kvm_gicr_access(s, GICR_PROPBASER + 4, ncpu, , true);
>
>  reg64 = c->gicr_pendbaser;
> -if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
> +if (!(c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS)) {
>  /* Setting PTZ is advised if LPIs are disabled, to reduce
>   * GIC initialization time.
>   */
> --
> 2.11.0
>



-- 
Pranith

[Qemu-devel] [PATCH] arm_gicv3_kvm: Fix compile warning

2017-08-29 Thread Pranith Kumar

Fix the following warning:

/home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: warning: logical not is only 
applied to the left hand side of this bitwise operator 
[-Wlogical-not-parentheses]
if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
^ ~
/home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: note: add parentheses after 
the '!' to evaluate the bitwise operator first
if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
^
/home/pranith/qemu/hw/intc/arm_gicv3_kvm.c:296:17: note: add parentheses around 
left hand side expression to silence this warning
if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
    ^

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 hw/intc/arm_gicv3_kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
index 6051c77705..481fe5405a 100644
--- a/hw/intc/arm_gicv3_kvm.c
+++ b/hw/intc/arm_gicv3_kvm.c
@@ -293,7 +293,7 @@ static void kvm_arm_gicv3_put(GICv3State *s)
 kvm_gicr_access(s, GICR_PROPBASER + 4, ncpu, , true);
 
 reg64 = c->gicr_pendbaser;
-if (!c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) {
+if (!(c->gicr_ctlr & GICR_CTLR_ENABLE_LPIS)) {
 /* Setting PTZ is advised if LPIs are disabled, to reduce
  * GIC initialization time.
  */
-- 
2.11.0

[Qemu-devel] [PATCH] tcg/softmmu: Increase size of TLB caches

2017-08-29 Thread Pranith Kumar

This patch increases the number of entries cached in the TLB. I went
over a few architectures to see if increasing it is problematic. Only
armv6 seems to have a limitation that only 8 bits can be used for
indexing these entries. For other architectures, the number of TLB
entries is increased to a 4K-sized cache. The patch also doubles the
number of victim TLB entries.

Some statistics collected from a build benchmark for various cache
sizes is listed below:

| TLB bits\vTLB entires | 8 |16  |32 |
| 8 | 952.94(+0.0%) | 929.99(+2.4%)  | 919.02(+3.6%) |
|10 | 898.92(+5.6%) | 886.13(+7.0%)  | 887.03(+6.9%) |
|12 | 878.56(+7.8%) | 873.53(+8.3%)* | 875.34(+8.1%) |

The best combination for this workload came out to be 12 bits for the
TLB and a 16 entry vTLB cache.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 include/exec/cpu-defs.h  | 6 +++---
 tcg/aarch64/tcg-target.h | 1 +
 tcg/arm/tcg-target.h | 1 +
 tcg/i386/tcg-target.h| 2 ++
 tcg/ia64/tcg-target.h| 1 +
 tcg/mips/tcg-target.h| 2 ++
 tcg/ppc/tcg-target.h | 1 +
 tcg/s390/tcg-target.h| 1 +
 tcg/sparc/tcg-target.h   | 1 +
 tcg/tci/tcg-target.h | 1 +
 10 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index bc8e7f848d..1957e3f32c 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -57,8 +57,8 @@ typedef uint64_t target_ulong;
 #endif
 
 #if !defined(CONFIG_USER_ONLY) && defined(CONFIG_TCG)
-/* use a fully associative victim tlb of 8 entries */
-#define CPU_VTLB_SIZE 8
+/* use a fully associative victim tlb of 16 entries */
+#define CPU_VTLB_SIZE 16
 
 #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 4
@@ -89,7 +89,7 @@ typedef uint64_t target_ulong;
  * of tlb_table inside env (which is non-trivial but not huge).
  */
 #define CPU_TLB_BITS \
-MIN(8,   \
+MIN(MIN(12, TCG_TARGET_TLB_MAX_INDEX_BITS),  \
 TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
 (NB_MMU_MODES <= 1 ? 0 : \
  NB_MMU_MODES <= 2 ? 1 : \
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index b41a248bee..9f4558cd83 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -15,6 +15,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 #undef TCG_TARGET_STACK_GROWSUP
 
 typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index a38be15a39..ebe27991f3 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -60,6 +60,7 @@ extern int arm_arch;
 #undef TCG_TARGET_STACK_GROWSUP
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 8
 
 typedef enum {
 TCG_REG_R0 = 0,
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 73a15f7e80..5279af6eb1 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -162,6 +162,8 @@ extern bool have_popcnt;
 # define TCG_AREG0 TCG_REG_EBP
 #endif
 
+#define TCG_TARGET_TLB_MAX_INDEX_BITS (32 - CPU_TLB_ENTRY_BITS)
+
 static inline void flush_icache_range(uintptr_t start, uintptr_t stop)
 {
 }
diff --git a/tcg/ia64/tcg-target.h b/tcg/ia64/tcg-target.h
index 8f475fe742..35878e20c7 100644
--- a/tcg/ia64/tcg-target.h
+++ b/tcg/ia64/tcg-target.h
@@ -28,6 +28,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 16
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 21
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 
 typedef struct {
 uint64_t lo __attribute__((aligned(16)));
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index e9558d15bc..1b60e53169 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -39,6 +39,8 @@
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define TCG_TARGET_NB_REGS 32
 
+#define TCG_TARGET_TLB_MAX_INDEX_BITS (16 - CPU_TLB_ENTRY_BITS)
+
 typedef enum {
 TCG_REG_ZERO = 0,
 TCG_REG_AT,
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index 5a092b038a..82e10c9471 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -34,6 +34,7 @@
 #define TCG_TARGET_NB_REGS 32
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 
 typedef enum {
 TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index dc0e59193c..57f0e22532 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -27,6 +27,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 
 typedef enum TCGReg {
 TCG_REG_R0 = 0,
diff --git a/tcg/sparc/

Re: [Qemu-devel] [RFC v3 PATCH 5/5] tcg/softmmu: Increase size of TLB caches

2017-08-29 Thread Pranith Kumar

On Tue, Aug 29, 2017 at 11:01 AM, Richard Henderson
<richard.hender...@linaro.org> wrote:
> On 08/28/2017 11:33 PM, Pranith Kumar wrote:
>> + * TODO: rewrite this comment
>>   */
>> -#define CPU_TLB_BITS \
>> -MIN(8,   \
>> -TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
>> -(NB_MMU_MODES <= 1 ? 0 : \
>> - NB_MMU_MODES <= 2 ? 1 : \
>> - NB_MMU_MODES <= 4 ? 2 : \
>> - NB_MMU_MODES <= 8 ? 3 : 4))
>> +#define CPU_TLB_BITS MIN(12, TCG_TARGET_TLB_MAX_INDEX_BITS)
>>
>
> Ah, no.  This will cause several builds to fail.
> You still need to restrict the *total* size of
> the TLB to TCG_TARGET_TLB_DISPLACEMENT_BITS.
>
> (That's not a 100% accurate statement, but is close.
> See the QEMU_BUILD_BUG_ON in tcg/*/*.c for specifics.)
>
> The upshot is that if a target has 2 MMU modes,
> we can allow them to be bigger.  But if it has 8,
> we have to make them smaller.
>
> I was expecting you to write
>
>   MIN(MIN(12, TCG_TARGET_TLB_MAX_INDEX_BITS)
>   TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -
>   ...)

I see what you mean. I will fix the blunder and send an updated patch.

Thanks!
-- 
Pranith

[Qemu-devel] [RFC v3 PATCH 3/5] mttcg: Add tcg target default memory ordering

2017-08-29 Thread Pranith Kumar

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.h | 2 ++
 tcg/arm/tcg-target.h | 2 ++
 tcg/ia64/tcg-target.h| 2 ++
 tcg/mips/tcg-target.h| 2 ++
 tcg/ppc/tcg-target.h | 2 ++
 tcg/s390/tcg-target.h| 2 ++
 tcg/sparc/tcg-target.h   | 2 ++
 7 files changed, 14 insertions(+)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 55a46ac825..b41a248bee 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -117,4 +117,6 @@ static inline void flush_icache_range(uintptr_t start, 
uintptr_t stop)
 __builtin___clear_cache((char *)start, (char *)stop);
 }
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif /* AARCH64_TCG_TARGET_H */
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 5ef1086710..a38be15a39 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -134,4 +134,6 @@ static inline void flush_icache_range(uintptr_t start, 
uintptr_t stop)
 __builtin___clear_cache((char *) start, (char *) stop);
 }
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif
diff --git a/tcg/ia64/tcg-target.h b/tcg/ia64/tcg-target.h
index 901bb7575d..8f475fe742 100644
--- a/tcg/ia64/tcg-target.h
+++ b/tcg/ia64/tcg-target.h
@@ -195,4 +195,6 @@ static inline void flush_icache_range(uintptr_t start, 
uintptr_t stop)
 asm volatile (";;sync.i;;srlz.i;;");
 }
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index d75cb63ed3..e9558d15bc 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -206,4 +206,6 @@ static inline void flush_icache_range(uintptr_t start, 
uintptr_t stop)
 cacheflush ((void *)start, stop-start, ICACHE);
 }
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index 5f4a40a5b4..5a092b038a 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -125,4 +125,6 @@ extern bool have_isa_3_00;
 
 void flush_icache_range(uintptr_t start, uintptr_t stop);
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index 957f0c0afe..dc0e59193c 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -133,6 +133,8 @@ extern uint64_t s390_facilities;
 
 #define TCG_TARGET_EXTEND_ARGS 1
 
+#define TCG_TARGET_DEFAULT_MO (TCG_MO_ALL & ~TCG_MO_ST_LD)
+
 enum {
 TCG_AREG0 = TCG_REG_R10,
 };
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index 854a0afd70..4515c9ab48 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -162,6 +162,8 @@ extern bool use_vis3_instructions;
 
 #define TCG_AREG0 TCG_REG_I0
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 static inline void flush_icache_range(uintptr_t start, uintptr_t stop)
 {
 uintptr_t p;
-- 
2.13.0

[Qemu-devel] [RFC v3 PATCH 5/5] tcg/softmmu: Increase size of TLB caches

2017-08-29 Thread Pranith Kumar

This patch increases the number of entries cached in the TLB. I went
over a few architectures to see if increasing it is problematic. Only
armv6 seems to have a limitation that only 8 bits can be used for
indexing these entries. For other architectures, the number of TLB
entries is increased to a 4K-sized cache. The patch also doubles the
number of victim TLB entries.

Some statistics collected from a build benchmark for various cache
sizes is listed below:

| TLB bits\vTLB entires | 8 |16  |32 |
| 8 | 952.94(+0.0%) | 929.99(+2.4%)  | 919.02(+3.6%) |
|10 | 898.92(+5.6%) | 886.13(+7.0%)  | 887.03(+6.9%) |
|12 | 878.56(+7.8%) | 873.53(+8.3%)* | 875.34(+8.1%) |

The best combination for this workload came out to be 12 bits for the
TLB and a 16 entry vTLB cache.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 include/exec/cpu-defs.h  | 13 -
 tcg/aarch64/tcg-target.h |  1 +
 tcg/arm/tcg-target.h |  1 +
 tcg/i386/tcg-target.h|  6 ++
 tcg/ia64/tcg-target.h|  1 +
 tcg/mips/tcg-target.h|  6 ++
 tcg/ppc/tcg-target.h |  1 +
 tcg/s390/tcg-target.h|  1 +
 tcg/sparc/tcg-target.h   |  1 +
 tcg/tci/tcg-target.h |  2 ++
 10 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index bc8e7f848d..33b0ac6fe0 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -57,8 +57,8 @@ typedef uint64_t target_ulong;
 #endif
 
 #if !defined(CONFIG_USER_ONLY) && defined(CONFIG_TCG)
-/* use a fully associative victim tlb of 8 entries */
-#define CPU_VTLB_SIZE 8
+/* use a fully associative victim tlb of 16 entries */
+#define CPU_VTLB_SIZE 16
 
 #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 4
@@ -87,14 +87,9 @@ typedef uint64_t target_ulong;
  * could be something like 0xC000 (the offset of the last TLB table) plus
  * 0x18 (the offset of the addend field in each TLB entry) plus the offset
  * of tlb_table inside env (which is non-trivial but not huge).
+ * TODO: rewrite this comment
  */
-#define CPU_TLB_BITS \
-MIN(8,   \
-TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
-(NB_MMU_MODES <= 1 ? 0 : \
- NB_MMU_MODES <= 2 ? 1 : \
- NB_MMU_MODES <= 4 ? 2 : \
- NB_MMU_MODES <= 8 ? 3 : 4))
+#define CPU_TLB_BITS MIN(12, TCG_TARGET_TLB_MAX_INDEX_BITS)
 
 #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
 
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index b41a248bee..9f4558cd83 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -15,6 +15,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 #undef TCG_TARGET_STACK_GROWSUP
 
 typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index a38be15a39..ebe27991f3 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -60,6 +60,7 @@ extern int arm_arch;
 #undef TCG_TARGET_STACK_GROWSUP
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 8
 
 typedef enum {
 TCG_REG_R0 = 0,
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 73a15f7e80..456d57115c 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -162,6 +162,12 @@ extern bool have_popcnt;
 # define TCG_AREG0 TCG_REG_EBP
 #endif
 
+#if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 28
+#else
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 27
+#endif
+
 static inline void flush_icache_range(uintptr_t start, uintptr_t stop)
 {
 }
diff --git a/tcg/ia64/tcg-target.h b/tcg/ia64/tcg-target.h
index 8f475fe742..35878e20c7 100644
--- a/tcg/ia64/tcg-target.h
+++ b/tcg/ia64/tcg-target.h
@@ -28,6 +28,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 16
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 21
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 32
 
 typedef struct {
 uint64_t lo __attribute__((aligned(16)));
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index e9558d15bc..0c7c5cf64c 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -39,6 +39,12 @@
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
 #define TCG_TARGET_NB_REGS 32
 
+#if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 12
+#else
+#define TCG_TARGET_TLB_MAX_INDEX_BITS 11
+#endif
+
 typedef enum {
 TCG_REG_ZERO = 0,
 TCG_REG_AT,
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index 5a092b038a..82e10c9471 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -34,6 +34,7 @@
 #define TCG_TARGET_

[Qemu-devel] [RFC v3 PATCH 4/5] mttcg: Implement implicit ordering semantics

2017-08-29 Thread Pranith Kumar

Currently, we cannot use mttcg for running strong memory model guests
on weak memory model hosts due to missing ordering semantics.

We implicitly generate fence instructions for stronger guests if an
ordering mismatch is detected. We generate fences only for the orders
for which fence instructions are necessary, for example a fence is not
necessary between a store and a subsequent load on x86 since its
absence in the guest binary tells that ordering need not be
ensured. Also note that if we find multiple subsequent fence
instructions in the generated IR, we combine them in the TCG
optimization pass.

This patch allows us to boot an x86 guest on ARM64 hosts using mttcg.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/tcg-op.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 87f673ef49..688d91755b 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -28,6 +28,7 @@
 #include "exec/exec-all.h"
 #include "tcg.h"
 #include "tcg-op.h"
+#include "tcg-mo.h"
 #include "trace-tcg.h"
 #include "trace/mem.h"
 
@@ -2662,8 +2663,20 @@ static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, 
TCGv addr,
 #endif
 }
 
+static void tcg_gen_req_mo(TCGBar type)
+{
+#ifdef TCG_GUEST_DEFAULT_MO
+type &= TCG_GUEST_DEFAULT_MO;
+#endif
+type &= ~TCG_TARGET_DEFAULT_MO;
+if (type) {
+tcg_gen_mb(type | TCG_BAR_SC);
+}
+}
+
 void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
+tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_ST_LD);
 memop = tcg_canonicalize_memop(memop, 0, 0);
 trace_guest_mem_before_tcg(tcg_ctx.cpu, tcg_ctx.tcg_env,
addr, trace_mem_get_info(memop, 0));
@@ -2672,6 +2685,7 @@ void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg 
idx, TCGMemOp memop)
 
 void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
+tcg_gen_req_mo(TCG_MO_LD_ST | TCG_MO_ST_ST);
 memop = tcg_canonicalize_memop(memop, 0, 1);
 trace_guest_mem_before_tcg(tcg_ctx.cpu, tcg_ctx.tcg_env,
addr, trace_mem_get_info(memop, 1));
@@ -2680,6 +2694,7 @@ void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg 
idx, TCGMemOp memop)
 
 void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
+tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_ST_LD);
 if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
 tcg_gen_qemu_ld_i32(TCGV_LOW(val), addr, idx, memop);
 if (memop & MO_SIGN) {
@@ -2698,6 +2713,7 @@ void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg 
idx, TCGMemOp memop)
 
 void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
+tcg_gen_req_mo(TCG_MO_LD_ST | TCG_MO_ST_ST);
 if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
 tcg_gen_qemu_st_i32(TCGV_LOW(val), addr, idx, memop);
 return;
-- 
2.13.0

[Qemu-devel] [PATCH 1/5] target/arm: Remove stale comment

2017-08-29 Thread Pranith Kumar

Update the comment which is not true since MTTCG.

Reviewed-by: Richard Henderson <r...@twiddle.net>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/arm/translate-a64.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 2200e25be0..f42b155d7d 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -2012,10 +2012,6 @@ static void disas_ldst_excl(DisasContext *s, uint32_t 
insn)
 }
 tcg_addr = read_cpu_reg_sp(s, rn, 1);
 
-/* Note that since TCG is single threaded load-acquire/store-release
- * semantics require no extra if (is_lasr) { ... } handling.
- */
-
 if (is_excl) {
 if (!is_store) {
 s->is_ldex = true;
-- 
2.13.0

[Qemu-devel] [RFC v3 PATCH 2/5] cpus-common: Cache allocated work items

2017-08-29 Thread Pranith Kumar

Using heaptrack, I found that quite a few of our temporary allocations
are coming from allocating work items. Instead of doing this
continously, we can cache the allocated items and reuse them instead
of freeing them.

Stats from an ARM64 guest (boot+shutdown):

heaptrack stats(before):
allocations:1471317
leaked allocations: 73824
temporary allocations:  651293

heaptrack stats(after):
allocations:1143130
leaked allocations: 73693
temporary allocations:  487342

The improvement in speedup is minor and within error margins, however I think 
the
patch is still worth. We can also explore atomics instead of taking a lock for
the work item pool.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 cpus-common.c | 75 +++
 1 file changed, 60 insertions(+), 15 deletions(-)

diff --git a/cpus-common.c b/cpus-common.c
index 59f751ecf9..ccf5f50e4e 100644
--- a/cpus-common.c
+++ b/cpus-common.c
@@ -24,6 +24,7 @@
 #include "sysemu/cpus.h"
 
 static QemuMutex qemu_cpu_list_lock;
+static QemuMutex qemu_wi_pool_lock;
 static QemuCond exclusive_cond;
 static QemuCond exclusive_resume;
 static QemuCond qemu_work_cond;
@@ -33,6 +34,49 @@ static QemuCond qemu_work_cond;
  */
 static int pending_cpus;
 
+typedef struct qemu_work_item {
+struct qemu_work_item *next;
+run_on_cpu_func func;
+run_on_cpu_data data;
+bool free, exclusive, done;
+} qemu_work_item;
+
+typedef struct qemu_wi_pool {
+qemu_work_item *head;
+int num_items;
+} qemu_wi_pool;
+
+qemu_wi_pool *wi_free_pool;
+
+static void qemu_init_workitem_pool(void)
+{
+wi_free_pool = g_malloc0(sizeof(qemu_wi_pool));
+}
+
+static void qemu_wi_pool_insert(qemu_work_item *item)
+{
+qemu_mutex_lock(_wi_pool_lock);
+qemu_work_item *curr = atomic_read(_free_pool->head);
+item->next = curr;
+wi_free_pool->head = item;
+qemu_mutex_unlock(_wi_pool_lock);
+}
+
+static qemu_work_item *qemu_wi_pool_remove(void)
+{
+qemu_mutex_lock(_wi_pool_lock);
+qemu_work_item *curr = atomic_read(_free_pool->head);
+if (curr == NULL) {
+goto out;
+}
+wi_free_pool->head = curr->next;
+curr->next = NULL;
+
+ out:
+qemu_mutex_unlock(_wi_pool_lock);
+return curr;
+}
+
 void qemu_init_cpu_list(void)
 {
 /* This is needed because qemu_init_cpu_list is also called by the
@@ -43,6 +87,9 @@ void qemu_init_cpu_list(void)
 qemu_cond_init(_cond);
 qemu_cond_init(_resume);
 qemu_cond_init(_work_cond);
+
+qemu_init_workitem_pool();
+qemu_mutex_init(_wi_pool_lock);
 }
 
 void cpu_list_lock(void)
@@ -106,14 +153,7 @@ void cpu_list_remove(CPUState *cpu)
 qemu_mutex_unlock(_cpu_list_lock);
 }
 
-struct qemu_work_item {
-struct qemu_work_item *next;
-run_on_cpu_func func;
-run_on_cpu_data data;
-bool free, exclusive, done;
-};
-
-static void queue_work_on_cpu(CPUState *cpu, struct qemu_work_item *wi)
+static void queue_work_on_cpu(CPUState *cpu, qemu_work_item *wi)
 {
 qemu_mutex_lock(>work_mutex);
 if (cpu->queued_work_first == NULL) {
@@ -132,7 +172,7 @@ static void queue_work_on_cpu(CPUState *cpu, struct 
qemu_work_item *wi)
 void do_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data data,
QemuMutex *mutex)
 {
-struct qemu_work_item wi;
+qemu_work_item wi;
 
 if (qemu_cpu_is_self(cpu)) {
 func(cpu, data);
@@ -156,9 +196,11 @@ void do_run_on_cpu(CPUState *cpu, run_on_cpu_func func, 
run_on_cpu_data data,
 
 void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data 
data)
 {
-struct qemu_work_item *wi;
+qemu_work_item *wi = qemu_wi_pool_remove();
 
-wi = g_malloc0(sizeof(struct qemu_work_item));
+if (!wi) {
+wi = g_malloc0(sizeof(qemu_work_item));
+}
 wi->func = func;
 wi->data = data;
 wi->free = true;
@@ -299,9 +341,11 @@ void cpu_exec_end(CPUState *cpu)
 void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func,
run_on_cpu_data data)
 {
-struct qemu_work_item *wi;
+qemu_work_item *wi = qemu_wi_pool_remove();
 
-wi = g_malloc0(sizeof(struct qemu_work_item));
+if (!wi) {
+wi = g_malloc0(sizeof(qemu_work_item));
+}
 wi->func = func;
 wi->data = data;
 wi->free = true;
@@ -312,7 +356,7 @@ void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func 
func,
 
 void process_queued_cpu_work(CPUState *cpu)
 {
-struct qemu_work_item *wi;
+qemu_work_item *wi;
 
 if (cpu->queued_work_first == NULL) {
 return;
@@ -343,7 +387,8 @@ void process_queued_cpu_work(CPUState *cpu)
 }
 qemu_mutex_lock(>work_mutex);
 if (wi->free) {
-g_free(wi);
+memset(wi, 0, sizeof(qemu_work_item));
+qemu_wi_pool_insert(wi);
 } else {
 atomic_mb_set(>done, true);
 }
-- 
2.13.0

Re: [Qemu-devel] [RFC PATCH 2/3] cpus-common: Cache allocated work items

2017-08-28 Thread Pranith Kumar

On Mon, Aug 28, 2017 at 3:05 PM, Emilio G. Cota <c...@braap.org> wrote:
> On Sun, Aug 27, 2017 at 23:53:25 -0400, Pranith Kumar wrote:
>> Using heaptrack, I found that quite a few of our temporary allocations
>> are coming from allocating work items. Instead of doing this
>> continously, we can cache the allocated items and reuse them instead
>> of freeing them.
>>
>> This reduces the number of allocations by 25% (20 -> 15 for
>> ARM64 boot+shutdown test).
>>
>
> But what is the perf difference, if any?
>
> Adding a lock (or a cmpxchg) here is not a great idea. However, this is not 
> yet
> immediately obvious because of other scalability bottlenecks. (if
> you boot many arm64 cores you'll see most of the time is spent idling
> on the BQL, see
>   https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg05207.html )
>
> You're most likely better off using glib's slices, see
>   https://developer.gnome.org/glib/stable/glib-Memory-Slices.html
> These slices use per-thread lists, so scalability should be OK.

I think we should modify our g_malloc() to internally use this. Seems
like an idea worth trying out.

>
> I also suggest profiling with either or both of jemalloc/tcmalloc
> (build with --enable-jemalloc/tcmalloc) in addition to using glibc's
> allocator, and then based on perf numbers decide whether this is something
> worth optimizing.
>

OK, I will try to get some perf numbers.

-- 
Pranith

Re: [Qemu-devel] [RFC PATCH 2/3] cpus-common: Cache allocated work items

2017-08-28 Thread Pranith Kumar

On Mon, Aug 28, 2017 at 1:47 PM, Richard Henderson
<richard.hender...@linaro.org> wrote:
> On 08/27/2017 08:53 PM, Pranith Kumar wrote:
>> Using heaptrack, I found that quite a few of our temporary allocations
>> are coming from allocating work items. Instead of doing this
>> continously, we can cache the allocated items and reuse them instead
>> of freeing them.
>>
>> This reduces the number of allocations by 25% (20 -> 15 for
>> ARM64 boot+shutdown test).
>>
>> Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
>
> Why does this list need to record a "last" element?
> It would seem a simple lifo would be sufficient.
>
> (You would also be able to manage the list via cmpxchg without a separate 
> lock,
> but perhaps the difference between the two isn't measurable.)
>

Yes, seems like a better design choice. Will fix in next iteration.

Thanks,
-- 
Pranith

Re: [Qemu-devel] [RFC PATCH 3/3] mttcg: Implement implicit ordering semantics

2017-08-28 Thread Pranith Kumar

On Mon, Aug 28, 2017 at 1:57 PM, Richard Henderson <r...@twiddle.net> wrote:
> On 08/27/2017 08:53 PM, Pranith Kumar wrote:
>> diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
>> index 55a46ac825..b41a248bee 100644
>> --- a/tcg/aarch64/tcg-target.h
>> +++ b/tcg/aarch64/tcg-target.h
>> @@ -117,4 +117,6 @@ static inline void flush_icache_range(uintptr_t start, 
>> uintptr_t stop)
>>  __builtin___clear_cache((char *)start, (char *)stop);
>>  }
>>
>> +#define TCG_TARGET_DEFAULT_MO (0)
>> +
>>  #endif /* AARCH64_TCG_TARGET_H */
>
> Please add all of these in one patch, separate from the tcg-op.c changes.
> We should also just make this mandatory and remove any related #ifdefs.

I tried looking up ordering semantics for architectures like ia64 and
s390. It is not really clear. I think every arch but for x86 can be
defined as weak, even though archs like sparc can also be configured
as TSO. Is this right?

>
>> +void tcg_gen_req_mo(TCGBar type)
>
> static, until we find that we need it somewhere else.
>

Will fix.

>> +#if defined(TCG_GUEST_DEFAULT_MO) && defined(TCG_TARGET_DEFAULT_MO)
>> +TCGBar order_mismatch = type & (TCG_GUEST_DEFAULT_MO & 
>> ~TCG_TARGET_DEFAULT_MO);
>> +if (order_mismatch) {
>> +tcg_gen_mb(order_mismatch | TCG_BAR_SC);
>> +}
>> +#else
>> +tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC);
>> +#endif
>
> Hmm.  How about
>
> static void tcg_gen_reg_mo(TCGBar type)
> {
> #ifdef TCG_GUEST_DEFAULT_MO
> type &= TCG_GUEST_DEFAULT_MO;
> #endif
> #ifdef TCG_TARGET_DEFAULT_MO
> type &= ~TCG_TARGET_DEFAULT_MO;
> #endif
> if (type) {
> tcg_gen_mb(type | TCG_BAR_SC);
> }
> }

Yes, this looks better and until we can get all the possible
definitions of TCG_GUEST_DEFAULT_MO and TCG_TARGET_DEFAULT_MO figured
out I would like to keep the #ifdefs.

>
> Just because one of those is undefined doesn't mean we can't infer tighter
> barriers from the others, including the initial value of TYPE.
>
>>  void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp 
>> memop)
>>  {
>> +tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_LD_ST);
>>  memop = tcg_canonicalize_memop(memop, 0, 0);
>
> You're putting the barrier before the load, so that should be
>
>   TCG_MO_LD_LD | TCG_MO_ST_LD
>
> i.e.  TCG_MO__
>
> If you were putting the barrier afterward (an equally reasonable option), 
> you'd
> reverse that and use what you have above.

OK, will fix this. Thanks for the review.

-- 
Pranith

[Qemu-devel] [PATCH 1/3] target/arm: Remove stale comment

2017-08-27 Thread Pranith Kumar

Update the comment which is not true since MTTCG.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/arm/translate-a64.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 2200e25be0..f42b155d7d 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -2012,10 +2012,6 @@ static void disas_ldst_excl(DisasContext *s, uint32_t 
insn)
 }
 tcg_addr = read_cpu_reg_sp(s, rn, 1);
 
-/* Note that since TCG is single threaded load-acquire/store-release
- * semantics require no extra if (is_lasr) { ... } handling.
- */
-
 if (is_excl) {
 if (!is_store) {
 s->is_ldex = true;
-- 
2.13.0

[Qemu-devel] [RFC PATCH 2/3] cpus-common: Cache allocated work items

2017-08-27 Thread Pranith Kumar

Using heaptrack, I found that quite a few of our temporary allocations
are coming from allocating work items. Instead of doing this
continously, we can cache the allocated items and reuse them instead
of freeing them.

This reduces the number of allocations by 25% (20 -> 15 for
ARM64 boot+shutdown test).

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 cpus-common.c | 85 ---
 1 file changed, 70 insertions(+), 15 deletions(-)

diff --git a/cpus-common.c b/cpus-common.c
index 59f751ecf9..a1c4c7d1a3 100644
--- a/cpus-common.c
+++ b/cpus-common.c
@@ -24,6 +24,7 @@
 #include "sysemu/cpus.h"
 
 static QemuMutex qemu_cpu_list_lock;
+static QemuMutex qemu_wi_pool_lock;
 static QemuCond exclusive_cond;
 static QemuCond exclusive_resume;
 static QemuCond qemu_work_cond;
@@ -33,6 +34,58 @@ static QemuCond qemu_work_cond;
  */
 static int pending_cpus;
 
+typedef struct qemu_work_item {
+struct qemu_work_item *next;
+run_on_cpu_func func;
+run_on_cpu_data data;
+bool free, exclusive, done;
+} qemu_work_item;
+
+typedef struct qemu_wi_pool {
+qemu_work_item *first, *last;
+} qemu_wi_pool;
+
+qemu_wi_pool *wi_free_pool;
+
+static void qemu_init_workitem_pool(void)
+{
+wi_free_pool = g_malloc0(sizeof(qemu_wi_pool));
+wi_free_pool->first = NULL;
+wi_free_pool->last  = NULL;
+}
+
+static void qemu_wi_pool_insert(qemu_work_item *item)
+{
+qemu_mutex_lock(_wi_pool_lock);
+if (wi_free_pool->last == NULL) {
+wi_free_pool->first = item;
+wi_free_pool->last = item;
+} else {
+wi_free_pool->last->next = item;
+wi_free_pool->last = item;
+}
+qemu_mutex_unlock(_wi_pool_lock);
+}
+
+static qemu_work_item* qemu_wi_pool_remove(void)
+{
+qemu_mutex_lock(_wi_pool_lock);
+qemu_work_item *ret = wi_free_pool->first;
+
+if (ret == NULL)
+goto out;
+
+wi_free_pool->first = ret->next;
+if (wi_free_pool->last == ret) {
+wi_free_pool->last = NULL;
+}
+ret->next = NULL;
+
+ out:
+qemu_mutex_unlock(_wi_pool_lock);
+return ret;
+}
+
 void qemu_init_cpu_list(void)
 {
 /* This is needed because qemu_init_cpu_list is also called by the
@@ -43,6 +96,9 @@ void qemu_init_cpu_list(void)
 qemu_cond_init(_cond);
 qemu_cond_init(_resume);
 qemu_cond_init(_work_cond);
+
+qemu_init_workitem_pool();
+qemu_mutex_init(_wi_pool_lock);
 }
 
 void cpu_list_lock(void)
@@ -106,14 +162,7 @@ void cpu_list_remove(CPUState *cpu)
 qemu_mutex_unlock(_cpu_list_lock);
 }
 
-struct qemu_work_item {
-struct qemu_work_item *next;
-run_on_cpu_func func;
-run_on_cpu_data data;
-bool free, exclusive, done;
-};
-
-static void queue_work_on_cpu(CPUState *cpu, struct qemu_work_item *wi)
+static void queue_work_on_cpu(CPUState *cpu, qemu_work_item *wi)
 {
 qemu_mutex_lock(>work_mutex);
 if (cpu->queued_work_first == NULL) {
@@ -132,7 +181,7 @@ static void queue_work_on_cpu(CPUState *cpu, struct 
qemu_work_item *wi)
 void do_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data data,
QemuMutex *mutex)
 {
-struct qemu_work_item wi;
+qemu_work_item wi;
 
 if (qemu_cpu_is_self(cpu)) {
 func(cpu, data);
@@ -145,6 +194,7 @@ void do_run_on_cpu(CPUState *cpu, run_on_cpu_func func, 
run_on_cpu_data data,
 wi.free = false;
 wi.exclusive = false;
 
+
 queue_work_on_cpu(cpu, );
 while (!atomic_mb_read()) {
 CPUState *self_cpu = current_cpu;
@@ -156,9 +206,11 @@ void do_run_on_cpu(CPUState *cpu, run_on_cpu_func func, 
run_on_cpu_data data,
 
 void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, run_on_cpu_data 
data)
 {
-struct qemu_work_item *wi;
+qemu_work_item *wi = qemu_wi_pool_remove();
 
-wi = g_malloc0(sizeof(struct qemu_work_item));
+if (!wi) {
+wi = g_malloc0(sizeof(qemu_work_item));
+}
 wi->func = func;
 wi->data = data;
 wi->free = true;
@@ -299,9 +351,11 @@ void cpu_exec_end(CPUState *cpu)
 void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func,
run_on_cpu_data data)
 {
-struct qemu_work_item *wi;
+qemu_work_item *wi = qemu_wi_pool_remove();
 
-wi = g_malloc0(sizeof(struct qemu_work_item));
+if (!wi) {
+wi = g_malloc0(sizeof(qemu_work_item));
+}
 wi->func = func;
 wi->data = data;
 wi->free = true;
@@ -312,7 +366,7 @@ void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func 
func,
 
 void process_queued_cpu_work(CPUState *cpu)
 {
-struct qemu_work_item *wi;
+qemu_work_item *wi;
 
 if (cpu->queued_work_first == NULL) {
 return;
@@ -343,7 +397,8 @@ void process_queued_cpu_work(CPUState *cpu)
 }
 qemu_mutex_lock(>work_mutex);
 if (wi->free) {
-g_free(wi);
+

[Qemu-devel] [RFC PATCH 3/3] mttcg: Implement implicit ordering semantics

2017-08-27 Thread Pranith Kumar

Currently, we cannot use mttcg for running strong memory model guests
on weak memory model hosts due to missing ordering semantics.

We implicitly generate fence instructions for stronger guests if an
ordering mismatch is detected. We generate fences only for the orders
for which fence instructions are necessary, for example a fence is not
necessary between a store and a subsequent load on x86 since its
absence in the guest binary tells that ordering need not be
ensured. Also note that if we find multiple subsequent fence
instructions in the generated IR, we combine them in the TCG
optimization pass.

This patch allows us to boot an x86 guest on ARM64 hosts using mttcg.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.h |  2 ++
 tcg/arm/tcg-target.h |  2 ++
 tcg/mips/tcg-target.h|  2 ++
 tcg/ppc/tcg-target.h |  2 ++
 tcg/tcg-op.c | 17 +
 tcg/tcg-op.h |  1 +
 6 files changed, 26 insertions(+)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 55a46ac825..b41a248bee 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -117,4 +117,6 @@ static inline void flush_icache_range(uintptr_t start, 
uintptr_t stop)
 __builtin___clear_cache((char *)start, (char *)stop);
 }
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif /* AARCH64_TCG_TARGET_H */
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 5ef1086710..a38be15a39 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -134,4 +134,6 @@ static inline void flush_icache_range(uintptr_t start, 
uintptr_t stop)
 __builtin___clear_cache((char *) start, (char *) stop);
 }
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index d75cb63ed3..e9558d15bc 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -206,4 +206,6 @@ static inline void flush_icache_range(uintptr_t start, 
uintptr_t stop)
 cacheflush ((void *)start, stop-start, ICACHE);
 }
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index 5f4a40a5b4..5a092b038a 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -125,4 +125,6 @@ extern bool have_isa_3_00;
 
 void flush_icache_range(uintptr_t start, uintptr_t stop);
 
+#define TCG_TARGET_DEFAULT_MO (0)
+
 #endif
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 87f673ef49..085fe66fb2 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -28,6 +28,7 @@
 #include "exec/exec-all.h"
 #include "tcg.h"
 #include "tcg-op.h"
+#include "tcg-mo.h"
 #include "trace-tcg.h"
 #include "trace/mem.h"
 
@@ -2662,8 +2663,21 @@ static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, 
TCGv addr,
 #endif
 }
 
+void tcg_gen_req_mo(TCGBar type)
+{
+#if defined(TCG_GUEST_DEFAULT_MO) && defined(TCG_TARGET_DEFAULT_MO)
+TCGBar order_mismatch = type & (TCG_GUEST_DEFAULT_MO & 
~TCG_TARGET_DEFAULT_MO);
+if (order_mismatch) {
+tcg_gen_mb(order_mismatch | TCG_BAR_SC);
+}
+#else
+tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC);
+#endif
+}
+
 void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
+tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_LD_ST);
 memop = tcg_canonicalize_memop(memop, 0, 0);
 trace_guest_mem_before_tcg(tcg_ctx.cpu, tcg_ctx.tcg_env,
addr, trace_mem_get_info(memop, 0));
@@ -2672,6 +2686,7 @@ void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg 
idx, TCGMemOp memop)
 
 void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
+tcg_gen_req_mo(TCG_MO_ST_LD | TCG_MO_ST_ST);
 memop = tcg_canonicalize_memop(memop, 0, 1);
 trace_guest_mem_before_tcg(tcg_ctx.cpu, tcg_ctx.tcg_env,
addr, trace_mem_get_info(memop, 1));
@@ -2680,6 +2695,7 @@ void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg 
idx, TCGMemOp memop)
 
 void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
+tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_LD_ST);
 if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
 tcg_gen_qemu_ld_i32(TCGV_LOW(val), addr, idx, memop);
 if (memop & MO_SIGN) {
@@ -2698,6 +2714,7 @@ void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg 
idx, TCGMemOp memop)
 
 void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
+tcg_gen_req_mo(TCG_MO_ST_LD | TCG_MO_ST_ST);
 if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
 tcg_gen_qemu_st_i32(TCGV_LOW(val), addr, idx, memop);
 return;
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 5d3278f243..6ad2c6d60e 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -262,6 +262,7 @@ static inline void tcg_gen_br(TCGLabel *l)
 }
 
 void tcg_gen_mb(TCGBar);
+void tcg_gen_req_mo(TCGBar type);
 
 /* Helper calls. */
 
-- 
2.13.0

Re: [Qemu-devel] [PATCH v4 11/43] tcg: define CF_PARALLEL and use it for TB hashing along with CF_COUNT_MASK

2017-08-27 Thread Pranith Kumar

Hi Emilio,

On Fri, Jul 21, 2017 at 1:59 AM, Emilio G. Cota  wrote:
> This will enable us to decouple code translation from the value
> of parallel_cpus at any given time. It will also help us minimize
> TB flushes when generating code via EXCP_ATOMIC.
>
> Note that the declaration of parallel_cpus is brought to exec-all.h
> to be able to define there the "curr_cflags" inline.
>
> Signed-off-by: Emilio G. Cota 

I was testing a winxp image today and I bisected a infinite loop to
this commit. The loop happens both with and without mttcg, so I think
it has got to do with something else.

It seems to be executing this instruction:

Trace 0x7fffc1d036c0 [0: 000c9a6b]

IN:
0x000c9a6b:  rep movsb %cs:(%si),%es:(%di)

and never stops.

You can find an execution log here: http://pranith.org/files/log_n.txt.gz

Let me know if you need more information.

Thanks,

> ---
>  include/exec/exec-all.h   | 20 +++-
>  include/exec/tb-hash-xx.h |  9 ++---
>  include/exec/tb-hash.h|  4 ++--
>  include/exec/tb-lookup.h  |  6 +++---
>  tcg/tcg.h |  1 -
>  accel/tcg/cpu-exec.c  | 45 +++--
>  accel/tcg/translate-all.c | 13 +
>  exec.c|  2 +-
>  tcg/tcg-runtime.c |  2 +-
>  tests/qht-bench.c |  2 +-
>  10 files changed, 65 insertions(+), 39 deletions(-)
>
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 8cbd90b..f6a928d 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -353,6 +353,9 @@ struct TranslationBlock {
>  #define CF_USE_ICOUNT  0x2
>  #define CF_IGNORE_ICOUNT 0x4 /* Do not generate icount code */
>  #define CF_INVALID 0x8 /* TB is stale. Setters must acquire tb_lock 
> */
> +#define CF_PARALLEL0x10 /* Generate code for a parallel context */
> +/* cflags' mask for hashing/comparison */
> +#define CF_HASH_MASK (CF_COUNT_MASK | CF_PARALLEL)
>
>  /* Per-vCPU dynamic tracing state used to generate this TB */
>  uint32_t trace_vcpu_dstate;
> @@ -396,11 +399,26 @@ struct TranslationBlock {
>  uintptr_t jmp_list_first;
>  };
>
> +extern bool parallel_cpus;
> +
> +/* Hide the atomic_read to make code a little easier on the eyes */
> +static inline uint32_t tb_cflags(const TranslationBlock *tb)
> +{
> +return atomic_read(>cflags);
> +}
> +
> +/* current cflags for hashing/comparison */
> +static inline uint32_t curr_cflags(void)
> +{
> +return parallel_cpus ? CF_PARALLEL : 0;
> +}
> +
>  void tb_free(TranslationBlock *tb);
>  void tb_flush(CPUState *cpu);
>  void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
>  TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
> -   target_ulong cs_base, uint32_t flags);
> +   target_ulong cs_base, uint32_t flags,
> +   uint32_t cf_mask);
>
>  #if defined(USE_DIRECT_JUMP)
>
> diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
> index 6cd3022..747a9a6 100644
> --- a/include/exec/tb-hash-xx.h
> +++ b/include/exec/tb-hash-xx.h
> @@ -48,8 +48,8 @@
>   * xxhash32, customized for input variables that are not guaranteed to be
>   * contiguous in memory.
>   */
> -static inline
> -uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f)
> +static inline uint32_t
> +tb_hash_func7(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f, uint32_t g)
>  {
>  uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
>  uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
> @@ -78,7 +78,7 @@ uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t 
> e, uint32_t f)
>  v4 *= PRIME32_1;
>
>  h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
> -h32 += 24;
> +h32 += 28;
>
>  h32 += e * PRIME32_3;
>  h32  = rol32(h32, 17) * PRIME32_4;
> @@ -86,6 +86,9 @@ uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t 
> e, uint32_t f)
>  h32 += f * PRIME32_3;
>  h32  = rol32(h32, 17) * PRIME32_4;
>
> +h32 += g * PRIME32_3;
> +h32  = rol32(h32, 17) * PRIME32_4;
> +
>  h32 ^= h32 >> 15;
>  h32 *= PRIME32_2;
>  h32 ^= h32 >> 13;
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 17b5ee0..0526c4f 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -59,9 +59,9 @@ static inline unsigned int 
> tb_jmp_cache_hash_func(target_ulong pc)
>
>  static inline
>  uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t 
> flags,
> -  uint32_t trace_vcpu_dstate)
> +  uint32_t cf_mask, uint32_t trace_vcpu_dstate)
>  {
> -return tb_hash_func6(phys_pc, pc, flags, trace_vcpu_dstate);
> +return tb_hash_func7(phys_pc, pc, flags, cf_mask, trace_vcpu_dstate);
>  }
>
>  #endif
> diff --git a/include/exec/tb-lookup.h

Re: [Qemu-devel] [RFC v2 PATCH] tcg/softmmu: Increase size of TLB caches

2017-08-24 Thread Pranith Kumar

On Thu, Aug 24, 2017 at 11:58 AM, Pranith Kumar <bobby.pr...@gmail.com> wrote:
> This patch increases the number of entries cached in the TLB. I went
> over a few architectures to see if increasing it is problematic. Only
> armv6 seems to have a limitation that only 8 bits can be used for
> indexing these entries. For other architectures, the number of TLB
> entries is increased to a 4K-sized cache. The patch also doubles the
> number of victim TLB entries.
>
> A few statistics collected from a build benchmark for various cache
> sizes is below:
>
> | TLB bits\vTLB entires | 8 |16  |32 |
> | 8 | 952.94(+0.0%) | 929.99(+2.4%)  | 919.02(+3.6%) |
> |10 | 898.92(+5.6%) | 886.13(+7.0%)  | 887.03(+6.9%) |
> |12 | 878.56(+7.8%) | 873.53(+8.3%)* | 875.34(+8.1%) |
>
> The best combination for this workload came out to be 12 bits for the
> TLB and a 16 entry vTLB cache.

You can find the raw data here: http://paste.ubuntu.com/25383585/

Thanks,
-- 
Pranith

[Qemu-devel] [RFC v2 PATCH] tcg/softmmu: Increase size of TLB caches

2017-08-24 Thread Pranith Kumar

This patch increases the number of entries cached in the TLB. I went
over a few architectures to see if increasing it is problematic. Only
armv6 seems to have a limitation that only 8 bits can be used for
indexing these entries. For other architectures, the number of TLB
entries is increased to a 4K-sized cache. The patch also doubles the
number of victim TLB entries.

A few statistics collected from a build benchmark for various cache
sizes is below:

| TLB bits\vTLB entires | 8 |16  |32 |
| 8 | 952.94(+0.0%) | 929.99(+2.4%)  | 919.02(+3.6%) |
|10 | 898.92(+5.6%) | 886.13(+7.0%)  | 887.03(+6.9%) |
|12 | 878.56(+7.8%) | 873.53(+8.3%)* | 875.34(+8.1%) |

The best combination for this workload came out to be 12 bits for the
TLB and a 16 entry vTLB cache.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 include/exec/cpu-defs.h  | 6 +++---
 tcg/aarch64/tcg-target.h | 1 +
 tcg/arm/tcg-target.h | 1 +
 tcg/i386/tcg-target.h| 2 ++
 tcg/ia64/tcg-target.h| 1 +
 tcg/mips/tcg-target.h| 1 +
 tcg/ppc/tcg-target.h | 1 +
 tcg/s390/tcg-target.h| 1 +
 tcg/sparc/tcg-target.h   | 1 +
 tcg/tci/tcg-target.h | 1 +
 10 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index bc8e7f848d..a5e1ad6cea 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -57,8 +57,8 @@ typedef uint64_t target_ulong;
 #endif
 
 #if !defined(CONFIG_USER_ONLY) && defined(CONFIG_TCG)
-/* use a fully associative victim tlb of 8 entries */
-#define CPU_VTLB_SIZE 8
+/* use a fully associative victim tlb of 16 entries */
+#define CPU_VTLB_SIZE 16
 
 #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 4
@@ -89,7 +89,7 @@ typedef uint64_t target_ulong;
  * of tlb_table inside env (which is non-trivial but not huge).
  */
 #define CPU_TLB_BITS \
-MIN(8,   \
+MIN(CPU_TLB_BITS_MAX,\
 TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
 (NB_MMU_MODES <= 1 ? 0 : \
  NB_MMU_MODES <= 2 ? 1 : \
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 55a46ac825..f428e09c98 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -15,6 +15,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+#define CPU_TLB_BITS_MAX 12
 #undef TCG_TARGET_STACK_GROWSUP
 
 typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 5ef1086710..69414be393 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -60,6 +60,7 @@ extern int arm_arch;
 #undef TCG_TARGET_STACK_GROWSUP
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define CPU_TLB_BITS_MAX 8
 
 typedef enum {
 TCG_REG_R0 = 0,
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 73a15f7e80..35c27a977b 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -162,6 +162,8 @@ extern bool have_popcnt;
 # define TCG_AREG0 TCG_REG_EBP
 #endif
 
+#define CPU_TLB_BITS_MAX 12
+
 static inline void flush_icache_range(uintptr_t start, uintptr_t stop)
 {
 }
diff --git a/tcg/ia64/tcg-target.h b/tcg/ia64/tcg-target.h
index 901bb7575d..fd713f7adf 100644
--- a/tcg/ia64/tcg-target.h
+++ b/tcg/ia64/tcg-target.h
@@ -28,6 +28,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 16
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 21
+#define CPU_TLB_BITS_MAX 8
 
 typedef struct {
 uint64_t lo __attribute__((aligned(16)));
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index d75cb63ed3..fd9046b7ad 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -37,6 +37,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define CPU_TLB_BITS_MAX 12
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index 5f4a40a5b4..f5071f706d 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -34,6 +34,7 @@
 #define TCG_TARGET_NB_REGS 32
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define CPU_TLB_BITS_MAX 8
 
 typedef enum {
 TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index 957f0c0afe..218be322ad 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -27,6 +27,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
+#define CPU_TLB_BITS_MAX 12
 
 typedef enum TCGReg {
 TCG_REG_R0 = 0,
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index 854a0afd70..9fd59a64f2 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h

Re: [Qemu-devel] [PATCH v2 10/13] vvfat: correctly generate numeric-tail of short file names

2017-08-08 Thread Pranith Kumar

On Mon, Aug 7, 2017 at 7:07 AM, Eric Blake <ebl...@redhat.com> wrote:
> On 08/05/2017 01:52 PM, Pranith Kumar wrote:
>> FYI,
>>
>> This commit breaks the build with gcc-7:
>>
>>   CC  block/vvfat.o
>> qemu/block/vvfat.c: In function ‘read_directory’:
>> qemu/block/vvfat.c:605:37: error: ‘__builtin___sprintf_chk’ may write
>> a terminating nul past the end of the destination
>> [-Werror=format-overflow=]
>>  int len = sprintf(tail, "~%d", i);
>>  ^
>
> Doesn't commit 7c8730d45f6 fix that?

Indeed it does. I hadn't rebased my branch before reporting. Sorry for
the noise!

-- 
Pranith

Re: [Qemu-devel] [PATCH v2 10/13] vvfat: correctly generate numeric-tail of short file names

2017-08-05 Thread Pranith Kumar

FYI,

This commit breaks the build with gcc-7:

  CC  block/vvfat.o
qemu/block/vvfat.c: In function ‘read_directory’:
qemu/block/vvfat.c:605:37: error: ‘__builtin___sprintf_chk’ may write
a terminating nul past the end of the destination
[-Werror=format-overflow=]
 int len = sprintf(tail, "~%d", i);
 ^
In file included from /usr/include/stdio.h:938:0,
 from qemu/include/qemu/osdep.h:68,
 from qemu/block/vvfat.c:25:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:33:10: note:
‘__builtin___sprintf_chk’ output between 3 and 12 bytes into a
destination of size 11
   return __builtin___sprintf_chk (__s, __USE_FORTIFY_LEVEL - 1,
  ^~
   __bos (__s), __fmt, __va_arg_pack ());
   ~
cc1: all warnings being treated as errors
qemu/rules.mak:66: recipe for target 'block/vvfat.o' failed
make: *** [block/vvfat.o] Error 1


Thanks,

On Mon, May 22, 2017 at 5:12 PM, Hervé Poussineau  wrote:
> More specifically:
> - try without numeric-tail only if LFN didn't have invalid short chars
> - start at ~1 (instead of ~0)
> - handle case if numeric tail is more than one char (ie > 10)
>
> Windows 9x Scandisk doesn't see anymore mismatches between short file names 
> and
> long file names for non-ASCII filenames.
>
> Specification: "FAT: General overview of on-disk format" v1.03, page 31
> Signed-off-by: Hervé Poussineau 
> ---
>  block/vvfat.c | 62 
> ---
>  1 file changed, 29 insertions(+), 33 deletions(-)
>
> diff --git a/block/vvfat.c b/block/vvfat.c
> index 3cb65493cd..414bca6dee 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -529,7 +529,8 @@ static uint8_t to_valid_short_char(gunichar c)
>  }
>
>  static direntry_t *create_short_filename(BDRVVVFATState *s,
> - const char *filename)
> + const char *filename,
> + unsigned int directory_start)
>  {
>  int i, j = 0;
>  direntry_t *entry = array_get_next(&(s->directory));
> @@ -587,8 +588,32 @@ static direntry_t *create_short_filename(BDRVVVFATState 
> *s,
>  }
>  }
>  }
> -(void)lossy_conversion;
> -return entry;
> +
> +/* numeric-tail generation */
> +for (j = 0; j < 8; j++) {
> +if (entry->name[j] == ' ') {
> +break;
> +}
> +}
> +for (i = lossy_conversion ? 1 : 0; i < 99; i++) {
> +direntry_t *entry1;
> +if (i > 0) {
> +int len = sprintf(tail, "~%d", i);
> +memcpy(entry->name + MIN(j, 8 - len), tail, len);
> +}
> +for (entry1 = array_get(&(s->directory), directory_start);
> + entry1 < entry; entry1++) {
> +if (!is_long_name(entry1) &&
> +!memcmp(entry1->name, entry->name, 11)) {
> +break; /* found dupe */
> +}
> +}
> +if (entry1 == entry) {
> +/* no dupe found */
> +return entry;
> +}
> +}
> +return NULL;
>  }
>
>  /* fat functions */
> @@ -701,36 +726,7 @@ static inline direntry_t* 
> create_short_and_long_name(BDRVVVFATState* s,
>  }
>
>  entry_long=create_long_filename(s,filename);
> -entry = create_short_filename(s, filename);
> -
> -/* mangle duplicates */
> -while(1) {
> -direntry_t* entry1=array_get(&(s->directory),directory_start);
> -int j;
> -
> -for(;entry1 -if(!is_long_name(entry1) && !memcmp(entry1->name,entry->name,11))
> -break; /* found dupe */
> -if(entry1==entry) /* no dupe found */
> -break;
> -
> -/* use all 8 characters of name */
> -if(entry->name[7]==' ') {
> -int j;
> -for(j=6;j>0 && entry->name[j]==' ';j--)
> -entry->name[j]='~';
> -}
> -
> -/* increment number */
> -for(j=7;j>0 && entry->name[j]=='9';j--)
> -entry->name[j]='0';
> -if(j>0) {
> -if(entry->name[j]<'0' || entry->name[j]>'9')
> -entry->name[j]='0';
> -else
> -entry->name[j]++;
> -}
> -}
> +entry = create_short_filename(s, filename, directory_start);
>
>  /* calculate checksum; propagate to long name */
>  if(entry_long) {
> --
> 2.11.0
>
>

-- 
Pranith

[Qemu-devel] [RFC PATCH] tcg/softmmu: Increase size of TLB cache

2017-07-24 Thread Pranith Kumar

This patch increases the number of entries we allow in the TLB. I went
over a few architectures to see if increasing it is problematic. Only
armv6 seems to have a limitation that only 8 bits can be used for
indexing these entries. For other architectures, I increased the
number of TLB entries to a 4K-sized cache.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 include/exec/cpu-defs.h  | 5 -
 tcg/aarch64/tcg-target.h | 1 +
 tcg/i386/tcg-target.h| 2 ++
 tcg/mips/tcg-target.h| 1 +
 tcg/s390/tcg-target.h| 1 +
 tcg/sparc/tcg-target.h   | 1 +
 6 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 29b3c2ada8..cb81232b83 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -64,6 +64,9 @@ typedef uint64_t target_ulong;
 #define CPU_TLB_ENTRY_BITS 5
 #endif
 
+#ifndef CPU_TLB_BITS_MAX
+# define CPU_TLB_BITS_MAX 8
+#endif
 /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
  * the TLB is not unnecessarily small, but still small enough for the
  * TLB lookup instruction sequence used by the TCG target.
@@ -87,7 +90,7 @@ typedef uint64_t target_ulong;
  * of tlb_table inside env (which is non-trivial but not huge).
  */
 #define CPU_TLB_BITS \
-MIN(8,   \
+MIN(CPU_TLB_BITS_MAX,\
 TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS -  \
 (NB_MMU_MODES <= 1 ? 0 : \
  NB_MMU_MODES <= 2 ? 1 : \
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 55a46ac825..f428e09c98 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -15,6 +15,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+#define CPU_TLB_BITS_MAX 12
 #undef TCG_TARGET_STACK_GROWSUP
 
 typedef enum {
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 73a15f7e80..35c27a977b 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -162,6 +162,8 @@ extern bool have_popcnt;
 # define TCG_AREG0 TCG_REG_EBP
 #endif
 
+#define CPU_TLB_BITS_MAX 12
+
 static inline void flush_icache_range(uintptr_t start, uintptr_t stop)
 {
 }
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index d75cb63ed3..fd9046b7ad 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -37,6 +37,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define CPU_TLB_BITS_MAX 12
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index 957f0c0afe..218be322ad 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -27,6 +27,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
+#define CPU_TLB_BITS_MAX 12
 
 typedef enum TCGReg {
 TCG_REG_R0 = 0,
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index 854a0afd70..9fd59a64f2 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -29,6 +29,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+#define CPU_TLB_BITS_MAX 12
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {
-- 
2.13.0

Re: [Qemu-devel] [PATCH] tcg/aarch64: Use ADR for shorter jumps

2017-07-12 Thread Pranith Kumar

On Wed, Jul 12, 2017 at 7:08 PM, Richard Henderson <r...@twiddle.net> wrote:
> On 07/12/2017 12:14 PM, Pranith Kumar wrote:
>>
>> Use ADR instruction for shorter jumps.
>>
>> I was going through rth's email and realized that I should have done
>> this the first time.
>>
>> Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
>> ---
>>   tcg/aarch64/tcg-target.inc.c | 14 +-
>>   1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
>> index 04bc369a92..5121ebc1a1 100644
>> --- a/tcg/aarch64/tcg-target.inc.c
>> +++ b/tcg/aarch64/tcg-target.inc.c
>> @@ -886,12 +886,16 @@ void aarch64_tb_set_jmp_target(uintptr_t jmp_addr,
>> uintptr_t addr)
>>   i1 = I3206_B | ((offset >> 2) & 0x3ff);
>>   i2 = NOP;
>>   } else {
>> -offset = (addr >> 12) - (jmp_addr >> 12);
>> +if (offset == sextract64(offset, 0, 21)) {
>> +i1 = I3406_ADR;
>> +i2 = NOP;
>> +} else {
>
>
> This is dead code because it is covered by the direct jump above.
> B has a 26-bit range, whereas ADR has a 21-bit range.

Hmm, yep. And I guess it doesn't make sense to use ADR for short jumps
because this will be a 2 instruction jump vs 1 instruction for direct
branch.

Thanks,
-- 
Pranith

[Qemu-devel] [PATCH] tcg/aarch64: Use ADR for shorter jumps

2017-07-12 Thread Pranith Kumar

Use ADR instruction for shorter jumps.

I was going through rth's email and realized that I should have done
this the first time.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 04bc369a92..5121ebc1a1 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -886,12 +886,16 @@ void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, 
uintptr_t addr)
 i1 = I3206_B | ((offset >> 2) & 0x3ff);
 i2 = NOP;
 } else {
-offset = (addr >> 12) - (jmp_addr >> 12);
+if (offset == sextract64(offset, 0, 21)) {
+i1 = I3406_ADR;
+i2 = NOP;
+} else {
+offset = (addr >> 12) - (jmp_addr >> 12);
 
-/* patch ADRP */
-i1 = I3406_ADRP | (offset & 3) << 29 | (offset & 0x1c) << (5 - 2) 
| rd;
-/* patch ADDI */
-i2 = I3401_ADDI | rt << 31 | (addr & 0xfff) << 10 | rd << 5 | rd;
+i1 = I3406_ADRP;
+i2 = I3401_ADDI | rt << 31 | (addr & 0xfff) << 10 | rd << 5 | rd;
+}
+i1 |= (offset & 3) << 29 | (offset & 0x1c) << (5 - 2) | rd;
 }
 pair = (uint64_t)i2 << 32 | i1;
 atomic_set((uint64_t *)jmp_addr, pair);
-- 
2.13.0

[Qemu-devel] [PATCH v4 2/2] mttcg/i386: Patch instruction using async_safe_* framework

2017-07-12 Thread Pranith Kumar

In mttcg, calling pause_all_vcpus() during execution from the
generated TBs causes a deadlock if some vCPU is waiting for exclusive
execution in start_exclusive(). Fix this by using the aync_safe_*
framework instead of pausing vcpus for patching instructions.

CC: Paolo Bonzini <pbonz...@redhat.com>
CC: Peter Maydell <peter.mayd...@linaro.org>
Reviewed-by: Richard Henderson <r...@twiddle.net>
Reviewed-by: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 hw/i386/kvmvapic.c | 101 +
 1 file changed, 55 insertions(+), 46 deletions(-)

diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
index 0d9ef77580..4e8f1f4e55 100644
--- a/hw/i386/kvmvapic.c
+++ b/hw/i386/kvmvapic.c
@@ -383,8 +383,7 @@ static void patch_byte(X86CPU *cpu, target_ulong addr, 
uint8_t byte)
 cpu_memory_rw_debug(CPU(cpu), addr, , 1, 1);
 }
 
-static void patch_call(VAPICROMState *s, X86CPU *cpu, target_ulong ip,
-   uint32_t target)
+static void patch_call(X86CPU *cpu, target_ulong ip, uint32_t target)
 {
 uint32_t offset;
 
@@ -393,77 +392,87 @@ static void patch_call(VAPICROMState *s, X86CPU *cpu, 
target_ulong ip,
 cpu_memory_rw_debug(CPU(cpu), ip + 1, (void *), sizeof(offset), 1);
 }
 
-static void patch_instruction(VAPICROMState *s, X86CPU *cpu, target_ulong ip)
+struct PatchInfo {
+VAPICHandlers *handler;
+target_ulong ip;
+};
+
+static void do_patch_instruction(CPUState *cs, run_on_cpu_data data)
 {
-CPUState *cs = CPU(cpu);
-CPUX86State *env = >env;
-VAPICHandlers *handlers;
+X86CPU *x86_cpu = X86_CPU(cs);
+struct PatchInfo *info = (struct PatchInfo *) data.host_ptr;
+VAPICHandlers *handlers = info->handler;
+target_ulong ip = info->ip;
 uint8_t opcode[2];
 uint32_t imm32 = 0;
-target_ulong current_pc = 0;
-target_ulong current_cs_base = 0;
-uint32_t current_flags = 0;
-
-if (smp_cpus == 1) {
-handlers = >rom_state.up;
-} else {
-handlers = >rom_state.mp;
-}
-
-if (tcg_enabled()) {
-cpu_restore_state(cs, cs->mem_io_pc);
-cpu_get_tb_cpu_state(env, _pc, _cs_base,
- _flags);
-/* Account this instruction, because we will exit the tb.
-   This is the first instruction in the block. Therefore
-   there is no need in restoring CPU state. */
-if (use_icount) {
---cs->icount_decr.u16.low;
-}
-}
-
-pause_all_vcpus();
 
 cpu_memory_rw_debug(cs, ip, opcode, sizeof(opcode), 0);
 
 switch (opcode[0]) {
 case 0x89: /* mov r32 to r/m32 */
-patch_byte(cpu, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
-patch_call(s, cpu, ip + 1, handlers->set_tpr);
+patch_byte(x86_cpu, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
+patch_call(x86_cpu, ip + 1, handlers->set_tpr);
 break;
 case 0x8b: /* mov r/m32 to r32 */
-patch_byte(cpu, ip, 0x90);
-patch_call(s, cpu, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
+patch_byte(x86_cpu, ip, 0x90);
+patch_call(x86_cpu, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
 break;
 case 0xa1: /* mov abs to eax */
-patch_call(s, cpu, ip, handlers->get_tpr[0]);
+patch_call(x86_cpu, ip, handlers->get_tpr[0]);
 break;
 case 0xa3: /* mov eax to abs */
-patch_call(s, cpu, ip, handlers->set_tpr_eax);
+patch_call(x86_cpu, ip, handlers->set_tpr_eax);
 break;
 case 0xc7: /* mov imm32, r/m32 (c7/0) */
-patch_byte(cpu, ip, 0x68);  /* push imm32 */
+patch_byte(x86_cpu, ip, 0x68);  /* push imm32 */
 cpu_memory_rw_debug(cs, ip + 6, (void *), sizeof(imm32), 0);
 cpu_memory_rw_debug(cs, ip + 1, (void *), sizeof(imm32), 1);
-patch_call(s, cpu, ip + 5, handlers->set_tpr);
+patch_call(x86_cpu, ip + 5, handlers->set_tpr);
 break;
 case 0xff: /* push r/m32 */
-patch_byte(cpu, ip, 0x50); /* push eax */
-patch_call(s, cpu, ip + 1, handlers->get_tpr_stack);
+patch_byte(x86_cpu, ip, 0x50); /* push eax */
+patch_call(x86_cpu, ip + 1, handlers->get_tpr_stack);
 break;
 default:
 abort();
 }
 
-resume_all_vcpus();
+g_free(info);
+}
+
+static void patch_instruction(VAPICROMState *s, X86CPU *cpu, target_ulong ip)
+{
+CPUState *cs = CPU(cpu);
+CPUX86State *env = >env;
+VAPICHandlers *handlers;
+struct PatchInfo *info;
+const run_on_cpu_func fn = do_patch_instruction;
+target_ulong current_pc = 0;
+target_ulong current_cs_base = 0;
+uint32_t current_flags = 0;
+
+if (smp_cpus == 1) {
+handlers = >rom_state.up;
+} else {
+handlers = >rom_state.mp;
+}
+
+info  = g_new(struct PatchInfo, 1);
+in

[Qemu-devel] [PATCH v4 1/2] Revert "exec.c: Fix breakpoint invalidation race"

2017-07-12 Thread Pranith Kumar

Now that we have proper locking after MTTCG patches have landed, we
can revert the commit.  This reverts commit

a9353fe897ca2687e5b3385ed39e3db3927a90e0.

CC: Peter Maydell <peter.mayd...@linaro.org>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 exec.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index a083ff89ad..48de9a67f5 100644
--- a/exec.c
+++ b/exec.c
@@ -775,15 +775,28 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp)
 #endif
 }
 
+#if defined(CONFIG_USER_ONLY)
 static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
 {
-/* Flush the whole TB as this will not have race conditions
- * even if we don't have proper locking yet.
- * Ideally we would just invalidate the TBs for the
- * specified PC.
- */
-tb_flush(cpu);
+mmap_lock();
+tb_lock();
+tb_invalidate_phys_page_range(pc, pc + 1, 0);
+tb_unlock();
+mmap_unlock();
 }
+#else
+static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
+{
+MemTxAttrs attrs;
+hwaddr phys = cpu_get_phys_page_attrs_debug(cpu, pc, );
+int asidx = cpu_asidx_from_attrs(cpu, attrs);
+if (phys != -1) {
+/* Locks grabbed by tb_invalidate_phys_addr */
+tb_invalidate_phys_addr(cpu->cpu_ases[asidx].as,
+phys | (pc & ~TARGET_PAGE_MASK));
+}
+}
+#endif
 
 #if defined(CONFIG_USER_ONLY)
 void cpu_watchpoint_remove_all(CPUState *cpu, int mask)
-- 
2.13.0

[Qemu-devel] [PATCH] util/cacheinfo: Fix warning generated by clang

2017-06-30 Thread Pranith Kumar

Clang generates the following warning on aarch64 host:

  CC  util/cacheinfo.o
/home/pranith/qemu/util/cacheinfo.c:121:48: warning: value size does not match 
register size specified by the constraint and modifier [-Wasm-operand-widths]
asm volatile("mrs\t%0, ctr_el0" : "=r"(ctr));
   ^
/home/pranith/qemu/util/cacheinfo.c:121:28: note: use constraint modifier "w"
asm volatile("mrs\t%0, ctr_el0" : "=r"(ctr));
   ^~
   %w0

Constraint modifier 'w' is not (yet?) accepted by gcc. Fix this by increasing 
the ctr size.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 util/cacheinfo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/cacheinfo.c b/util/cacheinfo.c
index f987522df4..6253049533 100644
--- a/util/cacheinfo.c
+++ b/util/cacheinfo.c
@@ -112,7 +112,7 @@ static void sys_cache_info(int *isize, int *dsize)
 static void arch_cache_info(int *isize, int *dsize)
 {
 if (*isize == 0 || *dsize == 0) {
-unsigned ctr;
+unsigned long ctr;
 
 /* The real cache geometry is in CCSIDR_EL1/CLIDR_EL1/CSSELR_EL1,
but (at least under Linux) these are marked protected by the
-- 
2.13.0

[Qemu-devel] [PATCH v4 2/3] tcg/aarch64: Use ADRP+ADD to compute target address

2017-06-30 Thread Pranith Kumar

We use ADRP+ADD to compute the target address for goto_tb. This patch
introduces the NOP instruction which is used to align the above
instruction pair so that we can use one atomic instruction to patch
the destination offsets.

CC: Richard Henderson <r...@twiddle.net>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 accel/tcg/translate-all.c|  2 +-
 tcg/aarch64/tcg-target.inc.c | 36 ++--
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index f6ad46b613..65a92dbf67 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -522,7 +522,7 @@ static inline PageDesc *page_find(tb_page_addr_t index)
 #elif defined(__powerpc__)
 # define MAX_CODE_GEN_BUFFER_SIZE  (32u * 1024 * 1024)
 #elif defined(__aarch64__)
-# define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
+# define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
 #elif defined(__s390x__)
   /* We have a +- 4GB range on the branches; leave some slop.  */
 # define MAX_CODE_GEN_BUFFER_SIZE  (3ul * 1024 * 1024 * 1024)
diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 8fce11ace7..a84422d633 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -372,6 +372,7 @@ typedef enum {
 I3510_EON   = 0x4a20,
 I3510_ANDS  = 0x6a00,
 
+NOP = 0xd503201f,
 /* System instructions.  */
 DMB_ISH = 0xd50338bf,
 DMB_LD  = 0x0100,
@@ -865,11 +866,27 @@ static inline void tcg_out_call(TCGContext *s, 
tcg_insn_unit *target)
 
 void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
 {
-tcg_insn_unit *code_ptr = (tcg_insn_unit *)jmp_addr;
-tcg_insn_unit *target = (tcg_insn_unit *)addr;
+tcg_insn_unit i1, i2;
+TCGType rt = TCG_TYPE_I64;
+TCGReg  rd = TCG_REG_TMP;
+uint64_t pair;
 
-reloc_pc26_atomic(code_ptr, target);
-flush_icache_range(jmp_addr, jmp_addr + 4);
+ptrdiff_t offset = addr - jmp_addr;
+
+if (offset == sextract64(offset, 0, 26)) {
+i1 = I3206_B | ((offset >> 2) & 0x3ff);
+i2 = NOP;
+} else {
+offset = (addr >> 12) - (jmp_addr >> 12);
+
+/* patch ADRP */
+i1 = I3406_ADRP | (offset & 3) << 29 | (offset & 0x1c) << (5 - 2) 
| rd;
+/* patch ADDI */
+i2 = I3401_ADDI | rt << 31 | (addr & 0xfff) << 10 | rd << 5 | rd;
+}
+pair = (uint64_t)i2 << 32 | i1;
+atomic_set((uint64_t *)jmp_addr, pair);
+flush_icache_range(jmp_addr, jmp_addr + 8);
 }
 
 static inline void tcg_out_goto_label(TCGContext *s, TCGLabel *l)
@@ -1388,10 +1405,17 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 #endif
 /* consistency for USE_DIRECT_JUMP */
 tcg_debug_assert(s->tb_jmp_insn_offset != NULL);
+/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
+   write can be used to patch the target address. */
+if ((uintptr_t)s->code_ptr & 7) {
+tcg_out32(s, NOP);
+}
 s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
 /* actual branch destination will be patched by
-   aarch64_tb_set_jmp_target later, beware retranslation. */
-tcg_out_goto_noaddr(s);
+   aarch64_tb_set_jmp_target later. */
+tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
+tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, TCG_REG_TMP, 0);
+tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
 s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
 break;
 
-- 
2.13.0

[Qemu-devel] [PATCH v4 1/3] tcg/aarch64: Introduce and use long branch to register

2017-06-30 Thread Pranith Kumar

We can use a branch to register instruction for exit_tb for offsets
greater than 128MB.

CC: Alex Bennée <alex.ben...@linaro.org>
Reviewed-by: Richard Henderson <r...@twiddle.net>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 1fa3bccc89..8fce11ace7 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -819,6 +819,17 @@ static inline void tcg_out_goto(TCGContext *s, 
tcg_insn_unit *target)
 tcg_out_insn(s, 3206, B, offset);
 }
 
+static inline void tcg_out_goto_long(TCGContext *s, tcg_insn_unit *target)
+{
+ptrdiff_t offset = target - s->code_ptr;
+if (offset == sextract64(offset, 0, 26)) {
+tcg_out_insn(s, 3206, BL, offset);
+} else {
+tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
+tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+}
+}
+
 static inline void tcg_out_goto_noaddr(TCGContext *s)
 {
 /* We pay attention here to not modify the branch target by reading from
@@ -1364,10 +1375,10 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 case INDEX_op_exit_tb:
 /* Reuse the zeroing that exists for goto_ptr.  */
 if (a0 == 0) {
-tcg_out_goto(s, s->code_gen_epilogue);
+tcg_out_goto_long(s, s->code_gen_epilogue);
 } else {
 tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X0, a0);
-tcg_out_goto(s, tb_ret_addr);
+tcg_out_goto_long(s, tb_ret_addr);
 }
 break;
 
-- 
2.13.0

[Qemu-devel] [PATCH v4 3/3] tcg/aarch64: Enable indirect jump path using LDR (literal)

2017-06-30 Thread Pranith Kumar

This patch enables the indirect jump path using an LDR (literal)
instruction. It will be interesting to test and see which performs
better among the two paths.

CC: Alex Bennée <alex.ben...@linaro.org>
Reviewed-by: Richard Henderson <r...@twiddle.net>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 42 --
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index a84422d633..04bc369a92 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -269,6 +269,8 @@ typedef enum {
 I3207_BLR   = 0xd63f,
 I3207_RET   = 0xd65f,
 
+/* Load literal for loading the address at pc-relative offset */
+I3305_LDR   = 0x5800,
 /* Load/store register.  Described here as 3.3.12, but the helper
that emits them can transform to 3.3.10 or 3.3.13.  */
 I3312_STRB  = 0x3800 | LDST_ST << 22 | MO_8 << 30,
@@ -389,6 +391,11 @@ static inline uint32_t tcg_in32(TCGContext *s)
 #define tcg_out_insn(S, FMT, OP, ...) \
 glue(tcg_out_insn_,FMT)(S, glue(glue(glue(I,FMT),_),OP), ## __VA_ARGS__)
 
+static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn, int imm19, 
TCGReg rt)
+{
+tcg_out32(s, insn | (imm19 & 0x7) << 5 | rt);
+}
+
 static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
   TCGReg rt, int imm19)
 {
@@ -864,6 +871,8 @@ static inline void tcg_out_call(TCGContext *s, 
tcg_insn_unit *target)
 }
 }
 
+#ifdef USE_DIRECT_JUMP
+
 void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
 {
 tcg_insn_unit i1, i2;
@@ -889,6 +898,8 @@ void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, 
uintptr_t addr)
 flush_icache_range(jmp_addr, jmp_addr + 8);
 }
 
+#endif
+
 static inline void tcg_out_goto_label(TCGContext *s, TCGLabel *l)
 {
 if (!l->has_value) {
@@ -1400,21 +1411,24 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 break;
 
 case INDEX_op_goto_tb:
-#ifndef USE_DIRECT_JUMP
-#error "USE_DIRECT_JUMP required for aarch64"
-#endif
-/* consistency for USE_DIRECT_JUMP */
-tcg_debug_assert(s->tb_jmp_insn_offset != NULL);
-/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
-   write can be used to patch the target address. */
-if ((uintptr_t)s->code_ptr & 7) {
-tcg_out32(s, NOP);
+if (s->tb_jmp_insn_offset != NULL) {
+/* USE_DIRECT_JUMP */
+/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
+   write can be used to patch the target address. */
+if ((uintptr_t)s->code_ptr & 7) {
+tcg_out32(s, NOP);
+}
+s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
+/* actual branch destination will be patched by
+   aarch64_tb_set_jmp_target later. */
+tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
+tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, 
TCG_REG_TMP, 0);
+} else {
+/* !USE_DIRECT_JUMP */
+tcg_debug_assert(s->tb_jmp_target_addr != NULL);
+intptr_t offset = tcg_pcrel_diff(s, (s->tb_jmp_target_addr + a0)) 
>> 2;
+tcg_out_insn(s, 3305, LDR, offset, TCG_REG_TMP);
 }
-s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
-/* actual branch destination will be patched by
-   aarch64_tb_set_jmp_target later. */
-tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
-tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, TCG_REG_TMP, 0);
 tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
 s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
 break;
-- 
2.13.0

Re: [Qemu-devel] [PATCH v3 2/3] tcg/aarch64: Use ADRP+ADD to compute target address

2017-06-30 Thread Pranith Kumar

On Fri, Jun 30, 2017 at 12:47 AM, Richard Henderson <r...@twiddle.net> wrote:
> On 06/29/2017 05:40 PM, Pranith Kumar wrote:
>>
>>   void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
>>   {
>>   tcg_insn_unit *code_ptr = (tcg_insn_unit *)jmp_addr;
>> -tcg_insn_unit *target = (tcg_insn_unit *)addr;
>> +tcg_insn_unit i1, i2;
>> +uint64_t pair;
>>   +ptrdiff_t offset = addr - jmp_addr;
>> +
>> +if (offset == sextract64(offset, 0, 26)) {
>> +i1 = NOP;
>> +i2 = I3206_B | ((offset >> 2) & 0x3ff);
>
>
> Branch first, since that's the offset you calculated.
> Also, the nop need not be executed.

This is exactly how I form the instruction pair below (B+NOP, not
NOP+B). But I get your point. It is confusing to use i1 for the second
instruction. I'll change it.

>
>> +} else {
>> +offset = (addr >> 12) - (jmp_addr >> 12);
>> +
>> +/* patch ADRP */
>> +i2 = deposit32(*code_ptr++, 29, 2, offset & 0x3);
>> +i2 = deposit32(i2, 5, 19, offset >> 2);
>> +/* patch ADDI */
>> +i1 = deposit32(*code_ptr, 10, 12, addr & 0xfff);
>
>
> You can't just patch these insns, because they aren't necessarily ADRP+ADD.
> Indeed, they will very likely be B and NOP.  The first address we patch in
> is tb_jmp_reset_offset, which is the following opcode, which is definitely
> in range of the branch above.

Whoops, I totally missed that we patch these out the first time out. I
will explicitly generate the ADRP+ADD pair from here.

Thanks,
-- 
Pranith

[Qemu-devel] [PATCH v3 2/3] tcg/aarch64: Use ADRP+ADD to compute target address

2017-06-29 Thread Pranith Kumar

We use ADRP+ADD to compute the target address for goto_tb. This patch
introduces the NOP instruction which is used to align the above
instruction pair so that we can use one atomic instruction to patch
the destination offsets.

CC: Richard Henderson <r...@twiddle.net>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 accel/tcg/translate-all.c|  2 +-
 tcg/aarch64/tcg-target.inc.c | 34 +-
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index f6ad46b613..65a92dbf67 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -522,7 +522,7 @@ static inline PageDesc *page_find(tb_page_addr_t index)
 #elif defined(__powerpc__)
 # define MAX_CODE_GEN_BUFFER_SIZE  (32u * 1024 * 1024)
 #elif defined(__aarch64__)
-# define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
+# define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
 #elif defined(__s390x__)
   /* We have a +- 4GB range on the branches; leave some slop.  */
 # define MAX_CODE_GEN_BUFFER_SIZE  (3ul * 1024 * 1024 * 1024)
diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 8fce11ace7..f059d9d781 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -372,6 +372,7 @@ typedef enum {
 I3510_EON   = 0x4a20,
 I3510_ANDS  = 0x6a00,
 
+NOP = 0xd503201f,
 /* System instructions.  */
 DMB_ISH = 0xd50338bf,
 DMB_LD  = 0x0100,
@@ -866,10 +867,26 @@ static inline void tcg_out_call(TCGContext *s, 
tcg_insn_unit *target)
 void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
 {
 tcg_insn_unit *code_ptr = (tcg_insn_unit *)jmp_addr;
-tcg_insn_unit *target = (tcg_insn_unit *)addr;
+tcg_insn_unit i1, i2;
+uint64_t pair;
 
-reloc_pc26_atomic(code_ptr, target);
-flush_icache_range(jmp_addr, jmp_addr + 4);
+ptrdiff_t offset = addr - jmp_addr;
+
+if (offset == sextract64(offset, 0, 26)) {
+i1 = NOP;
+i2 = I3206_B | ((offset >> 2) & 0x3ff);
+} else {
+offset = (addr >> 12) - (jmp_addr >> 12);
+
+/* patch ADRP */
+i2 = deposit32(*code_ptr++, 29, 2, offset & 0x3);
+i2 = deposit32(i2, 5, 19, offset >> 2);
+/* patch ADDI */
+i1 = deposit32(*code_ptr, 10, 12, addr & 0xfff);
+}
+pair = (uint64_t)i1 << 32 | i2;
+atomic_set((uint64_t *)jmp_addr, pair);
+flush_icache_range(jmp_addr, jmp_addr + 8);
 }
 
 static inline void tcg_out_goto_label(TCGContext *s, TCGLabel *l)
@@ -1388,10 +1405,17 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 #endif
 /* consistency for USE_DIRECT_JUMP */
 tcg_debug_assert(s->tb_jmp_insn_offset != NULL);
+/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
+   write can be used to patch the target address. */
+if ((uintptr_t)s->code_ptr & 7) {
+tcg_out32(s, NOP);
+}
 s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
 /* actual branch destination will be patched by
-   aarch64_tb_set_jmp_target later, beware retranslation. */
-tcg_out_goto_noaddr(s);
+   aarch64_tb_set_jmp_target later. */
+tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
+tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, TCG_REG_TMP, 0);
+tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
 s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
 break;
 
-- 
2.13.0

[Qemu-devel] [PATCH v3 0/3] Relax code buffer size limitation on aarch64 hosts

2017-06-29 Thread Pranith Kumar

Hello,

The following patches enable us to relax the 128MB code buffer size
limitation on ARM64 hosts.

Patch 2 increases this limitation to 3GB, even though ADRP+ADD can
address 4GB of pc-relative addresses to give us some slack.

Patch 3 uses LDR (literal) to load the address, allowing us to remove
the code buffer size limitation altogether. However, I feel that 3GB
should be sufficient for now and hence did not change it ;). It
however enables the !USE_DIRECT_JUMP path on aarch64 hosts.

Thanks,

v3:
* Update with comments and reviews by Richard

Pranith Kumar (3):
  tcg/aarch64: Introduce and use long branch to register
  tcg/aarch64: Use ADRP+ADD to compute target address
  tcg/aarch64: Enable indirect jump path using LDR (literal)

 accel/tcg/translate-all.c|  2 +-
 tcg/aarch64/tcg-target.inc.c | 77 
 2 files changed, 64 insertions(+), 15 deletions(-)

-- 
2.13.0

[Qemu-devel] [PATCH v3 3/3] tcg/aarch64: Enable indirect jump path using LDR (literal)

2017-06-29 Thread Pranith Kumar

This patch enables the indirect jump path using an LDR (literal)
instruction. It will be interesting to test and see which performs
better among the two paths.

CC: Alex Bennée <alex.ben...@linaro.org>
Reviewed-by: Richard Henderson <r...@twiddle.net>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 42 --
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index f059d9d781..b0801d0259 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -269,6 +269,8 @@ typedef enum {
 I3207_BLR   = 0xd63f,
 I3207_RET   = 0xd65f,
 
+/* Load literal for loading the address at pc-relative offset */
+I3305_LDR   = 0x5800,
 /* Load/store register.  Described here as 3.3.12, but the helper
that emits them can transform to 3.3.10 or 3.3.13.  */
 I3312_STRB  = 0x3800 | LDST_ST << 22 | MO_8 << 30,
@@ -389,6 +391,11 @@ static inline uint32_t tcg_in32(TCGContext *s)
 #define tcg_out_insn(S, FMT, OP, ...) \
 glue(tcg_out_insn_,FMT)(S, glue(glue(glue(I,FMT),_),OP), ## __VA_ARGS__)
 
+static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn, int imm19, 
TCGReg rt)
+{
+tcg_out32(s, insn | (imm19 & 0x7) << 5 | rt);
+}
+
 static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
   TCGReg rt, int imm19)
 {
@@ -864,6 +871,8 @@ static inline void tcg_out_call(TCGContext *s, 
tcg_insn_unit *target)
 }
 }
 
+#ifdef USE_DIRECT_JUMP
+
 void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
 {
 tcg_insn_unit *code_ptr = (tcg_insn_unit *)jmp_addr;
@@ -889,6 +898,8 @@ void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, 
uintptr_t addr)
 flush_icache_range(jmp_addr, jmp_addr + 8);
 }
 
+#endif
+
 static inline void tcg_out_goto_label(TCGContext *s, TCGLabel *l)
 {
 if (!l->has_value) {
@@ -1400,21 +1411,24 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 break;
 
 case INDEX_op_goto_tb:
-#ifndef USE_DIRECT_JUMP
-#error "USE_DIRECT_JUMP required for aarch64"
-#endif
-/* consistency for USE_DIRECT_JUMP */
-tcg_debug_assert(s->tb_jmp_insn_offset != NULL);
-/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
-   write can be used to patch the target address. */
-if ((uintptr_t)s->code_ptr & 7) {
-tcg_out32(s, NOP);
+if (s->tb_jmp_insn_offset != NULL) {
+/* USE_DIRECT_JUMP */
+/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
+   write can be used to patch the target address. */
+if ((uintptr_t)s->code_ptr & 7) {
+tcg_out32(s, NOP);
+}
+s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
+/* actual branch destination will be patched by
+   aarch64_tb_set_jmp_target later. */
+tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
+tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, 
TCG_REG_TMP, 0);
+} else {
+/* !USE_DIRECT_JUMP */
+tcg_debug_assert(s->tb_jmp_target_addr != NULL);
+intptr_t offset = tcg_pcrel_diff(s, (s->tb_jmp_target_addr + a0)) 
>> 2;
+tcg_out_insn(s, 3305, LDR, offset, TCG_REG_TMP);
 }
-s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
-/* actual branch destination will be patched by
-   aarch64_tb_set_jmp_target later. */
-tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
-tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, TCG_REG_TMP, 0);
 tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
 s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
 break;
-- 
2.13.0

[Qemu-devel] [PATCH v3 1/3] tcg/aarch64: Introduce and use long branch to register

2017-06-29 Thread Pranith Kumar

We can use a branch to register instruction for exit_tb for offsets
greater than 128MB.

CC: Alex Bennée <alex.ben...@linaro.org>
Reviewed-by: Richard Henderson <r...@twiddle.net>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 1fa3bccc89..8fce11ace7 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -819,6 +819,17 @@ static inline void tcg_out_goto(TCGContext *s, 
tcg_insn_unit *target)
 tcg_out_insn(s, 3206, B, offset);
 }
 
+static inline void tcg_out_goto_long(TCGContext *s, tcg_insn_unit *target)
+{
+ptrdiff_t offset = target - s->code_ptr;
+if (offset == sextract64(offset, 0, 26)) {
+tcg_out_insn(s, 3206, BL, offset);
+} else {
+tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
+tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+}
+}
+
 static inline void tcg_out_goto_noaddr(TCGContext *s)
 {
 /* We pay attention here to not modify the branch target by reading from
@@ -1364,10 +1375,10 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 case INDEX_op_exit_tb:
 /* Reuse the zeroing that exists for goto_ptr.  */
 if (a0 == 0) {
-tcg_out_goto(s, s->code_gen_epilogue);
+tcg_out_goto_long(s, s->code_gen_epilogue);
 } else {
 tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X0, a0);
-tcg_out_goto(s, tb_ret_addr);
+tcg_out_goto_long(s, tb_ret_addr);
 }
 break;
 
-- 
2.13.0

[Qemu-devel] [PATCH v2 1/3] tcg/aarch64: Introduce and use long branch to register

2017-06-29 Thread Pranith Kumar

We can use a branch to register instruction for exit_tb for offsets
greater than 128MB.

CC: Richard Henderson <r...@twiddle.net>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 1fa3bccc89..8fce11ace7 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -819,6 +819,17 @@ static inline void tcg_out_goto(TCGContext *s, 
tcg_insn_unit *target)
 tcg_out_insn(s, 3206, B, offset);
 }
 
+static inline void tcg_out_goto_long(TCGContext *s, tcg_insn_unit *target)
+{
+ptrdiff_t offset = target - s->code_ptr;
+if (offset == sextract64(offset, 0, 26)) {
+tcg_out_insn(s, 3206, BL, offset);
+} else {
+tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
+tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+}
+}
+
 static inline void tcg_out_goto_noaddr(TCGContext *s)
 {
 /* We pay attention here to not modify the branch target by reading from
@@ -1364,10 +1375,10 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 case INDEX_op_exit_tb:
 /* Reuse the zeroing that exists for goto_ptr.  */
 if (a0 == 0) {
-tcg_out_goto(s, s->code_gen_epilogue);
+tcg_out_goto_long(s, s->code_gen_epilogue);
 } else {
 tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X0, a0);
-tcg_out_goto(s, tb_ret_addr);
+tcg_out_goto_long(s, tb_ret_addr);
 }
 break;
 
-- 
2.13.0

[Qemu-devel] [PATCH v3 3/3] tcg/aarch64: Enable indirect jump path using LDR (literal)

2017-06-29 Thread Pranith Kumar

This patch enables the indirect jump path using an LDR (literal)
instruction. It will be interesting to test and see which performs
better among the two paths.

CC: Richard Henderson <r...@twiddle.net>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 42 --
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index b7670ecc90..5381c31b45 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -269,6 +269,8 @@ typedef enum {
 I3207_BLR   = 0xd63f,
 I3207_RET   = 0xd65f,
 
+/* Load literal for loading the address at pc-relative offset */
+I3305_LDR   = 0x5800,
 /* Load/store register.  Described here as 3.3.12, but the helper
that emits them can transform to 3.3.10 or 3.3.13.  */
 I3312_STRB  = 0x3800 | LDST_ST << 22 | MO_8 << 30,
@@ -389,6 +391,11 @@ static inline uint32_t tcg_in32(TCGContext *s)
 #define tcg_out_insn(S, FMT, OP, ...) \
 glue(tcg_out_insn_,FMT)(S, glue(glue(glue(I,FMT),_),OP), ## __VA_ARGS__)
 
+static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn, int imm19, 
TCGReg rt)
+{
+tcg_out32(s, insn | (imm19 & 0x7) << 5 | rt);
+}
+
 static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
   TCGReg rt, int imm19)
 {
@@ -864,6 +871,8 @@ static inline void tcg_out_call(TCGContext *s, 
tcg_insn_unit *target)
 }
 }
 
+#ifdef USE_DIRECT_JUMP
+
 void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
 {
 tcg_insn_unit *code_ptr = (tcg_insn_unit *)jmp_addr;
@@ -881,6 +890,8 @@ void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, 
uintptr_t addr)
 flush_icache_range(jmp_addr, jmp_addr + 8);
 }
 
+#endif
+
 static inline void tcg_out_goto_label(TCGContext *s, TCGLabel *l)
 {
 if (!l->has_value) {
@@ -1392,21 +1403,24 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 break;
 
 case INDEX_op_goto_tb:
-#ifndef USE_DIRECT_JUMP
-#error "USE_DIRECT_JUMP required for aarch64"
-#endif
-/* consistency for USE_DIRECT_JUMP */
-tcg_debug_assert(s->tb_jmp_insn_offset != NULL);
-/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
-   write can be used to patch the target address. */
-if ((uintptr_t)s->code_ptr & 7) {
-tcg_out32(s, NOP);
+if (s->tb_jmp_insn_offset != NULL) {
+/* USE_DIRECT_JUMP */
+/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
+   write can be used to patch the target address. */
+if ((uintptr_t)s->code_ptr & 7) {
+tcg_out32(s, NOP);
+}
+s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
+/* actual branch destination will be patched by
+   aarch64_tb_set_jmp_target later, beware of retranslation */
+tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
+tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, 
TCG_REG_TMP, 0);
+} else {
+/* !USE_DIRECT_JUMP */
+tcg_debug_assert(s->tb_jmp_target_addr != NULL);
+intptr_t offset = tcg_pcrel_diff(s, (s->tb_jmp_target_addr + a0)) 
>> 2;
+tcg_out_insn(s, 3305, LDR, offset, TCG_REG_TMP);
 }
-s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
-/* actual branch destination will be patched by
-   aarch64_tb_set_jmp_target later, beware of retranslation */
-tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
-tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, TCG_REG_TMP, 0);
 tcg_out_callr(s, TCG_REG_TMP);
 s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
 break;
-- 
2.13.0

[Qemu-devel] [PATCH v2 2/3] tcg/aarch64: Use ADRP+ADD to compute target address

2017-06-29 Thread Pranith Kumar

We use ADRP+ADD to compute the target address for goto_tb. This patch
introduces the NOP instruction which is used to align the above
instruction pair so that we can use one atomic instruction to patch
the destination offsets.

CC: Richard Henderson <r...@twiddle.net>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 accel/tcg/translate-all.c|  2 +-
 tcg/aarch64/tcg-target.inc.c | 26 +-
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index f6ad46b613..b6d122e087 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -522,7 +522,7 @@ static inline PageDesc *page_find(tb_page_addr_t index)
 #elif defined(__powerpc__)
 # define MAX_CODE_GEN_BUFFER_SIZE  (32u * 1024 * 1024)
 #elif defined(__aarch64__)
-# define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
+# define MAX_CODE_GEN_BUFFER_SIZE  (3ul * 1024 * 1024 * 1024)
 #elif defined(__s390x__)
   /* We have a +- 4GB range on the branches; leave some slop.  */
 # define MAX_CODE_GEN_BUFFER_SIZE  (3ul * 1024 * 1024 * 1024)
diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 8fce11ace7..b7670ecc90 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -372,6 +372,7 @@ typedef enum {
 I3510_EON   = 0x4a20,
 I3510_ANDS  = 0x6a00,
 
+NOP = 0xd503201f,
 /* System instructions.  */
 DMB_ISH = 0xd50338bf,
 DMB_LD  = 0x0100,
@@ -866,10 +867,18 @@ static inline void tcg_out_call(TCGContext *s, 
tcg_insn_unit *target)
 void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
 {
 tcg_insn_unit *code_ptr = (tcg_insn_unit *)jmp_addr;
-tcg_insn_unit *target = (tcg_insn_unit *)addr;
+tcg_insn_unit adrp_insn = *code_ptr++;
+tcg_insn_unit addi_insn = *code_ptr;
 
-reloc_pc26_atomic(code_ptr, target);
-flush_icache_range(jmp_addr, jmp_addr + 4);
+ptrdiff_t offset = (addr >> 12) - (jmp_addr >> 12);
+
+/* patch ADRP */
+adrp_insn = deposit32(adrp_insn, 29, 2, offset & 0x3);
+adrp_insn = deposit32(adrp_insn, 5, 19, offset >> 2);
+/* patch ADDI */
+addi_insn = deposit32(addi_insn, 10, 12, addr & 0xfff);
+atomic_set((uint64_t *)jmp_addr, adrp_insn | ((uint64_t)addi_insn << 32));
+flush_icache_range(jmp_addr, jmp_addr + 8);
 }
 
 static inline void tcg_out_goto_label(TCGContext *s, TCGLabel *l)
@@ -1388,10 +1397,17 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 #endif
 /* consistency for USE_DIRECT_JUMP */
 tcg_debug_assert(s->tb_jmp_insn_offset != NULL);
+/* Ensure that ADRP+ADD are 8-byte aligned so that an atomic
+   write can be used to patch the target address. */
+if ((uintptr_t)s->code_ptr & 7) {
+tcg_out32(s, NOP);
+}
 s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
 /* actual branch destination will be patched by
-   aarch64_tb_set_jmp_target later, beware retranslation. */
-tcg_out_goto_noaddr(s);
+   aarch64_tb_set_jmp_target later, beware of retranslation */
+tcg_out_insn(s, 3406, ADRP, TCG_REG_TMP, 0);
+tcg_out_insn(s, 3401, ADDI, TCG_TYPE_I64, TCG_REG_TMP, TCG_REG_TMP, 0);
+tcg_out_callr(s, TCG_REG_TMP);
 s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
 break;
 
-- 
2.13.0

[Qemu-devel] [PATCH v2 0/3] Relax code buffer size limitation on aarch64 hosts

2017-06-29 Thread Pranith Kumar

Hello,

The following patches enable us to relax the 128MB code buffer size
limitation on ARM64 hosts.

Patch 2 increases this limitation to 3GB, even though ADRP+ADD can
address 4GB of pc-relative addresses to give us some slack.

Patch 3 uses LDR (literal) to load the address, allowing us to remove
the code buffer size limitation altogether. However, I feel that 3GB
should be sufficient for now and hence did not change it ;). It
however enables the !USE_DIRECT_JUMP path on aarch64 hosts.

Thanks,

Pranith Kumar (3):
  tcg/aarch64: Introduce and use long branch to register
  tcg/aarch64: Use ADRP+ADD to compute target address
  tcg/aarch64: Enable indirect jump path using LDR (literal)

 accel/tcg/translate-all.c|  2 +-
 tcg/aarch64/tcg-target.inc.c | 69 +++-
 2 files changed, 56 insertions(+), 15 deletions(-)

-- 
2.13.0

[Qemu-devel] [PATCH v3 2/2] mttcg/i386: Patch instruction using async_safe_* framework

2017-06-29 Thread Pranith Kumar

In mttcg, calling pause_all_vcpus() during execution from the
generated TBs causes a deadlock if some vCPU is waiting for exclusive
execution in start_exclusive(). Fix this by using the aync_safe_*
framework instead of pausing vcpus for patching instructions.

CC: Paolo Bonzini <pbonz...@redhat.com>
CC: Peter Maydell <peter.mayd...@linaro.org>
Reviewed-by: Richard Henderson <r...@twiddle.net>
Reviewed-by: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 hw/i386/kvmvapic.c | 73 +++---
 1 file changed, 42 insertions(+), 31 deletions(-)

diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
index 82a49556af..5e0c8219b0 100644
--- a/hw/i386/kvmvapic.c
+++ b/hw/i386/kvmvapic.c
@@ -383,8 +383,7 @@ static void patch_byte(X86CPU *cpu, target_ulong addr, 
uint8_t byte)
 cpu_memory_rw_debug(CPU(cpu), addr, , 1, 1);
 }
 
-static void patch_call(VAPICROMState *s, X86CPU *cpu, target_ulong ip,
-   uint32_t target)
+static void patch_call(X86CPU *cpu, target_ulong ip, uint32_t target)
 {
 uint32_t offset;
 
@@ -393,23 +392,24 @@ static void patch_call(VAPICROMState *s, X86CPU *cpu, 
target_ulong ip,
 cpu_memory_rw_debug(CPU(cpu), ip + 1, (void *), sizeof(offset), 1);
 }
 
-static void patch_instruction(VAPICROMState *s, X86CPU *cpu, target_ulong ip)
+struct PatchInfo {
+VAPICHandlers *handler;
+target_ulong ip;
+};
+
+static void do_patch_instruction(CPUState *cs, run_on_cpu_data data)
 {
-CPUState *cs = CPU(cpu);
-CPUX86State *env = >env;
-VAPICHandlers *handlers;
+X86CPU *x86_cpu = X86_CPU(cs);
+CPUX86State *env = _cpu->env;
+struct PatchInfo *info = (struct PatchInfo *) data.host_ptr;
+VAPICHandlers *handlers = info->handler;
+target_ulong ip = info->ip;
 uint8_t opcode[2];
 uint32_t imm32 = 0;
 target_ulong current_pc = 0;
 target_ulong current_cs_base = 0;
 uint32_t current_flags = 0;
 
-if (smp_cpus == 1) {
-handlers = >rom_state.up;
-} else {
-handlers = >rom_state.mp;
-}
-
 if (!kvm_enabled()) {
 cpu_get_tb_cpu_state(env, _pc, _cs_base,
  _flags);
@@ -421,48 +421,59 @@ static void patch_instruction(VAPICROMState *s, X86CPU 
*cpu, target_ulong ip)
 }
 }
 
-pause_all_vcpus();
-
 cpu_memory_rw_debug(cs, ip, opcode, sizeof(opcode), 0);
 
 switch (opcode[0]) {
 case 0x89: /* mov r32 to r/m32 */
-patch_byte(cpu, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
-patch_call(s, cpu, ip + 1, handlers->set_tpr);
+patch_byte(x86_cpu, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
+patch_call(x86_cpu, ip + 1, handlers->set_tpr);
 break;
 case 0x8b: /* mov r/m32 to r32 */
-patch_byte(cpu, ip, 0x90);
-patch_call(s, cpu, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
+patch_byte(x86_cpu, ip, 0x90);
+patch_call(x86_cpu, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
 break;
 case 0xa1: /* mov abs to eax */
-patch_call(s, cpu, ip, handlers->get_tpr[0]);
+patch_call(x86_cpu, ip, handlers->get_tpr[0]);
 break;
 case 0xa3: /* mov eax to abs */
-patch_call(s, cpu, ip, handlers->set_tpr_eax);
+patch_call(x86_cpu, ip, handlers->set_tpr_eax);
 break;
 case 0xc7: /* mov imm32, r/m32 (c7/0) */
-patch_byte(cpu, ip, 0x68);  /* push imm32 */
+patch_byte(x86_cpu, ip, 0x68);  /* push imm32 */
 cpu_memory_rw_debug(cs, ip + 6, (void *), sizeof(imm32), 0);
 cpu_memory_rw_debug(cs, ip + 1, (void *), sizeof(imm32), 1);
-patch_call(s, cpu, ip + 5, handlers->set_tpr);
+patch_call(x86_cpu, ip + 5, handlers->set_tpr);
 break;
 case 0xff: /* push r/m32 */
-patch_byte(cpu, ip, 0x50); /* push eax */
-patch_call(s, cpu, ip + 1, handlers->get_tpr_stack);
+patch_byte(x86_cpu, ip, 0x50); /* push eax */
+patch_call(x86_cpu, ip + 1, handlers->get_tpr_stack);
 break;
 default:
 abort();
 }
 
-resume_all_vcpus();
+g_free(info);
+}
 
-if (!kvm_enabled()) {
-/* Both tb_lock and iothread_mutex will be reset when
- *  longjmps back into the cpu_exec loop. */
-tb_lock();
-tb_gen_code(cs, current_pc, current_cs_base, current_flags, 1);
-cpu_loop_exit_noexc(cs);
+static void patch_instruction(VAPICROMState *s, X86CPU *cpu, target_ulong ip)
+{
+CPUState *cs = CPU(cpu);
+VAPICHandlers *handlers;
+struct PatchInfo *info;
+const run_on_cpu_func fn = do_patch_instruction;
+
+if (smp_cpus == 1) {
+handlers = >rom_state.up;
+} else {
+handlers = >rom_state.mp;
 }
+
+info  = g_new(struct PatchInfo, 1);
+info->handler = handlers;
+inf

[Qemu-devel] [PATCH v3 1/2] Revert "exec.c: Fix breakpoint invalidation race"

2017-06-29 Thread Pranith Kumar

Now that we have proper locking after MTTCG patches have landed, we
can revert the commit.  This reverts commit

a9353fe897ca2687e5b3385ed39e3db3927a90e0.

CC: Peter Maydell <peter.mayd...@linaro.org>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 exec.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 42ad1eaedd..c8403baa46 100644
--- a/exec.c
+++ b/exec.c
@@ -770,15 +770,28 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp)
 #endif
 }
 
+#if defined(CONFIG_USER_ONLY)
 static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
 {
-/* Flush the whole TB as this will not have race conditions
- * even if we don't have proper locking yet.
- * Ideally we would just invalidate the TBs for the
- * specified PC.
- */
-tb_flush(cpu);
+mmap_lock();
+tb_lock();
+tb_invalidate_phys_page_range(pc, pc + 1, 0);
+tb_unlock();
+mmap_unlock();
 }
+#else
+static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
+{
+MemTxAttrs attrs;
+hwaddr phys = cpu_get_phys_page_attrs_debug(cpu, pc, );
+int asidx = cpu_asidx_from_attrs(cpu, attrs);
+if (phys != -1) {
+/* Locks grabbed by tb_invalidate_phys_addr */
+tb_invalidate_phys_addr(cpu->cpu_ases[asidx].as,
+phys | (pc & ~TARGET_PAGE_MASK));
+}
+}
+#endif
 
 #if defined(CONFIG_USER_ONLY)
 void cpu_watchpoint_remove_all(CPUState *cpu, int mask)
-- 
2.13.0

[Qemu-devel] [PATCH 0/2] Pending MTTCG patches

2017-06-29 Thread Pranith Kumar

Hello,

Please find these two pending MTTCG fixes I have in my repo.

I've reworked the async_safe_* patch according to pbonzini's suggestion.

Thanks,

Pranith Kumar (2):
  Revert "exec.c: Fix breakpoint invalidation race"
  mttcg/i386: Patch instruction using async_safe_* framework

 exec.c | 25 ++-
 hw/i386/kvmvapic.c | 73 +++---
 2 files changed, 61 insertions(+), 37 deletions(-)

-- 
2.13.0

[Qemu-devel] [PATCH 2/2] [TEST] Collect TLB and victim TLB hit/miss stats

2017-06-27 Thread Pranith Kumar

I used the following patch to collect hit/miss TLB ratios for a few
benchmarks. The results can be found here: http://imgur.com/a/gee1o

Please note that these results also include boot/shutdown as the
per-region instrumentation patch came later.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 accel/tcg/cputlb.c| 12 
 cpus.c| 26 ++
 include/exec/cpu-defs.h   |  4 
 include/sysemu/cpus.h |  2 ++
 target/arm/helper.c   |  6 +-
 tcg/i386/tcg-target.inc.c | 16 ++--
 vl.c  |  3 +++
 7 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index ef52a7e5e0..2ac2397431 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -864,12 +864,19 @@ static void io_writex(CPUArchState *env, CPUIOTLBEntry 
*iotlbentry,
 }
 }
 
+extern bool enable_instrumentation;
+
 /* Return true if ADDR is present in the victim tlb, and has been copied
back to the main tlb.  */
 static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
size_t elt_ofs, target_ulong page)
 {
 size_t vidx;
+
+if (enable_instrumentation) {
+env->tlb_access_victim++;
+}
+
 for (vidx = 0; vidx < CPU_VTLB_SIZE; ++vidx) {
 CPUTLBEntry *vtlb = >tlb_v_table[mmu_idx][vidx];
 target_ulong cmp = *(target_ulong *)((uintptr_t)vtlb + elt_ofs);
@@ -885,6 +892,11 @@ static bool victim_tlb_hit(CPUArchState *env, size_t 
mmu_idx, size_t index,
 CPUIOTLBEntry tmpio, *io = >iotlb[mmu_idx][index];
 CPUIOTLBEntry *vio = >iotlb_v[mmu_idx][vidx];
 tmpio = *io; *io = *vio; *vio = tmpio;
+
+if (enable_instrumentation) {
+env->tlb_access_victim_hit++;
+}
+
 return true;
 }
 }
diff --git a/cpus.c b/cpus.c
index 14bb8d552e..14669b3469 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1602,6 +1602,32 @@ static bool all_vcpus_paused(void)
 return true;
 }
 
+void print_tlb_stats(void)
+{
+CPUState *cpu;
+CPU_FOREACH(cpu) {
+CPUArchState *cs = cpu->env_ptr;
+
+fprintf(stderr, "TLB accesses %lu, hits %lu, victim accesses %lu, hits 
%lu\n",
+cs->tlb_access_total, cs->tlb_access_hit, 
cs->tlb_access_victim,
+cs->tlb_access_victim_hit);
+}
+}
+
+void clear_tlb_stats(void)
+{
+CPUState *cpu;
+CPU_FOREACH(cpu) {
+CPUArchState *cs = cpu->env_ptr;
+
+cs->tlb_access_total= 0;
+cs->tlb_access_hit  = 0;
+cs->tlb_access_victim   = 0;
+cs->tlb_access_victim   = 0;
+cs->tlb_access_victim_hit   = 0;
+}
+}
+
 void pause_all_vcpus(void)
 {
 CPUState *cpu;
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 5f4e303635..29b3c2ada8 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -138,6 +138,10 @@ typedef struct CPUIOTLBEntry {
 target_ulong tlb_flush_addr;\
 target_ulong tlb_flush_mask;\
 target_ulong vtlb_index;\
+target_ulong tlb_access_hit;\
+target_ulong tlb_access_total;  \
+target_ulong tlb_access_victim; \
+target_ulong tlb_access_victim_hit; \
 
 #else
 
diff --git a/include/sysemu/cpus.h b/include/sysemu/cpus.h
index 731756d948..7d8d92646c 100644
--- a/include/sysemu/cpus.h
+++ b/include/sysemu/cpus.h
@@ -10,6 +10,8 @@ void resume_all_vcpus(void);
 void pause_all_vcpus(void);
 void cpu_stop_current(void);
 void cpu_ticks_init(void);
+void print_tlb_stats(void);
+void clear_tlb_stats(void);
 
 void configure_icount(QemuOpts *opts, Error **errp);
 extern int use_icount;
diff --git a/target/arm/helper.c b/target/arm/helper.c
index dfbf03676c..d2e75b0f20 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -1124,7 +1124,9 @@ static uint64_t pmxevtyper_read(CPUARMState *env, const 
ARMCPRegInfo *ri)
 }
 }
 
-bool enable_instrumentation;
+extern bool enable_instrumentation;
+extern void print_tlb_stats(void);
+extern void clear_tlb_stats(void);
 
 static void pmuserenr_write(CPUARMState *env, const ARMCPRegInfo *ri,
 uint64_t value)
@@ -1139,6 +1141,8 @@ static void pmuserenr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 } else if (value == 0xfa11dead) {
 printf("Disabling instrumentation\n");
 enable_instrumentation = false;
+print_tlb_stats();
+clear_tlb_stats();
 tb_flush(cs);
 }
 
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 9d7d25c017..b75bd54c35 100644
--- a/tcg/i386/tcg-t

[Qemu-devel] [PATCH 1/2] [TEST] aarch64: Use pmuserenr_el0 register for instrumentation

2017-06-27 Thread Pranith Kumar

We need a way for the benchmark running in the guest to indicate us to
start/stop our instrumentation. On x86, we could use the 'cpuid'
instruction along with an appropriately populated 'eax'
register. However, no such dummy instruction exists for aarch64. So
we modify the permission bits for 'pmuserenr_el0' register and tap
that to instrument the guest code.

You can use the following annotations on your region-of-interest to
instrument the code.

#define magic_enable() \
  asm volatile ("msr pmuserenr_el0, %0" :: "r" (0x));
#define magic_disable() \
  asm volatile ("msr pmuserenr_el0, %0" :: "r" (0xfa11dead));

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/arm/helper.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index 2594faa9b8..dfbf03676c 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -1124,9 +1124,24 @@ static uint64_t pmxevtyper_read(CPUARMState *env, const 
ARMCPRegInfo *ri)
 }
 }
 
+bool enable_instrumentation;
+
 static void pmuserenr_write(CPUARMState *env, const ARMCPRegInfo *ri,
 uint64_t value)
 {
+ARMCPU *cpu = arm_env_get_cpu(env);
+CPUState *cs = CPU(cpu);
+
+if (value == 0x) {
+printf("Enabling instrumentation\n");
+enable_instrumentation = true;
+tb_flush(cs);
+} else if (value == 0xfa11dead) {
+printf("Disabling instrumentation\n");
+enable_instrumentation = false;
+tb_flush(cs);
+}
+
 if (arm_feature(env, ARM_FEATURE_V8)) {
 env->cp15.c9_pmuserenr = value & 0xf;
 } else {
@@ -1316,13 +1331,13 @@ static const ARMCPRegInfo v7_cp_reginfo[] = {
   .access = PL0_RW, .type = ARM_CP_CONST, .resetvalue = 0,
   .accessfn = pmreg_access_xevcntr },
 { .name = "PMUSERENR", .cp = 15, .crn = 9, .crm = 14, .opc1 = 0, .opc2 = 0,
-  .access = PL0_R | PL1_RW, .accessfn = access_tpm,
+  .access = PL0_RW | PL1_RW, .accessfn = access_tpm,
   .fieldoffset = offsetof(CPUARMState, cp15.c9_pmuserenr),
   .resetvalue = 0,
   .writefn = pmuserenr_write, .raw_writefn = raw_write },
 { .name = "PMUSERENR_EL0", .state = ARM_CP_STATE_AA64,
   .opc0 = 3, .opc1 = 3, .crn = 9, .crm = 14, .opc2 = 0,
-  .access = PL0_R | PL1_RW, .accessfn = access_tpm, .type = ARM_CP_ALIAS,
+  .access = PL0_RW | PL1_RW, .accessfn = access_tpm, .type = ARM_CP_ALIAS,
   .fieldoffset = offsetof(CPUARMState, cp15.c9_pmuserenr),
   .resetvalue = 0,
   .writefn = pmuserenr_write, .raw_writefn = raw_write },
-- 
2.13.0

[Qemu-devel] [TEST PATCH 0/2] Instrumentation and TLB stats

2017-06-27 Thread Pranith Kumar

The following two patches are what I use to instrument guest code and
collect TLB hit/miss information. These patches are for informational
and discussion purposes only.

Pranith Kumar (2):
  [TEST] aarch64: Use pmuserenr_el0 register for instrumentation
  [TEST] Collect TLB stats along with victim TLB

 accel/tcg/cputlb.c| 12 
 cpus.c| 26 ++
 include/exec/cpu-defs.h   |  4 
 include/sysemu/cpus.h |  2 ++
 target/arm/helper.c   | 23 +--
 tcg/i386/tcg-target.inc.c | 16 ++--
 vl.c  |  3 +++
 7 files changed, 82 insertions(+), 4 deletions(-)

-- 
2.13.0

Re: [Qemu-devel] [PATCH v1 2/3] tcg-runtime: light re-factor of lookup_tb_ptr

2017-06-14 Thread Pranith Kumar

Hi Alex,

On Wed, Jun 14, 2017 at 10:02 AM, Alex Bennée <alex.ben...@linaro.org> wrote:
> Just a little precursor re-factoring before I was going to add a trace
> point:
>
>   - single return point, defaulting to tcg_ctx.code_gen_epilogue
>   - move cs_base, pc and flags inside the jump cache hit scope
>   - calculate the tb_jmp_cache hash once
>
> Signed-off-by: Alex Bennée <alex.ben...@linaro.org>
> ---
>  tcg-runtime.c | 35 +++
>  1 file changed, 19 insertions(+), 16 deletions(-)

Phew, glad you got this figured out! I tested it on the images I have
and it works. Please add:

Tested-by: Pranith Kumar <bobby.pr...@gmail.com>


>
> diff --git a/tcg-runtime.c b/tcg-runtime.c
> index 7fa90ce508..f4bfa9cea6 100644
> --- a/tcg-runtime.c
> +++ b/tcg-runtime.c
> @@ -147,30 +147,33 @@ uint64_t HELPER(ctpop_i64)(uint64_t arg)
>  void *HELPER(lookup_tb_ptr)(CPUArchState *env, target_ulong addr)
>  {
>  CPUState *cpu = ENV_GET_CPU(env);
> +unsigned int addr_hash = tb_jmp_cache_hash_func(addr);
> +void *code_ptr = NULL;
>  TranslationBlock *tb;
> -target_ulong cs_base, pc;
> -uint32_t flags;
>
> -tb = atomic_rcu_read(>tb_jmp_cache[tb_jmp_cache_hash_func(addr)]);
> +tb = atomic_rcu_read(>tb_jmp_cache[addr_hash]);
>  if (likely(tb)) {
> +target_ulong cs_base, pc;
> +uint32_t flags;
> +
>  cpu_get_tb_cpu_state(env, , _base, );
> +
>  if (likely(tb->pc == addr && tb->cs_base == cs_base &&
> tb->flags == flags)) {
> -goto found;
> -}
> -tb = tb_htable_lookup(cpu, addr, cs_base, flags);
> -if (likely(tb)) {
> -atomic_set(>tb_jmp_cache[tb_jmp_cache_hash_func(addr)], tb);
> -goto found;
> +code_ptr = tb->tc_ptr;
> +} else {
> +/* If we didn't find it in the jmp_cache we still might
> + * find it in the global tb_htable
> + */
> +tb = tb_htable_lookup(cpu, addr, cs_base, flags);
> +if (likely(tb)) {
> +atomic_set(>tb_jmp_cache[addr_hash], tb);
> +code_ptr = tb->tc_ptr;
> +}
>  }
>  }
> -return tcg_ctx.code_gen_epilogue;
> - found:
> -qemu_log_mask_and_addr(CPU_LOG_EXEC, addr,
> -   "Chain %p [%d: " TARGET_FMT_lx "] %s\n",
> -   tb->tc_ptr, cpu->cpu_index, addr,
> -   lookup_symbol(addr));
> -return tb->tc_ptr;
> +
> +return code_ptr ? code_ptr : tcg_ctx.code_gen_epilogue;
>  }
>
>  void HELPER(exit_atomic)(CPUArchState *env)
> --
> 2.13.0
>
>

[Qemu-devel] [RFC PATCH 1/3] tcg/aarch64: Introduce and use jump to register

2017-06-07 Thread Pranith Kumar

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 1fa3bccc89..ab0a8caa03 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -819,6 +819,12 @@ static inline void tcg_out_goto(TCGContext *s, 
tcg_insn_unit *target)
 tcg_out_insn(s, 3206, B, offset);
 }
 
+static inline void tcg_out_goto_register(TCGContext *s, intptr_t target)
+{
+tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, target);
+tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+}
+
 static inline void tcg_out_goto_noaddr(TCGContext *s)
 {
 /* We pay attention here to not modify the branch target by reading from
@@ -1364,10 +1370,10 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 case INDEX_op_exit_tb:
 /* Reuse the zeroing that exists for goto_ptr.  */
 if (a0 == 0) {
-tcg_out_goto(s, s->code_gen_epilogue);
+tcg_out_goto_register(s, (intptr_t)(s->code_gen_epilogue));
 } else {
 tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X0, a0);
-tcg_out_goto(s, tb_ret_addr);
+tcg_out_goto_register(s, (intptr_t)(tb_ret_addr));
 }
 break;
 
-- 
2.13.0

[Qemu-devel] [RFC PATCH 0/3] Remove code buffer size limitation on aarch64 hosts

2017-06-07 Thread Pranith Kumar

Hi,

The following patches apply on top of tcg-next of rth's branch. These
patches make use of LDR (literal) on aarch64 and enable us to remove
the 128MB code buffer size limitation.

Pranith Kumar (3):
  tcg/aarch64: Introduce and use jump to register
  tcg/aarch64: Introdue LDR (literal) generation for aarch64
  tcg/aarch64: Remove code buffer size limitation

 include/exec/exec-all.h  |  6 +-
 tcg/aarch64/tcg-target.inc.c | 42 +-
 translate-all.c  |  2 --
 3 files changed, 22 insertions(+), 28 deletions(-)

-- 
2.13.0

[Qemu-devel] [RFC PATCH 2/3] tcg/aarch64: Introdue LDR (literal) for aarch64

2017-06-07 Thread Pranith Kumar

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 tcg/aarch64/tcg-target.inc.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index ab0a8caa03..e488aacadb 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -269,6 +269,8 @@ typedef enum {
 I3207_BLR   = 0xd63f,
 I3207_RET   = 0xd65f,
 
+/* Load literal for loading the address at pc-relative offset */
+I3305_LDR   = 0x5800,
 /* Load/store register.  Described here as 3.3.12, but the helper
that emits them can transform to 3.3.10 or 3.3.13.  */
 I3312_STRB  = 0x3800 | LDST_ST << 22 | MO_8 << 30,
@@ -388,6 +390,11 @@ static inline uint32_t tcg_in32(TCGContext *s)
 #define tcg_out_insn(S, FMT, OP, ...) \
 glue(tcg_out_insn_,FMT)(S, glue(glue(glue(I,FMT),_),OP), ## __VA_ARGS__)
 
+static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn, int imm19, 
TCGReg rt)
+{
+tcg_out32(s, insn | (imm19 & 0x7) << 5 | rt);
+}
+
 static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
   TCGReg rt, int imm19)
 {
-- 
2.13.0

[Qemu-devel] [RFC PATCH 3/3] tcg/aarch64: Remove code buffer size limitation

2017-06-07 Thread Pranith Kumar

This enables indirect jump on aarch64 hosts. Tested by booting an x86 guest on 
aarch64 host.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 include/exec/exec-all.h  |  6 +-
 tcg/aarch64/tcg-target.inc.c | 25 ++---
 translate-all.c  |  2 --
 3 files changed, 7 insertions(+), 26 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 724ec73dce..a6bd3c7d1e 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -301,9 +301,8 @@ static inline void 
tlb_flush_by_mmuidx_all_cpus_synced(CPUState *cpu,
 #define CODE_GEN_AVG_BLOCK_SIZE 150
 #endif
 
-#if defined(_ARCH_PPC) \
+#if defined(_ARCH_PPC) || defined(__sparc__)\
 || defined(__x86_64__) || defined(__i386__) \
-|| defined(__sparc__) || defined(__aarch64__) \
 || defined(__s390x__) || defined(__mips__) \
 || defined(CONFIG_TCG_INTERPRETER)
 /* NOTE: Direct jump patching must be atomic to be thread-safe. */
@@ -398,9 +397,6 @@ static inline void tb_set_jmp_target1(uintptr_t jmp_addr, 
uintptr_t addr)
 atomic_set((int32_t *)jmp_addr, disp / 2);
 /* no need to flush icache explicitly */
 }
-#elif defined(__aarch64__)
-void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
-#define tb_set_jmp_target1 aarch64_tb_set_jmp_target
 #elif defined(__sparc__) || defined(__mips__)
 void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr);
 #else
diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index e488aacadb..cc81a0d5ff 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -865,15 +865,6 @@ static inline void tcg_out_call(TCGContext *s, 
tcg_insn_unit *target)
 }
 }
 
-void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr)
-{
-tcg_insn_unit *code_ptr = (tcg_insn_unit *)jmp_addr;
-tcg_insn_unit *target = (tcg_insn_unit *)addr;
-
-reloc_pc26_atomic(code_ptr, target);
-flush_icache_range(jmp_addr, jmp_addr + 4);
-}
-
 static inline void tcg_out_goto_label(TCGContext *s, TCGLabel *l)
 {
 if (!l->has_value) {
@@ -1385,16 +1376,12 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 break;
 
 case INDEX_op_goto_tb:
-#ifndef USE_DIRECT_JUMP
-#error "USE_DIRECT_JUMP required for aarch64"
-#endif
-/* consistency for USE_DIRECT_JUMP */
-tcg_debug_assert(s->tb_jmp_insn_offset != NULL);
-s->tb_jmp_insn_offset[a0] = tcg_current_code_size(s);
-/* actual branch destination will be patched by
-   aarch64_tb_set_jmp_target later, beware retranslation. */
-tcg_out_goto_noaddr(s);
-s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
+{
+intptr_t offset = tcg_pcrel_diff(s, (s->tb_jmp_target_addr + a0)) 
>> 2;
+tcg_out_insn(s, 3305, LDR, offset, TCG_REG_TMP);
+tcg_out_callr(s, TCG_REG_TMP);
+s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
+}
 break;
 
 case INDEX_op_goto_ptr:
diff --git a/translate-all.c b/translate-all.c
index 966747ad60..e4cd849931 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -521,8 +521,6 @@ static inline PageDesc *page_find(tb_page_addr_t index)
 # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
 #elif defined(__powerpc__)
 # define MAX_CODE_GEN_BUFFER_SIZE  (32u * 1024 * 1024)
-#elif defined(__aarch64__)
-# define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
 #elif defined(__s390x__)
   /* We have a +- 4GB range on the branches; leave some slop.  */
 # define MAX_CODE_GEN_BUFFER_SIZE  (3ul * 1024 * 1024 * 1024)
-- 
2.13.0

[Qemu-devel] [PATCH RESEND] mttcg/i386: Patch instruction using async_safe_* framework

2017-06-07 Thread Pranith Kumar

In mttcg, calling pause_all_vcpus() during execution from the
generated TBs causes a deadlock if some vCPU is waiting for exclusive
execution in start_exclusive(). Fix this by using the aync_safe_*
framework instead of pausing vcpus for patching instructions.

CC: Paolo Bonzini <pbonz...@redhat.com>
CC: Peter Maydell <peter.mayd...@linaro.org>
Reviewed-by: Richard Henderson <r...@twiddle.net>
Reviewed-by: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 hw/i386/kvmvapic.c | 82 ++
 1 file changed, 52 insertions(+), 30 deletions(-)

diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
index 82a49556af..11b0d494eb 100644
--- a/hw/i386/kvmvapic.c
+++ b/hw/i386/kvmvapic.c
@@ -383,8 +383,7 @@ static void patch_byte(X86CPU *cpu, target_ulong addr, 
uint8_t byte)
 cpu_memory_rw_debug(CPU(cpu), addr, , 1, 1);
 }
 
-static void patch_call(VAPICROMState *s, X86CPU *cpu, target_ulong ip,
-   uint32_t target)
+static void patch_call(X86CPU *cpu, target_ulong ip, uint32_t target)
 {
 uint32_t offset;
 
@@ -393,23 +392,24 @@ static void patch_call(VAPICROMState *s, X86CPU *cpu, 
target_ulong ip,
 cpu_memory_rw_debug(CPU(cpu), ip + 1, (void *), sizeof(offset), 1);
 }
 
-static void patch_instruction(VAPICROMState *s, X86CPU *cpu, target_ulong ip)
+struct PatchInfo {
+VAPICHandlers *handler;
+target_ulong ip;
+};
+
+static void do_patch_instruction(CPUState *cs, run_on_cpu_data data)
 {
-CPUState *cs = CPU(cpu);
-CPUX86State *env = >env;
-VAPICHandlers *handlers;
+X86CPU *x86_cpu = X86_CPU(cs);
+CPUX86State *env = _cpu->env;
+struct PatchInfo *info = (struct PatchInfo *) data.host_ptr;
+VAPICHandlers *handlers = info->handler;
+target_ulong ip = info->ip;
 uint8_t opcode[2];
 uint32_t imm32 = 0;
 target_ulong current_pc = 0;
 target_ulong current_cs_base = 0;
 uint32_t current_flags = 0;
 
-if (smp_cpus == 1) {
-handlers = >rom_state.up;
-} else {
-handlers = >rom_state.mp;
-}
-
 if (!kvm_enabled()) {
 cpu_get_tb_cpu_state(env, _pc, _cs_base,
  _flags);
@@ -421,48 +421,70 @@ static void patch_instruction(VAPICROMState *s, X86CPU 
*cpu, target_ulong ip)
 }
 }
 
-pause_all_vcpus();
-
 cpu_memory_rw_debug(cs, ip, opcode, sizeof(opcode), 0);
 
 switch (opcode[0]) {
 case 0x89: /* mov r32 to r/m32 */
-patch_byte(cpu, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
-patch_call(s, cpu, ip + 1, handlers->set_tpr);
+patch_byte(x86_cpu, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
+patch_call(x86_cpu, ip + 1, handlers->set_tpr);
 break;
 case 0x8b: /* mov r/m32 to r32 */
-patch_byte(cpu, ip, 0x90);
-patch_call(s, cpu, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
+patch_byte(x86_cpu, ip, 0x90);
+patch_call(x86_cpu, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
 break;
 case 0xa1: /* mov abs to eax */
-patch_call(s, cpu, ip, handlers->get_tpr[0]);
+patch_call(x86_cpu, ip, handlers->get_tpr[0]);
 break;
 case 0xa3: /* mov eax to abs */
-patch_call(s, cpu, ip, handlers->set_tpr_eax);
+patch_call(x86_cpu, ip, handlers->set_tpr_eax);
 break;
 case 0xc7: /* mov imm32, r/m32 (c7/0) */
-patch_byte(cpu, ip, 0x68);  /* push imm32 */
+patch_byte(x86_cpu, ip, 0x68);  /* push imm32 */
 cpu_memory_rw_debug(cs, ip + 6, (void *), sizeof(imm32), 0);
 cpu_memory_rw_debug(cs, ip + 1, (void *), sizeof(imm32), 1);
-patch_call(s, cpu, ip + 5, handlers->set_tpr);
+patch_call(x86_cpu, ip + 5, handlers->set_tpr);
 break;
 case 0xff: /* push r/m32 */
-patch_byte(cpu, ip, 0x50); /* push eax */
-patch_call(s, cpu, ip + 1, handlers->get_tpr_stack);
+patch_byte(x86_cpu, ip, 0x50); /* push eax */
+patch_call(x86_cpu, ip + 1, handlers->get_tpr_stack);
 break;
 default:
 abort();
 }
 
-resume_all_vcpus();
+g_free(info);
+}
+
+static void patch_instruction(VAPICROMState *s, X86CPU *cpu, target_ulong ip)
+{
+CPUState *cs = CPU(cpu);
+VAPICHandlers *handlers;
+struct PatchInfo *info;
+
+if (smp_cpus == 1) {
+handlers = >rom_state.up;
+} else {
+handlers = >rom_state.mp;
+}
+
+info  = g_new(struct PatchInfo, 1);
+info->handler = handlers;
+info->ip = ip;
 
 if (!kvm_enabled()) {
-/* Both tb_lock and iothread_mutex will be reset when
- *  longjmps back into the cpu_exec loop. */
-tb_lock();
-tb_gen_code(cs, current_pc, current_cs_base, current_flags, 1);
-cpu_loop_exit_noexc(cs);
+const run_on_cpu_func fn = do_pat

Re: [Qemu-devel] [PATCH v2] mttcg/i386: Patch instruction using async_safe_* framework

2017-06-07 Thread Pranith Kumar

On Wed, Jun 7, 2017 at 2:09 PM, Alex Bennée <alex.ben...@linaro.org> wrote:
>
> Pranith Kumar <bobby.pr...@gmail.com> writes:
>
>> Can someone please pick this up?
>
> It needs to be re-posted with the review tag and ping Paolo re: async
> work for KVM.
>

Will do.

Thanks,
-- 
Pranith

[Qemu-devel] [PATCH] Revert "exec.c: Fix breakpoint invalidation race"

2017-06-07 Thread Pranith Kumar

Now that we have proper locking after MTTCG patches have landed, we
can revert the commit.  This reverts commit

a9353fe897ca2687e5b3385ed39e3db3927a90e0.

CC: Peter Maydell <peter.mayd...@linaro.org>
CC: Alex Bennée <alex.ben...@linaro.org>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 exec.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index eac6085760..ce7a3412bf 100644
--- a/exec.c
+++ b/exec.c
@@ -731,15 +731,28 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp)
 #endif
 }
 
+#if defined(CONFIG_USER_ONLY)
 static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
 {
-/* Flush the whole TB as this will not have race conditions
- * even if we don't have proper locking yet.
- * Ideally we would just invalidate the TBs for the
- * specified PC.
- */
-tb_flush(cpu);
+mmap_lock();
+tb_lock();
+tb_invalidate_phys_page_range(pc, pc + 1, 0);
+tb_unlock();
+mmap_unlock();
 }
+#else
+static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
+{
+MemTxAttrs attrs;
+hwaddr phys = cpu_get_phys_page_attrs_debug(cpu, pc, );
+int asidx = cpu_asidx_from_attrs(cpu, attrs);
+if (phys != -1) {
+/* Locks grabbed by tb_invalidate_phys_addr */
+tb_invalidate_phys_addr(cpu->cpu_ases[asidx].as,
+phys | (pc & ~TARGET_PAGE_MASK));
+}
+}
+#endif
 
 #if defined(CONFIG_USER_ONLY)
 void cpu_watchpoint_remove_all(CPUState *cpu, int mask)
-- 
2.11.0

Re: [Qemu-devel] [PATCH v2] mttcg/i386: Patch instruction using async_safe_* framework

2017-06-07 Thread Pranith Kumar

Can someone please pick this up?

Thanks,

On Fri, Feb 24, 2017 at 12:42 AM, Pranith Kumar <bobby.pr...@gmail.com> wrote:
> In mttcg, calling pause_all_vcpus() during execution from the
> generated TBs causes a deadlock if some vCPU is waiting for exclusive
> execution in start_exclusive(). Fix this by using the aync_safe_*
> framework instead of pausing vcpus for patching instructions.
>
> CC: Richard Henderson <r...@twiddle.net>
> CC: Peter Maydell <peter.mayd...@linaro.org>
> CC: Alex Bennée <alex.ben...@linaro.org>
> Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
> ---
>  hw/i386/kvmvapic.c | 82 
> ++
>  1 file changed, 52 insertions(+), 30 deletions(-)
>
> diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
> index 82a4955..11b0d49 100644
> --- a/hw/i386/kvmvapic.c
> +++ b/hw/i386/kvmvapic.c
> @@ -383,8 +383,7 @@ static void patch_byte(X86CPU *cpu, target_ulong addr, 
> uint8_t byte)
>  cpu_memory_rw_debug(CPU(cpu), addr, , 1, 1);
>  }
>
> -static void patch_call(VAPICROMState *s, X86CPU *cpu, target_ulong ip,
> -   uint32_t target)
> +static void patch_call(X86CPU *cpu, target_ulong ip, uint32_t target)
>  {
>  uint32_t offset;
>
> @@ -393,23 +392,24 @@ static void patch_call(VAPICROMState *s, X86CPU *cpu, 
> target_ulong ip,
>  cpu_memory_rw_debug(CPU(cpu), ip + 1, (void *), sizeof(offset), 
> 1);
>  }
>
> -static void patch_instruction(VAPICROMState *s, X86CPU *cpu, target_ulong ip)
> +struct PatchInfo {
> +VAPICHandlers *handler;
> +target_ulong ip;
> +};
> +
> +static void do_patch_instruction(CPUState *cs, run_on_cpu_data data)
>  {
> -CPUState *cs = CPU(cpu);
> -CPUX86State *env = >env;
> -VAPICHandlers *handlers;
> +X86CPU *x86_cpu = X86_CPU(cs);
> +CPUX86State *env = _cpu->env;
> +struct PatchInfo *info = (struct PatchInfo *) data.host_ptr;
> +VAPICHandlers *handlers = info->handler;
> +target_ulong ip = info->ip;
>  uint8_t opcode[2];
>  uint32_t imm32 = 0;
>  target_ulong current_pc = 0;
>  target_ulong current_cs_base = 0;
>  uint32_t current_flags = 0;
>
> -if (smp_cpus == 1) {
> -handlers = >rom_state.up;
> -} else {
> -handlers = >rom_state.mp;
> -}
> -
>  if (!kvm_enabled()) {
>  cpu_get_tb_cpu_state(env, _pc, _cs_base,
>   _flags);
> @@ -421,48 +421,70 @@ static void patch_instruction(VAPICROMState *s, X86CPU 
> *cpu, target_ulong ip)
>  }
>  }
>
> -pause_all_vcpus();
> -
>  cpu_memory_rw_debug(cs, ip, opcode, sizeof(opcode), 0);
>
>  switch (opcode[0]) {
>  case 0x89: /* mov r32 to r/m32 */
> -patch_byte(cpu, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
> -patch_call(s, cpu, ip + 1, handlers->set_tpr);
> +patch_byte(x86_cpu, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
> +patch_call(x86_cpu, ip + 1, handlers->set_tpr);
>  break;
>  case 0x8b: /* mov r/m32 to r32 */
> -patch_byte(cpu, ip, 0x90);
> -patch_call(s, cpu, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
> +patch_byte(x86_cpu, ip, 0x90);
> +patch_call(x86_cpu, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
>  break;
>  case 0xa1: /* mov abs to eax */
> -patch_call(s, cpu, ip, handlers->get_tpr[0]);
> +patch_call(x86_cpu, ip, handlers->get_tpr[0]);
>  break;
>  case 0xa3: /* mov eax to abs */
> -patch_call(s, cpu, ip, handlers->set_tpr_eax);
> +patch_call(x86_cpu, ip, handlers->set_tpr_eax);
>  break;
>  case 0xc7: /* mov imm32, r/m32 (c7/0) */
> -patch_byte(cpu, ip, 0x68);  /* push imm32 */
> +patch_byte(x86_cpu, ip, 0x68);  /* push imm32 */
>  cpu_memory_rw_debug(cs, ip + 6, (void *), sizeof(imm32), 0);
>  cpu_memory_rw_debug(cs, ip + 1, (void *), sizeof(imm32), 1);
> -patch_call(s, cpu, ip + 5, handlers->set_tpr);
> +patch_call(x86_cpu, ip + 5, handlers->set_tpr);
>  break;
>  case 0xff: /* push r/m32 */
> -patch_byte(cpu, ip, 0x50); /* push eax */
> -patch_call(s, cpu, ip + 1, handlers->get_tpr_stack);
> +patch_byte(x86_cpu, ip, 0x50); /* push eax */
> +patch_call(x86_cpu, ip + 1, handlers->get_tpr_stack);
>  break;
>  default:
>  abort();
>  }
>
> -resume_all_vcpus();
> +g_free(info);
> +}
> +
> +static void patch_instruction(VAPICROMState *s, X86CPU *cpu, targ

Re: [Qemu-devel] [PATCH v3 1/2 -- fixed] util: add cacheinfo

2017-06-07 Thread Pranith Kumar

On Tue, Jun 6, 2017 at 8:17 PM, Emilio G. Cota <c...@braap.org> wrote:
> Add helpers to gather cache info from the host at init-time.
>
> For now, only export the host's I/D cache line sizes, which we
> will use to improve cache locality to avoid false sharing.
>
> Suggested-by: Richard Henderson <r...@twiddle.net>
> Suggested-by: Geert Martin Ijewski <gm.ijew...@web.de>
> Tested-by:Geert Martin Ijewski <gm.ijew...@web.de>
> Signed-off-by: Emilio G. Cota <c...@braap.org>
> ---

Reviewed-by: Pranith Kumar <bobby.pr...@gmail.com>

-- 
Pranith

Re: [Qemu-devel] [PATCH v2 2/3] tests: use QEMU_CACHELINE_SIZE instead of hard-coding it

2017-06-05 Thread Pranith Kumar

On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <c...@braap.org> wrote:
> Signed-off-by: Emilio G. Cota <c...@braap.org>

Reviewed-by: Pranith Kumar <bobby.pr...@gmail.com>

> ---
>  tests/atomic_add-bench.c | 4 ++--
>  tests/qht-bench.c| 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tests/atomic_add-bench.c b/tests/atomic_add-bench.c
> index caa1e8e..c219109 100644
> --- a/tests/atomic_add-bench.c
> +++ b/tests/atomic_add-bench.c
> @@ -5,11 +5,11 @@
>
>  struct thread_info {
>  uint64_t r;
> -} QEMU_ALIGNED(64);
> +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE);
>
>  struct count {
>  unsigned long val;
> -} QEMU_ALIGNED(64);
> +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE);
>
>  static QemuThread *threads;
>  static struct thread_info *th_info;
> diff --git a/tests/qht-bench.c b/tests/qht-bench.c
> index 2afa09d..3f4b5eb 100644
> --- a/tests/qht-bench.c
> +++ b/tests/qht-bench.c
> @@ -28,7 +28,7 @@ struct thread_info {
>  uint64_t r;
>  bool write_op; /* writes alternate between insertions and removals */
>  bool resize_down;
> -} QEMU_ALIGNED(64); /* avoid false sharing among threads */
> +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); /* avoid false sharing among threads */
>
>  static struct qht ht;
>  static QemuThread *rw_threads;
> --
> 2.7.4
>



-- 
Pranith

Re: [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE

2017-06-05 Thread Pranith Kumar

On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota  wrote:
> This is a constant used as a hint for padding structs to hopefully avoid
> false cache line sharing.
>
> The constant can be set at configure time by defining QEMU_CACHELINE_SIZE
> via --extra-cflags. If not set there, we try to obtain the value from
> the machine running the configure script. If we fail, we default to
> reasonable values, i.e. 128 bytes for ppc64 and 64 bytes for all others.
>
> Note: the configure script only picks up the cache line size when run
> on Linux hosts because I have no other platforms (e.g. Windows, BSD's)
> to test on.
>
> Signed-off-by: Emilio G. Cota 
> ---
>  configure   | 38 ++
>  include/qemu/compiler.h | 17 +
>  2 files changed, 55 insertions(+)
>
> diff --git a/configure b/configure
> index 13e040d..6a68cb2 100755
> --- a/configure
> +++ b/configure
> @@ -4832,6 +4832,41 @@ EOF
>fi
>  fi
>
> +# Find out the size of a cache line on the host
> +# TODO: support more platforms
> +cat > $TMPC< +#ifdef __linux__
> +
> +#include 
> +
> +#define SYSFS "/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size"
> +
> +int main(int argc, char *argv[])
> +{
> +unsigned int size;
> +FILE *fp;
> +
> +fp = fopen(SYSFS, "r");
> +if (fp == NULL) {
> +return -1;
> +}
> +if (!fscanf(fp, "%u", )) {
> +return -1;
> +}
> +return size;
> +}
> +#else
> +#error Cannot find host cache line size
> +#endif
> +EOF

Is there any reason not to use sysconf(_SC_LEVEL1_DCACHE_LINESIZE)?

Thanks,
-- 
Pranith

Re: [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code

2017-06-05 Thread Pranith Kumar

On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <c...@braap.org> wrote:
> Allocating an arbitrarily-sized array of tbs results in either
> (a) a lot of memory wasted or (b) unnecessary flushes of the code
> cache when we run out of TB structs in the array.
>
> An obvious solution would be to just malloc a TB struct when needed,
> and keep the TB array as an array of pointers (recall that tb_find_pc()
> needs the TB array to run in O(log n)).
>
> Perhaps a better solution, which is implemented in this patch, is to
> allocate TB's right before the translated code they describe. This
> results in some memory waste due to padding to have code and TBs in
> separate cache lines--for instance, I measured 4.7% of padding in the
> used portion of code_gen_buffer when booting aarch64 Linux on a
> host with 64-byte cache lines. However, it can allow for optimizations
> in some host architectures, since TCG backends could safely assume that
> the TB and the corresponding translated code are very close to each
> other in memory. See this message by rth for a detailed explanation:
>
>   https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html
>   Subject: Re: GSoC 2017 Proposal: TCG performance enhancements
>   Message-ID: <1e67644b-4b30-887e-d329-1848e94c9...@twiddle.net>

Reviewed-by: Pranith Kumar <bobby.pr...@gmail.com>

Thanks for doing this Emilio. Do you plan to continue working on rth's
suggestions in that email? If so, can we co-ordinate our work?

-- 
Pranith

Re: [Qemu-devel] [PATCH v2 6/6] new script/analyse-tlb-flushes-simpletrace.py

2017-05-30 Thread Pranith Kumar

Hi Alex,

Please find some comments and questions below:

On Wed, May 17, 2017 at 10:52 AM, Alex Bennée  wrote:
> This is a simple helper script to extract TLB flush stats from the a
> simpletrace file and plot the results.
>
> Signed-off-by: Alex Bennée 
>
> ---
> v2
>   - re-factored for new trace events
>   - added time and latency graphs
> ---
>  scripts/analyse-tlb-flushes-simpletrace.py | 144 
> +
>  1 file changed, 144 insertions(+)
>  create mode 100755 scripts/analyse-tlb-flushes-simpletrace.py
>
> diff --git a/scripts/analyse-tlb-flushes-simpletrace.py 
> b/scripts/analyse-tlb-flushes-simpletrace.py
> new file mode 100755
> index 00..03fab8c86b
> --- /dev/null
> +++ b/scripts/analyse-tlb-flushes-simpletrace.py



> +
> +def get_args():
> +"Grab options"
> +parser = argparse.ArgumentParser()
> +parser.add_argument("--output", "-o", type=str, help="Render plot to 
> file")
> +parser.add_argument("--vcpus", type=int, help="Number of vCPUS")

It is not really clear what this argument is for. I guess you are
saying how many cpus the guest from which trace file was generated
had? What happens if we pass in less number of vcpus than used for
generation?

> +parser.add_argument("--graph", choices=['time', 'latency'], 
> default='time')

What does latency here indicate? I tried this argument on a sample
trace file I generated, and it had three empty boxes.

> +parser.add_argument("events", type=str, help='trace file read from')
> +parser.add_argument("tracefile", type=str, help='trace file read from')

The help text for 'events' file here should be something like 'the
trace events file'.

Thanks,
--
Pranith

Re: [Qemu-devel] [PATCH] tcg/i386: 'nop' instruction with 'lock' prefix is illegal

2017-05-15 Thread Pranith Kumar

On Sun, May 14, 2017 at 5:12 PM, Richard Henderson  wrote:
>>
> Surely you'd also want to make this change for 0x11a and 0x11b.  Which would
> also simplify that code a bit.
>
> That said, there's *lots* of missing LOCK prefix checks.  What brings this
> one in particular to your attention?
>

The motivation for this change is here:
https://github.com/aquynh/capstone/issues/915

Apparently LLVM generates it in certain scenarios when padding with
multi-byte nop (it shouldn't).

>From what I understand, a proper instruction like "lock; "
is converted to "lock; multi-byte nop; " due to code
alignment.

There were bugs reported regarding this:
https://bugs.chromium.org/p/nativeclient/issues/detail?id=3929

I am not sure we want to fix this, but I thought it would be easy
enough to cover this case.

Thanks,
-- 
Pranith

[Qemu-devel] [PATCH] tcg/i386: 'nop' instruction with 'lock' prefix is illegal

2017-05-13 Thread Pranith Kumar

The instruction "lock nopl (%rax)" should raise an exception. However,
we don't do that since we do not check for lock prefix for nop
instructions. The following patch adds this check and makes the
behavior similar to hardware.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/i386/translate.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/target/i386/translate.c b/target/i386/translate.c
index 1d1372fb43..76f4ccd3b4 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -7881,6 +7881,9 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
 gen_nop_modrm(env, s, modrm);
 break;
 case 0x119: case 0x11c ... 0x11f: /* nop (multi byte) */
+if (prefixes & PREFIX_LOCK) {
+goto illegal_op;
+}
 modrm = cpu_ldub_code(env, s->pc++);
 gen_nop_modrm(env, s, modrm);
 break;
-- 
2.13.0

Re: [Qemu-devel] [Bug 1653063] [NEW] qemu-system-arm hangs with -icount and -nodefaults

2017-04-21 Thread Pranith Kumar

On Thu, Dec 29, 2016 at 5:04 AM, Andrew Jones  wrote:
> On Thu, Dec 29, 2016 at 08:02:16AM -, Hansni Bu wrote:
>> Public bug reported:
> ...
>> https://bugs.launchpad.net/bugs/1653063
> ...
>>   After console prints the message below:
>>   "Uncompressing 
>> Linux..
>>  done, booting the kernel."
>>   there's no further action noticed.
>>
>>   If "-icount" is not set, namely run as:
>>   qemu-system-arm -M integratorcp -kernel arm-test/zImage.integrator -initrd 
>> arm-test/arm_root.img -nographic -nodefaults -chardev stdio,mux=on,id=char0 
>> -serial chardev:char0 --append "console=ttyAMA0"
>>
>>   or if "-nodefaults" is not set, namely run as:
>>   qemu-system-arm -M integratorcp -kernel arm-test/zImage.integrator -initrd 
>> arm-test/arm_root.img -nographic -icount 1 --append "console=ttyAMA0"
>>
>>   The Linux boot procedure can finish successfully.
>
> Hi Hansni,
>
> The fact things work when you remove -nodefaults is a sign that with it
> your single cpu may just not be getting scheduled again. Does the patch
> from Alex Bennée here[*] help?
>
> [*] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01743.html
>

This bug is still reproducible with the latest git.

--
Pranith

-- 
Pranith

Re: [Qemu-devel] [PATCH] tcg/i386: Do not display HT warning for TCG

2017-04-19 Thread Pranith Kumar

On Wed, Apr 19, 2017 at 10:26 PM, Eduardo Habkost <ehabk...@redhat.com> wrote:
> On Wed, Apr 19, 2017 at 06:03:01PM -0400, Pranith Kumar wrote:
>> On Wed, Apr 19, 2017 at 5:33 PM, Eduardo Habkost <ehabk...@redhat.com> wrote:
>> > On Wed, Apr 19, 2017 at 05:25:23PM -0400, Pranith Kumar wrote:
>> >> On Wed, Apr 19, 2017 at 4:57 PM, Eduardo Habkost <ehabk...@redhat.com> 
>> >> wrote:
>> >> > On Wed, Apr 19, 2017 at 04:16:53PM -0400, Pranith Kumar wrote:
>> >> >> On Wed, Apr 19, 2017 at 4:13 PM, Eduardo Habkost <ehabk...@redhat.com> 
>> >> >> wrote:
>> >> >> > On Wed, Apr 19, 2017 at 04:00:49PM -0400, Pranith Kumar wrote:
>> >> >> >> On Wed, Apr 19, 2017 at 3:54 PM, Pranith Kumar 
>> >> >> >> <bobby.pr...@gmail.com> wrote:
>> >> >> >> > When we enable hyperthreading (using threads smp argument), we 
>> >> >> >> > warn
>> >> >> >> > the user if the cpu is an AMD cpu. This does not make sense on 
>> >> >> >> > TCG and
>> >> >> >> > is also obsolete now that AMD Ryzen support hyperthreading.
>> >> >> >> >
>> >> >> >> > Fix this by adding CPUID_HT bit to the TCG features and explicitly
>> >> >> >> > checking this bit in the cpuid.
>> >> >> >> >
>> >> >> >> > Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
>> >> >> >> > ---
>> >> >> >> >  target/i386/cpu.c | 10 +-
>> >> >> >> >  1 file changed, 5 insertions(+), 5 deletions(-)
>> >> >> >> >
>> >> >> >> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
>> >> >> >> > index 13c0985f11..f34bb5ead7 100644
>> >> >> >> > --- a/target/i386/cpu.c
>> >> >> >> > +++ b/target/i386/cpu.c
>> >> >> >> > @@ -202,12 +202,12 @@ static void x86_cpu_vendor_words2str(char 
>> >> >> >> > *dst, uint32_t vendor1,
>> >> >> >> >  #define TCG_FEATURES (CPUID_FP87 | CPUID_PSE | CPUID_TSC | 
>> >> >> >> > CPUID_MSR | \
>> >> >> >> >CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | 
>> >> >> >> > CPUID_SEP | \
>> >> >> >> >CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | 
>> >> >> >> > CPUID_PAT | \
>> >> >> >> > -  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | 
>> >> >> >> > \
>> >> >> >> > +  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | 
>> >> >> >> > CPUID_HT | \
>> >> >> >> >CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | 
>> >> >> >> > CPUID_DE)
>> >> >> >> >/* partly implemented:
>> >> >> >> >CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for 
>> >> >> >> > Win64) */
>> >> >> >> >/* missing:
>> >> >> >> > -  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, 
>> >> >> >> > CPUID_PBE */
>> >> >> >> > +  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */
>> >> >> >> >  #define TCG_EXT_FEATURES (CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | 
>> >> >> >> > \
>> >> >> >> >CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | CPUID_EXT_CX16 | 
>> >> >> >> > \
>> >> >> >> >CPUID_EXT_SSE41 | CPUID_EXT_SSE42 | CPUID_EXT_POPCNT | 
>> >> >> >> > \
>> >> >> >> > @@ -3643,7 +3643,7 @@ static void x86_cpu_realizefn(DeviceState 
>> >> >> >> > *dev, Error **errp)
>> >> >> >> >
>> >> >> >> >  qemu_init_vcpu(cs);
>> >> >> >> >
>> >> >> >> > -/* Only Intel CPUs support hyperthreading. Even though QEMU 
>> >> >> >> > fixes this
>> >> >> >> > +/* Only some CPUs support hyperthreading. Even though QEMU 
>> >> >> >> > fixes this
&g

Re: [Qemu-devel] [PATCH] tcg/i386: Do not display HT warning for TCG

2017-04-19 Thread Pranith Kumar

On Wed, Apr 19, 2017 at 5:33 PM, Eduardo Habkost <ehabk...@redhat.com> wrote:
> On Wed, Apr 19, 2017 at 05:25:23PM -0400, Pranith Kumar wrote:
>> On Wed, Apr 19, 2017 at 4:57 PM, Eduardo Habkost <ehabk...@redhat.com> wrote:
>> > On Wed, Apr 19, 2017 at 04:16:53PM -0400, Pranith Kumar wrote:
>> >> On Wed, Apr 19, 2017 at 4:13 PM, Eduardo Habkost <ehabk...@redhat.com> 
>> >> wrote:
>> >> > On Wed, Apr 19, 2017 at 04:00:49PM -0400, Pranith Kumar wrote:
>> >> >> On Wed, Apr 19, 2017 at 3:54 PM, Pranith Kumar <bobby.pr...@gmail.com> 
>> >> >> wrote:
>> >> >> > When we enable hyperthreading (using threads smp argument), we warn
>> >> >> > the user if the cpu is an AMD cpu. This does not make sense on TCG 
>> >> >> > and
>> >> >> > is also obsolete now that AMD Ryzen support hyperthreading.
>> >> >> >
>> >> >> > Fix this by adding CPUID_HT bit to the TCG features and explicitly
>> >> >> > checking this bit in the cpuid.
>> >> >> >
>> >> >> > Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
>> >> >> > ---
>> >> >> >  target/i386/cpu.c | 10 +-
>> >> >> >  1 file changed, 5 insertions(+), 5 deletions(-)
>> >> >> >
>> >> >> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
>> >> >> > index 13c0985f11..f34bb5ead7 100644
>> >> >> > --- a/target/i386/cpu.c
>> >> >> > +++ b/target/i386/cpu.c
>> >> >> > @@ -202,12 +202,12 @@ static void x86_cpu_vendor_words2str(char 
>> >> >> > *dst, uint32_t vendor1,
>> >> >> >  #define TCG_FEATURES (CPUID_FP87 | CPUID_PSE | CPUID_TSC | 
>> >> >> > CPUID_MSR | \
>> >> >> >CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | 
>> >> >> > CPUID_SEP | \
>> >> >> >CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | 
>> >> >> > CPUID_PAT | \
>> >> >> > -  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | \
>> >> >> > +  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | 
>> >> >> > CPUID_HT | \
>> >> >> >CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE)
>> >> >> >/* partly implemented:
>> >> >> >CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for Win64) */
>> >> >> >/* missing:
>> >> >> > -  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, 
>> >> >> > CPUID_PBE */
>> >> >> > +  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */
>> >> >> >  #define TCG_EXT_FEATURES (CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | \
>> >> >> >CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | CPUID_EXT_CX16 | \
>> >> >> >CPUID_EXT_SSE41 | CPUID_EXT_SSE42 | CPUID_EXT_POPCNT | \
>> >> >> > @@ -3643,7 +3643,7 @@ static void x86_cpu_realizefn(DeviceState 
>> >> >> > *dev, Error **errp)
>> >> >> >
>> >> >> >  qemu_init_vcpu(cs);
>> >> >> >
>> >> >> > -/* Only Intel CPUs support hyperthreading. Even though QEMU 
>> >> >> > fixes this
>> >> >> > +/* Only some CPUs support hyperthreading. Even though QEMU 
>> >> >> > fixes this
>> >> >> >   * issue by adjusting CPUID__0001_EBX and 
>> >> >> > CPUID_8000_0008_ECX
>> >> >> >   * based on inputs (sockets,cores,threads), it is still better 
>> >> >> > to gives
>> >> >> >   * users a warning.
>> >> >> > @@ -3651,8 +3651,8 @@ static void x86_cpu_realizefn(DeviceState 
>> >> >> > *dev, Error **errp)
>> >> >> >   * NOTE: the following code has to follow qemu_init_vcpu(). 
>> >> >> > Otherwise
>> >> >> >   * cs->nr_threads hasn't be populated yet and the checking is 
>> >> >> > incorrect.
>> >> >> >   */
>> >> >> > -if (!IS_INTEL_CPU(env) && cs->nr_threads > 1 && !ht_warned) {
>> >> >> > -

Re: [Qemu-devel] [PATCH] tcg/i386: Do not display HT warning for TCG

2017-04-19 Thread Pranith Kumar

On Wed, Apr 19, 2017 at 4:57 PM, Eduardo Habkost <ehabk...@redhat.com> wrote:
> On Wed, Apr 19, 2017 at 04:16:53PM -0400, Pranith Kumar wrote:
>> On Wed, Apr 19, 2017 at 4:13 PM, Eduardo Habkost <ehabk...@redhat.com> wrote:
>> > On Wed, Apr 19, 2017 at 04:00:49PM -0400, Pranith Kumar wrote:
>> >> On Wed, Apr 19, 2017 at 3:54 PM, Pranith Kumar <bobby.pr...@gmail.com> 
>> >> wrote:
>> >> > When we enable hyperthreading (using threads smp argument), we warn
>> >> > the user if the cpu is an AMD cpu. This does not make sense on TCG and
>> >> > is also obsolete now that AMD Ryzen support hyperthreading.
>> >> >
>> >> > Fix this by adding CPUID_HT bit to the TCG features and explicitly
>> >> > checking this bit in the cpuid.
>> >> >
>> >> > Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
>> >> > ---
>> >> >  target/i386/cpu.c | 10 +-
>> >> >  1 file changed, 5 insertions(+), 5 deletions(-)
>> >> >
>> >> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
>> >> > index 13c0985f11..f34bb5ead7 100644
>> >> > --- a/target/i386/cpu.c
>> >> > +++ b/target/i386/cpu.c
>> >> > @@ -202,12 +202,12 @@ static void x86_cpu_vendor_words2str(char *dst, 
>> >> > uint32_t vendor1,
>> >> >  #define TCG_FEATURES (CPUID_FP87 | CPUID_PSE | CPUID_TSC | CPUID_MSR | 
>> >> > \
>> >> >CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | CPUID_SEP | 
>> >> > \
>> >> >CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT 
>> >> > | \
>> >> > -  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | \
>> >> > +  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | 
>> >> > CPUID_HT | \
>> >> >CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE)
>> >> >/* partly implemented:
>> >> >CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for Win64) */
>> >> >/* missing:
>> >> > -  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, 
>> >> > CPUID_PBE */
>> >> > +  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */
>> >> >  #define TCG_EXT_FEATURES (CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | \
>> >> >CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | CPUID_EXT_CX16 | \
>> >> >CPUID_EXT_SSE41 | CPUID_EXT_SSE42 | CPUID_EXT_POPCNT | \
>> >> > @@ -3643,7 +3643,7 @@ static void x86_cpu_realizefn(DeviceState *dev, 
>> >> > Error **errp)
>> >> >
>> >> >  qemu_init_vcpu(cs);
>> >> >
>> >> > -/* Only Intel CPUs support hyperthreading. Even though QEMU fixes 
>> >> > this
>> >> > +/* Only some CPUs support hyperthreading. Even though QEMU fixes 
>> >> > this
>> >> >   * issue by adjusting CPUID__0001_EBX and CPUID_8000_0008_ECX
>> >> >   * based on inputs (sockets,cores,threads), it is still better to 
>> >> > gives
>> >> >   * users a warning.
>> >> > @@ -3651,8 +3651,8 @@ static void x86_cpu_realizefn(DeviceState *dev, 
>> >> > Error **errp)
>> >> >   * NOTE: the following code has to follow qemu_init_vcpu(). 
>> >> > Otherwise
>> >> >   * cs->nr_threads hasn't be populated yet and the checking is 
>> >> > incorrect.
>> >> >   */
>> >> > -if (!IS_INTEL_CPU(env) && cs->nr_threads > 1 && !ht_warned) {
>> >> > -error_report("AMD CPU doesn't support hyperthreading. Please 
>> >> > configure"
>> >> > +if ((env->features[FEAT_1_EDX] & CPUID_HT) && (cs->nr_threads > 1) 
>> >> > && !ht_warned) {
>> >>
>> >> I missed a '!' here. We need to warn if CPUID_HT is not set. But I see
>> >> that it is not being set even on HT enabled processors (i7-3770). How
>> >> do I test for it?
>> >
>> > CPUID_HT is automatically set on the CPU if
>> > (nr_threads * nr_cores) > 1. See the "case 1:" block in
>> > cpu_x86_cpuid().
>> >
>> > AFAIK, the point of the warning is to let the user know that the
>> > guest OS is likely to ignore

Re: [Qemu-devel] [PATCH] tcg/i386: Do not display HT warning for TCG

2017-04-19 Thread Pranith Kumar

On Wed, Apr 19, 2017 at 4:13 PM, Eduardo Habkost <ehabk...@redhat.com> wrote:
> On Wed, Apr 19, 2017 at 04:00:49PM -0400, Pranith Kumar wrote:
>> On Wed, Apr 19, 2017 at 3:54 PM, Pranith Kumar <bobby.pr...@gmail.com> wrote:
>> > When we enable hyperthreading (using threads smp argument), we warn
>> > the user if the cpu is an AMD cpu. This does not make sense on TCG and
>> > is also obsolete now that AMD Ryzen support hyperthreading.
>> >
>> > Fix this by adding CPUID_HT bit to the TCG features and explicitly
>> > checking this bit in the cpuid.
>> >
>> > Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
>> > ---
>> >  target/i386/cpu.c | 10 +-
>> >  1 file changed, 5 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
>> > index 13c0985f11..f34bb5ead7 100644
>> > --- a/target/i386/cpu.c
>> > +++ b/target/i386/cpu.c
>> > @@ -202,12 +202,12 @@ static void x86_cpu_vendor_words2str(char *dst, 
>> > uint32_t vendor1,
>> >  #define TCG_FEATURES (CPUID_FP87 | CPUID_PSE | CPUID_TSC | CPUID_MSR | \
>> >CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | CPUID_SEP | \
>> >CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT | \
>> > -  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | \
>> > +  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | CPUID_HT 
>> > | \
>> >CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE)
>> >/* partly implemented:
>> >CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for Win64) */
>> >/* missing:
>> > -  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, CPUID_PBE */
>> > +  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */
>> >  #define TCG_EXT_FEATURES (CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | \
>> >CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | CPUID_EXT_CX16 | \
>> >CPUID_EXT_SSE41 | CPUID_EXT_SSE42 | CPUID_EXT_POPCNT | \
>> > @@ -3643,7 +3643,7 @@ static void x86_cpu_realizefn(DeviceState *dev, 
>> > Error **errp)
>> >
>> >  qemu_init_vcpu(cs);
>> >
>> > -/* Only Intel CPUs support hyperthreading. Even though QEMU fixes this
>> > +/* Only some CPUs support hyperthreading. Even though QEMU fixes this
>> >   * issue by adjusting CPUID__0001_EBX and CPUID_8000_0008_ECX
>> >   * based on inputs (sockets,cores,threads), it is still better to 
>> > gives
>> >   * users a warning.
>> > @@ -3651,8 +3651,8 @@ static void x86_cpu_realizefn(DeviceState *dev, 
>> > Error **errp)
>> >   * NOTE: the following code has to follow qemu_init_vcpu(). Otherwise
>> >   * cs->nr_threads hasn't be populated yet and the checking is 
>> > incorrect.
>> >   */
>> > -if (!IS_INTEL_CPU(env) && cs->nr_threads > 1 && !ht_warned) {
>> > -error_report("AMD CPU doesn't support hyperthreading. Please 
>> > configure"
>> > +if ((env->features[FEAT_1_EDX] & CPUID_HT) && (cs->nr_threads > 1) && 
>> > !ht_warned) {
>>
>> I missed a '!' here. We need to warn if CPUID_HT is not set. But I see
>> that it is not being set even on HT enabled processors (i7-3770). How
>> do I test for it?
>
> CPUID_HT is automatically set on the CPU if
> (nr_threads * nr_cores) > 1. See the "case 1:" block in
> cpu_x86_cpuid().
>
> AFAIK, the point of the warning is to let the user know that the
> guest OS is likely to ignore thread topology information if CPUID
> vendor is not Intel, and has nothing to do with TCG or KVM
> capabilities. Maybe the warning is obsolete today and guests
> won't be confused by a HT AMD CPU, but we need to confirm that.

We assume an AMD cpu for x86 TCG and it prints this warning when an
smp guest is run with (cores=2,threads=2) argument. The main point of
this patch is to remove that warning when using TCG.

-- 
Pranith

Re: [Qemu-devel] [PATCH] tcg/i386: Do not display HT warning for TCG

2017-04-19 Thread Pranith Kumar

On Wed, Apr 19, 2017 at 3:54 PM, Pranith Kumar <bobby.pr...@gmail.com> wrote:
> When we enable hyperthreading (using threads smp argument), we warn
> the user if the cpu is an AMD cpu. This does not make sense on TCG and
> is also obsolete now that AMD Ryzen support hyperthreading.
>
> Fix this by adding CPUID_HT bit to the TCG features and explicitly
> checking this bit in the cpuid.
>
> Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
> ---
>  target/i386/cpu.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 13c0985f11..f34bb5ead7 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -202,12 +202,12 @@ static void x86_cpu_vendor_words2str(char *dst, 
> uint32_t vendor1,
>  #define TCG_FEATURES (CPUID_FP87 | CPUID_PSE | CPUID_TSC | CPUID_MSR | \
>CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | CPUID_SEP | \
>CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT | \
> -  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | \
> +  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | CPUID_HT | \
>CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE)
>/* partly implemented:
>CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for Win64) */
>/* missing:
> -  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, CPUID_PBE */
> +  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */
>  #define TCG_EXT_FEATURES (CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | \
>CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | CPUID_EXT_CX16 | \
>CPUID_EXT_SSE41 | CPUID_EXT_SSE42 | CPUID_EXT_POPCNT | \
> @@ -3643,7 +3643,7 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
> **errp)
>
>  qemu_init_vcpu(cs);
>
> -/* Only Intel CPUs support hyperthreading. Even though QEMU fixes this
> +/* Only some CPUs support hyperthreading. Even though QEMU fixes this
>   * issue by adjusting CPUID__0001_EBX and CPUID_8000_0008_ECX
>   * based on inputs (sockets,cores,threads), it is still better to gives
>   * users a warning.
> @@ -3651,8 +3651,8 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
> **errp)
>   * NOTE: the following code has to follow qemu_init_vcpu(). Otherwise
>   * cs->nr_threads hasn't be populated yet and the checking is incorrect.
>   */
> -if (!IS_INTEL_CPU(env) && cs->nr_threads > 1 && !ht_warned) {
> -error_report("AMD CPU doesn't support hyperthreading. Please 
> configure"
> +if ((env->features[FEAT_1_EDX] & CPUID_HT) && (cs->nr_threads > 1) && 
> !ht_warned) {

I missed a '!' here. We need to warn if CPUID_HT is not set. But I see
that it is not being set even on HT enabled processors (i7-3770). How
do I test for it?

-- 
Pranith

[Qemu-devel] [RFC PATCH] tcg/i386: Do not display HT warning for TCG

2017-04-19 Thread Pranith Kumar

When we enable hyperthreading (using threads smp argument), we warn
the user if the cpu is an AMD cpu. This does not make sense on TCG and
is also obsolete now that AMD Ryzen support hyperthreading.

Fix this by adding CPUID_HT bit to the TCG features and explicitly
checking this bit in the cpuid.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/i386/cpu.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 13c0985f11..f34bb5ead7 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -202,12 +202,12 @@ static void x86_cpu_vendor_words2str(char *dst, uint32_t 
vendor1,
 #define TCG_FEATURES (CPUID_FP87 | CPUID_PSE | CPUID_TSC | CPUID_MSR | \
   CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | CPUID_SEP | \
   CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT | \
-  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | \
+  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | CPUID_HT | \
   CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE)
   /* partly implemented:
   CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for Win64) */
   /* missing:
-  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, CPUID_PBE */
+  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */
 #define TCG_EXT_FEATURES (CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | \
   CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | CPUID_EXT_CX16 | \
   CPUID_EXT_SSE41 | CPUID_EXT_SSE42 | CPUID_EXT_POPCNT | \
@@ -3643,7 +3643,7 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
 
 qemu_init_vcpu(cs);
 
-/* Only Intel CPUs support hyperthreading. Even though QEMU fixes this
+/* Only some CPUs support hyperthreading. Even though QEMU fixes this
  * issue by adjusting CPUID__0001_EBX and CPUID_8000_0008_ECX
  * based on inputs (sockets,cores,threads), it is still better to gives
  * users a warning.
@@ -3651,8 +3651,8 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  * NOTE: the following code has to follow qemu_init_vcpu(). Otherwise
  * cs->nr_threads hasn't be populated yet and the checking is incorrect.
  */
-if (!IS_INTEL_CPU(env) && cs->nr_threads > 1 && !ht_warned) {
-error_report("AMD CPU doesn't support hyperthreading. Please configure"
+if ((env->features[FEAT_1_EDX] & CPUID_HT) && (cs->nr_threads > 1) && 
!ht_warned) {
+error_report("This CPU doesn't support hyperthreading. Please 
configure"
  " -smp options properly.");
 ht_warned = true;
 }
-- 
2.11.0

[Qemu-devel] [PATCH] tcg/i386: Do not display HT warning for TCG

2017-04-19 Thread Pranith Kumar

When we enable hyperthreading (using threads smp argument), we warn
the user if the cpu is an AMD cpu. This does not make sense on TCG and
is also obsolete now that AMD Ryzen support hyperthreading.

Fix this by adding CPUID_HT bit to the TCG features and explicitly
checking this bit in the cpuid.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/i386/cpu.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 13c0985f11..f34bb5ead7 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -202,12 +202,12 @@ static void x86_cpu_vendor_words2str(char *dst, uint32_t 
vendor1,
 #define TCG_FEATURES (CPUID_FP87 | CPUID_PSE | CPUID_TSC | CPUID_MSR | \
   CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | CPUID_SEP | \
   CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT | \
-  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | \
+  CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | CPUID_HT | \
   CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE)
   /* partly implemented:
   CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for Win64) */
   /* missing:
-  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, CPUID_PBE */
+  CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */
 #define TCG_EXT_FEATURES (CPUID_EXT_SSE3 | CPUID_EXT_PCLMULQDQ | \
   CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | CPUID_EXT_CX16 | \
   CPUID_EXT_SSE41 | CPUID_EXT_SSE42 | CPUID_EXT_POPCNT | \
@@ -3643,7 +3643,7 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
 
 qemu_init_vcpu(cs);
 
-/* Only Intel CPUs support hyperthreading. Even though QEMU fixes this
+/* Only some CPUs support hyperthreading. Even though QEMU fixes this
  * issue by adjusting CPUID__0001_EBX and CPUID_8000_0008_ECX
  * based on inputs (sockets,cores,threads), it is still better to gives
  * users a warning.
@@ -3651,8 +3651,8 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  * NOTE: the following code has to follow qemu_init_vcpu(). Otherwise
  * cs->nr_threads hasn't be populated yet and the checking is incorrect.
  */
-if (!IS_INTEL_CPU(env) && cs->nr_threads > 1 && !ht_warned) {
-error_report("AMD CPU doesn't support hyperthreading. Please configure"
+if ((env->features[FEAT_1_EDX] & CPUID_HT) && (cs->nr_threads > 1) && 
!ht_warned) {
+error_report("This CPU doesn't support hyperthreading. Please 
configure"
  " -smp options properly.");
 ht_warned = true;
 }
-- 
2.11.0

Re: [Qemu-devel] [PATCH] timer.h: Provide monotonic time for ARM guests

2017-04-18 Thread Pranith Kumar

On Tue, Apr 18, 2017 at 5:56 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 17/04/2017 20:55, Pranith Kumar wrote:
>>>> +/* ARM does not have a user-space readble cycle counter available.
>>>> + * This is a compromise to get monotonically increasing time.
>>>> + */
>>>> +static inline int64_t cpu_get_host_ticks(void)
>>>> +{
>>>> +return get_clock();
>>>> +}
>>> This doesn't look like it should be ARM-specific. Is it
>>> better than the current default implementation? If so,
>>> why not make this the default implementation?
>>
>> I think we can do that...
>
> Yes, it is always better for emulation accuracy.  It may be much slower,
> depending on your OS (especially if get_clock requires a
> user->kernel->user transition), but the current code is quite broken.
>

OK, I sent an updated patch using get_clock() for all other cases.

--
Pranith

[Qemu-devel] [PATCH] timer.h: Provide better monotonic time

2017-04-18 Thread Pranith Kumar

Tested and confirmed that the stretch i386 debian qcow2 image on a
raspberry pi 2 works.

Fixes: LP#: 893208 <https://bugs.launchpad.net/qemu/+bug/893208/>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 include/qemu/timer.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/qemu/timer.h b/include/qemu/timer.h
index 8a1eb74839..1b518bca30 100644
--- a/include/qemu/timer.h
+++ b/include/qemu/timer.h
@@ -1020,10 +1020,9 @@ static inline int64_t cpu_get_host_ticks(void)
 /* The host CPU doesn't have an easily accessible cycle counter.
Just return a monotonically increasing value.  This will be
totally wrong, but hopefully better than nothing.  */
-static inline int64_t cpu_get_host_ticks (void)
+static inline int64_t cpu_get_host_ticks(void)
 {
-static int64_t ticks = 0;
-return ticks++;
+return get_clock();
 }
 #endif
 
-- 
2.11.0

Re: [Qemu-devel] [PATCH] timer.h: Provide monotonic time for ARM guests

2017-04-17 Thread Pranith Kumar

On Mon, Apr 17, 2017 at 2:42 PM, Peter Maydell <peter.mayd...@linaro.org> wrote:
> On 15 April 2017 at 20:29, Pranith Kumar <bobby.pr...@gmail.com> wrote:
>> Tested and confirmed that the stretch i386 debian qcow2 image on a
>> raspberry pi 2 works.
>>
>> Fixes: LP#: 893208 <https://bugs.launchpad.net/qemu/+bug/893208/>
>> Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
>> ---
>>  include/qemu/timer.h | 10 ++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/include/qemu/timer.h b/include/qemu/timer.h
>> index e1742f2f3d..14c9558da4 100644
>> --- a/include/qemu/timer.h
>> +++ b/include/qemu/timer.h
>> @@ -1015,6 +1015,16 @@ static inline int64_t cpu_get_host_ticks(void)
>>  return cur - ofs;
>>  }
>>
>> +#elif defined(__arm__) || defined(__aarch64__)
>> +
>> +/* ARM does not have a user-space readble cycle counter available.
>> + * This is a compromise to get monotonically increasing time.
>> + */
>> +static inline int64_t cpu_get_host_ticks(void)
>> +{
>> +return get_clock();
>> +}
>
> This doesn't look like it should be ARM-specific. Is it
> better than the current default implementation? If so,
> why not make this the default implementation?
>

I think we can do that...

>> +
>>  #else
>>  /* The host CPU doesn't have an easily accessible cycle counter.
>> Just return a monotonically increasing value.  This will be
>
> The comment here says that our default is already a monotonically
> increasing implementation -- is it wrong, or is there some other
> advantage of your version?

Comment #6 in the bug report by Laszlo Ersek explains why.
Incrementing by 1 for approximately 55 ms is what is causing the
divide by zero in grub.

-- 
Pranith

[Qemu-devel] [PATCH] timer.h: Provide monotonic time for ARM guests

2017-04-15 Thread Pranith Kumar

Tested and confirmed that the stretch i386 debian qcow2 image on a
raspberry pi 2 works.

Fixes: LP#: 893208 <https://bugs.launchpad.net/qemu/+bug/893208/>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 include/qemu/timer.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/qemu/timer.h b/include/qemu/timer.h
index e1742f2f3d..14c9558da4 100644
--- a/include/qemu/timer.h
+++ b/include/qemu/timer.h
@@ -1015,6 +1015,16 @@ static inline int64_t cpu_get_host_ticks(void)
 return cur - ofs;
 }
 
+#elif defined(__arm__) || defined(__aarch64__)
+
+/* ARM does not have a user-space readble cycle counter available.
+ * This is a compromise to get monotonically increasing time.
+ */
+static inline int64_t cpu_get_host_ticks(void)
+{
+return get_clock();
+}
+
 #else
 /* The host CPU doesn't have an easily accessible cycle counter.
Just return a monotonically increasing value.  This will be
-- 
2.11.0

Re: [Qemu-devel] What is the best commit for record-replay?

2017-04-09 Thread Pranith Kumar

On Thu, Mar 23, 2017 at 4:05 AM, Igor R  wrote:
> Hi,
>
> I'm trying to use the deterministic record/replay feature, and I would
> like to know which commit I should take to get it work.
> In RC0 it seems to be broken. I tried pre-MTTCG commit 2421f381dc, as

Can you retry with the latest rc? There were some fixes regarding rr since rc0.

Thanks,
--
Pranith

[Qemu-devel] [PATCH] tcg/i386: Display AMD HT warning only for KVM

2017-03-28 Thread Pranith Kumar

TCG uses the AMD cpu which warns when we use hyperthreading. Disable
the warning for TCG since it is not necessary.

Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/i386/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 7aa762245a..66242893b6 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -3647,7 +3647,7 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
**errp)
  * NOTE: the following code has to follow qemu_init_vcpu(). Otherwise
  * cs->nr_threads hasn't be populated yet and the checking is incorrect.
  */
-if (!IS_INTEL_CPU(env) && cs->nr_threads > 1 && !ht_warned) {
+if (!IS_INTEL_CPU(env) && cs->nr_threads > 1 && !ht_warned && 
kvm_enabled()) {
 error_report("AMD CPU doesn't support hyperthreading. Please configure"
  " -smp options properly.");
 ht_warned = true;
-- 
2.11.0

Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

2017-03-27 Thread Pranith Kumar

On Mon, Mar 27, 2017 at 11:03 PM, Pranith Kumar <bobby.pr...@gmail.com> wrote:

>
> If you think the project makes sense, I will add it to the GSoC wiki
> so that others can also apply for it. Please let me know if you are
> interested in mentoring it along with Alex.
>

One other thing is if you think the scope is too vast, can we split
this and have multiple GSoC projects? In that case, having more
mentors should help.

-- 
Pranith

Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

2017-03-27 Thread Pranith Kumar

Hi Paolo,

On Mon, Mar 27, 2017 at 7:32 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 25/03/2017 17:52, Pranith Kumar wrote:
>> * Implement an LRU translation block code cache.
>>
>>   In the current TCG design, when the translation cache fills up, we flush 
>> all
>>   the translated blocks (TBs) to free up space. We can improve this situation
>>   by not flushing the TBs that were recently used i.e., by implementing an 
>> LRU
>>   policy for freeing the blocks. This should avoid the re-translation 
>> overhead
>>   for frequently used blocks and improve performance.
>
> IIRC, Emilio measured one flush every roughly 10 seconds with 128 MB
> cache in system emulation mode---and "never" is a pretty accurate
> estimate for user-mode emulation.  This means that a really hot block
> would be retranslated very quickly.
>

OK. The problem with re-translation is that it is a serializing step
in the current design. All the cores have to wait for the translation
to complete. I think it will be a win if we could avoid it, although,
I should admit that I am not sure how much that benefit would be.

-- 
Pranith

Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

2017-03-27 Thread Pranith Kumar

Hi Richard,

Thanks for the feedback. Please find some comments inline.

On Mon, Mar 27, 2017 at 6:57 AM, Richard Henderson  wrote:
>
> 128MB is really quite large.  I doubt doubling the cache size will really
> help that much.  That said, it's really quite trivial to make this change,
> if you'd like to experiment.
>
> FWIW, I rarely see TB flushes for alpha -- not one during an entire gcc
> bootstrap.  Now, this is usually with 4GB ram, which by default implies
> 512MB translation cache.  But it does mean that, given an ideal guest, TB
> flushes should not dominate anything at all.
>
> If you're seeing multiple flushes during the startup of a browser, your
> guest must be flushing for other reasons than the code_gen_buffer being
> full.
>

This is indeed the case. From commit a9353fe897ca onwards, we are
flushing the tb cache instead of invalidating a single TB from
breakpoint_invalidate(). Now that MTTCG added proper tb/mmap locking,
we can revert that commit. I will do so once the merge windows opens.

>
>> * Implement an LRU translation block code cache.
>
>
> The major problem you'll encounter is how to manage allocation in this case.
>
> The current mechanism means that it is trivial to not know how much code is
> going to be generated for a given set of TCG opcodes.  When we reach the
> high-water mark, we've run out of room.  We then flush everything and start
> over at the beginning of the buffer.
>
> If you manage the cache with an allocator, you'll need to know in advance
> how much code is going to be generated.  This is going to require that you
> either (1) severely over-estimate the space required (qemu_ld generates lots
> more code than just add), (2) severely increase the time required, by
> generating code twice, or (3) somewhat increase the time required, by
> generating position-independent code into an external buffer and copying it
> into place after determining the size.
>

3 seems to the only feasible options, but I am not sure how easy it is
to generate position-independent code. Do you think it can be done as
a GSoC project?

>
>> * Avoid consistency overhead for strong memory model guests by generating
>>   load-acquire and store-release instructions.
>
>
> This is probably required for good performance of the user-only code path,
> but considering the number of other insns required for the system tlb
> lookup, I'm surprised that the memory barrier matters.
>

I know that having some experimental data will help to accurately show
the benefit here, but my observation from generating store-release
instruction instead of store+fence is that it helps make the system
more usable. I will try to collect this data for a linux x86 guest.

>
> I think it would be interesting to place TranslationBlock structures into
> the same memory block as code_gen_buffer, immediately before the code that
> implements the TB.
>
> Consider what happens within every TB:
>
> (1) We have one or more references to the TB address, via exit_tb.
>
> For aarch64, this will normally require 2-4 insns.
>
> # alpha-softmmu
> 0x7f75152114:  d0ffb320  adrp x0, #-0x99a000 (addr 0x7f747b8000)
> 0x7f75152118:  91004c00  add x0, x0, #0x13 (19)
> 0x7f7515211c:  17c3  b #-0xf4 (addr 0x7f75152028)
>
> # alpha-linux-user
> 0x00569500:  d2800260  mov x0, #0x13
> 0x00569504:  f2b59820  movk x0, #0xacc1, lsl #16
> 0x00569508:  f2c00fe0  movk x0, #0x7f, lsl #32
> 0x0056950c:  17df  b #-0x84 (addr 0x569488)
>
> We would reduce this to one insn, always, if the TB were close by, since the
> ADR instruction has a range of 1MB.
>
> (2) We have zero to two references to a linked TB, via goto_tb.
>
> Your stated goal above for eliminating the code_gen_buffer maximum of 128MB
> can be done in two ways.
>
> (2A) Raise the maximum to 2GB.  For this we would align an instruction pair,
> adrp+add, to compute the address; the following insn would branch.  The
> update code would write a new destination by modifing the adrp+add with a
> single 64-bit store.
>
> (2B) Eliminate the maximum altogether by referencing the destination
> directly in the TB.  This is the !USE_DIRECT_JUMP path.  It is normally not
> used on 64-bit targets because computing the full 64-bit address of the TB
> is harder, or just as hard, as computing the full 64-bit address of the
> destination.
>
> However, if the TB is nearby, aarch64 can load the address from
> TB.jmp_target_addr in one insn, with LDR (literal).  This pc-relative load
> also has a 1MB range.
>
> This has the side benefit that it is much quicker to re-link TBs, both in
> the computation of the code for the destination as well as re-flushing the
> icache.

This(2B) is the idea I had in mind. If we could have a combination of
both the above. If address range falls outside the 1MB range, we take
the penalty and generate the full 64-bit address.

>
>
> In addition, I strongly suspect the 1,342,177 entries (153MB) that we
> currently allocate for

Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

2017-03-27 Thread Pranith Kumar

Hi Stefan,

On Mon, Mar 27, 2017 at 11:54 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote:
> On Sat, Mar 25, 2017 at 12:52:35PM -0400, Pranith Kumar wrote:
>> Alex Bennée, who mentored me last year, has agreed to mentor me again this
>> time if the proposal is accepted.
>
> Thanks, the project idea looks good for GSoC.  I've talked to Alex about
> adding it to the wiki page.
>
> The "How to propose a custom project idea" section on the wiki says:
>
>   Note that other candidates can apply for newly added project ideas.
>   This ensures that custom project ideas are fair and open.
>
> This means that Alex has agreed to mentor the _project idea_.  Proposing
> a custom project idea doesn't guarantee that you will be selected for
> it.
>
> I think you already knew that but I wanted to clarify in case someone
> reading misinterprets what you wrote to think custom project ideas are a
> loophole for getting into GSoC.

Yes, I was waiting for the project idea to be finalized before mailing
you with the filled out template. But if you think it will be easier
if I add it first and then edit it, I will send you the template. I
will update the wiki as the discussion progresses.

-- 
Pranith

[Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

2017-03-25 Thread Pranith Kumar

Hello,

With MTTCG code now merged in mainline, I tried to see if we are able to run
x86 SMP guests on ARM64 hosts. For this I tried running a windows XP guest on
a dragonboard 410c which has 1GB RAM. Since x86 has a strong memory model
whereas ARM64 is a weak memory model, I added a patch to generate fence
instructions for every guest memory access. After some minor fixes, I was
successfully able to boot a 4 core guest all the way to the desktop (albeit
with a 1GB backing swap). However the performance is severely
limited and the guest is barely usable. Based on my observations, I think
there are some easily implementable additions we can make to improve the
performance of TCG in general and on ARM64 in particular. I propose to do the
following as part of Google Summer of Code 2017.


* Implement jump-to-register instruction on ARM64 to overcome the 128MB
  translation cache size limit.

  The translation cache size for an ARM64 host is currently limited to 128
  MB. This limitation is imposed by utilizing a branch instruction which
  encodes the jump offset and is limited by the number of bits it can use for
  the range of the offset. The performance impact by this limitation is severe
  and can be observed when you try to run large programs like a browser in the
  guest. The cache is flushed several times before the browser starts and the
  performance is not satisfactory. This limitation can be overcome by
  generating a branch-to-register instruction and utilizing that when the
  destination address is outside the range of what can be encoded in current
  branch instruction.

* Implement an LRU translation block code cache.

  In the current TCG design, when the translation cache fills up, we flush all
  the translated blocks (TBs) to free up space. We can improve this situation
  by not flushing the TBs that were recently used i.e., by implementing an LRU
  policy for freeing the blocks. This should avoid the re-translation overhead
  for frequently used blocks and improve performance.

* Avoid consistency overhead for strong memory model guests by generating
  load-acquire and store-release instructions.

  To run a strongly ordered guest on a weakly ordered host using MTTCG, for
  example, x86 on ARM64, we have to generate fence instructions for all the
  guest memory accesses to ensure consistency. The overhead imposed by these
  fence instructions is significant (almost 3x when compared to a run without
  fence instructions). ARM64 provides load-acquire and store-release
  instructions which are sequentially consistent and can be used instead of
  generating fence instructions. I plan to add support to generate these
  instructions in the TCG run-time to reduce the consistency overhead in
  MTTCG.

Alex Bennée, who mentored me last year, has agreed to mentor me again this
time if the proposal is accepted.

Please let me know if you have any comments or suggestions. Also please let me
know if there are other enhancements that are easily implementable to increase
TCG performance as part of this project or otherwise.

Thanks,
-- 
Pranith

[Qemu-devel] [PATCH] tcg/i386: Check the size of instruction being translated

2017-03-23 Thread Pranith Kumar

Sending again since I messed by pbonzini's email.

This fixes the bug: 'user-to-root privesc inside VM via bad translation
caching' reported by Jann Horn here:
https://bugs.chromium.org/p/project-zero/issues/detail?id=1122

CC: Richard Henderson <r...@twiddle.net>
CC: Peter Maydell <peter.mayd...@linaro.org>
CC: Paolo Bonzini <pbonz...@redhat.com>
Reported-by: Jann Horn <ja...@google.com>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/i386/translate.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/target/i386/translate.c b/target/i386/translate.c
index 72c1b03a2a..1d1372fb43 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -4418,6 +4418,13 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
 s->vex_l = 0;
 s->vex_v = 0;
  next_byte:
+/* x86 has an upper limit of 15 bytes for an instruction. Since we
+ * do not want to decode and generate IR for an illegal
+ * instruction, the following check limits the instruction size to
+ * 25 bytes: 14 prefix + 1 opc + 6 (modrm+sib+ofs) + 4 imm */
+if (s->pc - pc_start > 14) {
+goto illegal_op;
+}
 b = cpu_ldub_code(env, s->pc);
 s->pc++;
 /* Collect prefixes.  */
-- 
2.11.0

[Qemu-devel] [PATCH] tcg/i386: Check the size of instruction being translated

2017-03-23 Thread Pranith Kumar

This fixes the bug: 'user-to-root privesc inside VM via bad translation
caching' reported by Jann Horn here:
https://bugs.chromium.org/p/project-zero/issues/detail?id=1122

CC: Richard Henderson <r...@twiddle.net>
CC: Peter Maydell <peter.mayd...@linaro.org>
CC: Paolo Bonzini <paolo.bonz...@redhat.com>
Reported-by: Jann Horn <ja...@google.com>
Signed-off-by: Pranith Kumar <bobby.pr...@gmail.com>
---
 target/i386/translate.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/target/i386/translate.c b/target/i386/translate.c
index 72c1b03a2a..1d1372fb43 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -4418,6 +4418,13 @@ static target_ulong disas_insn(CPUX86State *env, 
DisasContext *s,
 s->vex_l = 0;
 s->vex_v = 0;
  next_byte:
+/* x86 has an upper limit of 15 bytes for an instruction. Since we
+ * do not want to decode and generate IR for an illegal
+ * instruction, the following check limits the instruction size to
+ * 25 bytes: 14 prefix + 1 opc + 6 (modrm+sib+ofs) + 4 imm */
+if (s->pc - pc_start > 14) {
+goto illegal_op;
+}
 b = cpu_ldub_code(env, s->pc);
 s->pc++;
 /* Collect prefixes.  */
-- 
2.11.0

Re: [Qemu-devel] [BUG] user-to-root privesc inside VM via bad translation caching

2017-03-23 Thread Pranith Kumar

On Thu, Mar 23, 2017 at 1:37 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 23/03/2017 17:50, Pranith Kumar wrote:
>> On Thu, Mar 23, 2017 at 6:27 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>>>
>>>
>>> On 22/03/2017 21:01, Richard Henderson wrote:
>>>>>
>>>>> Ah, OK. Thanks for the explanation. May be we should check the size of
>>>>> the instruction while decoding the prefixes and error out once we
>>>>> exceed the limit. We would not generate any IR code.
>>>>
>>>> Yes.
>>>>
>>>> It would not enforce a true limit of 15 bytes, since you can't know that
>>>> until you've done the rest of the decode.  But you'd be able to say that
>>>> no more than 14 prefix + 1 opc + 6 modrm+sib+ofs + 4 immediate = 25
>>>> bytes is used.
>>>>
>>>> Which does fix the bug.
>>>
>>> Yeah, that would work for 2.9 if somebody wants to put together a patch.
>>>  Ensuring that all instruction fetching happens before translation side
>>> effects is a little harder, but perhaps it's also the opportunity to get
>>> rid of s->rip_offset which is a little ugly.
>>
>> How about the following?
>>
>> diff --git a/target/i386/translate.c b/target/i386/translate.c
>> index 72c1b03a2a..67c58b8900 100644
>> --- a/target/i386/translate.c
>> +++ b/target/i386/translate.c
>> @@ -4418,6 +4418,11 @@ static target_ulong disas_insn(CPUX86State
>> *env, DisasContext *s,
>>  s->vex_l = 0;
>>  s->vex_v = 0;
>>   next_byte:
>> +/* The prefixes can atmost be 14 bytes since x86 has an upper
>> +   limit of 15 bytes for the instruction */
>> +if (s->pc - pc_start > 14) {
>> +goto illegal_op;
>> +}
>>  b = cpu_ldub_code(env, s->pc);
>>  s->pc++;
>>  /* Collect prefixes.  */
>
> Please make the comment more verbose, based on Richard's remark.  We
> should apply it to 2.9.
>
> Also, QEMU usually formats comments with stars on every line.


OK. I'll send a proper patch with updated comment.

Thanks,
--
Pranith

Re: [Qemu-devel] [Qemu-arm] about armv8's prefetch decode

2017-03-23 Thread Pranith Kumar

Hi Jed,

On Mon, Mar 20, 2017 at 2:35 AM, Wangjintang  wrote:
> Hi,
>
> We see that armv8's prefetch instruction decode have been skipped in 
> qemu. For some user, they need prefetch instruction, for example, they use 
> qemu to generate the instruction trace. We want to merge this patch to 
> community, it's ok or not?  Thanks.
>

Your patch seems to be missing. Can you retry with the content of the
patch pasted in the email?

Thanks,
--
Pranith

Re: [Qemu-devel] [BUG] user-to-root privesc inside VM via bad translation caching

2017-03-23 Thread Pranith Kumar

On Thu, Mar 23, 2017 at 6:27 AM, Paolo Bonzini  wrote:
>
>
> On 22/03/2017 21:01, Richard Henderson wrote:
>>>
>>> Ah, OK. Thanks for the explanation. May be we should check the size of
>>> the instruction while decoding the prefixes and error out once we
>>> exceed the limit. We would not generate any IR code.
>>
>> Yes.
>>
>> It would not enforce a true limit of 15 bytes, since you can't know that
>> until you've done the rest of the decode.  But you'd be able to say that
>> no more than 14 prefix + 1 opc + 6 modrm+sib+ofs + 4 immediate = 25
>> bytes is used.
>>
>> Which does fix the bug.
>
> Yeah, that would work for 2.9 if somebody wants to put together a patch.
>  Ensuring that all instruction fetching happens before translation side
> effects is a little harder, but perhaps it's also the opportunity to get
> rid of s->rip_offset which is a little ugly.

How about the following?

diff --git a/target/i386/translate.c b/target/i386/translate.c
index 72c1b03a2a..67c58b8900 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -4418,6 +4418,11 @@ static target_ulong disas_insn(CPUX86State
*env, DisasContext *s,
 s->vex_l = 0;
 s->vex_v = 0;
  next_byte:
+/* The prefixes can atmost be 14 bytes since x86 has an upper
+   limit of 15 bytes for the instruction */
+if (s->pc - pc_start > 14) {
+goto illegal_op;
+}
 b = cpu_ldub_code(env, s->pc);
 s->pc++;
 /* Collect prefixes.  */

--
Pranith

Re: [Qemu-devel] [BUG] user-to-root privesc inside VM via bad translation caching

2017-03-22 Thread Pranith Kumar

On Wed, Mar 22, 2017 at 11:21 AM, Peter Maydell
<peter.mayd...@linaro.org> wrote:
> On 22 March 2017 at 15:14, Pranith Kumar <bobby.prani+q...@gmail.com> wrote:
>> On Wed, Mar 22, 2017 at 11:04 AM, Peter Maydell
>> <peter.mayd...@linaro.org> wrote:
>>> This doesn't look right because it means we'll check
>>> only after we've emitted all the code to do the
>>> instruction operation, so the effect will be
>>> "execute instruction, then take illegal-opcode
>>> exception".
>
>> The pc is restored to original address (s->pc = pc_start), so the
>> exception will overwrite the generated illegal instruction and will be
>> executed first.
>
> s->pc is the guest PC -- moving that backwards will
> not do anything about the generated TCG IR that's
> already been written. You'd need to rewind the
> write pointer in the IR stream, which there is
> no support for doing AFAIK.

Ah, OK. Thanks for the explanation. May be we should check the size of
the instruction while decoding the prefixes and error out once we
exceed the limit. We would not generate any IR code.

--
Pranith

Re: [Qemu-devel] [BUG] user-to-root privesc inside VM via bad translation caching

2017-03-22 Thread Pranith Kumar

On Wed, Mar 22, 2017 at 11:04 AM, Peter Maydell
 wrote:
>>
>> How about doing the instruction size check as follows?
>>
>> diff --git a/target/i386/translate.c b/target/i386/translate.c
>> index 72c1b03a2a..94cf3da719 100644
>> --- a/target/i386/translate.c
>> +++ b/target/i386/translate.c
>> @@ -8235,6 +8235,10 @@ static target_ulong disas_insn(CPUX86State
>> *env, DisasContext *s,
>>  default:
>>  goto unknown_op;
>>  }
>> +if (s->pc - pc_start > 15) {
>> +s->pc = pc_start;
>> +goto illegal_op;
>> +}
>>  return s->pc;
>>   illegal_op:
>>  gen_illegal_opcode(s);
>
> This doesn't look right because it means we'll check
> only after we've emitted all the code to do the
> instruction operation, so the effect will be
> "execute instruction, then take illegal-opcode
> exception".
>

The pc is restored to original address (s->pc = pc_start), so the
exception will overwrite the generated illegal instruction and will be
executed first.

But yes, it's better to follow the architecture manual.

Thanks,
--
Pranith

Re: [Qemu-devel] [BUG] user-to-root privesc inside VM via bad translation caching

2017-03-22 Thread Pranith Kumar

On Mon, Mar 20, 2017 at 10:46 AM, Peter Maydell wrote:
> On 20 March 2017 at 14:36, Jann Horn  wrote:
>> This is an issue in QEMU's system emulation for X86 in TCG mode.
>> The issue permits an attacker who can execute code in guest ring 3
>> with normal user privileges to inject code into other processes that
>> are running in guest ring 3, in particular root-owned processes.
>
>> I am sending this to qemu-devel because a QEMU security contact
>> told me that QEMU does not consider privilege escalation inside a
>> TCG VM to be a security concern.
>
> Correct; it's just a bug. Don't trust TCG QEMU as a security boundary.
>
> We should really fix the crossing-a-page-boundary code for x86.
> I believe we do get it correct for ARM Thumb instructions.

How about doing the instruction size check as follows?

diff --git a/target/i386/translate.c b/target/i386/translate.c
index 72c1b03a2a..94cf3da719 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -8235,6 +8235,10 @@ static target_ulong disas_insn(CPUX86State
*env, DisasContext *s,
 default:
 goto unknown_op;
 }
+if (s->pc - pc_start > 15) {
+s->pc = pc_start;
+goto illegal_op;
+}
 return s->pc;
  illegal_op:
 gen_illegal_opcode(s);

Thanks,
--
Pranith

1 2 3 4 5 >

1 - 100 of 427 matches

Mail list logo