Re: [Qemu-devel] [PATCH v7 0/7] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches

2017-06-11 Thread Lluís Vilanova
Emilio G Cota writes:

> On Fri, Jan 13, 2017 at 21:48:09 +0100, Lluís Vilanova wrote:
> (snip)
>> To handle both issues, this series integrates the dynamic tracing event state
>> into the TB hashing function, so that vCPUs tracing different events will use
>> separate TBs. Note that only events with the 'vcpu' property are used for
>> hashing (as stored in the bitmap of CPUState->trace_dstate).

> Is this going to be picked up by anyone? AFAICT the patchset is close
> to being merge-ready.

> Lluís: I'm very interested in your instrumentation work [1]:

> - How much up to date are the branches in [1]? I couldn't find this
>   v7 iteration in there, although maybe I didn't look carefully enough.

> - Are you planning on upstreaming it? I have some time to help with
>   that if you're interested.

After your latest re-spin on this series, there's two basic pieces missing
upstream to have instrumentation:

* An interface to insert user callbacks on events.

* A useful mechanism to pass value place holders (values are later calculated)
  to these callbacks. This is useful when instrumenting an event before it
  happens (e.g., before an instruction is executed) to pass it some values that
  are only known afterwards (e.g., number of instructions in a BBL). I called
  them promises, and you can see an example in [2] (using an older API of QDBI).


> - Do you have instrumentation examples beyond what's in
>   docs/instrumentation.txt? In particular I'd like to see how the basic
>   block (BBL) instrumentation works, i.e. how a 'skeleton' simulator
>   would work to decode the guest instructions and also track their
>   dependences.

No, sorry, I never got around to writing such type of instrumentation
example. The closest I have is [2].


[2] https://projects.gso.ac.upc.edu/projects/qdbi-simpoint


> Thanks,

>   Emilio

> [1] https://projects.gso.ac.upc.edu/projects/qemu-dbi


Cheers,
  Lluis



Re: [Qemu-devel] [PATCH v7 0/7] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches

2017-06-06 Thread Stefan Hajnoczi
On Thu, Jun 01, 2017 at 03:55:44PM -0400, Emilio G. Cota wrote:
> On Fri, Jan 13, 2017 at 21:48:09 +0100, Lluís Vilanova wrote:
> (snip)
> > To handle both issues, this series integrates the dynamic tracing event 
> > state
> > into the TB hashing function, so that vCPUs tracing different events will 
> > use
> > separate TBs. Note that only events with the 'vcpu' property are used for
> > hashing (as stored in the bitmap of CPUState->trace_dstate).
> 
> Is this going to be picked up by anyone? AFAICT the patchset is close
> to being merge-ready.

You left comments and the discussion with Richard Henderson is also not
complete yet.

Once there is consensus I will merge it.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v7 0/7] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches

2017-06-01 Thread Emilio G. Cota
On Fri, Jan 13, 2017 at 21:48:09 +0100, Lluís Vilanova wrote:
(snip)
> To handle both issues, this series integrates the dynamic tracing event state
> into the TB hashing function, so that vCPUs tracing different events will use
> separate TBs. Note that only events with the 'vcpu' property are used for
> hashing (as stored in the bitmap of CPUState->trace_dstate).

Is this going to be picked up by anyone? AFAICT the patchset is close
to being merge-ready.

Lluís: I'm very interested in your instrumentation work [1]:

- How much up to date are the branches in [1]? I couldn't find this
  v7 iteration in there, although maybe I didn't look carefully enough.

- Are you planning on upstreaming it? I have some time to help with
  that if you're interested.

- Do you have instrumentation examples beyond what's in
  docs/instrumentation.txt? In particular I'd like to see how the basic
  block (BBL) instrumentation works, i.e. how a 'skeleton' simulator
  would work to decode the guest instructions and also track their
  dependences.

Thanks,

Emilio

[1] https://projects.gso.ac.upc.edu/projects/qemu-dbi



[Qemu-devel] [PATCH v7 0/7] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches

2017-01-13 Thread Lluís Vilanova
Optimizes tracing of events with the 'tcg' and 'vcpu' properties (e.g., memory
accesses), making it feasible to statically enable them by default on all QEMU
builds.

Some quick'n'dirty numbers with 400.perlbench (SPECcpu2006) on the train input
(medium size - suns.pl) and the guest_mem_before event:

* vanilla, statically disabled
real0m2,259s
user0m2,252s
sys 0m0,004s

* vanilla, statically enabled (overhead: 2.18x)
real0m4,921s
user0m4,912s
sys 0m0,008s

* multi-tb, statically disabled (overhead: 0.99x) [within noise range]
real0m2,228s
user0m2,216s
sys 0m0,008s

* multi-tb, statically enabled (overhead: 0.99x) [within noise range]
real0m2,229s
user0m2,224s
sys 0m0,004s


Right now, events with the 'tcg' property always generate TCG code to trace that
event at guest code execution time, where the event's dynamic state is checked.

This series adds a performance optimization where TCG code for events with the
'tcg' and 'vcpu' properties is not generated if the event is dynamically
disabled. This optimization raises two issues:

* An event can be dynamically disabled/enabled after the corresponding TCG code
  has been generated (i.e., a new TB with the corresponding code should be
  used).

* Each vCPU can have a different dynamic state for the same event (i.e., tracing
  the memory accesses of only one process pinned to a vCPU).

To handle both issues, this series integrates the dynamic tracing event state
into the TB hashing function, so that vCPUs tracing different events will use
separate TBs. Note that only events with the 'vcpu' property are used for
hashing (as stored in the bitmap of CPUState->trace_dstate).

This makes dynamic event state changes on vCPUs very efficient, since they can
use TBs produced by other vCPUs while on the same event state combination (or
produced by the same vCPU, earlier).

Discarded alternatives:

* Emitting TCG code to check if an event needs tracing, where we should still
  move the tracing call code to either a cold path (making tracing performance
  worse), or leave it inlined (making non-tracing performance worse).

* Eliding TCG code only when *zero* vCPUs are tracing an event, since enabling
  it on a single vCPU will impact the performance of all other vCPUs that are
  not tracing that event.

Signed-off-by: Lluís Vilanova 
---

Changes in v7
=

* Fix delayed dstate changes (now uses async_run_on_cpu() as suggested by Paolo
  Bonzini).

* Note to Richard: patch 4 has been adapted to the new patch 3 async callback,
  but is essentially the same.


Changes in v6
=

* Check hashing size error with QEMU_BUILD_BUG_ON [Richard Henderson].


Changes in v5
=

* Move define into "qemu-common.h" to allow compilation of tests.


Changes in v4
=

* Incorporate trace_dstate into the TB hashing function instead of using
  multiple physical TB caches [suggested by Richard Henderson].


Changes in v3
=

* Rebase on 0737f32daf.
* Do not use reserved symbol prefixes ("__") [Stefan Hajnoczi].
* Refactor trace_get_vcpu_event_count() to be inlinable.
* Optimize cpu_tb_cache_set_requested() (hottest path).


Changes in v2
=

* Fix bitmap copy in cpu_tb_cache_set_apply().
* Split generated code re-alignment into a separate patch [Daniel P. Berrange].


Lluís Vilanova (7):
  exec: [tcg] Refactor flush of per-CPU virtual TB cache
  trace: Make trace_get_vcpu_event_count() inlinable
  trace: [tcg] Delay changes to dynamic state when translating
  exec: [tcg] Use different TBs according to the vCPU's dynamic tracing 
state
  trace: [tcg] Do not generate TCG code to trace dinamically-disabled events
  trace: [tcg,trivial] Re-align generated code
  trace: [trivial] Statically enable all guest events


 cpu-exec.c   |   22 +-
 cputlb.c |2 +-
 include/exec/exec-all.h  |   11 +++
 include/exec/tb-hash-xx.h|8 +++-
 include/exec/tb-hash.h   |5 +++--
 include/qemu-common.h|3 +++
 include/qom/cpu.h|3 +++
 qom/cpu.c|2 ++
 scripts/tracetool/__init__.py|3 ++-
 scripts/tracetool/backend/dtrace.py  |4 ++--
 scripts/tracetool/backend/ftrace.py  |   20 ++--
 scripts/tracetool/backend/log.py |   19 ++-
 scripts/tracetool/backend/simple.py  |4 ++--
 scripts/tracetool/backend/syslog.py  |6 +++---
 scripts/tracetool/backend/ust.py |4 ++--
 scripts/tracetool/format/h.py|   26 +++---
 scripts/tracetool/format/tcg_h.py|   21 +
 scripts/tracetool/format/tcg_helper_c.py |5 +++--
 tests/qht-bench.c|2 +-
 trace-events