Re: [PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-15 Thread Pierrick Bouvier
On 1/15/24 13:04, Alex Bennée wrote:> Pierrick Bouvier 
 writes:>

On 1/13/24 21:16, Alex Bennée wrote:

Pierrick Bouvier  writes:


On 1/12/24 21:20, Alex Bennée wrote:

Pierrick Bouvier  writes:


On 1/11/24 19:57, Philippe Mathieu-Daudé wrote:

Hi Pierrick,
On 11/1/24 15:23, Pierrick Bouvier wrote:

For now, it simply performs instruction, bb and mem count, and ensure
that inline vs callback versions have the same result. Later, we'll
extend it when new inline operations are added.

Use existing plugins to test everything works is a bit cumbersome, as
different events are treated in different plugins. Thus, this new one.




+#define MAX_CPUS 8

Where does this value come from?



The plugin tests/plugin/insn.c had this constant, so I picked it up
from here.


Should the pluggin API provide a helper to ask TCG how many
vCPUs are created?


In user mode, we can't know how many simultaneous threads (and thus
vcpu) will be triggered by advance. I'm not sure if additional cpus
can be added in system mode.

One problem though, is that when you register an inline op with a
dynamic array, when you resize it (when detecting a new vcpu), you
can't change it afterwards. So, you need a storage statically sized
somewhere.

Your question is good, and maybe we should define a MAX constant that
plugins should rely on, instead of a random amount.

For user-mode it can be infinite. The existing plugins do this by
ensuring vcpu_index % max_vcpu. Perhaps we just ensure that for the
scoreboard as well? Of course that does introduce a trap for those using
user-mode...



The problem with vcpu-index % max_vcpu is that it reintroduces race
condition, though it's probably less frequent than on a single
variable. IMHO, yes it solves memory error, but does not solve the
initial problem itself.

The simplest solution would be to have a size "big enough" for most
cases, and abort when it's reached.

Well that is simple enough for system emulation as max_vcpus is a
bounded
number.


Another solution, much more complicated, but correct, would be to move
memory management of plugin scoreboard to plugin runtime, and add a
level of indirection to access it.

That certainly gives us the most control and safety. We can then
ensure
we'll never to writing past the bounds of the buffer. The plugin would
have to use an access function to get the pointer to read at the time it
cared and of course inline checks should be pretty simple.


Every time a new vcpu is added, we
can grow dynamically. This way, the array can grow, and ultimately,
plugin can poke its content/size. I'm not sure this complexity is what
we want though.

It doesn't seem too bad. We have a start/end_exclusive is *-user
do_fork
where we could update pointers. If we are smart about growing the size
of the arrays we could avoid too much re-translation.



I was concerned about a potential race when the scoreboard updates
this pointer, and other cpus are executing tb (using it). But this
concern is not valid, since start_exclusive ensures all other cpus are
stopped.

vcpu_init_hook function in plugins/core.c seems a good location to add
this logic. We would check if an update is needed, then
start_exclusive(), update the scoreboard and exit exclusive section.


It might already be in the exclusive section. We should try and re-use
an existing exclusive region rather than adding a new one. It's
expensive so best avoided if we can.


Do you think it's worth to try to inline scoreboard pointer (and flush
all tb when updated), instead of simply adding an indirection to it?
With this, we could avoid any need to re-translate anything.


Depends on the cost/complexity of the indirection I guess.
Re-translation isn't super expensive if we say double the size of the
score board each time we need to.


Do we want a limit of one scoreboard per thread? Can we store structures
in there?



 From the current plugins use case, it seems that several scoreboards
are needed.
Allowing structure storage seems a bit more tricky to me, because
since memory may be reallocated, users won't be allowed to point
directly to it. I would be in favor to avoid this (comments are
welcome).


The init function can take a sizeof(entry) field and the inline op can
have an offset field (which for the simple 0 case can be folded away by
TCG).



Thanks for all your comments and guidance on this.

I implemented a new version, working with a scoreboard that gets resized 
automatically, which allows usage of structs as well. The result is 
pretty satisfying as there is almost no more need to manually keep track 
of how many cpus have been used, while offering thread-safety by default.


I'll re-spin the serie once I cleaned up commits, and ported existing 
plugins to this.


Re: [PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-15 Thread Alex Bennée
Pierrick Bouvier  writes:

> On 1/13/24 21:16, Alex Bennée wrote:
>> Pierrick Bouvier  writes:
>> 
>>> On 1/12/24 21:20, Alex Bennée wrote:
 Pierrick Bouvier  writes:

> On 1/11/24 19:57, Philippe Mathieu-Daudé wrote:
>> Hi Pierrick,
>> On 11/1/24 15:23, Pierrick Bouvier wrote:
>>> For now, it simply performs instruction, bb and mem count, and ensure
>>> that inline vs callback versions have the same result. Later, we'll
>>> extend it when new inline operations are added.
>>>
>>> Use existing plugins to test everything works is a bit cumbersome, as
>>> different events are treated in different plugins. Thus, this new one.
>>>
>> 
>>> +#define MAX_CPUS 8
>> Where does this value come from?
>>
>
> The plugin tests/plugin/insn.c had this constant, so I picked it up
> from here.
>
>> Should the pluggin API provide a helper to ask TCG how many
>> vCPUs are created?
>
> In user mode, we can't know how many simultaneous threads (and thus
> vcpu) will be triggered by advance. I'm not sure if additional cpus
> can be added in system mode.
>
> One problem though, is that when you register an inline op with a
> dynamic array, when you resize it (when detecting a new vcpu), you
> can't change it afterwards. So, you need a storage statically sized
> somewhere.
>
> Your question is good, and maybe we should define a MAX constant that
> plugins should rely on, instead of a random amount.
 For user-mode it can be infinite. The existing plugins do this by
 ensuring vcpu_index % max_vcpu. Perhaps we just ensure that for the
 scoreboard as well? Of course that does introduce a trap for those using
 user-mode...

>>>
>>> The problem with vcpu-index % max_vcpu is that it reintroduces race
>>> condition, though it's probably less frequent than on a single
>>> variable. IMHO, yes it solves memory error, but does not solve the
>>> initial problem itself.
>>>
>>> The simplest solution would be to have a size "big enough" for most
>>> cases, and abort when it's reached.
>> Well that is simple enough for system emulation as max_vcpus is a
>> bounded
>> number.
>> 
>>> Another solution, much more complicated, but correct, would be to move
>>> memory management of plugin scoreboard to plugin runtime, and add a
>>> level of indirection to access it.
>> That certainly gives us the most control and safety. We can then
>> ensure
>> we'll never to writing past the bounds of the buffer. The plugin would
>> have to use an access function to get the pointer to read at the time it
>> cared and of course inline checks should be pretty simple.
>> 
>>> Every time a new vcpu is added, we
>>> can grow dynamically. This way, the array can grow, and ultimately,
>>> plugin can poke its content/size. I'm not sure this complexity is what
>>> we want though.
>> It doesn't seem too bad. We have a start/end_exclusive is *-user
>> do_fork
>> where we could update pointers. If we are smart about growing the size
>> of the arrays we could avoid too much re-translation.
>>
>
> I was concerned about a potential race when the scoreboard updates
> this pointer, and other cpus are executing tb (using it). But this
> concern is not valid, since start_exclusive ensures all other cpus are
> stopped.
>
> vcpu_init_hook function in plugins/core.c seems a good location to add
> this logic. We would check if an update is needed, then
> start_exclusive(), update the scoreboard and exit exclusive section.

It might already be in the exclusive section. We should try and re-use
an existing exclusive region rather than adding a new one. It's
expensive so best avoided if we can.

> Do you think it's worth to try to inline scoreboard pointer (and flush
> all tb when updated), instead of simply adding an indirection to it?
> With this, we could avoid any need to re-translate anything.

Depends on the cost/complexity of the indirection I guess.
Re-translation isn't super expensive if we say double the size of the
score board each time we need to.

>> Do we want a limit of one scoreboard per thread? Can we store structures
>> in there?
>> 
>
> From the current plugins use case, it seems that several scoreboards
> are needed.
> Allowing structure storage seems a bit more tricky to me, because
> since memory may be reallocated, users won't be allowed to point
> directly to it. I would be in favor to avoid this (comments are
> welcome).

The init function can take a sizeof(entry) field and the inline op can
have an offset field (which for the simple 0 case can be folded away by
TCG).

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro



Re: [PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-14 Thread Pierrick Bouvier

On 1/13/24 21:16, Alex Bennée wrote:

Pierrick Bouvier  writes:


On 1/12/24 21:20, Alex Bennée wrote:

Pierrick Bouvier  writes:


On 1/11/24 19:57, Philippe Mathieu-Daudé wrote:

Hi Pierrick,
On 11/1/24 15:23, Pierrick Bouvier wrote:

For now, it simply performs instruction, bb and mem count, and ensure
that inline vs callback versions have the same result. Later, we'll
extend it when new inline operations are added.

Use existing plugins to test everything works is a bit cumbersome, as
different events are treated in different plugins. Thus, this new one.




+#define MAX_CPUS 8

Where does this value come from?



The plugin tests/plugin/insn.c had this constant, so I picked it up
from here.


Should the pluggin API provide a helper to ask TCG how many
vCPUs are created?


In user mode, we can't know how many simultaneous threads (and thus
vcpu) will be triggered by advance. I'm not sure if additional cpus
can be added in system mode.

One problem though, is that when you register an inline op with a
dynamic array, when you resize it (when detecting a new vcpu), you
can't change it afterwards. So, you need a storage statically sized
somewhere.

Your question is good, and maybe we should define a MAX constant that
plugins should rely on, instead of a random amount.

For user-mode it can be infinite. The existing plugins do this by
ensuring vcpu_index % max_vcpu. Perhaps we just ensure that for the
scoreboard as well? Of course that does introduce a trap for those using
user-mode...



The problem with vcpu-index % max_vcpu is that it reintroduces race
condition, though it's probably less frequent than on a single
variable. IMHO, yes it solves memory error, but does not solve the
initial problem itself.

The simplest solution would be to have a size "big enough" for most
cases, and abort when it's reached.


Well that is simple enough for system emulation as max_vcpus is a bounded
number.


Another solution, much more complicated, but correct, would be to move
memory management of plugin scoreboard to plugin runtime, and add a
level of indirection to access it.


That certainly gives us the most control and safety. We can then ensure
we'll never to writing past the bounds of the buffer. The plugin would
have to use an access function to get the pointer to read at the time it
cared and of course inline checks should be pretty simple.


Every time a new vcpu is added, we
can grow dynamically. This way, the array can grow, and ultimately,
plugin can poke its content/size. I'm not sure this complexity is what
we want though.


It doesn't seem too bad. We have a start/end_exclusive is *-user do_fork
where we could update pointers. If we are smart about growing the size
of the arrays we could avoid too much re-translation.



I was concerned about a potential race when the scoreboard updates this 
pointer, and other cpus are executing tb (using it). But this concern is 
not valid, since start_exclusive ensures all other cpus are stopped.


vcpu_init_hook function in plugins/core.c seems a good location to add 
this logic. We would check if an update is needed, then 
start_exclusive(), update the scoreboard and exit exclusive section.


Do you think it's worth to try to inline scoreboard pointer (and flush 
all tb when updated), instead of simply adding an indirection to it? 
With this, we could avoid any need to re-translate anything.



Do we want a limit of one scoreboard per thread? Can we store structures
in there?



From the current plugins use case, it seems that several scoreboards 
are needed.
Allowing structure storage seems a bit more tricky to me, because since 
memory may be reallocated, users won't be allowed to point directly to 
it. I would be in favor to avoid this (comments are welcome).


Re: [PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-13 Thread Alex Bennée
Pierrick Bouvier  writes:

> On 1/12/24 21:20, Alex Bennée wrote:
>> Pierrick Bouvier  writes:
>> 
>>> On 1/11/24 19:57, Philippe Mathieu-Daudé wrote:
 Hi Pierrick,
 On 11/1/24 15:23, Pierrick Bouvier wrote:
> For now, it simply performs instruction, bb and mem count, and ensure
> that inline vs callback versions have the same result. Later, we'll
> extend it when new inline operations are added.
>
> Use existing plugins to test everything works is a bit cumbersome, as
> different events are treated in different plugins. Thus, this new one.
>

> +#define MAX_CPUS 8
 Where does this value come from?

>>>
>>> The plugin tests/plugin/insn.c had this constant, so I picked it up
>>> from here.
>>>
 Should the pluggin API provide a helper to ask TCG how many
 vCPUs are created?
>>>
>>> In user mode, we can't know how many simultaneous threads (and thus
>>> vcpu) will be triggered by advance. I'm not sure if additional cpus
>>> can be added in system mode.
>>>
>>> One problem though, is that when you register an inline op with a
>>> dynamic array, when you resize it (when detecting a new vcpu), you
>>> can't change it afterwards. So, you need a storage statically sized
>>> somewhere.
>>>
>>> Your question is good, and maybe we should define a MAX constant that
>>> plugins should rely on, instead of a random amount.
>> For user-mode it can be infinite. The existing plugins do this by
>> ensuring vcpu_index % max_vcpu. Perhaps we just ensure that for the
>> scoreboard as well? Of course that does introduce a trap for those using
>> user-mode...
>> 
>
> The problem with vcpu-index % max_vcpu is that it reintroduces race
> condition, though it's probably less frequent than on a single
> variable. IMHO, yes it solves memory error, but does not solve the
> initial problem itself.
>
> The simplest solution would be to have a size "big enough" for most
> cases, and abort when it's reached.

Well that is simple enough for system emulation as max_vcpus is a bounded
number.

> Another solution, much more complicated, but correct, would be to move
> memory management of plugin scoreboard to plugin runtime, and add a
> level of indirection to access it.

That certainly gives us the most control and safety. We can then ensure
we'll never to writing past the bounds of the buffer. The plugin would
have to use an access function to get the pointer to read at the time it
cared and of course inline checks should be pretty simple.

> Every time a new vcpu is added, we
> can grow dynamically. This way, the array can grow, and ultimately,
> plugin can poke its content/size. I'm not sure this complexity is what
> we want though.

It doesn't seem too bad. We have a start/end_exclusive is *-user do_fork
where we could update pointers. If we are smart about growing the size
of the arrays we could avoid too much re-translation.

Do we want a limit of one scoreboard per thread? Can we store structures
in there?

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro



Re: [PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-12 Thread Pierrick Bouvier

On 1/12/24 21:20, Alex Bennée wrote:

Pierrick Bouvier  writes:


On 1/11/24 19:57, Philippe Mathieu-Daudé wrote:

Hi Pierrick,
On 11/1/24 15:23, Pierrick Bouvier wrote:

For now, it simply performs instruction, bb and mem count, and ensure
that inline vs callback versions have the same result. Later, we'll
extend it when new inline operations are added.

Use existing plugins to test everything works is a bit cumbersome, as
different events are treated in different plugins. Thus, this new one.

Signed-off-by: Pierrick Bouvier 
---
tests/plugin/inline.c| 183 +++
tests/plugin/meson.build |   2 +-
2 files changed, 184 insertions(+), 1 deletion(-)
create mode 100644 tests/plugin/inline.c



+#define MAX_CPUS 8

Where does this value come from?



The plugin tests/plugin/insn.c had this constant, so I picked it up
from here.


Should the pluggin API provide a helper to ask TCG how many
vCPUs are created?


In user mode, we can't know how many simultaneous threads (and thus
vcpu) will be triggered by advance. I'm not sure if additional cpus
can be added in system mode.

One problem though, is that when you register an inline op with a
dynamic array, when you resize it (when detecting a new vcpu), you
can't change it afterwards. So, you need a storage statically sized
somewhere.

Your question is good, and maybe we should define a MAX constant that
plugins should rely on, instead of a random amount.


For user-mode it can be infinite. The existing plugins do this by
ensuring vcpu_index % max_vcpu. Perhaps we just ensure that for the
scoreboard as well? Of course that does introduce a trap for those using
user-mode...



The problem with vcpu-index % max_vcpu is that it reintroduces race 
condition, though it's probably less frequent than on a single variable. 
IMHO, yes it solves memory error, but does not solve the initial problem 
itself.


The simplest solution would be to have a size "big enough" for most 
cases, and abort when it's reached.


Another solution, much more complicated, but correct, would be to move 
memory management of plugin scoreboard to plugin runtime, and add a 
level of indirection to access it. Every time a new vcpu is added, we 
can grow dynamically. This way, the array can grow, and ultimately, 
plugin can poke its content/size. I'm not sure this complexity is what 
we want though.


Re: [PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-12 Thread Alex Bennée
Pierrick Bouvier  writes:

> On 1/11/24 19:57, Philippe Mathieu-Daudé wrote:
>> Hi Pierrick,
>> On 11/1/24 15:23, Pierrick Bouvier wrote:
>>> For now, it simply performs instruction, bb and mem count, and ensure
>>> that inline vs callback versions have the same result. Later, we'll
>>> extend it when new inline operations are added.
>>>
>>> Use existing plugins to test everything works is a bit cumbersome, as
>>> different events are treated in different plugins. Thus, this new one.
>>>
>>> Signed-off-by: Pierrick Bouvier 
>>> ---
>>>tests/plugin/inline.c| 183 +++
>>>tests/plugin/meson.build |   2 +-
>>>2 files changed, 184 insertions(+), 1 deletion(-)
>>>create mode 100644 tests/plugin/inline.c
>> 
>>> +#define MAX_CPUS 8
>> Where does this value come from?
>> 
>
> The plugin tests/plugin/insn.c had this constant, so I picked it up
> from here.
>
>> Should the pluggin API provide a helper to ask TCG how many
>> vCPUs are created?
>
> In user mode, we can't know how many simultaneous threads (and thus
> vcpu) will be triggered by advance. I'm not sure if additional cpus
> can be added in system mode.
>
> One problem though, is that when you register an inline op with a
> dynamic array, when you resize it (when detecting a new vcpu), you
> can't change it afterwards. So, you need a storage statically sized
> somewhere.
>
> Your question is good, and maybe we should define a MAX constant that
> plugins should rely on, instead of a random amount.

For user-mode it can be infinite. The existing plugins do this by
ensuring vcpu_index % max_vcpu. Perhaps we just ensure that for the
scoreboard as well? Of course that does introduce a trap for those using
user-mode...

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro



Re: [PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-11 Thread Pierrick Bouvier

On 1/11/24 19:57, Philippe Mathieu-Daudé wrote:

Hi Pierrick,

On 11/1/24 15:23, Pierrick Bouvier wrote:

For now, it simply performs instruction, bb and mem count, and ensure
that inline vs callback versions have the same result. Later, we'll
extend it when new inline operations are added.

Use existing plugins to test everything works is a bit cumbersome, as
different events are treated in different plugins. Thus, this new one.

Signed-off-by: Pierrick Bouvier 
---
   tests/plugin/inline.c| 183 +++
   tests/plugin/meson.build |   2 +-
   2 files changed, 184 insertions(+), 1 deletion(-)
   create mode 100644 tests/plugin/inline.c



+#define MAX_CPUS 8


Where does this value come from?



The plugin tests/plugin/insn.c had this constant, so I picked it up from 
here.



Should the pluggin API provide a helper to ask TCG how many
vCPUs are created?


In user mode, we can't know how many simultaneous threads (and thus 
vcpu) will be triggered by advance. I'm not sure if additional cpus can 
be added in system mode.


One problem though, is that when you register an inline op with a 
dynamic array, when you resize it (when detecting a new vcpu), you can't 
change it afterwards. So, you need a storage statically sized somewhere.


Your question is good, and maybe we should define a MAX constant that 
plugins should rely on, instead of a random amount.


Re: [PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-11 Thread Philippe Mathieu-Daudé

Hi Pierrick,

On 11/1/24 15:23, Pierrick Bouvier wrote:

For now, it simply performs instruction, bb and mem count, and ensure
that inline vs callback versions have the same result. Later, we'll
extend it when new inline operations are added.

Use existing plugins to test everything works is a bit cumbersome, as
different events are treated in different plugins. Thus, this new one.

Signed-off-by: Pierrick Bouvier 
---
  tests/plugin/inline.c| 183 +++
  tests/plugin/meson.build |   2 +-
  2 files changed, 184 insertions(+), 1 deletion(-)
  create mode 100644 tests/plugin/inline.c



+#define MAX_CPUS 8


Where does this value come from?

Should the pluggin API provide a helper to ask TCG how many
vCPUs are created?



[PATCH 03/12] tests/plugin: add test plugin for inline operations

2024-01-11 Thread Pierrick Bouvier
For now, it simply performs instruction, bb and mem count, and ensure
that inline vs callback versions have the same result. Later, we'll
extend it when new inline operations are added.

Use existing plugins to test everything works is a bit cumbersome, as
different events are treated in different plugins. Thus, this new one.

Signed-off-by: Pierrick Bouvier 
---
 tests/plugin/inline.c| 183 +++
 tests/plugin/meson.build |   2 +-
 2 files changed, 184 insertions(+), 1 deletion(-)
 create mode 100644 tests/plugin/inline.c

diff --git a/tests/plugin/inline.c b/tests/plugin/inline.c
new file mode 100644
index 000..6114ebca545
--- /dev/null
+++ b/tests/plugin/inline.c
@@ -0,0 +1,183 @@
+/*
+ * Copyright (C) 2023, Pierrick Bouvier 
+ *
+ * Demonstrates and tests usage of inline ops.
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+
+#include 
+#include 
+#include 
+
+#include 
+
+#define MAX_CPUS 8
+
+static uint64_t count_tb;
+static uint64_t count_tb_per_vcpu[MAX_CPUS];
+static uint64_t count_tb_inline_per_vcpu[MAX_CPUS];
+static uint64_t count_tb_inline_racy;
+static uint64_t count_insn;
+static uint64_t count_insn_per_vcpu[MAX_CPUS];
+static uint64_t count_insn_inline_per_vcpu[MAX_CPUS];
+static uint64_t count_insn_inline_racy;
+static uint64_t count_mem;
+static uint64_t count_mem_per_vcpu[MAX_CPUS];
+static uint64_t count_mem_inline_per_vcpu[MAX_CPUS];
+static uint64_t count_mem_inline_racy;
+static GMutex tb_lock;
+static GMutex insn_lock;
+static GMutex mem_lock;
+
+QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
+
+static uint64_t collect_per_vcpu(uint64_t *values)
+{
+uint64_t count = 0;
+for (int i = 0; i < MAX_CPUS; ++i) {
+count += values[i];
+}
+return count;
+}
+
+static void stats_insn(void)
+{
+const uint64_t expected = count_insn;
+const uint64_t per_vcpu = collect_per_vcpu(count_insn_per_vcpu);
+const uint64_t inl_per_vcpu = collect_per_vcpu(count_insn_inline_per_vcpu);
+printf("insn: %" PRIu64 "\n", expected);
+printf("insn: %" PRIu64 " (per vcpu)\n", per_vcpu);
+printf("insn: %" PRIu64 " (per vcpu inline)\n", inl_per_vcpu);
+printf("insn: %" PRIu64 " (inline racy)\n", count_insn_inline_racy);
+g_assert(expected > 0);
+g_assert(per_vcpu == expected);
+g_assert(inl_per_vcpu == expected);
+g_assert(count_insn_inline_racy <= expected);
+}
+
+static void stats_tb(void)
+{
+const uint64_t expected = count_tb;
+const uint64_t per_vcpu = collect_per_vcpu(count_tb_per_vcpu);
+const uint64_t inl_per_vcpu = collect_per_vcpu(count_tb_inline_per_vcpu);
+printf("tb: %" PRIu64 "\n", expected);
+printf("tb: %" PRIu64 " (per vcpu)\n", per_vcpu);
+printf("tb: %" PRIu64 " (per vcpu inline)\n", inl_per_vcpu);
+printf("tb: %" PRIu64 " (inline racy)\n", count_tb_inline_racy);
+g_assert(expected > 0);
+g_assert(per_vcpu == expected);
+g_assert(inl_per_vcpu == expected);
+g_assert(count_tb_inline_racy <= expected);
+}
+
+static void stats_mem(void)
+{
+const uint64_t expected = count_mem;
+const uint64_t per_vcpu = collect_per_vcpu(count_mem_per_vcpu);
+const uint64_t inl_per_vcpu = collect_per_vcpu(count_mem_inline_per_vcpu);
+printf("mem: %" PRIu64 "\n", expected);
+printf("mem: %" PRIu64 " (per vcpu)\n", per_vcpu);
+printf("mem: %" PRIu64 " (per vcpu inline)\n", inl_per_vcpu);
+printf("mem: %" PRIu64 " (inline racy)\n", count_mem_inline_racy);
+g_assert(expected > 0);
+g_assert(per_vcpu == expected);
+g_assert(inl_per_vcpu == expected);
+g_assert(count_mem_inline_racy <= expected);
+}
+
+static void plugin_exit(qemu_plugin_id_t id, void *udata)
+{
+for (int i = 0; i < MAX_CPUS; ++i) {
+const uint64_t tb = count_tb_per_vcpu[i];
+const uint64_t tb_inline = count_tb_inline_per_vcpu[i];
+const uint64_t insn = count_insn_per_vcpu[i];
+const uint64_t insn_inline = count_insn_inline_per_vcpu[i];
+const uint64_t mem = count_mem_per_vcpu[i];
+const uint64_t mem_inline = count_mem_inline_per_vcpu[i];
+printf("cpu %d: tb (%" PRIu64 ", %" PRIu64 ") | "
+   "insn (%" PRIu64 ", %" PRIu64 ") | "
+   "mem (%" PRIu64 ", %" PRIu64 ")"
+   "\n",
+   i, tb, tb_inline, insn, insn_inline, mem, mem_inline);
+g_assert(tb == tb_inline);
+g_assert(insn == insn_inline);
+g_assert(mem == mem_inline);
+}
+
+stats_tb();
+stats_insn();
+stats_mem();
+}
+
+static void vcpu_tb_exec(unsigned int cpu_index, void *udata)
+{
+count_tb_per_vcpu[cpu_index]++;
+g_mutex_lock(_lock);
+count_tb++;
+g_mutex_unlock(_lock);
+}
+
+static void vcpu_insn_exec(unsigned int cpu_index, void *udata)
+{
+count_insn_per_vcpu[cpu_index]++;
+g_mutex_lock(_lock);
+count_insn++;
+g_mutex_unlock(_lock);
+}
+
+static