Re: [Qemu-devel] [RFC v2 03/11] docs: new design document multi-thread-tcg.txt (DRAFTING)

2016-05-25 Thread Sergey Fedorov
On 25/05/16 21:03, Paolo Bonzini wrote:
>> The page table seems to be protected by 'mmap_lock' in user mode
>> emulation but by 'tb_lock' in system mode emulation. It may turn to be
>> possible to read it safely even with no lock held.
> Yes, it is possible to at least follow the radix tree safely with no
> lock held.  The fields in the leaves can be either lockless or protected
> by a lock.
>
> The radix tree can be followed without a lock just like you do with RCU.
> The difference with RCU is that:
>
> 1) the leaves are protected with a lock, so you don't do the "copy";
> instead after reading you lock around updates
>
> 2) the radix tree is only ever added to, so you don't need to protect
> the reads with rcu_read_lock/rcu_read_unlock.  rcu_read_lock and
> rcu_read_unlock are only needed to inform the deleters that something
> cannot yet go away.  Without deleters, you don't need rcu_read_lock
> and rcu_read_unlock (but you still need atomic_rcu_read/atomic_rcu_set).
>
>

Yes, however looking closer at how the leafs are used I can't see much
point to do this so far...

Thanks,
Sergey



Re: [Qemu-devel] [RFC v2 03/11] docs: new design document multi-thread-tcg.txt (DRAFTING)

2016-05-25 Thread Paolo Bonzini
> The page table seems to be protected by 'mmap_lock' in user mode
> emulation but by 'tb_lock' in system mode emulation. It may turn to be
> possible to read it safely even with no lock held.

Yes, it is possible to at least follow the radix tree safely with no
lock held.  The fields in the leaves can be either lockless or protected
by a lock.

The radix tree can be followed without a lock just like you do with RCU.
The difference with RCU is that:

1) the leaves are protected with a lock, so you don't do the "copy";
instead after reading you lock around updates

2) the radix tree is only ever added to, so you don't need to protect
the reads with rcu_read_lock/rcu_read_unlock.  rcu_read_lock and
rcu_read_unlock are only needed to inform the deleters that something
cannot yet go away.  Without deleters, you don't need rcu_read_lock
and rcu_read_unlock (but you still need atomic_rcu_read/atomic_rcu_set).

Paolo



Re: [Qemu-devel] [RFC v2 03/11] docs: new design document multi-thread-tcg.txt (DRAFTING)

2016-05-25 Thread Alex Bennée

Sergey Fedorov  writes:

> On 11/04/16 23:00, Sergey Fedorov wrote:
>> On 05/04/16 18:32, Alex Bennée wrote:
>>
>> (snip)
>>> +
>>> +Memory maps and TLBs
>>> +
>>> +
>>> +The memory handling code is fairly critical to the speed of memory
>>> +access in the emulated system.
>>> +
>> It would be nice to put some intro sentence for the following bullets :)
>>
>>> +  - Memory regions (dividing up access to PIO, MMIO and RAM)
>>> +  - Dirty page tracking (for code gen, migration and display)
>>> +  - Virtual TLB (for translating guest address->real address)
>
> There's also a global page table - called 'l1_map' - which is used for:
>  * keeping a list of TBs generated from a given physical guest page for
>further code invalidation on page writes
>  * holding a bitmap to track which regions of a given physical guest page
>actually contain code for optimized code invalidation on page writes
>(used only in system mode emulation)
>  * holding page flags, e.g. protection bits (used only in user mode
>emulation)
>
> The page table seems to be protected by 'mmap_lock' in user mode
> emulation but by 'tb_lock' in system mode emulation. It may turn to be
> possible to read it safely even with no lock held.

I've started adding words to that effect to the document.

>
> Kind regards,
> Sergey
>
>>> +
>>> +There is a both a fast path walked by the generated code and a slow
>>> +path when resolution is required. When the TLB tables are updated we
>>> +need to ensure they are done in a safe way by bringing all executing
>>> +threads to a halt before making the modifications.
>> Again, I think we could benefit if we could possibly manage to avoid
>> bringing vCPU threads to halt.
>>
>> Nothing about memory regions and dirty page tracking?
>>
>>> +
>>> +DESIGN REQUIREMENTS:
>>> +
>>> +  - TLB Flush All/Page
>>> +- can be across-CPUs
>>> +- will need all other CPUs brought to a halt
>> s/will/may/ ?
>>
>>> +  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>> +- This is a per-CPU table - by definition can't race
>>> +- updated by it's own thread when the slow-path is forced
>> (snip)
>
> Kind regards,
> Sergey


--
Alex Bennée



Re: [Qemu-devel] [RFC v2 03/11] docs: new design document multi-thread-tcg.txt (DRAFTING)

2016-05-25 Thread Sergey Fedorov
On 11/04/16 23:00, Sergey Fedorov wrote:
> On 05/04/16 18:32, Alex Bennée wrote:
>
> (snip)
>> +
>> +Memory maps and TLBs
>> +
>> +
>> +The memory handling code is fairly critical to the speed of memory
>> +access in the emulated system.
>> +
> It would be nice to put some intro sentence for the following bullets :)
>
>> +  - Memory regions (dividing up access to PIO, MMIO and RAM)
>> +  - Dirty page tracking (for code gen, migration and display)
>> +  - Virtual TLB (for translating guest address->real address)

There's also a global page table - called 'l1_map' - which is used for:
 * keeping a list of TBs generated from a given physical guest page for
   further code invalidation on page writes
 * holding a bitmap to track which regions of a given physical guest page
   actually contain code for optimized code invalidation on page writes
   (used only in system mode emulation)
 * holding page flags, e.g. protection bits (used only in user mode
   emulation)

The page table seems to be protected by 'mmap_lock' in user mode
emulation but by 'tb_lock' in system mode emulation. It may turn to be
possible to read it safely even with no lock held.

Kind regards,
Sergey

>> +
>> +There is a both a fast path walked by the generated code and a slow
>> +path when resolution is required. When the TLB tables are updated we
>> +need to ensure they are done in a safe way by bringing all executing
>> +threads to a halt before making the modifications.
> Again, I think we could benefit if we could possibly manage to avoid
> bringing vCPU threads to halt.
>
> Nothing about memory regions and dirty page tracking?
>
>> +
>> +DESIGN REQUIREMENTS:
>> +
>> +  - TLB Flush All/Page
>> +- can be across-CPUs
>> +- will need all other CPUs brought to a halt
> s/will/may/ ?
>
>> +  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>> +- This is a per-CPU table - by definition can't race
>> +- updated by it's own thread when the slow-path is forced
> (snip)

Kind regards,
Sergey



Re: [Qemu-devel] [RFC v2 03/11] docs: new design document multi-thread-tcg.txt (DRAFTING)

2016-05-06 Thread Sergey Fedorov
On 05/04/16 18:32, Alex Bennée wrote:
> This is a current DRAFT of a design proposal for upgrading TCG emulation
> to take advantage of modern CPUs by running a thread-per-CPU. The
> document goes through the various areas of the code affected by such a
> change and proposes design requirements for each part of the solution.
>
> It has been written *without* explicit reference to the current ongoing
> efforts to introduce this feature. The hope being we can review and
> discuss the design choices without assuming the current choices taken by
> the implementation are correct.
>
> Signed-off-by: Alex Bennée 
>
> ---
> v1
>   - initial version
> v2
>   - update discussion on locks
>   - bit more detail on vCPU scheduling
>   - explicitly mention Translation Blocks
>   - emulated hardware state already covered by iomutex
>   - a few minor rewords

We could also include a few words about icount mode support in MTTCG.

Kind regards,
Sergey



Re: [Qemu-devel] [RFC v2 03/11] docs: new design document multi-thread-tcg.txt (DRAFTING)

2016-04-11 Thread Sergey Fedorov
On 05/04/16 18:32, Alex Bennée wrote:

(snip)
> +Introduction
> +
> +
> +This document outlines the design for multi-threaded TCG emulation.
> +The original TCG implementation was single threaded and dealt with
> +multiple CPUs by with simple round-robin scheduling. This simplified a
> +lot of things but became increasingly limited as systems being
> +emulated gained additional cores and per-core performance gains for
> +host systems started to level off.

This looks like the description of system-mode TCG only. Maybe it would
be worth mentioning current status of user-mode multithreading support
as well?

(snip)
> +Shared Data Structures
> +==
> +
> +Global TCG State
> +
> +
> +We need to protect the entire code generation cycle including any post
> +generation patching of the translated code. This also implies a shared
> +translation buffer which contains code running on all cores. Any
> +execution path that comes to the main run loop will need to hold a
> +mutex for code generation. This also includes times when we need flush
> +code or jumps from the tb_cache.
> +
> +DESIGN REQUIREMENT: Add locking around all code generation, patching
> +and jump cache modification.

I think we could also benefit from some kind of "lock-free" algorithms
where it is possible. So locking as a requirement seems to be a bit too
enforcing. Regarding shared translation buffer, how is it implied? Don't
we have on option of separate per-vCPU code cache? (Maybe I missed some
discussion on this?)

> +
> +Translation Blocks
> +--
> +
> +Currently the whole system shares a single code generation buffer
> +which when full will force a flush of all translations and start from
> +scratch again.
> +
> +Once a basic block has been translated it will continue to be used
> +until it is invalidated. These invalidation events are typically due
> +to page changes in system emulation

I didn't dig too deep into this yet, but TLB invalidation after
virtual-to-physical address mapping changes doesn't seem to invalidate
any TBs in system mode...

>  and changes in memory mapping in
> +user mode. Debugging operations 

and self modifying code

> can also trigger invalidation's.
> +
> +The invalidation also requires removing the TB from look-ups
> +(tb_phys_hash and tb_jmp_cache) as well removing any direct TB to TB
> +patched jumps.

We could probably get by lazy approach with just preventing to pick up
invalidated TBs from look-ups. It also possible not to remove direct
jumps from the TB being invalidated to other TBs since it is not going
to be executed anyway.

> +
> +DESIGN REQUIREMENT: Safely handle invalidation of TBs

It would probably be a good idea to include translation buffer flush
considerations as well.

> +
> +Memory maps and TLBs
> +
> +
> +The memory handling code is fairly critical to the speed of memory
> +access in the emulated system.
> +

It would be nice to put some intro sentence for the following bullets :)

> +  - Memory regions (dividing up access to PIO, MMIO and RAM)
> +  - Dirty page tracking (for code gen, migration and display)
> +  - Virtual TLB (for translating guest address->real address)
> +
> +There is a both a fast path walked by the generated code and a slow
> +path when resolution is required. When the TLB tables are updated we
> +need to ensure they are done in a safe way by bringing all executing
> +threads to a halt before making the modifications.

Again, I think we could benefit if we could possibly manage to avoid
bringing vCPU threads to halt.

Nothing about memory regions and dirty page tracking?

> +
> +DESIGN REQUIREMENTS:
> +
> +  - TLB Flush All/Page
> +- can be across-CPUs
> +- will need all other CPUs brought to a halt

s/will/may/ ?

> +  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
> +- This is a per-CPU table - by definition can't race
> +- updated by it's own thread when the slow-path is forced
(snip)
> +Memory Consistency
> +==
> +
> +Between emulated guests and host systems there are a range of memory
> +consistency models. While emulating weakly ordered systems on strongly
> +ordered hosts shouldn't cause any problems the same is not true for
> +the reverse setup.
> +
> +The proposed design currently does not address the problem of
> +emulating strong ordering on a weakly ordered host although even on
> +strongly ordered systems software should be using synchronisation
> +primitives to ensure correct operation.

e.g. strongly-ordered x86 allows store-after-load reordering and
provides memory fences to synchronize.

> +
> +Memory Barriers
> +---
> +
> +Barriers (sometimes known as fences) provide a mechanism for software
> +to enforce a particular ordering of memory operations from the point
> +of view of external observers (e.g. another processor core). They can
> +apply to any memory operations as well as just loads or stores.
> +
> +The 

[Qemu-devel] [RFC v2 03/11] docs: new design document multi-thread-tcg.txt (DRAFTING)

2016-04-05 Thread Alex Bennée
This is a current DRAFT of a design proposal for upgrading TCG emulation
to take advantage of modern CPUs by running a thread-per-CPU. The
document goes through the various areas of the code affected by such a
change and proposes design requirements for each part of the solution.

It has been written *without* explicit reference to the current ongoing
efforts to introduce this feature. The hope being we can review and
discuss the design choices without assuming the current choices taken by
the implementation are correct.

Signed-off-by: Alex Bennée 

---
v1
  - initial version
v2
  - update discussion on locks
  - bit more detail on vCPU scheduling
  - explicitly mention Translation Blocks
  - emulated hardware state already covered by iomutex
  - a few minor rewords
---
 docs/multi-thread-tcg.txt | 184 ++
 1 file changed, 184 insertions(+)
 create mode 100644 docs/multi-thread-tcg.txt

diff --git a/docs/multi-thread-tcg.txt b/docs/multi-thread-tcg.txt
new file mode 100644
index 000..32e2f46
--- /dev/null
+++ b/docs/multi-thread-tcg.txt
@@ -0,0 +1,184 @@
+Copyright (c) 2015 Linaro Ltd.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.  See
+the COPYING file in the top-level directory.
+
+STATUS: DRAFTING
+
+Introduction
+
+
+This document outlines the design for multi-threaded TCG emulation.
+The original TCG implementation was single threaded and dealt with
+multiple CPUs by with simple round-robin scheduling. This simplified a
+lot of things but became increasingly limited as systems being
+emulated gained additional cores and per-core performance gains for
+host systems started to level off.
+
+vCPU Scheduling
+===
+
+We introduce a new running mode where each vCPU will run on its own
+user-space thread. This will be enabled by default for all
+FE/BE combinations that have had the required work done to support
+this safely.
+
+In the general case of running translated code there should be no
+inter-vCPU dependencies and all vCPUs should be able to run at full
+speed. Synchronisation will only be required while accessing internal
+shared data structures or when the emulated architecture requires a
+coherent representation of the emulated machine state.
+
+Shared Data Structures
+==
+
+Global TCG State
+
+
+We need to protect the entire code generation cycle including any post
+generation patching of the translated code. This also implies a shared
+translation buffer which contains code running on all cores. Any
+execution path that comes to the main run loop will need to hold a
+mutex for code generation. This also includes times when we need flush
+code or jumps from the tb_cache.
+
+DESIGN REQUIREMENT: Add locking around all code generation, patching
+and jump cache modification.
+
+Translation Blocks
+--
+
+Currently the whole system shares a single code generation buffer
+which when full will force a flush of all translations and start from
+scratch again.
+
+Once a basic block has been translated it will continue to be used
+until it is invalidated. These invalidation events are typically due
+to page changes in system emulation and changes in memory mapping in
+user mode. Debugging operations can also trigger invalidation's.
+
+The invalidation also requires removing the TB from look-ups
+(tb_phys_hash and tb_jmp_cache) as well removing any direct TB to TB
+patched jumps.
+
+DESIGN REQUIREMENT: Safely handle invalidation of TBs
+
+Memory maps and TLBs
+
+
+The memory handling code is fairly critical to the speed of memory
+access in the emulated system.
+
+  - Memory regions (dividing up access to PIO, MMIO and RAM)
+  - Dirty page tracking (for code gen, migration and display)
+  - Virtual TLB (for translating guest address->real address)
+
+There is a both a fast path walked by the generated code and a slow
+path when resolution is required. When the TLB tables are updated we
+need to ensure they are done in a safe way by bringing all executing
+threads to a halt before making the modifications.
+
+DESIGN REQUIREMENTS:
+
+  - TLB Flush All/Page
+- can be across-CPUs
+- will need all other CPUs brought to a halt
+  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
+- This is a per-CPU table - by definition can't race
+- updated by it's own thread when the slow-path is forced
+
+Emulated hardware state
+---
+
+Currently thanks to KVM work any access to IO memory is automatically
+protected by the global iothread mutex. Any IO region that doesn't use
+global mutex is expected to do its own locking.
+
+Memory Consistency
+==
+
+Between emulated guests and host systems there are a range of memory
+consistency models. While emulating weakly ordered systems on strongly
+ordered hosts shouldn't cause any problems the same is not true for
+the