Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-11 Thread Jan Beulich
>>> On 11.12.17 at 21:26,  wrote:
> On 12/11/2017 10:06 AM, Jan Beulich wrote:
> On 08.12.17 at 15:38,  wrote:
>>> On 08/12/17 08:03, Tim Deegan wrote:
 It should be possible to do something like the misconfigured-entry bit
 trick by _allocating_ the memory up-front and building the p2m entries
 but only making them usable by the {IO}MMUs on first access.  That
 would make these early p2m walks shorter (because they can skip whole
 subtrees that aren't marked present yet) without making major changes
 to domain build or introducing run-time failures.
>>>
>>> I am not aware of any way on Arm to misconfigure an entry. We do have
>>> valid and access bits, although they will affect the IOMMU as well. So
>>> it will not be possible to get page-table sharing with this "feature"
>>> enabled.
>> 
>> How would you intend to solve the IOMMU part of the problem with
>> PoD? As was pointed out before - IOMMU and PoD are incompatible
>> on x86.
> 
> I am not sure why you ask about PoD here when I acknowledge I will look 
> at a different solution. And again, misconfiguring an entry is not 
> possible on Arm.

I'm sorry if I've overlooked any such acknowledgment; it's certainly
not in context above.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-11 Thread Julien Grall

On 12/11/2017 11:10 AM, Andre Przywara wrote:

Hi,


Hi Andre,


But on the other hand we had PoD naturally already in KVM, so this came
at no cost.
So I believe it would be worth to investigate what the actual impact is
on booting a 32-bit kernel, with emulating s/w ops like KVM does (see
below), but cleaning the *whole VA space*. If this is somewhat
acceptable (I assume we have no more than 2GB for a typical ARM32
guest), it might be worth to ignore PoD, at least for now and to solve
this problem (and the IOMMU consequences).


I am fairly surprised you think I came up with this solution without any 
investigation. I actually clearly stated it in my first e-mail that 
Linux is not able to bring up CPU with a flush of the "whole VA space".


At the moment, Linux 32-bit as a 1 second timeout to bring up a 
secondary CPU. In that second we need to do at least a full flush (I 
think there are a second). In the case of Xen Arm32, the domain heap 
(where domain memory belongs) is not mapped in the hypervisor. So you 
end up to do mapping for every page-table and final memory. To that, you 
add the cost of doing cache maintenance. Then, you finally add the 
potential cost preemption (vCPU might be schedule out).


During my initial investigation, I was not able to boot Dom0 with 512MB. 
I tried to optimize the mapping path, but it didn't show much 
improvement in general.


Regarding the IOMMU consequences, S/W ops are not easily virtualizable. 
If you use them, then it is the price to pay. It is better than not been 
able to boot current kernel or randomly crashing.


Cheers,

--
Julien Grall,

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-11 Thread Julien Grall

Hi Jan,

On 12/11/2017 10:06 AM, Jan Beulich wrote:

On 08.12.17 at 15:38,  wrote:

On 08/12/17 08:03, Tim Deegan wrote:

It should be possible to do something like the misconfigured-entry bit
trick by _allocating_ the memory up-front and building the p2m entries
but only making them usable by the {IO}MMUs on first access.  That
would make these early p2m walks shorter (because they can skip whole
subtrees that aren't marked present yet) without making major changes
to domain build or introducing run-time failures.


I am not aware of any way on Arm to misconfigure an entry. We do have
valid and access bits, although they will affect the IOMMU as well. So
it will not be possible to get page-table sharing with this "feature"
enabled.


How would you intend to solve the IOMMU part of the problem with
PoD? As was pointed out before - IOMMU and PoD are incompatible
on x86.


I am not sure why you ask about PoD here when I acknowledge I will look 
at a different solution. And again, misconfiguring an entry is not 
possible on Arm.


But to answer your question, IOMMU will neither be supported with PoD 
nor access/valid bit solution. And that's fine because S/W are not 
easily virtualizable, I take that as a hint for "All the features may 
not be available when using S/W in a guest".


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-11 Thread Julien Grall

Hi,

On 12/10/2017 03:22 PM, Tim Deegan wrote:

At 14:38 + on 08 Dec (1512743913), Julien Grall wrote:

On 08/12/17 08:03, Tim Deegan wrote:

+1 for avoiding the full majesty of PoD if you don't need it.

It should be possible to do something like the misconfigured-entry bit
trick by _allocating_ the memory up-front and building the p2m entries
but only making them usable by the {IO}MMUs on first access.  That
would make these early p2m walks shorter (because they can skip whole
subtrees that aren't marked present yet) without making major changes
to domain build or introducing run-time failures.


I am not aware of any way on Arm to misconfigure an entry. We do have
valid and access bits, although they will affect the IOMMU as well. So
it will not be possible to get page-table sharing with this "feature"
enabled.


How unfortunate.  How does KVM's demand-population scheme handle the IOMMU?


From what I have heard, when using IOMMU all the memory is pinned. They 
also don't share page-tables.


But I am not a KVM expert, maybe Andre/Marc can confirm here?

Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-10 Thread Tim Deegan
At 14:38 + on 08 Dec (1512743913), Julien Grall wrote:
> On 08/12/17 08:03, Tim Deegan wrote:
> > +1 for avoiding the full majesty of PoD if you don't need it.
> > 
> > It should be possible to do something like the misconfigured-entry bit
> > trick by _allocating_ the memory up-front and building the p2m entries
> > but only making them usable by the {IO}MMUs on first access.  That
> > would make these early p2m walks shorter (because they can skip whole
> > subtrees that aren't marked present yet) without making major changes
> > to domain build or introducing run-time failures.
> 
> I am not aware of any way on Arm to misconfigure an entry. We do have 
> valid and access bits, although they will affect the IOMMU as well. So 
> it will not be possible to get page-table sharing with this "feature" 
> enabled.

How unfortunate.  How does KVM's demand-population scheme handle the IOMMU? 

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-08 Thread Julien Grall

On 08/12/17 08:03, Tim Deegan wrote:

Hi,


Hi Tim,

Somehow your e-mail was marked as spam by gmail.


At 12:58 + on 06 Dec (1512565090), Julien Grall wrote:

On 12/06/2017 12:28 PM, George Dunlap wrote:

2. It sounds like rather than using PoD, you could use the
"misconfigured p2m table" technique that x86 uses: set bits in the p2m
entry which cause a specific kind of HAP fault when accessed.  The fault
handler then looks in the p2m entry, and if it finds an otherwise valid
entry, it just fixes the "misconfigured" bits and continues.


I thought about this. But when do you set the entry to misconfigured?

If you take the example of Linux 32-bit. There are a couple of full
cache clean during the boot of uni-processor. So you would need to go
through the p2m multiple time and reset the access bits.


My 2c (echoing what some others have already said):

+1 for avoiding the full majesty of PoD if you don't need it.

It should be possible to do something like the misconfigured-entry bit
trick by _allocating_ the memory up-front and building the p2m entries
but only making them usable by the {IO}MMUs on first access.  That
would make these early p2m walks shorter (because they can skip whole
subtrees that aren't marked present yet) without making major changes
to domain build or introducing run-time failures.


I am not aware of any way on Arm to misconfigure an entry. We do have 
valid and access bits, although they will affect the IOMMU as well. So 
it will not be possible to get page-table sharing with this "feature" 
enabled.


At the moment, I am thinking to provide a per-guest option to turn 
on/off the possibility to use the valid/access bit. That will be at the 
expense to do a full invalidate on S/W.



Also beware of DoS conditions -- a guest that touches all its memory
and then flushes by set/way mustn't be allowed to hurt the rest of the
system.  That probably means the set/way flush has to be preemptable.


I am fully aware about it :). This was actually mentioned in my first 
e-mail.


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-08 Thread George Dunlap
On 12/07/2017 07:21 PM, Marc Zyngier wrote:
> On 07/12/17 18:06, George Dunlap wrote:
>> On 12/07/2017 04:58 PM, Marc Zyngier wrote:
>>> On 07/12/17 16:44, George Dunlap wrote:
 On 12/07/2017 04:04 PM, Julien Grall wrote:
> Hi Jan,
>
> On 07/12/17 15:45, Jan Beulich wrote:
> On 07.12.17 at 15:53,  wrote:
>>> On 07/12/17 13:52, Julien Grall wrote:
>>> There is exactly one case where set/way makes sense, and that's when
>>> you're the only CPU left in the system, your MMU is off, and you're
>>> about to go down.
>>
>> With this and ...
>>
>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>> migrating from one CPU to another. So you could happily be flushing by
>>> S/W, and still end up with dirty lines in your cache. Success!
>>
>> ... this I wonder what value emulating those insns then has in the first
>> place. Can't you as well simply skip and ignore them, with the same
>> (bad) result?
>
> The result will be much much worst. Here a concrete example with a Linux
> Arm 32-bit:
>
> 1) Cache enabled
> 2) Decompress
> 3) Nuke cache (S/W)
> 4) Cache off
> 5) Access new kernel
>
> If you skip #3, the decompress data may not have reached the memory, so
> you would access stall data.
>
> This would effectively mean we don't support Linux Arm 32-bit.

 So Marc said that #3 "doesn't make sense", since although it might be
 the only cpu on in the system, you're not "about to go down"; but Linux
 32-bit is doing that anyway.
>>>
>>> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
>>> ARMv4, and has been left untouched ever since. "If it ain't broke..."
>>>
 It sounds like from the slides the purpose of #3 might be to get stuff
 out of the D-cache into the I-cache.  But why is the cache turned off?
>>>
>>> Linux mandates that the kernel in entered with the MMU off. Which has
>>> the effect of disabling the caches too (VIVT caches and all that jazz).
>>>
 And why doesn't Linux use the VA-based flushes rather than the S/W flushes?
>>>
>>> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
>>> break stuff from the late 90s, so that's not going to happen. These
>>> days, I tend to pick my battles... ;-)
>>
>> OK, so let me try to state this "forwards" for those of us not familiar
>> with the situation:
>>
>> 1. Linux expects to start in 'linear' mode, with the MMU disabled.
>>
>> 2. On ARM, disabling the MMU disables caching (!).  But disabling
>> caching doesn't flush the cache; it just means the cache is bypassed (!).
>>
>> 3. Which means for Linux on ARM, after unzipping the kernel image, you
>> need to flush the cache before disabling the MMU and starting Linux proper
>>
>> 4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
>> flush the cache.  This still works on 32-bit hardware, and so the Linux
>> maintainers are loathe to change it, even though more reliable VA-based
>> instructions are available (?).
> 
> It also works on 64bit HW. It is just not easily virtualizable, which is
> why we've removed all S/W from the 64bit Linux port a while ago.

From the diagram in your talk, it looked like the "flush the cache"
operation *doesn't* work anywhere that has a "system cache", even on
bare metal.

>> 6. Rather than fix this in Linux, KVM has added a work-around in which
>> the *hypervisor* flushes the caches at certain points (!!!).  Julien is
>> looking into doing the same with Xen.
> 
> The "at certain points" doesn't quite describe it. We fully emulate S/W
> instruction using the biggest hammer we can find.

Oh, I thought Julien was saying something about flushing the guest's RAM
every time caching was enabled or disabled.

>> Given the variety of hardware that Linux has to run on, it's hard to
>> understand why 1) 32-bit ARM Linux couldn't detect if it would be
>> appropriate to use VA-based instructions rather than S/W instructions 2)
>> There couldn't at least be a Kconfig option to use VA instructions
>> instead of S/W instructions.
> 
> [Linux hat on]
> 
> 1) There is hardly anything to detect. Both sets of CMOs are available
> on a moderately recent implementation. What you'd want to detect is the
> the kernel is "virtualizable", which is not an easy task.

> An alternative option would be to switch to VA CMOs if compiled for
> ARMv7 (and maybe v6), assuming that doesn't have any horrible side
> effect with broken cache implementations (and there is a few out there).
> You'll have to check that this doesn't regress on any existing HW.

So the idea would be to use the VA-based operations if available, and
then special-case specific chipsets known to have issues.  Linux (and
Xen and...) end up doing this for lots of different kinds of hardware;
this would be no different.

> 2) Kconfig options are the way to hell. It took 

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-08 Thread Tim Deegan
Hi,

At 12:58 + on 06 Dec (1512565090), Julien Grall wrote:
> On 12/06/2017 12:28 PM, George Dunlap wrote:
> > 2. It sounds like rather than using PoD, you could use the
> > "misconfigured p2m table" technique that x86 uses: set bits in the p2m
> > entry which cause a specific kind of HAP fault when accessed.  The fault
> > handler then looks in the p2m entry, and if it finds an otherwise valid
> > entry, it just fixes the "misconfigured" bits and continues.
> 
> I thought about this. But when do you set the entry to misconfigured?
> 
> If you take the example of Linux 32-bit. There are a couple of full 
> cache clean during the boot of uni-processor. So you would need to go 
> through the p2m multiple time and reset the access bits.

My 2c (echoing what some others have already said):

+1 for avoiding the full majesty of PoD if you don't need it.

It should be possible to do something like the misconfigured-entry bit
trick by _allocating_ the memory up-front and building the p2m entries
but only making them usable by the {IO}MMUs on first access.  That
would make these early p2m walks shorter (because they can skip whole
subtrees that aren't marked present yet) without making major changes
to domain build or introducing run-time failures.

Also beware of DoS conditions -- a guest that touches all its memory
and then flushes by set/way mustn't be allowed to hurt the rest of the
system.  That probably means the set/way flush has to be preemptable.

Tim.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread Marc Zyngier
On 07/12/17 18:06, George Dunlap wrote:
> On 12/07/2017 04:58 PM, Marc Zyngier wrote:
>> On 07/12/17 16:44, George Dunlap wrote:
>>> On 12/07/2017 04:04 PM, Julien Grall wrote:
 Hi Jan,

 On 07/12/17 15:45, Jan Beulich wrote:
 On 07.12.17 at 15:53,  wrote:
>> On 07/12/17 13:52, Julien Grall wrote:
>> There is exactly one case where set/way makes sense, and that's when
>> you're the only CPU left in the system, your MMU is off, and you're
>> about to go down.
>
> With this and ...
>
>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>> migrating from one CPU to another. So you could happily be flushing by
>> S/W, and still end up with dirty lines in your cache. Success!
>
> ... this I wonder what value emulating those insns then has in the first
> place. Can't you as well simply skip and ignore them, with the same
> (bad) result?

 The result will be much much worst. Here a concrete example with a Linux
 Arm 32-bit:

 1) Cache enabled
 2) Decompress
 3) Nuke cache (S/W)
 4) Cache off
 5) Access new kernel

 If you skip #3, the decompress data may not have reached the memory, so
 you would access stall data.

 This would effectively mean we don't support Linux Arm 32-bit.
>>>
>>> So Marc said that #3 "doesn't make sense", since although it might be
>>> the only cpu on in the system, you're not "about to go down"; but Linux
>>> 32-bit is doing that anyway.
>>
>> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
>> ARMv4, and has been left untouched ever since. "If it ain't broke..."
>>
>>> It sounds like from the slides the purpose of #3 might be to get stuff
>>> out of the D-cache into the I-cache.  But why is the cache turned off?
>>
>> Linux mandates that the kernel in entered with the MMU off. Which has
>> the effect of disabling the caches too (VIVT caches and all that jazz).
>>
>>> And why doesn't Linux use the VA-based flushes rather than the S/W flushes?
>>
>> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
>> break stuff from the late 90s, so that's not going to happen. These
>> days, I tend to pick my battles... ;-)
> 
> OK, so let me try to state this "forwards" for those of us not familiar
> with the situation:
> 
> 1. Linux expects to start in 'linear' mode, with the MMU disabled.
> 
> 2. On ARM, disabling the MMU disables caching (!).  But disabling
> caching doesn't flush the cache; it just means the cache is bypassed (!).
> 
> 3. Which means for Linux on ARM, after unzipping the kernel image, you
> need to flush the cache before disabling the MMU and starting Linux proper
> 
> 4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
> flush the cache.  This still works on 32-bit hardware, and so the Linux
> maintainers are loathe to change it, even though more reliable VA-based
> instructions are available (?).

It also works on 64bit HW. It is just not easily virtualizable, which is
why we've removed all S/W from the 64bit Linux port a while ago.

> 
> 5. For 64-bit hardware, the S/W instructions don't affect the L3 cache
> [1] (?!).  So a 32-bit guest on a 64-bit host the above is entirely broken.

System caches in general can avoid implementing S/W. That's not specific
to 64bit. It is just that in general, 32bit systems do not have a very
deep cache hierarchy (there are of course a number of exceptions to this
rule). 64bit systems, on the other hand, can be much bigger and are
quite happily stacking a deep cache hierarchy.

> 6. Rather than fix this in Linux, KVM has added a work-around in which
> the *hypervisor* flushes the caches at certain points (!!!).  Julien is
> looking into doing the same with Xen.

The "at certain points" doesn't quite describe it. We fully emulate S/W
instruction using the biggest hammer we can find.

> Is that about right?

I think you got the gist of it.

> Given the variety of hardware that Linux has to run on, it's hard to
> understand why 1) 32-bit ARM Linux couldn't detect if it would be
> appropriate to use VA-based instructions rather than S/W instructions 2)
> There couldn't at least be a Kconfig option to use VA instructions
> instead of S/W instructions.

[Linux hat on]

1) There is hardly anything to detect. Both sets of CMOs are available
on a moderately recent implementation. What you'd want to detect is the
the kernel is "virtualizable", which is not an easy task.

2) Kconfig options are the way to hell. It took us 5 years to get a
32bit kernel that would boot on about anything, and we're not going to
go back.

An alternative option would be to switch to VA CMOs if compiled for
ARMv7 (and maybe v6), assuming that doesn't have any horrible side
effect with broken cache implementations (and there is a few out there).
You'll have to check that this doesn't regress on any existing HW.

Of course, none 

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread George Dunlap
On 12/07/2017 04:58 PM, Marc Zyngier wrote:
> On 07/12/17 16:44, George Dunlap wrote:
>> On 12/07/2017 04:04 PM, Julien Grall wrote:
>>> Hi Jan,
>>>
>>> On 07/12/17 15:45, Jan Beulich wrote:
>>> On 07.12.17 at 15:53,  wrote:
> On 07/12/17 13:52, Julien Grall wrote:
> There is exactly one case where set/way makes sense, and that's when
> you're the only CPU left in the system, your MMU is off, and you're
> about to go down.

 With this and ...

> On top of bypassing the coherency, S/W CMOs do not prevent lines from
> migrating from one CPU to another. So you could happily be flushing by
> S/W, and still end up with dirty lines in your cache. Success!

 ... this I wonder what value emulating those insns then has in the first
 place. Can't you as well simply skip and ignore them, with the same
 (bad) result?
>>>
>>> The result will be much much worst. Here a concrete example with a Linux
>>> Arm 32-bit:
>>>
>>> 1) Cache enabled
>>> 2) Decompress
>>> 3) Nuke cache (S/W)
>>> 4) Cache off
>>> 5) Access new kernel
>>>
>>> If you skip #3, the decompress data may not have reached the memory, so
>>> you would access stall data.
>>>
>>> This would effectively mean we don't support Linux Arm 32-bit.
>>
>> So Marc said that #3 "doesn't make sense", since although it might be
>> the only cpu on in the system, you're not "about to go down"; but Linux
>> 32-bit is doing that anyway.
> 
> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to
> ARMv4, and has been left untouched ever since. "If it ain't broke..."
> 
>> It sounds like from the slides the purpose of #3 might be to get stuff
>> out of the D-cache into the I-cache.  But why is the cache turned off?
> 
> Linux mandates that the kernel in entered with the MMU off. Which has
> the effect of disabling the caches too (VIVT caches and all that jazz).
> 
>> And why doesn't Linux use the VA-based flushes rather than the S/W flushes?
> 
> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably
> break stuff from the late 90s, so that's not going to happen. These
> days, I tend to pick my battles... ;-)

OK, so let me try to state this "forwards" for those of us not familiar
with the situation:

1. Linux expects to start in 'linear' mode, with the MMU disabled.

2. On ARM, disabling the MMU disables caching (!).  But disabling
caching doesn't flush the cache; it just means the cache is bypassed (!).

3. Which means for Linux on ARM, after unzipping the kernel image, you
need to flush the cache before disabling the MMU and starting Linux proper

4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to
flush the cache.  This still works on 32-bit hardware, and so the Linux
maintainers are loathe to change it, even though more reliable VA-based
instructions are available (?).

5. For 64-bit hardware, the S/W instructions don't affect the L3 cache
[1] (?!).  So a 32-bit guest on a 64-bit host the above is entirely broken.

6. Rather than fix this in Linux, KVM has added a work-around in which
the *hypervisor* flushes the caches at certain points (!!!).  Julien is
looking into doing the same with Xen.

Is that about right?

Given the variety of hardware that Linux has to run on, it's hard to
understand why 1) 32-bit ARM Linux couldn't detect if it would be
appropriate to use VA-based instructions rather than S/W instructions 2)
There couldn't at least be a Kconfig option to use VA instructions
instead of S/W instructions.

 -George

[1]
https://events.linuxfoundation.org/sites/events/files/slides/slides_10.pdf,
slide 9

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread George Dunlap
On 12/07/2017 04:04 PM, Julien Grall wrote:
> Hi Jan,
> 
> On 07/12/17 15:45, Jan Beulich wrote:
> On 07.12.17 at 15:53,  wrote:
>>> On 07/12/17 13:52, Julien Grall wrote:
>>> There is exactly one case where set/way makes sense, and that's when
>>> you're the only CPU left in the system, your MMU is off, and you're
>>> about to go down.
>>
>> With this and ...
>>
>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from
>>> migrating from one CPU to another. So you could happily be flushing by
>>> S/W, and still end up with dirty lines in your cache. Success!
>>
>> ... this I wonder what value emulating those insns then has in the first
>> place. Can't you as well simply skip and ignore them, with the same
>> (bad) result?
> 
> The result will be much much worst. Here a concrete example with a Linux
> Arm 32-bit:
> 
> 1) Cache enabled
> 2) Decompress
> 3) Nuke cache (S/W)
> 4) Cache off
> 5) Access new kernel
> 
> If you skip #3, the decompress data may not have reached the memory, so
> you would access stall data.
> 
> This would effectively mean we don't support Linux Arm 32-bit.

So Marc said that #3 "doesn't make sense", since although it might be
the only cpu on in the system, you're not "about to go down"; but Linux
32-bit is doing that anyway.

It sounds like from the slides the purpose of #3 might be to get stuff
out of the D-cache into the I-cache.  But why is the cache turned off?
And why doesn't Linux use the VA-based flushes rather than the S/W flushes?

 -George


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread Julien Grall

Hi Jan,

On 07/12/17 15:45, Jan Beulich wrote:

On 07.12.17 at 15:53,  wrote:

On 07/12/17 13:52, Julien Grall wrote:
There is exactly one case where set/way makes sense, and that's when
you're the only CPU left in the system, your MMU is off, and you're
about to go down.


With this and ...


On top of bypassing the coherency, S/W CMOs do not prevent lines from
migrating from one CPU to another. So you could happily be flushing by
S/W, and still end up with dirty lines in your cache. Success!


... this I wonder what value emulating those insns then has in the first
place. Can't you as well simply skip and ignore them, with the same
(bad) result?


The result will be much much worst. Here a concrete example with a Linux 
Arm 32-bit:


1) Cache enabled
2) Decompress
3) Nuke cache (S/W)
4) Cache off
5) Access new kernel

If you skip #3, the decompress data may not have reached the memory, so 
you would access stall data.


This would effectively mean we don't support Linux Arm 32-bit.

Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread Jan Beulich
>>> On 07.12.17 at 16:22,  wrote:
> On 07/12/17 09:39, Jan Beulich wrote:
> On 06.12.17 at 18:52,  wrote:
>>> But I think this is bringing another class of problem. When a
>>> misconfigured is accessed, we would need to clean & invalidate the cache
>>> for that region.
>> 
>> Why? (Please remember that I'm an x86 person, so may simply
>> not be aware of extra constraints ARM has.) The data in the
>> cache (if any) doesn't change while the mapping is invalid (unless
>> Xen modifies it, but if there was a coherency problem between
>> Xen and guest accesses, you'd have the issue with hypercalls
>> which you describe later independent of the approach suggested
>> here).
> 
> Caches on Arm are coherent and are controlled by attributes in the 
> page-tables. The coherency is lost if you access a region with different 
> memory attributes.
> 
> To take the hypercall case, we impose memory shared with the hypervisor 
> or any other guests to have specific memory attributes. So this will 
> ensure cache coherency. This applies to:
>   - hypercall arguments passed via a pointer to guest memory
>   - memory shared via the grant table mechanism
>   - memory shared with the hypervisor (shared_info, vcpu_info, grant 
> table...).
> 
> Now regarding access by a guest. Even though the entry is 
> "misconfigured" in the guest page-tables, this same physical address may 
> be have been mapped in other places (e.g Xen, guests...).

But that's not an issue specific to the situation here, i.e. multiple
mappings with different memory attributes would always be a
problem. Hence I assume you have code in place to deal with that.
By retaining the entry contents except for the valid bit (or
something else to allow you to gain control upon access) nothing
should really change for the rest of the hypervisor logic, provided
such entries are not explicitly being ignored on any of the involved
logic.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread Julien Grall

(+ Marc)

@Marc: My Arm cache knowledge is somewhat limited. Feel free to correct 
me if I am wrong.


On 07/12/17 09:39, Jan Beulich wrote:

On 06.12.17 at 18:52,  wrote:

On 12/06/2017 03:15 PM, Jan Beulich wrote:

What we do in x86 is that we flag all entries at the top level as
misconfigured at any time where otherwise we would have to
walk the full tree. Upon access, the misconfigured flag is being
propagated down the page table hierarchy, with only the
intermediate and leaf entries needed for the current access
becoming properly configured again. In your case, as long as
only a limited set of leaf entries are being touched before any
S/W emulation is needed, you'd be able to skip all misconfigured
entries in your traversal, just like with PoD you'd skip
unpopulated ones.


Oh, what you call "misconfigured bits" would be clearing the valid bit
of an entry on Arm. The entry would be considered invalid, but it is
still possible to store informations (the rest of the bits are ignored
by the hardware).


Well, on x86 we don't always have a separate "valid" bit, hence
we set something else to a value which will cause a suitable VM
exit when being accessed by the guest.


But I think this is bringing another class of problem. When a
misconfigured is accessed, we would need to clean & invalidate the cache
for that region.


Why? (Please remember that I'm an x86 person, so may simply
not be aware of extra constraints ARM has.) The data in the
cache (if any) doesn't change while the mapping is invalid (unless
Xen modifies it, but if there was a coherency problem between
Xen and guest accesses, you'd have the issue with hypercalls
which you describe later independent of the approach suggested
here).


Caches on Arm are coherent and are controlled by attributes in the 
page-tables. The coherency is lost if you access a region with different 
memory attributes.


To take the hypercall case, we impose memory shared with the hypervisor 
or any other guests to have specific memory attributes. So this will 
ensure cache coherency. This applies to:

- hypercall arguments passed via a pointer to guest memory
- memory shared via the grant table mechanism
	- memory shared with the hypervisor (shared_info, vcpu_info, grant 
table...).


Now regarding access by a guest. Even though the entry is 
"misconfigured" in the guest page-tables, this same physical address may 
be have been mapped in other places (e.g Xen, guests...). Because of 
speculation, a line could have been pulled in the case. As we don't know 
the memory attribute used by the guest, you have to clean & invalidate 
that region on a guest access.


Getting back to the hypercall case, I am still trying to figure out if 
we need to clean & invalidate the buffer used when the guest entry is 
"misconfigured". I can't convince myself why this would not be 
necessary. I need to have a more thorough think.


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread Marc Zyngier
On 07/12/17 13:52, Julien Grall wrote:
> (+ Marc)
> 
> Hi,
> 
> @Marc: My Arm cache knowledge is somewhat limited. Feel free to correct 
> me if I am wrong.
> 
> Before answering to the rest of the e-mail, let me reinforce what I said 
> in my first e-mail. Set/Way are very complex to emulate and an OS using 
> them should never expect good performance in virtualization context. The 
> difficulty is clearly spell out in the Arm Arm.

It is actually even worse than that. Software using set/way operations
is simply not virtualizable, full stop. Yes, we paper over it in ugly
ways, but nobody should really use set/way.

There is exactly one case where set/way makes sense, and that's when
you're the only CPU left in the system, your MMU is off, and you're
about to go down.

> So the main goal here is to workaround those software.

Quite. Said SW is usually a 32bit Linux kernel.

> 
> On 06/12/17 17:49, George Dunlap wrote:
>> On 12/06/2017 12:58 PM, Julien Grall wrote:
>>> Hi George,
>>>
>>> On 12/06/2017 12:28 PM, George Dunlap wrote:
 On 12/05/2017 06:39 PM, Julien Grall wrote:
> Hi all,
>
> Even though it is an Arm failure, I have CCed x86 folks to get feedback
> on the approach. I have a WIP branch I could share if that interest
> people.
>
> Few months ago, we noticed an heisenbug on jobs run by osstest on the
> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
> 0 is in data/prefetch abort state at early boot. I have been able to
> reproduce it reliably, although from the little information I have I
> think it is related to a cache issue because we don't trap cache
> maintenance instructions by set/way.
>
> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
> working on a given cache level by S/W. Because the OS is not allowed to
> infer the S/W to PA mapping, it can only use S/W to nuke the whole
> cache. "The expected usage of the cache maintenance that operate by
> set/way is associated with powerdown and powerup of caches, if this is
> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>
> Those instructions will target a local processor and usually working in
> batch for nuking the cache. This means if the vCPU is migrated to
> another pCPU in the middle of the process, the cache may not be cleaned.
> This would result to data corruption and potential crash of the OS.

 I don't quite understand the failure mode here: Why does vCPU migration
 cause cache inconsistency in the middle of one of these "cleans", but
 not under normal operation?
>>>
>>> Because they target a specific S/W cache level whereas other cache
>>> operations are working with VA.
>>>
>>> To make it short, the other VA cache instructions will work to Poinut of
>>> Coherency/Point of Unification and guarantee that the caches will be
>>> consistent. For more details see B2.2.6 in ARM DDI 046C.c.
>>
>> I skimmed that section, and I'm not much the wiser.
>>
>> Just to be clear, this is my question.
>>
>> Suppose we have the following sequence of events (where vN[pM] means
>> vcpu N running on pcpu M):
>>
>> Start with A == 0
>>
>> 1. v0[p1] Read A
>>p1 has 'A==0' in the cache
>> 2. scheduler migrates v1 to p0
>> 3. v0[p0] A=2
>>p0 has 'A==2' in the cache
>> 4 scheduler migrates v0 to p1
>> 5 v0[p1] Read A
>>
>> Now, I presume that with the guest not doing anything, the Read of A at
>> #5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen
>> or by the hardware, between #1 and #5, p0's version of A gets "cleaned"
>> and p1's version of A gets "invalidated" (to use the terminology from
>> the section mentioned above).
> 
> Caches on Arm are coherent and are controlled by the attributes in the 
> page-tables. Imagine the region is normal cacheable and inner-shareable, 
> a data synchronization barrier in #4 will ensure the visibility of the A 
> to p1. So A will be read as 2.
> 
>>
>> So my question is, how does *adding* cache flushing of any sort end up
>> violating the integrity in a situation like the above?
> 
> Because the integrity is based on the memory attributes in the 
> page-tables. S/W instructions work directly on the cache and will break 
> the coherency. Marc pointed me to his talk [1] that explain cache on Arm 
> and also the set/way problem (see from slide 8).

On top of bypassing the coherency, S/W CMOs do not prevent lines from
migrating from one CPU to another. So you could happily be flushing by
S/W, and still end up with dirty lines in your cache. Success!

At that point, performance is the least of your worries.

> 
>>
> For those been worry about the performance impact, I have looked at the
> current use of S/W instructions:
>   - Linux Arm64: The last used in the kernel was beginning of 2015
>   - Linux Arm32: Still use S/W for boot and secondary CPU
> bring-up. No
> plan to change.
>    

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread Jan Beulich
>>> On 07.12.17 at 14:52,  wrote:
> On 06/12/17 17:49, George Dunlap wrote:
>> Do you want to reset the p2m multiple times?  I thought the goal was
>> simply to keep the amount of p2m space you need to flush to a minimum;
>> if you expect the memory which has been faulted in by the *last* flush
>> to be relatively small, you could just always flush all memory that had
>> been touched to that point.
>> 
>> If you *do* need to go through the p2m multiple times, then
>> misconfiguration is a much better option than PoD.  In PoD, once a page
>> has data on it, it can't be removed from the p2m anymore.  For the
>> misconfiguration technique, you can go through and misconfigure the
>> entries in the top-level p2m table as many times as you want.  The whole
>> reason for doing it on x86 is that it's a relatively lightweight
>> operation: we use it to modify MMIO mappings, to enable or disable
>> logdirty for migrate, 
> 
> Does this also work when you share the page-tables with the IOMMU? It 
> just occurred to me that for both PoD and "misconfigured bits" we would 
> get into trouble because page-tables are shared with the IOMMU.

PoD and IOMMU are incompatible on x86 at present.

The bits we use for "mis-configuring" entries are ignored by the IOMMU,
which is not a problem since all we use this approach for (right now) is
to update the memory type (i.e. cacheability) for possibly huge ranges.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread Julien Grall

(+ Marc)

Hi,

@Marc: My Arm cache knowledge is somewhat limited. Feel free to correct 
me if I am wrong.


Before answering to the rest of the e-mail, let me reinforce what I said 
in my first e-mail. Set/Way are very complex to emulate and an OS using 
them should never expect good performance in virtualization context. The 
difficulty is clearly spell out in the Arm Arm.


So the main goal here is to workaround those software.

On 06/12/17 17:49, George Dunlap wrote:

On 12/06/2017 12:58 PM, Julien Grall wrote:

Hi George,

On 12/06/2017 12:28 PM, George Dunlap wrote:

On 12/05/2017 06:39 PM, Julien Grall wrote:

Hi all,

Even though it is an Arm failure, I have CCed x86 folks to get feedback
on the approach. I have a WIP branch I could share if that interest
people.

Few months ago, we noticed an heisenbug on jobs run by osstest on the
cubietrucks (see [1]). From the log, we figured out that the guest vCPU
0 is in data/prefetch abort state at early boot. I have been able to
reproduce it reliably, although from the little information I have I
think it is related to a cache issue because we don't trap cache
maintenance instructions by set/way.

This is a set of 3 instructions (clean, clean & invalidate, invalidate)
working on a given cache level by S/W. Because the OS is not allowed to
infer the S/W to PA mapping, it can only use S/W to nuke the whole
cache. "The expected usage of the cache maintenance that operate by
set/way is associated with powerdown and powerup of caches, if this is
required by the implementation" (see D3-2020 ARM DDI 0487B.b).

Those instructions will target a local processor and usually working in
batch for nuking the cache. This means if the vCPU is migrated to
another pCPU in the middle of the process, the cache may not be cleaned.
This would result to data corruption and potential crash of the OS.


I don't quite understand the failure mode here: Why does vCPU migration
cause cache inconsistency in the middle of one of these "cleans", but
not under normal operation?


Because they target a specific S/W cache level whereas other cache
operations are working with VA.

To make it short, the other VA cache instructions will work to Poinut of
Coherency/Point of Unification and guarantee that the caches will be
consistent. For more details see B2.2.6 in ARM DDI 046C.c.


I skimmed that section, and I'm not much the wiser.

Just to be clear, this is my question.

Suppose we have the following sequence of events (where vN[pM] means
vcpu N running on pcpu M):

Start with A == 0

1. v0[p1] Read A
   p1 has 'A==0' in the cache
2. scheduler migrates v1 to p0
3. v0[p0] A=2
   p0 has 'A==2' in the cache
4 scheduler migrates v0 to p1
5 v0[p1] Read A

Now, I presume that with the guest not doing anything, the Read of A at
#5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen
or by the hardware, between #1 and #5, p0's version of A gets "cleaned"
and p1's version of A gets "invalidated" (to use the terminology from
the section mentioned above).


Caches on Arm are coherent and are controlled by the attributes in the 
page-tables. Imagine the region is normal cacheable and inner-shareable, 
a data synchronization barrier in #4 will ensure the visibility of the A 
to p1. So A will be read as 2.




So my question is, how does *adding* cache flushing of any sort end up
violating the integrity in a situation like the above?


Because the integrity is based on the memory attributes in the 
page-tables. S/W instructions work directly on the cache and will break 
the coherency. Marc pointed me to his talk [1] that explain cache on Arm 
and also the set/way problem (see from slide 8).





For those been worry about the performance impact, I have looked at the
current use of S/W instructions:
  - Linux Arm64: The last used in the kernel was beginning of 2015
  - Linux Arm32: Still use S/W for boot and secondary CPU
bring-up. No
plan to change.
  - UEFI: A couple of use in UEFI, but I have heard they plan to
remove them (need confirmation).

I haven't looked at all the OSes. However, given the Arm Arm clearly
state S/W instructions are not easily virtualizable, I would expect
guest OSes developers to try there best to limit the use of the
instructions.

To limit the performance impact, we could introduce a guest option to
tell whether the guest will use S/W. If it does plan to use S/W, PoD
will be disabled.

Now regarding the hardware domain. At the moment, it has its RAM direct
mapped. Supporting direct mapping in PoD will be quite a pain for a
limited benefits (see why above). In that case I would suggest to impose
vCPU pinning for the hardware domain if the S/W are expected to be used.
Again, a command line option could be introduced here.

Any feedbacks on the approach will be welcomed.


I still don't entirely understand the underlying failure mode, but there
are a couple of things we could consider:

1. Automatically disabling 'vcpu migration' when caching is turned 

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-07 Thread Jan Beulich
>>> On 06.12.17 at 18:52,  wrote:
> On 12/06/2017 03:15 PM, Jan Beulich wrote:
>> What we do in x86 is that we flag all entries at the top level as
>> misconfigured at any time where otherwise we would have to
>> walk the full tree. Upon access, the misconfigured flag is being
>> propagated down the page table hierarchy, with only the
>> intermediate and leaf entries needed for the current access
>> becoming properly configured again. In your case, as long as
>> only a limited set of leaf entries are being touched before any
>> S/W emulation is needed, you'd be able to skip all misconfigured
>> entries in your traversal, just like with PoD you'd skip
>> unpopulated ones.
> 
> Oh, what you call "misconfigured bits" would be clearing the valid bit 
> of an entry on Arm. The entry would be considered invalid, but it is 
> still possible to store informations (the rest of the bits are ignored 
> by the hardware).

Well, on x86 we don't always have a separate "valid" bit, hence
we set something else to a value which will cause a suitable VM
exit when being accessed by the guest.

> But I think this is bringing another class of problem. When a 
> misconfigured is accessed, we would need to clean & invalidate the cache 
> for that region.

Why? (Please remember that I'm an x86 person, so may simply
not be aware of extra constraints ARM has.) The data in the
cache (if any) doesn't change while the mapping is invalid (unless
Xen modifies it, but if there was a coherency problem between
Xen and guest accesses, you'd have the issue with hypercalls
which you describe later independent of the approach suggested
here).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread George Dunlap
On 12/06/2017 12:58 PM, Julien Grall wrote:
> Hi George,
> 
> On 12/06/2017 12:28 PM, George Dunlap wrote:
>> On 12/05/2017 06:39 PM, Julien Grall wrote:
>>> Hi all,
>>>
>>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>>> on the approach. I have a WIP branch I could share if that interest
>>> people.
>>>
>>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>>> 0 is in data/prefetch abort state at early boot. I have been able to
>>> reproduce it reliably, although from the little information I have I
>>> think it is related to a cache issue because we don't trap cache
>>> maintenance instructions by set/way.
>>>
>>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>>> working on a given cache level by S/W. Because the OS is not allowed to
>>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>>> cache. "The expected usage of the cache maintenance that operate by
>>> set/way is associated with powerdown and powerup of caches, if this is
>>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>>
>>> Those instructions will target a local processor and usually working in
>>> batch for nuking the cache. This means if the vCPU is migrated to
>>> another pCPU in the middle of the process, the cache may not be cleaned.
>>> This would result to data corruption and potential crash of the OS.
>>
>> I don't quite understand the failure mode here: Why does vCPU migration
>> cause cache inconsistency in the middle of one of these "cleans", but
>> not under normal operation?
> 
> Because they target a specific S/W cache level whereas other cache
> operations are working with VA.
> 
> To make it short, the other VA cache instructions will work to Poinut of
> Coherency/Point of Unification and guarantee that the caches will be
> consistent. For more details see B2.2.6 in ARM DDI 046C.c.

I skimmed that section, and I'm not much the wiser.

Just to be clear, this is my question.

Suppose we have the following sequence of events (where vN[pM] means
vcpu N running on pcpu M):

Start with A == 0

1. v0[p1] Read A
  p1 has 'A==0' in the cache
2. scheduler migrates v1 to p0
3. v0[p0] A=2
  p0 has 'A==2' in the cache
4 scheduler migrates v0 to p1
5 v0[p1] Read A

Now, I presume that with the guest not doing anything, the Read of A at
#5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen
or by the hardware, between #1 and #5, p0's version of A gets "cleaned"
and p1's version of A gets "invalidated" (to use the terminology from
the section mentioned above).

So my question is, how does *adding* cache flushing of any sort end up
violating the integrity in a situation like the above?

>>> For those been worry about the performance impact, I have looked at the
>>> current use of S/W instructions:
>>>  - Linux Arm64: The last used in the kernel was beginning of 2015
>>>  - Linux Arm32: Still use S/W for boot and secondary CPU
>>> bring-up. No
>>> plan to change.
>>>  - UEFI: A couple of use in UEFI, but I have heard they plan to
>>> remove them (need confirmation).
>>>
>>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>>> state S/W instructions are not easily virtualizable, I would expect
>>> guest OSes developers to try there best to limit the use of the
>>> instructions.
>>>
>>> To limit the performance impact, we could introduce a guest option to
>>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>>> will be disabled.
>>>
>>> Now regarding the hardware domain. At the moment, it has its RAM direct
>>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>>> limited benefits (see why above). In that case I would suggest to impose
>>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>>> Again, a command line option could be introduced here.
>>>
>>> Any feedbacks on the approach will be welcomed.
>>
>> I still don't entirely understand the underlying failure mode, but there
>> are a couple of things we could consider:
>>
>> 1. Automatically disabling 'vcpu migration' when caching is turned off.
>> This wouldn't prevent a vcpu from being preempted, just from being run
>> somewhere else.
> 
> This suggest the guest will directly perform S/W, right? So you leave
> the possibility to the guest to flush all caches the vCPU can access.
> This an easy way for the guest to affect the cache entry of other guests.
> 
> I think this would help some potential data attack.

Well, it's the equivalent of your "imposing vcpu pinning" solution
above, but only temporary.  Was that suggestion meant to allow the
hardware domain to directly perform S/W?

>> 2. It sounds like rather than using PoD, you could use the
>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>> entry which cause a specific kind of HAP fault when accessed.  The fault
>> 

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread Julien Grall



On 12/06/2017 03:24 PM, George Dunlap wrote:

On 12/06/2017 03:19 PM, Julien Grall wrote:

Hi Konrad,

On 12/06/2017 03:10 PM, Konrad Rzeszutek Wilk wrote:

.snip..

The suggested policy is based on the KVM one:
 - If we trap a S/W instructions, we enable VM trapping (e.g
HCR_EL2.TVM) to
detect cache being turned on/off, and do a full clean.
 - We flush the caches on both caches being turned on and off.
 - Once the caches are enabled, we stop trapping VM instructions.

Doing a full clean will require to go through the P2M and flush the
entries
one by one. At the moment, all the memory is mapped. As you can imagine
flushing guest with hundreds of MB will take a very long time (Linux
timeout
during CPU bring).


Yikes. Since you mention 'based on the KVM one' - did they solve this
particular
problem or do they also have the same issue?


KVM is using populate on demand by default.


If I understand properly, it's probably more accurate to say that KVM
uses "allocate on demand".  The complicated part of populate-on-demand
is the fact that it's not allowed to allocate anything.


Hmmm yes. You are right on the wording.

Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread George Dunlap
On 12/06/2017 03:19 PM, Julien Grall wrote:
> Hi Konrad,
> 
> On 12/06/2017 03:10 PM, Konrad Rzeszutek Wilk wrote:
>> .snip..
>>> The suggested policy is based on the KVM one:
>>> - If we trap a S/W instructions, we enable VM trapping (e.g
>>> HCR_EL2.TVM) to
>>> detect cache being turned on/off, and do a full clean.
>>> - We flush the caches on both caches being turned on and off.
>>> - Once the caches are enabled, we stop trapping VM instructions.
>>>
>>> Doing a full clean will require to go through the P2M and flush the
>>> entries
>>> one by one. At the moment, all the memory is mapped. As you can imagine
>>> flushing guest with hundreds of MB will take a very long time (Linux
>>> timeout
>>> during CPU bring).
>>
>> Yikes. Since you mention 'based on the KVM one' - did they solve this
>> particular
>> problem or do they also have the same issue?
> 
> KVM is using populate on demand by default.

If I understand properly, it's probably more accurate to say that KVM
uses "allocate on demand".  The complicated part of populate-on-demand
is the fact that it's not allowed to allocate anything.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread Julien Grall

Hi Konrad,

On 12/06/2017 03:10 PM, Konrad Rzeszutek Wilk wrote:

.snip..

The suggested policy is based on the KVM one:
- If we trap a S/W instructions, we enable VM trapping (e.g 
HCR_EL2.TVM) to
detect cache being turned on/off, and do a full clean.
- We flush the caches on both caches being turned on and off.
- Once the caches are enabled, we stop trapping VM instructions.

Doing a full clean will require to go through the P2M and flush the entries
one by one. At the moment, all the memory is mapped. As you can imagine
flushing guest with hundreds of MB will take a very long time (Linux timeout
during CPU bring).


Yikes. Since you mention 'based on the KVM one' - did they solve this particular
problem or do they also have the same issue?


KVM is using populate on demand by default.

Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread Jan Beulich
>>> On 06.12.17 at 13:58,  wrote:
> On 12/06/2017 12:28 PM, George Dunlap wrote:
>> 2. It sounds like rather than using PoD, you could use the
>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>> entry which cause a specific kind of HAP fault when accessed.  The fault
>> handler then looks in the p2m entry, and if it finds an otherwise valid
>> entry, it just fixes the "misconfigured" bits and continues.
> 
> I thought about this. But when do you set the entry to misconfigured?

What we do in x86 is that we flag all entries at the top level as
misconfigured at any time where otherwise we would have to
walk the full tree. Upon access, the misconfigured flag is being
propagated down the page table hierarchy, with only the
intermediate and leaf entries needed for the current access
becoming properly configured again. In your case, as long as
only a limited set of leaf entries are being touched before any
S/W emulation is needed, you'd be able to skip all misconfigured
entries in your traversal, just like with PoD you'd skip
unpopulated ones.

> If you take the example of Linux 32-bit. There are a couple of full 
> cache clean during the boot of uni-processor. So you would need to go 
> through the p2m multiple time and reset the access bits.

The proposed mechanism isn't really similar to traditional accessed
bit handling. If there is no other use for the accessed bit (assuming
there is one in ARM PTEs in the first place), and as long as the bit
being clear gives you some sort of signal (or x86 this and the dirty
bit are being updated by hardware, as kind of a side effect of a
page table walk), it could of course be used for the purpose here.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread Konrad Rzeszutek Wilk
.snip..
> The suggested policy is based on the KVM one:
>   - If we trap a S/W instructions, we enable VM trapping (e.g 
> HCR_EL2.TVM) to
> detect cache being turned on/off, and do a full clean.
>   - We flush the caches on both caches being turned on and off.
>   - Once the caches are enabled, we stop trapping VM instructions.
> 
> Doing a full clean will require to go through the P2M and flush the entries
> one by one. At the moment, all the memory is mapped. As you can imagine
> flushing guest with hundreds of MB will take a very long time (Linux timeout
> during CPU bring).

Yikes. Since you mention 'based on the KVM one' - did they solve this particular
problem or do they also have the same issue?

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread Julien Grall



On 12/06/2017 12:58 PM, Julien Grall wrote:

Hi George,

On 12/06/2017 12:28 PM, George Dunlap wrote:

On 12/05/2017 06:39 PM, Julien Grall wrote:

Hi all,

Even though it is an Arm failure, I have CCed x86 folks to get feedback
on the approach. I have a WIP branch I could share if that interest 
people.


Few months ago, we noticed an heisenbug on jobs run by osstest on the
cubietrucks (see [1]). From the log, we figured out that the guest vCPU
0 is in data/prefetch abort state at early boot. I have been able to
reproduce it reliably, although from the little information I have I
think it is related to a cache issue because we don't trap cache
maintenance instructions by set/way.

This is a set of 3 instructions (clean, clean & invalidate, invalidate)
working on a given cache level by S/W. Because the OS is not allowed to
infer the S/W to PA mapping, it can only use S/W to nuke the whole
cache. "The expected usage of the cache maintenance that operate by
set/way is associated with powerdown and powerup of caches, if this is
required by the implementation" (see D3-2020 ARM DDI 0487B.b).

Those instructions will target a local processor and usually working in
batch for nuking the cache. This means if the vCPU is migrated to
another pCPU in the middle of the process, the cache may not be cleaned.
This would result to data corruption and potential crash of the OS.


I don't quite understand the failure mode here: Why does vCPU migration
cause cache inconsistency in the middle of one of these "cleans", but
not under normal operation?


Because they target a specific S/W cache level whereas other cache 
operations are working with VA.


To make it short, the other VA cache instructions will work to Poinut of 
Coherency/Point of Unification and guarantee that the caches will be 
consistent. For more details see B2.2.6 in ARM DDI 046C.c.





For those been worry about the performance impact, I have looked at the
current use of S/W instructions:
 - Linux Arm64: The last used in the kernel was beginning of 2015
 - Linux Arm32: Still use S/W for boot and secondary CPU 
bring-up. No

plan to change.
 - UEFI: A couple of use in UEFI, but I have heard they plan to
remove them (need confirmation).

I haven't looked at all the OSes. However, given the Arm Arm clearly
state S/W instructions are not easily virtualizable, I would expect
guest OSes developers to try there best to limit the use of the
instructions.

To limit the performance impact, we could introduce a guest option to
tell whether the guest will use S/W. If it does plan to use S/W, PoD
will be disabled.

Now regarding the hardware domain. At the moment, it has its RAM direct
mapped. Supporting direct mapping in PoD will be quite a pain for a
limited benefits (see why above). In that case I would suggest to impose
vCPU pinning for the hardware domain if the S/W are expected to be used.
Again, a command line option could be introduced here.

Any feedbacks on the approach will be welcomed.


I still don't entirely understand the underlying failure mode, but there
are a couple of things we could consider:

1. Automatically disabling 'vcpu migration' when caching is turned off.
This wouldn't prevent a vcpu from being preempted, just from being run
somewhere else.


This suggest the guest will directly perform S/W, right? So you leave 
the possibility to the guest to flush all caches the vCPU can access. 
This an easy way for the guest to affect the cache entry of other guests.


I think this would help some potential data attack.



2. It sounds like rather than using PoD, you could use the
"misconfigured p2m table" technique that x86 uses: set bits in the p2m
entry which cause a specific kind of HAP fault when accessed.  The fault
handler then looks in the p2m entry, and if it finds an otherwise valid
entry, it just fixes the "misconfigured" bits and continues.


I thought about this. But when do you set the entry to misconfigured?

If you take the example of Linux 32-bit. There are a couple of full 
cache clean during the boot of uni-processor. So you would need to go 
through the p2m multiple time and reset the access bits.


To complete here, I agree that using PoD to emulate S/W is not great. 
But after looking at all the other solutions that was the only one that 
could provide a better isolation of the guests and provide some decent 
performance.


--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread Julien Grall

Hi George,

On 12/06/2017 12:28 PM, George Dunlap wrote:

On 12/05/2017 06:39 PM, Julien Grall wrote:

Hi all,

Even though it is an Arm failure, I have CCed x86 folks to get feedback
on the approach. I have a WIP branch I could share if that interest people.

Few months ago, we noticed an heisenbug on jobs run by osstest on the
cubietrucks (see [1]). From the log, we figured out that the guest vCPU
0 is in data/prefetch abort state at early boot. I have been able to
reproduce it reliably, although from the little information I have I
think it is related to a cache issue because we don't trap cache
maintenance instructions by set/way.

This is a set of 3 instructions (clean, clean & invalidate, invalidate)
working on a given cache level by S/W. Because the OS is not allowed to
infer the S/W to PA mapping, it can only use S/W to nuke the whole
cache. "The expected usage of the cache maintenance that operate by
set/way is associated with powerdown and powerup of caches, if this is
required by the implementation" (see D3-2020 ARM DDI 0487B.b).

Those instructions will target a local processor and usually working in
batch for nuking the cache. This means if the vCPU is migrated to
another pCPU in the middle of the process, the cache may not be cleaned.
This would result to data corruption and potential crash of the OS.


I don't quite understand the failure mode here: Why does vCPU migration
cause cache inconsistency in the middle of one of these "cleans", but
not under normal operation?


Because they target a specific S/W cache level whereas other cache 
operations are working with VA.


To make it short, the other VA cache instructions will work to Poinut of 
Coherency/Point of Unification and guarantee that the caches will be 
consistent. For more details see B2.2.6 in ARM DDI 046C.c.





For those been worry about the performance impact, I have looked at the
current use of S/W instructions:
 - Linux Arm64: The last used in the kernel was beginning of 2015
 - Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
plan to change.
 - UEFI: A couple of use in UEFI, but I have heard they plan to
remove them (need confirmation).

I haven't looked at all the OSes. However, given the Arm Arm clearly
state S/W instructions are not easily virtualizable, I would expect
guest OSes developers to try there best to limit the use of the
instructions.

To limit the performance impact, we could introduce a guest option to
tell whether the guest will use S/W. If it does plan to use S/W, PoD
will be disabled.

Now regarding the hardware domain. At the moment, it has its RAM direct
mapped. Supporting direct mapping in PoD will be quite a pain for a
limited benefits (see why above). In that case I would suggest to impose
vCPU pinning for the hardware domain if the S/W are expected to be used.
Again, a command line option could be introduced here.

Any feedbacks on the approach will be welcomed.


I still don't entirely understand the underlying failure mode, but there
are a couple of things we could consider:

1. Automatically disabling 'vcpu migration' when caching is turned off.
This wouldn't prevent a vcpu from being preempted, just from being run
somewhere else.


This suggest the guest will directly perform S/W, right? So you leave 
the possibility to the guest to flush all caches the vCPU can access. 
This an easy way for the guest to affect the cache entry of other guests.


I think this would help some potential data attack.



2. It sounds like rather than using PoD, you could use the
"misconfigured p2m table" technique that x86 uses: set bits in the p2m
entry which cause a specific kind of HAP fault when accessed.  The fault
handler then looks in the p2m entry, and if it finds an otherwise valid
entry, it just fixes the "misconfigured" bits and continues.


I thought about this. But when do you set the entry to misconfigured?

If you take the example of Linux 32-bit. There are a couple of full 
cache clean during the boot of uni-processor. So you would need to go 
through the p2m multiple time and reset the access bits.


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread George Dunlap
On 12/05/2017 06:39 PM, Julien Grall wrote:
> Hi all,
> 
> Even though it is an Arm failure, I have CCed x86 folks to get feedback
> on the approach. I have a WIP branch I could share if that interest people.
> 
> Few months ago, we noticed an heisenbug on jobs run by osstest on the
> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
> 0 is in data/prefetch abort state at early boot. I have been able to
> reproduce it reliably, although from the little information I have I
> think it is related to a cache issue because we don't trap cache
> maintenance instructions by set/way.
> 
> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
> working on a given cache level by S/W. Because the OS is not allowed to
> infer the S/W to PA mapping, it can only use S/W to nuke the whole
> cache. "The expected usage of the cache maintenance that operate by
> set/way is associated with powerdown and powerup of caches, if this is
> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
> 
> Those instructions will target a local processor and usually working in
> batch for nuking the cache. This means if the vCPU is migrated to
> another pCPU in the middle of the process, the cache may not be cleaned.
> This would result to data corruption and potential crash of the OS.

I don't quite understand the failure mode here: Why does vCPU migration
cause cache inconsistency in the middle of one of these "cleans", but
not under normal operation?

> For those been worry about the performance impact, I have looked at the
> current use of S/W instructions:
> - Linux Arm64: The last used in the kernel was beginning of 2015
> - Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
> plan to change.
> - UEFI: A couple of use in UEFI, but I have heard they plan to
> remove them (need confirmation).
> 
> I haven't looked at all the OSes. However, given the Arm Arm clearly
> state S/W instructions are not easily virtualizable, I would expect
> guest OSes developers to try there best to limit the use of the
> instructions.
> 
> To limit the performance impact, we could introduce a guest option to
> tell whether the guest will use S/W. If it does plan to use S/W, PoD
> will be disabled.
> 
> Now regarding the hardware domain. At the moment, it has its RAM direct
> mapped. Supporting direct mapping in PoD will be quite a pain for a
> limited benefits (see why above). In that case I would suggest to impose
> vCPU pinning for the hardware domain if the S/W are expected to be used.
> Again, a command line option could be introduced here.
> 
> Any feedbacks on the approach will be welcomed.

I still don't entirely understand the underlying failure mode, but there
are a couple of things we could consider:

1. Automatically disabling 'vcpu migration' when caching is turned off.
This wouldn't prevent a vcpu from being preempted, just from being run
somewhere else.

2. It sounds like rather than using PoD, you could use the
"misconfigured p2m table" technique that x86 uses: set bits in the p2m
entry which cause a specific kind of HAP fault when accessed.  The fault
handler then looks in the p2m entry, and if it finds an otherwise valid
entry, it just fixes the "misconfigured" bits and continues.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread Julien Grall

Hi Jan,

On 12/06/2017 09:15 AM, Jan Beulich wrote:

On 05.12.17 at 19:39,  wrote:

The suggested policy is based on the KVM one:
- If we trap a S/W instructions, we enable VM trapping (e.g
HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
- We flush the caches on both caches being turned on and off.
- Once the caches are enabled, we stop trapping VM instructions.

Doing a full clean will require to go through the P2M and flush the
entries one by one. At the moment, all the memory is mapped. As you can
imagine flushing guest with hundreds of MB will take a very long time
(Linux timeout during CPU bring).

Therefore, we need a way to limit the number of entries we need to
flush. The suggested solution here is to introduce Populate On Demand
(PoD) on Arm.

The guest would boot with no RAM mapped in stage-2 page-table. At every
prefetch/data abort, the RAM would be mapped using preferably 2MB chunk
or 4KB. This means that when S/W would be used, the number of entries
mapped would be very limited. However, for safety, the flush should be
preemptible.


For my own understanding: Here you suggest to use PoD in order
to deal with S/W insn interception.


That's right. PoD would limit the number of entry to flush.




To limit the performance impact, we could introduce a guest option to
tell whether the guest will use S/W. If it does plan to use S/W, PoD
will be disabled.


Therefore I'm wondering if here you mean "If it doesn't plan to ..."


Whoops. I meant "If it doesn't plan".



Independent of this I'm pretty unclear about your conclusion that
there will be only a very limited number of P2M entries at the time
S/W insns would be used by the guest. Are you ignoring potentially
malicious guests for the moment? Otoh you admit that things would
need to be preemptible, so perhaps the argumentation is that you
simply expect well behaved guests to only have such limited amount
of P2M entries.


The preemption is to cover malicious guests and potentially well-behaved 
guests use case I missed. But TBH, the latter would be a call for the OS 
to be reworked as fast emulation of S/W will be really difficult.




Am I, btw, understanding correctly that other than on x86 you
intend PoD to not be used for maxmem > memory scenarios, at
least for the time being?


Yes. I don't think it would be difficult to add that support for Arm as 
well.


Also, at the moment, PoD code is nearly a verbatim copy of the x86 
version. And this is only because the interface with the rest p2m code. 
I am planning to discuss on the ML the possibility to share the PoD code.


--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-06 Thread Jan Beulich
>>> On 05.12.17 at 19:39,  wrote:
> The suggested policy is based on the KVM one:
>   - If we trap a S/W instructions, we enable VM trapping (e.g 
> HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
>   - We flush the caches on both caches being turned on and off.
>   - Once the caches are enabled, we stop trapping VM instructions.
> 
> Doing a full clean will require to go through the P2M and flush the 
> entries one by one. At the moment, all the memory is mapped. As you can 
> imagine flushing guest with hundreds of MB will take a very long time 
> (Linux timeout during CPU bring).
> 
> Therefore, we need a way to limit the number of entries we need to 
> flush. The suggested solution here is to introduce Populate On Demand 
> (PoD) on Arm.
> 
> The guest would boot with no RAM mapped in stage-2 page-table. At every 
> prefetch/data abort, the RAM would be mapped using preferably 2MB chunk 
> or 4KB. This means that when S/W would be used, the number of entries 
> mapped would be very limited. However, for safety, the flush should be 
> preemptible.

For my own understanding: Here you suggest to use PoD in order
to deal with S/W insn interception.

> To limit the performance impact, we could introduce a guest option to 
> tell whether the guest will use S/W. If it does plan to use S/W, PoD 
> will be disabled.

Therefore I'm wondering if here you mean "If it doesn't plan to ..."

Independent of this I'm pretty unclear about your conclusion that
there will be only a very limited number of P2M entries at the time
S/W insns would be used by the guest. Are you ignoring potentially
malicious guests for the moment? Otoh you admit that things would
need to be preemptible, so perhaps the argumentation is that you
simply expect well behaved guests to only have such limited amount
of P2M entries.

Am I, btw, understanding correctly that other than on x86 you
intend PoD to not be used for maxmem > memory scenarios, at
least for the time being?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-05 Thread Julien Grall



On 05/12/2017 22:35, Stefano Stabellini wrote:

On Tue, 5 Dec 2017, Julien Grall wrote:

Hi all,

Even though it is an Arm failure, I have CCed x86 folks to get feedback on the
approach. I have a WIP branch I could share if that interest people.

Few months ago, we noticed an heisenbug on jobs run by osstest on the
cubietrucks (see [1]). From the log, we figured out that the guest vCPU 0 is
in data/prefetch abort state at early boot. I have been able to reproduce it
reliably, although from the little information I have I think it is related to
a cache issue because we don't trap cache maintenance instructions by set/way.

This is a set of 3 instructions (clean, clean & invalidate, invalidate)
working on a given cache level by S/W. Because the OS is not allowed to infer
the S/W to PA mapping, it can only use S/W to nuke the whole cache. "The
expected usage of the cache maintenance that operate by set/way is associated
with powerdown and powerup of caches, if this is required by the
implementation" (see D3-2020 ARM DDI 0487B.b).

Those instructions will target a local processor and usually working in batch
for nuking the cache. This means if the vCPU is migrated to another pCPU in
the middle of the process, the cache may not be cleaned. This would result to
data corruption and potential crash of the OS.

Thankfully, the Arm architecture offers a way to trap all the cache
maintenance instructions by S/W (e.g HCR_EL2.TSW). Xen will need to set that
bit and handle S/W.

The major question now is how to handle them. S/W instructions are difficult
to virtualize (see ARMv7 ARM B1.14.4).

The suggested policy is based on the KVM one:
- If we trap a S/W instructions, we enable VM trapping (e.g
HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
- We flush the caches on both caches being turned on and off.
- Once the caches are enabled, we stop trapping VM instructions.

Doing a full clean will require to go through the P2M and flush the entries
one by one. At the moment, all the memory is mapped. As you can imagine
flushing guest with hundreds of MB will take a very long time (Linux timeout
during CPU bring).

Therefore, we need a way to limit the number of entries we need to flush. The
suggested solution here is to introduce Populate On Demand (PoD) on Arm.

The guest would boot with no RAM mapped in stage-2 page-table. At every
prefetch/data abort, the RAM would be mapped using preferably 2MB chunk or
4KB. This means that when S/W would be used, the number of entries mapped
would be very limited. However, for safety, the flush should be preemptible.

For those been worry about the performance impact, I have looked at the
current use of S/W instructions:
- Linux Arm64: The last used in the kernel was beginning of 2015
- Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
plan to change.
- UEFI: A couple of use in UEFI, but I have heard they plan to remove
them (need confirmation).

I haven't looked at all the OSes. However, given the Arm Arm clearly state S/W
instructions are not easily virtualizable, I would expect guest OSes
developers to try there best to limit the use of the instructions.

To limit the performance impact, we could introduce a guest option to tell
whether the guest will use S/W. If it does plan to use S/W, PoD will be
disabled.

Now regarding the hardware domain. At the moment, it has its RAM direct
mapped. Supporting direct mapping in PoD will be quite a pain for a limited
benefits (see why above). In that case I would suggest to impose vCPU pinning
for the hardware domain if the S/W are expected to be used. Again, a command
line option could be introduced here.

Any feedbacks on the approach will be welcomed.
  
Could we pin the hwdom vcpus only at boot time, until all S/W operations

are issued, then "release" them? If we can detect the last expected S/W
operation with some sort of heuristic.


Feel free to suggest a way. I haven't found it. But to be honest, you 
have seen how much people care about 32-bit hwdom today. So I would not 
spend too much time thinking about optimizing it.




Given the information provided above, would it make sense to consider
avoiding PoD for arm64 kernel direct boots?


Please suggest a way to kernel an arm64 kernel direct boot and not using 
S/W. I don't see any.


The only solution, I can see, is to provide a configuration option at 
boot time as I suggested a bit above:


"To limit the performance impact, we could introduce a guest option to 
tell whether the guest will use S/W. If it does plan to use S/W, PoD 
will be disabled."


But at this stage, my concern is fixing blatant bug in Xen and 
performance is a second step.


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

2017-12-05 Thread Stefano Stabellini
On Tue, 5 Dec 2017, Julien Grall wrote:
> Hi all,
> 
> Even though it is an Arm failure, I have CCed x86 folks to get feedback on the
> approach. I have a WIP branch I could share if that interest people.
> 
> Few months ago, we noticed an heisenbug on jobs run by osstest on the
> cubietrucks (see [1]). From the log, we figured out that the guest vCPU 0 is
> in data/prefetch abort state at early boot. I have been able to reproduce it
> reliably, although from the little information I have I think it is related to
> a cache issue because we don't trap cache maintenance instructions by set/way.
> 
> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
> working on a given cache level by S/W. Because the OS is not allowed to infer
> the S/W to PA mapping, it can only use S/W to nuke the whole cache. "The
> expected usage of the cache maintenance that operate by set/way is associated
> with powerdown and powerup of caches, if this is required by the
> implementation" (see D3-2020 ARM DDI 0487B.b).
> 
> Those instructions will target a local processor and usually working in batch
> for nuking the cache. This means if the vCPU is migrated to another pCPU in
> the middle of the process, the cache may not be cleaned. This would result to
> data corruption and potential crash of the OS.
> 
> Thankfully, the Arm architecture offers a way to trap all the cache
> maintenance instructions by S/W (e.g HCR_EL2.TSW). Xen will need to set that
> bit and handle S/W.
> 
> The major question now is how to handle them. S/W instructions are difficult
> to virtualize (see ARMv7 ARM B1.14.4).
> 
> The suggested policy is based on the KVM one:
>   - If we trap a S/W instructions, we enable VM trapping (e.g
> HCR_EL2.TVM) to detect cache being turned on/off, and do a full clean.
>   - We flush the caches on both caches being turned on and off.
>   - Once the caches are enabled, we stop trapping VM instructions.
> 
> Doing a full clean will require to go through the P2M and flush the entries
> one by one. At the moment, all the memory is mapped. As you can imagine
> flushing guest with hundreds of MB will take a very long time (Linux timeout
> during CPU bring).
> 
> Therefore, we need a way to limit the number of entries we need to flush. The
> suggested solution here is to introduce Populate On Demand (PoD) on Arm.
> 
> The guest would boot with no RAM mapped in stage-2 page-table. At every
> prefetch/data abort, the RAM would be mapped using preferably 2MB chunk or
> 4KB. This means that when S/W would be used, the number of entries mapped
> would be very limited. However, for safety, the flush should be preemptible.
> 
> For those been worry about the performance impact, I have looked at the
> current use of S/W instructions:
>   - Linux Arm64: The last used in the kernel was beginning of 2015
>   - Linux Arm32: Still use S/W for boot and secondary CPU bring-up. No
> plan to change.
>   - UEFI: A couple of use in UEFI, but I have heard they plan to remove
> them (need confirmation).
> 
> I haven't looked at all the OSes. However, given the Arm Arm clearly state S/W
> instructions are not easily virtualizable, I would expect guest OSes
> developers to try there best to limit the use of the instructions.
> 
> To limit the performance impact, we could introduce a guest option to tell
> whether the guest will use S/W. If it does plan to use S/W, PoD will be
> disabled.
> 
> Now regarding the hardware domain. At the moment, it has its RAM direct
> mapped. Supporting direct mapping in PoD will be quite a pain for a limited
> benefits (see why above). In that case I would suggest to impose vCPU pinning
> for the hardware domain if the S/W are expected to be used. Again, a command
> line option could be introduced here.
> 
> Any feedbacks on the approach will be welcomed.
 
Could we pin the hwdom vcpus only at boot time, until all S/W operations
are issued, then "release" them? If we can detect the last expected S/W
operation with some sort of heuristic.

Given the information provided above, would it make sense to consider
avoiding PoD for arm64 kernel direct boots?

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel