Re: simplify gendisk and request_queue allocation for blk-mq based drivers

2021-06-04 Thread Konrad Rzeszutek Wilk
On Wed, Jun 02, 2021 at 09:53:15AM +0300, Christoph Hellwig wrote:
> Hi all,

Hi!

You wouldn't have a nice git repo to pull so one can test it easily?

Thank you!

Cc-ing Boris/Juergen - pls see below xen.
> 
> this series is the scond part of cleaning up lifetimes and allocation of
> the gendisk and request_queue structure.  It adds a new interface to
> allocate the disk and queue together for blk based drivers, and uses that
> in all drivers that do not have any caveats in their gendisk and
> request_queue lifetime rules.
> 
> Diffstat:
>  block/blk-mq.c  |   91 +++---
>  block/blk.h |1 
>  block/elevator.c|2 
>  drivers/block/amiflop.c |   16 +-
>  drivers/block/aoe/aoeblk.c  |   33 
>  drivers/block/aoe/aoedev.c  |3 -
>  drivers/block/ataflop.c |   16 +-
>  drivers/block/floppy.c  |   20 +--
>  drivers/block/loop.c|   19 ++-
>  drivers/block/nbd.c |   53 +++
>  drivers/block/null_blk/main.c   |   11 +---
>  drivers/block/paride/pcd.c  |   19 +++
>  drivers/block/paride/pd.c   |   30 ---
>  drivers/block/paride/pf.c   |   18 ++
>  drivers/block/ps3disk.c |   36 +
>  drivers/block/rbd.c |   52 ++-
>  drivers/block/rnbd/rnbd-clt.c   |   35 +++--
>  drivers/block/sunvdc.c  |   47 -
>  drivers/block/swim.c|   34 +---
>  drivers/block/swim3.c   |   33 +---
>  drivers/block/sx8.c |   23 ++--
>  drivers/block/virtio_blk.c  |   26 ++---
>  drivers/block/xen-blkfront.c|   96 
> ++--
>  drivers/block/z2ram.c   |   15 +
>  drivers/cdrom/gdrom.c   |   45 +++-
>  drivers/md/dm-rq.c  |9 +--
>  drivers/memstick/core/ms_block.c|   25 +++--
>  drivers/memstick/core/mspro_block.c |   26 -
>  drivers/mtd/mtd_blkdevs.c   |   48 --
>  drivers/mtd/ubi/block.c |   68 ++---
>  drivers/s390/block/scm_blk.c|   21 ++-
>  include/linux/blk-mq.h  |   24 ++---
>  include/linux/elevator.h|1 
>  33 files changed, 386 insertions(+), 610 deletions(-)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 0/7] Untrusted device support for virtio

2021-06-04 Thread Konrad Rzeszutek Wilk

On 4/29/21 12:16 AM, Jason Wang wrote:


在 2021/4/29 上午5:06, Konrad Rzeszutek Wilk 写道:

On Wed, Apr 21, 2021 at 11:21:10AM +0800, Jason Wang wrote:

Hi All:

Sometimes, the driver doesn't trust the device. This is usually
happens for the encrtpyed VM or VDUSE[1]. In both cases, technology
like swiotlb is used to prevent the poking/mangling of memory from the
device. But this is not sufficient since current virtio driver may
trust what is stored in the descriptor table (coherent mapping) for
performing the DMA operations like unmap and bounce so the device may
choose to utilize the behaviour of swiotlb to perform attacks[2].

We fixed it in the SWIOTLB. That is it saves the expected length
of the DMA operation. See

commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155
Author: Martin Radev 
Date:   Tue Jan 12 16:07:29 2021 +0100

 swiotlb: Validate bounce size in the sync/unmap path
 The size of the buffer being bounced is not checked if it happens
 to be larger than the size of the mapped buffer. Because the size
 can be controlled by a device, as it's the case with virtio devices,
 this can lead to memory corruption.



Good to know this, but this series tries to protect at different level. 
And I believe such protection needs to be done at both levels.




My apologies for taking so long to respond, somehow this disappeared in 
one of the folders.



For double insurance, to protect from a malicous device, when DMA API
is used for the device, this series store and use the descriptor
metadata in an auxiliay structure which can not be accessed via
swiotlb instead of the ones in the descriptor table. Actually, we've

Sorry for being dense here, but how wold SWIOTLB be utilized for
this attack?



So we still behaviors that is triggered by device that is not trusted. 
Such behavior is what the series tries to avoid. We've learnt a lot of 
lessons to eliminate the potential attacks via this. And it would be too 
late to fix if we found another issue of SWIOTLB.


Proving "the unexpected device triggered behavior is safe" is very hard 
(or even impossible) than "eliminating the unexpected device triggered 
behavior totally".


E.g I wonder whether something like this can happen: Consider the DMA 
direction of unmap is under the control of device. The device can cheat 
the SWIOTLB by changing the flag to modify the device read only buffer. 


 Why would you want to expose that to the device? And wouldn't 
that be specific to Linux devices - because surely Windows DMA APIs are 
different and this 'flag' seems very Linux-kernel specific?



If yes, it is really safe?


Well no? But neither is rm -Rf / but we still allow folks to do that.


The above patch only log the bounce size but it doesn't log the flag. 


It logs and panics the system.

Even if it logs the flag, SWIOTLB still doesn't know how each buffer is 
used and when it's the appropriate(safe) time to unmap the buffer, only 
the driver that is using the SWIOTLB know them.


Fair enough. Is the intent to do the same thing for all the other 
drivers that could be running in an encrypted guest and would require 
SWIOTLB.


Like legacy devices that KVM can expose (floppy driver?, SVGA driver)?



So I think we need to consolidate on both layers instead of solely 
depending on the SWIOTLB.


Please make sure that this explanation is in part of the cover letter
or in the commit/Kconfig.

Also, are you aware of the patchset than Andi been working on that tries 
to make the DMA code to have extra bells and whistles for this purpose?


Thank you.

Thanks





almost achieved that through packed virtqueue and we just need to fix
a corner case of handling mapping errors. For split virtqueue we just
follow what's done in the packed.

Note that we don't duplicate descriptor medata for indirect
descriptors since it uses stream mapping which is read only so it's
safe if the metadata of non-indirect descriptors are correct.

The behaivor for non DMA API is kept for minimizing the performance
impact.

Slightly tested with packed on/off, iommu on/of, swiotlb force/off in
the guest.

Please review.

[1] 
https://lore.kernel.org/netdev/fab615ce-5e13-a3b3-3715-a4203b4ab...@redhat.com/T/ 

[2] 
https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152...@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b 



Jason Wang (7):
   virtio-ring: maintain next in extra state for packed virtqueue
   virtio_ring: rename vring_desc_extra_packed
   virtio-ring: factor out desc_extra allocation
   virtio_ring: secure handling of mapping errors
   virtio_ring: introduce virtqueue_desc_add_split()
   virtio: use err label in __vring_new_virtqueue()
   virtio-ring: store DMA metadata in desc_extra for split virtqueue

  drivers/virtio/virtio_ring.c | 189 ++-
  1 file changed, 141 insertions(+), 48 deletions(-)

--
2.25.1





___
Virtualization mailing list
Virtua

Re: [RFC PATCH 0/7] Untrusted device support for virtio

2021-04-28 Thread Konrad Rzeszutek Wilk
On Wed, Apr 21, 2021 at 11:21:10AM +0800, Jason Wang wrote:
> Hi All:
> 
> Sometimes, the driver doesn't trust the device. This is usually
> happens for the encrtpyed VM or VDUSE[1]. In both cases, technology
> like swiotlb is used to prevent the poking/mangling of memory from the
> device. But this is not sufficient since current virtio driver may
> trust what is stored in the descriptor table (coherent mapping) for
> performing the DMA operations like unmap and bounce so the device may
> choose to utilize the behaviour of swiotlb to perform attacks[2].

We fixed it in the SWIOTLB. That is it saves the expected length
of the DMA operation. See

commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 
Author: Martin Radev 
Date:   Tue Jan 12 16:07:29 2021 +0100

swiotlb: Validate bounce size in the sync/unmap path

The size of the buffer being bounced is not checked if it happens
to be larger than the size of the mapped buffer. Because the size
can be controlled by a device, as it's the case with virtio devices,
this can lead to memory corruption.


> 
> For double insurance, to protect from a malicous device, when DMA API
> is used for the device, this series store and use the descriptor
> metadata in an auxiliay structure which can not be accessed via
> swiotlb instead of the ones in the descriptor table. Actually, we've

Sorry for being dense here, but how wold SWIOTLB be utilized for
this attack?

> almost achieved that through packed virtqueue and we just need to fix
> a corner case of handling mapping errors. For split virtqueue we just
> follow what's done in the packed.
> 
> Note that we don't duplicate descriptor medata for indirect
> descriptors since it uses stream mapping which is read only so it's
> safe if the metadata of non-indirect descriptors are correct.
> 
> The behaivor for non DMA API is kept for minimizing the performance
> impact.
> 
> Slightly tested with packed on/off, iommu on/of, swiotlb force/off in
> the guest.
> 
> Please review.
> 
> [1] 
> https://lore.kernel.org/netdev/fab615ce-5e13-a3b3-3715-a4203b4ab...@redhat.com/T/
> [2] 
> https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152...@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b
> 
> Jason Wang (7):
>   virtio-ring: maintain next in extra state for packed virtqueue
>   virtio_ring: rename vring_desc_extra_packed
>   virtio-ring: factor out desc_extra allocation
>   virtio_ring: secure handling of mapping errors
>   virtio_ring: introduce virtqueue_desc_add_split()
>   virtio: use err label in __vring_new_virtqueue()
>   virtio-ring: store DMA metadata in desc_extra for split virtqueue
> 
>  drivers/virtio/virtio_ring.c | 189 ++-
>  1 file changed, 141 insertions(+), 48 deletions(-)
> 
> -- 
> 2.25.1
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/7] x86/seves: Support 32-bit boot path and other updates

2021-02-10 Thread Konrad Rzeszutek Wilk
On Wed, Feb 10, 2021 at 04:12:25PM +0100, Joerg Roedel wrote:
> Hi Konrad,
> 
> On Wed, Feb 10, 2021 at 09:58:35AM -0500, Konrad Rzeszutek Wilk wrote:
> > What GRUB versions are we talking about (CC-ing Daniel Kiper, who owns
> > GRUB).
> 
> I think this was about 32-bit GRUB builds used by distributions. I
> personally tested it with a kernel which has EFI support disabled, in
> this case the OVMF firmware will also boot into the startup_32 boot
> path.

I think I am missing something obvious here - but why would you want
EFI support disabled?

Or is the idea that "legacy" OSes can nicely run under AMD SEV?
But since you are having a kernel patch that is not "legacy OS" anymore.

> 
> > By 'some firmware' we talking SeaBIOS?
> 
> No, SeaBIOS is not supported for SEV-ES, only OVMF has handling for #VC
> so far.
> 
> Regards,
> 
>   Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/7] x86/seves: Support 32-bit boot path and other updates

2021-02-10 Thread Konrad Rzeszutek Wilk
On Wed, Feb 10, 2021 at 11:21:28AM +0100, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> Hi,
> 
> these patches add support for the 32-bit boot in the decompressor
> code. This is needed to boot an SEV-ES guest on some firmware and grub
> versions. The patches also add the necessary CPUID sanity checks and a

Could you expand a bit please?

What GRUB versions are we talking about (CC-ing Daniel Kiper, who owns
GRUB).

By 'some firmware' we talking SeaBIOS?

> 32-bit version of the C-bit check.
> 
> Other updates included here:
> 
>   1. Add code to shut down exception handling in the
>  decompressor code before jumping to the real kernel.
>  Once in the real kernel it is not safe anymore to jump
>  back to the decompressor code via exceptions.
> 
>   2. Replace open-coded hlt loops with proper calls to
>  sev_es_terminate().
> 
> Please review.
> 
> Thanks,
> 
>   Joerg
> 
> Joerg Roedel (7):
>   x86/boot/compressed/64: Cleanup exception handling before booting
> kernel
>   x86/boot/compressed/64: Reload CS in startup_32
>   x86/boot/compressed/64: Setup IDT in startup_32 boot path
>   x86/boot/compressed/64: Add 32-bit boot #VC handler
>   x86/boot/compressed/64: Add CPUID sanity check to 32-bit boot-path
>   x86/boot/compressed/64: Check SEV encryption in 32-bit boot-path
>   x86/sev-es: Replace open-coded hlt-loops with sev_es_terminate()
> 
>  arch/x86/boot/compressed/head_64.S | 168 -
>  arch/x86/boot/compressed/idt_64.c  |  14 +++
>  arch/x86/boot/compressed/mem_encrypt.S | 114 -
>  arch/x86/boot/compressed/misc.c|   7 +-
>  arch/x86/boot/compressed/misc.h|   6 +
>  arch/x86/boot/compressed/sev-es.c  |  12 +-
>  arch/x86/kernel/sev-es-shared.c|  10 +-
>  7 files changed, 307 insertions(+), 24 deletions(-)
> 
> -- 
> 2.30.0
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-08 Thread Konrad Rzeszutek Wilk
On Fri, Feb 05, 2021 at 06:58:52PM +0100, Christoph Hellwig wrote:
> On Wed, Feb 03, 2021 at 02:36:38PM -0500, Konrad Rzeszutek Wilk wrote:
> > > So what?  If you guys want to provide a new capability you'll have to do
> > > work.  And designing a new protocol based around the fact that the
> > > hardware/hypervisor is not trusted and a copy is always required makes
> > > a lot of more sense than throwing in band aids all over the place.
> > 
> > If you don't trust the hypervisor, what would this capability be in?
> 
> Well, they don't trust the hypervisor to not attack the guest somehow,
> except through the data read.  I never really understood the concept,
> as it leaves too many holes.
> 
> But the point is that these schemes want to force bounce buffering
> because they think it is more secure.  And if that is what you want
> you better have protocol build around the fact that each I/O needs
> to use bounce buffers, so you make those buffers the actual shared
> memory use for communication, and build the protocol around it.

Right. That is what the SWIOTLB pool ends up being as it is allocated at
bootup where the guest tells the hypervisor - these are shared and
clear-text.

> E.g. you don't force the ridiculous NVMe PRP offset rules on the block
> layer, just to make a complicated swiotlb allocation that needs to
> preserve the alignment just do I/O.  But instead you have a trivial

I agree that NVMe is being silly. It could have allocated the coherent
pool and use that and do its own offset within that. That would in
essence carve out a static pool within the SWIOTLB static one..

TTM does that - it has its own DMA machinery on top of DMA API to deal
with its "passing" buffers from one application to another and the fun
of keeping track of that.

> ring buffer or whatever because you know I/O will be copied anyway
> and none of all the hard work higher layers do to make the I/O suitable
> for a normal device apply.

I lost you here. Sorry, are you saying have a simple ring protocol
(like NVME has), where the ring entries (SG or DMA phys) are statically
allocated and whenever NVME driver gets data from user-space it
would copy it in there?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-03 Thread Konrad Rzeszutek Wilk
On Wed, Feb 03, 2021 at 01:49:22PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> > Your comment makes sense but then that would require the cooperation
> > of these vendors and the cloud providers to agree on something meaningful.
> > I am also not sure whether the end result would be better than hardening
> > this interface to catch corruption. There is already some validation in
> > unmap path anyway.
> 
> So what?  If you guys want to provide a new capability you'll have to do
> work.  And designing a new protocol based around the fact that the
> hardware/hypervisor is not trusted and a copy is always required makes
> a lot of more sense than throwing in band aids all over the place.

If you don't trust the hypervisor, what would this capability be in?

I suppose you mean this would need to be in the the guest kernel and
this protocol would depend on .. not-hypervisor and most certainly not
the virtio or any SR-IOV device. That removes a lot of options.

The one sensibile one (since folks will trust OEM vendors like Intel
or AMD to provide the memory encryption so they will also trust the
IOMMU - I hope?) - and they do have plans for that with their IOMMU
frameworks which will remove the need for SWIOTLB (I hope).

But that is not now, but in future.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-02 Thread Konrad Rzeszutek Wilk
On Tue, Feb 02, 2021 at 04:34:09PM -0600, Tom Lendacky wrote:
> On 2/2/21 10:37 AM, Konrad Rzeszutek Wilk wrote:
> > On Mon, Jan 25, 2021 at 07:33:35PM +0100, Martin Radev wrote:
> >> On Mon, Jan 18, 2021 at 10:14:28AM -0500, Konrad Rzeszutek Wilk wrote:
> >>> On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> >>>> On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
> >>>>> On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> >>>>>> The size of the buffer being bounced is not checked if it happens
> >>>>>> to be larger than the size of the mapped buffer. Because the size
> >>>>>> can be controlled by a device, as it's the case with virtio devices,
> >>>>>> this can lead to memory corruption.
> >>>>>>
> >>>>>
> >>>>> I'm really worried about all these hodge podge hacks for not trusted
> >>>>> hypervisors in the I/O stack.  Instead of trying to harden protocols
> >>>>> that are fundamentally not designed for this, how about instead coming
> >>>>> up with a new paravirtualized I/O interface that is specifically
> >>>>> designed for use with an untrusted hypervisor from the start?
> >>>>
> >>>> Your comment makes sense but then that would require the cooperation
> >>>> of these vendors and the cloud providers to agree on something 
> >>>> meaningful.
> >>>> I am also not sure whether the end result would be better than hardening
> >>>> this interface to catch corruption. There is already some validation in
> >>>> unmap path anyway.
> >>>>
> >>>> Another possibility is to move this hardening to the common virtio code,
> >>>> but I think the code may become more complicated there since it would
> >>>> require tracking both the dma_addr and length for each descriptor.
> >>>
> >>> Christoph,
> >>>
> >>> I've been wrestling with the same thing - this is specific to busted
> >>> drivers. And in reality you could do the same thing with a hardware
> >>> virtio device (see example in 
> >>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthunderclap.io%2Fdata=04%7C01%7Cthomas.lendacky%40amd.com%7Cfc27af49d9a943699f6c08d8c798eed4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637478806973542212%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aUVqobkOSDfDhCAEauABOUvCAaIcw%2FTh07YFxeBjBDU%3Dreserved=0)
> >>>  - where the
> >>> mitigation is 'enable the IOMMU to do its job.'.
> >>>
> >>> AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP)..
> >>> and while that is great in the future, SEV without IOMMU is now here.
> >>>
> >>> Doing a full circle here, this issue can be exploited with virtio
> >>> but you could say do that with real hardware too if you hacked the
> >>> firmware, so if you say used Intel SR-IOV NIC that was compromised
> >>> on an AMD SEV machine, and plumbed in the guest - the IOMMU inside
> >>> of the guest would be SWIOTLB code. Last line of defense against
> >>> bad firmware to say.
> >>>
> >>> As such I am leaning towards taking this code, but I am worried
> >>> about the performance hit .. but perhaps I shouldn't as if you
> >>> are using SWIOTLB=force already you are kind of taking a
> >>> performance hit?
> >>>
> >>
> >> I have not measured the performance degradation. This will hit all AMD SEV,
> >> Intel TDX, IBM Protected Virtualization VMs. I don't expect the hit to
> >> be large since there are only few added operations per hundreads of copied
> >> bytes. I could try to measure the performance hit by running some benchmark
> >> with virtio-net/virtio-blk/virtio-rng.
> >>
> >> Earlier I said:
> >>>> Another possibility is to move this hardening to the common virtio code,
> >>>> but I think the code may become more complicated there since it would
> >>>> require tracking both the dma_addr and length for each descriptor.
> >>
> >> Unfortunately, this doesn't make sense. Even if there's validation for
> >> the size in the common virtio layer, there will be some other device
> >> which controls a dma_addr and length passed to dma_unmap* in the
> >> corresponding driver. The device c

Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-02-02 Thread Konrad Rzeszutek Wilk
On Mon, Jan 25, 2021 at 07:33:35PM +0100, Martin Radev wrote:
> On Mon, Jan 18, 2021 at 10:14:28AM -0500, Konrad Rzeszutek Wilk wrote:
> > On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> > > On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
> > > > On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> > > > > The size of the buffer being bounced is not checked if it happens
> > > > > to be larger than the size of the mapped buffer. Because the size
> > > > > can be controlled by a device, as it's the case with virtio devices,
> > > > > this can lead to memory corruption.
> > > > > 
> > > > 
> > > > I'm really worried about all these hodge podge hacks for not trusted
> > > > hypervisors in the I/O stack.  Instead of trying to harden protocols
> > > > that are fundamentally not designed for this, how about instead coming
> > > > up with a new paravirtualized I/O interface that is specifically
> > > > designed for use with an untrusted hypervisor from the start?
> > > 
> > > Your comment makes sense but then that would require the cooperation
> > > of these vendors and the cloud providers to agree on something meaningful.
> > > I am also not sure whether the end result would be better than hardening
> > > this interface to catch corruption. There is already some validation in
> > > unmap path anyway.
> > > 
> > > Another possibility is to move this hardening to the common virtio code,
> > > but I think the code may become more complicated there since it would
> > > require tracking both the dma_addr and length for each descriptor.
> > 
> > Christoph,
> > 
> > I've been wrestling with the same thing - this is specific to busted
> > drivers. And in reality you could do the same thing with a hardware
> > virtio device (see example in http://thunderclap.io/) - where the
> > mitigation is 'enable the IOMMU to do its job.'.
> > 
> > AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP)..
> > and while that is great in the future, SEV without IOMMU is now here.
> > 
> > Doing a full circle here, this issue can be exploited with virtio
> > but you could say do that with real hardware too if you hacked the
> > firmware, so if you say used Intel SR-IOV NIC that was compromised
> > on an AMD SEV machine, and plumbed in the guest - the IOMMU inside
> > of the guest would be SWIOTLB code. Last line of defense against
> > bad firmware to say.
> > 
> > As such I am leaning towards taking this code, but I am worried
> > about the performance hit .. but perhaps I shouldn't as if you
> > are using SWIOTLB=force already you are kind of taking a
> > performance hit?
> > 
> 
> I have not measured the performance degradation. This will hit all AMD SEV,
> Intel TDX, IBM Protected Virtualization VMs. I don't expect the hit to
> be large since there are only few added operations per hundreads of copied
> bytes. I could try to measure the performance hit by running some benchmark
> with virtio-net/virtio-blk/virtio-rng.
> 
> Earlier I said:
> > > Another possibility is to move this hardening to the common virtio code,
> > > but I think the code may become more complicated there since it would
> > > require tracking both the dma_addr and length for each descriptor.
> 
> Unfortunately, this doesn't make sense. Even if there's validation for
> the size in the common virtio layer, there will be some other device
> which controls a dma_addr and length passed to dma_unmap* in the
> corresponding driver. The device can target a specific dma-mapped private
> buffer by changing the dma_addr and set a good length to overwrite buffers
> following it.
> 
> So, instead of doing the check in every driver and hitting a performance
> cost even when swiotlb is not used, it's probably better to fix it in
> swiotlb.
> 
> @Tom Lendacky, do you think that it makes sense to harden swiotlb or
> some other approach may be better for the SEV features?

I am not Tom, but this change seems the right way forward regardless if
is TDX, AMD SEV, or any other architecture that encrypt memory and use
SWIOTLB.

Let me queue it up in development branch and do some regression testing.
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path

2021-01-18 Thread Konrad Rzeszutek Wilk
On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote:
> On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote:
> > > The size of the buffer being bounced is not checked if it happens
> > > to be larger than the size of the mapped buffer. Because the size
> > > can be controlled by a device, as it's the case with virtio devices,
> > > this can lead to memory corruption.
> > > 
> > 
> > I'm really worried about all these hodge podge hacks for not trusted
> > hypervisors in the I/O stack.  Instead of trying to harden protocols
> > that are fundamentally not designed for this, how about instead coming
> > up with a new paravirtualized I/O interface that is specifically
> > designed for use with an untrusted hypervisor from the start?
> 
> Your comment makes sense but then that would require the cooperation
> of these vendors and the cloud providers to agree on something meaningful.
> I am also not sure whether the end result would be better than hardening
> this interface to catch corruption. There is already some validation in
> unmap path anyway.
> 
> Another possibility is to move this hardening to the common virtio code,
> but I think the code may become more complicated there since it would
> require tracking both the dma_addr and length for each descriptor.

Christoph,

I've been wrestling with the same thing - this is specific to busted
drivers. And in reality you could do the same thing with a hardware
virtio device (see example in http://thunderclap.io/) - where the
mitigation is 'enable the IOMMU to do its job.'.

AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP)..
and while that is great in the future, SEV without IOMMU is now here.

Doing a full circle here, this issue can be exploited with virtio
but you could say do that with real hardware too if you hacked the
firmware, so if you say used Intel SR-IOV NIC that was compromised
on an AMD SEV machine, and plumbed in the guest - the IOMMU inside
of the guest would be SWIOTLB code. Last line of defense against
bad firmware to say.

As such I am leaning towards taking this code, but I am worried
about the performance hit .. but perhaps I shouldn't as if you
are using SWIOTLB=force already you are kind of taking a
performance hit?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: swiotlb/virtio: unchecked device dma address and length

2020-12-16 Thread Konrad Rzeszutek Wilk
..snip..
>> > > This raises two issues:
>> > > 1) swiotlb_tlb_unmap_single fails to check whether the index
>generated
>> > > from the dma_addr is in range of the io_tlb_orig_addr array.
>> > That is fairly simple to implement I would think. That is it can
>check
>> > that the dma_addr is from the PA in the io_tlb pool when
>SWIOTLB=force
>> > is used.
>> 
>> 
>> I'm not sure this can fix all the cases. It looks to me we should map
>> descriptor coherent but readonly (which is not supported by current
>DMA
>> API).
>
>Neither is this supported but encrypted memory technologies.


-ECONFUSED.

Could you state this once more please? I am not exactly sure what you are 
saying 

>
>> Otherwise, device can modify the desc[i].addr/desc[i].len at any time
>to
>> pretend a valid mapping.
>> 
>> Thanks
>> 
>> 
>> > 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: swiotlb/virtio: unchecked device dma address and length

2020-12-16 Thread Konrad Rzeszutek Wilk
On December 16, 2020 1:41:48 AM EST, Jason Wang  wrote:
>
>
>- Original Message -
>> 
>> 
>> - Original Message -
>> > .snip.
>> > > > > This raises two issues:
>> > > > > 1) swiotlb_tlb_unmap_single fails to check whether the index
>> > > > > generated
>> > > > > from the dma_addr is in range of the io_tlb_orig_addr array.
>> > > > That is fairly simple to implement I would think. That is it
>can check
>> > > > that the dma_addr is from the PA in the io_tlb pool when
>SWIOTLB=force
>> > > > is used.
>> > > 
>> > > 
>> > > I'm not sure this can fix all the cases. It looks to me we should
>map
>> > > descriptor coherent but readonly (which is not supported by
>current DMA
>> > > API).
>> > 
>> > I think I am missing something obvious here. The attacker is the
>> > hypervisor,
>> > aka
>> > the owner of the VirtIO device (ring0). The attacker is the one
>that
>> > provides the addr/len - having that readonly from a guest
>perspective
>> > does not change the fact that the hypervisor can modify the memory
>range
>> > by mapping it via a different virtual address in the hypervisor?
>(aka
>> > aliasing it).
>> 
>> Right, but if we allow hypervisor to provide arbitrary addr/len, does
>> it mean hypervisor can read encrypted content of encrypted memory of
>> guest through swiotlb?

Yes .
>> 
>> Thanks
>
>Actually not. I think you're right.


Your sentence is very confusing.

On a DMA unmap SWIOTLB (when force is used) it trusts the driver from providing 
the correct DMA address and length which SWIOTLB uses to match to its 
associated original PA address.

Think original PA having a mapping to a PA in the SWIOTLB pool.


The length is not checked so the attacker can modify that to say a huge number 
and cause SWIOTLB bounce code to write or read data from non SWIOTLB PA into 
the SWIOTLB PA pool.




>
>Thanks
>
>> 
>> > > 
>> > > Otherwise, device can modify the desc[i].addr/desc[i].len at any
>time to
>> > > pretend a valid mapping.
>> > 
>> > With the swiotlb=force as long as addr/len are within the PA
>boundaries
>> > within the SWIOTLB pool this should be OK?
>> > 
>> > After all that whole area is in cleartext and visible to the
>attacker.
>> > 
>> > 
>> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: swiotlb/virtio: unchecked device dma address and length

2020-12-15 Thread Konrad Rzeszutek Wilk
On Tue, Dec 15, 2020 at 11:54:08AM +0100, Felicitas Hetzelt wrote:
> Hello,
> thank you all for looking into this! To answer some of the questions:
>  - Did you have already some PoC fixes for this:
>We don't have a full PoC or fix currently. Thought we have a PoC
>with which were able to overwrite memory outside of the mapped
>dma region.
>  - Is there a CVE associated with this?
>No
>  - Is there a paper on this you all are working on?
>Yes, we were planning to use this bug (among others
>in a paper)
> 
> Could you point us to the intel thunder issue that you mentioned?

ThunderClap was it!

https://lwn.net/Articles/786558/

Cc-ing Lu Baolu ..

Hm, this was a year ago and it looks like there are some extra SWIOTLB
patches to be done ?

> 
> On 12/15/20 9:47 AM, Ashish Kalra wrote:
> > On Mon, Dec 14, 2020 at 04:49:50PM -0500, Konrad Rzeszutek Wilk wrote:
> >> On Fri, Dec 11, 2020 at 06:31:21PM +0100, Felicitas Hetzelt wrote:
> >>> Hello,
> >>
> >> Hi! Please see below my responses.
> >>
> >>> we have been analyzing the Hypervisor-OS interface of Linux
> >>> and discovered bugs in the swiotlb/virtio implementation that can be
> >>> triggered from a malicious Hypervisor / virtual device.
> >>> With SEV, the SWIOTLB implementation is forcefully enabled and would
> >>> always be used. Thus, all virtio devices and others would use it under
> >>> the hood.
> >>>
> >>> The reason for analyzing this interface is that, technologies such as
> >>> Intel's Trusted Domain Extensions [1] and AMD's Secure Nested Paging [2]
> >>> change the threat model assumed by various Linux kernel subsystems.
> >>> These technologies take the presence of a fully malicious hypervisor
> >>> into account and aim to provide protection for virtual machines in such
> >>> an environment. Therefore, all input received from the hypervisor or an
> >>> external device should be carefully validated. Note that these issues
> >>> are of little (or no) relevance in a "normal" virtualization setup,
> >>> nevertheless we believe that it is required to fix them if TDX or SNP is
> >>> used.
> >>>
> >>> We are happy to provide more information if needed!
> >>>
> >>> [1]
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fwww%2Fus%2Fen%2Fdevelop%2Farticles%2Fintel-trust-domain-extensions.htmldata=04%7C01%7Cashish.kalra%40amd.com%7C1d1cbca182a84c0e504708d8a079eec0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637435792867090126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=THAJlYGLSOx3bKQYH62TLKH50By7Wnsu0z92snfNY84%3Dreserved=0
> >>>
> >>> [2] 
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fen%2Fprocessors%2Famd-secure-encrypted-virtualizationdata=04%7C01%7Cashish.kalra%40amd.com%7C1d1cbca182a84c0e504708d8a079eec0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637435792867090126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=M3jmYCWaEvmAzIy%2F4z5XstsPf812SbEkuNX5PVVr0HY%3Dreserved=0
> >>>
> >>> Bug:
> >>> OOB memory write.
> >>> dma_unmap_single -> swiotlb_tbl_unmap_single is invoked with dma_addr
> >>> and length parameters that are under control of the device.
> >>> This happens e.g. in virtio_ring:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.10-rc7%2Fsource%2Fdrivers%2Fvirtio%2Fvirtio_ring.c%23L378data=04%7C01%7Cashish.kalra%40amd.com%7C1d1cbca182a84c0e504708d8a079eec0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637435792867090126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=j0CIi%2F8hBkVx45XGBtT4Ri52uWIOdOts%2BSbJ0kCB5B0%3Dreserved=0
> >>
> >> Heya!
> >>
> >> Thank you for pointing this out! I've a couple of questions and hope you 
> >> can
> >> help me out with them.
> >>
> >> Also CC-ing AMD / TDX folks.
> >>>
> > 
> > Adding more relevant folks in AMD.
> > 
> > Needless to say, the swiotlb code needs to validate this external untrusted 
> > input dma_addr and length parameters.
> > 
> > Thanks,
> > Ashish
> > 
> >>> This raises two issues:
> >>> 1) swiotlb_tlb_unmap_single fails to check whether the index generated
> >>> from the dma_addr is in range

Re: swiotlb/virtio: unchecked device dma address and length

2020-12-15 Thread Konrad Rzeszutek Wilk
.snip.
> > > This raises two issues:
> > > 1) swiotlb_tlb_unmap_single fails to check whether the index generated
> > > from the dma_addr is in range of the io_tlb_orig_addr array.
> > That is fairly simple to implement I would think. That is it can check
> > that the dma_addr is from the PA in the io_tlb pool when SWIOTLB=force
> > is used.
> 
> 
> I'm not sure this can fix all the cases. It looks to me we should map
> descriptor coherent but readonly (which is not supported by current DMA
> API).

I think I am missing something obvious here. The attacker is the hypervisor, aka
the owner of the VirtIO device (ring0). The attacker is the one that
provides the addr/len - having that readonly from a guest perspective
does not change the fact that the hypervisor can modify the memory range
by mapping it via a different virtual address in the hypervisor? (aka
aliasing it).
> 
> Otherwise, device can modify the desc[i].addr/desc[i].len at any time to
> pretend a valid mapping.

With the swiotlb=force as long as addr/len are within the PA boundaries
within the SWIOTLB pool this should be OK?

After all that whole area is in cleartext and visible to the attacker.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: swiotlb/virtio: unchecked device dma address and length

2020-12-14 Thread Konrad Rzeszutek Wilk
On Fri, Dec 11, 2020 at 06:31:21PM +0100, Felicitas Hetzelt wrote:
> Hello,

Hi! Please see below my responses.

> we have been analyzing the Hypervisor-OS interface of Linux
> and discovered bugs in the swiotlb/virtio implementation that can be
> triggered from a malicious Hypervisor / virtual device.
> With SEV, the SWIOTLB implementation is forcefully enabled and would
> always be used. Thus, all virtio devices and others would use it under
> the hood.
> 
> The reason for analyzing this interface is that, technologies such as
> Intel's Trusted Domain Extensions [1] and AMD's Secure Nested Paging [2]
> change the threat model assumed by various Linux kernel subsystems.
> These technologies take the presence of a fully malicious hypervisor
> into account and aim to provide protection for virtual machines in such
> an environment. Therefore, all input received from the hypervisor or an
> external device should be carefully validated. Note that these issues
> are of little (or no) relevance in a "normal" virtualization setup,
> nevertheless we believe that it is required to fix them if TDX or SNP is
> used.
> 
> We are happy to provide more information if needed!
> 
> [1]
> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
> 
> [2] https://www.amd.com/en/processors/amd-secure-encrypted-virtualization
> 
> Bug:
> OOB memory write.
> dma_unmap_single -> swiotlb_tbl_unmap_single is invoked with dma_addr
> and length parameters that are under control of the device.
> This happens e.g. in virtio_ring:
> https://elixir.bootlin.com/linux/v5.10-rc7/source/drivers/virtio/virtio_ring.c#L378

Heya!

Thank you for pointing this out! I've a couple of questions and hope you can
help me out with them.

Also CC-ing AMD / TDX folks.
> 
> This raises two issues:
> 1) swiotlb_tlb_unmap_single fails to check whether the index generated
> from the dma_addr is in range of the io_tlb_orig_addr array.

That is fairly simple to implement I would think. That is it can check
that the dma_addr is from the PA in the io_tlb pool when SWIOTLB=force
is used.

> 2) when swiotlb_bounce is called the device controls the length of the
> memory copied to the cpu address.

So.. this sounds very similar to the Intel Thunder.. something issue
where this exact issue was fixed by handing the DMA off to the SWIOTLB
bounce code.

But if that is broken, then that CVE is still not fixed?

So the issue here is that swiotlb_tbl_unmap_single(..,mapping_size,) is
under the attacker control. Ugh.

One way could be to have a io_tlb_orig_addr-ish array with the length
of mappings to double check?

Couple more questions:
 - Did you have already some PoC fixes for this? 
 - Is there a CVE associated with this?
 - Is there a paper on this you all are working on?

Thank you!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Is: virtio_gpu_object_shmem_init issues? Was:Re: upstream boot error: general protection fault in swiotlb_map

2020-08-24 Thread Konrad Rzeszutek Wilk
On Thu, Aug 06, 2020 at 03:46:23AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:47ec5303 Merge git://git.kernel.org/pub/scm/linux/kernel/g..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=16fe1dea90
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7c06047f622c5724
> dashboard link: https://syzkaller.appspot.com/bug?extid=3f86afd0b1e4bf1cb64c
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+3f86afd0b1e4bf1cb...@syzkaller.appspotmail.com
> 
> ceph: loaded (mds proto 32)
> NET: Registered protocol family 38
> async_tx: api initialized (async)
> Key type asymmetric registered
> Asymmetric key parser 'x509' registered
> Asymmetric key parser 'pkcs8' registered
> Key type pkcs7_test registered
> Asymmetric key parser 'tpm_parser' registered
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 243)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> io scheduler bfq registered
> hgafb: HGA card not detected.
> hgafb: probe of hgafb.0 failed with error -22
> usbcore: registered new interface driver udlfb
> uvesafb: failed to execute /sbin/v86d
> uvesafb: make sure that the v86d helper is installed and executable
> uvesafb: Getting VBE info block failed (eax=0x4f00, err=-2)
> uvesafb: vbe_init() failed with -22
> uvesafb: probe of uvesafb.0 failed with error -22
> vga16fb: mapped to 0x8aac772d
> Console: switching to colour frame buffer device 80x30
> fb0: VGA16 VGA frame buffer device
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> ACPI: Power Button [PWRF]
> ioatdma: Intel(R) QuickData Technology Driver 5.00
> PCI Interrupt Link [GSIF] enabled at IRQ 21
> PCI Interrupt Link [GSIG] enabled at IRQ 22
> PCI Interrupt Link [GSIH] enabled at IRQ 23
> N_HDLC line discipline registered with maxframe=4096
> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> 00:05: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> Cyclades driver 2.6
> Initializing Nozomi driver 2.1d
> RocketPort device driver module, version 2.09, 12-June-2003
> No rocketport ports found; unloading driver
> Non-volatile memory driver v1.3
> Linux agpgart interface v0.103
> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> usbcore: registered new interface driver udl
> [drm] pci: virtio-vga detected at :00:01.0
> fb0: switching to virtiodrmfb from VGA16 VGA
> Console: switching to colour VGA+ 80x25
> virtio-pci :00:01.0: vgaarb: deactivate vga console
> Console: switching to colour dummy device 80x25
> [drm] features: -virgl +edid
> [drm] number of scanouts: 1
> [drm] number of cap sets: 0
> [drm] Initialized virtio_gpu 0.1.0 0 for virtio0 on minor 2
> general protection fault, probably for non-canonical address 
> 0xdc00:  [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x-0x0007]
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.8.0-syzkaller #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> RIP: 0010:swiotlb_map+0x5ac/0x700 kernel/dma/swiotlb.c:683
> Code: 28 04 00 00 48 c1 ea 03 80 3c 02 00 0f 85 4d 01 00 00 4c 8b a5 18 04 00 
> 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f 85 1e 
> 01 00 00 48 8d 7d 50 4d 8b 24 24 48 b8 00 00
> RSP: :c934f3e0 EFLAGS: 00010246
> RAX: dc00 RBX:  RCX: 8162cc1d
> RDX:  RSI: 8162cc98 RDI: 88802971a470
> RBP: 88802971a048 R08: 0001 R09: 8c5dba77
> R10:  R11:  R12: 
> R13: 7ac0 R14: dc00 R15: 1000
> FS:  () GS:88802ce0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 09a8d000 CR4: 00350ef0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  dma_direct_map_page include/linux/dma-direct.h:170 [inline]
>  dma_direct_map_sg+0x3bb/0x670 kernel/dma/direct.c:368
>  dma_map_sg_attrs+0xd0/0x160 kernel/dma/mapping.c:183
>  drm_gem_shmem_get_pages_sgt drivers/gpu/drm/drm_gem_shmem_helper.c:700 
> [inline]
>  drm_gem_shmem_get_pages_sgt+0x1fc/0x310 
> drivers/gpu/drm/drm_gem_shmem_helper.c:679
>  virtio_gpu_object_shmem_init drivers/gpu/drm/virtio/virtgpu_object.c:153 
> [inline]
>  virtio_gpu_object_create+0x2fd/0xa70 
> drivers/gpu/drm/virtio/virtgpu_object.c:232
>  virtio_gpu_gem_create drivers/gpu/drm/virtio/virtgpu_gem.c:45 [inline]
>  virtio_gpu_mode_dumb_create+0x298/0x530 
> drivers/gpu/drm/virtio/virtgpu_gem.c:85
>  

Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

2020-06-11 Thread Konrad Rzeszutek Wilk
On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:
> As testing shows no performance change, switch to that now.

What kind of testing? 100GiB? Low latency?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] virtio: Add bounce DMA ops

2020-04-30 Thread Konrad Rzeszutek Wilk
On Wed, Apr 29, 2020 at 06:20:48AM -0400, Michael S. Tsirkin wrote:
> On Wed, Apr 29, 2020 at 03:39:53PM +0530, Srivatsa Vaddagiri wrote:
> > That would still not work I think where swiotlb is used for pass-thr devices
> > (when private memory is fine) as well as virtio devices (when shared memory 
> > is
> > required).
> 
> So that is a separate question. When there are multiple untrusted
> devices, at the moment it looks like a single bounce buffer is used.
> 
> Which to me seems like a security problem, I think we should protect
> untrusted devices from each other.

There are two DMA pools code in Linux already - the TTM one for graphics
and the mm/dmapool.c - could those be used instead? Or augmented at least?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory

2019-12-13 Thread Konrad Rzeszutek Wilk
On Thu, Dec 12, 2019 at 06:11:24PM +0100, David Hildenbrand wrote:
> This series is based on latest linux-next. The patches are located at:
> https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4
Heya!

Would there be by any chance a virtio-spec git tree somewhere?

..snip..
> --
> 5. Future work
> --
> 
> The separate patches contain a lot of future work items. One of the next
> steps is to make memory unplug more likely to succeed - currently, there
> are no guarantees on how much memory can get unplugged again. I have


Or perhaps tell the caller why we can't and let them sort it out?
For example: "Application XYZ is mlocked. Can't offload'.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool

2019-09-10 Thread Konrad Rzeszutek Wilk
On Fri, Aug 09, 2019 at 07:00:24PM +0300, Adalbert Lazăr wrote:
> This patch might be obsolete thanks to single-stepping.

sooo should it be skipped from this large patchset to easy
review?

> 
> Signed-off-by: Adalbert Lazăr 
> ---
>  arch/x86/kvm/x86.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2c06de73a784..06f44ce8ed07 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6311,7 +6311,8 @@ static bool reexecute_instruction(struct kvm_vcpu 
> *vcpu, gva_t cr2,
>   indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
>   spin_unlock(>kvm->mmu_lock);
>  
> - if (indirect_shadow_pages)
> + if (indirect_shadow_pages
> + && !kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
>   kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
>  
>   return true;
> @@ -6322,7 +6323,8 @@ static bool reexecute_instruction(struct kvm_vcpu 
> *vcpu, gva_t cr2,
>* and it failed try to unshadow page and re-enter the
>* guest to let CPU execute the instruction.
>*/
> - kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> + if (!kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
> + kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
>  
>   /*
>* If the access faults on its page table, it can not
> @@ -6374,6 +6376,9 @@ static bool retry_instruction(struct x86_emulate_ctxt 
> *ctxt,
>   if (!vcpu->arch.mmu->direct_map)
>   gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
>  
> + if (kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa)))
> + return false;
> +
>   kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
>  
>   return true;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v6 00/27] x86: PIE support and option to extend KASLR randomization

2019-01-31 Thread Konrad Rzeszutek Wilk
On Thu, Jan 31, 2019 at 11:24:07AM -0800, Thomas Garnier wrote:
> There has been no major concern in the latest iterations. I am interested on
> what would be the best way to slowly integrate this patchset upstream.

One question that I was somehow expected in this cover letter - what
about all those lovely speculative bugs? As in say some one hasn't
updated their machine with the Spectre v3a microcode - wouldn't they
be able to get the kernel virtual address space?

In effect rendering all this hard-work not needed?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/5 v5] Fix virtio-blk issue with SWIOTLB

2019-01-30 Thread Konrad Rzeszutek Wilk
On Wed, Jan 30, 2019 at 05:40:02PM +0100, Joerg Roedel wrote:
> Hi,
> 
> here is the next version of this patch-set. Previous
> versions can be found here:
> 
>   V1: https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/
> 
>   V2: https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/
> 
>   V3: https://lore.kernel.org/lkml/20190123163049.24863-1-j...@8bytes.org/
> 
>   V4: https://lore.kernel.org/lkml/20190129084342.26030-1-j...@8bytes.org/
> 
> The problem solved here is a limitation of the SWIOTLB implementation,
> which does not support allocations larger than 256kb.  When the
> virtio-blk driver tries to read/write a block larger than that, the
> allocation of the dma-handle fails and an IO error is reported.
> 
> Changes to v4 are:
> 
>   - Added Reviewed-by tags from Christoph
> 
>   - Added missing EXPORT_SYMBOL(_GPL) lines
> 
> Please review.

I can put it in my tree and send it to Linus .. unless folks want
to do it through a different tree?


> 
> Thanks,
> 
>   Joerg
> Joerg Roedel (5):
>   swiotlb: Introduce swiotlb_max_mapping_size()
>   swiotlb: Add is_swiotlb_active() function
>   dma: Introduce dma_max_mapping_size()
>   virtio: Introduce virtio_max_dma_size()
>   virtio-blk: Consider virtio_max_dma_size() for maximum segment size
> 
>  Documentation/DMA-API.txt|  8 
>  drivers/block/virtio_blk.c   | 10 ++
>  drivers/virtio/virtio_ring.c | 11 +++
>  include/linux/dma-mapping.h  | 16 
>  include/linux/swiotlb.h  | 11 +++
>  include/linux/virtio.h   |  2 ++
>  kernel/dma/direct.c  | 12 
>  kernel/dma/swiotlb.c | 14 ++
>  8 files changed, 80 insertions(+), 4 deletions(-)
> 
> -- 
> 2.17.1
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/5 v3] Fix virtio-blk issue with SWIOTLB

2019-01-28 Thread Konrad Rzeszutek Wilk
On Mon, Jan 28, 2019 at 10:20:05AM -0500, Michael S. Tsirkin wrote:
> On Wed, Jan 23, 2019 at 04:14:53PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jan 23, 2019 at 01:51:29PM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Jan 23, 2019 at 05:30:44PM +0100, Joerg Roedel wrote:
> > > > Hi,
> > > > 
> > > > here is the third version of this patch-set. Previous
> > > > versions can be found here:
> > > > 
> > > > V1: 
> > > > https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/
> > > > 
> > > > V2: 
> > > > https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/
> > > > 
> > > > The problem solved here is a limitation of the SWIOTLB implementation,
> > > > which does not support allocations larger than 256kb.  When the
> > > > virtio-blk driver tries to read/write a block larger than that, the
> > > > allocation of the dma-handle fails and an IO error is reported.
> > > 
> > > 
> > > OK looks good to me.
> > > I will park this in my tree for now this way it will get
> > > testing in linux-next.
> > > Can I get an ack from DMA maintainers on the DMA bits for
> > > merging this in 5.0?
> > 
> > You got mine (SWIOTBL is my area).
> 
> OK so
> 
> Reviewed-by: Konrad Rzeszutek Wilk 

Yes :-)
> 
> 
> 
> > > 
> > > > Changes to v2 are:
> > > > 
> > > > * Check if SWIOTLB is active before returning its limit in
> > > >   dma_direct_max_mapping_size()
> > > > 
> > > > * Only apply the maximum segment limit in virtio-blk when
> > > >   DMA-API is used for the vring
> > > > 
> > > > Please review.
> > > > 
> > > > Thanks,
> > > > 
> > > > Joerg
> > > > 
> > > > Joerg Roedel (5):
> > > >   swiotlb: Introduce swiotlb_max_mapping_size()
> > > >   swiotlb: Add is_swiotlb_active() function
> > > >   dma: Introduce dma_max_mapping_size()
> > > >   virtio: Introduce virtio_max_dma_size()
> > > >   virtio-blk: Consider virtio_max_dma_size() for maximum segment size
> > > > 
> > > >  drivers/block/virtio_blk.c   | 10 ++
> > > >  drivers/virtio/virtio_ring.c | 10 ++
> > > >  include/linux/dma-mapping.h  | 16 
> > > >  include/linux/swiotlb.h  | 11 +++
> > > >  include/linux/virtio.h   |  2 ++
> > > >  kernel/dma/direct.c  | 11 +++
> > > >  kernel/dma/swiotlb.c | 10 ++
> > > >  7 files changed, 66 insertions(+), 4 deletions(-)
> > > > 
> > > > -- 
> > > > 2.17.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/5 v3] Fix virtio-blk issue with SWIOTLB

2019-01-23 Thread Konrad Rzeszutek Wilk
On Wed, Jan 23, 2019 at 01:51:29PM -0500, Michael S. Tsirkin wrote:
> On Wed, Jan 23, 2019 at 05:30:44PM +0100, Joerg Roedel wrote:
> > Hi,
> > 
> > here is the third version of this patch-set. Previous
> > versions can be found here:
> > 
> > V1: https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/
> > 
> > V2: https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/
> > 
> > The problem solved here is a limitation of the SWIOTLB implementation,
> > which does not support allocations larger than 256kb.  When the
> > virtio-blk driver tries to read/write a block larger than that, the
> > allocation of the dma-handle fails and an IO error is reported.
> 
> 
> OK looks good to me.
> I will park this in my tree for now this way it will get
> testing in linux-next.
> Can I get an ack from DMA maintainers on the DMA bits for
> merging this in 5.0?

You got mine (SWIOTBL is my area).
> 
> > Changes to v2 are:
> > 
> > * Check if SWIOTLB is active before returning its limit in
> >   dma_direct_max_mapping_size()
> > 
> > * Only apply the maximum segment limit in virtio-blk when
> >   DMA-API is used for the vring
> > 
> > Please review.
> > 
> > Thanks,
> > 
> > Joerg
> > 
> > Joerg Roedel (5):
> >   swiotlb: Introduce swiotlb_max_mapping_size()
> >   swiotlb: Add is_swiotlb_active() function
> >   dma: Introduce dma_max_mapping_size()
> >   virtio: Introduce virtio_max_dma_size()
> >   virtio-blk: Consider virtio_max_dma_size() for maximum segment size
> > 
> >  drivers/block/virtio_blk.c   | 10 ++
> >  drivers/virtio/virtio_ring.c | 10 ++
> >  include/linux/dma-mapping.h  | 16 
> >  include/linux/swiotlb.h  | 11 +++
> >  include/linux/virtio.h   |  2 ++
> >  kernel/dma/direct.c  | 11 +++
> >  kernel/dma/swiotlb.c | 10 ++
> >  7 files changed, 66 insertions(+), 4 deletions(-)
> > 
> > -- 
> > 2.17.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/5 v3] Fix virtio-blk issue with SWIOTLB

2019-01-23 Thread Konrad Rzeszutek Wilk
On Wed, Jan 23, 2019 at 05:30:44PM +0100, Joerg Roedel wrote:
> Hi,
> 
> here is the third version of this patch-set. Previous
> versions can be found here:
> 
>   V1: https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/
> 
>   V2: https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/
> 
> The problem solved here is a limitation of the SWIOTLB implementation,
> which does not support allocations larger than 256kb.  When the
> virtio-blk driver tries to read/write a block larger than that, the
> allocation of the dma-handle fails and an IO error is reported.
> 
> Changes to v2 are:
> 
>   * Check if SWIOTLB is active before returning its limit in
> dma_direct_max_mapping_size()
> 
>   * Only apply the maximum segment limit in virtio-blk when
> DMA-API is used for the vring
> 
> Please review.
> 
> Thanks,
> 
>   Joerg
> 
> Joerg Roedel (5):
>   swiotlb: Introduce swiotlb_max_mapping_size()
>   swiotlb: Add is_swiotlb_active() function
>   dma: Introduce dma_max_mapping_size()
>   virtio: Introduce virtio_max_dma_size()
>   virtio-blk: Consider virtio_max_dma_size() for maximum segment size
> 
>  drivers/block/virtio_blk.c   | 10 ++
>  drivers/virtio/virtio_ring.c | 10 ++

The kvm-devel mailing list should have been copied on those.

When you do can you please put 'Reviewed-by: Konrad Rzeszutek Wilk
' on all of them?

Thank you!

P.S.
Christopher, I am assuming you are OK with this idea, if so - and once
the owners of 'virtio' Ack, do you want to put it in my tree?

Thanks!
>  include/linux/dma-mapping.h  | 16 
>  include/linux/swiotlb.h  | 11 +++
>  include/linux/virtio.h   |  2 ++
>  kernel/dma/direct.c  | 11 +++
>  kernel/dma/swiotlb.c | 10 ++
>  7 files changed, 66 insertions(+), 4 deletions(-)


> 
> -- 
> 2.17.1
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/3] swiotlb: Export maximum allocation size

2019-01-16 Thread Konrad Rzeszutek Wilk
On Fri, Jan 11, 2019 at 10:12:31AM +0100, Joerg Roedel wrote:
> On Thu, Jan 10, 2019 at 12:02:05PM -0500, Konrad Rzeszutek Wilk wrote:
> > Why not use swiotlb_nr_tbl ? That is how drivers/gpu/drm use to figure if 
> > they
> > need to limit the size of pages.
> 
> That function just exports the overall size of the swiotlb aperture, no?
> What I need here is the maximum size for a single mapping.

Yes. The other drivers just assumed that if there is SWIOTLB they would use
the smaller size by default (as in they knew the limitation).

But I agree it would be better to have something smarter - and also convert the
DRM drivers to piggy back on this.

Or alternatively we could make SWIOTLB handle bigger sizes..
> 
> Regards,
> 
>   Joerg
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/3] swiotlb: Export maximum allocation size

2019-01-10 Thread Konrad Rzeszutek Wilk
On Thu, Jan 10, 2019 at 02:44:31PM +0100, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> The SWIOTLB implementation has a maximum size it can
> allocate dma-handles for. This needs to be exported so that
> device drivers don't try to allocate larger chunks.
> 
> This is especially important for block device drivers like
> virtio-blk, that might do DMA through SWIOTLB.

Why not use swiotlb_nr_tbl ? That is how drivers/gpu/drm use to figure if they
need to limit the size of pages.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  include/linux/swiotlb.h | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 7c007ed7505f..0bcc80a97036 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -72,6 +72,11 @@ static inline bool is_swiotlb_buffer(phys_addr_t paddr)
>   return paddr >= io_tlb_start && paddr < io_tlb_end;
>  }
>  
> +static inline size_t swiotlb_max_alloc_size(void)
> +{
> + return ((1UL << IO_TLB_SHIFT) * IO_TLB_SEGSIZE);
> +}
> +
>  bool swiotlb_map(struct device *dev, phys_addr_t *phys, dma_addr_t *dma_addr,
>   size_t size, enum dma_data_direction dir, unsigned long attrs);
>  void __init swiotlb_exit(void);
> @@ -95,6 +100,13 @@ static inline unsigned int swiotlb_max_segment(void)
>  {
>   return 0;
>  }
> +
> +static inline size_t swiotlb_max_alloc_size(void)
> +{
> + /* There is no limit when SWIOTLB isn't used */
> + return ~0UL;
> +}
> +
>  #endif /* CONFIG_SWIOTLB */
>  
>  extern void swiotlb_print_info(void);
> -- 
> 2.17.1
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

2018-12-16 Thread Konrad Rzeszutek Wilk
.giant snip..
> > +   npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > +   if (npinned != npages)
> > +   goto err;
> > +
> 
> As I said I have doubts about the whole approach, but this
> implementation in particular isn't a good idea
> as it keeps the page around forever.
> So no THP, no NUMA rebalancing, userspace-controlled
> amount of memory locked up and not accounted for.
> 
> Don't get me wrong it's a great patch in an ideal world.
> But then in an ideal world no barriers smap etc are necessary at all.

So .. suggestions on how this could be accepted? As in other ways
where we still get vmap and the issues you mentioned are not troubling you?

Thanks!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM

2018-07-19 Thread Konrad Rzeszutek Wilk
On Thu, Jul 19, 2018 at 11:37:59PM +0200, Ahmed Abd El Mawgood wrote:
> Hi,
> 
> This is my first set of patches that works as I would expect, and the
> third revision I sent to mailing lists.
> 
> Following up with my previous discussions about kernel rootkit mitigation
> via placing R/O protection on critical data structure, static data,
> privileged registers with static content. These patches present the
> first part where it is only possible to place these protections on
> memory pages. Feature-wise, this set of patches is incomplete in the sense of:
> - They still don't protect privileged registers
> - They don't protect guest TLB from malicious gva -> gpa page mappings.
> But they provide sketches for a basic working design. Note that I am totally
> noob and it took lots of time and effort to get to this point. So sorry in
> advance if I overlooked something.

This reminds me of Xen PV page model. That is the hypervisor is the one
auditing the page tables and the guest's pages are read-only.

Ditto for IDT, GDT, etc. Gosh, did you by chance look at how
Xen PV mechanism is done? It may provide the protection you are looking for?

CC-ing xen-devel.
> 
> [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation
> [PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot 
> itterator functions
> [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE
> 
> Summery:
> 
>  Documentation/virtual/kvm/hypercalls.txt |  14 
>  arch/x86/include/asm/kvm_host.h  |  11 ++-
>  arch/x86/kvm/Kconfig |   7 ++
>  arch/x86/kvm/mmu.c   | 127 
> ++-
>  arch/x86/kvm/x86.c   |  82 +++-
>  include/linux/kvm_host.h |   3 +
>  include/uapi/linux/kvm_para.h|   1 +
>  virt/kvm/kvm_main.c  |  29 ++-
>  8 files changed, 232 insertions(+), 42 deletions(-)
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC V3 PATCH 0/8] Packed ring for vhost

2018-04-23 Thread Konrad Rzeszutek Wilk
On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> > > Hi all:
> > > 
> > > This RFC implement packed ring layout. The code were tested with
> > > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> > > tweaks were needed on top of Tiwei's code to make it run. TCP stream
> > > and pktgen does not show obvious difference compared with split ring.
> > 
> > I have to ask then - what is the benefit of this?
> 
> You can use this with dpdk within guest.
> The DPDK version did see a performance improvement so hopefully with

Is there a link to this performance improvement document?

> future versions this will too.
> 
> > > 
> > > Changes from V2:
> > > - do not use & in checking desc_event_flags
> > > - off should be most significant bit
> > > - remove the workaround of mergeable buffer for dpdk prototype
> > > - id should be in the last descriptor in the chain
> > > - keep _F_WRITE for write descriptor when adding used
> > > - device flags updating should use ADDR_USED type
> > > - return error on unexpected unavail descriptor in a chain
> > > - return false in vhost_ve_avail_empty is descriptor is available
> > > - track last seen avail_wrap_counter
> > > - correctly examine available descriptor in get_indirect_packed()
> > > - vhost_idx_diff should return u16 instead of bool
> > > 
> > > Changes from V1:
> > > 
> > > - Refactor vhost used elem code to avoid open coding on used elem
> > > - Event suppression support (compile test only).
> > > - Indirect descriptor support (compile test only).
> > > - Zerocopy support.
> > > - vIOMMU support.
> > > - SCSI/VSOCK support (compile test only).
> > > - Fix several bugs
> > > 
> > > For simplicity, I don't implement batching or other optimizations.
> > > 
> > > Please review.
> > > 
> > > Jason Wang (8):
> > >   vhost: move get_rx_bufs to vhost.c
> > >   vhost: hide used ring layout from device
> > >   vhost: do not use vring_used_elem
> > >   vhost_net: do not explicitly manipulate vhost_used_elem
> > >   vhost: vhost_put_user() can accept metadata type
> > >   virtio: introduce packed ring defines
> > >   vhost: packed ring support
> > >   vhost: event suppression for packed ring
> > > 
> > >  drivers/vhost/net.c| 136 ++
> > >  drivers/vhost/scsi.c   |  62 +--
> > >  drivers/vhost/vhost.c  | 824 
> > > ++---
> > >  drivers/vhost/vhost.h  |  47 ++-
> > >  drivers/vhost/vsock.c  |  42 +-
> > >  include/uapi/linux/virtio_config.h |   9 +
> > >  include/uapi/linux/virtio_ring.h   |  32 ++
> > >  7 files changed, 926 insertions(+), 226 deletions(-)
> > > 
> > > -- 
> > > 2.7.4
> > > 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC V3 PATCH 0/8] Packed ring for vhost

2018-04-23 Thread Konrad Rzeszutek Wilk
On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:
> Hi all:
> 
> This RFC implement packed ring layout. The code were tested with
> Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
> tweaks were needed on top of Tiwei's code to make it run. TCP stream
> and pktgen does not show obvious difference compared with split ring.

I have to ask then - what is the benefit of this?

> 
> Changes from V2:
> - do not use & in checking desc_event_flags
> - off should be most significant bit
> - remove the workaround of mergeable buffer for dpdk prototype
> - id should be in the last descriptor in the chain
> - keep _F_WRITE for write descriptor when adding used
> - device flags updating should use ADDR_USED type
> - return error on unexpected unavail descriptor in a chain
> - return false in vhost_ve_avail_empty is descriptor is available
> - track last seen avail_wrap_counter
> - correctly examine available descriptor in get_indirect_packed()
> - vhost_idx_diff should return u16 instead of bool
> 
> Changes from V1:
> 
> - Refactor vhost used elem code to avoid open coding on used elem
> - Event suppression support (compile test only).
> - Indirect descriptor support (compile test only).
> - Zerocopy support.
> - vIOMMU support.
> - SCSI/VSOCK support (compile test only).
> - Fix several bugs
> 
> For simplicity, I don't implement batching or other optimizations.
> 
> Please review.
> 
> Jason Wang (8):
>   vhost: move get_rx_bufs to vhost.c
>   vhost: hide used ring layout from device
>   vhost: do not use vring_used_elem
>   vhost_net: do not explicitly manipulate vhost_used_elem
>   vhost: vhost_put_user() can accept metadata type
>   virtio: introduce packed ring defines
>   vhost: packed ring support
>   vhost: event suppression for packed ring
> 
>  drivers/vhost/net.c| 136 ++
>  drivers/vhost/scsi.c   |  62 +--
>  drivers/vhost/vhost.c  | 824 
> ++---
>  drivers/vhost/vhost.h  |  47 ++-
>  drivers/vhost/vsock.c  |  42 +-
>  include/uapi/linux/virtio_config.h |   9 +
>  include/uapi/linux/virtio_ring.h   |  32 ++
>  7 files changed, 926 insertions(+), 226 deletions(-)
> 
> -- 
> 2.7.4
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/8] Packed ring for vhost

2018-03-26 Thread Konrad Rzeszutek Wilk
On Mon, Mar 26, 2018 at 11:38:45AM +0800, Jason Wang wrote:
> Hi all:
> 
> This RFC implement packed ring layout. The code were tested with pmd
> implement by Jens at
> http://dpdk.org/ml/archives/dev/2018-January/089417.html. Minor change
> was needed for pmd codes to kick virtqueue since it assumes a busy
> polling backend.
> 
> Test were done between localhost and guest. Testpmd (rxonly) in guest
> reports 2.4Mpps. Testpmd (txonly) repots about 2.1Mpps.

And how does it compare to older ring layout?

> 
> Notes: The event suppression /indirect descriptor support is complied
>test only because of lacked driver support.
> 
> Changes from V1:
> 
> - Refactor vhost used elem code to avoid open coding on used elem
> - Event suppression support (compile test only).
> - Indirect descriptor support (compile test only).
> - Zerocopy support.
> - vIOMMU support.
> - SCSI/VSOCK support (compile test only).
> - Fix several bugs
> 
> For simplicity, I don't implement batching or other optimizations.
> 
> Please review.
> 
> Thanks
> 
> Jason Wang (8):
>   vhost: move get_rx_bufs to vhost.c
>   vhost: hide used ring layout from device
>   vhost: do not use vring_used_elem
>   vhost_net: do not explicitly manipulate vhost_used_elem
>   vhost: vhost_put_user() can accept metadata type
>   virtio: introduce packed ring defines
>   vhost: packed ring support
>   vhost: event suppression for packed ring
> 
>  drivers/vhost/net.c| 138 ++-
>  drivers/vhost/scsi.c   |  62 +--
>  drivers/vhost/vhost.c  | 818 
> ++---
>  drivers/vhost/vhost.h  |  46 ++-
>  drivers/vhost/vsock.c  |  42 +-
>  include/uapi/linux/virtio_config.h |   9 +
>  include/uapi/linux/virtio_ring.h   |  32 ++
>  7 files changed, 921 insertions(+), 226 deletions(-)
> 
> -- 
> 2.7.4
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support

2017-11-15 Thread Konrad Rzeszutek Wilk
On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote:
> From: Yang Zhang 
> 
> Some latency-intensive workload have seen obviously performance
> drop when running inside VM. The main reason is that the overhead
> is amplified when running inside VM. The most cost I have seen is
> inside idle path.

Meaning an VMEXIT b/c it is an 'halt' operation ? And then going
back in guest (VMRESUME) takes time. And hence your latency gets
all whacked b/c of this?

So if I understand - you want to use your _full_ timeslice (of the guest)
without ever (or as much as possible) to go in the hypervisor?

Which means in effect you don't care about power-saving or CPUfreq
savings, you just want to eat the full CPU for snack?

> 
> This patch introduces a new mechanism to poll for a while before
> entering idle state. If schedule is needed during poll, then we
> don't need to goes through the heavy overhead path.

Schedule of what? The guest or the host?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops

2017-08-29 Thread Konrad Rzeszutek Wilk
On Tue, Aug 29, 2017 at 11:46:35AM +, Yang Zhang wrote:
> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
> idle path which will polling for a while before we enter the real idle
> state.
> 
> In virtualization, idle path includes several heavy operations
> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
> performance especially for latency intensive workload like message
> passing task. The cost is mainly come from the vmexit which is a
> hardware context switch between VM and hypervisor. Our solution is to
> poll for a while and do not enter real idle path if we can get the
> schedule event during polling.
> 
> Poll may cause the CPU waste so we adopt a smart polling mechanism to
> reduce the useless poll.
> 
> Signed-off-by: Yang Zhang 
> Signed-off-by: Quan Xu 
> Cc: Jeremy Fitzhardinge 
> Cc: Chris Wright 
> Cc: Alok Kataria 
> Cc: Rusty Russell 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Peter Zijlstra 
> Cc: Andy Lutomirski 
> Cc: "Kirill A. Shutemov" 
> Cc: Pan Xinhui 
> Cc: Kees Cook 
> Cc: virtualization@lists.linux-foundation.org
> Cc: linux-ker...@vger.kernel.org

Adding xen-devel.

Juergen, we really should replace Jeremy's name with xen-devel or
your name.. Wasn't there an patch by you that took some of the 
mainternship over it?

> ---
>  arch/x86/include/asm/paravirt.h   | 5 +
>  arch/x86/include/asm/paravirt_types.h | 6 ++
>  arch/x86/kernel/paravirt.c| 6 ++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 9ccac19..6d46760 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int 
> counter)
>  
>  #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
>  
> +static inline void paravirt_idle_poll(void)
> +{
> + PVOP_VCALL0(pv_idle_ops.poll);
> +}
> +
>  static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned 
> entries)
>  {
>   PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
> diff --git a/arch/x86/include/asm/paravirt_types.h 
> b/arch/x86/include/asm/paravirt_types.h
> index 9ffc36b..cf45726 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -324,6 +324,10 @@ struct pv_lock_ops {
>   struct paravirt_callee_save vcpu_is_preempted;
>  } __no_randomize_layout;
>  
> +struct pv_idle_ops {
> + void (*poll)(void);
> +} __no_randomize_layout;
> +
>  /* This contains all the paravirt structures: we get a convenient
>   * number for each function using the offset which we use to indicate
>   * what to patch. */
> @@ -334,6 +338,7 @@ struct paravirt_patch_template {
>   struct pv_irq_ops pv_irq_ops;
>   struct pv_mmu_ops pv_mmu_ops;
>   struct pv_lock_ops pv_lock_ops;
> + struct pv_idle_ops pv_idle_ops;
>  } __no_randomize_layout;
>  
>  extern struct pv_info pv_info;
> @@ -343,6 +348,7 @@ struct paravirt_patch_template {
>  extern struct pv_irq_ops pv_irq_ops;
>  extern struct pv_mmu_ops pv_mmu_ops;
>  extern struct pv_lock_ops pv_lock_ops;
> +extern struct pv_idle_ops pv_idle_ops;
>  
>  #define PARAVIRT_PATCH(x)\
>   (offsetof(struct paravirt_patch_template, x) / sizeof(void *))
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index bc0a849..1b5b247 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>   .pv_lock_ops = pv_lock_ops,
>  #endif
> + .pv_idle_ops = pv_idle_ops,
>   };
>   return *((void **) + type);
>  }
> @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
>   .steal_clock = native_steal_clock,
>  };
>  
> +struct pv_idle_ops pv_idle_ops = {
> + .poll = paravirt_nop,
> +};
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>   .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>   .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> @@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
>  EXPORT_SYMBOL(pv_mmu_ops);
>  EXPORT_SYMBOL_GPL(pv_info);
>  EXPORT_SYMBOL(pv_irq_ops);
> +EXPORT_SYMBOL(pv_idle_ops);
> -- 
> 1.8.3.1
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v6 10/11] x86, xen: support vcpu preempted check

2016-10-28 Thread Konrad Rzeszutek Wilk
On Fri, Oct 28, 2016 at 04:11:26AM -0400, Pan Xinhui wrote:
> From: Juergen Gross 
> 
> Support the vcpu_is_preempted() functionality under Xen. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
> 
> A quick test (4 vcpus on 1 physical cpu doing a parallel build job
> with "make -j 8") reduced system time by about 5% with this patch.
> 
> Signed-off-by: Juergen Gross 
> Signed-off-by: Pan Xinhui 
> ---
>  arch/x86/xen/spinlock.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index 3d6e006..74756bb 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
>   per_cpu(irq_name, cpu) = NULL;
>  }
>  
> -

Spurious change.
>  /*
>   * Our init of PV spinlocks is split in two init functions due to us
>   * using paravirt patching and jump labels patching and having to do
> @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
>   pv_lock_ops.queued_spin_unlock = 
> PV_CALLEE_SAVE(__pv_queued_spin_unlock);
>   pv_lock_ops.wait = xen_qlock_wait;
>   pv_lock_ops.kick = xen_qlock_kick;
> +
> + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
>  }
>  
>  /*
> -- 
> 2.4.11
> 
> 
> ___
> Xen-devel mailing list
> xen-de...@lists.xen.org
> https://lists.xen.org/xen-devel
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v6 00/11] implement vcpu preempted check

2016-10-28 Thread Konrad Rzeszutek Wilk
On Fri, Oct 28, 2016 at 04:11:16AM -0400, Pan Xinhui wrote:
> change from v5:
>   spilt x86/kvm patch into guest/host part.
>   introduce kvm_write_guest_offset_cached.
>   fix some typos.
>   rebase patch onto 4.9.2
> change from v4:
>   spilt x86 kvm vcpu preempted check into two patches.
>   add documentation patch.
>   add x86 vcpu preempted check patch under xen
>   add s390 vcpu preempted check patch 
> change from v3:
>   add x86 vcpu preempted check patch
> change from v2:
>   no code change, fix typos, update some comments
> change from v1:
>   a simplier definition of default vcpu_is_preempted
>   skip mahcine type check on ppc, and add config. remove dedicated macro.
>   add one patch to drop overload of rwsem_spin_on_owner and 
> mutex_spin_on_owner. 
>   add more comments
>   thanks boqun and Peter's suggestion.
> 
> This patch set aims to fix lock holder preemption issues.

Do you have a git tree with these patches?

> 
> test-case:
> perf record -a perf bench sched messaging -g 400 -p && perf report
> 
> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
> 
> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
> These spin_on_onwer variant also cause rcu stall before we apply this patch 
> set
> 
> We also have observed some performace improvements in uninx benchmark tests.
> 
> PPC test result:
> 1 copy - 0.94%
> 2 copy - 7.17%
> 4 copy - 11.9%
> 8 copy -  3.04%
> 16 copy - 15.11%
> 
> details below:
> Without patch:
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 
> 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 
> 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 
> 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 
> 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 
> s, 1 samples)
> 
> With patch: 
> 
> 1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 
> 1 samples)
> 2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 
> 1 samples)
> 4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 
> 1 samples)
> 8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 
> 1 samples)
> 16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 
> s, 1 samples)
> 
> X86 test result:
>   test-case   after-patch   before-patch
> Execl Throughput   |18307.9 lps  |11701.6 lps 
> File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
> File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
> File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
> Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
> Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
> Process Creation   |29881.2 lps  |28572.8 lps 
> Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
> Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
> System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 
> 
> Christian Borntraeger (1):
>   s390/spinlock: Provide vcpu_is_preempted
> 
> Juergen Gross (1):
>   x86, xen: support vcpu preempted check
> 
> Pan Xinhui (9):
>   kernel/sched: introduce vcpu preempted check interface
>   locking/osq: Drop the overload of osq_lock()
>   kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
>   powerpc/spinlock: support vcpu preempted check
>   x86, paravirt: Add interface to support kvm/xen vcpu preempted check
>   KVM: Introduce kvm_write_guest_offset_cached
>   x86, kvm/x86.c: support vcpu preempted check
>   x86, kernel/kvm.c: support vcpu preempted check
>   Documentation: virtual: kvm: Support vcpu preempted check
> 
>  Documentation/virtual/kvm/msr.txt |  9 -
>  arch/powerpc/include/asm/spinlock.h   |  8 
>  arch/s390/include/asm/spinlock.h  |  8 
>  arch/s390/kernel/smp.c|  9 +++--
>  arch/s390/lib/spinlock.c  | 25 -
>  arch/x86/include/asm/paravirt_types.h |  2 ++
>  arch/x86/include/asm/spinlock.h   |  8 
>  arch/x86/include/uapi/asm/kvm_para.h  |  4 +++-
>  

Re: [PATCH 0/9] qspinlock stuff -v15

2015-03-27 Thread Konrad Rzeszutek Wilk
On Thu, Mar 26, 2015 at 09:21:53PM +0100, Peter Zijlstra wrote:
 On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote:
  Ah nice. That could be spun out as a seperate patch to optimize the existing
  ticket locks I presume.
 
 Yes I suppose we can do something similar for the ticket and patch in
 the right increment. We'd need to restructure the code a bit, but
 its not fundamentally impossible.
 
 We could equally apply the head hashing to the current ticket
 implementation and avoid the current bitmap iteration.
 
  Now with the old pv ticketlock code an vCPU would only go to sleep once and
  be woken up when it was its turn. With this new code it is woken up twice 
  (and twice it goes to sleep). With an overcommit scenario this would imply
  that we will have at least twice as many VMEXIT as with the previous code.
 
 An astute observation, I had not considered that.

Thank you.
 
  I presume when you did benchmarking this did not even register? Thought
  I wonder if it would if you ran the benchmark for a week or so.
 
 You presume I benchmarked :-) I managed to boot something virt and run
 hackbench in it. I wouldn't know a representative virt setup if I ran
 into it.
 
 The thing is, we want this qspinlock for real hardware because its
 faster and I really want to avoid having to carry two spinlock
 implementations -- although I suppose that if we really really have to
 we could.

In some way you already have that - for virtualized environments where you
don't have an PV mechanism you just use the byte spinlock - which is good.

And switching to PV ticketlock implementation after boot.. ugh. I feel your 
pain.

What if you used an PV bytelock implemenation? The code you posted already
'sprays' all the vCPUS to wake up. And that is exactly what you need for PV
bytelocks - well, you only need to wake up the vCPUS that have gone to sleep
waiting on an specific 'struct spinlock' and just stash those in an per-cpu
area. The old Xen spinlock code (Before 3.11?) had this.

Just an idea thought.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/9] qspinlock stuff -v15

2015-03-25 Thread Konrad Rzeszutek Wilk
On Mon, Mar 16, 2015 at 02:16:13PM +0100, Peter Zijlstra wrote:
 Hi Waiman,
 
 As promised; here is the paravirt stuff I did during the trip to BOS last 
 week.
 
 All the !paravirt patches are more or less the same as before (the only real
 change is the copyright lines in the first patch).
 
 The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more
 convoluted and I've no real way to test that but it should be stright fwd to
 make work.
 
 I ran this using the virtme tool (thanks Andy) on my laptop with a 4x
 overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and
 it both booted and survived a hackbench run (perf bench sched messaging -g 20
 -l 5000).
 
 So while the paravirt code isn't the most optimal code ever conceived it does 
 work.
 
 Also, the paravirt patching includes replacing the call with movb $0, %arg1
 for the native case, which should greatly reduce the cost of having
 CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware.

Ah nice. That could be spun out as a seperate patch to optimize the existing
ticket locks I presume.

Now with the old pv ticketlock code an vCPU would only go to sleep once and
be woken up when it was its turn. With this new code it is woken up twice 
(and twice it goes to sleep). With an overcommit scenario this would imply
that we will have at least twice as many VMEXIT as with the previous code.

I presume when you did benchmarking this did not even register? Thought
I wonder if it would if you ran the benchmark for a week or so.

 
 I feel that if someone were to do a Xen patch we can go ahead and merge this
 stuff (finally!).
 
 These patches do not implement the paravirt spinlock debug stats currently
 implemented (separately) by KVM and Xen, but that should not be too hard to do
 on top and in the 'generic' code -- no reason to duplicate all that.
 
 Of course; once this lands people can look at improving the paravirt nonsense.
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] kasan_map_early_shadow() on Xen

2015-03-06 Thread Konrad Rzeszutek Wilk
On Wed, Mar 04, 2015 at 05:47:03PM -0800, Luis R. Rodriguez wrote:
 On Wed, Mar 4, 2015 at 6:36 AM, Andrey Ryabinin a.ryabi...@samsung.com 
 wrote:
  On 03/03/2015 07:02 PM, Konrad Rzeszutek Wilk wrote:
  If it is like that - then just using what had to be implemented
  for the stack protection as a template ought to pave most of the
  work?
 
  Probably. I think I could make this work.
  However, I won't be able to work on this until the end of the next week.
 
 While a solution is likely possible here I'd like for us to notice how
 two features have gone now under our nose for Xen on v4.0 which have
 issues. Proactive maintainer reviews would detect these issues but we
 can't bet on these, and testing is just as reactive and slow. I'm

Hmm, I see you saying 'Proactive maintainer review would detect' these
but that assumes that the maintainer would even be copied on these patches?

None of the Xen maintainers were copied on this.

And while we all strive to be pro-active I have to remind you maintainers
are usually swamped.

 hinting we keep our eyes out for an architectural solution that would
 avoid these issues. Don't ask me what that is just yet...

Keep in mind that a lot of these issues get resolved during merge window
thanks to the resiliancy of Boris monitoring these tests with an sharp eye!

After all that is what merge windows are - things break during them.

 
  Luis
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] kasan_map_early_shadow() on Xen

2015-03-03 Thread Konrad Rzeszutek Wilk
On Tue, Mar 03, 2015 at 06:38:20PM +0300, Andrey Ryabinin wrote:
 On 03/03/2015 05:16 PM, Konrad Rzeszutek Wilk wrote:
  On Tue, Mar 03, 2015 at 04:15:06PM +0300, Andrey Ryabinin wrote:
  On 03/03/2015 12:40 PM, Luis R. Rodriguez wrote:
  Andrey,
 
  I believe that on Xen we should disable kasan, would like confirmation
 
  I guess Xen guests won't work with kasan because Xen guests doesn't setup 
  shadow
  (kasan_map_early_shadow() is not called in xen guests).
 
  Disabling kasan for Xen in Kconfig is undesirable because that will 
  disable kasan
  for allmodconfig and allyesconfig builds, but I don't see other option for 
  now.
  
  Was there an bug reported for this? It would be good to CC the maintainers
  of Xen on that sort of thing.
  
 
 There was no report for this. I just looked at Xen code because of this 
 Luis's mail
 and I don't see how it could work with kasan.

Ahh.
 Fixing this might be not trivial. We need to setup shadow memory before any 
 kasan instrumented
 function will run.
 This is like with -fstack-protector (you need to setup gdt before calling any 
 stack protected function).

If it is like that - then just using what had to be implemented
for the stack protection as a template ought to pave most of the
work?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] kasan_map_early_shadow() on Xen

2015-03-03 Thread Konrad Rzeszutek Wilk
On Tue, Mar 03, 2015 at 04:15:06PM +0300, Andrey Ryabinin wrote:
 On 03/03/2015 12:40 PM, Luis R. Rodriguez wrote:
  Andrey,
  
  I believe that on Xen we should disable kasan, would like confirmation
 
 I guess Xen guests won't work with kasan because Xen guests doesn't setup 
 shadow
 (kasan_map_early_shadow() is not called in xen guests).
 
 Disabling kasan for Xen in Kconfig is undesirable because that will disable 
 kasan
 for allmodconfig and allyesconfig builds, but I don't see other option for 
 now.

Was there an bug reported for this? It would be good to CC the maintainers
of Xen on that sort of thing.

Thanks!
.. snip..
  [0] https://lkml.org/lkml/2015/2/23/328

If you look at the other x86 pulls - you will see that the authors
also missed enabling it on 32-bit kernels!

  
   Luis
  
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org
 http://lists.xen.org/xen-devel
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v13 10/11] pvqspinlock, x86: Enable PV qspinlock for KVM

2014-12-02 Thread Konrad Rzeszutek Wilk
On Wed, Oct 29, 2014 at 04:19:10PM -0400, Waiman Long wrote:
 This patch adds the necessary KVM specific code to allow KVM to
 support the CPU halting and kicking operations needed by the queue
 spinlock PV code.
 
 Two KVM guests of 20 CPU cores (2 nodes) were created for performance
 testing in one of the following three configurations:
  1) Only 1 VM is active
  2) Both VMs are active and they share the same 20 physical CPUs
 (200% overcommit)
 
 The tests run included the disk workload of the AIM7 benchmark on
 both ext4 and xfs RAM disks at 3000 users on a 3.17 based kernel. The
 ebizzy -m test and futextest was was also run and its performance
 data were recorded.  With two VMs running, the idle=poll kernel
 option was added to simulate a busy guest. If PV qspinlock is not
 enabled, unfairlock will be used automically in a guest.

What is the unfairlock? Isn't it just using a bytelock at this point?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v4 04/10] x86: paravirt: Wrap initialization of set_iopl_mask in a macro

2014-12-01 Thread Konrad Rzeszutek Wilk
On Sun, Nov 02, 2014 at 09:32:20AM -0800, Josh Triplett wrote:
 This will allow making set_iopl_mask optional later.
 
 Signed-off-by: Josh Triplett j...@joshtriplett.org

Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 ---
  arch/x86/include/asm/paravirt_types.h | 1 +
  arch/x86/kernel/paravirt.c| 2 +-
  arch/x86/xen/enlighten.c  | 4 ++--
  3 files changed, 4 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/include/asm/paravirt_types.h 
 b/arch/x86/include/asm/paravirt_types.h
 index 7549b8b..3caf2a8 100644
 --- a/arch/x86/include/asm/paravirt_types.h
 +++ b/arch/x86/include/asm/paravirt_types.h
 @@ -143,6 +143,7 @@ struct pv_cpu_ops {
   void (*load_sp0)(struct tss_struct *tss, struct thread_struct *t);
  
   void (*set_iopl_mask)(unsigned mask);
 +#define INIT_SET_IOPL_MASK(f) .set_iopl_mask = f,
  
   void (*wbinvd)(void);
   void (*io_delay)(void);
 diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
 index 548d25f..e7969d4 100644
 --- a/arch/x86/kernel/paravirt.c
 +++ b/arch/x86/kernel/paravirt.c
 @@ -383,7 +383,7 @@ __visible struct pv_cpu_ops pv_cpu_ops = {
   .iret = native_iret,
   .swapgs = native_swapgs,
  
 - .set_iopl_mask = native_set_iopl_mask,
 + INIT_SET_IOPL_MASK(native_set_iopl_mask)
   .io_delay = native_io_delay,
  
   .start_context_switch = paravirt_nop,
 diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
 index 1a3f044..8ad0778 100644
 --- a/arch/x86/xen/enlighten.c
 +++ b/arch/x86/xen/enlighten.c
 @@ -912,7 +912,7 @@ static void xen_load_sp0(struct tss_struct *tss,
   xen_mc_issue(PARAVIRT_LAZY_CPU);
  }
  
 -static void xen_set_iopl_mask(unsigned mask)
 +static void __maybe_unused xen_set_iopl_mask(unsigned mask)
  {
   struct physdev_set_iopl set_iopl;
  
 @@ -1279,7 +1279,7 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst 
 = {
   .write_idt_entry = xen_write_idt_entry,
   .load_sp0 = xen_load_sp0,
  
 - .set_iopl_mask = xen_set_iopl_mask,
 + INIT_SET_IOPL_MASK(xen_set_iopl_mask)
   .io_delay = xen_io_delay,
  
   /* Xen takes care of %gs when switching to usermode for us */
 -- 
 2.1.1
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org
 http://lists.xen.org/xen-devel
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-12-01 Thread Konrad Rzeszutek Wilk
On Tue, Nov 25, 2014 at 07:33:58PM -0500, Waiman Long wrote:
 On 10/27/2014 02:02 PM, Konrad Rzeszutek Wilk wrote:
 On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:
 
 My concern is that spin_unlock() can be called in many places, including
 loadable kernel modules. Can the paravirt_patch_ident_32() function able to
 patch all of them in reasonable time? How about a kernel module loaded later
 at run time?
 It has too. When the modules are loaded the .paravirt symbols are exposed
 and the module loader patches that.
 
 And during bootup time (before modules are loaded) it also patches everything
 - when it only runs on one CPU.
 
 
 I have been changing the patching code to patch the unlock call sites and it
 seems to be working now. However, when I manually inserted a kernel module
 using insmod and run the code in the newly inserted module, I got memory
 access violation as follows:
 
 BUG: unable to handle kernel NULL pointer dereference at   (null)
 IP: [  (null)]   (null)
 PGD 18d62f3067 PUD 18d476f067 PMD 0
 Oops: 0010 [#1] SMP
 Modules linked in: locktest(OE) ebtable_nat ebtables xt_CHECKSUM
 iptable_mangle bridge autofs4 8021q garp stp llc ipt_REJECT
 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT
 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
 ip6_tables ipv6 vhost_net macvtap macvlan vhost tun uinput ppdev parport_pc
 parport sg microcode pcspkr virtio_balloon snd_hda_codec_generic
 virtio_console snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep
 snd_seq snd_seq_device snd_pcm snd_timer snd soundcore virtio_net i2c_piix4
 i2c_core ext4(E) jbd2(E) mbcache(E) floppy(E) virtio_blk(E) sr_mod(E)
 cdrom(E) virtio_pci(E) virtio_ring(E) virtio(E) pata_acpi(E) ata_generic(E)
 ata_piix(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last
 unloaded: speedstep_lib]
 CPU: 1 PID: 3907 Comm: run-locktest Tainted: GW  OE  3.17.0-pvqlock
 #3
 Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
 task: 8818cc5baf90 ti: 8818b7094000 task.ti: 8818b7094000
 RIP: 0010:[]  [  (null)]   (null)
 RSP: 0018:8818b7097db0  EFLAGS: 00010246
 RAX:  RBX: 004c4b40 RCX: 
 RDX: 0001 RSI:  RDI: 8818d3f052c0
 RBP: 8818b7097dd8 R08: 80522014 R09: 
 R10: 1000 R11: 0001 R12: 0001
 R13:  R14: 0001 R15: 8818b7097ea0
 FS:  7fb828ece700() GS:88193ec2() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2:  CR3: 0018cc7e9000 CR4: 06e0
 Stack:
  a06ff395 8818d465e000 8164bec0 0001
  0050 8818b7097e18 a06ff785 8818b7097e38
  0246 54755e3a 39f8ba72 8818c174f000
 Call Trace:
  [a06ff395] ? test_spinlock+0x65/0x90 [locktest]
  [a06ff785] etime_show+0xd5/0x120 [locktest]
  [812a2dc6] kobj_attr_show+0x16/0x20
  [8121a7fa] sysfs_kf_seq_show+0xca/0x1b0
  [81218a13] kernfs_seq_show+0x23/0x30
  [811c82db] seq_read+0xbb/0x400
  [812197e5] kernfs_fop_read+0x35/0x40
  [811a4223] vfs_read+0xa3/0x110
  [811a47e6] SyS_read+0x56/0xd0
  [810f3e16] ? __audit_syscall_exit+0x216/0x2c0
  [815b3ca9] system_call_fastpath+0x16/0x1b
 Code:  Bad RIP value.
  RSP 8818b7097db0
 CR2: 
 ---[ end trace 69d0e259c9ec632f ]---
 
 It seems like call site patching isn't properly done or the kernel module
 that I built was missing some critical information necessary for the proper

Did the readelf give you the paravirt note section?
 linking. Anyway, I will include the unlock call patching code as a separate
 patch as it seems there may be problem under certain circumstance.

one way to troubleshoot those is to enable the paravirt patching code to
actually print where it is patching the code. That way when you load the
module you can confirm it has done its job.

Then you can verify that the address  where the code is called:

a06ff395

is indeed patched. You might as well also do a hexdump in the module loading
to confim that the patching had been done correctly.
 
 BTW, the kernel panic problem that your team reported had been fixed. The
 fix will be in the next version of the patch.
 
 -Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v13 09/11] pvqspinlock, x86: Add para-virtualization support

2014-12-01 Thread Konrad Rzeszutek Wilk
On Wed, Oct 29, 2014 at 04:19:09PM -0400, Waiman Long wrote:
 This patch adds para-virtualization support to the queue spinlock
 code base with minimal impact to the native case. There are some
 minor code changes in the generic qspinlock.c file which should be
 usable in other architectures. The other code changes are specific
 to x86 processors and so are all put under the arch/x86 directory.
 
 On the lock side, the slowpath code is split into 2 separate functions
 generated from the same code - one for bare metal and one for PV guest.
 The switching is done in the _raw_spin_lock* functions. This makes
 sure that the performance impact to the bare metal case is minimal,
 just a few NOPs in the _raw_spin_lock* functions. In the PV slowpath
 code, there are 2 paravirt callee saved calls that minimize register
 pressure.
 
 On the unlock side, however, the disabling of unlock function inlining
 does have some slight impact on bare metal performance.
 
 The actual paravirt code comes in 5 parts;
 
  - init_node; this initializes the extra data members required for PV
state. PV state data is kept 1 cacheline ahead of the regular data.
 
  - link_and_wait_node; this replaces the regular MCS queuing code. CPU
halting can happen if the wait is too long.
 
  - wait_head; this waits until the lock is avialable and the CPU will
be halted if the wait is too long.
 
  - wait_check; this is called after acquiring the lock to see if the
next queue head CPU is halted. If this is the case, the lock bit is
changed to indicate the queue head will have to be kicked on unlock.
 
  - queue_unlock;  this routine has a jump label to check if paravirt
is enabled. If yes, it has to do an atomic cmpxchg to clear the lock
bit or call the slowpath function to kick the queue head cpu.
 
 Tracking the head is done in two parts, firstly the pv_wait_head will
 store its cpu number in whichever node is pointed to by the tail part
 of the lock word. Secondly, pv_link_and_wait_node() will propagate the
 existing head from the old to the new tail node.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 ---
  arch/x86/include/asm/paravirt.h   |   19 ++
  arch/x86/include/asm/paravirt_types.h |   20 ++
  arch/x86/include/asm/pvqspinlock.h|  411 
 +
  arch/x86/include/asm/qspinlock.h  |   71 ++-
  arch/x86/kernel/paravirt-spinlocks.c  |6 +
  include/asm-generic/qspinlock.h   |2 +
  kernel/locking/qspinlock.c|   69 +-
  7 files changed, 591 insertions(+), 7 deletions(-)
  create mode 100644 arch/x86/include/asm/pvqspinlock.h
 
 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
 index cd6e161..7e296e6 100644
 --- a/arch/x86/include/asm/paravirt.h
 +++ b/arch/x86/include/asm/paravirt.h
 @@ -712,6 +712,24 @@ static inline void __set_fixmap(unsigned /* enum 
 fixed_addresses */ idx,
  
  #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
  
 +#ifdef CONFIG_QUEUE_SPINLOCK
 +
 +static __always_inline void pv_kick_cpu(int cpu)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu);
 +}
 +
 +static __always_inline void pv_lockwait(u8 *lockbyte)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte);
 +}
 +
 +static __always_inline void pv_lockstat(enum pv_lock_stats type)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.lockstat, type);
 +}
 +
 +#else
  static __always_inline void __ticket_lock_spinning(struct arch_spinlock 
 *lock,
   __ticket_t ticket)
  {
 @@ -723,6 +741,7 @@ static __always_inline void __ticket_unlock_kick(struct 
 arch_spinlock *lock,
  {
   PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
  }
 +#endif
  
  #endif
  
 diff --git a/arch/x86/include/asm/paravirt_types.h 
 b/arch/x86/include/asm/paravirt_types.h
 index 7549b8b..49e4b76 100644
 --- a/arch/x86/include/asm/paravirt_types.h
 +++ b/arch/x86/include/asm/paravirt_types.h
 @@ -326,6 +326,9 @@ struct pv_mmu_ops {
  phys_addr_t phys, pgprot_t flags);
  };
  
 +struct mcs_spinlock;
 +struct qspinlock;
 +
  struct arch_spinlock;
  #ifdef CONFIG_SMP
  #include asm/spinlock_types.h
 @@ -333,9 +336,26 @@ struct arch_spinlock;
  typedef u16 __ticket_t;
  #endif
  
 +#ifdef CONFIG_QUEUE_SPINLOCK
 +enum pv_lock_stats {
 + PV_HALT_QHEAD,  /* Queue head halting   */
 + PV_HALT_QNODE,  /* Other queue node halting */
 + PV_HALT_ABORT,  /* Halting aborted  */
 + PV_WAKE_KICKED, /* Wakeup by kicking*/
 + PV_WAKE_SPURIOUS,   /* Spurious wakeup  */
 + PV_KICK_NOHALT  /* Kick but CPU not halted  */
 +};
 +#endif
 +
  struct pv_lock_ops {
 +#ifdef CONFIG_QUEUE_SPINLOCK
 + struct paravirt_callee_save kick_cpu;
 + struct paravirt_callee_save lockstat;
 + struct paravirt_callee_save lockwait;
 +#else
   struct paravirt_callee_save lock_spinning;
   void (*unlock_kick)(struct 

Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Konrad Rzeszutek Wilk
On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:
 On 10/24/2014 04:54 AM, Peter Zijlstra wrote:
 On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:
 
 Since enabling paravirt spinlock will disable unlock function inlining,
 a jump label can be added to the unlock function without adding patch
 sites all over the kernel.
 But you don't have to. My patches allowed for the inline to remain,
 again reducing the overhead of enabling PV spinlocks while running on a
 real machine.
 
 Look at:
 
http://lkml.kernel.org/r/20140615130154.213923...@chello.nl
 
 In particular this hunk:
 
 Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 ===
 --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
 +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
   DEF_NATIVE(, mov32, mov %edi, %eax);
   DEF_NATIVE(, mov64, mov %rdi, %rax);
 
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
 +#endif
 +
   unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
   {
  return paravirt_patch_insns(insnbuf, len,
 @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
  PATCH_SITE(pv_cpu_ops, clts);
  PATCH_SITE(pv_mmu_ops, flush_tlb_single);
  PATCH_SITE(pv_cpu_ops, wbinvd);
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +   PATCH_SITE(pv_lock_ops, queue_unlock);
 +#endif
 
  patch_site:
  ret = paravirt_patch_insns(ibuf, len, start, end);
 
 
 That makes sure to overwrite the callee-saved call to the
 pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).
 
 
 Therefore you can retain the inlined unlock with hardly (there might be
 some NOP padding) any overhead at all. On PV it reverts to a callee
 saved function call.
 
 My concern is that spin_unlock() can be called in many places, including
 loadable kernel modules. Can the paravirt_patch_ident_32() function able to
 patch all of them in reasonable time? How about a kernel module loaded later
 at run time?

It has too. When the modules are loaded the .paravirt symbols are exposed
and the module loader patches that.

And during bootup time (before modules are loaded) it also patches everything
- when it only runs on one CPU.
 
 So I think we may still need to disable unlock function inlining even if we
 used your way kernel site patching.

No need. Inline should (And is) working just fine.
 
 Regards,
 Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible

2014-10-02 Thread Konrad Rzeszutek Wilk
On Tue, Sep 30, 2014 at 11:01:29AM -0700, Andy Lutomirski wrote:
 On Tue, Sep 30, 2014 at 10:53 AM, Konrad Rzeszutek Wilk
 konrad.w...@oracle.com wrote:
  x86 will be worse than PPC, too: the special case needed to support
  QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly
  large and disgusting and will only exist to support something that IMO
  should never have existed in the first place.
 
  scratches his head I don't follow.
 
 If you boot a Xen PV dom0 on QEMU master with -machine q35,iommu=on
 and you add a virtio device, dom0 will end up with a PCI device that
 does DMA to machine addresses.  These addresses are not compatible
 with the DMA API (which works with bus addresses), nor are they the
 same as physical addresses.

That is presumarily because the IOMMU assumes the virtio devices are real
devices, not fake ones.
 
 So virtio in current kernels won't work for the same reason they never
 work on Xen.  But virtio-pci with my patches won't work either,
 because they (or the Xen hypervisor) will try to program the IOMMU
 with a non-identity mapping, causing everything to explode.
 
 Hacking up the virtio-pci driver to explicitly ask Xen for machine
 addresses might work, but, at the very least, it will be a giant
 security hole if anyone binds a virtio device to a domain other than
 dom0 (which, again, is kind of the point of having an IOMMU).
 
 
  PPC at least avoids *that* problem by virtue of not having Xen
  paravirt.  (And please don't add Xen paravirt to PPC -- x86 is trying
  to kill it off, but this is a 5-10 year project.)
 
  Correction:
   - The Xen project is trying to kill some of the paravirts off.
   - KVM uses paravirts as well (and then added some)
 
 By paravirt I meant PV, where there's the weird physical/machine
 address discrepancy that's visible to the guest.  This is not to say
 that Xen PVH wouldn't also be screwed running on QEMU master.
 
 --Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible

2014-09-30 Thread Konrad Rzeszutek Wilk
 x86 will be worse than PPC, too: the special case needed to support
 QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly
 large and disgusting and will only exist to support something that IMO
 should never have existed in the first place.

scratches his head I don't follow.
 
 PPC at least avoids *that* problem by virtue of not having Xen
 paravirt.  (And please don't add Xen paravirt to PPC -- x86 is trying
 to kill it off, but this is a 5-10 year project.)

Correction:
 - The Xen project is trying to kill some of the paravirts off.
 - KVM uses paravirts as well (and then added some)

 
 [..., reordered]
 
 
  Except that I think that PPC is the only platform on which QEMU's code
  actually bypasses any IOMMU.  Unless we've all missed something, there
  is no QEMU release that will put a virtio device behind an IOMMU on
  any platform other than PPC.
 
  I think that is true but it seems that this will be true for x86 for
  QEMU 2.2 unless we make some changes there.
  Which we might not have the time for since 2.2 is feature frozen
  from tomorrow.
  Maybe we should disable the IOMMU in 2.2, this is worth considering.
 
 
 Please do.
 
 Also, try booting this 2.2 QEMU candidate with nested virtualization
 on.  Then bind vfio to a virtio-pci device and watch the guest get
 corrupted.  QEMU will blame Linux for incorrectly programming the

Hehe.
 hardware, and Linux will blame QEMU for its blatant violation of the
 ACPI spec.  Given that this is presumably most of the point of adding
 IOMMU support, it seems like a terrible idea to let code like that
 into the wild.
 
 If this happens, Linux may also end up needing a quirk to prevent vfio
 from binding to QEMU 2.2's virtio-pci devices.
 
 --Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 0/3] virtio: Use the DMA API when appropriate

2014-09-19 Thread Konrad Rzeszutek Wilk
On Tue, Sep 16, 2014 at 10:22:25PM -0700, Andy Lutomirski wrote:
 This fixes virtio on Xen guests as well as on any other platform
 that uses virtio_pci on which physical addresses don't match bus
 addresses.

I can do 'Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com'
but not sure through whom this patch should go through?


 
 This can be tested with:
 
 virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
 
 using virtme from here:
 
 https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
 
 Without these patches, the guest hangs forever.  With these patches,
 everything works.
 
 This should be safe on all platforms that I'm aware of.  That
 doesn't mean that there isn't anything that I missed.
 
 Applies to net-next.
 
 Changes from v4:
  - Rebased onto net-next.
  - Dropped the sg end mark changes from virtio_net, as that issue
is solved differently in net-next.
  - virtio_pci does not use the DMA API on powerpc, for reasons that
are explained in a detailed comment in virtio_ring.
 
 Changes from v3:
  - virtio_pci only asks virtio_ring to use the DMA API if
!PCI_DMA_BUS_IS_PHYS.
  - Reduce tools/virtio breakage.  It's now merely as broken as before
instead of being even more broken.
  - Drop the sg_next changes -- Rusty's version is better.
 
 Changes from v2:
  - Reordered patches.
  - Fixed a virtio_net OOPS.
 
 Changes from v1:
  - Using the DMA API is optional now.  It would be nice to improve the
DMA API to the point that it could be used unconditionally, but s390
proves that we're not there yet.
  - Includes patch 4, which fixes DMA debugging warnings from virtio_net.
 
 Andy Lutomirski (3):
   virtio_ring: Support DMA APIs if requested
   virtio_pci: Use the DMA API for virtqueues when possible
   virtio_net: Stop doing DMA from the stack
 
  drivers/lguest/lguest_device.c |   3 +-
  drivers/misc/mic/card/mic_virtio.c |   2 +-
  drivers/net/virtio_net.c   |  53 ++---
  drivers/remoteproc/remoteproc_virtio.c |   4 +-
  drivers/s390/kvm/kvm_virtio.c  |   2 +-
  drivers/s390/kvm/virtio_ccw.c  |   4 +-
  drivers/virtio/virtio_mmio.c   |   5 +-
  drivers/virtio/virtio_pci.c|  91 ++--
  drivers/virtio/virtio_ring.c   | 194 
 +++--
  include/linux/virtio_ring.h|   1 +
  tools/virtio/linux/dma-mapping.h   |  17 +++
  tools/virtio/linux/virtio.h|   1 +
  tools/virtio/virtio_test.c |   2 +-
  tools/virtio/vringh_test.c |   3 +-
  14 files changed, 314 insertions(+), 68 deletions(-)
  create mode 100644 tools/virtio/linux/dma-mapping.h
 
 -- 
 1.9.3
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API

2014-09-02 Thread Konrad Rzeszutek Wilk
On Wed, Sep 03, 2014 at 06:53:33AM +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2014-09-01 at 22:55 -0700, Andy Lutomirski wrote:
  
  On x86, at least, I doubt that we'll ever see a physically addressed
  PCI virtio device for which ACPI advertises an IOMMU, since any sane
  hypervisor will just not advertise an IOMMU for the virtio device.
  But are there arm64 or PPC guests that use virtio_pci, that have
  IOMMUs, and that will malfunction if the virtio_pci driver ends up
  using the IOMMU?  I certainly hope not, since these systems might be
  very hard-pressed to work right if someone plugged in a physical
  virtio-speaking PCI device.
 
 It will definitely not work on ppc64. We always have IOMMUs on pseries,
 all PCI busses do, and because it's a paravirtualized environment,
 napping/unmapping pages means hypercalls - expensive.
 
 But our virtio implementation bypasses it in qemu, so if virtio-pci
 starts using the DMA mapping API without changing the DMA ops under the
 hood, it will break for us.

What is the default dma_ops that the Linux guests start with as
guests under ppc64?

Thanks!
 
 Cheers,
 Ben.
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio DMA API?

2014-08-29 Thread Konrad Rzeszutek Wilk
On Thu, Aug 28, 2014 at 07:31:16AM +1000, Benjamin Herrenschmidt wrote:
 On Wed, 2014-08-27 at 20:40 +0930, Rusty Russell wrote:
 
  Hi Andy,
  
  This has long been a source of contention.  virtio assumes that
  the hypervisor can decode guest-physical addresses.
  
  PowerPC, in particular, doesn't want to pay the cost of IOMMU
  manipulations, and all arguments presented so far for using an IOMMU for
  a virtio device are weak.  And changing to use DMA APIs would break them
  anyway.
  
  Of course, it's Just A Matter of Code, so it's possible to
  create a Xen-specific variant which uses the DMA APIs.  I'm not sure
  what that would look like in the virtio standard, however.
 
 So this has popped up in the past a few times already from people who
 want to use virtio as a transport between physical systems connected
 via a bus like PCI using non-transparent bridges for example.
 
 There's a way to get both here that isn't too nasty... we can make the
 virtio drivers use the dma_map_* APIs and just switch the dma_ops in
 the struct device based on the hypervisor requirements. IE. For KVM we
 could attach a set of ops that basically just return the physical
 address, real PCI transport would use the normal callbacks etc...

Right.
 
 The only problem at the moment is that the dma_map_ops, while
 defined generically, aren't plumbed into the generic struct device
 but instead on some architectures dev_archdata. This includes
 powerpc, ARM and x86 (under a CONFIG option for the latter which
 is only enabled on x86_64 and some oddball i386 variant).

I am not following the interaction between 'struct device', 'struct
dev_archdata' and 'struct dma_map_ops' ? The 'struct dma_ops' should
be able to exist without having to exist in the other structures?
Naturally the implementation of 'struct dma_ops' has to use 
'struct device' otherwise it can't get the details such as dma_mapping.

 
 So either we switch to have all architectures we care about always
 use the generic DMA ops and move the pointer to struct device, or
 we create another inline indirection to deal with the cases
 without the dma_map_ops...

Or you implement an passthrough 'dma_map_ops' that you suggested?

Thought I feel I am not groking something from your email. Hmm, time
to get some more coffee.
 
 Cheers,
 Ben.
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] virtio_pci: Use the DMA API for virtqueues

2014-08-27 Thread Konrad Rzeszutek Wilk
On Tue, Aug 26, 2014 at 02:17:02PM -0700, Andy Lutomirski wrote:
 A virtqueue is a coherent DMA mapping.  Use the DMA API for it.
 This fixes virtio_pci on Xen.
 
 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  drivers/virtio/virtio_pci.c | 25 ++---
  1 file changed, 18 insertions(+), 7 deletions(-)
 
 diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
 index 3d1463c6b120..19039c5bec24 100644
 --- a/drivers/virtio/virtio_pci.c
 +++ b/drivers/virtio/virtio_pci.c
 @@ -80,8 +80,9 @@ struct virtio_pci_vq_info
   /* the number of entries in the queue */
   int num;
  
 - /* the virtual address of the ring queue */
 - void *queue;
 + /* the ring queue */
 + void *queue;/* virtual address */
 + dma_addr_t queue_dma_addr;  /* bus address */
  
   /* the list node for the virtqueues list */
   struct list_head node;
 @@ -417,15 +418,16 @@ static struct virtqueue *setup_vq(struct virtio_device 
 *vdev, unsigned index,
   info-num = num;
   info-msix_vector = msix_vec;
  
 - size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
 - info-queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
 + size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
 + info-queue = dma_zalloc_coherent(vdev-dev.parent, size,
 +   info-queue_dma_addr, GFP_KERNEL);
   if (info-queue == NULL) {
   err = -ENOMEM;
   goto out_info;
   }
  
   /* activate the queue */
 - iowrite32(virt_to_phys(info-queue)  VIRTIO_PCI_QUEUE_ADDR_SHIFT,
 + iowrite32(info-queue_dma_addr  VIRTIO_PCI_QUEUE_ADDR_SHIFT,
 vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
  
   /* create the vring */
 @@ -462,7 +464,8 @@ out_assign:
   vring_del_virtqueue(vq);
  out_activate_queue:
   iowrite32(0, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
 - free_pages_exact(info-queue, size);
 + dma_free_coherent(vdev-dev.parent, size,
 +   info-queue, info-queue_dma_addr);
  out_info:
   kfree(info);
   return ERR_PTR(err);
 @@ -493,7 +496,8 @@ static void vp_del_vq(struct virtqueue *vq)
   iowrite32(0, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
  
   size = PAGE_ALIGN(vring_size(info-num, VIRTIO_PCI_VRING_ALIGN));
 - free_pages_exact(info-queue, size);
 + dma_free_coherent(vq-vdev-dev.parent, size,
 +   info-queue, info-queue_dma_addr);
   kfree(info);
  }
  
 @@ -712,6 +716,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
   if (err)
   goto out;
  
 + /*
 +  * We support 64-bit DMA.  If this fails (e.g. some bridge
 +  * or PV code doesn't or DAC is disabled), then we're okay
 +  * with 32-bit DMA.

scratches his head

I am having a hard time parsing that. Could you expand a bit the
faulting use-case please?

 +  */
 + dma_set_mask_and_coherent(pci_dev-dev, DMA_BIT_MASK(64));

The usual process is:

ret = dma_set_mask_and_coherent(..)
if (ret)
ret = dma_set_mask_and_coherent(.., DMA_BIT_MASK(32))

if (ret)
pr_warn(We are truly screwed. Good luck!\n);

 +
   err = pci_request_regions(pci_dev, virtio-pci);
   if (err)
   goto out_enable_device;
 -- 
 1.9.3
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] virtio_ring: Use DMA APIs

2014-08-27 Thread Konrad Rzeszutek Wilk
On Wed, Aug 27, 2014 at 09:29:36AM +0200, Christian Borntraeger wrote:
 On 26/08/14 23:17, Andy Lutomirski wrote:
  virtio_ring currently sends the device (usually a hypervisor)
  physical addresses of its I/O buffers.  This is okay when DMA
  addresses and physical addresses are the same thing, but this isn't
  always the case.  For example, this never works on Xen guests, and
  it is likely to fail if a physical virtio device ever ends up
  behind an IOMMU or swiotlb.
  
  The immediate use case for me is to enable virtio on Xen guests.
  For that to work, we need this fix as well as a corresponding
  fix to virtio_pci or to another driver.
  
  With this patch, virtfs survives kmemleak and CONFIG_DMA_API_DEBUG.
  virtio-net warns (correctly) about DMA from the stack in
  virtnet_set_rx_mode.
  
  This breaks s390's defconfig.  The default configuration for s390
  does virtio through a KVM-specific interface, but that configuration
  does not support DMA.  I could modify this patch to stub out the DMA
  API calls if !CONFIG_HAS_DMA, but it seems to me that it would be
  much nicer to make s390 support DMA unconditionally.
 
 s390 has no DMA per se. Newest systems have a PCI-like  I/O attach in 
 addition to the classic channel I/O, and for that we enable the DMA code 
 just for that transport to be able to reuse some of the existing PCI
 drivers. (only some because, we differ in some aspects from how PCI
 looks like) But the architecture itself (and the virtio interface) does
 not provide the DMA interface as you know it:

Don't most of those DMA_API end up then being nops? As in, we do
have in the dma-api file the #ifdef case when a platform does not do
DMA which ends up with all functions stubbed out.

 
 In essence this patch is a no-go for s390.

Is that due to possible compilation?
 
 I am also aksing myself, if this makes virtio-ring more expensive?
 
 Christian
 
 
 
  
  It's actually unclear to me whether this can be fixed without either
  s390 arch support or a special case for s390 in virtio_ring: on
  brief inspection of s390's DMA code, it seems to assume that
  everything goes through a PCI IOMMU, which is presumably not the
  case for virtio.
 
 
  
  Cc: Cornelia Huck cornelia.h...@de.ibm.com
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  ---
   drivers/virtio/Kconfig   |   1 +
   drivers/virtio/virtio_ring.c | 116 
  +++
   2 files changed, 95 insertions(+), 22 deletions(-)
  
  diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
  index c6683f2e396c..b6f3acc61153 100644
  --- a/drivers/virtio/Kconfig
  +++ b/drivers/virtio/Kconfig
  @@ -1,5 +1,6 @@
   config VIRTIO
  tristate
  +   depends on HAS_DMA
  ---help---
This option is selected by any driver which implements the virtio
bus, such as CONFIG_VIRTIO_PCI, CONFIG_VIRTIO_MMIO, CONFIG_LGUEST,
  diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
  index d356a701c9c2..6a78e2fd8e63 100644
  --- a/drivers/virtio/virtio_ring.c
  +++ b/drivers/virtio/virtio_ring.c
  @@ -24,6 +24,7 @@
   #include linux/module.h
   #include linux/hrtimer.h
   #include linux/kmemleak.h
  +#include linux/dma-mapping.h
  
   #ifdef DEBUG
   /* For development, we want to crash whenever the ring is screwed. */
  @@ -54,6 +55,12 @@
   #define END_USE(vq)
   #endif
  
  +struct vring_desc_state
  +{
  +   void *data; /* Data for callback. */
  +   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
  +};
  +
   struct vring_virtqueue
   {
  struct virtqueue vq;
  @@ -93,12 +100,45 @@ struct vring_virtqueue
  ktime_t last_add_time;
   #endif
  
  -   /* Tokens for callbacks. */
  -   void *data[];
  +   /* Per-descriptor state. */
  +   struct vring_desc_state desc_state[];
   };
  
   #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
  
  +/* Helper to map one sg entry, since we can't use dma_map_sg. */
  +static dma_addr_t dma_map_one_sg(struct vring_virtqueue *vq,
  +struct scatterlist *sg,
  +enum dma_data_direction direction)
  +{
  +   return dma_map_page(vq-vq.vdev-dev.parent,
  +   sg_page(sg), sg-offset, sg-length, direction);
  +}
  +
  +static void unmap_one(struct vring_virtqueue *vq, struct vring_desc *desc)
  +{
  +   if (desc-flags  VRING_DESC_F_INDIRECT) {
  +   dma_unmap_single(vq-vq.vdev-dev.parent,
  +desc-addr, desc-len,
  +(desc-flags  VRING_DESC_F_WRITE) ?
  +DMA_FROM_DEVICE : DMA_TO_DEVICE);
  +   } else {
  +   dma_unmap_page(vq-vq.vdev-dev.parent,
  +  desc-addr, desc-len,
  +  (desc-flags  VRING_DESC_F_WRITE) ?
  +  DMA_FROM_DEVICE : DMA_TO_DEVICE);
  +   }
  +}
  +
  +static void unmap_indirect(struct vring_virtqueue *vq, struct 

Re: [PATCH 3/3] virtio_pci: Use the DMA API for virtqueues

2014-08-27 Thread Konrad Rzeszutek Wilk
On Wed, Aug 27, 2014 at 10:35:10AM -0700, Andy Lutomirski wrote:
 On Wed, Aug 27, 2014 at 10:32 AM, Konrad Rzeszutek Wilk
 konrad.w...@oracle.com wrote:
  On Tue, Aug 26, 2014 at 02:17:02PM -0700, Andy Lutomirski wrote:
  A virtqueue is a coherent DMA mapping.  Use the DMA API for it.
  This fixes virtio_pci on Xen.
 
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  ---
   drivers/virtio/virtio_pci.c | 25 ++---
   1 file changed, 18 insertions(+), 7 deletions(-)
 
  diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
  index 3d1463c6b120..19039c5bec24 100644
  --- a/drivers/virtio/virtio_pci.c
  +++ b/drivers/virtio/virtio_pci.c
  @@ -80,8 +80,9 @@ struct virtio_pci_vq_info
/* the number of entries in the queue */
int num;
 
  - /* the virtual address of the ring queue */
  - void *queue;
  + /* the ring queue */
  + void *queue;/* virtual address */
  + dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
  @@ -417,15 +418,16 @@ static struct virtqueue *setup_vq(struct 
  virtio_device *vdev, unsigned index,
info-num = num;
info-msix_vector = msix_vec;
 
  - size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
  - info-queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
  + size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
  + info-queue = dma_zalloc_coherent(vdev-dev.parent, size,
  +   info-queue_dma_addr, GFP_KERNEL);
if (info-queue == NULL) {
err = -ENOMEM;
goto out_info;
}
 
/* activate the queue */
  - iowrite32(virt_to_phys(info-queue)  VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  + iowrite32(info-queue_dma_addr  VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
  @@ -462,7 +464,8 @@ out_assign:
vring_del_virtqueue(vq);
   out_activate_queue:
iowrite32(0, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
  - free_pages_exact(info-queue, size);
  + dma_free_coherent(vdev-dev.parent, size,
  +   info-queue, info-queue_dma_addr);
   out_info:
kfree(info);
return ERR_PTR(err);
  @@ -493,7 +496,8 @@ static void vp_del_vq(struct virtqueue *vq)
iowrite32(0, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info-num, VIRTIO_PCI_VRING_ALIGN));
  - free_pages_exact(info-queue, size);
  + dma_free_coherent(vq-vdev-dev.parent, size,
  +   info-queue, info-queue_dma_addr);
kfree(info);
   }
 
  @@ -712,6 +716,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
if (err)
goto out;
 
  + /*
  +  * We support 64-bit DMA.  If this fails (e.g. some bridge
  +  * or PV code doesn't or DAC is disabled), then we're okay
  +  * with 32-bit DMA.
 
  scratches his head
 
  I am having a hard time parsing that. Could you expand a bit the
  faulting use-case please?
 
  +  */
  + dma_set_mask_and_coherent(pci_dev-dev, DMA_BIT_MASK(64));
 
  The usual process is:
 
  ret = dma_set_mask_and_coherent(..)
  if (ret)
  ret = dma_set_mask_and_coherent(.., DMA_BIT_MASK(32))
 
  if (ret)
  pr_warn(We are truly screwed. Good luck!\n);
 
 
 I assumed that, if dma_set_mask_and_coherent(..., DMA_BIT_BASK(64))
 fails, then we can still do DMA, just not 64-bit DMA.  This driver
 should be fine with that -- it'll just be a bit slower.

Right. It should fail back to 32-bit (the default on PCI bus).

I was merely thinking of the error reporting might be useful
but on second thought - saying that the user is screwed - is
a bad idea.
 
 If that's not a safe assumption, I can change it.
 
 --Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/3] virtio: Clean up scatterlists and use the DMA API

2014-08-27 Thread Konrad Rzeszutek Wilk
On Wed, Aug 27, 2014 at 07:10:20PM +0100, Stefan Hajnoczi wrote:
 On Wed, Aug 27, 2014 at 6:27 PM, Konrad Rzeszutek Wilk
 konrad.w...@oracle.com wrote:
  On Wed, Aug 27, 2014 at 07:46:46AM +0100, Stefan Hajnoczi wrote:
  On Tue, Aug 26, 2014 at 10:16 PM, Andy Lutomirski l...@amacapital.net 
  wrote:
   There are two outstanding issues.  virtio_net warns if DMA debugging
   is on because it does DMA from the stack.  (The warning is correct.)
   This also is likely to do something unpleasant to s390.
   (Maintainers are cc'd -- I don't know what to do about it.)
 
  This changes the semantics of vring and breaks existing guests when
  bus address != physical address.
 
  Isn't that what this is suppose to fix? Right now the semantics of the
  vring is that bus address == physical address.
 
 Today the assumption is that the device emulation code has access to
 guest physical memory (no translation necessary).

Right.
 
 Changing the semantics breaks today's guests.  You need to make this
 change in a way so that existing guests don't break.

Well, the DMA API will preserve that semantic as under KVM it will
return phys == bus.

But that is a side-effect of the DMA API implementation preserving
this semantic so I understand your point.
 
 Stefan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio DMA API?

2014-08-27 Thread Konrad Rzeszutek Wilk
On Wed, Aug 27, 2014 at 01:52:50PM +0200, Michael S. Tsirkin wrote:
 On Wed, Aug 27, 2014 at 08:40:51PM +0930, Rusty Russell wrote:
  Andy Lutomirski l...@amacapital.net writes:
   Currently, a lot of the virtio code assumes that bus (i.e. hypervisor)
   addresses are the same as physical address.  This is false on Xen, so
   virtio is completely broken.  I wouldn't be surprised if it also
   becomes a problem the first time that someone sticks a physical
   virtio device on a 32-bit bus on an ARM SOC with more than 4G RAM.
  
   Would you accept patches to convert virtio_ring and virtio_pci to use
   the DMA APIs?  I think that the only real catch will be that
   virtio_ring's approach to freeing indirect blocks is currently
   incompatible with the DMA API -- it assumes that knowing the bus
   address is enough to call kfree, and I don't think that the DMA API
   provides a reverse mapping like that.
  
  Hi Andy,
  
  This has long been a source of contention.  virtio assumes that
  the hypervisor can decode guest-physical addresses.
  
  PowerPC, in particular, doesn't want to pay the cost of IOMMU
  manipulations, and all arguments presented so far for using an IOMMU for
  a virtio device are weak.  And changing to use DMA APIs would break them
  anyway.
  
  Of course, it's Just A Matter of Code, so it's possible to
  create a Xen-specific variant which uses the DMA APIs.  I'm not sure
  what that would look like in the virtio standard, however.
  
  Cheers,
  Rusty.
 
 For x86 as of QEMU 2.0 there's no iommu.
 So a reasonable thing to do for that platform
 might be to always use iommu *if it's there*.
 My understanding is this isn't the case for powerpc?

Wasn't there some implementation of AMD IOMMU code on QEMU
mailing list floating around? Aha:
https://www.mail-archive.com/kvm@vger.kernel.org/msg40516.html

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio DMA API?

2014-08-25 Thread Konrad Rzeszutek Wilk
On Mon, Aug 25, 2014 at 10:18:46AM -0700, Andy Lutomirski wrote:
 Currently, a lot of the virtio code assumes that bus (i.e. hypervisor)
 addresses are the same as physical address.  This is false on Xen, so
 virtio is completely broken.  I wouldn't be surprised if it also
 becomes a problem the first time that someone sticks a physical
 virtio device on a 32-bit bus on an ARM SOC with more than 4G RAM.
 
 Would you accept patches to convert virtio_ring and virtio_pci to use
 the DMA APIs?  I think that the only real catch will be that
 virtio_ring's approach to freeing indirect blocks is currently
 incompatible with the DMA API -- it assumes that knowing the bus
 address is enough to call kfree, and I don't think that the DMA API
 provides a reverse mapping like that.

If you use the dma_map/unmap_sg all of that ends up being stuck in the
sg structure (sg-dma_address ends with the DMA addr, sg_phys(sg) gives
you the physical address).


 
 --Andy
 
 -- 
 Andy Lutomirski
 AMA Capital Management, LLC
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 10/11] qspinlock: Paravirt support

2014-07-15 Thread Konrad Rzeszutek Wilk
On Mon, Jul 07, 2014 at 05:27:34PM +0200, Peter Zijlstra wrote:
 On Fri, Jun 20, 2014 at 09:46:08AM -0400, Konrad Rzeszutek Wilk wrote:
  I dug in the code and I have some comments about it, but before
  I post them I was wondering if you have any plans to run any performance
  tests against the PV ticketlock with normal and over-committed scenarios?
 
 I can barely boot a guest.. I'm not sure I can make them do anything
 much at all yet. All this virt crap is totally painful.

HA!

The reason I asked about that is from a pen-and-paper view it looks
suboptimal in the worst case scenario compared to PV ticketlock.

The 'worst case scenario' is when we over-commit (more CPUs than there
are physical CPUs) or have to delay guests (the sum of all virtual
CPUs  physical CPUs and all of the guests are compiling kernels).

In those cases the PV ticketlock goes to sleep and gets woken up
once the ticket holder has finished. In the PV qspinlock we do
wake up the first in queue, but we also wake the next one in queue
so it can progress further. And so on.

Perhaps a better mechanism is just ditch the queue part and utilize
the byte part and under KVM and Xen just do bytelocking (since we
have 8 bits). For the PV halt/waking we can stash in the 'struct mcs'
the current lock that each CPU is waiting for. And the unlocker
can iterate over all of those and wake them all up. Perhaps make
the iteration random. Anyhow, that is how the old PV bytelock under
Xen worked (before 3.11) and it had worked pretty well (it didn't
do it random thought - always started with 'for_each_online_cpu').

Squashing in the ticketlock concept in qspinlock for PV looks
scary.

And as I said - this is all pen-and-paper - so it might be that this
'wake-up-go-sleep-on-the-queue' kick is actually not that bad?

Lastly - thank you for taking a stab at this.
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-27 Thread Konrad Rzeszutek Wilk
On Mon, Jun 23, 2014 at 06:26:22PM +0200, Peter Zijlstra wrote:
 On Tue, Jun 17, 2014 at 04:05:31PM -0400, Konrad Rzeszutek Wilk wrote:
   + * The basic principle of a queue-based spinlock can best be understood
   + * by studying a classic queue-based spinlock implementation called the
   + * MCS lock. The paper below provides a good description for this kind
   + * of lock.
   + *
   + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
   + *
   + * This queue spinlock implementation is based on the MCS lock, however 
   to make
   + * it fit the 4 bytes we assume spinlock_t to be, and preserve its 
   existing
   + * API, we must modify it some.
   + *
   + * In particular; where the traditional MCS lock consists of a tail 
   pointer
   + * (8 bytes) and needs the next pointer (another 8 bytes) of its own 
   node to
   + * unlock the next pending (next-locked), we compress both these: {tail,
   + * next-locked} into a single u32 value.
   + *
   + * Since a spinlock disables recursion of its own context and there is a 
   limit
   + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, 
   we can
   + * encode the tail as and index indicating this context and a cpu number.
   + *
   + * We can further change the first spinner to spin on a bit in the lock 
   word
   + * instead of its node; whereby avoiding the need to carry a node from 
   lock to
   + * unlock, and preserving API.
  
  You also made changes (compared to the MCS) in that the unlock path is not
  spinning waiting for the successor and that the job of passing the lock
  is not done in the unlock path either.
  
  Instead all of that is now done in the path of the lock acquirer logic. 
  
  Could you update the comment to say that please?
 
 I _think_ I know what you mean.. So that is actually implied by the last

You do :-)

 paragraph, but I suppose I can make it explicit; something like:
 
   *
   * Another way to look at it is:
   *
   *  lock(tail,locked)
   *struct mcs_spinlock node;
   *mcs_spin_lock(tail, node);
   *test-and-set locked;
   *mcs_spin_unlock(tail, node);
   *
   *  unlock(tail,locked)
   *clear locked
   *
   * Where we have compressed (tail,locked) into a single u32 word.
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-27 Thread Konrad Rzeszutek Wilk
On Mon, Jun 23, 2014 at 06:12:00PM +0200, Peter Zijlstra wrote:
 On Tue, Jun 17, 2014 at 04:03:29PM -0400, Konrad Rzeszutek Wilk wrote:
+   new = tail | (val  _Q_LOCKED_MASK);
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+
+   /*
+* we won the trylock; forget about queueing.
+*/
+   if (new == _Q_LOCKED_VAL)
+   goto release;
+
+   /*
+* if there was a previous node; link it and wait.
+*/
+   if (old  ~_Q_LOCKED_MASK) {
+   prev = decode_tail(old);
+   ACCESS_ONCE(prev-next) = node;
+
+   arch_mcs_spin_lock_contended(node-locked);
  
  Could you add a comment here:
  
  /* We are spinning forever until the previous node updates locked - which
  it does once the it has updated lock-val with our tail number. */
 
 That's incorrect -- or at least, I understand that to be incorrect. The
 previous node will not have changed the tail to point to us. You always
 change to tail to point to yourself, seeing how you add yourself to the
 tail.
 
 Is the existing comment any better if I s/wait./wait for it to release
 us./ ?

Yes!
 
+   /*
+* claim the lock:
+*
+* n,0 - 0,1 : lock, uncontended
+* *,0 - *,1 : lock, contended
+*/
+   for (;;) {
+   new = _Q_LOCKED_VAL;
+   if (val != tail)
+   new |= val;
   
  ..snip..
   
   Could you help a bit in explaining it in English please?
  
  After looking at the assembler code I finally figured out how
  we can get here. And the 'contended' part threw me off. Somehow
  I imagined there are two more more CPUs stampeding here and 
  trying to update the lock-val. But in reality the other CPUs
  are stuck in the arch_mcs_spin_lock_contended spinning on their
  local value.
 
 Well, the lock as a whole is contended (there's 1 waiters), and the
 point of MCS style locks it to make sure they're not actually pounding
 on the same cacheline. So the whole thing is consistent.
 
  Perhaps you could add this comment.
  
  /* Once queue_spin_unlock is called (which _subtracts_ _Q_LOCKED_VAL from
  the lock-val and still preserving the tail data), the winner gets to
  claim the ticket. 
 
 There's no tickets :/

s/ticket/be first in line/ ?

 
  Since we still need the other CPUs to continue and
  preserve the strict ordering in which they setup node-next, we:
   1) update lock-val to the tail value (so tail CPU and its index) with
  _Q_LOCKED_VAL.
 
 We don't, we preserve the tail value, unless we're the tail, in which
 case we clear the tail.
 
   2). Once we are done, we poke the other CPU (the one that linked to
  us) by writting to node-locked (below) so they can make progress and
  loop on lock-val changing from _Q_LOCKED_MASK to zero).
 
 _If_ there was another cpu, ie. the tail didn't point to us.

nods
 
 ---
 
 I don't do well with natural language comments like that; they tend to
 confuse me more than anything.
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-27 Thread Konrad Rzeszutek Wilk
On Mon, Jun 23, 2014 at 05:56:50PM +0200, Peter Zijlstra wrote:
 On Mon, Jun 16, 2014 at 04:49:18PM -0400, Konrad Rzeszutek Wilk wrote:
   Index: linux-2.6/kernel/locking/mcs_spinlock.h
   ===
   --- linux-2.6.orig/kernel/locking/mcs_spinlock.h
   +++ linux-2.6/kernel/locking/mcs_spinlock.h
   @@ -17,6 +17,7 @@
struct mcs_spinlock {
 struct mcs_spinlock *next;
 int locked; /* 1 if lock acquired */
   + int count;
  
  This could use a comment.
 
 like so?
 
   int count; /* nesting count, see qspinlock.c */

/* nested level -  in user, softirq, hard irq or nmi context. */ ?

 
 
   +static inline u32 encode_tail(int cpu, int idx)
   +{
   + u32 tail;
   +
   + tail  = (cpu + 1)  _Q_TAIL_CPU_OFFSET;
   + tail |= idx  _Q_TAIL_IDX_OFFSET; /* assume  4 */
  
  Should there an
  
  ASSSERT (idx  4)
  
  just in case we screw up somehow (I can't figure out how, but
  that is partially why ASSERTS are added).
 
 #ifdef CONFIG_DEBUG_SPINLOCK
   BUG_ON(idx  3);
 #endif
 
 might do, I suppose.

nods
 
   +/**
   + * queue_spin_lock_slowpath - acquire the queue spinlock
   + * @lock: Pointer to queue spinlock structure
   + * @val: Current value of the queue spinlock 32-bit word
   + *
   + * (queue tail, lock bit)
  
  Except it is not a lock bit. It is a lock uint8_t.
 
 It is indeed, although that's an accident of implementation. I could do
 s/bit// and not mention the entire storage angle at all?

I think giving as much details as possible is good.

What you said 'accident of implementation' is a could be woven
in there?
 
  Is the queue tail at this point the composite of 'cpu|idx'?
 
 Yes, as per {en,de}code_tail() above.
 
   + *
   + *  fast  :slow  :   
unlock
   + *:  :
   + * uncontended  (0,0)   --:-- (0,1) 
   :-- (*,0)
   + *:   | ^./  :
   + *:   v   \   |  :
   + * uncontended:(n,x) --+-- (n,0) |  :
  
  So many CPUn come in right? Is 'n' for the number of CPUs?
 
 Nope, 'n' for any one specific tail, in particular the first one to
 arrive. This is the 'uncontended queue' case as per the label, so we
 need a named value for the first, in order to distinguish between the
 state to the right (same tail, but unlocked) and the state below
 (different tail).
 
   + *   queue:   | ^--'  |  :
   + *:   v   |  :
   + * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
   + *   queue: ^--' :
  
  And here um, what are the '*' for? Are they the four different
  types of handlers that can be nested? So task, sofitrq, hardisk, and
  nmi?
 
 '*' as in wildcard, any tail, specifically not 'n'.

Ah, thank you for the explanation! Would it be possible to include
that in the comment please?

 
   +void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
   +{
   + struct mcs_spinlock *prev, *next, *node;
   + u32 new, old, tail;
   + int idx;
   +
   + BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
   +
   + node = this_cpu_ptr(mcs_nodes[0]);
   + idx = node-count++;
  
  If this is the first time we enter this, wouldn't idx end up
  being 1?
 
 Nope, postfix ++ returns first and increments later.

blushes Yes it does.
 
   + tail = encode_tail(smp_processor_id(), idx);
   +
   + node += idx;
  
  Meaning we end up skipping the 'mcs_nodes[0]' one altogether - even
  on the first 'level' (task, softirq, hardirq, nmi)? Won't that
  cause us to blow past the array when we are nested at the nmi
  handler?
 
 Seeing how its all static storage, which is automagically initialized to
 0, combined with the postfix ++ (as opposed to the prefix ++) we should
 be getting 0 here.

I've no idea what I was thinking, but thank you for setting me straight.

 
   + node-locked = 0;
   + node-next = NULL;
   +
   + /*
   +  * trylock || xchg(lock, node)
   +  *
   +  * 0,0 - 0,1 ; trylock
   +  * p,x - n,x ; prev = xchg(lock, node)
  
  I looked at that for 10 seconds and I was not sure what you meant.
  Is this related to the MCS document you had pointed to? It would help
  if you mention that the comments follow the document. (But they
  don't seem to)
  
  I presume what you mean is that if we are the next after the
  lock-holder we need only to update the 'next' (or the
  composite value of smp_processor_idx | idx) to point to us.
  
  As in, swap the 'L' with 'I' (looking at the doc)
 
 They are the 'tail','lock' tuples, so this composite atomic operation
 completes either:
 
   0,0 - 0,1  -- we had no tail, not locked; into: no tail, locked.
 
 OR
 
   p,x - n,x  -- tail was p

Re: [PATCH 10/11] qspinlock: Paravirt support

2014-06-20 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:07PM +0200, Peter Zijlstra wrote:
 Add minimal paravirt support.
 
 The code aims for minimal impact on the native case.

Woot!
 
 On the lock side we add one jump label (asm_goto) and 4 paravirt
 callee saved calls that default to NOPs. The only effects are the
 extra NOPs and some pointless MOVs to accomodate the calling
 convention. No register spills happen because of this (x86_64).
 
 On the unlock side we have one paravirt callee saved call, which
 defaults to the actual unlock sequence: movb $0, (%rdi) and a NOP.
 
 The actual paravirt code comes in 3 parts;
 
  - init_node; this initializes the extra data members required for PV
state. PV state data is kept 1 cacheline ahead of the regular data.
 
  - link_and_wait_node/kick_node; these are paired with the regular MCS
queueing and are placed resp. before/after the paired MCS ops.
 
  - wait_head/queue_unlock; the interesting part here is finding the
head node to kick.
 
 Tracking the head is done in two parts, firstly the pv_wait_head will
 store its cpu number in whichever node is pointed to by the tail part
 of the lock word. Secondly, pv_link_and_wait_node() will propagate the
 existing head from the old to the new tail node.

I dug in the code and I have some comments about it, but before
I post them I was wondering if you have any plans to run any performance
tests against the PV ticketlock with normal and over-committed scenarios?

Looking at this with a pen and paper I see that compared to
PV ticketlock for the CPUs that are contending on the queue (so they
go to pv_wait_head_and_link, then progress to pv_wait_head), they
go sleep twice and get woken up twice. In PV ticketlock the
contending CPUs would only go to sleep once and woken up once it
was their turn.

That of course is the worst case scenario - where the CPU
that has the lock is taking forever to do its job and the
host is quite overcommitted.

Thanks!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-18 Thread Konrad Rzeszutek Wilk
On Wed, Jun 18, 2014 at 01:29:48PM +0200, Paolo Bonzini wrote:
 Il 17/06/2014 22:36, Konrad Rzeszutek Wilk ha scritto:
 +/* One more attempt - but if we fail mark it as pending. */
 +if (val == _Q_LOCKED_VAL) {
 +new = Q_LOCKED_VAL |_Q_PENDING_VAL;
 +
 +old = atomic_cmpxchg(lock-val, val, new);
 +if (old == _Q_LOCKED_VAL) /* YEEY! */
 +return;
 +val = old;
 +}
 
 Note that Peter's code is in a for(;;) loop:
 
 
 + for (;;) {
 + /*
 +  * If we observe any contention; queue.
 +  */
 + if (val  ~_Q_LOCKED_MASK)
 + goto queue;
 +
 + new = _Q_LOCKED_VAL;
 + if (val == new)
 + new |= _Q_PENDING_VAL;
 +
 + old = atomic_cmpxchg(lock-val, val, new);
 + if (old == val)
 + break;
 +
 + val = old;
 + }
 +
 + /*
 +  * we won the trylock
 +  */
 + if (new == _Q_LOCKED_VAL)
 + return;
 
 So what you'd have is basically:
 
   /*
* One more attempt if no one is already in queue.  Perhaps
* they have unlocked the spinlock already.
*/
   if (val == _Q_LOCKED_VAL  atomic_read(lock-val) == 0) {
   old = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
   if (old == 0) /* YEEY! */
   return;
   val = old;
   }
 
 But I agree with Waiman that this is unlikely to trigger often enough. It
 does have to be handled in the slowpath for correctness, but the most likely
 path is (0,0,1)-(0,1,1).

nods
 
 Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014-06-18 Thread Konrad Rzeszutek Wilk
On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote:
 Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto:
 On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 This patch extracts the logic for the exchange of new and previous tail
 code words into a new xchg_tail() function which can be optimized in a
 later patch.
 
 And also adds a third try on acquiring the lock. That I think should
 be a seperate patch.
 
 It doesn't really add a new try, the old code is:
 
 
 - for (;;) {
 - new = _Q_LOCKED_VAL;
 - if (val)
 - new = tail | (val  _Q_LOCKED_PENDING_MASK);
 -
 - old = atomic_cmpxchg(lock-val, val, new);
 - if (old == val)
 - break;
 -
 - val = old;
 - }
 
   /*
 -  * we won the trylock; forget about queueing.
*/
 - if (new == _Q_LOCKED_VAL)
 - goto release;
 
 The trylock happens if the if (val) hits the else branch.
 
 What the patch does is change it from attempting two transition with a
 single cmpxchg:
 
 -  * 0,0,0 - 0,0,1 ; trylock
 -  * p,y,x - n,y,x ; prev = xchg(lock, node)
 
 to first doing the trylock, then the xchg.  If the trylock passes and the
 xchg returns prev=0,0,0, the next step of the algorithm goes to the
 locked/uncontended state
 
 + /*
 +  * claim the lock:
 +  *
 +  * n,0 - 0,1 : lock, uncontended
 
 Similar to your suggestion of patch 3, it's expected that the xchg will
 *not* return prev=0,0,0 after a failed trylock.

I do like your explanation. I hope that Peter will put it in the
description as it explains the change quite well.

 
 However, I *do* agree with you that it's simpler to just squash this patch
 into 01/11.

Uh, did I say that? Oh I said why don't make it right the first time!

I meant in terms of seperating the slowpath (aka the bytelock on the pending
bit) from the queue (MCS code). Or renaming the function to be called
'complex' instead of 'slowpath' as it is getting quite hairy.

The #1 patch is nice by itself - as it lays out the foundation of the
MCS-similar code - and if Ingo decides he does not want this pending
byte-lock bit business - it can be easily reverted or dropped.

In terms of squashing this in #1 - I would advocate against that.

Thanks!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014-06-18 Thread Konrad Rzeszutek Wilk
 However, I *do* agree with you that it's simpler to just squash this patch
 into 01/11.
 Uh, did I say that? Oh I said why don't make it right the first time!
 
 I meant in terms of seperating the slowpath (aka the bytelock on the pending
 bit) from the queue (MCS code). Or renaming the function to be called
 'complex' instead of 'slowpath' as it is getting quite hairy.
 
 The #1 patch is nice by itself - as it lays out the foundation of the
 MCS-similar code - and if Ingo decides he does not want this pending
 byte-lock bit business - it can be easily reverted or dropped.
 
 The pending bit code is needed for performance parity with ticket spinlock
 for light load. My own measurement indicates that the queuing overhead will
 cause the queue spinlock to be slower than ticket spinlock with 2-4
 contending tasks. The pending bit solves the performance problem with 2

Aha!

 contending tasks, leave only the 3-4 tasks cases being a bit slower than the
 ticket spinlock which should be more than compensated by its superior
 performance with heavy contention and slightly better performance with no
 contention.

That should be mentioned in the commit description as the rationale for
the patch qspinlock: Add pending bit and also in the code.

Thank you!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 09/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled

2014-06-18 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:06PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 This patch renames the paravirt_ticketlocks_enabled static key to a
 more generic paravirt_spinlocks_enabled name.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org

Acked-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com

 ---
  arch/x86/include/asm/spinlock.h  |4 ++--
  arch/x86/kernel/kvm.c|2 +-
  arch/x86/kernel/paravirt-spinlocks.c |4 ++--
  arch/x86/xen/spinlock.c  |2 +-
  4 files changed, 6 insertions(+), 6 deletions(-)
 
 --- a/arch/x86/include/asm/spinlock.h
 +++ b/arch/x86/include/asm/spinlock.h
 @@ -39,7 +39,7 @@
  /* How long a lock should spin before we consider blocking */
  #define SPIN_THRESHOLD   (1  15)
  
 -extern struct static_key paravirt_ticketlocks_enabled;
 +extern struct static_key paravirt_spinlocks_enabled;
  static __always_inline bool static_key_false(struct static_key *key);
  
  #ifdef CONFIG_QUEUE_SPINLOCK
 @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowp
  static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
  {
   if (TICKET_SLOWPATH_FLAG 
 - static_key_false(paravirt_ticketlocks_enabled)) {
 + static_key_false(paravirt_spinlocks_enabled)) {
   arch_spinlock_t prev;
  
   prev = *lock;
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -819,7 +819,7 @@ static __init int kvm_spinlock_init_jump
   if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
   return 0;
  
 - static_key_slow_inc(paravirt_ticketlocks_enabled);
 + static_key_slow_inc(paravirt_spinlocks_enabled);
   printk(KERN_INFO KVM setup paravirtual spinlock\n);
  
   return 0;
 --- a/arch/x86/kernel/paravirt-spinlocks.c
 +++ b/arch/x86/kernel/paravirt-spinlocks.c
 @@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = {
  };
  EXPORT_SYMBOL(pv_lock_ops);
  
 -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
 -EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
 +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
 +EXPORT_SYMBOL(paravirt_spinlocks_enabled);
 --- a/arch/x86/xen/spinlock.c
 +++ b/arch/x86/xen/spinlock.c
 @@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jum
   if (!xen_domain())
   return 0;
  
 - static_key_slow_inc(paravirt_ticketlocks_enabled);
 + static_key_slow_inc(paravirt_spinlocks_enabled);
   return 0;
  }
  early_initcall(xen_init_spinlocks_jump);
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 08/11] qspinlock: Revert to test-and-set on hypervisors

2014-06-18 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:05PM +0200, Peter Zijlstra wrote:
 When we detect a hypervisor (!paravirt, see later patches), revert to

Please spell out the name of the patches.

 a simple test-and-set lock to avoid the horrors of queue preemption.

Heheh.
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  arch/x86/include/asm/qspinlock.h |   14 ++
  include/asm-generic/qspinlock.h  |7 +++
  kernel/locking/qspinlock.c   |3 +++
  3 files changed, 24 insertions(+)
 
 --- a/arch/x86/include/asm/qspinlock.h
 +++ b/arch/x86/include/asm/qspinlock.h
 @@ -1,6 +1,7 @@
  #ifndef _ASM_X86_QSPINLOCK_H
  #define _ASM_X86_QSPINLOCK_H
  
 +#include asm/cpufeature.h
  #include asm-generic/qspinlock_types.h
  
  #if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
 @@ -20,6 +21,19 @@ static inline void queue_spin_unlock(str
  
  #endif /* !CONFIG_X86_OOSTORE  !CONFIG_X86_PPRO_FENCE */
  
 +#define virt_queue_spin_lock virt_queue_spin_lock
 +
 +static inline bool virt_queue_spin_lock(struct qspinlock *lock)
 +{
 + if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
 + return false;
 +
 + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
 + cpu_relax();
 +
 + return true;
 +}
 +
  #include asm-generic/qspinlock.h
  
  #endif /* _ASM_X86_QSPINLOCK_H */
 --- a/include/asm-generic/qspinlock.h
 +++ b/include/asm-generic/qspinlock.h
 @@ -98,6 +98,13 @@ static __always_inline void queue_spin_u
  }
  #endif
  
 +#ifndef virt_queue_spin_lock
 +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
 +{
 + return false;
 +}
 +#endif
 +
  /*
   * Initializier
   */
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -247,6 +247,9 @@ void queue_spin_lock_slowpath(struct qsp
  
   BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
  
 + if (virt_queue_spin_lock(lock))
 + return;
 +
   /*
* wait for in-progress pending-locked hand-overs
*
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 07/11] qspinlock: Use a simple write to grab the lock, if applicable

2014-06-18 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:04PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 Currently, atomic_cmpxchg() is used to get the lock. However, this is
 not really necessary if there is more than one task in the queue and
 the queue head don't need to reset the queue code word. For that case,

s/queue code word/tail {number,value}/ ?


 a simple write to set the lock bit is enough as the queue head will
 be the only one eligible to get the lock as long as it checks that
 both the lock and pending bits are not set. The current pending bit
 waiting code will ensure that the bit will not be set as soon as the
 queue code word (tail) in the lock is set.

Just use the same word as above.
 
 With that change, the are some slight improvement in the performance
 of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
 Westere-EX machine as shown in the tables below.
 
   [Standalone/Embedded - same node]
   # of tasks  Before patchAfter patch %Change
   --  --- --  ---
3   2324/2321  2248/2265-3%/-2%
4   2890/2896  2819/2831-2%/-2%
5   3611/3595  3522/3512-2%/-2%
6   4281/4276  4173/4160-3%/-3%
7   5018/5001  4875/4861-3%/-3%
8   5759/5750  5563/5568-3%/-3%
 
   [Standalone/Embedded - different nodes]
   # of tasks  Before patchAfter patch %Change
   --  --- --  ---
3  12242/12237 12087/12093  -1%/-1%
4  10688/10696 10507/10521  -2%/-2%
 
 It was also found that this change produced a much bigger performance
 improvement in the newer IvyBridge-EX chip and was essentially to close
 the performance gap between the ticket spinlock and queue spinlock.
 
 The disk workload of the AIM7 benchmark was run on a 4-socket
 Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
 on a 3.14 based kernel. The results of the test runs were:
 
 AIM7 XFS Disk Test
   kernel JPMReal Time   Sys TimeUsr Time
   -  ----   
   ticketlock56782333.17   96.61   5.81
   qspinlock 57507993.13   94.83   5.97
 
 AIM7 EXT4 Disk Test
   kernel JPMReal Time   Sys TimeUsr Time
   -  ----   
   ticketlock1114551   16.15  509.72   7.11
   qspinlock 21844668.24  232.99   6.01
 
 The ext4 filesystem run had a much higher spinlock contention than
 the xfs filesystem run.
 
 The ebizzy -m test was also run with the following results:
 
   kernel   records/s  Real Time   Sys TimeUsr Time
   --  -   
   ticketlock 2075   10.00  216.35   3.49
   qspinlock  3023   10.00  198.20   4.80
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  kernel/locking/qspinlock.c |   59 
 -
  1 file changed, 43 insertions(+), 16 deletions(-)
 
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decod
   * By using the whole 2nd least significant byte for the pending bit, we
   * can allow better optimization of the lock acquisition for the pending
   * bit holder.
 + *
 + * This internal structure is also used by the set_locked function which
 + * is not restricted to _Q_PENDING_BITS == 8.
   */
 -#if _Q_PENDING_BITS == 8
 -
  struct __qspinlock {
   union {
   atomic_t val;
 - struct {
  #ifdef __LITTLE_ENDIAN
 + u8   locked;
 + struct {
   u16 locked_pending;
   u16 tail;
 + };
  #else
 + struct {
   u16 tail;
   u16 locked_pending;
 -#endif
   };
 + struct {
 + u8  reserved[3];
 + u8  locked;
 + };
 +#endif
   };
  };
  
 +#if _Q_PENDING_BITS == 8
  /**
   * clear_pending_set_locked - take ownership and clear the pending bit.
   * @lock: Pointer to queue spinlock structure
 @@ -197,6 +206,19 @@ static __always_inline u32 xchg_tail(str
  #endif /* _Q_PENDING_BITS == 8 */
  
  /**
 + * set_locked - Set the lock bit and own the lock

Full stop missing.

 + * @lock: Pointer to queue spinlock structure

Ditto.
 + *
 + * *,*,0 - *,0,1
 + */
 +static __always_inline void set_locked(struct qspinlock *lock)
 +{
 + struct __qspinlock *l = (void *)lock;
 +
 + ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
 +}
 +
 +/**
   * 

Re: [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

2014-06-18 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote:
 From: Peter Zijlstra pet...@infradead.org
 
 When we allow for a max NR_CPUS  2^14 we can optimize the pending
 wait-acquire and the xchg_tail() operations.
 
 By growing the pending bit to a byte, we reduce the tail to 16bit.
 This means we can use xchg16 for the tail part and do away with all
 the repeated compxchg() operations.
 
 This in turn allows us to unconditionally acquire; the locked state
 as observed by the wait loops cannot change. And because both locked
 and pending are now a full byte we can use simple stores for the
 state transition, obviating one atomic operation entirely.

I have to ask - how much more performance do you get from this?

Is this extra atomic operation hurting that much?
 
 All this is horribly broken on Alpha pre EV56 (and any other arch that
 cannot do single-copy atomic byte stores).
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  include/asm-generic/qspinlock_types.h |   13 
  kernel/locking/qspinlock.c|  103 
 ++
  2 files changed, 106 insertions(+), 10 deletions(-)
 
 --- a/include/asm-generic/qspinlock_types.h
 +++ b/include/asm-generic/qspinlock_types.h
 @@ -38,6 +38,14 @@ typedef struct qspinlock {
  /*
   * Bitfields in the atomic value:
   *
 + * When NR_CPUS  16K
 + *  0- 7: locked byte
 + * 8: pending
 + *  9-15: not used
 + * 16-17: tail index
 + * 18-31: tail cpu (+1)
 + *
 + * When NR_CPUS = 16K
   *  0- 7: locked byte
   * 8: pending
   *  9-10: tail index
 @@ -50,7 +58,11 @@ typedef struct qspinlock {
  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
  
  #define _Q_PENDING_OFFSET(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
 +#if CONFIG_NR_CPUS  (1U  14)
 +#define _Q_PENDING_BITS  8
 +#else
  #define _Q_PENDING_BITS  1
 +#endif
  #define _Q_PENDING_MASK  _Q_SET_MASK(PENDING)
  
  #define _Q_TAIL_IDX_OFFSET   (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 @@ -61,6 +73,7 @@ typedef struct qspinlock {
  #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
  
 +#define _Q_TAIL_OFFSET   _Q_TAIL_IDX_OFFSET
  #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
  
  #define _Q_LOCKED_VAL(1U  _Q_LOCKED_OFFSET)
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -22,6 +22,7 @@
  #include linux/percpu.h
  #include linux/hardirq.h
  #include linux/mutex.h
 +#include asm/byteorder.h
  #include asm/qspinlock.h
  
  /*
 @@ -48,6 +49,9 @@
   * We can further change the first spinner to spin on a bit in the lock word
   * instead of its node; whereby avoiding the need to carry a node from lock 
 to
   * unlock, and preserving API.
 + *
 + * N.B. The current implementation only supports architectures that allow
 + *  atomic operations on smaller 8-bit and 16-bit data types.
   */
  
  #include mcs_spinlock.h
 @@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decod
  
  #define _Q_LOCKED_PENDING_MASK   (_Q_LOCKED_MASK | _Q_PENDING_MASK)
  
 +/*
 + * By using the whole 2nd least significant byte for the pending bit, we
 + * can allow better optimization of the lock acquisition for the pending
 + * bit holder.
 + */
 +#if _Q_PENDING_BITS == 8
 +
 +struct __qspinlock {
 + union {
 + atomic_t val;
 + struct {
 +#ifdef __LITTLE_ENDIAN
 + u16 locked_pending;
 + u16 tail;
 +#else
 + u16 tail;
 + u16 locked_pending;
 +#endif
 + };
 + };
 +};
 +
 +/**
 + * clear_pending_set_locked - take ownership and clear the pending bit.
 + * @lock: Pointer to queue spinlock structure
 + * @val : Current value of the queue spinlock 32-bit word
 + *
 + * *,1,0 - *,0,1
 + *
 + * Lock stealing is not allowed if this function is used.
 + */
 +static __always_inline void
 +clear_pending_set_locked(struct qspinlock *lock, u32 val)
 +{
 + struct __qspinlock *l = (void *)lock;
 +
 + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
 +}
 +
 +/*
 + * xchg_tail - Put in the new queue tail code word  retrieve previous one

Missing full stop.
 + * @lock : Pointer to queue spinlock structure
 + * @tail : The new queue tail code word
 + * Return: The previous queue tail code word
 + *
 + * xchg(lock, tail)
 + *
 + * p,*,* - n,*,* ; prev = xchg(lock, node)
 + */
 +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 +{
 + struct __qspinlock *l = (void *)lock;
 +
 + return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
 +}
 +
 +#else /* _Q_PENDING_BITS == 8 */
 +
 +/**
 + * clear_pending_set_locked - take ownership and clear the pending bit.
 + * @lock: Pointer to queue spinlock structure
 + * @val : Current value of the queue spinlock 32-bit word
 + *
 + * *,1,0 - *,0,1
 + */
 +static __always_inline void
 +clear_pending_set_locked(struct 

Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-17 Thread Konrad Rzeszutek Wilk
  +   new = tail | (val  _Q_LOCKED_MASK);
  +
  +   old = atomic_cmpxchg(lock-val, val, new);
  +   if (old == val)
  +   break;
  +
  +   val = old;
  +   }
  +
  +   /*
  +* we won the trylock; forget about queueing.
  +*/
  +   if (new == _Q_LOCKED_VAL)
  +   goto release;
  +
  +   /*
  +* if there was a previous node; link it and wait.
  +*/
  +   if (old  ~_Q_LOCKED_MASK) {
  +   prev = decode_tail(old);
  +   ACCESS_ONCE(prev-next) = node;
  +
  +   arch_mcs_spin_lock_contended(node-locked);

Could you add a comment here:

/* We are spinning forever until the previous node updates locked - which
it does once the it has updated lock-val with our tail number. */

  +   }
  +
  +   /*
  +* we're at the head of the waitqueue, wait for the owner to go away.
  +*
  +* *,x - *,0
  +*/
  +   while ((val = atomic_read(lock-val))  _Q_LOCKED_MASK)
  +   cpu_relax();
  +
  +   /*
  +* claim the lock:
  +*
  +* n,0 - 0,1 : lock, uncontended
  +* *,0 - *,1 : lock, contended
  +*/
  +   for (;;) {
  +   new = _Q_LOCKED_VAL;
  +   if (val != tail)
  +   new |= val;
 
..snip..
 
 Could you help a bit in explaining it in English please?

After looking at the assembler code I finally figured out how
we can get here. And the 'contended' part threw me off. Somehow
I imagined there are two more more CPUs stampeding here and 
trying to update the lock-val. But in reality the other CPUs
are stuck in the arch_mcs_spin_lock_contended spinning on their
local value.

Perhaps you could add this comment.

/* Once queue_spin_unlock is called (which _subtracts_ _Q_LOCKED_VAL from
the lock-val and still preserving the tail data), the winner gets to
claim the ticket. Since we still need the other CPUs to continue and
preserve the strict ordering in which they setup node-next, we:
 1) update lock-val to the tail value (so tail CPU and its index) with
_Q_LOCKED_VAL.
 2). Once we are done, we poke the other CPU (the one that linked to
us) by writting to node-locked (below) so they can make progress and
loop on lock-val changing from _Q_LOCKED_MASK to zero).

*/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-17 Thread Konrad Rzeszutek Wilk
 + * The basic principle of a queue-based spinlock can best be understood
 + * by studying a classic queue-based spinlock implementation called the
 + * MCS lock. The paper below provides a good description for this kind
 + * of lock.
 + *
 + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
 + *
 + * This queue spinlock implementation is based on the MCS lock, however to 
 make
 + * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing
 + * API, we must modify it some.
 + *
 + * In particular; where the traditional MCS lock consists of a tail pointer
 + * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
 + * unlock the next pending (next-locked), we compress both these: {tail,
 + * next-locked} into a single u32 value.
 + *
 + * Since a spinlock disables recursion of its own context and there is a 
 limit
 + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can
 + * encode the tail as and index indicating this context and a cpu number.
 + *
 + * We can further change the first spinner to spin on a bit in the lock word
 + * instead of its node; whereby avoiding the need to carry a node from lock 
 to
 + * unlock, and preserving API.

You also made changes (compared to the MCS) in that the unlock path is not
spinning waiting for the successor and that the job of passing the lock
is not done in the unlock path either.

Instead all of that is now done in the path of the lock acquirer logic. 

Could you update the comment to say that please?

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-17 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote:
 Because the qspinlock needs to touch a second cacheline; add a pending
 bit and allow a single in-word spinner before we punt to the second
 cacheline.

Could you add this in the description please:

And by second cacheline we mean the local 'node'. That is the:
mcs_nodes[0] and mcs_nodes[idx]

Perhaps it might be better then to split this in the header file
as this is trying to not be a slowpath code - but rather - a
pre-slow-path-lets-try-if-we can do another cmpxchg in case
the unlocker has just unlocked itself.

So something like:

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index e8a7ae8..29cc9c7 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock 
*lock, u32 val);
  */
 static __always_inline void queue_spin_lock(struct qspinlock *lock)
 {
-   u32 val;
+   u32 val, new;
 
val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
if (likely(val == 0))
return;
+
+   /* One more attempt - but if we fail mark it as pending. */
+   if (val == _Q_LOCKED_VAL) {
+   new = Q_LOCKED_VAL |_Q_PENDING_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == _Q_LOCKED_VAL) /* YEEY! */
+   return;
+   val = old;
+   }
queue_spin_lock_slowpath(lock, val);
 }

and then the slowpath preserves most of the old logic path
(with the pending bit stuff)?

 
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  include/asm-generic/qspinlock_types.h |   12 ++-
  kernel/locking/qspinlock.c|  109 
 +++---
  2 files changed, 97 insertions(+), 24 deletions(-)
 
 --- a/include/asm-generic/qspinlock_types.h
 +++ b/include/asm-generic/qspinlock_types.h
 @@ -39,8 +39,9 @@ typedef struct qspinlock {
   * Bitfields in the atomic value:
   *
   *  0- 7: locked byte
 - *  8- 9: tail index
 - * 10-31: tail cpu (+1)
 + * 8: pending
 + *  9-10: tail index
 + * 11-31: tail cpu (+1)
   */
  #define  _Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
  _Q_ ## type ## _OFFSET)
 @@ -48,7 +49,11 @@ typedef struct qspinlock {
  #define _Q_LOCKED_BITS   8
  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
  
 -#define _Q_TAIL_IDX_OFFSET   (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
 +#define _Q_PENDING_OFFSET(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
 +#define _Q_PENDING_BITS  1
 +#define _Q_PENDING_MASK  _Q_SET_MASK(PENDING)
 +
 +#define _Q_TAIL_IDX_OFFSET   (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
  #define _Q_TAIL_IDX_BITS 2
  #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX)
  
 @@ -57,5 +62,6 @@ typedef struct qspinlock {
  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
  
  #define _Q_LOCKED_VAL(1U  _Q_LOCKED_OFFSET)
 +#define _Q_PENDING_VAL   (1U  _Q_PENDING_OFFSET)
  
  #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -83,24 +83,28 @@ static inline struct mcs_spinlock *decod
   return per_cpu_ptr(mcs_nodes[idx], cpu);
  }
  
 +#define _Q_LOCKED_PENDING_MASK   (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 +
  /**
   * queue_spin_lock_slowpath - acquire the queue spinlock
   * @lock: Pointer to queue spinlock structure
   * @val: Current value of the queue spinlock 32-bit word
   *
 - * (queue tail, lock bit)
 - *
 - *  fast  :slow  :
 unlock
 - *:  :
 - * uncontended  (0,0)   --:-- (0,1) :-- 
 (*,0)
 - *:   | ^./  :
 - *:   v   \   |  :
 - * uncontended:(n,x) --+-- (n,0) |  :
 - *   queue:   | ^--'  |  :
 - *:   v   |  :
 - * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
 - *   queue: ^--' :
 + * (queue tail, pending bit, lock bit)
   *
 + *  fast :slow  :
 unlock
 + *   :  :
 + * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
 (*,*,0)
 + *   :   | ^.--. /  :
 + *   :   v   \  \|  :
 + * pending   :(0,1,1) +-- (0,1,0)   \   |  :
 + *   :   | ^--'  |   |  :
 + *   :   v   |   |  :
 + * 

Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014-06-17 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 This patch extracts the logic for the exchange of new and previous tail
 code words into a new xchg_tail() function which can be optimized in a
 later patch.

And also adds a third try on acquiring the lock. That I think should
be a seperate patch.

And instead of saying 'later patch' you should spell out the name
of the patch. Especially as this might not be obvious from somebody
doing git bisection.

 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org
 ---
  include/asm-generic/qspinlock_types.h |2 +
  kernel/locking/qspinlock.c|   58 
 +-
  2 files changed, 38 insertions(+), 22 deletions(-)
 
 --- a/include/asm-generic/qspinlock_types.h
 +++ b/include/asm-generic/qspinlock_types.h
 @@ -61,6 +61,8 @@ typedef struct qspinlock {
  #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
  
 +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 +
  #define _Q_LOCKED_VAL(1U  _Q_LOCKED_OFFSET)
  #define _Q_PENDING_VAL   (1U  _Q_PENDING_OFFSET)
  
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decod
  #define _Q_LOCKED_PENDING_MASK   (_Q_LOCKED_MASK | _Q_PENDING_MASK)
  
  /**
 + * xchg_tail - Put in the new queue tail code word  retrieve previous one
 + * @lock : Pointer to queue spinlock structure
 + * @tail : The new queue tail code word
 + * Return: The previous queue tail code word
 + *
 + * xchg(lock, tail)
 + *
 + * p,*,* - n,*,* ; prev = xchg(lock, node)
 + */
 +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 +{
 + u32 old, new, val = atomic_read(lock-val);
 +
 + for (;;) {
 + new = (val  _Q_LOCKED_PENDING_MASK) | tail;
 + old = atomic_cmpxchg(lock-val, val, new);
 + if (old == val)
 + break;
 +
 + val = old;
 + }
 + return old;
 +}
 +
 +/**
   * queue_spin_lock_slowpath - acquire the queue spinlock
   * @lock: Pointer to queue spinlock structure
   * @val: Current value of the queue spinlock 32-bit word
 @@ -182,36 +207,25 @@ void queue_spin_lock_slowpath(struct qsp
   node-next = NULL;
  
   /*
 -  * we already touched the queueing cacheline; don't bother with pending
 -  * stuff.
 -  *
 -  * trylock || xchg(lock, node)
 -  *
 -  * 0,0,0 - 0,0,1 ; trylock
 -  * p,y,x - n,y,x ; prev = xchg(lock, node)
 +  * We touched a (possibly) cold cacheline in the per-cpu queue node;
 +  * attempt the trylock once more in the hope someone let go while we
 +  * weren't watching.
*/
 - for (;;) {
 - new = _Q_LOCKED_VAL;
 - if (val)
 - new = tail | (val  _Q_LOCKED_PENDING_MASK);
 -
 - old = atomic_cmpxchg(lock-val, val, new);
 - if (old == val)
 - break;
 -
 - val = old;
 - }
 + if (queue_spin_trylock(lock))
 + goto release;

So now are three of them? One in queue_spin_lock, then at the start
of this function when checking for the pending bit, and the once more
here. And that is because the local cache line might be cold for the
'mcs_index' struct?

That all seems to be a bit of experimental. But then we are already
in the slowpath so we could as well do:

for (i = 0; i  10; i++)
if (queue_spin_trylock(lock))
goto release;

And would have the same effect.


  
   /*
 -  * we won the trylock; forget about queueing.
 +  * we already touched the queueing cacheline; don't bother with pending
 +  * stuff.

I guess we could also just erase the pending bit if we wanted too. The
optimistic spinning will still hit go to the queue label as lock-val will
have the tail value.

 +  *
 +  * p,*,* - n,*,*
*/
 - if (new == _Q_LOCKED_VAL)
 - goto release;
 + old = xchg_tail(lock, tail);
  
   /*
* if there was a previous node; link it and wait.
*/
 - if (old  ~_Q_LOCKED_PENDING_MASK) {
 + if (old  _Q_TAIL_MASK) {
   prev = decode_tail(old);
   ACCESS_ONCE(prev-next) = node;
  
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 14/16] pvqspinlock: Add qspinlock para-virtualization support

2014-06-17 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 03:16:54PM +0200, Peter Zijlstra wrote:
 On Thu, Jun 12, 2014 at 04:48:41PM -0400, Waiman Long wrote:
  I don't have a good understanding of the kernel alternatives mechanism.
 
 I didn't either; I do now, cost me a whole day reading up on
 alternative/paravirt code patching.
 
 See the patches I just send out; I got the 'native' case with paravirt
 enabled to be one NOP worse than the native case without paravirt -- for
 queue_spin_unlock.
 
 The lock slowpath is several nops and some pointless movs more expensive.

You could use the asm goto which would optimize the fast path to be the
'native' case. That way you wouldn't have the the nops and movs in the
path.

(And asm goto also uses the alternative_asm macros).
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-17 Thread Konrad Rzeszutek Wilk
On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote:
 On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote:
 On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote:
 Because the qspinlock needs to touch a second cacheline; add a pending
 bit and allow a single in-word spinner before we punt to the second
 cacheline.
 Could you add this in the description please:
 
 And by second cacheline we mean the local 'node'. That is the:
 mcs_nodes[0] and mcs_nodes[idx]
 
 Perhaps it might be better then to split this in the header file
 as this is trying to not be a slowpath code - but rather - a
 pre-slow-path-lets-try-if-we can do another cmpxchg in case
 the unlocker has just unlocked itself.
 
 So something like:
 
 diff --git a/include/asm-generic/qspinlock.h 
 b/include/asm-generic/qspinlock.h
 index e8a7ae8..29cc9c7 100644
 --- a/include/asm-generic/qspinlock.h
 +++ b/include/asm-generic/qspinlock.h
 @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock 
 *lock, u32 val);
*/
   static __always_inline void queue_spin_lock(struct qspinlock *lock)
   {
 -u32 val;
 +u32 val, new;
 
  val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
  if (likely(val == 0))
  return;
 +
 +/* One more attempt - but if we fail mark it as pending. */
 +if (val == _Q_LOCKED_VAL) {
 +new = Q_LOCKED_VAL |_Q_PENDING_VAL;
 +
 +old = atomic_cmpxchg(lock-val, val, new);
 +if (old == _Q_LOCKED_VAL) /* YEEY! */
 +return;
 
 No, it can leave like that. The unlock path will not clear the pending bit.

Err, you are right. It needs to go back in the slowpath.

 We are trying to make the fastpath as simple as possible as it may be
 inlined. The complexity of the queue spinlock is in the slowpath.

Sure, but then it shouldn't be called slowpath anymore as it is not
slow. It is a combination of fast path (the potential chance of
grabbing the lock and setting the pending lock) and the real slow
path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ?

 
 Moreover, an cmpxchg followed immediately followed by another cmpxchg will
 just increase the level of memory contention when a lock is fairly
 contended. The chance of second cmpxchg() succeeding will be pretty low.

Then why even do the pending bit - which is what the slowpath does
for the first time. And if it grabs it (And sets the pending bit) it
immediately exits. Why not perculate that piece of code in-to this header.

And the leave all that slow code (queing, mcs_lock access, etc) in the slowpath.

 
 -Longman
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-17 Thread Konrad Rzeszutek Wilk
On Tue, Jun 17, 2014 at 05:07:29PM -0400, Konrad Rzeszutek Wilk wrote:
 On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote:
  On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote:
  On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote:
  Because the qspinlock needs to touch a second cacheline; add a pending
  bit and allow a single in-word spinner before we punt to the second
  cacheline.
  Could you add this in the description please:
  
  And by second cacheline we mean the local 'node'. That is the:
  mcs_nodes[0] and mcs_nodes[idx]
  
  Perhaps it might be better then to split this in the header file
  as this is trying to not be a slowpath code - but rather - a
  pre-slow-path-lets-try-if-we can do another cmpxchg in case
  the unlocker has just unlocked itself.
  
  So something like:
  
  diff --git a/include/asm-generic/qspinlock.h 
  b/include/asm-generic/qspinlock.h
  index e8a7ae8..29cc9c7 100644
  --- a/include/asm-generic/qspinlock.h
  +++ b/include/asm-generic/qspinlock.h
  @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock 
  *lock, u32 val);
 */
static __always_inline void queue_spin_lock(struct qspinlock *lock)
{
  -  u32 val;
  +  u32 val, new;
  
 val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
 if (likely(val == 0))
 return;
  +
  +  /* One more attempt - but if we fail mark it as pending. */
  +  if (val == _Q_LOCKED_VAL) {
  +  new = Q_LOCKED_VAL |_Q_PENDING_VAL;
  +
  +  old = atomic_cmpxchg(lock-val, val, new);
  +  if (old == _Q_LOCKED_VAL) /* YEEY! */
  +  return;
  
  No, it can leave like that. The unlock path will not clear the pending bit.
 
 Err, you are right. It needs to go back in the slowpath.

What I should have wrote is:

if (old == 0) /* YEEY */
  return;

As that would the same thing as this patch does on the pending bit - that
is if we can on the second compare and exchange set the pending bit (and the
lock) and the lock has been released - we are good.

And it is a quick path.

 
  We are trying to make the fastpath as simple as possible as it may be
  inlined. The complexity of the queue spinlock is in the slowpath.
 
 Sure, but then it shouldn't be called slowpath anymore as it is not
 slow. It is a combination of fast path (the potential chance of
 grabbing the lock and setting the pending lock) and the real slow
 path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ?
 

I forgot to mention - that was the crux of my comments - just change
the slowpath to complex name at that point to better reflect what
it does.

  
  Moreover, an cmpxchg followed immediately followed by another cmpxchg will
  just increase the level of memory contention when a lock is fairly
  contended. The chance of second cmpxchg() succeeding will be pretty low.
 
 Then why even do the pending bit - which is what the slowpath does
 for the first time. And if it grabs it (And sets the pending bit) it
 immediately exits. Why not perculate that piece of code in-to this header.
 
 And the leave all that slow code (queing, mcs_lock access, etc) in the 
 slowpath.
 
  
  -Longman
  
  
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-17 Thread Konrad Rzeszutek Wilk

On Jun 17, 2014 6:25 PM, Waiman Long waiman.l...@hp.com wrote:

 On 06/17/2014 05:10 PM, Konrad Rzeszutek Wilk wrote: 
  On Tue, Jun 17, 2014 at 05:07:29PM -0400, Konrad Rzeszutek Wilk wrote: 
  On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote: 
  On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote: 
  On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote: 
  Because the qspinlock needs to touch a second cacheline; add a pending 
  bit and allow a single in-word spinner before we punt to the second 
  cacheline. 
  Could you add this in the description please: 
  
  And by second cacheline we mean the local 'node'. That is the: 
  mcs_nodes[0] and mcs_nodes[idx] 
  
  Perhaps it might be better then to split this in the header file 
  as this is trying to not be a slowpath code - but rather - a 
  pre-slow-path-lets-try-if-we can do another cmpxchg in case 
  the unlocker has just unlocked itself. 
  
  So something like: 
  
  diff --git a/include/asm-generic/qspinlock.h 
  b/include/asm-generic/qspinlock.h 
  index e8a7ae8..29cc9c7 100644 
  --- a/include/asm-generic/qspinlock.h 
  +++ b/include/asm-generic/qspinlock.h 
  @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct 
  qspinlock *lock, u32 val); 
     */ 
    static __always_inline void queue_spin_lock(struct qspinlock *lock) 
    { 
  - u32 val; 
  + u32 val, new; 
  
    val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL); 
    if (likely(val == 0)) 
    return; 
  + 
  + /* One more attempt - but if we fail mark it as pending. */ 
  + if (val == _Q_LOCKED_VAL) { 
  + new = Q_LOCKED_VAL |_Q_PENDING_VAL; 
  + 
  + old = atomic_cmpxchg(lock-val, val, new); 
  + if (old == _Q_LOCKED_VAL) /* YEEY! */ 
  + return; 
  No, it can leave like that. The unlock path will not clear the pending 
  bit. 
  Err, you are right. It needs to go back in the slowpath. 
  What I should have wrote is: 
  
  if (old == 0) /* YEEY */ 
     return; 

 Unfortunately, that still doesn't work. If old is 0, it just meant the 
 cmpxchg failed. It still haven't got the lock. 
  As that would the same thing as this patch does on the pending bit - that 
  is if we can on the second compare and exchange set the pending bit (and 
  the 
  lock) and the lock has been released - we are good. 

 That is not true. When the lock is freed, the pending bit holder will 
 still have to clear the pending bit and set the lock bit as is done in 
 the slowpath. We cannot skip the step here. The problem of moving the 
 pending code here is that it includes a wait loop which we don't want to 
 put in the fastpath. 
  
  And it is a quick path. 
  
  We are trying to make the fastpath as simple as possible as it may be 
  inlined. The complexity of the queue spinlock is in the slowpath. 
  Sure, but then it shouldn't be called slowpath anymore as it is not 
  slow. It is a combination of fast path (the potential chance of 
  grabbing the lock and setting the pending lock) and the real slow 
  path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ? 
  
  I forgot to mention - that was the crux of my comments - just change 
  the slowpath to complex name at that point to better reflect what 
  it does. 

 Actually in my v11 patch, I subdivided the slowpath into a slowpath for 
 the pending code and slowerpath for actual queuing. Perhaps, we could 
 use quickpath and slowpath instead. Anyway, it is a minor detail that we 
 can discuss after the core code get merged.

 -Longman

Why not do it the right way the first time around?

That aside - these optimization - seem to make the code harder to read. And 
they do remind me of the scheduler code in 2.6.x which was based on heuristics 
- and eventually ripped out.

So are these optimizations based on turning off certain hardware features? Say 
hardware prefetching?

What I am getting at - can the hardware do this at some point (or perhaps 
already does on IvyBridge-EX?) - that is prefetch the per-cpu areas so they are 
always hot? And rendering this optimization not needed?

Thanks!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-16 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:46:58PM +0200, Peter Zijlstra wrote:
 From: Waiman Long waiman.l...@hp.com
 
 This patch introduces a new generic queue spinlock implementation that
 can serve as an alternative to the default ticket spinlock. Compared
 with the ticket spinlock, this queue spinlock should be almost as fair
 as the ticket spinlock. It has about the same speed in single-thread
 and it can be much faster in high contention situations especially when
 the spinlock is embedded within the data structure to be protected.
 
 Only in light to moderate contention where the average queue depth
 is around 1-3 will this queue spinlock be potentially a bit slower
 due to the higher slowpath overhead.
 
 This queue spinlock is especially suit to NUMA machines with a large
 number of cores as the chance of spinlock contention is much higher
 in those machines. The cost of contention is also higher because of
 slower inter-node memory traffic.
 
 Due to the fact that spinlocks are acquired with preemption disabled,
 the process will not be migrated to another CPU while it is trying
 to get a spinlock. Ignoring interrupt handling, a CPU can only be
 contending in one spinlock at any one time. Counting soft IRQ, hard
 IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
 activities.  By allocating a set of per-cpu queue nodes and used them
 to form a waiting queue, we can encode the queue node address into a
 much smaller 24-bit size (including CPU number and queue node index)
 leaving one byte for the lock.
 
 Please note that the queue node is only needed when waiting for the
 lock. Once the lock is acquired, the queue node can be released to
 be used later.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Signed-off-by: Peter Zijlstra pet...@infradead.org

Thank you for the repost. I have some questions about the implementation
that hopefully will be easy to answer and said answers I hope can
be added in the code to enlighten other folks.

See below.
.. snip..

 Index: linux-2.6/kernel/locking/mcs_spinlock.h
 ===
 --- linux-2.6.orig/kernel/locking/mcs_spinlock.h
 +++ linux-2.6/kernel/locking/mcs_spinlock.h
 @@ -17,6 +17,7 @@
  struct mcs_spinlock {
   struct mcs_spinlock *next;
   int locked; /* 1 if lock acquired */
 + int count;

This could use a comment.

  };
  
  #ifndef arch_mcs_spin_lock_contended
 Index: linux-2.6/kernel/locking/qspinlock.c
 ===
 --- /dev/null
 +++ linux-2.6/kernel/locking/qspinlock.c
 @@ -0,0 +1,197 @@
 +/*
 + * Queue spinlock
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License as published by
 + * the Free Software Foundation; either version 2 of the License, or
 + * (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
 + *
 + * Authors: Waiman Long waiman.l...@hp.com
 + *  Peter Zijlstra pzijl...@redhat.com
 + */
 +#include linux/smp.h
 +#include linux/bug.h
 +#include linux/cpumask.h
 +#include linux/percpu.h
 +#include linux/hardirq.h
 +#include linux/mutex.h
 +#include asm/qspinlock.h
 +
 +/*
 + * The basic principle of a queue-based spinlock can best be understood
 + * by studying a classic queue-based spinlock implementation called the
 + * MCS lock. The paper below provides a good description for this kind
 + * of lock.
 + *
 + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf
 + *
 + * This queue spinlock implementation is based on the MCS lock, however to 
 make
 + * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing
 + * API, we must modify it some.
 + *
 + * In particular; where the traditional MCS lock consists of a tail pointer
 + * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
 + * unlock the next pending (next-locked), we compress both these: {tail,
 + * next-locked} into a single u32 value.
 + *
 + * Since a spinlock disables recursion of its own context and there is a 
 limit
 + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can
 + * encode the tail as and index indicating this context and a cpu number.
 + *
 + * We can further change the first spinner to spin on a bit in the lock word
 + * instead of its node; whereby avoiding the need to carry a node from lock 
 to
 + * unlock, and preserving API.
 + */
 +
 +#include mcs_spinlock.h
 +
 +/*
 + * Per-CPU queue node structures; we can never have more than 4 nested
 + * contexts: task, softirq, hardirq, nmi.
 + *
 + * Exactly fits one cacheline.
 + */
 +static 

Re: [PATCH 00/11] qspinlock with paravirt support

2014-06-16 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:46:57PM +0200, Peter Zijlstra wrote:
 Since Waiman seems incapable of doing simple things; here's my take on the
 paravirt crap.
 
 The first few patches are taken from Waiman's latest series, but the virt
 support is completely new. Its primary aim is to not mess up the native code.

OK. I finally cleared some time to look over this and are reading the code
in details to make sure I have it clear in mind. I will most likely ask
some questions that are naive - hopefully they will lead to the code being
self-explanatory for anybody else taking a stab at understanding them when
bugs appear.
 
 I've not stress tested it, but the virt and paravirt (kvm) cases boot on 
 simple
 smp guests. I've not done Xen, but the patch should be simple and similar.

Looking forward to seeing it. Glancing over the KVM one and comparing it
to the original version that Waiman posted it should be fairly simple. Perhaps
even some of the code could be shared?

 
 I ripped out all the unfair nonsense as its not at all required for paravirt
 and optimizations that make paravirt better at the cost of code clarity and/or
 native performance are just not worth it.
 
 Also; if we were to ever add some of that unfair nonsense you do so _after_ 
 you
 got the simple things working.
 
 The thing I'm least sure about is the head tracking, I chose to do something
 different from what Waiman did, because his is O(nr_cpus) and had the
 assumption that guests have small nr_cpus. AFAIK this is not at all true. The
 biggest problem I have with what I did is that it contains wait loops itself.
 
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-05-07 Thread Konrad Rzeszutek Wilk
On Wed, May 07, 2014 at 11:01:28AM -0400, Waiman Long wrote:
 v9-v10:
   - Make some minor changes to qspinlock.c to accommodate review feedback.
   - Change author to PeterZ for 2 of the patches.
   - Include Raghavendra KT's test results in patch 18.

Any chance you can post these on a git tree? Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM

2014-05-07 Thread Konrad Rzeszutek Wilk
 Raghavendra KT had done some performance testing on this patch with
 the following results:
 
 Overall we are seeing good improvement for pv-unfair version.
 
 System: 32 cpu sandybridge with HT on (4 node with 32 GB each)
 Guest : 8GB with 16 vcpu/VM.
 Average was taken over 8-10 data points.
 
 Base = 3.15-rc2  with PRAVIRT_SPINLOCK = y
 
 A = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = y (unfair lock)
 
 B =  3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = n PARAVIRT_UNFAIR_LOCKS = n
 (queue spinlock without paravirt)
 
 C = 3.15-rc2 + qspinlock v9 patch with  QUEUE_SPINLOCK = y
 PRAVIRT_SPINLOCK = y  PARAVIRT_UNFAIR_LOCKS = n
 (queue spinlock with paravirt)

Could you do s/PRAVIRT/PARAVIRT/ please?

 
 Ebizzy %improvements
 
 overcommit  ABC
 0.5x 4.42652.0611   1.5824
 1.0x 0.9015   -7.7828   4.5443
 1.5x46.1162   -2.9845  -3.5046
 2.0x99.8150   -2.7116   4.7461

Considering B sucks
 
 Dbench %improvements
 
 overcommit  ABC
 0.5x 3.26173.54362.5676
 1.0x 0.63022.23425.2201
 1.5x 5.00274.82753.8375
 2.0x23.82424.578212.6067
 
 Absolute values of base results: (overcommit, value, stdev)
 Ebizzy ( records / sec with 120 sec run)
 0.5x 20941.8750 (2%)
 1.0x 17623.8750 (5%)
 1.5x  5874.7778 (15%)
 2.0x  3581.8750 (7%)
 
 Dbench (throughput in MB/sec)
 0.5x 10009.6610 (5%)
 1.0x  6583.0538 (1%)
 1.5x  3991.9622 (4%)
 2.0x  2527.0613 (2.5%)
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 ---
  arch/x86/kernel/kvm.c |  135 
 +
  kernel/Kconfig.locks  |2 +-
  2 files changed, 136 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
 index 7ab8ab3..eef427b 100644
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -567,6 +567,7 @@ static void kvm_kick_cpu(int cpu)
   kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
  }
  
 +#ifndef CONFIG_QUEUE_SPINLOCK
  enum kvm_contention_stat {
   TAKEN_SLOW,
   TAKEN_SLOW_PICKUP,
 @@ -794,6 +795,134 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, 
 __ticket_t ticket)
   }
   }
  }
 +#else /* !CONFIG_QUEUE_SPINLOCK */
 +
 +#ifdef CONFIG_KVM_DEBUG_FS
 +static struct dentry *d_spin_debug;
 +static struct dentry *d_kvm_debug;
 +static u32 kick_nohlt_stats; /* Kick but not halt count  */
 +static u32 halt_qhead_stats; /* Queue head halting count */
 +static u32 halt_qnode_stats; /* Queue node halting count */
 +static u32 halt_abort_stats; /* Halting abort count  */
 +static u32 wake_kick_stats;  /* Wakeup by kicking count  */
 +static u32 wake_spur_stats;  /* Spurious wakeup count*/
 +static u64 time_blocked; /* Total blocking time  */
 +
 +static int __init kvm_spinlock_debugfs(void)
 +{
 + d_kvm_debug = debugfs_create_dir(kvm-guest, NULL);
 + if (!d_kvm_debug) {
 + printk(KERN_WARNING
 +Could not create 'kvm' debugfs directory\n);
 + return -ENOMEM;
 + }
 + d_spin_debug = debugfs_create_dir(spinlocks, d_kvm_debug);
 +
 + debugfs_create_u32(kick_nohlt_stats,
 +0644, d_spin_debug, kick_nohlt_stats);
 + debugfs_create_u32(halt_qhead_stats,
 +0644, d_spin_debug, halt_qhead_stats);
 + debugfs_create_u32(halt_qnode_stats,
 +0644, d_spin_debug, halt_qnode_stats);
 + debugfs_create_u32(halt_abort_stats,
 +0644, d_spin_debug, halt_abort_stats);
 + debugfs_create_u32(wake_kick_stats,
 +0644, d_spin_debug, wake_kick_stats);
 + debugfs_create_u32(wake_spur_stats,
 +0644, d_spin_debug, wake_spur_stats);
 + debugfs_create_u64(time_blocked,
 +0644, d_spin_debug, time_blocked);
 + return 0;
 +}
 +
 +static inline void kvm_halt_stats(enum pv_lock_stats type)
 +{
 + if (type == PV_HALT_QHEAD)
 + add_smp(halt_qhead_stats, 1);
 + else if (type == PV_HALT_QNODE)
 + add_smp(halt_qnode_stats, 1);
 + else /* type == PV_HALT_ABORT */
 + add_smp(halt_abort_stats, 1);
 +}
 +
 +static inline void kvm_lock_stats(enum pv_lock_stats type)
 +{
 + if (type == PV_WAKE_KICKED)
 + add_smp(wake_kick_stats, 1);
 + else if (type == PV_WAKE_SPURIOUS)
 + add_smp(wake_spur_stats, 1);
 + else /* type == PV_KICK_NOHALT */
 + add_smp(kick_nohlt_stats, 1);
 +}
 +
 +static inline u64 spin_time_start(void)
 +{
 + return sched_clock();
 +}
 +
 +static inline void spin_time_accum_blocked(u64 start)
 +{
 + u64 delta;
 +
 + 

Re: [PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS

2014-04-23 Thread Konrad Rzeszutek Wilk
On Wed, Apr 23, 2014 at 10:23:43AM -0400, Waiman Long wrote:
 On 04/18/2014 05:40 PM, Waiman Long wrote:
 On 04/18/2014 03:05 PM, Peter Zijlstra wrote:
 On Fri, Apr 18, 2014 at 01:52:50PM -0400, Waiman Long wrote:
 I am confused by your notation.
 Nah, I think I was confused :-) Make the 1 _Q_LOCKED_VAL though, as
 that's the proper constant to use.
 
 Everyone gets confused once in a while:-) I have plenty of that myself.
 
 I will change 1 to _Q_LOCKED_VAL as suggested.
 
 -Longman
 
 
 The attached patch file contains the additional changes that I had
 made to qspinlock.c file so far. Please let me know if you or others
 have any additional feedbacks or changes that will need to go to the
 next version of the patch series.
 
 I am going to take vacation starting from tomorrow and will be back
 on 5/5 (Mon). So I will not be able to respond to emails within this
 period.
 
 BTW, is there any chance that this patch can be merged to 3.16?

Um, it needs to have Acks from KVM and Xen maintainers who have not
done so. Also Peter needs to chime in. (BTW, please CC
xen-de...@lists.xenproject.org next time you post so that David and Boris
can take a peek at it).

I would strongly recommend you put all your patches on github (free git
service) so that we can test it and poke it at during your vacation
(and even after).

 
 -Longman

 diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
 index be2adca..2e184b8 100644
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -25,10 +25,6 @@
  #include asm/byteorder.h
  #include asm/qspinlock.h
  
 -#if !defined(__LITTLE_ENDIAN)  !defined(__BIG_ENDIAN)
 -#error Missing either LITTLE_ENDIAN or BIG_ENDIAN definition.
 -#endif
 -
  /*
   * The basic principle of a queue-based spinlock can best be understood
   * by studying a classic queue-based spinlock implementation called the
 @@ -200,7 +196,7 @@ clear_pending_set_locked(struct qspinlock *lock, u32 val)
  {
   struct __qspinlock *l = (void *)lock;
  
 - ACCESS_ONCE(l-locked_pending) = 1;
 + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
  }
  
  /*
 @@ -567,16 +563,16 @@ static __always_inline int get_qlock(struct qspinlock 
 *lock)
  /**
   * trylock_pending - try to acquire queue spinlock using the pending bit
   * @lock : Pointer to queue spinlock structure
 - * @pval : Pointer to value of the queue spinlock 32-bit word
 + * @val  : Current value of the queue spinlock 32-bit word
   * Return: 1 if lock acquired, 0 otherwise
   *
   * The pending bit won't be set as soon as one or more tasks queue up.
   * This function should only be called when lock stealing will not happen.
   * Otherwise, it has to be disabled.
   */
 -static inline int trylock_pending(struct qspinlock *lock, u32 *pval)
 +static inline int trylock_pending(struct qspinlock *lock, u32 val)
  {
 - u32 old, new, val = *pval;
 + u32 old, new;
   int retry = 1;
  
   /*
 @@ -593,8 +589,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
   if (val  _Q_TAIL_MASK)
   return 0;
  
 - if ((val  _Q_LOCKED_PENDING_MASK) ==
 - (_Q_LOCKED_VAL|_Q_PENDING_VAL)) {
 + if (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)) {
   /*
* If both the lock and pending bits are set, we wait
* a while to see if that either bit will be cleared.
 @@ -605,9 +600,9 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
   retry--;
   cpu_relax();
   cpu_relax();
 - *pval = val = atomic_read(lock-val);
 + val = atomic_read(lock-val);
   continue;
 - } else if ((val  _Q_LOCKED_PENDING_MASK) == _Q_PENDING_VAL) {
 + } else if (val == _Q_PENDING_VAL) {
   /*
* Pending bit is set, but not the lock bit.
* Assuming that the pending bit holder is going to
 @@ -615,7 +610,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
* it is better to wait than to exit at this point.
*/
   cpu_relax();
 - *pval = val = atomic_read(lock-val);
 + val = atomic_read(lock-val);
   continue;
   }
  
 @@ -627,7 +622,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
   if (old == val)
   break;
  
 - *pval = val = old;
 + val = old;
   }
  
   /*
 @@ -643,7 +638,7 @@ static inline int trylock_pending(struct qspinlock *lock, 
 u32 *pval)
*
* this wait loop must be a load-acquire such that we match the
* store-release that clears the locked bit and create lock
 -  * sequentiality; this because 

Re: [PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS

2014-04-23 Thread Konrad Rzeszutek Wilk
On Wed, Apr 23, 2014 at 01:43:58PM -0400, Waiman Long wrote:
 On 04/23/2014 10:56 AM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 23, 2014 at 10:23:43AM -0400, Waiman Long wrote:
 On 04/18/2014 05:40 PM, Waiman Long wrote:
 On 04/18/2014 03:05 PM, Peter Zijlstra wrote:
 On Fri, Apr 18, 2014 at 01:52:50PM -0400, Waiman Long wrote:
 I am confused by your notation.
 Nah, I think I was confused :-) Make the 1 _Q_LOCKED_VAL though, as
 that's the proper constant to use.
 Everyone gets confused once in a while:-) I have plenty of that myself.
 
 I will change 1 to _Q_LOCKED_VAL as suggested.
 
 -Longman
 
 The attached patch file contains the additional changes that I had
 made to qspinlock.c file so far. Please let me know if you or others
 have any additional feedbacks or changes that will need to go to the
 next version of the patch series.
 
 I am going to take vacation starting from tomorrow and will be back
 on 5/5 (Mon). So I will not be able to respond to emails within this
 period.
 
 BTW, is there any chance that this patch can be merged to 3.16?
 Um, it needs to have Acks from KVM and Xen maintainers who have not
 done so. Also Peter needs to chime in. (BTW, please CC
 xen-de...@lists.xenproject.org next time you post so that David and Boris
 can take a peek at it).
 
 I will cc xen-de...@lists.xenproject.org when I sent out the next patch.
 
 I would strongly recommend you put all your patches on github (free git
 service) so that we can test it and poke it at during your vacation
 (and even after).
 
 
 I am not used to setting up a public repo in github. If I create a
 repo there, should I put a snapshot of the whole kernel source tree
 or just a portion of the relevant files as the base? With the later,
 it won't be buildable.

You just push your local branch. It should look like a normal
Linux tree with your commits on top.

 
 -Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v9 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-04-18 Thread Konrad Rzeszutek Wilk
On Thu, Apr 17, 2014 at 09:48:36PM -0400, Waiman Long wrote:
 On 04/17/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
 On Thu, Apr 17, 2014 at 11:03:52AM -0400, Waiman Long wrote:
 v8-v9:
- Integrate PeterZ's version of the queue spinlock patch with some
  modification:
  http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
- Break the more complex patches into smaller ones to ease review effort.
- Fix a racing condition in the PV qspinlock code.
 I am not seeing anything mentioning that the overcommit scenario
 for KVM and Xen had been fixed. Or was the 'racing condition' said
 issue?
 
 Thanks.
 
 The hanging is caused by a racing condition which should be fixed in
 the v9 patch. Please let me know if you are still seeing it.

OK, is there a git tree with these patches to easily slurp them up?


Thanks!
 
 -Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v9 03/19] qspinlock: Add pending bit

2014-04-18 Thread Konrad Rzeszutek Wilk
On Fri, Apr 18, 2014 at 12:23:29PM -0400, Waiman Long wrote:
 On 04/18/2014 03:42 AM, Ingo Molnar wrote:
 * Waiman Longwaiman.l...@hp.com  wrote:
 
 Because the qspinlock needs to touch a second cacheline; add a pending
 bit and allow a single in-word spinner before we punt to the second
 cacheline.
 
 Signed-off-by: Peter Zijlstrapet...@infradead.org
 Signed-off-by: Waiman Longwaiman.l...@hp.com
 This patch should have a From: Peter in it as well, right?
 
 Thanks,
 
  Ingo
 
 Do you mean a From: line in the mail header? It will be a bit hard
 to have different From: header in the same patch series. I can
 certainly do that if there is an easy way to do it.

It is pretty easy.

Just do 'git commit --amend --author The Right Author' and when
you send the patches (git send-email) it will include that.

 
 -Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v9 00/19] qspinlock: a 4-byte queue spinlock with PV support

2014-04-17 Thread Konrad Rzeszutek Wilk
On Thu, Apr 17, 2014 at 11:03:52AM -0400, Waiman Long wrote:
 v8-v9:
   - Integrate PeterZ's version of the queue spinlock patch with some
 modification:
 http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
   - Break the more complex patches into smaller ones to ease review effort.
   - Fix a racing condition in the PV qspinlock code.

I am not seeing anything mentioning that the overcommit scenario
for KVM and Xen had been fixed. Or was the 'racing condition' said
issue?

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation

2014-04-07 Thread Konrad Rzeszutek Wilk
On Mon, Apr 07, 2014 at 04:12:58PM +0200, Peter Zijlstra wrote:
 On Fri, Apr 04, 2014 at 12:57:27PM -0400, Konrad Rzeszutek Wilk wrote:
  On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote:
   
   So I'm just not ever going to pick up this patch; I spend a week trying
   to reverse engineer this; I posted a 7 patch series creating the
   equivalent, but in a gradual and readable fashion:
   
 http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
   
   You keep on ignoring that; I'll keep on ignoring your patches.
   
   I might at some point rewrite some of your pv stuff on top to get this
   moving again, but I'm not really motivated to work with you atm.
  
  Uh? Did you CC also xen-de...@lists.xenproject.org on your patches Peter?
  I hadn't had a chance to see or comment on them :-(
 
 No of course not :-)
 
 Also as noted elsewhere, I didn't actually do any PV muck yet. I spend
 the time trying to get my head around patch 1; all the while Waiman kept
 piling more and more on top.

Ah, I see. Looking forward to seeing your 'muck' code then :-)

Thanks!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-04 Thread Konrad Rzeszutek Wilk
On Wed, Apr 02, 2014 at 10:32:01AM -0400, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
  N.B. Sorry for the duplicate. This patch series were resent as the
   original one was rejected by the vger.kernel.org list server
   due to long header. There is no change in content.
  
  v7-v8:
- Remove one unneeded atomic operation from the slowpath, thus
  improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
  unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
  on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
 
 FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
 HVM guest under Xen after a while stops working. The workload
 is doing 'make -j32' on the Linux kernel.
 
 Completely unresponsive. Thoughts?

Each VCPU seems to be stuck with this stack trace:

rip: 810013a8 xen_hypercall_sched_op+0x8
flags: 0002 nz
rsp: 88029f13fb98
rax:    rcx: fffa   rdx: 
rbx:    rsi: 88029f13fba8   rdi: 0003
rbp: 88029f13fbd0r8: 8807ee65a1c0r9: 88080d800b10
r10: 48cb   r11:    r12: 0013
r13: 0004   r14: 0001   r15: ea00076a8cd0
 cs: 0010ss: ds: es: 
 fs:  @ 2b24c3e7e380
 gs:  @ 88080e20/
Code (instr addr 810013a8)
cc cc cc cc cc cc cc cc cc cc cc cc cc b8 1d 00 00 00 0f 01 c1 c3 cc cc cc cc 
cc cc cc cc cc cc


Stack:
 81352d9e 00299f13fbb0 88029f13fba4 88020001
  88029f13fbd0 0045 88029f13fbe0
 81354240 88029f13fc00 81012cb6 88080f4da200
 88080e214b00 88029f13fc48 815e4631 

Call Trace:
  [810013a8] xen_hypercall_sched_op+0x8  --
  [81352d9e] xen_poll_irq_timeout+0x3e
  [81354240] xen_poll_irq+0x10
  [81012cb6] xen_hibernate+0x46
  [815e4631] queue_spin_lock_slowerpath+0x84
  [810ab96e] queue_spin_lock_slowpath+0xee
  [815eff8f] _raw_spin_lock_irqsave+0x3f
  [81144e4d] pagevec_lru_move_fn+0x8d
  [81144780] __pagevec_lru_add_fn
  [81144ed7] __pagevec_lru_add+0x17
  [81145540] __lru_cache_add+0x60
  [8114590e] lru_cache_add+0xe
  [8116d4ba] page_add_new_anon_rmap+0xda
  [81162ab1] handle_mm_fault+0xaa1
  [81169d42] mmap_region+0x2c2
  [815f3c4d] __do_page_fault+0x18d
  [811544e1] vm_mmap_pgoff+0xb1
  [815f3fdb] do_page_fault+0x2b
  [815f06c8] page_fault+0x28
rip: 810013a8 xen_hypercall_sched_op+0x8


 
 (CC ing Marcos who had run the test)
  
  v6-v7:
- Remove an atomic operation from the 2-task contending code
- Shorten the names of some macros
- Make the queue waiter to attempt to steal lock when unfair lock is
  enabled.
- Remove lock holder kick from the PV code and fix a race condition
- Run the unfair lock  PV code on overcommitted KVM guests to collect
  performance data.
  
  v5-v6:
   - Change the optimized 2-task contending code to make it fairer at the
 expense of a bit of performance.
   - Add a patch to support unfair queue spinlock for Xen.
   - Modify the PV qspinlock code to follow what was done in the PV
 ticketlock.
   - Add performance data for the unfair lock as well as the PV
 support code.
  
  v4-v5:
   - Move the optimized 2-task contending code to the generic file to
 enable more architectures to use it without code duplication.
   - Address some of the style-related comments by PeterZ.
   - Allow the use of unfair queue spinlock in a real para-virtualized
 execution environment.
   - Add para-virtualization support to the qspinlock code by ensuring
 that the lock holder and queue head stay alive as much as possible.
  
  v3-v4:
   - Remove debugging code and fix a configuration error
   - Simplify the qspinlock structure and streamline the code to make it
 perform a bit better
   - Add an x86 version of asm/qspinlock.h for holding x86 specific
 optimization.
   - Add an optimized x86 code path for 2 contending tasks to improve
 low contention performance.
  
  v2-v3:
   - Simplify the code by using numerous mode only without an unfair option.
   - Use the latest smp_load_acquire()/smp_store_release() barriers.
   - Move the queue spinlock code to kernel/locking.
   - Make the use of queue spinlock the default for x86-64 without user
 configuration.
   - Additional performance tuning.
  
  v1-v2:
   - Add some more comments to document what the code does.
   - Add a numerous CPU mode to support = 16K CPUs
   - Add a configuration option to allow lock stealing which

Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-04 Thread Konrad Rzeszutek Wilk
On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
 On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
 On 04/02/2014 04:35 PM, Waiman Long wrote:
 On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
 N.B. Sorry for the duplicate. This patch series were resent as the
   original one was rejected by the vger.kernel.org list server
   due to long header. There is no change in content.
 
 v7-v8:
- Remove one unneeded atomic operation from the slowpath, thus
  improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
  unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
  on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
 FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
 HVM guest under Xen after a while stops working. The workload
 is doing 'make -j32' on the Linux kernel.
 
 Completely unresponsive. Thoughts?
 
 Thank for reporting that. I haven't done that much testing on Xen.
 My focus was in KVM. I will perform more test on Xen to see if I
 can reproduce the problem.
 
 BTW, does the halting and sending IPI mechanism work in HVM? I saw
 Yes.
 that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
 However, this piece of code isn't in upstream code. So I wonder if
 there is problem with that.
 The PV ticketlock fixed it for HVM. It was disabled before because
 the PV guests were using bytelocks while the HVM were using ticketlocks
 and you couldnt' swap in PV bytelocks for ticketlocks during startup.
 
 The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
 kernel for all configurations. So PV ticketlock as well as Xen and
 KVM support was compiled in. I think booting the kernel on bare
 metal will cause the Xen code to work in HVM mode thus activating
 the PV spinlock code which has a negative impact on performance.

Huh? -EPARSE

 That may be why it was disabled so that the bare metal performance
 will not be impacted.

I am not following you.
 
 BTW, could you send me more information about the configuration of
 the machine, like the .config file that you used?

Marcos, could you please send that information to Peter. Thanks!
 
 -Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-04 Thread Konrad Rzeszutek Wilk
On Fri, Apr 04, 2014 at 01:13:17PM -0400, Waiman Long wrote:
 On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote:
 On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
 On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
 On 04/02/2014 04:35 PM, Waiman Long wrote:
 On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
 N.B. Sorry for the duplicate. This patch series were resent as the
   original one was rejected by the vger.kernel.org list server
   due to long header. There is no change in content.
 
 v7-v8:
- Remove one unneeded atomic operation from the slowpath, thus
  improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
  unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
  on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
 FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
 HVM guest under Xen after a while stops working. The workload
 is doing 'make -j32' on the Linux kernel.
 
 Completely unresponsive. Thoughts?
 
 Thank for reporting that. I haven't done that much testing on Xen.
 My focus was in KVM. I will perform more test on Xen to see if I
 can reproduce the problem.
 
 BTW, does the halting and sending IPI mechanism work in HVM? I saw
 Yes.
 that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
 However, this piece of code isn't in upstream code. So I wonder if
 there is problem with that.
 The PV ticketlock fixed it for HVM. It was disabled before because
 the PV guests were using bytelocks while the HVM were using ticketlocks
 and you couldnt' swap in PV bytelocks for ticketlocks during startup.
 The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
 kernel for all configurations. So PV ticketlock as well as Xen and
 KVM support was compiled in. I think booting the kernel on bare
 metal will cause the Xen code to work in HVM mode thus activating
 the PV spinlock code which has a negative impact on performance.
 Huh? -EPARSE
 
 That may be why it was disabled so that the bare metal performance
 will not be impacted.
 I am not following you.
 
 What I am saying is that when XEN and PV spinlock is compiled into
 the current upstream kernel, the PV spinlock jump label is turned on
 when booted on bare metal. In other words, the PV spinlock code is

How does it turn it on? I see that the jump lables are only turned
on when the jump label is enable when it detects that it is running
under Xen or KVM. It won't turn it on under baremetal.

 active even when they are not needed and actually slow thing down in
 that situation. This is a problem and we need to find way to make
 sure that the PV spinlock code won't be activated on bare metal.

Could you explain to me which piece of code enables the jump labels
on baremetal please?
 
 -Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-04 Thread Konrad Rzeszutek Wilk
On Fri, Apr 04, 2014 at 01:58:15PM -0400, Konrad Rzeszutek Wilk wrote:
 On Fri, Apr 04, 2014 at 01:13:17PM -0400, Waiman Long wrote:
  On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote:
  On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote:
  On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote:
  On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
  On 04/02/2014 04:35 PM, Waiman Long wrote:
  On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
  On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
  N.B. Sorry for the duplicate. This patch series were resent as the
original one was rejected by the vger.kernel.org list server
due to long header. There is no change in content.
  
  v7-v8:
 - Remove one unneeded atomic operation from the slowpath, thus
   improving performance.
 - Simplify some of the codes and add more comments.
 - Test for X86_FEATURE_HYPERVISOR CPU feature bit to 
   enable/disable
   unfair lock.
 - Reduce unfair lock slowpath lock stealing frequency depending
   on its distance from the queue head.
 - Add performance data for IvyBridge-EX CPU.
  FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
  HVM guest under Xen after a while stops working. The workload
  is doing 'make -j32' on the Linux kernel.
  
  Completely unresponsive. Thoughts?
  
  Thank for reporting that. I haven't done that much testing on Xen.
  My focus was in KVM. I will perform more test on Xen to see if I
  can reproduce the problem.
  
  BTW, does the halting and sending IPI mechanism work in HVM? I saw
  Yes.
  that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
  However, this piece of code isn't in upstream code. So I wonder if
  there is problem with that.
  The PV ticketlock fixed it for HVM. It was disabled before because
  the PV guests were using bytelocks while the HVM were using ticketlocks
  and you couldnt' swap in PV bytelocks for ticketlocks during startup.
  The RHEL7 code has used PV ticketlock already. RHEL7 uses a single
  kernel for all configurations. So PV ticketlock as well as Xen and
  KVM support was compiled in. I think booting the kernel on bare
  metal will cause the Xen code to work in HVM mode thus activating
  the PV spinlock code which has a negative impact on performance.
  Huh? -EPARSE
  
  That may be why it was disabled so that the bare metal performance
  will not be impacted.
  I am not following you.
  
  What I am saying is that when XEN and PV spinlock is compiled into
  the current upstream kernel, the PV spinlock jump label is turned on
  when booted on bare metal. In other words, the PV spinlock code is
 
 How does it turn it on? I see that the jump lables are only turned
 on when the jump label is enable when it detects that it is running
 under Xen or KVM. It won't turn it on under baremetal.

Well, it seems that it does turn it on baremetal which is an stupid mistake.

Sending a patch shortly.
 
  active even when they are not needed and actually slow thing down in
  that situation. This is a problem and we need to find way to make
  sure that the PV spinlock code won't be activated on bare metal.
 
 Could you explain to me which piece of code enables the jump labels
 on baremetal please?
  
  -Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support

2014-04-03 Thread Konrad Rzeszutek Wilk
On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote:
 On 04/02/2014 04:35 PM, Waiman Long wrote:
 On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote:
 On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote:
 N.B. Sorry for the duplicate. This patch series were resent as the
   original one was rejected by the vger.kernel.org list server
   due to long header. There is no change in content.
 
 v7-v8:
- Remove one unneeded atomic operation from the slowpath, thus
  improving performance.
- Simplify some of the codes and add more comments.
- Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
  unfair lock.
- Reduce unfair lock slowpath lock stealing frequency depending
  on its distance from the queue head.
- Add performance data for IvyBridge-EX CPU.
 FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an
 HVM guest under Xen after a while stops working. The workload
 is doing 'make -j32' on the Linux kernel.
 
 Completely unresponsive. Thoughts?
 
 
 Thank for reporting that. I haven't done that much testing on Xen.
 My focus was in KVM. I will perform more test on Xen to see if I
 can reproduce the problem.
 
 
 BTW, does the halting and sending IPI mechanism work in HVM? I saw

Yes.
 that in RHEL7, PV spinlock was explicitly disabled when in HVM mode.
 However, this piece of code isn't in upstream code. So I wonder if
 there is problem with that.

The PV ticketlock fixed it for HVM. It was disabled before because
the PV guests were using bytelocks while the HVM were using ticketlocks
and you couldnt' swap in PV bytelocks for ticketlocks during startup.

 
 -Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v8 10/10] pvqspinlock, x86: Enable qspinlock PV support for XEN

2014-04-02 Thread Konrad Rzeszutek Wilk
 diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
 index a70fdeb..451e392 100644
 --- a/kernel/Kconfig.locks
 +++ b/kernel/Kconfig.locks
 @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK
  
  config QUEUE_SPINLOCK
   def_bool y if ARCH_USE_QUEUE_SPINLOCK
 - depends on SMP  (!PARAVIRT_SPINLOCKS || !XEN)
 + depends on SMP

If I read this correctly that means you cannot select any more the old
ticketlocks? As in, if you select CONFIG_PARAVIRT on X86 it will automatically
select ARCH_USE_QUEUE_SPINLOCK which will then enable this by default?

Should the 'def_bool' be selectable?

 -- 
 1.7.1
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest

2014-03-21 Thread Konrad Rzeszutek Wilk

On Mar 20, 2014 11:40 PM, Waiman Long waiman.l...@hp.com wrote:

 On 03/19/2014 04:28 PM, Konrad Rzeszutek Wilk wrote: 
  On Wed, Mar 19, 2014 at 04:14:05PM -0400, Waiman Long wrote: 
  This patch adds a XEN init function to activate the unfair queue 
  spinlock in a XEN guest when the PARAVIRT_UNFAIR_LOCKS kernel config 
  option is selected. 
  
  Signed-off-by: Waiman Longwaiman.l...@hp.com 
  --- 
    arch/x86/xen/setup.c |   19 +++ 
    1 files changed, 19 insertions(+), 0 deletions(-) 
  
  diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c 
  index 0982233..66bb6f5 100644 
  --- a/arch/x86/xen/setup.c 
  +++ b/arch/x86/xen/setup.c 
  @@ -625,3 +625,22 @@ void __init xen_arch_setup(void) 
    numa_off = 1; 
    #endif 
    } 
  + 
  +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS 
  +/* 
  + * Enable unfair lock if running in a Xen guest 
  + */ 
  +static __init int xen_unfair_locks_init_jump(void) 
  +{ 
  + /* 
  + * Disable unfair lock if not running in a PV domain 
  + */ 
  + if (!xen_pv_domain()) 
  + return 0; 
  I would just make this 'xen_domain'. Not sure why you need 
  to have it only for PV while the PVHVM guests can also use it? 

 The compilation of the setup.c file should have implied xen_domain 
 already (at least HVM). The check is added to make sure that unfair lock 
 won't be enabled on bare metal. As for PVHVM, is there a way to detect 
 it is running as such which is distinct from HVM?

Xen_domain() should cover PVHVM and PV.

  Would it also make sense to use the same printk statement 
  that the KVM has? 
  

 Yes, I can add a printk statement like KVM. 

 -Longman 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest

2014-03-19 Thread Konrad Rzeszutek Wilk
On Tue, Mar 18, 2014 at 11:11:43PM -0400, Waiman Long wrote:
 On 03/17/2014 03:10 PM, Konrad Rzeszutek Wilk wrote:
 On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
 On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
 On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
 On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
 On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
 +static inline void arch_spin_lock(struct qspinlock *lock)
 +{
 +   if (static_key_false(paravirt_unfairlocks_enabled))
 +   queue_spin_lock_unfair(lock);
 +   else
 +   queue_spin_lock(lock);
 +}
 So I would have expected something like:
 
  if (static_key_false(paravirt_spinlock)) {
  while (!queue_spin_trylock(lock))
  cpu_relax();
  return;
  }
 
 At the top of queue_spin_lock_slowpath().
 I don't like the idea of constantly spinning on the lock. That can cause 
 all
 sort of performance issues.
 Its bloody virt; _that_ is a performance issue to begin with.
 
 Anybody half sane stops using virt (esp. if they care about
 performance).
 
 My version of the unfair lock tries to grab the
 lock ignoring if there are others waiting in the queue or not. So instead 
 of
 the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
 lock byte in the unfair version. A CPU has only one chance to steal the
 lock. If it can't, it will be lined up in the queue just like the fair
 version. It is not as unfair as the other unfair locking schemes that 
 spins
 on the lock repetitively. So lock starvation should be less a problem.
 
 On the other hand, it may not perform as well as the other unfair locking
 schemes. It is a compromise to provide some lock unfairness without
 sacrificing the good cacheline behavior of the queue spinlock.
 But but but,.. any kind of queueing gets you into a world of hurt with
 virt.
 
 The simple test-and-set lock (as per the above) still sucks due to lock
 holder preemption, but at least the suckage doesn't queue. Because with
 queueing you not only have to worry about the lock holder getting
 preemption, but also the waiter(s).
 
 Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
 preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
 which cpu0 gets back online.
 
 The simple test-and-set lock will now let cpu2 acquire. Your queue
 however will just sit there spinning, waiting for cpu1 to come back from
 holiday.
 
 I think you're way over engineering this. Just do the simple
 test-and-set lock for virt   !paravirt (as I think Paolo Bonzini
 suggested RHEL6 already does).
 The PV ticketlock code was designed to handle lock holder preemption
 by redirecting CPU resources in a preempted guest to another guest
 that can better use it and then return the preempted CPU back
 sooner.
 
 Using a simple test-and-set lock will not allow us to enable this PV
 spinlock functionality as there is no structure to decide who does
 what. I can extend the current unfair lock code to allow those
 And what would be needed to do 'decide who does what'?
 
 
 
 Sorry for not very clear in my previous mail. The current PV code
 will halt the CPUs if the lock isn't acquired in a certain time.
 When the locker holder come back, it will wake up the next CPU in
 the ticket lock queue. With a simple set and test lock, there is no
 structure to decide which one to wake up. So you still need to have
 some sort of queue to do that, you just can't wake up all of them
 and let them fight for the lock.

Why not? That is what bytelocks did and it worked OK in the
virtualization environment. The reason it was OK was because the
hypervisor decided which VCPU to schedule and that one would get the
lock. Granted it did have a per-cpu-what-lock-am-I-waiting-for
structure to aid in this.

In some form, the hypervisor serializes who is going to get
the lock as it ultimately decides which of the VCPUs that are
kicked to wake up - and if the lock is not FIFO - it doesn't
matter which ones gets it.

With the 'per-cpu-what-lock-I-am-waiting-for' you can also
have a round-robin of which vCPU you had kick to keep
it fair.

You might want to take a look at the PV bytelock that existed
in the Xen code prior to PV ticketlock (so 3.10) to see how
it did that.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock

2014-03-19 Thread Konrad Rzeszutek Wilk
On Wed, Mar 19, 2014 at 04:14:00PM -0400, Waiman Long wrote:
 This patch makes the necessary changes at the x86 architecture
 specific layer to enable the use of queue spinlock for x86-64. As
 x86-32 machines are typically not multi-socket. The benefit of queue
 spinlock may not be apparent. So queue spinlock is not enabled.
 
 Currently, there is some incompatibilities between the para-virtualized
 spinlock code (which hard-codes the use of ticket spinlock) and the
 queue spinlock. Therefore, the use of queue spinlock is disabled when
 the para-virtualized spinlock is enabled.

And how does this patch do that? I think that comment is obsolete?

 
 The arch/x86/include/asm/qspinlock.h header file includes some x86
 specific optimization which will make the queue spinlock code perform
 better than the generic implementation.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 Acked-by: Rik van Riel r...@redhat.com
 ---
  arch/x86/Kconfig  |1 +
  arch/x86/include/asm/qspinlock.h  |   41 
 +
  arch/x86/include/asm/spinlock.h   |5 
  arch/x86/include/asm/spinlock_types.h |4 +++
  4 files changed, 51 insertions(+), 0 deletions(-)
  create mode 100644 arch/x86/include/asm/qspinlock.h
 
 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index 0af5250..de573f9 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -17,6 +17,7 @@ config X86_64
   depends on 64BIT
   select X86_DEV_DMA_OPS
   select ARCH_USE_CMPXCHG_LOCKREF
 + select ARCH_USE_QUEUE_SPINLOCK
  
  ### Arch settings
  config X86
 diff --git a/arch/x86/include/asm/qspinlock.h 
 b/arch/x86/include/asm/qspinlock.h
 new file mode 100644
 index 000..44cefee
 --- /dev/null
 +++ b/arch/x86/include/asm/qspinlock.h
 @@ -0,0 +1,41 @@
 +#ifndef _ASM_X86_QSPINLOCK_H
 +#define _ASM_X86_QSPINLOCK_H
 +
 +#include asm-generic/qspinlock_types.h
 +
 +#if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
 +
 +#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS
 +
 +/*
 + * x86-64 specific queue spinlock union structure
 + */
 +union arch_qspinlock {
 + struct qspinlock slock;
 + u8   lock;  /* Lock bit */
 +};
 +
 +#define  queue_spin_unlock queue_spin_unlock
 +/**
 + * queue_spin_unlock - release a queue spinlock
 + * @lock : Pointer to queue spinlock structure
 + *
 + * No special memory barrier other than a compiler one is needed for the
 + * x86 architecture. A compiler barrier is added at the end to make sure
 + * that the clearing the lock bit is done ASAP without artificial delay
 + * due to compiler optimization.
 + */
 +static inline void queue_spin_unlock(struct qspinlock *lock)
 +{
 + union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 +
 + barrier();
 + ACCESS_ONCE(qlock-lock) = 0;
 + barrier();
 +}
 +
 +#endif /* !CONFIG_X86_OOSTORE  !CONFIG_X86_PPRO_FENCE */
 +
 +#include asm-generic/qspinlock.h
 +
 +#endif /* _ASM_X86_QSPINLOCK_H */
 diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
 index 0f62f54..958d20f 100644
 --- a/arch/x86/include/asm/spinlock.h
 +++ b/arch/x86/include/asm/spinlock.h
 @@ -42,6 +42,10 @@
  extern struct static_key paravirt_ticketlocks_enabled;
  static __always_inline bool static_key_false(struct static_key *key);
  
 +#ifdef CONFIG_QUEUE_SPINLOCK
 +#include asm/qspinlock.h
 +#else
 +
  #ifdef CONFIG_PARAVIRT_SPINLOCKS
  
  static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
 @@ -180,6 +184,7 @@ static __always_inline void 
 arch_spin_lock_flags(arch_spinlock_t *lock,
  {
   arch_spin_lock(lock);
  }
 +#endif /* CONFIG_QUEUE_SPINLOCK */
  
  static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
  {
 diff --git a/arch/x86/include/asm/spinlock_types.h 
 b/arch/x86/include/asm/spinlock_types.h
 index 4f1bea1..7960268 100644
 --- a/arch/x86/include/asm/spinlock_types.h
 +++ b/arch/x86/include/asm/spinlock_types.h
 @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
  
  #define TICKET_SHIFT (sizeof(__ticket_t) * 8)
  
 +#ifdef CONFIG_QUEUE_SPINLOCK
 +#include asm-generic/qspinlock_types.h
 +#else
  typedef struct arch_spinlock {
   union {
   __ticketpair_t head_tail;
 @@ -33,6 +36,7 @@ typedef struct arch_spinlock {
  } arch_spinlock_t;
  
  #define __ARCH_SPIN_LOCK_UNLOCKED{ { 0 } }
 +#endif /* CONFIG_QUEUE_SPINLOCK */
  
  #include asm/rwlock.h
  
 -- 
 1.7.1
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest

2014-03-19 Thread Konrad Rzeszutek Wilk
On Wed, Mar 19, 2014 at 04:14:05PM -0400, Waiman Long wrote:
 This patch adds a XEN init function to activate the unfair queue
 spinlock in a XEN guest when the PARAVIRT_UNFAIR_LOCKS kernel config
 option is selected.
 
 Signed-off-by: Waiman Long waiman.l...@hp.com
 ---
  arch/x86/xen/setup.c |   19 +++
  1 files changed, 19 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
 index 0982233..66bb6f5 100644
 --- a/arch/x86/xen/setup.c
 +++ b/arch/x86/xen/setup.c
 @@ -625,3 +625,22 @@ void __init xen_arch_setup(void)
   numa_off = 1;
  #endif
  }
 +
 +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS
 +/*
 + * Enable unfair lock if running in a Xen guest
 + */
 +static __init int xen_unfair_locks_init_jump(void)
 +{
 + /*
 +  * Disable unfair lock if not running in a PV domain
 +  */
 + if (!xen_pv_domain())
 + return 0;

I would just make this 'xen_domain'. Not sure why you need
to have it only for PV while the PVHVM guests can also use it?

Would it also make sense to use the same printk statement
that the KVM has?
 +
 + static_key_slow_inc(paravirt_unfairlocks_enabled);
 +
 + return 0;
 +}
 +early_initcall(xen_unfair_locks_init_jump);
 +#endif
 -- 
 1.7.1
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest

2014-03-17 Thread Konrad Rzeszutek Wilk
On Thu, Mar 13, 2014 at 02:16:06PM +0100, Paolo Bonzini wrote:
 Il 13/03/2014 11:54, David Vrabel ha scritto:
 On 12/03/14 18:54, Waiman Long wrote:
 Locking is always an issue in a virtualized environment as the virtual
 CPU that is waiting on a lock may get scheduled out and hence block
 any progress in lock acquisition even when the lock has been freed.
 
 One solution to this problem is to allow unfair lock in a
 para-virtualized environment. In this case, a new lock acquirer can
 come and steal the lock if the next-in-line CPU to get the lock is
 scheduled out. Unfair lock in a native environment is generally not a
 good idea as there is a possibility of lock starvation for a heavily
 contended lock.
 
 I do not think this is a good idea -- the problems with unfair locks are
 worse in a virtualized guest.  If a waiting VCPU deschedules and has to
 be kicked to grab a lock then it is very likely to lose a race with
 another running VCPU trying to take a lock (since it takes time for the
 VCPU to be rescheduled).
 
 Actually, I think the unfair version should be automatically
 selected if running on a hypervisor.  Per-hypervisor pvops can
 choose to enable the fair one.
 
 Lock unfairness may be particularly evident on a virtualized guest
 when the host is overcommitted, but problems with fair locks are
 even worse.
 
 In fact, RHEL/CentOS 6 already uses unfair locks if
 X86_FEATURE_HYPERVISOR is set.  The patch was rejected upstream in
 favor of pv ticketlocks, but pv ticketlocks do not cover all
 hypervisors so perhaps we could revisit that choice.
 
 Measurements were done by Gleb for two guests running 2.6.32 with 16
 vcpus each, on a 16-core system.  One guest ran with unfair locks,
 one guest ran with fair locks.  Two kernel compilations (time make

And when you say fair locks are you saying PV ticketlocks or generic
ticketlocks? 
 -j 16 all) were started at the same time on both guests, and times
 were as follows:
 
 unfair: fair:
 real 13m34.674s real 19m35.827s
 user 96m2.638s  user 102m38.665s
 sys 56m14.991s  sys 158m22.470s
 
 real 13m3.768s  real 19m4.375s
 user 95m34.509s user 111m9.903s
 sys 53m40.550s  sys 141m59.370s
 
 Actually, interpreting the numbers shows an even worse slowdown.
 
 Compilation took ~6.5 minutes in a guest when the host was not
 overcommitted, and with unfair locks everything scaled just fine.

You should see the same values with the PV ticketlock. It is not clear
to me if this testing did include that variant of locks?

 
 Ticketlocks fell completely apart; during the first 13 minutes they
 were allotted 16*6.5=104 minutes of CPU time, and they spent almost
 all of it spinning in the kernel (102 minutes in the first run).

Right, the non-PV variant of them do fall apart. That is why
PV ticketlocks are so nice.

 They did perhaps 30 seconds worth of work because, as soon as the
 unfair-lock guest finished and the host was no longer overcommitted,
 compilation finished in 6 minutes.
 
 So that's approximately 12x slowdown from using non-pv fair locks
 (vs. unfair locks) on a 200%-overcommitted host.

Ah, so it was non-PV.

I am curious if the test was any different if you tested PV ticketlocks
vs Red Hat variant of unfair locks.

 
 Paolo
 
 With the unfair locking activated on bare metal 4-socket Westmere-EX
 box, the execution times (in ms) of a spinlock micro-benchmark were
 as follows:
 
   # ofTicket   Fair Unfair
   taskslock queue lockqueue lock
   --  ---   ----
 1   135135   137
 2  1045   1120   747
 3  1827   2345  1084
 4  2689   2934  1438
 5  3736   3658  1722
 6  4942   4434  2092
 7  6304   5176  2245
 8  7736   5955  2388
 
 Are these figures with or without the later PV support patches?
 
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest

2014-03-17 Thread Konrad Rzeszutek Wilk
On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote:
 On 03/14/2014 04:30 AM, Peter Zijlstra wrote:
 On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote:
 On 03/13/2014 11:15 AM, Peter Zijlstra wrote:
 On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote:
 +static inline void arch_spin_lock(struct qspinlock *lock)
 +{
 + if (static_key_false(paravirt_unfairlocks_enabled))
 + queue_spin_lock_unfair(lock);
 + else
 + queue_spin_lock(lock);
 +}
 So I would have expected something like:
 
if (static_key_false(paravirt_spinlock)) {
while (!queue_spin_trylock(lock))
cpu_relax();
return;
}
 
 At the top of queue_spin_lock_slowpath().
 I don't like the idea of constantly spinning on the lock. That can cause all
 sort of performance issues.
 Its bloody virt; _that_ is a performance issue to begin with.
 
 Anybody half sane stops using virt (esp. if they care about
 performance).
 
 My version of the unfair lock tries to grab the
 lock ignoring if there are others waiting in the queue or not. So instead of
 the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the
 lock byte in the unfair version. A CPU has only one chance to steal the
 lock. If it can't, it will be lined up in the queue just like the fair
 version. It is not as unfair as the other unfair locking schemes that spins
 on the lock repetitively. So lock starvation should be less a problem.
 
 On the other hand, it may not perform as well as the other unfair locking
 schemes. It is a compromise to provide some lock unfairness without
 sacrificing the good cacheline behavior of the queue spinlock.
 But but but,.. any kind of queueing gets you into a world of hurt with
 virt.
 
 The simple test-and-set lock (as per the above) still sucks due to lock
 holder preemption, but at least the suckage doesn't queue. Because with
 queueing you not only have to worry about the lock holder getting
 preemption, but also the waiter(s).
 
 Take the situation of 3 (v)CPUs where cpu0 holds the lock but is
 preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after
 which cpu0 gets back online.
 
 The simple test-and-set lock will now let cpu2 acquire. Your queue
 however will just sit there spinning, waiting for cpu1 to come back from
 holiday.
 
 I think you're way over engineering this. Just do the simple
 test-and-set lock for virt  !paravirt (as I think Paolo Bonzini
 suggested RHEL6 already does).
 
 The PV ticketlock code was designed to handle lock holder preemption
 by redirecting CPU resources in a preempted guest to another guest
 that can better use it and then return the preempted CPU back
 sooner.
 
 Using a simple test-and-set lock will not allow us to enable this PV
 spinlock functionality as there is no structure to decide who does
 what. I can extend the current unfair lock code to allow those

And what would be needed to do 'decide who does what'?

 waiting in the queue to also attempt to steal the lock, though at a
 lesser frequency so that the queue head has a higher chance of
 getting the lock. This will solve the lock waiter preemption problem
 that you worry about. This does make the code a bit more complex,
 but it allow us to enable both the unfair lock and the PV spinlock
 code together to solve the lock waiter and lock holder preemption
 problems.
 
 -Longman
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-blk: make the queue depth the max supportable by the hypervisor

2014-03-15 Thread Konrad Rzeszutek Wilk
On March 14, 2014 11:34:31 PM EDT, Theodore Ts'o ty...@mit.edu wrote:
The current virtio block sets a queue depth of 64, which is
insufficient for very fast devices.  It has been demonstrated that
with a high IOPS device, using a queue depth of 256 can double the
IOPS which can be sustained.

As suggested by Venkatash Srinivas, set the queue depth by default to
be one half the the device's virtqueue, which is the maximum queue
depth that can be supported by the channel to the host OS (each I/O
request requires at least two VQ entries).

Also allow the queue depth to be something which can be set at module
load time or via a kernel boot-time parameter, for
testing/benchmarking purposes.

Signed-off-by: Theodore Ts'o ty...@mit.edu
Signed-off-by: Venkatesh Srinivas venkate...@google.com
Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Michael S. Tsirkin m...@redhat.com
Cc: virtio-...@lists.oasis-open.org
Cc: virtualization@lists.linux-foundation.org
Cc: Frank Swiderski f...@google.com
---

This is a combination of my patch and Vekatash's patch.  I agree that
setting the default automatically is better than requiring the user to
set the value by hand.

 drivers/block/virtio_blk.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 6a680d4..0f70c01 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -481,6 +481,9 @@ static struct blk_mq_ops virtio_mq_ops = {
   .free_hctx  = blk_mq_free_single_hw_queue,
 };
 
+static int queue_depth = -1;
+module_param(queue_depth, int, 0444);

?

+
 static struct blk_mq_reg virtio_mq_reg = {
   .ops= virtio_mq_ops,
   .nr_hw_queues   = 1,
@@ -551,9 +554,14 @@ static int virtblk_probe(struct virtio_device
*vdev)
   goto out_free_vq;
   }
 
+  virtio_mq_reg.queue_depth = queue_depth  0 ? queue_depth :
+  (vblk-vq-num_free / 2);
   virtio_mq_reg.cmd_size =
   sizeof(struct virtblk_req) +
   sizeof(struct scatterlist) * sg_elems;
+  virtblk_name_format(vd, index, vblk-disk-disk_name,
DISK_NAME_LEN);
+  pr_info(%s: using queue depth %d\n, vblk-disk-disk_name,
+  virtio_mq_reg.queue_depth);

Isn't that visible from sysfs?
 
   q = vblk-disk-queue = blk_mq_init_queue(virtio_mq_reg, vblk);
   if (!q) {
@@ -565,8 +573,6 @@ static int virtblk_probe(struct virtio_device
*vdev)
 
   q-queuedata = vblk;
 
-  virtblk_name_format(vd, index, vblk-disk-disk_name,
DISK_NAME_LEN);
-
   vblk-disk-major = major;
   vblk-disk-first_minor = index_to_minor(index);
   vblk-disk-private_data = vblk;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


  1   2   3   >