Re: simplify gendisk and request_queue allocation for blk-mq based drivers
On Wed, Jun 02, 2021 at 09:53:15AM +0300, Christoph Hellwig wrote: > Hi all, Hi! You wouldn't have a nice git repo to pull so one can test it easily? Thank you! Cc-ing Boris/Juergen - pls see below xen. > > this series is the scond part of cleaning up lifetimes and allocation of > the gendisk and request_queue structure. It adds a new interface to > allocate the disk and queue together for blk based drivers, and uses that > in all drivers that do not have any caveats in their gendisk and > request_queue lifetime rules. > > Diffstat: > block/blk-mq.c | 91 +++--- > block/blk.h |1 > block/elevator.c|2 > drivers/block/amiflop.c | 16 +- > drivers/block/aoe/aoeblk.c | 33 > drivers/block/aoe/aoedev.c |3 - > drivers/block/ataflop.c | 16 +- > drivers/block/floppy.c | 20 +-- > drivers/block/loop.c| 19 ++- > drivers/block/nbd.c | 53 +++ > drivers/block/null_blk/main.c | 11 +--- > drivers/block/paride/pcd.c | 19 +++ > drivers/block/paride/pd.c | 30 --- > drivers/block/paride/pf.c | 18 ++ > drivers/block/ps3disk.c | 36 + > drivers/block/rbd.c | 52 ++- > drivers/block/rnbd/rnbd-clt.c | 35 +++-- > drivers/block/sunvdc.c | 47 - > drivers/block/swim.c| 34 +--- > drivers/block/swim3.c | 33 +--- > drivers/block/sx8.c | 23 ++-- > drivers/block/virtio_blk.c | 26 ++--- > drivers/block/xen-blkfront.c| 96 > ++-- > drivers/block/z2ram.c | 15 + > drivers/cdrom/gdrom.c | 45 +++- > drivers/md/dm-rq.c |9 +-- > drivers/memstick/core/ms_block.c| 25 +++-- > drivers/memstick/core/mspro_block.c | 26 - > drivers/mtd/mtd_blkdevs.c | 48 -- > drivers/mtd/ubi/block.c | 68 ++--- > drivers/s390/block/scm_blk.c| 21 ++- > include/linux/blk-mq.h | 24 ++--- > include/linux/elevator.h|1 > 33 files changed, 386 insertions(+), 610 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 0/7] Untrusted device support for virtio
On 4/29/21 12:16 AM, Jason Wang wrote: 在 2021/4/29 上午5:06, Konrad Rzeszutek Wilk 写道: On Wed, Apr 21, 2021 at 11:21:10AM +0800, Jason Wang wrote: Hi All: Sometimes, the driver doesn't trust the device. This is usually happens for the encrtpyed VM or VDUSE[1]. In both cases, technology like swiotlb is used to prevent the poking/mangling of memory from the device. But this is not sufficient since current virtio driver may trust what is stored in the descriptor table (coherent mapping) for performing the DMA operations like unmap and bounce so the device may choose to utilize the behaviour of swiotlb to perform attacks[2]. We fixed it in the SWIOTLB. That is it saves the expected length of the DMA operation. See commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 Author: Martin Radev Date: Tue Jan 12 16:07:29 2021 +0100 swiotlb: Validate bounce size in the sync/unmap path The size of the buffer being bounced is not checked if it happens to be larger than the size of the mapped buffer. Because the size can be controlled by a device, as it's the case with virtio devices, this can lead to memory corruption. Good to know this, but this series tries to protect at different level. And I believe such protection needs to be done at both levels. My apologies for taking so long to respond, somehow this disappeared in one of the folders. For double insurance, to protect from a malicous device, when DMA API is used for the device, this series store and use the descriptor metadata in an auxiliay structure which can not be accessed via swiotlb instead of the ones in the descriptor table. Actually, we've Sorry for being dense here, but how wold SWIOTLB be utilized for this attack? So we still behaviors that is triggered by device that is not trusted. Such behavior is what the series tries to avoid. We've learnt a lot of lessons to eliminate the potential attacks via this. And it would be too late to fix if we found another issue of SWIOTLB. Proving "the unexpected device triggered behavior is safe" is very hard (or even impossible) than "eliminating the unexpected device triggered behavior totally". E.g I wonder whether something like this can happen: Consider the DMA direction of unmap is under the control of device. The device can cheat the SWIOTLB by changing the flag to modify the device read only buffer. Why would you want to expose that to the device? And wouldn't that be specific to Linux devices - because surely Windows DMA APIs are different and this 'flag' seems very Linux-kernel specific? If yes, it is really safe? Well no? But neither is rm -Rf / but we still allow folks to do that. The above patch only log the bounce size but it doesn't log the flag. It logs and panics the system. Even if it logs the flag, SWIOTLB still doesn't know how each buffer is used and when it's the appropriate(safe) time to unmap the buffer, only the driver that is using the SWIOTLB know them. Fair enough. Is the intent to do the same thing for all the other drivers that could be running in an encrypted guest and would require SWIOTLB. Like legacy devices that KVM can expose (floppy driver?, SVGA driver)? So I think we need to consolidate on both layers instead of solely depending on the SWIOTLB. Please make sure that this explanation is in part of the cover letter or in the commit/Kconfig. Also, are you aware of the patchset than Andi been working on that tries to make the DMA code to have extra bells and whistles for this purpose? Thank you. Thanks almost achieved that through packed virtqueue and we just need to fix a corner case of handling mapping errors. For split virtqueue we just follow what's done in the packed. Note that we don't duplicate descriptor medata for indirect descriptors since it uses stream mapping which is read only so it's safe if the metadata of non-indirect descriptors are correct. The behaivor for non DMA API is kept for minimizing the performance impact. Slightly tested with packed on/off, iommu on/of, swiotlb force/off in the guest. Please review. [1] https://lore.kernel.org/netdev/fab615ce-5e13-a3b3-3715-a4203b4ab...@redhat.com/T/ [2] https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152...@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b Jason Wang (7): virtio-ring: maintain next in extra state for packed virtqueue virtio_ring: rename vring_desc_extra_packed virtio-ring: factor out desc_extra allocation virtio_ring: secure handling of mapping errors virtio_ring: introduce virtqueue_desc_add_split() virtio: use err label in __vring_new_virtqueue() virtio-ring: store DMA metadata in desc_extra for split virtqueue drivers/virtio/virtio_ring.c | 189 ++- 1 file changed, 141 insertions(+), 48 deletions(-) -- 2.25.1 ___ Virtualization mailing list Virtua
Re: [RFC PATCH 0/7] Untrusted device support for virtio
On Wed, Apr 21, 2021 at 11:21:10AM +0800, Jason Wang wrote: > Hi All: > > Sometimes, the driver doesn't trust the device. This is usually > happens for the encrtpyed VM or VDUSE[1]. In both cases, technology > like swiotlb is used to prevent the poking/mangling of memory from the > device. But this is not sufficient since current virtio driver may > trust what is stored in the descriptor table (coherent mapping) for > performing the DMA operations like unmap and bounce so the device may > choose to utilize the behaviour of swiotlb to perform attacks[2]. We fixed it in the SWIOTLB. That is it saves the expected length of the DMA operation. See commit daf9514fd5eb098d7d6f3a1247cb8cc48fc94155 Author: Martin Radev Date: Tue Jan 12 16:07:29 2021 +0100 swiotlb: Validate bounce size in the sync/unmap path The size of the buffer being bounced is not checked if it happens to be larger than the size of the mapped buffer. Because the size can be controlled by a device, as it's the case with virtio devices, this can lead to memory corruption. > > For double insurance, to protect from a malicous device, when DMA API > is used for the device, this series store and use the descriptor > metadata in an auxiliay structure which can not be accessed via > swiotlb instead of the ones in the descriptor table. Actually, we've Sorry for being dense here, but how wold SWIOTLB be utilized for this attack? > almost achieved that through packed virtqueue and we just need to fix > a corner case of handling mapping errors. For split virtqueue we just > follow what's done in the packed. > > Note that we don't duplicate descriptor medata for indirect > descriptors since it uses stream mapping which is read only so it's > safe if the metadata of non-indirect descriptors are correct. > > The behaivor for non DMA API is kept for minimizing the performance > impact. > > Slightly tested with packed on/off, iommu on/of, swiotlb force/off in > the guest. > > Please review. > > [1] > https://lore.kernel.org/netdev/fab615ce-5e13-a3b3-3715-a4203b4ab...@redhat.com/T/ > [2] > https://yhbt.net/lore/all/c3629a27-3590-1d9f-211b-c0b7be152...@redhat.com/T/#mc6b6e2343cbeffca68ca7a97e0f473aaa871c95b > > Jason Wang (7): > virtio-ring: maintain next in extra state for packed virtqueue > virtio_ring: rename vring_desc_extra_packed > virtio-ring: factor out desc_extra allocation > virtio_ring: secure handling of mapping errors > virtio_ring: introduce virtqueue_desc_add_split() > virtio: use err label in __vring_new_virtqueue() > virtio-ring: store DMA metadata in desc_extra for split virtqueue > > drivers/virtio/virtio_ring.c | 189 ++- > 1 file changed, 141 insertions(+), 48 deletions(-) > > -- > 2.25.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/7] x86/seves: Support 32-bit boot path and other updates
On Wed, Feb 10, 2021 at 04:12:25PM +0100, Joerg Roedel wrote: > Hi Konrad, > > On Wed, Feb 10, 2021 at 09:58:35AM -0500, Konrad Rzeszutek Wilk wrote: > > What GRUB versions are we talking about (CC-ing Daniel Kiper, who owns > > GRUB). > > I think this was about 32-bit GRUB builds used by distributions. I > personally tested it with a kernel which has EFI support disabled, in > this case the OVMF firmware will also boot into the startup_32 boot > path. I think I am missing something obvious here - but why would you want EFI support disabled? Or is the idea that "legacy" OSes can nicely run under AMD SEV? But since you are having a kernel patch that is not "legacy OS" anymore. > > > By 'some firmware' we talking SeaBIOS? > > No, SeaBIOS is not supported for SEV-ES, only OVMF has handling for #VC > so far. > > Regards, > > Joerg ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/7] x86/seves: Support 32-bit boot path and other updates
On Wed, Feb 10, 2021 at 11:21:28AM +0100, Joerg Roedel wrote: > From: Joerg Roedel > > Hi, > > these patches add support for the 32-bit boot in the decompressor > code. This is needed to boot an SEV-ES guest on some firmware and grub > versions. The patches also add the necessary CPUID sanity checks and a Could you expand a bit please? What GRUB versions are we talking about (CC-ing Daniel Kiper, who owns GRUB). By 'some firmware' we talking SeaBIOS? > 32-bit version of the C-bit check. > > Other updates included here: > > 1. Add code to shut down exception handling in the > decompressor code before jumping to the real kernel. > Once in the real kernel it is not safe anymore to jump > back to the decompressor code via exceptions. > > 2. Replace open-coded hlt loops with proper calls to > sev_es_terminate(). > > Please review. > > Thanks, > > Joerg > > Joerg Roedel (7): > x86/boot/compressed/64: Cleanup exception handling before booting > kernel > x86/boot/compressed/64: Reload CS in startup_32 > x86/boot/compressed/64: Setup IDT in startup_32 boot path > x86/boot/compressed/64: Add 32-bit boot #VC handler > x86/boot/compressed/64: Add CPUID sanity check to 32-bit boot-path > x86/boot/compressed/64: Check SEV encryption in 32-bit boot-path > x86/sev-es: Replace open-coded hlt-loops with sev_es_terminate() > > arch/x86/boot/compressed/head_64.S | 168 - > arch/x86/boot/compressed/idt_64.c | 14 +++ > arch/x86/boot/compressed/mem_encrypt.S | 114 - > arch/x86/boot/compressed/misc.c| 7 +- > arch/x86/boot/compressed/misc.h| 6 + > arch/x86/boot/compressed/sev-es.c | 12 +- > arch/x86/kernel/sev-es-shared.c| 10 +- > 7 files changed, 307 insertions(+), 24 deletions(-) > > -- > 2.30.0 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path
On Fri, Feb 05, 2021 at 06:58:52PM +0100, Christoph Hellwig wrote: > On Wed, Feb 03, 2021 at 02:36:38PM -0500, Konrad Rzeszutek Wilk wrote: > > > So what? If you guys want to provide a new capability you'll have to do > > > work. And designing a new protocol based around the fact that the > > > hardware/hypervisor is not trusted and a copy is always required makes > > > a lot of more sense than throwing in band aids all over the place. > > > > If you don't trust the hypervisor, what would this capability be in? > > Well, they don't trust the hypervisor to not attack the guest somehow, > except through the data read. I never really understood the concept, > as it leaves too many holes. > > But the point is that these schemes want to force bounce buffering > because they think it is more secure. And if that is what you want > you better have protocol build around the fact that each I/O needs > to use bounce buffers, so you make those buffers the actual shared > memory use for communication, and build the protocol around it. Right. That is what the SWIOTLB pool ends up being as it is allocated at bootup where the guest tells the hypervisor - these are shared and clear-text. > E.g. you don't force the ridiculous NVMe PRP offset rules on the block > layer, just to make a complicated swiotlb allocation that needs to > preserve the alignment just do I/O. But instead you have a trivial I agree that NVMe is being silly. It could have allocated the coherent pool and use that and do its own offset within that. That would in essence carve out a static pool within the SWIOTLB static one.. TTM does that - it has its own DMA machinery on top of DMA API to deal with its "passing" buffers from one application to another and the fun of keeping track of that. > ring buffer or whatever because you know I/O will be copied anyway > and none of all the hard work higher layers do to make the I/O suitable > for a normal device apply. I lost you here. Sorry, are you saying have a simple ring protocol (like NVME has), where the ring entries (SG or DMA phys) are statically allocated and whenever NVME driver gets data from user-space it would copy it in there? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path
On Wed, Feb 03, 2021 at 01:49:22PM +0100, Christoph Hellwig wrote: > On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote: > > Your comment makes sense but then that would require the cooperation > > of these vendors and the cloud providers to agree on something meaningful. > > I am also not sure whether the end result would be better than hardening > > this interface to catch corruption. There is already some validation in > > unmap path anyway. > > So what? If you guys want to provide a new capability you'll have to do > work. And designing a new protocol based around the fact that the > hardware/hypervisor is not trusted and a copy is always required makes > a lot of more sense than throwing in band aids all over the place. If you don't trust the hypervisor, what would this capability be in? I suppose you mean this would need to be in the the guest kernel and this protocol would depend on .. not-hypervisor and most certainly not the virtio or any SR-IOV device. That removes a lot of options. The one sensibile one (since folks will trust OEM vendors like Intel or AMD to provide the memory encryption so they will also trust the IOMMU - I hope?) - and they do have plans for that with their IOMMU frameworks which will remove the need for SWIOTLB (I hope). But that is not now, but in future. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path
On Tue, Feb 02, 2021 at 04:34:09PM -0600, Tom Lendacky wrote: > On 2/2/21 10:37 AM, Konrad Rzeszutek Wilk wrote: > > On Mon, Jan 25, 2021 at 07:33:35PM +0100, Martin Radev wrote: > >> On Mon, Jan 18, 2021 at 10:14:28AM -0500, Konrad Rzeszutek Wilk wrote: > >>> On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote: > >>>> On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote: > >>>>> On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote: > >>>>>> The size of the buffer being bounced is not checked if it happens > >>>>>> to be larger than the size of the mapped buffer. Because the size > >>>>>> can be controlled by a device, as it's the case with virtio devices, > >>>>>> this can lead to memory corruption. > >>>>>> > >>>>> > >>>>> I'm really worried about all these hodge podge hacks for not trusted > >>>>> hypervisors in the I/O stack. Instead of trying to harden protocols > >>>>> that are fundamentally not designed for this, how about instead coming > >>>>> up with a new paravirtualized I/O interface that is specifically > >>>>> designed for use with an untrusted hypervisor from the start? > >>>> > >>>> Your comment makes sense but then that would require the cooperation > >>>> of these vendors and the cloud providers to agree on something > >>>> meaningful. > >>>> I am also not sure whether the end result would be better than hardening > >>>> this interface to catch corruption. There is already some validation in > >>>> unmap path anyway. > >>>> > >>>> Another possibility is to move this hardening to the common virtio code, > >>>> but I think the code may become more complicated there since it would > >>>> require tracking both the dma_addr and length for each descriptor. > >>> > >>> Christoph, > >>> > >>> I've been wrestling with the same thing - this is specific to busted > >>> drivers. And in reality you could do the same thing with a hardware > >>> virtio device (see example in > >>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fthunderclap.io%2Fdata=04%7C01%7Cthomas.lendacky%40amd.com%7Cfc27af49d9a943699f6c08d8c798eed4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637478806973542212%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=aUVqobkOSDfDhCAEauABOUvCAaIcw%2FTh07YFxeBjBDU%3Dreserved=0) > >>> - where the > >>> mitigation is 'enable the IOMMU to do its job.'. > >>> > >>> AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP).. > >>> and while that is great in the future, SEV without IOMMU is now here. > >>> > >>> Doing a full circle here, this issue can be exploited with virtio > >>> but you could say do that with real hardware too if you hacked the > >>> firmware, so if you say used Intel SR-IOV NIC that was compromised > >>> on an AMD SEV machine, and plumbed in the guest - the IOMMU inside > >>> of the guest would be SWIOTLB code. Last line of defense against > >>> bad firmware to say. > >>> > >>> As such I am leaning towards taking this code, but I am worried > >>> about the performance hit .. but perhaps I shouldn't as if you > >>> are using SWIOTLB=force already you are kind of taking a > >>> performance hit? > >>> > >> > >> I have not measured the performance degradation. This will hit all AMD SEV, > >> Intel TDX, IBM Protected Virtualization VMs. I don't expect the hit to > >> be large since there are only few added operations per hundreads of copied > >> bytes. I could try to measure the performance hit by running some benchmark > >> with virtio-net/virtio-blk/virtio-rng. > >> > >> Earlier I said: > >>>> Another possibility is to move this hardening to the common virtio code, > >>>> but I think the code may become more complicated there since it would > >>>> require tracking both the dma_addr and length for each descriptor. > >> > >> Unfortunately, this doesn't make sense. Even if there's validation for > >> the size in the common virtio layer, there will be some other device > >> which controls a dma_addr and length passed to dma_unmap* in the > >> corresponding driver. The device c
Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path
On Mon, Jan 25, 2021 at 07:33:35PM +0100, Martin Radev wrote: > On Mon, Jan 18, 2021 at 10:14:28AM -0500, Konrad Rzeszutek Wilk wrote: > > On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote: > > > On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote: > > > > On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote: > > > > > The size of the buffer being bounced is not checked if it happens > > > > > to be larger than the size of the mapped buffer. Because the size > > > > > can be controlled by a device, as it's the case with virtio devices, > > > > > this can lead to memory corruption. > > > > > > > > > > > > > I'm really worried about all these hodge podge hacks for not trusted > > > > hypervisors in the I/O stack. Instead of trying to harden protocols > > > > that are fundamentally not designed for this, how about instead coming > > > > up with a new paravirtualized I/O interface that is specifically > > > > designed for use with an untrusted hypervisor from the start? > > > > > > Your comment makes sense but then that would require the cooperation > > > of these vendors and the cloud providers to agree on something meaningful. > > > I am also not sure whether the end result would be better than hardening > > > this interface to catch corruption. There is already some validation in > > > unmap path anyway. > > > > > > Another possibility is to move this hardening to the common virtio code, > > > but I think the code may become more complicated there since it would > > > require tracking both the dma_addr and length for each descriptor. > > > > Christoph, > > > > I've been wrestling with the same thing - this is specific to busted > > drivers. And in reality you could do the same thing with a hardware > > virtio device (see example in http://thunderclap.io/) - where the > > mitigation is 'enable the IOMMU to do its job.'. > > > > AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP).. > > and while that is great in the future, SEV without IOMMU is now here. > > > > Doing a full circle here, this issue can be exploited with virtio > > but you could say do that with real hardware too if you hacked the > > firmware, so if you say used Intel SR-IOV NIC that was compromised > > on an AMD SEV machine, and plumbed in the guest - the IOMMU inside > > of the guest would be SWIOTLB code. Last line of defense against > > bad firmware to say. > > > > As such I am leaning towards taking this code, but I am worried > > about the performance hit .. but perhaps I shouldn't as if you > > are using SWIOTLB=force already you are kind of taking a > > performance hit? > > > > I have not measured the performance degradation. This will hit all AMD SEV, > Intel TDX, IBM Protected Virtualization VMs. I don't expect the hit to > be large since there are only few added operations per hundreads of copied > bytes. I could try to measure the performance hit by running some benchmark > with virtio-net/virtio-blk/virtio-rng. > > Earlier I said: > > > Another possibility is to move this hardening to the common virtio code, > > > but I think the code may become more complicated there since it would > > > require tracking both the dma_addr and length for each descriptor. > > Unfortunately, this doesn't make sense. Even if there's validation for > the size in the common virtio layer, there will be some other device > which controls a dma_addr and length passed to dma_unmap* in the > corresponding driver. The device can target a specific dma-mapped private > buffer by changing the dma_addr and set a good length to overwrite buffers > following it. > > So, instead of doing the check in every driver and hitting a performance > cost even when swiotlb is not used, it's probably better to fix it in > swiotlb. > > @Tom Lendacky, do you think that it makes sense to harden swiotlb or > some other approach may be better for the SEV features? I am not Tom, but this change seems the right way forward regardless if is TDX, AMD SEV, or any other architecture that encrypt memory and use SWIOTLB. Let me queue it up in development branch and do some regression testing. > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] swiotlb: Validate bounce size in the sync/unmap path
On Mon, Jan 18, 2021 at 12:44:58PM +0100, Martin Radev wrote: > On Wed, Jan 13, 2021 at 12:30:17PM +0100, Christoph Hellwig wrote: > > On Tue, Jan 12, 2021 at 04:07:29PM +0100, Martin Radev wrote: > > > The size of the buffer being bounced is not checked if it happens > > > to be larger than the size of the mapped buffer. Because the size > > > can be controlled by a device, as it's the case with virtio devices, > > > this can lead to memory corruption. > > > > > > > I'm really worried about all these hodge podge hacks for not trusted > > hypervisors in the I/O stack. Instead of trying to harden protocols > > that are fundamentally not designed for this, how about instead coming > > up with a new paravirtualized I/O interface that is specifically > > designed for use with an untrusted hypervisor from the start? > > Your comment makes sense but then that would require the cooperation > of these vendors and the cloud providers to agree on something meaningful. > I am also not sure whether the end result would be better than hardening > this interface to catch corruption. There is already some validation in > unmap path anyway. > > Another possibility is to move this hardening to the common virtio code, > but I think the code may become more complicated there since it would > require tracking both the dma_addr and length for each descriptor. Christoph, I've been wrestling with the same thing - this is specific to busted drivers. And in reality you could do the same thing with a hardware virtio device (see example in http://thunderclap.io/) - where the mitigation is 'enable the IOMMU to do its job.'. AMD SEV documents speak about utilizing IOMMU to do this (AMD SEV-SNP).. and while that is great in the future, SEV without IOMMU is now here. Doing a full circle here, this issue can be exploited with virtio but you could say do that with real hardware too if you hacked the firmware, so if you say used Intel SR-IOV NIC that was compromised on an AMD SEV machine, and plumbed in the guest - the IOMMU inside of the guest would be SWIOTLB code. Last line of defense against bad firmware to say. As such I am leaning towards taking this code, but I am worried about the performance hit .. but perhaps I shouldn't as if you are using SWIOTLB=force already you are kind of taking a performance hit? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: swiotlb/virtio: unchecked device dma address and length
..snip.. >> > > This raises two issues: >> > > 1) swiotlb_tlb_unmap_single fails to check whether the index >generated >> > > from the dma_addr is in range of the io_tlb_orig_addr array. >> > That is fairly simple to implement I would think. That is it can >check >> > that the dma_addr is from the PA in the io_tlb pool when >SWIOTLB=force >> > is used. >> >> >> I'm not sure this can fix all the cases. It looks to me we should map >> descriptor coherent but readonly (which is not supported by current >DMA >> API). > >Neither is this supported but encrypted memory technologies. -ECONFUSED. Could you state this once more please? I am not exactly sure what you are saying > >> Otherwise, device can modify the desc[i].addr/desc[i].len at any time >to >> pretend a valid mapping. >> >> Thanks >> >> >> > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: swiotlb/virtio: unchecked device dma address and length
On December 16, 2020 1:41:48 AM EST, Jason Wang wrote: > > >- Original Message - >> >> >> - Original Message - >> > .snip. >> > > > > This raises two issues: >> > > > > 1) swiotlb_tlb_unmap_single fails to check whether the index >> > > > > generated >> > > > > from the dma_addr is in range of the io_tlb_orig_addr array. >> > > > That is fairly simple to implement I would think. That is it >can check >> > > > that the dma_addr is from the PA in the io_tlb pool when >SWIOTLB=force >> > > > is used. >> > > >> > > >> > > I'm not sure this can fix all the cases. It looks to me we should >map >> > > descriptor coherent but readonly (which is not supported by >current DMA >> > > API). >> > >> > I think I am missing something obvious here. The attacker is the >> > hypervisor, >> > aka >> > the owner of the VirtIO device (ring0). The attacker is the one >that >> > provides the addr/len - having that readonly from a guest >perspective >> > does not change the fact that the hypervisor can modify the memory >range >> > by mapping it via a different virtual address in the hypervisor? >(aka >> > aliasing it). >> >> Right, but if we allow hypervisor to provide arbitrary addr/len, does >> it mean hypervisor can read encrypted content of encrypted memory of >> guest through swiotlb? Yes . >> >> Thanks > >Actually not. I think you're right. Your sentence is very confusing. On a DMA unmap SWIOTLB (when force is used) it trusts the driver from providing the correct DMA address and length which SWIOTLB uses to match to its associated original PA address. Think original PA having a mapping to a PA in the SWIOTLB pool. The length is not checked so the attacker can modify that to say a huge number and cause SWIOTLB bounce code to write or read data from non SWIOTLB PA into the SWIOTLB PA pool. > >Thanks > >> >> > > >> > > Otherwise, device can modify the desc[i].addr/desc[i].len at any >time to >> > > pretend a valid mapping. >> > >> > With the swiotlb=force as long as addr/len are within the PA >boundaries >> > within the SWIOTLB pool this should be OK? >> > >> > After all that whole area is in cleartext and visible to the >attacker. >> > >> > >> ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: swiotlb/virtio: unchecked device dma address and length
On Tue, Dec 15, 2020 at 11:54:08AM +0100, Felicitas Hetzelt wrote: > Hello, > thank you all for looking into this! To answer some of the questions: > - Did you have already some PoC fixes for this: >We don't have a full PoC or fix currently. Thought we have a PoC >with which were able to overwrite memory outside of the mapped >dma region. > - Is there a CVE associated with this? >No > - Is there a paper on this you all are working on? >Yes, we were planning to use this bug (among others >in a paper) > > Could you point us to the intel thunder issue that you mentioned? ThunderClap was it! https://lwn.net/Articles/786558/ Cc-ing Lu Baolu .. Hm, this was a year ago and it looks like there are some extra SWIOTLB patches to be done ? > > On 12/15/20 9:47 AM, Ashish Kalra wrote: > > On Mon, Dec 14, 2020 at 04:49:50PM -0500, Konrad Rzeszutek Wilk wrote: > >> On Fri, Dec 11, 2020 at 06:31:21PM +0100, Felicitas Hetzelt wrote: > >>> Hello, > >> > >> Hi! Please see below my responses. > >> > >>> we have been analyzing the Hypervisor-OS interface of Linux > >>> and discovered bugs in the swiotlb/virtio implementation that can be > >>> triggered from a malicious Hypervisor / virtual device. > >>> With SEV, the SWIOTLB implementation is forcefully enabled and would > >>> always be used. Thus, all virtio devices and others would use it under > >>> the hood. > >>> > >>> The reason for analyzing this interface is that, technologies such as > >>> Intel's Trusted Domain Extensions [1] and AMD's Secure Nested Paging [2] > >>> change the threat model assumed by various Linux kernel subsystems. > >>> These technologies take the presence of a fully malicious hypervisor > >>> into account and aim to provide protection for virtual machines in such > >>> an environment. Therefore, all input received from the hypervisor or an > >>> external device should be carefully validated. Note that these issues > >>> are of little (or no) relevance in a "normal" virtualization setup, > >>> nevertheless we believe that it is required to fix them if TDX or SNP is > >>> used. > >>> > >>> We are happy to provide more information if needed! > >>> > >>> [1] > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fwww%2Fus%2Fen%2Fdevelop%2Farticles%2Fintel-trust-domain-extensions.htmldata=04%7C01%7Cashish.kalra%40amd.com%7C1d1cbca182a84c0e504708d8a079eec0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637435792867090126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=THAJlYGLSOx3bKQYH62TLKH50By7Wnsu0z92snfNY84%3Dreserved=0 > >>> > >>> [2] > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fen%2Fprocessors%2Famd-secure-encrypted-virtualizationdata=04%7C01%7Cashish.kalra%40amd.com%7C1d1cbca182a84c0e504708d8a079eec0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637435792867090126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=M3jmYCWaEvmAzIy%2F4z5XstsPf812SbEkuNX5PVVr0HY%3Dreserved=0 > >>> > >>> Bug: > >>> OOB memory write. > >>> dma_unmap_single -> swiotlb_tbl_unmap_single is invoked with dma_addr > >>> and length parameters that are under control of the device. > >>> This happens e.g. in virtio_ring: > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.10-rc7%2Fsource%2Fdrivers%2Fvirtio%2Fvirtio_ring.c%23L378data=04%7C01%7Cashish.kalra%40amd.com%7C1d1cbca182a84c0e504708d8a079eec0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637435792867090126%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=j0CIi%2F8hBkVx45XGBtT4Ri52uWIOdOts%2BSbJ0kCB5B0%3Dreserved=0 > >> > >> Heya! > >> > >> Thank you for pointing this out! I've a couple of questions and hope you > >> can > >> help me out with them. > >> > >> Also CC-ing AMD / TDX folks. > >>> > > > > Adding more relevant folks in AMD. > > > > Needless to say, the swiotlb code needs to validate this external untrusted > > input dma_addr and length parameters. > > > > Thanks, > > Ashish > > > >>> This raises two issues: > >>> 1) swiotlb_tlb_unmap_single fails to check whether the index generated > >>> from the dma_addr is in range
Re: swiotlb/virtio: unchecked device dma address and length
.snip. > > > This raises two issues: > > > 1) swiotlb_tlb_unmap_single fails to check whether the index generated > > > from the dma_addr is in range of the io_tlb_orig_addr array. > > That is fairly simple to implement I would think. That is it can check > > that the dma_addr is from the PA in the io_tlb pool when SWIOTLB=force > > is used. > > > I'm not sure this can fix all the cases. It looks to me we should map > descriptor coherent but readonly (which is not supported by current DMA > API). I think I am missing something obvious here. The attacker is the hypervisor, aka the owner of the VirtIO device (ring0). The attacker is the one that provides the addr/len - having that readonly from a guest perspective does not change the fact that the hypervisor can modify the memory range by mapping it via a different virtual address in the hypervisor? (aka aliasing it). > > Otherwise, device can modify the desc[i].addr/desc[i].len at any time to > pretend a valid mapping. With the swiotlb=force as long as addr/len are within the PA boundaries within the SWIOTLB pool this should be OK? After all that whole area is in cleartext and visible to the attacker. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: swiotlb/virtio: unchecked device dma address and length
On Fri, Dec 11, 2020 at 06:31:21PM +0100, Felicitas Hetzelt wrote: > Hello, Hi! Please see below my responses. > we have been analyzing the Hypervisor-OS interface of Linux > and discovered bugs in the swiotlb/virtio implementation that can be > triggered from a malicious Hypervisor / virtual device. > With SEV, the SWIOTLB implementation is forcefully enabled and would > always be used. Thus, all virtio devices and others would use it under > the hood. > > The reason for analyzing this interface is that, technologies such as > Intel's Trusted Domain Extensions [1] and AMD's Secure Nested Paging [2] > change the threat model assumed by various Linux kernel subsystems. > These technologies take the presence of a fully malicious hypervisor > into account and aim to provide protection for virtual machines in such > an environment. Therefore, all input received from the hypervisor or an > external device should be carefully validated. Note that these issues > are of little (or no) relevance in a "normal" virtualization setup, > nevertheless we believe that it is required to fix them if TDX or SNP is > used. > > We are happy to provide more information if needed! > > [1] > https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html > > [2] https://www.amd.com/en/processors/amd-secure-encrypted-virtualization > > Bug: > OOB memory write. > dma_unmap_single -> swiotlb_tbl_unmap_single is invoked with dma_addr > and length parameters that are under control of the device. > This happens e.g. in virtio_ring: > https://elixir.bootlin.com/linux/v5.10-rc7/source/drivers/virtio/virtio_ring.c#L378 Heya! Thank you for pointing this out! I've a couple of questions and hope you can help me out with them. Also CC-ing AMD / TDX folks. > > This raises two issues: > 1) swiotlb_tlb_unmap_single fails to check whether the index generated > from the dma_addr is in range of the io_tlb_orig_addr array. That is fairly simple to implement I would think. That is it can check that the dma_addr is from the PA in the io_tlb pool when SWIOTLB=force is used. > 2) when swiotlb_bounce is called the device controls the length of the > memory copied to the cpu address. So.. this sounds very similar to the Intel Thunder.. something issue where this exact issue was fixed by handing the DMA off to the SWIOTLB bounce code. But if that is broken, then that CVE is still not fixed? So the issue here is that swiotlb_tbl_unmap_single(..,mapping_size,) is under the attacker control. Ugh. One way could be to have a io_tlb_orig_addr-ish array with the length of mappings to double check? Couple more questions: - Did you have already some PoC fixes for this? - Is there a CVE associated with this? - Is there a paper on this you all are working on? Thank you! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Is: virtio_gpu_object_shmem_init issues? Was:Re: upstream boot error: general protection fault in swiotlb_map
On Thu, Aug 06, 2020 at 03:46:23AM -0700, syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:47ec5303 Merge git://git.kernel.org/pub/scm/linux/kernel/g.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=16fe1dea90 > kernel config: https://syzkaller.appspot.com/x/.config?x=7c06047f622c5724 > dashboard link: https://syzkaller.appspot.com/bug?extid=3f86afd0b1e4bf1cb64c > compiler: gcc (GCC) 10.1.0-syz 20200507 > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+3f86afd0b1e4bf1cb...@syzkaller.appspotmail.com > > ceph: loaded (mds proto 32) > NET: Registered protocol family 38 > async_tx: api initialized (async) > Key type asymmetric registered > Asymmetric key parser 'x509' registered > Asymmetric key parser 'pkcs8' registered > Key type pkcs7_test registered > Asymmetric key parser 'tpm_parser' registered > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 243) > io scheduler mq-deadline registered > io scheduler kyber registered > io scheduler bfq registered > hgafb: HGA card not detected. > hgafb: probe of hgafb.0 failed with error -22 > usbcore: registered new interface driver udlfb > uvesafb: failed to execute /sbin/v86d > uvesafb: make sure that the v86d helper is installed and executable > uvesafb: Getting VBE info block failed (eax=0x4f00, err=-2) > uvesafb: vbe_init() failed with -22 > uvesafb: probe of uvesafb.0 failed with error -22 > vga16fb: mapped to 0x8aac772d > Console: switching to colour frame buffer device 80x30 > fb0: VGA16 VGA frame buffer device > input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 > ACPI: Power Button [PWRF] > ioatdma: Intel(R) QuickData Technology Driver 5.00 > PCI Interrupt Link [GSIF] enabled at IRQ 21 > PCI Interrupt Link [GSIG] enabled at IRQ 22 > PCI Interrupt Link [GSIH] enabled at IRQ 23 > N_HDLC line discipline registered with maxframe=4096 > Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled > 00:05: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A > Cyclades driver 2.6 > Initializing Nozomi driver 2.1d > RocketPort device driver module, version 2.09, 12-June-2003 > No rocketport ports found; unloading driver > Non-volatile memory driver v1.3 > Linux agpgart interface v0.103 > [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0 > [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1 > usbcore: registered new interface driver udl > [drm] pci: virtio-vga detected at :00:01.0 > fb0: switching to virtiodrmfb from VGA16 VGA > Console: switching to colour VGA+ 80x25 > virtio-pci :00:01.0: vgaarb: deactivate vga console > Console: switching to colour dummy device 80x25 > [drm] features: -virgl +edid > [drm] number of scanouts: 1 > [drm] number of cap sets: 0 > [drm] Initialized virtio_gpu 0.1.0 0 for virtio0 on minor 2 > general protection fault, probably for non-canonical address > 0xdc00: [#1] PREEMPT SMP KASAN > KASAN: null-ptr-deref in range [0x-0x0007] > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.8.0-syzkaller #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > RIP: 0010:swiotlb_map+0x5ac/0x700 kernel/dma/swiotlb.c:683 > Code: 28 04 00 00 48 c1 ea 03 80 3c 02 00 0f 85 4d 01 00 00 4c 8b a5 18 04 00 > 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f 85 1e > 01 00 00 48 8d 7d 50 4d 8b 24 24 48 b8 00 00 > RSP: :c934f3e0 EFLAGS: 00010246 > RAX: dc00 RBX: RCX: 8162cc1d > RDX: RSI: 8162cc98 RDI: 88802971a470 > RBP: 88802971a048 R08: 0001 R09: 8c5dba77 > R10: R11: R12: > R13: 7ac0 R14: dc00 R15: 1000 > FS: () GS:88802ce0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: CR3: 09a8d000 CR4: 00350ef0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > dma_direct_map_page include/linux/dma-direct.h:170 [inline] > dma_direct_map_sg+0x3bb/0x670 kernel/dma/direct.c:368 > dma_map_sg_attrs+0xd0/0x160 kernel/dma/mapping.c:183 > drm_gem_shmem_get_pages_sgt drivers/gpu/drm/drm_gem_shmem_helper.c:700 > [inline] > drm_gem_shmem_get_pages_sgt+0x1fc/0x310 > drivers/gpu/drm/drm_gem_shmem_helper.c:679 > virtio_gpu_object_shmem_init drivers/gpu/drm/virtio/virtgpu_object.c:153 > [inline] > virtio_gpu_object_create+0x2fd/0xa70 > drivers/gpu/drm/virtio/virtgpu_object.c:232 > virtio_gpu_gem_create drivers/gpu/drm/virtio/virtgpu_gem.c:45 [inline] > virtio_gpu_mode_dumb_create+0x298/0x530 > drivers/gpu/drm/virtio/virtgpu_gem.c:85 >
Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version
On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote: > As testing shows no performance change, switch to that now. What kind of testing? 100GiB? Low latency? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 5/5] virtio: Add bounce DMA ops
On Wed, Apr 29, 2020 at 06:20:48AM -0400, Michael S. Tsirkin wrote: > On Wed, Apr 29, 2020 at 03:39:53PM +0530, Srivatsa Vaddagiri wrote: > > That would still not work I think where swiotlb is used for pass-thr devices > > (when private memory is fine) as well as virtio devices (when shared memory > > is > > required). > > So that is a separate question. When there are multiple untrusted > devices, at the moment it looks like a single bounce buffer is used. > > Which to me seems like a security problem, I think we should protect > untrusted devices from each other. There are two DMA pools code in Linux already - the TTM one for graphics and the mm/dmapool.c - could those be used instead? Or augmented at least? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory
On Thu, Dec 12, 2019 at 06:11:24PM +0100, David Hildenbrand wrote: > This series is based on latest linux-next. The patches are located at: > https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4 Heya! Would there be by any chance a virtio-spec git tree somewhere? ..snip.. > -- > 5. Future work > -- > > The separate patches contain a lot of future work items. One of the next > steps is to make memory unplug more likely to succeed - currently, there > are no guarantees on how much memory can get unplugged again. I have Or perhaps tell the caller why we can't and let them sort it out? For example: "Application XYZ is mlocked. Can't offload'. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v6 69/92] kvm: x86: keep the page protected if tracked by the introspection tool
On Fri, Aug 09, 2019 at 07:00:24PM +0300, Adalbert Lazăr wrote: > This patch might be obsolete thanks to single-stepping. sooo should it be skipped from this large patchset to easy review? > > Signed-off-by: Adalbert Lazăr > --- > arch/x86/kvm/x86.c | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 2c06de73a784..06f44ce8ed07 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -6311,7 +6311,8 @@ static bool reexecute_instruction(struct kvm_vcpu > *vcpu, gva_t cr2, > indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages; > spin_unlock(>kvm->mmu_lock); > > - if (indirect_shadow_pages) > + if (indirect_shadow_pages > + && !kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa))) > kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); > > return true; > @@ -6322,7 +6323,8 @@ static bool reexecute_instruction(struct kvm_vcpu > *vcpu, gva_t cr2, >* and it failed try to unshadow page and re-enter the >* guest to let CPU execute the instruction. >*/ > - kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); > + if (!kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa))) > + kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); > > /* >* If the access faults on its page table, it can not > @@ -6374,6 +6376,9 @@ static bool retry_instruction(struct x86_emulate_ctxt > *ctxt, > if (!vcpu->arch.mmu->direct_map) > gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL); > > + if (kvmi_tracked_gfn(vcpu, gpa_to_gfn(gpa))) > + return false; > + > kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)); > > return true; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 00/27] x86: PIE support and option to extend KASLR randomization
On Thu, Jan 31, 2019 at 11:24:07AM -0800, Thomas Garnier wrote: > There has been no major concern in the latest iterations. I am interested on > what would be the best way to slowly integrate this patchset upstream. One question that I was somehow expected in this cover letter - what about all those lovely speculative bugs? As in say some one hasn't updated their machine with the Spectre v3a microcode - wouldn't they be able to get the kernel virtual address space? In effect rendering all this hard-work not needed? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/5 v5] Fix virtio-blk issue with SWIOTLB
On Wed, Jan 30, 2019 at 05:40:02PM +0100, Joerg Roedel wrote: > Hi, > > here is the next version of this patch-set. Previous > versions can be found here: > > V1: https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/ > > V2: https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/ > > V3: https://lore.kernel.org/lkml/20190123163049.24863-1-j...@8bytes.org/ > > V4: https://lore.kernel.org/lkml/20190129084342.26030-1-j...@8bytes.org/ > > The problem solved here is a limitation of the SWIOTLB implementation, > which does not support allocations larger than 256kb. When the > virtio-blk driver tries to read/write a block larger than that, the > allocation of the dma-handle fails and an IO error is reported. > > Changes to v4 are: > > - Added Reviewed-by tags from Christoph > > - Added missing EXPORT_SYMBOL(_GPL) lines > > Please review. I can put it in my tree and send it to Linus .. unless folks want to do it through a different tree? > > Thanks, > > Joerg > Joerg Roedel (5): > swiotlb: Introduce swiotlb_max_mapping_size() > swiotlb: Add is_swiotlb_active() function > dma: Introduce dma_max_mapping_size() > virtio: Introduce virtio_max_dma_size() > virtio-blk: Consider virtio_max_dma_size() for maximum segment size > > Documentation/DMA-API.txt| 8 > drivers/block/virtio_blk.c | 10 ++ > drivers/virtio/virtio_ring.c | 11 +++ > include/linux/dma-mapping.h | 16 > include/linux/swiotlb.h | 11 +++ > include/linux/virtio.h | 2 ++ > kernel/dma/direct.c | 12 > kernel/dma/swiotlb.c | 14 ++ > 8 files changed, 80 insertions(+), 4 deletions(-) > > -- > 2.17.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/5 v3] Fix virtio-blk issue with SWIOTLB
On Mon, Jan 28, 2019 at 10:20:05AM -0500, Michael S. Tsirkin wrote: > On Wed, Jan 23, 2019 at 04:14:53PM -0500, Konrad Rzeszutek Wilk wrote: > > On Wed, Jan 23, 2019 at 01:51:29PM -0500, Michael S. Tsirkin wrote: > > > On Wed, Jan 23, 2019 at 05:30:44PM +0100, Joerg Roedel wrote: > > > > Hi, > > > > > > > > here is the third version of this patch-set. Previous > > > > versions can be found here: > > > > > > > > V1: > > > > https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/ > > > > > > > > V2: > > > > https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/ > > > > > > > > The problem solved here is a limitation of the SWIOTLB implementation, > > > > which does not support allocations larger than 256kb. When the > > > > virtio-blk driver tries to read/write a block larger than that, the > > > > allocation of the dma-handle fails and an IO error is reported. > > > > > > > > > OK looks good to me. > > > I will park this in my tree for now this way it will get > > > testing in linux-next. > > > Can I get an ack from DMA maintainers on the DMA bits for > > > merging this in 5.0? > > > > You got mine (SWIOTBL is my area). > > OK so > > Reviewed-by: Konrad Rzeszutek Wilk Yes :-) > > > > > > > > > > Changes to v2 are: > > > > > > > > * Check if SWIOTLB is active before returning its limit in > > > > dma_direct_max_mapping_size() > > > > > > > > * Only apply the maximum segment limit in virtio-blk when > > > > DMA-API is used for the vring > > > > > > > > Please review. > > > > > > > > Thanks, > > > > > > > > Joerg > > > > > > > > Joerg Roedel (5): > > > > swiotlb: Introduce swiotlb_max_mapping_size() > > > > swiotlb: Add is_swiotlb_active() function > > > > dma: Introduce dma_max_mapping_size() > > > > virtio: Introduce virtio_max_dma_size() > > > > virtio-blk: Consider virtio_max_dma_size() for maximum segment size > > > > > > > > drivers/block/virtio_blk.c | 10 ++ > > > > drivers/virtio/virtio_ring.c | 10 ++ > > > > include/linux/dma-mapping.h | 16 > > > > include/linux/swiotlb.h | 11 +++ > > > > include/linux/virtio.h | 2 ++ > > > > kernel/dma/direct.c | 11 +++ > > > > kernel/dma/swiotlb.c | 10 ++ > > > > 7 files changed, 66 insertions(+), 4 deletions(-) > > > > > > > > -- > > > > 2.17.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/5 v3] Fix virtio-blk issue with SWIOTLB
On Wed, Jan 23, 2019 at 01:51:29PM -0500, Michael S. Tsirkin wrote: > On Wed, Jan 23, 2019 at 05:30:44PM +0100, Joerg Roedel wrote: > > Hi, > > > > here is the third version of this patch-set. Previous > > versions can be found here: > > > > V1: https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/ > > > > V2: https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/ > > > > The problem solved here is a limitation of the SWIOTLB implementation, > > which does not support allocations larger than 256kb. When the > > virtio-blk driver tries to read/write a block larger than that, the > > allocation of the dma-handle fails and an IO error is reported. > > > OK looks good to me. > I will park this in my tree for now this way it will get > testing in linux-next. > Can I get an ack from DMA maintainers on the DMA bits for > merging this in 5.0? You got mine (SWIOTBL is my area). > > > Changes to v2 are: > > > > * Check if SWIOTLB is active before returning its limit in > > dma_direct_max_mapping_size() > > > > * Only apply the maximum segment limit in virtio-blk when > > DMA-API is used for the vring > > > > Please review. > > > > Thanks, > > > > Joerg > > > > Joerg Roedel (5): > > swiotlb: Introduce swiotlb_max_mapping_size() > > swiotlb: Add is_swiotlb_active() function > > dma: Introduce dma_max_mapping_size() > > virtio: Introduce virtio_max_dma_size() > > virtio-blk: Consider virtio_max_dma_size() for maximum segment size > > > > drivers/block/virtio_blk.c | 10 ++ > > drivers/virtio/virtio_ring.c | 10 ++ > > include/linux/dma-mapping.h | 16 > > include/linux/swiotlb.h | 11 +++ > > include/linux/virtio.h | 2 ++ > > kernel/dma/direct.c | 11 +++ > > kernel/dma/swiotlb.c | 10 ++ > > 7 files changed, 66 insertions(+), 4 deletions(-) > > > > -- > > 2.17.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/5 v3] Fix virtio-blk issue with SWIOTLB
On Wed, Jan 23, 2019 at 05:30:44PM +0100, Joerg Roedel wrote: > Hi, > > here is the third version of this patch-set. Previous > versions can be found here: > > V1: https://lore.kernel.org/lkml/20190110134433.15672-1-j...@8bytes.org/ > > V2: https://lore.kernel.org/lkml/20190115132257.6426-1-j...@8bytes.org/ > > The problem solved here is a limitation of the SWIOTLB implementation, > which does not support allocations larger than 256kb. When the > virtio-blk driver tries to read/write a block larger than that, the > allocation of the dma-handle fails and an IO error is reported. > > Changes to v2 are: > > * Check if SWIOTLB is active before returning its limit in > dma_direct_max_mapping_size() > > * Only apply the maximum segment limit in virtio-blk when > DMA-API is used for the vring > > Please review. > > Thanks, > > Joerg > > Joerg Roedel (5): > swiotlb: Introduce swiotlb_max_mapping_size() > swiotlb: Add is_swiotlb_active() function > dma: Introduce dma_max_mapping_size() > virtio: Introduce virtio_max_dma_size() > virtio-blk: Consider virtio_max_dma_size() for maximum segment size > > drivers/block/virtio_blk.c | 10 ++ > drivers/virtio/virtio_ring.c | 10 ++ The kvm-devel mailing list should have been copied on those. When you do can you please put 'Reviewed-by: Konrad Rzeszutek Wilk ' on all of them? Thank you! P.S. Christopher, I am assuming you are OK with this idea, if so - and once the owners of 'virtio' Ack, do you want to put it in my tree? Thanks! > include/linux/dma-mapping.h | 16 > include/linux/swiotlb.h | 11 +++ > include/linux/virtio.h | 2 ++ > kernel/dma/direct.c | 11 +++ > kernel/dma/swiotlb.c | 10 ++ > 7 files changed, 66 insertions(+), 4 deletions(-) > > -- > 2.17.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/3] swiotlb: Export maximum allocation size
On Fri, Jan 11, 2019 at 10:12:31AM +0100, Joerg Roedel wrote: > On Thu, Jan 10, 2019 at 12:02:05PM -0500, Konrad Rzeszutek Wilk wrote: > > Why not use swiotlb_nr_tbl ? That is how drivers/gpu/drm use to figure if > > they > > need to limit the size of pages. > > That function just exports the overall size of the swiotlb aperture, no? > What I need here is the maximum size for a single mapping. Yes. The other drivers just assumed that if there is SWIOTLB they would use the smaller size by default (as in they knew the limitation). But I agree it would be better to have something smarter - and also convert the DRM drivers to piggy back on this. Or alternatively we could make SWIOTLB handle bigger sizes.. > > Regards, > > Joerg ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/3] swiotlb: Export maximum allocation size
On Thu, Jan 10, 2019 at 02:44:31PM +0100, Joerg Roedel wrote: > From: Joerg Roedel > > The SWIOTLB implementation has a maximum size it can > allocate dma-handles for. This needs to be exported so that > device drivers don't try to allocate larger chunks. > > This is especially important for block device drivers like > virtio-blk, that might do DMA through SWIOTLB. Why not use swiotlb_nr_tbl ? That is how drivers/gpu/drm use to figure if they need to limit the size of pages. > > Signed-off-by: Joerg Roedel > --- > include/linux/swiotlb.h | 12 > 1 file changed, 12 insertions(+) > > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h > index 7c007ed7505f..0bcc80a97036 100644 > --- a/include/linux/swiotlb.h > +++ b/include/linux/swiotlb.h > @@ -72,6 +72,11 @@ static inline bool is_swiotlb_buffer(phys_addr_t paddr) > return paddr >= io_tlb_start && paddr < io_tlb_end; > } > > +static inline size_t swiotlb_max_alloc_size(void) > +{ > + return ((1UL << IO_TLB_SHIFT) * IO_TLB_SEGSIZE); > +} > + > bool swiotlb_map(struct device *dev, phys_addr_t *phys, dma_addr_t *dma_addr, > size_t size, enum dma_data_direction dir, unsigned long attrs); > void __init swiotlb_exit(void); > @@ -95,6 +100,13 @@ static inline unsigned int swiotlb_max_segment(void) > { > return 0; > } > + > +static inline size_t swiotlb_max_alloc_size(void) > +{ > + /* There is no limit when SWIOTLB isn't used */ > + return ~0UL; > +} > + > #endif /* CONFIG_SWIOTLB */ > > extern void swiotlb_print_info(void); > -- > 2.17.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
.giant snip.. > > + npinned = get_user_pages_fast(uaddr, npages, write, pages); > > + if (npinned != npages) > > + goto err; > > + > > As I said I have doubts about the whole approach, but this > implementation in particular isn't a good idea > as it keeps the page around forever. > So no THP, no NUMA rebalancing, userspace-controlled > amount of memory locked up and not accounted for. > > Don't get me wrong it's a great patch in an ideal world. > But then in an ideal world no barriers smap etc are necessary at all. So .. suggestions on how this could be accepted? As in other ways where we still get vmap and the issues you mentioned are not troubling you? Thanks! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Memory Read Only Enforcement: VMM assisted kernel rootkit mitigation for KVM
On Thu, Jul 19, 2018 at 11:37:59PM +0200, Ahmed Abd El Mawgood wrote: > Hi, > > This is my first set of patches that works as I would expect, and the > third revision I sent to mailing lists. > > Following up with my previous discussions about kernel rootkit mitigation > via placing R/O protection on critical data structure, static data, > privileged registers with static content. These patches present the > first part where it is only possible to place these protections on > memory pages. Feature-wise, this set of patches is incomplete in the sense of: > - They still don't protect privileged registers > - They don't protect guest TLB from malicious gva -> gpa page mappings. > But they provide sketches for a basic working design. Note that I am totally > noob and it took lots of time and effort to get to this point. So sorry in > advance if I overlooked something. This reminds me of Xen PV page model. That is the hypervisor is the one auditing the page tables and the guest's pages are read-only. Ditto for IDT, GDT, etc. Gosh, did you by chance look at how Xen PV mechanism is done? It may provide the protection you are looking for? CC-ing xen-devel. > > [PATCH 1/3] [RFC V3] KVM: X86: Memory ROE documentation > [PATCH 2/3] [RFC V3] KVM: X86: Adding arbitrary data pointer in kvm memslot > itterator functions > [PATCH 3/3] [RFC V3] KVM: X86: Adding skeleton for Memory ROE > > Summery: > > Documentation/virtual/kvm/hypercalls.txt | 14 > arch/x86/include/asm/kvm_host.h | 11 ++- > arch/x86/kvm/Kconfig | 7 ++ > arch/x86/kvm/mmu.c | 127 > ++- > arch/x86/kvm/x86.c | 82 +++- > include/linux/kvm_host.h | 3 + > include/uapi/linux/kvm_para.h| 1 + > virt/kvm/kvm_main.c | 29 ++- > 8 files changed, 232 insertions(+), 42 deletions(-) > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC V3 PATCH 0/8] Packed ring for vhost
On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote: > On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote: > > On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote: > > > Hi all: > > > > > > This RFC implement packed ring layout. The code were tested with > > > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and > > > tweaks were needed on top of Tiwei's code to make it run. TCP stream > > > and pktgen does not show obvious difference compared with split ring. > > > > I have to ask then - what is the benefit of this? > > You can use this with dpdk within guest. > The DPDK version did see a performance improvement so hopefully with Is there a link to this performance improvement document? > future versions this will too. > > > > > > > Changes from V2: > > > - do not use & in checking desc_event_flags > > > - off should be most significant bit > > > - remove the workaround of mergeable buffer for dpdk prototype > > > - id should be in the last descriptor in the chain > > > - keep _F_WRITE for write descriptor when adding used > > > - device flags updating should use ADDR_USED type > > > - return error on unexpected unavail descriptor in a chain > > > - return false in vhost_ve_avail_empty is descriptor is available > > > - track last seen avail_wrap_counter > > > - correctly examine available descriptor in get_indirect_packed() > > > - vhost_idx_diff should return u16 instead of bool > > > > > > Changes from V1: > > > > > > - Refactor vhost used elem code to avoid open coding on used elem > > > - Event suppression support (compile test only). > > > - Indirect descriptor support (compile test only). > > > - Zerocopy support. > > > - vIOMMU support. > > > - SCSI/VSOCK support (compile test only). > > > - Fix several bugs > > > > > > For simplicity, I don't implement batching or other optimizations. > > > > > > Please review. > > > > > > Jason Wang (8): > > > vhost: move get_rx_bufs to vhost.c > > > vhost: hide used ring layout from device > > > vhost: do not use vring_used_elem > > > vhost_net: do not explicitly manipulate vhost_used_elem > > > vhost: vhost_put_user() can accept metadata type > > > virtio: introduce packed ring defines > > > vhost: packed ring support > > > vhost: event suppression for packed ring > > > > > > drivers/vhost/net.c| 136 ++ > > > drivers/vhost/scsi.c | 62 +-- > > > drivers/vhost/vhost.c | 824 > > > ++--- > > > drivers/vhost/vhost.h | 47 ++- > > > drivers/vhost/vsock.c | 42 +- > > > include/uapi/linux/virtio_config.h | 9 + > > > include/uapi/linux/virtio_ring.h | 32 ++ > > > 7 files changed, 926 insertions(+), 226 deletions(-) > > > > > > -- > > > 2.7.4 > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC V3 PATCH 0/8] Packed ring for vhost
On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote: > Hi all: > > This RFC implement packed ring layout. The code were tested with > Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and > tweaks were needed on top of Tiwei's code to make it run. TCP stream > and pktgen does not show obvious difference compared with split ring. I have to ask then - what is the benefit of this? > > Changes from V2: > - do not use & in checking desc_event_flags > - off should be most significant bit > - remove the workaround of mergeable buffer for dpdk prototype > - id should be in the last descriptor in the chain > - keep _F_WRITE for write descriptor when adding used > - device flags updating should use ADDR_USED type > - return error on unexpected unavail descriptor in a chain > - return false in vhost_ve_avail_empty is descriptor is available > - track last seen avail_wrap_counter > - correctly examine available descriptor in get_indirect_packed() > - vhost_idx_diff should return u16 instead of bool > > Changes from V1: > > - Refactor vhost used elem code to avoid open coding on used elem > - Event suppression support (compile test only). > - Indirect descriptor support (compile test only). > - Zerocopy support. > - vIOMMU support. > - SCSI/VSOCK support (compile test only). > - Fix several bugs > > For simplicity, I don't implement batching or other optimizations. > > Please review. > > Jason Wang (8): > vhost: move get_rx_bufs to vhost.c > vhost: hide used ring layout from device > vhost: do not use vring_used_elem > vhost_net: do not explicitly manipulate vhost_used_elem > vhost: vhost_put_user() can accept metadata type > virtio: introduce packed ring defines > vhost: packed ring support > vhost: event suppression for packed ring > > drivers/vhost/net.c| 136 ++ > drivers/vhost/scsi.c | 62 +-- > drivers/vhost/vhost.c | 824 > ++--- > drivers/vhost/vhost.h | 47 ++- > drivers/vhost/vsock.c | 42 +- > include/uapi/linux/virtio_config.h | 9 + > include/uapi/linux/virtio_ring.h | 32 ++ > 7 files changed, 926 insertions(+), 226 deletions(-) > > -- > 2.7.4 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH V2 0/8] Packed ring for vhost
On Mon, Mar 26, 2018 at 11:38:45AM +0800, Jason Wang wrote: > Hi all: > > This RFC implement packed ring layout. The code were tested with pmd > implement by Jens at > http://dpdk.org/ml/archives/dev/2018-January/089417.html. Minor change > was needed for pmd codes to kick virtqueue since it assumes a busy > polling backend. > > Test were done between localhost and guest. Testpmd (rxonly) in guest > reports 2.4Mpps. Testpmd (txonly) repots about 2.1Mpps. And how does it compare to older ring layout? > > Notes: The event suppression /indirect descriptor support is complied >test only because of lacked driver support. > > Changes from V1: > > - Refactor vhost used elem code to avoid open coding on used elem > - Event suppression support (compile test only). > - Indirect descriptor support (compile test only). > - Zerocopy support. > - vIOMMU support. > - SCSI/VSOCK support (compile test only). > - Fix several bugs > > For simplicity, I don't implement batching or other optimizations. > > Please review. > > Thanks > > Jason Wang (8): > vhost: move get_rx_bufs to vhost.c > vhost: hide used ring layout from device > vhost: do not use vring_used_elem > vhost_net: do not explicitly manipulate vhost_used_elem > vhost: vhost_put_user() can accept metadata type > virtio: introduce packed ring defines > vhost: packed ring support > vhost: event suppression for packed ring > > drivers/vhost/net.c| 138 ++- > drivers/vhost/scsi.c | 62 +-- > drivers/vhost/vhost.c | 818 > ++--- > drivers/vhost/vhost.h | 46 ++- > drivers/vhost/vsock.c | 42 +- > include/uapi/linux/virtio_config.h | 9 + > include/uapi/linux/virtio_ring.h | 32 ++ > 7 files changed, 921 insertions(+), 226 deletions(-) > > -- > 2.7.4 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support
On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote: > From: Yang Zhang> > Some latency-intensive workload have seen obviously performance > drop when running inside VM. The main reason is that the overhead > is amplified when running inside VM. The most cost I have seen is > inside idle path. Meaning an VMEXIT b/c it is an 'halt' operation ? And then going back in guest (VMRESUME) takes time. And hence your latency gets all whacked b/c of this? So if I understand - you want to use your _full_ timeslice (of the guest) without ever (or as much as possible) to go in the hypervisor? Which means in effect you don't care about power-saving or CPUfreq savings, you just want to eat the full CPU for snack? > > This patch introduces a new mechanism to poll for a while before > entering idle state. If schedule is needed during poll, then we > don't need to goes through the heavy overhead path. Schedule of what? The guest or the host? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
On Tue, Aug 29, 2017 at 11:46:35AM +, Yang Zhang wrote: > So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in > idle path which will polling for a while before we enter the real idle > state. > > In virtualization, idle path includes several heavy operations > includes timer access(LAPIC timer or TSC deadline timer) which will hurt > performance especially for latency intensive workload like message > passing task. The cost is mainly come from the vmexit which is a > hardware context switch between VM and hypervisor. Our solution is to > poll for a while and do not enter real idle path if we can get the > schedule event during polling. > > Poll may cause the CPU waste so we adopt a smart polling mechanism to > reduce the useless poll. > > Signed-off-by: Yang Zhang> Signed-off-by: Quan Xu > Cc: Jeremy Fitzhardinge > Cc: Chris Wright > Cc: Alok Kataria > Cc: Rusty Russell > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: x...@kernel.org > Cc: Peter Zijlstra > Cc: Andy Lutomirski > Cc: "Kirill A. Shutemov" > Cc: Pan Xinhui > Cc: Kees Cook > Cc: virtualization@lists.linux-foundation.org > Cc: linux-ker...@vger.kernel.org Adding xen-devel. Juergen, we really should replace Jeremy's name with xen-devel or your name.. Wasn't there an patch by you that took some of the mainternship over it? > --- > arch/x86/include/asm/paravirt.h | 5 + > arch/x86/include/asm/paravirt_types.h | 6 ++ > arch/x86/kernel/paravirt.c| 6 ++ > 3 files changed, 17 insertions(+) > > diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h > index 9ccac19..6d46760 100644 > --- a/arch/x86/include/asm/paravirt.h > +++ b/arch/x86/include/asm/paravirt.h > @@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int > counter) > > #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter)) > > +static inline void paravirt_idle_poll(void) > +{ > + PVOP_VCALL0(pv_idle_ops.poll); > +} > + > static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned > entries) > { > PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries); > diff --git a/arch/x86/include/asm/paravirt_types.h > b/arch/x86/include/asm/paravirt_types.h > index 9ffc36b..cf45726 100644 > --- a/arch/x86/include/asm/paravirt_types.h > +++ b/arch/x86/include/asm/paravirt_types.h > @@ -324,6 +324,10 @@ struct pv_lock_ops { > struct paravirt_callee_save vcpu_is_preempted; > } __no_randomize_layout; > > +struct pv_idle_ops { > + void (*poll)(void); > +} __no_randomize_layout; > + > /* This contains all the paravirt structures: we get a convenient > * number for each function using the offset which we use to indicate > * what to patch. */ > @@ -334,6 +338,7 @@ struct paravirt_patch_template { > struct pv_irq_ops pv_irq_ops; > struct pv_mmu_ops pv_mmu_ops; > struct pv_lock_ops pv_lock_ops; > + struct pv_idle_ops pv_idle_ops; > } __no_randomize_layout; > > extern struct pv_info pv_info; > @@ -343,6 +348,7 @@ struct paravirt_patch_template { > extern struct pv_irq_ops pv_irq_ops; > extern struct pv_mmu_ops pv_mmu_ops; > extern struct pv_lock_ops pv_lock_ops; > +extern struct pv_idle_ops pv_idle_ops; > > #define PARAVIRT_PATCH(x)\ > (offsetof(struct paravirt_patch_template, x) / sizeof(void *)) > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index bc0a849..1b5b247 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -128,6 +128,7 @@ static void *get_call_destination(u8 type) > #ifdef CONFIG_PARAVIRT_SPINLOCKS > .pv_lock_ops = pv_lock_ops, > #endif > + .pv_idle_ops = pv_idle_ops, > }; > return *((void **) + type); > } > @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = { > .steal_clock = native_steal_clock, > }; > > +struct pv_idle_ops pv_idle_ops = { > + .poll = paravirt_nop, > +}; > + > __visible struct pv_irq_ops pv_irq_ops = { > .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), > .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), > @@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = { > EXPORT_SYMBOL(pv_mmu_ops); > EXPORT_SYMBOL_GPL(pv_info); > EXPORT_SYMBOL(pv_irq_ops); > +EXPORT_SYMBOL(pv_idle_ops); > -- > 1.8.3.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v6 10/11] x86, xen: support vcpu preempted check
On Fri, Oct 28, 2016 at 04:11:26AM -0400, Pan Xinhui wrote: > From: Juergen Gross> > Support the vcpu_is_preempted() functionality under Xen. This will > enhance lock performance on overcommitted hosts (more runnable vcpus > than physical cpus in the system) as doing busy waits for preempted > vcpus will hurt system performance far worse than early yielding. > > A quick test (4 vcpus on 1 physical cpu doing a parallel build job > with "make -j 8") reduced system time by about 5% with this patch. > > Signed-off-by: Juergen Gross > Signed-off-by: Pan Xinhui > --- > arch/x86/xen/spinlock.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c > index 3d6e006..74756bb 100644 > --- a/arch/x86/xen/spinlock.c > +++ b/arch/x86/xen/spinlock.c > @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu) > per_cpu(irq_name, cpu) = NULL; > } > > - Spurious change. > /* > * Our init of PV spinlocks is split in two init functions due to us > * using paravirt patching and jump labels patching and having to do > @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void) > pv_lock_ops.queued_spin_unlock = > PV_CALLEE_SAVE(__pv_queued_spin_unlock); > pv_lock_ops.wait = xen_qlock_wait; > pv_lock_ops.kick = xen_qlock_kick; > + > + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; > } > > /* > -- > 2.4.11 > > > ___ > Xen-devel mailing list > xen-de...@lists.xen.org > https://lists.xen.org/xen-devel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v6 00/11] implement vcpu preempted check
On Fri, Oct 28, 2016 at 04:11:16AM -0400, Pan Xinhui wrote: > change from v5: > spilt x86/kvm patch into guest/host part. > introduce kvm_write_guest_offset_cached. > fix some typos. > rebase patch onto 4.9.2 > change from v4: > spilt x86 kvm vcpu preempted check into two patches. > add documentation patch. > add x86 vcpu preempted check patch under xen > add s390 vcpu preempted check patch > change from v3: > add x86 vcpu preempted check patch > change from v2: > no code change, fix typos, update some comments > change from v1: > a simplier definition of default vcpu_is_preempted > skip mahcine type check on ppc, and add config. remove dedicated macro. > add one patch to drop overload of rwsem_spin_on_owner and > mutex_spin_on_owner. > add more comments > thanks boqun and Peter's suggestion. > > This patch set aims to fix lock holder preemption issues. Do you have a git tree with these patches? > > test-case: > perf record -a perf bench sched messaging -g 400 -p && perf report > > 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock > 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner > 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock > 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task > 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq > 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is > 2.49% sched-messaging [kernel.vmlinux] [k] system_call > > We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin > loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. > These spin_on_onwer variant also cause rcu stall before we apply this patch > set > > We also have observed some performace improvements in uninx benchmark tests. > > PPC test result: > 1 copy - 0.94% > 2 copy - 7.17% > 4 copy - 11.9% > 8 copy - 3.04% > 16 copy - 15.11% > > details below: > Without patch: > > 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, > 1 samples) > 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, > 1 samples) > 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, > 1 samples) > 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, > 1 samples) > 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 > s, 1 samples) > > With patch: > > 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, > 1 samples) > 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, > 1 samples) > 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, > 1 samples) > 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, > 1 samples) > 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 > s, 1 samples) > > X86 test result: > test-case after-patch before-patch > Execl Throughput |18307.9 lps |11701.6 lps > File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps > File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps > File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps > Pipe Throughput| 11872208.7 lps | 11855628.9 lps > Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps > Process Creation |29881.2 lps |28572.8 lps > Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm > Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm > System Call Overhead | 10385653.0 lps | 10419979.0 lps > > Christian Borntraeger (1): > s390/spinlock: Provide vcpu_is_preempted > > Juergen Gross (1): > x86, xen: support vcpu preempted check > > Pan Xinhui (9): > kernel/sched: introduce vcpu preempted check interface > locking/osq: Drop the overload of osq_lock() > kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner > powerpc/spinlock: support vcpu preempted check > x86, paravirt: Add interface to support kvm/xen vcpu preempted check > KVM: Introduce kvm_write_guest_offset_cached > x86, kvm/x86.c: support vcpu preempted check > x86, kernel/kvm.c: support vcpu preempted check > Documentation: virtual: kvm: Support vcpu preempted check > > Documentation/virtual/kvm/msr.txt | 9 - > arch/powerpc/include/asm/spinlock.h | 8 > arch/s390/include/asm/spinlock.h | 8 > arch/s390/kernel/smp.c| 9 +++-- > arch/s390/lib/spinlock.c | 25 - > arch/x86/include/asm/paravirt_types.h | 2 ++ > arch/x86/include/asm/spinlock.h | 8 > arch/x86/include/uapi/asm/kvm_para.h | 4 +++- >
Re: [PATCH 0/9] qspinlock stuff -v15
On Thu, Mar 26, 2015 at 09:21:53PM +0100, Peter Zijlstra wrote: On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote: Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. Yes I suppose we can do something similar for the ticket and patch in the right increment. We'd need to restructure the code a bit, but its not fundamentally impossible. We could equally apply the head hashing to the current ticket implementation and avoid the current bitmap iteration. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. An astute observation, I had not considered that. Thank you. I presume when you did benchmarking this did not even register? Thought I wonder if it would if you ran the benchmark for a week or so. You presume I benchmarked :-) I managed to boot something virt and run hackbench in it. I wouldn't know a representative virt setup if I ran into it. The thing is, we want this qspinlock for real hardware because its faster and I really want to avoid having to carry two spinlock implementations -- although I suppose that if we really really have to we could. In some way you already have that - for virtualized environments where you don't have an PV mechanism you just use the byte spinlock - which is good. And switching to PV ticketlock implementation after boot.. ugh. I feel your pain. What if you used an PV bytelock implemenation? The code you posted already 'sprays' all the vCPUS to wake up. And that is exactly what you need for PV bytelocks - well, you only need to wake up the vCPUS that have gone to sleep waiting on an specific 'struct spinlock' and just stash those in an per-cpu area. The old Xen spinlock code (Before 3.11?) had this. Just an idea thought. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/9] qspinlock stuff -v15
On Mon, Mar 16, 2015 at 02:16:13PM +0100, Peter Zijlstra wrote: Hi Waiman, As promised; here is the paravirt stuff I did during the trip to BOS last week. All the !paravirt patches are more or less the same as before (the only real change is the copyright lines in the first patch). The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more convoluted and I've no real way to test that but it should be stright fwd to make work. I ran this using the virtme tool (thanks Andy) on my laptop with a 4x overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and it both booted and survived a hackbench run (perf bench sched messaging -g 20 -l 5000). So while the paravirt code isn't the most optimal code ever conceived it does work. Also, the paravirt patching includes replacing the call with movb $0, %arg1 for the native case, which should greatly reduce the cost of having CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware. Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. I presume when you did benchmarking this did not even register? Thought I wonder if it would if you ran the benchmark for a week or so. I feel that if someone were to do a Xen patch we can go ahead and merge this stuff (finally!). These patches do not implement the paravirt spinlock debug stats currently implemented (separately) by KVM and Xen, but that should not be too hard to do on top and in the 'generic' code -- no reason to duplicate all that. Of course; once this lands people can look at improving the paravirt nonsense. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] kasan_map_early_shadow() on Xen
On Wed, Mar 04, 2015 at 05:47:03PM -0800, Luis R. Rodriguez wrote: On Wed, Mar 4, 2015 at 6:36 AM, Andrey Ryabinin a.ryabi...@samsung.com wrote: On 03/03/2015 07:02 PM, Konrad Rzeszutek Wilk wrote: If it is like that - then just using what had to be implemented for the stack protection as a template ought to pave most of the work? Probably. I think I could make this work. However, I won't be able to work on this until the end of the next week. While a solution is likely possible here I'd like for us to notice how two features have gone now under our nose for Xen on v4.0 which have issues. Proactive maintainer reviews would detect these issues but we can't bet on these, and testing is just as reactive and slow. I'm Hmm, I see you saying 'Proactive maintainer review would detect' these but that assumes that the maintainer would even be copied on these patches? None of the Xen maintainers were copied on this. And while we all strive to be pro-active I have to remind you maintainers are usually swamped. hinting we keep our eyes out for an architectural solution that would avoid these issues. Don't ask me what that is just yet... Keep in mind that a lot of these issues get resolved during merge window thanks to the resiliancy of Boris monitoring these tests with an sharp eye! After all that is what merge windows are - things break during them. Luis ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] kasan_map_early_shadow() on Xen
On Tue, Mar 03, 2015 at 06:38:20PM +0300, Andrey Ryabinin wrote: On 03/03/2015 05:16 PM, Konrad Rzeszutek Wilk wrote: On Tue, Mar 03, 2015 at 04:15:06PM +0300, Andrey Ryabinin wrote: On 03/03/2015 12:40 PM, Luis R. Rodriguez wrote: Andrey, I believe that on Xen we should disable kasan, would like confirmation I guess Xen guests won't work with kasan because Xen guests doesn't setup shadow (kasan_map_early_shadow() is not called in xen guests). Disabling kasan for Xen in Kconfig is undesirable because that will disable kasan for allmodconfig and allyesconfig builds, but I don't see other option for now. Was there an bug reported for this? It would be good to CC the maintainers of Xen on that sort of thing. There was no report for this. I just looked at Xen code because of this Luis's mail and I don't see how it could work with kasan. Ahh. Fixing this might be not trivial. We need to setup shadow memory before any kasan instrumented function will run. This is like with -fstack-protector (you need to setup gdt before calling any stack protected function). If it is like that - then just using what had to be implemented for the stack protection as a template ought to pave most of the work? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] kasan_map_early_shadow() on Xen
On Tue, Mar 03, 2015 at 04:15:06PM +0300, Andrey Ryabinin wrote: On 03/03/2015 12:40 PM, Luis R. Rodriguez wrote: Andrey, I believe that on Xen we should disable kasan, would like confirmation I guess Xen guests won't work with kasan because Xen guests doesn't setup shadow (kasan_map_early_shadow() is not called in xen guests). Disabling kasan for Xen in Kconfig is undesirable because that will disable kasan for allmodconfig and allyesconfig builds, but I don't see other option for now. Was there an bug reported for this? It would be good to CC the maintainers of Xen on that sort of thing. Thanks! .. snip.. [0] https://lkml.org/lkml/2015/2/23/328 If you look at the other x86 pulls - you will see that the authors also missed enabling it on 32-bit kernels! Luis ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v13 10/11] pvqspinlock, x86: Enable PV qspinlock for KVM
On Wed, Oct 29, 2014 at 04:19:10PM -0400, Waiman Long wrote: This patch adds the necessary KVM specific code to allow KVM to support the CPU halting and kicking operations needed by the queue spinlock PV code. Two KVM guests of 20 CPU cores (2 nodes) were created for performance testing in one of the following three configurations: 1) Only 1 VM is active 2) Both VMs are active and they share the same 20 physical CPUs (200% overcommit) The tests run included the disk workload of the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a 3.17 based kernel. The ebizzy -m test and futextest was was also run and its performance data were recorded. With two VMs running, the idle=poll kernel option was added to simulate a busy guest. If PV qspinlock is not enabled, unfairlock will be used automically in a guest. What is the unfairlock? Isn't it just using a bytelock at this point? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v4 04/10] x86: paravirt: Wrap initialization of set_iopl_mask in a macro
On Sun, Nov 02, 2014 at 09:32:20AM -0800, Josh Triplett wrote: This will allow making set_iopl_mask optional later. Signed-off-by: Josh Triplett j...@joshtriplett.org Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com --- arch/x86/include/asm/paravirt_types.h | 1 + arch/x86/kernel/paravirt.c| 2 +- arch/x86/xen/enlighten.c | 4 ++-- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..3caf2a8 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -143,6 +143,7 @@ struct pv_cpu_ops { void (*load_sp0)(struct tss_struct *tss, struct thread_struct *t); void (*set_iopl_mask)(unsigned mask); +#define INIT_SET_IOPL_MASK(f) .set_iopl_mask = f, void (*wbinvd)(void); void (*io_delay)(void); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 548d25f..e7969d4 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -383,7 +383,7 @@ __visible struct pv_cpu_ops pv_cpu_ops = { .iret = native_iret, .swapgs = native_swapgs, - .set_iopl_mask = native_set_iopl_mask, + INIT_SET_IOPL_MASK(native_set_iopl_mask) .io_delay = native_io_delay, .start_context_switch = paravirt_nop, diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 1a3f044..8ad0778 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -912,7 +912,7 @@ static void xen_load_sp0(struct tss_struct *tss, xen_mc_issue(PARAVIRT_LAZY_CPU); } -static void xen_set_iopl_mask(unsigned mask) +static void __maybe_unused xen_set_iopl_mask(unsigned mask) { struct physdev_set_iopl set_iopl; @@ -1279,7 +1279,7 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = { .write_idt_entry = xen_write_idt_entry, .load_sp0 = xen_load_sp0, - .set_iopl_mask = xen_set_iopl_mask, + INIT_SET_IOPL_MASK(xen_set_iopl_mask) .io_delay = xen_io_delay, /* Xen takes care of %gs when switching to usermode for us */ -- 2.1.1 ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On Tue, Nov 25, 2014 at 07:33:58PM -0500, Waiman Long wrote: On 10/27/2014 02:02 PM, Konrad Rzeszutek Wilk wrote: On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote: My concern is that spin_unlock() can be called in many places, including loadable kernel modules. Can the paravirt_patch_ident_32() function able to patch all of them in reasonable time? How about a kernel module loaded later at run time? It has too. When the modules are loaded the .paravirt symbols are exposed and the module loader patches that. And during bootup time (before modules are loaded) it also patches everything - when it only runs on one CPU. I have been changing the patching code to patch the unlock call sites and it seems to be working now. However, when I manually inserted a kernel module using insmod and run the code in the newly inserted module, I got memory access violation as follows: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [ (null)] (null) PGD 18d62f3067 PUD 18d476f067 PMD 0 Oops: 0010 [#1] SMP Modules linked in: locktest(OE) ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge autofs4 8021q garp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan vhost tun uinput ppdev parport_pc parport sg microcode pcspkr virtio_balloon snd_hda_codec_generic virtio_console snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore virtio_net i2c_piix4 i2c_core ext4(E) jbd2(E) mbcache(E) floppy(E) virtio_blk(E) sr_mod(E) cdrom(E) virtio_pci(E) virtio_ring(E) virtio(E) pata_acpi(E) ata_generic(E) ata_piix(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: speedstep_lib] CPU: 1 PID: 3907 Comm: run-locktest Tainted: GW OE 3.17.0-pvqlock #3 Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011 task: 8818cc5baf90 ti: 8818b7094000 task.ti: 8818b7094000 RIP: 0010:[] [ (null)] (null) RSP: 0018:8818b7097db0 EFLAGS: 00010246 RAX: RBX: 004c4b40 RCX: RDX: 0001 RSI: RDI: 8818d3f052c0 RBP: 8818b7097dd8 R08: 80522014 R09: R10: 1000 R11: 0001 R12: 0001 R13: R14: 0001 R15: 8818b7097ea0 FS: 7fb828ece700() GS:88193ec2() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: CR3: 0018cc7e9000 CR4: 06e0 Stack: a06ff395 8818d465e000 8164bec0 0001 0050 8818b7097e18 a06ff785 8818b7097e38 0246 54755e3a 39f8ba72 8818c174f000 Call Trace: [a06ff395] ? test_spinlock+0x65/0x90 [locktest] [a06ff785] etime_show+0xd5/0x120 [locktest] [812a2dc6] kobj_attr_show+0x16/0x20 [8121a7fa] sysfs_kf_seq_show+0xca/0x1b0 [81218a13] kernfs_seq_show+0x23/0x30 [811c82db] seq_read+0xbb/0x400 [812197e5] kernfs_fop_read+0x35/0x40 [811a4223] vfs_read+0xa3/0x110 [811a47e6] SyS_read+0x56/0xd0 [810f3e16] ? __audit_syscall_exit+0x216/0x2c0 [815b3ca9] system_call_fastpath+0x16/0x1b Code: Bad RIP value. RSP 8818b7097db0 CR2: ---[ end trace 69d0e259c9ec632f ]--- It seems like call site patching isn't properly done or the kernel module that I built was missing some critical information necessary for the proper Did the readelf give you the paravirt note section? linking. Anyway, I will include the unlock call patching code as a separate patch as it seems there may be problem under certain circumstance. one way to troubleshoot those is to enable the paravirt patching code to actually print where it is patching the code. That way when you load the module you can confirm it has done its job. Then you can verify that the address where the code is called: a06ff395 is indeed patched. You might as well also do a hexdump in the module loading to confim that the patching had been done correctly. BTW, the kernel panic problem that your team reported had been fixed. The fix will be in the next version of the patch. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v13 09/11] pvqspinlock, x86: Add para-virtualization support
On Wed, Oct 29, 2014 at 04:19:09PM -0400, Waiman Long wrote: This patch adds para-virtualization support to the queue spinlock code base with minimal impact to the native case. There are some minor code changes in the generic qspinlock.c file which should be usable in other architectures. The other code changes are specific to x86 processors and so are all put under the arch/x86 directory. On the lock side, the slowpath code is split into 2 separate functions generated from the same code - one for bare metal and one for PV guest. The switching is done in the _raw_spin_lock* functions. This makes sure that the performance impact to the bare metal case is minimal, just a few NOPs in the _raw_spin_lock* functions. In the PV slowpath code, there are 2 paravirt callee saved calls that minimize register pressure. On the unlock side, however, the disabling of unlock function inlining does have some slight impact on bare metal performance. The actual paravirt code comes in 5 parts; - init_node; this initializes the extra data members required for PV state. PV state data is kept 1 cacheline ahead of the regular data. - link_and_wait_node; this replaces the regular MCS queuing code. CPU halting can happen if the wait is too long. - wait_head; this waits until the lock is avialable and the CPU will be halted if the wait is too long. - wait_check; this is called after acquiring the lock to see if the next queue head CPU is halted. If this is the case, the lock bit is changed to indicate the queue head will have to be kicked on unlock. - queue_unlock; this routine has a jump label to check if paravirt is enabled. If yes, it has to do an atomic cmpxchg to clear the lock bit or call the slowpath function to kick the queue head cpu. Tracking the head is done in two parts, firstly the pv_wait_head will store its cpu number in whichever node is pointed to by the tail part of the lock word. Secondly, pv_link_and_wait_node() will propagate the existing head from the old to the new tail node. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/paravirt.h | 19 ++ arch/x86/include/asm/paravirt_types.h | 20 ++ arch/x86/include/asm/pvqspinlock.h| 411 + arch/x86/include/asm/qspinlock.h | 71 ++- arch/x86/kernel/paravirt-spinlocks.c |6 + include/asm-generic/qspinlock.h |2 + kernel/locking/qspinlock.c| 69 +- 7 files changed, 591 insertions(+), 7 deletions(-) create mode 100644 arch/x86/include/asm/pvqspinlock.h diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index cd6e161..7e296e6 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,6 +712,24 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUE_SPINLOCK + +static __always_inline void pv_kick_cpu(int cpu) +{ + PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu); +} + +static __always_inline void pv_lockwait(u8 *lockbyte) +{ + PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte); +} + +static __always_inline void pv_lockstat(enum pv_lock_stats type) +{ + PVOP_VCALLEE1(pv_lock_ops.lockstat, type); +} + +#else static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -723,6 +741,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, { PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } +#endif #endif diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..49e4b76 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -326,6 +326,9 @@ struct pv_mmu_ops { phys_addr_t phys, pgprot_t flags); }; +struct mcs_spinlock; +struct qspinlock; + struct arch_spinlock; #ifdef CONFIG_SMP #include asm/spinlock_types.h @@ -333,9 +336,26 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +#ifdef CONFIG_QUEUE_SPINLOCK +enum pv_lock_stats { + PV_HALT_QHEAD, /* Queue head halting */ + PV_HALT_QNODE, /* Other queue node halting */ + PV_HALT_ABORT, /* Halting aborted */ + PV_WAKE_KICKED, /* Wakeup by kicking*/ + PV_WAKE_SPURIOUS, /* Spurious wakeup */ + PV_KICK_NOHALT /* Kick but CPU not halted */ +}; +#endif + struct pv_lock_ops { +#ifdef CONFIG_QUEUE_SPINLOCK + struct paravirt_callee_save kick_cpu; + struct paravirt_callee_save lockstat; + struct paravirt_callee_save lockwait; +#else struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote: On 10/24/2014 04:54 AM, Peter Zijlstra wrote: On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote: Since enabling paravirt spinlock will disable unlock function inlining, a jump label can be added to the unlock function without adding patch sites all over the kernel. But you don't have to. My patches allowed for the inline to remain, again reducing the overhead of enabling PV spinlocks while running on a real machine. Look at: http://lkml.kernel.org/r/20140615130154.213923...@chello.nl In particular this hunk: Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c === --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs) DEF_NATIVE(, mov32, mov %edi, %eax); DEF_NATIVE(, mov64, mov %rdi, %rax); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi)); +#endif + unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) { return paravirt_patch_insns(insnbuf, len, @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb PATCH_SITE(pv_cpu_ops, clts); PATCH_SITE(pv_mmu_ops, flush_tlb_single); PATCH_SITE(pv_cpu_ops, wbinvd); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) + PATCH_SITE(pv_lock_ops, queue_unlock); +#endif patch_site: ret = paravirt_patch_insns(ibuf, len, start, end); That makes sure to overwrite the callee-saved call to the pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi). Therefore you can retain the inlined unlock with hardly (there might be some NOP padding) any overhead at all. On PV it reverts to a callee saved function call. My concern is that spin_unlock() can be called in many places, including loadable kernel modules. Can the paravirt_patch_ident_32() function able to patch all of them in reasonable time? How about a kernel module loaded later at run time? It has too. When the modules are loaded the .paravirt symbols are exposed and the module loader patches that. And during bootup time (before modules are loaded) it also patches everything - when it only runs on one CPU. So I think we may still need to disable unlock function inlining even if we used your way kernel site patching. No need. Inline should (And is) working just fine. Regards, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
On Tue, Sep 30, 2014 at 11:01:29AM -0700, Andy Lutomirski wrote: On Tue, Sep 30, 2014 at 10:53 AM, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: x86 will be worse than PPC, too: the special case needed to support QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly large and disgusting and will only exist to support something that IMO should never have existed in the first place. scratches his head I don't follow. If you boot a Xen PV dom0 on QEMU master with -machine q35,iommu=on and you add a virtio device, dom0 will end up with a PCI device that does DMA to machine addresses. These addresses are not compatible with the DMA API (which works with bus addresses), nor are they the same as physical addresses. That is presumarily because the IOMMU assumes the virtio devices are real devices, not fake ones. So virtio in current kernels won't work for the same reason they never work on Xen. But virtio-pci with my patches won't work either, because they (or the Xen hypervisor) will try to program the IOMMU with a non-identity mapping, causing everything to explode. Hacking up the virtio-pci driver to explicitly ask Xen for machine addresses might work, but, at the very least, it will be a giant security hole if anyone binds a virtio device to a domain other than dom0 (which, again, is kind of the point of having an IOMMU). PPC at least avoids *that* problem by virtue of not having Xen paravirt. (And please don't add Xen paravirt to PPC -- x86 is trying to kill it off, but this is a 5-10 year project.) Correction: - The Xen project is trying to kill some of the paravirts off. - KVM uses paravirts as well (and then added some) By paravirt I meant PV, where there's the weird physical/machine address discrepancy that's visible to the guest. This is not to say that Xen PVH wouldn't also be screwed running on QEMU master. --Andy ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
x86 will be worse than PPC, too: the special case needed to support QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly large and disgusting and will only exist to support something that IMO should never have existed in the first place. scratches his head I don't follow. PPC at least avoids *that* problem by virtue of not having Xen paravirt. (And please don't add Xen paravirt to PPC -- x86 is trying to kill it off, but this is a 5-10 year project.) Correction: - The Xen project is trying to kill some of the paravirts off. - KVM uses paravirts as well (and then added some) [..., reordered] Except that I think that PPC is the only platform on which QEMU's code actually bypasses any IOMMU. Unless we've all missed something, there is no QEMU release that will put a virtio device behind an IOMMU on any platform other than PPC. I think that is true but it seems that this will be true for x86 for QEMU 2.2 unless we make some changes there. Which we might not have the time for since 2.2 is feature frozen from tomorrow. Maybe we should disable the IOMMU in 2.2, this is worth considering. Please do. Also, try booting this 2.2 QEMU candidate with nested virtualization on. Then bind vfio to a virtio-pci device and watch the guest get corrupted. QEMU will blame Linux for incorrectly programming the Hehe. hardware, and Linux will blame QEMU for its blatant violation of the ACPI spec. Given that this is presumably most of the point of adding IOMMU support, it seems like a terrible idea to let code like that into the wild. If this happens, Linux may also end up needing a quirk to prevent vfio from binding to QEMU 2.2's virtio-pci devices. --Andy ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 0/3] virtio: Use the DMA API when appropriate
On Tue, Sep 16, 2014 at 10:22:25PM -0700, Andy Lutomirski wrote: This fixes virtio on Xen guests as well as on any other platform that uses virtio_pci on which physical addresses don't match bus addresses. I can do 'Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com' but not sure through whom this patch should go through? This can be tested with: virtme-run --xen xen --kimg arch/x86/boot/bzImage --console using virtme from here: https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git Without these patches, the guest hangs forever. With these patches, everything works. This should be safe on all platforms that I'm aware of. That doesn't mean that there isn't anything that I missed. Applies to net-next. Changes from v4: - Rebased onto net-next. - Dropped the sg end mark changes from virtio_net, as that issue is solved differently in net-next. - virtio_pci does not use the DMA API on powerpc, for reasons that are explained in a detailed comment in virtio_ring. Changes from v3: - virtio_pci only asks virtio_ring to use the DMA API if !PCI_DMA_BUS_IS_PHYS. - Reduce tools/virtio breakage. It's now merely as broken as before instead of being even more broken. - Drop the sg_next changes -- Rusty's version is better. Changes from v2: - Reordered patches. - Fixed a virtio_net OOPS. Changes from v1: - Using the DMA API is optional now. It would be nice to improve the DMA API to the point that it could be used unconditionally, but s390 proves that we're not there yet. - Includes patch 4, which fixes DMA debugging warnings from virtio_net. Andy Lutomirski (3): virtio_ring: Support DMA APIs if requested virtio_pci: Use the DMA API for virtqueues when possible virtio_net: Stop doing DMA from the stack drivers/lguest/lguest_device.c | 3 +- drivers/misc/mic/card/mic_virtio.c | 2 +- drivers/net/virtio_net.c | 53 ++--- drivers/remoteproc/remoteproc_virtio.c | 4 +- drivers/s390/kvm/kvm_virtio.c | 2 +- drivers/s390/kvm/virtio_ccw.c | 4 +- drivers/virtio/virtio_mmio.c | 5 +- drivers/virtio/virtio_pci.c| 91 ++-- drivers/virtio/virtio_ring.c | 194 +++-- include/linux/virtio_ring.h| 1 + tools/virtio/linux/dma-mapping.h | 17 +++ tools/virtio/linux/virtio.h| 1 + tools/virtio/virtio_test.c | 2 +- tools/virtio/vringh_test.c | 3 +- 14 files changed, 314 insertions(+), 68 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h -- 1.9.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 0/4] virtio: Clean up scatterlists and use the DMA API
On Wed, Sep 03, 2014 at 06:53:33AM +1000, Benjamin Herrenschmidt wrote: On Mon, 2014-09-01 at 22:55 -0700, Andy Lutomirski wrote: On x86, at least, I doubt that we'll ever see a physically addressed PCI virtio device for which ACPI advertises an IOMMU, since any sane hypervisor will just not advertise an IOMMU for the virtio device. But are there arm64 or PPC guests that use virtio_pci, that have IOMMUs, and that will malfunction if the virtio_pci driver ends up using the IOMMU? I certainly hope not, since these systems might be very hard-pressed to work right if someone plugged in a physical virtio-speaking PCI device. It will definitely not work on ppc64. We always have IOMMUs on pseries, all PCI busses do, and because it's a paravirtualized environment, napping/unmapping pages means hypercalls - expensive. But our virtio implementation bypasses it in qemu, so if virtio-pci starts using the DMA mapping API without changing the DMA ops under the hood, it will break for us. What is the default dma_ops that the Linux guests start with as guests under ppc64? Thanks! Cheers, Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: virtio DMA API?
On Thu, Aug 28, 2014 at 07:31:16AM +1000, Benjamin Herrenschmidt wrote: On Wed, 2014-08-27 at 20:40 +0930, Rusty Russell wrote: Hi Andy, This has long been a source of contention. virtio assumes that the hypervisor can decode guest-physical addresses. PowerPC, in particular, doesn't want to pay the cost of IOMMU manipulations, and all arguments presented so far for using an IOMMU for a virtio device are weak. And changing to use DMA APIs would break them anyway. Of course, it's Just A Matter of Code, so it's possible to create a Xen-specific variant which uses the DMA APIs. I'm not sure what that would look like in the virtio standard, however. So this has popped up in the past a few times already from people who want to use virtio as a transport between physical systems connected via a bus like PCI using non-transparent bridges for example. There's a way to get both here that isn't too nasty... we can make the virtio drivers use the dma_map_* APIs and just switch the dma_ops in the struct device based on the hypervisor requirements. IE. For KVM we could attach a set of ops that basically just return the physical address, real PCI transport would use the normal callbacks etc... Right. The only problem at the moment is that the dma_map_ops, while defined generically, aren't plumbed into the generic struct device but instead on some architectures dev_archdata. This includes powerpc, ARM and x86 (under a CONFIG option for the latter which is only enabled on x86_64 and some oddball i386 variant). I am not following the interaction between 'struct device', 'struct dev_archdata' and 'struct dma_map_ops' ? The 'struct dma_ops' should be able to exist without having to exist in the other structures? Naturally the implementation of 'struct dma_ops' has to use 'struct device' otherwise it can't get the details such as dma_mapping. So either we switch to have all architectures we care about always use the generic DMA ops and move the pointer to struct device, or we create another inline indirection to deal with the cases without the dma_map_ops... Or you implement an passthrough 'dma_map_ops' that you suggested? Thought I feel I am not groking something from your email. Hmm, time to get some more coffee. Cheers, Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] virtio_pci: Use the DMA API for virtqueues
On Tue, Aug 26, 2014 at 02:17:02PM -0700, Andy Lutomirski wrote: A virtqueue is a coherent DMA mapping. Use the DMA API for it. This fixes virtio_pci on Xen. Signed-off-by: Andy Lutomirski l...@amacapital.net --- drivers/virtio/virtio_pci.c | 25 ++--- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 3d1463c6b120..19039c5bec24 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -80,8 +80,9 @@ struct virtio_pci_vq_info /* the number of entries in the queue */ int num; - /* the virtual address of the ring queue */ - void *queue; + /* the ring queue */ + void *queue;/* virtual address */ + dma_addr_t queue_dma_addr; /* bus address */ /* the list node for the virtqueues list */ struct list_head node; @@ -417,15 +418,16 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index, info-num = num; info-msix_vector = msix_vec; - size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info-queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); + size = vring_size(num, VIRTIO_PCI_VRING_ALIGN); + info-queue = dma_zalloc_coherent(vdev-dev.parent, size, + info-queue_dma_addr, GFP_KERNEL); if (info-queue == NULL) { err = -ENOMEM; goto out_info; } /* activate the queue */ - iowrite32(virt_to_phys(info-queue) VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(info-queue_dma_addr VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ @@ -462,7 +464,8 @@ out_assign: vring_del_virtqueue(vq); out_activate_queue: iowrite32(0, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); - free_pages_exact(info-queue, size); + dma_free_coherent(vdev-dev.parent, size, + info-queue, info-queue_dma_addr); out_info: kfree(info); return ERR_PTR(err); @@ -493,7 +496,8 @@ static void vp_del_vq(struct virtqueue *vq) iowrite32(0, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); size = PAGE_ALIGN(vring_size(info-num, VIRTIO_PCI_VRING_ALIGN)); - free_pages_exact(info-queue, size); + dma_free_coherent(vq-vdev-dev.parent, size, + info-queue, info-queue_dma_addr); kfree(info); } @@ -712,6 +716,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev, if (err) goto out; + /* + * We support 64-bit DMA. If this fails (e.g. some bridge + * or PV code doesn't or DAC is disabled), then we're okay + * with 32-bit DMA. scratches his head I am having a hard time parsing that. Could you expand a bit the faulting use-case please? + */ + dma_set_mask_and_coherent(pci_dev-dev, DMA_BIT_MASK(64)); The usual process is: ret = dma_set_mask_and_coherent(..) if (ret) ret = dma_set_mask_and_coherent(.., DMA_BIT_MASK(32)) if (ret) pr_warn(We are truly screwed. Good luck!\n); + err = pci_request_regions(pci_dev, virtio-pci); if (err) goto out_enable_device; -- 1.9.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] virtio_ring: Use DMA APIs
On Wed, Aug 27, 2014 at 09:29:36AM +0200, Christian Borntraeger wrote: On 26/08/14 23:17, Andy Lutomirski wrote: virtio_ring currently sends the device (usually a hypervisor) physical addresses of its I/O buffers. This is okay when DMA addresses and physical addresses are the same thing, but this isn't always the case. For example, this never works on Xen guests, and it is likely to fail if a physical virtio device ever ends up behind an IOMMU or swiotlb. The immediate use case for me is to enable virtio on Xen guests. For that to work, we need this fix as well as a corresponding fix to virtio_pci or to another driver. With this patch, virtfs survives kmemleak and CONFIG_DMA_API_DEBUG. virtio-net warns (correctly) about DMA from the stack in virtnet_set_rx_mode. This breaks s390's defconfig. The default configuration for s390 does virtio through a KVM-specific interface, but that configuration does not support DMA. I could modify this patch to stub out the DMA API calls if !CONFIG_HAS_DMA, but it seems to me that it would be much nicer to make s390 support DMA unconditionally. s390 has no DMA per se. Newest systems have a PCI-like I/O attach in addition to the classic channel I/O, and for that we enable the DMA code just for that transport to be able to reuse some of the existing PCI drivers. (only some because, we differ in some aspects from how PCI looks like) But the architecture itself (and the virtio interface) does not provide the DMA interface as you know it: Don't most of those DMA_API end up then being nops? As in, we do have in the dma-api file the #ifdef case when a platform does not do DMA which ends up with all functions stubbed out. In essence this patch is a no-go for s390. Is that due to possible compilation? I am also aksing myself, if this makes virtio-ring more expensive? Christian It's actually unclear to me whether this can be fixed without either s390 arch support or a special case for s390 in virtio_ring: on brief inspection of s390's DMA code, it seems to assume that everything goes through a PCI IOMMU, which is presumably not the case for virtio. Cc: Cornelia Huck cornelia.h...@de.ibm.com Signed-off-by: Andy Lutomirski l...@amacapital.net --- drivers/virtio/Kconfig | 1 + drivers/virtio/virtio_ring.c | 116 +++ 2 files changed, 95 insertions(+), 22 deletions(-) diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index c6683f2e396c..b6f3acc61153 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -1,5 +1,6 @@ config VIRTIO tristate + depends on HAS_DMA ---help--- This option is selected by any driver which implements the virtio bus, such as CONFIG_VIRTIO_PCI, CONFIG_VIRTIO_MMIO, CONFIG_LGUEST, diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index d356a701c9c2..6a78e2fd8e63 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -24,6 +24,7 @@ #include linux/module.h #include linux/hrtimer.h #include linux/kmemleak.h +#include linux/dma-mapping.h #ifdef DEBUG /* For development, we want to crash whenever the ring is screwed. */ @@ -54,6 +55,12 @@ #define END_USE(vq) #endif +struct vring_desc_state +{ + void *data; /* Data for callback. */ + struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ +}; + struct vring_virtqueue { struct virtqueue vq; @@ -93,12 +100,45 @@ struct vring_virtqueue ktime_t last_add_time; #endif - /* Tokens for callbacks. */ - void *data[]; + /* Per-descriptor state. */ + struct vring_desc_state desc_state[]; }; #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +/* Helper to map one sg entry, since we can't use dma_map_sg. */ +static dma_addr_t dma_map_one_sg(struct vring_virtqueue *vq, +struct scatterlist *sg, +enum dma_data_direction direction) +{ + return dma_map_page(vq-vq.vdev-dev.parent, + sg_page(sg), sg-offset, sg-length, direction); +} + +static void unmap_one(struct vring_virtqueue *vq, struct vring_desc *desc) +{ + if (desc-flags VRING_DESC_F_INDIRECT) { + dma_unmap_single(vq-vq.vdev-dev.parent, +desc-addr, desc-len, +(desc-flags VRING_DESC_F_WRITE) ? +DMA_FROM_DEVICE : DMA_TO_DEVICE); + } else { + dma_unmap_page(vq-vq.vdev-dev.parent, + desc-addr, desc-len, + (desc-flags VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE); + } +} + +static void unmap_indirect(struct vring_virtqueue *vq, struct
Re: [PATCH 3/3] virtio_pci: Use the DMA API for virtqueues
On Wed, Aug 27, 2014 at 10:35:10AM -0700, Andy Lutomirski wrote: On Wed, Aug 27, 2014 at 10:32 AM, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Tue, Aug 26, 2014 at 02:17:02PM -0700, Andy Lutomirski wrote: A virtqueue is a coherent DMA mapping. Use the DMA API for it. This fixes virtio_pci on Xen. Signed-off-by: Andy Lutomirski l...@amacapital.net --- drivers/virtio/virtio_pci.c | 25 ++--- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 3d1463c6b120..19039c5bec24 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -80,8 +80,9 @@ struct virtio_pci_vq_info /* the number of entries in the queue */ int num; - /* the virtual address of the ring queue */ - void *queue; + /* the ring queue */ + void *queue;/* virtual address */ + dma_addr_t queue_dma_addr; /* bus address */ /* the list node for the virtqueues list */ struct list_head node; @@ -417,15 +418,16 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index, info-num = num; info-msix_vector = msix_vec; - size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info-queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); + size = vring_size(num, VIRTIO_PCI_VRING_ALIGN); + info-queue = dma_zalloc_coherent(vdev-dev.parent, size, + info-queue_dma_addr, GFP_KERNEL); if (info-queue == NULL) { err = -ENOMEM; goto out_info; } /* activate the queue */ - iowrite32(virt_to_phys(info-queue) VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(info-queue_dma_addr VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ @@ -462,7 +464,8 @@ out_assign: vring_del_virtqueue(vq); out_activate_queue: iowrite32(0, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); - free_pages_exact(info-queue, size); + dma_free_coherent(vdev-dev.parent, size, + info-queue, info-queue_dma_addr); out_info: kfree(info); return ERR_PTR(err); @@ -493,7 +496,8 @@ static void vp_del_vq(struct virtqueue *vq) iowrite32(0, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); size = PAGE_ALIGN(vring_size(info-num, VIRTIO_PCI_VRING_ALIGN)); - free_pages_exact(info-queue, size); + dma_free_coherent(vq-vdev-dev.parent, size, + info-queue, info-queue_dma_addr); kfree(info); } @@ -712,6 +716,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev, if (err) goto out; + /* + * We support 64-bit DMA. If this fails (e.g. some bridge + * or PV code doesn't or DAC is disabled), then we're okay + * with 32-bit DMA. scratches his head I am having a hard time parsing that. Could you expand a bit the faulting use-case please? + */ + dma_set_mask_and_coherent(pci_dev-dev, DMA_BIT_MASK(64)); The usual process is: ret = dma_set_mask_and_coherent(..) if (ret) ret = dma_set_mask_and_coherent(.., DMA_BIT_MASK(32)) if (ret) pr_warn(We are truly screwed. Good luck!\n); I assumed that, if dma_set_mask_and_coherent(..., DMA_BIT_BASK(64)) fails, then we can still do DMA, just not 64-bit DMA. This driver should be fine with that -- it'll just be a bit slower. Right. It should fail back to 32-bit (the default on PCI bus). I was merely thinking of the error reporting might be useful but on second thought - saying that the user is screwed - is a bad idea. If that's not a safe assumption, I can change it. --Andy ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/3] virtio: Clean up scatterlists and use the DMA API
On Wed, Aug 27, 2014 at 07:10:20PM +0100, Stefan Hajnoczi wrote: On Wed, Aug 27, 2014 at 6:27 PM, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Wed, Aug 27, 2014 at 07:46:46AM +0100, Stefan Hajnoczi wrote: On Tue, Aug 26, 2014 at 10:16 PM, Andy Lutomirski l...@amacapital.net wrote: There are two outstanding issues. virtio_net warns if DMA debugging is on because it does DMA from the stack. (The warning is correct.) This also is likely to do something unpleasant to s390. (Maintainers are cc'd -- I don't know what to do about it.) This changes the semantics of vring and breaks existing guests when bus address != physical address. Isn't that what this is suppose to fix? Right now the semantics of the vring is that bus address == physical address. Today the assumption is that the device emulation code has access to guest physical memory (no translation necessary). Right. Changing the semantics breaks today's guests. You need to make this change in a way so that existing guests don't break. Well, the DMA API will preserve that semantic as under KVM it will return phys == bus. But that is a side-effect of the DMA API implementation preserving this semantic so I understand your point. Stefan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: virtio DMA API?
On Wed, Aug 27, 2014 at 01:52:50PM +0200, Michael S. Tsirkin wrote: On Wed, Aug 27, 2014 at 08:40:51PM +0930, Rusty Russell wrote: Andy Lutomirski l...@amacapital.net writes: Currently, a lot of the virtio code assumes that bus (i.e. hypervisor) addresses are the same as physical address. This is false on Xen, so virtio is completely broken. I wouldn't be surprised if it also becomes a problem the first time that someone sticks a physical virtio device on a 32-bit bus on an ARM SOC with more than 4G RAM. Would you accept patches to convert virtio_ring and virtio_pci to use the DMA APIs? I think that the only real catch will be that virtio_ring's approach to freeing indirect blocks is currently incompatible with the DMA API -- it assumes that knowing the bus address is enough to call kfree, and I don't think that the DMA API provides a reverse mapping like that. Hi Andy, This has long been a source of contention. virtio assumes that the hypervisor can decode guest-physical addresses. PowerPC, in particular, doesn't want to pay the cost of IOMMU manipulations, and all arguments presented so far for using an IOMMU for a virtio device are weak. And changing to use DMA APIs would break them anyway. Of course, it's Just A Matter of Code, so it's possible to create a Xen-specific variant which uses the DMA APIs. I'm not sure what that would look like in the virtio standard, however. Cheers, Rusty. For x86 as of QEMU 2.0 there's no iommu. So a reasonable thing to do for that platform might be to always use iommu *if it's there*. My understanding is this isn't the case for powerpc? Wasn't there some implementation of AMD IOMMU code on QEMU mailing list floating around? Aha: https://www.mail-archive.com/kvm@vger.kernel.org/msg40516.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: virtio DMA API?
On Mon, Aug 25, 2014 at 10:18:46AM -0700, Andy Lutomirski wrote: Currently, a lot of the virtio code assumes that bus (i.e. hypervisor) addresses are the same as physical address. This is false on Xen, so virtio is completely broken. I wouldn't be surprised if it also becomes a problem the first time that someone sticks a physical virtio device on a 32-bit bus on an ARM SOC with more than 4G RAM. Would you accept patches to convert virtio_ring and virtio_pci to use the DMA APIs? I think that the only real catch will be that virtio_ring's approach to freeing indirect blocks is currently incompatible with the DMA API -- it assumes that knowing the bus address is enough to call kfree, and I don't think that the DMA API provides a reverse mapping like that. If you use the dma_map/unmap_sg all of that ends up being stuck in the sg structure (sg-dma_address ends with the DMA addr, sg_phys(sg) gives you the physical address). --Andy -- Andy Lutomirski AMA Capital Management, LLC ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 10/11] qspinlock: Paravirt support
On Mon, Jul 07, 2014 at 05:27:34PM +0200, Peter Zijlstra wrote: On Fri, Jun 20, 2014 at 09:46:08AM -0400, Konrad Rzeszutek Wilk wrote: I dug in the code and I have some comments about it, but before I post them I was wondering if you have any plans to run any performance tests against the PV ticketlock with normal and over-committed scenarios? I can barely boot a guest.. I'm not sure I can make them do anything much at all yet. All this virt crap is totally painful. HA! The reason I asked about that is from a pen-and-paper view it looks suboptimal in the worst case scenario compared to PV ticketlock. The 'worst case scenario' is when we over-commit (more CPUs than there are physical CPUs) or have to delay guests (the sum of all virtual CPUs physical CPUs and all of the guests are compiling kernels). In those cases the PV ticketlock goes to sleep and gets woken up once the ticket holder has finished. In the PV qspinlock we do wake up the first in queue, but we also wake the next one in queue so it can progress further. And so on. Perhaps a better mechanism is just ditch the queue part and utilize the byte part and under KVM and Xen just do bytelocking (since we have 8 bits). For the PV halt/waking we can stash in the 'struct mcs' the current lock that each CPU is waiting for. And the unlocker can iterate over all of those and wake them all up. Perhaps make the iteration random. Anyhow, that is how the old PV bytelock under Xen worked (before 3.11) and it had worked pretty well (it didn't do it random thought - always started with 'for_each_online_cpu'). Squashing in the ticketlock concept in qspinlock for PV looks scary. And as I said - this is all pen-and-paper - so it might be that this 'wake-up-go-sleep-on-the-queue' kick is actually not that bad? Lastly - thank you for taking a stab at this. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock
On Mon, Jun 23, 2014 at 06:26:22PM +0200, Peter Zijlstra wrote: On Tue, Jun 17, 2014 at 04:05:31PM -0400, Konrad Rzeszutek Wilk wrote: + * The basic principle of a queue-based spinlock can best be understood + * by studying a classic queue-based spinlock implementation called the + * MCS lock. The paper below provides a good description for this kind + * of lock. + * + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf + * + * This queue spinlock implementation is based on the MCS lock, however to make + * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing + * API, we must modify it some. + * + * In particular; where the traditional MCS lock consists of a tail pointer + * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to + * unlock the next pending (next-locked), we compress both these: {tail, + * next-locked} into a single u32 value. + * + * Since a spinlock disables recursion of its own context and there is a limit + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can + * encode the tail as and index indicating this context and a cpu number. + * + * We can further change the first spinner to spin on a bit in the lock word + * instead of its node; whereby avoiding the need to carry a node from lock to + * unlock, and preserving API. You also made changes (compared to the MCS) in that the unlock path is not spinning waiting for the successor and that the job of passing the lock is not done in the unlock path either. Instead all of that is now done in the path of the lock acquirer logic. Could you update the comment to say that please? I _think_ I know what you mean.. So that is actually implied by the last You do :-) paragraph, but I suppose I can make it explicit; something like: * * Another way to look at it is: * * lock(tail,locked) *struct mcs_spinlock node; *mcs_spin_lock(tail, node); *test-and-set locked; *mcs_spin_unlock(tail, node); * * unlock(tail,locked) *clear locked * * Where we have compressed (tail,locked) into a single u32 word. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock
On Mon, Jun 23, 2014 at 06:12:00PM +0200, Peter Zijlstra wrote: On Tue, Jun 17, 2014 at 04:03:29PM -0400, Konrad Rzeszutek Wilk wrote: + new = tail | (val _Q_LOCKED_MASK); + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + + /* +* we won the trylock; forget about queueing. +*/ + if (new == _Q_LOCKED_VAL) + goto release; + + /* +* if there was a previous node; link it and wait. +*/ + if (old ~_Q_LOCKED_MASK) { + prev = decode_tail(old); + ACCESS_ONCE(prev-next) = node; + + arch_mcs_spin_lock_contended(node-locked); Could you add a comment here: /* We are spinning forever until the previous node updates locked - which it does once the it has updated lock-val with our tail number. */ That's incorrect -- or at least, I understand that to be incorrect. The previous node will not have changed the tail to point to us. You always change to tail to point to yourself, seeing how you add yourself to the tail. Is the existing comment any better if I s/wait./wait for it to release us./ ? Yes! + /* +* claim the lock: +* +* n,0 - 0,1 : lock, uncontended +* *,0 - *,1 : lock, contended +*/ + for (;;) { + new = _Q_LOCKED_VAL; + if (val != tail) + new |= val; ..snip.. Could you help a bit in explaining it in English please? After looking at the assembler code I finally figured out how we can get here. And the 'contended' part threw me off. Somehow I imagined there are two more more CPUs stampeding here and trying to update the lock-val. But in reality the other CPUs are stuck in the arch_mcs_spin_lock_contended spinning on their local value. Well, the lock as a whole is contended (there's 1 waiters), and the point of MCS style locks it to make sure they're not actually pounding on the same cacheline. So the whole thing is consistent. Perhaps you could add this comment. /* Once queue_spin_unlock is called (which _subtracts_ _Q_LOCKED_VAL from the lock-val and still preserving the tail data), the winner gets to claim the ticket. There's no tickets :/ s/ticket/be first in line/ ? Since we still need the other CPUs to continue and preserve the strict ordering in which they setup node-next, we: 1) update lock-val to the tail value (so tail CPU and its index) with _Q_LOCKED_VAL. We don't, we preserve the tail value, unless we're the tail, in which case we clear the tail. 2). Once we are done, we poke the other CPU (the one that linked to us) by writting to node-locked (below) so they can make progress and loop on lock-val changing from _Q_LOCKED_MASK to zero). _If_ there was another cpu, ie. the tail didn't point to us. nods --- I don't do well with natural language comments like that; they tend to confuse me more than anything. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock
On Mon, Jun 23, 2014 at 05:56:50PM +0200, Peter Zijlstra wrote: On Mon, Jun 16, 2014 at 04:49:18PM -0400, Konrad Rzeszutek Wilk wrote: Index: linux-2.6/kernel/locking/mcs_spinlock.h === --- linux-2.6.orig/kernel/locking/mcs_spinlock.h +++ linux-2.6/kernel/locking/mcs_spinlock.h @@ -17,6 +17,7 @@ struct mcs_spinlock { struct mcs_spinlock *next; int locked; /* 1 if lock acquired */ + int count; This could use a comment. like so? int count; /* nesting count, see qspinlock.c */ /* nested level - in user, softirq, hard irq or nmi context. */ ? +static inline u32 encode_tail(int cpu, int idx) +{ + u32 tail; + + tail = (cpu + 1) _Q_TAIL_CPU_OFFSET; + tail |= idx _Q_TAIL_IDX_OFFSET; /* assume 4 */ Should there an ASSSERT (idx 4) just in case we screw up somehow (I can't figure out how, but that is partially why ASSERTS are added). #ifdef CONFIG_DEBUG_SPINLOCK BUG_ON(idx 3); #endif might do, I suppose. nods +/** + * queue_spin_lock_slowpath - acquire the queue spinlock + * @lock: Pointer to queue spinlock structure + * @val: Current value of the queue spinlock 32-bit word + * + * (queue tail, lock bit) Except it is not a lock bit. It is a lock uint8_t. It is indeed, although that's an accident of implementation. I could do s/bit// and not mention the entire storage angle at all? I think giving as much details as possible is good. What you said 'accident of implementation' is a could be woven in there? Is the queue tail at this point the composite of 'cpu|idx'? Yes, as per {en,de}code_tail() above. + * + * fast :slow : unlock + *: : + * uncontended (0,0) --:-- (0,1) :-- (*,0) + *: | ^./ : + *: v \ | : + * uncontended:(n,x) --+-- (n,0) | : So many CPUn come in right? Is 'n' for the number of CPUs? Nope, 'n' for any one specific tail, in particular the first one to arrive. This is the 'uncontended queue' case as per the label, so we need a named value for the first, in order to distinguish between the state to the right (same tail, but unlocked) and the state below (different tail). + * queue: | ^--' | : + *: v | : + * contended :(*,x) --+-- (*,0) - (*,1) ---' : + * queue: ^--' : And here um, what are the '*' for? Are they the four different types of handlers that can be nested? So task, sofitrq, hardisk, and nmi? '*' as in wildcard, any tail, specifically not 'n'. Ah, thank you for the explanation! Would it be possible to include that in the comment please? +void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + struct mcs_spinlock *prev, *next, *node; + u32 new, old, tail; + int idx; + + BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + + node = this_cpu_ptr(mcs_nodes[0]); + idx = node-count++; If this is the first time we enter this, wouldn't idx end up being 1? Nope, postfix ++ returns first and increments later. blushes Yes it does. + tail = encode_tail(smp_processor_id(), idx); + + node += idx; Meaning we end up skipping the 'mcs_nodes[0]' one altogether - even on the first 'level' (task, softirq, hardirq, nmi)? Won't that cause us to blow past the array when we are nested at the nmi handler? Seeing how its all static storage, which is automagically initialized to 0, combined with the postfix ++ (as opposed to the prefix ++) we should be getting 0 here. I've no idea what I was thinking, but thank you for setting me straight. + node-locked = 0; + node-next = NULL; + + /* + * trylock || xchg(lock, node) + * + * 0,0 - 0,1 ; trylock + * p,x - n,x ; prev = xchg(lock, node) I looked at that for 10 seconds and I was not sure what you meant. Is this related to the MCS document you had pointed to? It would help if you mention that the comments follow the document. (But they don't seem to) I presume what you mean is that if we are the next after the lock-holder we need only to update the 'next' (or the composite value of smp_processor_idx | idx) to point to us. As in, swap the 'L' with 'I' (looking at the doc) They are the 'tail','lock' tuples, so this composite atomic operation completes either: 0,0 - 0,1 -- we had no tail, not locked; into: no tail, locked. OR p,x - n,x -- tail was p
Re: [PATCH 10/11] qspinlock: Paravirt support
On Sun, Jun 15, 2014 at 02:47:07PM +0200, Peter Zijlstra wrote: Add minimal paravirt support. The code aims for minimal impact on the native case. Woot! On the lock side we add one jump label (asm_goto) and 4 paravirt callee saved calls that default to NOPs. The only effects are the extra NOPs and some pointless MOVs to accomodate the calling convention. No register spills happen because of this (x86_64). On the unlock side we have one paravirt callee saved call, which defaults to the actual unlock sequence: movb $0, (%rdi) and a NOP. The actual paravirt code comes in 3 parts; - init_node; this initializes the extra data members required for PV state. PV state data is kept 1 cacheline ahead of the regular data. - link_and_wait_node/kick_node; these are paired with the regular MCS queueing and are placed resp. before/after the paired MCS ops. - wait_head/queue_unlock; the interesting part here is finding the head node to kick. Tracking the head is done in two parts, firstly the pv_wait_head will store its cpu number in whichever node is pointed to by the tail part of the lock word. Secondly, pv_link_and_wait_node() will propagate the existing head from the old to the new tail node. I dug in the code and I have some comments about it, but before I post them I was wondering if you have any plans to run any performance tests against the PV ticketlock with normal and over-committed scenarios? Looking at this with a pen and paper I see that compared to PV ticketlock for the CPUs that are contending on the queue (so they go to pv_wait_head_and_link, then progress to pv_wait_head), they go sleep twice and get woken up twice. In PV ticketlock the contending CPUs would only go to sleep once and woken up once it was their turn. That of course is the worst case scenario - where the CPU that has the lock is taking forever to do its job and the host is quite overcommitted. Thanks! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 03/11] qspinlock: Add pending bit
On Wed, Jun 18, 2014 at 01:29:48PM +0200, Paolo Bonzini wrote: Il 17/06/2014 22:36, Konrad Rzeszutek Wilk ha scritto: +/* One more attempt - but if we fail mark it as pending. */ +if (val == _Q_LOCKED_VAL) { +new = Q_LOCKED_VAL |_Q_PENDING_VAL; + +old = atomic_cmpxchg(lock-val, val, new); +if (old == _Q_LOCKED_VAL) /* YEEY! */ +return; +val = old; +} Note that Peter's code is in a for(;;) loop: + for (;;) { + /* + * If we observe any contention; queue. + */ + if (val ~_Q_LOCKED_MASK) + goto queue; + + new = _Q_LOCKED_VAL; + if (val == new) + new |= _Q_PENDING_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + + /* + * we won the trylock + */ + if (new == _Q_LOCKED_VAL) + return; So what you'd have is basically: /* * One more attempt if no one is already in queue. Perhaps * they have unlocked the spinlock already. */ if (val == _Q_LOCKED_VAL atomic_read(lock-val) == 0) { old = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL); if (old == 0) /* YEEY! */ return; val = old; } But I agree with Waiman that this is unlikely to trigger often enough. It does have to be handled in the slowpath for correctness, but the most likely path is (0,0,1)-(0,1,1). nods Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote: Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto: On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: From: Waiman Long waiman.l...@hp.com This patch extracts the logic for the exchange of new and previous tail code words into a new xchg_tail() function which can be optimized in a later patch. And also adds a third try on acquiring the lock. That I think should be a seperate patch. It doesn't really add a new try, the old code is: - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } /* - * we won the trylock; forget about queueing. */ - if (new == _Q_LOCKED_VAL) - goto release; The trylock happens if the if (val) hits the else branch. What the patch does is change it from attempting two transition with a single cmpxchg: - * 0,0,0 - 0,0,1 ; trylock - * p,y,x - n,y,x ; prev = xchg(lock, node) to first doing the trylock, then the xchg. If the trylock passes and the xchg returns prev=0,0,0, the next step of the algorithm goes to the locked/uncontended state + /* + * claim the lock: + * + * n,0 - 0,1 : lock, uncontended Similar to your suggestion of patch 3, it's expected that the xchg will *not* return prev=0,0,0 after a failed trylock. I do like your explanation. I hope that Peter will put it in the description as it explains the change quite well. However, I *do* agree with you that it's simpler to just squash this patch into 01/11. Uh, did I say that? Oh I said why don't make it right the first time! I meant in terms of seperating the slowpath (aka the bytelock on the pending bit) from the queue (MCS code). Or renaming the function to be called 'complex' instead of 'slowpath' as it is getting quite hairy. The #1 patch is nice by itself - as it lays out the foundation of the MCS-similar code - and if Ingo decides he does not want this pending byte-lock bit business - it can be easily reverted or dropped. In terms of squashing this in #1 - I would advocate against that. Thanks! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
However, I *do* agree with you that it's simpler to just squash this patch into 01/11. Uh, did I say that? Oh I said why don't make it right the first time! I meant in terms of seperating the slowpath (aka the bytelock on the pending bit) from the queue (MCS code). Or renaming the function to be called 'complex' instead of 'slowpath' as it is getting quite hairy. The #1 patch is nice by itself - as it lays out the foundation of the MCS-similar code - and if Ingo decides he does not want this pending byte-lock bit business - it can be easily reverted or dropped. The pending bit code is needed for performance parity with ticket spinlock for light load. My own measurement indicates that the queuing overhead will cause the queue spinlock to be slower than ticket spinlock with 2-4 contending tasks. The pending bit solves the performance problem with 2 Aha! contending tasks, leave only the 3-4 tasks cases being a bit slower than the ticket spinlock which should be more than compensated by its superior performance with heavy contention and slightly better performance with no contention. That should be mentioned in the commit description as the rationale for the patch qspinlock: Add pending bit and also in the code. Thank you! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 09/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
On Sun, Jun 15, 2014 at 02:47:06PM +0200, Peter Zijlstra wrote: From: Waiman Long waiman.l...@hp.com This patch renames the paravirt_ticketlocks_enabled static key to a more generic paravirt_spinlocks_enabled name. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org Acked-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com --- arch/x86/include/asm/spinlock.h |4 ++-- arch/x86/kernel/kvm.c|2 +- arch/x86/kernel/paravirt-spinlocks.c |4 ++-- arch/x86/xen/spinlock.c |2 +- 4 files changed, 6 insertions(+), 6 deletions(-) --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -39,7 +39,7 @@ /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 15) -extern struct static_key paravirt_ticketlocks_enabled; +extern struct static_key paravirt_spinlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); #ifdef CONFIG_QUEUE_SPINLOCK @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowp static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { if (TICKET_SLOWPATH_FLAG - static_key_false(paravirt_ticketlocks_enabled)) { + static_key_false(paravirt_spinlocks_enabled)) { arch_spinlock_t prev; prev = *lock; --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -819,7 +819,7 @@ static __init int kvm_spinlock_init_jump if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); printk(KERN_INFO KVM setup paravirtual spinlock\n); return 0; --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = { }; EXPORT_SYMBOL(pv_lock_ops); -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE; +EXPORT_SYMBOL(paravirt_spinlocks_enabled); --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jum if (!xen_domain()) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); return 0; } early_initcall(xen_init_spinlocks_jump); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 08/11] qspinlock: Revert to test-and-set on hypervisors
On Sun, Jun 15, 2014 at 02:47:05PM +0200, Peter Zijlstra wrote: When we detect a hypervisor (!paravirt, see later patches), revert to Please spell out the name of the patches. a simple test-and-set lock to avoid the horrors of queue preemption. Heheh. Signed-off-by: Peter Zijlstra pet...@infradead.org --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h |7 +++ kernel/locking/qspinlock.c |3 +++ 3 files changed, 24 insertions(+) --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include asm/cpufeature.h #include asm-generic/qspinlock_types.h #if !defined(CONFIG_X86_OOSTORE) !defined(CONFIG_X86_PPRO_FENCE) @@ -20,6 +21,19 @@ static inline void queue_spin_unlock(str #endif /* !CONFIG_X86_OOSTORE !CONFIG_X86_PPRO_FENCE */ +#define virt_queue_spin_lock virt_queue_spin_lock + +static inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include asm-generic/qspinlock.h #endif /* _ASM_X86_QSPINLOCK_H */ --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -98,6 +98,13 @@ static __always_inline void queue_spin_u } #endif +#ifndef virt_queue_spin_lock +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -247,6 +247,9 @@ void queue_spin_lock_slowpath(struct qsp BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (virt_queue_spin_lock(lock)) + return; + /* * wait for in-progress pending-locked hand-overs * ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 07/11] qspinlock: Use a simple write to grab the lock, if applicable
On Sun, Jun 15, 2014 at 02:47:04PM +0200, Peter Zijlstra wrote: From: Waiman Long waiman.l...@hp.com Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the queue code word. For that case, s/queue code word/tail {number,value}/ ? a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the queue code word (tail) in the lock is set. Just use the same word as above. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasks Before patchAfter patch %Change -- --- -- --- 3 2324/2321 2248/2265-3%/-2% 4 2890/2896 2819/2831-2%/-2% 5 3611/3595 3522/3512-2%/-2% 6 4281/4276 4173/4160-3%/-3% 7 5018/5001 4875/4861-3%/-3% 8 5759/5750 5563/5568-3%/-3% [Standalone/Embedded - different nodes] # of tasks Before patchAfter patch %Change -- --- -- --- 3 12242/12237 12087/12093 -1%/-1% 4 10688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The ebizzy -m test was also run with the following results: kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- kernel/locking/qspinlock.c | 59 - 1 file changed, 43 insertions(+), 16 deletions(-) --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decod * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + u8 locked; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[3]; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -197,6 +206,19 @@ static __always_inline u32 xchg_tail(str #endif /* _Q_PENDING_BITS == 8 */ /** + * set_locked - Set the lock bit and own the lock Full stop missing. + * @lock: Pointer to queue spinlock structure Ditto. + * + * *,*,0 - *,0,1 + */ +static __always_inline void set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL; +} + +/** *
Re: [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS
On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote: From: Peter Zijlstra pet...@infradead.org When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. I have to ask - how much more performance do you get from this? Is this extra atomic operation hurting that much? All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock_types.h | 13 kernel/locking/qspinlock.c| 103 ++ 2 files changed, 106 insertions(+), 10 deletions(-) --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -38,6 +38,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -50,7 +58,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS 8 +#else #define _Q_PENDING_BITS 1 +#endif #define _Q_PENDING_MASK _Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -61,6 +73,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL(1U _Q_LOCKED_OFFSET) --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -22,6 +22,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -48,6 +49,9 @@ * We can further change the first spinner to spin on a bit in the lock word * instead of its node; whereby avoiding the need to carry a node from lock to * unlock, and preserving API. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. */ #include mcs_spinlock.h @@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decod #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL; +} + +/* + * xchg_tail - Put in the new queue tail code word retrieve previous one Missing full stop. + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + struct __qspinlock *l = (void *)lock; + + return (u32)xchg(l-tail, tail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; +} + +#else /* _Q_PENDING_BITS == 8 */ + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct
Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock
+ new = tail | (val _Q_LOCKED_MASK); + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + + /* +* we won the trylock; forget about queueing. +*/ + if (new == _Q_LOCKED_VAL) + goto release; + + /* +* if there was a previous node; link it and wait. +*/ + if (old ~_Q_LOCKED_MASK) { + prev = decode_tail(old); + ACCESS_ONCE(prev-next) = node; + + arch_mcs_spin_lock_contended(node-locked); Could you add a comment here: /* We are spinning forever until the previous node updates locked - which it does once the it has updated lock-val with our tail number. */ + } + + /* +* we're at the head of the waitqueue, wait for the owner to go away. +* +* *,x - *,0 +*/ + while ((val = atomic_read(lock-val)) _Q_LOCKED_MASK) + cpu_relax(); + + /* +* claim the lock: +* +* n,0 - 0,1 : lock, uncontended +* *,0 - *,1 : lock, contended +*/ + for (;;) { + new = _Q_LOCKED_VAL; + if (val != tail) + new |= val; ..snip.. Could you help a bit in explaining it in English please? After looking at the assembler code I finally figured out how we can get here. And the 'contended' part threw me off. Somehow I imagined there are two more more CPUs stampeding here and trying to update the lock-val. But in reality the other CPUs are stuck in the arch_mcs_spin_lock_contended spinning on their local value. Perhaps you could add this comment. /* Once queue_spin_unlock is called (which _subtracts_ _Q_LOCKED_VAL from the lock-val and still preserving the tail data), the winner gets to claim the ticket. Since we still need the other CPUs to continue and preserve the strict ordering in which they setup node-next, we: 1) update lock-val to the tail value (so tail CPU and its index) with _Q_LOCKED_VAL. 2). Once we are done, we poke the other CPU (the one that linked to us) by writting to node-locked (below) so they can make progress and loop on lock-val changing from _Q_LOCKED_MASK to zero). */ ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock
+ * The basic principle of a queue-based spinlock can best be understood + * by studying a classic queue-based spinlock implementation called the + * MCS lock. The paper below provides a good description for this kind + * of lock. + * + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf + * + * This queue spinlock implementation is based on the MCS lock, however to make + * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing + * API, we must modify it some. + * + * In particular; where the traditional MCS lock consists of a tail pointer + * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to + * unlock the next pending (next-locked), we compress both these: {tail, + * next-locked} into a single u32 value. + * + * Since a spinlock disables recursion of its own context and there is a limit + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can + * encode the tail as and index indicating this context and a cpu number. + * + * We can further change the first spinner to spin on a bit in the lock word + * instead of its node; whereby avoiding the need to carry a node from lock to + * unlock, and preserving API. You also made changes (compared to the MCS) in that the unlock path is not spinning waiting for the successor and that the job of passing the lock is not done in the unlock path either. Instead all of that is now done in the path of the lock acquirer logic. Could you update the comment to say that please? Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 03/11] qspinlock: Add pending bit
On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote: Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. Could you add this in the description please: And by second cacheline we mean the local 'node'. That is the: mcs_nodes[0] and mcs_nodes[idx] Perhaps it might be better then to split this in the header file as this is trying to not be a slowpath code - but rather - a pre-slow-path-lets-try-if-we can do another cmpxchg in case the unlocker has just unlocked itself. So something like: diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index e8a7ae8..29cc9c7 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); */ static __always_inline void queue_spin_lock(struct qspinlock *lock) { - u32 val; + u32 val, new; val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL); if (likely(val == 0)) return; + + /* One more attempt - but if we fail mark it as pending. */ + if (val == _Q_LOCKED_VAL) { + new = Q_LOCKED_VAL |_Q_PENDING_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == _Q_LOCKED_VAL) /* YEEY! */ + return; + val = old; + } queue_spin_lock_slowpath(lock, val); } and then the slowpath preserves most of the old logic path (with the pending bit stuff)? Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock_types.h | 12 ++- kernel/locking/qspinlock.c| 109 +++--- 2 files changed, 97 insertions(+), 24 deletions(-) --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -39,8 +39,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define _Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -48,7 +49,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS 1 +#define _Q_PENDING_MASK _Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -57,5 +62,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL(1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -83,24 +83,28 @@ static inline struct mcs_spinlock *decod return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word * - * (queue tail, lock bit) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:-- (0,1) :-- (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+-- (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+-- (*,0) - (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock bit) * + * fast :slow : unlock + * : : + * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +-- (0,1,0) \ | : + * : | ^--' | | : + * : v | | : + *
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: From: Waiman Long waiman.l...@hp.com This patch extracts the logic for the exchange of new and previous tail code words into a new xchg_tail() function which can be optimized in a later patch. And also adds a third try on acquiring the lock. That I think should be a seperate patch. And instead of saying 'later patch' you should spell out the name of the patch. Especially as this might not be obvious from somebody doing git bisection. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock_types.h |2 + kernel/locking/qspinlock.c| 58 +- 2 files changed, 38 insertions(+), 22 deletions(-) --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -61,6 +61,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL(1U _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decod #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(lock-val); + + for (;;) { + new = (val _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -182,36 +207,25 @@ void queue_spin_lock_slowpath(struct qsp node-next = NULL; /* - * we already touched the queueing cacheline; don't bother with pending - * stuff. - * - * trylock || xchg(lock, node) - * - * 0,0,0 - 0,0,1 ; trylock - * p,y,x - n,y,x ; prev = xchg(lock, node) + * We touched a (possibly) cold cacheline in the per-cpu queue node; + * attempt the trylock once more in the hope someone let go while we + * weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; So now are three of them? One in queue_spin_lock, then at the start of this function when checking for the pending bit, and the once more here. And that is because the local cache line might be cold for the 'mcs_index' struct? That all seems to be a bit of experimental. But then we are already in the slowpath so we could as well do: for (i = 0; i 10; i++) if (queue_spin_trylock(lock)) goto release; And would have the same effect. /* - * we won the trylock; forget about queueing. + * we already touched the queueing cacheline; don't bother with pending + * stuff. I guess we could also just erase the pending bit if we wanted too. The optimistic spinning will still hit go to the queue label as lock-val will have the tail value. + * + * p,*,* - n,*,* */ - if (new == _Q_LOCKED_VAL) - goto release; + old = xchg_tail(lock, tail); /* * if there was a previous node; link it and wait. */ - if (old ~_Q_LOCKED_PENDING_MASK) { + if (old _Q_TAIL_MASK) { prev = decode_tail(old); ACCESS_ONCE(prev-next) = node; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 14/16] pvqspinlock: Add qspinlock para-virtualization support
On Sun, Jun 15, 2014 at 03:16:54PM +0200, Peter Zijlstra wrote: On Thu, Jun 12, 2014 at 04:48:41PM -0400, Waiman Long wrote: I don't have a good understanding of the kernel alternatives mechanism. I didn't either; I do now, cost me a whole day reading up on alternative/paravirt code patching. See the patches I just send out; I got the 'native' case with paravirt enabled to be one NOP worse than the native case without paravirt -- for queue_spin_unlock. The lock slowpath is several nops and some pointless movs more expensive. You could use the asm goto which would optimize the fast path to be the 'native' case. That way you wouldn't have the the nops and movs in the path. (And asm goto also uses the alternative_asm macros). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 03/11] qspinlock: Add pending bit
On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote: On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote: On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote: Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. Could you add this in the description please: And by second cacheline we mean the local 'node'. That is the: mcs_nodes[0] and mcs_nodes[idx] Perhaps it might be better then to split this in the header file as this is trying to not be a slowpath code - but rather - a pre-slow-path-lets-try-if-we can do another cmpxchg in case the unlocker has just unlocked itself. So something like: diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index e8a7ae8..29cc9c7 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); */ static __always_inline void queue_spin_lock(struct qspinlock *lock) { -u32 val; +u32 val, new; val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL); if (likely(val == 0)) return; + +/* One more attempt - but if we fail mark it as pending. */ +if (val == _Q_LOCKED_VAL) { +new = Q_LOCKED_VAL |_Q_PENDING_VAL; + +old = atomic_cmpxchg(lock-val, val, new); +if (old == _Q_LOCKED_VAL) /* YEEY! */ +return; No, it can leave like that. The unlock path will not clear the pending bit. Err, you are right. It needs to go back in the slowpath. We are trying to make the fastpath as simple as possible as it may be inlined. The complexity of the queue spinlock is in the slowpath. Sure, but then it shouldn't be called slowpath anymore as it is not slow. It is a combination of fast path (the potential chance of grabbing the lock and setting the pending lock) and the real slow path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ? Moreover, an cmpxchg followed immediately followed by another cmpxchg will just increase the level of memory contention when a lock is fairly contended. The chance of second cmpxchg() succeeding will be pretty low. Then why even do the pending bit - which is what the slowpath does for the first time. And if it grabs it (And sets the pending bit) it immediately exits. Why not perculate that piece of code in-to this header. And the leave all that slow code (queing, mcs_lock access, etc) in the slowpath. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 03/11] qspinlock: Add pending bit
On Tue, Jun 17, 2014 at 05:07:29PM -0400, Konrad Rzeszutek Wilk wrote: On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote: On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote: On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote: Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. Could you add this in the description please: And by second cacheline we mean the local 'node'. That is the: mcs_nodes[0] and mcs_nodes[idx] Perhaps it might be better then to split this in the header file as this is trying to not be a slowpath code - but rather - a pre-slow-path-lets-try-if-we can do another cmpxchg in case the unlocker has just unlocked itself. So something like: diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index e8a7ae8..29cc9c7 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); */ static __always_inline void queue_spin_lock(struct qspinlock *lock) { - u32 val; + u32 val, new; val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL); if (likely(val == 0)) return; + + /* One more attempt - but if we fail mark it as pending. */ + if (val == _Q_LOCKED_VAL) { + new = Q_LOCKED_VAL |_Q_PENDING_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == _Q_LOCKED_VAL) /* YEEY! */ + return; No, it can leave like that. The unlock path will not clear the pending bit. Err, you are right. It needs to go back in the slowpath. What I should have wrote is: if (old == 0) /* YEEY */ return; As that would the same thing as this patch does on the pending bit - that is if we can on the second compare and exchange set the pending bit (and the lock) and the lock has been released - we are good. And it is a quick path. We are trying to make the fastpath as simple as possible as it may be inlined. The complexity of the queue spinlock is in the slowpath. Sure, but then it shouldn't be called slowpath anymore as it is not slow. It is a combination of fast path (the potential chance of grabbing the lock and setting the pending lock) and the real slow path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ? I forgot to mention - that was the crux of my comments - just change the slowpath to complex name at that point to better reflect what it does. Moreover, an cmpxchg followed immediately followed by another cmpxchg will just increase the level of memory contention when a lock is fairly contended. The chance of second cmpxchg() succeeding will be pretty low. Then why even do the pending bit - which is what the slowpath does for the first time. And if it grabs it (And sets the pending bit) it immediately exits. Why not perculate that piece of code in-to this header. And the leave all that slow code (queing, mcs_lock access, etc) in the slowpath. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 03/11] qspinlock: Add pending bit
On Jun 17, 2014 6:25 PM, Waiman Long waiman.l...@hp.com wrote: On 06/17/2014 05:10 PM, Konrad Rzeszutek Wilk wrote: On Tue, Jun 17, 2014 at 05:07:29PM -0400, Konrad Rzeszutek Wilk wrote: On Tue, Jun 17, 2014 at 04:51:57PM -0400, Waiman Long wrote: On 06/17/2014 04:36 PM, Konrad Rzeszutek Wilk wrote: On Sun, Jun 15, 2014 at 02:47:00PM +0200, Peter Zijlstra wrote: Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. Could you add this in the description please: And by second cacheline we mean the local 'node'. That is the: mcs_nodes[0] and mcs_nodes[idx] Perhaps it might be better then to split this in the header file as this is trying to not be a slowpath code - but rather - a pre-slow-path-lets-try-if-we can do another cmpxchg in case the unlocker has just unlocked itself. So something like: diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index e8a7ae8..29cc9c7 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -75,11 +75,21 @@ extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); */ static __always_inline void queue_spin_lock(struct qspinlock *lock) { - u32 val; + u32 val, new; val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL); if (likely(val == 0)) return; + + /* One more attempt - but if we fail mark it as pending. */ + if (val == _Q_LOCKED_VAL) { + new = Q_LOCKED_VAL |_Q_PENDING_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == _Q_LOCKED_VAL) /* YEEY! */ + return; No, it can leave like that. The unlock path will not clear the pending bit. Err, you are right. It needs to go back in the slowpath. What I should have wrote is: if (old == 0) /* YEEY */ return; Unfortunately, that still doesn't work. If old is 0, it just meant the cmpxchg failed. It still haven't got the lock. As that would the same thing as this patch does on the pending bit - that is if we can on the second compare and exchange set the pending bit (and the lock) and the lock has been released - we are good. That is not true. When the lock is freed, the pending bit holder will still have to clear the pending bit and set the lock bit as is done in the slowpath. We cannot skip the step here. The problem of moving the pending code here is that it includes a wait loop which we don't want to put in the fastpath. And it is a quick path. We are trying to make the fastpath as simple as possible as it may be inlined. The complexity of the queue spinlock is in the slowpath. Sure, but then it shouldn't be called slowpath anymore as it is not slow. It is a combination of fast path (the potential chance of grabbing the lock and setting the pending lock) and the real slow path (the queuing). Perhaps it should be called 'queue_spinlock_complex' ? I forgot to mention - that was the crux of my comments - just change the slowpath to complex name at that point to better reflect what it does. Actually in my v11 patch, I subdivided the slowpath into a slowpath for the pending code and slowerpath for actual queuing. Perhaps, we could use quickpath and slowpath instead. Anyway, it is a minor detail that we can discuss after the core code get merged. -Longman Why not do it the right way the first time around? That aside - these optimization - seem to make the code harder to read. And they do remind me of the scheduler code in 2.6.x which was based on heuristics - and eventually ripped out. So are these optimizations based on turning off certain hardware features? Say hardware prefetching? What I am getting at - can the hardware do this at some point (or perhaps already does on IvyBridge-EX?) - that is prefetch the per-cpu areas so they are always hot? And rendering this optimization not needed? Thanks! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock
On Sun, Jun 15, 2014 at 02:46:58PM +0200, Peter Zijlstra wrote: From: Waiman Long waiman.l...@hp.com This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org Thank you for the repost. I have some questions about the implementation that hopefully will be easy to answer and said answers I hope can be added in the code to enlighten other folks. See below. .. snip.. Index: linux-2.6/kernel/locking/mcs_spinlock.h === --- linux-2.6.orig/kernel/locking/mcs_spinlock.h +++ linux-2.6/kernel/locking/mcs_spinlock.h @@ -17,6 +17,7 @@ struct mcs_spinlock { struct mcs_spinlock *next; int locked; /* 1 if lock acquired */ + int count; This could use a comment. }; #ifndef arch_mcs_spin_lock_contended Index: linux-2.6/kernel/locking/qspinlock.c === --- /dev/null +++ linux-2.6/kernel/locking/qspinlock.c @@ -0,0 +1,197 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + * Peter Zijlstra pzijl...@redhat.com + */ +#include linux/smp.h +#include linux/bug.h +#include linux/cpumask.h +#include linux/percpu.h +#include linux/hardirq.h +#include linux/mutex.h +#include asm/qspinlock.h + +/* + * The basic principle of a queue-based spinlock can best be understood + * by studying a classic queue-based spinlock implementation called the + * MCS lock. The paper below provides a good description for this kind + * of lock. + * + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf + * + * This queue spinlock implementation is based on the MCS lock, however to make + * it fit the 4 bytes we assume spinlock_t to be, and preserve its existing + * API, we must modify it some. + * + * In particular; where the traditional MCS lock consists of a tail pointer + * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to + * unlock the next pending (next-locked), we compress both these: {tail, + * next-locked} into a single u32 value. + * + * Since a spinlock disables recursion of its own context and there is a limit + * to the contexts that can nest; namely: task, softirq, hardirq, nmi, we can + * encode the tail as and index indicating this context and a cpu number. + * + * We can further change the first spinner to spin on a bit in the lock word + * instead of its node; whereby avoiding the need to carry a node from lock to + * unlock, and preserving API. + */ + +#include mcs_spinlock.h + +/* + * Per-CPU queue node structures; we can never have more than 4 nested + * contexts: task, softirq, hardirq, nmi. + * + * Exactly fits one cacheline. + */ +static
Re: [PATCH 00/11] qspinlock with paravirt support
On Sun, Jun 15, 2014 at 02:46:57PM +0200, Peter Zijlstra wrote: Since Waiman seems incapable of doing simple things; here's my take on the paravirt crap. The first few patches are taken from Waiman's latest series, but the virt support is completely new. Its primary aim is to not mess up the native code. OK. I finally cleared some time to look over this and are reading the code in details to make sure I have it clear in mind. I will most likely ask some questions that are naive - hopefully they will lead to the code being self-explanatory for anybody else taking a stab at understanding them when bugs appear. I've not stress tested it, but the virt and paravirt (kvm) cases boot on simple smp guests. I've not done Xen, but the patch should be simple and similar. Looking forward to seeing it. Glancing over the KVM one and comparing it to the original version that Waiman posted it should be fairly simple. Perhaps even some of the code could be shared? I ripped out all the unfair nonsense as its not at all required for paravirt and optimizations that make paravirt better at the cost of code clarity and/or native performance are just not worth it. Also; if we were to ever add some of that unfair nonsense you do so _after_ you got the simple things working. The thing I'm least sure about is the head tracking, I chose to do something different from what Waiman did, because his is O(nr_cpus) and had the assumption that guests have small nr_cpus. AFAIK this is not at all true. The biggest problem I have with what I did is that it contains wait loops itself. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 00/19] qspinlock: a 4-byte queue spinlock with PV support
On Wed, May 07, 2014 at 11:01:28AM -0400, Waiman Long wrote: v9-v10: - Make some minor changes to qspinlock.c to accommodate review feedback. - Change author to PeterZ for 2 of the patches. - Include Raghavendra KT's test results in patch 18. Any chance you can post these on a git tree? Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 18/19] pvqspinlock, x86: Enable PV qspinlock PV for KVM
Raghavendra KT had done some performance testing on this patch with the following results: Overall we are seeing good improvement for pv-unfair version. System: 32 cpu sandybridge with HT on (4 node with 32 GB each) Guest : 8GB with 16 vcpu/VM. Average was taken over 8-10 data points. Base = 3.15-rc2 with PRAVIRT_SPINLOCK = y A = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = y (unfair lock) B = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = n PARAVIRT_UNFAIR_LOCKS = n (queue spinlock without paravirt) C = 3.15-rc2 + qspinlock v9 patch with QUEUE_SPINLOCK = y PRAVIRT_SPINLOCK = y PARAVIRT_UNFAIR_LOCKS = n (queue spinlock with paravirt) Could you do s/PRAVIRT/PARAVIRT/ please? Ebizzy %improvements overcommit ABC 0.5x 4.42652.0611 1.5824 1.0x 0.9015 -7.7828 4.5443 1.5x46.1162 -2.9845 -3.5046 2.0x99.8150 -2.7116 4.7461 Considering B sucks Dbench %improvements overcommit ABC 0.5x 3.26173.54362.5676 1.0x 0.63022.23425.2201 1.5x 5.00274.82753.8375 2.0x23.82424.578212.6067 Absolute values of base results: (overcommit, value, stdev) Ebizzy ( records / sec with 120 sec run) 0.5x 20941.8750 (2%) 1.0x 17623.8750 (5%) 1.5x 5874.7778 (15%) 2.0x 3581.8750 (7%) Dbench (throughput in MB/sec) 0.5x 10009.6610 (5%) 1.0x 6583.0538 (1%) 1.5x 3991.9622 (4%) 2.0x 2527.0613 (2.5%) Signed-off-by: Waiman Long waiman.l...@hp.com Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- arch/x86/kernel/kvm.c | 135 + kernel/Kconfig.locks |2 +- 2 files changed, 136 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 7ab8ab3..eef427b 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -567,6 +567,7 @@ static void kvm_kick_cpu(int cpu) kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); } +#ifndef CONFIG_QUEUE_SPINLOCK enum kvm_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -794,6 +795,134 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) } } } +#else /* !CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_KVM_DEBUG_FS +static struct dentry *d_spin_debug; +static struct dentry *d_kvm_debug; +static u32 kick_nohlt_stats; /* Kick but not halt count */ +static u32 halt_qhead_stats; /* Queue head halting count */ +static u32 halt_qnode_stats; /* Queue node halting count */ +static u32 halt_abort_stats; /* Halting abort count */ +static u32 wake_kick_stats; /* Wakeup by kicking count */ +static u32 wake_spur_stats; /* Spurious wakeup count*/ +static u64 time_blocked; /* Total blocking time */ + +static int __init kvm_spinlock_debugfs(void) +{ + d_kvm_debug = debugfs_create_dir(kvm-guest, NULL); + if (!d_kvm_debug) { + printk(KERN_WARNING +Could not create 'kvm' debugfs directory\n); + return -ENOMEM; + } + d_spin_debug = debugfs_create_dir(spinlocks, d_kvm_debug); + + debugfs_create_u32(kick_nohlt_stats, +0644, d_spin_debug, kick_nohlt_stats); + debugfs_create_u32(halt_qhead_stats, +0644, d_spin_debug, halt_qhead_stats); + debugfs_create_u32(halt_qnode_stats, +0644, d_spin_debug, halt_qnode_stats); + debugfs_create_u32(halt_abort_stats, +0644, d_spin_debug, halt_abort_stats); + debugfs_create_u32(wake_kick_stats, +0644, d_spin_debug, wake_kick_stats); + debugfs_create_u32(wake_spur_stats, +0644, d_spin_debug, wake_spur_stats); + debugfs_create_u64(time_blocked, +0644, d_spin_debug, time_blocked); + return 0; +} + +static inline void kvm_halt_stats(enum pv_lock_stats type) +{ + if (type == PV_HALT_QHEAD) + add_smp(halt_qhead_stats, 1); + else if (type == PV_HALT_QNODE) + add_smp(halt_qnode_stats, 1); + else /* type == PV_HALT_ABORT */ + add_smp(halt_abort_stats, 1); +} + +static inline void kvm_lock_stats(enum pv_lock_stats type) +{ + if (type == PV_WAKE_KICKED) + add_smp(wake_kick_stats, 1); + else if (type == PV_WAKE_SPURIOUS) + add_smp(wake_spur_stats, 1); + else /* type == PV_KICK_NOHALT */ + add_smp(kick_nohlt_stats, 1); +} + +static inline u64 spin_time_start(void) +{ + return sched_clock(); +} + +static inline void spin_time_accum_blocked(u64 start) +{ + u64 delta; + +
Re: [PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS
On Wed, Apr 23, 2014 at 10:23:43AM -0400, Waiman Long wrote: On 04/18/2014 05:40 PM, Waiman Long wrote: On 04/18/2014 03:05 PM, Peter Zijlstra wrote: On Fri, Apr 18, 2014 at 01:52:50PM -0400, Waiman Long wrote: I am confused by your notation. Nah, I think I was confused :-) Make the 1 _Q_LOCKED_VAL though, as that's the proper constant to use. Everyone gets confused once in a while:-) I have plenty of that myself. I will change 1 to _Q_LOCKED_VAL as suggested. -Longman The attached patch file contains the additional changes that I had made to qspinlock.c file so far. Please let me know if you or others have any additional feedbacks or changes that will need to go to the next version of the patch series. I am going to take vacation starting from tomorrow and will be back on 5/5 (Mon). So I will not be able to respond to emails within this period. BTW, is there any chance that this patch can be merged to 3.16? Um, it needs to have Acks from KVM and Xen maintainers who have not done so. Also Peter needs to chime in. (BTW, please CC xen-de...@lists.xenproject.org next time you post so that David and Boris can take a peek at it). I would strongly recommend you put all your patches on github (free git service) so that we can test it and poke it at during your vacation (and even after). -Longman diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index be2adca..2e184b8 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -25,10 +25,6 @@ #include asm/byteorder.h #include asm/qspinlock.h -#if !defined(__LITTLE_ENDIAN) !defined(__BIG_ENDIAN) -#error Missing either LITTLE_ENDIAN or BIG_ENDIAN definition. -#endif - /* * The basic principle of a queue-based spinlock can best be understood * by studying a classic queue-based spinlock implementation called the @@ -200,7 +196,7 @@ clear_pending_set_locked(struct qspinlock *lock, u32 val) { struct __qspinlock *l = (void *)lock; - ACCESS_ONCE(l-locked_pending) = 1; + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL; } /* @@ -567,16 +563,16 @@ static __always_inline int get_qlock(struct qspinlock *lock) /** * trylock_pending - try to acquire queue spinlock using the pending bit * @lock : Pointer to queue spinlock structure - * @pval : Pointer to value of the queue spinlock 32-bit word + * @val : Current value of the queue spinlock 32-bit word * Return: 1 if lock acquired, 0 otherwise * * The pending bit won't be set as soon as one or more tasks queue up. * This function should only be called when lock stealing will not happen. * Otherwise, it has to be disabled. */ -static inline int trylock_pending(struct qspinlock *lock, u32 *pval) +static inline int trylock_pending(struct qspinlock *lock, u32 val) { - u32 old, new, val = *pval; + u32 old, new; int retry = 1; /* @@ -593,8 +589,7 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) if (val _Q_TAIL_MASK) return 0; - if ((val _Q_LOCKED_PENDING_MASK) == - (_Q_LOCKED_VAL|_Q_PENDING_VAL)) { + if (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)) { /* * If both the lock and pending bits are set, we wait * a while to see if that either bit will be cleared. @@ -605,9 +600,9 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) retry--; cpu_relax(); cpu_relax(); - *pval = val = atomic_read(lock-val); + val = atomic_read(lock-val); continue; - } else if ((val _Q_LOCKED_PENDING_MASK) == _Q_PENDING_VAL) { + } else if (val == _Q_PENDING_VAL) { /* * Pending bit is set, but not the lock bit. * Assuming that the pending bit holder is going to @@ -615,7 +610,7 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * it is better to wait than to exit at this point. */ cpu_relax(); - *pval = val = atomic_read(lock-val); + val = atomic_read(lock-val); continue; } @@ -627,7 +622,7 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) if (old == val) break; - *pval = val = old; + val = old; } /* @@ -643,7 +638,7 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * * this wait loop must be a load-acquire such that we match the * store-release that clears the locked bit and create lock - * sequentiality; this because
Re: [PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS
On Wed, Apr 23, 2014 at 01:43:58PM -0400, Waiman Long wrote: On 04/23/2014 10:56 AM, Konrad Rzeszutek Wilk wrote: On Wed, Apr 23, 2014 at 10:23:43AM -0400, Waiman Long wrote: On 04/18/2014 05:40 PM, Waiman Long wrote: On 04/18/2014 03:05 PM, Peter Zijlstra wrote: On Fri, Apr 18, 2014 at 01:52:50PM -0400, Waiman Long wrote: I am confused by your notation. Nah, I think I was confused :-) Make the 1 _Q_LOCKED_VAL though, as that's the proper constant to use. Everyone gets confused once in a while:-) I have plenty of that myself. I will change 1 to _Q_LOCKED_VAL as suggested. -Longman The attached patch file contains the additional changes that I had made to qspinlock.c file so far. Please let me know if you or others have any additional feedbacks or changes that will need to go to the next version of the patch series. I am going to take vacation starting from tomorrow and will be back on 5/5 (Mon). So I will not be able to respond to emails within this period. BTW, is there any chance that this patch can be merged to 3.16? Um, it needs to have Acks from KVM and Xen maintainers who have not done so. Also Peter needs to chime in. (BTW, please CC xen-de...@lists.xenproject.org next time you post so that David and Boris can take a peek at it). I will cc xen-de...@lists.xenproject.org when I sent out the next patch. I would strongly recommend you put all your patches on github (free git service) so that we can test it and poke it at during your vacation (and even after). I am not used to setting up a public repo in github. If I create a repo there, should I put a snapshot of the whole kernel source tree or just a portion of the relevant files as the base? With the later, it won't be buildable. You just push your local branch. It should look like a normal Linux tree with your commits on top. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v9 00/19] qspinlock: a 4-byte queue spinlock with PV support
On Thu, Apr 17, 2014 at 09:48:36PM -0400, Waiman Long wrote: On 04/17/2014 01:23 PM, Konrad Rzeszutek Wilk wrote: On Thu, Apr 17, 2014 at 11:03:52AM -0400, Waiman Long wrote: v8-v9: - Integrate PeterZ's version of the queue spinlock patch with some modification: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org - Break the more complex patches into smaller ones to ease review effort. - Fix a racing condition in the PV qspinlock code. I am not seeing anything mentioning that the overcommit scenario for KVM and Xen had been fixed. Or was the 'racing condition' said issue? Thanks. The hanging is caused by a racing condition which should be fixed in the v9 patch. Please let me know if you are still seeing it. OK, is there a git tree with these patches to easily slurp them up? Thanks! -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v9 03/19] qspinlock: Add pending bit
On Fri, Apr 18, 2014 at 12:23:29PM -0400, Waiman Long wrote: On 04/18/2014 03:42 AM, Ingo Molnar wrote: * Waiman Longwaiman.l...@hp.com wrote: Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. Signed-off-by: Peter Zijlstrapet...@infradead.org Signed-off-by: Waiman Longwaiman.l...@hp.com This patch should have a From: Peter in it as well, right? Thanks, Ingo Do you mean a From: line in the mail header? It will be a bit hard to have different From: header in the same patch series. I can certainly do that if there is an easy way to do it. It is pretty easy. Just do 'git commit --amend --author The Right Author' and when you send the patches (git send-email) it will include that. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v9 00/19] qspinlock: a 4-byte queue spinlock with PV support
On Thu, Apr 17, 2014 at 11:03:52AM -0400, Waiman Long wrote: v8-v9: - Integrate PeterZ's version of the queue spinlock patch with some modification: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org - Break the more complex patches into smaller ones to ease review effort. - Fix a racing condition in the PV qspinlock code. I am not seeing anything mentioning that the overcommit scenario for KVM and Xen had been fixed. Or was the 'racing condition' said issue? Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v8 01/10] qspinlock: A generic 4-byte queue spinlock implementation
On Mon, Apr 07, 2014 at 04:12:58PM +0200, Peter Zijlstra wrote: On Fri, Apr 04, 2014 at 12:57:27PM -0400, Konrad Rzeszutek Wilk wrote: On Fri, Apr 04, 2014 at 03:00:12PM +0200, Peter Zijlstra wrote: So I'm just not ever going to pick up this patch; I spend a week trying to reverse engineer this; I posted a 7 patch series creating the equivalent, but in a gradual and readable fashion: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org You keep on ignoring that; I'll keep on ignoring your patches. I might at some point rewrite some of your pv stuff on top to get this moving again, but I'm not really motivated to work with you atm. Uh? Did you CC also xen-de...@lists.xenproject.org on your patches Peter? I hadn't had a chance to see or comment on them :-( No of course not :-) Also as noted elsewhere, I didn't actually do any PV muck yet. I spend the time trying to get my head around patch 1; all the while Waiman kept piling more and more on top. Ah, I see. Looking forward to seeing your 'muck' code then :-) Thanks! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support
On Wed, Apr 02, 2014 at 10:32:01AM -0400, Konrad Rzeszutek Wilk wrote: On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote: N.B. Sorry for the duplicate. This patch series were resent as the original one was rejected by the vger.kernel.org list server due to long header. There is no change in content. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an HVM guest under Xen after a while stops working. The workload is doing 'make -j32' on the Linux kernel. Completely unresponsive. Thoughts? Each VCPU seems to be stuck with this stack trace: rip: 810013a8 xen_hypercall_sched_op+0x8 flags: 0002 nz rsp: 88029f13fb98 rax: rcx: fffa rdx: rbx: rsi: 88029f13fba8 rdi: 0003 rbp: 88029f13fbd0r8: 8807ee65a1c0r9: 88080d800b10 r10: 48cb r11: r12: 0013 r13: 0004 r14: 0001 r15: ea00076a8cd0 cs: 0010ss: ds: es: fs: @ 2b24c3e7e380 gs: @ 88080e20/ Code (instr addr 810013a8) cc cc cc cc cc cc cc cc cc cc cc cc cc b8 1d 00 00 00 0f 01 c1 c3 cc cc cc cc cc cc cc cc cc cc Stack: 81352d9e 00299f13fbb0 88029f13fba4 88020001 88029f13fbd0 0045 88029f13fbe0 81354240 88029f13fc00 81012cb6 88080f4da200 88080e214b00 88029f13fc48 815e4631 Call Trace: [810013a8] xen_hypercall_sched_op+0x8 -- [81352d9e] xen_poll_irq_timeout+0x3e [81354240] xen_poll_irq+0x10 [81012cb6] xen_hibernate+0x46 [815e4631] queue_spin_lock_slowerpath+0x84 [810ab96e] queue_spin_lock_slowpath+0xee [815eff8f] _raw_spin_lock_irqsave+0x3f [81144e4d] pagevec_lru_move_fn+0x8d [81144780] __pagevec_lru_add_fn [81144ed7] __pagevec_lru_add+0x17 [81145540] __lru_cache_add+0x60 [8114590e] lru_cache_add+0xe [8116d4ba] page_add_new_anon_rmap+0xda [81162ab1] handle_mm_fault+0xaa1 [81169d42] mmap_region+0x2c2 [815f3c4d] __do_page_fault+0x18d [811544e1] vm_mmap_pgoff+0xb1 [815f3fdb] do_page_fault+0x2b [815f06c8] page_fault+0x28 rip: 810013a8 xen_hypercall_sched_op+0x8 (CC ing Marcos who had run the test) v6-v7: - Remove an atomic operation from the 2-task contending code - Shorten the names of some macros - Make the queue waiter to attempt to steal lock when unfair lock is enabled. - Remove lock holder kick from the PV code and fix a race condition - Run the unfair lock PV code on overcommitted KVM guests to collect performance data. v5-v6: - Change the optimized 2-task contending code to make it fairer at the expense of a bit of performance. - Add a patch to support unfair queue spinlock for Xen. - Modify the PV qspinlock code to follow what was done in the PV ticketlock. - Add performance data for the unfair lock as well as the PV support code. v4-v5: - Move the optimized 2-task contending code to the generic file to enable more architectures to use it without code duplication. - Address some of the style-related comments by PeterZ. - Allow the use of unfair queue spinlock in a real para-virtualized execution environment. - Add para-virtualization support to the qspinlock code by ensuring that the lock holder and queue head stay alive as much as possible. v3-v4: - Remove debugging code and fix a configuration error - Simplify the qspinlock structure and streamline the code to make it perform a bit better - Add an x86 version of asm/qspinlock.h for holding x86 specific optimization. - Add an optimized x86 code path for 2 contending tasks to improve low contention performance. v2-v3: - Simplify the code by using numerous mode only without an unfair option. - Use the latest smp_load_acquire()/smp_store_release() barriers. - Move the queue spinlock code to kernel/locking. - Make the use of queue spinlock the default for x86-64 without user configuration. - Additional performance tuning. v1-v2: - Add some more comments to document what the code does. - Add a numerous CPU mode to support = 16K CPUs - Add a configuration option to allow lock stealing which
Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support
On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote: On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote: On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote: On 04/02/2014 04:35 PM, Waiman Long wrote: On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote: On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote: N.B. Sorry for the duplicate. This patch series were resent as the original one was rejected by the vger.kernel.org list server due to long header. There is no change in content. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an HVM guest under Xen after a while stops working. The workload is doing 'make -j32' on the Linux kernel. Completely unresponsive. Thoughts? Thank for reporting that. I haven't done that much testing on Xen. My focus was in KVM. I will perform more test on Xen to see if I can reproduce the problem. BTW, does the halting and sending IPI mechanism work in HVM? I saw Yes. that in RHEL7, PV spinlock was explicitly disabled when in HVM mode. However, this piece of code isn't in upstream code. So I wonder if there is problem with that. The PV ticketlock fixed it for HVM. It was disabled before because the PV guests were using bytelocks while the HVM were using ticketlocks and you couldnt' swap in PV bytelocks for ticketlocks during startup. The RHEL7 code has used PV ticketlock already. RHEL7 uses a single kernel for all configurations. So PV ticketlock as well as Xen and KVM support was compiled in. I think booting the kernel on bare metal will cause the Xen code to work in HVM mode thus activating the PV spinlock code which has a negative impact on performance. Huh? -EPARSE That may be why it was disabled so that the bare metal performance will not be impacted. I am not following you. BTW, could you send me more information about the configuration of the machine, like the .config file that you used? Marcos, could you please send that information to Peter. Thanks! -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support
On Fri, Apr 04, 2014 at 01:13:17PM -0400, Waiman Long wrote: On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote: On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote: On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote: On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote: On 04/02/2014 04:35 PM, Waiman Long wrote: On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote: On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote: N.B. Sorry for the duplicate. This patch series were resent as the original one was rejected by the vger.kernel.org list server due to long header. There is no change in content. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an HVM guest under Xen after a while stops working. The workload is doing 'make -j32' on the Linux kernel. Completely unresponsive. Thoughts? Thank for reporting that. I haven't done that much testing on Xen. My focus was in KVM. I will perform more test on Xen to see if I can reproduce the problem. BTW, does the halting and sending IPI mechanism work in HVM? I saw Yes. that in RHEL7, PV spinlock was explicitly disabled when in HVM mode. However, this piece of code isn't in upstream code. So I wonder if there is problem with that. The PV ticketlock fixed it for HVM. It was disabled before because the PV guests were using bytelocks while the HVM were using ticketlocks and you couldnt' swap in PV bytelocks for ticketlocks during startup. The RHEL7 code has used PV ticketlock already. RHEL7 uses a single kernel for all configurations. So PV ticketlock as well as Xen and KVM support was compiled in. I think booting the kernel on bare metal will cause the Xen code to work in HVM mode thus activating the PV spinlock code which has a negative impact on performance. Huh? -EPARSE That may be why it was disabled so that the bare metal performance will not be impacted. I am not following you. What I am saying is that when XEN and PV spinlock is compiled into the current upstream kernel, the PV spinlock jump label is turned on when booted on bare metal. In other words, the PV spinlock code is How does it turn it on? I see that the jump lables are only turned on when the jump label is enable when it detects that it is running under Xen or KVM. It won't turn it on under baremetal. active even when they are not needed and actually slow thing down in that situation. This is a problem and we need to find way to make sure that the PV spinlock code won't be activated on bare metal. Could you explain to me which piece of code enables the jump labels on baremetal please? -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support
On Fri, Apr 04, 2014 at 01:58:15PM -0400, Konrad Rzeszutek Wilk wrote: On Fri, Apr 04, 2014 at 01:13:17PM -0400, Waiman Long wrote: On 04/04/2014 12:55 PM, Konrad Rzeszutek Wilk wrote: On Thu, Apr 03, 2014 at 10:57:18PM -0400, Waiman Long wrote: On 04/03/2014 01:23 PM, Konrad Rzeszutek Wilk wrote: On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote: On 04/02/2014 04:35 PM, Waiman Long wrote: On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote: On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote: N.B. Sorry for the duplicate. This patch series were resent as the original one was rejected by the vger.kernel.org list server due to long header. There is no change in content. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an HVM guest under Xen after a while stops working. The workload is doing 'make -j32' on the Linux kernel. Completely unresponsive. Thoughts? Thank for reporting that. I haven't done that much testing on Xen. My focus was in KVM. I will perform more test on Xen to see if I can reproduce the problem. BTW, does the halting and sending IPI mechanism work in HVM? I saw Yes. that in RHEL7, PV spinlock was explicitly disabled when in HVM mode. However, this piece of code isn't in upstream code. So I wonder if there is problem with that. The PV ticketlock fixed it for HVM. It was disabled before because the PV guests were using bytelocks while the HVM were using ticketlocks and you couldnt' swap in PV bytelocks for ticketlocks during startup. The RHEL7 code has used PV ticketlock already. RHEL7 uses a single kernel for all configurations. So PV ticketlock as well as Xen and KVM support was compiled in. I think booting the kernel on bare metal will cause the Xen code to work in HVM mode thus activating the PV spinlock code which has a negative impact on performance. Huh? -EPARSE That may be why it was disabled so that the bare metal performance will not be impacted. I am not following you. What I am saying is that when XEN and PV spinlock is compiled into the current upstream kernel, the PV spinlock jump label is turned on when booted on bare metal. In other words, the PV spinlock code is How does it turn it on? I see that the jump lables are only turned on when the jump label is enable when it detects that it is running under Xen or KVM. It won't turn it on under baremetal. Well, it seems that it does turn it on baremetal which is an stupid mistake. Sending a patch shortly. active even when they are not needed and actually slow thing down in that situation. This is a problem and we need to find way to make sure that the PV spinlock code won't be activated on bare metal. Could you explain to me which piece of code enables the jump labels on baremetal please? -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support
On Wed, Apr 02, 2014 at 10:10:17PM -0400, Waiman Long wrote: On 04/02/2014 04:35 PM, Waiman Long wrote: On 04/02/2014 10:32 AM, Konrad Rzeszutek Wilk wrote: On Wed, Apr 02, 2014 at 09:27:29AM -0400, Waiman Long wrote: N.B. Sorry for the duplicate. This patch series were resent as the original one was rejected by the vger.kernel.org list server due to long header. There is no change in content. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. FYI, your v7 patch with 32 VCPUs (on a 32 cpu socket machine) on an HVM guest under Xen after a while stops working. The workload is doing 'make -j32' on the Linux kernel. Completely unresponsive. Thoughts? Thank for reporting that. I haven't done that much testing on Xen. My focus was in KVM. I will perform more test on Xen to see if I can reproduce the problem. BTW, does the halting and sending IPI mechanism work in HVM? I saw Yes. that in RHEL7, PV spinlock was explicitly disabled when in HVM mode. However, this piece of code isn't in upstream code. So I wonder if there is problem with that. The PV ticketlock fixed it for HVM. It was disabled before because the PV guests were using bytelocks while the HVM were using ticketlocks and you couldnt' swap in PV bytelocks for ticketlocks during startup. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v8 10/10] pvqspinlock, x86: Enable qspinlock PV support for XEN
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index a70fdeb..451e392 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK config QUEUE_SPINLOCK def_bool y if ARCH_USE_QUEUE_SPINLOCK - depends on SMP (!PARAVIRT_SPINLOCKS || !XEN) + depends on SMP If I read this correctly that means you cannot select any more the old ticketlocks? As in, if you select CONFIG_PARAVIRT on X86 it will automatically select ARCH_USE_QUEUE_SPINLOCK which will then enable this by default? Should the 'def_bool' be selectable? -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v7 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest
On Mar 20, 2014 11:40 PM, Waiman Long waiman.l...@hp.com wrote: On 03/19/2014 04:28 PM, Konrad Rzeszutek Wilk wrote: On Wed, Mar 19, 2014 at 04:14:05PM -0400, Waiman Long wrote: This patch adds a XEN init function to activate the unfair queue spinlock in a XEN guest when the PARAVIRT_UNFAIR_LOCKS kernel config option is selected. Signed-off-by: Waiman Longwaiman.l...@hp.com --- arch/x86/xen/setup.c | 19 +++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 0982233..66bb6f5 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -625,3 +625,22 @@ void __init xen_arch_setup(void) numa_off = 1; #endif } + +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS +/* + * Enable unfair lock if running in a Xen guest + */ +static __init int xen_unfair_locks_init_jump(void) +{ + /* + * Disable unfair lock if not running in a PV domain + */ + if (!xen_pv_domain()) + return 0; I would just make this 'xen_domain'. Not sure why you need to have it only for PV while the PVHVM guests can also use it? The compilation of the setup.c file should have implied xen_domain already (at least HVM). The check is added to make sure that unfair lock won't be enabled on bare metal. As for PVHVM, is there a way to detect it is running as such which is distinct from HVM? Xen_domain() should cover PVHVM and PV. Would it also make sense to use the same printk statement that the KVM has? Yes, I can add a printk statement like KVM. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
On Tue, Mar 18, 2014 at 11:11:43PM -0400, Waiman Long wrote: On 03/17/2014 03:10 PM, Konrad Rzeszutek Wilk wrote: On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote: On 03/14/2014 04:30 AM, Peter Zijlstra wrote: On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote: On 03/13/2014 11:15 AM, Peter Zijlstra wrote: On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote: +static inline void arch_spin_lock(struct qspinlock *lock) +{ + if (static_key_false(paravirt_unfairlocks_enabled)) + queue_spin_lock_unfair(lock); + else + queue_spin_lock(lock); +} So I would have expected something like: if (static_key_false(paravirt_spinlock)) { while (!queue_spin_trylock(lock)) cpu_relax(); return; } At the top of queue_spin_lock_slowpath(). I don't like the idea of constantly spinning on the lock. That can cause all sort of performance issues. Its bloody virt; _that_ is a performance issue to begin with. Anybody half sane stops using virt (esp. if they care about performance). My version of the unfair lock tries to grab the lock ignoring if there are others waiting in the queue or not. So instead of the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the lock byte in the unfair version. A CPU has only one chance to steal the lock. If it can't, it will be lined up in the queue just like the fair version. It is not as unfair as the other unfair locking schemes that spins on the lock repetitively. So lock starvation should be less a problem. On the other hand, it may not perform as well as the other unfair locking schemes. It is a compromise to provide some lock unfairness without sacrificing the good cacheline behavior of the queue spinlock. But but but,.. any kind of queueing gets you into a world of hurt with virt. The simple test-and-set lock (as per the above) still sucks due to lock holder preemption, but at least the suckage doesn't queue. Because with queueing you not only have to worry about the lock holder getting preemption, but also the waiter(s). Take the situation of 3 (v)CPUs where cpu0 holds the lock but is preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after which cpu0 gets back online. The simple test-and-set lock will now let cpu2 acquire. Your queue however will just sit there spinning, waiting for cpu1 to come back from holiday. I think you're way over engineering this. Just do the simple test-and-set lock for virt !paravirt (as I think Paolo Bonzini suggested RHEL6 already does). The PV ticketlock code was designed to handle lock holder preemption by redirecting CPU resources in a preempted guest to another guest that can better use it and then return the preempted CPU back sooner. Using a simple test-and-set lock will not allow us to enable this PV spinlock functionality as there is no structure to decide who does what. I can extend the current unfair lock code to allow those And what would be needed to do 'decide who does what'? Sorry for not very clear in my previous mail. The current PV code will halt the CPUs if the lock isn't acquired in a certain time. When the locker holder come back, it will wake up the next CPU in the ticket lock queue. With a simple set and test lock, there is no structure to decide which one to wake up. So you still need to have some sort of queue to do that, you just can't wake up all of them and let them fight for the lock. Why not? That is what bytelocks did and it worked OK in the virtualization environment. The reason it was OK was because the hypervisor decided which VCPU to schedule and that one would get the lock. Granted it did have a per-cpu-what-lock-am-I-waiting-for structure to aid in this. In some form, the hypervisor serializes who is going to get the lock as it ultimately decides which of the VCPUs that are kicked to wake up - and if the lock is not FIFO - it doesn't matter which ones gets it. With the 'per-cpu-what-lock-I-am-waiting-for' you can also have a round-robin of which vCPU you had kick to keep it fair. You might want to take a look at the PV bytelock that existed in the Xen code prior to PV ticketlock (so 3.10) to see how it did that. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v7 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock
On Wed, Mar 19, 2014 at 04:14:00PM -0400, Waiman Long wrote: This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. And how does this patch do that? I think that comment is obsolete? The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Signed-off-by: Waiman Long waiman.l...@hp.com Acked-by: Rik van Riel r...@redhat.com --- arch/x86/Kconfig |1 + arch/x86/include/asm/qspinlock.h | 41 + arch/x86/include/asm/spinlock.h |5 arch/x86/include/asm/spinlock_types.h |4 +++ 4 files changed, 51 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 0af5250..de573f9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -17,6 +17,7 @@ config X86_64 depends on 64BIT select X86_DEV_DMA_OPS select ARCH_USE_CMPXCHG_LOCKREF + select ARCH_USE_QUEUE_SPINLOCK ### Arch settings config X86 diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h new file mode 100644 index 000..44cefee --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,41 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +#if !defined(CONFIG_X86_OOSTORE) !defined(CONFIG_X86_PPRO_FENCE) + +#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS + +/* + * x86-64 specific queue spinlock union structure + */ +union arch_qspinlock { + struct qspinlock slock; + u8 lock; /* Lock bit */ +}; + +#define queue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * No special memory barrier other than a compiler one is needed for the + * x86 architecture. A compiler barrier is added at the end to make sure + * that the clearing the lock bit is done ASAP without artificial delay + * due to compiler optimization. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + barrier(); + ACCESS_ONCE(qlock-lock) = 0; + barrier(); +} + +#endif /* !CONFIG_X86_OOSTORE !CONFIG_X86_PPRO_FENCE */ + +#include asm-generic/qspinlock.h + +#endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 0f62f54..958d20f 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm/qspinlock.h +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -180,6 +184,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, { arch_spin_lock(lock); } +#endif /* CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 4f1bea1..7960268 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm-generic/qspinlock_types.h +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED{ { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include asm/rwlock.h -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v7 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest
On Wed, Mar 19, 2014 at 04:14:05PM -0400, Waiman Long wrote: This patch adds a XEN init function to activate the unfair queue spinlock in a XEN guest when the PARAVIRT_UNFAIR_LOCKS kernel config option is selected. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/xen/setup.c | 19 +++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 0982233..66bb6f5 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -625,3 +625,22 @@ void __init xen_arch_setup(void) numa_off = 1; #endif } + +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS +/* + * Enable unfair lock if running in a Xen guest + */ +static __init int xen_unfair_locks_init_jump(void) +{ + /* + * Disable unfair lock if not running in a PV domain + */ + if (!xen_pv_domain()) + return 0; I would just make this 'xen_domain'. Not sure why you need to have it only for PV while the PVHVM guests can also use it? Would it also make sense to use the same printk statement that the KVM has? + + static_key_slow_inc(paravirt_unfairlocks_enabled); + + return 0; +} +early_initcall(xen_unfair_locks_init_jump); +#endif -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
On Thu, Mar 13, 2014 at 02:16:06PM +0100, Paolo Bonzini wrote: Il 13/03/2014 11:54, David Vrabel ha scritto: On 12/03/14 18:54, Waiman Long wrote: Locking is always an issue in a virtualized environment as the virtual CPU that is waiting on a lock may get scheduled out and hence block any progress in lock acquisition even when the lock has been freed. One solution to this problem is to allow unfair lock in a para-virtualized environment. In this case, a new lock acquirer can come and steal the lock if the next-in-line CPU to get the lock is scheduled out. Unfair lock in a native environment is generally not a good idea as there is a possibility of lock starvation for a heavily contended lock. I do not think this is a good idea -- the problems with unfair locks are worse in a virtualized guest. If a waiting VCPU deschedules and has to be kicked to grab a lock then it is very likely to lose a race with another running VCPU trying to take a lock (since it takes time for the VCPU to be rescheduled). Actually, I think the unfair version should be automatically selected if running on a hypervisor. Per-hypervisor pvops can choose to enable the fair one. Lock unfairness may be particularly evident on a virtualized guest when the host is overcommitted, but problems with fair locks are even worse. In fact, RHEL/CentOS 6 already uses unfair locks if X86_FEATURE_HYPERVISOR is set. The patch was rejected upstream in favor of pv ticketlocks, but pv ticketlocks do not cover all hypervisors so perhaps we could revisit that choice. Measurements were done by Gleb for two guests running 2.6.32 with 16 vcpus each, on a 16-core system. One guest ran with unfair locks, one guest ran with fair locks. Two kernel compilations (time make And when you say fair locks are you saying PV ticketlocks or generic ticketlocks? -j 16 all) were started at the same time on both guests, and times were as follows: unfair: fair: real 13m34.674s real 19m35.827s user 96m2.638s user 102m38.665s sys 56m14.991s sys 158m22.470s real 13m3.768s real 19m4.375s user 95m34.509s user 111m9.903s sys 53m40.550s sys 141m59.370s Actually, interpreting the numbers shows an even worse slowdown. Compilation took ~6.5 minutes in a guest when the host was not overcommitted, and with unfair locks everything scaled just fine. You should see the same values with the PV ticketlock. It is not clear to me if this testing did include that variant of locks? Ticketlocks fell completely apart; during the first 13 minutes they were allotted 16*6.5=104 minutes of CPU time, and they spent almost all of it spinning in the kernel (102 minutes in the first run). Right, the non-PV variant of them do fall apart. That is why PV ticketlocks are so nice. They did perhaps 30 seconds worth of work because, as soon as the unfair-lock guest finished and the host was no longer overcommitted, compilation finished in 6 minutes. So that's approximately 12x slowdown from using non-pv fair locks (vs. unfair locks) on a 200%-overcommitted host. Ah, so it was non-PV. I am curious if the test was any different if you tested PV ticketlocks vs Red Hat variant of unfair locks. Paolo With the unfair locking activated on bare metal 4-socket Westmere-EX box, the execution times (in ms) of a spinlock micro-benchmark were as follows: # ofTicket Fair Unfair taskslock queue lockqueue lock -- --- ---- 1 135135 137 2 1045 1120 747 3 1827 2345 1084 4 2689 2934 1438 5 3736 3658 1722 6 4942 4434 2092 7 6304 5176 2245 8 7736 5955 2388 Are these figures with or without the later PV support patches? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
On Mon, Mar 17, 2014 at 01:44:34PM -0400, Waiman Long wrote: On 03/14/2014 04:30 AM, Peter Zijlstra wrote: On Thu, Mar 13, 2014 at 04:05:19PM -0400, Waiman Long wrote: On 03/13/2014 11:15 AM, Peter Zijlstra wrote: On Wed, Mar 12, 2014 at 02:54:52PM -0400, Waiman Long wrote: +static inline void arch_spin_lock(struct qspinlock *lock) +{ + if (static_key_false(paravirt_unfairlocks_enabled)) + queue_spin_lock_unfair(lock); + else + queue_spin_lock(lock); +} So I would have expected something like: if (static_key_false(paravirt_spinlock)) { while (!queue_spin_trylock(lock)) cpu_relax(); return; } At the top of queue_spin_lock_slowpath(). I don't like the idea of constantly spinning on the lock. That can cause all sort of performance issues. Its bloody virt; _that_ is a performance issue to begin with. Anybody half sane stops using virt (esp. if they care about performance). My version of the unfair lock tries to grab the lock ignoring if there are others waiting in the queue or not. So instead of the doing a cmpxchg of the whole 32-bit word, I just do a cmpxchg of the lock byte in the unfair version. A CPU has only one chance to steal the lock. If it can't, it will be lined up in the queue just like the fair version. It is not as unfair as the other unfair locking schemes that spins on the lock repetitively. So lock starvation should be less a problem. On the other hand, it may not perform as well as the other unfair locking schemes. It is a compromise to provide some lock unfairness without sacrificing the good cacheline behavior of the queue spinlock. But but but,.. any kind of queueing gets you into a world of hurt with virt. The simple test-and-set lock (as per the above) still sucks due to lock holder preemption, but at least the suckage doesn't queue. Because with queueing you not only have to worry about the lock holder getting preemption, but also the waiter(s). Take the situation of 3 (v)CPUs where cpu0 holds the lock but is preempted. cpu1 queues, cpu2 queues. Then cpu1 gets preempted, after which cpu0 gets back online. The simple test-and-set lock will now let cpu2 acquire. Your queue however will just sit there spinning, waiting for cpu1 to come back from holiday. I think you're way over engineering this. Just do the simple test-and-set lock for virt !paravirt (as I think Paolo Bonzini suggested RHEL6 already does). The PV ticketlock code was designed to handle lock holder preemption by redirecting CPU resources in a preempted guest to another guest that can better use it and then return the preempted CPU back sooner. Using a simple test-and-set lock will not allow us to enable this PV spinlock functionality as there is no structure to decide who does what. I can extend the current unfair lock code to allow those And what would be needed to do 'decide who does what'? waiting in the queue to also attempt to steal the lock, though at a lesser frequency so that the queue head has a higher chance of getting the lock. This will solve the lock waiter preemption problem that you worry about. This does make the code a bit more complex, but it allow us to enable both the unfair lock and the PV spinlock code together to solve the lock waiter and lock holder preemption problems. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-blk: make the queue depth the max supportable by the hypervisor
On March 14, 2014 11:34:31 PM EDT, Theodore Ts'o ty...@mit.edu wrote: The current virtio block sets a queue depth of 64, which is insufficient for very fast devices. It has been demonstrated that with a high IOPS device, using a queue depth of 256 can double the IOPS which can be sustained. As suggested by Venkatash Srinivas, set the queue depth by default to be one half the the device's virtqueue, which is the maximum queue depth that can be supported by the channel to the host OS (each I/O request requires at least two VQ entries). Also allow the queue depth to be something which can be set at module load time or via a kernel boot-time parameter, for testing/benchmarking purposes. Signed-off-by: Theodore Ts'o ty...@mit.edu Signed-off-by: Venkatesh Srinivas venkate...@google.com Cc: Rusty Russell ru...@rustcorp.com.au Cc: Michael S. Tsirkin m...@redhat.com Cc: virtio-...@lists.oasis-open.org Cc: virtualization@lists.linux-foundation.org Cc: Frank Swiderski f...@google.com --- This is a combination of my patch and Vekatash's patch. I agree that setting the default automatically is better than requiring the user to set the value by hand. drivers/block/virtio_blk.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 6a680d4..0f70c01 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -481,6 +481,9 @@ static struct blk_mq_ops virtio_mq_ops = { .free_hctx = blk_mq_free_single_hw_queue, }; +static int queue_depth = -1; +module_param(queue_depth, int, 0444); ? + static struct blk_mq_reg virtio_mq_reg = { .ops= virtio_mq_ops, .nr_hw_queues = 1, @@ -551,9 +554,14 @@ static int virtblk_probe(struct virtio_device *vdev) goto out_free_vq; } + virtio_mq_reg.queue_depth = queue_depth 0 ? queue_depth : + (vblk-vq-num_free / 2); virtio_mq_reg.cmd_size = sizeof(struct virtblk_req) + sizeof(struct scatterlist) * sg_elems; + virtblk_name_format(vd, index, vblk-disk-disk_name, DISK_NAME_LEN); + pr_info(%s: using queue depth %d\n, vblk-disk-disk_name, + virtio_mq_reg.queue_depth); Isn't that visible from sysfs? q = vblk-disk-queue = blk_mq_init_queue(virtio_mq_reg, vblk); if (!q) { @@ -565,8 +573,6 @@ static int virtblk_probe(struct virtio_device *vdev) q-queuedata = vblk; - virtblk_name_format(vd, index, vblk-disk-disk_name, DISK_NAME_LEN); - vblk-disk-major = major; vblk-disk-first_minor = index_to_minor(index); vblk-disk-private_data = vblk; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization