Re: [PATCH 1/2] hyperv: implement hv_get_tsc_page()
On Thu, 9 Feb 2017 21:14:25 +0100 (CET) Thomas Gleixner wrote: > On Thu, 9 Feb 2017, Stephen Hemminger wrote: > > > The actual code looks fine, but the style police will not like you. > > { should be at start of line on functions. > > And #else should be at start of line, > > > > But maybe this was just more of exchange mangling the mail. > > Looks like. > > > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) { > > + return tsc_pg; > > +} > > + > > That's how it reads in a proper mail client connected to a proper mail > server: > > > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) > > +{ > > + return tsc_pg; > > +} > > :) Yup. it looks like the mail server is trying to be "helpful" by eliminating extra white space. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
On Thu, 9 Feb 2017 14:55:50 -0800 Andy Lutomirski wrote: > On Thu, Feb 9, 2017 at 12:45 PM, KY Srinivasan wrote: > > > > > >> -Original Message- > >> From: Thomas Gleixner [mailto:t...@linutronix.de] > >> Sent: Thursday, February 9, 2017 9:08 AM > >> To: Vitaly Kuznetsov > >> Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar > >> ; H. Peter Anvin ; KY Srinivasan > >> ; Haiyang Zhang ; Stephen > >> Hemminger ; Dexuan Cui > >> ; linux-ker...@vger.kernel.org; > >> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org > >> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read > >> method > >> > >> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote: > >> > +#ifdef CONFIG_HYPERV_TSCPAGE > >> > +static notrace u64 vread_hvclock(int *mode) > >> > +{ > >> > + const struct ms_hyperv_tsc_page *tsc_pg = > >> > + (const struct ms_hyperv_tsc_page *)&hvclock_page; > >> > + u64 sequence, scale, offset, current_tick, cur_tsc; > >> > + > >> > + while (1) { > >> > + sequence = READ_ONCE(tsc_pg->tsc_sequence); > >> > + if (!sequence) > >> > + break; > >> > + > >> > + scale = READ_ONCE(tsc_pg->tsc_scale); > >> > + offset = READ_ONCE(tsc_pg->tsc_offset); > >> > + rdtscll(cur_tsc); > >> > + > >> > + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset; > >> > + > >> > + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence) > >> > + return current_tick; > >> > >> That sequence stuff lacks still a sensible explanation. It's fundamentally > >> different from the sequence counting we do in the kernel, so documentation > >> for it is really required. > > > > The host is updating multiple fields in this shared TSC page and the > > sequence number is > > used to ensure that the guest sees a consistent set values published. If I > > remember > > correctly, Xen has a similar mechanism. > > So what's the actual protocol? When the hypervisor updates the page, > does it freeze all guest cpus? If not, how does it maintain > atomicity? The protocol looks a lot like Linux seqlock, but it has an extra protection which is missing here. The host needs to update sequence number twice in order to guarantee ordering. Otherwise it is possible that Host and guest can race. Host Write offset Write scale Set tsc_sequence = N Guest read sequence = N Read scale Write scale Write offset Read Offset Check sequence == N Set tsc_sequence = N +1 Look like the current host side protocol is wrong. The solution that Andi Kleen invented, and I used in seqlock was for the writer to update sequence at start and end of transaction. If sequence number is odd, then the reader knows it is looking at stale data. Host Write offset Write scale Set tsc_sequence = N (end of transaction) Guest read sequence = N Spin until sequence is even (N is even) Read scale Set tsc_sequence += 1 Write scale Write offset Read Offset Check sequence == N? (fails is N + 1) Set tsc_sequence += 1 (end of transaction) read sequence = N+2 Spin until sequence is even (ie N +2) Read scale Read Offset Check sequence == N +2? (yes ok). Also it is faster to just read scale and offset with this loop and save the reading of TSC and doing multiply until after scale/offset has been acquired. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
On Thu, Feb 9, 2017 at 12:45 PM, KY Srinivasan wrote: > > >> -Original Message- >> From: Thomas Gleixner [mailto:t...@linutronix.de] >> Sent: Thursday, February 9, 2017 9:08 AM >> To: Vitaly Kuznetsov >> Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar >> ; H. Peter Anvin ; KY Srinivasan >> ; Haiyang Zhang ; Stephen >> Hemminger ; Dexuan Cui >> ; linux-ker...@vger.kernel.org; >> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org >> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read >> method >> >> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote: >> > +#ifdef CONFIG_HYPERV_TSCPAGE >> > +static notrace u64 vread_hvclock(int *mode) >> > +{ >> > + const struct ms_hyperv_tsc_page *tsc_pg = >> > + (const struct ms_hyperv_tsc_page *)&hvclock_page; >> > + u64 sequence, scale, offset, current_tick, cur_tsc; >> > + >> > + while (1) { >> > + sequence = READ_ONCE(tsc_pg->tsc_sequence); >> > + if (!sequence) >> > + break; >> > + >> > + scale = READ_ONCE(tsc_pg->tsc_scale); >> > + offset = READ_ONCE(tsc_pg->tsc_offset); >> > + rdtscll(cur_tsc); >> > + >> > + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset; >> > + >> > + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence) >> > + return current_tick; >> >> That sequence stuff lacks still a sensible explanation. It's fundamentally >> different from the sequence counting we do in the kernel, so documentation >> for it is really required. > > The host is updating multiple fields in this shared TSC page and the sequence > number is > used to ensure that the guest sees a consistent set values published. If I > remember > correctly, Xen has a similar mechanism. So what's the actual protocol? When the hypervisor updates the page, does it freeze all guest cpus? If not, how does it maintain atomicity? --Andy ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: Try to untangle DMA coherency
On 09/02/2017 21:57, Michael S. Tsirkin wrote: On Wed, Feb 08, 2017 at 01:58:10PM +0100, Alexander Graf wrote: On 02/01/2017 08:19 PM, Michael S. Tsirkin wrote: On Wed, Feb 01, 2017 at 06:27:09PM +, Will Deacon wrote: On Wed, Feb 01, 2017 at 08:09:21PM +0200, Michael S. Tsirkin wrote: On Wed, Feb 01, 2017 at 12:25:57PM +, Robin Murphy wrote: diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 7e38ed79c3fc..961af25b385c 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -160,10 +161,14 @@ static bool vring_use_dma_api(struct virtio_device *vdev) return true; /* -* On ARM-based machines, the DMA ops will do the right thing, -* so always use them with legacy devices. +* On ARM-based machines, the coherent DMA ops will do the right +* thing, so always use them with legacy devices. However, using +* non-coherent DMA when the host *is* actually coherent, but has +* forgotten to tell us, is going to break badly; since this situation +* already exists in the wild, maintain the old behaviour there. */ - if (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64)) + if ((IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64)) && + device_get_dma_attr(&vdev->dev) == DEV_DMA_COHERENT) return !virtio_has_feature(vdev, VIRTIO_F_VERSION_1); return false; This is exactly what I feared. Yes, sorry about this. It works fine for virtio-pci (where "dma-coherent" is used) and it also works on the fastmodel if you disable cache-modelling (which is needed to make the thing run at a usable pace) so we didn't spot this in testing. Could we identify fastboot and do the special dance just for it? [assuming you mean fastmodel instead of fastboot] I'd like to do that instead. It's fastboot doing the unreasonable thing here and deviating from what every other legacy device without exception did for years. If this means fastboot will need to update to virtio 1, all the better. The problem still exists with virtio 1, unless we require that the "dma-coherent" property is set/unset correctly when VIRTIO_F_IOMMU_PLATFORM is advertised by the device (which is what I suggested in my reply). I'm not ignoring that, but I need to understand that part a bit better. I'll reply to that patch in a day or two after looking at how _CCA is supposed to work. We can't detect the fastmodel, Surely, it puts a hardware id somewhere? I think you mean fastmodel isn't always affected, right? but we could implicitly treat virtio-mmio devices as cache-coherent regardless of the "dma-coherent" flag. I already prototyped this, but I suspect the devicetree people will push back (and there's a similar patch needed for ACPI). See below. Do you prefer this approach? Will --->8 I'd like to see basically if (fastmodel) a pile of special work-arounds else not less hacky but more common virtio work-arounds :) And then I can apply whatever comes from @arm.com and not worry about breaking actual hardware. I'm actually seeing the exact same breakage in QEMU right now, so it's not fast model related at all. In QEMU we also don't properly set the dma-coherent flag, so we run into cache coherency problems. Alex But with latest revert QEMU should now work, right? Yes, with the revert QEMU works fine. But please keep in mind that it's not just the fast model that's broken here :). Alex ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: Try to untangle DMA coherency
On Wed, Feb 08, 2017 at 01:58:10PM +0100, Alexander Graf wrote: > On 02/01/2017 08:19 PM, Michael S. Tsirkin wrote: > > On Wed, Feb 01, 2017 at 06:27:09PM +, Will Deacon wrote: > > > On Wed, Feb 01, 2017 at 08:09:21PM +0200, Michael S. Tsirkin wrote: > > > > On Wed, Feb 01, 2017 at 12:25:57PM +, Robin Murphy wrote: > > > > > diff --git a/drivers/virtio/virtio_ring.c > > > > > b/drivers/virtio/virtio_ring.c > > > > > index 7e38ed79c3fc..961af25b385c 100644 > > > > > --- a/drivers/virtio/virtio_ring.c > > > > > +++ b/drivers/virtio/virtio_ring.c > > > > > @@ -20,6 +20,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > > #include > > > > > #include > > > > > #include > > > > > @@ -160,10 +161,14 @@ static bool vring_use_dma_api(struct > > > > > virtio_device *vdev) > > > > > return true; > > > > > > > > > > /* > > > > > - * On ARM-based machines, the DMA ops will do the right thing, > > > > > - * so always use them with legacy devices. > > > > > + * On ARM-based machines, the coherent DMA ops will do the right > > > > > + * thing, so always use them with legacy devices. However, using > > > > > + * non-coherent DMA when the host *is* actually coherent, but > > > > > has > > > > > + * forgotten to tell us, is going to break badly; since this > > > > > situation > > > > > + * already exists in the wild, maintain the old behaviour there. > > > > >*/ > > > > > - if (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64)) > > > > > + if ((IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64)) && > > > > > + device_get_dma_attr(&vdev->dev) == DEV_DMA_COHERENT) > > > > > return !virtio_has_feature(vdev, VIRTIO_F_VERSION_1); > > > > > > > > > > return false; > > > > > > > > This is exactly what I feared. > > > > > > Yes, sorry about this. It works fine for virtio-pci (where "dma-coherent" > > > is used) and it also works on the fastmodel if you disable cache-modelling > > > (which is needed to make the thing run at a usable pace) so we didn't spot > > > this in testing. > > > > > > > Could we identify fastboot and do the special dance just for it? > > > > > > [assuming you mean fastmodel instead of fastboot] > > > > > > > I'd like to do that instead. It's fastboot doing the unreasonable thing > > > > here and deviating from what every other legacy device without exception > > > > did for years. If this means fastboot will need to update to virtio 1, > > > > all the better. > > > > > > The problem still exists with virtio 1, unless we require that the > > > "dma-coherent" property is set/unset correctly when > > > VIRTIO_F_IOMMU_PLATFORM > > > is advertised by the device (which is what I suggested in my reply). > > > > I'm not ignoring that, but I need to understand that part a bit better. > > I'll reply to that patch in a day or two after looking at how _CCA is > > supposed to work. > > > > > We can't detect the fastmodel, > > > > Surely, it puts a hardware id somewhere? I think you mean > > fastmodel isn't always affected, right? > > > > > but we could implicitly treat virtio-mmio > > > devices as cache-coherent regardless of the "dma-coherent" flag. I already > > > prototyped this, but I suspect the devicetree people will push back (and > > > there's a similar patch needed for ACPI). > > > > > > See below. Do you prefer this approach? > > > > > > Will > > > > > > --->8 > > > > I'd like to see basically > > > > if (fastmodel) > > a pile of special work-arounds > > else > > not less hacky but more common virtio work-arounds > > > > :) > > > > And then I can apply whatever comes from @arm.com and not > > worry about breaking actual hardware. > > I'm actually seeing the exact same breakage in QEMU right now, so it's not > fast model related at all. In QEMU we also don't properly set the > dma-coherent flag, so we run into cache coherency problems. > > > Alex But with latest revert QEMU should now work, right? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
RE: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
> -Original Message- > From: Thomas Gleixner [mailto:t...@linutronix.de] > Sent: Thursday, February 9, 2017 9:08 AM > To: Vitaly Kuznetsov > Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar > ; H. Peter Anvin ; KY Srinivasan > ; Haiyang Zhang ; Stephen > Hemminger ; Dexuan Cui > ; linux-ker...@vger.kernel.org; > de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org > Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read > method > > On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote: > > +#ifdef CONFIG_HYPERV_TSCPAGE > > +static notrace u64 vread_hvclock(int *mode) > > +{ > > + const struct ms_hyperv_tsc_page *tsc_pg = > > + (const struct ms_hyperv_tsc_page *)&hvclock_page; > > + u64 sequence, scale, offset, current_tick, cur_tsc; > > + > > + while (1) { > > + sequence = READ_ONCE(tsc_pg->tsc_sequence); > > + if (!sequence) > > + break; > > + > > + scale = READ_ONCE(tsc_pg->tsc_scale); > > + offset = READ_ONCE(tsc_pg->tsc_offset); > > + rdtscll(cur_tsc); > > + > > + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset; > > + > > + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence) > > + return current_tick; > > That sequence stuff lacks still a sensible explanation. It's fundamentally > different from the sequence counting we do in the kernel, so documentation > for it is really required. The host is updating multiple fields in this shared TSC page and the sequence number is used to ensure that the guest sees a consistent set values published. If I remember correctly, Xen has a similar mechanism. K. Y > > Thanks, > > tglx ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
RE: [PATCH 1/2] hyperv: implement hv_get_tsc_page()
On Thu, 9 Feb 2017, Stephen Hemminger wrote: > The actual code looks fine, but the style police will not like you. > { should be at start of line on functions. > And #else should be at start of line, > > But maybe this was just more of exchange mangling the mail. Looks like. > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) { > + return tsc_pg; > +} > + That's how it reads in a proper mail client connected to a proper mail server: > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) > +{ > + return tsc_pg; > +} :) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: Try to untangle DMA coherency
On 9 February 2017 at 18:49, Michael S. Tsirkin wrote: > On Thu, Feb 09, 2017 at 06:31:18PM +, Will Deacon wrote: >> On Thu, Feb 09, 2017 at 08:17:16PM +0200, Michael S. Tsirkin wrote: >> > On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote: >> > > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote: >> > > > I am inclined to say, for 4.10 let's revert >> > > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a >> > > > regression in 4.10. >> > > >> > > No complaints there, as long as we can keep working to fix this for 4.11 >> > > and onwards. You'll also need to cc stable on the revert. >> > > >> > > > So I think we can defer the fix to 4.11. >> > > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349 >> > > > for hosts with virtio 1 support. >> > > > >> > > > All this will hopefully push hosts to just implement virtio 1. >> > > > For mmio the changes are very small: several new registers, >> > > > that's all. You want this for proper 64 bit dma mask anyway. >> > > >> > > As I've said, virtio 1 will have exactly the same issue unless we start >> > > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio >> > > devices correctly. >> > > >> > >> > OK I read up on _CCA in ACPI spec. It says: >> > The _CCA object returns whether or not a bus-master device supports >> > hardware managed cache coherency. Expected values are 0 to indicate it >> > is not supported, and 1 to indicate that it is supported. >> > >> > So if host is cache coherent, and guest thinks it isn't, we incur >> > unnecessary overhead by wasting coherent memory. >> > I get that but you said it actually breaks - why does it? >> >> It breaks because QEMU doesn't set _CCA for virtio-mmio devices, and that >> only becomes a problem when we use the DMA API, because that results in the >> guest taking out a non-cacheable mapping. On ARM (and other archs such as >> Power), having a mismatch between a cacheable and a non-cacheable mapping >> can result in a loss of coherency between the two (for example, if the >> non-cacheable gues accesses bypass the cache, but the cacheable host >> accesses allocate in the cache). >> >> Will > > I see. And I guess using a cacheable mapping is significantly faster. > I would say we want to typically use cacheable for virtio then, > whether we bypass the IOMMU or not. I guess this is why we always set > _CCA/DT correctly, right? > The point is that you don't get to choose: if the hardware is DMA coherent, the CPU should use cacheable mappings, or you may lose data. If the hardware is non-coherent, the CPU should use non-cacheable mappings for the same reason. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: Try to untangle DMA coherency
On Thu, Feb 09, 2017 at 08:49:41PM +0200, Michael S. Tsirkin wrote: > On Thu, Feb 09, 2017 at 06:31:18PM +, Will Deacon wrote: > > On Thu, Feb 09, 2017 at 08:17:16PM +0200, Michael S. Tsirkin wrote: > > > On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote: > > > > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote: > > > > > I am inclined to say, for 4.10 let's revert > > > > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a > > > > > regression in 4.10. > > > > > > > > No complaints there, as long as we can keep working to fix this for 4.11 > > > > and onwards. You'll also need to cc stable on the revert. > > > > > > > > > So I think we can defer the fix to 4.11. > > > > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349 > > > > > for hosts with virtio 1 support. > > > > > > > > > > All this will hopefully push hosts to just implement virtio 1. > > > > > For mmio the changes are very small: several new registers, > > > > > that's all. You want this for proper 64 bit dma mask anyway. > > > > > > > > As I've said, virtio 1 will have exactly the same issue unless we start > > > > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio > > > > devices correctly. > > > > > > > > > > OK I read up on _CCA in ACPI spec. It says: > > > The _CCA object returns whether or not a bus-master device supports > > > hardware managed cache coherency. Expected values are 0 to indicate it > > > is not supported, and 1 to indicate that it is supported. > > > > > > So if host is cache coherent, and guest thinks it isn't, we incur > > > unnecessary overhead by wasting coherent memory. > > > I get that but you said it actually breaks - why does it? > > > > It breaks because QEMU doesn't set _CCA for virtio-mmio devices, and that > > only becomes a problem when we use the DMA API, because that results in the > > guest taking out a non-cacheable mapping. On ARM (and other archs such as > > Power), having a mismatch between a cacheable and a non-cacheable mapping > > can result in a loss of coherency between the two (for example, if the > > non-cacheable gues accesses bypass the cache, but the cacheable host > > accesses allocate in the cache). > > > I see. And I guess using a cacheable mapping is significantly faster. > I would say we want to typically use cacheable for virtio then, > whether we bypass the IOMMU or not. I guess this is why we always set > _CCA/DT correctly, right? At the moment, _CCA/DT is pretty much never set correctly for virtio-mmio (that is, it isn't set even though the device is cache coherent). If it *was* set correctly, then we wouldn't have needed to revert my patch. Robin's patch to only use the DMA API if _CCA/DT is present would work (although the thing that he posted was buggy iirc). Will ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: Try to untangle DMA coherency
On Thu, Feb 09, 2017 at 06:31:18PM +, Will Deacon wrote: > On Thu, Feb 09, 2017 at 08:17:16PM +0200, Michael S. Tsirkin wrote: > > On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote: > > > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote: > > > > I am inclined to say, for 4.10 let's revert > > > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a > > > > regression in 4.10. > > > > > > No complaints there, as long as we can keep working to fix this for 4.11 > > > and onwards. You'll also need to cc stable on the revert. > > > > > > > So I think we can defer the fix to 4.11. > > > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349 > > > > for hosts with virtio 1 support. > > > > > > > > All this will hopefully push hosts to just implement virtio 1. > > > > For mmio the changes are very small: several new registers, > > > > that's all. You want this for proper 64 bit dma mask anyway. > > > > > > As I've said, virtio 1 will have exactly the same issue unless we start > > > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio > > > devices correctly. > > > > > > > OK I read up on _CCA in ACPI spec. It says: > > The _CCA object returns whether or not a bus-master device supports > > hardware managed cache coherency. Expected values are 0 to indicate it > > is not supported, and 1 to indicate that it is supported. > > > > So if host is cache coherent, and guest thinks it isn't, we incur > > unnecessary overhead by wasting coherent memory. > > I get that but you said it actually breaks - why does it? > > It breaks because QEMU doesn't set _CCA for virtio-mmio devices, and that > only becomes a problem when we use the DMA API, because that results in the > guest taking out a non-cacheable mapping. On ARM (and other archs such as > Power), having a mismatch between a cacheable and a non-cacheable mapping > can result in a loss of coherency between the two (for example, if the > non-cacheable gues accesses bypass the cache, but the cacheable host > accesses allocate in the cache). > > Will I see. And I guess using a cacheable mapping is significantly faster. I would say we want to typically use cacheable for virtio then, whether we bypass the IOMMU or not. I guess this is why we always set _CCA/DT correctly, right? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: Try to untangle DMA coherency
On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote: > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote: > > I am inclined to say, for 4.10 let's revert > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a > > regression in 4.10. > > No complaints there, as long as we can keep working to fix this for 4.11 > and onwards. You'll also need to cc stable on the revert. > > > So I think we can defer the fix to 4.11. > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349 > > for hosts with virtio 1 support. > > > > All this will hopefully push hosts to just implement virtio 1. > > For mmio the changes are very small: several new registers, > > that's all. You want this for proper 64 bit dma mask anyway. > > As I've said, virtio 1 will have exactly the same issue unless we start > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio > devices correctly. > > Will OK I read up on _CCA in ACPI spec. It says: The _CCA object returns whether or not a bus-master device supports hardware managed cache coherency. Expected values are 0 to indicate it is not supported, and 1 to indicate that it is supported. So if host is cache coherent, and guest thinks it isn't, we incur unnecessary overhead by wasting coherent memory. I get that but you said it actually breaks - why does it? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: Try to untangle DMA coherency
On Thu, Feb 09, 2017 at 08:17:16PM +0200, Michael S. Tsirkin wrote: > On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote: > > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote: > > > I am inclined to say, for 4.10 let's revert > > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a > > > regression in 4.10. > > > > No complaints there, as long as we can keep working to fix this for 4.11 > > and onwards. You'll also need to cc stable on the revert. > > > > > So I think we can defer the fix to 4.11. > > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349 > > > for hosts with virtio 1 support. > > > > > > All this will hopefully push hosts to just implement virtio 1. > > > For mmio the changes are very small: several new registers, > > > that's all. You want this for proper 64 bit dma mask anyway. > > > > As I've said, virtio 1 will have exactly the same issue unless we start > > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio > > devices correctly. > > > > OK I read up on _CCA in ACPI spec. It says: > The _CCA object returns whether or not a bus-master device supports > hardware managed cache coherency. Expected values are 0 to indicate it > is not supported, and 1 to indicate that it is supported. > > So if host is cache coherent, and guest thinks it isn't, we incur > unnecessary overhead by wasting coherent memory. > I get that but you said it actually breaks - why does it? It breaks because QEMU doesn't set _CCA for virtio-mmio devices, and that only becomes a problem when we use the DMA API, because that results in the guest taking out a non-cacheable mapping. On ARM (and other archs such as Power), having a mismatch between a cacheable and a non-cacheable mapping can result in a loss of coherency between the two (for example, if the non-cacheable gues accesses bypass the cache, but the cacheable host accesses allocate in the cache). Will ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: packed ring layout proposal v2
On 02/09/2017 06:43 PM, Michael S. Tsirkin wrote: [...] >> [...] >>> Note: should this proposal be accepted and approved, one or more >>> claims disclosed to the TC admin and listed on the Virtio TC >>> IPR page https://www.oasis-open.org/committees/virtio/ipr.php >>> might become Essential Claims. >>> >> >> I have trouble parsing that legal stuff. Do I read that right, that >> these claims can be implemented as part of a virtio implementation without >> any worries (e.g. non open source HW implementation or non open source >> hypervisor)? > > By that legal stuff do you mean the IPR or the Note? Both in combination. > Not representing Red Hat here, and definitely not legal advice, below is > just my personal understanding of the IPR requirements. > > Virtio TC operates under the Non-Assertion Mode of the OASIS IPR Policy: > https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode > > As far as I can tell, the hardware and software question is covered > by that policy, since it states: > > Covered Product - includes only those specific portions of a > product (hardware, software or combinations thereof) > > Also as far as I can tell IPR does not mention source at all, so I think > virtio IPR would apply to open and closed source software equally. > > The Note is included to satisfy the disclosure requirements. > > Does this answer the question? Yes, thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
RE: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
Why not use existing seqlock's? -Original Message- From: Thomas Gleixner [mailto:t...@linutronix.de] Sent: Thursday, February 9, 2017 9:08 AM To: Vitaly Kuznetsov Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar ; H. Peter Anvin ; KY Srinivasan ; Haiyang Zhang ; Stephen Hemminger ; Dexuan Cui ; linux-ker...@vger.kernel.org; de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote: > +#ifdef CONFIG_HYPERV_TSCPAGE > +static notrace u64 vread_hvclock(int *mode) { > + const struct ms_hyperv_tsc_page *tsc_pg = > + (const struct ms_hyperv_tsc_page *)&hvclock_page; > + u64 sequence, scale, offset, current_tick, cur_tsc; > + > + while (1) { > + sequence = READ_ONCE(tsc_pg->tsc_sequence); > + if (!sequence) > + break; > + > + scale = READ_ONCE(tsc_pg->tsc_scale); > + offset = READ_ONCE(tsc_pg->tsc_offset); > + rdtscll(cur_tsc); > + > + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset; > + > + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence) > + return current_tick; That sequence stuff lacks still a sensible explanation. It's fundamentally different from the sequence counting we do in the kernel, so documentation for it is really required. Thanks, tglx ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
RE: [PATCH 1/2] hyperv: implement hv_get_tsc_page()
The actual code looks fine, but the style police will not like you. { should be at start of line on functions. And #else should be at start of line, But maybe this was just more of exchange mangling the mail. -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Thursday, February 9, 2017 6:11 AM To: x...@kernel.org; Andy Lutomirski Cc: Thomas Gleixner ; Ingo Molnar ; H. Peter Anvin ; KY Srinivasan ; Haiyang Zhang ; Stephen Hemminger ; Dexuan Cui ; linux-ker...@vger.kernel.org; de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org Subject: [PATCH 1/2] hyperv: implement hv_get_tsc_page() To use Hyper-V TSC page clocksource from vDSO we need to make tsc_pg available. Implement hv_get_tsc_page() and add CONFIG_HYPERV_TSCPAGE to make #ifdef-s simple. Signed-off-by: Vitaly Kuznetsov --- arch/x86/hyperv/hv_init.c | 9 +++-- arch/x86/include/asm/mshyperv.h | 8 drivers/hv/Kconfig | 3 +++ 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index b371d0e..0ce8485 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -27,10 +27,15 @@ #include -#ifdef CONFIG_X86_64 +#ifdef CONFIG_HYPERV_TSCPAGE static struct ms_hyperv_tsc_page *tsc_pg; +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) { + return tsc_pg; +} + static u64 read_hv_clock_tsc(struct clocksource *arg) { u64 current_tick; @@ -136,7 +141,7 @@ void hyperv_init(void) /* * Register Hyper-V specific clocksource. */ -#ifdef CONFIG_X86_64 +#ifdef CONFIG_HYPERV_TSCPAGE if (ms_hyperv.features & HV_X64_MSR_REFERENCE_TSC_AVAILABLE) { union hv_x64_msr_hypercall_contents tsc_msr; diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index f8dc370..14dd92c 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -173,4 +173,12 @@ void hyperv_report_panic(struct pt_regs *regs); bool hv_is_hypercall_page_setup(void); void hyperv_cleanup(void); #endif +#ifdef CONFIG_HYPERV_TSCPAGE +struct ms_hyperv_tsc_page *hv_get_tsc_page(void); #else static inline +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) { + return NULL; +} +#endif #endif diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig index 0403b51..c29cd53 100644 --- a/drivers/hv/Kconfig +++ b/drivers/hv/Kconfig @@ -7,6 +7,9 @@ config HYPERV Select this option to run Linux as a Hyper-V client operating system. +config HYPERV_TSCPAGE + def_bool HYPERV && X86_64 + config HYPERV_UTILS tristate "Microsoft Hyper-V Utilities driver" depends on HYPERV && CONNECTOR && NLS -- 2.9.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] packed ring layout proposal v2
On Thu, Feb 09, 2017 at 04:48:53PM +0100, Paolo Bonzini wrote: > > > On 08/02/2017 20:59, Michael S. Tsirkin wrote: > > We couldn't decide what's better for everyone in 1.0 days and I doubt > > we'll be able to now, but yes, benchmarking is needed to make > > sire it's required. Very easy to remove or not to use/support in > > drivers/devices though. > > Fair enough, but of course then we must specify that devices MUST > support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers > SHOULD support both (or use neither). > > The drivers' part adds to the implementation burden, which is why I > wanted to remove it. Alternatively we can say that indirect is > mandatory for both devices and drivers (and save a feature bit), while > VRING_DESC_F_NEXT is optional. > > >>> * Non power-of-2 ring sizes > >>> > >>> As the ring simply wraps around, there's no reason to > >>> require ring size to be power of two. > >>> It can be made a separate feature though. > >> > >> Power of 2 ring sizes are required in order to ignore the high bits of > >> the indices. With non-power-of-2 sizes you are forced to keep the > >> indices less than the ring size. > > > > Right. So > > > > if (unlikely(idx++ > size)) > > idx = 0; > > > > OTOH ring size that's twice larger than necessary > > because of power of two requirements wastes cache. > > I don't know. Power of 2 ring size is pretty standard, I'd rather avoid > the complication and the gratuitous difference with 1.0. I thought originally there's a reason 1.0 rings had to be powers of two but now I don't see why. OK, we can make it a feature flag later if we want to. > If batching is mostly advisory (with exceptions such as mrg-rxbuf) I > don't have any problem with it. > > Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: packed ring layout proposal v2
On Wed, Feb 08, 2017 at 02:37:57PM +0100, Christian Borntraeger wrote: > On 02/08/2017 04:20 AM, Michael S. Tsirkin wrote: > [...] > > * Prototype > > > > A partial prototype can be found under > > tools/virtio/ringtest/ring.c > > > > Test run: > > [mst@tuck ringtest]$ time ./ring > > real0m0.399s > > user0m0.791s > > sys 0m0.000s > > [mst@tuck ringtest]$ time ./virtio_ring_0_9 > > real0m0.503s > > user0m0.999s > > sys 0m0.000s > > I see similar improvements on s390, so I think this would be a very nice > improvement. > > [...] > > Note: should this proposal be accepted and approved, one or more > > claims disclosed to the TC admin and listed on the Virtio TC > > IPR page https://www.oasis-open.org/committees/virtio/ipr.php > > might become Essential Claims. > > > > I have trouble parsing that legal stuff. Do I read that right, that > these claims can be implemented as part of a virtio implementation without > any worries (e.g. non open source HW implementation or non open source > hypervisor)? By that legal stuff do you mean the IPR or the Note? Not representing Red Hat here, and definitely not legal advice, below is just my personal understanding of the IPR requirements. Virtio TC operates under the Non-Assertion Mode of the OASIS IPR Policy: https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode As far as I can tell, the hardware and software question is covered by that policy, since it states: Covered Product - includes only those specific portions of a product (hardware, software or combinations thereof) Also as far as I can tell IPR does not mention source at all, so I think virtio IPR would apply to open and closed source software equally. The Note is included to satisfy the disclosure requirements. Does this answer the question? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote: > +#ifdef CONFIG_HYPERV_TSCPAGE > +static notrace u64 vread_hvclock(int *mode) > +{ > + const struct ms_hyperv_tsc_page *tsc_pg = > + (const struct ms_hyperv_tsc_page *)&hvclock_page; > + u64 sequence, scale, offset, current_tick, cur_tsc; > + > + while (1) { > + sequence = READ_ONCE(tsc_pg->tsc_sequence); > + if (!sequence) > + break; > + > + scale = READ_ONCE(tsc_pg->tsc_scale); > + offset = READ_ONCE(tsc_pg->tsc_offset); > + rdtscll(cur_tsc); > + > + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset; > + > + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence) > + return current_tick; That sequence stuff lacks still a sensible explanation. It's fundamentally different from the sequence counting we do in the kernel, so documentation for it is really required. Thanks, tglx ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] packed ring layout proposal v2
On Thu, 9 Feb 2017 16:48:53 +0100 Paolo Bonzini wrote: > On 08/02/2017 20:59, Michael S. Tsirkin wrote: > > We couldn't decide what's better for everyone in 1.0 days and I doubt > > we'll be able to now, but yes, benchmarking is needed to make > > sire it's required. Very easy to remove or not to use/support in > > drivers/devices though. > > Fair enough, but of course then we must specify that devices MUST > support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers > SHOULD support both (or use neither). > > The drivers' part adds to the implementation burden, which is why I > wanted to remove it. Alternatively we can say that indirect is > mandatory for both devices and drivers (and save a feature bit), while > VRING_DESC_F_NEXT is optional. I think this (INDIRECT mandatory, NEXT optional) makes sense. But we really need some benchmarking. > > >>> * Non power-of-2 ring sizes > >>> > >>> As the ring simply wraps around, there's no reason to > >>> require ring size to be power of two. > >>> It can be made a separate feature though. > >> > >> Power of 2 ring sizes are required in order to ignore the high bits of > >> the indices. With non-power-of-2 sizes you are forced to keep the > >> indices less than the ring size. > > > > Right. So > > > > if (unlikely(idx++ > size)) > > idx = 0; > > > > OTOH ring size that's twice larger than necessary > > because of power of two requirements wastes cache. > > I don't know. Power of 2 ring size is pretty standard, I'd rather avoid > the complication and the gratuitous difference with 1.0. I agree. I don't think dropping the power of 2 requirement buys us so much that it makes up for the added complexity. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: automatic IRQ affinity for virtio V3
On Thu, Feb 09, 2017 at 05:50:31AM -0800, Christoph Hellwig wrote: > On Sun, Feb 05, 2017 at 06:15:17PM +0100, Christoph Hellwig wrote: > > Hi Michael, hi Jason, > > > > This patches applies a few cleanups to the virtio PCI interrupt handling > > code, and then converts the virtio PCI code to use the automatic MSI-X > > vectors spreading, as well as using the information in virtio-blk > > and virtio-scsi to automatically align the blk-mq queues to the MSI-X > > vectors. > > Any chance to get this in for 4.11 after I got reviews from Jason > for most of the patches? Absolutely, I intend to merge it. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] packed ring layout proposal v2
On 08/02/2017 20:59, Michael S. Tsirkin wrote: > We couldn't decide what's better for everyone in 1.0 days and I doubt > we'll be able to now, but yes, benchmarking is needed to make > sire it's required. Very easy to remove or not to use/support in > drivers/devices though. Fair enough, but of course then we must specify that devices MUST support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers SHOULD support both (or use neither). The drivers' part adds to the implementation burden, which is why I wanted to remove it. Alternatively we can say that indirect is mandatory for both devices and drivers (and save a feature bit), while VRING_DESC_F_NEXT is optional. >>> * Non power-of-2 ring sizes >>> >>> As the ring simply wraps around, there's no reason to >>> require ring size to be power of two. >>> It can be made a separate feature though. >> >> Power of 2 ring sizes are required in order to ignore the high bits of >> the indices. With non-power-of-2 sizes you are forced to keep the >> indices less than the ring size. > > Right. So > > if (unlikely(idx++ > size)) > idx = 0; > > OTOH ring size that's twice larger than necessary > because of power of two requirements wastes cache. I don't know. Power of 2 ring size is pretty standard, I'd rather avoid the complication and the gratuitous difference with 1.0. If batching is mostly advisory (with exceptions such as mrg-rxbuf) I don't have any problem with it. Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol defined by the hypervisor is different from VCLOCK_PVCLOCK. Implement the required support by adding hvclock_page VVAR. Signed-off-by: Vitaly Kuznetsov --- arch/x86/entry/vdso/vclock_gettime.c | 36 +++ arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++- arch/x86/entry/vdso/vdso2c.c | 3 +++ arch/x86/entry/vdso/vma.c | 7 +++ arch/x86/hyperv/hv_init.c | 3 +++ arch/x86/include/asm/clocksource.h| 3 ++- arch/x86/include/asm/vdso.h | 1 + 7 files changed, 54 insertions(+), 2 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 9d4d6e1..4af10b4 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -32,6 +33,11 @@ extern u8 pvclock_page __attribute__((visibility("hidden"))); #endif +#ifdef CONFIG_HYPERV_TSCPAGE +extern u8 hvclock_page + __attribute__((visibility("hidden"))); +#endif + #ifndef BUILD_VDSO32 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) @@ -141,6 +147,32 @@ static notrace u64 vread_pvclock(int *mode) return last; } #endif +#ifdef CONFIG_HYPERV_TSCPAGE +static notrace u64 vread_hvclock(int *mode) +{ + const struct ms_hyperv_tsc_page *tsc_pg = + (const struct ms_hyperv_tsc_page *)&hvclock_page; + u64 sequence, scale, offset, current_tick, cur_tsc; + + while (1) { + sequence = READ_ONCE(tsc_pg->tsc_sequence); + if (!sequence) + break; + + scale = READ_ONCE(tsc_pg->tsc_scale); + offset = READ_ONCE(tsc_pg->tsc_offset); + rdtscll(cur_tsc); + + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset; + + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence) + return current_tick; + } + + *mode = VCLOCK_NONE; + return 0; +} +#endif notrace static u64 vread_tsc(void) { @@ -173,6 +205,10 @@ notrace static inline u64 vgetsns(int *mode) else if (gtod->vclock_mode == VCLOCK_PVCLOCK) cycles = vread_pvclock(mode); #endif +#ifdef CONFIG_HYPERV_TSCPAGE + else if (gtod->vclock_mode == VCLOCK_HVCLOCK) + cycles = vread_hvclock(mode); +#endif else return 0; v = (cycles - gtod->cycle_last) & gtod->mask; diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index a708aa9..8ebb4b6 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -25,7 +25,7 @@ SECTIONS * segment. */ - vvar_start = . - 2 * PAGE_SIZE; + vvar_start = . - 3 * PAGE_SIZE; vvar_page = vvar_start; /* Place all vvars at the offsets in asm/vvar.h. */ @@ -36,6 +36,7 @@ SECTIONS #undef EMIT_VVAR pvclock_page = vvar_start + PAGE_SIZE; + hvclock_page = vvar_start + 2 * PAGE_SIZE; . = SIZEOF_HEADERS; diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 491020b..0780a44 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -74,6 +74,7 @@ enum { sym_vvar_page, sym_hpet_page, sym_pvclock_page, + sym_hvclock_page, sym_VDSO_FAKE_SECTION_TABLE_START, sym_VDSO_FAKE_SECTION_TABLE_END, }; @@ -82,6 +83,7 @@ const int special_pages[] = { sym_vvar_page, sym_hpet_page, sym_pvclock_page, + sym_hvclock_page, }; struct vdso_sym { @@ -94,6 +96,7 @@ struct vdso_sym required_syms[] = { [sym_vvar_page] = {"vvar_page", true}, [sym_hpet_page] = {"hpet_page", true}, [sym_pvclock_page] = {"pvclock_page", true}, + [sym_hvclock_page] = {"hvclock_page", true}, [sym_VDSO_FAKE_SECTION_TABLE_START] = { "VDSO_FAKE_SECTION_TABLE_START", false }, diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 10820f6..b256a3b 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -21,6 +21,7 @@ #include #include #include +#include #if defined(CONFIG_X86_64) unsigned int __read_mostly vdso64_enabled = 1; @@ -120,6 +121,12 @@ static int vvar_fault(const struct vm_special_mapping *sm, vmf->address, __pa(pvti) >> PAGE_SHIFT); } + } else if (sym_offset == image->sym_hvclock_page) { + struct ms_hyperv_tsc_page *tsc_pg = hv_get_tsc_page(); + + if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK)) + ret = vm_insert_pfn(vma, vmf->address, + vmalloc_to_pfn(tsc_pg)); }
[PATCH 1/2] hyperv: implement hv_get_tsc_page()
To use Hyper-V TSC page clocksource from vDSO we need to make tsc_pg available. Implement hv_get_tsc_page() and add CONFIG_HYPERV_TSCPAGE to make #ifdef-s simple. Signed-off-by: Vitaly Kuznetsov --- arch/x86/hyperv/hv_init.c | 9 +++-- arch/x86/include/asm/mshyperv.h | 8 drivers/hv/Kconfig | 3 +++ 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index b371d0e..0ce8485 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -27,10 +27,15 @@ #include -#ifdef CONFIG_X86_64 +#ifdef CONFIG_HYPERV_TSCPAGE static struct ms_hyperv_tsc_page *tsc_pg; +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) +{ + return tsc_pg; +} + static u64 read_hv_clock_tsc(struct clocksource *arg) { u64 current_tick; @@ -136,7 +141,7 @@ void hyperv_init(void) /* * Register Hyper-V specific clocksource. */ -#ifdef CONFIG_X86_64 +#ifdef CONFIG_HYPERV_TSCPAGE if (ms_hyperv.features & HV_X64_MSR_REFERENCE_TSC_AVAILABLE) { union hv_x64_msr_hypercall_contents tsc_msr; diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index f8dc370..14dd92c 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -173,4 +173,12 @@ void hyperv_report_panic(struct pt_regs *regs); bool hv_is_hypercall_page_setup(void); void hyperv_cleanup(void); #endif +#ifdef CONFIG_HYPERV_TSCPAGE +struct ms_hyperv_tsc_page *hv_get_tsc_page(void); +#else +static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void) +{ + return NULL; +} +#endif #endif diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig index 0403b51..c29cd53 100644 --- a/drivers/hv/Kconfig +++ b/drivers/hv/Kconfig @@ -7,6 +7,9 @@ config HYPERV Select this option to run Linux as a Hyper-V client operating system. +config HYPERV_TSCPAGE + def_bool HYPERV && X86_64 + config HYPERV_UTILS tristate "Microsoft Hyper-V Utilities driver" depends on HYPERV && CONNECTOR && NLS -- 2.9.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 0/2] x86/vdso: Add Hyper-V TSC page clocksource support
Hi, Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol defined by the hypervisor is different from VCLOCK_PVCLOCK. Implemented the required support. Simple sysbench test shows the following results: Before: # time sysbench --test=memory --max-requests=50 run ... real1m22.618s user0m50.193s sys 0m32.268s After: # time sysbench --test=memory --max-requests=50 run ... real0m47.241s user0m47.117s sys 0m0.008s So it seems it is worth it. As nobody seems to be strongly offended by my RFC I'm sending first non-RFC version. Patch 1 is made on top of K. Y.'s code refactoring which moved tsc page clocksource to arch/x86/hyperv/hv_init.c, this is currently present in Greg's char-misc-next tree. Changes since RFC: - Use mul_u64_u64_shr() instead of an open coded implementation [Andy Lutomirski, Thomas Gleixner] - Don't use the same pvclock_page for both VCLOCK_PVCLOCK and VCLOCK_HVCLOCK, create another one [Andy Lutomirski] - Fix issues reported by kbuild test robot. - Rename HYPERV_CLOCK -> HYPERV_TSCPAGE to avoid the ambiguity. I'm also going to try to optimize mul_u64_u64_shr() for 32bit but this can be split from this series I guess. Vitaly Kuznetsov (2): hyperv: implement hv_get_tsc_page() x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method arch/x86/entry/vdso/vclock_gettime.c | 36 +++ arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++- arch/x86/entry/vdso/vdso2c.c | 3 +++ arch/x86/entry/vdso/vma.c | 7 +++ arch/x86/hyperv/hv_init.c | 12 ++-- arch/x86/include/asm/clocksource.h| 3 ++- arch/x86/include/asm/mshyperv.h | 8 arch/x86/include/asm/vdso.h | 1 + drivers/hv/Kconfig| 3 +++ 9 files changed, 72 insertions(+), 4 deletions(-) -- 2.9.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: automatic IRQ affinity for virtio V3
On Sun, Feb 05, 2017 at 06:15:17PM +0100, Christoph Hellwig wrote: > Hi Michael, hi Jason, > > This patches applies a few cleanups to the virtio PCI interrupt handling > code, and then converts the virtio PCI code to use the automatic MSI-X > vectors spreading, as well as using the information in virtio-blk > and virtio-scsi to automatically align the blk-mq queues to the MSI-X > vectors. Any chance to get this in for 4.11 after I got reviews from Jason for most of the patches? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization