Re: [PATCH 1/2] hyperv: implement hv_get_tsc_page()

2017-02-09 Thread Stephen Hemminger
On Thu, 9 Feb 2017 21:14:25 +0100 (CET)
Thomas Gleixner  wrote:

> On Thu, 9 Feb 2017, Stephen Hemminger wrote:
> 
> > The actual code looks fine, but the style police will not like you.
> > { should be at start of line on functions.
> > And #else should be at start of line,
> > 
> > But maybe this was just more of exchange mangling the mail.  
> 
> Looks like.
> 
> > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) {
> > +   return tsc_pg;
> > +}
> > +  
> 
> That's how it reads in a proper mail client connected to a proper mail
> server:
> 
> > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
> > +{
> > +   return tsc_pg;
> > +}  
> 
> :)


Yup. it looks like the mail server is trying to be "helpful" by eliminating 
extra white space.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Stephen Hemminger
On Thu, 9 Feb 2017 14:55:50 -0800
Andy Lutomirski  wrote:

> On Thu, Feb 9, 2017 at 12:45 PM, KY Srinivasan  wrote:
> >
> >  
> >> -Original Message-
> >> From: Thomas Gleixner [mailto:t...@linutronix.de]
> >> Sent: Thursday, February 9, 2017 9:08 AM
> >> To: Vitaly Kuznetsov 
> >> Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar
> >> ; H. Peter Anvin ; KY Srinivasan
> >> ; Haiyang Zhang ; Stephen
> >> Hemminger ; Dexuan Cui
> >> ; linux-ker...@vger.kernel.org;
> >> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org
> >> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read
> >> method
> >>
> >> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:  
> >> > +#ifdef CONFIG_HYPERV_TSCPAGE
> >> > +static notrace u64 vread_hvclock(int *mode)
> >> > +{
> >> > +   const struct ms_hyperv_tsc_page *tsc_pg =
> >> > +   (const struct ms_hyperv_tsc_page *)&hvclock_page;
> >> > +   u64 sequence, scale, offset, current_tick, cur_tsc;
> >> > +
> >> > +   while (1) {
> >> > +   sequence = READ_ONCE(tsc_pg->tsc_sequence);
> >> > +   if (!sequence)
> >> > +   break;
> >> > +
> >> > +   scale = READ_ONCE(tsc_pg->tsc_scale);
> >> > +   offset = READ_ONCE(tsc_pg->tsc_offset);
> >> > +   rdtscll(cur_tsc);
> >> > +
> >> > +   current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> >> > +
> >> > +   if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> >> > +   return current_tick;  
> >>
> >> That sequence stuff lacks still a sensible explanation. It's fundamentally
> >> different from the sequence counting we do in the kernel, so documentation
> >> for it is really required.  
> >
> > The host is updating multiple fields in this shared TSC page and the 
> > sequence number is
> > used to ensure that the guest sees a consistent set values published. If I 
> > remember
> > correctly, Xen has a similar mechanism.  
> 
> So what's the actual protocol?  When the hypervisor updates the page,
> does it freeze all guest cpus?  If not, how does it maintain
> atomicity?

The protocol looks a lot like Linux seqlock, but it has an extra protection
which is missing here.

The host needs to update sequence number twice in order to guarantee ordering.
Otherwise it is possible that Host and guest can race.

Host
Write offset
Write scale
Set tsc_sequence = N
  Guest
read sequence = N
Read scale
Write scale
Write offset

Read Offset
Check sequence == N
Set tsc_sequence = N +1

Look like the current host side protocol is wrong.

The solution that Andi Kleen invented, and I used in seqlock was for the writer 
to update
sequence at start and end of transaction. If sequence number is odd, then the 
reader knows
it is looking at stale data.
Host
Write offset
Write scale
Set tsc_sequence = N (end of 
transaction)
  Guest
read sequence = N
Spin until sequence is even (N is even)
Read scale
Set tsc_sequence += 1
Write scale
Write offset

Read Offset
Check sequence == N? (fails is N + 1)
Set tsc_sequence += 1 (end of 
transaction)
read sequence = N+2
Spin until sequence is even (ie N +2)
Read scale  
Read Offset
Check sequence == N +2? (yes ok).

Also it is faster to just read scale and offset with this loop and save
the reading of TSC and doing multiply until after scale/offset has been 
acquired.




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Andy Lutomirski
On Thu, Feb 9, 2017 at 12:45 PM, KY Srinivasan  wrote:
>
>
>> -Original Message-
>> From: Thomas Gleixner [mailto:t...@linutronix.de]
>> Sent: Thursday, February 9, 2017 9:08 AM
>> To: Vitaly Kuznetsov 
>> Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar
>> ; H. Peter Anvin ; KY Srinivasan
>> ; Haiyang Zhang ; Stephen
>> Hemminger ; Dexuan Cui
>> ; linux-ker...@vger.kernel.org;
>> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org
>> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read
>> method
>>
>> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:
>> > +#ifdef CONFIG_HYPERV_TSCPAGE
>> > +static notrace u64 vread_hvclock(int *mode)
>> > +{
>> > +   const struct ms_hyperv_tsc_page *tsc_pg =
>> > +   (const struct ms_hyperv_tsc_page *)&hvclock_page;
>> > +   u64 sequence, scale, offset, current_tick, cur_tsc;
>> > +
>> > +   while (1) {
>> > +   sequence = READ_ONCE(tsc_pg->tsc_sequence);
>> > +   if (!sequence)
>> > +   break;
>> > +
>> > +   scale = READ_ONCE(tsc_pg->tsc_scale);
>> > +   offset = READ_ONCE(tsc_pg->tsc_offset);
>> > +   rdtscll(cur_tsc);
>> > +
>> > +   current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
>> > +
>> > +   if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
>> > +   return current_tick;
>>
>> That sequence stuff lacks still a sensible explanation. It's fundamentally
>> different from the sequence counting we do in the kernel, so documentation
>> for it is really required.
>
> The host is updating multiple fields in this shared TSC page and the sequence 
> number is
> used to ensure that the guest sees a consistent set values published. If I 
> remember
> correctly, Xen has a similar mechanism.

So what's the actual protocol?  When the hypervisor updates the page,
does it freeze all guest cpus?  If not, how does it maintain
atomicity?

--Andy
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: Try to untangle DMA coherency

2017-02-09 Thread Alexander Graf



On 09/02/2017 21:57, Michael S. Tsirkin wrote:

On Wed, Feb 08, 2017 at 01:58:10PM +0100, Alexander Graf wrote:

On 02/01/2017 08:19 PM, Michael S. Tsirkin wrote:

On Wed, Feb 01, 2017 at 06:27:09PM +, Will Deacon wrote:

On Wed, Feb 01, 2017 at 08:09:21PM +0200, Michael S. Tsirkin wrote:

On Wed, Feb 01, 2017 at 12:25:57PM +, Robin Murphy wrote:

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 7e38ed79c3fc..961af25b385c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -160,10 +161,14 @@ static bool vring_use_dma_api(struct virtio_device *vdev)
return true;

/*
-* On ARM-based machines, the DMA ops will do the right thing,
-* so always use them with legacy devices.
+* On ARM-based machines, the coherent DMA ops will do the right
+* thing, so always use them with legacy devices. However, using
+* non-coherent DMA when the host *is* actually coherent, but has
+* forgotten to tell us, is going to break badly; since this situation
+* already exists in the wild, maintain the old behaviour there.
 */
-   if (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64))
+   if ((IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64)) &&
+   device_get_dma_attr(&vdev->dev) == DEV_DMA_COHERENT)
return !virtio_has_feature(vdev, VIRTIO_F_VERSION_1);

return false;


This is exactly what I feared.


Yes, sorry about this. It works fine for virtio-pci (where "dma-coherent"
is used) and it also works on the fastmodel if you disable cache-modelling
(which is needed to make the thing run at a usable pace) so we didn't spot
this in testing.


Could we identify fastboot and do the special dance just for it?


[assuming you mean fastmodel instead of fastboot]


I'd like to do that instead. It's fastboot doing the unreasonable thing
here and deviating from what every other legacy device without exception
did for years. If this means fastboot will need to update to virtio 1,
all the better.


The problem still exists with virtio 1, unless we require that the
"dma-coherent" property is set/unset correctly when VIRTIO_F_IOMMU_PLATFORM
is advertised by the device (which is what I suggested in my reply).


I'm not ignoring that, but I need to understand that part a bit better.
I'll reply to that patch in a day or two after looking at how _CCA is
supposed to work.


We can't detect the fastmodel,


Surely, it puts a hardware id somewhere? I think you mean
fastmodel isn't always affected, right?


but we could implicitly treat virtio-mmio
devices as cache-coherent regardless of the "dma-coherent" flag. I already
prototyped this, but I suspect the devicetree people will push back (and
there's a similar patch needed for ACPI).

See below. Do you prefer this approach?

Will

--->8


I'd like to see basically

if (fastmodel)
a pile of special work-arounds
else
not less hacky but more common virtio work-arounds

:)

And then I can apply whatever comes from @arm.com and not
worry about breaking actual hardware.


I'm actually seeing the exact same breakage in QEMU right now, so it's not
fast model related at all. In QEMU we also don't properly set the
dma-coherent flag, so we run into cache coherency problems.


Alex


But with latest revert QEMU should now work, right?


Yes, with the revert QEMU works fine. But please keep in mind that it's 
not just the fast model that's broken here :).



Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: Try to untangle DMA coherency

2017-02-09 Thread Michael S. Tsirkin
On Wed, Feb 08, 2017 at 01:58:10PM +0100, Alexander Graf wrote:
> On 02/01/2017 08:19 PM, Michael S. Tsirkin wrote:
> > On Wed, Feb 01, 2017 at 06:27:09PM +, Will Deacon wrote:
> > > On Wed, Feb 01, 2017 at 08:09:21PM +0200, Michael S. Tsirkin wrote:
> > > > On Wed, Feb 01, 2017 at 12:25:57PM +, Robin Murphy wrote:
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 7e38ed79c3fc..961af25b385c 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -20,6 +20,7 @@
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > +#include 
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > @@ -160,10 +161,14 @@ static bool vring_use_dma_api(struct 
> > > > > virtio_device *vdev)
> > > > >   return true;
> > > > > 
> > > > >   /*
> > > > > -  * On ARM-based machines, the DMA ops will do the right thing,
> > > > > -  * so always use them with legacy devices.
> > > > > +  * On ARM-based machines, the coherent DMA ops will do the right
> > > > > +  * thing, so always use them with legacy devices. However, using
> > > > > +  * non-coherent DMA when the host *is* actually coherent, but 
> > > > > has
> > > > > +  * forgotten to tell us, is going to break badly; since this 
> > > > > situation
> > > > > +  * already exists in the wild, maintain the old behaviour there.
> > > > >*/
> > > > > - if (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64))
> > > > > + if ((IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_ARM64)) &&
> > > > > + device_get_dma_attr(&vdev->dev) == DEV_DMA_COHERENT)
> > > > >   return !virtio_has_feature(vdev, VIRTIO_F_VERSION_1);
> > > > > 
> > > > >   return false;
> > > > 
> > > > This is exactly what I feared.
> > > 
> > > Yes, sorry about this. It works fine for virtio-pci (where "dma-coherent"
> > > is used) and it also works on the fastmodel if you disable cache-modelling
> > > (which is needed to make the thing run at a usable pace) so we didn't spot
> > > this in testing.
> > > 
> > > > Could we identify fastboot and do the special dance just for it?
> > > 
> > > [assuming you mean fastmodel instead of fastboot]
> > > 
> > > > I'd like to do that instead. It's fastboot doing the unreasonable thing
> > > > here and deviating from what every other legacy device without exception
> > > > did for years. If this means fastboot will need to update to virtio 1,
> > > > all the better.
> > > 
> > > The problem still exists with virtio 1, unless we require that the
> > > "dma-coherent" property is set/unset correctly when 
> > > VIRTIO_F_IOMMU_PLATFORM
> > > is advertised by the device (which is what I suggested in my reply).
> > 
> > I'm not ignoring that, but I need to understand that part a bit better.
> > I'll reply to that patch in a day or two after looking at how _CCA is
> > supposed to work.
> > 
> > > We can't detect the fastmodel,
> > 
> > Surely, it puts a hardware id somewhere? I think you mean
> > fastmodel isn't always affected, right?
> > 
> > > but we could implicitly treat virtio-mmio
> > > devices as cache-coherent regardless of the "dma-coherent" flag. I already
> > > prototyped this, but I suspect the devicetree people will push back (and
> > > there's a similar patch needed for ACPI).
> > > 
> > > See below. Do you prefer this approach?
> > > 
> > > Will
> > > 
> > > --->8
> > 
> > I'd like to see basically
> > 
> > if (fastmodel)
> > a pile of special work-arounds
> > else
> > not less hacky but more common virtio work-arounds
> > 
> > :)
> > 
> > And then I can apply whatever comes from @arm.com and not
> > worry about breaking actual hardware.
> 
> I'm actually seeing the exact same breakage in QEMU right now, so it's not
> fast model related at all. In QEMU we also don't properly set the
> dma-coherent flag, so we run into cache coherency problems.
> 
> 
> Alex

But with latest revert QEMU should now work, right?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread KY Srinivasan via Virtualization


> -Original Message-
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Thursday, February 9, 2017 9:08 AM
> To: Vitaly Kuznetsov 
> Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar
> ; H. Peter Anvin ; KY Srinivasan
> ; Haiyang Zhang ; Stephen
> Hemminger ; Dexuan Cui
> ; linux-ker...@vger.kernel.org;
> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org
> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read
> method
> 
> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:
> > +#ifdef CONFIG_HYPERV_TSCPAGE
> > +static notrace u64 vread_hvclock(int *mode)
> > +{
> > +   const struct ms_hyperv_tsc_page *tsc_pg =
> > +   (const struct ms_hyperv_tsc_page *)&hvclock_page;
> > +   u64 sequence, scale, offset, current_tick, cur_tsc;
> > +
> > +   while (1) {
> > +   sequence = READ_ONCE(tsc_pg->tsc_sequence);
> > +   if (!sequence)
> > +   break;
> > +
> > +   scale = READ_ONCE(tsc_pg->tsc_scale);
> > +   offset = READ_ONCE(tsc_pg->tsc_offset);
> > +   rdtscll(cur_tsc);
> > +
> > +   current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> > +
> > +   if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> > +   return current_tick;
> 
> That sequence stuff lacks still a sensible explanation. It's fundamentally
> different from the sequence counting we do in the kernel, so documentation
> for it is really required.

The host is updating multiple fields in this shared TSC page and the sequence 
number is
used to ensure that the guest sees a consistent set values published. If I 
remember
correctly, Xen has a similar mechanism.

K. Y
> 
> Thanks,
> 
>   tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 1/2] hyperv: implement hv_get_tsc_page()

2017-02-09 Thread Thomas Gleixner
On Thu, 9 Feb 2017, Stephen Hemminger wrote:

> The actual code looks fine, but the style police will not like you.
> { should be at start of line on functions.
> And #else should be at start of line,
> 
> But maybe this was just more of exchange mangling the mail.

Looks like.

> +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) {
> + return tsc_pg;
> +}
> +

That's how it reads in a proper mail client connected to a proper mail
server:

> +struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
> +{
> +   return tsc_pg;
> +}

:)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: Try to untangle DMA coherency

2017-02-09 Thread Ard Biesheuvel
On 9 February 2017 at 18:49, Michael S. Tsirkin  wrote:
> On Thu, Feb 09, 2017 at 06:31:18PM +, Will Deacon wrote:
>> On Thu, Feb 09, 2017 at 08:17:16PM +0200, Michael S. Tsirkin wrote:
>> > On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote:
>> > > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote:
>> > > > I am inclined to say, for 4.10 let's revert
>> > > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a
>> > > > regression in 4.10.
>> > >
>> > > No complaints there, as long as we can keep working to fix this for 4.11
>> > > and onwards. You'll also need to cc stable on the revert.
>> > >
>> > > > So I think we can defer the fix to 4.11.
>> > > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349
>> > > > for hosts with virtio 1 support.
>> > > >
>> > > > All this will hopefully push hosts to just implement virtio 1.
>> > > > For mmio the changes are very small: several new registers,
>> > > > that's all. You want this for proper 64 bit dma mask anyway.
>> > >
>> > > As I've said, virtio 1 will have exactly the same issue unless we start
>> > > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio
>> > > devices correctly.
>> > >
>> >
>> > OK I read up on _CCA in ACPI spec. It says:
>> > The _CCA object returns whether or not a bus-master device supports
>> > hardware managed cache coherency. Expected values are 0 to indicate it
>> > is not supported, and 1 to indicate that it is supported.
>> >
>> > So if host is cache coherent, and guest thinks it isn't, we incur
>> > unnecessary overhead by wasting coherent memory.
>> > I get that but you said it actually breaks - why does it?
>>
>> It breaks because QEMU doesn't set _CCA for virtio-mmio devices, and that
>> only becomes a problem when we use the DMA API, because that results in the
>> guest taking out a non-cacheable mapping. On ARM (and other archs such as
>> Power), having a mismatch between a cacheable and a non-cacheable mapping
>> can result in a loss of coherency between the two (for example, if the
>> non-cacheable gues accesses bypass the cache, but the cacheable host
>> accesses allocate in the cache).
>>
>> Will
>
> I see. And I guess using a cacheable mapping is significantly faster.
> I would say we want to typically use cacheable for virtio then,
> whether we bypass the IOMMU or not. I guess this is why we always set
> _CCA/DT correctly, right?
>

The point is that you don't get to choose: if the hardware is DMA
coherent, the CPU should use cacheable mappings, or you may lose data.
If the hardware is non-coherent, the CPU should use non-cacheable
mappings for the same reason.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: Try to untangle DMA coherency

2017-02-09 Thread Will Deacon
On Thu, Feb 09, 2017 at 08:49:41PM +0200, Michael S. Tsirkin wrote:
> On Thu, Feb 09, 2017 at 06:31:18PM +, Will Deacon wrote:
> > On Thu, Feb 09, 2017 at 08:17:16PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote:
> > > > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote:
> > > > > I am inclined to say, for 4.10 let's revert
> > > > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a
> > > > > regression in 4.10.
> > > > 
> > > > No complaints there, as long as we can keep working to fix this for 4.11
> > > > and onwards. You'll also need to cc stable on the revert.
> > > > 
> > > > > So I think we can defer the fix to 4.11.
> > > > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349
> > > > > for hosts with virtio 1 support.
> > > > > 
> > > > > All this will hopefully push hosts to just implement virtio 1.
> > > > > For mmio the changes are very small: several new registers,
> > > > > that's all. You want this for proper 64 bit dma mask anyway.
> > > > 
> > > > As I've said, virtio 1 will have exactly the same issue unless we start
> > > > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio
> > > > devices correctly.
> > > > 
> > > 
> > > OK I read up on _CCA in ACPI spec. It says:
> > > The _CCA object returns whether or not a bus-master device supports
> > > hardware managed cache coherency. Expected values are 0 to indicate it
> > > is not supported, and 1 to indicate that it is supported.
> > > 
> > > So if host is cache coherent, and guest thinks it isn't, we incur
> > > unnecessary overhead by wasting coherent memory.
> > > I get that but you said it actually breaks - why does it?
> > 
> > It breaks because QEMU doesn't set _CCA for virtio-mmio devices, and that
> > only becomes a problem when we use the DMA API, because that results in the
> > guest taking out a non-cacheable mapping. On ARM (and other archs such as
> > Power), having a mismatch between a cacheable and a non-cacheable mapping
> > can result in a loss of coherency between the two (for example, if the
> > non-cacheable gues accesses bypass the cache, but the cacheable host
> > accesses allocate in the cache).
> > 
> I see. And I guess using a cacheable mapping is significantly faster.
> I would say we want to typically use cacheable for virtio then,
> whether we bypass the IOMMU or not. I guess this is why we always set
> _CCA/DT correctly, right?

At the moment, _CCA/DT is pretty much never set correctly for virtio-mmio
(that is, it isn't set even though the device is cache coherent). If it
*was* set correctly, then we wouldn't have needed to revert my patch.

Robin's patch to only use the DMA API if _CCA/DT is present would work
(although the thing that he posted was buggy iirc).

Will
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: Try to untangle DMA coherency

2017-02-09 Thread Michael S. Tsirkin
On Thu, Feb 09, 2017 at 06:31:18PM +, Will Deacon wrote:
> On Thu, Feb 09, 2017 at 08:17:16PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote:
> > > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote:
> > > > I am inclined to say, for 4.10 let's revert
> > > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a
> > > > regression in 4.10.
> > > 
> > > No complaints there, as long as we can keep working to fix this for 4.11
> > > and onwards. You'll also need to cc stable on the revert.
> > > 
> > > > So I think we can defer the fix to 4.11.
> > > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349
> > > > for hosts with virtio 1 support.
> > > > 
> > > > All this will hopefully push hosts to just implement virtio 1.
> > > > For mmio the changes are very small: several new registers,
> > > > that's all. You want this for proper 64 bit dma mask anyway.
> > > 
> > > As I've said, virtio 1 will have exactly the same issue unless we start
> > > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio
> > > devices correctly.
> > > 
> > 
> > OK I read up on _CCA in ACPI spec. It says:
> > The _CCA object returns whether or not a bus-master device supports
> > hardware managed cache coherency. Expected values are 0 to indicate it
> > is not supported, and 1 to indicate that it is supported.
> > 
> > So if host is cache coherent, and guest thinks it isn't, we incur
> > unnecessary overhead by wasting coherent memory.
> > I get that but you said it actually breaks - why does it?
> 
> It breaks because QEMU doesn't set _CCA for virtio-mmio devices, and that
> only becomes a problem when we use the DMA API, because that results in the
> guest taking out a non-cacheable mapping. On ARM (and other archs such as
> Power), having a mismatch between a cacheable and a non-cacheable mapping
> can result in a loss of coherency between the two (for example, if the
> non-cacheable gues accesses bypass the cache, but the cacheable host
> accesses allocate in the cache).
> 
> Will

I see. And I guess using a cacheable mapping is significantly faster.
I would say we want to typically use cacheable for virtio then,
whether we bypass the IOMMU or not. I guess this is why we always set
_CCA/DT correctly, right?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: Try to untangle DMA coherency

2017-02-09 Thread Michael S. Tsirkin
On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote:
> On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote:
> > I am inclined to say, for 4.10 let's revert
> > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a
> > regression in 4.10.
> 
> No complaints there, as long as we can keep working to fix this for 4.11
> and onwards. You'll also need to cc stable on the revert.
> 
> > So I think we can defer the fix to 4.11.
> > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349
> > for hosts with virtio 1 support.
> > 
> > All this will hopefully push hosts to just implement virtio 1.
> > For mmio the changes are very small: several new registers,
> > that's all. You want this for proper 64 bit dma mask anyway.
> 
> As I've said, virtio 1 will have exactly the same issue unless we start
> requiring firmware to advertise dma-coherent/_CCA for virtio-mmio
> devices correctly.
> 
> Will

OK I read up on _CCA in ACPI spec. It says:
The _CCA object returns whether or not a bus-master device supports
hardware managed cache coherency. Expected values are 0 to indicate it
is not supported, and 1 to indicate that it is supported.

So if host is cache coherent, and guest thinks it isn't, we incur
unnecessary overhead by wasting coherent memory.
I get that but you said it actually breaks - why does it?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: Try to untangle DMA coherency

2017-02-09 Thread Will Deacon
On Thu, Feb 09, 2017 at 08:17:16PM +0200, Michael S. Tsirkin wrote:
> On Thu, Feb 02, 2017 at 04:40:49PM +, Will Deacon wrote:
> > On Thu, Feb 02, 2017 at 06:30:28PM +0200, Michael S. Tsirkin wrote:
> > > I am inclined to say, for 4.10 let's revert
> > > c7070619f3408d9a0dffbed9149e6f00479cf43b since what it fixes is not a
> > > regression in 4.10.
> > 
> > No complaints there, as long as we can keep working to fix this for 4.11
> > and onwards. You'll also need to cc stable on the revert.
> > 
> > > So I think we can defer the fix to 4.11.
> > > I think we still want f7f6634d23830ff74335734fbdb28ea109c1f349
> > > for hosts with virtio 1 support.
> > > 
> > > All this will hopefully push hosts to just implement virtio 1.
> > > For mmio the changes are very small: several new registers,
> > > that's all. You want this for proper 64 bit dma mask anyway.
> > 
> > As I've said, virtio 1 will have exactly the same issue unless we start
> > requiring firmware to advertise dma-coherent/_CCA for virtio-mmio
> > devices correctly.
> > 
> 
> OK I read up on _CCA in ACPI spec. It says:
> The _CCA object returns whether or not a bus-master device supports
> hardware managed cache coherency. Expected values are 0 to indicate it
> is not supported, and 1 to indicate that it is supported.
> 
> So if host is cache coherent, and guest thinks it isn't, we incur
> unnecessary overhead by wasting coherent memory.
> I get that but you said it actually breaks - why does it?

It breaks because QEMU doesn't set _CCA for virtio-mmio devices, and that
only becomes a problem when we use the DMA API, because that results in the
guest taking out a non-cacheable mapping. On ARM (and other archs such as
Power), having a mismatch between a cacheable and a non-cacheable mapping
can result in a loss of coherency between the two (for example, if the
non-cacheable gues accesses bypass the cache, but the cacheable host
accesses allocate in the cache).

Will
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: packed ring layout proposal v2

2017-02-09 Thread Christian Borntraeger
On 02/09/2017 06:43 PM, Michael S. Tsirkin wrote:
[...]
>> [...] 
>>> Note: should this proposal be accepted and approved, one or more
>>>   claims disclosed to the TC admin and listed on the Virtio TC
>>>   IPR page https://www.oasis-open.org/committees/virtio/ipr.php
>>>   might become Essential Claims.
>>>
>>
>> I have trouble parsing that legal stuff. Do I read that right, that
>> these claims can be implemented as part of a virtio implementation without
>> any worries (e.g. non open source HW implementation or non open source
>> hypervisor)?
> 
> By that legal stuff do you mean the IPR or the Note?

Both in combination.
 
> Not representing Red Hat here, and definitely not legal advice, below is
> just my personal understanding of the IPR requirements.
> 
> Virtio TC operates under the Non-Assertion Mode of the OASIS IPR Policy:
> https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode
> 
> As far as I can tell, the hardware and software question is covered
> by that policy, since it states:
> 
>   Covered Product - includes only those specific portions of a
>   product (hardware, software or combinations thereof)
> 
> Also as far as I can tell IPR does not mention source at all, so I think
> virtio IPR would apply to open and closed source software equally.
> 
> The Note is included to satisfy the disclosure requirements.
> 
> Does this answer the question?

Yes, thanks.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Stephen Hemminger via Virtualization
Why not use existing seqlock's?

-Original Message-
From: Thomas Gleixner [mailto:t...@linutronix.de] 
Sent: Thursday, February 9, 2017 9:08 AM
To: Vitaly Kuznetsov 
Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar 
; H. Peter Anvin ; KY Srinivasan 
; Haiyang Zhang ; Stephen Hemminger 
; Dexuan Cui ; 
linux-ker...@vger.kernel.org; de...@linuxdriverproject.org; 
virtualization@lists.linux-foundation.org
Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:
> +#ifdef CONFIG_HYPERV_TSCPAGE
> +static notrace u64 vread_hvclock(int *mode) {
> + const struct ms_hyperv_tsc_page *tsc_pg =
> + (const struct ms_hyperv_tsc_page *)&hvclock_page;
> + u64 sequence, scale, offset, current_tick, cur_tsc;
> +
> + while (1) {
> + sequence = READ_ONCE(tsc_pg->tsc_sequence);
> + if (!sequence)
> + break;
> +
> + scale = READ_ONCE(tsc_pg->tsc_scale);
> + offset = READ_ONCE(tsc_pg->tsc_offset);
> + rdtscll(cur_tsc);
> +
> + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> +
> + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> + return current_tick;

That sequence stuff lacks still a sensible explanation. It's fundamentally 
different from the sequence counting we do in the kernel, so documentation for 
it is really required.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 1/2] hyperv: implement hv_get_tsc_page()

2017-02-09 Thread Stephen Hemminger via Virtualization
The actual code looks fine, but the style police will not like you.
{ should be at start of line on functions.
And #else should be at start of line,

But maybe this was just more of exchange mangling the mail.

-Original Message-
From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] 
Sent: Thursday, February 9, 2017 6:11 AM
To: x...@kernel.org; Andy Lutomirski 
Cc: Thomas Gleixner ; Ingo Molnar ; H. 
Peter Anvin ; KY Srinivasan ; Haiyang Zhang 
; Stephen Hemminger ; Dexuan 
Cui ; linux-ker...@vger.kernel.org; 
de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org
Subject: [PATCH 1/2] hyperv: implement hv_get_tsc_page()

To use Hyper-V TSC page clocksource from vDSO we need to make tsc_pg available. 
Implement hv_get_tsc_page() and add CONFIG_HYPERV_TSCPAGE to make #ifdef-s 
simple.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/hv_init.c   | 9 +++--
 arch/x86/include/asm/mshyperv.h | 8 
 drivers/hv/Kconfig  | 3 +++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 
b371d0e..0ce8485 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -27,10 +27,15 @@
 #include 
 
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
 
 static struct ms_hyperv_tsc_page *tsc_pg;
 
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void) {
+   return tsc_pg;
+}
+
 static u64 read_hv_clock_tsc(struct clocksource *arg)  {
u64 current_tick;
@@ -136,7 +141,7 @@ void hyperv_init(void)
/*
 * Register Hyper-V specific clocksource.
 */
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
if (ms_hyperv.features & HV_X64_MSR_REFERENCE_TSC_AVAILABLE) {
union hv_x64_msr_hypercall_contents tsc_msr;
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h 
index f8dc370..14dd92c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -173,4 +173,12 @@ void hyperv_report_panic(struct pt_regs *regs);  bool 
hv_is_hypercall_page_setup(void);  void hyperv_cleanup(void);  #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void); #else static inline 
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void) {
+   return NULL;
+}
+#endif
 #endif
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig index 0403b51..c29cd53 
100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -7,6 +7,9 @@ config HYPERV
  Select this option to run Linux as a Hyper-V client operating
  system.
 
+config HYPERV_TSCPAGE
+   def_bool HYPERV && X86_64
+
 config HYPERV_UTILS
tristate "Microsoft Hyper-V Utilities driver"
depends on HYPERV && CONNECTOR && NLS
--
2.9.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] packed ring layout proposal v2

2017-02-09 Thread Michael S. Tsirkin
On Thu, Feb 09, 2017 at 04:48:53PM +0100, Paolo Bonzini wrote:
> 
> 
> On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> > We couldn't decide what's better for everyone in 1.0 days and I doubt
> > we'll be able to now, but yes, benchmarking is needed to make
> > sire it's required. Very easy to remove or not to use/support in
> > drivers/devices though.
> 
> Fair enough, but of course then we must specify that devices MUST
> support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
> SHOULD support both (or use neither).
> 
> The drivers' part adds to the implementation burden, which is why I
> wanted to remove it.  Alternatively we can say that indirect is
> mandatory for both devices and drivers (and save a feature bit), while
> VRING_DESC_F_NEXT is optional.
> 
> >>> * Non power-of-2 ring sizes
> >>>
> >>> As the ring simply wraps around, there's no reason to
> >>> require ring size to be power of two.
> >>> It can be made a separate feature though.
> >>
> >> Power of 2 ring sizes are required in order to ignore the high bits of
> >> the indices.  With non-power-of-2 sizes you are forced to keep the
> >> indices less than the ring size.
> > 
> > Right. So
> > 
> > if (unlikely(idx++ > size))
> > idx = 0;
> > 
> > OTOH ring size that's twice larger than necessary
> > because of power of two requirements wastes cache.
> 
> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> the complication and the gratuitous difference with 1.0.

I thought originally there's a reason 1.0 rings had to be powers of two
but now I don't see why. OK, we can make it a feature flag later if we
want to.

> If batching is mostly advisory (with exceptions such as mrg-rxbuf) I
> don't have any problem with it.
> 
> Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: packed ring layout proposal v2

2017-02-09 Thread Michael S. Tsirkin
On Wed, Feb 08, 2017 at 02:37:57PM +0100, Christian Borntraeger wrote:
> On 02/08/2017 04:20 AM, Michael S. Tsirkin wrote:
> [...]
> > * Prototype
> > 
> > A partial prototype can be found under
> > tools/virtio/ringtest/ring.c
> > 
> > Test run:
> > [mst@tuck ringtest]$ time ./ring 
> > real0m0.399s
> > user0m0.791s
> > sys 0m0.000s
> > [mst@tuck ringtest]$ time ./virtio_ring_0_9
> > real0m0.503s
> > user0m0.999s
> > sys 0m0.000s
> 
> I see similar improvements on s390, so I think this would be a very nice 
> improvement.
> 
> [...] 
> > Note: should this proposal be accepted and approved, one or more
> >   claims disclosed to the TC admin and listed on the Virtio TC
> >   IPR page https://www.oasis-open.org/committees/virtio/ipr.php
> >   might become Essential Claims.
> > 
> 
> I have trouble parsing that legal stuff. Do I read that right, that
> these claims can be implemented as part of a virtio implementation without
> any worries (e.g. non open source HW implementation or non open source
> hypervisor)?

By that legal stuff do you mean the IPR or the Note?

Not representing Red Hat here, and definitely not legal advice, below is
just my personal understanding of the IPR requirements.

Virtio TC operates under the Non-Assertion Mode of the OASIS IPR Policy:
https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode

As far as I can tell, the hardware and software question is covered
by that policy, since it states:

Covered Product - includes only those specific portions of a
product (hardware, software or combinations thereof)

Also as far as I can tell IPR does not mention source at all, so I think
virtio IPR would apply to open and closed source software equally.

The Note is included to satisfy the disclosure requirements.

Does this answer the question?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Thomas Gleixner
On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:
> +#ifdef CONFIG_HYPERV_TSCPAGE
> +static notrace u64 vread_hvclock(int *mode)
> +{
> + const struct ms_hyperv_tsc_page *tsc_pg =
> + (const struct ms_hyperv_tsc_page *)&hvclock_page;
> + u64 sequence, scale, offset, current_tick, cur_tsc;
> +
> + while (1) {
> + sequence = READ_ONCE(tsc_pg->tsc_sequence);
> + if (!sequence)
> + break;
> +
> + scale = READ_ONCE(tsc_pg->tsc_scale);
> + offset = READ_ONCE(tsc_pg->tsc_offset);
> + rdtscll(cur_tsc);
> +
> + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> +
> + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> + return current_tick;

That sequence stuff lacks still a sensible explanation. It's fundamentally
different from the sequence counting we do in the kernel, so documentation
for it is really required.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] packed ring layout proposal v2

2017-02-09 Thread Cornelia Huck
On Thu, 9 Feb 2017 16:48:53 +0100
Paolo Bonzini  wrote:

> On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> > We couldn't decide what's better for everyone in 1.0 days and I doubt
> > we'll be able to now, but yes, benchmarking is needed to make
> > sire it's required. Very easy to remove or not to use/support in
> > drivers/devices though.
> 
> Fair enough, but of course then we must specify that devices MUST
> support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
> SHOULD support both (or use neither).
> 
> The drivers' part adds to the implementation burden, which is why I
> wanted to remove it.  Alternatively we can say that indirect is
> mandatory for both devices and drivers (and save a feature bit), while
> VRING_DESC_F_NEXT is optional.

I think this (INDIRECT mandatory, NEXT optional) makes sense. But we
really need some benchmarking.

> 
> >>> * Non power-of-2 ring sizes
> >>>
> >>> As the ring simply wraps around, there's no reason to
> >>> require ring size to be power of two.
> >>> It can be made a separate feature though.
> >>
> >> Power of 2 ring sizes are required in order to ignore the high bits of
> >> the indices.  With non-power-of-2 sizes you are forced to keep the
> >> indices less than the ring size.
> > 
> > Right. So
> > 
> > if (unlikely(idx++ > size))
> > idx = 0;
> > 
> > OTOH ring size that's twice larger than necessary
> > because of power of two requirements wastes cache.
> 
> I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
> the complication and the gratuitous difference with 1.0.

I agree. I don't think dropping the power of 2 requirement buys us so
much that it makes up for the added complexity.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: automatic IRQ affinity for virtio V3

2017-02-09 Thread Michael S. Tsirkin
On Thu, Feb 09, 2017 at 05:50:31AM -0800, Christoph Hellwig wrote:
> On Sun, Feb 05, 2017 at 06:15:17PM +0100, Christoph Hellwig wrote:
> > Hi Michael, hi Jason,
> > 
> > This patches applies a few cleanups to the virtio PCI interrupt handling
> > code, and then converts the virtio PCI code to use the automatic MSI-X
> > vectors spreading, as well as using the information in virtio-blk
> > and virtio-scsi to automatically align the blk-mq queues to the MSI-X
> > vectors.
> 
> Any chance to get this in for 4.11 after I got reviews from Jason
> for most of the patches?

Absolutely, I intend to merge it.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] packed ring layout proposal v2

2017-02-09 Thread Paolo Bonzini


On 08/02/2017 20:59, Michael S. Tsirkin wrote:
> We couldn't decide what's better for everyone in 1.0 days and I doubt
> we'll be able to now, but yes, benchmarking is needed to make
> sire it's required. Very easy to remove or not to use/support in
> drivers/devices though.

Fair enough, but of course then we must specify that devices MUST
support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
SHOULD support both (or use neither).

The drivers' part adds to the implementation burden, which is why I
wanted to remove it.  Alternatively we can say that indirect is
mandatory for both devices and drivers (and save a feature bit), while
VRING_DESC_F_NEXT is optional.

>>> * Non power-of-2 ring sizes
>>>
>>> As the ring simply wraps around, there's no reason to
>>> require ring size to be power of two.
>>> It can be made a separate feature though.
>>
>> Power of 2 ring sizes are required in order to ignore the high bits of
>> the indices.  With non-power-of-2 sizes you are forced to keep the
>> indices less than the ring size.
> 
> Right. So
> 
>   if (unlikely(idx++ > size))
>   idx = 0;
> 
> OTOH ring size that's twice larger than necessary
> because of power of two requirements wastes cache.

I don't know.  Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.

If batching is mostly advisory (with exceptions such as mrg-rxbuf) I
don't have any problem with it.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Vitaly Kuznetsov
Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
defined by the hypervisor is different from VCLOCK_PVCLOCK. Implement the
required support by adding hvclock_page VVAR.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/entry/vdso/vclock_gettime.c  | 36 +++
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
 arch/x86/entry/vdso/vdso2c.c  |  3 +++
 arch/x86/entry/vdso/vma.c |  7 +++
 arch/x86/hyperv/hv_init.c |  3 +++
 arch/x86/include/asm/clocksource.h|  3 ++-
 arch/x86/include/asm/vdso.h   |  1 +
 7 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 9d4d6e1..4af10b4 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +33,11 @@ extern u8 pvclock_page
__attribute__((visibility("hidden")));
 #endif
 
+#ifdef CONFIG_HYPERV_TSCPAGE
+extern u8 hvclock_page
+   __attribute__((visibility("hidden")));
+#endif
+
 #ifndef BUILD_VDSO32
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -141,6 +147,32 @@ static notrace u64 vread_pvclock(int *mode)
return last;
 }
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+static notrace u64 vread_hvclock(int *mode)
+{
+   const struct ms_hyperv_tsc_page *tsc_pg =
+   (const struct ms_hyperv_tsc_page *)&hvclock_page;
+   u64 sequence, scale, offset, current_tick, cur_tsc;
+
+   while (1) {
+   sequence = READ_ONCE(tsc_pg->tsc_sequence);
+   if (!sequence)
+   break;
+
+   scale = READ_ONCE(tsc_pg->tsc_scale);
+   offset = READ_ONCE(tsc_pg->tsc_offset);
+   rdtscll(cur_tsc);
+
+   current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
+
+   if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
+   return current_tick;
+   }
+
+   *mode = VCLOCK_NONE;
+   return 0;
+}
+#endif
 
 notrace static u64 vread_tsc(void)
 {
@@ -173,6 +205,10 @@ notrace static inline u64 vgetsns(int *mode)
else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
cycles = vread_pvclock(mode);
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+   else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+   cycles = vread_hvclock(mode);
+#endif
else
return 0;
v = (cycles - gtod->cycle_last) & gtod->mask;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S 
b/arch/x86/entry/vdso/vdso-layout.lds.S
index a708aa9..8ebb4b6 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
 * segment.
 */
 
-   vvar_start = . - 2 * PAGE_SIZE;
+   vvar_start = . - 3 * PAGE_SIZE;
vvar_page = vvar_start;
 
/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
 #undef EMIT_VVAR
 
pvclock_page = vvar_start + PAGE_SIZE;
+   hvclock_page = vvar_start + 2 * PAGE_SIZE;
 
. = SIZEOF_HEADERS;
 
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 491020b..0780a44 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -74,6 +74,7 @@ enum {
sym_vvar_page,
sym_hpet_page,
sym_pvclock_page,
+   sym_hvclock_page,
sym_VDSO_FAKE_SECTION_TABLE_START,
sym_VDSO_FAKE_SECTION_TABLE_END,
 };
@@ -82,6 +83,7 @@ const int special_pages[] = {
sym_vvar_page,
sym_hpet_page,
sym_pvclock_page,
+   sym_hvclock_page,
 };
 
 struct vdso_sym {
@@ -94,6 +96,7 @@ struct vdso_sym required_syms[] = {
[sym_vvar_page] = {"vvar_page", true},
[sym_hpet_page] = {"hpet_page", true},
[sym_pvclock_page] = {"pvclock_page", true},
+   [sym_hvclock_page] = {"hvclock_page", true},
[sym_VDSO_FAKE_SECTION_TABLE_START] = {
"VDSO_FAKE_SECTION_TABLE_START", false
},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 10820f6..b256a3b 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if defined(CONFIG_X86_64)
 unsigned int __read_mostly vdso64_enabled = 1;
@@ -120,6 +121,12 @@ static int vvar_fault(const struct vm_special_mapping *sm,
vmf->address,
__pa(pvti) >> PAGE_SHIFT);
}
+   } else if (sym_offset == image->sym_hvclock_page) {
+   struct ms_hyperv_tsc_page *tsc_pg = hv_get_tsc_page();
+
+   if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK))
+   ret = vm_insert_pfn(vma, vmf->address,
+   vmalloc_to_pfn(tsc_pg));
}

[PATCH 1/2] hyperv: implement hv_get_tsc_page()

2017-02-09 Thread Vitaly Kuznetsov
To use Hyper-V TSC page clocksource from vDSO we need to make tsc_pg
available. Implement hv_get_tsc_page() and add CONFIG_HYPERV_TSCPAGE to
make #ifdef-s simple.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/hv_init.c   | 9 +++--
 arch/x86/include/asm/mshyperv.h | 8 
 drivers/hv/Kconfig  | 3 +++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index b371d0e..0ce8485 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -27,10 +27,15 @@
 #include 
 
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
 
 static struct ms_hyperv_tsc_page *tsc_pg;
 
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+   return tsc_pg;
+}
+
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
u64 current_tick;
@@ -136,7 +141,7 @@ void hyperv_init(void)
/*
 * Register Hyper-V specific clocksource.
 */
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
if (ms_hyperv.features & HV_X64_MSR_REFERENCE_TSC_AVAILABLE) {
union hv_x64_msr_hypercall_contents tsc_msr;
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f8dc370..14dd92c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -173,4 +173,12 @@ void hyperv_report_panic(struct pt_regs *regs);
 bool hv_is_hypercall_page_setup(void);
 void hyperv_cleanup(void);
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
+#else
+static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+   return NULL;
+}
+#endif
 #endif
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 0403b51..c29cd53 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -7,6 +7,9 @@ config HYPERV
  Select this option to run Linux as a Hyper-V client operating
  system.
 
+config HYPERV_TSCPAGE
+   def_bool HYPERV && X86_64
+
 config HYPERV_UTILS
tristate "Microsoft Hyper-V Utilities driver"
depends on HYPERV && CONNECTOR && NLS
-- 
2.9.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 0/2] x86/vdso: Add Hyper-V TSC page clocksource support

2017-02-09 Thread Vitaly Kuznetsov
Hi,

Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
defined by the hypervisor is different from VCLOCK_PVCLOCK. Implemented the
required support. Simple sysbench test shows the following results:

Before:
# time sysbench --test=memory --max-requests=50 run
...
real1m22.618s
user0m50.193s
sys 0m32.268s

After:
# time sysbench --test=memory --max-requests=50 run
...
real0m47.241s
user0m47.117s
sys 0m0.008s

So it seems it is worth it. As nobody seems to be strongly offended by my RFC
I'm sending first non-RFC version. Patch 1 is made on top of K. Y.'s code
refactoring which moved tsc page clocksource to arch/x86/hyperv/hv_init.c,
this is currently present in Greg's char-misc-next tree.

Changes since RFC:
- Use mul_u64_u64_shr() instead of an open coded implementation
  [Andy Lutomirski, Thomas Gleixner]
- Don't use the same pvclock_page for both VCLOCK_PVCLOCK and
  VCLOCK_HVCLOCK, create another one [Andy Lutomirski]
- Fix issues reported by kbuild test robot.
- Rename HYPERV_CLOCK -> HYPERV_TSCPAGE to avoid the ambiguity.

I'm also going to try to optimize mul_u64_u64_shr() for 32bit but this can
be split from this series I guess.

Vitaly Kuznetsov (2):
  hyperv: implement hv_get_tsc_page()
  x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

 arch/x86/entry/vdso/vclock_gettime.c  | 36 +++
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
 arch/x86/entry/vdso/vdso2c.c  |  3 +++
 arch/x86/entry/vdso/vma.c |  7 +++
 arch/x86/hyperv/hv_init.c | 12 ++--
 arch/x86/include/asm/clocksource.h|  3 ++-
 arch/x86/include/asm/mshyperv.h   |  8 
 arch/x86/include/asm/vdso.h   |  1 +
 drivers/hv/Kconfig|  3 +++
 9 files changed, 72 insertions(+), 4 deletions(-)

-- 
2.9.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: automatic IRQ affinity for virtio V3

2017-02-09 Thread Christoph Hellwig
On Sun, Feb 05, 2017 at 06:15:17PM +0100, Christoph Hellwig wrote:
> Hi Michael, hi Jason,
> 
> This patches applies a few cleanups to the virtio PCI interrupt handling
> code, and then converts the virtio PCI code to use the automatic MSI-X
> vectors spreading, as well as using the information in virtio-blk
> and virtio-scsi to automatically align the blk-mq queues to the MSI-X
> vectors.

Any chance to get this in for 4.11 after I got reviews from Jason
for most of the patches?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization