Re: [Xen-devel] [PATCH net-next v1] xen-netback: make copy batch size configurable

2017-11-13 Thread Joao Martins
On Mon, Nov 13, 2017 at 04:39:09PM +, Paul Durrant wrote:
> > -Original Message-
> > From: Joao Martins [mailto:joao.m.mart...@oracle.com]
> > Sent: 13 November 2017 16:34
> > To: Paul Durrant <paul.durr...@citrix.com>
> > Cc: net...@vger.kernel.org; Wei Liu <wei.l...@citrix.com>; xen-
> > de...@lists.xenproject.org
> > Subject: Re: [PATCH net-next v1] xen-netback: make copy batch size
> > configurable
> > 
> > On Mon, Nov 13, 2017 at 11:58:03AM +0000, Paul Durrant wrote:
> > > On Mon, Nov 13, 2017 at 11:54:00AM +, Joao Martins wrote:
> > > > On 11/13/2017 10:33 AM, Paul Durrant wrote:
> > > > > On 11/10/2017 19:35 PM, Joao Martins wrote:
> > 
> > [snip]
> > 
> > > > >> diff --git a/drivers/net/xen-netback/rx.c 
> > > > >> b/drivers/net/xen-netback/rx.c
> > > > >> index b1cf7c6f407a..793a85f61f9d 100644
> > > > >> --- a/drivers/net/xen-netback/rx.c
> > > > >> +++ b/drivers/net/xen-netback/rx.c
> > > > >> @@ -168,11 +168,14 @@ static void xenvif_rx_copy_add(struct 
> > > > >> xenvif_queue *queue,
> > > > >> struct xen_netif_rx_request *req,
> > > > >> unsigned int offset, void *data, size_t 
> > > > >> len)
> > > > >>  {
> > > > >> +unsigned int batch_size;
> > > > >>  struct gnttab_copy *op;
> > > > >>  struct page *page;
> > > > >>  struct xen_page_foreign *foreign;
> > > > >>
> > > > >> -if (queue->rx_copy.num == COPY_BATCH_SIZE)
> > > > >> +batch_size = min(xenvif_copy_batch_size, queue->rx_copy.size);
> > > > >
> > > > > Surely queue->rx_copy.size and xenvif_copy_batch_size are always
> > > > > identical? Why do you need this statement (and hence stack variable)?
> > > > >
> > > > This statement was to allow to be changed dynamically and would
> > > > affect all newly created guests or running guests if value happened
> > > > to be smaller than initially allocated. But I suppose I should make
> > > > behaviour more consistent with the other params we have right now
> > > > and just look at initially allocated one `queue->rx_copy.batch_size` ?
> > >
> > > Yes, that would certainly be consistent but I can see value in
> > > allowing it to be dynamically tuned, so perhaps adding some re-allocation
> > > code to allow the batch to be grown as well as shrunk might be nice.
> > 
> > The shrink one we potentially risk losing data, so we need to gate the
> > reallocation whenever `rx_copy.num` is less than the new requested
> > batch. Worst case means guestrx_thread simply uses the initial
> > allocated value.
> 
> Can't you just re-alloc immediately after the flush (when num is
> guaranteed to be zero)?

/facepalm

Yes, after the flush should make things much simpler.

Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH net-next v1] xen-netback: make copy batch size configurable

2017-11-13 Thread Joao Martins
On Mon, Nov 13, 2017 at 11:58:03AM +, Paul Durrant wrote:
> On Mon, Nov 13, 2017 at 11:54:00AM +0000, Joao Martins wrote:
> > On 11/13/2017 10:33 AM, Paul Durrant wrote:
> > > On 11/10/2017 19:35 PM, Joao Martins wrote:

[snip]

> > >> diff --git a/drivers/net/xen-netback/rx.c b/drivers/net/xen-netback/rx.c
> > >> index b1cf7c6f407a..793a85f61f9d 100644
> > >> --- a/drivers/net/xen-netback/rx.c
> > >> +++ b/drivers/net/xen-netback/rx.c
> > >> @@ -168,11 +168,14 @@ static void xenvif_rx_copy_add(struct
> > >> xenvif_queue *queue,
> > >> struct xen_netif_rx_request *req,
> > >> unsigned int offset, void *data, size_t 
> > >> len)
> > >>  {
> > >> +unsigned int batch_size;
> > >>  struct gnttab_copy *op;
> > >>  struct page *page;
> > >>  struct xen_page_foreign *foreign;
> > >>
> > >> -if (queue->rx_copy.num == COPY_BATCH_SIZE)
> > >> +batch_size = min(xenvif_copy_batch_size, queue->rx_copy.size);
> > >
> > > Surely queue->rx_copy.size and xenvif_copy_batch_size are always
> > > identical? Why do you need this statement (and hence stack variable)?
> > >
> > This statement was to allow to be changed dynamically and would
> > affect all newly created guests or running guests if value happened
> > to be smaller than initially allocated. But I suppose I should make
> > behaviour more consistent with the other params we have right now
> > and just look at initially allocated one `queue->rx_copy.batch_size` ?
> 
> Yes, that would certainly be consistent but I can see value in
> allowing it to be dynamically tuned, so perhaps adding some re-allocation
> code to allow the batch to be grown as well as shrunk might be nice.

The shrink one we potentially risk losing data, so we need to gate the
reallocation whenever `rx_copy.num` is less than the new requested
batch. Worst case means guestrx_thread simply uses the initial
allocated value.

Anyhow, something like the below scissored diff (on top of your comments):

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index a165a4123396..8e4eaf3a507d 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -359,6 +359,7 @@ irqreturn_t xenvif_ctrl_irq_fn(int irq, void *data);
 
 void xenvif_rx_action(struct xenvif_queue *queue);
 void xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb);
+int xenvif_rx_copy_realloc(struct xenvif_queue *queue, unsigned int size);
 
 void xenvif_carrier_on(struct xenvif *vif);
 
diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index 1892bf9327e4..14613b5fcccb 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -516,20 +516,13 @@ struct xenvif *xenvif_alloc(struct device *parent, 
domid_t domid,
 
 int xenvif_init_queue(struct xenvif_queue *queue)
 {
-   unsigned int size = xenvif_copy_batch_size;
int err, i;
-   void *addr;
-
-   addr = vzalloc(size * sizeof(struct gnttab_copy));
-   if (!addr)
-   goto err;
-   queue->rx_copy.op = addr;
 
-   addr = vzalloc(size * sizeof(RING_IDX));
-   if (!addr)
+   err = xenvif_rx_copy_realloc(queue, xenvif_copy_batch_size);
+   if (err) {
+   netdev_err(queue->vif->dev, "Could not alloc rx_copy\n");
goto err;
-   queue->rx_copy.idx = addr;
-   queue->rx_copy.batch_size = size;
+   }
 
queue->credit_bytes = queue->remaining_credit = ~0UL;
queue->credit_usec  = 0UL;
diff --git a/drivers/net/xen-netback/rx.c b/drivers/net/xen-netback/rx.c
index be3946cdaaf6..f54bfe72188c 100644
--- a/drivers/net/xen-netback/rx.c
+++ b/drivers/net/xen-netback/rx.c
@@ -130,6 +130,51 @@ static void xenvif_rx_queue_drop_expired(struct 
xenvif_queue *queue)
}
 }
 
+int xenvif_rx_copy_realloc(struct xenvif_queue *queue, unsigned int size)
+{
+   void *op = NULL, *idx = NULL;
+
+   /* No reallocation if new size doesn't fit ongoing requests */
+   if (!size || queue->rx_copy.num > size)
+   return -EINVAL;
+
+   op = vzalloc(size * sizeof(struct gnttab_copy));
+   if (!op)
+   goto err;
+
+   idx = vzalloc(size * sizeof(RING_IDX));
+   if (!idx)
+   goto err;
+
+   /* Ongoing requests need copying */
+   if (queue->rx_copy.num) {
+   unsigned int tmp;
+
+   tmp = queue->rx_copy.num * sizeof(struct gnttab_copy);
+   memcpy(op, queue->rx_copy.op, tmp);
+
+   tmp = qu

Re: [Xen-devel] [PATCH net-next v1] xen-netback: make copy batch size configurable

2017-11-13 Thread Joao Martins
On 11/13/2017 10:33 AM, Paul Durrant wrote:
>> -Original Message-
>> From: Joao Martins [mailto:joao.m.mart...@oracle.com]
>> Sent: 10 November 2017 19:35
>> To: net...@vger.kernel.org
>> Cc: Joao Martins <joao.m.mart...@oracle.com>; Wei Liu
>> <wei.l...@citrix.com>; Paul Durrant <paul.durr...@citrix.com>; xen-
>> de...@lists.xenproject.org
>> Subject: [PATCH net-next v1] xen-netback: make copy batch size
>> configurable
>>
>> Commit eb1723a29b9a ("xen-netback: refactor guest rx") refactored Rx
>> handling and as a result decreased max grant copy ops from 4352 to 64.
>> Before this commit it would drain the rx_queue (while there are
>> enough slots in the ring to put packets) then copy to all pages and write
>> responses on the ring. With the refactor we do almost the same albeit
>> the last two steps are done every COPY_BATCH_SIZE (64) copies.
>>
>> For big packets, the value of 64 means copying 3 packets best case scenario
>> (17 copies) and worst-case only 1 packet (34 copies, i.e. if all frags
>> plus head cross the 4k grant boundary) which could be the case when
>> packets go from local backend process.
>>
>> Instead of making it static to 64 grant copies, lets allow the user to
>> select its value (while keeping the current as default) by introducing
>> the `copy_batch_size` module parameter. This allows users to select
>> the higher batches (i.e. for better throughput with big packets) as it
>> was prior to the above mentioned commit.
>>
>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>> ---
>>  drivers/net/xen-netback/common.h|  6 --
>>  drivers/net/xen-netback/interface.c | 25 -
>>  drivers/net/xen-netback/netback.c   |  5 +
>>  drivers/net/xen-netback/rx.c|  5 -
>>  4 files changed, 37 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-
>> netback/common.h
>> index a46a1e94505d..a5fe36e098a7 100644
>> --- a/drivers/net/xen-netback/common.h
>> +++ b/drivers/net/xen-netback/common.h
>> @@ -129,8 +129,9 @@ struct xenvif_stats {
>>  #define COPY_BATCH_SIZE 64
>>
>>  struct xenvif_copy_state {
>> -struct gnttab_copy op[COPY_BATCH_SIZE];
>> -RING_IDX idx[COPY_BATCH_SIZE];
>> +struct gnttab_copy *op;
>> +RING_IDX *idx;
>> +unsigned int size;
> 
> Could you name this batch_size, or something like that to make it clear what 
> it means?
>
Yeap, will change it.

>>  unsigned int num;
>>  struct sk_buff_head *completed;
>>  };
>> @@ -381,6 +382,7 @@ extern unsigned int rx_drain_timeout_msecs;
>>  extern unsigned int rx_stall_timeout_msecs;
>>  extern unsigned int xenvif_max_queues;
>>  extern unsigned int xenvif_hash_cache_size;
>> +extern unsigned int xenvif_copy_batch_size;
>>
>>  #ifdef CONFIG_DEBUG_FS
>>  extern struct dentry *xen_netback_dbg_root;
>> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-
>> netback/interface.c
>> index d6dff347f896..a558868a883f 100644
>> --- a/drivers/net/xen-netback/interface.c
>> +++ b/drivers/net/xen-netback/interface.c
>> @@ -516,7 +516,20 @@ struct xenvif *xenvif_alloc(struct device *parent,
>> domid_t domid,
>>
>>  int xenvif_init_queue(struct xenvif_queue *queue)
>>  {
>> +int size = xenvif_copy_batch_size;
> 
> unsigned int
>>> int err, i;
>> +void *addr;
>> +
>> +addr = vzalloc(size * sizeof(struct gnttab_copy));
> 
> Does the memory need to be zeroed?
>
It doesn't need to be but given that xenvif_queue is zeroed (which included this
region) thus thought I would leave the same way.

>> +if (!addr)
>> +goto err;
>> +queue->rx_copy.op = addr;
>> +
>> +addr = vzalloc(size * sizeof(RING_IDX));
> 
> Likewise.
> 
>> +if (!addr)
>> +goto err;
>> +queue->rx_copy.idx = addr;
>> +queue->rx_copy.size = size;
>>
>>  queue->credit_bytes = queue->remaining_credit = ~0UL;
>>  queue->credit_usec  = 0UL;
>> @@ -544,7 +557,7 @@ int xenvif_init_queue(struct xenvif_queue *queue)
>>   queue->mmap_pages);
>>  if (err) {
>>  netdev_err(queue->vif->dev, "Could not reserve
>> mmap_pages\n");
>> -return -ENOMEM;
>> +goto err;
>>  }
>>
>>  for (i = 0; i < MAX_PENDING_REQS; i++) {
>> @@ -5

[Xen-devel] [PATCH net-next v1] xen-netback: make copy batch size configurable

2017-11-10 Thread Joao Martins
Commit eb1723a29b9a ("xen-netback: refactor guest rx") refactored Rx
handling and as a result decreased max grant copy ops from 4352 to 64.
Before this commit it would drain the rx_queue (while there are
enough slots in the ring to put packets) then copy to all pages and write
responses on the ring. With the refactor we do almost the same albeit
the last two steps are done every COPY_BATCH_SIZE (64) copies.

For big packets, the value of 64 means copying 3 packets best case scenario
(17 copies) and worst-case only 1 packet (34 copies, i.e. if all frags
plus head cross the 4k grant boundary) which could be the case when
packets go from local backend process.

Instead of making it static to 64 grant copies, lets allow the user to
select its value (while keeping the current as default) by introducing
the `copy_batch_size` module parameter. This allows users to select
the higher batches (i.e. for better throughput with big packets) as it
was prior to the above mentioned commit.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 drivers/net/xen-netback/common.h|  6 --
 drivers/net/xen-netback/interface.c | 25 -
 drivers/net/xen-netback/netback.c   |  5 +
 drivers/net/xen-netback/rx.c|  5 -
 4 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index a46a1e94505d..a5fe36e098a7 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -129,8 +129,9 @@ struct xenvif_stats {
 #define COPY_BATCH_SIZE 64
 
 struct xenvif_copy_state {
-   struct gnttab_copy op[COPY_BATCH_SIZE];
-   RING_IDX idx[COPY_BATCH_SIZE];
+   struct gnttab_copy *op;
+   RING_IDX *idx;
+   unsigned int size;
unsigned int num;
struct sk_buff_head *completed;
 };
@@ -381,6 +382,7 @@ extern unsigned int rx_drain_timeout_msecs;
 extern unsigned int rx_stall_timeout_msecs;
 extern unsigned int xenvif_max_queues;
 extern unsigned int xenvif_hash_cache_size;
+extern unsigned int xenvif_copy_batch_size;
 
 #ifdef CONFIG_DEBUG_FS
 extern struct dentry *xen_netback_dbg_root;
diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index d6dff347f896..a558868a883f 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -516,7 +516,20 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t 
domid,
 
 int xenvif_init_queue(struct xenvif_queue *queue)
 {
+   int size = xenvif_copy_batch_size;
int err, i;
+   void *addr;
+
+   addr = vzalloc(size * sizeof(struct gnttab_copy));
+   if (!addr)
+   goto err;
+   queue->rx_copy.op = addr;
+
+   addr = vzalloc(size * sizeof(RING_IDX));
+   if (!addr)
+   goto err;
+   queue->rx_copy.idx = addr;
+   queue->rx_copy.size = size;
 
queue->credit_bytes = queue->remaining_credit = ~0UL;
queue->credit_usec  = 0UL;
@@ -544,7 +557,7 @@ int xenvif_init_queue(struct xenvif_queue *queue)
 queue->mmap_pages);
if (err) {
netdev_err(queue->vif->dev, "Could not reserve mmap_pages\n");
-   return -ENOMEM;
+   goto err;
}
 
for (i = 0; i < MAX_PENDING_REQS; i++) {
@@ -556,6 +569,13 @@ int xenvif_init_queue(struct xenvif_queue *queue)
}
 
return 0;
+
+err:
+   if (queue->rx_copy.op)
+   vfree(queue->rx_copy.op);
+   if (queue->rx_copy.idx)
+   vfree(queue->rx_copy.idx);
+   return -ENOMEM;
 }
 
 void xenvif_carrier_on(struct xenvif *vif)
@@ -788,6 +808,9 @@ void xenvif_disconnect_ctrl(struct xenvif *vif)
  */
 void xenvif_deinit_queue(struct xenvif_queue *queue)
 {
+   vfree(queue->rx_copy.op);
+   vfree(queue->rx_copy.idx);
+   queue->rx_copy.size = 0;
gnttab_free_pages(MAX_PENDING_REQS, queue->mmap_pages);
 }
 
diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index a27daa23c9dc..3a5e1d7ac2f4 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -96,6 +96,11 @@ unsigned int xenvif_hash_cache_size = 
XENVIF_HASH_CACHE_SIZE_DEFAULT;
 module_param_named(hash_cache_size, xenvif_hash_cache_size, uint, 0644);
 MODULE_PARM_DESC(hash_cache_size, "Number of flows in the hash cache");
 
+/* This is the maximum batch of grant copies on Rx */
+unsigned int xenvif_copy_batch_size = COPY_BATCH_SIZE;
+module_param_named(copy_batch_size, xenvif_copy_batch_size, uint, 0644);
+MODULE_PARM_DESC(copy_batch_size, "Maximum batch of grant copies on Rx");
+
 static void xenvif_idx_release(struct xenvif_queue *queue, u16 pending_idx,
   u8 status);
 
diff --git a/drivers/net/xen-netback/rx.c b/drivers/net/xen-netback/rx.c
index b1c

[Xen-devel] [PATCH v8 4/5] x86/xen/time: setup vcpu 0 time info page

2017-11-08 Thread Joao Martins
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.

The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
Changes since v5:
 * Move xen_setup_vsyscall_time_info within the PVCLOCK_TSC_STABLE_BIT
 clause added in predecessor patch.

Changes since v4:
 * Remove pvclock_set_flags since predecessor patch will set in
 xen_time_init. Consequently pvti local variable is not so useful
 and doesn't make things more clear - therefore remove it.
 * Adjust comment on xen_setup_vsyscall_time_info()
 * Add Juergen's Reviewed-by (Retained as there wasn't functional
 changes)

Changes since v3:
 (Comments from Juergen)
 * Remove _t added suffix from *GUEST_HANDLE* when sync vcpu.h
 with the latest

Changes since v2:
 (Comments from Juergen)
 * Omit the blank after the cast on all 3 occurrences.
 * Change last VCLOCK_PVCLOCK message to be more descriptive
 * Sync the complete vcpu.h header instead of just adding the
 needed one. (IOW adding VCPUOP_get_physid)

Changes since v1:
 * Check flags ahead to see if the  primary clock can use
 PVCLOCK_TSC_STABLE_BIT even if secondary registration fails.
 (Comments from Boris)
 * Remove addr, addr variables;
 * Change first pr_debug to pr_warn;
 * Change last pr_debug to pr_notice;
 * Add routine to solely register secondary time info.
 * Move xen_clock to outside xen_setup_vsyscall_time_info to allow
 restore path to simply re-register secondary time info. Let us
 handle the restore path more gracefully without re-allocating a
 page.
 * Removed cpu argument from xen_setup_vsyscall_time_info()
 * Adjustment failed registration error messages/loglevel to be the same
 * Also teardown secondary time info on suspend

Changes since RFC:
 (Comments from Boris and David)
 * Remove Kconfig option
 * Use get_zeroed_page/free/page
 * Remove the hypercall availability check
 * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
 (New)
 * Set secondary copy on restore such that it works on migration.
 * Drop global xen_clock variable and stash it locally on
 xen_setup_vsyscall_time_info.
 * WARN_ON(ret) if we fail to unregister the pvti.
---
 arch/x86/xen/suspend.c   |  4 ++
 arch/x86/xen/time.c  | 90 +++-
 arch/x86/xen/xen-ops.h   |  2 +
 include/xen/interface/vcpu.h | 42 +
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index d6b1680693a9..800ed36ecfba 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -16,6 +16,8 @@
 
 void xen_arch_pre_suspend(void)
 {
+   xen_save_time_memory_area();
+
if (xen_pv_domain())
xen_pv_pre_suspend();
 }
@@ -26,6 +28,8 @@ void xen_arch_post_suspend(int cancelled)
xen_pv_post_suspend(cancelled);
else
xen_hvm_post_suspend(cancelled);
+
+   xen_restore_time_memory_area();
 }
 
 static void xen_vcpu_notify_restore(void *data)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index fc0148d3a70d..dec966fbe888 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -370,6 +370,92 @@ static const struct pv_time_ops xen_time_ops __initconst = 
{
.steal_clock = xen_steal_clock,
 };
 
+static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
+
+void xen_save_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = NULL;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+   if (ret != 0)
+   pr_notice("Cannot save secondary vcpu_time_info (err %d)",
+ ret);
+   else
+   clear_page(xen_clock);
+}
+
+void xen_restore_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = _clock->pvti;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+
+   /*
+* We don't disable VCLOCK_PVCLOCK entirely if it fails to register the
+* secondary time info with Xen or if we migrated to a host without the
+* necessary flags. On both of these cases what happens is either
+* p

[Xen-devel] [PATCH v8 2/5] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-11-08 Thread Joao Martins
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:

commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.

While moving pvclock_pvti_cpu0_va into pvclock, rename also this
function to pvclock_get_pvti_cpu0_va (including its call sites)
to be symmetric with the setter (pvclock_set_pvti_cpu0_va).

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Andy Lutomirski <l...@kernel.org>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
Acked-by: Thomas Gleixner <t...@linutronix.de>
---
Changes since v7:
 * Add Paolo Acked-by
 (Comments from Thomas Gleixner)
 * Rename getter to pvclock_get_pvti_cpu0_va
 and fixup its callsites (vdso/vma.c and ptp/ptp_kvm.c)
 * Add Thomas Acked-by

Changes since v1:
 * Rebased: the only conflict was that I had move the export
 pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
 * Do not initialize pvti_cpu0_va to NULL (checkpatch error)
 ( Comments from Andy Lutomirski )
 * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
 for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
 * Add his Acked-by (provided the previous adjustment was made)

Changes since RFC:
 (Comments from Andy Lutomirski)
 * Add __init to pvclock_set_pvti_cpu0_va
 * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
 pvclock_set_pvti_cpu0_va
---
 arch/x86/entry/vdso/vma.c  |  2 +-
 arch/x86/include/asm/pvclock.h | 19 ++-
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 drivers/ptp/ptp_kvm.c  |  2 +-
 5 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 1911310959f8..a77fd3c8d824 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -112,7 +112,7 @@ static int vvar_fault(const struct vm_special_mapping *sm,
__pa_symbol(&__vvar_page) >> PAGE_SHIFT);
} else if (sym_offset == image->sym_pvclock_page) {
struct pvclock_vsyscall_time_info *pvti =
-   pvclock_pvti_cpu0_va();
+   pvclock_get_pvti_cpu0_va();
if (pvti && vclock_was_used(VCLOCK_PVCLOCK)) {
ret = vm_insert_pfn(
vma,
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 448cfe1b48cf..55325f934d71 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,15 +4,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_KVM_GUEST
-extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
-#else
-static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return NULL;
-}
-#endif
-
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
@@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
+struct pvclock_vsyscall_time_info *pvclock_get_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_get_pvti_cpu0_va(void)
+{
+   return NULL;
+}
+#endif
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d88967659098..538738047ff5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
-struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return hv_clock;
-}
-EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
+   pvclock_set_pvti_cpu0_va(hv_clock);
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 5c3f6d6a5078..761f6af6efa5 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -25,8 +25,10 @@
 
 #include 
 #include 
+#include 
 
 static u8 valid_flags __read_mostly = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
 
 void pvclock_set_flags(u8 

[Xen-devel] [PATCH v8 1/5] ptp_kvm: probe for kvm guest availability

2017-11-08 Thread Joao Martins
In the event of moving pvclock_pvti_cpu0_va() definition to common
pvclock code, this function would return a value on non KVM guests.
Later on this would fail with a GPF on ptp_kvm_init when running on a
Xen guest. Therefore, ptp_kvm_init() should check whether it is running
in a KVM guest.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Radim Krčmář <rkrc...@redhat.com>
---
Changes since v7:
 * Add Radim's Acked-by
---
 drivers/ptp/ptp_kvm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/ptp/ptp_kvm.c b/drivers/ptp/ptp_kvm.c
index 2b1b212c219e..e04d7b2ecb3a 100644
--- a/drivers/ptp/ptp_kvm.c
+++ b/drivers/ptp/ptp_kvm.c
@@ -178,6 +178,9 @@ static int __init ptp_kvm_init(void)
 {
long ret;
 
+   if (!kvm_para_available())
+   return -ENODEV;
+
clock_pair_gpa = slow_virt_to_phys(_pair);
hv_clock = pvclock_pvti_cpu0_va();
 
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v8 3/5] x86/xen/time: set pvclock flags on xen_time_init()

2017-11-08 Thread Joao Martins
Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set,
then set it too on pvclock flags. This allows Xen clocksource to use it
and thus speeding up xen_clocksource_read() callers (i.e. sched_clock())

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
Changes since v5:
 * Add Boris RoB

New in v5
---
 arch/x86/xen/time.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1ecb05db3632..fc0148d3a70d 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -372,6 +372,7 @@ static const struct pv_time_ops xen_time_ops __initconst = {
 
 static void __init xen_time_init(void)
 {
+   struct pvclock_vcpu_time_info *pvti;
int cpu = smp_processor_id();
struct timespec tp;
 
@@ -395,6 +396,14 @@ static void __init xen_time_init(void)
 
setup_force_cpu_cap(X86_FEATURE_TSC);
 
+   /*
+* We check ahead on the primary time info if this
+* bit is supported hence speeding up Xen clocksource.
+*/
+   pvti = &__this_cpu_read(xen_vcpu)->time;
+   if (pvti->flags & PVCLOCK_TSC_STABLE_BIT)
+   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+
xen_setup_runstate_info(cpu);
xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v8 5/5] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-11-08 Thread Joao Martins
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
Changes since v4:
 * Add Paolo's Acked-by
 * Add Konrad's Reviewed-by

Changes since v1:
 * Add Juergen's Gross Acked-by.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index af0cb69f6a3e..ff93f4a44d2e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7604,6 +7604,7 @@ S:Supported
 F: arch/x86/kvm/
 F: arch/x86/include/uapi/asm/kvm*
 F: arch/x86/include/asm/kvm*
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/x86/kernel/kvm.c
 F: arch/x86/kernel/kvmclock.c
 
@@ -14731,6 +14732,7 @@ F:  arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
 F: arch/x86/include/asm/xen/
+F: arch/x86/include/asm/pvclock-abi.h
 F: include/xen/
 F: include/uapi/xen/
 F: Documentation/ABI/stable/sysfs-hypervisor-xen
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v8 0/5] x86/xen: pvclock vdso support

2017-11-08 Thread Joao Martins
Hey,

This is take 8 for vdso for Xen. PVCLOCK_TSC_STABLE_BIT can be set starting Xen
 4.8 which is required for vdso time related calls. In order to have it on, you
need to have the hypervisor clocksource be TSC e.g. with the following boot
params "clocksource=tsc tsc=stable:socket".

Series is structured as following:

Patch 1 probes for kvm guest in ptp_kvm in the event having
pvclock_pvti_cpu0_va() moved to common pvclock (on the next patch)
Patch 2 streamlines pvti page get/set in pvclock for both of its users
Patch 3,4 registers the pvti page on Xen and sets it in pvclock accordingly
Patch 5 adds a file to KVM/Xen maintainers for tracking pvclock ABI changes.

All patches appear to be Acked by its respective maintainers.

The difference to v7 is adding the Acks on patches 1 and 2 plus the adjustment
from Thomas to rename the getter function. (Changelog in individual patches)

Thanks,
Joao

Joao Martins (5):
  ptp_kvm: probe for kvm guest availability
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: set pvclock flags on xen_time_init()
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

 MAINTAINERS|  2 +
 arch/x86/entry/vdso/vma.c  |  2 +-
 arch/x86/include/asm/pvclock.h | 19 +
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 arch/x86/xen/suspend.c |  4 ++
 arch/x86/xen/time.c| 97 ++
 arch/x86/xen/xen-ops.h |  2 +
 drivers/ptp/ptp_kvm.c  |  5 ++-
 include/xen/interface/vcpu.h   | 42 ++
 10 files changed, 177 insertions(+), 17 deletions(-)

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 2/5] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-11-08 Thread Joao Martins
On 11/08/2017 11:06 AM, Thomas Gleixner wrote:
> On Tue, 7 Nov 2017, Joao Martins wrote:
>> On 11/06/2017 04:09 PM, Paolo Bonzini wrote:
>>> On 19/10/2017 15:39, Joao Martins wrote:
>>>> Right now there is only a pvclock_pvti_cpu0_va() which is defined
>>>> on kvmclock since:
>>>>
>>>> commit dac16fba6fc5
>>>> ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
>>>>
>>>> The only user of this interface so far is kvm. This commit adds a
>>>> setter function for the pvti page and moves pvclock_pvti_cpu0_va
>>>> to pvclock, which is a more generic place to have it; and would
>>>> allow other PV clocksources to use it, such as Xen.
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>>>> Acked-by: Andy Lutomirski <l...@kernel.org>
>>>
>>> Acked-by: Paolo Bonzini <pbonz...@redhat.com>
>>>
>>> IOW, the Xen folks are free to pick up the whole series. :)
>>>
>> Thank you!
>>
>> I guess only x86 maintainers Ack is left - any comments?
> 
> The only nit-pick I have are the convoluted function names:
> 
> pvclock_set_pvti_cpu0_va() pvclock_pvti_cpu0_va()
> 
> What on earth does that mean?
>
Those two functions respectively set and get in pvclock common code the address
of a page for vCPU 0 containing time info (pvti, which is periodically updated
by hypervisor). This region is guest memory and registered with hypervisor by
guest PV clocksource and set in pvclock if certain conditions are met (i.e.
PVCLOCK_TSC_STABLE_BIT is supported by hypervisor), and the getter is afterwards
used by vdso and ptp_kvm.

FWIW I merely followed the current style/code of the existent function but there
could be a better name like "pvclock_set_data() pvclock_get_data()". Albeit the
current names are more explicit on what we should expect to set or return from
the functions.

> Aside of that can you please make it at least symetric, i.e. _set_ and
> _get_ ?
> 
OK - Provided this is changing an exported symbol (pvclock_pvti_cpu0_va in use
by ptp_kvm) and a non-functional change would you want me to address in a
separate patch or it is OK to have in this one?

> Other than that:
> 
>   Acked-by: Thomas Gleixner <t...@linutronix.de>
> 
Thanks!

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v7 2/5] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-11-07 Thread Joao Martins
On 11/06/2017 04:09 PM, Paolo Bonzini wrote:
> On 19/10/2017 15:39, Joao Martins wrote:
>> Right now there is only a pvclock_pvti_cpu0_va() which is defined
>> on kvmclock since:
>>
>> commit dac16fba6fc5
>> ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
>>
>> The only user of this interface so far is kvm. This commit adds a
>> setter function for the pvti page and moves pvclock_pvti_cpu0_va
>> to pvclock, which is a more generic place to have it; and would
>> allow other PV clocksources to use it, such as Xen.
>>
>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>> Acked-by: Andy Lutomirski <l...@kernel.org>
> 
> Acked-by: Paolo Bonzini <pbonz...@redhat.com>
> 
> IOW, the Xen folks are free to pick up the whole series. :)
> 
Thank you!

I guess only x86 maintainers Ack is left - any comments?

Joao

> Paolo
> 
>> ---
>> Changes since v1:
>>  * Rebased: the only conflict was that I had move the export
>>  pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
>>  * Do not initialize pvti_cpu0_va to NULL (checkpatch error)
>>  ( Comments from Andy Lutomirski )
>>  * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
>>  for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
>>  * Add his Acked-by (provided the previous adjustment was made)
>>
>> Changes since RFC:
>>  (Comments from Andy Lutomirski)
>>  * Add __init to pvclock_set_pvti_cpu0_va
>>  * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
>>  pvclock_set_pvti_cpu0_va
>> ---
>>  arch/x86/include/asm/pvclock.h | 19 ++-
>>  arch/x86/kernel/kvmclock.c |  7 +--
>>  arch/x86/kernel/pvclock.c  | 14 ++
>>  3 files changed, 25 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
>> index 448cfe1b48cf..6f228f90cdd7 100644
>> --- a/arch/x86/include/asm/pvclock.h
>> +++ b/arch/x86/include/asm/pvclock.h
>> @@ -4,15 +4,6 @@
>>  #include 
>>  #include 
>>  
>> -#ifdef CONFIG_KVM_GUEST
>> -extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
>> -#else
>> -static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
>> -{
>> -return NULL;
>> -}
>> -#endif
>> -
>>  /* some helper functions for xen and kvm pv clock sources */
>>  u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
>>  u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
>> @@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
>>  
>>  #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
>>  
>> +#ifdef CONFIG_PARAVIRT_CLOCK
>> +void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
>> +struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
>> +#else
>> +static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
>> +{
>> +return NULL;
>> +}
>> +#endif
>> +
>>  #endif /* _ASM_X86_PVCLOCK_H */
>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>> index d88967659098..538738047ff5 100644
>> --- a/arch/x86/kernel/kvmclock.c
>> +++ b/arch/x86/kernel/kvmclock.c
>> @@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
>>  static struct pvclock_vsyscall_time_info *hv_clock;
>>  static struct pvclock_wall_clock wall_clock;
>>  
>> -struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
>> -{
>> -return hv_clock;
>> -}
>> -EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
>> -
>>  /*
>>   * The wallclock is the time of day when we booted. Since then, some time 
>> may
>>   * have elapsed since the hypervisor wrote the data. So we try to account 
>> for
>> @@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
>>  return 1;
>>  }
>>  
>> +pvclock_set_pvti_cpu0_va(hv_clock);
>>  put_cpu();
>>  
>>  kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
>> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
>> index 5c3f6d6a5078..cb7d6d9c9c2d 100644
>> --- a/arch/x86/kernel/pvclock.c
>> +++ b/arch/x86/kernel/pvclock.c
>> @@ -25,8 +25,10 @@
>>  
>>  #include 
>>  #include 
>> +#include 
>>  
>>  static u8 valid_flags __read_mostly = 0;
>> +static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
>>  
>>  void pvclock_set_flags(u8 flags)
>>  {
>> @@ -144,3 +146,15 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
>> *wall_clock,
>>  
>>  set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
>>  }
>> +
>> +void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
>> +{
>> +WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
>> +pvti_cpu0_va = pvti;
>> +}
>> +
>> +struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
>> +{
>> +return pvti_cpu0_va;
>> +}
>> +EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
>>
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC 3/8] libxl: add backend_features to libxl_device_disk

2017-11-07 Thread Joao Martins


On 11/07/2017 11:28 AM, Oleksandr Grytsov wrote:
> On Thu, Nov 2, 2017 at 8:06 PM, Joao Martins <joao.m.mart...@oracle.com
> <mailto:joao.m.mart...@oracle.com>> wrote:
> 
> The function libxl__device_generic_add will have an additional
> argument whereby it adds a second set of entries visible to the
> backend only. These entries will then be used for devices
> thus overriding backend maximum feature set with this user-defined ones.
> 
> libxl_device_disk.backend_features are a key value store storing:
>   = 
> 
> xl|libxl are stateless with respect to feature names therefore is up to 
> the
> admin to carefully select those. If backend isn't supported therefore the
>     features won't be overwritten.
> 
> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com
> <mailto:joao.m.mart...@oracle.com>>
> ---
>  tools/libxl/libxl.h          |  8 
>  tools/libxl/libxl_console.c  |  5 +++--
>  tools/libxl/libxl_device.c   | 37 +
>  tools/libxl/libxl_disk.c     | 17 +++--
>  tools/libxl/libxl_internal.h |  4 +++-
>  tools/libxl/libxl_pci.c      |  2 +-
>  tools/libxl/libxl_types.idl  |  1 +
>  tools/libxl/libxl_usb.c      |  2 +-
>  8 files changed, 65 insertions(+), 11 deletions(-)
> 
> 
> No need to extend libxl__device_generic_add with additional parameter 
> (brents).
> You can add nested entry in libxl__set_xenstore_ as following:
> 
> flexarray_append(back, "require/feature-persistent", "0");

Right, although entries on "back" array will have readonly permission to the
frontend. And these newly added "require" directory in this RFC was meant to be
only visible to the backend, hence only having XS_PERM_NONE permission set.

Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH RFC 2/8] public/io/netif: add directory for backend parameters

2017-11-06 Thread Joao Martins
On Mon, Nov 06, 2017 at 10:33:59AM +, Paul Durrant wrote:
> > -Original Message-
> > From: Joao Martins [mailto:joao.m.mart...@oracle.com]
> > Sent: 02 November 2017 18:06
> > To: Xen Development List <xen-devel@lists.xen.org>
> > Cc: Joao Martins <joao.m.mart...@oracle.com>; Konrad Rzeszutek Wilk
> > <konrad.w...@oracle.com>; Paul Durrant <paul.durr...@citrix.com>; Wei Liu
> > <wei.l...@citrix.com>
> > Subject: [PATCH RFC 2/8] public/io/netif: add directory for backend
> > parameters
> > 
> > The proposed directory provides a mechanism for tools to control the
> > maximum feature set of the device being provisioned by backend.
> > The parameters/features include offloading features, number of
> > queues etc.
> > 
> > Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
> > ---
> >  xen/include/public/io/netif.h | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/xen/include/public/io/netif.h b/xen/include/public/io/netif.h
> > index 2454448baa..a412e4771d 100644
> > --- a/xen/include/public/io/netif.h
> > +++ b/xen/include/public/io/netif.h
> > @@ -161,6 +161,22 @@
> >   */
> > 
> >  /*
> > + * The directory "require" maybe be created in backend path by tools
> > + * domain to override the maximum feature set that backend provides to
> > the
> > + * frontend. The children entries within this directory are features names
> > + * and the correspondent values that should be used backend as defaults
> > e.g.:
> > + *
> > + * /local/domain/X/backend///require
> > + * /local/domain/X/backend///require/multi-queue-
> > max-queues = "2"
> > + * /local/domain/X/backend///require/feature-no-csum-
> > offload = "1"
> > + *
> > + * In the example above, network backend will negotiate up to a maximum
> > of
> > + * two queues with frontend plus disabling IPv4 checksum offloading.
> > + *
> > + * This directory and its children entries shall only be visible to the 
> > backend.
> > + */
> > +
> 
> What should happen if the toolstack sets something in 'require' that
> the backend cannot provide? I don't see anything in your RFC patches
> to check that the backend has responded appropriately to the keys.

Hmm, you're right that this RFC doesn't handle that properly - but for the
ones the backend provide I had suggested (albeit not implemented here)
back in the other thread that we could compare the values of feature in
"require" with the one announced to the frontend. But well this wouldn't
cover the non-provided ones, and possibly would fall a bit as a hack.

I could change the format of the entries within "require"
directory to be e.g. "- = " and the
acknowledgement entry would come in the form "-status
= ". Consequently the lack of a "-status" entry would
have a stronger semantic i.e. unsupported and ignored. The toolstack then would 
have
means to check whether the feature was really succesfull set as desired
or not. But then one question comes to mind: should the backend be
prevented to init in the event that the features requested fail to be
set? In which case uevent (on Linux) isn't triggered and xenbus state doesn't
get changed and toolstack would fail with timeout later on.

Also, a nice thing of this stuff is that we could also use this to set
set backend implementation specific parameters that are not
described or relevant in I/O specs. But then I start to wonder where would
be the correct place for backends to specify its maximum feature set of
changeable entries? Maybe:

/local/domain/X/backend/vif/features/
/local/domain/X/backend/vif/features/-desc = "Description
of "

Cheers,
Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH RFC 8/8] xen-netback: frontend feature control

2017-11-02 Thread Joao Martins
Toolstack may write values to the "require" subdirectory in the backend
main xenstore directory (e.g. backend/vif/X/Y/). Read these values and
use them when announcing those to the frontend. When backend scans
frontend features the values set in the require directory take
precedence, hence making no significant changes in feature parsing.

This is achieved by using the newly introduced helper
(xenbus_printf_feature()) which reads from require subdirectory and
prints that value and otherwise printing a default_val in the entry. We
then replace all instances of xenbus_printf by this new helper. A
backend_features struct is introduced and all values set there are used
in place of the module parameters being used.

Note, however that feature-rx-copy, feature-rx-flip aren't probed
because first two aren't implemented the full set of possibilities.
Additionally probe to for 'feature-no-csum-offload' to allow toolstack
to control per device checksum offloading.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 drivers/net/xen-netback/xenbus.c | 122 +++
 1 file changed, 99 insertions(+), 23 deletions(-)

diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index a56d3eab35dd..391f1f2e1af2 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -22,9 +22,25 @@
 #include 
 #include 
 
+#define REQUIRE_PATH_LEN (256)
+
+struct backend_features {
+   unsigned int max_queues;
+   unsigned int split_evtchn:1;
+   unsigned int ctrl_ring:1;
+   unsigned int can_sg:1;
+   unsigned int gso_v4:1;
+   unsigned int gso_v6:1;
+   unsigned int mcast_ctrl:1;
+   unsigned int dyn_mcast_ctrl:1;
+   unsigned int ip_no_csum:1;
+   unsigned int ipv6_csum:1;
+};
+
 struct backend_info {
struct xenbus_device *dev;
struct xenvif *vif;
+   struct backend_features features;
 
/* This is the state that will be reflected in xenstore when any
 * active hotplug script completes.
@@ -48,6 +64,17 @@ static void xen_unregister_watchers(struct xenvif *vif);
 static void set_backend_state(struct backend_info *be,
  enum xenbus_state state);
 
+static int xenbus_read_feature(const char *dir, const char *node,
+  unsigned int default_val)
+{
+   char reqnode[REQUIRE_PATH_LEN];
+   unsigned int val;
+
+   snprintf(reqnode, REQUIRE_PATH_LEN, "%s/require", dir);
+   val = xenbus_read_unsigned(reqnode, node, default_val);
+   return val;
+}
+
 #ifdef CONFIG_DEBUG_FS
 struct dentry *xen_netback_dbg_root = NULL;
 
@@ -280,6 +307,32 @@ static int netback_remove(struct xenbus_device *dev)
return 0;
 }
 
+static void netback_probe_features(struct xenbus_device *dev,
+  struct backend_info *be)
+{
+   struct backend_features *ft = >features;
+
+   ft->can_sg = xenbus_read_feature(dev->nodename, "feature-sg", 1);
+   ft->gso_v4 = xenbus_read_feature(dev->nodename, "feature-gso-v4", 1);
+   ft->gso_v6 = xenbus_read_feature(dev->nodename, "feature-gso-v6", 1);
+   ft->gso_v6 = xenbus_read_feature(dev->nodename, "feature-gso-v6", 1);
+   ft->ipv6_csum = xenbus_read_feature(dev->nodename,
+   "feature-ipv6-csum-offload", 1);
+   ft->ip_no_csum = xenbus_read_feature(dev->nodename,
+   "feature-no-csum-offload", 0);
+   ft->mcast_ctrl = xenbus_read_feature(dev->nodename,
+"feature-multicast-control", 1);
+   ft->dyn_mcast_ctrl = xenbus_read_feature(dev->nodename,
+   "feature-dynamic-multicast-control", 1);
+   ft->split_evtchn = xenbus_read_feature(dev->nodename,
+  "feature-split-event-channels",
+  separate_tx_rx_irq);
+   ft->max_queues = xenbus_read_feature(dev->nodename,
+"multi-queue-max-queues",
+xenvif_max_queues);
+   ft->ctrl_ring = xenbus_read_feature(dev->nodename, "feature-ctrl-ring",
+   1);
+}
 
 /**
  * Entry point to this code when a new device is created.  Allocate the basic
@@ -291,8 +344,8 @@ static int netback_probe(struct xenbus_device *dev,
const char *message;
struct xenbus_transaction xbt;
int err;
-   int sg;
const char *script;
+   struct backend_features *ft;
struct backend_info *be = kzalloc(sizeof(struct backend_info),
  GFP_KERNEL);

[Xen-devel] [PATCH RFC 7/8] xen-blkback: frontend feature control

2017-11-02 Thread Joao Martins
Toolstack may write values to the "require" subdirectory in the
backend main directory (e.g. backend/vbd/X/Y/). Read these values
and use them when announcing those to the frontend. When backend
scans frontend features the values set in the require directory
take precedence, hence making no significant changes in feature
parsing.

xenbus_read_feature() reads from require subdirectory and prints that
value and otherwise writing a default_val in the entry. We then replace
all instances of xenbus_printf to use these previously seeded features.
A backend_features struct is introduced and all values set there are
used in place of the module parameters being used.

Note, however that feature-barrier, feature-flush-support and
feature-discard aren't probed because first two are physical
device dependent and feature-discard already has tunables to
adjust.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 drivers/block/xen-blkback/blkback.c |  2 +-
 drivers/block/xen-blkback/common.h  |  1 +
 drivers/block/xen-blkback/xenbus.c  | 66 -
 3 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index c90e90b6..05b3f124c871 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1271,7 +1271,7 @@ static int dispatch_rw_block_io(struct xen_blkif_ring 
*ring,
unlikely((req->operation != BLKIF_OP_INDIRECT) &&
 (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) ||
unlikely((req->operation == BLKIF_OP_INDIRECT) &&
-(nseg > MAX_INDIRECT_SEGMENTS))) {
+(nseg > ring->blkif->vbd.max_indirect_segs))) {
pr_debug("Bad number of segments in request (%d)\n", nseg);
/* Haven't submitted any bio's yet. */
goto fail_response;
diff --git a/drivers/block/xen-blkback/common.h 
b/drivers/block/xen-blkback/common.h
index a7832428e0da..ff12f2d883b9 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -229,6 +229,7 @@ struct xen_vbd {
unsigned intdiscard_secure:1;
unsigned intfeature_gnt_persistent:1;
unsigned intoverflow_max_grants:1;
+   unsigned intmax_indirect_segs;
 };
 
 struct backend_info;
diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 48d796ea3626..31683f29d5fb 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -25,11 +25,19 @@
 
 /* On the XenBus the max length of 'ring-ref%u'. */
 #define RINGREF_NAME_LEN (20)
+#define REQUIRE_PATH_LEN (256)
+
+struct backend_features {
+   unsigned max_queues;
+   unsigned max_ring_order;
+   unsigned pers_grants;
+};
 
 struct backend_info {
struct xenbus_device*dev;
struct xen_blkif*blkif;
struct xenbus_watch backend_watch;
+   struct backend_features features;
unsignedmajor;
unsignedminor;
char*mode;
@@ -602,6 +610,40 @@ int xen_blkbk_barrier(struct xenbus_transaction xbt,
return err;
 }
 
+static int xenbus_read_feature(const char *dir, const char *node,
+  unsigned int default_val)
+{
+   char reqnode[REQUIRE_PATH_LEN];
+   unsigned int val;
+
+   snprintf(reqnode, REQUIRE_PATH_LEN, "%s/require", dir);
+   val = xenbus_read_unsigned(reqnode, node, default_val);
+   return val;
+}
+
+static void xen_blkbk_probe_features(struct xenbus_device *dev,
+struct backend_info *be)
+{
+   struct backend_features *ft = >features;
+   struct xen_vbd *vbd = >blkif->vbd;
+
+   vbd->max_indirect_segs = xenbus_read_feature(dev->nodename,
+   "feature-max-indirect-segments",
+   MAX_INDIRECT_SEGMENTS);
+
+   ft->max_queues = xenbus_read_feature(dev->nodename,
+"multi-queue-max-queues",
+xenblk_max_queues);
+
+   ft->max_ring_order = xenbus_read_feature(dev->nodename,
+"max-ring-page-order",
+xen_blkif_max_ring_order);
+
+   ft->pers_grants = xenbus_read_feature(dev->nodename,
+ "feature-persistent",
+ 1);
+}
+
 /*
  * Entry point to this code when a new device is created.  Allocate the basic
  * structures, and watch the store waiting for the hotplug scripts to tell us
@@ -613,6 +65

[Xen-devel] [PATCH RFC 0/8] libxl, xl, public/io: PV backends feature control

2017-11-02 Thread Joao Martins
Hey folks,

Presented herewith is an attempt to implement PV backends feature control
as discussed in the list 
(https://lists.xen.org/archives/html/xen-devel/2017-09/msg00766.html)

Given that this a simple proposal hence I thought to include all changes
involved in the same patchset such that everyone see all the changes and has a
better estimate (but restricted to xen-devel just for the RFC purposes).

The motivation here is to allow system administrators more fine grained
control of the device features being used by guest.

The only change I made compared to the proposed discussed above was to use
"require" instead of "request" as the prefix because there is a feature which
has "request" in it. But if "request" is still preferred as a prefix I can 
change
it up.

The scheme proposed is quite simple:

* The directory "require" is created (inside the backend path) and within that
directory the features/capabilities names and values are written.

* Toolstack constructs a key value store of features, and user specifies those
through special entry names prefixed also as "require". Toolstack is stateless 
thus sys
admin has full control over what to pass to the backend. In other words it
doesn't look at particular feature names/values.

* The backend will then use that for seeding its maximum feature set to the
frontend.

An example would be a domain config to look like this:

vif = ["bridge=br0,require-multi-queue-max-queues=2"]
disk = [ "phy:/path/to/disk,hda,w,require-feature-persistent=0" ]

And if backend supports it, it would create a vif with a maximum of 2 queues,
and a vbd with persistent grants disabled.

I only implemented for blkback and netback but there is nothing really specific
to how it's done and could possibly be implemented in other PV interfaces. But
there wasn't a protocol agnostic file to put all this, so I went ahead and did
for the two individual io types (block and netif) I am most interested in.

Any comments appreciated :)

Thanks!
Joao

For Linux the diffstat/changeset is: (the last two patches)

Joao Martins (2):
  xen-blkback: frontend feature control
  xen-netback: frontend feature control

 drivers/block/xen-blkback/blkback.c |   2 +-
 drivers/block/xen-blkback/common.h  |   1 +
 drivers/block/xen-blkback/xenbus.c  |  66 ---
 drivers/net/xen-netback/xenbus.c| 122 +---
 4 files changed, 159 insertions(+), 32 deletions(-)

And for Xen the diffstat/changeset is:

Joao Martins (6):
  public/io/blkif: add directory for backend parameters
  public/io/netif: add directory for backend parameters
  libxl: add backend_features to libxl_device_disk
  libxl: add backend_features to libxl_device_nic
  libxlu: parse disk backend features parameters
  xl: parse vif backend features parameters

 tools/libxl/libxl.h   | 16 +++
 tools/libxl/libxl_9pfs.c  |  2 +-
 tools/libxl/libxl_console.c   |  7 ---
 tools/libxl/libxl_device.c| 47 +++
 tools/libxl/libxl_disk.c  | 17 ++--
 tools/libxl/libxl_internal.h  |  6 --
 tools/libxl/libxl_nic.c   | 13 +++-
 tools/libxl/libxl_pci.c   |  2 +-
 tools/libxl/libxl_types.idl   |  2 ++
 tools/libxl/libxl_usb.c   |  2 +-
 tools/libxl/libxl_vdispl.c|  3 ++-
 tools/libxl/libxl_vtpm.c  |  2 +-
 tools/libxl/libxlu_disk_l.l   | 42 ++
 tools/xl/xl_parse.c   | 37 ++
 tools/xl/xl_parse.h   |  2 ++
 xen/include/public/io/blkif.h | 14 +
 xen/include/public/io/netif.h | 16 +++
 17 files changed, 209 insertions(+), 21 deletions(-)

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH RFC 5/8] libxlu: parse disk backend features parameters

2017-11-02 Thread Joao Martins
Any option name preceded by "require-" means a backend feature
to be set. This is stored in key value structure which libxl will parse
and tell blkback to override the specified features.

An example would be a config containing:

...
vcpus = 8
disk = [ "phy:/path/to/disk,hda,w,require-multi-queue-max-queues=1" ]
...

Which would set the number of queues to 2 as opposed to e.g. the global
blkback defined xen_blkback.max_queues parameter.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 tools/libxl/libxlu_disk_l.l | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/tools/libxl/libxlu_disk_l.l b/tools/libxl/libxlu_disk_l.l
index 97039a2800..4530c6c4fc 100644
--- a/tools/libxl/libxlu_disk_l.l
+++ b/tools/libxl/libxlu_disk_l.l
@@ -62,6 +62,9 @@ void xlu__disk_yyset_column(int  column_no, yyscan_t 
yyscanner);
 /* For actions whose patterns contain '=', finds the start of the value */
 #define FROMEQUALS (strchr(yytext,'=')+1)
 
+/* For actions whose patterns contain '-', finds the start of the value */
+#define FROMMINUS (strchr(yytext,'-')+1)
+
 /* Chops the delimiter off, modifying yytext and yyleng. */
 #define STRIP(delim) do{\
if (yyleng>0 && yytext[yyleng-1]==(delim))  \
@@ -114,6 +117,37 @@ static void setbackendtype(DiskParseContext *dpc, const 
char *str) {
 else xlu__disk_err(dpc,str,"unknown value for backendtype");
 }
 
+static int addbackendfeature(DiskParseContext *dpc, const char *key)
+{
+libxl_key_value_list *sl = >disk->backend_features;
+size_t count = libxl_key_value_list_length(sl);
+libxl_key_value_list array = *sl;
+char *eql = strchr(key,'=');
+char *val = eql + 1;
+int i;
+
+array = calloc((count+1) * 2 + 1, sizeof(char*));
+if (!array)
+return -ENOMEM;
+
+for (i = 0; i < count * 2; i++) {
+if ((*sl)[i])
+array[i] = strdup((*sl)[i]);
+}
+array[i] = NULL;
+libxl_key_value_list_dispose(sl);
+
+*eql = 0;
+count *= 2;
+array[count++] = strdup(key);
+array[count++] = strdup(val);
+array[count] = NULL;
+*eql = '=';
+
+*sl = array;
+return 0;
+}
+
 /* Sets ->colo-port from the string.  COLO need this. */
 static void setcoloport(DiskParseContext *dpc, const char *str) {
 int port = atoi(str);
@@ -187,6 +221,14 @@ script=[^,]*,? { STRIP(','); SAVESTRING("script", 
script, FROMEQUALS); }
 direct-io-safe,? { DPC->disk->direct_io_safe = 1; }
 discard,?  { libxl_defbool_set(>disk->discard_enable, true); }
 no-discard,?   { libxl_defbool_set(>disk->discard_enable, false); }
+require-[a-z][-a-z0-9]*=[^,],? {
+   STRIP(',');
+   if (addbackendfeature(DPC, FROMMINUS)) {
+   xlu__disk_err(DPC,yytext,"unable to parse feature");
+   return 0;
+   }
+}
+
  /* Note that the COLO configuration settings should be considered unstable.
   * They may change incompatibly in future versions of Xen. */
 colo,? { libxl_defbool_set(>disk->colo_enable, true); }
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH RFC 6/8] xl: parse vif backend features parameters

2017-11-02 Thread Joao Martins
Any option name preceded by "require-" means a backend feature to be {un,}set.
This is stored in key value structure which libxl will parse and inform netback
to override the specified features.

An example would be a config containing:

...
vcpus = 8
vif = ["bridge=br0,require-multi-queue-max-queues=2"]
...

Which would set the number of queues to 2 as opposed to e.g. the global
netback defined xen_netback.max_queues parameter.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 tools/xl/xl_parse.c | 37 +
 tools/xl/xl_parse.h |  2 ++
 2 files changed, 39 insertions(+)

diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index 9a692d5ae6..007df694d8 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -401,6 +401,29 @@ void replace_string(char **str, const char *val)
 *str = xstrdup(val);
 }
 
+static void add_to_kvlist(libxl_key_value_list *sl, char *key, char *val)
+{
+size_t count = libxl_key_value_list_length(sl);
+libxl_key_value_list array = *sl;
+int i;
+
+array = xcalloc((count+1) * 2 + 1, sizeof(char*));
+
+for (i = 0; i < count * 2; i++) {
+if ((*sl)[i])
+array[i] = xstrdup((*sl)[i]);
+}
+array[i] = NULL;
+libxl_key_value_list_dispose(sl);
+
+count *= 2;
+array[count++] = xstrdup(key);
+array[count++] = xstrdup(val);
+array[count] = NULL;
+
+*sl = array;
+}
+
 int match_option_size(const char *prefix, size_t len,
   char *arg, char **argopt)
 {
@@ -559,6 +582,20 @@ int parse_nic_config(libxl_device_nic *nic, XLU_Config 
**config, char *token)
 fprintf(stderr, "the accel parameter for vifs is currently not 
supported\n");
 } else if (MATCH_OPTION("devid", token, oparg)) {
 nic->devid = parse_ulong(oparg);
+} else if (MATCH_FEATURE("require", token, oparg)) {
+char *key = NULL, *value = NULL;
+int rc;
+
+rc = split_string_into_pair(oparg, "=", , );
+if (rc != 0) {
+fprintf(stderr, "failed to parse vif backend feature %s", oparg);
+return 1;
+}
+
+add_to_kvlist(>backend_features, key, value);
+
+free(key);
+free(value);
 } else {
 fprintf(stderr, "unrecognized argument `%s'\n", token);
 return 1;
diff --git a/tools/xl/xl_parse.h b/tools/xl/xl_parse.h
index cc459fb43f..aea07394cc 100644
--- a/tools/xl/xl_parse.h
+++ b/tools/xl/xl_parse.h
@@ -40,6 +40,8 @@ int match_option_size(const char *prefix, size_t len,
 #define MATCH_OPTION(prefix, arg, oparg) \
 match_option_size((prefix "="), sizeof((prefix)), (arg), &(oparg))
 
+#define MATCH_FEATURE(prefix, arg, oparg) \
+match_option_size((prefix "-"), sizeof((prefix)), (arg), &(oparg))
 
 void split_string_into_string_list(const char *str, const char *delim,
libxl_string_list *psl);
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH RFC 3/8] libxl: add backend_features to libxl_device_disk

2017-11-02 Thread Joao Martins
The function libxl__device_generic_add will have an additional
argument whereby it adds a second set of entries visible to the
backend only. These entries will then be used for devices
thus overriding backend maximum feature set with this user-defined ones.

libxl_device_disk.backend_features are a key value store storing:
  = 

xl|libxl are stateless with respect to feature names therefore is up to the
admin to carefully select those. If backend isn't supported therefore the
features won't be overwritten.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 tools/libxl/libxl.h  |  8 
 tools/libxl/libxl_console.c  |  5 +++--
 tools/libxl/libxl_device.c   | 37 +
 tools/libxl/libxl_disk.c | 17 +++--
 tools/libxl/libxl_internal.h |  4 +++-
 tools/libxl/libxl_pci.c  |  2 +-
 tools/libxl/libxl_types.idl  |  1 +
 tools/libxl/libxl_usb.c  |  2 +-
 8 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 5e9aed739d..82990089ef 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1101,6 +1101,14 @@ void libxl_mac_copy(libxl_ctx *ctx, libxl_mac *dst, 
const libxl_mac *src);
  */
 #define LIBXL_HAVE_SET_PARAMETERS 1
 
+/*
+ * LIBXL_HAVE_DISK_BACKEND_FEATURES
+ *
+ * libxl_device_disk contains backend_features which can be used to control
+ * what features are exposed to guest vbds.
+ */
+#define LIBXL_HAVE_DISK_BACKEND_FEATURES 1
+
 typedef char **libxl_string_list;
 void libxl_string_list_dispose(libxl_string_list *sl);
 int libxl_string_list_length(const libxl_string_list *sl);
diff --git a/tools/libxl/libxl_console.c b/tools/libxl/libxl_console.c
index c05dc28b99..f40def1276 100644
--- a/tools/libxl/libxl_console.c
+++ b/tools/libxl/libxl_console.c
@@ -339,7 +339,7 @@ int libxl__device_console_add(libxl__gc *gc, uint32_t domid,
 libxl__device_generic_add(gc, XBT_NULL, device,
   libxl__xs_kvs_of_flexarray(gc, back),
   libxl__xs_kvs_of_flexarray(gc, front),
-  libxl__xs_kvs_of_flexarray(gc, ro_front));
+  libxl__xs_kvs_of_flexarray(gc, ro_front), NULL);
 rc = 0;
 out:
 return rc;
@@ -385,7 +385,8 @@ int libxl__device_vuart_add(libxl__gc *gc, uint32_t domid,
 rc = libxl__device_generic_add(gc, XBT_NULL, ,
libxl__xs_kvs_of_flexarray(gc, back),
NULL,
-   libxl__xs_kvs_of_flexarray(gc, ro_front));
+   libxl__xs_kvs_of_flexarray(gc, ro_front),
+   NULL);
 return rc;
 }
 
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 5438577c3c..05178fb480 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -43,6 +43,15 @@ char *libxl__device_backend_path(libxl__gc *gc, 
libxl__device *device)
  device->domid, device->devid);
 }
 
+char *libxl__device_require_path(libxl__gc *gc, libxl__device *device)
+{
+char *dom_path = libxl__xs_get_dompath(gc, device->backend_domid);
+
+return GCSPRINTF("%s/backend/%s/%u/%d/require", dom_path,
+ libxl__device_kind_to_string(device->backend_kind),
+ device->domid, device->devid);
+}
+
 char *libxl__device_libxl_path(libxl__gc *gc, libxl__device *device)
 {
 char *libxl_dom_path = libxl__xs_libxl_path(gc, device->domid);
@@ -114,13 +123,16 @@ out:
 }
 
 int libxl__device_generic_add(libxl__gc *gc, xs_transaction_t t,
-libxl__device *device, char **bents, char **fents, char **ro_fents)
+libxl__device *device, char **bents, char **fents, char **ro_fents,
+char **brents)
 {
 libxl_ctx *ctx = libxl__gc_owner(gc);
-char *frontend_path = NULL, *backend_path = NULL, *libxl_path;
+char *frontend_path = NULL, *backend_path = NULL, *require_path = NULL,
+ *libxl_path;
 struct xs_permissions frontend_perms[2];
 struct xs_permissions ro_frontend_perms[2];
 struct xs_permissions backend_perms[2];
+struct xs_permissions require_perms[1];
 int create_transaction = t == XBT_NULL;
 int libxl_only = device->backend_kind == LIBXL__DEVICE_KIND_NONE;
 int rc;
@@ -131,6 +143,7 @@ int libxl__device_generic_add(libxl__gc *gc, 
xs_transaction_t t,
 } else {
 frontend_path = libxl__device_frontend_path(gc, device);
 backend_path = libxl__device_backend_path(gc, device);
+require_path = libxl__device_require_path(gc, device);
 }
 libxl_path = libxl__device_libxl_path(gc, device);
 
@@ -144,6 +157,9 @@ int libxl__device_generic_add(libxl__gc *gc, 
xs_transaction_t t,
 ro_frontend_perms[1].id = backend_perms[1].id = device->domid;
 ro_frontend_perms[1].perms = backend_perms[1].perms = XS_PERM_READ;
 
+

[Xen-devel] [PATCH RFC 2/8] public/io/netif: add directory for backend parameters

2017-11-02 Thread Joao Martins
The proposed directory provides a mechanism for tools to control the
maximum feature set of the device being provisioned by backend.
The parameters/features include offloading features, number of
queues etc.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 xen/include/public/io/netif.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/xen/include/public/io/netif.h b/xen/include/public/io/netif.h
index 2454448baa..a412e4771d 100644
--- a/xen/include/public/io/netif.h
+++ b/xen/include/public/io/netif.h
@@ -161,6 +161,22 @@
  */
 
 /*
+ * The directory "require" maybe be created in backend path by tools
+ * domain to override the maximum feature set that backend provides to the
+ * frontend. The children entries within this directory are features names
+ * and the correspondent values that should be used backend as defaults e.g.:
+ *
+ * /local/domain/X/backend///require
+ * /local/domain/X/backend///require/multi-queue-max-queues = 
"2"
+ * /local/domain/X/backend///require/feature-no-csum-offload = 
"1"
+ *
+ * In the example above, network backend will negotiate up to a maximum of
+ * two queues with frontend plus disabling IPv4 checksum offloading.
+ *
+ * This directory and its children entries shall only be visible to the 
backend.
+ */
+
+/*
  * Control ring
  * 
  *
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH RFC 4/8] libxl: add backend_features to libxl_device_nic

2017-11-02 Thread Joao Martins
Adds "backend_features" to the libxl_device_nic structure to
represent a set of features to be set on the device by the admin.
These backend_features is a key value store representing
an array of  = , which would then be
translated into (backend-only permissions) xenstore entries in
the form of:

/local/domain//backend/vif///require
/local/domain/[...]/require/ = 

Entries get stored under the require directory within the backend
path.

Adjust libxl__device_add and libxl__device_add_async to pass the
third argument as the backend-only entries to be written to backend_path.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 tools/libxl/libxl.h  |  8 
 tools/libxl/libxl_9pfs.c |  2 +-
 tools/libxl/libxl_console.c  |  2 +-
 tools/libxl/libxl_device.c   | 14 --
 tools/libxl/libxl_internal.h |  2 +-
 tools/libxl/libxl_nic.c  | 13 -
 tools/libxl/libxl_types.idl  |  1 +
 tools/libxl/libxl_vdispl.c   |  3 ++-
 tools/libxl/libxl_vtpm.c |  2 +-
 9 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 82990089ef..5b4fbebf7b 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1109,6 +1109,14 @@ void libxl_mac_copy(libxl_ctx *ctx, libxl_mac *dst, 
const libxl_mac *src);
  */
 #define LIBXL_HAVE_DISK_BACKEND_FEATURES 1
 
+/*
+ * LIBXL_HAVE_VIF_BACKEND_FEATURES
+ *
+ * libxl_device_nic contains backend_features which can be used to control
+ * what features are exposed to guest vifs.
+ */
+#define LIBXL_HAVE_VIF_BACKEND_FEATURES 1
+
 typedef char **libxl_string_list;
 void libxl_string_list_dispose(libxl_string_list *sl);
 int libxl_string_list_length(const libxl_string_list *sl);
diff --git a/tools/libxl/libxl_9pfs.c b/tools/libxl/libxl_9pfs.c
index 9db887b5d8..3b80b358f4 100644
--- a/tools/libxl/libxl_9pfs.c
+++ b/tools/libxl/libxl_9pfs.c
@@ -42,7 +42,7 @@ static LIBXL_DEFINE_UPDATE_DEVID(p9, "9pfs")
 static int libxl__set_xenstore_p9(libxl__gc *gc, uint32_t domid,
   libxl_device_p9 *p9,
   flexarray_t *back, flexarray_t *front,
-  flexarray_t *ro_front)
+  flexarray_t *ro_front, flexarray_t *require)
 {
 flexarray_append_pair(back, "path", p9->path);
 flexarray_append_pair(back, "security_model", p9->security_model);
diff --git a/tools/libxl/libxl_console.c b/tools/libxl/libxl_console.c
index f40def1276..1c5a298750 100644
--- a/tools/libxl/libxl_console.c
+++ b/tools/libxl/libxl_console.c
@@ -730,7 +730,7 @@ static LIBXL_DEFINE_UPDATE_DEVID(vfb, "vfb")
 static int libxl__set_xenstore_vfb(libxl__gc *gc, uint32_t domid,
libxl_device_vfb *vfb,
   flexarray_t *back, flexarray_t *front,
-  flexarray_t *ro_front)
+  flexarray_t *ro_front, flexarray_t *require)
 {
 flexarray_append_pair(back, "vnc",
   libxl_defbool_val(vfb->vnc.enable) ? "1" : "0");
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 05178fb480..87983e2ef9 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -1860,7 +1860,7 @@ void libxl__device_add_async(libxl__egc *egc, uint32_t 
domid,
  libxl__ao_device *aodev)
 {
 STATE_AO_GC(aodev->ao);
-flexarray_t *back;
+flexarray_t *back, *require;
 flexarray_t *front, *ro_front;
 libxl__device *device;
 xs_transaction_t t = XBT_NULL;
@@ -1912,6 +1912,7 @@ void libxl__device_add_async(libxl__egc *egc, uint32_t 
domid,
 back = flexarray_make(gc, 16, 1);
 front = flexarray_make(gc, 16, 1);
 ro_front = flexarray_make(gc, 16, 1);
+require = flexarray_make(gc, 16, 1);
 
 flexarray_append_pair(back, "frontend-id", GCSPRINTF("%d", domid));
 flexarray_append_pair(back, "online", "1");
@@ -1924,7 +1925,7 @@ void libxl__device_add_async(libxl__egc *egc, uint32_t 
domid,
   GCSPRINTF("%d", XenbusStateInitialising));
 
 if (dt->set_xenstore_config)
-dt->set_xenstore_config(gc, domid, type, back, front, ro_front);
+dt->set_xenstore_config(gc, domid, type, back, front, ro_front, 
require);
 
 for (;;) {
 rc = libxl__xs_transaction_start(gc, );
@@ -1948,7 +1949,7 @@ void libxl__device_add_async(libxl__egc *egc, uint32_t 
domid,
   libxl__xs_kvs_of_flexarray(gc, back),
   libxl__xs_kvs_of_flexarray(gc, front),
   libxl__xs_kvs_of_flexarray(gc, ro_front),
-  NULL);
+  libxl__xs_kvs_of_flexarray(gc, require));

[Xen-devel] [PATCH RFC 1/8] public/io/blkif: add directory for backend parameters

2017-11-02 Thread Joao Martins
The proposed directory provides a mechanism for tools to control the
maximum feature set of the device being provisioned by backends.
Examples include max ring page order, persistent grants, number of
queues etc.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 xen/include/public/io/blkif.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 15a71e3fea..4c0a93a2bf 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -133,6 +133,20 @@
  *  This option doesn't require a backend to use O_DIRECT, so it
  *  should not be used to try to control the caching behaviour.
  *
+ * require
+ *
+ *  The directory "require" maybe be created by tools domain to
+ *  override the maximum feature set that backend provides to the
+ *  frontend. The children entries within this directory are
+ *  features names and its correspondent value e.g.:
+ *
+ *  /local/domain/X/backend/vbd///require
+ *  
/local/domain/X/backend/vbd///require/multi-queue-max-queues = "2"
+ *  /local/domain/X/backend/vbd///require/feature-persistent 
= "0"
+ *
+ *  In the example above, block backend will negotiate up to a maximum of
+ *  two queues with frontend plus disabling persistent grants.
+ *
  *- Features -
  *
  * feature-barrier
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v6 1/4] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-10-19 Thread Joao Martins
On 10/17/2017 04:34 PM, Joao Martins wrote:
> On 10/03/2017 12:55 PM, Joao Martins wrote:
>> Right now there is only a pvclock_pvti_cpu0_va() which is defined
>> on kvmclock since:
>>
>> commit dac16fba6fc5
>> ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
>>
>> The only user of this interface so far is kvm. This commit adds a
>> setter function for the pvti page and moves pvclock_pvti_cpu0_va
>> to pvclock, which is a more generic place to have it; and would
>> allow other PV clocksources to use it, such as Xen.
>>
>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>> Acked-by: Andy Lutomirski <l...@kernel.org>
> 
> Ping?
> 
> While the rest of series has been acked, I think that this patch (per
> maintainers file) still misses x86 and (or?) kvm ack/review.

I found out an issue with ptp_kvm modinit (if attempted to be loaded) under Xen
related to this series, so I resent with that fixed. Hopefully things can be
taken from there - Sorry for the noise.

Thanks,
Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v7 4/5] x86/xen/time: setup vcpu 0 time info page

2017-10-19 Thread Joao Martins
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.

The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
Changes since v6:
 * Add Boris RoB

Changes since v5:
 * Move xen_setup_vsyscall_time_info within the PVCLOCK_TSC_STABLE_BIT
 clause added in predecessor patch.

Changes since v4:
 * Remove pvclock_set_flags since predecessor patch will set in
 xen_time_init. Consequently pvti local variable is not so useful
 and doesn't make things more clear - therefore remove it.
 * Adjust comment on xen_setup_vsyscall_time_info()
 * Add Juergen's Reviewed-by (Retained as there wasn't functional
 changes)

Changes since v3:
 (Comments from Juergen)
 * Remove _t added suffix from *GUEST_HANDLE* when sync vcpu.h
 with the latest

Changes since v2:
 (Comments from Juergen)
 * Omit the blank after the cast on all 3 occurrences.
 * Change last VCLOCK_PVCLOCK message to be more descriptive
 * Sync the complete vcpu.h header instead of just adding the
 needed one. (IOW adding VCPUOP_get_physid)

Changes since v1:
 * Check flags ahead to see if the  primary clock can use
 PVCLOCK_TSC_STABLE_BIT even if secondary registration fails.
 (Comments from Boris)
 * Remove addr, addr variables;
 * Change first pr_debug to pr_warn;
 * Change last pr_debug to pr_notice;
 * Add routine to solely register secondary time info.
 * Move xen_clock to outside xen_setup_vsyscall_time_info to allow
 restore path to simply re-register secondary time info. Let us
 handle the restore path more gracefully without re-allocating a
 page.
 * Removed cpu argument from xen_setup_vsyscall_time_info()
 * Adjustment failed registration error messages/loglevel to be the same
 * Also teardown secondary time info on suspend

Changes since RFC:
 (Comments from Boris and David)
 * Remove Kconfig option
 * Use get_zeroed_page/free/page
 * Remove the hypercall availability check
 * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
 (New)
 * Set secondary copy on restore such that it works on migration.
 * Drop global xen_clock variable and stash it locally on
 xen_setup_vsyscall_time_info.
 * WARN_ON(ret) if we fail to unregister the pvti.
---
 arch/x86/xen/suspend.c   |  4 ++
 arch/x86/xen/time.c  | 90 +++-
 arch/x86/xen/xen-ops.h   |  2 +
 include/xen/interface/vcpu.h | 42 +
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index d6b1680693a9..800ed36ecfba 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -16,6 +16,8 @@
 
 void xen_arch_pre_suspend(void)
 {
+   xen_save_time_memory_area();
+
if (xen_pv_domain())
xen_pv_pre_suspend();
 }
@@ -26,6 +28,8 @@ void xen_arch_post_suspend(int cancelled)
xen_pv_post_suspend(cancelled);
else
xen_hvm_post_suspend(cancelled);
+
+   xen_restore_time_memory_area();
 }
 
 static void xen_vcpu_notify_restore(void *data)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index fc0148d3a70d..dec966fbe888 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -370,6 +370,92 @@ static const struct pv_time_ops xen_time_ops __initconst = 
{
.steal_clock = xen_steal_clock,
 };
 
+static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
+
+void xen_save_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = NULL;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+   if (ret != 0)
+   pr_notice("Cannot save secondary vcpu_time_info (err %d)",
+ ret);
+   else
+   clear_page(xen_clock);
+}
+
+void xen_restore_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = _clock->pvti;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+
+   /*
+* We don't disable VCLOCK_PVCLOCK entirely if it fails to register the
+* secondary time info with Xen or if we migrated to a host without the
+* necessary flags. On both of the

[Xen-devel] [PATCH v7 2/5] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-10-19 Thread Joao Martins
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:

commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Andy Lutomirski <l...@kernel.org>
---
Changes since v1:
 * Rebased: the only conflict was that I had move the export
 pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
 * Do not initialize pvti_cpu0_va to NULL (checkpatch error)
 ( Comments from Andy Lutomirski )
 * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
 for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
 * Add his Acked-by (provided the previous adjustment was made)

Changes since RFC:
 (Comments from Andy Lutomirski)
 * Add __init to pvclock_set_pvti_cpu0_va
 * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
 pvclock_set_pvti_cpu0_va
---
 arch/x86/include/asm/pvclock.h | 19 ++-
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 448cfe1b48cf..6f228f90cdd7 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,15 +4,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_KVM_GUEST
-extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
-#else
-static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return NULL;
-}
-#endif
-
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
@@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return NULL;
+}
+#endif
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d88967659098..538738047ff5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
-struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return hv_clock;
-}
-EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
+   pvclock_set_pvti_cpu0_va(hv_clock);
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 5c3f6d6a5078..cb7d6d9c9c2d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -25,8 +25,10 @@
 
 #include 
 #include 
+#include 
 
 static u8 valid_flags __read_mostly = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
 
 void pvclock_set_flags(u8 flags)
 {
@@ -144,3 +146,15 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+{
+   WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
+   pvti_cpu0_va = pvti;
+}
+
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return pvti_cpu0_va;
+}
+EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v7 0/5] x86/xen: pvclock vdso support

2017-10-19 Thread Joao Martins
Hey,

[ I found an issue with ptp_kvm modinit with my series, so resending with that
  fixed. ]

This is take 7 for vdso for Xen. PVCLOCK_TSC_STABLE_BIT can be set starting Xen
 4.8 which is required for vdso time related calls. In order to have it on, you
need to have the hypervisor clocksource be TSC e.g. with the following boot
params "clocksource=tsc tsc=stable:socket".

Series is structured as following:

Patch 1 probes for kvm guest in ptp_kvm in the event having
pvclock_pvti_cpu0_va() moved to common pvclock (on the next patch)
Patch 2 streamlines pvti page get/set in pvclock for both of its users
Patch 3,4 registers the pvti page on Xen and sets it in pvclock accordingly
Patch 5 adds a file to KVM/Xen maintainers for tracking pvclock ABI changes.

[ Only patches 1 and 2 requires ack/review - the rest is acked/reviewed ]

Changelog is in individual patches.

Thanks,
Joao

Joao Martins (5):
  ptp_kvm: probe for kvm guest availability
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: set pvclock flags on xen_time_init()
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

 MAINTAINERS|  2 +
 arch/x86/include/asm/pvclock.h | 19 +
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 arch/x86/xen/suspend.c |  4 ++
 arch/x86/xen/time.c| 97 ++
 arch/x86/xen/xen-ops.h |  2 +
 drivers/ptp/ptp_kvm.c  |  3 ++
 include/xen/interface/vcpu.h   | 42 ++
 9 files changed, 175 insertions(+), 15 deletions(-)

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v7 3/5] x86/xen/time: set pvclock flags on xen_time_init()

2017-10-19 Thread Joao Martins
Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set,
then set it too on pvclock flags. This allows Xen clocksource to use it
and thus speeding up xen_clocksource_read() callers (i.e. sched_clock())

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
Changes since v5:
 * Add Boris RoB

New in v5
---
 arch/x86/xen/time.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1ecb05db3632..fc0148d3a70d 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -372,6 +372,7 @@ static const struct pv_time_ops xen_time_ops __initconst = {
 
 static void __init xen_time_init(void)
 {
+   struct pvclock_vcpu_time_info *pvti;
int cpu = smp_processor_id();
struct timespec tp;
 
@@ -395,6 +396,14 @@ static void __init xen_time_init(void)
 
setup_force_cpu_cap(X86_FEATURE_TSC);
 
+   /*
+* We check ahead on the primary time info if this
+* bit is supported hence speeding up Xen clocksource.
+*/
+   pvti = &__this_cpu_read(xen_vcpu)->time;
+   if (pvti->flags & PVCLOCK_TSC_STABLE_BIT)
+   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+
xen_setup_runstate_info(cpu);
xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v7 5/5] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-10-19 Thread Joao Martins
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
Changes since v4:
 * Add Paolo's Acked-by
 * Add Konrad's Reviewed-by

Changes since v1:
 * Add Juergen's Gross Acked-by.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index a74227ad082e..09de17b955ea 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7604,6 +7604,7 @@ S:Supported
 F: arch/x86/kvm/
 F: arch/x86/include/uapi/asm/kvm*
 F: arch/x86/include/asm/kvm*
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/x86/kernel/kvm.c
 F: arch/x86/kernel/kvmclock.c
 
@@ -14731,6 +14732,7 @@ F:  arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
 F: arch/x86/include/asm/xen/
+F: arch/x86/include/asm/pvclock-abi.h
 F: include/xen/
 F: include/uapi/xen/
 F: Documentation/ABI/stable/sysfs-hypervisor-xen
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v7 1/5] ptp_kvm: probe for kvm guest availability

2017-10-19 Thread Joao Martins
In the event of moving pvclock_pvti_cpu0_va() definition to common
pvclock code, this function could return a value on non KVM guests.
If user tried to load the module (or have it builtin) it would fail
with a GPF on ptp_kvm_init when running on a Xen guest. Therefore,
ptp_kvm_init() should check whether it is running in a KVM guest.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
New in v7;
---
 drivers/ptp/ptp_kvm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/ptp/ptp_kvm.c b/drivers/ptp/ptp_kvm.c
index 2b1b212c219e..e04d7b2ecb3a 100644
--- a/drivers/ptp/ptp_kvm.c
+++ b/drivers/ptp/ptp_kvm.c
@@ -178,6 +178,9 @@ static int __init ptp_kvm_init(void)
 {
long ret;
 
+   if (!kvm_para_available())
+   return -ENODEV;
+
clock_pair_gpa = slow_virt_to_phys(_pair);
hv_clock = pvclock_pvti_cpu0_va();
 
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v6 1/4] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-10-17 Thread Joao Martins
On 10/03/2017 12:55 PM, Joao Martins wrote:
> Right now there is only a pvclock_pvti_cpu0_va() which is defined
> on kvmclock since:
> 
> commit dac16fba6fc5
> ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
> 
> The only user of this interface so far is kvm. This commit adds a
> setter function for the pvti page and moves pvclock_pvti_cpu0_va
> to pvclock, which is a more generic place to have it; and would
> allow other PV clocksources to use it, such as Xen.
> 
> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
> Acked-by: Andy Lutomirski <l...@kernel.org>

Ping?

While the rest of series has been acked, I think that this patch (per
maintainers file) still misses x86 and (or?) kvm ack/review.

Joao

> ---
> Changes since v1:
>  * Rebased: the only conflict was that I had move the export
>  pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
>  * Do not initialize pvti_cpu0_va to NULL (checkpatch error)
>  ( Comments from Andy Lutomirski )
>  * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
>  for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
>  * Add his Acked-by (provided the previous adjustment was made)
> 
> Changes since RFC:
>  (Comments from Andy Lutomirski)
>  * Add __init to pvclock_set_pvti_cpu0_va
>  * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
>  pvclock_set_pvti_cpu0_va
> ---
>  arch/x86/include/asm/pvclock.h | 19 ++-
>  arch/x86/kernel/kvmclock.c |  7 +--
>  arch/x86/kernel/pvclock.c  | 14 ++
>  3 files changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
> index 448cfe1b48cf..6f228f90cdd7 100644
> --- a/arch/x86/include/asm/pvclock.h
> +++ b/arch/x86/include/asm/pvclock.h
> @@ -4,15 +4,6 @@
>  #include 
>  #include 
>  
> -#ifdef CONFIG_KVM_GUEST
> -extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
> -#else
> -static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
> -{
> - return NULL;
> -}
> -#endif
> -
>  /* some helper functions for xen and kvm pv clock sources */
>  u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
>  u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
> @@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
>  
>  #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
>  
> +#ifdef CONFIG_PARAVIRT_CLOCK
> +void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
> +struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
> +#else
> +static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
> +{
> + return NULL;
> +}
> +#endif
> +
>  #endif /* _ASM_X86_PVCLOCK_H */
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index d88967659098..538738047ff5 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
>  static struct pvclock_vsyscall_time_info *hv_clock;
>  static struct pvclock_wall_clock wall_clock;
>  
> -struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
> -{
> - return hv_clock;
> -}
> -EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
> -
>  /*
>   * The wallclock is the time of day when we booted. Since then, some time may
>   * have elapsed since the hypervisor wrote the data. So we try to account for
> @@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
>   return 1;
>   }
>  
> + pvclock_set_pvti_cpu0_va(hv_clock);
>   put_cpu();
>  
>   kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
> index 5c3f6d6a5078..cb7d6d9c9c2d 100644
> --- a/arch/x86/kernel/pvclock.c
> +++ b/arch/x86/kernel/pvclock.c
> @@ -25,8 +25,10 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  static u8 valid_flags __read_mostly = 0;
> +static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
>  
>  void pvclock_set_flags(u8 flags)
>  {
> @@ -144,3 +146,15 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
> *wall_clock,
>  
>   set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
>  }
> +
> +void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
> +{
> + WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
> + pvti_cpu0_va = pvti;
> +}
> +
> +struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
> +{
> + return pvti_cpu0_va;
> +}
> +EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 1/2] public/io/netif.h: add gref mapping control messages

2017-10-03 Thread Joao Martins
Adds 3 messages to allow guest to let backend keep grants mapped,
such that 1) guests allowing fast recycling of pages can avoid doing
grant ops for those cases, or otherwise 2) preferring copies over
grants and 3) always using a fixed set of pages for network I/O.

The three control ring messages added are:
 - Add grefs to be mapped by backend
 - Remove grefs mappings (If they are not in use)
 - Get maximum amount of grefs kept mapped.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Paul Durrant <paul.durr...@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
---
v5:
* Added RoB from Paul and Konrad

v4:
* Declare xen_netif_gref parameters are input or output.
* Clarify status field and that it doesn't require to be set to zero
prior to its usage.
* Clarify on ADD_GREF_MAPPING is 'all or nothing'
* Improve last paragraph of DEL_GREF_MAPPING

v3:
* Use DEL for unmapping grefs instead of PUT
* Rname from xen_netif_gref_alloc to xen_netif_gref
* Add 'status' field on xen_netif_gref
* Clarify what 'inflight' means
* Use "beginning of the page" instead of "beginning of the grant"
* Mention that page needs to be r/w (as it will have to modify \.status)
---
 xen/include/public/io/netif.h | 123 ++
 1 file changed, 123 insertions(+)

diff --git a/xen/include/public/io/netif.h b/xen/include/public/io/netif.h
index ca0061410d..2454448baa 100644
--- a/xen/include/public/io/netif.h
+++ b/xen/include/public/io/netif.h
@@ -353,6 +353,9 @@ struct xen_netif_ctrl_request {
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING_SIZE 5
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING  6
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_ALGORITHM7
+#define XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE 8
+#define XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING  9
+#define XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING 10
 
 uint32_t data[3];
 };
@@ -391,6 +394,44 @@ struct xen_netif_ctrl_response {
 };
 
 /*
+ * Static Grants (struct xen_netif_gref)
+ * =
+ *
+ * A frontend may provide a fixed set of grant references to be mapped on
+ * the backend. The message of type XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ * prior its usage in the command ring allows for creation of these mappings.
+ * The backend will maintain a fixed amount of these mappings.
+ *
+ * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE lets a frontend query how many
+ * of these mappings can be kept.
+ *
+ * Each entry in the XEN_NETIF_CTRL_TYPE_{ADD,DEL}_GREF_MAPPING input table has
+ * the following format:
+ *
+ *0 1 2 3 4 5 6 7  octet
+ * +-+-+-+-+-+-+-+-+
+ * | grant ref |  flags|  status   |
+ * +-+-+-+-+-+-+-+-+
+ *
+ * grant ref: grant reference (IN)
+ * flags: flags describing the control operation (IN)
+ * status: XEN_NETIF_CTRL_STATUS_* (OUT)
+ *
+ * 'status' is an output parameter which does not require to be set to zero
+ * prior to its usage in the corresponding control messages.
+ */
+
+struct xen_netif_gref {
+   grant_ref_t ref;
+   uint16_t flags;
+
+#define _XEN_NETIF_CTRLF_GREF_readonly0
+#define XEN_NETIF_CTRLF_GREF_readonly(1U<<_XEN_NETIF_CTRLF_GREF_readonly)
+
+   uint16_t status;
+};
+
+/*
  * Control messages
  * 
  *
@@ -609,6 +650,88 @@ struct xen_netif_ctrl_response {
  *   invalidate any table data outside that range.
  *   The grant reference may be read-only and must remain valid until
  *   the response has been processed.
+ *
+ * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
+ * -
+ *
+ * This is sent by the frontend to fetch the number of grefs that can be kept
+ * mapped in the backend.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
+ *  data[0] = queue index (assumed 0 for single queue)
+ *  data[1] = 0
+ *  data[2] = 0
+ *
+ * Response:
+ *
+ *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation not
+ * supported
+ *   XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER - The queue index is
+ * out of range
+ *   XEN_NETIF_CTRL_STATUS_SUCCESS   - Operation successful
+ *  data   = maximum number of entries allowed in the gref mapping table
+ *   (if operation was successful) or zero if it is not supported.
+ *
+ * XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ * 
+ *
+ * This is sent by the frontend for backend to map a list of grant
+ * references.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ *  data[0] = queue index
+ *  data[1] = grant reference of page containing the mapping list
+ *(r/w and assumed to start at beginning of page)
+ *  data[2] = size of list 

[Xen-devel] [PATCH v5 0/2] netif: staging grants for I/O requests

2017-10-03 Thread Joao Martins
Hey,

This is v5 from netif series. The new thing (besides the tags being added) is
the specification (previously written in the cover letter) being added to docs
as requested by Konrad. And all seems to be RoB.

Reference implementation also here (on top of net-next):

https://github.com/jpemartins/linux.git xen-net-stg-gnts-v3

Thanks!

Joao Martins (2):
  public/io/netif.h: add gref mapping control messages
  docs/misc: add netif staging grants design document

 docs/misc/netif-staging-grants.pandoc | 587 ++
 xen/include/public/io/netif.h | 123 +++
 2 files changed, 710 insertions(+)
 create mode 100644 docs/misc/netif-staging-grants.pandoc

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 2/2] docs/misc: add netif staging grants design document

2017-10-03 Thread Joao Martins
Add a document outlining how the guest can map a set of grants
on the backend through the control ring.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
---
New in v5
---
 docs/misc/netif-staging-grants.pandoc | 587 ++
 1 file changed, 587 insertions(+)
 create mode 100644 docs/misc/netif-staging-grants.pandoc

diff --git a/docs/misc/netif-staging-grants.pandoc 
b/docs/misc/netif-staging-grants.pandoc
new file mode 100644
index 00..b26a6e0915
--- /dev/null
+++ b/docs/misc/netif-staging-grants.pandoc
@@ -0,0 +1,587 @@
+% Staging grants for network I/O requests
+% Revision 4
+
+\clearpage
+
+
+Architecture(s): Any
+
+
+# Background and Motivation
+
+At the Xen hackaton '16 networking session, we spoke about having a permanently
+mapped region to describe header/linear region of packet buffers. This document
+outlines the proposal covering motivation of this and applicability for other
+use-cases alongside the necessary changes.
+
+The motivation of this work is to eliminate grant ops for packet I/O intensive
+workloads such as those observed with smaller requests size (i.e. <= 256 bytes
+or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are 
the
+only ones performing really good (up to 80 Gbit/s in few CPUs), usually
+backing end-hosts and server appliances. Anything that involves higher packet
+rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s
+throughput.
+
+# Proposal
+
+The proposal is to leverage the already implicit copy from and to packet linear
+data on netfront and netback, to be done instead from a permanently mapped
+region. In some (physical) NICs this is known as header/data split.
+
+Specifically some workloads (e.g. NFV) it would provide a big increase in
+throughput when we switch to (zero)copying in the backend/frontend, instead of
+the grant hypercalls. Thus this extension aims at futureproofing the netif
+protocol by adding the possibility of guests setting up a list of grants that
+are set up at device creation and revoked at device freeing - without taking
+too much grant entries in account for the general case (i.e. to cover only the
+header region <= 256 bytes, 16 grants per ring) while configurable by kernel
+when one wants to resort to a copy-based as opposed to grant copy/map.
+
+\clearpage
+
+# General Operation
+
+Here we describe how netback and netfront general operate, and where the 
proposed
+solution will fit. The security mechanism currently involves grants references
+which in essence are round-robin recycled 'tickets' stamped with the GPFNs,
+permission attributes, and the authorized domain:
+
+(This is an in-memory view of struct grant_entry_v1):
+
+ 0 1 2 3 4 5 6 7 octet
+++---++
+| flags  | domain id | frame  |
+++---++
+
+Where there are N grant entries in a grant table, for example:
+
+@0:
+++---++
+| rw | 0 | 0xABCDEF   |
+++---++
+| rw | 0 | 0xFA124|
+++---++
+| ro | 1 | 0xBEEF |
+++---++
+
+  .
+@N:
+++---++
+| rw | 0 | 0x9923A|
+++---++
+
+Each entry consumes 8 bytes, therefore 512 entries can fit on one page.
+The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384
+grants. The ParaVirtualized (PV) drivers will use the grant reference (index
+in the grant table - 0 .. N) in their command ring.
+
+\clearpage
+
+## Guest Transmit
+
+The view of the shared transmit ring is the following:
+
+ 0 1 2 3 4 5 6 7 octet
++++
+| req_prod   | req_event  |
++++
+| rsp_prod   | rsp_event  |
++++
+| pvt| pad[44]|
+++|
+| | [64bytes]
++++-\
+| gref   | offset| flags  | |
+++---++ +-'struct
+| id | size  | id| status 

[Xen-devel] [PATCH v6 2/4] x86/xen/time: set pvclock flags on xen_time_init()

2017-10-03 Thread Joao Martins
Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set,
then set it too on pvclock flags. This allows Xen clocksource to use it
and thus speeding up xen_clocksource_read() callers (i.e. sched_clock())

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
---
Changes since v5:
 * Add Boris RoB

New in v5
---
 arch/x86/xen/time.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1ecb05db3632..fc0148d3a70d 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -372,6 +372,7 @@ static const struct pv_time_ops xen_time_ops __initconst = {
 
 static void __init xen_time_init(void)
 {
+   struct pvclock_vcpu_time_info *pvti;
int cpu = smp_processor_id();
struct timespec tp;
 
@@ -395,6 +396,14 @@ static void __init xen_time_init(void)
 
setup_force_cpu_cap(X86_FEATURE_TSC);
 
+   /*
+* We check ahead on the primary time info if this
+* bit is supported hence speeding up Xen clocksource.
+*/
+   pvti = &__this_cpu_read(xen_vcpu)->time;
+   if (pvti->flags & PVCLOCK_TSC_STABLE_BIT)
+   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+
xen_setup_runstate_info(cpu);
xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v6 3/4] x86/xen/time: setup vcpu 0 time info page

2017-10-03 Thread Joao Martins
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.

The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Juergen Gross <jgr...@suse.com>
---
Changes since v5:
 * Move xen_setup_vsyscall_time_info within the PVCLOCK_TSC_STABLE_BIT
 clause added in the previous patch.

Changes since v4:
 * Remove pvclock_set_flags since predecessor patch will set in
 xen_time_init. Consequently pvti local variable is not so useful
 and doesn't make things more clear - therefore remove it.
 * Adjust comment on xen_setup_vsyscall_time_info()
 * Add Juergen's Reviewed-by (Retained as there wasn't functional
 changes)

Changes since v3:
 (Comments from Juergen)
 * Remove _t added suffix from *GUEST_HANDLE* when sync vcpu.h
 with the latest

Changes since v2:
 (Comments from Juergen)
 * Omit the blank after the cast on all 3 occurrences.
 * Change last VCLOCK_PVCLOCK message to be more descriptive
 * Sync the complete vcpu.h header instead of just adding the
 needed one. (IOW adding VCPUOP_get_physid)

Changes since v1:
 * Check flags ahead to see if the  primary clock can use
 PVCLOCK_TSC_STABLE_BIT even if secondary registration fails.
 (Comments from Boris)
 * Remove addr, addr variables;
 * Change first pr_debug to pr_warn;
 * Change last pr_debug to pr_notice;
 * Add routine to solely register secondary time info.
 * Move xen_clock to outside xen_setup_vsyscall_time_info to allow
 restore path to simply re-register secondary time info. Let us
 handle the restore path more gracefully without re-allocating a
 page.
 * Removed cpu argument from xen_setup_vsyscall_time_info()
 * Adjustment failed registration error messages/loglevel to be the same
 * Also teardown secondary time info on suspend

Changes since RFC:
 (Comments from Boris and David)
 * Remove Kconfig option
 * Use get_zeroed_page/free/page
 * Remove the hypercall availability check
 * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
 (New)
 * Set secondary copy on restore such that it works on migration.
 * Drop global xen_clock variable and stash it locally on
 xen_setup_vsyscall_time_info.
 * WARN_ON(ret) if we fail to unregister the pvti.
---
 arch/x86/xen/suspend.c   |  4 ++
 arch/x86/xen/time.c  | 90 +++-
 arch/x86/xen/xen-ops.h   |  2 +
 include/xen/interface/vcpu.h | 42 +
 4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index d6b1680693a9..800ed36ecfba 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -16,6 +16,8 @@
 
 void xen_arch_pre_suspend(void)
 {
+   xen_save_time_memory_area();
+
if (xen_pv_domain())
xen_pv_pre_suspend();
 }
@@ -26,6 +28,8 @@ void xen_arch_post_suspend(int cancelled)
xen_pv_post_suspend(cancelled);
else
xen_hvm_post_suspend(cancelled);
+
+   xen_restore_time_memory_area();
 }
 
 static void xen_vcpu_notify_restore(void *data)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index fc0148d3a70d..dec966fbe888 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -370,6 +370,92 @@ static const struct pv_time_ops xen_time_ops __initconst = 
{
.steal_clock = xen_steal_clock,
 };
 
+static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
+
+void xen_save_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = NULL;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+   if (ret != 0)
+   pr_notice("Cannot save secondary vcpu_time_info (err %d)",
+ ret);
+   else
+   clear_page(xen_clock);
+}
+
+void xen_restore_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = _clock->pvti;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+
+   /*
+* We don't disable VCLOCK_PVCLOCK entirely if it fails to register the
+* secondary time info with Xen or if we migrated to a host without the
+* necessary flags. On both of these cases what happens is either
+* process seeing a zeroed out pvti or seeing no PVCLOCK_TSC_STAB

[Xen-devel] [PATCH v6 1/4] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-10-03 Thread Joao Martins
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:

commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Andy Lutomirski <l...@kernel.org>
---
Changes since v1:
 * Rebased: the only conflict was that I had move the export
 pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
 * Do not initialize pvti_cpu0_va to NULL (checkpatch error)
 ( Comments from Andy Lutomirski )
 * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
 for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
 * Add his Acked-by (provided the previous adjustment was made)

Changes since RFC:
 (Comments from Andy Lutomirski)
 * Add __init to pvclock_set_pvti_cpu0_va
 * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
 pvclock_set_pvti_cpu0_va
---
 arch/x86/include/asm/pvclock.h | 19 ++-
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 448cfe1b48cf..6f228f90cdd7 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,15 +4,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_KVM_GUEST
-extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
-#else
-static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return NULL;
-}
-#endif
-
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
@@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return NULL;
+}
+#endif
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d88967659098..538738047ff5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
-struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return hv_clock;
-}
-EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
+   pvclock_set_pvti_cpu0_va(hv_clock);
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 5c3f6d6a5078..cb7d6d9c9c2d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -25,8 +25,10 @@
 
 #include 
 #include 
+#include 
 
 static u8 valid_flags __read_mostly = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
 
 void pvclock_set_flags(u8 flags)
 {
@@ -144,3 +146,15 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+{
+   WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
+   pvti_cpu0_va = pvti;
+}
+
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return pvti_cpu0_va;
+}
+EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v6 0/4] x86/xen: pvclock vdso support

2017-10-03 Thread Joao Martins
Hey,

This is take 6 for vdso for Xen. PVCLOCK_TSC_STABLE_BIT can be set starting Xen
 4.8 which is required for vdso time related calls. In order to have it on, you
need to have the hypervisor clocksource be TSC e.g. with the following boot
params "clocksource=tsc tsc=stable:socket".

Series is structured as following:

Patch 1 streamlines pvti page get/set in pvclock for both of its users
Patch 2,3 registers the pvti page on Xen and sets it in pvclock accordingly
Patch 4 adds a file to KVM/Xen maintainers for tracking pvclock ABI changes.

[ Patch 2 and 4 are acked. ]

Changelog is in individual patches.

Thanks,
Joao

Joao Martins (4):
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: set pvclock flags on xen_time_init()
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

 MAINTAINERS|  2 +
 arch/x86/include/asm/pvclock.h | 19 +
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 arch/x86/xen/suspend.c |  4 ++
 arch/x86/xen/time.c| 97 ++
 arch/x86/xen/xen-ops.h |  2 +
 include/xen/interface/vcpu.h   | 42 ++
 8 files changed, 172 insertions(+), 15 deletions(-)

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v6 4/4] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-10-03 Thread Joao Martins
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
Changes since v4:
 * Add Paolo's Acked-by
 * Add Konrad's Reviewed-by

Changes since v1:
 * Add Juergen's Gross Acked-by.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6671f375f7fc..a4834c3c377a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7603,6 +7603,7 @@ S:Supported
 F: arch/x86/kvm/
 F: arch/x86/include/uapi/asm/kvm*
 F: arch/x86/include/asm/kvm*
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/x86/kernel/kvm.c
 F: arch/x86/kernel/kvmclock.c
 
@@ -14718,6 +14719,7 @@ F:  arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
 F: arch/x86/include/asm/xen/
+F: arch/x86/include/asm/pvclock-abi.h
 F: include/xen/
 F: include/uapi/xen/
 F: Documentation/ABI/stable/sysfs-hypervisor-xen
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 3/4] x86/xen/time: setup vcpu 0 time info page

2017-10-02 Thread Joao Martins


On 10/02/2017 07:44 PM, Boris Ostrovsky wrote:
> 
>> +
>> +static void xen_setup_vsyscall_time_info(void)
>> +{
>> +struct vcpu_register_time_memory_area t;
>> +struct pvclock_vsyscall_time_info *ti;
>> +int ret;
> 
> 
> In the previous version you'd return immediately if
> PVCLOCK_TSC_STABLE_BIT was not set. Don't you still need to check this?
> Especially give...
> 
Yes, my mistake.

When moving the primary info check I changed the comment below, but should have
moved the call to xen_setup_vsyscall_time_info() into the newly added if ()
clause added in the previous patch. Let me move that inside the conditional and
respin in v6.

Joao

> 
>> +
>> +ti = (struct pvclock_vsyscall_time_info *)get_zeroed_page(GFP_KERNEL);
>> +if (!ti)
>> +return;
>> +
>> +t.addr.v = >pvti;
>> +
>> +ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
>> +if (ret) {
>> +pr_notice("xen: VCLOCK_PVCLOCK not supported (err %d)\n", ret);
>> +free_page((unsigned long)ti);
>> +return;
>> +}
>> +
>> +/*
>> + * If primary time info had this bit set, secondary should too since
> 
> ... this comment?
> 
> -boris

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 1/4] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-10-02 Thread Joao Martins
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:

commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Andy Lutomirski <l...@kernel.org>
---
Changes since v1:
 * Rebased: the only conflict was that I had move the export
 pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
 * Do not initialize pvti_cpu0_va to NULL (checkpatch error)
 ( Comments from Andy Lutomirski )
 * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
 for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
 * Add his Acked-by (provided the previous adjustment was made)

Changes since RFC:
 (Comments from Andy Lutomirski)
 * Add __init to pvclock_set_pvti_cpu0_va
 * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
 pvclock_set_pvti_cpu0_va
---
 arch/x86/include/asm/pvclock.h | 19 ++-
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 448cfe1b48cf..6f228f90cdd7 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,15 +4,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_KVM_GUEST
-extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
-#else
-static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return NULL;
-}
-#endif
-
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
@@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return NULL;
+}
+#endif
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d88967659098..538738047ff5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
-struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return hv_clock;
-}
-EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
+   pvclock_set_pvti_cpu0_va(hv_clock);
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 5c3f6d6a5078..cb7d6d9c9c2d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -25,8 +25,10 @@
 
 #include 
 #include 
+#include 
 
 static u8 valid_flags __read_mostly = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
 
 void pvclock_set_flags(u8 flags)
 {
@@ -144,3 +146,15 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+{
+   WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
+   pvti_cpu0_va = pvti;
+}
+
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return pvti_cpu0_va;
+}
+EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 4/4] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-10-02 Thread Joao Martins
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Juergen Gross <jgr...@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Acked-by: Paolo Bonzini <pbonz...@redhat.com>
---
Changes since v4:
 * Add Paolo's Acked-by
 * Add Konrad's Reviewed-by

Changes since v1:
 * Add Juergen's Gross Acked-by.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6671f375f7fc..a4834c3c377a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7603,6 +7603,7 @@ S:Supported
 F: arch/x86/kvm/
 F: arch/x86/include/uapi/asm/kvm*
 F: arch/x86/include/asm/kvm*
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/x86/kernel/kvm.c
 F: arch/x86/kernel/kvmclock.c
 
@@ -14718,6 +14719,7 @@ F:  arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
 F: arch/x86/include/asm/xen/
+F: arch/x86/include/asm/pvclock-abi.h
 F: include/xen/
 F: include/uapi/xen/
 F: Documentation/ABI/stable/sysfs-hypervisor-xen
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 3/4] x86/xen/time: setup vcpu 0 time info page

2017-10-02 Thread Joao Martins
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.

The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Juergen Gross <jgr...@suse.com>
---
Changes since v4:
 * Remove pvclock_set_flags since predecessor patch will set in
 xen_time_init. Consequently pvti local variable is not so useful
 and doesn't make things more clear - therefore remove it.
 * Adjust comment on xen_setup_vsyscall_time_info()
 * Add Juergen's Reviewed-by (Retained as there wasn't functional
 changes)

Changes since v3:
 (Comments from Juergen)
 * Remove _t added suffix from *GUEST_HANDLE* when sync vcpu.h
 with the latest

Changes since v2:
 (Comments from Juergen)
 * Omit the blank after the cast on all 3 occurrences.
 * Change last VCLOCK_PVCLOCK message to be more descriptive
 * Sync the complete vcpu.h header instead of just adding the
 needed one. (IOW adding VCPUOP_get_physid)

Changes since v1:
 * Check flags ahead to see if the  primary clock can use
 PVCLOCK_TSC_STABLE_BIT even if secondary registration fails.
 (Comments from Boris)
 * Remove addr, addr variables;
 * Change first pr_debug to pr_warn;
 * Change last pr_debug to pr_notice;
 * Add routine to solely register secondary time info.
 * Move xen_clock to outside xen_setup_vsyscall_time_info to allow
 restore path to simply re-register secondary time info. Let us
 handle the restore path more gracefully without re-allocating a
 page.
 * Removed cpu argument from xen_setup_vsyscall_time_info()
 * Adjustment failed registration error messages/loglevel to be the same
 * Also teardown secondary time info on suspend

Changes since RFC:
 (Comments from Boris and David)
 * Remove Kconfig option
 * Use get_zeroed_page/free/page
 * Remove the hypercall availability check
 * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
 (New)
 * Set secondary copy on restore such that it works on migration.
 * Drop global xen_clock variable and stash it locally on
 xen_setup_vsyscall_time_info.
 * WARN_ON(ret) if we fail to unregister the pvti.
---
 arch/x86/xen/suspend.c   |  4 ++
 arch/x86/xen/time.c  | 87 
 arch/x86/xen/xen-ops.h   |  2 +
 include/xen/interface/vcpu.h | 42 +
 4 files changed, 135 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index d6b1680693a9..800ed36ecfba 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -16,6 +16,8 @@
 
 void xen_arch_pre_suspend(void)
 {
+   xen_save_time_memory_area();
+
if (xen_pv_domain())
xen_pv_pre_suspend();
 }
@@ -26,6 +28,8 @@ void xen_arch_post_suspend(int cancelled)
xen_pv_post_suspend(cancelled);
else
xen_hvm_post_suspend(cancelled);
+
+   xen_restore_time_memory_area();
 }
 
 static void xen_vcpu_notify_restore(void *data)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index fc0148d3a70d..aa8bb87601f3 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -370,6 +370,92 @@ static const struct pv_time_ops xen_time_ops __initconst = 
{
.steal_clock = xen_steal_clock,
 };
 
+static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
+
+void xen_save_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = NULL;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+   if (ret != 0)
+   pr_notice("Cannot save secondary vcpu_time_info (err %d)",
+ ret);
+   else
+   clear_page(xen_clock);
+}
+
+void xen_restore_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = _clock->pvti;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+
+   /*
+* We don't disable VCLOCK_PVCLOCK entirely if it fails to register the
+* secondary time info with Xen or if we migrated to a host without the
+* necessary flags. On both of these cases what happens is either
+* process seeing a zeroed out pvti or seeing no PVCLOCK_TSC_STABLE_BIT
+* bit set. Userspace checks the latter and if 0, it discards the data
+* in pvti and fallbacks to a system call

[Xen-devel] [PATCH v5 2/4] x86/xen/time: set pvclock flags on xen_time_init()

2017-10-02 Thread Joao Martins
Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set,
then set it too on pvclock flags. This allows Xen clocksource to use it
and thus speeding up xen_clocksource_read() callers (i.e. sched_clock())

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
New in v5
---
 arch/x86/xen/time.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1ecb05db3632..fc0148d3a70d 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -372,6 +372,7 @@ static const struct pv_time_ops xen_time_ops __initconst = {
 
 static void __init xen_time_init(void)
 {
+   struct pvclock_vcpu_time_info *pvti;
int cpu = smp_processor_id();
struct timespec tp;
 
@@ -395,6 +396,14 @@ static void __init xen_time_init(void)
 
setup_force_cpu_cap(X86_FEATURE_TSC);
 
+   /*
+* We check ahead on the primary time info if this
+* bit is supported hence speeding up Xen clocksource.
+*/
+   pvti = &__this_cpu_read(xen_vcpu)->time;
+   if (pvti->flags & PVCLOCK_TSC_STABLE_BIT)
+   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+
xen_setup_runstate_info(cpu);
xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 0/4] x86/xen: pvclock vdso support

2017-10-02 Thread Joao Martins
Hey,

This is take 5 for vdso for Xen. PVCLOCK_TSC_STABLE_BIT can be set starting Xen
 4.8 which is required for vdso time related calls. In order to have it on, you
need to have the hypervisor clocksource be TSC e.g. with the following boot
params "clocksource=tsc tsc=stable:socket".

Series is structured as following:

Patch 1 streamlines pvti page get/set in pvclock for both of its users
Patch 2,3 registers the pvti page on Xen and sets it in pvclock accordingly
Patch 4 adds a file to KVM/Xen maintainers for tracking pvclock ABI changes.
[ The last one is already Acked. ]

Changelog is in individual patches.

Thanks,
Joao

Joao Martins (4):
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: set pvclock flags on xen_time_init()
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

 MAINTAINERS|  2 +
 arch/x86/include/asm/pvclock.h | 19 +
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 arch/x86/xen/suspend.c |  4 ++
 arch/x86/xen/time.c| 96 ++
 arch/x86/xen/xen-ops.h |  2 +
 include/xen/interface/vcpu.h   | 42 ++
 8 files changed, 171 insertions(+), 15 deletions(-)

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-28 Thread Joao Martins
On 09/28/2017 12:46 AM, Joao Martins wrote:
> On 09/27/2017 11:44 PM, Boris Ostrovsky wrote:
>> On 09/27/2017 04:57 PM, Joao Martins wrote:
>>> On 09/27/2017 09:22 PM, Boris Ostrovsky wrote:
>>>> On 09/27/2017 11:26 AM, Joao Martins wrote:
>>>>> On 09/27/2017 03:40 PM, Boris Ostrovsky wrote:
>>>>>>> +static void xen_setup_vsyscall_time_info(void)
>>>>>>> +{
>>>>>>> +   struct vcpu_register_time_memory_area t;
>>>>>>> +   struct pvclock_vsyscall_time_info *ti;
>>>>>>> +   struct pvclock_vcpu_time_info *pvti;
>>>>>>> +   int ret;
>>>>>>> +
>>>>>>> +   pvti = &__this_cpu_read(xen_vcpu)->time;
>>>>>>> +
>>>>>>> +   /*
>>>>>>> +* We check ahead on the primary time info if this
>>>>>>> +* bit is supported hence speeding up Xen clocksource.
>>>>>>> +*/
>>>>>>> +   if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))
>>>>>>> +   return;
>>>>>>> +
>>>>>>> +   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>>>>>> Is it OK to have this flag set if anything below fails?
>>>>>>
>>>>> Yes - if anything below fails it will only affect userspace mapped page.
>>>> Then should it be set somewhere else, like in xen_time_init()?
>>>>
>>> Hm, I could move it if you think it's better - but given the importance of 
>>> the
>>> bit we are checking and its direct correlation to whether or not we can 
>>> setup
>>> VCLOCK_PVCLOCK then I find it cleaner to have it here in the same routine. 
>>> One
>>> thing I failed to mention before is that checking ahead like above, let us 
>>> also
>>> avoid allocating a page plus an hypercall to register the pvti just to 
>>> check the
>>> one bit of info we need for using VCLOCK_PVCLOCK.
>>>
>>> It is very unlikely with current Xen code that 1) the secondary copy 
>>> register
>>> below fails, or 2) master and secondary don't have the same bits set. So in 
>>> case
>>> you're reconsidering the "shortcut" check above I can move it like we had 
>>> in v1
>>> and have pvclock_set_flags right before pvclock_set_pvti_cpu0_va().
>>
>> I think it would be more logical to move it to the end like in v1.
>>
>> But can you explain again why this flag should not be set in
>> xen_time_init()?
> 
> I didn't say we shouldn't have this flag there - I was just pointing out a
> matter of taste on whether to put on xen_time_init() or in
> xen_setup_vsyscall_time_info() (which is called from xen_time_init btw) so
> there's no functional change.
> 
To be clear, in this paragraph when I say on xen_setup_vsyscall_time_info() I
mean like it is described in this patch i.e. in the beginning of the routine.

>> It seems to me that it would be useful not just for
>> vDSO but for xen_clocksource_read()->pvclock_clocksource_read() as well.
> 
> Right - That's what I mentioned by "allowing xen clocksource to use/check that
> bit (consequently speeding up sched_clock)". The above chunk is really focused
> on enabling the flag on pvclock_clocksource_read().
> 
>>>
>>>>>  What I
>>>>> do above is just allowing xen clocksource to use/check that bit 
>>>>> (consequently
>>>>> speeding up sched_clock) given the necessary support is there in the 
>>>>> master
>>>>> copy. The secondary copy (i.e. what's being set up below, mapped/used in 
>>>>> vdso)
>>>>> has the same data from the master copy, just separate memory regions. The 
>>>>> checks
>>>>> below are just for the unlikely cases of failing to register the 
>>>>> secondary copy
>>>>> or if its content were to differ from master copy in future releases - and
>>>>> therefore we handle those more gracefully.
>>>>>
>>>>>> (I can see in the changelog that apparently at some point I've asked
>>>>>> about this at v1 but I can't remember/find what exactly it was)
>>>>>>
>>>>>>> +
>>>>>>> +   ti = (struct pvclock_vsyscall_time_info 
>>>>>>> *)get_zeroed_page(GFP_KERNEL);
>>>>>>> +   if (!ti)
&

Re: [Xen-devel] [PATCH v4 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-27 Thread Joao Martins
On 09/27/2017 11:44 PM, Boris Ostrovsky wrote:
> On 09/27/2017 04:57 PM, Joao Martins wrote:
>> On 09/27/2017 09:22 PM, Boris Ostrovsky wrote:
>>> On 09/27/2017 11:26 AM, Joao Martins wrote:
>>>> On 09/27/2017 03:40 PM, Boris Ostrovsky wrote:
>>>>>> +static void xen_setup_vsyscall_time_info(void)
>>>>>> +{
>>>>>> +struct vcpu_register_time_memory_area t;
>>>>>> +struct pvclock_vsyscall_time_info *ti;
>>>>>> +struct pvclock_vcpu_time_info *pvti;
>>>>>> +int ret;
>>>>>> +
>>>>>> +pvti = &__this_cpu_read(xen_vcpu)->time;
>>>>>> +
>>>>>> +/*
>>>>>> + * We check ahead on the primary time info if this
>>>>>> + * bit is supported hence speeding up Xen clocksource.
>>>>>> + */
>>>>>> +if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))
>>>>>> +return;
>>>>>> +
>>>>>> +pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>>>>> Is it OK to have this flag set if anything below fails?
>>>>>
>>>> Yes - if anything below fails it will only affect userspace mapped page.
>>> Then should it be set somewhere else, like in xen_time_init()?
>>>
>> Hm, I could move it if you think it's better - but given the importance of 
>> the
>> bit we are checking and its direct correlation to whether or not we can setup
>> VCLOCK_PVCLOCK then I find it cleaner to have it here in the same routine. 
>> One
>> thing I failed to mention before is that checking ahead like above, let us 
>> also
>> avoid allocating a page plus an hypercall to register the pvti just to check 
>> the
>> one bit of info we need for using VCLOCK_PVCLOCK.
>>
>> It is very unlikely with current Xen code that 1) the secondary copy register
>> below fails, or 2) master and secondary don't have the same bits set. So in 
>> case
>> you're reconsidering the "shortcut" check above I can move it like we had in 
>> v1
>> and have pvclock_set_flags right before pvclock_set_pvti_cpu0_va().
> 
> I think it would be more logical to move it to the end like in v1.
> 
> But can you explain again why this flag should not be set in
> xen_time_init()?

I didn't say we shouldn't have this flag there - I was just pointing out a
matter of taste on whether to put on xen_time_init() or in
xen_setup_vsyscall_time_info() (which is called from xen_time_init btw) so
there's no functional change.

> It seems to me that it would be useful not just for
> vDSO but for xen_clocksource_read()->pvclock_clocksource_read() as well.

Right - That's what I mentioned by "allowing xen clocksource to use/check that
bit (consequently speeding up sched_clock)". The above chunk is really focused
on enabling the flag on pvclock_clocksource_read().

>>
>>>>  What I
>>>> do above is just allowing xen clocksource to use/check that bit 
>>>> (consequently
>>>> speeding up sched_clock) given the necessary support is there in the master
>>>> copy. The secondary copy (i.e. what's being set up below, mapped/used in 
>>>> vdso)
>>>> has the same data from the master copy, just separate memory regions. The 
>>>> checks
>>>> below are just for the unlikely cases of failing to register the secondary 
>>>> copy
>>>> or if its content were to differ from master copy in future releases - and
>>>> therefore we handle those more gracefully.
>>>>
>>>>> (I can see in the changelog that apparently at some point I've asked
>>>>> about this at v1 but I can't remember/find what exactly it was)
>>>>>
>>>>>> +
>>>>>> +ti = (struct pvclock_vsyscall_time_info 
>>>>>> *)get_zeroed_page(GFP_KERNEL);
>>>>>> +if (!ti)
>>>>>> +return;
>>>>>> +
>>>>>> +t.addr.v = >pvti;
>>>>>> +
>>>>>> +ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 
>>>>>> 0, );
>>>>>> +if (ret) {
>>>>>> +pr_notice("xen: VCLOCK_PVCLOCK not supported (err 
>>>>>> %d)\n", ret);
>>>>>> +free_page((unsigned long)ti);
>>>>>> +return;
>>

Re: [Xen-devel] [PATCH v4 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-27 Thread Joao Martins
On 09/27/2017 09:22 PM, Boris Ostrovsky wrote:
> On 09/27/2017 11:26 AM, Joao Martins wrote:
>> On 09/27/2017 03:40 PM, Boris Ostrovsky wrote:
>>>> +static void xen_setup_vsyscall_time_info(void)
>>>> +{
>>>> +  struct vcpu_register_time_memory_area t;
>>>> +  struct pvclock_vsyscall_time_info *ti;
>>>> +  struct pvclock_vcpu_time_info *pvti;
>>>> +  int ret;
>>>> +
>>>> +  pvti = &__this_cpu_read(xen_vcpu)->time;
>>>> +
>>>> +  /*
>>>> +   * We check ahead on the primary time info if this
>>>> +   * bit is supported hence speeding up Xen clocksource.
>>>> +   */
>>>> +  if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))
>>>> +  return;
>>>> +
>>>> +  pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>>> Is it OK to have this flag set if anything below fails?
>>>
>> Yes - if anything below fails it will only affect userspace mapped page.
> 
> Then should it be set somewhere else, like in xen_time_init()?
>
Hm, I could move it if you think it's better - but given the importance of the
bit we are checking and its direct correlation to whether or not we can setup
VCLOCK_PVCLOCK then I find it cleaner to have it here in the same routine. One
thing I failed to mention before is that checking ahead like above, let us also
avoid allocating a page plus an hypercall to register the pvti just to check the
one bit of info we need for using VCLOCK_PVCLOCK.

It is very unlikely with current Xen code that 1) the secondary copy register
below fails, or 2) master and secondary don't have the same bits set. So in case
you're reconsidering the "shortcut" check above I can move it like we had in v1
and have pvclock_set_flags right before pvclock_set_pvti_cpu0_va().

>>  What I
>> do above is just allowing xen clocksource to use/check that bit (consequently
>> speeding up sched_clock) given the necessary support is there in the master
>> copy. The secondary copy (i.e. what's being set up below, mapped/used in 
>> vdso)
>> has the same data from the master copy, just separate memory regions. The 
>> checks
>> below are just for the unlikely cases of failing to register the secondary 
>> copy
>> or if its content were to differ from master copy in future releases - and
>> therefore we handle those more gracefully.
>>
>>> (I can see in the changelog that apparently at some point I've asked
>>> about this at v1 but I can't remember/find what exactly it was)
>>>
>>>> +
>>>> +  ti = (struct pvclock_vsyscall_time_info *)get_zeroed_page(GFP_KERNEL);
>>>> +  if (!ti)
>>>> +  return;
>>>> +
>>>> +  t.addr.v = >pvti;
>>>> +
>>>> +  ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
>>>> +  if (ret) {
>>>> +  pr_notice("xen: VCLOCK_PVCLOCK not supported (err %d)\n", ret);
>>>> +  free_page((unsigned long)ti);
>>>> +  return;
>>>> +  }
>>>> +
>>>> +  /*
>>>> +   * If the check above succedded this one should too since it's the
>>>> +   * same data on both primary and secondary time infos just different
>>>> +   * memory regions. But we still check it in case hypervisor is buggy.
>>>> +   */
>>>> +  pvti = >pvti;
>>>> +  if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)) {
>>>> +  t.addr.v = NULL;
>>>> +  ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
>>>> +   0, );
>>>> +  if (!ret)
>>>> +  free_page((unsigned long)ti);
>>>> +
>>>> +  pr_notice("xen: VCLOCK_PVCLOCK not supported (tsc unstable)\n");
>>>> +  return;
>>>> +  }
>>>> +
>>>> +  xen_clock = ti;
>>>> +  pvclock_set_pvti_cpu0_va(xen_clock);
>>>> +
>>>> +  xen_clocksource.archdata.vclock_mode = VCLOCK_PVCLOCK;
>>>> +}
>>>> +
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-27 Thread Joao Martins
On 09/27/2017 03:40 PM, Boris Ostrovsky wrote:
>> +static void xen_setup_vsyscall_time_info(void)
>> +{
>> +struct vcpu_register_time_memory_area t;
>> +struct pvclock_vsyscall_time_info *ti;
>> +struct pvclock_vcpu_time_info *pvti;
>> +int ret;
>> +
>> +pvti = &__this_cpu_read(xen_vcpu)->time;
>> +
>> +/*
>> + * We check ahead on the primary time info if this
>> + * bit is supported hence speeding up Xen clocksource.
>> + */
>> +if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))
>> +return;
>> +
>> +pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
> 
> Is it OK to have this flag set if anything below fails?
> 
Yes - if anything below fails it will only affect userspace mapped page. What I
do above is just allowing xen clocksource to use/check that bit (consequently
speeding up sched_clock) given the necessary support is there in the master
copy. The secondary copy (i.e. what's being set up below, mapped/used in vdso)
has the same data from the master copy, just separate memory regions. The checks
below are just for the unlikely cases of failing to register the secondary copy
or if its content were to differ from master copy in future releases - and
therefore we handle those more gracefully.

> (I can see in the changelog that apparently at some point I've asked
> about this at v1 but I can't remember/find what exactly it was)
> 
>> +
>> +ti = (struct pvclock_vsyscall_time_info *)get_zeroed_page(GFP_KERNEL);
>> +if (!ti)
>> +return;
>> +
>> +t.addr.v = >pvti;
>> +
>> +ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
>> +if (ret) {
>> +pr_notice("xen: VCLOCK_PVCLOCK not supported (err %d)\n", ret);
>> +free_page((unsigned long)ti);
>> +return;
>> +}
>> +
>> +/*
>> + * If the check above succedded this one should too since it's the
>> + * same data on both primary and secondary time infos just different
>> + * memory regions. But we still check it in case hypervisor is buggy.
>> + */
>> +pvti = >pvti;
>> +if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)) {
>> +t.addr.v = NULL;
>> +ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
>> + 0, );
>> +if (!ret)
>> +free_page((unsigned long)ti);
>> +
>> +pr_notice("xen: VCLOCK_PVCLOCK not supported (tsc unstable)\n");
>> +return;
>> +}
>> +
>> +xen_clock = ti;
>> +pvclock_set_pvti_cpu0_va(xen_clock);
>> +
>> +xen_clocksource.archdata.vclock_mode = VCLOCK_PVCLOCK;
>> +}
>> +
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-27 Thread Joao Martins
On 09/27/2017 01:14 PM, Juergen Gross wrote:
> On 27/09/17 14:00, Joao Martins wrote:

[...]

>> diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
>> index 98188c87f5c1..b4a1eabcf1c4 100644
>> --- a/include/xen/interface/vcpu.h
>> +++ b/include/xen/interface/vcpu.h
>> @@ -178,4 +178,46 @@ DEFINE_GUEST_HANDLE_STRUCT(vcpu_register_vcpu_info);
>>  
>>  /* Send an NMI to the specified VCPU. @extra_arg == NULL. */
>>  #define VCPUOP_send_nmi 11
>> +
>> +/*
>> + * Get the physical ID information for a pinned vcpu's underlying physical
>> + * processor.  The physical ID informmation is architecture-specific.
>> + * On x86: id[31:0]=apic_id, id[63:32]=acpi_id.
>> + * This command returns -EINVAL if it is not a valid operation for this 
>> VCPU.
>> + */
>> +#define VCPUOP_get_physid   12 /* arg == vcpu_get_physid_t */
>> +struct vcpu_get_physid {
>> +uint64_t phys_id;
>> +};
>> +DEFINE_GUEST_HANDLE_STRUCT(vcpu_get_physid_t);
> 
> DEFINE_GUEST_HANDLE_STRUCT(vcpu_get_physid);
> 
>> +#define xen_vcpu_physid_to_x86_apicid(physid) ((uint32_t)(physid))
>> +#define xen_vcpu_physid_to_x86_acpiid(physid) ((uint32_t)((physid) >> 32))
>> +
>> +/*
>> + * Register a memory location to get a secondary copy of the vcpu time
>> + * parameters.  The master copy still exists as part of the vcpu shared
>> + * memory area, and this secondary copy is updated whenever the master copy
>> + * is updated (and using the same versioning scheme for synchronisation).
>> + *
>> + * The intent is that this copy may be mapped (RO) into userspace so
>> + * that usermode can compute system time using the time info and the
>> + * tsc.  Usermode will see an array of vcpu_time_info structures, one
>> + * for each vcpu, and choose the right one by an existing mechanism
>> + * which allows it to get the current vcpu number (such as via a
>> + * segment limit).  It can then apply the normal algorithm to compute
>> + * system time from the tsc.
>> + *
>> + * @extra_arg == pointer to vcpu_register_time_info_memory_area structure.
>> + */
>> +#define VCPUOP_register_vcpu_time_memory_area   13
>> +DEFINE_GUEST_HANDLE_STRUCT(vcpu_time_info_t);
> 
> DEFINE_GUEST_HANDLE_STRUCT(vcpu_time_info);
> 
>> +struct vcpu_register_time_memory_area {
>> +union {
>> +GUEST_HANDLE(vcpu_time_info_t) h;
> 
> GUEST_HANDLE(vcpu_time_info) h;
> 
>> +struct pvclock_vcpu_time_info *v;
>> +uint64_t p;
>> +} addr;
>> +};
>> +DEFINE_GUEST_HANDLE_STRUCT(vcpu_register_time_memory_area_t);
> 
> DEFINE_GUEST_HANDLE_STRUCT(vcpu_register_time_memory_area);

Oh sorry - I forgot to remove the suffix. In the meantime I sent over v4
addressing the above.

Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-27 Thread Joao Martins
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.

The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
Changes since v3:
 (Comments from Juergen)
 * Remove _t added suffix from *GUEST_HANDLE* when sync vcpu.h
 with the latest

Changes since v2:
 (Comments from Juergen)
 * Omit the blank after the cast on all 3 occurrences.
 * Change last VCLOCK_PVCLOCK message to be more descriptive
 * Sync the complete vcpu.h header instead of just adding the
 needed one. (IOW adding VCPUOP_get_physid)

Changes since v1:
 * Check flags ahead to see if the  primary clock can use
 PVCLOCK_TSC_STABLE_BIT even if secondary registration fails.
 (Comments from Boris)
 * Remove addr, addr variables;
 * Change first pr_debug to pr_warn;
 * Change last pr_debug to pr_notice;
 * Add routine to solely register secondary time info.
 * Move xen_clock to outside xen_setup_vsyscall_time_info to allow
 restore path to simply re-register secondary time info. Let us
 handle the restore path more gracefully without re-allocating a
 page.
 * Removed cpu argument from xen_setup_vsyscall_time_info()
 * Adjustment failed registration error messages/loglevel to be the same
 * Also teardown secondary time info on suspend

Changes since RFC:
 (Comments from Boris and David)
 * Remove Kconfig option
 * Use get_zeroed_page/free/page
 * Remove the hypercall availability check
 * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
 (New)
 * Set secondary copy on restore such that it works on migration.
 * Drop global xen_clock variable and stash it locally on
 xen_setup_vsyscall_time_info.
 * WARN_ON(ret) if we fail to unregister the pvti.
---
 arch/x86/xen/suspend.c   |   4 ++
 arch/x86/xen/time.c  | 100 +++
 arch/x86/xen/xen-ops.h   |   2 +
 include/xen/interface/vcpu.h |  42 ++
 4 files changed, 148 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index d6b1680693a9..800ed36ecfba 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -16,6 +16,8 @@
 
 void xen_arch_pre_suspend(void)
 {
+   xen_save_time_memory_area();
+
if (xen_pv_domain())
xen_pv_pre_suspend();
 }
@@ -26,6 +28,8 @@ void xen_arch_post_suspend(int cancelled)
xen_pv_post_suspend(cancelled);
else
xen_hvm_post_suspend(cancelled);
+
+   xen_restore_time_memory_area();
 }
 
 static void xen_vcpu_notify_restore(void *data)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1ecb05db3632..3bf72b933825 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -370,6 +370,105 @@ static const struct pv_time_ops xen_time_ops __initconst 
= {
.steal_clock = xen_steal_clock,
 };
 
+static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
+
+void xen_save_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = NULL;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+   if (ret != 0)
+   pr_notice("Cannot save secondary vcpu_time_info (err %d)",
+ ret);
+   else
+   clear_page(xen_clock);
+}
+
+void xen_restore_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = _clock->pvti;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+
+   /*
+* We don't disable VCLOCK_PVCLOCK entirely if it fails to register the
+* secondary time info with Xen or if we migrated to a host without the
+* necessary flags. On both of these cases what happens is either
+* process seeing a zeroed out pvti or seeing no PVCLOCK_TSC_STABLE_BIT
+* bit set. Userspace checks the latter and if 0, it discards the data
+* in pvti and fallbacks to a system call for a reliable timestamp.
+*/
+   if (ret != 0)
+   pr_notice("Cannot restore secondary vcpu_time_info (err %d)",
+ ret);
+}
+
+static void xen_setup_vsyscall_time_info(void)
+{
+   struct vcpu_register_time_memory_area t;
+   struct pvclock_vsyscall_time_info *ti;
+   struct pvclock_vcpu_time_info *pvti;
+   int ret;
+
+ 

[Xen-devel] [PATCH v4 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-09-27 Thread Joao Martins
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:

commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Andy Lutomirski <l...@kernel.org>
---
Changes since v1:
 * Rebased: the only conflict was that I had move the export
 pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
 * Do not initialize pvti_cpu0_va to NULL (checkpatch error)
 ( Comments from Andy Lutomirski )
 * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
 for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
 * Add his Acked-by (provided the previous adjustment was made)

Changes since RFC:
 (Comments from Andy Lutomirski)
 * Add __init to pvclock_set_pvti_cpu0_va
 * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
 pvclock_set_pvti_cpu0_va
---
 arch/x86/include/asm/pvclock.h | 19 ++-
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 448cfe1b48cf..6f228f90cdd7 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,15 +4,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_KVM_GUEST
-extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
-#else
-static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return NULL;
-}
-#endif
-
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
@@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return NULL;
+}
+#endif
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d88967659098..538738047ff5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
-struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return hv_clock;
-}
-EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
+   pvclock_set_pvti_cpu0_va(hv_clock);
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 5c3f6d6a5078..cb7d6d9c9c2d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -25,8 +25,10 @@
 
 #include 
 #include 
+#include 
 
 static u8 valid_flags __read_mostly = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
 
 void pvclock_set_flags(u8 flags)
 {
@@ -144,3 +146,15 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+{
+   WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
+   pvti_cpu0_va = pvti;
+}
+
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return pvti_cpu0_va;
+}
+EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 3/3] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-09-27 Thread Joao Martins
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Juergen Gross <jgr...@suse.com>
---
In the end, I choose the originally posted because this is so far the
only ABI shared between Xen/KVM. Therefore whenever we have more things
shared it would deserve its own place in MAINTAINERS file. If the
thinking is wrong, I can switch to the alternative with a
"PARAVIRT ABIS" section.

Changes since v1:
 * Add Juergen Gross Acked-by.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6671f375f7fc..a4834c3c377a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7603,6 +7603,7 @@ S:Supported
 F: arch/x86/kvm/
 F: arch/x86/include/uapi/asm/kvm*
 F: arch/x86/include/asm/kvm*
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/x86/kernel/kvm.c
 F: arch/x86/kernel/kvmclock.c
 
@@ -14718,6 +14719,7 @@ F:  arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
 F: arch/x86/include/asm/xen/
+F: arch/x86/include/asm/pvclock-abi.h
 F: include/xen/
 F: include/uapi/xen/
 F: Documentation/ABI/stable/sysfs-hypervisor-xen
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 0/3] x86/xen: pvclock vdso support

2017-09-27 Thread Joao Martins
Hey,

This is take 4 for vdso for Xen. PVCLOCK_TSC_STABLE_BIT can be set starting Xen
 4.8 which is required for vdso time related calls. In order to have it on, you
need to have the hypervisor clocksource be TSC e.g. with the following boot
params "clocksource=tsc tsc=stable:socket".

Series is structured as following:

Patch 1 streamlines pvti page get/set in pvclock for both of its users
Patch 2 registers the pvti page on Xen and sets it in pvclock accordingly
Patch 3 adds a file to KVM/Xen maintainers for tracking pvclock ABI changes.

Changelog is included in individual patches.
(only patch 2 changed in this version)

Thanks,
Joao

Joao Martins (3):
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

 MAINTAINERS|   2 +
 arch/x86/include/asm/pvclock.h |  19 
 arch/x86/kernel/kvmclock.c |   7 +--
 arch/x86/kernel/pvclock.c  |  14 ++
 arch/x86/xen/suspend.c |   4 ++
 arch/x86/xen/time.c| 100 +
 arch/x86/xen/xen-ops.h |   2 +
 include/xen/interface/vcpu.h   |  42 +
 8 files changed, 175 insertions(+), 15 deletions(-)

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 3/3] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-09-27 Thread Joao Martins
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Juergen Gross <jgr...@suse.com>
---
In the end, I choose the originally posted because this is so far the
only ABI shared between Xen/KVM. Therefore whenever we have more things
shared it would deserve its own place in MAINTAINERS file. If the
thinking is wrong, I can switch to the alternative with a
"PARAVIRT ABIS" section.

Changes since v1:
 * Add Juergen Gross Acked-by.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6671f375f7fc..a4834c3c377a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7603,6 +7603,7 @@ S:Supported
 F: arch/x86/kvm/
 F: arch/x86/include/uapi/asm/kvm*
 F: arch/x86/include/asm/kvm*
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/x86/kernel/kvm.c
 F: arch/x86/kernel/kvmclock.c
 
@@ -14718,6 +14719,7 @@ F:  arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
 F: arch/x86/include/asm/xen/
+F: arch/x86/include/asm/pvclock-abi.h
 F: include/xen/
 F: include/uapi/xen/
 F: Documentation/ABI/stable/sysfs-hypervisor-xen
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 0/3] x86/xen: pvclock vdso support

2017-09-27 Thread Joao Martins
Hey,

This take 3 for vdso for Xen. PVCLOCK_TSC_STABLE_BIT can be set starting Xen
 4.8 which is required for vdso time related calls. In order to have it on, you
need to have the hypervisor clocksource be TSC e.g. with the following boot
params "clocksource=tsc tsc=stable:socket".

Series is structured as following:

Patch 1 streamlines pvti page get/set in pvclock for both of its users
Patch 2 registers the pvti page on Xen and sets it in pvclock accordingly
Patch 3 adds a file to KVM/Xen maintainers for tracking pvclock ABI changes.

Changelog is included in individual patches.
(only patch 2 changed in this version)

Thanks,
Joao

Joao Martins (3):
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

 MAINTAINERS|   2 +
 arch/x86/include/asm/pvclock.h |  19 
 arch/x86/kernel/kvmclock.c |   7 +--
 arch/x86/kernel/pvclock.c  |  14 ++
 arch/x86/xen/suspend.c |   4 ++
 arch/x86/xen/time.c| 100 +
 arch/x86/xen/xen-ops.h |   2 +
 include/xen/interface/vcpu.h   |  42 +
 8 files changed, 175 insertions(+), 15 deletions(-)

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-27 Thread Joao Martins
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.

The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
Changes since v2:
 (Comments from Juergen)
 * Omit the blan after the cast on all 3 occurrences.
 * Change last VCLOCK_PVCLOCK message to be more descriptive
 * Sync the complete vcpu.h header instead of just adding the
 needed one. (IOW adding VCPUOP_get_physid)

Changes since v1:
 * Check flags ahead to see if the  primary clock can use
 PVCLOCK_TSC_STABLE_BIT even if secondary registration fails.
 (Comments from Boris)
 * Remove addr, addr variables;
 * Change first pr_debug to pr_warn;
 * Change last pr_debug to pr_notice;
 * Add routine to solely register secondary time info.
 * Move xen_clock to outside xen_setup_vsyscall_time_info to allow
 restore path to simply re-register secondary time info. Let us
 handle the restore path more gracefully without re-allocating a
 page.
 * Removed cpu argument from xen_setup_vsyscall_time_info()
 * Adjustment failed registration error messages/loglevel to be the same
 * Also teardown secondary time info on suspend

Changes since RFC:
 (Comments from Boris and David)
 * Remove Kconfig option
 * Use get_zeroed_page/free/page
 * Remove the hypercall availability check
 * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
 (New)
 * Set secondary copy on restore such that it works on migration.
 * Drop global xen_clock variable and stash it locally on
 xen_setup_vsyscall_time_info.
 * WARN_ON(ret) if we fail to unregister the pvti.
---
 arch/x86/xen/suspend.c   |   4 ++
 arch/x86/xen/time.c  | 100 +++
 arch/x86/xen/xen-ops.h   |   2 +
 include/xen/interface/vcpu.h |  42 ++
 4 files changed, 148 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index d6b1680693a9..800ed36ecfba 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -16,6 +16,8 @@
 
 void xen_arch_pre_suspend(void)
 {
+   xen_save_time_memory_area();
+
if (xen_pv_domain())
xen_pv_pre_suspend();
 }
@@ -26,6 +28,8 @@ void xen_arch_post_suspend(int cancelled)
xen_pv_post_suspend(cancelled);
else
xen_hvm_post_suspend(cancelled);
+
+   xen_restore_time_memory_area();
 }
 
 static void xen_vcpu_notify_restore(void *data)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1ecb05db3632..3bf72b933825 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -370,6 +370,105 @@ static const struct pv_time_ops xen_time_ops __initconst 
= {
.steal_clock = xen_steal_clock,
 };
 
+static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
+
+void xen_save_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = NULL;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+   if (ret != 0)
+   pr_notice("Cannot save secondary vcpu_time_info (err %d)",
+ ret);
+   else
+   clear_page(xen_clock);
+}
+
+void xen_restore_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = _clock->pvti;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+
+   /*
+* We don't disable VCLOCK_PVCLOCK entirely if it fails to register the
+* secondary time info with Xen or if we migrated to a host without the
+* necessary flags. On both of these cases what happens is either
+* process seeing a zeroed out pvti or seeing no PVCLOCK_TSC_STABLE_BIT
+* bit set. Userspace checks the latter and if 0, it discards the data
+* in pvti and fallbacks to a system call for a reliable timestamp.
+*/
+   if (ret != 0)
+   pr_notice("Cannot restore secondary vcpu_time_info (err %d)",
+ ret);
+}
+
+static void xen_setup_vsyscall_time_info(void)
+{
+   struct vcpu_register_time_memory_area t;
+   struct pvclock_vsyscall_time_info *ti;
+   struct pvclock_vcpu_time_info *pvti;
+   int ret;
+
+   pvti = &__this_cpu_read(xen_vcpu)->time;
+
+   /*
+* We check ahead on the primary time info if th

[Xen-devel] [PATCH v3 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-09-27 Thread Joao Martins
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:

commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Andy Lutomirski <l...@kernel.org>
---
Changes since v1:
 * Rebased: the only conflict was that I had move the export
 pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
 * Do not initialize pvti_cpu0_va to NULL (checkpatch error)
 ( Comments from Andy Lutomirski )
 * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
 for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
 * Add his Acked-by (provided the previous adjustment was made)

Changes since RFC:
 (Comments from Andy Lutomirski)
 * Add __init to pvclock_set_pvti_cpu0_va
 * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
 pvclock_set_pvti_cpu0_va
---
 arch/x86/include/asm/pvclock.h | 19 ++-
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 448cfe1b48cf..6f228f90cdd7 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,15 +4,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_KVM_GUEST
-extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
-#else
-static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return NULL;
-}
-#endif
-
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
@@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return NULL;
+}
+#endif
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d88967659098..538738047ff5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
-struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return hv_clock;
-}
-EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
+   pvclock_set_pvti_cpu0_va(hv_clock);
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 5c3f6d6a5078..cb7d6d9c9c2d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -25,8 +25,10 @@
 
 #include 
 #include 
+#include 
 
 static u8 valid_flags __read_mostly = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
 
 void pvclock_set_flags(u8 flags)
 {
@@ -144,3 +146,15 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+{
+   WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
+   pvti_cpu0_va = pvti;
+}
+
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return pvti_cpu0_va;
+}
+EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-26 Thread Joao Martins
On 09/26/2017 10:32 AM, Juergen Gross wrote:
> On 22/09/17 18:25, Joao Martins wrote:
[snip]
>> +static void xen_setup_vsyscall_time_info(void)
>> +{
>> +struct vcpu_register_time_memory_area t;
>> +struct pvclock_vsyscall_time_info *ti;
>> +struct pvclock_vcpu_time_info *pvti;
>> +int ret;
>> +
>> +pvti = &__this_cpu_read(xen_vcpu)->time;
>> +
>> +/*
>> + * We check ahead on the primary time info if this
>> + * bit is supported hence speeding up Xen clocksource.
>> + */
>> +if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))
>> +return;
>> +
>> +pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
>> +
>> +ti = (struct pvclock_vsyscall_time_info *) get_zeroed_page(GFP_KERNEL);
> 
> Coding style: omit the blank after the cast.
> 
OK.

>> +if (!ti)
>> +return;
>> +
>> +t.addr.v = >pvti;
>> +
>> +ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
>> +if (ret) {
>> +pr_notice("xen: VCLOCK_PVCLOCK not supported (err %d)\n", ret);
>> +free_page((unsigned long) ti);
> 
> Coding style again, once more below.
> 
OK.

>> +return;
>> +}
>> +
>> +/*
>> + * If the check above succedded this one should too since it's the
>> + * same data on both primary and secondary time infos just different
>> + * memory regions. But we still check it in case hypervisor is buggy.
>> + */
>> +pvti = >pvti;
>> +if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)) {
>> +t.addr.v = NULL;
>> +ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
>> + 0, );
>> +if (!ret)
>> +free_page((unsigned long) ti);
>> +
>> +pr_notice("xen: VCLOCK_PVCLOCK not supported (err %d)\n", ret);
> 
> Mind making the message more descriptive? E.g. instead of reporting
> "(err 0)" just telling "(tsc unstable)"?
> 
Got it.

>> +return;
>> +}
>> +
>> +xen_clock = ti;
>> +pvclock_set_pvti_cpu0_va(xen_clock);
>> +
>> +xen_clocksource.archdata.vclock_mode = VCLOCK_PVCLOCK;
>> +}
>> +
>>  static void __init xen_time_init(void)
>>  {
>>  int cpu = smp_processor_id();
>> @@ -396,6 +495,7 @@ static void __init xen_time_init(void)
>>  setup_force_cpu_cap(X86_FEATURE_TSC);
>>  
>>  xen_setup_runstate_info(cpu);
>> +xen_setup_vsyscall_time_info();
>>  xen_setup_timer(cpu);
>>  xen_setup_cpu_clockevents();
>>  
>> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
>> index c8a6d224f7ed..f96dbedb33d4 100644
>> --- a/arch/x86/xen/xen-ops.h
>> +++ b/arch/x86/xen/xen-ops.h
>> @@ -69,6 +69,8 @@ void xen_setup_runstate_info(int cpu);
>>  void xen_teardown_timer(int cpu);
>>  u64 xen_clocksource_read(void);
>>  void xen_setup_cpu_clockevents(void);
>> +void xen_save_time_memory_area(void);
>> +void xen_restore_time_memory_area(void);
>>  void __init xen_init_time_ops(void);
>>  void __init xen_hvm_init_time_ops(void);
>>  
>> diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
>> index 98188c87f5c1..8da788c5bd4f 100644
>> --- a/include/xen/interface/vcpu.h
>> +++ b/include/xen/interface/vcpu.h
>> @@ -178,4 +178,32 @@ DEFINE_GUEST_HANDLE_STRUCT(vcpu_register_vcpu_info);
>>  
>>  /* Send an NMI to the specified VCPU. @extra_arg == NULL. */
>>  #define VCPUOP_send_nmi 11
>> +
>> +/*
>> + * Register a memory location to get a secondary copy of the vcpu time
>> + * parameters.  The master copy still exists as part of the vcpu shared
>> + * memory area, and this secondary copy is updated whenever the master copy
>> + * is updated (and using the same versioning scheme for synchronisation).
>> + *
>> + * The intent is that this copy may be mapped (RO) into userspace so
>> + * that usermode can compute system time using the time info and the
>> + * tsc.  Usermode will see an array of vcpu_time_info structures, one
>> + * for each vcpu, and choose the right one by an existing mechanism
>> + * which allows it to get the current vcpu number (such as via a
>> + * segment limit).  It can then apply the normal algorithm to compute
>> + * system time from the tsc.
>> + *
>> + * @extra_arg == pointer to vcpu_register_time_info_memory_area stru

[Xen-devel] [PATCH v2 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-09-22 Thread Joao Martins
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:

commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Andy Lutomirski <l...@kernel.org>
---
Changes since v1:
 * Rebased: the only conflict was that I had move the export
 pvclock_pvti_cpu0_va() symbol as it is used by kvm PTP driver.
 * Do not initialize pvti_cpu0_va to NULL (checkpatch error)

 ( Comments from Andy Lutomirski )
 * Removed asm/pvclock.h 'pvclock_set_pvti_cpu0_va' definition
 for non !PARAVIRT_CLOCK to better track screwed Kconfig stuff.
 * Add his Acked-by (provided the previous adjustment was made)

Changes since RFC:
 (Comments from Andy Lutomirski)
 * Add __init to pvclock_set_pvti_cpu0_va
 * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
 pvclock_set_pvti_cpu0_va
---
 arch/x86/include/asm/pvclock.h | 19 ++-
 arch/x86/kernel/kvmclock.c |  7 +--
 arch/x86/kernel/pvclock.c  | 14 ++
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 448cfe1b48cf..6f228f90cdd7 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,15 +4,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_KVM_GUEST
-extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
-#else
-static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return NULL;
-}
-#endif
-
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
@@ -101,4 +92,14 @@ struct pvclock_vsyscall_time_info {
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return NULL;
+}
+#endif
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index d88967659098..538738047ff5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -47,12 +47,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
-struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return hv_clock;
-}
-EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -334,6 +328,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
+   pvclock_set_pvti_cpu0_va(hv_clock);
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 5c3f6d6a5078..cb7d6d9c9c2d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -25,8 +25,10 @@
 
 #include 
 #include 
+#include 
 
 static u8 valid_flags __read_mostly = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
 
 void pvclock_set_flags(u8 flags)
 {
@@ -144,3 +146,15 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+{
+   WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
+   pvti_cpu0_va = pvti;
+}
+
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return pvti_cpu0_va;
+}
+EXPORT_SYMBOL_GPL(pvclock_pvti_cpu0_va);
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 2/3] x86/xen/time: setup vcpu 0 time info page

2017-09-22 Thread Joao Martins
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.

The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
Changes since v1:
 * Check flags ahead to see if the  primary clock can use
 PVCLOCK_TSC_STABLE_BIT even if secondary registration fails.

 (Comments from Boris)
 * Remove addr, addr variables;
 * Change first pr_debug to pr_warn;
 * Change last pr_debug to pr_notice;
 * Add routine to solely register secondary time info.
 * Move xen_clock to outside xen_setup_vsyscall_time_info to allow
 restore path to simply re-register secondary time info. Let us
 handle the restore path more gracefully without re-allocating a
 page.
 * Removed cpu argument from xen_setup_vsyscall_time_info()
 * Adjustment failed registration error messages/loglevel to be the same
 * Also teardown secondary time info on suspend

Changes since RFC:
 (Comments from Boris and David)
 * Remove Kconfig option
 * Use get_zeroed_page/free/page
 * Remove the hypercall availability check
 * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
 (New)
 * Set secondary copy on restore such that it works on migration.
 * Drop global xen_clock variable and stash it locally on
 xen_setup_vsyscall_time_info.
 * WARN_ON(ret) if we fail to unregister the pvti.
---
 arch/x86/xen/suspend.c   |   4 ++
 arch/x86/xen/time.c  | 100 +++
 arch/x86/xen/xen-ops.h   |   2 +
 include/xen/interface/vcpu.h |  28 
 4 files changed, 134 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index d6b1680693a9..800ed36ecfba 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -16,6 +16,8 @@
 
 void xen_arch_pre_suspend(void)
 {
+   xen_save_time_memory_area();
+
if (xen_pv_domain())
xen_pv_pre_suspend();
 }
@@ -26,6 +28,8 @@ void xen_arch_post_suspend(int cancelled)
xen_pv_post_suspend(cancelled);
else
xen_hvm_post_suspend(cancelled);
+
+   xen_restore_time_memory_area();
 }
 
 static void xen_vcpu_notify_restore(void *data)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1ecb05db3632..2924b97691c6 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -370,6 +370,105 @@ static const struct pv_time_ops xen_time_ops __initconst 
= {
.steal_clock = xen_steal_clock,
 };
 
+static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
+
+void xen_save_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = NULL;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+   if (ret != 0)
+   pr_notice("Cannot save secondary vcpu_time_info (err %d)",
+ ret);
+   else
+   clear_page(xen_clock);
+}
+
+void xen_restore_time_memory_area(void)
+{
+   struct vcpu_register_time_memory_area t;
+   int ret;
+
+   if (!xen_clock)
+   return;
+
+   t.addr.v = _clock->pvti;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area, 0, );
+
+   /*
+* We don't disable VCLOCK_PVCLOCK entirely if it fails to register the
+* secondary time info with Xen or if we migrated to a host without the
+* necessary flags. On both of these cases what happens is either
+* process seeing a zeroed out pvti or seeing no PVCLOCK_TSC_STABLE_BIT
+* bit set. Userspace checks the latter and if 0, it discards the data
+* in pvti and fallbacks to a system call for a reliable timestamp.
+*/
+   if (ret != 0)
+   pr_notice("Cannot restore secondary vcpu_time_info (err %d)",
+ ret);
+}
+
+static void xen_setup_vsyscall_time_info(void)
+{
+   struct vcpu_register_time_memory_area t;
+   struct pvclock_vsyscall_time_info *ti;
+   struct pvclock_vcpu_time_info *pvti;
+   int ret;
+
+   pvti = &__this_cpu_read(xen_vcpu)->time;
+
+   /*
+* We check ahead on the primary time info if this
+* bit is supported hence speeding up Xen clocksource.
+*/
+   if (!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))
+   return;
+
+   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+
+   ti = (struct pvclock_vsyscall_time_info *) g

[Xen-devel] [PATCH v2 0/3] x86/xen: pvclock vdso support

2017-09-22 Thread Joao Martins
Hey,

Sorry for the huge delay in following up this series.

This take 2 for vdso for Xen. PVCLOCK_TSC_STABLE_BIT can be set starting Xen
 4.8 which is required for vdso time related calls. In order to have it on, you
need to have the hypervisor clocksource be TSC e.g. with the following boot
params "clocksource=tsc tsc=stable:socket".

Series is structured as following:

Patch 1 streamlines pvti page get/set in pvclock for both of its users
Patch 2 registers the pvti page on Xen and sets it in pvclock accordingly
Patch 3 adds a file to KVM/Xen maintainers for tracking pvclock ABI changes.

Changelog since v1 is included in individual patches.

Any comments/suggestions are welcome.

Thanks,
Joao


Joao Martins (3):
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

 MAINTAINERS|   2 +
 arch/x86/include/asm/pvclock.h |  19 
 arch/x86/kernel/kvmclock.c |   7 +--
 arch/x86/kernel/pvclock.c  |  14 ++
 arch/x86/xen/suspend.c |   4 ++
 arch/x86/xen/time.c| 100 +
 arch/x86/xen/xen-ops.h |   2 +
 include/xen/interface/vcpu.h   |  28 
 8 files changed, 161 insertions(+), 15 deletions(-)

-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 3/3] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-09-22 Thread Joao Martins
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Acked-by: Juergen Gross <jgr...@suse.com>
---
Out of the two options (and provided I was given a choice) I choose the
originally posted because this is so far the only ABI shared between Xen/KVM.
Whenever we have more things shared it would probably deserve moving into its
own section in MAINTAINERS file. If my thinking is wrong, I can switch to the
alternative i.e. a "PARAVIRT ABIS" section.

Changes since v1:
 * Add Juergen Gross Acked-by.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2281af4b41b6..5a6c26c298b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7592,6 +7592,7 @@ S:Supported
 F: arch/x86/kvm/
 F: arch/x86/include/uapi/asm/kvm*
 F: arch/x86/include/asm/kvm*
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/x86/kernel/kvm.c
 F: arch/x86/kernel/kvmclock.c
 
@@ -14708,6 +14709,7 @@ F:  arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
 F: arch/x86/include/asm/xen/
+F: arch/x86/include/asm/pvclock-abi.h
 F: include/xen/
 F: include/uapi/xen/
 F: Documentation/ABI/stable/sysfs-hypervisor-xen
-- 
2.11.0


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 0/1] netif: staging grants for I/O requests

2017-09-19 Thread Joao Martins
Hey,

This is v4 taking into consideration all comments received from v3 (changelog
in the first patch). The specification is right after the diffstat.

Reference implementation also here (on top of net-next):

https://github.com/jpemartins/linux.git xen-net-stg-gnts-v3

Cheers,

Joao Martins (1):
  public/io/netif.h: add gref mapping control messages

 xen/include/public/io/netif.h | 115 ++
 1 file changed, 115 insertions(+)
---
% Staging grants for network I/O requests
% Joao Martins <<joao.m.mart...@oracle.com>>
% Revision 4

\clearpage


Architecture(s): Any


# Background and Motivation

At the Xen hackaton '16 networking session, we spoke about having a permanently
mapped region to describe header/linear region of packet buffers. This document
outlines the proposal covering motivation of this and applicability for other
use-cases alongside the necessary changes.

The motivation of this work is to eliminate grant ops for packet I/O intensive
workloads such as those observed with smaller requests size (i.e. <= 256 bytes
or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are the
only ones performing really good (up to 80 Gbit/s in few CPUs), usually
backing end-hosts and server appliances. Anything that involves higher packet
rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s
throughput.

# Proposal

The proposal is to leverage the already implicit copy from and to packet linear
data on netfront and netback, to be done instead from a permanently mapped
region. In some (physical) NICs this is known as header/data split.

Specifically some workloads (e.g. NFV) it would provide a big increase in
throughput when we switch to (zero)copying in the backend/frontend, instead of
the grant hypercalls. Thus this extension aims at futureproofing the netif
protocol by adding the possibility of guests setting up a list of grants that
are set up at device creation and revoked at device freeing - without taking
too much grant entries in account for the general case (i.e. to cover only the
header region <= 256 bytes, 16 grants per ring) while configurable by kernel
when one wants to resort to a copy-based as opposed to grant copy/map.

\clearpage

# General Operation

Here we describe how netback and netfront general operate, and where the 
proposed
solution will fit. The security mechanism currently involves grants references
which in essence are round-robin recycled 'tickets' stamped with the GPFNs,
permission attributes, and the authorized domain:

(This is an in-memory view of struct grant_entry_v1):

 0 1 2 3 4 5 6 7 octet
++---++
| flags  | domain id | frame  |
++---++

Where there are N grant entries in a grant table, for example:

@0:
++---++
| rw | 0 | 0xABCDEF   |
++---++
| rw | 0 | 0xFA124|
++---++
| ro | 1 | 0xBEEF |
++---++

  .
@N:
++---++
| rw | 0 | 0x9923A|
++---++

Each entry consumes 8 bytes, therefore 512 entries can fit on one page.
The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384
grants. The ParaVirtualized (PV) drivers will use the grant reference (index
in the grant table - 0 .. N) in their command ring.

\clearpage

## Guest Transmit

The view of the shared transmit ring is the following:

 0 1 2 3 4 5 6 7 octet
+++
| req_prod   | req_event  |
+++
| rsp_prod   | rsp_event  |
+++
| pvt| pad[44]|
++|
| | [64bytes]
+++-\
| gref   | offset| flags  | |
++---++ +-'struct
| id | size  | id| status | | netif_tx_sring_entry'
+-+-/
|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/| .. N
+-+
 
Each entry consumes 16 octe

[Xen-devel] [PATCH v4 1/1] public/io/netif.h: add gref mapping control messages

2017-09-19 Thread Joao Martins
Adds 3 messages to allow guest to let backend keep grants mapped,
such that 1) guests allowing fast recycling of pages can avoid doing
grant ops for those cases, or otherwise 2) preferring copies over
grants and 3) always using a fixed set of pages for network I/O.

The three control ring messages added are:
 - Add grefs to be mapped by backend
 - Remove grefs mappings (If they are not in use)
 - Get maximum amount of grefs kept mapped.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
v4:
* Declare xen_netif_gref parameters are input or output.
* Clarify status field and that it doesn't require to be set to zero
prior to its usage.
* Clarify on ADD_GREF_MAPPING is 'all or nothing'
* Improve last paragraph of DEL_GREF_MAPPING

v3:
* Use DEL for unmapping grefs instead of PUT
* Rname from xen_netif_gref_alloc to xen_netif_gref
* Add 'status' field on xen_netif_gref
* Clarify what 'inflight' means
* Use "beginning of the page" instead of "beginning of the grant"
* Mention that page needs to be r/w (as it will have to modify \.status)
---
 xen/include/public/io/netif.h | 123 ++
 1 file changed, 123 insertions(+)

diff --git a/xen/include/public/io/netif.h b/xen/include/public/io/netif.h
index ca0061410d..2454448baa 100644
--- a/xen/include/public/io/netif.h
+++ b/xen/include/public/io/netif.h
@@ -353,6 +353,9 @@ struct xen_netif_ctrl_request {
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING_SIZE 5
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING  6
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_ALGORITHM7
+#define XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE 8
+#define XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING  9
+#define XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING 10
 
 uint32_t data[3];
 };
@@ -391,6 +394,44 @@ struct xen_netif_ctrl_response {
 };
 
 /*
+ * Static Grants (struct xen_netif_gref)
+ * =
+ *
+ * A frontend may provide a fixed set of grant references to be mapped on
+ * the backend. The message of type XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ * prior its usage in the command ring allows for creation of these mappings.
+ * The backend will maintain a fixed amount of these mappings.
+ *
+ * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE lets a frontend query how many
+ * of these mappings can be kept.
+ *
+ * Each entry in the XEN_NETIF_CTRL_TYPE_{ADD,DEL}_GREF_MAPPING input table has
+ * the following format:
+ *
+ *0 1 2 3 4 5 6 7  octet
+ * +-+-+-+-+-+-+-+-+
+ * | grant ref |  flags|  status   |
+ * +-+-+-+-+-+-+-+-+
+ *
+ * grant ref: grant reference (IN)
+ * flags: flags describing the control operation (IN)
+ * status: XEN_NETIF_CTRL_STATUS_* (OUT)
+ *
+ * 'status' is an output parameter which does not require to be set to zero
+ * prior to its usage in the corresponding control messages.
+ */
+
+struct xen_netif_gref {
+   grant_ref_t ref;
+   uint16_t flags;
+
+#define _XEN_NETIF_CTRLF_GREF_readonly0
+#define XEN_NETIF_CTRLF_GREF_readonly(1U<<_XEN_NETIF_CTRLF_GREF_readonly)
+
+   uint16_t status;
+};
+
+/*
  * Control messages
  * 
  *
@@ -609,6 +650,88 @@ struct xen_netif_ctrl_response {
  *   invalidate any table data outside that range.
  *   The grant reference may be read-only and must remain valid until
  *   the response has been processed.
+ *
+ * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
+ * -
+ *
+ * This is sent by the frontend to fetch the number of grefs that can be kept
+ * mapped in the backend.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
+ *  data[0] = queue index (assumed 0 for single queue)
+ *  data[1] = 0
+ *  data[2] = 0
+ *
+ * Response:
+ *
+ *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation not
+ * supported
+ *   XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER - The queue index is
+ * out of range
+ *   XEN_NETIF_CTRL_STATUS_SUCCESS   - Operation successful
+ *  data   = maximum number of entries allowed in the gref mapping table
+ *   (if operation was successful) or zero if it is not supported.
+ *
+ * XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ * 
+ *
+ * This is sent by the frontend for backend to map a list of grant
+ * references.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ *  data[0] = queue index
+ *  data[1] = grant reference of page containing the mapping list
+ *(r/w and assumed to start at beginning of page)
+ *  data[2] = size of list in entries
+ *
+ * Response:
+ *
+ *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation not
+ *  

Re: [Xen-devel] Feature control on PV devices

2017-09-19 Thread Joao Martins
On 09/18/2017 08:59 PM, Konrad Rzeszutek Wilk wrote:
> On Thu, Sep 14, 2017 at 05:08:18PM +0100, Joao Martins wrote:
>> [ Realized that I didn't CC the maintainers,
>>   so doing that now, +Linux folks +PV interfaces czar
>>   Sorry for the noise! ]
>>
>> On 09/08/2017 09:49 AM, Joao Martins wrote:
>>> [Forgot two important details regarding Xenbus states]
>>> On 09/07/2017 05:53 PM, Joao Martins wrote:
>>>> Hey!
>>>>
>>>> We wanted to brought up this small proposal regarding the lack of
>>>> parameterization on PV devices on Xen.
>>>>
>>>> Currently users don't have a way for enforce and control what
>>>> features/queues/etc the backend provides. So far there's only global 
>>>> parameters
>>>> on backends, and specs do not mention anything in this regard.
> 
> How would this scale with say FreeBSD backends?
>
This is per-device parameter configuration support, based on xenstore entries.
All backend needs to understand is that the request/XXX xenstore entries and
superseed whatever global defaults were defined by backend (after validation).
So what I am proposing here makes no OS assumptions and should work for FreeBSD
or any other.

> And I am assuming you are
> also thinking about device driver backends - where you can't easily
> get access to the backend and change the SysFS parameters (if they have
> it all)?
> 
Yeah - Provided that the xenstore entries will be created with permissions for
toolstack domain and the backend domain then backends other than Dom0 should
work too. Note that this is device setup (e.g. domain create time), i.e. the
configuration of what the frontend is allowed to see/use.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 1/1] public/io/netif.h: add gref mapping control messages

2017-09-18 Thread Joao Martins
On Mon, Sep 18, 2017 at 12:11:04PM +, Paul Durrant wrote:
> > -Original Message-
> > From: Joao Martins [mailto:joao.m.mart...@oracle.com]
> > Sent: 18 September 2017 12:56
> > To: Paul Durrant <paul.durr...@citrix.com>
> > Cc: Xen-devel <xen-devel@lists.xen.org>; Wei Liu <wei.l...@citrix.com>;
> > Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
> > Subject: Re: [PATCH v3 1/1] public/io/netif.h: add gref mapping control
> > messages
> > 
> > On Mon, Sep 18, 2017 at 09:53:18AM +0000, Paul Durrant wrote:
> > > > -Original Message-
> > > > From: Joao Martins [mailto:joao.m.mart...@oracle.com]
> > > > Sent: 13 September 2017 19:11
> > > > To: Xen-devel <xen-devel@lists.xen.org>
> > > > Cc: Wei Liu <wei.l...@citrix.com>; Paul Durrant
> > <paul.durr...@citrix.com>;
> > > > Konrad Rzeszutek Wilk <konrad.w...@oracle.com>; Joao Martins
> > > > <joao.m.mart...@oracle.com>
> > > > Subject: [PATCH v3 1/1] public/io/netif.h: add gref mapping control
> > messages

[snip]

> > > > + * XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
> > > > + * 
> > > > + *
> > > > + * This is sent by the frontend for backend to map a list of grant
> > > > + * references.
> > > > + *
> > > > + * Request:
> > > > + *
> > > > + *  type= XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
> > > > + *  data[0] = queue index
> > > > + *  data[1] = grant reference of page containing the mapping list
> > > > + *(r/w and assumed to start at beginning of page)
> > > > + *  data[2] = size of list in entries
> > > > + *
> > > > + * Response:
> > > > + *
> > > > + *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation
> > not
> > > > + * supported
> > > > + *   XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER - Operation
> > failed
> > > > + *   XEN_NETIF_CTRL_STATUS_SUCCESS   - Operation 
> > > > successful
> > > > + *  data   = number of entries that were mapped
> > > > + *
> > > > + * NOTE: Each entry in the input table has the format outlined
> > > > + *   in struct xen_netif_gref.
> > >
> > > You may want to put words here about the 'all or nothing' semantics of
> > > this operation vs. the semantics of the 'del' operation below.
> > >
> > Good point I'll add a paragraph about that.
> > 
> > For the unmap it is clear that status should be per-entry for reasons
> > discussed on v2. Do you think ADD 'all or nothing' like I had on v2 ?
> > If so I should remove the 'data' return part since it is not really
> > useful here.
> 
> The 'all or nothing' semantic is easier for the frontend to deal with,
> so I think that's the way to go. Otherwise you'd need the per-entry
> status, as you say. Either way, I don't think the data return is
> particularly useful.
> 
Yeap.

The 'data' return was to allow both cases but leaving the decision to
implementors meaning if number of mapped entries was the same as the
input size (data[2]) then frontend wouldn't need to check all entries.
But it would still need to unmap on partial success, as that
is not guaranteed by design. On a 'all or nothing', 'data' doesn't really has
any meaning and definitely makes life easier for frontend.

> > 
> > > > + *
> > > > + * XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING
> > > > + * 
> > > > + *
> > > > + * This is sent by the frontend for backend to unmap a list of grant
> > > > + * references.
> > > > + *
> > > > + * Request:
> > > > + *
> > > > + *  type= XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING
> > > > + *  data[0] = queue index
> > > > + *  data[1] = grant reference of page containing the mapping list
> > > > + *(r/w and assumed to start at beginning of page)
> > > > + *  data[2] = size of list in entries
> > > > + *
> > > > + * Response:
> > > > + *
> > > > + *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation
> > not
> > > > + * supported
> > > > + *   XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER - Operation
> > failed
> > > > + *   XEN_NETIF_CTRL_STATUS_SUC

Re: [Xen-devel] [PATCH v3 1/1] public/io/netif.h: add gref mapping control messages

2017-09-18 Thread Joao Martins
On Mon, Sep 18, 2017 at 09:53:18AM +, Paul Durrant wrote:
> > -Original Message-
> > From: Joao Martins [mailto:joao.m.mart...@oracle.com]
> > Sent: 13 September 2017 19:11
> > To: Xen-devel <xen-devel@lists.xen.org>
> > Cc: Wei Liu <wei.l...@citrix.com>; Paul Durrant <paul.durr...@citrix.com>;
> > Konrad Rzeszutek Wilk <konrad.w...@oracle.com>; Joao Martins
> > <joao.m.mart...@oracle.com>
> > Subject: [PATCH v3 1/1] public/io/netif.h: add gref mapping control messages
> > 
> > Adds 3 messages to allow guest to let backend keep grants mapped,
> > such that 1) guests allowing fast recycling of pages can avoid doing
> > grant ops for those cases, or otherwise 2) preferring copies over
> > grants and 3) always using a fixed set of pages for network I/O.
> > 
> > The three control ring messages added are:
> >  - Add grefs to be mapped by backend
> >  - Remove grefs mappings (If they are not in use)
> >  - Get maximum amount of grefs kept mapped.
> > 
> > Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
> > ---
> > v3:
> > * Use DEL for unmapping grefs instead of PUT
> > * Rname from xen_netif_gref_alloc to xen_netif_gref
> > * Add 'status' field on xen_netif_gref
> > * Clarify what 'inflight' means
> > * Use "beginning of the page" instead of "beginning of the grant"
> > * Mention that page needs to be r/w (as it will have to modify \.status)
> > * `data` on ADD|PUT returns number of entries mapped/unmapped.
> > ---
> >  xen/include/public/io/netif.h | 115
> > ++
> >  1 file changed, 115 insertions(+)
> > 
> > diff --git a/xen/include/public/io/netif.h b/xen/include/public/io/netif.h
> > index ca0061410d..0080a260fd 100644
> > --- a/xen/include/public/io/netif.h
> > +++ b/xen/include/public/io/netif.h
> > @@ -353,6 +353,9 @@ struct xen_netif_ctrl_request {
> >  #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING_SIZE 5
> >  #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING  6
> >  #define XEN_NETIF_CTRL_TYPE_SET_HASH_ALGORITHM7
> > +#define XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE 8
> > +#define XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING  9
> > +#define XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING 10
> > 
> >  uint32_t data[3];
> >  };
> > @@ -391,6 +394,41 @@ struct xen_netif_ctrl_response {
> >  };
> > 
> >  /*
> > + * Static Grants (struct xen_netif_gref)
> > + * =
> > + *
> > + * A frontend may provide a fixed set of grant references to be mapped on
> > + * the backend. The message of type
> > XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
> > + * prior its usage in the command ring allows for creation of these 
> > mappings.
> > + * The backend will maintain a fixed amount of these mappings.
> > + *
> > + * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE lets a frontend
> > query how many
> > + * of these mappings can be kept.
> > + *
> > + * Each entry in the XEN_NETIF_CTRL_TYPE_{ADD,DEL}_GREF_MAPPING
> > input table has
> > + * the following format:
> > + *
> > + *0 1 2 3 4 5 6 7  octet
> > + * +-+-+-+-+-+-+-+-+
> > + * | grant ref |  flags|  status   |
> > + * +-+-+-+-+-+-+-+-+
> > + *
> > + * grant ref: grant reference
> > + * flags: flags describing the control operation
> > + * status: XEN_NETIF_CTRL_STATUS_*
> > + */
> 
> You may want to add some words here pointing out that the status is an
> 'out' field, and also whether it should be initialized to zero or not.
> 
OK.

> > +
> > +struct xen_netif_gref {
> > +   grant_ref_t ref;
> > +   uint16_t flags;
> > +
> > +#define _XEN_NETIF_CTRLF_GREF_readonly0
> > +#define XEN_NETIF_CTRLF_GREF_readonly
> > (1U<<_XEN_NETIF_CTRLF_GREF_readonly)
> > +
> > +   uint16_t status;
> > +};
> > +
> > +/*
> >   * Control messages
> >   * 
> >   *
> > @@ -609,6 +647,83 @@ struct xen_netif_ctrl_response {
> >   *   invalidate any table data outside that range.
> >   *   The grant reference may be read-only and must remain valid until
> >   *   the response has been processed.
> > + *
> > + * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
> > + * -
> > + *
> > + * This is sent by the frontend to fetch the number of gref

Re: [Xen-devel] [PATCH v3 0/1] netif: staging grants for I/O requests

2017-09-18 Thread Joao Martins
On Mon, Sep 18, 2017 at 09:45:06AM +, Paul Durrant wrote:
> > -Original Message-
> > From: Joao Martins [mailto:joao.m.mart...@oracle.com]
> > Sent: 13 September 2017 19:11
> > To: Xen-devel <xen-devel@lists.xen.org>
> > Cc: Wei Liu <wei.l...@citrix.com>; Paul Durrant <paul.durr...@citrix.com>;
> > Konrad Rzeszutek Wilk <konrad.w...@oracle.com>; Joao Martins
> > <joao.m.mart...@oracle.com>
> > Subject: [PATCH v3 0/1] netif: staging grants for I/O requests
> > 
> > Hey,
> > 
> > This is v3 taking into consideration all comments received from v2 
> > (changelog
> > in the first patch). The specification is right after the diffstat.
> > 
> > Reference implementation also here (on top of net-next):
> > 
> > https://github.com/jpemartins/linux.git xen-net-stg-gnts-v3
> > 
> > Although I am satisfied with how things are being done above, I wanted
> > to request some advise/input on whether there could be a simpler way of
> > achieving the same. Specifically because these control messages
> > adds up significant code on the frontend to pregrant, and in other cases the
> > control message might be limitative if frontend tries to keep a dinamically
> > changed buffer pool in different queues. *Maybe* it could be simpler to
> > adjust
> > the TX/RX ring ABI in a compatible matter (Disclaimer: I haven't implemented
> > this just yet):
> 
> But the whole point of pre-granting is to separate the grant/ungrant
> operations from the rx/tx operations, right?

/nods

> So, why would the extra
> control messages really be an overhead?

It's not that it's an overhead, but more like the bigger amount of code
to pregrant once ... and so I was trying to figure out if there was some
simplification/flexibility that could be made; in the meantime I was
experimenting a bit and it looks that won't probably make too much
difference implementation-wise while implying higher complexity on the
datapath and also weaker semantics.

With things like AF_PACKET v4 (pre mapping buffers) appearing in linux
mid term, it will require stronger semantics like those provided by the
control ring ops rather than these flags I was suggesting below.

The advantage with the flags though is that add/del mappings would be
(by design) on the context of the queue rather than in the control
ring thread handling it. But maybe this can be considered implementation
specific behaviour too and we could find ways to handle that better if it
ever becomes a problem e.g. doing the pre{un,}maps on dealloc thread context.

Joao

> > 
> >  1) Add a flag `NETTXF_persist` to `netif_tx_request`
> > 
> >  2) Replace RX `netif_rx_request` padding with `flags` and adda
> >  `NETRXF_persist` with the same purpose as 1).
> > 
> >  3) This remains backwards compatible as backends not supporting this
> > wouldn't
> >  act on this new flag, and given we replace padding with flags means
> > unsupported
> >  backends won't simply be aware of RX *request* `flags`. This is under the
> >  assumption that there's no requirement that padding must be zero
> > throughout
> >  the netif.h specification.
> > 
> >  4) Keeping `GET_GREF_MAPPING_SIZE` ctrl msg for frontend to do better
> >  decisions?
> > 
> >  5) Semantics are simple: slots with flags marked as NET{RX,TX}F_persist
> >  represent a permanent mapped ref and therefore mapped if non-existent.
> >  *future* omissions of the flag signals the mapping should be removed.
> > 
> > This would allow guests which reuse buffers (apparently Windows :)) to scale
> > better as unmaps would be done on the individual queue context  plus
> > allowing
> > frontend to remain a more simple in the management of "permanent"
> > buffers. The
> > drawback seems to be the added complexity (and somewhat racy behaviour)
> > on the
> > datapath, to map or unmap accordingly. Because now we would have to
> > differentiate between long vs short lived map/unmap ops in addition to
> > looking
> > up on our mappings table. Thoughts, or perhaps people may prefer the one
> > already described in the series?
> > 
> > Cheers,
> > 
> > Joao Martins (1):
> >   public/io/netif.h: add gref mapping control messages
> > 
> >  xen/include/public/io/netif.h | 115
> > ++
> >  1 file changed, 115 insertions(+)
> > ---
> > % Staging grants for network I/O requests
> > % Joao Martins <<joao.m.mart...@oracle.com>>
> > % Revision 3
> > 
> > \clearpage
> > 
> > ---

Re: [Xen-devel] Feature control on PV devices

2017-09-15 Thread Joao Martins
On 09/15/2017 12:34 PM, Juergen Gross wrote:
> On 15/09/17 13:19, Wei Liu wrote:
>> On Thu, Sep 14, 2017 at 05:18:44PM +0100, Joao Martins wrote:
>>> On 09/14/2017 05:10 PM, Wei Liu wrote:
>>>> On Thu, Sep 07, 2017 at 05:53:54PM +0100, Joao Martins wrote:
>>>>> Hey!
>>>>>
>>>>> We wanted to brought up this small proposal regarding the lack of
>>>>> parameterization on PV devices on Xen.
>>>>>
>>>>> Currently users don't have a way for enforce and control what
>>>>> features/queues/etc the backend provides. So far there's only global 
>>>>> parameters
>>>>> on backends, and specs do not mention anything in this regard.
>>>>>
>>>>> The most obvious example is netback/blkback max_queues module parameter 
>>>>> where it
>>>>> sets the limit the maximum queues for all devices which is not that 
>>>>> flexible.
>>>>> Other examples include controlling offloads visible by the NIC (e.g. 
>>>>> disabling
>>>>> checksum offload, disabling scather-gather), others more about I/O path 
>>>>> (e.g.
>>>>> disable blkif indirect descriptors, limit number of pages for the ring), 
>>>>> or less
>>>>> grant usage by minimizing number of queues/descriptors.
>>>>>
>>>>> Of course there could be more examples, as this seems to be ortoghonal to 
>>>>> the
>>>>> kinds of PV backends we have. And seems like all features appear to be 
>>>>> published
>>>>> on the same xenbus state?
>>>>>
>>>>> The idea to address this would be very simple:
>>>>>
>>>>> - Toolstack when initializing device paths, writes additional entries in 
>>>>> the
>>>>> form of 'request-' = . These entries are only
>>>>> visible by the backend and toolstack;
>>>>>
>>>>> - Backend reads this entries and uses  as the value of
>>>>> , which will then be visible on the frontend.
>>>>>
>>>>> [ Removal of the 'request-*' xenstore entries could represent a feedback 
>>>>> look
>>>>>   that the backend indeed read and used the value. Or else it could 
>>>>> simply be
>>>>>   ignored. ]
>>>>>
>>>>> And that's it.
>>>>>
>>>>> In pratice user would do: E.g.
>>>>>
>>>>> domain.cfg:
>>>>> ...
>>>>> name = "guest"
>>>>> kernel = "bzImage"
>>>>> vif = ["bridge=br0,queues=2"]
>>>>> disk = [
>>>>> "format=raw,vdev=hda,access=rw,backendtype=phy,target=/dev/HostVG/XenGuest2,queues=1,max-ring-page-order=0"
>>>>
>>>> There needs to be a way to distinguish parameters consumed by toolstack
>>>> vs the ones passed on to backends. The parameters passed to backends
>>>> should start with a predefined prefix.
>>>>
>>> Hmm, which seems to be inline with the "request" prefix when controlling 
>>> certain
>>> features enabled/disabled? Oh wait, perhaps you mean wrt to the 
>>> UI/config-format
>>> rather than xenstore entries and such? If it's the latter, see below,
>>
>> I was thinking about xl config syntax.
>>
>>>
>>>>> ]
>>>>> ...
>>>>>
>>>>> Toolstack writes:
>>>>>
>>>>> /local/domain/0/backend/vif/8/0/request-multi-queue-max-queues = 2
>>>>> /local/domain/0/backend/vbd/8/51713/request-multi-queue-max-queues = 2
>>>>> /local/domain/0/backend/vbd/8/51713/request-max-ring-page-order = 0
> 
> I'd rather use a specific directory, e.g.:
> 
> /local/domain/0/backend/vif/8/0/request/multi-queue-max-queues = 2
> /local/domain/0/backend/vbd/8/51713/request/multi-queue-max-queues = 2
> /local/domain/0/backend/vbd/8/51713/request/max-ring-page-order = 0
> 
> This will enable the backend to just look for all entries in
> .../request/ instead of having to try all possible features.
> 
Yeap, sounds better and cleaner indeed.

And backend can simply remove the whole directory when it's done consuming
the parameters as a signal to the toolstack? Or maybe it might be enough to
simply detect that request/XXX and XXX xenstores entries have the same value.

>>>>> Backends reads and seeds with (and assuming it passes backend va

Re: [Xen-devel] Feature control on PV devices

2017-09-14 Thread Joao Martins
On 09/14/2017 05:10 PM, Wei Liu wrote:
> On Thu, Sep 07, 2017 at 05:53:54PM +0100, Joao Martins wrote:
>> Hey!
>>
>> We wanted to brought up this small proposal regarding the lack of
>> parameterization on PV devices on Xen.
>>
>> Currently users don't have a way for enforce and control what
>> features/queues/etc the backend provides. So far there's only global 
>> parameters
>> on backends, and specs do not mention anything in this regard.
>>
>> The most obvious example is netback/blkback max_queues module parameter 
>> where it
>> sets the limit the maximum queues for all devices which is not that flexible.
>> Other examples include controlling offloads visible by the NIC (e.g. 
>> disabling
>> checksum offload, disabling scather-gather), others more about I/O path (e.g.
>> disable blkif indirect descriptors, limit number of pages for the ring), or 
>> less
>> grant usage by minimizing number of queues/descriptors.
>>
>> Of course there could be more examples, as this seems to be ortoghonal to the
>> kinds of PV backends we have. And seems like all features appear to be 
>> published
>> on the same xenbus state?
>>
>> The idea to address this would be very simple:
>>
>> - Toolstack when initializing device paths, writes additional entries in the
>> form of 'request-' = . These entries are only
>> visible by the backend and toolstack;
>>
>> - Backend reads this entries and uses  as the value of
>> , which will then be visible on the frontend.
>>
>> [ Removal of the 'request-*' xenstore entries could represent a feedback look
>>   that the backend indeed read and used the value. Or else it could simply be
>>   ignored. ]
>>
>> And that's it.
>>
>> In pratice user would do: E.g.
>>
>> domain.cfg:
>> ...
>> name = "guest"
>> kernel = "bzImage"
>> vif = ["bridge=br0,queues=2"]
>> disk = [
>> "format=raw,vdev=hda,access=rw,backendtype=phy,target=/dev/HostVG/XenGuest2,queues=1,max-ring-page-order=0"
> 
> There needs to be a way to distinguish parameters consumed by toolstack
> vs the ones passed on to backends. The parameters passed to backends
> should start with a predefined prefix.
> 
Hmm, which seems to be inline with the "request" prefix when controlling certain
features enabled/disabled? Oh wait, perhaps you mean wrt to the UI/config-format
rather than xenstore entries and such? If it's the latter, see below,

>> ]
>> ...
>>
>> Toolstack writes:
>>
>> /local/domain/0/backend/vif/8/0/request-multi-queue-max-queues = 2
>> /local/domain/0/backend/vbd/8/51713/request-multi-queue-max-queues = 2
>> /local/domain/0/backend/vbd/8/51713/request-max-ring-page-order = 0
>>
>> Backends reads and seeds with (and assuming it passes backend validation 
>> ofc):
>>
>> /local/domain/0/backend/vif/8/0/multi-queue-max-queues = 2
>> /local/domain/0/backend/vbd/8/51713/multi-queue-max-queues = 2
>> /local/domain/0/backend/vbd/8/51713/max-ring-page-order = 0
>>
>> The XL configuration entry for controlling these tunable are just examples 
>> it's
>> not clear the general preference for this. An alternative could be:
>>
>> vif = ["bridge=br0,features=queues:2\\;max-ring-page-order:0"]
>>
>> Which lets us have more generic feature control, without sticking to 
>> particular
>> features names.
>>

In case the above was about config format, this one suggested above sounds more
general, and easy to reuse across backends. Maybe instead of "features", could
be "backend_features" since, most PV backends declare a "backend" and a
"backend_id" as per libxl IDL.

>> Naturally libvirt could be a consumer of this (as it already has the 'queues'
>> and host 'tso4', 'tso6', etc in their XML schemas)
>>
>> Thoughts? Do folks think the correct way of handling this?
>>
> 
> I think having a way to control backend features in xl/libxl is a good
> thing.

Thanks!

> 
>> Cheers,
>> Joao
>>
>> [0] https://github.com/qemu/qemu/blob/master/hw/net/virtio-net.c#L2102

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Feature control on PV devices

2017-09-14 Thread Joao Martins
[ Realized that I didn't CC the maintainers,
  so doing that now, +Linux folks +PV interfaces czar
  Sorry for the noise! ]

On 09/08/2017 09:49 AM, Joao Martins wrote:
> [Forgot two important details regarding Xenbus states]
> On 09/07/2017 05:53 PM, Joao Martins wrote:
>> Hey!
>>
>> We wanted to brought up this small proposal regarding the lack of
>> parameterization on PV devices on Xen.
>>
>> Currently users don't have a way for enforce and control what
>> features/queues/etc the backend provides. So far there's only global 
>> parameters
>> on backends, and specs do not mention anything in this regard.
>>
>> The most obvious example is netback/blkback max_queues module parameter 
>> where it
>> sets the limit the maximum queues for all devices which is not that flexible.
>> Other examples include controlling offloads visible by the NIC (e.g. 
>> disabling
>> checksum offload, disabling scather-gather), others more about I/O path (e.g.
>> disable blkif indirect descriptors, limit number of pages for the ring), or 
>> less
>> grant usage by minimizing number of queues/descriptors.
>>
>> Of course there could be more examples, as this seems to be ortoghonal to the
>> kinds of PV backends we have. And seems like all features appear to be 
>> published
>> on the same xenbus state?
>>
>> The idea to address this would be very simple:
>>
>> - Toolstack when initializing device paths, writes additional entries in the
>> form of 'request-' = . These entries are only
>> visible by the backend and toolstack;
>>
> And after that we switch the device state to XenbusStateInitialising as usual.
> 
>>
>> - Backend reads this entries and uses  as the value of
>> , which will then be visible on the frontend.
>>
> And after that we switch state to XenbusStateInitWait as usual. No changes are
> involved in xenbus state changes other than reading what the toolstack had
> written in "request-*" and seed accordingly. Backends without support would
> simply ignore these new entries.
> 
>> [ Removal of the 'request-*' xenstore entries could represent a feedback look
>>   that the backend indeed read and used the value. Or else it could simply be
>>   ignored. ]
>>
>> And that's it.
>>
>> In pratice user would do: E.g.
>>
>> domain.cfg:
>> ...
>> name = "guest"
>> kernel = "bzImage"
>> vif = ["bridge=br0,queues=2"]
>> disk = [
>> "format=raw,vdev=hda,access=rw,backendtype=phy,target=/dev/HostVG/XenGuest2,queues=1,max-ring-page-order=0"
>> ]
>> ...
>>
>> Toolstack writes:
>>
>> /local/domain/0/backend/vif/8/0/request-multi-queue-max-queues = 2
>> /local/domain/0/backend/vbd/8/51713/request-multi-queue-max-queues = 2
>> /local/domain/0/backend/vbd/8/51713/request-max-ring-page-order = 0
> 
> /local/domain/0/backend/vbd/8/51713/state = 1 (XenbusStateInitialising)
> 
>>
>> Backends reads and seeds with (and assuming it passes backend validation 
>> ofc):
>>
>> /local/domain/0/backend/vif/8/0/multi-queue-max-queues = 2
>> /local/domain/0/backend/vbd/8/51713/multi-queue-max-queues = 2
>> /local/domain/0/backend/vbd/8/51713/max-ring-page-order = 0
>>
> /local/domain/0/backend/vbd/8/51713/state = 2 (XenbusStateInitWait)
> 
>> The XL configuration entry for controlling these tunable are just examples 
>> it's
>> not clear the general preference for this. An alternative could be:
>>
>> vif = ["bridge=br0,features=queues:2\\;max-ring-page-order:0"]
>>
>> Which lets us have more generic feature control, without sticking to 
>> particular
>> features names.
>>
>> Naturally libvirt could be a consumer of this (as it already has the 'queues'
>> and host 'tso4', 'tso6', etc in their XML schemas)
>>
>> Thoughts? Do folks think the correct way of handling this?
>>
>> Cheers,
>> Joao
>>
>> [0] https://github.com/qemu/qemu/blob/master/hw/net/virtio-net.c#L2102
>>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 0/1] netif: staging grants for I/O requests

2017-09-13 Thread Joao Martins
Hey,

This is v3 taking into consideration all comments received from v2 (changelog
in the first patch). The specification is right after the diffstat.

Reference implementation also here (on top of net-next):

https://github.com/jpemartins/linux.git xen-net-stg-gnts-v3

Although I am satisfied with how things are being done above, I wanted
to request some advise/input on whether there could be a simpler way of
achieving the same. Specifically because these control messages
adds up significant code on the frontend to pregrant, and in other cases the
control message might be limitative if frontend tries to keep a dinamically
changed buffer pool in different queues. *Maybe* it could be simpler to adjust
the TX/RX ring ABI in a compatible matter (Disclaimer: I haven't implemented
this just yet):

 1) Add a flag `NETTXF_persist` to `netif_tx_request`

 2) Replace RX `netif_rx_request` padding with `flags` and adda
 `NETRXF_persist` with the same purpose as 1).

 3) This remains backwards compatible as backends not supporting this wouldn't
 act on this new flag, and given we replace padding with flags means unsupported
 backends won't simply be aware of RX *request* `flags`. This is under the
 assumption that there's no requirement that padding must be zero throughout
 the netif.h specification.

 4) Keeping `GET_GREF_MAPPING_SIZE` ctrl msg for frontend to do better
 decisions?

 5) Semantics are simple: slots with flags marked as NET{RX,TX}F_persist
 represent a permanent mapped ref and therefore mapped if non-existent.
 *future* omissions of the flag signals the mapping should be removed.

This would allow guests which reuse buffers (apparently Windows :)) to scale
better as unmaps would be done on the individual queue context  plus allowing
frontend to remain a more simple in the management of "permanent" buffers. The
drawback seems to be the added complexity (and somewhat racy behaviour) on the
datapath, to map or unmap accordingly. Because now we would have to
differentiate between long vs short lived map/unmap ops in addition to looking
up on our mappings table. Thoughts, or perhaps people may prefer the one
already described in the series?

Cheers,

Joao Martins (1):
  public/io/netif.h: add gref mapping control messages

 xen/include/public/io/netif.h | 115 ++
 1 file changed, 115 insertions(+)
---
% Staging grants for network I/O requests
% Joao Martins <<joao.m.mart...@oracle.com>>
% Revision 3

\clearpage


Architecture(s): Any


# Background and Motivation

At the Xen hackaton '16 networking session, we spoke about having a permanently
mapped region to describe header/linear region of packet buffers. This document
outlines the proposal covering motivation of this and applicability for other
use-cases alongside the necessary changes. This proposal is an RFC and also
includes alternative solutions.

The motivation of this work is to eliminate grant ops for packet I/O intensive
workloads such as those observed with smaller requests size (i.e. <= 256 bytes
or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are the
only ones performing really good (up to 80 Gbit/s in few CPUs), usually
backing end-hosts and server appliances. Anything that involves higher packet
rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s
throughput.

# Proposal

The proposal is to leverage the already implicit copy from and to packet linear
data on netfront and netback, to be done instead from a permanently mapped
region. In some (physical) NICs this is known as header/data split.

Specifically some workloads (e.g. NFV) it would provide a big increase in
throughput when we switch to (zero)copying in the backend/frontend, instead of
the grant hypercalls. Thus this extension aims at futureproofing the netif
protocol by adding the possibility of guests setting up a list of grants that
are set up at device creation and revoked at device freeing - without taking
too much grant entries in account for the general case (i.e. to cover only the
header region <= 256 bytes, 16 grants per ring) while configurable by kernel
when one wants to resort to a copy-based as opposed to grant copy/map.

\clearpage

# General Operation

Here we describe how netback and netfront general operate, and where the 
proposed
solution will fit. The security mechanism currently involves grants references
which in essence are round-robin recycled 'tickets' stamped with the GPFNs,
permission attributes, and the authorized domain:

(This is an in-memory view of struct grant_entry_v1):

 0 1 2 3 4 5 6 7 octet
++---++
| flags  | domain id | frame  |
++---++

Where th

[Xen-devel] [PATCH v3 1/1] public/io/netif.h: add gref mapping control messages

2017-09-13 Thread Joao Martins
Adds 3 messages to allow guest to let backend keep grants mapped,
such that 1) guests allowing fast recycling of pages can avoid doing
grant ops for those cases, or otherwise 2) preferring copies over
grants and 3) always using a fixed set of pages for network I/O.

The three control ring messages added are:
 - Add grefs to be mapped by backend
 - Remove grefs mappings (If they are not in use)
 - Get maximum amount of grefs kept mapped.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
v3:
* Use DEL for unmapping grefs instead of PUT
* Rname from xen_netif_gref_alloc to xen_netif_gref
* Add 'status' field on xen_netif_gref
* Clarify what 'inflight' means
* Use "beginning of the page" instead of "beginning of the grant"
* Mention that page needs to be r/w (as it will have to modify \.status)
* `data` on ADD|PUT returns number of entries mapped/unmapped.
---
 xen/include/public/io/netif.h | 115 ++
 1 file changed, 115 insertions(+)

diff --git a/xen/include/public/io/netif.h b/xen/include/public/io/netif.h
index ca0061410d..0080a260fd 100644
--- a/xen/include/public/io/netif.h
+++ b/xen/include/public/io/netif.h
@@ -353,6 +353,9 @@ struct xen_netif_ctrl_request {
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING_SIZE 5
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING  6
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_ALGORITHM7
+#define XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE 8
+#define XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING  9
+#define XEN_NETIF_CTRL_TYPE_DEL_GREF_MAPPING 10
 
 uint32_t data[3];
 };
@@ -391,6 +394,41 @@ struct xen_netif_ctrl_response {
 };
 
 /*
+ * Static Grants (struct xen_netif_gref)
+ * =
+ *
+ * A frontend may provide a fixed set of grant references to be mapped on
+ * the backend. The message of type XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ * prior its usage in the command ring allows for creation of these mappings.
+ * The backend will maintain a fixed amount of these mappings.
+ *
+ * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE lets a frontend query how many
+ * of these mappings can be kept.
+ *
+ * Each entry in the XEN_NETIF_CTRL_TYPE_{ADD,DEL}_GREF_MAPPING input table has
+ * the following format:
+ *
+ *0 1 2 3 4 5 6 7  octet
+ * +-+-+-+-+-+-+-+-+
+ * | grant ref |  flags|  status   |
+ * +-+-+-+-+-+-+-+-+
+ *
+ * grant ref: grant reference
+ * flags: flags describing the control operation
+ * status: XEN_NETIF_CTRL_STATUS_*
+ */
+
+struct xen_netif_gref {
+   grant_ref_t ref;
+   uint16_t flags;
+
+#define _XEN_NETIF_CTRLF_GREF_readonly0
+#define XEN_NETIF_CTRLF_GREF_readonly(1U<<_XEN_NETIF_CTRLF_GREF_readonly)
+
+   uint16_t status;
+};
+
+/*
  * Control messages
  * 
  *
@@ -609,6 +647,83 @@ struct xen_netif_ctrl_response {
  *   invalidate any table data outside that range.
  *   The grant reference may be read-only and must remain valid until
  *   the response has been processed.
+ *
+ * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
+ * -
+ *
+ * This is sent by the frontend to fetch the number of grefs that can be kept
+ * mapped in the backend.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
+ *  data[0] = queue index (assumed 0 for single queue)
+ *  data[1] = 0
+ *  data[2] = 0
+ *
+ * Response:
+ *
+ *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation not
+ * supported
+ *   XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER - The queue index is
+ * out of range
+ *   XEN_NETIF_CTRL_STATUS_SUCCESS   - Operation successful
+ *  data   = maximum number of entries allowed in the gref mapping table
+ *   (if operation was successful) or zero if it is not supported.
+ *
+ * XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ * 
+ *
+ * This is sent by the frontend for backend to map a list of grant
+ * references.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ *  data[0] = queue index
+ *  data[1] = grant reference of page containing the mapping list
+ *(r/w and assumed to start at beginning of page)
+ *  data[2] = size of list in entries
+ *
+ * Response:
+ *
+ *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation not
+ * supported
+ *   XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER - Operation failed
+ *   XEN_NETIF_CTRL_STATUS_SUCCESS   - Operation successful
+ *  data   = number of entries that were mapped
+ *
+ * NOTE: Each entry in the input table has the format outlined
+ *   in struct xen_netif_gref.
+ *
+ * XEN_

Re: [Xen-devel] Feature control on PV devices

2017-09-08 Thread Joao Martins
[Forgot two important details regarding Xenbus states]

On 09/07/2017 05:53 PM, Joao Martins wrote:
> Hey!
> 
> We wanted to brought up this small proposal regarding the lack of
> parameterization on PV devices on Xen.
> 
> Currently users don't have a way for enforce and control what
> features/queues/etc the backend provides. So far there's only global 
> parameters
> on backends, and specs do not mention anything in this regard.
> 
> The most obvious example is netback/blkback max_queues module parameter where 
> it
> sets the limit the maximum queues for all devices which is not that flexible.
> Other examples include controlling offloads visible by the NIC (e.g. disabling
> checksum offload, disabling scather-gather), others more about I/O path (e.g.
> disable blkif indirect descriptors, limit number of pages for the ring), or 
> less
> grant usage by minimizing number of queues/descriptors.
> 
> Of course there could be more examples, as this seems to be ortoghonal to the
> kinds of PV backends we have. And seems like all features appear to be 
> published
> on the same xenbus state?
> 
> The idea to address this would be very simple:
> 
> - Toolstack when initializing device paths, writes additional entries in the
> form of 'request-' = . These entries are only
> visible by the backend and toolstack;
>
And after that we switch the device state to XenbusStateInitialising as usual.

> 
> - Backend reads this entries and uses  as the value of
> , which will then be visible on the frontend.
> 
And after that we switch state to XenbusStateInitWait as usual. No changes are
involved in xenbus state changes other than reading what the toolstack had
written in "request-*" and seed accordingly. Backends without support would
simply ignore these new entries.

> [ Removal of the 'request-*' xenstore entries could represent a feedback look
>   that the backend indeed read and used the value. Or else it could simply be
>   ignored. ]
> 
> And that's it.
> 
> In pratice user would do: E.g.
> 
> domain.cfg:
> ...
> name = "guest"
> kernel = "bzImage"
> vif = ["bridge=br0,queues=2"]
> disk = [
> "format=raw,vdev=hda,access=rw,backendtype=phy,target=/dev/HostVG/XenGuest2,queues=1,max-ring-page-order=0"
> ]
> ...
> 
> Toolstack writes:
> 
> /local/domain/0/backend/vif/8/0/request-multi-queue-max-queues = 2
> /local/domain/0/backend/vbd/8/51713/request-multi-queue-max-queues = 2
> /local/domain/0/backend/vbd/8/51713/request-max-ring-page-order = 0

/local/domain/0/backend/vbd/8/51713/state = 1 (XenbusStateInitialising)

> 
> Backends reads and seeds with (and assuming it passes backend validation ofc):
> 
> /local/domain/0/backend/vif/8/0/multi-queue-max-queues = 2
> /local/domain/0/backend/vbd/8/51713/multi-queue-max-queues = 2
> /local/domain/0/backend/vbd/8/51713/max-ring-page-order = 0
> 
/local/domain/0/backend/vbd/8/51713/state = 2 (XenbusStateInitWait)

> The XL configuration entry for controlling these tunable are just examples 
> it's
> not clear the general preference for this. An alternative could be:
> 
> vif = ["bridge=br0,features=queues:2\\;max-ring-page-order:0"]
> 
> Which lets us have more generic feature control, without sticking to 
> particular
> features names.
> 
> Naturally libvirt could be a consumer of this (as it already has the 'queues'
> and host 'tso4', 'tso6', etc in their XML schemas)
> 
> Thoughts? Do folks think the correct way of handling this?
> 
> Cheers,
> Joao
> 
> [0] https://github.com/qemu/qemu/blob/master/hw/net/virtio-net.c#L2102
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Feature control on PV devices

2017-09-07 Thread Joao Martins
Hey!

We wanted to brought up this small proposal regarding the lack of
parameterization on PV devices on Xen.

Currently users don't have a way for enforce and control what
features/queues/etc the backend provides. So far there's only global parameters
on backends, and specs do not mention anything in this regard.

The most obvious example is netback/blkback max_queues module parameter where it
sets the limit the maximum queues for all devices which is not that flexible.
Other examples include controlling offloads visible by the NIC (e.g. disabling
checksum offload, disabling scather-gather), others more about I/O path (e.g.
disable blkif indirect descriptors, limit number of pages for the ring), or less
grant usage by minimizing number of queues/descriptors.

Of course there could be more examples, as this seems to be ortoghonal to the
kinds of PV backends we have. And seems like all features appear to be published
on the same xenbus state?

The idea to address this would be very simple:

- Toolstack when initializing device paths, writes additional entries in the
form of 'request-' = . These entries are only
visible by the backend and toolstack;

- Backend reads this entries and uses  as the value of
, which will then be visible on the frontend.

[ Removal of the 'request-*' xenstore entries could represent a feedback look
  that the backend indeed read and used the value. Or else it could simply be
  ignored. ]

And that's it.

In pratice user would do: E.g.

domain.cfg:
...
name = "guest"
kernel = "bzImage"
vif = ["bridge=br0,queues=2"]
disk = [
"format=raw,vdev=hda,access=rw,backendtype=phy,target=/dev/HostVG/XenGuest2,queues=1,max-ring-page-order=0"
]
...

Toolstack writes:

/local/domain/0/backend/vif/8/0/request-multi-queue-max-queues = 2
/local/domain/0/backend/vbd/8/51713/request-multi-queue-max-queues = 2
/local/domain/0/backend/vbd/8/51713/request-max-ring-page-order = 0

Backends reads and seeds with (and assuming it passes backend validation ofc):

/local/domain/0/backend/vif/8/0/multi-queue-max-queues = 2
/local/domain/0/backend/vbd/8/51713/multi-queue-max-queues = 2
/local/domain/0/backend/vbd/8/51713/max-ring-page-order = 0

The XL configuration entry for controlling these tunable are just examples it's
not clear the general preference for this. An alternative could be:

vif = ["bridge=br0,features=queues:2\\;max-ring-page-order:0"]

Which lets us have more generic feature control, without sticking to particular
features names.

Naturally libvirt could be a consumer of this (as it already has the 'queues'
and host 'tso4', 'tso6', etc in their XML schemas)

Thoughts? Do folks think the correct way of handling this?

Cheers,
Joao

[0] https://github.com/qemu/qemu/blob/master/hw/net/virtio-net.c#L2102

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/1] public/io/netif.h: add gref mapping control messages

2017-09-06 Thread Joao Martins
On 09/06/2017 02:49 PM, Paul Durrant wrote:
>> -Original Message-
>> From: Joao Martins [mailto:joao.m.mart...@oracle.com]
>> Sent: 01 September 2017 15:51
>> To: Xen-devel <xen-devel@lists.xen.org>
>> Cc: Wei Liu <wei.l...@citrix.com>; Paul Durrant <paul.durr...@citrix.com>;
>> Konrad Rzeszutek Wilk <konrad.w...@oracle.com>; Joao Martins
>> <joao.m.mart...@oracle.com>
>> Subject: [PATCH v2 1/1] public/io/netif.h: add gref mapping control messages
>>
>> Adds 3 messages to allow guest to let backend keep grants mapped,
>> such that 1) guests allowing fast recycling of pages can avoid doing
>> grant ops for those cases, or otherwise 2) preferring copies over
>> grants and 3) always using a fixed set of pages for network I/O.
>>
>> The three control ring messages added are:
>>  - Add grefs to be mapped by backend
>>  - Remove grefs mappings (If they are not in use)
>>  - Get maximum amount of grefs kept mapped.
>>
>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>> ---
>>  xen/include/public/io/netif.h | 114
>> ++
>>  1 file changed, 114 insertions(+)
>>
>> diff --git a/xen/include/public/io/netif.h b/xen/include/public/io/netif.h
>> index ca0061410d..264c317471 100644
>> --- a/xen/include/public/io/netif.h
>> +++ b/xen/include/public/io/netif.h
>> @@ -353,6 +353,9 @@ struct xen_netif_ctrl_request {
>>  #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING_SIZE 5
>>  #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING  6
>>  #define XEN_NETIF_CTRL_TYPE_SET_HASH_ALGORITHM7
>> +#define XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE 8
>> +#define XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING  9
>> +#define XEN_NETIF_CTRL_TYPE_PUT_GREF_MAPPING 10
>>
>>  uint32_t data[3];
>>  };
>> @@ -391,6 +394,41 @@ struct xen_netif_ctrl_response {
>>  };
>>
>>  /*
>> + * Static Grants (struct xen_netif_gref_alloc)
>> + * ===
>> + *
>> + * A frontend may provide a fixed set of grant references to be mapped on
>> + * the backend. The message of type
>> XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
>> + * prior its usage in the command ring allows for creation of these 
>> mappings.
>> + * The backend will maintain a fixed amount of these mappings.
>> + *
>> + * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE lets a frontend
>> query how many
>> + * of these mappings can be kept.
>> + *
>> + * Each entry in the XEN_NETIF_CTRL_TYPE_{ADD,PUT}_GREF_MAPPING
>> input table has
> 
> ADD and PUT are slightly odd choices for opposites. Normally you'd have 'get' 
> and 'put' or 'add' and 'remove' (or 'delete').
> 
That's true - I probably was too obsessed into fitting in 3 characters to avoid
realigning the earlier chunk listing all ctrl messages types. ADD, DEL probably
is a better one (GET would sound a bit strange for these ops).

>> + * the following format:
>> + *
>> + *0 1 2 3 4 5 6 7  octet
>> + * +-+-+-+-+-+-+-+-+
>> + * | grant ref |  flags|  padding  |
>> + * +-+-+-+-+-+-+-+-+
>> + *
>> + * grant ref: grant reference
>> + * flags: flags describing the control operation
>> + *
>> + */
>> +
>> +struct xen_netif_gref_alloc {
> 
> Is 'alloc' really desirable here? What's being allocated?
> 
Probably not my best choice of naming, but given that we aren't actually mapping
on the frontend but rather the backend hence I choose 'alloc'. But as you hint
it might be misleading. Would 'map' or 'mapping' be better candidates?

>> +   grant_ref_t ref;
>> +   uint16_t flags;
>> +
>> +#define _XEN_NETIF_CTRLF_GREF_readonly0
>> +#define XEN_NETIF_CTRLF_GREF_readonly
>> (1U<<_XEN_NETIF_CTRLF_GREF_readonly)
>> +
>> +   uint8_t pad[2];
>> +};
>> +
>> +/*
>>   * Control messages
>>   * 
>>   *
>> @@ -609,6 +647,82 @@ struct xen_netif_ctrl_response {
>>   *   invalidate any table data outside that range.
>>   *   The grant reference may be read-only and must remain valid until
>>   *   the response has been processed.
>> + *
>> + * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
>> + * -
>> + *
>> + * This is sent by the frontend to fetch the number of grefs that can be 
>> kept
>> + * mapped in the backend.
>> + *
>> + * 

[Xen-devel] [PATCH v2 1/1] public/io/netif.h: add gref mapping control messages

2017-09-01 Thread Joao Martins
Adds 3 messages to allow guest to let backend keep grants mapped,
such that 1) guests allowing fast recycling of pages can avoid doing
grant ops for those cases, or otherwise 2) preferring copies over
grants and 3) always using a fixed set of pages for network I/O.

The three control ring messages added are:
 - Add grefs to be mapped by backend
 - Remove grefs mappings (If they are not in use)
 - Get maximum amount of grefs kept mapped.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 xen/include/public/io/netif.h | 114 ++
 1 file changed, 114 insertions(+)

diff --git a/xen/include/public/io/netif.h b/xen/include/public/io/netif.h
index ca0061410d..264c317471 100644
--- a/xen/include/public/io/netif.h
+++ b/xen/include/public/io/netif.h
@@ -353,6 +353,9 @@ struct xen_netif_ctrl_request {
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING_SIZE 5
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING  6
 #define XEN_NETIF_CTRL_TYPE_SET_HASH_ALGORITHM7
+#define XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE 8
+#define XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING  9
+#define XEN_NETIF_CTRL_TYPE_PUT_GREF_MAPPING 10
 
 uint32_t data[3];
 };
@@ -391,6 +394,41 @@ struct xen_netif_ctrl_response {
 };
 
 /*
+ * Static Grants (struct xen_netif_gref_alloc)
+ * ===
+ *
+ * A frontend may provide a fixed set of grant references to be mapped on
+ * the backend. The message of type XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ * prior its usage in the command ring allows for creation of these mappings.
+ * The backend will maintain a fixed amount of these mappings.
+ *
+ * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE lets a frontend query how many
+ * of these mappings can be kept.
+ *
+ * Each entry in the XEN_NETIF_CTRL_TYPE_{ADD,PUT}_GREF_MAPPING input table has
+ * the following format:
+ *
+ *0 1 2 3 4 5 6 7  octet
+ * +-+-+-+-+-+-+-+-+
+ * | grant ref |  flags|  padding  |
+ * +-+-+-+-+-+-+-+-+
+ *
+ * grant ref: grant reference
+ * flags: flags describing the control operation
+ *
+ */
+
+struct xen_netif_gref_alloc {
+   grant_ref_t ref;
+   uint16_t flags;
+
+#define _XEN_NETIF_CTRLF_GREF_readonly0
+#define XEN_NETIF_CTRLF_GREF_readonly(1U<<_XEN_NETIF_CTRLF_GREF_readonly)
+
+   uint8_t pad[2];
+};
+
+/*
  * Control messages
  * 
  *
@@ -609,6 +647,82 @@ struct xen_netif_ctrl_response {
  *   invalidate any table data outside that range.
  *   The grant reference may be read-only and must remain valid until
  *   the response has been processed.
+ *
+ * XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
+ * -
+ *
+ * This is sent by the frontend to fetch the number of grefs that can be kept
+ * mapped in the backend.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_GET_GREF_MAPPING_SIZE
+ *  data[0] = queue index (assumed 0 for single queue)
+ *  data[1] = 0
+ *  data[2] = 0
+ *
+ * Response:
+ *
+ *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation not
+ * supported
+ *   XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER - The queue index is
+ * out of range
+ *   XEN_NETIF_CTRL_STATUS_SUCCESS   - Operation successful
+ *  data   = maximum number of entries allowed in the gref mapping table
+ *   (if operation was successful) or zero if a mapping table is
+ *   not supported (i.e. hash mapping is done only by modular
+ *   arithmetic).
+ *
+ * XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ * 
+ *
+ * This is sent by the frontend for backend to map a list of grant
+ * references.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_ADD_GREF_MAPPING
+ *  data[0] = queue index
+ *  data[1] = grant reference of page containing the mapping list
+ *(assumed to start at beginning of grant)
+ *  data[2] = size of list in entries
+ *
+ * Response:
+ *
+ *  status = XEN_NETIF_CTRL_STATUS_NOT_SUPPORTED - Operation not
+ * supported
+ *   XEN_NETIF_CTRL_STATUS_INVALID_PARAMETER - Operation failed
+ *   XEN_NETIF_CTRL_STATUS_SUCCESS   - Operation successful
+ *
+ * NOTE: Each entry in the input table has the format outlined
+ *   in struct xen_netif_gref_alloc.
+ *
+ * XEN_NETIF_CTRL_TYPE_PUT_GREF_MAPPING
+ * 
+ *
+ * This is sent by the frontend for backend to unmap a list of grant
+* references.
+ *
+ * Request:
+ *
+ *  type= XEN_NETIF_CTRL_TYPE_PUT_GREF_MAPPING
+ *  data[0] = queue index
+ *  data[1] = grant reference of page containing the mapping list
+ *(assumed to start at beginning of page)
+ *  da

[Xen-devel] [PATCH v2 0/1] netif: staging grants for I/O requests

2017-09-01 Thread Joao Martins
Hey,

This is v2 taking into consideration all comments received from RFC/v1.
The approach has significantly changed and we use control ring messages to
the manage the permanent gref mappings - as also presented at XDDS17. The
specification is right after the diffstat. I kept the same name "staging
grants" but feel free to suggest an alternative.

Reference implementation also here (on top of net-next):

https://github.com/jpemartins/linux.git xen-net-stg-gnts-v2

I also would like to note that this lays some groundwork for netback to
work with a known set of pages, hence there will be less effort involved
into making netback work with something like zerogrant.

Cheers,
Joao

Joao Martins (1):
  public/io/netif.h: add gref mapping control messages

 xen/include/public/io/netif.h | 114 ++
 1 file changed, 114 insertions(+)
---
% Staging grants for network I/O requests
% Joao Martins <<joao.m.mart...@oracle.com>>
% Revision 2

\clearpage


Architecture(s): Any


# Background and Motivation

At the Xen hackaton '16 networking session, we spoke about having a permanently
mapped region to describe header/linear region of packet buffers. This document
outlines the proposal covering motivation of this and applicability for other
use-cases alongside the necessary changes. This proposal is an RFC and also
includes alternative solutions.

The motivation of this work is to eliminate grant ops for packet I/O intensive
workloads such as those observed with smaller requests size (i.e. <= 256 bytes
or <= MTU). Currently on Xen, only bulk transfer (e.g. 32K..64K packets) are the
only ones performing really good (up to 80 Gbit/s in few CPUs), usually
backing end-hosts and server appliances. Anything that involves higher packet
rates (<= 1500 MTU) or without sg, performs badly almost like a 1 Gbit/s
throughput.

# Proposal

The proposal is to leverage the already implicit copy from and to packet linear
data on netfront and netback, to be done instead from a permanently mapped
region. In some (physical) NICs this is known as header/data split.

Specifically some workloads (e.g. NFV) it would provide a big increase in
throughput when we switch to (zero)copying in the backend/frontend, instead of
the grant hypercalls. Thus this extension aims at futureproofing the netif
protocol by adding the possibility of guests setting up a list of grants that
are set up at device creation and revoked at device freeing - without taking
too much grant entries in account for the general case (i.e. to cover only the
header region <= 256 bytes, 16 grants per ring) while configurable by kernel
when one wants to resort to a copy-based as opposed to grant copy/map.

\clearpage

# General Operation

Here we describe how netback and netfront general operate, and where the 
proposed
solution will fit. The security mechanism currently involves grants references
which in essence are round-robin recycled 'tickets' stamped with the GPFNs,
permission attributes, and the authorized domain:

(This is an in-memory view of struct grant_entry_v1):

 0 1 2 3 4 5 6 7 octet
++---++
| flags  | domain id | frame  |
++---++

Where there are N grant entries in a grant table, for example:

@0:
++---++
| rw | 0 | 0xABCDEF   |
++---++
| rw | 0 | 0xFA124|
++---++
| ro | 1 | 0xBEEF |
++---++

  .
@N:
++---++
| rw | 0 | 0x9923A|
++---++

Each entry consumes 8 bytes, therefore 512 entries can fit on one page.
The `gnttab_max_frames` which is a default of 32 pages. Hence 16,384
grants. The ParaVirtualized (PV) drivers will use the grant reference (index
in the grant table - 0 .. N) in their command ring.

\clearpage

## Guest Transmit

The view of the shared transmit ring is the following:

 0 1 2 3 4 5 6 7 octet
+++
| req_prod   | req_event  |
+++
| rsp_prod   | rsp_event  |
+++
| pvt| pad[44]|
++|
| 

Re: [Xen-devel] DESIGN v2: CPUID part 3

2017-08-02 Thread Joao Martins
On 08/01/2017 07:34 PM, Andrew Cooper wrote:
> On 31/07/2017 20:49, Konrad Rzeszutek Wilk wrote:
>> On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
>>> On 07/05/2017 12:16 PM, Andrew Cooper wrote:
>>>> On 05/07/17 10:46, Joao Martins wrote:
>>>>> Hey Andrew,
>>>>>
>>>>> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>>>>>
>>>>>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ 
>>>>>> perhaps?)
>>>>>> The toolstack shall also have a mechanism to explicitly select topology
>>>>>> configuration for the guest, which primarily affects the virtual APIC ID
>>>>>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
>>>>>> Xen's auditing shall ensure that guests observe values consistent with 
>>>>>> the
>>>>>> guarantees made by the vendor manuals.
>>>>>>
>>>>> Why choose max_vcpus domctl?
>>>> Despite its name, the max_vcpus hypercall is the one which allocates all
>>>> the vcpus in the hypervisor.  I don't want there to be any opportunity
>>>> for vcpus to exist but no topology information to have been provided.
>>>>
>>> /nods
>>>
>>> So then doing this at vcpus allocation we would need to pass an additional 
>>> CPU
>>> topology argument on the max_vcpus hypercall? Otherwise it's sort of guess 
>>> work
>>> wrt sockets, cores, threads ... no?
>> Andrew, thoughts on this and the one below?
> 
> Urgh sorry.  I've been distracted with some high priority interrupts (of
> the non-maskable variety).
> 
> So, bad news is that the CPUID and MSR policy handling has become
> substantially more complicated and entwined than I had first planned.  A
> change in either of the data alters the auditing of the other, so I am
> leaning towards implementing everything with a single set hypercall (as
> this is the only way to get a plausibly-consistent set of data).
> 
> The good news is that I don't think we actually need any changes to the
> XEN_DOMCTL_max_vcpus.  I now think there is sufficient expressibility in
> the static cpuid policy to work.
> 
Awesome!

>>> There could be other uses too on passing this info to Xen, say e.g. the
>>> scheduler knowing the guest CPU topology it would allow better selection of
>>> core+sibling pair such that it could match cache/cpu topology passed on the
>>> guest (for unpinned SMT guests).
> 
> I remain to be convinced (i.e. with some real performance numbers) that
> the added complexity in the scheduler for that logic is a benefit in the
> general case.
> 
The suggestion above was a simple extension to struct domain (e.g. cores/threads
or struct cpu_topology field) - nothing too disruptive I think.

But I cannot really argue on this as this was just an idea that I found
interesting (no numbers to support it entirely). We just happened to see it
under-perform when a simple range of cpus was used for affinity, and that some
vcpus end up being scheduled belonging the same core+sibling pair IIRC; hence I
(perhaps naively) imagined that there could be value in further scheduler
enlightenment e.g. "gang-scheduling" where we schedule core+sibling always
together. I was speaking to Dario (CC'ed) on the summit whether CPU topology
could have value - and there might be but it remains to be explored once we're
able to pass a cpu topology to the guest. (In the past it seemed enthusiastic of
the idea of the topology[0] and hence I assumed to be in the context of 
schedulers)

[0] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg03850.html

> In practice, customers are either running very specific and dedicated
> workloads (at which point pinning is used and there is no
> oversubscription, and exposing the actual SMT topology is a good thing),
>
/nods

> or customers are running general workloads with no pinning (or perhaps
> cpupool-numa-split) with a moderate amount of oversubscription (at which
> point exposing SMT is a bad move).
> 
Given the scale you folks invest on over-subscription (1000 VMs), I wonder what
moderate here means :P

> Counterintuitively, exposing NUMA in general oversubscribed scenarios is
> terrible for net system performance.  What happens in practice is that
> VMs which see NUMA spend their idle cycles trying to balance their own
> userspace processes, rather than yielding to the hypervisor so another
> guest can get a go.
> 
Interesting to know - vNUMA perhaps is only better placed for performance cases
where both (or either) I/O topology and memory locality matter - or when going
for bigger guests. Provided that the correspondent CPU topology is provided.

Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] PV drivers and zero copying

2017-07-31 Thread Joao Martins
On 07/31/2017 12:41 PM, Oleksandr Andrushchenko wrote:
> Hi, Joao!
> 
> On 07/31/2017 02:03 PM, Joao Martins wrote:
>> Hey Oleksandr,
>>
>> On 07/31/2017 09:34 AM, Oleksandr Andrushchenko wrote:
>>> Hi, all!
>>>
>> [snip]
>>> Comparison for display use-case
>>> ===
>>>
>>> 1 Number of grant references used
>>> 1-1 grant references: nr_pages
>>> 1-2 GNTTABOP_transfer: nr_pages
>>> 1-3 XENMEM_exchange: not an option
>>>
>>> 2 Effect of DomU crash on Dom0 (its mapped pages)
>>> 2-1 grant references: pages can be unmapped by Dom0, Dom0 is fully
>>> recovered
>>> 2-2 GNTTABOP_transfer: pages will be returned to the Hypervisor, lost
>>> for Dom0
>>> 2-3 XENMEM_exchange: not an option
>>>
>>> 3 Security issues from sharing Dom0 pages to DomU
>>> 1-1 grant references: none
>>> 1-2 GNTTABOP_transfer: none
>>> 1-3 XENMEM_exchange: not an option
>>>
>>> At the moment approach 1 with granted references seems to be a winner for
>>> sharing buffers both ways, e.g. Dom0 -> DomU and DomU -> Dom0.
>>>
>>> Conclusion
>>> ==
>>>
>>> I would like to get some feedback from the community on which approach
>>> is more
>>> suitable for sharing large buffers and to have a clear vision on cons
>>> and pros
>>> of each one: please feel free to add other metrics I missed and correct
>>> the ones
>>> I commented on.  I would appreciate help on comparing approaches 2 and 3
>>> as I
>>> have little knowledge of these APIs (2 seems to be addressed by
>>> Christopher, and
>>> 3 seems to be relevant to what Konrad/Stefano do WRT SWIOTLB).
>> Depending on your performance/memory requirements - there could be another
>> option which is to keep the guest mapped on Domain-0 (what was discussed with
>> Zerogrant session[0][1] that will be formally proposed in the next month or 
>> so).
> Unfortunately I missed that session during the Summit
> due to overlapping sessions

Hmm - Zerocopy Rx (Dom0 -> DomU) would indeed be an interesting topic to bring 
up.

>> But that would only solve the grant maps/unmaps/copies done on Domain-0 
>> (given
>> the numbers you pasted a bit ago, you might not really need to go to such 
>> extents)
>>
>> [0]
>> http://schd.ws/hosted_files/xendeveloperanddesignsummit2017/05/zerogrant_spec.pdf
>> [1]
>> http://schd.ws/hosted_files/xendeveloperanddesignsummit2017/a8/zerogrant_slides.pdf
> I will read these, thank you for the links
>> For the buffers allocated on Dom0 and safely grant buffers from Dom0 to DomU
>> (which I am not so sure it is possible today :()
> We have this working in our setup for display (we have implemented
> z-copy with grant references already)

Allow me to clarify :) I meant "possible to do it in a safely manner", IOW,
regarding what I mentioned below in following paragraphs. But your answer below
clarifies on that aspect.

>> , maybe a "contract" from DomU
>> provide a set of transferable pages that Dom0 holds on for each Dom-0 gref
>> provided to the guest (and assuming this is only a handful couple of guests 
>> as
>> grant table is not that big).
> It is an option
>>
>>   IIUC, From what you pasted above on "Buffer
>> allocated @Dom0" sounds like Domain-0 could quickly ran out of pages/OOM (and
>> grants), if you're guest is misbehaving/buggy or malicious; *also* domain-0
>> grant table is a rather finite/small resource (even though you can override 
>> the
>> number of frames in the arguments).
> Well, you are right. But, we are focusing on embedded appliances,
> so those systems we use are not that "dynamic" with that respect.
> Namely: we have fixed number of domains and their functionality
> is well known, so we can do rather precise assumption on resource
> usage.

Interesting! So here I presume backend trusts the frontend.

Cheers,
Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] PV drivers and zero copying

2017-07-31 Thread Joao Martins
On 07/31/2017 11:37 AM, Oleksandr Andrushchenko wrote:
> On 07/31/2017 01:04 PM, Julien Grall wrote:
>> On 31/07/17 10:52, Oleksandr Andrushchenko wrote:
>>> On 07/31/2017 12:47 PM, Julien Grall wrote:
 On 31/07/17 10:46, Oleksandr Andrushchenko wrote:
 Do you have any example of hardware? What are the performance you
 require with them?

>>> Currently our target is Renesas R-Car Gen3
>>> At the moment I don't have clean requirements, but
>>> ideally, PV driver introduces 0% performance drop
>>> Some time soon I will have numbers on running display/GPU
>>> with and without zero-copy - will keep updated
>>
>> PV driver with 0% performance drop sounds a stretch target.
> It is, but we should always get as close as possible
>> But this is does not answer to my question. Do you have any hardware 
>> that does not support scatter/gather
> AFAIK display driver which doesn't support scatter-gather on our platform
> (BSP based on 4.9 kernel, rcar_du uses DRM CMA - DRM contiguous memory 
> allocator)
> Anyways, for pages above 4GB even scatter-gather will not help
> devices with 32-bit DMA
>> or not protected by an IOMMU that will be interfaced with PV drivers?
>>
> As per my understanding, IOMMU is solely owned by the hypervisor now
> and there is no API to tell Xen from Dom0 to setup IOMMU for such
> a buffer (pages), so display HW can do DMA with that buffer.
> Thus, Dom0 has no means to do that work and make PV driver produce
> buffers which can be used by the real HW driver without bounce buffering.

Sounds like it is addressed by PV-IOMMU[0] which I think it will be resurrected
in the coming months as per the design session last hackaton.

[0] https://schd.ws/hosted_files/xendeveloperanddesignsummit2017/91/PV-IOMMU.txt

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] PV drivers and zero copying

2017-07-31 Thread Joao Martins
Hey Oleksandr,

On 07/31/2017 09:34 AM, Oleksandr Andrushchenko wrote:
> Hi, all!
> 
[snip]
> 
> Comparison for display use-case
> ===
> 
> 1 Number of grant references used
> 1-1 grant references: nr_pages
> 1-2 GNTTABOP_transfer: nr_pages
> 1-3 XENMEM_exchange: not an option
> 
> 2 Effect of DomU crash on Dom0 (its mapped pages)
> 2-1 grant references: pages can be unmapped by Dom0, Dom0 is fully 
> recovered
> 2-2 GNTTABOP_transfer: pages will be returned to the Hypervisor, lost 
> for Dom0
> 2-3 XENMEM_exchange: not an option
> 
> 3 Security issues from sharing Dom0 pages to DomU
> 1-1 grant references: none
> 1-2 GNTTABOP_transfer: none
> 1-3 XENMEM_exchange: not an option
> 
> At the moment approach 1 with granted references seems to be a winner for
> sharing buffers both ways, e.g. Dom0 -> DomU and DomU -> Dom0.
> 
> Conclusion
> ==
> 
> I would like to get some feedback from the community on which approach 
> is more
> suitable for sharing large buffers and to have a clear vision on cons 
> and pros
> of each one: please feel free to add other metrics I missed and correct 
> the ones
> I commented on.  I would appreciate help on comparing approaches 2 and 3 
> as I
> have little knowledge of these APIs (2 seems to be addressed by 
> Christopher, and
> 3 seems to be relevant to what Konrad/Stefano do WRT SWIOTLB).

Depending on your performance/memory requirements - there could be another
option which is to keep the guest mapped on Domain-0 (what was discussed with
Zerogrant session[0][1] that will be formally proposed in the next month or so).
But that would only solve the grant maps/unmaps/copies done on Domain-0 (given
the numbers you pasted a bit ago, you might not really need to go to such 
extents)

[0]
http://schd.ws/hosted_files/xendeveloperanddesignsummit2017/05/zerogrant_spec.pdf
[1]
http://schd.ws/hosted_files/xendeveloperanddesignsummit2017/a8/zerogrant_slides.pdf

For the buffers allocated on Dom0 and safely grant buffers from Dom0 to DomU
(which I am not so sure it is possible today :(), maybe a "contract" from DomU
provide a set of transferable pages that Dom0 holds on for each Dom-0 gref
provided to the guest (and assuming this is only a handful couple of guests as
grant table is not that big). IIUC, From what you pasted above on "Buffer
allocated @Dom0" sounds like Domain-0 could quickly ran out of pages/OOM (and
grants), if you're guest is misbehaving/buggy or malicious; *also* domain-0
grant table is a rather finite/small resource (even though you can override the
number of frames in the arguments).

Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] DESIGN v2: CPUID part 3

2017-07-05 Thread Joao Martins
On 07/05/2017 12:16 PM, Andrew Cooper wrote:
> On 05/07/17 10:46, Joao Martins wrote:
>> Hey Andrew,
>>
>> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>>> Presented herewith is the a plan for the final part of CPUID work, which
>>> primarily covers better Xen/Toolstack interaction for configuring the guests
>>> CPUID policy.
>>>
>> Really nice write up, a few comments below.
>>
>>> A PDF version of this document is available from:
>>>
>>> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
>>>
>>> Changes from v1:
>>>  * Clarification of the interaction of emulated features
>>>  * More information about the difference between max and default 
>>> featuresets.
>>>
>>> ~Andrew
>>>
>>> -8<-
>>> % CPUID Handling (part 3)
>>> % Revision 2
>>>

[snip]

>>> # Proposal
>>>
>>> First and foremost, split the current **max\_policy** notion into separate
>>> **max** and **default** policies.  This allows for the provision of features
>>> which are unused by default, but may be opted in to, both at the hypervisor
>>> level and the toolstack level.
>>>
>>> At the hypervisor level, **max** constitutes all the features Xen can use on
>>> the current hardware, while **default** is the subset thereof which are
>>> supported features, the features which the user has explicitly opted in to,
>>> and excluding any features the user has explicitly opted out of.
>>>
>>> A new `cpuid=` command line option shall be introduced, whose internals are
>>> generated automatically from the featureset ABI.  This means that all 
>>> features
>>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain 
>>> command
>>> line control.  (RFC: The same top level option can probably be used for
>>> non-feature CPUID data control, although I can't currently think of any 
>>> cases
>>> where this would be used Also find a sensible way to express 'available but
>>> not to be used by Xen', as per the current `smep` and `smap` options.)
>>>
>>>
>>> At the guest level, the **max** policy is conceptually unchanged.  It
>>> constitutes all the features Xen is willing to offer to each type of guest 
>>> on
>>> the current hardware (including emulated features).  However, it shall 
>>> instead
>>> be derived from Xen's **default** host policy.  This is to ensure that
>>> experimental hypervisor features must be opted in to at the Xen level before
>>> they can be opted in to at the toolstack level.
>>>
>>> The guests **default** policy is then derived from its **max**.  This is
>>> because there are some features which should always be explicitly opted in 
>>> to
>>> by the toolstack, such as emulated features which come with a security
>>> trade-off, or for non-architectural features which may differ in
>>> implementation in heterogeneous environments.
>>>
>>> All global policies (Xen and guest, max and default) shall be made available
>>> to the toolstack, in a manner similar to the existing
>>> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
>>> taken which include all CPUID data, not just the feature bitmaps.
>>>
>>> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
>>> which allows the toolstack to query and set the cpuid policy for a specific
>>> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if Xen
>>> is unhappy with any aspect of the policy during auditing.  This provides
>>> feedback to the user that a chosen combination will not work, rather than 
>>> the
>>> guest booting in an unexpected state.
>>>
>>> When a domain is initially created, the appropriate guests **default** 
>>> policy
>>> is duplicated for use.  When auditing, Xen shall audit the toolstacks
>>> requested policy against the guests **max** policy.  This allows 
>>> experimental
>>> features or non-migration-safe features to be opted in to, without those
>>> features being imposed upon all guests automatically.
>>>
>>> A guests CPUID policy shall be immutable after construction.  This better
>>> matches real hardware, and simplifies the logic in Xen to translate policy
>>> alterations into configuration changes.
>>>
>> This appears to be a suitable abstraction even for higher level too

Re: [Xen-devel] DESIGN v2: CPUID part 3

2017-07-05 Thread Joao Martins
On 07/05/2017 10:46 AM, Joao Martins wrote:
> Hey Andrew,
> 
> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>> Presented herewith is the a plan for the final part of CPUID work, which
>> primarily covers better Xen/Toolstack interaction for configuring the guests
>> CPUID policy.
>>
> Really nice write up, a few comments below.
> 
>> A PDF version of this document is available from:
>>
>> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
>>
>> Changes from v1:
>>  * Clarification of the interaction of emulated features
>>  * More information about the difference between max and default featuresets.
>>
>> ~Andrew
>>
>> -8<-

[snip]

>> # Proposal
>>
>> First and foremost, split the current **max\_policy** notion into separate
>> **max** and **default** policies.  This allows for the provision of features
>> which are unused by default, but may be opted in to, both at the hypervisor
>> level and the toolstack level.
>>
>> At the hypervisor level, **max** constitutes all the features Xen can use on
>> the current hardware, while **default** is the subset thereof which are
>> supported features, the features which the user has explicitly opted in to,
>> and excluding any features the user has explicitly opted out of.
>>
>> A new `cpuid=` command line option shall be introduced, whose internals are
>> generated automatically from the featureset ABI.  This means that all 
>> features
>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
>> line control.  (RFC: The same top level option can probably be used for
>> non-feature CPUID data control, although I can't currently think of any cases
>> where this would be used Also find a sensible way to express 'available but
>> not to be used by Xen', as per the current `smep` and `smap` options.)
>>
>>
>> At the guest level, the **max** policy is conceptually unchanged.  It
>> constitutes all the features Xen is willing to offer to each type of guest on
>> the current hardware (including emulated features).  However, it shall 
>> instead
>> be derived from Xen's **default** host policy.  This is to ensure that
>> experimental hypervisor features must be opted in to at the Xen level before
>> they can be opted in to at the toolstack level.
>>
>> The guests **default** policy is then derived from its **max**.  This is
>> because there are some features which should always be explicitly opted in to
>> by the toolstack, such as emulated features which come with a security
>> trade-off, or for non-architectural features which may differ in
>> implementation in heterogeneous environments.
>>
>> All global policies (Xen and guest, max and default) shall be made available
>> to the toolstack, in a manner similar to the existing
>> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
>> taken which include all CPUID data, not just the feature bitmaps.
>>
>> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
>> which allows the toolstack to query and set the cpuid policy for a specific
>> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if Xen
>> is unhappy with any aspect of the policy during auditing.  This provides
>> feedback to the user that a chosen combination will not work, rather than the
>> guest booting in an unexpected state.
>>
>> When a domain is initially created, the appropriate guests **default** policy
>> is duplicated for use.  When auditing, Xen shall audit the toolstacks
>> requested policy against the guests **max** policy.  This allows experimental
>> features or non-migration-safe features to be opted in to, without those
>> features being imposed upon all guests automatically.
>>
>> A guests CPUID policy shall be immutable after construction.  This better
>> matches real hardware, and simplifies the logic in Xen to translate policy
>> alterations into configuration changes.
>>
> 
> This appears to be a suitable abstraction even for higher level toolstacks
> (libxl). At least I can imagine libvirt fetching the PV/HVM max policy, and
> compare them between different servers when user computes the guest cpu config
> (the normalized one) and use the common denominator as the guest policy.
> Probably higher level toolstack could even use these said policies constructs
> and built the idea of models such that the user could easily choose one for a
> pool of hosts with different families. But the discussion here is more focused
> on xc <-> Xen so I won't clobbe

Re: [Xen-devel] DESIGN v2: CPUID part 3

2017-07-05 Thread Joao Martins
Hey Andrew,

On 07/04/2017 03:55 PM, Andrew Cooper wrote:
> Presented herewith is the a plan for the final part of CPUID work, which
> primarily covers better Xen/Toolstack interaction for configuring the guests
> CPUID policy.
> 
Really nice write up, a few comments below.

> A PDF version of this document is available from:
> 
> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
> 
> Changes from v1:
>  * Clarification of the interaction of emulated features
>  * More information about the difference between max and default featuresets.
> 
> ~Andrew
> 
> -8<-
> % CPUID Handling (part 3)
> % Revision 2
> 
> # Current state
> 
> At early boot, Xen enumerates the features it can see, takes into account
> errata checks and command line arguments, and stores this information in the
> `boot_cpu_data.x86_capability[]` bitmap.  This gets adjusted as APs boot up,
> and is sanitised to disable all dependent leaf features.
> 
> At mid/late boot (before dom0 is constructed), Xen performs the necessary
> calculations for guest cpuid handling.  Data are contained within the `struct
> cpuid_policy` object, which is a representation of the architectural CPUID
> information as specified by the Intel and AMD manuals.
> 
> There are a few global `cpuid_policy` objects.  First is the **raw_policy**
> which is filled in from native `CPUID` instructions.  This represents what the
> hardware is capable of, in its current firmware/microcode configuration.
> 
> The next global object is **host_policy**, which is derived from the
> **raw_policy** and `boot_cpu_data.x86_capability[]`. It represents the
> features which Xen knows about and is using.  The **host_policy** is
> necessarily a subset of **raw_policy**.
> 
> The **pv_max_policy** and **hvm_max_policy** are derived from the
> **host_policy**, and represent the upper bounds available to guests.
> Generally speaking, the guest policies are less featurefull than the
> **host_policy** because there are features which Xen doesn't or cannot safely
> provide to guests.  However, they are not subsets.  There are some features
> (the HYPERVISOR bit for all guests, and X2APIC mode for HVM guests) which are
> emulated in the absence of real hardware support.
> 
> The toolstack may query for the **{raw,host,pv,hvm}\_featureset** information
> using _XEN\_SYSCTL\_get\_cpu\_featureset_.  This is bitmap form of the feature
> leaves only.
> 
> When a new domain is created, the appropriate **{pv,hvm}\_max_policy** is
> duplicated as a starting point, and can be subsequently mutated indirectly by
> some hypercalls
> (_XEN\_DOMCTL\_{set\_address\_size,disable\_migrate,settscinfo}_) or directly
> by _XEN\_DOMCTL\_set\_cpuid_.
> 
> 
> # Issues with the existing hypercalls
> 
> _XEN\_DOMCTL\_set\_cpuid_ doesn't have a return value which the domain builder
> pays attention to.  This is because, before CPUID part 2, there were no
> failure conditions, as Xen would accept all toolstack-provided data, and
> attempt to audit it at the time it was requested by the guest.  To simplify
> the part 2 work, this behaviour was maintained, although Xen was altered to
> audit the data at hypercall time, typically zeroing out areas which failed the
> audit.
> 
> There is no mechanism for the toolstack to query the CPUID configuration for a
> specific domain.  Originally, the domain builder constructed a guests CPUID
> policy from first principles, using native `CPUID` instructions in the control
> domain.  This functioned to an extent, but was subject to masking problems,
> and is fundamentally incompatible with HVM control domains or the use of
> _CPUID Faulting_ in newer Intel processors.
> 
> CPUID phase 1 introduced the featureset information, which provided an
> architecturally sound mechanism for the toolstack to identify which features
> are usable for guests.  However, the rest of the CPUID policy is still
> generated from native `CPUID` instructions.
> 
> The `cpuid_policy` is per-domain information.  Most CPUID data is identical
> across all CPUs.  Some data are dynamic, based on other control settings
> (APIC, OSXSAVE, OSPKE, OSLWP), and Xen substitutes these appropriately when
> the information is requested..  Other areas however are topology information,
> including thread/core/socket layout, cache and TLB hierarchy.  These data are
> inherited from whichever physical CPU the domain builder happened to be
> running on when it was making calculations.  As a result, it is inappropriate
> for the guest under construction, and usually entirely bogus when considered
> alongside other data.
> 
> 
> # Other problems
> 
> There is no easy provision for features at different code maturity levels,
> both in the hypervisor, and in the toolstack.
> 
> Some CPUID features have top-level command line options on the Xen command
> line, but most do not.  On some hardware, some features can be hidden
> indirectly by altering the `cpuid_mask_*` parameters.  This is a problem for
> developing 

Re: [Xen-devel] [dpdk-dev] [PATCH] maintainers: claim responsability for xen

2017-02-20 Thread Joao Martins
On 02/20/2017 09:56 AM, Jan Blunck wrote:
> On Fri, Feb 17, 2017 at 5:07 PM, Konrad Rzeszutek Wilk
>  wrote:
>> On Thu, Feb 16, 2017 at 10:51:44PM +0100, Vincent JARDIN wrote:
>>> Le 16/02/2017 à 14:36, Konrad Rzeszutek Wilk a écrit :
> Is it time now to officially remove Dom0 support?
 So we do have an prototype implementation of netback but it is waiting
 for review of xen-devel to the spec.

 And I believe the implementation does utilize some of the dom0
 parts of code in DPDK.
>>>
>>> Please, do you have URLs/pointers about it? It would be interesting to share
>>> it with DPDK community too.
>>
>> Joao, would it be possible to include an tarball of the patches? I know
>> they are no in the right state with the review of the staging
>> grants API - they are incompatible, but it may help folks to get
>> a feel for what DPDK APIs you used?
>>
>> Staging grants API:
>> https://lists.xenproject.org/archives/html/xen-devel/2016-12/msg01878.html
> 
> The topic of the grants API is unrelated to the dom0 memory pool. The
> memory pool which uses xen_create_contiguous_region() is used in cases
> we know that there are no hugepages available.
Correct, I think what Konrad was trying to say was that xen-netback normally
lives in a PV domain which doesn't have superpages, therefore such driver would
need that memory pool part in order to work. The mentioned spec are additions to
xen netif ABI for backend to safely map a fixed set of grant references
(recycled overtime, provided by frontend) with the purpose of avoiding grant ops
- DPDK would be one of the users.

> Joao and I met in Dublin and I whined about not being able to call
> into the grants API from userspace and instead need to kick a kernel
> driver to do the work for every burst. It would be great if that could
> change in the future.
Hm, I recall about that discussion. AFAIK you can do both grant alloc/revoke of
pages through xengntshr_share_pages(...) and xengntshr_unshare(...) APIs
provided by libxengnttab[0] starting 4.7 or, libxc on older versions with
xc_gntshr_share_pages/xc_gntshr_munmap[2]. For the notification (or kicks) you
can allocate the event channel in the guest with libevtchn[1] starting 4.7, with
xenevtchn_bind_unbound_port(...) or libxc on older versions with
xc_evtchn_bind_unbound_port(...)[2]. And kick the guest with xenevtchn_notify or
xc_evtchn_notify(...) [latter on older versions]. In short these APIs are ioctls
to /dev/gntdev and /dev/evtchn. xenstore operations can also be done in
userspace with libxenstore[3].

To have the (similar) behavior of VRING_AVAIL_F_NO_INTERRUPT (i.e. avoiding the
kicks) you "just" don't set rsp_event in ring (e.g. no calls to
RING_FINAL_CHECK_FOR_RESPONSES), and keep checking for unconsumed Rx/Tx
responses. For guest request notification (to wake up the backend for new Tx/Rx
requests), you're dependent on whether backend requests it since it's the one
setting req_event index. If it indeed sets it then you gotta use the evtchn
notify that I depicted in the previous paragraph.

Hope that helps!

Joao

[0]
https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=tools/libs/gnttab/include/xengnttab.h;hb=HEAD
[1]
https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=tools/libs/evtchn/include/xenevtchn.h;hb=HEAD
[2]
https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=tools/libxc/include/xenctrl_compat.h;hb=HEAD
[3]
https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=tools/xenstore/include/xenstore.h;hb=HEAD

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [dpdk-dev] [PATCH] maintainers: claim responsability for xen

2017-02-20 Thread Joao Martins
On 02/17/2017 04:07 PM, Konrad Rzeszutek Wilk wrote:
> On Thu, Feb 16, 2017 at 10:51:44PM +0100, Vincent JARDIN wrote:
>> Le 16/02/2017 à 14:36, Konrad Rzeszutek Wilk a écrit :
>>>> Is it time now to officially remove Dom0 support?
>>> So we do have an prototype implementation of netback but it is waiting
>>> for review of xen-devel to the spec.
>>>
>>> And I believe the implementation does utilize some of the dom0
>>> parts of code in DPDK.
>>
>> Please, do you have URLs/pointers about it? It would be interesting to share
>> it with DPDK community too.
> 
> Joao, would it be possible to include an tarball of the patches? I know
> they are no in the right state with the review of the staging
> grants API - they are incompatible, but it may help folks to get
> a feel for what DPDK APIs you used?
OK, see attached - I should note that its a WIP as Konrad noted, but once the
staging grants work is finished, the code would be improved to have it in better
shape (as well as in feature parity) for a proper RFC [and adhering to the
project coding style].

Joao
>From 3bced1452e1e619e7f4701cf67ba88c2627aa376 Mon Sep 17 00:00:00 2001
From: Joao Martins <joao.m.mart...@oracle.com>
Date: Mon, 20 Feb 2017 13:33:34 +
Subject: [PATCH WIP 1/2] drivers/net: add xen-netback PMD

Introduce Xen network backend support, namely xen-netback.
This mostly means adding a boilerplate driver with a initially
reduced set of features (i.e. without feature-sg and no multi queue).
It handles grant operations and notifications correctly, and almost
all state machine. Additionally it supports one early version of
staging grants (here after feature-persistent=1) to allow DPDK to
have a set of premapped grants and hence avoid the grant copy
(slow)paths. This driver is implemented using xen provided libraries
for event channels, gnttab and xenstore operations.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
 drivers/net/Makefile   |   1 +
 drivers/net/xen-netback/Makefile   |  68 ++
 .../xen-netback/rte_pmd_xen-netback_version.map|   3 +
 drivers/net/xen-netback/xnb.h  | 159 
 drivers/net/xen-netback/xnb_ethdev.c   | 701 +++
 drivers/net/xen-netback/xnb_ethdev.h   |  34 +
 drivers/net/xen-netback/xnb_ring.c | 240 +
 drivers/net/xen-netback/xnb_rxtx.c | 683 +++
 drivers/net/xen-netback/xnb_xenbus.c   | 975 +
 mk/rte.app.mk  |   1 +
 10 files changed, 2865 insertions(+)
 create mode 100644 drivers/net/xen-netback/Makefile
 create mode 100644 drivers/net/xen-netback/rte_pmd_xen-netback_version.map
 create mode 100644 drivers/net/xen-netback/xnb.h
 create mode 100644 drivers/net/xen-netback/xnb_ethdev.c
 create mode 100644 drivers/net/xen-netback/xnb_ethdev.h
 create mode 100644 drivers/net/xen-netback/xnb_ring.c
 create mode 100644 drivers/net/xen-netback/xnb_rxtx.c
 create mode 100644 drivers/net/xen-netback/xnb_xenbus.c

diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index bc93230..a4bf7cb 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -55,6 +55,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD) += thunderx
 DIRS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio
 DIRS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD) += vmxnet3
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_XENVIRT) += xenvirt
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_XEN_NETBACK) += xen-netback
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
diff --git a/drivers/net/xen-netback/Makefile b/drivers/net/xen-netback/Makefile
new file mode 100644
index 000..c6299b0
--- /dev/null
+++ b/drivers/net/xen-netback/Makefile
@@ -0,0 +1,68 @@
+# BSD LICENSE
+#
+# Copyright(c) 2016, Oracle and/or its affiliates. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#
+#   * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#   * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in
+# the documentation and/or other materials provided with the
+# distribution.
+#   * Neither the name of Intel Corporation nor the names of its
+# contributors may be used to endorse or promote products derived
+# from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+# A PARTICULAR PURPOSE ARE DISCLAIMED. IN

Re: [Xen-devel] [PATCH] x86/time: tsc_check_writability() may need to be run a second time

2017-02-10 Thread Joao Martins
On 02/10/2017 11:17 AM, Andrew Cooper wrote:
> On 10/02/17 11:11, Joao Martins wrote:
>> On 02/10/2017 11:03 AM, Jan Beulich wrote:
>>> While we shouldn't remove its current invocation, we need to re-run it
>>> for the case that the X86_FEATURE_TSC_RELIABLE feature flag has been
>>> cleared, in order to avoid using the TSC rendezvous function in case
>>> the TSC can't be written.
>>>
>>> Signed-off-by: Jan Beulich <jbeul...@suse.com>
>> FWIW,
> 
> Independent reviews are always worth it.  Please continue!

Nice, Thanks!

Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86/time: tsc_check_writability() may need to be run a second time

2017-02-10 Thread Joao Martins
On 02/10/2017 11:03 AM, Jan Beulich wrote:
> While we shouldn't remove its current invocation, we need to re-run it
> for the case that the X86_FEATURE_TSC_RELIABLE feature flag has been
> cleared, in order to avoid using the TSC rendezvous function in case
> the TSC can't be written.
> 
> Signed-off-by: Jan Beulich <jbeul...@suse.com>

FWIW,

Reviewed-by: Joao Martins <joao.m.mart...@oracle.com>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [libvirt] [PATCH 2/2] libxl: fix dom0 maximum memory setting

2017-02-08 Thread Joao Martins
On 02/08/2017 04:06 PM, Jim Fehlig wrote:
> Joao Martins wrote:
>> On 02/02/2017 10:31 PM, Jim Fehlig wrote:
>>> When the libxl driver is initialized, it creates a virDomainDef
>>> object for dom0 and adds it to the list of domains. Total memory
>>> for dom0 was being set from the max_memkb field of libxl_dominfo
>>> struct retrieved from libxl, but this field can be set to
>>> LIBXL_MEMKB_DEFAULT (~0ULL) if dom0 maximum memory has not been
>>> explicitly set by the user.
>>>
>>> This patch adds some simple parsing of the Xen commandline,
>>> looking for a dom0_mem parameter that also specifies a 'max' value.
>>> If not specified, dom0 maximum memory is effectively all physical
>>> host memory.
>>>
>>> Signed-off-by: Jim Fehlig <jfeh...@suse.com>
>>> ---
>>>  src/libxl/libxl_conf.c   | 75 
>>> 
>>>  src/libxl/libxl_conf.h   |  3 ++
>>>  src/libxl/libxl_driver.c |  2 +-
>>>  3 files changed, 79 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
>>> index b5186f2..bfe0e92 100644
>>> --- a/src/libxl/libxl_conf.c
>>> +++ b/src/libxl/libxl_conf.c
>>> @@ -34,6 +34,7 @@
>>>  #include "internal.h"
>>>  #include "virlog.h"
>>>  #include "virerror.h"
>>> +#include "c-ctype.h"
>>>  #include "datatypes.h"
>>>  #include "virconf.h"
>>>  #include "virfile.h"
>>> @@ -1530,6 +1531,80 @@ int libxlDriverConfigLoadFile(libxlDriverConfigPtr 
>>> cfg,
>>>  
>>>  }
>>>  
>>> +/*
>>> + * dom0's maximum memory can be controled by the user with the 'dom0_mem' 
>>> Xen
>>> + * command line parameter. E.g. to set dom0's initial memory to 4G and max
>>> + * memory to 8G: dom0_mem=4G,max:8G
>>> + *
>>> + * If not constrained by the user, dom0 can effectively use all host 
>>> memory.
>>> + * This function returns the configured maximum memory for dom0 in 
>>> kilobytes,
>>> + * either the user-specified value or total physical memory as a default.
>>> + */
>>> +unsigned long long
>>> +libxlDriverGetDom0MaxmemConf(libxlDriverConfigPtr cfg)
>>> +{
>>> +char **cmd_tokens = NULL;
>>> +char **mem_tokens = NULL;
>>> +size_t i;
>>> +size_t j;
>>> +unsigned long long ret;
>>> +libxl_physinfo physinfo;
>>> +
>>> +if (cfg->verInfo->commandline == NULL ||
>>> +!(cmd_tokens = virStringSplit(cfg->verInfo->commandline, " ", 0)))
>>> +goto physmem;
>>> +
>>> +for (i = 0; cmd_tokens[i] != NULL; i++) {
>>> +if (!STRPREFIX(cmd_tokens[i], "dom0_mem="))
>>> +continue;
>>> +
>>> +if (!(mem_tokens = virStringSplit(cmd_tokens[i], ",", 0)))
>>> +break;
>>> +for (j = 0; mem_tokens[j] != NULL; j++) {
>>> +if (STRPREFIX(mem_tokens[j], "max:")) {
>>> +char *p = mem_tokens[j] + 4;
>>> +unsigned long long multiplier = 1;
>>> +
>>> +while (c_isdigit(*p))
>>> +p++;
>>> +if (virStrToLong_ull(mem_tokens[j] + 4, , 10, ) < 0)
>>> +break;
>>> +if (*p) {
>>> +switch (*p) {
>>> +case 'k':
>>> +case 'K':
>>> +multiplier = 1024;
>>> +break;
>>> +case 'm':
>>> +case 'M':
>>> +multiplier = 1024 * 1024;
>>> +break;
>>> +case 'g':
>>> +case 'G':
>>> +multiplier = 1024 * 1024 * 1024;
>>> +break;
>>> +}
>>> +}
>>> +ret = (ret * multiplier) / 1024;
>>> +goto cleanup;
>>> +}
>>> +}
>>> +}
>>> +
>>> + physmem:
>>> +/* No 'max' specified in dom0_mem, so dom0 can use all physical memory 
>>> */
>>> +libxl_physinfo_init();
>>&

Re: [Xen-devel] [libvirt] [PATCH 2/2] libxl: fix dom0 maximum memory setting

2017-02-08 Thread Joao Martins
On 02/02/2017 10:31 PM, Jim Fehlig wrote:
> When the libxl driver is initialized, it creates a virDomainDef
> object for dom0 and adds it to the list of domains. Total memory
> for dom0 was being set from the max_memkb field of libxl_dominfo
> struct retrieved from libxl, but this field can be set to
> LIBXL_MEMKB_DEFAULT (~0ULL) if dom0 maximum memory has not been
> explicitly set by the user.
> 
> This patch adds some simple parsing of the Xen commandline,
> looking for a dom0_mem parameter that also specifies a 'max' value.
> If not specified, dom0 maximum memory is effectively all physical
> host memory.
> 
> Signed-off-by: Jim Fehlig 
> ---
>  src/libxl/libxl_conf.c   | 75 
> 
>  src/libxl/libxl_conf.h   |  3 ++
>  src/libxl/libxl_driver.c |  2 +-
>  3 files changed, 79 insertions(+), 1 deletion(-)
> 
> diff --git a/src/libxl/libxl_conf.c b/src/libxl/libxl_conf.c
> index b5186f2..bfe0e92 100644
> --- a/src/libxl/libxl_conf.c
> +++ b/src/libxl/libxl_conf.c
> @@ -34,6 +34,7 @@
>  #include "internal.h"
>  #include "virlog.h"
>  #include "virerror.h"
> +#include "c-ctype.h"
>  #include "datatypes.h"
>  #include "virconf.h"
>  #include "virfile.h"
> @@ -1530,6 +1531,80 @@ int libxlDriverConfigLoadFile(libxlDriverConfigPtr cfg,
>  
>  }
>  
> +/*
> + * dom0's maximum memory can be controled by the user with the 'dom0_mem' Xen
> + * command line parameter. E.g. to set dom0's initial memory to 4G and max
> + * memory to 8G: dom0_mem=4G,max:8G
> + *
> + * If not constrained by the user, dom0 can effectively use all host memory.
> + * This function returns the configured maximum memory for dom0 in kilobytes,
> + * either the user-specified value or total physical memory as a default.
> + */
> +unsigned long long
> +libxlDriverGetDom0MaxmemConf(libxlDriverConfigPtr cfg)
> +{
> +char **cmd_tokens = NULL;
> +char **mem_tokens = NULL;
> +size_t i;
> +size_t j;
> +unsigned long long ret;
> +libxl_physinfo physinfo;
> +
> +if (cfg->verInfo->commandline == NULL ||
> +!(cmd_tokens = virStringSplit(cfg->verInfo->commandline, " ", 0)))
> +goto physmem;
> +
> +for (i = 0; cmd_tokens[i] != NULL; i++) {
> +if (!STRPREFIX(cmd_tokens[i], "dom0_mem="))
> +continue;
> +
> +if (!(mem_tokens = virStringSplit(cmd_tokens[i], ",", 0)))
> +break;
> +for (j = 0; mem_tokens[j] != NULL; j++) {
> +if (STRPREFIX(mem_tokens[j], "max:")) {
> +char *p = mem_tokens[j] + 4;
> +unsigned long long multiplier = 1;
> +
> +while (c_isdigit(*p))
> +p++;
> +if (virStrToLong_ull(mem_tokens[j] + 4, , 10, ) < 0)
> +break;
> +if (*p) {
> +switch (*p) {
> +case 'k':
> +case 'K':
> +multiplier = 1024;
> +break;
> +case 'm':
> +case 'M':
> +multiplier = 1024 * 1024;
> +break;
> +case 'g':
> +case 'G':
> +multiplier = 1024 * 1024 * 1024;
> +break;
> +}
> +}
> +ret = (ret * multiplier) / 1024;
> +goto cleanup;
> +}
> +}
> +}
> +
> + physmem:
> +/* No 'max' specified in dom0_mem, so dom0 can use all physical memory */
> +libxl_physinfo_init();
> +libxl_get_physinfo(cfg->ctx, );
Despite being an unlikely event, libxl_get_physinfo can fail here - I think you
need to check the return value here.

> +ret = (physinfo.total_pages * cfg->verInfo->pagesize) / 1024;
> +libxl_physinfo_dispose();
> +
> + cleanup:
> +virStringListFree(cmd_tokens);
> +virStringListFree(mem_tokens);
> +return ret;
> +}
> +
> +
>  #ifdef LIBXL_HAVE_DEVICE_CHANNEL
>  static int
>  libxlPrepareChannel(virDomainChrDefPtr channel,
> diff --git a/src/libxl/libxl_conf.h b/src/libxl/libxl_conf.h
> index 69d7885..c4ddbfe 100644
> --- a/src/libxl/libxl_conf.h
> +++ b/src/libxl/libxl_conf.h
> @@ -173,6 +173,9 @@ libxlDriverNodeGetInfo(libxlDriverPrivatePtr driver,
>  int libxlDriverConfigLoadFile(libxlDriverConfigPtr cfg,
>const char *filename);
>  
> +unsigned long long
> +libxlDriverGetDom0MaxmemConf(libxlDriverConfigPtr cfg);
> +
>  int
>  libxlMakeDisk(virDomainDiskDefPtr l_dev, libxl_device_disk *x_dev);
>  int
> diff --git a/src/libxl/libxl_driver.c b/src/libxl/libxl_driver.c
> index 921cc93..e54b3b7 100644
> --- a/src/libxl/libxl_driver.c
> +++ b/src/libxl/libxl_driver.c
> @@ -615,7 +615,7 @@ libxlAddDom0(libxlDriverPrivatePtr driver)
>  if (virDomainDefSetVcpus(vm->def, d_info.vcpu_online) < 0)
>  goto cleanup;
>  vm->def->mem.cur_balloon = 

Re: [Xen-devel] [PATCH v1 3/3] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-01-26 Thread Joao Martins
On 01/26/2017 05:25 PM, Andy Lutomirski wrote:
> On Wed, Jan 25, 2017 at 9:33 AM, Joao Martins <joao.m.mart...@oracle.com> 
> wrote:
>> This file defines an ABI shared between guest and hypervisor(s)
>> (KVM, Xen) and as such there should be an correspondent entry in
>> MAINTAINERS file. Notice that there's already a text notice at the
>> top of the header file, hence this commit simply enforces it more
>> explicitly and have both peers noticed when such changes happen.
>>
>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>> ---
>> This was suggested by folks at xen-devel as we missed some of the
>> ABI additions (e.g. flags field in pvti, TSC stable bit) - so this
>> patch is to help preventing that from happening. Alternatively I
>> could instead add a "PVCLOCK ABI" section in this file with the
>> two mailing lists.
> 
> If you do the latter, please add me as an R:.
OK, Thanks.

Since the ABI is used on both hypervisors I'll leave/wait for maintainers to
voice their preference.

Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v1 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-01-26 Thread Joao Martins
On 01/26/2017 05:25 PM, Andy Lutomirski wrote:
> On Wed, Jan 25, 2017 at 9:33 AM, Joao Martins <joao.m.mart...@oracle.com> 
> wrote:
>> Right now there is only a pvclock_pvti_cpu0_va() which is defined
>> on kvmclock since:
>>
>> commit dac16fba6fc5
>> ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
>>
>> The only user of this interface was kvm. This commit moves
>> pvclock_pvti_cpu0_va to pvclock which is a more generic place to have it
>> and adds the correspondent setter routine for it. This allows other
>> pvclock-based clocksources to use it, such as Xen.
> 
> With a minor nit:
> 
> Acked-by: Andy Lutomirski <l...@kernel.org>
> 
>> +#else
>> +static inline void pvclock_set_pvti_cpu0_va(struct 
>> pvclock_vsyscall_time_info *pvti)
>> +{
>> +}
> 
> How about just not providing pvclock_set_pvti_cpu0_va() in this case?
> It'll save three lines of code, and, more importantly, it will force
> us to notice if we screw up the Kconfig stuff.
Sounds good, will remove this then. Thanks!

Joao

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v1 2/3] x86/xen/time: setup vcpu 0 time info page

2017-01-26 Thread Joao Martins
On 01/25/2017 07:26 PM, Boris Ostrovsky wrote:
> On 01/25/2017 12:33 PM, Joao Martins wrote:
>> In order to support pvclock vdso on xen we need to setup the time
>> info page for vcpu 0 and register the page with Xen using the
>> VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
>> will also forcefully update the pvti which will set some of the
>> necessary flags for vdso. Afterwards we check if it supports the
>> PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
>> vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
>> will be later on used when mapping the vdso image.
>>
>> The xen headers are also updated to include the new hypercall for
>> registering the secondary vcpu_time_info struct.
>>
>> Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
>> ---
>> Changes since RFC:
>>  (Comments from Boris and David)
>>  * Remove Kconfig option
>>  * Use get_zeroed_page/free/page
>>  * Remove the hypercall availability check
>>  * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
>>  (New)
>>  * Set secondary copy on restore such that it works on migration.
>>  * Drop global xen_clock variable and stash it locally on
>>  xen_setup_vsyscall_time_info.
>>  * WARN_ON(ret) if we fail to unregister the pvti.
>> ---
>>  arch/x86/xen/enlighten.c |  2 ++
>>  arch/x86/xen/time.c  | 51 
>> 
>>  arch/x86/xen/xen-ops.h   |  1 +
>>  include/xen/interface/vcpu.h | 28 
>>  4 files changed, 82 insertions(+)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index 51ef952..15d271d 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -270,6 +270,8 @@ void xen_vcpu_restore(void)
> 
> This is called for PV only. What about HVM?
The call is missing from xen_hvm_post_suspend(...), I will add it.

> 
>>  HYPERVISOR_vcpu_op(VCPUOP_up, xen_vcpu_nr(cpu), NULL))
>>  BUG();
>>  }
>> +
>> +xen_setup_vsyscall_time_info(0);
> 
> Do we need to tear down time memory area on VCPU suspend?
I also missed that; otherwise I am leaking a page.

I could also rework this patch such that the initially allocated xen_clock page
is reused and simply register/unregister in save/restore paths. This would
probably mean adding one extra helper to register the vcpu_time info and perhaps
make xen_setup_vsyscall_time_info a bit simpler.

>>  }
>>  
>>  static void __init xen_banner(void)
>> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
>> index 1e69956..e90f703 100644
>> --- a/arch/x86/xen/time.c
>> +++ b/arch/x86/xen/time.c
>> @@ -367,6 +367,56 @@ static const struct pv_time_ops xen_time_ops 
>> __initconst = {
>>  .steal_clock = xen_steal_clock,
>>  };
>>  
>> +int xen_setup_vsyscall_time_info(int cpu)
>> +{
>> +struct pvclock_vsyscall_time_info *xen_clock;
>> +struct vcpu_register_time_memory_area t;
>> +struct pvclock_vcpu_time_info *pvti;
>> +unsigned long addr;
>> +u8 flags;
>> +int ret;
>> +
>> +addr = get_zeroed_page(GFP_KERNEL);
>> +if (!addr)
>> +return -ENOMEM;
>> +
>> +xen_clock = (struct pvclock_vsyscall_time_info *) addr;
>> +memset(xen_clock, 0, PAGE_SIZE);
> 
> You don't really need addr 
The reason I had this was to avoid to save one cast to unsigned long (on
free_page paths). But maybe it's not worth it and looking at the rest of the
x86/xen code, this doesn't seem to be the case. I will remove it.

> and there is no reason to memset the page to
> zero, given that you got it with get_zeroed_page().
Yeap, Fixed.

> 
>> +
>> +t.addr.v = _clock->pvti;
>> +
>> +ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
>> + cpu, );
>> +
>> +if (ret) {
>> +pr_debug("xen: cannot register vcpu_time_info err %d\n", ret);
> 
> pr_warn() would be more appropriate I think. You also have blank line
> before 'if'.
Fixed.

>> +free_page(addr);
>> +return ret;
>> +}
>> +
>> +pvti = _clock->pvti;
>> +flags = pvti->flags;
> 
> I don't think you need these, given that you only reference flags once
> below.
> 
Indeed, fixed as well.

>> +
>> +if (!(flags & PVCLOCK_TSC_STABLE_BIT)) {
>> +t.addr.v = NULL;
>> +ret = HYPERVISOR_vcpu_op(VCPUOP_re

[Xen-devel] [PATCH v1 2/3] x86/xen/time: setup vcpu 0 time info page

2017-01-25 Thread Joao Martins
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.

The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
Changes since RFC:
 (Comments from Boris and David)
 * Remove Kconfig option
 * Use get_zeroed_page/free/page
 * Remove the hypercall availability check
 * Unregister pvti with arg.addr.v = NULL if stable bit isn't supported.
 (New)
 * Set secondary copy on restore such that it works on migration.
 * Drop global xen_clock variable and stash it locally on
 xen_setup_vsyscall_time_info.
 * WARN_ON(ret) if we fail to unregister the pvti.
---
 arch/x86/xen/enlighten.c |  2 ++
 arch/x86/xen/time.c  | 51 
 arch/x86/xen/xen-ops.h   |  1 +
 include/xen/interface/vcpu.h | 28 
 4 files changed, 82 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 51ef952..15d271d 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -270,6 +270,8 @@ void xen_vcpu_restore(void)
HYPERVISOR_vcpu_op(VCPUOP_up, xen_vcpu_nr(cpu), NULL))
BUG();
}
+
+   xen_setup_vsyscall_time_info(0);
 }
 
 static void __init xen_banner(void)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 1e69956..e90f703 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -367,6 +367,56 @@ static const struct pv_time_ops xen_time_ops __initconst = 
{
.steal_clock = xen_steal_clock,
 };
 
+int xen_setup_vsyscall_time_info(int cpu)
+{
+   struct pvclock_vsyscall_time_info *xen_clock;
+   struct vcpu_register_time_memory_area t;
+   struct pvclock_vcpu_time_info *pvti;
+   unsigned long addr;
+   u8 flags;
+   int ret;
+
+   addr = get_zeroed_page(GFP_KERNEL);
+   if (!addr)
+   return -ENOMEM;
+
+   xen_clock = (struct pvclock_vsyscall_time_info *) addr;
+   memset(xen_clock, 0, PAGE_SIZE);
+
+   t.addr.v = _clock->pvti;
+
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
+cpu, );
+
+   if (ret) {
+   pr_debug("xen: cannot register vcpu_time_info err %d\n", ret);
+   free_page(addr);
+   return ret;
+   }
+
+   pvti = _clock->pvti;
+   flags = pvti->flags;
+
+   if (!(flags & PVCLOCK_TSC_STABLE_BIT)) {
+   t.addr.v = NULL;
+   ret = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_time_memory_area,
+cpu, );
+   if (!ret)
+   free_page(addr);
+
+   WARN_ON(ret);
+   pr_debug("xen: VCLOCK_PVCLOCK not supported\n");
+   return -ENOTSUPP;
+   }
+
+   pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
+   pvclock_set_pvti_cpu0_va(xen_clock);
+
+   xen_clocksource.archdata.vclock_mode = VCLOCK_PVCLOCK;
+
+   return 0;
+}
+
 static void __init xen_time_init(void)
 {
int cpu = smp_processor_id();
@@ -393,6 +443,7 @@ static void __init xen_time_init(void)
setup_force_cpu_cap(X86_FEATURE_TSC);
 
xen_setup_runstate_info(cpu);
+   xen_setup_vsyscall_time_info(cpu);
xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
 
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index ac0a2b0..4036d15 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -66,6 +66,7 @@ void __init xen_vmalloc_p2m_tree(void);
 void xen_init_irq_ops(void);
 void xen_setup_timer(int cpu);
 void xen_setup_runstate_info(int cpu);
+int xen_setup_vsyscall_time_info(int cpu);
 void xen_teardown_timer(int cpu);
 u64 xen_clocksource_read(void);
 void xen_setup_cpu_clockevents(void);
diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
index 98188c8..8da788c 100644
--- a/include/xen/interface/vcpu.h
+++ b/include/xen/interface/vcpu.h
@@ -178,4 +178,32 @@ DEFINE_GUEST_HANDLE_STRUCT(vcpu_register_vcpu_info);
 
 /* Send an NMI to the specified VCPU. @extra_arg == NULL. */
 #define VCPUOP_send_nmi 11
+
+/*
+ * Register a memory location to get a secondary copy of the vcpu time
+ * parameters.  The master copy still exists as part of the vcpu shared
+ * memory area, and this secondary copy is updated whenever the master copy
+ * is updated (and using the same versioning scheme for synchronisation).
+ *
+ * The int

[Xen-devel] [PATCH v1 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va

2017-01-25 Thread Joao Martins
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:

commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")

The only user of this interface was kvm. This commit moves
pvclock_pvti_cpu0_va to pvclock which is a more generic place to have it
and adds the correspondent setter routine for it. This allows other
pvclock-based clocksources to use it, such as Xen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
Changes since RFC:
 (Comments from Andy Lutomirski)
 * Add WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)) to
 pvclock_set_pvti_cpu0_va
---
 arch/x86/include/asm/pvclock.h | 22 +-
 arch/x86/kernel/kvmclock.c |  6 +-
 arch/x86/kernel/pvclock.c  | 13 +
 3 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 448cfe1..58399e1 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,15 +4,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_KVM_GUEST
-extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
-#else
-static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return NULL;
-}
-#endif
-
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
@@ -101,4 +92,17 @@ struct pvclock_vsyscall_time_info {
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti);
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info 
*pvti)
+{
+}
+static inline struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return NULL;
+}
+#endif
+
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 2a5cafd..9dfbb79 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -45,11 +45,6 @@ early_param("no-kvmclock", parse_no_kvmclock);
 static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
-struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
-{
-   return hv_clock;
-}
-
 /*
  * The wallclock is the time of day when we booted. Since then, some time may
  * have elapsed since the hypervisor wrote the data. So we try to account for
@@ -330,6 +325,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
+   pvclock_set_pvti_cpu0_va(hv_clock);
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 9e93fe5..b281060 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -23,8 +23,10 @@
 #include 
 #include 
 #include 
+#include 
 
 static u8 valid_flags __read_mostly = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly = NULL;
 
 void pvclock_set_flags(u8 flags)
 {
@@ -142,3 +144,14 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+{
+   WARN_ON(vclock_was_used(VCLOCK_PVCLOCK));
+   pvti_cpu0_va = pvti;
+}
+
+struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void)
+{
+   return pvti_cpu0_va;
+}
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v1 0/3] x86/xen: pvclock vdso support

2017-01-25 Thread Joao Martins
Hey,

This small series presents support for vDSO for Xen domains.
PVCLOCK_TSC_STABLE_BIT can be set starting Xen 4.8 which is required for vdso
time related calls. In order to have it on, you need to have the hypervisor
clocksource be TSC e.g. with the following boot params "clocksource=tsc
tsc=stable:socket".

Series is structured as following: Patch 1 streamlines pvti page get/set in
pvclock for both of its users Patch 2 registers the pvti page on Xen and
sets it in pvclock accordingly and Patch 3 (new in this version) adds an
entry to maintainers for tracking pvclock ABI changes. Changelog since RFC is
included in individual patches.

Any comments/suggestions are welcome.

Thanks,
Joao

Joao Martins (3):
  x86/pvclock: add setter for pvclock_pvti_cpu0_va
  x86/xen/time: setup vcpu 0 time info page
  MAINTAINERS: xen, kvm: track pvclock-abi.h changes

 MAINTAINERS|  2 ++
 arch/x86/include/asm/pvclock.h | 22 ++
 arch/x86/kernel/kvmclock.c |  6 +
 arch/x86/kernel/pvclock.c  | 13 +++
 arch/x86/xen/enlighten.c   |  2 ++
 arch/x86/xen/time.c| 51 ++
 arch/x86/xen/xen-ops.h |  1 +
 include/xen/interface/vcpu.h   | 28 +++
 8 files changed, 111 insertions(+), 14 deletions(-)

-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v1 3/3] MAINTAINERS: xen, kvm: track pvclock-abi.h changes

2017-01-25 Thread Joao Martins
This file defines an ABI shared between guest and hypervisor(s)
(KVM, Xen) and as such there should be an correspondent entry in
MAINTAINERS file. Notice that there's already a text notice at the
top of the header file, hence this commit simply enforces it more
explicitly and have both peers noticed when such changes happen.

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
---
This was suggested by folks at xen-devel as we missed some of the
ABI additions (e.g. flags field in pvti, TSC stable bit) - so this
patch is to help preventing that from happening. Alternatively I
could instead add a "PVCLOCK ABI" section in this file with the
two mailing lists.
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 26edd83..c4315d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7041,6 +7041,7 @@ F:Documentation/virtual/kvm/
 F: arch/*/kvm/
 F: arch/x86/kernel/kvm.c
 F: arch/x86/kernel/kvmclock.c
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/*/include/asm/kvm*
 F: include/linux/kvm*
 F: include/uapi/linux/kvm*
@@ -13483,6 +13484,7 @@ M:  Juergen Gross <jgr...@suse.com>
 L: xen-de...@lists.xenproject.org (moderated for non-subscribers)
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git
 S: Supported
+F: arch/x86/include/asm/pvclock-abi.h
 F: arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2] x86/hvm: do not set msr_tsc_adjust on hvm_set_guest_tsc_fixed

2017-01-24 Thread Joao Martins
Commit 6e03363 ("x86: Implement TSC adjust feature for HVM guest")
implemented TSC_ADJUST MSR for hvm guests. Though while booting
an HVM guest the boot CPU would have a value set with delta_tsc -
guest tsc while secondary CPUS would have 0. For example one can
observe:
 $ xen-hvmctx 17 | grep tsc_adjust
 TSC_ADJUST: tsc_adjust ff9377dfef47fe66
 TSC_ADJUST: tsc_adjust 0
 TSC_ADJUST: tsc_adjust 0
 TSC_ADJUST: tsc_adjust 0

Upcoming Linux 4.10 now validates whether this MSR is correct and
adjusts them accordingly under the following conditions: values of < 0
(our case for CPU 0) or != 0 or values > 7FFF. In this conditions it
will force set to 0 and for the CPUs that the value doesn't match all
together. If this msr is not correct we would see messages such as:

[Firmware Bug]: TSC ADJUST: CPU0: -30517044286984129 force to 0

And on HVM guests supporting TSC_ADJUST (requiring at least Haswell
Intel) it won't boot.

Our current vCPU 0 values are incorrect and according to Intel SDM which on
section "Time-Stamp Counter Adjustment" states that "On RESET, the value
of the IA32_TSC_ADJUST MSR is 0." hence we should set it 0 and be
consistent across multiple vCPUs. Perhaps this MSR should be only
changed by the guest which already happens through
hvm_set_guest_tsc_adjust(..) routines (see below). After this patch
guests running Linux 4.10 will see a valid IA32_TSC_ADJUST msr of value
 0 for all CPUs and are able to boot.

On the same section of the spec ("Time-Stamp Counter Adjustment") it is
also stated:
"If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR
 adds (or subtracts) value X from the TSC, the logical processor also
 adds (or subtracts) value X from the IA32_TSC_ADJUST MSR.

 Unlike the TSC, the value of the IA32_TSC_ADJUST MSR changes only in
 response to WRMSR (either to the MSR itself, or to the
 IA32_TIME_STAMP_COUNTER MSR). Its value does not otherwise change as
 time elapses. Software seeking to adjust the TSC can do so by using
 WRMSR to write the same value to the IA32_TSC_ADJUST MSR on each logical
 processor."

This suggests these MSRs values should only be changed through guest i.e.
throught write intercept msrs. We keep IA32_TSC MSR logic such that writes
accomodate adjustments to TSC_ADJUST, hence no functional change in the
msr_tsc_adjust for IA32_TSC msr. Though, we do that in a separate routine
namely hvm_set_guest_tsc_msr instead of through hvm_set_guest_tsc(...).

Signed-off-by: Joao Martins <joao.m.mart...@oracle.com>
Reviewed-by: Jan Beulich <jbeul...@suse.com>
---
Since v1:
 * Change from section numbers to section titles
 * Add citation of Intel SDM mentioning TSC_ADJUST being changed only through
 WRMSR
---
 xen/arch/x86/hvm/hvm.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 6ab60d2..e934aaa 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -387,13 +387,20 @@ void hvm_set_guest_tsc_fixed(struct vcpu *v, u64 
guest_tsc, u64 at_tsc)
 }
 
 delta_tsc = guest_tsc - tsc;
-v->arch.hvm_vcpu.msr_tsc_adjust += delta_tsc
-  - v->arch.hvm_vcpu.cache_tsc_offset;
 v->arch.hvm_vcpu.cache_tsc_offset = delta_tsc;
 
 hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset, at_tsc);
 }
 
+static void hvm_set_guest_tsc_msr(struct vcpu *v, u64 guest_tsc)
+{
+uint64_t tsc_offset = v->arch.hvm_vcpu.cache_tsc_offset;
+
+hvm_set_guest_tsc(v, guest_tsc);
+v->arch.hvm_vcpu.msr_tsc_adjust += v->arch.hvm_vcpu.cache_tsc_offset
+  - tsc_offset;
+}
+
 void hvm_set_guest_tsc_adjust(struct vcpu *v, u64 tsc_adjust)
 {
 v->arch.hvm_vcpu.cache_tsc_offset += tsc_adjust
@@ -3491,7 +3498,7 @@ int hvm_msr_write_intercept(unsigned int msr, uint64_t 
msr_content,
 break;
 
 case MSR_IA32_TSC:
-hvm_set_guest_tsc(v, msr_content);
+hvm_set_guest_tsc_msr(v, msr_content);
 break;
 
 case MSR_IA32_TSC_ADJUST:
-- 
2.1.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


  1   2   3   4   >