[Xen-devel] [PATCH 1/1] tools/livepatch: cleanup unnecessary "j = ARRAY_SIZE(action_options); "

2016-06-09 Thread Dongli Zhang
Local variable "j" would be used only when "i == ARRAY_SIZE(main_options)"
is true. Thus, it is not necessary to update "j" when "i ==
ARRAY_SIZE(main_options)" is false.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
---
 tools/misc/xen-livepatch.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/misc/xen-livepatch.c b/tools/misc/xen-livepatch.c
index 28f339a..3162489 100644
--- a/tools/misc/xen-livepatch.c
+++ b/tools/misc/xen-livepatch.c
@@ -435,8 +435,7 @@ int main(int argc, char *argv[])
"'xen-livepatch help'\n", argv[1]);
 return 1;
 }
-} else
-j = ARRAY_SIZE(action_options);
+}
 
 xch = xc_interface_open(0,0,0);
 if ( !xch )
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] RFC: how to differentiate livepatched symbol and original symbol in Xen hypervisor

2016-06-06 Thread Dongli Zhang
Hi,

About the livepatch TODO: "Make XENPF_get_symbol also include Live Patch
symbols" mentioned at http://wiki.xenproject.org/wiki/XSplice, I am wondering
how the patched function would be dumped.

For instance, if function "gnttab_usage_print_all" is livepatched, it  would
show as symbol in both Xen hypervisor and applied livepatch. How are we going
to differentiate the old and new symbols referring to the same symbol name but
different address? One address is the original and another is the on pointed by
instruction "e9 ".

Here is a sample on my test machine. The following is my own customized xen
debug message in "xl debug-keys x". I am patching my own function "my_old_func"
in Xen hypervisor.

(XEN) name=my_global_domain, value=0x82d080409054, size=28, new=1
(XEN) name=my_old_func, value=0x82d080409070, size=89, new=0
(XEN) name=mg_data, value=0x82d08040a000, size=4, new=1

The following is the current result of XENPF_get_symbol on Dom0:

root@vm:/soft/img# cat /proc/xen/xensyms | grep my_old_func
82d0802465a4 T my_old_func
82d0802465a4 t .text.my_old_func

In this example, I livepatched "my_old_func" and thus we have two symbols
referring the same name but different addresses now (82d0802465a4 and
82d080409070).

Are we going to use new nm symbol flag , append extra string in symbol name
(e.g., my_old_func#livepatch) or this even does not matter?

Thank you very much!

Best,

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] xsplice-build-tools: replace realpath with readlink in xsplice-build

2016-05-27 Thread Dongli Zhang
Replace realpath with readlink since '-m' option is not supported by realpath.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 xsplice-build | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xsplice-build b/xsplice-build
index 2eb7993..5852186 100755
--- a/xsplice-build
+++ b/xsplice-build
@@ -213,7 +213,7 @@ while [[ $# -gt 0 ]]; do
 ;;
 --xen-syms)
 shift
-XENSYMS="$(realpath -m -- "$1")"
+XENSYMS="$(readlink -m -- "$1")"
 [ -f "$XENSYMS" ] || die "xen-syms file does not exist"
 shift
 ;;
@@ -238,9 +238,9 @@ done
 [ -z "$outputarg" ] && die "Output directory not given"
 [ -z "$DEPENDS" ] && die "Build-id dependency not given"
 
-SRCDIR="$(realpath -m -- "$srcarg")"
-PATCHFILE="$(realpath -m -- "$patcharg")"
-OUTPUT="$(realpath -m -- "$outputarg")"
+SRCDIR="$(readlink -m -- "$srcarg")"
+PATCHFILE="$(readlink -m -- "$patcharg")"
+OUTPUT="$(readlink -m -- "$outputarg")"
 
 [ -d "${SRCDIR}" ] || die "Xen directory does not exist"
 [ -f "${PATCHFILE}" ] || die "Patchfile does not exist"
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] xsplice: should we use realpath or readlink in xsplice-build?

2016-05-27 Thread Dongli Zhang
Hi Ross and Konrad,

I played xsplice with a sample patch I wrote. It works and really fantastic
work! 

I use the utility from
git://xenbits.xen.org/people/konradwilk/xsplice-build-tools.git

However, the 'realpath -m' command in file 'xsplice-build' work on neither my
Ubuntu or Oracle Linuxi (no -m option). It works well after I replace all
'realpath' with 'readlink'. The patch elf was generated successfully and could
be uploaded and applied.

I am not sure if this is the issue with my OS configuration or we should better
use 'readlink'. Since this is the new git repo, I am not sure if I could submit
this patch to xen-devel if a patch is helpful.

Thank you very much!

Best,

Dongli Zhang



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] How about use different "max_grant_frames" for different domains?

2016-05-26 Thread Dongli Zhang
Hi,

Because the boosting CPU and memory resources on server, Xen users are allowed
and actually do create lots of vdisk and vnic, e.g., 6 vnic, 20 vdisk and 24
vcpu (the number of vnic queue is proportional to the number of vcpu assigned
to guest), and this requires the administrator to increase the value of
"max_grant_frames" by changing cmdline param or DEFAULT_MAX_NR_GRANT_FRAMES in
Xen hypervisor.  Currently, all Xen guests share the same "max_grant_frames" in
Xen hypervisor, that is, if we increase the grant frame because of one VM, all
other VMs' grant frame limit will increase as well. 

Therefore, please let me know if using an individual "max_grant_frames" instead
of a global variable is reasonable? How about if we relocate this variable to
"struct grant_table" or "struct domain"?

This would require changes in both xen hypervisor and xen tools, e.g.,
1. A new "do_grant_table_op cmd" or "do_domctl cmd" to allow Dom0 to increase
the grant frame number for a specific domain. Dom0 would be able to use this
interface to change the grant frame number for VM initially in
"libxl__build_pre" and libxc or later via new xl command.
2. Change in xen tools "parse_config_data()" to allow administrator to add new
param to vm.cfg, e.g, "max_nr_frame=128".
3. This also requires a lot of changes in xen hypervisor so that all refers to
global "max_grant_frames" should redirect to domain specific variable (e.g.,
domain->max_grant_frames) now.

Please let me know the feedback. If this is reasonable, I will work on patches.

Thank you very much!

Dongli Zhang



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] tools/xsplice: cleanup unnecessary "j = ARRAY_SIZE(action_options); "

2016-05-29 Thread Dongli Zhang
Local variable "j" would be used only when "i == ARRAY_SIZE(main_options)"
is true. Thus, it is not necessary to update "j" when "i ==
ARRAY_SIZE(main_options)" is false.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 tools/misc/xen-xsplice.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/misc/xen-xsplice.c b/tools/misc/xen-xsplice.c
index ddaa079..811dc57 100644
--- a/tools/misc/xen-xsplice.c
+++ b/tools/misc/xen-xsplice.c
@@ -435,8 +435,7 @@ int main(int argc, char *argv[])
"'xen-xsplice help'\n", argv[1]);
 return 1;
 }
-} else
-j = ARRAY_SIZE(action_options);
+}
 
 xch = xc_interface_open(0,0,0);
 if ( !xch )
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] tools/livepatch: initialise j to 0 to make some versions of gcc happy

2016-06-15 Thread Dongli Zhang
Initialise j to 0 to make some versions of gcc (e.g., gcc4.5/4.3) happy to
avoid compilation error by commit beba3693f7243e68bbe31fe3794da91068eeea5b.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 tools/misc/xen-livepatch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/misc/xen-livepatch.c b/tools/misc/xen-livepatch.c
index 3162489..62c072e 100644
--- a/tools/misc/xen-livepatch.c
+++ b/tools/misc/xen-livepatch.c
@@ -412,7 +412,7 @@ struct {
 
 int main(int argc, char *argv[])
 {
-int i, j, ret;
+int i, j = 0, ret;
 
 if ( argc  <= 1 )
 {
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/1] tools/livepatch: cleanup unnecessary "j = ARRAY_SIZE(action_options); "

2016-06-14 Thread Dongli Zhang
Local variable "j" would be used only when "i == ARRAY_SIZE(main_options)"
is true. Thus, it is not necessary to update "j" when "i ==
ARRAY_SIZE(main_options)" is false.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 tools/misc/xen-livepatch.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tools/misc/xen-livepatch.c b/tools/misc/xen-livepatch.c
index 28f339a..62c072e 100644
--- a/tools/misc/xen-livepatch.c
+++ b/tools/misc/xen-livepatch.c
@@ -412,7 +412,7 @@ struct {
 
 int main(int argc, char *argv[])
 {
-int i, j, ret;
+int i, j = 0, ret;
 
 if ( argc  <= 1 )
 {
@@ -435,8 +435,7 @@ int main(int argc, char *argv[])
"'xen-livepatch help'\n", argv[1]);
 return 1;
 }
-} else
-j = ARRAY_SIZE(action_options);
+}
 
 xch = xc_interface_open(0,0,0);
 if ( !xch )
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] tools/livepatch: initialise j to 0 to make some versions of gcc happy

2016-06-16 Thread Dongli Zhang

> I suggest pasting in Olaf's exact error message here.
> 
> To avoid extra round trip, I propose updating the commit message as
> followed:
> 
> Initialise j to 0 to make some versions of gcc (e.g., gcc4.5/4.3)
> happy to
> avoid compilation error by commit
> beba3693f7243e68bbe31fe3794da91068eeea5b.
> 
> Failure manifests with gcc 4.5 as:
> 
> [  153s] cc1: warnings being treated as errors
> [  153s] xen-livepatch.c: In function 'main':
> [  153s] xen-livepatch.c:415:12: error: 'j' may be used
> uninitialized in this function
> [  153s] make[3]: *** [xen-livepatch.o] Error 1
> 
> If you agree with this I will handle the updating while doing my next
> sweep.


Sure. Feel free to update the commit message.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] xen-netfront: uninitialized fields in xenvif_rx_action

2016-02-01 Thread Dongli Zhang
While npo.copy and npo.meta are initialized in xenvif_rx_action, fields
such as npo.meta_prod are directly used later in xenvif_gop_skb without
being initialized first. Although the output of xenvif_rx_action is based
on the difference between new npo->meta_prod and old npo->meta_prod, it is
better to initialize them to 0 at the beginning.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 drivers/net/xen-netback/netback.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 61b97c3..32f0fbd 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -560,6 +560,10 @@ static void xenvif_rx_action(struct xenvif_queue *queue)
bool need_to_notify = false;
 
struct netrx_pending_operations npo = {
+   .copy_prod = 0,
+   .copy_cons = 0,
+   .meta_prod = 0,
+   .meta_cons = 0,
.copy  = queue->grant_copy_op,
.meta  = queue->meta,
};
-- 
1.9.1






___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] Is "feature-multicast-control" supported by xen-netfront?

2016-01-26 Thread Dongli Zhang
Hi,

There was a patch
(http://lists.xenproject.org/archives/html/xen-devel/2015-09/msg00285.html) in
2015 to support multicast packet filter in Xen netback. I found that this
feature "feature-multicast-control" is currently not supported by Xen netfront
in mainline linux. 

Is there as patch for xen-netfront to support feature-multicast-control?
Otherwise, I might need to make it myself.

Thank you very much!

Best,


Dongli Zhang
http://finallyjustice.github.io   
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Question about the best practice to install two versions of Xen toolstack on the same machine

2016-05-24 Thread Dongli Zhang
Hi Meng,

I never install two xen (actually xen-tools) on the same server. However, I
always hate to install xen-tools at default locations since that will mess
the
OS. Here is how I install xen-tools without overriding the privileged
folders
like /etc or /usr. Hope this can help.

ENV: xen-4.6.1 and Ubuntu 14.04.3.

You can easily install xen.gz by just modifying grub. Let's skip this.

About xen-tools,

e.g., my username is "zhang" and I want to install tools at /soft/xen. I
always
run the following with a script.

1. sudo mkdir /soft/xen & sudo chown zhang /soft/xen

2. sudo mkdir /var/lib/xen & sudo chown zhang /var/lib/xen

3. sudo mkdir /var/lib/xenstored & sudo chown zhang /var/lib/xenstored

4. ./configure --prefix=/soft/xen --sysconfdir=/soft/xen

5. make tools

6. make install-tools (do not use "sudo" or "su" in this step)

7. export LD_LIBRARY_PATH=/soft/xen/lib

8. PATH=$PATH:/soft/xen/bin:/soft/xen/sbin

9. export PYTHONPATH=/soft/xen/lib/python2.7/site-packages (this is for
pygrub)

10. Goto /soft/xen and run "./init.d/xencommons"

11. Boot guest with xl for either pv, pvhvm or hvm, never tried pvh :)

Hopefully, you can install different xen-tools at different locations and
let
them share the same /var/lib/xen and /var/lib/xenstored.

2016-05-23 15:02 GMT+08:00 Jan Beulich <jbeul...@suse.com>:

> >>> On 20.05.16 at 19:56, <men...@cis.upenn.edu> wrote:
> > On Fri, May 20, 2016 at 6:20 AM, Jan Beulich <jbeul...@suse.com> wrote:
> >>>>> On 19.05.16 at 20:40, <men...@cis.upenn.edu> wrote:
> >>> Does anyone try to install two version of Xen toolstack on the same
> machine?
> >>> Is there any documentation about the best practice to install two
> >>> versions of Xen onto the same machine?
> >>
> >> Or, as an alternative to Olaf's reply, don't install the tools at all,
> but
> >> instead run everything right out of the build trees. That requires some
> >> script wrappers to get things like the library search path set up
> >> correctly, but with that in place it has been working fine for me for
> >> years.
> >>
> >
> > Thank you so much for your suggestions!  I tried to add the library
> > and bins in xen/dist/install into the PATH and LD_LIBRARY_PATH, but
> > failed to have the system work properly.
> >
> > I'm wondering if it's convenient for you, would you mind sharing your
> > script wrapper? I can learn from it and customize it for my machine.
>
> Sure, here you go. For commands other then "xl" it ought to be
> hard linked to respective other names. I cannot easily tell whether
> it makes assumptions on settings done elsewhere in my systems,
> so don't put too high hopes into being able to use it as is.
>
> Jan
>
>
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>


-- 
Dongli Zhang (张东立)
finallyjustice.github.io
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] xen: move TLB-flush filtering out into populate_physmap

2016-09-05 Thread Dongli Zhang
This patch implemented parts of TODO left in commit id
a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out
into populate_physmap.

Because of TLB-flush in alloc_heap_pages, it's very slow to create a guest
with memory size of more than 100GB on host with 100+ cpus.

This patch introduced a "MEMF_no_tlbflush" bit to memflag to indicate
whether TLB-flush should be done in alloc_heap_pages or its caller
populate_physmap.  Once this bit is set in memflag, alloc_heap_pages will
ignore TLB-flush.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 xen/common/memory.c | 26 ++
 xen/common/page_alloc.c |  3 ++-
 xen/include/xen/mm.h|  2 ++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index f34dd56..14ec5fa 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a)
 unsigned int i, j;
 xen_pfn_t gpfn, mfn;
 struct domain *d = a->domain, *curr_d = current->domain;
+   bool_t need_tlbflush = 0;
+   uint32_t tlbflush_timestamp = 0;
 
 if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done,
  a->nr_extents-1) )
@@ -149,6 +151,8 @@ static void populate_physmap(struct memop_args *a)
 if ( a->extent_order > (a->memflags & MEMF_populate_on_demand ? MAX_ORDER :
 max_order(curr_d)) )
 return;
+   
+   a->memflags |= MEMF_no_tlbflush;
 
 for ( i = a->nr_done; i < a->nr_extents; i++ )
 {
@@ -213,6 +217,18 @@ static void populate_physmap(struct memop_args *a)
  i, a->nr_extents);
 goto out;
 }
+   
+   for ( j = 0; j < (1U << a->extent_order); j++ )
+   {
+   if ( page[j].u.free.need_tlbflush &&
+(page[j].tlbflush_timestamp <= 
tlbflush_current_time()) &&
+(!need_tlbflush ||
+ (page[j].tlbflush_timestamp > 
tlbflush_timestamp)) )
+   {
+   need_tlbflush = 1;
+   tlbflush_timestamp = 
page[j].tlbflush_timestamp;
+   }
+   }
 
 mfn = page_to_mfn(page);
 }
@@ -232,6 +248,16 @@ static void populate_physmap(struct memop_args *a)
 }
 
 out:
+   if ( need_tlbflush )
+   {
+   cpumask_t mask = cpu_online_map;
+   tlbflush_filter(mask, tlbflush_timestamp);
+   if ( !cpumask_empty() )
+   {
+   perfc_incr(need_flush_tlb_flush);
+   flush_tlb_mask();
+   }
+   }
 a->nr_done = i;
 }
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 18ff6cf..79f633b 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,7 +827,8 @@ static struct page_info *alloc_heap_pages(
 BUG_ON(pg[i].count_info != PGC_state_free);
 pg[i].count_info = PGC_state_inuse;
 
-if ( pg[i].u.free.need_tlbflush &&
+if ( !(memflags & MEMF_no_tlbflush) &&
+pg[i].u.free.need_tlbflush &&
  (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
  (!need_tlbflush ||
   (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 58bc0b8..880ca88 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -221,6 +221,8 @@ struct npfec {
 #define  MEMF_exact_node  (1U<<_MEMF_exact_node)
 #define _MEMF_no_owner5
 #define  MEMF_no_owner(1U<<_MEMF_no_owner)
+#define _MEMF_no_tlbflush 6
+#define  MEMF_no_tlbflush (1U<<_MEMF_no_tlbflush)
 #define _MEMF_node8
 #define  MEMF_node_mask   ((1U << (8 * sizeof(nodeid_t))) - 1)
 #define  MEMF_node(n) n) + 1) & MEMF_node_mask) << _MEMF_node)
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/1] xen: move TLB-flush filtering out into populate_physmap

2016-09-06 Thread Dongli Zhang
This patch implemented parts of TODO left in commit id
a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out
into populate_physmap.

Because of TLB-flush in alloc_heap_pages, it's very slow to create a guest
with memory size of more than 100GB on host with 100+ cpus.

This patch introduced a "MEMF_no_tlbflush" bit to memflag to indicate
whether TLB-flush should be done in alloc_heap_pages or its caller
populate_physmap.  Once this bit is set in memflag, alloc_heap_pages will
ignore TLB-flush.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 xen/common/memory.c | 26 ++
 xen/common/page_alloc.c |  3 ++-
 xen/include/xen/mm.h|  2 ++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index f34dd56..50c1a07 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a)
 unsigned int i, j;
 xen_pfn_t gpfn, mfn;
 struct domain *d = a->domain, *curr_d = current->domain;
+bool_t need_tlbflush = 0;
+uint32_t tlbflush_timestamp = 0;
 
 if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done,
  a->nr_extents-1) )
@@ -150,6 +152,8 @@ static void populate_physmap(struct memop_args *a)
 max_order(curr_d)) )
 return;
 
+a->memflags |= MEMF_no_tlbflush;
+
 for ( i = a->nr_done; i < a->nr_extents; i++ )
 {
 if ( i != a->nr_done && hypercall_preempt_check() )
@@ -214,6 +218,18 @@ static void populate_physmap(struct memop_args *a)
 goto out;
 }
 
+for ( j = 0; j < (1U << a->extent_order); j++ )
+{
+if ( page[j].u.free.need_tlbflush &&
+ (page[j].tlbflush_timestamp <= 
tlbflush_current_time()) &&
+ (!need_tlbflush ||
+ (page[j].tlbflush_timestamp > tlbflush_timestamp)) )
+{
+need_tlbflush = 1;
+tlbflush_timestamp = page[j].tlbflush_timestamp;
+}
+}
+
 mfn = page_to_mfn(page);
 }
 
@@ -232,6 +248,16 @@ static void populate_physmap(struct memop_args *a)
 }
 
 out:
+if ( need_tlbflush )
+{
+cpumask_t mask = cpu_online_map;
+tlbflush_filter(mask, tlbflush_timestamp);
+if ( !cpumask_empty() )
+{
+perfc_incr(need_flush_tlb_flush);
+flush_tlb_mask();
+}
+}
 a->nr_done = i;
 }
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 18ff6cf..e0283fc 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,7 +827,8 @@ static struct page_info *alloc_heap_pages(
 BUG_ON(pg[i].count_info != PGC_state_free);
 pg[i].count_info = PGC_state_inuse;
 
-if ( pg[i].u.free.need_tlbflush &&
+if ( !(memflags & MEMF_no_tlbflush) &&
+ pg[i].u.free.need_tlbflush &&
  (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
  (!need_tlbflush ||
   (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 58bc0b8..880ca88 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -221,6 +221,8 @@ struct npfec {
 #define  MEMF_exact_node  (1U<<_MEMF_exact_node)
 #define _MEMF_no_owner5
 #define  MEMF_no_owner(1U<<_MEMF_no_owner)
+#define _MEMF_no_tlbflush 6
+#define  MEMF_no_tlbflush (1U<<_MEMF_no_tlbflush)
 #define _MEMF_node8
 #define  MEMF_node_mask   ((1U << (8 * sizeof(nodeid_t))) - 1)
 #define  MEMF_node(n) n) + 1) & MEMF_node_mask) << _MEMF_node)
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 1/1] xen: move TLB-flush filtering out into populate_physmap during vm creation

2016-09-07 Thread Dongli Zhang
This patch implemented parts of TODO left in commit id
a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out
into populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very
slow to create a guest with memory size of more than 100GB on host with
100+ cpus.

This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate
whether TLB-flush should be done in alloc_heap_pages or its caller
populate_physmap. Once this bit is set in memflags, alloc_heap_pages will
ignore TLB-flush. To use this bit after vm is created might lead to
security issue, that is, this would make pages accessible to the guest B,
when guest A may still have a cached mapping to them.

Therefore, this patch also introduced a "already_scheduled" field to struct
domain to indicate whether this domain has ever got scheduled by
hypervisor.  MEMF_no_tlbflush can be set only during vm creation phase when
already_scheduled is still 0 before this domain gets scheduled for the
first time.

TODO: ballooning very huge amount of memory cannot benefit from this patch
and might still be slow.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>

---
Changed since v2:
  * Limit this optimization to domain creation time.

---
 xen/common/domain.c |  2 ++
 xen/common/memory.c | 33 +
 xen/common/page_alloc.c |  3 ++-
 xen/common/schedule.c   |  5 +
 xen/include/xen/mm.h|  2 ++
 xen/include/xen/sched.h |  3 +++
 6 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index a8804e4..611a471 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -303,6 +303,8 @@ struct domain *domain_create(domid_t domid, unsigned int 
domcr_flags,
 if ( !zalloc_cpumask_var(>domain_dirty_cpumask) )
 goto fail;
 
+d->already_scheduled = 0;
+
 if ( domcr_flags & DOMCRF_hvm )
 d->guest_type = guest_type_hvm;
 else if ( domcr_flags & DOMCRF_pvh )
diff --git a/xen/common/memory.c b/xen/common/memory.c
index f34dd56..3641469 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a)
 unsigned int i, j;
 xen_pfn_t gpfn, mfn;
 struct domain *d = a->domain, *curr_d = current->domain;
+bool_t need_tlbflush = 0;
+uint32_t tlbflush_timestamp = 0;
 
 if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done,
  a->nr_extents-1) )
@@ -150,6 +152,12 @@ static void populate_physmap(struct memop_args *a)
 max_order(curr_d)) )
 return;
 
+/* MEMF_no_tlbflush can be set only during vm creation phase when
+ * already_scheduled is still 0 before this domain gets scheduled for
+ * the first time. */
+if ( d->already_scheduled == 0 )
+a->memflags |= MEMF_no_tlbflush;
+
 for ( i = a->nr_done; i < a->nr_extents; i++ )
 {
 if ( i != a->nr_done && hypercall_preempt_check() )
@@ -214,6 +222,21 @@ static void populate_physmap(struct memop_args *a)
 goto out;
 }
 
+if ( d->already_scheduled == 0 )
+{
+for ( j = 0; j < (1U << a->extent_order); j++ )
+{
+if ( page[j].u.free.need_tlbflush &&
+ (page[j].tlbflush_timestamp <= 
tlbflush_current_time()) &&
+ (!need_tlbflush ||
+ (page[j].tlbflush_timestamp > 
tlbflush_timestamp)) )
+{
+need_tlbflush = 1;
+tlbflush_timestamp = page[j].tlbflush_timestamp;
+}
+}
+}
+
 mfn = page_to_mfn(page);
 }
 
@@ -232,6 +255,16 @@ static void populate_physmap(struct memop_args *a)
 }
 
 out:
+if ( need_tlbflush )
+{
+cpumask_t mask = cpu_online_map;
+tlbflush_filter(mask, tlbflush_timestamp);
+if ( !cpumask_empty() )
+{
+perfc_incr(need_flush_tlb_flush);
+flush_tlb_mask();
+}
+}
 a->nr_done = i;
 }
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 18ff6cf..e0283fc 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,7 +827,8 @@ static struct page_info *alloc_heap_pages(
 BUG_ON(pg[i].count_info != PGC_state_free);
 pg[i].count_info = PGC_state_inuse;
 
-if ( pg[i].u.free.need_tlbflush &&
+if ( !(memflags & MEMF_no_tlbflush) &&
+ pg[i].u.free.need_tlbflush &&
  (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
  (!need_tlbflush ||
   (pg[i].tlbflush_times

[Xen-devel] [PATCH v4 1/2] xen: replace complicated tlbflush check with an inline function

2016-09-12 Thread Dongli Zhang
This patch cleaned up the code by replacing complicated tlbflush check with
an inline function. We should use this inline function to avoid the long
and complicated to read tlbflush check when implementing TODOs left in
commit a902c12ee45fc9389eb8fe54eeddaf267a555c58.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
Changed since v3:
  * Wrap the complicated tlbflush condition check as inline function
(suggested by Dario).

---
 xen/common/page_alloc.c |  7 +++
 xen/include/xen/mm.h| 11 +++
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 18ff6cf..5b93a01 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,10 +827,9 @@ static struct page_info *alloc_heap_pages(
 BUG_ON(pg[i].count_info != PGC_state_free);
 pg[i].count_info = PGC_state_inuse;
 
-if ( pg[i].u.free.need_tlbflush &&
- (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
- (!need_tlbflush ||
-  (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
+if ( page_needs_tlbflush([i], need_tlbflush,
+ tlbflush_timestamp,
+ tlbflush_current_time()) )
 {
 need_tlbflush = 1;
 tlbflush_timestamp = pg[i].tlbflush_timestamp;
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 58bc0b8..766559d 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -567,4 +567,15 @@ int prepare_ring_for_helper(struct domain *d, unsigned 
long gmfn,
 struct page_info **_page, void **_va);
 void destroy_ring_for_helper(void **_va, struct page_info *page);
 
+static inline int page_needs_tlbflush(struct page_info *page,
+  bool_t need_tlbflush,
+  uint32_t tlbflush_timestamp,
+  uint32_t tlbflush_current_time)
+{
+return page->u.free.need_tlbflush &&
+   page->tlbflush_timestamp <= tlbflush_current_time &&
+   (!need_tlbflush ||
+page->tlbflush_timestamp > tlbflush_timestamp);
+}
+
 #endif /* __XEN_MM_H__ */
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 2/2] xen: move TLB-flush filtering out into populate_physmap during vm creation

2016-09-12 Thread Dongli Zhang
This patch implemented parts of TODO left in commit id
a902c12ee45fc9389eb8fe54eeddaf267a555c58. It moved TLB-flush filtering out
into populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very
slow to create a guest with memory size of more than 100GB on host with
100+ cpus.

This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate
whether TLB-flush should be done in alloc_heap_pages or its caller
populate_physmap. Once this bit is set in memflags, alloc_heap_pages will
ignore TLB-flush. To use this bit after vm is created might lead to
security issue, that is, this would make pages accessible to the guest B,
when guest A may still have a cached mapping to them.

Therefore, this patch also introduced a "is_ever_unpaused" field to struct
domain to indicate whether this domain has ever got unpaused by hypervisor.
MEMF_no_tlbflush can be set only during vm creation phase when
is_ever_unpaused is still false before this domain gets unpaused for the
first time.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
Changed since v3:
  * Set the flag to true in domain_unpause_by_systemcontroller when
unpausing the guest domain for the first time.
  * Use true/false for all boot_t variables.
  * Add unlikely to optimize "if statement".
  * Correct comment style.

Changed since v2:
  * Limit this optimization to domain creation time.

---
 xen/common/domain.c | 11 +++
 xen/common/memory.c | 34 ++
 xen/common/page_alloc.c |  3 ++-
 xen/include/xen/mm.h|  2 ++
 xen/include/xen/sched.h |  3 +++
 5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index a8804e4..7be1bee 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -303,6 +303,8 @@ struct domain *domain_create(domid_t domid, unsigned int 
domcr_flags,
 if ( !zalloc_cpumask_var(>domain_dirty_cpumask) )
 goto fail;
 
+d->is_ever_unpaused = false;
+
 if ( domcr_flags & DOMCRF_hvm )
 d->guest_type = guest_type_hvm;
 else if ( domcr_flags & DOMCRF_pvh )
@@ -1004,6 +1006,15 @@ int domain_unpause_by_systemcontroller(struct domain *d)
 {
 int old, new, prev = d->controller_pause_count;
 
+/*
+ * Set is_ever_unpaused to true when this domain gets unpaused for the
+ * first time. We record this information here to help populate_physmap
+ * verify whether the domain has ever been unpaused. MEMF_no_tlbflush
+ * is allowed to be set by populate_physmap only during vm creation.
+ */
+if ( unlikely(!d->is_ever_unpaused) )
+d->is_ever_unpaused = true;
+
 do
 {
 old = prev;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index cc0f69e..f3a733b 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a)
 unsigned int i, j;
 xen_pfn_t gpfn, mfn;
 struct domain *d = a->domain, *curr_d = current->domain;
+bool_t need_tlbflush = false;
+uint32_t tlbflush_timestamp = 0;
 
 if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done,
  a->nr_extents-1) )
@@ -150,6 +152,14 @@ static void populate_physmap(struct memop_args *a)
 max_order(curr_d)) )
 return;
 
+/*
+ * MEMF_no_tlbflush can be set only during vm creation phase when
+ * is_ever_unpaused is still false before this domain gets unpaused for
+ * the first time.
+ */
+if ( unlikely(!d->is_ever_unpaused) )
+a->memflags |= MEMF_no_tlbflush;
+
 for ( i = a->nr_done; i < a->nr_extents; i++ )
 {
 if ( i != a->nr_done && hypercall_preempt_check() )
@@ -214,6 +224,20 @@ static void populate_physmap(struct memop_args *a)
 goto out;
 }
 
+if ( unlikely(!d->is_ever_unpaused) )
+{
+for ( j = 0; j < (1U << a->extent_order); j++ )
+{
+if ( page_needs_tlbflush([j], need_tlbflush,
+ tlbflush_timestamp,
+ tlbflush_current_time()) )
+{
+need_tlbflush = true;
+tlbflush_timestamp = page[j].tlbflush_timestamp;
+}
+}
+}
+
 mfn = page_to_mfn(page);
 }
 
@@ -232,6 +256,16 @@ static void populate_physmap(struct memop_args *a)
 }
 
 out:
+if ( need_tlbflush )
+{
+cpumask_t mask = cpu_online_map;
+tlbflush_filter(mask, tlbflush_timestamp);
+if ( !cpumask_empty() )
+{
+perfc_incr(need_flush_tlb_flush);
+flush_tlb_mask();
+}

Re: [Xen-devel] [PATCH v3 1/1] xen: move TLB-flush filtering out into populate_physmap during vm creation

2016-09-08 Thread Dongli Zhang
Thank you all very much for the feedback. I will address them in the next
version.

> > Did you go through and check that there is nothing this information
> > can already get derived from? I can't immediately point you at
> > anything, but it feels like there should. 
> >
> Indeed. And if there isn't and we need to do add our own flagging,
> isn't there a better way and place where to put it (e.g., what Juergen
> and Andrew are hinting at)?

I prefer that to derive whether guest boot finishes is neither limited to a
specific domain type (pv, hvm, pvhvm or pvh) nor a specific arch (i386, x86_64,
arm32, arm64 or more in the future). Ideally, I would have one or a combo of
fields belong to "struct domain" but there isn't.

Can we use the field in vcpu[0]? How about domain->vcpu[0]->last_run_time?  The
only line of code updating the field is in function "schedule":

1411prev->last_run_time = now;

I am afraid if there would be any existing (I did not find one) or future
interfaces that can reset the entire vcpu[0] to 0.

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 1/2] xen: replace complicated tlbflush check with an inline function

2016-09-14 Thread Dongli Zhang
> I don't think you should pass this into the function ...
> 
> > +{
> > +return page->u.free.need_tlbflush &&
> > +   page->tlbflush_timestamp <= tlbflush_current_time &&
> 
> ... and use tlbflush_current_time() here instead.

I rewrite the inline function in xen/include/xen/mm.h to:

+#include 
+
+static inline bool accumulate_tlbflush(bool need_tlbflush,
+   const struct page_info *page,
+   uint32_t tlbflush_timestamp)
+{
+return page->u.free.need_tlbflush &&
+   page->tlbflush_timestamp <= tlbflush_current_time() &&
+   (!need_tlbflush ||
+page->tlbflush_timestamp > tlbflush_timestamp);
+}

However, to use tlbflush_current_time and "asm/flushtlb.h" would lead
to the following compiling error:

>>>>>>>>>>>>>>>>>>>>>>>>>>>
In file included from 
/home/zhang/test/mainline-xen/xen/include/asm/flushtlb.h:14:0,
 from suspend.c:13:
/home/zhang/test/mainline-xen/xen/include/xen/mm.h: In function 
‘accumulate_tlbflush’:
/home/zhang/test/mainline-xen/xen/include/xen/mm.h:577:12: error: implicit 
declaration of function ‘tlbflush_current_time’ 
[-Werror=implicit-function-declaration]
page->tlbflush_timestamp <= tlbflush_current_time() &&
^
/home/zhang/test/mainline-xen/xen/include/xen/mm.h:577:12: error: nested extern 
declaration of ‘tlbflush_current_time’ [-Werror=nested-externs]
cc1: all warnings being treated as errors
make[4]: *** [suspend.o] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>

I can workaround the issue by removing "#include " from
xen/arch/x86/acpi/suspend.c and then everything works fine.


Can I just rewrite the inline function to a #define macro? This minimizes the
changes to the code.

+#define accumulate_tlbflush(need_tlbflush, page, tlbflush_timestamp) \
+(page)->u.free.need_tlbflush &&  \
+(page)->tlbflush_timestamp <= tlbflush_current_time() && \
+(!need_tlbflush ||   \
+ (page)->tlbflush_timestamp > tlbflush_timestamp)

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/2] xen: move TLB-flush filtering out into populate_physmap during vm creation

2016-09-16 Thread Dongli Zhang
> > +/*
> > + * MEMF_no_tlbflush can be set only during vm creation phase when
> > + * is_ever_unpaused is still false before this domain gets unpaused for
> > + * the first time.
> > + */
> > +if ( unlikely(!d->is_ever_unpaused) )
> > +a->memflags |= MEMF_no_tlbflush;
> 
> So you no longer mean to expose this to the caller?

hmmm I would prefer to expose this to the toolstack if it is OK for
maintainers.

I copy and paste Wei's comments below:

==

> Rule 1. It is toolstack's responsibility to set the "MEMF_no_tlbflush" bit
> in memflags. The toolstack developers should be careful that
> "MEMF_no_tlbflush" should never be used after vm creation is finished.
> 

Is it possible to have a safety catch for this in the hypervisor? In
general IMHO we should avoid providing an interface that is possible to
create a security problem.

==

Hi Wei, since it is possible to have a safety catch now in the hypervisor (the
bit is allowed only before VM creation is finished), is it OK for you to expose
MEMF_no_tlbflush bit to toolstack?

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 1/2] xen: replace complicated tlbflush check with an inline function

2016-09-15 Thread Dongli Zhang
> > I rewrite the inline function in xen/include/xen/mm.h to:
> > 
> > +#include 
> > +
> > +static inline bool accumulate_tlbflush(bool need_tlbflush,
> > +   const struct page_info *page,
> > +   uint32_t tlbflush_timestamp)
> > +{
> > +return page->u.free.need_tlbflush &&
> > +   page->tlbflush_timestamp <= tlbflush_current_time() &&
> > +   (!need_tlbflush ||
> > +page->tlbflush_timestamp > tlbflush_timestamp);
> > +}
> > 
> > However, to use tlbflush_current_time and "asm/flushtlb.h" would lead
> > to the following compiling error:
> > 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > In file included from 
> > /home/zhang/test/mainline-xen/xen/include/asm/flushtlb.h:14:0,
> >  from suspend.c:13:
> > /home/zhang/test/mainline-xen/xen/include/xen/mm.h: In function 
> > ‘accumulate_tlbflush’:
> > /home/zhang/test/mainline-xen/xen/include/xen/mm.h:577:12: error: implicit 
> > declaration of function ‘tlbflush_current_time’ 
> > [-Werror=implicit-function-declaration]
> > page->tlbflush_timestamp <= tlbflush_current_time() &&
> > ^
> > /home/zhang/test/mainline-xen/xen/include/xen/mm.h:577:12: error: nested 
> > extern declaration of ‘tlbflush_current_time’ [-Werror=nested-externs]
> > cc1: all warnings being treated as errors
> > make[4]: *** [suspend.o] Error 1
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > 
> > I can workaround the issue by removing "#include " from
> > xen/arch/x86/acpi/suspend.c and then everything works fine.

> Removing? You should _add_ asm/tlbflush.h inclusion to xen/mm.h.

As mentioned in previous email, there was a compiling error from suspend.c:13
even after asm/tlbflush.h is added to xen/mm.h

The compiling error is bypassed if I remove the include of asm/flushtlb.h from
suspend.c. 

The following patch can make the inline function work while avoiding compiling
error. Since asm/flushtlb.h depends on get_order_from_bytes in xen/mm.h, it is
included after get_order_from_bytes in xen/mm.h.

If removing asm/flushtlb.h in suspend.c is not preferred, the other two options
I have are (1) use #define macro (2) pass tlbflush_current_time() as an arg in
accumulate_tlbflush.

diff --git a/xen/arch/x86/acpi/suspend.c b/xen/arch/x86/acpi/suspend.c
index 1d8344c..d5c67ee 100644
--- a/xen/arch/x86/acpi/suspend.c
+++ b/xen/arch/x86/acpi/suspend.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 58bc0b8..03bcbc3 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -567,4 +567,16 @@ int prepare_ring_for_helper(struct domain *d, unsigned 
long gmfn,
 struct page_info **_page, void **_va);
 void destroy_ring_for_helper(void **_va, struct page_info *page);
 
+#include 
+
+static inline bool accumulate_tlbflush(bool need_tlbflush,
+   const struct page_info *page,
+   uint32_t tlbflush_timestamp)
+{
+return page->u.free.need_tlbflush &&
+   page->tlbflush_timestamp <= tlbflush_current_time() &&
+   (!need_tlbflush ||
+page->tlbflush_timestamp > tlbflush_timestamp);
+}
+
 #endif /* __XEN_MM_H__ */


Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 1/2] xen: replace tlbflush check and operation with inline functions

2016-09-18 Thread Dongli Zhang
This patch cleaned up the code by replacing complicated tlbflush check and
operation with inline functions. We should use those inline functions to
avoid the complicated tlbflush check and tlbflush operations when
implementing TODOs left in commit a902c12ee45fc9389eb8fe54eeddaf267a555c58
(More efficient TLB-flush filtering in alloc_heap_pages()).

"#include " is removed from xen/arch/x86/acpi/suspend.c to
avoid the compiling error after we include "" to
xen/include/xen/mm.h.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
Changed since v4:
  * Wrap the filtered tlbflush mask operation as inline function (suggested
by Jan).
  * Remove asm/flushtlb.h from suspend.c to avoid compiling error.

Changed since v3:
  * Wrap the complicated tlbflush condition check as inline function
(suggested by Dario).

---
 xen/arch/x86/acpi/suspend.c |  1 -
 xen/common/page_alloc.c | 16 +++-
 xen/include/xen/mm.h| 25 +
 3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/acpi/suspend.c b/xen/arch/x86/acpi/suspend.c
index 1d8344c..d5c67ee 100644
--- a/xen/arch/x86/acpi/suspend.c
+++ b/xen/arch/x86/acpi/suspend.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 18ff6cf..173c10d 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,10 +827,8 @@ static struct page_info *alloc_heap_pages(
 BUG_ON(pg[i].count_info != PGC_state_free);
 pg[i].count_info = PGC_state_inuse;
 
-if ( pg[i].u.free.need_tlbflush &&
- (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
- (!need_tlbflush ||
-  (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
+if ( accumulate_tlbflush(need_tlbflush, [i],
+ tlbflush_timestamp) )
 {
 need_tlbflush = 1;
 tlbflush_timestamp = pg[i].tlbflush_timestamp;
@@ -849,15 +847,7 @@ static struct page_info *alloc_heap_pages(
 spin_unlock(_lock);
 
 if ( need_tlbflush )
-{
-cpumask_t mask = cpu_online_map;
-tlbflush_filter(mask, tlbflush_timestamp);
-if ( !cpumask_empty() )
-{
-perfc_incr(need_flush_tlb_flush);
-flush_tlb_mask();
-}
-}
+filtered_flush_tlb_mask(tlbflush_timestamp);
 
 return pg;
 }
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index f470e49..85848e3 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 TYPE_SAFE(unsigned long, mfn);
@@ -567,4 +568,28 @@ int prepare_ring_for_helper(struct domain *d, unsigned 
long gmfn,
 struct page_info **_page, void **_va);
 void destroy_ring_for_helper(void **_va, struct page_info *page);
 
+#include 
+
+static inline bool accumulate_tlbflush(bool need_tlbflush,
+   const struct page_info *page,
+   uint32_t tlbflush_timestamp)
+{
+return page->u.free.need_tlbflush &&
+   page->tlbflush_timestamp <= tlbflush_current_time() &&
+   (!need_tlbflush ||
+page->tlbflush_timestamp > tlbflush_timestamp);
+}
+
+static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp)
+{
+cpumask_t mask = cpu_online_map;
+
+tlbflush_filter(mask, tlbflush_timestamp);
+if ( !cpumask_empty() )
+{
+perfc_incr(need_flush_tlb_flush);
+flush_tlb_mask();
+}
+}
+
 #endif /* __XEN_MM_H__ */
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 2/2] xen: move TLB-flush filtering out into populate_physmap during vm creation

2016-09-18 Thread Dongli Zhang
This patch implemented parts of TODO left in commit id
a902c12ee45fc9389eb8fe54eeddaf267a555c58 (More efficient TLB-flush
filtering in alloc_heap_pages()). It moved TLB-flush filtering out into
populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very slow
to create a guest with memory size of more than 100GB on host with 100+
cpus.

This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate
whether TLB-flush should be done in alloc_heap_pages or its caller
populate_physmap.  Once this bit is set in memflags, alloc_heap_pages will
ignore TLB-flush. To use this bit after vm is created might lead to
security issue, that is, this would make pages accessible to the guest B,
when guest A may still have a cached mapping to them.

Therefore, this patch also introduced a "creation_finished" field to struct
domain to indicate whether this domain has ever got unpaused by hypervisor.
MEMF_no_tlbflush can be set only during vm creation phase when
creation_finished is still false before this domain gets unpaused for the
first time.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
Changed since v4:
  * Rename is_ever_unpaused to creation_finished.
  * Change bool_t to bool.
  * Polish comments.

Changed since v3:
  * Set the flag to true in domain_unpause_by_systemcontroller when
unpausing the guest domain for the first time.
  * Use true/false for all boot_t variables.
  * Add unlikely to optimize "if statement".
  * Correct comment style.

Changed since v2:
  * Limit this optimization to domain creation time.

---
 xen/common/domain.c |  8 
 xen/common/memory.c | 28 
 xen/common/page_alloc.c |  3 ++-
 xen/include/xen/mm.h|  2 ++
 xen/include/xen/sched.h |  6 ++
 5 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index a8804e4..c170c69 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1004,6 +1004,14 @@ int domain_unpause_by_systemcontroller(struct domain *d)
 {
 int old, new, prev = d->controller_pause_count;
 
+/*
+ * We record this information here for populate_physmap to figure out
+ * that the domain has finished being created. In fact, we're only
+ * allowed to set the MEMF_no_tlbflush flag during VM creation.
+ */
+if ( unlikely(!d->creation_finished) )
+d->creation_finished = true;
+
 do
 {
 old = prev;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index cc0f69e..27d1f2a 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a)
 unsigned int i, j;
 xen_pfn_t gpfn, mfn;
 struct domain *d = a->domain, *curr_d = current->domain;
+bool need_tlbflush = false;
+uint32_t tlbflush_timestamp = 0;
 
 if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done,
  a->nr_extents-1) )
@@ -150,6 +152,17 @@ static void populate_physmap(struct memop_args *a)
 max_order(curr_d)) )
 return;
 
+/*
+ * With MEMF_no_tlbflush set, alloc_heap_pages() will ignore
+ * TLB-flushes. After VM creation, this is a security issue (it can
+ * make pages accessible to guest B, when guest A may still have a
+ * cached mapping to them). So we only do this only during domain
+ * creation, when the domain itself has not yet been unpaused for the
+ * first time.
+ */
+if ( unlikely(!d->creation_finished) )
+a->memflags |= MEMF_no_tlbflush;
+
 for ( i = a->nr_done; i < a->nr_extents; i++ )
 {
 if ( i != a->nr_done && hypercall_preempt_check() )
@@ -214,6 +227,19 @@ static void populate_physmap(struct memop_args *a)
 goto out;
 }
 
+if ( unlikely(a->memflags & MEMF_no_tlbflush) )
+{
+for ( j = 0; j < (1U << a->extent_order); j++ )
+{
+if ( accumulate_tlbflush(need_tlbflush, [j],
+ tlbflush_timestamp) )
+{
+need_tlbflush = true;
+tlbflush_timestamp = page[j].tlbflush_timestamp;
+}
+}
+}
+
 mfn = page_to_mfn(page);
 }
 
@@ -232,6 +258,8 @@ static void populate_physmap(struct memop_args *a)
 }
 
 out:
+if ( need_tlbflush )
+filtered_flush_tlb_mask(tlbflush_timestamp);
 a->nr_done = i;
 }
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 173c10d..a67f49b 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,7 +827,8 @@ static struct page_info *alloc_heap_pages(
 BUG_ON(pg[i].count_info != PGC_sta

[Xen-devel] [PATCH v6 2/2] xen: move TLB-flush filtering out into populate_physmap during vm creation

2016-09-19 Thread Dongli Zhang
This patch implemented parts of TODO left in commit id
a902c12ee45fc9389eb8fe54eeddaf267a555c58 (More efficient TLB-flush
filtering in alloc_heap_pages()). It moved TLB-flush filtering out into
populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very slow
to create a guest with memory size of more than 100GB on host with 100+
cpus.

This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate
whether TLB-flush should be done in alloc_heap_pages or its caller
populate_physmap.  Once this bit is set in memflags, alloc_heap_pages will
ignore TLB-flush. To use this bit after vm is created might lead to
security issue, that is, this would make pages accessible to the guest B,
when guest A may still have a cached mapping to them.

Therefore, this patch also introduced a "creation_finished" field to struct
domain to indicate whether this domain has ever got unpaused by hypervisor.
MEMF_no_tlbflush can be set only during vm creation phase when
creation_finished is still false before this domain gets unpaused for the
first time.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
Changed since v5:
  * Remove conditional check before "d->creation_finished = true;".
  * Place "bool creation_finished;" next to the other group of booleans.
  * Remove duplicate "only" in comments.

Changed since v4:
  * Rename is_ever_unpaused to creation_finished.
  * Change bool_t to bool.
  * Polish comments.

Changed since v3:
  * Set the flag to true in domain_unpause_by_systemcontroller when
unpausing the guest domain for the first time.
  * Use true/false for all boot_t variables.
  * Add unlikely to optimize "if statement".
  * Correct comment style.

Changed since v2:
  * Limit this optimization to domain creation time.

---
 xen/common/domain.c |  7 +++
 xen/common/memory.c | 22 ++
 xen/common/page_alloc.c |  4 +++-
 xen/include/xen/mm.h|  2 ++
 xen/include/xen/sched.h |  6 ++
 5 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index a8804e4..3abaca9 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1004,6 +1004,13 @@ int domain_unpause_by_systemcontroller(struct domain *d)
 {
 int old, new, prev = d->controller_pause_count;
 
+/*
+ * We record this information here for populate_physmap to figure out
+ * that the domain has finished being created. In fact, we're only
+ * allowed to set the MEMF_no_tlbflush flag during VM creation.
+ */
+d->creation_finished = true;
+
 do
 {
 old = prev;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index cc0f69e..21797ca 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -141,6 +141,8 @@ static void populate_physmap(struct memop_args *a)
 unsigned int i, j;
 xen_pfn_t gpfn, mfn;
 struct domain *d = a->domain, *curr_d = current->domain;
+bool need_tlbflush = false;
+uint32_t tlbflush_timestamp = 0;
 
 if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done,
  a->nr_extents-1) )
@@ -150,6 +152,17 @@ static void populate_physmap(struct memop_args *a)
 max_order(curr_d)) )
 return;
 
+/*
+ * With MEMF_no_tlbflush set, alloc_heap_pages() will ignore
+ * TLB-flushes. After VM creation, this is a security issue (it can
+ * make pages accessible to guest B, when guest A may still have a
+ * cached mapping to them). So we do this only during domain creation,
+ * when the domain itself has not yet been unpaused for the first
+ * time.
+ */
+if ( unlikely(!d->creation_finished) )
+a->memflags |= MEMF_no_tlbflush;
+
 for ( i = a->nr_done; i < a->nr_extents; i++ )
 {
 if ( i != a->nr_done && hypercall_preempt_check() )
@@ -214,6 +227,13 @@ static void populate_physmap(struct memop_args *a)
 goto out;
 }
 
+if ( unlikely(a->memflags & MEMF_no_tlbflush) )
+{
+for ( j = 0; j < (1U << a->extent_order); j++ )
+accumulate_tlbflush(_tlbflush, [j],
+_timestamp);
+}
+
 mfn = page_to_mfn(page);
 }
 
@@ -232,6 +252,8 @@ static void populate_physmap(struct memop_args *a)
 }
 
 out:
+if ( need_tlbflush )
+filtered_flush_tlb_mask(tlbflush_timestamp);
 a->nr_done = i;
 }
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index d7ca3a0..ae2476d 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,7 +827,9 @@ static struct page_info *alloc_heap_pages(
 BUG_ON(pg[i].count_info != PGC_state_free);
 pg[i].count_info = PGC_state_inuse;
 
-accu

[Xen-devel] [PATCH v6 1/2] xen: replace tlbflush check and operation with inline functions

2016-09-19 Thread Dongli Zhang
This patch cleaned up the code by replacing complicated tlbflush check and
operation with inline functions. We should use those inline functions to
avoid the complicated tlbflush check and tlbflush operations when
implementing TODOs left in commit a902c12ee45fc9389eb8fe54eeddaf267a555c58
(More efficient TLB-flush filtering in alloc_heap_pages()).

"#include " is removed from xen/arch/x86/acpi/suspend.c to
avoid the compiling error after we include "" to
xen/include/xen/mm.h.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
Changed since v5:
  * Move the if() and its body of tlbflush check into inline function.

Changed since v4:
  * Wrap the filtered tlbflush mask operation as inline function (suggested
by Jan).
  * Remove asm/flushtlb.h from suspend.c to avoid compiling error.

Changed since v3:
  * Wrap the complicated tlbflush condition check as inline function
(suggested by Dario).

---
 xen/arch/x86/acpi/suspend.c |  1 -
 xen/common/page_alloc.c | 19 ++-
 xen/include/xen/mm.h| 29 +
 3 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/xen/arch/x86/acpi/suspend.c b/xen/arch/x86/acpi/suspend.c
index 1d8344c..d5c67ee 100644
--- a/xen/arch/x86/acpi/suspend.c
+++ b/xen/arch/x86/acpi/suspend.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 18ff6cf..d7ca3a0 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -827,14 +827,7 @@ static struct page_info *alloc_heap_pages(
 BUG_ON(pg[i].count_info != PGC_state_free);
 pg[i].count_info = PGC_state_inuse;
 
-if ( pg[i].u.free.need_tlbflush &&
- (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
- (!need_tlbflush ||
-  (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
-{
-need_tlbflush = 1;
-tlbflush_timestamp = pg[i].tlbflush_timestamp;
-}
+accumulate_tlbflush(_tlbflush, [i], _timestamp);
 
 /* Initialise fields which have other uses for free pages. */
 pg[i].u.inuse.type_info = 0;
@@ -849,15 +842,7 @@ static struct page_info *alloc_heap_pages(
 spin_unlock(_lock);
 
 if ( need_tlbflush )
-{
-cpumask_t mask = cpu_online_map;
-tlbflush_filter(mask, tlbflush_timestamp);
-if ( !cpumask_empty() )
-{
-perfc_incr(need_flush_tlb_flush);
-flush_tlb_mask();
-}
-}
+filtered_flush_tlb_mask(tlbflush_timestamp);
 
 return pg;
 }
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index f470e49..50db01d 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 TYPE_SAFE(unsigned long, mfn);
@@ -567,4 +568,32 @@ int prepare_ring_for_helper(struct domain *d, unsigned 
long gmfn,
 struct page_info **_page, void **_va);
 void destroy_ring_for_helper(void **_va, struct page_info *page);
 
+#include 
+
+static inline void accumulate_tlbflush(bool *need_tlbflush,
+   const struct page_info *page,
+   uint32_t *tlbflush_timestamp)
+{
+if ( page->u.free.need_tlbflush &&
+ page->tlbflush_timestamp <= tlbflush_current_time() &&
+ (!*need_tlbflush ||
+  page->tlbflush_timestamp > *tlbflush_timestamp) )
+{
+*need_tlbflush = true;
+*tlbflush_timestamp = page->tlbflush_timestamp;
+}
+}
+
+static inline void filtered_flush_tlb_mask(uint32_t tlbflush_timestamp)
+{
+cpumask_t mask = cpu_online_map;
+
+tlbflush_filter(mask, tlbflush_timestamp);
+if ( !cpumask_empty() )
+{
+perfc_incr(need_flush_tlb_flush);
+flush_tlb_mask();
+}
+}
+
 #endif /* __XEN_MM_H__ */
-- 
1.9.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v6 2/2] xen: move TLB-flush filtering out into populate_physmap during vm creation

2016-09-20 Thread Dongli Zhang
> > This patch implemented parts of TODO left in commit id
> > a902c12ee45fc9389eb8fe54eeddaf267a555c58 (More efficient TLB-flush
> > filtering in alloc_heap_pages()). It moved TLB-flush filtering out into
> > populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very slow
> > to create a guest with memory size of more than 100GB on host with 100+
> > cpus.
> >
> >
> Do you have some actual numbers on how much faster after applying this
> patch?
> 
> This is mostly for writing release note etc, so it is fine if you don't
> have numbers at hand.

I do not have data of upstream version at hand now.  I always tested the
performance of this patchset by backporting it to an Oracle VM based on a
very old Xen version, before I sent out the patchset for review. The
backport just involves: (1) copy & paste code, (2) change bool to bool_t
and (3) change true/false to 1/0.

The test machine has 8 nodes of 2048GB memory and 128 cpu. With this
patchset applied, the time to re-create a VM (with 135GB memory and 12
vcpu) is reduced from 5min to 20s.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/2] xen: move TLB-flush filtering out into populate_physmap during vm creation

2016-09-16 Thread Dongli Zhang
> What is the scenario that you would want toolstack to set such flag?
> 
> Shouldn't hypervisor always set the flag when the guest is never
> unpaused and always clear / ignore that flag if the guest is ever
> unpaused? If that's all is needed, why does toolstack need to get
> involved?

You are right. I will not expose the flag to toolstack.

- Original Message -
From: wei.l...@citrix.com
To: dongli.zh...@oracle.com
Cc: jbeul...@suse.com, wei.l...@citrix.com, konrad.w...@oracle.com, 
sstabell...@kernel.org, t...@xen.org, dario.faggi...@citrix.com, 
ian.jack...@eu.citrix.com, george.dun...@eu.citrix.com, 
david.vra...@citrix.com, xen-devel@lists.xen.org, andrew.coop...@citrix.com
Sent: Friday, September 16, 2016 6:55:33 PM GMT +08:00 Beijing / Chongqing / 
Hong Kong / Urumqi
Subject: Re: [PATCH v4 2/2] xen: move TLB-flush filtering out into 
populate_physmap during vm creation

On Fri, Sep 16, 2016 at 03:47:23AM -0700, Dongli Zhang wrote:
> > > +/*
> > > + * MEMF_no_tlbflush can be set only during vm creation phase when
> > > + * is_ever_unpaused is still false before this domain gets unpaused 
> > > for
> > > + * the first time.
> > > + */
> > > +if ( unlikely(!d->is_ever_unpaused) )
> > > +a->memflags |= MEMF_no_tlbflush;
> > 
> > So you no longer mean to expose this to the caller?
> 
> hmmm I would prefer to expose this to the toolstack if it is OK for
> maintainers.
> 
> I copy and paste Wei's comments below:
> 
> ==
> 
> > Rule 1. It is toolstack's responsibility to set the "MEMF_no_tlbflush" bit
> > in memflags. The toolstack developers should be careful that
> > "MEMF_no_tlbflush" should never be used after vm creation is finished.
> > 
> 
> Is it possible to have a safety catch for this in the hypervisor? In
> general IMHO we should avoid providing an interface that is possible to
> create a security problem.
> 
> ==
> 
> Hi Wei, since it is possible to have a safety catch now in the hypervisor (the
> bit is allowed only before VM creation is finished), is it OK for you to 
> expose
> MEMF_no_tlbflush bit to toolstack?
> 

What is the scenario that you would want toolstack to set such flag?

Shouldn't hypervisor always set the flag when the guest is never
unpaused and always clear / ignore that flag if the guest is ever
unpaused? If that's all is needed, why does toolstack need to get
involved?

Do I miss something here?

Wei.


> Thank you very much!
> 
> Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/1] xen: move TLB-flush filtering out into populate_physmap

2016-09-07 Thread Dongli Zhang
> But since what Dongli cares about at the moment is domain creation, it
> certainly won't hurt to limit this optimization to domain creation time;
> then we can discuss enabling it for ballooning when someone finds it to
> be an issue.

Thank you all very much for the feedback. The current limitation only
impacts vm creation time unless someone would balloon 100+GB memory.

To limit this optimization to domain creation time, how do you think if we
add the following rules to xen and toolstack? Are the rules reasonable?

Rule 1. It is toolstack's responsibility to set the "MEMF_no_tlbflush" bit
in memflags. The toolstack developers should be careful that
"MEMF_no_tlbflush" should never be used after vm creation is finished.

Rule 2. xen should check at hypervisor level that MEMF_no_tlbflush is
allowed to be set only when current populate_physmap operation is initiated
by dom0.  Otherwise, MEMF_no_tlbflush should be masked in memflags if ( d
== curr_d && d->domain_id != 0 ). Therefore, this patch would not impact
the security of memory ballooning operations.

Would you please let me know if the rules above are reasonable? Are there
any suggestions for better solutions? xc_dom_boot_mem_init is very close to
the end of vm creation and I could not get a better solution unless we add
a new XEN_DOMCTL operation to intentionally notify xen when
xc_dom_boot_mem_init is finished so that MEMF_no_tlbflush should no longer
be used.

I copy and paste patches related to the above rules in this email. I will send
out patches for review if the above solution works.


@@ -150,6 +152,11 @@ static void populate_physmap(struct memop_args *a)
 max_order(curr_d)) )
 return;
 
+/* MEMF_no_tlbflush is masked out if current populate_physmap operation is
+ * not initiated by dom0 */
+if ( d == curr_d && d->domain_id != 0 )
+a->memflags &= ~MEMF_no_tlbflush;
+
 for ( i = a->nr_done; i < a->nr_extents; i++ )
 {
 if ( i != a->nr_done && hypercall_preempt_check() )


@@ -1185,7 +1185,7 @@ static int meminit_pv(struct xc_dom_image *dom)
 xen_pfn_t extents[SUPERPAGE_BATCH_SIZE];
 xen_pfn_t pfn_base_idx;
 
-memflags = 0;
+memflags = MEMF_no_tlbflush;
 if ( pnode != XC_NUMA_NO_NODE )
 memflags |= XENMEMF_exact_node(pnode);
 
@@ -1273,7 +1273,7 @@ static int meminit_hvm(struct xc_dom_image *dom)
 int rc;
 unsigned long stat_normal_pages = 0, stat_2mb_pages = 0, 
 stat_1gb_pages = 0;
-unsigned int memflags = 0;
+unsigned int memflags = MEMF_no_tlbflush;
 int claim_enabled = dom->claim_enabled;
 uint64_t total_pages;
 xen_vmemrange_t dummy_vmemrange[2];


Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFD] OP-TEE (and probably other TEEs) support

2016-11-28 Thread Dongli Zhang
riginal TrustZone desing.
>
> Personally I prefer second approach. I, even, did small PoC that
> allows different guests to work with OP-TEE (but not simultaneously!).
> You can find patches at [1] if you are interested.
> During working on this PoC I have identified main questions that
> should be answered:
>
> On XEN side:
> 1. SMC handling in XEN. There are many different SMCs and only portion
> of them belong to TEE. We need some SMC dispatcher that will route
> calls to different subsystems. Like PSCI calls to PSCI subsystem, TEE
> calls to TEE subsystem.
>
> 2. Support for different TEEs. There are OP-TEE, Google Trusty, TI
> M-Shield... They all work thru SMC, but have different protocols.
> Currently, we are aimed only to OP-TEE. But we need some generic API
> in XEN, so support for new TEE can be easily added.
>
> 3. TEE services. Hypervisor should inform TEE when new guest is
> created or destroyed, it should tag SMCs to TEE with GuestID, so TEE
> can isolate guest data on its side.

So you assume xen hypervisor is in the TCB?

>
> 4. SMC mangling. RichOS communicates with TEE using shared buffers, by
> providing physical memory addresses. Hypervisor should convert IPAs to
> PAs.
> Currently I'm rewriting parts of OP-TEE to make it support arbitrary
> buffers originated from RichOS.
>
> 5. Events from TEE. This is hard topic. Sometimes OP-TEE needs some
> services from RichOS. For example it wants Linux to service pending
> IRQ request, or allocate portion of shared memory, or lock calling
> thread, etc. This is called "RPC request". To do RPC request OP-TEE
> initiates return to Normal World, but it sets special return code to
> indicate that Linux should do some job for OP-TEE. When Linux finishes
> work, it initiates another SMC with code like "I have finished RPC
> request" and OP-TEE resumes its execution.
> OP-TEE mutexes create problem there. We don't  want to sleep in secure
> state, so when OP-TEE thread gets blocked on a mutex, it issues RPC
> request that asks calling thread to wait on wait queue. When mutex
> owner unlocks it, that another thread also issues RPC to wake up first
> thread.
> This works perfectly when there are one OS (or one guest). But when
> there are many, it is possible that request from one guest blocks
> another guest. That another guest will wait on wait queue, but there
> will be no one, who can wake it up. So we need another mechanism to
> wake up sleeping threads. Obvious candidate is IPI. There are 8
> non-secure IPIs that all are used by linux kernel. By there are also 8
> secure IPIs. I think OP-TEE can use one of those to deliver events to
> Normal World. But this will require changes to OP-TEE, XEN and Linux
> kernel.
>
> 6. Some mechanism to control which guests can work with TEE. At this
> time I have no idea how this should work.
>
> On OP-TEE side:
> 1. Shared memory redesign which is almost complete.
> 2. Implement new SMCs  from hypervsiror to TEE to track VM lifecycle.
> 3. Track VM IDs to isolated VM data.
> 4. RPCs to sleeping guests.
>
> That is basically all what I currently have on my mind. It will be
> great if all interested sides will share their opinions, requirements,
> thoughts on this.
>
> [1] https://github.com/lorc/xen/commits/staging-4.7 - Xen + OP-TEE PoC

Which environment (or dev board) this code is supported?

>
> --
> WBR Volodymyr Babchuk aka lorc [+380976646013]
> mailto: vlad.babc...@gmail.com
>
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel



-- 
Dongli Zhang
finallyjustice.github.io

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] xen-netfront: do not cast grant table reference to signed short

2016-10-31 Thread Dongli Zhang
Hi David and Jan,

I did more testing on the code. Casting to either (long) or (unsigned long)
would be fine.

However, there is still an issue that ref is of type uint32_t and
IS_ERR_VALUE((unsigned long)ref) would not return true when ref=-ENOSPC (or
other error code).

IS_ERR_VALUE((long)ref) would return false as well.

The solution is to cast ref to (int) first as follows:

-   BUG_ON((signed short)ref < 0);
+   WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)(int)ref));


David, I am very sorry for this error and I will be careful the next time.
Would you please let me know if I should resend a new patch or an incremental
based on previous one at 
https://git.kernel.org/cgit/linux/kernel/git/davem/net.git?

Thank you very much!

Dongli Zhang


- Original Message -
From: da...@davemloft.net
To: dongli.zh...@oracle.com
Cc: linux-ker...@vger.kernel.org, xen-de...@lists.xenproject.org, 
net...@vger.kernel.org, boris.ostrov...@oracle.com, david.vra...@citrix.com, 
jgr...@suse.com
Sent: Tuesday, November 1, 2016 4:06:27 AM GMT +08:00 Beijing / Chongqing / 
Hong Kong / Urumqi
Subject: Re: [PATCH 1/1] xen-netfront: do not cast grant table reference to 
signed short

From: Dongli Zhang <dongli.zh...@oracle.com>
Date: Mon, 31 Oct 2016 13:38:29 +0800

> While grant reference is of type uint32_t, xen-netfront erroneously casts
> it to signed short in BUG_ON().
> 
> This would lead to the xen domU panic during boot-up or migration when it
> is attached with lots of paravirtual devices.
> 
> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>

Since this is consistent with how the macros in linux/err.h handle "is
this an error" checks, this change looks good to me.

Applied, thanks.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Why cast to "signed short" before checking the return value of gnttab_claim_grant_reference() in xen-netfront.c?

2016-10-28 Thread Dongli Zhang
Hi,

Would anyone please help with a question on xen-netfront.c code?

In xen-netfront.c, the return value of gnttab_claim_grant_reference() is
checked with "BUG_ON((signed short)ref < 0);". Why we use signed short here
while the return value is of uint32_t? 

Am I missing anything or can I send a patch to fix this issue? In
xen-blkfront.c and xen-scsifront.c "BUG_ON(gnt_list_entry->gref == -ENOSPC);"
is involved to check return value. 

The guest kernel would panic after grant refs reach a very large number (when
guest is attached with large number of devices and live migrated).

Thank you very much!

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] xen-netfront: do not cast grant table reference to signed short

2016-10-31 Thread Dongli Zhang
> >  ref = gnttab_claim_grant_reference(>gref_rx_head);
> > -BUG_ON((signed short)ref < 0);
> > +WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)ref));

> You really need to cast to plain (or signed) long here - casting to
> unsigned long will work only in 32-bit configurations, as otherwise
> you lose the sign of the value.

Seems the sign of value would not be lost since casting to unsigned long will
use signed extension. Is my understanding wrong or did I miss anything? I have
verified this with both sample userspace helloworld program and a kernel
module.

> And then just issuing a warning here is insufficient, I think: Either
> you follow David's line of thought assuming that no failure here is

Keeping this can help debug xen-netfront.

> possible at all (in which case the BUG_ON() can be ditched without
> replacement), or you follow your original one (which matches mine)
> that we can't exclude an error here because of a bug elsewhere,
> in which case this either needs to stay BUG_ON() or should be
> followed by some form of bailing out (so that the bad ref won't get
> stored, preventing its later use from causing further damage).

The reason I use warning instead BUG_ON is that Linus suggested use
WARN_ON_ONCE() in a previous email:
http://lkml.iu.edu/hypermail/linux/kernel/1610.0/00878.html

I would change to BUG_ON() if it is OK to you.

Thank you very much!

Dongli Zhang 


- Original Message -
From: jbeul...@suse.com
To: dongli.zh...@oracle.com
Cc: david.vra...@citrix.com, xen-de...@lists.xenproject.org, 
boris.ostrov...@oracle.com, jgr...@suse.com, linux-ker...@vger.kernel.org, 
net...@vger.kernel.org
Sent: Monday, October 31, 2016 3:48:03 PM GMT +08:00 Beijing / Chongqing / Hong 
Kong / Urumqi
Subject: Re: [Xen-devel] [PATCH 1/1] xen-netfront: do not cast grant table 
reference to signed short

>>> On 31.10.16 at 06:38, <dongli.zh...@oracle.com> wrote:
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -304,7 +304,7 @@ static void xennet_alloc_rx_buffers(struct netfront_queue 
> *queue)
>   queue->rx_skbs[id] = skb;
>  
>   ref = gnttab_claim_grant_reference(>gref_rx_head);
> - BUG_ON((signed short)ref < 0);
> + WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)ref));

You really need to cast to plain (or signed) long here - casting to
unsigned long will work only in 32-bit configurations, as otherwise
you lose the sign of the value.

And then just issuing a warning here is insufficient, I think: Either
you follow David's line of thought assuming that no failure here is
possible at all (in which case the BUG_ON() can be ditched without
replacement), or you follow your original one (which matches mine)
that we can't exclude an error here because of a bug elsewhere,
in which case this either needs to stay BUG_ON() or should be
followed by some form of bailing out (so that the bad ref won't get
stored, preventing its later use from causing further damage).

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] xen-netfront: do not cast grant table reference to signed short

2016-10-30 Thread Dongli Zhang
While grant reference is of type uint32_t, xen-netfront erroneously casts
it to signed short in BUG_ON().

This would lead to the xen domU panic during boot-up or migration when it
is attached with lots of paravirtual devices.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 drivers/net/xen-netfront.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e17879d..189a28d 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -304,7 +304,7 @@ static void xennet_alloc_rx_buffers(struct netfront_queue 
*queue)
queue->rx_skbs[id] = skb;
 
ref = gnttab_claim_grant_reference(>gref_rx_head);
-   BUG_ON((signed short)ref < 0);
+   WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)ref));
queue->grant_rx_ref[id] = ref;
 
page = skb_frag_page(_shinfo(skb)->frags[0]);
@@ -428,7 +428,7 @@ static void xennet_tx_setup_grant(unsigned long gfn, 
unsigned int offset,
id = get_id_from_freelist(>tx_skb_freelist, queue->tx_skbs);
tx = RING_GET_REQUEST(>tx, queue->tx.req_prod_pvt++);
ref = gnttab_claim_grant_reference(>gref_tx_head);
-   BUG_ON((signed short)ref < 0);
+   WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)ref));
 
gnttab_grant_foreign_access_ref(ref, queue->info->xbdev->otherend_id,
gfn, GNTMAP_readonly);
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] xen-netfront: cast grant table reference first to type int

2016-11-01 Thread Dongli Zhang
IS_ERR_VALUE() in commit 87557efc27f6a50140fb20df06a917f368ce3c66
("xen-netfront: do not cast grant table reference to signed short") would
not return true for error code unless we cast ref first to type int.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 drivers/net/xen-netfront.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 189a28d..bf2744e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -304,7 +304,7 @@ static void xennet_alloc_rx_buffers(struct netfront_queue 
*queue)
queue->rx_skbs[id] = skb;
 
ref = gnttab_claim_grant_reference(>gref_rx_head);
-   WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)ref));
+   WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)(int)ref));
queue->grant_rx_ref[id] = ref;
 
page = skb_frag_page(_shinfo(skb)->frags[0]);
@@ -428,7 +428,7 @@ static void xennet_tx_setup_grant(unsigned long gfn, 
unsigned int offset,
id = get_id_from_freelist(>tx_skb_freelist, queue->tx_skbs);
tx = RING_GET_REQUEST(>tx, queue->tx.req_prod_pvt++);
ref = gnttab_claim_grant_reference(>gref_tx_head);
-   WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)ref));
+   WARN_ON_ONCE(IS_ERR_VALUE((unsigned long)(int)ref));
 
gnttab_grant_foreign_access_ref(ref, queue->info->xbdev->otherend_id,
gfn, GNTMAP_readonly);
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] xen-blkfront hang

2017-08-04 Thread Dongli Zhang


On 08/04/2017 05:13 PM, Roger Pau Monné wrote:
> On Fri, Aug 04, 2017 at 10:00:09AM +0200, Valentin Vidic wrote:
>> On Mon, Jul 31, 2017 at 09:09:19AM +0800, Dongli Zhang wrote:
>>> To verify whether the above patch would help, please check the 
>>> nr_grant_frames
>>> value in guest domU. If this value is exactly the same of maximum grant 
>>> frames
>>> (by default, xen mainline uses 32) and the number of free grant references 
>>> is
>>> very small, the above patch might help.
>>
>> You are right, this is what I get after the machine hangs:
>>
>>   crash> print nr_grant_frames 
>>   $1 = 32
>>   crash> print gnttab_free_count 
>>   $2 = 9
>>
>>> The best way is to increase the gnttab_max_frames to larger value (e.g.,  
>>> 256)
>>> in dom0 xen.gz grub.
>>
>> Thank you, this seems to help.  The test machine does not hang now and
>> the numbers are looking better now:
>>
>>   crash> print nr_grant_frames 
>>   $1 = 59
>>   crash> print gnttab_free_count 
>>   $2 = 356
> 
> At some point I've already expressed my opinion that having a
> per-queue list of persistent grants was not a good idea. Now I think
> the only solution is to remove persistent grants, or to lower the
> default number of per-queue persistent grants to a ridiculously low
> value.

It would be more efficient if (1) persistent grants are shared by all queues and
(2) there is a new mechanism to allow frontend to actively ask backend to unmap
existing persistent grants. So far, it is backend's responsibility to decide
when to unmap persistent grants.

Dongli Zhang

> 
> Roger.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/1] xen/blkfront: always allocate grants first from per-queue persistent grants

2017-07-19 Thread Dongli Zhang
Hi Konrad,

In addition to Junxiao's patch on xen-blkfront, would you please help merge this
to mainline as well?

Thank you very much!

Dongli Zhang

On 06/28/2017 11:34 PM, Roger Pau Monné wrote:
> On Wed, Jun 28, 2017 at 08:57:28PM +0800, Dongli Zhang wrote:
>> This patch partially reverts 3df0e50 ("xen/blkfront: pseudo support for
>> multi hardware queues/rings"). The xen-blkfront queue/ring might hang due
>> to grants allocation failure in the situation when gnttab_free_head is
>> almost empty while many persistent grants are reserved for this queue/ring.
>>
>> As persistent grants management was per-queue since 73716df ("xen/blkfront:
>> make persistent grants pool per-queue"), we should always allocate from
>> persistent grants first.
>>
>> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
> 
> Acked-by: Roger Pau Monné <roger@citrix.com>
> 
> Thanks.
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Paravitrualization drivers query

2017-07-21 Thread Dongli Zhang


On 07/21/2017 03:12 PM, shishir tiwari wrote:
> 
> > Hi
> >
> > I am trying understand Xen Pv drivers and i writing my own pv fronend 
> and
> > backend driver.
> >
> > 1. For driver internal communication how do i create/write node in 
> backend
> > driver and how to read in fronted drivers.
> > 2.how do i create one shared page in backend driver to write/read data 
> in
> > frontend driver.
> >
> 
> >Depending on where your driver lives >(kernel or userspace), the APIs are
> >going to be different.
> 
> 
> My driver will be in kernel space. I want to create some Node and shared page 
> or
> queue. So I can transfer data for domu to dom0. 

Hi Shishir

I suggest you read about drivers/net/xen-netfront.c and
drivers/net/xen-netback/* in linux kernel.

You would rely on xenbus/xenstore to read/write node in frontend or backend and
the corresponding APIs are xenbus_write, xenbus_printf, xenbus_read, etc.

To create shared page between frontend and backend, you create use grant tale
mechanism. Generally, each grant table reference is used to indicate one shared
memory page. The frontend allocates a page to share and bind this page to a
grant table reference. This reference is then passed to backend via xenstore or
ring buffer. Backend maps the shared page with this reference via APIs like
gnttab_map_refs().

I suggest you read about the book "Definitive Guide to Xen Hypervisor" and read
about linux paravirtual driver code.

Dongli Zhang

> 
> 
> > I gone-through kernel code and but its tittle bit confusing to me.
> 
> >>Which bits do you find confusing?
> 
> 
> I am not able to understand who first create Node and shared memory by xen or
> domO or may be by frontend drivers
> 
> 
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/2] tools: utility to dump guest grant table info

2017-06-30 Thread Dongli Zhang
As both xen-netfront and xen-blkfront support multi-queue, they would
consume a lot of grant table references when there are many paravirtual
devices and vcpus assigned to guest. Guest domU might panic or hang due to
grant allocation failure when nr_grant_frames in guest has reached its max
value.

This utility would help the administrators to monitor the guest grant table
frame usage on dom0 side so that it is not required to debug on guest
kernel side for crash/hang analysis anymore.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 tools/misc/Makefile   |  4 
 tools/misc/xen-gnttab-query.c | 45 +++
 2 files changed, 49 insertions(+)
 create mode 100644 tools/misc/xen-gnttab-query.c

diff --git a/tools/misc/Makefile b/tools/misc/Makefile
index 8152f7b..d081b4b 100644
--- a/tools/misc/Makefile
+++ b/tools/misc/Makefile
@@ -31,6 +31,7 @@ INSTALL_SBIN   += xenperf
 INSTALL_SBIN   += xenpm
 INSTALL_SBIN   += xenwatchdogd
 INSTALL_SBIN   += xen-livepatch
+INSTALL_SBIN   += xen-gnttab-query
 INSTALL_SBIN += $(INSTALL_SBIN-y)
 
 # Everything to be installed in a private bin/
@@ -108,4 +109,7 @@ xen-lowmemd: xen-lowmemd.o
 xencov: xencov.o
$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
 
+xen-gnttab-query: xen-gnttab-query.o
+   $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
+
 -include $(DEPS)
diff --git a/tools/misc/xen-gnttab-query.c b/tools/misc/xen-gnttab-query.c
new file mode 100644
index 000..3f93a6c
--- /dev/null
+++ b/tools/misc/xen-gnttab-query.c
@@ -0,0 +1,45 @@
+#include 
+#include 
+#include 
+#include 
+
+void show_help(void)
+{
+fprintf(stderr,
+"xen-gnttab-query: query grant table info\n"
+"Usage: xen-gnttab-query [domid (default 0)]\n");
+}
+
+int main(int argc, char *argv[])
+{
+xc_interface *xch;
+int domid, rc, c;
+struct gnttab_query_size query;
+
+while ( (c = getopt(argc, argv, "h")) != -1 )
+{
+switch ( c )
+{
+case 'h':
+show_help();
+return 0;
+}
+}
+
+domid = (argc > 1) ? strtol(argv[1], NULL, 10) : 0;
+
+xch = xc_interface_open(0, 0, 0);
+if ( !xch )
+errx(1, "failed to open control interface");
+
+query.dom = domid;
+rc = xc_gnttab_query_size(xch, );
+
+if ( rc == 0 && (query.status == GNTST_okay) )
+printf("domid=%d: nr_frames=%d, max_nr_frames=%d\n",
+   query.dom, query.nr_frames, query.max_nr_frames);
+
+xc_interface_close(xch);
+
+return 0;
+}
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/2] tools/libxc: add interface for GNTTABOP_query_size

2017-06-30 Thread Dongli Zhang
This patch adds new interface for GNTTABOP_query_size in libxc to help
query the current grant table frames and maximum grant table frames for a
specific domain.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 tools/libxc/include/xenctrl.h |  1 +
 tools/libxc/xc_gnttab.c   | 12 
 2 files changed, 13 insertions(+)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 1629f41..155c69e 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1597,6 +1597,7 @@ int xc_gnttab_op(xc_interface *xch, int cmd,
  void * op, int op_size, int count);
 /* Logs iff hypercall bounce fails, otherwise doesn't. */
 
+int xc_gnttab_query_size(xc_interface *xch, struct gnttab_query_size *query);
 int xc_gnttab_get_version(xc_interface *xch, int domid); /* Never logs */
 grant_entry_v1_t *xc_gnttab_map_table_v1(xc_interface *xch, int domid, int 
*gnt_num);
 grant_entry_v2_t *xc_gnttab_map_table_v2(xc_interface *xch, int domid, int 
*gnt_num);
diff --git a/tools/libxc/xc_gnttab.c b/tools/libxc/xc_gnttab.c
index af53fac..81a89fe 100644
--- a/tools/libxc/xc_gnttab.c
+++ b/tools/libxc/xc_gnttab.c
@@ -38,6 +38,18 @@ int xc_gnttab_op(xc_interface *xch, int cmd, void * op, int 
op_size, int count)
 return ret;
 }
 
+int xc_gnttab_query_size(xc_interface *xch, struct gnttab_query_size *query)
+{
+int rc;
+
+rc = xc_gnttab_op(xch, GNTTABOP_query_size, query, sizeof(*query), 1);
+
+if ( rc || (query->status != GNTST_okay) )
+ERROR("Could not query dom %d's grant size\n", query->dom);
+
+return rc;
+}
+
 int xc_gnttab_get_version(xc_interface *xch, int domid)
 {
 struct gnttab_get_version query;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 2/2] tools: utility to dump guest grant table info

2017-07-04 Thread Dongli Zhang
Hi Wei,

I will send the patch based on staging.

Dongli Zhang

On 07/04/2017 10:10 PM, Wei Liu wrote:
> I pushed your two patches and discovered you also need to patch
> .gitignore. Could you please send a patch for that? thanks
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] gitignore: add tools/misc/xen-diag to .gitignore

2017-07-04 Thread Dongli Zhang
Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 .gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitignore b/.gitignore
index 068f430..e1009e0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -211,6 +211,7 @@ tools/misc/xenperf
 tools/misc/xenpm
 tools/misc/xen-hvmctx
 tools/misc/xenlockprof
+tools/misc/xen-diag
 tools/misc/lowmemd
 tools/misc/xencov
 tools/pkg-config/*
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 1/2] tools/libxc: add interface for GNTTABOP_query_size

2017-07-02 Thread Dongli Zhang
This patch adds new interface for GNTTABOP_query_size in libxc to help
query the current grant table frames and maximum grant table frames for a
specific domain.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
Changed since v1:
  * Change %d to %u in ERROR()

---
 tools/libxc/include/xenctrl.h |  1 +
 tools/libxc/xc_gnttab.c   | 12 
 2 files changed, 13 insertions(+)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 1629f41..155c69e 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1597,6 +1597,7 @@ int xc_gnttab_op(xc_interface *xch, int cmd,
  void * op, int op_size, int count);
 /* Logs iff hypercall bounce fails, otherwise doesn't. */
 
+int xc_gnttab_query_size(xc_interface *xch, struct gnttab_query_size *query);
 int xc_gnttab_get_version(xc_interface *xch, int domid); /* Never logs */
 grant_entry_v1_t *xc_gnttab_map_table_v1(xc_interface *xch, int domid, int 
*gnt_num);
 grant_entry_v2_t *xc_gnttab_map_table_v2(xc_interface *xch, int domid, int 
*gnt_num);
diff --git a/tools/libxc/xc_gnttab.c b/tools/libxc/xc_gnttab.c
index af53fac..9e6f1fb 100644
--- a/tools/libxc/xc_gnttab.c
+++ b/tools/libxc/xc_gnttab.c
@@ -38,6 +38,18 @@ int xc_gnttab_op(xc_interface *xch, int cmd, void * op, int 
op_size, int count)
 return ret;
 }
 
+int xc_gnttab_query_size(xc_interface *xch, struct gnttab_query_size *query)
+{
+int rc;
+
+rc = xc_gnttab_op(xch, GNTTABOP_query_size, query, sizeof(*query), 1);
+
+if ( rc || (query->status != GNTST_okay) )
+ERROR("Could not query dom %u's grant size\n", query->dom);
+
+return rc;
+}
+
 int xc_gnttab_get_version(xc_interface *xch, int domid)
 {
 struct gnttab_get_version query;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2 2/2] tools: utility to dump guest grant table info

2017-07-02 Thread Dongli Zhang
As both xen-netfront and xen-blkfront support multi-queue, they would
consume a lot of grant table references when there are many paravirtual
devices and vcpus assigned to guest. Guest domU might panic or hang due to
grant allocation failure when nr_grant_frames in guest has reached its max
value.

This utility would help the administrators to diagnose xen issue. There is
only one command gnttab_query_size so far to monitor the guest grant table
frame usage on dom0 side so that it is not required to debug on guest
kernel side for crash/hang analysis anymore.

It is extensible for adding new commands for more diagnostic functions and
the framework of xen-diag.c is from xen-livepatch.c.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
Changed since v1:
  * rewrite xen-gnttab-query.c to xen-diag.c based on livepatch.c framework

---
 tools/misc/Makefile   |   4 ++
 tools/misc/xen-diag.c | 129 ++
 2 files changed, 133 insertions(+)
 create mode 100644 tools/misc/xen-diag.c

diff --git a/tools/misc/Makefile b/tools/misc/Makefile
index 8152f7b..c1113b9 100644
--- a/tools/misc/Makefile
+++ b/tools/misc/Makefile
@@ -31,6 +31,7 @@ INSTALL_SBIN   += xenperf
 INSTALL_SBIN   += xenpm
 INSTALL_SBIN   += xenwatchdogd
 INSTALL_SBIN   += xen-livepatch
+INSTALL_SBIN   += xen-diag
 INSTALL_SBIN += $(INSTALL_SBIN-y)
 
 # Everything to be installed in a private bin/
@@ -102,6 +103,9 @@ xenwatchdogd: xenwatchdogd.o
 xen-livepatch: xen-livepatch.o
$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
 
+xen-diag: xen-diag.o
+   $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
+
 xen-lowmemd: xen-lowmemd.o
$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenevtchn) $(LDLIBS_libxenctrl) 
$(LDLIBS_libxenstore) $(APPEND_LDFLAGS)
 
diff --git a/tools/misc/xen-diag.c b/tools/misc/xen-diag.c
new file mode 100644
index 000..1da50e1
--- /dev/null
+++ b/tools/misc/xen-diag.c
@@ -0,0 +1,129 @@
+/*
+ * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+static xc_interface *xch;
+
+#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0]))
+
+void show_help(void)
+{
+fprintf(stderr,
+"xen-diag: xen diagnostic utility\n"
+"Usage: xen-diag command [args]\n"
+"Commands:\n"
+"  help   display this help\n"
+"  gnttab_query_size   dump the current and max grant 
frames for \n");
+}
+
+/* wrapper function */
+static int help_func(int argc, char *argv[])
+{
+show_help();
+return 0;
+}
+
+static int gnttab_query_size_func(int argc, char *argv[])
+{
+int domid, rc = 1;
+struct gnttab_query_size query;
+
+if ( argc != 1 )
+{
+show_help();
+return rc;
+}
+
+domid = strtol(argv[0], NULL, 10);
+query.dom = domid;
+rc = xc_gnttab_query_size(xch, );
+
+if ( rc == 0 && (query.status == GNTST_okay) )
+printf("domid=%d: nr_frames=%d, max_nr_frames=%d\n",
+   query.dom, query.nr_frames, query.max_nr_frames);
+
+return rc == 0 && (query.status == GNTST_okay) ? 0 : 1;
+}
+
+struct {
+const char *name;
+int (*function)(int argc, char *argv[]);
+} main_options[] = {
+{ "help", help_func },
+{ "gnttab_query_size", gnttab_query_size_func},
+};
+
+int main(int argc, char *argv[])
+{
+int ret, i;
+
+/*
+ * Set stdout to be unbuffered to avoid having to fflush when
+ * printing without a newline.
+ */
+setvbuf(stdout, NULL, _IONBF, 0);
+
+if ( argc <= 1 )
+{
+show_help();
+return 0;
+}
+
+for ( i = 0; i < ARRAY_SIZE(main_options); i++ )
+if ( !strncmp(main_options[i].name, argv[1], strlen(argv[1])) )
+break;
+
+if ( i == ARRAY_SIZE(main_options) )
+{
+show_help();
+return 0;
+}
+else
+{
+xch = xc_interface_open(0, 0, 0);
+if ( !xch )
+{
+fprintf(stderr, "failed to get the handler\n");
+return 0;
+}
+
+ret = main_options[i].function(argc - 2, argv + 2);
+
+xc_interface_close(xch);
+}
+
+/*
+ * Exitcode 0 for success.
+ * Exitcode 1 for an error.
+ * Exitcode 2 if the operation should be retried for any reason (e.g. a
+ * timeout or because another operation was in progress).
+ */
+
+#define EXIT_TIMEOUT (EXIT_FAILURE + 1)
+
+BUILD_BUG_ON(EXIT_SUCCESS != 0);
+BUILD_BUG_ON(EXIT_FAILURE != 1);
+BUILD_BUG_ON(EXIT_TIMEOUT != 2);
+
+switch ( ret )
+{
+case 0:
+return EXIT_SUCCESS;
+

[Xen-devel] [PATCH v2 1/1] xen/blkfront: always allocate grants first from per-queue persistent grants

2017-06-28 Thread Dongli Zhang
This patch partially reverts 3df0e50 ("xen/blkfront: pseudo support for
multi hardware queues/rings"). The xen-blkfront queue/ring might hang due
to grants allocation failure in the situation when gnttab_free_head is
almost empty while many persistent grants are reserved for this queue/ring.

As persistent grants management was per-queue since 73716df ("xen/blkfront:
make persistent grants pool per-queue"), we should always allocate from
persistent grants first.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>

---
Changed since v1:
  * use "max_grefs - rinfo->persistent_gnts_c" as callback argument

---
 drivers/block/xen-blkfront.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3945963..4544a1c 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -713,6 +713,7 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
 * existing persistent grants, or if we have to get new grants,
 * as there are not sufficiently many free.
 */
+   bool new_persistent_gnts = false;
struct scatterlist *sg;
int num_sg, max_grefs, num_grant;
 
@@ -724,19 +725,21 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
 */
max_grefs += INDIRECT_GREFS(max_grefs);
 
-   /*
-* We have to reserve 'max_grefs' grants because persistent
-* grants are shared by all rings.
-*/
-   if (max_grefs > 0)
-   if (gnttab_alloc_grant_references(max_grefs, _head) 
< 0) {
+   /* Check if we have enough persistent grants to allocate a requests */
+   if (rinfo->persistent_gnts_c < max_grefs) {
+   new_persistent_gnts = true;
+
+   if (gnttab_alloc_grant_references(
+   max_grefs - rinfo->persistent_gnts_c,
+   _head) < 0) {
gnttab_request_free_callback(
>callback,
blkif_restart_queue_callback,
rinfo,
-   max_grefs);
+   max_grefs - rinfo->persistent_gnts_c);
return 1;
}
+   }
 
/* Fill out a communications ring structure. */
id = blkif_ring_get_request(rinfo, req, _req);
@@ -837,7 +840,7 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
if (unlikely(require_extra_req))
rinfo->shadow[extra_id].req = *extra_ring_req;
 
-   if (max_grefs > 0)
+   if (new_persistent_gnts)
gnttab_free_grant_references(setup.gref_head);
 
return 0;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] xen/blkfront: always allocate grants first from per-queue persistent grants

2017-06-27 Thread Dongli Zhang
This patch partially reverts 3df0e50 ("xen/blkfront: pseudo support for
multi hardware queues/rings"). The xen-blkfront queue/ring might hang due
to grants allocation failure in the situation when gnttab_free_head is
almost empty while many persistent grants are reserved for this queue/ring.

As persistent grants management was per-queue since 73716df ("xen/blkfront:
make persistent grants pool per-queue"), we should always allocate from
persistent grants first.

Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 drivers/block/xen-blkfront.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3945963..d2b759f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -713,6 +713,7 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
 * existing persistent grants, or if we have to get new grants,
 * as there are not sufficiently many free.
 */
+   bool new_persistent_gnts = false;
struct scatterlist *sg;
int num_sg, max_grefs, num_grant;
 
@@ -724,12 +725,13 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
 */
max_grefs += INDIRECT_GREFS(max_grefs);
 
-   /*
-* We have to reserve 'max_grefs' grants because persistent
-* grants are shared by all rings.
-*/
-   if (max_grefs > 0)
-   if (gnttab_alloc_grant_references(max_grefs, _head) 
< 0) {
+   /* Check if we have enough persistent grants to allocate a requests */
+   if (rinfo->persistent_gnts_c < max_grefs) {
+   new_persistent_gnts = true;
+
+   if (gnttab_alloc_grant_references(
+   max_grefs - rinfo->persistent_gnts_c,
+   _head) < 0) {
gnttab_request_free_callback(
>callback,
blkif_restart_queue_callback,
@@ -737,6 +739,7 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
max_grefs);
return 1;
}
+   }
 
/* Fill out a communications ring structure. */
id = blkif_ring_get_request(rinfo, req, _req);
@@ -837,7 +840,7 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
if (unlikely(require_extra_req))
rinfo->shadow[extra_id].req = *extra_ring_req;
 
-   if (max_grefs > 0)
+   if (new_persistent_gnts)
gnttab_free_grant_references(setup.gref_head);
 
return 0;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] xen-blkfront hang

2017-07-30 Thread Dongli Zhang
CCed xen-devel so that more people would be able to help.

Dongli Zhang

On 07/31/2017 09:09 AM, Dongli Zhang wrote:
> Hi Valentin,
> 
> On 07/30/2017 03:42 PM, Valentin Vidic wrote:
>> I'm having a problem with a domU hang in disk IO, described here:
>>
>> https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html
>>
>> Do you think this is a multi-queue issue and applying one of these
>> latest changes would help?
>>
>> xen/blkfront: always allocate grants first from per-queue persistent grants
>> https://github.com/torvalds/linux/commit/bd912ef3e46b6edb51bb8af4b73fd2be7817e305
> 
> This patch is not able to fix the lack of grant issue permanently. It is used 
> to
> optimize the utilization of grant table entires.
> 
> To verify whether the above patch would help, please check the nr_grant_frames
> value in guest domU. If this value is exactly the same of maximum grant frames
> (by default, xen mainline uses 32) and the number of free grant references is
> very small, the above patch might help.
> 
> The best way is to increase the gnttab_max_frames to larger value (e.g.,  256)
> in dom0 xen.gz grub.
> 
> Dongli Zhang
> 
>>
>> xen-blkfront: fix mq start/stop race
>> https://github.com/torvalds/linux/commit/4b422cb99836de3d261faec20a0329385bdec43d
>>

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] xen-blkfront hang

2017-07-31 Thread Dongli Zhang
Here are the options:

1. Dump a vmcore of guest and print nr_grant_frames with crash utility.

2. Implement a kernel module in guest to dump nr_grant_frames if you still have
access to your hung guest domU.

3. There is a new utility in xen toolstack at tools/misc/xen-diag.c to dump
grant table usage for arbitrary guest domU (including dom0)

./xen-diag gnttab_query_size [domid]

4. If your host's xen toolstack does not have xen-diag, feel free to implement
one your self via GNTTABOP_query_size hypercall and compile with -lxenctrl.

Dongli Zhang


On 07/31/2017 02:30 PM, Valentin Vidic wrote:
> On Mon, Jul 31, 2017 at 09:09:19AM +0800, Dongli Zhang wrote:
>> This patch is not able to fix the lack of grant issue permanently. It is 
>> used to
>> optimize the utilization of grant table entires.
>>
>> To verify whether the above patch would help, please check the 
>> nr_grant_frames
>> value in guest domU. If this value is exactly the same of maximum grant 
>> frames
>> (by default, xen mainline uses 32) and the number of free grant references is
>> very small, the above patch might help.
> 
> I can try that, but how do I get the nr_grant_frames value in domU?
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] high CPU stolen time after live migrate

2017-10-07 Thread Dongli Zhang
Hi Dario and Olivier,

I have just encountered this issue in the past. While the fix mentioned in the
link is effective, I assume the fix was derived from upstream linux and it will
introduce new error as mentioned below. 

While there is a kernel bug in the guest kernel, I think the root cause is at
the hypervisor side.

From my own test, the issue is reproducible even when migration a VM locally
within the same dom0. From the test, once guest VM is migrated,
RUNSTATE_offline time looks normal, while RUNSTATE_runnable is moving backward
and decreased. Therefore, the value returned by paravirt_steal_clock()
(actually xen_steal_clock()), which is equivalent to the sum of
RUNSTATE_offline and RUNSTATE_runnable, is decreased as well. However, the
kernel such as 4.8 could not handle this special situation correctly
as the code in cputime.c is not written specifically for xen hypervisor.

For kernel like v4.8-rc8, would something as below would be better?

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a846cf8..3546e21 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -274,11 +274,17 @@ static __always_inline cputime_t 
steal_account_process_time(cputime_t maxtime)
if (static_key_false(_steal_enabled)) {
cputime_t steal_cputime;
u64 steal;
+   s64 steal_diff;
 
steal = paravirt_steal_clock(smp_processor_id());
-   steal -= this_rq()->prev_steal_time;
+   steal_diff = steal - this_rq()->prev_steal_time;
 
-   steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+   if (steal_diff < 0) {
+   this_rq()->prev_steal_time = steal;
+   return 0;
+   }
+
+   steal_cputime = min(nsecs_to_cputime(steal_diff), maxtime);
account_steal_time(steal_cputime);
this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);


This issue seems not getting totally fixed by most up-to-date upstream linux (I
have tested with 4.12.0-rc7). The issue in 4.12.0-rc7 is different. After live
migration, although the steal clock counter is not overflowed (become a very
large unsigned number), the steal clock counter in /proc/stat is moving
backward and decreased (e.g., from 329 to 311).

test@vm:~$ cat /proc/stat 
cpu  248 0 240 31197 893 0 1 329 0 0 
cpu0 248 0 240 31197 893 0 1 329 0 0 
intr 39051 16307 0 0 0 0 0 990 127 592 1004 1360 40 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 
ctxt 59400 
btime 1506731352 
processes 1877 
procs_running 1 
procs_blocked 0 
softirq 38903 0 15524 1227 6904 0 0 6 0 0 15242 

After live migration, steal counter in ubuntu guest running 4.12.0-rc7 was 
decreased to 311. 

test@vm:~$ cat /proc/stat 
cpu  251 0 242 31245 893 0 1 311 0 0 
cpu0 251 0 242 31245 893 0 1 311 0 0 
intr 39734 16404 0 0 0 0 0 1440 128 0 8 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 
ctxt 60880 
btime 1506731352 
processes 1882 
procs_running 3 
procs_blocked 0 
softirq 39195 0 15618 1286 6958 0 0 7 0 0 15326

I assume this is not an expected behavior. A different patch (similar to the one
I mentioned above) to upstream linux would fix this issue.

-

Whatever the fix would be applied to guest kernel side, I think the root cause
is because xen hypervisor returns a RUNSTATE_runnable time less than the
previous one before live migration.

As I am not clear enough with xen scheduling, I do not understand why
RUNSTATE_runnable cputime is decreased after live migration.

Dongli Zhang 



- Original Message -
From: dario.faggi...@citrix.com
To: xen.l...@daevel.fr, xen-us...@lists.xensource.com
Cc: xen-devel@lists.xen.org
Sent: Tuesday, October 3, 2017 5:24:49 PM GMT +08:00 Beijing / Chongqing / Hong 
Kong / Urumqi
Subject: Re: [Xen-devel] high CPU stolen time after live migrate

On Mon, 2017-10-02 at 18:37 +0200, Olivier Bonvalet wrote:
> root! laussor:/proc# cat /proc/uptime 
> 652005.23 2631328.82
> 
> 
> Values for "stolen time" in /proc/stat seems impossible with only 7
> days of uptime.
> 
I think it can be this:
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-parav
irtualized-xen-guest/

What's the version of your guest kernel?

Dario
-- 
<> (Raistlin Majere)
-
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R Ltd., Cambridge (UK)
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-10 Thread Dongli Zhang
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
paravirt_steal_clock() might be less than this_rq()->prev_steal_time.

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

The code in this patch is borrowed from do_stolen_accounting() which has
already been removed from linux source code since commit ecb23dc6 ("xen:
add steal_clock support on x86").

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
Unlike the issue discussed by Michael Las which would overflow steal time
and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
linux 4.11+ would only decrease but not overflow steal time after live
migration.

References: 
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 kernel/sched/cputime.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 14d2dbf..57d09cab 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -238,10 +238,17 @@ static __always_inline u64 steal_account_process_time(u64 
maxtime)
 {
 #ifdef CONFIG_PARAVIRT
if (static_key_false(_steal_enabled)) {
-   u64 steal;
+   u64 steal, steal_time;
+   s64 steal_delta;
+
+   steal_time = paravirt_steal_clock(smp_processor_id());
+   steal = steal_delta = steal_time - this_rq()->prev_steal_time;
+
+   if (unlikely(steal_delta < 0)) {
+   this_rq()->prev_steal_time = steal_time;
+   return 0;
+   }
 
-   steal = paravirt_steal_clock(smp_processor_id());
-   steal -= this_rq()->prev_steal_time;
steal = min(steal, maxtime);
account_steal_time(steal);
this_rq()->prev_steal_time += steal;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-11 Thread Dongli Zhang
Hi Stanislaw and Peter,

On 10/10/2017 08:42 PM, Stanislaw Gruszka wrote:
> On Tue, Oct 10, 2017 at 12:59:26PM +0200, Ingo Molnar wrote:
>>
>> (Cc:-ed more gents involved in kernel/sched/cputime.c work. Full patch 
>> quoted 
>> below.)
>>
>> * Dongli Zhang <dongli.zh...@oracle.com> wrote:
>>
>>> After guest live migration on xen, steal time in /proc/stat
>>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>>> paravirt_steal_clock() might be less than this_rq()->prev_steal_time.
>>>
>>> For instance, steal time of each vcpu is 335 before live migration.
>>>
>>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>>> cpu0 38 0 81 50063 492 0 0 335 0 0
>>> cpu1 65 0 97 49763 634 0 0 335 0 0
>>> cpu2 38 0 81 50098 462 0 0 335 0 0
>>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>>
>>> After live migration, steal time is reduced to 312.
>>>
>>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>>> cpu0 38 0 82 50123 500 0 0 312 0 0
>>> cpu1 65 0 97 49832 634 0 0 312 0 0
>>> cpu2 39 0 82 50167 462 0 0 312 0 0
>>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>>
>>> The code in this patch is borrowed from do_stolen_accounting() which has
>>> already been removed from linux source code since commit ecb23dc6 ("xen:
>>> add steal_clock support on x86").
>>>
>>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>>> discussed by Michael Las at
>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
>>> Unlike the issue discussed by Michael Las which would overflow steal time
>>> and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
>>> linux 4.11+ would only decrease but not overflow steal time after live
>>> migration.
>>>
>>> References: 
>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>>> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
>>> ---
>>>  kernel/sched/cputime.c | 13 ++---
>>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>>> index 14d2dbf..57d09cab 100644
>>> --- a/kernel/sched/cputime.c
>>> +++ b/kernel/sched/cputime.c
>>> @@ -238,10 +238,17 @@ static __always_inline u64 
>>> steal_account_process_time(u64 maxtime)
>>>  {
>>>  #ifdef CONFIG_PARAVIRT
>>> if (static_key_false(_steal_enabled)) {
>>> -   u64 steal;
>>> +   u64 steal, steal_time;
>>> +   s64 steal_delta;
>>> +
>>> +   steal_time = paravirt_steal_clock(smp_processor_id());
>>> +   steal = steal_delta = steal_time - this_rq()->prev_steal_time;
>>> +
>>> +   if (unlikely(steal_delta < 0)) {
>>> +   this_rq()->prev_steal_time = steal_time;
> 
> I don't think setting prev_steal_time to smaller value is right
> thing to do.

If we do not set prev_steal_time to smaller steal (obtained from
paravirt_steal_clock()), it will take a while for kernel to wait for new steal
to catch up with this_rq()->prev_steal_time, and cpustat[CPUTIME_STEAL] will
stay unchanged until steal is more than this_rq()->prev_steal_time again. Do you
think it is fine?

If it is fine, I will try to limit the fix to xen specific code in
driver/xen/time.c so that we would not taint kernel/sched/cputime.c, as Peter
has asked why not just fix up paravirt_steal_time() on migration.

Thank you very much!

Dongli Zhang

> 
> Beside, I don't think we need to check for overflow condition for
> cputime variables (it will happen after 279 years :-). So instead
> of introducing signed steal_delta variable I would just add
> below check, which should be sufficient to fix the problem:
> 
>   if (unlikely(steal <= this_rq()->prev_steal_time))
>   return 0;
> 
> Thanks
> Stanislaw
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-11 Thread Dongli Zhang
Hi Rik,

On 10/10/2017 10:01 PM, Rik van Riel wrote:
> On Tue, 2017-10-10 at 14:48 +0200, Peter Zijlstra wrote:
>> On Tue, Oct 10, 2017 at 02:42:01PM +0200, Stanislaw Gruszka wrote:
>>>>> + u64 steal, steal_time;
>>>>> + s64 steal_delta;
>>>>> +
>>>>> + steal_time =
>>>>> paravirt_steal_clock(smp_processor_id());
>>>>> + steal = steal_delta = steal_time - this_rq()-
>>>>>> prev_steal_time;
>>>>> +
>>>>> + if (unlikely(steal_delta < 0)) {
>>>>> + this_rq()->prev_steal_time =
>>>>> steal_time;
>>>
>>> I don't think setting prev_steal_time to smaller value is right
>>> thing to do. 
>>>
>>> Beside, I don't think we need to check for overflow condition for
>>> cputime variables (it will happen after 279 years :-). So instead
>>> of introducing signed steal_delta variable I would just add
>>> below check, which should be sufficient to fix the problem:
>>>
>>> if (unlikely(steal <= this_rq()->prev_steal_time))
>>> return 0;
>>
>> How about you just fix up paravirt_steal_time() on migration and not
>> muck with the users ?
> 
> Not just migration, either. CPU hotplug is another time to fix up
> the steal time.

I think this issue might be hit when we add and online vcpu after a very very
long time since boot (or the last time vcpu is offline). Please correct me if I
am wrong.

Thank you very much!

Dongli Zhang

> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-19 Thread Dongli Zhang
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

The code in this patch is borrowed from do_stolen_accounting() which has
already been removed from linux source code since commit ecb23dc6f2ef
("xen: add steal_clock support on x86"). The core idea of both
do_stolen_accounting() and this patch is to avoid accounting new steal
clock if it is smaller than previous old steal clock.

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.

References: 
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
---
 drivers/xen/time.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index ac5f23f..2b3a996 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -19,6 +19,8 @@
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
 
+static DEFINE_PER_CPU(u64, xen_old_steal);
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -83,9 +85,20 @@ bool xen_vcpu_stolen(int vcpu)
 u64 xen_steal_clock(int cpu)
 {
struct vcpu_runstate_info state;
+   u64 xen_new_steal;
+   s64 steal_delta;
 
xen_get_runstate_snapshot_cpu(, cpu);
-   return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
+   xen_new_steal = state.time[RUNSTATE_runnable]
+   + state.time[RUNSTATE_offline];
+   steal_delta = xen_new_steal - per_cpu(xen_old_steal, cpu);
+
+   if (steal_delta < 0)
+   xen_new_steal = per_cpu(xen_old_steal, cpu);
+   else
+   per_cpu(xen_old_steal, cpu) = xen_new_steal;
+
+   return xen_new_steal;
 }
 
 void xen_setup_runstate_info(int cpu)
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v6 1/1] xen/time: do not decrease steal time after live migration on xen

2017-11-01 Thread Dongli Zhang
Hi Boris,

I have received from l...@intel.com that the prior version of patch hit issue
during compilation with aarch64-linux-gnu-gcc. I think this patch reviewed by
you would hit the same compiling issue on arm64 (there is no issue with x86_64).

-

1st issue:

Without including header  into driver/xen/time.c, compilation on
x86_64 works well (without any warning or error) but arm64 would hit the
following error:

drivers/xen/time.c: In function ‘xen_manage_runstate_time’:
drivers/xen/time.c:94:20: error: implicit declaration of function
‘kmalloc_array’ [-Werror=implicit-function-declaration]
   runstate_delta = kmalloc_array(num_possible_cpus(),
^

drivers/xen/time.c:131:3: error: implicit declaration of function ‘kfree’
[-Werror=implicit-function-declaration]
   kfree(runstate_delta);
   ^
cc1: some warnings being treated as errors

About the 1st issue, should I submit a new patch including  or
just a incremental based on previous patch merged into your own branch
/tree?

-

2nd issue:

aarch64-linux-gnu-gcc expects a cast for kmalloc_array(). Is this really
necessary as I did find people casting the return type of
kmalloc/kcalloc/kmalloc_array in linux source code (e.g.,
drivers/block/virtio_blk.c). Can we just ignore this warning?

drivers/xen/time.c:94:18: warning: assignment makes pointer from integer without
a cast [-Wint-conversion]
   runstate_delta = kmalloc_array(num_possible_cpus(),
  ^
-


Thank you very much!

Dongli Zhang


On 11/02/2017 03:19 AM, Boris Ostrovsky wrote:
> On 10/31/2017 09:46 PM, Dongli Zhang wrote:
>> After guest live migration on xen, steal time in /proc/stat
>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>> derived from previous return value of xen_steal_clock().
>>
>> For instance, steal time of each vcpu is 335 before live migration.
>>
>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>> cpu0 38 0 81 50063 492 0 0 335 0 0
>> cpu1 65 0 97 49763 634 0 0 335 0 0
>> cpu2 38 0 81 50098 462 0 0 335 0 0
>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>
>> After live migration, steal time is reduced to 312.
>>
>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>> cpu0 38 0 82 50123 500 0 0 312 0 0
>> cpu1 65 0 97 49832 634 0 0 312 0 0
>> cpu2 39 0 82 50167 462 0 0 312 0 0
>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>
>> Since runstate times are cumulative and cleared during xen live migration
>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>> to global percpu variables before live migration suspend. Once guest VM is
>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>> runstate times and previously accumulated times stored in global percpu
>> variables. Comments before the call of HYPERVISOR_suspend() has been
>> removed as it is inaccurate. The call can return an error code (e.g.,
>> possibly -EPERM in the future).
> 
> I'd like split comment removal bit into a separate paragraph. I can do
> this when committing if you don't mind.
> 
>>
>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>> discussed by Michael Las at
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>> which would overflow steal time and lead to 100% st usage in top command
>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>
>> References: 
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
>>
>> ---
> 
> Reviewed-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
> 
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v6 1/1] xen/time: do not decrease steal time after live migration on xen

2017-11-02 Thread Dongli Zhang
Hi Boris,

On 11/03/2017 04:28 AM, Boris Ostrovsky wrote:
> On 11/01/2017 09:19 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> I have received from l...@intel.com that the prior version of patch hit issue
>> during compilation with aarch64-linux-gnu-gcc. I think this patch reviewed by
>> you would hit the same compiling issue on arm64 (there is no issue with 
>> x86_64).
>>
>> -
>>
>> 1st issue:
>>
>> Without including header  into driver/xen/time.c, compilation 
>> on
>> x86_64 works well (without any warning or error) but arm64 would hit the
>> following error:
>>
>> drivers/xen/time.c: In function ‘xen_manage_runstate_time’:
>> drivers/xen/time.c:94:20: error: implicit declaration of function
>> ‘kmalloc_array’ [-Werror=implicit-function-declaration]
>>runstate_delta = kmalloc_array(num_possible_cpus(),
>> ^
>>
>> drivers/xen/time.c:131:3: error: implicit declaration of function ‘kfree’
>> [-Werror=implicit-function-declaration]
>>kfree(runstate_delta);
>>^
>> cc1: some warnings being treated as errors
>>
>> About the 1st issue, should I submit a new patch including  or
>> just a incremental based on previous patch merged into your own branch
>> /tree?
>>
>> -
>>
>> 2nd issue:
>>
>> aarch64-linux-gnu-gcc expects a cast for kmalloc_array(). Is this really
>> necessary as I did find people casting the return type of
>> kmalloc/kcalloc/kmalloc_array in linux source code (e.g.,
>> drivers/block/virtio_blk.c). Can we just ignore this warning?
>>
>> drivers/xen/time.c:94:18: warning: assignment makes pointer from integer 
>> without
>> a cast [-Wint-conversion]
>>runstate_delta = kmalloc_array(num_possible_cpus(),
>>   ^
>> -
> 
> That's because you need to declare kmalloc_array(), otherwise the
> compiler by default assumes that it returns an int. So including
> linux/slab.h should take care of both warnings.
> 
> I can add it while committing.

Please help add it while committing. Thank you very much for your help!

> 
> 
> -boris
> 
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
> 

Dongli Zhang

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-29 Thread Dongli Zhang
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

Since runstate times are cumulative and cleared during xen live migration
by xen hypervisor, the idea of this patch is to accumulate runstate times
to global percpu variables before live migration suspend. Once guest VM is
resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
runstate times and previously accumulated times stored in global percpu
variables.

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.

References: 
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>

---
Changed since v1:
  * relocate modification to xen_get_runstate_snapshot_cpu

Changed since v2:
  * accumulate runstate times before live migration

Changed since v3:
  * do not accumulate times in the case of guest checkpointing

---
 drivers/xen/manage.c |  2 ++
 drivers/xen/time.c   | 83 ++--
 include/xen/interface/vcpu.h |  2 ++
 include/xen/xen-ops.h|  1 +
 4 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c425d03..3dc085d 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -72,6 +72,7 @@ static int xen_suspend(void *data)
}
 
gnttab_suspend();
+   xen_accumulate_runstate_time(-1);
xen_arch_pre_suspend();
 
/*
@@ -84,6 +85,7 @@ static int xen_suspend(void *data)
: 0);
 
xen_arch_post_suspend(si->cancelled);
+   xen_accumulate_runstate_time(si->cancelled);
gnttab_resume();
 
if (!si->cancelled) {
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index ac5f23f..18e2b76 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -19,6 +19,9 @@
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
 
+static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
+static u64 **runstate_time_delta;
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -47,8 +50,8 @@ static u64 get64(const u64 *p)
return ret;
 }
 
-static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
- unsigned int cpu)
+static void xen_get_runstate_snapshot_cpu_delta(
+   struct vcpu_runstate_info *res, unsigned int cpu)
 {
u64 state_time;
struct vcpu_runstate_info *state;
@@ -66,6 +69,82 @@ static void xen_get_runstate_snapshot_cpu(struct 
vcpu_runstate_info *res,
 (state_time & XEN_RUNSTATE_UPDATE));
 }
 
+static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
+ unsigned int cpu)
+{
+   int i;
+
+   xen_get_runstate_snapshot_cpu_delta(res, cpu);
+
+   for (i = 0; i < RUNSTATE_max; i++)
+   res->time[i] += per_cpu(old_runstate_time, cpu)[i];
+}
+
+void xen_accumulate_runstate_time(int action)
+{
+   struct vcpu_runstate_info state;
+   int cpu, i;
+
+   switch (action) {
+   case -1: /* backup runstate time before suspend */
+   WARN_ON_ONCE(unlikely(runstate_time_delta));
+
+   runstate_time_delta = kcalloc(num_possible_cpus(),
+ sizeof(*runstate_time_delta),
+ GFP_KERNEL);
+   if (unlikely(!runstate_time_delta)) {
+   pr_alert("%s: failed to allocate runstate_time_delta\n",
+   __func__);
+   return;
+   }
+
+   for_each_possible_cpu(cpu) {
+   runstate_time_delta[cpu] = kmalloc_array(RUNSTATE_max,
+ sizeof(**runstate_time_delta),
+ 

[Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-30 Thread Dongli Zhang
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

Since runstate times are cumulative and cleared during xen live migration
by xen hypervisor, the idea of this patch is to accumulate runstate times
to global percpu variables before live migration suspend. Once guest VM is
resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
runstate times and previously accumulated times stored in global percpu
variables.

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.

References: 
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>

---
Changed since v1:
  * relocate modification to xen_get_runstate_snapshot_cpu

Changed since v2:
  * accumulate runstate times before live migration

Changed since v3:
  * do not accumulate times in the case of guest checkpointing

Changed since v4:
  * allocate array of vcpu_runstate_info to reduce number of memory allocation

---
 drivers/xen/manage.c |  2 ++
 drivers/xen/time.c   | 68 ++--
 include/xen/interface/vcpu.h |  2 ++
 include/xen/xen-ops.h|  1 +
 4 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c425d03..3dc085d 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -72,6 +72,7 @@ static int xen_suspend(void *data)
}
 
gnttab_suspend();
+   xen_accumulate_runstate_time(-1);
xen_arch_pre_suspend();
 
/*
@@ -84,6 +85,7 @@ static int xen_suspend(void *data)
: 0);
 
xen_arch_post_suspend(si->cancelled);
+   xen_accumulate_runstate_time(si->cancelled);
gnttab_resume();
 
if (!si->cancelled) {
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index ac5f23f..cf3afb9 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -19,6 +19,9 @@
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
 
+static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
+static struct vcpu_runstate_info *runstate_delta;
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -47,8 +50,8 @@ static u64 get64(const u64 *p)
return ret;
 }
 
-static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
- unsigned int cpu)
+static void xen_get_runstate_snapshot_cpu_delta(
+   struct vcpu_runstate_info *res, unsigned int cpu)
 {
u64 state_time;
struct vcpu_runstate_info *state;
@@ -66,6 +69,67 @@ static void xen_get_runstate_snapshot_cpu(struct 
vcpu_runstate_info *res,
 (state_time & XEN_RUNSTATE_UPDATE));
 }
 
+static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
+ unsigned int cpu)
+{
+   int i;
+
+   xen_get_runstate_snapshot_cpu_delta(res, cpu);
+
+   for (i = 0; i < RUNSTATE_max; i++)
+   res->time[i] += per_cpu(old_runstate_time, cpu)[i];
+}
+
+void xen_accumulate_runstate_time(int action)
+{
+   struct vcpu_runstate_info state;
+   int cpu, i;
+
+   switch (action) {
+   case -1: /* backup runstate time before suspend */
+   WARN_ON_ONCE(unlikely(runstate_delta));
+
+   runstate_delta = kcalloc(num_possible_cpus(),
+sizeof(*runstate_delta),
+GFP_KERNEL);
+   if (unlikely(!runstate_delta)) {
+   pr_alert("%s: failed to allocate runstate_delta\n",
+   __func__);
+   return;
+   }
+
+   for_each_possible_cpu(cpu) {
+   xen_get_runstate_snapshot_cpu_delta(, cpu);
+   memcp

[Xen-devel] [PATCH v3 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-25 Thread Dongli Zhang
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

Since runstate times are cumulative and cleared during xen live migration
by xen hypervisor, the idea of this patch is to accumulate runstate times
to global percpu variables before live migration suspend. Once guest VM is
resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
runstate times and previously accumulated times stored in global percpu
variables.

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.

References: 
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>

---
Changed since v1:
  * relocate modification to xen_get_runstate_snapshot_cpu

Changed since v2:
  * accumulate runstate times before live migration

---
 drivers/xen/manage.c  |  1 +
 drivers/xen/time.c| 19 +++
 include/xen/xen-ops.h |  1 +
 3 files changed, 21 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c425d03..9aa2955 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -72,6 +72,7 @@ static int xen_suspend(void *data)
}
 
gnttab_suspend();
+   xen_accumulate_runstate_time();
xen_arch_pre_suspend();
 
/*
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index ac5f23f..6df3f82 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -19,6 +19,8 @@
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
 
+static DEFINE_PER_CPU(u64[4], old_runstate_time);
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -52,6 +54,7 @@ static void xen_get_runstate_snapshot_cpu(struct 
vcpu_runstate_info *res,
 {
u64 state_time;
struct vcpu_runstate_info *state;
+   int i;
 
BUG_ON(preemptible());
 
@@ -64,6 +67,22 @@ static void xen_get_runstate_snapshot_cpu(struct 
vcpu_runstate_info *res,
rmb();  /* Hypervisor might update data. */
} while (get64(>state_entry_time) != state_time ||
 (state_time & XEN_RUNSTATE_UPDATE));
+
+   for (i = 0; i < 4; i++)
+   res->time[i] += per_cpu(old_runstate_time, cpu)[i];
+}
+
+void xen_accumulate_runstate_time(void)
+{
+   struct vcpu_runstate_info state;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   xen_get_runstate_snapshot_cpu(, cpu);
+   memcpy(per_cpu(old_runstate_time, cpu),
+   state.time,
+   4 * sizeof(u64));
+   }
 }
 
 /*
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 218e6aa..5680059 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -32,6 +32,7 @@ void xen_resume_notifier_unregister(struct notifier_block 
*nb);
 bool xen_vcpu_stolen(int vcpu);
 void xen_setup_runstate_info(int cpu);
 void xen_time_setup_guest(void);
+void xen_accumulate_runstate_time(void);
 void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
 u64 xen_steal_clock(int cpu);
 
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-30 Thread Dongli Zhang
Hi Boris,

On 10/31/2017 08:58 AM, Boris Ostrovsky wrote:
> 
> 
> On 10/30/2017 08:14 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 10/30/2017 09:34 PM, Boris Ostrovsky wrote:
>>> On 10/30/2017 04:03 AM, Dongli Zhang wrote:
>>>> After guest live migration on xen, steal time in /proc/stat
>>>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>>>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>>>> derived from previous return value of xen_steal_clock().
>>>>
>>>> For instance, steal time of each vcpu is 335 before live migration.
>>>>
>>>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>>>> cpu0 38 0 81 50063 492 0 0 335 0 0
>>>> cpu1 65 0 97 49763 634 0 0 335 0 0
>>>> cpu2 38 0 81 50098 462 0 0 335 0 0
>>>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>>>
>>>> After live migration, steal time is reduced to 312.
>>>>
>>>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>>>> cpu0 38 0 82 50123 500 0 0 312 0 0
>>>> cpu1 65 0 97 49832 634 0 0 312 0 0
>>>> cpu2 39 0 82 50167 462 0 0 312 0 0
>>>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>>>
>>>> Since runstate times are cumulative and cleared during xen live migration
>>>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>>>> to global percpu variables before live migration suspend. Once guest VM is
>>>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>>>> runstate times and previously accumulated times stored in global percpu
>>>> variables.
>>>>
>>>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>>>> discussed by Michael Las at
>>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>>>>
>>>> which would overflow steal time and lead to 100% st usage in top command
>>>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>>>
>>>> References:
>>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>>>>
>>>> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
>>>>
>>>> ---
>>>> Changed since v1:
>>>>* relocate modification to xen_get_runstate_snapshot_cpu
>>>>
>>>> Changed since v2:
>>>>* accumulate runstate times before live migration
>>>>
>>>> Changed since v3:
>>>>* do not accumulate times in the case of guest checkpointing
>>>>
>>>> Changed since v4:
>>>>* allocate array of vcpu_runstate_info to reduce number of memory 
>>>> allocation
>>>>
>>>> ---
>>>>   drivers/xen/manage.c |  2 ++
>>>>   drivers/xen/time.c   | 68
>>>> ++--
>>>>   include/xen/interface/vcpu.h |  2 ++
>>>>   include/xen/xen-ops.h|  1 +
>>>>   4 files changed, 71 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>>>> index c425d03..3dc085d 100644
>>>> --- a/drivers/xen/manage.c
>>>> +++ b/drivers/xen/manage.c
>>>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>>>>   }
>>>> gnttab_suspend();
>>>> +xen_accumulate_runstate_time(-1);
>>>>   xen_arch_pre_suspend();
>>>> /*
>>>> @@ -84,6 +85,7 @@ static int xen_suspend(void *data)
>>>>  : 0);
>>>> xen_arch_post_suspend(si->cancelled);
>>>> +xen_accumulate_runstate_time(si->cancelled);
>>>
>>> I am not convinced that the comment above HYPERVISOR_suspend() is
>>> correct. The call can return an error code and so if it returns -EPERM
>>> (which AFAICS it can't now but might in the future) then
>>> xen_accumulate_runstate_time() will do wrong thing.
>>
>> I would split xen_accumulate_runstate_time() into two functions to avoid the
>> -EPERM issue, as one is for saving and another is for accumulation, 
>> respectively.
>>
>> Otherwise, can you use xen_accumulate_runstate_time(2) for saving before 
>> suspend
>> and xen_accumulate_runstate_time(si->cancelled) after resume?
> 
> 
> I'd probably just say

Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-30 Thread Dongli Zhang
Hi Boris,

On 10/30/2017 09:34 PM, Boris Ostrovsky wrote:
> On 10/30/2017 04:03 AM, Dongli Zhang wrote:
>> After guest live migration on xen, steal time in /proc/stat
>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>> derived from previous return value of xen_steal_clock().
>>
>> For instance, steal time of each vcpu is 335 before live migration.
>>
>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>> cpu0 38 0 81 50063 492 0 0 335 0 0
>> cpu1 65 0 97 49763 634 0 0 335 0 0
>> cpu2 38 0 81 50098 462 0 0 335 0 0
>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>
>> After live migration, steal time is reduced to 312.
>>
>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>> cpu0 38 0 82 50123 500 0 0 312 0 0
>> cpu1 65 0 97 49832 634 0 0 312 0 0
>> cpu2 39 0 82 50167 462 0 0 312 0 0
>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>
>> Since runstate times are cumulative and cleared during xen live migration
>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>> to global percpu variables before live migration suspend. Once guest VM is
>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>> runstate times and previously accumulated times stored in global percpu
>> variables.
>>
>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>> discussed by Michael Las at
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>> which would overflow steal time and lead to 100% st usage in top command
>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>
>> References: 
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
>>
>> ---
>> Changed since v1:
>>   * relocate modification to xen_get_runstate_snapshot_cpu
>>
>> Changed since v2:
>>   * accumulate runstate times before live migration
>>
>> Changed since v3:
>>   * do not accumulate times in the case of guest checkpointing
>>
>> Changed since v4:
>>   * allocate array of vcpu_runstate_info to reduce number of memory 
>> allocation
>>
>> ---
>>  drivers/xen/manage.c |  2 ++
>>  drivers/xen/time.c   | 68 
>> ++--
>>  include/xen/interface/vcpu.h |  2 ++
>>  include/xen/xen-ops.h|  1 +
>>  4 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>> index c425d03..3dc085d 100644
>> --- a/drivers/xen/manage.c
>> +++ b/drivers/xen/manage.c
>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>>  }
>>  
>>  gnttab_suspend();
>> +xen_accumulate_runstate_time(-1);
>>  xen_arch_pre_suspend();
>>  
>>  /*
>> @@ -84,6 +85,7 @@ static int xen_suspend(void *data)
>> : 0);
>>  
>>  xen_arch_post_suspend(si->cancelled);
>> +xen_accumulate_runstate_time(si->cancelled);
> 
> I am not convinced that the comment above HYPERVISOR_suspend() is
> correct. The call can return an error code and so if it returns -EPERM
> (which AFAICS it can't now but might in the future) then
> xen_accumulate_runstate_time() will do wrong thing.

I would split xen_accumulate_runstate_time() into two functions to avoid the
-EPERM issue, as one is for saving and another is for accumulation, 
respectively.

Otherwise, can you use xen_accumulate_runstate_time(2) for saving before suspend
and xen_accumulate_runstate_time(si->cancelled) after resume?

> 
> 
>>  gnttab_resume();
>>  
>>  if (!si->cancelled) {
>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
>> index ac5f23f..cf3afb9 100644
>> --- a/drivers/xen/time.c
>> +++ b/drivers/xen/time.c
>> @@ -19,6 +19,9 @@
>>  /* runstate info updated by Xen */
>>  static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>>  
>> +static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
>> +static struct vcpu_runstate_info *runstate_delta;
> 
> I'd move this inside xen_accumulate_runstate_time() since that's the

If we split xen_accumulate_runstate_time() into two functions, we would leave
runstate_delta as global static.

> only function that uses it. And why does it need to be
> vcpu_runstate_info and not u64[4]?

This was suggested by Juergen to avoid the allocation a

[Xen-devel] [PATCH v6 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-31 Thread Dongli Zhang
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
xen_steal_lock() might be less than this_rq()->prev_steal_time which is
derived from previous return value of xen_steal_clock().

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

Since runstate times are cumulative and cleared during xen live migration
by xen hypervisor, the idea of this patch is to accumulate runstate times
to global percpu variables before live migration suspend. Once guest VM is
resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
runstate times and previously accumulated times stored in global percpu
variables. Comments before the call of HYPERVISOR_suspend() has been
removed as it is inaccurate. The call can return an error code (e.g.,
possibly -EPERM in the future).

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
which would overflow steal time and lead to 100% st usage in top command
for linux 4.8-4.10. A backport of this patch would fix that issue.

References: 
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>

---
Changed since v1:
  * relocate modification to xen_get_runstate_snapshot_cpu

Changed since v2:
  * accumulate runstate times before live migration

Changed since v3:
  * do not accumulate times in the case of guest checkpointing

Changed since v4:
  * allocate array of vcpu_runstate_info to reduce number of memory allocation

Changed since v5:
  * remove old incorrect comments above hypercall and mention in commit message
  * rename xen_accumulate_runstate_time() to xen_manage_runstate_time()
  * move global static pointer into xen_manage_runstate_time
  * change warn and alert to pr_warn_once() or pr_warn()
  * change kcalloc to kmalloc_array
  * do not add RUNSTATE_max to change Xen ABI and use 4 in the code instead

---
 drivers/xen/manage.c  |  7 ++---
 drivers/xen/time.c| 71 +--
 include/xen/xen-ops.h |  1 +
 3 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index c425d03..8835065 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -72,18 +72,15 @@ static int xen_suspend(void *data)
}
 
gnttab_suspend();
+   xen_manage_runstate_time(-1);
xen_arch_pre_suspend();
 
-   /*
-* This hypercall returns 1 if suspend was cancelled
-* or the domain was merely checkpointed, and 0 if it
-* is resuming in a new domain.
-*/
si->cancelled = HYPERVISOR_suspend(xen_pv_domain()
? virt_to_gfn(xen_start_info)
: 0);
 
xen_arch_post_suspend(si->cancelled);
+   xen_manage_runstate_time(si->cancelled ? 1 : 0);
gnttab_resume();
 
if (!si->cancelled) {
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index ac5f23f..65a0b25 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -19,6 +19,8 @@
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
 
+static DEFINE_PER_CPU(u64[4], old_runstate_time);
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -47,8 +49,8 @@ static u64 get64(const u64 *p)
return ret;
 }
 
-static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
- unsigned int cpu)
+static void xen_get_runstate_snapshot_cpu_delta(
+ struct vcpu_runstate_info *res, unsigned int cpu)
 {
u64 state_time;
struct vcpu_runstate_info *state;
@@ -66,6 +68,71 @@ static void xen_get_runstate_snapshot_cpu(struct 
vcpu_runstate_info *res,
 (state_time & XEN_RUNSTATE_UPDATE));
 }
 
+static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
+ unsigned int cpu)
+{
+   int i;
+
+   xen_get_runstate_snapshot_cpu_delta(res, cpu);
+
+   for (i = 0; i < 4; i++)
+   res->time[i] += per_cpu(old_runstate_time, cpu)[i];
+}
+
+void xen_manage_runstate_time(int action)
+{
+   static struct vcpu_runstate_info *runstate_delta;
+   struct vcpu_runstate

Re: [Xen-devel] [PATCH v3 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-27 Thread Dongli Zhang
Hi Boris,

On 10/25/2017 11:12 PM, Boris Ostrovsky wrote:
> On 10/25/2017 02:45 AM, Dongli Zhang wrote:
>> After guest live migration on xen, steal time in /proc/stat
>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>> derived from previous return value of xen_steal_clock().
>>
>> For instance, steal time of each vcpu is 335 before live migration.
>>
>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>> cpu0 38 0 81 50063 492 0 0 335 0 0
>> cpu1 65 0 97 49763 634 0 0 335 0 0
>> cpu2 38 0 81 50098 462 0 0 335 0 0
>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>
>> After live migration, steal time is reduced to 312.
>>
>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>> cpu0 38 0 82 50123 500 0 0 312 0 0
>> cpu1 65 0 97 49832 634 0 0 312 0 0
>> cpu2 39 0 82 50167 462 0 0 312 0 0
>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>
>> Since runstate times are cumulative and cleared during xen live migration
>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>> to global percpu variables before live migration suspend. Once guest VM is
>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>> runstate times and previously accumulated times stored in global percpu
>> variables.
>>
>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>> discussed by Michael Las at
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>> which would overflow steal time and lead to 100% st usage in top command
>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>
>> References: 
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
>>
>> ---
>> Changed since v1:
>>   * relocate modification to xen_get_runstate_snapshot_cpu
>>
>> Changed since v2:
>>   * accumulate runstate times before live migration
>>
>> ---
>>  drivers/xen/manage.c  |  1 +
>>  drivers/xen/time.c| 19 +++
>>  include/xen/xen-ops.h |  1 +
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>> index c425d03..9aa2955 100644
>> --- a/drivers/xen/manage.c
>> +++ b/drivers/xen/manage.c
>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>>  }
>>  
>>  gnttab_suspend();
>> +xen_accumulate_runstate_time();
>>  xen_arch_pre_suspend();
>>  
>>  /*
>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
>> index ac5f23f..6df3f82 100644
>> --- a/drivers/xen/time.c
>> +++ b/drivers/xen/time.c
>> @@ -19,6 +19,8 @@
>>  /* runstate info updated by Xen */
>>  static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>>  
>> +static DEFINE_PER_CPU(u64[4], old_runstate_time);
>> +
>>  /* return an consistent snapshot of 64-bit time/counter value */
>>  static u64 get64(const u64 *p)
>>  {
>> @@ -52,6 +54,7 @@ static void xen_get_runstate_snapshot_cpu(struct 
>> vcpu_runstate_info *res,
>>  {
>>  u64 state_time;
>>  struct vcpu_runstate_info *state;
>> +int i;
>>  
>>  BUG_ON(preemptible());
>>  
>> @@ -64,6 +67,22 @@ static void xen_get_runstate_snapshot_cpu(struct 
>> vcpu_runstate_info *res,
>>  rmb();  /* Hypervisor might update data. */
>>  } while (get64(>state_entry_time) != state_time ||
>>   (state_time & XEN_RUNSTATE_UPDATE));
>> +
>> +for (i = 0; i < 4; i++)
>> +res->time[i] += per_cpu(old_runstate_time, cpu)[i];
>> +}
>> +
>> +void xen_accumulate_runstate_time(void)
>> +{
>> +struct vcpu_runstate_info state;
>> +int cpu;
>> +
>> +for_each_possible_cpu(cpu) {
>> +xen_get_runstate_snapshot_cpu(, cpu);
>> +memcpy(per_cpu(old_runstate_time, cpu),
>> +state.time,
>> +4 * sizeof(u64));
> 
> sizeof(old_runstate_time). (I think this should work for per_cpu variables)
> 
>> +}
> 
> Hmm.. This may not perform as intended if we are merely checkpointing
> (or pausing) the guest (i.e. if HYPERVISOR_suspend() returns 1). We will
> double-account for the last interval that the guest has run.
> 
> I'd rather not have yet another per-cpu variable but I can't think of

Re: [Xen-devel] [PATCH v3 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-27 Thread Dongli Zhang
Hi Juergen,

On 10/27/2017 03:31 PM, Juergen Gross wrote:
> On 27/10/17 09:16, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 10/25/2017 11:12 PM, Boris Ostrovsky wrote:
>>> On 10/25/2017 02:45 AM, Dongli Zhang wrote:
>>>> After guest live migration on xen, steal time in /proc/stat
>>>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>>>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>>>> derived from previous return value of xen_steal_clock().
>>>>
>>>> For instance, steal time of each vcpu is 335 before live migration.
>>>>
>>>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>>>> cpu0 38 0 81 50063 492 0 0 335 0 0
>>>> cpu1 65 0 97 49763 634 0 0 335 0 0
>>>> cpu2 38 0 81 50098 462 0 0 335 0 0
>>>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>>>
>>>> After live migration, steal time is reduced to 312.
>>>>
>>>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>>>> cpu0 38 0 82 50123 500 0 0 312 0 0
>>>> cpu1 65 0 97 49832 634 0 0 312 0 0
>>>> cpu2 39 0 82 50167 462 0 0 312 0 0
>>>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>>>
>>>> Since runstate times are cumulative and cleared during xen live migration
>>>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>>>> to global percpu variables before live migration suspend. Once guest VM is
>>>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>>>> runstate times and previously accumulated times stored in global percpu
>>>> variables.
>>>>
>>>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>>>> discussed by Michael Las at
>>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>>>> which would overflow steal time and lead to 100% st usage in top command
>>>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>>>
>>>> References: 
>>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>>>> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
>>>>
>>>> ---
>>>> Changed since v1:
>>>>   * relocate modification to xen_get_runstate_snapshot_cpu
>>>>
>>>> Changed since v2:
>>>>   * accumulate runstate times before live migration
>>>>
>>>> ---
>>>>  drivers/xen/manage.c  |  1 +
>>>>  drivers/xen/time.c| 19 +++
>>>>  include/xen/xen-ops.h |  1 +
>>>>  3 files changed, 21 insertions(+)
>>>>
>>>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>>>> index c425d03..9aa2955 100644
>>>> --- a/drivers/xen/manage.c
>>>> +++ b/drivers/xen/manage.c
>>>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>>>>}
>>>>  
>>>>gnttab_suspend();
>>>> +  xen_accumulate_runstate_time();
>>>>xen_arch_pre_suspend();
>>>>  
>>>>/*
>>>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
>>>> index ac5f23f..6df3f82 100644
>>>> --- a/drivers/xen/time.c
>>>> +++ b/drivers/xen/time.c
>>>> @@ -19,6 +19,8 @@
>>>>  /* runstate info updated by Xen */
>>>>  static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>>>>  
>>>> +static DEFINE_PER_CPU(u64[4], old_runstate_time);
>>>> +
>>>>  /* return an consistent snapshot of 64-bit time/counter value */
>>>>  static u64 get64(const u64 *p)
>>>>  {
>>>> @@ -52,6 +54,7 @@ static void xen_get_runstate_snapshot_cpu(struct 
>>>> vcpu_runstate_info *res,
>>>>  {
>>>>u64 state_time;
>>>>struct vcpu_runstate_info *state;
>>>> +  int i;
>>>>  
>>>>BUG_ON(preemptible());
>>>>  
>>>> @@ -64,6 +67,22 @@ static void xen_get_runstate_snapshot_cpu(struct 
>>>> vcpu_runstate_info *res,
>>>>rmb();  /* Hypervisor might update data. */
>>>>} while (get64(>state_entry_time) != state_time ||
>>>> (state_time & XEN_RUNSTATE_UPDATE));
>>>> +
>>>> +  for (i = 0; i < 4; i++)
>>>> +  res->time[i] += per_cpu(old_runstate_time, cpu)[i];
>>&g

Re: [Xen-devel] [PATCH 1/1] xen/time: do not decrease steal time after live migration on xen

2017-10-19 Thread Dongli Zhang
Hi Boris,

- boris.ostrov...@oracle.com wrote:

> On 10/19/2017 04:02 AM, Dongli Zhang wrote:
> > After guest live migration on xen, steal time in /proc/stat
> > (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> > xen_steal_lock() might be less than this_rq()->prev_steal_time which
> is
> > derived from previous return value of xen_steal_clock().
> >
> > For instance, steal time of each vcpu is 335 before live migration.
> >
> > cpu  198 0 368 200064 1962 0 0 1340 0 0
> > cpu0 38 0 81 50063 492 0 0 335 0 0
> > cpu1 65 0 97 49763 634 0 0 335 0 0
> > cpu2 38 0 81 50098 462 0 0 335 0 0
> > cpu3 56 0 107 50138 374 0 0 335 0 0
> >
> > After live migration, steal time is reduced to 312.
> >
> > cpu  200 0 370 200330 1971 0 0 1248 0 0
> > cpu0 38 0 82 50123 500 0 0 312 0 0
> > cpu1 65 0 97 49832 634 0 0 312 0 0
> > cpu2 39 0 82 50167 462 0 0 312 0 0
> > cpu3 56 0 107 50207 374 0 0 312 0 0
> >
> > The code in this patch is borrowed from do_stolen_accounting() which
> has
> > already been removed from linux source code since commit
> ecb23dc6f2ef
> > ("xen: add steal_clock support on x86"). The core idea of both
> > do_stolen_accounting() and this patch is to avoid accounting new
> steal
> > clock if it is smaller than previous old steal clock.
> >
> > Similar and more severe issue would impact prior linux 4.8-4.10 as
> > discussed by Michael Las at
> >
> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
> > which would overflow steal time and lead to 100% st usage in top
> command
> > for linux 4.8-4.10. A backport of this patch would fix that issue.
> >
> > References:
> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
> > Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
> > ---
> >  drivers/xen/time.c | 15 ++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/xen/time.c b/drivers/xen/time.c
> > index ac5f23f..2b3a996 100644
> > --- a/drivers/xen/time.c
> > +++ b/drivers/xen/time.c
> > @@ -19,6 +19,8 @@
> >  /* runstate info updated by Xen */
> >  static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
> >
> > +static DEFINE_PER_CPU(u64, xen_old_steal);
> > +
> >  /* return an consistent snapshot of 64-bit time/counter value */
> >  static u64 get64(const u64 *p)
> >  {
> > @@ -83,9 +85,20 @@ bool xen_vcpu_stolen(int vcpu)
> >  u64 xen_steal_clock(int cpu)
> >  {
> > struct vcpu_runstate_info state;
> > +   u64 xen_new_steal;
> > +   s64 steal_delta;
> >
> > xen_get_runstate_snapshot_cpu(, cpu);
> > -   return state.time[RUNSTATE_runnable] +
> state.time[RUNSTATE_offline];
> > +   xen_new_steal = state.time[RUNSTATE_runnable]
> > +   + state.time[RUNSTATE_offline];
> > +   steal_delta = xen_new_steal - per_cpu(xen_old_steal, cpu);
> > +
> > +   if (steal_delta < 0)
> > +   xen_new_steal = per_cpu(xen_old_steal, cpu);
> > +   else
> > +   per_cpu(xen_old_steal, cpu) = xen_new_steal;
> > +
> > +   return xen_new_steal;
> >  }
> >
> >  void xen_setup_runstate_info(int cpu)
> 
> Can we stash state.time[] during suspend and then add stashed values
> inside xen_get_runstate_snapshot_cpu()?


Would you like to stash state.time[] during do_suspend() (or xen_suspend()) or
code below is expected:

-

--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -19,6 +19,8 @@
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
 
+static DEFINE_PER_CPU(u64[4], old_runstate_time);
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -52,6 +54,8 @@ static void xen_get_runstate_snapshot_cpu(struct 
vcpu_runstate_info *res,
 {
u64 state_time;
struct vcpu_runstate_info *state;
+   int i;
+   s64 time_delta;
 
BUG_ON(preemptible());
 
@@ -64,6 +68,17 @@ static void xen_get_runstate_snapshot_cpu(struct 
vcpu_runstate_info *res,
rmb();  /* Hypervisor might update data. */
} while (get64(>state_entry_time) != state_time ||
 (state_time & XEN_RUNSTATE_UPDATE));
+
+   for (i = 0; i < 4; i++) {
+       if (i == RUNSTATE_runnable || i == RUNSTATE_offline) {
+   time_delta = res->time[i] - per_cpu(old_runstate_time, 
cpu)[i];
+
+   if (unlikely(time_de