Re: Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)

2014-06-06 Thread Vitaly Kuznetsov
Jiri Slaby jsl...@suse.cz writes:

 On 06/04/2014 07:48 AM, Greg KH wrote:
 On Wed, May 14, 2014 at 03:11:22PM -0400, Konrad Rzeszutek Wilk wrote:
 Hey Greg

 This email is in regards to backporting two patches to stable that
 fall under the 'performance' rule:

  bfe11d6de1c416cea4f3f0f35f864162063ce3fa
  fbe363c476afe8ec992d3baf682670a4bd1b6ce6
 
 Now queued up, thanks.

 AFAIU, they introduce a performance regression.

 Vitaly?

I'm aware of a performance regression in a 'very special' case when
ramdisks or files on tmpfs are being used as storage, I post my results
a while ago:
https://lkml.org/lkml/2014/5/22/164
I'm not sure if that 'special' case requires investigation and/or should
prevent us from doing stable backport but it would be nice if someone
tries to reproduce it at least.

I'm going to make a bunch of tests with FusionIO drives and sequential
read to replicate same test Felipe did, I'll report as soon as I have
data (beginning of next week hopefuly).

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)

2014-06-10 Thread Vitaly Kuznetsov
Vitaly Kuznetsov vkuzn...@redhat.com writes:

 Jiri Slaby jsl...@suse.cz writes:

 On 06/04/2014 07:48 AM, Greg KH wrote:
 On Wed, May 14, 2014 at 03:11:22PM -0400, Konrad Rzeszutek Wilk wrote:
 Hey Greg

 This email is in regards to backporting two patches to stable that
 fall under the 'performance' rule:

  bfe11d6de1c416cea4f3f0f35f864162063ce3fa
  fbe363c476afe8ec992d3baf682670a4bd1b6ce6
 
 Now queued up, thanks.

 AFAIU, they introduce a performance regression.

 Vitaly?

 I'm aware of a performance regression in a 'very special' case when
 ramdisks or files on tmpfs are being used as storage, I post my results
 a while ago:
 https://lkml.org/lkml/2014/5/22/164
 I'm not sure if that 'special' case requires investigation and/or should
 prevent us from doing stable backport but it would be nice if someone
 tries to reproduce it at least.

 I'm going to make a bunch of tests with FusionIO drives and sequential
 read to replicate same test Felipe did, I'll report as soon as I have
 data (beginning of next week hopefuly).

Turns out the regression I'm observing with these patches is not
restricted to tmpfs/ramdisk usage.

I was doing tests with Fusion-io ioDrive Duo 320GB (Dual Adapter) on HP
ProLiant DL380 G6 (2xE5540, 8G RAM). Hyperthreading is disabled, Dom0 is
pinned to CPU0 (cores 0,1,2,3) I run up to 8 guests with 1 vCPU each,
they are pinned to CPU1 (cores 4,5,6,7,4,5,6,7). I tried differed
pinning (Dom0 to 0,1,4,5, DomUs to 2,3,6,7,2,3,6,7 to balance NUMA, that
doesn't make any difference to the results). I was testing on top of
Xen-4.3.2.

I was testing two storage configurations:
1) Plain 10G partitions from one Fusion drive (/dev/fioa) are attached
to guests
2) LVM group is created on top of both drives (/dev/fioa, /dev/fiob),
10G logical volumes are created with striping (lvcreate -i2 ...)

Test is done by simultaneous fio run in guests (rw=read, direct=1) for
10 second. Each test was performed 3 times and the average was taken. 
Kernels I compare are:
1) v3.15-rc5-157-g60b5f90 unmodified
2) v3.15-rc5-157-g60b5f90 with 427bfe07e6744c058ce6fc4aa187cda96b635539,
   bfe11d6de1c416cea4f3f0f35f864162063ce3fa, and
   fbe363c476afe8ec992d3baf682670a4bd1b6ce6 reverted.

First test was done with Dom0 with persistent grant support (Fedora's
3.14.4-200.fc20.x86_64):
1) Partitions:
http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_partitions.png
(same markers mean same bs, we get 860 MB/s here, patches make no
difference, result matches expectation)

2) LVM Stripe:
http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_stripe.png
(1715 MB/s, patches make no difference, result matches expectation)

Second test was performed with Dom0 without persistent grants support
(Fedora's 3.7.9-205.fc18.x86_64)
1) Partitions:
http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_partitions.png
(860 MB/sec again, patches worsen a bit overall throughput with 1-3
clients)

2) LVM Stripe:
http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_stripe.png
(Here we see the same regression I observed with ramdisks and tmpfs
files, unmodified kernel: 1550MB/s, with patches reverted: 1715MB/s).

The only major difference with Felipe's test is that he was using
blktap3 with XenServer and I'm using standard blktap2.

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)

2014-06-12 Thread Vitaly Kuznetsov
Roger Pau Monné roger@citrix.com writes:

 On 10/06/14 15:19, Vitaly Kuznetsov wrote:
 Vitaly Kuznetsov vkuzn...@redhat.com writes:
 
 Jiri Slaby jsl...@suse.cz writes:

 On 06/04/2014 07:48 AM, Greg KH wrote:
 On Wed, May 14, 2014 at 03:11:22PM -0400, Konrad Rzeszutek Wilk wrote:
 Hey Greg

 This email is in regards to backporting two patches to stable that
 fall under the 'performance' rule:

  bfe11d6de1c416cea4f3f0f35f864162063ce3fa
  fbe363c476afe8ec992d3baf682670a4bd1b6ce6

 Now queued up, thanks.

 AFAIU, they introduce a performance regression.

 Vitaly?

 I'm aware of a performance regression in a 'very special' case when
 ramdisks or files on tmpfs are being used as storage, I post my results
 a while ago:
 https://lkml.org/lkml/2014/5/22/164
 I'm not sure if that 'special' case requires investigation and/or should
 prevent us from doing stable backport but it would be nice if someone
 tries to reproduce it at least.

 I'm going to make a bunch of tests with FusionIO drives and sequential
 read to replicate same test Felipe did, I'll report as soon as I have
 data (beginning of next week hopefuly).
 
 Turns out the regression I'm observing with these patches is not
 restricted to tmpfs/ramdisk usage.
 
 I was doing tests with Fusion-io ioDrive Duo 320GB (Dual Adapter) on HP
 ProLiant DL380 G6 (2xE5540, 8G RAM). Hyperthreading is disabled, Dom0 is
 pinned to CPU0 (cores 0,1,2,3) I run up to 8 guests with 1 vCPU each,
 they are pinned to CPU1 (cores 4,5,6,7,4,5,6,7). I tried differed
 pinning (Dom0 to 0,1,4,5, DomUs to 2,3,6,7,2,3,6,7 to balance NUMA, that
 doesn't make any difference to the results). I was testing on top of
 Xen-4.3.2.
 
 I was testing two storage configurations:
 1) Plain 10G partitions from one Fusion drive (/dev/fioa) are attached
 to guests
 2) LVM group is created on top of both drives (/dev/fioa, /dev/fiob),
 10G logical volumes are created with striping (lvcreate -i2 ...)
 
 Test is done by simultaneous fio run in guests (rw=read, direct=1) for
 10 second. Each test was performed 3 times and the average was taken. 
 Kernels I compare are:
 1) v3.15-rc5-157-g60b5f90 unmodified
 2) v3.15-rc5-157-g60b5f90 with 427bfe07e6744c058ce6fc4aa187cda96b635539,
bfe11d6de1c416cea4f3f0f35f864162063ce3fa, and
fbe363c476afe8ec992d3baf682670a4bd1b6ce6 reverted.
 
 First test was done with Dom0 with persistent grant support (Fedora's
 3.14.4-200.fc20.x86_64):
 1) Partitions:
 http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_partitions.png
 (same markers mean same bs, we get 860 MB/s here, patches make no
 difference, result matches expectation)
 
 2) LVM Stripe:
 http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_stripe.png
 (1715 MB/s, patches make no difference, result matches expectation)
 
 Second test was performed with Dom0 without persistent grants support
 (Fedora's 3.7.9-205.fc18.x86_64)
 1) Partitions:
 http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_partitions.png
 (860 MB/sec again, patches worsen a bit overall throughput with 1-3
 clients)
 
 2) LVM Stripe:
 http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_stripe.png
 (Here we see the same regression I observed with ramdisks and tmpfs
 files, unmodified kernel: 1550MB/s, with patches reverted: 1715MB/s).
 
 The only major difference with Felipe's test is that he was using
 blktap3 with XenServer and I'm using standard blktap2.

 Hello,

 I don't think you are using blktap2, I guess you are using blkback.

Right, sorry for the confusion.

 Also, running the test only for 10s and 3 repetitions seems too low, I
 would probably try to run the tests for a longer time and do more
 repetitions, and include the standard deviation also.

 Could you try to revert the patches independently to see if it's a
 specific commit that introduces the regression?

I did additional test runs. Now I'm comparing 3 kernels:
1) Unmodified v3.15-rc5-157-g60b5f90 - green color on chart

2) v3.15-rc5-157-g60b5f90 with bfe11d6de1c416cea4f3f0f35f864162063ce3fa
and 427bfe07e6744c058ce6fc4aa187cda96b635539 reverted (so only
fbe363c476afe8ec992d3baf682670a4bd1b6ce6 xen-blkfront: revoke foreign
access for grants not mapped by the backend left) - blue color on chart

3) v3.15-rc5-157-g60b5f90 with all
(bfe11d6de1c416cea4f3f0f35f864162063ce3fa,
427bfe07e6744c058ce6fc4aa187cda96b635539,
fbe363c476afe8ec992d3baf682670a4bd1b6ce6) patches reverted - red color
on chart.

I test on top of striped LVM on 2 FusionIO drives, I do 3 repetitions for
30 seconds each.

The result is here:
http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_20140612.png

It is consistent with what I've measured with ramdrives and tmpfs files:

1) fbe363c476afe8ec992d3baf682670a4bd1b6ce6 xen-blkfront: revoke
foreign access for grants not mapped by the backend brings us the
regression. Bigger block size is - bigger the difference but the
regression is observed with all block sizes  8k.

2) bfe11d6de1c416cea4f3f0f35f864162063ce3fa xen-blkfront

Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)

2014-06-12 Thread Vitaly Kuznetsov
Felipe Franciosi felipe.franci...@citrix.com writes:

 Hi Vitaly,

 Are you able to test a 3.10 guest with and without the backport that
 Roger sent? This patch is attached to an e-mail Roger sent on 22 May
 2014 13:54.

Sure,

Now I'm comparing d642daf637d02dacf216d7fd9da7532a4681cfd3 and
46c0326164c98e556c35c3eb240273595d43425d commits from
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
(with and without two commits in question). The test is exactly the same
as described before.

The result is here:
http://hadoop.ru/pubfiles/bug1096909/fusion/310_nopgrants_stripe.png

as you can see 46c03261 (without patches) wins everywhere.


 Because your results are contradicting with what these patches are
 meant to do, I would like to make sure that this isn't related to
 something else that happened after 3.10.

I still think Dom0 kernel and blktap/blktap3 is what make a difference
between our test environments.


 You could also test Ubuntu Sancy guests with and without the patched kernels 
 provided by Joseph Salisbury on launchpad: 
 https://bugs.launchpad.net/bugs/1319003

 Thanks,
 Felipe

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: 12 June 2014 13:01
 To: Roger Pau Monne
 Cc: xen-de...@lists.xenproject.org; ax...@kernel.dk; Felipe Franciosi; Greg
 KH; linux-kernel@vger.kernel.org; sta...@vger.kernel.org;
 jerry.snitsel...@oracle.com; Jiri Slaby; Ronen Hod; Andrew Jones
 Subject: Re: [Xen-devel] Backport request to stable of two performance
 related fixes for xen-blkfront (3.13 fixes to earlier trees)
 
 Roger Pau Monné roger@citrix.com writes:
 
  On 10/06/14 15:19, Vitaly Kuznetsov wrote:
  Vitaly Kuznetsov vkuzn...@redhat.com writes:
 
  Jiri Slaby jsl...@suse.cz writes:
 
  On 06/04/2014 07:48 AM, Greg KH wrote:
  On Wed, May 14, 2014 at 03:11:22PM -0400, Konrad Rzeszutek Wilk
 wrote:
  Hey Greg
 
  This email is in regards to backporting two patches to stable
  that fall under the 'performance' rule:
 
   bfe11d6de1c416cea4f3f0f35f864162063ce3fa
   fbe363c476afe8ec992d3baf682670a4bd1b6ce6
 
  Now queued up, thanks.
 
  AFAIU, they introduce a performance regression.
 
  Vitaly?
 
  I'm aware of a performance regression in a 'very special' case when
  ramdisks or files on tmpfs are being used as storage, I post my
  results a while ago:
  https://lkml.org/lkml/2014/5/22/164
  I'm not sure if that 'special' case requires investigation and/or
  should prevent us from doing stable backport but it would be nice if
  someone tries to reproduce it at least.
 
  I'm going to make a bunch of tests with FusionIO drives and
  sequential read to replicate same test Felipe did, I'll report as
  soon as I have data (beginning of next week hopefuly).
 
  Turns out the regression I'm observing with these patches is not
  restricted to tmpfs/ramdisk usage.
 
  I was doing tests with Fusion-io ioDrive Duo 320GB (Dual Adapter) on
  HP ProLiant DL380 G6 (2xE5540, 8G RAM). Hyperthreading is disabled,
  Dom0 is pinned to CPU0 (cores 0,1,2,3) I run up to 8 guests with 1
  vCPU each, they are pinned to CPU1 (cores 4,5,6,7,4,5,6,7). I tried
  differed pinning (Dom0 to 0,1,4,5, DomUs to 2,3,6,7,2,3,6,7 to
  balance NUMA, that doesn't make any difference to the results). I was
  testing on top of Xen-4.3.2.
 
  I was testing two storage configurations:
  1) Plain 10G partitions from one Fusion drive (/dev/fioa) are
  attached to guests
  2) LVM group is created on top of both drives (/dev/fioa, /dev/fiob),
  10G logical volumes are created with striping (lvcreate -i2 ...)
 
  Test is done by simultaneous fio run in guests (rw=read, direct=1)
  for
  10 second. Each test was performed 3 times and the average was taken.
  Kernels I compare are:
  1) v3.15-rc5-157-g60b5f90 unmodified
  2) v3.15-rc5-157-g60b5f90 with
 427bfe07e6744c058ce6fc4aa187cda96b635539,
 bfe11d6de1c416cea4f3f0f35f864162063ce3fa, and
 fbe363c476afe8ec992d3baf682670a4bd1b6ce6 reverted.
 
  First test was done with Dom0 with persistent grant support (Fedora's
  3.14.4-200.fc20.x86_64):
  1) Partitions:
  http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_partitions.pn
  g (same markers mean same bs, we get 860 MB/s here, patches make no
  difference, result matches expectation)
 
  2) LVM Stripe:
  http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_stripe.png
  (1715 MB/s, patches make no difference, result matches expectation)
 
  Second test was performed with Dom0 without persistent grants support
  (Fedora's 3.7.9-205.fc18.x86_64)
  1) Partitions:
  http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_partitions.
  png
  (860 MB/sec again, patches worsen a bit overall throughput with 1-3
  clients)
 
  2) LVM Stripe:
  http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_stripe.png
  (Here we see the same regression I observed with ramdisks and tmpfs
  files, unmodified kernel: 1550MB/s, with patches reverted: 1715MB/s).
 
  The only major difference

[PATCH] xenpv: don't BUG when failing to setup NMI callback

2014-06-13 Thread Vitaly Kuznetsov
some old Xen hypervisors (prior to 3.2) forbid DomUs to register
NMI callbacks. E.g. we have the following code in xen-3.1:

if ( (d-domain_id != 0) || (v-vcpu_id != 0) )
return -EINVAL;

Commit 6efa20e49b9cb1db1ab66870cc37323474a75a13 introduced kernel
crash in case PV guest fails to register NMI callback. All x86_64
PV guests will fail to boot on top of such hypervisors (RHEL5
example):

(XEN) traps.c:405:d7 Unhandled invalid opcode fault/trap [#6] in domain 7 on 
VCPU 0 [ec=]
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 7 (vcpu#0) crashed on cpu#3:
(XEN) [ Xen-3.1.2-389.el5  x86_64  debug=n  Not tainted ]
(XEN) CPU:3
(XEN) RIP:e033:[81004d96]
(XEN) RFLAGS: 0282   CONTEXT: guest
(XEN) rax: ffea   rbx:    rcx: 0002
(XEN) rdx: 0001   rsi: 81b0fe28   rdi: 
(XEN) rbp: 81b0fe40   rsp: 81b0fde8   r8:  
(XEN) r9:  81b0fdd0   r10: 7ff0   r11: 
(XEN) r12: 81d65900   r13:    r14: 
(XEN) r15:    cr0: 80050033   cr4: 26b0
(XEN) cr3: 00013a263000   cr2: 
(XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
...

However it is possible to proceed without NMI callback registered.
Change BUG() with warning in case of -EINVAL.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 arch/x86/xen/setup.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 821a11a..5b8b180 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -593,8 +593,17 @@ void xen_enable_syscall(void)
 void xen_enable_nmi(void)
 {
 #ifdef CONFIG_X86_64
-   if (register_callback(CALLBACKTYPE_nmi, (char *)nmi))
+   int ret;
+
+   ret = register_callback(CALLBACKTYPE_nmi, (char *)nmi);
+   if (ret == -EINVAL) {
+   /* Hypervisor probably forbids us to register NMI callback,
+  that is expected when running on top of Xen-3.1 and older */
+   pr_warn(xen: failed to register NMI callback\n);
+   } else if (ret != 0) {
+   /* Other hypervisor failure */
BUG();
+   }
 #endif
 }
 void __init xen_pvmmu_arch_setup(void)
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xenpv: don't BUG when failing to setup NMI callback

2014-06-13 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Fri, Jun 13, 2014 at 01:26:28PM +0200, Vitaly Kuznetsov wrote:
 some old Xen hypervisors (prior to 3.2) forbid DomUs to register
 NMI callbacks. E.g. we have the following code in xen-3.1:
 
 if ( (d-domain_id != 0) || (v-vcpu_id != 0) )
 return -EINVAL;
 
 Commit 6efa20e49b9cb1db1ab66870cc37323474a75a13 introduced kernel
 crash in case PV guest fails to register NMI callback. All x86_64
 PV guests will fail to boot on top of such hypervisors (RHEL5
 example):
 
 (XEN) traps.c:405:d7 Unhandled invalid opcode fault/trap [#6] in domain 7 on 
 VCPU 0 [ec=]
 (XEN) domain_crash_sync called from entry.S
 (XEN) Domain 7 (vcpu#0) crashed on cpu#3:
 (XEN) [ Xen-3.1.2-389.el5  x86_64  debug=n  Not tainted ]
 (XEN) CPU:3
 (XEN) RIP:e033:[81004d96]
 (XEN) RFLAGS: 0282   CONTEXT: guest
 (XEN) rax: ffea   rbx:    rcx: 0002
 (XEN) rdx: 0001   rsi: 81b0fe28   rdi: 
 (XEN) rbp: 81b0fe40   rsp: 81b0fde8   r8:  
 (XEN) r9:  81b0fdd0   r10: 7ff0   r11: 
 (XEN) r12: 81d65900   r13:    r14: 
 (XEN) r15:    cr0: 80050033   cr4: 26b0
 (XEN) cr3: 00013a263000   cr2: 
 (XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
 ...
 
 However it is possible to proceed without NMI callback registered.
 Change BUG() with warning in case of -EINVAL.
 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com

 Oh, we had a similar patch - somebody reported it earlier - and we
 just checked the version of Xen:

 http://lists.xenproject.org/archives/html/xen-devel/2014-05/msg01474.html

 But I can't remember why I didn't post it.

 However I do like your path of checking the 'ret'.

 Vitaly, could expand your patch to also do a check in cvt_gate_to_trap
 so that it won't enable the NMI handler and then lets pick your
 patch?

Sure,

I also suggest we lower the required xen version to 3.2 instead of 4.0
as 3.2 has the following:

commit a2308fa704a40f23916a176d9e06bbc0e3469caf
Author: Keir Fraser k...@xensource.com
Date:   Mon Oct 22 13:04:32 2007 +0100

x86: Allow NMI callback CS to be specified via set_trap_table()
hypercall.
Based on a patch by Jan Beulich.
Signed-off-by: Keir Fraser k...@xensource.com

I'll update my patch and re-send it.


 ---
  arch/x86/xen/setup.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
 index 821a11a..5b8b180 100644
 --- a/arch/x86/xen/setup.c
 +++ b/arch/x86/xen/setup.c
 @@ -593,8 +593,17 @@ void xen_enable_syscall(void)
  void xen_enable_nmi(void)
  {
  #ifdef CONFIG_X86_64
 -if (register_callback(CALLBACKTYPE_nmi, (char *)nmi))
 +int ret;
 +
 +ret = register_callback(CALLBACKTYPE_nmi, (char *)nmi);
 +if (ret == -EINVAL) {
 +/* Hypervisor probably forbids us to register NMI callback,
 +   that is expected when running on top of Xen-3.1 and older */
 +pr_warn(xen: failed to register NMI callback\n);
 +} else if (ret != 0) {
 +/* Other hypervisor failure */
  BUG();
 +}
  #endif
  }
  void __init xen_pvmmu_arch_setup(void)
 -- 
 1.9.3
 

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)

2014-05-20 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 Hey Greg

 This email is in regards to backporting two patches to stable that
 fall under the 'performance' rule:

  bfe11d6de1c416cea4f3f0f35f864162063ce3fa
  fbe363c476afe8ec992d3baf682670a4bd1b6ce6

 I've copied Jerry - the maintainer of the Oracle's kernel. I don't have
 the emails of the other distros maintainers but the bugs associated with it 
 are:

 https://bugzilla.redhat.com/show_bug.cgi?id=1096909
 (RHEL7)

I was doing tests with RHEL7 kernel and these patches and unfortunately
I see huge performance degradation in some workloads.

I'm in the middle of my testing now but here are some intermediate
results. 
Test environment:
Fedora-20, xen-4.3.2-2.fc20.x86_64, 3.11.10-301.fc20.x86_64
I do testing with 1-9 RHEL7 PVHVM guests with:
1) Unmodified RHEL7 kernel
2) Only fbe363c476afe8ec992d3baf682670a4bd1b6ce6 applied (revoke foreign
access)
3) Both fbe363c476afe8ec992d3baf682670a4bd1b6ce6 and
bfe11d6de1c416cea4f3f0f35f864162063ce3fa
(actually 427bfe07e6744c058ce6fc4aa187cda96b635539 is required as well
to make build happy, I suggest we backport that to stable as well)

Storage devices are:
1) ramdisks (/dev/ram*) (persistent grants and indirect descriptors disabled)
2) /tmp/img*.img on tmpfs (persistent grants and indirect descriptors disabled)

Test itself: direct random read with bs=2048k (using fio). (Actually
'dd', 'read/write access', ... show same results)

fio test file:
[fio_read]
ioengine=libaio
blocksize=2048k
rw=randread
filename=/dev/xvdc
randrepeat=1
fallocate=none
direct=1
invalidate=0
runtime=20
time_based

I run fio simultaneously and sum up the result. So, results are:
1) ramdisks: http://hadoop.ru/pubfiles/b1096909_3.11.10_ramdisk.png
2) tmpfiles: http://hadoop.ru/pubfiles/b1096909_3.11.10_tmpfile.png

In few words: patch series has (almost) no effect when persistent grants
are enabled (that was expected) and gives me performance regression when
persistent grants are disabled (that wasn't expected).

My thoughts are: it seems fbe363c476afe8ec992d3baf682670a4bd1b6ce6
brings performance regression in some cases (at least when persistent
grants are disabled). My guess atm is that gnttab_end_foreign_access()
(gnttab_end_foreign_access_ref_v1() is being used here) is guilty, for
some reason it is looping for some
time. bfe11d6de1c416cea4f3f0f35f864162063ce3fa really brings performance
improvement over fbe363c476afe8ec992d3baf682670a4bd1b6ce6 but whole
series still brings regression.

I would be glad to hear what could be wrong with my testing in case I'm
the only one who sees such behavior. Any other pointers are more than
welcome and please feel free to ask for any additional
info/testing/whatever from me.

 https://bugs.launchpad.net/ubuntu/+bug/1319003
 (Ubuntu 13.10)

 The following distros are affected:

 (x) Ubuntu 13.04 and derivatives (3.8)
 (v) Ubuntu 13.10 and derivatives (3.11), supported until 2014-07
 (x) Fedora 17 (3.8 and 3.9 in updates)
 (x) Fedora 18 (3.8, 3.9, 3.10, 3.11 in updates)
 (v) Fedora 19 (3.9; 3.10, 3.11, 3.12 in updates; fixed with latest update to 
 3.13), supported until TBA
 (v) Fedora 20 (3.11; 3.12 in updates; fixed with latest update to 3.13), 
 supported until TBA
 (v) RHEL 7 and derivatives (3.10), expected to be supported until about 2025
 (v) openSUSE 13.1 (3.11), expected to be supported until at least 2016-08
 (v) SLES 12 (3.12), expected to be supported until about 2024
 (v) Mageia 3 (3.8), supported until 2014-11-19
 (v) Mageia 4 (3.12), supported until 2015-08-01
 (v) Oracle Enterprise Linux with Unbreakable Enterprise Kernel Release 3 
 (3.8), supported until TBA

 Here is the analysis of the problem and what was put in the RHEL7 bug.
 The Oracle bug does not exist (as I just backport them in the kernel and
 send a GIT PULL to Jerry) - but if you would like I can certainly furnish
 you with one (it would be identical to what is mentioned below).

 If you are OK with the backport, I am volunteering Roger and Felipe to assist
 in jamming^H^H^H^Hbackporting the patches into earlier kernels.

 Summary:
 Storage performance regression when Xen backend lacks persistent-grants 
 support

 Description of problem:
 When used as a Xen guest, RHEL 7 will be slower than older releases in terms
 s of storage performance. This is due to the persistent-grants feature 
 introduced
 in xen-blkfront on the Linux Kernel 3.8 series. From 3.8 to 3.12 (inclusive),
 xen-blkfront will add an extra set of memcpy() operations regardless of
 persistent-grants support in the backend (i.e. xen-blkback, qemu, tapdisk).
 This has been identified and fixed in the 3.13 kernel series, but was not
 backported to previous LTS kernels due to the nature of the bug (performance 
 only).

 While persistent grants reduce the stress on the Xen grant table and allow
 for much better aggregate throughput (at the cost of an extra set of memcpy
 operations), adding the copy overhead when the feature is unsupported on
 the backend 

Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)

2014-05-20 Thread Vitaly Kuznetsov
Vitaly Kuznetsov vkuzn...@redhat.com writes:

 1) ramdisks (/dev/ram*) (persistent grants and indirect descriptors
 disabled)

sorry, there was a typo. persistent grants and indirect descriptors are
enabled with ramdisks, otherwise such testing won't make any sense.

 2) /tmp/img*.img on tmpfs (persistent grants and indirect descriptors 
 disabled)


-- 
  Vitaly Kuznetsov
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)

2014-05-20 Thread Vitaly Kuznetsov
Roger Pau Monné roger@citrix.com writes:

 On 20/05/14 11:54, Vitaly Kuznetsov wrote:
 Vitaly Kuznetsov vkuzn...@redhat.com writes:
 
 1) ramdisks (/dev/ram*) (persistent grants and indirect descriptors
 disabled)
 
 sorry, there was a typo. persistent grants and indirect descriptors are
 enabled with ramdisks, otherwise such testing won't make any sense.

 I'm not sure how is that possible, from your description I get that you
 are using 3.11 on the Dom0, which means blkback has support for
 persistent grants and indirect descriptors, but the guest is RHEL7,
 that's using the 3.10 kernel AFAICT, and this kernel only has persistent
 grants implemented.

RHEL7 kernel is mostly merged with 3.11 in its Xen part, we have
indirect descriptors backported.

Actually I tried my tests with upstream (Fedora) kernel and results were
similar. I can try comparing e.g. 3.11.10 with 3.12.0 and provide exact
measurements.

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)

2014-05-22 Thread Vitaly Kuznetsov
, maxb=334814KB/s, 
mint=20002msec, maxt=20002msec
   READ: io=6636.0MB, aggrb=339661KB/s, minb=339661KB/s, maxb=339661KB/s, 
mint=20006msec, maxt=20006msec
   READ: io=6594.0MB, aggrb=337595KB/s, minb=337595KB/s, maxb=337595KB/s, 
mint=20001msec, maxt=20001msec

Dumb 'dd' test shows the same:
revertall_upstream client:
# time for ntry in `seq 1 100`; do dd if=/dev/xvdc of=/dev/null bs=2048k 2 
/dev/null; done

real0m16.262s
user0m0.189s
sys0m7.021s

unpatched_upstream
# time for ntry in `seq 1 100`; do dd if=/dev/xvdc of=/dev/null bs=2048k 2 
/dev/null; done

real0m19.938s
user0m0.174s
sys0m9.489s

I tried running newer Dom0 (3.14.4-200.fc20.x86_64) but that makes no
difference.

P.P.S. I understand this test differs a lot from what these patches were
supposed to fix and I'm not trying to say 'no' for stable backport, but
I also thinks this test data can be interesting as well.

And thanks, Felipe, for all your hardware hints!


 In the meantime, I stand behind that the patches need to be backported and 
 there is a regression if we don't do that.
 Ubuntu has already provided a test kernel with the patches pulled in.  I will 
 test those as soon as I get the chance (hopefully by the end of the week).
 See: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1319003

 Felipe

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: 20 May 2014 12:41
 To: Roger Pau Monne
 Cc: Konrad Rzeszutek Wilk; ax...@kernel.dk; Felipe Franciosi;
 gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
 sta...@vger.kernel.org; jerry.snitsel...@oracle.com; xen-
 de...@lists.xenproject.org
 Subject: Re: [Xen-devel] Backport request to stable of two performance
 related fixes for xen-blkfront (3.13 fixes to earlier trees)
 
 Roger Pau Monné roger@citrix.com writes:
 
  On 20/05/14 11:54, Vitaly Kuznetsov wrote:
  Vitaly Kuznetsov vkuzn...@redhat.com writes:
 
  1) ramdisks (/dev/ram*) (persistent grants and indirect descriptors
  disabled)
 
  sorry, there was a typo. persistent grants and indirect descriptors
  are enabled with ramdisks, otherwise such testing won't make any sense.
 
  I'm not sure how is that possible, from your description I get that
  you are using 3.11 on the Dom0, which means blkback has support for
  persistent grants and indirect descriptors, but the guest is RHEL7,
  that's using the 3.10 kernel AFAICT, and this kernel only has
  persistent grants implemented.
 
 RHEL7 kernel is mostly merged with 3.11 in its Xen part, we have indirect
 descriptors backported.
 
 Actually I tried my tests with upstream (Fedora) kernel and results were
 similar. I can try comparing e.g. 3.11.10 with 3.12.0 and provide exact
 measurements.
 
 --
   Vitaly
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org
 http://lists.xen.org/xen-devel

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] xenpv: don't BUG when failing to setup NMI callback

2014-06-16 Thread Vitaly Kuznetsov
some old Xen hypervisors (prior to 3.2) forbid DomUs to register
NMI callbacks. E.g. we have the following code in xen-3.1:

if ( (d-domain_id != 0) || (v-vcpu_id != 0) )
return -EINVAL;

Commit 6efa20e49b9cb1db1ab66870cc37323474a75a13 introduced kernel
crash in case PV guest fails to register NMI callback. All x86_64
PV guests will fail to boot on top of such hypervisors (RHEL5
example):

(XEN) traps.c:405:d7 Unhandled invalid opcode fault/trap [#6] in domain 7 on 
VCPU 0 [ec=]
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 7 (vcpu#0) crashed on cpu#3:
(XEN) [ Xen-3.1.2-389.el5  x86_64  debug=n  Not tainted ]
(XEN) CPU:3
(XEN) RIP:e033:[81004d96]
(XEN) RFLAGS: 0282   CONTEXT: guest
(XEN) rax: ffea   rbx:    rcx: 0002
(XEN) rdx: 0001   rsi: 81b0fe28   rdi: 
(XEN) rbp: 81b0fe40   rsp: 81b0fde8   r8:  
(XEN) r9:  81b0fdd0   r10: 7ff0   r11: 
(XEN) r12: 81d65900   r13:    r14: 
(XEN) r15:    cr0: 80050033   cr4: 26b0
(XEN) cr3: 00013a263000   cr2: 
(XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
...

However it is possible to proceed without NMI callback registered.

Changes in v2:
- skip nmi in cvt_gate_to_trap() for Xen  3.2
- do not call xen_enable_nmi() when running under Xen  3.2
- never BUG() in xen_enable_nmi()

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 arch/x86/xen/enlighten.c |  9 +
 arch/x86/xen/setup.c | 17 ++---
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index f17b292..60ec5ef 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -746,12 +746,13 @@ static int cvt_gate_to_trap(int vector, const gate_desc 
*val,
 */
;
 #endif
-   } else if (addr == (unsigned long)nmi)
+   } else if (addr == (unsigned long)nmi) {
/*
-* Use the native version as well.
+* Use the native version as well but require Xen = 3.2
 */
-   ;
-   else {
+   if (!xen_running_on_version_or_later(3, 2))
+   return 0;
+   } else {
/* Some other trap using IST? */
if (WARN_ON(val-ist != 0))
return 0;
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 821a11a..fdc73d3 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -593,8 +593,14 @@ void xen_enable_syscall(void)
 void xen_enable_nmi(void)
 {
 #ifdef CONFIG_X86_64
-   if (register_callback(CALLBACKTYPE_nmi, (char *)nmi))
-   BUG();
+   int ret;
+
+   ret = register_callback(CALLBACKTYPE_nmi, (char *)nmi);
+   if (ret != 0) {
+   /* Hypervisor probably forbids us to register NMI callback or
+  some other error happened */
+   pr_warn(xen: failed to register NMI callback: %d\n, ret);
+   }
 #endif
 }
 void __init xen_pvmmu_arch_setup(void)
@@ -611,7 +617,12 @@ void __init xen_pvmmu_arch_setup(void)
 
xen_enable_sysenter();
xen_enable_syscall();
-   xen_enable_nmi();
+
+   /* Xen versions prior to 3.2 forbid DomUs to register NMI callbacks */
+   if (xen_running_on_version_or_later(3, 2))
+   xen_enable_nmi();
+   else
+   pr_warn(xen: skipping NMI callback registration for Xen  
3.2);
 }
 
 /* This function is not called for HVM domains */
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Vitaly Kuznetsov
David Vrabel david.vra...@citrix.com writes:

 On 07/07/14 21:33, Andrew Morton wrote:
 On Mon,  7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov vkuzn...@redhat.com 
 wrote:
 
 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.

 Does make forward progress though?  Or is it ending up in a repeatedly
 retrying the same instruction?

If memcpy is using SSE2 optimization 16-byte 'movdqu' instruction never
finishes (repeatedly retrying to issue two 8-byte requests to
qemu-dm). qemu-dm decides that it's hitting 'Neither RAM nor known MMIO
space' and returns 8 0xff bytes for both of this requests (I was testing
with qemu-traditional).


 Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
 regions as well?

I wasn't using ballooning, it happens that oldmem has several (two in my
test) pages which are HVMMEM_mmio_dm but qemu-dm considers them being
neither ram nor mmio.


 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.

 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

 The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
 pages) must be common to KVM.  How does KVM handle this?

Is far as I'm concearned the issue was never hit with KVM. I *think* the
issue has something to do with the conjunction of 16-byte 'movdqu'
emulation for io pages in xen hypervisor, 8-byte event channel requests
and qemu-traditional. But even if it gets fixed on hypervisor side I
believe fixing the issue kernel-side still worth it as there are
non-fixed hypervisors out there (e.g. AWS EC2).


 David

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.
 
 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.
 
 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

 Could the 'remap_oldmem_pfn_range' become an function ops? I see there
 is an 'register_oldmem_pfn_is_ram' - so could there be similar one for
 'pfn_range'?

yes, it is possible to replace '__weak remap_oldmem_pfn_range' with
'register_oldmem_pfn_is_ram'. However s390 arch overrides this function
in arch/s390/kernel/crash_dump.c so we'll have to make some changes
there as well.


 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  fs/proc/vmcore.c | 68 
 +++-
  1 file changed, 62 insertions(+), 6 deletions(-)
 
 diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
 index 382aa89..2716e19 100644
 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
   * virtually contiguous user-space in ELF layout.
   */
  #ifdef CONFIG_MMU
 +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
 +unsigned long pfn, unsigned long page_count)
 +{
 +unsigned long pos;
 +size_t size;
 +unsigned long vma_addr;
 +unsigned long emptypage_pfn = __pa(empty_zero_page)  PAGE_SHIFT;
 +
 +for (pos = pfn; (pos - pfn) = page_count; pos++) {
 +if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
 +/* we hit a page which is not ram or reached the end */
 +if (pos - pfn  0) {
 +/* remapping continuous region */
 +size = (pos - pfn)  PAGE_SHIFT;
 +vma_addr = vma-vm_start + len;
 +if (remap_oldmem_pfn_range(vma, vma_addr,
 +   pfn, size,
 +   vma-vm_page_prot))
 +return len;
 +len += size;
 +page_count -= (pos - pfn);
 +}
 +if (page_count  0) {
 +/* we hit a page which is not ram, replacing
 +   with an empty one */
 +vma_addr = vma-vm_start + len;
 +if (remap_oldmem_pfn_range(vma, vma_addr,
 +   emptypage_pfn,
 +   PAGE_SIZE,
 +   vma-vm_page_prot))
 +return len;
 +len += PAGE_SIZE;
 +pfn = pos + 1;
 +page_count--;
 +}
 +}
 +}
 +return len;
 +}
 +
  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
  {
  size_t size = vma-vm_end - vma-vm_start;
 @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct 
 vm_area_struct *vma)
  
  list_for_each_entry(m, vmcore_list, list) {
  if (start  m-offset + m-size) {
 -u64 paddr = 0;
 +u64 paddr = 0, original_len;
 +unsigned long pfn, page_count;
  
  tsz = min_t(size_t, m-offset + m-size - start, size);
  paddr = m-paddr + start - m-offset;
 -if (remap_oldmem_pfn_range(vma, vma-vm_start + len,
 -   paddr  PAGE_SHIFT, tsz,
 -   vma-vm_page_prot))
 -goto fail;
 +
 +/* check if oldmem_pfn_is_ram was registered to avoid
 +   looping over all pages without a reason */
 +if (oldmem_pfn_is_ram) {
 +pfn = paddr  PAGE_SHIFT;
 +page_count = tsz  PAGE_SHIFT

Re: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Vitaly Kuznetsov
Vivek Goyal vgo...@redhat.com writes:

 On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()).

 I am wondering if this name pfn_is_ram() appropriate for what we are 
 doing. So IIUC, a balooned memory is also RAM just that it has not 
 been allocated yet. That means we can safely assume that there is no
 data and can safely fill it with zeros?

For Xen pfn_is_ram() returns 0 in case the page is an mmio. Ballooned
pages are also considered being mmio (HVMOP_get_mem_type returns
HVMMEM_mmio_dm).


 If yes, then page_is_zero_filled() might be a more approprate name.


It's not as mmio page is not always zero-filled. We just don't need
these pages in vmcore.

 Also I am wondering why it was not done as part of copy_oldmem_page()
 so that respective arch could hide all the details.


Afaiac that wouldn't solve the mmap issue I'm trying to address but we
can ask Olaf why he preferred pfn_is_ram() path.

 However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.
 
 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.
 
 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

 I am not sure I understand this part. So what is all other hypervisor
 specic code which will like to do this. And will that code is compiled
 at the same time as CONFIG_XEN_PVHVM?


I meant to say that we have many hypervisors for x86 supported. In case
I override __weak remap_oldmem_pfn_range() in xen-specific code it will
*always* get executed when this code was compiled. In case we'll have to
do similar override in e.g. Hyperv or KVM code in future we'll have a
mess (in which order do we need to execute these overrides?).

In few words, Xen-PVHVM is not an architecture so I'm not following
Architectures may override this function to map oldmem path.

 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  fs/proc/vmcore.c | 68 
 +++-
  1 file changed, 62 insertions(+), 6 deletions(-)
 
 diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
 index 382aa89..2716e19 100644
 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
   * virtually contiguous user-space in ELF layout.
   */
  #ifdef CONFIG_MMU
 +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
 +unsigned long pfn, unsigned long page_count)
 +{
 +unsigned long pos;
 +size_t size;
 +unsigned long vma_addr;
 +unsigned long emptypage_pfn = __pa(empty_zero_page)  PAGE_SHIFT;
 +
 +for (pos = pfn; (pos - pfn) = page_count; pos++) {
 +if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
 +/* we hit a page which is not ram or reached the end */
 +if (pos - pfn  0) {
 +/* remapping continuous region */
 +size = (pos - pfn)  PAGE_SHIFT;
 +vma_addr = vma-vm_start + len;
 +if (remap_oldmem_pfn_range(vma, vma_addr,
 +   pfn, size,
 +   vma-vm_page_prot))
 +return len;
 +len += size;
 +page_count -= (pos - pfn);
 +}
 +if (page_count  0) {
 +/* we hit a page which is not ram, replacing
 +   with an empty one */
 +vma_addr = vma-vm_start + len;
 +if (remap_oldmem_pfn_range(vma, vma_addr,
 +   emptypage_pfn,
 +   PAGE_SIZE,
 +   vma-vm_page_prot))
 +return len;
 +len += PAGE_SIZE;
 +pfn = pos + 1;
 +page_count--;
 +}
 +}
 +}
 +return len;
 +}
 +
  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
  {
  size_t size = vma-vm_end

[PATCH v2] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Vitaly Kuznetsov
We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Changes from v1:
- comment style changes
- change remap_oldmem_pfn_checked() interface to closer match the
  remap_oldmem_pfn() interface
- preserve formal parameters within the loop, make the loop conditions
  easier to understand
- use my_zero_pfn() for the zero page
- return remapped length instead of new offset

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
Reviewed-by: Andrew Jones drjo...@redhat.com
---
 fs/proc/vmcore.c | 89 
 1 file changed, 84 insertions(+), 5 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..5cd13f8 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,67 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * virtually contiguous user-space in ELF layout.
  */
 #ifdef CONFIG_MMU
+/*
+ * remap_oldmem_pfn_checked - do remap_oldmem_pfn replacing all pages reported
+ * as not being ram with the zero page.
+ *
+ * @vma: vm_area_struct describing requested mapping
+ * @vma_addr: start remapping from
+ * @pfn: page frame number to start remapping to
+ * @size: remapping size
+ *
+ * Returns the remapped length. If no errors were hit during the remapping it
+ * should be equal to size.
+ */
+static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma,
+   unsigned long vma_addr, unsigned long pfn,
+   unsigned long size)
+{
+   size_t map_size;
+   unsigned long pos_start, pos_end, pos;
+   unsigned long zeropage_pfn = my_zero_pfn(0);
+   u64 len = 0;
+
+   pos_start = pfn;
+   pos_end = pfn + (size  PAGE_SHIFT);
+
+   for (pos = pos_start; pos  pos_end; ++pos) {
+   if (!pfn_is_ram(pos)) {
+   /* We hit a page which is not ram. Remap the continuous
+* region between pos_start and pos-1 and replace
+* the non-ram page at pos with the zero page.
+*/
+   if (pos  pos_start) {
+   /* Remap continuous region */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, vma_addr + len,
+  pos_start, map_size,
+  vma-vm_page_prot))
+   return len;
+   len += map_size;
+   }
+   /* Remap the zero page */
+   if (remap_oldmem_pfn_range(vma, vma_addr + len,
+  zeropage_pfn,
+  PAGE_SIZE,
+  vma-vm_page_prot))
+   return len;
+   len += PAGE_SIZE;
+   pos_start = pos + 1;
+   }
+   }
+   if (pos  pos_start) {
+   /* Remap the rest */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, vma_addr + len, pos_start,
+  map_size,
+  vma-vm_page_prot))
+   return len;
+   len += map_size;
+   }
+   return len;
+}
+
 static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 {
size_t size = vma-vm_end - vma-vm_start;
@@ -387,13 +448,31 @@ static int mmap_vmcore(struct file *file, struct 
vm_area_struct *vma)
 
tsz = min_t(size_t, m-offset + m-size - start, size);
paddr = m-paddr + start - m-offset;
-   if (remap_oldmem_pfn_range(vma, vma-vm_start + len,
-  paddr  PAGE_SHIFT, tsz,
-  vma-vm_page_prot))
-   goto fail

[PATCH v3] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-10 Thread Vitaly Kuznetsov
We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Changes from v2:
- make remap_oldmem_pfn_checked() interface exactly match
  remap_oldmem_pfn_range()
- unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
  we don't need to take care of it in mmap_vmcore()
- create vmcore_remap_oldmem_pfn() wrapper

Changes from v1:
- comment style changes
- change remap_oldmem_pfn_checked() interface to closer match the
  remap_oldmem_pfn() interface
- preserve formal parameters within the loop, make the loop conditions
  easier to understand
- use my_zero_pfn() for the zero page
- return remapped length instead of new offset

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
Reviewed-by: Andrew Jones drjo...@redhat.com
---
 fs/proc/vmcore.c | 82 +---
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..66dac43 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,82 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * virtually contiguous user-space in ELF layout.
  */
 #ifdef CONFIG_MMU
+/*
+ * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
+ * reported as not being ram with the zero page.
+ *
+ * @vma: vm_area_struct describing requested mapping
+ * @from: start remapping from
+ * @pfn: page frame number to start remapping to
+ * @size: remapping size
+ * @prot: protection bits
+ *
+ * Returns zero on success, -EAGAIN on failure.
+ */
+int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
+unsigned long pfn, unsigned long size,
+pgprot_t prot)
+{
+   size_t map_size;
+   unsigned long pos_start, pos_end, pos;
+   unsigned long zeropage_pfn = my_zero_pfn(0);
+   u64 len = 0;
+
+   pos_start = pfn;
+   pos_end = pfn + (size  PAGE_SHIFT);
+
+   for (pos = pos_start; pos  pos_end; ++pos) {
+   if (!pfn_is_ram(pos)) {
+   /* We hit a page which is not ram. Remap the continuous
+* region between pos_start and pos-1 and replace
+* the non-ram page at pos with the zero page.
+*/
+   if (pos  pos_start) {
+   /* Remap continuous region */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len,
+  pos_start, map_size,
+  prot))
+   goto fail;
+   len += map_size;
+   }
+   /* Remap the zero page */
+   if (remap_oldmem_pfn_range(vma, from + len,
+  zeropage_pfn,
+  PAGE_SIZE,
+  prot))
+   goto fail;
+   len += PAGE_SIZE;
+   pos_start = pos + 1;
+   }
+   }
+   if (pos  pos_start) {
+   /* Remap the rest */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len, pos_start,
+  map_size,
+  vma-vm_page_prot))
+   goto fail;
+   len += map_size;
+   }
+   return 0;
+fail:
+   do_munmap(vma-vm_mm, from, len);
+   return -EAGAIN;
+}
+
+int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
+   unsigned long from, unsigned long pfn,
+   unsigned long size, pgprot_t prot)
+{
+   /* Check if oldmem_pfn_is_ram was registered to avoid
+  looping over all pages without a reason. */
+   if (oldmem_pfn_is_ram)
+   return remap_oldmem_pfn_checked

[PATCH v4] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-11 Thread Vitaly Kuznetsov
We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Changes from v3:
- multi line comment style changes
- minor code style changes

Changes from v2:
- make remap_oldmem_pfn_checked() interface exactly match
  remap_oldmem_pfn_range()
- unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
  we don't need to take care of it in mmap_vmcore()
- create vmcore_remap_oldmem_pfn() wrapper

Changes from v1:
- comment style changes
- change remap_oldmem_pfn_checked() interface to closer match the
  remap_oldmem_pfn() interface
- preserve formal parameters within the loop, make the loop conditions
  easier to understand
- use my_zero_pfn() for the zero page
- return remapped length instead of new offset

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
Reviewed-by: Andrew Jones drjo...@redhat.com
---
 fs/proc/vmcore.c | 83 ++--
 1 file changed, 80 insertions(+), 3 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..405a409 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * virtually contiguous user-space in ELF layout.
  */
 #ifdef CONFIG_MMU
+/*
+ * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
+ * reported as not being ram with the zero page.
+ *
+ * @vma: vm_area_struct describing requested mapping
+ * @from: start remapping from
+ * @pfn: page frame number to start remapping to
+ * @size: remapping size
+ * @prot: protection bits
+ *
+ * Returns zero on success, -EAGAIN on failure.
+ */
+int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
+unsigned long pfn, unsigned long size,
+pgprot_t prot)
+{
+   size_t map_size;
+   unsigned long pos_start, pos_end, pos;
+   unsigned long zeropage_pfn = my_zero_pfn(0);
+   u64 len = 0;
+
+   pos_start = pfn;
+   pos_end = pfn + (size  PAGE_SHIFT);
+
+   for (pos = pos_start; pos  pos_end; ++pos) {
+   if (!pfn_is_ram(pos)) {
+   /*
+* We hit a page which is not ram. Remap the continuous
+* region between pos_start and pos-1 and replace
+* the non-ram page at pos with the zero page.
+*/
+   if (pos  pos_start) {
+   /* Remap continuous region */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len,
+  pos_start, map_size,
+  prot))
+   goto fail;
+   len += map_size;
+   }
+   /* Remap the zero page */
+   if (remap_oldmem_pfn_range(vma, from + len,
+  zeropage_pfn,
+  PAGE_SIZE, prot))
+   goto fail;
+   len += PAGE_SIZE;
+   pos_start = pos + 1;
+   }
+   }
+   if (pos  pos_start) {
+   /* Remap the rest */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len, pos_start,
+  map_size, vma-vm_page_prot))
+   goto fail;
+   len += map_size;
+   }
+   return 0;
+fail:
+   do_munmap(vma-vm_mm, from, len);
+   return -EAGAIN;
+}
+
+int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
+   unsigned long from, unsigned long pfn,
+   unsigned long size, pgprot_t prot)
+{
+   /*
+* Check if oldmem_pfn_is_ram was registered to avoid
+* looping over all pages without a reason.
+*/
+   if (oldmem_pfn_is_ram

[PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-07 Thread Vitaly Kuznetsov
we have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 fs/proc/vmcore.c | 68 +++-
 1 file changed, 62 insertions(+), 6 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..2716e19 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * virtually contiguous user-space in ELF layout.
  */
 #ifdef CONFIG_MMU
+static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
+   unsigned long pfn, unsigned long page_count)
+{
+   unsigned long pos;
+   size_t size;
+   unsigned long vma_addr;
+   unsigned long emptypage_pfn = __pa(empty_zero_page)  PAGE_SHIFT;
+
+   for (pos = pfn; (pos - pfn) = page_count; pos++) {
+   if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
+   /* we hit a page which is not ram or reached the end */
+   if (pos - pfn  0) {
+   /* remapping continuous region */
+   size = (pos - pfn)  PAGE_SHIFT;
+   vma_addr = vma-vm_start + len;
+   if (remap_oldmem_pfn_range(vma, vma_addr,
+  pfn, size,
+  vma-vm_page_prot))
+   return len;
+   len += size;
+   page_count -= (pos - pfn);
+   }
+   if (page_count  0) {
+   /* we hit a page which is not ram, replacing
+  with an empty one */
+   vma_addr = vma-vm_start + len;
+   if (remap_oldmem_pfn_range(vma, vma_addr,
+  emptypage_pfn,
+  PAGE_SIZE,
+  vma-vm_page_prot))
+   return len;
+   len += PAGE_SIZE;
+   pfn = pos + 1;
+   page_count--;
+   }
+   }
+   }
+   return len;
+}
+
 static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 {
size_t size = vma-vm_end - vma-vm_start;
@@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct 
vm_area_struct *vma)
 
list_for_each_entry(m, vmcore_list, list) {
if (start  m-offset + m-size) {
-   u64 paddr = 0;
+   u64 paddr = 0, original_len;
+   unsigned long pfn, page_count;
 
tsz = min_t(size_t, m-offset + m-size - start, size);
paddr = m-paddr + start - m-offset;
-   if (remap_oldmem_pfn_range(vma, vma-vm_start + len,
-  paddr  PAGE_SHIFT, tsz,
-  vma-vm_page_prot))
-   goto fail;
+
+   /* check if oldmem_pfn_is_ram was registered to avoid
+  looping over all pages without a reason */
+   if (oldmem_pfn_is_ram) {
+   pfn = paddr  PAGE_SHIFT;
+   page_count = tsz  PAGE_SHIFT;
+   original_len = len;
+   len = remap_oldmem_pfn_checked(vma, len, pfn,
+  page_count);
+   if (len != original_len + tsz)
+   goto fail;
+   } else {
+   if (remap_oldmem_pfn_range(vma,
+  vma

Re: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-08 Thread Vitaly Kuznetsov
Andrew Morton a...@linux-foundation.org writes:

 On Mon,  7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov vkuzn...@redhat.com 
 wrote:

 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.
 
 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.
 
 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

 I'd like to get some reviewed-by's and tested-by's on this one please.


This patch can be tested with Xen PVHVM guest only as it is the only
platform which registers oldmem_pfn_is_ram atm.

 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
   * virtually contiguous user-space in ELF layout.
   */
  #ifdef CONFIG_MMU
 +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
 +unsigned long pfn, unsigned long page_count)
 +{
 +unsigned long pos;
 +size_t size;
 +unsigned long vma_addr;
 +unsigned long emptypage_pfn = __pa(empty_zero_page)  PAGE_SHIFT;

 That's old-school.  Can we use my_zero_pfn() here?

 Also, zeropage_pfn is a better name - let's not introduce the
 hitherto unknown concept of an empty page.


Sure!

 +for (pos = pfn; (pos - pfn) = page_count; pos++) {
 +if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
 +/* we hit a page which is not ram or reached the end */
 +if (pos - pfn  0) {
 +/* remapping continuous region */
 +size = (pos - pfn)  PAGE_SHIFT;
 +vma_addr = vma-vm_start + len;
 +if (remap_oldmem_pfn_range(vma, vma_addr,
 +   pfn, size,
 +   vma-vm_page_prot))
 +return len;
 +len += size;
 +page_count -= (pos - pfn);
 +}
 +if (page_count  0) {
 +/* we hit a page which is not ram, replacing
 +   with an empty one */

 I suggest

   /*
* We hit a page which is not ram.  Replace it
* with the zero page.
*/


:-)

 +vma_addr = vma-vm_start + len;
 +if (remap_oldmem_pfn_range(vma, vma_addr,
 +   emptypage_pfn,
 +   PAGE_SIZE,
 +   vma-vm_page_prot))
 +return len;
 +len += PAGE_SIZE;
 +pfn = pos + 1;
 +page_count--;
 +}
 +}
 +}
 +return len;
 +}

 Also, this loop seems unnecessarily hard to follow.  It *look* like the
 `for' statement has an off-by-one because of the =, but page_count
 is mofidied inside the loop!  Despite it being an incoming formal
 argument.

There is no off-by-one error here (I believe) as I'm checking two
possible conditions to remap the continuous region:
1) We hit a non-ram page
2) We reached the end
so we exclude the page pos is pointing us to in both cases.

I tried to avoid code duplication e.g. having one 'remap continuous
region' inside the loop to do the remapping when we hit a non-ram page
and having the other one outside the loop to remap the tail.


 None of this is made any easier by the function's lack of
 documentation.  Some description of the incoming args would help, along
 with an overall description of the function's responsibilities.

 That being said, can't we just do something nice and simple like

   pos = pfn;
   while (pos  pfn + page_count) {
   stuff which advances `pos'
   }

 ?


I completely agree it's possible to make this code easier to
understand, will do.

  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
  {
  size_t size = vma-vm_end - vma-vm_start;
 @@ -383,17 +423,33 @@ static

[PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-14 Thread Vitaly Kuznetsov
We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Changes from v4:
- change map_size type size_t - unsigned long
- use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked()

Changes from v3:
- multi line comment style changes
- minor code style changes

Changes from v2:
- make remap_oldmem_pfn_checked() interface exactly match
  remap_oldmem_pfn_range()
- unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
  we don't need to take care of it in mmap_vmcore()
- create vmcore_remap_oldmem_pfn() wrapper

Changes from v1:
- comment style changes
- change remap_oldmem_pfn_checked() interface to closer match the
  remap_oldmem_pfn() interface
- preserve formal parameters within the loop, make the loop conditions
  easier to understand
- use my_zero_pfn() for the zero page
- return remapped length instead of new offset

Reviewed-by: Andrew Jones drjo...@redhat.com
Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 fs/proc/vmcore.c | 83 ++--
 1 file changed, 80 insertions(+), 3 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..1f77f35 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * virtually contiguous user-space in ELF layout.
  */
 #ifdef CONFIG_MMU
+/*
+ * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
+ * reported as not being ram with the zero page.
+ *
+ * @vma: vm_area_struct describing requested mapping
+ * @from: start remapping from
+ * @pfn: page frame number to start remapping to
+ * @size: remapping size
+ * @prot: protection bits
+ *
+ * Returns zero on success, -EAGAIN on failure.
+ */
+int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
+unsigned long pfn, unsigned long size,
+pgprot_t prot)
+{
+   unsigned long map_size;
+   unsigned long pos_start, pos_end, pos;
+   unsigned long zeropage_pfn = my_zero_pfn(0);
+   u64 len = 0;
+
+   pos_start = pfn;
+   pos_end = pfn + (size  PAGE_SHIFT);
+
+   for (pos = pos_start; pos  pos_end; ++pos) {
+   if (!pfn_is_ram(pos)) {
+   /*
+* We hit a page which is not ram. Remap the continuous
+* region between pos_start and pos-1 and replace
+* the non-ram page at pos with the zero page.
+*/
+   if (pos  pos_start) {
+   /* Remap continuous region */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len,
+  pos_start, map_size,
+  prot))
+   goto fail;
+   len += map_size;
+   }
+   /* Remap the zero page */
+   if (remap_oldmem_pfn_range(vma, from + len,
+  zeropage_pfn,
+  PAGE_SIZE, prot))
+   goto fail;
+   len += PAGE_SIZE;
+   pos_start = pos + 1;
+   }
+   }
+   if (pos  pos_start) {
+   /* Remap the rest */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len, pos_start,
+  map_size, prot))
+   goto fail;
+   len += map_size;
+   }
+   return 0;
+fail:
+   do_munmap(vma-vm_mm, from, len);
+   return -EAGAIN;
+}
+
+int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
+   unsigned long from, unsigned long pfn,
+   unsigned long size, pgprot_t prot)
+{
+   /*
+* Check if oldmem_pfn_is_ram

Re: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-14 Thread Vitaly Kuznetsov
HATAYAMA, Daisuke d.hatay...@jp.fujitsu.com writes:

 (2014/07/14 18:16), Vitaly Kuznetsov wrote:
 We have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.
 
 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.
 
 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.
 
 Changes from v4:
 - change map_size type size_t - unsigned long
 - use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked()
 
 Changes from v3:
 - multi line comment style changes
 - minor code style changes
 
 Changes from v2:
 - make remap_oldmem_pfn_checked() interface exactly match
remap_oldmem_pfn_range()
 - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
we don't need to take care of it in mmap_vmcore()
 - create vmcore_remap_oldmem_pfn() wrapper
 
 Changes from v1:
 - comment style changes
 - change remap_oldmem_pfn_checked() interface to closer match the
remap_oldmem_pfn() interface
 - preserve formal parameters within the loop, make the loop conditions
easier to understand
 - use my_zero_pfn() for the zero page
 - return remapped length instead of new offset
 
 Reviewed-by: Andrew Jones drjo...@redhat.com
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
   fs/proc/vmcore.c | 83 
 ++--
   1 file changed, 80 insertions(+), 3 deletions(-)
 
 diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
 index 382aa89..1f77f35 100644
 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
* virtually contiguous user-space in ELF layout.
*/
   #ifdef CONFIG_MMU
 +/*
 + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
 + * reported as not being ram with the zero page.
 + *
 + * @vma: vm_area_struct describing requested mapping
 + * @from: start remapping from
 + * @pfn: page frame number to start remapping to
 + * @size: remapping size
 + * @prot: protection bits
 + *
 + * Returns zero on success, -EAGAIN on failure.
 + */
 +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
 + unsigned long pfn, unsigned long size,
 + pgprot_t prot)
 +{
 +unsigned long map_size;
 +unsigned long pos_start, pos_end, pos;
 +unsigned long zeropage_pfn = my_zero_pfn(0);
 +u64 len = 0;

 Sorry, I missed this yesterday. 

Thanks for your review!

 This should also be fixed as size_t or unsigned long.
 Does 32-bit compiler warn about this at the call of do_munmap() below
 due to difference of bit length of the two types?

Mine doesn't. But you're right, it makes sense to make it match
do_munmap()'s interface and len there is size_t. I'll send v6 with this change.


 +
 +pos_start = pfn;
 +pos_end = pfn + (size  PAGE_SHIFT);
 +
 +for (pos = pos_start; pos  pos_end; ++pos) {
 +if (!pfn_is_ram(pos)) {
 +/*
 + * We hit a page which is not ram. Remap the continuous
 + * region between pos_start and pos-1 and replace
 + * the non-ram page at pos with the zero page.
 + */
 +if (pos  pos_start) {
 +/* Remap continuous region */
 +map_size = (pos - pos_start)  PAGE_SHIFT;
 +if (remap_oldmem_pfn_range(vma, from + len,
 +   pos_start, map_size,
 +   prot))
 +goto fail;
 +len += map_size;
 +}
 +/* Remap the zero page */
 +if (remap_oldmem_pfn_range(vma, from + len,
 +   zeropage_pfn,
 +   PAGE_SIZE, prot))
 +goto fail;
 +len += PAGE_SIZE;
 +pos_start = pos + 1;
 +}
 +}
 +if (pos  pos_start) {
 +/* Remap the rest */
 +map_size = (pos - pos_start)  PAGE_SHIFT;
 +if (remap_oldmem_pfn_range(vma, from

[PATCH v6] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-14 Thread Vitaly Kuznetsov
We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Changes from v5:
- make len size_t to match do_unmap() interface

Changes from v4:
- change map_size type size_t - unsigned long
- use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked()

Changes from v3:
- multi line comment style changes
- minor code style changes

Changes from v2:
- make remap_oldmem_pfn_checked() interface exactly match
  remap_oldmem_pfn_range()
- unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
  we don't need to take care of it in mmap_vmcore()
- create vmcore_remap_oldmem_pfn() wrapper

Changes from v1:
- comment style changes
- change remap_oldmem_pfn_checked() interface to closer match the
  remap_oldmem_pfn() interface
- preserve formal parameters within the loop, make the loop conditions
  easier to understand
- use my_zero_pfn() for the zero page
- return remapped length instead of new offset

Reviewed-by: Andrew Jones drjo...@redhat.com
Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 fs/proc/vmcore.c | 83 ++--
 1 file changed, 80 insertions(+), 3 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..fa45923 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * virtually contiguous user-space in ELF layout.
  */
 #ifdef CONFIG_MMU
+/*
+ * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
+ * reported as not being ram with the zero page.
+ *
+ * @vma: vm_area_struct describing requested mapping
+ * @from: start remapping from
+ * @pfn: page frame number to start remapping to
+ * @size: remapping size
+ * @prot: protection bits
+ *
+ * Returns zero on success, -EAGAIN on failure.
+ */
+int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
+unsigned long pfn, unsigned long size,
+pgprot_t prot)
+{
+   unsigned long map_size;
+   unsigned long pos_start, pos_end, pos;
+   unsigned long zeropage_pfn = my_zero_pfn(0);
+   size_t len = 0;
+
+   pos_start = pfn;
+   pos_end = pfn + (size  PAGE_SHIFT);
+
+   for (pos = pos_start; pos  pos_end; ++pos) {
+   if (!pfn_is_ram(pos)) {
+   /*
+* We hit a page which is not ram. Remap the continuous
+* region between pos_start and pos-1 and replace
+* the non-ram page at pos with the zero page.
+*/
+   if (pos  pos_start) {
+   /* Remap continuous region */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len,
+  pos_start, map_size,
+  prot))
+   goto fail;
+   len += map_size;
+   }
+   /* Remap the zero page */
+   if (remap_oldmem_pfn_range(vma, from + len,
+  zeropage_pfn,
+  PAGE_SIZE, prot))
+   goto fail;
+   len += PAGE_SIZE;
+   pos_start = pos + 1;
+   }
+   }
+   if (pos  pos_start) {
+   /* Remap the rest */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len, pos_start,
+  map_size, prot))
+   goto fail;
+   len += map_size;
+   }
+   return 0;
+fail:
+   do_munmap(vma-vm_mm, from, len);
+   return -EAGAIN;
+}
+
+int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
+   unsigned long from, unsigned long pfn,
+   unsigned long size

[PATCH RFC 4/4] xen/pvhvm: Make MSI IRQs work after kexec

2014-07-15 Thread Vitaly Kuznetsov
When kexec was peformed MSI IRQs for passthrough-ed devices were already
mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
fails as we have no IRQ mapping information for that. Requesting for new
mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
we don't recieve these IRQs.

RFC: I wasn't able to understand why commit af42b8d1 which introduced
xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
of checking pirq  0 as if the mapping was already done (and we have pirq0 
here)
we don't need to request for a new pirq. We're loosing existing PIRQ and I'm 
also
not sure when __write_msi_msg() with new PIRQ will result in new mapping.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 arch/x86/pci/xen.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 905956f..685e8f1 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int 
nvec, int type)
__read_msi_msg(msidesc, msg);
pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
((msg.address_lo  MSI_ADDR_DEST_ID_SHIFT)  0xff);
-   if (msg.data != XEN_PIRQ_MSI_DATA ||
-   xen_irq_from_pirq(pirq)  0) {
+   if (msg.data != XEN_PIRQ_MSI_DATA || pirq = 0) {
pirq = xen_allocate_pirq_msi(dev, msidesc);
if (pirq  0) {
irq = -ENODEV;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/4] xen/pvhvm: fix shared_info and pirq issues with kexec

2014-07-15 Thread Vitaly Kuznetsov
With this patch series I'm trying to address several issues with kexec on pvhvm:
- shared_info issue (1st patch, just sending Olaf's work with Konrad's fix)
- create specific pvhvm shutdown handler for kexec (2nd patch)
- GSI PIRQ issue (3rd patch, I'm pretty confident that it does the right thing)
- MSI PIRQ issue (4th patch, and I'm not sure it doesn't break anything - RFC)

This patch series can be tested on single vCPU guest. We still have SMP issues 
with
pvhvm guests and kexec which require additional fixes.

Olaf Hering (1):
  xen PVonHVM: use E820_Reserved area for shared_info

Vitaly Kuznetsov (3):
  xen/pvhvm: Introduce xen_pvhvm_kexec_shutdown()
  xen/pvhvm: Unmap all PIRQs on startup and shutdown
  xen/pvhvm: Make MSI IRQs work after kexec

 arch/x86/pci/xen.c   |  3 +-
 arch/x86/xen/enlighten.c | 83 +++-
 arch/x86/xen/smp.c   | 10 +
 arch/x86/xen/smp.h   |  1 +
 arch/x86/xen/suspend.c   |  2 +-
 arch/x86/xen/xen-ops.h   |  2 +-
 drivers/xen/events/events_base.c | 76 
 include/xen/events.h |  3 ++
 8 files changed, 158 insertions(+), 22 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 1/4] xen PVonHVM: use E820_Reserved area for shared_info

2014-07-15 Thread Vitaly Kuznetsov
From: Olaf Hering o...@aepfle.de

This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
(xen PVonHVM: move shared_info to MMIO before kexec).

Currently kexec in a PVonHVM guest fails with a triple fault because the
new kernel overwrites the shared info page. The exact failure depends on
the size of the kernel image. This patch moves the pfn from RAM into an
E820 reserved memory area.

The pfn containing the shared_info is located somewhere in RAM. This will
cause trouble if the current kernel is doing a kexec boot into a new
kernel. The new kernel (and its startup code) can not know where the pfn
is, so it can not reserve the page. The hypervisor will continue to update
the pfn, and as a result memory corruption occours in the new kernel.

The toolstack marks the memory area FC00- as reserved in the
E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
starting from FE70 as reserved for guest use. Older Xen4 toolstacks
will usually not allocate areas up to FE70, so FE70 is expected
to work also with older toolstacks.

In Xen3 there is no reserved area at a fixed location. If the guest is
started on such old hosts the shared_info page will be placed in RAM. As
a result kexec can not be used.

Signed-off-by: Olaf Hering o...@aepfle.de
Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
(cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)

[On resume we need to reset the xen_vcpu_info, which the original
patch did not do]

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 arch/x86/xen/enlighten.c | 74 
 arch/x86/xen/suspend.c   |  2 +-
 arch/x86/xen/xen-ops.h   |  2 +-
 3 files changed, 58 insertions(+), 20 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index ffb101e..a11af62 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void)
 #endif
 }
 
-void __ref xen_hvm_init_shared_info(void)
+#ifdef CONFIG_XEN_PVHVM
+#define HVM_SHARED_INFO_ADDR 0xFE70UL
+static struct shared_info *xen_hvm_shared_info;
+static unsigned long xen_hvm_sip_phys;
+static int xen_major, xen_minor;
+
+static void xen_hvm_connect_shared_info(unsigned long pfn)
 {
-   int cpu;
struct xen_add_to_physmap xatp;
-   static struct shared_info *shared_info_page = 0;
 
-   if (!shared_info_page)
-   shared_info_page = (struct shared_info *)
-   extend_brk(PAGE_SIZE, PAGE_SIZE);
xatp.domid = DOMID_SELF;
xatp.idx = 0;
xatp.space = XENMAPSPACE_shared_info;
-   xatp.gpfn = __pa(shared_info_page)  PAGE_SHIFT;
+   xatp.gpfn = pfn;
if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, xatp))
BUG();
 
-   HYPERVISOR_shared_info = (struct shared_info *)shared_info_page;
+}
+static void __init xen_hvm_set_shared_info(struct shared_info *sip)
+{
+   int cpu;
+
+   HYPERVISOR_shared_info = sip;
 
/* xen_vcpu is a pointer to the vcpu_info struct in the shared_info
 * page, we use it in the event channel upcall and in some pvclock
@@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void)
}
 }
 
-#ifdef CONFIG_XEN_PVHVM
+/* Reconnect the shared_info pfn to a (new) mfn */
+void xen_hvm_resume_shared_info(void)
+{
+   xen_hvm_connect_shared_info(xen_hvm_sip_phys  PAGE_SHIFT);
+   xen_hvm_set_shared_info(xen_hvm_shared_info);
+}
+
+/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest 
usage.
+ * On these old tools the shared info page will be placed in E820_Ram.
+ * Xen 4 provides a E820_Reserved area at 0xFC00, and this code expects
+ * that nothing is mapped up to HVM_SHARED_INFO_ADDR.
+ * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is used
+ * here for the shared info page. */
+static void __init xen_hvm_init_shared_info(void)
+{
+   if (xen_major  4) {
+   xen_hvm_shared_info = extend_brk(PAGE_SIZE, PAGE_SIZE);
+   xen_hvm_sip_phys = __pa(xen_hvm_shared_info);
+   } else {
+   xen_hvm_sip_phys = HVM_SHARED_INFO_ADDR;
+   set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_hvm_sip_phys);
+   xen_hvm_shared_info =
+   (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+   }
+   xen_hvm_resume_shared_info();
+}
+
 static void __init init_hvm_pv_info(void)
 {
-   int major, minor;
-   uint32_t eax, ebx, ecx, edx, pages, msr, base;
+   uint32_t  ecx, edx, pages, msr, base;
u64 pfn;
 
base = xen_cpuid_base();
-   cpuid(base + 1, eax, ebx, ecx, edx);
-
-   major = eax  16;
-   minor = eax  0x;
-   printk(KERN_INFO Xen version %d.%d.\n, major, minor);
-
cpuid(base + 2, pages, msr, ecx, edx);
 
pfn = __pa(hypercall_page);
@@ -1828,10 +1853,23 @@ static void __init

[PATCH RFC 2/4] xen/pvhvm: Introduce xen_pvhvm_kexec_shutdown()

2014-07-15 Thread Vitaly Kuznetsov
PVHVM guest requires special actions before kexec. Register specific
xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown().

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 arch/x86/xen/enlighten.c | 9 +
 arch/x86/xen/smp.c   | 9 +
 arch/x86/xen/smp.h   | 1 +
 3 files changed, 19 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index a11af62..8074e4a 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = {
.notifier_call  = xen_hvm_cpu_notify,
 };
 
+static void xen_pvhvm_kexec_shutdown(void)
+{
+   xen_kexec_shutdown();
+   native_machine_shutdown();
+}
+
 static void __init xen_hvm_guest_init(void)
 {
init_hvm_pv_info();
@@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void)
x86_init.irqs.intr_init = xen_init_IRQ;
xen_hvm_init_time_ops();
xen_hvm_init_mmu_ops();
+#ifdef CONFIG_KEXEC
+   machine_ops.shutdown = xen_pvhvm_kexec_shutdown;
+#endif
 }
 
 static uint32_t __init xen_hvm_platform(void)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 7005974..35dcf39 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -18,6 +18,7 @@
 #include linux/smp.h
 #include linux/irq_work.h
 #include linux/tick.h
+#include linux/kexec.h
 
 #include asm/paravirt.h
 #include asm/desc.h
@@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu)
native_cpu_die(cpu);
 }
 
+void xen_kexec_shutdown(void)
+{
+#ifdef CONFIG_KEXEC
+   if (!kexec_in_progress)
+   return;
+#endif
+}
+
 void __init xen_hvm_smp_init(void)
 {
if (!xen_have_vector_callback)
diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
index c7c2d89..1af0493 100644
--- a/arch/x86/xen/smp.h
+++ b/arch/x86/xen/smp.h
@@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector);
 extern void xen_send_IPI_all(int vector);
 extern void xen_send_IPI_self(int vector);
 
+extern void xen_kexec_shutdown(void);
 #endif
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown

2014-07-15 Thread Vitaly Kuznetsov
When kexec is being run PIRQs from Qemu-emulated devices are still
mapped to old event channels and new kernel has no information about
that. Trying to map them twice results in the following in Xen's dmesg:

 (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
 (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
 (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
 ...

 and the following in new kernel's dmesg:

 [   92.286796] xen:events: Failed to obtain physical IRQ 4

The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
when kexec was requested and on every kernel startup. We need to do this
twice to deal with the following issues:
- startup-time unmapping is required to make kdump work;
- shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
- shutdown-time unmapping is required to make Qemu-emulated NICs work after
  kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
  is being performed).

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 arch/x86/xen/smp.c   |  1 +
 drivers/xen/events/events_base.c | 76 
 include/xen/events.h |  3 ++
 3 files changed, 80 insertions(+)

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 35dcf39..e2b4deb 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
 #ifdef CONFIG_KEXEC
if (!kexec_in_progress)
return;
+   xen_unmap_all_pirqs();
 #endif
 }
 
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c919d3d..7701c7f 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
 static bool fifo_events = true;
 module_param(fifo_events, bool, 0);
 
+void xen_unmap_all_pirqs(void)
+{
+   int pirq, rc, gsi, irq, evtchn;
+   struct physdev_unmap_pirq unmap_irq;
+   struct irq_info *info;
+   struct evtchn_close close;
+
+   mutex_lock(irq_mapping_update_lock);
+
+   list_for_each_entry(info, xen_irq_list_head, list) {
+   if (info-type != IRQT_PIRQ)
+   continue;
+
+   pirq = info-u.pirq.pirq;
+   gsi = info-u.pirq.gsi;
+   evtchn = info-evtchn;
+   irq = info-irq;
+
+   pr_debug(unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n,
+   gsi, pirq, irq, evtchn);
+
+   if (evtchn  0) {
+   close.port = evtchn;
+   if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
+   close) != 0)
+   pr_warn(close evtchn %d failed\n, evtchn);
+   }
+
+   unmap_irq.pirq = pirq;
+   unmap_irq.domid = DOMID_SELF;
+
+   rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, unmap_irq);
+   if (rc)
+   pr_warn(unmap pirq failed gsi=%d pirq=%d irq=%d 
rc=%d\n,
+   gsi, pirq, irq, rc);
+   }
+
+   mutex_unlock(irq_mapping_update_lock);
+}
+EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs);
+
+static void xen_startup_unmap_pirqs(void)
+{
+   struct evtchn_status status;
+   int port, rc = -ENOENT;
+   struct physdev_unmap_pirq unmap_irq;
+   struct evtchn_close close;
+
+   memset(status, 0, sizeof(status));
+   for (port = 0; port  xen_evtchn_max_channels(); port++) {
+   status.dom = DOMID_SELF;
+   status.port = port;
+   rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, status);
+   if (rc  0)
+   continue;
+   if (status.status == EVTCHNSTAT_pirq) {
+   close.port = port;
+   if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
+   close) != 0)
+   pr_warn(xen: failed to close evtchn %d\n,
+   port);
+   unmap_irq.pirq = status.u.pirq;
+   unmap_irq.domid = DOMID_SELF;
+   pr_warn(xen: unmapping previously mapped pirq %d\n,
+   unmap_irq.pirq);
+   if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq,
+ unmap_irq) != 0)
+   pr_warn(xen: failed to unmap pirq %d\n,
+   unmap_irq.pirq);
+   }
+   }
+}
+
+
 void __init xen_init_IRQ(void)
 {
int ret = -EINVAL;
@@ -1671,6 +1745,8 @@ void __init xen_init_IRQ(void)
xen_callback_vector();
 
if (xen_hvm_domain()) {
+   xen_startup_unmap_pirqs

[PATCH v7] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-15 Thread Vitaly Kuznetsov
We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Changes from v6:
- remove useless len increment when remapping the rest of the region

Changes from v5:
- make len size_t to match do_unmap() interface

Changes from v4:
- change map_size type size_t - unsigned long
- use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked()

Changes from v3:
- multi line comment style changes
- minor code style changes

Changes from v2:
- make remap_oldmem_pfn_checked() interface exactly match
  remap_oldmem_pfn_range()
- unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
  we don't need to take care of it in mmap_vmcore()
- create vmcore_remap_oldmem_pfn() wrapper

Changes from v1:
- comment style changes
- change remap_oldmem_pfn_checked() interface to closer match the
  remap_oldmem_pfn() interface
- preserve formal parameters within the loop, make the loop conditions
  easier to understand
- use my_zero_pfn() for the zero page
- return remapped length instead of new offset

Reviewed-by: Andrew Jones drjo...@redhat.com
Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 fs/proc/vmcore.c | 82 +---
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..a18e651 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,82 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * virtually contiguous user-space in ELF layout.
  */
 #ifdef CONFIG_MMU
+/*
+ * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
+ * reported as not being ram with the zero page.
+ *
+ * @vma: vm_area_struct describing requested mapping
+ * @from: start remapping from
+ * @pfn: page frame number to start remapping to
+ * @size: remapping size
+ * @prot: protection bits
+ *
+ * Returns zero on success, -EAGAIN on failure.
+ */
+int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
+unsigned long pfn, unsigned long size,
+pgprot_t prot)
+{
+   unsigned long map_size;
+   unsigned long pos_start, pos_end, pos;
+   unsigned long zeropage_pfn = my_zero_pfn(0);
+   size_t len = 0;
+
+   pos_start = pfn;
+   pos_end = pfn + (size  PAGE_SHIFT);
+
+   for (pos = pos_start; pos  pos_end; ++pos) {
+   if (!pfn_is_ram(pos)) {
+   /*
+* We hit a page which is not ram. Remap the continuous
+* region between pos_start and pos-1 and replace
+* the non-ram page at pos with the zero page.
+*/
+   if (pos  pos_start) {
+   /* Remap continuous region */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len,
+  pos_start, map_size,
+  prot))
+   goto fail;
+   len += map_size;
+   }
+   /* Remap the zero page */
+   if (remap_oldmem_pfn_range(vma, from + len,
+  zeropage_pfn,
+  PAGE_SIZE, prot))
+   goto fail;
+   len += PAGE_SIZE;
+   pos_start = pos + 1;
+   }
+   }
+   if (pos  pos_start) {
+   /* Remap the rest */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len, pos_start,
+  map_size, prot))
+   goto fail;
+   }
+   return 0;
+fail:
+   do_munmap(vma-vm_mm, from, len);
+   return -EAGAIN;
+}
+
+int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
+   unsigned long from, unsigned long

Re: [PATCH RFC 1/4] xen PVonHVM: use E820_Reserved area for shared_info

2014-07-15 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
 From: Olaf Hering o...@aepfle.de
 
 This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
 (xen PVonHVM: move shared_info to MMIO before kexec).
 
 Currently kexec in a PVonHVM guest fails with a triple fault because the
 new kernel overwrites the shared info page. The exact failure depends on
 the size of the kernel image. This patch moves the pfn from RAM into an
 E820 reserved memory area.
 
 The pfn containing the shared_info is located somewhere in RAM. This will
 cause trouble if the current kernel is doing a kexec boot into a new
 kernel. The new kernel (and its startup code) can not know where the pfn
 is, so it can not reserve the page. The hypervisor will continue to update
 the pfn, and as a result memory corruption occours in the new kernel.
 
 The toolstack marks the memory area FC00- as reserved in the
 E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
 starting from FE70 as reserved for guest use. Older Xen4 toolstacks
 will usually not allocate areas up to FE70, so FE70 is expected
 to work also with older toolstacks.
 
 In Xen3 there is no reserved area at a fixed location. If the guest is
 started on such old hosts the shared_info page will be placed in RAM. As
 a result kexec can not be used.

 So this looks right, the one thing that we really need to check 
 is e9daff24a266307943457086533041bd971d0ef9

This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.

 We are doing this b/c on 32-bit PVonHVM with older hypervisors
 (Xen 4.1) it ends up bothing up the start_info. This is bad b/c
 we use it for the time keeping, and the timekeeping code loops
 forever - as the version field never changes. Olaf says to
 revert it, so lets do that.

 Could you kindly test that the migration on 32-bit PVHVM guests
 on older hypervisors works?


Sure, will do! Was there anything special about the setup or any 32-bit
pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can
try checking both current and old versions to make sure the issue was
acutually fixed.

 
 Signed-off-by: Olaf Hering o...@aepfle.de
 Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)
 
 [On resume we need to reset the xen_vcpu_info, which the original
 patch did not do]
 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  arch/x86/xen/enlighten.c | 74 
 
  arch/x86/xen/suspend.c   |  2 +-
  arch/x86/xen/xen-ops.h   |  2 +-
  3 files changed, 58 insertions(+), 20 deletions(-)
 
 diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
 index ffb101e..a11af62 100644
 --- a/arch/x86/xen/enlighten.c
 +++ b/arch/x86/xen/enlighten.c
 @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init 
 xen_start_kernel(void)
  #endif
  }
  
 -void __ref xen_hvm_init_shared_info(void)
 +#ifdef CONFIG_XEN_PVHVM
 +#define HVM_SHARED_INFO_ADDR 0xFE70UL
 +static struct shared_info *xen_hvm_shared_info;
 +static unsigned long xen_hvm_sip_phys;
 +static int xen_major, xen_minor;
 +
 +static void xen_hvm_connect_shared_info(unsigned long pfn)
  {
 -int cpu;
  struct xen_add_to_physmap xatp;
 -static struct shared_info *shared_info_page = 0;
  
 -if (!shared_info_page)
 -shared_info_page = (struct shared_info *)
 -extend_brk(PAGE_SIZE, PAGE_SIZE);
  xatp.domid = DOMID_SELF;
  xatp.idx = 0;
  xatp.space = XENMAPSPACE_shared_info;
 -xatp.gpfn = __pa(shared_info_page)  PAGE_SHIFT;
 +xatp.gpfn = pfn;
  if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, xatp))
  BUG();
  
 -HYPERVISOR_shared_info = (struct shared_info *)shared_info_page;
 +}
 +static void __init xen_hvm_set_shared_info(struct shared_info *sip)
 +{
 +int cpu;
 +
 +HYPERVISOR_shared_info = sip;
  
  /* xen_vcpu is a pointer to the vcpu_info struct in the shared_info
   * page, we use it in the event channel upcall and in some pvclock
 @@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void)
  }
  }
  
 -#ifdef CONFIG_XEN_PVHVM
 +/* Reconnect the shared_info pfn to a (new) mfn */
 +void xen_hvm_resume_shared_info(void)
 +{
 +xen_hvm_connect_shared_info(xen_hvm_sip_phys  PAGE_SHIFT);
 +xen_hvm_set_shared_info(xen_hvm_shared_info);
 +}
 +
 +/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest 
 usage.
 + * On these old tools the shared info page will be placed in E820_Ram.
 + * Xen 4 provides a E820_Reserved area at 0xFC00, and this code expects
 + * that nothing is mapped up to HVM_SHARED_INFO_ADDR.
 + * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is 
 used
 + * here for the shared info page. */
 +static void __init xen_hvm_init_shared_info(void)
 +{
 +if (xen_major  4

Re: [PATCH RFC 2/4] xen/pvhvm: Introduce xen_pvhvm_kexec_shutdown()

2014-07-15 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Tue, Jul 15, 2014 at 03:40:38PM +0200, Vitaly Kuznetsov wrote:
 PVHVM guest requires special actions before kexec. Register specific
 xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown().
 

 This looks close to what I had sent as an RFC to you?


Yes, I stole that part from your RFC: VCPU_reset_cpu_info patch to
call xen_unmap_all_pirqs() on shutdown. I'm looking at SMP issues your
patches address as well.

 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  arch/x86/xen/enlighten.c | 9 +
  arch/x86/xen/smp.c   | 9 +
  arch/x86/xen/smp.h   | 1 +
  3 files changed, 19 insertions(+)
 
 diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
 index a11af62..8074e4a 100644
 --- a/arch/x86/xen/enlighten.c
 +++ b/arch/x86/xen/enlighten.c
 @@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = {
  .notifier_call  = xen_hvm_cpu_notify,
  };
  
 +static void xen_pvhvm_kexec_shutdown(void)
 +{
 +xen_kexec_shutdown();
 +native_machine_shutdown();
 +}
 +
  static void __init xen_hvm_guest_init(void)
  {
  init_hvm_pv_info();
 @@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void)
  x86_init.irqs.intr_init = xen_init_IRQ;
  xen_hvm_init_time_ops();
  xen_hvm_init_mmu_ops();
 +#ifdef CONFIG_KEXEC
 +machine_ops.shutdown = xen_pvhvm_kexec_shutdown;
 +#endif
  }
  
  static uint32_t __init xen_hvm_platform(void)
 diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
 index 7005974..35dcf39 100644
 --- a/arch/x86/xen/smp.c
 +++ b/arch/x86/xen/smp.c
 @@ -18,6 +18,7 @@
  #include linux/smp.h
  #include linux/irq_work.h
  #include linux/tick.h
 +#include linux/kexec.h
  
  #include asm/paravirt.h
  #include asm/desc.h
 @@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu)
  native_cpu_die(cpu);
  }
  
 +void xen_kexec_shutdown(void)
 +{
 +#ifdef CONFIG_KEXEC
 +if (!kexec_in_progress)
 +return;
 +#endif
 +}
 +
  void __init xen_hvm_smp_init(void)
  {
  if (!xen_have_vector_callback)
 diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
 index c7c2d89..1af0493 100644
 --- a/arch/x86/xen/smp.h
 +++ b/arch/x86/xen/smp.h
 @@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector);
  extern void xen_send_IPI_all(int vector);
  extern void xen_send_IPI_self(int vector);
  
 +extern void xen_kexec_shutdown(void);
  #endif
 -- 
 1.9.3
 

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown

2014-07-29 Thread Vitaly Kuznetsov
David Vrabel david.vra...@citrix.com writes:

 On 15/07/14 14:40, Vitaly Kuznetsov wrote:
 When kexec is being run PIRQs from Qemu-emulated devices are still
 mapped to old event channels and new kernel has no information about
 that. Trying to map them twice results in the following in Xen's dmesg:
 
  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
  ...
 
  and the following in new kernel's dmesg:
 
  [   92.286796] xen:events: Failed to obtain physical IRQ 4
 
 The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
 devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
 when kexec was requested and on every kernel startup. We need to do this
 twice to deal with the following issues:
 - startup-time unmapping is required to make kdump work;
 - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
 - shutdown-time unmapping is required to make Qemu-emulated NICs work after
   kexec (event channel is being closed on shutdown but no 
 PHYSDEVOP_unmap_pirq
   is being performed).

 I think this should be done only in one place -- just prior to exec'ing
 the new kernel (including kdump kernels).


Thank you for your comments!

The problem I'm fighting wiht atm is: with FIFO-based event channels we
need to call evtchn_fifo_destroy() so next EVTCHNOP_init_control won't
fail. I was intended to put evtchn_fifo_destroy() in
EVTCHNOP_reset. That introduces a problem: we need to deal with
store/console channels. It is possible to remap those from guest with
EVTCHNOP_bind_interdomain (if we remember where they were mapped before)
but we can't do it after we did evtchn_fifo_destroy() and we can't
rebind them after kexec and performing EVTCHNOP_init_control as
we can't remember where these channels were mapped to after kexec/kdump.

I see the following possible solutions:
1) We put evtchn_fifo_destroy() in EVTCHNOP_init_control so
EVTCHNOP_init_control can be called twice. No EVTCHNOP_reset is required
in that case.

2) Introduce special (e.g. 'EVTCHNOP_fifo_destroy') hypercall to do
evtchn_fifo_destroy() without closing all channels. Alternatively we can
avoid closing all channels in EVTCHNOP_reset when called with DOMID_SELF
(as this mode is not being used atm) -- but that would look unobvious.

3) Keep evtchn_fifo_destroy() in EVTCHNOP_reset but keep console/store
channels -- I saw your concerns it is not safe, some sort of additional
blocking will be required.

4) Do the remapping boot time (query for store/console channels -
perform EVTCHNOP_reset - rebind with EVTCHNOP_bind_interdomain).

There is an additional problem: EVTCHNOP_bind_interdomain operation has
local port as OUT parameter so we can't guarantee that remapping
store/console channels will remap them to the same local channel they
were mapped before EVTCHNOP_reset (and we have this information in hvm
info: HVM_PARAM_CONSOLE_EVTCHN/HVM_PARAM_STORE_EVTCHN, ...). Not sure
how to deal with that in case we go with remapping.

Your thoughts would be very appreciated. Thank you again,

 --- a/arch/x86/xen/smp.c
 +++ b/arch/x86/xen/smp.c
 @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
  #ifdef CONFIG_KEXEC
  if (!kexec_in_progress)
  return;
 +xen_unmap_all_pirqs();
  #endif
  }
  
 diff --git a/drivers/xen/events/events_base.c 
 b/drivers/xen/events/events_base.c
 index c919d3d..7701c7f 100644
 --- a/drivers/xen/events/events_base.c
 +++ b/drivers/xen/events/events_base.c
 @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
  static bool fifo_events = true;
  module_param(fifo_events, bool, 0);
  
 +void xen_unmap_all_pirqs(void)
 +{
 +int pirq, rc, gsi, irq, evtchn;
 +struct physdev_unmap_pirq unmap_irq;
 +struct irq_info *info;
 +struct evtchn_close close;
 +
 +mutex_lock(irq_mapping_update_lock);
 +
 +list_for_each_entry(info, xen_irq_list_head, list) {
 +if (info-type != IRQT_PIRQ)
 +continue;

 I think you need to do this by querying Xen state rather than relying on
 potentially bad guest state.  Particularly since you may crash while
 holding irq_mapping_update_lock.

 EVTCHNOP_status gets you the info you need I think.

 David

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown

2014-07-29 Thread Vitaly Kuznetsov
David Vrabel david.vra...@citrix.com writes:

 On 29/07/14 14:50, Vitaly Kuznetsov wrote:
 David Vrabel david.vra...@citrix.com writes:
 
 On 15/07/14 14:40, Vitaly Kuznetsov wrote:
 When kexec is being run PIRQs from Qemu-emulated devices are still
 mapped to old event channels and new kernel has no information about
 that. Trying to map them twice results in the following in Xen's dmesg:

  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
  ...

  and the following in new kernel's dmesg:

  [   92.286796] xen:events: Failed to obtain physical IRQ 4

 The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
 devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
 when kexec was requested and on every kernel startup. We need to do this
 twice to deal with the following issues:
 - startup-time unmapping is required to make kdump work;
 - shutdown-time unmapping is required to support kexec-ing non-fixed 
 kernels;
 - shutdown-time unmapping is required to make Qemu-emulated NICs work after
   kexec (event channel is being closed on shutdown but no 
 PHYSDEVOP_unmap_pirq
   is being performed).

 I think this should be done only in one place -- just prior to exec'ing
 the new kernel (including kdump kernels).

 
 Thank you for your comments!
 
 The problem I'm fighting wiht atm is: with FIFO-based event channels we
 need to call evtchn_fifo_destroy() so next EVTCHNOP_init_control won't
 fail. I was intended to put evtchn_fifo_destroy() in
 EVTCHNOP_reset. That introduces a problem: we need to deal with
 store/console channels. It is possible to remap those from guest with
 EVTCHNOP_bind_interdomain (if we remember where they were mapped before)
 but we can't do it after we did evtchn_fifo_destroy() and we can't
 rebind them after kexec and performing EVTCHNOP_init_control as
 we can't remember where these channels were mapped to after kexec/kdump.
 
 I see the following possible solutions:
 1) We put evtchn_fifo_destroy() in EVTCHNOP_init_control so
 EVTCHNOP_init_control can be called twice. No EVTCHNOP_reset is required
 in that case.

 EVTCHNOP_init_control is called for each VCPU so I can't see how this
 would work.

Right, forgot about that...


 2) Introduce special (e.g. 'EVTCHNOP_fifo_destroy') hypercall to do
 evtchn_fifo_destroy() without closing all channels. Alternatively we can
 avoid closing all channels in EVTCHNOP_reset when called with DOMID_SELF
 (as this mode is not being used atm) -- but that would look unobvious.

 I would try this.  The guest prior to kexec would then:

 1. Use EVTCHNOP_status to query remote end of console and xenstore event
 channels.

 2. Loop for all event channels:

 a. unmap pirq (if required)
 b. EVTCHNOP_close

 3. EVTCHNOP_fifo_destroy (the implementation of which must verify that
 no channels are bound).

 4. EVTCHNOP_bind_interdomain to rebind the console and xenstore channels.


Yea, that's what I have now when I put evtchn_fifo_destroy() in
EVTCHNOP_reset. The problem here is: we can't do
EVTCHNOP_bind_interdomain after we did evtchn_fifo_destroy(), we need to
call EVTCHNOP_init_control first. And we'll do that only after kexec so
we won't remember what we need to remap.. The second issue is the fact
that EVTCHNOP_bind_interdomain will remap store/storage channels to
*some* local ports, not necessary matching hvm info
(HVM_PARAM_CONSOLE_EVTCHN/HVM_PARAM_STORE_EVTCHN)..

Would it be safe is instead of closing interdomain channels on
EVTCHNOP_fifo_destroy we switch evtchn_port_ops to evtchn_port_ops_2l
(so on EVTCHNOP_init_control after kexec we switch back)? I'll try
prototyping this.

Thank you for your comments,

 David

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 4/4] xen/pvhvm: Make MSI IRQs work after kexec

2014-07-16 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
 When kexec was peformed MSI IRQs for passthrough-ed devices were already
 mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
 fails as we have no IRQ mapping information for that. Requesting for new
 mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
 we don't recieve these IRQs.

 receive


Thanks for your comments!

 How come '__write_msi_msg' does not result in new MSI IRQs?


Actually that was the hidden question in my RFC :-)

Let me describe what I see. When normal boot is performed we have the
following in xen_hvm_setup_msi_irqs():

__read_msi_msg()
 pirq - 0

then we allocate new pirq with
 pirq = xen_allocate_pirq_msi()
 pirq - 54

and we have the following mapping:
xen: msi -- pirq=54 -- irq=72

in 'xl debug-keys i':
(XEN)IRQ:  29 affinity:04 vec:b9 type=PCI-MSI status=0030 in-flight=0 
domain-list=7: 54(),

After kexec we see the following:
__read_msi_msg()
 pirq - 54

but as xen_irq_from_pirq() fails we follow the same path allocating new pirq:
 pirq = xen_allocate_pirq_msi()
 pirq - 55

and we have the following mapping:
xen: msi -- pirq=55 -- irq=75

However (afaict) mapping in xen wasn't updated:

in 'xl debug-keys i':
(XEN)IRQ:  29 affinity:02 vec:b9 type=PCI-MSI status=0030 in-flight=0 
domain-list=7: 54(--M-),

 Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
 from the device and updating the internal PIRQ-IRQ code to match
 with the reality?


Yea, 'always trust the device'.

 
 RFC: I wasn't able to understand why commit af42b8d1 which introduced
 xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that 
 instead
 of checking pirq  0 as if the mapping was already done (and we have pirq0 
 here)
 we don't need to request for a new pirq. We're loosing existing PIRQ and I'm 
 also
 not sure when __write_msi_msg() with new PIRQ will result in new mapping.

 We don't request a new pirq. We end up returning before we call 
 xen_allocate_pirq_msi.
 At least that is how the commit you mentioned worked.


I meant to say that in case we have pirq  0 from __read_msi_msg() but
xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
xen_allocate_pirq_msi() which brings us new pirq.

 In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ 
 - is
 that we might be called twice by a buggy driver. As such we want to check
 our PIRQ-IRQ to figure this out.

But if we're called twice we'll see the same pirq, right? Or there are
some cases when we see 'crap' instead of pirq here?

I think it would be nice to use the same pirq after kexec instead of
allocating a new one even in case we can make remapping work.

Thanks for your comments again!

 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  arch/x86/pci/xen.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)
 
 diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
 index 905956f..685e8f1 100644
 --- a/arch/x86/pci/xen.c
 +++ b/arch/x86/pci/xen.c
 @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, 
 int nvec, int type)
  __read_msi_msg(msidesc, msg);
  pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
  ((msg.address_lo  MSI_ADDR_DEST_ID_SHIFT)  0xff);
 -if (msg.data != XEN_PIRQ_MSI_DATA ||
 -xen_irq_from_pirq(pirq)  0) {
 +if (msg.data != XEN_PIRQ_MSI_DATA || pirq = 0) {
  pirq = xen_allocate_pirq_msi(dev, msidesc);
  if (pirq  0) {
  irq = -ENODEV;
 -- 
 1.9.3
 

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown

2014-07-16 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Tue, Jul 15, 2014 at 03:40:39PM +0200, Vitaly Kuznetsov wrote:
 When kexec is being run PIRQs from Qemu-emulated devices are still
 mapped to old event channels and new kernel has no information about
 that. Trying to map them twice results in the following in Xen's dmesg:
 
  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
  (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
  ...
 
  and the following in new kernel's dmesg:
 
  [   92.286796] xen:events: Failed to obtain physical IRQ 4
 
 The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
 devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
 when kexec was requested and on every kernel startup. We need to do this
 twice to deal with the following issues:
 - startup-time unmapping is required to make kdump work;
 - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
 - shutdown-time unmapping is required to make Qemu-emulated NICs work after
   kexec (event channel is being closed on shutdown but no 
 PHYSDEVOP_unmap_pirq
   is being performed).

 How does this work when you boot the guest under Xen 4.4 where the FIFO events
 are used? Does it still work correctly?

Thanks for pointing that out! I've checked and it doesn't. However
patches make no difference - guest kernel gets stuck on boot with and
without them. Will try to investigate...


 Thanks.
 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  arch/x86/xen/smp.c   |  1 +
  drivers/xen/events/events_base.c | 76 
 
  include/xen/events.h |  3 ++
  3 files changed, 80 insertions(+)
 
 diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
 index 35dcf39..e2b4deb 100644
 --- a/arch/x86/xen/smp.c
 +++ b/arch/x86/xen/smp.c
 @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
  #ifdef CONFIG_KEXEC
  if (!kexec_in_progress)
  return;
 +xen_unmap_all_pirqs();
  #endif
  }
  
 diff --git a/drivers/xen/events/events_base.c 
 b/drivers/xen/events/events_base.c
 index c919d3d..7701c7f 100644
 --- a/drivers/xen/events/events_base.c
 +++ b/drivers/xen/events/events_base.c
 @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
  static bool fifo_events = true;
  module_param(fifo_events, bool, 0);
  
 +void xen_unmap_all_pirqs(void)
 +{
 +int pirq, rc, gsi, irq, evtchn;
 +struct physdev_unmap_pirq unmap_irq;
 +struct irq_info *info;
 +struct evtchn_close close;
 +
 +mutex_lock(irq_mapping_update_lock);
 +
 +list_for_each_entry(info, xen_irq_list_head, list) {
 +if (info-type != IRQT_PIRQ)
 +continue;
 +
 +pirq = info-u.pirq.pirq;
 +gsi = info-u.pirq.gsi;
 +evtchn = info-evtchn;
 +irq = info-irq;
 +
 +pr_debug(unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n,
 +gsi, pirq, irq, evtchn);
 +
 +if (evtchn  0) {
 +close.port = evtchn;
 +if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
 +close) != 0)
 +pr_warn(close evtchn %d failed\n, evtchn);
 +}
 +
 +unmap_irq.pirq = pirq;
 +unmap_irq.domid = DOMID_SELF;
 +
 +rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, unmap_irq);
 +if (rc)
 +pr_warn(unmap pirq failed gsi=%d pirq=%d irq=%d 
 rc=%d\n,
 +gsi, pirq, irq, rc);
 +}
 +
 +mutex_unlock(irq_mapping_update_lock);
 +}
 +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs);

 Why the EXPORT? Is this used by modules?
 +
 +static void xen_startup_unmap_pirqs(void)
 +{
 +struct evtchn_status status;
 +int port, rc = -ENOENT;
 +struct physdev_unmap_pirq unmap_irq;
 +struct evtchn_close close;
 +
 +memset(status, 0, sizeof(status));
 +for (port = 0; port  xen_evtchn_max_channels(); port++) {
 +status.dom = DOMID_SELF;
 +status.port = port;
 +rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, status);
 +if (rc  0)
 +continue;
 +if (status.status == EVTCHNSTAT_pirq) {
 +close.port = port;
 +if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
 +close) != 0)
 +pr_warn(xen: failed to close evtchn %d\n,
 +port);
 +unmap_irq.pirq = status.u.pirq;
 +unmap_irq.domid = DOMID_SELF;
 +pr_warn(xen: unmapping previously mapped pirq %d\n,
 +unmap_irq.pirq);
 +if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq

Re: [PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown

2014-07-16 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Wed, Jul 16, 2014 at 11:37:10AM +0200, Vitaly Kuznetsov wrote:
 Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:
 
  On Tue, Jul 15, 2014 at 03:40:39PM +0200, Vitaly Kuznetsov wrote:
  When kexec is being run PIRQs from Qemu-emulated devices are still
  mapped to old event channels and new kernel has no information about
  that. Trying to map them twice results in the following in Xen's dmesg:
  
   (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
   (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
   (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
   ...
  
   and the following in new kernel's dmesg:
  
   [   92.286796] xen:events: Failed to obtain physical IRQ 4
  
  The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
  devices. Address the issue by unmapping all mapped PIRQs on kernel 
  shutdown
  when kexec was requested and on every kernel startup. We need to do this
  twice to deal with the following issues:
  - startup-time unmapping is required to make kdump work;
  - shutdown-time unmapping is required to support kexec-ing non-fixed 
  kernels;
  - shutdown-time unmapping is required to make Qemu-emulated NICs work 
  after
kexec (event channel is being closed on shutdown but no 
  PHYSDEVOP_unmap_pirq
is being performed).
 
  How does this work when you boot the guest under Xen 4.4 where the FIFO 
  events
  are used? Does it still work correctly?
 
 Thanks for pointing that out! I've checked and it doesn't. However
 patches make no difference - guest kernel gets stuck on boot with and
 without them. Will try to investigate...

 I think for FIFO events we can't do much right now - it would need some
 new hypercalls to de-allocate or such.


Yeah, you're probably right. I tried wrapping evtchn_fifo_destroy() into
'EVTCHNOP_fifo_destroy' hypercall but it seems some other actions are
required as well..

 But I was thinking that your code logic could just return out when
 it detects that it is running with FIFO events (with a TODO comment)
 - and also spit out some information to this effect?

Sure, having TODO here is a good idea.


 Say: Use xen.fifo=0 in your launching kernel

s,xen.fifo,xen.fifo_events,


 (don't know the right name for the kernel in which you do 'kexec -e' in ?
 Is that launching? Original? Bootstrap kernel?)

Yes, if under Xen-4.4 I boot original kernel with xen.fifo_events=0
I'm able to do kexec with xen.fifo_events=0 and even without it (but
only once :-). Once kernel is booted with FIFO-based event channels
enabled no kexec is possible, new kernel gets stuck (I guess vcpuop
timer doesn't work..). My patch series brings no difference here..

Thanks,


 
 
  Thanks.
  
  Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
  ---
   arch/x86/xen/smp.c   |  1 +
   drivers/xen/events/events_base.c | 76 
  
   include/xen/events.h |  3 ++
   3 files changed, 80 insertions(+)
  
  diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
  index 35dcf39..e2b4deb 100644
  --- a/arch/x86/xen/smp.c
  +++ b/arch/x86/xen/smp.c
  @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
   #ifdef CONFIG_KEXEC
if (!kexec_in_progress)
return;
  + xen_unmap_all_pirqs();
   #endif
   }
   
  diff --git a/drivers/xen/events/events_base.c 
  b/drivers/xen/events/events_base.c
  index c919d3d..7701c7f 100644
  --- a/drivers/xen/events/events_base.c
  +++ b/drivers/xen/events/events_base.c
  @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
   static bool fifo_events = true;
   module_param(fifo_events, bool, 0);
   
  +void xen_unmap_all_pirqs(void)
  +{
  + int pirq, rc, gsi, irq, evtchn;
  + struct physdev_unmap_pirq unmap_irq;
  + struct irq_info *info;
  + struct evtchn_close close;
  +
  + mutex_lock(irq_mapping_update_lock);
  +
  + list_for_each_entry(info, xen_irq_list_head, list) {
  + if (info-type != IRQT_PIRQ)
  + continue;
  +
  + pirq = info-u.pirq.pirq;
  + gsi = info-u.pirq.gsi;
  + evtchn = info-evtchn;
  + irq = info-irq;
  +
  + pr_debug(unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n,
  + gsi, pirq, irq, evtchn);
  +
  + if (evtchn  0) {
  + close.port = evtchn;
  + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
  + close) != 0)
  + pr_warn(close evtchn %d failed\n, evtchn);
  + }
  +
  + unmap_irq.pirq = pirq;
  + unmap_irq.domid = DOMID_SELF;
  +
  + rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, unmap_irq);
  + if (rc)
  + pr_warn(unmap pirq failed gsi=%d pirq=%d irq=%d 
  rc=%d\n,
  + gsi, pirq, irq, rc);
  + }
  +
  + mutex_unlock(irq_mapping_update_lock);
  +}
  +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs

Re: [PATCH RFC 4/4] xen/pvhvm: Make MSI IRQs work after kexec

2014-07-16 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote:
 Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:
 
  On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
  When kexec was peformed MSI IRQs for passthrough-ed devices were already
  mapped and we see non-zero pirq extracted from MSI msg. 
  xen_irq_from_pirq()
  fails as we have no IRQ mapping information for that. Requesting for new
  mapping with __write_msi_msg() does not result in MSI IRQ being remapped 
  so
  we don't recieve these IRQs.
 
  receive
 
 
 Thanks for your comments!

 Thank you for quick turnaround with the answers!
 
  How come '__write_msi_msg' does not result in new MSI IRQs?
 
 
 Actually that was the hidden question in my RFC :-)
 
 Let me describe what I see. When normal boot is performed we have the
 following in xen_hvm_setup_msi_irqs():
 
 __read_msi_msg()
  pirq - 0
 
 then we allocate new pirq with
  pirq = xen_allocate_pirq_msi()
  pirq - 54
 
 and we have the following mapping:
 xen: msi -- pirq=54 -- irq=72
 
 in 'xl debug-keys i':
 (XEN)IRQ:  29 affinity:04 vec:b9 type=PCI-MSI status=0030 
 in-flight=0 domain-list=7: 54(),
 
 After kexec we see the following:
 __read_msi_msg()
  pirq - 54
 
 but as xen_irq_from_pirq() fails we follow the same path allocating new pirq:
  pirq = xen_allocate_pirq_msi()
  pirq - 55
 
 and we have the following mapping:
 xen: msi -- pirq=55 -- irq=75
 
 However (afaict) mapping in xen wasn't updated:
 
 in 'xl debug-keys i':
 (XEN)IRQ:  29 affinity:02 vec:b9 type=PCI-MSI status=0030 
 in-flight=0 domain-list=7: 54(--M-),

 I am wondering if that is related to in QEMU traditional:

 qemu-xen-trad: free all the pirqs for msi/msix when driver unloads

 (which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050)

 If you have that patch in, is the PIRQ value correctly updated?


Thanks, that really works! I tested both kexec -e / kdump cases. I'm
wondering if we although need my commit to workaround non-fixed qemus?

 
  Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
  from the device and updating the internal PIRQ-IRQ code to match
  with the reality?
 
 
 Yea, 'always trust the device'.
 
  
  RFC: I wasn't able to understand why commit af42b8d1 which introduced
  xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that 
  instead
  of checking pirq  0 as if the mapping was already done (and we have 
  pirq0 here)
  we don't need to request for a new pirq. We're loosing existing PIRQ and 
  I'm also
  not sure when __write_msi_msg() with new PIRQ will result in new mapping.
 
  We don't request a new pirq. We end up returning before we call 
  xen_allocate_pirq_msi.
  At least that is how the commit you mentioned worked.
 
 
 I meant to say that in case we have pirq  0 from __read_msi_msg() but
 xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
 xen_allocate_pirq_msi() which brings us new pirq.
 
  In regards to why using 'xen_irq_from_pirq' instead of just checking the 
  PIRQ - is
  that we might be called twice by a buggy driver. As such we want to check
  our PIRQ-IRQ to figure this out.
 
 But if we're called twice we'll see the same pirq, right? Or there are

 Good point.
 some cases when we see 'crap' instead of pirq here?

 For PCI passthrough devices they will be zero until they are enabled.
 But I am not sure about the emulated devices, such as e1000 or such, which
 would also go through this path (I think - do we have MSI devices that
 we emulate in QEMU?)

AFAICT emulated e1000 doesn't use MSI (at least with qemu-tradidtional)
and with my patch series it works after kexec.


 
 I think it would be nice to use the same pirq after kexec instead of
 allocating a new one even in case we can make remapping work.

 I concur.

 Stefano, do you recall why you used xen_irq_from_pirq instead of just
 trusting the 'pirq' value? Was it to workaround broken QEMU?

 
 Thanks for your comments again!
 
  
  Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
  ---
   arch/x86/pci/xen.c | 3 +--
   1 file changed, 1 insertion(+), 2 deletions(-)
  
  diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
  index 905956f..685e8f1 100644
  --- a/arch/x86/pci/xen.c
  +++ b/arch/x86/pci/xen.c
  @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev 
  *dev, int nvec, int type)
__read_msi_msg(msidesc, msg);
pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
((msg.address_lo  MSI_ADDR_DEST_ID_SHIFT)  0xff);
  - if (msg.data != XEN_PIRQ_MSI_DATA ||
  - xen_irq_from_pirq(pirq)  0) {
  + if (msg.data != XEN_PIRQ_MSI_DATA || pirq = 0) {
pirq = xen_allocate_pirq_msi(dev, msidesc);
if (pirq  0) {
irq = -ENODEV;
  -- 
  1.9.3
  
 
 -- 
   Vitaly

-- 
  Vitaly
--
To unsubscribe from this list

Re: [PATCH RFC 4/4] xen/pvhvm: Make MSI IRQs work after kexec

2014-07-17 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Wed, Jul 16, 2014 at 07:20:39PM +0200, Vitaly Kuznetsov wrote:
 Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:
 
  On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote:
  Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:
  
   On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
   When kexec was peformed MSI IRQs for passthrough-ed devices were 
   already
   mapped and we see non-zero pirq extracted from MSI msg. 
   xen_irq_from_pirq()
   fails as we have no IRQ mapping information for that. Requesting for 
   new
   mapping with __write_msi_msg() does not result in MSI IRQ being 
   remapped so
   we don't recieve these IRQs.
  
   receive
  
  
  Thanks for your comments!
 
  Thank you for quick turnaround with the answers!
  
   How come '__write_msi_msg' does not result in new MSI IRQs?
  
  
  Actually that was the hidden question in my RFC :-)
  
  Let me describe what I see. When normal boot is performed we have the
  following in xen_hvm_setup_msi_irqs():
  
  __read_msi_msg()
   pirq - 0
  
  then we allocate new pirq with
   pirq = xen_allocate_pirq_msi()
   pirq - 54
  
  and we have the following mapping:
  xen: msi -- pirq=54 -- irq=72
  
  in 'xl debug-keys i':
  (XEN)IRQ:  29 affinity:04 vec:b9 type=PCI-MSI status=0030 
  in-flight=0 domain-list=7: 54(),
  
  After kexec we see the following:
  __read_msi_msg()
   pirq - 54
  
  but as xen_irq_from_pirq() fails we follow the same path allocating new 
  pirq:
   pirq = xen_allocate_pirq_msi()
   pirq - 55
  
  and we have the following mapping:
  xen: msi -- pirq=55 -- irq=75
  
  However (afaict) mapping in xen wasn't updated:
  
  in 'xl debug-keys i':
  (XEN)IRQ:  29 affinity:02 vec:b9 type=PCI-MSI status=0030 
  in-flight=0 domain-list=7: 54(--M-),
 
  I am wondering if that is related to in QEMU traditional:
 
  qemu-xen-trad: free all the pirqs for msi/msix when driver unloads
 
  (which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050)
 
  If you have that patch in, is the PIRQ value correctly updated?
 
 
 Thanks, that really works! I tested both kexec -e / kdump cases. I'm
 wondering if we although need my commit to workaround non-fixed qemus?

 Without your patch on older QEMU's with PCI passthrough we won't get
 any more interrupts after we kexec in the guest right?


Correct.

 As in, this issue happens _only_ with PCI passthrough devices that use
 MSI or MSI-X?

I haven't tested MSI-X but in theory yes, only MSI and MSI-X
passthrough-ed devices are affected.


 Still need to get Stefano's view on this.


Sure, thanks!

 
  
   Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
   from the device and updating the internal PIRQ-IRQ code to match
   with the reality?
  
  
  Yea, 'always trust the device'.
  
   
   RFC: I wasn't able to understand why commit af42b8d1 which introduced
   xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that 
   instead
   of checking pirq  0 as if the mapping was already done (and we have 
   pirq0 here)
   we don't need to request for a new pirq. We're loosing existing PIRQ 
   and I'm also
   not sure when __write_msi_msg() with new PIRQ will result in new 
   mapping.
  
   We don't request a new pirq. We end up returning before we call 
   xen_allocate_pirq_msi.
   At least that is how the commit you mentioned worked.
  
  
  I meant to say that in case we have pirq  0 from __read_msi_msg() but
  xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
  xen_allocate_pirq_msi() which brings us new pirq.
  
   In regards to why using 'xen_irq_from_pirq' instead of just checking 
   the PIRQ - is
   that we might be called twice by a buggy driver. As such we want to 
   check
   our PIRQ-IRQ to figure this out.
  
  But if we're called twice we'll see the same pirq, right? Or there are
 
  Good point.
  some cases when we see 'crap' instead of pirq here?
 
  For PCI passthrough devices they will be zero until they are enabled.
  But I am not sure about the emulated devices, such as e1000 or such, which
  would also go through this path (I think - do we have MSI devices that
  we emulate in QEMU?)
 
 AFAICT emulated e1000 doesn't use MSI (at least with qemu-tradidtional)
 and with my patch series it works after kexec.
 
 
  
  I think it would be nice to use the same pirq after kexec instead of
  allocating a new one even in case we can make remapping work.
 
  I concur.
 
  Stefano, do you recall why you used xen_irq_from_pirq instead of just
  trusting the 'pirq' value? Was it to workaround broken QEMU?
 
  
  Thanks for your comments again!
  
   
   Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
   ---
arch/x86/pci/xen.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
   
   diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
   index 905956f..685e8f1 100644
   --- a/arch/x86/pci/xen.c
   +++ b/arch

[PATCH v8] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-17 Thread Vitaly Kuznetsov
We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Changes from v7:
- address kbuild test robot's warnings by making remap_oldmem_pfn_checked()
  and vmcore_remap_oldmem_pfn() static (Fengguang Wu's patch)

Changes from v6:
- remove useless len increment when remapping the rest of the region

Changes from v5:
- make len size_t to match do_unmap() interface

Changes from v4:
- change map_size type size_t - unsigned long
- use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked()

Changes from v3:
- multi line comment style changes
- minor code style changes

Changes from v2:
- make remap_oldmem_pfn_checked() interface exactly match
  remap_oldmem_pfn_range()
- unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
  we don't need to take care of it in mmap_vmcore()
- create vmcore_remap_oldmem_pfn() wrapper

Changes from v1:
- comment style changes
- change remap_oldmem_pfn_checked() interface to closer match the
  remap_oldmem_pfn() interface
- preserve formal parameters within the loop, make the loop conditions
  easier to understand
- use my_zero_pfn() for the zero page
- return remapped length instead of new offset

Reviewed-by: Andrew Jones drjo...@redhat.com
Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 fs/proc/vmcore.c | 82 +---
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..a90d6d35 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,82 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * virtually contiguous user-space in ELF layout.
  */
 #ifdef CONFIG_MMU
+/*
+ * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
+ * reported as not being ram with the zero page.
+ *
+ * @vma: vm_area_struct describing requested mapping
+ * @from: start remapping from
+ * @pfn: page frame number to start remapping to
+ * @size: remapping size
+ * @prot: protection bits
+ *
+ * Returns zero on success, -EAGAIN on failure.
+ */
+static int remap_oldmem_pfn_checked(struct vm_area_struct *vma,
+   unsigned long from, unsigned long pfn,
+   unsigned long size, pgprot_t prot)
+{
+   unsigned long map_size;
+   unsigned long pos_start, pos_end, pos;
+   unsigned long zeropage_pfn = my_zero_pfn(0);
+   size_t len = 0;
+
+   pos_start = pfn;
+   pos_end = pfn + (size  PAGE_SHIFT);
+
+   for (pos = pos_start; pos  pos_end; ++pos) {
+   if (!pfn_is_ram(pos)) {
+   /*
+* We hit a page which is not ram. Remap the continuous
+* region between pos_start and pos-1 and replace
+* the non-ram page at pos with the zero page.
+*/
+   if (pos  pos_start) {
+   /* Remap continuous region */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len,
+  pos_start, map_size,
+  prot))
+   goto fail;
+   len += map_size;
+   }
+   /* Remap the zero page */
+   if (remap_oldmem_pfn_range(vma, from + len,
+  zeropage_pfn,
+  PAGE_SIZE, prot))
+   goto fail;
+   len += PAGE_SIZE;
+   pos_start = pos + 1;
+   }
+   }
+   if (pos  pos_start) {
+   /* Remap the rest */
+   map_size = (pos - pos_start)  PAGE_SHIFT;
+   if (remap_oldmem_pfn_range(vma, from + len, pos_start,
+  map_size, prot))
+   goto fail;
+   }
+   return 0;
+fail

Re: [PATCH RFC 1/4] xen PVonHVM: use E820_Reserved area for shared_info

2014-07-18 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Tue, Jul 15, 2014 at 05:43:17PM +0200, Vitaly Kuznetsov wrote:
 Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:
 
  On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
  From: Olaf Hering o...@aepfle.de
  
  This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
  (xen PVonHVM: move shared_info to MMIO before kexec).
  
  Currently kexec in a PVonHVM guest fails with a triple fault because the
  new kernel overwrites the shared info page. The exact failure depends on
  the size of the kernel image. This patch moves the pfn from RAM into an
  E820 reserved memory area.
  
  The pfn containing the shared_info is located somewhere in RAM. This will
  cause trouble if the current kernel is doing a kexec boot into a new
  kernel. The new kernel (and its startup code) can not know where the pfn
  is, so it can not reserve the page. The hypervisor will continue to update
  the pfn, and as a result memory corruption occours in the new kernel.
  
  The toolstack marks the memory area FC00- as reserved in the
  E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
  starting from FE70 as reserved for guest use. Older Xen4 toolstacks
  will usually not allocate areas up to FE70, so FE70 is expected
  to work also with older toolstacks.
  
  In Xen3 there is no reserved area at a fixed location. If the guest is
  started on such old hosts the shared_info page will be placed in RAM. As
  a result kexec can not be used.
 
  So this looks right, the one thing that we really need to check 
  is e9daff24a266307943457086533041bd971d0ef9
 
 This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.
 
  We are doing this b/c on 32-bit PVonHVM with older hypervisors
  (Xen 4.1) it ends up bothing up the start_info. This is bad b/c
  we use it for the time keeping, and the timekeeping code loops
  forever - as the version field never changes. Olaf says to
  revert it, so lets do that.
 
  Could you kindly test that the migration on 32-bit PVHVM guests
  on older hypervisors works?
 
 
 Sure, will do! Was there anything special about the setup or any 32-bit
 pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can
 try checking both current and old versions to make sure the issue was
 acutually fixed.

 Nothing fancy (well, it was SMP, so 4 CPUs). I did the 'save'/'restore' and 
 the
 guest would not restore properly.


The symptoms you saw were: after the resume guest appears to be frozen,
all vcpus except for the first one spin at 100%? I was able to reproduce
that on old patch version and everything works fine with your fix
(calling xen_hvm_set_shared_info() in addition to
xen_hvm_connect_shared_info() on resume). We're probably safe to apply
it now, thanks!

However I'd like to suggest we remove '__init' from
xen_hvm_set_shared_info() as now we call it on resume.

 Thank you!
 
  
  Signed-off-by: Olaf Hering o...@aepfle.de
  Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
  (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)
  
  [On resume we need to reset the xen_vcpu_info, which the original
  patch did not do]
  
  Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
  ---
   arch/x86/xen/enlighten.c | 74 
  
   arch/x86/xen/suspend.c   |  2 +-
   arch/x86/xen/xen-ops.h   |  2 +-
   3 files changed, 58 insertions(+), 20 deletions(-)
  
  diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
  index ffb101e..a11af62 100644
  --- a/arch/x86/xen/enlighten.c
  +++ b/arch/x86/xen/enlighten.c
  @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init 
  xen_start_kernel(void)
   #endif
   }
   
  -void __ref xen_hvm_init_shared_info(void)
  +#ifdef CONFIG_XEN_PVHVM
  +#define HVM_SHARED_INFO_ADDR 0xFE70UL
  +static struct shared_info *xen_hvm_shared_info;
  +static unsigned long xen_hvm_sip_phys;
  +static int xen_major, xen_minor;
  +
  +static void xen_hvm_connect_shared_info(unsigned long pfn)
   {
  - int cpu;
struct xen_add_to_physmap xatp;
  - static struct shared_info *shared_info_page = 0;
   
  - if (!shared_info_page)
  - shared_info_page = (struct shared_info *)
  - extend_brk(PAGE_SIZE, PAGE_SIZE);
xatp.domid = DOMID_SELF;
xatp.idx = 0;
xatp.space = XENMAPSPACE_shared_info;
  - xatp.gpfn = __pa(shared_info_page)  PAGE_SHIFT;
  + xatp.gpfn = pfn;
if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, xatp))
BUG();
   
  - HYPERVISOR_shared_info = (struct shared_info *)shared_info_page;
  +}
  +static void __init xen_hvm_set_shared_info(struct shared_info *sip)
  +{
  + int cpu;
  +
  + HYPERVISOR_shared_info = sip;
   
/* xen_vcpu is a pointer to the vcpu_info struct in the shared_info
 * page, we use it in the event channel upcall and in some pvclock
  @@ -1760,20 +1766,39 @@ void __ref

Re: [PATCH RFC 1/4] xen PVonHVM: use E820_Reserved area for shared_info

2014-07-18 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Fri, Jul 18, 2014 at 01:05:46PM +0200, Vitaly Kuznetsov wrote:
 Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:
 
  On Tue, Jul 15, 2014 at 05:43:17PM +0200, Vitaly Kuznetsov wrote:
  Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:
  
   On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
   From: Olaf Hering o...@aepfle.de
   
   This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
   (xen PVonHVM: move shared_info to MMIO before kexec).
   
   Currently kexec in a PVonHVM guest fails with a triple fault because 
   the
   new kernel overwrites the shared info page. The exact failure depends 
   on
   the size of the kernel image. This patch moves the pfn from RAM into an
   E820 reserved memory area.
   
   The pfn containing the shared_info is located somewhere in RAM. This 
   will
   cause trouble if the current kernel is doing a kexec boot into a new
   kernel. The new kernel (and its startup code) can not know where the 
   pfn
   is, so it can not reserve the page. The hypervisor will continue to 
   update
   the pfn, and as a result memory corruption occours in the new kernel.
   
   The toolstack marks the memory area FC00- as reserved in 
   the
   E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
   starting from FE70 as reserved for guest use. Older Xen4 toolstacks
   will usually not allocate areas up to FE70, so FE70 is expected
   to work also with older toolstacks.
   
   In Xen3 there is no reserved area at a fixed location. If the guest is
   started on such old hosts the shared_info page will be placed in RAM. 
   As
   a result kexec can not be used.
  
   So this looks right, the one thing that we really need to check 
   is e9daff24a266307943457086533041bd971d0ef9
  
  This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.
  
   We are doing this b/c on 32-bit PVonHVM with older hypervisors
   (Xen 4.1) it ends up bothing up the start_info. This is bad b/c
   we use it for the time keeping, and the timekeeping code loops
   forever - as the version field never changes. Olaf says to
   revert it, so lets do that.
  
   Could you kindly test that the migration on 32-bit PVHVM guests
   on older hypervisors works?
  
  
  Sure, will do! Was there anything special about the setup or any 32-bit
  pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can
  try checking both current and old versions to make sure the issue was
  acutually fixed.
 
  Nothing fancy (well, it was SMP, so 4 CPUs). I did the 'save'/'restore' 
  and the
  guest would not restore properly.
 
 
 The symptoms you saw were: after the resume guest appears to be frozen,
 all vcpus except for the first one spin at 100%? I was able to reproduce

 Yes, that is it.
 that on old patch version and everything works fine with your fix
 (calling xen_hvm_set_shared_info() in addition to
 xen_hvm_connect_shared_info() on resume). We're probably safe to apply
 it now, thanks!

 Woot!  Could you include that tidbit of information in
 the commit please?


Sure,

 
 However I'd like to suggest we remove '__init' from
 xen_hvm_set_shared_info() as now we call it on resume.

 Good idea.

 Lets wait until Stefano responds (for the MSI PIRQ one), and
 if he does not have anything special to say, then repost the
 whole patchset including this tiny __init fix and the updated
 comment?

Deal :-) Please take a look at my '[PATCH RFC] evtchn: introduce
EVTCHNOP_fifo_destroy hypercall'. In case that works we can fix FIFO
case at the same time, no TODO required.

I'll be able to return to this work at the end of next week.

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/4] xen/pvhvm: fix shared_info and pirq issues with kexec

2014-08-01 Thread Vitaly Kuznetsov
David Vrabel david.vra...@citrix.com writes:

 On 15/07/14 14:40, Vitaly Kuznetsov wrote:
 With this patch series I'm trying to address several issues with kexec on 
 pvhvm:
 - shared_info issue (1st patch, just sending Olaf's work with Konrad's fix)
 - create specific pvhvm shutdown handler for kexec (2nd patch)
 - GSI PIRQ issue (3rd patch, I'm pretty confident that it does the right 
 thing)
 - MSI PIRQ issue (4th patch, and I'm not sure it doesn't break anything - 
 RFC)
 
 This patch series can be tested on single vCPU guest. We still have SMP 
 issues with
 pvhvm guests and kexec which require additional fixes.

 In addition to the fixes for multi-VCPU guests, what else remains?


I'm aware of grants and ballooned out pages.

 What's the plan for handling granted pages?


(if I got the design right) we have two issues:

1) Pages we grant access to other domains. We have the list so we can
try doing gnttab_end_foreign_access for all unmapped grants but there is
nothing we can do with mapped ones from guest. We can either assume that
all such usages are short-term and try waiting for them to finish or we
need to do something like force-unmap from hypervisor side.

2) Pages we mapped from other domains. There is no easy way to collect
all grant handles from different places in kernel atm so I can see two
possible solutions:
- we keep track of all handles with new kernel structure in guest and
unmap them all on kexec/kdump.
- we introduce new GNTTABOP_reset which does something similar to
gnttab_release_mappings().

There is nothing we need to do with transferred grants (and I don't see
transfer usages in kernel).

Please correct me if I'm wrong.

 I don't think we want to accept a partial solution unless the known
 non-working configurations fail fast on kexec load.

*I think* we can leave ballooned out pages out of scope for now.

Thanks,


 David

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/4] xen/pvhvm: fix shared_info and pirq issues with kexec

2014-08-04 Thread Vitaly Kuznetsov
David Vrabel david.vra...@citrix.com writes:

 On 01/08/14 13:21, Vitaly Kuznetsov wrote:
 David Vrabel david.vra...@citrix.com writes:
 
 On 15/07/14 14:40, Vitaly Kuznetsov wrote:
 With this patch series I'm trying to address several issues with kexec on 
 pvhvm:
 - shared_info issue (1st patch, just sending Olaf's work with Konrad's fix)
 - create specific pvhvm shutdown handler for kexec (2nd patch)
 - GSI PIRQ issue (3rd patch, I'm pretty confident that it does the right 
 thing)
 - MSI PIRQ issue (4th patch, and I'm not sure it doesn't break anything - 
 RFC)

 This patch series can be tested on single vCPU guest. We still have SMP 
 issues with
 pvhvm guests and kexec which require additional fixes.

 In addition to the fixes for multi-VCPU guests, what else remains?

 
 I'm aware of grants and ballooned out pages.
 
 What's the plan for handling granted pages?

 
 (if I got the design right) we have two issues:
 
 1) Pages we grant access to other domains. We have the list so we can
 try doing gnttab_end_foreign_access for all unmapped grants but there is
 nothing we can do with mapped ones from guest. We can either assume that
 all such usages are short-term and try waiting for them to finish or we
 need to do something like force-unmap from hypervisor side.

 Shared rings and persistent grants (used in blkfront) remain mapped for
 long periods so just waiting won't work.

 Force unmap by the hypervisor might be a possibility but the hypervisor
 needs to atomically replace the grant mapping with a different valid
 mapping, or the force unmap would cause the backend that was accesses
 the pages to fault.

 Every writable mapping would have to be replaced with a mapping to a
 unique page (to prevent information leaking between different granted
 pages).  Read-only mappings could be replaces with a read-only mapping
 to shared zero page safely.

 The only way I can see how to do this requires co-operation from the
 backend kernel -- it would need to provide replacement frames for every
 grant map.  Xen would then use this frame when force-unmapping
 (revoking) the mapping.

We can introduce something like GNTTABOP_safe_map_grant_ref op with such
replacement frames and use then in 'force unmap' but to be honest I'd
like to avoid that.. What if xen could allocate and provide replacement
pages on its own? I understand we need to take some precautions against
malicious domains hogging all memory with such approach..

Actually if we narrow down our case to backend/frontend usage only we
can think that grants do not introduce any issue for kexec/kdump because:
- frontends are being closed on kexec so backends unmap corresponding
grants. That's not true for persistent grant case as persistent grants
are being unmaped only in xen_blkif_free() but not in
xen_blkif_disconnect() atm. The following patch helps:

diff --git a/drivers/block/xen-blkback/xenbus.c
b/drivers/block/xen-blkback/xenbus.c
index 3a8b810..54f4089 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -270,6 +270,9 @@ static int xen_blkif_disconnect(struct xen_blkif
*blkif)
blkif-blk_rings.common.sring = NULL;
}
 
+   /* Remove all persistent grants and the cache of ballooned
pages. */
+   xen_blkbk_free_caches(blkif);
+
return 0;
 }
 
@@ -281,9 +284,6 @@ static void xen_blkif_free(struct xen_blkif *blkif)
xen_blkif_disconnect(blkif);
xen_vbd_free(blkif-vbd);
 
-   /* Remove all persistent grants and the cache of ballooned
pages. */
-   xen_blkbk_free_caches(blkif);
-
/* Make sure everything is drained before shutting down */
BUG_ON(blkif-persistent_gnt_c != 0);
BUG_ON(atomic_read(blkif-persistent_gnt_in_use) != 0);


- in kdump case our new kernel lives in its own space so we won't hit
granted pages (we can read some crap when collecting memory dump but
it's not any better in case these pages were unmapped)

- frontends are being reinitialized on new kernel startup, that also
causes the unmap (with the patch from above applied).

So I think we need to do something with grants in two cases: they were
used outside frontends/backends and we have no idea if they are going to
be unmapped (can be solved with kexec_is_safe flag) and when backend got
stuck and refused to close/unmap (should be rare).


 2) Pages we mapped from other domains. There is no easy way to collect
 all grant handles from different places in kernel atm so I can see two
 possible solutions:
 - we keep track of all handles with new kernel structure in guest and
 unmap them all on kexec/kdump.
 - we introduce new GNTTABOP_reset which does something similar to
 gnttab_release_mappings().

 I think you can ignore this for now -- frontend drivers do not grant
 map, but see suggestion about kexec_is_safe below.

 There is nothing we need to do with transferred grants (and I don't see
 transfer usages in kernel).

 Agreed.

 I don't think we want

[PATCH] Drivers: hv: util: make struct hv_do_fcopy match Hyper-V host messages

2014-10-24 Thread Vitaly Kuznetsov
An attempt to fix fcopy on i586 (bc5a5b0 Drivers: hv: util: Properly pack the 
data
for file copy functionality) led to a regression on x86_64 (and actually didn't 
fix
i586 breakage). Fcopy messages from Hyper-V host come in the following format:

struct do_fcopy_hdr   |   36 bytes
  |4 bytes
offset|8 bytes
size  |4 bytes
data  | 6144 bytes

On x86_64 struct hv_do_fcopy matched this format without ' 
__attribute__((packed))'
and on i586 adding ' __attribute__((packed))' to it doesn't change anything. 
Keep
the structure packed and add padding to match re reality. Tested both i586 and 
x86_64
on Hyper-V Server 2012 R2.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 include/uapi/linux/hyperv.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index 0a8e6ba..bb1cb73 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -134,6 +134,7 @@ struct hv_start_fcopy {
 
 struct hv_do_fcopy {
struct hv_fcopy_hdr hdr;
+   __u32   pad;
__u64   offset;
__u32   size;
__u8data[DATA_FRAGMENT];
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] tools: hv: introduce -n/--no-daemon option

2014-10-22 Thread Vitaly Kuznetsov
All tools/hv daemons do mandatory daemon() on startup. However, no pidfile
is created, this make it difficult for an init system to track such daemons.
Modern linux distros use systemd as their init system. It can handle the
daemonizing by itself, however, it requires a daemon to stay in foreground
for that. Some distros already carry distro-specific patch for hv tools
which switches off daemon().

Introduce -n/--no-daemon option for all 3 daemons in hv/tools. Parse options
with getopt() to make this part easily expandable.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_fcopy_daemon.c | 33 +++--
 tools/hv/hv_kvp_daemon.c   | 34 --
 tools/hv/hv_vss_daemon.c   | 33 +++--
 3 files changed, 94 insertions(+), 6 deletions(-)

diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
index 8f96b3e..f437d73 100644
--- a/tools/hv/hv_fcopy_daemon.c
+++ b/tools/hv/hv_fcopy_daemon.c
@@ -33,6 +33,7 @@
 #include sys/stat.h
 #include fcntl.h
 #include dirent.h
+#include getopt.h
 
 static int target_fd;
 static char target_fname[W_MAX_PATH];
@@ -126,15 +127,43 @@ static int hv_copy_cancel(void)
 
 }
 
-int main(void)
+void print_usage(char *argv[])
+{
+   fprintf(stderr, Usage: %s [options]\n
+   Options are:\n
+ -n, --no-daemonstay in foreground, don't daemonize\n
+ -h, --help print this help\n, argv[0]);
+}
+
+int main(int argc, char *argv[])
 {
int fd, fcopy_fd, len;
int error;
+   int daemonize = 1, long_index = 0, opt;
int version = FCOPY_CURRENT_VERSION;
char *buffer[4096 * 2];
struct hv_fcopy_hdr *in_msg;
 
-   if (daemon(1, 0)) {
+   static struct option long_options[] = {
+   {help,no_argument,   0,  'h' },
+   {no-daemon,   no_argument,   0,  'n' },
+   {0, 0, 0,  0   }
+   };
+
+   while ((opt = getopt_long(argc, argv, hn, long_options,
+ long_index)) != -1) {
+   switch (opt) {
+   case 'n':
+   daemonize = 0;
+   break;
+   case 'h':
+   default:
+   print_usage(argv);
+   exit(EXIT_FAILURE);
+   }
+   }
+
+   if (daemonize  daemon(1, 0)) {
syslog(LOG_ERR, daemon() failed; error: %s, strerror(errno));
exit(EXIT_FAILURE);
}
diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index 4088b81..22b0764 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -43,6 +43,7 @@
 #include fcntl.h
 #include dirent.h
 #include net/if.h
+#include getopt.h
 
 /*
  * KVP protocol: The user mode component first registers with the
@@ -1417,7 +1418,15 @@ netlink_send(int fd, struct cn_msg *msg)
return sendmsg(fd, message, 0);
 }
 
-int main(void)
+void print_usage(char *argv[])
+{
+   fprintf(stderr, Usage: %s [options]\n
+   Options are:\n
+ -n, --no-daemonstay in foreground, don't daemonize\n
+ -h, --help print this help\n, argv[0]);
+}
+
+int main(int argc, char *argv[])
 {
int fd, len, nl_group;
int error;
@@ -1435,9 +1444,30 @@ int main(void)
struct hv_kvp_ipaddr_value *kvp_ip_val;
char *kvp_recv_buffer;
size_t kvp_recv_buffer_len;
+   int daemonize = 1, long_index = 0, opt;
+
+   static struct option long_options[] = {
+   {help,no_argument,   0,  'h' },
+   {no-daemon,   no_argument,   0,  'n' },
+   {0, 0, 0,  0   }
+   };
+
+   while ((opt = getopt_long(argc, argv, hn, long_options,
+ long_index)) != -1) {
+   switch (opt) {
+   case 'n':
+   daemonize = 0;
+   break;
+   case 'h':
+   default:
+   print_usage(argv);
+   exit(EXIT_FAILURE);
+   }
+   }
 
-   if (daemon(1, 0))
+   if (daemonize  daemon(1, 0))
return 1;
+
openlog(KVP, 0, LOG_USER);
syslog(LOG_INFO, KVP starting; pid is:%d, getpid());
 
diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index 6a213b8..9ae2b6e 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -36,6 +36,7 @@
 #include linux/hyperv.h
 #include linux/netlink.h
 #include syslog.h
+#include getopt.h
 
 static struct sockaddr_nl addr;
 
@@ -131,7 +132,15 @@ static int netlink_send(int fd, struct cn_msg *msg)
return sendmsg(fd, message, 0);
 }
 
-int main(void)
+void print_usage(char *argv[])
+{
+   fprintf(stderr, Usage: %s [options]\n
+   Options

[PATCH] xen/blkback: unmap all persistent grants when frontend gets disconnected

2014-09-08 Thread Vitaly Kuznetsov
blkback does not unmap persistent grants when frontend goes to Closed
state (e.g. when blkfront module is being removed). This leads to the
following in guest's dmesg:

[  343.243825] xen:grant_table: WARNING: g.e. 0x445 still in use!
[  343.243825] xen:grant_table: WARNING: g.e. 0x42a still in use!
...

When load module - use device - unload module sequence is performed multiple 
times
it is possible to hit BUG() condition in blkfront module:

[  343.243825] kernel BUG at drivers/block/xen-blkfront.c:954!
[  343.243825] invalid opcode:  [#1] SMP
[  343.243825] Modules linked in: xen_blkfront(-) ata_generic pata_acpi [last 
unloaded: xen_blkfront]
...
[  343.243825] Call Trace:
[  343.243825]  [814111ef] ? unregister_xenbus_watch+0x16f/0x1e0
[  343.243825]  [a0016fbf] blkfront_remove+0x3f/0x140 [xen_blkfront]
...
[  343.243825] RIP  [a0016aae] blkif_free+0x34e/0x360 [xen_blkfront]
[  343.243825]  RSP 88001eb8fdc0

We don't need to keep these grants if we're disconnecting as frontend might 
already
forgot about them. Solve the issue by moving xen_blkbk_free_caches() call from
xen_blkif_free() to xen_blkif_disconnect().

Now we can see the following:
[  928.590893] xen:grant_table: WARNING: g.e. 0x587 still in use!
[  928.591861] xen:grant_table: WARNING: g.e. 0x372 still in use!
...
[  929.592146] xen:grant_table: freeing g.e. 0x587
[  929.597174] xen:grant_table: freeing g.e. 0x372
...

Backend does not keep persistent grants any more, reconnect works fine.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/block/xen-blkback/xenbus.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/xen-blkback/xenbus.c 
b/drivers/block/xen-blkback/xenbus.c
index 3a8b810..54f4089 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -270,6 +270,9 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
blkif-blk_rings.common.sring = NULL;
}
 
+   /* Remove all persistent grants and the cache of ballooned pages. */
+   xen_blkbk_free_caches(blkif);
+
return 0;
 }
 
@@ -281,9 +284,6 @@ static void xen_blkif_free(struct xen_blkif *blkif)
xen_blkif_disconnect(blkif);
xen_vbd_free(blkif-vbd);
 
-   /* Remove all persistent grants and the cache of ballooned pages. */
-   xen_blkbk_free_caches(blkif);
-
/* Make sure everything is drained before shutting down */
BUG_ON(blkif-persistent_gnt_c != 0);
BUG_ON(atomic_read(blkif-persistent_gnt_in_use) != 0);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] xen/blkfront: improve protection against issuing unsupported REQ_FUA

2014-10-27 Thread Vitaly Kuznetsov
Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced
in d11e61583 and was factored out into blkif_request_flush_valid() in
0f1ca65ee. However:
1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH
   and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request
   will still pass the check.
2) blkif_request_flush_valid() is misnamed. It is bool but returns true when
   the request is invalid.
3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that
   -EOPNOTSUPP is more appropriate here.
Fix all of the above issues.

This patch is based on the original patch by Laszlo Ersek and a comment by
Jeff Moyer.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/block/xen-blkfront.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5ac312f..2e6c103 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -582,12 +582,14 @@ static inline void flush_requests(struct blkfront_info 
*info)
notify_remote_via_irq(info-irq);
 }
 
-static inline bool blkif_request_flush_valid(struct request *req,
-struct blkfront_info *info)
+static inline bool blkif_request_flush_invalid(struct request *req,
+  struct blkfront_info *info)
 {
return ((req-cmd_type != REQ_TYPE_FS) ||
-   ((req-cmd_flags  (REQ_FLUSH | REQ_FUA)) 
-   !info-flush_op));
+   ((req-cmd_flags  REQ_FLUSH) 
+!(info-feature_flush  REQ_FLUSH)) ||
+   ((req-cmd_flags  REQ_FUA) 
+!(info-feature_flush  REQ_FUA)));
 }
 
 /*
@@ -612,8 +614,8 @@ static void do_blkif_request(struct request_queue *rq)
 
blk_start_request(req);
 
-   if (blkif_request_flush_valid(req, info)) {
-   __blk_end_request_all(req, -EIO);
+   if (blkif_request_flush_invalid(req, info)) {
+   __blk_end_request_all(req, -EOPNOTSUPP);
continue;
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon

2014-11-19 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 Under high memory pressure and very high KVP R/W test pressure, the netlink
 recvfrom() may transiently return ENOBUFS to the daemon -- we found this
 during a 2-week stress test.

 We'd better not terminate the daemon on this failure, because a typical KVP
 user can re-try the R/W and hopefully it will succeed next time.

 Cc: K. Y. Srinivasan k...@microsoft.com
 Signed-off-by: Dexuan Cui de...@microsoft.com
 ---
  tools/hv/hv_kvp_daemon.c | 7 +++
  1 file changed, 7 insertions(+)

 diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
 index 22b0764..9f4b303 100644
 --- a/tools/hv/hv_kvp_daemon.c
 +++ b/tools/hv/hv_kvp_daemon.c
 @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[])
   addr_p, addr_l);

   if (len  0) {
 + int saved_errno = errno;
   syslog(LOG_ERR, recvfrom failed; pid:%u error:%d %s,
   addr.nl_pid, errno, strerror(errno));
 +
 + if (saved_errno == ENOBUFS) {

is it possible to meet EAGAIN (or EWOULDBLOCK) here as well? I'd suggest
we ignore these as well in such case. Ignoring ENOMEM here is doubtful,
I think. But possible.

 + syslog(LOG_ERR, error = ENOBUFS: ignored);
 + continue;
 + }
 +
   close(fd);
   return -1;
   }

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon

2014-11-19 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov
 Sent: Wednesday, November 19, 2014 18:50 PM
 To: Dexuan Cui
 Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
 de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
 jasow...@redhat.com; Haiyang Zhang
 Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
 
 Dexuan Cui  writes:
 
  Under high memory pressure and very high KVP R/W test pressure, the netlink
  recvfrom() may transiently return ENOBUFS to the daemon -- we found this
  during a 2-week stress test.
 
  We'd better not terminate the daemon on this failure, because a typical KVP
  user can re-try the R/W and hopefully it will succeed next time.
 
  diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
  index 22b0764..9f4b303 100644
  --- a/tools/hv/hv_kvp_daemon.c
  +++ b/tools/hv/hv_kvp_daemon.c
  @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[])
 addr_p, addr_l);
 
 if (len  0) {
  +  int saved_errno = errno;
 syslog(LOG_ERR, recvfrom failed; pid:%u error:%d %s,
 addr.nl_pid, errno, strerror(errno));
  +
  +  if (saved_errno == ENOBUFS) {
 
 is it possible to meet EAGAIN (or EWOULDBLOCK) here as well? I'd suggest
 we ignore these as well in such case. Ignoring ENOMEM here is doubtful,
 I think. But possible.
 
   Vitaly

 I don't think EAGAIN is possible  because man recvfrom says
If  no messages are available at the socket, the receive calls wait for a
  message to arrive, unless the socket is nonblocking (see fcntl(2)), in 
 which
  case the value -1 is returned and  the  external variable  errno is set 
 to
 EAGAIN or EWOULDBLOCK.

 The same man page mention ENOMEM for recvmsg(), but not recvfrom().

Ah, sorry, I though your patch patches the other place: call to
netlink_send() which does sendmsg() (and my EAGAIN/EWOULDBLOCK/ENOMEM
comment was about it). It could also make sense to patch them both as I
think it is possible to hit these as well.


 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon

2014-11-19 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Wednesday, November 19, 2014 20:41 PM
 To: Dexuan Cui
 Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
 de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
 jasow...@redhat.com; Haiyang Zhang
 Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
 
 Dexuan Cui de...@microsoft.com writes:
 
  -Original Message-
  From: Vitaly Kuznetsov
  Sent: Wednesday, November 19, 2014 18:50 PM
  To: Dexuan Cui
  Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
 driverdev-
  de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
  jasow...@redhat.com; Haiyang Zhang
  Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
 
  Dexuan Cui  writes:
 
   Under high memory pressure and very high KVP R/W test pressure,
 the netlink
   recvfrom() may transiently return ENOBUFS to the daemon -- we found
 this
   during a 2-week stress test.
  
   We'd better not terminate the daemon on this failure, because a
 typical KVP
   user can re-try the R/W and hopefully it will succeed next time.
  
   diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
   index 22b0764..9f4b303 100644
   --- a/tools/hv/hv_kvp_daemon.c
   +++ b/tools/hv/hv_kvp_daemon.c
   @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[])
   addr_p, addr_l);
  
   if (len  0) {
   +   int saved_errno = errno;
   syslog(LOG_ERR, recvfrom failed; pid:%u
 error:%d %s,
   addr.nl_pid, errno, 
   strerror(errno));
   +
   +   if (saved_errno == ENOBUFS) {
 
  is it possible to meet EAGAIN (or EWOULDBLOCK) here as well? I'd
 suggest
  we ignore these as well in such case. Ignoring ENOMEM here is doubtful,
  I think. But possible.
 
Vitaly
 
  I don't think EAGAIN is possible  because man recvfrom says
 If  no messages are available at the socket, the receive calls wait 
  for a
   message to arrive, unless the socket is nonblocking (see fcntl(2)), in
 which
   case the value -1 is returned and  the  external variable  errno is 
  set to
  EAGAIN or EWOULDBLOCK.
 
  The same man page mention ENOMEM for recvmsg(), but not recvfrom().
 
 Ah, sorry, I though your patch patches the other place: call to
 netlink_send() which does sendmsg() (and my
 EAGAIN/EWOULDBLOCK/ENOMEM
 comment was about it). It could also make sense to patch them both as I
 think it is possible to hit these as well.
 
  -- Dexuan
 --
   Vitaly

 OK, I can add this new check:
 (I'll send out the v2 tomorrow in case  people have new comments)


Thanks!

 --- a/tools/hv/hv_kvp_daemon.c
 +++ b/tools/hv/hv_kvp_daemon.c
 @@ -1770,8 +1770,15 @@ kvp_done:

 len = netlink_send(fd, incoming_cn_msg);
 if (len  0) {
 +   int saved_errno = errno;
 syslog(LOG_ERR, net_link send failed; error: %d %s, 
 errno,
 strerror(errno));
 +
 +   if (saved_errno == ENOMEM || saved_errno ==  EAGAIN) {

Sorry for being pushy, but it seems ENOBUFS is also possible here (at
least man sendmsg mentions it).

 +   syslog(LOG_ERR, send error: ignored);
 +   continue;
 +   }
 +
 exit(EXIT_FAILURE);
 }
 }

 Thanks,
 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon

2014-11-19 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov 
  --
Vitaly
 
  OK, I can add this new check:
  (I'll send out the v2 tomorrow in case  people have new comments)
 
 
 Thanks!
 
  --- a/tools/hv/hv_kvp_daemon.c
  +++ b/tools/hv/hv_kvp_daemon.c
  @@ -1770,8 +1770,15 @@ kvp_done:
 
  len = netlink_send(fd, incoming_cn_msg);
  if (len  0) {
  +   int saved_errno = errno;
  syslog(LOG_ERR, net_link send failed; error: %d 
  %s, errno,
  strerror(errno));
  +
  +   if (saved_errno == ENOMEM || saved_errno ==  
  EAGAIN) {
 
 Sorry for being pushy, but it seems ENOBUFS is also possible here (at
 least man sendmsg mentions it).
 OK, I'll add this too. :-)

 BTW, I realized sendmsg() can't return EAGAIN here as that's for non-blocking
 socket.

 Here I simply ignore the error, hoping the other end will re-try.


I agree, it's sufficient to ignore ENOBUFS on recieve path and both
ENOMEM/ENOBUFS on send.

Thanks!

 
  +   syslog(LOG_ERR, send error: ignored);
  +   continue;
  +   }
  +
  exit(EXIT_FAILURE);
  }
  }
 
  Thanks,
  -- Dexuan
 
   Vitaly

 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] tools: hv: ignore ENOBUFS and ENOMEM in the KVP daemon

2014-11-20 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 Under high memory pressure and very high KVP R/W test pressure, the netlink
 recvfrom() may transiently return ENOBUFS to the daemon -- we found this
 during a 2-week stress test.

 We'd better not terminate the daemon on the failure, because a typical KVP
 user will re-try the R/W and hopefully it will succeed next time.

 We can also ignore the errors on sending.

 Cc: Vitaly Kuznetsov vkuzn...@redhat.com
 Cc: K. Y. Srinivasan k...@microsoft.com
 Signed-off-by: Dexuan Cui de...@microsoft.com
 ---

 v2: I also ignore the errors on sending, as Vitaly suggested.

Thanks,

Reviewed-by: Vitaly Kuznetsov vkuzn...@redhat.com


  tools/hv/hv_kvp_daemon.c | 14 ++
  1 file changed, 14 insertions(+)

 diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
 index 22b0764..6a6432a 100644
 --- a/tools/hv/hv_kvp_daemon.c
 +++ b/tools/hv/hv_kvp_daemon.c
 @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[])
   addr_p, addr_l);

   if (len  0) {
 + int saved_errno = errno;
   syslog(LOG_ERR, recvfrom failed; pid:%u error:%d %s,
   addr.nl_pid, errno, strerror(errno));
 +
 + if (saved_errno == ENOBUFS) {
 + syslog(LOG_ERR, receive error: ignored);
 + continue;
 + }
 +
   close(fd);
   return -1;
   }
 @@ -1763,8 +1770,15 @@ kvp_done:

   len = netlink_send(fd, incoming_cn_msg);
   if (len  0) {
 + int saved_errno = errno;
   syslog(LOG_ERR, net_link send failed; error: %d %s, 
 errno,
   strerror(errno));
 +
 + if (saved_errno == ENOMEM || saved_errno == ENOBUFS) {
 + syslog(LOG_ERR, send error: ignored);
 + continue;
 + }
 +
   exit(EXIT_FAILURE);
   }
   }

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Tools: hv: vssdaemon: freeze/thaw logic improvement for the failure case

2014-11-10 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Saturday, November 8, 2014 1:09 AM
 To: KY Srinivasan; Haiyang Zhang; Greg Kroah-Hartman
 Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; Dexuan Cui
 Subject: [PATCH 0/3] Tools: hv: vssdaemon: freeze/thaw logic improvement
 for the failure case
 
 This patch series addresses the following issues:
 - Wrong error reporting for multiple filesystems case.
 - Skip all readonly-mounted filesystems instead of skipping iso9660.
 - Thaw all filesystems after an unsuccessful freeze attempt.
 
 Vitaly Kuznetsov (3):
   Tools: hv: vssdaemon: consult with errno in case of failure only
   Tools: hv: vssdaemon: skip all filesystems mounted readonly
   Tools: hv: vssdaemon: thaw everything in case of freeze failure
 
  tools/hv/hv_vss_daemon.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)

 Hi Vitaly,
 Thanks for your patchset!

 FYI: Greg checked in a patch of mine several hours ago -- my patch
 implemented thaw all filesytems on a failure of freeze too. :-)

Ah, sorry for stepping on your toes :-)


 Please see my patch in Greg's char-misc-next tree:
 https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/commit/?h=char-misc-nextid=4f689190bb55d171d2f6614f8a6cbd4b868e48bd

 Can you please rebase  your patch(es) on Greg's tree?

Sure, I'll throw away my patch#3, rebase, and repost.


 Thanks,
 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/2] Tools: hv: vssdaemon: freeze/thaw logic improvement for the failure case

2014-11-10 Thread Vitaly Kuznetsov
This patch series addresses the following issues:
- Verbosely report errors during freeze (mountpoint, exact error);
- Skip all readonly-mounted filesystems instead of skipping iso9660 only.

Changes since v1:
- Rebase on top of 'char-misc-next';
- Tools: hv: vssdaemon: thaw everything in case of freeze was thrown away as 
Dexuan's 
  Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same 
partition contains
  the same change;
- Tools: hv: vssdaemon: consult with errno in case of failure only was 
replaced with
  Tools: hv: vssdaemon: report freeze errors.

Vitaly Kuznetsov (2):
  Tools: hv: vssdaemon: report freeze errors
  Tools: hv: vssdaemon: skip all filesystems mounted readonly

 tools/hv/hv_vss_daemon.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/2] Tools: hv: vssdaemon: skip all filesystems mounted readonly

2014-11-10 Thread Vitaly Kuznetsov
Instead of making a list of exceptions for readonly filesystems
in addition to iso9660 we already have it is better to skip freeze
operation for all readonly-mounted filesystems.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_vss_daemon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index ee44f0d..5e63f70 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -102,7 +102,7 @@ static int vss_operate(int operation)
while ((ent = getmntent(mounts))) {
if (strncmp(ent-mnt_fsname, match, strlen(match)))
continue;
-   if (strcmp(ent-mnt_type, iso9660) == 0)
+   if (hasmntopt(ent, MNTOPT_RO) != NULL)
continue;
if (strcmp(ent-mnt_type, vfat) == 0)
continue;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] Tools: hv: vssdaemon: report freeze errors

2014-11-10 Thread Vitaly Kuznetsov
When ioctl(fd, FIFREEZE, 0) results in an error we cannot report it
to syslog instantly since that can cause write to a frozen disk.
However, the name of the filesystem which caused the error and errno
are valuable and we would like to get a nice human-readable message
in the log. Save errno before calling vss_operate(VSS_OP_THAW) and
report the error right after.

Unfortunately, FITHAW errors cannot be reported the same way as we
need to finish thawing all filesystems before calling syslog().

We should also avoid calling endmntent() for the second time in
case we encountered an error during freezing of '/' as it usually
results in SEGSEGV.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_vss_daemon.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index b720d8f..ee44f0d 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -82,7 +82,7 @@ static int vss_operate(int operation)
FILE *mounts;
struct mntent *ent;
unsigned int cmd;
-   int error = 0, root_seen = 0;
+   int error = 0, root_seen = 0, save_errno = 0;
 
switch (operation) {
case VSS_OP_FREEZE:
@@ -114,7 +114,6 @@ static int vss_operate(int operation)
if (error  operation == VSS_OP_FREEZE)
goto err;
}
-   endmntent(mounts);
 
if (root_seen) {
error |= vss_do_freeze(/, cmd);
@@ -122,10 +121,19 @@ static int vss_operate(int operation)
goto err;
}
 
-   return error;
+   goto out;
 err:
-   endmntent(mounts);
+   save_errno = errno;
vss_operate(VSS_OP_THAW);
+   /* Call syslog after we thaw all filesystems */
+   if (ent)
+   syslog(LOG_ERR, FREEZE of %s failed; error:%d %s,
+  ent-mnt_dir, save_errno, strerror(save_errno));
+   else
+   syslog(LOG_ERR, FREEZE of / failed; error:%d %s, save_errno,
+  strerror(save_errno));
+out:
+   endmntent(mounts);
return error;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hv: hv_fcopy: drop the obsolete message on transfer failure

2014-11-12 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 In the case the user-space daemon crashes, hangs or is killed, we
 need to down the semaphore, otherwise, after the daemon starts next
 time, the obsolete data in fcopy_transaction.message or
 fcopy_transaction.fcopy_msg will be used immediately.

 Cc: K. Y. Srinivasan k...@microsoft.com
 Signed-off-by: Dexuan Cui de...@microsoft.com

Reviewed-by: Vitaly Kuznetsov vkuzn...@redhat.com

 ---
  drivers/hv/hv_fcopy.c | 9 +
  1 file changed, 9 insertions(+)

 diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
 index 23b2ce2..177122a 100644
 --- a/drivers/hv/hv_fcopy.c
 +++ b/drivers/hv/hv_fcopy.c
 @@ -86,6 +86,15 @@ static void fcopy_work_func(struct work_struct *dummy)
* process the pending transaction.
*/
   fcopy_respond_to_host(HV_E_FAIL);
 +
 + /* In the case the user-space daemon crashes, hangs or is killed, we
 +  * need to down the semaphore, otherwise, after the daemon starts next
 +  * time, the obsolete data in fcopy_transaction.message or
 +  * fcopy_transaction.fcopy_msg will be used immediately.
 +  */
 + if (down_trylock(fcopy_transaction.read_sema))
 + pr_debug(FCP: failed to acquire the semaphore\n);
 +
  }

  static int fcopy_handle_handshake(u32 version)

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] xen/blkfront: improve protection against issuing unsupported REQ_FUA

2014-11-03 Thread Vitaly Kuznetsov
Boris Ostrovsky boris.ostrov...@oracle.com writes:

 On 11/03/2014 07:22 AM, Laszlo Ersek wrote:
 On 10/27/14 14:44, Vitaly Kuznetsov wrote:
 Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced
 in d11e61583 and was factored out into blkif_request_flush_valid() in
 0f1ca65ee. However:
 1) This check in incomplete. In case we negotiated to feature_flush = 
 REQ_FLUSH
 and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA 
 request
 will still pass the check.
 2) blkif_request_flush_valid() is misnamed. It is bool but returns true when
 the request is invalid.
 3) When blkif_request_flush_valid() fails -EIO is being returned. It seems 
 that
 -EOPNOTSUPP is more appropriate here.
 Fix all of the above issues.

 This patch is based on the original patch by Laszlo Ersek and a comment by
 Jeff Moyer.

 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
   drivers/block/xen-blkfront.c | 14 --
   1 file changed, 8 insertions(+), 6 deletions(-)

 diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
 index 5ac312f..2e6c103 100644
 --- a/drivers/block/xen-blkfront.c
 +++ b/drivers/block/xen-blkfront.c
 @@ -582,12 +582,14 @@ static inline void flush_requests(struct 
 blkfront_info *info)
 notify_remote_via_irq(info-irq);
   }
   -static inline bool blkif_request_flush_valid(struct request
 *req,
 -struct blkfront_info *info)
 +static inline bool blkif_request_flush_invalid(struct request *req,
 +  struct blkfront_info *info)
   {
 return ((req-cmd_type != REQ_TYPE_FS) ||
 -   ((req-cmd_flags  (REQ_FLUSH | REQ_FUA)) 
 -   !info-flush_op));
 +   ((req-cmd_flags  REQ_FLUSH) 
 +!(info-feature_flush  REQ_FLUSH)) ||
 +   ((req-cmd_flags  REQ_FUA) 
 +!(info-feature_flush  REQ_FUA)));

 Somewhat unrelated to the patch, but I am wondering whether we
 actually need flush_op field at all as it seems that it is
 unambiguously defined by REQ_FLUSH/REQ_FUA.

I was under an impression it was added for readability sake but we
definitely can remove it. If noone objects I'll send separate cleanup
patch (don't want to mix these two).


 -boris

   }
 /*
 @@ -612,8 +614,8 @@ static void do_blkif_request(struct request_queue *rq)
 blk_start_request(req);
   - if (blkif_request_flush_valid(req, info)) {
 -   __blk_end_request_all(req, -EIO);
 +   if (blkif_request_flush_invalid(req, info)) {
 +   __blk_end_request_all(req, -EOPNOTSUPP);
 continue;
 }
   

 Not sure if there has been some feedback yet (I can't see anything
 threaded with this message in my inbox).

 FWIW I consulted Documentation/block/writeback_cache_control.txt for
 this review. Apparently, REQ_FLUSH forces out previously completed
 write requests, whereas REQ_FUA delays the IO completion signal for
 *this* request until the data has been committed to non-volatile
 storage. So, indeed, support for REQ_FLUSH only does not guarantee that
 REQ_FUA can be served.

 Reviewed-by: Laszlo Ersek ler...@redhat.com

 Thanks
 Laszlo

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors

2014-11-26 Thread Vitaly Kuznetsov
When an SMP Hyper-V guest is running on top of 2012R2 Server and secondary
cpus are sent offline (with echo 0  /sys/devices/system/cpu/cpu$cpu/online)
the system freeze is observed. This happens due to the fact that on newer
hypervisors (Win8, WS2012R2, ...) vmbus channel handlers are distributed
across all cpus (see init_vp_index() function in drivers/hv/channel_mgmt.c)
and on cpu offlining nobody reassigns them to CPU0. Prevent cpu offlining
when vmbus is loaded until the issue is fixed host-side.

This patch also disables hibernation but it is OK as it is also broken (MCE
error is hit on resume). Suspend still works.

Tested with WS2008R2 and WS2012R2.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/vmbus_drv.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 4d6b269..9a82249 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -32,6 +32,7 @@
 #include linux/completion.h
 #include linux/hyperv.h
 #include linux/kernel_stat.h
+#include linux/cpu.h
 #include asm/hyperv.h
 #include asm/hypervisor.h
 #include asm/mshyperv.h
@@ -671,6 +672,13 @@ static void vmbus_isr(void)
tasklet_schedule(msg_dpc);
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static int hyperv_cpu_disable(void)
+{
+   return -1;
+}
+#endif
+
 /*
  * vmbus_bus_init -Main vmbus driver initialization routine.
  *
@@ -711,6 +719,12 @@ static int vmbus_bus_init(int irq)
if (ret)
goto err_alloc;
 
+#ifdef CONFIG_HOTPLUG_CPU
+   if ((vmbus_proto_version != VERSION_WS2008) 
+   (vmbus_proto_version != VERSION_WIN7))
+   smp_ops.cpu_disable = hyperv_cpu_disable;
+#endif
+
vmbus_request_offers();
 
return 0;
@@ -964,6 +978,11 @@ static void __exit vmbus_exit(void)
bus_unregister(hv_bus);
hv_cleanup();
acpi_bus_unregister_driver(vmbus_acpi_driver);
+#ifdef CONFIG_HOTPLUG_CPU
+   if ((vmbus_proto_version != VERSION_WS2008) 
+   (vmbus_proto_version != VERSION_WIN7))
+   smp_ops.cpu_disable = native_cpu_disable;
+#endif
 }
 
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors

2014-11-27 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: devel [mailto:driverdev-devel-boun...@linuxdriverproject.org] On
 Behalf Of Greg Kroah-Hartman
 Sent: Thursday, November 27, 2014 11:03 AM
 To: Vitaly Kuznetsov
 Cc: de...@linuxdriverproject.org; Haiyang Zhang; linux-
 ker...@vger.kernel.org
 Subject: Re: [PATCH] Drivers: hv: vmbus: prevent cpu offlining on newer
 hypervisors
 
 On Wed, Nov 26, 2014 at 02:52:22PM +0100, Vitaly Kuznetsov wrote:
  When an SMP Hyper-V guest is running on top of 2012R2 Server and
 secondary
  cpus are sent offline (with echo 0 
 /sys/devices/system/cpu/cpu$cpu/online)
  the system freeze is observed. This happens due to the fact that on newer
  hypervisors (Win8, WS2012R2, ...) vmbus channel handlers are
 distributed
  across all cpus (see init_vp_index() function in
 drivers/hv/channel_mgmt.c)
  and on cpu offlining nobody reassigns them to CPU0. Prevent cpu
 offlining
  when vmbus is loaded until the issue is fixed host-side.
 
  This patch also disables hibernation but it is OK as it is also broken (MCE
  error is hit on resume). Suspend still works.
 
  Tested with WS2008R2 and WS2012R2.
 
  Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
  ---
   drivers/hv/vmbus_drv.c | 19 +++
   1 file changed, 19 insertions(+)
 
  diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
  index 4d6b269..9a82249 100644
  --- a/drivers/hv/vmbus_drv.c
  +++ b/drivers/hv/vmbus_drv.c
  @@ -32,6 +32,7 @@
   #include linux/completion.h
   #include linux/hyperv.h
   #include linux/kernel_stat.h
  +#include linux/cpu.h
   #include asm/hyperv.h
   #include asm/hypervisor.h
   #include asm/mshyperv.h
  @@ -671,6 +672,13 @@ static void vmbus_isr(void)
 tasklet_schedule(msg_dpc);
   }
 
  +#ifdef CONFIG_HOTPLUG_CPU
  +static int hyperv_cpu_disable(void)
  +{
  +  return -1;
  +}
  +#endif
  +
   /*
* vmbus_bus_init -Main vmbus driver initialization routine.
*
  @@ -711,6 +719,12 @@ static int vmbus_bus_init(int irq)
 if (ret)
 goto err_alloc;
 
  +#ifdef CONFIG_HOTPLUG_CPU
  +  if ((vmbus_proto_version != VERSION_WS2008) 
  +  (vmbus_proto_version != VERSION_WIN7))
  +  smp_ops.cpu_disable = hyperv_cpu_disable;
  +#endif
  +
 vmbus_request_offers();
 
 return 0;
  @@ -964,6 +978,11 @@ static void __exit vmbus_exit(void)
 bus_unregister(hv_bus);
 hv_cleanup();
 acpi_bus_unregister_driver(vmbus_acpi_driver);
  +#ifdef CONFIG_HOTPLUG_CPU
  +  if ((vmbus_proto_version != VERSION_WS2008) 
  +  (vmbus_proto_version != VERSION_WIN7))
  +  smp_ops.cpu_disable = native_cpu_disable;
  +#endif
   }
 
 #ifdef in a .c file is not a good idea to do if at all possible, please
 only put this in one place, using a function call to hide the mess.
 
 greg k-h

 Hi Vitaly,
 The idea of the patch is good to me.

 I agree with Greg.
 BTW, maybe hv_cpu_hotplug_quirk() is a better name?

My idea was that eventually this function will start doing something
real (e.g. switching channels to cpu0 if it doesn't happen fully
host-side) so I called it with a general name 'hyperv_cpu_disable'.

I'll try addressing our and Greg's comments in v2, thanks!


 Thanks,
 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] Drivers: hv: vss: Introduce timeout for communication with userspace

2014-11-06 Thread Vitaly Kuznetsov
In contrast with KVP there is no timeout when communicating with
userspace VSS daemon. In case it gets stuck performing freeze/thaw
operation no message will be sent to the host so it will take very
long (around 10 minutes) before backup fails. Introduce 10 second
timeout using schedule_delayed_work().

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/hv_snapshot.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c
index 34f14fd..21e51be 100644
--- a/drivers/hv/hv_snapshot.c
+++ b/drivers/hv/hv_snapshot.c
@@ -28,7 +28,7 @@
 #define VSS_MINOR  0
 #define VSS_VERSION(VSS_MAJOR  16 | VSS_MINOR)
 
-
+#define VSS_USERSPACE_TIMEOUT (msecs_to_jiffies(10 * 1000))
 
 /*
  * Global state maintained for transaction that is being processed.
@@ -55,12 +55,24 @@ static const char vss_name[] = vss_kernel_module;
 static __u8 *recv_buffer;
 
 static void vss_send_op(struct work_struct *dummy);
+static void vss_timeout_func(struct work_struct *dummy);
+
+static DECLARE_DELAYED_WORK(vss_timeout_work, vss_timeout_func);
 static DECLARE_WORK(vss_send_op_work, vss_send_op);
 
 /*
  * Callback when data is received from user mode.
  */
 
+static void vss_timeout_func(struct work_struct *dummy)
+{
+   /*
+* Timeout waiting for userspace component to reply happened.
+*/
+   pr_warn(VSS: timeout waiting for daemon to reply\n);
+   vss_respond_to_host(HV_E_FAIL);
+}
+
 static void
 vss_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp)
 {
@@ -76,7 +88,8 @@ vss_cn_callback(struct cn_msg *msg, struct netlink_skb_parms 
*nsp)
return;
 
}
-   vss_respond_to_host(vss_msg-error);
+   if (cancel_delayed_work_sync(vss_timeout_work))
+   vss_respond_to_host(vss_msg-error);
 }
 
 
@@ -223,6 +236,8 @@ void hv_vss_onchannelcallback(void *context)
case VSS_OP_FREEZE:
case VSS_OP_THAW:
schedule_work(vss_send_op_work);
+   schedule_delayed_work(vss_timeout_work,
+ VSS_USERSPACE_TIMEOUT);
return;
 
case VSS_OP_HOT_BACKUP:
@@ -277,5 +292,6 @@ hv_vss_init(struct hv_util_service *srv)
 void hv_vss_deinit(void)
 {
cn_del_callback(vss_id);
+   cancel_delayed_work_sync(vss_timeout_work);
cancel_work_sync(vss_send_op_work);
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Drivers: hv: kvp,vss: improve kernel-userspace communication in failure case

2014-11-06 Thread Vitaly Kuznetsov
This series addresses two issues:
- There is no timeout when communicating with userspace VSS daemon.
- In case we fail to send a message to VSS or KVP userspace daemons we
  can report the failure to the host right away avoiding the timeout.

Newly introduced 10 second timeout is something worth discussing. In theory 
freeze/thaw
ioctls should be fast. In case someone thinks 10 seconds is not enough we can 
easily
increase it as we cover the most common failure scenario (when the daemon was 
stopped)
with the second patch of this series.

Vitaly Kuznetsov (2):
  Drivers: hv: vss: Introduce timeout for communication with userspace
  Drivers: hv: kvp,vss: Fast propagation of userspace communication
failure

 drivers/hv/hv_kvp.c  |  9 -
 drivers/hv/hv_snapshot.c | 28 +---
 2 files changed, 33 insertions(+), 4 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] Drivers: hv: kvp,vss: Fast propagation of userspace communication failure

2014-11-06 Thread Vitaly Kuznetsov
If we fail to send a message to userspace daemon with cn_netlink_send()
there is no need to wait for userspace to reply as it is not going to
happen. This happens when kvp or vss daemon is stopped after a successful
handshake. Report HV_E_FAIL immediately and cancel the timeout job so
host won't receive two failures.
Use pr_warn() for VSS and pr_debug() for KVP deliberately as VSS request
are rare and result in a failed backup. KVP requests are much more frequent
after a successful handshake so avoid flooding logs. It would be nice to
have an ability to de-negotiate with the host in case userspace daemon gets
disconnected so we won't receive new requests. But I'm not sure it is
possible.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/hv_kvp.c  | 9 -
 drivers/hv/hv_snapshot.c | 8 +++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c
index 521c146..beb8105 100644
--- a/drivers/hv/hv_kvp.c
+++ b/drivers/hv/hv_kvp.c
@@ -350,6 +350,7 @@ kvp_send_key(struct work_struct *dummy)
__u8 pool = kvp_transaction.kvp_msg-kvp_hdr.pool;
__u32 val32;
__u64 val64;
+   int rc;
 
msg = kzalloc(sizeof(*msg) + sizeof(struct hv_kvp_msg) , GFP_ATOMIC);
if (!msg)
@@ -446,7 +447,13 @@ kvp_send_key(struct work_struct *dummy)
}
 
msg-len = sizeof(struct hv_kvp_msg);
-   cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
+   rc = cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
+   if (rc) {
+   pr_debug(KVP: failed to communicate to the daemon: %d\n, rc);
+   if (cancel_delayed_work_sync(kvp_work))
+   kvp_respond_to_host(message, HV_E_FAIL);
+   }
+
kfree(msg);
 
return;
diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c
index 21e51be..9d5e0d1 100644
--- a/drivers/hv/hv_snapshot.c
+++ b/drivers/hv/hv_snapshot.c
@@ -96,6 +96,7 @@ vss_cn_callback(struct cn_msg *msg, struct netlink_skb_parms 
*nsp)
 static void vss_send_op(struct work_struct *dummy)
 {
int op = vss_transaction.msg-vss_hdr.operation;
+   int rc;
struct cn_msg *msg;
struct hv_vss_msg *vss_msg;
 
@@ -111,7 +112,12 @@ static void vss_send_op(struct work_struct *dummy)
vss_msg-vss_hdr.operation = op;
msg-len = sizeof(struct hv_vss_msg);
 
-   cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
+   rc = cn_netlink_send(msg, 0, 0, GFP_ATOMIC);
+   if (rc) {
+   pr_warn(VSS: failed to communicate to the daemon: %d\n, rc);
+   if (cancel_delayed_work_sync(vss_timeout_work))
+   vss_respond_to_host(HV_E_FAIL);
+   }
kfree(msg);
 
return;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] Tools: hv: vssdaemon: thaw everything in case of freeze failure

2014-11-07 Thread Vitaly Kuznetsov
If one or more filesystems failed to freeze we need to thaw everything as
host doing backup won't issue THAW request after we return HV_E_FAIL and our
system will remain with frozen filesystems for ever.

There is no track of filesystems we freeze so in case there is some external
tool doing freeze/thaw requests at the same time they will collide with vss
daemon. This issue can be addressed by introducing a freeze/thaw transaction
and keeping track of what was actually frozen

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_vss_daemon.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index 7be999a..e98c638 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -284,6 +284,12 @@ int main(int argc, char *argv[])
error = vss_operate(op);
if (error)
error = HV_E_FAIL;
+   if (error  op == VSS_OP_FREEZE) {
+   /* Need to thaw all frozen fylesystems */
+   syslog(LOG_ERR,
+  Freeze failed, thaw everything);
+   vss_operate(VSS_OP_THAW);
+   }
break;
default:
syslog(LOG_ERR, Illegal op:%d\n, op);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] Tools: hv: vssdaemon: freeze/thaw logic improvement for the failure case

2014-11-07 Thread Vitaly Kuznetsov
This patch series addresses the following issues:
- Wrong error reporting for multiple filesystems case.
- Skip all readonly-mounted filesystems instead of skipping iso9660.
- Thaw all filesystems after an unsuccessful freeze attempt.

Vitaly Kuznetsov (3):
  Tools: hv: vssdaemon: consult with errno in case of failure only
  Tools: hv: vssdaemon: skip all filesystems mounted readonly
  Tools: hv: vssdaemon: thaw everything in case of freeze failure

 tools/hv/hv_vss_daemon.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] Tools: hv: vssdaemon: consult with errno in case of failure only

2014-11-07 Thread Vitaly Kuznetsov
If ioctl() return 0 there is no point in examining errno and
it can actually produce misleading output. In case there was
no failure errno will contain the error code for previous failure
so user will see the following in the log:
 Hyper-V VSS: VSS: freeze of /mnt/udf: Operation not supported
 Hyper-V VSS: VSS: freeze of /: Operation not supported

We should also log errors with LOG_ERR instead of LOG_INFO.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_vss_daemon.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index 9ae2b6e..5f67858 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -52,7 +52,11 @@ static int vss_do_freeze(char *dir, unsigned int cmd, char 
*fs_op)
if (fd  0)
return 1;
ret = ioctl(fd, cmd, 0);
-   syslog(LOG_INFO, VSS: %s of %s: %s\n, fs_op, dir, strerror(errno));
+   if (ret)
+   syslog(LOG_ERR, VSS: %s of %s: %s\n, fs_op, dir,
+  strerror(errno));
+   else
+   syslog(LOG_INFO, VSS: %s of %s succeeded\n, fs_op, dir);
close(fd);
return !!ret;
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] Tools: hv: vssdaemon: skip all filesystems mounted readonly

2014-11-07 Thread Vitaly Kuznetsov
Instead of making a list of exceptions for readonly filesystems
in addition to iso9660 we already have it is better to skip freeze
operation for all readonly-mounted filesystems.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_vss_daemon.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index 5f67858..7be999a 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -90,7 +90,7 @@ static int vss_operate(int operation)
while ((ent = getmntent(mounts))) {
if (strncmp(ent-mnt_fsname, match, strlen(match)))
continue;
-   if (strcmp(ent-mnt_type, iso9660) == 0)
+   if (hasmntopt(ent, MNTOPT_RO) != NULL)
continue;
if (strcmp(ent-mnt_type, vfat) == 0)
continue;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] xen/blkfront: improve protection against issuing unsupported REQ_FUA

2014-12-03 Thread Vitaly Kuznetsov
Boris Ostrovsky boris.ostrov...@oracle.com writes:

 On 12/01/2014 08:01 AM, Vitaly Kuznetsov wrote:
 Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced
 in d11e61583 and was factored out into blkif_request_flush_valid() in
 0f1ca65ee. However:
 1) This check in incomplete. In case we negotiated to feature_flush = 
 REQ_FLUSH
 and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA 
 request
 will still pass the check.
 2) blkif_request_flush_valid() is misnamed. It is bool but returns true when
 the request is invalid.
 3) When blkif_request_flush_valid() fails -EIO is being returned. It seems 
 that
 -EOPNOTSUPP is more appropriate here.
 Fix all of the above issues.

 This patch is based on the original patch by Laszlo Ersek and a comment by
 Jeff Moyer.

 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 Reviewed-by: Laszlo Ersek ler...@redhat.com

 Reviewed-by: Boris Ostrovsky boris.ostrov...@oracle.com

 (although, as I mentioned last time, a companion patch to remove
 flush_op would be a good thing to have)


Thanks, it is on my todo list but I'm trying to separate this
(potential) bugfix from straight cleanup.

 -boris

 ---
   drivers/block/xen-blkfront.c | 14 --
   1 file changed, 8 insertions(+), 6 deletions(-)

 diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
 index 5ac312f..2e6c103 100644
 --- a/drivers/block/xen-blkfront.c
 +++ b/drivers/block/xen-blkfront.c
 @@ -582,12 +582,14 @@ static inline void flush_requests(struct blkfront_info 
 *info)
  notify_remote_via_irq(info-irq);
   }
   -static inline bool blkif_request_flush_valid(struct request *req,
 - struct blkfront_info *info)
 +static inline bool blkif_request_flush_invalid(struct request *req,
 +   struct blkfront_info *info)
   {
  return ((req-cmd_type != REQ_TYPE_FS) ||
 -((req-cmd_flags  (REQ_FLUSH | REQ_FUA)) 
 -!info-flush_op));
 +((req-cmd_flags  REQ_FLUSH) 
 + !(info-feature_flush  REQ_FLUSH)) ||
 +((req-cmd_flags  REQ_FUA) 
 + !(info-feature_flush  REQ_FUA)));
   }
 /*
 @@ -612,8 +614,8 @@ static void do_blkif_request(struct request_queue *rq)
  blk_start_request(req);
   -  if (blkif_request_flush_valid(req, info)) {
 -__blk_end_request_all(req, -EIO);
 +if (blkif_request_flush_invalid(req, info)) {
 +__blk_end_request_all(req, -EOPNOTSUPP);
  continue;
  }
   

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Drivers: hv: vmbus: Fix a race condition when unregistering a device

2014-11-04 Thread Vitaly Kuznetsov
When build with Debug the following crash is sometimes observed:
Call Trace:
 [812b9600] string+0x40/0x100
 [812bb038] vsnprintf+0x218/0x5e0
 [810baf7d] ? trace_hardirqs_off+0xd/0x10
 [812bb4c1] vscnprintf+0x11/0x30
 [8107a2f0] vprintk+0xd0/0x5c0
 [a0051ea0] ? vmbus_process_rescind_offer+0x0/0x110 [hv_vmbus]
 [8155c71c] printk+0x41/0x45
 [a004ebac] vmbus_device_unregister+0x2c/0x40 [hv_vmbus]
 [a0051ecb] vmbus_process_rescind_offer+0x2b/0x110 [hv_vmbus]
...

This happens due to the following race: between 'if (channel-device_obj)' check
in vmbus_process_rescind_offer() and pr_debug() in vmbus_device_unregister() the
device can disappear. Fix the issue by taking an additional reference to the
device before proceeding to vmbus_device_unregister().

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/channel_mgmt.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index a2d1a96..d36ce68 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -216,9 +216,16 @@ static void vmbus_process_rescind_offer(struct work_struct 
*work)
unsigned long flags;
struct vmbus_channel *primary_channel;
struct vmbus_channel_relid_released msg;
+   struct device *dev;
+
+   if (channel-device_obj) {
+   dev = get_device(channel-device_obj-device);
+   if (dev) {
+   vmbus_device_unregister(channel-device_obj);
+   put_device(dev);
+   }
+   }
 
-   if (channel-device_obj)
-   vmbus_device_unregister(channel-device_obj);
memset(msg, 0, sizeof(struct vmbus_channel_relid_released));
msg.child_relid = channel-offermsg.child_relid;
msg.header.msgtype = CHANNELMSG_RELID_RELEASED;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools: hv: introduce -n/--no-daemon option

2014-11-04 Thread Vitaly Kuznetsov
KY Srinivasan k...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Wednesday, October 22, 2014 9:07 AM
 To: KY Srinivasan; Haiyang Zhang; de...@linuxdriverproject.org
 Cc: linux-kernel@vger.kernel.org
 Subject: [PATCH] tools: hv: introduce -n/--no-daemon option
 
 All tools/hv daemons do mandatory daemon() on startup. However, no
 pidfile is created, this make it difficult for an init system to track such
 daemons.
 Modern linux distros use systemd as their init system. It can handle the
 daemonizing by itself, however, it requires a daemon to stay in foreground
 for that. Some distros already carry distro-specific patch for hv tools which
 switches off daemon().
 
 Introduce -n/--no-daemon option for all 3 daemons in hv/tools. Parse
 options with getopt() to make this part easily expandable.
 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 You may want to include Greg KH in the to list.

For some reason he's missing on the get_maintainer.pl output for all
Hyper-V parts.

Greg, will you pick this up or do I need to resend?

 Signed-off-by:  K. Y. Srinivasan k...@microsoft.com


Thanks!

 ---
  tools/hv/hv_fcopy_daemon.c | 33 +++-
 -
  tools/hv/hv_kvp_daemon.c   | 34
 --
  tools/hv/hv_vss_daemon.c   | 33 +++--
  3 files changed, 94 insertions(+), 6 deletions(-)
 
 diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
 index 8f96b3e..f437d73 100644
 --- a/tools/hv/hv_fcopy_daemon.c
 +++ b/tools/hv/hv_fcopy_daemon.c
 @@ -33,6 +33,7 @@
  #include sys/stat.h
  #include fcntl.h
  #include dirent.h
 +#include getopt.h
 
  static int target_fd;
  static char target_fname[W_MAX_PATH];
 @@ -126,15 +127,43 @@ static int hv_copy_cancel(void)
 
  }
 
 -int main(void)
 +void print_usage(char *argv[])
 +{
 +fprintf(stderr, Usage: %s [options]\n
 +Options are:\n
 +  -n, --no-daemonstay in foreground, don't
 daemonize\n
 +  -h, --help print this help\n, argv[0]);
 +}
 +
 +int main(int argc, char *argv[])
  {
  int fd, fcopy_fd, len;
  int error;
 +int daemonize = 1, long_index = 0, opt;
  int version = FCOPY_CURRENT_VERSION;
  char *buffer[4096 * 2];
  struct hv_fcopy_hdr *in_msg;
 
 -if (daemon(1, 0)) {
 +static struct option long_options[] = {
 +{help,no_argument,   0,  'h' },
 +{no-daemon,   no_argument,   0,  'n' },
 +{0, 0, 0,  0   }
 +};
 +
 +while ((opt = getopt_long(argc, argv, hn, long_options,
 +  long_index)) != -1) {
 +switch (opt) {
 +case 'n':
 +daemonize = 0;
 +break;
 +case 'h':
 +default:
 +print_usage(argv);
 +exit(EXIT_FAILURE);
 +}
 +}
 +
 +if (daemonize  daemon(1, 0)) {
  syslog(LOG_ERR, daemon() failed; error: %s,
 strerror(errno));
  exit(EXIT_FAILURE);
  }
 diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index
 4088b81..22b0764 100644
 --- a/tools/hv/hv_kvp_daemon.c
 +++ b/tools/hv/hv_kvp_daemon.c
 @@ -43,6 +43,7 @@
  #include fcntl.h
  #include dirent.h
  #include net/if.h
 +#include getopt.h
 
  /*
   * KVP protocol: The user mode component first registers with the @@ -
 1417,7 +1418,15 @@ netlink_send(int fd, struct cn_msg *msg)
  return sendmsg(fd, message, 0);
  }
 
 -int main(void)
 +void print_usage(char *argv[])
 +{
 +fprintf(stderr, Usage: %s [options]\n
 +Options are:\n
 +  -n, --no-daemonstay in foreground, don't
 daemonize\n
 +  -h, --help print this help\n, argv[0]);
 +}
 +
 +int main(int argc, char *argv[])
  {
  int fd, len, nl_group;
  int error;
 @@ -1435,9 +1444,30 @@ int main(void)
  struct hv_kvp_ipaddr_value *kvp_ip_val;
  char *kvp_recv_buffer;
  size_t kvp_recv_buffer_len;
 +int daemonize = 1, long_index = 0, opt;
 +
 +static struct option long_options[] = {
 +{help,no_argument,   0,  'h' },
 +{no-daemon,   no_argument,   0,  'n' },
 +{0, 0, 0,  0   }
 +};
 +
 +while ((opt = getopt_long(argc, argv, hn, long_options,
 +  long_index)) != -1) {
 +switch (opt) {
 +case 'n':
 +daemonize = 0;
 +break;
 +case 'h':
 +default:
 +print_usage(argv);
 +exit(EXIT_FAILURE);
 +}
 +}
 
 -if (daemon(1, 0))
 +if (daemonize  daemon(1, 0))
  return 1;
 +
  openlog(KVP, 0, LOG_USER);
  syslog(LOG_INFO, KVP starting; pid is:%d, getpid());
 
 diff --git a/tools

Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure

2014-12-01 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Jason Wang [mailto:jasow...@redhat.com]
 Sent: Monday, December 1, 2014 16:23 PM
 To: Dexuan Cui
 Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
 de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY
 Srinivasan; vkuzn...@redhat.com; Haiyang Zhang
 Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
 failure
 On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui de...@microsoft.com wrote:
   -Original Message-
   From: Jason Wang [mailto:jasow...@redhat.com]
   Sent: Friday, November 28, 2014 18:13 PM
   To: Dexuan Cui
   Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
  driverdev-
   de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY
   Srinivasan; vkuzn...@redhat.com; Haiyang Zhang
   Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on
  transfer
   failure
   On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui de...@microsoft.com
  wrote:
 -Original Message-
 From: Jason Wang [mailto:jasow...@redhat.com]
 Sent: Friday, November 28, 2014 14:47 PM
 To: Dexuan Cui
 Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
driverdev-
 de...@linuxdriverproject.org; o...@aepfle.de;
  a...@canonical.com; KY
 Srinivasan; vkuzn...@redhat.com; Haiyang Zhang
 Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message
  on
transfer
 failure
 On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui
  de...@microsoft.com
wrote:
  In the case the user-space daemon crashes, hangs or is
  killed, we
  need to down the semaphore, otherwise, after the daemon starts
next
  time, the obsolete data in fcopy_transaction.message or
  fcopy_transaction.fcopy_msg will be used immediately.
 
  Cc: Jason Wang jasow...@redhat.com
  Cc: Vitaly Kuznetsov vkuzn...@redhat.com
  Cc: K. Y. Srinivasan k...@microsoft.com
  Signed-off-by: Dexuan Cui de...@microsoft.com
  ---
 
  v2: I removed the FCP prefix as Greg asked.
 
  I also updated the output message a little:
  FCP: failed to acquire the semaphore --
  can not acquire the semaphore: it is benign
 
  v3: I added the code in fcopy_release() as Jason Wang
  suggested.
  I removed the pr_debug (it isn't so meaningful)and added a
  comment instead.
 
   drivers/hv/hv_fcopy.c | 19 +++
   1 file changed, 19 insertions(+)
 
  diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
  index 23b2ce2..faa6ba6 100644
  --- a/drivers/hv/hv_fcopy.c
  +++ b/drivers/hv/hv_fcopy.c
  @@ -86,6 +86,18 @@ static void fcopy_work_func(struct
  work_struct
  *dummy)
  * process the pending transaction.
  */
 fcopy_respond_to_host(HV_E_FAIL);
  +
  +  /* In the case the user-space daemon crashes, hangs or is
killed, we
  +   * need to down the semaphore, otherwise, after the daemon
starts
  next
  +   * time, the obsolete data in fcopy_transaction.message or
  +   * fcopy_transaction.fcopy_msg will be used immediately.
  +   *
  +   * NOTE: fcopy_read() happens to get the semaphore (very
  rare)?
  We're
  +   * still OK, because we've reported the failure to the host.
  +   */
  +  if (down_trylock(fcopy_transaction.read_sema))
  +  ;
   
 Sorry, I'm not quite understand how if () ; can help here.
   
 Btw, a question not relate to this patch.
   
 What happens if a daemon is resume from SIGSTOP and expires the
check
 here?
Hi Jason,
My idea is: here we need down_trylock(), but in case we can't get
  the
semaphore, it's OK anyway:
   
Scenario 1):
1.1: when the daemon is blocked on the pread(), the daemon
  receives
SIGSTOP;
1.2: the host user runs the PowerShell Copy-VMFile command;
1.3.1: the driver reports the failure to the host user in 5s and
1.3.2: the driver down()-es the semaphore;
1.4: the daemon receives SIGCONT and it will be still blocked on
  the
pread().
Without the down_trylock(), in 1.4, the daemon can receive an
obsolete message.
NOTE: in this scenario, the daemon is not killed.
   
Scenario 2):
In senario 1), if the daemon receives SIGCONT between 1.3.1 and
  1.3.2
and
do down() in fcopy_read(), it will receive the message but: the
driver has
reported the failure to the host user and the driver's 1.3.2 can't
get the
semaphore -- IMO this is acceptably OK, though in the VM, an
incomplete
file will be left there.
BTW, I think in the daemon's hv_start_fcopy() we should add a
close(target_fd) before open()-ing a new one.
 
   Right, but how about the case when resuming from SIGSTOP but no
  timeout?
  Sorry, I don't understand this:
  if no timeout, fcopy_read() will get the semaphore

[PATCH v2] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors

2014-12-01 Thread Vitaly Kuznetsov
When an SMP Hyper-V guest is running on top of 2012R2 Server and secondary
cpus are sent offline (with echo 0  /sys/devices/system/cpu/cpu$cpu/online)
the system freeze is observed. This happens due to the fact that on newer
hypervisors (Win8, WS2012R2, ...) vmbus channel handlers are distributed
across all cpus (see init_vp_index() function in drivers/hv/channel_mgmt.c)
and on cpu offlining nobody reassigns them to CPU0. Prevent cpu offlining
when vmbus is loaded until the issue is fixed host-side.

This patch also disables hibernation but it is OK as it is also broken (MCE
error is hit on resume). Suspend still works.

Tested with WS2008R2 and WS2012R2.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com

---
Changes since v1:
- introduce hv_cpu_hotplug_quirk() function to not spread #ifdefs [Greg KH]
- add pr_notice() message hv_vmbus: CPU offlining is not supported by 
hypervisor
---
 drivers/hv/vmbus_drv.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 4d6b269..2e6b38e 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -32,6 +32,7 @@
 #include linux/completion.h
 #include linux/hyperv.h
 #include linux/kernel_stat.h
+#include linux/cpu.h
 #include asm/hyperv.h
 #include asm/hypervisor.h
 #include asm/mshyperv.h
@@ -671,6 +672,36 @@ static void vmbus_isr(void)
tasklet_schedule(msg_dpc);
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static int hyperv_cpu_disable(void)
+{
+   return -1;
+}
+
+static void hv_cpu_hotplug_quirk(bool vmbus_loaded)
+{
+   /*
+* Offlining a CPU when running on newer hypervisors (WS2012R2, Win8,
+* ...) is not supported at this moment as channel interrupts are
+* distributed across all of them.
+*/
+
+   if ((vmbus_proto_version == VERSION_WS2008) ||
+   (vmbus_proto_version == VERSION_WIN7))
+   return;
+
+   if (vmbus_loaded) {
+   smp_ops.cpu_disable = hyperv_cpu_disable;
+   pr_notice(CPU offlining is not supported by hypervisor);
+   } else
+   smp_ops.cpu_disable = native_cpu_disable;
+}
+#else
+static void hv_cpu_hotplug_quirk(bool vmbus_loaded)
+{
+}
+#endif
+
 /*
  * vmbus_bus_init -Main vmbus driver initialization routine.
  *
@@ -711,6 +742,7 @@ static int vmbus_bus_init(int irq)
if (ret)
goto err_alloc;
 
+   hv_cpu_hotplug_quirk(true);
vmbus_request_offers();
 
return 0;
@@ -964,6 +996,7 @@ static void __exit vmbus_exit(void)
bus_unregister(hv_bus);
hv_cleanup();
acpi_bus_unregister_driver(vmbus_acpi_driver);
+   hv_cpu_hotplug_quirk(false);
 }
 
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND] xen/blkfront: improve protection against issuing unsupported REQ_FUA

2014-12-01 Thread Vitaly Kuznetsov
Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced
in d11e61583 and was factored out into blkif_request_flush_valid() in
0f1ca65ee. However:
1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH
   and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request
   will still pass the check.
2) blkif_request_flush_valid() is misnamed. It is bool but returns true when
   the request is invalid.
3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that
   -EOPNOTSUPP is more appropriate here.
Fix all of the above issues.

This patch is based on the original patch by Laszlo Ersek and a comment by
Jeff Moyer.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
Reviewed-by: Laszlo Ersek ler...@redhat.com
---
 drivers/block/xen-blkfront.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5ac312f..2e6c103 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -582,12 +582,14 @@ static inline void flush_requests(struct blkfront_info 
*info)
notify_remote_via_irq(info-irq);
 }
 
-static inline bool blkif_request_flush_valid(struct request *req,
-struct blkfront_info *info)
+static inline bool blkif_request_flush_invalid(struct request *req,
+  struct blkfront_info *info)
 {
return ((req-cmd_type != REQ_TYPE_FS) ||
-   ((req-cmd_flags  (REQ_FLUSH | REQ_FUA)) 
-   !info-flush_op));
+   ((req-cmd_flags  REQ_FLUSH) 
+!(info-feature_flush  REQ_FLUSH)) ||
+   ((req-cmd_flags  REQ_FUA) 
+!(info-feature_flush  REQ_FUA)));
 }
 
 /*
@@ -612,8 +614,8 @@ static void do_blkif_request(struct request_queue *rq)
 
blk_start_request(req);
 
-   if (blkif_request_flush_valid(req, info)) {
-   __blk_end_request_all(req, -EIO);
+   if (blkif_request_flush_invalid(req, info)) {
+   __blk_end_request_all(req, -EOPNOTSUPP);
continue;
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] xen/blkfront: remove redundant flush_op

2014-12-08 Thread Vitaly Kuznetsov
flush_op is unambiguously defined by feature_flush:
REQ_FUA | REQ_FLUSH - BLKIF_OP_WRITE_BARRIER
REQ_FLUSH - BLKIF_OP_FLUSH_DISKCACHE
0 - 0
and thus can be removed. This is just a cleanup.

The patch was suggested by Boris Ostrovsky.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
The patch is supposed to be applied after xen/blkfront: improve protection
against issuing unsupported REQ_FUA.
---
 drivers/block/xen-blkfront.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2e6c103..d1ee233 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -126,7 +126,6 @@ struct blkfront_info
unsigned int persistent_gnts_c;
unsigned long shadow_free;
unsigned int feature_flush;
-   unsigned int flush_op;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
unsigned int discard_granularity;
@@ -479,7 +478,14 @@ static int blkif_queue_request(struct request *req)
 * way.  (It's also a FLUSH+FUA, since it is
 * guaranteed ordered WRT previous writes.)
 */
-   ring_req-operation = info-flush_op;
+   if (unlikely(info-feature_flush  REQ_FUA))
+   ring_req-operation =
+   BLKIF_OP_WRITE_BARRIER;
+   else if (likely(info-feature_flush))
+   ring_req-operation =
+   BLKIF_OP_FLUSH_DISKCACHE;
+   else
+   ring_req-operation = 0;
}
ring_req-u.rw.nr_segments = nseg;
}
@@ -691,8 +697,8 @@ static void xlvbd_flush(struct blkfront_info *info)
blk_queue_flush(info-rq, info-feature_flush);
printk(KERN_INFO blkfront: %s: %s: %s %s %s %s %s\n,
   info-gd-disk_name,
-  info-flush_op == BLKIF_OP_WRITE_BARRIER ?
-   barrier : (info-flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
+  info-feature_flush == (REQ_FLUSH | REQ_FUA) ?
+   barrier : (info-feature_flush == REQ_FLUSH ?
flush diskcache : barrier or flush),
   info-feature_flush ? enabled; : disabled;,
   persistent grants:,
@@ -1190,7 +1196,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
if (error == -EOPNOTSUPP)
error = 0;
info-feature_flush = 0;
-   info-flush_op = 0;
xlvbd_flush(info);
}
/* fall through */
@@ -1810,7 +1815,6 @@ static void blkfront_connect(struct blkfront_info *info)
physical_sector_size = sector_size;
 
info-feature_flush = 0;
-   info-flush_op = 0;
 
err = xenbus_gather(XBT_NIL, info-xbdev-otherend,
feature-barrier, %d, barrier,
@@ -1823,10 +1827,8 @@ static void blkfront_connect(struct blkfront_info *info)
 *
 * If there are barriers, then we use flush.
 */
-   if (!err  barrier) {
+   if (!err  barrier)
info-feature_flush = REQ_FLUSH | REQ_FUA;
-   info-flush_op = BLKIF_OP_WRITE_BARRIER;
-   }
/*
 * And if there is feature-flush-cache use that above
 * barriers.
@@ -1835,10 +1837,8 @@ static void blkfront_connect(struct blkfront_info *info)
feature-flush-cache, %d, flush,
NULL);
 
-   if (!err  flush) {
+   if (!err  flush)
info-feature_flush = REQ_FLUSH;
-   info-flush_op = BLKIF_OP_FLUSH_DISKCACHE;
-   }
 
err = xenbus_gather(XBT_NIL, info-xbdev-otherend,
feature-discard, %d, discard,
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work

2014-12-09 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 Currently IPv6-only-injection doesn't work because the daemon doesn't parse
 any IPv6 information at all once it finds the dhcp_enabled flag is true.

 But according to the Hyper-v host team, the flag is only for IPv4.
 In the case the host only injects 1 IPv6 address, the dhcp flag is true, but
 we shouldn't ignore the IPv6 address and we should pass BOOTPROTO=none
 to the distro-specific script hv_set_ipconfig.

 Tested in Ubuntu 14.10 and RHEL7.

 Cc: K. Y. Srinivasan k...@microsoft.com
 Signed-off-by: Dexuan Cui de...@microsoft.com
 ---
  tools/hv/hv_kvp_daemon.c | 47 +++
  1 file changed, 31 insertions(+), 16 deletions(-)

 diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
 index 6a6432a..6ef6c04 100644
 --- a/tools/hv/hv_kvp_daemon.c
 +++ b/tools/hv/hv_kvp_daemon.c
 @@ -1145,6 +1145,9 @@ static int kvp_write_file(FILE *f, char *s1, char *s2, 
 char *s3)
  }

 +/* How many ipv6 addresses the host is trying to inject? */
 +static int num_ipv6_injected;
 +
  static int process_ip_string(FILE *f, char *ip_string, int type)
  {
   int error = 0;
 @@ -1190,6 +1193,7 @@ static int process_ip_string(FILE *f, char *ip_string, 
 int type)
   switch (type) {
   case IPADDR:
   snprintf(str, sizeof(str), %s, IPV6ADDR);
 + num_ipv6_injected++;
   break;
   case NETMASK:
   snprintf(str, sizeof(str), %s, IPV6NETMASK);
 @@ -1308,27 +1312,12 @@ static int kvp_set_ip_info(char *if_name, struct 
 hv_kvp_ipaddr_value *new_val)
   if (error)
   goto setval_error;

 - if (new_val-dhcp_enabled) {
 - error = kvp_write_file(file, BOOTPROTO, , dhcp);
 - if (error)
 - goto setval_error;
 -
 - /*
 -  * We are done!.
 -  */
 - goto setval_done;
 -
 - } else {
 - error = kvp_write_file(file, BOOTPROTO, , none);
 - if (error)
 - goto setval_error;
 - }
 -
   /*
* Write the configuration for ipaddress, netmask, gateway and
* name servers.
*/

 + num_ipv6_injected = 0;
   error = process_ip_string(file, (char *)new_val-ip_addr, IPADDR);
   if (error)
   goto setval_error;
 @@ -1345,6 +1334,32 @@ static int kvp_set_ip_info(char *if_name, struct 
 hv_kvp_ipaddr_value *new_val)
   if (error)
   goto setval_error;

 + /*
 +  * Here dhcp_enabled is only for IPv4 according to Hyper-V host team.
 +  *
 +  * In the case the host only injects 1 IPv6 address:
 +  * new_val-dhcp_enabled is true, but we can't pass BOOTPROTO=dhcp to
 +  * the script hv_set_ifconfig, because in some distros (like RHEL7)
 +  * BOOTPROTO=dhcp has a special meaning in the config file (e.g.,
 +  * /etc/sysconfig/network-scripts/ifcfg-eth0): the network init program
 +  * ignores any static IP addr information once there is
 +  * BOOTPROTO=dhcp; as a result, IPv6-only injection can't work.
 +  *
 +  * In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't affect
 +  * Ubuntu because it's ignored by the Ubuntu version of
 +  * hv_set_ifconfig and it doesn't seem to have special meaning in
 +  * Ubuntu.
 +  */

I just checked and adding IPV6ADDR=something when BOOTPROTO=dhcp
works for me with both RHEL7 and Fedora21.

Other than that I think bringing distribution specifics into kernel.git
is not a good idea. /etc/sysconfig/network-scripts/ifcfg-* format is
distro-specific and not all Linux distros support it. Moreover,
different distros can treat setting differently. I think it was wrong to
stick to this format in kvp daemon from very beginning.

As a solution I would suggest doing the following: kvp daemon writes all
received request details in distro-agnostic format in some temporary
place and then calls distro-specific script to set things up. Actually,
we already have such script: tools/hv/hv_set_ifconfig.sh

As for this bug I propose the following: remove skipping all
IPADDR/MASK/... settings in case of BOOTPROTO=dhcp and let
distro-specific script deal with the rest.


 + if (new_val-dhcp_enabled  num_ipv6_injected == 0) {
 + error = kvp_write_file(file, BOOTPROTO, , dhcp);
 + if (error)
 + goto setval_error;
 + } else {
 + error = kvp_write_file(file, BOOTPROTO, , none);
 + if (error)
 + goto setval_error;
 + }
 +
  setval_done:
   fclose(file);

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

[PATCH v2] xen/blkfront: remove redundant flush_op

2014-12-09 Thread Vitaly Kuznetsov
flush_op is unambiguously defined by feature_flush:
REQ_FUA | REQ_FLUSH - BLKIF_OP_WRITE_BARRIER
REQ_FLUSH - BLKIF_OP_FLUSH_DISKCACHE
0 - 0
and thus can be removed. This is just a cleanup.

The patch was suggested by Boris Ostrovsky.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
Changes from v1:
   Future-proof feature_flush against new flags [Boris Ostrovsky].

The patch is supposed to be applied after xen/blkfront: improve protection
against issuing unsupported REQ_FUA.
---
 drivers/block/xen-blkfront.c | 51 +++-
 1 file changed, 31 insertions(+), 20 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2e6c103..2236c6f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -126,7 +126,6 @@ struct blkfront_info
unsigned int persistent_gnts_c;
unsigned long shadow_free;
unsigned int feature_flush;
-   unsigned int flush_op;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
unsigned int discard_granularity;
@@ -479,7 +478,19 @@ static int blkif_queue_request(struct request *req)
 * way.  (It's also a FLUSH+FUA, since it is
 * guaranteed ordered WRT previous writes.)
 */
-   ring_req-operation = info-flush_op;
+   switch (info-feature_flush 
+   ((REQ_FLUSH|REQ_FUA))) {
+   case REQ_FLUSH|REQ_FUA:
+   ring_req-operation =
+   BLKIF_OP_WRITE_BARRIER;
+   break;
+   case REQ_FLUSH:
+   ring_req-operation =
+   BLKIF_OP_FLUSH_DISKCACHE;
+   break;
+   default:
+   ring_req-operation = 0;
+   }
}
ring_req-u.rw.nr_segments = nseg;
}
@@ -685,20 +696,26 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
return 0;
 }
 
+static const char *flush_info(unsigned int feature_flush)
+{
+   switch (feature_flush  ((REQ_FLUSH | REQ_FUA))) {
+   case REQ_FLUSH|REQ_FUA:
+   return barrier: enabled;;
+   case REQ_FLUSH:
+   return flush diskcache: enabled;;
+   default:
+   return barrier or flush: disabled;;
+   }
+}
 
 static void xlvbd_flush(struct blkfront_info *info)
 {
blk_queue_flush(info-rq, info-feature_flush);
-   printk(KERN_INFO blkfront: %s: %s: %s %s %s %s %s\n,
-  info-gd-disk_name,
-  info-flush_op == BLKIF_OP_WRITE_BARRIER ?
-   barrier : (info-flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
-   flush diskcache : barrier or flush),
-  info-feature_flush ? enabled; : disabled;,
-  persistent grants:,
-  info-feature_persistent ? enabled; : disabled;,
-  indirect descriptors:,
-  info-max_indirect_segments ? enabled; : disabled;);
+   pr_info(blkfront: %s: %s %s %s %s %s\n,
+   info-gd-disk_name, flush_info(info-feature_flush),
+   persistent grants:, info-feature_persistent ?
+   enabled; : disabled;, indirect descriptors:,
+   info-max_indirect_segments ? enabled; : disabled;);
 }
 
 static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
@@ -1190,7 +1207,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
if (error == -EOPNOTSUPP)
error = 0;
info-feature_flush = 0;
-   info-flush_op = 0;
xlvbd_flush(info);
}
/* fall through */
@@ -1810,7 +1826,6 @@ static void blkfront_connect(struct blkfront_info *info)
physical_sector_size = sector_size;
 
info-feature_flush = 0;
-   info-flush_op = 0;
 
err = xenbus_gather(XBT_NIL, info-xbdev-otherend,
feature-barrier, %d, barrier,
@@ -1823,10 +1838,8 @@ static void blkfront_connect(struct blkfront_info *info)
 *
 * If there are barriers, then we use flush.
 */
-   if (!err  barrier) {
+   if (!err  barrier)
info-feature_flush = REQ_FLUSH | REQ_FUA;
-   info-flush_op = BLKIF_OP_WRITE_BARRIER;
-   }
/*
 * And if there is feature-flush-cache use that above
 * barriers.
@@ -1835,10 +1848,8 @@ static void blkfront_connect(struct

[PATCH 0/5] Tools: hv: fix compiler warnings and do minor cleanup

2014-12-09 Thread Vitaly Kuznetsov
When someone does 'make' in tools/hv/ issues appear:
- hv_fcopy_daemon is not being built;
- lots of compiler warnings.

This is just a cleanup. Compile-tested by myself on top of linux-next/master.

Piggyback this series and send [PATCH 5/5] Tools: hv: do not add redundant '/'
 in hv_start_fcopy()

Vitaly Kuznetsov (5):
  Tools: hv: add mising fcopyd to the Makefile
  Tools: hv: remove unused bytes_written from kvp_update_file()
  Tools: hv: address compiler warnings for hv_kvp_daemon.c
  Tools: hv: address compiler warnings for hv_fcopy_daemon.c
  Tools: hv: do not add redundant '/' in hv_start_fcopy()

 tools/hv/Makefile  |  4 ++--
 tools/hv/hv_fcopy_daemon.c | 10 ++
 tools/hv/hv_kvp_daemon.c   | 29 +
 3 files changed, 17 insertions(+), 26 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] Tools: hv: remove unused bytes_written from kvp_update_file()

2014-12-09 Thread Vitaly Kuznetsov
fwrite() does not actually return the number of bytes written and
this value is being ignored anyway and ferror() is being called to
check for an error. As we assign to this variable and never use it
we get the following compile-time warning:
hv_kvp_daemon.c:149:9: warning: variable ‘bytes_written’ set but not used 
[-Wunused-but-set-variable]

Remove bytes_written completely.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_kvp_daemon.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index 6a6432a..5a274ca 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -147,7 +147,6 @@ static void kvp_release_lock(int pool)
 static void kvp_update_file(int pool)
 {
FILE *filep;
-   size_t bytes_written;
 
/*
 * We are going to write our in-memory registry out to
@@ -163,8 +162,7 @@ static void kvp_update_file(int pool)
exit(EXIT_FAILURE);
}
 
-   bytes_written = fwrite(kvp_file_info[pool].records,
-   sizeof(struct kvp_record),
+   fwrite(kvp_file_info[pool].records, sizeof(struct kvp_record),
kvp_file_info[pool].num_records, filep);
 
if (ferror(filep) || fclose(filep)) {
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5] Tools: hv: do not add redundant '/' in hv_start_fcopy()

2014-12-09 Thread Vitaly Kuznetsov
We don't need to add additional '/' to smsg-path_name as snprintf(%s/%s)
does the right thing. Without the patch we get doubled '//' in the log message.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_fcopy_daemon.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
index 1a23872..9445d8f 100644
--- a/tools/hv/hv_fcopy_daemon.c
+++ b/tools/hv/hv_fcopy_daemon.c
@@ -43,12 +43,6 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg)
int error = HV_E_FAIL;
char *q, *p;
 
-   /*
-* If possile append a path seperator to the path.
-*/
-   if (strlen((char *)smsg-path_name)  (W_MAX_PATH - 2))
-   strcat((char *)smsg-path_name, /);
-
p = (char *)smsg-path_name;
snprintf(target_fname, sizeof(target_fname), %s/%s,
 (char *)smsg-path_name, (char *)smsg-file_name);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] Tools: hv: add mising fcopyd to the Makefile

2014-12-09 Thread Vitaly Kuznetsov
fcopyd in missing in the Makefile, add it there.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/hv/Makefile b/tools/hv/Makefile
index bd22f78..99ffe61 100644
--- a/tools/hv/Makefile
+++ b/tools/hv/Makefile
@@ -5,9 +5,9 @@ PTHREAD_LIBS = -lpthread
 WARNINGS = -Wall -Wextra
 CFLAGS = $(WARNINGS) -g $(PTHREAD_LIBS)
 
-all: hv_kvp_daemon hv_vss_daemon
+all: hv_kvp_daemon hv_vss_daemon hv_fcopy_daemon
 %: %.c
$(CC) $(CFLAGS) -o $@ $^
 
 clean:
-   $(RM) hv_kvp_daemon hv_vss_daemon
+   $(RM) hv_kvp_daemon hv_vss_daemon hv_fcopy_daemon
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5] Tools: hv: address compiler warnings for hv_kvp_daemon.c

2014-12-09 Thread Vitaly Kuznetsov
This patch addresses two types of compiler warnings:
... warning: comparison between signed and unsigned integer expressions 
[-Wsign-compare]
and
... warning: pointer targets in passing argument N of ‘kvp_...’ differ in 
signedness [-Wpointer-sign]

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_kvp_daemon.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index 5a274ca..48a95f9 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -308,7 +308,7 @@ static int kvp_file_init(void)
return 0;
 }
 
-static int kvp_key_delete(int pool, const char *key, int key_size)
+static int kvp_key_delete(int pool, const __u8 *key, int key_size)
 {
int i;
int j, k;
@@ -351,8 +351,8 @@ static int kvp_key_delete(int pool, const char *key, int 
key_size)
return 1;
 }
 
-static int kvp_key_add_or_modify(int pool, const char *key, int key_size, 
const char *value,
-   int value_size)
+static int kvp_key_add_or_modify(int pool, const __u8 *key, int key_size,
+const __u8 *value, int value_size)
 {
int i;
int num_records;
@@ -405,7 +405,7 @@ static int kvp_key_add_or_modify(int pool, const char *key, 
int key_size, const
return 0;
 }
 
-static int kvp_get_value(int pool, const char *key, int key_size, char *value,
+static int kvp_get_value(int pool, const __u8 *key, int key_size, __u8 *value,
int value_size)
 {
int i;
@@ -437,8 +437,8 @@ static int kvp_get_value(int pool, const char *key, int 
key_size, char *value,
return 1;
 }
 
-static int kvp_pool_enumerate(int pool, int index, char *key, int key_size,
-   char *value, int value_size)
+static int kvp_pool_enumerate(int pool, int index, __u8 *key, int key_size,
+   __u8 *value, int value_size)
 {
struct kvp_record *record;
 
@@ -659,7 +659,7 @@ static char *kvp_if_name_to_mac(char *if_name)
char*p, *x;
charbuf[256];
char addr_file[256];
-   int i;
+   unsigned int i;
char *mac_addr = NULL;
 
snprintf(addr_file, sizeof(addr_file), %s%s%s, /sys/class/net/,
@@ -698,7 +698,7 @@ static char *kvp_mac_to_if_name(char *mac)
charbuf[256];
char *kvp_net_dir = /sys/class/net/;
char dev_id[256];
-   int i;
+   unsigned int i;
 
dir = opendir(kvp_net_dir);
if (dir == NULL)
@@ -748,7 +748,7 @@ static char *kvp_mac_to_if_name(char *mac)
 
 
 static void kvp_process_ipconfig_file(char *cmd,
-   char *config_buf, int len,
+   char *config_buf, unsigned int len,
int element_size, int offset)
 {
char buf[256];
@@ -766,7 +766,7 @@ static void kvp_process_ipconfig_file(char *cmd,
if (offset == 0)
memset(config_buf, 0, len);
while ((p = fgets(buf, sizeof(buf), file)) != NULL) {
-   if ((len - strlen(config_buf))  (element_size + 1))
+   if (len  strlen(config_buf) + element_size + 1)
break;
 
x = strchr(p, '\n');
@@ -914,7 +914,7 @@ static int kvp_process_ip_address(void *addrp,
 
 static int
 kvp_get_ip_info(int family, char *if_name, int op,
-void  *out_buffer, int length)
+void  *out_buffer, unsigned int length)
 {
struct ifaddrs *ifap;
struct ifaddrs *curp;
@@ -1017,8 +1017,7 @@ kvp_get_ip_info(int family, char *if_name, int op,
weight += hweight32(w[i]);
 
sprintf(cidr_mask, /%d, weight);
-   if ((length - sn_offset) 
-   (strlen(cidr_mask) + 1))
+   if (length  sn_offset + strlen(cidr_mask) + 1)
goto gather_ipaddr;
 
if (sn_offset == 0)
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] Tools: hv: address compiler warnings for hv_fcopy_daemon.c

2014-12-09 Thread Vitaly Kuznetsov
This patch addresses two types of compiler warnings:
... warning: unused variable ‘fd’ [-Wunused-variable]
and
... warning: format ‘%s’ expects argument of type ‘char *’, but argument 5 has 
type ‘__u16 *’ [-Wformat=]

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 tools/hv/hv_fcopy_daemon.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
index f437d73..1a23872 100644
--- a/tools/hv/hv_fcopy_daemon.c
+++ b/tools/hv/hv_fcopy_daemon.c
@@ -51,7 +51,7 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg)
 
p = (char *)smsg-path_name;
snprintf(target_fname, sizeof(target_fname), %s/%s,
-   (char *)smsg-path_name, smsg-file_name);
+(char *)smsg-path_name, (char *)smsg-file_name);
 
syslog(LOG_INFO, Target file name: %s, target_fname);
/*
@@ -137,7 +137,7 @@ void print_usage(char *argv[])
 
 int main(int argc, char *argv[])
 {
-   int fd, fcopy_fd, len;
+   int fcopy_fd, len;
int error;
int daemonize = 1, long_index = 0, opt;
int version = FCOPY_CURRENT_VERSION;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work

2014-12-10 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 Thanks,
 -- Dexuan

 -Original Message-
 From: Dexuan Cui
 Sent: Wednesday, December 10, 2014 15:34 PM
 To: 'Vitaly Kuznetsov'
 Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
 de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
 jasow...@redhat.com; KY Srinivasan; Haiyang Zhang
 Subject: RE: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
 
  -Original Message-
  From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
  Sent: Tuesday, December 9, 2014 21:06 PM
  To: Dexuan Cui
  Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
  de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
  jasow...@redhat.com; KY Srinivasan; Haiyang Zhang
  Subject: Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
   ..
   + * Here dhcp_enabled is only for IPv4 according to Hyper-V 
   host
  team.
   + *
   + * In the case the host only injects 1 IPv6 address:
   + * new_val-dhcp_enabled is true, but we can't pass
  BOOTPROTO=dhcp to
   + * the script hv_set_ifconfig, because in some distros (like 
   RHEL7)
   + * BOOTPROTO=dhcp has a special meaning in the config file 
   (e.g.,
   + * /etc/sysconfig/network-scripts/ifcfg-eth0): the network init
  program
   + * ignores any static IP addr information once there is
   + * BOOTPROTO=dhcp; as a result, IPv6-only injection can't work.
   + *
   + * In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't 
   affect
   + * Ubuntu because it's ignored by the Ubuntu version of
   + * hv_set_ifconfig and it doesn't seem to have special meaning 
   in
   + * Ubuntu.
   + */
 
  I just checked and adding IPV6ADDR=something when
 BOOTPROTO=dhcp
  works for me with both RHEL7 and Fedora21.
 It doesn't work in my side. :-(
 Running 'ifup eth0' shows some errors(I use set -x)
 ...
 + /sbin/dhclient -H localhost -1 -q -lf 
 /var/lib/dhclient/dhclient--eth0.lease -pf
 /var/run/dhclient-eth0.pid eth0
 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied.
 BTW, I run with root, and
 'chown 777 /etc/sysconfig/network-scripts/ifcfg-eth0 doesn't help.


s,chown,chmod, :-) But it won't help in case of SELinux mislabeling.

 Thanks,
 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Tools: hv: fix compiler warnings and do minor cleanup

2014-12-10 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Tuesday, December 9, 2014 23:48 PM
 To: KY Srinivasan
 Cc: Haiyang Zhang; de...@linuxdriverproject.org; linux-
 ker...@vger.kernel.org; Dexuan Cui
 Subject: [PATCH 0/5] Tools: hv: fix compiler warnings and do minor cleanup
 
 When someone does 'make' in tools/hv/ issues appear:
 - hv_fcopy_daemon is not being built;
 - lots of compiler warnings.
 
 This is just a cleanup. Compile-tested by myself on top of linux-next/master.
 
 Piggyback this series and send [PATCH 5/5] Tools: hv: do not add redundant
 '/'
  in hv_start_fcopy()
 
 Vitaly Kuznetsov (5):
   Tools: hv: add mising fcopyd to the Makefile
   Tools: hv: remove unused bytes_written from kvp_update_file()
   Tools: hv: address compiler warnings for hv_kvp_daemon.c
   Tools: hv: address compiler warnings for hv_fcopy_daemon.c
   Tools: hv: do not add redundant '/' in hv_start_fcopy()
 
  tools/hv/Makefile  |  4 ++--
  tools/hv/hv_fcopy_daemon.c | 10 ++
  tools/hv/hv_kvp_daemon.c   | 29 +
  3 files changed, 17 insertions(+), 26 deletions(-)
 
 --
 1.9.3

 Hi Vitaly,
 Thanks for the patchset!

 Acked-by: Dexuan Cui de...@microsoft.com

 PS, I added Greg into the TO list.
 The hv code in drivers/hv/ and tools/hv/ usually has to go into
 Greg's tree first.

Well, I don't mind spamming Greg but he's not on the
scripts/get_maintainer.pl output. In case he's not monitoring the list
for patches by some other tool (patchwork?) a patch adding him to
MAINTAINERS would do the job.

Greg, do you want to become an official Hyper-V maintainer in
MAINTAINERS? I can send a patch then :-)

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work

2014-12-10 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Tuesday, December 9, 2014 21:06 PM
 To: Dexuan Cui
 Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
 de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
 jasow...@redhat.com; KY Srinivasan; Haiyang Zhang
 Subject: Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
  ..
  +   * Here dhcp_enabled is only for IPv4 according to Hyper-V host
 team.
  +   *
  +   * In the case the host only injects 1 IPv6 address:
  +   * new_val-dhcp_enabled is true, but we can't pass
 BOOTPROTO=dhcp to
  +   * the script hv_set_ifconfig, because in some distros (like RHEL7)
  +   * BOOTPROTO=dhcp has a special meaning in the config file (e.g.,
  +   * /etc/sysconfig/network-scripts/ifcfg-eth0): the network init
 program
  +   * ignores any static IP addr information once there is
  +   * BOOTPROTO=dhcp; as a result, IPv6-only injection can't work.
  +   *
  +   * In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't affect
  +   * Ubuntu because it's ignored by the Ubuntu version of
  +   * hv_set_ifconfig and it doesn't seem to have special meaning in
  +   * Ubuntu.
  +   */
 
 I just checked and adding IPV6ADDR=something when BOOTPROTO=dhcp
 works for me with both RHEL7 and Fedora21.
 It doesn't work in my side. :-(
 Running 'ifup eth0' shows some errors(I use set -x)
 ...
 + /sbin/dhclient -H localhost -1 -q -lf 
 /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0
 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied.
 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied.
 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied.
 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied.
 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied.
 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied.
 done.

 I'm trying to find out the cause.

Selinux? You can try 'setenforce 0' to figure this out.


 Other than that I think bringing distribution specifics into kernel.git
 is not a good idea. /etc/sysconfig/network-scripts/ifcfg-* format is
 distro-specific and not all Linux distros support it. Moreover,
 I agree.

 different distros can treat setting differently. I think it was wrong to
 stick to this format in kvp daemon from very beginning.
 We can also think the current format used in kvp daemon is already
 distro-agnostic -- it just happens to look like the style of network config 
 file
 used in RHEL :-)

Yes, it is already there and I don't see any point in changing it.


 
 As a solution I would suggest doing the following: kvp daemon writes all
 received request details in distro-agnostic format in some temporary
 place and then calls distro-specific script to set things up. Actually,
 we already have such script: tools/hv/hv_set_ifconfig.sh
 Yeah, this is exactly what we already have today.

 As for this bug I propose the following: remove skipping all
 IPADDR/MASK/... settings in case of BOOTPROTO=dhcp and let
 distro-specific script deal with the rest.
 --
   Vitaly
 OK, so the patch would be 1-line only:

 diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
 index 22b0764..53fdaad 100644
 --- a/tools/hv/hv_kvp_daemon.c
 +++ b/tools/hv/hv_kvp_daemon.c
 @@ -1314,10 +1314,8 @@ static int kvp_set_ip_info(char *if_name, struct 
 hv_kvp_ipaddr_value *new_val)
 goto setval_error;

 /*
 -* We are done!.
 +* We are not done... TODO: add comment here.
  */
 -   goto setval_done;
 -
 } else {
 error = kvp_write_file(file, BOOTPROTO, , none);
 if (error)

 I'll send out a v2 after I resolve the grep ... Permission dinied issue.

 Thanks,
 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] tools: hv: kvp_daemon: make IPv6-only-injection work

2015-01-02 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: devel [mailto:driverdev-devel-boun...@linuxdriverproject.org] On
 Behalf Of Dexuan Cui
 Sent: Wednesday, December 10, 2014 19:33 PM
 To: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
 de...@linuxdriverproject.org; vkuzn...@redhat.com; o...@aepfle.de;
 a...@canonical.com; jasow...@redhat.com; KY Srinivasan
 Cc: Haiyang Zhang
 Subject: [PATCH v2] tools: hv: kvp_daemon: make IPv6-only-injection work
 
 In the case the host only injects an IPv6 address, the dhcp_enabled flag is
 true (it's only for IPv4 according to Hyper-V host team), but we still need 
 to
 proceed to parse the IPv6 information.
 
 Cc: Vitaly Kuznetsov vkuzn...@redhat.com
 Cc: K. Y. Srinivasan k...@microsoft.com
 Signed-off-by: Dexuan Cui de...@microsoft.com
 ---
 
 v2: removed the distro-specific logic as Vitaly suggested.
 
  tools/hv/hv_kvp_daemon.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)
 
 diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
 index 6a6432a..4b3ee35 100644
 --- a/tools/hv/hv_kvp_daemon.c
 +++ b/tools/hv/hv_kvp_daemon.c
 @@ -1308,16 +1308,17 @@ static int kvp_set_ip_info(char *if_name, struct
 hv_kvp_ipaddr_value *new_val)
  if (error)
  goto setval_error;
 
 +/*
 + * The dhcp_enabled flag is only for IPv4. In the case the host only
 + * injects an IPv6 address, the flag is true, but we still need to
 + * proceed to parse and pass the IPv6 information to the
 + * disto-specific script hv_set_ifconfig.
 + */

Actually we just relay what was recieved from the host and it's up to
distro-specific script how to interpret BOOTPROTO=dhcp now. Additional
IPv4 addresses (in case we receive them from our host) are not skipped
now as well.

  if (new_val-dhcp_enabled) {
  error = kvp_write_file(file, BOOTPROTO, , dhcp);
  if (error)
  goto setval_error;
 
 -/*
 - * We are done!.
 - */
 -goto setval_done;
 -
  } else {
  error = kvp_write_file(file, BOOTPROTO, , none);
  if (error)
 @@ -1345,7 +1346,6 @@ static int kvp_set_ip_info(char *if_name, struct
 hv_kvp_ipaddr_value *new_val)
  if (error)
  goto setval_error;
 
 -setval_done:
  fclose(file);
 
  /*
 --
 1.9.1

 Hi Vitaly,
 Can you please ACK the v2 patch?

Sorry it took me so long to reply, last 3 weeks I was on vacation. I'm
not particulary sure I'm in charge here to give an ACK :-), but 

Reviewed-By: Vitaly Kuznetsov vkuzn...@redhat.com

 Or, please let me know if you have new comments.

 Thanks,
 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Drivers: hv: vmbus: fix crashes on hv_vmbus load/unload path

2015-01-26 Thread Vitaly Kuznetsov
Vitaly Kuznetsov vkuzn...@redhat.com writes:

 It is possible (since 93e5bd06a953: Drivers: hv: Make the vmbus driver
 unloadable) to unload hv_vmbus driver if no other devices are connected.
 1aec169673d7: x86: Hyperv: Cleanup the irq mess fixed doulble interrupt
 gate setup. However, if we try to unload hv_vmbus and then load it back
 crashes in different places of vmbus driver occur on both unload and second
 load paths. Address those I saw in my testing.

It seems that newly introduced clockevent device (Drivers: hv: vmbus:
Implement a clockevent device) makes it impossible to unload hv_vmbus
module:

# rmmod hv_vmbus
rmmod hv_vmbus
rmmod: ERROR: Module hv_vmbus is in use

I'll try investigating before sending v2 without PATCH 2/3.


 Not everything is fixed though. MCE was hit once on Generation2 instance and
 I neither understand what caused it nor do I know the way to reproduce it.
 Anyway, here is the log:

 [  204.846255] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 
 0: b200c0020001
 [  204.846675] mce: [Hardware Error]: TSC 6b5cd64bc8 
 [  204.846675] mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1421944123 
 SOCKET 0 APIC 0 microcode 
 [  204.846675] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [  204.846675] mce: [Hardware Error]: Machine check: Processor context corrupt
 [  204.846675] Kernel panic - not syncing: Fatal Machine check
 [  204.846675] Kernel Offset: 0x0 from 0x8100 (relocation range: 
 0x8000-0x9fff)
 [  204.846675] Rebooting in 30 seconds..
 [  204.846675] ACPI MEMORY or I/O RESET_REG.

 Vitaly Kuznetsov (3):
   Drivers: hv: vmbus: avoid double kfree for device_obj
   Drivers: hv: vmbus: introduce vmbus_acpi_remove
   Drivers: hv: vmbus: teardown hv_vmbus_con workqueue and
 vmbus_connection pages on shutdown

  drivers/hv/channel_mgmt.c |  1 -
  drivers/hv/connection.c   | 17 -
  drivers/hv/hyperv_vmbus.h |  1 +
  drivers/hv/vmbus_drv.c| 16 
  4 files changed, 29 insertions(+), 6 deletions(-)

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels

2015-02-04 Thread Vitaly Kuznetsov
Jason Wang jasow...@redhat.com writes:

 On Wed, Feb 4, 2015 at 1:00 AM, Vitaly Kuznetsov vkuzn...@redhat.com
 wrote:
 free_channel() function frees the channel unconditionally so we need
 to make
 sure nobody has any link to it. This is not trivial and there are
 several
 examples of races we have:

 1) In vmbus_onoffer_rescind() we check for channel existence with
relid2channel() and then use it. This can go wrong if we're in
 the middle
of channel removal (free_channel() was already called).

 2) In process_chn_event() we check for channel existence with
pcpu_relid2channel() and then use it. This can also go wrong.

 3) vmbus_free_channels() just frees all channels, in case we're in
 the middle
of vmbus_process_rescind_offer() crash is possible.

 The issue can be solved by holding vmbus_connection.channel_lock
 everywhere,
 however, it looks like a way to deadlocks and performance
 degradation. Get/put
 workflow fits here the best.

 Implement vmbus_get_channel()/vmbus_put_channel() pair instead of
 free_channel().

 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  drivers/hv/channel_mgmt.c | 45
 ++---
  drivers/hv/connection.c   |  7 +--
  drivers/hv/hyperv_vmbus.h |  4 
  include/linux/hyperv.h| 13 +
  4 files changed, 60 insertions(+), 9 deletions(-)

 diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
 index 36bacc7..eb9ce94 100644
 --- a/drivers/hv/channel_mgmt.c
 +++ b/drivers/hv/channel_mgmt.c
 @@ -147,6 +147,8 @@ static struct vmbus_channel *alloc_channel(void)
  return NULL;
  channel-id = atomic_inc_return(chan_num);
 +atomic_set(channel-count, 1);
 +
  spin_lock_init(channel-inbound_lock);
  spin_lock_init(channel-lock);
  @@ -178,19 +180,47 @@ static void release_channel(struct
 work_struct *work)
  }
   /*
 - * free_channel - Release the resources used by the vmbus channel
 object
 + * vmbus_put_channel - Decrease the channel usage counter and
 release the
 + * resources when this counter reaches zero.
   */
 -static void free_channel(struct vmbus_channel *channel)
 +void vmbus_put_channel(struct vmbus_channel *channel)
  {
 +unsigned long flags;
  /*
   * We have to release the channel's workqueue/thread in the vmbus's
   * workqueue/thread context
   * ie we can't destroy ourselves.
   */
 -INIT_WORK(channel-work, release_channel);
 -queue_work(vmbus_connection.work_queue, channel-work);
 +spin_lock_irqsave(channel-lock, flags);
 +if (atomic_dec_and_test(channel-count)) {
 +channel-dying = true;
 +INIT_WORK(channel-work, release_channel);
 +spin_unlock_irqrestore(channel-lock, flags);
 +queue_work(vmbus_connection.work_queue, channel-work);
 +} else
 +spin_unlock_irqrestore(channel-lock, flags);
 +}
 +EXPORT_SYMBOL_GPL(vmbus_put_channel);
 +
 +/* vmbus_get_channel - Get additional reference to the channel */
 +struct vmbus_channel *vmbus_get_channel(struct vmbus_channel
 *channel)
 +{
 +unsigned long flags;
 +struct vmbus_channel *ret = NULL;
 +
 +if (!channel)
 +return NULL;
 +
 +spin_lock_irqsave(channel-lock, flags);
 +if (!channel-dying) {
 +atomic_inc(channel-count);
 +ret = channel;
 +}
 +spin_unlock_irqrestore(channel-lock, flags);

 Looks like we can use atomic_inc_return_safe() here to avoid extra
 dying. And then there's also no need for the spinlock.

 if (atomic_inc_return_safe(channel-count)  0)
   return channel;
 else
   return NULL;

Good idea, thanks! I'll try.


 +return ret;
  }
 +EXPORT_SYMBOL_GPL(vmbus_get_channel);
   static void percpu_channel_enq(void *arg)
  {
 @@ -253,7 +283,7 @@ static void vmbus_process_rescind_offer(struct
 work_struct *work)
  list_del(channel-sc_list);
  spin_unlock_irqrestore(primary_channel-lock, flags);
  }
 -free_channel(channel);
 +vmbus_put_channel(channel);
  }
   void vmbus_free_channels(void)
 @@ -262,7 +292,7 @@ void vmbus_free_channels(void)
  list_for_each_entry(channel, vmbus_connection.chn_list,
 listentry) {
  vmbus_device_unregister(channel-device_obj);
 -free_channel(channel);
 +vmbus_put_channel(channel);
  }
  }
  @@ -391,7 +421,7 @@ done_init_rescind:
  spin_unlock_irqrestore(newchannel-lock, flags);
  return;
  err_free_chan:
 -free_channel(newchannel);
 +vmbus_put_channel(newchannel);
  }
   enum {
 @@ -549,6 +579,7 @@ static void vmbus_onoffer_rescind(struct
 vmbus_channel_message_header *hdr)
  queue_work(channel-controlwq, channel-work);
  spin_unlock_irqrestore(channel-lock, flags);
 +vmbus_put_channel(channel);
  }
   /*
 diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
 index c4acd1c..d1ce134 100644
 --- a/drivers/hv/connection.c
 +++ b/drivers/hv/connection.c
 @@ -247,7

Re: [PATCH 1/4] Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels

2015-02-04 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Wednesday, February 4, 2015 1:01 AM
 To: KY Srinivasan; de...@linuxdriverproject.org
 Cc: Haiyang Zhang; linux-kernel@vger.kernel.org; Dexuan Cui; Jason Wang
 Subject: [PATCH 1/4] Drivers: hv: vmbus: implement get/put usage workflow for
 vmbus channels
 
 free_channel() function frees the channel unconditionally so we need to make
 sure nobody has any link to it. This is not trivial and there are several
 examples of races we have:
 
 1) In vmbus_onoffer_rescind() we check for channel existence with
relid2channel() and then use it. This can go wrong if we're in the middle
of channel removal (free_channel() was already called).
 
 2) In process_chn_event() we check for channel existence with
pcpu_relid2channel() and then use it. This can also go wrong.
 
 3) vmbus_free_channels() just frees all channels, in case we're in the middle
of vmbus_process_rescind_offer() crash is possible.
 
 The issue can be solved by holding vmbus_connection.channel_lock everywhere,
 however, it looks like a way to deadlocks and performance degradation. 
 Get/put
 workflow fits here the best.
 
 Implement vmbus_get_channel()/vmbus_put_channel() pair instead of
 free_channel().
 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  drivers/hv/channel_mgmt.c | 45
 ++---
  drivers/hv/connection.c   |  7 +--
  drivers/hv/hyperv_vmbus.h |  4 
  include/linux/hyperv.h| 13 +
  4 files changed, 60 insertions(+), 9 deletions(-)
 
 diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
 index 36bacc7..eb9ce94 100644
 --- a/drivers/hv/channel_mgmt.c
 +++ b/drivers/hv/channel_mgmt.c
 @@ -147,6 +147,8 @@ static struct vmbus_channel *alloc_channel(void)
  return NULL;
 
  channel-id = atomic_inc_return(chan_num);
 +atomic_set(channel-count, 1);
 +
  spin_lock_init(channel-inbound_lock);
  spin_lock_init(channel-lock);
 
 @@ -178,19 +180,47 @@ static void release_channel(struct work_struct *work)
  }
 
  /*
 - * free_channel - Release the resources used by the vmbus channel object
 + * vmbus_put_channel - Decrease the channel usage counter and release the
 + * resources when this counter reaches zero.
   */
 -static void free_channel(struct vmbus_channel *channel)
 +void vmbus_put_channel(struct vmbus_channel *channel)
  {
 +unsigned long flags;
 
  /*
   * We have to release the channel's workqueue/thread in the vmbus's
   * workqueue/thread context
   * ie we can't destroy ourselves.
   */
 -INIT_WORK(channel-work, release_channel);
 -queue_work(vmbus_connection.work_queue, channel-work);
 +spin_lock_irqsave(channel-lock, flags);
 +if (atomic_dec_and_test(channel-count)) {
 +channel-dying = true;
 +INIT_WORK(channel-work, release_channel);
 +spin_unlock_irqrestore(channel-lock, flags);
 +queue_work(vmbus_connection.work_queue, channel-work);
 +} else
 +spin_unlock_irqrestore(channel-lock, flags);
 +}
 +EXPORT_SYMBOL_GPL(vmbus_put_channel);
 +
 +/* vmbus_get_channel - Get additional reference to the channel */
 +struct vmbus_channel *vmbus_get_channel(struct vmbus_channel *channel)
 +{
 +unsigned long flags;
 +struct vmbus_channel *ret = NULL;
 +
 +if (!channel)
 +return NULL;
 +
 +spin_lock_irqsave(channel-lock, flags);
 +if (!channel-dying) {
 +atomic_inc(channel-count);
 +ret = channel;
 +}
 +spin_unlock_irqrestore(channel-lock, flags);
 +return ret;
  }
 +EXPORT_SYMBOL_GPL(vmbus_get_channel);
 
  static void percpu_channel_enq(void *arg)
  {
 @@ -253,7 +283,7 @@ static void vmbus_process_rescind_offer(struct
 work_struct *work)
  list_del(channel-sc_list);
  spin_unlock_irqrestore(primary_channel-lock, flags);
  }
 -free_channel(channel);
 +vmbus_put_channel(channel);
  }
 
  void vmbus_free_channels(void)
 @@ -262,7 +292,7 @@ void vmbus_free_channels(void)
 
  list_for_each_entry(channel, vmbus_connection.chn_list, listentry) {
  vmbus_device_unregister(channel-device_obj);
 -free_channel(channel);
 +vmbus_put_channel(channel);
  }
  }
 
 @@ -391,7 +421,7 @@ done_init_rescind:
  spin_unlock_irqrestore(newchannel-lock, flags);
  return;
  err_free_chan:
 -free_channel(newchannel);
 +vmbus_put_channel(newchannel);
  }
 
  enum {
 @@ -549,6 +579,7 @@ static void vmbus_onoffer_rescind(struct
 vmbus_channel_message_header *hdr)
  queue_work(channel-controlwq, channel-work);
 
  spin_unlock_irqrestore(channel-lock, flags);
 +vmbus_put_channel(channel);
  }
 
  /*
 diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
 index c4acd1c..d1ce134 100644
 --- a/drivers/hv/connection.c
 +++ b/drivers/hv

Re: [PATCH 2/3] hv: vmbus_post_msg: retry the hypercall on HV_STATUS_INVALID_CONNECTION_ID

2015-01-30 Thread Vitaly Kuznetsov
Dexuan Cui de...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Thursday, January 29, 2015 21:31 PM
 To: Dexuan Cui
 Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
 de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
 jasow...@redhat.com; KY Srinivasan; Haiyang Zhang
 Subject: Re: [PATCH 2/3] hv: vmbus_post_msg: retry the hypercall on
 HV_STATUS_INVALID_CONNECTION_ID
 
 Dexuan Cui de...@microsoft.com writes:
 
  I got the hypercall error code on Hyper-V 2008 R2 when keeping running
  rmmod hv_netvsc; modprobe hv_netvsc; rmmod hv_utils; modprobe hv_utils
  in a Linux guest.
 
  Without the patch, the driver can occasionally fail to load.
 
  CC: K. Y. Srinivasan k...@microsoft.com
  Signed-off-by: Dexuan Cui de...@microsoft.com
  ---
   arch/x86/include/uapi/asm/hyperv.h | 1 +
   drivers/hv/connection.c| 9 +
   2 files changed, 10 insertions(+)
 
  diff --git a/arch/x86/include/uapi/asm/hyperv.h
 b/arch/x86/include/uapi/asm/hyperv.h
  index 90c458e..b9daffb 100644
  --- a/arch/x86/include/uapi/asm/hyperv.h
  +++ b/arch/x86/include/uapi/asm/hyperv.h
  @@ -225,6 +225,7 @@
   #define HV_STATUS_INVALID_HYPERCALL_CODE  2
   #define HV_STATUS_INVALID_HYPERCALL_INPUT 3
   #define HV_STATUS_INVALID_ALIGNMENT   4
  +#define HV_STATUS_INVALID_CONNECTION_ID   18
   #define HV_STATUS_INSUFFICIENT_BUFFERS19
 
 The gap beween 4 and 18 tells me there are other codes here ;-) Are they
 all 'permanent failures'?
 It looks we only need to care about these error codes here.

 BTW, you can get all the hypercall error codes in the top level functional 
 spec:
 http://blogs.msdn.com/b/virtual_pc_guy/archive/2014/02/17/updated-hypervisor-top-level-functional-specification.aspx
 For this hypercall (0x005c), see 14.9.7 HvPostMessage.

Thanks, interesting!

Btw, HV_STATUS_INSUFFICIENT_MEMORY looks suspicious, looks like we can
hit it as well...

I suggest we split all failures here in 2 classes:
1) permanent
2) worth retrying

and treat them accordingly (no big changes, just maybe group them within
hv_post_message() together as it is the only place where these codes are
being used).


 
   typedef struct _HV_REFERENCE_TSC_PAGE {
  diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
  index c4acd1c..8bd05f3 100644
  --- a/drivers/hv/connection.c
  +++ b/drivers/hv/connection.c
  @@ -440,6 +440,15 @@ int vmbus_post_msg(void *buffer, size_t buflen)
 ret = hv_post_message(conn_id, 1, buffer, buflen);
 
 switch (ret) {
  +  case HV_STATUS_INVALID_CONNECTION_ID:
  +  /*
  +   * We could get this if we send messages too
  +   * frequently or the host is under low resource
  +   * conditions: let's wait 1 more second before
  +   * retrying the hypercall.
  +   */
  +  msleep(1000);
  +  break;
 
 In case it is our last try (No. 10) we will return '18' from the
 function. I suggest we set ret = -ENOMEM here as well.
 Thanks for the suggestion!

 I think it would be better to add this to the case
 HV_STATUS_INVALID_CONNECTION_ID:
  ret = -EAGAIN;
 ?

Yes, like fallthrough


 case HV_STATUS_INSUFFICIENT_BUFFERS:
 ret = -ENOMEM;
 
 Or should we treat these two equally? There is a smaller (100ms) sleep
 between tries already, we can consider changing it instead.
 
 case -ENOMEM:
 
 --
   Vitaly
 In my experiments, in the HV_STATUS_INVALID_CONNECTION_ID case,
 waiting 100ms is not enough sometimes, so I'd like to wait more time.
 I agree with you both cases can wait 1000ms. I'll update my patch.

 BTW, the  case -ENOMEM: is not reachable(the hypervisor itself doesn't 
 return -ENOMEM), I think. I can remove it.

hv_post_message() can return -EMSGSIZE or do_hypercall() return value
(which becomes u16 in hv_post_message()). So yes, I agree, -ENOMEM is
not possible.


 Thanks,
 -- Dexuan

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] Drivers: hv: vmbus: protect vmbus_get_outgoing_channel() against channel removal

2015-02-03 Thread Vitaly Kuznetsov
list_for_each_safe() we have in vmbus_get_outgoing_channel() works, however, we
are not protected against the channel being removed (e.g. after receiving 
rescind
offer). Users of this function (storvsc_do_io() is the only one at this moment)
can get a link to an already freed channel. Make vmbus_get_outgoing_channel()
search holding primary-lock as child channels are not being freed unless 
they're
removed from parent's list.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/channel_mgmt.c  | 10 +++---
 drivers/scsi/storvsc_drv.c |  2 ++
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index fdccd16..af6243c 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -881,18 +881,20 @@ cleanup:
  */
 struct vmbus_channel *vmbus_get_outgoing_channel(struct vmbus_channel *primary)
 {
-   struct list_head *cur, *tmp;
+   struct list_head *cur;
int cur_cpu;
struct vmbus_channel *cur_channel;
struct vmbus_channel *outgoing_channel = primary;
int cpu_distance, new_cpu_distance;
+   unsigned long flags;
 
if (list_empty(primary-sc_list))
-   return outgoing_channel;
+   return vmbus_get_channel(outgoing_channel);
 
cur_cpu = hv_context.vp_index[get_cpu()];
put_cpu();
-   list_for_each_safe(cur, tmp, primary-sc_list) {
+   spin_lock_irqsave(primary-lock, flags);
+   list_for_each(cur, primary-sc_list) {
cur_channel = list_entry(cur, struct vmbus_channel, sc_list);
if (cur_channel-state != CHANNEL_OPENED_STATE)
continue;
@@ -913,6 +915,8 @@ struct vmbus_channel *vmbus_get_outgoing_channel(struct 
vmbus_channel *primary)
 
outgoing_channel = cur_channel;
}
+   outgoing_channel = vmbus_get_channel(outgoing_channel);
+   spin_unlock_irqrestore(primary-lock, flags);
 
return outgoing_channel;
 }
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 4cff0dd..3b9b851 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1370,6 +1370,8 @@ static int storvsc_do_io(struct hv_device *device,
   VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
}
 
+   vmbus_put_channel(outgoing_channel);
+
if (ret != 0)
return ret;
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] hyperv: netvsc: improve protection against rescind offer

2015-02-03 Thread Vitaly Kuznetsov
The check added in commit c3582a2c4d0b (hyperv: Add support for vNIC hot
removal) is incomplete as there is no synchronization between
vmbus_onoffer_rescind() and netvsc_send(). In case we get the offer after we
checked out_channel-rescind and before netvsc_send() finishes its job we can
get a crash as we'll be dealing with already freed channel.

Make netvsc_send() take additional reference to the channel with newly
introduced vmbus_get_channel(), this guarantees we won't lose the channel. We
can still get rescind while we're processing but this won't cause a crash.

Reported-by: Jason Wang jasow...@redhat.com
Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/net/hyperv/netvsc.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 9f49c01..d9b13a1 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -763,11 +763,16 @@ int netvsc_send(struct hv_device *device,
out_channel = net_device-chn_table[packet-q_idx];
if (out_channel == NULL)
out_channel = device-channel;
-   packet-channel = out_channel;
+   packet-channel = vmbus_get_channel(out_channel);
 
-   if (out_channel-rescind)
+   if (!packet-channel)
return -ENODEV;
 
+   if (out_channel-rescind) {
+   vmbus_put_channel(out_channel);
+   return -ENODEV;
+   }
+
if (packet-page_buf_cnt) {
ret = vmbus_sendpacket_pagebuffer(out_channel,
  packet-page_buf,
@@ -810,6 +815,7 @@ int netvsc_send(struct hv_device *device,
   packet, ret);
}
 
+   vmbus_put_channel(packet-channel);
return ret;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels

2015-02-03 Thread Vitaly Kuznetsov
free_channel() function frees the channel unconditionally so we need to make
sure nobody has any link to it. This is not trivial and there are several
examples of races we have:

1) In vmbus_onoffer_rescind() we check for channel existence with
   relid2channel() and then use it. This can go wrong if we're in the middle
   of channel removal (free_channel() was already called).

2) In process_chn_event() we check for channel existence with
   pcpu_relid2channel() and then use it. This can also go wrong.

3) vmbus_free_channels() just frees all channels, in case we're in the middle
   of vmbus_process_rescind_offer() crash is possible.

The issue can be solved by holding vmbus_connection.channel_lock everywhere,
however, it looks like a way to deadlocks and performance degradation. Get/put
workflow fits here the best.

Implement vmbus_get_channel()/vmbus_put_channel() pair instead of
free_channel().

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/channel_mgmt.c | 45 ++---
 drivers/hv/connection.c   |  7 +--
 drivers/hv/hyperv_vmbus.h |  4 
 include/linux/hyperv.h| 13 +
 4 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 36bacc7..eb9ce94 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -147,6 +147,8 @@ static struct vmbus_channel *alloc_channel(void)
return NULL;
 
channel-id = atomic_inc_return(chan_num);
+   atomic_set(channel-count, 1);
+
spin_lock_init(channel-inbound_lock);
spin_lock_init(channel-lock);
 
@@ -178,19 +180,47 @@ static void release_channel(struct work_struct *work)
 }
 
 /*
- * free_channel - Release the resources used by the vmbus channel object
+ * vmbus_put_channel - Decrease the channel usage counter and release the
+ * resources when this counter reaches zero.
  */
-static void free_channel(struct vmbus_channel *channel)
+void vmbus_put_channel(struct vmbus_channel *channel)
 {
+   unsigned long flags;
 
/*
 * We have to release the channel's workqueue/thread in the vmbus's
 * workqueue/thread context
 * ie we can't destroy ourselves.
 */
-   INIT_WORK(channel-work, release_channel);
-   queue_work(vmbus_connection.work_queue, channel-work);
+   spin_lock_irqsave(channel-lock, flags);
+   if (atomic_dec_and_test(channel-count)) {
+   channel-dying = true;
+   INIT_WORK(channel-work, release_channel);
+   spin_unlock_irqrestore(channel-lock, flags);
+   queue_work(vmbus_connection.work_queue, channel-work);
+   } else
+   spin_unlock_irqrestore(channel-lock, flags);
+}
+EXPORT_SYMBOL_GPL(vmbus_put_channel);
+
+/* vmbus_get_channel - Get additional reference to the channel */
+struct vmbus_channel *vmbus_get_channel(struct vmbus_channel *channel)
+{
+   unsigned long flags;
+   struct vmbus_channel *ret = NULL;
+
+   if (!channel)
+   return NULL;
+
+   spin_lock_irqsave(channel-lock, flags);
+   if (!channel-dying) {
+   atomic_inc(channel-count);
+   ret = channel;
+   }
+   spin_unlock_irqrestore(channel-lock, flags);
+   return ret;
 }
+EXPORT_SYMBOL_GPL(vmbus_get_channel);
 
 static void percpu_channel_enq(void *arg)
 {
@@ -253,7 +283,7 @@ static void vmbus_process_rescind_offer(struct work_struct 
*work)
list_del(channel-sc_list);
spin_unlock_irqrestore(primary_channel-lock, flags);
}
-   free_channel(channel);
+   vmbus_put_channel(channel);
 }
 
 void vmbus_free_channels(void)
@@ -262,7 +292,7 @@ void vmbus_free_channels(void)
 
list_for_each_entry(channel, vmbus_connection.chn_list, listentry) {
vmbus_device_unregister(channel-device_obj);
-   free_channel(channel);
+   vmbus_put_channel(channel);
}
 }
 
@@ -391,7 +421,7 @@ done_init_rescind:
spin_unlock_irqrestore(newchannel-lock, flags);
return;
 err_free_chan:
-   free_channel(newchannel);
+   vmbus_put_channel(newchannel);
 }
 
 enum {
@@ -549,6 +579,7 @@ static void vmbus_onoffer_rescind(struct 
vmbus_channel_message_header *hdr)
queue_work(channel-controlwq, channel-work);
 
spin_unlock_irqrestore(channel-lock, flags);
+   vmbus_put_channel(channel);
 }
 
 /*
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index c4acd1c..d1ce134 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -247,7 +247,8 @@ void vmbus_disconnect(void)
  * Map the given relid to the corresponding channel based on the
  * per-cpu list of channels that have been affinitized to this CPU.
  * This will be used in the channel callback path as we can do this
- * mapping in a lock-free fashion.
+ * mapping in a lock-free fashion. Takes additional reference

[PATCH 0/4] Drivers: hv: Further protection for the rescind path

2015-02-03 Thread Vitaly Kuznetsov
This series is a continuation of the Drivers: hv: vmbus: serialize Offer and
Rescind offer. I'm trying to address a number of theoretically possible issues
with rescind offer handling. All these complications come from the fact that
a rescind offer results in vmbus channel being freed and we must ensure nobody
still uses it. Instead of introducing new locks I suggest we switch channels
usage to the get/put workflow.

The main part of the series is [PATCH 1/4] which introduces the workflow for
vmbus channels, all other patches fix different corner cases using this
workflow. I'm not sure all such cases are covered with this series (probably
not), but in case protection is required in some other places it should become
relatively easy to add one.

I did some sanity testing with CONFIG_DEBUG_LOCKDEP=y and nothing popped out,
however, additional testing would be much appreciated.

K.Y., Haiyang, I'm not sending this series to netdev@ and linux-scsi@ as it is
supposed to be applied as a whole, please resend these patches with your
sign-offs when (and if) we're done with reviews. Thanks!

Vitaly Kuznetsov (4):
  Drivers: hv: vmbus: implement get/put usage workflow for vmbus
channels
  Drivers: hv: vmbus: do not lose rescind offer on failure in
vmbus_process_offer()
  Drivers: hv: vmbus: protect vmbus_get_outgoing_channel() against
channel removal
  hyperv: netvsc: improve protection against rescind offer

 drivers/hv/channel_mgmt.c   | 75 +
 drivers/hv/connection.c |  7 +++--
 drivers/hv/hyperv_vmbus.h   |  4 +++
 drivers/net/hyperv/netvsc.c | 10 --
 drivers/scsi/storvsc_drv.c  |  2 ++
 include/linux/hyperv.h  | 13 
 6 files changed, 95 insertions(+), 16 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] Drivers: hv: vmbus: do not lose rescind offer on failure in vmbus_process_offer()

2015-02-03 Thread Vitaly Kuznetsov
In case we hit a failure condition in vmbus_process_offer() and a rescind offer
was pending for the channel we just do free_channel() so 
CHANNELMSG_RELID_RELEASED
will never be send to the host. We have to follow vmbus_process_rescind_offer()
path anyway.

To support the change we need to protect list_del in 
vmbus_process_rescind_offer()
hitting an uninitialized list.

Reported-by: Dexuan Cui de...@microsoft.com
Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/channel_mgmt.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index eb9ce94..fdccd16 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -152,6 +152,7 @@ static struct vmbus_channel *alloc_channel(void)
spin_lock_init(channel-inbound_lock);
spin_lock_init(channel-lock);
 
+   INIT_LIST_HEAD(channel-listentry);
INIT_LIST_HEAD(channel-sc_list);
INIT_LIST_HEAD(channel-percpu_list);
 
@@ -308,6 +309,7 @@ static void vmbus_process_offer(struct work_struct *work)
struct vmbus_channel *channel;
bool fnew = true;
bool enq = false;
+   bool failure = false;
int ret;
unsigned long flags;
 
@@ -408,19 +410,33 @@ static void vmbus_process_offer(struct work_struct *work)
spin_lock_irqsave(vmbus_connection.channel_lock, flags);
list_del(newchannel-listentry);
spin_unlock_irqrestore(vmbus_connection.channel_lock, flags);
+   /*
+* Init listentry again as vmbus_process_rescind_offer can try
+* doing list_del again.
+*/
+   INIT_LIST_HEAD(channel-listentry);
kfree(newchannel-device_obj);
+   newchannel-device_obj = NULL;
goto err_free_chan;
}
+   goto done_init_rescind;
+err_free_chan:
+   failure = true;
 done_init_rescind:
+   /*
+* Get additional reference as vmbus_put_channel() can be called
+* either directly or through vmbus_process_rescind_offer().
+*/
+   vmbus_get_channel(newchannel);
spin_lock_irqsave(newchannel-lock, flags);
/* The next possible work is rescind handling */
INIT_WORK(newchannel-work, vmbus_process_rescind_offer);
/* Check if rescind offer was already received */
if (newchannel-rescind)
queue_work(newchannel-controlwq, newchannel-work);
+   else if (failure)
+   vmbus_put_channel(newchannel);
spin_unlock_irqrestore(newchannel-lock, flags);
-   return;
-err_free_chan:
vmbus_put_channel(newchannel);
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] Drivers: hv: Further protection for the rescind path

2015-02-05 Thread Vitaly Kuznetsov
KY Srinivasan k...@microsoft.com writes:

 -Original Message-
 From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
 Sent: Tuesday, February 3, 2015 9:01 AM
 To: KY Srinivasan; de...@linuxdriverproject.org
 Cc: Haiyang Zhang; linux-kernel@vger.kernel.org; Dexuan Cui; Jason Wang
 Subject: [PATCH 0/4] Drivers: hv: Further protection for the rescind path
 
 This series is a continuation of the Drivers: hv: vmbus: serialize Offer and
 Rescind offer. I'm trying to address a number of theoretically possible 
 issues
 with rescind offer handling. All these complications come from the fact that 
 a
 rescind offer results in vmbus channel being freed and we must ensure
 nobody still uses it. Instead of introducing new locks I suggest we switch
 channels usage to the get/put workflow.
 
 The main part of the series is [PATCH 1/4] which introduces the workflow for
 vmbus channels, all other patches fix different corner cases using this
 workflow. I'm not sure all such cases are covered with this series (probably
 not), but in case protection is required in some other places it should 
 become
 relatively easy to add one.
 
 I did some sanity testing with CONFIG_DEBUG_LOCKDEP=y and nothing
 popped out, however, additional testing would be much appreciated.
 
 K.Y., Haiyang, I'm not sending this series to netdev@ and linux-scsi@ as it 
 is
 supposed to be applied as a whole, please resend these patches with your
 sign-offs when (and if) we're done with reviews. Thanks!

 Vitaly,

 Thanks for looking into this issue. While today, rescind offer results in the 
 freeing of the channel, I don't think
 that is required. By not freeing up the channel in the rescind path, we can 
 have a safe way to access the channel and
 that does not have to involve taking a reference on the channel every time 
 you access it - the get/put workflow in your
 patch set. As part of the network performance improvement work, I had 
 eliminated all locks in the receive path by setting
 up per-cpu data structures for mapping the relid to channel etc. These set of 
 patches introduces locking/atomic operations
 in performance critical code paths to deal with an event that is truly
 rare - the channel getting rescinded.

It is possible to eliminate all locks/atomic operations from performance
critical pyth in my patch series by following Dexuan's suggestion -
we'll get the channel in vmbus_open and put it in vmbus_close (and on
processing offer/rescind offer) this won't affect performance. I'm in
the middle of testing this approach.


 All channel messages are handled in a single work context:

 vmbus_on_msg_dpc() - vmbus_onmessage_work()- Various channel messages 
 [offer, rescind etc.]

 So, the rescind message cannot be processed while we are processing the offer 
 message and since an offer
 cannot be rescinded before it is offered, offer and rescind are naturally 
 serialized (I think I have patchset in my queue
 from you that is trying to solve the concurrent execution of offer and 
 rescind and looking at the code I cannot see how
 this can occur).

 As part of handling the rescind message, we will just set the channel state 
 to indicate that the offer is rescinded (we can add
 the rescind state to the channel states already defined and this will be done 
 under the protection of the channel lock).
 The cleanup of the channel and sending of the RELID release message  will 
 only be done in the context of the driver as part of 
 driver remove function. I think this should be doable in a way that does not 
 penalize the normal path. If it is ok with you, I will
 try to put together a patch along the lines I have described here.


Yes, if we consider rescind event as a very rare event we can avoid
freeing channels, but if (in some conditions) it happens frequently
we'll have significant memory leakage.

We can also free them with something like schedule_deyalyed_work with
e.g. 10 second delay after removing it from all lists so probability of
hitting a crash will me very low, I seriously doubt we will ever hit it.

Please let me know what you think is better. In case we follow 'never
free' or 'delayed free' approach I'll extract and send separately PATCH
2/4 from my series to address 'loosing rescind offer' issue pointed out
by Dexuan. 

Thanks,

 Regards,

 K. Y

 
 Vitaly Kuznetsov (4):
   Drivers: hv: vmbus: implement get/put usage workflow for vmbus
 channels
   Drivers: hv: vmbus: do not lose rescind offer on failure in
 vmbus_process_offer()
   Drivers: hv: vmbus: protect vmbus_get_outgoing_channel() against
 channel removal
   hyperv: netvsc: improve protection against rescind offer
 
  drivers/hv/channel_mgmt.c   | 75
 +
  drivers/hv/connection.c |  7 +++--
  drivers/hv/hyperv_vmbus.h   |  4 +++
  drivers/net/hyperv/netvsc.c | 10 --  drivers/scsi/storvsc_drv.c  |  2 ++
  include/linux/hyperv.h  | 13 
  6 files changed, 95 insertions(+), 16 deletions

Re: [PATCH v2 1/1] drivers:hv:vmbus drivers:hv:vmbus Allow for more than one MMIO range for children

2015-02-06 Thread Vitaly Kuznetsov
Jake Oshins ja...@microsoft.com writes:

 This set of changes finds the _CRS object in the ACPI namespace
 that contains memory address space descriptors, intended to convey
 to VMBus which ranges of memory-mapped I/O space are available for
 child devices, and then builds a resource list that contains all
 those ranges.  Without this change, only some of the memory-mapped
 I/O space will be available for child devices, and only in some
 virtual BIOS configurations (Generation 2 VMs).

 This patch has been updated with feedback from Vitaly Kuznetsov.
 Cleanup is now driven by the acpi remove callback function.

Sorry for beeing late with this message but I'm seeing issues with this
commit. I added some debug to figure out what's going on and here is
what I see:

With Gen1 VM we end up doing request_resource for two ranges:
f800 - fffb
fe000 - fffef

request_resource() fails (as we already have PCI device at f800 I
suppose?) but we don't check the return value. release_resource on
module unload crashes the kernel:
[   78.314344] BUG: unable to handle kernel NULL pointer dereference at
0030
[   78.315021] IP: [8107fac5] release_resource+0x25/0x90
[   78.315021] PGD 78c67067 PUD 78c5a067 PMD 0 
[   78.315021] Oops:  [#1] SMP DEBUG_PAGEALLOC
[   78.315021] Modules linked in: hv_vmbus(-)
...
If I'm not mistaken, before the change we didn't do any
request_resource() for Gen1 VMs at all.

With Gen2 VM we do request_resource for fe000 - f range
only, that means this commit doesn't change anything.

Can you please take a look? I'd like to help but I don't completely
understand the essense of the change wrt Gen1 VMs with PCI devices.

Thanks,


 Signed-off-by: Jake Oshins ja...@microsoft.com
 ---
  drivers/hv/vmbus_drv.c  |   99 
 +--
  drivers/video/fbdev/hyperv_fb.c |2 +-
  include/linux/hyperv.h  |2 +-
  3 files changed, 86 insertions(+), 17 deletions(-)

 diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
 index 4d6b269..ed618ac 100644
 --- a/drivers/hv/vmbus_drv.c
 +++ b/drivers/hv/vmbus_drv.c
 @@ -43,10 +43,7 @@ static struct tasklet_struct msg_dpc;
  static struct completion probe_event;
  static int irq;

 -struct resource hyperv_mmio = {
 - .name  = hyperv mmio,
 - .flags = IORESOURCE_MEM,
 -};
 +struct resource *hyperv_mmio;
  EXPORT_SYMBOL_GPL(hyperv_mmio);

  static int vmbus_exists(void)
 @@ -849,30 +846,98 @@ void vmbus_device_unregister(struct hv_device 
 *device_obj)

  /*
 - * VMBUS is an acpi enumerated device. Get the the information we
 - * need from DSDT.
 + * VMBUS is an acpi enumerated device. Get the
 + * information we need from DSDT.
   */

  static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
  {
 + resource_size_t start = 0;
 + resource_size_t end = 0;
 + struct resource *new_res;
 + struct resource **old_res = hyperv_mmio;
 +
   switch (res-type) {
   case ACPI_RESOURCE_TYPE_IRQ:
   irq = res-data.irq.interrupts[0];
 + return AE_OK;
 +
 + /*
 +  * Address descriptors are for bus windows. Ignore
 +  * memory descriptors, which are for registers on
 +  * devices.
 +  */
 + case ACPI_RESOURCE_TYPE_ADDRESS32:
 + start = res-data.address32.minimum;
 + end = res-data.address32.maximum;
   break;

   case ACPI_RESOURCE_TYPE_ADDRESS64:
 - hyperv_mmio.start = res-data.address64.minimum;
 - hyperv_mmio.end = res-data.address64.maximum;
 + start = res-data.address64.minimum;
 + end = res-data.address64.maximum;
   break;
 +
 + default:
 + /* Unused resource type */
 + return AE_OK;
   }

 + /*
 +  * Ignore ranges that are below 1MB, as they're not
 +  * necessary or useful here.
 +  */
 + if (end  0x10)
 + return AE_OK;
 +
 + new_res = kzalloc(sizeof(*new_res), GFP_ATOMIC);
 + if (!new_res)
 + return AE_NO_MEMORY;
 +
 + new_res-name = hyperv mmio;
 + new_res-flags = IORESOURCE_MEM;
 + new_res-start = start;
 + new_res-end = end;
 +
 + do {
 + if (!*old_res) {
 + *old_res = new_res;
 + break;
 + }
 +
 + if ((*old_res)-start  new_res-end) {
 + new_res-sibling = *old_res;
 + *old_res = new_res;
 + break;
 + }
 +
 + old_res = (*old_res)-sibling;
 +
 + } while (1);
 +
   return AE_OK;
  }

 +static int vmbus_acpi_remove(struct acpi_device *device)
 +{
 + struct resource *cur_res;
 + struct resource *next_res;
 +
 + if (hyperv_mmio) {
 + release_resource(hyperv_mmio);
 + for (cur_res = hyperv_mmio; cur_res; cur_res = next_res) {
 + next_res = cur_res

Re: [PATCH v2 1/3] Drivers: hv: check vmbus_device_create() return value in vmbus_process_offer()

2015-01-20 Thread Vitaly Kuznetsov
Dan Carpenter dan.carpen...@oracle.com writes:

 On Mon, Jan 19, 2015 at 05:56:11PM +0100, Vitaly Kuznetsov wrote:
 vmbus_device_create() result is not being checked in vmbus_process_offer() 
 and
 it can fail if kzalloc() fails. Add the check and do minor cleanup to avoid
 additional duplication of free_channel(); return; block.
 
 Reported-by: Jason Wang jasow...@redhat.com
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com

 out is always a bad name for a label.  It's too vague.  It implies
 that the code uses One Err style error handling which is bug prone and
 I've ranted about that in the past so I won't here.  This kind of coding
 is buggier than direct returns.  But recently I've been looking at bugs
 where we return zero where the code should return a negative error code
 and, wow, do I hate out labels!

   if (function_whatever(xxx))
   goto out;

 [ thousands of lines removed. ]

 out:
   return ret;

 Oh crap...  Did the coder mean to return success or not???

 If you use a direct return then the code looks like:

   if (function_whatever(xxx))
   return 0;

 In that case, you can immediately see that the coder typed 0
 deliberately.  Direct returns are best.  I guess that's not directly
 related to this code.  But I didn't know that until I read to the bottom
 of the patch and I already had this rant prepared in my head ready to
 go...

Thank you for your rant, Dan! It contains an explanation _why_ and so is
useful. However ... :-)

1) vmbus_process_offer() returns void so we won't forget to set proper
return code.
2) this patch is a preparation for the PATCH 3/3 where the label is
being used to do some useful (non-trivial) work. Direct returns
approach would require us to duplicate the code or move it to a function
and call it from all return places. I consider adding out label being
less evil.

Anyway, I can rename it to something less provocative in PATCH 3/3,
e.g. init_rescind.


 error is a crap label name because it doesn't tell you what the code
 does.  A better name is err_free_chan or something which talks about
 freeing the channel.

And here I have to completely agree with you, I'll rename it in v3.


 regards,
 dan carpenter

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/3] Drivers: hv: check vmbus_device_create() return value in vmbus_process_offer()

2015-01-20 Thread Vitaly Kuznetsov
vmbus_device_create() result is not being checked in vmbus_process_offer() and
it can fail if kzalloc() fails. Add the check and do minor cleanup to avoid
additional duplication of free_channel(); return; block.

Reported-by: Jason Wang jasow...@redhat.com
Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/channel_mgmt.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 2c59f03..01f2c2b 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -341,11 +341,10 @@ static void vmbus_process_offer(struct work_struct *work)
if (channel-sc_creation_callback != NULL)
channel-sc_creation_callback(newchannel);
 
-   return;
+   goto out;
}
 
-   free_channel(newchannel);
-   return;
+   goto err_free_chan;
}
 
/*
@@ -364,6 +363,8 @@ static void vmbus_process_offer(struct work_struct *work)
newchannel-offermsg.offer.if_type,
newchannel-offermsg.offer.if_instance,
newchannel);
+   if (!newchannel-device_obj)
+   goto err_free_chan;
 
/*
 * Add the new device to the bus. This will kick off device-driver
@@ -379,9 +380,12 @@ static void vmbus_process_offer(struct work_struct *work)
list_del(newchannel-listentry);
spin_unlock_irqrestore(vmbus_connection.channel_lock, flags);
kfree(newchannel-device_obj);
-
-   free_channel(newchannel);
+   goto err_free_chan;
}
+out:
+   return;
+err_free_chan:
+   free_channel(newchannel);
 }
 
 enum {
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/3] Drivers: hv: rename sc_lock to the more generic lock

2015-01-20 Thread Vitaly Kuznetsov
sc_lock spinlock in struct vmbus_channel is being used to not only protect the
sc_list field, e.g. vmbus_open() function uses it to implement test-and-set
access to the state field. Rename it to the more generic 'lock' and add the
description.

Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
---
 drivers/hv/channel.c  |  6 +++---
 drivers/hv/channel_mgmt.c | 10 +-
 include/linux/hyperv.h|  7 ++-
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 433f72a..8608ed1 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -73,14 +73,14 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 
send_ringbuffer_size,
unsigned long flags;
int ret, t, err = 0;
 
-   spin_lock_irqsave(newchannel-sc_lock, flags);
+   spin_lock_irqsave(newchannel-lock, flags);
if (newchannel-state == CHANNEL_OPEN_STATE) {
newchannel-state = CHANNEL_OPENING_STATE;
} else {
-   spin_unlock_irqrestore(newchannel-sc_lock, flags);
+   spin_unlock_irqrestore(newchannel-lock, flags);
return -EINVAL;
}
-   spin_unlock_irqrestore(newchannel-sc_lock, flags);
+   spin_unlock_irqrestore(newchannel-lock, flags);
 
newchannel-onchannel_callback = onchannelcallback;
newchannel-channel_callback_context = context;
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 01f2c2b..c6fdd74 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -146,7 +146,7 @@ static struct vmbus_channel *alloc_channel(void)
return NULL;
 
spin_lock_init(channel-inbound_lock);
-   spin_lock_init(channel-sc_lock);
+   spin_lock_init(channel-lock);
 
INIT_LIST_HEAD(channel-sc_list);
INIT_LIST_HEAD(channel-percpu_list);
@@ -246,9 +246,9 @@ static void vmbus_process_rescind_offer(struct work_struct 
*work)
spin_unlock_irqrestore(vmbus_connection.channel_lock, flags);
} else {
primary_channel = channel-primary_channel;
-   spin_lock_irqsave(primary_channel-sc_lock, flags);
+   spin_lock_irqsave(primary_channel-lock, flags);
list_del(channel-sc_list);
-   spin_unlock_irqrestore(primary_channel-sc_lock, flags);
+   spin_unlock_irqrestore(primary_channel-lock, flags);
}
free_channel(channel);
 }
@@ -323,9 +323,9 @@ static void vmbus_process_offer(struct work_struct *work)
 * Process the sub-channel.
 */
newchannel-primary_channel = channel;
-   spin_lock_irqsave(channel-sc_lock, flags);
+   spin_lock_irqsave(channel-lock, flags);
list_add_tail(newchannel-sc_list, channel-sc_list);
-   spin_unlock_irqrestore(channel-sc_lock, flags);
+   spin_unlock_irqrestore(channel-lock, flags);
 
if (newchannel-target_cpu != get_cpu()) {
put_cpu();
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 476c685..02dd978 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -722,7 +722,12 @@ struct vmbus_channel {
 */
void (*sc_creation_callback)(struct vmbus_channel *new_sc);
 
-   spinlock_t sc_lock;
+   /*
+* The spinlock to protect the structure. It is being used to protect
+* test-and-set access to various attributes of the structure as well
+* as all sc_list operations.
+*/
+   spinlock_t lock;
/*
 * All Sub-channels of a primary channel are linked here.
 */
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 0/3] Drivers: hv: vmbus: protect Offer/Rescind offer processing

2015-01-20 Thread Vitaly Kuznetsov
This patch series is a renamed successor of [PATCH] Drivers: hv: vmbus:
serialize Offer and Rescind offer processing.

Changes from v2:
- Rename labels in vmbus_process_offer() (out - done_init_rescind in
  PATCH 3/3, error - err_free_chan in PATCH 1/3) [Dan Carpenter]
- Invert condition, update comment, and remove out label in
  vmbus_onoffer_rescind()

Changes from v1:
- Separate vmbus_device_create() return value check [K. Y. Srinivasan]
- Do not lose a rescind offer received during offer processing. Use renamed
(in [PATCH v2 2/3]) spinlock to protect simulteneous test-and-set workflow for
rescind and work fields. [K. Y. Srinivasan]

Vitaly Kuznetsov (3):
  Drivers: hv: check vmbus_device_create() return value in
vmbus_process_offer()
  Drivers: hv: rename sc_lock to the more generic lock
  Drivers: hv: vmbus: serialize Offer and Rescind offer

 drivers/hv/channel.c  |  6 +++---
 drivers/hv/channel_mgmt.c | 50 ---
 include/linux/hyperv.h|  7 ++-
 3 files changed, 43 insertions(+), 20 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >