Re: Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)
Jiri Slaby jsl...@suse.cz writes: On 06/04/2014 07:48 AM, Greg KH wrote: On Wed, May 14, 2014 at 03:11:22PM -0400, Konrad Rzeszutek Wilk wrote: Hey Greg This email is in regards to backporting two patches to stable that fall under the 'performance' rule: bfe11d6de1c416cea4f3f0f35f864162063ce3fa fbe363c476afe8ec992d3baf682670a4bd1b6ce6 Now queued up, thanks. AFAIU, they introduce a performance regression. Vitaly? I'm aware of a performance regression in a 'very special' case when ramdisks or files on tmpfs are being used as storage, I post my results a while ago: https://lkml.org/lkml/2014/5/22/164 I'm not sure if that 'special' case requires investigation and/or should prevent us from doing stable backport but it would be nice if someone tries to reproduce it at least. I'm going to make a bunch of tests with FusionIO drives and sequential read to replicate same test Felipe did, I'll report as soon as I have data (beginning of next week hopefuly). -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)
Vitaly Kuznetsov vkuzn...@redhat.com writes: Jiri Slaby jsl...@suse.cz writes: On 06/04/2014 07:48 AM, Greg KH wrote: On Wed, May 14, 2014 at 03:11:22PM -0400, Konrad Rzeszutek Wilk wrote: Hey Greg This email is in regards to backporting two patches to stable that fall under the 'performance' rule: bfe11d6de1c416cea4f3f0f35f864162063ce3fa fbe363c476afe8ec992d3baf682670a4bd1b6ce6 Now queued up, thanks. AFAIU, they introduce a performance regression. Vitaly? I'm aware of a performance regression in a 'very special' case when ramdisks or files on tmpfs are being used as storage, I post my results a while ago: https://lkml.org/lkml/2014/5/22/164 I'm not sure if that 'special' case requires investigation and/or should prevent us from doing stable backport but it would be nice if someone tries to reproduce it at least. I'm going to make a bunch of tests with FusionIO drives and sequential read to replicate same test Felipe did, I'll report as soon as I have data (beginning of next week hopefuly). Turns out the regression I'm observing with these patches is not restricted to tmpfs/ramdisk usage. I was doing tests with Fusion-io ioDrive Duo 320GB (Dual Adapter) on HP ProLiant DL380 G6 (2xE5540, 8G RAM). Hyperthreading is disabled, Dom0 is pinned to CPU0 (cores 0,1,2,3) I run up to 8 guests with 1 vCPU each, they are pinned to CPU1 (cores 4,5,6,7,4,5,6,7). I tried differed pinning (Dom0 to 0,1,4,5, DomUs to 2,3,6,7,2,3,6,7 to balance NUMA, that doesn't make any difference to the results). I was testing on top of Xen-4.3.2. I was testing two storage configurations: 1) Plain 10G partitions from one Fusion drive (/dev/fioa) are attached to guests 2) LVM group is created on top of both drives (/dev/fioa, /dev/fiob), 10G logical volumes are created with striping (lvcreate -i2 ...) Test is done by simultaneous fio run in guests (rw=read, direct=1) for 10 second. Each test was performed 3 times and the average was taken. Kernels I compare are: 1) v3.15-rc5-157-g60b5f90 unmodified 2) v3.15-rc5-157-g60b5f90 with 427bfe07e6744c058ce6fc4aa187cda96b635539, bfe11d6de1c416cea4f3f0f35f864162063ce3fa, and fbe363c476afe8ec992d3baf682670a4bd1b6ce6 reverted. First test was done with Dom0 with persistent grant support (Fedora's 3.14.4-200.fc20.x86_64): 1) Partitions: http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_partitions.png (same markers mean same bs, we get 860 MB/s here, patches make no difference, result matches expectation) 2) LVM Stripe: http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_stripe.png (1715 MB/s, patches make no difference, result matches expectation) Second test was performed with Dom0 without persistent grants support (Fedora's 3.7.9-205.fc18.x86_64) 1) Partitions: http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_partitions.png (860 MB/sec again, patches worsen a bit overall throughput with 1-3 clients) 2) LVM Stripe: http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_stripe.png (Here we see the same regression I observed with ramdisks and tmpfs files, unmodified kernel: 1550MB/s, with patches reverted: 1715MB/s). The only major difference with Felipe's test is that he was using blktap3 with XenServer and I'm using standard blktap2. -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)
Roger Pau Monné roger@citrix.com writes: On 10/06/14 15:19, Vitaly Kuznetsov wrote: Vitaly Kuznetsov vkuzn...@redhat.com writes: Jiri Slaby jsl...@suse.cz writes: On 06/04/2014 07:48 AM, Greg KH wrote: On Wed, May 14, 2014 at 03:11:22PM -0400, Konrad Rzeszutek Wilk wrote: Hey Greg This email is in regards to backporting two patches to stable that fall under the 'performance' rule: bfe11d6de1c416cea4f3f0f35f864162063ce3fa fbe363c476afe8ec992d3baf682670a4bd1b6ce6 Now queued up, thanks. AFAIU, they introduce a performance regression. Vitaly? I'm aware of a performance regression in a 'very special' case when ramdisks or files on tmpfs are being used as storage, I post my results a while ago: https://lkml.org/lkml/2014/5/22/164 I'm not sure if that 'special' case requires investigation and/or should prevent us from doing stable backport but it would be nice if someone tries to reproduce it at least. I'm going to make a bunch of tests with FusionIO drives and sequential read to replicate same test Felipe did, I'll report as soon as I have data (beginning of next week hopefuly). Turns out the regression I'm observing with these patches is not restricted to tmpfs/ramdisk usage. I was doing tests with Fusion-io ioDrive Duo 320GB (Dual Adapter) on HP ProLiant DL380 G6 (2xE5540, 8G RAM). Hyperthreading is disabled, Dom0 is pinned to CPU0 (cores 0,1,2,3) I run up to 8 guests with 1 vCPU each, they are pinned to CPU1 (cores 4,5,6,7,4,5,6,7). I tried differed pinning (Dom0 to 0,1,4,5, DomUs to 2,3,6,7,2,3,6,7 to balance NUMA, that doesn't make any difference to the results). I was testing on top of Xen-4.3.2. I was testing two storage configurations: 1) Plain 10G partitions from one Fusion drive (/dev/fioa) are attached to guests 2) LVM group is created on top of both drives (/dev/fioa, /dev/fiob), 10G logical volumes are created with striping (lvcreate -i2 ...) Test is done by simultaneous fio run in guests (rw=read, direct=1) for 10 second. Each test was performed 3 times and the average was taken. Kernels I compare are: 1) v3.15-rc5-157-g60b5f90 unmodified 2) v3.15-rc5-157-g60b5f90 with 427bfe07e6744c058ce6fc4aa187cda96b635539, bfe11d6de1c416cea4f3f0f35f864162063ce3fa, and fbe363c476afe8ec992d3baf682670a4bd1b6ce6 reverted. First test was done with Dom0 with persistent grant support (Fedora's 3.14.4-200.fc20.x86_64): 1) Partitions: http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_partitions.png (same markers mean same bs, we get 860 MB/s here, patches make no difference, result matches expectation) 2) LVM Stripe: http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_stripe.png (1715 MB/s, patches make no difference, result matches expectation) Second test was performed with Dom0 without persistent grants support (Fedora's 3.7.9-205.fc18.x86_64) 1) Partitions: http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_partitions.png (860 MB/sec again, patches worsen a bit overall throughput with 1-3 clients) 2) LVM Stripe: http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_stripe.png (Here we see the same regression I observed with ramdisks and tmpfs files, unmodified kernel: 1550MB/s, with patches reverted: 1715MB/s). The only major difference with Felipe's test is that he was using blktap3 with XenServer and I'm using standard blktap2. Hello, I don't think you are using blktap2, I guess you are using blkback. Right, sorry for the confusion. Also, running the test only for 10s and 3 repetitions seems too low, I would probably try to run the tests for a longer time and do more repetitions, and include the standard deviation also. Could you try to revert the patches independently to see if it's a specific commit that introduces the regression? I did additional test runs. Now I'm comparing 3 kernels: 1) Unmodified v3.15-rc5-157-g60b5f90 - green color on chart 2) v3.15-rc5-157-g60b5f90 with bfe11d6de1c416cea4f3f0f35f864162063ce3fa and 427bfe07e6744c058ce6fc4aa187cda96b635539 reverted (so only fbe363c476afe8ec992d3baf682670a4bd1b6ce6 xen-blkfront: revoke foreign access for grants not mapped by the backend left) - blue color on chart 3) v3.15-rc5-157-g60b5f90 with all (bfe11d6de1c416cea4f3f0f35f864162063ce3fa, 427bfe07e6744c058ce6fc4aa187cda96b635539, fbe363c476afe8ec992d3baf682670a4bd1b6ce6) patches reverted - red color on chart. I test on top of striped LVM on 2 FusionIO drives, I do 3 repetitions for 30 seconds each. The result is here: http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_20140612.png It is consistent with what I've measured with ramdrives and tmpfs files: 1) fbe363c476afe8ec992d3baf682670a4bd1b6ce6 xen-blkfront: revoke foreign access for grants not mapped by the backend brings us the regression. Bigger block size is - bigger the difference but the regression is observed with all block sizes 8k. 2) bfe11d6de1c416cea4f3f0f35f864162063ce3fa xen-blkfront
Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)
Felipe Franciosi felipe.franci...@citrix.com writes: Hi Vitaly, Are you able to test a 3.10 guest with and without the backport that Roger sent? This patch is attached to an e-mail Roger sent on 22 May 2014 13:54. Sure, Now I'm comparing d642daf637d02dacf216d7fd9da7532a4681cfd3 and 46c0326164c98e556c35c3eb240273595d43425d commits from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git (with and without two commits in question). The test is exactly the same as described before. The result is here: http://hadoop.ru/pubfiles/bug1096909/fusion/310_nopgrants_stripe.png as you can see 46c03261 (without patches) wins everywhere. Because your results are contradicting with what these patches are meant to do, I would like to make sure that this isn't related to something else that happened after 3.10. I still think Dom0 kernel and blktap/blktap3 is what make a difference between our test environments. You could also test Ubuntu Sancy guests with and without the patched kernels provided by Joseph Salisbury on launchpad: https://bugs.launchpad.net/bugs/1319003 Thanks, Felipe -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: 12 June 2014 13:01 To: Roger Pau Monne Cc: xen-de...@lists.xenproject.org; ax...@kernel.dk; Felipe Franciosi; Greg KH; linux-kernel@vger.kernel.org; sta...@vger.kernel.org; jerry.snitsel...@oracle.com; Jiri Slaby; Ronen Hod; Andrew Jones Subject: Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees) Roger Pau Monné roger@citrix.com writes: On 10/06/14 15:19, Vitaly Kuznetsov wrote: Vitaly Kuznetsov vkuzn...@redhat.com writes: Jiri Slaby jsl...@suse.cz writes: On 06/04/2014 07:48 AM, Greg KH wrote: On Wed, May 14, 2014 at 03:11:22PM -0400, Konrad Rzeszutek Wilk wrote: Hey Greg This email is in regards to backporting two patches to stable that fall under the 'performance' rule: bfe11d6de1c416cea4f3f0f35f864162063ce3fa fbe363c476afe8ec992d3baf682670a4bd1b6ce6 Now queued up, thanks. AFAIU, they introduce a performance regression. Vitaly? I'm aware of a performance regression in a 'very special' case when ramdisks or files on tmpfs are being used as storage, I post my results a while ago: https://lkml.org/lkml/2014/5/22/164 I'm not sure if that 'special' case requires investigation and/or should prevent us from doing stable backport but it would be nice if someone tries to reproduce it at least. I'm going to make a bunch of tests with FusionIO drives and sequential read to replicate same test Felipe did, I'll report as soon as I have data (beginning of next week hopefuly). Turns out the regression I'm observing with these patches is not restricted to tmpfs/ramdisk usage. I was doing tests with Fusion-io ioDrive Duo 320GB (Dual Adapter) on HP ProLiant DL380 G6 (2xE5540, 8G RAM). Hyperthreading is disabled, Dom0 is pinned to CPU0 (cores 0,1,2,3) I run up to 8 guests with 1 vCPU each, they are pinned to CPU1 (cores 4,5,6,7,4,5,6,7). I tried differed pinning (Dom0 to 0,1,4,5, DomUs to 2,3,6,7,2,3,6,7 to balance NUMA, that doesn't make any difference to the results). I was testing on top of Xen-4.3.2. I was testing two storage configurations: 1) Plain 10G partitions from one Fusion drive (/dev/fioa) are attached to guests 2) LVM group is created on top of both drives (/dev/fioa, /dev/fiob), 10G logical volumes are created with striping (lvcreate -i2 ...) Test is done by simultaneous fio run in guests (rw=read, direct=1) for 10 second. Each test was performed 3 times and the average was taken. Kernels I compare are: 1) v3.15-rc5-157-g60b5f90 unmodified 2) v3.15-rc5-157-g60b5f90 with 427bfe07e6744c058ce6fc4aa187cda96b635539, bfe11d6de1c416cea4f3f0f35f864162063ce3fa, and fbe363c476afe8ec992d3baf682670a4bd1b6ce6 reverted. First test was done with Dom0 with persistent grant support (Fedora's 3.14.4-200.fc20.x86_64): 1) Partitions: http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_partitions.pn g (same markers mean same bs, we get 860 MB/s here, patches make no difference, result matches expectation) 2) LVM Stripe: http://hadoop.ru/pubfiles/bug1096909/fusion/315_pgrants_stripe.png (1715 MB/s, patches make no difference, result matches expectation) Second test was performed with Dom0 without persistent grants support (Fedora's 3.7.9-205.fc18.x86_64) 1) Partitions: http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_partitions. png (860 MB/sec again, patches worsen a bit overall throughput with 1-3 clients) 2) LVM Stripe: http://hadoop.ru/pubfiles/bug1096909/fusion/315_nopgrants_stripe.png (Here we see the same regression I observed with ramdisks and tmpfs files, unmodified kernel: 1550MB/s, with patches reverted: 1715MB/s). The only major difference
[PATCH] xenpv: don't BUG when failing to setup NMI callback
some old Xen hypervisors (prior to 3.2) forbid DomUs to register NMI callbacks. E.g. we have the following code in xen-3.1: if ( (d-domain_id != 0) || (v-vcpu_id != 0) ) return -EINVAL; Commit 6efa20e49b9cb1db1ab66870cc37323474a75a13 introduced kernel crash in case PV guest fails to register NMI callback. All x86_64 PV guests will fail to boot on top of such hypervisors (RHEL5 example): (XEN) traps.c:405:d7 Unhandled invalid opcode fault/trap [#6] in domain 7 on VCPU 0 [ec=] (XEN) domain_crash_sync called from entry.S (XEN) Domain 7 (vcpu#0) crashed on cpu#3: (XEN) [ Xen-3.1.2-389.el5 x86_64 debug=n Not tainted ] (XEN) CPU:3 (XEN) RIP:e033:[81004d96] (XEN) RFLAGS: 0282 CONTEXT: guest (XEN) rax: ffea rbx: rcx: 0002 (XEN) rdx: 0001 rsi: 81b0fe28 rdi: (XEN) rbp: 81b0fe40 rsp: 81b0fde8 r8: (XEN) r9: 81b0fdd0 r10: 7ff0 r11: (XEN) r12: 81d65900 r13: r14: (XEN) r15: cr0: 80050033 cr4: 26b0 (XEN) cr3: 00013a263000 cr2: (XEN) ds: es: fs: gs: ss: e02b cs: e033 ... However it is possible to proceed without NMI callback registered. Change BUG() with warning in case of -EINVAL. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/setup.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 821a11a..5b8b180 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -593,8 +593,17 @@ void xen_enable_syscall(void) void xen_enable_nmi(void) { #ifdef CONFIG_X86_64 - if (register_callback(CALLBACKTYPE_nmi, (char *)nmi)) + int ret; + + ret = register_callback(CALLBACKTYPE_nmi, (char *)nmi); + if (ret == -EINVAL) { + /* Hypervisor probably forbids us to register NMI callback, + that is expected when running on top of Xen-3.1 and older */ + pr_warn(xen: failed to register NMI callback\n); + } else if (ret != 0) { + /* Other hypervisor failure */ BUG(); + } #endif } void __init xen_pvmmu_arch_setup(void) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xenpv: don't BUG when failing to setup NMI callback
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Fri, Jun 13, 2014 at 01:26:28PM +0200, Vitaly Kuznetsov wrote: some old Xen hypervisors (prior to 3.2) forbid DomUs to register NMI callbacks. E.g. we have the following code in xen-3.1: if ( (d-domain_id != 0) || (v-vcpu_id != 0) ) return -EINVAL; Commit 6efa20e49b9cb1db1ab66870cc37323474a75a13 introduced kernel crash in case PV guest fails to register NMI callback. All x86_64 PV guests will fail to boot on top of such hypervisors (RHEL5 example): (XEN) traps.c:405:d7 Unhandled invalid opcode fault/trap [#6] in domain 7 on VCPU 0 [ec=] (XEN) domain_crash_sync called from entry.S (XEN) Domain 7 (vcpu#0) crashed on cpu#3: (XEN) [ Xen-3.1.2-389.el5 x86_64 debug=n Not tainted ] (XEN) CPU:3 (XEN) RIP:e033:[81004d96] (XEN) RFLAGS: 0282 CONTEXT: guest (XEN) rax: ffea rbx: rcx: 0002 (XEN) rdx: 0001 rsi: 81b0fe28 rdi: (XEN) rbp: 81b0fe40 rsp: 81b0fde8 r8: (XEN) r9: 81b0fdd0 r10: 7ff0 r11: (XEN) r12: 81d65900 r13: r14: (XEN) r15: cr0: 80050033 cr4: 26b0 (XEN) cr3: 00013a263000 cr2: (XEN) ds: es: fs: gs: ss: e02b cs: e033 ... However it is possible to proceed without NMI callback registered. Change BUG() with warning in case of -EINVAL. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com Oh, we had a similar patch - somebody reported it earlier - and we just checked the version of Xen: http://lists.xenproject.org/archives/html/xen-devel/2014-05/msg01474.html But I can't remember why I didn't post it. However I do like your path of checking the 'ret'. Vitaly, could expand your patch to also do a check in cvt_gate_to_trap so that it won't enable the NMI handler and then lets pick your patch? Sure, I also suggest we lower the required xen version to 3.2 instead of 4.0 as 3.2 has the following: commit a2308fa704a40f23916a176d9e06bbc0e3469caf Author: Keir Fraser k...@xensource.com Date: Mon Oct 22 13:04:32 2007 +0100 x86: Allow NMI callback CS to be specified via set_trap_table() hypercall. Based on a patch by Jan Beulich. Signed-off-by: Keir Fraser k...@xensource.com I'll update my patch and re-send it. --- arch/x86/xen/setup.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 821a11a..5b8b180 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -593,8 +593,17 @@ void xen_enable_syscall(void) void xen_enable_nmi(void) { #ifdef CONFIG_X86_64 -if (register_callback(CALLBACKTYPE_nmi, (char *)nmi)) +int ret; + +ret = register_callback(CALLBACKTYPE_nmi, (char *)nmi); +if (ret == -EINVAL) { +/* Hypervisor probably forbids us to register NMI callback, + that is expected when running on top of Xen-3.1 and older */ +pr_warn(xen: failed to register NMI callback\n); +} else if (ret != 0) { +/* Other hypervisor failure */ BUG(); +} #endif } void __init xen_pvmmu_arch_setup(void) -- 1.9.3 -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: Hey Greg This email is in regards to backporting two patches to stable that fall under the 'performance' rule: bfe11d6de1c416cea4f3f0f35f864162063ce3fa fbe363c476afe8ec992d3baf682670a4bd1b6ce6 I've copied Jerry - the maintainer of the Oracle's kernel. I don't have the emails of the other distros maintainers but the bugs associated with it are: https://bugzilla.redhat.com/show_bug.cgi?id=1096909 (RHEL7) I was doing tests with RHEL7 kernel and these patches and unfortunately I see huge performance degradation in some workloads. I'm in the middle of my testing now but here are some intermediate results. Test environment: Fedora-20, xen-4.3.2-2.fc20.x86_64, 3.11.10-301.fc20.x86_64 I do testing with 1-9 RHEL7 PVHVM guests with: 1) Unmodified RHEL7 kernel 2) Only fbe363c476afe8ec992d3baf682670a4bd1b6ce6 applied (revoke foreign access) 3) Both fbe363c476afe8ec992d3baf682670a4bd1b6ce6 and bfe11d6de1c416cea4f3f0f35f864162063ce3fa (actually 427bfe07e6744c058ce6fc4aa187cda96b635539 is required as well to make build happy, I suggest we backport that to stable as well) Storage devices are: 1) ramdisks (/dev/ram*) (persistent grants and indirect descriptors disabled) 2) /tmp/img*.img on tmpfs (persistent grants and indirect descriptors disabled) Test itself: direct random read with bs=2048k (using fio). (Actually 'dd', 'read/write access', ... show same results) fio test file: [fio_read] ioengine=libaio blocksize=2048k rw=randread filename=/dev/xvdc randrepeat=1 fallocate=none direct=1 invalidate=0 runtime=20 time_based I run fio simultaneously and sum up the result. So, results are: 1) ramdisks: http://hadoop.ru/pubfiles/b1096909_3.11.10_ramdisk.png 2) tmpfiles: http://hadoop.ru/pubfiles/b1096909_3.11.10_tmpfile.png In few words: patch series has (almost) no effect when persistent grants are enabled (that was expected) and gives me performance regression when persistent grants are disabled (that wasn't expected). My thoughts are: it seems fbe363c476afe8ec992d3baf682670a4bd1b6ce6 brings performance regression in some cases (at least when persistent grants are disabled). My guess atm is that gnttab_end_foreign_access() (gnttab_end_foreign_access_ref_v1() is being used here) is guilty, for some reason it is looping for some time. bfe11d6de1c416cea4f3f0f35f864162063ce3fa really brings performance improvement over fbe363c476afe8ec992d3baf682670a4bd1b6ce6 but whole series still brings regression. I would be glad to hear what could be wrong with my testing in case I'm the only one who sees such behavior. Any other pointers are more than welcome and please feel free to ask for any additional info/testing/whatever from me. https://bugs.launchpad.net/ubuntu/+bug/1319003 (Ubuntu 13.10) The following distros are affected: (x) Ubuntu 13.04 and derivatives (3.8) (v) Ubuntu 13.10 and derivatives (3.11), supported until 2014-07 (x) Fedora 17 (3.8 and 3.9 in updates) (x) Fedora 18 (3.8, 3.9, 3.10, 3.11 in updates) (v) Fedora 19 (3.9; 3.10, 3.11, 3.12 in updates; fixed with latest update to 3.13), supported until TBA (v) Fedora 20 (3.11; 3.12 in updates; fixed with latest update to 3.13), supported until TBA (v) RHEL 7 and derivatives (3.10), expected to be supported until about 2025 (v) openSUSE 13.1 (3.11), expected to be supported until at least 2016-08 (v) SLES 12 (3.12), expected to be supported until about 2024 (v) Mageia 3 (3.8), supported until 2014-11-19 (v) Mageia 4 (3.12), supported until 2015-08-01 (v) Oracle Enterprise Linux with Unbreakable Enterprise Kernel Release 3 (3.8), supported until TBA Here is the analysis of the problem and what was put in the RHEL7 bug. The Oracle bug does not exist (as I just backport them in the kernel and send a GIT PULL to Jerry) - but if you would like I can certainly furnish you with one (it would be identical to what is mentioned below). If you are OK with the backport, I am volunteering Roger and Felipe to assist in jamming^H^H^H^Hbackporting the patches into earlier kernels. Summary: Storage performance regression when Xen backend lacks persistent-grants support Description of problem: When used as a Xen guest, RHEL 7 will be slower than older releases in terms s of storage performance. This is due to the persistent-grants feature introduced in xen-blkfront on the Linux Kernel 3.8 series. From 3.8 to 3.12 (inclusive), xen-blkfront will add an extra set of memcpy() operations regardless of persistent-grants support in the backend (i.e. xen-blkback, qemu, tapdisk). This has been identified and fixed in the 3.13 kernel series, but was not backported to previous LTS kernels due to the nature of the bug (performance only). While persistent grants reduce the stress on the Xen grant table and allow for much better aggregate throughput (at the cost of an extra set of memcpy operations), adding the copy overhead when the feature is unsupported on the backend
Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)
Vitaly Kuznetsov vkuzn...@redhat.com writes: 1) ramdisks (/dev/ram*) (persistent grants and indirect descriptors disabled) sorry, there was a typo. persistent grants and indirect descriptors are enabled with ramdisks, otherwise such testing won't make any sense. 2) /tmp/img*.img on tmpfs (persistent grants and indirect descriptors disabled) -- Vitaly Kuznetsov -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)
Roger Pau Monné roger@citrix.com writes: On 20/05/14 11:54, Vitaly Kuznetsov wrote: Vitaly Kuznetsov vkuzn...@redhat.com writes: 1) ramdisks (/dev/ram*) (persistent grants and indirect descriptors disabled) sorry, there was a typo. persistent grants and indirect descriptors are enabled with ramdisks, otherwise such testing won't make any sense. I'm not sure how is that possible, from your description I get that you are using 3.11 on the Dom0, which means blkback has support for persistent grants and indirect descriptors, but the guest is RHEL7, that's using the 3.10 kernel AFAICT, and this kernel only has persistent grants implemented. RHEL7 kernel is mostly merged with 3.11 in its Xen part, we have indirect descriptors backported. Actually I tried my tests with upstream (Fedora) kernel and results were similar. I can try comparing e.g. 3.11.10 with 3.12.0 and provide exact measurements. -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees)
, maxb=334814KB/s, mint=20002msec, maxt=20002msec READ: io=6636.0MB, aggrb=339661KB/s, minb=339661KB/s, maxb=339661KB/s, mint=20006msec, maxt=20006msec READ: io=6594.0MB, aggrb=337595KB/s, minb=337595KB/s, maxb=337595KB/s, mint=20001msec, maxt=20001msec Dumb 'dd' test shows the same: revertall_upstream client: # time for ntry in `seq 1 100`; do dd if=/dev/xvdc of=/dev/null bs=2048k 2 /dev/null; done real0m16.262s user0m0.189s sys0m7.021s unpatched_upstream # time for ntry in `seq 1 100`; do dd if=/dev/xvdc of=/dev/null bs=2048k 2 /dev/null; done real0m19.938s user0m0.174s sys0m9.489s I tried running newer Dom0 (3.14.4-200.fc20.x86_64) but that makes no difference. P.P.S. I understand this test differs a lot from what these patches were supposed to fix and I'm not trying to say 'no' for stable backport, but I also thinks this test data can be interesting as well. And thanks, Felipe, for all your hardware hints! In the meantime, I stand behind that the patches need to be backported and there is a regression if we don't do that. Ubuntu has already provided a test kernel with the patches pulled in. I will test those as soon as I get the chance (hopefully by the end of the week). See: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1319003 Felipe -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: 20 May 2014 12:41 To: Roger Pau Monne Cc: Konrad Rzeszutek Wilk; ax...@kernel.dk; Felipe Franciosi; gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; sta...@vger.kernel.org; jerry.snitsel...@oracle.com; xen- de...@lists.xenproject.org Subject: Re: [Xen-devel] Backport request to stable of two performance related fixes for xen-blkfront (3.13 fixes to earlier trees) Roger Pau Monné roger@citrix.com writes: On 20/05/14 11:54, Vitaly Kuznetsov wrote: Vitaly Kuznetsov vkuzn...@redhat.com writes: 1) ramdisks (/dev/ram*) (persistent grants and indirect descriptors disabled) sorry, there was a typo. persistent grants and indirect descriptors are enabled with ramdisks, otherwise such testing won't make any sense. I'm not sure how is that possible, from your description I get that you are using 3.11 on the Dom0, which means blkback has support for persistent grants and indirect descriptors, but the guest is RHEL7, that's using the 3.10 kernel AFAICT, and this kernel only has persistent grants implemented. RHEL7 kernel is mostly merged with 3.11 in its Xen part, we have indirect descriptors backported. Actually I tried my tests with upstream (Fedora) kernel and results were similar. I can try comparing e.g. 3.11.10 with 3.12.0 and provide exact measurements. -- Vitaly ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] xenpv: don't BUG when failing to setup NMI callback
some old Xen hypervisors (prior to 3.2) forbid DomUs to register NMI callbacks. E.g. we have the following code in xen-3.1: if ( (d-domain_id != 0) || (v-vcpu_id != 0) ) return -EINVAL; Commit 6efa20e49b9cb1db1ab66870cc37323474a75a13 introduced kernel crash in case PV guest fails to register NMI callback. All x86_64 PV guests will fail to boot on top of such hypervisors (RHEL5 example): (XEN) traps.c:405:d7 Unhandled invalid opcode fault/trap [#6] in domain 7 on VCPU 0 [ec=] (XEN) domain_crash_sync called from entry.S (XEN) Domain 7 (vcpu#0) crashed on cpu#3: (XEN) [ Xen-3.1.2-389.el5 x86_64 debug=n Not tainted ] (XEN) CPU:3 (XEN) RIP:e033:[81004d96] (XEN) RFLAGS: 0282 CONTEXT: guest (XEN) rax: ffea rbx: rcx: 0002 (XEN) rdx: 0001 rsi: 81b0fe28 rdi: (XEN) rbp: 81b0fe40 rsp: 81b0fde8 r8: (XEN) r9: 81b0fdd0 r10: 7ff0 r11: (XEN) r12: 81d65900 r13: r14: (XEN) r15: cr0: 80050033 cr4: 26b0 (XEN) cr3: 00013a263000 cr2: (XEN) ds: es: fs: gs: ss: e02b cs: e033 ... However it is possible to proceed without NMI callback registered. Changes in v2: - skip nmi in cvt_gate_to_trap() for Xen 3.2 - do not call xen_enable_nmi() when running under Xen 3.2 - never BUG() in xen_enable_nmi() Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/enlighten.c | 9 + arch/x86/xen/setup.c | 17 ++--- 2 files changed, 19 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index f17b292..60ec5ef 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -746,12 +746,13 @@ static int cvt_gate_to_trap(int vector, const gate_desc *val, */ ; #endif - } else if (addr == (unsigned long)nmi) + } else if (addr == (unsigned long)nmi) { /* -* Use the native version as well. +* Use the native version as well but require Xen = 3.2 */ - ; - else { + if (!xen_running_on_version_or_later(3, 2)) + return 0; + } else { /* Some other trap using IST? */ if (WARN_ON(val-ist != 0)) return 0; diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 821a11a..fdc73d3 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -593,8 +593,14 @@ void xen_enable_syscall(void) void xen_enable_nmi(void) { #ifdef CONFIG_X86_64 - if (register_callback(CALLBACKTYPE_nmi, (char *)nmi)) - BUG(); + int ret; + + ret = register_callback(CALLBACKTYPE_nmi, (char *)nmi); + if (ret != 0) { + /* Hypervisor probably forbids us to register NMI callback or + some other error happened */ + pr_warn(xen: failed to register NMI callback: %d\n, ret); + } #endif } void __init xen_pvmmu_arch_setup(void) @@ -611,7 +617,12 @@ void __init xen_pvmmu_arch_setup(void) xen_enable_sysenter(); xen_enable_syscall(); - xen_enable_nmi(); + + /* Xen versions prior to 3.2 forbid DomUs to register NMI callbacks */ + if (xen_running_on_version_or_later(3, 2)) + xen_enable_nmi(); + else + pr_warn(xen: skipping NMI callback registration for Xen 3.2); } /* This function is not called for HVM domains */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors
David Vrabel david.vra...@citrix.com writes: On 07/07/14 21:33, Andrew Morton wrote: On Mon, 7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov vkuzn...@redhat.com wrote: we have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Does make forward progress though? Or is it ending up in a repeatedly retrying the same instruction? If memcpy is using SSE2 optimization 16-byte 'movdqu' instruction never finishes (repeatedly retrying to issue two 8-byte requests to qemu-dm). qemu-dm decides that it's hitting 'Neither RAM nor known MMIO space' and returns 8 0xff bytes for both of this requests (I was testing with qemu-traditional). Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM regions as well? I wasn't using ballooning, it happens that oldmem has several (two in my test) pages which are HVMMEM_mmio_dm but qemu-dm considers them being neither ram nor mmio. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned pages) must be common to KVM. How does KVM handle this? Is far as I'm concearned the issue was never hit with KVM. I *think* the issue has something to do with the conjunction of 16-byte 'movdqu' emulation for io pages in xen hypervisor, 8-byte event channel requests and qemu-traditional. But even if it gets fixed on hypervisor side I believe fixing the issue kernel-side still worth it as there are non-fixed hypervisors out there (e.g. AWS EC2). David -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote: we have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Could the 'remap_oldmem_pfn_range' become an function ops? I see there is an 'register_oldmem_pfn_is_ram' - so could there be similar one for 'pfn_range'? yes, it is possible to replace '__weak remap_oldmem_pfn_range' with 'register_oldmem_pfn_is_ram'. However s390 arch overrides this function in arch/s390/kernel/crash_dump.c so we'll have to make some changes there as well. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- fs/proc/vmcore.c | 68 +++- 1 file changed, 62 insertions(+), 6 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..2716e19 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len, +unsigned long pfn, unsigned long page_count) +{ +unsigned long pos; +size_t size; +unsigned long vma_addr; +unsigned long emptypage_pfn = __pa(empty_zero_page) PAGE_SHIFT; + +for (pos = pfn; (pos - pfn) = page_count; pos++) { +if (!pfn_is_ram(pos) || (pos - pfn) == page_count) { +/* we hit a page which is not ram or reached the end */ +if (pos - pfn 0) { +/* remapping continuous region */ +size = (pos - pfn) PAGE_SHIFT; +vma_addr = vma-vm_start + len; +if (remap_oldmem_pfn_range(vma, vma_addr, + pfn, size, + vma-vm_page_prot)) +return len; +len += size; +page_count -= (pos - pfn); +} +if (page_count 0) { +/* we hit a page which is not ram, replacing + with an empty one */ +vma_addr = vma-vm_start + len; +if (remap_oldmem_pfn_range(vma, vma_addr, + emptypage_pfn, + PAGE_SIZE, + vma-vm_page_prot)) +return len; +len += PAGE_SIZE; +pfn = pos + 1; +page_count--; +} +} +} +return len; +} + static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma-vm_end - vma-vm_start; @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) list_for_each_entry(m, vmcore_list, list) { if (start m-offset + m-size) { -u64 paddr = 0; +u64 paddr = 0, original_len; +unsigned long pfn, page_count; tsz = min_t(size_t, m-offset + m-size - start, size); paddr = m-paddr + start - m-offset; -if (remap_oldmem_pfn_range(vma, vma-vm_start + len, - paddr PAGE_SHIFT, tsz, - vma-vm_page_prot)) -goto fail; + +/* check if oldmem_pfn_is_ram was registered to avoid + looping over all pages without a reason */ +if (oldmem_pfn_is_ram) { +pfn = paddr PAGE_SHIFT; +page_count = tsz PAGE_SHIFT
Re: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors
Vivek Goyal vgo...@redhat.com writes: On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote: we have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). I am wondering if this name pfn_is_ram() appropriate for what we are doing. So IIUC, a balooned memory is also RAM just that it has not been allocated yet. That means we can safely assume that there is no data and can safely fill it with zeros? For Xen pfn_is_ram() returns 0 in case the page is an mmio. Ballooned pages are also considered being mmio (HVMOP_get_mem_type returns HVMMEM_mmio_dm). If yes, then page_is_zero_filled() might be a more approprate name. It's not as mmio page is not always zero-filled. We just don't need these pages in vmcore. Also I am wondering why it was not done as part of copy_oldmem_page() so that respective arch could hide all the details. Afaiac that wouldn't solve the mmap issue I'm trying to address but we can ask Olaf why he preferred pfn_is_ram() path. However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. I am not sure I understand this part. So what is all other hypervisor specic code which will like to do this. And will that code is compiled at the same time as CONFIG_XEN_PVHVM? I meant to say that we have many hypervisors for x86 supported. In case I override __weak remap_oldmem_pfn_range() in xen-specific code it will *always* get executed when this code was compiled. In case we'll have to do similar override in e.g. Hyperv or KVM code in future we'll have a mess (in which order do we need to execute these overrides?). In few words, Xen-PVHVM is not an architecture so I'm not following Architectures may override this function to map oldmem path. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- fs/proc/vmcore.c | 68 +++- 1 file changed, 62 insertions(+), 6 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..2716e19 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len, +unsigned long pfn, unsigned long page_count) +{ +unsigned long pos; +size_t size; +unsigned long vma_addr; +unsigned long emptypage_pfn = __pa(empty_zero_page) PAGE_SHIFT; + +for (pos = pfn; (pos - pfn) = page_count; pos++) { +if (!pfn_is_ram(pos) || (pos - pfn) == page_count) { +/* we hit a page which is not ram or reached the end */ +if (pos - pfn 0) { +/* remapping continuous region */ +size = (pos - pfn) PAGE_SHIFT; +vma_addr = vma-vm_start + len; +if (remap_oldmem_pfn_range(vma, vma_addr, + pfn, size, + vma-vm_page_prot)) +return len; +len += size; +page_count -= (pos - pfn); +} +if (page_count 0) { +/* we hit a page which is not ram, replacing + with an empty one */ +vma_addr = vma-vm_start + len; +if (remap_oldmem_pfn_range(vma, vma_addr, + emptypage_pfn, + PAGE_SIZE, + vma-vm_page_prot)) +return len; +len += PAGE_SIZE; +pfn = pos + 1; +page_count--; +} +} +} +return len; +} + static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma-vm_end
[PATCH v2] mmap_vmcore: skip non-ram pages reported by hypervisors
We have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Changes from v1: - comment style changes - change remap_oldmem_pfn_checked() interface to closer match the remap_oldmem_pfn() interface - preserve formal parameters within the loop, make the loop conditions easier to understand - use my_zero_pfn() for the zero page - return remapped length instead of new offset Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com Reviewed-by: Andrew Jones drjo...@redhat.com --- fs/proc/vmcore.c | 89 1 file changed, 84 insertions(+), 5 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..5cd13f8 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,67 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +/* + * remap_oldmem_pfn_checked - do remap_oldmem_pfn replacing all pages reported + * as not being ram with the zero page. + * + * @vma: vm_area_struct describing requested mapping + * @vma_addr: start remapping from + * @pfn: page frame number to start remapping to + * @size: remapping size + * + * Returns the remapped length. If no errors were hit during the remapping it + * should be equal to size. + */ +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, + unsigned long vma_addr, unsigned long pfn, + unsigned long size) +{ + size_t map_size; + unsigned long pos_start, pos_end, pos; + unsigned long zeropage_pfn = my_zero_pfn(0); + u64 len = 0; + + pos_start = pfn; + pos_end = pfn + (size PAGE_SHIFT); + + for (pos = pos_start; pos pos_end; ++pos) { + if (!pfn_is_ram(pos)) { + /* We hit a page which is not ram. Remap the continuous +* region between pos_start and pos-1 and replace +* the non-ram page at pos with the zero page. +*/ + if (pos pos_start) { + /* Remap continuous region */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, vma_addr + len, + pos_start, map_size, + vma-vm_page_prot)) + return len; + len += map_size; + } + /* Remap the zero page */ + if (remap_oldmem_pfn_range(vma, vma_addr + len, + zeropage_pfn, + PAGE_SIZE, + vma-vm_page_prot)) + return len; + len += PAGE_SIZE; + pos_start = pos + 1; + } + } + if (pos pos_start) { + /* Remap the rest */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, vma_addr + len, pos_start, + map_size, + vma-vm_page_prot)) + return len; + len += map_size; + } + return len; +} + static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma-vm_end - vma-vm_start; @@ -387,13 +448,31 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) tsz = min_t(size_t, m-offset + m-size - start, size); paddr = m-paddr + start - m-offset; - if (remap_oldmem_pfn_range(vma, vma-vm_start + len, - paddr PAGE_SHIFT, tsz, - vma-vm_page_prot)) - goto fail
[PATCH v3] mmap_vmcore: skip non-ram pages reported by hypervisors
We have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Changes from v2: - make remap_oldmem_pfn_checked() interface exactly match remap_oldmem_pfn_range() - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so we don't need to take care of it in mmap_vmcore() - create vmcore_remap_oldmem_pfn() wrapper Changes from v1: - comment style changes - change remap_oldmem_pfn_checked() interface to closer match the remap_oldmem_pfn() interface - preserve formal parameters within the loop, make the loop conditions easier to understand - use my_zero_pfn() for the zero page - return remapped length instead of new offset Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com Reviewed-by: Andrew Jones drjo...@redhat.com --- fs/proc/vmcore.c | 82 +--- 1 file changed, 79 insertions(+), 3 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..66dac43 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,82 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +/* + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages + * reported as not being ram with the zero page. + * + * @vma: vm_area_struct describing requested mapping + * @from: start remapping from + * @pfn: page frame number to start remapping to + * @size: remapping size + * @prot: protection bits + * + * Returns zero on success, -EAGAIN on failure. + */ +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from, +unsigned long pfn, unsigned long size, +pgprot_t prot) +{ + size_t map_size; + unsigned long pos_start, pos_end, pos; + unsigned long zeropage_pfn = my_zero_pfn(0); + u64 len = 0; + + pos_start = pfn; + pos_end = pfn + (size PAGE_SHIFT); + + for (pos = pos_start; pos pos_end; ++pos) { + if (!pfn_is_ram(pos)) { + /* We hit a page which is not ram. Remap the continuous +* region between pos_start and pos-1 and replace +* the non-ram page at pos with the zero page. +*/ + if (pos pos_start) { + /* Remap continuous region */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, + pos_start, map_size, + prot)) + goto fail; + len += map_size; + } + /* Remap the zero page */ + if (remap_oldmem_pfn_range(vma, from + len, + zeropage_pfn, + PAGE_SIZE, + prot)) + goto fail; + len += PAGE_SIZE; + pos_start = pos + 1; + } + } + if (pos pos_start) { + /* Remap the rest */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, pos_start, + map_size, + vma-vm_page_prot)) + goto fail; + len += map_size; + } + return 0; +fail: + do_munmap(vma-vm_mm, from, len); + return -EAGAIN; +} + +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + /* Check if oldmem_pfn_is_ram was registered to avoid + looping over all pages without a reason. */ + if (oldmem_pfn_is_ram) + return remap_oldmem_pfn_checked
[PATCH v4] mmap_vmcore: skip non-ram pages reported by hypervisors
We have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Changes from v3: - multi line comment style changes - minor code style changes Changes from v2: - make remap_oldmem_pfn_checked() interface exactly match remap_oldmem_pfn_range() - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so we don't need to take care of it in mmap_vmcore() - create vmcore_remap_oldmem_pfn() wrapper Changes from v1: - comment style changes - change remap_oldmem_pfn_checked() interface to closer match the remap_oldmem_pfn() interface - preserve formal parameters within the loop, make the loop conditions easier to understand - use my_zero_pfn() for the zero page - return remapped length instead of new offset Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com Reviewed-by: Andrew Jones drjo...@redhat.com --- fs/proc/vmcore.c | 83 ++-- 1 file changed, 80 insertions(+), 3 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..405a409 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +/* + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages + * reported as not being ram with the zero page. + * + * @vma: vm_area_struct describing requested mapping + * @from: start remapping from + * @pfn: page frame number to start remapping to + * @size: remapping size + * @prot: protection bits + * + * Returns zero on success, -EAGAIN on failure. + */ +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from, +unsigned long pfn, unsigned long size, +pgprot_t prot) +{ + size_t map_size; + unsigned long pos_start, pos_end, pos; + unsigned long zeropage_pfn = my_zero_pfn(0); + u64 len = 0; + + pos_start = pfn; + pos_end = pfn + (size PAGE_SHIFT); + + for (pos = pos_start; pos pos_end; ++pos) { + if (!pfn_is_ram(pos)) { + /* +* We hit a page which is not ram. Remap the continuous +* region between pos_start and pos-1 and replace +* the non-ram page at pos with the zero page. +*/ + if (pos pos_start) { + /* Remap continuous region */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, + pos_start, map_size, + prot)) + goto fail; + len += map_size; + } + /* Remap the zero page */ + if (remap_oldmem_pfn_range(vma, from + len, + zeropage_pfn, + PAGE_SIZE, prot)) + goto fail; + len += PAGE_SIZE; + pos_start = pos + 1; + } + } + if (pos pos_start) { + /* Remap the rest */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, pos_start, + map_size, vma-vm_page_prot)) + goto fail; + len += map_size; + } + return 0; +fail: + do_munmap(vma-vm_mm, from, len); + return -EAGAIN; +} + +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + /* +* Check if oldmem_pfn_is_ram was registered to avoid +* looping over all pages without a reason. +*/ + if (oldmem_pfn_is_ram
[PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors
we have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- fs/proc/vmcore.c | 68 +++- 1 file changed, 62 insertions(+), 6 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..2716e19 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len, + unsigned long pfn, unsigned long page_count) +{ + unsigned long pos; + size_t size; + unsigned long vma_addr; + unsigned long emptypage_pfn = __pa(empty_zero_page) PAGE_SHIFT; + + for (pos = pfn; (pos - pfn) = page_count; pos++) { + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) { + /* we hit a page which is not ram or reached the end */ + if (pos - pfn 0) { + /* remapping continuous region */ + size = (pos - pfn) PAGE_SHIFT; + vma_addr = vma-vm_start + len; + if (remap_oldmem_pfn_range(vma, vma_addr, + pfn, size, + vma-vm_page_prot)) + return len; + len += size; + page_count -= (pos - pfn); + } + if (page_count 0) { + /* we hit a page which is not ram, replacing + with an empty one */ + vma_addr = vma-vm_start + len; + if (remap_oldmem_pfn_range(vma, vma_addr, + emptypage_pfn, + PAGE_SIZE, + vma-vm_page_prot)) + return len; + len += PAGE_SIZE; + pfn = pos + 1; + page_count--; + } + } + } + return len; +} + static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma-vm_end - vma-vm_start; @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) list_for_each_entry(m, vmcore_list, list) { if (start m-offset + m-size) { - u64 paddr = 0; + u64 paddr = 0, original_len; + unsigned long pfn, page_count; tsz = min_t(size_t, m-offset + m-size - start, size); paddr = m-paddr + start - m-offset; - if (remap_oldmem_pfn_range(vma, vma-vm_start + len, - paddr PAGE_SHIFT, tsz, - vma-vm_page_prot)) - goto fail; + + /* check if oldmem_pfn_is_ram was registered to avoid + looping over all pages without a reason */ + if (oldmem_pfn_is_ram) { + pfn = paddr PAGE_SHIFT; + page_count = tsz PAGE_SHIFT; + original_len = len; + len = remap_oldmem_pfn_checked(vma, len, pfn, + page_count); + if (len != original_len + tsz) + goto fail; + } else { + if (remap_oldmem_pfn_range(vma, + vma
Re: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors
Andrew Morton a...@linux-foundation.org writes: On Mon, 7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov vkuzn...@redhat.com wrote: we have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. I'd like to get some reviewed-by's and tested-by's on this one please. This patch can be tested with Xen PVHVM guest only as it is the only platform which registers oldmem_pfn_is_ram atm. --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len, +unsigned long pfn, unsigned long page_count) +{ +unsigned long pos; +size_t size; +unsigned long vma_addr; +unsigned long emptypage_pfn = __pa(empty_zero_page) PAGE_SHIFT; That's old-school. Can we use my_zero_pfn() here? Also, zeropage_pfn is a better name - let's not introduce the hitherto unknown concept of an empty page. Sure! +for (pos = pfn; (pos - pfn) = page_count; pos++) { +if (!pfn_is_ram(pos) || (pos - pfn) == page_count) { +/* we hit a page which is not ram or reached the end */ +if (pos - pfn 0) { +/* remapping continuous region */ +size = (pos - pfn) PAGE_SHIFT; +vma_addr = vma-vm_start + len; +if (remap_oldmem_pfn_range(vma, vma_addr, + pfn, size, + vma-vm_page_prot)) +return len; +len += size; +page_count -= (pos - pfn); +} +if (page_count 0) { +/* we hit a page which is not ram, replacing + with an empty one */ I suggest /* * We hit a page which is not ram. Replace it * with the zero page. */ :-) +vma_addr = vma-vm_start + len; +if (remap_oldmem_pfn_range(vma, vma_addr, + emptypage_pfn, + PAGE_SIZE, + vma-vm_page_prot)) +return len; +len += PAGE_SIZE; +pfn = pos + 1; +page_count--; +} +} +} +return len; +} Also, this loop seems unnecessarily hard to follow. It *look* like the `for' statement has an off-by-one because of the =, but page_count is mofidied inside the loop! Despite it being an incoming formal argument. There is no off-by-one error here (I believe) as I'm checking two possible conditions to remap the continuous region: 1) We hit a non-ram page 2) We reached the end so we exclude the page pos is pointing us to in both cases. I tried to avoid code duplication e.g. having one 'remap continuous region' inside the loop to do the remapping when we hit a non-ram page and having the other one outside the loop to remap the tail. None of this is made any easier by the function's lack of documentation. Some description of the incoming args would help, along with an overall description of the function's responsibilities. That being said, can't we just do something nice and simple like pos = pfn; while (pos pfn + page_count) { stuff which advances `pos' } ? I completely agree it's possible to make this code easier to understand, will do. static int mmap_vmcore(struct file *file, struct vm_area_struct *vma) { size_t size = vma-vm_end - vma-vm_start; @@ -383,17 +423,33 @@ static
[PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors
We have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Changes from v4: - change map_size type size_t - unsigned long - use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked() Changes from v3: - multi line comment style changes - minor code style changes Changes from v2: - make remap_oldmem_pfn_checked() interface exactly match remap_oldmem_pfn_range() - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so we don't need to take care of it in mmap_vmcore() - create vmcore_remap_oldmem_pfn() wrapper Changes from v1: - comment style changes - change remap_oldmem_pfn_checked() interface to closer match the remap_oldmem_pfn() interface - preserve formal parameters within the loop, make the loop conditions easier to understand - use my_zero_pfn() for the zero page - return remapped length instead of new offset Reviewed-by: Andrew Jones drjo...@redhat.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- fs/proc/vmcore.c | 83 ++-- 1 file changed, 80 insertions(+), 3 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..1f77f35 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +/* + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages + * reported as not being ram with the zero page. + * + * @vma: vm_area_struct describing requested mapping + * @from: start remapping from + * @pfn: page frame number to start remapping to + * @size: remapping size + * @prot: protection bits + * + * Returns zero on success, -EAGAIN on failure. + */ +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from, +unsigned long pfn, unsigned long size, +pgprot_t prot) +{ + unsigned long map_size; + unsigned long pos_start, pos_end, pos; + unsigned long zeropage_pfn = my_zero_pfn(0); + u64 len = 0; + + pos_start = pfn; + pos_end = pfn + (size PAGE_SHIFT); + + for (pos = pos_start; pos pos_end; ++pos) { + if (!pfn_is_ram(pos)) { + /* +* We hit a page which is not ram. Remap the continuous +* region between pos_start and pos-1 and replace +* the non-ram page at pos with the zero page. +*/ + if (pos pos_start) { + /* Remap continuous region */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, + pos_start, map_size, + prot)) + goto fail; + len += map_size; + } + /* Remap the zero page */ + if (remap_oldmem_pfn_range(vma, from + len, + zeropage_pfn, + PAGE_SIZE, prot)) + goto fail; + len += PAGE_SIZE; + pos_start = pos + 1; + } + } + if (pos pos_start) { + /* Remap the rest */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, pos_start, + map_size, prot)) + goto fail; + len += map_size; + } + return 0; +fail: + do_munmap(vma-vm_mm, from, len); + return -EAGAIN; +} + +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + /* +* Check if oldmem_pfn_is_ram
Re: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors
HATAYAMA, Daisuke d.hatay...@jp.fujitsu.com writes: (2014/07/14 18:16), Vitaly Kuznetsov wrote: We have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Changes from v4: - change map_size type size_t - unsigned long - use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked() Changes from v3: - multi line comment style changes - minor code style changes Changes from v2: - make remap_oldmem_pfn_checked() interface exactly match remap_oldmem_pfn_range() - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so we don't need to take care of it in mmap_vmcore() - create vmcore_remap_oldmem_pfn() wrapper Changes from v1: - comment style changes - change remap_oldmem_pfn_checked() interface to closer match the remap_oldmem_pfn() interface - preserve formal parameters within the loop, make the loop conditions easier to understand - use my_zero_pfn() for the zero page - return remapped length instead of new offset Reviewed-by: Andrew Jones drjo...@redhat.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- fs/proc/vmcore.c | 83 ++-- 1 file changed, 80 insertions(+), 3 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..1f77f35 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +/* + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages + * reported as not being ram with the zero page. + * + * @vma: vm_area_struct describing requested mapping + * @from: start remapping from + * @pfn: page frame number to start remapping to + * @size: remapping size + * @prot: protection bits + * + * Returns zero on success, -EAGAIN on failure. + */ +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from, + unsigned long pfn, unsigned long size, + pgprot_t prot) +{ +unsigned long map_size; +unsigned long pos_start, pos_end, pos; +unsigned long zeropage_pfn = my_zero_pfn(0); +u64 len = 0; Sorry, I missed this yesterday. Thanks for your review! This should also be fixed as size_t or unsigned long. Does 32-bit compiler warn about this at the call of do_munmap() below due to difference of bit length of the two types? Mine doesn't. But you're right, it makes sense to make it match do_munmap()'s interface and len there is size_t. I'll send v6 with this change. + +pos_start = pfn; +pos_end = pfn + (size PAGE_SHIFT); + +for (pos = pos_start; pos pos_end; ++pos) { +if (!pfn_is_ram(pos)) { +/* + * We hit a page which is not ram. Remap the continuous + * region between pos_start and pos-1 and replace + * the non-ram page at pos with the zero page. + */ +if (pos pos_start) { +/* Remap continuous region */ +map_size = (pos - pos_start) PAGE_SHIFT; +if (remap_oldmem_pfn_range(vma, from + len, + pos_start, map_size, + prot)) +goto fail; +len += map_size; +} +/* Remap the zero page */ +if (remap_oldmem_pfn_range(vma, from + len, + zeropage_pfn, + PAGE_SIZE, prot)) +goto fail; +len += PAGE_SIZE; +pos_start = pos + 1; +} +} +if (pos pos_start) { +/* Remap the rest */ +map_size = (pos - pos_start) PAGE_SHIFT; +if (remap_oldmem_pfn_range(vma, from
[PATCH v6] mmap_vmcore: skip non-ram pages reported by hypervisors
We have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Changes from v5: - make len size_t to match do_unmap() interface Changes from v4: - change map_size type size_t - unsigned long - use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked() Changes from v3: - multi line comment style changes - minor code style changes Changes from v2: - make remap_oldmem_pfn_checked() interface exactly match remap_oldmem_pfn_range() - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so we don't need to take care of it in mmap_vmcore() - create vmcore_remap_oldmem_pfn() wrapper Changes from v1: - comment style changes - change remap_oldmem_pfn_checked() interface to closer match the remap_oldmem_pfn() interface - preserve formal parameters within the loop, make the loop conditions easier to understand - use my_zero_pfn() for the zero page - return remapped length instead of new offset Reviewed-by: Andrew Jones drjo...@redhat.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- fs/proc/vmcore.c | 83 ++-- 1 file changed, 80 insertions(+), 3 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..fa45923 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +/* + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages + * reported as not being ram with the zero page. + * + * @vma: vm_area_struct describing requested mapping + * @from: start remapping from + * @pfn: page frame number to start remapping to + * @size: remapping size + * @prot: protection bits + * + * Returns zero on success, -EAGAIN on failure. + */ +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from, +unsigned long pfn, unsigned long size, +pgprot_t prot) +{ + unsigned long map_size; + unsigned long pos_start, pos_end, pos; + unsigned long zeropage_pfn = my_zero_pfn(0); + size_t len = 0; + + pos_start = pfn; + pos_end = pfn + (size PAGE_SHIFT); + + for (pos = pos_start; pos pos_end; ++pos) { + if (!pfn_is_ram(pos)) { + /* +* We hit a page which is not ram. Remap the continuous +* region between pos_start and pos-1 and replace +* the non-ram page at pos with the zero page. +*/ + if (pos pos_start) { + /* Remap continuous region */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, + pos_start, map_size, + prot)) + goto fail; + len += map_size; + } + /* Remap the zero page */ + if (remap_oldmem_pfn_range(vma, from + len, + zeropage_pfn, + PAGE_SIZE, prot)) + goto fail; + len += PAGE_SIZE; + pos_start = pos + 1; + } + } + if (pos pos_start) { + /* Remap the rest */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, pos_start, + map_size, prot)) + goto fail; + len += map_size; + } + return 0; +fail: + do_munmap(vma-vm_mm, from, len); + return -EAGAIN; +} + +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size
[PATCH RFC 4/4] xen/pvhvm: Make MSI IRQs work after kexec
When kexec was peformed MSI IRQs for passthrough-ed devices were already mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq() fails as we have no IRQ mapping information for that. Requesting for new mapping with __write_msi_msg() does not result in MSI IRQ being remapped so we don't recieve these IRQs. RFC: I wasn't able to understand why commit af42b8d1 which introduced xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead of checking pirq 0 as if the mapping was already done (and we have pirq0 here) we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also not sure when __write_msi_msg() with new PIRQ will result in new mapping. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/pci/xen.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index 905956f..685e8f1 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) __read_msi_msg(msidesc, msg); pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | ((msg.address_lo MSI_ADDR_DEST_ID_SHIFT) 0xff); - if (msg.data != XEN_PIRQ_MSI_DATA || - xen_irq_from_pirq(pirq) 0) { + if (msg.data != XEN_PIRQ_MSI_DATA || pirq = 0) { pirq = xen_allocate_pirq_msi(dev, msidesc); if (pirq 0) { irq = -ENODEV; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 0/4] xen/pvhvm: fix shared_info and pirq issues with kexec
With this patch series I'm trying to address several issues with kexec on pvhvm: - shared_info issue (1st patch, just sending Olaf's work with Konrad's fix) - create specific pvhvm shutdown handler for kexec (2nd patch) - GSI PIRQ issue (3rd patch, I'm pretty confident that it does the right thing) - MSI PIRQ issue (4th patch, and I'm not sure it doesn't break anything - RFC) This patch series can be tested on single vCPU guest. We still have SMP issues with pvhvm guests and kexec which require additional fixes. Olaf Hering (1): xen PVonHVM: use E820_Reserved area for shared_info Vitaly Kuznetsov (3): xen/pvhvm: Introduce xen_pvhvm_kexec_shutdown() xen/pvhvm: Unmap all PIRQs on startup and shutdown xen/pvhvm: Make MSI IRQs work after kexec arch/x86/pci/xen.c | 3 +- arch/x86/xen/enlighten.c | 83 +++- arch/x86/xen/smp.c | 10 + arch/x86/xen/smp.h | 1 + arch/x86/xen/suspend.c | 2 +- arch/x86/xen/xen-ops.h | 2 +- drivers/xen/events/events_base.c | 76 include/xen/events.h | 3 ++ 8 files changed, 158 insertions(+), 22 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 1/4] xen PVonHVM: use E820_Reserved area for shared_info
From: Olaf Hering o...@aepfle.de This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3 (xen PVonHVM: move shared_info to MMIO before kexec). Currently kexec in a PVonHVM guest fails with a triple fault because the new kernel overwrites the shared info page. The exact failure depends on the size of the kernel image. This patch moves the pfn from RAM into an E820 reserved memory area. The pfn containing the shared_info is located somewhere in RAM. This will cause trouble if the current kernel is doing a kexec boot into a new kernel. The new kernel (and its startup code) can not know where the pfn is, so it can not reserve the page. The hypervisor will continue to update the pfn, and as a result memory corruption occours in the new kernel. The toolstack marks the memory area FC00- as reserved in the E820 map. Within that range newer toolstacks (4.3+) will keep 1MB starting from FE70 as reserved for guest use. Older Xen4 toolstacks will usually not allocate areas up to FE70, so FE70 is expected to work also with older toolstacks. In Xen3 there is no reserved area at a fixed location. If the guest is started on such old hosts the shared_info page will be placed in RAM. As a result kexec can not be used. Signed-off-by: Olaf Hering o...@aepfle.de Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f) [On resume we need to reset the xen_vcpu_info, which the original patch did not do] Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/enlighten.c | 74 arch/x86/xen/suspend.c | 2 +- arch/x86/xen/xen-ops.h | 2 +- 3 files changed, 58 insertions(+), 20 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index ffb101e..a11af62 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void) #endif } -void __ref xen_hvm_init_shared_info(void) +#ifdef CONFIG_XEN_PVHVM +#define HVM_SHARED_INFO_ADDR 0xFE70UL +static struct shared_info *xen_hvm_shared_info; +static unsigned long xen_hvm_sip_phys; +static int xen_major, xen_minor; + +static void xen_hvm_connect_shared_info(unsigned long pfn) { - int cpu; struct xen_add_to_physmap xatp; - static struct shared_info *shared_info_page = 0; - if (!shared_info_page) - shared_info_page = (struct shared_info *) - extend_brk(PAGE_SIZE, PAGE_SIZE); xatp.domid = DOMID_SELF; xatp.idx = 0; xatp.space = XENMAPSPACE_shared_info; - xatp.gpfn = __pa(shared_info_page) PAGE_SHIFT; + xatp.gpfn = pfn; if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, xatp)) BUG(); - HYPERVISOR_shared_info = (struct shared_info *)shared_info_page; +} +static void __init xen_hvm_set_shared_info(struct shared_info *sip) +{ + int cpu; + + HYPERVISOR_shared_info = sip; /* xen_vcpu is a pointer to the vcpu_info struct in the shared_info * page, we use it in the event channel upcall and in some pvclock @@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void) } } -#ifdef CONFIG_XEN_PVHVM +/* Reconnect the shared_info pfn to a (new) mfn */ +void xen_hvm_resume_shared_info(void) +{ + xen_hvm_connect_shared_info(xen_hvm_sip_phys PAGE_SHIFT); + xen_hvm_set_shared_info(xen_hvm_shared_info); +} + +/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest usage. + * On these old tools the shared info page will be placed in E820_Ram. + * Xen 4 provides a E820_Reserved area at 0xFC00, and this code expects + * that nothing is mapped up to HVM_SHARED_INFO_ADDR. + * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is used + * here for the shared info page. */ +static void __init xen_hvm_init_shared_info(void) +{ + if (xen_major 4) { + xen_hvm_shared_info = extend_brk(PAGE_SIZE, PAGE_SIZE); + xen_hvm_sip_phys = __pa(xen_hvm_shared_info); + } else { + xen_hvm_sip_phys = HVM_SHARED_INFO_ADDR; + set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_hvm_sip_phys); + xen_hvm_shared_info = + (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP); + } + xen_hvm_resume_shared_info(); +} + static void __init init_hvm_pv_info(void) { - int major, minor; - uint32_t eax, ebx, ecx, edx, pages, msr, base; + uint32_t ecx, edx, pages, msr, base; u64 pfn; base = xen_cpuid_base(); - cpuid(base + 1, eax, ebx, ecx, edx); - - major = eax 16; - minor = eax 0x; - printk(KERN_INFO Xen version %d.%d.\n, major, minor); - cpuid(base + 2, pages, msr, ecx, edx); pfn = __pa(hypercall_page); @@ -1828,10 +1853,23 @@ static void __init
[PATCH RFC 2/4] xen/pvhvm: Introduce xen_pvhvm_kexec_shutdown()
PVHVM guest requires special actions before kexec. Register specific xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown(). Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/enlighten.c | 9 + arch/x86/xen/smp.c | 9 + arch/x86/xen/smp.h | 1 + 3 files changed, 19 insertions(+) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index a11af62..8074e4a 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = { .notifier_call = xen_hvm_cpu_notify, }; +static void xen_pvhvm_kexec_shutdown(void) +{ + xen_kexec_shutdown(); + native_machine_shutdown(); +} + static void __init xen_hvm_guest_init(void) { init_hvm_pv_info(); @@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void) x86_init.irqs.intr_init = xen_init_IRQ; xen_hvm_init_time_ops(); xen_hvm_init_mmu_ops(); +#ifdef CONFIG_KEXEC + machine_ops.shutdown = xen_pvhvm_kexec_shutdown; +#endif } static uint32_t __init xen_hvm_platform(void) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 7005974..35dcf39 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -18,6 +18,7 @@ #include linux/smp.h #include linux/irq_work.h #include linux/tick.h +#include linux/kexec.h #include asm/paravirt.h #include asm/desc.h @@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu) native_cpu_die(cpu); } +void xen_kexec_shutdown(void) +{ +#ifdef CONFIG_KEXEC + if (!kexec_in_progress) + return; +#endif +} + void __init xen_hvm_smp_init(void) { if (!xen_have_vector_callback) diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h index c7c2d89..1af0493 100644 --- a/arch/x86/xen/smp.h +++ b/arch/x86/xen/smp.h @@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector); extern void xen_send_IPI_all(int vector); extern void xen_send_IPI_self(int vector); +extern void xen_kexec_shutdown(void); #endif -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown
When kexec is being run PIRQs from Qemu-emulated devices are still mapped to old event channels and new kernel has no information about that. Trying to map them twice results in the following in Xen's dmesg: (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped ... and the following in new kernel's dmesg: [ 92.286796] xen:events: Failed to obtain physical IRQ 4 The result is that the new kernel doesn't recieve IRQs for Qemu-emulated devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown when kexec was requested and on every kernel startup. We need to do this twice to deal with the following issues: - startup-time unmapping is required to make kdump work; - shutdown-time unmapping is required to support kexec-ing non-fixed kernels; - shutdown-time unmapping is required to make Qemu-emulated NICs work after kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq is being performed). Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/smp.c | 1 + drivers/xen/events/events_base.c | 76 include/xen/events.h | 3 ++ 3 files changed, 80 insertions(+) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 35dcf39..e2b4deb 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void) #ifdef CONFIG_KEXEC if (!kexec_in_progress) return; + xen_unmap_all_pirqs(); #endif } diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c919d3d..7701c7f 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {} static bool fifo_events = true; module_param(fifo_events, bool, 0); +void xen_unmap_all_pirqs(void) +{ + int pirq, rc, gsi, irq, evtchn; + struct physdev_unmap_pirq unmap_irq; + struct irq_info *info; + struct evtchn_close close; + + mutex_lock(irq_mapping_update_lock); + + list_for_each_entry(info, xen_irq_list_head, list) { + if (info-type != IRQT_PIRQ) + continue; + + pirq = info-u.pirq.pirq; + gsi = info-u.pirq.gsi; + evtchn = info-evtchn; + irq = info-irq; + + pr_debug(unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n, + gsi, pirq, irq, evtchn); + + if (evtchn 0) { + close.port = evtchn; + if (HYPERVISOR_event_channel_op(EVTCHNOP_close, + close) != 0) + pr_warn(close evtchn %d failed\n, evtchn); + } + + unmap_irq.pirq = pirq; + unmap_irq.domid = DOMID_SELF; + + rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, unmap_irq); + if (rc) + pr_warn(unmap pirq failed gsi=%d pirq=%d irq=%d rc=%d\n, + gsi, pirq, irq, rc); + } + + mutex_unlock(irq_mapping_update_lock); +} +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs); + +static void xen_startup_unmap_pirqs(void) +{ + struct evtchn_status status; + int port, rc = -ENOENT; + struct physdev_unmap_pirq unmap_irq; + struct evtchn_close close; + + memset(status, 0, sizeof(status)); + for (port = 0; port xen_evtchn_max_channels(); port++) { + status.dom = DOMID_SELF; + status.port = port; + rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, status); + if (rc 0) + continue; + if (status.status == EVTCHNSTAT_pirq) { + close.port = port; + if (HYPERVISOR_event_channel_op(EVTCHNOP_close, + close) != 0) + pr_warn(xen: failed to close evtchn %d\n, + port); + unmap_irq.pirq = status.u.pirq; + unmap_irq.domid = DOMID_SELF; + pr_warn(xen: unmapping previously mapped pirq %d\n, + unmap_irq.pirq); + if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, + unmap_irq) != 0) + pr_warn(xen: failed to unmap pirq %d\n, + unmap_irq.pirq); + } + } +} + + void __init xen_init_IRQ(void) { int ret = -EINVAL; @@ -1671,6 +1745,8 @@ void __init xen_init_IRQ(void) xen_callback_vector(); if (xen_hvm_domain()) { + xen_startup_unmap_pirqs
[PATCH v7] mmap_vmcore: skip non-ram pages reported by hypervisors
We have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Changes from v6: - remove useless len increment when remapping the rest of the region Changes from v5: - make len size_t to match do_unmap() interface Changes from v4: - change map_size type size_t - unsigned long - use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked() Changes from v3: - multi line comment style changes - minor code style changes Changes from v2: - make remap_oldmem_pfn_checked() interface exactly match remap_oldmem_pfn_range() - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so we don't need to take care of it in mmap_vmcore() - create vmcore_remap_oldmem_pfn() wrapper Changes from v1: - comment style changes - change remap_oldmem_pfn_checked() interface to closer match the remap_oldmem_pfn() interface - preserve formal parameters within the loop, make the loop conditions easier to understand - use my_zero_pfn() for the zero page - return remapped length instead of new offset Reviewed-by: Andrew Jones drjo...@redhat.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- fs/proc/vmcore.c | 82 +--- 1 file changed, 79 insertions(+), 3 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..a18e651 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,82 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +/* + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages + * reported as not being ram with the zero page. + * + * @vma: vm_area_struct describing requested mapping + * @from: start remapping from + * @pfn: page frame number to start remapping to + * @size: remapping size + * @prot: protection bits + * + * Returns zero on success, -EAGAIN on failure. + */ +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from, +unsigned long pfn, unsigned long size, +pgprot_t prot) +{ + unsigned long map_size; + unsigned long pos_start, pos_end, pos; + unsigned long zeropage_pfn = my_zero_pfn(0); + size_t len = 0; + + pos_start = pfn; + pos_end = pfn + (size PAGE_SHIFT); + + for (pos = pos_start; pos pos_end; ++pos) { + if (!pfn_is_ram(pos)) { + /* +* We hit a page which is not ram. Remap the continuous +* region between pos_start and pos-1 and replace +* the non-ram page at pos with the zero page. +*/ + if (pos pos_start) { + /* Remap continuous region */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, + pos_start, map_size, + prot)) + goto fail; + len += map_size; + } + /* Remap the zero page */ + if (remap_oldmem_pfn_range(vma, from + len, + zeropage_pfn, + PAGE_SIZE, prot)) + goto fail; + len += PAGE_SIZE; + pos_start = pos + 1; + } + } + if (pos pos_start) { + /* Remap the rest */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, pos_start, + map_size, prot)) + goto fail; + } + return 0; +fail: + do_munmap(vma-vm_mm, from, len); + return -EAGAIN; +} + +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma, + unsigned long from, unsigned long
Re: [PATCH RFC 1/4] xen PVonHVM: use E820_Reserved area for shared_info
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote: From: Olaf Hering o...@aepfle.de This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3 (xen PVonHVM: move shared_info to MMIO before kexec). Currently kexec in a PVonHVM guest fails with a triple fault because the new kernel overwrites the shared info page. The exact failure depends on the size of the kernel image. This patch moves the pfn from RAM into an E820 reserved memory area. The pfn containing the shared_info is located somewhere in RAM. This will cause trouble if the current kernel is doing a kexec boot into a new kernel. The new kernel (and its startup code) can not know where the pfn is, so it can not reserve the page. The hypervisor will continue to update the pfn, and as a result memory corruption occours in the new kernel. The toolstack marks the memory area FC00- as reserved in the E820 map. Within that range newer toolstacks (4.3+) will keep 1MB starting from FE70 as reserved for guest use. Older Xen4 toolstacks will usually not allocate areas up to FE70, so FE70 is expected to work also with older toolstacks. In Xen3 there is no reserved area at a fixed location. If the guest is started on such old hosts the shared_info page will be placed in RAM. As a result kexec can not be used. So this looks right, the one thing that we really need to check is e9daff24a266307943457086533041bd971d0ef9 This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f. We are doing this b/c on 32-bit PVonHVM with older hypervisors (Xen 4.1) it ends up bothing up the start_info. This is bad b/c we use it for the time keeping, and the timekeeping code loops forever - as the version field never changes. Olaf says to revert it, so lets do that. Could you kindly test that the migration on 32-bit PVHVM guests on older hypervisors works? Sure, will do! Was there anything special about the setup or any 32-bit pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can try checking both current and old versions to make sure the issue was acutually fixed. Signed-off-by: Olaf Hering o...@aepfle.de Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f) [On resume we need to reset the xen_vcpu_info, which the original patch did not do] Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/enlighten.c | 74 arch/x86/xen/suspend.c | 2 +- arch/x86/xen/xen-ops.h | 2 +- 3 files changed, 58 insertions(+), 20 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index ffb101e..a11af62 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void) #endif } -void __ref xen_hvm_init_shared_info(void) +#ifdef CONFIG_XEN_PVHVM +#define HVM_SHARED_INFO_ADDR 0xFE70UL +static struct shared_info *xen_hvm_shared_info; +static unsigned long xen_hvm_sip_phys; +static int xen_major, xen_minor; + +static void xen_hvm_connect_shared_info(unsigned long pfn) { -int cpu; struct xen_add_to_physmap xatp; -static struct shared_info *shared_info_page = 0; -if (!shared_info_page) -shared_info_page = (struct shared_info *) -extend_brk(PAGE_SIZE, PAGE_SIZE); xatp.domid = DOMID_SELF; xatp.idx = 0; xatp.space = XENMAPSPACE_shared_info; -xatp.gpfn = __pa(shared_info_page) PAGE_SHIFT; +xatp.gpfn = pfn; if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, xatp)) BUG(); -HYPERVISOR_shared_info = (struct shared_info *)shared_info_page; +} +static void __init xen_hvm_set_shared_info(struct shared_info *sip) +{ +int cpu; + +HYPERVISOR_shared_info = sip; /* xen_vcpu is a pointer to the vcpu_info struct in the shared_info * page, we use it in the event channel upcall and in some pvclock @@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void) } } -#ifdef CONFIG_XEN_PVHVM +/* Reconnect the shared_info pfn to a (new) mfn */ +void xen_hvm_resume_shared_info(void) +{ +xen_hvm_connect_shared_info(xen_hvm_sip_phys PAGE_SHIFT); +xen_hvm_set_shared_info(xen_hvm_shared_info); +} + +/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest usage. + * On these old tools the shared info page will be placed in E820_Ram. + * Xen 4 provides a E820_Reserved area at 0xFC00, and this code expects + * that nothing is mapped up to HVM_SHARED_INFO_ADDR. + * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is used + * here for the shared info page. */ +static void __init xen_hvm_init_shared_info(void) +{ +if (xen_major 4
Re: [PATCH RFC 2/4] xen/pvhvm: Introduce xen_pvhvm_kexec_shutdown()
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:38PM +0200, Vitaly Kuznetsov wrote: PVHVM guest requires special actions before kexec. Register specific xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown(). This looks close to what I had sent as an RFC to you? Yes, I stole that part from your RFC: VCPU_reset_cpu_info patch to call xen_unmap_all_pirqs() on shutdown. I'm looking at SMP issues your patches address as well. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/enlighten.c | 9 + arch/x86/xen/smp.c | 9 + arch/x86/xen/smp.h | 1 + 3 files changed, 19 insertions(+) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index a11af62..8074e4a 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = { .notifier_call = xen_hvm_cpu_notify, }; +static void xen_pvhvm_kexec_shutdown(void) +{ +xen_kexec_shutdown(); +native_machine_shutdown(); +} + static void __init xen_hvm_guest_init(void) { init_hvm_pv_info(); @@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void) x86_init.irqs.intr_init = xen_init_IRQ; xen_hvm_init_time_ops(); xen_hvm_init_mmu_ops(); +#ifdef CONFIG_KEXEC +machine_ops.shutdown = xen_pvhvm_kexec_shutdown; +#endif } static uint32_t __init xen_hvm_platform(void) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 7005974..35dcf39 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -18,6 +18,7 @@ #include linux/smp.h #include linux/irq_work.h #include linux/tick.h +#include linux/kexec.h #include asm/paravirt.h #include asm/desc.h @@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu) native_cpu_die(cpu); } +void xen_kexec_shutdown(void) +{ +#ifdef CONFIG_KEXEC +if (!kexec_in_progress) +return; +#endif +} + void __init xen_hvm_smp_init(void) { if (!xen_have_vector_callback) diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h index c7c2d89..1af0493 100644 --- a/arch/x86/xen/smp.h +++ b/arch/x86/xen/smp.h @@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector); extern void xen_send_IPI_all(int vector); extern void xen_send_IPI_self(int vector); +extern void xen_kexec_shutdown(void); #endif -- 1.9.3 -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown
David Vrabel david.vra...@citrix.com writes: On 15/07/14 14:40, Vitaly Kuznetsov wrote: When kexec is being run PIRQs from Qemu-emulated devices are still mapped to old event channels and new kernel has no information about that. Trying to map them twice results in the following in Xen's dmesg: (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped ... and the following in new kernel's dmesg: [ 92.286796] xen:events: Failed to obtain physical IRQ 4 The result is that the new kernel doesn't recieve IRQs for Qemu-emulated devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown when kexec was requested and on every kernel startup. We need to do this twice to deal with the following issues: - startup-time unmapping is required to make kdump work; - shutdown-time unmapping is required to support kexec-ing non-fixed kernels; - shutdown-time unmapping is required to make Qemu-emulated NICs work after kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq is being performed). I think this should be done only in one place -- just prior to exec'ing the new kernel (including kdump kernels). Thank you for your comments! The problem I'm fighting wiht atm is: with FIFO-based event channels we need to call evtchn_fifo_destroy() so next EVTCHNOP_init_control won't fail. I was intended to put evtchn_fifo_destroy() in EVTCHNOP_reset. That introduces a problem: we need to deal with store/console channels. It is possible to remap those from guest with EVTCHNOP_bind_interdomain (if we remember where they were mapped before) but we can't do it after we did evtchn_fifo_destroy() and we can't rebind them after kexec and performing EVTCHNOP_init_control as we can't remember where these channels were mapped to after kexec/kdump. I see the following possible solutions: 1) We put evtchn_fifo_destroy() in EVTCHNOP_init_control so EVTCHNOP_init_control can be called twice. No EVTCHNOP_reset is required in that case. 2) Introduce special (e.g. 'EVTCHNOP_fifo_destroy') hypercall to do evtchn_fifo_destroy() without closing all channels. Alternatively we can avoid closing all channels in EVTCHNOP_reset when called with DOMID_SELF (as this mode is not being used atm) -- but that would look unobvious. 3) Keep evtchn_fifo_destroy() in EVTCHNOP_reset but keep console/store channels -- I saw your concerns it is not safe, some sort of additional blocking will be required. 4) Do the remapping boot time (query for store/console channels - perform EVTCHNOP_reset - rebind with EVTCHNOP_bind_interdomain). There is an additional problem: EVTCHNOP_bind_interdomain operation has local port as OUT parameter so we can't guarantee that remapping store/console channels will remap them to the same local channel they were mapped before EVTCHNOP_reset (and we have this information in hvm info: HVM_PARAM_CONSOLE_EVTCHN/HVM_PARAM_STORE_EVTCHN, ...). Not sure how to deal with that in case we go with remapping. Your thoughts would be very appreciated. Thank you again, --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void) #ifdef CONFIG_KEXEC if (!kexec_in_progress) return; +xen_unmap_all_pirqs(); #endif } diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c919d3d..7701c7f 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {} static bool fifo_events = true; module_param(fifo_events, bool, 0); +void xen_unmap_all_pirqs(void) +{ +int pirq, rc, gsi, irq, evtchn; +struct physdev_unmap_pirq unmap_irq; +struct irq_info *info; +struct evtchn_close close; + +mutex_lock(irq_mapping_update_lock); + +list_for_each_entry(info, xen_irq_list_head, list) { +if (info-type != IRQT_PIRQ) +continue; I think you need to do this by querying Xen state rather than relying on potentially bad guest state. Particularly since you may crash while holding irq_mapping_update_lock. EVTCHNOP_status gets you the info you need I think. David -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown
David Vrabel david.vra...@citrix.com writes: On 29/07/14 14:50, Vitaly Kuznetsov wrote: David Vrabel david.vra...@citrix.com writes: On 15/07/14 14:40, Vitaly Kuznetsov wrote: When kexec is being run PIRQs from Qemu-emulated devices are still mapped to old event channels and new kernel has no information about that. Trying to map them twice results in the following in Xen's dmesg: (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped ... and the following in new kernel's dmesg: [ 92.286796] xen:events: Failed to obtain physical IRQ 4 The result is that the new kernel doesn't recieve IRQs for Qemu-emulated devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown when kexec was requested and on every kernel startup. We need to do this twice to deal with the following issues: - startup-time unmapping is required to make kdump work; - shutdown-time unmapping is required to support kexec-ing non-fixed kernels; - shutdown-time unmapping is required to make Qemu-emulated NICs work after kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq is being performed). I think this should be done only in one place -- just prior to exec'ing the new kernel (including kdump kernels). Thank you for your comments! The problem I'm fighting wiht atm is: with FIFO-based event channels we need to call evtchn_fifo_destroy() so next EVTCHNOP_init_control won't fail. I was intended to put evtchn_fifo_destroy() in EVTCHNOP_reset. That introduces a problem: we need to deal with store/console channels. It is possible to remap those from guest with EVTCHNOP_bind_interdomain (if we remember where they were mapped before) but we can't do it after we did evtchn_fifo_destroy() and we can't rebind them after kexec and performing EVTCHNOP_init_control as we can't remember where these channels were mapped to after kexec/kdump. I see the following possible solutions: 1) We put evtchn_fifo_destroy() in EVTCHNOP_init_control so EVTCHNOP_init_control can be called twice. No EVTCHNOP_reset is required in that case. EVTCHNOP_init_control is called for each VCPU so I can't see how this would work. Right, forgot about that... 2) Introduce special (e.g. 'EVTCHNOP_fifo_destroy') hypercall to do evtchn_fifo_destroy() without closing all channels. Alternatively we can avoid closing all channels in EVTCHNOP_reset when called with DOMID_SELF (as this mode is not being used atm) -- but that would look unobvious. I would try this. The guest prior to kexec would then: 1. Use EVTCHNOP_status to query remote end of console and xenstore event channels. 2. Loop for all event channels: a. unmap pirq (if required) b. EVTCHNOP_close 3. EVTCHNOP_fifo_destroy (the implementation of which must verify that no channels are bound). 4. EVTCHNOP_bind_interdomain to rebind the console and xenstore channels. Yea, that's what I have now when I put evtchn_fifo_destroy() in EVTCHNOP_reset. The problem here is: we can't do EVTCHNOP_bind_interdomain after we did evtchn_fifo_destroy(), we need to call EVTCHNOP_init_control first. And we'll do that only after kexec so we won't remember what we need to remap.. The second issue is the fact that EVTCHNOP_bind_interdomain will remap store/storage channels to *some* local ports, not necessary matching hvm info (HVM_PARAM_CONSOLE_EVTCHN/HVM_PARAM_STORE_EVTCHN).. Would it be safe is instead of closing interdomain channels on EVTCHNOP_fifo_destroy we switch evtchn_port_ops to evtchn_port_ops_2l (so on EVTCHNOP_init_control after kexec we switch back)? I'll try prototyping this. Thank you for your comments, David -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 4/4] xen/pvhvm: Make MSI IRQs work after kexec
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote: When kexec was peformed MSI IRQs for passthrough-ed devices were already mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq() fails as we have no IRQ mapping information for that. Requesting for new mapping with __write_msi_msg() does not result in MSI IRQ being remapped so we don't recieve these IRQs. receive Thanks for your comments! How come '__write_msi_msg' does not result in new MSI IRQs? Actually that was the hidden question in my RFC :-) Let me describe what I see. When normal boot is performed we have the following in xen_hvm_setup_msi_irqs(): __read_msi_msg() pirq - 0 then we allocate new pirq with pirq = xen_allocate_pirq_msi() pirq - 54 and we have the following mapping: xen: msi -- pirq=54 -- irq=72 in 'xl debug-keys i': (XEN)IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=0030 in-flight=0 domain-list=7: 54(), After kexec we see the following: __read_msi_msg() pirq - 54 but as xen_irq_from_pirq() fails we follow the same path allocating new pirq: pirq = xen_allocate_pirq_msi() pirq - 55 and we have the following mapping: xen: msi -- pirq=55 -- irq=75 However (afaict) mapping in xen wasn't updated: in 'xl debug-keys i': (XEN)IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=0030 in-flight=0 domain-list=7: 54(--M-), Is it fair to state that your code ends up reading the MSI IRQ (PIRQ) from the device and updating the internal PIRQ-IRQ code to match with the reality? Yea, 'always trust the device'. RFC: I wasn't able to understand why commit af42b8d1 which introduced xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead of checking pirq 0 as if the mapping was already done (and we have pirq0 here) we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also not sure when __write_msi_msg() with new PIRQ will result in new mapping. We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi. At least that is how the commit you mentioned worked. I meant to say that in case we have pirq 0 from __read_msi_msg() but xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do xen_allocate_pirq_msi() which brings us new pirq. In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is that we might be called twice by a buggy driver. As such we want to check our PIRQ-IRQ to figure this out. But if we're called twice we'll see the same pirq, right? Or there are some cases when we see 'crap' instead of pirq here? I think it would be nice to use the same pirq after kexec instead of allocating a new one even in case we can make remapping work. Thanks for your comments again! Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/pci/xen.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index 905956f..685e8f1 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) __read_msi_msg(msidesc, msg); pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | ((msg.address_lo MSI_ADDR_DEST_ID_SHIFT) 0xff); -if (msg.data != XEN_PIRQ_MSI_DATA || -xen_irq_from_pirq(pirq) 0) { +if (msg.data != XEN_PIRQ_MSI_DATA || pirq = 0) { pirq = xen_allocate_pirq_msi(dev, msidesc); if (pirq 0) { irq = -ENODEV; -- 1.9.3 -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:39PM +0200, Vitaly Kuznetsov wrote: When kexec is being run PIRQs from Qemu-emulated devices are still mapped to old event channels and new kernel has no information about that. Trying to map them twice results in the following in Xen's dmesg: (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped ... and the following in new kernel's dmesg: [ 92.286796] xen:events: Failed to obtain physical IRQ 4 The result is that the new kernel doesn't recieve IRQs for Qemu-emulated devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown when kexec was requested and on every kernel startup. We need to do this twice to deal with the following issues: - startup-time unmapping is required to make kdump work; - shutdown-time unmapping is required to support kexec-ing non-fixed kernels; - shutdown-time unmapping is required to make Qemu-emulated NICs work after kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq is being performed). How does this work when you boot the guest under Xen 4.4 where the FIFO events are used? Does it still work correctly? Thanks for pointing that out! I've checked and it doesn't. However patches make no difference - guest kernel gets stuck on boot with and without them. Will try to investigate... Thanks. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/smp.c | 1 + drivers/xen/events/events_base.c | 76 include/xen/events.h | 3 ++ 3 files changed, 80 insertions(+) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 35dcf39..e2b4deb 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void) #ifdef CONFIG_KEXEC if (!kexec_in_progress) return; +xen_unmap_all_pirqs(); #endif } diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c919d3d..7701c7f 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {} static bool fifo_events = true; module_param(fifo_events, bool, 0); +void xen_unmap_all_pirqs(void) +{ +int pirq, rc, gsi, irq, evtchn; +struct physdev_unmap_pirq unmap_irq; +struct irq_info *info; +struct evtchn_close close; + +mutex_lock(irq_mapping_update_lock); + +list_for_each_entry(info, xen_irq_list_head, list) { +if (info-type != IRQT_PIRQ) +continue; + +pirq = info-u.pirq.pirq; +gsi = info-u.pirq.gsi; +evtchn = info-evtchn; +irq = info-irq; + +pr_debug(unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n, +gsi, pirq, irq, evtchn); + +if (evtchn 0) { +close.port = evtchn; +if (HYPERVISOR_event_channel_op(EVTCHNOP_close, +close) != 0) +pr_warn(close evtchn %d failed\n, evtchn); +} + +unmap_irq.pirq = pirq; +unmap_irq.domid = DOMID_SELF; + +rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, unmap_irq); +if (rc) +pr_warn(unmap pirq failed gsi=%d pirq=%d irq=%d rc=%d\n, +gsi, pirq, irq, rc); +} + +mutex_unlock(irq_mapping_update_lock); +} +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs); Why the EXPORT? Is this used by modules? + +static void xen_startup_unmap_pirqs(void) +{ +struct evtchn_status status; +int port, rc = -ENOENT; +struct physdev_unmap_pirq unmap_irq; +struct evtchn_close close; + +memset(status, 0, sizeof(status)); +for (port = 0; port xen_evtchn_max_channels(); port++) { +status.dom = DOMID_SELF; +status.port = port; +rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, status); +if (rc 0) +continue; +if (status.status == EVTCHNSTAT_pirq) { +close.port = port; +if (HYPERVISOR_event_channel_op(EVTCHNOP_close, +close) != 0) +pr_warn(xen: failed to close evtchn %d\n, +port); +unmap_irq.pirq = status.u.pirq; +unmap_irq.domid = DOMID_SELF; +pr_warn(xen: unmapping previously mapped pirq %d\n, +unmap_irq.pirq); +if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq
Re: [PATCH RFC 3/4] xen/pvhvm: Unmap all PIRQs on startup and shutdown
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Wed, Jul 16, 2014 at 11:37:10AM +0200, Vitaly Kuznetsov wrote: Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:39PM +0200, Vitaly Kuznetsov wrote: When kexec is being run PIRQs from Qemu-emulated devices are still mapped to old event channels and new kernel has no information about that. Trying to map them twice results in the following in Xen's dmesg: (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped ... and the following in new kernel's dmesg: [ 92.286796] xen:events: Failed to obtain physical IRQ 4 The result is that the new kernel doesn't recieve IRQs for Qemu-emulated devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown when kexec was requested and on every kernel startup. We need to do this twice to deal with the following issues: - startup-time unmapping is required to make kdump work; - shutdown-time unmapping is required to support kexec-ing non-fixed kernels; - shutdown-time unmapping is required to make Qemu-emulated NICs work after kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq is being performed). How does this work when you boot the guest under Xen 4.4 where the FIFO events are used? Does it still work correctly? Thanks for pointing that out! I've checked and it doesn't. However patches make no difference - guest kernel gets stuck on boot with and without them. Will try to investigate... I think for FIFO events we can't do much right now - it would need some new hypercalls to de-allocate or such. Yeah, you're probably right. I tried wrapping evtchn_fifo_destroy() into 'EVTCHNOP_fifo_destroy' hypercall but it seems some other actions are required as well.. But I was thinking that your code logic could just return out when it detects that it is running with FIFO events (with a TODO comment) - and also spit out some information to this effect? Sure, having TODO here is a good idea. Say: Use xen.fifo=0 in your launching kernel s,xen.fifo,xen.fifo_events, (don't know the right name for the kernel in which you do 'kexec -e' in ? Is that launching? Original? Bootstrap kernel?) Yes, if under Xen-4.4 I boot original kernel with xen.fifo_events=0 I'm able to do kexec with xen.fifo_events=0 and even without it (but only once :-). Once kernel is booted with FIFO-based event channels enabled no kexec is possible, new kernel gets stuck (I guess vcpuop timer doesn't work..). My patch series brings no difference here.. Thanks, Thanks. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/smp.c | 1 + drivers/xen/events/events_base.c | 76 include/xen/events.h | 3 ++ 3 files changed, 80 insertions(+) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 35dcf39..e2b4deb 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void) #ifdef CONFIG_KEXEC if (!kexec_in_progress) return; + xen_unmap_all_pirqs(); #endif } diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index c919d3d..7701c7f 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {} static bool fifo_events = true; module_param(fifo_events, bool, 0); +void xen_unmap_all_pirqs(void) +{ + int pirq, rc, gsi, irq, evtchn; + struct physdev_unmap_pirq unmap_irq; + struct irq_info *info; + struct evtchn_close close; + + mutex_lock(irq_mapping_update_lock); + + list_for_each_entry(info, xen_irq_list_head, list) { + if (info-type != IRQT_PIRQ) + continue; + + pirq = info-u.pirq.pirq; + gsi = info-u.pirq.gsi; + evtchn = info-evtchn; + irq = info-irq; + + pr_debug(unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n, + gsi, pirq, irq, evtchn); + + if (evtchn 0) { + close.port = evtchn; + if (HYPERVISOR_event_channel_op(EVTCHNOP_close, + close) != 0) + pr_warn(close evtchn %d failed\n, evtchn); + } + + unmap_irq.pirq = pirq; + unmap_irq.domid = DOMID_SELF; + + rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, unmap_irq); + if (rc) + pr_warn(unmap pirq failed gsi=%d pirq=%d irq=%d rc=%d\n, + gsi, pirq, irq, rc); + } + + mutex_unlock(irq_mapping_update_lock); +} +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs
Re: [PATCH RFC 4/4] xen/pvhvm: Make MSI IRQs work after kexec
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote: Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote: When kexec was peformed MSI IRQs for passthrough-ed devices were already mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq() fails as we have no IRQ mapping information for that. Requesting for new mapping with __write_msi_msg() does not result in MSI IRQ being remapped so we don't recieve these IRQs. receive Thanks for your comments! Thank you for quick turnaround with the answers! How come '__write_msi_msg' does not result in new MSI IRQs? Actually that was the hidden question in my RFC :-) Let me describe what I see. When normal boot is performed we have the following in xen_hvm_setup_msi_irqs(): __read_msi_msg() pirq - 0 then we allocate new pirq with pirq = xen_allocate_pirq_msi() pirq - 54 and we have the following mapping: xen: msi -- pirq=54 -- irq=72 in 'xl debug-keys i': (XEN)IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=0030 in-flight=0 domain-list=7: 54(), After kexec we see the following: __read_msi_msg() pirq - 54 but as xen_irq_from_pirq() fails we follow the same path allocating new pirq: pirq = xen_allocate_pirq_msi() pirq - 55 and we have the following mapping: xen: msi -- pirq=55 -- irq=75 However (afaict) mapping in xen wasn't updated: in 'xl debug-keys i': (XEN)IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=0030 in-flight=0 domain-list=7: 54(--M-), I am wondering if that is related to in QEMU traditional: qemu-xen-trad: free all the pirqs for msi/msix when driver unloads (which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050) If you have that patch in, is the PIRQ value correctly updated? Thanks, that really works! I tested both kexec -e / kdump cases. I'm wondering if we although need my commit to workaround non-fixed qemus? Is it fair to state that your code ends up reading the MSI IRQ (PIRQ) from the device and updating the internal PIRQ-IRQ code to match with the reality? Yea, 'always trust the device'. RFC: I wasn't able to understand why commit af42b8d1 which introduced xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead of checking pirq 0 as if the mapping was already done (and we have pirq0 here) we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also not sure when __write_msi_msg() with new PIRQ will result in new mapping. We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi. At least that is how the commit you mentioned worked. I meant to say that in case we have pirq 0 from __read_msi_msg() but xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do xen_allocate_pirq_msi() which brings us new pirq. In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is that we might be called twice by a buggy driver. As such we want to check our PIRQ-IRQ to figure this out. But if we're called twice we'll see the same pirq, right? Or there are Good point. some cases when we see 'crap' instead of pirq here? For PCI passthrough devices they will be zero until they are enabled. But I am not sure about the emulated devices, such as e1000 or such, which would also go through this path (I think - do we have MSI devices that we emulate in QEMU?) AFAICT emulated e1000 doesn't use MSI (at least with qemu-tradidtional) and with my patch series it works after kexec. I think it would be nice to use the same pirq after kexec instead of allocating a new one even in case we can make remapping work. I concur. Stefano, do you recall why you used xen_irq_from_pirq instead of just trusting the 'pirq' value? Was it to workaround broken QEMU? Thanks for your comments again! Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/pci/xen.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index 905956f..685e8f1 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) __read_msi_msg(msidesc, msg); pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | ((msg.address_lo MSI_ADDR_DEST_ID_SHIFT) 0xff); - if (msg.data != XEN_PIRQ_MSI_DATA || - xen_irq_from_pirq(pirq) 0) { + if (msg.data != XEN_PIRQ_MSI_DATA || pirq = 0) { pirq = xen_allocate_pirq_msi(dev, msidesc); if (pirq 0) { irq = -ENODEV; -- 1.9.3 -- Vitaly -- Vitaly -- To unsubscribe from this list
Re: [PATCH RFC 4/4] xen/pvhvm: Make MSI IRQs work after kexec
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Wed, Jul 16, 2014 at 07:20:39PM +0200, Vitaly Kuznetsov wrote: Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote: Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote: When kexec was peformed MSI IRQs for passthrough-ed devices were already mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq() fails as we have no IRQ mapping information for that. Requesting for new mapping with __write_msi_msg() does not result in MSI IRQ being remapped so we don't recieve these IRQs. receive Thanks for your comments! Thank you for quick turnaround with the answers! How come '__write_msi_msg' does not result in new MSI IRQs? Actually that was the hidden question in my RFC :-) Let me describe what I see. When normal boot is performed we have the following in xen_hvm_setup_msi_irqs(): __read_msi_msg() pirq - 0 then we allocate new pirq with pirq = xen_allocate_pirq_msi() pirq - 54 and we have the following mapping: xen: msi -- pirq=54 -- irq=72 in 'xl debug-keys i': (XEN)IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=0030 in-flight=0 domain-list=7: 54(), After kexec we see the following: __read_msi_msg() pirq - 54 but as xen_irq_from_pirq() fails we follow the same path allocating new pirq: pirq = xen_allocate_pirq_msi() pirq - 55 and we have the following mapping: xen: msi -- pirq=55 -- irq=75 However (afaict) mapping in xen wasn't updated: in 'xl debug-keys i': (XEN)IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=0030 in-flight=0 domain-list=7: 54(--M-), I am wondering if that is related to in QEMU traditional: qemu-xen-trad: free all the pirqs for msi/msix when driver unloads (which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050) If you have that patch in, is the PIRQ value correctly updated? Thanks, that really works! I tested both kexec -e / kdump cases. I'm wondering if we although need my commit to workaround non-fixed qemus? Without your patch on older QEMU's with PCI passthrough we won't get any more interrupts after we kexec in the guest right? Correct. As in, this issue happens _only_ with PCI passthrough devices that use MSI or MSI-X? I haven't tested MSI-X but in theory yes, only MSI and MSI-X passthrough-ed devices are affected. Still need to get Stefano's view on this. Sure, thanks! Is it fair to state that your code ends up reading the MSI IRQ (PIRQ) from the device and updating the internal PIRQ-IRQ code to match with the reality? Yea, 'always trust the device'. RFC: I wasn't able to understand why commit af42b8d1 which introduced xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead of checking pirq 0 as if the mapping was already done (and we have pirq0 here) we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also not sure when __write_msi_msg() with new PIRQ will result in new mapping. We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi. At least that is how the commit you mentioned worked. I meant to say that in case we have pirq 0 from __read_msi_msg() but xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do xen_allocate_pirq_msi() which brings us new pirq. In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is that we might be called twice by a buggy driver. As such we want to check our PIRQ-IRQ to figure this out. But if we're called twice we'll see the same pirq, right? Or there are Good point. some cases when we see 'crap' instead of pirq here? For PCI passthrough devices they will be zero until they are enabled. But I am not sure about the emulated devices, such as e1000 or such, which would also go through this path (I think - do we have MSI devices that we emulate in QEMU?) AFAICT emulated e1000 doesn't use MSI (at least with qemu-tradidtional) and with my patch series it works after kexec. I think it would be nice to use the same pirq after kexec instead of allocating a new one even in case we can make remapping work. I concur. Stefano, do you recall why you used xen_irq_from_pirq instead of just trusting the 'pirq' value? Was it to workaround broken QEMU? Thanks for your comments again! Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/pci/xen.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index 905956f..685e8f1 100644 --- a/arch/x86/pci/xen.c +++ b/arch
[PATCH v8] mmap_vmcore: skip non-ram pages reported by hypervisors
We have a special check in read_vmcore() handler to check if the page was reported as ram or not by the hypervisor (pfn_is_ram()). However, when vmcore is read with mmap() no such check is performed. That can lead to unpredictable results, e.g. when running Xen PVHVM guest memcpy() after mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating enormous load in both DomU and Dom0. Fix the issue by mapping each non-ram page to the zero page. Keep direct path with remap_oldmem_pfn_range() to avoid looping through all pages on bare metal. The issue can also be solved by overriding remap_oldmem_pfn_range() in xen-specific code, as remap_oldmem_pfn_range() was been designed for. That, however, would involve non-obvious xen code path for all x86 builds with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific code on x86 arch from doing the same override. Changes from v7: - address kbuild test robot's warnings by making remap_oldmem_pfn_checked() and vmcore_remap_oldmem_pfn() static (Fengguang Wu's patch) Changes from v6: - remove useless len increment when remapping the rest of the region Changes from v5: - make len size_t to match do_unmap() interface Changes from v4: - change map_size type size_t - unsigned long - use prot instead of vma-vm_page_prot inside remap_oldmem_pfn_checked() Changes from v3: - multi line comment style changes - minor code style changes Changes from v2: - make remap_oldmem_pfn_checked() interface exactly match remap_oldmem_pfn_range() - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so we don't need to take care of it in mmap_vmcore() - create vmcore_remap_oldmem_pfn() wrapper Changes from v1: - comment style changes - change remap_oldmem_pfn_checked() interface to closer match the remap_oldmem_pfn() interface - preserve formal parameters within the loop, make the loop conditions easier to understand - use my_zero_pfn() for the zero page - return remapped length instead of new offset Reviewed-by: Andrew Jones drjo...@redhat.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- fs/proc/vmcore.c | 82 +--- 1 file changed, 79 insertions(+), 3 deletions(-) diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c index 382aa89..a90d6d35 100644 --- a/fs/proc/vmcore.c +++ b/fs/proc/vmcore.c @@ -328,6 +328,82 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz) * virtually contiguous user-space in ELF layout. */ #ifdef CONFIG_MMU +/* + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages + * reported as not being ram with the zero page. + * + * @vma: vm_area_struct describing requested mapping + * @from: start remapping from + * @pfn: page frame number to start remapping to + * @size: remapping size + * @prot: protection bits + * + * Returns zero on success, -EAGAIN on failure. + */ +static int remap_oldmem_pfn_checked(struct vm_area_struct *vma, + unsigned long from, unsigned long pfn, + unsigned long size, pgprot_t prot) +{ + unsigned long map_size; + unsigned long pos_start, pos_end, pos; + unsigned long zeropage_pfn = my_zero_pfn(0); + size_t len = 0; + + pos_start = pfn; + pos_end = pfn + (size PAGE_SHIFT); + + for (pos = pos_start; pos pos_end; ++pos) { + if (!pfn_is_ram(pos)) { + /* +* We hit a page which is not ram. Remap the continuous +* region between pos_start and pos-1 and replace +* the non-ram page at pos with the zero page. +*/ + if (pos pos_start) { + /* Remap continuous region */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, + pos_start, map_size, + prot)) + goto fail; + len += map_size; + } + /* Remap the zero page */ + if (remap_oldmem_pfn_range(vma, from + len, + zeropage_pfn, + PAGE_SIZE, prot)) + goto fail; + len += PAGE_SIZE; + pos_start = pos + 1; + } + } + if (pos pos_start) { + /* Remap the rest */ + map_size = (pos - pos_start) PAGE_SHIFT; + if (remap_oldmem_pfn_range(vma, from + len, pos_start, + map_size, prot)) + goto fail; + } + return 0; +fail
Re: [PATCH RFC 1/4] xen PVonHVM: use E820_Reserved area for shared_info
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 05:43:17PM +0200, Vitaly Kuznetsov wrote: Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote: From: Olaf Hering o...@aepfle.de This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3 (xen PVonHVM: move shared_info to MMIO before kexec). Currently kexec in a PVonHVM guest fails with a triple fault because the new kernel overwrites the shared info page. The exact failure depends on the size of the kernel image. This patch moves the pfn from RAM into an E820 reserved memory area. The pfn containing the shared_info is located somewhere in RAM. This will cause trouble if the current kernel is doing a kexec boot into a new kernel. The new kernel (and its startup code) can not know where the pfn is, so it can not reserve the page. The hypervisor will continue to update the pfn, and as a result memory corruption occours in the new kernel. The toolstack marks the memory area FC00- as reserved in the E820 map. Within that range newer toolstacks (4.3+) will keep 1MB starting from FE70 as reserved for guest use. Older Xen4 toolstacks will usually not allocate areas up to FE70, so FE70 is expected to work also with older toolstacks. In Xen3 there is no reserved area at a fixed location. If the guest is started on such old hosts the shared_info page will be placed in RAM. As a result kexec can not be used. So this looks right, the one thing that we really need to check is e9daff24a266307943457086533041bd971d0ef9 This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f. We are doing this b/c on 32-bit PVonHVM with older hypervisors (Xen 4.1) it ends up bothing up the start_info. This is bad b/c we use it for the time keeping, and the timekeeping code loops forever - as the version field never changes. Olaf says to revert it, so lets do that. Could you kindly test that the migration on 32-bit PVHVM guests on older hypervisors works? Sure, will do! Was there anything special about the setup or any 32-bit pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can try checking both current and old versions to make sure the issue was acutually fixed. Nothing fancy (well, it was SMP, so 4 CPUs). I did the 'save'/'restore' and the guest would not restore properly. The symptoms you saw were: after the resume guest appears to be frozen, all vcpus except for the first one spin at 100%? I was able to reproduce that on old patch version and everything works fine with your fix (calling xen_hvm_set_shared_info() in addition to xen_hvm_connect_shared_info() on resume). We're probably safe to apply it now, thanks! However I'd like to suggest we remove '__init' from xen_hvm_set_shared_info() as now we call it on resume. Thank you! Signed-off-by: Olaf Hering o...@aepfle.de Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f) [On resume we need to reset the xen_vcpu_info, which the original patch did not do] Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- arch/x86/xen/enlighten.c | 74 arch/x86/xen/suspend.c | 2 +- arch/x86/xen/xen-ops.h | 2 +- 3 files changed, 58 insertions(+), 20 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index ffb101e..a11af62 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void) #endif } -void __ref xen_hvm_init_shared_info(void) +#ifdef CONFIG_XEN_PVHVM +#define HVM_SHARED_INFO_ADDR 0xFE70UL +static struct shared_info *xen_hvm_shared_info; +static unsigned long xen_hvm_sip_phys; +static int xen_major, xen_minor; + +static void xen_hvm_connect_shared_info(unsigned long pfn) { - int cpu; struct xen_add_to_physmap xatp; - static struct shared_info *shared_info_page = 0; - if (!shared_info_page) - shared_info_page = (struct shared_info *) - extend_brk(PAGE_SIZE, PAGE_SIZE); xatp.domid = DOMID_SELF; xatp.idx = 0; xatp.space = XENMAPSPACE_shared_info; - xatp.gpfn = __pa(shared_info_page) PAGE_SHIFT; + xatp.gpfn = pfn; if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, xatp)) BUG(); - HYPERVISOR_shared_info = (struct shared_info *)shared_info_page; +} +static void __init xen_hvm_set_shared_info(struct shared_info *sip) +{ + int cpu; + + HYPERVISOR_shared_info = sip; /* xen_vcpu is a pointer to the vcpu_info struct in the shared_info * page, we use it in the event channel upcall and in some pvclock @@ -1760,20 +1766,39 @@ void __ref
Re: [PATCH RFC 1/4] xen PVonHVM: use E820_Reserved area for shared_info
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Fri, Jul 18, 2014 at 01:05:46PM +0200, Vitaly Kuznetsov wrote: Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 05:43:17PM +0200, Vitaly Kuznetsov wrote: Konrad Rzeszutek Wilk konrad.w...@oracle.com writes: On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote: From: Olaf Hering o...@aepfle.de This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3 (xen PVonHVM: move shared_info to MMIO before kexec). Currently kexec in a PVonHVM guest fails with a triple fault because the new kernel overwrites the shared info page. The exact failure depends on the size of the kernel image. This patch moves the pfn from RAM into an E820 reserved memory area. The pfn containing the shared_info is located somewhere in RAM. This will cause trouble if the current kernel is doing a kexec boot into a new kernel. The new kernel (and its startup code) can not know where the pfn is, so it can not reserve the page. The hypervisor will continue to update the pfn, and as a result memory corruption occours in the new kernel. The toolstack marks the memory area FC00- as reserved in the E820 map. Within that range newer toolstacks (4.3+) will keep 1MB starting from FE70 as reserved for guest use. Older Xen4 toolstacks will usually not allocate areas up to FE70, so FE70 is expected to work also with older toolstacks. In Xen3 there is no reserved area at a fixed location. If the guest is started on such old hosts the shared_info page will be placed in RAM. As a result kexec can not be used. So this looks right, the one thing that we really need to check is e9daff24a266307943457086533041bd971d0ef9 This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f. We are doing this b/c on 32-bit PVonHVM with older hypervisors (Xen 4.1) it ends up bothing up the start_info. This is bad b/c we use it for the time keeping, and the timekeeping code loops forever - as the version field never changes. Olaf says to revert it, so lets do that. Could you kindly test that the migration on 32-bit PVHVM guests on older hypervisors works? Sure, will do! Was there anything special about the setup or any 32-bit pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can try checking both current and old versions to make sure the issue was acutually fixed. Nothing fancy (well, it was SMP, so 4 CPUs). I did the 'save'/'restore' and the guest would not restore properly. The symptoms you saw were: after the resume guest appears to be frozen, all vcpus except for the first one spin at 100%? I was able to reproduce Yes, that is it. that on old patch version and everything works fine with your fix (calling xen_hvm_set_shared_info() in addition to xen_hvm_connect_shared_info() on resume). We're probably safe to apply it now, thanks! Woot! Could you include that tidbit of information in the commit please? Sure, However I'd like to suggest we remove '__init' from xen_hvm_set_shared_info() as now we call it on resume. Good idea. Lets wait until Stefano responds (for the MSI PIRQ one), and if he does not have anything special to say, then repost the whole patchset including this tiny __init fix and the updated comment? Deal :-) Please take a look at my '[PATCH RFC] evtchn: introduce EVTCHNOP_fifo_destroy hypercall'. In case that works we can fix FIFO case at the same time, no TODO required. I'll be able to return to this work at the end of next week. -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/4] xen/pvhvm: fix shared_info and pirq issues with kexec
David Vrabel david.vra...@citrix.com writes: On 15/07/14 14:40, Vitaly Kuznetsov wrote: With this patch series I'm trying to address several issues with kexec on pvhvm: - shared_info issue (1st patch, just sending Olaf's work with Konrad's fix) - create specific pvhvm shutdown handler for kexec (2nd patch) - GSI PIRQ issue (3rd patch, I'm pretty confident that it does the right thing) - MSI PIRQ issue (4th patch, and I'm not sure it doesn't break anything - RFC) This patch series can be tested on single vCPU guest. We still have SMP issues with pvhvm guests and kexec which require additional fixes. In addition to the fixes for multi-VCPU guests, what else remains? I'm aware of grants and ballooned out pages. What's the plan for handling granted pages? (if I got the design right) we have two issues: 1) Pages we grant access to other domains. We have the list so we can try doing gnttab_end_foreign_access for all unmapped grants but there is nothing we can do with mapped ones from guest. We can either assume that all such usages are short-term and try waiting for them to finish or we need to do something like force-unmap from hypervisor side. 2) Pages we mapped from other domains. There is no easy way to collect all grant handles from different places in kernel atm so I can see two possible solutions: - we keep track of all handles with new kernel structure in guest and unmap them all on kexec/kdump. - we introduce new GNTTABOP_reset which does something similar to gnttab_release_mappings(). There is nothing we need to do with transferred grants (and I don't see transfer usages in kernel). Please correct me if I'm wrong. I don't think we want to accept a partial solution unless the known non-working configurations fail fast on kexec load. *I think* we can leave ballooned out pages out of scope for now. Thanks, David -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/4] xen/pvhvm: fix shared_info and pirq issues with kexec
David Vrabel david.vra...@citrix.com writes: On 01/08/14 13:21, Vitaly Kuznetsov wrote: David Vrabel david.vra...@citrix.com writes: On 15/07/14 14:40, Vitaly Kuznetsov wrote: With this patch series I'm trying to address several issues with kexec on pvhvm: - shared_info issue (1st patch, just sending Olaf's work with Konrad's fix) - create specific pvhvm shutdown handler for kexec (2nd patch) - GSI PIRQ issue (3rd patch, I'm pretty confident that it does the right thing) - MSI PIRQ issue (4th patch, and I'm not sure it doesn't break anything - RFC) This patch series can be tested on single vCPU guest. We still have SMP issues with pvhvm guests and kexec which require additional fixes. In addition to the fixes for multi-VCPU guests, what else remains? I'm aware of grants and ballooned out pages. What's the plan for handling granted pages? (if I got the design right) we have two issues: 1) Pages we grant access to other domains. We have the list so we can try doing gnttab_end_foreign_access for all unmapped grants but there is nothing we can do with mapped ones from guest. We can either assume that all such usages are short-term and try waiting for them to finish or we need to do something like force-unmap from hypervisor side. Shared rings and persistent grants (used in blkfront) remain mapped for long periods so just waiting won't work. Force unmap by the hypervisor might be a possibility but the hypervisor needs to atomically replace the grant mapping with a different valid mapping, or the force unmap would cause the backend that was accesses the pages to fault. Every writable mapping would have to be replaced with a mapping to a unique page (to prevent information leaking between different granted pages). Read-only mappings could be replaces with a read-only mapping to shared zero page safely. The only way I can see how to do this requires co-operation from the backend kernel -- it would need to provide replacement frames for every grant map. Xen would then use this frame when force-unmapping (revoking) the mapping. We can introduce something like GNTTABOP_safe_map_grant_ref op with such replacement frames and use then in 'force unmap' but to be honest I'd like to avoid that.. What if xen could allocate and provide replacement pages on its own? I understand we need to take some precautions against malicious domains hogging all memory with such approach.. Actually if we narrow down our case to backend/frontend usage only we can think that grants do not introduce any issue for kexec/kdump because: - frontends are being closed on kexec so backends unmap corresponding grants. That's not true for persistent grant case as persistent grants are being unmaped only in xen_blkif_free() but not in xen_blkif_disconnect() atm. The following patch helps: diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index 3a8b810..54f4089 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -270,6 +270,9 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) blkif-blk_rings.common.sring = NULL; } + /* Remove all persistent grants and the cache of ballooned pages. */ + xen_blkbk_free_caches(blkif); + return 0; } @@ -281,9 +284,6 @@ static void xen_blkif_free(struct xen_blkif *blkif) xen_blkif_disconnect(blkif); xen_vbd_free(blkif-vbd); - /* Remove all persistent grants and the cache of ballooned pages. */ - xen_blkbk_free_caches(blkif); - /* Make sure everything is drained before shutting down */ BUG_ON(blkif-persistent_gnt_c != 0); BUG_ON(atomic_read(blkif-persistent_gnt_in_use) != 0); - in kdump case our new kernel lives in its own space so we won't hit granted pages (we can read some crap when collecting memory dump but it's not any better in case these pages were unmapped) - frontends are being reinitialized on new kernel startup, that also causes the unmap (with the patch from above applied). So I think we need to do something with grants in two cases: they were used outside frontends/backends and we have no idea if they are going to be unmapped (can be solved with kexec_is_safe flag) and when backend got stuck and refused to close/unmap (should be rare). 2) Pages we mapped from other domains. There is no easy way to collect all grant handles from different places in kernel atm so I can see two possible solutions: - we keep track of all handles with new kernel structure in guest and unmap them all on kexec/kdump. - we introduce new GNTTABOP_reset which does something similar to gnttab_release_mappings(). I think you can ignore this for now -- frontend drivers do not grant map, but see suggestion about kexec_is_safe below. There is nothing we need to do with transferred grants (and I don't see transfer usages in kernel). Agreed. I don't think we want
[PATCH] Drivers: hv: util: make struct hv_do_fcopy match Hyper-V host messages
An attempt to fix fcopy on i586 (bc5a5b0 Drivers: hv: util: Properly pack the data for file copy functionality) led to a regression on x86_64 (and actually didn't fix i586 breakage). Fcopy messages from Hyper-V host come in the following format: struct do_fcopy_hdr | 36 bytes |4 bytes offset|8 bytes size |4 bytes data | 6144 bytes On x86_64 struct hv_do_fcopy matched this format without ' __attribute__((packed))' and on i586 adding ' __attribute__((packed))' to it doesn't change anything. Keep the structure packed and add padding to match re reality. Tested both i586 and x86_64 on Hyper-V Server 2012 R2. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- include/uapi/linux/hyperv.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h index 0a8e6ba..bb1cb73 100644 --- a/include/uapi/linux/hyperv.h +++ b/include/uapi/linux/hyperv.h @@ -134,6 +134,7 @@ struct hv_start_fcopy { struct hv_do_fcopy { struct hv_fcopy_hdr hdr; + __u32 pad; __u64 offset; __u32 size; __u8data[DATA_FRAGMENT]; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tools: hv: introduce -n/--no-daemon option
All tools/hv daemons do mandatory daemon() on startup. However, no pidfile is created, this make it difficult for an init system to track such daemons. Modern linux distros use systemd as their init system. It can handle the daemonizing by itself, however, it requires a daemon to stay in foreground for that. Some distros already carry distro-specific patch for hv tools which switches off daemon(). Introduce -n/--no-daemon option for all 3 daemons in hv/tools. Parse options with getopt() to make this part easily expandable. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_fcopy_daemon.c | 33 +++-- tools/hv/hv_kvp_daemon.c | 34 -- tools/hv/hv_vss_daemon.c | 33 +++-- 3 files changed, 94 insertions(+), 6 deletions(-) diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c index 8f96b3e..f437d73 100644 --- a/tools/hv/hv_fcopy_daemon.c +++ b/tools/hv/hv_fcopy_daemon.c @@ -33,6 +33,7 @@ #include sys/stat.h #include fcntl.h #include dirent.h +#include getopt.h static int target_fd; static char target_fname[W_MAX_PATH]; @@ -126,15 +127,43 @@ static int hv_copy_cancel(void) } -int main(void) +void print_usage(char *argv[]) +{ + fprintf(stderr, Usage: %s [options]\n + Options are:\n + -n, --no-daemonstay in foreground, don't daemonize\n + -h, --help print this help\n, argv[0]); +} + +int main(int argc, char *argv[]) { int fd, fcopy_fd, len; int error; + int daemonize = 1, long_index = 0, opt; int version = FCOPY_CURRENT_VERSION; char *buffer[4096 * 2]; struct hv_fcopy_hdr *in_msg; - if (daemon(1, 0)) { + static struct option long_options[] = { + {help,no_argument, 0, 'h' }, + {no-daemon, no_argument, 0, 'n' }, + {0, 0, 0, 0 } + }; + + while ((opt = getopt_long(argc, argv, hn, long_options, + long_index)) != -1) { + switch (opt) { + case 'n': + daemonize = 0; + break; + case 'h': + default: + print_usage(argv); + exit(EXIT_FAILURE); + } + } + + if (daemonize daemon(1, 0)) { syslog(LOG_ERR, daemon() failed; error: %s, strerror(errno)); exit(EXIT_FAILURE); } diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 4088b81..22b0764 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -43,6 +43,7 @@ #include fcntl.h #include dirent.h #include net/if.h +#include getopt.h /* * KVP protocol: The user mode component first registers with the @@ -1417,7 +1418,15 @@ netlink_send(int fd, struct cn_msg *msg) return sendmsg(fd, message, 0); } -int main(void) +void print_usage(char *argv[]) +{ + fprintf(stderr, Usage: %s [options]\n + Options are:\n + -n, --no-daemonstay in foreground, don't daemonize\n + -h, --help print this help\n, argv[0]); +} + +int main(int argc, char *argv[]) { int fd, len, nl_group; int error; @@ -1435,9 +1444,30 @@ int main(void) struct hv_kvp_ipaddr_value *kvp_ip_val; char *kvp_recv_buffer; size_t kvp_recv_buffer_len; + int daemonize = 1, long_index = 0, opt; + + static struct option long_options[] = { + {help,no_argument, 0, 'h' }, + {no-daemon, no_argument, 0, 'n' }, + {0, 0, 0, 0 } + }; + + while ((opt = getopt_long(argc, argv, hn, long_options, + long_index)) != -1) { + switch (opt) { + case 'n': + daemonize = 0; + break; + case 'h': + default: + print_usage(argv); + exit(EXIT_FAILURE); + } + } - if (daemon(1, 0)) + if (daemonize daemon(1, 0)) return 1; + openlog(KVP, 0, LOG_USER); syslog(LOG_INFO, KVP starting; pid is:%d, getpid()); diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index 6a213b8..9ae2b6e 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -36,6 +36,7 @@ #include linux/hyperv.h #include linux/netlink.h #include syslog.h +#include getopt.h static struct sockaddr_nl addr; @@ -131,7 +132,15 @@ static int netlink_send(int fd, struct cn_msg *msg) return sendmsg(fd, message, 0); } -int main(void) +void print_usage(char *argv[]) +{ + fprintf(stderr, Usage: %s [options]\n + Options
[PATCH] xen/blkback: unmap all persistent grants when frontend gets disconnected
blkback does not unmap persistent grants when frontend goes to Closed state (e.g. when blkfront module is being removed). This leads to the following in guest's dmesg: [ 343.243825] xen:grant_table: WARNING: g.e. 0x445 still in use! [ 343.243825] xen:grant_table: WARNING: g.e. 0x42a still in use! ... When load module - use device - unload module sequence is performed multiple times it is possible to hit BUG() condition in blkfront module: [ 343.243825] kernel BUG at drivers/block/xen-blkfront.c:954! [ 343.243825] invalid opcode: [#1] SMP [ 343.243825] Modules linked in: xen_blkfront(-) ata_generic pata_acpi [last unloaded: xen_blkfront] ... [ 343.243825] Call Trace: [ 343.243825] [814111ef] ? unregister_xenbus_watch+0x16f/0x1e0 [ 343.243825] [a0016fbf] blkfront_remove+0x3f/0x140 [xen_blkfront] ... [ 343.243825] RIP [a0016aae] blkif_free+0x34e/0x360 [xen_blkfront] [ 343.243825] RSP 88001eb8fdc0 We don't need to keep these grants if we're disconnecting as frontend might already forgot about them. Solve the issue by moving xen_blkbk_free_caches() call from xen_blkif_free() to xen_blkif_disconnect(). Now we can see the following: [ 928.590893] xen:grant_table: WARNING: g.e. 0x587 still in use! [ 928.591861] xen:grant_table: WARNING: g.e. 0x372 still in use! ... [ 929.592146] xen:grant_table: freeing g.e. 0x587 [ 929.597174] xen:grant_table: freeing g.e. 0x372 ... Backend does not keep persistent grants any more, reconnect works fine. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/block/xen-blkback/xenbus.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index 3a8b810..54f4089 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -270,6 +270,9 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif) blkif-blk_rings.common.sring = NULL; } + /* Remove all persistent grants and the cache of ballooned pages. */ + xen_blkbk_free_caches(blkif); + return 0; } @@ -281,9 +284,6 @@ static void xen_blkif_free(struct xen_blkif *blkif) xen_blkif_disconnect(blkif); xen_vbd_free(blkif-vbd); - /* Remove all persistent grants and the cache of ballooned pages. */ - xen_blkbk_free_caches(blkif); - /* Make sure everything is drained before shutting down */ BUG_ON(blkif-persistent_gnt_c != 0); BUG_ON(atomic_read(blkif-persistent_gnt_in_use) != 0); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] xen/blkfront: improve protection against issuing unsupported REQ_FUA
Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced in d11e61583 and was factored out into blkif_request_flush_valid() in 0f1ca65ee. However: 1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request will still pass the check. 2) blkif_request_flush_valid() is misnamed. It is bool but returns true when the request is invalid. 3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that -EOPNOTSUPP is more appropriate here. Fix all of the above issues. This patch is based on the original patch by Laszlo Ersek and a comment by Jeff Moyer. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/block/xen-blkfront.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 5ac312f..2e6c103 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -582,12 +582,14 @@ static inline void flush_requests(struct blkfront_info *info) notify_remote_via_irq(info-irq); } -static inline bool blkif_request_flush_valid(struct request *req, -struct blkfront_info *info) +static inline bool blkif_request_flush_invalid(struct request *req, + struct blkfront_info *info) { return ((req-cmd_type != REQ_TYPE_FS) || - ((req-cmd_flags (REQ_FLUSH | REQ_FUA)) - !info-flush_op)); + ((req-cmd_flags REQ_FLUSH) +!(info-feature_flush REQ_FLUSH)) || + ((req-cmd_flags REQ_FUA) +!(info-feature_flush REQ_FUA))); } /* @@ -612,8 +614,8 @@ static void do_blkif_request(struct request_queue *rq) blk_start_request(req); - if (blkif_request_flush_valid(req, info)) { - __blk_end_request_all(req, -EIO); + if (blkif_request_flush_invalid(req, info)) { + __blk_end_request_all(req, -EOPNOTSUPP); continue; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
Dexuan Cui de...@microsoft.com writes: Under high memory pressure and very high KVP R/W test pressure, the netlink recvfrom() may transiently return ENOBUFS to the daemon -- we found this during a 2-week stress test. We'd better not terminate the daemon on this failure, because a typical KVP user can re-try the R/W and hopefully it will succeed next time. Cc: K. Y. Srinivasan k...@microsoft.com Signed-off-by: Dexuan Cui de...@microsoft.com --- tools/hv/hv_kvp_daemon.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 22b0764..9f4b303 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[]) addr_p, addr_l); if (len 0) { + int saved_errno = errno; syslog(LOG_ERR, recvfrom failed; pid:%u error:%d %s, addr.nl_pid, errno, strerror(errno)); + + if (saved_errno == ENOBUFS) { is it possible to meet EAGAIN (or EWOULDBLOCK) here as well? I'd suggest we ignore these as well in such case. Ignoring ENOMEM here is doubtful, I think. But possible. + syslog(LOG_ERR, error = ENOBUFS: ignored); + continue; + } + close(fd); return -1; } -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov Sent: Wednesday, November 19, 2014 18:50 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; Haiyang Zhang Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon Dexuan Cui writes: Under high memory pressure and very high KVP R/W test pressure, the netlink recvfrom() may transiently return ENOBUFS to the daemon -- we found this during a 2-week stress test. We'd better not terminate the daemon on this failure, because a typical KVP user can re-try the R/W and hopefully it will succeed next time. diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 22b0764..9f4b303 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[]) addr_p, addr_l); if (len 0) { + int saved_errno = errno; syslog(LOG_ERR, recvfrom failed; pid:%u error:%d %s, addr.nl_pid, errno, strerror(errno)); + + if (saved_errno == ENOBUFS) { is it possible to meet EAGAIN (or EWOULDBLOCK) here as well? I'd suggest we ignore these as well in such case. Ignoring ENOMEM here is doubtful, I think. But possible. Vitaly I don't think EAGAIN is possible because man recvfrom says If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and the external variable errno is set to EAGAIN or EWOULDBLOCK. The same man page mention ENOMEM for recvmsg(), but not recvfrom(). Ah, sorry, I though your patch patches the other place: call to netlink_send() which does sendmsg() (and my EAGAIN/EWOULDBLOCK/ENOMEM comment was about it). It could also make sense to patch them both as I think it is possible to hit these as well. -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Wednesday, November 19, 2014 20:41 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; Haiyang Zhang Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov Sent: Wednesday, November 19, 2014 18:50 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; Haiyang Zhang Subject: Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon Dexuan Cui writes: Under high memory pressure and very high KVP R/W test pressure, the netlink recvfrom() may transiently return ENOBUFS to the daemon -- we found this during a 2-week stress test. We'd better not terminate the daemon on this failure, because a typical KVP user can re-try the R/W and hopefully it will succeed next time. diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 22b0764..9f4b303 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[]) addr_p, addr_l); if (len 0) { + int saved_errno = errno; syslog(LOG_ERR, recvfrom failed; pid:%u error:%d %s, addr.nl_pid, errno, strerror(errno)); + + if (saved_errno == ENOBUFS) { is it possible to meet EAGAIN (or EWOULDBLOCK) here as well? I'd suggest we ignore these as well in such case. Ignoring ENOMEM here is doubtful, I think. But possible. Vitaly I don't think EAGAIN is possible because man recvfrom says If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and the external variable errno is set to EAGAIN or EWOULDBLOCK. The same man page mention ENOMEM for recvmsg(), but not recvfrom(). Ah, sorry, I though your patch patches the other place: call to netlink_send() which does sendmsg() (and my EAGAIN/EWOULDBLOCK/ENOMEM comment was about it). It could also make sense to patch them both as I think it is possible to hit these as well. -- Dexuan -- Vitaly OK, I can add this new check: (I'll send out the v2 tomorrow in case people have new comments) Thanks! --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1770,8 +1770,15 @@ kvp_done: len = netlink_send(fd, incoming_cn_msg); if (len 0) { + int saved_errno = errno; syslog(LOG_ERR, net_link send failed; error: %d %s, errno, strerror(errno)); + + if (saved_errno == ENOMEM || saved_errno == EAGAIN) { Sorry for being pushy, but it seems ENOBUFS is also possible here (at least man sendmsg mentions it). + syslog(LOG_ERR, send error: ignored); + continue; + } + exit(EXIT_FAILURE); } } Thanks, -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools: hv: ignore ENOBUFS in the KVP daemon
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov -- Vitaly OK, I can add this new check: (I'll send out the v2 tomorrow in case people have new comments) Thanks! --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1770,8 +1770,15 @@ kvp_done: len = netlink_send(fd, incoming_cn_msg); if (len 0) { + int saved_errno = errno; syslog(LOG_ERR, net_link send failed; error: %d %s, errno, strerror(errno)); + + if (saved_errno == ENOMEM || saved_errno == EAGAIN) { Sorry for being pushy, but it seems ENOBUFS is also possible here (at least man sendmsg mentions it). OK, I'll add this too. :-) BTW, I realized sendmsg() can't return EAGAIN here as that's for non-blocking socket. Here I simply ignore the error, hoping the other end will re-try. I agree, it's sufficient to ignore ENOBUFS on recieve path and both ENOMEM/ENOBUFS on send. Thanks! + syslog(LOG_ERR, send error: ignored); + continue; + } + exit(EXIT_FAILURE); } } Thanks, -- Dexuan Vitaly -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] tools: hv: ignore ENOBUFS and ENOMEM in the KVP daemon
Dexuan Cui de...@microsoft.com writes: Under high memory pressure and very high KVP R/W test pressure, the netlink recvfrom() may transiently return ENOBUFS to the daemon -- we found this during a 2-week stress test. We'd better not terminate the daemon on the failure, because a typical KVP user will re-try the R/W and hopefully it will succeed next time. We can also ignore the errors on sending. Cc: Vitaly Kuznetsov vkuzn...@redhat.com Cc: K. Y. Srinivasan k...@microsoft.com Signed-off-by: Dexuan Cui de...@microsoft.com --- v2: I also ignore the errors on sending, as Vitaly suggested. Thanks, Reviewed-by: Vitaly Kuznetsov vkuzn...@redhat.com tools/hv/hv_kvp_daemon.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 22b0764..6a6432a 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1559,8 +1559,15 @@ int main(int argc, char *argv[]) addr_p, addr_l); if (len 0) { + int saved_errno = errno; syslog(LOG_ERR, recvfrom failed; pid:%u error:%d %s, addr.nl_pid, errno, strerror(errno)); + + if (saved_errno == ENOBUFS) { + syslog(LOG_ERR, receive error: ignored); + continue; + } + close(fd); return -1; } @@ -1763,8 +1770,15 @@ kvp_done: len = netlink_send(fd, incoming_cn_msg); if (len 0) { + int saved_errno = errno; syslog(LOG_ERR, net_link send failed; error: %d %s, errno, strerror(errno)); + + if (saved_errno == ENOMEM || saved_errno == ENOBUFS) { + syslog(LOG_ERR, send error: ignored); + continue; + } + exit(EXIT_FAILURE); } } -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Tools: hv: vssdaemon: freeze/thaw logic improvement for the failure case
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Saturday, November 8, 2014 1:09 AM To: KY Srinivasan; Haiyang Zhang; Greg Kroah-Hartman Cc: de...@linuxdriverproject.org; linux-kernel@vger.kernel.org; Dexuan Cui Subject: [PATCH 0/3] Tools: hv: vssdaemon: freeze/thaw logic improvement for the failure case This patch series addresses the following issues: - Wrong error reporting for multiple filesystems case. - Skip all readonly-mounted filesystems instead of skipping iso9660. - Thaw all filesystems after an unsuccessful freeze attempt. Vitaly Kuznetsov (3): Tools: hv: vssdaemon: consult with errno in case of failure only Tools: hv: vssdaemon: skip all filesystems mounted readonly Tools: hv: vssdaemon: thaw everything in case of freeze failure tools/hv/hv_vss_daemon.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) Hi Vitaly, Thanks for your patchset! FYI: Greg checked in a patch of mine several hours ago -- my patch implemented thaw all filesytems on a failure of freeze too. :-) Ah, sorry for stepping on your toes :-) Please see my patch in Greg's char-misc-next tree: https://git.kernel.org/cgit/linux/kernel/git/gregkh/char-misc.git/commit/?h=char-misc-nextid=4f689190bb55d171d2f6614f8a6cbd4b868e48bd Can you please rebase your patch(es) on Greg's tree? Sure, I'll throw away my patch#3, rebase, and repost. Thanks, -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/2] Tools: hv: vssdaemon: freeze/thaw logic improvement for the failure case
This patch series addresses the following issues: - Verbosely report errors during freeze (mountpoint, exact error); - Skip all readonly-mounted filesystems instead of skipping iso9660 only. Changes since v1: - Rebase on top of 'char-misc-next'; - Tools: hv: vssdaemon: thaw everything in case of freeze was thrown away as Dexuan's Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition contains the same change; - Tools: hv: vssdaemon: consult with errno in case of failure only was replaced with Tools: hv: vssdaemon: report freeze errors. Vitaly Kuznetsov (2): Tools: hv: vssdaemon: report freeze errors Tools: hv: vssdaemon: skip all filesystems mounted readonly tools/hv/hv_vss_daemon.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/2] Tools: hv: vssdaemon: skip all filesystems mounted readonly
Instead of making a list of exceptions for readonly filesystems in addition to iso9660 we already have it is better to skip freeze operation for all readonly-mounted filesystems. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_vss_daemon.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index ee44f0d..5e63f70 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -102,7 +102,7 @@ static int vss_operate(int operation) while ((ent = getmntent(mounts))) { if (strncmp(ent-mnt_fsname, match, strlen(match))) continue; - if (strcmp(ent-mnt_type, iso9660) == 0) + if (hasmntopt(ent, MNTOPT_RO) != NULL) continue; if (strcmp(ent-mnt_type, vfat) == 0) continue; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] Tools: hv: vssdaemon: report freeze errors
When ioctl(fd, FIFREEZE, 0) results in an error we cannot report it to syslog instantly since that can cause write to a frozen disk. However, the name of the filesystem which caused the error and errno are valuable and we would like to get a nice human-readable message in the log. Save errno before calling vss_operate(VSS_OP_THAW) and report the error right after. Unfortunately, FITHAW errors cannot be reported the same way as we need to finish thawing all filesystems before calling syslog(). We should also avoid calling endmntent() for the second time in case we encountered an error during freezing of '/' as it usually results in SEGSEGV. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_vss_daemon.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index b720d8f..ee44f0d 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -82,7 +82,7 @@ static int vss_operate(int operation) FILE *mounts; struct mntent *ent; unsigned int cmd; - int error = 0, root_seen = 0; + int error = 0, root_seen = 0, save_errno = 0; switch (operation) { case VSS_OP_FREEZE: @@ -114,7 +114,6 @@ static int vss_operate(int operation) if (error operation == VSS_OP_FREEZE) goto err; } - endmntent(mounts); if (root_seen) { error |= vss_do_freeze(/, cmd); @@ -122,10 +121,19 @@ static int vss_operate(int operation) goto err; } - return error; + goto out; err: - endmntent(mounts); + save_errno = errno; vss_operate(VSS_OP_THAW); + /* Call syslog after we thaw all filesystems */ + if (ent) + syslog(LOG_ERR, FREEZE of %s failed; error:%d %s, + ent-mnt_dir, save_errno, strerror(save_errno)); + else + syslog(LOG_ERR, FREEZE of / failed; error:%d %s, save_errno, + strerror(save_errno)); +out: + endmntent(mounts); return error; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hv: hv_fcopy: drop the obsolete message on transfer failure
Dexuan Cui de...@microsoft.com writes: In the case the user-space daemon crashes, hangs or is killed, we need to down the semaphore, otherwise, after the daemon starts next time, the obsolete data in fcopy_transaction.message or fcopy_transaction.fcopy_msg will be used immediately. Cc: K. Y. Srinivasan k...@microsoft.com Signed-off-by: Dexuan Cui de...@microsoft.com Reviewed-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/hv_fcopy.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index 23b2ce2..177122a 100644 --- a/drivers/hv/hv_fcopy.c +++ b/drivers/hv/hv_fcopy.c @@ -86,6 +86,15 @@ static void fcopy_work_func(struct work_struct *dummy) * process the pending transaction. */ fcopy_respond_to_host(HV_E_FAIL); + + /* In the case the user-space daemon crashes, hangs or is killed, we + * need to down the semaphore, otherwise, after the daemon starts next + * time, the obsolete data in fcopy_transaction.message or + * fcopy_transaction.fcopy_msg will be used immediately. + */ + if (down_trylock(fcopy_transaction.read_sema)) + pr_debug(FCP: failed to acquire the semaphore\n); + } static int fcopy_handle_handshake(u32 version) -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xen/blkfront: improve protection against issuing unsupported REQ_FUA
Boris Ostrovsky boris.ostrov...@oracle.com writes: On 11/03/2014 07:22 AM, Laszlo Ersek wrote: On 10/27/14 14:44, Vitaly Kuznetsov wrote: Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced in d11e61583 and was factored out into blkif_request_flush_valid() in 0f1ca65ee. However: 1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request will still pass the check. 2) blkif_request_flush_valid() is misnamed. It is bool but returns true when the request is invalid. 3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that -EOPNOTSUPP is more appropriate here. Fix all of the above issues. This patch is based on the original patch by Laszlo Ersek and a comment by Jeff Moyer. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/block/xen-blkfront.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 5ac312f..2e6c103 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -582,12 +582,14 @@ static inline void flush_requests(struct blkfront_info *info) notify_remote_via_irq(info-irq); } -static inline bool blkif_request_flush_valid(struct request *req, -struct blkfront_info *info) +static inline bool blkif_request_flush_invalid(struct request *req, + struct blkfront_info *info) { return ((req-cmd_type != REQ_TYPE_FS) || - ((req-cmd_flags (REQ_FLUSH | REQ_FUA)) - !info-flush_op)); + ((req-cmd_flags REQ_FLUSH) +!(info-feature_flush REQ_FLUSH)) || + ((req-cmd_flags REQ_FUA) +!(info-feature_flush REQ_FUA))); Somewhat unrelated to the patch, but I am wondering whether we actually need flush_op field at all as it seems that it is unambiguously defined by REQ_FLUSH/REQ_FUA. I was under an impression it was added for readability sake but we definitely can remove it. If noone objects I'll send separate cleanup patch (don't want to mix these two). -boris } /* @@ -612,8 +614,8 @@ static void do_blkif_request(struct request_queue *rq) blk_start_request(req); - if (blkif_request_flush_valid(req, info)) { - __blk_end_request_all(req, -EIO); + if (blkif_request_flush_invalid(req, info)) { + __blk_end_request_all(req, -EOPNOTSUPP); continue; } Not sure if there has been some feedback yet (I can't see anything threaded with this message in my inbox). FWIW I consulted Documentation/block/writeback_cache_control.txt for this review. Apparently, REQ_FLUSH forces out previously completed write requests, whereas REQ_FUA delays the IO completion signal for *this* request until the data has been committed to non-volatile storage. So, indeed, support for REQ_FLUSH only does not guarantee that REQ_FUA can be served. Reviewed-by: Laszlo Ersek ler...@redhat.com Thanks Laszlo -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors
When an SMP Hyper-V guest is running on top of 2012R2 Server and secondary cpus are sent offline (with echo 0 /sys/devices/system/cpu/cpu$cpu/online) the system freeze is observed. This happens due to the fact that on newer hypervisors (Win8, WS2012R2, ...) vmbus channel handlers are distributed across all cpus (see init_vp_index() function in drivers/hv/channel_mgmt.c) and on cpu offlining nobody reassigns them to CPU0. Prevent cpu offlining when vmbus is loaded until the issue is fixed host-side. This patch also disables hibernation but it is OK as it is also broken (MCE error is hit on resume). Suspend still works. Tested with WS2008R2 and WS2012R2. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/vmbus_drv.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 4d6b269..9a82249 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -32,6 +32,7 @@ #include linux/completion.h #include linux/hyperv.h #include linux/kernel_stat.h +#include linux/cpu.h #include asm/hyperv.h #include asm/hypervisor.h #include asm/mshyperv.h @@ -671,6 +672,13 @@ static void vmbus_isr(void) tasklet_schedule(msg_dpc); } +#ifdef CONFIG_HOTPLUG_CPU +static int hyperv_cpu_disable(void) +{ + return -1; +} +#endif + /* * vmbus_bus_init -Main vmbus driver initialization routine. * @@ -711,6 +719,12 @@ static int vmbus_bus_init(int irq) if (ret) goto err_alloc; +#ifdef CONFIG_HOTPLUG_CPU + if ((vmbus_proto_version != VERSION_WS2008) + (vmbus_proto_version != VERSION_WIN7)) + smp_ops.cpu_disable = hyperv_cpu_disable; +#endif + vmbus_request_offers(); return 0; @@ -964,6 +978,11 @@ static void __exit vmbus_exit(void) bus_unregister(hv_bus); hv_cleanup(); acpi_bus_unregister_driver(vmbus_acpi_driver); +#ifdef CONFIG_HOTPLUG_CPU + if ((vmbus_proto_version != VERSION_WS2008) + (vmbus_proto_version != VERSION_WIN7)) + smp_ops.cpu_disable = native_cpu_disable; +#endif } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors
Dexuan Cui de...@microsoft.com writes: -Original Message- From: devel [mailto:driverdev-devel-boun...@linuxdriverproject.org] On Behalf Of Greg Kroah-Hartman Sent: Thursday, November 27, 2014 11:03 AM To: Vitaly Kuznetsov Cc: de...@linuxdriverproject.org; Haiyang Zhang; linux- ker...@vger.kernel.org Subject: Re: [PATCH] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors On Wed, Nov 26, 2014 at 02:52:22PM +0100, Vitaly Kuznetsov wrote: When an SMP Hyper-V guest is running on top of 2012R2 Server and secondary cpus are sent offline (with echo 0 /sys/devices/system/cpu/cpu$cpu/online) the system freeze is observed. This happens due to the fact that on newer hypervisors (Win8, WS2012R2, ...) vmbus channel handlers are distributed across all cpus (see init_vp_index() function in drivers/hv/channel_mgmt.c) and on cpu offlining nobody reassigns them to CPU0. Prevent cpu offlining when vmbus is loaded until the issue is fixed host-side. This patch also disables hibernation but it is OK as it is also broken (MCE error is hit on resume). Suspend still works. Tested with WS2008R2 and WS2012R2. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/vmbus_drv.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 4d6b269..9a82249 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -32,6 +32,7 @@ #include linux/completion.h #include linux/hyperv.h #include linux/kernel_stat.h +#include linux/cpu.h #include asm/hyperv.h #include asm/hypervisor.h #include asm/mshyperv.h @@ -671,6 +672,13 @@ static void vmbus_isr(void) tasklet_schedule(msg_dpc); } +#ifdef CONFIG_HOTPLUG_CPU +static int hyperv_cpu_disable(void) +{ + return -1; +} +#endif + /* * vmbus_bus_init -Main vmbus driver initialization routine. * @@ -711,6 +719,12 @@ static int vmbus_bus_init(int irq) if (ret) goto err_alloc; +#ifdef CONFIG_HOTPLUG_CPU + if ((vmbus_proto_version != VERSION_WS2008) + (vmbus_proto_version != VERSION_WIN7)) + smp_ops.cpu_disable = hyperv_cpu_disable; +#endif + vmbus_request_offers(); return 0; @@ -964,6 +978,11 @@ static void __exit vmbus_exit(void) bus_unregister(hv_bus); hv_cleanup(); acpi_bus_unregister_driver(vmbus_acpi_driver); +#ifdef CONFIG_HOTPLUG_CPU + if ((vmbus_proto_version != VERSION_WS2008) + (vmbus_proto_version != VERSION_WIN7)) + smp_ops.cpu_disable = native_cpu_disable; +#endif } #ifdef in a .c file is not a good idea to do if at all possible, please only put this in one place, using a function call to hide the mess. greg k-h Hi Vitaly, The idea of the patch is good to me. I agree with Greg. BTW, maybe hv_cpu_hotplug_quirk() is a better name? My idea was that eventually this function will start doing something real (e.g. switching channels to cpu0 if it doesn't happen fully host-side) so I called it with a general name 'hyperv_cpu_disable'. I'll try addressing our and Greg's comments in v2, thanks! Thanks, -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] Drivers: hv: vss: Introduce timeout for communication with userspace
In contrast with KVP there is no timeout when communicating with userspace VSS daemon. In case it gets stuck performing freeze/thaw operation no message will be sent to the host so it will take very long (around 10 minutes) before backup fails. Introduce 10 second timeout using schedule_delayed_work(). Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/hv_snapshot.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c index 34f14fd..21e51be 100644 --- a/drivers/hv/hv_snapshot.c +++ b/drivers/hv/hv_snapshot.c @@ -28,7 +28,7 @@ #define VSS_MINOR 0 #define VSS_VERSION(VSS_MAJOR 16 | VSS_MINOR) - +#define VSS_USERSPACE_TIMEOUT (msecs_to_jiffies(10 * 1000)) /* * Global state maintained for transaction that is being processed. @@ -55,12 +55,24 @@ static const char vss_name[] = vss_kernel_module; static __u8 *recv_buffer; static void vss_send_op(struct work_struct *dummy); +static void vss_timeout_func(struct work_struct *dummy); + +static DECLARE_DELAYED_WORK(vss_timeout_work, vss_timeout_func); static DECLARE_WORK(vss_send_op_work, vss_send_op); /* * Callback when data is received from user mode. */ +static void vss_timeout_func(struct work_struct *dummy) +{ + /* +* Timeout waiting for userspace component to reply happened. +*/ + pr_warn(VSS: timeout waiting for daemon to reply\n); + vss_respond_to_host(HV_E_FAIL); +} + static void vss_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp) { @@ -76,7 +88,8 @@ vss_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp) return; } - vss_respond_to_host(vss_msg-error); + if (cancel_delayed_work_sync(vss_timeout_work)) + vss_respond_to_host(vss_msg-error); } @@ -223,6 +236,8 @@ void hv_vss_onchannelcallback(void *context) case VSS_OP_FREEZE: case VSS_OP_THAW: schedule_work(vss_send_op_work); + schedule_delayed_work(vss_timeout_work, + VSS_USERSPACE_TIMEOUT); return; case VSS_OP_HOT_BACKUP: @@ -277,5 +292,6 @@ hv_vss_init(struct hv_util_service *srv) void hv_vss_deinit(void) { cn_del_callback(vss_id); + cancel_delayed_work_sync(vss_timeout_work); cancel_work_sync(vss_send_op_work); } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Drivers: hv: kvp,vss: improve kernel-userspace communication in failure case
This series addresses two issues: - There is no timeout when communicating with userspace VSS daemon. - In case we fail to send a message to VSS or KVP userspace daemons we can report the failure to the host right away avoiding the timeout. Newly introduced 10 second timeout is something worth discussing. In theory freeze/thaw ioctls should be fast. In case someone thinks 10 seconds is not enough we can easily increase it as we cover the most common failure scenario (when the daemon was stopped) with the second patch of this series. Vitaly Kuznetsov (2): Drivers: hv: vss: Introduce timeout for communication with userspace Drivers: hv: kvp,vss: Fast propagation of userspace communication failure drivers/hv/hv_kvp.c | 9 - drivers/hv/hv_snapshot.c | 28 +--- 2 files changed, 33 insertions(+), 4 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] Drivers: hv: kvp,vss: Fast propagation of userspace communication failure
If we fail to send a message to userspace daemon with cn_netlink_send() there is no need to wait for userspace to reply as it is not going to happen. This happens when kvp or vss daemon is stopped after a successful handshake. Report HV_E_FAIL immediately and cancel the timeout job so host won't receive two failures. Use pr_warn() for VSS and pr_debug() for KVP deliberately as VSS request are rare and result in a failed backup. KVP requests are much more frequent after a successful handshake so avoid flooding logs. It would be nice to have an ability to de-negotiate with the host in case userspace daemon gets disconnected so we won't receive new requests. But I'm not sure it is possible. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/hv_kvp.c | 9 - drivers/hv/hv_snapshot.c | 8 +++- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c index 521c146..beb8105 100644 --- a/drivers/hv/hv_kvp.c +++ b/drivers/hv/hv_kvp.c @@ -350,6 +350,7 @@ kvp_send_key(struct work_struct *dummy) __u8 pool = kvp_transaction.kvp_msg-kvp_hdr.pool; __u32 val32; __u64 val64; + int rc; msg = kzalloc(sizeof(*msg) + sizeof(struct hv_kvp_msg) , GFP_ATOMIC); if (!msg) @@ -446,7 +447,13 @@ kvp_send_key(struct work_struct *dummy) } msg-len = sizeof(struct hv_kvp_msg); - cn_netlink_send(msg, 0, 0, GFP_ATOMIC); + rc = cn_netlink_send(msg, 0, 0, GFP_ATOMIC); + if (rc) { + pr_debug(KVP: failed to communicate to the daemon: %d\n, rc); + if (cancel_delayed_work_sync(kvp_work)) + kvp_respond_to_host(message, HV_E_FAIL); + } + kfree(msg); return; diff --git a/drivers/hv/hv_snapshot.c b/drivers/hv/hv_snapshot.c index 21e51be..9d5e0d1 100644 --- a/drivers/hv/hv_snapshot.c +++ b/drivers/hv/hv_snapshot.c @@ -96,6 +96,7 @@ vss_cn_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp) static void vss_send_op(struct work_struct *dummy) { int op = vss_transaction.msg-vss_hdr.operation; + int rc; struct cn_msg *msg; struct hv_vss_msg *vss_msg; @@ -111,7 +112,12 @@ static void vss_send_op(struct work_struct *dummy) vss_msg-vss_hdr.operation = op; msg-len = sizeof(struct hv_vss_msg); - cn_netlink_send(msg, 0, 0, GFP_ATOMIC); + rc = cn_netlink_send(msg, 0, 0, GFP_ATOMIC); + if (rc) { + pr_warn(VSS: failed to communicate to the daemon: %d\n, rc); + if (cancel_delayed_work_sync(vss_timeout_work)) + vss_respond_to_host(HV_E_FAIL); + } kfree(msg); return; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] Tools: hv: vssdaemon: thaw everything in case of freeze failure
If one or more filesystems failed to freeze we need to thaw everything as host doing backup won't issue THAW request after we return HV_E_FAIL and our system will remain with frozen filesystems for ever. There is no track of filesystems we freeze so in case there is some external tool doing freeze/thaw requests at the same time they will collide with vss daemon. This issue can be addressed by introducing a freeze/thaw transaction and keeping track of what was actually frozen Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_vss_daemon.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index 7be999a..e98c638 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -284,6 +284,12 @@ int main(int argc, char *argv[]) error = vss_operate(op); if (error) error = HV_E_FAIL; + if (error op == VSS_OP_FREEZE) { + /* Need to thaw all frozen fylesystems */ + syslog(LOG_ERR, + Freeze failed, thaw everything); + vss_operate(VSS_OP_THAW); + } break; default: syslog(LOG_ERR, Illegal op:%d\n, op); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] Tools: hv: vssdaemon: freeze/thaw logic improvement for the failure case
This patch series addresses the following issues: - Wrong error reporting for multiple filesystems case. - Skip all readonly-mounted filesystems instead of skipping iso9660. - Thaw all filesystems after an unsuccessful freeze attempt. Vitaly Kuznetsov (3): Tools: hv: vssdaemon: consult with errno in case of failure only Tools: hv: vssdaemon: skip all filesystems mounted readonly Tools: hv: vssdaemon: thaw everything in case of freeze failure tools/hv/hv_vss_daemon.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] Tools: hv: vssdaemon: consult with errno in case of failure only
If ioctl() return 0 there is no point in examining errno and it can actually produce misleading output. In case there was no failure errno will contain the error code for previous failure so user will see the following in the log: Hyper-V VSS: VSS: freeze of /mnt/udf: Operation not supported Hyper-V VSS: VSS: freeze of /: Operation not supported We should also log errors with LOG_ERR instead of LOG_INFO. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_vss_daemon.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index 9ae2b6e..5f67858 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -52,7 +52,11 @@ static int vss_do_freeze(char *dir, unsigned int cmd, char *fs_op) if (fd 0) return 1; ret = ioctl(fd, cmd, 0); - syslog(LOG_INFO, VSS: %s of %s: %s\n, fs_op, dir, strerror(errno)); + if (ret) + syslog(LOG_ERR, VSS: %s of %s: %s\n, fs_op, dir, + strerror(errno)); + else + syslog(LOG_INFO, VSS: %s of %s succeeded\n, fs_op, dir); close(fd); return !!ret; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] Tools: hv: vssdaemon: skip all filesystems mounted readonly
Instead of making a list of exceptions for readonly filesystems in addition to iso9660 we already have it is better to skip freeze operation for all readonly-mounted filesystems. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_vss_daemon.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index 5f67858..7be999a 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -90,7 +90,7 @@ static int vss_operate(int operation) while ((ent = getmntent(mounts))) { if (strncmp(ent-mnt_fsname, match, strlen(match))) continue; - if (strcmp(ent-mnt_type, iso9660) == 0) + if (hasmntopt(ent, MNTOPT_RO) != NULL) continue; if (strcmp(ent-mnt_type, vfat) == 0) continue; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] xen/blkfront: improve protection against issuing unsupported REQ_FUA
Boris Ostrovsky boris.ostrov...@oracle.com writes: On 12/01/2014 08:01 AM, Vitaly Kuznetsov wrote: Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced in d11e61583 and was factored out into blkif_request_flush_valid() in 0f1ca65ee. However: 1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request will still pass the check. 2) blkif_request_flush_valid() is misnamed. It is bool but returns true when the request is invalid. 3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that -EOPNOTSUPP is more appropriate here. Fix all of the above issues. This patch is based on the original patch by Laszlo Ersek and a comment by Jeff Moyer. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com Reviewed-by: Laszlo Ersek ler...@redhat.com Reviewed-by: Boris Ostrovsky boris.ostrov...@oracle.com (although, as I mentioned last time, a companion patch to remove flush_op would be a good thing to have) Thanks, it is on my todo list but I'm trying to separate this (potential) bugfix from straight cleanup. -boris --- drivers/block/xen-blkfront.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 5ac312f..2e6c103 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -582,12 +582,14 @@ static inline void flush_requests(struct blkfront_info *info) notify_remote_via_irq(info-irq); } -static inline bool blkif_request_flush_valid(struct request *req, - struct blkfront_info *info) +static inline bool blkif_request_flush_invalid(struct request *req, + struct blkfront_info *info) { return ((req-cmd_type != REQ_TYPE_FS) || -((req-cmd_flags (REQ_FLUSH | REQ_FUA)) -!info-flush_op)); +((req-cmd_flags REQ_FLUSH) + !(info-feature_flush REQ_FLUSH)) || +((req-cmd_flags REQ_FUA) + !(info-feature_flush REQ_FUA))); } /* @@ -612,8 +614,8 @@ static void do_blkif_request(struct request_queue *rq) blk_start_request(req); - if (blkif_request_flush_valid(req, info)) { -__blk_end_request_all(req, -EIO); +if (blkif_request_flush_invalid(req, info)) { +__blk_end_request_all(req, -EOPNOTSUPP); continue; } -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Drivers: hv: vmbus: Fix a race condition when unregistering a device
When build with Debug the following crash is sometimes observed: Call Trace: [812b9600] string+0x40/0x100 [812bb038] vsnprintf+0x218/0x5e0 [810baf7d] ? trace_hardirqs_off+0xd/0x10 [812bb4c1] vscnprintf+0x11/0x30 [8107a2f0] vprintk+0xd0/0x5c0 [a0051ea0] ? vmbus_process_rescind_offer+0x0/0x110 [hv_vmbus] [8155c71c] printk+0x41/0x45 [a004ebac] vmbus_device_unregister+0x2c/0x40 [hv_vmbus] [a0051ecb] vmbus_process_rescind_offer+0x2b/0x110 [hv_vmbus] ... This happens due to the following race: between 'if (channel-device_obj)' check in vmbus_process_rescind_offer() and pr_debug() in vmbus_device_unregister() the device can disappear. Fix the issue by taking an additional reference to the device before proceeding to vmbus_device_unregister(). Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/channel_mgmt.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index a2d1a96..d36ce68 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -216,9 +216,16 @@ static void vmbus_process_rescind_offer(struct work_struct *work) unsigned long flags; struct vmbus_channel *primary_channel; struct vmbus_channel_relid_released msg; + struct device *dev; + + if (channel-device_obj) { + dev = get_device(channel-device_obj-device); + if (dev) { + vmbus_device_unregister(channel-device_obj); + put_device(dev); + } + } - if (channel-device_obj) - vmbus_device_unregister(channel-device_obj); memset(msg, 0, sizeof(struct vmbus_channel_relid_released)); msg.child_relid = channel-offermsg.child_relid; msg.header.msgtype = CHANNELMSG_RELID_RELEASED; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools: hv: introduce -n/--no-daemon option
KY Srinivasan k...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Wednesday, October 22, 2014 9:07 AM To: KY Srinivasan; Haiyang Zhang; de...@linuxdriverproject.org Cc: linux-kernel@vger.kernel.org Subject: [PATCH] tools: hv: introduce -n/--no-daemon option All tools/hv daemons do mandatory daemon() on startup. However, no pidfile is created, this make it difficult for an init system to track such daemons. Modern linux distros use systemd as their init system. It can handle the daemonizing by itself, however, it requires a daemon to stay in foreground for that. Some distros already carry distro-specific patch for hv tools which switches off daemon(). Introduce -n/--no-daemon option for all 3 daemons in hv/tools. Parse options with getopt() to make this part easily expandable. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com You may want to include Greg KH in the to list. For some reason he's missing on the get_maintainer.pl output for all Hyper-V parts. Greg, will you pick this up or do I need to resend? Signed-off-by: K. Y. Srinivasan k...@microsoft.com Thanks! --- tools/hv/hv_fcopy_daemon.c | 33 +++- - tools/hv/hv_kvp_daemon.c | 34 -- tools/hv/hv_vss_daemon.c | 33 +++-- 3 files changed, 94 insertions(+), 6 deletions(-) diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c index 8f96b3e..f437d73 100644 --- a/tools/hv/hv_fcopy_daemon.c +++ b/tools/hv/hv_fcopy_daemon.c @@ -33,6 +33,7 @@ #include sys/stat.h #include fcntl.h #include dirent.h +#include getopt.h static int target_fd; static char target_fname[W_MAX_PATH]; @@ -126,15 +127,43 @@ static int hv_copy_cancel(void) } -int main(void) +void print_usage(char *argv[]) +{ +fprintf(stderr, Usage: %s [options]\n +Options are:\n + -n, --no-daemonstay in foreground, don't daemonize\n + -h, --help print this help\n, argv[0]); +} + +int main(int argc, char *argv[]) { int fd, fcopy_fd, len; int error; +int daemonize = 1, long_index = 0, opt; int version = FCOPY_CURRENT_VERSION; char *buffer[4096 * 2]; struct hv_fcopy_hdr *in_msg; -if (daemon(1, 0)) { +static struct option long_options[] = { +{help,no_argument, 0, 'h' }, +{no-daemon, no_argument, 0, 'n' }, +{0, 0, 0, 0 } +}; + +while ((opt = getopt_long(argc, argv, hn, long_options, + long_index)) != -1) { +switch (opt) { +case 'n': +daemonize = 0; +break; +case 'h': +default: +print_usage(argv); +exit(EXIT_FAILURE); +} +} + +if (daemonize daemon(1, 0)) { syslog(LOG_ERR, daemon() failed; error: %s, strerror(errno)); exit(EXIT_FAILURE); } diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 4088b81..22b0764 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -43,6 +43,7 @@ #include fcntl.h #include dirent.h #include net/if.h +#include getopt.h /* * KVP protocol: The user mode component first registers with the @@ - 1417,7 +1418,15 @@ netlink_send(int fd, struct cn_msg *msg) return sendmsg(fd, message, 0); } -int main(void) +void print_usage(char *argv[]) +{ +fprintf(stderr, Usage: %s [options]\n +Options are:\n + -n, --no-daemonstay in foreground, don't daemonize\n + -h, --help print this help\n, argv[0]); +} + +int main(int argc, char *argv[]) { int fd, len, nl_group; int error; @@ -1435,9 +1444,30 @@ int main(void) struct hv_kvp_ipaddr_value *kvp_ip_val; char *kvp_recv_buffer; size_t kvp_recv_buffer_len; +int daemonize = 1, long_index = 0, opt; + +static struct option long_options[] = { +{help,no_argument, 0, 'h' }, +{no-daemon, no_argument, 0, 'n' }, +{0, 0, 0, 0 } +}; + +while ((opt = getopt_long(argc, argv, hn, long_options, + long_index)) != -1) { +switch (opt) { +case 'n': +daemonize = 0; +break; +case 'h': +default: +print_usage(argv); +exit(EXIT_FAILURE); +} +} -if (daemon(1, 0)) +if (daemonize daemon(1, 0)) return 1; + openlog(KVP, 0, LOG_USER); syslog(LOG_INFO, KVP starting; pid is:%d, getpid()); diff --git a/tools
Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Monday, December 1, 2014 16:23 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY Srinivasan; vkuzn...@redhat.com; Haiyang Zhang Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure On Fri, Nov 28, 2014 at 7:54 PM, Dexuan Cui de...@microsoft.com wrote: -Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Friday, November 28, 2014 18:13 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY Srinivasan; vkuzn...@redhat.com; Haiyang Zhang Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui de...@microsoft.com wrote: -Original Message- From: Jason Wang [mailto:jasow...@redhat.com] Sent: Friday, November 28, 2014 14:47 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; KY Srinivasan; vkuzn...@redhat.com; Haiyang Zhang Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui de...@microsoft.com wrote: In the case the user-space daemon crashes, hangs or is killed, we need to down the semaphore, otherwise, after the daemon starts next time, the obsolete data in fcopy_transaction.message or fcopy_transaction.fcopy_msg will be used immediately. Cc: Jason Wang jasow...@redhat.com Cc: Vitaly Kuznetsov vkuzn...@redhat.com Cc: K. Y. Srinivasan k...@microsoft.com Signed-off-by: Dexuan Cui de...@microsoft.com --- v2: I removed the FCP prefix as Greg asked. I also updated the output message a little: FCP: failed to acquire the semaphore -- can not acquire the semaphore: it is benign v3: I added the code in fcopy_release() as Jason Wang suggested. I removed the pr_debug (it isn't so meaningful)and added a comment instead. drivers/hv/hv_fcopy.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c index 23b2ce2..faa6ba6 100644 --- a/drivers/hv/hv_fcopy.c +++ b/drivers/hv/hv_fcopy.c @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct *dummy) * process the pending transaction. */ fcopy_respond_to_host(HV_E_FAIL); + + /* In the case the user-space daemon crashes, hangs or is killed, we + * need to down the semaphore, otherwise, after the daemon starts next + * time, the obsolete data in fcopy_transaction.message or + * fcopy_transaction.fcopy_msg will be used immediately. + * + * NOTE: fcopy_read() happens to get the semaphore (very rare)? We're + * still OK, because we've reported the failure to the host. + */ + if (down_trylock(fcopy_transaction.read_sema)) + ; Sorry, I'm not quite understand how if () ; can help here. Btw, a question not relate to this patch. What happens if a daemon is resume from SIGSTOP and expires the check here? Hi Jason, My idea is: here we need down_trylock(), but in case we can't get the semaphore, it's OK anyway: Scenario 1): 1.1: when the daemon is blocked on the pread(), the daemon receives SIGSTOP; 1.2: the host user runs the PowerShell Copy-VMFile command; 1.3.1: the driver reports the failure to the host user in 5s and 1.3.2: the driver down()-es the semaphore; 1.4: the daemon receives SIGCONT and it will be still blocked on the pread(). Without the down_trylock(), in 1.4, the daemon can receive an obsolete message. NOTE: in this scenario, the daemon is not killed. Scenario 2): In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 and do down() in fcopy_read(), it will receive the message but: the driver has reported the failure to the host user and the driver's 1.3.2 can't get the semaphore -- IMO this is acceptably OK, though in the VM, an incomplete file will be left there. BTW, I think in the daemon's hv_start_fcopy() we should add a close(target_fd) before open()-ing a new one. Right, but how about the case when resuming from SIGSTOP but no timeout? Sorry, I don't understand this: if no timeout, fcopy_read() will get the semaphore
[PATCH v2] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors
When an SMP Hyper-V guest is running on top of 2012R2 Server and secondary cpus are sent offline (with echo 0 /sys/devices/system/cpu/cpu$cpu/online) the system freeze is observed. This happens due to the fact that on newer hypervisors (Win8, WS2012R2, ...) vmbus channel handlers are distributed across all cpus (see init_vp_index() function in drivers/hv/channel_mgmt.c) and on cpu offlining nobody reassigns them to CPU0. Prevent cpu offlining when vmbus is loaded until the issue is fixed host-side. This patch also disables hibernation but it is OK as it is also broken (MCE error is hit on resume). Suspend still works. Tested with WS2008R2 and WS2012R2. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- Changes since v1: - introduce hv_cpu_hotplug_quirk() function to not spread #ifdefs [Greg KH] - add pr_notice() message hv_vmbus: CPU offlining is not supported by hypervisor --- drivers/hv/vmbus_drv.c | 33 + 1 file changed, 33 insertions(+) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 4d6b269..2e6b38e 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -32,6 +32,7 @@ #include linux/completion.h #include linux/hyperv.h #include linux/kernel_stat.h +#include linux/cpu.h #include asm/hyperv.h #include asm/hypervisor.h #include asm/mshyperv.h @@ -671,6 +672,36 @@ static void vmbus_isr(void) tasklet_schedule(msg_dpc); } +#ifdef CONFIG_HOTPLUG_CPU +static int hyperv_cpu_disable(void) +{ + return -1; +} + +static void hv_cpu_hotplug_quirk(bool vmbus_loaded) +{ + /* +* Offlining a CPU when running on newer hypervisors (WS2012R2, Win8, +* ...) is not supported at this moment as channel interrupts are +* distributed across all of them. +*/ + + if ((vmbus_proto_version == VERSION_WS2008) || + (vmbus_proto_version == VERSION_WIN7)) + return; + + if (vmbus_loaded) { + smp_ops.cpu_disable = hyperv_cpu_disable; + pr_notice(CPU offlining is not supported by hypervisor); + } else + smp_ops.cpu_disable = native_cpu_disable; +} +#else +static void hv_cpu_hotplug_quirk(bool vmbus_loaded) +{ +} +#endif + /* * vmbus_bus_init -Main vmbus driver initialization routine. * @@ -711,6 +742,7 @@ static int vmbus_bus_init(int irq) if (ret) goto err_alloc; + hv_cpu_hotplug_quirk(true); vmbus_request_offers(); return 0; @@ -964,6 +996,7 @@ static void __exit vmbus_exit(void) bus_unregister(hv_bus); hv_cleanup(); acpi_bus_unregister_driver(vmbus_acpi_driver); + hv_cpu_hotplug_quirk(false); } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RESEND] xen/blkfront: improve protection against issuing unsupported REQ_FUA
Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced in d11e61583 and was factored out into blkif_request_flush_valid() in 0f1ca65ee. However: 1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request will still pass the check. 2) blkif_request_flush_valid() is misnamed. It is bool but returns true when the request is invalid. 3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that -EOPNOTSUPP is more appropriate here. Fix all of the above issues. This patch is based on the original patch by Laszlo Ersek and a comment by Jeff Moyer. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com Reviewed-by: Laszlo Ersek ler...@redhat.com --- drivers/block/xen-blkfront.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 5ac312f..2e6c103 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -582,12 +582,14 @@ static inline void flush_requests(struct blkfront_info *info) notify_remote_via_irq(info-irq); } -static inline bool blkif_request_flush_valid(struct request *req, -struct blkfront_info *info) +static inline bool blkif_request_flush_invalid(struct request *req, + struct blkfront_info *info) { return ((req-cmd_type != REQ_TYPE_FS) || - ((req-cmd_flags (REQ_FLUSH | REQ_FUA)) - !info-flush_op)); + ((req-cmd_flags REQ_FLUSH) +!(info-feature_flush REQ_FLUSH)) || + ((req-cmd_flags REQ_FUA) +!(info-feature_flush REQ_FUA))); } /* @@ -612,8 +614,8 @@ static void do_blkif_request(struct request_queue *rq) blk_start_request(req); - if (blkif_request_flush_valid(req, info)) { - __blk_end_request_all(req, -EIO); + if (blkif_request_flush_invalid(req, info)) { + __blk_end_request_all(req, -EOPNOTSUPP); continue; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] xen/blkfront: remove redundant flush_op
flush_op is unambiguously defined by feature_flush: REQ_FUA | REQ_FLUSH - BLKIF_OP_WRITE_BARRIER REQ_FLUSH - BLKIF_OP_FLUSH_DISKCACHE 0 - 0 and thus can be removed. This is just a cleanup. The patch was suggested by Boris Ostrovsky. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- The patch is supposed to be applied after xen/blkfront: improve protection against issuing unsupported REQ_FUA. --- drivers/block/xen-blkfront.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 2e6c103..d1ee233 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -126,7 +126,6 @@ struct blkfront_info unsigned int persistent_gnts_c; unsigned long shadow_free; unsigned int feature_flush; - unsigned int flush_op; unsigned int feature_discard:1; unsigned int feature_secdiscard:1; unsigned int discard_granularity; @@ -479,7 +478,14 @@ static int blkif_queue_request(struct request *req) * way. (It's also a FLUSH+FUA, since it is * guaranteed ordered WRT previous writes.) */ - ring_req-operation = info-flush_op; + if (unlikely(info-feature_flush REQ_FUA)) + ring_req-operation = + BLKIF_OP_WRITE_BARRIER; + else if (likely(info-feature_flush)) + ring_req-operation = + BLKIF_OP_FLUSH_DISKCACHE; + else + ring_req-operation = 0; } ring_req-u.rw.nr_segments = nseg; } @@ -691,8 +697,8 @@ static void xlvbd_flush(struct blkfront_info *info) blk_queue_flush(info-rq, info-feature_flush); printk(KERN_INFO blkfront: %s: %s: %s %s %s %s %s\n, info-gd-disk_name, - info-flush_op == BLKIF_OP_WRITE_BARRIER ? - barrier : (info-flush_op == BLKIF_OP_FLUSH_DISKCACHE ? + info-feature_flush == (REQ_FLUSH | REQ_FUA) ? + barrier : (info-feature_flush == REQ_FLUSH ? flush diskcache : barrier or flush), info-feature_flush ? enabled; : disabled;, persistent grants:, @@ -1190,7 +1196,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) if (error == -EOPNOTSUPP) error = 0; info-feature_flush = 0; - info-flush_op = 0; xlvbd_flush(info); } /* fall through */ @@ -1810,7 +1815,6 @@ static void blkfront_connect(struct blkfront_info *info) physical_sector_size = sector_size; info-feature_flush = 0; - info-flush_op = 0; err = xenbus_gather(XBT_NIL, info-xbdev-otherend, feature-barrier, %d, barrier, @@ -1823,10 +1827,8 @@ static void blkfront_connect(struct blkfront_info *info) * * If there are barriers, then we use flush. */ - if (!err barrier) { + if (!err barrier) info-feature_flush = REQ_FLUSH | REQ_FUA; - info-flush_op = BLKIF_OP_WRITE_BARRIER; - } /* * And if there is feature-flush-cache use that above * barriers. @@ -1835,10 +1837,8 @@ static void blkfront_connect(struct blkfront_info *info) feature-flush-cache, %d, flush, NULL); - if (!err flush) { + if (!err flush) info-feature_flush = REQ_FLUSH; - info-flush_op = BLKIF_OP_FLUSH_DISKCACHE; - } err = xenbus_gather(XBT_NIL, info-xbdev-otherend, feature-discard, %d, discard, -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
Dexuan Cui de...@microsoft.com writes: Currently IPv6-only-injection doesn't work because the daemon doesn't parse any IPv6 information at all once it finds the dhcp_enabled flag is true. But according to the Hyper-v host team, the flag is only for IPv4. In the case the host only injects 1 IPv6 address, the dhcp flag is true, but we shouldn't ignore the IPv6 address and we should pass BOOTPROTO=none to the distro-specific script hv_set_ipconfig. Tested in Ubuntu 14.10 and RHEL7. Cc: K. Y. Srinivasan k...@microsoft.com Signed-off-by: Dexuan Cui de...@microsoft.com --- tools/hv/hv_kvp_daemon.c | 47 +++ 1 file changed, 31 insertions(+), 16 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 6a6432a..6ef6c04 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1145,6 +1145,9 @@ static int kvp_write_file(FILE *f, char *s1, char *s2, char *s3) } +/* How many ipv6 addresses the host is trying to inject? */ +static int num_ipv6_injected; + static int process_ip_string(FILE *f, char *ip_string, int type) { int error = 0; @@ -1190,6 +1193,7 @@ static int process_ip_string(FILE *f, char *ip_string, int type) switch (type) { case IPADDR: snprintf(str, sizeof(str), %s, IPV6ADDR); + num_ipv6_injected++; break; case NETMASK: snprintf(str, sizeof(str), %s, IPV6NETMASK); @@ -1308,27 +1312,12 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) if (error) goto setval_error; - if (new_val-dhcp_enabled) { - error = kvp_write_file(file, BOOTPROTO, , dhcp); - if (error) - goto setval_error; - - /* - * We are done!. - */ - goto setval_done; - - } else { - error = kvp_write_file(file, BOOTPROTO, , none); - if (error) - goto setval_error; - } - /* * Write the configuration for ipaddress, netmask, gateway and * name servers. */ + num_ipv6_injected = 0; error = process_ip_string(file, (char *)new_val-ip_addr, IPADDR); if (error) goto setval_error; @@ -1345,6 +1334,32 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) if (error) goto setval_error; + /* + * Here dhcp_enabled is only for IPv4 according to Hyper-V host team. + * + * In the case the host only injects 1 IPv6 address: + * new_val-dhcp_enabled is true, but we can't pass BOOTPROTO=dhcp to + * the script hv_set_ifconfig, because in some distros (like RHEL7) + * BOOTPROTO=dhcp has a special meaning in the config file (e.g., + * /etc/sysconfig/network-scripts/ifcfg-eth0): the network init program + * ignores any static IP addr information once there is + * BOOTPROTO=dhcp; as a result, IPv6-only injection can't work. + * + * In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't affect + * Ubuntu because it's ignored by the Ubuntu version of + * hv_set_ifconfig and it doesn't seem to have special meaning in + * Ubuntu. + */ I just checked and adding IPV6ADDR=something when BOOTPROTO=dhcp works for me with both RHEL7 and Fedora21. Other than that I think bringing distribution specifics into kernel.git is not a good idea. /etc/sysconfig/network-scripts/ifcfg-* format is distro-specific and not all Linux distros support it. Moreover, different distros can treat setting differently. I think it was wrong to stick to this format in kvp daemon from very beginning. As a solution I would suggest doing the following: kvp daemon writes all received request details in distro-agnostic format in some temporary place and then calls distro-specific script to set things up. Actually, we already have such script: tools/hv/hv_set_ifconfig.sh As for this bug I propose the following: remove skipping all IPADDR/MASK/... settings in case of BOOTPROTO=dhcp and let distro-specific script deal with the rest. + if (new_val-dhcp_enabled num_ipv6_injected == 0) { + error = kvp_write_file(file, BOOTPROTO, , dhcp); + if (error) + goto setval_error; + } else { + error = kvp_write_file(file, BOOTPROTO, , none); + if (error) + goto setval_error; + } + setval_done: fclose(file); -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
[PATCH v2] xen/blkfront: remove redundant flush_op
flush_op is unambiguously defined by feature_flush: REQ_FUA | REQ_FLUSH - BLKIF_OP_WRITE_BARRIER REQ_FLUSH - BLKIF_OP_FLUSH_DISKCACHE 0 - 0 and thus can be removed. This is just a cleanup. The patch was suggested by Boris Ostrovsky. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- Changes from v1: Future-proof feature_flush against new flags [Boris Ostrovsky]. The patch is supposed to be applied after xen/blkfront: improve protection against issuing unsupported REQ_FUA. --- drivers/block/xen-blkfront.c | 51 +++- 1 file changed, 31 insertions(+), 20 deletions(-) diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index 2e6c103..2236c6f 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -126,7 +126,6 @@ struct blkfront_info unsigned int persistent_gnts_c; unsigned long shadow_free; unsigned int feature_flush; - unsigned int flush_op; unsigned int feature_discard:1; unsigned int feature_secdiscard:1; unsigned int discard_granularity; @@ -479,7 +478,19 @@ static int blkif_queue_request(struct request *req) * way. (It's also a FLUSH+FUA, since it is * guaranteed ordered WRT previous writes.) */ - ring_req-operation = info-flush_op; + switch (info-feature_flush + ((REQ_FLUSH|REQ_FUA))) { + case REQ_FLUSH|REQ_FUA: + ring_req-operation = + BLKIF_OP_WRITE_BARRIER; + break; + case REQ_FLUSH: + ring_req-operation = + BLKIF_OP_FLUSH_DISKCACHE; + break; + default: + ring_req-operation = 0; + } } ring_req-u.rw.nr_segments = nseg; } @@ -685,20 +696,26 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size, return 0; } +static const char *flush_info(unsigned int feature_flush) +{ + switch (feature_flush ((REQ_FLUSH | REQ_FUA))) { + case REQ_FLUSH|REQ_FUA: + return barrier: enabled;; + case REQ_FLUSH: + return flush diskcache: enabled;; + default: + return barrier or flush: disabled;; + } +} static void xlvbd_flush(struct blkfront_info *info) { blk_queue_flush(info-rq, info-feature_flush); - printk(KERN_INFO blkfront: %s: %s: %s %s %s %s %s\n, - info-gd-disk_name, - info-flush_op == BLKIF_OP_WRITE_BARRIER ? - barrier : (info-flush_op == BLKIF_OP_FLUSH_DISKCACHE ? - flush diskcache : barrier or flush), - info-feature_flush ? enabled; : disabled;, - persistent grants:, - info-feature_persistent ? enabled; : disabled;, - indirect descriptors:, - info-max_indirect_segments ? enabled; : disabled;); + pr_info(blkfront: %s: %s %s %s %s %s\n, + info-gd-disk_name, flush_info(info-feature_flush), + persistent grants:, info-feature_persistent ? + enabled; : disabled;, indirect descriptors:, + info-max_indirect_segments ? enabled; : disabled;); } static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset) @@ -1190,7 +1207,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) if (error == -EOPNOTSUPP) error = 0; info-feature_flush = 0; - info-flush_op = 0; xlvbd_flush(info); } /* fall through */ @@ -1810,7 +1826,6 @@ static void blkfront_connect(struct blkfront_info *info) physical_sector_size = sector_size; info-feature_flush = 0; - info-flush_op = 0; err = xenbus_gather(XBT_NIL, info-xbdev-otherend, feature-barrier, %d, barrier, @@ -1823,10 +1838,8 @@ static void blkfront_connect(struct blkfront_info *info) * * If there are barriers, then we use flush. */ - if (!err barrier) { + if (!err barrier) info-feature_flush = REQ_FLUSH | REQ_FUA; - info-flush_op = BLKIF_OP_WRITE_BARRIER; - } /* * And if there is feature-flush-cache use that above * barriers. @@ -1835,10 +1848,8 @@ static void blkfront_connect(struct
[PATCH 0/5] Tools: hv: fix compiler warnings and do minor cleanup
When someone does 'make' in tools/hv/ issues appear: - hv_fcopy_daemon is not being built; - lots of compiler warnings. This is just a cleanup. Compile-tested by myself on top of linux-next/master. Piggyback this series and send [PATCH 5/5] Tools: hv: do not add redundant '/' in hv_start_fcopy() Vitaly Kuznetsov (5): Tools: hv: add mising fcopyd to the Makefile Tools: hv: remove unused bytes_written from kvp_update_file() Tools: hv: address compiler warnings for hv_kvp_daemon.c Tools: hv: address compiler warnings for hv_fcopy_daemon.c Tools: hv: do not add redundant '/' in hv_start_fcopy() tools/hv/Makefile | 4 ++-- tools/hv/hv_fcopy_daemon.c | 10 ++ tools/hv/hv_kvp_daemon.c | 29 + 3 files changed, 17 insertions(+), 26 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] Tools: hv: remove unused bytes_written from kvp_update_file()
fwrite() does not actually return the number of bytes written and this value is being ignored anyway and ferror() is being called to check for an error. As we assign to this variable and never use it we get the following compile-time warning: hv_kvp_daemon.c:149:9: warning: variable ‘bytes_written’ set but not used [-Wunused-but-set-variable] Remove bytes_written completely. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_kvp_daemon.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 6a6432a..5a274ca 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -147,7 +147,6 @@ static void kvp_release_lock(int pool) static void kvp_update_file(int pool) { FILE *filep; - size_t bytes_written; /* * We are going to write our in-memory registry out to @@ -163,8 +162,7 @@ static void kvp_update_file(int pool) exit(EXIT_FAILURE); } - bytes_written = fwrite(kvp_file_info[pool].records, - sizeof(struct kvp_record), + fwrite(kvp_file_info[pool].records, sizeof(struct kvp_record), kvp_file_info[pool].num_records, filep); if (ferror(filep) || fclose(filep)) { -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] Tools: hv: do not add redundant '/' in hv_start_fcopy()
We don't need to add additional '/' to smsg-path_name as snprintf(%s/%s) does the right thing. Without the patch we get doubled '//' in the log message. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_fcopy_daemon.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c index 1a23872..9445d8f 100644 --- a/tools/hv/hv_fcopy_daemon.c +++ b/tools/hv/hv_fcopy_daemon.c @@ -43,12 +43,6 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg) int error = HV_E_FAIL; char *q, *p; - /* -* If possile append a path seperator to the path. -*/ - if (strlen((char *)smsg-path_name) (W_MAX_PATH - 2)) - strcat((char *)smsg-path_name, /); - p = (char *)smsg-path_name; snprintf(target_fname, sizeof(target_fname), %s/%s, (char *)smsg-path_name, (char *)smsg-file_name); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] Tools: hv: add mising fcopyd to the Makefile
fcopyd in missing in the Makefile, add it there. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/Makefile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/hv/Makefile b/tools/hv/Makefile index bd22f78..99ffe61 100644 --- a/tools/hv/Makefile +++ b/tools/hv/Makefile @@ -5,9 +5,9 @@ PTHREAD_LIBS = -lpthread WARNINGS = -Wall -Wextra CFLAGS = $(WARNINGS) -g $(PTHREAD_LIBS) -all: hv_kvp_daemon hv_vss_daemon +all: hv_kvp_daemon hv_vss_daemon hv_fcopy_daemon %: %.c $(CC) $(CFLAGS) -o $@ $^ clean: - $(RM) hv_kvp_daemon hv_vss_daemon + $(RM) hv_kvp_daemon hv_vss_daemon hv_fcopy_daemon -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] Tools: hv: address compiler warnings for hv_kvp_daemon.c
This patch addresses two types of compiler warnings: ... warning: comparison between signed and unsigned integer expressions [-Wsign-compare] and ... warning: pointer targets in passing argument N of ‘kvp_...’ differ in signedness [-Wpointer-sign] Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_kvp_daemon.c | 25 - 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 5a274ca..48a95f9 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -308,7 +308,7 @@ static int kvp_file_init(void) return 0; } -static int kvp_key_delete(int pool, const char *key, int key_size) +static int kvp_key_delete(int pool, const __u8 *key, int key_size) { int i; int j, k; @@ -351,8 +351,8 @@ static int kvp_key_delete(int pool, const char *key, int key_size) return 1; } -static int kvp_key_add_or_modify(int pool, const char *key, int key_size, const char *value, - int value_size) +static int kvp_key_add_or_modify(int pool, const __u8 *key, int key_size, +const __u8 *value, int value_size) { int i; int num_records; @@ -405,7 +405,7 @@ static int kvp_key_add_or_modify(int pool, const char *key, int key_size, const return 0; } -static int kvp_get_value(int pool, const char *key, int key_size, char *value, +static int kvp_get_value(int pool, const __u8 *key, int key_size, __u8 *value, int value_size) { int i; @@ -437,8 +437,8 @@ static int kvp_get_value(int pool, const char *key, int key_size, char *value, return 1; } -static int kvp_pool_enumerate(int pool, int index, char *key, int key_size, - char *value, int value_size) +static int kvp_pool_enumerate(int pool, int index, __u8 *key, int key_size, + __u8 *value, int value_size) { struct kvp_record *record; @@ -659,7 +659,7 @@ static char *kvp_if_name_to_mac(char *if_name) char*p, *x; charbuf[256]; char addr_file[256]; - int i; + unsigned int i; char *mac_addr = NULL; snprintf(addr_file, sizeof(addr_file), %s%s%s, /sys/class/net/, @@ -698,7 +698,7 @@ static char *kvp_mac_to_if_name(char *mac) charbuf[256]; char *kvp_net_dir = /sys/class/net/; char dev_id[256]; - int i; + unsigned int i; dir = opendir(kvp_net_dir); if (dir == NULL) @@ -748,7 +748,7 @@ static char *kvp_mac_to_if_name(char *mac) static void kvp_process_ipconfig_file(char *cmd, - char *config_buf, int len, + char *config_buf, unsigned int len, int element_size, int offset) { char buf[256]; @@ -766,7 +766,7 @@ static void kvp_process_ipconfig_file(char *cmd, if (offset == 0) memset(config_buf, 0, len); while ((p = fgets(buf, sizeof(buf), file)) != NULL) { - if ((len - strlen(config_buf)) (element_size + 1)) + if (len strlen(config_buf) + element_size + 1) break; x = strchr(p, '\n'); @@ -914,7 +914,7 @@ static int kvp_process_ip_address(void *addrp, static int kvp_get_ip_info(int family, char *if_name, int op, -void *out_buffer, int length) +void *out_buffer, unsigned int length) { struct ifaddrs *ifap; struct ifaddrs *curp; @@ -1017,8 +1017,7 @@ kvp_get_ip_info(int family, char *if_name, int op, weight += hweight32(w[i]); sprintf(cidr_mask, /%d, weight); - if ((length - sn_offset) - (strlen(cidr_mask) + 1)) + if (length sn_offset + strlen(cidr_mask) + 1) goto gather_ipaddr; if (sn_offset == 0) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] Tools: hv: address compiler warnings for hv_fcopy_daemon.c
This patch addresses two types of compiler warnings: ... warning: unused variable ‘fd’ [-Wunused-variable] and ... warning: format ‘%s’ expects argument of type ‘char *’, but argument 5 has type ‘__u16 *’ [-Wformat=] Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- tools/hv/hv_fcopy_daemon.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c index f437d73..1a23872 100644 --- a/tools/hv/hv_fcopy_daemon.c +++ b/tools/hv/hv_fcopy_daemon.c @@ -51,7 +51,7 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg) p = (char *)smsg-path_name; snprintf(target_fname, sizeof(target_fname), %s/%s, - (char *)smsg-path_name, smsg-file_name); +(char *)smsg-path_name, (char *)smsg-file_name); syslog(LOG_INFO, Target file name: %s, target_fname); /* @@ -137,7 +137,7 @@ void print_usage(char *argv[]) int main(int argc, char *argv[]) { - int fd, fcopy_fd, len; + int fcopy_fd, len; int error; int daemonize = 1, long_index = 0, opt; int version = FCOPY_CURRENT_VERSION; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
Dexuan Cui de...@microsoft.com writes: Thanks, -- Dexuan -Original Message- From: Dexuan Cui Sent: Wednesday, December 10, 2014 15:34 PM To: 'Vitaly Kuznetsov' Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; KY Srinivasan; Haiyang Zhang Subject: RE: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Tuesday, December 9, 2014 21:06 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; KY Srinivasan; Haiyang Zhang Subject: Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work .. + * Here dhcp_enabled is only for IPv4 according to Hyper-V host team. + * + * In the case the host only injects 1 IPv6 address: + * new_val-dhcp_enabled is true, but we can't pass BOOTPROTO=dhcp to + * the script hv_set_ifconfig, because in some distros (like RHEL7) + * BOOTPROTO=dhcp has a special meaning in the config file (e.g., + * /etc/sysconfig/network-scripts/ifcfg-eth0): the network init program + * ignores any static IP addr information once there is + * BOOTPROTO=dhcp; as a result, IPv6-only injection can't work. + * + * In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't affect + * Ubuntu because it's ignored by the Ubuntu version of + * hv_set_ifconfig and it doesn't seem to have special meaning in + * Ubuntu. + */ I just checked and adding IPV6ADDR=something when BOOTPROTO=dhcp works for me with both RHEL7 and Fedora21. It doesn't work in my side. :-( Running 'ifup eth0' shows some errors(I use set -x) ... + /sbin/dhclient -H localhost -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. BTW, I run with root, and 'chown 777 /etc/sysconfig/network-scripts/ifcfg-eth0 doesn't help. s,chown,chmod, :-) But it won't help in case of SELinux mislabeling. Thanks, -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Tools: hv: fix compiler warnings and do minor cleanup
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Tuesday, December 9, 2014 23:48 PM To: KY Srinivasan Cc: Haiyang Zhang; de...@linuxdriverproject.org; linux- ker...@vger.kernel.org; Dexuan Cui Subject: [PATCH 0/5] Tools: hv: fix compiler warnings and do minor cleanup When someone does 'make' in tools/hv/ issues appear: - hv_fcopy_daemon is not being built; - lots of compiler warnings. This is just a cleanup. Compile-tested by myself on top of linux-next/master. Piggyback this series and send [PATCH 5/5] Tools: hv: do not add redundant '/' in hv_start_fcopy() Vitaly Kuznetsov (5): Tools: hv: add mising fcopyd to the Makefile Tools: hv: remove unused bytes_written from kvp_update_file() Tools: hv: address compiler warnings for hv_kvp_daemon.c Tools: hv: address compiler warnings for hv_fcopy_daemon.c Tools: hv: do not add redundant '/' in hv_start_fcopy() tools/hv/Makefile | 4 ++-- tools/hv/hv_fcopy_daemon.c | 10 ++ tools/hv/hv_kvp_daemon.c | 29 + 3 files changed, 17 insertions(+), 26 deletions(-) -- 1.9.3 Hi Vitaly, Thanks for the patchset! Acked-by: Dexuan Cui de...@microsoft.com PS, I added Greg into the TO list. The hv code in drivers/hv/ and tools/hv/ usually has to go into Greg's tree first. Well, I don't mind spamming Greg but he's not on the scripts/get_maintainer.pl output. In case he's not monitoring the list for patches by some other tool (patchwork?) a patch adding him to MAINTAINERS would do the job. Greg, do you want to become an official Hyper-V maintainer in MAINTAINERS? I can send a patch then :-) -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Tuesday, December 9, 2014 21:06 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; KY Srinivasan; Haiyang Zhang Subject: Re: [PATCH] tools: hv: kvp_daemon: make IPv6-only-injection work .. + * Here dhcp_enabled is only for IPv4 according to Hyper-V host team. + * + * In the case the host only injects 1 IPv6 address: + * new_val-dhcp_enabled is true, but we can't pass BOOTPROTO=dhcp to + * the script hv_set_ifconfig, because in some distros (like RHEL7) + * BOOTPROTO=dhcp has a special meaning in the config file (e.g., + * /etc/sysconfig/network-scripts/ifcfg-eth0): the network init program + * ignores any static IP addr information once there is + * BOOTPROTO=dhcp; as a result, IPv6-only injection can't work. + * + * In the case of IPv6-only injection, BOOTPROTO=dhcp doesn't affect + * Ubuntu because it's ignored by the Ubuntu version of + * hv_set_ifconfig and it doesn't seem to have special meaning in + * Ubuntu. + */ I just checked and adding IPV6ADDR=something when BOOTPROTO=dhcp works for me with both RHEL7 and Fedora21. It doesn't work in my side. :-( Running 'ifup eth0' shows some errors(I use set -x) ... + /sbin/dhclient -H localhost -1 -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0 grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. grep: /etc/sysconfig/network-scripts/ifcfg-eth0: Permission dinied. done. I'm trying to find out the cause. Selinux? You can try 'setenforce 0' to figure this out. Other than that I think bringing distribution specifics into kernel.git is not a good idea. /etc/sysconfig/network-scripts/ifcfg-* format is distro-specific and not all Linux distros support it. Moreover, I agree. different distros can treat setting differently. I think it was wrong to stick to this format in kvp daemon from very beginning. We can also think the current format used in kvp daemon is already distro-agnostic -- it just happens to look like the style of network config file used in RHEL :-) Yes, it is already there and I don't see any point in changing it. As a solution I would suggest doing the following: kvp daemon writes all received request details in distro-agnostic format in some temporary place and then calls distro-specific script to set things up. Actually, we already have such script: tools/hv/hv_set_ifconfig.sh Yeah, this is exactly what we already have today. As for this bug I propose the following: remove skipping all IPADDR/MASK/... settings in case of BOOTPROTO=dhcp and let distro-specific script deal with the rest. -- Vitaly OK, so the patch would be 1-line only: diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 22b0764..53fdaad 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1314,10 +1314,8 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) goto setval_error; /* -* We are done!. +* We are not done... TODO: add comment here. */ - goto setval_done; - } else { error = kvp_write_file(file, BOOTPROTO, , none); if (error) I'll send out a v2 after I resolve the grep ... Permission dinied issue. Thanks, -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] tools: hv: kvp_daemon: make IPv6-only-injection work
Dexuan Cui de...@microsoft.com writes: -Original Message- From: devel [mailto:driverdev-devel-boun...@linuxdriverproject.org] On Behalf Of Dexuan Cui Sent: Wednesday, December 10, 2014 19:33 PM To: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; vkuzn...@redhat.com; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; KY Srinivasan Cc: Haiyang Zhang Subject: [PATCH v2] tools: hv: kvp_daemon: make IPv6-only-injection work In the case the host only injects an IPv6 address, the dhcp_enabled flag is true (it's only for IPv4 according to Hyper-V host team), but we still need to proceed to parse the IPv6 information. Cc: Vitaly Kuznetsov vkuzn...@redhat.com Cc: K. Y. Srinivasan k...@microsoft.com Signed-off-by: Dexuan Cui de...@microsoft.com --- v2: removed the distro-specific logic as Vitaly suggested. tools/hv/hv_kvp_daemon.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 6a6432a..4b3ee35 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1308,16 +1308,17 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) if (error) goto setval_error; +/* + * The dhcp_enabled flag is only for IPv4. In the case the host only + * injects an IPv6 address, the flag is true, but we still need to + * proceed to parse and pass the IPv6 information to the + * disto-specific script hv_set_ifconfig. + */ Actually we just relay what was recieved from the host and it's up to distro-specific script how to interpret BOOTPROTO=dhcp now. Additional IPv4 addresses (in case we receive them from our host) are not skipped now as well. if (new_val-dhcp_enabled) { error = kvp_write_file(file, BOOTPROTO, , dhcp); if (error) goto setval_error; -/* - * We are done!. - */ -goto setval_done; - } else { error = kvp_write_file(file, BOOTPROTO, , none); if (error) @@ -1345,7 +1346,6 @@ static int kvp_set_ip_info(char *if_name, struct hv_kvp_ipaddr_value *new_val) if (error) goto setval_error; -setval_done: fclose(file); /* -- 1.9.1 Hi Vitaly, Can you please ACK the v2 patch? Sorry it took me so long to reply, last 3 weeks I was on vacation. I'm not particulary sure I'm in charge here to give an ACK :-), but Reviewed-By: Vitaly Kuznetsov vkuzn...@redhat.com Or, please let me know if you have new comments. Thanks, -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Drivers: hv: vmbus: fix crashes on hv_vmbus load/unload path
Vitaly Kuznetsov vkuzn...@redhat.com writes: It is possible (since 93e5bd06a953: Drivers: hv: Make the vmbus driver unloadable) to unload hv_vmbus driver if no other devices are connected. 1aec169673d7: x86: Hyperv: Cleanup the irq mess fixed doulble interrupt gate setup. However, if we try to unload hv_vmbus and then load it back crashes in different places of vmbus driver occur on both unload and second load paths. Address those I saw in my testing. It seems that newly introduced clockevent device (Drivers: hv: vmbus: Implement a clockevent device) makes it impossible to unload hv_vmbus module: # rmmod hv_vmbus rmmod hv_vmbus rmmod: ERROR: Module hv_vmbus is in use I'll try investigating before sending v2 without PATCH 2/3. Not everything is fixed though. MCE was hit once on Generation2 instance and I neither understand what caused it nor do I know the way to reproduce it. Anyway, here is the log: [ 204.846255] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 0: b200c0020001 [ 204.846675] mce: [Hardware Error]: TSC 6b5cd64bc8 [ 204.846675] mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1421944123 SOCKET 0 APIC 0 microcode [ 204.846675] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 204.846675] mce: [Hardware Error]: Machine check: Processor context corrupt [ 204.846675] Kernel panic - not syncing: Fatal Machine check [ 204.846675] Kernel Offset: 0x0 from 0x8100 (relocation range: 0x8000-0x9fff) [ 204.846675] Rebooting in 30 seconds.. [ 204.846675] ACPI MEMORY or I/O RESET_REG. Vitaly Kuznetsov (3): Drivers: hv: vmbus: avoid double kfree for device_obj Drivers: hv: vmbus: introduce vmbus_acpi_remove Drivers: hv: vmbus: teardown hv_vmbus_con workqueue and vmbus_connection pages on shutdown drivers/hv/channel_mgmt.c | 1 - drivers/hv/connection.c | 17 - drivers/hv/hyperv_vmbus.h | 1 + drivers/hv/vmbus_drv.c| 16 4 files changed, 29 insertions(+), 6 deletions(-) -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels
Jason Wang jasow...@redhat.com writes: On Wed, Feb 4, 2015 at 1:00 AM, Vitaly Kuznetsov vkuzn...@redhat.com wrote: free_channel() function frees the channel unconditionally so we need to make sure nobody has any link to it. This is not trivial and there are several examples of races we have: 1) In vmbus_onoffer_rescind() we check for channel existence with relid2channel() and then use it. This can go wrong if we're in the middle of channel removal (free_channel() was already called). 2) In process_chn_event() we check for channel existence with pcpu_relid2channel() and then use it. This can also go wrong. 3) vmbus_free_channels() just frees all channels, in case we're in the middle of vmbus_process_rescind_offer() crash is possible. The issue can be solved by holding vmbus_connection.channel_lock everywhere, however, it looks like a way to deadlocks and performance degradation. Get/put workflow fits here the best. Implement vmbus_get_channel()/vmbus_put_channel() pair instead of free_channel(). Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/channel_mgmt.c | 45 ++--- drivers/hv/connection.c | 7 +-- drivers/hv/hyperv_vmbus.h | 4 include/linux/hyperv.h| 13 + 4 files changed, 60 insertions(+), 9 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 36bacc7..eb9ce94 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -147,6 +147,8 @@ static struct vmbus_channel *alloc_channel(void) return NULL; channel-id = atomic_inc_return(chan_num); +atomic_set(channel-count, 1); + spin_lock_init(channel-inbound_lock); spin_lock_init(channel-lock); @@ -178,19 +180,47 @@ static void release_channel(struct work_struct *work) } /* - * free_channel - Release the resources used by the vmbus channel object + * vmbus_put_channel - Decrease the channel usage counter and release the + * resources when this counter reaches zero. */ -static void free_channel(struct vmbus_channel *channel) +void vmbus_put_channel(struct vmbus_channel *channel) { +unsigned long flags; /* * We have to release the channel's workqueue/thread in the vmbus's * workqueue/thread context * ie we can't destroy ourselves. */ -INIT_WORK(channel-work, release_channel); -queue_work(vmbus_connection.work_queue, channel-work); +spin_lock_irqsave(channel-lock, flags); +if (atomic_dec_and_test(channel-count)) { +channel-dying = true; +INIT_WORK(channel-work, release_channel); +spin_unlock_irqrestore(channel-lock, flags); +queue_work(vmbus_connection.work_queue, channel-work); +} else +spin_unlock_irqrestore(channel-lock, flags); +} +EXPORT_SYMBOL_GPL(vmbus_put_channel); + +/* vmbus_get_channel - Get additional reference to the channel */ +struct vmbus_channel *vmbus_get_channel(struct vmbus_channel *channel) +{ +unsigned long flags; +struct vmbus_channel *ret = NULL; + +if (!channel) +return NULL; + +spin_lock_irqsave(channel-lock, flags); +if (!channel-dying) { +atomic_inc(channel-count); +ret = channel; +} +spin_unlock_irqrestore(channel-lock, flags); Looks like we can use atomic_inc_return_safe() here to avoid extra dying. And then there's also no need for the spinlock. if (atomic_inc_return_safe(channel-count) 0) return channel; else return NULL; Good idea, thanks! I'll try. +return ret; } +EXPORT_SYMBOL_GPL(vmbus_get_channel); static void percpu_channel_enq(void *arg) { @@ -253,7 +283,7 @@ static void vmbus_process_rescind_offer(struct work_struct *work) list_del(channel-sc_list); spin_unlock_irqrestore(primary_channel-lock, flags); } -free_channel(channel); +vmbus_put_channel(channel); } void vmbus_free_channels(void) @@ -262,7 +292,7 @@ void vmbus_free_channels(void) list_for_each_entry(channel, vmbus_connection.chn_list, listentry) { vmbus_device_unregister(channel-device_obj); -free_channel(channel); +vmbus_put_channel(channel); } } @@ -391,7 +421,7 @@ done_init_rescind: spin_unlock_irqrestore(newchannel-lock, flags); return; err_free_chan: -free_channel(newchannel); +vmbus_put_channel(newchannel); } enum { @@ -549,6 +579,7 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr) queue_work(channel-controlwq, channel-work); spin_unlock_irqrestore(channel-lock, flags); +vmbus_put_channel(channel); } /* diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index c4acd1c..d1ce134 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -247,7
Re: [PATCH 1/4] Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Wednesday, February 4, 2015 1:01 AM To: KY Srinivasan; de...@linuxdriverproject.org Cc: Haiyang Zhang; linux-kernel@vger.kernel.org; Dexuan Cui; Jason Wang Subject: [PATCH 1/4] Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels free_channel() function frees the channel unconditionally so we need to make sure nobody has any link to it. This is not trivial and there are several examples of races we have: 1) In vmbus_onoffer_rescind() we check for channel existence with relid2channel() and then use it. This can go wrong if we're in the middle of channel removal (free_channel() was already called). 2) In process_chn_event() we check for channel existence with pcpu_relid2channel() and then use it. This can also go wrong. 3) vmbus_free_channels() just frees all channels, in case we're in the middle of vmbus_process_rescind_offer() crash is possible. The issue can be solved by holding vmbus_connection.channel_lock everywhere, however, it looks like a way to deadlocks and performance degradation. Get/put workflow fits here the best. Implement vmbus_get_channel()/vmbus_put_channel() pair instead of free_channel(). Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/channel_mgmt.c | 45 ++--- drivers/hv/connection.c | 7 +-- drivers/hv/hyperv_vmbus.h | 4 include/linux/hyperv.h| 13 + 4 files changed, 60 insertions(+), 9 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 36bacc7..eb9ce94 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -147,6 +147,8 @@ static struct vmbus_channel *alloc_channel(void) return NULL; channel-id = atomic_inc_return(chan_num); +atomic_set(channel-count, 1); + spin_lock_init(channel-inbound_lock); spin_lock_init(channel-lock); @@ -178,19 +180,47 @@ static void release_channel(struct work_struct *work) } /* - * free_channel - Release the resources used by the vmbus channel object + * vmbus_put_channel - Decrease the channel usage counter and release the + * resources when this counter reaches zero. */ -static void free_channel(struct vmbus_channel *channel) +void vmbus_put_channel(struct vmbus_channel *channel) { +unsigned long flags; /* * We have to release the channel's workqueue/thread in the vmbus's * workqueue/thread context * ie we can't destroy ourselves. */ -INIT_WORK(channel-work, release_channel); -queue_work(vmbus_connection.work_queue, channel-work); +spin_lock_irqsave(channel-lock, flags); +if (atomic_dec_and_test(channel-count)) { +channel-dying = true; +INIT_WORK(channel-work, release_channel); +spin_unlock_irqrestore(channel-lock, flags); +queue_work(vmbus_connection.work_queue, channel-work); +} else +spin_unlock_irqrestore(channel-lock, flags); +} +EXPORT_SYMBOL_GPL(vmbus_put_channel); + +/* vmbus_get_channel - Get additional reference to the channel */ +struct vmbus_channel *vmbus_get_channel(struct vmbus_channel *channel) +{ +unsigned long flags; +struct vmbus_channel *ret = NULL; + +if (!channel) +return NULL; + +spin_lock_irqsave(channel-lock, flags); +if (!channel-dying) { +atomic_inc(channel-count); +ret = channel; +} +spin_unlock_irqrestore(channel-lock, flags); +return ret; } +EXPORT_SYMBOL_GPL(vmbus_get_channel); static void percpu_channel_enq(void *arg) { @@ -253,7 +283,7 @@ static void vmbus_process_rescind_offer(struct work_struct *work) list_del(channel-sc_list); spin_unlock_irqrestore(primary_channel-lock, flags); } -free_channel(channel); +vmbus_put_channel(channel); } void vmbus_free_channels(void) @@ -262,7 +292,7 @@ void vmbus_free_channels(void) list_for_each_entry(channel, vmbus_connection.chn_list, listentry) { vmbus_device_unregister(channel-device_obj); -free_channel(channel); +vmbus_put_channel(channel); } } @@ -391,7 +421,7 @@ done_init_rescind: spin_unlock_irqrestore(newchannel-lock, flags); return; err_free_chan: -free_channel(newchannel); +vmbus_put_channel(newchannel); } enum { @@ -549,6 +579,7 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr) queue_work(channel-controlwq, channel-work); spin_unlock_irqrestore(channel-lock, flags); +vmbus_put_channel(channel); } /* diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index c4acd1c..d1ce134 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv
Re: [PATCH 2/3] hv: vmbus_post_msg: retry the hypercall on HV_STATUS_INVALID_CONNECTION_ID
Dexuan Cui de...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Thursday, January 29, 2015 21:31 PM To: Dexuan Cui Cc: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; jasow...@redhat.com; KY Srinivasan; Haiyang Zhang Subject: Re: [PATCH 2/3] hv: vmbus_post_msg: retry the hypercall on HV_STATUS_INVALID_CONNECTION_ID Dexuan Cui de...@microsoft.com writes: I got the hypercall error code on Hyper-V 2008 R2 when keeping running rmmod hv_netvsc; modprobe hv_netvsc; rmmod hv_utils; modprobe hv_utils in a Linux guest. Without the patch, the driver can occasionally fail to load. CC: K. Y. Srinivasan k...@microsoft.com Signed-off-by: Dexuan Cui de...@microsoft.com --- arch/x86/include/uapi/asm/hyperv.h | 1 + drivers/hv/connection.c| 9 + 2 files changed, 10 insertions(+) diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h index 90c458e..b9daffb 100644 --- a/arch/x86/include/uapi/asm/hyperv.h +++ b/arch/x86/include/uapi/asm/hyperv.h @@ -225,6 +225,7 @@ #define HV_STATUS_INVALID_HYPERCALL_CODE 2 #define HV_STATUS_INVALID_HYPERCALL_INPUT 3 #define HV_STATUS_INVALID_ALIGNMENT 4 +#define HV_STATUS_INVALID_CONNECTION_ID 18 #define HV_STATUS_INSUFFICIENT_BUFFERS19 The gap beween 4 and 18 tells me there are other codes here ;-) Are they all 'permanent failures'? It looks we only need to care about these error codes here. BTW, you can get all the hypercall error codes in the top level functional spec: http://blogs.msdn.com/b/virtual_pc_guy/archive/2014/02/17/updated-hypervisor-top-level-functional-specification.aspx For this hypercall (0x005c), see 14.9.7 HvPostMessage. Thanks, interesting! Btw, HV_STATUS_INSUFFICIENT_MEMORY looks suspicious, looks like we can hit it as well... I suggest we split all failures here in 2 classes: 1) permanent 2) worth retrying and treat them accordingly (no big changes, just maybe group them within hv_post_message() together as it is the only place where these codes are being used). typedef struct _HV_REFERENCE_TSC_PAGE { diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index c4acd1c..8bd05f3 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -440,6 +440,15 @@ int vmbus_post_msg(void *buffer, size_t buflen) ret = hv_post_message(conn_id, 1, buffer, buflen); switch (ret) { + case HV_STATUS_INVALID_CONNECTION_ID: + /* + * We could get this if we send messages too + * frequently or the host is under low resource + * conditions: let's wait 1 more second before + * retrying the hypercall. + */ + msleep(1000); + break; In case it is our last try (No. 10) we will return '18' from the function. I suggest we set ret = -ENOMEM here as well. Thanks for the suggestion! I think it would be better to add this to the case HV_STATUS_INVALID_CONNECTION_ID: ret = -EAGAIN; ? Yes, like fallthrough case HV_STATUS_INSUFFICIENT_BUFFERS: ret = -ENOMEM; Or should we treat these two equally? There is a smaller (100ms) sleep between tries already, we can consider changing it instead. case -ENOMEM: -- Vitaly In my experiments, in the HV_STATUS_INVALID_CONNECTION_ID case, waiting 100ms is not enough sometimes, so I'd like to wait more time. I agree with you both cases can wait 1000ms. I'll update my patch. BTW, the case -ENOMEM: is not reachable(the hypervisor itself doesn't return -ENOMEM), I think. I can remove it. hv_post_message() can return -EMSGSIZE or do_hypercall() return value (which becomes u16 in hv_post_message()). So yes, I agree, -ENOMEM is not possible. Thanks, -- Dexuan -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] Drivers: hv: vmbus: protect vmbus_get_outgoing_channel() against channel removal
list_for_each_safe() we have in vmbus_get_outgoing_channel() works, however, we are not protected against the channel being removed (e.g. after receiving rescind offer). Users of this function (storvsc_do_io() is the only one at this moment) can get a link to an already freed channel. Make vmbus_get_outgoing_channel() search holding primary-lock as child channels are not being freed unless they're removed from parent's list. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/channel_mgmt.c | 10 +++--- drivers/scsi/storvsc_drv.c | 2 ++ 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index fdccd16..af6243c 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -881,18 +881,20 @@ cleanup: */ struct vmbus_channel *vmbus_get_outgoing_channel(struct vmbus_channel *primary) { - struct list_head *cur, *tmp; + struct list_head *cur; int cur_cpu; struct vmbus_channel *cur_channel; struct vmbus_channel *outgoing_channel = primary; int cpu_distance, new_cpu_distance; + unsigned long flags; if (list_empty(primary-sc_list)) - return outgoing_channel; + return vmbus_get_channel(outgoing_channel); cur_cpu = hv_context.vp_index[get_cpu()]; put_cpu(); - list_for_each_safe(cur, tmp, primary-sc_list) { + spin_lock_irqsave(primary-lock, flags); + list_for_each(cur, primary-sc_list) { cur_channel = list_entry(cur, struct vmbus_channel, sc_list); if (cur_channel-state != CHANNEL_OPENED_STATE) continue; @@ -913,6 +915,8 @@ struct vmbus_channel *vmbus_get_outgoing_channel(struct vmbus_channel *primary) outgoing_channel = cur_channel; } + outgoing_channel = vmbus_get_channel(outgoing_channel); + spin_unlock_irqrestore(primary-lock, flags); return outgoing_channel; } diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 4cff0dd..3b9b851 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1370,6 +1370,8 @@ static int storvsc_do_io(struct hv_device *device, VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED); } + vmbus_put_channel(outgoing_channel); + if (ret != 0) return ret; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] hyperv: netvsc: improve protection against rescind offer
The check added in commit c3582a2c4d0b (hyperv: Add support for vNIC hot removal) is incomplete as there is no synchronization between vmbus_onoffer_rescind() and netvsc_send(). In case we get the offer after we checked out_channel-rescind and before netvsc_send() finishes its job we can get a crash as we'll be dealing with already freed channel. Make netvsc_send() take additional reference to the channel with newly introduced vmbus_get_channel(), this guarantees we won't lose the channel. We can still get rescind while we're processing but this won't cause a crash. Reported-by: Jason Wang jasow...@redhat.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/net/hyperv/netvsc.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index 9f49c01..d9b13a1 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -763,11 +763,16 @@ int netvsc_send(struct hv_device *device, out_channel = net_device-chn_table[packet-q_idx]; if (out_channel == NULL) out_channel = device-channel; - packet-channel = out_channel; + packet-channel = vmbus_get_channel(out_channel); - if (out_channel-rescind) + if (!packet-channel) return -ENODEV; + if (out_channel-rescind) { + vmbus_put_channel(out_channel); + return -ENODEV; + } + if (packet-page_buf_cnt) { ret = vmbus_sendpacket_pagebuffer(out_channel, packet-page_buf, @@ -810,6 +815,7 @@ int netvsc_send(struct hv_device *device, packet, ret); } + vmbus_put_channel(packet-channel); return ret; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels
free_channel() function frees the channel unconditionally so we need to make sure nobody has any link to it. This is not trivial and there are several examples of races we have: 1) In vmbus_onoffer_rescind() we check for channel existence with relid2channel() and then use it. This can go wrong if we're in the middle of channel removal (free_channel() was already called). 2) In process_chn_event() we check for channel existence with pcpu_relid2channel() and then use it. This can also go wrong. 3) vmbus_free_channels() just frees all channels, in case we're in the middle of vmbus_process_rescind_offer() crash is possible. The issue can be solved by holding vmbus_connection.channel_lock everywhere, however, it looks like a way to deadlocks and performance degradation. Get/put workflow fits here the best. Implement vmbus_get_channel()/vmbus_put_channel() pair instead of free_channel(). Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/channel_mgmt.c | 45 ++--- drivers/hv/connection.c | 7 +-- drivers/hv/hyperv_vmbus.h | 4 include/linux/hyperv.h| 13 + 4 files changed, 60 insertions(+), 9 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 36bacc7..eb9ce94 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -147,6 +147,8 @@ static struct vmbus_channel *alloc_channel(void) return NULL; channel-id = atomic_inc_return(chan_num); + atomic_set(channel-count, 1); + spin_lock_init(channel-inbound_lock); spin_lock_init(channel-lock); @@ -178,19 +180,47 @@ static void release_channel(struct work_struct *work) } /* - * free_channel - Release the resources used by the vmbus channel object + * vmbus_put_channel - Decrease the channel usage counter and release the + * resources when this counter reaches zero. */ -static void free_channel(struct vmbus_channel *channel) +void vmbus_put_channel(struct vmbus_channel *channel) { + unsigned long flags; /* * We have to release the channel's workqueue/thread in the vmbus's * workqueue/thread context * ie we can't destroy ourselves. */ - INIT_WORK(channel-work, release_channel); - queue_work(vmbus_connection.work_queue, channel-work); + spin_lock_irqsave(channel-lock, flags); + if (atomic_dec_and_test(channel-count)) { + channel-dying = true; + INIT_WORK(channel-work, release_channel); + spin_unlock_irqrestore(channel-lock, flags); + queue_work(vmbus_connection.work_queue, channel-work); + } else + spin_unlock_irqrestore(channel-lock, flags); +} +EXPORT_SYMBOL_GPL(vmbus_put_channel); + +/* vmbus_get_channel - Get additional reference to the channel */ +struct vmbus_channel *vmbus_get_channel(struct vmbus_channel *channel) +{ + unsigned long flags; + struct vmbus_channel *ret = NULL; + + if (!channel) + return NULL; + + spin_lock_irqsave(channel-lock, flags); + if (!channel-dying) { + atomic_inc(channel-count); + ret = channel; + } + spin_unlock_irqrestore(channel-lock, flags); + return ret; } +EXPORT_SYMBOL_GPL(vmbus_get_channel); static void percpu_channel_enq(void *arg) { @@ -253,7 +283,7 @@ static void vmbus_process_rescind_offer(struct work_struct *work) list_del(channel-sc_list); spin_unlock_irqrestore(primary_channel-lock, flags); } - free_channel(channel); + vmbus_put_channel(channel); } void vmbus_free_channels(void) @@ -262,7 +292,7 @@ void vmbus_free_channels(void) list_for_each_entry(channel, vmbus_connection.chn_list, listentry) { vmbus_device_unregister(channel-device_obj); - free_channel(channel); + vmbus_put_channel(channel); } } @@ -391,7 +421,7 @@ done_init_rescind: spin_unlock_irqrestore(newchannel-lock, flags); return; err_free_chan: - free_channel(newchannel); + vmbus_put_channel(newchannel); } enum { @@ -549,6 +579,7 @@ static void vmbus_onoffer_rescind(struct vmbus_channel_message_header *hdr) queue_work(channel-controlwq, channel-work); spin_unlock_irqrestore(channel-lock, flags); + vmbus_put_channel(channel); } /* diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index c4acd1c..d1ce134 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -247,7 +247,8 @@ void vmbus_disconnect(void) * Map the given relid to the corresponding channel based on the * per-cpu list of channels that have been affinitized to this CPU. * This will be used in the channel callback path as we can do this - * mapping in a lock-free fashion. + * mapping in a lock-free fashion. Takes additional reference
[PATCH 0/4] Drivers: hv: Further protection for the rescind path
This series is a continuation of the Drivers: hv: vmbus: serialize Offer and Rescind offer. I'm trying to address a number of theoretically possible issues with rescind offer handling. All these complications come from the fact that a rescind offer results in vmbus channel being freed and we must ensure nobody still uses it. Instead of introducing new locks I suggest we switch channels usage to the get/put workflow. The main part of the series is [PATCH 1/4] which introduces the workflow for vmbus channels, all other patches fix different corner cases using this workflow. I'm not sure all such cases are covered with this series (probably not), but in case protection is required in some other places it should become relatively easy to add one. I did some sanity testing with CONFIG_DEBUG_LOCKDEP=y and nothing popped out, however, additional testing would be much appreciated. K.Y., Haiyang, I'm not sending this series to netdev@ and linux-scsi@ as it is supposed to be applied as a whole, please resend these patches with your sign-offs when (and if) we're done with reviews. Thanks! Vitaly Kuznetsov (4): Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels Drivers: hv: vmbus: do not lose rescind offer on failure in vmbus_process_offer() Drivers: hv: vmbus: protect vmbus_get_outgoing_channel() against channel removal hyperv: netvsc: improve protection against rescind offer drivers/hv/channel_mgmt.c | 75 + drivers/hv/connection.c | 7 +++-- drivers/hv/hyperv_vmbus.h | 4 +++ drivers/net/hyperv/netvsc.c | 10 -- drivers/scsi/storvsc_drv.c | 2 ++ include/linux/hyperv.h | 13 6 files changed, 95 insertions(+), 16 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] Drivers: hv: vmbus: do not lose rescind offer on failure in vmbus_process_offer()
In case we hit a failure condition in vmbus_process_offer() and a rescind offer was pending for the channel we just do free_channel() so CHANNELMSG_RELID_RELEASED will never be send to the host. We have to follow vmbus_process_rescind_offer() path anyway. To support the change we need to protect list_del in vmbus_process_rescind_offer() hitting an uninitialized list. Reported-by: Dexuan Cui de...@microsoft.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/channel_mgmt.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index eb9ce94..fdccd16 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -152,6 +152,7 @@ static struct vmbus_channel *alloc_channel(void) spin_lock_init(channel-inbound_lock); spin_lock_init(channel-lock); + INIT_LIST_HEAD(channel-listentry); INIT_LIST_HEAD(channel-sc_list); INIT_LIST_HEAD(channel-percpu_list); @@ -308,6 +309,7 @@ static void vmbus_process_offer(struct work_struct *work) struct vmbus_channel *channel; bool fnew = true; bool enq = false; + bool failure = false; int ret; unsigned long flags; @@ -408,19 +410,33 @@ static void vmbus_process_offer(struct work_struct *work) spin_lock_irqsave(vmbus_connection.channel_lock, flags); list_del(newchannel-listentry); spin_unlock_irqrestore(vmbus_connection.channel_lock, flags); + /* +* Init listentry again as vmbus_process_rescind_offer can try +* doing list_del again. +*/ + INIT_LIST_HEAD(channel-listentry); kfree(newchannel-device_obj); + newchannel-device_obj = NULL; goto err_free_chan; } + goto done_init_rescind; +err_free_chan: + failure = true; done_init_rescind: + /* +* Get additional reference as vmbus_put_channel() can be called +* either directly or through vmbus_process_rescind_offer(). +*/ + vmbus_get_channel(newchannel); spin_lock_irqsave(newchannel-lock, flags); /* The next possible work is rescind handling */ INIT_WORK(newchannel-work, vmbus_process_rescind_offer); /* Check if rescind offer was already received */ if (newchannel-rescind) queue_work(newchannel-controlwq, newchannel-work); + else if (failure) + vmbus_put_channel(newchannel); spin_unlock_irqrestore(newchannel-lock, flags); - return; -err_free_chan: vmbus_put_channel(newchannel); } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] Drivers: hv: Further protection for the rescind path
KY Srinivasan k...@microsoft.com writes: -Original Message- From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] Sent: Tuesday, February 3, 2015 9:01 AM To: KY Srinivasan; de...@linuxdriverproject.org Cc: Haiyang Zhang; linux-kernel@vger.kernel.org; Dexuan Cui; Jason Wang Subject: [PATCH 0/4] Drivers: hv: Further protection for the rescind path This series is a continuation of the Drivers: hv: vmbus: serialize Offer and Rescind offer. I'm trying to address a number of theoretically possible issues with rescind offer handling. All these complications come from the fact that a rescind offer results in vmbus channel being freed and we must ensure nobody still uses it. Instead of introducing new locks I suggest we switch channels usage to the get/put workflow. The main part of the series is [PATCH 1/4] which introduces the workflow for vmbus channels, all other patches fix different corner cases using this workflow. I'm not sure all such cases are covered with this series (probably not), but in case protection is required in some other places it should become relatively easy to add one. I did some sanity testing with CONFIG_DEBUG_LOCKDEP=y and nothing popped out, however, additional testing would be much appreciated. K.Y., Haiyang, I'm not sending this series to netdev@ and linux-scsi@ as it is supposed to be applied as a whole, please resend these patches with your sign-offs when (and if) we're done with reviews. Thanks! Vitaly, Thanks for looking into this issue. While today, rescind offer results in the freeing of the channel, I don't think that is required. By not freeing up the channel in the rescind path, we can have a safe way to access the channel and that does not have to involve taking a reference on the channel every time you access it - the get/put workflow in your patch set. As part of the network performance improvement work, I had eliminated all locks in the receive path by setting up per-cpu data structures for mapping the relid to channel etc. These set of patches introduces locking/atomic operations in performance critical code paths to deal with an event that is truly rare - the channel getting rescinded. It is possible to eliminate all locks/atomic operations from performance critical pyth in my patch series by following Dexuan's suggestion - we'll get the channel in vmbus_open and put it in vmbus_close (and on processing offer/rescind offer) this won't affect performance. I'm in the middle of testing this approach. All channel messages are handled in a single work context: vmbus_on_msg_dpc() - vmbus_onmessage_work()- Various channel messages [offer, rescind etc.] So, the rescind message cannot be processed while we are processing the offer message and since an offer cannot be rescinded before it is offered, offer and rescind are naturally serialized (I think I have patchset in my queue from you that is trying to solve the concurrent execution of offer and rescind and looking at the code I cannot see how this can occur). As part of handling the rescind message, we will just set the channel state to indicate that the offer is rescinded (we can add the rescind state to the channel states already defined and this will be done under the protection of the channel lock). The cleanup of the channel and sending of the RELID release message will only be done in the context of the driver as part of driver remove function. I think this should be doable in a way that does not penalize the normal path. If it is ok with you, I will try to put together a patch along the lines I have described here. Yes, if we consider rescind event as a very rare event we can avoid freeing channels, but if (in some conditions) it happens frequently we'll have significant memory leakage. We can also free them with something like schedule_deyalyed_work with e.g. 10 second delay after removing it from all lists so probability of hitting a crash will me very low, I seriously doubt we will ever hit it. Please let me know what you think is better. In case we follow 'never free' or 'delayed free' approach I'll extract and send separately PATCH 2/4 from my series to address 'loosing rescind offer' issue pointed out by Dexuan. Thanks, Regards, K. Y Vitaly Kuznetsov (4): Drivers: hv: vmbus: implement get/put usage workflow for vmbus channels Drivers: hv: vmbus: do not lose rescind offer on failure in vmbus_process_offer() Drivers: hv: vmbus: protect vmbus_get_outgoing_channel() against channel removal hyperv: netvsc: improve protection against rescind offer drivers/hv/channel_mgmt.c | 75 + drivers/hv/connection.c | 7 +++-- drivers/hv/hyperv_vmbus.h | 4 +++ drivers/net/hyperv/netvsc.c | 10 -- drivers/scsi/storvsc_drv.c | 2 ++ include/linux/hyperv.h | 13 6 files changed, 95 insertions(+), 16 deletions
Re: [PATCH v2 1/1] drivers:hv:vmbus drivers:hv:vmbus Allow for more than one MMIO range for children
Jake Oshins ja...@microsoft.com writes: This set of changes finds the _CRS object in the ACPI namespace that contains memory address space descriptors, intended to convey to VMBus which ranges of memory-mapped I/O space are available for child devices, and then builds a resource list that contains all those ranges. Without this change, only some of the memory-mapped I/O space will be available for child devices, and only in some virtual BIOS configurations (Generation 2 VMs). This patch has been updated with feedback from Vitaly Kuznetsov. Cleanup is now driven by the acpi remove callback function. Sorry for beeing late with this message but I'm seeing issues with this commit. I added some debug to figure out what's going on and here is what I see: With Gen1 VM we end up doing request_resource for two ranges: f800 - fffb fe000 - fffef request_resource() fails (as we already have PCI device at f800 I suppose?) but we don't check the return value. release_resource on module unload crashes the kernel: [ 78.314344] BUG: unable to handle kernel NULL pointer dereference at 0030 [ 78.315021] IP: [8107fac5] release_resource+0x25/0x90 [ 78.315021] PGD 78c67067 PUD 78c5a067 PMD 0 [ 78.315021] Oops: [#1] SMP DEBUG_PAGEALLOC [ 78.315021] Modules linked in: hv_vmbus(-) ... If I'm not mistaken, before the change we didn't do any request_resource() for Gen1 VMs at all. With Gen2 VM we do request_resource for fe000 - f range only, that means this commit doesn't change anything. Can you please take a look? I'd like to help but I don't completely understand the essense of the change wrt Gen1 VMs with PCI devices. Thanks, Signed-off-by: Jake Oshins ja...@microsoft.com --- drivers/hv/vmbus_drv.c | 99 +-- drivers/video/fbdev/hyperv_fb.c |2 +- include/linux/hyperv.h |2 +- 3 files changed, 86 insertions(+), 17 deletions(-) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 4d6b269..ed618ac 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -43,10 +43,7 @@ static struct tasklet_struct msg_dpc; static struct completion probe_event; static int irq; -struct resource hyperv_mmio = { - .name = hyperv mmio, - .flags = IORESOURCE_MEM, -}; +struct resource *hyperv_mmio; EXPORT_SYMBOL_GPL(hyperv_mmio); static int vmbus_exists(void) @@ -849,30 +846,98 @@ void vmbus_device_unregister(struct hv_device *device_obj) /* - * VMBUS is an acpi enumerated device. Get the the information we - * need from DSDT. + * VMBUS is an acpi enumerated device. Get the + * information we need from DSDT. */ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx) { + resource_size_t start = 0; + resource_size_t end = 0; + struct resource *new_res; + struct resource **old_res = hyperv_mmio; + switch (res-type) { case ACPI_RESOURCE_TYPE_IRQ: irq = res-data.irq.interrupts[0]; + return AE_OK; + + /* + * Address descriptors are for bus windows. Ignore + * memory descriptors, which are for registers on + * devices. + */ + case ACPI_RESOURCE_TYPE_ADDRESS32: + start = res-data.address32.minimum; + end = res-data.address32.maximum; break; case ACPI_RESOURCE_TYPE_ADDRESS64: - hyperv_mmio.start = res-data.address64.minimum; - hyperv_mmio.end = res-data.address64.maximum; + start = res-data.address64.minimum; + end = res-data.address64.maximum; break; + + default: + /* Unused resource type */ + return AE_OK; } + /* + * Ignore ranges that are below 1MB, as they're not + * necessary or useful here. + */ + if (end 0x10) + return AE_OK; + + new_res = kzalloc(sizeof(*new_res), GFP_ATOMIC); + if (!new_res) + return AE_NO_MEMORY; + + new_res-name = hyperv mmio; + new_res-flags = IORESOURCE_MEM; + new_res-start = start; + new_res-end = end; + + do { + if (!*old_res) { + *old_res = new_res; + break; + } + + if ((*old_res)-start new_res-end) { + new_res-sibling = *old_res; + *old_res = new_res; + break; + } + + old_res = (*old_res)-sibling; + + } while (1); + return AE_OK; } +static int vmbus_acpi_remove(struct acpi_device *device) +{ + struct resource *cur_res; + struct resource *next_res; + + if (hyperv_mmio) { + release_resource(hyperv_mmio); + for (cur_res = hyperv_mmio; cur_res; cur_res = next_res) { + next_res = cur_res
Re: [PATCH v2 1/3] Drivers: hv: check vmbus_device_create() return value in vmbus_process_offer()
Dan Carpenter dan.carpen...@oracle.com writes: On Mon, Jan 19, 2015 at 05:56:11PM +0100, Vitaly Kuznetsov wrote: vmbus_device_create() result is not being checked in vmbus_process_offer() and it can fail if kzalloc() fails. Add the check and do minor cleanup to avoid additional duplication of free_channel(); return; block. Reported-by: Jason Wang jasow...@redhat.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com out is always a bad name for a label. It's too vague. It implies that the code uses One Err style error handling which is bug prone and I've ranted about that in the past so I won't here. This kind of coding is buggier than direct returns. But recently I've been looking at bugs where we return zero where the code should return a negative error code and, wow, do I hate out labels! if (function_whatever(xxx)) goto out; [ thousands of lines removed. ] out: return ret; Oh crap... Did the coder mean to return success or not??? If you use a direct return then the code looks like: if (function_whatever(xxx)) return 0; In that case, you can immediately see that the coder typed 0 deliberately. Direct returns are best. I guess that's not directly related to this code. But I didn't know that until I read to the bottom of the patch and I already had this rant prepared in my head ready to go... Thank you for your rant, Dan! It contains an explanation _why_ and so is useful. However ... :-) 1) vmbus_process_offer() returns void so we won't forget to set proper return code. 2) this patch is a preparation for the PATCH 3/3 where the label is being used to do some useful (non-trivial) work. Direct returns approach would require us to duplicate the code or move it to a function and call it from all return places. I consider adding out label being less evil. Anyway, I can rename it to something less provocative in PATCH 3/3, e.g. init_rescind. error is a crap label name because it doesn't tell you what the code does. A better name is err_free_chan or something which talks about freeing the channel. And here I have to completely agree with you, I'll rename it in v3. regards, dan carpenter -- Vitaly -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/3] Drivers: hv: check vmbus_device_create() return value in vmbus_process_offer()
vmbus_device_create() result is not being checked in vmbus_process_offer() and it can fail if kzalloc() fails. Add the check and do minor cleanup to avoid additional duplication of free_channel(); return; block. Reported-by: Jason Wang jasow...@redhat.com Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/channel_mgmt.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 2c59f03..01f2c2b 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -341,11 +341,10 @@ static void vmbus_process_offer(struct work_struct *work) if (channel-sc_creation_callback != NULL) channel-sc_creation_callback(newchannel); - return; + goto out; } - free_channel(newchannel); - return; + goto err_free_chan; } /* @@ -364,6 +363,8 @@ static void vmbus_process_offer(struct work_struct *work) newchannel-offermsg.offer.if_type, newchannel-offermsg.offer.if_instance, newchannel); + if (!newchannel-device_obj) + goto err_free_chan; /* * Add the new device to the bus. This will kick off device-driver @@ -379,9 +380,12 @@ static void vmbus_process_offer(struct work_struct *work) list_del(newchannel-listentry); spin_unlock_irqrestore(vmbus_connection.channel_lock, flags); kfree(newchannel-device_obj); - - free_channel(newchannel); + goto err_free_chan; } +out: + return; +err_free_chan: + free_channel(newchannel); } enum { -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 2/3] Drivers: hv: rename sc_lock to the more generic lock
sc_lock spinlock in struct vmbus_channel is being used to not only protect the sc_list field, e.g. vmbus_open() function uses it to implement test-and-set access to the state field. Rename it to the more generic 'lock' and add the description. Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com --- drivers/hv/channel.c | 6 +++--- drivers/hv/channel_mgmt.c | 10 +- include/linux/hyperv.h| 7 ++- 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index 433f72a..8608ed1 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -73,14 +73,14 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size, unsigned long flags; int ret, t, err = 0; - spin_lock_irqsave(newchannel-sc_lock, flags); + spin_lock_irqsave(newchannel-lock, flags); if (newchannel-state == CHANNEL_OPEN_STATE) { newchannel-state = CHANNEL_OPENING_STATE; } else { - spin_unlock_irqrestore(newchannel-sc_lock, flags); + spin_unlock_irqrestore(newchannel-lock, flags); return -EINVAL; } - spin_unlock_irqrestore(newchannel-sc_lock, flags); + spin_unlock_irqrestore(newchannel-lock, flags); newchannel-onchannel_callback = onchannelcallback; newchannel-channel_callback_context = context; diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 01f2c2b..c6fdd74 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -146,7 +146,7 @@ static struct vmbus_channel *alloc_channel(void) return NULL; spin_lock_init(channel-inbound_lock); - spin_lock_init(channel-sc_lock); + spin_lock_init(channel-lock); INIT_LIST_HEAD(channel-sc_list); INIT_LIST_HEAD(channel-percpu_list); @@ -246,9 +246,9 @@ static void vmbus_process_rescind_offer(struct work_struct *work) spin_unlock_irqrestore(vmbus_connection.channel_lock, flags); } else { primary_channel = channel-primary_channel; - spin_lock_irqsave(primary_channel-sc_lock, flags); + spin_lock_irqsave(primary_channel-lock, flags); list_del(channel-sc_list); - spin_unlock_irqrestore(primary_channel-sc_lock, flags); + spin_unlock_irqrestore(primary_channel-lock, flags); } free_channel(channel); } @@ -323,9 +323,9 @@ static void vmbus_process_offer(struct work_struct *work) * Process the sub-channel. */ newchannel-primary_channel = channel; - spin_lock_irqsave(channel-sc_lock, flags); + spin_lock_irqsave(channel-lock, flags); list_add_tail(newchannel-sc_list, channel-sc_list); - spin_unlock_irqrestore(channel-sc_lock, flags); + spin_unlock_irqrestore(channel-lock, flags); if (newchannel-target_cpu != get_cpu()) { put_cpu(); diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 476c685..02dd978 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -722,7 +722,12 @@ struct vmbus_channel { */ void (*sc_creation_callback)(struct vmbus_channel *new_sc); - spinlock_t sc_lock; + /* +* The spinlock to protect the structure. It is being used to protect +* test-and-set access to various attributes of the structure as well +* as all sc_list operations. +*/ + spinlock_t lock; /* * All Sub-channels of a primary channel are linked here. */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 0/3] Drivers: hv: vmbus: protect Offer/Rescind offer processing
This patch series is a renamed successor of [PATCH] Drivers: hv: vmbus: serialize Offer and Rescind offer processing. Changes from v2: - Rename labels in vmbus_process_offer() (out - done_init_rescind in PATCH 3/3, error - err_free_chan in PATCH 1/3) [Dan Carpenter] - Invert condition, update comment, and remove out label in vmbus_onoffer_rescind() Changes from v1: - Separate vmbus_device_create() return value check [K. Y. Srinivasan] - Do not lose a rescind offer received during offer processing. Use renamed (in [PATCH v2 2/3]) spinlock to protect simulteneous test-and-set workflow for rescind and work fields. [K. Y. Srinivasan] Vitaly Kuznetsov (3): Drivers: hv: check vmbus_device_create() return value in vmbus_process_offer() Drivers: hv: rename sc_lock to the more generic lock Drivers: hv: vmbus: serialize Offer and Rescind offer drivers/hv/channel.c | 6 +++--- drivers/hv/channel_mgmt.c | 50 --- include/linux/hyperv.h| 7 ++- 3 files changed, 43 insertions(+), 20 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/