Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
On Tue, Nov 14, 2017 at 02:28:56PM +0800, Wanpeng Li wrote: > > - have the TLB invalidate handler do something like: > > > >state = READ_ONCE(src->preempted); > >if (!(state & KVM_VCPU_IPI_PENDING)) > >return; > > > >local_flush_tlb(); > > > >do { > >} while (!try_cmpxchg(&src->preempted, &state, > > state & ~KVM_VCPU_IPI_PENDING)); > > There are a lot of cases handled by flush_tlb_func_remote() -> > flush_tlb_function_common(), so I'm afraid to have hole. Sure, just fix the handler to do what must be done. The above was merely a sketch. The important part is to only clear IPI_PENDING after we do the actual flushing, since the caller is waiting for that bit. So flush_tlb_others() has two callers: - arch_tlbbatch_flush() -- info::end = TLB_FLUSH_ALL - flush_tlb_mm_range() -- info::mm = mm native_flush_tlb_others() does smp_call_function_many( .func=flush_tlb_func_remote) which in turn calls flush_tlb_func_common( .local=false, .reason=TLB_REMOTE_SHOOTDOWN). So something like: struct flush_tlb_info info = { .start = 0, .end = TLB_FLUSH_ALL, }; flush_tlb_func_common(&info, false, TLB_REMOTE_SHOOTDOWN); should work for the new IPI. It 'upgrades' all ranges to full flushes, but meh.
Re: [PATCH 0/2] backlight: pwm_bl: prevent backlight flicker when switching PWM on
Hi, On Fri, 10 Nov 2017 12:22:15 +0100 Enric Balletbo i Serra wrote: > Hi all, > > On 08/11/17 11:48, Daniel Thompson wrote: > > On 26/10/17 13:49, Lothar Waßmann wrote: > >> These patches implement some measures to prevent backlight flicker > >> when the backlight is being switched on for a display with an active > >> low brightness control pin. > >> GIT: [PATCH 1/2] backlight: pwm_bl: Enable PWM before switching regulator > >> GIT: [PATCH 2/2] backlight: pwm_bl: add configurable delay between > > > > Other than hoisting the pwm_enable() even earlier in the setup sequence this > > patchset seems to have a significant overlap with Enric's recent posting: > > > > https://lkml.org/lkml/2017/7/21/211 > > > > Any chance of a shared view on this, especially on the DT bindings? > > > > The DT binding were discussed some time ago for the patch series I sent, > though > there isn't a final ack from DT maintainer. > > Lothar, the series I sent have been reviewed and acked, can you test if the > series fixes the problem for you too? > I'll try to test it within the next couple of days and will report back. Lothar Waßmann
Re: [PATCH 4.13 00/33] 4.13.13-stable review
On Mon, Nov 13, 2017 at 02:29:09PM -0800, Guenter Roeck wrote: > On Mon, Nov 13, 2017 at 01:56:21PM +0100, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.13.13 release. > > There are 33 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Wed Nov 15 12:55:46 UTC 2017. > > Anything received after that time might be too late. > > > > Build results: > total: 145 pass: 145 fail: 0 > Qemu test results: > total: 123 pass: 123 fail: 0 > > Details are available at http://kerneltests.org/builders. Thanks for testing all of these and letting me know. greg k-h
Re: [PATCH 4.9 00/87] 4.9.62-stable review --> crash
Am 14.11.2017 um 08:41 schrieb Greg Kroah-Hartman: On Tue, Nov 14, 2017 at 07:48:47AM +0100, Sebastian Gottschall wrote: ahm it compiles well. but [ 24.838120] Unable to handle kernel NULL pointer dereference at virtual address 0055 [ 24.846193] pgd = c0004000 [ 24.848893] [0055] *pgd= [ 24.852472] Internal error: Oops - BUG: 817 [#1] PREEMPT SMP ARM [ 24.858463] Modules linked in: xhci_plat_hcd xhci_pci xhci_hcd ohci_hcd ehci_pci ehci_platform ehci_hcd usbcore usb_common nls_base qca_ssdk gpio_pca953x mii_gpio wil6210 ath10k_pci ath10k_core ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 compat [ 24.880852] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.62-rc1 #90 [ 24.887189] Hardware name: AnnapurnaLabs Alpine (Device Tree) [ 24.892921] task: ef029ac0 task.stack: ef05a000 [ 24.897444] PC is at nf_nat_cleanup_conntrack+0x4c/0x74 [ 24.902657] LR is at nf_nat_cleanup_conntrack+0x38/0x74 [ 24.907869] pc : [] lr : [] psr: 6153 [ 24.907869] sp : ef05bb58 ip : ef05bb58 fp : ef05bb6c [ 24.919317] r10: ed230cc0 r9 : ed230c00 r8 : edf45800 [ 24.924529] r7 : ebcd2f00 r6 : ec33739e r5 : c0892294 r4 : ebcd2f00 [ 24.931040] r3 : r2 : 0055 r1 : r0 : c0892718 [ 24.937551] Flags: nZCv IRQs on FIQs off Mode SVC_32 ISA ARM Segment user [ 24.944755] Control: 10c5387d Table: 2bd1006a DAC: 0055 [ 24.950486] Process swapper/2 (pid: 0, stack limit = 0xef05a210) [ 24.956477] Stack: (0xef05bb58 to 0xef05c000) will dig into the code to find the reason Can you run 'git bisect' or if you use quilt, do a manual bisect to find the offending patch? already done netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable" this one caused the crash. if i revert it, its working again Sebastian -- Mit freundlichen Grüssen / Regards Sebastian Gottschall / CTO NewMedia-NET GmbH - DD-WRT Firmensitz: Stubenwaldallee 21a, 64625 Bensheim Registergericht: Amtsgericht Darmstadt, HRB 25473 Geschäftsführer: Peter Steinhäuser, Christian Scheele http://www.dd-wrt.com email: s.gottsch...@dd-wrt.com Tel.: +496251-582650 / Fax: +496251-5826565
RE: [PATCH 1/2] mm: drop migrate type checks from has_unmovable_pages
Hi Michal, > -Original Message- > From: Michal Hocko [mailto:mho...@kernel.org] > Sent: Tuesday, November 14, 2017 3:07 PM > To: Ran Wang > Cc: linux...@kvack.org; Michael Ellerman ; Vlastimil > Babka ; Andrew Morton ; > KAMEZAWA Hiroyuki ; Reza Arbab > ; Yasuaki Ishimatsu ; > qiuxi...@huawei.com; Igor Mammedov ; Vitaly > Kuznetsov ; LKML ; > Leo Li ; Xiaobo Xie > Subject: Re: [PATCH 1/2] mm: drop migrate type checks from > has_unmovable_pages > > On Tue 14-11-17 06:10:00, Ran Wang wrote: > [...] > > > > This drop cause DWC3 USB controller fail on initialization with > > > > Layerscaper processors (such as LS1043A) as below: > > > > > > > > [2.701437] xhci-hcd xhci-hcd.0.auto: new USB bus registered, > assigned > > > bus number 1 > > > > [2.710949] cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -16 > > > > [2.717411] xhci-hcd xhci-hcd.0.auto: can't setup: -12 > > > > [2.727940] xhci-hcd xhci-hcd.0.auto: USB bus 1 deregistered > > > > [2.733607] xhci-hcd: probe of xhci-hcd.0.auto failed with error -12 > > > > [2.739978] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller > > > > > > > > And I notice that someone also reported to you that DWC2 got > > > > affected recently, so do you have the solution now? > > > > > > Yes. It should be in linux-next. Have a look at the following email > > > thread: > > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml. > > > > kernel.org%2Fr%2F20171104082500.qvzbb2kw4suo6cgy%40dhcp22.suse.cz& > > > > data=02%7C01%7Cran.wang_1%40nxp.com%7C5e73c6a941fc4f1c10e708d52 > > > > a860c5b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636461677 > > > > 583607877&sdata=zlRxJ4LZwOBsit5qRx9yFT5qfP54wZ0z6G1z%2Bcywf5g%3D > > > &reserved=0 > > I really have no idea where the above link came from because my email had > a reference to > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml. > kernel.org%2Fr%2F20171104082500.qvzbb2kw4suo6cgy%40dhcp22.suse.cz& > data=02%7C01%7Cran.wang_1%40nxp.com%7C9b452e62f11e446d12b408d5 > 2b2e4014%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63646239 > 9997608449&sdata=S9MPhGyIUiYCJdVYMh3DAHAEytu%2Fu45BB%2BcMhO% > 2BP3Qo%3D&reserved=0 > Has your email client modified the original email? > > > Thanks for your info, although I fail to open the link you shared, but > > I got patch from my colleague and the issue got fix on my side, let you > > know, > thanks. > > Thanks for your testing anyway. Can I assume your Tested-by? Yes, please. BR Ran > -- > Michal Hocko > SUSE Labs
Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll
* Quan Xu wrote: > > > On 2017/11/13 23:08, Ingo Molnar wrote: > > * Quan Xu wrote: > > > > > From: Quan Xu > > > > > > To reduce the cost of poll, we introduce three sysctl to control the > > > poll time when running as a virtual machine with paravirt. > > > > > > Signed-off-by: Yang Zhang > > > Signed-off-by: Quan Xu > > > --- > > > Documentation/sysctl/kernel.txt | 35 > > > +++ > > > arch/x86/kernel/paravirt.c |4 > > > include/linux/kernel.h |6 ++ > > > kernel/sysctl.c | 34 > > > ++ > > > 4 files changed, 79 insertions(+), 0 deletions(-) > > > > > > diff --git a/Documentation/sysctl/kernel.txt > > > b/Documentation/sysctl/kernel.txt > > > index 694968c..30c25fb 100644 > > > --- a/Documentation/sysctl/kernel.txt > > > +++ b/Documentation/sysctl/kernel.txt > > > @@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this > > > one. > > > == > > > +paravirt_poll_grow: (X86 only) > > > + > > > +Multiplied value to increase the poll time. This is expected to take > > > +effect only when running as a virtual machine with CONFIG_PARAVIRT > > > +enabled. This can't bring any benifit on bare mental even with > > > +CONFIG_PARAVIRT enabled. > > > + > > > +By default this value is 2. Possible values to set are in range {2..16}. > > > + > > > +== > > > + > > > +paravirt_poll_shrink: (X86 only) > > > + > > > +Divided value to reduce the poll time. This is expected to take effect > > > +only when running as a virtual machine with CONFIG_PARAVIRT enabled. > > > +This can't bring any benifit on bare mental even with CONFIG_PARAVIRT > > > +enabled. > > > + > > > +By default this value is 2. Possible values to set are in range {2..16}. > > > + > > > +== > > > + > > > +paravirt_poll_threshold_ns: (X86 only) > > > + > > > +Controls the maximum poll time before entering real idle path. This is > > > +expected to take effect only when running as a virtual machine with > > > +CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental > > > +even with CONFIG_PARAVIRT enabled. > > > + > > > +By default, this value is 0 means not to poll. Possible values to set > > > +are in range {0..50}. Change the value to non-zero if running > > > +latency-bound workloads in a virtual machine. > > I absolutely hate it how this hybrid idle loop polling mechanism is not > > self-tuning! > > Ingo, actually it is self-tuning.. Then why the hell does it touch the syscall ABI? > could I only leave paravirt_poll_threshold_ns parameter (the maximum poll > time), > which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then user > can > turn it off, or find an appropriate threshold for some odd scenario.. That way lies utter madness. Maybe add it as a debugfs knob, but exposing it to userspace: NAK. Thanks, Ingo
Re: [PATCH 3.18 00/28] 3.18.81-stable review
On Mon, Nov 13, 2017 at 02:50:22PM -0700, Shuah Khan wrote: > On 11/13/2017 05:54 AM, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 3.18.81 release. > > There are 28 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Wed Nov 15 12:53:41 UTC 2017. > > Anything received after that time might be too late. > > > > The whole patch series can be found in one patch at: > > kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.81-rc1.gz > > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > linux-3.18.y > > and the diffstat can be found below. > > > > thanks, > > > > greg k-h > > > > Compiled and booted on my test system. No dmesg regressions. Thanks for testing all of these and letting me know. greg k-h
Re: [PATCH 4.13 00/33] 4.13.13-stable review
On Mon, Nov 13, 2017 at 02:02:12PM -0800, kernelci.org bot wrote: > stable-rc/linux-4.13.y boot: 127 boots: 10 failed, 116 passed with 1 conflict > (v4.13.12-34-g109b28ca1340) > > Full Boot Summary: > https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.13.y/kernel/v4.13.12-34-g109b28ca1340/ > Full Build Summary: > https://kernelci.org/build/stable-rc/branch/linux-4.13.y/kernel/v4.13.12-34-g109b28ca1340/ > > Tree: stable-rc > Branch: linux-4.13.y > Git Describe: v4.13.12-34-g109b28ca1340 > Git Commit: 109b28ca1340961002d4bede168f07823451b8e4 > Git URL: > http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > Tested: 51 unique boards, 17 SoC families, 18 builds out of 188 > > Boot Regressions Detected: > > arm: > > at91_dt_defconfig: > at91rm9200ek_rootfs:nfs: > lab-free-electrons: new failure (last pass: v4.13.12) > > exynos_defconfig: > exynos4412-odroidx2_rootfs:nfs: > lab-collabora: failing since 4 days (last pass: > v4.13.11-37-g2295c8345797 - first fail: v4.13.12) > exynos5250-snow_rootfs:nfs: > lab-collabora: failing since 4 days (last pass: > v4.13.11-37-g2295c8345797 - first fail: v4.13.12) Thanks for these "failing since..." markings, that makes me feel better that I didn't break anything on my end :) greg k-h
Re: [PATCH 4.9 00/87] 4.9.62-stable review --> crash
On Tue, Nov 14, 2017 at 07:48:47AM +0100, Sebastian Gottschall wrote: > ahm it compiles well. but > > [ 24.838120] Unable to handle kernel NULL pointer dereference at virtual > address 0055 > [ 24.846193] pgd = c0004000 > [ 24.848893] [0055] *pgd= > [ 24.852472] Internal error: Oops - BUG: 817 [#1] PREEMPT SMP ARM > [ 24.858463] Modules linked in: xhci_plat_hcd xhci_pci xhci_hcd ohci_hcd > ehci_pci ehci_platform ehci_hcd usbcore usb_common nls_base qca_ssdk > gpio_pca953x mii_gpio wil6210 ath10k_pci ath10k_core ath9k ath9k_common > ath9k_hw ath mac80211 cfg80211 compat > [ 24.880852] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.62-rc1 #90 > [ 24.887189] Hardware name: AnnapurnaLabs Alpine (Device Tree) > [ 24.892921] task: ef029ac0 task.stack: ef05a000 > [ 24.897444] PC is at nf_nat_cleanup_conntrack+0x4c/0x74 > [ 24.902657] LR is at nf_nat_cleanup_conntrack+0x38/0x74 > [ 24.907869] pc : [] lr : [] psr: 6153 > [ 24.907869] sp : ef05bb58 ip : ef05bb58 fp : ef05bb6c > [ 24.919317] r10: ed230cc0 r9 : ed230c00 r8 : edf45800 > [ 24.924529] r7 : ebcd2f00 r6 : ec33739e r5 : c0892294 r4 : ebcd2f00 > [ 24.931040] r3 : r2 : 0055 r1 : r0 : c0892718 > [ 24.937551] Flags: nZCv IRQs on FIQs off Mode SVC_32 ISA ARM Segment > user > [ 24.944755] Control: 10c5387d Table: 2bd1006a DAC: 0055 > [ 24.950486] Process swapper/2 (pid: 0, stack limit = 0xef05a210) > [ 24.956477] Stack: (0xef05bb58 to 0xef05c000) > > > will dig into the code to find the reason Can you run 'git bisect' or if you use quilt, do a manual bisect to find the offending patch? thanks, greg k-h
Re: [PATCH 0/7] net: core: devname allocation cleanups
From: Rasmus Villemoes Date: Mon, 13 Nov 2017 00:15:03 +0100 > It's somewhat confusing to have both dev_alloc_name and > dev_get_valid_name. I can't see why the former is less strict than the > latter, so make them (or rather dev_alloc_name_ns and > dev_get_valid_name) equivalent, hardening dev_alloc_name() a little. > > Obvious follow-up patches would be to only export one function, and > make dev_alloc_name a static inline wrapper for that (whichever name > is chosen for the exported interface). But maybe there is a good > reason the two exported interfaces do different checking, so I'll > refrain from including the trivial but tree-wide renaming in this > series. Series applied, thanks.
Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
On Tue, Nov 14, 2017 at 02:10:16PM +0800, Wanpeng Li wrote: > 2017-11-13 21:02 GMT+08:00 Peter Zijlstra : > > That can be written like: > > > > do { > > if (state & KVM_VCPU_PREEMPTED) > > new_state = state | KVM_VCPU_SHOULD_FLUSH; > > else > > new_state = state | KVM_VCPU_IPI_PENDING; > > } while (!try_cmpxchg(&src->preempted, state, new_state); > > > > if (new_state & KVM_VCPU_IPI_PENDING) > > Should be new_state & KVM_VCPU_SHOULD_FLUSH I think. Quite so indeed.
Re: [RESEND PATCH v2 4/4] x86/umip: Warn if UMIP-protected instructions are used
* Ricardo Neri wrote: > +const char * const umip_insns[5] = { > + [UMIP_INST_SGDT] = "sgdt", > + [UMIP_INST_SIDT] = "sidt", > + [UMIP_INST_SMSW] = "smsw", > + [UMIP_INST_SLDT] = "sldt", > + [UMIP_INST_STR] = "str", > +}; Sigh ... > +/* > + * If you change these strings, ensure that buffers using them are > sufficiently > + * large. > + */ > +static const char umip_warn_use[] = "cannot be used by applications."; > +static const char umip_warn_emu[] = "For now, expensive software emulation > returns result."; Please use the string literals directly, don't add an extra obfuscation layer. Plus: > + unsigned char buf[MAX_INSN_SIZE], warn[128]; > + snprintf(warn, sizeof(warn), "%s %s", umip_insns[umip_inst], > + umip_warn_use); This is incredibly fragile against future buffer overflows, and warning about it in comments does not make it less fragile! Thanks, Ingo
RE: [patch v2 3/8] KVM: x86: add Intel processor trace virtualization mode
> > + if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_PT_USE_GPA) || > > + !(_vmexit_control & VM_EXIT_CLEAR_IA32_RTIT_CTL) || > > + !(_vmentry_control & VM_ENTRY_LOAD_IA32_RTIT_CTL)) { > > + _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_PT_USE_GPA; > > Also, you are not checking anywhere if the SUPPRESS_PIP controls are > available. This is probably the best place. SUPPRESS_PIP(should be "CONCEAL", will fix it.) is use for control of processor trace packet. I think we should clear it when in SYSTEM mode (For example, PIPs are generated on VM exit, with NonRoot=0. On VM exit to SMM, VMCS packets are additionally generated). Why need check this here? > > > + _vmexit_control &= ~VM_EXIT_CLEAR_IA32_RTIT_CTL; > > + _vmentry_control &= ~VM_ENTRY_LOAD_IA32_RTIT_CTL; > > These two are not needed; disabling SECONDARY_EXEC_PT_USE_GPA is enough. > The tracing mode will revert to PT_SYSTEM, which does not use the load/clear > RTIT_CTL controls. > The status of *_RTIT_CTL should be same with SECONDARY_EXEC_PT_USE_GPA or would cause VM-entry failed. (architecture-instruction-set-extensions-programming-reference 5.2.3)
[f2fs-dev] [PATCH RESEND] f2fs: fix concurrent problem for updating free bitmap
alloc_nid_failed and scan_nat_page can be called at the same time, and we haven't protected add_free_nid and update_free_nid_bitmap with the same nid_list_lock. That could lead to Thread AThread B - __build_free_nids - scan_nat_page - add_free_nid - alloc_nid_failed - update_free_nid_bitmap - update_free_nid_bitmap scan_nat_page will clear the free bitmap since the nid is PREALLOC_NID, but alloc_nid_failed needs to set the free bitmap. This results in free nid with free bitmap cleared. This patch update the bitmap under the same nid_list_lock in add_free_nid. Signed-off-by: Fan li --- fs/f2fs/node.c | 82 ++ 1 file changed, 42 insertions(+), 40 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index b965a53..0a217d2 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1811,8 +1811,33 @@ static void __move_free_nid(struct f2fs_sb_info *sbi, struct free_nid *i, } } +static void update_free_nid_bitmap(struct f2fs_sb_info *sbi, nid_t nid, + bool set, bool build) +{ + struct f2fs_nm_info *nm_i = NM_I(sbi); + unsigned int nat_ofs = NAT_BLOCK_OFFSET(nid); + unsigned int nid_ofs = nid - START_NID(nid); + + if (!test_bit_le(nat_ofs, nm_i->nat_block_bitmap)) + return; + + if (set) { + if (test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) + return; + __set_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); + nm_i->free_nid_count[nat_ofs]++; + } else { + if (!test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) + return; + __clear_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); + if (!build) + nm_i->free_nid_count[nat_ofs]--; + } +} + /* return if the nid is recognized as free */ -static bool add_free_nid(struct f2fs_sb_info *sbi, nid_t nid, bool build) +static bool add_free_nid(struct f2fs_sb_info *sbi, + nid_t nid, bool build, bool update) { struct f2fs_nm_info *nm_i = NM_I(sbi); struct free_nid *i, *e; @@ -1870,6 +1895,11 @@ static bool add_free_nid(struct f2fs_sb_info *sbi, nid_t nid, bool build) ret = true; err = __insert_free_nid(sbi, i, FREE_NID); err_out: + if (update) { + update_free_nid_bitmap(sbi, nid, ret, build); + if (!build) + nm_i->available_nids++; + } spin_unlock(&nm_i->nid_list_lock); radix_tree_preload_end(); err: @@ -1896,30 +1926,6 @@ static void remove_free_nid(struct f2fs_sb_info *sbi, nid_t nid) kmem_cache_free(free_nid_slab, i); } -static void update_free_nid_bitmap(struct f2fs_sb_info *sbi, nid_t nid, - bool set, bool build) -{ - struct f2fs_nm_info *nm_i = NM_I(sbi); - unsigned int nat_ofs = NAT_BLOCK_OFFSET(nid); - unsigned int nid_ofs = nid - START_NID(nid); - - if (!test_bit_le(nat_ofs, nm_i->nat_block_bitmap)) - return; - - if (set) { - if (test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) - return; - __set_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); - nm_i->free_nid_count[nat_ofs]++; - } else { - if (!test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) - return; - __clear_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); - if (!build) - nm_i->free_nid_count[nat_ofs]--; - } -} - static void scan_nat_page(struct f2fs_sb_info *sbi, struct page *nat_page, nid_t start_nid) { @@ -1937,18 +1943,18 @@ static void scan_nat_page(struct f2fs_sb_info *sbi, i = start_nid % NAT_ENTRY_PER_BLOCK; for (; i < NAT_ENTRY_PER_BLOCK; i++, start_nid++) { - bool freed = false; - if (unlikely(start_nid >= nm_i->max_nid)) break; blk_addr = le32_to_cpu(nat_blk->entries[i].block_addr); f2fs_bug_on(sbi, blk_addr == NEW_ADDR); - if (blk_addr == NULL_ADDR) - freed = add_free_nid(sbi, start_nid, true); - spin_lock(&NM_I(sbi)->nid_list_lock); - update_free_nid_bitmap(sbi, start_nid, freed, true); - spin_unlock(&NM_I(sbi)->nid_list_lock); + if (blk_addr == NULL_ADDR) { + add_free_nid(sbi, start_nid, true, true); + } else { + spin_lock(&NM_I(sbi)->nid_list_lock); + update_free_nid_bitmap(sbi, start_nid, false
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 14/11/17 08:02, Quan Xu wrote: > > > On 2017/11/13 18:53, Juergen Gross wrote: >> On 13/11/17 11:06, Quan Xu wrote: >>> From: Quan Xu >>> >>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called >>> in idle path which will poll for a while before we enter the real idle >>> state. >>> >>> In virtualization, idle path includes several heavy operations >>> includes timer access(LAPIC timer or TSC deadline timer) which will >>> hurt performance especially for latency intensive workload like message >>> passing task. The cost is mainly from the vmexit which is a hardware >>> context switch between virtual machine and hypervisor. Our solution is >>> to poll for a while and do not enter real idle path if we can get the >>> schedule event during polling. >>> >>> Poll may cause the CPU waste so we adopt a smart polling mechanism to >>> reduce the useless poll. >>> >>> Signed-off-by: Yang Zhang >>> Signed-off-by: Quan Xu >>> Cc: Juergen Gross >>> Cc: Alok Kataria >>> Cc: Rusty Russell >>> Cc: Thomas Gleixner >>> Cc: Ingo Molnar >>> Cc: "H. Peter Anvin" >>> Cc: x...@kernel.org >>> Cc: virtualizat...@lists.linux-foundation.org >>> Cc: linux-kernel@vger.kernel.org >>> Cc: xen-de...@lists.xenproject.org >> Hmm, is the idle entry path really so critical to performance that a new >> pvops function is necessary? > Juergen, Here is the data we get when running benchmark netperf: > 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): > 29031.6 bit/s -- 76.1 %CPU > > 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): > 35787.7 bit/s -- 129.4 %CPU > > 3. w/ kvm dynamic poll: > 35735.6 bit/s -- 200.0 %CPU > > 4. w/patch and w/ kvm dynamic poll: > 42225.3 bit/s -- 198.7 %CPU > > 5. idle=poll > 37081.7 bit/s -- 998.1 %CPU > > > > w/ this patch, we will improve performance by 23%.. even we could improve > performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the > cost of CPU is much lower than 'idle=poll' case.. I don't question the general idea. I just think pvops isn't the best way to implement it. >> Wouldn't a function pointer, maybe guarded >> by a static key, be enough? A further advantage would be that this would >> work on other architectures, too. > > I assume this feature will be ported to other archs.. a new pvops makes > code > clean and easy to maintain. also I tried to add it into existed pvops, > but it > doesn't match. You are aware that pvops is x86 only? I really don't see the big difference in maintainability compared to the static key / function pointer variant: void (*guest_idle_poll_func)(void); struct static_key guest_idle_poll_key __read_mostly; static inline void guest_idle_poll(void) { if (static_key_false(&guest_idle_poll_key)) guest_idle_poll_func(); } And KVM would just need to set guest_idle_poll_func and enable the static key. Works on non-x86 architectures, too. Juergen
Re: [PATCH] iio: mma8452: add power_mode sysfs configuration
Am 14.11.2017 05:56 schrieb harinath Nampally: Hi Martin, But given your concerns, I would strip down this patch to only offer the already documented "low_noise" and "low_power" modes. It wouldn't be worth it to extend the ABI just because of this! OK then we can map 'low_noise' to high resolution mode. But I am afraid I can't test the functionality because I don't have proper instruments to measure the current draw(in microAmps) accurately. I would like "oversampling" more than this "power_mode" too. For this driver it would be far more complicated to implement though. I doubt that it'll be done. power_mode is basically already there implicitely, and given that there *is* the ABI, we could offer it for free. I think 'oversampling' is already implemented, as I see 'case IIO_CHAN_INFO_OVERSAMPLING_RATIO:' being handled which is basically setting the all 4 different power modes. If we also add 'power_mode', I think it would be like having two different user interfaces for same functionality. So I don't see much of value adding 'power_mode' as well. Please correct me if I am wrong. Thanks, Harinath You're right. I should've looked more closely. oversampling is there and seems to work. No need to blow up this driver or let alone extend an ABI now. Let's drop this patch. thanks martin On Sun, Nov 12, 2017 at 7:28 AM, Martin Kepplinger wrote: On 2017-11-11 01:33, Jonathan Cameron wrote: On Mon, 6 Nov 2017 08:19:58 +0100 Martin Kepplinger wrote: This adds the power_mode sysfs interface to the device as documented in sysfs-bus-iio. --- Note that I explicitely don't sign off on this. This is a starting point for anybody who can test it and check for correct API usage, and ABI correctness, as documented in Documentation/ABI/testing/sys-bus-iio (grep it for "power_mode"). The ABI doc probably would need an addition too, if the 4 power modes here seem generally useful (there are only 2 listed there)! So, if you can test this, feel free to set up a proper patch or two, and I'm happy to review. Please note that this patch is quite old. It really should be that simple as far as my understanding back then. We always list the available frequencies of the given power mode we are in, for example, already, and everything basically is in place except for the user interface. Hmm. A lot of devices support something along these lines. The issue has always been - how is userspace to figure out what to do with it? It's all very vague... Funnily enough - this used to be really common, but is becoming less so now - presumably because no one was using it much (or maybe I am reading too much into that ;) Now the question is whether it can be tied to better defined things? Here low noise restricts the range to 4g. Issue is that we don't actually have writeable _available attributes (which correspond to range in this case). Does it? Isn't it merely less oversampling. Low power mode... This one is apparently oversampling. If possible support it as that as we have well defined interfaces for that. Jonathan. Ah, I remember; the oversampling settings was actually a reason why I hadn't submitted the patch :) The oversampling API would definitely be more accurate. I would like "oversampling" more than this "power_mode" too. For this driver it would be far more complicated to implement though. I doubt that it'll be done. power_mode is basically already there implicitely, and given that there *is* the ABI, we could offer it for free. But given your concerns, I would strip down this patch to only offer the already documented "low_noise" and "low_power" modes. It wouldn't be worth it to extend the ABI just because of this! Users would have a simple switch if they don't really *want* to know the details. I think it can be useful to just say "I don't care about power consuption. Be as accurate as possible" or "I just want this think to work. Use a little power as possible." Sure it's vage, but would it be useless?
Re: [PATCH v3 2/3] usb: xhci: Add DbC support in xHCI driver
Hi, Mathias Nyman writes: >> +static int dbc_buf_alloc(struct dbc_buf *db, unsigned int size) >> +{ >> +db->buf_buf = kzalloc(size, GFP_KERNEL); >> +if (!db->buf_buf) >> +return -ENOMEM; >> + >> +db->buf_size = size; >> +db->buf_put = db->buf_buf; >> +db->buf_get = db->buf_buf; >> + >> +return 0; >> +} you may wanna have a look at kfifo. -- balbi
Re: [PATCH] lib: Avoid redundant sizeof checking in __bitmap_weight() calculation.
On 14 November 2017 at 07:57, Rakib Mullick wrote: > Currently, during __bitmap_weight() calculation hweight_long() is used. > Inside a hweight_long() a check has been made to figure out whether a > hweight32() or hweight64() version to use. > > diff --git a/lib/bitmap.c b/lib/bitmap.c > index d8f0c09..552096f 100644 > --- a/lib/bitmap.c > +++ b/lib/bitmap.c > @@ -241,10 +241,15 @@ EXPORT_SYMBOL(__bitmap_subset); > int __bitmap_weight(const unsigned long *bitmap, unsigned int bits) > { > unsigned int k, lim = bits/BITS_PER_LONG; > - int w = 0; > - > - for (k = 0; k < lim; k++) > - w += hweight_long(bitmap[k]); > + int w = 0, is32 = sizeof(bitmap[0]) ? 1 : 0; > + hint: sizeof() very rarely evaluates to zero... So this is the same as "is32 = 1". So the patch as-is is broken (and may explain the 1-byte delta in vmlinux). But even if this condition is fixed, the patch doesn't change anything, since the sizeof() evaluation is done at compile-time, regardless of whether it happens inside the inlined hweight_long or outside. So it is certainly not worth it to duplicate the loop. Rasmus
Re: [PATCH] iio: accel: mma8452: Add single pulse/tap event detection
Am 14.11.2017 05:36 schrieb harinath Nampally: > This patch adds following related changes: > - defines pulse event related registers > - enables and handles single pulse interrupt for fxls8471 > - handles IIO_EV_DIR_EITHER in read/write callbacks (because > event direction for pulse is either rising or falling) > - configures read/write event value for pulse latency register > using IIO_EV_INFO_HYSTERESIS > - adds multiple events like pulse and tranient event spec > as elements of event_spec array named 'mma8452_accel_events' > > Except mma8653 chip all other chips like mma845x and > fxls8471 have single tap detection feature. > Tested thoroughly using iio_event_monitor application on > imx6ul-evk board which has fxls8471. > > Signed-off-by: Harinath Nampally > --- What tree is this written against? It doesn't apply to the current -next anyways. Thanks for the review. It is actually against 'testing' branch, I think two of my earlier patches are not yet applied to any branch, that might be reason this patch is not good against current -next or 'togreg'. I think the defintions would deserve to be in a separate patch, but that's debatable. Yes, I would argue that definitions are not a logical change. I would argue definitions don't break the build and maybe slightly better support features like bisect or revert :) > .type = IIO_EV_TYPE_MAG, > .dir = IIO_EV_DIR_RISING, > .mask_separate = BIT(IIO_EV_INFO_ENABLE), > @@ -1139,6 +1274,15 @@ static const struct iio_event_spec mma8452_transient_event[] = { > BIT(IIO_EV_INFO_PERIOD) | > BIT(IIO_EV_INFO_HIGH_PASS_FILTER_3DB) > }, > + { > + //pulse event > + .type = IIO_EV_TYPE_MAG, > + .dir = IIO_EV_DIR_EITHER, > + .mask_separate = BIT(IIO_EV_INFO_ENABLE), > + .mask_shared_by_type = BIT(IIO_EV_INFO_VALUE) | > + BIT(IIO_EV_INFO_PERIOD) | > + BIT(IIO_EV_INFO_HYSTERESIS) > + }, > }; > > static const struct iio_event_spec mma8452_motion_event[] = { > @@ -1202,8 +1346,8 @@ static struct attribute_group mma8452_event_attribute_group = { > .shift = 16 - (bits), \ > .endianness = IIO_BE, \ > }, \ > - .event_spec = mma8452_transient_event, \ > - .num_event_specs = ARRAY_SIZE(mma8452_transient_event), \ > + .event_spec = mma8452_accel_events, \ > + .num_event_specs = ARRAY_SIZE(mma8452_accel_events), \ that would go in the mentioned separate renaming-patch OK so I will make a patch set; patch 1/2 to just rename 'mma8452_transient_event[]' to 'mma8452_accel_events[]'(without adding pulse event). and everything else would go in 2/2. Does that makes sense? It does to me.
Re: [RESEND PATCH v2 3/4] x86/umip: Identify the str and sldt instructions
* Ricardo Neri wrote: > The instructions STR and SLDT are not emulated in any case. Thus, it made > sense to not implement functionality to identify them. However, a > subsequent commit will introduce functionality to warn about the use of > all the instructions that UMIP protect, not only those that are emulated. > A first step for that is the ability to identify them. > > Plus, now that STR and SLDT are identified, we need to explicitly avoid > their emulation (i.e., not rely on unsuccessful identification). Group > togehter all the cases that we do not want to emulate: STR, SLDT and user > long mode processes. > > Cc: Andy Lutomirski > Cc: H. Peter Anvin > Cc: Borislav Petkov > Cc: Tony Luck > Cc: Paolo Bonzini > Cc: Ravi V. Shankar > Cc: x...@kernel.org > Signed-off-by: Ricardo Neri Sigh, the _title_ still refers to 'str'... I'll fix it up, no need to resend, but this lack of attention to details is seriously sad. Thanks, Ingo
Re: [PATCH] arch, mm: introduce arch_tlb_gather_mmu_lazy (was: Re: [RESEND PATCH] mm, oom_reaper: gather each vma to prevent) leaking TLB entry
On Tue 14-11-17 10:45:49, Minchan Kim wrote: [...] > Anyway, I think Wang Nan's patch is already broken. > http://lkml.kernel.org/r/%3c20171107095453.179940-1-wangn...@huawei.com%3E > > Because unmap_page_range(ie, zap_pte_range) can flush TLB forcefully > and free pages. However, the architecture code for TLB flush cannot > flush at all by wrong fullmm so other threads can write freed-page. I am not sure I understand what you mean. How is that any different from any other explicit partial madvise call? -- Michal Hocko SUSE Labs
[f2fs-dev] [PATCH RESEND v2] f2fs: validate before set/clear free nat bitmap
In flush_nat_entries, all dirty nats will be flushed and if their new address isn't NULL_ADDR, their bitmaps will be updated, the free_nid_count of the bitmaps will be increased regardless of whether the nats have already been occupied before. This could lead to wrong free_nid_count. So this patch checks the status of the bits before actually set/clear them. Fixes: 586d1492f301 ("f2fs: skip scanning free nid bitmap of full NAT blocks") Signed-off-by: Fan li --- fs/f2fs/node.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index d234c6e..b965a53 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1906,15 +1906,18 @@ static void update_free_nid_bitmap(struct f2fs_sb_info *sbi, nid_t nid, if (!test_bit_le(nat_ofs, nm_i->nat_block_bitmap)) return; - if (set) + if (set) { + if (test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) + return; __set_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); - else - __clear_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); - - if (set) nm_i->free_nid_count[nat_ofs]++; - else if (!build) - nm_i->free_nid_count[nat_ofs]--; + } else { + if (!test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) + return; + __clear_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); + if (!build) + nm_i->free_nid_count[nat_ofs]--; + } } static void scan_nat_page(struct f2fs_sb_info *sbi, -- 2.7.4
Re: [PATCH 3/4] x86/umip: Identify the str and sldt instructions
* Ricardo Neri wrote: > On Mon, Nov 13, 2017 at 09:12:03AM +0100, Ingo Molnar wrote: > > > > * Ricardo Neri wrote: > > > > > The instructions str and sldt are not emulated in any case. Thus, it made > > > sense to not implement functionality to identify them. However, a > > > subsequent commit will introduce functionality to warn about the use of > > > all the instructions that UMIP protect, not only those that are emulated. > > > A first step for that is the ability to identify them. > > > > > > Plus, now that str and sldt are identified, we need to explicitly avoid > > > their emulation (i.e., not rely on unsuccessful identification). Group > > > togehter all the cases that we do not want to emulate: str, sldt and user > > > long mode processes. > > > > Did you notice how in all your previous patches (both in the code and in > > the > > changelogs) I have manually fixed up the capitalization of these > > instruction > > mnenonics? > > I am sorry, I tried to see where you made these changes but I could not find > any. I did a git diff of arch/x86/kernel/umip.c between the branch > rneri/umip_v11 > of my repository [1] and the master branch of the tip tree and I did not find > any differences. For example, I turned: [PATCH v11 12/12] selftests/x86: Add tests for instruction str and sldt The instructions str and sldt are not valid when running on virtual-8086 mode and generate an invalid operand exception. ... into: a9e017d5619e: selftests/x86: Add tests for the STR and SLDT instructions The STR and SLDT instructions are not valid when running on virtual-8086 mode and generate an invalid operand exception. ... I did not catch every case though. Thanks, Ingo
Re: [PATCH net-next 0/3] rxrpc: Fixes
From: David Howells Date: Sat, 11 Nov 2017 17:57:52 + > > Here are some patches that fix some things in AF_RXRPC: > > (1) Prevent notifications from being passed to a kernel service for a call > that it has ended. > > (2) Fix a null pointer deference that occurs under some circumstances when an > ACK is generated. > > (3) Fix a number of things to do with call expiration. > > The patches can be found here also: > > > http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-next > > Tagged thusly: > > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git > rxrpc-next-2017 Pulled, thanks David.
Re: [RFC 1/7] x86/asm/64: Allocate and enable the SYSENTER stack
* Andy Lutomirski wrote: > I have old patches to stop using IST for #DB and #BP, but I never finished > them. I'm all in favor of reviving that effort! Thanks, Ingo
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
2017-11-14 15:02 GMT+08:00 Quan Xu : > > > On 2017/11/13 18:53, Juergen Gross wrote: >> >> On 13/11/17 11:06, Quan Xu wrote: >>> >>> From: Quan Xu >>> >>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called >>> in idle path which will poll for a while before we enter the real idle >>> state. >>> >>> In virtualization, idle path includes several heavy operations >>> includes timer access(LAPIC timer or TSC deadline timer) which will >>> hurt performance especially for latency intensive workload like message >>> passing task. The cost is mainly from the vmexit which is a hardware >>> context switch between virtual machine and hypervisor. Our solution is >>> to poll for a while and do not enter real idle path if we can get the >>> schedule event during polling. >>> >>> Poll may cause the CPU waste so we adopt a smart polling mechanism to >>> reduce the useless poll. >>> >>> Signed-off-by: Yang Zhang >>> Signed-off-by: Quan Xu >>> Cc: Juergen Gross >>> Cc: Alok Kataria >>> Cc: Rusty Russell >>> Cc: Thomas Gleixner >>> Cc: Ingo Molnar >>> Cc: "H. Peter Anvin" >>> Cc: x...@kernel.org >>> Cc: virtualizat...@lists.linux-foundation.org >>> Cc: linux-kernel@vger.kernel.org >>> Cc: xen-de...@lists.xenproject.org >> >> Hmm, is the idle entry path really so critical to performance that a new >> pvops function is necessary? > > Juergen, Here is the data we get when running benchmark netperf: > 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): > 29031.6 bit/s -- 76.1 %CPU > > 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): > 35787.7 bit/s -- 129.4 %CPU > > 3. w/ kvm dynamic poll: > 35735.6 bit/s -- 200.0 %CPU Actually we can reduce the CPU utilization by sleeping a period of time as what has already been done in the poll logic of IO subsystem, then we can improve the algorithm in kvm instead of introduing another duplicate one in the kvm guest. Regards, Wanpeng Li > > 4. w/patch and w/ kvm dynamic poll: > 42225.3 bit/s -- 198.7 %CPU > > 5. idle=poll > 37081.7 bit/s -- 998.1 %CPU > > > > w/ this patch, we will improve performance by 23%.. even we could improve > performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the > cost of CPU is much lower than 'idle=poll' case.. > >> Wouldn't a function pointer, maybe guarded >> by a static key, be enough? A further advantage would be that this would >> work on other architectures, too. > > > I assume this feature will be ported to other archs.. a new pvops makes code > clean and easy to maintain. also I tried to add it into existed pvops, but > it > doesn't match. > > > > Quan > Alibaba Cloud >> >> >> Juergen >> >
RE: [patch v2 3/8] KVM: x86: add Intel processor trace virtualization mode
> > +#define VM_EXIT_PT_SUPPRESS_PIP0x0100 > > +#define VM_EXIT_CLEAR_IA32_RTIT_CTL0x0200 > > > > #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff > > > > @@ -108,6 +112,8 @@ > > #define VM_ENTRY_LOAD_IA32_PAT 0x4000 > > #define VM_ENTRY_LOAD_IA32_EFER 0x8000 > > #define VM_ENTRY_LOAD_BNDCFGS 0x0001 > > +#define VM_ENTRY_PT_SUPPRESS_PIP 0x0002 > > +#define VM_ENTRY_LOAD_IA32_RTIT_CTL0x0004 > > > Please use PT_CONCEAL instead of PT_SUPPRESS_PIP, to better match the SDM > (for both vmexit and vmentry controls). > > > + if (!enable_ept) > > + vmexit_control &= ~VM_EXIT_CLEAR_IA32_RTIT_CTL; > > + > > Why is this (and the similar bit-clear operation in vmx_vmentry_control) > needed only for !enable_ept? > > Shouldn't it be like > > if (pt_mode == PT_MODE_SYSTEM) { > vmexit_control &= ~VM_EXIT_PT_SUPPRESS_PIP; > vmexit_control &= ~VM_EXIT_CLEAR_IA32_RTIT_CTL; > } > > and > > if (pt_mode == PT_MODE_SYSTEM) { > vmentry_control &= ~VM_ENTRY_PT_SUPPRESS_PIP; > vmentry_control &= ~VM_ENTRY_LOAD_IA32_RTIT_CTL; > } > I think I have a misunderstand of " always set "use GPA for processor tracing" in secondary execution control if it can be ". "use GPA for processor tracing" can't be set in SYSTEM mode even if hardware can set this bit. Because guest will still think this a GPA address and translate by EPT. In fact, RTIT_OUTPUT_BASE will always a HPA in SYSTEM mode. Will fix in next version. Thanks, Luwei Kang
Re: Improving documentation of parent-ID field in /proc/PID/mountinfo
Hi Miklos, Ram Thanks for your comments. A question below. On 13 November 2017 at 09:11, Miklos Szeredi wrote: > On Mon, Nov 13, 2017 at 8:55 AM, Ram Pai wrote: >> On Mon, Nov 13, 2017 at 07:02:21AM +0100, Michael Kerrisk (man-pages) wrote: >>> Hello Ram, >>> >>> Long ago (2.6.29) you added the /proc/PID/mountinfo file and >>> associated documentation in Documentation/filesystems/proc.txt. Later, >>> I pasted much of that documentation into the proc(5) manual page. >>> >>> That documentation says of the second field in the file: >>> >>> [[ >>> This file contains lines of the form: >>> >>> 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root >>> rw,errors=continue >>> (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) >>> >>> (1) mount ID: unique identifier of the mount (may be reused after umount) >>> (2) parent ID: ID of parent (or of self for the top of the mount tree) >>> ... >>> ]] >>> >>> The last piece of the description of field (2) doesn't seem to be >>> correct, or is at least rather unclear. I take this to be saying that >>> that for the root mount point, /, field (2) will have the same value >>> as field (1). I never actually looked at this detail closely, but >>> Alexander pointed out that this is obviously not so, as one can >>> immediately verify: >>> >>> $ grep '/ / ' /proc/$$/mountinfo >>> 65 0 8:2 / / rw,relatime shared:1 - ext4 /dev/sda2 rw,seclabel,data=order >>> >>> I dug around in the kernel source for a bit. I do not have an exact >>> handle on the details, but I can see roughly what is going on. >>> Internally, there seems to be one ("hidden") mount ID reserved to each >>> mount namespace, and that ID is the parent of the root mount point. >>> >>> Looking through the (4.14) kernel source, mount IDs are allocated by >>> mnt_alloc_id() (in fs/namespace.c), which is in turn called by >>> alloc_vfsmnt() which is in turn called by clone_mnt(). >>> >>> A new mount namespace is created by the kernel function copy_mnt_ns() >>> (in fs/namespace.c, called by create_new_namespaces() in >>> kernel/nsproxy.c). The copy_mnt_ns() function calls copy_tree() (in >>> fs/namespace.c), and copy_tree() calls clone_mnt() in *two* places. >>> The first of these is the call that creates the "hidden" mount ID that >>> becomes the parent of the root mount point. (I verified this by >>> instrumenting the kernel with a few printk() calls to display the >>> IDs.) The second place where copy_tree() calls clone_mnt() is in a >>> loop that replicates each of the mount points (including the root >>> mount point) in the source mount namespace. >> >> We used to report that mount, ones upon a time. Something has changed >> the behavior since then and its not reported any more, thus making it >> hidden. > > The hidden one is the initramfs, I believe. That's the root of the > mount namespace, and the when a namespace is cloned, the tree is > copied from the namespace root. > > It is "hidden" because no process has its root there. Note the > difference between namespace root and process root: the first is the > real root of the mount tree and is unchangeable, the second is > pointing to some place in a mount tree and can be changed (chroot). > > So there's nothing special in this rootfs, it is just hidden because > it's not the root of any task. > > The description of field (2) is correct, it just does not make it > clear what it means by "root". Sorry -- do you mean the old description is correct, or my new description (below)? Cheers, Michael > Thanks, > Miklos > >> >>> >>> With these details in mind, I propose to patch the man page to read as >>> below. Perhaps you have some corrections or improvements to suggest >>> for this text? >>> >>> [[ >>> (2) parent ID: the ID of the parent mount. For the root >>>mount point, the ID shown here is a hidden mount ID >>>associated with the mount namespace. That ID is dis‐ >>>tinct from any of the IDs shown in field (1) of the >>>records shown in the mountinfo file, and does not >>>appear in field (1) in the mountinfo file in any other >>>mount namespace. (In the initial mount namespace, >>>this hidden ID has the value 0.) >> >> It captures the current semantics correctly. >> >> RP >> -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: linux-next: Tree for Nov 7
On Mon 13-11-17 09:35:22, Khalid Aziz wrote: > On 11/13/2017 09:06 AM, Michal Hocko wrote: > > OK, so this one should take care of the backward compatibility while > > still not touching the arch code > > --- > > commit 39ff9bf8597e79a032da0954aea1f0d77d137765 > > Author: Michal Hocko > > Date: Mon Nov 13 17:06:24 2017 +0100 > > > > mm: introduce MAP_FIXED_SAFE > > MAP_FIXED is used quite often but it is inherently dangerous because it > > unmaps an existing mapping covered by the requested range. While this > > might be might be really desidered behavior in many cases there are > > others which would rather see a failure than a silent memory > > corruption. > > Introduce a new MAP_FIXED_SAFE flag for mmap to achive this behavior. > > It is a MAP_FIXED extension with a single exception that it fails with > > ENOMEM if the requested address is already covered by an existing > > mapping. We still do rely on get_unmaped_area to handle all the arch > > specific MAP_FIXED treatment and check for a conflicting vma after it > > returns. > > Signed-off-by: Michal Hocko > > > > .. deleted ... > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 680506faceae..aad8d37f0205 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1358,6 +1358,10 @@ unsigned long do_mmap(struct file *file, unsigned > > long addr, > > if (mm->map_count > sysctl_max_map_count) > > return -ENOMEM; > > + /* force arch specific MAP_FIXED handling in get_unmapped_area */ > > + if (flags & MAP_FIXED_SAFE) > > + flags |= MAP_FIXED; > > + > > /* Obtain the address to map to. we verify (or select) it and ensure > > * that it represents a valid section of the address space. > > */ > > Do you need to move this code above: > > if (!(flags & MAP_FIXED)) > addr = round_hint_to_min(addr); > > /* Careful about overflows.. */ > len = PAGE_ALIGN(len); > if (!len) > return -ENOMEM; > > Not doing that might mean the hint address will end up being rounded for > MAP_FIXED_SAFE which would change the behavior from MAP_FIXED. Yes, I will move it. -- Michal Hocko SUSE Labs
Re: [PATCH 1/2] mm: drop migrate type checks from has_unmovable_pages
On Tue 14-11-17 06:10:00, Ran Wang wrote: [...] > > > This drop cause DWC3 USB controller fail on initialization with > > > Layerscaper processors (such as LS1043A) as below: > > > > > > [2.701437] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned > > bus number 1 > > > [2.710949] cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -16 > > > [2.717411] xhci-hcd xhci-hcd.0.auto: can't setup: -12 > > > [2.727940] xhci-hcd xhci-hcd.0.auto: USB bus 1 deregistered > > > [2.733607] xhci-hcd: probe of xhci-hcd.0.auto failed with error -12 > > > [2.739978] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller > > > > > > And I notice that someone also reported to you that DWC2 got affected > > > recently, so do you have the solution now? > > > > Yes. It should be in linux-next. Have a look at the following email > > thread: > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml. > > kernel.org%2Fr%2F20171104082500.qvzbb2kw4suo6cgy%40dhcp22.suse.cz& > > data=02%7C01%7Cran.wang_1%40nxp.com%7C5e73c6a941fc4f1c10e708d52 > > a860c5b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636461677 > > 583607877&sdata=zlRxJ4LZwOBsit5qRx9yFT5qfP54wZ0z6G1z%2Bcywf5g%3D > > &reserved=0 I really have no idea where the above link came from because my email had a reference to http://lkml.kernel.org/r/20171104082500.qvzbb2kw4suo6...@dhcp22.suse.cz Has your email client modified the original email? > Thanks for your info, although I fail to open the link you shared, but I got > patch > from my colleague and the issue got fix on my side, let you know, thanks. Thanks for your testing anyway. Can I assume your Tested-by? -- Michal Hocko SUSE Labs
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Cc: Juergen Gross Cc: Alok Kataria Cc: Rusty Russell Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. Quan Alibaba Cloud Juergen
RE: [patch v2 8/8] KVM: x86: Disable intercept for Intel processor trace MSRs
> > + if (pt_mode == PT_MODE_HOST_GUEST) { > > + u32 i, eax, ebx, ecx, edx; > > + > > + cpuid_count(0x14, 1, &eax, &ebx, &ecx, &edx); > > + vmx_disable_intercept_for_msr(MSR_IA32_RTIT_STATUS, false); > > + vmx_disable_intercept_for_msr(MSR_IA32_RTIT_OUTPUT_BASE, false); > > + vmx_disable_intercept_for_msr(MSR_IA32_RTIT_OUTPUT_MASK, false); > > + vmx_disable_intercept_for_msr(MSR_IA32_RTIT_CR3_MATCH, false); > > + for (i = 0; i < (eax & 0x7); i++) > > + vmx_disable_intercept_for_msr(MSR_IA32_RTIT_ADDR0_A + i, > > + false); > > + } > > + > > As I mentioned earlier, this probably makes vmentry/vmexit too expensive when > guests are not using processor tracing. I would do it only if guest > TRACEEN=1 (since anyway the values have to be correct if guest TRACEEN=1, and > a change in TRACEEN always causes a vmexit). > Will change in next version. Thanks, Luwei Kang > > > return alloc_kvm_area(); > > > > out: > >
RE: [patch v2 7/8] KVM: x86: add Intel PT msr RTIT_CTL read/write
> > static DEFINE_PER_CPU(struct vmcs *, vmxarea); static > > DEFINE_PER_CPU(struct vmcs *, current_vmcs); @@ -3384,6 +3385,11 @@ > > static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > return 1; > > msr_info->data = vcpu->arch.ia32_xss; > > break; > > + case MSR_IA32_RTIT_CTL: > > + if (!vmx_pt_supported()) > > + return 1; > > + msr_info->data = vmcs_read64(GUEST_IA32_RTIT_CTL); > > + break; > > case MSR_TSC_AUX: > > if (!msr_info->host_initiated && > > !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP)) @@ -3508,6 > > +3514,11 > > @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > else > > clear_atomic_switch_msr(vmx, MSR_IA32_XSS); > > break; > > + case MSR_IA32_RTIT_CTL: > > + if (!vmx_pt_supported() || to_vmx(vcpu)->nested.vmxon) > > + return 1; > > VMXON must also clear TraceEn bit (see 23.4 in the SDM). Will clear TraceEn bit in handle_vmon() function. > > You also need to support R/W of all the other MSRs, in order to make them > accessible to userspace, and add them in msrs_to_save and kvm_init_msr_list. > Will add it in next version. This is use for live migration, is that right? > Regarding the save/restore of the state, I am now wondering if you could also > use XSAVES and XRSTORS instead of multiple rdmsr/wrmsr in a loop. > The cost is two MSR writes on vmenter (because the guest must run with > IA32_XSS=0) and two on vmexit. > If use XSAVES and XRSTORS for context switch. 1. Before kvm_load_guest_fpu(vcpu), we need to save host RTIT_CTL, disable PT and restore the value of "vmx->pt_desc.guest.ctl" to GUEST_IA32_RTIT_CTL. Is that right? 2. After VM-exit (step out from kvm_x86_ops->run(vcpu)), we need to save the status of GUEST_IA32_RTIT_CTL . TRACEEN=0 and others MSRs are still in guest status. Where to enable PT if in host-guest mode. I think we should enable PT after vm-exit but it may cause #GP. " If XRSTORS would restore (or initialize) PT state and IA32_RTIT_CTL.TraceEn = 1, the instruction causes a general-protection exception. SDM 13.5.6". if enable after kvm_put_guest_fpu() I think it too late.) Thanks, Luwei Kang > > > + vmcs_write64(GUEST_IA32_RTIT_CTL, data); > > + break; > > case MSR_TSC_AUX: > > if (!msr_info->host_initiated && > > !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP)) > >
[PATCH] lib: Avoid redundant sizeof checking in __bitmap_weight() calculation.
Currently, during __bitmap_weight() calculation hweight_long() is used. Inside a hweight_long() a check has been made to figure out whether a hweight32() or hweight64() version to use. However, it's unnecessary to do it in case of __bitmap_weight() calculation inside the loop. We can detect whether to use hweight32() or hweight64() upfront and call respective function directly. It also reduces the vmlinux size. Before the patch: textdata bss dec hex filename 129013327798930 1454181635242078219c05e vmlinux After the patch: textdata bss dec hex filename 129013317798930 1454181635242077219c05d vmlinux Signed-off-by: Rakib Mullick Cc: Andrew Morton Cc: Rasmus Villemoes Cc: Matthew Wilcox Cc: Yury Norov Cc: Mauro Carvalho Chehab --- Patch was created against torvald's tree (commit 43ff2f4db9d0f764). lib/bitmap.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/lib/bitmap.c b/lib/bitmap.c index d8f0c09..552096f 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -241,10 +241,15 @@ EXPORT_SYMBOL(__bitmap_subset); int __bitmap_weight(const unsigned long *bitmap, unsigned int bits) { unsigned int k, lim = bits/BITS_PER_LONG; - int w = 0; - - for (k = 0; k < lim; k++) - w += hweight_long(bitmap[k]); + int w = 0, is32 = sizeof(bitmap[0]) ? 1 : 0; + + if (is32) { + for (k = 0; k < lim; k++) + w += hweight32(bitmap[k]); + } else { + for (k = 0; k < lim; k++) + w += hweight64(bitmap[k]); + } if (bits % BITS_PER_LONG) w += hweight_long(bitmap[k] & BITMAP_LAST_WORD_MASK(bits)); -- 2.9.3
Re: KASAN: use-after-free Read in rds_tcp_dev_event
On (11/13/17 19:30), Girish Moodalbail wrote: > (L538-540). However, it leaves behind some of the rds_tcp connections that > shared the same underlying RDS connection (L534 and 535). These connections > with pointer to stale network namespace are left behind in the global list. It leaves behind no such thing. After mprds, you want to collect only one instance of the conn that is being removed, that's why lines 534-535 skips over duplicat instances of the same conn (for multiple paths in the same conn). > When the 2nd network namespace is deleted, we will hit the above stale > pointer and hit UAF panic. > I think we should move away from global list to a per-namespace list. The > global list are used only in two places (both of which are per-namespace > operations): Nice try, but not so. Let me look at this tomorrow, I missed this mail in my mbox. --Sowmini
[PATCH -next] irqchip/exiu: Fix return value check in exiu_init()
In case of error, the function of_iomap() returns NULL pointer not ERR_PTR(). The IS_ERR() test in the return value check should be replaced with NULL test. Fixes: 706cffc1b912 ("irqchip/exiu: Add support for Socionext Synquacer EXIU controller") Signed-off-by: Wei Yongjun --- drivers/irqchip/irq-sni-exiu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-sni-exiu.c b/drivers/irqchip/irq-sni-exiu.c index 1b6e2f7..1927b2f 100644 --- a/drivers/irqchip/irq-sni-exiu.c +++ b/drivers/irqchip/irq-sni-exiu.c @@ -196,8 +196,8 @@ static int __init exiu_init(struct device_node *node, } data->base = of_iomap(node, 0); - if (IS_ERR(data->base)) { - err = PTR_ERR(data->base); + if (!data->base) { + err = -ENODEV; goto out_free; }
Re: [PATCH] wcn36xx: Set BTLE coexistence related configuration values to defaults
Ramon Fried writes: > From: Eyal Ilsar > > If the value for the firmware configuration parameters > BTC_STATIC_LEN_LE_BT and BTC_STATIC_LEN_LE_WLAN are not set the duty > cycle between BT and WLAN is such that if BT (including BLE) is active > WLAN gets 0 bandwidth. When tuning these parameters having a too high > value for WLAN means that BLE performance degrades. The "sweet" point > of roughly half of the maximal values was empirically found to achieve > a balance between BLE and Wi-Fi coexistence performance. > > Signed-off-by: Eyal Ilsar > Signed-off-by: Ramon Fried Then submit a new version of the patch then please include the version number: https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches#patch_version_missing So after fixing Bjorn's comments the next version should be v3. -- Kalle Valo
Re: [PATCH v6 4/5] crash: export paddr_vmcoreinfo_note()
On 11/13/17 at 08:29pm, Marc-André Lureau wrote: > The following patch is going to use the symbol from the fw_cfg module, > to call the function and write the note location details in the > vmcoreinfo entry, so qemu can produce dumps with the vmcoreinfo note. > > CC: Andrew Morton > CC: Baoquan He > CC: Dave Young > CC: Dave Young > CC: Hari Bathini > CC: Tony Luck > CC: Vivek Goyal > Signed-off-by: Marc-André Lureau > Acked-by: Gabriel Somlo ACK Acked-by: Baoquan He Thanks Baoquan > --- > kernel/crash_core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index 6db80fc0810b..47541c891810 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -375,6 +375,7 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) > { > return __pa(vmcoreinfo_note); > } > +EXPORT_SYMBOL(paddr_vmcoreinfo_note); > > static int __init crash_save_vmcoreinfo_init(void) > { > -- > 2.15.0.125.g8f49766d64 >
Re: [PATCH 4.9 00/87] 4.9.62-stable review --> crash
ahm it compiles well. but [ 24.838120] Unable to handle kernel NULL pointer dereference at virtual address 0055 [ 24.846193] pgd = c0004000 [ 24.848893] [0055] *pgd= [ 24.852472] Internal error: Oops - BUG: 817 [#1] PREEMPT SMP ARM [ 24.858463] Modules linked in: xhci_plat_hcd xhci_pci xhci_hcd ohci_hcd ehci_pci ehci_platform ehci_hcd usbcore usb_common nls_base qca_ssdk gpio_pca953x mii_gpio wil6210 ath10k_pci ath10k_core ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 compat [ 24.880852] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.9.62-rc1 #90 [ 24.887189] Hardware name: AnnapurnaLabs Alpine (Device Tree) [ 24.892921] task: ef029ac0 task.stack: ef05a000 [ 24.897444] PC is at nf_nat_cleanup_conntrack+0x4c/0x74 [ 24.902657] LR is at nf_nat_cleanup_conntrack+0x38/0x74 [ 24.907869] pc : [] lr : [] psr: 6153 [ 24.907869] sp : ef05bb58 ip : ef05bb58 fp : ef05bb6c [ 24.919317] r10: ed230cc0 r9 : ed230c00 r8 : edf45800 [ 24.924529] r7 : ebcd2f00 r6 : ec33739e r5 : c0892294 r4 : ebcd2f00 [ 24.931040] r3 : r2 : 0055 r1 : r0 : c0892718 [ 24.937551] Flags: nZCv IRQs on FIQs off Mode SVC_32 ISA ARM Segment user [ 24.944755] Control: 10c5387d Table: 2bd1006a DAC: 0055 [ 24.950486] Process swapper/2 (pid: 0, stack limit = 0xef05a210) [ 24.956477] Stack: (0xef05bb58 to 0xef05c000) will dig into the code to find the reason Am 13.11.2017 um 13:55 schrieb Greg Kroah-Hartman: This is the start of the stable review cycle for the 4.9.62 release. There are 87 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Wed Nov 15 12:55:40 UTC 2017. Anything received after that time might be too late. The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.62-rc1.gz or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y and the diffstat can be found below. thanks, greg k-h - Pseudo-Shortlog of commits: Greg Kroah-Hartman Linux 4.9.62-rc1 Borislav Petkov x86/oprofile/ppro: Do not use __this_cpu*() in preemptible context Pavel Tatashin x86/smpboot: Make optimization of delay calibration work correctly Florian Westphal netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable" Richard Schütz can: c_can: don't indicate triple sampling support for D_CAN Marek Vasut can: ifi: Fix transmitter delay calculation Gerhard Bertelsmann can: sun4i: handle overrun in RX FIFO John Stultz drm/bridge: adv7511: Re-write the i2c address before EDID probing John Stultz drm/bridge: adv7511: Reuse __adv7511_power_on/off() when probing EDID John Stultz drm/bridge: adv7511: Rework adv7511_power_on/off() so they can be reused internally Sinclair Yeh drm/vmwgfx: Fix Ubuntu 17.10 Wayland black screen issue Ilya Dryomov rbd: use GFP_NOIO for parent stat and data requests Kai-Heng Feng Input: elan_i2c - add ELAN060C to the ACPI table Oswald Buddenhagen MIPS: AR7: Ensure that serial ports are properly set up Jonas Gorski MIPS: AR7: Defer registration of GPIO Jaedon Shin MIPS: BMIPS: Fix missing cbr address Marcus Cooper ASoC: sun4i-spdif: remove legacy dapm components Luis R. Rodriguez tools: firmware: check for distro fallback udev cancel rule Luis R. Rodriguez selftests: firmware: send expected errors to /dev/null Matt Redfearn MIPS: SMP: Fix deadlock & online race Matija Glavinic Pecotic MIPS: Fix race on setting and getting cpu_online_mask Matt Redfearn MIPS: SMP: Use a completion event to signal CPU up Paul Burton MIPS: Fix CM region target definitions Gustavo A. R. Silva MIPS: microMIPS: Fix incorrect mask in insn_table_MM Maarten Lankhorst drm/i915: Do not rely on wm preservation for ILK watermarks Takashi Iwai ALSA: seq: Avoid invalid lockdep class warning Takashi Iwai ALSA: seq: Fix OSS sysex delivery in OSS emulation Mark Rutland ARM: 8720/1: ensure dump_instr() checks addr_limit Eric Biggers KEYS: fix NULL pointer dereference during ASN.1 parsing [ver #2] Andrey Ryabinin crypto: x86/sha256-mb - fix panic due to unaligned access Andrey Ryabinin crypto: x86/sha1-mb - fix panic due to unaligned access Romain Izard crypto: ccm - preserve the IV buffer Li Bin workqueue: Fix NULL pointer dereference Peter Zijlstra x86/uaccess, sched/preempt: Verify access_ok() context Carlo Caione platform/x86: hp-wmi: Do not shadow error values Carlo Caione platform/x86: hp-wmi: Fix error value for hp_wmi_tablet_state Eric Biggers KEYS: trusted: fix writing past end of buffer in trusted_read() Eric Biggers KEYS: tr
Re: [PATCH 1/2] arm64: mm: abort uaccess retries upon fatal signal
On Tue, Aug 22, 2017 at 10:45:24AM +0100, Will Deacon wrote: > On Mon, Aug 21, 2017 at 02:42:03PM +0100, Mark Rutland wrote: > > On Tue, Jul 11, 2017 at 03:58:49PM +0100, Will Deacon wrote: > > > On Tue, Jul 11, 2017 at 03:19:22PM +0100, Mark Rutland wrote: > > > > When there's a fatal signal pending, arm64's do_page_fault() > > > > implementation returns 0. The intent is that we'll return to the > > > > faulting userspace instruction, delivering the signal on the way. > > > > > > > > However, if we take a fatal signal during fixing up a uaccess, this > > > > results in a return to the faulting kernel instruction, which will be > > > > instantly retried, resulting in the same fault being taken forever. As > > > > the task never reaches userspace, the signal is not delivered, and the > > > > task is left unkillable. While the task is stuck in this state, it can > > > > inhibit the forward progress of the system. > > > > > > > > To avoid this, we must ensure that when a fatal signal is pending, we > > > > apply any necessary fixup for a faulting kernel instruction. Thus we > > > > will return to an error path, and it is up to that code to make forward > > > > progress towards delivering the fatal signal. > > > > > > > > Signed-off-by: Mark Rutland > > > > Reviewed-by: Steve Capper > > > > Tested-by: Steve Capper > > > > Cc: Catalin Marinas > > > > Cc: James Morse > > > > Cc: Laura Abbott > > > > Cc: Will Deacon > > > > Cc: sta...@vger.kernel.org > > > > --- > > > > arch/arm64/mm/fault.c | 5 - > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > > > > index 37b95df..3952d5e 100644 > > > > --- a/arch/arm64/mm/fault.c > > > > +++ b/arch/arm64/mm/fault.c > > > > @@ -397,8 +397,11 @@ static int __kprobes do_page_fault(unsigned long > > > > addr, unsigned int esr, > > > > * signal first. We do not need to release the mmap_sem because > > > > it > > > > * would already be released in __lock_page_or_retry in > > > > mm/filemap.c. > > > > */ > > > > - if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) > > > > + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) { > > > > + if (!user_mode(regs)) > > > > + goto no_context; > > > > return 0; > > > > + } > > > > > > This will need rebasing at -rc1 (take a look at current HEAD). > > > > > > Also, I think it introduces a weird corner case where we take a page fault > > > when writing the signal frame to the user stack to deliver a SIGSEGV. If > > > we end up with VM_FAULT_RETRY and somebody has sent a SIGKILL to the task, > > > then we'll fail setup_sigframe and force an un-handleable SIGSEGV instead > > > of SIGKILL. > > > > > > The end result (task is killed) is the same, but the fatal signal is > > > wrong. > > > > That doesn't seem to be the case, testing on v4.13-rc5. > > > > I used sigaltstack() to use the userfaultfd region as signal stack, > > registerd a SIGSEGV handler, and dereferenced NULL. The task locks up, > > but when killed with a SIGINT or SIGKILL, the exit status reflects that > > signal, rather than the SIGSEGV. > > > > If I move the SIGINT handler onto the userfaultfd-monitored stack, then > > delivering SIGINT hangs, but can be killed with SIGKILL, and the exit > > status reflects that SIGKILL. > > > > As you say, it does look like we'd try to set up a deferred SIGSEGV for > > the failed signal delivery. > > > > I haven't yet figured out exactly how that works; I'll keep digging. > > The SEGV makes it all the way into do_group_exit, but then signal_group_exit > is set and the exit_code is overridden with SIGKILL at the last minute (see > complete_signal). Unfortunately, this last minute is too late for print-fatal-signals. With print-fatal-signals enabled, this patch leads to misleading "potentially unexpected fatal signal 11" warnings if a process is SIGKILL'd at the right time. I've seen it without userfaultfd, but it's easiest reproduced by patching Mark's original test code [1] with the following patch and simply running "pkill -WINCH foo; pkill -KILL foo". This results in: foo: potentially unexpected fatal signal 11. CPU: 1 PID: 1793 Comm: foo Not tainted 4.9.58-devel #3 task: b3534780 task.stack: b4b7c000 PC is at 0x76effb60 LR is at 0x4227f4 pc : [<76effb60>]lr : [<004227f4>]psr: 600b0010 sp : 7eaf7bb4 ip : fp : r10: 0001 r9 : 0003 r8 : 76fcd000 r7 : 001d r6 : 76fd0cf0 r5 : 7eaf7c08 r4 : r3 : r2 : r1 : 7eaf7a88 r0 : fffc Flags: nZCv IRQs on FIQs on Mode USER_32 ISA ARM Segment user Control: 10c5387d Table: 3357404a DAC: 0055 CPU: 1 PID: 1793 Comm: foo Not tainted 4.9.58-devel #3 [<801113f0>] (unwind_backtrace) from [<8010cfb0>] (show_stack+0x18/0x1c) [<8010cfb0>] (show_stack) from [<8039725c>] (dump_stack+0x84/0
Re: [PATCH v2] coccinelle: fix parallel build with CHECK=scripts/coccicheck
On Tue, 14 Nov 2017, Masahiro Yamada wrote: > Hi Julia, > > 2017-11-14 1:45 GMT+09:00 Julia Lawall : > > > > > > On Tue, 14 Nov 2017, Masahiro Yamada wrote: > > > >> Hi Julia, > >> > >> > >> 2017-11-14 0:30 GMT+09:00 Julia Lawall : > >> > > >> > > >> > On Thu, 9 Nov 2017, Masahiro Yamada wrote: > >> > > >> >> The command "make -j8 C=1 CHECK=scripts/coccicheck" produces lots of > >> >> "coccicheck failed" error messages. > >> >> > >> >> I do not know the coccinelle internals, but I guess --jobs does not > >> >> work well if spatch is invoked from Make running in parallel. > >> >> Disable --jobs in this case. > >> > > >> > Why is this change under: > >> > > >> > if [ "$C" = "1" -o "$C" = "2" ]; > >> > > >> > The coccicheck failed messages come also if one runs Coccinelle on the > >> > entire kernel. > >> > >> As far as I tested, "coccicheck failed" error only happens > >> when ONLINE=1. > >> > >> > >> make -j8 C=1 CHECK=scripts/coccicheck > >> COCCI=scripts/coccinelle/misc/bugon.cocci > >> > >> emits lots of errors. > >> > >> > >> make -j8 coccicheck COCCI=scripts/coccinelle/misc/bugon.cocci > >> > >> is fine. > >> > >> > >> Have you tested it? > >> Do you mean you got a different result from mine? > > > > I agree with your results, with respect to the number of errors. > > > > julia > > > > So, what shall we do? > > If you do not like to fix it (or you can fix coccinelle itself), > I can take back this patch. I'm OK with your fix. I will check and ack it today. > I am not a coccinelle developer, so > setting USE_JOBS="no" is the best I can do. The problem on the Coccinelle side is that it uses a subdirectory with the name of the semantic patch to store standard output and standard error for the different threads. I didn't want to use a name with the pid, so that one could easily find this information while Coccinelle is running. Normally the subdirectory is cleaned up when Coccinelle completes, so there is only one of them at a time. Maybe it is best to just add the pid. There is the risk that these subdirectories will accumulate if Coccinelle crashes in a way such that they don't get cleaned up, but Coccinelle could print a warning if it detects this case, rather than failing. Still I think it is useful to do something on the make coccicheck side, because there is no need for the double layer of parallelism. julia
Re: [PATCH v6 4/5] crash: export paddr_vmcoreinfo_note()
On 11/13/17 at 08:29pm, Marc-André Lureau wrote: > The following patch is going to use the symbol from the fw_cfg module, > to call the function and write the note location details in the > vmcoreinfo entry, so qemu can produce dumps with the vmcoreinfo note. > > CC: Andrew Morton > CC: Baoquan He > CC: Dave Young > CC: Dave Young > CC: Hari Bathini > CC: Tony Luck > CC: Vivek Goyal > Signed-off-by: Marc-André Lureau > Acked-by: Gabriel Somlo > --- > kernel/crash_core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index 6db80fc0810b..47541c891810 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -375,6 +375,7 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) > { > return __pa(vmcoreinfo_note); > } > +EXPORT_SYMBOL(paddr_vmcoreinfo_note); > > static int __init crash_save_vmcoreinfo_init(void) > { > -- > 2.15.0.125.g8f49766d64 > Acked-by: Dave Young Thanks Dave
Re: n900 in next-20170901
On Mon, Nov 13, 2017 at 01:15:30PM -0800, Tony Lindgren wrote: > * Tony Lindgren [171110 07:36]: > > * Joonsoo Kim [171110 06:34]: > > > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote: > > > > +#define OMAP34XX_SRAM_PHYS 0x4020 > > > > +#define OMAP34XX_SRAM_VIRT 0xd001 > > > > +#define OMAP34XX_SRAM_SIZE 0x1 > > > > > > For my testing environment, vmalloc address space is started at > > > roughly 0xe000 so 0xd001 would not be valid. > > > > Well we can map it anywhere we want, got any preferences? > > Hmm and I'm also wondering what you do to make vmalloc space to > start at 0xe000 instead of 0xd000? Please see the another reply. > > The reason I'm asking is because I think we can just move all of > save_secure_ram_context to run from DDR instead of SRAM. But I'd > rather do a minimal patch first that fixes your series and then we > can test the further changes with more time. Okay. I agree to make a minimal patch first and then go next step. > After moving save_secure_ram_context to DDR, we can call > save_secure_ram_context directly with something like: > > args_pa = __pa(omap3_secure_ram_storage); > offset = (unsigned long)omap3_secure_ram_storage - args_pa; > ret = save_secure_ram_context(args_pa, offset); > > > Just that the current save_secure_ram_context uses "high_mask" > > of 0x to translate the address. To make this more flexible, > > we need the save_secure_ram_context changes too. So we might > > want to do the static mapping and save_secure_ram_context changes > > as a single patch. > > > > > And, PHYS can be different according to the system type. Maybe either > > > OMAP3_SRAM_PUB_PA or OMAP3_SRAM_PA. It seems that SIZE and TYPE should > > > be considered, too. My understanding is correct? > > > > We can have a static map for the whole SRAM area, see function > > __arm_ioremap_pfn_caller() for the comment "Try to reuse one of the > > static mapping whenever possible". So the different public SRAM start > > addresses and sizes don't matter there. > > And then if save_secure_ram_contet runs in DDR, no static map is > needed. Okay. Thanks.
Re: Fwd: linux v4.14 causes firmware iwlwifi errors on Lenovo Thinkpad T440s
On Mon, 2017-11-13 at 16:23 -0600, Larry Finger wrote: > On 11/13/2017 03:30 PM, Bartosz Golaszewski wrote: > > 2017-11-13 21:45 GMT+01:00 Larry Finger > > : > > > On 11/13/2017 02:22 PM, Bartosz Golaszewski wrote: > > > > > > > > Forwarding here too as I messed up the address the last time. > > > > -- > > > > > > > > Hi, > > > > > > > > I noticed my wireless interface can't get up with linux v4.14 > > > > and the > > > > kernel log is flooded with firmware errors: > > > > > > > > iwlwifi :03:00.0: Firmware error during reconfiguration - > > > > reprobe! > > > > iwlwifi :03:00.0: FW error in SYNC CMD DQA_ENABLE_CMD > > > > > > > > and > > > > > > > > ieee80211 phy63: Hardware restart was requested. > > > > > > > > The wireless controller is: Intel Corporation Wireless 7260 > > > > (rev 83) > > > > Firmware used is: iwlwifi-7260-17 > > > > > > > > Everything works fine with v4.13.12. > > > > > > > > I didn't have time today to bisect for the offending commit. > > > > Full log > > > > uploaded[1]. > > > > > > > > Best regards, > > > > Bartosz Golaszewski > > > > > > > > [1] https://pastebin.com/jksqxvS6 > > > > > > > > > Your log shows "iwlwifi :03:00.0: loaded firmware version > > > 17.228510.0 > > > op_mode iwlmvm" > > > > > > Mine, where the 7260 works, shows "iwlwifi :04:00.0: loaded > > > firmware > > > version 17.459231.0 op_mode iwlmvm". > > > > > > It seems as if you need newer firmware. A detailed file listing > > > shows > > > "-rw-r--r-- 1 root root 1049340 Oct 9 12:03 > > > /lib/firmware/iwlwifi-7260-17.ucode". That date is likely when I > > > installed > > > the updated kernel firmware package from my distro. The md5sum > > > for the file > > > is 73a217f55c47d3a70bb5dbbe1d676423. > > > > > > Larry > > > > > > > Ok so it seems the version in linux-firmware is outdated. The file > > you're using is available on github[1] and fixed the issue for me. > > > > Thanks! > > Bartosz Golaszewski > > > > [1] https://github.com/OpenELEC/iwlwifi-firmware > > Interesting. Using md5sum of the git repo for linux-firmware gets > 73a217f55c47d3a70bb5dbbe1d676423 iwlwifi-7260-17.ucode. > > That is the file I'm using. You shouldn't use firmwares from github or anywhere else except from the two official places where we release it: Mainline (but slow to get updated): https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/iwlwifi-7260-17.ucode Our official public tree: https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/linux-firmware.git/plain/iwlwifi-7260-17.ucode And, of course, your distro may distribute these in an official package as well, but it's good to check the versions to be sure you're running the latest one we released. -- Cheers, Luca.
Re: n900 in next-20170901
On Fri, Nov 10, 2017 at 07:36:20AM -0800, Tony Lindgren wrote: > * Joonsoo Kim [171110 06:34]: > > On Thu, Nov 09, 2017 at 07:26:10PM -0800, Tony Lindgren wrote: > > > +#define OMAP34XX_SRAM_PHYS 0x4020 > > > +#define OMAP34XX_SRAM_VIRT 0xd001 > > > +#define OMAP34XX_SRAM_SIZE 0x1 > > > > For my testing environment, vmalloc address space is started at > > roughly 0xe000 so 0xd001 would not be valid. > > Well we can map it anywhere we want, got any preferences? My testing environment is a beagle-(xm?) for QEMU. It is configured by CONFIG_VMSPLIT_3G=y so kernel address space is started at 0xc000. And, it has 512 MB memory so 0xc000 ~ 0xdff0 is used for direct mapping. See below. [0.00] Memory: 429504K/522240K available (11264K kernel code, 1562K rwdata, 4288K rodata, 2048K init, 405K bss, 27200K reserved, 65536K cma-reserved, 0K highmem) [0.00] Virtual kernel memory layout: [0.00] vector : 0x - 0x1000 ( 4 kB) [0.00] fixmap : 0xffc0 - 0xfff0 (3072 kB) [0.00] vmalloc : 0xe000 - 0xff80 ( 504 MB) [0.00] lowmem : 0xc000 - 0xdff0 ( 511 MB) [0.00] pkmap : 0xbfe0 - 0xc000 ( 2 MB) [0.00] modules : 0xbf00 - 0xbfe0 ( 14 MB) [0.00] .text : 0xc0208000 - 0xc0e0 (12256 kB) [0.00] .init : 0xc130 - 0xc150 (2048 kB) [0.00] .data : 0xc150 - 0xc1686810 (1563 kB) [0.00].bss : 0xc168fc68 - 0xc16f512c ( 406 kB) Therefore, if OMAP34XX_SRAM_VIRT is 0xd001, direct mapping is broken and the system doesn't work. I guess that we should use more stable address like as 0xf000. > > Just that the current save_secure_ram_context uses "high_mask" > of 0x to translate the address. To make this more flexible, > we need the save_secure_ram_context changes too. So we might > want to do the static mapping and save_secure_ram_context changes > as a single patch. > > > And, PHYS can be different according to the system type. Maybe either > > OMAP3_SRAM_PUB_PA or OMAP3_SRAM_PA. It seems that SIZE and TYPE should > > be considered, too. My understanding is correct? > > We can have a static map for the whole SRAM area, see function > __arm_ioremap_pfn_caller() for the comment "Try to reuse one of the > static mapping whenever possible". So the different public SRAM start > addresses and sizes don't matter there. Okay. Look fine with SRAM start addresses and sizes. However, we need to consider mtype since __arm_ioremap_pfn_caller() doesn't reuse the mapping if mtype is different. mtype can be either MT_MEMORY_RWX or MT_MEMORY_RWX_NONCACHED. Thanks.
[RESEND PATCH v2 2/4] x86/umip: Inform that UMIP has been enabled
Let us have an indication that this feature has been enabled. Cc: Andy Lutomirski Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Tony Luck Cc: Paolo Bonzini Cc: Ravi V. Shankar Cc: x...@kernel.org Suggested-by: Ingo Molnar Signed-off-by: Ricardo Neri --- arch/x86/kernel/cpu/common.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 13ae9e5..fa998ca 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -341,6 +341,8 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c) cr4_set_bits(X86_CR4_UMIP); + pr_info("x86/cpu: Activated the Intel User Mode Instruction Prevention (UMIP) CPU feature\n"); + return; out: -- 2.7.4
[PATCH] usb: quirks: Add no-lpm quirk for KY-688 USB 3.1 Type-C Hub
KY-688 USB 3.1 Type-C Hub internally uses a Genesys Logic hub to connect to Realtek r8153. Similar to commit ("7496cfe5431f2 usb: quirks: Add no-lpm quirk for Moshi USB to Ethernet Adapter"), no-lpm can make r8153 ethernet work. Signed-off-by: Kai-Heng Feng --- drivers/usb/core/quirks.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c index a6aaf2f193a4..12246da8fcf6 100644 --- a/drivers/usb/core/quirks.c +++ b/drivers/usb/core/quirks.c @@ -151,6 +151,9 @@ static const struct usb_device_id usb_quirk_list[] = { /* appletouch */ { USB_DEVICE(0x05ac, 0x021a), .driver_info = USB_QUIRK_RESET_RESUME }, + /* Genesys Logic hub, internally used by KY-688 USB 3.1 Type-C Hub */ + { USB_DEVICE(0x05e3, 0x0612), .driver_info = USB_QUIRK_NO_LPM }, + /* Genesys Logic hub, internally used by Moshi USB to Ethernet Adapter */ { USB_DEVICE(0x05e3, 0x0616), .driver_info = USB_QUIRK_NO_LPM }, -- 2.14.1
[RESEND PATCH v2 3/4] x86/umip: Identify the str and sldt instructions
The instructions STR and SLDT are not emulated in any case. Thus, it made sense to not implement functionality to identify them. However, a subsequent commit will introduce functionality to warn about the use of all the instructions that UMIP protect, not only those that are emulated. A first step for that is the ability to identify them. Plus, now that STR and SLDT are identified, we need to explicitly avoid their emulation (i.e., not rely on unsuccessful identification). Group togehter all the cases that we do not want to emulate: STR, SLDT and user long mode processes. Cc: Andy Lutomirski Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Tony Luck Cc: Paolo Bonzini Cc: Ravi V. Shankar Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- This patch also corrects the #define of SMSW. This change does not have a functional impact as it is only used as an identifier. --- arch/x86/kernel/umip.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c index 6ba82be..2e09b5b 100644 --- a/arch/x86/kernel/umip.c +++ b/arch/x86/kernel/umip.c @@ -78,7 +78,9 @@ #defineUMIP_INST_SGDT 0 /* 0F 01 /0 */ #defineUMIP_INST_SIDT 1 /* 0F 01 /1 */ -#defineUMIP_INST_SMSW 3 /* 0F 01 /4 */ +#defineUMIP_INST_SMSW 2 /* 0F 01 /4 */ +#defineUMIP_INST_SLDT 3 /* 0F 00 /0 */ +#defineUMIP_INST_STR 4 /* 0F 00 /1 */ /** * identify_insn() - Identify a UMIP-protected instruction @@ -118,10 +120,16 @@ static int identify_insn(struct insn *insn) default: return -EINVAL; } + } else if (insn->opcode.bytes[1] == 0x0) { + if (X86_MODRM_REG(insn->modrm.value) == 0) + return UMIP_INST_SLDT; + else if (X86_MODRM_REG(insn->modrm.value) == 1) + return UMIP_INST_STR; + else + return -EINVAL; + } else { + return -EINVAL; } - - /* SLDT AND STR are not emulated */ - return -EINVAL; } /** @@ -267,10 +275,6 @@ bool fixup_umip_exception(struct pt_regs *regs) if (!regs) return false; - /* Do not emulate 64-bit processes. */ - if (user_64bit_mode(regs)) - return false; - /* * If not in user-space long mode, a custom code segment could be in * use. This is true in protected mode (if the process defined a local @@ -322,6 +326,11 @@ bool fixup_umip_exception(struct pt_regs *regs) if (umip_inst < 0) return false; + /* Do not emulate SLDT, STR or user long mode processes. */ + if (umip_inst == UMIP_INST_STR || umip_inst == UMIP_INST_SLDT || + user_64bit_mode(regs)) + return false; + if (emulate_umip_insn(&insn, umip_inst, dummy_data, &dummy_data_size)) return false; -- 2.7.4
[RESEND PATCH v2 4/4] x86/umip: Warn if UMIP-protected instructions are used
Issue a rate-limited warning whenever any of the instructions that UMIP protects (i.e., SGDT, SIDT, SLDT, STR and SMSW) are used by user space programs. This is useful because, with UMIP enabled, the few programs that use such instructions will start receiving a SIGSEGV signal. In the specific cases for which emulation is provided (instructions SGDT, SIDT and SMSW in protected and virtual-8086 modes), a warning is also helpful to encourage updates in such programs to avoid the use of such instructions. An existing rate-limited pr_err() is converted to use the new function umip_pr_warn() in order to have it printing at the same rate and log level. Cc: Andy Lutomirski Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Tony Luck Cc: Paolo Bonzini Cc: Ravi V. Shankar Cc: x...@kernel.org Suggested-by: Linus Torvalds Signed-off-by: Ricardo Neri --- arch/x86/kernel/umip.c | 65 +- 1 file changed, 59 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c index 2e09b5b..50f4b11 100644 --- a/arch/x86/kernel/umip.c +++ b/arch/x86/kernel/umip.c @@ -82,6 +82,54 @@ #defineUMIP_INST_SLDT 3 /* 0F 00 /0 */ #defineUMIP_INST_STR 4 /* 0F 00 /1 */ +const char * const umip_insns[5] = { + [UMIP_INST_SGDT] = "sgdt", + [UMIP_INST_SIDT] = "sidt", + [UMIP_INST_SMSW] = "smsw", + [UMIP_INST_SLDT] = "sldt", + [UMIP_INST_STR] = "str", +}; + +/* + * If you change these strings, ensure that buffers using them are sufficiently + * large. + */ +static const char umip_warn_use[] = "cannot be used by applications."; +static const char umip_warn_emu[] = "For now, expensive software emulation returns result."; + +/** + * umip_pr_warn() - Print a rate-limited warning + * @regs: Register set with the context in which the warning is printed + * @msg: Pointer to a string with the warning message + * @error: Error code to print along with the warning + * + * Print the message contained in @msg along with the task name, ID number and + * instruction and stack pointers of the associated process. Optionally, an + * error code is printed if @error is not zero. These warning messages are + * limited to a burst of 5 messages every two minutes. + * + * Returns: + * + * None. + */ +static void umip_pr_warn(struct pt_regs *regs, char *msg, long error) +{ + struct task_struct *tsk = current; + char err_str[8 + BITS_PER_LONG / 4] = ""; + + /* Bursts of 5 messages every two minutes */ + static DEFINE_RATELIMIT_STATE(ratelimit, 2 * 60 * HZ, 5); + + if (!__ratelimit(&ratelimit)) + return; + + if (error) + snprintf(err_str, sizeof(err_str), " error:%lx", error); + + pr_warn("%s[%d] %s ip:%lx sp:%lx%s\n", tsk->comm, task_pid_nr(tsk), msg, + regs->ip, regs->sp, err_str); +} + /** * identify_insn() - Identify a UMIP-protected instruction * @insn: Instruction structure with opcode and ModRM byte. @@ -236,10 +284,7 @@ static void force_sig_info_umip_fault(void __user *addr, struct pt_regs *regs) if (!(show_unhandled_signals && unhandled_signal(tsk, SIGSEGV))) return; - pr_err_ratelimited("%s[%d] umip emulation segfault ip:%lx sp:%lx error:%x in %lx\n", - tsk->comm, task_pid_nr(tsk), regs->ip, - regs->sp, X86_PF_USER | X86_PF_WRITE, - regs->ip); + umip_pr_warn(regs, "segfault at", X86_PF_USER | X86_PF_WRITE); } /** @@ -264,10 +309,10 @@ static void force_sig_info_umip_fault(void __user *addr, struct pt_regs *regs) bool fixup_umip_exception(struct pt_regs *regs) { int not_copied, nr_copied, reg_offset, dummy_data_size, umip_inst; + unsigned char buf[MAX_INSN_SIZE], warn[128]; unsigned long seg_base = 0, *reg_addr; /* 10 bytes is the maximum size of the result of UMIP instructions */ unsigned char dummy_data[10] = { 0 }; - unsigned char buf[MAX_INSN_SIZE]; void __user *uaddr; struct insn insn; char seg_defs; @@ -326,10 +371,18 @@ bool fixup_umip_exception(struct pt_regs *regs) if (umip_inst < 0) return false; + snprintf(warn, sizeof(warn), "%s %s", umip_insns[umip_inst], +umip_warn_use); + /* Do not emulate SLDT, STR or user long mode processes. */ if (umip_inst == UMIP_INST_STR || umip_inst == UMIP_INST_SLDT || - user_64bit_mode(regs)) + user_64bit_mode(regs)) { + umip_pr_warn(regs, warn, 0); return false; + } + + snprintf(warn, sizeof(warn), "%s %s", warn, umip_warn_emu); + umip_pr_warn(regs, warn, 0); if (emulate_umip_insn(&insn, umip_inst, dummy_data, &dummy_data_size)) return false; -- 2.7.4
[RESEND PATCH v2 0/4] x86: Tweaks for UMIP
[To tip maintainers: This is a resend to copy the Linux kernel mailing list. No changes in the patches since my original v2 submission.] Now that the support for UMIP [1], [2] has been merged in the tip tree, this series add a couple of tweaks. Ingo asked for two small additions to select UMIP by default when building and inform of this feature being enabled [3]. Also, Linus suggested to issue a rate-limited warning whenever the any of the instructions that UMIP protects are used by user space programs [4]. This is useful to give programs a hint on the reason for which they start seeing an unexpected SIGSEGV signal. Also, it helps to encourage updates to those programs and avoid using these instructions if possible. Thanks and BR, Ricardo [1]. https://lkml.org/lkml/2017/10/27/699 [2]. https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1523438.html [3]. https://lkml.org/lkml/2017/11/8/238 [4]. https://lkml.org/lkml/2017/11/8/593 Changes since V1: * Capitalize all the instructions' mnemonics in both code and patch descriptions. * Correct documentation of umip_pr_warn() to correctly reflect the function name. * Update description of patch #4 to describe the update to the existing rate-limited pr_err(). Ricardo Neri (4): x86/umip: Select X86_INTEL_UMIP by default x86/umip: Inform that UMIP has been enabled x86/umip: Identify the str and sldt instructions x86/umip: Warn if UMIP-protected instructions are used arch/x86/Kconfig | 10 - arch/x86/kernel/cpu/common.c | 2 + arch/x86/kernel/umip.c | 88 +--- 3 files changed, 85 insertions(+), 15 deletions(-) -- 2.7.4
[RESEND PATCH v2 1/4] x86/umip: Select X86_INTEL_UMIP by default
UMIP does not incur in a significant performance penalty. Furthermore, it is triggered only when a small group of instructions are used from user space programs. While here, provide more details on the benefits UMIP provides and the behavior that can expect the few applications that use the instructions protected by UMIP. Cc: Andy Lutomirski Cc: H. Peter Anvin Cc: Borislav Petkov Cc: Tony Luck Cc: Paolo Bonzini Cc: Ravi V. Shankar Cc: x...@kernel.org Suggested-by: Ingo Molnar Signed-off-by: Ricardo Neri --- arch/x86/Kconfig | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f08977d..a524a7a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1805,14 +1805,20 @@ config X86_SMAP If unsure, say Y. config X86_INTEL_UMIP - def_bool n + def_bool y depends on CPU_SUP_INTEL prompt "Intel User Mode Instruction Prevention" if EXPERT ---help--- The User Mode Instruction Prevention (UMIP) is a security feature in newer Intel processors. If enabled, a general protection fault is issued if the instructions SGDT, SLDT, - SIDT, SMSW and STR are executed in user mode. + SIDT, SMSW and STR are executed in user mode. These instructions + unnecessarily expose information about the hardware state. + + The vast majority of applications do not use these instructions. + For the very few that do, software emulation is provided in + specific cases in protected and virtual-8086 modes. Emulated + results are dummy. config X86_INTEL_MPX prompt "Intel MPX (Memory Protection Extensions)" -- 2.7.4
Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
2017-11-13 18:46 GMT+08:00 Peter Zijlstra : > On Mon, Nov 13, 2017 at 04:26:57PM +0800, Wanpeng Li wrote: >> 2017-11-13 16:04 GMT+08:00 Peter Zijlstra : > >> > So if at this point a vCPU gets preempted we'll still spin-wait for it, >> > which is sub-optimal. >> > >> > I think we can come up with something to get around that 'problem' if >> > indeed it is a problem. But we can easily do that as follow up patches. >> > Just let me know if you think its worth spending more time on. >> >> You can post your idea, it is always smart. :) Then we can evaluate >> the complexity and gains. > > I'm not sure I have a fully baked idea just yet, but the general idea > would be something like: > > - switch (back) to a dedicated TLB invalidate IPI > > - introduce KVM_VCPU_IPI_PENDING > > - change flush_tlb_others() into something like: > >for_each_cpu(cpu, flushmask) { > src = &per_cpu(steal_time, cpu); > state = READ_ONCE(src->preempted); > do { > if (state & KVM_VCPU_PREEMPTED) { > if (try_cmpxchg(&src->preempted, &state, > state | > KVM_VCPU_SHOULD_FLUSH)) { > __cpumask_clear_cpu(cpu, flushmask); > break; > } > } > } while (!try_cmpxchg(&src->preempted, &state, > state | KVM_VCPU_IPI_PENDING)); >} > >apic->send_IPI_mask(flushmask, CALL_TLB_VECTOR); > >for_each_cpu(cpu, flushmask) { > src = &per_cpu(steal_time, cpu); > smp_cond_load_acquire(&src->preempted, !(VAL & KVM_VCPU_IPI_PENDING); >} > > > - have the TLB invalidate handler do something like: > >state = READ_ONCE(src->preempted); >if (!(state & KVM_VCPU_IPI_PENDING)) >return; > >local_flush_tlb(); > >do { >} while (!try_cmpxchg(&src->preempted, &state, > state & ~KVM_VCPU_IPI_PENDING)); There are a lot of cases handled by flush_tlb_func_remote() -> flush_tlb_function_common(), so I'm afraid to have hole. Regards, Wanpeng Li > > - then at VMEXIT time do something like: > >state = READ_ONCE(src->preempted); >do { > if (!(state & KVM_VCPU_IPI_PENDING)) > break; >} while (!try_cmpxchg(&src->preempted, state, > (state & ~KVM_VCPU_IPI_PENDING) | > KVM_VCPU_SHOULD_FLUSH)); > >and clear any possible pending TLB_VECTOR in the guest state to avoid >raising that IPI spuriously on enter again. > > > This way the preemption will clear the IPI_PENDING and the > flush_others() wait loop will terminate.
Re: 答复: [f2fs-dev] [PATCH RESEND] f2fs: validate before set/clear free nat bitmap
On 2017/11/14 13:59, LiFan wrote: > Sorry, it seems my company mailbox single mail would cut the long line short > automatically. > It's fine in my outlook mail, so I overlooked. Maybe 'git send-email' can be one of your options to save some work in your email client? ;) Thanks, > I haven't find a way to solve that yet, please hold both of my new patch. > I will fix it as soon as possible. > > > -邮件原件- > 发件人: Jaegeuk Kim [mailto:jaeg...@kernel.org] > 发送时间: 2017年11月14日 12:54 > 收件人: LiFan > 抄送: 'Chao Yu'; 'Chao Yu'; linux-kernel@vger.kernel.org; > linux-f2fs-de...@lists.sourceforge.net > 主题: Re: [f2fs-dev] [PATCH RESEND] f2fs: validate before set/clear free nat > bitmap > > Sorry, I can't merge this patch due to wrong format. > > On 11/11, LiFan wrote: >> In flush_nat_entries, all dirty nats will be flushed and if their new >> address isn't NULL_ADDR, their bitmaps will be updated, the >> free_nid_count of the bitmaps will be increased regardless of whether >> the nats have already been occupied before. This could lead to wrong >> free_nid_count. >> So this patch checks the status of the bits before actually set/clear > them. >> >> Fixes: 586d1492f301 ("f2fs: skip scanning free nid bitmap of full NAT >> blocks") >> >> Signed-off-by: Fan li >> --- >> fs/f2fs/node.c | 17 ++--- >> 1 file changed, 10 insertions(+), 7 deletions(-) >> >> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index d234c6e..b965a53 >> 100644 >> --- a/fs/f2fs/node.c >> +++ b/fs/f2fs/node.c >> @@ -1906,15 +1906,18 @@ static void update_free_nid_bitmap(struct >> f2fs_sb_info *sbi, nid_t nid, >> if (!test_bit_le(nat_ofs, nm_i->nat_block_bitmap)) >> return; >> >> -if (set) >> +if (set) { >> +if (test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) >> +return; >> __set_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); >> -else >> -__clear_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); >> - >> -if (set) >> nm_i->free_nid_count[nat_ofs]++; >> -else if (!build) >> -nm_i->free_nid_count[nat_ofs]--; >> +} else { >> +if (!test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) >> +return; >> +__clear_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); >> +if (!build) >> +nm_i->free_nid_count[nat_ofs]--; >> +} >> } >> >> static void scan_nat_page(struct f2fs_sb_info *sbi, >> -- >> 2.7.4 >> > > > > >
[p...@keyserver.paulfurley.com: PGP key expires in 3 days: 0x7179B76704ABA18B (it can be extended)]
James, Refreshed my key with expiration +2y in keys.gunpg.net and pgp.mit.edu. Please update. /Jarkko
Re: [PATCH V2 net] net: hns3: Updates MSI/MSI-X alloc/free APIs(depricated) to new APIs
On Thu, Nov 09, 2017 at 04:38:13PM +, Salil Mehta wrote: > This patch migrates the HNS3 driver code from use of depricated PCI > MSI/MSI-X interrupt vector allocation/free APIs to new common APIs. > > Signed-off-by: Salil Mehta > Suggested-by: Christoph Hellwig > --- > PATCH V2: Yuval Shaia > Link -> https://lkml.org/lkml/2017/11/9/138 > PATCH V1: Initial Submit > --- > .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 107 > +++-- > .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 15 ++- > 2 files changed, 42 insertions(+), 80 deletions(-) Reviewed-by: Yuval Shaia > > diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c > b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c > index c1cdbfd..d65c599 100644 > --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c > +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c > @@ -885,14 +885,14 @@ static int hclge_query_pf_resource(struct hclge_dev > *hdev) > hdev->pkt_buf_size = __le16_to_cpu(req->buf_size) << HCLGE_BUF_UNIT_S; > > if (hnae3_dev_roce_supported(hdev)) { > - hdev->num_roce_msix = > + hdev->num_roce_msi = > hnae_get_field(__le16_to_cpu(req->pf_intr_vector_number), > HCLGE_PF_VEC_NUM_M, HCLGE_PF_VEC_NUM_S); > > /* PF should have NIC vectors and Roce vectors, >* NIC vectors are queued before Roce vectors. >*/ > - hdev->num_msi = hdev->num_roce_msix + HCLGE_ROCE_VECTOR_OFFSET; > + hdev->num_msi = hdev->num_roce_msi + HCLGE_ROCE_VECTOR_OFFSET; > } else { > hdev->num_msi = > hnae_get_field(__le16_to_cpu(req->pf_intr_vector_number), > @@ -1835,7 +1835,7 @@ static int hclge_init_roce_base_info(struct hclge_vport > *vport) > struct hnae3_handle *roce = &vport->roce; > struct hnae3_handle *nic = &vport->nic; > > - roce->rinfo.num_vectors = vport->back->num_roce_msix; > + roce->rinfo.num_vectors = vport->back->num_roce_msi; > > if (vport->back->num_msi_left < vport->roce.rinfo.num_vectors || > vport->back->num_msi_left == 0) > @@ -1853,67 +1853,47 @@ static int hclge_init_roce_base_info(struct > hclge_vport *vport) > return 0; > } > > -static int hclge_init_msix(struct hclge_dev *hdev) > +static int hclge_init_msi(struct hclge_dev *hdev) > { > struct pci_dev *pdev = hdev->pdev; > - int ret, i; > - > - hdev->msix_entries = devm_kcalloc(&pdev->dev, hdev->num_msi, > - sizeof(struct msix_entry), > - GFP_KERNEL); > - if (!hdev->msix_entries) > - return -ENOMEM; > - > - hdev->vector_status = devm_kcalloc(&pdev->dev, hdev->num_msi, > -sizeof(u16), GFP_KERNEL); > - if (!hdev->vector_status) > - return -ENOMEM; > + int vectors; > + int i; > > - for (i = 0; i < hdev->num_msi; i++) { > - hdev->msix_entries[i].entry = i; > - hdev->vector_status[i] = HCLGE_INVALID_VPORT; > + vectors = pci_alloc_irq_vectors(pdev, 1, hdev->num_msi, > + PCI_IRQ_MSI | PCI_IRQ_MSIX); > + if (vectors < 0) { > + dev_err(&pdev->dev, > + "failed(%d) to allocate MSI/MSI-X vectors\n", > + vectors); > + return vectors; > } > + if (vectors < hdev->num_msi) > + dev_warn(&hdev->pdev->dev, > + "requested %d MSI/MSI-X, but allocated %d MSI/MSI-X\n", > + hdev->num_msi, vectors); > > - hdev->num_msi_left = hdev->num_msi; > - hdev->base_msi_vector = hdev->pdev->irq; > + hdev->num_msi = vectors; > + hdev->num_msi_left = vectors; > + hdev->base_msi_vector = pdev->irq; > hdev->roce_base_vector = hdev->base_msi_vector + > HCLGE_ROCE_VECTOR_OFFSET; > > - ret = pci_enable_msix_range(hdev->pdev, hdev->msix_entries, > - hdev->num_msi, hdev->num_msi); > - if (ret < 0) { > - dev_info(&hdev->pdev->dev, > - "MSI-X vector alloc failed: %d\n", ret); > - return ret; > - } > - > - return 0; > -} > - > -static int hclge_init_msi(struct hclge_dev *hdev) > -{ > - struct pci_dev *pdev = hdev->pdev; > - int vectors; > - int i; > - > hdev->vector_status = devm_kcalloc(&pdev->dev, hdev->num_msi, > sizeof(u16), GFP_KERNEL); > - if (!hdev->vector_status) > + if (!hdev->vector_status) { > + pci_free_irq_vectors(pdev); > return -ENOMEM; > + } > > for (i = 0; i < hdev->num_msi; i++) > hdev->vector_status[i] = HCLGE_INVALID_VPORT; > > - vectors = pci_alloc_irq_vectors(pdev, 1, hde
Re: [RFC PATCH 0/2] apply write hints to select the type of segments
On 2017/11/14 12:20, Jaegeuk Kim wrote: > On 11/13, Hyunchul Lee wrote: >> On 11/13/2017 10:59 AM, Chao Yu wrote: >>> On 2017/11/13 9:35, Hyunchul Lee wrote: On 11/13/2017 10:26 AM, Chao Yu wrote: > On 2017/11/13 8:24, Hyunchul Lee wrote: >> On 11/10/2017 03:42 PM, Chao Yu wrote: >>> On 2017/11/10 8:23, Hyunchul Lee wrote: Hello, Chao On 11/09/2017 06:12 PM, Chao Yu wrote: > On 2017/11/9 13:51, Hyunchul Lee wrote: >> From: Hyunchul Lee >> >> Using write hints[1], applications can inform the life time of the >> data >> written to devices. and this[2] reported that the write hints patch >> decreased writes in NAND by 25%. >> >> This hints help F2FS to determine the followings. >> 1) the segment types where the data will be written. >> 2) the hints that will be passed down to devices with the data of >> segments. >> >> This patch set implements the first mapping from write hints to >> segment types >> as shown below. >> >> hints segment type >> - >> WRITE_LIFE_SHORT CURSEG_COLD_DATA >> WRITE_LIFE_EXTREMECURSEG_HOT_DATA >> othersCURSEG_WARM_DATA >> >> The F2FS poliy for hot/cold seperation has precedence over this >> hints, And >> hints are not applied in in-place update. > > Could we change to disable IPU if file/inode write hint is existing? > I am afraid that this makes side effects. for example, this could cause out-of-place updates even when there are not enough free segments. I can write the patch that handles these situations. But I wonder that this is required, and I am not sure which IPU polices can be disabled. >>> >>> Oh, As I replied in another thread, I think IPU just affects filesystem >>> hot/cold separating, rather than this feature. So I think it will be >>> okay >>> to not consider it. >>> >> >> Before the second mapping is implemented, write hints are not passed >> down >> to devices. Because it is better that the data of a segment have the >> same >> hint. >> >> [1]: c75b1d9421f80f4143e389d2d50ddfc8a28c8c35 >> [2]: https://lwn.net/Articles/726477/ > > Could you write a patch to support passing write hint to block layer > for > buffered writes as below commit: > 0127251c45ae ("ext4: add support for passing in write hints for > buffered writes") > Sure I will. I wrote it already ;) >>> >>> Cool, ;) >>> I think that datas from the same segment should be passed down with the same hint, and the following mapping is reasonable. I wonder what is your opinion about it. segment type hints - CURSEG_COLD_DATA WRITE_LIFE_EXTREME CURSEG_HOT_DATAWRITE_LIFE_SHORT CURSEG_COLD_NODE WRITE_LIFE_NORMAL >>> >>> We have WRITE_LIFE_LONG defined rather than WRITE_LIFE_NORMAL in fs.h? >>> CURSEG_HOT_NODEWRITE_LIFE_MEDIUM >>> >>> As I know, in scenario of cell phone, data of meta_inode is hottest, >>> then hot >>> data, warm node, and cold node should be coldest. So I suggested we can >>> define >>> as below: >>> >>> META_DATA WRITE_LIFE_SHORT >>> HOT_DATA & WARM_NODEWRITE_LIFE_MEDIUM >>> HOT_NODE & WARM_DATAWRITE_LIFE_LONG >>> COLD_NODE & COLD_DATA WRITE_LIFE_EXTREME >>> >> >> I agree, But I am not sure that assigning the same hint to a node and >> data >> segment is good. Because NVMe is likely to write them in the same erase >> block if they have the same hint. > > If we do not give the hint, they can still be written to the same erase > block, >>> >>> I mean it's possible to write them to the same erase block. :) >>> > right? it will not be worse? > If the hint is not given, I think that they could be written to the same erase block, or not. But if we give the same hint, they are written to the same block. >>> >>> IMO, Only if underlying device can support more hint type or opened >>> channels, >>> and actual temperature of data segment and node segment is quite different, >>> we >>> can separate them. >>> >> >> Okay, If Jaegeuk Kim agrees with this, I will submit the patch that >> implements your proposed mapping. > > How about this? We'd better to split data and node blocks as much as
linux-next: Tree for Nov 14
Hi all, Please do not add any v4.16 material to your linux-next included trees until v4.15-rc1 has been released. Changes since 20171113: The powerpc tree lost its build failure. The keys tree lost its build failure. The nvdimm tree gained a conflict against the parisc-hd tree. The akpm tree lost a patch that turned up elsewhere. Non-merge commits (relative to Linus' tree): 11762 10963 files changed, 528527 insertions(+), 258423 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu (with and without kvm enabled). Below is a summary of the state of the merge. I am currently merging 272 trees (counting Linus' and 42 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (8e9a2dba8686 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip) Merging fixes/master (820bf5c419e4 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi) Merging kbuild-current/fixes (bb3f38c3c5b7 kbuild: clang: fix build failures with sparse check) Merging arc-current/for-curr (92d44128241f ARCv2: Accomodate HS48 MMUv5 by relaxing MMU ver checking) Merging arm-current/fixes (b9dd05c7002e ARM: 8720/1: ensure dump_instr() checks addr_limit) Merging m68k-current/for-linus (5e387199c17c m68k/defconfig: Update defconfigs for v4.14-rc7) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (7ecb37f62fe5 powerpc/perf: Fix core-imc hotplug callback failure during imc initialization) Merging sparc/master (23198ddffb6c sparc32: Add cmpxchg64().) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (b39545684a90 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net) Merging ipsec/master (c9f3f813d462 xfrm: Fix stack-out-of-bounds read in xfrm_state_find.) Merging netfilter/master (7400bb4b5800 netfilter: nf_reject_ipv4: Fix use-after-free in send_reset) Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook mask only if set) Merging wireless-drivers/master (a6127b4440d1 Merge tag 'iwlwifi-for-kalle-2017-10-06' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes) Merging mac80211/master (b39545684a90 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net) Merging sound-current/for-linus (7087cb8fad5e Documentation: sound: hd-audio: notes.rst) Merging pci-current/for-linus (6b7be529634b MAINTAINERS: Add Lorenzo Pieralisi for PCI host bridge drivers) Merging driver-core.current/driver-core-linus (39dae59d66ac Linux 4.14-rc8) Merging tty.current/tty-linus (8a5776a5f498 Linux 4.14-rc4) Merging usb.current/usb-linus (bb176f67090c Linux 4.14-rc6) Merging usb-gadget-fixes/fixes (7c80f9e4a588 usb: usbtest: fix NULL pointer dereference) Merging usb-serial-fixes/usb-linus (0b07194bb55e Linux 4.14-rc7) Merging usb-chipidea-fixes/ci-for-usb-stable (cbb22ebcfb99 usb: chipidea: core: check before accessing ci_role in ci_role_show) Merging phy/fixes (2fb850092fd9 phy: rockchip-typec: Check for errors from tcphy_phy_init()) Merging staging.current/staging-linus (bb176f67090c Linux 4.14-rc6) Merging char-misc.current/char-misc-linus (bb176f67090c Linux 4.14-rc6) Merging input-current/for-linus (26dd633e437d Input: synaptics-rmi4 - RMI4 can also use
Re: [PATCH v4 2/4] KVM: X86: Add paravirt remote TLB flush
2017-11-13 21:02 GMT+08:00 Peter Zijlstra : > On Mon, Nov 13, 2017 at 11:46:34AM +0100, Peter Zijlstra wrote: >> On Mon, Nov 13, 2017 at 04:26:57PM +0800, Wanpeng Li wrote: >> > 2017-11-13 16:04 GMT+08:00 Peter Zijlstra : >> >> > > So if at this point a vCPU gets preempted we'll still spin-wait for it, >> > > which is sub-optimal. >> > > >> > > I think we can come up with something to get around that 'problem' if >> > > indeed it is a problem. But we can easily do that as follow up patches. >> > > Just let me know if you think its worth spending more time on. >> > >> > You can post your idea, it is always smart. :) Then we can evaluate >> > the complexity and gains. >> >> I'm not sure I have a fully baked idea just yet, but the general idea >> would be something like: >> >> - switch (back) to a dedicated TLB invalidate IPI > > Just for PV that is; the !PV code can continue doing what it does today. > >> - introduce KVM_VCPU_IPI_PENDING >> >> - change flush_tlb_others() into something like: >> >>for_each_cpu(cpu, flushmask) { >>src = &per_cpu(steal_time, cpu); >>state = READ_ONCE(src->preempted); >>do { >>if (state & KVM_VCPU_PREEMPTED) { >>if (try_cmpxchg(&src->preempted, &state, >>state | >> KVM_VCPU_SHOULD_FLUSH)) { >>__cpumask_clear_cpu(cpu, flushmask); >>break; >>} >>} >>} while (!try_cmpxchg(&src->preempted, &state, >>state | KVM_VCPU_IPI_PENDING)); > > That can be written like: > > do { > if (state & KVM_VCPU_PREEMPTED) > new_state = state | KVM_VCPU_SHOULD_FLUSH; > else > new_state = state | KVM_VCPU_IPI_PENDING; > } while (!try_cmpxchg(&src->preempted, state, new_state); > > if (new_state & KVM_VCPU_IPI_PENDING) Should be new_state & KVM_VCPU_SHOULD_FLUSH I think. Regards, Wanpeng Li > __cpumask_clear_cpu(cpu, flushmask); > >>} >> >>apic->send_IPI_mask(flushmask, CALL_TLB_VECTOR); >> >>for_each_cpu(cpu, flushmask) { >>src = &per_cpu(steal_time, cpu); > > /* > * The ACQUIRE pairs with the cmpxchg clearing IPI_PENDING, > * which is either the TLB IPI handler, or the VMEXIT path. > * It ensure that the invalidate happens-before. > */ >>smp_cond_load_acquire(&src->preempted, !(VAL & KVM_VCPU_IPI_PENDING); >>} > > And here we wait for completion of the invalidate; but because of the > VMEXIT change below, this will never stall on a !running vCPU. > > Note that PLE will not help (much) here, without this extra IPI_PENDING > state and the VMEXIT transferring it to SHOULD_FLUSH this vCPU's progress > will be held up until all vCPU's you've IPI'd will have ran the IPI > handler, which in the worst case is still a very long time. > >> - have the TLB invalidate handler do something like: >> >>state = READ_ONCE(src->preempted); >>if (!(state & KVM_VCPU_IPI_PENDING)) >> return; >> >>local_flush_tlb(); >> >>do { >>} while (!try_cmpxchg(&src->preempted, &state, >>state & ~KVM_VCPU_IPI_PENDING)); > > That needs to be: > > /* > * Clear KVM_VCPU_IPI_PENDING to 'complete' flush_tlb_others(). > */ > do { > /* > * VMEXIT could have cleared this for us, in which case > * we're done. > */ > if (!(state & KVM_VCPU_IPI_PENDING)) > return; > > } while (!try_cmpxchg(&src->preempted, state, > state & ~KVM_VCPU_IPI_PENDING)); > >> - then at VMEXIT time do something like: >> > /* > * If we have IPI_PENDING set at VMEXIT time, transfer it to > * SHOULD_FLUSH. Clearing IPI_PENDING here allows the > * flush_others() vCPU to continue while the SHOULD_FLUSH > * guarantees this vCPU will flush TLBs before it continues > * execution. > */ > >>state = READ_ONCE(src->preempted); >>do { >> if (!(state & KVM_VCPU_IPI_PENDING)) >> break; >>} while (!try_cmpxchg(&src->preempted, state, >>(state & ~KVM_VCPU_IPI_PENDING) | >>KVM_VCPU_SHOULD_FLUSH)); >> >>and clear any possible pending TLB_VECTOR in the guest state to avoid >>raising that IPI spuriously on enter again. >> > >
Re: [PATCH] docs: dev-tools: coccinelle: delete out of date wiki reference
On Tue, 14 Nov 2017, Masahiro Yamada wrote: > Hi Julia, Jon, > > 2017-11-14 1:50 GMT+09:00 Julia Lawall : > > The wiki is no longer available. > > > > Signed-off-by: Julia Lawall > > > > > Jon sent the doc pull request yesterday. > > I will pick this up for Kbuild tree > because I have not sent pull requests for this MW yet. OK, thanks. julia > > > > > > > diff --git a/Documentation/dev-tools/coccinelle.rst > > b/Documentation/dev-tools/coccinelle.rst > > index 37e474f..94f41c2 100644 > > --- a/Documentation/dev-tools/coccinelle.rst > > +++ b/Documentation/dev-tools/coccinelle.rst > > @@ -33,9 +33,6 @@ of many distributions, e.g. : > > You can get the latest version released from the Coccinelle homepage at > > http://coccinelle.lip6.fr/ > > > > -Information and tips about Coccinelle are also provided on the wiki > > -pages at http://cocci.ekstranet.diku.dk/wiki/doku.php > > - > > Once you have it, run the following command:: > > > > ./configure > > > > -- > Best Regards > Masahiro Yamada >
RE: [PATCH 1/2] mm: drop migrate type checks from has_unmovable_pages
Hi Michal, > -Original Message- > From: Michal Hocko [mailto:mho...@kernel.org] > Sent: Monday, November 13, 2017 7:03 PM > To: Ran Wang > Cc: linux...@kvack.org; Michael Ellerman ; Vlastimil > Babka ; Andrew Morton ; > KAMEZAWA Hiroyuki ; Reza Arbab > ; Yasuaki Ishimatsu ; > qiuxi...@huawei.com; Igor Mammedov ; Vitaly > Kuznetsov ; LKML ; > Leo Li ; Xiaobo Xie > Subject: Re: [PATCH 1/2] mm: drop migrate type checks from > has_unmovable_pages > > On Mon 13-11-17 07:33:13, Ran Wang wrote: > > Hello Michal, > > > > > > > > > Date: Fri, 13 Oct 2017 14:00:12 +0200 > > > > > > From: Michal Hocko > > > > > > Michael has noticed that the memory offline tries to migrate kernel > > > code pages when doing echo 0 > > > > /sys/devices/system/memory/memory0/online > > > > > > The current implementation will fail the operation after several > > > failed page migration attempts but we shouldn't even attempt to > > > migrate that memory and fail right away because this memory is > > > clearly not migrateable. This will become a real problem when we drop > the retry loop counter resp. timeout. > > > > > > The real problem is in has_unmovable_pages in fact. We should fail > > > if there are any non migrateable pages in the area. In orther to > > > guarantee that remove the migrate type checks because > > > MIGRATE_MOVABLE is not guaranteed to contain only migrateable pages. > It is merely a heuristic. > > > Similarly MIGRATE_CMA does guarantee that the page allocator doesn't > > > allocate any non-migrateable pages from the block but CMA > > > allocations themselves are unlikely to migrateable. Therefore remove > both checks. > > > > > > Reported-by: Michael Ellerman > > > Signed-off-by: Michal Hocko > > > Tested-by: Michael Ellerman > > > Acked-by: Vlastimil Babka > > > --- > > > mm/page_alloc.c | 3 --- > > > 1 file changed, 3 deletions(-) > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index > > > 3badcedf96a7..ad0294ab3e4f 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, > > > struct page *page, int count, > > >*/ > > > if (zone_idx(zone) == ZONE_MOVABLE) > > > return false; > > > - mt = get_pageblock_migratetype(page); > > > - if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt)) > > > - return false; > > > > This drop cause DWC3 USB controller fail on initialization with > > Layerscaper processors (such as LS1043A) as below: > > > > [2.701437] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned > bus number 1 > > [2.710949] cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -16 > > [2.717411] xhci-hcd xhci-hcd.0.auto: can't setup: -12 > > [2.727940] xhci-hcd xhci-hcd.0.auto: USB bus 1 deregistered > > [2.733607] xhci-hcd: probe of xhci-hcd.0.auto failed with error -12 > > [2.739978] xhci-hcd xhci-hcd.1.auto: xHCI Host Controller > > > > And I notice that someone also reported to you that DWC2 got affected > > recently, so do you have the solution now? > > Yes. It should be in linux-next. Have a look at the following email > thread: > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flkml. > kernel.org%2Fr%2F20171104082500.qvzbb2kw4suo6cgy%40dhcp22.suse.cz& > data=02%7C01%7Cran.wang_1%40nxp.com%7C5e73c6a941fc4f1c10e708d52 > a860c5b%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636461677 > 583607877&sdata=zlRxJ4LZwOBsit5qRx9yFT5qfP54wZ0z6G1z%2Bcywf5g%3D > &reserved=0 Thanks for your info, although I fail to open the link you shared, but I got patch from my colleague and the issue got fix on my side, let you know, thanks. Best Regards, Ran > -- > Michal Hocko > SUSE Labs
[PATCH] x86/mce: add support SRAO reported via CMC check
In Intel SDM Volume 3B (253669-063US, July 2017), SRAO could be reported via CMC: In cases when SRAO is signaled via CMCI the error signature is indicated via UC=1, PCC=0, S=0. So we add those known AO MCACODs check in mce_severity(). Signed-off-by: Xie XiuQi Tested-by: Chen Wei --- arch/x86/kernel/cpu/mcheck/mce-severity.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c index 4ca632a..48f239a 100644 --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c @@ -101,6 +101,16 @@ NOSER, BITCLR(MCI_STATUS_UC) ), + /* known AO MCACODs reported via CMC: */ + MCESEV( + AO, "Action optional: memory scrubbing error", + SER, MASK(MCI_UC_SAR|MCACOD_SCRUBMSK, MCI_STATUS_UC|MCACOD_SCRUB) + ), + MCESEV( + AO, "Action optional: last level cache writeback error", + SER, MASK(MCI_UC_SAR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB) + ), + /* ignore OVER for UCNA */ MCESEV( UCNA, "Uncorrected no action required", -- 1.8.3.1
[GIT PULL]: dmaengine updates for 4.15-rc1
Hi Linus, Here is the PULL request for dmaengine updates for 4.15-rc1. As you may have noticed I am also using topic branches but the branch (for-linus) contains only merge commits. Since I was not in KS and based on reading the coverage I have gathered that you would like it this way, if not do let me know I shall do accordingly. Further we have also done RST conversion for dmaengine documentation. That would come from Jon's tree. The following changes since commit 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e: Linux 4.14-rc1 (2017-09-16 15:47:51 -0700) are available in the git repository at: git://git.infradead.org/users/vkoul/slave-dma.git tags/dmaengine-4.15-rc1 for you to fetch changes up to cecd5fc5512349662b9e7a9e06231055d803e3f6: Merge branch 'topic/xilinx' into for-linus (2017-11-14 10:37:28 +0530) dmaengine updates for 4.15-rc1 Updates for this cycle include: - New driver for Spreadtrum dma controller, ST MDMA and DMAMUX controllers - PM support for IMG MDC drivers - Updates to bcm-sba-raid driver and improvements to sun6i driver - Subsystem conversion for: - timers to use timer_setup() - remove usage of PCI pool API - usage of %p format specifier - Minor updates to bunch of drivers Adam Wallis (1): dmaengine: dmatest: warn user when dma test times out Alexander Kochetkov (1): dmaengine: pl330: fix descriptor allocation fail Andy Shevchenko (1): MAINTAINERS: Step down from a co-maintaner of DW DMAC driver Anup Patel (4): dmaengine: bcm-sba-raid: serialize dma_cookie_complete() using reqs_lock dmaengine: bcm-sba-raid: Use only single mailbox channel dmaengine: bcm-sba-raid: Use common GPL comment header dmaengine: Build bcm-sba-raid driver as loadable module for iProc SoCs Arnd Bergmann (1): dmaengine: stm32_mdma: add CONFIG_OF dependency Baolin Wang (2): dt-bindings: dmaengine: Add Spreadtrum SC9860 DMA controller dmaengine: sprd: Add Spreadtrum DMA driver Biju Das (1): dmaengine: usb-dmac: Add compatible string for r8a7743/5 Colin Ian King (1): dmaengine: stm32: remove redundant initialization of hwdesc Corentin Labbe (1): dmaengine: sun6i: use of_device_get_match_data Dan Carpenter (1): dmaengine: stm32-dmamux: Fix a NULL vs IS_ERR() check in probe Ed Blake (2): dmaengine: img-mdc: Add suspend / resume handling dmaengine: img-mdc: Add runtime PM Geert Uytterhoeven (1): dmaengine: nbpfaxi: Use of_device_get_match_data() helper Hiroyuki Yokoyama (1): dmaengine: rcar-dmac: use TCRB instead of TCR for residue Kees Cook (1): dmaengine: Convert timers to use timer_setup() Lars-Peter Clausen (3): dmaengine: axi-dmac: Only use hardware cyclic mode for single segment transfers dmaengine: axi-dmac: Fix software cyclic mode dmaengine: xilinx_dma: Move enum xdma_ip_type to driver file Nicolin Chen (1): dmaengine: imx-sdma: Correct src_addr_widths and directions Peter Ujfalusi (3): dmaengine: edma: Implement protection for invalid max_burst dmaengine: omap-dma: Implement protection for invalid max_burst dmaengine: ti-dma-crossbar: Correct am335x/am43xx mux value type Pierre-Yves MORDRET (6): dt-bindings: Document the STM32 DMAMUX bindings dmaengine: Add STM32 DMAMUX driver dt-bindings: stm32-dma: add a property to handle STM32 DMAMUX dt-bindings: Document the STM32 MDMA bindings dmaengine: Add STM32 MDMA driver dmaengine: stm32_mdma: activate pack/unpack feature Romain Perier (1): dmaengine: pch_dma: Replace PCI pool old API Russell King (1): dmaengine: sa11x0: add DMA filters Sricharan R (1): dmaengine: qcom-bam: Process multiple pending descriptors Stefan Brüns (10): dmaengine: List all allowed values for src/dst_addr_width in kernel doc dmaengine: Mark struct dma_slave_caps kernel-doc correctly, clarify dmaengine: sun6i: Correct setting of clock autogating register for A83T/H3 dmaengine: sun6i: Correct burst length field offsets for H3 dmaengine: sun6i: Restructure code to allow extension for new SoCs dmaengine: sun6i: Enable additional burst lengths/widths on H3 dmaengine: sun6i: Move number of pchans/vchans/request to device struct arm64: allwinner: a64: Add devicetree binding for DMA controller dmaengine: sun6i: Add support for Allwinner A64 and compatibles dmaengine: sun6i: Retrieve channel count/max request from devicetree Vinod Koul (21): dmaengine: stm32: use %p format specfier for pointer dmaengine: coh901318: Remove unnecessary 0x prefixes before %pad dmaengine: at_hdmac: Remove unnecessary 0x prefixes before %pad dmaengine: Revert "rcar-dmac: use TCRB instead of TCR for residue" Merge branch 'topic/print_fixes' into for-linus
Re: [PATCH v2 1/3] dt-bindings: phy: Add Cygnus usb phy binding
Hi Rob, On Mon, Nov 13, 2017 at 11:23 PM, Rob Herring wrote: > On Sun, Nov 12, 2017 at 10:23 PM, Raveendra Padasalagi > wrote: >> Hi, >> >> On Sat, Nov 11, 2017 at 3:14 AM, Rob Herring wrote: >>> On Wed, Nov 08, 2017 at 01:16:41PM +0530, Raveendra Padasalagi wrote: Add devicetree binding document for broadcom's Cygnus SoC specific usb phy controller driver. Signed-off-by: Raveendra Padasalagi --- .../bindings/phy/brcm,cygnus-usb-phy.txt | 106 + 1 file changed, 106 insertions(+) create mode 100644 Documentation/devicetree/bindings/phy/brcm,cygnus-usb-phy.txt diff --git a/Documentation/devicetree/bindings/phy/brcm,cygnus-usb-phy.txt b/Documentation/devicetree/bindings/phy/brcm,cygnus-usb-phy.txt new file mode 100644 index 000..bbc4b94 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/brcm,cygnus-usb-phy.txt @@ -0,0 +1,106 @@ +BROADCOM CYGNUS USB PHY + +Required Properties: +- compatible: brcm,cygnus-usb-phy +- reg : the register start address and length for +crmu_usbphy_aon_ctrl, +cdru usb phy control, +usb host idm registers, +usb device idm registers. +- reg-names: a list of the names corresponding to the previous register ranges + Should contain +"crmu-usbphy-aon-ctrl", +"cdru-usbphy", +"usb2h-idm", +"usb2d-idm". +- address-cells: should be 1 +- size-cells: should be 0 + +Sub-nodes: + Each port's PHY should be represented as a sub-node. + +Sub-nodes required properties: +- reg: the PHY number +- #phy-cells must be 1 + The node that uses the phy must provide 1 integer argument specifying + port number. + +Optional Properties: +- vbus-p#-supply : The regulator for vbus out control for the host >>> >>> Is this a literal # or something else? >> >> Yes, this is a literal. It's assumed # will replace numeric 0-2 for >> each of the ports. > > I'm still confused. Which is valid? "vbus-p#-supply" or "vbus-p0-supply" > I agree, it's creating confusion. Instead of enumerating "vbus-p0-supply", "vbus-p1-supply", "vbus-p2-supply" kept "vbus-p#-supply". Yes, as suggested by you "vbus-supply" should be sufficient as it's in each of phy sub node. > If the latter, you need to enumerate all valid options. But these are > in sub nodes for each port, so just "vbus-supply" should be > sufficient. Keeping "vbus-supply" should not create any confusion. Will send out the change in next version of the patch. > One more question, does Vbus actually supply power to the phy or you > are just associating the Vbus supply to a connector with a port? The > latter needs a connector node instead and Vbus should be part of that. > There's been some attempts at USB connectors, but we don't really have > one yet (the extcon binding is not it). Vbus is not supplied to phy, it's been given to the devices connected on the port and in our platform vbus is controlled through an external regulator which is controlled through gpio. So "vbus-supply" shown above actually points to the phandle of vbus regulator node. >> In the example it's not shown as the regulators specified in vbus-p#-supply >> are board specific. > > Please show in the example. Examples should be complete. Ok. Sure. > Rob
Re: [GIT PULL] USB/PHY driver changes for 4.15-rc1
On Mon, Nov 13, 2017 at 8:19 AM, Greg KH wrote: > > Other major thing is the typec code that moved out of staging and into > the "real" part of the drivers/usb/ tree, which was nice to see happen. Hmm. So now it asks me about Type-C Port Controller Manager. Fair enough. I say "N", because I have none. But then it still asks me about that TI TPS6598x driver... So I do see the _technical_ logic in there - the "TYPEC" config option is a hidden internal option, and it's selected by the things that need it. But from a user perspective, this configuration model is really strange. Why is TYPEC_TCPM something you ask the user, but not "do you want Type-C support"? And if you single out the PCM side to ask about, why don't you single out the power delivery side? Wouldn't it make more sense to at least ask whether I want Type-C power delivery chips before it then starts asking about individual PD drivers, the same way you asked about the port controller before you started asking ab out individual port controller drivers? Or is it just me who finds this a bit odd? Linus
[PATCH] x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will first save the guest FPU/xstate context, then load the qemu userspace FPU context, only to then immediately save the qemu userspace FPU context back to memory. When scheduling in a VCPU, the same extraneous FPU loads and saves are done. This could be avoided by moving from a model where the guest FPU is loaded and stored with preemption disabled, to a model where the qemu userspace FPU is swapped out for the guest FPU context for the duration of the KVM_RUN ioctl. This is done under the VCPU mutex, which is also taken when other tasks inspect the VCPU FPU context, so the code should already be safe for this change. That should come as no surprise, given that s390 already has this optimization. No performance changes were detected in quick ping-pong tests on my 4 socket system, which is expected since an FPU+xstate load is on the order of 0.1us, while ping-ponging between CPUs is on the order of 20us, and somewhat noisy. There may be other tests where performance changes are noticeable. Signed-off-by: Rik van Riel Suggested-by: Christian Borntraeger --- arch/x86/include/asm/kvm_host.h | 13 + arch/x86/kvm/x86.c | 29 - 2 files changed, 25 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c73e493adf07..92e66685249e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -536,7 +536,20 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_page_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + /* +* QEMU userspace and the guest each have their own FPU state. +* In vcpu_run, we switch between the user and guest FPU contexts. +* While running a VCPU, the VCPU thread will have the guest FPU +* context. +* +* Note that while the PKRU state lives inside the fpu registers, +* it is switched out separately at VMENTER and VMEXIT time. The +* "guest_fpu" state here contains the guest FPU context, with the +* host PRKU bits. +*/ + struct fpu user_fpu; struct fpu guest_fpu; + u64 xcr0; u64 guest_supported_xcr0; u32 guest_xstate_size; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 03869eb7fcd6..59912b20a830 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2917,7 +2917,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) srcu_read_unlock(&vcpu->kvm->srcu, idx); pagefault_enable(); kvm_x86_ops->vcpu_put(vcpu); - kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); } @@ -6908,7 +6907,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) preempt_disable(); kvm_x86_ops->prepare_guest_switch(vcpu); - kvm_load_guest_fpu(vcpu); /* * Disable IRQs before setting IN_GUEST_MODE. Posted interrupt @@ -7095,6 +7093,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); + kvm_load_guest_fpu(vcpu); + for (;;) { if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); @@ -7132,6 +7132,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) } } + kvm_put_guest_fpu(vcpu); + srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); return r; @@ -7663,32 +7665,25 @@ static void fx_init(struct kvm_vcpu *vcpu) vcpu->arch.cr0 |= X86_CR0_ET; } +/* Swap (qemu) user FPU context for the guest FPU context. */ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) { - if (vcpu->guest_fpu_loaded) - return; - - /* -* Restore all possible states in the guest, -* and assume host would use all available bits. -* Guest xcr0 would be loaded later. -*/ - vcpu->guest_fpu_loaded = 1; - __kernel_fpu_begin(); + preempt_disable(); + copy_fpregs_to_fpstate(&vcpu->arch.user_fpu); /* PKRU is separately restored in kvm_x86_ops->run. */ __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state, ~XFEATURE_MASK_PKRU); + preempt_enable(); trace_kvm_fpu(1); } +/* When vcpu_run ends, restore user space FPU context. */ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) { - if (!vcpu->guest_fpu_loaded) - return; - - vcpu->guest_fpu_loaded = 0; + preempt_disable(); copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu); - __kernel_fpu_end(); + copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state); + preempt_enable(); ++vcpu->stat.fpu_reload; trace_kvm_fpu(0); }
[PATCH] f2fs: expose quota information in debugfs
This patch shows # of dirty pages and # of hidden quota files. Signed-off-by: Jaegeuk Kim --- fs/f2fs/debug.c | 11 +++ fs/f2fs/f2fs.h | 10 -- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c index f7eec506ceea..ecada8425268 100644 --- a/fs/f2fs/debug.c +++ b/fs/f2fs/debug.c @@ -45,9 +45,18 @@ static void update_general_status(struct f2fs_sb_info *sbi) si->ndirty_dent = get_pages(sbi, F2FS_DIRTY_DENTS); si->ndirty_meta = get_pages(sbi, F2FS_DIRTY_META); si->ndirty_data = get_pages(sbi, F2FS_DIRTY_DATA); + si->ndirty_qdata = get_pages(sbi, F2FS_DIRTY_QDATA); si->ndirty_imeta = get_pages(sbi, F2FS_DIRTY_IMETA); si->ndirty_dirs = sbi->ndirty_inode[DIR_INODE]; si->ndirty_files = sbi->ndirty_inode[FILE_INODE]; + + si->nquota_files = 0; + if (f2fs_sb_has_quota_ino(sbi->sb)) { + for (i = 0; i < MAXQUOTAS; i++) { + if (f2fs_qf_ino(sbi->sb, i)) + si->nquota_files++; + } + } si->ndirty_all = sbi->ndirty_inode[DIRTY_META]; si->inmem_pages = get_pages(sbi, F2FS_INMEM_PAGES); si->aw_cnt = atomic_read(&sbi->aw_cnt); @@ -369,6 +378,8 @@ static int stat_show(struct seq_file *s, void *v) si->ndirty_dent, si->ndirty_dirs, si->ndirty_all); seq_printf(s, " - datas: %4d in files:%4d\n", si->ndirty_data, si->ndirty_files); + seq_printf(s, " - quota datas: %4d in quota files:%4d\n", + si->ndirty_qdata, si->nquota_files); seq_printf(s, " - meta: %4d in %4d\n", si->ndirty_meta, si->meta_pages); seq_printf(s, " - imeta: %4d\n", diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 5c379a8ea075..44f874483ecf 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -865,6 +865,7 @@ struct f2fs_sm_info { enum count_type { F2FS_DIRTY_DENTS, F2FS_DIRTY_DATA, + F2FS_DIRTY_QDATA, F2FS_DIRTY_NODES, F2FS_DIRTY_META, F2FS_INMEM_PAGES, @@ -1642,6 +1643,8 @@ static inline void inode_inc_dirty_pages(struct inode *inode) atomic_inc(&F2FS_I(inode)->dirty_pages); inc_page_count(F2FS_I_SB(inode), S_ISDIR(inode->i_mode) ? F2FS_DIRTY_DENTS : F2FS_DIRTY_DATA); + if (IS_NOQUOTA(inode)) + inc_page_count(F2FS_I_SB(inode), F2FS_DIRTY_QDATA); } static inline void dec_page_count(struct f2fs_sb_info *sbi, int count_type) @@ -1658,6 +1661,8 @@ static inline void inode_dec_dirty_pages(struct inode *inode) atomic_dec(&F2FS_I(inode)->dirty_pages); dec_page_count(F2FS_I_SB(inode), S_ISDIR(inode->i_mode) ? F2FS_DIRTY_DENTS : F2FS_DIRTY_DATA); + if (IS_NOQUOTA(inode)) + dec_page_count(F2FS_I_SB(inode), F2FS_DIRTY_QDATA); } static inline s64 get_pages(struct f2fs_sb_info *sbi, int count_type) @@ -2771,9 +2776,10 @@ struct f2fs_stat_info { unsigned long long hit_largest, hit_cached, hit_rbtree; unsigned long long hit_total, total_ext; int ext_tree, zombie_tree, ext_node; - int ndirty_node, ndirty_dent, ndirty_meta, ndirty_data, ndirty_imeta; + int ndirty_node, ndirty_dent, ndirty_meta, ndirty_imeta; + int ndirty_data, ndirty_qdata; int inmem_pages; - unsigned int ndirty_dirs, ndirty_files, ndirty_all; + unsigned int ndirty_dirs, ndirty_files, nquota_files, ndirty_all; int nats, dirty_nats, sits, dirty_sits; int free_nids, avail_nids, alloc_nids; int total_count, utilization; -- 2.14.0.rc1.383.gd1ce394fe2-goog
Re: [PATCH v1 4/5] perf, tools: Add fallback in perf_evsel__nr_cpus for no map
On Mon, Nov 13, 2017 at 10:22:30AM +0100, Jiri Olsa wrote: > On Thu, Nov 09, 2017 at 06:55:27AM -0800, Andi Kleen wrote: > > From: Andi Kleen > > > > Support the case of the event having no cpumap in perf_evsel__nr_cpus. > > Just return 1 in this case. This can happen in perf script > > when it uses the perf stat shadow functions. > > why 1, where in shadow code? you can synthesize cpus for event > via event_update event For sampling it should be always 1, right? Where: #0 0x00570e03 in __perf_evsel_stat__is (evsel=0x2690ce0, id=PERF_STAT_EVSEL_ID__CYCLES_IN_TX) at util/stat.c:75 #1 0x00572375 in perf_stat__update_shadow_stats (counter=0x2690ce0, count=3744, cpu=0) at util/stat-shadow.c:194 -Andi
Re: [GIT PULL] x86 updates for v4.15
On Mon, Nov 13, 2017 at 12:24 AM, Ingo Molnar wrote: > >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-asm-for-linus Hmm #2. My laptop had odd SIGBUS and IO errors after a suspend/resume cycle when running commit d6ec9d9a4def, which is after my merge of the x86 core changes. I'm probably not going to be able to bisect it - there's nothing in the logs, probably because processes just died (and most likely the IO errors were due to the disk having gone missing), but looking at the merges I had done up until that point, all the suspect ones are from you. The x86 pull obviously being the most likely one, just based on content, and based on that "after suspend/resume". I'm wondering how much suspend/resume testing that entry code has gotten. Last release it was the TLB ASID code that messed up on suspend/resume, I suspect there is a decided lack of test coverage in the otherwise good x86 farm.. I'll see if I can get anything interesting out of testing some more, but I thought I'd give you guys a heads up. Usually it's the networking tree and the PM tree that triggers issues on my laptop, but neither of those had been merged at that point. But there also really isn't anything else that looks odd in there. Linus
Re: [PATCH] iio: mma8452: add power_mode sysfs configuration
Hi Martin, > But given your concerns, I would strip down this patch to only offer the > already documented "low_noise" and "low_power" modes. It wouldn't be > worth it to extend the ABI just because of this! OK then we can map 'low_noise' to high resolution mode. But I am afraid I can't test the functionality because I don't have proper instruments to measure the current draw(in microAmps) accurately. > I would like "oversampling" more than this "power_mode" too. For this > driver it would be far more complicated to implement though. I doubt > that it'll be done. power_mode is basically already there implicitely, > and given that there *is* the ABI, we could offer it for free. I think 'oversampling' is already implemented, as I see 'case IIO_CHAN_INFO_OVERSAMPLING_RATIO:' being handled which is basically setting the all 4 different power modes. If we also add 'power_mode', I think it would be like having two different user interfaces for same functionality. So I don't see much of value adding 'power_mode' as well. Please correct me if I am wrong. Thanks, Harinath On Sun, Nov 12, 2017 at 7:28 AM, Martin Kepplinger wrote: > On 2017-11-11 01:33, Jonathan Cameron wrote: >> On Mon, 6 Nov 2017 08:19:58 +0100 >> Martin Kepplinger wrote: >> >>> This adds the power_mode sysfs interface to the device as documented in >>> sysfs-bus-iio. >>> >>> --- >>> >>> Note that I explicitely don't sign off on this. >>> >>> This is a starting point for anybody who can test it and check for correct >>> API usage, and ABI correctness, as documented in >>> Documentation/ABI/testing/sys-bus-iio >>> (grep it for "power_mode"). The ABI doc probably would need an addition >>> too, if the 4 power modes here seem generally useful (there are only >>> 2 listed there)! >>> >>> So, if you can test this, feel free to set up a proper patch or >>> two, and I'm happy to review. >>> >>> Please note that this patch is quite old. It really should be that simple >>> as far as my understanding back then. We always list the available >>> frequencies >>> of the given power mode we are in, for example, already, and everything >>> basically is in place except for the user interface. >> >> Hmm. A lot of devices support something along these lines. The issue >> has always been - how is userspace to figure out what to do with it? >> It's all very vague... >> >> Funnily enough - this used to be really common, but is becoming less so >> now - presumably because no one was using it much (or maybe I am reading >> too much into that ;) >> >> Now the question is whether it can be tied to better defined things? >> >> Here low noise restricts the range to 4g. Issue is that we don't actually >> have writeable _available attributes (which correspond to range in this >> case). >> > > Does it? Isn't it merely less oversampling. > >> Low power mode... This one is apparently oversampling. If possible support >> it as that as we have well defined interfaces for that. >> >> Jonathan. > > Ah, I remember; the oversampling settings was actually a reason why I > hadn't submitted the patch :) The oversampling API would definitely be > more accurate. > > I would like "oversampling" more than this "power_mode" too. For this > driver it would be far more complicated to implement though. I doubt > that it'll be done. power_mode is basically already there implicitely, > and given that there *is* the ABI, we could offer it for free. > > But given your concerns, I would strip down this patch to only offer the > already documented "low_noise" and "low_power" modes. It wouldn't be > worth it to extend the ABI just because of this! > > Users would have a simple switch if they don't really *want* to know the > details. I think it can be useful to just say "I don't care about power > consuption. Be as accurate as possible" or "I just want this think to > work. Use a little power as possible." Sure it's vage, but would it be > useless?
Re: [f2fs-dev] [PATCH RESEND] f2fs: validate before set/clear free nat bitmap
Sorry, I can't merge this patch due to wrong format. On 11/11, LiFan wrote: > In flush_nat_entries, all dirty nats will be flushed and if their new > address isn't > NULL_ADDR, their bitmaps will be updated, the free_nid_count of the bitmaps > will be increased regardless of whether the nats have already been occupied > before. This could lead to wrong free_nid_count. > So this patch checks the status of the bits before actually set/clear them. > > Fixes: 586d1492f301 ("f2fs: skip scanning free nid bitmap of full NAT > blocks") > > Signed-off-by: Fan li > --- > fs/f2fs/node.c | 17 ++--- > 1 file changed, 10 insertions(+), 7 deletions(-) > > diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index d234c6e..b965a53 100644 > --- a/fs/f2fs/node.c > +++ b/fs/f2fs/node.c > @@ -1906,15 +1906,18 @@ static void update_free_nid_bitmap(struct > f2fs_sb_info *sbi, nid_t nid, > if (!test_bit_le(nat_ofs, nm_i->nat_block_bitmap)) > return; > > - if (set) > + if (set) { > + if (test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) > + return; > __set_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); > - else > - __clear_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); > - > - if (set) > nm_i->free_nid_count[nat_ofs]++; > - else if (!build) > - nm_i->free_nid_count[nat_ofs]--; > + } else { > + if (!test_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs])) > + return; > + __clear_bit_le(nid_ofs, nm_i->free_nid_bitmap[nat_ofs]); > + if (!build) > + nm_i->free_nid_count[nat_ofs]--; > + } > } > > static void scan_nat_page(struct f2fs_sb_info *sbi, > -- > 2.7.4 >
Re: Prototype patch for Linux-kernel memory model
On Mon, Nov 13, 2017 at 03:09:11PM -0500, Alan Stern wrote: > On Mon, 13 Nov 2017, Paul E. McKenney wrote: > > > Hello! > > > > Please see below for the git commit corresponding to a prototype > > patch for the Linux-kernel memory model. This addresses the feedback > > we got at Linux Plumbers Conference: > > > > 1. There is a Documentation/recipes.txt file giving known-good > > useful examples, along with corresponding litmus tests. > > > > 2. There is a Documentation/explanation.txt file giving an > > overview of the memory model and its workings. > > > > 3. There is a Documentation/references.txt file giving some > > background reading. > > > > I believe that we have something that will be extremely useful and > > valuable to novices and experts alike. > > > > Please note that this version of the memory model does not yet reflect > > the changes that make DEC Alpha no longer be a special case because > > those changes have not yet hit mainline. The model will be updated > > once this happens. > > > > Thoughts? > > In references.txt, should we add URLs to non-paywalled PDFs? Or should > we assume that our readers are capable of using Google to find these > things on their own? > > There are a few places where some comments should be resolved/removed > before submission: > > Documentation/references.txt line 98: > Uncategorized stuff (any of this really needed?) > > litmus-tests/README line 92: > [ Shouldn't we have one with smp_wmb() in the process with both > writes, and smp_mb() in the other process. ] I updated these, recategorizing the "Uncategorized stuff" and removing the note from litmus-tests/README -- we don't seem to use R in recipes anyway. > In the files defining the memory model, we should replace the GPL > boilerplate with SPDX headers. We can! I pushed both commits. Thanx, Paul
Re: [PATCH v9 3/7] mailbox: qcom: Move the apcs struct into a separate header
On Mon 13 Nov 18:12 PST 2017, Stephen Boyd wrote: > On 10/27, Georgi Djakov wrote: > > Hi Bjorn, > > > > Thanks for reviewing! > > > > On 10/26/2017 07:28 AM, Bjorn Andersson wrote: > > > On Thu 21 Sep 09:49 PDT 2017, Georgi Djakov wrote: > > > > > >> Move the structure shared by the APCS IPC device and its subdevices > > >> into a separate header file. > > >> > > > > > > As you're creating the apcs regmap with devm_regmap_init_mmio() you can > > > just call dev_get_regmap(dev->parent) in your child to get the handle. > > > > Ok, thanks! > > > > > > > > But I would prefer that you just add the clock code to the existing > > > driver. > > > > This will require an ack from Stephen, and i got the impression that he > > prefers a separate clk driver [1]. > > > > Stephen, are you ok with registering the clocks from the apcs mailbox > > driver? > > > > [1] https://lkml.org/lkml/2017/6/26/750 > > The parent regmap "trick" was the plan. Is something wrong with > that? > Not at all, but then this patch (moving apcs context to a shared header file) shouldn't be needed, or am I missing something? > Not having random clk drivers scattered throughout the tree is > sort of nice because it makes for an easier time finding things > that are similar. Maybe that's an abuse of the driver model > though? Just to get things into some same directory. I'm fine > either way. > Keeping the clock driver in the clock subsystem does make sense. I see now that there is a include of a local header file as well, so that would just be messy to keep split. I'm fine with the extra driver instance, it's the DT that I don't think should describe the fact that we want to keep the clock-part in the clock subsystem. Do you see any problems spawning the clock driver programmatically and then calling of_clk_add_hw_provider() on the parent's of_node? Regards, Bjorn
Fwd: FW: [PATCH 18/31] nds32: Library functions
>>On Wed, Nov 08, 2017 at 01:55:06PM +0800, Greentime Hu wrote: > >> +#define __range_ok(addr, size) (size <= get_fs() && addr <= (get_fs() >> +-size)) >> + >> +#define access_ok(type, addr, size) \ >> + __range_ok((unsigned long)addr, (unsigned long)size) > >> +#define __get_user_x(__r2,__p,__e,__s,__i...) >> \ >> +__asm__ __volatile__ ( \ >> + __asmeq("%0", "$r0") __asmeq("%1", "$r2") \ >> + "bal__get_user_" #__s \ > >... which does not check access_ok() or do any visible equivalents; OK... > >> +#define get_user(x,p) >> \ >> + ({ \ >> + const register typeof(*(p)) __user *__p asm("$r0") = (p);\ >> + register unsigned long __r2 asm("$r2"); \ >> + register int __e asm("$r0");\ >> + switch (sizeof(*(__p))) { \ >> + case 1: \ >> + __get_user_x(__r2, __p, __e, 1, "$lp"); \ > >... and neither does this, which is almost certainly *not* OK. > >> +#define put_user(x,p) >> \ > >Same here, AFAICS. Thanks. I will add access_ok() in get_user/put_user >> +extern unsigned long __arch_copy_from_user(void *to, const void __user * >> from, >> +unsigned long n); > >> +static inline unsigned long raw_copy_from_user(void *to, >> +const void __user * from, >> +unsigned long n) >> +{ >> + return __arch_copy_from_user(to, from, n); } > >Er... Why not call your __arch_... raw_... and be done with that? Thanks. I will modify it in next patch version >> +#define INLINE_COPY_FROM_USER >> +#define INLINE_COPY_TO_USER > >Are those actually worth bothering? IOW, have you compared behaviour with and >without them? We compared the assembly code of copy_from/to_user's caller function, and we think the performance is better by making copy_from/to_user as inline >> +ENTRY(__arch_copy_to_user) >> + push$r0 >> + push$r2 >> + beqz$r2, ctu_exit >> + srli$p0, $r2, #2! $p0 = number of word to clear >> + andi$r2, $r2, #3! Bytes less than a word to copy >> + beqz$p0, byte_ctu ! Only less than a word to copy >> +word_ctu: >> + lmw.bim $p1, [$r1], $p1 ! Load the next word >> +USER(smw.bim,$p1, [$r0], $p1)! Store the next word > >Umm... It's that happy with unaligned loads and stores? Your memcpy seems to >be trying to avoid those... Thanks. This should be aligned loads and stores, too. I will modify it in next version patch. >> +9001: >> + pop $p1 ! Original $r2, n >> + pop $p0 ! Original $r0, void *to >> + sub $r1, $r0, $p0 ! Bytes copied >> + sub $r2, $p1, $r1 ! Bytes left to copy >> + push$lp >> + move$r0, $p0 >> + bal memzero ! Clean up the memory > >Just what memory are you zeroing here? The one you had been unable to store >into in the first place? > >> +ENTRY(__arch_copy_from_user) > >> +9001: >> + pop $p1 ! Original $r2, n >> + pop $p0 ! Original $r0, void *to >> + sub $r1, $r1, $p0 ! Bytes copied >> + sub $r2, $p1, $r1 ! Bytes left to copy >> + push$lp >> + bal memzero ! Clean up the memory > >Ditto, only this one is even worse - instead of just oopsing on you, it will >quietly destroy data past the area you've copied into. raw_copy_..._user() >MUST NOT ZERO ANYTHING. Ever. Thanks So, I should keep the area that we've copied into instead of zeroing the area even if unpredicted exception is happened. Right? Best regards Vincent
linux-next: manual merge of the nvdimm tree with the parisc-hd tree
Hi Dan, Today's linux-next merge of the nvdimm tree got a conflict in: arch/parisc/include/uapi/asm/mman.h between commit: 48cd4dc8f57f ("parisc: Convert MAP_TYPE to cover 4 bits on parisc") from the parisc-hd tree and commit: 1c9725974074 ("mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flag") from the nvdimm tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc arch/parisc/include/uapi/asm/mman.h index 9a39035986cc,bca652aa1677.. --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@@ -12,11 -11,10 +12,12 @@@ #define MAP_SHARED0x01/* Share changes */ #define MAP_PRIVATE 0x02/* Changes are private */ + #define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */ -#define MAP_TYPE 0x03/* Mask for type of mapping */ +#define MAP_TYPE (MAP_SHARED|MAP_PRIVATE|MAP_RESRVD1|MAP_RESRVD2) /* Mask for type of mapping */ #define MAP_FIXED 0x04/* Interpret addr exactly */ +#define MAP_RESRVD1 0x08/* reserved for 3rd bit of MAP_TYPE */ #define MAP_ANONYMOUS 0x10/* don't use a file */ +#define MAP_RESRVD2 0x20/* reserved for 4th bit of MAP_TYPE */ #define MAP_DENYWRITE 0x0800 /* ETXTBSY */ #define MAP_EXECUTABLE0x1000 /* mark it as an executable */
Re: [PATCH] iio: accel: mma8452: Add single pulse/tap event detection
> > This patch adds following related changes: > > - defines pulse event related registers > > - enables and handles single pulse interrupt for fxls8471 > > - handles IIO_EV_DIR_EITHER in read/write callbacks (because > > event direction for pulse is either rising or falling) > > - configures read/write event value for pulse latency register > > using IIO_EV_INFO_HYSTERESIS > > - adds multiple events like pulse and tranient event spec > > as elements of event_spec array named 'mma8452_accel_events' > > > > Except mma8653 chip all other chips like mma845x and > > fxls8471 have single tap detection feature. > > Tested thoroughly using iio_event_monitor application on > > imx6ul-evk board which has fxls8471. > > > > Signed-off-by: Harinath Nampally > > --- > What tree is this written against? It doesn't apply to the current -next > anyways. Thanks for the review. It is actually against 'testing' branch, I think two of my earlier patches are not yet applied to any branch, that might be reason this patch is not good against current -next or 'togreg'. > I think the defintions would deserve to be in a separate patch, but > that's debatable. Yes, I would argue that definitions are not a logical change. > > .type = IIO_EV_TYPE_MAG, > > .dir = IIO_EV_DIR_RISING, > > .mask_separate = BIT(IIO_EV_INFO_ENABLE), > > @@ -1139,6 +1274,15 @@ static const struct iio_event_spec > > mma8452_transient_event[] = { > > BIT(IIO_EV_INFO_PERIOD) | > > BIT(IIO_EV_INFO_HIGH_PASS_FILTER_3DB) > > }, > > + { > > + //pulse event > > + .type = IIO_EV_TYPE_MAG, > > + .dir = IIO_EV_DIR_EITHER, > > + .mask_separate = BIT(IIO_EV_INFO_ENABLE), > > + .mask_shared_by_type = BIT(IIO_EV_INFO_VALUE) | > > + BIT(IIO_EV_INFO_PERIOD) | > > + BIT(IIO_EV_INFO_HYSTERESIS) > > + }, > > }; > > > > static const struct iio_event_spec mma8452_motion_event[] = { > > @@ -1202,8 +1346,8 @@ static struct attribute_group > > mma8452_event_attribute_group = { > > .shift = 16 - (bits), \ > > .endianness = IIO_BE, \ > > }, \ > > - .event_spec = mma8452_transient_event, \ > > - .num_event_specs = ARRAY_SIZE(mma8452_transient_event), \ > > + .event_spec = mma8452_accel_events, \ > > + .num_event_specs = ARRAY_SIZE(mma8452_accel_events), \ > that would go in the mentioned separate renaming-patch OK so I will make a patch set; patch 1/2 to just rename 'mma8452_transient_event[]' to 'mma8452_accel_events[]'(without adding pulse event). and everything else would go in 2/2. Does that makes sense? Thanks, Harinath On Fri, Nov 10, 2017 at 5:35 PM, Martin Kepplinger wrote: > On 2017-11-09 04:12, Harinath Nampally wrote: >> This patch adds following related changes: >> - defines pulse event related registers >> - enables and handles single pulse interrupt for fxls8471 >> - handles IIO_EV_DIR_EITHER in read/write callbacks (because >> event direction for pulse is either rising or falling) >> - configures read/write event value for pulse latency register >> using IIO_EV_INFO_HYSTERESIS >> - adds multiple events like pulse and tranient event spec >> as elements of event_spec array named 'mma8452_accel_events' >> >> Except mma8653 chip all other chips like mma845x and >> fxls8471 have single tap detection feature. >> Tested thoroughly using iio_event_monitor application on >> imx6ul-evk board which has fxls8471. >> >> Signed-off-by: Harinath Nampally >> --- > > What tree is this written against? It doesn't apply to the current -next > anyways. > >> drivers/iio/accel/mma8452.c | 156 >> ++-- >> 1 file changed, 151 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/iio/accel/mma8452.c b/drivers/iio/accel/mma8452.c >> index 43c3a6b..36f1b56 100644 >> --- a/drivers/iio/accel/mma8452.c >> +++ b/drivers/iio/accel/mma8452.c >> @@ -72,6 +72,19 @@ >> #define MMA8452_TRANSIENT_THS_MASK GENMASK(6, 0) >> #define MMA8452_TRANSIENT_COUNT 0x20 >> #define MMA8452_TRANSIENT_CHAN_SHIFT 1 >> +#define MMA8452_PULSE_CFG0x21 >> +#define MMA8452_PULSE_CFG_CHAN(chan) BIT(chan * 2) >> +#define MMA8452_PULSE_CFG_ELEBIT(6) >> +#define MMA8452_PULSE_SRC0x22 >> +#define MMA8452_PULSE_SRC_XPULSE BIT(4) >> +#define MMA8452_PULSE_SRC_YPULSE BIT(5) >> +#define MMA8452_PULSE_SRC_ZPULSE BIT(6) >> +#define MMA8452_PULSE_THS0x23 >> +#define MMA8452_PULSE_THS_MASK GENMASK(6, 0) >> +#define MMA8452_PULSE_COUNT 0x26 >> +#define MMA8452_PULSE_CHAN_SHIFT 2 >> +#define MMA8452_PULSE_LTCY 0x27 >> + >> #define MMA8452_CTR
Re: [RFC PATCH 0/2] apply write hints to select the type of segments
On 11/13, Hyunchul Lee wrote: > On 11/13/2017 10:59 AM, Chao Yu wrote: > > On 2017/11/13 9:35, Hyunchul Lee wrote: > >> On 11/13/2017 10:26 AM, Chao Yu wrote: > >>> On 2017/11/13 8:24, Hyunchul Lee wrote: > On 11/10/2017 03:42 PM, Chao Yu wrote: > > On 2017/11/10 8:23, Hyunchul Lee wrote: > >> Hello, Chao > >> > >> On 11/09/2017 06:12 PM, Chao Yu wrote: > >>> On 2017/11/9 13:51, Hyunchul Lee wrote: > From: Hyunchul Lee > > Using write hints[1], applications can inform the life time of the > data > written to devices. and this[2] reported that the write hints patch > decreased writes in NAND by 25%. > > This hints help F2FS to determine the followings. > 1) the segment types where the data will be written. > 2) the hints that will be passed down to devices with the data of > segments. > > This patch set implements the first mapping from write hints to > segment types > as shown below. > > hints segment type > - > WRITE_LIFE_SHORT CURSEG_COLD_DATA > WRITE_LIFE_EXTREMECURSEG_HOT_DATA > othersCURSEG_WARM_DATA > > The F2FS poliy for hot/cold seperation has precedence over this > hints, And > hints are not applied in in-place update. > >>> > >>> Could we change to disable IPU if file/inode write hint is existing? > >>> > >> > >> I am afraid that this makes side effects. for example, this could cause > >> out-of-place updates even when there are not enough free segments. > >> I can write the patch that handles these situations. But I wonder > >> that this is required, and I am not sure which IPU polices can be > >> disabled. > > > > Oh, As I replied in another thread, I think IPU just affects filesystem > > hot/cold separating, rather than this feature. So I think it will be > > okay > > to not consider it. > > > >> > > Before the second mapping is implemented, write hints are not passed > down > to devices. Because it is better that the data of a segment have the > same > hint. > > [1]: c75b1d9421f80f4143e389d2d50ddfc8a28c8c35 > [2]: https://lwn.net/Articles/726477/ > >>> > >>> Could you write a patch to support passing write hint to block layer > >>> for > >>> buffered writes as below commit: > >>> 0127251c45ae ("ext4: add support for passing in write hints for > >>> buffered writes") > >>> > >> > >> Sure I will. I wrote it already ;) > > > > Cool, ;) > > > >> I think that datas from the same segment should be passed down with > >> the same > >> hint, and the following mapping is reasonable. I wonder what is your > >> opinion > >> about it. > >> > >> segment type hints > >> - > >> CURSEG_COLD_DATA WRITE_LIFE_EXTREME > >> CURSEG_HOT_DATAWRITE_LIFE_SHORT > >> CURSEG_COLD_NODE WRITE_LIFE_NORMAL > > > > We have WRITE_LIFE_LONG defined rather than WRITE_LIFE_NORMAL in fs.h? > > > >> CURSEG_HOT_NODEWRITE_LIFE_MEDIUM > > > > As I know, in scenario of cell phone, data of meta_inode is hottest, > > then hot > > data, warm node, and cold node should be coldest. So I suggested we can > > define > > as below: > > > > META_DATA WRITE_LIFE_SHORT > > HOT_DATA & WARM_NODEWRITE_LIFE_MEDIUM > > HOT_NODE & WARM_DATAWRITE_LIFE_LONG > > COLD_NODE & COLD_DATA WRITE_LIFE_EXTREME > > > > I agree, But I am not sure that assigning the same hint to a node and > data > segment is good. Because NVMe is likely to write them in the same erase > block if they have the same hint. > >>> > >>> If we do not give the hint, they can still be written to the same erase > >>> block, > > > > I mean it's possible to write them to the same erase block. :) > > > >>> right? it will not be worse? > >>> > >> > >> If the hint is not given, I think that they could be written to > >> the same erase block, or not. But if we give the same hint, they are > >> written > >> to the same block. > > > > IMO, Only if underlying device can support more hint type or opened > > channels, > > and actual temperature of data segment and node segment is quite different, > > we > > can separate them. > > > > Okay, If Jaegeuk Kim agrees with this, I will submit the patch that > implements your proposed mapping. How about this? We'd better to split data and node blocks as much as possible. segment typeh
Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll
On 2017/11/13 23:08, Ingo Molnar wrote: * Quan Xu wrote: From: Quan Xu To reduce the cost of poll, we introduce three sysctl to control the poll time when running as a virtual machine with paravirt. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu --- Documentation/sysctl/kernel.txt | 35 +++ arch/x86/kernel/paravirt.c |4 include/linux/kernel.h |6 ++ kernel/sysctl.c | 34 ++ 4 files changed, 79 insertions(+), 0 deletions(-) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 694968c..30c25fb 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one. == +paravirt_poll_grow: (X86 only) + +Multiplied value to increase the poll time. This is expected to take +effect only when running as a virtual machine with CONFIG_PARAVIRT +enabled. This can't bring any benifit on bare mental even with +CONFIG_PARAVIRT enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_shrink: (X86 only) + +Divided value to reduce the poll time. This is expected to take effect +only when running as a virtual machine with CONFIG_PARAVIRT enabled. +This can't bring any benifit on bare mental even with CONFIG_PARAVIRT +enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_threshold_ns: (X86 only) + +Controls the maximum poll time before entering real idle path. This is +expected to take effect only when running as a virtual machine with +CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental +even with CONFIG_PARAVIRT enabled. + +By default, this value is 0 means not to poll. Possible values to set +are in range {0..50}. Change the value to non-zero if running +latency-bound workloads in a virtual machine. I absolutely hate it how this hybrid idle loop polling mechanism is not self-tuning! Ingo, actually it is self-tuning.. Please make it all work fine by default, and automatically so, instead of adding three random parameters... .. I will make it all fine by default. howerver cloud environment is of diversity, could I only leave paravirt_poll_threshold_ns parameter (the maximum poll time), which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then user can turn it off, or find an appropriate threshold for some odd scenario.. thanks for your comments!! Quan Alibaba Cloud And if it cannot be done automatically then we should rather not do it at all. Maybe the next submitter of a similar feature can think of a better approach. Thanks, Ingo
Re: [Regression/XFS/PM] Freeze tasks failed in xfsaild
On Tue, Nov 14, 2017 at 11:39:59AM +0800, Yu Chen wrote: > Hi Dave, > On Tue, Nov 14, 2017 at 09:52:16AM +1100, Dave Chinner wrote: > > On Mon, Nov 13, 2017 at 06:31:39PM +0800, Yu Chen wrote: > > > Hi all, > > > Currently we are running hibernation stress test on a server > > > and unfortunately after 48 rounds of cycling, it fails at a > > > early stage that, the xfs task refuses to be frozen by the system: > > > > > > [ 1934.221653] PM: Syncing filesystems ... > > > [ 1934.661517] PM: done. > > > [ 1934.664067] Freezing user space processes ... (elapsed 0.003 seconds) > > > done. > > > [ 1934.675251] OOM killer disabled. > > > [ 1934.724317] PM: Preallocating image memory... done (allocated 6906555 > > > pages) > > > [ 1954.666378] PM: Allocated 27626220 kbytes in 19.93 seconds (1386.16 > > > MB/s) > > > [ 1954.673939] Freezing remaining freezable tasks ... > > > [ 1974.681089] Freezing of tasks failed after 20.001 seconds (1 tasks > > > refusing to freeze, wq_busy=0): > > > [ 1974.691169] xfsaild/dm-1D0 1362 2 0x0080 > > > [ 1974.697283] Call Trace: > > > [ 1974.700014] __schedule+0x3be/0x830 > > > [ 1974.703898] schedule+0x36/0x80 > > > [ 1974.707440] _xfs_log_force+0x143/0x280 [xfs] > > > [ 1974.712295] ? schedule_timeout+0x16b/0x350 > > > [ 1974.716953] ? wake_up_q+0x80/0x80 > > > [ 1974.720752] ? xfsaild+0x16f/0x770 [xfs] > > > [ 1974.725134] xfs_log_force+0x2c/0x80 [xfs] > > > [ 1974.729707] xfsaild+0x16f/0x770 [xfs] > > > [ 1974.733885] kthread+0x109/0x140 > > > [ 1974.737480] ? kthread+0x109/0x140 > > > [ 1974.741271] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] > > > [ 1974.747284] ? kthread_park+0x60/0x60 > > > [ 1974.751354] ret_from_fork+0x25/0x30 > > > [ 1974.755366] Restarting kernel threads ... done. > > > [ 1978.259907] OOM killer enabled. > > > [ 1978.263405] Restarting tasks ... done. > > > > > > The reason for this failure might be that, > > > while the kernel thread xfsaild/dm-1 is waiting for > > > xfs-buf/dm-1 to wake it up, however the latter > > > has already been frozen, thus xfsaild/dm-1 never > > > has a chance to be woken up and get froze. (Although > > > the xfsaild/dm-1 remains in TASK_UNINTERRUPTIBLE, which > > > is quite similar to 'frozen'.) > > > > Should be fixed by this commit in the for-next branch: > > > > 0bd89676c4fe xfs: check kthread_should_stop() after the setting of task > > state > > > > That should get merged into 4.15 with the next merge... > > > I did not quite catch why above commit would fix the issue here, > according to > https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=0bd89676c4fed53b003025bc4a5200861ac5d8ef > it tries to address a race condition between umount and xfsaild on > checking the kthread_should_stop() in order not to make > xfsaild falling asleep indefinitely. Argh, got my threads slightly crossed there. > But in our case, the xfsaild is waiting for the xfs-buf to wake > it up, and is nothing related to the kthread_should_stop() checking > here, did I miss something? Similar symptoms - the symptom that was fixed by the commit I mentioned was the xfsaild getting stuck in sleeping forever and so never seeing the KTHREAD_STOP bit - it was a "set bit vs wakeup" race caused by the fact that we didn't reset the state of the task correctly after wakeup. You said: >> (Although the xfsaild/dm-1 remains in TASK_UNINTERRUPTIBLE, which >> is quite similar to 'frozen'.) So from a quick look, it seemed like a similar race condition. I missed the *un* part of the task state, though. TASK_UNINTERRUPTIBLE implies waiting for IO completion, which is what _xfs_log_force() is doing. SO, follow the other branch of the discussion: hibernation needs to freeze filesystems so they can quiesce gracefully before the kernel starts shutting down the infrastructure the filesystem relies on... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v2] coccinelle: fix parallel build with CHECK=scripts/coccicheck
Hi Julia, 2017-11-14 1:45 GMT+09:00 Julia Lawall : > > > On Tue, 14 Nov 2017, Masahiro Yamada wrote: > >> Hi Julia, >> >> >> 2017-11-14 0:30 GMT+09:00 Julia Lawall : >> > >> > >> > On Thu, 9 Nov 2017, Masahiro Yamada wrote: >> > >> >> The command "make -j8 C=1 CHECK=scripts/coccicheck" produces lots of >> >> "coccicheck failed" error messages. >> >> >> >> I do not know the coccinelle internals, but I guess --jobs does not >> >> work well if spatch is invoked from Make running in parallel. >> >> Disable --jobs in this case. >> > >> > Why is this change under: >> > >> > if [ "$C" = "1" -o "$C" = "2" ]; >> > >> > The coccicheck failed messages come also if one runs Coccinelle on the >> > entire kernel. >> >> As far as I tested, "coccicheck failed" error only happens >> when ONLINE=1. >> >> >> make -j8 C=1 CHECK=scripts/coccicheck >> COCCI=scripts/coccinelle/misc/bugon.cocci >> >> emits lots of errors. >> >> >> make -j8 coccicheck COCCI=scripts/coccinelle/misc/bugon.cocci >> >> is fine. >> >> >> Have you tested it? >> Do you mean you got a different result from mine? > > I agree with your results, with respect to the number of errors. > > julia > So, what shall we do? If you do not like to fix it (or you can fix coccinelle itself), I can take back this patch. I am not a coccinelle developer, so setting USE_JOBS="no" is the best I can do. -- Best Regards Masahiro Yamada
Re: [PATCH] quota: be aware of error from dquot_initialize
On 2017/11/13 17:18, Jan Kara wrote: > On Mon 13-11-17 11:31:48, Chao Yu wrote: >> Commit 6184fc0b8dd7 ("quota: Propagate error from ->acquire_dquot()") >> missed to handle error from dquot_initialize in dquot_file_open, fix it. >> >> Signed-off-by: Chao Yu > > Good spotting. I've added the patch to my tree. Thanks for queuing the patch. :) BTW, I notice in add_dquot_ref we also didn't handle error of __dquot_initialize, should we handle it too? Thanks, > > Honza > >> --- >> fs/quota/dquot.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c >> index 50b0556a124f..80002c094647 100644 >> --- a/fs/quota/dquot.c >> +++ b/fs/quota/dquot.c >> @@ -2133,7 +2133,7 @@ int dquot_file_open(struct inode *inode, struct file >> *file) >> >> error = generic_file_open(inode, file); >> if (!error && (file->f_mode & FMODE_WRITE)) >> -dquot_initialize(inode); >> +error = dquot_initialize(inode); >> return error; >> } >> EXPORT_SYMBOL(dquot_file_open); >> -- >> 2.15.0.55.gc2ece9dc4de6 >> >>
Re: [Regression/XFS/PM] Freeze tasks failed in xfsaild
Hi Dave, On Tue, Nov 14, 2017 at 09:52:16AM +1100, Dave Chinner wrote: > On Mon, Nov 13, 2017 at 06:31:39PM +0800, Yu Chen wrote: > > Hi all, > > Currently we are running hibernation stress test on a server > > and unfortunately after 48 rounds of cycling, it fails at a > > early stage that, the xfs task refuses to be frozen by the system: > > > > [ 1934.221653] PM: Syncing filesystems ... > > [ 1934.661517] PM: done. > > [ 1934.664067] Freezing user space processes ... (elapsed 0.003 seconds) > > done. > > [ 1934.675251] OOM killer disabled. > > [ 1934.724317] PM: Preallocating image memory... done (allocated 6906555 > > pages) > > [ 1954.666378] PM: Allocated 27626220 kbytes in 19.93 seconds (1386.16 MB/s) > > [ 1954.673939] Freezing remaining freezable tasks ... > > [ 1974.681089] Freezing of tasks failed after 20.001 seconds (1 tasks > > refusing to freeze, wq_busy=0): > > [ 1974.691169] xfsaild/dm-1D0 1362 2 0x0080 > > [ 1974.697283] Call Trace: > > [ 1974.700014] __schedule+0x3be/0x830 > > [ 1974.703898] schedule+0x36/0x80 > > [ 1974.707440] _xfs_log_force+0x143/0x280 [xfs] > > [ 1974.712295] ? schedule_timeout+0x16b/0x350 > > [ 1974.716953] ? wake_up_q+0x80/0x80 > > [ 1974.720752] ? xfsaild+0x16f/0x770 [xfs] > > [ 1974.725134] xfs_log_force+0x2c/0x80 [xfs] > > [ 1974.729707] xfsaild+0x16f/0x770 [xfs] > > [ 1974.733885] kthread+0x109/0x140 > > [ 1974.737480] ? kthread+0x109/0x140 > > [ 1974.741271] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] > > [ 1974.747284] ? kthread_park+0x60/0x60 > > [ 1974.751354] ret_from_fork+0x25/0x30 > > [ 1974.755366] Restarting kernel threads ... done. > > [ 1978.259907] OOM killer enabled. > > [ 1978.263405] Restarting tasks ... done. > > > > The reason for this failure might be that, > > while the kernel thread xfsaild/dm-1 is waiting for > > xfs-buf/dm-1 to wake it up, however the latter > > has already been frozen, thus xfsaild/dm-1 never > > has a chance to be woken up and get froze. (Although > > the xfsaild/dm-1 remains in TASK_UNINTERRUPTIBLE, which > > is quite similar to 'frozen'.) > > Should be fixed by this commit in the for-next branch: > > 0bd89676c4fe xfs: check kthread_should_stop() after the setting of task state > > That should get merged into 4.15 with the next merge... > I did not quite catch why above commit would fix the issue here, according to https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=0bd89676c4fed53b003025bc4a5200861ac5d8ef it tries to address a race condition between umount and xfsaild on checking the kthread_should_stop() in order not to make xfsaild falling asleep indefinitely. But in our case, the xfsaild is waiting for the xfs-buf to wake it up, and is nothing related to the kthread_should_stop() checking here, did I miss something? Thanks, Yu
Re: [PATCH] tick/broadcast: Remove redundant code in tick_check_new_device()
On 2017/11/14 0:54, Thomas Gleixner wrote: On Wed, 8 Nov 2017, Zhenzhong Duan wrote: There is no way a timer used as broadcast clockevent device is also used as percpu tick clockevent device currently. Correct. It's better to put related code in tick_install_broadcast_device(), but I feel it's harmless to give it back to the clockevents layer. Pls correct me if I'm wrong. You already established, that it _cannot_ be the broadcast device and the per cpu device at the same time. So that condition can never be true. What do you want to put into tick_install_broadcast_device()? This second paragraph doesn't make sense, unless I'm missing something. I didn't find the reason in long history logs while the comments saying 'If the current device is the broadcast device, do not give it back to the clockevents layer !' If it does, tick_install_broadcast_device() is a proper place. If not, I can resend the patch with fresh description, pls confirm. -- thanks zduan
[PATCH] uapi: fix linux/tls.h userspace compilation error
Move inclusion of a private kernel header from uapi/linux/tls.h to its only user - net/tls.h, to fix the following linux/tls.h userspace compilation error: /usr/include/linux/tls.h:41:21: fatal error: net/tcp.h: No such file or directory As to this point uapi/linux/tls.h was totaly unusuable for userspace, cleanup this header file further by moving other redundant includes to net/tls.h. Fixes: 3c4d7559159b ("tls: kernel TLS support") Cc: # v4.13+ Signed-off-by: Dmitry V. Levin --- include/net/tls.h| 4 include/uapi/linux/tls.h | 4 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index b89d397dd62f..c06db1eadac2 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -35,6 +35,10 @@ #define _TLS_OFFLOAD_H #include +#include +#include +#include +#include #include diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h index d5e0682ab837..293b2cdad88d 100644 --- a/include/uapi/linux/tls.h +++ b/include/uapi/linux/tls.h @@ -35,10 +35,6 @@ #define _UAPI_LINUX_TLS_H #include -#include -#include -#include -#include /* TLS socket options */ #define TLS_TX 1 /* Set transmit parameters */ -- ldv
Re: KASAN: use-after-free Read in rds_tcp_dev_event
On 11/7/17 12:28 PM, syzbot wrote: Hello, syzkaller hit the following crash on 287683d027a3ff83feb6c7044430c79881664ecf git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master compiler: gcc (GCC) 7.1.1 20170620 .config is attached Raw console output is attached. == BUG: KASAN: use-after-free in rds_tcp_kill_sock net/rds/tcp.c:530 [inline] BUG: KASAN: use-after-free in rds_tcp_dev_event+0xc01/0xc90 net/rds/tcp.c:568 Read of size 8 at addr 8801cd879200 by task kworker/u4:3/147 CPU: 0 PID: 147 Comm: kworker/u4:3 Not tainted 4.14.0-rc7+ #156 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x25b/0x340 mm/kasan/report.c:409 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430 rds_tcp_kill_sock net/rds/tcp.c:530 [inline] rds_tcp_dev_event+0xc01/0xc90 net/rds/tcp.c:568 The issue here is that we are trying to reference a network namespace (struct net *) that is long gone (i.e., L532 below -- c_net is the culprit). 528 spin_lock_irq(&rds_tcp_conn_lock); 529 list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) { 530 struct net *c_net = tc->t_cpath->cp_conn->c_net; 531 532 if (net != c_net || !tc->t_sock) 533 continue; 534 if (!list_has_conn(&tmp_list, tc->t_cpath->cp_conn)) 535 list_move_tail(&tc->t_tcp_node, &tmp_list); 536 } 537 spin_unlock_irq(&rds_tcp_conn_lock); 538 list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node) { 539 rds_tcp_conn_paths_destroy(tc->t_cpath->cp_conn); 540 rds_conn_destroy(tc->t_cpath->cp_conn); 541 } When a network namespace is deleted, devices within that namespace are unregistered and removed one by one. RDS is notified about this event through rds_tcp_dev_event() callback. When the loopback device is removed from the namespace, the above RDS callback function destroys all the RDS connections in that namespace. The loop@L529 above walks through each of the rds_tcp connection in the global list (rds_tcp_conn_list) to see if that connection belongs to the namespace in question. It collects all such connections and destroys them (L538-540). However, it leaves behind some of the rds_tcp connections that shared the same underlying RDS connection (L534 and 535). These connections with pointer to stale network namespace are left behind in the global list. When the 2nd network namespace is deleted, we will hit the above stale pointer and hit UAF panic. I think we should move away from global list to a per-namespace list. The global list are used only in two places (both of which are per-namespace operations): - to destroy all the RDS connections belonging to a namespace when the network namespace is being deleted. - to reset all the RDS connections when socket parameters for a namespace are modified using sysctl Thanks, ~Girish
Re: [Regression/XFS/PM] Freeze tasks failed in xfsaild
On Mon, Nov 13, 2017 at 09:14:14PM +0100, Luis R. Rodriguez wrote: > On Mon, Nov 13, 2017 at 06:31:39PM +0800, Yu Chen wrote: > > The xfs-buf/dm-1 should be freezed according to > > commit 8018ec083c72 ("xfs: mark all internal workqueues > > as freezable"), thus a easier way might be have to revert > > commit 18f1df4e00ce ("xfs: Make xfsaild freezeable > > again") for now, after this reverting the xfsaild/dm-1 > > becomes non-freezable again, thus pm does not see this > > thread - unless we find a graceful way to treat xfsaild/dm-1 > > as 'frozen' if it is waiting for an already 'frozen' task, > > or if the filesystem freeze is added in. > > > > Any comments would be much appreciated. > > Reverting 18f1df4e00ce ("xfs: Make xfsaild freezeable again") > would break the proper form of the kthread for it to be freezable. > This "form" is not defined formally, and sadly its just a form > learned throughout years over different kthreads in the kernel. > > I'm also not convinced all our hibernation / suspend woes would be fixed by > reverting this commit, its why I worked instead on formalizing a proper freeze > / thaw, which a lot of filesystems already implement prior to system > hibernation / suspend / resume [0]. > > I'll be respinning this series without the last patch, provided I'm able to > ensure I don't need the ext[234] hack I did in that thread. Can you test the > first 3 patches *only* on that series and seeing if that helps on your XFS > front as well? > > [0] https://lkml.kernel.org/r/20171003185313.1017-1-mcg...@kernel.org > > Luis Thanks for the comment Luis, Yes, I agree the freezing of filesystem is a proper/thorough fix for such kind issues, but as Dan said, it might be a little risky for us to to deploy it on our products currently, unless it is in the mainline/stable branch. Although the XFS issue might not be 100% reproducible, we can help test the patch set while seeking for a lightweight 'fix'. Thanks, Yu
[PATCH] perf annotate: Remove precision for mnemonics
There are many instructions, esp on powerpc, whose mnemonics are longer than 6 characters. Using precision limit causes truncation of such mnemonics. Fix this by removing precision limit. Note that, 'width' is still 6, so alignment won't get affected for length <= 6. Before: li r11,-1 xscvdp vs1,vs1 add. r10,r10,r11 After: li r11,-1 xscvdpsxds vs1,vs1 add. r10,r10,r11 Reported-by: Donald Stence Signed-off-by: Ravi Bangoria --- tools/perf/util/annotate.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 54321b947de8..6462a7423beb 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -165,7 +165,7 @@ static void ins__delete(struct ins_operands *ops) static int ins__raw_scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { - return scnprintf(bf, size, "%-6.6s %s", ins->name, ops->raw); + return scnprintf(bf, size, "%-6s %s", ins->name, ops->raw); } int ins__scnprintf(struct ins *ins, char *bf, size_t size, @@ -230,12 +230,12 @@ static int call__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { if (ops->target.name) - return scnprintf(bf, size, "%-6.6s %s", ins->name, ops->target.name); + return scnprintf(bf, size, "%-6s %s", ins->name, ops->target.name); if (ops->target.addr == 0) return ins__raw_scnprintf(ins, bf, size, ops); - return scnprintf(bf, size, "%-6.6s *%" PRIx64, ins->name, ops->target.addr); + return scnprintf(bf, size, "%-6s *%" PRIx64, ins->name, ops->target.addr); } static struct ins_ops call_ops = { @@ -299,7 +299,7 @@ static int jump__scnprintf(struct ins *ins, char *bf, size_t size, c++; } - return scnprintf(bf, size, "%-6.6s %.*s%" PRIx64, + return scnprintf(bf, size, "%-6s %.*s%" PRIx64, ins->name, c ? c - ops->raw : 0, ops->raw, ops->target.offset); } @@ -372,7 +372,7 @@ static int lock__scnprintf(struct ins *ins, char *bf, size_t size, if (ops->locked.ins.ops == NULL) return ins__raw_scnprintf(ins, bf, size, ops); - printed = scnprintf(bf, size, "%-6.6s ", ins->name); + printed = scnprintf(bf, size, "%-6s ", ins->name); return printed + ins__scnprintf(&ops->locked.ins, bf + printed, size - printed, ops->locked.ops); } @@ -448,7 +448,7 @@ static int mov__parse(struct arch *arch, struct ins_operands *ops, struct map *m static int mov__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { - return scnprintf(bf, size, "%-6.6s %s,%s", ins->name, + return scnprintf(bf, size, "%-6s %s,%s", ins->name, ops->source.name ?: ops->source.raw, ops->target.name ?: ops->target.raw); } @@ -488,7 +488,7 @@ static int dec__parse(struct arch *arch __maybe_unused, struct ins_operands *ops static int dec__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { - return scnprintf(bf, size, "%-6.6s %s", ins->name, + return scnprintf(bf, size, "%-6s %s", ins->name, ops->target.name ?: ops->target.raw); } @@ -500,7 +500,7 @@ static struct ins_ops dec_ops = { static int nop__scnprintf(struct ins *ins __maybe_unused, char *bf, size_t size, struct ins_operands *ops __maybe_unused) { - return scnprintf(bf, size, "%-6.6s", "nop"); + return scnprintf(bf, size, "%-6s", "nop"); } static struct ins_ops nop_ops = { @@ -990,7 +990,7 @@ void disasm_line__free(struct disasm_line *dl) int disasm_line__scnprintf(struct disasm_line *dl, char *bf, size_t size, bool raw) { if (raw || !dl->ins.ops) - return scnprintf(bf, size, "%-6.6s %s", dl->ins.name, dl->ops.raw); + return scnprintf(bf, size, "%-6s %s", dl->ins.name, dl->ops.raw); return ins__scnprintf(&dl->ins, bf, size, &dl->ops); } -- 2.13.6
Re: video: fbdev: Convert timers to use timer_setup()
On Mon, Nov 13, 2017 at 5:45 PM, Guenter Roeck wrote: > On Tue, Oct 24, 2017 at 08:20:26AM -0700, Kees Cook wrote: >> In preparation for unconditionally passing the struct timer_list pointer to >> all timer callbacks, switch to using the new timer_setup() and from_timer() >> to pass the timer pointer explicitly. One tracking pointer was added, and >> one initialization was cleaned up. >> >> Cc: Bartlomiej Zolnierkiewicz >> Cc: Benjamin Herrenschmidt >> Cc: Tomi Valkeinen >> Cc: David Lechner >> Cc: Daniel Vetter >> Cc: Sean Paul >> Cc: Jean Delvare >> Cc: Hans de Goede >> Cc: "Gustavo A. R. Silva" >> Cc: linux-fb...@vger.kernel.org >> Cc: dri-de...@lists.freedesktop.org >> Cc: linux-o...@vger.kernel.org >> Signed-off-by: Kees Cook > > Hi Kees, > > this patch causes a large number of qemu crashes. > > Unable to handle kernel NULL pointer dereference at virtual address 0194 > pgd = c0004000 > [0194] *pgd= > Internal error: Oops: 5 [#1] ARM > Modules linked in: > CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-next-20171113 #1 > Hardware name: ARM-Versatile (Device Tree Support) > task: c04df238 task.stack: c04da000 > PC is at queue_work_on+0x1c/0x48 > ... > [] (queue_work_on) from [] > (cursor_timer_handler+0x20/0x44) > [] (cursor_timer_handler) from [] > (call_timer_fn+0x24/0xa0) > [] (call_timer_fn) from [] (expire_timers+0x7c/0x8c) > [] (expire_timers) from [] (run_timer_softirq+0x88/0x184) > [] (run_timer_softirq) from [] (__do_softirq+0xe0/0x238) > [] (__do_softirq) from [] (irq_exit+0xb4/0xd0) > [] (irq_exit) from [] (__handle_domain_irq+0x50/0xa8) > [] (__handle_domain_irq) from [] > (vic_handle_irq+0x54/0x94) > [] (vic_handle_irq) from [] (__irq_svc+0x68/0x84) > > See > http://kerneltests.org/builders/qemu-arm-next/builds/806/steps/qemubuildcommand/logs/stdio > for complete crash logs. > > Reverting the patch fixes the problem. > > Images for various other architectures crash as well in next-20171113, > but I didn't bisect those. It looks like there are additional (possibly irq > related) problems in the latest -next kernel; I don't know if those are > also related to timer changes. I think this is already fixed here: https://marc.info/?l=linux-fbdev&m=151056635200583&w=2 If not, please let me know! :) -Kees -- Kees Cook Pixel Security
Re: [intel-sgx-kernel-dev] [PATCH v5 11/11] intel_sgx: driver documentation
On Mon, 2017-11-13 at 21:45 +0200, Jarkko Sakkinen wrote: > Signed-off-by: Jarkko Sakkinen > --- > Documentation/index.rst | 1 + > Documentation/x86/intel_sgx.rst | 131 > > 2 files changed, 132 insertions(+) > create mode 100644 Documentation/x86/intel_sgx.rst > > diff --git a/Documentation/index.rst b/Documentation/index.rst > index cb7f1ba5b3b1..ccfebc260e04 100644 > --- a/Documentation/index.rst > +++ b/Documentation/index.rst > @@ -86,6 +86,7 @@ implementation. > :maxdepth: 2 > > sh/index > + x86/index > > Korean translations > --- > diff --git a/Documentation/x86/intel_sgx.rst > b/Documentation/x86/intel_sgx.rst > new file mode 100644 > index ..34bcf6a2a495 > --- /dev/null > +++ b/Documentation/x86/intel_sgx.rst > @@ -0,0 +1,131 @@ > +=== > +Intel(R) SGX driver > +=== > + > +Introduction > + > + > +Intel(R) SGX is a set of CPU instructions that can be used by > applications to > +set aside private regions of code and data. The code outside the > enclave is > +disallowed to access the memory inside the enclave by the CPU access > control. > +In a way you can think that SGX provides inverted sandbox. It > protects the > +application from a malicious host. > + > +There is a new hardware unit in the processor called Memory > Encryption Engine > +(MEE) starting from the Skylake microarchitecture. BIOS can define > one or many > +MEE regions that can hold enclave data by configuring them with > PRMRR registers. > + > +The MEE automatically encrypts the data leaving the processor > package to the MEE > +regions. The data is encrypted using a random key whose life-time is > exactly one > +power cycle. Not sure whether you should talk about MEE staff here. They are not in SDM and (thus) may potentially be changed in the future. > + > +You can tell if your CPU supports SGX by looking into > ``/proc/cpuinfo``: > + > + ``cat /proc/cpuinfo | grep sgx`` > + > +Enclave data types > +== > + > +SGX defines new data types to maintain information about the > enclaves and their > +security properties. > + > +The following data structures exist in MEE regions: > + > +* **Enclave Page Cache (EPC):** memory pages for protected code and > data > +* **Enclave Page Cache Map (EPCM):** meta-data for each EPC page > + > +The Enclave Page Cache holds following types of pages: > + > +* **SGX Enclave Control Structure (SECS)**: meta-data defining the > global > + properties of an enclave such as range of addresses it can access. > +* **Regular (REG):** containing code and data for the enclave. > +* **Thread Control Structure (TCS):** defines an entry point for a > hardware > + thread to enter into the enclave. The enclave can only be entered > through > + these entry points. > +* **Version Array (VA)**: an EPC page receives a unique 8 byte > version number > + when it is swapped, which is then stored into a VA page. A VA page > can hold up > + to 512 version numbers. > + > +Launch control > +== > + > +For launching an enclave, two structures must be provided for > ENCLS(EINIT): > + > +1. **SIGSTRUCT:** a signed measurement of the enclave binary. > +2. **EINITTOKEN:** the measurement, the public key of the signer and > various > + enclave attributes. This structure contains a MAC of its contents > using > + hardware derived symmetric key called *launch key*. > + > +The hardware platform contains a root key pair for signing the > SIGTRUCT > +for a *launch enclave* that is able to acquire the *launch key* for > +creating EINITTOKEN's for other enclaves. For the launch enclave > +EINITTOKEN is not needed because it is signed with the private root > key. > + > +There are two feature control bits associate with launch control > + > +* **IA32_FEATURE_CONTROL[0]**: locks down the feature control > register > +* **IA32_FEATURE_CONTROL[17]**: allow runtime reconfiguration of > + IA32_SGXLEPUBKEYHASHn MSRs that define MRSIGNER hash for the > launch > + enclave. Essentially they define a signing key that does not > require > + EINITTOKEN to be let run. > + > +The BIOS can configure IA32_SGXLEPUBKEYHASHn MSRs before feature > control > +register is locked. > + > +It could be tempting to implement launch control by writing the MSRs > +every time when an enclave is launched. This does not scale because > for > +generic case because BIOS might lock down the MSRs before handover > to > +the OS. > + > +Debug enclaves > +-- > + > +Enclave can be set as a *debug enclave* of which memory can be read > or written > +by using the ENCLS(EDBGRD) and ENCLS(EDBGWR) opcodes. The Intel > provided launch > +enclave provides them always a valid EINITTOKEN and therefore they > are a low > +hanging fruit way to try out SGX. > + > +Virtualization > +== > + > +Launch control > +-- > + > +The values for IA32_SGXLEPUBKEYHASHn MSRs cannot be em
RE: [patch v2 4/8] KVM: x86: add Intel processor trace cpuid emulataion
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index > > 0099e10..ef19a11 100644 > > --- a/arch/x86/kvm/cpuid.c > > +++ b/arch/x86/kvm/cpuid.c > > @@ -70,6 +70,7 @@ u64 kvm_supported_xcr0(void) > > /* These are scattered features in cpufeatures.h. */ > > #define KVM_CPUID_BIT_AVX512_4VNNIW 2 > > #define KVM_CPUID_BIT_AVX512_4FMAPS 3 > > +#define KVM_CPUID_BIT_INTEL_PT 25 > > This is not necessary, because there is no need to place processor tracing in > scattered features. Can you replace this hunk, and the KF usage below, with > the following patch? > Yes, this looks good to me. will fix in next version. Thanks, Luwei Kang > 8< - > From: Paolo Bonzini > Subject: [PATCH] x86: cpufeature: move processor tracing out of scattered > features > > Processor tracing is already enumerated in word 9 (CPUID[7,0].EBX), so do not > duplicate it in the scattered features word. > > Signed-off-by: Paolo Bonzini > --- > arch/x86/include/asm/cpufeatures.h | 3 ++- > arch/x86/kernel/cpu/scattered.c| 1 - > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/include/asm/cpufeatures.h > b/arch/x86/include/asm/cpufeatures.h > index 2519c6c801c9..839781e78763 100644 > --- a/arch/x86/include/asm/cpufeatures.h > +++ b/arch/x86/include/asm/cpufeatures.h > @@ -199,7 +199,7 @@ > #define X86_FEATURE_SME ( 7*32+10) /* AMD Secure Memory > Encryption */ > > #define X86_FEATURE_INTEL_PPIN ( 7*32+14) /* Intel Processor Inventory > Number */ > -#define X86_FEATURE_INTEL_PT ( 7*32+15) /* Intel Processor Trace */ > + > #define X86_FEATURE_AVX512_4VNNIW (7*32+16) /* AVX-512 Neural Network > Instructions */ #define > X86_FEATURE_AVX512_4FMAPS (7*32+17) /* AVX-512 Multiply Accumulation Single > precision */ > > @@ -238,6 +238,7 @@ > #define X86_FEATURE_AVX512IFMA ( 9*32+21) /* AVX-512 Integer Fused > Multiply-Add instructions */ > #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ > #define X86_FEATURE_CLWB ( 9*32+24) /* CLWB instruction */ > +#define X86_FEATURE_INTEL_PT ( 9*32+25) /* Intel Processor Trace */ > #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ > #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and > Reciprocal */ > #define X86_FEATURE_AVX512CD ( 9*32+28) /* AVX-512 Conflict Detection */ > diff --git a/arch/x86/kernel/cpu/scattered.c > b/arch/x86/kernel/cpu/scattered.c index 05459ad3db46..d0e69769abfd 100644 > --- a/arch/x86/kernel/cpu/scattered.c > +++ b/arch/x86/kernel/cpu/scattered.c > @@ -21,7 +21,6 @@ struct cpuid_bit { > static const struct cpuid_bit cpuid_bits[] = { > { X86_FEATURE_APERFMPERF, CPUID_ECX, 0, 0x0006, 0 }, > { X86_FEATURE_EPB, CPUID_ECX, 3, 0x0006, 0 }, > - { X86_FEATURE_INTEL_PT, CPUID_EBX, 25, 0x0007, 0 }, > { X86_FEATURE_AVX512_4VNNIW,CPUID_EDX, 2, 0x0007, 0 }, > { X86_FEATURE_AVX512_4FMAPS,CPUID_EDX, 3, 0x0007, 0 }, > { X86_FEATURE_CAT_L3, CPUID_EBX, 1, 0x0010, 0 }, > > > #define KF(x) bit(KVM_CPUID_BIT_##x) > > > > int kvm_update_cpuid(struct kvm_vcpu *vcpu) @@ -327,6 +328,7 @@ > > static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 > > function, > > unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0; > > unsigned f_mpx = kvm_mpx_supported() ? F(MPX) : 0; > > unsigned f_xsaves = kvm_x86_ops->xsaves_supported() ? F(XSAVES) : 0; > > + unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? KF(INTEL_PT) : > > +0; > > > > /* cpuid 1.edx */ > > const u32 kvm_cpuid_1_edx_x86_features =
Re: [RFC PATCH v10 6/7] PCI / PM: Move acpi wakeup code to pci core
Hi Rafael, I'll answer some of it from my perspective, though Jeffy might have had different ideas (and answers) when he implemented this. On Wed, Nov 08, 2017 at 11:32:20PM +0100, Rafael J. Wysocki wrote: > On Friday, October 27, 2017 9:26:11 AM CET Jeffy Chen wrote: > > Move acpi wakeup code to pci core as pci_set_wakeup(), so that other > > platforms could reuse it. > > What exactly do you want to reuse? > > It looks like that's just several lines of code in acpi_pci_wakeup() > and acpi_pci_propagate_wakeup() which invoke ACPI-specific lower-level > functions, so IMO not worth it at all. The important part he's sharing here is the walking of the tree structure, in which it's possible for some parent along the way to handle wakeup for its children. I'm not sure how valuable nor how reusable that is. In this case (the Rockchip platforms Jeffy and I are working on), I think we really want to just support a single WAKE# pin for the whole system, so maybe the complexity isn't needed. The spec does describe that there are good reasons for supporting more than 1 WAKE# pin though (e.g., 1 per device), so it doesn't seem really wise to shoehorn oursleves into a single setup. But that can be implemented either via copying the "few" lines of tree-walking logic, or by trying to share them. > The structure for other platform code may be the same or similar, but > the details will almost certainly be different and I don't think that > having more callback pointers in pci_platform_pm_ops is necessarily better. I suppose that's reasonable. > > Also add .setup_dev() / .setup_host_bridge() / .cleanup() platform pm > > ops's callbacks to setup and cleanup pci devices and host bridge for > > wakeup. > > Why are they needed? The implementation is in patch 7, if you really needed more info about why, or provide alternatives. The current set of hooks assumes that there is no state information or initialization needed for tracking actions of these platform PM hooks on a device. For ACPI this works, because devices have "companion" acpi_dev's to handle everything, and the ACPI framework generally prepares GPE's for you, IIUC. For 'pci-mid', the operations happen to be trivial (and arguably wrong -- several are no-ops, where we might expect the platform to tell us whether the operation was actually supported or not). For device tree, there isn't really a canonical place to store this information, nor to initialize something like wakeup interrupts. Technically, we could shoehorn this into the .set_wakeup() call, but we'd probably rather not do things like request_irq() on every attempt to suspend/resume the system (among other reasons, we'd lose information that we might otherwise track in /proc/ or /sys/). Or the inverse of the above: where would you suggest initializing or tearing down the wakeirq? An alternative could be to include any necessary state into the pci_host_bridge or pci_dev and just inline the setup code into pci.c/remove.c (e.g., pci_register_host_bridge()) and pci-driver.c (pci_device_{probe,remove}()). But I'm not sure that's much more beautiful. Brian > > Signed-off-by: Jeffy Chen > > Thanks, > Rafael >
Re: [GIT pull] printk updates for 4.15
On Mon, Nov 13, 2017 at 5:18 PM, Linus Torvalds wrote: > > (b) just emit a "synchronization printk" every once in a while, which > is obviously also using the same standard time source, but the line > actually _says_ what the other time sources are. Side note: there's a few good obvious times to do this. After a NTP synchronization, after a resume, and maybe "every X hours if nothing else is happening". That "if nothing else is happening" would actually be a nice heartbeat thing for people who care about that. I've had machines crash overnight, and later wondered when it happened. Of course, these days other system journal sources tend to be so chatty that it doesn't much happen, but maybe it would still be appreciated in embedded places where that isn't yet the case.. And that "how often you do the time sync printk" really could be a kernel configuration thing then, but it wouldn't actually affect any existing machinery unlike the "let's just change what the printk header timestamp means". Linus
Re: [v8, 4/5] x86/xsave: Make XSAVE check the base CPUID features before enabling
On 11/13/2017 05:28 PM, Andi Kleen wrote: Guenter, Do you have a command line that reproduces it and the exact log output? Sorry, I forgot: Logs are at http://kerneltests.org/builders/qemu-x86_64-next. Guenter
Re: [RFC 6/7] x86/asm: Remap the TSS into the cpu entry area
On Mon, Nov 13, 2017 at 6:28 PM, Linus Torvalds wrote: > On Mon, Nov 13, 2017 at 6:25 PM, Andy Lutomirski wrote: >> On Mon, Nov 13, 2017 at 11:36 AM, Linus Torvalds >> wrote: >>> >>> I forget what the actual size is, but aligning the hardware TSS struct >>> to 128 bytes might be sufficient. It's not that big. >> >> 104 bytes, so it's probably already fine. For anything except an >> actual task switch, only the first 12 or so bytes matter. > > Note that historically, about half of the Intel errata (that don't get > fixed) are about TSS in oddball situations, mainly page crossers. > > I may be exaggerating just a tiny bit, but it's definitely a "don't do it". :) I suspect the major case where this matters is when we do a task switch, which only ever happens on 32-bit double faults, at which point we're already seriously screwed. But yes, I agree.
Re: linux-next: Signed-off-by missing for commit in the f2fs tree
On 11/13, Stephen Rothwell wrote: > Hi Jaegeuk, > > Commit > > c79d88f915ed ("f2fs: separate nat entry mem alloc from nat_tree_lock") > > is missing a Signed-off-by from its author. Thank you so much to point this out. I fixed it. Thanks, > > -- > Cheers, > Stephen Rothwell
Re: [RFC 6/7] x86/asm: Remap the TSS into the cpu entry area
On Mon, Nov 13, 2017 at 6:25 PM, Andy Lutomirski wrote: > On Mon, Nov 13, 2017 at 11:36 AM, Linus Torvalds > wrote: >> >> I forget what the actual size is, but aligning the hardware TSS struct >> to 128 bytes might be sufficient. It's not that big. > > 104 bytes, so it's probably already fine. For anything except an > actual task switch, only the first 12 or so bytes matter. Note that historically, about half of the Intel errata (that don't get fixed) are about TSS in oddball situations, mainly page crossers. I may be exaggerating just a tiny bit, but it's definitely a "don't do it". Linus
Re: [RFC 6/7] x86/asm: Remap the TSS into the cpu entry area
On Mon, Nov 13, 2017 at 11:22 AM, Dave Hansen wrote: > On 11/10/2017 08:05 PM, Andy Lutomirski wrote: >> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h >> index fbc9b7f4e35e..8a9ba5553cab 100644 >> --- a/arch/x86/include/asm/fixmap.h >> +++ b/arch/x86/include/asm/fixmap.h >> @@ -52,6 +52,13 @@ extern unsigned long __FIXADDR_TOP; >> struct cpu_entry_area >> { >> char gdt[PAGE_SIZE]; >> + >> + /* >> + * The gdt is just below cpu_tss and thus serves (on x86_64) as a >> + * a read-only guard page for the SYSENTER stack at the bottom >> + * of the TSS region. >> + */ >> + struct tss_struct tss; >> }; >> > > Aha, and here's the place that you need sizeof(tss_struct) to be nice > and page-aligned. > > But why don't we just do: > > char tss_space[PAGE_SIZE*something]; The idea is to save some space. The TSS plus IO bitmap is slightly over a page, so, if we're giving it a dedicated block of pages, we have almost a page of unused space. I want to use some of that space for the SYSENTER stack. To reliably detect overflow, that space should be at the beginning. It turns out that using almost a page is way too *big*: it masks bugs. I want anything nontrivial that accidentally runs on the SYSENTER stack to overflow and crash very quickly rather than having a decent chance of working or of causing nasty corruption with a crash down the road. So I'm going to make it much smaller and instead just add a build-time assertion that we don't cross a page boundary.
Re: [PATCH 3/4] x86/umip: Identify the str and sldt instructions
On Mon, Nov 13, 2017 at 09:12:03AM +0100, Ingo Molnar wrote: > > * Ricardo Neri wrote: > > > The instructions str and sldt are not emulated in any case. Thus, it made > > sense to not implement functionality to identify them. However, a > > subsequent commit will introduce functionality to warn about the use of > > all the instructions that UMIP protect, not only those that are emulated. > > A first step for that is the ability to identify them. > > > > Plus, now that str and sldt are identified, we need to explicitly avoid > > their emulation (i.e., not rely on unsuccessful identification). Group > > togehter all the cases that we do not want to emulate: str, sldt and user > > long mode processes. > > Did you notice how in all your previous patches (both in the code and in the > changelogs) I have manually fixed up the capitalization of these instruction > mnenonics? I am sorry, I tried to see where you made these changes but I could not find any. I did a git diff of arch/x86/kernel/umip.c between the branch rneri/umip_v11 of my repository [1] and the master branch of the tip tree and I did not find any differences. Also, looking at the log of the master branch of the tip tree I see, for instance: commit 1e5db223696afa55e6a038fac638f759e1fdcc01 Author: Ricardo Neri Date: Sun Nov 5 18:27:52 2017 -0800 x86/umip: Add emulation code for UMIP instructions The feature User-Mode Instruction Prevention present in recent Intel processor prevents a group of instructions (sgdt, sidt, sldt, smsw, and str) from being executed with CPL > 0. Otherwise, a general protection fault is issued. ... The instruction mnemonics were not capitalized. Is the master branch the one where I can look at your fixes? > > The capitalized form is much more readable, especially with seriously > overloaded > acronyms such as 'str' ... I see. > > You now repeat the same bad pattern, in fact you regress existing code: > > > - /* SLDT AND STR are not emulated */ > > > + /* Do not emulate sldt, str or user long mode processes. */ > > Please be more careful with such details, and please fix & resend this series. Sure, I will submit a v2 with capitalized mnemonics in both the code and the patch descriptions. I will be more careful in the future. Thanks and BR, Ricardo [1]. https://github.com/ricardon/tip/commits/rneri/umip_v11
Re: [RFC 6/7] x86/asm: Remap the TSS into the cpu entry area
On Mon, Nov 13, 2017 at 11:36 AM, Linus Torvalds wrote: > On Mon, Nov 13, 2017 at 11:22 AM, Dave Hansen wrote: >> >> Aha, and here's the place that you need sizeof(tss_struct) to be nice >> and page-aligned. > > No, it should _not_ be page-aligned. It should fit _within_ a page, > but it 'struct tss_struct' now has something else in front of it, then > page-aliging that is actually pointless. > > I forget what the actual size is, but aligning the hardware TSS struct > to 128 bytes might be sufficient. It's not that big. 104 bytes, so it's probably already fine. For anything except an actual task switch, only the first 12 or so bytes matter.
Re: [PATCH 1/3] perf help: Document missing options
Hi Arnaldo and Namhyung :) On 11/14/2017 09:15 AM, Namhyung Kim wrote: Hi Arnaldo, On Mon, Nov 13, 2017 at 03:29:56PM -0300, Arnaldo Carvalho de Melo wrote: Em Sun, Nov 12, 2017 at 10:10:45AM +0900, Sihyeon Jang escreveu: Cc: Jiri Olsa Cc: Namhyung Kim Signed-off-by: Sihyeon Jang --- tools/perf/Documentation/perf-help.txt | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/tools/perf/Documentation/perf-help.txt b/tools/perf/Documentation/perf-help.txt index 5143918..bb605af 100644 --- a/tools/perf/Documentation/perf-help.txt +++ b/tools/perf/Documentation/perf-help.txt @@ -7,7 +7,7 @@ perf-help - display help information about perf SYNOPSIS -'perf help' [-a|--all] [COMMAND] +'perf help' [--all] [--man|--web|--info] [COMMAND] Can you try figuring out if this actually works? I tried here and it doesn't, its an area we took "for free" when we copied the initial codebase from git.git, but I never looked at this area that much, now that I try: Yeah, I'm not sure we need to keep it. [acme@jouet linux]$ perf help Config with no key for man viewer: childrenError: wrong config key-value pair top.children=true [acme@jouet linux]$ Unsure if this is something that got broken by the 'perf config' patches, Taeung? Looks like a bug in 8e99b6d4533c ("tools include: Adopt strstarts() from the kernel"). Following patch should fix it: Thanks, Namhyung I also checked this error and test the below patch. It seems that Namhyung already fixes it !! Thanks, Taeung From 096b78b437b5758acc025498e88d73d9d471b3c0 Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Tue, 14 Nov 2017 09:10:43 +0900 Subject: [PATCH] perf help: Fix a bug during strstart() conversion The commit 8e99b6d4533c changed prefixcmp() to strstart() but missed to change the return value in some place. It makes perf help print annoying output even for sane config items like below: $ perf help '.root': unsupported man viewer sub key. ... Fixes: 8e99b6d4533c ("tools include: Adopt strstarts() from the kernel") Cc: Taeung Song Signed-off-by: Namhyung Kim --- tools/perf/builtin-help.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/perf/builtin-help.c b/tools/perf/builtin-help.c index dbe4e4153bcf..ff51e5fc0daf 100644 --- a/tools/perf/builtin-help.c +++ b/tools/perf/builtin-help.c @@ -283,7 +283,7 @@ static int perf_help_config(const char *var, const char *value, void *cb) add_man_viewer(value); return 0; } - if (!strstarts(var, "man.")) + if (strstarts(var, "man.")) return add_man_viewer_info(var, value); return 0; @@ -313,7 +313,7 @@ static const char *cmd_to_page(const char *perf_cmd) if (!perf_cmd) return "perf"; - else if (!strstarts(perf_cmd, "perf")) + else if (strstarts(perf_cmd, "perf")) return perf_cmd; return asprintf(&s, "perf-%s", perf_cmd) < 0 ? NULL : s;
Re: [RFC 1/7] x86/asm/64: Allocate and enable the SYSENTER stack
On Mon, Nov 13, 2017 at 11:07 AM, Dave Hansen wrote: > On 11/10/2017 08:05 PM, Andy Lutomirski wrote: >> This will simplify some future code changes that will want some >> temporary stack space in more places. It also lets us get rid of a >> SWAPGS_UNSAFE_STACK user. >> >> This does not depend on CONFIG_IA32_EMULATION because we'll want the >> stack space even without IA32 emulation. > > It was never clear to me why we don't use this on 64-bit today. Does > anybody know why? Nothing used it? As far as I can tell, the original x86_64 Linux port was a little bit more excited about IST than I think made sense. As a result, we use IST for #DB and #BP, which is IMO rather nasty and causes a lot more problems than it solves. But, since #DB uses IST, we don't actually need a real stack for SYSENTER (because SYSENTER with TF set will invoke #DB on the IST stack rather than the SYSENTER stack). I have old patches to stop using IST for #DB and #BP, but I never finished them.