Re: [PATCH 1/3] ARM: mach-omap2: remove bogus "or_module" from rx51-peripherals
* Paul Gortmaker[160722 07:02]: > [Re: [PATCH 1/3] ARM: mach-omap2: remove bogus "or_module" from > rx51-peripherals] On 21/07/2016 (Thu 23:41) Tony Lindgren wrote: > > > Hi, > > > > * Paul Gortmaker [160719 21:17]: > > > During unrelated work, attempting to remove an include of the > > > linux/module.h in favour of "struct module;" in order to reduce > > > header entanglement, we found doing so caused a build failure in > > > this file. > > > > We're planning to drop this file after v4.8-rc1 after I've > > verified that legacy booting still works at v4.8-rc1. > > > > Are you OK if I pick this patch into my omap-for-v4.8/legacy > > branch? Or if you have a minimal immutable branch against v4.7-rc1 > > with just this patch I can merge it in no problem. > > Is the legacy branch a contingency plan for the case where legacy > booting doesn't work? If so, that should be OK. Well it's just a branch of omap legacy booting related patches for v4.8. But looking at it now, looks like I already pushed out the removal of the last two remaining board files before I took few weeks off. But I did not add it to Linux next to keep things working until -rc1. > Having the patch present, or having the file deleted both take care of > my concern -- which was was introducing build regressions when adding > the gpio header cleanup into for-4.9 content. OK. As I've already pushed out the board-*.c removal branch, I suggest we just drop the $subject patch to avoid a merge conflict. Regards, Tony
Re: [PATCH 1/3] ARM: mach-omap2: remove bogus "or_module" from rx51-peripherals
* Paul Gortmaker [160722 07:02]: > [Re: [PATCH 1/3] ARM: mach-omap2: remove bogus "or_module" from > rx51-peripherals] On 21/07/2016 (Thu 23:41) Tony Lindgren wrote: > > > Hi, > > > > * Paul Gortmaker [160719 21:17]: > > > During unrelated work, attempting to remove an include of the > > > linux/module.h in favour of "struct module;" in order to reduce > > > header entanglement, we found doing so caused a build failure in > > > this file. > > > > We're planning to drop this file after v4.8-rc1 after I've > > verified that legacy booting still works at v4.8-rc1. > > > > Are you OK if I pick this patch into my omap-for-v4.8/legacy > > branch? Or if you have a minimal immutable branch against v4.7-rc1 > > with just this patch I can merge it in no problem. > > Is the legacy branch a contingency plan for the case where legacy > booting doesn't work? If so, that should be OK. Well it's just a branch of omap legacy booting related patches for v4.8. But looking at it now, looks like I already pushed out the removal of the last two remaining board files before I took few weeks off. But I did not add it to Linux next to keep things working until -rc1. > Having the patch present, or having the file deleted both take care of > my concern -- which was was introducing build regressions when adding > the gpio header cleanup into for-4.9 content. OK. As I've already pushed out the board-*.c removal branch, I suggest we just drop the $subject patch to avoid a merge conflict. Regards, Tony
[PATCH] net/irda: fix NULL pointer dereference on memory allocation failure
I ran into this: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] PREEMPT SMP KASAN CPU: 2 PID: 2012 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 8800b745f2c0 ti: 88011174 task.ti: 88011174 RIP: 0010:[] [] irttp_connect_request+0x36/0x710 RSP: 0018:880111747bb8 EFLAGS: 00010286 RAX: dc00 RBX: RCX: 69dd8358 RDX: 0009 RSI: 0027 RDI: 0048 RBP: 880111747c00 R08: R09: R10: 69dd8358 R11: 10759723 R12: R13: 88011a7e4780 R14: 0027 R15: FS: 7fc738404700() GS:88011af0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fc737fdfb10 CR3: 000118087000 CR4: 06e0 Stack: 0200 880111747bd8 810ee611 880119f1f220 880119f1f4f8 880119f1f4f0 88011a7e4780 880119f1f232 880119f1f220 880111747d58 82bca542 Call Trace: [] irda_connect+0x562/0x1190 [] SYSC_connect+0x202/0x2a0 [] SyS_connect+0x9/0x10 [] do_syscall_64+0x19c/0x410 [] entry_SYSCALL64_slow_path+0x25/0x25 Code: 41 89 ca 48 89 e5 41 57 41 56 41 55 41 54 41 89 d7 53 48 89 fb 48 83 c7 48 48 89 fa 41 89 f6 48 c1 ea 03 48 83 ec 20 4c 8b 65 10 <0f> b6 04 02 84 c0 74 08 84 c0 0f 8e 4c 04 00 00 80 7b 48 00 74 RIP [] irttp_connect_request+0x36/0x710 RSP ---[ end trace 4cda2588bc055b30 ]--- The problem is that irda_open_tsap() can fail and leave self->tsap = NULL, and then irttp_connect_request() almost immediately dereferences it. Cc: sta...@vger.kernel.org Signed-off-by: Vegard Nossum--- net/irda/af_irda.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c index 923abd6..8d2f7c9 100644 --- a/net/irda/af_irda.c +++ b/net/irda/af_irda.c @@ -1024,8 +1024,11 @@ static int irda_connect(struct socket *sock, struct sockaddr *uaddr, } /* Check if we have opened a local TSAP */ - if (!self->tsap) - irda_open_tsap(self, LSAP_ANY, addr->sir_name); + if (!self->tsap) { + err = irda_open_tsap(self, LSAP_ANY, addr->sir_name); + if (err) + goto out; + } /* Move to connecting socket, start sending Connect Requests */ sock->state = SS_CONNECTING; -- 1.9.1
[PATCH] net/irda: fix NULL pointer dereference on memory allocation failure
I ran into this: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] PREEMPT SMP KASAN CPU: 2 PID: 2012 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: 8800b745f2c0 ti: 88011174 task.ti: 88011174 RIP: 0010:[] [] irttp_connect_request+0x36/0x710 RSP: 0018:880111747bb8 EFLAGS: 00010286 RAX: dc00 RBX: RCX: 69dd8358 RDX: 0009 RSI: 0027 RDI: 0048 RBP: 880111747c00 R08: R09: R10: 69dd8358 R11: 10759723 R12: R13: 88011a7e4780 R14: 0027 R15: FS: 7fc738404700() GS:88011af0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fc737fdfb10 CR3: 000118087000 CR4: 06e0 Stack: 0200 880111747bd8 810ee611 880119f1f220 880119f1f4f8 880119f1f4f0 88011a7e4780 880119f1f232 880119f1f220 880111747d58 82bca542 Call Trace: [] irda_connect+0x562/0x1190 [] SYSC_connect+0x202/0x2a0 [] SyS_connect+0x9/0x10 [] do_syscall_64+0x19c/0x410 [] entry_SYSCALL64_slow_path+0x25/0x25 Code: 41 89 ca 48 89 e5 41 57 41 56 41 55 41 54 41 89 d7 53 48 89 fb 48 83 c7 48 48 89 fa 41 89 f6 48 c1 ea 03 48 83 ec 20 4c 8b 65 10 <0f> b6 04 02 84 c0 74 08 84 c0 0f 8e 4c 04 00 00 80 7b 48 00 74 RIP [] irttp_connect_request+0x36/0x710 RSP ---[ end trace 4cda2588bc055b30 ]--- The problem is that irda_open_tsap() can fail and leave self->tsap = NULL, and then irttp_connect_request() almost immediately dereferences it. Cc: sta...@vger.kernel.org Signed-off-by: Vegard Nossum --- net/irda/af_irda.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c index 923abd6..8d2f7c9 100644 --- a/net/irda/af_irda.c +++ b/net/irda/af_irda.c @@ -1024,8 +1024,11 @@ static int irda_connect(struct socket *sock, struct sockaddr *uaddr, } /* Check if we have opened a local TSAP */ - if (!self->tsap) - irda_open_tsap(self, LSAP_ANY, addr->sir_name); + if (!self->tsap) { + err = irda_open_tsap(self, LSAP_ANY, addr->sir_name); + if (err) + goto out; + } /* Move to connecting socket, start sending Connect Requests */ sock->state = SS_CONNECTING; -- 1.9.1
Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
On Sat, Jul 23, 2016 at 2:35 PM, Josh Poimboeufwrote: > > While doing the scanning and printing, it does call the frame pointer > unwinder in parallel, but like before, that's *only* used to determine > whether a found address should be printed without a question mark. If > the unwinder goes off the rails, the scanning and printing of text > addresses goes on, undisturbed. > > The frame pointer unwinder code itself is quite careful not to > dereference anything it shouldn't (though of course I welcome any review > comments that find otherwise). So this was the bug the last time around we did unwinders - the code would dereference the unwind tables, and the tables would be corrupted. End result: recursive oops. And they were corrupted not even because of memory corruption, but simply because they contained incorrect data, due to compiler bugs and other issues. I have really bad memories from that time. Several years after the fact. It took months to finally revert the crap, because the author continued to insist that "this was the last bug" for several passes through that thing. As they say, "Once burned, twice shy". But in this case, it's more like "Four times burned, sixteen times as shy". Linus
Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
On Sat, Jul 23, 2016 at 2:35 PM, Josh Poimboeuf wrote: > > While doing the scanning and printing, it does call the frame pointer > unwinder in parallel, but like before, that's *only* used to determine > whether a found address should be printed without a question mark. If > the unwinder goes off the rails, the scanning and printing of text > addresses goes on, undisturbed. > > The frame pointer unwinder code itself is quite careful not to > dereference anything it shouldn't (though of course I welcome any review > comments that find otherwise). So this was the bug the last time around we did unwinders - the code would dereference the unwind tables, and the tables would be corrupted. End result: recursive oops. And they were corrupted not even because of memory corruption, but simply because they contained incorrect data, due to compiler bugs and other issues. I have really bad memories from that time. Several years after the fact. It took months to finally revert the crap, because the author continued to insist that "this was the last bug" for several passes through that thing. As they say, "Once burned, twice shy". But in this case, it's more like "Four times burned, sixteen times as shy". Linus
Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
On Fri, Jul 22, 2016 at 05:31:47PM -0700, Andy Lutomirski wrote: > On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds >wrote: > > > > So without having yet looked at the code, I want people to understand > > that to a very real degree, the stack tracer that the *oopsing* code > > (ie what all the usual kernel fault handlers use) is very very special > > code and needs to be handled very carefully, and needs to be extra > > robust, even in the presence of stack corruption, and even in the > > presence of the dwarf info being totally corrupted. Because we've very > > much had both things happen. > > > > It is very possible that we should have two different stack tracers - > > the stupid "for oopses only" code that doesn't necessarily give the > > perfect trace, but is very anal and happily gives old stale addresses > > (which can be very useful for seeing what happened just before the > > "real" stack trace), and then a separate stack trace engine that is > > clever and gets things right, and if that one faults it can depend on > > the normal kernel fault handling picking up the pieces. > > I think that Josh's code has the potential to be extremely robust > *and* give more correct results when possible. One thing I intend to > review when v2 shows up is that it's as conservative as it needs to be > to avoid ever dereferencing an out-of-bounds pointer. And Josh's oops > printer carefully walks and prints out all addresses on the stack > (complete with question marks) even if the unwinder doesn't find them. I should add that while the show_trace_log_lvl() code (which is used for oopses) looks different on the surface, the algorithm is fundamentally the same as before: traverse the stacks, scanning and printing any kernel text addresses. While doing the scanning and printing, it does call the frame pointer unwinder in parallel, but like before, that's *only* used to determine whether a found address should be printed without a question mark. If the unwinder goes off the rails, the scanning and printing of text addresses goes on, undisturbed. The frame pointer unwinder code itself is quite careful not to dereference anything it shouldn't (though of course I welcome any review comments that find otherwise). > > Yes, the current stack tracer is crufty. No, it's not perfect. But it > > is very well tested, and has held up. That should not be dismissed. > > > > I think you may be giving the current tracer slightly more credit than > it's due. In my stack guard page patchset, I fixed two separate > issues, one of which caused recursive faults and one of which caused > it to output nothing at all. So maybe *now* it's very robust :) But > it's still an umaintainable mess IMO, and Josh's patchset helps a > *lot*. -- Josh
Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
On Fri, Jul 22, 2016 at 05:31:47PM -0700, Andy Lutomirski wrote: > On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds > wrote: > > > > So without having yet looked at the code, I want people to understand > > that to a very real degree, the stack tracer that the *oopsing* code > > (ie what all the usual kernel fault handlers use) is very very special > > code and needs to be handled very carefully, and needs to be extra > > robust, even in the presence of stack corruption, and even in the > > presence of the dwarf info being totally corrupted. Because we've very > > much had both things happen. > > > > It is very possible that we should have two different stack tracers - > > the stupid "for oopses only" code that doesn't necessarily give the > > perfect trace, but is very anal and happily gives old stale addresses > > (which can be very useful for seeing what happened just before the > > "real" stack trace), and then a separate stack trace engine that is > > clever and gets things right, and if that one faults it can depend on > > the normal kernel fault handling picking up the pieces. > > I think that Josh's code has the potential to be extremely robust > *and* give more correct results when possible. One thing I intend to > review when v2 shows up is that it's as conservative as it needs to be > to avoid ever dereferencing an out-of-bounds pointer. And Josh's oops > printer carefully walks and prints out all addresses on the stack > (complete with question marks) even if the unwinder doesn't find them. I should add that while the show_trace_log_lvl() code (which is used for oopses) looks different on the surface, the algorithm is fundamentally the same as before: traverse the stacks, scanning and printing any kernel text addresses. While doing the scanning and printing, it does call the frame pointer unwinder in parallel, but like before, that's *only* used to determine whether a found address should be printed without a question mark. If the unwinder goes off the rails, the scanning and printing of text addresses goes on, undisturbed. The frame pointer unwinder code itself is quite careful not to dereference anything it shouldn't (though of course I welcome any review comments that find otherwise). > > Yes, the current stack tracer is crufty. No, it's not perfect. But it > > is very well tested, and has held up. That should not be dismissed. > > > > I think you may be giving the current tracer slightly more credit than > it's due. In my stack guard page patchset, I fixed two separate > issues, one of which caused recursive faults and one of which caused > it to output nothing at all. So maybe *now* it's very robust :) But > it's still an umaintainable mess IMO, and Josh's patchset helps a > *lot*. -- Josh
Re: PM-wakeup: Delete unnecessary checks before two function calls
>> How do you think about to integrate this update suggestion >> into another source code repository? > > I'm not really sure what you mean. Do you find the suggested source code change acceptable? http://article.gmane.org/gmane.linux.power-management.general/61766 https://lkml.org/lkml/2015/6/28/21 Who would dare to commit it? Regards, Markus
Re: PM-wakeup: Delete unnecessary checks before two function calls
>> How do you think about to integrate this update suggestion >> into another source code repository? > > I'm not really sure what you mean. Do you find the suggested source code change acceptable? http://article.gmane.org/gmane.linux.power-management.general/61766 https://lkml.org/lkml/2015/6/28/21 Who would dare to commit it? Regards, Markus
Re: [kernel-hardening] [PATCH v5 03/32] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
On Thu, 21 Jul 2016 22:34:33 -0700, Andy Lutomirski said: > How much memory do you have and what's your config? My code is > obviously buggy, but I'm wondering why neither I nor the 0day bot caught > this. Probably because your devel box and the 0day bot both have 4-level page tables and the dual-core i5 in my laptop has (presumably) 3? In any case, your patch didn't fix things, nor did (as you noted in a mail to Ingo) does reverting the problem commit (and then the following one that deletes now-dead code so it will compile cleanly). 8G of memory, 2 4G dimms. .config attached. Anything that would help from /proc or /sys? # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.7.0-rc7 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEBUG_RODATA=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y # CONFIG_KERNEL_GZIP is not set # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set CONFIG_KERNEL_XZ=y # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y # CONFIG_USELIB is not set CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_DEBUG=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set CONFIG_IRQ_TIME_ACCOUNTING=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # # RCU Subsystem # CONFIG_PREEMPT_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y # CONFIG_TREE_RCU_TRACE is not set # CONFIG_RCU_EXPEDITE_BOOT is not set CONFIG_BUILD_BIN2C=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=21 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 CONFIG_NMI_LOG_BUF_SHIFT=13 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y CONFIG_ARCH_SUPPORTS_INT128=y CONFIG_CGROUPS=y CONFIG_PAGE_COUNTER=y CONFIG_MEMCG=y CONFIG_MEMCG_SWAP=y CONFIG_MEMCG_SWAP_ENABLED=y CONFIG_BLK_CGROUP=y # CONFIG_DEBUG_BLK_CGROUP is not set CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y CONFIG_CGROUP_PIDS=y CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_HUGETLB=y CONFIG_CPUSETS=y CONFIG_PROC_PID_CPUSET=y CONFIG_CGROUP_DEVICE=y CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_DEBUG=y CONFIG_CHECKPOINT_RESTORE=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y # CONFIG_USER_NS is not set CONFIG_PID_NS=y CONFIG_NET_NS=y
Re: [kernel-hardening] [PATCH v5 03/32] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated
On Thu, 21 Jul 2016 22:34:33 -0700, Andy Lutomirski said: > How much memory do you have and what's your config? My code is > obviously buggy, but I'm wondering why neither I nor the 0day bot caught > this. Probably because your devel box and the 0day bot both have 4-level page tables and the dual-core i5 in my laptop has (presumably) 3? In any case, your patch didn't fix things, nor did (as you noted in a mail to Ingo) does reverting the problem commit (and then the following one that deletes now-dead code so it will compile cleanly). 8G of memory, 2 4G dimms. .config attached. Anything that would help from /proc or /sys? # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.7.0-rc7 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEBUG_RODATA=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y # CONFIG_KERNEL_GZIP is not set # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set CONFIG_KERNEL_XZ=y # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y # CONFIG_USELIB is not set CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_DEBUG=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set CONFIG_IRQ_TIME_ACCOUNTING=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # # RCU Subsystem # CONFIG_PREEMPT_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y # CONFIG_TREE_RCU_TRACE is not set # CONFIG_RCU_EXPEDITE_BOOT is not set CONFIG_BUILD_BIN2C=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=21 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 CONFIG_NMI_LOG_BUF_SHIFT=13 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y CONFIG_ARCH_SUPPORTS_INT128=y CONFIG_CGROUPS=y CONFIG_PAGE_COUNTER=y CONFIG_MEMCG=y CONFIG_MEMCG_SWAP=y CONFIG_MEMCG_SWAP_ENABLED=y CONFIG_BLK_CGROUP=y # CONFIG_DEBUG_BLK_CGROUP is not set CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y CONFIG_RT_GROUP_SCHED=y CONFIG_CGROUP_PIDS=y CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_HUGETLB=y CONFIG_CPUSETS=y CONFIG_PROC_PID_CPUSET=y CONFIG_CGROUP_DEVICE=y CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_DEBUG=y CONFIG_CHECKPOINT_RESTORE=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y # CONFIG_USER_NS is not set CONFIG_PID_NS=y CONFIG_NET_NS=y
Re: [PATCH/RFC] Re: linux-next: build failure after merge of the luto-misc tree
Hi Arnaldo, On Fri, 22 Jul 2016 16:57:34 -0300 Arnaldo Carvalho de Melowrote: > > Em Fri, Jul 22, 2016 at 02:44:17PM -0500, Josh Poimboeuf escreveu: > > On Fri, Jul 22, 2016 at 04:36:55PM -0300, Arnaldo Carvalho de Melo wrote: > > > Em Fri, Jul 22, 2016 at 02:19:20PM -0500, Josh Poimboeuf escreveu: > > > > On Fri, Jul 22, 2016 at 11:37:39AM -0300, Arnaldo Carvalho de Melo > > > > wrote: > > > > > I.e. with the two patches I mentioned, that are equivalent to the > > > > > last patch I > > > > > sent to Stephen for testing, we would end up with HOSTARCH=powerpc and > > > > > ARCH=x86, no? > > > > > Thanks for spelling it out, that helped a lot. > > > > Glad you liked it, I had to do it for my own sanity :-) > > > > And something that gave me mixed feelings was an e-mail from the kbuild > > > test bot that noticed my perf/core changes and said that the build was > > > broken for "make ARCH=x86_64", so I had to reinstate this part: > > > > ifeq ($(ARCH),x86_64) > > > ARCH := x86 > > > endif > > > > Because, as you say, 'make ARCH=x86' works :-\ I think it will not be > > > needed with your patch, right? I'm checking your patch below right now, > > > Yeah, that shouldn't be needed with my patch. I think either would > > work, but my patch is more of a permanent solution. > > Sure, I left it there because then we don't have bisection broke at that > fix I made, i.e. 'make ARCH=x86_64' works at that point too. > > I applied your patch and will push it to Ingo, now we must cross our > fingers so that Stephen doesn't come back to us once more telling it is > still broken :o) Unfortunately, this is what I get when I just build perf/core: DESCEND objtool CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/builtin-check.o LD /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool-in.o Warning: objtool: x86 instruction decoder differs from kernel LINK /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool In file included from /home/sfr/next/next/arch/x86/include/uapi/asm/bitsperlong.h:10:0, from /home/sfr/next/next/include/uapi/asm-generic/int-ll64.h:11, from /home/sfr/next/next/include/uapi/asm-generic/types.h:6, from /home/sfr/next/next/arch/x86/include/uapi/asm/types.h:4, from /home/sfr/next/next/tools/include/linux/types.h:9, from /home/sfr/next/next/include/uapi/linux/elf.h:4, from /home/sfr/next/next/arch/x86/entry/vdso/vdso2c.c:66: /home/sfr/next/next/tools/include/asm-generic/bitsperlong.h:13:2: error: #error Inconsistent word size. Check asm/bitsperlong.h #error Inconsistent word size. Check asm/bitsperlong.h ^ The be clear: this is a ppc64le hosted, x86_64 target cross build. I than added the following patch, and the build finishes successfully. From: Stephen Rothwell Date: Sat, 23 Jul 2016 14:35:40 +1000 Subject: [PATCH] x86: make the vdso2c compiler use the host architecture headers Signed-off-by: Stephen Rothwell --- arch/x86/entry/vdso/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 253b72eaade6..25e88c030c47 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -55,7 +55,7 @@ VDSO_LDFLAGS_vdso.lds = -m64 -Wl,-soname=linux-vdso.so.1 \ $(obj)/vdso64.so.dbg: $(src)/vdso.lds $(vobjs) FORCE $(call if_changed,vdso) -HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi -I$(srctree)/arch/x86/include/uapi +HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi -I$(srctree)/arch/$(SUBARCH)/include/uapi hostprogs-y+= vdso2c quiet_cmd_vdso2c = VDSO2C $@ -- 2.8.1 There may be a more correct way to do this ... -- Cheers, Stephen Rothwell
Re: [PATCH/RFC] Re: linux-next: build failure after merge of the luto-misc tree
Hi Arnaldo, On Fri, 22 Jul 2016 16:57:34 -0300 Arnaldo Carvalho de Melo wrote: > > Em Fri, Jul 22, 2016 at 02:44:17PM -0500, Josh Poimboeuf escreveu: > > On Fri, Jul 22, 2016 at 04:36:55PM -0300, Arnaldo Carvalho de Melo wrote: > > > Em Fri, Jul 22, 2016 at 02:19:20PM -0500, Josh Poimboeuf escreveu: > > > > On Fri, Jul 22, 2016 at 11:37:39AM -0300, Arnaldo Carvalho de Melo > > > > wrote: > > > > > I.e. with the two patches I mentioned, that are equivalent to the > > > > > last patch I > > > > > sent to Stephen for testing, we would end up with HOSTARCH=powerpc and > > > > > ARCH=x86, no? > > > > > Thanks for spelling it out, that helped a lot. > > > > Glad you liked it, I had to do it for my own sanity :-) > > > > And something that gave me mixed feelings was an e-mail from the kbuild > > > test bot that noticed my perf/core changes and said that the build was > > > broken for "make ARCH=x86_64", so I had to reinstate this part: > > > > ifeq ($(ARCH),x86_64) > > > ARCH := x86 > > > endif > > > > Because, as you say, 'make ARCH=x86' works :-\ I think it will not be > > > needed with your patch, right? I'm checking your patch below right now, > > > Yeah, that shouldn't be needed with my patch. I think either would > > work, but my patch is more of a permanent solution. > > Sure, I left it there because then we don't have bisection broke at that > fix I made, i.e. 'make ARCH=x86_64' works at that point too. > > I applied your patch and will push it to Ingo, now we must cross our > fingers so that Stephen doesn't come back to us once more telling it is > still broken :o) Unfortunately, this is what I get when I just build perf/core: DESCEND objtool CC /home/sfr/next/x86_64_allmodconfig/tools/objtool/builtin-check.o LD /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool-in.o Warning: objtool: x86 instruction decoder differs from kernel LINK /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool In file included from /home/sfr/next/next/arch/x86/include/uapi/asm/bitsperlong.h:10:0, from /home/sfr/next/next/include/uapi/asm-generic/int-ll64.h:11, from /home/sfr/next/next/include/uapi/asm-generic/types.h:6, from /home/sfr/next/next/arch/x86/include/uapi/asm/types.h:4, from /home/sfr/next/next/tools/include/linux/types.h:9, from /home/sfr/next/next/include/uapi/linux/elf.h:4, from /home/sfr/next/next/arch/x86/entry/vdso/vdso2c.c:66: /home/sfr/next/next/tools/include/asm-generic/bitsperlong.h:13:2: error: #error Inconsistent word size. Check asm/bitsperlong.h #error Inconsistent word size. Check asm/bitsperlong.h ^ The be clear: this is a ppc64le hosted, x86_64 target cross build. I than added the following patch, and the build finishes successfully. From: Stephen Rothwell Date: Sat, 23 Jul 2016 14:35:40 +1000 Subject: [PATCH] x86: make the vdso2c compiler use the host architecture headers Signed-off-by: Stephen Rothwell --- arch/x86/entry/vdso/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile index 253b72eaade6..25e88c030c47 100644 --- a/arch/x86/entry/vdso/Makefile +++ b/arch/x86/entry/vdso/Makefile @@ -55,7 +55,7 @@ VDSO_LDFLAGS_vdso.lds = -m64 -Wl,-soname=linux-vdso.so.1 \ $(obj)/vdso64.so.dbg: $(src)/vdso.lds $(vobjs) FORCE $(call if_changed,vdso) -HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi -I$(srctree)/arch/x86/include/uapi +HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi -I$(srctree)/arch/$(SUBARCH)/include/uapi hostprogs-y+= vdso2c quiet_cmd_vdso2c = VDSO2C $@ -- 2.8.1 There may be a more correct way to do this ... -- Cheers, Stephen Rothwell
[PATCH] x86/mm/cpa: Unbreak populate_pgd(): stop trying to deallocate failed PUDs
This mostly reverts commit 360cb4d15567a7eca07a5f3ade6de308bbfb4e70. I broke the case where a PUD table got allocated -- populate_pud() would wander off a pgd_none entry and get lost. I'm not sure how this survived my testing. Fixing this directly is difficult or impossible because of the awful state of Linux's page table accessors. Instead, fix the original issue in a much simpler way. The problem was that, if we allocated a PUD table, failed to populate it, and freed it, another CPU could potentially keep using the PGD entry we installed (either by copying it via vmalloc_fault or by speculatively caching it). There's a straightforward fix: simply leave the top-level entry in place if this happens. This can't waste any significant amount of memory -- there are at most 256 entries like this systemwide and, as a practical matter, if we hit this failure path repeatedly, we're likely to reuse the same page anyway. For context, this is a reversion with this hunk added in: if (ret < 0) { + /* +* Leave the PUD page in place in case some other CPU or thread +* already found it, but remove any useless entries we just +* added to it. +*/ - unmap_pgd_range(cpa->pgd, addr, + unmap_pud_range(pgd_entry, addr, addr + (cpa->numpages << PAGE_SHIFT)); return ret; } This effectively open-codes what the now-deleted unmap_pgd_range() function used to do except that unmap_pgd_range() used to try to free the page as well. Cc: Mike KrinkinReported-by: Valdis Kletnieks Signed-off-by: Andy Lutomirski --- arch/x86/mm/pageattr.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 26c93c6e04a0..2bc6ea153f76 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1082,6 +1082,8 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr) pud = (pud_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK); if (!pud) return -1; + + set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE)); } pgprot_val(pgprot) &= ~pgprot_val(cpa->mask_clr); @@ -1089,16 +1091,11 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr) ret = populate_pud(cpa, addr, pgd_entry, pgprot); if (ret < 0) { - if (pud) - free_page((unsigned long)pud); unmap_pud_range(pgd_entry, addr, addr + (cpa->numpages << PAGE_SHIFT)); return ret; } - if (pud) - set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE)); - cpa->numpages = ret; return 0; } -- 2.7.4
[PATCH] x86/mm/cpa: Unbreak populate_pgd(): stop trying to deallocate failed PUDs
This mostly reverts commit 360cb4d15567a7eca07a5f3ade6de308bbfb4e70. I broke the case where a PUD table got allocated -- populate_pud() would wander off a pgd_none entry and get lost. I'm not sure how this survived my testing. Fixing this directly is difficult or impossible because of the awful state of Linux's page table accessors. Instead, fix the original issue in a much simpler way. The problem was that, if we allocated a PUD table, failed to populate it, and freed it, another CPU could potentially keep using the PGD entry we installed (either by copying it via vmalloc_fault or by speculatively caching it). There's a straightforward fix: simply leave the top-level entry in place if this happens. This can't waste any significant amount of memory -- there are at most 256 entries like this systemwide and, as a practical matter, if we hit this failure path repeatedly, we're likely to reuse the same page anyway. For context, this is a reversion with this hunk added in: if (ret < 0) { + /* +* Leave the PUD page in place in case some other CPU or thread +* already found it, but remove any useless entries we just +* added to it. +*/ - unmap_pgd_range(cpa->pgd, addr, + unmap_pud_range(pgd_entry, addr, addr + (cpa->numpages << PAGE_SHIFT)); return ret; } This effectively open-codes what the now-deleted unmap_pgd_range() function used to do except that unmap_pgd_range() used to try to free the page as well. Cc: Mike Krinkin Reported-by: Valdis Kletnieks Signed-off-by: Andy Lutomirski --- arch/x86/mm/pageattr.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 26c93c6e04a0..2bc6ea153f76 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1082,6 +1082,8 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr) pud = (pud_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK); if (!pud) return -1; + + set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE)); } pgprot_val(pgprot) &= ~pgprot_val(cpa->mask_clr); @@ -1089,16 +1091,11 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr) ret = populate_pud(cpa, addr, pgd_entry, pgprot); if (ret < 0) { - if (pud) - free_page((unsigned long)pud); unmap_pud_range(pgd_entry, addr, addr + (cpa->numpages << PAGE_SHIFT)); return ret; } - if (pud) - set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE)); - cpa->numpages = ret; return 0; } -- 2.7.4
Re: [PATCH v4 2/4] drm/rockchip: add an common abstracted PSR driver
Yakir, On Wed, Jul 13, 2016 at 9:15 PM, Yakir Yangwrote: > +static void psr_set_state(struct psr_drv *psr, enum psr_state state) > +{ > + mutex_lock(>state_mutex); > + > + if (psr->state == state) { > + mutex_unlock(>state_mutex); > + return; > + } > + > + psr->state = state; > + switch (state) { > + case PSR_ENABLE: > + psr->set(psr->encoder, true); > + break; > + > + case PSR_DISABLE: > + case PSR_FLUSH: > + psr->set(psr->encoder, false); > + break; > + }; > + > + mutex_unlock(>state_mutex); > +} > + > +static void psr_flush_handler(unsigned long data) > +{ > + struct psr_drv *psr = (struct psr_drv *)data; > + > + if (!psr || psr->state != PSR_FLUSH) > + return; > + > + psr_set_state(psr, PSR_ENABLE); As mentioned in a separate thread, this is probably not OK. psr_set_state() grabs a mutex and that might sleep. ...but psr_flush_handler() is a timer. I'm nearly certain that timers can't sleep. I believe this is the source of "sleeping function called from invalid context" that I've seen at times. -Doug
Re: [PATCH 1/3] Add a new field to struct shrinker
On 07/22/2016 06:27 PM, Tony Jones wrote: > On 07/20/2016 07:54 AM, Michal Hocko wrote: > >>> Michal, just to make sure I understand you correctly, do you mean that we >>> could infer the names of the shrinkers by looking at the names of their >>> callbacks? >> >> Yes, %ps can then be used for the name of the shrinker structure >> (assuming it is available). > > This is fine for emitting via the ftrace /sys interface, but in order to > have the data [name] get > marshalled thru to perf (for example) you need to add it to the > TP_fast_assign entry. > > tony Unfortunately, %ps/%pF doesn't do much (re: Michal's comment "assuming it is available"): - TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", + TP_printk("%pF %p(%ps): nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", __entry->shrink, __entry->shr, + __entry->shr, __entry->nid, __entry->nr_objects_to_shrink, # cat trace_pipe bash-1917 [003] ...1 2925.941062: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 88042bb60cc0(0x88042bb60cc0): nid: 0 objects to shrink 0 gfp_flags GFP_KERNEL pgs_scanned 1000 lru_pgs 1000 cache items 4 delta 7 total_scan 7 Otherwise what I was suggesting was something like this to ensure it was correctly marshaled for perf/etc: diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -16,6 +16,8 @@ #define RECLAIM_WB_SYNC0x0004u /* Unused, all reclaim async */ #define RECLAIM_WB_ASYNC 0x0008u +#define SHRINKER_NAME_LEN (size_t)32 + #define show_reclaim_flags(flags) \ (flags) ? __print_flags(flags, "|", \ {RECLAIM_WB_ANON, "RECLAIM_WB_ANON"}, \ @@ -191,6 +193,7 @@ TRACE_EVENT(mm_shrink_slab_start, TP_STRUCT__entry( __field(struct shrinker *, shr) __field(void *, shrink) + __array(char, name, SHRINKER_NAME_LEN); __field(int, nid) __field(long, nr_objects_to_shrink) __field(gfp_t, gfp_flags) @@ -202,6 +205,11 @@ TRACE_EVENT(mm_shrink_slab_start, ), TP_fast_assign( + char sym[KSYM_SYMBOL_LEN]; + + sprint_symbol(sym, (unsigned long)shr); + strlcpy(__entry->name, sym, SHRINKER_NAME_LEN); + __entry->shr = shr; __entry->shrink = shr->scan_objects; __entry->nid = sc->nid; @@ -214,9 +222,10 @@ TRACE_EVENT(mm_shrink_slab_start, __entry->total_scan = total_scan; ), - TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", + TP_printk("%pF %p(%s): nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", __entry->shrink, __entry->shr, + __entry->name, __entry->nid, __entry->nr_objects_to_shrink, show_gfp_flags(__entry->gfp_flags), @@ -236,6 +245,7 @@ TRACE_EVENT(mm_shrink_slab_end, TP_STRUCT__entry( __field(struct shrinker *, shr) + __array(char, name, SHRINKER_NAME_LEN); __field(int, nid) __field(void *, shrink) __field(long, unused_scan) @@ -245,6 +255,11 @@ TRACE_EVENT(mm_shrink_slab_end, ), TP_fast_assign( + char sym[KSYM_SYMBOL_LEN]; + + sprint_symbol(sym, (unsigned long)shr); + strlcpy(__entry->name, sym, SHRINKER_NAME_LEN); + __entry->shr = shr; __entry->nid = nid; __entry->shrink = shr->scan_objects; @@ -254,9 +269,10 @@ TRACE_EVENT(mm_shrink_slab_end, __entry->total_scan = total_scan; ), - TP_printk("%pF %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d", + TP_printk("%pF %p(%pF): nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d", __entry->shrink, __entry->shr, + __entry->shr, __entry->nid, __entry->unused_scan, __entry->new_scan,
Re: [PATCH v4 2/4] drm/rockchip: add an common abstracted PSR driver
Yakir, On Wed, Jul 13, 2016 at 9:15 PM, Yakir Yang wrote: > +static void psr_set_state(struct psr_drv *psr, enum psr_state state) > +{ > + mutex_lock(>state_mutex); > + > + if (psr->state == state) { > + mutex_unlock(>state_mutex); > + return; > + } > + > + psr->state = state; > + switch (state) { > + case PSR_ENABLE: > + psr->set(psr->encoder, true); > + break; > + > + case PSR_DISABLE: > + case PSR_FLUSH: > + psr->set(psr->encoder, false); > + break; > + }; > + > + mutex_unlock(>state_mutex); > +} > + > +static void psr_flush_handler(unsigned long data) > +{ > + struct psr_drv *psr = (struct psr_drv *)data; > + > + if (!psr || psr->state != PSR_FLUSH) > + return; > + > + psr_set_state(psr, PSR_ENABLE); As mentioned in a separate thread, this is probably not OK. psr_set_state() grabs a mutex and that might sleep. ...but psr_flush_handler() is a timer. I'm nearly certain that timers can't sleep. I believe this is the source of "sleeping function called from invalid context" that I've seen at times. -Doug
Re: [PATCH 1/3] Add a new field to struct shrinker
On 07/22/2016 06:27 PM, Tony Jones wrote: > On 07/20/2016 07:54 AM, Michal Hocko wrote: > >>> Michal, just to make sure I understand you correctly, do you mean that we >>> could infer the names of the shrinkers by looking at the names of their >>> callbacks? >> >> Yes, %ps can then be used for the name of the shrinker structure >> (assuming it is available). > > This is fine for emitting via the ftrace /sys interface, but in order to > have the data [name] get > marshalled thru to perf (for example) you need to add it to the > TP_fast_assign entry. > > tony Unfortunately, %ps/%pF doesn't do much (re: Michal's comment "assuming it is available"): - TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", + TP_printk("%pF %p(%ps): nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", __entry->shrink, __entry->shr, + __entry->shr, __entry->nid, __entry->nr_objects_to_shrink, # cat trace_pipe bash-1917 [003] ...1 2925.941062: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 88042bb60cc0(0x88042bb60cc0): nid: 0 objects to shrink 0 gfp_flags GFP_KERNEL pgs_scanned 1000 lru_pgs 1000 cache items 4 delta 7 total_scan 7 Otherwise what I was suggesting was something like this to ensure it was correctly marshaled for perf/etc: diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -16,6 +16,8 @@ #define RECLAIM_WB_SYNC0x0004u /* Unused, all reclaim async */ #define RECLAIM_WB_ASYNC 0x0008u +#define SHRINKER_NAME_LEN (size_t)32 + #define show_reclaim_flags(flags) \ (flags) ? __print_flags(flags, "|", \ {RECLAIM_WB_ANON, "RECLAIM_WB_ANON"}, \ @@ -191,6 +193,7 @@ TRACE_EVENT(mm_shrink_slab_start, TP_STRUCT__entry( __field(struct shrinker *, shr) __field(void *, shrink) + __array(char, name, SHRINKER_NAME_LEN); __field(int, nid) __field(long, nr_objects_to_shrink) __field(gfp_t, gfp_flags) @@ -202,6 +205,11 @@ TRACE_EVENT(mm_shrink_slab_start, ), TP_fast_assign( + char sym[KSYM_SYMBOL_LEN]; + + sprint_symbol(sym, (unsigned long)shr); + strlcpy(__entry->name, sym, SHRINKER_NAME_LEN); + __entry->shr = shr; __entry->shrink = shr->scan_objects; __entry->nid = sc->nid; @@ -214,9 +222,10 @@ TRACE_EVENT(mm_shrink_slab_start, __entry->total_scan = total_scan; ), - TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", + TP_printk("%pF %p(%s): nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", __entry->shrink, __entry->shr, + __entry->name, __entry->nid, __entry->nr_objects_to_shrink, show_gfp_flags(__entry->gfp_flags), @@ -236,6 +245,7 @@ TRACE_EVENT(mm_shrink_slab_end, TP_STRUCT__entry( __field(struct shrinker *, shr) + __array(char, name, SHRINKER_NAME_LEN); __field(int, nid) __field(void *, shrink) __field(long, unused_scan) @@ -245,6 +255,11 @@ TRACE_EVENT(mm_shrink_slab_end, ), TP_fast_assign( + char sym[KSYM_SYMBOL_LEN]; + + sprint_symbol(sym, (unsigned long)shr); + strlcpy(__entry->name, sym, SHRINKER_NAME_LEN); + __entry->shr = shr; __entry->nid = nid; __entry->shrink = shr->scan_objects; @@ -254,9 +269,10 @@ TRACE_EVENT(mm_shrink_slab_end, __entry->total_scan = total_scan; ), - TP_printk("%pF %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d", + TP_printk("%pF %p(%pF): nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d", __entry->shrink, __entry->shr, + __entry->shr, __entry->nid, __entry->unused_scan, __entry->new_scan,
Re: linux-next: build failure after merge of the nvdimm tree
On Fri, Jul 22, 2016 at 6:04 PM, Dan Williamswrote: > On Thu, Jul 21, 2016 at 11:13 PM, Stephen Rothwell > wrote: >> Hi Dan, >> >> After merging the nvdimm tree, today's linux-next build (powerpc >> ppc64_defconfig) failed like this: >> >> In file included from drivers/md/dm.h:14:0, >> from drivers/md/dm-uevent.c:27: >> include/linux/device-mapper.h:134:22: error: expected ';', ',' or ')' before >> '*' token >> void __pmem **kaddr, pfn_t *pfn, long size); >> ^ >> include/linux/device-mapper.h:182:2: error: unknown type name >> 'dm_direct_access_fn' >> dm_direct_access_fn direct_access; >> ^ >> >> Caused by commit >> >> 7a9eb2066631 ("pmem: kill __pmem address space") >> >> interacting with commit >> >> 545ed20e6df6 ("dm: add infrastructure for DAX support") >> >> from the device-mapper tree. >> >> I applied the following merge fix patch for today. Someone needs to >> tell Linus about this when he merges the trees. > > There's no real rush to remove "__pmem" I'll pull this out until after > DM DAX support has merged. > > Thanks Stephen! Sorry, I forgot that some ARM patches already depend on this change... so we'll need to roll forward and notify Linus about this merge interaction.
Re: linux-next: build failure after merge of the nvdimm tree
On Fri, Jul 22, 2016 at 6:04 PM, Dan Williams wrote: > On Thu, Jul 21, 2016 at 11:13 PM, Stephen Rothwell > wrote: >> Hi Dan, >> >> After merging the nvdimm tree, today's linux-next build (powerpc >> ppc64_defconfig) failed like this: >> >> In file included from drivers/md/dm.h:14:0, >> from drivers/md/dm-uevent.c:27: >> include/linux/device-mapper.h:134:22: error: expected ';', ',' or ')' before >> '*' token >> void __pmem **kaddr, pfn_t *pfn, long size); >> ^ >> include/linux/device-mapper.h:182:2: error: unknown type name >> 'dm_direct_access_fn' >> dm_direct_access_fn direct_access; >> ^ >> >> Caused by commit >> >> 7a9eb2066631 ("pmem: kill __pmem address space") >> >> interacting with commit >> >> 545ed20e6df6 ("dm: add infrastructure for DAX support") >> >> from the device-mapper tree. >> >> I applied the following merge fix patch for today. Someone needs to >> tell Linus about this when he merges the trees. > > There's no real rush to remove "__pmem" I'll pull this out until after > DM DAX support has merged. > > Thanks Stephen! Sorry, I forgot that some ARM patches already depend on this change... so we'll need to roll forward and notify Linus about this merge interaction.
Crypto Fixes for 4.7
Hi Linus: This push fixes a sporadic build failure in the qat driver as well as a memory corruption bug in rsa-pkcs1pad. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Herbert Xu (1): crypto: rsa-pkcs1pad - fix rsa-pkcs1pad request struct Jan Stancek (1): crypto: qat - make qat_asym_algs.o depend on asn1 headers crypto/rsa-pkcs1pad.c |4 ++-- drivers/crypto/qat/qat_common/Makefile |1 + 2 files changed, 3 insertions(+), 2 deletions(-) Thanks, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Crypto Fixes for 4.7
Hi Linus: This push fixes a sporadic build failure in the qat driver as well as a memory corruption bug in rsa-pkcs1pad. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Herbert Xu (1): crypto: rsa-pkcs1pad - fix rsa-pkcs1pad request struct Jan Stancek (1): crypto: qat - make qat_asym_algs.o depend on asn1 headers crypto/rsa-pkcs1pad.c |4 ++-- drivers/crypto/qat/qat_common/Makefile |1 + 2 files changed, 3 insertions(+), 2 deletions(-) Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [GIT PULL] clk fixes for v4.7-rc8
On Fri, Jul 22, 2016 at 9:47 AM, Michael Turquettewrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git > tags/clk-fixes-for-linus The pgp key you sign your pull requests with keep changing. Sometimes it's your primary key, most often it seems to be your 4kb subkey, now it's another 2kb subkey. All the signatures look fine, but is there a reason for this odd behavior? Linus
Re: [GIT PULL] clk fixes for v4.7-rc8
On Fri, Jul 22, 2016 at 9:47 AM, Michael Turquette wrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git > tags/clk-fixes-for-linus The pgp key you sign your pull requests with keep changing. Sometimes it's your primary key, most often it seems to be your 4kb subkey, now it's another 2kb subkey. All the signatures look fine, but is there a reason for this odd behavior? Linus
Re: [PATCH net-next v3] cdc_ether: Improve ZTE MF823/831/910 handling
Hi, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Kristian-Evensen/cdc_ether-Improve-ZTE-MF823-831-910-handling/20160723-093100 config: x86_64-randconfig-i0-201629 (attached as .config) compiler: gcc-4.9 (Debian 4.9.3-14) 4.9.3 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): drivers/net/usb/cdc_ether.c: In function 'usbnet_cdc_zte_rx_fixup': >> drivers/net/usb/cdc_ether.c:461:5: warning: unused variable >> 'buggy_hwaddrs_idx' [-Wunused-variable] u8 buggy_hwaddrs_idx = 0; ^ >> drivers/net/usb/cdc_ether.c:460:5: warning: unused variable >> 'num_buggy_hwaddrs' [-Wunused-variable] u8 num_buggy_hwaddrs; ^ vim +/buggy_hwaddrs_idx +461 drivers/net/usb/cdc_ether.c 454 * device sends packets with a static, bogus, random MAC address (event if 455 * device MAC address has been updated). Always set MAC address to that of the 456 * device. 457 */ 458 static int usbnet_cdc_zte_rx_fixup(struct usbnet *dev, struct sk_buff *skb) 459 { > 460 u8 num_buggy_hwaddrs; > 461 u8 buggy_hwaddrs_idx = 0; 462 463 if (skb->len < ETH_HLEN || !(skb->data[0] & 0x02)) 464 return 1; --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH net-next v3] cdc_ether: Improve ZTE MF823/831/910 handling
Hi, [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Kristian-Evensen/cdc_ether-Improve-ZTE-MF823-831-910-handling/20160723-093100 config: x86_64-randconfig-i0-201629 (attached as .config) compiler: gcc-4.9 (Debian 4.9.3-14) 4.9.3 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): drivers/net/usb/cdc_ether.c: In function 'usbnet_cdc_zte_rx_fixup': >> drivers/net/usb/cdc_ether.c:461:5: warning: unused variable >> 'buggy_hwaddrs_idx' [-Wunused-variable] u8 buggy_hwaddrs_idx = 0; ^ >> drivers/net/usb/cdc_ether.c:460:5: warning: unused variable >> 'num_buggy_hwaddrs' [-Wunused-variable] u8 num_buggy_hwaddrs; ^ vim +/buggy_hwaddrs_idx +461 drivers/net/usb/cdc_ether.c 454 * device sends packets with a static, bogus, random MAC address (event if 455 * device MAC address has been updated). Always set MAC address to that of the 456 * device. 457 */ 458 static int usbnet_cdc_zte_rx_fixup(struct usbnet *dev, struct sk_buff *skb) 459 { > 460 u8 num_buggy_hwaddrs; > 461 u8 buggy_hwaddrs_idx = 0; 462 463 if (skb->len < ETH_HLEN || !(skb->data[0] & 0x02)) 464 return 1; --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
Kees Cookwrites: > On Fri, Jul 22, 2016 at 11:45 AM, Eric W. Biederman > wrote: >> Colin Walters writes: >> >>> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote: This patchset addresses two use cases: - Implement a sane upper bound on the number of namespaces. - Provide a way for sandboxes to limit the attack surface from namespaces. >>> >>> Perhaps this is obvious, but since you didn't quite explicitly state it; >>> do you see this as obsoleting the existing downstream patches >>> mentioned in: >>> https://lwn.net/Articles/673597/ >>> It seems conceptually similar to Kees' original approach, right? >> >> Similar yes, and I expect it fills the need. My primary difference is >> that I believe this approach makes sense from a perspective of assuming >> that user namespaces or other namespaces are not any buggier than any >> other piece of kernel code and that people will use them. >> >> I don't see these limits making sense from a perspective that user >> namespaces are flawed and distro kernels should not have enabled them in >> the first place. That was my perception right or wrong of Kees patches >> and the related patches that landed in Ubuntu and Debian. >> >> With Kees approach I could not see how to handle the case where some >> applications on the system wanted user namespaces and others don't. >> Which made it very nasty for future evolution and more deployment of >> user namespaces. Being per user namespace these limits can be used to >> sandbox applications without affecting the rest of the system. > > While it certainly works for my use-case (init ns > max_usernamespaces=0), I don't see how this helps the case of "let > user foobar open 1 userns, but everyone else is 0", which is likely > the middle ground between "just turn it off" and "everyone gets to > create usernamespaces". I'm personally not interested in that level of > granularity, but in earlier discussions it sounded like this was > something you wanted? So the case I really care about is when there is limited use, so people don't have to redesign their applications. In this case if you want to disable things in a sandbox like way you just create a user namespace and set the count to 0 in that user namespace. A whole system disable I tend to think is a stupid configuration for a new system. It gets into people negotiating for what they need, and I don't see that as sustainable. I prefer good usable defaults. I would have loved to have done something with per user limits so it could be disabled for a selection of users, but it turns out the kernel doesn't have appropriate data structures for to hold limits for users that have not logged in. And in practice I don't care the case where 1 user is allowed but not the others, I care about disallow this user/program that is in a sandbox. I also seem to recall people have problems with using seccomp to disable things. All of that said a per user policy is easily implemented in pam by setting the size count for a specific user to 0. I do think a limit to catch applications that go crazy is very sane, and that is primarily what is implemented here. Eric
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
Kees Cook writes: > On Fri, Jul 22, 2016 at 11:45 AM, Eric W. Biederman > wrote: >> Colin Walters writes: >> >>> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote: This patchset addresses two use cases: - Implement a sane upper bound on the number of namespaces. - Provide a way for sandboxes to limit the attack surface from namespaces. >>> >>> Perhaps this is obvious, but since you didn't quite explicitly state it; >>> do you see this as obsoleting the existing downstream patches >>> mentioned in: >>> https://lwn.net/Articles/673597/ >>> It seems conceptually similar to Kees' original approach, right? >> >> Similar yes, and I expect it fills the need. My primary difference is >> that I believe this approach makes sense from a perspective of assuming >> that user namespaces or other namespaces are not any buggier than any >> other piece of kernel code and that people will use them. >> >> I don't see these limits making sense from a perspective that user >> namespaces are flawed and distro kernels should not have enabled them in >> the first place. That was my perception right or wrong of Kees patches >> and the related patches that landed in Ubuntu and Debian. >> >> With Kees approach I could not see how to handle the case where some >> applications on the system wanted user namespaces and others don't. >> Which made it very nasty for future evolution and more deployment of >> user namespaces. Being per user namespace these limits can be used to >> sandbox applications without affecting the rest of the system. > > While it certainly works for my use-case (init ns > max_usernamespaces=0), I don't see how this helps the case of "let > user foobar open 1 userns, but everyone else is 0", which is likely > the middle ground between "just turn it off" and "everyone gets to > create usernamespaces". I'm personally not interested in that level of > granularity, but in earlier discussions it sounded like this was > something you wanted? So the case I really care about is when there is limited use, so people don't have to redesign their applications. In this case if you want to disable things in a sandbox like way you just create a user namespace and set the count to 0 in that user namespace. A whole system disable I tend to think is a stupid configuration for a new system. It gets into people negotiating for what they need, and I don't see that as sustainable. I prefer good usable defaults. I would have loved to have done something with per user limits so it could be disabled for a selection of users, but it turns out the kernel doesn't have appropriate data structures for to hold limits for users that have not logged in. And in practice I don't care the case where 1 user is allowed but not the others, I care about disallow this user/program that is in a sandbox. I also seem to recall people have problems with using seccomp to disable things. All of that said a per user policy is easily implemented in pam by setting the size count for a specific user to 0. I do think a limit to catch applications that go crazy is very sane, and that is primarily what is implemented here. Eric
Re: [v6 PATCH 1/6] extcon: Add Type-C and DP support
Hi, 2016-07-23 6:40 GMT+09:00 Guenter Roeck: > On Fri, Jul 22, 2016 at 2:29 AM, Chanwoo Choi wrote: >> Hi Chris, >> >> I'm sorry for late reply. I finished the first draft to support the extcon >> property. >> You can check the patches[1]. But, I need more time to test it. After tested >> it, >> I'll send the patches. >> >> [1] >> https://git.kernel.org/cgit/linux/kernel/git/chanwoo/extcon.git/log/?h=extcon-test >> >> Chanwoo Choi (4): >> extcon: Add the extcon_type to group each connector into five category >> extcon: Add the support for extcon property according to type of connector >> extcon: Rename the extcon_set/get_state() to maintain the function naming >> pattern >> extcon: Add the sync APIs to support the notification for extcon property >> > > Some additional feedback: > > - IS_PROPERTY(), is_extcon_property_supported(): >The logic should probably be > if (EXTCON_TYPE_##name & type) { >and > if (!(extcon_info[id].type & type)) { > > (logical & instead of logical | ) > > - extcon_get_property(): > - ret is not initialized. > - case EXTCON_PROP_DISP_MIN ... EXTCON_PROP_DISP_MAX: > misses a break statement Thanks for report. I'll check it. Thanks, Chanwoo Choi > > Guenter > >> Chris Zhong (1): >> extcon: Add EXTCON_DISP_DP and the property for USB Type-C >> >> Regards, >> Chanwoo Choi >> >> On 2016년 07월 21일 22:13, Chris Zhong wrote: >>> Add EXTCON_DISP_DP for the Display external connector. For Type-C >>> connector the DisplayPort can work as an Alternate Mode(VESA DisplayPort >>> Alt Mode on USB Type-C Standard). The Type-C support both normal and >>> flipped orientation, so add a property to extcon. >>> >>> Signe-off-by: Chris Zhong >>> >>> Signed-off-by: Chris Zhong >>> --- >>> >>> Changes in v6: >>> - move the EXTCON_PROP_TYPEC_POLARITY to EXTCON_TYPE_USB in _supported >>> Series-changes: 5 >>> - support get property >>> >>> Changes in v5: None >>> Changes in v4: None >>> Changes in v3: None >>> Changes in v2: None >>> Changes in v1: None >>> >>> drivers/extcon/extcon.c | 26 ++ >>> include/linux/extcon.h | 13 + >>> 2 files changed, 39 insertions(+) >>> >>> diff --git a/drivers/extcon/extcon.c b/drivers/extcon/extcon.c >>> index a1117db..f79b510 100644 >>> --- a/drivers/extcon/extcon.c >>> +++ b/drivers/extcon/extcon.c >>> @@ -157,6 +157,11 @@ struct __extcon_info { >>> .id = EXTCON_DISP_VGA, >>> .name = "VGA", >>> }, >>> + [EXTCON_DISP_DP] = { >>> + .type = EXTCON_TYPE_DISP | EXTCON_TYPE_USB, >>> + .id = EXTCON_DISP_DP, >>> + .name = "DP", >>> + }, >>> >>> /* Miscellaneous external connector */ >>> [EXTCON_DOCK] = { >>> @@ -270,6 +275,7 @@ static bool is_extcon_property_supported(unsigned int >>> id, >>> switch (prop) { >>> case EXTCON_PROP_USB_ID: >>> case EXTCON_PROP_USB_VBUS: >>> + case EXTCON_PROP_TYPEC_POLARITY: >>> return true; >>> default: >>> break; >>> @@ -547,6 +553,26 @@ int extcon_get_cable_property(struct extcon_dev *edev, >>> unsigned int id, >>> enum extcon_property prop, >>> union extcon_property_value *val) >>> { >>> + struct extcon_cable *cable; >>> + int index; >>> + >>> + if (!edev) >>> + return -EINVAL; >>> + >>> + /* Check the property whether is supported or not */ >>> + if (!is_extcon_property_supported(id, prop)) >>> + return -EINVAL; >>> + >>> + /* Find the cable index of external connector by using id */ >>> + index = find_cable_index_by_id(edev, id); >>> + if (index < 0) >>> + return index; >>> + >>> + /* Store the property value */ >>> + cable = >cables[index]; >>> + >>> + val->intval = cable->propval[prop].intval; >>> + >>> return 0; >>> } >>> >>> diff --git a/include/linux/extcon.h b/include/linux/extcon.h >>> index f6f0a8d..50ef87f 100644 >>> --- a/include/linux/extcon.h >>> +++ b/include/linux/extcon.h >>> @@ -77,6 +77,7 @@ enum extcon_type { >>> #define EXTCON_DISP_MHL 41 /* Mobile High-Definition >>> Link */ >>> #define EXTCON_DISP_DVI 42 /* Digital Visual Interface */ >>> #define EXTCON_DISP_VGA 43 /* Video Graphics Array */ >>> +#define EXTCON_DISP_DP 44 /* DisplayPort */ >>> >>> /* Miscellaneous external connector */ >>> #define EXTCON_DOCK 60 >>> @@ -108,9 +109,13 @@ enum extcon_property { >>>* - EXTCON_PROP_USB_USB >>>* @type: integer (int value) >>>* @value: 0 (low) or 1 (high) >>> + * - EXTCON_PROP_TYPEC_POLARITY, >>> + * @type: integer (int value) >>> + * @value: 0 (normal) or 1
Re: [v6 PATCH 1/6] extcon: Add Type-C and DP support
Hi, 2016-07-23 6:40 GMT+09:00 Guenter Roeck : > On Fri, Jul 22, 2016 at 2:29 AM, Chanwoo Choi wrote: >> Hi Chris, >> >> I'm sorry for late reply. I finished the first draft to support the extcon >> property. >> You can check the patches[1]. But, I need more time to test it. After tested >> it, >> I'll send the patches. >> >> [1] >> https://git.kernel.org/cgit/linux/kernel/git/chanwoo/extcon.git/log/?h=extcon-test >> >> Chanwoo Choi (4): >> extcon: Add the extcon_type to group each connector into five category >> extcon: Add the support for extcon property according to type of connector >> extcon: Rename the extcon_set/get_state() to maintain the function naming >> pattern >> extcon: Add the sync APIs to support the notification for extcon property >> > > Some additional feedback: > > - IS_PROPERTY(), is_extcon_property_supported(): >The logic should probably be > if (EXTCON_TYPE_##name & type) { >and > if (!(extcon_info[id].type & type)) { > > (logical & instead of logical | ) > > - extcon_get_property(): > - ret is not initialized. > - case EXTCON_PROP_DISP_MIN ... EXTCON_PROP_DISP_MAX: > misses a break statement Thanks for report. I'll check it. Thanks, Chanwoo Choi > > Guenter > >> Chris Zhong (1): >> extcon: Add EXTCON_DISP_DP and the property for USB Type-C >> >> Regards, >> Chanwoo Choi >> >> On 2016년 07월 21일 22:13, Chris Zhong wrote: >>> Add EXTCON_DISP_DP for the Display external connector. For Type-C >>> connector the DisplayPort can work as an Alternate Mode(VESA DisplayPort >>> Alt Mode on USB Type-C Standard). The Type-C support both normal and >>> flipped orientation, so add a property to extcon. >>> >>> Signe-off-by: Chris Zhong >>> >>> Signed-off-by: Chris Zhong >>> --- >>> >>> Changes in v6: >>> - move the EXTCON_PROP_TYPEC_POLARITY to EXTCON_TYPE_USB in _supported >>> Series-changes: 5 >>> - support get property >>> >>> Changes in v5: None >>> Changes in v4: None >>> Changes in v3: None >>> Changes in v2: None >>> Changes in v1: None >>> >>> drivers/extcon/extcon.c | 26 ++ >>> include/linux/extcon.h | 13 + >>> 2 files changed, 39 insertions(+) >>> >>> diff --git a/drivers/extcon/extcon.c b/drivers/extcon/extcon.c >>> index a1117db..f79b510 100644 >>> --- a/drivers/extcon/extcon.c >>> +++ b/drivers/extcon/extcon.c >>> @@ -157,6 +157,11 @@ struct __extcon_info { >>> .id = EXTCON_DISP_VGA, >>> .name = "VGA", >>> }, >>> + [EXTCON_DISP_DP] = { >>> + .type = EXTCON_TYPE_DISP | EXTCON_TYPE_USB, >>> + .id = EXTCON_DISP_DP, >>> + .name = "DP", >>> + }, >>> >>> /* Miscellaneous external connector */ >>> [EXTCON_DOCK] = { >>> @@ -270,6 +275,7 @@ static bool is_extcon_property_supported(unsigned int >>> id, >>> switch (prop) { >>> case EXTCON_PROP_USB_ID: >>> case EXTCON_PROP_USB_VBUS: >>> + case EXTCON_PROP_TYPEC_POLARITY: >>> return true; >>> default: >>> break; >>> @@ -547,6 +553,26 @@ int extcon_get_cable_property(struct extcon_dev *edev, >>> unsigned int id, >>> enum extcon_property prop, >>> union extcon_property_value *val) >>> { >>> + struct extcon_cable *cable; >>> + int index; >>> + >>> + if (!edev) >>> + return -EINVAL; >>> + >>> + /* Check the property whether is supported or not */ >>> + if (!is_extcon_property_supported(id, prop)) >>> + return -EINVAL; >>> + >>> + /* Find the cable index of external connector by using id */ >>> + index = find_cable_index_by_id(edev, id); >>> + if (index < 0) >>> + return index; >>> + >>> + /* Store the property value */ >>> + cable = >cables[index]; >>> + >>> + val->intval = cable->propval[prop].intval; >>> + >>> return 0; >>> } >>> >>> diff --git a/include/linux/extcon.h b/include/linux/extcon.h >>> index f6f0a8d..50ef87f 100644 >>> --- a/include/linux/extcon.h >>> +++ b/include/linux/extcon.h >>> @@ -77,6 +77,7 @@ enum extcon_type { >>> #define EXTCON_DISP_MHL 41 /* Mobile High-Definition >>> Link */ >>> #define EXTCON_DISP_DVI 42 /* Digital Visual Interface */ >>> #define EXTCON_DISP_VGA 43 /* Video Graphics Array */ >>> +#define EXTCON_DISP_DP 44 /* DisplayPort */ >>> >>> /* Miscellaneous external connector */ >>> #define EXTCON_DOCK 60 >>> @@ -108,9 +109,13 @@ enum extcon_property { >>>* - EXTCON_PROP_USB_USB >>>* @type: integer (int value) >>>* @value: 0 (low) or 1 (high) >>> + * - EXTCON_PROP_TYPEC_POLARITY, >>> + * @type: integer (int value) >>> + * @value: 0 (normal) or 1 (flip) >>>*/ >>> EXTCON_PROP_USB_ID = 0, >>> EXTCON_PROP_USB_VBUS,
Re: [v6 PATCH 1/6] extcon: Add Type-C and DP support
Hi, 2016-07-23 3:21 GMT+09:00 Guenter Roeck: > Hi, > > On Fri, Jul 22, 2016 at 2:29 AM, Chanwoo Choi wrote: >> Hi Chris, >> >> I'm sorry for late reply. I finished the first draft to support the extcon >> property. >> You can check the patches[1]. But, I need more time to test it. After tested >> it, >> I'll send the patches. >> >> [1] >> https://git.kernel.org/cgit/linux/kernel/git/chanwoo/extcon.git/log/?h=extcon-test >> >> Chanwoo Choi (4): >> extcon: Add the extcon_type to group each connector into five category >> extcon: Add the support for extcon property according to type of connector >> extcon: Rename the extcon_set/get_state() to maintain the function naming >> pattern >> extcon: Add the sync APIs to support the notification for extcon property >> >> Chris Zhong (1): >> extcon: Add EXTCON_DISP_DP and the property for USB Type-C >> > > Couple of comments: > > - In extcon.c, line 560: 'state' is unused. > > - extcon_set_property(edev, EXTCON_USB, EXTCON_USB_VBUS, 0); > does not work. As written, it has to be something like: you shoud use the 'union extcon_property_valu' as below. > > union extcon_property_value prop; > ... > prop.intval = vbus_state; > extcon_set_property(edev, EXTCON_USB, EXTCON_USB_VBUS, prop); > > - For USB, the state of EXTCON_USB, EXTCON_USB_HOST, and > EXTCON_DISP_DP tend to change at the same time, together with the > associated properties. > It might be desirable to have an equivalent of extcon_sync_all(edev); Each external connector has the separate notifier independently. All of operation in extcon should be done by each external connector unit. I think that extcon_sync_all() is not appropriate. If the some extcon device support the many connectors, maybe extcon send the un-needed notifications for all connectors. Also, extcon_sync_all(edev) is same operation with follwoing three function calls: extcon_sync(edev, EXTCON_USB); extcon_sync(edev, EXTCON_USB_HOST); extcon_sync(edev, EXTCON_DISP_DP); Also, EXTCON_DISP_DP has the both EXTCON_TYPE_USB and EXTCON_TYPE_DISP. So, the extcon client driver is able to set the vbus property with EXTCON_DISP_DP as following: - extcon_set_property(edev, EXTCON_DISP_DP, EXTCON_USB_VBUS, prop) Thanks, Chanwoo Choi
Re: [v6 PATCH 1/6] extcon: Add Type-C and DP support
Hi, 2016-07-23 3:21 GMT+09:00 Guenter Roeck : > Hi, > > On Fri, Jul 22, 2016 at 2:29 AM, Chanwoo Choi wrote: >> Hi Chris, >> >> I'm sorry for late reply. I finished the first draft to support the extcon >> property. >> You can check the patches[1]. But, I need more time to test it. After tested >> it, >> I'll send the patches. >> >> [1] >> https://git.kernel.org/cgit/linux/kernel/git/chanwoo/extcon.git/log/?h=extcon-test >> >> Chanwoo Choi (4): >> extcon: Add the extcon_type to group each connector into five category >> extcon: Add the support for extcon property according to type of connector >> extcon: Rename the extcon_set/get_state() to maintain the function naming >> pattern >> extcon: Add the sync APIs to support the notification for extcon property >> >> Chris Zhong (1): >> extcon: Add EXTCON_DISP_DP and the property for USB Type-C >> > > Couple of comments: > > - In extcon.c, line 560: 'state' is unused. > > - extcon_set_property(edev, EXTCON_USB, EXTCON_USB_VBUS, 0); > does not work. As written, it has to be something like: you shoud use the 'union extcon_property_valu' as below. > > union extcon_property_value prop; > ... > prop.intval = vbus_state; > extcon_set_property(edev, EXTCON_USB, EXTCON_USB_VBUS, prop); > > - For USB, the state of EXTCON_USB, EXTCON_USB_HOST, and > EXTCON_DISP_DP tend to change at the same time, together with the > associated properties. > It might be desirable to have an equivalent of extcon_sync_all(edev); Each external connector has the separate notifier independently. All of operation in extcon should be done by each external connector unit. I think that extcon_sync_all() is not appropriate. If the some extcon device support the many connectors, maybe extcon send the un-needed notifications for all connectors. Also, extcon_sync_all(edev) is same operation with follwoing three function calls: extcon_sync(edev, EXTCON_USB); extcon_sync(edev, EXTCON_USB_HOST); extcon_sync(edev, EXTCON_DISP_DP); Also, EXTCON_DISP_DP has the both EXTCON_TYPE_USB and EXTCON_TYPE_DISP. So, the extcon client driver is able to set the vbus property with EXTCON_DISP_DP as following: - extcon_set_property(edev, EXTCON_DISP_DP, EXTCON_USB_VBUS, prop) Thanks, Chanwoo Choi
Re: [PATCH v8 0/10] Output raw touch data via V4L2
I'm not Nick, but I'm testing his patches. ;-) Here's what I'm getting from v4l2-compliance with his patches: root@RDU2:/mnt/disk ./v4l2-compliance -a -f -d /dev/v4l-touch0 Driver Info: Driver name : rmi4_f54 Card type : Synaptics RMI4 Touch Sensor Bus info : rmi4:rmi4-00.fn54 Driver version: 4.7.0 Capabilities : 0x9521 Video Capture Touch Capture Read/Write Streaming Extended Pix Format Device Capabilities Device Caps : 0x1521 Video Capture Touch Capture Read/Write Streaming Extended Pix Format Compliance test for device /dev/v4l-touch0 (not using libv4l2): Required ioctls: test VIDIOC_QUERYCAP: OK Allow for multiple opens: test second video open: OK test VIDIOC_QUERYCAP: OK test VIDIOC_G/S_PRIORITY: OK Debug ioctls: test VIDIOC_DBG_G/S_REGISTER: OK (Not Supported) test VIDIOC_LOG_STATUS: OK (Not Supported) Input ioctls: test VIDIOC_G/S_TUNER/ENUM_FREQ_BANDS: OK (Not Supported) test VIDIOC_G/S_FREQUENCY: OK (Not Supported) test VIDIOC_S_HW_FREQ_SEEK: OK (Not Supported) test VIDIOC_ENUMAUDIO: OK (Not Supported) test VIDIOC_G/S/ENUMINPUT: OK test VIDIOC_G/S_AUDIO: OK (Not Supported) Inputs: 5 Audio Inputs: 0 Tuners: 0 Output ioctls: test VIDIOC_G/S_MODULATOR: OK (Not Supported) test VIDIOC_G/S_FREQUENCY: OK (Not Supported) test VIDIOC_ENUMAUDOUT: OK (Not Supported) test VIDIOC_G/S/ENUMOUTPUT: OK (Not Supported) test VIDIOC_G/S_AUDOUT: OK (Not Supported) Outputs: 0 Audio Outputs: 0 Modulators: 0 Input/Output configuration ioctls: test VIDIOC_ENUM/G/S/QUERY_STD: OK (Not Supported) test VIDIOC_ENUM/G/S/QUERY_DV_TIMINGS: OK (Not Supported) test VIDIOC_DV_TIMINGS_CAP: OK (Not Supported) test VIDIOC_G/S_EDID: OK (Not Supported) Test input 0: Control ioctls: test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported) test VIDIOC_QUERYCTRL: OK (Not Supported) test VIDIOC_G/S_CTRL: OK (Not Supported) test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported) test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported) test VIDIOC_G/S_JPEGCOMP: OK (Not Supported) Standard Controls: 0 Private Controls: 0 Format ioctls: test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK test VIDIOC_G/S_PARM: OK test VIDIOC_G_FBUF: OK (Not Supported) test VIDIOC_G_FMT: OK test VIDIOC_TRY_FMT: OK test VIDIOC_S_FMT: OK test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported) test Cropping: OK (Not Supported) test Composing: OK (Not Supported) test Scaling: OK (Not Supported) Codec ioctls: test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported) test VIDIOC_G_ENC_INDEX: OK (Not Supported) test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported) Buffer ioctls: test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK test VIDIOC_EXPBUF: OK Test input 1: Control ioctls: test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported) test VIDIOC_QUERYCTRL: OK (Not Supported) test VIDIOC_G/S_CTRL: OK (Not Supported) test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported) test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported) test VIDIOC_G/S_JPEGCOMP: OK (Not Supported) Standard Controls: 0 Private Controls: 0 Format ioctls: test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK test VIDIOC_G/S_PARM: OK test VIDIOC_G_FBUF: OK (Not Supported) test VIDIOC_G_FMT: OK test VIDIOC_TRY_FMT: OK test VIDIOC_S_FMT: OK test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported) test Cropping: OK (Not Supported) test Composing: OK (Not Supported) test Scaling: OK (Not Supported) Codec ioctls: test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported) test VIDIOC_G_ENC_INDEX: OK (Not Supported) test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported) Buffer ioctls: test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK test VIDIOC_EXPBUF: OK Test input 2: Control ioctls: test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported) test VIDIOC_QUERYCTRL: OK (Not Supported) test VIDIOC_G/S_CTRL: OK (Not Supported) test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported) test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported) test VIDIOC_G/S_JPEGCOMP: OK (Not Supported) Standard Controls: 0 Private Controls: 0 Format ioctls: test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK test VIDIOC_G/S_PARM: OK test VIDIOC_G_FBUF: OK (Not Supported) test VIDIOC_G_FMT: OK test VIDIOC_TRY_FMT: OK test VIDIOC_S_FMT: OK test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported) test Cropping: OK (Not Supported) test Composing: OK (Not Supported) test Scaling: OK (Not Supported) Codec ioctls: test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported) test VIDIOC_G_ENC_INDEX: OK (Not Supported) test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported) Buffer ioctls: test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK test VIDIOC_EXPBUF: OK Test input 3: Control ioctls: test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported) test VIDIOC_QUERYCTRL: OK (Not Supported) test VIDIOC_G/S_CTRL: OK (Not Supported) test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported) test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported) test VIDIOC_G/S_JPEGCOMP: OK (Not Supported) Standard Controls: 0 Private Controls: 0 Format ioctls: test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK
Re: [PATCH v8 0/10] Output raw touch data via V4L2
I'm not Nick, but I'm testing his patches. ;-) Here's what I'm getting from v4l2-compliance with his patches: root@RDU2:/mnt/disk ./v4l2-compliance -a -f -d /dev/v4l-touch0 Driver Info: Driver name : rmi4_f54 Card type : Synaptics RMI4 Touch Sensor Bus info : rmi4:rmi4-00.fn54 Driver version: 4.7.0 Capabilities : 0x9521 Video Capture Touch Capture Read/Write Streaming Extended Pix Format Device Capabilities Device Caps : 0x1521 Video Capture Touch Capture Read/Write Streaming Extended Pix Format Compliance test for device /dev/v4l-touch0 (not using libv4l2): Required ioctls: test VIDIOC_QUERYCAP: OK Allow for multiple opens: test second video open: OK test VIDIOC_QUERYCAP: OK test VIDIOC_G/S_PRIORITY: OK Debug ioctls: test VIDIOC_DBG_G/S_REGISTER: OK (Not Supported) test VIDIOC_LOG_STATUS: OK (Not Supported) Input ioctls: test VIDIOC_G/S_TUNER/ENUM_FREQ_BANDS: OK (Not Supported) test VIDIOC_G/S_FREQUENCY: OK (Not Supported) test VIDIOC_S_HW_FREQ_SEEK: OK (Not Supported) test VIDIOC_ENUMAUDIO: OK (Not Supported) test VIDIOC_G/S/ENUMINPUT: OK test VIDIOC_G/S_AUDIO: OK (Not Supported) Inputs: 5 Audio Inputs: 0 Tuners: 0 Output ioctls: test VIDIOC_G/S_MODULATOR: OK (Not Supported) test VIDIOC_G/S_FREQUENCY: OK (Not Supported) test VIDIOC_ENUMAUDOUT: OK (Not Supported) test VIDIOC_G/S/ENUMOUTPUT: OK (Not Supported) test VIDIOC_G/S_AUDOUT: OK (Not Supported) Outputs: 0 Audio Outputs: 0 Modulators: 0 Input/Output configuration ioctls: test VIDIOC_ENUM/G/S/QUERY_STD: OK (Not Supported) test VIDIOC_ENUM/G/S/QUERY_DV_TIMINGS: OK (Not Supported) test VIDIOC_DV_TIMINGS_CAP: OK (Not Supported) test VIDIOC_G/S_EDID: OK (Not Supported) Test input 0: Control ioctls: test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported) test VIDIOC_QUERYCTRL: OK (Not Supported) test VIDIOC_G/S_CTRL: OK (Not Supported) test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported) test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported) test VIDIOC_G/S_JPEGCOMP: OK (Not Supported) Standard Controls: 0 Private Controls: 0 Format ioctls: test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK test VIDIOC_G/S_PARM: OK test VIDIOC_G_FBUF: OK (Not Supported) test VIDIOC_G_FMT: OK test VIDIOC_TRY_FMT: OK test VIDIOC_S_FMT: OK test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported) test Cropping: OK (Not Supported) test Composing: OK (Not Supported) test Scaling: OK (Not Supported) Codec ioctls: test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported) test VIDIOC_G_ENC_INDEX: OK (Not Supported) test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported) Buffer ioctls: test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK test VIDIOC_EXPBUF: OK Test input 1: Control ioctls: test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported) test VIDIOC_QUERYCTRL: OK (Not Supported) test VIDIOC_G/S_CTRL: OK (Not Supported) test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported) test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported) test VIDIOC_G/S_JPEGCOMP: OK (Not Supported) Standard Controls: 0 Private Controls: 0 Format ioctls: test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK test VIDIOC_G/S_PARM: OK test VIDIOC_G_FBUF: OK (Not Supported) test VIDIOC_G_FMT: OK test VIDIOC_TRY_FMT: OK test VIDIOC_S_FMT: OK test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported) test Cropping: OK (Not Supported) test Composing: OK (Not Supported) test Scaling: OK (Not Supported) Codec ioctls: test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported) test VIDIOC_G_ENC_INDEX: OK (Not Supported) test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported) Buffer ioctls: test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK test VIDIOC_EXPBUF: OK Test input 2: Control ioctls: test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported) test VIDIOC_QUERYCTRL: OK (Not Supported) test VIDIOC_G/S_CTRL: OK (Not Supported) test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported) test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported) test VIDIOC_G/S_JPEGCOMP: OK (Not Supported) Standard Controls: 0 Private Controls: 0 Format ioctls: test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK test VIDIOC_G/S_PARM: OK test VIDIOC_G_FBUF: OK (Not Supported) test VIDIOC_G_FMT: OK test VIDIOC_TRY_FMT: OK test VIDIOC_S_FMT: OK test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported) test Cropping: OK (Not Supported) test Composing: OK (Not Supported) test Scaling: OK (Not Supported) Codec ioctls: test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported) test VIDIOC_G_ENC_INDEX: OK (Not Supported) test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported) Buffer ioctls: test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK test VIDIOC_EXPBUF: OK Test input 3: Control ioctls: test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported) test VIDIOC_QUERYCTRL: OK (Not Supported) test VIDIOC_G/S_CTRL: OK (Not Supported) test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported) test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported) test VIDIOC_G/S_JPEGCOMP: OK (Not Supported) Standard Controls: 0 Private Controls: 0 Format ioctls: test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK
[PATCH v18 net-next 1/1] hv_sock: introduce Hyper-V Sockets
Hyper-V Sockets (hv_sock) supplies a byte-stream based communication mechanism between the host and the guest. It's somewhat like TCP over VMBus, but the transportation layer (VMBus) is much simpler than IP. With Hyper-V Sockets, applications between the host and the guest can talk to each other directly by the traditional BSD-style socket APIs. Hyper-V Sockets is only available on new Windows hosts, like Windows Server 2016. More info is in this article "Make your own integration services": https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service The patch implements the necessary support in the guest side by introducing a new socket address family AF_HYPERV. Signed-off-by: Dexuan CuiCc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Vitaly Kuznetsov Cc: Cathy Avery Cc: Olaf Hering --- You can also get the patch by (commit 84146dfb): https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160721_v18 For the change log before v12, please see https://lkml.org/lkml/2016/5/15/31 In v12, the changes are mainly the following: 1) remove the module params as David suggested. 2) use 5 exact pages for VMBus send/recv rings, respectively. The host side's design of the feature requires 5 exact pages for recv/send rings respectively -- this is suboptimal considering memory consumption, however unluckily we have to live with it, before the host comes up with a new design in the future. :-( 3) remove the per-connection static send/recv buffers Instead, we allocate and free the buffers dynamically only when we recv/send data. This means: when a connection is idle, no memory is consumed as recv/send buffers at all. In v13: I return ENOMEM on buffer alllocation failure Actually "man read/write" says "Other errors may occur, depending on the object connected to fd". "man send/recv" indeed lists ENOMEM. Considering AF_HYPERV is a new socket type, ENOMEM seems OK here. In the long run, I think we should add a new API in the VMBus driver, allowing data copy from VMBus ringbuffer into user mode buffer directly. This way, we can even eliminate this temporary buffer. In v14: fix some coding style issues pointed out by David. In v15: Just some stylistic changes addressing comments from Joe Perches and Olaf Hering -- thank you! - add a GPL blurb. - define a new macro PAGE_SIZE_4K and use it to replace PAGE_SIZE - change sk_to_hvsock/hvsock_to_sk() from macros to inline functions - remove a not-very-useful pr_err() - fix some typos in comment and coding style issues. In v16: Made stylistic changes addressing comments from Vitaly Kuznetsov. Thank you very much for the detailed comments, Vitaly! In v17: - PAGE_SIZE -> PAGE_SIZE_4K - allow regular users to use the socket Thank you Michal Kubecek for the suggestions! In v18: Just some tiny updates to address some spurious compiler warnings: "xxx may be used uninitialized in this function". Looking forward to your comments! MAINTAINERS |2 + include/linux/hyperv.h | 13 + include/linux/socket.h |4 +- include/net/af_hvsock.h | 78 +++ include/uapi/linux/hyperv.h | 23 + net/Kconfig |1 + net/Makefile|1 + net/hv_sock/Kconfig | 10 + net/hv_sock/Makefile|3 + net/hv_sock/af_hvsock.c | 1507 +++ 10 files changed, 1641 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 50f69ba..6eaa26f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5514,7 +5514,9 @@ F:drivers/pci/host/pci-hyperv.c F: drivers/net/hyperv/ F: drivers/scsi/storvsc_drv.c F: drivers/video/fbdev/hyperv_fb.c +F: net/hv_sock/ F: include/linux/hyperv.h +F: include/net/af_hvsock.h F: tools/hv/ F: Documentation/ABI/stable/sysfs-bus-vmbus diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 50f493e..1cda6ea5 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -1508,5 +1508,18 @@ static inline void commit_rd_index(struct vmbus_channel *channel) vmbus_set_event(channel); } +struct vmpipe_proto_header { + u32 pkt_type; + u32 data_size; +}; + +#define HVSOCK_HEADER_LEN (sizeof(struct vmpacket_descriptor) + \ +sizeof(struct vmpipe_proto_header)) + +/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */ +#define PREV_INDICES_LEN (sizeof(u64)) +#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \ + ALIGN((payload_len), 8) + \ + PREV_INDICES_LEN) #endif /* _HYPERV_H */ diff --git a/include/linux/socket.h b/include/linux/socket.h index b5cc5a6..0b68b58 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -202,8
[PATCH v18 net-next 1/1] hv_sock: introduce Hyper-V Sockets
Hyper-V Sockets (hv_sock) supplies a byte-stream based communication mechanism between the host and the guest. It's somewhat like TCP over VMBus, but the transportation layer (VMBus) is much simpler than IP. With Hyper-V Sockets, applications between the host and the guest can talk to each other directly by the traditional BSD-style socket APIs. Hyper-V Sockets is only available on new Windows hosts, like Windows Server 2016. More info is in this article "Make your own integration services": https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service The patch implements the necessary support in the guest side by introducing a new socket address family AF_HYPERV. Signed-off-by: Dexuan Cui Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Vitaly Kuznetsov Cc: Cathy Avery Cc: Olaf Hering --- You can also get the patch by (commit 84146dfb): https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160721_v18 For the change log before v12, please see https://lkml.org/lkml/2016/5/15/31 In v12, the changes are mainly the following: 1) remove the module params as David suggested. 2) use 5 exact pages for VMBus send/recv rings, respectively. The host side's design of the feature requires 5 exact pages for recv/send rings respectively -- this is suboptimal considering memory consumption, however unluckily we have to live with it, before the host comes up with a new design in the future. :-( 3) remove the per-connection static send/recv buffers Instead, we allocate and free the buffers dynamically only when we recv/send data. This means: when a connection is idle, no memory is consumed as recv/send buffers at all. In v13: I return ENOMEM on buffer alllocation failure Actually "man read/write" says "Other errors may occur, depending on the object connected to fd". "man send/recv" indeed lists ENOMEM. Considering AF_HYPERV is a new socket type, ENOMEM seems OK here. In the long run, I think we should add a new API in the VMBus driver, allowing data copy from VMBus ringbuffer into user mode buffer directly. This way, we can even eliminate this temporary buffer. In v14: fix some coding style issues pointed out by David. In v15: Just some stylistic changes addressing comments from Joe Perches and Olaf Hering -- thank you! - add a GPL blurb. - define a new macro PAGE_SIZE_4K and use it to replace PAGE_SIZE - change sk_to_hvsock/hvsock_to_sk() from macros to inline functions - remove a not-very-useful pr_err() - fix some typos in comment and coding style issues. In v16: Made stylistic changes addressing comments from Vitaly Kuznetsov. Thank you very much for the detailed comments, Vitaly! In v17: - PAGE_SIZE -> PAGE_SIZE_4K - allow regular users to use the socket Thank you Michal Kubecek for the suggestions! In v18: Just some tiny updates to address some spurious compiler warnings: "xxx may be used uninitialized in this function". Looking forward to your comments! MAINTAINERS |2 + include/linux/hyperv.h | 13 + include/linux/socket.h |4 +- include/net/af_hvsock.h | 78 +++ include/uapi/linux/hyperv.h | 23 + net/Kconfig |1 + net/Makefile|1 + net/hv_sock/Kconfig | 10 + net/hv_sock/Makefile|3 + net/hv_sock/af_hvsock.c | 1507 +++ 10 files changed, 1641 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 50f69ba..6eaa26f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5514,7 +5514,9 @@ F:drivers/pci/host/pci-hyperv.c F: drivers/net/hyperv/ F: drivers/scsi/storvsc_drv.c F: drivers/video/fbdev/hyperv_fb.c +F: net/hv_sock/ F: include/linux/hyperv.h +F: include/net/af_hvsock.h F: tools/hv/ F: Documentation/ABI/stable/sysfs-bus-vmbus diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 50f493e..1cda6ea5 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -1508,5 +1508,18 @@ static inline void commit_rd_index(struct vmbus_channel *channel) vmbus_set_event(channel); } +struct vmpipe_proto_header { + u32 pkt_type; + u32 data_size; +}; + +#define HVSOCK_HEADER_LEN (sizeof(struct vmpacket_descriptor) + \ +sizeof(struct vmpipe_proto_header)) + +/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */ +#define PREV_INDICES_LEN (sizeof(u64)) +#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \ + ALIGN((payload_len), 8) + \ + PREV_INDICES_LEN) #endif /* _HYPERV_H */ diff --git a/include/linux/socket.h b/include/linux/socket.h index b5cc5a6..0b68b58 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -202,8 +202,9 @@ struct ucred { #define AF_VSOCK 40 /* vSockets */ #define AF_KCM 41
Re: [PATCH v3] virtio_blk: Fix a slient kernel panic
Ping, Any comment is appreciate. Thanks Minfei > On Jul 19, 2016, at 20:22, Cornelia Huckwrote: > > On Tue, 19 Jul 2016 12:32:42 +0800 > Minfei Huang wrote: > >> From: Minfei Huang >> >> We do a lot of memory allocation in function init_vq, and don't handle >> the allocation failure properly. Then this function will return 0, >> although initialization fails due to lacking memory. At that moment, >> kernel will panic in guest machine, if virtio is used to drive disk. >> >> To fix this bug, we should take care of allocation failure, and return >> correct value to let caller know what happen. >> >> Tested-by: Chao Fan >> Signed-off-by: Minfei Huang >> Signed-off-by: Minfei Huang >> --- >> v2: >> - Remove useless initialisation to NULL >> v1: >> - Refactor the patch to make code more readable >> --- >> drivers/block/virtio_blk.c | 26 -- >> 1 file changed, 8 insertions(+), 18 deletions(-) > > Your changes certainly make the function more compact. > > Reviewed-by: Cornelia Huck smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v3] virtio_blk: Fix a slient kernel panic
Ping, Any comment is appreciate. Thanks Minfei > On Jul 19, 2016, at 20:22, Cornelia Huck wrote: > > On Tue, 19 Jul 2016 12:32:42 +0800 > Minfei Huang wrote: > >> From: Minfei Huang >> >> We do a lot of memory allocation in function init_vq, and don't handle >> the allocation failure properly. Then this function will return 0, >> although initialization fails due to lacking memory. At that moment, >> kernel will panic in guest machine, if virtio is used to drive disk. >> >> To fix this bug, we should take care of allocation failure, and return >> correct value to let caller know what happen. >> >> Tested-by: Chao Fan >> Signed-off-by: Minfei Huang >> Signed-off-by: Minfei Huang >> --- >> v2: >> - Remove useless initialisation to NULL >> v1: >> - Refactor the patch to make code more readable >> --- >> drivers/block/virtio_blk.c | 26 -- >> 1 file changed, 8 insertions(+), 18 deletions(-) > > Your changes certainly make the function more compact. > > Reviewed-by: Cornelia Huck smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH] add u64 number parser
On 07/22/16 17:23, James Smart wrote: + buf = kmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + memcpy(buf, s->from, len); + buf[len] = '\0'; Hello James, Have you considered to combine the above kmalloc() and memcpy() calls into a single kasprintf(GFP_KERNEL, "%.*s", len, s->from) call? Bart.
Re: [PATCH] add u64 number parser
On 07/22/16 17:23, James Smart wrote: + buf = kmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + memcpy(buf, s->from, len); + buf[len] = '\0'; Hello James, Have you considered to combine the above kmalloc() and memcpy() calls into a single kasprintf(GFP_KERNEL, "%.*s", len, s->from) call? Bart.
[PATCH v18 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
Hyper-V Sockets (hv_sock) supplies a byte-stream based communication mechanism between the host and the guest. It's somewhat like TCP over VMBus, but the transportation layer (VMBus) is much simpler than IP. With Hyper-V Sockets, applications between the host and the guest can talk to each other directly by the traditional BSD-style socket APIs. Hyper-V Sockets is only available on new Windows hosts, like Windows Server 2016. More info is in this article "Make your own integration services": https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service The patch implements the necessary support in the guest side by introducing a new socket address family AF_HYPERV. You can also get the patch by (commit 84146dfb): https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160721_v18 Note: the VMBus driver side's supporting patches have been in the mainline tree. I know the kernel has already had a VM Sockets driver (AF_VSOCK) based on VMware VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is proposing AF_VSOCK of virtio version: http://marc.info/?l=linux-netdev=145952064004765=2 However, though Hyper-V Sockets may seem conceptually similar to AF_VOSCK, there are differences in the transportation layer, and IMO these make the direct code reusing impractical: 1. In AF_VSOCK, the endpoint type is: , but in AF_HYPERV, the endpoint type is: . Here GUID is 128-bit. 2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't. 3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE, SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT. These are meaningless to AF_HYPERV. 4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus, like .notify_recv_init .notify_recv_pre_block .notify_recv_pre_dequeue .notify_recv_post_dequeue .notify_send_init .notify_send_pre_block .notify_send_pre_enqueue .notify_send_post_enqueue etc. So I think we'd better introduce a new address family: AF_HYPERV. Please review the patch. Looking forward to your comments, especially comments from David. :-) Changes since v1: - updated "[PATCH 6/7] hvsock: introduce Hyper-V VM Sockets feature" - added __init and __exit for the module init/exit functions - net/hv_sock/Kconfig: "default m" -> "default m if HYPERV" - MODULE_LICENSE: "Dual MIT/GPL" -> "Dual BSD/GPL" Changes since v2: - fixed various coding issue pointed out by David Miller - fixed indentation issues - removed pr_debug in net/hv_sock/af_hvsock.c - used reverse-Chrismas-tree style for local variables. - EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL Changes since v3: - fixed a few coding issue pointed by Vitaly Kuznetsov and Dan Carpenter - fixed the ret value in vmbus_recvpacket_hvsock on error - fixed the style of multi-line comment: vmbus_get_hvsock_rw_status() Changes since v4 (https://lkml.org/lkml/2015/7/28/404): - addressed all the comments about V4. - treat the hvsock offers/channels as special VMBus devices - add a mechanism to pass hvsock events to the hvsock driver - fixed some corner cases with proper locking when a connection is closed - rebased to the latest Greg's tree Changes since v5 (https://lkml.org/lkml/2015/12/24/103): - addressed the coding style issues (Vitaly Kuznetsov & David Miller, thanks!) - used a better coding for the per-channel rescind callback (Thank Vitaly!) - avoided the introduction of new VMBUS driver APIs vmbus_sendpacket_hvsock() and vmbus_recvpacket_hvsock() and used vmbus_sendpacket()/vmbus_recvpacket() in the higher level (i.e., the vmsock driver). Thank Vitaly! Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html) - only a few minor changes of coding style and comments Changes since v7 - a few minor changes of coding style: thanks, Joe Perches! - added some lines of comments about GUID/UUID before the struct sockaddr_hv. Changes since v8 - removed the unnecessary __packed for some definitions: thanks, David! - hvsock_open_connection: use offer.u.pipe.user_def[0] to know the connection and reorganized the function direction - reorganized the code according to suggestions from Cathy Avery: split big functions into small ones, set .setsockopt and getsockopt to sock_no_setsockopt/sock_no_getsockopt - inline'd some small list helper functions Changes since v9 - minimized struct hvsock_sock by making the send/recv buffers pointers. the buffers are allocated by kmalloc() in __hvsock_create() now. - minimized the sizes of the send/recv buffers and the vmbus ringbuffers. Changes since v10 1) add module params: send_ring_page, recv_ring_page. They can be used to enlarge the ringbuffer size to get better performance, e.g., # modprobe hv_sock recv_ring_page=16 send_ring_page=16 By default, recv_ring_page is 3 and send_ring_page is 2. 2) add module param max_socket_number (the default is 1024). A user can enlarge the number to create more than 1024 hv_sock sockets. By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB
[PATCH v18 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
Hyper-V Sockets (hv_sock) supplies a byte-stream based communication mechanism between the host and the guest. It's somewhat like TCP over VMBus, but the transportation layer (VMBus) is much simpler than IP. With Hyper-V Sockets, applications between the host and the guest can talk to each other directly by the traditional BSD-style socket APIs. Hyper-V Sockets is only available on new Windows hosts, like Windows Server 2016. More info is in this article "Make your own integration services": https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service The patch implements the necessary support in the guest side by introducing a new socket address family AF_HYPERV. You can also get the patch by (commit 84146dfb): https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160721_v18 Note: the VMBus driver side's supporting patches have been in the mainline tree. I know the kernel has already had a VM Sockets driver (AF_VSOCK) based on VMware VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is proposing AF_VSOCK of virtio version: http://marc.info/?l=linux-netdev=145952064004765=2 However, though Hyper-V Sockets may seem conceptually similar to AF_VOSCK, there are differences in the transportation layer, and IMO these make the direct code reusing impractical: 1. In AF_VSOCK, the endpoint type is: , but in AF_HYPERV, the endpoint type is: . Here GUID is 128-bit. 2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't. 3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE, SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT. These are meaningless to AF_HYPERV. 4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus, like .notify_recv_init .notify_recv_pre_block .notify_recv_pre_dequeue .notify_recv_post_dequeue .notify_send_init .notify_send_pre_block .notify_send_pre_enqueue .notify_send_post_enqueue etc. So I think we'd better introduce a new address family: AF_HYPERV. Please review the patch. Looking forward to your comments, especially comments from David. :-) Changes since v1: - updated "[PATCH 6/7] hvsock: introduce Hyper-V VM Sockets feature" - added __init and __exit for the module init/exit functions - net/hv_sock/Kconfig: "default m" -> "default m if HYPERV" - MODULE_LICENSE: "Dual MIT/GPL" -> "Dual BSD/GPL" Changes since v2: - fixed various coding issue pointed out by David Miller - fixed indentation issues - removed pr_debug in net/hv_sock/af_hvsock.c - used reverse-Chrismas-tree style for local variables. - EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL Changes since v3: - fixed a few coding issue pointed by Vitaly Kuznetsov and Dan Carpenter - fixed the ret value in vmbus_recvpacket_hvsock on error - fixed the style of multi-line comment: vmbus_get_hvsock_rw_status() Changes since v4 (https://lkml.org/lkml/2015/7/28/404): - addressed all the comments about V4. - treat the hvsock offers/channels as special VMBus devices - add a mechanism to pass hvsock events to the hvsock driver - fixed some corner cases with proper locking when a connection is closed - rebased to the latest Greg's tree Changes since v5 (https://lkml.org/lkml/2015/12/24/103): - addressed the coding style issues (Vitaly Kuznetsov & David Miller, thanks!) - used a better coding for the per-channel rescind callback (Thank Vitaly!) - avoided the introduction of new VMBUS driver APIs vmbus_sendpacket_hvsock() and vmbus_recvpacket_hvsock() and used vmbus_sendpacket()/vmbus_recvpacket() in the higher level (i.e., the vmsock driver). Thank Vitaly! Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html) - only a few minor changes of coding style and comments Changes since v7 - a few minor changes of coding style: thanks, Joe Perches! - added some lines of comments about GUID/UUID before the struct sockaddr_hv. Changes since v8 - removed the unnecessary __packed for some definitions: thanks, David! - hvsock_open_connection: use offer.u.pipe.user_def[0] to know the connection and reorganized the function direction - reorganized the code according to suggestions from Cathy Avery: split big functions into small ones, set .setsockopt and getsockopt to sock_no_setsockopt/sock_no_getsockopt - inline'd some small list helper functions Changes since v9 - minimized struct hvsock_sock by making the send/recv buffers pointers. the buffers are allocated by kmalloc() in __hvsock_create() now. - minimized the sizes of the send/recv buffers and the vmbus ringbuffers. Changes since v10 1) add module params: send_ring_page, recv_ring_page. They can be used to enlarge the ringbuffer size to get better performance, e.g., # modprobe hv_sock recv_ring_page=16 send_ring_page=16 By default, recv_ring_page is 3 and send_ring_page is 2. 2) add module param max_socket_number (the default is 1024). A user can enlarge the number to create more than 1024 hv_sock sockets. By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB
Re: [PATCH 1/3] Add a new field to struct shrinker
On 07/20/2016 07:54 AM, Michal Hocko wrote: >> Michal, just to make sure I understand you correctly, do you mean that we >> could infer the names of the shrinkers by looking at the names of their >> callbacks? > > Yes, %ps can then be used for the name of the shrinker structure > (assuming it is available). This is fine for emitting via the ftrace /sys interface, but in order to have the data [name] get marshalled thru to perf (for example) you need to add it to the TP_fast_assign entry. tony
Re: [PATCH 1/3] Add a new field to struct shrinker
On 07/20/2016 07:54 AM, Michal Hocko wrote: >> Michal, just to make sure I understand you correctly, do you mean that we >> could infer the names of the shrinkers by looking at the names of their >> callbacks? > > Yes, %ps can then be used for the name of the shrinker structure > (assuming it is available). This is fine for emitting via the ftrace /sys interface, but in order to have the data [name] get marshalled thru to perf (for example) you need to add it to the TP_fast_assign entry. tony
Re: [PATCH 2/5] mm: add per-zone lru list stat
Hi Fengguang, On Sat, Jul 23, 2016 at 08:45:15AM +0800, Fengguang Wu wrote: > Hi Minchan, > > We find duplicate /proc/vmstat lines showing up in linux-next, which > look related to this patch. > > >>--- a/mm/vmstat.c > >>+++ b/mm/vmstat.c > >>@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned > >>int order) > >> const char * const vmstat_text[] = { > >>/* enum zone_stat_item countes */ > >>"nr_free_pages", > >>+ "nr_inactive_anon", > >>+ "nr_active_anon", > >>+ "nr_inactive_file", > >>+ "nr_active_file", > >>+ "nr_unevictable", > >>"nr_mlock", > >>"nr_slab_reclaimable", > >>"nr_slab_unreclaimable", > > In the below vmstat output, "nr_inactive_anon 2217" is shown twice. > So do the other entries added by the above chunk. > > nr_free_pages 831238 > nr_inactive_anon 2217 > nr_active_anon 4386 > nr_inactive_file 117467 > nr_active_file 4602 > nr_unevictable 0 > nr_zone_write_pending 0 > nr_mlock 0 > nr_slab_reclaimable 8323 > nr_slab_unreclaimable 4641 > nr_page_table_pages 870 > nr_kernel_stack 3776 > nr_bounce 0 > nr_zspages 0 > numa_hit 201105 > numa_miss 0 > numa_foreign 0 > numa_interleave 66970 > numa_local 201105 > numa_other 0 > nr_free_cma 0 > nr_inactive_anon 2217 > nr_active_anon 4368 > nr_inactive_file 117449 > nr_active_file 4620 > nr_unevictable 0 > nr_isolated_anon 0 > nr_isolated_file 0 > nr_pages_scanned 0 > workingset_refault 0 > workingset_activate 0 > workingset_nodereclaim 0 > nr_anon_pages 4321 > nr_mapped 3469 > nr_file_pages 124348 > nr_dirty 0 > nr_writeback 0 > nr_writeback_temp 0 > nr_shmem 2279 > nr_shmem_hugepages 0 > nr_shmem_pmdmapped 0 Thanks for catching that. We need a decision to maintain LRU stat both per-zone and per-node. Mel, do you want to keep the LRU stat in per-node in addition?
Re: [PATCH 2/5] mm: add per-zone lru list stat
Hi Fengguang, On Sat, Jul 23, 2016 at 08:45:15AM +0800, Fengguang Wu wrote: > Hi Minchan, > > We find duplicate /proc/vmstat lines showing up in linux-next, which > look related to this patch. > > >>--- a/mm/vmstat.c > >>+++ b/mm/vmstat.c > >>@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned > >>int order) > >> const char * const vmstat_text[] = { > >>/* enum zone_stat_item countes */ > >>"nr_free_pages", > >>+ "nr_inactive_anon", > >>+ "nr_active_anon", > >>+ "nr_inactive_file", > >>+ "nr_active_file", > >>+ "nr_unevictable", > >>"nr_mlock", > >>"nr_slab_reclaimable", > >>"nr_slab_unreclaimable", > > In the below vmstat output, "nr_inactive_anon 2217" is shown twice. > So do the other entries added by the above chunk. > > nr_free_pages 831238 > nr_inactive_anon 2217 > nr_active_anon 4386 > nr_inactive_file 117467 > nr_active_file 4602 > nr_unevictable 0 > nr_zone_write_pending 0 > nr_mlock 0 > nr_slab_reclaimable 8323 > nr_slab_unreclaimable 4641 > nr_page_table_pages 870 > nr_kernel_stack 3776 > nr_bounce 0 > nr_zspages 0 > numa_hit 201105 > numa_miss 0 > numa_foreign 0 > numa_interleave 66970 > numa_local 201105 > numa_other 0 > nr_free_cma 0 > nr_inactive_anon 2217 > nr_active_anon 4368 > nr_inactive_file 117449 > nr_active_file 4620 > nr_unevictable 0 > nr_isolated_anon 0 > nr_isolated_file 0 > nr_pages_scanned 0 > workingset_refault 0 > workingset_activate 0 > workingset_nodereclaim 0 > nr_anon_pages 4321 > nr_mapped 3469 > nr_file_pages 124348 > nr_dirty 0 > nr_writeback 0 > nr_writeback_temp 0 > nr_shmem 2279 > nr_shmem_hugepages 0 > nr_shmem_pmdmapped 0 Thanks for catching that. We need a decision to maintain LRU stat both per-zone and per-node. Mel, do you want to keep the LRU stat in per-node in addition?
ADS1118: hwmon or iio ? [was: Re: [RCF 1/3] hwmon: Add ads1118 driver]
On 07/22/2016 07:39 AM, Joshua Clayton wrote: Greetings Guenter, Thank you for reviewing my submission. On 07/15/2016 06:40 PM, Guenter Roeck wrote: On 07/15/2016 05:18 PM, Joshua Clayton wrote: Add new driver for Texas Instruments ADS1118 and and ADS1018. This driver works with ADS1018, because of code borrowed from asd1015, which is similar, but I can only test ADS1118 Browsing through the datasheet, I think this should probably be implemented as iio driver (and iio already has a driver for ads1015). Jonathan, what do you think ? Thanks, Guenter No response from Jonathan as yet, but I am willing to rework the driver when I have some time. It might be weeks before I can start, though. Changing the subject - maybe it helps to get Jonathan's attention. Guenter
ADS1118: hwmon or iio ? [was: Re: [RCF 1/3] hwmon: Add ads1118 driver]
On 07/22/2016 07:39 AM, Joshua Clayton wrote: Greetings Guenter, Thank you for reviewing my submission. On 07/15/2016 06:40 PM, Guenter Roeck wrote: On 07/15/2016 05:18 PM, Joshua Clayton wrote: Add new driver for Texas Instruments ADS1118 and and ADS1018. This driver works with ADS1018, because of code borrowed from asd1015, which is similar, but I can only test ADS1118 Browsing through the datasheet, I think this should probably be implemented as iio driver (and iio already has a driver for ads1015). Jonathan, what do you think ? Thanks, Guenter No response from Jonathan as yet, but I am willing to rework the driver when I have some time. It might be weeks before I can start, though. Changing the subject - maybe it helps to get Jonathan's attention. Guenter
Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
On Fri, Jul 22, 2016 at 10:00 PM, Matt Flemingwrote: > > I suppose we could rewrite the page table mapping for those precious > <1MB regions to coerce the firmware into accessing different pages > instead of the 1:1 addresses and copy the regions elsewhere. Maybe. > That assumes we don't hit other firmware bugs though. .. it also assumes that firmware honors our page tables. Which at a minimum won't be true for things like SMM, which happens with paging entirely disabled. And that's exactly the kind of code that touches some of the memory in the low 1M region - things like the legacy keyboard state bits in the low 4kB etc, for crazy old DOS days, iirc. Linus
Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code
On Fri, Jul 22, 2016 at 10:00 PM, Matt Fleming wrote: > > I suppose we could rewrite the page table mapping for those precious > <1MB regions to coerce the firmware into accessing different pages > instead of the 1:1 addresses and copy the regions elsewhere. Maybe. > That assumes we don't hit other firmware bugs though. .. it also assumes that firmware honors our page tables. Which at a minimum won't be true for things like SMM, which happens with paging entirely disabled. And that's exactly the kind of code that touches some of the memory in the low 1M region - things like the legacy keyboard state bits in the low 4kB etc, for crazy old DOS days, iirc. Linus
Re: linux-next: build failure after merge of the nvdimm tree
On Thu, Jul 21, 2016 at 11:13 PM, Stephen Rothwellwrote: > Hi Dan, > > After merging the nvdimm tree, today's linux-next build (powerpc > ppc64_defconfig) failed like this: > > In file included from drivers/md/dm.h:14:0, > from drivers/md/dm-uevent.c:27: > include/linux/device-mapper.h:134:22: error: expected ';', ',' or ')' before > '*' token > void __pmem **kaddr, pfn_t *pfn, long size); > ^ > include/linux/device-mapper.h:182:2: error: unknown type name > 'dm_direct_access_fn' > dm_direct_access_fn direct_access; > ^ > > Caused by commit > > 7a9eb2066631 ("pmem: kill __pmem address space") > > interacting with commit > > 545ed20e6df6 ("dm: add infrastructure for DAX support") > > from the device-mapper tree. > > I applied the following merge fix patch for today. Someone needs to > tell Linus about this when he merges the trees. There's no real rush to remove "__pmem" I'll pull this out until after DM DAX support has merged. Thanks Stephen!
Re: linux-next: build failure after merge of the nvdimm tree
On Thu, Jul 21, 2016 at 11:13 PM, Stephen Rothwell wrote: > Hi Dan, > > After merging the nvdimm tree, today's linux-next build (powerpc > ppc64_defconfig) failed like this: > > In file included from drivers/md/dm.h:14:0, > from drivers/md/dm-uevent.c:27: > include/linux/device-mapper.h:134:22: error: expected ';', ',' or ')' before > '*' token > void __pmem **kaddr, pfn_t *pfn, long size); > ^ > include/linux/device-mapper.h:182:2: error: unknown type name > 'dm_direct_access_fn' > dm_direct_access_fn direct_access; > ^ > > Caused by commit > > 7a9eb2066631 ("pmem: kill __pmem address space") > > interacting with commit > > 545ed20e6df6 ("dm: add infrastructure for DAX support") > > from the device-mapper tree. > > I applied the following merge fix patch for today. Someone needs to > tell Linus about this when he merges the trees. There's no real rush to remove "__pmem" I'll pull this out until after DM DAX support has merged. Thanks Stephen!
Re: linux-next: Tree for Jul 22
On Fri, Jul 22, 2016 at 3:28 PM, Paul Gortmaker <paul.gortma...@windriver.com> wrote: [...] > > For x86-64 allmod, I am seeing this issue for the 1st time today > in my local builds, even though it doesn't appear in sfr's builds: > > FATAL: drivers/input/evbug: sizeof(struct input_device_id)=312 is not > a modulo of the size of section > __mod_input___device_table=384. > Fix definition of struct input_device_id in mod_devicetable.h > make[2]: *** [__modpost] Error 1 > make[1]: *** [modules] Error 2 > make: *** [sub-make] Error 2 > > paul@builder:~/git/linux-head$ git describe > next-20160722 > > Nothing has touched drivers/input/evbug.c for years... Circling back to look at this more, it seems to be a result of the makefiles not clobbering something properly when using the same build directory from yesterday's next build today. Once I executed a "make clean" I could not reproduce it. P. --
Re: linux-next: Tree for Jul 22
On Fri, Jul 22, 2016 at 3:28 PM, Paul Gortmaker wrote: [...] > > For x86-64 allmod, I am seeing this issue for the 1st time today > in my local builds, even though it doesn't appear in sfr's builds: > > FATAL: drivers/input/evbug: sizeof(struct input_device_id)=312 is not > a modulo of the size of section > __mod_input___device_table=384. > Fix definition of struct input_device_id in mod_devicetable.h > make[2]: *** [__modpost] Error 1 > make[1]: *** [modules] Error 2 > make: *** [sub-make] Error 2 > > paul@builder:~/git/linux-head$ git describe > next-20160722 > > Nothing has touched drivers/input/evbug.c for years... Circling back to look at this more, it seems to be a result of the makefiles not clobbering something properly when using the same build directory from yesterday's next build today. Once I executed a "make clean" I could not reproduce it. P. --
Re: [PATCH v2 16/17] x86/insn: remove pcommit
On Fri, Jul 22, 2016 at 9:52 AM, Ingo Molnarwrote: > > * Dan Williams wrote: > >> On Tue, Jul 12, 2016 at 3:12 PM, Dan Williams >> wrote: >> > On Tue, Jul 12, 2016 at 7:57 AM, Peter Zijlstra >> > wrote: >> >> On Sat, Jul 09, 2016 at 08:25:54PM -0700, Dan Williams wrote: >> >>> The pcommit instruction is being deprecated in favor of either ADR >> >>> (asynchronous DRAM refresh: flush-on-power-fail) at the platform level, >> >>> or >> >>> posted-write-queue flush addresses as defined by the ACPI 6.x NFIT >> >>> (NVDIMM >> >>> Firmware Interface Table). >> >> >> >>> arch/x86/include/asm/cpufeatures.h |1 >> >>> arch/x86/include/asm/special_insns.h | 46 >> >>> >> >>> arch/x86/lib/x86-opcode-map.txt|2 - >> >>> tools/objtool/arch/x86/insn/x86-opcode-map.txt |2 - >> >>> tools/perf/arch/x86/tests/insn-x86-dat-32.c|2 - >> >>> tools/perf/arch/x86/tests/insn-x86-dat-64.c|2 - >> >>> tools/perf/arch/x86/tests/insn-x86-dat-src.c |4 -- >> >> >> >> Just deprecated, or is it completely eradicated, removed from history, >> >> will never ever happen and we'll reissue the opcode for something else? >> >> >> >> Because if its only deprecated then removing it from the instruction >> >> decoders seems wrong, old binaries might still contain the opcode. >> > >> > Eradicated. >> > >> > "The new instructions like CLWB and CLFLUSHOPT will be rolled into the >> > SDM but PCOMMIT will be removed from the Extensions doc and not rolled >> > into the SDM." [1] >> > >> > Existing binaries are already gating their usage on the presence of >> > the cpu id flag, that flag and the instruction opcode are reserved >> > going forward. >> > >> > [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-June/005923.html >> >> x86 maintainers, I have the other patches in this series queued in -next. >> Please >> ack this one and I'll add it for v4.8-rc1, or otherwise let me know how you >> want >> to handle this patch. > > Since it's just a removal AFAICS that the rest of your series should not > depend > on, can you submit it to the x86 tree? This patch depends on the previous patches in the series removing calls to pcommit_sfence().
Re: [PATCH v2 16/17] x86/insn: remove pcommit
On Fri, Jul 22, 2016 at 9:52 AM, Ingo Molnar wrote: > > * Dan Williams wrote: > >> On Tue, Jul 12, 2016 at 3:12 PM, Dan Williams >> wrote: >> > On Tue, Jul 12, 2016 at 7:57 AM, Peter Zijlstra >> > wrote: >> >> On Sat, Jul 09, 2016 at 08:25:54PM -0700, Dan Williams wrote: >> >>> The pcommit instruction is being deprecated in favor of either ADR >> >>> (asynchronous DRAM refresh: flush-on-power-fail) at the platform level, >> >>> or >> >>> posted-write-queue flush addresses as defined by the ACPI 6.x NFIT >> >>> (NVDIMM >> >>> Firmware Interface Table). >> >> >> >>> arch/x86/include/asm/cpufeatures.h |1 >> >>> arch/x86/include/asm/special_insns.h | 46 >> >>> >> >>> arch/x86/lib/x86-opcode-map.txt|2 - >> >>> tools/objtool/arch/x86/insn/x86-opcode-map.txt |2 - >> >>> tools/perf/arch/x86/tests/insn-x86-dat-32.c|2 - >> >>> tools/perf/arch/x86/tests/insn-x86-dat-64.c|2 - >> >>> tools/perf/arch/x86/tests/insn-x86-dat-src.c |4 -- >> >> >> >> Just deprecated, or is it completely eradicated, removed from history, >> >> will never ever happen and we'll reissue the opcode for something else? >> >> >> >> Because if its only deprecated then removing it from the instruction >> >> decoders seems wrong, old binaries might still contain the opcode. >> > >> > Eradicated. >> > >> > "The new instructions like CLWB and CLFLUSHOPT will be rolled into the >> > SDM but PCOMMIT will be removed from the Extensions doc and not rolled >> > into the SDM." [1] >> > >> > Existing binaries are already gating their usage on the presence of >> > the cpu id flag, that flag and the instruction opcode are reserved >> > going forward. >> > >> > [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-June/005923.html >> >> x86 maintainers, I have the other patches in this series queued in -next. >> Please >> ack this one and I'll add it for v4.8-rc1, or otherwise let me know how you >> want >> to handle this patch. > > Since it's just a removal AFAICS that the rest of your series should not > depend > on, can you submit it to the x86 tree? This patch depends on the previous patches in the series removing calls to pcommit_sfence().
Re: [PATCH 2/5] mm: add per-zone lru list stat
Hi Minchan, We find duplicate /proc/vmstat lines showing up in linux-next, which look related to this patch. --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order) const char * const vmstat_text[] = { /* enum zone_stat_item countes */ "nr_free_pages", + "nr_inactive_anon", + "nr_active_anon", + "nr_inactive_file", + "nr_active_file", + "nr_unevictable", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", In the below vmstat output, "nr_inactive_anon 2217" is shown twice. So do the other entries added by the above chunk. nr_free_pages 831238 nr_inactive_anon 2217 nr_active_anon 4386 nr_inactive_file 117467 nr_active_file 4602 nr_unevictable 0 nr_zone_write_pending 0 nr_mlock 0 nr_slab_reclaimable 8323 nr_slab_unreclaimable 4641 nr_page_table_pages 870 nr_kernel_stack 3776 nr_bounce 0 nr_zspages 0 numa_hit 201105 numa_miss 0 numa_foreign 0 numa_interleave 66970 numa_local 201105 numa_other 0 nr_free_cma 0 nr_inactive_anon 2217 nr_active_anon 4368 nr_inactive_file 117449 nr_active_file 4620 nr_unevictable 0 nr_isolated_anon 0 nr_isolated_file 0 nr_pages_scanned 0 workingset_refault 0 workingset_activate 0 workingset_nodereclaim 0 nr_anon_pages 4321 nr_mapped 3469 nr_file_pages 124348 nr_dirty 0 nr_writeback 0 nr_writeback_temp 0 nr_shmem 2279 nr_shmem_hugepages 0 nr_shmem_pmdmapped 0 ... Thanks, Fengguang
Re: [PATCH 2/5] mm: add per-zone lru list stat
Hi Minchan, We find duplicate /proc/vmstat lines showing up in linux-next, which look related to this patch. --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int order) const char * const vmstat_text[] = { /* enum zone_stat_item countes */ "nr_free_pages", + "nr_inactive_anon", + "nr_active_anon", + "nr_inactive_file", + "nr_active_file", + "nr_unevictable", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", In the below vmstat output, "nr_inactive_anon 2217" is shown twice. So do the other entries added by the above chunk. nr_free_pages 831238 nr_inactive_anon 2217 nr_active_anon 4386 nr_inactive_file 117467 nr_active_file 4602 nr_unevictable 0 nr_zone_write_pending 0 nr_mlock 0 nr_slab_reclaimable 8323 nr_slab_unreclaimable 4641 nr_page_table_pages 870 nr_kernel_stack 3776 nr_bounce 0 nr_zspages 0 numa_hit 201105 numa_miss 0 numa_foreign 0 numa_interleave 66970 numa_local 201105 numa_other 0 nr_free_cma 0 nr_inactive_anon 2217 nr_active_anon 4368 nr_inactive_file 117449 nr_active_file 4620 nr_unevictable 0 nr_isolated_anon 0 nr_isolated_file 0 nr_pages_scanned 0 workingset_refault 0 workingset_activate 0 workingset_nodereclaim 0 nr_anon_pages 4321 nr_mapped 3469 nr_file_pages 124348 nr_dirty 0 nr_writeback 0 nr_writeback_temp 0 nr_shmem 2279 nr_shmem_hugepages 0 nr_shmem_pmdmapped 0 ... Thanks, Fengguang
Re: [PATCH v4 00/12] mm: Hardened usercopy
On 07/20/2016 01:26 PM, Kees Cook wrote: Hi, [This is now in my kspp -next tree, though I'd really love to add some additional explicit Tested-bys, Reviewed-bys, or Acked-bys. If you've looked through any part of this or have done any testing, please consider sending an email with your "*-by:" line. :)] This is a start of the mainline port of PAX_USERCOPY[1]. After writing tests (now in lkdtm in -next) for Casey's earlier port[2], I kept tweaking things further and further until I ended up with a whole new patch series. To that end, I took Rik, Laura, and other people's feedback along with additional changes and clean-ups. Based on my understanding, PAX_USERCOPY was designed to catch a few classes of flaws (mainly bad bounds checking) around the use of copy_to_user()/copy_from_user(). These changes don't touch get_user() and put_user(), since these operate on constant sized lengths, and tend to be much less vulnerable. There are effectively three distinct protections in the whole series, each of which I've given a separate CONFIG, though this patch set is only the first of the three intended protections. (Generally speaking, PAX_USERCOPY covers what I'm calling CONFIG_HARDENED_USERCOPY (this) and CONFIG_HARDENED_USERCOPY_WHITELIST (future), and PAX_USERCOPY_SLABS covers CONFIG_HARDENED_USERCOPY_SPLIT_KMALLOC (future).) This series, which adds CONFIG_HARDENED_USERCOPY, checks that objects being copied to/from userspace meet certain criteria: - if address is a heap object, the size must not exceed the object's allocated size. (This will catch all kinds of heap overflow flaws.) - if address range is in the current process stack, it must be within the a valid stack frame (if such checking is possible) or at least entirely within the current process's stack. (This could catch large lengths that would have extended beyond the current process stack, or overflows if their length extends back into the original stack.) - if the address range is part of kernel data, rodata, or bss, allow it. - if address range is page-allocated, that it doesn't span multiple allocations (excepting Reserved and CMA pages). - if address is within the kernel text, reject it. - everything else is accepted The patches in the series are: - Support for examination of CMA page types: 1- mm: Add is_migrate_cma_page - Support for arch-specific stack frame checking (which will likely be replaced in the future by Josh's more comprehensive unwinder): 2- mm: Implement stack frame object validation - The core copy_to/from_user() checks, without the slab object checks: 3- mm: Hardened usercopy - Per-arch enablement of the protection: 4- x86/uaccess: Enable hardened usercopy 5- ARM: uaccess: Enable hardened usercopy 6- arm64/uaccess: Enable hardened usercopy 7- ia64/uaccess: Enable hardened usercopy 8- powerpc/uaccess: Enable hardened usercopy 9- sparc/uaccess: Enable hardened usercopy 10- s390/uaccess: Enable hardened usercopy - The heap allocator implementation of object size checking: 11- mm: SLAB hardened usercopy support 12- mm: SLUB hardened usercopy support Some notes: - This is expected to apply on top of -next which contains fixes for the position of _etext on both arm and arm64, though it has some conflicts with KASAN that should be trivial to fix up. Also in -next are the tests for this protection (in lkdtm), prefixed with USERCOPY_. - I couldn't detect a measurable performance change with these features enabled. Kernel build times were unchanged, hackbench was unchanged, etc. I think we could flip this to "on by default" at some point, but for now, I'm leaving it off until I can get some more definitive measurements. I would love if someone with greater familiarity with perf could give this a spin and report results. - The SLOB support extracted from grsecurity seems entirely broken. I have no idea what's going on there, I spent my time testing SLAB and SLUB. Having someone else look at SLOB would be nice, but this series doesn't depend on it. Additional features that would be nice, but aren't blocking this series: - Needs more architecture support for stack frame checking (only x86 now, but it seems Josh will have a good solution for this soon). Thanks! -Kees [1] https://grsecurity.net/download.php "grsecurity - test kernel patch" [2] http://www.openwall.com/lists/kernel-hardening/2016/05/19/5 v4: - handle CMA pages, labbott - update stack checker comments, labbott - check for vmalloc addresses, labbott - deal with KASAN in -next changing arm64 copy*user calls - check for linear mappings at runtime instead of via CONFIG v3: - switch to using BUG for better Oops integration - when checking page allocations, check each for Reserved - use enums for the stack check return for readability v2: - added s390 support - handle slub red zone - disallow writes to rodata area - stack frame walker now
Re: [PATCH v4 00/12] mm: Hardened usercopy
On 07/20/2016 01:26 PM, Kees Cook wrote: Hi, [This is now in my kspp -next tree, though I'd really love to add some additional explicit Tested-bys, Reviewed-bys, or Acked-bys. If you've looked through any part of this or have done any testing, please consider sending an email with your "*-by:" line. :)] This is a start of the mainline port of PAX_USERCOPY[1]. After writing tests (now in lkdtm in -next) for Casey's earlier port[2], I kept tweaking things further and further until I ended up with a whole new patch series. To that end, I took Rik, Laura, and other people's feedback along with additional changes and clean-ups. Based on my understanding, PAX_USERCOPY was designed to catch a few classes of flaws (mainly bad bounds checking) around the use of copy_to_user()/copy_from_user(). These changes don't touch get_user() and put_user(), since these operate on constant sized lengths, and tend to be much less vulnerable. There are effectively three distinct protections in the whole series, each of which I've given a separate CONFIG, though this patch set is only the first of the three intended protections. (Generally speaking, PAX_USERCOPY covers what I'm calling CONFIG_HARDENED_USERCOPY (this) and CONFIG_HARDENED_USERCOPY_WHITELIST (future), and PAX_USERCOPY_SLABS covers CONFIG_HARDENED_USERCOPY_SPLIT_KMALLOC (future).) This series, which adds CONFIG_HARDENED_USERCOPY, checks that objects being copied to/from userspace meet certain criteria: - if address is a heap object, the size must not exceed the object's allocated size. (This will catch all kinds of heap overflow flaws.) - if address range is in the current process stack, it must be within the a valid stack frame (if such checking is possible) or at least entirely within the current process's stack. (This could catch large lengths that would have extended beyond the current process stack, or overflows if their length extends back into the original stack.) - if the address range is part of kernel data, rodata, or bss, allow it. - if address range is page-allocated, that it doesn't span multiple allocations (excepting Reserved and CMA pages). - if address is within the kernel text, reject it. - everything else is accepted The patches in the series are: - Support for examination of CMA page types: 1- mm: Add is_migrate_cma_page - Support for arch-specific stack frame checking (which will likely be replaced in the future by Josh's more comprehensive unwinder): 2- mm: Implement stack frame object validation - The core copy_to/from_user() checks, without the slab object checks: 3- mm: Hardened usercopy - Per-arch enablement of the protection: 4- x86/uaccess: Enable hardened usercopy 5- ARM: uaccess: Enable hardened usercopy 6- arm64/uaccess: Enable hardened usercopy 7- ia64/uaccess: Enable hardened usercopy 8- powerpc/uaccess: Enable hardened usercopy 9- sparc/uaccess: Enable hardened usercopy 10- s390/uaccess: Enable hardened usercopy - The heap allocator implementation of object size checking: 11- mm: SLAB hardened usercopy support 12- mm: SLUB hardened usercopy support Some notes: - This is expected to apply on top of -next which contains fixes for the position of _etext on both arm and arm64, though it has some conflicts with KASAN that should be trivial to fix up. Also in -next are the tests for this protection (in lkdtm), prefixed with USERCOPY_. - I couldn't detect a measurable performance change with these features enabled. Kernel build times were unchanged, hackbench was unchanged, etc. I think we could flip this to "on by default" at some point, but for now, I'm leaving it off until I can get some more definitive measurements. I would love if someone with greater familiarity with perf could give this a spin and report results. - The SLOB support extracted from grsecurity seems entirely broken. I have no idea what's going on there, I spent my time testing SLAB and SLUB. Having someone else look at SLOB would be nice, but this series doesn't depend on it. Additional features that would be nice, but aren't blocking this series: - Needs more architecture support for stack frame checking (only x86 now, but it seems Josh will have a good solution for this soon). Thanks! -Kees [1] https://grsecurity.net/download.php "grsecurity - test kernel patch" [2] http://www.openwall.com/lists/kernel-hardening/2016/05/19/5 v4: - handle CMA pages, labbott - update stack checker comments, labbott - check for vmalloc addresses, labbott - deal with KASAN in -next changing arm64 copy*user calls - check for linear mappings at runtime instead of via CONFIG v3: - switch to using BUG for better Oops integration - when checking page allocations, check each for Reserved - use enums for the stack check return for readability v2: - added s390 support - handle slub red zone - disallow writes to rodata area - stack frame walker now
Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvaldswrote: > > So without having yet looked at the code, I want people to understand > that to a very real degree, the stack tracer that the *oopsing* code > (ie what all the usual kernel fault handlers use) is very very special > code and needs to be handled very carefully, and needs to be extra > robust, even in the presence of stack corruption, and even in the > presence of the dwarf info being totally corrupted. Because we've very > much had both things happen. > > It is very possible that we should have two different stack tracers - > the stupid "for oopses only" code that doesn't necessarily give the > perfect trace, but is very anal and happily gives old stale addresses > (which can be very useful for seeing what happened just before the > "real" stack trace), and then a separate stack trace engine that is > clever and gets things right, and if that one faults it can depend on > the normal kernel fault handling picking up the pieces. I think that Josh's code has the potential to be extremely robust *and* give more correct results when possible. One thing I intend to review when v2 shows up is that it's as conservative as it needs to be to avoid ever dereferencing an out-of-bounds pointer. And Josh's oops printer carefully walks and prints out all addresses on the stack (complete with question marks) even if the unwinder doesn't find them. > > Yes, the current stack tracer is crufty. No, it's not perfect. But it > is very well tested, and has held up. That should not be dismissed. > I think you may be giving the current tracer slightly more credit than it's due. In my stack guard page patchset, I fixed two separate issues, one of which caused recursive faults and one of which caused it to output nothing at all. So maybe *now* it's very robust :) But it's still an umaintainable mess IMO, and Josh's patchset helps a *lot*. --Andy
Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds wrote: > > So without having yet looked at the code, I want people to understand > that to a very real degree, the stack tracer that the *oopsing* code > (ie what all the usual kernel fault handlers use) is very very special > code and needs to be handled very carefully, and needs to be extra > robust, even in the presence of stack corruption, and even in the > presence of the dwarf info being totally corrupted. Because we've very > much had both things happen. > > It is very possible that we should have two different stack tracers - > the stupid "for oopses only" code that doesn't necessarily give the > perfect trace, but is very anal and happily gives old stale addresses > (which can be very useful for seeing what happened just before the > "real" stack trace), and then a separate stack trace engine that is > clever and gets things right, and if that one faults it can depend on > the normal kernel fault handling picking up the pieces. I think that Josh's code has the potential to be extremely robust *and* give more correct results when possible. One thing I intend to review when v2 shows up is that it's as conservative as it needs to be to avoid ever dereferencing an out-of-bounds pointer. And Josh's oops printer carefully walks and prints out all addresses on the stack (complete with question marks) even if the unwinder doesn't find them. > > Yes, the current stack tracer is crufty. No, it's not perfect. But it > is very well tested, and has held up. That should not be dismissed. > I think you may be giving the current tracer slightly more credit than it's due. In my stack guard page patchset, I fixed two separate issues, one of which caused recursive faults and one of which caused it to output nothing at all. So maybe *now* it's very robust :) But it's still an umaintainable mess IMO, and Josh's patchset helps a *lot*. --Andy
[PATCH] add u64 number parser
add u64 number parser Will be used by the nvme-fabrics FC transport in parsing options Signed-off-by: James Smart--- include/linux/parser.h | 1 + lib/parser.c | 47 +++ 2 files changed, 48 insertions(+) diff --git a/include/linux/parser.h b/include/linux/parser.h index 39d5b79..884c1e6 100644 --- a/include/linux/parser.h +++ b/include/linux/parser.h @@ -27,6 +27,7 @@ typedef struct { int match_token(char *, const match_table_t table, substring_t args[]); int match_int(substring_t *, int *result); +int match_u64(substring_t *, u64 *result); int match_octal(substring_t *, int *result); int match_hex(substring_t *, int *result); bool match_wildcard(const char *pattern, const char *str); diff --git a/lib/parser.c b/lib/parser.c index b6d1163..3278958 100644 --- a/lib/parser.c +++ b/lib/parser.c @@ -152,6 +152,36 @@ static int match_number(substring_t *s, int *result, int base) } /** + * match_u64int: scan a number in the given base from a substring_t + * @s: substring to be scanned + * @result: resulting u64 on success + * @base: base to use when converting string + * + * Description: Given a _t and a base, attempts to parse the substring + * as a number in that base. On success, sets @result to the integer represented + * by the string and returns 0. Returns -ENOMEM, -EINVAL, or -ERANGE on failure. + */ +static int match_u64int(substring_t *s, u64 *result, int base) +{ + char *buf; + int ret; + u64 val; + size_t len = s->to - s->from; + + buf = kmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + memcpy(buf, s->from, len); + buf[len] = '\0'; + + ret = kstrtoull(buf, base, ); + if (!ret) + *result = val; + kfree(buf); + return ret; +} + +/** * match_int: - scan a decimal representation of an integer from a substring_t * @s: substring_t to be scanned * @result: resulting integer on success @@ -167,6 +197,23 @@ int match_int(substring_t *s, int *result) EXPORT_SYMBOL(match_int); /** + * match_u64: - scan a decimal representation of a u64 from + * a substring_t + * @s: substring_t to be scanned + * @result: resulting unsigned long long on success + * + * Description: Attempts to parse the _t @s as a long decimal + * integer. On success, sets @result to the integer represented by the + * string and returns 0. + * Returns -ENOMEM, -EINVAL, or -ERANGE on failure. + */ +int match_u64(substring_t *s, u64 *result) +{ + return match_u64int(s, result, 0); +} +EXPORT_SYMBOL(match_u64); + +/** * match_octal: - scan an octal representation of an integer from a substring_t * @s: substring_t to be scanned * @result: resulting integer on success -- 2.5.0
[PATCH] add u64 number parser
add u64 number parser Will be used by the nvme-fabrics FC transport in parsing options Signed-off-by: James Smart --- include/linux/parser.h | 1 + lib/parser.c | 47 +++ 2 files changed, 48 insertions(+) diff --git a/include/linux/parser.h b/include/linux/parser.h index 39d5b79..884c1e6 100644 --- a/include/linux/parser.h +++ b/include/linux/parser.h @@ -27,6 +27,7 @@ typedef struct { int match_token(char *, const match_table_t table, substring_t args[]); int match_int(substring_t *, int *result); +int match_u64(substring_t *, u64 *result); int match_octal(substring_t *, int *result); int match_hex(substring_t *, int *result); bool match_wildcard(const char *pattern, const char *str); diff --git a/lib/parser.c b/lib/parser.c index b6d1163..3278958 100644 --- a/lib/parser.c +++ b/lib/parser.c @@ -152,6 +152,36 @@ static int match_number(substring_t *s, int *result, int base) } /** + * match_u64int: scan a number in the given base from a substring_t + * @s: substring to be scanned + * @result: resulting u64 on success + * @base: base to use when converting string + * + * Description: Given a _t and a base, attempts to parse the substring + * as a number in that base. On success, sets @result to the integer represented + * by the string and returns 0. Returns -ENOMEM, -EINVAL, or -ERANGE on failure. + */ +static int match_u64int(substring_t *s, u64 *result, int base) +{ + char *buf; + int ret; + u64 val; + size_t len = s->to - s->from; + + buf = kmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + memcpy(buf, s->from, len); + buf[len] = '\0'; + + ret = kstrtoull(buf, base, ); + if (!ret) + *result = val; + kfree(buf); + return ret; +} + +/** * match_int: - scan a decimal representation of an integer from a substring_t * @s: substring_t to be scanned * @result: resulting integer on success @@ -167,6 +197,23 @@ int match_int(substring_t *s, int *result) EXPORT_SYMBOL(match_int); /** + * match_u64: - scan a decimal representation of a u64 from + * a substring_t + * @s: substring_t to be scanned + * @result: resulting unsigned long long on success + * + * Description: Attempts to parse the _t @s as a long decimal + * integer. On success, sets @result to the integer represented by the + * string and returns 0. + * Returns -ENOMEM, -EINVAL, or -ERANGE on failure. + */ +int match_u64(substring_t *s, u64 *result) +{ + return match_u64int(s, result, 0); +} +EXPORT_SYMBOL(match_u64); + +/** * match_octal: - scan an octal representation of an integer from a substring_t * @s: substring_t to be scanned * @result: resulting integer on success -- 2.5.0
Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
On Fri, Jul 22, 2016 at 6:21 AM, Josh Poimboeufwrote: > > Some if its advantages: > > - simplicity: no more callback sprawl and less code duplication. > > - flexibility: allows the caller to stop and inspect the stack state at > each step in the unwinding process. > > - modularity: the unwinder code, console stack dump code, and stack > metadata analysis code are all better separated so that changing one > of them shouldn't have much of an impact on any of the others. I've been without internet for the last week, so I have a ton pending, and not good enough internet even now to take a good look. However, I want to make one thing really really clear: the absolute NUMBER ONE requirement for the stack tracing code is none of the above. The #1 requirement is that it works, and not have a chance in hell of ever breaking. We had that happen once before when people wanted to make it fancy and add Dwarf info, and it was such a f*cking disaster that I am not sure I ever want to do that again. Seriously. It does not matter if the stack tracing gives the wrong answers. It does not matter if the stack tracing is complicated and odd old code. It does not matter one whit if some new user is inconvenienced, and in fact it is possible that new users should write their *own* stack tracer code. The ONLY thing that matters (to a very high degree) is that the code is stable, and if an Oops happens, the stack tracer never *ever* cause even more problems than we already have. If the stack tracer *ever* takes a recursive fault and kills the machine, the stack tracer is worse than bad - we'd be better off *without* a stack tracer at all. And yes, we had exactly that situation, where bugs in the stack tracer meant that other bugs ended up being much harder to debug, because instead of a nice logged oops message that would have been trivial to figure out, we very occasionally ended up with a dead machine instead. So without having yet looked at the code, I want people to understand that to a very real degree, the stack tracer that the *oopsing* code (ie what all the usual kernel fault handlers use) is very very special code and needs to be handled very carefully, and needs to be extra robust, even in the presence of stack corruption, and even in the presence of the dwarf info being totally corrupted. Because we've very much had both things happen. It is very possible that we should have two different stack tracers - the stupid "for oopses only" code that doesn't necessarily give the perfect trace, but is very anal and happily gives old stale addresses (which can be very useful for seeing what happened just before the "real" stack trace), and then a separate stack trace engine that is clever and gets things right, and if that one faults it can depend on the normal kernel fault handling picking up the pieces. Yes, the current stack tracer is crufty. No, it's not perfect. But it is very well tested, and has held up. That should not be dismissed. Linus
Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code
On Fri, Jul 22, 2016 at 6:21 AM, Josh Poimboeuf wrote: > > Some if its advantages: > > - simplicity: no more callback sprawl and less code duplication. > > - flexibility: allows the caller to stop and inspect the stack state at > each step in the unwinding process. > > - modularity: the unwinder code, console stack dump code, and stack > metadata analysis code are all better separated so that changing one > of them shouldn't have much of an impact on any of the others. I've been without internet for the last week, so I have a ton pending, and not good enough internet even now to take a good look. However, I want to make one thing really really clear: the absolute NUMBER ONE requirement for the stack tracing code is none of the above. The #1 requirement is that it works, and not have a chance in hell of ever breaking. We had that happen once before when people wanted to make it fancy and add Dwarf info, and it was such a f*cking disaster that I am not sure I ever want to do that again. Seriously. It does not matter if the stack tracing gives the wrong answers. It does not matter if the stack tracing is complicated and odd old code. It does not matter one whit if some new user is inconvenienced, and in fact it is possible that new users should write their *own* stack tracer code. The ONLY thing that matters (to a very high degree) is that the code is stable, and if an Oops happens, the stack tracer never *ever* cause even more problems than we already have. If the stack tracer *ever* takes a recursive fault and kills the machine, the stack tracer is worse than bad - we'd be better off *without* a stack tracer at all. And yes, we had exactly that situation, where bugs in the stack tracer meant that other bugs ended up being much harder to debug, because instead of a nice logged oops message that would have been trivial to figure out, we very occasionally ended up with a dead machine instead. So without having yet looked at the code, I want people to understand that to a very real degree, the stack tracer that the *oopsing* code (ie what all the usual kernel fault handlers use) is very very special code and needs to be handled very carefully, and needs to be extra robust, even in the presence of stack corruption, and even in the presence of the dwarf info being totally corrupted. Because we've very much had both things happen. It is very possible that we should have two different stack tracers - the stupid "for oopses only" code that doesn't necessarily give the perfect trace, but is very anal and happily gives old stale addresses (which can be very useful for seeing what happened just before the "real" stack trace), and then a separate stack trace engine that is clever and gets things right, and if that one faults it can depend on the normal kernel fault handling picking up the pieces. Yes, the current stack tracer is crufty. No, it's not perfect. But it is very well tested, and has held up. That should not be dismissed. Linus
[PATCH] f2fs: clean up codying style and redundancy
This patch includes minor clean-ups. Signed-off-by: Jaegeuk Kim--- fs/f2fs/acl.h | 2 +- fs/f2fs/data.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/f2fs/acl.h b/fs/f2fs/acl.h index 997ca8e..b2334d1 100644 --- a/fs/f2fs/acl.h +++ b/fs/f2fs/acl.h @@ -37,7 +37,7 @@ struct f2fs_acl_header { #ifdef CONFIG_F2FS_FS_POSIX_ACL extern struct posix_acl *f2fs_get_acl(struct inode *, int); -extern int f2fs_set_acl(struct inode *inode, struct posix_acl *acl, int type); +extern int f2fs_set_acl(struct inode *, struct posix_acl *, int); extern int f2fs_init_acl(struct inode *, struct inode *, struct page *, struct page *); #else diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 87a458f..614154f 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -1731,7 +1731,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) if (test_opt(F2FS_I_SB(inode), LFS)) return 0; - trace_f2fs_direct_IO_enter(inode, offset, count, iov_iter_rw(iter)); + trace_f2fs_direct_IO_enter(inode, offset, count, rw); down_read(_I(inode)->dio_rwsem[rw]); err = blockdev_direct_IO(iocb, inode, iter, get_data_block_dio); @@ -1744,7 +1744,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) f2fs_write_failed(mapping, offset + count); } - trace_f2fs_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), err); + trace_f2fs_direct_IO_exit(inode, offset, count, rw, err); return err; } -- 2.8.3
[PATCH] f2fs: clean up codying style and redundancy
This patch includes minor clean-ups. Signed-off-by: Jaegeuk Kim --- fs/f2fs/acl.h | 2 +- fs/f2fs/data.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/f2fs/acl.h b/fs/f2fs/acl.h index 997ca8e..b2334d1 100644 --- a/fs/f2fs/acl.h +++ b/fs/f2fs/acl.h @@ -37,7 +37,7 @@ struct f2fs_acl_header { #ifdef CONFIG_F2FS_FS_POSIX_ACL extern struct posix_acl *f2fs_get_acl(struct inode *, int); -extern int f2fs_set_acl(struct inode *inode, struct posix_acl *acl, int type); +extern int f2fs_set_acl(struct inode *, struct posix_acl *, int); extern int f2fs_init_acl(struct inode *, struct inode *, struct page *, struct page *); #else diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 87a458f..614154f 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -1731,7 +1731,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) if (test_opt(F2FS_I_SB(inode), LFS)) return 0; - trace_f2fs_direct_IO_enter(inode, offset, count, iov_iter_rw(iter)); + trace_f2fs_direct_IO_enter(inode, offset, count, rw); down_read(_I(inode)->dio_rwsem[rw]); err = blockdev_direct_IO(iocb, inode, iter, get_data_block_dio); @@ -1744,7 +1744,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) f2fs_write_failed(mapping, offset + count); } - trace_f2fs_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), err); + trace_f2fs_direct_IO_exit(inode, offset, count, rw, err); return err; } -- 2.8.3
Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeufwrote: >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info >> > *info, >> > +unsigned long *visit_mask) >> > +{ >> > + unsigned long *begin = (unsigned long >> > *)this_cpu_read(hardirq_stack); >> > + unsigned long *end = begin + (THREAD_SIZE / sizeof(long)); >> > + >> > + if (stack < begin || stack >= end) >> > + return false; >> > + >> > + if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask)) >> > + return false; >> > + >> > + info->type = STACK_TYPE_IRQ; >> > + info->begin = begin; >> > + info->end = end; >> > + info->next = (unsigned long *)*begin; >> >> This works, but it's a bit magic. I don't suppose we could get rid of >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to >> move from stack to stack by querying the stack type of whatever the >> frame base address is if the frame base address ends up being out of >> bounds for the current stack? Or maybe the unwinder could even do >> this by itself. > > I'm not quite sure what you mean here. The ->next stack pointer is > quite useful and it abstracts that ugliness away from the callers of > get_stack_info(). I'm open to any specific suggestions. So far I've found two users of this thing. One is show_stack_log_lvl(), and it makes sense there, but maybe info->heuristic_next_stack would be a better name. The other is the unwinder itself, and I think that walking from stack to stack using this heuristic is the wrong approach there, at least in the long term. I'd rather we just followed the bp chain wherever it leads us, as long as it leads us to a valid stack that we haven't visited before. As a concrete example of what I think is wrong with the current approach, ISTM it would be totally valid to implement stack switching like this: some_func: push %rbp mov %rsp, %rbp ... mov [next stack], %rsp call some_other_func mov %rbp, %rsp pop %rbp ret With the current approach, you can't unwind out of that function, because there's no way to populate info->next. I'm not actually suggesting that the kernel should ever do such a thing on x86, and my proposed rewrite of the IRQ stack code [1] will be fully compatible with your approach, but it seems odd to me that the unwinder should depend on idea that the stacks in use are chained together in a way that can be decoded without . (But maybe some of the Go compilers do work this way -- I've never looked at their precise stack layout.) Also, if you ever intend to port this thing to other architectures, I think there are architectures that have separate exception stacks and that track the next available slot on those stacks dynamically. I think that x86_32 is an example of this if task gates are used in a back-and-forth manner, although Linux doesn't do that. (x86_64 should have done this for IST, but it didn't.) On those architectures, you can have two separate switches onto the same stack live at the same time, and your current approach won't work. (Even if you make the change I'm suggesting, visit_mask will break, too, but fixing that would be a much less invasive change.) Am I making any sense? This is a suggestion for making it better, not something I see as a requirement for getting the x86 code upstream. >> >> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info, >> > + unsigned long *visit_mask) >> > { >> > unsigned long stack = (unsigned long)s; >> > unsigned long begin, end; >> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long >> > *s, char **name, >> > if (stack < begin || stack >= end) >> > continue; >> > >> > - if (test_and_set_bit(k, visit_mask)) >> > + if (visit_mask && >> > + test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask)) >> > return false; >> > >> > - *name = exception_stack_names[k]; >> > - return (unsigned long *)end; >> > + info->type = STACK_TYPE_EXCEPTION + k; >> > + info->begin = (unsigned long *)begin; >> > + info->end = (unsigned long *)end; >> > + info->next = (unsigned long *)info->end[-2]; >> >> This is so magical that I don't immediately see why it's correct. >> Presumably it's because the thing two slots down from the top of the >> stack is regs->sp? If so, that needs a comment. > > Heck if I know, I just stole it from dump_trace() ;-) > > I'll figure it out and add a comment. If you can write it as: struct pt_regs *regs = (struct pt_regs *)end - 1; info->next = regs->sp; and it still works, then no comment required :) > >> But again, couldn't we use the fact that we now know how to decode >>
Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf wrote: >> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info >> > *info, >> > +unsigned long *visit_mask) >> > +{ >> > + unsigned long *begin = (unsigned long >> > *)this_cpu_read(hardirq_stack); >> > + unsigned long *end = begin + (THREAD_SIZE / sizeof(long)); >> > + >> > + if (stack < begin || stack >= end) >> > + return false; >> > + >> > + if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask)) >> > + return false; >> > + >> > + info->type = STACK_TYPE_IRQ; >> > + info->begin = begin; >> > + info->end = end; >> > + info->next = (unsigned long *)*begin; >> >> This works, but it's a bit magic. I don't suppose we could get rid of >> this ->next thing entirely and teach show_stack_log_lvl(), etc. to >> move from stack to stack by querying the stack type of whatever the >> frame base address is if the frame base address ends up being out of >> bounds for the current stack? Or maybe the unwinder could even do >> this by itself. > > I'm not quite sure what you mean here. The ->next stack pointer is > quite useful and it abstracts that ugliness away from the callers of > get_stack_info(). I'm open to any specific suggestions. So far I've found two users of this thing. One is show_stack_log_lvl(), and it makes sense there, but maybe info->heuristic_next_stack would be a better name. The other is the unwinder itself, and I think that walking from stack to stack using this heuristic is the wrong approach there, at least in the long term. I'd rather we just followed the bp chain wherever it leads us, as long as it leads us to a valid stack that we haven't visited before. As a concrete example of what I think is wrong with the current approach, ISTM it would be totally valid to implement stack switching like this: some_func: push %rbp mov %rsp, %rbp ... mov [next stack], %rsp call some_other_func mov %rbp, %rsp pop %rbp ret With the current approach, you can't unwind out of that function, because there's no way to populate info->next. I'm not actually suggesting that the kernel should ever do such a thing on x86, and my proposed rewrite of the IRQ stack code [1] will be fully compatible with your approach, but it seems odd to me that the unwinder should depend on idea that the stacks in use are chained together in a way that can be decoded without . (But maybe some of the Go compilers do work this way -- I've never looked at their precise stack layout.) Also, if you ever intend to port this thing to other architectures, I think there are architectures that have separate exception stacks and that track the next available slot on those stacks dynamically. I think that x86_32 is an example of this if task gates are used in a back-and-forth manner, although Linux doesn't do that. (x86_64 should have done this for IST, but it didn't.) On those architectures, you can have two separate switches onto the same stack live at the same time, and your current approach won't work. (Even if you make the change I'm suggesting, visit_mask will break, too, but fixing that would be a much less invasive change.) Am I making any sense? This is a suggestion for making it better, not something I see as a requirement for getting the x86 code upstream. >> >> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info, >> > + unsigned long *visit_mask) >> > { >> > unsigned long stack = (unsigned long)s; >> > unsigned long begin, end; >> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long >> > *s, char **name, >> > if (stack < begin || stack >= end) >> > continue; >> > >> > - if (test_and_set_bit(k, visit_mask)) >> > + if (visit_mask && >> > + test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask)) >> > return false; >> > >> > - *name = exception_stack_names[k]; >> > - return (unsigned long *)end; >> > + info->type = STACK_TYPE_EXCEPTION + k; >> > + info->begin = (unsigned long *)begin; >> > + info->end = (unsigned long *)end; >> > + info->next = (unsigned long *)info->end[-2]; >> >> This is so magical that I don't immediately see why it's correct. >> Presumably it's because the thing two slots down from the top of the >> stack is regs->sp? If so, that needs a comment. > > Heck if I know, I just stole it from dump_trace() ;-) > > I'll figure it out and add a comment. If you can write it as: struct pt_regs *regs = (struct pt_regs *)end - 1; info->next = regs->sp; and it still works, then no comment required :) > >> But again, couldn't we use the fact that we now know how to decode >> pt_regs to avoid needing
Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
On Fri, Jul 22 2016, Michal Hocko wrote: > On Fri 22-07-16 18:46:57, Neil Brown wrote: >> On Mon, Jul 18 2016, Michal Hocko wrote: >> >> > From: Michal Hocko>> > >> > Mikulas has reported that a swap backed by dm-crypt doesn't work >> > properly because the swapout cannot make a sufficient forward progress >> > as the writeout path depends on dm_crypt worker which has to allocate >> > memory to perform the encryption. In order to guarantee a forward >> > progress it relies on the mempool allocator. mempool_alloc(), however, >> > prefers to use the underlying (usually page) allocator before it grabs >> > objects from the pool. Such an allocation can dive into the memory >> > reclaim and consequently to throttle_vm_writeout. >> >> That's just broken. >> I used to think mempool should always use the pre-allocated reserves >> first. That is surely the most logical course of action. Otherwise >> that memory is just sitting there doing nothing useful. >> >> I spoke to Nick Piggin about this some years ago and he pointed out that >> the kmalloc allocation paths are much better optimized for low overhead >> when there is plenty of memory. They can just pluck a free block of a >> per-CPU list without taking any locks. By contrast, accessing the >> preallocated pool always requires a spinlock. >> >> So it makes lots of sense to prefer the underlying allocator if it can >> provide a quick response. If it cannot, the sensible thing is to use >> the pool, or wait for the pool to be replenished. >> >> So the allocator should never wait at all, never enter reclaim, never >> throttle. >> >> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first >> time through, but if the pool is empty, direct-reclaim is allowed on the >> next attempt. Presumably this is where the throttling comes in ?? > > Yes that is correct. > >> I suspect that it really shouldn't do that. It should leave kswapd to >> do reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in >> mempool_alloc where pool->wait can wake it up. > > Mikulas was already suggesting that and my concern was that this would > give up prematurely even under mild page cache load when there are many > clean page cache pages. That's a valid point - freeing up clean pages is a reasonable thing for a mempool allocator to try to do. > If we just back off and rely on kswapd which > might get stuck on the writeout then the IO throughput can be reduced If I were king of MM, I would make a decree to be proclaimed throughout the land kswapd must never sleep except when it explicitly chooses to Maybe that is impractical, but having firm rules like that would go a long way to make it possible to actually understand and reason about how MM works. As it is, there seems to be a tendency to put bandaids over bandaids. > I believe which would make the whole memory pressure just worse. So I am > not sure this is a good idea in general. I completely agree with you > that the mempool request shouldn't be throttled unless there is a strong > reason for that. More on that below. > >> If I'm following the code properly, the stack trace below can only >> happen if the first pool->alloc() attempt, with direct-reclaim disabled, >> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait() >> and io_schedule_timeout(). > > mempool_alloc retries immediatelly without any sleep after the first > no-reclaim attempt. I missed that ... I see it now... I wonder if anyone has contemplated using some modern programming techniques like, maybe, a "while" loop in there.. Something like the below... > >> I suspect the timeout *doesn't* fire (5 seconds is along time) so it >> gets woken up when there is something in the pool. It then loops around >> and tries pool->alloc() again, even though there is something in the >> pool. This might be justified if that ->alloc would never block, but >> obviously it does. >> >> I would very strongly recommend just changing mempool_alloc() to >> permanently mask out __GFP_DIRECT_RECLAIM. >> >> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate. >> It is "LESS" throttle, not "NO" throttle, but you have made >> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads. > > Yes that is correct. But it still allows to throttle on congestion: > shrink_inactive_list: > /* >* Stall direct reclaim for IO completions if underlying BDIs or zone >* is congested. Allow kswapd to continue until it starts encountering >* unqueued dirty pages or cycling through the LRU too quickly. >*/ > if (!sc->hibernation_mode && !current_is_kswapd() && > current_may_throttle()) > wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10); > > My thinking was that throttle_vm_writeout is there to prevent from > dirtying too many pages from the reclaim the context. PF_LESS_THROTTLE > is part of the writeout so throttling
Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
On Fri, Jul 22 2016, Michal Hocko wrote: > On Fri 22-07-16 18:46:57, Neil Brown wrote: >> On Mon, Jul 18 2016, Michal Hocko wrote: >> >> > From: Michal Hocko >> > >> > Mikulas has reported that a swap backed by dm-crypt doesn't work >> > properly because the swapout cannot make a sufficient forward progress >> > as the writeout path depends on dm_crypt worker which has to allocate >> > memory to perform the encryption. In order to guarantee a forward >> > progress it relies on the mempool allocator. mempool_alloc(), however, >> > prefers to use the underlying (usually page) allocator before it grabs >> > objects from the pool. Such an allocation can dive into the memory >> > reclaim and consequently to throttle_vm_writeout. >> >> That's just broken. >> I used to think mempool should always use the pre-allocated reserves >> first. That is surely the most logical course of action. Otherwise >> that memory is just sitting there doing nothing useful. >> >> I spoke to Nick Piggin about this some years ago and he pointed out that >> the kmalloc allocation paths are much better optimized for low overhead >> when there is plenty of memory. They can just pluck a free block of a >> per-CPU list without taking any locks. By contrast, accessing the >> preallocated pool always requires a spinlock. >> >> So it makes lots of sense to prefer the underlying allocator if it can >> provide a quick response. If it cannot, the sensible thing is to use >> the pool, or wait for the pool to be replenished. >> >> So the allocator should never wait at all, never enter reclaim, never >> throttle. >> >> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first >> time through, but if the pool is empty, direct-reclaim is allowed on the >> next attempt. Presumably this is where the throttling comes in ?? > > Yes that is correct. > >> I suspect that it really shouldn't do that. It should leave kswapd to >> do reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in >> mempool_alloc where pool->wait can wake it up. > > Mikulas was already suggesting that and my concern was that this would > give up prematurely even under mild page cache load when there are many > clean page cache pages. That's a valid point - freeing up clean pages is a reasonable thing for a mempool allocator to try to do. > If we just back off and rely on kswapd which > might get stuck on the writeout then the IO throughput can be reduced If I were king of MM, I would make a decree to be proclaimed throughout the land kswapd must never sleep except when it explicitly chooses to Maybe that is impractical, but having firm rules like that would go a long way to make it possible to actually understand and reason about how MM works. As it is, there seems to be a tendency to put bandaids over bandaids. > I believe which would make the whole memory pressure just worse. So I am > not sure this is a good idea in general. I completely agree with you > that the mempool request shouldn't be throttled unless there is a strong > reason for that. More on that below. > >> If I'm following the code properly, the stack trace below can only >> happen if the first pool->alloc() attempt, with direct-reclaim disabled, >> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait() >> and io_schedule_timeout(). > > mempool_alloc retries immediatelly without any sleep after the first > no-reclaim attempt. I missed that ... I see it now... I wonder if anyone has contemplated using some modern programming techniques like, maybe, a "while" loop in there.. Something like the below... > >> I suspect the timeout *doesn't* fire (5 seconds is along time) so it >> gets woken up when there is something in the pool. It then loops around >> and tries pool->alloc() again, even though there is something in the >> pool. This might be justified if that ->alloc would never block, but >> obviously it does. >> >> I would very strongly recommend just changing mempool_alloc() to >> permanently mask out __GFP_DIRECT_RECLAIM. >> >> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate. >> It is "LESS" throttle, not "NO" throttle, but you have made >> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads. > > Yes that is correct. But it still allows to throttle on congestion: > shrink_inactive_list: > /* >* Stall direct reclaim for IO completions if underlying BDIs or zone >* is congested. Allow kswapd to continue until it starts encountering >* unqueued dirty pages or cycling through the LRU too quickly. >*/ > if (!sc->hibernation_mode && !current_is_kswapd() && > current_may_throttle()) > wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10); > > My thinking was that throttle_vm_writeout is there to prevent from > dirtying too many pages from the reclaim the context. PF_LESS_THROTTLE > is part of the writeout so throttling it on too many
Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)
On Fri, Jul 22, 2016 at 11:53:52AM +0200, Daniel Borkmann wrote: > On 07/22/2016 04:14 AM, Alexei Starovoitov wrote: > >On Thu, Jul 21, 2016 at 06:09:17PM -0700, Sargun Dhillon wrote: > >>This allows user memory to be written to during the course of a kprobe. > >>It shouldn't be used to implement any kind of security mechanism > >>because of TOC-TOU attacks, but rather to debug, divert, and > >>manipulate execution of semi-cooperative processes. > >> > >>Although it uses probe_kernel_write, we limit the address space > >>the probe can write into by checking the space with access_ok. > >>This is so the call doesn't sleep. > >> > >>Given this feature is experimental, and has the risk of crashing > >>the system, we print a warning on invocation. > >> > >>It was tested with the tracex7 program on x86-64. > >> > >>Signed-off-by: Sargun Dhillon> >>Cc: Alexei Starovoitov > >>Cc: Daniel Borkmann > >>--- > >> include/uapi/linux/bpf.h | 12 > >> kernel/bpf/verifier.c | 9 + > >> kernel/trace/bpf_trace.c | 37 + > >> samples/bpf/bpf_helpers.h | 2 ++ > >> 4 files changed, 60 insertions(+) > >> > >>diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > >>index 2b7076f..4536282 100644 > >>--- a/include/uapi/linux/bpf.h > >>+++ b/include/uapi/linux/bpf.h > >>@@ -365,6 +365,18 @@ enum bpf_func_id { > >> */ > >>BPF_FUNC_get_current_task, > >> > >>+ /** > >>+* bpf_probe_write(void *dst, void *src, int len) > >>+* safely attempt to write to a location > >>+* @dst: destination address in userspace > >>+* @src: source address on stack > >>+* @len: number of bytes to copy > >>+* Return: > >>+* Returns number of bytes that could not be copied. > >>+* On success, this will be zero > > > >that is odd comment. > >there are only three possible return values 0, -EFAULT, -EPERM > > Agree. > > >>+*/ > >>+ BPF_FUNC_probe_write, > >>+ > >>__BPF_FUNC_MAX_ID, > >> }; > >> > >>diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > >>index f72f23b..6785008 100644 > >>--- a/kernel/bpf/verifier.c > >>+++ b/kernel/bpf/verifier.c > >>@@ -1154,6 +1154,15 @@ static int check_call(struct verifier_env *env, int > >>func_id) > >>return -EINVAL; > >>} > >> > >>+ if (func_id == BPF_FUNC_probe_write) { > >>+ > >>pr_warn_once("\n"); > >>+ pr_warn_once("* bpf_probe_write: Experimental Feature in use > >>*\n"); > >>+ pr_warn_once("* bpf_probe_write: Feature may corrupt memory > >>*\n"); > >>+ > >>pr_warn_once("\n"); > >>+ pr_notice_ratelimited("bpf_probe_write in use by: %.16s-%d", > >>+ current->comm, task_pid_nr(current)); > >>+ } > > > >I think single line pr_notice_ratelimited() with 'feature may corrupt user > >memory' > >will be enough. > > Agree. > > >Also please move this to tracing specific part into bpf_trace.c > >similar to bpf_get_trace_printk_proto() instead of verifier.c > > Yes, sorry for not being too clear about it, this spot will then be > called by the verifier to fetch it for every function call. Meaning that > this could get printed multiple times for loading a single program, but > I think it's okay. If single line, I'd make that pr_warn_ratelimited(), > and probably something like ... > > "note: %s[%d] is installing a program with bpf_probe_write helper that may > corrupt user memory!", > current->comm, task_pid_nr(current) > > >>+static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) > >>+{ > >>+ void *unsafe_ptr = (void *) (long) r1; > >>+ void *src = (void *) (long) r2; > >>+ int size = (int) r3; > >>+ struct task_struct *task = current; > >>+ > >>+ /* > > > >bpf_trace.c follows non-net comment style, so it's good here. > >just distracting vs the rest of net style. > > > >>+* Ensure we're in a user context which it is safe for the helper > >>+* to run. This helper has no business in a kthread > >>+* > >>+* access_ok should prevent writing to non-user memory, but on > >>+* some architectures (nommu, etc...) access_ok isn't enough > >>+* So we check the current segment > >>+*/ > >>+ > >>+ if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD))) > >>+ return -EPERM; > > > >Should we also add a check for !PF_EXITING ? > >Like signals are not delivered to such tasks and I'm not sure > >what would be the state of mm of it. > > I agree, good point. > > You can make that 'current->flags & (PF_KTHREAD | PF_EXITING)' and > we don't need the extra task var either. > > >>+ if (unlikely(segment_eq(get_fs(), KERNEL_DS))) > >>+ return -EPERM; > >>+ if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) > >>+ return -EPERM; > >>+ > >>+ return
Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)
On Fri, Jul 22, 2016 at 11:53:52AM +0200, Daniel Borkmann wrote: > On 07/22/2016 04:14 AM, Alexei Starovoitov wrote: > >On Thu, Jul 21, 2016 at 06:09:17PM -0700, Sargun Dhillon wrote: > >>This allows user memory to be written to during the course of a kprobe. > >>It shouldn't be used to implement any kind of security mechanism > >>because of TOC-TOU attacks, but rather to debug, divert, and > >>manipulate execution of semi-cooperative processes. > >> > >>Although it uses probe_kernel_write, we limit the address space > >>the probe can write into by checking the space with access_ok. > >>This is so the call doesn't sleep. > >> > >>Given this feature is experimental, and has the risk of crashing > >>the system, we print a warning on invocation. > >> > >>It was tested with the tracex7 program on x86-64. > >> > >>Signed-off-by: Sargun Dhillon > >>Cc: Alexei Starovoitov > >>Cc: Daniel Borkmann > >>--- > >> include/uapi/linux/bpf.h | 12 > >> kernel/bpf/verifier.c | 9 + > >> kernel/trace/bpf_trace.c | 37 + > >> samples/bpf/bpf_helpers.h | 2 ++ > >> 4 files changed, 60 insertions(+) > >> > >>diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > >>index 2b7076f..4536282 100644 > >>--- a/include/uapi/linux/bpf.h > >>+++ b/include/uapi/linux/bpf.h > >>@@ -365,6 +365,18 @@ enum bpf_func_id { > >> */ > >>BPF_FUNC_get_current_task, > >> > >>+ /** > >>+* bpf_probe_write(void *dst, void *src, int len) > >>+* safely attempt to write to a location > >>+* @dst: destination address in userspace > >>+* @src: source address on stack > >>+* @len: number of bytes to copy > >>+* Return: > >>+* Returns number of bytes that could not be copied. > >>+* On success, this will be zero > > > >that is odd comment. > >there are only three possible return values 0, -EFAULT, -EPERM > > Agree. > > >>+*/ > >>+ BPF_FUNC_probe_write, > >>+ > >>__BPF_FUNC_MAX_ID, > >> }; > >> > >>diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > >>index f72f23b..6785008 100644 > >>--- a/kernel/bpf/verifier.c > >>+++ b/kernel/bpf/verifier.c > >>@@ -1154,6 +1154,15 @@ static int check_call(struct verifier_env *env, int > >>func_id) > >>return -EINVAL; > >>} > >> > >>+ if (func_id == BPF_FUNC_probe_write) { > >>+ > >>pr_warn_once("\n"); > >>+ pr_warn_once("* bpf_probe_write: Experimental Feature in use > >>*\n"); > >>+ pr_warn_once("* bpf_probe_write: Feature may corrupt memory > >>*\n"); > >>+ > >>pr_warn_once("\n"); > >>+ pr_notice_ratelimited("bpf_probe_write in use by: %.16s-%d", > >>+ current->comm, task_pid_nr(current)); > >>+ } > > > >I think single line pr_notice_ratelimited() with 'feature may corrupt user > >memory' > >will be enough. > > Agree. > > >Also please move this to tracing specific part into bpf_trace.c > >similar to bpf_get_trace_printk_proto() instead of verifier.c > > Yes, sorry for not being too clear about it, this spot will then be > called by the verifier to fetch it for every function call. Meaning that > this could get printed multiple times for loading a single program, but > I think it's okay. If single line, I'd make that pr_warn_ratelimited(), > and probably something like ... > > "note: %s[%d] is installing a program with bpf_probe_write helper that may > corrupt user memory!", > current->comm, task_pid_nr(current) > > >>+static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) > >>+{ > >>+ void *unsafe_ptr = (void *) (long) r1; > >>+ void *src = (void *) (long) r2; > >>+ int size = (int) r3; > >>+ struct task_struct *task = current; > >>+ > >>+ /* > > > >bpf_trace.c follows non-net comment style, so it's good here. > >just distracting vs the rest of net style. > > > >>+* Ensure we're in a user context which it is safe for the helper > >>+* to run. This helper has no business in a kthread > >>+* > >>+* access_ok should prevent writing to non-user memory, but on > >>+* some architectures (nommu, etc...) access_ok isn't enough > >>+* So we check the current segment > >>+*/ > >>+ > >>+ if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD))) > >>+ return -EPERM; > > > >Should we also add a check for !PF_EXITING ? > >Like signals are not delivered to such tasks and I'm not sure > >what would be the state of mm of it. > > I agree, good point. > > You can make that 'current->flags & (PF_KTHREAD | PF_EXITING)' and > we don't need the extra task var either. > > >>+ if (unlikely(segment_eq(get_fs(), KERNEL_DS))) > >>+ return -EPERM; > >>+ if (!access_ok(VERIFY_WRITE, unsafe_ptr, size)) > >>+ return -EPERM; > >>+ > >>+ return probe_kernel_write(unsafe_ptr, src, size); > >>+} > >>+ >
Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
On Fri, Jul 22, 2016 at 04:39:00PM -0700, Andy Lutomirski wrote: > On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeufwrote: > > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote: > >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf > >> wrote: > >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote: > >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf > >> >> wrote: > >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote: > >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf > >> >> >> wrote: > >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote: > >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf > >> >> >> >> wrote: > >> >> >> >> > Now that we can find pt_regs registers in the middle of the > >> >> >> >> > stack due to > >> >> >> >> > an interrupt or exception, we can print them. Here's what it > >> >> >> >> > looks > >> >> >> >> > like: > >> >> >> >> > > >> >> >> >> >... > >> >> >> >> >[] do_async_page_fault+0x2c/0xa0 > >> >> >> >> >[] async_page_fault+0x28/0x30 > >> >> >> >> > RIP: 0010:[] [] > >> >> >> >> > __clear_user+0x42/0x70 > >> >> >> >> > RSP: 0018:88007876fd38 EFLAGS: 00010202 > >> >> >> >> > RAX: RBX: 0138 RCX: > >> >> >> >> > 0138 > >> >> >> >> > RDX: RSI: 0008 RDI: > >> >> >> >> > 0061b640 > >> >> >> >> > RBP: 88007876fd48 R08: 000dc2ced0d0 R09: > >> >> >> >> > > >> >> >> >> > R10: 0001 R11: R12: > >> >> >> >> > 0061b640 > >> >> >> >> > R13: R14: 88007877 R15: > >> >> >> >> > 880079947200 > >> >> >> >> >[] ? __clear_user+0x42/0x70 > >> >> >> >> >[] ? __clear_user+0x23/0x70 > >> >> >> >> >[] clear_user+0x2b/0x40 > >> >> >> >> >... > >> >> >> >> > >> >> >> >> This looks wrong. Here are some theories: > >> >> >> >> > >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: > >> >> >> >> > >> >> >> >> Then it's found again as an unreliable address as "? > >> >> >> >> __clear_user+0x42/0x70" by scanning the stack. "? > >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the > >> >> >> >> stack. > >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been > >> >> >> >> suppressed because it matched a reliable address? > >> >> >> >> > >> >> >> >> (b) You actually intended for all the addresses to be printed, in > >> >> >> >> which case "? __clear_user+0x42/0x70" should have been > >> >> >> >> "__clear_user+0x42/0x70" and you have a bug. In this case, it's > >> >> >> >> plausible that your state machine got a bit lost leading to "? > >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact -- > >> >> >> >> it's a real frame and you didn't find it). > >> >> >> >> > >> >> >> >> (c) Something else and I'm confused. > >> >> >> > > >> >> >> > So there's a subtle difference between addresses reported by > >> >> >> > regs->ip > >> >> >> > and normal return addresses. For example: > >> >> >> > > >> >> >> >... > >> >> >> >[] smp_apic_timer_interrupt+0x3d/0x50 > >> >> >> >[] apic_timer_interrupt+0x9e/0xb0 > >> >> >> > RIP: 0010:[] [] > >> >> >> > path_init+0x0/0x750 > >> >> >> > RSP: 0018:880036a3fd80 EFLAGS: 0296 > >> >> >> > RAX: 88003691aa40 RBX: 880036a3ff08 RCX: 880036a3ff08 > >> >> >> > RDX: 880036a3ff08 RSI: 0041 RDI: 880036a3fdb0 > >> >> >> > RBP: 880036a3fda0 R08: R09: 0010 > >> >> >> > R10: 8080808080808080 R11: fefefefefefefeff R12: 880036a3fdb0 > >> >> >> > R13: 0001 R14: 880036a3ff08 R15: > >> >> >> > > >> >> >> >[] ? lookup_fast+0x3d0/0x3d0 > >> >> >> >[] ? path_lookupat+0x1b/0x120 > >> >> >> >[] filename_lookup+0xb1/0x180 > >> >> >> >... > >> >> >> > > >> >> >> > In this case the irq hit right after path_lookupat() called into > >> >> >> > path_init(). So the "path_init+0x0" printed by __show_regs() is > >> >> >> > right. > >> >> >> > > >> >> >> > Note the backtrace reports the same address, but it instead > >> >> >> > describes it > >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast(). That's > >> >> >> > because normally, such an address after a call instruction at the > >> >> >> > end of > >> >> >> > a function would indicate a tail call (e.g., to a noreturn > >> >> >> > function). > >> >> >> > If that were the case, printing "path_init+0x0" would be completely > >> >> >> > wrong, because path_init() just happens to be the function located > >> >> >> > immediately after lookup_fast(). > >> >> >> > > >> >> >> > Maybe I could add some special logic to say: "if this return > >> >> >> > address was > >> >> >> > from
Re: [PATCH v4 7/7] power_supply: make use of new strcpytoupper() function
Hi, On Fri, Jul 22, 2016 at 04:31:09PM -0700, Markus Mayer wrote: > Call strcpytoupper() rather than walking the string explicitly to > convert it to uppercase. > > Signed-off-by: Markus Mayer> --- > drivers/power/power_supply_sysfs.c | 13 + > 1 file changed, 5 insertions(+), 8 deletions(-) > > diff --git a/drivers/power/power_supply_sysfs.c > b/drivers/power/power_supply_sysfs.c > index 80fed98..20fdcc5 100644 > --- a/drivers/power/power_supply_sysfs.c > +++ b/drivers/power/power_supply_sysfs.c > @@ -256,19 +256,16 @@ void power_supply_init_attrs(struct device_type > *dev_type) > > static char *kstruprdup(const char *str, gfp_t gfp) > { > - char *ret, *ustr; > + char *ustr; > > - ustr = ret = kmalloc(strlen(str) + 1, gfp); > + ustr = kmalloc(strlen(str) + 1, gfp); > > - if (!ret) > + if (!ustr) > return NULL; > > - while (*str) > - *ustr++ = toupper(*str++); > + strcpytoupper(ustr, str); > > - *ustr = 0; > - > - return ret; > + return ustr; > } > > int power_supply_uevent(struct device *dev, struct kobj_uevent_env *env) Acked-By: Sebastian Reichel Note: I plan to merge [0] (Move power supply subsystem to drivers/power/supply) directly after 4.8-rc1, which will result in merge conflicts if this goes into the kernel via some other subsystem. [0] https://marc.info/?l=linux-pm=146617051028786=2 -- Sebastian signature.asc Description: PGP signature
Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
On Fri, Jul 22, 2016 at 04:39:00PM -0700, Andy Lutomirski wrote: > On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf wrote: > > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote: > >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf > >> wrote: > >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote: > >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf > >> >> wrote: > >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote: > >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf > >> >> >> wrote: > >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote: > >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf > >> >> >> >> wrote: > >> >> >> >> > Now that we can find pt_regs registers in the middle of the > >> >> >> >> > stack due to > >> >> >> >> > an interrupt or exception, we can print them. Here's what it > >> >> >> >> > looks > >> >> >> >> > like: > >> >> >> >> > > >> >> >> >> >... > >> >> >> >> >[] do_async_page_fault+0x2c/0xa0 > >> >> >> >> >[] async_page_fault+0x28/0x30 > >> >> >> >> > RIP: 0010:[] [] > >> >> >> >> > __clear_user+0x42/0x70 > >> >> >> >> > RSP: 0018:88007876fd38 EFLAGS: 00010202 > >> >> >> >> > RAX: RBX: 0138 RCX: > >> >> >> >> > 0138 > >> >> >> >> > RDX: RSI: 0008 RDI: > >> >> >> >> > 0061b640 > >> >> >> >> > RBP: 88007876fd48 R08: 000dc2ced0d0 R09: > >> >> >> >> > > >> >> >> >> > R10: 0001 R11: R12: > >> >> >> >> > 0061b640 > >> >> >> >> > R13: R14: 88007877 R15: > >> >> >> >> > 880079947200 > >> >> >> >> >[] ? __clear_user+0x42/0x70 > >> >> >> >> >[] ? __clear_user+0x23/0x70 > >> >> >> >> >[] clear_user+0x2b/0x40 > >> >> >> >> >... > >> >> >> >> > >> >> >> >> This looks wrong. Here are some theories: > >> >> >> >> > >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: > >> >> >> >> > >> >> >> >> Then it's found again as an unreliable address as "? > >> >> >> >> __clear_user+0x42/0x70" by scanning the stack. "? > >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the > >> >> >> >> stack. > >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been > >> >> >> >> suppressed because it matched a reliable address? > >> >> >> >> > >> >> >> >> (b) You actually intended for all the addresses to be printed, in > >> >> >> >> which case "? __clear_user+0x42/0x70" should have been > >> >> >> >> "__clear_user+0x42/0x70" and you have a bug. In this case, it's > >> >> >> >> plausible that your state machine got a bit lost leading to "? > >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact -- > >> >> >> >> it's a real frame and you didn't find it). > >> >> >> >> > >> >> >> >> (c) Something else and I'm confused. > >> >> >> > > >> >> >> > So there's a subtle difference between addresses reported by > >> >> >> > regs->ip > >> >> >> > and normal return addresses. For example: > >> >> >> > > >> >> >> >... > >> >> >> >[] smp_apic_timer_interrupt+0x3d/0x50 > >> >> >> >[] apic_timer_interrupt+0x9e/0xb0 > >> >> >> > RIP: 0010:[] [] > >> >> >> > path_init+0x0/0x750 > >> >> >> > RSP: 0018:880036a3fd80 EFLAGS: 0296 > >> >> >> > RAX: 88003691aa40 RBX: 880036a3ff08 RCX: 880036a3ff08 > >> >> >> > RDX: 880036a3ff08 RSI: 0041 RDI: 880036a3fdb0 > >> >> >> > RBP: 880036a3fda0 R08: R09: 0010 > >> >> >> > R10: 8080808080808080 R11: fefefefefefefeff R12: 880036a3fdb0 > >> >> >> > R13: 0001 R14: 880036a3ff08 R15: > >> >> >> > > >> >> >> >[] ? lookup_fast+0x3d0/0x3d0 > >> >> >> >[] ? path_lookupat+0x1b/0x120 > >> >> >> >[] filename_lookup+0xb1/0x180 > >> >> >> >... > >> >> >> > > >> >> >> > In this case the irq hit right after path_lookupat() called into > >> >> >> > path_init(). So the "path_init+0x0" printed by __show_regs() is > >> >> >> > right. > >> >> >> > > >> >> >> > Note the backtrace reports the same address, but it instead > >> >> >> > describes it > >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast(). That's > >> >> >> > because normally, such an address after a call instruction at the > >> >> >> > end of > >> >> >> > a function would indicate a tail call (e.g., to a noreturn > >> >> >> > function). > >> >> >> > If that were the case, printing "path_init+0x0" would be completely > >> >> >> > wrong, because path_init() just happens to be the function located > >> >> >> > immediately after lookup_fast(). > >> >> >> > > >> >> >> > Maybe I could add some special logic to say: "if this return > >> >> >> > address was > >> >> >> > from a call, use printk_stack_address(); else if it was from a > >> >> >> > pt_regs, > >> >> >> > use
Re: [PATCH v4 7/7] power_supply: make use of new strcpytoupper() function
Hi, On Fri, Jul 22, 2016 at 04:31:09PM -0700, Markus Mayer wrote: > Call strcpytoupper() rather than walking the string explicitly to > convert it to uppercase. > > Signed-off-by: Markus Mayer > --- > drivers/power/power_supply_sysfs.c | 13 + > 1 file changed, 5 insertions(+), 8 deletions(-) > > diff --git a/drivers/power/power_supply_sysfs.c > b/drivers/power/power_supply_sysfs.c > index 80fed98..20fdcc5 100644 > --- a/drivers/power/power_supply_sysfs.c > +++ b/drivers/power/power_supply_sysfs.c > @@ -256,19 +256,16 @@ void power_supply_init_attrs(struct device_type > *dev_type) > > static char *kstruprdup(const char *str, gfp_t gfp) > { > - char *ret, *ustr; > + char *ustr; > > - ustr = ret = kmalloc(strlen(str) + 1, gfp); > + ustr = kmalloc(strlen(str) + 1, gfp); > > - if (!ret) > + if (!ustr) > return NULL; > > - while (*str) > - *ustr++ = toupper(*str++); > + strcpytoupper(ustr, str); > > - *ustr = 0; > - > - return ret; > + return ustr; > } > > int power_supply_uevent(struct device *dev, struct kobj_uevent_env *env) Acked-By: Sebastian Reichel Note: I plan to merge [0] (Move power supply subsystem to drivers/power/supply) directly after 4.8-rc1, which will result in merge conflicts if this goes into the kernel via some other subsystem. [0] https://marc.info/?l=linux-pm=146617051028786=2 -- Sebastian signature.asc Description: PGP signature
Re: [PATCH v4 4/7] staging: speakup: replace spk_strlwr() with strlcpytolower()
Markus Mayerwrites: > After introducing generic strltolower() and strtolower(), spk_strlwr() > is no longer needed. > > Signed-off-by: Markus Mayer > Acked-by: Samuel Thibault Acked-by: Chris Brannon Haven't looked at much kernel code in a while, but this looks good. -- Chris
Re: [PATCH v4 4/7] staging: speakup: replace spk_strlwr() with strlcpytolower()
Markus Mayer writes: > After introducing generic strltolower() and strtolower(), spk_strlwr() > is no longer needed. > > Signed-off-by: Markus Mayer > Acked-by: Samuel Thibault Acked-by: Chris Brannon Haven't looked at much kernel code in a while, but this looks good. -- Chris
Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
On Fri, Jul 22, 2016 at 04:26:46PM -0700, Andy Lutomirski wrote: > On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeufwrote: > > valid_stack_ptr() is buggy: it assumes that all stacks are of size > > THREAD_SIZE, which is not true for exception stacks. So the > > walk_stack() callbacks will need to know the location of the beginning > > of the stack as well as the end. > > > > Another issue is that in general the various features of a stack (type, > > size, next stack pointer, description string) are scattered around in > > various places throughout the stack dump code. > > I finally figured out what visit_info is. But would it make more > sense to track it in the unwind state so that the unwinder can > directly make sure it doesn't start looping? Well, the unwinders aren't the only users of get_stack_info() and the visit_mask. show_trace_log_lvl() also uses it. But it would probably be cleaner to at least do the visit_mask bit testing/setting in get_stack_info() rather than in the in_*_stack() functions. > And please remove test_and_set_bit() -- it's pointlessly slow. Ok. > > > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info, > > +unsigned long *visit_mask) > > +{ > > + unsigned long *begin = (unsigned long > > *)this_cpu_read(hardirq_stack); > > + unsigned long *end = begin + (THREAD_SIZE / sizeof(long)); > > + > > + if (stack < begin || stack >= end) > > + return false; > > + > > + if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask)) > > + return false; > > + > > + info->type = STACK_TYPE_IRQ; > > + info->begin = begin; > > + info->end = end; > > + info->next = (unsigned long *)*begin; > > This works, but it's a bit magic. I don't suppose we could get rid of > this ->next thing entirely and teach show_stack_log_lvl(), etc. to > move from stack to stack by querying the stack type of whatever the > frame base address is if the frame base address ends up being out of > bounds for the current stack? Or maybe the unwinder could even do > this by itself. I'm not quite sure what you mean here. The ->next stack pointer is quite useful and it abstracts that ugliness away from the callers of get_stack_info(). I'm open to any specific suggestions. > > > +static bool in_exception_stack(unsigned long *s, struct stack_info *info, > > + unsigned long *visit_mask) > > { > > unsigned long stack = (unsigned long)s; > > unsigned long begin, end; > > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long > > *s, char **name, > > if (stack < begin || stack >= end) > > continue; > > > > - if (test_and_set_bit(k, visit_mask)) > > + if (visit_mask && > > + test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask)) > > return false; > > > > - *name = exception_stack_names[k]; > > - return (unsigned long *)end; > > + info->type = STACK_TYPE_EXCEPTION + k; > > + info->begin = (unsigned long *)begin; > > + info->end = (unsigned long *)end; > > + info->next = (unsigned long *)info->end[-2]; > > This is so magical that I don't immediately see why it's correct. > Presumably it's because the thing two slots down from the top of the > stack is regs->sp? If so, that needs a comment. Heck if I know, I just stole it from dump_trace() ;-) I'll figure it out and add a comment. > But again, couldn't we use the fact that we now know how to decode > pt_regs to avoid needing this? I can imagine it being useful as a > fallback in the event that the unwinder fails, but this is just a > fallback. Yeah, this is needed as a fallback. But I wouldn't call it "just" a fallback: the stack dump code *needs* to be able to still traverse the stacks if frame pointers fail. > Also, NMI is weird and I'm wondering whether this works at > all when trying to unwind from a looped NMI. Unless I'm missing something, I think it should be fine for nested NMIs, since they're all on the same stack. I can try to test it. What in particular are you worried about? > Fixing this up could be a followup after this series is in, I think -- > you're preserving existing behavior AFAICS. I just don't particularly > like the existing behavior. > > FWIW, I think this code needs to be explicitly tested for the 32-bit > double fault case. It's highly magical. Ok, I'll test it. -- Josh
Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
On Fri, Jul 22, 2016 at 04:26:46PM -0700, Andy Lutomirski wrote: > On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf wrote: > > valid_stack_ptr() is buggy: it assumes that all stacks are of size > > THREAD_SIZE, which is not true for exception stacks. So the > > walk_stack() callbacks will need to know the location of the beginning > > of the stack as well as the end. > > > > Another issue is that in general the various features of a stack (type, > > size, next stack pointer, description string) are scattered around in > > various places throughout the stack dump code. > > I finally figured out what visit_info is. But would it make more > sense to track it in the unwind state so that the unwinder can > directly make sure it doesn't start looping? Well, the unwinders aren't the only users of get_stack_info() and the visit_mask. show_trace_log_lvl() also uses it. But it would probably be cleaner to at least do the visit_mask bit testing/setting in get_stack_info() rather than in the in_*_stack() functions. > And please remove test_and_set_bit() -- it's pointlessly slow. Ok. > > > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info, > > +unsigned long *visit_mask) > > +{ > > + unsigned long *begin = (unsigned long > > *)this_cpu_read(hardirq_stack); > > + unsigned long *end = begin + (THREAD_SIZE / sizeof(long)); > > + > > + if (stack < begin || stack >= end) > > + return false; > > + > > + if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask)) > > + return false; > > + > > + info->type = STACK_TYPE_IRQ; > > + info->begin = begin; > > + info->end = end; > > + info->next = (unsigned long *)*begin; > > This works, but it's a bit magic. I don't suppose we could get rid of > this ->next thing entirely and teach show_stack_log_lvl(), etc. to > move from stack to stack by querying the stack type of whatever the > frame base address is if the frame base address ends up being out of > bounds for the current stack? Or maybe the unwinder could even do > this by itself. I'm not quite sure what you mean here. The ->next stack pointer is quite useful and it abstracts that ugliness away from the callers of get_stack_info(). I'm open to any specific suggestions. > > > +static bool in_exception_stack(unsigned long *s, struct stack_info *info, > > + unsigned long *visit_mask) > > { > > unsigned long stack = (unsigned long)s; > > unsigned long begin, end; > > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long > > *s, char **name, > > if (stack < begin || stack >= end) > > continue; > > > > - if (test_and_set_bit(k, visit_mask)) > > + if (visit_mask && > > + test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask)) > > return false; > > > > - *name = exception_stack_names[k]; > > - return (unsigned long *)end; > > + info->type = STACK_TYPE_EXCEPTION + k; > > + info->begin = (unsigned long *)begin; > > + info->end = (unsigned long *)end; > > + info->next = (unsigned long *)info->end[-2]; > > This is so magical that I don't immediately see why it's correct. > Presumably it's because the thing two slots down from the top of the > stack is regs->sp? If so, that needs a comment. Heck if I know, I just stole it from dump_trace() ;-) I'll figure it out and add a comment. > But again, couldn't we use the fact that we now know how to decode > pt_regs to avoid needing this? I can imagine it being useful as a > fallback in the event that the unwinder fails, but this is just a > fallback. Yeah, this is needed as a fallback. But I wouldn't call it "just" a fallback: the stack dump code *needs* to be able to still traverse the stacks if frame pointers fail. > Also, NMI is weird and I'm wondering whether this works at > all when trying to unwind from a looped NMI. Unless I'm missing something, I think it should be fine for nested NMIs, since they're all on the same stack. I can try to test it. What in particular are you worried about? > Fixing this up could be a followup after this series is in, I think -- > you're preserving existing behavior AFAICS. I just don't particularly > like the existing behavior. > > FWIW, I think this code needs to be explicitly tested for the 32-bit > double fault case. It's highly magical. Ok, I'll test it. -- Josh
Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirskiwrote: > On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf wrote: >> valid_stack_ptr() is buggy: it assumes that all stacks are of size >> THREAD_SIZE, which is not true for exception stacks. So the >> walk_stack() callbacks will need to know the location of the beginning >> of the stack as well as the end. >> >> Another issue is that in general the various features of a stack (type, >> size, next stack pointer, description string) are scattered around in >> various places throughout the stack dump code. > > I finally figured out what visit_info is. But would it make more > sense to track it in the unwind state so that the unwinder can > directly make sure it doesn't start looping? > I just realized that it *is* in the unwind state. But maybe this code in update_stack_state: sp = info->next; if (!sp). goto unknown; if (get_stack_info(sp, state->task, info, >stack_mask)) goto unknown; if (!on_stack(info, addr, len)) goto unknown; should do something like: if (get_stack_info(addr, ...)) goto unknown. sp = info->end; instead. Alternatively, maybe it would make sense to keep sp as is (have update_stack_state return bool instead of returning a pointer) so that a frame that switches stacks still shows the actual sp at the time that the frame called whatever the it called. I'm really quite confused by what state->sp means in your current code. In the non-stack-switching case (everything is on the thread stack), it appears to always match state->bp. Am I missing something? If I'm understanding this correctly, when you're pointing at a call frame, state->bp is that frame's base address (the top of the stack frame), unwind_get_return_address() returns the address to which that frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or whatever it ends up looking like will return RDI at the time that the frame called whatever function it called, if known. By that logic, shouldn't state->sp be sp on entry to the call instruction? (Or could sp just be removed? Does it do anything?) I guess the reason I'm still not 100% comfortable with the idea that pt_regs frames don't exist a real frames is that I keep waffling as to how I should think about the regs associated with a frame in the future DWARF world. I think I imagine them being the regs at the time that the frame did it's call to the next frame, which, by an admittedly weak analogy, means that the pt_regs state would be the regs at the time that the exception or interrupt happened. That offers a third silly option for dealing with the annoying '?': emit two frames for a pt_regs, but have the frame in the entry code show NULL for its return address because it's not a normal return. > And please remove test_and_set_bit() -- it's pointlessly slow. > >> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info, >> +unsigned long *visit_mask) >> +{ >> + unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack); >> + unsigned long *end = begin + (THREAD_SIZE / sizeof(long)); >> + >> + if (stack < begin || stack >= end) >> + return false; >> + >> + if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask)) >> + return false; >> + >> + info->type = STACK_TYPE_IRQ; >> + info->begin = begin; >> + info->end = end; >> + info->next = (unsigned long *)*begin; > > This works, but it's a bit magic. I don't suppose we could get rid of > this ->next thing entirely and teach show_stack_log_lvl(), etc. to > move from stack to stack by querying the stack type of whatever the > frame base address is if the frame base address ends up being out of > bounds for the current stack? Or maybe the unwinder could even do > this by itself. > >> +static bool in_exception_stack(unsigned long *s, struct stack_info *info, >> + unsigned long *visit_mask) >> { >> unsigned long stack = (unsigned long)s; >> unsigned long begin, end; >> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long >> *s, char **name, >> if (stack < begin || stack >= end) >> continue; >> >> - if (test_and_set_bit(k, visit_mask)) >> + if (visit_mask && >> + test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask)) >> return false; >> >> - *name = exception_stack_names[k]; >> - return (unsigned long *)end; >> + info->type = STACK_TYPE_EXCEPTION + k; >> + info->begin = (unsigned long *)begin; >> + info->end = (unsigned long *)end; >> + info->next = (unsigned long *)info->end[-2]; > > This is so magical that I don't immediately see why
Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface
On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski wrote: > On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf wrote: >> valid_stack_ptr() is buggy: it assumes that all stacks are of size >> THREAD_SIZE, which is not true for exception stacks. So the >> walk_stack() callbacks will need to know the location of the beginning >> of the stack as well as the end. >> >> Another issue is that in general the various features of a stack (type, >> size, next stack pointer, description string) are scattered around in >> various places throughout the stack dump code. > > I finally figured out what visit_info is. But would it make more > sense to track it in the unwind state so that the unwinder can > directly make sure it doesn't start looping? > I just realized that it *is* in the unwind state. But maybe this code in update_stack_state: sp = info->next; if (!sp). goto unknown; if (get_stack_info(sp, state->task, info, >stack_mask)) goto unknown; if (!on_stack(info, addr, len)) goto unknown; should do something like: if (get_stack_info(addr, ...)) goto unknown. sp = info->end; instead. Alternatively, maybe it would make sense to keep sp as is (have update_stack_state return bool instead of returning a pointer) so that a frame that switches stacks still shows the actual sp at the time that the frame called whatever the it called. I'm really quite confused by what state->sp means in your current code. In the non-stack-switching case (everything is on the thread stack), it appears to always match state->bp. Am I missing something? If I'm understanding this correctly, when you're pointing at a call frame, state->bp is that frame's base address (the top of the stack frame), unwind_get_return_address() returns the address to which that frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or whatever it ends up looking like will return RDI at the time that the frame called whatever function it called, if known. By that logic, shouldn't state->sp be sp on entry to the call instruction? (Or could sp just be removed? Does it do anything?) I guess the reason I'm still not 100% comfortable with the idea that pt_regs frames don't exist a real frames is that I keep waffling as to how I should think about the regs associated with a frame in the future DWARF world. I think I imagine them being the regs at the time that the frame did it's call to the next frame, which, by an admittedly weak analogy, means that the pt_regs state would be the regs at the time that the exception or interrupt happened. That offers a third silly option for dealing with the annoying '?': emit two frames for a pt_regs, but have the frame in the entry code show NULL for its return address because it's not a normal return. > And please remove test_and_set_bit() -- it's pointlessly slow. > >> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info, >> +unsigned long *visit_mask) >> +{ >> + unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack); >> + unsigned long *end = begin + (THREAD_SIZE / sizeof(long)); >> + >> + if (stack < begin || stack >= end) >> + return false; >> + >> + if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask)) >> + return false; >> + >> + info->type = STACK_TYPE_IRQ; >> + info->begin = begin; >> + info->end = end; >> + info->next = (unsigned long *)*begin; > > This works, but it's a bit magic. I don't suppose we could get rid of > this ->next thing entirely and teach show_stack_log_lvl(), etc. to > move from stack to stack by querying the stack type of whatever the > frame base address is if the frame base address ends up being out of > bounds for the current stack? Or maybe the unwinder could even do > this by itself. > >> +static bool in_exception_stack(unsigned long *s, struct stack_info *info, >> + unsigned long *visit_mask) >> { >> unsigned long stack = (unsigned long)s; >> unsigned long begin, end; >> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long >> *s, char **name, >> if (stack < begin || stack >= end) >> continue; >> >> - if (test_and_set_bit(k, visit_mask)) >> + if (visit_mask && >> + test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask)) >> return false; >> >> - *name = exception_stack_names[k]; >> - return (unsigned long *)end; >> + info->type = STACK_TYPE_EXCEPTION + k; >> + info->begin = (unsigned long *)begin; >> + info->end = (unsigned long *)end; >> + info->next = (unsigned long *)info->end[-2]; > > This is so magical that I don't immediately see why it's correct. > Presumably it's because the
Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
On 23-07-16, 01:47, Rafael J. Wysocki wrote: > I'll apply the revert with a "Cc: stable" tag. That will work. > Question is what to do about the other drivers setting > cpuinfo.transition_latency to CPUFREQ_ETERNAL. Perhaps leave them as is unless someone comes and reports a problem, they don't seem to have any problem right now anyway :) -- viresh
Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
On 23-07-16, 01:47, Rafael J. Wysocki wrote: > I'll apply the revert with a "Cc: stable" tag. That will work. > Question is what to do about the other drivers setting > cpuinfo.transition_latency to CPUFREQ_ETERNAL. Perhaps leave them as is unless someone comes and reports a problem, they don't seem to have any problem right now anyway :) -- viresh
Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
On Sat, Jul 23, 2016 at 1:30 AM, Viresh Kumarwrote: > On 22-07-16, 23:46, Rafael J. Wysocki wrote: >> On Friday, July 22, 2016 02:28:52 PM Viresh Kumar wrote: >> > On 22-07-16, 23:31, Rafael J. Wysocki wrote: >> > > > cpufreq.c >> > > > >> > > > if (policy->governor->max_transition_latency && >> > > > policy->cpuinfo.transition_latency > >> > > > policy->governor->max_transition_latency) { >> > > > >> > > > - And this check will always fail, unless max_transition_latency is >> > > > zero. >> > > >> > > Why would it fail? If governor->max_transition_latency is non-zero, but >> > > less >> > > than UNIT_MAX, the condition checked will be true to my eyes. >> > >> > Bad wording. Sorry. >> > >> > I meant, this 'if' check will always succeed (as you also noted), and >> > so we will always get the error message reported in this patch. >> >> Not always, but for drivers setting cpuinfo.transition_latency to >> CPUFREQ_ETERNAL. > > So the drivers which have set their transition_latency to > CPUFREQ_ETERNAL, can't use ondemand governor unless > governor->max_transition_latency is set to 0 from userspace. > > What should be done about this patch then ? It broke in late 2015. I'll apply the revert with a "Cc: stable" tag. Question is what to do about the other drivers setting cpuinfo.transition_latency to CPUFREQ_ETERNAL. Thanks, Rafael
Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
On Sat, Jul 23, 2016 at 1:30 AM, Viresh Kumar wrote: > On 22-07-16, 23:46, Rafael J. Wysocki wrote: >> On Friday, July 22, 2016 02:28:52 PM Viresh Kumar wrote: >> > On 22-07-16, 23:31, Rafael J. Wysocki wrote: >> > > > cpufreq.c >> > > > >> > > > if (policy->governor->max_transition_latency && >> > > > policy->cpuinfo.transition_latency > >> > > > policy->governor->max_transition_latency) { >> > > > >> > > > - And this check will always fail, unless max_transition_latency is >> > > > zero. >> > > >> > > Why would it fail? If governor->max_transition_latency is non-zero, but >> > > less >> > > than UNIT_MAX, the condition checked will be true to my eyes. >> > >> > Bad wording. Sorry. >> > >> > I meant, this 'if' check will always succeed (as you also noted), and >> > so we will always get the error message reported in this patch. >> >> Not always, but for drivers setting cpuinfo.transition_latency to >> CPUFREQ_ETERNAL. > > So the drivers which have set their transition_latency to > CPUFREQ_ETERNAL, can't use ondemand governor unless > governor->max_transition_latency is set to 0 from userspace. > > What should be done about this patch then ? It broke in late 2015. I'll apply the revert with a "Cc: stable" tag. Question is what to do about the other drivers setting cpuinfo.transition_latency to CPUFREQ_ETERNAL. Thanks, Rafael
[PATCH 1/1] EDAC, sb_edac: Fix channel reporting on Knights Landing
On Intel Xeon Phi Knights Landing processor family the channels of memory controller have untypical arrangement - MC0 is mapped to CH3,4,5 and MC1 is mapped to CH0,1,2. This causes EDAC driver to report the channel name incorrectly. We missed this change earlier, so the code already contains similar comment, but the translation function is incorrect. Without this patch: errors in DIMM_A and DIMM_D were reported in DIMM_D errors in DIMM_B and DIMM_E were reported in DIMM_E errors in DIMM_C and DIMM_F were reported in DIMM_F Fixes: d0cdf9003140 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support") Signed-off-by: Lukasz OdziobaSigned-off-by: Hubert Chrzaniuk --- drivers/edac/sb_edac.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c index 6744d88..61e2c52 100644 --- a/drivers/edac/sb_edac.c +++ b/drivers/edac/sb_edac.c @@ -552,9 +552,9 @@ static const struct pci_id_table pci_dev_descr_haswell_table[] = { /* Knight's Landing Support */ /* * KNL's memory channels are swizzled between memory controllers. - * MC0 is mapped to CH3,5,6 and MC1 is mapped to CH0,1,2 + * MC0 is mapped to CH3,4,5 and MC1 is mapped to CH0,1,2 */ -#define knl_channel_remap(channel) ((channel + 3) % 6) +#define knl_channel_remap(mc, chan) (mc ? chan : chan + 3) /* Memory controller, TAD tables, error injection - 2-8-0, 2-9-0 (2 of these) */ #define PCI_DEVICE_ID_INTEL_KNL_IMC_MC 0x7840 @@ -1286,7 +1286,7 @@ static u32 knl_get_mc_route(int entry, u32 reg) mc = GET_BITFIELD(reg, entry*3, (entry*3)+2); chan = GET_BITFIELD(reg, (entry*2) + 18, (entry*2) + 18 + 1); - return knl_channel_remap(mc*3 + chan); + return knl_channel_remap(mc, chan); } /* @@ -3003,9 +3003,16 @@ static void sbridge_mce_output_error(struct mem_ctl_info *mci, mscod, errcode, m->bank); } else { - char A = *("A"); + char mc, A = *("A"); - channel = knl_channel_remap(channel); + /* +* Reported channel is in range 0-2, so we can't map it +* back to mc. To figure out mc we check machine check +* bank register that reported this error. +* bank15 means mc0 and bank16 means mc1. +*/ + mc = m->bank == 16; + channel = knl_channel_remap(mc, channel); channel_mask = 1 << channel; snprintf(msg, sizeof(msg), "%s%s err_code:%04x:%04x channel:%d (DIMM_%c)", -- 1.8.3.1
[PATCH 1/1] EDAC, sb_edac: Fix channel reporting on Knights Landing
On Intel Xeon Phi Knights Landing processor family the channels of memory controller have untypical arrangement - MC0 is mapped to CH3,4,5 and MC1 is mapped to CH0,1,2. This causes EDAC driver to report the channel name incorrectly. We missed this change earlier, so the code already contains similar comment, but the translation function is incorrect. Without this patch: errors in DIMM_A and DIMM_D were reported in DIMM_D errors in DIMM_B and DIMM_E were reported in DIMM_E errors in DIMM_C and DIMM_F were reported in DIMM_F Fixes: d0cdf9003140 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support") Signed-off-by: Lukasz Odzioba Signed-off-by: Hubert Chrzaniuk --- drivers/edac/sb_edac.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c index 6744d88..61e2c52 100644 --- a/drivers/edac/sb_edac.c +++ b/drivers/edac/sb_edac.c @@ -552,9 +552,9 @@ static const struct pci_id_table pci_dev_descr_haswell_table[] = { /* Knight's Landing Support */ /* * KNL's memory channels are swizzled between memory controllers. - * MC0 is mapped to CH3,5,6 and MC1 is mapped to CH0,1,2 + * MC0 is mapped to CH3,4,5 and MC1 is mapped to CH0,1,2 */ -#define knl_channel_remap(channel) ((channel + 3) % 6) +#define knl_channel_remap(mc, chan) (mc ? chan : chan + 3) /* Memory controller, TAD tables, error injection - 2-8-0, 2-9-0 (2 of these) */ #define PCI_DEVICE_ID_INTEL_KNL_IMC_MC 0x7840 @@ -1286,7 +1286,7 @@ static u32 knl_get_mc_route(int entry, u32 reg) mc = GET_BITFIELD(reg, entry*3, (entry*3)+2); chan = GET_BITFIELD(reg, (entry*2) + 18, (entry*2) + 18 + 1); - return knl_channel_remap(mc*3 + chan); + return knl_channel_remap(mc, chan); } /* @@ -3003,9 +3003,16 @@ static void sbridge_mce_output_error(struct mem_ctl_info *mci, mscod, errcode, m->bank); } else { - char A = *("A"); + char mc, A = *("A"); - channel = knl_channel_remap(channel); + /* +* Reported channel is in range 0-2, so we can't map it +* back to mc. To figure out mc we check machine check +* bank register that reported this error. +* bank15 means mc0 and bank16 means mc1. +*/ + mc = m->bank == 16; + channel = knl_channel_remap(mc, channel); channel_mask = 1 << channel; snprintf(msg, sizeof(msg), "%s%s err_code:%04x:%04x channel:%d (DIMM_%c)", -- 1.8.3.1
Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeufwrote: > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote: >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf wrote: >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote: >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf >> >> wrote: >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote: >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf >> >> >> wrote: >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote: >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf >> >> >> >> wrote: >> >> >> >> > Now that we can find pt_regs registers in the middle of the stack >> >> >> >> > due to >> >> >> >> > an interrupt or exception, we can print them. Here's what it >> >> >> >> > looks >> >> >> >> > like: >> >> >> >> > >> >> >> >> >... >> >> >> >> >[] do_async_page_fault+0x2c/0xa0 >> >> >> >> >[] async_page_fault+0x28/0x30 >> >> >> >> > RIP: 0010:[] [] >> >> >> >> > __clear_user+0x42/0x70 >> >> >> >> > RSP: 0018:88007876fd38 EFLAGS: 00010202 >> >> >> >> > RAX: RBX: 0138 RCX: >> >> >> >> > 0138 >> >> >> >> > RDX: RSI: 0008 RDI: >> >> >> >> > 0061b640 >> >> >> >> > RBP: 88007876fd48 R08: 000dc2ced0d0 R09: >> >> >> >> > >> >> >> >> > R10: 0001 R11: R12: >> >> >> >> > 0061b640 >> >> >> >> > R13: R14: 88007877 R15: >> >> >> >> > 880079947200 >> >> >> >> >[] ? __clear_user+0x42/0x70 >> >> >> >> >[] ? __clear_user+0x23/0x70 >> >> >> >> >[] clear_user+0x2b/0x40 >> >> >> >> >... >> >> >> >> >> >> >> >> This looks wrong. Here are some theories: >> >> >> >> >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: >> >> >> >> >> >> >> >> Then it's found again as an unreliable address as "? >> >> >> >> __clear_user+0x42/0x70" by scanning the stack. "? >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack. >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been >> >> >> >> suppressed because it matched a reliable address? >> >> >> >> >> >> >> >> (b) You actually intended for all the addresses to be printed, in >> >> >> >> which case "? __clear_user+0x42/0x70" should have been >> >> >> >> "__clear_user+0x42/0x70" and you have a bug. In this case, it's >> >> >> >> plausible that your state machine got a bit lost leading to "? >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact -- >> >> >> >> it's a real frame and you didn't find it). >> >> >> >> >> >> >> >> (c) Something else and I'm confused. >> >> >> > >> >> >> > So there's a subtle difference between addresses reported by regs->ip >> >> >> > and normal return addresses. For example: >> >> >> > >> >> >> >... >> >> >> >[] smp_apic_timer_interrupt+0x3d/0x50 >> >> >> >[] apic_timer_interrupt+0x9e/0xb0 >> >> >> > RIP: 0010:[] [] >> >> >> > path_init+0x0/0x750 >> >> >> > RSP: 0018:880036a3fd80 EFLAGS: 0296 >> >> >> > RAX: 88003691aa40 RBX: 880036a3ff08 RCX: 880036a3ff08 >> >> >> > RDX: 880036a3ff08 RSI: 0041 RDI: 880036a3fdb0 >> >> >> > RBP: 880036a3fda0 R08: R09: 0010 >> >> >> > R10: 8080808080808080 R11: fefefefefefefeff R12: 880036a3fdb0 >> >> >> > R13: 0001 R14: 880036a3ff08 R15: >> >> >> > >> >> >> >[] ? lookup_fast+0x3d0/0x3d0 >> >> >> >[] ? path_lookupat+0x1b/0x120 >> >> >> >[] filename_lookup+0xb1/0x180 >> >> >> >... >> >> >> > >> >> >> > In this case the irq hit right after path_lookupat() called into >> >> >> > path_init(). So the "path_init+0x0" printed by __show_regs() is >> >> >> > right. >> >> >> > >> >> >> > Note the backtrace reports the same address, but it instead >> >> >> > describes it >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast(). That's >> >> >> > because normally, such an address after a call instruction at the >> >> >> > end of >> >> >> > a function would indicate a tail call (e.g., to a noreturn function). >> >> >> > If that were the case, printing "path_init+0x0" would be completely >> >> >> > wrong, because path_init() just happens to be the function located >> >> >> > immediately after lookup_fast(). >> >> >> > >> >> >> > Maybe I could add some special logic to say: "if this return address >> >> >> > was >> >> >> > from a call, use printk_stack_address(); else if it was from a >> >> >> > pt_regs, >> >> >> > use printk_address()." (The former prints the preceding function, >> >> >> > the >> >> >> > latter prints the current function.) Then we could remove the >> >> >> > question >> >> >> > mark. >> >> >> > >> >>
Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack
On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf wrote: > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote: >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf wrote: >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote: >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf >> >> wrote: >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote: >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf >> >> >> wrote: >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote: >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf >> >> >> >> wrote: >> >> >> >> > Now that we can find pt_regs registers in the middle of the stack >> >> >> >> > due to >> >> >> >> > an interrupt or exception, we can print them. Here's what it >> >> >> >> > looks >> >> >> >> > like: >> >> >> >> > >> >> >> >> >... >> >> >> >> >[] do_async_page_fault+0x2c/0xa0 >> >> >> >> >[] async_page_fault+0x28/0x30 >> >> >> >> > RIP: 0010:[] [] >> >> >> >> > __clear_user+0x42/0x70 >> >> >> >> > RSP: 0018:88007876fd38 EFLAGS: 00010202 >> >> >> >> > RAX: RBX: 0138 RCX: >> >> >> >> > 0138 >> >> >> >> > RDX: RSI: 0008 RDI: >> >> >> >> > 0061b640 >> >> >> >> > RBP: 88007876fd48 R08: 000dc2ced0d0 R09: >> >> >> >> > >> >> >> >> > R10: 0001 R11: R12: >> >> >> >> > 0061b640 >> >> >> >> > R13: R14: 88007877 R15: >> >> >> >> > 880079947200 >> >> >> >> >[] ? __clear_user+0x42/0x70 >> >> >> >> >[] ? __clear_user+0x23/0x70 >> >> >> >> >[] clear_user+0x2b/0x40 >> >> >> >> >... >> >> >> >> >> >> >> >> This looks wrong. Here are some theories: >> >> >> >> >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: >> >> >> >> >> >> >> >> Then it's found again as an unreliable address as "? >> >> >> >> __clear_user+0x42/0x70" by scanning the stack. "? >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack. >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been >> >> >> >> suppressed because it matched a reliable address? >> >> >> >> >> >> >> >> (b) You actually intended for all the addresses to be printed, in >> >> >> >> which case "? __clear_user+0x42/0x70" should have been >> >> >> >> "__clear_user+0x42/0x70" and you have a bug. In this case, it's >> >> >> >> plausible that your state machine got a bit lost leading to "? >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact -- >> >> >> >> it's a real frame and you didn't find it). >> >> >> >> >> >> >> >> (c) Something else and I'm confused. >> >> >> > >> >> >> > So there's a subtle difference between addresses reported by regs->ip >> >> >> > and normal return addresses. For example: >> >> >> > >> >> >> >... >> >> >> >[] smp_apic_timer_interrupt+0x3d/0x50 >> >> >> >[] apic_timer_interrupt+0x9e/0xb0 >> >> >> > RIP: 0010:[] [] >> >> >> > path_init+0x0/0x750 >> >> >> > RSP: 0018:880036a3fd80 EFLAGS: 0296 >> >> >> > RAX: 88003691aa40 RBX: 880036a3ff08 RCX: 880036a3ff08 >> >> >> > RDX: 880036a3ff08 RSI: 0041 RDI: 880036a3fdb0 >> >> >> > RBP: 880036a3fda0 R08: R09: 0010 >> >> >> > R10: 8080808080808080 R11: fefefefefefefeff R12: 880036a3fdb0 >> >> >> > R13: 0001 R14: 880036a3ff08 R15: >> >> >> > >> >> >> >[] ? lookup_fast+0x3d0/0x3d0 >> >> >> >[] ? path_lookupat+0x1b/0x120 >> >> >> >[] filename_lookup+0xb1/0x180 >> >> >> >... >> >> >> > >> >> >> > In this case the irq hit right after path_lookupat() called into >> >> >> > path_init(). So the "path_init+0x0" printed by __show_regs() is >> >> >> > right. >> >> >> > >> >> >> > Note the backtrace reports the same address, but it instead >> >> >> > describes it >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast(). That's >> >> >> > because normally, such an address after a call instruction at the >> >> >> > end of >> >> >> > a function would indicate a tail call (e.g., to a noreturn function). >> >> >> > If that were the case, printing "path_init+0x0" would be completely >> >> >> > wrong, because path_init() just happens to be the function located >> >> >> > immediately after lookup_fast(). >> >> >> > >> >> >> > Maybe I could add some special logic to say: "if this return address >> >> >> > was >> >> >> > from a call, use printk_stack_address(); else if it was from a >> >> >> > pt_regs, >> >> >> > use printk_address()." (The former prints the preceding function, >> >> >> > the >> >> >> > latter prints the current function.) Then we could remove the >> >> >> > question >> >> >> > mark. >> >> >> > >> >> >> > There's also the question of whether or not the address should be >> >> >> > printed again, after
[git pull] Input updates for 4.7-rc7 (part 2)
Hi Linus, Please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus to receive a few more fixes for the input subsystem: - restore naming for tsc2005 touchscreens as some userspace match on it - fix out of bound access in legacy keyboard driver - fixup in RMI4 driver Everything is tagged for stable as well. Changelog: - Andrew Duggan (1): Input: synaptics-rmi4 - fix maximum size check for F12 control register 8 Dmitry Torokhov (1): tty/vt/keyboard: fix OOB access in do_compute_shiftstate() Michael Welling (1): Input: tsc200x - report proper input_dev name Diffstat: drivers/input/rmi4/rmi_f12.c | 9 + drivers/input/touchscreen/tsc2004.c | 7 ++- drivers/input/touchscreen/tsc2005.c | 7 ++- drivers/input/touchscreen/tsc200x-core.c | 15 --- drivers/input/touchscreen/tsc200x-core.h | 2 +- drivers/tty/vt/keyboard.c| 30 +- 6 files changed, 39 insertions(+), 31 deletions(-)
[git pull] Input updates for 4.7-rc7 (part 2)
Hi Linus, Please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus to receive a few more fixes for the input subsystem: - restore naming for tsc2005 touchscreens as some userspace match on it - fix out of bound access in legacy keyboard driver - fixup in RMI4 driver Everything is tagged for stable as well. Changelog: - Andrew Duggan (1): Input: synaptics-rmi4 - fix maximum size check for F12 control register 8 Dmitry Torokhov (1): tty/vt/keyboard: fix OOB access in do_compute_shiftstate() Michael Welling (1): Input: tsc200x - report proper input_dev name Diffstat: drivers/input/rmi4/rmi_f12.c | 9 + drivers/input/touchscreen/tsc2004.c | 7 ++- drivers/input/touchscreen/tsc2005.c | 7 ++- drivers/input/touchscreen/tsc200x-core.c | 15 --- drivers/input/touchscreen/tsc200x-core.h | 2 +- drivers/tty/vt/keyboard.c| 30 +- 6 files changed, 39 insertions(+), 31 deletions(-)
[PATCH v4 7/7] power_supply: make use of new strcpytoupper() function
Call strcpytoupper() rather than walking the string explicitly to convert it to uppercase. Signed-off-by: Markus Mayer--- drivers/power/power_supply_sysfs.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/power/power_supply_sysfs.c b/drivers/power/power_supply_sysfs.c index 80fed98..20fdcc5 100644 --- a/drivers/power/power_supply_sysfs.c +++ b/drivers/power/power_supply_sysfs.c @@ -256,19 +256,16 @@ void power_supply_init_attrs(struct device_type *dev_type) static char *kstruprdup(const char *str, gfp_t gfp) { - char *ret, *ustr; + char *ustr; - ustr = ret = kmalloc(strlen(str) + 1, gfp); + ustr = kmalloc(strlen(str) + 1, gfp); - if (!ret) + if (!ustr) return NULL; - while (*str) - *ustr++ = toupper(*str++); + strcpytoupper(ustr, str); - *ustr = 0; - - return ret; + return ustr; } int power_supply_uevent(struct device *dev, struct kobj_uevent_env *env) -- 2.7.4
[PATCH v4 7/7] power_supply: make use of new strcpytoupper() function
Call strcpytoupper() rather than walking the string explicitly to convert it to uppercase. Signed-off-by: Markus Mayer --- drivers/power/power_supply_sysfs.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/power/power_supply_sysfs.c b/drivers/power/power_supply_sysfs.c index 80fed98..20fdcc5 100644 --- a/drivers/power/power_supply_sysfs.c +++ b/drivers/power/power_supply_sysfs.c @@ -256,19 +256,16 @@ void power_supply_init_attrs(struct device_type *dev_type) static char *kstruprdup(const char *str, gfp_t gfp) { - char *ret, *ustr; + char *ustr; - ustr = ret = kmalloc(strlen(str) + 1, gfp); + ustr = kmalloc(strlen(str) + 1, gfp); - if (!ret) + if (!ustr) return NULL; - while (*str) - *ustr++ = toupper(*str++); + strcpytoupper(ustr, str); - *ustr = 0; - - return ret; + return ustr; } int power_supply_uevent(struct device *dev, struct kobj_uevent_env *env) -- 2.7.4
[PATCH v4 6/7] drm/nouveau/fifo/gk104: make use of new strcpytoupper() function
Call strcpytoupper() rather than copying the string explicitly and then walking it to convert it to uppercase. Signed-off-by: Markus Mayer--- drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c index 743f3a1..8d01032 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c @@ -332,10 +332,7 @@ gk104_fifo_intr_fault(struct gk104_fifo *fifo, int unit) enum nvkm_devidx engidx = nvkm_top_fault(device->top, unit); if (engidx < NVKM_SUBDEV_NR) { const char *src = nvkm_subdev_name[engidx]; - char *dst = en; - do { - *dst++ = toupper(*src++); - } while(*src); + strcpytoupper(en, src); engine = nvkm_device_engine(device, engidx); } } else { -- 2.7.4
[PATCH v4 6/7] drm/nouveau/fifo/gk104: make use of new strcpytoupper() function
Call strcpytoupper() rather than copying the string explicitly and then walking it to convert it to uppercase. Signed-off-by: Markus Mayer --- drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c index 743f3a1..8d01032 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c @@ -332,10 +332,7 @@ gk104_fifo_intr_fault(struct gk104_fifo *fifo, int unit) enum nvkm_devidx engidx = nvkm_top_fault(device->top, unit); if (engidx < NVKM_SUBDEV_NR) { const char *src = nvkm_subdev_name[engidx]; - char *dst = en; - do { - *dst++ = toupper(*src++); - } while(*src); + strcpytoupper(en, src); engine = nvkm_device_engine(device, engidx); } } else { -- 2.7.4