date:20160722

On Sat, Jul 23, 2016 at 2:35 PM, Josh Poimboeuf  wrote:
>
> While doing the scanning and printing, it does call the frame pointer
> unwinder in parallel, but like before, that's *only* used to determine
> whether a found address should be printed without a question mark.  If
> the unwinder goes off the rails, the scanning and printing of text
> addresses goes on, undisturbed.
>
> The frame pointer unwinder code itself is quite careful not to
> dereference anything it shouldn't (though of course I welcome any review
> comments that find otherwise).

So this was the bug the last time around we did unwinders - the code
would dereference the unwind tables, and the tables would be
corrupted. End result: recursive oops.

And they were corrupted not even because of memory corruption, but
simply because they contained incorrect data, due to compiler bugs and
other issues.

I have really bad memories from that time. Several years after the
fact. It took months to finally revert the crap, because the author
continued to insist that "this was the last bug" for several passes
through that thing.

As they say, "Once burned, twice shy". But in this case, it's more
like "Four times burned, sixteen times as shy".

Linus

Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code

On Sat, Jul 23, 2016 at 2:35 PM, Josh Poimboeuf  wrote:
>
> While doing the scanning and printing, it does call the frame pointer
> unwinder in parallel, but like before, that's *only* used to determine
> whether a found address should be printed without a question mark.  If
> the unwinder goes off the rails, the scanning and printing of text
> addresses goes on, undisturbed.
>
> The frame pointer unwinder code itself is quite careful not to
> dereference anything it shouldn't (though of course I welcome any review
> comments that find otherwise).

So this was the bug the last time around we did unwinders - the code
would dereference the unwind tables, and the tables would be
corrupted. End result: recursive oops.

And they were corrupted not even because of memory corruption, but
simply because they contained incorrect data, due to compiler bugs and
other issues.

I have really bad memories from that time. Several years after the
fact. It took months to finally revert the crap, because the author
continued to insist that "this was the last bug" for several passes
through that thing.

As they say, "Once burned, twice shy". But in this case, it's more
like "Four times burned, sixteen times as shy".

Linus

Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code

On Fri, Jul 22, 2016 at 05:31:47PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds
>  wrote:
> >
> > So without having yet looked at the code, I want people to understand
> > that to a very real degree, the stack tracer that the *oopsing* code
> > (ie what all the usual kernel fault handlers use) is very very special
> > code and needs to be handled very carefully, and needs to be extra
> > robust, even in the presence of stack corruption, and even in the
> > presence of the dwarf info being totally corrupted. Because we've very
> > much had both things happen.
> >
> > It is very possible that we should have two different stack tracers -
> > the stupid "for oopses only" code that doesn't necessarily give the
> > perfect trace, but is very anal and happily gives old stale addresses
> > (which can be very useful for seeing what happened just before the
> > "real" stack trace), and then a separate stack trace engine that is
> > clever and gets things right, and if that one faults it can depend on
> > the normal kernel fault handling picking up the pieces.
> 
> I think that Josh's code has the potential to be extremely robust
> *and* give more correct results when possible.  One thing I intend to
> review when v2 shows up is that it's as conservative as it needs to be
> to avoid ever dereferencing an out-of-bounds pointer.  And Josh's oops
> printer carefully walks and prints out all addresses on the stack
> (complete with question marks) even if the unwinder doesn't find them.

I should add that while the show_trace_log_lvl() code (which is used for
oopses) looks different on the surface, the algorithm is fundamentally
the same as before: traverse the stacks, scanning and printing any
kernel text addresses.

While doing the scanning and printing, it does call the frame pointer
unwinder in parallel, but like before, that's *only* used to determine
whether a found address should be printed without a question mark.  If
the unwinder goes off the rails, the scanning and printing of text
addresses goes on, undisturbed.

The frame pointer unwinder code itself is quite careful not to
dereference anything it shouldn't (though of course I welcome any review
comments that find otherwise).

> > Yes, the current stack tracer is  crufty. No, it's not perfect. But it
> > is very well tested, and has held up. That should not be dismissed.
> >
> 
> I think you may be giving the current tracer slightly more credit than
> it's due.  In my stack guard page patchset, I fixed two separate
> issues, one of which caused recursive faults and one of which caused
> it to output nothing at all.  So maybe *now* it's very robust :)  But
> it's still an umaintainable mess IMO, and Josh's patchset helps a
> *lot*.

-- 
Josh

Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code

On Fri, Jul 22, 2016 at 05:31:47PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds
>  wrote:
> >
> > So without having yet looked at the code, I want people to understand
> > that to a very real degree, the stack tracer that the *oopsing* code
> > (ie what all the usual kernel fault handlers use) is very very special
> > code and needs to be handled very carefully, and needs to be extra
> > robust, even in the presence of stack corruption, and even in the
> > presence of the dwarf info being totally corrupted. Because we've very
> > much had both things happen.
> >
> > It is very possible that we should have two different stack tracers -
> > the stupid "for oopses only" code that doesn't necessarily give the
> > perfect trace, but is very anal and happily gives old stale addresses
> > (which can be very useful for seeing what happened just before the
> > "real" stack trace), and then a separate stack trace engine that is
> > clever and gets things right, and if that one faults it can depend on
> > the normal kernel fault handling picking up the pieces.
> 
> I think that Josh's code has the potential to be extremely robust
> *and* give more correct results when possible.  One thing I intend to
> review when v2 shows up is that it's as conservative as it needs to be
> to avoid ever dereferencing an out-of-bounds pointer.  And Josh's oops
> printer carefully walks and prints out all addresses on the stack
> (complete with question marks) even if the unwinder doesn't find them.

I should add that while the show_trace_log_lvl() code (which is used for
oopses) looks different on the surface, the algorithm is fundamentally
the same as before: traverse the stacks, scanning and printing any
kernel text addresses.

While doing the scanning and printing, it does call the frame pointer
unwinder in parallel, but like before, that's *only* used to determine
whether a found address should be printed without a question mark.  If
the unwinder goes off the rails, the scanning and printing of text
addresses goes on, undisturbed.

The frame pointer unwinder code itself is quite careful not to
dereference anything it shouldn't (though of course I welcome any review
comments that find otherwise).

> > Yes, the current stack tracer is  crufty. No, it's not perfect. But it
> > is very well tested, and has held up. That should not be dismissed.
> >
> 
> I think you may be giving the current tracer slightly more credit than
> it's due.  In my stack guard page patchset, I fixed two separate
> issues, one of which caused recursive faults and one of which caused
> it to output nothing at all.  So maybe *now* it's very robust :)  But
> it's still an umaintainable mess IMO, and Josh's patchset helps a
> *lot*.

-- 
Josh

Re: PM-wakeup: Delete unnecessary checks before two function calls

2016-07-22 Thread SF Markus Elfring

>> How do you think about to integrate this update suggestion
>> into another source code repository?
> 
> I'm not really sure what you mean.

Do you find the suggested source code change acceptable?

http://article.gmane.org/gmane.linux.power-management.general/61766
https://lkml.org/lkml/2015/6/28/21


Who would dare to commit it?

Regards,
Markus

Re: PM-wakeup: Delete unnecessary checks before two function calls

2016-07-22 Thread SF Markus Elfring

>> How do you think about to integrate this update suggestion
>> into another source code repository?
> 
> I'm not really sure what you mean.

Do you find the suggested source code change acceptable?

http://article.gmane.org/gmane.linux.power-management.general/61766
https://lkml.org/lkml/2015/6/28/21


Who would dare to commit it?

Regards,
Markus

Re: [kernel-hardening] [PATCH v5 03/32] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated

2016-07-22 Thread Valdis . Kletnieks

On Thu, 21 Jul 2016 22:34:33 -0700, Andy Lutomirski said:

> How much memory do you have and what's your config?  My code is
> obviously buggy, but I'm wondering why neither I nor the 0day bot caught
> this.

Probably because your devel box and the 0day bot both have 4-level page
tables and the dual-core i5 in my laptop has (presumably) 3?

In any case, your patch didn't fix things, nor did (as you noted in a mail
to Ingo) does reverting the problem commit (and then the following one that
deletes now-dead code so it will compile cleanly).

8G of memory, 2 4G dimms.  .config attached.  Anything that would help
from /proc or /sys?
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.7.0-rc7 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEBUG_RODATA=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_KERNEL_XZ=y
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_DEBUG=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
CONFIG_IRQ_TIME_ACCOUNTING=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_RCU_EXPEDITE_BOOT is not set
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=21
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_NMI_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_SWAP_ENABLED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CHECKPOINT_RESTORE=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
CONFIG_NET_NS=y

Re: [kernel-hardening] [PATCH v5 03/32] x86/cpa: In populate_pgd, don't set the pgd entry until it's populated

2016-07-22 Thread Valdis . Kletnieks

On Thu, 21 Jul 2016 22:34:33 -0700, Andy Lutomirski said:

> How much memory do you have and what's your config?  My code is
> obviously buggy, but I'm wondering why neither I nor the 0day bot caught
> this.

Probably because your devel box and the 0day bot both have 4-level page
tables and the dual-core i5 in my laptop has (presumably) 3?

In any case, your patch didn't fix things, nor did (as you noted in a mail
to Ingo) does reverting the problem commit (and then the following one that
deletes now-dead code so it will compile cleanly).

8G of memory, 2 4G dimms.  .config attached.  Anything that would help
from /proc or /sys?
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.7.0-rc7 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEBUG_RODATA=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_KERNEL_XZ=y
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_DEBUG=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
CONFIG_IRQ_TIME_ACCOUNTING=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_RCU_EXPEDITE_BOOT is not set
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=21
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_NMI_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_SWAP_ENABLED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CHECKPOINT_RESTORE=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
CONFIG_NET_NS=y

Re: [PATCH/RFC] Re: linux-next: build failure after merge of the luto-misc tree

2016-07-22 Thread Stephen Rothwell

Hi Arnaldo,

On Fri, 22 Jul 2016 16:57:34 -0300 Arnaldo Carvalho de Melo  
wrote:
>
> Em Fri, Jul 22, 2016 at 02:44:17PM -0500, Josh Poimboeuf escreveu:
> > On Fri, Jul 22, 2016 at 04:36:55PM -0300, Arnaldo Carvalho de Melo wrote:  
> > > Em Fri, Jul 22, 2016 at 02:19:20PM -0500, Josh Poimboeuf escreveu:  
> > > > On Fri, Jul 22, 2016 at 11:37:39AM -0300, Arnaldo Carvalho de Melo 
> > > > wrote:  
> > > > > I.e. with the two patches I mentioned, that are equivalent to the 
> > > > > last patch I
> > > > > sent to Stephen for testing, we would end up with HOSTARCH=powerpc and
> > > > > ARCH=x86, no?  
> 
> > > > Thanks for spelling it out, that helped a lot.  
> 
> > > Glad you liked it, I had to do it for my own sanity :-)  
> 
> > > And something that gave me mixed feelings was an e-mail from the kbuild
> > > test bot that noticed my perf/core changes and said that the build was
> > > broken for "make ARCH=x86_64", so I had to reinstate this part:  
> 
> > > ifeq ($(ARCH),x86_64)
> > > ARCH := x86
> > > endif  
> 
> > > Because, as you say, 'make ARCH=x86' works :-\ I think it will not be
> > > needed with your patch, right? I'm checking your patch below right now,  
> 
> > Yeah, that shouldn't be needed with my patch.  I think either would
> > work, but my patch is more of a permanent solution.  
> 
> Sure, I left it there because then we don't have bisection broke at that
> fix I made, i.e. 'make ARCH=x86_64' works at that point too.
> 
> I applied your patch and will push it to Ingo, now we must cross our
> fingers so that Stephen doesn't come back to us once more telling it is
> still broken :o)

Unfortunately, this is what I get when I just build perf/core:

  DESCEND  objtool
  CC   /home/sfr/next/x86_64_allmodconfig/tools/objtool/builtin-check.o
  LD   /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool-in.o
Warning: objtool: x86 instruction decoder differs from kernel
  LINK /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool
In file included from 
/home/sfr/next/next/arch/x86/include/uapi/asm/bitsperlong.h:10:0,
 from 
/home/sfr/next/next/include/uapi/asm-generic/int-ll64.h:11,
 from /home/sfr/next/next/include/uapi/asm-generic/types.h:6,
 from /home/sfr/next/next/arch/x86/include/uapi/asm/types.h:4,
 from /home/sfr/next/next/tools/include/linux/types.h:9,
 from /home/sfr/next/next/include/uapi/linux/elf.h:4,
 from /home/sfr/next/next/arch/x86/entry/vdso/vdso2c.c:66:
/home/sfr/next/next/tools/include/asm-generic/bitsperlong.h:13:2: error: #error 
Inconsistent word size. Check asm/bitsperlong.h
 #error Inconsistent word size. Check asm/bitsperlong.h
  ^

The be clear: this is a ppc64le hosted, x86_64 target cross build.

I than added the following patch, and the build finishes successfully.

From: Stephen Rothwell 
Date: Sat, 23 Jul 2016 14:35:40 +1000
Subject: [PATCH] x86: make the vdso2c compiler use the host architecture
 headers

Signed-off-by: Stephen Rothwell 
---
 arch/x86/entry/vdso/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 253b72eaade6..25e88c030c47 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -55,7 +55,7 @@ VDSO_LDFLAGS_vdso.lds = -m64 -Wl,-soname=linux-vdso.so.1 \
 $(obj)/vdso64.so.dbg: $(src)/vdso.lds $(vobjs) FORCE
$(call if_changed,vdso)

-HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi 
-I$(srctree)/arch/x86/include/uapi
+HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi 
-I$(srctree)/arch/$(SUBARCH)/include/uapi
 hostprogs-y+= vdso2c

 quiet_cmd_vdso2c = VDSO2C  $@
-- 
2.8.1

There may be a more correct way to do this ...
-- 
Cheers,
Stephen Rothwell

Re: [PATCH/RFC] Re: linux-next: build failure after merge of the luto-misc tree

2016-07-22 Thread Stephen Rothwell

Hi Arnaldo,

On Fri, 22 Jul 2016 16:57:34 -0300 Arnaldo Carvalho de Melo  
wrote:
>
> Em Fri, Jul 22, 2016 at 02:44:17PM -0500, Josh Poimboeuf escreveu:
> > On Fri, Jul 22, 2016 at 04:36:55PM -0300, Arnaldo Carvalho de Melo wrote:  
> > > Em Fri, Jul 22, 2016 at 02:19:20PM -0500, Josh Poimboeuf escreveu:  
> > > > On Fri, Jul 22, 2016 at 11:37:39AM -0300, Arnaldo Carvalho de Melo 
> > > > wrote:  
> > > > > I.e. with the two patches I mentioned, that are equivalent to the 
> > > > > last patch I
> > > > > sent to Stephen for testing, we would end up with HOSTARCH=powerpc and
> > > > > ARCH=x86, no?  
> 
> > > > Thanks for spelling it out, that helped a lot.  
> 
> > > Glad you liked it, I had to do it for my own sanity :-)  
> 
> > > And something that gave me mixed feelings was an e-mail from the kbuild
> > > test bot that noticed my perf/core changes and said that the build was
> > > broken for "make ARCH=x86_64", so I had to reinstate this part:  
> 
> > > ifeq ($(ARCH),x86_64)
> > > ARCH := x86
> > > endif  
> 
> > > Because, as you say, 'make ARCH=x86' works :-\ I think it will not be
> > > needed with your patch, right? I'm checking your patch below right now,  
> 
> > Yeah, that shouldn't be needed with my patch.  I think either would
> > work, but my patch is more of a permanent solution.  
> 
> Sure, I left it there because then we don't have bisection broke at that
> fix I made, i.e. 'make ARCH=x86_64' works at that point too.
> 
> I applied your patch and will push it to Ingo, now we must cross our
> fingers so that Stephen doesn't come back to us once more telling it is
> still broken :o)

Unfortunately, this is what I get when I just build perf/core:

  DESCEND  objtool
  CC   /home/sfr/next/x86_64_allmodconfig/tools/objtool/builtin-check.o
  LD   /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool-in.o
Warning: objtool: x86 instruction decoder differs from kernel
  LINK /home/sfr/next/x86_64_allmodconfig/tools/objtool/objtool
In file included from 
/home/sfr/next/next/arch/x86/include/uapi/asm/bitsperlong.h:10:0,
 from 
/home/sfr/next/next/include/uapi/asm-generic/int-ll64.h:11,
 from /home/sfr/next/next/include/uapi/asm-generic/types.h:6,
 from /home/sfr/next/next/arch/x86/include/uapi/asm/types.h:4,
 from /home/sfr/next/next/tools/include/linux/types.h:9,
 from /home/sfr/next/next/include/uapi/linux/elf.h:4,
 from /home/sfr/next/next/arch/x86/entry/vdso/vdso2c.c:66:
/home/sfr/next/next/tools/include/asm-generic/bitsperlong.h:13:2: error: #error 
Inconsistent word size. Check asm/bitsperlong.h
 #error Inconsistent word size. Check asm/bitsperlong.h
  ^

The be clear: this is a ppc64le hosted, x86_64 target cross build.

I than added the following patch, and the build finishes successfully.

From: Stephen Rothwell 
Date: Sat, 23 Jul 2016 14:35:40 +1000
Subject: [PATCH] x86: make the vdso2c compiler use the host architecture
 headers

Signed-off-by: Stephen Rothwell 
---
 arch/x86/entry/vdso/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 253b72eaade6..25e88c030c47 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -55,7 +55,7 @@ VDSO_LDFLAGS_vdso.lds = -m64 -Wl,-soname=linux-vdso.so.1 \
 $(obj)/vdso64.so.dbg: $(src)/vdso.lds $(vobjs) FORCE
$(call if_changed,vdso)

-HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi 
-I$(srctree)/arch/x86/include/uapi
+HOST_EXTRACFLAGS += -I$(srctree)/tools/include -I$(srctree)/include/uapi 
-I$(srctree)/arch/$(SUBARCH)/include/uapi
 hostprogs-y+= vdso2c

 quiet_cmd_vdso2c = VDSO2C  $@
-- 
2.8.1

There may be a more correct way to do this ...
-- 
Cheers,
Stephen Rothwell

[PATCH] x86/mm/cpa: Unbreak populate_pgd(): stop trying to deallocate failed PUDs

This mostly reverts commit 360cb4d15567a7eca07a5f3ade6de308bbfb4e70.

I broke the case where a PUD table got allocated -- populate_pud()
would wander off a pgd_none entry and get lost.  I'm not sure how
this survived my testing.

Fixing this directly is difficult or impossible because of the awful
state of Linux's page table accessors.

Instead, fix the original issue in a much simpler way.  The problem
was that, if we allocated a PUD table, failed to populate it, and
freed it, another CPU could potentially keep using the PGD entry we
installed (either by copying it via vmalloc_fault or by speculatively
caching it).  There's a straightforward fix: simply leave the
top-level entry in place if this happens.  This can't waste any
significant amount of memory -- there are at most 256 entries like
this systemwide and, as a practical matter, if we hit this failure
path repeatedly, we're likely to reuse the same page anyway.

For context, this is a reversion with this hunk added in:

if (ret < 0) {
+   /*
+* Leave the PUD page in place in case some other CPU or thread
+* already found it, but remove any useless entries we just
+* added to it.
+*/
-   unmap_pgd_range(cpa->pgd, addr,
+   unmap_pud_range(pgd_entry, addr,
addr + (cpa->numpages << PAGE_SHIFT));
return ret;
}

This effectively open-codes what the now-deleted unmap_pgd_range()
function used to do except that unmap_pgd_range() used to try to
free the page as well.

Cc: Mike Krinkin 
Reported-by: Valdis Kletnieks 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/mm/pageattr.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 26c93c6e04a0..2bc6ea153f76 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1082,6 +1082,8 @@ static int populate_pgd(struct cpa_data *cpa, unsigned 
long addr)
pud = (pud_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
if (!pud)
return -1;
+
+   set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));
}
 
pgprot_val(pgprot) &= ~pgprot_val(cpa->mask_clr);
@@ -1089,16 +1091,11 @@ static int populate_pgd(struct cpa_data *cpa, unsigned 
long addr)
 
ret = populate_pud(cpa, addr, pgd_entry, pgprot);
if (ret < 0) {
-   if (pud)
-   free_page((unsigned long)pud);
unmap_pud_range(pgd_entry, addr,
addr + (cpa->numpages << PAGE_SHIFT));
return ret;
}
 
-   if (pud)
-   set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));
-
cpa->numpages = ret;
return 0;
 }
-- 
2.7.4

[PATCH] x86/mm/cpa: Unbreak populate_pgd(): stop trying to deallocate failed PUDs

This mostly reverts commit 360cb4d15567a7eca07a5f3ade6de308bbfb4e70.

I broke the case where a PUD table got allocated -- populate_pud()
would wander off a pgd_none entry and get lost.  I'm not sure how
this survived my testing.

Fixing this directly is difficult or impossible because of the awful
state of Linux's page table accessors.

Instead, fix the original issue in a much simpler way.  The problem
was that, if we allocated a PUD table, failed to populate it, and
freed it, another CPU could potentially keep using the PGD entry we
installed (either by copying it via vmalloc_fault or by speculatively
caching it).  There's a straightforward fix: simply leave the
top-level entry in place if this happens.  This can't waste any
significant amount of memory -- there are at most 256 entries like
this systemwide and, as a practical matter, if we hit this failure
path repeatedly, we're likely to reuse the same page anyway.

For context, this is a reversion with this hunk added in:

if (ret < 0) {
+   /*
+* Leave the PUD page in place in case some other CPU or thread
+* already found it, but remove any useless entries we just
+* added to it.
+*/
-   unmap_pgd_range(cpa->pgd, addr,
+   unmap_pud_range(pgd_entry, addr,
addr + (cpa->numpages << PAGE_SHIFT));
return ret;
}

This effectively open-codes what the now-deleted unmap_pgd_range()
function used to do except that unmap_pgd_range() used to try to
free the page as well.

Cc: Mike Krinkin 
Reported-by: Valdis Kletnieks 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/mm/pageattr.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 26c93c6e04a0..2bc6ea153f76 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1082,6 +1082,8 @@ static int populate_pgd(struct cpa_data *cpa, unsigned 
long addr)
pud = (pud_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
if (!pud)
return -1;
+
+   set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));
}
 
pgprot_val(pgprot) &= ~pgprot_val(cpa->mask_clr);
@@ -1089,16 +1091,11 @@ static int populate_pgd(struct cpa_data *cpa, unsigned 
long addr)
 
ret = populate_pud(cpa, addr, pgd_entry, pgprot);
if (ret < 0) {
-   if (pud)
-   free_page((unsigned long)pud);
unmap_pud_range(pgd_entry, addr,
addr + (cpa->numpages << PAGE_SHIFT));
return ret;
}
 
-   if (pud)
-   set_pgd(pgd_entry, __pgd(__pa(pud) | _KERNPG_TABLE));
-
cpa->numpages = ret;
return 0;
 }
-- 
2.7.4

Re: [PATCH v4 2/4] drm/rockchip: add an common abstracted PSR driver

2016-07-22 Thread Doug Anderson

Yakir,

On Wed, Jul 13, 2016 at 9:15 PM, Yakir Yang  wrote:
> +static void psr_set_state(struct psr_drv *psr, enum psr_state state)
> +{
> +   mutex_lock(>state_mutex);
> +
> +   if (psr->state == state) {
> +   mutex_unlock(>state_mutex);
> +   return;
> +   }
> +
> +   psr->state = state;
> +   switch (state) {
> +   case PSR_ENABLE:
> +   psr->set(psr->encoder, true);
> +   break;
> +
> +   case PSR_DISABLE:
> +   case PSR_FLUSH:
> +   psr->set(psr->encoder, false);
> +   break;
> +   };
> +
> +   mutex_unlock(>state_mutex);
> +}
> +
> +static void psr_flush_handler(unsigned long data)
> +{
> +   struct psr_drv *psr = (struct psr_drv *)data;
> +
> +   if (!psr || psr->state != PSR_FLUSH)
> +   return;
> +
> +   psr_set_state(psr, PSR_ENABLE);

As mentioned in a separate thread, this is probably not OK.
psr_set_state() grabs a mutex and that might sleep.  ...but
psr_flush_handler() is a timer.  I'm nearly certain that timers can't
sleep.

I believe this is the source of "sleeping function called from invalid
context" that I've seen at times.

-Doug

Re: [PATCH 1/3] Add a new field to struct shrinker

On 07/22/2016 06:27 PM, Tony Jones wrote:
> On 07/20/2016 07:54 AM, Michal Hocko wrote:
> 
>>> Michal, just to make sure I understand you correctly, do you mean that we
>>> could infer the names of the shrinkers by looking at the names of their 
>>> callbacks?
>>
>> Yes, %ps can then be used for the name of the shrinker structure
>> (assuming it is available).
> 
> This is fine for emitting via the ftrace /sys interface,  but in order to 
> have the data [name] get 
> marshalled thru to perf (for example) you need to add it to the 
> TP_fast_assign entry.
> 
> tony

Unfortunately, %ps/%pF doesn't do much (re:  Michal's comment "assuming it is 
available"):

-   TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s 
pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+   TP_printk("%pF %p(%ps): nid: %d objects to shrink %ld gfp_flags %s 
pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
__entry->shrink,
__entry->shr,
+   __entry->shr,
__entry->nid,
__entry->nr_objects_to_shrink,

# cat trace_pipe
bash-1917  [003] ...1  2925.941062: mm_shrink_slab_start: 
super_cache_scan+0x0/0x1a0 88042bb60cc0(0x88042bb60cc0): nid: 0 objects 
to shrink 0 gfp_flags GFP_KERNEL pgs_scanned 1000 lru_pgs 1000 cache items 4 
delta 7 total_scan 7


Otherwise what I was suggesting was something like this to ensure it was 
correctly marshaled for perf/etc:

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -16,6 +16,8 @@
 #define RECLAIM_WB_SYNC0x0004u /* Unused, all reclaim async */
 #define RECLAIM_WB_ASYNC   0x0008u
 
+#define SHRINKER_NAME_LEN  (size_t)32
+
 #define show_reclaim_flags(flags)  \
(flags) ? __print_flags(flags, "|", \
{RECLAIM_WB_ANON,   "RECLAIM_WB_ANON"}, \
@@ -191,6 +193,7 @@ TRACE_EVENT(mm_shrink_slab_start,
TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(void *, shrink)
+   __array(char, name, SHRINKER_NAME_LEN);
__field(int, nid)
__field(long, nr_objects_to_shrink)
__field(gfp_t, gfp_flags)
@@ -202,6 +205,11 @@ TRACE_EVENT(mm_shrink_slab_start,
),
 
TP_fast_assign(
+   char sym[KSYM_SYMBOL_LEN];
+
+   sprint_symbol(sym, (unsigned long)shr);
+   strlcpy(__entry->name, sym, SHRINKER_NAME_LEN);
+
__entry->shr = shr;
__entry->shrink = shr->scan_objects;
__entry->nid = sc->nid;
@@ -214,9 +222,10 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->total_scan = total_scan;
),
 
-   TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s 
pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+   TP_printk("%pF %p(%s): nid: %d objects to shrink %ld gfp_flags %s 
pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
__entry->shrink,
__entry->shr,
+   __entry->name,
__entry->nid,
__entry->nr_objects_to_shrink,
show_gfp_flags(__entry->gfp_flags),
@@ -236,6 +245,7 @@ TRACE_EVENT(mm_shrink_slab_end,
 
TP_STRUCT__entry(
__field(struct shrinker *, shr)
+   __array(char, name, SHRINKER_NAME_LEN);
__field(int, nid)
__field(void *, shrink)
__field(long, unused_scan)
@@ -245,6 +255,11 @@ TRACE_EVENT(mm_shrink_slab_end,
),
 
TP_fast_assign(
+   char sym[KSYM_SYMBOL_LEN];
+
+   sprint_symbol(sym, (unsigned long)shr);
+   strlcpy(__entry->name, sym, SHRINKER_NAME_LEN);
+
__entry->shr = shr;
__entry->nid = nid;
__entry->shrink = shr->scan_objects;
@@ -254,9 +269,10 @@ TRACE_EVENT(mm_shrink_slab_end,
__entry->total_scan = total_scan;
),
 
-   TP_printk("%pF %p: nid: %d unused scan count %ld new scan count %ld 
total_scan %ld last shrinker return val %d",
+   TP_printk("%pF %p(%pF): nid: %d unused scan count %ld new scan count 
%ld total_scan %ld last shrinker return val %d",
__entry->shrink,
__entry->shr,
+   __entry->shr,
__entry->nid,
__entry->unused_scan,
__entry->new_scan,

Re: [PATCH v4 2/4] drm/rockchip: add an common abstracted PSR driver

2016-07-22 Thread Doug Anderson

Yakir,

On Wed, Jul 13, 2016 at 9:15 PM, Yakir Yang  wrote:
> +static void psr_set_state(struct psr_drv *psr, enum psr_state state)
> +{
> +   mutex_lock(>state_mutex);
> +
> +   if (psr->state == state) {
> +   mutex_unlock(>state_mutex);
> +   return;
> +   }
> +
> +   psr->state = state;
> +   switch (state) {
> +   case PSR_ENABLE:
> +   psr->set(psr->encoder, true);
> +   break;
> +
> +   case PSR_DISABLE:
> +   case PSR_FLUSH:
> +   psr->set(psr->encoder, false);
> +   break;
> +   };
> +
> +   mutex_unlock(>state_mutex);
> +}
> +
> +static void psr_flush_handler(unsigned long data)
> +{
> +   struct psr_drv *psr = (struct psr_drv *)data;
> +
> +   if (!psr || psr->state != PSR_FLUSH)
> +   return;
> +
> +   psr_set_state(psr, PSR_ENABLE);

As mentioned in a separate thread, this is probably not OK.
psr_set_state() grabs a mutex and that might sleep.  ...but
psr_flush_handler() is a timer.  I'm nearly certain that timers can't
sleep.

I believe this is the source of "sleeping function called from invalid
context" that I've seen at times.

-Doug

Re: [PATCH 1/3] Add a new field to struct shrinker

On 07/22/2016 06:27 PM, Tony Jones wrote:
> On 07/20/2016 07:54 AM, Michal Hocko wrote:
> 
>>> Michal, just to make sure I understand you correctly, do you mean that we
>>> could infer the names of the shrinkers by looking at the names of their 
>>> callbacks?
>>
>> Yes, %ps can then be used for the name of the shrinker structure
>> (assuming it is available).
> 
> This is fine for emitting via the ftrace /sys interface,  but in order to 
> have the data [name] get 
> marshalled thru to perf (for example) you need to add it to the 
> TP_fast_assign entry.
> 
> tony

Unfortunately, %ps/%pF doesn't do much (re:  Michal's comment "assuming it is 
available"):

-   TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s 
pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+   TP_printk("%pF %p(%ps): nid: %d objects to shrink %ld gfp_flags %s 
pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
__entry->shrink,
__entry->shr,
+   __entry->shr,
__entry->nid,
__entry->nr_objects_to_shrink,

# cat trace_pipe
bash-1917  [003] ...1  2925.941062: mm_shrink_slab_start: 
super_cache_scan+0x0/0x1a0 88042bb60cc0(0x88042bb60cc0): nid: 0 objects 
to shrink 0 gfp_flags GFP_KERNEL pgs_scanned 1000 lru_pgs 1000 cache items 4 
delta 7 total_scan 7


Otherwise what I was suggesting was something like this to ensure it was 
correctly marshaled for perf/etc:

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -16,6 +16,8 @@
 #define RECLAIM_WB_SYNC0x0004u /* Unused, all reclaim async */
 #define RECLAIM_WB_ASYNC   0x0008u
 
+#define SHRINKER_NAME_LEN  (size_t)32
+
 #define show_reclaim_flags(flags)  \
(flags) ? __print_flags(flags, "|", \
{RECLAIM_WB_ANON,   "RECLAIM_WB_ANON"}, \
@@ -191,6 +193,7 @@ TRACE_EVENT(mm_shrink_slab_start,
TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(void *, shrink)
+   __array(char, name, SHRINKER_NAME_LEN);
__field(int, nid)
__field(long, nr_objects_to_shrink)
__field(gfp_t, gfp_flags)
@@ -202,6 +205,11 @@ TRACE_EVENT(mm_shrink_slab_start,
),
 
TP_fast_assign(
+   char sym[KSYM_SYMBOL_LEN];
+
+   sprint_symbol(sym, (unsigned long)shr);
+   strlcpy(__entry->name, sym, SHRINKER_NAME_LEN);
+
__entry->shr = shr;
__entry->shrink = shr->scan_objects;
__entry->nid = sc->nid;
@@ -214,9 +222,10 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->total_scan = total_scan;
),
 
-   TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s 
pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+   TP_printk("%pF %p(%s): nid: %d objects to shrink %ld gfp_flags %s 
pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
__entry->shrink,
__entry->shr,
+   __entry->name,
__entry->nid,
__entry->nr_objects_to_shrink,
show_gfp_flags(__entry->gfp_flags),
@@ -236,6 +245,7 @@ TRACE_EVENT(mm_shrink_slab_end,
 
TP_STRUCT__entry(
__field(struct shrinker *, shr)
+   __array(char, name, SHRINKER_NAME_LEN);
__field(int, nid)
__field(void *, shrink)
__field(long, unused_scan)
@@ -245,6 +255,11 @@ TRACE_EVENT(mm_shrink_slab_end,
),
 
TP_fast_assign(
+   char sym[KSYM_SYMBOL_LEN];
+
+   sprint_symbol(sym, (unsigned long)shr);
+   strlcpy(__entry->name, sym, SHRINKER_NAME_LEN);
+
__entry->shr = shr;
__entry->nid = nid;
__entry->shrink = shr->scan_objects;
@@ -254,9 +269,10 @@ TRACE_EVENT(mm_shrink_slab_end,
__entry->total_scan = total_scan;
),
 
-   TP_printk("%pF %p: nid: %d unused scan count %ld new scan count %ld 
total_scan %ld last shrinker return val %d",
+   TP_printk("%pF %p(%pF): nid: %d unused scan count %ld new scan count 
%ld total_scan %ld last shrinker return val %d",
__entry->shrink,
__entry->shr,
+   __entry->shr,
__entry->nid,
__entry->unused_scan,
__entry->new_scan,

Re: linux-next: build failure after merge of the nvdimm tree

On Fri, Jul 22, 2016 at 6:04 PM, Dan Williams  wrote:
> On Thu, Jul 21, 2016 at 11:13 PM, Stephen Rothwell  
> wrote:
>> Hi Dan,
>>
>> After merging the nvdimm tree, today's linux-next build (powerpc
>> ppc64_defconfig) failed like this:
>>
>> In file included from drivers/md/dm.h:14:0,
>>  from drivers/md/dm-uevent.c:27:
>> include/linux/device-mapper.h:134:22: error: expected ';', ',' or ')' before 
>> '*' token
>>   void __pmem **kaddr, pfn_t *pfn, long size);
>>   ^
>> include/linux/device-mapper.h:182:2: error: unknown type name 
>> 'dm_direct_access_fn'
>>   dm_direct_access_fn direct_access;
>>   ^
>>
>> Caused by commit
>>
>>   7a9eb2066631 ("pmem: kill __pmem address space")
>>
>> interacting with commit
>>
>>   545ed20e6df6 ("dm: add infrastructure for DAX support")
>>
>> from the device-mapper tree.
>>
>> I applied the following merge fix patch for today.  Someone needs to
>> tell Linus about this when he merges the trees.
>
> There's no real rush to remove "__pmem" I'll pull this out until after
> DM DAX support has merged.
>
> Thanks Stephen!

Sorry, I forgot that some ARM patches already depend on this change...
so we'll need to roll forward and notify Linus about this merge
interaction.

Re: linux-next: build failure after merge of the nvdimm tree

On Fri, Jul 22, 2016 at 6:04 PM, Dan Williams  wrote:
> On Thu, Jul 21, 2016 at 11:13 PM, Stephen Rothwell  
> wrote:
>> Hi Dan,
>>
>> After merging the nvdimm tree, today's linux-next build (powerpc
>> ppc64_defconfig) failed like this:
>>
>> In file included from drivers/md/dm.h:14:0,
>>  from drivers/md/dm-uevent.c:27:
>> include/linux/device-mapper.h:134:22: error: expected ';', ',' or ')' before 
>> '*' token
>>   void __pmem **kaddr, pfn_t *pfn, long size);
>>   ^
>> include/linux/device-mapper.h:182:2: error: unknown type name 
>> 'dm_direct_access_fn'
>>   dm_direct_access_fn direct_access;
>>   ^
>>
>> Caused by commit
>>
>>   7a9eb2066631 ("pmem: kill __pmem address space")
>>
>> interacting with commit
>>
>>   545ed20e6df6 ("dm: add infrastructure for DAX support")
>>
>> from the device-mapper tree.
>>
>> I applied the following merge fix patch for today.  Someone needs to
>> tell Linus about this when he merges the trees.
>
> There's no real rush to remove "__pmem" I'll pull this out until after
> DM DAX support has merged.
>
> Thanks Stephen!

Sorry, I forgot that some ARM patches already depend on this change...
so we'll need to roll forward and notify Linus about this merge
interaction.

Crypto Fixes for 4.7

2016-07-22 Thread Herbert Xu

Hi Linus:

This push fixes a sporadic build failure in the qat driver as well
as a memory corruption bug in rsa-pkcs1pad.


Please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus


Herbert Xu (1):
  crypto: rsa-pkcs1pad - fix rsa-pkcs1pad request struct

Jan Stancek (1):
  crypto: qat - make qat_asym_algs.o depend on asn1 headers

 crypto/rsa-pkcs1pad.c  |4 ++--
 drivers/crypto/qat/qat_common/Makefile |1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Crypto Fixes for 4.7

2016-07-22 Thread Herbert Xu

Hi Linus:

This push fixes a sporadic build failure in the qat driver as well
as a memory corruption bug in rsa-pkcs1pad.


Please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus


Herbert Xu (1):
  crypto: rsa-pkcs1pad - fix rsa-pkcs1pad request struct

Jan Stancek (1):
  crypto: qat - make qat_asym_algs.o depend on asn1 headers

 crypto/rsa-pkcs1pad.c  |4 ++--
 drivers/crypto/qat/qat_common/Makefile |1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [GIT PULL] clk fixes for v4.7-rc8

On Fri, Jul 22, 2016 at 9:47 AM, Michael Turquette
 wrote:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git 
> tags/clk-fixes-for-linus

The pgp key you sign your pull requests with keep changing. Sometimes
it's your primary key, most often it seems to be your 4kb subkey, now
it's another 2kb subkey.

All the signatures look fine, but is there a reason for this odd behavior?

Linus

Re: [GIT PULL] clk fixes for v4.7-rc8

On Fri, Jul 22, 2016 at 9:47 AM, Michael Turquette
 wrote:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git 
> tags/clk-fixes-for-linus

The pgp key you sign your pull requests with keep changing. Sometimes
it's your primary key, most often it seems to be your 4kb subkey, now
it's another 2kb subkey.

All the signatures look fine, but is there a reason for this odd behavior?

Linus

Re: [PATCH net-next v3] cdc_ether: Improve ZTE MF823/831/910 handling

2016-07-22 Thread kbuild test robot

Hi,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Kristian-Evensen/cdc_ether-Improve-ZTE-MF823-831-910-handling/20160723-093100
config: x86_64-randconfig-i0-201629 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.3-14) 4.9.3
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/net/usb/cdc_ether.c: In function 'usbnet_cdc_zte_rx_fixup':
>> drivers/net/usb/cdc_ether.c:461:5: warning: unused variable 
>> 'buggy_hwaddrs_idx' [-Wunused-variable]
 u8 buggy_hwaddrs_idx = 0;
^
>> drivers/net/usb/cdc_ether.c:460:5: warning: unused variable 
>> 'num_buggy_hwaddrs' [-Wunused-variable]
 u8 num_buggy_hwaddrs;
^

vim +/buggy_hwaddrs_idx +461 drivers/net/usb/cdc_ether.c

   454   * device sends packets with a static, bogus, random MAC address (event 
if
   455   * device MAC address has been updated). Always set MAC address to that 
of the
   456   * device.
   457   */
   458  static int usbnet_cdc_zte_rx_fixup(struct usbnet *dev, struct sk_buff 
*skb)
   459  {
 > 460  u8 num_buggy_hwaddrs;
 > 461  u8 buggy_hwaddrs_idx = 0;
   462  
   463  if (skb->len < ETH_HLEN || !(skb->data[0] & 0x02))
   464  return 1;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH net-next v3] cdc_ether: Improve ZTE MF823/831/910 handling

2016-07-22 Thread kbuild test robot

Hi,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Kristian-Evensen/cdc_ether-Improve-ZTE-MF823-831-910-handling/20160723-093100
config: x86_64-randconfig-i0-201629 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.3-14) 4.9.3
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/net/usb/cdc_ether.c: In function 'usbnet_cdc_zte_rx_fixup':
>> drivers/net/usb/cdc_ether.c:461:5: warning: unused variable 
>> 'buggy_hwaddrs_idx' [-Wunused-variable]
 u8 buggy_hwaddrs_idx = 0;
^
>> drivers/net/usb/cdc_ether.c:460:5: warning: unused variable 
>> 'num_buggy_hwaddrs' [-Wunused-variable]
 u8 num_buggy_hwaddrs;
^

vim +/buggy_hwaddrs_idx +461 drivers/net/usb/cdc_ether.c

   454   * device sends packets with a static, bogus, random MAC address (event 
if
   455   * device MAC address has been updated). Always set MAC address to that 
of the
   456   * device.
   457   */
   458  static int usbnet_cdc_zte_rx_fixup(struct usbnet *dev, struct sk_buff 
*skb)
   459  {
 > 460  u8 num_buggy_hwaddrs;
 > 461  u8 buggy_hwaddrs_idx = 0;
   462  
   463  if (skb->len < ETH_HLEN || !(skb->data[0] & 0x02))
   464  return 1;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-22 Thread Eric W. Biederman

Kees Cook  writes:

> On Fri, Jul 22, 2016 at 11:45 AM, Eric W. Biederman
>  wrote:
>> Colin Walters  writes:
>>
>>> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:

 This patchset addresses two use cases:
 - Implement a sane upper bound on the number of namespaces.
 - Provide a way for sandboxes to limit the attack surface from
   namespaces.
>>>
>>> Perhaps this is obvious, but since you didn't quite explicitly state it;
>>> do you see this as obsoleting the existing downstream patches
>>> mentioned in:
>>> https://lwn.net/Articles/673597/
>>> It seems conceptually similar to Kees' original approach, right?
>>
>> Similar yes, and I expect it fills the need.  My primary difference is
>> that I believe this approach makes sense from a perspective of assuming
>> that user namespaces or other namespaces are not any buggier than any
>> other piece of kernel code and that people will use them.
>>
>> I don't see these limits making sense from a perspective that user
>> namespaces are flawed and distro kernels should not have enabled them in
>> the first place.  That was my perception right or wrong of Kees patches
>> and the related patches that landed in Ubuntu and Debian.
>>
>> With Kees approach I could not see how to handle the case where some
>> applications on the system wanted user namespaces and others don't.
>> Which made it very nasty for future evolution and more deployment of
>> user namespaces.  Being per user namespace these limits can be used to
>> sandbox applications without affecting the rest of the system.
>
> While it certainly works for my use-case (init ns
> max_usernamespaces=0), I don't see how this helps the case of "let
> user foobar open 1 userns, but everyone else is 0", which is likely
> the middle ground between "just turn it off" and "everyone gets to
> create usernamespaces". I'm personally not interested in that level of
> granularity, but in earlier discussions it sounded like this was
> something you wanted?

So the case I really care about is when there is limited use, so people
don't have to redesign their applications.  In this case if you want to
disable things in a sandbox like way you just create a user namespace
and set the count to 0 in that user namespace.

A whole system disable I tend to think is a stupid configuration for a
new system.  It gets into people negotiating for what they need, and I
don't see that as sustainable.  I prefer good usable defaults.

I would have loved to have done something with per user limits so it
could be disabled for a selection of users, but it turns out the kernel
doesn't have appropriate data structures for to hold limits for users
that have not logged in.

And in practice I don't care the case where 1 user is allowed but not
the others, I care about disallow this user/program that is in a
sandbox.  I also seem to recall people have problems with using seccomp
to disable things.  All of that said a per user policy is easily
implemented in pam by setting the size count for a specific user to 0.

I do think a limit to catch applications that go crazy is very sane, and
that is primarily what is implemented here.

Eric

Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-22 Thread Eric W. Biederman

Kees Cook  writes:

> On Fri, Jul 22, 2016 at 11:45 AM, Eric W. Biederman
>  wrote:
>> Colin Walters  writes:
>>
>>> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:

 This patchset addresses two use cases:
 - Implement a sane upper bound on the number of namespaces.
 - Provide a way for sandboxes to limit the attack surface from
   namespaces.
>>>
>>> Perhaps this is obvious, but since you didn't quite explicitly state it;
>>> do you see this as obsoleting the existing downstream patches
>>> mentioned in:
>>> https://lwn.net/Articles/673597/
>>> It seems conceptually similar to Kees' original approach, right?
>>
>> Similar yes, and I expect it fills the need.  My primary difference is
>> that I believe this approach makes sense from a perspective of assuming
>> that user namespaces or other namespaces are not any buggier than any
>> other piece of kernel code and that people will use them.
>>
>> I don't see these limits making sense from a perspective that user
>> namespaces are flawed and distro kernels should not have enabled them in
>> the first place.  That was my perception right or wrong of Kees patches
>> and the related patches that landed in Ubuntu and Debian.
>>
>> With Kees approach I could not see how to handle the case where some
>> applications on the system wanted user namespaces and others don't.
>> Which made it very nasty for future evolution and more deployment of
>> user namespaces.  Being per user namespace these limits can be used to
>> sandbox applications without affecting the rest of the system.
>
> While it certainly works for my use-case (init ns
> max_usernamespaces=0), I don't see how this helps the case of "let
> user foobar open 1 userns, but everyone else is 0", which is likely
> the middle ground between "just turn it off" and "everyone gets to
> create usernamespaces". I'm personally not interested in that level of
> granularity, but in earlier discussions it sounded like this was
> something you wanted?

So the case I really care about is when there is limited use, so people
don't have to redesign their applications.  In this case if you want to
disable things in a sandbox like way you just create a user namespace
and set the count to 0 in that user namespace.

A whole system disable I tend to think is a stupid configuration for a
new system.  It gets into people negotiating for what they need, and I
don't see that as sustainable.  I prefer good usable defaults.

I would have loved to have done something with per user limits so it
could be disabled for a selection of users, but it turns out the kernel
doesn't have appropriate data structures for to hold limits for users
that have not logged in.

And in practice I don't care the case where 1 user is allowed but not
the others, I care about disallow this user/program that is in a
sandbox.  I also seem to recall people have problems with using seccomp
to disable things.  All of that said a per user policy is easily
implemented in pam by setting the size count for a specific user to 0.

I do think a limit to catch applications that go crazy is very sane, and
that is primarily what is implemented here.

Eric

Re: [v6 PATCH 1/6] extcon: Add Type-C and DP support

Hi,

2016-07-23 6:40 GMT+09:00 Guenter Roeck :
> On Fri, Jul 22, 2016 at 2:29 AM, Chanwoo Choi  wrote:
>> Hi Chris,
>>
>> I'm sorry for late reply. I finished the first draft to support the extcon 
>> property.
>> You can check the patches[1]. But, I need more time to test it. After tested 
>> it,
>> I'll send the patches.
>>
>> [1] 
>> https://git.kernel.org/cgit/linux/kernel/git/chanwoo/extcon.git/log/?h=extcon-test
>>
>> Chanwoo Choi (4):
>>   extcon: Add the extcon_type to group each connector into five category
>>   extcon: Add the support for extcon property according to type of connector
>>   extcon: Rename the extcon_set/get_state() to maintain the function naming 
>> pattern
>>   extcon: Add the sync APIs to support the  notification for extcon property
>>
>
> Some additional feedback:
>
> - IS_PROPERTY(), is_extcon_property_supported():
>The logic should probably be
> if (EXTCON_TYPE_##name & type) {
>and
> if (!(extcon_info[id].type & type)) {
>
>   (logical & instead of logical | )
>
> - extcon_get_property():
>   - ret is not initialized.
>   - case EXTCON_PROP_DISP_MIN ... EXTCON_PROP_DISP_MAX:
> misses a break statement

Thanks for report. I'll check it.

Thanks,
Chanwoo Choi

>
> Guenter
>
>> Chris Zhong (1):
>>   extcon: Add EXTCON_DISP_DP and the property for USB Type-C
>>
>> Regards,
>> Chanwoo Choi
>>
>> On 2016년 07월 21일 22:13, Chris Zhong wrote:
>>> Add EXTCON_DISP_DP for the Display external connector. For Type-C
>>> connector the DisplayPort can work as an Alternate Mode(VESA DisplayPort
>>> Alt Mode on USB Type-C Standard). The Type-C support both normal and
>>> flipped orientation, so add a property to extcon.
>>>
>>> Signe-off-by: Chris Zhong 
>>>
>>> Signed-off-by: Chris Zhong 
>>> ---
>>>
>>> Changes in v6:
>>> - move the EXTCON_PROP_TYPEC_POLARITY to EXTCON_TYPE_USB in _supported
>>> Series-changes: 5
>>> - support get property
>>>
>>> Changes in v5: None
>>> Changes in v4: None
>>> Changes in v3: None
>>> Changes in v2: None
>>> Changes in v1: None
>>>
>>>  drivers/extcon/extcon.c | 26 ++
>>>  include/linux/extcon.h  | 13 +
>>>  2 files changed, 39 insertions(+)
>>>
>>> diff --git a/drivers/extcon/extcon.c b/drivers/extcon/extcon.c
>>> index a1117db..f79b510 100644
>>> --- a/drivers/extcon/extcon.c
>>> +++ b/drivers/extcon/extcon.c
>>> @@ -157,6 +157,11 @@ struct __extcon_info {
>>>   .id = EXTCON_DISP_VGA,
>>>   .name = "VGA",
>>>   },
>>> + [EXTCON_DISP_DP] = {
>>> + .type = EXTCON_TYPE_DISP | EXTCON_TYPE_USB,
>>> + .id = EXTCON_DISP_DP,
>>> + .name = "DP",
>>> + },
>>>
>>>   /* Miscellaneous external connector */
>>>   [EXTCON_DOCK] = {
>>> @@ -270,6 +275,7 @@ static bool is_extcon_property_supported(unsigned int 
>>> id,
>>>   switch (prop) {
>>>   case EXTCON_PROP_USB_ID:
>>>   case EXTCON_PROP_USB_VBUS:
>>> + case EXTCON_PROP_TYPEC_POLARITY:
>>>   return true;
>>>   default:
>>>   break;
>>> @@ -547,6 +553,26 @@ int extcon_get_cable_property(struct extcon_dev *edev, 
>>> unsigned int id,
>>>   enum extcon_property prop,
>>>   union extcon_property_value *val)
>>>  {
>>> + struct extcon_cable *cable;
>>> + int index;
>>> +
>>> + if (!edev)
>>> + return -EINVAL;
>>> +
>>> + /* Check the property whether is supported or not */
>>> + if (!is_extcon_property_supported(id, prop))
>>> + return -EINVAL;
>>> +
>>> + /* Find the cable index of external connector by using id */
>>> + index = find_cable_index_by_id(edev, id);
>>> + if (index < 0)
>>> + return index;
>>> +
>>> + /* Store the property value */
>>> + cable = >cables[index];
>>> +
>>> + val->intval = cable->propval[prop].intval;
>>> +
>>>   return 0;
>>>  }
>>>
>>> diff --git a/include/linux/extcon.h b/include/linux/extcon.h
>>> index f6f0a8d..50ef87f 100644
>>> --- a/include/linux/extcon.h
>>> +++ b/include/linux/extcon.h
>>> @@ -77,6 +77,7 @@ enum extcon_type {
>>>  #define EXTCON_DISP_MHL  41  /* Mobile High-Definition 
>>> Link */
>>>  #define EXTCON_DISP_DVI  42  /* Digital Visual Interface */
>>>  #define EXTCON_DISP_VGA  43  /* Video Graphics Array */
>>> +#define EXTCON_DISP_DP   44  /* DisplayPort */
>>>
>>>  /* Miscellaneous external connector */
>>>  #define EXTCON_DOCK  60
>>> @@ -108,9 +109,13 @@ enum extcon_property {
>>>* - EXTCON_PROP_USB_USB
>>>* @type:   integer (int value)
>>>* @value:  0 (low) or 1 (high)
>>> +  * - EXTCON_PROP_TYPEC_POLARITY,
>>> +  * @type:   integer (int value)
>>> +  * @value:  0 (normal) or 1

Re: [v6 PATCH 1/6] extcon: Add Type-C and DP support

Hi,

2016-07-23 6:40 GMT+09:00 Guenter Roeck :
> On Fri, Jul 22, 2016 at 2:29 AM, Chanwoo Choi  wrote:
>> Hi Chris,
>>
>> I'm sorry for late reply. I finished the first draft to support the extcon 
>> property.
>> You can check the patches[1]. But, I need more time to test it. After tested 
>> it,
>> I'll send the patches.
>>
>> [1] 
>> https://git.kernel.org/cgit/linux/kernel/git/chanwoo/extcon.git/log/?h=extcon-test
>>
>> Chanwoo Choi (4):
>>   extcon: Add the extcon_type to group each connector into five category
>>   extcon: Add the support for extcon property according to type of connector
>>   extcon: Rename the extcon_set/get_state() to maintain the function naming 
>> pattern
>>   extcon: Add the sync APIs to support the  notification for extcon property
>>
>
> Some additional feedback:
>
> - IS_PROPERTY(), is_extcon_property_supported():
>The logic should probably be
> if (EXTCON_TYPE_##name & type) {
>and
> if (!(extcon_info[id].type & type)) {
>
>   (logical & instead of logical | )
>
> - extcon_get_property():
>   - ret is not initialized.
>   - case EXTCON_PROP_DISP_MIN ... EXTCON_PROP_DISP_MAX:
> misses a break statement

Thanks for report. I'll check it.

Thanks,
Chanwoo Choi

>
> Guenter
>
>> Chris Zhong (1):
>>   extcon: Add EXTCON_DISP_DP and the property for USB Type-C
>>
>> Regards,
>> Chanwoo Choi
>>
>> On 2016년 07월 21일 22:13, Chris Zhong wrote:
>>> Add EXTCON_DISP_DP for the Display external connector. For Type-C
>>> connector the DisplayPort can work as an Alternate Mode(VESA DisplayPort
>>> Alt Mode on USB Type-C Standard). The Type-C support both normal and
>>> flipped orientation, so add a property to extcon.
>>>
>>> Signe-off-by: Chris Zhong 
>>>
>>> Signed-off-by: Chris Zhong 
>>> ---
>>>
>>> Changes in v6:
>>> - move the EXTCON_PROP_TYPEC_POLARITY to EXTCON_TYPE_USB in _supported
>>> Series-changes: 5
>>> - support get property
>>>
>>> Changes in v5: None
>>> Changes in v4: None
>>> Changes in v3: None
>>> Changes in v2: None
>>> Changes in v1: None
>>>
>>>  drivers/extcon/extcon.c | 26 ++
>>>  include/linux/extcon.h  | 13 +
>>>  2 files changed, 39 insertions(+)
>>>
>>> diff --git a/drivers/extcon/extcon.c b/drivers/extcon/extcon.c
>>> index a1117db..f79b510 100644
>>> --- a/drivers/extcon/extcon.c
>>> +++ b/drivers/extcon/extcon.c
>>> @@ -157,6 +157,11 @@ struct __extcon_info {
>>>   .id = EXTCON_DISP_VGA,
>>>   .name = "VGA",
>>>   },
>>> + [EXTCON_DISP_DP] = {
>>> + .type = EXTCON_TYPE_DISP | EXTCON_TYPE_USB,
>>> + .id = EXTCON_DISP_DP,
>>> + .name = "DP",
>>> + },
>>>
>>>   /* Miscellaneous external connector */
>>>   [EXTCON_DOCK] = {
>>> @@ -270,6 +275,7 @@ static bool is_extcon_property_supported(unsigned int 
>>> id,
>>>   switch (prop) {
>>>   case EXTCON_PROP_USB_ID:
>>>   case EXTCON_PROP_USB_VBUS:
>>> + case EXTCON_PROP_TYPEC_POLARITY:
>>>   return true;
>>>   default:
>>>   break;
>>> @@ -547,6 +553,26 @@ int extcon_get_cable_property(struct extcon_dev *edev, 
>>> unsigned int id,
>>>   enum extcon_property prop,
>>>   union extcon_property_value *val)
>>>  {
>>> + struct extcon_cable *cable;
>>> + int index;
>>> +
>>> + if (!edev)
>>> + return -EINVAL;
>>> +
>>> + /* Check the property whether is supported or not */
>>> + if (!is_extcon_property_supported(id, prop))
>>> + return -EINVAL;
>>> +
>>> + /* Find the cable index of external connector by using id */
>>> + index = find_cable_index_by_id(edev, id);
>>> + if (index < 0)
>>> + return index;
>>> +
>>> + /* Store the property value */
>>> + cable = >cables[index];
>>> +
>>> + val->intval = cable->propval[prop].intval;
>>> +
>>>   return 0;
>>>  }
>>>
>>> diff --git a/include/linux/extcon.h b/include/linux/extcon.h
>>> index f6f0a8d..50ef87f 100644
>>> --- a/include/linux/extcon.h
>>> +++ b/include/linux/extcon.h
>>> @@ -77,6 +77,7 @@ enum extcon_type {
>>>  #define EXTCON_DISP_MHL  41  /* Mobile High-Definition 
>>> Link */
>>>  #define EXTCON_DISP_DVI  42  /* Digital Visual Interface */
>>>  #define EXTCON_DISP_VGA  43  /* Video Graphics Array */
>>> +#define EXTCON_DISP_DP   44  /* DisplayPort */
>>>
>>>  /* Miscellaneous external connector */
>>>  #define EXTCON_DOCK  60
>>> @@ -108,9 +109,13 @@ enum extcon_property {
>>>* - EXTCON_PROP_USB_USB
>>>* @type:   integer (int value)
>>>* @value:  0 (low) or 1 (high)
>>> +  * - EXTCON_PROP_TYPEC_POLARITY,
>>> +  * @type:   integer (int value)
>>> +  * @value:  0 (normal) or 1 (flip)
>>>*/
>>>   EXTCON_PROP_USB_ID = 0,
>>>   EXTCON_PROP_USB_VBUS,

Re: [v6 PATCH 1/6] extcon: Add Type-C and DP support

Hi,

2016-07-23 3:21 GMT+09:00 Guenter Roeck :
> Hi,
>
> On Fri, Jul 22, 2016 at 2:29 AM, Chanwoo Choi  wrote:
>> Hi Chris,
>>
>> I'm sorry for late reply. I finished the first draft to support the extcon 
>> property.
>> You can check the patches[1]. But, I need more time to test it. After tested 
>> it,
>> I'll send the patches.
>>
>> [1] 
>> https://git.kernel.org/cgit/linux/kernel/git/chanwoo/extcon.git/log/?h=extcon-test
>>
>> Chanwoo Choi (4):
>>   extcon: Add the extcon_type to group each connector into five category
>>   extcon: Add the support for extcon property according to type of connector
>>   extcon: Rename the extcon_set/get_state() to maintain the function naming 
>> pattern
>>   extcon: Add the sync APIs to support the  notification for extcon property
>>
>> Chris Zhong (1):
>>   extcon: Add EXTCON_DISP_DP and the property for USB Type-C
>>
>
> Couple of comments:
>
> - In extcon.c, line 560: 'state' is unused.
>
> - extcon_set_property(edev, EXTCON_USB, EXTCON_USB_VBUS, 0);
>   does not work. As written, it has to be something like:

you shoud use the 'union extcon_property_valu' as below.

>
>   union extcon_property_value prop;
>   ...
>   prop.intval = vbus_state;
>   extcon_set_property(edev, EXTCON_USB, EXTCON_USB_VBUS, prop);
>
> - For USB, the state of EXTCON_USB, EXTCON_USB_HOST, and
> EXTCON_DISP_DP tend to change at the same time, together with the
> associated properties.
>   It might be desirable to have an equivalent of extcon_sync_all(edev);

Each external connector has the separate notifier independently.
All of operation in extcon should be done by each external connector unit.

I think that extcon_sync_all() is not appropriate. If the some extcon device
support the many connectors, maybe extcon send the un-needed notifications
for all connectors.

Also, extcon_sync_all(edev) is same operation with follwoing three
function calls:
extcon_sync(edev, EXTCON_USB);
extcon_sync(edev, EXTCON_USB_HOST);
extcon_sync(edev, EXTCON_DISP_DP);

Also,
EXTCON_DISP_DP has the both EXTCON_TYPE_USB and EXTCON_TYPE_DISP.
So, the extcon client driver is able to set the vbus property with
EXTCON_DISP_DP as following:
- extcon_set_property(edev, EXTCON_DISP_DP, EXTCON_USB_VBUS, prop)

Thanks,
Chanwoo Choi

Re: [v6 PATCH 1/6] extcon: Add Type-C and DP support

Hi,

2016-07-23 3:21 GMT+09:00 Guenter Roeck :
> Hi,
>
> On Fri, Jul 22, 2016 at 2:29 AM, Chanwoo Choi  wrote:
>> Hi Chris,
>>
>> I'm sorry for late reply. I finished the first draft to support the extcon 
>> property.
>> You can check the patches[1]. But, I need more time to test it. After tested 
>> it,
>> I'll send the patches.
>>
>> [1] 
>> https://git.kernel.org/cgit/linux/kernel/git/chanwoo/extcon.git/log/?h=extcon-test
>>
>> Chanwoo Choi (4):
>>   extcon: Add the extcon_type to group each connector into five category
>>   extcon: Add the support for extcon property according to type of connector
>>   extcon: Rename the extcon_set/get_state() to maintain the function naming 
>> pattern
>>   extcon: Add the sync APIs to support the  notification for extcon property
>>
>> Chris Zhong (1):
>>   extcon: Add EXTCON_DISP_DP and the property for USB Type-C
>>
>
> Couple of comments:
>
> - In extcon.c, line 560: 'state' is unused.
>
> - extcon_set_property(edev, EXTCON_USB, EXTCON_USB_VBUS, 0);
>   does not work. As written, it has to be something like:

you shoud use the 'union extcon_property_valu' as below.

>
>   union extcon_property_value prop;
>   ...
>   prop.intval = vbus_state;
>   extcon_set_property(edev, EXTCON_USB, EXTCON_USB_VBUS, prop);
>
> - For USB, the state of EXTCON_USB, EXTCON_USB_HOST, and
> EXTCON_DISP_DP tend to change at the same time, together with the
> associated properties.
>   It might be desirable to have an equivalent of extcon_sync_all(edev);

Each external connector has the separate notifier independently.
All of operation in extcon should be done by each external connector unit.

I think that extcon_sync_all() is not appropriate. If the some extcon device
support the many connectors, maybe extcon send the un-needed notifications
for all connectors.

Also, extcon_sync_all(edev) is same operation with follwoing three
function calls:
extcon_sync(edev, EXTCON_USB);
extcon_sync(edev, EXTCON_USB_HOST);
extcon_sync(edev, EXTCON_DISP_DP);

Also,
EXTCON_DISP_DP has the both EXTCON_TYPE_USB and EXTCON_TYPE_DISP.
So, the extcon client driver is able to set the vbus property with
EXTCON_DISP_DP as following:
- extcon_set_property(edev, EXTCON_DISP_DP, EXTCON_USB_VBUS, prop)

Thanks,
Chanwoo Choi

Re: [PATCH v8 0/10] Output raw touch data via V4L2

2016-07-22 Thread Chris Healy

I'm not Nick, but I'm testing his patches. ;-)  Here's what I'm
getting from v4l2-compliance with his patches:

root@RDU2:/mnt/disk ./v4l2-compliance -a -f -d /dev/v4l-touch0
Driver Info:
Driver name   : rmi4_f54
Card type : Synaptics RMI4 Touch Sensor
Bus info  : rmi4:rmi4-00.fn54
Driver version: 4.7.0
Capabilities  : 0x9521
Video Capture
Touch Capture
Read/Write
Streaming
Extended Pix Format
Device Capabilities
Device Caps   : 0x1521
Video Capture
Touch Capture
Read/Write
Streaming
Extended Pix Format

Compliance test for device /dev/v4l-touch0 (not using libv4l2):

Required ioctls:
test VIDIOC_QUERYCAP: OK

Allow for multiple opens:
test second video open: OK
test VIDIOC_QUERYCAP: OK
test VIDIOC_G/S_PRIORITY: OK

Debug ioctls:
test VIDIOC_DBG_G/S_REGISTER: OK (Not Supported)
test VIDIOC_LOG_STATUS: OK (Not Supported)

Input ioctls:
test VIDIOC_G/S_TUNER/ENUM_FREQ_BANDS: OK (Not Supported)
test VIDIOC_G/S_FREQUENCY: OK (Not Supported)
test VIDIOC_S_HW_FREQ_SEEK: OK (Not Supported)
test VIDIOC_ENUMAUDIO: OK (Not Supported)
test VIDIOC_G/S/ENUMINPUT: OK
test VIDIOC_G/S_AUDIO: OK (Not Supported)
Inputs: 5 Audio Inputs: 0 Tuners: 0

Output ioctls:
test VIDIOC_G/S_MODULATOR: OK (Not Supported)
test VIDIOC_G/S_FREQUENCY: OK (Not Supported)
test VIDIOC_ENUMAUDOUT: OK (Not Supported)
test VIDIOC_G/S/ENUMOUTPUT: OK (Not Supported)
test VIDIOC_G/S_AUDOUT: OK (Not Supported)
Outputs: 0 Audio Outputs: 0 Modulators: 0

Input/Output configuration ioctls:
test VIDIOC_ENUM/G/S/QUERY_STD: OK (Not Supported)
test VIDIOC_ENUM/G/S/QUERY_DV_TIMINGS: OK (Not Supported)
test VIDIOC_DV_TIMINGS_CAP: OK (Not Supported)
test VIDIOC_G/S_EDID: OK (Not Supported)

Test input 0:

Control ioctls:
test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported)
test VIDIOC_QUERYCTRL: OK (Not Supported)
test VIDIOC_G/S_CTRL: OK (Not Supported)
test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported)
test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported)
test VIDIOC_G/S_JPEGCOMP: OK (Not Supported)
Standard Controls: 0 Private Controls: 0

Format ioctls:
test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK
test VIDIOC_G/S_PARM: OK
test VIDIOC_G_FBUF: OK (Not Supported)
test VIDIOC_G_FMT: OK
test VIDIOC_TRY_FMT: OK
test VIDIOC_S_FMT: OK
test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported)
test Cropping: OK (Not Supported)
test Composing: OK (Not Supported)
test Scaling: OK (Not Supported)

Codec ioctls:
test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported)
test VIDIOC_G_ENC_INDEX: OK (Not Supported)
test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported)

Buffer ioctls:
test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK
test VIDIOC_EXPBUF: OK

Test input 1:

Control ioctls:
test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported)
test VIDIOC_QUERYCTRL: OK (Not Supported)
test VIDIOC_G/S_CTRL: OK (Not Supported)
test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported)
test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported)
test VIDIOC_G/S_JPEGCOMP: OK (Not Supported)
Standard Controls: 0 Private Controls: 0

Format ioctls:
test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK
test VIDIOC_G/S_PARM: OK
test VIDIOC_G_FBUF: OK (Not Supported)
test VIDIOC_G_FMT: OK
test VIDIOC_TRY_FMT: OK
test VIDIOC_S_FMT: OK
test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported)
test Cropping: OK (Not Supported)
test Composing: OK (Not Supported)
test Scaling: OK (Not Supported)

Codec ioctls:
test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported)
test VIDIOC_G_ENC_INDEX: OK (Not Supported)
test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported)

Buffer ioctls:
test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK
test VIDIOC_EXPBUF: OK

Test input 2:

Control ioctls:
test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported)
test VIDIOC_QUERYCTRL: OK (Not Supported)
test VIDIOC_G/S_CTRL: OK (Not Supported)
test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported)
test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported)
test VIDIOC_G/S_JPEGCOMP: OK (Not Supported)
Standard Controls: 0 Private Controls: 0

Format ioctls:
test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK
test VIDIOC_G/S_PARM: OK
test VIDIOC_G_FBUF: OK (Not Supported)
test VIDIOC_G_FMT: OK
test VIDIOC_TRY_FMT: OK
test VIDIOC_S_FMT: OK
test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported)
test Cropping: OK (Not Supported)
test Composing: OK (Not Supported)
test Scaling: OK (Not Supported)

Codec ioctls:
test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported)
test VIDIOC_G_ENC_INDEX: OK (Not Supported)
test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported)

Buffer ioctls:
test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK
test VIDIOC_EXPBUF: OK

Test input 3:

Control ioctls:
test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported)
test VIDIOC_QUERYCTRL: OK (Not Supported)
test VIDIOC_G/S_CTRL: OK (Not Supported)
test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported)
test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported)
test VIDIOC_G/S_JPEGCOMP: OK (Not Supported)
Standard Controls: 0 Private Controls: 0

Format ioctls:
test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK

Re: [PATCH v8 0/10] Output raw touch data via V4L2

2016-07-22 Thread Chris Healy

I'm not Nick, but I'm testing his patches. ;-)  Here's what I'm
getting from v4l2-compliance with his patches:

root@RDU2:/mnt/disk ./v4l2-compliance -a -f -d /dev/v4l-touch0
Driver Info:
Driver name   : rmi4_f54
Card type : Synaptics RMI4 Touch Sensor
Bus info  : rmi4:rmi4-00.fn54
Driver version: 4.7.0
Capabilities  : 0x9521
Video Capture
Touch Capture
Read/Write
Streaming
Extended Pix Format
Device Capabilities
Device Caps   : 0x1521
Video Capture
Touch Capture
Read/Write
Streaming
Extended Pix Format

Compliance test for device /dev/v4l-touch0 (not using libv4l2):

Required ioctls:
test VIDIOC_QUERYCAP: OK

Allow for multiple opens:
test second video open: OK
test VIDIOC_QUERYCAP: OK
test VIDIOC_G/S_PRIORITY: OK

Debug ioctls:
test VIDIOC_DBG_G/S_REGISTER: OK (Not Supported)
test VIDIOC_LOG_STATUS: OK (Not Supported)

Input ioctls:
test VIDIOC_G/S_TUNER/ENUM_FREQ_BANDS: OK (Not Supported)
test VIDIOC_G/S_FREQUENCY: OK (Not Supported)
test VIDIOC_S_HW_FREQ_SEEK: OK (Not Supported)
test VIDIOC_ENUMAUDIO: OK (Not Supported)
test VIDIOC_G/S/ENUMINPUT: OK
test VIDIOC_G/S_AUDIO: OK (Not Supported)
Inputs: 5 Audio Inputs: 0 Tuners: 0

Output ioctls:
test VIDIOC_G/S_MODULATOR: OK (Not Supported)
test VIDIOC_G/S_FREQUENCY: OK (Not Supported)
test VIDIOC_ENUMAUDOUT: OK (Not Supported)
test VIDIOC_G/S/ENUMOUTPUT: OK (Not Supported)
test VIDIOC_G/S_AUDOUT: OK (Not Supported)
Outputs: 0 Audio Outputs: 0 Modulators: 0

Input/Output configuration ioctls:
test VIDIOC_ENUM/G/S/QUERY_STD: OK (Not Supported)
test VIDIOC_ENUM/G/S/QUERY_DV_TIMINGS: OK (Not Supported)
test VIDIOC_DV_TIMINGS_CAP: OK (Not Supported)
test VIDIOC_G/S_EDID: OK (Not Supported)

Test input 0:

Control ioctls:
test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported)
test VIDIOC_QUERYCTRL: OK (Not Supported)
test VIDIOC_G/S_CTRL: OK (Not Supported)
test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported)
test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported)
test VIDIOC_G/S_JPEGCOMP: OK (Not Supported)
Standard Controls: 0 Private Controls: 0

Format ioctls:
test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK
test VIDIOC_G/S_PARM: OK
test VIDIOC_G_FBUF: OK (Not Supported)
test VIDIOC_G_FMT: OK
test VIDIOC_TRY_FMT: OK
test VIDIOC_S_FMT: OK
test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported)
test Cropping: OK (Not Supported)
test Composing: OK (Not Supported)
test Scaling: OK (Not Supported)

Codec ioctls:
test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported)
test VIDIOC_G_ENC_INDEX: OK (Not Supported)
test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported)

Buffer ioctls:
test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK
test VIDIOC_EXPBUF: OK

Test input 1:

Control ioctls:
test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported)
test VIDIOC_QUERYCTRL: OK (Not Supported)
test VIDIOC_G/S_CTRL: OK (Not Supported)
test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported)
test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported)
test VIDIOC_G/S_JPEGCOMP: OK (Not Supported)
Standard Controls: 0 Private Controls: 0

Format ioctls:
test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK
test VIDIOC_G/S_PARM: OK
test VIDIOC_G_FBUF: OK (Not Supported)
test VIDIOC_G_FMT: OK
test VIDIOC_TRY_FMT: OK
test VIDIOC_S_FMT: OK
test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported)
test Cropping: OK (Not Supported)
test Composing: OK (Not Supported)
test Scaling: OK (Not Supported)

Codec ioctls:
test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported)
test VIDIOC_G_ENC_INDEX: OK (Not Supported)
test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported)

Buffer ioctls:
test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK
test VIDIOC_EXPBUF: OK

Test input 2:

Control ioctls:
test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported)
test VIDIOC_QUERYCTRL: OK (Not Supported)
test VIDIOC_G/S_CTRL: OK (Not Supported)
test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported)
test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported)
test VIDIOC_G/S_JPEGCOMP: OK (Not Supported)
Standard Controls: 0 Private Controls: 0

Format ioctls:
test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK
test VIDIOC_G/S_PARM: OK
test VIDIOC_G_FBUF: OK (Not Supported)
test VIDIOC_G_FMT: OK
test VIDIOC_TRY_FMT: OK
test VIDIOC_S_FMT: OK
test VIDIOC_G_SLICED_VBI_CAP: OK (Not Supported)
test Cropping: OK (Not Supported)
test Composing: OK (Not Supported)
test Scaling: OK (Not Supported)

Codec ioctls:
test VIDIOC_(TRY_)ENCODER_CMD: OK (Not Supported)
test VIDIOC_G_ENC_INDEX: OK (Not Supported)
test VIDIOC_(TRY_)DECODER_CMD: OK (Not Supported)

Buffer ioctls:
test VIDIOC_REQBUFS/CREATE_BUFS/QUERYBUF: OK
test VIDIOC_EXPBUF: OK

Test input 3:

Control ioctls:
test VIDIOC_QUERY_EXT_CTRL/QUERYMENU: OK (Not Supported)
test VIDIOC_QUERYCTRL: OK (Not Supported)
test VIDIOC_G/S_CTRL: OK (Not Supported)
test VIDIOC_G/S/TRY_EXT_CTRLS: OK (Not Supported)
test VIDIOC_(UN)SUBSCRIBE_EVENT/DQEVENT: OK (Not Supported)
test VIDIOC_G/S_JPEGCOMP: OK (Not Supported)
Standard Controls: 0 Private Controls: 0

Format ioctls:
test VIDIOC_ENUM_FMT/FRAMESIZES/FRAMEINTERVALS: OK

[PATCH v18 net-next 1/1] hv_sock: introduce Hyper-V Sockets

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Vitaly Kuznetsov 
Cc: Cathy Avery 
Cc: Olaf Hering 

---

You can also get the patch by (commit 84146dfb):
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160721_v18

For the change log before v12, please see https://lkml.org/lkml/2016/5/15/31

In v12, the changes are mainly the following:

1) remove the module params as David suggested.

2) use 5 exact pages for VMBus send/recv rings, respectively.
The host side's design of the feature requires 5 exact pages for recv/send
rings respectively -- this is suboptimal considering memory consumption,
however unluckily we have to live with it, before the host comes up with
a new design in the future. :-(

3) remove the per-connection static send/recv buffers
Instead, we allocate and free the buffers dynamically only when we recv/send
data. This means: when a connection is idle, no memory is consumed as
recv/send buffers at all.

In v13:
I return ENOMEM on buffer alllocation failure

   Actually "man read/write" says "Other errors may occur, depending on the
object connected to fd". "man send/recv" indeed lists ENOMEM.
   Considering AF_HYPERV is a new socket type, ENOMEM seems OK here.
   In the long run, I think we should add a new API in the VMBus driver,
allowing data copy from VMBus ringbuffer into user mode buffer directly.
This way, we can even eliminate this temporary buffer.

In v14:
fix some coding style issues pointed out by David.

In v15:
Just some stylistic changes addressing comments from Joe Perches and
Olaf Hering -- thank you!
- add a GPL blurb.
- define a new macro PAGE_SIZE_4K and use it to replace PAGE_SIZE
- change sk_to_hvsock/hvsock_to_sk() from macros to inline functions
- remove a not-very-useful pr_err()
- fix some typos in comment and coding style issues.

In v16:
Made stylistic changes addressing comments from Vitaly Kuznetsov.
Thank you very much for the detailed comments, Vitaly!

In v17:
- PAGE_SIZE -> PAGE_SIZE_4K
- allow regular users to use the socket
Thank you Michal Kubecek for the suggestions!

In v18:
Just some tiny updates to address some spurious compiler warnings:
"xxx may be used uninitialized in this function".

Looking forward to your comments!

 MAINTAINERS |2 +
 include/linux/hyperv.h  |   13 +
 include/linux/socket.h  |4 +-
 include/net/af_hvsock.h |   78 +++
 include/uapi/linux/hyperv.h |   23 +
 net/Kconfig |1 +
 net/Makefile|1 +
 net/hv_sock/Kconfig |   10 +
 net/hv_sock/Makefile|3 +
 net/hv_sock/af_hvsock.c | 1507 +++
 10 files changed, 1641 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 50f69ba..6eaa26f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5514,7 +5514,9 @@ F:drivers/pci/host/pci-hyperv.c
 F: drivers/net/hyperv/
 F: drivers/scsi/storvsc_drv.c
 F: drivers/video/fbdev/hyperv_fb.c
+F: net/hv_sock/
 F: include/linux/hyperv.h
+F: include/net/af_hvsock.h
 F: tools/hv/
 F: Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 50f493e..1cda6ea5 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1508,5 +1508,18 @@ static inline void commit_rd_index(struct vmbus_channel 
*channel)
vmbus_set_event(channel);
 }
 
+struct vmpipe_proto_header {
+   u32 pkt_type;
+   u32 data_size;
+};
+
+#define HVSOCK_HEADER_LEN  (sizeof(struct vmpacket_descriptor) + \
+sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN   (sizeof(u64))
 
+#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \
+   ALIGN((payload_len), 8) + \
+   PREV_INDICES_LEN)
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index b5cc5a6..0b68b58 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -202,8

[PATCH v18 net-next 1/1] hv_sock: introduce Hyper-V Sockets

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Vitaly Kuznetsov 
Cc: Cathy Avery 
Cc: Olaf Hering 

---

You can also get the patch by (commit 84146dfb):
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160721_v18

For the change log before v12, please see https://lkml.org/lkml/2016/5/15/31

In v12, the changes are mainly the following:

1) remove the module params as David suggested.

2) use 5 exact pages for VMBus send/recv rings, respectively.
The host side's design of the feature requires 5 exact pages for recv/send
rings respectively -- this is suboptimal considering memory consumption,
however unluckily we have to live with it, before the host comes up with
a new design in the future. :-(

3) remove the per-connection static send/recv buffers
Instead, we allocate and free the buffers dynamically only when we recv/send
data. This means: when a connection is idle, no memory is consumed as
recv/send buffers at all.

In v13:
I return ENOMEM on buffer alllocation failure

   Actually "man read/write" says "Other errors may occur, depending on the
object connected to fd". "man send/recv" indeed lists ENOMEM.
   Considering AF_HYPERV is a new socket type, ENOMEM seems OK here.
   In the long run, I think we should add a new API in the VMBus driver,
allowing data copy from VMBus ringbuffer into user mode buffer directly.
This way, we can even eliminate this temporary buffer.

In v14:
fix some coding style issues pointed out by David.

In v15:
Just some stylistic changes addressing comments from Joe Perches and
Olaf Hering -- thank you!
- add a GPL blurb.
- define a new macro PAGE_SIZE_4K and use it to replace PAGE_SIZE
- change sk_to_hvsock/hvsock_to_sk() from macros to inline functions
- remove a not-very-useful pr_err()
- fix some typos in comment and coding style issues.

In v16:
Made stylistic changes addressing comments from Vitaly Kuznetsov.
Thank you very much for the detailed comments, Vitaly!

In v17:
- PAGE_SIZE -> PAGE_SIZE_4K
- allow regular users to use the socket
Thank you Michal Kubecek for the suggestions!

In v18:
Just some tiny updates to address some spurious compiler warnings:
"xxx may be used uninitialized in this function".

Looking forward to your comments!

 MAINTAINERS |2 +
 include/linux/hyperv.h  |   13 +
 include/linux/socket.h  |4 +-
 include/net/af_hvsock.h |   78 +++
 include/uapi/linux/hyperv.h |   23 +
 net/Kconfig |1 +
 net/Makefile|1 +
 net/hv_sock/Kconfig |   10 +
 net/hv_sock/Makefile|3 +
 net/hv_sock/af_hvsock.c | 1507 +++
 10 files changed, 1641 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 50f69ba..6eaa26f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5514,7 +5514,9 @@ F:drivers/pci/host/pci-hyperv.c
 F: drivers/net/hyperv/
 F: drivers/scsi/storvsc_drv.c
 F: drivers/video/fbdev/hyperv_fb.c
+F: net/hv_sock/
 F: include/linux/hyperv.h
+F: include/net/af_hvsock.h
 F: tools/hv/
 F: Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 50f493e..1cda6ea5 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1508,5 +1508,18 @@ static inline void commit_rd_index(struct vmbus_channel 
*channel)
vmbus_set_event(channel);
 }
 
+struct vmpipe_proto_header {
+   u32 pkt_type;
+   u32 data_size;
+};
+
+#define HVSOCK_HEADER_LEN  (sizeof(struct vmpacket_descriptor) + \
+sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN   (sizeof(u64))
 
+#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \
+   ALIGN((payload_len), 8) + \
+   PREV_INDICES_LEN)
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index b5cc5a6..0b68b58 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -202,8 +202,9 @@ struct ucred {
 #define AF_VSOCK   40  /* vSockets */
 #define AF_KCM 41

Re: [PATCH v3] virtio_blk: Fix a slient kernel panic

2016-07-22 Thread Minfei Huang


Ping, Any comment is appreciate.

Thanks
Minfei

> On Jul 19, 2016, at 20:22, Cornelia Huck  wrote:
> 
> On Tue, 19 Jul 2016 12:32:42 +0800
> Minfei Huang  wrote:
> 
>> From: Minfei Huang 
>> 
>> We do a lot of memory allocation in function init_vq, and don't handle
>> the allocation failure properly. Then this function will return 0,
>> although initialization fails due to lacking memory. At that moment,
>> kernel will panic in guest machine, if virtio is used to drive disk.
>> 
>> To fix this bug, we should take care of allocation failure, and return
>> correct value to let caller know what happen.
>> 
>> Tested-by: Chao Fan 
>> Signed-off-by: Minfei Huang 
>> Signed-off-by: Minfei Huang 
>> ---
>> v2:
>> - Remove useless initialisation to NULL
>> v1:
>> - Refactor the patch to make code more readable
>> ---
>> drivers/block/virtio_blk.c | 26 --
>> 1 file changed, 8 insertions(+), 18 deletions(-)
> 
> Your changes certainly make the function more compact.
> 
> Reviewed-by: Cornelia Huck 



smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH v3] virtio_blk: Fix a slient kernel panic

2016-07-22 Thread Minfei Huang


Ping, Any comment is appreciate.

Thanks
Minfei

> On Jul 19, 2016, at 20:22, Cornelia Huck  wrote:
> 
> On Tue, 19 Jul 2016 12:32:42 +0800
> Minfei Huang  wrote:
> 
>> From: Minfei Huang 
>> 
>> We do a lot of memory allocation in function init_vq, and don't handle
>> the allocation failure properly. Then this function will return 0,
>> although initialization fails due to lacking memory. At that moment,
>> kernel will panic in guest machine, if virtio is used to drive disk.
>> 
>> To fix this bug, we should take care of allocation failure, and return
>> correct value to let caller know what happen.
>> 
>> Tested-by: Chao Fan 
>> Signed-off-by: Minfei Huang 
>> Signed-off-by: Minfei Huang 
>> ---
>> v2:
>> - Remove useless initialisation to NULL
>> v1:
>> - Refactor the patch to make code more readable
>> ---
>> drivers/block/virtio_blk.c | 26 --
>> 1 file changed, 8 insertions(+), 18 deletions(-)
> 
> Your changes certainly make the function more compact.
> 
> Reviewed-by: Cornelia Huck 



smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] add u64 number parser

2016-07-22 Thread Bart Van Assche


On 07/22/16 17:23, James Smart wrote:

+   buf = kmalloc(len + 1, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+   memcpy(buf, s->from, len);
+   buf[len] = '\0';


Hello James,

Have you considered to combine the above kmalloc() and memcpy() calls 
into a single kasprintf(GFP_KERNEL, "%.*s", len, s->from) call?


Bart.

Re: [PATCH] add u64 number parser

2016-07-22 Thread Bart Van Assche


On 07/22/16 17:23, James Smart wrote:

+   buf = kmalloc(len + 1, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+   memcpy(buf, s->from, len);
+   buf[len] = '\0';


Hello James,

Have you considered to combine the above kmalloc() and memcpy() calls 
into a single kasprintf(GFP_KERNEL, "%.*s", len, s->from) call?


Bart.

[PATCH v18 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by
introducing a new socket address family AF_HYPERV.

You can also get the patch by (commit 84146dfb):
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160721_v18

Note: the VMBus driver side's supporting patches have been in the mainline
tree.

I know the kernel has already had a VM Sockets driver (AF_VSOCK) based
on VMware VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is
proposing AF_VSOCK of virtio version:
http://marc.info/?l=linux-netdev=145952064004765=2

However, though Hyper-V Sockets may seem conceptually similar to
AF_VOSCK, there are differences in the transportation layer, and IMO these
make the direct code reusing impractical:

1. In AF_VSOCK, the endpoint type is: , but in
AF_HYPERV, the endpoint type is: . Here GUID
is 128-bit.

2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.

3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE,
SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT.
These are meaningless to AF_HYPERV.

4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus,
like .notify_recv_init
.notify_recv_pre_block
.notify_recv_pre_dequeue
.notify_recv_post_dequeue
.notify_send_init
.notify_send_pre_block
.notify_send_pre_enqueue
.notify_send_post_enqueue
etc.

So I think we'd better introduce a new address family: AF_HYPERV.

Please review the patch.

Looking forward to your comments, especially comments from David. :-)

Changes since v1:
- updated "[PATCH 6/7] hvsock: introduce Hyper-V VM Sockets feature"
- added __init and __exit for the module init/exit functions
- net/hv_sock/Kconfig: "default m" -> "default m if HYPERV"
- MODULE_LICENSE: "Dual MIT/GPL" -> "Dual BSD/GPL"

Changes since v2:
- fixed various coding issue pointed out by David Miller
- fixed indentation issues
- removed pr_debug in net/hv_sock/af_hvsock.c
- used reverse-Chrismas-tree style for local variables.
- EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL

Changes since v3:
- fixed a few coding issue pointed by Vitaly Kuznetsov and Dan Carpenter
- fixed the ret value in vmbus_recvpacket_hvsock on error
- fixed the style of multi-line comment: vmbus_get_hvsock_rw_status()

Changes since v4 (https://lkml.org/lkml/2015/7/28/404):
- addressed all the comments about V4.
- treat the hvsock offers/channels as special VMBus devices
- add a mechanism to pass hvsock events to the hvsock driver
- fixed some corner cases with proper locking when a connection is closed
- rebased to the latest Greg's tree

Changes since v5 (https://lkml.org/lkml/2015/12/24/103):
- addressed the coding style issues (Vitaly Kuznetsov & David Miller, thanks!)
- used a better coding for the per-channel rescind callback (Thank Vitaly!)
- avoided the introduction of new VMBUS driver APIs vmbus_sendpacket_hvsock()
and vmbus_recvpacket_hvsock() and used vmbus_sendpacket()/vmbus_recvpacket()
in the higher level (i.e., the vmsock driver). Thank Vitaly!

Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html)
- only a few minor changes of coding style and comments

Changes since v7
- a few minor changes of coding style: thanks, Joe Perches!
- added some lines of comments about GUID/UUID before the struct sockaddr_hv.

Changes since v8
- removed the unnecessary __packed for some definitions: thanks, David!
- hvsock_open_connection: use offer.u.pipe.user_def[0] to know the connection
and reorganized the function
direction
- reorganized the code according to suggestions from Cathy Avery: split big
functions into small ones, set .setsockopt and getsockopt to
sock_no_setsockopt/sock_no_getsockopt
- inline'd some small list helper functions

Changes since v9
- minimized struct hvsock_sock by making the send/recv buffers pointers.
the buffers are allocated by kmalloc() in __hvsock_create() now.
- minimized the sizes of the send/recv buffers and the vmbus ringbuffers.

Changes since v10

1) add module params: send_ring_page, recv_ring_page. They can be used to
enlarge the ringbuffer size to get better performance, e.g.,
# modprobe hv_sock recv_ring_page=16 send_ring_page=16
By default, recv_ring_page is 3 and send_ring_page is 2.

2) add module param max_socket_number (the default is 1024).
A user can enlarge the number to create more than 1024 hv_sock sockets.
By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB

[PATCH v18 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

The patch implements the necessary support in the guest side by
introducing a new socket address family AF_HYPERV.

You can also get the patch by (commit 84146dfb):
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160721_v18

Note: the VMBus driver side's supporting patches have been in the mainline
tree.

However, though Hyper-V Sockets may seem conceptually similar to
AF_VOSCK, there are differences in the transportation layer, and IMO these
make the direct code reusing impractical:

1. In AF_VSOCK, the endpoint type is: , but in
AF_HYPERV, the endpoint type is: . Here GUID
is 128-bit.

2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.

3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE,
SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT.
These are meaningless to AF_HYPERV.

So I think we'd better introduce a new address family: AF_HYPERV.

Please review the patch.

Looking forward to your comments, especially comments from David. :-)

Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html)
- only a few minor changes of coding style and comments

Changes since v7
- a few minor changes of coding style: thanks, Joe Perches!
- added some lines of comments about GUID/UUID before the struct sockaddr_hv.

Changes since v10

2) add module param max_socket_number (the default is 1024).
A user can enlarge the number to create more than 1024 hv_sock sockets.
By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB

Re: [PATCH 1/3] Add a new field to struct shrinker

On 07/20/2016 07:54 AM, Michal Hocko wrote:

>> Michal, just to make sure I understand you correctly, do you mean that we
>> could infer the names of the shrinkers by looking at the names of their 
>> callbacks?
> 
> Yes, %ps can then be used for the name of the shrinker structure
> (assuming it is available).

This is fine for emitting via the ftrace /sys interface,  but in order to have 
the data [name] get 
marshalled thru to perf (for example) you need to add it to the TP_fast_assign 
entry.

tony

Re: [PATCH 1/3] Add a new field to struct shrinker

On 07/20/2016 07:54 AM, Michal Hocko wrote:

>> Michal, just to make sure I understand you correctly, do you mean that we
>> could infer the names of the shrinkers by looking at the names of their 
>> callbacks?
> 
> Yes, %ps can then be used for the name of the shrinker structure
> (assuming it is available).

This is fine for emitting via the ftrace /sys interface,  but in order to have 
the data [name] get 
marshalled thru to perf (for example) you need to add it to the TP_fast_assign 
entry.

tony

Re: [PATCH 2/5] mm: add per-zone lru list stat

2016-07-22 Thread Minchan Kim

Hi Fengguang,

On Sat, Jul 23, 2016 at 08:45:15AM +0800, Fengguang Wu wrote:
> Hi Minchan,
> 
> We find duplicate /proc/vmstat lines showing up in linux-next, which
> look related to this patch.
> 
> >>--- a/mm/vmstat.c
> >>+++ b/mm/vmstat.c
> >>@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned 
> >>int order)
> >> const char * const vmstat_text[] = {
> >>/* enum zone_stat_item countes */
> >>"nr_free_pages",
> >>+   "nr_inactive_anon",
> >>+   "nr_active_anon",
> >>+   "nr_inactive_file",
> >>+   "nr_active_file",
> >>+   "nr_unevictable",
> >>"nr_mlock",
> >>"nr_slab_reclaimable",
> >>"nr_slab_unreclaimable",
> 
> In the below vmstat output, "nr_inactive_anon 2217" is shown twice.
> So do the other entries added by the above chunk.
> 
> nr_free_pages 831238
> nr_inactive_anon 2217
> nr_active_anon 4386
> nr_inactive_file 117467
> nr_active_file 4602
> nr_unevictable 0
> nr_zone_write_pending 0
> nr_mlock 0
> nr_slab_reclaimable 8323
> nr_slab_unreclaimable 4641
> nr_page_table_pages 870
> nr_kernel_stack 3776
> nr_bounce 0
> nr_zspages 0
> numa_hit 201105
> numa_miss 0
> numa_foreign 0
> numa_interleave 66970
> numa_local 201105
> numa_other 0
> nr_free_cma 0
> nr_inactive_anon 2217
> nr_active_anon 4368
> nr_inactive_file 117449
> nr_active_file 4620
> nr_unevictable 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_pages_scanned 0
> workingset_refault 0
> workingset_activate 0
> workingset_nodereclaim 0
> nr_anon_pages 4321
> nr_mapped 3469
> nr_file_pages 124348
> nr_dirty 0
> nr_writeback 0
> nr_writeback_temp 0
> nr_shmem 2279
> nr_shmem_hugepages 0
> nr_shmem_pmdmapped 0

Thanks for catching that.
We need a decision to maintain LRU stat both per-zone and per-node.

Mel, do you want to keep the LRU stat in per-node in addition?

Re: [PATCH 2/5] mm: add per-zone lru list stat

2016-07-22 Thread Minchan Kim

Hi Fengguang,

On Sat, Jul 23, 2016 at 08:45:15AM +0800, Fengguang Wu wrote:
> Hi Minchan,
> 
> We find duplicate /proc/vmstat lines showing up in linux-next, which
> look related to this patch.
> 
> >>--- a/mm/vmstat.c
> >>+++ b/mm/vmstat.c
> >>@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned 
> >>int order)
> >> const char * const vmstat_text[] = {
> >>/* enum zone_stat_item countes */
> >>"nr_free_pages",
> >>+   "nr_inactive_anon",
> >>+   "nr_active_anon",
> >>+   "nr_inactive_file",
> >>+   "nr_active_file",
> >>+   "nr_unevictable",
> >>"nr_mlock",
> >>"nr_slab_reclaimable",
> >>"nr_slab_unreclaimable",
> 
> In the below vmstat output, "nr_inactive_anon 2217" is shown twice.
> So do the other entries added by the above chunk.
> 
> nr_free_pages 831238
> nr_inactive_anon 2217
> nr_active_anon 4386
> nr_inactive_file 117467
> nr_active_file 4602
> nr_unevictable 0
> nr_zone_write_pending 0
> nr_mlock 0
> nr_slab_reclaimable 8323
> nr_slab_unreclaimable 4641
> nr_page_table_pages 870
> nr_kernel_stack 3776
> nr_bounce 0
> nr_zspages 0
> numa_hit 201105
> numa_miss 0
> numa_foreign 0
> numa_interleave 66970
> numa_local 201105
> numa_other 0
> nr_free_cma 0
> nr_inactive_anon 2217
> nr_active_anon 4368
> nr_inactive_file 117449
> nr_active_file 4620
> nr_unevictable 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_pages_scanned 0
> workingset_refault 0
> workingset_activate 0
> workingset_nodereclaim 0
> nr_anon_pages 4321
> nr_mapped 3469
> nr_file_pages 124348
> nr_dirty 0
> nr_writeback 0
> nr_writeback_temp 0
> nr_shmem 2279
> nr_shmem_hugepages 0
> nr_shmem_pmdmapped 0

Thanks for catching that.
We need a decision to maintain LRU stat both per-zone and per-node.

Mel, do you want to keep the LRU stat in per-node in addition?

ADS1118: hwmon or iio ? [was: Re: [RCF 1/3] hwmon: Add ads1118 driver]

2016-07-22 Thread Guenter Roeck


On 07/22/2016 07:39 AM, Joshua Clayton wrote:

Greetings Guenter,

Thank you for reviewing my submission.

On 07/15/2016 06:40 PM, Guenter Roeck wrote:

On 07/15/2016 05:18 PM, Joshua Clayton wrote:

Add new driver for Texas Instruments ADS1118 and and ADS1018.
This driver works with ADS1018, because of code borrowed
from asd1015, which is similar, but I can only test ADS1118



Browsing through the datasheet, I think this should probably be implemented
as iio driver (and iio already has a driver for ads1015).

Jonathan, what do you think ?

Thanks,
Guenter


No response from Jonathan as yet, but I am willing to rework the driver when I
have some time. It might be weeks before I can start, though.



Changing the subject - maybe it helps to get Jonathan's attention.

Guenter

ADS1118: hwmon or iio ? [was: Re: [RCF 1/3] hwmon: Add ads1118 driver]

2016-07-22 Thread Guenter Roeck


On 07/22/2016 07:39 AM, Joshua Clayton wrote:

Greetings Guenter,

Thank you for reviewing my submission.

On 07/15/2016 06:40 PM, Guenter Roeck wrote:

On 07/15/2016 05:18 PM, Joshua Clayton wrote:

Add new driver for Texas Instruments ADS1118 and and ADS1018.
This driver works with ADS1018, because of code borrowed
from asd1015, which is similar, but I can only test ADS1118



Browsing through the datasheet, I think this should probably be implemented
as iio driver (and iio already has a driver for ads1015).

Jonathan, what do you think ?

Thanks,
Guenter


No response from Jonathan as yet, but I am willing to rework the driver when I
have some time. It might be weeks before I can start, though.



Changing the subject - maybe it helps to get Jonathan's attention.

Guenter

Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code

On Fri, Jul 22, 2016 at 10:00 PM, Matt Fleming  wrote:
>
> I suppose we could rewrite the page table mapping for those precious
> <1MB regions to coerce the firmware into accessing different pages
> instead of the 1:1 addresses and copy the regions elsewhere. Maybe.
> That assumes we don't hit other firmware bugs though.

.. it also assumes that firmware honors our page tables. Which at a
minimum won't be true for things like SMM, which happens with paging
entirely disabled.

And that's exactly the kind of code that touches some of the memory in
the low 1M region - things like the legacy keyboard state bits in the
low 4kB etc, for crazy old DOS days, iirc.

Linus

Re: [PATCH] x86/boot: Reorganize and clean up the BIOS area reservation code

On Fri, Jul 22, 2016 at 10:00 PM, Matt Fleming  wrote:
>
> I suppose we could rewrite the page table mapping for those precious
> <1MB regions to coerce the firmware into accessing different pages
> instead of the 1:1 addresses and copy the regions elsewhere. Maybe.
> That assumes we don't hit other firmware bugs though.

.. it also assumes that firmware honors our page tables. Which at a
minimum won't be true for things like SMM, which happens with paging
entirely disabled.

And that's exactly the kind of code that touches some of the memory in
the low 1M region - things like the legacy keyboard state bits in the
low 4kB etc, for crazy old DOS days, iirc.

Linus

Re: linux-next: build failure after merge of the nvdimm tree

On Thu, Jul 21, 2016 at 11:13 PM, Stephen Rothwell  
wrote:
> Hi Dan,
>
> After merging the nvdimm tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
>
> In file included from drivers/md/dm.h:14:0,
>  from drivers/md/dm-uevent.c:27:
> include/linux/device-mapper.h:134:22: error: expected ';', ',' or ')' before 
> '*' token
>   void __pmem **kaddr, pfn_t *pfn, long size);
>   ^
> include/linux/device-mapper.h:182:2: error: unknown type name 
> 'dm_direct_access_fn'
>   dm_direct_access_fn direct_access;
>   ^
>
> Caused by commit
>
>   7a9eb2066631 ("pmem: kill __pmem address space")
>
> interacting with commit
>
>   545ed20e6df6 ("dm: add infrastructure for DAX support")
>
> from the device-mapper tree.
>
> I applied the following merge fix patch for today.  Someone needs to
> tell Linus about this when he merges the trees.

There's no real rush to remove "__pmem" I'll pull this out until after
DM DAX support has merged.

Thanks Stephen!

Re: linux-next: build failure after merge of the nvdimm tree

On Thu, Jul 21, 2016 at 11:13 PM, Stephen Rothwell  
wrote:
> Hi Dan,
>
> After merging the nvdimm tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
>
> In file included from drivers/md/dm.h:14:0,
>  from drivers/md/dm-uevent.c:27:
> include/linux/device-mapper.h:134:22: error: expected ';', ',' or ')' before 
> '*' token
>   void __pmem **kaddr, pfn_t *pfn, long size);
>   ^
> include/linux/device-mapper.h:182:2: error: unknown type name 
> 'dm_direct_access_fn'
>   dm_direct_access_fn direct_access;
>   ^
>
> Caused by commit
>
>   7a9eb2066631 ("pmem: kill __pmem address space")
>
> interacting with commit
>
>   545ed20e6df6 ("dm: add infrastructure for DAX support")
>
> from the device-mapper tree.
>
> I applied the following merge fix patch for today.  Someone needs to
> tell Linus about this when he merges the trees.

There's no real rush to remove "__pmem" I'll pull this out until after
DM DAX support has merged.

Thanks Stephen!

Re: linux-next: Tree for Jul 22

2016-07-22 Thread Paul Gortmaker

On Fri, Jul 22, 2016 at 3:28 PM, Paul Gortmaker
<paul.gortma...@windriver.com> wrote:

[...]

>
> For x86-64 allmod, I am seeing this issue for the 1st time today
> in my local builds, even though it doesn't appear in sfr's builds:
>
> FATAL: drivers/input/evbug: sizeof(struct input_device_id)=312 is not
> a modulo of the size of section
> __mod_input___device_table=384.
> Fix definition of struct input_device_id in mod_devicetable.h
> make[2]: *** [__modpost] Error 1
> make[1]: *** [modules] Error 2
> make: *** [sub-make] Error 2
>
> paul@builder:~/git/linux-head$ git describe
> next-20160722
>
> Nothing has touched drivers/input/evbug.c for years...

Circling back to look at this more, it seems to be a result of the
makefiles not clobbering something properly when using the
same build directory from yesterday's next build today.

Once I executed a "make clean" I could not reproduce it.

P.
--

Re: linux-next: Tree for Jul 22

2016-07-22 Thread Paul Gortmaker

On Fri, Jul 22, 2016 at 3:28 PM, Paul Gortmaker
 wrote:

[...]

>
> For x86-64 allmod, I am seeing this issue for the 1st time today
> in my local builds, even though it doesn't appear in sfr's builds:
>
> FATAL: drivers/input/evbug: sizeof(struct input_device_id)=312 is not
> a modulo of the size of section
> __mod_input___device_table=384.
> Fix definition of struct input_device_id in mod_devicetable.h
> make[2]: *** [__modpost] Error 1
> make[1]: *** [modules] Error 2
> make: *** [sub-make] Error 2
>
> paul@builder:~/git/linux-head$ git describe
> next-20160722
>
> Nothing has touched drivers/input/evbug.c for years...

Circling back to look at this more, it seems to be a result of the
makefiles not clobbering something properly when using the
same build directory from yesterday's next build today.

Once I executed a "make clean" I could not reproduce it.

P.
--

Re: [PATCH v2 16/17] x86/insn: remove pcommit

On Fri, Jul 22, 2016 at 9:52 AM, Ingo Molnar  wrote:
>
> * Dan Williams  wrote:
>
>> On Tue, Jul 12, 2016 at 3:12 PM, Dan Williams  
>> wrote:
>> > On Tue, Jul 12, 2016 at 7:57 AM, Peter Zijlstra  
>> > wrote:
>> >> On Sat, Jul 09, 2016 at 08:25:54PM -0700, Dan Williams wrote:
>> >>> The pcommit instruction is being deprecated in favor of either ADR
>> >>> (asynchronous DRAM refresh: flush-on-power-fail) at the platform level, 
>> >>> or
>> >>> posted-write-queue flush addresses as defined by the ACPI 6.x NFIT 
>> >>> (NVDIMM
>> >>> Firmware Interface Table).
>> >>
>> >>>  arch/x86/include/asm/cpufeatures.h |1
>> >>>  arch/x86/include/asm/special_insns.h   |   46 
>> >>> 
>> >>>  arch/x86/lib/x86-opcode-map.txt|2 -
>> >>>  tools/objtool/arch/x86/insn/x86-opcode-map.txt |2 -
>> >>>  tools/perf/arch/x86/tests/insn-x86-dat-32.c|2 -
>> >>>  tools/perf/arch/x86/tests/insn-x86-dat-64.c|2 -
>> >>>  tools/perf/arch/x86/tests/insn-x86-dat-src.c   |4 --
>> >>
>> >> Just deprecated, or is it completely eradicated, removed from history,
>> >> will never ever happen and we'll reissue the opcode for something else?
>> >>
>> >> Because if its only deprecated then removing it from the instruction
>> >> decoders seems wrong, old binaries might still contain the opcode.
>> >
>> > Eradicated.
>> >
>> > "The new instructions like CLWB and CLFLUSHOPT will be rolled into the
>> > SDM but PCOMMIT will be removed from the Extensions doc and not rolled
>> > into the SDM." [1]
>> >
>> > Existing binaries are already gating their usage on the presence of
>> > the cpu id flag, that flag and the instruction opcode are reserved
>> > going forward.
>> >
>> > [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-June/005923.html
>>
>> x86 maintainers, I have the other patches in this series queued in -next. 
>> Please
>> ack this one and I'll add it for v4.8-rc1, or otherwise let me know how you 
>> want
>> to handle this patch.
>
> Since it's just a removal AFAICS that the rest of your series should not 
> depend
> on, can you submit it to the x86 tree?

This patch depends on the previous patches in the series removing
calls to pcommit_sfence().

Re: [PATCH v2 16/17] x86/insn: remove pcommit

On Fri, Jul 22, 2016 at 9:52 AM, Ingo Molnar  wrote:
>
> * Dan Williams  wrote:
>
>> On Tue, Jul 12, 2016 at 3:12 PM, Dan Williams  
>> wrote:
>> > On Tue, Jul 12, 2016 at 7:57 AM, Peter Zijlstra  
>> > wrote:
>> >> On Sat, Jul 09, 2016 at 08:25:54PM -0700, Dan Williams wrote:
>> >>> The pcommit instruction is being deprecated in favor of either ADR
>> >>> (asynchronous DRAM refresh: flush-on-power-fail) at the platform level, 
>> >>> or
>> >>> posted-write-queue flush addresses as defined by the ACPI 6.x NFIT 
>> >>> (NVDIMM
>> >>> Firmware Interface Table).
>> >>
>> >>>  arch/x86/include/asm/cpufeatures.h |1
>> >>>  arch/x86/include/asm/special_insns.h   |   46 
>> >>> 
>> >>>  arch/x86/lib/x86-opcode-map.txt|2 -
>> >>>  tools/objtool/arch/x86/insn/x86-opcode-map.txt |2 -
>> >>>  tools/perf/arch/x86/tests/insn-x86-dat-32.c|2 -
>> >>>  tools/perf/arch/x86/tests/insn-x86-dat-64.c|2 -
>> >>>  tools/perf/arch/x86/tests/insn-x86-dat-src.c   |4 --
>> >>
>> >> Just deprecated, or is it completely eradicated, removed from history,
>> >> will never ever happen and we'll reissue the opcode for something else?
>> >>
>> >> Because if its only deprecated then removing it from the instruction
>> >> decoders seems wrong, old binaries might still contain the opcode.
>> >
>> > Eradicated.
>> >
>> > "The new instructions like CLWB and CLFLUSHOPT will be rolled into the
>> > SDM but PCOMMIT will be removed from the Extensions doc and not rolled
>> > into the SDM." [1]
>> >
>> > Existing binaries are already gating their usage on the presence of
>> > the cpu id flag, that flag and the instruction opcode are reserved
>> > going forward.
>> >
>> > [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-June/005923.html
>>
>> x86 maintainers, I have the other patches in this series queued in -next. 
>> Please
>> ack this one and I'll add it for v4.8-rc1, or otherwise let me know how you 
>> want
>> to handle this patch.
>
> Since it's just a removal AFAICS that the rest of your series should not 
> depend
> on, can you submit it to the x86 tree?

This patch depends on the previous patches in the series removing
calls to pcommit_sfence().

Re: [PATCH 2/5] mm: add per-zone lru list stat

2016-07-22 Thread Fengguang Wu


Hi Minchan,

We find duplicate /proc/vmstat lines showing up in linux-next, which
look related to this patch.


--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int 
order)
 const char * const vmstat_text[] = {
/* enum zone_stat_item countes */
"nr_free_pages",
+   "nr_inactive_anon",
+   "nr_active_anon",
+   "nr_inactive_file",
+   "nr_active_file",
+   "nr_unevictable",
"nr_mlock",
"nr_slab_reclaimable",
"nr_slab_unreclaimable",


In the below vmstat output, "nr_inactive_anon 2217" is shown twice.
So do the other entries added by the above chunk.

nr_free_pages 831238
nr_inactive_anon 2217
nr_active_anon 4386
nr_inactive_file 117467
nr_active_file 4602
nr_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_slab_reclaimable 8323
nr_slab_unreclaimable 4641
nr_page_table_pages 870
nr_kernel_stack 3776
nr_bounce 0
nr_zspages 0
numa_hit 201105
numa_miss 0
numa_foreign 0
numa_interleave 66970
numa_local 201105
numa_other 0
nr_free_cma 0
nr_inactive_anon 2217
nr_active_anon 4368
nr_inactive_file 117449
nr_active_file 4620
nr_unevictable 0
nr_isolated_anon 0
nr_isolated_file 0
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_pages 4321
nr_mapped 3469
nr_file_pages 124348
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 2279
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
...

Thanks,
Fengguang

Re: [PATCH 2/5] mm: add per-zone lru list stat

2016-07-22 Thread Fengguang Wu


Hi Minchan,

We find duplicate /proc/vmstat lines showing up in linux-next, which
look related to this patch.


--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,6 +921,11 @@ int fragmentation_index(struct zone *zone, unsigned int 
order)
 const char * const vmstat_text[] = {
/* enum zone_stat_item countes */
"nr_free_pages",
+   "nr_inactive_anon",
+   "nr_active_anon",
+   "nr_inactive_file",
+   "nr_active_file",
+   "nr_unevictable",
"nr_mlock",
"nr_slab_reclaimable",
"nr_slab_unreclaimable",


In the below vmstat output, "nr_inactive_anon 2217" is shown twice.
So do the other entries added by the above chunk.

nr_free_pages 831238
nr_inactive_anon 2217
nr_active_anon 4386
nr_inactive_file 117467
nr_active_file 4602
nr_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_slab_reclaimable 8323
nr_slab_unreclaimable 4641
nr_page_table_pages 870
nr_kernel_stack 3776
nr_bounce 0
nr_zspages 0
numa_hit 201105
numa_miss 0
numa_foreign 0
numa_interleave 66970
numa_local 201105
numa_other 0
nr_free_cma 0
nr_inactive_anon 2217
nr_active_anon 4368
nr_inactive_file 117449
nr_active_file 4620
nr_unevictable 0
nr_isolated_anon 0
nr_isolated_file 0
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_pages 4321
nr_mapped 3469
nr_file_pages 124348
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 2279
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
...

Thanks,
Fengguang

Re: [PATCH v4 00/12] mm: Hardened usercopy

2016-07-22 Thread Laura Abbott


On 07/20/2016 01:26 PM, Kees Cook wrote:

Hi,

[This is now in my kspp -next tree, though I'd really love to add some
additional explicit Tested-bys, Reviewed-bys, or Acked-bys. If you've
looked through any part of this or have done any testing, please consider
sending an email with your "*-by:" line. :)]

This is a start of the mainline port of PAX_USERCOPY[1]. After writing
tests (now in lkdtm in -next) for Casey's earlier port[2], I kept tweaking
things further and further until I ended up with a whole new patch series.
To that end, I took Rik, Laura, and other people's feedback along with
additional changes and clean-ups.

Based on my understanding, PAX_USERCOPY was designed to catch a
few classes of flaws (mainly bad bounds checking) around the use of
copy_to_user()/copy_from_user(). These changes don't touch get_user() and
put_user(), since these operate on constant sized lengths, and tend to be
much less vulnerable. There are effectively three distinct protections in
the whole series, each of which I've given a separate CONFIG, though this
patch set is only the first of the three intended protections. (Generally
speaking, PAX_USERCOPY covers what I'm calling CONFIG_HARDENED_USERCOPY
(this) and CONFIG_HARDENED_USERCOPY_WHITELIST (future), and
PAX_USERCOPY_SLABS covers CONFIG_HARDENED_USERCOPY_SPLIT_KMALLOC
(future).)

This series, which adds CONFIG_HARDENED_USERCOPY, checks that objects
being copied to/from userspace meet certain criteria:
- if address is a heap object, the size must not exceed the object's
  allocated size. (This will catch all kinds of heap overflow flaws.)
- if address range is in the current process stack, it must be within the
  a valid stack frame (if such checking is possible) or at least entirely
  within the current process's stack. (This could catch large lengths that
  would have extended beyond the current process stack, or overflows if
  their length extends back into the original stack.)
- if the address range is part of kernel data, rodata, or bss, allow it.
- if address range is page-allocated, that it doesn't span multiple
  allocations (excepting Reserved and CMA pages).
- if address is within the kernel text, reject it.
- everything else is accepted

The patches in the series are:
- Support for examination of CMA page types:
1- mm: Add is_migrate_cma_page
- Support for arch-specific stack frame checking (which will likely be
  replaced in the future by Josh's more comprehensive unwinder):
2- mm: Implement stack frame object validation
- The core copy_to/from_user() checks, without the slab object checks:
3- mm: Hardened usercopy
- Per-arch enablement of the protection:
4- x86/uaccess: Enable hardened usercopy
5- ARM: uaccess: Enable hardened usercopy
6- arm64/uaccess: Enable hardened usercopy
7- ia64/uaccess: Enable hardened usercopy
8- powerpc/uaccess: Enable hardened usercopy
9- sparc/uaccess: Enable hardened usercopy
   10- s390/uaccess: Enable hardened usercopy
- The heap allocator implementation of object size checking:
   11- mm: SLAB hardened usercopy support
   12- mm: SLUB hardened usercopy support

Some notes:

- This is expected to apply on top of -next which contains fixes for the
  position of _etext on both arm and arm64, though it has some conflicts
  with KASAN that should be trivial to fix up. Also in -next are the
  tests for this protection (in lkdtm), prefixed with USERCOPY_.

- I couldn't detect a measurable performance change with these features
  enabled. Kernel build times were unchanged, hackbench was unchanged,
  etc. I think we could flip this to "on by default" at some point, but
  for now, I'm leaving it off until I can get some more definitive
  measurements. I would love if someone with greater familiarity with
  perf could give this a spin and report results.

- The SLOB support extracted from grsecurity seems entirely broken. I
  have no idea what's going on there, I spent my time testing SLAB and
  SLUB. Having someone else look at SLOB would be nice, but this series
  doesn't depend on it.

Additional features that would be nice, but aren't blocking this series:

- Needs more architecture support for stack frame checking (only x86 now,
  but it seems Josh will have a good solution for this soon).


Thanks!

-Kees

[1] https://grsecurity.net/download.php "grsecurity - test kernel patch"
[2] http://www.openwall.com/lists/kernel-hardening/2016/05/19/5

v4:
- handle CMA pages, labbott
- update stack checker comments, labbott
- check for vmalloc addresses, labbott
- deal with KASAN in -next changing arm64 copy*user calls
- check for linear mappings at runtime instead of via CONFIG

v3:
- switch to using BUG for better Oops integration
- when checking page allocations, check each for Reserved
- use enums for the stack check return for readability

v2:
- added s390 support
- handle slub red zone
- disallow writes to rodata area
- stack frame walker now

Re: [PATCH v4 00/12] mm: Hardened usercopy

2016-07-22 Thread Laura Abbott


On 07/20/2016 01:26 PM, Kees Cook wrote:

Hi,

[This is now in my kspp -next tree, though I'd really love to add some
additional explicit Tested-bys, Reviewed-bys, or Acked-bys. If you've
looked through any part of this or have done any testing, please consider
sending an email with your "*-by:" line. :)]

This is a start of the mainline port of PAX_USERCOPY[1]. After writing
tests (now in lkdtm in -next) for Casey's earlier port[2], I kept tweaking
things further and further until I ended up with a whole new patch series.
To that end, I took Rik, Laura, and other people's feedback along with
additional changes and clean-ups.

Based on my understanding, PAX_USERCOPY was designed to catch a
few classes of flaws (mainly bad bounds checking) around the use of
copy_to_user()/copy_from_user(). These changes don't touch get_user() and
put_user(), since these operate on constant sized lengths, and tend to be
much less vulnerable. There are effectively three distinct protections in
the whole series, each of which I've given a separate CONFIG, though this
patch set is only the first of the three intended protections. (Generally
speaking, PAX_USERCOPY covers what I'm calling CONFIG_HARDENED_USERCOPY
(this) and CONFIG_HARDENED_USERCOPY_WHITELIST (future), and
PAX_USERCOPY_SLABS covers CONFIG_HARDENED_USERCOPY_SPLIT_KMALLOC
(future).)

This series, which adds CONFIG_HARDENED_USERCOPY, checks that objects
being copied to/from userspace meet certain criteria:
- if address is a heap object, the size must not exceed the object's
  allocated size. (This will catch all kinds of heap overflow flaws.)
- if address range is in the current process stack, it must be within the
  a valid stack frame (if such checking is possible) or at least entirely
  within the current process's stack. (This could catch large lengths that
  would have extended beyond the current process stack, or overflows if
  their length extends back into the original stack.)
- if the address range is part of kernel data, rodata, or bss, allow it.
- if address range is page-allocated, that it doesn't span multiple
  allocations (excepting Reserved and CMA pages).
- if address is within the kernel text, reject it.
- everything else is accepted

The patches in the series are:
- Support for examination of CMA page types:
1- mm: Add is_migrate_cma_page
- Support for arch-specific stack frame checking (which will likely be
  replaced in the future by Josh's more comprehensive unwinder):
2- mm: Implement stack frame object validation
- The core copy_to/from_user() checks, without the slab object checks:
3- mm: Hardened usercopy
- Per-arch enablement of the protection:
4- x86/uaccess: Enable hardened usercopy
5- ARM: uaccess: Enable hardened usercopy
6- arm64/uaccess: Enable hardened usercopy
7- ia64/uaccess: Enable hardened usercopy
8- powerpc/uaccess: Enable hardened usercopy
9- sparc/uaccess: Enable hardened usercopy
   10- s390/uaccess: Enable hardened usercopy
- The heap allocator implementation of object size checking:
   11- mm: SLAB hardened usercopy support
   12- mm: SLUB hardened usercopy support

Some notes:

- This is expected to apply on top of -next which contains fixes for the
  position of _etext on both arm and arm64, though it has some conflicts
  with KASAN that should be trivial to fix up. Also in -next are the
  tests for this protection (in lkdtm), prefixed with USERCOPY_.

- I couldn't detect a measurable performance change with these features
  enabled. Kernel build times were unchanged, hackbench was unchanged,
  etc. I think we could flip this to "on by default" at some point, but
  for now, I'm leaving it off until I can get some more definitive
  measurements. I would love if someone with greater familiarity with
  perf could give this a spin and report results.

- The SLOB support extracted from grsecurity seems entirely broken. I
  have no idea what's going on there, I spent my time testing SLAB and
  SLUB. Having someone else look at SLOB would be nice, but this series
  doesn't depend on it.

Additional features that would be nice, but aren't blocking this series:

- Needs more architecture support for stack frame checking (only x86 now,
  but it seems Josh will have a good solution for this soon).


Thanks!

-Kees

[1] https://grsecurity.net/download.php "grsecurity - test kernel patch"
[2] http://www.openwall.com/lists/kernel-hardening/2016/05/19/5

v4:
- handle CMA pages, labbott
- update stack checker comments, labbott
- check for vmalloc addresses, labbott
- deal with KASAN in -next changing arm64 copy*user calls
- check for linear mappings at runtime instead of via CONFIG

v3:
- switch to using BUG for better Oops integration
- when checking page allocations, check each for Reserved
- use enums for the stack check return for readability

v2:
- added s390 support
- handle slub red zone
- disallow writes to rodata area
- stack frame walker now

Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code

On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds
 wrote:
>
> So without having yet looked at the code, I want people to understand
> that to a very real degree, the stack tracer that the *oopsing* code
> (ie what all the usual kernel fault handlers use) is very very special
> code and needs to be handled very carefully, and needs to be extra
> robust, even in the presence of stack corruption, and even in the
> presence of the dwarf info being totally corrupted. Because we've very
> much had both things happen.
>
> It is very possible that we should have two different stack tracers -
> the stupid "for oopses only" code that doesn't necessarily give the
> perfect trace, but is very anal and happily gives old stale addresses
> (which can be very useful for seeing what happened just before the
> "real" stack trace), and then a separate stack trace engine that is
> clever and gets things right, and if that one faults it can depend on
> the normal kernel fault handling picking up the pieces.

I think that Josh's code has the potential to be extremely robust
*and* give more correct results when possible.  One thing I intend to
review when v2 shows up is that it's as conservative as it needs to be
to avoid ever dereferencing an out-of-bounds pointer.  And Josh's oops
printer carefully walks and prints out all addresses on the stack
(complete with question marks) even if the unwinder doesn't find them.

>
> Yes, the current stack tracer is  crufty. No, it's not perfect. But it
> is very well tested, and has held up. That should not be dismissed.
>

I think you may be giving the current tracer slightly more credit than
it's due.  In my stack guard page patchset, I fixed two separate
issues, one of which caused recursive faults and one of which caused
it to output nothing at all.  So maybe *now* it's very robust :)  But
it's still an umaintainable mess IMO, and Josh's patchset helps a
*lot*.

--Andy

Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code

On Fri, Jul 22, 2016 at 5:22 PM, Linus Torvalds
 wrote:
>
> So without having yet looked at the code, I want people to understand
> that to a very real degree, the stack tracer that the *oopsing* code
> (ie what all the usual kernel fault handlers use) is very very special
> code and needs to be handled very carefully, and needs to be extra
> robust, even in the presence of stack corruption, and even in the
> presence of the dwarf info being totally corrupted. Because we've very
> much had both things happen.
>
> It is very possible that we should have two different stack tracers -
> the stupid "for oopses only" code that doesn't necessarily give the
> perfect trace, but is very anal and happily gives old stale addresses
> (which can be very useful for seeing what happened just before the
> "real" stack trace), and then a separate stack trace engine that is
> clever and gets things right, and if that one faults it can depend on
> the normal kernel fault handling picking up the pieces.

I think that Josh's code has the potential to be extremely robust
*and* give more correct results when possible.  One thing I intend to
review when v2 shows up is that it's as conservative as it needs to be
to avoid ever dereferencing an out-of-bounds pointer.  And Josh's oops
printer carefully walks and prints out all addresses on the stack
(complete with question marks) even if the unwinder doesn't find them.

>
> Yes, the current stack tracer is  crufty. No, it's not perfect. But it
> is very well tested, and has held up. That should not be dismissed.
>

I think you may be giving the current tracer slightly more credit than
it's due.  In my stack guard page patchset, I fixed two separate
issues, one of which caused recursive faults and one of which caused
it to output nothing at all.  So maybe *now* it's very robust :)  But
it's still an umaintainable mess IMO, and Josh's patchset helps a
*lot*.

--Andy

[PATCH] add u64 number parser

2016-07-22 Thread James Smart


add u64 number parser

Will be used by the nvme-fabrics FC transport in parsing options

Signed-off-by: James Smart 
---
 include/linux/parser.h |  1 +
 lib/parser.c   | 47 +++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/parser.h b/include/linux/parser.h
index 39d5b79..884c1e6 100644
--- a/include/linux/parser.h
+++ b/include/linux/parser.h
@@ -27,6 +27,7 @@ typedef struct {
 
 int match_token(char *, const match_table_t table, substring_t args[]);
 int match_int(substring_t *, int *result);
+int match_u64(substring_t *, u64 *result);
 int match_octal(substring_t *, int *result);
 int match_hex(substring_t *, int *result);
 bool match_wildcard(const char *pattern, const char *str);
diff --git a/lib/parser.c b/lib/parser.c
index b6d1163..3278958 100644
--- a/lib/parser.c
+++ b/lib/parser.c
@@ -152,6 +152,36 @@ static int match_number(substring_t *s, int *result, int 
base)
 }
 
 /**
+ * match_u64int: scan a number in the given base from a substring_t
+ * @s: substring to be scanned
+ * @result: resulting u64 on success
+ * @base: base to use when converting string
+ *
+ * Description: Given a _t and a base, attempts to parse the 
substring
+ * as a number in that base. On success, sets @result to the integer 
represented
+ * by the string and returns 0. Returns -ENOMEM, -EINVAL, or -ERANGE on 
failure.
+ */
+static int match_u64int(substring_t *s, u64 *result, int base)
+{
+   char *buf;
+   int ret;
+   u64 val;
+   size_t len = s->to - s->from;
+
+   buf = kmalloc(len + 1, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+   memcpy(buf, s->from, len);
+   buf[len] = '\0';
+
+   ret = kstrtoull(buf, base, );
+   if (!ret)
+   *result = val;
+   kfree(buf);
+   return ret;
+}
+
+/**
  * match_int: - scan a decimal representation of an integer from a substring_t
  * @s: substring_t to be scanned
  * @result: resulting integer on success
@@ -167,6 +197,23 @@ int match_int(substring_t *s, int *result)
 EXPORT_SYMBOL(match_int);
 
 /**
+ * match_u64: - scan a decimal representation of a u64 from
+ *  a substring_t
+ * @s: substring_t to be scanned
+ * @result: resulting unsigned long long on success
+ *
+ * Description: Attempts to parse the _t @s as a long decimal
+ * integer. On success, sets @result to the integer represented by the
+ * string and returns 0.
+ * Returns -ENOMEM, -EINVAL, or -ERANGE on failure.
+ */
+int match_u64(substring_t *s, u64 *result)
+{
+   return match_u64int(s, result, 0);
+}
+EXPORT_SYMBOL(match_u64);
+
+/**
  * match_octal: - scan an octal representation of an integer from a substring_t
  * @s: substring_t to be scanned
  * @result: resulting integer on success
-- 
2.5.0

[PATCH] add u64 number parser

2016-07-22 Thread James Smart


add u64 number parser

Will be used by the nvme-fabrics FC transport in parsing options

Signed-off-by: James Smart 
---
 include/linux/parser.h |  1 +
 lib/parser.c   | 47 +++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/parser.h b/include/linux/parser.h
index 39d5b79..884c1e6 100644
--- a/include/linux/parser.h
+++ b/include/linux/parser.h
@@ -27,6 +27,7 @@ typedef struct {
 
 int match_token(char *, const match_table_t table, substring_t args[]);
 int match_int(substring_t *, int *result);
+int match_u64(substring_t *, u64 *result);
 int match_octal(substring_t *, int *result);
 int match_hex(substring_t *, int *result);
 bool match_wildcard(const char *pattern, const char *str);
diff --git a/lib/parser.c b/lib/parser.c
index b6d1163..3278958 100644
--- a/lib/parser.c
+++ b/lib/parser.c
@@ -152,6 +152,36 @@ static int match_number(substring_t *s, int *result, int 
base)
 }
 
 /**
+ * match_u64int: scan a number in the given base from a substring_t
+ * @s: substring to be scanned
+ * @result: resulting u64 on success
+ * @base: base to use when converting string
+ *
+ * Description: Given a _t and a base, attempts to parse the 
substring
+ * as a number in that base. On success, sets @result to the integer 
represented
+ * by the string and returns 0. Returns -ENOMEM, -EINVAL, or -ERANGE on 
failure.
+ */
+static int match_u64int(substring_t *s, u64 *result, int base)
+{
+   char *buf;
+   int ret;
+   u64 val;
+   size_t len = s->to - s->from;
+
+   buf = kmalloc(len + 1, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+   memcpy(buf, s->from, len);
+   buf[len] = '\0';
+
+   ret = kstrtoull(buf, base, );
+   if (!ret)
+   *result = val;
+   kfree(buf);
+   return ret;
+}
+
+/**
  * match_int: - scan a decimal representation of an integer from a substring_t
  * @s: substring_t to be scanned
  * @result: resulting integer on success
@@ -167,6 +197,23 @@ int match_int(substring_t *s, int *result)
 EXPORT_SYMBOL(match_int);
 
 /**
+ * match_u64: - scan a decimal representation of a u64 from
+ *  a substring_t
+ * @s: substring_t to be scanned
+ * @result: resulting unsigned long long on success
+ *
+ * Description: Attempts to parse the _t @s as a long decimal
+ * integer. On success, sets @result to the integer represented by the
+ * string and returns 0.
+ * Returns -ENOMEM, -EINVAL, or -ERANGE on failure.
+ */
+int match_u64(substring_t *s, u64 *result)
+{
+   return match_u64int(s, result, 0);
+}
+EXPORT_SYMBOL(match_u64);
+
+/**
  * match_octal: - scan an octal representation of an integer from a substring_t
  * @s: substring_t to be scanned
  * @result: resulting integer on success
-- 
2.5.0

Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code

On Fri, Jul 22, 2016 at 6:21 AM, Josh Poimboeuf  wrote:
>
> Some if its advantages:
>
> - simplicity: no more callback sprawl and less code duplication.
>
> - flexibility: allows the caller to stop and inspect the stack state at
>   each step in the unwinding process.
>
> - modularity: the unwinder code, console stack dump code, and stack
>   metadata analysis code are all better separated so that changing one
>   of them shouldn't have much of an impact on any of the others.

I've been without internet for the last week, so I have a ton pending,
and not good enough internet even now to take a good look.

However, I want to make one thing really really clear: the absolute
NUMBER ONE requirement for the stack tracing code is none of the
above.

The #1 requirement is that it works, and not have a chance in hell of
ever breaking. We had that happen once before when people wanted to
make it fancy and add Dwarf info, and it was such a f*cking disaster
that I am not sure I ever want to do that again. Seriously.

It does not matter if the stack tracing gives the wrong answers.

It does not matter if the stack tracing is complicated and odd old code.

It does not matter one whit if some new user is inconvenienced, and in
fact it is possible that new users should write their *own* stack
tracer code.

The ONLY thing that matters (to a very high degree) is that the code
is stable, and if an Oops happens, the stack tracer never *ever*
cause even more problems than we already have.

If the stack tracer *ever* takes a recursive fault and kills the
machine, the stack tracer is worse than bad - we'd be better off
*without* a stack tracer at all.

And yes, we had exactly that situation, where bugs in the stack tracer
meant that other bugs ended up being much harder to debug, because
instead of a nice logged oops message that would have been trivial to
figure out, we very occasionally ended up with a dead machine instead.

So without having yet looked at the code, I want people to understand
that to a very real degree, the stack tracer that the *oopsing* code
(ie what all the usual kernel fault handlers use) is very very special
code and needs to be handled very carefully, and needs to be extra
robust, even in the presence of stack corruption, and even in the
presence of the dwarf info being totally corrupted. Because we've very
much had both things happen.

It is very possible that we should have two different stack tracers -
the stupid "for oopses only" code that doesn't necessarily give the
perfect trace, but is very anal and happily gives old stale addresses
(which can be very useful for seeing what happened just before the
"real" stack trace), and then a separate stack trace engine that is
clever and gets things right, and if that one faults it can depend on
the normal kernel fault handling picking up the pieces.

Yes, the current stack tracer is  crufty. No, it's not perfect. But it
is very well tested, and has held up. That should not be dismissed.

 Linus

Re: [PATCH 00/19] x86/dumpstack: rewrite x86 stack dump code

On Fri, Jul 22, 2016 at 6:21 AM, Josh Poimboeuf  wrote:
>
> Some if its advantages:
>
> - simplicity: no more callback sprawl and less code duplication.
>
> - flexibility: allows the caller to stop and inspect the stack state at
>   each step in the unwinding process.
>
> - modularity: the unwinder code, console stack dump code, and stack
>   metadata analysis code are all better separated so that changing one
>   of them shouldn't have much of an impact on any of the others.

I've been without internet for the last week, so I have a ton pending,
and not good enough internet even now to take a good look.

However, I want to make one thing really really clear: the absolute
NUMBER ONE requirement for the stack tracing code is none of the
above.

The #1 requirement is that it works, and not have a chance in hell of
ever breaking. We had that happen once before when people wanted to
make it fancy and add Dwarf info, and it was such a f*cking disaster
that I am not sure I ever want to do that again. Seriously.

It does not matter if the stack tracing gives the wrong answers.

It does not matter if the stack tracing is complicated and odd old code.

It does not matter one whit if some new user is inconvenienced, and in
fact it is possible that new users should write their *own* stack
tracer code.

The ONLY thing that matters (to a very high degree) is that the code
is stable, and if an Oops happens, the stack tracer never *ever*
cause even more problems than we already have.

If the stack tracer *ever* takes a recursive fault and kills the
machine, the stack tracer is worse than bad - we'd be better off
*without* a stack tracer at all.

And yes, we had exactly that situation, where bugs in the stack tracer
meant that other bugs ended up being much harder to debug, because
instead of a nice logged oops message that would have been trivial to
figure out, we very occasionally ended up with a dead machine instead.

So without having yet looked at the code, I want people to understand
that to a very real degree, the stack tracer that the *oopsing* code
(ie what all the usual kernel fault handlers use) is very very special
code and needs to be handled very carefully, and needs to be extra
robust, even in the presence of stack corruption, and even in the
presence of the dwarf info being totally corrupted. Because we've very
much had both things happen.

It is very possible that we should have two different stack tracers -
the stupid "for oopses only" code that doesn't necessarily give the
perfect trace, but is very anal and happily gives old stale addresses
(which can be very useful for seeing what happened just before the
"real" stack trace), and then a separate stack trace engine that is
clever and gets things right, and if that one faults it can depend on
the normal kernel fault handling picking up the pieces.

Yes, the current stack tracer is  crufty. No, it's not perfect. But it
is very well tested, and has held up. That should not be dismissed.

 Linus

[PATCH] f2fs: clean up codying style and redundancy

2016-07-22 Thread Jaegeuk Kim

This patch includes minor clean-ups.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/acl.h  | 2 +-
 fs/f2fs/data.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/acl.h b/fs/f2fs/acl.h
index 997ca8e..b2334d1 100644
--- a/fs/f2fs/acl.h
+++ b/fs/f2fs/acl.h
@@ -37,7 +37,7 @@ struct f2fs_acl_header {
 #ifdef CONFIG_F2FS_FS_POSIX_ACL
 
 extern struct posix_acl *f2fs_get_acl(struct inode *, int);
-extern int f2fs_set_acl(struct inode *inode, struct posix_acl *acl, int type);
+extern int f2fs_set_acl(struct inode *, struct posix_acl *, int);
 extern int f2fs_init_acl(struct inode *, struct inode *, struct page *,
struct page *);
 #else
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 87a458f..614154f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1731,7 +1731,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
if (test_opt(F2FS_I_SB(inode), LFS))
return 0;
 
-   trace_f2fs_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
+   trace_f2fs_direct_IO_enter(inode, offset, count, rw);
 
down_read(_I(inode)->dio_rwsem[rw]);
err = blockdev_direct_IO(iocb, inode, iter, get_data_block_dio);
@@ -1744,7 +1744,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
f2fs_write_failed(mapping, offset + count);
}
 
-   trace_f2fs_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), err);
+   trace_f2fs_direct_IO_exit(inode, offset, count, rw, err);
 
return err;
 }
-- 
2.8.3

[PATCH] f2fs: clean up codying style and redundancy

2016-07-22 Thread Jaegeuk Kim

This patch includes minor clean-ups.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/acl.h  | 2 +-
 fs/f2fs/data.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/acl.h b/fs/f2fs/acl.h
index 997ca8e..b2334d1 100644
--- a/fs/f2fs/acl.h
+++ b/fs/f2fs/acl.h
@@ -37,7 +37,7 @@ struct f2fs_acl_header {
 #ifdef CONFIG_F2FS_FS_POSIX_ACL
 
 extern struct posix_acl *f2fs_get_acl(struct inode *, int);
-extern int f2fs_set_acl(struct inode *inode, struct posix_acl *acl, int type);
+extern int f2fs_set_acl(struct inode *, struct posix_acl *, int);
 extern int f2fs_init_acl(struct inode *, struct inode *, struct page *,
struct page *);
 #else
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 87a458f..614154f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1731,7 +1731,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
if (test_opt(F2FS_I_SB(inode), LFS))
return 0;
 
-   trace_f2fs_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
+   trace_f2fs_direct_IO_enter(inode, offset, count, rw);
 
down_read(_I(inode)->dio_rwsem[rw]);
err = blockdev_direct_IO(iocb, inode, iter, get_data_block_dio);
@@ -1744,7 +1744,7 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
f2fs_write_failed(mapping, offset + count);
}
 
-   trace_f2fs_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), err);
+   trace_f2fs_direct_IO_exit(inode, offset, count, rw, err);
 
return err;
 }
-- 
2.8.3

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf  wrote:
>> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info 
>> > *info,
>> > +unsigned long *visit_mask)
>> > +{
>> > +   unsigned long *begin = (unsigned long 
>> > *)this_cpu_read(hardirq_stack);
>> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> > +
>> > +   if (stack < begin || stack >= end)
>> > +   return false;
>> > +
>> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> > +   return false;
>> > +
>> > +   info->type  = STACK_TYPE_IRQ;
>> > +   info->begin = begin;
>> > +   info->end   = end;
>> > +   info->next  = (unsigned long *)*begin;
>>
>> This works, but it's a bit magic.  I don't suppose we could get rid of
>> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> move from stack to stack by querying the stack type of whatever the
>> frame base address is if the frame base address ends up being out of
>> bounds for the current stack?  Or maybe the unwinder could even do
>> this by itself.
>
> I'm not quite sure what you mean here.  The ->next stack pointer is
> quite useful and it abstracts that ugliness away from the callers of
> get_stack_info().  I'm open to any specific suggestions.

So far I've found two users of this thing.  One is
show_stack_log_lvl(), and it makes sense there, but maybe
info->heuristic_next_stack would be a better name.  The other is the
unwinder itself, and I think that walking from stack to stack using
this heuristic is the wrong approach there, at least in the long term.
I'd rather we just followed the bp chain wherever it leads us, as long
as it leads us to a valid stack that we haven't visited before.

As a concrete example of what I think is wrong with the current
approach, ISTM it would be totally valid to implement stack switching
like this:

some_func:
 push %rbp
 mov %rsp, %rbp
 ...
 mov [next stack], %rsp
 call some_other_func
 mov %rbp, %rsp
 pop %rbp
 ret

With the current approach, you can't unwind out of that function,
because there's no way to populate info->next.  I'm not actually
suggesting that the kernel should ever do such a thing on x86, and my
proposed rewrite of the IRQ stack code [1] will be fully compatible
with your approach, but it seems odd to me that the unwinder should
depend on idea that the stacks in use are chained together in a way
that can be decoded without .  (But maybe some of the Go compilers do
work this way -- I've never looked at their precise stack layout.)

Also, if you ever intend to port this thing to other architectures, I
think there are architectures that have separate exception stacks and
that track the next available slot on those stacks dynamically.  I
think that x86_32 is an example of this if task gates are used in a
back-and-forth manner, although Linux doesn't do that.  (x86_64 should
have done this for IST, but it didn't.)  On those architectures, you
can have two separate switches onto the same stack live at the same
time, and your current approach won't work.  (Even if you make the
change I'm suggesting, visit_mask will break, too, but fixing that
would be a much less invasive change.)

Am I making any sense?  This is a suggestion for making it better, not
something I see as a requirement for getting the x86 code upstream.

>>
>> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> > +  unsigned long *visit_mask)
>> >  {
>> > unsigned long stack = (unsigned long)s;
>> > unsigned long begin, end;
>> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
>> > *s, char **name,
>> > if (stack < begin || stack >= end)
>> > continue;
>> >
>> > -   if (test_and_set_bit(k, visit_mask))
>> > +   if (visit_mask &&
>> > +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> > return false;
>> >
>> > -   *name = exception_stack_names[k];
>> > -   return (unsigned long *)end;
>> > +   info->type  = STACK_TYPE_EXCEPTION + k;
>> > +   info->begin = (unsigned long *)begin;
>> > +   info->end   = (unsigned long *)end;
>> > +   info->next  = (unsigned long *)info->end[-2];
>>
>> This is so magical that I don't immediately see why it's correct.
>> Presumably it's because the thing two slots down from the top of the
>> stack is regs->sp?  If so, that needs a comment.
>
> Heck if I know, I just stole it from dump_trace() ;-)
>
> I'll figure it out and add a comment.

If you can write it as:

struct pt_regs *regs = (struct pt_regs *)end - 1;
info->next = regs->sp;

and it still works, then no comment required :)

>
>> But again, couldn't we use the fact that we now know how to decode
>>

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

On Fri, Jul 22, 2016 at 4:54 PM, Josh Poimboeuf  wrote:
>> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info 
>> > *info,
>> > +unsigned long *visit_mask)
>> > +{
>> > +   unsigned long *begin = (unsigned long 
>> > *)this_cpu_read(hardirq_stack);
>> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> > +
>> > +   if (stack < begin || stack >= end)
>> > +   return false;
>> > +
>> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> > +   return false;
>> > +
>> > +   info->type  = STACK_TYPE_IRQ;
>> > +   info->begin = begin;
>> > +   info->end   = end;
>> > +   info->next  = (unsigned long *)*begin;
>>
>> This works, but it's a bit magic.  I don't suppose we could get rid of
>> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
>> move from stack to stack by querying the stack type of whatever the
>> frame base address is if the frame base address ends up being out of
>> bounds for the current stack?  Or maybe the unwinder could even do
>> this by itself.
>
> I'm not quite sure what you mean here.  The ->next stack pointer is
> quite useful and it abstracts that ugliness away from the callers of
> get_stack_info().  I'm open to any specific suggestions.

So far I've found two users of this thing.  One is
show_stack_log_lvl(), and it makes sense there, but maybe
info->heuristic_next_stack would be a better name.  The other is the
unwinder itself, and I think that walking from stack to stack using
this heuristic is the wrong approach there, at least in the long term.
I'd rather we just followed the bp chain wherever it leads us, as long
as it leads us to a valid stack that we haven't visited before.

As a concrete example of what I think is wrong with the current
approach, ISTM it would be totally valid to implement stack switching
like this:

some_func:
 push %rbp
 mov %rsp, %rbp
 ...
 mov [next stack], %rsp
 call some_other_func
 mov %rbp, %rsp
 pop %rbp
 ret

With the current approach, you can't unwind out of that function,
because there's no way to populate info->next.  I'm not actually
suggesting that the kernel should ever do such a thing on x86, and my
proposed rewrite of the IRQ stack code [1] will be fully compatible
with your approach, but it seems odd to me that the unwinder should
depend on idea that the stacks in use are chained together in a way
that can be decoded without .  (But maybe some of the Go compilers do
work this way -- I've never looked at their precise stack layout.)

Also, if you ever intend to port this thing to other architectures, I
think there are architectures that have separate exception stacks and
that track the next available slot on those stacks dynamically.  I
think that x86_32 is an example of this if task gates are used in a
back-and-forth manner, although Linux doesn't do that.  (x86_64 should
have done this for IST, but it didn't.)  On those architectures, you
can have two separate switches onto the same stack live at the same
time, and your current approach won't work.  (Even if you make the
change I'm suggesting, visit_mask will break, too, but fixing that
would be a much less invasive change.)

Am I making any sense?  This is a suggestion for making it better, not
something I see as a requirement for getting the x86 code upstream.

>>
>> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> > +  unsigned long *visit_mask)
>> >  {
>> > unsigned long stack = (unsigned long)s;
>> > unsigned long begin, end;
>> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
>> > *s, char **name,
>> > if (stack < begin || stack >= end)
>> > continue;
>> >
>> > -   if (test_and_set_bit(k, visit_mask))
>> > +   if (visit_mask &&
>> > +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> > return false;
>> >
>> > -   *name = exception_stack_names[k];
>> > -   return (unsigned long *)end;
>> > +   info->type  = STACK_TYPE_EXCEPTION + k;
>> > +   info->begin = (unsigned long *)begin;
>> > +   info->end   = (unsigned long *)end;
>> > +   info->next  = (unsigned long *)info->end[-2];
>>
>> This is so magical that I don't immediately see why it's correct.
>> Presumably it's because the thing two slots down from the top of the
>> stack is regs->sp?  If so, that needs a comment.
>
> Heck if I know, I just stole it from dump_trace() ;-)
>
> I'll figure it out and add a comment.

If you can write it as:

struct pt_regs *regs = (struct pt_regs *)end - 1;
info->next = regs->sp;

and it still works, then no comment required :)

>
>> But again, couldn't we use the fact that we now know how to decode
>> pt_regs to avoid needing

Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-22 Thread NeilBrown

On Fri, Jul 22 2016, Michal Hocko wrote:

> On Fri 22-07-16 18:46:57, Neil Brown wrote:
>> On Mon, Jul 18 2016, Michal Hocko wrote:
>> 
>> > From: Michal Hocko 
>> >
>> > Mikulas has reported that a swap backed by dm-crypt doesn't work
>> > properly because the swapout cannot make a sufficient forward progress
>> > as the writeout path depends on dm_crypt worker which has to allocate
>> > memory to perform the encryption. In order to guarantee a forward
>> > progress it relies on the mempool allocator. mempool_alloc(), however,
>> > prefers to use the underlying (usually page) allocator before it grabs
>> > objects from the pool. Such an allocation can dive into the memory
>> > reclaim and consequently to throttle_vm_writeout.
>> 
>> That's just broken.
>> I used to think mempool should always use the pre-allocated reserves
>> first.  That is surely the most logical course of action.  Otherwise
>> that memory is just sitting there doing nothing useful.
>> 
>> I spoke to Nick Piggin about this some years ago and he pointed out that
>> the kmalloc allocation paths are much better optimized for low overhead
>> when there is plenty of memory.  They can just pluck a free block of a
>> per-CPU list without taking any locks.   By contrast, accessing the
>> preallocated pool always requires a spinlock.
>> 
>> So it makes lots of sense to prefer the underlying allocator if it can
>> provide a quick response.  If it cannot, the sensible thing is to use
>> the pool, or wait for the pool to be replenished.
>> 
>> So the allocator should never wait at all, never enter reclaim, never
>> throttle.
>> 
>> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first
>> time through, but if the pool is empty, direct-reclaim is allowed on the
>> next attempt.  Presumably this is where the throttling comes in ??
>
> Yes that is correct.
>
>> I suspect that it really shouldn't do that. It should leave kswapd to
>> do reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in
>> mempool_alloc where pool->wait can wake it up.
>
> Mikulas was already suggesting that and my concern was that this would
> give up prematurely even under mild page cache load when there are many
> clean page cache pages.

That's a valid point - freeing up clean pages is a reasonable thing for
a mempool allocator to try to do.

>  If we just back off and rely on kswapd which
> might get stuck on the writeout then the IO throughput can be reduced

If I were king of MM, I would make a decree to be proclaimed throughout
the land
kswapd must never sleep except when it explicitly chooses to

Maybe that is impractical, but having firm rules like that would go a
long way to make it possible to actually understand and reason about how
MM works.  As it is, there seems to be a tendency to put bandaids over
bandaids.

> I believe which would make the whole memory pressure just worse. So I am
> not sure this is a good idea in general. I completely agree with you
> that the mempool request shouldn't be throttled unless there is a strong
> reason for that. More on that below.
>
>> If I'm following the code properly, the stack trace below can only
>> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
>> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
>> and io_schedule_timeout().
>
> mempool_alloc retries immediatelly without any sleep after the first
> no-reclaim attempt.

I missed that ... I see it now... I wonder if anyone has contemplated
using some modern programming techniques like, maybe, a "while" loop in
there..
Something like the below...

>
>> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
>> gets woken up when there is something in the pool.  It then loops around
>> and tries pool->alloc() again, even though there is something in the
>> pool.  This might be justified if that ->alloc would never block, but
>> obviously it does.
>> 
>> I would very strongly recommend just changing mempool_alloc() to
>> permanently mask out __GFP_DIRECT_RECLAIM.
>> 
>> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
>> It is "LESS" throttle, not "NO" throttle, but you have made
>> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.
>
> Yes that is correct. But it still allows to throttle on congestion:
> shrink_inactive_list:
>   /*
>* Stall direct reclaim for IO completions if underlying BDIs or zone
>* is congested. Allow kswapd to continue until it starts encountering
>* unqueued dirty pages or cycling through the LRU too quickly.
>*/
>   if (!sc->hibernation_mode && !current_is_kswapd() &&
>   current_may_throttle())
>   wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> My thinking was that throttle_vm_writeout is there to prevent from
> dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> is part of the writeout so throttling

Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-22 Thread NeilBrown

On Fri, Jul 22 2016, Michal Hocko wrote:

> On Fri 22-07-16 18:46:57, Neil Brown wrote:
>> On Mon, Jul 18 2016, Michal Hocko wrote:
>> 
>> > From: Michal Hocko 
>> >
>> > Mikulas has reported that a swap backed by dm-crypt doesn't work
>> > properly because the swapout cannot make a sufficient forward progress
>> > as the writeout path depends on dm_crypt worker which has to allocate
>> > memory to perform the encryption. In order to guarantee a forward
>> > progress it relies on the mempool allocator. mempool_alloc(), however,
>> > prefers to use the underlying (usually page) allocator before it grabs
>> > objects from the pool. Such an allocation can dive into the memory
>> > reclaim and consequently to throttle_vm_writeout.
>> 
>> That's just broken.
>> I used to think mempool should always use the pre-allocated reserves
>> first.  That is surely the most logical course of action.  Otherwise
>> that memory is just sitting there doing nothing useful.
>> 
>> I spoke to Nick Piggin about this some years ago and he pointed out that
>> the kmalloc allocation paths are much better optimized for low overhead
>> when there is plenty of memory.  They can just pluck a free block of a
>> per-CPU list without taking any locks.   By contrast, accessing the
>> preallocated pool always requires a spinlock.
>> 
>> So it makes lots of sense to prefer the underlying allocator if it can
>> provide a quick response.  If it cannot, the sensible thing is to use
>> the pool, or wait for the pool to be replenished.
>> 
>> So the allocator should never wait at all, never enter reclaim, never
>> throttle.
>> 
>> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first
>> time through, but if the pool is empty, direct-reclaim is allowed on the
>> next attempt.  Presumably this is where the throttling comes in ??
>
> Yes that is correct.
>
>> I suspect that it really shouldn't do that. It should leave kswapd to
>> do reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in
>> mempool_alloc where pool->wait can wake it up.
>
> Mikulas was already suggesting that and my concern was that this would
> give up prematurely even under mild page cache load when there are many
> clean page cache pages.

That's a valid point - freeing up clean pages is a reasonable thing for
a mempool allocator to try to do.

>  If we just back off and rely on kswapd which
> might get stuck on the writeout then the IO throughput can be reduced

If I were king of MM, I would make a decree to be proclaimed throughout
the land
kswapd must never sleep except when it explicitly chooses to

Maybe that is impractical, but having firm rules like that would go a
long way to make it possible to actually understand and reason about how
MM works.  As it is, there seems to be a tendency to put bandaids over
bandaids.

> I believe which would make the whole memory pressure just worse. So I am
> not sure this is a good idea in general. I completely agree with you
> that the mempool request shouldn't be throttled unless there is a strong
> reason for that. More on that below.
>
>> If I'm following the code properly, the stack trace below can only
>> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
>> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
>> and io_schedule_timeout().
>
> mempool_alloc retries immediatelly without any sleep after the first
> no-reclaim attempt.

I missed that ... I see it now... I wonder if anyone has contemplated
using some modern programming techniques like, maybe, a "while" loop in
there..
Something like the below...

>
>> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
>> gets woken up when there is something in the pool.  It then loops around
>> and tries pool->alloc() again, even though there is something in the
>> pool.  This might be justified if that ->alloc would never block, but
>> obviously it does.
>> 
>> I would very strongly recommend just changing mempool_alloc() to
>> permanently mask out __GFP_DIRECT_RECLAIM.
>> 
>> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
>> It is "LESS" throttle, not "NO" throttle, but you have made
>> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.
>
> Yes that is correct. But it still allows to throttle on congestion:
> shrink_inactive_list:
>   /*
>* Stall direct reclaim for IO completions if underlying BDIs or zone
>* is congested. Allow kswapd to continue until it starts encountering
>* unqueued dirty pages or cycling through the LRU too quickly.
>*/
>   if (!sc->hibernation_mode && !current_is_kswapd() &&
>   current_may_throttle())
>   wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> My thinking was that throttle_vm_writeout is there to prevent from
> dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> is part of the writeout so throttling it on too many

Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)

2016-07-22 Thread Sargun Dhillon

On Fri, Jul 22, 2016 at 11:53:52AM +0200, Daniel Borkmann wrote:
> On 07/22/2016 04:14 AM, Alexei Starovoitov wrote:
> >On Thu, Jul 21, 2016 at 06:09:17PM -0700, Sargun Dhillon wrote:
> >>This allows user memory to be written to during the course of a kprobe.
> >>It shouldn't be used to implement any kind of security mechanism
> >>because of TOC-TOU attacks, but rather to debug, divert, and
> >>manipulate execution of semi-cooperative processes.
> >>
> >>Although it uses probe_kernel_write, we limit the address space
> >>the probe can write into by checking the space with access_ok.
> >>This is so the call doesn't sleep.
> >>
> >>Given this feature is experimental, and has the risk of crashing
> >>the system, we print a warning on invocation.
> >>
> >>It was tested with the tracex7 program on x86-64.
> >>
> >>Signed-off-by: Sargun Dhillon 
> >>Cc: Alexei Starovoitov 
> >>Cc: Daniel Borkmann 
> >>---
> >>  include/uapi/linux/bpf.h  | 12 
> >>  kernel/bpf/verifier.c |  9 +
> >>  kernel/trace/bpf_trace.c  | 37 +
> >>  samples/bpf/bpf_helpers.h |  2 ++
> >>  4 files changed, 60 insertions(+)
> >>
> >>diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>index 2b7076f..4536282 100644
> >>--- a/include/uapi/linux/bpf.h
> >>+++ b/include/uapi/linux/bpf.h
> >>@@ -365,6 +365,18 @@ enum bpf_func_id {
> >> */
> >>BPF_FUNC_get_current_task,
> >>
> >>+   /**
> >>+* bpf_probe_write(void *dst, void *src, int len)
> >>+* safely attempt to write to a location
> >>+* @dst: destination address in userspace
> >>+* @src: source address on stack
> >>+* @len: number of bytes to copy
> >>+* Return:
> >>+*   Returns number of bytes that could not be copied.
> >>+*   On success, this will be zero
> >
> >that is odd comment.
> >there are only three possible return values 0, -EFAULT, -EPERM
> 
> Agree.
> 
> >>+*/
> >>+   BPF_FUNC_probe_write,
> >>+
> >>__BPF_FUNC_MAX_ID,
> >>  };
> >>
> >>diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>index f72f23b..6785008 100644
> >>--- a/kernel/bpf/verifier.c
> >>+++ b/kernel/bpf/verifier.c
> >>@@ -1154,6 +1154,15 @@ static int check_call(struct verifier_env *env, int 
> >>func_id)
> >>return -EINVAL;
> >>}
> >>
> >>+   if (func_id == BPF_FUNC_probe_write) {
> >>+   
> >>pr_warn_once("\n");
> >>+   pr_warn_once("* bpf_probe_write: Experimental Feature in use 
> >>*\n");
> >>+   pr_warn_once("* bpf_probe_write: Feature may corrupt memory  
> >>*\n");
> >>+   
> >>pr_warn_once("\n");
> >>+   pr_notice_ratelimited("bpf_probe_write in use by: %.16s-%d",
> >>+ current->comm, task_pid_nr(current));
> >>+   }
> >
> >I think single line pr_notice_ratelimited() with 'feature may corrupt user 
> >memory'
> >will be enough.
> 
> Agree.
> 
> >Also please move this to tracing specific part into bpf_trace.c
> >similar to bpf_get_trace_printk_proto() instead of verifier.c
> 
> Yes, sorry for not being too clear about it, this spot will then be
> called by the verifier to fetch it for every function call. Meaning that
> this could get printed multiple times for loading a single program, but
> I think it's okay. If single line, I'd make that pr_warn_ratelimited(),
> and probably something like ...
> 
>  "note: %s[%d] is installing a program with bpf_probe_write helper that may 
> corrupt user memory!",
>  current->comm, task_pid_nr(current)
> 
> >>+static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> >>+{
> >>+   void *unsafe_ptr = (void *) (long) r1;
> >>+   void *src = (void *) (long) r2;
> >>+   int size = (int) r3;
> >>+   struct task_struct *task = current;
> >>+
> >>+   /*
> >
> >bpf_trace.c follows non-net comment style, so it's good here.
> >just distracting vs the rest of net style.
> >
> >>+* Ensure we're in a user context which it is safe for the helper
> >>+* to run. This helper has no business in a kthread
> >>+*
> >>+* access_ok should prevent writing to non-user memory, but on
> >>+* some architectures (nommu, etc...) access_ok isn't enough
> >>+* So we check the current segment
> >>+*/
> >>+
> >>+   if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD)))
> >>+   return -EPERM;
> >
> >Should we also add a check for !PF_EXITING ?
> >Like signals are not delivered to such tasks and I'm not sure
> >what would be the state of mm of it.
> 
> I agree, good point.
> 
> You can make that 'current->flags & (PF_KTHREAD | PF_EXITING)' and
> we don't need the extra task var either.
> 
> >>+   if (unlikely(segment_eq(get_fs(), KERNEL_DS)))
> >>+   return -EPERM;
> >>+   if (!access_ok(VERIFY_WRITE, unsafe_ptr, size))
> >>+   return -EPERM;
> >>+
> >>+   return

Re: [PATCH v4 1/2] bpf: Add bpf_probe_write BPF helper to be called in tracers (kprobes)

2016-07-22 Thread Sargun Dhillon

On Fri, Jul 22, 2016 at 11:53:52AM +0200, Daniel Borkmann wrote:
> On 07/22/2016 04:14 AM, Alexei Starovoitov wrote:
> >On Thu, Jul 21, 2016 at 06:09:17PM -0700, Sargun Dhillon wrote:
> >>This allows user memory to be written to during the course of a kprobe.
> >>It shouldn't be used to implement any kind of security mechanism
> >>because of TOC-TOU attacks, but rather to debug, divert, and
> >>manipulate execution of semi-cooperative processes.
> >>
> >>Although it uses probe_kernel_write, we limit the address space
> >>the probe can write into by checking the space with access_ok.
> >>This is so the call doesn't sleep.
> >>
> >>Given this feature is experimental, and has the risk of crashing
> >>the system, we print a warning on invocation.
> >>
> >>It was tested with the tracex7 program on x86-64.
> >>
> >>Signed-off-by: Sargun Dhillon 
> >>Cc: Alexei Starovoitov 
> >>Cc: Daniel Borkmann 
> >>---
> >>  include/uapi/linux/bpf.h  | 12 
> >>  kernel/bpf/verifier.c |  9 +
> >>  kernel/trace/bpf_trace.c  | 37 +
> >>  samples/bpf/bpf_helpers.h |  2 ++
> >>  4 files changed, 60 insertions(+)
> >>
> >>diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>index 2b7076f..4536282 100644
> >>--- a/include/uapi/linux/bpf.h
> >>+++ b/include/uapi/linux/bpf.h
> >>@@ -365,6 +365,18 @@ enum bpf_func_id {
> >> */
> >>BPF_FUNC_get_current_task,
> >>
> >>+   /**
> >>+* bpf_probe_write(void *dst, void *src, int len)
> >>+* safely attempt to write to a location
> >>+* @dst: destination address in userspace
> >>+* @src: source address on stack
> >>+* @len: number of bytes to copy
> >>+* Return:
> >>+*   Returns number of bytes that could not be copied.
> >>+*   On success, this will be zero
> >
> >that is odd comment.
> >there are only three possible return values 0, -EFAULT, -EPERM
> 
> Agree.
> 
> >>+*/
> >>+   BPF_FUNC_probe_write,
> >>+
> >>__BPF_FUNC_MAX_ID,
> >>  };
> >>
> >>diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>index f72f23b..6785008 100644
> >>--- a/kernel/bpf/verifier.c
> >>+++ b/kernel/bpf/verifier.c
> >>@@ -1154,6 +1154,15 @@ static int check_call(struct verifier_env *env, int 
> >>func_id)
> >>return -EINVAL;
> >>}
> >>
> >>+   if (func_id == BPF_FUNC_probe_write) {
> >>+   
> >>pr_warn_once("\n");
> >>+   pr_warn_once("* bpf_probe_write: Experimental Feature in use 
> >>*\n");
> >>+   pr_warn_once("* bpf_probe_write: Feature may corrupt memory  
> >>*\n");
> >>+   
> >>pr_warn_once("\n");
> >>+   pr_notice_ratelimited("bpf_probe_write in use by: %.16s-%d",
> >>+ current->comm, task_pid_nr(current));
> >>+   }
> >
> >I think single line pr_notice_ratelimited() with 'feature may corrupt user 
> >memory'
> >will be enough.
> 
> Agree.
> 
> >Also please move this to tracing specific part into bpf_trace.c
> >similar to bpf_get_trace_printk_proto() instead of verifier.c
> 
> Yes, sorry for not being too clear about it, this spot will then be
> called by the verifier to fetch it for every function call. Meaning that
> this could get printed multiple times for loading a single program, but
> I think it's okay. If single line, I'd make that pr_warn_ratelimited(),
> and probably something like ...
> 
>  "note: %s[%d] is installing a program with bpf_probe_write helper that may 
> corrupt user memory!",
>  current->comm, task_pid_nr(current)
> 
> >>+static u64 bpf_probe_write(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> >>+{
> >>+   void *unsafe_ptr = (void *) (long) r1;
> >>+   void *src = (void *) (long) r2;
> >>+   int size = (int) r3;
> >>+   struct task_struct *task = current;
> >>+
> >>+   /*
> >
> >bpf_trace.c follows non-net comment style, so it's good here.
> >just distracting vs the rest of net style.
> >
> >>+* Ensure we're in a user context which it is safe for the helper
> >>+* to run. This helper has no business in a kthread
> >>+*
> >>+* access_ok should prevent writing to non-user memory, but on
> >>+* some architectures (nommu, etc...) access_ok isn't enough
> >>+* So we check the current segment
> >>+*/
> >>+
> >>+   if (unlikely(in_interrupt() || (task->flags & PF_KTHREAD)))
> >>+   return -EPERM;
> >
> >Should we also add a check for !PF_EXITING ?
> >Like signals are not delivered to such tasks and I'm not sure
> >what would be the state of mm of it.
> 
> I agree, good point.
> 
> You can make that 'current->flags & (PF_KTHREAD | PF_EXITING)' and
> we don't need the extra task var either.
> 
> >>+   if (unlikely(segment_eq(get_fs(), KERNEL_DS)))
> >>+   return -EPERM;
> >>+   if (!access_ok(VERIFY_WRITE, unsafe_ptr, size))
> >>+   return -EPERM;
> >>+
> >>+   return probe_kernel_write(unsafe_ptr, src, size);
> >>+}
> >>+
>

Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack

On Fri, Jul 22, 2016 at 04:39:00PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf  wrote:
> > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf  
> >> wrote:
> >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
> >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf  
> >> >> wrote:
> >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf 
> >> >> >>  wrote:
> >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf 
> >> >> >> >>  wrote:
> >> >> >> >> > Now that we can find pt_regs registers in the middle of the 
> >> >> >> >> > stack due to
> >> >> >> >> > an interrupt or exception, we can print them.  Here's what it 
> >> >> >> >> > looks
> >> >> >> >> > like:
> >> >> >> >> >
> >> >> >> >> >...
> >> >> >> >> >[] do_async_page_fault+0x2c/0xa0
> >> >> >> >> >[] async_page_fault+0x28/0x30
> >> >> >> >> >   RIP: 0010:[]  [] 
> >> >> >> >> > __clear_user+0x42/0x70
> >> >> >> >> >   RSP: 0018:88007876fd38  EFLAGS: 00010202
> >> >> >> >> >   RAX:  RBX: 0138 RCX: 
> >> >> >> >> > 0138
> >> >> >> >> >   RDX:  RSI: 0008 RDI: 
> >> >> >> >> > 0061b640
> >> >> >> >> >   RBP: 88007876fd48 R08: 000dc2ced0d0 R09: 
> >> >> >> >> > 
> >> >> >> >> >   R10: 0001 R11:  R12: 
> >> >> >> >> > 0061b640
> >> >> >> >> >   R13:  R14: 88007877 R15: 
> >> >> >> >> > 880079947200
> >> >> >> >> >[] ? __clear_user+0x42/0x70
> >> >> >> >> >[] ? __clear_user+0x23/0x70
> >> >> >> >> >[] clear_user+0x2b/0x40
> >> >> >> >> >...
> >> >> >> >>
> >> >> >> >> This looks wrong.  Here are some theories:
> >> >> >> >>
> >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: 
> >> >> >> >> 
> >> >> >> >> Then it's found again as an unreliable address as "?
> >> >> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
> >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the 
> >> >> >> >> stack.
> >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> >> >> >> suppressed because it matched a reliable address?
> >> >> >> >>
> >> >> >> >> (b) You actually intended for all the addresses to be printed, in
> >> >> >> >> which case "? __clear_user+0x42/0x70" should have been
> >> >> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
> >> >> >> >> plausible that your state machine got a bit lost leading to "?
> >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> >> >> >> it's a real frame and you didn't find it).
> >> >> >> >>
> >> >> >> >> (c) Something else and I'm confused.
> >> >> >> >
> >> >> >> > So there's a subtle difference between addresses reported by 
> >> >> >> > regs->ip
> >> >> >> > and normal return addresses.  For example:
> >> >> >> >
> >> >> >> >...
> >> >> >> >[] smp_apic_timer_interrupt+0x3d/0x50
> >> >> >> >[] apic_timer_interrupt+0x9e/0xb0
> >> >> >> >   RIP: 0010:[]  [] 
> >> >> >> > path_init+0x0/0x750
> >> >> >> >   RSP: 0018:880036a3fd80  EFLAGS: 0296
> >> >> >> >   RAX: 88003691aa40 RBX: 880036a3ff08 RCX: 880036a3ff08
> >> >> >> >   RDX: 880036a3ff08 RSI: 0041 RDI: 880036a3fdb0
> >> >> >> >   RBP: 880036a3fda0 R08:  R09: 0010
> >> >> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: 880036a3fdb0
> >> >> >> >   R13: 0001 R14: 880036a3ff08 R15: 
> >> >> >> >
> >> >> >> >[] ? lookup_fast+0x3d0/0x3d0
> >> >> >> >[] ? path_lookupat+0x1b/0x120
> >> >> >> >[] filename_lookup+0xb1/0x180
> >> >> >> >...
> >> >> >> >
> >> >> >> > In this case the irq hit right after path_lookupat() called into
> >> >> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is 
> >> >> >> > right.
> >> >> >> >
> >> >> >> > Note the backtrace reports the same address, but it instead 
> >> >> >> > describes it
> >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
> >> >> >> > because normally, such an address after a call instruction at the 
> >> >> >> > end of
> >> >> >> > a function would indicate a tail call (e.g., to a noreturn 
> >> >> >> > function).
> >> >> >> > If that were the case, printing "path_init+0x0" would be completely
> >> >> >> > wrong, because path_init() just happens to be the function located
> >> >> >> > immediately after lookup_fast().
> >> >> >> >
> >> >> >> > Maybe I could add some special logic to say: "if this return 
> >> >> >> > address was
> >> >> >> > from

Re: [PATCH v4 7/7] power_supply: make use of new strcpytoupper() function

2016-07-22 Thread Sebastian Reichel

Hi,

On Fri, Jul 22, 2016 at 04:31:09PM -0700, Markus Mayer wrote:
> Call strcpytoupper() rather than walking the string explicitly to
> convert it to uppercase.
> 
> Signed-off-by: Markus Mayer 
> ---
>  drivers/power/power_supply_sysfs.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/power/power_supply_sysfs.c 
> b/drivers/power/power_supply_sysfs.c
> index 80fed98..20fdcc5 100644
> --- a/drivers/power/power_supply_sysfs.c
> +++ b/drivers/power/power_supply_sysfs.c
> @@ -256,19 +256,16 @@ void power_supply_init_attrs(struct device_type 
> *dev_type)
>  
>  static char *kstruprdup(const char *str, gfp_t gfp)
>  {
> - char *ret, *ustr;
> + char *ustr;
>  
> - ustr = ret = kmalloc(strlen(str) + 1, gfp);
> + ustr = kmalloc(strlen(str) + 1, gfp);
>  
> - if (!ret)
> + if (!ustr)
>   return NULL;
>  
> - while (*str)
> - *ustr++ = toupper(*str++);
> + strcpytoupper(ustr, str);
>  
> - *ustr = 0;
> -
> - return ret;
> + return ustr;
>  }
>  
>  int power_supply_uevent(struct device *dev, struct kobj_uevent_env *env)

Acked-By: Sebastian Reichel 

Note: I plan to merge [0] (Move power supply subsystem to
drivers/power/supply) directly after 4.8-rc1, which will
result in merge conflicts if this goes into the kernel via
some other subsystem.

[0] https://marc.info/?l=linux-pm=146617051028786=2

-- Sebastian


signature.asc
Description: PGP signature

Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack

On Fri, Jul 22, 2016 at 04:39:00PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf  wrote:
> > On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf  
> >> wrote:
> >> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
> >> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf  
> >> >> wrote:
> >> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
> >> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf 
> >> >> >>  wrote:
> >> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
> >> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf 
> >> >> >> >>  wrote:
> >> >> >> >> > Now that we can find pt_regs registers in the middle of the 
> >> >> >> >> > stack due to
> >> >> >> >> > an interrupt or exception, we can print them.  Here's what it 
> >> >> >> >> > looks
> >> >> >> >> > like:
> >> >> >> >> >
> >> >> >> >> >...
> >> >> >> >> >[] do_async_page_fault+0x2c/0xa0
> >> >> >> >> >[] async_page_fault+0x28/0x30
> >> >> >> >> >   RIP: 0010:[]  [] 
> >> >> >> >> > __clear_user+0x42/0x70
> >> >> >> >> >   RSP: 0018:88007876fd38  EFLAGS: 00010202
> >> >> >> >> >   RAX:  RBX: 0138 RCX: 
> >> >> >> >> > 0138
> >> >> >> >> >   RDX:  RSI: 0008 RDI: 
> >> >> >> >> > 0061b640
> >> >> >> >> >   RBP: 88007876fd48 R08: 000dc2ced0d0 R09: 
> >> >> >> >> > 
> >> >> >> >> >   R10: 0001 R11:  R12: 
> >> >> >> >> > 0061b640
> >> >> >> >> >   R13:  R14: 88007877 R15: 
> >> >> >> >> > 880079947200
> >> >> >> >> >[] ? __clear_user+0x42/0x70
> >> >> >> >> >[] ? __clear_user+0x23/0x70
> >> >> >> >> >[] clear_user+0x2b/0x40
> >> >> >> >> >...
> >> >> >> >>
> >> >> >> >> This looks wrong.  Here are some theories:
> >> >> >> >>
> >> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: 
> >> >> >> >> 
> >> >> >> >> Then it's found again as an unreliable address as "?
> >> >> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
> >> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the 
> >> >> >> >> stack.
> >> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
> >> >> >> >> suppressed because it matched a reliable address?
> >> >> >> >>
> >> >> >> >> (b) You actually intended for all the addresses to be printed, in
> >> >> >> >> which case "? __clear_user+0x42/0x70" should have been
> >> >> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
> >> >> >> >> plausible that your state machine got a bit lost leading to "?
> >> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
> >> >> >> >> it's a real frame and you didn't find it).
> >> >> >> >>
> >> >> >> >> (c) Something else and I'm confused.
> >> >> >> >
> >> >> >> > So there's a subtle difference between addresses reported by 
> >> >> >> > regs->ip
> >> >> >> > and normal return addresses.  For example:
> >> >> >> >
> >> >> >> >...
> >> >> >> >[] smp_apic_timer_interrupt+0x3d/0x50
> >> >> >> >[] apic_timer_interrupt+0x9e/0xb0
> >> >> >> >   RIP: 0010:[]  [] 
> >> >> >> > path_init+0x0/0x750
> >> >> >> >   RSP: 0018:880036a3fd80  EFLAGS: 0296
> >> >> >> >   RAX: 88003691aa40 RBX: 880036a3ff08 RCX: 880036a3ff08
> >> >> >> >   RDX: 880036a3ff08 RSI: 0041 RDI: 880036a3fdb0
> >> >> >> >   RBP: 880036a3fda0 R08:  R09: 0010
> >> >> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: 880036a3fdb0
> >> >> >> >   R13: 0001 R14: 880036a3ff08 R15: 
> >> >> >> >
> >> >> >> >[] ? lookup_fast+0x3d0/0x3d0
> >> >> >> >[] ? path_lookupat+0x1b/0x120
> >> >> >> >[] filename_lookup+0xb1/0x180
> >> >> >> >...
> >> >> >> >
> >> >> >> > In this case the irq hit right after path_lookupat() called into
> >> >> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is 
> >> >> >> > right.
> >> >> >> >
> >> >> >> > Note the backtrace reports the same address, but it instead 
> >> >> >> > describes it
> >> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
> >> >> >> > because normally, such an address after a call instruction at the 
> >> >> >> > end of
> >> >> >> > a function would indicate a tail call (e.g., to a noreturn 
> >> >> >> > function).
> >> >> >> > If that were the case, printing "path_init+0x0" would be completely
> >> >> >> > wrong, because path_init() just happens to be the function located
> >> >> >> > immediately after lookup_fast().
> >> >> >> >
> >> >> >> > Maybe I could add some special logic to say: "if this return 
> >> >> >> > address was
> >> >> >> > from a call, use printk_stack_address(); else if it was from a 
> >> >> >> > pt_regs,
> >> >> >> > use

Re: [PATCH v4 7/7] power_supply: make use of new strcpytoupper() function

2016-07-22 Thread Sebastian Reichel

Hi,

On Fri, Jul 22, 2016 at 04:31:09PM -0700, Markus Mayer wrote:
> Call strcpytoupper() rather than walking the string explicitly to
> convert it to uppercase.
> 
> Signed-off-by: Markus Mayer 
> ---
>  drivers/power/power_supply_sysfs.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/power/power_supply_sysfs.c 
> b/drivers/power/power_supply_sysfs.c
> index 80fed98..20fdcc5 100644
> --- a/drivers/power/power_supply_sysfs.c
> +++ b/drivers/power/power_supply_sysfs.c
> @@ -256,19 +256,16 @@ void power_supply_init_attrs(struct device_type 
> *dev_type)
>  
>  static char *kstruprdup(const char *str, gfp_t gfp)
>  {
> - char *ret, *ustr;
> + char *ustr;
>  
> - ustr = ret = kmalloc(strlen(str) + 1, gfp);
> + ustr = kmalloc(strlen(str) + 1, gfp);
>  
> - if (!ret)
> + if (!ustr)
>   return NULL;
>  
> - while (*str)
> - *ustr++ = toupper(*str++);
> + strcpytoupper(ustr, str);
>  
> - *ustr = 0;
> -
> - return ret;
> + return ustr;
>  }
>  
>  int power_supply_uevent(struct device *dev, struct kobj_uevent_env *env)

Acked-By: Sebastian Reichel 

Note: I plan to merge [0] (Move power supply subsystem to
drivers/power/supply) directly after 4.8-rc1, which will
result in merge conflicts if this goes into the kernel via
some other subsystem.

[0] https://marc.info/?l=linux-pm=146617051028786=2

-- Sebastian


signature.asc
Description: PGP signature

Re: [PATCH v4 4/7] staging: speakup: replace spk_strlwr() with strlcpytolower()

2016-07-22 Thread Chris Brannon

Markus Mayer  writes:

> After introducing generic strltolower() and strtolower(), spk_strlwr()
> is no longer needed.
>
> Signed-off-by: Markus Mayer 
> Acked-by: Samuel Thibault 

Acked-by: Chris Brannon 

Haven't looked at much kernel code in a while, but this looks good.

-- Chris

Re: [PATCH v4 4/7] staging: speakup: replace spk_strlwr() with strlcpytolower()

2016-07-22 Thread Chris Brannon

Markus Mayer  writes:

> After introducing generic strltolower() and strtolower(), spk_strlwr()
> is no longer needed.
>
> Signed-off-by: Markus Mayer 
> Acked-by: Samuel Thibault 

Acked-by: Chris Brannon 

Haven't looked at much kernel code in a while, but this looks good.

-- Chris

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

On Fri, Jul 22, 2016 at 04:26:46PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
> > valid_stack_ptr() is buggy: it assumes that all stacks are of size
> > THREAD_SIZE, which is not true for exception stacks.  So the
> > walk_stack() callbacks will need to know the location of the beginning
> > of the stack as well as the end.
> >
> > Another issue is that in general the various features of a stack (type,
> > size, next stack pointer, description string) are scattered around in
> > various places throughout the stack dump code.
> 
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?

Well, the unwinders aren't the only users of get_stack_info() and the
visit_mask.  show_trace_log_lvl() also uses it.

But it would probably be cleaner to at least do the visit_mask bit
testing/setting in get_stack_info() rather than in the in_*_stack()
functions.

> And please remove test_and_set_bit() -- it's pointlessly slow.

Ok.

> 
> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> > +unsigned long *visit_mask)
> > +{
> > +   unsigned long *begin = (unsigned long 
> > *)this_cpu_read(hardirq_stack);
> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> > +
> > +   if (stack < begin || stack >= end)
> > +   return false;
> > +
> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> > +   return false;
> > +
> > +   info->type  = STACK_TYPE_IRQ;
> > +   info->begin = begin;
> > +   info->end   = end;
> > +   info->next  = (unsigned long *)*begin;
> 
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.

I'm not quite sure what you mean here.  The ->next stack pointer is
quite useful and it abstracts that ugliness away from the callers of
get_stack_info().  I'm open to any specific suggestions.

> 
> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> > +  unsigned long *visit_mask)
> >  {
> > unsigned long stack = (unsigned long)s;
> > unsigned long begin, end;
> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
> > *s, char **name,
> > if (stack < begin || stack >= end)
> > continue;
> >
> > -   if (test_and_set_bit(k, visit_mask))
> > +   if (visit_mask &&
> > +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> > return false;
> >
> > -   *name = exception_stack_names[k];
> > -   return (unsigned long *)end;
> > +   info->type  = STACK_TYPE_EXCEPTION + k;
> > +   info->begin = (unsigned long *)begin;
> > +   info->end   = (unsigned long *)end;
> > +   info->next  = (unsigned long *)info->end[-2];
> 
> This is so magical that I don't immediately see why it's correct.
> Presumably it's because the thing two slots down from the top of the
> stack is regs->sp?  If so, that needs a comment.

Heck if I know, I just stole it from dump_trace() ;-)

I'll figure it out and add a comment.

> But again, couldn't we use the fact that we now know how to decode
> pt_regs to avoid needing this?  I can imagine it being useful as a
> fallback in the event that the unwinder fails, but this is just a
> fallback.

Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
fallback: the stack dump code *needs* to be able to still traverse the
stacks if frame pointers fail.

> Also, NMI is weird and I'm wondering whether this works at
> all when trying to unwind from a looped NMI.

Unless I'm missing something, I think it should be fine for nested NMIs,
since they're all on the same stack.  I can try to test it.  What in
particular are you worried about?

> Fixing this up could be a followup after this series is in, I think --
> you're preserving existing behavior AFAICS.  I just don't particularly
> like the existing behavior.
> 
> FWIW, I think this code needs to be explicitly tested for the 32-bit
> double fault case.  It's highly magical.

Ok, I'll test it.

-- 
Josh

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

On Fri, Jul 22, 2016 at 04:26:46PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
> > valid_stack_ptr() is buggy: it assumes that all stacks are of size
> > THREAD_SIZE, which is not true for exception stacks.  So the
> > walk_stack() callbacks will need to know the location of the beginning
> > of the stack as well as the end.
> >
> > Another issue is that in general the various features of a stack (type,
> > size, next stack pointer, description string) are scattered around in
> > various places throughout the stack dump code.
> 
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?

Well, the unwinders aren't the only users of get_stack_info() and the
visit_mask.  show_trace_log_lvl() also uses it.

But it would probably be cleaner to at least do the visit_mask bit
testing/setting in get_stack_info() rather than in the in_*_stack()
functions.

> And please remove test_and_set_bit() -- it's pointlessly slow.

Ok.

> 
> > +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
> > +unsigned long *visit_mask)
> > +{
> > +   unsigned long *begin = (unsigned long 
> > *)this_cpu_read(hardirq_stack);
> > +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
> > +
> > +   if (stack < begin || stack >= end)
> > +   return false;
> > +
> > +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
> > +   return false;
> > +
> > +   info->type  = STACK_TYPE_IRQ;
> > +   info->begin = begin;
> > +   info->end   = end;
> > +   info->next  = (unsigned long *)*begin;
> 
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.

I'm not quite sure what you mean here.  The ->next stack pointer is
quite useful and it abstracts that ugliness away from the callers of
get_stack_info().  I'm open to any specific suggestions.

> 
> > +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
> > +  unsigned long *visit_mask)
> >  {
> > unsigned long stack = (unsigned long)s;
> > unsigned long begin, end;
> > @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
> > *s, char **name,
> > if (stack < begin || stack >= end)
> > continue;
> >
> > -   if (test_and_set_bit(k, visit_mask))
> > +   if (visit_mask &&
> > +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
> > return false;
> >
> > -   *name = exception_stack_names[k];
> > -   return (unsigned long *)end;
> > +   info->type  = STACK_TYPE_EXCEPTION + k;
> > +   info->begin = (unsigned long *)begin;
> > +   info->end   = (unsigned long *)end;
> > +   info->next  = (unsigned long *)info->end[-2];
> 
> This is so magical that I don't immediately see why it's correct.
> Presumably it's because the thing two slots down from the top of the
> stack is regs->sp?  If so, that needs a comment.

Heck if I know, I just stole it from dump_trace() ;-)

I'll figure it out and add a comment.

> But again, couldn't we use the fact that we now know how to decode
> pt_regs to avoid needing this?  I can imagine it being useful as a
> fallback in the event that the unwinder fails, but this is just a
> fallback.

Yeah, this is needed as a fallback.  But I wouldn't call it "just" a
fallback: the stack dump code *needs* to be able to still traverse the
stacks if frame pointers fail.

> Also, NMI is weird and I'm wondering whether this works at
> all when trying to unwind from a looped NMI.

Unless I'm missing something, I think it should be fine for nested NMIs,
since they're all on the same stack.  I can try to test it.  What in
particular are you worried about?

> Fixing this up could be a followup after this series is in, I think --
> you're preserving existing behavior AFAICS.  I just don't particularly
> like the existing behavior.
> 
> FWIW, I think this code needs to be explicitly tested for the 32-bit
> double fault case.  It's highly magical.

Ok, I'll test it.

-- 
Josh

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski  wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
>> valid_stack_ptr() is buggy: it assumes that all stacks are of size
>> THREAD_SIZE, which is not true for exception stacks.  So the
>> walk_stack() callbacks will need to know the location of the beginning
>> of the stack as well as the end.
>>
>> Another issue is that in general the various features of a stack (type,
>> size, next stack pointer, description string) are scattered around in
>> various places throughout the stack dump code.
>
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?
>

I just realized that it *is* in the unwind state.  But maybe this code
in update_stack_state:

sp = info->next;
if (!sp).
goto unknown;

if (get_stack_info(sp, state->task, info, >stack_mask))
goto unknown;

if (!on_stack(info, addr, len))
goto unknown;

should do something like:

if (get_stack_info(addr, ...))
  goto unknown.

sp = info->end;

instead.  Alternatively, maybe it would make sense to keep sp as is
(have update_stack_state return bool instead of returning a pointer)
so that a frame that switches stacks still shows the actual sp at the
time that the frame called whatever the it called.

I'm really quite confused by what state->sp means in your current
code.  In the non-stack-switching case (everything is on the thread
stack), it appears to always match state->bp.  Am I missing something?
 If I'm understanding this correctly, when you're pointing at a call
frame, state->bp is that frame's base address (the top of the stack
frame), unwind_get_return_address() returns the address to which that
frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or
whatever it ends up looking like will return RDI at the time that the
frame called whatever function it called, if known.  By that logic,
shouldn't state->sp be sp on entry to the call instruction?  (Or could
sp just be removed?  Does it do anything?)

I guess the reason I'm still not 100% comfortable with the idea that
pt_regs frames don't exist a real frames is that I keep waffling as to
how I should think about the regs associated with a frame in the
future DWARF world.  I think I imagine them being the regs at the time
that the frame did it's call to the next frame, which, by an
admittedly weak analogy, means that the pt_regs state would be the
regs at the time that the exception or interrupt happened.  That
offers a third silly option for dealing with the annoying '?': emit
two frames for a pt_regs, but have the frame in the entry code show
NULL for its return address because it's not a normal return.

> And please remove test_and_set_bit() -- it's pointlessly slow.
>
>> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> +unsigned long *visit_mask)
>> +{
>> +   unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> +
>> +   if (stack < begin || stack >= end)
>> +   return false;
>> +
>> +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> +   return false;
>> +
>> +   info->type  = STACK_TYPE_IRQ;
>> +   info->begin = begin;
>> +   info->end   = end;
>> +   info->next  = (unsigned long *)*begin;
>
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.
>
>> +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> +  unsigned long *visit_mask)
>>  {
>> unsigned long stack = (unsigned long)s;
>> unsigned long begin, end;
>> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
>> *s, char **name,
>> if (stack < begin || stack >= end)
>> continue;
>>
>> -   if (test_and_set_bit(k, visit_mask))
>> +   if (visit_mask &&
>> +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> return false;
>>
>> -   *name = exception_stack_names[k];
>> -   return (unsigned long *)end;
>> +   info->type  = STACK_TYPE_EXCEPTION + k;
>> +   info->begin = (unsigned long *)begin;
>> +   info->end   = (unsigned long *)end;
>> +   info->next  = (unsigned long *)info->end[-2];
>
> This is so magical that I don't immediately see why

Re: [PATCH 10/19] x86/dumpstack: add get_stack_info() interface

On Fri, Jul 22, 2016 at 4:26 PM, Andy Lutomirski  wrote:
> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf  wrote:
>> valid_stack_ptr() is buggy: it assumes that all stacks are of size
>> THREAD_SIZE, which is not true for exception stacks.  So the
>> walk_stack() callbacks will need to know the location of the beginning
>> of the stack as well as the end.
>>
>> Another issue is that in general the various features of a stack (type,
>> size, next stack pointer, description string) are scattered around in
>> various places throughout the stack dump code.
>
> I finally figured out what visit_info is.  But would it make more
> sense to track it in the unwind state so that the unwinder can
> directly make sure it doesn't start looping?
>

I just realized that it *is* in the unwind state.  But maybe this code
in update_stack_state:

sp = info->next;
if (!sp).
goto unknown;

if (get_stack_info(sp, state->task, info, >stack_mask))
goto unknown;

if (!on_stack(info, addr, len))
goto unknown;

should do something like:

if (get_stack_info(addr, ...))
  goto unknown.

sp = info->end;

instead.  Alternatively, maybe it would make sense to keep sp as is
(have update_stack_state return bool instead of returning a pointer)
so that a frame that switches stacks still shows the actual sp at the
time that the frame called whatever the it called.

I'm really quite confused by what state->sp means in your current
code.  In the non-stack-switching case (everything is on the thread
stack), it appears to always match state->bp.  Am I missing something?
 If I'm understanding this correctly, when you're pointing at a call
frame, state->bp is that frame's base address (the top of the stack
frame), unwind_get_return_address() returns the address to which that
frame would return, and, in the future, unwind_get_gpr(UNWIND_DI) or
whatever it ends up looking like will return RDI at the time that the
frame called whatever function it called, if known.  By that logic,
shouldn't state->sp be sp on entry to the call instruction?  (Or could
sp just be removed?  Does it do anything?)

I guess the reason I'm still not 100% comfortable with the idea that
pt_regs frames don't exist a real frames is that I keep waffling as to
how I should think about the regs associated with a frame in the
future DWARF world.  I think I imagine them being the regs at the time
that the frame did it's call to the next frame, which, by an
admittedly weak analogy, means that the pt_regs state would be the
regs at the time that the exception or interrupt happened.  That
offers a third silly option for dealing with the annoying '?': emit
two frames for a pt_regs, but have the frame in the entry code show
NULL for its return address because it's not a normal return.

> And please remove test_and_set_bit() -- it's pointlessly slow.
>
>> +static bool in_hardirq_stack(unsigned long *stack, struct stack_info *info,
>> +unsigned long *visit_mask)
>> +{
>> +   unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
>> +   unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
>> +
>> +   if (stack < begin || stack >= end)
>> +   return false;
>> +
>> +   if (visit_mask && test_and_set_bit(STACK_TYPE_IRQ, visit_mask))
>> +   return false;
>> +
>> +   info->type  = STACK_TYPE_IRQ;
>> +   info->begin = begin;
>> +   info->end   = end;
>> +   info->next  = (unsigned long *)*begin;
>
> This works, but it's a bit magic.  I don't suppose we could get rid of
> this ->next thing entirely and teach show_stack_log_lvl(), etc. to
> move from stack to stack by querying the stack type of whatever the
> frame base address is if the frame base address ends up being out of
> bounds for the current stack?  Or maybe the unwinder could even do
> this by itself.
>
>> +static bool in_exception_stack(unsigned long *s, struct stack_info *info,
>> +  unsigned long *visit_mask)
>>  {
>> unsigned long stack = (unsigned long)s;
>> unsigned long begin, end;
>> @@ -44,55 +63,62 @@ static unsigned long *in_exception_stack(unsigned long 
>> *s, char **name,
>> if (stack < begin || stack >= end)
>> continue;
>>
>> -   if (test_and_set_bit(k, visit_mask))
>> +   if (visit_mask &&
>> +   test_and_set_bit(STACK_TYPE_EXCEPTION + k, visit_mask))
>> return false;
>>
>> -   *name = exception_stack_names[k];
>> -   return (unsigned long *)end;
>> +   info->type  = STACK_TYPE_EXCEPTION + k;
>> +   info->begin = (unsigned long *)begin;
>> +   info->end   = (unsigned long *)end;
>> +   info->next  = (unsigned long *)info->end[-2];
>
> This is so magical that I don't immediately see why it's correct.
> Presumably it's because the

Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"

2016-07-22 Thread Viresh Kumar

On 23-07-16, 01:47, Rafael J. Wysocki wrote:
> I'll apply the revert with a "Cc: stable" tag.

That will work.

> Question is what to do about the other drivers setting
> cpuinfo.transition_latency to CPUFREQ_ETERNAL.

Perhaps leave them as is unless someone comes and reports a problem, they don't
seem to have any problem right now anyway :)

-- 
viresh

Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"

2016-07-22 Thread Viresh Kumar

On 23-07-16, 01:47, Rafael J. Wysocki wrote:
> I'll apply the revert with a "Cc: stable" tag.

That will work.

> Question is what to do about the other drivers setting
> cpuinfo.transition_latency to CPUFREQ_ETERNAL.

Perhaps leave them as is unless someone comes and reports a problem, they don't
seem to have any problem right now anyway :)

-- 
viresh

Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"

2016-07-22 Thread Rafael J. Wysocki

On Sat, Jul 23, 2016 at 1:30 AM, Viresh Kumar  wrote:
> On 22-07-16, 23:46, Rafael J. Wysocki wrote:
>> On Friday, July 22, 2016 02:28:52 PM Viresh Kumar wrote:
>> > On 22-07-16, 23:31, Rafael J. Wysocki wrote:
>> > > > cpufreq.c
>> > > >
>> > > > if (policy->governor->max_transition_latency &&
>> > > > policy->cpuinfo.transition_latency >
>> > > > policy->governor->max_transition_latency) {
>> > > >
>> > > > - And this check will always fail, unless max_transition_latency is 
>> > > > zero.
>> > >
>> > > Why would it fail?  If governor->max_transition_latency is non-zero, but 
>> > > less
>> > > than UNIT_MAX, the condition checked will be true to my eyes.
>> >
>> > Bad wording. Sorry.
>> >
>> > I meant, this 'if' check will always succeed (as you also noted), and
>> > so we will always get the error message reported in this patch.
>>
>> Not always, but for drivers setting cpuinfo.transition_latency to 
>> CPUFREQ_ETERNAL.
>
> So the drivers which have set their transition_latency to
> CPUFREQ_ETERNAL, can't use ondemand governor unless
> governor->max_transition_latency is set to 0 from userspace.
>
> What should be done about this patch then ? It broke in late 2015.

I'll apply the revert with a "Cc: stable" tag.

Question is what to do about the other drivers setting
cpuinfo.transition_latency to CPUFREQ_ETERNAL.

Thanks,
Rafael

Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"

2016-07-22 Thread Rafael J. Wysocki

On Sat, Jul 23, 2016 at 1:30 AM, Viresh Kumar  wrote:
> On 22-07-16, 23:46, Rafael J. Wysocki wrote:
>> On Friday, July 22, 2016 02:28:52 PM Viresh Kumar wrote:
>> > On 22-07-16, 23:31, Rafael J. Wysocki wrote:
>> > > > cpufreq.c
>> > > >
>> > > > if (policy->governor->max_transition_latency &&
>> > > > policy->cpuinfo.transition_latency >
>> > > > policy->governor->max_transition_latency) {
>> > > >
>> > > > - And this check will always fail, unless max_transition_latency is 
>> > > > zero.
>> > >
>> > > Why would it fail?  If governor->max_transition_latency is non-zero, but 
>> > > less
>> > > than UNIT_MAX, the condition checked will be true to my eyes.
>> >
>> > Bad wording. Sorry.
>> >
>> > I meant, this 'if' check will always succeed (as you also noted), and
>> > so we will always get the error message reported in this patch.
>>
>> Not always, but for drivers setting cpuinfo.transition_latency to 
>> CPUFREQ_ETERNAL.
>
> So the drivers which have set their transition_latency to
> CPUFREQ_ETERNAL, can't use ondemand governor unless
> governor->max_transition_latency is set to 0 from userspace.
>
> What should be done about this patch then ? It broke in late 2015.

I'll apply the revert with a "Cc: stable" tag.

Question is what to do about the other drivers setting
cpuinfo.transition_latency to CPUFREQ_ETERNAL.

Thanks,
Rafael

[PATCH 1/1] EDAC, sb_edac: Fix channel reporting on Knights Landing

2016-07-22 Thread Lukasz Odzioba

On Intel Xeon Phi Knights Landing processor family the channels
of memory controller have untypical arrangement - MC0 is mapped to
CH3,4,5 and MC1 is mapped to CH0,1,2. This causes EDAC driver to
report the channel name incorrectly.

We missed this change earlier, so the code already contains
similar comment, but the translation function is incorrect.

Without this patch:
  errors in DIMM_A and DIMM_D were reported in DIMM_D
  errors in DIMM_B and DIMM_E were reported in DIMM_E
  errors in DIMM_C and DIMM_F were reported in DIMM_F

Fixes: d0cdf9003140 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support")
Signed-off-by: Lukasz Odzioba 
Signed-off-by: Hubert Chrzaniuk 
---
 drivers/edac/sb_edac.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 6744d88..61e2c52 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -552,9 +552,9 @@ static const struct pci_id_table 
pci_dev_descr_haswell_table[] = {
 /* Knight's Landing Support */
 /*
  * KNL's memory channels are swizzled between memory controllers.
- * MC0 is mapped to CH3,5,6 and MC1 is mapped to CH0,1,2
+ * MC0 is mapped to CH3,4,5 and MC1 is mapped to CH0,1,2
  */
-#define knl_channel_remap(channel) ((channel + 3) % 6)
+#define knl_channel_remap(mc, chan) (mc ? chan : chan + 3)
 
 /* Memory controller, TAD tables, error injection - 2-8-0, 2-9-0 (2 of these) 
*/
 #define PCI_DEVICE_ID_INTEL_KNL_IMC_MC   0x7840
@@ -1286,7 +1286,7 @@ static u32 knl_get_mc_route(int entry, u32 reg)
mc = GET_BITFIELD(reg, entry*3, (entry*3)+2);
chan = GET_BITFIELD(reg, (entry*2) + 18, (entry*2) + 18 + 1);
 
-   return knl_channel_remap(mc*3 + chan);
+   return knl_channel_remap(mc, chan);
 }
 
 /*
@@ -3003,9 +3003,16 @@ static void sbridge_mce_output_error(struct mem_ctl_info 
*mci,
mscod, errcode,
m->bank);
} else {
-   char A = *("A");
+   char mc, A = *("A");
 
-   channel = knl_channel_remap(channel);
+   /*
+* Reported channel is in range 0-2, so we can't map it
+* back to mc. To figure out mc we check machine check
+* bank register that reported this error.
+* bank15 means mc0 and bank16 means mc1.
+*/
+   mc = m->bank == 16;
+   channel = knl_channel_remap(mc, channel);
channel_mask = 1 << channel;
snprintf(msg, sizeof(msg),
"%s%s err_code:%04x:%04x channel:%d (DIMM_%c)",
-- 
1.8.3.1

[PATCH 1/1] EDAC, sb_edac: Fix channel reporting on Knights Landing

2016-07-22 Thread Lukasz Odzioba

On Intel Xeon Phi Knights Landing processor family the channels
of memory controller have untypical arrangement - MC0 is mapped to
CH3,4,5 and MC1 is mapped to CH0,1,2. This causes EDAC driver to
report the channel name incorrectly.

We missed this change earlier, so the code already contains
similar comment, but the translation function is incorrect.

Without this patch:
  errors in DIMM_A and DIMM_D were reported in DIMM_D
  errors in DIMM_B and DIMM_E were reported in DIMM_E
  errors in DIMM_C and DIMM_F were reported in DIMM_F

Fixes: d0cdf9003140 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support")
Signed-off-by: Lukasz Odzioba 
Signed-off-by: Hubert Chrzaniuk 
---
 drivers/edac/sb_edac.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 6744d88..61e2c52 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -552,9 +552,9 @@ static const struct pci_id_table 
pci_dev_descr_haswell_table[] = {
 /* Knight's Landing Support */
 /*
  * KNL's memory channels are swizzled between memory controllers.
- * MC0 is mapped to CH3,5,6 and MC1 is mapped to CH0,1,2
+ * MC0 is mapped to CH3,4,5 and MC1 is mapped to CH0,1,2
  */
-#define knl_channel_remap(channel) ((channel + 3) % 6)
+#define knl_channel_remap(mc, chan) (mc ? chan : chan + 3)
 
 /* Memory controller, TAD tables, error injection - 2-8-0, 2-9-0 (2 of these) 
*/
 #define PCI_DEVICE_ID_INTEL_KNL_IMC_MC   0x7840
@@ -1286,7 +1286,7 @@ static u32 knl_get_mc_route(int entry, u32 reg)
mc = GET_BITFIELD(reg, entry*3, (entry*3)+2);
chan = GET_BITFIELD(reg, (entry*2) + 18, (entry*2) + 18 + 1);
 
-   return knl_channel_remap(mc*3 + chan);
+   return knl_channel_remap(mc, chan);
 }
 
 /*
@@ -3003,9 +3003,16 @@ static void sbridge_mce_output_error(struct mem_ctl_info 
*mci,
mscod, errcode,
m->bank);
} else {
-   char A = *("A");
+   char mc, A = *("A");
 
-   channel = knl_channel_remap(channel);
+   /*
+* Reported channel is in range 0-2, so we can't map it
+* back to mc. To figure out mc we check machine check
+* bank register that reported this error.
+* bank15 means mc0 and bank16 means mc1.
+*/
+   mc = m->bank == 16;
+   channel = knl_channel_remap(mc, channel);
channel_mask = 1 << channel;
snprintf(msg, sizeof(msg),
"%s%s err_code:%04x:%04x channel:%d (DIMM_%c)",
-- 
1.8.3.1

Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack

On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf  wrote:
> On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf  wrote:
>> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf  
>> >> wrote:
>> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
>> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf  
>> >> >> wrote:
>> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
>> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf 
>> >> >> >>  wrote:
>> >> >> >> > Now that we can find pt_regs registers in the middle of the stack 
>> >> >> >> > due to
>> >> >> >> > an interrupt or exception, we can print them.  Here's what it 
>> >> >> >> > looks
>> >> >> >> > like:
>> >> >> >> >
>> >> >> >> >...
>> >> >> >> >[] do_async_page_fault+0x2c/0xa0
>> >> >> >> >[] async_page_fault+0x28/0x30
>> >> >> >> >   RIP: 0010:[]  [] 
>> >> >> >> > __clear_user+0x42/0x70
>> >> >> >> >   RSP: 0018:88007876fd38  EFLAGS: 00010202
>> >> >> >> >   RAX:  RBX: 0138 RCX: 
>> >> >> >> > 0138
>> >> >> >> >   RDX:  RSI: 0008 RDI: 
>> >> >> >> > 0061b640
>> >> >> >> >   RBP: 88007876fd48 R08: 000dc2ced0d0 R09: 
>> >> >> >> > 
>> >> >> >> >   R10: 0001 R11:  R12: 
>> >> >> >> > 0061b640
>> >> >> >> >   R13:  R14: 88007877 R15: 
>> >> >> >> > 880079947200
>> >> >> >> >[] ? __clear_user+0x42/0x70
>> >> >> >> >[] ? __clear_user+0x23/0x70
>> >> >> >> >[] clear_user+0x2b/0x40
>> >> >> >> >...
>> >> >> >>
>> >> >> >> This looks wrong.  Here are some theories:
>> >> >> >>
>> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: 
>> >> >> >> 
>> >> >> >> Then it's found again as an unreliable address as "?
>> >> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
>> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
>> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
>> >> >> >> suppressed because it matched a reliable address?
>> >> >> >>
>> >> >> >> (b) You actually intended for all the addresses to be printed, in
>> >> >> >> which case "? __clear_user+0x42/0x70" should have been
>> >> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
>> >> >> >> plausible that your state machine got a bit lost leading to "?
>> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
>> >> >> >> it's a real frame and you didn't find it).
>> >> >> >>
>> >> >> >> (c) Something else and I'm confused.
>> >> >> >
>> >> >> > So there's a subtle difference between addresses reported by regs->ip
>> >> >> > and normal return addresses.  For example:
>> >> >> >
>> >> >> >...
>> >> >> >[] smp_apic_timer_interrupt+0x3d/0x50
>> >> >> >[] apic_timer_interrupt+0x9e/0xb0
>> >> >> >   RIP: 0010:[]  [] 
>> >> >> > path_init+0x0/0x750
>> >> >> >   RSP: 0018:880036a3fd80  EFLAGS: 0296
>> >> >> >   RAX: 88003691aa40 RBX: 880036a3ff08 RCX: 880036a3ff08
>> >> >> >   RDX: 880036a3ff08 RSI: 0041 RDI: 880036a3fdb0
>> >> >> >   RBP: 880036a3fda0 R08:  R09: 0010
>> >> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: 880036a3fdb0
>> >> >> >   R13: 0001 R14: 880036a3ff08 R15: 
>> >> >> >
>> >> >> >[] ? lookup_fast+0x3d0/0x3d0
>> >> >> >[] ? path_lookupat+0x1b/0x120
>> >> >> >[] filename_lookup+0xb1/0x180
>> >> >> >...
>> >> >> >
>> >> >> > In this case the irq hit right after path_lookupat() called into
>> >> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is 
>> >> >> > right.
>> >> >> >
>> >> >> > Note the backtrace reports the same address, but it instead 
>> >> >> > describes it
>> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
>> >> >> > because normally, such an address after a call instruction at the 
>> >> >> > end of
>> >> >> > a function would indicate a tail call (e.g., to a noreturn function).
>> >> >> > If that were the case, printing "path_init+0x0" would be completely
>> >> >> > wrong, because path_init() just happens to be the function located
>> >> >> > immediately after lookup_fast().
>> >> >> >
>> >> >> > Maybe I could add some special logic to say: "if this return address 
>> >> >> > was
>> >> >> > from a call, use printk_stack_address(); else if it was from a 
>> >> >> > pt_regs,
>> >> >> > use printk_address()."  (The former prints the preceding function, 
>> >> >> > the
>> >> >> > latter prints the current function.)  Then we could remove the 
>> >> >> > question
>> >> >> > mark.
>> >> >> >
>> >>

Re: [PATCH 19/19] x86/dumpstack: print any pt_regs found on the stack

On Fri, Jul 22, 2016 at 4:30 PM, Josh Poimboeuf  wrote:
> On Fri, Jul 22, 2016 at 04:18:04PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 22, 2016 at 3:20 PM, Josh Poimboeuf  wrote:
>> > On Fri, Jul 22, 2016 at 02:46:10PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Jul 22, 2016 at 8:57 AM, Josh Poimboeuf  
>> >> wrote:
>> >> > On Thu, Jul 21, 2016 at 10:13:03PM -0700, Andy Lutomirski wrote:
>> >> >> On Thu, Jul 21, 2016 at 8:30 PM, Josh Poimboeuf  
>> >> >> wrote:
>> >> >> > On Thu, Jul 21, 2016 at 03:32:32PM -0700, Andy Lutomirski wrote:
>> >> >> >> On Thu, Jul 21, 2016 at 2:21 PM, Josh Poimboeuf 
>> >> >> >>  wrote:
>> >> >> >> > Now that we can find pt_regs registers in the middle of the stack 
>> >> >> >> > due to
>> >> >> >> > an interrupt or exception, we can print them.  Here's what it 
>> >> >> >> > looks
>> >> >> >> > like:
>> >> >> >> >
>> >> >> >> >...
>> >> >> >> >[] do_async_page_fault+0x2c/0xa0
>> >> >> >> >[] async_page_fault+0x28/0x30
>> >> >> >> >   RIP: 0010:[]  [] 
>> >> >> >> > __clear_user+0x42/0x70
>> >> >> >> >   RSP: 0018:88007876fd38  EFLAGS: 00010202
>> >> >> >> >   RAX:  RBX: 0138 RCX: 
>> >> >> >> > 0138
>> >> >> >> >   RDX:  RSI: 0008 RDI: 
>> >> >> >> > 0061b640
>> >> >> >> >   RBP: 88007876fd48 R08: 000dc2ced0d0 R09: 
>> >> >> >> > 
>> >> >> >> >   R10: 0001 R11:  R12: 
>> >> >> >> > 0061b640
>> >> >> >> >   R13:  R14: 88007877 R15: 
>> >> >> >> > 880079947200
>> >> >> >> >[] ? __clear_user+0x42/0x70
>> >> >> >> >[] ? __clear_user+0x23/0x70
>> >> >> >> >[] clear_user+0x2b/0x40
>> >> >> >> >...
>> >> >> >>
>> >> >> >> This looks wrong.  Here are some theories:
>> >> >> >>
>> >> >> >> (a) __clear_user is a reliable address that is indicated by RIP: 
>> >> >> >> 
>> >> >> >> Then it's found again as an unreliable address as "?
>> >> >> >> __clear_user+0x42/0x70" by scanning the stack.  "?
>> >> >> >> __clear_user+0x23/0x70" is a genuine leftover artifact on the stack.
>> >> >> >> In this case, shouldn't "? __clear_user+0x42/0x70" have been
>> >> >> >> suppressed because it matched a reliable address?
>> >> >> >>
>> >> >> >> (b) You actually intended for all the addresses to be printed, in
>> >> >> >> which case "? __clear_user+0x42/0x70" should have been
>> >> >> >> "__clear_user+0x42/0x70" and you have a bug.  In this case, it's
>> >> >> >> plausible that your state machine got a bit lost leading to "?
>> >> >> >> __clear_user+0x23/0x70" as well (i.e. it's not just an artifact --
>> >> >> >> it's a real frame and you didn't find it).
>> >> >> >>
>> >> >> >> (c) Something else and I'm confused.
>> >> >> >
>> >> >> > So there's a subtle difference between addresses reported by regs->ip
>> >> >> > and normal return addresses.  For example:
>> >> >> >
>> >> >> >...
>> >> >> >[] smp_apic_timer_interrupt+0x3d/0x50
>> >> >> >[] apic_timer_interrupt+0x9e/0xb0
>> >> >> >   RIP: 0010:[]  [] 
>> >> >> > path_init+0x0/0x750
>> >> >> >   RSP: 0018:880036a3fd80  EFLAGS: 0296
>> >> >> >   RAX: 88003691aa40 RBX: 880036a3ff08 RCX: 880036a3ff08
>> >> >> >   RDX: 880036a3ff08 RSI: 0041 RDI: 880036a3fdb0
>> >> >> >   RBP: 880036a3fda0 R08:  R09: 0010
>> >> >> >   R10: 8080808080808080 R11: fefefefefefefeff R12: 880036a3fdb0
>> >> >> >   R13: 0001 R14: 880036a3ff08 R15: 
>> >> >> >
>> >> >> >[] ? lookup_fast+0x3d0/0x3d0
>> >> >> >[] ? path_lookupat+0x1b/0x120
>> >> >> >[] filename_lookup+0xb1/0x180
>> >> >> >...
>> >> >> >
>> >> >> > In this case the irq hit right after path_lookupat() called into
>> >> >> > path_init().  So the "path_init+0x0" printed by __show_regs() is 
>> >> >> > right.
>> >> >> >
>> >> >> > Note the backtrace reports the same address, but it instead 
>> >> >> > describes it
>> >> >> > as "lookup_fast+0x3d0", which is the end of lookup_fast().  That's
>> >> >> > because normally, such an address after a call instruction at the 
>> >> >> > end of
>> >> >> > a function would indicate a tail call (e.g., to a noreturn function).
>> >> >> > If that were the case, printing "path_init+0x0" would be completely
>> >> >> > wrong, because path_init() just happens to be the function located
>> >> >> > immediately after lookup_fast().
>> >> >> >
>> >> >> > Maybe I could add some special logic to say: "if this return address 
>> >> >> > was
>> >> >> > from a call, use printk_stack_address(); else if it was from a 
>> >> >> > pt_regs,
>> >> >> > use printk_address()."  (The former prints the preceding function, 
>> >> >> > the
>> >> >> > latter prints the current function.)  Then we could remove the 
>> >> >> > question
>> >> >> > mark.
>> >> >> >
>> >> >> > There's also the question of whether or not the address should be
>> >> >> > printed again, after

[git pull] Input updates for 4.7-rc7 (part 2)

2016-07-22 Thread Dmitry Torokhov

Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus

to receive  a few more fixes for the input subsystem:

- restore naming for tsc2005 touchscreens as some userspace match on it
- fix out of bound access in legacy keyboard driver
- fixup in RMI4 driver

Everything is tagged for stable as well.

Changelog:
-

Andrew Duggan (1):
  Input: synaptics-rmi4 - fix maximum size check for F12 control register 8

Dmitry Torokhov (1):
  tty/vt/keyboard: fix OOB access in do_compute_shiftstate()

Michael Welling (1):
  Input: tsc200x - report proper input_dev name


Diffstat:


 drivers/input/rmi4/rmi_f12.c |  9 +
 drivers/input/touchscreen/tsc2004.c  |  7 ++-
 drivers/input/touchscreen/tsc2005.c  |  7 ++-
 drivers/input/touchscreen/tsc200x-core.c | 15 ---
 drivers/input/touchscreen/tsc200x-core.h |  2 +-
 drivers/tty/vt/keyboard.c| 30 +-
 6 files changed, 39 insertions(+), 31 deletions(-)

[git pull] Input updates for 4.7-rc7 (part 2)

2016-07-22 Thread Dmitry Torokhov

Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus

to receive  a few more fixes for the input subsystem:

- restore naming for tsc2005 touchscreens as some userspace match on it
- fix out of bound access in legacy keyboard driver
- fixup in RMI4 driver

Everything is tagged for stable as well.

Changelog:
-

Andrew Duggan (1):
  Input: synaptics-rmi4 - fix maximum size check for F12 control register 8

Dmitry Torokhov (1):
  tty/vt/keyboard: fix OOB access in do_compute_shiftstate()

Michael Welling (1):
  Input: tsc200x - report proper input_dev name


Diffstat:


 drivers/input/rmi4/rmi_f12.c |  9 +
 drivers/input/touchscreen/tsc2004.c  |  7 ++-
 drivers/input/touchscreen/tsc2005.c  |  7 ++-
 drivers/input/touchscreen/tsc200x-core.c | 15 ---
 drivers/input/touchscreen/tsc200x-core.h |  2 +-
 drivers/tty/vt/keyboard.c| 30 +-
 6 files changed, 39 insertions(+), 31 deletions(-)

[PATCH v4 7/7] power_supply: make use of new strcpytoupper() function

Call strcpytoupper() rather than walking the string explicitly to
convert it to uppercase.

Signed-off-by: Markus Mayer 
---
 drivers/power/power_supply_sysfs.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/power/power_supply_sysfs.c 
b/drivers/power/power_supply_sysfs.c
index 80fed98..20fdcc5 100644
--- a/drivers/power/power_supply_sysfs.c
+++ b/drivers/power/power_supply_sysfs.c
@@ -256,19 +256,16 @@ void power_supply_init_attrs(struct device_type *dev_type)
 
 static char *kstruprdup(const char *str, gfp_t gfp)
 {
-   char *ret, *ustr;
+   char *ustr;
 
-   ustr = ret = kmalloc(strlen(str) + 1, gfp);
+   ustr = kmalloc(strlen(str) + 1, gfp);
 
-   if (!ret)
+   if (!ustr)
return NULL;
 
-   while (*str)
-   *ustr++ = toupper(*str++);
+   strcpytoupper(ustr, str);
 
-   *ustr = 0;
-
-   return ret;
+   return ustr;
 }
 
 int power_supply_uevent(struct device *dev, struct kobj_uevent_env *env)
-- 
2.7.4

[PATCH v4 7/7] power_supply: make use of new strcpytoupper() function

Call strcpytoupper() rather than walking the string explicitly to
convert it to uppercase.

Signed-off-by: Markus Mayer 
---
 drivers/power/power_supply_sysfs.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/power/power_supply_sysfs.c 
b/drivers/power/power_supply_sysfs.c
index 80fed98..20fdcc5 100644
--- a/drivers/power/power_supply_sysfs.c
+++ b/drivers/power/power_supply_sysfs.c
@@ -256,19 +256,16 @@ void power_supply_init_attrs(struct device_type *dev_type)
 
 static char *kstruprdup(const char *str, gfp_t gfp)
 {
-   char *ret, *ustr;
+   char *ustr;
 
-   ustr = ret = kmalloc(strlen(str) + 1, gfp);
+   ustr = kmalloc(strlen(str) + 1, gfp);
 
-   if (!ret)
+   if (!ustr)
return NULL;
 
-   while (*str)
-   *ustr++ = toupper(*str++);
+   strcpytoupper(ustr, str);
 
-   *ustr = 0;
-
-   return ret;
+   return ustr;
 }
 
 int power_supply_uevent(struct device *dev, struct kobj_uevent_env *env)
-- 
2.7.4

[PATCH v4 6/7] drm/nouveau/fifo/gk104: make use of new strcpytoupper() function

Call strcpytoupper() rather than copying the string explicitly and then
walking it to convert it to uppercase.

Signed-off-by: Markus Mayer 
---
 drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c 
b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c
index 743f3a1..8d01032 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c
@@ -332,10 +332,7 @@ gk104_fifo_intr_fault(struct gk104_fifo *fifo, int unit)
enum nvkm_devidx engidx = nvkm_top_fault(device->top, unit);
if (engidx < NVKM_SUBDEV_NR) {
const char *src = nvkm_subdev_name[engidx];
-   char *dst = en;
-   do {
-   *dst++ = toupper(*src++);
-   } while(*src);
+   strcpytoupper(en, src);
engine = nvkm_device_engine(device, engidx);
}
} else {
-- 
2.7.4

[PATCH v4 6/7] drm/nouveau/fifo/gk104: make use of new strcpytoupper() function