Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Darrick J. Wong
On Mon, Sep 12, 2016 at 11:40:35AM +1000, Dave Chinner wrote:
> On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> > On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
> > > My understanding is that it is looking for the VM_MIXEDMAP flag which
> > > is already ambiguous for determining if DAX is enabled even if this
> > > dynamic listing issue is fixed.  XFS has arranged for DAX to be a
> > > per-inode capability and has an XFS-specific inode flag.  We can make
> > > that a common inode flag, but it seems we should have a way to
> > > interrogate the mapping itself in the case where the inode is unknown
> > > or unavailable.  I'm thinking extensions to mincore to have flags for
> > > DAX and possibly whether the page is part of a pte, pmd, or pud
> > > mapping.  Just floating that idea before starting to look into the
> > > implementation, comments or other ideas welcome...
> > 
> > I think this goes back to our previous discussion about support for the PMEM
> > programming model.  Really I think what NVML needs isn't a way to tell if it
> > is getting a DAX mapping, but whether it is getting a DAX mapping on a
> > filesystem that fully supports the PMEM programming model.  This of course 
> > is
> > defined to be a filesystem where it can do all of its flushes from userspace
> > safely and never call fsync/msync, and that allocations that happen in page
> > faults will be synchronized to media before the page fault completes.
> > 
> > IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> > everything or can I rely fully on flushes from userspace?" 
> 
> "need fsync/msync" is a dynamic state of an inode, not a static
> property. i.e. users can do things that change an inode behind the
> back of a mapping, even if they are not aware that this might
> happen. As such, a filesystem can invalidate an existing mapping
> at any time and userspace won't notice because it will simply fault
> in a new mapping on the next access...
> 
> > For all existing implementations, I think the answer is "you need to use
> > fsync/msync" because we don't yet have proper support for the PMEM 
> > programming
> > model.
> 
> Yes, that is correct.
> 
> FWIW, I don't think it will ever be possible to support this 
> wonderful "PMEM programming model" from any current or future kernel
> filesystem without a very specific set of restrictions on what can
> be done to a file.  e.g.
> 
>   1. the file has to be fully allocated and zeroed before
>  use. Preallocation/zeroing via unwritten extents is not
>  allowed. Sparse files are not allowed. Shared extents are
>  not allowed.
>   2. set the "PMEM_IMMUTABLE" inode flag - filesystem must
>  check the file is fully allocated before allowing it to
>  be set, and caller must have CAP_LINUX_IMMUTABLE.
>   3. Inode metadata is now immutable, and file data can only
>  be accessed and/or modified via mmap().
>   4. All non-mmap methods of inode data modification
>  will now fail with EPERM.
>   5. all methods of inode metadata modification will now fail
>  with EPERM, timestamp udpdates will be ignored.
>   6. PMEM_IMMUTABLE flag can only be removed if the file is
>  not currently mapped and caller has CAP_LINUX_IMMUTABLE.
> 
> A flag like this /should/ make it possible to avoid fsync/msync() on
> a file for existing filesystems, but it also means that such files
> have significant management issues (hence the need for
> CAP_LINUX_IMMUTABLE to cover it's use).

Hmmm... I started to ponder such a flag, but ran into some questions.
If it's PMEM_IMMUTABLE, does this mean that none of 1-6 apply if the
filesystem discovers it isn't on pmem?

I thought about just having a 'immutable metadata' flag where any
timestamp, xattr, or block mapping update just returns EPERM.  There
wouldn't be any checks as in (1); if you left a hole in the file prior
to setting the flag then you won't be filling it unless you clear the
flag.  OTOH if it merely made the metadata unchangeable then it's a
stretch to get to non-mmap data accesses also being disallowed.

Maybe the immutable metadata and mmap-only properties would only be
implied if both DAX and IMMUTABLE_META are set on a file?

Ok no more rambling until sleep. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Darrick J. Wong
On Mon, Sep 12, 2016 at 11:40:35AM +1000, Dave Chinner wrote:
> On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
> > On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote:
> > > My understanding is that it is looking for the VM_MIXEDMAP flag which
> > > is already ambiguous for determining if DAX is enabled even if this
> > > dynamic listing issue is fixed.  XFS has arranged for DAX to be a
> > > per-inode capability and has an XFS-specific inode flag.  We can make
> > > that a common inode flag, but it seems we should have a way to
> > > interrogate the mapping itself in the case where the inode is unknown
> > > or unavailable.  I'm thinking extensions to mincore to have flags for
> > > DAX and possibly whether the page is part of a pte, pmd, or pud
> > > mapping.  Just floating that idea before starting to look into the
> > > implementation, comments or other ideas welcome...
> > 
> > I think this goes back to our previous discussion about support for the PMEM
> > programming model.  Really I think what NVML needs isn't a way to tell if it
> > is getting a DAX mapping, but whether it is getting a DAX mapping on a
> > filesystem that fully supports the PMEM programming model.  This of course 
> > is
> > defined to be a filesystem where it can do all of its flushes from userspace
> > safely and never call fsync/msync, and that allocations that happen in page
> > faults will be synchronized to media before the page fault completes.
> > 
> > IIUC this is what NVML needs - a way to decide "do I use fsync/msync for
> > everything or can I rely fully on flushes from userspace?" 
> 
> "need fsync/msync" is a dynamic state of an inode, not a static
> property. i.e. users can do things that change an inode behind the
> back of a mapping, even if they are not aware that this might
> happen. As such, a filesystem can invalidate an existing mapping
> at any time and userspace won't notice because it will simply fault
> in a new mapping on the next access...
> 
> > For all existing implementations, I think the answer is "you need to use
> > fsync/msync" because we don't yet have proper support for the PMEM 
> > programming
> > model.
> 
> Yes, that is correct.
> 
> FWIW, I don't think it will ever be possible to support this 
> wonderful "PMEM programming model" from any current or future kernel
> filesystem without a very specific set of restrictions on what can
> be done to a file.  e.g.
> 
>   1. the file has to be fully allocated and zeroed before
>  use. Preallocation/zeroing via unwritten extents is not
>  allowed. Sparse files are not allowed. Shared extents are
>  not allowed.
>   2. set the "PMEM_IMMUTABLE" inode flag - filesystem must
>  check the file is fully allocated before allowing it to
>  be set, and caller must have CAP_LINUX_IMMUTABLE.
>   3. Inode metadata is now immutable, and file data can only
>  be accessed and/or modified via mmap().
>   4. All non-mmap methods of inode data modification
>  will now fail with EPERM.
>   5. all methods of inode metadata modification will now fail
>  with EPERM, timestamp udpdates will be ignored.
>   6. PMEM_IMMUTABLE flag can only be removed if the file is
>  not currently mapped and caller has CAP_LINUX_IMMUTABLE.
> 
> A flag like this /should/ make it possible to avoid fsync/msync() on
> a file for existing filesystems, but it also means that such files
> have significant management issues (hence the need for
> CAP_LINUX_IMMUTABLE to cover it's use).

Hmmm... I started to ponder such a flag, but ran into some questions.
If it's PMEM_IMMUTABLE, does this mean that none of 1-6 apply if the
filesystem discovers it isn't on pmem?

I thought about just having a 'immutable metadata' flag where any
timestamp, xattr, or block mapping update just returns EPERM.  There
wouldn't be any checks as in (1); if you left a hole in the file prior
to setting the flag then you won't be filling it unless you clear the
flag.  OTOH if it merely made the metadata unchangeable then it's a
stretch to get to non-mmap data accesses also being disallowed.

Maybe the immutable metadata and mmap-only properties would only be
implied if both DAX and IMMUTABLE_META are set on a file?

Ok no more rambling until sleep. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm, x86: Properly check whether a pfn is an MMIO or not

2016-09-14 Thread Raslan, KarimAllah

> On Jun 22, 2016, at 3:41 PM, Paolo Bonzini  wrote:
> 
> 
> 
> On 22/06/2016 04:34, KarimAllah Ahmed wrote:
>> pfn_valid check is not sufficient because it only checks if a page has a 
>> struct
>> page or not, if for example "mem=" was passed to the kernel some valid pages
>> won't have a struct page. This means that if guests were assigned valid 
>> memory
>> that lies after the mem= boundary it will be passed uncached to the guest no
>> matter what the guest caching attributes are for this memory.
> 
> How can you pass memory after the mem= boundary to the guest?

Does my previous reply answer your question?

> 
> Paolo
> 

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: [PATCH] kvm, x86: Properly check whether a pfn is an MMIO or not

2016-09-14 Thread Raslan, KarimAllah

> On Jun 22, 2016, at 3:41 PM, Paolo Bonzini  wrote:
> 
> 
> 
> On 22/06/2016 04:34, KarimAllah Ahmed wrote:
>> pfn_valid check is not sufficient because it only checks if a page has a 
>> struct
>> page or not, if for example "mem=" was passed to the kernel some valid pages
>> won't have a struct page. This means that if guests were assigned valid 
>> memory
>> that lies after the mem= boundary it will be passed uncached to the guest no
>> matter what the guest caching attributes are for this memory.
> 
> How can you pass memory after the mem= boundary to the guest?

Does my previous reply answer your question?

> 
> Paolo
> 

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B



Re: perf: perf_fuzzer lockup in perf_cgroup_attach

2016-09-14 Thread Stephane Eranian
On Wed, Sep 14, 2016 at 7:43 PM, Vince Weaver  wrote:
>
> so the skylake that was fuzzing finally is mostly locked up.
>
> Really hard to tell what's going, especially as KASLR made looking up the
> addresses a big pain.
>
I would think there is a way to disable KASLR for this kind of testing!

Which of your fuzzer scripts are you using? fast_repro99.sh?

> The best I can tell things are getting wedged somehow in
> perf_cgroup_switch() while interrupts are disabled.  Interrupts are never
> getting re-enabled, causing the RCU and NMI watchdogs to trigger (and more
> alarming things like the SATA bus resetting).
>
How do you get to perf_cgroup_switch() from the traces you provide below?

> [26292.413603] Task dump for CPU 4:
> [26292.413604] perf_fuzzer R  running task0  8870   1096 
> 0x1008
> [26292.413605]  9045f29e1100 e7143ab0 9045fdd03db8 
> 938accef
> [26292.413606]  0004 0087 9045fdd03dd0 
> 938af5e9
> [26292.413607]  0004 9045fdd03e00 93984928 
> 9045fdd19440
> [26292.413608] Call Trace:
> [26292.413609][] sched_show_task+0xaf/0x110
> [26292.413611]  [] dump_cpu_task+0x39/0x40
> [26292.413613]  [] rcu_dump_cpu_stacks+0x80/0xbb
> [26292.413614]  [] rcu_check_callbacks+0x71e/0x880
> [26292.413615]  [] ? acct_account_cputime+0x1c/0x20
> [26292.413616]  [] ? account_system_time+0x79/0x120
> [26292.413617]  [] ? tick_sched_do_timer+0x30/0x30
> [26292.413619]  [] update_process_times+0x2f/0x60
> [26292.413619]  [] tick_sched_handle.isra.13+0x25/0x60
> [26292.413620]  [] tick_sched_timer+0x3d/0x70
> [26292.413621]  [] __hrtimer_run_queues+0xf3/0x280
> [26292.413623]  [] hrtimer_interrupt+0xa8/0x1a0
> [26292.413624]  [] local_apic_timer_interrupt+0x38/0x60
> [26292.413625]  [] smp_trace_apic_timer_interrupt+0x44/0xde
> [26292.413627]  [] ? perf_cgroup_attach+0x70/0x70
> [26292.413628]  [] trace_apic_timer_interrupt+0x82/0x90
> [26292.413628][] ? perf_cgroup_attach+0x70/0x70
> [26292.413631]  [] ? smp_call_function_single+0xd6/0x130
> [26292.413632]  [] ? perf_cgroup_attach+0x70/0x70
> [26292.413633]  [] cpu_function_call+0x43/0x60
> [26292.413634]  [] ? __perf_event_enable+0x260/0x260
> [26292.413635]  [] perf_install_in_context+0x141/0x150
> [26292.413636]  [] SYSC_perf_event_open+0x70e/0xfe0
> [26292.413637]  [] ? check_preempt_curr+0x54/0x90
> [26292.413639]  [] SyS_perf_event_open+0x9/0x10
> [26292.413640]  [] do_syscall_64+0x64/0x160
> [26292.413641]  [] entry_SYSCALL64_slow_path+0x25/0x25
>
> [26316.489382] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! 
> [perf_fuzzer:8870]
> [26316.497309] CPU: 4 PID: 8870 Comm: perf_fuzzer Tainted: GW   
> 4.8.0-rc6+ #5
> [26316.497310] Hardware name: LENOVO 10FY0017US/SKYBAY, BIOS FWKT53A   
> 06/06/2016
> [26316.497310] task: 9045f29e1100 task.stack: 9045f0a48000
> [26316.497311] RIP: 0010:[]  [] 
> smp_call_function_single+0xd6/0x130
> [26316.497312] RSP: 0018:9045f0a4bd30  EFLAGS: 0202
> [26316.497313] RAX:  RBX: 0002 RCX: 
> 
> [26316.497313] RDX: 0001 RSI: 00fb RDI: 
> 0286
> [26316.497314] RBP: 9045f0a4bd78 R08: 9045fdc9bed0 R09: 
> 9045f2ac8000
> [26316.497314] R10:  R11: 0246 R12: 
> 9396ff60
> [26316.497315] R13: 9045fdc9bed0 R14: 001e R15: 
> 9045fdc9bee0
> [26316.497315] FS:  7f4157a08700() GS:9045fdd0() 
> knlGS:
> [26316.497316] CS:  0010 DS:  ES:  CR0: 80050033
> [26316.497316] CR2: 0693a048 CR3: 000231049000 CR4: 
> 003407e0
> [26316.497317] DR0:  DR1:  DR2: 
> 
> [26316.497317] DR3:  DR6: fffe0ff0 DR7: 
> 0600
> [26316.497318] Stack:
> [26316.497318]   9045ee35da80  
> 9396ff60
> [26316.497319]  9045f0a4bd88 0003 e7143ab0 
> 9045ee47613a
> [26316.497321]  9045ee476000 9045f0a4bdb0 9396ec53 
> 
> [26316.497322] Call Trace:
> [26316.497323]  [] ? perf_cgroup_attach+0x70/0x70
> [26316.497324]  [] cpu_function_call+0x43/0x60
> [26316.497325]  [] ? __perf_event_enable+0x260/0x260
> [26316.497326]  [] perf_install_in_context+0x141/0x150
> [26316.497327]  [] SYSC_perf_event_open+0x70e/0xfe0
> [26316.497328]  [] ? check_preempt_curr+0x54/0x90
> [26316.497330]  [] SyS_perf_event_open+0x9/0x10
> [26316.497331]  [] do_syscall_64+0x64/0x160
> [26316.497332]  [] entry_SYSCALL64_slow_path+0x25/0x25
> [26316.497332] Code: 25 28 00 00 00 75 70 48 83 c4 38 5b 41 5c 5d c3 48 8d 75 
> c8 48 89 d1 89 df 4c 89 e2 e8 14 fe ff ff 8b 55 e0 83 e2 01 74 0a f3 90 <8b> 
> 55 e0 83 e2 01 75 f6 eb c3 8b 05 ba e5 dd 00 85 c0 75 85 80
>


Re: perf: perf_fuzzer lockup in perf_cgroup_attach

2016-09-14 Thread Stephane Eranian
On Wed, Sep 14, 2016 at 7:43 PM, Vince Weaver  wrote:
>
> so the skylake that was fuzzing finally is mostly locked up.
>
> Really hard to tell what's going, especially as KASLR made looking up the
> addresses a big pain.
>
I would think there is a way to disable KASLR for this kind of testing!

Which of your fuzzer scripts are you using? fast_repro99.sh?

> The best I can tell things are getting wedged somehow in
> perf_cgroup_switch() while interrupts are disabled.  Interrupts are never
> getting re-enabled, causing the RCU and NMI watchdogs to trigger (and more
> alarming things like the SATA bus resetting).
>
How do you get to perf_cgroup_switch() from the traces you provide below?

> [26292.413603] Task dump for CPU 4:
> [26292.413604] perf_fuzzer R  running task0  8870   1096 
> 0x1008
> [26292.413605]  9045f29e1100 e7143ab0 9045fdd03db8 
> 938accef
> [26292.413606]  0004 0087 9045fdd03dd0 
> 938af5e9
> [26292.413607]  0004 9045fdd03e00 93984928 
> 9045fdd19440
> [26292.413608] Call Trace:
> [26292.413609][] sched_show_task+0xaf/0x110
> [26292.413611]  [] dump_cpu_task+0x39/0x40
> [26292.413613]  [] rcu_dump_cpu_stacks+0x80/0xbb
> [26292.413614]  [] rcu_check_callbacks+0x71e/0x880
> [26292.413615]  [] ? acct_account_cputime+0x1c/0x20
> [26292.413616]  [] ? account_system_time+0x79/0x120
> [26292.413617]  [] ? tick_sched_do_timer+0x30/0x30
> [26292.413619]  [] update_process_times+0x2f/0x60
> [26292.413619]  [] tick_sched_handle.isra.13+0x25/0x60
> [26292.413620]  [] tick_sched_timer+0x3d/0x70
> [26292.413621]  [] __hrtimer_run_queues+0xf3/0x280
> [26292.413623]  [] hrtimer_interrupt+0xa8/0x1a0
> [26292.413624]  [] local_apic_timer_interrupt+0x38/0x60
> [26292.413625]  [] smp_trace_apic_timer_interrupt+0x44/0xde
> [26292.413627]  [] ? perf_cgroup_attach+0x70/0x70
> [26292.413628]  [] trace_apic_timer_interrupt+0x82/0x90
> [26292.413628][] ? perf_cgroup_attach+0x70/0x70
> [26292.413631]  [] ? smp_call_function_single+0xd6/0x130
> [26292.413632]  [] ? perf_cgroup_attach+0x70/0x70
> [26292.413633]  [] cpu_function_call+0x43/0x60
> [26292.413634]  [] ? __perf_event_enable+0x260/0x260
> [26292.413635]  [] perf_install_in_context+0x141/0x150
> [26292.413636]  [] SYSC_perf_event_open+0x70e/0xfe0
> [26292.413637]  [] ? check_preempt_curr+0x54/0x90
> [26292.413639]  [] SyS_perf_event_open+0x9/0x10
> [26292.413640]  [] do_syscall_64+0x64/0x160
> [26292.413641]  [] entry_SYSCALL64_slow_path+0x25/0x25
>
> [26316.489382] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! 
> [perf_fuzzer:8870]
> [26316.497309] CPU: 4 PID: 8870 Comm: perf_fuzzer Tainted: GW   
> 4.8.0-rc6+ #5
> [26316.497310] Hardware name: LENOVO 10FY0017US/SKYBAY, BIOS FWKT53A   
> 06/06/2016
> [26316.497310] task: 9045f29e1100 task.stack: 9045f0a48000
> [26316.497311] RIP: 0010:[]  [] 
> smp_call_function_single+0xd6/0x130
> [26316.497312] RSP: 0018:9045f0a4bd30  EFLAGS: 0202
> [26316.497313] RAX:  RBX: 0002 RCX: 
> 
> [26316.497313] RDX: 0001 RSI: 00fb RDI: 
> 0286
> [26316.497314] RBP: 9045f0a4bd78 R08: 9045fdc9bed0 R09: 
> 9045f2ac8000
> [26316.497314] R10:  R11: 0246 R12: 
> 9396ff60
> [26316.497315] R13: 9045fdc9bed0 R14: 001e R15: 
> 9045fdc9bee0
> [26316.497315] FS:  7f4157a08700() GS:9045fdd0() 
> knlGS:
> [26316.497316] CS:  0010 DS:  ES:  CR0: 80050033
> [26316.497316] CR2: 0693a048 CR3: 000231049000 CR4: 
> 003407e0
> [26316.497317] DR0:  DR1:  DR2: 
> 
> [26316.497317] DR3:  DR6: fffe0ff0 DR7: 
> 0600
> [26316.497318] Stack:
> [26316.497318]   9045ee35da80  
> 9396ff60
> [26316.497319]  9045f0a4bd88 0003 e7143ab0 
> 9045ee47613a
> [26316.497321]  9045ee476000 9045f0a4bdb0 9396ec53 
> 
> [26316.497322] Call Trace:
> [26316.497323]  [] ? perf_cgroup_attach+0x70/0x70
> [26316.497324]  [] cpu_function_call+0x43/0x60
> [26316.497325]  [] ? __perf_event_enable+0x260/0x260
> [26316.497326]  [] perf_install_in_context+0x141/0x150
> [26316.497327]  [] SYSC_perf_event_open+0x70e/0xfe0
> [26316.497328]  [] ? check_preempt_curr+0x54/0x90
> [26316.497330]  [] SyS_perf_event_open+0x9/0x10
> [26316.497331]  [] do_syscall_64+0x64/0x160
> [26316.497332]  [] entry_SYSCALL64_slow_path+0x25/0x25
> [26316.497332] Code: 25 28 00 00 00 75 70 48 83 c4 38 5b 41 5c 5d c3 48 8d 75 
> c8 48 89 d1 89 df 4c 89 e2 e8 14 fe ff ff 8b 55 e0 83 e2 01 74 0a f3 90 <8b> 
> 55 e0 83 e2 01 75 f6 eb c3 8b 05 ba e5 dd 00 85 c0 75 85 80
>


Re: [PATCH V2 2/3] net-next: dsa: add Qualcomm tag RX/TX handler

2016-09-14 Thread Florian Fainelli
On 09/14/2016 03:39 AM, John Crispin wrote:
> Add support for the 2-bytes Qualcomm tag that gigabit switches such as
> the QCA8337/N might insert when receiving packets, or that we need
> to insert while targeting specific switch ports. The tag is inserted
> directly behind the ethernet header.
> 
> Signed-off-by: John Crispin 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH V2 2/3] net-next: dsa: add Qualcomm tag RX/TX handler

2016-09-14 Thread Florian Fainelli
On 09/14/2016 03:39 AM, John Crispin wrote:
> Add support for the 2-bytes Qualcomm tag that gigabit switches such as
> the QCA8337/N might insert when receiving packets, or that we need
> to insert while targeting specific switch ports. The tag is inserted
> directly behind the ethernet header.
> 
> Signed-off-by: John Crispin 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v4 22/22] phy: Add support for Qualcomm's USB HS phy

2016-09-14 Thread Peter Chen
On Wed, Sep 14, 2016 at 10:42:50AM -0700, Stephen Boyd wrote:
> > > 
> > > Hmm.. maybe the confusion is in which registers we should be able to
> > > access? Are we talking about the ULPI viewport MMIO register space or
> > > the ULPI registers that we access through the viewport? I have a
> > > hw_phymode_configure() inside of of ci_ulpi_init() so that the
> > > identification registers through the ULPI viewport read properly
> > > (assuming there aren't other power requirements like regulators). If we
> > > don't set the portsc.pts before using the viewport, the viewport doesn't
> > > work and reads timeout. So we really don't touch the ULPI registers
> > > except for the scratch space and the id registers until after the phy is
> > > properly powered on with clks and regulators, because the only place we
> > > touch them after doing the id checking is in this phy driver in
> > > qcom_usb_hs_phy_power_on(). We've "solved" the chicken-egg problem where
> > > we don't know which device driver to probe because the phy needs to be
> > > powered on to read the id registers to know which device driver to use
> > > by using DT to match up device drivers instead.
> > > 
> > > [1] https://www.sparkfun.com/datasheets/Components/SMD/ULPI_v1_1.pdf
> > 
> > Ok, ulpi phy works like USB device on USB bus which create device at
> > runtime. So, like some hard-wired USB devices, it may needs power
> > sequence too, otherwise, how it knows which driver can loads.
> > 
> 
> Yes. We use the DT compatible string to ignore any issues with reading
> the device ids when the device is powered off. Unlike USB though, we
> have device drivers for the ULPI PHYs that do the power sequencing along
> with other initializations, so using a common pwrseq layer seems like
> overkill just so we can read the id registers.

If the attempt to read id registers will not hang the system, it can
work. Like the case [1], without PHY ref_clk, it will hang when
configure PHY mode (visit portsc.pts), it has no chance to read id
registers. So, I still think the ULPI bus driver needs to do power
sequence for its children, unless you have some place to open the
ref_clk.

> 
> Are there any concerns with this patch? Or can they be reapplied?

For this patch, it is ok for me.


[1] http://www.spinics.net/lists/linux-usb/msg146336.html
-- 

Best Regards,
Peter Chen


Re: [PATCH v4 22/22] phy: Add support for Qualcomm's USB HS phy

2016-09-14 Thread Peter Chen
On Wed, Sep 14, 2016 at 10:42:50AM -0700, Stephen Boyd wrote:
> > > 
> > > Hmm.. maybe the confusion is in which registers we should be able to
> > > access? Are we talking about the ULPI viewport MMIO register space or
> > > the ULPI registers that we access through the viewport? I have a
> > > hw_phymode_configure() inside of of ci_ulpi_init() so that the
> > > identification registers through the ULPI viewport read properly
> > > (assuming there aren't other power requirements like regulators). If we
> > > don't set the portsc.pts before using the viewport, the viewport doesn't
> > > work and reads timeout. So we really don't touch the ULPI registers
> > > except for the scratch space and the id registers until after the phy is
> > > properly powered on with clks and regulators, because the only place we
> > > touch them after doing the id checking is in this phy driver in
> > > qcom_usb_hs_phy_power_on(). We've "solved" the chicken-egg problem where
> > > we don't know which device driver to probe because the phy needs to be
> > > powered on to read the id registers to know which device driver to use
> > > by using DT to match up device drivers instead.
> > > 
> > > [1] https://www.sparkfun.com/datasheets/Components/SMD/ULPI_v1_1.pdf
> > 
> > Ok, ulpi phy works like USB device on USB bus which create device at
> > runtime. So, like some hard-wired USB devices, it may needs power
> > sequence too, otherwise, how it knows which driver can loads.
> > 
> 
> Yes. We use the DT compatible string to ignore any issues with reading
> the device ids when the device is powered off. Unlike USB though, we
> have device drivers for the ULPI PHYs that do the power sequencing along
> with other initializations, so using a common pwrseq layer seems like
> overkill just so we can read the id registers.

If the attempt to read id registers will not hang the system, it can
work. Like the case [1], without PHY ref_clk, it will hang when
configure PHY mode (visit portsc.pts), it has no chance to read id
registers. So, I still think the ULPI bus driver needs to do power
sequence for its children, unless you have some place to open the
ref_clk.

> 
> Are there any concerns with this patch? Or can they be reapplied?

For this patch, it is ok for me.


[1] http://www.spinics.net/lists/linux-usb/msg146336.html
-- 

Best Regards,
Peter Chen


Re: [PATCH] clk: Add USB3 PHY reset lines

2016-09-14 Thread Vivek Gautam
On Thu, Sep 15, 2016 at 10:15 AM,   wrote:
> From: Vivek Gautam 

a stale "--from " flag in my command added this line.
Will take care of it from next time.

>
> Adding missing reset lines for USB 3.0 PHY.
>
> Signed-off-by: Vivek Gautam 
> ---

[snip]

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] clk: Add USB3 PHY reset lines

2016-09-14 Thread Vivek Gautam
On Thu, Sep 15, 2016 at 10:15 AM,   wrote:
> From: Vivek Gautam 

a stale "--from " flag in my command added this line.
Will take care of it from next time.

>
> Adding missing reset lines for USB 3.0 PHY.
>
> Signed-off-by: Vivek Gautam 
> ---

[snip]

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


RE: [PATCH 1/6] PCI: xilinx: Keep both legacy and MSI interrupt references.

2016-09-14 Thread Bharat Kumar Gogada
On Wed, Sep 14, 2016 at 03:32:44PM +, Bharat Kumar Gogada wrote:
> On Thu, Sep 01, 2016 at 03:44:41PM +0530, Bharat Kumar Gogada wrote:
> > When built with MSI support the legacy domain reference is being 
> > overwritten with MSI.
> > Instead creating two separate domains for MSI and legacy interrupts.
> > 
> > Signed-off-by: Bharat Kumar Gogada 
> 
> Applied all these (except [5/6]) with Michal's ack to pci/host-xilinx for 
> v4.9, thanks!
> 
> I already applied Po Liu's patch ("PCI: xilinx: Relax device number checking 
> to allow SR-IOV") to pci/virtualization.  This is identical to [5/6], so I 
> dropped [5/6] to avoid a conflict.
> 
> Thanks Bjorn. 
> Some how I couldn't find Po Liu's patch, can you provide git link of patch 
> that was applied. 

Sure, here it is: 
https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization=8e7ca8ca5fd8a3242289105723c429733be8b73b

Thanks Bjorn, this patch is similar to [5/6] patch. 


RE: [PATCH 1/6] PCI: xilinx: Keep both legacy and MSI interrupt references.

2016-09-14 Thread Bharat Kumar Gogada
On Wed, Sep 14, 2016 at 03:32:44PM +, Bharat Kumar Gogada wrote:
> On Thu, Sep 01, 2016 at 03:44:41PM +0530, Bharat Kumar Gogada wrote:
> > When built with MSI support the legacy domain reference is being 
> > overwritten with MSI.
> > Instead creating two separate domains for MSI and legacy interrupts.
> > 
> > Signed-off-by: Bharat Kumar Gogada 
> 
> Applied all these (except [5/6]) with Michal's ack to pci/host-xilinx for 
> v4.9, thanks!
> 
> I already applied Po Liu's patch ("PCI: xilinx: Relax device number checking 
> to allow SR-IOV") to pci/virtualization.  This is identical to [5/6], so I 
> dropped [5/6] to avoid a conflict.
> 
> Thanks Bjorn. 
> Some how I couldn't find Po Liu's patch, can you provide git link of patch 
> that was applied. 

Sure, here it is: 
https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization=8e7ca8ca5fd8a3242289105723c429733be8b73b

Thanks Bjorn, this patch is similar to [5/6] patch. 


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:38:16PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
> >>  wrote:
> >> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >> >> >
> >> >> >> > This RFC handle both cgroup and seccomp approaches in a similar 
> >> >> >> > way. I
> >> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> >> >> > security issues with delegation?
> >> >> >>
> >> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> >> >> Tejun says [1]:
> >> >> >>
> >> >> >> We haven't had to face this decision because cgroup has never 
> >> >> >> properly
> >> >> >> supported delegating to applications and the in-use setups where this
> >> >> >> happens are custom configurations where there is no boundary between
> >> >> >> system and applications and adhoc trial-and-error is good enough a 
> >> >> >> way
> >> >> >> to find a working solution.  That wiggle room goes away once we
> >> >> >> officially open this up to individual applications.
> >> >> >>
> >> >> >> Unless and until that changes, I think that landlock should stay away
> >> >> >> from cgroups.  Others could reasonably disagree with me.
> >> >> >
> >> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >> >> > and not for sandboxing. So the above doesn't matter in such contexts.
> >> >> > lsm hooks + cgroups provide convenient scope and existing entry 
> >> >> > points.
> >> >> > Please see checmate examples how it's used.
> >> >> >
> >> >>
> >> >> To be clear: I'm not arguing at all that there shouldn't be
> >> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> >> landlock interface shouldn't expose any cgroup integration, at least
> >> >> until the cgroup situation settles down a lot.
> >> >
> >> > ahh. yes. we're perfectly in agreement here.
> >> > I'm suggesting that the next RFC shouldn't include unpriv
> >> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> >> > argue about unpriv with cgroups and even unpriv as a whole,
> >> > since it's not a given. Seccomp integration is also questionable.
> >> > I'd rather not have seccomp as a gate keeper for this lsm.
> >> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> >> > don't have one to one relationship, so mixing them up is only
> >> > asking for trouble further down the road.
> >> > If we really need to carry some information from seccomp to lsm+bpf,
> >> > it's easier to add eBPF support to seccomp and let bpf side deal
> >> > with passing whatever information.
> >> >
> >>
> >> As an argument for keeping seccomp (or an extended seccomp) as the
> >> interface for an unprivileged bpf+lsm: seccomp already checks off most
> >> of the boxes for safely letting unprivileged programs sandbox
> >> themselves.
> >
> > you mean the attach part of seccomp syscall that deals with no_new_priv?
> > sure, that's reusable.
> >
> >> Furthermore, to the extent that there are use cases for
> >> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> >> hierarchy, I suspect that syscall filters have exactly the same
> >> problem and that we should fix seccomp to cover it.
> >
> > not sure what you mean by 'seccomp hierarchy'. The normal process
> > hierarchy ?
> 
> Kind of.  I mean the filter layers that are inherited across fork(),
> the TSYNC mechanism, etc.
> 
> > imo the main deficiency of secccomp is inability to look into arguments.
> > One can argue that it's a blessing, since composite args
> > are not yet copied into the kernel memory.
> > But in a lot of cases the seccomp arguments are FDs pointing
> > to kernel objects and if programs could examine those objects
> > the sandboxing scope would be more precise.
> > lsm+bpf solves that part and I'd still argue that it's
> > orthogonal to seccomp's pass/reject flow.
> > I mean if seccomp says 'ok' the syscall should continue executing
> > as normal and whatever LSM hooks were triggered by it may have
> > their own lsm+bpf verdicts.
> 
> I agree with all of this...
> 
> > Furthermore in the process hierarchy different children
> > should be able to set their own lsm+bpf filters that are not
> > related to parallel seccomp+bpf hierarchy of programs.
> > seccomp syscall can be an interface to attach programs
> > to lsm hooks, but nothing more than that.
> 
> I'm not sure what you mean.  I mean that, logically, I think we should
> be able to do:
> 
> seccomp(attach a syscall filter);
> fork();
> child does seccomp(attach some lsm filters);
> 
> I think that they *should* be related to the seccomp+bpf hierarchy of
> programs in that they are entries in the same logical list of filter
> layers installed.  Some of those layers can be syscall filters and
> some of the layers can be lsm 

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:38:16PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
> >>  wrote:
> >> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >> >> >
> >> >> >> > This RFC handle both cgroup and seccomp approaches in a similar 
> >> >> >> > way. I
> >> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> >> >> > security issues with delegation?
> >> >> >>
> >> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> >> >> Tejun says [1]:
> >> >> >>
> >> >> >> We haven't had to face this decision because cgroup has never 
> >> >> >> properly
> >> >> >> supported delegating to applications and the in-use setups where this
> >> >> >> happens are custom configurations where there is no boundary between
> >> >> >> system and applications and adhoc trial-and-error is good enough a 
> >> >> >> way
> >> >> >> to find a working solution.  That wiggle room goes away once we
> >> >> >> officially open this up to individual applications.
> >> >> >>
> >> >> >> Unless and until that changes, I think that landlock should stay away
> >> >> >> from cgroups.  Others could reasonably disagree with me.
> >> >> >
> >> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >> >> > and not for sandboxing. So the above doesn't matter in such contexts.
> >> >> > lsm hooks + cgroups provide convenient scope and existing entry 
> >> >> > points.
> >> >> > Please see checmate examples how it's used.
> >> >> >
> >> >>
> >> >> To be clear: I'm not arguing at all that there shouldn't be
> >> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> >> landlock interface shouldn't expose any cgroup integration, at least
> >> >> until the cgroup situation settles down a lot.
> >> >
> >> > ahh. yes. we're perfectly in agreement here.
> >> > I'm suggesting that the next RFC shouldn't include unpriv
> >> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> >> > argue about unpriv with cgroups and even unpriv as a whole,
> >> > since it's not a given. Seccomp integration is also questionable.
> >> > I'd rather not have seccomp as a gate keeper for this lsm.
> >> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> >> > don't have one to one relationship, so mixing them up is only
> >> > asking for trouble further down the road.
> >> > If we really need to carry some information from seccomp to lsm+bpf,
> >> > it's easier to add eBPF support to seccomp and let bpf side deal
> >> > with passing whatever information.
> >> >
> >>
> >> As an argument for keeping seccomp (or an extended seccomp) as the
> >> interface for an unprivileged bpf+lsm: seccomp already checks off most
> >> of the boxes for safely letting unprivileged programs sandbox
> >> themselves.
> >
> > you mean the attach part of seccomp syscall that deals with no_new_priv?
> > sure, that's reusable.
> >
> >> Furthermore, to the extent that there are use cases for
> >> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> >> hierarchy, I suspect that syscall filters have exactly the same
> >> problem and that we should fix seccomp to cover it.
> >
> > not sure what you mean by 'seccomp hierarchy'. The normal process
> > hierarchy ?
> 
> Kind of.  I mean the filter layers that are inherited across fork(),
> the TSYNC mechanism, etc.
> 
> > imo the main deficiency of secccomp is inability to look into arguments.
> > One can argue that it's a blessing, since composite args
> > are not yet copied into the kernel memory.
> > But in a lot of cases the seccomp arguments are FDs pointing
> > to kernel objects and if programs could examine those objects
> > the sandboxing scope would be more precise.
> > lsm+bpf solves that part and I'd still argue that it's
> > orthogonal to seccomp's pass/reject flow.
> > I mean if seccomp says 'ok' the syscall should continue executing
> > as normal and whatever LSM hooks were triggered by it may have
> > their own lsm+bpf verdicts.
> 
> I agree with all of this...
> 
> > Furthermore in the process hierarchy different children
> > should be able to set their own lsm+bpf filters that are not
> > related to parallel seccomp+bpf hierarchy of programs.
> > seccomp syscall can be an interface to attach programs
> > to lsm hooks, but nothing more than that.
> 
> I'm not sure what you mean.  I mean that, logically, I think we should
> be able to do:
> 
> seccomp(attach a syscall filter);
> fork();
> child does seccomp(attach some lsm filters);
> 
> I think that they *should* be related to the seccomp+bpf hierarchy of
> programs in that they are entries in the same logical list of filter
> layers installed.  Some of those layers can be syscall filters and
> some of the layers can be lsm filters.  If we subsequently add a way
> to attach a 

[PATCH] clk: Add USB3 PHY reset lines

2016-09-14 Thread vivek . gautam
From: Vivek Gautam 

Adding missing reset lines for USB 3.0 PHY.

Signed-off-by: Vivek Gautam 
---

 - Build tested on clk-next branch.
 - Tested with my wip branch for usbphy that is available @[1].

[1]https://github.com/vivekgautam1/linux/tree/linaro/integration-linux-qcomlt-usbphy-wip

 drivers/clk/qcom/gcc-msm8996.c   | 2 ++
 include/dt-bindings/clock/qcom,gcc-msm8996.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/clk/qcom/gcc-msm8996.c b/drivers/clk/qcom/gcc-msm8996.c
index ebe5d18..fe03e6f 100644
--- a/drivers/clk/qcom/gcc-msm8996.c
+++ b/drivers/clk/qcom/gcc-msm8996.c
@@ -3329,6 +3329,8 @@ static const struct qcom_reset_map gcc_msm8996_resets[] = 
{
[GCC_USB_20_BCR] = { 0x12000 },
[GCC_QUSB2PHY_PRIM_BCR] = { 0x12038 },
[GCC_QUSB2PHY_SEC_BCR] = { 0x1203c },
+   [GCC_USB3_PHY_BCR] = { 0x50020 },
+   [GCC_USB3PHY_PHY_BCR] = { 0x50024 },
[GCC_USB_PHY_CFG_AHB2PHY_BCR] = { 0x6a000 },
[GCC_SDCC1_BCR] = { 0x13000 },
[GCC_SDCC2_BCR] = { 0x14000 },
diff --git a/include/dt-bindings/clock/qcom,gcc-msm8996.h 
b/include/dt-bindings/clock/qcom,gcc-msm8996.h
index b7ea1e8..1828723 100644
--- a/include/dt-bindings/clock/qcom,gcc-msm8996.h
+++ b/include/dt-bindings/clock/qcom,gcc-msm8996.h
@@ -337,6 +337,8 @@
 #define GCC_QREFS_VBG_CAL_BCR  100
 #define GCC_PCIE_PHY_COM_BCR   101
 #define GCC_PCIE_PHY_COM_NOCSR_BCR 102
+#define GCC_USB3_PHY_BCR   103
+#define GCC_USB3PHY_PHY_BCR104
 
 
 /* Indexes for GDSCs */
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH] clk: Add USB3 PHY reset lines

2016-09-14 Thread vivek . gautam
From: Vivek Gautam 

Adding missing reset lines for USB 3.0 PHY.

Signed-off-by: Vivek Gautam 
---

 - Build tested on clk-next branch.
 - Tested with my wip branch for usbphy that is available @[1].

[1]https://github.com/vivekgautam1/linux/tree/linaro/integration-linux-qcomlt-usbphy-wip

 drivers/clk/qcom/gcc-msm8996.c   | 2 ++
 include/dt-bindings/clock/qcom,gcc-msm8996.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/clk/qcom/gcc-msm8996.c b/drivers/clk/qcom/gcc-msm8996.c
index ebe5d18..fe03e6f 100644
--- a/drivers/clk/qcom/gcc-msm8996.c
+++ b/drivers/clk/qcom/gcc-msm8996.c
@@ -3329,6 +3329,8 @@ static const struct qcom_reset_map gcc_msm8996_resets[] = 
{
[GCC_USB_20_BCR] = { 0x12000 },
[GCC_QUSB2PHY_PRIM_BCR] = { 0x12038 },
[GCC_QUSB2PHY_SEC_BCR] = { 0x1203c },
+   [GCC_USB3_PHY_BCR] = { 0x50020 },
+   [GCC_USB3PHY_PHY_BCR] = { 0x50024 },
[GCC_USB_PHY_CFG_AHB2PHY_BCR] = { 0x6a000 },
[GCC_SDCC1_BCR] = { 0x13000 },
[GCC_SDCC2_BCR] = { 0x14000 },
diff --git a/include/dt-bindings/clock/qcom,gcc-msm8996.h 
b/include/dt-bindings/clock/qcom,gcc-msm8996.h
index b7ea1e8..1828723 100644
--- a/include/dt-bindings/clock/qcom,gcc-msm8996.h
+++ b/include/dt-bindings/clock/qcom,gcc-msm8996.h
@@ -337,6 +337,8 @@
 #define GCC_QREFS_VBG_CAL_BCR  100
 #define GCC_PCIE_PHY_COM_BCR   101
 #define GCC_PCIE_PHY_COM_NOCSR_BCR 102
+#define GCC_USB3_PHY_BCR   103
+#define GCC_USB3PHY_PHY_BCR104
 
 
 /* Indexes for GDSCs */
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
>>  wrote:
>> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
>> >> >> >
>> >> >> > This RFC handle both cgroup and seccomp approaches in a similar way. 
>> >> >> > I
>> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
>> >> >> > security issues with delegation?
>> >> >>
>> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
>> >> >> Tejun says [1]:
>> >> >>
>> >> >> We haven't had to face this decision because cgroup has never properly
>> >> >> supported delegating to applications and the in-use setups where this
>> >> >> happens are custom configurations where there is no boundary between
>> >> >> system and applications and adhoc trial-and-error is good enough a way
>> >> >> to find a working solution.  That wiggle room goes away once we
>> >> >> officially open this up to individual applications.
>> >> >>
>> >> >> Unless and until that changes, I think that landlock should stay away
>> >> >> from cgroups.  Others could reasonably disagree with me.
>> >> >
>> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
>> >> > and not for sandboxing. So the above doesn't matter in such contexts.
>> >> > lsm hooks + cgroups provide convenient scope and existing entry points.
>> >> > Please see checmate examples how it's used.
>> >> >
>> >>
>> >> To be clear: I'm not arguing at all that there shouldn't be
>> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
>> >> landlock interface shouldn't expose any cgroup integration, at least
>> >> until the cgroup situation settles down a lot.
>> >
>> > ahh. yes. we're perfectly in agreement here.
>> > I'm suggesting that the next RFC shouldn't include unpriv
>> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
>> > argue about unpriv with cgroups and even unpriv as a whole,
>> > since it's not a given. Seccomp integration is also questionable.
>> > I'd rather not have seccomp as a gate keeper for this lsm.
>> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
>> > don't have one to one relationship, so mixing them up is only
>> > asking for trouble further down the road.
>> > If we really need to carry some information from seccomp to lsm+bpf,
>> > it's easier to add eBPF support to seccomp and let bpf side deal
>> > with passing whatever information.
>> >
>>
>> As an argument for keeping seccomp (or an extended seccomp) as the
>> interface for an unprivileged bpf+lsm: seccomp already checks off most
>> of the boxes for safely letting unprivileged programs sandbox
>> themselves.
>
> you mean the attach part of seccomp syscall that deals with no_new_priv?
> sure, that's reusable.
>
>> Furthermore, to the extent that there are use cases for
>> unprivileged bpf+lsm that *aren't* expressible within the seccomp
>> hierarchy, I suspect that syscall filters have exactly the same
>> problem and that we should fix seccomp to cover it.
>
> not sure what you mean by 'seccomp hierarchy'. The normal process
> hierarchy ?

Kind of.  I mean the filter layers that are inherited across fork(),
the TSYNC mechanism, etc.

> imo the main deficiency of secccomp is inability to look into arguments.
> One can argue that it's a blessing, since composite args
> are not yet copied into the kernel memory.
> But in a lot of cases the seccomp arguments are FDs pointing
> to kernel objects and if programs could examine those objects
> the sandboxing scope would be more precise.
> lsm+bpf solves that part and I'd still argue that it's
> orthogonal to seccomp's pass/reject flow.
> I mean if seccomp says 'ok' the syscall should continue executing
> as normal and whatever LSM hooks were triggered by it may have
> their own lsm+bpf verdicts.

I agree with all of this...

> Furthermore in the process hierarchy different children
> should be able to set their own lsm+bpf filters that are not
> related to parallel seccomp+bpf hierarchy of programs.
> seccomp syscall can be an interface to attach programs
> to lsm hooks, but nothing more than that.

I'm not sure what you mean.  I mean that, logically, I think we should
be able to do:

seccomp(attach a syscall filter);
fork();
child does seccomp(attach some lsm filters);

I think that they *should* be related to the seccomp+bpf hierarchy of
programs in that they are entries in the same logical list of filter
layers installed.  Some of those layers can be syscall filters and
some of the layers can be lsm filters.  If we subsequently add a way
to attach a removable seccomp filter or a way to attach a seccomp
filter that logs failures to some fd watched by an outside monitor, I
think that should work for lsm, too, with more or less the same
interface.

If we need a way for a sandbox manager to opt 

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
>>  wrote:
>> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
>> >> >> >
>> >> >> > This RFC handle both cgroup and seccomp approaches in a similar way. 
>> >> >> > I
>> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
>> >> >> > security issues with delegation?
>> >> >>
>> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
>> >> >> Tejun says [1]:
>> >> >>
>> >> >> We haven't had to face this decision because cgroup has never properly
>> >> >> supported delegating to applications and the in-use setups where this
>> >> >> happens are custom configurations where there is no boundary between
>> >> >> system and applications and adhoc trial-and-error is good enough a way
>> >> >> to find a working solution.  That wiggle room goes away once we
>> >> >> officially open this up to individual applications.
>> >> >>
>> >> >> Unless and until that changes, I think that landlock should stay away
>> >> >> from cgroups.  Others could reasonably disagree with me.
>> >> >
>> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
>> >> > and not for sandboxing. So the above doesn't matter in such contexts.
>> >> > lsm hooks + cgroups provide convenient scope and existing entry points.
>> >> > Please see checmate examples how it's used.
>> >> >
>> >>
>> >> To be clear: I'm not arguing at all that there shouldn't be
>> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
>> >> landlock interface shouldn't expose any cgroup integration, at least
>> >> until the cgroup situation settles down a lot.
>> >
>> > ahh. yes. we're perfectly in agreement here.
>> > I'm suggesting that the next RFC shouldn't include unpriv
>> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
>> > argue about unpriv with cgroups and even unpriv as a whole,
>> > since it's not a given. Seccomp integration is also questionable.
>> > I'd rather not have seccomp as a gate keeper for this lsm.
>> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
>> > don't have one to one relationship, so mixing them up is only
>> > asking for trouble further down the road.
>> > If we really need to carry some information from seccomp to lsm+bpf,
>> > it's easier to add eBPF support to seccomp and let bpf side deal
>> > with passing whatever information.
>> >
>>
>> As an argument for keeping seccomp (or an extended seccomp) as the
>> interface for an unprivileged bpf+lsm: seccomp already checks off most
>> of the boxes for safely letting unprivileged programs sandbox
>> themselves.
>
> you mean the attach part of seccomp syscall that deals with no_new_priv?
> sure, that's reusable.
>
>> Furthermore, to the extent that there are use cases for
>> unprivileged bpf+lsm that *aren't* expressible within the seccomp
>> hierarchy, I suspect that syscall filters have exactly the same
>> problem and that we should fix seccomp to cover it.
>
> not sure what you mean by 'seccomp hierarchy'. The normal process
> hierarchy ?

Kind of.  I mean the filter layers that are inherited across fork(),
the TSYNC mechanism, etc.

> imo the main deficiency of secccomp is inability to look into arguments.
> One can argue that it's a blessing, since composite args
> are not yet copied into the kernel memory.
> But in a lot of cases the seccomp arguments are FDs pointing
> to kernel objects and if programs could examine those objects
> the sandboxing scope would be more precise.
> lsm+bpf solves that part and I'd still argue that it's
> orthogonal to seccomp's pass/reject flow.
> I mean if seccomp says 'ok' the syscall should continue executing
> as normal and whatever LSM hooks were triggered by it may have
> their own lsm+bpf verdicts.

I agree with all of this...

> Furthermore in the process hierarchy different children
> should be able to set their own lsm+bpf filters that are not
> related to parallel seccomp+bpf hierarchy of programs.
> seccomp syscall can be an interface to attach programs
> to lsm hooks, but nothing more than that.

I'm not sure what you mean.  I mean that, logically, I think we should
be able to do:

seccomp(attach a syscall filter);
fork();
child does seccomp(attach some lsm filters);

I think that they *should* be related to the seccomp+bpf hierarchy of
programs in that they are entries in the same logical list of filter
layers installed.  Some of those layers can be syscall filters and
some of the layers can be lsm filters.  If we subsequently add a way
to attach a removable seccomp filter or a way to attach a seccomp
filter that logs failures to some fd watched by an outside monitor, I
think that should work for lsm, too, with more or less the same
interface.

If we need a way for a sandbox manager to opt different children into
different subsets of fancy filters, then 

Re: [PATCH v4] arm64: Improve kprobes test for atomic sequence

2016-09-14 Thread Masami Hiramatsu
On Mon, 12 Sep 2016 21:07:40 -0400
David Long  wrote:
> 
> After the patch the function reads as follows:
> 
> > enum kprobe_insn __kprobes
> > arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn 
> > *asi)
> > {
> > enum kprobe_insn decoded;
> > kprobe_opcode_t insn = le32_to_cpu(*addr);
> > kprobe_opcode_t *scan_end = NULL;
> > unsigned long size = 0, offset = 0;
> >
> > /*
> >  * If there's a symbol defined in front of and near enough to
> >  * the probe address assume it is the entry point to this
> >  * code and use it to further limit how far back we search
> >  * when determining if we're in an atomic sequence. If we could
> >  * not find any symbol skip the atomic test altogether as we
> >  * could otherwise end up searching irrelevant text/literals.
> >  * KPROBES depends on KALLSYMS so this last case should never
> >  * happen.
> >  */
> > if (kallsyms_lookup_size_offset((unsigned long) addr, , )) {
> > if (offset < (MAX_ATOMIC_CONTEXT_SIZE*sizeof(kprobe_opcode_t)))
> > scan_end = addr - (offset / sizeof(kprobe_opcode_t));
> > else
> > scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
> > }
> > decoded = arm_probe_decode_insn(insn, asi);
> >
> > if (decoded != INSN_REJECTED && scan_end)
> > if (is_probed_address_atomic(addr - 1, scan_end))
> > return INSN_REJECTED;
> >
> > return decoded;
> > }
> 
> A failed kallsyms_lookup_size_offset() call means scan_end will be left 
> as NULL, which in turn means arm_kprobe_decode_insn() will simply return 
> the result of the arm_probe_decode_insn() call.  In other words it does 
> the normal analysis of the instruction to be probed, but does not do the 
> atomic sequence search that normally follows that (since it doesn't 
> really know how far back to search).

OK, my idea was just rejecting it when kallsyms_lookup_size_offset() is
failed, because we can not ensure that the address is in the kernel
text. But anyway, that should be tested in general code like kernel/kprobes.c.

OK, now I think it is clear to apply.

Acked-by: Masami Hiramatsu 

Thanks,

> 
> Thanks,
> -dl
> 


-- 
Masami Hiramatsu 


Re: [PATCH v4] arm64: Improve kprobes test for atomic sequence

2016-09-14 Thread Masami Hiramatsu
On Mon, 12 Sep 2016 21:07:40 -0400
David Long  wrote:
> 
> After the patch the function reads as follows:
> 
> > enum kprobe_insn __kprobes
> > arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn 
> > *asi)
> > {
> > enum kprobe_insn decoded;
> > kprobe_opcode_t insn = le32_to_cpu(*addr);
> > kprobe_opcode_t *scan_end = NULL;
> > unsigned long size = 0, offset = 0;
> >
> > /*
> >  * If there's a symbol defined in front of and near enough to
> >  * the probe address assume it is the entry point to this
> >  * code and use it to further limit how far back we search
> >  * when determining if we're in an atomic sequence. If we could
> >  * not find any symbol skip the atomic test altogether as we
> >  * could otherwise end up searching irrelevant text/literals.
> >  * KPROBES depends on KALLSYMS so this last case should never
> >  * happen.
> >  */
> > if (kallsyms_lookup_size_offset((unsigned long) addr, , )) {
> > if (offset < (MAX_ATOMIC_CONTEXT_SIZE*sizeof(kprobe_opcode_t)))
> > scan_end = addr - (offset / sizeof(kprobe_opcode_t));
> > else
> > scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
> > }
> > decoded = arm_probe_decode_insn(insn, asi);
> >
> > if (decoded != INSN_REJECTED && scan_end)
> > if (is_probed_address_atomic(addr - 1, scan_end))
> > return INSN_REJECTED;
> >
> > return decoded;
> > }
> 
> A failed kallsyms_lookup_size_offset() call means scan_end will be left 
> as NULL, which in turn means arm_kprobe_decode_insn() will simply return 
> the result of the arm_probe_decode_insn() call.  In other words it does 
> the normal analysis of the instruction to be probed, but does not do the 
> atomic sequence search that normally follows that (since it doesn't 
> really know how far back to search).

OK, my idea was just rejecting it when kallsyms_lookup_size_offset() is
failed, because we can not ensure that the address is in the kernel
text. But anyway, that should be tested in general code like kernel/kprobes.c.

OK, now I think it is clear to apply.

Acked-by: Masami Hiramatsu 

Thanks,

> 
> Thanks,
> -dl
> 


-- 
Masami Hiramatsu 


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >> >
> >> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> >> > security issues with delegation?
> >> >>
> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> >> Tejun says [1]:
> >> >>
> >> >> We haven't had to face this decision because cgroup has never properly
> >> >> supported delegating to applications and the in-use setups where this
> >> >> happens are custom configurations where there is no boundary between
> >> >> system and applications and adhoc trial-and-error is good enough a way
> >> >> to find a working solution.  That wiggle room goes away once we
> >> >> officially open this up to individual applications.
> >> >>
> >> >> Unless and until that changes, I think that landlock should stay away
> >> >> from cgroups.  Others could reasonably disagree with me.
> >> >
> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >> > and not for sandboxing. So the above doesn't matter in such contexts.
> >> > lsm hooks + cgroups provide convenient scope and existing entry points.
> >> > Please see checmate examples how it's used.
> >> >
> >>
> >> To be clear: I'm not arguing at all that there shouldn't be
> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> landlock interface shouldn't expose any cgroup integration, at least
> >> until the cgroup situation settles down a lot.
> >
> > ahh. yes. we're perfectly in agreement here.
> > I'm suggesting that the next RFC shouldn't include unpriv
> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> > argue about unpriv with cgroups and even unpriv as a whole,
> > since it's not a given. Seccomp integration is also questionable.
> > I'd rather not have seccomp as a gate keeper for this lsm.
> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> > don't have one to one relationship, so mixing them up is only
> > asking for trouble further down the road.
> > If we really need to carry some information from seccomp to lsm+bpf,
> > it's easier to add eBPF support to seccomp and let bpf side deal
> > with passing whatever information.
> >
> 
> As an argument for keeping seccomp (or an extended seccomp) as the
> interface for an unprivileged bpf+lsm: seccomp already checks off most
> of the boxes for safely letting unprivileged programs sandbox
> themselves.  

you mean the attach part of seccomp syscall that deals with no_new_priv?
sure, that's reusable.

> Furthermore, to the extent that there are use cases for
> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> hierarchy, I suspect that syscall filters have exactly the same
> problem and that we should fix seccomp to cover it.

not sure what you mean by 'seccomp hierarchy'. The normal process
hierarchy ?
imo the main deficiency of secccomp is inability to look into arguments.
One can argue that it's a blessing, since composite args
are not yet copied into the kernel memory.
But in a lot of cases the seccomp arguments are FDs pointing
to kernel objects and if programs could examine those objects
the sandboxing scope would be more precise.
lsm+bpf solves that part and I'd still argue that it's
orthogonal to seccomp's pass/reject flow.
I mean if seccomp says 'ok' the syscall should continue executing
as normal and whatever LSM hooks were triggered by it may have
their own lsm+bpf verdicts.
Furthermore in the process hierarchy different children
should be able to set their own lsm+bpf filters that are not
related to parallel seccomp+bpf hierarchy of programs.
seccomp syscall can be an interface to attach programs
to lsm hooks, but nothing more than that.



Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
>  wrote:
> > On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >> >
> >> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> >> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> >> > security issues with delegation?
> >> >>
> >> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> >> Tejun says [1]:
> >> >>
> >> >> We haven't had to face this decision because cgroup has never properly
> >> >> supported delegating to applications and the in-use setups where this
> >> >> happens are custom configurations where there is no boundary between
> >> >> system and applications and adhoc trial-and-error is good enough a way
> >> >> to find a working solution.  That wiggle room goes away once we
> >> >> officially open this up to individual applications.
> >> >>
> >> >> Unless and until that changes, I think that landlock should stay away
> >> >> from cgroups.  Others could reasonably disagree with me.
> >> >
> >> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >> > and not for sandboxing. So the above doesn't matter in such contexts.
> >> > lsm hooks + cgroups provide convenient scope and existing entry points.
> >> > Please see checmate examples how it's used.
> >> >
> >>
> >> To be clear: I'm not arguing at all that there shouldn't be
> >> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >> landlock interface shouldn't expose any cgroup integration, at least
> >> until the cgroup situation settles down a lot.
> >
> > ahh. yes. we're perfectly in agreement here.
> > I'm suggesting that the next RFC shouldn't include unpriv
> > and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> > argue about unpriv with cgroups and even unpriv as a whole,
> > since it's not a given. Seccomp integration is also questionable.
> > I'd rather not have seccomp as a gate keeper for this lsm.
> > lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> > don't have one to one relationship, so mixing them up is only
> > asking for trouble further down the road.
> > If we really need to carry some information from seccomp to lsm+bpf,
> > it's easier to add eBPF support to seccomp and let bpf side deal
> > with passing whatever information.
> >
> 
> As an argument for keeping seccomp (or an extended seccomp) as the
> interface for an unprivileged bpf+lsm: seccomp already checks off most
> of the boxes for safely letting unprivileged programs sandbox
> themselves.  

you mean the attach part of seccomp syscall that deals with no_new_priv?
sure, that's reusable.

> Furthermore, to the extent that there are use cases for
> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> hierarchy, I suspect that syscall filters have exactly the same
> problem and that we should fix seccomp to cover it.

not sure what you mean by 'seccomp hierarchy'. The normal process
hierarchy ?
imo the main deficiency of secccomp is inability to look into arguments.
One can argue that it's a blessing, since composite args
are not yet copied into the kernel memory.
But in a lot of cases the seccomp arguments are FDs pointing
to kernel objects and if programs could examine those objects
the sandboxing scope would be more precise.
lsm+bpf solves that part and I'd still argue that it's
orthogonal to seccomp's pass/reject flow.
I mean if seccomp says 'ok' the syscall should continue executing
as normal and whatever LSM hooks were triggered by it may have
their own lsm+bpf verdicts.
Furthermore in the process hierarchy different children
should be able to set their own lsm+bpf filters that are not
related to parallel seccomp+bpf hierarchy of programs.
seccomp syscall can be an interface to attach programs
to lsm hooks, but nothing more than that.



Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-14 Thread Wanpeng Li
2016-09-15 12:08 GMT+08:00 Mika Penttilä :
> On 09/14/2016 10:58 AM, Wanpeng Li wrote:
>> From: Wanpeng Li 
>>
>> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used
>> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
>> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542
>> x86, apicv: add virtual x2apic support) disable virtual x2apic mode
>> completely if w/o APICv, and the author also told me that windows guest
>> can't enter into x2apic mode when he developed the APICv feature several
>> years ago. However, it is not truth currently, Interrupt Remapping and
>> vIOMMU is added to qemu and the developers from Intel test windows 8 can
>> work in x2apic mode w/ Interrupt Remapping enabled recently.
>>
>> This patch enables TPR shadow for virtual x2apic mode to boost
>> windows guest in x2apic mode even if w/o APICv.
>>
>> Can pass the kvm-unit-test.
>>
>
> While at it, is the vmx flexpriotity stuff still valid code?
> AFAICS it gets enabled iff TPR shadow is on. flexpriority
> is on when :
>
> (flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
> cpu_has_vmx_virtualize_apic_accesses)
>
> But apic accesses to TPR mmio are not then trapped and TPR changes not 
> reported because
> the “use TPR shadow” VM-execution control is 1.

Please note the patch is for MSR-BASED TPR shadow w/o APICv, TPR
shadow can work correctly in other configurations.

Regards,
Wanpeng Li


Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-14 Thread Wanpeng Li
2016-09-15 12:08 GMT+08:00 Mika Penttilä :
> On 09/14/2016 10:58 AM, Wanpeng Li wrote:
>> From: Wanpeng Li 
>>
>> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used
>> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
>> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542
>> x86, apicv: add virtual x2apic support) disable virtual x2apic mode
>> completely if w/o APICv, and the author also told me that windows guest
>> can't enter into x2apic mode when he developed the APICv feature several
>> years ago. However, it is not truth currently, Interrupt Remapping and
>> vIOMMU is added to qemu and the developers from Intel test windows 8 can
>> work in x2apic mode w/ Interrupt Remapping enabled recently.
>>
>> This patch enables TPR shadow for virtual x2apic mode to boost
>> windows guest in x2apic mode even if w/o APICv.
>>
>> Can pass the kvm-unit-test.
>>
>
> While at it, is the vmx flexpriotity stuff still valid code?
> AFAICS it gets enabled iff TPR shadow is on. flexpriority
> is on when :
>
> (flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
> cpu_has_vmx_virtualize_apic_accesses)
>
> But apic accesses to TPR mmio are not then trapped and TPR changes not 
> reported because
> the “use TPR shadow” VM-execution control is 1.

Please note the patch is for MSR-BASED TPR shadow w/o APICv, TPR
shadow can work correctly in other configurations.

Regards,
Wanpeng Li


Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-14 Thread Mika Penttilä
On 09/14/2016 10:58 AM, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used 
> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542 
> x86, apicv: add virtual x2apic support) disable virtual x2apic mode 
> completely if w/o APICv, and the author also told me that windows guest
> can't enter into x2apic mode when he developed the APICv feature several 
> years ago. However, it is not truth currently, Interrupt Remapping and 
> vIOMMU is added to qemu and the developers from Intel test windows 8 can 
> work in x2apic mode w/ Interrupt Remapping enabled recently. 
> 
> This patch enables TPR shadow for virtual x2apic mode to boost 
> windows guest in x2apic mode even if w/o APICv.
> 
> Can pass the kvm-unit-test.
> 

While at it, is the vmx flexpriotity stuff still valid code?
AFAICS it gets enabled iff TPR shadow is on. flexpriority
is on when :

(flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
cpu_has_vmx_virtualize_apic_accesses)

But apic accesses to TPR mmio are not then trapped and TPR changes not reported 
because
the “use TPR shadow” VM-execution control is 1.

Thanks,
Mika


> Suggested-by: Wincy Van 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Wincy Van 
> Cc: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 41 ++---
>  1 file changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5cede40..e703129 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6336,7 +6336,7 @@ static void wakeup_handler(void)
>  
>  static __init int hardware_setup(void)
>  {
> - int r = -ENOMEM, i, msr;
> + int r = -ENOMEM, i;
>  
>   rdmsrl_safe(MSR_EFER, _efer);
>  
> @@ -6464,18 +6464,6 @@ static __init int hardware_setup(void)
>  
>   set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
>  
> - for (msr = 0x800; msr <= 0x8ff; msr++)
> - vmx_disable_intercept_msr_read_x2apic(msr);
> -
> - /* TMCCT */
> - vmx_enable_intercept_msr_read_x2apic(0x839);
> - /* TPR */
> - vmx_disable_intercept_msr_write_x2apic(0x808);
> - /* EOI */
> - vmx_disable_intercept_msr_write_x2apic(0x80b);
> - /* SELF-IPI */
> - vmx_disable_intercept_msr_write_x2apic(0x83f);
> -
>   if (enable_ept) {
>   kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
>   (enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
> @@ -8435,12 +8423,7 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   return;
>   }
>  
> - /*
> -  * There is not point to enable virtualize x2apic without enable
> -  * apicv
> -  */
> - if (!cpu_has_vmx_virtualize_x2apic_mode() ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!cpu_has_vmx_virtualize_x2apic_mode())
>   return;
>  
>   if (!cpu_need_tpr_shadow(vcpu))
> @@ -8449,8 +8432,28 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   sec_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>  
>   if (set) {
> + int msr;
> +
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +
> + if (kvm_vcpu_apicv_active(vcpu)) {
> + for (msr = 0x800; msr <= 0x8ff; msr++)
> + vmx_disable_intercept_msr_read_x2apic(msr);
> +
> + /* TMCCT */
> + vmx_enable_intercept_msr_read_x2apic(0x839);
> + /* TPR */
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + /* EOI */
> + vmx_disable_intercept_msr_write_x2apic(0x80b);
> + /* SELF-IPI */
> + vmx_disable_intercept_msr_write_x2apic(0x83f);
> + } else if (vmx_exec_control(to_vmx(vcpu)) & 
> CPU_BASED_TPR_SHADOW) {
> + /* TPR */
> + vmx_disable_intercept_msr_read_x2apic(0x808);
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + }
>   } else {
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> 



Re: [PATCH] KVM: VMX: Enable MSR-BASED TPR shadow even if w/o APICv

2016-09-14 Thread Mika Penttilä
On 09/14/2016 10:58 AM, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> I observed that kvmvapic(to optimize flexpriority=N or AMD) is used 
> to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
> on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542 
> x86, apicv: add virtual x2apic support) disable virtual x2apic mode 
> completely if w/o APICv, and the author also told me that windows guest
> can't enter into x2apic mode when he developed the APICv feature several 
> years ago. However, it is not truth currently, Interrupt Remapping and 
> vIOMMU is added to qemu and the developers from Intel test windows 8 can 
> work in x2apic mode w/ Interrupt Remapping enabled recently. 
> 
> This patch enables TPR shadow for virtual x2apic mode to boost 
> windows guest in x2apic mode even if w/o APICv.
> 
> Can pass the kvm-unit-test.
> 

While at it, is the vmx flexpriotity stuff still valid code?
AFAICS it gets enabled iff TPR shadow is on. flexpriority
is on when :

(flexpriority_enabled && lapic_in_kernel && cpu_has_vmx_tpr_shadow && 
cpu_has_vmx_virtualize_apic_accesses)

But apic accesses to TPR mmio are not then trapped and TPR changes not reported 
because
the “use TPR shadow” VM-execution control is 1.

Thanks,
Mika


> Suggested-by: Wincy Van 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Wincy Van 
> Cc: Yang Zhang 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 41 ++---
>  1 file changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5cede40..e703129 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6336,7 +6336,7 @@ static void wakeup_handler(void)
>  
>  static __init int hardware_setup(void)
>  {
> - int r = -ENOMEM, i, msr;
> + int r = -ENOMEM, i;
>  
>   rdmsrl_safe(MSR_EFER, _efer);
>  
> @@ -6464,18 +6464,6 @@ static __init int hardware_setup(void)
>  
>   set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
>  
> - for (msr = 0x800; msr <= 0x8ff; msr++)
> - vmx_disable_intercept_msr_read_x2apic(msr);
> -
> - /* TMCCT */
> - vmx_enable_intercept_msr_read_x2apic(0x839);
> - /* TPR */
> - vmx_disable_intercept_msr_write_x2apic(0x808);
> - /* EOI */
> - vmx_disable_intercept_msr_write_x2apic(0x80b);
> - /* SELF-IPI */
> - vmx_disable_intercept_msr_write_x2apic(0x83f);
> -
>   if (enable_ept) {
>   kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
>   (enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
> @@ -8435,12 +8423,7 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   return;
>   }
>  
> - /*
> -  * There is not point to enable virtualize x2apic without enable
> -  * apicv
> -  */
> - if (!cpu_has_vmx_virtualize_x2apic_mode() ||
> - !kvm_vcpu_apicv_active(vcpu))
> + if (!cpu_has_vmx_virtualize_x2apic_mode())
>   return;
>  
>   if (!cpu_need_tpr_shadow(vcpu))
> @@ -8449,8 +8432,28 @@ static void vmx_set_virtual_x2apic_mode(struct 
> kvm_vcpu *vcpu, bool set)
>   sec_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>  
>   if (set) {
> + int msr;
> +
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +
> + if (kvm_vcpu_apicv_active(vcpu)) {
> + for (msr = 0x800; msr <= 0x8ff; msr++)
> + vmx_disable_intercept_msr_read_x2apic(msr);
> +
> + /* TMCCT */
> + vmx_enable_intercept_msr_read_x2apic(0x839);
> + /* TPR */
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + /* EOI */
> + vmx_disable_intercept_msr_write_x2apic(0x80b);
> + /* SELF-IPI */
> + vmx_disable_intercept_msr_write_x2apic(0x83f);
> + } else if (vmx_exec_control(to_vmx(vcpu)) & 
> CPU_BASED_TPR_SHADOW) {
> + /* TPR */
> + vmx_disable_intercept_msr_read_x2apic(0x808);
> + vmx_disable_intercept_msr_write_x2apic(0x808);
> + }
>   } else {
>   sec_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>   sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> 



Re: [PATCH 3/4] autofs - make mountpoint checks namespace aware

2016-09-14 Thread Ian Kent
On Wed, 2016-09-14 at 21:08 -0500, Eric W. Biederman wrote:
> Ian Kent  writes:
> 
> > On Wed, 2016-09-14 at 12:28 -0500, Eric W. Biederman wrote:
> > > Ian Kent  writes:
> > > 
> > > > If an automount mount is clone(2)ed into a file system that is
> > > > propagation private, when it later expires in the originating
> > > > namespace subsequent calls to autofs ->d_automount() for that
> > > > dentry in the original namespace will return ELOOP until the
> > > > mount is manually umounted in the cloned namespace.
> > > > 
> > > > In the same way, if an autofs mount is triggered by automount(8)
> > > > running within a container the dentry will be seen as mounted in
> > > > the root init namespace and calls to ->d_automount() in that namespace
> > > > will return ELOOP until the mount is umounted within the container.
> > > > 
> > > > Also, have_submounts() can return an incorect result when a mount
> > > > exists in a namespace other than the one being checked.
> > > 
> > > Overall this appears to be a fairly reasonable set of changes.  It does
> > > increase the expense when an actual mount point is encountered, but if
> > > these are the desired some increase in cost when a dentry is a
> > > mountpoint is unavoidable.
> > > 
> > > May I ask the motiviation for this set of changes?  Reading through the
> > > changes I don't grasp why we want to change the behavior of autofs.
> > > What problem is being solved?  What are the benefits?
> > 
> > LOL, it's all too easy for me to give a patch description that I think
> > explains
> > a problem I need to solve without realizing it isn't clear to others what
> > the
> > problem is, sorry about that.
> > 
> > For quite a while now, and not that frequently but consistently, I've been
> > getting reports of people using autofs getting ELOOP errors and not being
> > able
> > to mount automounts.
> > 
> > This has been due to the cloning of autofs file systems (that have active
> > automounts at the time of the clone) by other systems.
> > 
> > An unshare, as one example, can easily result in the cloning of an autofs
> > file
> > system that has active mounts which shows this problem.
> > 
> > Once an active mount that has been cloned is expired in the namespace that
> > performed the unshare it can't be (auto)mounted again in the the originating
> > namespace because the mounted check in the autofs module will think it is
> > already mounted.
> > 
> > I'm not sure this is a clear description either, hopefully it is enough to
> > demonstrate the type of problem I'm typing to solve.
> 
> So to rephrase the problem is that an autofs instance can stop working
> properly from the perspective of the mount namespace it is mounted in
> if the autofs instance is shared between multiple mount namespaces.  The
> problem is that mounts and unmounts do not always propogate between
> mount namespaces.  This lack of symmetric mount/unmount behavior
> leads to mountpoints that become unusable.

That's right.

It's also worth considering that symmetric mount propagation is usually not the
behaviour needed either and things like LXC and Docker are set propagation slave
because of problems caused by propagation back to the parent namespace.

So a mount can be triggered within a container, mounted by the automount daemon
in the parent namespace, and propagated to the child and similarly for expires,
which is the common use case now.

> 
> Which leads to the question what is the expected new behavior with your
> patchset applied.  New mounts can be added in the parent mount namespace
> (because the test is local).  Does your change also allow the
> autofs mountpoints to be used in the other mount namespaces that share
> the autofs instance if everything becomes unmounted?

The problem occurs when the subordinate namespace doesn't deal with these
propagated mounts properly, although they can obviously be used by the
subordinate namespace.

> 
> Or is it expected that other mount namespaces that share an autofs
> instance will get changes in their mounts via mount propagation and if
> mount propagation is insufficient they are on their own.

Namespaces that receive updates via mount propagation from a parent will
continue to function as they do now.

Mounts that don't get updates via mount propagation will retain the mount to use
if they need to, as they would without this change, but the originating
namespace will also continue to function as expected.

The child namespace needs cleanup its mounts on exit, which it had to do prior
to this change also.

> 
> I believe this is a question of how do notifications of the desire for
> an automount work after your change, and are those notifications
> consistent with your desired and/or expected behavior.

It sounds like you might be assuming the service receiving these cloned mounts
actually wants to use them or is expecting them to behave like automount mounts.
But that's not what I've seen and is not the 

Re: [PATCH 3/4] autofs - make mountpoint checks namespace aware

2016-09-14 Thread Ian Kent
On Wed, 2016-09-14 at 21:08 -0500, Eric W. Biederman wrote:
> Ian Kent  writes:
> 
> > On Wed, 2016-09-14 at 12:28 -0500, Eric W. Biederman wrote:
> > > Ian Kent  writes:
> > > 
> > > > If an automount mount is clone(2)ed into a file system that is
> > > > propagation private, when it later expires in the originating
> > > > namespace subsequent calls to autofs ->d_automount() for that
> > > > dentry in the original namespace will return ELOOP until the
> > > > mount is manually umounted in the cloned namespace.
> > > > 
> > > > In the same way, if an autofs mount is triggered by automount(8)
> > > > running within a container the dentry will be seen as mounted in
> > > > the root init namespace and calls to ->d_automount() in that namespace
> > > > will return ELOOP until the mount is umounted within the container.
> > > > 
> > > > Also, have_submounts() can return an incorect result when a mount
> > > > exists in a namespace other than the one being checked.
> > > 
> > > Overall this appears to be a fairly reasonable set of changes.  It does
> > > increase the expense when an actual mount point is encountered, but if
> > > these are the desired some increase in cost when a dentry is a
> > > mountpoint is unavoidable.
> > > 
> > > May I ask the motiviation for this set of changes?  Reading through the
> > > changes I don't grasp why we want to change the behavior of autofs.
> > > What problem is being solved?  What are the benefits?
> > 
> > LOL, it's all too easy for me to give a patch description that I think
> > explains
> > a problem I need to solve without realizing it isn't clear to others what
> > the
> > problem is, sorry about that.
> > 
> > For quite a while now, and not that frequently but consistently, I've been
> > getting reports of people using autofs getting ELOOP errors and not being
> > able
> > to mount automounts.
> > 
> > This has been due to the cloning of autofs file systems (that have active
> > automounts at the time of the clone) by other systems.
> > 
> > An unshare, as one example, can easily result in the cloning of an autofs
> > file
> > system that has active mounts which shows this problem.
> > 
> > Once an active mount that has been cloned is expired in the namespace that
> > performed the unshare it can't be (auto)mounted again in the the originating
> > namespace because the mounted check in the autofs module will think it is
> > already mounted.
> > 
> > I'm not sure this is a clear description either, hopefully it is enough to
> > demonstrate the type of problem I'm typing to solve.
> 
> So to rephrase the problem is that an autofs instance can stop working
> properly from the perspective of the mount namespace it is mounted in
> if the autofs instance is shared between multiple mount namespaces.  The
> problem is that mounts and unmounts do not always propogate between
> mount namespaces.  This lack of symmetric mount/unmount behavior
> leads to mountpoints that become unusable.

That's right.

It's also worth considering that symmetric mount propagation is usually not the
behaviour needed either and things like LXC and Docker are set propagation slave
because of problems caused by propagation back to the parent namespace.

So a mount can be triggered within a container, mounted by the automount daemon
in the parent namespace, and propagated to the child and similarly for expires,
which is the common use case now.

> 
> Which leads to the question what is the expected new behavior with your
> patchset applied.  New mounts can be added in the parent mount namespace
> (because the test is local).  Does your change also allow the
> autofs mountpoints to be used in the other mount namespaces that share
> the autofs instance if everything becomes unmounted?

The problem occurs when the subordinate namespace doesn't deal with these
propagated mounts properly, although they can obviously be used by the
subordinate namespace.

> 
> Or is it expected that other mount namespaces that share an autofs
> instance will get changes in their mounts via mount propagation and if
> mount propagation is insufficient they are on their own.

Namespaces that receive updates via mount propagation from a parent will
continue to function as they do now.

Mounts that don't get updates via mount propagation will retain the mount to use
if they need to, as they would without this change, but the originating
namespace will also continue to function as expected.

The child namespace needs cleanup its mounts on exit, which it had to do prior
to this change also.

> 
> I believe this is a question of how do notifications of the desire for
> an automount work after your change, and are those notifications
> consistent with your desired and/or expected behavior.

It sounds like you might be assuming the service receiving these cloned mounts
actually wants to use them or is expecting them to behave like automount mounts.
But that's not what I've seen and is not the way these cloned mounts behave

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
>> >> >
>> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
>> >> > don't see why building on top of cgroup v2 is a problem. Is there
>> >> > security issues with delegation?
>> >>
>> >> What I mean is: cgroup v2 delegation has a functionality problem.
>> >> Tejun says [1]:
>> >>
>> >> We haven't had to face this decision because cgroup has never properly
>> >> supported delegating to applications and the in-use setups where this
>> >> happens are custom configurations where there is no boundary between
>> >> system and applications and adhoc trial-and-error is good enough a way
>> >> to find a working solution.  That wiggle room goes away once we
>> >> officially open this up to individual applications.
>> >>
>> >> Unless and until that changes, I think that landlock should stay away
>> >> from cgroups.  Others could reasonably disagree with me.
>> >
>> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
>> > and not for sandboxing. So the above doesn't matter in such contexts.
>> > lsm hooks + cgroups provide convenient scope and existing entry points.
>> > Please see checmate examples how it's used.
>> >
>>
>> To be clear: I'm not arguing at all that there shouldn't be
>> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
>> landlock interface shouldn't expose any cgroup integration, at least
>> until the cgroup situation settles down a lot.
>
> ahh. yes. we're perfectly in agreement here.
> I'm suggesting that the next RFC shouldn't include unpriv
> and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> argue about unpriv with cgroups and even unpriv as a whole,
> since it's not a given. Seccomp integration is also questionable.
> I'd rather not have seccomp as a gate keeper for this lsm.
> lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> don't have one to one relationship, so mixing them up is only
> asking for trouble further down the road.
> If we really need to carry some information from seccomp to lsm+bpf,
> it's easier to add eBPF support to seccomp and let bpf side deal
> with passing whatever information.
>

As an argument for keeping seccomp (or an extended seccomp) as the
interface for an unprivileged bpf+lsm: seccomp already checks off most
of the boxes for safely letting unprivileged programs sandbox
themselves.  Furthermore, to the extent that there are use cases for
unprivileged bpf+lsm that *aren't* expressible within the seccomp
hierarchy, I suspect that syscall filters have exactly the same
problem and that we should fix seccomp to cover it.

If I ever add a "seccomp monitor", which is something I want to do
eventually, I think it should work for lsm+bpf as well, which is
another argument for keeping it in seccomp.

--Andy


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
>> >> >
>> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
>> >> > don't see why building on top of cgroup v2 is a problem. Is there
>> >> > security issues with delegation?
>> >>
>> >> What I mean is: cgroup v2 delegation has a functionality problem.
>> >> Tejun says [1]:
>> >>
>> >> We haven't had to face this decision because cgroup has never properly
>> >> supported delegating to applications and the in-use setups where this
>> >> happens are custom configurations where there is no boundary between
>> >> system and applications and adhoc trial-and-error is good enough a way
>> >> to find a working solution.  That wiggle room goes away once we
>> >> officially open this up to individual applications.
>> >>
>> >> Unless and until that changes, I think that landlock should stay away
>> >> from cgroups.  Others could reasonably disagree with me.
>> >
>> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
>> > and not for sandboxing. So the above doesn't matter in such contexts.
>> > lsm hooks + cgroups provide convenient scope and existing entry points.
>> > Please see checmate examples how it's used.
>> >
>>
>> To be clear: I'm not arguing at all that there shouldn't be
>> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
>> landlock interface shouldn't expose any cgroup integration, at least
>> until the cgroup situation settles down a lot.
>
> ahh. yes. we're perfectly in agreement here.
> I'm suggesting that the next RFC shouldn't include unpriv
> and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> argue about unpriv with cgroups and even unpriv as a whole,
> since it's not a given. Seccomp integration is also questionable.
> I'd rather not have seccomp as a gate keeper for this lsm.
> lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> don't have one to one relationship, so mixing them up is only
> asking for trouble further down the road.
> If we really need to carry some information from seccomp to lsm+bpf,
> it's easier to add eBPF support to seccomp and let bpf side deal
> with passing whatever information.
>

As an argument for keeping seccomp (or an extended seccomp) as the
interface for an unprivileged bpf+lsm: seccomp already checks off most
of the boxes for safely letting unprivileged programs sandbox
themselves.  Furthermore, to the extent that there are use cases for
unprivileged bpf+lsm that *aren't* expressible within the seccomp
hierarchy, I suspect that syscall filters have exactly the same
problem and that we should fix seccomp to cover it.

If I ever add a "seccomp monitor", which is something I want to do
eventually, I think it should work for lsm+bpf as well, which is
another argument for keeping it in seccomp.

--Andy


[PATCH] rtc/rtc-cmos: Initialize hpet timer before irq is registered

2016-09-14 Thread Pratyush Anand
We have observed on few x86 machines with rtc-cmos device that
hpet_rtc_interrupt() is called just after irq registration and before
cmos_do_probe() could call hpet_rtc_timer_init().

So, neither hpet_default_delta nor hpet_t1_cmp is initialized by the time
interrupt is raised in the given situation, and this results in NMI
watchdog LOCKUP.

It has only been observed sporadically on kdump secondary kernels.

See the call trace:
---<-snip->---
[   27.913194] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[   27.915371] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-342.el7.x86_64 
#1
[   27.917503] Hardware name: HP ProLiant DL160 Gen8, BIOS J03 02/10/2014
[   27.919455]  8186a728 59c82488 880034e05af0 
81637bd4
[   27.921870]  880034e05b70 8163144a 0010 
880034e05b80
[   27.924257]  880034e05b20 59c82488  

[   27.926599] Call Trace:
[   27.927352][] dump_stack+0x19/0x1b
[   27.929080]  [] panic+0xd8/0x1e7
[   27.930588]  [] ? restart_watchdog_hrtimer+0x50/0x50
[   27.932502]  [] watchdog_overflow_callback+0xc2/0xd0
[   27.934427]  [] __perf_event_overflow+0xa1/0x250
[   27.936232]  [] perf_event_overflow+0x14/0x20
[   27.937957]  [] intel_pmu_handle_irq+0x1e8/0x470
[   27.939799]  [] perf_event_nmi_handler+0x2b/0x50
[   27.941649]  [] nmi_handle.isra.0+0x69/0xb0
[   27.943348]  [] do_nmi+0x169/0x340
[   27.944802]  [] end_repeat_nmi+0x1e/0x2e
[   27.946424]  [] ? hpet_rtc_interrupt+0x85/0x380
[   27.948197]  [] ? hpet_rtc_interrupt+0x85/0x380
[   27.949992]  [] ? hpet_rtc_interrupt+0x85/0x380
[   27.951816]  <>[] ? 
run_timer_softirq+0x43/0x340
[   27.954114]  [] handle_irq_event_percpu+0x3e/0x1e0
[   27.955962]  [] handle_irq_event+0x3d/0x60
[   27.957635]  [] handle_edge_irq+0x77/0x130
[   27.959332]  [] handle_irq+0xbf/0x150
[   27.960949]  [] do_IRQ+0x4f/0xf0
[   27.962434]  [] common_interrupt+0x6d/0x6d
[   27.964101][] ? 
_raw_spin_unlock_irqrestore+0x1b/0x40
[   27.966308]  [] __setup_irq+0x2a7/0x570
[   28.067859]  [] ? hpet_cpuhp_notify+0x140/0x140
[   28.069709]  [] request_threaded_irq+0xcc/0x170
[   28.071585]  [] cmos_do_probe+0x1e6/0x450
[   28.073240]  [] ? cmos_do_probe+0x450/0x450
[   28.074911]  [] cmos_pnp_probe+0xbb/0xc0
[   28.076533]  [] pnp_device_probe+0x65/0xd0
[   28.078198]  [] driver_probe_device+0x87/0x390
[   28.079971]  [] __driver_attach+0x93/0xa0
[   28.081660]  [] ? __device_attach+0x40/0x40
[   28.083662]  [] bus_for_each_dev+0x73/0xc0
[   28.085370]  [] driver_attach+0x1e/0x20
[   28.086974]  [] bus_add_driver+0x200/0x2d0
[   28.088634]  [] ? rtc_sysfs_init+0xe/0xe
[   28.090349]  [] driver_register+0x64/0xf0
[   28.091989]  [] pnp_register_driver+0x20/0x30
[   28.093707]  [] cmos_init+0x11/0x71
---<-snip->---

This patch moves hpet_rtc_timer_init() before IRQ registration, so that we
can gracefully handle such spurious interrupts. It also masks HPET RTC
interrupts, in case IRQ registration fails.

We were able to reproduce the problem in maximum 15 trials of kdump
secondary kernel boot on an hp-dl160gen8 FCoE host machine without this
patch.  However, more than 35 trials went fine after applying this patch.

Suggested-by: Thomas Gleixner 
Signed-off-by: Pratyush Anand 
---
 drivers/rtc/rtc-cmos.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
index 43745cac0141..fddde655cbd4 100644
--- a/drivers/rtc/rtc-cmos.c
+++ b/drivers/rtc/rtc-cmos.c
@@ -707,6 +707,8 @@ cmos_do_probe(struct device *dev, struct resource *ports, 
int rtc_irq)
goto cleanup1;
}
 
+   hpet_rtc_timer_init();
+
if (is_valid_irq(rtc_irq)) {
irq_handler_t rtc_cmos_int_handler;
 
@@ -714,6 +716,7 @@ cmos_do_probe(struct device *dev, struct resource *ports, 
int rtc_irq)
rtc_cmos_int_handler = hpet_rtc_interrupt;
retval = hpet_register_irq_handler(cmos_interrupt);
if (retval) {
+   hpet_mask_rtc_irq_bit(RTC_IRQMASK);
dev_warn(dev, "hpet_register_irq_handler "
" failed in rtc_init().");
goto cleanup1;
@@ -729,7 +732,6 @@ cmos_do_probe(struct device *dev, struct resource *ports, 
int rtc_irq)
goto cleanup1;
}
}
-   hpet_rtc_timer_init();
 
/* export at least the first block of NVRAM */
nvram.size = address_space - NVRAM_OFFSET;
-- 
2.7.4



[PATCH] rtc/rtc-cmos: Initialize hpet timer before irq is registered

2016-09-14 Thread Pratyush Anand
We have observed on few x86 machines with rtc-cmos device that
hpet_rtc_interrupt() is called just after irq registration and before
cmos_do_probe() could call hpet_rtc_timer_init().

So, neither hpet_default_delta nor hpet_t1_cmp is initialized by the time
interrupt is raised in the given situation, and this results in NMI
watchdog LOCKUP.

It has only been observed sporadically on kdump secondary kernels.

See the call trace:
---<-snip->---
[   27.913194] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[   27.915371] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-342.el7.x86_64 
#1
[   27.917503] Hardware name: HP ProLiant DL160 Gen8, BIOS J03 02/10/2014
[   27.919455]  8186a728 59c82488 880034e05af0 
81637bd4
[   27.921870]  880034e05b70 8163144a 0010 
880034e05b80
[   27.924257]  880034e05b20 59c82488  

[   27.926599] Call Trace:
[   27.927352][] dump_stack+0x19/0x1b
[   27.929080]  [] panic+0xd8/0x1e7
[   27.930588]  [] ? restart_watchdog_hrtimer+0x50/0x50
[   27.932502]  [] watchdog_overflow_callback+0xc2/0xd0
[   27.934427]  [] __perf_event_overflow+0xa1/0x250
[   27.936232]  [] perf_event_overflow+0x14/0x20
[   27.937957]  [] intel_pmu_handle_irq+0x1e8/0x470
[   27.939799]  [] perf_event_nmi_handler+0x2b/0x50
[   27.941649]  [] nmi_handle.isra.0+0x69/0xb0
[   27.943348]  [] do_nmi+0x169/0x340
[   27.944802]  [] end_repeat_nmi+0x1e/0x2e
[   27.946424]  [] ? hpet_rtc_interrupt+0x85/0x380
[   27.948197]  [] ? hpet_rtc_interrupt+0x85/0x380
[   27.949992]  [] ? hpet_rtc_interrupt+0x85/0x380
[   27.951816]  <>[] ? 
run_timer_softirq+0x43/0x340
[   27.954114]  [] handle_irq_event_percpu+0x3e/0x1e0
[   27.955962]  [] handle_irq_event+0x3d/0x60
[   27.957635]  [] handle_edge_irq+0x77/0x130
[   27.959332]  [] handle_irq+0xbf/0x150
[   27.960949]  [] do_IRQ+0x4f/0xf0
[   27.962434]  [] common_interrupt+0x6d/0x6d
[   27.964101][] ? 
_raw_spin_unlock_irqrestore+0x1b/0x40
[   27.966308]  [] __setup_irq+0x2a7/0x570
[   28.067859]  [] ? hpet_cpuhp_notify+0x140/0x140
[   28.069709]  [] request_threaded_irq+0xcc/0x170
[   28.071585]  [] cmos_do_probe+0x1e6/0x450
[   28.073240]  [] ? cmos_do_probe+0x450/0x450
[   28.074911]  [] cmos_pnp_probe+0xbb/0xc0
[   28.076533]  [] pnp_device_probe+0x65/0xd0
[   28.078198]  [] driver_probe_device+0x87/0x390
[   28.079971]  [] __driver_attach+0x93/0xa0
[   28.081660]  [] ? __device_attach+0x40/0x40
[   28.083662]  [] bus_for_each_dev+0x73/0xc0
[   28.085370]  [] driver_attach+0x1e/0x20
[   28.086974]  [] bus_add_driver+0x200/0x2d0
[   28.088634]  [] ? rtc_sysfs_init+0xe/0xe
[   28.090349]  [] driver_register+0x64/0xf0
[   28.091989]  [] pnp_register_driver+0x20/0x30
[   28.093707]  [] cmos_init+0x11/0x71
---<-snip->---

This patch moves hpet_rtc_timer_init() before IRQ registration, so that we
can gracefully handle such spurious interrupts. It also masks HPET RTC
interrupts, in case IRQ registration fails.

We were able to reproduce the problem in maximum 15 trials of kdump
secondary kernel boot on an hp-dl160gen8 FCoE host machine without this
patch.  However, more than 35 trials went fine after applying this patch.

Suggested-by: Thomas Gleixner 
Signed-off-by: Pratyush Anand 
---
 drivers/rtc/rtc-cmos.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
index 43745cac0141..fddde655cbd4 100644
--- a/drivers/rtc/rtc-cmos.c
+++ b/drivers/rtc/rtc-cmos.c
@@ -707,6 +707,8 @@ cmos_do_probe(struct device *dev, struct resource *ports, 
int rtc_irq)
goto cleanup1;
}
 
+   hpet_rtc_timer_init();
+
if (is_valid_irq(rtc_irq)) {
irq_handler_t rtc_cmos_int_handler;
 
@@ -714,6 +716,7 @@ cmos_do_probe(struct device *dev, struct resource *ports, 
int rtc_irq)
rtc_cmos_int_handler = hpet_rtc_interrupt;
retval = hpet_register_irq_handler(cmos_interrupt);
if (retval) {
+   hpet_mask_rtc_irq_bit(RTC_IRQMASK);
dev_warn(dev, "hpet_register_irq_handler "
" failed in rtc_init().");
goto cleanup1;
@@ -729,7 +732,6 @@ cmos_do_probe(struct device *dev, struct resource *ports, 
int rtc_irq)
goto cleanup1;
}
}
-   hpet_rtc_timer_init();
 
/* export at least the first block of NVRAM */
nvram.size = address_space - NVRAM_OFFSET;
-- 
2.7.4



rcu_sched self-detected stall on CPU

2016-09-14 Thread NTU
Hello,

I've been testing out KASAN and experimenting with some other new
config settings related to security and kernel hardening, I ran into a
problem though and I've attached a snippet of my kernel log of the
stack trace as well as my kernel config. Also intermittently (every
few seconds just about) my system freezes for a short moment and
resumes, no errors pop up though in dmesg. My mouse just stops working
and the system is basically very choppy and sluggish, most likely
related to KASAN from the looks of it. I'm running a 4.2 GHz quad core
APU, if you need any more info let me know! (Kernel image is nearly
9MB with XZ compression!)

Thanks!
[   26.541937] INFO: rcu_sched self-detected stall on CPU
[   26.542948] INFO: rcu_sched detected stalls on CPUs/tasks:
[   26.542959]  3-...: (20987 ticks this GP) idle=2ef/141/0 
softirq=182/182 fqs=5168 
[   26.542962]  (detected by 1, t=21002 jiffies, g=-222, c=-223, q=14)
[   26.542964] Task dump for CPU 3:
[   26.542968] swapper/0   R  running task0 1  0 0x0008
[   26.542973]  0003 8803a4bef228 00015120 
8803a4be6400
[   26.542976]  a900e5a0 0297  
0297
[   26.542979]  88039d950c00 0006e660 8803a4bef220 
0297
[   26.542980] Call Trace:
[   26.542983]  [] ? 0xa7fbd7c1
[   26.542985]  [] ? 0xa7f17d35
[   26.542986]  [] ? 0xa796c115
[   26.542988]  [] ? 0xa7633d86
[   26.542989]  [] ? 0xa796c0a6
[   26.542991]  [] ? 0xa796c761
[   26.542992]  [] ? 0xa796a5d0
[   26.542993]  [] ? 0xa7f17d35
[   26.542995]  [] ? 0xa7657aee
[   26.542996]  [] ? 0xa77d2daf
[   26.542997]  [] ? 0xa77d3241
[   26.542999]  [] ? 0xa768e427
[   26.543000]  [] ? 0xa7680b62
[   26.543002]  [] ? 0xa7681d05
[   26.543003]  [] ? 0xa7682d36
[   26.543004]  [] ? 0xa76859e4
[   26.543006]  [] ? 0xa7677352
[   26.543007]  [] ? 0xa767886a
[   26.543009]  [] ? 0xa8e68a94
[   26.543010]  [] ? 0xa8e7998f
[   26.543011]  [] ? 0xa7967d0a
[   26.543013]  [] ? 0xa7968cee
[   26.543014]  [] ? 0xa7f17bfe
[   26.543016]  [] ? 0xa796c1b6
[   26.543017]  [] ? 0xa796c1b6
[   26.543018]  [] ? 0xa7f47d98
[   26.543020]  [] ? 0xa796c777
[   26.543021]  [] ? 0xa796a5d0
[   26.543023]  [] ? 0xa7f17d35
[   26.543024]  [] ? 0xa7657aee
[   26.543026]  [] ? 0xa7657950
[   26.543027]  [] ? 0xa7f57519
[   26.543028]  [] ? 0xa7f57577
[   26.543030]  [] ? 0xa77d2928
[   26.543031]  [] ? 0xa768d320
[   26.543033]  [] ? 0xa767fccd
[   26.543034]  [] ? 0xa767fccd
[   26.543035]  [] ? 0xa7680ee7
[   26.543037]  [] ? 0xa76806c0
[   26.543038]  [] ? 0xa8078048
[   26.543040]  [] ? 0xa7620446
[   26.543041]  [] ? 0xa7620446
[   26.543043]  [] ? 0xa7fbd7c1
[   26.543044]  [] ? 0xa7f17d35
[   26.543046]  [] ? 0xa796c115
[   26.543047]  [] ? 0xa7633d86
[   26.543048]  [] ? 0xa796c0a6
[   26.543050]  [] ? 0xa796c761
[   26.543051]  [] ? 0xa796a5d0
[   26.543052]  [] ? 0xa7f17d35
[   26.543054]  [] ? 0xa7657aee
[   26.543055]  [] ? 0xa77d2daf
[   26.543057]  [] ? 0xa77d3241
[   26.543058]  [] ? 0xa7683199
[   26.543059]  [] ? 0xa76859e4
[   26.543061]  [] ? 0xa7677352
[   26.543062]  [] ? 0xa767886a
[   26.543063]  [] ? 0xa8e68a94
[   26.543065]  [] ? 0xa8e7998f
[   26.543066]  [] ? 0xa79687af
[   26.543068]  [] ? 0xa7967d0a
[   26.543069]  [] ? 0xa7968cee
[   26.543071]  [] ? 0xa7f17bfe
[   26.543072]  [] ? 0xa796c1b6
[   26.543073]  [] ? 0xa796c1b6
[   26.543075]  [] ? 0xa7f47d98
[   26.543076]  [] ? 0xa796c777
[   26.543078]  [] ? 0xa796a5d0
[   26.543079]  [] ? 0xa7f17d35
[   26.543080]  [] ? 0xa7657aee
[   26.543082]  [] ? 0xa7657950
[   26.543083]  [] ? 0xa7f57519
[   26.543085]  [] ? 0xa7f57577
[   26.543086]  [] ? 0xa77d2928
[   26.543088]  [] ? 0xa76928d6
[   26.543089]  [] ? 0xa769198f
[   26.543090]  [] ? 0xa76935b9
[   26.543092]  [] ? 0xa7693110
[   26.543093]  [] ? 0xa76859e4
[   26.543095]  [] ? 0xa7693867
[   26.543096]  [] ? 0xa767886f
[   26.543098]  [] ? 0xa8e68a94
[   26.543099]  [] ? 0xa8e7998f
[   26.543101]  [] ? 0xa8e68a60
[   26.757288]  3-...: (20987 ticks this GP) idle=2ef/141/0 
softirq=182/182 fqs=5221 
[   26.758231]   (t=21217 jiffies g=-222 c=-223 q=14)
[   26.759200] Task dump for CPU 3:
[   26.760162] swapper/0   R  running task0 1  0 0x0008
[   26.761198]  

rcu_sched self-detected stall on CPU

2016-09-14 Thread NTU
Hello,

I've been testing out KASAN and experimenting with some other new
config settings related to security and kernel hardening, I ran into a
problem though and I've attached a snippet of my kernel log of the
stack trace as well as my kernel config. Also intermittently (every
few seconds just about) my system freezes for a short moment and
resumes, no errors pop up though in dmesg. My mouse just stops working
and the system is basically very choppy and sluggish, most likely
related to KASAN from the looks of it. I'm running a 4.2 GHz quad core
APU, if you need any more info let me know! (Kernel image is nearly
9MB with XZ compression!)

Thanks!
[   26.541937] INFO: rcu_sched self-detected stall on CPU
[   26.542948] INFO: rcu_sched detected stalls on CPUs/tasks:
[   26.542959]  3-...: (20987 ticks this GP) idle=2ef/141/0 
softirq=182/182 fqs=5168 
[   26.542962]  (detected by 1, t=21002 jiffies, g=-222, c=-223, q=14)
[   26.542964] Task dump for CPU 3:
[   26.542968] swapper/0   R  running task0 1  0 0x0008
[   26.542973]  0003 8803a4bef228 00015120 
8803a4be6400
[   26.542976]  a900e5a0 0297  
0297
[   26.542979]  88039d950c00 0006e660 8803a4bef220 
0297
[   26.542980] Call Trace:
[   26.542983]  [] ? 0xa7fbd7c1
[   26.542985]  [] ? 0xa7f17d35
[   26.542986]  [] ? 0xa796c115
[   26.542988]  [] ? 0xa7633d86
[   26.542989]  [] ? 0xa796c0a6
[   26.542991]  [] ? 0xa796c761
[   26.542992]  [] ? 0xa796a5d0
[   26.542993]  [] ? 0xa7f17d35
[   26.542995]  [] ? 0xa7657aee
[   26.542996]  [] ? 0xa77d2daf
[   26.542997]  [] ? 0xa77d3241
[   26.542999]  [] ? 0xa768e427
[   26.543000]  [] ? 0xa7680b62
[   26.543002]  [] ? 0xa7681d05
[   26.543003]  [] ? 0xa7682d36
[   26.543004]  [] ? 0xa76859e4
[   26.543006]  [] ? 0xa7677352
[   26.543007]  [] ? 0xa767886a
[   26.543009]  [] ? 0xa8e68a94
[   26.543010]  [] ? 0xa8e7998f
[   26.543011]  [] ? 0xa7967d0a
[   26.543013]  [] ? 0xa7968cee
[   26.543014]  [] ? 0xa7f17bfe
[   26.543016]  [] ? 0xa796c1b6
[   26.543017]  [] ? 0xa796c1b6
[   26.543018]  [] ? 0xa7f47d98
[   26.543020]  [] ? 0xa796c777
[   26.543021]  [] ? 0xa796a5d0
[   26.543023]  [] ? 0xa7f17d35
[   26.543024]  [] ? 0xa7657aee
[   26.543026]  [] ? 0xa7657950
[   26.543027]  [] ? 0xa7f57519
[   26.543028]  [] ? 0xa7f57577
[   26.543030]  [] ? 0xa77d2928
[   26.543031]  [] ? 0xa768d320
[   26.543033]  [] ? 0xa767fccd
[   26.543034]  [] ? 0xa767fccd
[   26.543035]  [] ? 0xa7680ee7
[   26.543037]  [] ? 0xa76806c0
[   26.543038]  [] ? 0xa8078048
[   26.543040]  [] ? 0xa7620446
[   26.543041]  [] ? 0xa7620446
[   26.543043]  [] ? 0xa7fbd7c1
[   26.543044]  [] ? 0xa7f17d35
[   26.543046]  [] ? 0xa796c115
[   26.543047]  [] ? 0xa7633d86
[   26.543048]  [] ? 0xa796c0a6
[   26.543050]  [] ? 0xa796c761
[   26.543051]  [] ? 0xa796a5d0
[   26.543052]  [] ? 0xa7f17d35
[   26.543054]  [] ? 0xa7657aee
[   26.543055]  [] ? 0xa77d2daf
[   26.543057]  [] ? 0xa77d3241
[   26.543058]  [] ? 0xa7683199
[   26.543059]  [] ? 0xa76859e4
[   26.543061]  [] ? 0xa7677352
[   26.543062]  [] ? 0xa767886a
[   26.543063]  [] ? 0xa8e68a94
[   26.543065]  [] ? 0xa8e7998f
[   26.543066]  [] ? 0xa79687af
[   26.543068]  [] ? 0xa7967d0a
[   26.543069]  [] ? 0xa7968cee
[   26.543071]  [] ? 0xa7f17bfe
[   26.543072]  [] ? 0xa796c1b6
[   26.543073]  [] ? 0xa796c1b6
[   26.543075]  [] ? 0xa7f47d98
[   26.543076]  [] ? 0xa796c777
[   26.543078]  [] ? 0xa796a5d0
[   26.543079]  [] ? 0xa7f17d35
[   26.543080]  [] ? 0xa7657aee
[   26.543082]  [] ? 0xa7657950
[   26.543083]  [] ? 0xa7f57519
[   26.543085]  [] ? 0xa7f57577
[   26.543086]  [] ? 0xa77d2928
[   26.543088]  [] ? 0xa76928d6
[   26.543089]  [] ? 0xa769198f
[   26.543090]  [] ? 0xa76935b9
[   26.543092]  [] ? 0xa7693110
[   26.543093]  [] ? 0xa76859e4
[   26.543095]  [] ? 0xa7693867
[   26.543096]  [] ? 0xa767886f
[   26.543098]  [] ? 0xa8e68a94
[   26.543099]  [] ? 0xa8e7998f
[   26.543101]  [] ? 0xa8e68a60
[   26.757288]  3-...: (20987 ticks this GP) idle=2ef/141/0 
softirq=182/182 fqs=5221 
[   26.758231]   (t=21217 jiffies g=-222 c=-223 q=14)
[   26.759200] Task dump for CPU 3:
[   26.760162] swapper/0   R  running task0 1  0 0x0008
[   26.761198]  

Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >
> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> > security issues with delegation?
> >>
> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> Tejun says [1]:
> >>
> >> We haven't had to face this decision because cgroup has never properly
> >> supported delegating to applications and the in-use setups where this
> >> happens are custom configurations where there is no boundary between
> >> system and applications and adhoc trial-and-error is good enough a way
> >> to find a working solution.  That wiggle room goes away once we
> >> officially open this up to individual applications.
> >>
> >> Unless and until that changes, I think that landlock should stay away
> >> from cgroups.  Others could reasonably disagree with me.
> >
> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> > and not for sandboxing. So the above doesn't matter in such contexts.
> > lsm hooks + cgroups provide convenient scope and existing entry points.
> > Please see checmate examples how it's used.
> >
> 
> To be clear: I'm not arguing at all that there shouldn't be
> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> landlock interface shouldn't expose any cgroup integration, at least
> until the cgroup situation settles down a lot.

ahh. yes. we're perfectly in agreement here.
I'm suggesting that the next RFC shouldn't include unpriv
and seccomp at all. Once bpf+lsm+cgroup is merged, we can
argue about unpriv with cgroups and even unpriv as a whole,
since it's not a given. Seccomp integration is also questionable.
I'd rather not have seccomp as a gate keeper for this lsm.
lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
don't have one to one relationship, so mixing them up is only
asking for trouble further down the road.
If we really need to carry some information from seccomp to lsm+bpf,
it's easier to add eBPF support to seccomp and let bpf side deal
with passing whatever information. 



Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >> >
> >> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> >> > don't see why building on top of cgroup v2 is a problem. Is there
> >> > security issues with delegation?
> >>
> >> What I mean is: cgroup v2 delegation has a functionality problem.
> >> Tejun says [1]:
> >>
> >> We haven't had to face this decision because cgroup has never properly
> >> supported delegating to applications and the in-use setups where this
> >> happens are custom configurations where there is no boundary between
> >> system and applications and adhoc trial-and-error is good enough a way
> >> to find a working solution.  That wiggle room goes away once we
> >> officially open this up to individual applications.
> >>
> >> Unless and until that changes, I think that landlock should stay away
> >> from cgroups.  Others could reasonably disagree with me.
> >
> > Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> > and not for sandboxing. So the above doesn't matter in such contexts.
> > lsm hooks + cgroups provide convenient scope and existing entry points.
> > Please see checmate examples how it's used.
> >
> 
> To be clear: I'm not arguing at all that there shouldn't be
> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> landlock interface shouldn't expose any cgroup integration, at least
> until the cgroup situation settles down a lot.

ahh. yes. we're perfectly in agreement here.
I'm suggesting that the next RFC shouldn't include unpriv
and seccomp at all. Once bpf+lsm+cgroup is merged, we can
argue about unpriv with cgroups and even unpriv as a whole,
since it's not a given. Seccomp integration is also questionable.
I'd rather not have seccomp as a gate keeper for this lsm.
lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
don't have one to one relationship, so mixing them up is only
asking for trouble further down the road.
If we really need to carry some information from seccomp to lsm+bpf,
it's easier to add eBPF support to seccomp and let bpf side deal
with passing whatever information. 



RE: [PATCH] dmaengine: dmatest: Restore "memcpy" as default mode

2016-09-14 Thread Appana Durga Kedareswara Rao


> -Original Message-
> From: Eugeniy Paltsev [mailto:eugeniy.palt...@synopsys.com]
> Sent: Wednesday, September 14, 2016 11:11 PM
> To: dmaeng...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; Appana Durga Kedareswara Rao
> ; vinod.k...@intel.com; dan.j.willi...@intel.com;
> andriy.shevche...@linux.intel.com; linux-snps-...@lists.infradead.org; Eugeniy
> Paltsev 
> Subject: [PATCH] dmaengine: dmatest: Restore "memcpy" as default mode
> 
> Commit 0d4cb44da6ca0e8 ("dmaengine: dmatest: Add support for scatter-
> gather DMA mode") changes default "dmatest" behavior by changing default
> mode from "memcpy" to "scatter-gather".
> Now "memcpy" gets back as default mode.
> 
> Signed-off-by: Eugeniy Paltsev 

Reviewed-by: Kedareswara rao Appana 


Thanks,
Kedar.

> ---
>  drivers/dma/dmatest.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c index
> 53e9728..cf76fc6 100644
> --- a/drivers/dma/dmatest.c
> +++ b/drivers/dma/dmatest.c
> @@ -56,10 +56,10 @@ module_param(sg_buffers, uint, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(sg_buffers,
>   "Number of scatter gather buffers (default: 1)");
> 
> -static unsigned int dmatest = 1;
> +static unsigned int dmatest;
>  module_param(dmatest, uint, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(dmatest,
> - "dmatest 0-memcpy 1-slave_sg (default: 1)");
> + "dmatest 0-memcpy 1-slave_sg (default: 0)");
> 
>  static unsigned int xor_sources = 3;
>  module_param(xor_sources, uint, S_IRUGO | S_IWUSR);
> --
> 2.5.5



RE: [PATCH] dmaengine: dmatest: Restore "memcpy" as default mode

2016-09-14 Thread Appana Durga Kedareswara Rao


> -Original Message-
> From: Eugeniy Paltsev [mailto:eugeniy.palt...@synopsys.com]
> Sent: Wednesday, September 14, 2016 11:11 PM
> To: dmaeng...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; Appana Durga Kedareswara Rao
> ; vinod.k...@intel.com; dan.j.willi...@intel.com;
> andriy.shevche...@linux.intel.com; linux-snps-...@lists.infradead.org; Eugeniy
> Paltsev 
> Subject: [PATCH] dmaengine: dmatest: Restore "memcpy" as default mode
> 
> Commit 0d4cb44da6ca0e8 ("dmaengine: dmatest: Add support for scatter-
> gather DMA mode") changes default "dmatest" behavior by changing default
> mode from "memcpy" to "scatter-gather".
> Now "memcpy" gets back as default mode.
> 
> Signed-off-by: Eugeniy Paltsev 

Reviewed-by: Kedareswara rao Appana 


Thanks,
Kedar.

> ---
>  drivers/dma/dmatest.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c index
> 53e9728..cf76fc6 100644
> --- a/drivers/dma/dmatest.c
> +++ b/drivers/dma/dmatest.c
> @@ -56,10 +56,10 @@ module_param(sg_buffers, uint, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(sg_buffers,
>   "Number of scatter gather buffers (default: 1)");
> 
> -static unsigned int dmatest = 1;
> +static unsigned int dmatest;
>  module_param(dmatest, uint, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(dmatest,
> - "dmatest 0-memcpy 1-slave_sg (default: 1)");
> + "dmatest 0-memcpy 1-slave_sg (default: 0)");
> 
>  static unsigned int xor_sources = 3;
>  module_param(xor_sources, uint, S_IRUGO | S_IWUSR);
> --
> 2.5.5



Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Nicholas Piggin
On Thu, 15 Sep 2016 12:31:33 +1000
Dave Chinner  wrote:

> On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> > On Wed, 14 Sep 2016 17:39:02 +1000
> > Dave Chinner  wrote:  
> > > Ok, looking back over your example, you seem to be suggesting a new
> > > page fault behaviour is required from filesystems that has not been
> > > described or explained, and that behaviour is triggered
> > > (persistently) somehow from userspace. You've also suggested
> > > filesystems store a persistent per-block "no fsync" flag
> > > in their extent map as part of the implementation. Right?  
> > 
> > This is what we're talking about. Of course a filesystem can't just
> > start supporting the feature without any changes.  
> 
> Sure, but one first has to describe the feature desired before all

The DAX people have been. They want to be able to get mappings
that can be synced without doing fsync. The *exact* extent of
those capabilities and what the API exactly looks like is up for
discussion.


> parties can discuss it. We need more than vague references and
> allusions from you to define the solution you are proposing.
> 
> Once everyone understands what is being describing, we might be able
> to work out how it can be implemented in a simple, generic manner
> rather than require every filesystem to change their on-disk
> formats. IOWs, we need you to describe /details/ of semantics,
> behaviour and data integrity constraints that are required, not
> describe an implementation of something we have no knwoledge about.

Well you said it was impossible already and Christoph told them
they were smoking crack :)

Anyway, there's a few questions. Implementation and API. Some
filesystems may never cope with it. Of course it should be as
generic as possible though.


> > > Reading between the lines, I'm guessing that the "no fsync" flag has
> > > very specific update semantics, constraints and requirements.  Can
> > > you outline how you expect this flag to be set and updated, how it's
> > > used consistently between different applications (e.g. cp of a file
> > > vs the app using the file), behavioural constraints it implies for
> > > page faults vs non-mmap access to the data in the block, how
> > > you'd expect filesystems to deal with things like a hole punch
> > > landing in the middle of an extent marked with "no fsync", etc?  
> > 
> > Well that's what's being discussed.  An approach close to what I did is
> > to allow the app request a "no sync" type of mmap.  
> 
> That's not an answer to the questions I asked about about the "no
> sync" flag you were proposing. You've redirected to the a different
> solution, one that 

No sync flag would do the same thing exactly in terms of consistency.
It would just do the no-sync sequence by default rather than being
asked for it. More of an API detail than implementation.

> 
> > Filesystem will
> > invalidate all such mappings before it does buffered IOs or hole punch,
> > and will sync metadata after allocating a new block but before returning
> > from a fault.  
> 
> ... requires synchronous metadata updates from page fault context,
> which we already know is not a good solution.  I'll quote one of
> Christoph's previous replies to save me the trouble:
> 
>   "You could write all metadata synchronously from the page
>   fault handler, but that's basically asking for all kinds of
>   deadlocks."
> So, let's redirect back to the "no sync" flag you were talking about
> - can you answer the questions I asked above? It would be especially
> important to highlight how the proposed feature would avoid requiring
> synchronous metadata updates in page fault contexts

Right. So what deadlocks are you concerned about?

There could be a scale of capabilities here, for different filesystems
that do things differently. 

Some filesystems could require fsync for metadata, but allow fdatasync
to be skipped. Users would need to have some knowledge of block size
or do preallocation and sync.

That might put more burden on libraries/applications if there are
concurrent operations, but that might be something they can deal with
-- fdatasync already requires some knowledge of concurrent operations
(or lack thereof).


> > > [snip]
> > >   
> > > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > > Conceptual problem is simple.
> > > 
> > > Play nice and be constructive, please?  
> > 
> > So you agree that the persistent memory people who have come with some
> > requirements and ideas for an API should not be immediately shut down
> > with bogus handwaving.  
> 
> Pull your head in, Nick.
> 
> You've been absent from the community for the last 5 years. You
> suddenly barge in with a massive chip on your shoulder and try to

I'm trying to give some constructive input to the nvdimm guys.

You and Christoph know a huge amount about vfs and filesystems.
But sometimes you shut people down prematurely. 

RE: [PATCH for-next 10/20] IB/hns: Modify the init of iboe lock

2016-09-14 Thread Salil Mehta


> -Original Message-
> From: Leon Romanovsky [mailto:l...@kernel.org]
> Sent: Wednesday, September 14, 2016 6:05 AM
> To: Salil Mehta
> Cc: dledf...@redhat.com; Huwei (Xavier); oulijun; Zhuangyuzeng (Yisen);
> xuwei (O); mehta.salil@gmail.com; linux-r...@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm; Huangdongdong (Donald)
> Subject: Re: [PATCH for-next 10/20] IB/hns: Modify the init of iboe
> lock
> 
> On Wed, Sep 14, 2016 at 02:09:37AM +, Salil Mehta wrote:
> >
> >
> > > -Original Message-
> > > From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> > > ow...@vger.kernel.org] On Behalf Of Leon Romanovsky
> > > Sent: Tuesday, September 13, 2016 7:50 AM
> > > To: Salil Mehta
> > > Cc: dledf...@redhat.com; Huwei (Xavier); oulijun; Zhuangyuzeng
> (Yisen);
> > > xuwei (O); mehta.salil@gmail.com; linux-r...@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; Linuxarm; Huangdongdong (Donald)
> > > Subject: Re: [PATCH for-next 10/20] IB/hns: Modify the init of iboe
> > > lock
> > >
> > > On Fri, Sep 09, 2016 at 06:30:41PM +0800, Salil Mehta wrote:
> > > > From: Lijun Ou 
> > > >
> > > > This lock will be used in query port interface, and will be
> called
> > > > while IB device was registered to OFED frame. So, the lock of
> iboe
> > > > must be initiated before IB device was registered.
> > >
> > > Sorry,
> > > what did you mean by writing "OFED frame"?
> > It is a typo. It was OFED framework but I guess more appropriate word
> > might have been 'IB core' layer of Infiniband. Will fix this. Thanks!
> 
> As a general note, and I understand that these contributors are not
> native English speakers, and I understand the desire to submit the
> right
> code and code should speak by itself, but can you invest more time in
> commit messages and write them in English?
> 
> Thanks
Hi Leon,
Yes, will try to make them better.

Best regards
Salil
> >
> > Best regards
> > Salil
> > >
> > > >
> > > > Signed-off-by: Lijun Ou 
> > > > Signed-off-by: Dongdong Huang(Donald) 
> > > > Reviewed-by:  Wei Hu (Xavier) 
> > > > Signed-off-by: Salil Mehta 
> > > > ---
> > > >  drivers/infiniband/hw/hns/hns_roce_main.c |3 +--
> > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c
> > > b/drivers/infiniband/hw/hns/hns_roce_main.c
> > > > index 2704076..4721c0c 100644
> > > > --- a/drivers/infiniband/hw/hns/hns_roce_main.c
> > > > +++ b/drivers/infiniband/hw/hns/hns_roce_main.c
> > > > @@ -615,6 +615,7 @@ static int hns_roce_register_device(struct
> > > hns_roce_dev *hr_dev)
> > > > struct device *dev = _dev->pdev->dev;
> > > >
> > > > iboe = _dev->iboe;
> > > > +   spin_lock_init(>lock);
> > > >
> > > > ib_dev = _dev->ib_dev;
> > > > strlcpy(ib_dev->name, "hisi_%d", IB_DEVICE_NAME_MAX);
> > > > @@ -701,8 +702,6 @@ static int hns_roce_register_device(struct
> > > hns_roce_dev *hr_dev)
> > > > goto error_failed_setup_mtu_gids;
> > > > }
> > > >
> > > > -   spin_lock_init(>lock);
> > > > -
> > > > iboe->nb.notifier_call = hns_roce_netdev_event;
> > > > ret = register_netdevice_notifier(>nb);
> > > > if (ret) {
> > > > --
> > > > 1.7.9.5
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-
> rdma"
> > > in
> > > > the body of a message to majord...@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-
> info.html


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Nicholas Piggin
On Thu, 15 Sep 2016 12:31:33 +1000
Dave Chinner  wrote:

> On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> > On Wed, 14 Sep 2016 17:39:02 +1000
> > Dave Chinner  wrote:  
> > > Ok, looking back over your example, you seem to be suggesting a new
> > > page fault behaviour is required from filesystems that has not been
> > > described or explained, and that behaviour is triggered
> > > (persistently) somehow from userspace. You've also suggested
> > > filesystems store a persistent per-block "no fsync" flag
> > > in their extent map as part of the implementation. Right?  
> > 
> > This is what we're talking about. Of course a filesystem can't just
> > start supporting the feature without any changes.  
> 
> Sure, but one first has to describe the feature desired before all

The DAX people have been. They want to be able to get mappings
that can be synced without doing fsync. The *exact* extent of
those capabilities and what the API exactly looks like is up for
discussion.


> parties can discuss it. We need more than vague references and
> allusions from you to define the solution you are proposing.
> 
> Once everyone understands what is being describing, we might be able
> to work out how it can be implemented in a simple, generic manner
> rather than require every filesystem to change their on-disk
> formats. IOWs, we need you to describe /details/ of semantics,
> behaviour and data integrity constraints that are required, not
> describe an implementation of something we have no knwoledge about.

Well you said it was impossible already and Christoph told them
they were smoking crack :)

Anyway, there's a few questions. Implementation and API. Some
filesystems may never cope with it. Of course it should be as
generic as possible though.


> > > Reading between the lines, I'm guessing that the "no fsync" flag has
> > > very specific update semantics, constraints and requirements.  Can
> > > you outline how you expect this flag to be set and updated, how it's
> > > used consistently between different applications (e.g. cp of a file
> > > vs the app using the file), behavioural constraints it implies for
> > > page faults vs non-mmap access to the data in the block, how
> > > you'd expect filesystems to deal with things like a hole punch
> > > landing in the middle of an extent marked with "no fsync", etc?  
> > 
> > Well that's what's being discussed.  An approach close to what I did is
> > to allow the app request a "no sync" type of mmap.  
> 
> That's not an answer to the questions I asked about about the "no
> sync" flag you were proposing. You've redirected to the a different
> solution, one that 

No sync flag would do the same thing exactly in terms of consistency.
It would just do the no-sync sequence by default rather than being
asked for it. More of an API detail than implementation.

> 
> > Filesystem will
> > invalidate all such mappings before it does buffered IOs or hole punch,
> > and will sync metadata after allocating a new block but before returning
> > from a fault.  
> 
> ... requires synchronous metadata updates from page fault context,
> which we already know is not a good solution.  I'll quote one of
> Christoph's previous replies to save me the trouble:
> 
>   "You could write all metadata synchronously from the page
>   fault handler, but that's basically asking for all kinds of
>   deadlocks."
> So, let's redirect back to the "no sync" flag you were talking about
> - can you answer the questions I asked above? It would be especially
> important to highlight how the proposed feature would avoid requiring
> synchronous metadata updates in page fault contexts

Right. So what deadlocks are you concerned about?

There could be a scale of capabilities here, for different filesystems
that do things differently. 

Some filesystems could require fsync for metadata, but allow fdatasync
to be skipped. Users would need to have some knowledge of block size
or do preallocation and sync.

That might put more burden on libraries/applications if there are
concurrent operations, but that might be something they can deal with
-- fdatasync already requires some knowledge of concurrent operations
(or lack thereof).


> > > [snip]
> > >   
> > > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > > Conceptual problem is simple.
> > > 
> > > Play nice and be constructive, please?  
> > 
> > So you agree that the persistent memory people who have come with some
> > requirements and ideas for an API should not be immediately shut down
> > with bogus handwaving.  
> 
> Pull your head in, Nick.
> 
> You've been absent from the community for the last 5 years. You
> suddenly barge in with a massive chip on your shoulder and try to

I'm trying to give some constructive input to the nvdimm guys.

You and Christoph know a huge amount about vfs and filesystems.
But sometimes you shut people down prematurely. It can be very
intimidating for someone 

RE: [PATCH for-next 10/20] IB/hns: Modify the init of iboe lock

2016-09-14 Thread Salil Mehta


> -Original Message-
> From: Leon Romanovsky [mailto:l...@kernel.org]
> Sent: Wednesday, September 14, 2016 6:05 AM
> To: Salil Mehta
> Cc: dledf...@redhat.com; Huwei (Xavier); oulijun; Zhuangyuzeng (Yisen);
> xuwei (O); mehta.salil@gmail.com; linux-r...@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm; Huangdongdong (Donald)
> Subject: Re: [PATCH for-next 10/20] IB/hns: Modify the init of iboe
> lock
> 
> On Wed, Sep 14, 2016 at 02:09:37AM +, Salil Mehta wrote:
> >
> >
> > > -Original Message-
> > > From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> > > ow...@vger.kernel.org] On Behalf Of Leon Romanovsky
> > > Sent: Tuesday, September 13, 2016 7:50 AM
> > > To: Salil Mehta
> > > Cc: dledf...@redhat.com; Huwei (Xavier); oulijun; Zhuangyuzeng
> (Yisen);
> > > xuwei (O); mehta.salil@gmail.com; linux-r...@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; Linuxarm; Huangdongdong (Donald)
> > > Subject: Re: [PATCH for-next 10/20] IB/hns: Modify the init of iboe
> > > lock
> > >
> > > On Fri, Sep 09, 2016 at 06:30:41PM +0800, Salil Mehta wrote:
> > > > From: Lijun Ou 
> > > >
> > > > This lock will be used in query port interface, and will be
> called
> > > > while IB device was registered to OFED frame. So, the lock of
> iboe
> > > > must be initiated before IB device was registered.
> > >
> > > Sorry,
> > > what did you mean by writing "OFED frame"?
> > It is a typo. It was OFED framework but I guess more appropriate word
> > might have been 'IB core' layer of Infiniband. Will fix this. Thanks!
> 
> As a general note, and I understand that these contributors are not
> native English speakers, and I understand the desire to submit the
> right
> code and code should speak by itself, but can you invest more time in
> commit messages and write them in English?
> 
> Thanks
Hi Leon,
Yes, will try to make them better.

Best regards
Salil
> >
> > Best regards
> > Salil
> > >
> > > >
> > > > Signed-off-by: Lijun Ou 
> > > > Signed-off-by: Dongdong Huang(Donald) 
> > > > Reviewed-by:  Wei Hu (Xavier) 
> > > > Signed-off-by: Salil Mehta 
> > > > ---
> > > >  drivers/infiniband/hw/hns/hns_roce_main.c |3 +--
> > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c
> > > b/drivers/infiniband/hw/hns/hns_roce_main.c
> > > > index 2704076..4721c0c 100644
> > > > --- a/drivers/infiniband/hw/hns/hns_roce_main.c
> > > > +++ b/drivers/infiniband/hw/hns/hns_roce_main.c
> > > > @@ -615,6 +615,7 @@ static int hns_roce_register_device(struct
> > > hns_roce_dev *hr_dev)
> > > > struct device *dev = _dev->pdev->dev;
> > > >
> > > > iboe = _dev->iboe;
> > > > +   spin_lock_init(>lock);
> > > >
> > > > ib_dev = _dev->ib_dev;
> > > > strlcpy(ib_dev->name, "hisi_%d", IB_DEVICE_NAME_MAX);
> > > > @@ -701,8 +702,6 @@ static int hns_roce_register_device(struct
> > > hns_roce_dev *hr_dev)
> > > > goto error_failed_setup_mtu_gids;
> > > > }
> > > >
> > > > -   spin_lock_init(>lock);
> > > > -
> > > > iboe->nb.notifier_call = hns_roce_netdev_event;
> > > > ret = register_netdevice_notifier(>nb);
> > > > if (ret) {
> > > > --
> > > > 1.7.9.5
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-
> rdma"
> > > in
> > > > the body of a message to majord...@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-
> info.html


[PATCH v3) posix-timers: make it configurable

2016-09-14 Thread Nicolas Pitre
Many embedded systems typically don't need them.  This removes about
22KB from the kernel binary size on ARM when configured out.

Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime.

The clock_settime, clock_gettime, clock_getres and clock_nanosleep syscalls
are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only.

Signed-off-by: Nicolas Pitre 
---

Changes from v2:

- Fix compilation with CONFIG_COMPAT=y due to missing 
  clock_nanosleep_restart(), found by kbuild test robot.

Changes from RFC/v1:

- Stubbed-out functions moved to static inlines.
- The timer signal handling code is now removed.
- The list of removed syscalls is explicitly documented.
- The clock_settime, clock_gettime, clock_getres and clock_nanosleep 
  syscalls are minimally preserved as this required very little code.

I'm now able to boot a copy of Fedora 21 with this patch and 
CONFIG_POSIX_TIMERS=n with no apparent issues.

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index ee3de3421f..00e6098e9a 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -6,7 +6,7 @@ menu "PTP clock support"
 
 config PTP_1588_CLOCK
tristate "PTP clock support"
-   depends on NET
+   depends on NET && POSIX_TIMERS
select PPS
select NET_PTP_CLASSIFY
help
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 62d44c1760..2288c5c557 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -118,6 +118,8 @@ struct k_clock {
 extern struct k_clock clock_posix_cpu;
 extern struct k_clock clock_posix_dynamic;
 
+#ifdef CONFIG_POSIX_TIMERS
+
 void posix_timers_register_clock(const clockid_t clock_id, struct k_clock 
*new_clock);
 
 /* function to call to trigger timer event */
@@ -131,8 +133,30 @@ void posix_cpu_timers_exit_group(struct task_struct *task);
 void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
   cputime_t *newval, cputime_t *oldval);
 
-long clock_nanosleep_restart(struct restart_block *restart_block);
-
 void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
 
+#else
+
+#include 
+
+static inline void posix_timers_register_clock(const clockid_t clock_id,
+  struct k_clock *new_clock) {}
+static inline int posix_timer_event(struct k_itimer *timr, int si_private)
+{ return 0; }
+static inline void run_posix_cpu_timers(struct task_struct *task) {}
+static inline void posix_cpu_timers_exit(struct task_struct *task)
+{
+   add_device_randomness((const void*) >se.sum_exec_runtime,
+ sizeof(unsigned long long));
+}
+static inline void posix_cpu_timers_exit_group(struct task_struct *task) {}
+static inline void set_process_cpu_timer(struct task_struct *task,
+   unsigned int clock_idx, cputime_t *newval, cputime_t *oldval) {}
+static inline void update_rlimit_cpu(struct task_struct *task,
+unsigned long rlim_new) {}
+
+#endif
+
+long clock_nanosleep_restart(struct restart_block *restart_block);
+
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 54182d52a0..39a1d6d3f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2924,8 +2924,13 @@ static inline void exit_thread(struct task_struct *tsk)
 extern void exit_files(struct task_struct *);
 extern void __cleanup_sighand(struct sighand_struct *);
 
+#ifdef CONFIG_POSIX_TIMERS
 extern void exit_itimers(struct signal_struct *);
 extern void flush_itimer_signals(void);
+#else
+static inline void exit_itimers(struct signal_struct *s) {}
+static inline void flush_itimer_signals(void) {}
+#endif
 
 extern void do_group_exit(int);
 
@@ -3382,7 +3387,12 @@ static __always_inline bool need_resched(void)
  * Thread group CPU time accounting.
  */
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+#ifdef CONFIG_POSIX_TIMERS
 void thread_group_cputimer(struct task_struct *tsk, struct task_cputime 
*times);
+#else
+static inline void thread_group_cputimer(struct task_struct *tsk,
+struct task_cputime *times) {}
+#endif
 
 /*
  * Reevaluate whether the task has signals pending delivery.
diff --git a/kernel/signal.c b/kernel/signal.c
index af21afc00d..ea75065e29 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -427,6 +427,7 @@ void flush_signals(struct task_struct *t)
spin_unlock_irqrestore(>sighand->siglock, flags);
 }
 
+#ifdef CONFIG_POSIX_TIMERS
 static void __flush_itimer_signals(struct sigpending *pending)
 {
sigset_t signal, retain;
@@ -460,6 +461,7 @@ void flush_itimer_signals(void)

[PATCH v3) posix-timers: make it configurable

2016-09-14 Thread Nicolas Pitre
Many embedded systems typically don't need them.  This removes about
22KB from the kernel binary size on ARM when configured out.

Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime.

The clock_settime, clock_gettime, clock_getres and clock_nanosleep syscalls
are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only.

Signed-off-by: Nicolas Pitre 
---

Changes from v2:

- Fix compilation with CONFIG_COMPAT=y due to missing 
  clock_nanosleep_restart(), found by kbuild test robot.

Changes from RFC/v1:

- Stubbed-out functions moved to static inlines.
- The timer signal handling code is now removed.
- The list of removed syscalls is explicitly documented.
- The clock_settime, clock_gettime, clock_getres and clock_nanosleep 
  syscalls are minimally preserved as this required very little code.

I'm now able to boot a copy of Fedora 21 with this patch and 
CONFIG_POSIX_TIMERS=n with no apparent issues.

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index ee3de3421f..00e6098e9a 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -6,7 +6,7 @@ menu "PTP clock support"
 
 config PTP_1588_CLOCK
tristate "PTP clock support"
-   depends on NET
+   depends on NET && POSIX_TIMERS
select PPS
select NET_PTP_CLASSIFY
help
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 62d44c1760..2288c5c557 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -118,6 +118,8 @@ struct k_clock {
 extern struct k_clock clock_posix_cpu;
 extern struct k_clock clock_posix_dynamic;
 
+#ifdef CONFIG_POSIX_TIMERS
+
 void posix_timers_register_clock(const clockid_t clock_id, struct k_clock 
*new_clock);
 
 /* function to call to trigger timer event */
@@ -131,8 +133,30 @@ void posix_cpu_timers_exit_group(struct task_struct *task);
 void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
   cputime_t *newval, cputime_t *oldval);
 
-long clock_nanosleep_restart(struct restart_block *restart_block);
-
 void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
 
+#else
+
+#include 
+
+static inline void posix_timers_register_clock(const clockid_t clock_id,
+  struct k_clock *new_clock) {}
+static inline int posix_timer_event(struct k_itimer *timr, int si_private)
+{ return 0; }
+static inline void run_posix_cpu_timers(struct task_struct *task) {}
+static inline void posix_cpu_timers_exit(struct task_struct *task)
+{
+   add_device_randomness((const void*) >se.sum_exec_runtime,
+ sizeof(unsigned long long));
+}
+static inline void posix_cpu_timers_exit_group(struct task_struct *task) {}
+static inline void set_process_cpu_timer(struct task_struct *task,
+   unsigned int clock_idx, cputime_t *newval, cputime_t *oldval) {}
+static inline void update_rlimit_cpu(struct task_struct *task,
+unsigned long rlim_new) {}
+
+#endif
+
+long clock_nanosleep_restart(struct restart_block *restart_block);
+
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 54182d52a0..39a1d6d3f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2924,8 +2924,13 @@ static inline void exit_thread(struct task_struct *tsk)
 extern void exit_files(struct task_struct *);
 extern void __cleanup_sighand(struct sighand_struct *);
 
+#ifdef CONFIG_POSIX_TIMERS
 extern void exit_itimers(struct signal_struct *);
 extern void flush_itimer_signals(void);
+#else
+static inline void exit_itimers(struct signal_struct *s) {}
+static inline void flush_itimer_signals(void) {}
+#endif
 
 extern void do_group_exit(int);
 
@@ -3382,7 +3387,12 @@ static __always_inline bool need_resched(void)
  * Thread group CPU time accounting.
  */
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+#ifdef CONFIG_POSIX_TIMERS
 void thread_group_cputimer(struct task_struct *tsk, struct task_cputime 
*times);
+#else
+static inline void thread_group_cputimer(struct task_struct *tsk,
+struct task_cputime *times) {}
+#endif
 
 /*
  * Reevaluate whether the task has signals pending delivery.
diff --git a/kernel/signal.c b/kernel/signal.c
index af21afc00d..ea75065e29 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -427,6 +427,7 @@ void flush_signals(struct task_struct *t)
spin_unlock_irqrestore(>sighand->siglock, flags);
 }
 
+#ifdef CONFIG_POSIX_TIMERS
 static void __flush_itimer_signals(struct sigpending *pending)
 {
sigset_t signal, retain;
@@ -460,6 +461,7 @@ void flush_itimer_signals(void)

linux-next: build warning after merge of the lightnvm tree

2016-09-14 Thread Stephen Rothwell
Hi Matias,

After merging the lightnvm tree, today's linux-next build (x86_64
allmodconfig) produced this warning:

drivers/lightnvm/sysfs.c: In function 'nvm_sysfs_register_dev':
drivers/lightnvm/sysfs.c:184:2: warning: ignoring return value of 'device_add', 
declared with attribute warn_unused_result [-Wunused-result]
  device_add(>dev);
  ^

Introduced by commit

  38c9e260b9f9 ("lightnvm: expose device geometry through sysfs")

-- 
Cheers,
Stephen Rothwell


linux-next: build warning after merge of the lightnvm tree

2016-09-14 Thread Stephen Rothwell
Hi Matias,

After merging the lightnvm tree, today's linux-next build (x86_64
allmodconfig) produced this warning:

drivers/lightnvm/sysfs.c: In function 'nvm_sysfs_register_dev':
drivers/lightnvm/sysfs.c:184:2: warning: ignoring return value of 'device_add', 
declared with attribute warn_unused_result [-Wunused-result]
  device_add(>dev);
  ^

Introduced by commit

  38c9e260b9f9 ("lightnvm: expose device geometry through sysfs")

-- 
Cheers,
Stephen Rothwell


RE: [PATCH for-next 13/20] IB/hns: Add check for some NULL pointer scenes

2016-09-14 Thread Salil Mehta


> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Leon Romanovsky
> Sent: Tuesday, September 13, 2016 8:00 AM
> To: Salil Mehta
> Cc: dledf...@redhat.com; Huwei (Xavier); oulijun; Zhuangyuzeng (Yisen);
> xuwei (O); mehta.salil@gmail.com; linux-r...@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm; Huangdongdong (Donald)
> Subject: Re: [PATCH for-next 13/20] IB/hns: Add check for some NULL
> pointer scenes
> 
> On Fri, Sep 09, 2016 at 06:30:44PM +0800, Salil Mehta wrote:
> > From: Lijun Ou 
> >
> > Some pointers have not be checked when they are null,
> > so we add check for them.
> >
> > Signed-off-by: Lijun Ou 
> > Signed-off-by: Dongdong Huang(Donald) 
> > Reviewed-by:  Wei Hu (Xavier) 
> > Signed-off-by: Salil Mehta 
> 
> I may admit that I didn't check your code to read the implementations
> of
> get_send_wqe() and hns_roce_v1_poll_one(), but based on my assumption
> that the code is similar to mlx4.
> 
> These failures can't occur.
> 
> Can you throw a light on how did you find them and check it?
Hi Leon,
Looks like this is another redundant patch. These return checks should
never be required. I think the mistake lies in the wrong check placed
inside below function:

void *get_send_wqe(struct hns_roce_qp *hr_qp, int n)
{
.
.
/* To Be Deleted: Below check is redundantly placed. */
if ((n < 0) || (n > hr_qp->sq.wqe_cnt)) {
dev_err(_dev->pdev->dev, "sq wqe index:%d,sq wqe cnt:%d\r\n",
n, hr_qp->sq.wqe_cnt);
return NULL;
}

return get_wqe(hr_qp, hr_qp->sq.offset + (n << hr_qp->sq.wqe_shift));
}

and perhaps same is the case in function get_rcv_wqe(). The same check needs
to be removed from there as well and also the error handling in the calling
functions. Thanks for figuring it out. Will correct in the subsequent patch.

Best regards
Salil 
> 
> > ---
> >  drivers/infiniband/hw/hns/hns_roce_hw_v1.c |   11 +++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> > index f0d6315..e3e154c 100644
> > --- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> > +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> > @@ -92,6 +92,12 @@ int hns_roce_v1_post_send(struct ib_qp *ibqp,
> struct ib_send_wr *wr,
> > }
> >
> > wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1));
> > +   if (unlikely(!wqe)) {
> > +   dev_err(dev, "get send wqe failed\n");
> > +   ret = -EINVAL;
> > +   *bad_wr = wr;
> > +   goto out;
> > +   }
> > qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] =
> >   wr->wr_id;
> >
> > @@ -1558,6 +1564,11 @@ static int hns_roce_v1_poll_one(struct
> hns_roce_cq *hr_cq,
> > sq_wqe = get_send_wqe(*cur_qp, roce_get_field(cqe-
> >cqe_byte_4,
> > CQE_BYTE_4_WQE_INDEX_M,
> > CQE_BYTE_4_WQE_INDEX_S));
> > +   if (unlikely(!sq_wqe)) {
> > +   dev_err(dev, "Get send wqe failed!\n");
> > +   return -EFAULT;
> > +   }
> > +
> > switch (sq_wqe->flag & HNS_ROCE_WQE_OPCODE_MASK) {
> > case HNS_ROCE_WQE_OPCODE_SEND:
> > wc->opcode = IB_WC_SEND;
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next 13/20] IB/hns: Add check for some NULL pointer scenes

2016-09-14 Thread Salil Mehta


> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Leon Romanovsky
> Sent: Tuesday, September 13, 2016 8:00 AM
> To: Salil Mehta
> Cc: dledf...@redhat.com; Huwei (Xavier); oulijun; Zhuangyuzeng (Yisen);
> xuwei (O); mehta.salil@gmail.com; linux-r...@vger.kernel.org;
> linux-kernel@vger.kernel.org; Linuxarm; Huangdongdong (Donald)
> Subject: Re: [PATCH for-next 13/20] IB/hns: Add check for some NULL
> pointer scenes
> 
> On Fri, Sep 09, 2016 at 06:30:44PM +0800, Salil Mehta wrote:
> > From: Lijun Ou 
> >
> > Some pointers have not be checked when they are null,
> > so we add check for them.
> >
> > Signed-off-by: Lijun Ou 
> > Signed-off-by: Dongdong Huang(Donald) 
> > Reviewed-by:  Wei Hu (Xavier) 
> > Signed-off-by: Salil Mehta 
> 
> I may admit that I didn't check your code to read the implementations
> of
> get_send_wqe() and hns_roce_v1_poll_one(), but based on my assumption
> that the code is similar to mlx4.
> 
> These failures can't occur.
> 
> Can you throw a light on how did you find them and check it?
Hi Leon,
Looks like this is another redundant patch. These return checks should
never be required. I think the mistake lies in the wrong check placed
inside below function:

void *get_send_wqe(struct hns_roce_qp *hr_qp, int n)
{
.
.
/* To Be Deleted: Below check is redundantly placed. */
if ((n < 0) || (n > hr_qp->sq.wqe_cnt)) {
dev_err(_dev->pdev->dev, "sq wqe index:%d,sq wqe cnt:%d\r\n",
n, hr_qp->sq.wqe_cnt);
return NULL;
}

return get_wqe(hr_qp, hr_qp->sq.offset + (n << hr_qp->sq.wqe_shift));
}

and perhaps same is the case in function get_rcv_wqe(). The same check needs
to be removed from there as well and also the error handling in the calling
functions. Thanks for figuring it out. Will correct in the subsequent patch.

Best regards
Salil 
> 
> > ---
> >  drivers/infiniband/hw/hns/hns_roce_hw_v1.c |   11 +++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> > index f0d6315..e3e154c 100644
> > --- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> > +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
> > @@ -92,6 +92,12 @@ int hns_roce_v1_post_send(struct ib_qp *ibqp,
> struct ib_send_wr *wr,
> > }
> >
> > wqe = get_send_wqe(qp, ind & (qp->sq.wqe_cnt - 1));
> > +   if (unlikely(!wqe)) {
> > +   dev_err(dev, "get send wqe failed\n");
> > +   ret = -EINVAL;
> > +   *bad_wr = wr;
> > +   goto out;
> > +   }
> > qp->sq.wrid[(qp->sq.head + nreq) & (qp->sq.wqe_cnt - 1)] =
> >   wr->wr_id;
> >
> > @@ -1558,6 +1564,11 @@ static int hns_roce_v1_poll_one(struct
> hns_roce_cq *hr_cq,
> > sq_wqe = get_send_wqe(*cur_qp, roce_get_field(cqe-
> >cqe_byte_4,
> > CQE_BYTE_4_WQE_INDEX_M,
> > CQE_BYTE_4_WQE_INDEX_S));
> > +   if (unlikely(!sq_wqe)) {
> > +   dev_err(dev, "Get send wqe failed!\n");
> > +   return -EFAULT;
> > +   }
> > +
> > switch (sq_wqe->flag & HNS_ROCE_WQE_OPCODE_MASK) {
> > case HNS_ROCE_WQE_OPCODE_SEND:
> > wc->opcode = IB_WC_SEND;
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -next] mailbox: platform_mhu: Remove redundant dev_err call in platform_mhu_probe()

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

There is a error message within devm_ioremap_resource
already, so remove the dev_err call to avoid redundant
error message.

Signed-off-by: Wei Yongjun 
---
 drivers/mailbox/platform_mhu.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/mailbox/platform_mhu.c b/drivers/mailbox/platform_mhu.c
index e13201a..b3b5acc 100644
--- a/drivers/mailbox/platform_mhu.c
+++ b/drivers/mailbox/platform_mhu.c
@@ -137,10 +137,8 @@ static int platform_mhu_probe(struct platform_device *pdev)
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
mhu->base = devm_ioremap_resource(dev, res);
-   if (IS_ERR(mhu->base)) {
-   dev_err(dev, "ioremap failed\n");
+   if (IS_ERR(mhu->base))
return PTR_ERR(mhu->base);
-   }
 
for (i = 0; i < MHU_CHANS; i++) {
mhu->chan[i].con_priv = >mlink[i];



[PATCH -next] mailbox: platform_mhu: Remove redundant dev_err call in platform_mhu_probe()

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

There is a error message within devm_ioremap_resource
already, so remove the dev_err call to avoid redundant
error message.

Signed-off-by: Wei Yongjun 
---
 drivers/mailbox/platform_mhu.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/mailbox/platform_mhu.c b/drivers/mailbox/platform_mhu.c
index e13201a..b3b5acc 100644
--- a/drivers/mailbox/platform_mhu.c
+++ b/drivers/mailbox/platform_mhu.c
@@ -137,10 +137,8 @@ static int platform_mhu_probe(struct platform_device *pdev)
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
mhu->base = devm_ioremap_resource(dev, res);
-   if (IS_ERR(mhu->base)) {
-   dev_err(dev, "ioremap failed\n");
+   if (IS_ERR(mhu->base))
return PTR_ERR(mhu->base);
-   }
 
for (i = 0; i < MHU_CHANS; i++) {
mhu->chan[i].con_priv = >mlink[i];



[PATCH -next] nbd: fix return value check

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

In case of error, the function blk_mq_alloc_request() or
blk_mq_init_queue() returns ERR_PTR() not  NULL. The NULL
test in the return value check should be replaced with
IS_ERR().

Fixes: fd8383fd88a2 ("nbd: convert to blkmq")
Signed-off-by: Wei Yongjun 
---
 drivers/block/nbd.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4c6dd1a..62ff9cb 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -599,8 +599,8 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
nbd_device *nbd,
return -EINVAL;
 
sreq = blk_mq_alloc_request(bdev_get_queue(bdev), WRITE, 0);
-   if (!sreq)
-   return -ENOMEM;
+   if (IS_ERR(sreq))
+   return PTR_ERR(sreq);
 
mutex_unlock(>tx_lock);
fsync_bdev(bdev);
@@ -956,7 +956,8 @@ static int __init nbd_init(void)
 * These structs are big so we dynamically allocate them.
 */
disk->queue = blk_mq_init_queue(_dev[i].tag_set);
-   if (!disk->queue) {
+   if (IS_ERR(disk->queue)) {
+   err = PTR_ERR(disk->queue);
blk_mq_free_tag_set(_dev[i].tag_set);
put_disk(disk);
goto out;



[PATCH -next] nbd: fix return value check

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

In case of error, the function blk_mq_alloc_request() or
blk_mq_init_queue() returns ERR_PTR() not  NULL. The NULL
test in the return value check should be replaced with
IS_ERR().

Fixes: fd8383fd88a2 ("nbd: convert to blkmq")
Signed-off-by: Wei Yongjun 
---
 drivers/block/nbd.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 4c6dd1a..62ff9cb 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -599,8 +599,8 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
nbd_device *nbd,
return -EINVAL;
 
sreq = blk_mq_alloc_request(bdev_get_queue(bdev), WRITE, 0);
-   if (!sreq)
-   return -ENOMEM;
+   if (IS_ERR(sreq))
+   return PTR_ERR(sreq);
 
mutex_unlock(>tx_lock);
fsync_bdev(bdev);
@@ -956,7 +956,8 @@ static int __init nbd_init(void)
 * These structs are big so we dynamically allocate them.
 */
disk->queue = blk_mq_init_queue(_dev[i].tag_set);
-   if (!disk->queue) {
+   if (IS_ERR(disk->queue)) {
+   err = PTR_ERR(disk->queue);
blk_mq_free_tag_set(_dev[i].tag_set);
put_disk(disk);
goto out;



[PATCH -next] drm/sun4i: backend: remove redundant dev_err call in sun4i_backend_bind()

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

There is a error message within devm_ioremap_resource
already, so remove the dev_err call to avoid redundant
error message.

Signed-off-by: Wei Yongjun 
---
 drivers/gpu/drm/sun4i/sun4i_backend.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/sun4i/sun4i_backend.c 
b/drivers/gpu/drm/sun4i/sun4i_backend.c
index 91a7022..fcd4e91 100644
--- a/drivers/gpu/drm/sun4i/sun4i_backend.c
+++ b/drivers/gpu/drm/sun4i/sun4i_backend.c
@@ -288,10 +288,8 @@ static int sun4i_backend_bind(struct device *dev, struct 
device *master,
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
regs = devm_ioremap_resource(dev, res);
-   if (IS_ERR(regs)) {
-   dev_err(dev, "Couldn't map the backend registers\n");
+   if (IS_ERR(regs))
return PTR_ERR(regs);
-   }
 
backend->regs = devm_regmap_init_mmio(dev, regs,
  _backend_regmap_config);





[PATCH -next] drm/sun4i: backend: remove redundant dev_err call in sun4i_backend_bind()

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

There is a error message within devm_ioremap_resource
already, so remove the dev_err call to avoid redundant
error message.

Signed-off-by: Wei Yongjun 
---
 drivers/gpu/drm/sun4i/sun4i_backend.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/sun4i/sun4i_backend.c 
b/drivers/gpu/drm/sun4i/sun4i_backend.c
index 91a7022..fcd4e91 100644
--- a/drivers/gpu/drm/sun4i/sun4i_backend.c
+++ b/drivers/gpu/drm/sun4i/sun4i_backend.c
@@ -288,10 +288,8 @@ static int sun4i_backend_bind(struct device *dev, struct 
device *master,
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
regs = devm_ioremap_resource(dev, res);
-   if (IS_ERR(regs)) {
-   dev_err(dev, "Couldn't map the backend registers\n");
+   if (IS_ERR(regs))
return PTR_ERR(regs);
-   }
 
backend->regs = devm_regmap_init_mmio(dev, regs,
  _backend_regmap_config);





Re: [PATCH 0/5] Make /sys/class/net per net namespace objects belong to container

2016-09-14 Thread Dmitry Torokhov
On Mon, Aug 29, 2016 at 5:38 AM, Eric W. Biederman
 wrote:
> David Miller  writes:
>
>> From: Dmitry Torokhov 
>> Date: Tue, 16 Aug 2016 15:33:10 -0700
>>
>>> There are objects in /sys hierarchy (/sys/class/net/) that logically belong
>>> to a namespace/container. Unfortunately all sysfs objects start their life
>>> belonging to global root, and while we could change ownership manually,
>>> keeping tracks of all objects that come and go is cumbersome. It would
>>> be better if kernel created them using correct uid/gid from the beginning.
>>>
>>> This series changes kernfs to allow creating object's with arbitrary
>>> uid/gid, adds get_ownership() callback to ktype structure so subsystems
>>> could supply their own logic (likely tied to namespace support) for
>>> determining ownership of kobjects, and adjusts sysfs code to make use of
>>> this information. Lastly net-sysfs is adjusted to make sure that objects in
>>> net namespace are owned by the root user from the owning user namespace.
>>>
>>> Note that we do not adjust ownership of objects moved into a new namespace
>>> (as when moving a network device into a container) as userspace can easily
>>> do it.
>>
>> I need some domain experts to review this series please.
>
> I just came back from vacation and I will aim to take a look shortly.
>
> The big picture idea seems sensible.  Having a better ownship of sysfs
> files that are part of a network namespace.  I will have to look at the
> details to see if the implementation is similarly sensible.

Eric,

Did you find anything objectionable in the series or should I fix up
the !CONFIG_SYSFS error in networking patch and resubmit?

Thanks.

-- 
Dmitry


Re: [PATCH 0/5] Make /sys/class/net per net namespace objects belong to container

2016-09-14 Thread Dmitry Torokhov
On Mon, Aug 29, 2016 at 5:38 AM, Eric W. Biederman
 wrote:
> David Miller  writes:
>
>> From: Dmitry Torokhov 
>> Date: Tue, 16 Aug 2016 15:33:10 -0700
>>
>>> There are objects in /sys hierarchy (/sys/class/net/) that logically belong
>>> to a namespace/container. Unfortunately all sysfs objects start their life
>>> belonging to global root, and while we could change ownership manually,
>>> keeping tracks of all objects that come and go is cumbersome. It would
>>> be better if kernel created them using correct uid/gid from the beginning.
>>>
>>> This series changes kernfs to allow creating object's with arbitrary
>>> uid/gid, adds get_ownership() callback to ktype structure so subsystems
>>> could supply their own logic (likely tied to namespace support) for
>>> determining ownership of kobjects, and adjusts sysfs code to make use of
>>> this information. Lastly net-sysfs is adjusted to make sure that objects in
>>> net namespace are owned by the root user from the owning user namespace.
>>>
>>> Note that we do not adjust ownership of objects moved into a new namespace
>>> (as when moving a network device into a container) as userspace can easily
>>> do it.
>>
>> I need some domain experts to review this series please.
>
> I just came back from vacation and I will aim to take a look shortly.
>
> The big picture idea seems sensible.  Having a better ownship of sysfs
> files that are part of a network namespace.  I will have to look at the
> details to see if the implementation is similarly sensible.

Eric,

Did you find anything objectionable in the series or should I fix up
the !CONFIG_SYSFS error in networking patch and resubmit?

Thanks.

-- 
Dmitry


[PATCH] DRM: i915: Fix gen8 graphics on Broadwell-U. These patches stop the random gpu hang on my XPS-13-9343, kernel version 4.8-rc5.

2016-09-14 Thread bobcao3
Signed-off-by: bobcao3 
---
 drivers/gpu/drm/i915/i915_gem_gtt.c |  6 
 drivers/gpu/drm/i915/i915_gem_stolen.c  | 61 -
 drivers/gpu/drm/i915/i915_reg.h |  6 
 drivers/gpu/drm/i915/intel_ringbuffer.c | 19 +-
 4 files changed, 59 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c 
b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 7a30af7..0b05dd9 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -2907,6 +2907,12 @@ static unsigned int gen8_get_total_gtt_size(u16 
bdw_gmch_ctl)
if (bdw_gmch_ctl > 4)
bdw_gmch_ctl = 4;
 #endif
+#ifdef CONFIG_X86_64
+   /* Limit 64b platforms to a 4GB GGTT */
+   /* DMA 4GB protection */
+   if (bdw_gmch_ctl > 8)
+   bdw_gmch_ctl = 8;
+#endif
 
return bdw_gmch_ctl << 20;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c 
b/drivers/gpu/drm/i915/i915_gem_stolen.c
index 66be299a1..da272ae 100644
--- a/drivers/gpu/drm/i915/i915_gem_stolen.c
+++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
@@ -352,47 +352,44 @@ static void gen8_get_stolen_reserved(struct 
drm_i915_private *dev_priv,
 unsigned long *base, unsigned long *size)
 {
uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
+   unsigned long stolen_top;
+   struct i915_ggtt *ggtt = _priv->ggtt;
 
*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
 
switch (reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK) {
case GEN8_STOLEN_RESERVED_1M:
-   *size = 1024 * 1024;
+   *size = 1 << 10 << 10;
break;
case GEN8_STOLEN_RESERVED_2M:
-   *size = 2 * 1024 * 1024;
+   *size = 2 << 10 << 10;
break;
case GEN8_STOLEN_RESERVED_4M:
-   *size = 4 * 1024 * 1024;
+   *size = 4 << 10 << 10;
break;
case GEN8_STOLEN_RESERVED_8M:
-   *size = 8 * 1024 * 1024;
+   *size = 8 << 10 << 10;
break;
default:
-   *size = 8 * 1024 * 1024;
-   MISSING_CASE(reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK);
-   }
-}
-
-static void bdw_get_stolen_reserved(struct drm_i915_private *dev_priv,
-   unsigned long *base, unsigned long *size)
-{
-   struct i915_ggtt *ggtt = _priv->ggtt;
-   uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
-   unsigned long stolen_top;
+   /* Whatever if it is a BDW device or SKL device
+* Or others devices..
+* This way is always going to work on 5th
+* generation Intel Processer
+*/
+   stolen_top = dev_priv->mm.stolen_base + ggtt->stolen_size;
 
-   stolen_top = dev_priv->mm.stolen_base + ggtt->stolen_size;
+   *base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
 
-   *base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
+   /* MLIMIT - MBASE => PEG */
+   /*   -- mobile-5th-gen-core-family-datasheet-vol-2.pdf */
+   if (*base == 0) {
+   *size = 0;
+   MISSING_CASE(reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK);
+   } else
+   *size = stolen_top - *base;
 
-   /* On these platforms, the register doesn't have a size field, so the
-* size is the distance between the base and the top of the stolen
-* memory. We also have the genuine case where base is zero and there's
-* nothing reserved. */
-   if (*base == 0)
-   *size = 0;
-   else
-   *size = stolen_top - *base;
+   break;
+   }
 }
 
 int i915_gem_init_stolen(struct drm_device *dev)
@@ -442,14 +439,14 @@ int i915_gem_init_stolen(struct drm_device *dev)
gen7_get_stolen_reserved(dev_priv, _base,
 _size);
break;
+   case 8:
+   gen8_get_stolen_reserved(dev_priv, _base,
+_size);
+   break;
default:
-   if (IS_BROADWELL(dev_priv) ||
-   IS_SKYLAKE(dev_priv) || IS_KABYLAKE(dev))
-   bdw_get_stolen_reserved(dev_priv, _base,
-   _size);
-   else
-   gen8_get_stolen_reserved(dev_priv, _base,
-_size);
+   // FIXME: This seemed like going to work
+   gen8_get_stolen_reserved(dev_priv, _base,
+_size);
break;
}
 
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index bf2cad3..3dce37b 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1748,6 

[PATCH] DRM: i915: Fix gen8 graphics on Broadwell-U. These patches stop the random gpu hang on my XPS-13-9343, kernel version 4.8-rc5.

2016-09-14 Thread bobcao3
Signed-off-by: bobcao3 
---
 drivers/gpu/drm/i915/i915_gem_gtt.c |  6 
 drivers/gpu/drm/i915/i915_gem_stolen.c  | 61 -
 drivers/gpu/drm/i915/i915_reg.h |  6 
 drivers/gpu/drm/i915/intel_ringbuffer.c | 19 +-
 4 files changed, 59 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c 
b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 7a30af7..0b05dd9 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -2907,6 +2907,12 @@ static unsigned int gen8_get_total_gtt_size(u16 
bdw_gmch_ctl)
if (bdw_gmch_ctl > 4)
bdw_gmch_ctl = 4;
 #endif
+#ifdef CONFIG_X86_64
+   /* Limit 64b platforms to a 4GB GGTT */
+   /* DMA 4GB protection */
+   if (bdw_gmch_ctl > 8)
+   bdw_gmch_ctl = 8;
+#endif
 
return bdw_gmch_ctl << 20;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c 
b/drivers/gpu/drm/i915/i915_gem_stolen.c
index 66be299a1..da272ae 100644
--- a/drivers/gpu/drm/i915/i915_gem_stolen.c
+++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
@@ -352,47 +352,44 @@ static void gen8_get_stolen_reserved(struct 
drm_i915_private *dev_priv,
 unsigned long *base, unsigned long *size)
 {
uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
+   unsigned long stolen_top;
+   struct i915_ggtt *ggtt = _priv->ggtt;
 
*base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
 
switch (reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK) {
case GEN8_STOLEN_RESERVED_1M:
-   *size = 1024 * 1024;
+   *size = 1 << 10 << 10;
break;
case GEN8_STOLEN_RESERVED_2M:
-   *size = 2 * 1024 * 1024;
+   *size = 2 << 10 << 10;
break;
case GEN8_STOLEN_RESERVED_4M:
-   *size = 4 * 1024 * 1024;
+   *size = 4 << 10 << 10;
break;
case GEN8_STOLEN_RESERVED_8M:
-   *size = 8 * 1024 * 1024;
+   *size = 8 << 10 << 10;
break;
default:
-   *size = 8 * 1024 * 1024;
-   MISSING_CASE(reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK);
-   }
-}
-
-static void bdw_get_stolen_reserved(struct drm_i915_private *dev_priv,
-   unsigned long *base, unsigned long *size)
-{
-   struct i915_ggtt *ggtt = _priv->ggtt;
-   uint32_t reg_val = I915_READ(GEN6_STOLEN_RESERVED);
-   unsigned long stolen_top;
+   /* Whatever if it is a BDW device or SKL device
+* Or others devices..
+* This way is always going to work on 5th
+* generation Intel Processer
+*/
+   stolen_top = dev_priv->mm.stolen_base + ggtt->stolen_size;
 
-   stolen_top = dev_priv->mm.stolen_base + ggtt->stolen_size;
+   *base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
 
-   *base = reg_val & GEN6_STOLEN_RESERVED_ADDR_MASK;
+   /* MLIMIT - MBASE => PEG */
+   /*   -- mobile-5th-gen-core-family-datasheet-vol-2.pdf */
+   if (*base == 0) {
+   *size = 0;
+   MISSING_CASE(reg_val & GEN8_STOLEN_RESERVED_SIZE_MASK);
+   } else
+   *size = stolen_top - *base;
 
-   /* On these platforms, the register doesn't have a size field, so the
-* size is the distance between the base and the top of the stolen
-* memory. We also have the genuine case where base is zero and there's
-* nothing reserved. */
-   if (*base == 0)
-   *size = 0;
-   else
-   *size = stolen_top - *base;
+   break;
+   }
 }
 
 int i915_gem_init_stolen(struct drm_device *dev)
@@ -442,14 +439,14 @@ int i915_gem_init_stolen(struct drm_device *dev)
gen7_get_stolen_reserved(dev_priv, _base,
 _size);
break;
+   case 8:
+   gen8_get_stolen_reserved(dev_priv, _base,
+_size);
+   break;
default:
-   if (IS_BROADWELL(dev_priv) ||
-   IS_SKYLAKE(dev_priv) || IS_KABYLAKE(dev))
-   bdw_get_stolen_reserved(dev_priv, _base,
-   _size);
-   else
-   gen8_get_stolen_reserved(dev_priv, _base,
-_size);
+   // FIXME: This seemed like going to work
+   gen8_get_stolen_reserved(dev_priv, _base,
+_size);
break;
}
 
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index bf2cad3..3dce37b 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1748,6 +1748,12 @@ enum 

Re: [PATCH] cpufreq: CPPC: Avoid overflow when calculating desired_perf

2016-09-14 Thread Hoan Tran
Hi Rafael,

On Wed, Sep 14, 2016 at 5:50 PM, Rafael J. Wysocki  wrote:
> On Wednesday, September 14, 2016 04:08:28 PM Hoan Tran wrote:
>> This patch fixes overflow issue when calculating the desired_perf.
>>
>> Signed-off-by: Hoan Tran 
>> ---
>>  drivers/cpufreq/cppc_cpufreq.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>> index 894e465..3e0961e 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -84,7 +84,8 @@ static int cppc_cpufreq_set_target(struct cpufreq_policy 
>> *policy,
>>
>>   cpu = all_cpu_data[policy->cpu];
>>
>> - cpu->perf_ctrls.desired_perf = target_freq * policy->max / 
>> cppc_dmi_max_khz;
>> + cpu->perf_ctrls.desired_perf =
>> + (u64)target_freq * policy->max / cppc_dmi_max_khz;
>>   freqs.old = policy->cur;
>>   freqs.new = target_freq;
>
> That's on top of the CPPC material in linux-next I gather?

Yes, it's on TOP of linux-next.

>
> Which commit does it fix?

This is a fix for ad38677df44b67e0f5b6c4d31e9c2734abde8ed9 (cpufreq:
CPPC: Force reporting values in KHz to fix user space interface)
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/cpufreq/cppc_cpufreq.c?id=ad38677df44b67e0f5b6c4d31e9c2734abde8ed9

Thanks
Hoan

>
> Thanks,
> Rafael
>


Re: [PATCH] cpufreq: CPPC: Avoid overflow when calculating desired_perf

2016-09-14 Thread Hoan Tran
Hi Rafael,

On Wed, Sep 14, 2016 at 5:50 PM, Rafael J. Wysocki  wrote:
> On Wednesday, September 14, 2016 04:08:28 PM Hoan Tran wrote:
>> This patch fixes overflow issue when calculating the desired_perf.
>>
>> Signed-off-by: Hoan Tran 
>> ---
>>  drivers/cpufreq/cppc_cpufreq.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>> index 894e465..3e0961e 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -84,7 +84,8 @@ static int cppc_cpufreq_set_target(struct cpufreq_policy 
>> *policy,
>>
>>   cpu = all_cpu_data[policy->cpu];
>>
>> - cpu->perf_ctrls.desired_perf = target_freq * policy->max / 
>> cppc_dmi_max_khz;
>> + cpu->perf_ctrls.desired_perf =
>> + (u64)target_freq * policy->max / cppc_dmi_max_khz;
>>   freqs.old = policy->cur;
>>   freqs.new = target_freq;
>
> That's on top of the CPPC material in linux-next I gather?

Yes, it's on TOP of linux-next.

>
> Which commit does it fix?

This is a fix for ad38677df44b67e0f5b6c4d31e9c2734abde8ed9 (cpufreq:
CPPC: Force reporting values in KHz to fix user space interface)
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/cpufreq/cppc_cpufreq.c?id=ad38677df44b67e0f5b6c4d31e9c2734abde8ed9

Thanks
Hoan

>
> Thanks,
> Rafael
>


perf: perf_fuzzer lockup in perf_cgroup_attach

2016-09-14 Thread Vince Weaver

so the skylake that was fuzzing finally is mostly locked up.

Really hard to tell what's going, especially as KASLR made looking up the 
addresses a big pain.

The best I can tell things are getting wedged somehow in 
perf_cgroup_switch() while interrupts are disabled.  Interrupts are never 
getting re-enabled, causing the RCU and NMI watchdogs to trigger (and more 
alarming things like the SATA bus resetting).

[26292.413603] Task dump for CPU 4:
[26292.413604] perf_fuzzer R  running task0  8870   1096 0x1008
[26292.413605]  9045f29e1100 e7143ab0 9045fdd03db8 
938accef
[26292.413606]  0004 0087 9045fdd03dd0 
938af5e9
[26292.413607]  0004 9045fdd03e00 93984928 
9045fdd19440
[26292.413608] Call Trace:
[26292.413609][] sched_show_task+0xaf/0x110
[26292.413611]  [] dump_cpu_task+0x39/0x40
[26292.413613]  [] rcu_dump_cpu_stacks+0x80/0xbb
[26292.413614]  [] rcu_check_callbacks+0x71e/0x880
[26292.413615]  [] ? acct_account_cputime+0x1c/0x20
[26292.413616]  [] ? account_system_time+0x79/0x120
[26292.413617]  [] ? tick_sched_do_timer+0x30/0x30
[26292.413619]  [] update_process_times+0x2f/0x60
[26292.413619]  [] tick_sched_handle.isra.13+0x25/0x60
[26292.413620]  [] tick_sched_timer+0x3d/0x70
[26292.413621]  [] __hrtimer_run_queues+0xf3/0x280
[26292.413623]  [] hrtimer_interrupt+0xa8/0x1a0
[26292.413624]  [] local_apic_timer_interrupt+0x38/0x60
[26292.413625]  [] smp_trace_apic_timer_interrupt+0x44/0xde
[26292.413627]  [] ? perf_cgroup_attach+0x70/0x70
[26292.413628]  [] trace_apic_timer_interrupt+0x82/0x90
[26292.413628][] ? perf_cgroup_attach+0x70/0x70
[26292.413631]  [] ? smp_call_function_single+0xd6/0x130
[26292.413632]  [] ? perf_cgroup_attach+0x70/0x70
[26292.413633]  [] cpu_function_call+0x43/0x60
[26292.413634]  [] ? __perf_event_enable+0x260/0x260
[26292.413635]  [] perf_install_in_context+0x141/0x150
[26292.413636]  [] SYSC_perf_event_open+0x70e/0xfe0
[26292.413637]  [] ? check_preempt_curr+0x54/0x90
[26292.413639]  [] SyS_perf_event_open+0x9/0x10
[26292.413640]  [] do_syscall_64+0x64/0x160
[26292.413641]  [] entry_SYSCALL64_slow_path+0x25/0x25

[26316.489382] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! 
[perf_fuzzer:8870]
[26316.497309] CPU: 4 PID: 8870 Comm: perf_fuzzer Tainted: GW   
4.8.0-rc6+ #5
[26316.497310] Hardware name: LENOVO 10FY0017US/SKYBAY, BIOS FWKT53A   
06/06/2016
[26316.497310] task: 9045f29e1100 task.stack: 9045f0a48000
[26316.497311] RIP: 0010:[]  [] 
smp_call_function_single+0xd6/0x130
[26316.497312] RSP: 0018:9045f0a4bd30  EFLAGS: 0202
[26316.497313] RAX:  RBX: 0002 RCX: 
[26316.497313] RDX: 0001 RSI: 00fb RDI: 0286
[26316.497314] RBP: 9045f0a4bd78 R08: 9045fdc9bed0 R09: 9045f2ac8000
[26316.497314] R10:  R11: 0246 R12: 9396ff60
[26316.497315] R13: 9045fdc9bed0 R14: 001e R15: 9045fdc9bee0
[26316.497315] FS:  7f4157a08700() GS:9045fdd0() 
knlGS:
[26316.497316] CS:  0010 DS:  ES:  CR0: 80050033
[26316.497316] CR2: 0693a048 CR3: 000231049000 CR4: 003407e0
[26316.497317] DR0:  DR1:  DR2: 
[26316.497317] DR3:  DR6: fffe0ff0 DR7: 0600
[26316.497318] Stack:
[26316.497318]   9045ee35da80  
9396ff60
[26316.497319]  9045f0a4bd88 0003 e7143ab0 
9045ee47613a
[26316.497321]  9045ee476000 9045f0a4bdb0 9396ec53 

[26316.497322] Call Trace:
[26316.497323]  [] ? perf_cgroup_attach+0x70/0x70
[26316.497324]  [] cpu_function_call+0x43/0x60
[26316.497325]  [] ? __perf_event_enable+0x260/0x260
[26316.497326]  [] perf_install_in_context+0x141/0x150
[26316.497327]  [] SYSC_perf_event_open+0x70e/0xfe0
[26316.497328]  [] ? check_preempt_curr+0x54/0x90
[26316.497330]  [] SyS_perf_event_open+0x9/0x10
[26316.497331]  [] do_syscall_64+0x64/0x160
[26316.497332]  [] entry_SYSCALL64_slow_path+0x25/0x25
[26316.497332] Code: 25 28 00 00 00 75 70 48 83 c4 38 5b 41 5c 5d c3 48 8d 75 
c8 48 89 d1 89 df 4c 89 e2 e8 14 fe ff ff 8b 55 e0 83 e2 01 74 0a f3 90 <8b> 55 
e0 83 e2 01 75 f6 eb c3 8b 05 ba e5 dd 00 85 c0 75 85 80 



perf: perf_fuzzer lockup in perf_cgroup_attach

2016-09-14 Thread Vince Weaver

so the skylake that was fuzzing finally is mostly locked up.

Really hard to tell what's going, especially as KASLR made looking up the 
addresses a big pain.

The best I can tell things are getting wedged somehow in 
perf_cgroup_switch() while interrupts are disabled.  Interrupts are never 
getting re-enabled, causing the RCU and NMI watchdogs to trigger (and more 
alarming things like the SATA bus resetting).

[26292.413603] Task dump for CPU 4:
[26292.413604] perf_fuzzer R  running task0  8870   1096 0x1008
[26292.413605]  9045f29e1100 e7143ab0 9045fdd03db8 
938accef
[26292.413606]  0004 0087 9045fdd03dd0 
938af5e9
[26292.413607]  0004 9045fdd03e00 93984928 
9045fdd19440
[26292.413608] Call Trace:
[26292.413609][] sched_show_task+0xaf/0x110
[26292.413611]  [] dump_cpu_task+0x39/0x40
[26292.413613]  [] rcu_dump_cpu_stacks+0x80/0xbb
[26292.413614]  [] rcu_check_callbacks+0x71e/0x880
[26292.413615]  [] ? acct_account_cputime+0x1c/0x20
[26292.413616]  [] ? account_system_time+0x79/0x120
[26292.413617]  [] ? tick_sched_do_timer+0x30/0x30
[26292.413619]  [] update_process_times+0x2f/0x60
[26292.413619]  [] tick_sched_handle.isra.13+0x25/0x60
[26292.413620]  [] tick_sched_timer+0x3d/0x70
[26292.413621]  [] __hrtimer_run_queues+0xf3/0x280
[26292.413623]  [] hrtimer_interrupt+0xa8/0x1a0
[26292.413624]  [] local_apic_timer_interrupt+0x38/0x60
[26292.413625]  [] smp_trace_apic_timer_interrupt+0x44/0xde
[26292.413627]  [] ? perf_cgroup_attach+0x70/0x70
[26292.413628]  [] trace_apic_timer_interrupt+0x82/0x90
[26292.413628][] ? perf_cgroup_attach+0x70/0x70
[26292.413631]  [] ? smp_call_function_single+0xd6/0x130
[26292.413632]  [] ? perf_cgroup_attach+0x70/0x70
[26292.413633]  [] cpu_function_call+0x43/0x60
[26292.413634]  [] ? __perf_event_enable+0x260/0x260
[26292.413635]  [] perf_install_in_context+0x141/0x150
[26292.413636]  [] SYSC_perf_event_open+0x70e/0xfe0
[26292.413637]  [] ? check_preempt_curr+0x54/0x90
[26292.413639]  [] SyS_perf_event_open+0x9/0x10
[26292.413640]  [] do_syscall_64+0x64/0x160
[26292.413641]  [] entry_SYSCALL64_slow_path+0x25/0x25

[26316.489382] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! 
[perf_fuzzer:8870]
[26316.497309] CPU: 4 PID: 8870 Comm: perf_fuzzer Tainted: GW   
4.8.0-rc6+ #5
[26316.497310] Hardware name: LENOVO 10FY0017US/SKYBAY, BIOS FWKT53A   
06/06/2016
[26316.497310] task: 9045f29e1100 task.stack: 9045f0a48000
[26316.497311] RIP: 0010:[]  [] 
smp_call_function_single+0xd6/0x130
[26316.497312] RSP: 0018:9045f0a4bd30  EFLAGS: 0202
[26316.497313] RAX:  RBX: 0002 RCX: 
[26316.497313] RDX: 0001 RSI: 00fb RDI: 0286
[26316.497314] RBP: 9045f0a4bd78 R08: 9045fdc9bed0 R09: 9045f2ac8000
[26316.497314] R10:  R11: 0246 R12: 9396ff60
[26316.497315] R13: 9045fdc9bed0 R14: 001e R15: 9045fdc9bee0
[26316.497315] FS:  7f4157a08700() GS:9045fdd0() 
knlGS:
[26316.497316] CS:  0010 DS:  ES:  CR0: 80050033
[26316.497316] CR2: 0693a048 CR3: 000231049000 CR4: 003407e0
[26316.497317] DR0:  DR1:  DR2: 
[26316.497317] DR3:  DR6: fffe0ff0 DR7: 0600
[26316.497318] Stack:
[26316.497318]   9045ee35da80  
9396ff60
[26316.497319]  9045f0a4bd88 0003 e7143ab0 
9045ee47613a
[26316.497321]  9045ee476000 9045f0a4bdb0 9396ec53 

[26316.497322] Call Trace:
[26316.497323]  [] ? perf_cgroup_attach+0x70/0x70
[26316.497324]  [] cpu_function_call+0x43/0x60
[26316.497325]  [] ? __perf_event_enable+0x260/0x260
[26316.497326]  [] perf_install_in_context+0x141/0x150
[26316.497327]  [] SYSC_perf_event_open+0x70e/0xfe0
[26316.497328]  [] ? check_preempt_curr+0x54/0x90
[26316.497330]  [] SyS_perf_event_open+0x9/0x10
[26316.497331]  [] do_syscall_64+0x64/0x160
[26316.497332]  [] entry_SYSCALL64_slow_path+0x25/0x25
[26316.497332] Code: 25 28 00 00 00 75 70 48 83 c4 38 5b 41 5c 5d c3 48 8d 75 
c8 48 89 d1 89 df 4c 89 e2 e8 14 fe ff ff 8b 55 e0 83 e2 01 74 0a f3 90 <8b> 55 
e0 83 e2 01 75 f6 eb c3 8b 05 ba e5 dd 00 85 c0 75 85 80 



Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Dave Chinner
On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> On Wed, 14 Sep 2016 17:39:02 +1000
> Dave Chinner  wrote:
> > Ok, looking back over your example, you seem to be suggesting a new
> > page fault behaviour is required from filesystems that has not been
> > described or explained, and that behaviour is triggered
> > (persistently) somehow from userspace. You've also suggested
> > filesystems store a persistent per-block "no fsync" flag
> > in their extent map as part of the implementation. Right?
> 
> This is what we're talking about. Of course a filesystem can't just
> start supporting the feature without any changes.

Sure, but one first has to describe the feature desired before all
parties can discuss it. We need more than vague references and
allusions from you to define the solution you are proposing.

Once everyone understands what is being describing, we might be able
to work out how it can be implemented in a simple, generic manner
rather than require every filesystem to change their on-disk
formats. IOWs, we need you to describe /details/ of semantics,
behaviour and data integrity constraints that are required, not
describe an implementation of something we have no knwoledge about.

> > Reading between the lines, I'm guessing that the "no fsync" flag has
> > very specific update semantics, constraints and requirements.  Can
> > you outline how you expect this flag to be set and updated, how it's
> > used consistently between different applications (e.g. cp of a file
> > vs the app using the file), behavioural constraints it implies for
> > page faults vs non-mmap access to the data in the block, how
> > you'd expect filesystems to deal with things like a hole punch
> > landing in the middle of an extent marked with "no fsync", etc?
> 
> Well that's what's being discussed.  An approach close to what I did is
> to allow the app request a "no sync" type of mmap.

That's not an answer to the questions I asked about about the "no
sync" flag you were proposing. You've redirected to the a different
solution, one that 

> Filesystem will
> invalidate all such mappings before it does buffered IOs or hole punch,
> and will sync metadata after allocating a new block but before returning
> from a fault.

... requires synchronous metadata updates from page fault context,
which we already know is not a good solution.  I'll quote one of
Christoph's previous replies to save me the trouble:

"You could write all metadata synchronously from the page
fault handler, but that's basically asking for all kinds of
deadlocks."

So, let's redirect back to the "no sync" flag you were talking about
- can you answer the questions I asked above? It would be especially
important to highlight how the proposed feature would avoid requiring
synchronous metadata updates in page fault contexts

> > [snip]
> > 
> > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > Conceptual problem is simple.  
> > 
> > Play nice and be constructive, please?
> 
> So you agree that the persistent memory people who have come with some
> requirements and ideas for an API should not be immediately shut down
> with bogus handwaving.

Pull your head in, Nick.

You've been absent from the community for the last 5 years. You
suddenly barge in with a massive chip on your shoulder and try to
throw your weight around. You're being arrogant, obnoxious, evasive
and petty. You're belittling anyone who dares to question your
proclamations. You're not listening to the replies you are getting.
You're baiting people to try to get an adverse reaction from them
and when someone gives you the adverse reaction you were fishing
for, you play the victim card.

That's textbook bullying behaviour.

Nick, this behaviour does not help progress the discussion in any
way. It only serves to annoy the other people who are sincerely
trying to understand and determine if/how we can solve the problem
in some way.

So, again, play nice and be constructive, please?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Dave Chinner
On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> On Wed, 14 Sep 2016 17:39:02 +1000
> Dave Chinner  wrote:
> > Ok, looking back over your example, you seem to be suggesting a new
> > page fault behaviour is required from filesystems that has not been
> > described or explained, and that behaviour is triggered
> > (persistently) somehow from userspace. You've also suggested
> > filesystems store a persistent per-block "no fsync" flag
> > in their extent map as part of the implementation. Right?
> 
> This is what we're talking about. Of course a filesystem can't just
> start supporting the feature without any changes.

Sure, but one first has to describe the feature desired before all
parties can discuss it. We need more than vague references and
allusions from you to define the solution you are proposing.

Once everyone understands what is being describing, we might be able
to work out how it can be implemented in a simple, generic manner
rather than require every filesystem to change their on-disk
formats. IOWs, we need you to describe /details/ of semantics,
behaviour and data integrity constraints that are required, not
describe an implementation of something we have no knwoledge about.

> > Reading between the lines, I'm guessing that the "no fsync" flag has
> > very specific update semantics, constraints and requirements.  Can
> > you outline how you expect this flag to be set and updated, how it's
> > used consistently between different applications (e.g. cp of a file
> > vs the app using the file), behavioural constraints it implies for
> > page faults vs non-mmap access to the data in the block, how
> > you'd expect filesystems to deal with things like a hole punch
> > landing in the middle of an extent marked with "no fsync", etc?
> 
> Well that's what's being discussed.  An approach close to what I did is
> to allow the app request a "no sync" type of mmap.

That's not an answer to the questions I asked about about the "no
sync" flag you were proposing. You've redirected to the a different
solution, one that 

> Filesystem will
> invalidate all such mappings before it does buffered IOs or hole punch,
> and will sync metadata after allocating a new block but before returning
> from a fault.

... requires synchronous metadata updates from page fault context,
which we already know is not a good solution.  I'll quote one of
Christoph's previous replies to save me the trouble:

"You could write all metadata synchronously from the page
fault handler, but that's basically asking for all kinds of
deadlocks."

So, let's redirect back to the "no sync" flag you were talking about
- can you answer the questions I asked above? It would be especially
important to highlight how the proposed feature would avoid requiring
synchronous metadata updates in page fault contexts

> > [snip]
> > 
> > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > Conceptual problem is simple.  
> > 
> > Play nice and be constructive, please?
> 
> So you agree that the persistent memory people who have come with some
> requirements and ideas for an API should not be immediately shut down
> with bogus handwaving.

Pull your head in, Nick.

You've been absent from the community for the last 5 years. You
suddenly barge in with a massive chip on your shoulder and try to
throw your weight around. You're being arrogant, obnoxious, evasive
and petty. You're belittling anyone who dares to question your
proclamations. You're not listening to the replies you are getting.
You're baiting people to try to get an adverse reaction from them
and when someone gives you the adverse reaction you were fishing
for, you play the victim card.

That's textbook bullying behaviour.

Nick, this behaviour does not help progress the discussion in any
way. It only serves to annoy the other people who are sincerely
trying to understand and determine if/how we can solve the problem
in some way.

So, again, play nice and be constructive, please?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


[PATCH -next] staging: fsl-mc: remove .owner field for driver

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

Remove .owner field if calls are used which set it automatically.

Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci

Signed-off-by: Wei Yongjun 
---
 drivers/staging/fsl-mc/bus/fsl-mc-allocator.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c 
b/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
index 2004fa7..1a35cfb 100644
--- a/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
+++ b/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
@@ -652,7 +652,6 @@ static const struct fsl_mc_device_id match_id_table[] = {
 static struct fsl_mc_driver fsl_mc_allocator_driver = {
.driver = {
   .name = "fsl_mc_allocator",
-  .owner = THIS_MODULE,
   .pm = NULL,
   },
.match_id_table = match_id_table,





[PATCH -next] staging: fsl-mc: remove .owner field for driver

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

Remove .owner field if calls are used which set it automatically.

Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci

Signed-off-by: Wei Yongjun 
---
 drivers/staging/fsl-mc/bus/fsl-mc-allocator.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c 
b/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
index 2004fa7..1a35cfb 100644
--- a/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
+++ b/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
@@ -652,7 +652,6 @@ static const struct fsl_mc_device_id match_id_table[] = {
 static struct fsl_mc_driver fsl_mc_allocator_driver = {
.driver = {
   .name = "fsl_mc_allocator",
-  .owner = THIS_MODULE,
   .pm = NULL,
   },
.match_id_table = match_id_table,





Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 7:19 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 06:25:07PM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaün  wrote:
>> >
>> > On 14/09/2016 20:27, Andy Lutomirski wrote:
>> >> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>> >>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
>> >>> set for all cgroup except the root. The flag is clear when a new process
>> >>> without the no_new_privs flags is attached to the cgroup.
>> >>>
>> >>> If a cgroup is landlocked, then any new attempt, from an unprivileged
>> >>> process, to attach a process without no_new_privs to this cgroup will
>> >>> be denied.
>> >>
>> >> Until and unless everyone can agree on a way to properly namespace,
>> >> delegate, etc cgroups, I think that trying to add unprivileged
>> >> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
>> >> no-internal-tasks, etc, I just don't see how this approach can be
>> >> viable.
>> >
>> > As far as I can tell, the no_new_privs flag of at task is not related to
>> > namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
>> > the no_new_privs property of *tasks* in a cgroup. The semantic is 
>> > unchanged.
>> >
>> > Using cgroup is optional, any task could use the seccomp-based
>> > landlocking instead. However, for those that want/need to manage a
>> > security policy in a more dynamic way, using cgroups may make sense.
>> >
>> > I though cgroup delegation was OK in the v2, isn't it the case? Do you
>> > have some links?
>> >
>> >>
>> >> Can we try to make landlock work completely independently of cgroups
>> >> so that it doesn't get stuck and so that programs can use it without
>> >> worrying about cgroup v1 vs v2, interactions with cgroup managers,
>> >> cgroup managers that (supposedly?) will start migrating processes
>> >> around piecemeal and almost certainly blowing up landlock in the
>> >> process, etc?
>> >
>> > This RFC handle both cgroup and seccomp approaches in a similar way. I
>> > don't see why building on top of cgroup v2 is a problem. Is there
>> > security issues with delegation?
>>
>> What I mean is: cgroup v2 delegation has a functionality problem.
>> Tejun says [1]:
>>
>> We haven't had to face this decision because cgroup has never properly
>> supported delegating to applications and the in-use setups where this
>> happens are custom configurations where there is no boundary between
>> system and applications and adhoc trial-and-error is good enough a way
>> to find a working solution.  That wiggle room goes away once we
>> officially open this up to individual applications.
>>
>> Unless and until that changes, I think that landlock should stay away
>> from cgroups.  Others could reasonably disagree with me.
>
> Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> and not for sandboxing. So the above doesn't matter in such contexts.
> lsm hooks + cgroups provide convenient scope and existing entry points.
> Please see checmate examples how it's used.
>

To be clear: I'm not arguing at all that there shouldn't be
bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
landlock interface shouldn't expose any cgroup integration, at least
until the cgroup situation settles down a lot.

--Andy


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 7:19 PM, Alexei Starovoitov
 wrote:
> On Wed, Sep 14, 2016 at 06:25:07PM -0700, Andy Lutomirski wrote:
>> On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaün  wrote:
>> >
>> > On 14/09/2016 20:27, Andy Lutomirski wrote:
>> >> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
>> >>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
>> >>> set for all cgroup except the root. The flag is clear when a new process
>> >>> without the no_new_privs flags is attached to the cgroup.
>> >>>
>> >>> If a cgroup is landlocked, then any new attempt, from an unprivileged
>> >>> process, to attach a process without no_new_privs to this cgroup will
>> >>> be denied.
>> >>
>> >> Until and unless everyone can agree on a way to properly namespace,
>> >> delegate, etc cgroups, I think that trying to add unprivileged
>> >> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
>> >> no-internal-tasks, etc, I just don't see how this approach can be
>> >> viable.
>> >
>> > As far as I can tell, the no_new_privs flag of at task is not related to
>> > namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
>> > the no_new_privs property of *tasks* in a cgroup. The semantic is 
>> > unchanged.
>> >
>> > Using cgroup is optional, any task could use the seccomp-based
>> > landlocking instead. However, for those that want/need to manage a
>> > security policy in a more dynamic way, using cgroups may make sense.
>> >
>> > I though cgroup delegation was OK in the v2, isn't it the case? Do you
>> > have some links?
>> >
>> >>
>> >> Can we try to make landlock work completely independently of cgroups
>> >> so that it doesn't get stuck and so that programs can use it without
>> >> worrying about cgroup v1 vs v2, interactions with cgroup managers,
>> >> cgroup managers that (supposedly?) will start migrating processes
>> >> around piecemeal and almost certainly blowing up landlock in the
>> >> process, etc?
>> >
>> > This RFC handle both cgroup and seccomp approaches in a similar way. I
>> > don't see why building on top of cgroup v2 is a problem. Is there
>> > security issues with delegation?
>>
>> What I mean is: cgroup v2 delegation has a functionality problem.
>> Tejun says [1]:
>>
>> We haven't had to face this decision because cgroup has never properly
>> supported delegating to applications and the in-use setups where this
>> happens are custom configurations where there is no boundary between
>> system and applications and adhoc trial-and-error is good enough a way
>> to find a working solution.  That wiggle room goes away once we
>> officially open this up to individual applications.
>>
>> Unless and until that changes, I think that landlock should stay away
>> from cgroups.  Others could reasonably disagree with me.
>
> Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> and not for sandboxing. So the above doesn't matter in such contexts.
> lsm hooks + cgroups provide convenient scope and existing entry points.
> Please see checmate examples how it's used.
>

To be clear: I'm not arguing at all that there shouldn't be
bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
landlock interface shouldn't expose any cgroup integration, at least
until the cgroup situation settles down a lot.

--Andy


[PATCH -next] staging: fsl-mc: use list_del_init instead of list_del/INIT_LIST_HEAD

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

Using list_del_init() instead of list_del() + INIT_LIST_HEAD().

Signed-off-by: Wei Yongjun 
---
 drivers/staging/fsl-mc/bus/fsl-mc-allocator.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c 
b/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
index 2004fa7..1e06d28 100644
--- a/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
+++ b/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
@@ -142,8 +142,7 @@ static int __must_check 
fsl_mc_resource_pool_remove_device(struct fsl_mc_device
goto out_unlock;
}
 
-   list_del(>node);
-   INIT_LIST_HEAD(>node);
+   list_del_init(>node);
res_pool->free_count--;
res_pool->max_count--;
 
@@ -220,8 +219,7 @@ int __must_check fsl_mc_resource_allocate(struct fsl_mc_bus 
*mc_bus,
res_pool->free_count > res_pool->max_count))
goto out_unlock;
 
-   list_del(>node);
-   INIT_LIST_HEAD(>node);
+   list_del_init(>node);
 
res_pool->free_count--;
error = 0;





[PATCH -next] staging: fsl-mc: use list_del_init instead of list_del/INIT_LIST_HEAD

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

Using list_del_init() instead of list_del() + INIT_LIST_HEAD().

Signed-off-by: Wei Yongjun 
---
 drivers/staging/fsl-mc/bus/fsl-mc-allocator.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c 
b/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
index 2004fa7..1e06d28 100644
--- a/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
+++ b/drivers/staging/fsl-mc/bus/fsl-mc-allocator.c
@@ -142,8 +142,7 @@ static int __must_check 
fsl_mc_resource_pool_remove_device(struct fsl_mc_device
goto out_unlock;
}
 
-   list_del(>node);
-   INIT_LIST_HEAD(>node);
+   list_del_init(>node);
res_pool->free_count--;
res_pool->max_count--;
 
@@ -220,8 +219,7 @@ int __must_check fsl_mc_resource_allocate(struct fsl_mc_bus 
*mc_bus,
res_pool->free_count > res_pool->max_count))
goto out_unlock;
 
-   list_del(>node);
-   INIT_LIST_HEAD(>node);
+   list_del_init(>node);
 
res_pool->free_count--;
error = 0;





Re: [PATCH 3/4] autofs - make mountpoint checks namespace aware

2016-09-14 Thread Eric W. Biederman
Ian Kent  writes:

> On Wed, 2016-09-14 at 12:28 -0500, Eric W. Biederman wrote:
>> Ian Kent  writes:
>> 
>> > If an automount mount is clone(2)ed into a file system that is
>> > propagation private, when it later expires in the originating
>> > namespace subsequent calls to autofs ->d_automount() for that
>> > dentry in the original namespace will return ELOOP until the
>> > mount is manually umounted in the cloned namespace.
>> > 
>> > In the same way, if an autofs mount is triggered by automount(8)
>> > running within a container the dentry will be seen as mounted in
>> > the root init namespace and calls to ->d_automount() in that namespace
>> > will return ELOOP until the mount is umounted within the container.
>> > 
>> > Also, have_submounts() can return an incorect result when a mount
>> > exists in a namespace other than the one being checked.
>> 
>> Overall this appears to be a fairly reasonable set of changes.  It does
>> increase the expense when an actual mount point is encountered, but if
>> these are the desired some increase in cost when a dentry is a
>> mountpoint is unavoidable.
>> 
>> May I ask the motiviation for this set of changes?  Reading through the
>> changes I don't grasp why we want to change the behavior of autofs.
>> What problem is being solved?  What are the benefits?
>
> LOL, it's all too easy for me to give a patch description that I think 
> explains
> a problem I need to solve without realizing it isn't clear to others what the
> problem is, sorry about that.
>
> For quite a while now, and not that frequently but consistently, I've been
> getting reports of people using autofs getting ELOOP errors and not being able
> to mount automounts.
>
> This has been due to the cloning of autofs file systems (that have active
> automounts at the time of the clone) by other systems.
>
> An unshare, as one example, can easily result in the cloning of an autofs file
> system that has active mounts which shows this problem.
>
> Once an active mount that has been cloned is expired in the namespace that
> performed the unshare it can't be (auto)mounted again in the the originating
> namespace because the mounted check in the autofs module will think it is
> already mounted.
>
> I'm not sure this is a clear description either, hopefully it is enough to
> demonstrate the type of problem I'm typing to solve.

So to rephrase the problem is that an autofs instance can stop working
properly from the perspective of the mount namespace it is mounted in
if the autofs instance is shared between multiple mount namespaces.  The
problem is that mounts and unmounts do not always propogate between
mount namespaces.  This lack of symmetric mount/unmount behavior
leads to mountpoints that become unusable.

Which leads to the question what is the expected new behavior with your
patchset applied.  New mounts can be added in the parent mount namespace
(because the test is local).  Does your change also allow the
autofs mountpoints to be used in the other mount namespaces that share
the autofs instance if everything becomes unmounted?

Or is it expected that other mount namespaces that share an autofs
instance will get changes in their mounts via mount propagation and if
mount propagation is insufficient they are on their own.

I believe this is a question of how do notifications of the desire for
an automount work after your change, and are those notifications
consistent with your desired and/or expected behavior.

Eric







Re: [PATCH v2 2/3] x86 Test and expose CPUID faulting capabilities in /proc/cpuinfo

2016-09-14 Thread Kyle Huey
On Wed, Sep 14, 2016 at 6:17 PM, Andy Lutomirski  wrote:
> On Wed, Sep 14, 2016 at 3:03 PM, Kyle Huey  wrote:
>> On Wed, Sep 14, 2016 at 2:35 PM, Dave Hansen
>>  wrote:
>>> On 09/14/2016 02:01 PM, Kyle Huey wrote:
>
>>> Is any of this useful to optimize away at compile-time?  We have config
>>> options for when we're running as a guest, and this seems like a feature
>>> that isn't available when running on bare metal.
>>
>> On the contrary, this is only available when we're on bare metal.
>> Neither Xen nor KVM virtualize CPUID faulting (although KVM correctly
>> suppresses MSR_PLATFORM_INFO's report of support for it).
>
> KVM could easily support this.  If rr starts using it, I think KVM
> *should* add support, possibly even for older CPUs that don't support
> the feature in hardware.
>
> It's too bad that x86 doesn't give us the instruction bytes on a
> fault.  Otherwise we could lazily switch this feature.

We are *very* interested in having KVM and Xen support virtualization
of this feature.  I am planning to work on KVM after I get this series
of patches in :)

- Kyle


[PATCH -next] staging: android: ion: Fix return value check in hi6220_ion_probe()

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

In case of error, the function ion_device_create() returns ERR_PTR() and
never returns NULL. The NULL test in the return value check should be
replaced with IS_ERR().

Signed-off-by: Wei Yongjun 
---
 drivers/staging/android/ion/hisilicon/hi6220_ion.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/android/ion/hisilicon/hi6220_ion.c 
b/drivers/staging/android/ion/hisilicon/hi6220_ion.c
index f392db2..659aa71 100644
--- a/drivers/staging/android/ion/hisilicon/hi6220_ion.c
+++ b/drivers/staging/android/ion/hisilicon/hi6220_ion.c
@@ -49,8 +49,8 @@ static int hi6220_ion_probe(struct platform_device *pdev)
platform_set_drvdata(pdev, ipdev);
 
ipdev->idev = ion_device_create(NULL);
-   if (!ipdev->idev)
-   return -ENOMEM;
+   if (IS_ERR(ipdev->idev))
+   return PTR_ERR(ipdev->idev);
 
ipdev->data = ion_parse_dt(pdev, hisi_heaps);
if (IS_ERR(ipdev->data))





Re: [PATCH 3/4] autofs - make mountpoint checks namespace aware

2016-09-14 Thread Eric W. Biederman
Ian Kent  writes:

> On Wed, 2016-09-14 at 12:28 -0500, Eric W. Biederman wrote:
>> Ian Kent  writes:
>> 
>> > If an automount mount is clone(2)ed into a file system that is
>> > propagation private, when it later expires in the originating
>> > namespace subsequent calls to autofs ->d_automount() for that
>> > dentry in the original namespace will return ELOOP until the
>> > mount is manually umounted in the cloned namespace.
>> > 
>> > In the same way, if an autofs mount is triggered by automount(8)
>> > running within a container the dentry will be seen as mounted in
>> > the root init namespace and calls to ->d_automount() in that namespace
>> > will return ELOOP until the mount is umounted within the container.
>> > 
>> > Also, have_submounts() can return an incorect result when a mount
>> > exists in a namespace other than the one being checked.
>> 
>> Overall this appears to be a fairly reasonable set of changes.  It does
>> increase the expense when an actual mount point is encountered, but if
>> these are the desired some increase in cost when a dentry is a
>> mountpoint is unavoidable.
>> 
>> May I ask the motiviation for this set of changes?  Reading through the
>> changes I don't grasp why we want to change the behavior of autofs.
>> What problem is being solved?  What are the benefits?
>
> LOL, it's all too easy for me to give a patch description that I think 
> explains
> a problem I need to solve without realizing it isn't clear to others what the
> problem is, sorry about that.
>
> For quite a while now, and not that frequently but consistently, I've been
> getting reports of people using autofs getting ELOOP errors and not being able
> to mount automounts.
>
> This has been due to the cloning of autofs file systems (that have active
> automounts at the time of the clone) by other systems.
>
> An unshare, as one example, can easily result in the cloning of an autofs file
> system that has active mounts which shows this problem.
>
> Once an active mount that has been cloned is expired in the namespace that
> performed the unshare it can't be (auto)mounted again in the the originating
> namespace because the mounted check in the autofs module will think it is
> already mounted.
>
> I'm not sure this is a clear description either, hopefully it is enough to
> demonstrate the type of problem I'm typing to solve.

So to rephrase the problem is that an autofs instance can stop working
properly from the perspective of the mount namespace it is mounted in
if the autofs instance is shared between multiple mount namespaces.  The
problem is that mounts and unmounts do not always propogate between
mount namespaces.  This lack of symmetric mount/unmount behavior
leads to mountpoints that become unusable.

Which leads to the question what is the expected new behavior with your
patchset applied.  New mounts can be added in the parent mount namespace
(because the test is local).  Does your change also allow the
autofs mountpoints to be used in the other mount namespaces that share
the autofs instance if everything becomes unmounted?

Or is it expected that other mount namespaces that share an autofs
instance will get changes in their mounts via mount propagation and if
mount propagation is insufficient they are on their own.

I believe this is a question of how do notifications of the desire for
an automount work after your change, and are those notifications
consistent with your desired and/or expected behavior.

Eric







Re: [PATCH v2 2/3] x86 Test and expose CPUID faulting capabilities in /proc/cpuinfo

2016-09-14 Thread Kyle Huey
On Wed, Sep 14, 2016 at 6:17 PM, Andy Lutomirski  wrote:
> On Wed, Sep 14, 2016 at 3:03 PM, Kyle Huey  wrote:
>> On Wed, Sep 14, 2016 at 2:35 PM, Dave Hansen
>>  wrote:
>>> On 09/14/2016 02:01 PM, Kyle Huey wrote:
>
>>> Is any of this useful to optimize away at compile-time?  We have config
>>> options for when we're running as a guest, and this seems like a feature
>>> that isn't available when running on bare metal.
>>
>> On the contrary, this is only available when we're on bare metal.
>> Neither Xen nor KVM virtualize CPUID faulting (although KVM correctly
>> suppresses MSR_PLATFORM_INFO's report of support for it).
>
> KVM could easily support this.  If rr starts using it, I think KVM
> *should* add support, possibly even for older CPUs that don't support
> the feature in hardware.
>
> It's too bad that x86 doesn't give us the instruction bytes on a
> fault.  Otherwise we could lazily switch this feature.

We are *very* interested in having KVM and Xen support virtualization
of this feature.  I am planning to work on KVM after I get this series
of patches in :)

- Kyle


[PATCH -next] staging: android: ion: Fix return value check in hi6220_ion_probe()

2016-09-14 Thread Wei Yongjun
From: Wei Yongjun 

In case of error, the function ion_device_create() returns ERR_PTR() and
never returns NULL. The NULL test in the return value check should be
replaced with IS_ERR().

Signed-off-by: Wei Yongjun 
---
 drivers/staging/android/ion/hisilicon/hi6220_ion.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/android/ion/hisilicon/hi6220_ion.c 
b/drivers/staging/android/ion/hisilicon/hi6220_ion.c
index f392db2..659aa71 100644
--- a/drivers/staging/android/ion/hisilicon/hi6220_ion.c
+++ b/drivers/staging/android/ion/hisilicon/hi6220_ion.c
@@ -49,8 +49,8 @@ static int hi6220_ion_probe(struct platform_device *pdev)
platform_set_drvdata(pdev, ipdev);
 
ipdev->idev = ion_device_create(NULL);
-   if (!ipdev->idev)
-   return -ENOMEM;
+   if (IS_ERR(ipdev->idev))
+   return PTR_ERR(ipdev->idev);
 
ipdev->data = ion_parse_dt(pdev, hisi_heaps);
if (IS_ERR(ipdev->data))





Re: [PATCH v2 3/3] x86,arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction

2016-09-14 Thread Kyle Huey
On Wed, Sep 14, 2016 at 6:54 PM, Andy Lutomirski  wrote:
> On Wed, Sep 14, 2016 at 6:47 PM, Kyle Huey  wrote:
>> On Wed, Sep 14, 2016 at 6:29 PM, Andy Lutomirski  wrote:
>>> On Wed, Sep 14, 2016 at 2:01 PM, Kyle Huey  wrote:
>
 +
 +int set_cpuid_mode(struct task_struct *task, unsigned long val)
 +{
 +   /* Only disable/enable_cpuid() if it is supported on this 
 hardware. */
 +   bool cpuid_fault_supported = 
 static_cpu_has(X86_FEATURE_CPUID_FAULT);
 +
 +   if (val == ARCH_CPUID_ENABLE && cpuid_fault_supported) {
 +   if (task_no_new_privs(task) && 
 test_thread_flag(TIF_NOCPUID))
 +   return -EACCES;
>>>
>>> This check seems confused.  If this flag were preserved on execve,
>>> it's the SIGSEGV mode that would need the check.
>>
>> Not sure I follow this one.  no_new_privs should block transitions
>> from SIGSEGV to ENABLE, right?  That's what this check does.
>
> It's the other way around entirely: if you make a change to your
> process context such that a subseqently execve()'d setuid program
> might malfunction, you've just done something dangerous.  This is only
> okay, at least in newly-supported instances, if you are either
> privileged or if you have no_new_privs set.  Having privilege makes it
> okay: unprivileged programs can't use it to subvert setuid programs.
> no_new_privs makes it safe as well: if no_new_privs is set, you can't
> gain privilege via execve(), so there's no attack surface.  So, if you
> have execve() keep ARCH_CPUID_SIGSEGV set, then setting it that way in
> the first place should require privilege or no_new_privs.
>
> I personally favor resetting to ARCH_CPUID_ENABLE on execve() and not
> worrying about no_new_privs.
>
> Does that make sense?

Yes, ok.  Robert and I agree that resetting does make the most sense.
Using this usefully requires a ptrace supervisor (to catch the traps),
which can easily inject a call to arch_prctl to reenable
ARCH_CPUID_SIGSEGV when desired.

- Kyle


Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 06:25:07PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaün  wrote:
> >
> > On 14/09/2016 20:27, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
> >>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
> >>> set for all cgroup except the root. The flag is clear when a new process
> >>> without the no_new_privs flags is attached to the cgroup.
> >>>
> >>> If a cgroup is landlocked, then any new attempt, from an unprivileged
> >>> process, to attach a process without no_new_privs to this cgroup will
> >>> be denied.
> >>
> >> Until and unless everyone can agree on a way to properly namespace,
> >> delegate, etc cgroups, I think that trying to add unprivileged
> >> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
> >> no-internal-tasks, etc, I just don't see how this approach can be
> >> viable.
> >
> > As far as I can tell, the no_new_privs flag of at task is not related to
> > namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
> > the no_new_privs property of *tasks* in a cgroup. The semantic is unchanged.
> >
> > Using cgroup is optional, any task could use the seccomp-based
> > landlocking instead. However, for those that want/need to manage a
> > security policy in a more dynamic way, using cgroups may make sense.
> >
> > I though cgroup delegation was OK in the v2, isn't it the case? Do you
> > have some links?
> >
> >>
> >> Can we try to make landlock work completely independently of cgroups
> >> so that it doesn't get stuck and so that programs can use it without
> >> worrying about cgroup v1 vs v2, interactions with cgroup managers,
> >> cgroup managers that (supposedly?) will start migrating processes
> >> around piecemeal and almost certainly blowing up landlock in the
> >> process, etc?
> >
> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> > don't see why building on top of cgroup v2 is a problem. Is there
> > security issues with delegation?
> 
> What I mean is: cgroup v2 delegation has a functionality problem.
> Tejun says [1]:
> 
> We haven't had to face this decision because cgroup has never properly
> supported delegating to applications and the in-use setups where this
> happens are custom configurations where there is no boundary between
> system and applications and adhoc trial-and-error is good enough a way
> to find a working solution.  That wiggle room goes away once we
> officially open this up to individual applications.
> 
> Unless and until that changes, I think that landlock should stay away
> from cgroups.  Others could reasonably disagree with me.

Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
and not for sandboxing. So the above doesn't matter in such contexts.
lsm hooks + cgroups provide convenient scope and existing entry points.
Please see checmate examples how it's used.



Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks

2016-09-14 Thread Alexei Starovoitov
On Wed, Sep 14, 2016 at 06:25:07PM -0700, Andy Lutomirski wrote:
> On Wed, Sep 14, 2016 at 3:11 PM, Mickaël Salaün  wrote:
> >
> > On 14/09/2016 20:27, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün  wrote:
> >>> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
> >>> set for all cgroup except the root. The flag is clear when a new process
> >>> without the no_new_privs flags is attached to the cgroup.
> >>>
> >>> If a cgroup is landlocked, then any new attempt, from an unprivileged
> >>> process, to attach a process without no_new_privs to this cgroup will
> >>> be denied.
> >>
> >> Until and unless everyone can agree on a way to properly namespace,
> >> delegate, etc cgroups, I think that trying to add unprivileged
> >> semantics to cgroups is nuts.  Given the big thread about cgroup v2,
> >> no-internal-tasks, etc, I just don't see how this approach can be
> >> viable.
> >
> > As far as I can tell, the no_new_privs flag of at task is not related to
> > namespaces. The CGRP_NO_NEW_PRIVS flag is only a cache to quickly access
> > the no_new_privs property of *tasks* in a cgroup. The semantic is unchanged.
> >
> > Using cgroup is optional, any task could use the seccomp-based
> > landlocking instead. However, for those that want/need to manage a
> > security policy in a more dynamic way, using cgroups may make sense.
> >
> > I though cgroup delegation was OK in the v2, isn't it the case? Do you
> > have some links?
> >
> >>
> >> Can we try to make landlock work completely independently of cgroups
> >> so that it doesn't get stuck and so that programs can use it without
> >> worrying about cgroup v1 vs v2, interactions with cgroup managers,
> >> cgroup managers that (supposedly?) will start migrating processes
> >> around piecemeal and almost certainly blowing up landlock in the
> >> process, etc?
> >
> > This RFC handle both cgroup and seccomp approaches in a similar way. I
> > don't see why building on top of cgroup v2 is a problem. Is there
> > security issues with delegation?
> 
> What I mean is: cgroup v2 delegation has a functionality problem.
> Tejun says [1]:
> 
> We haven't had to face this decision because cgroup has never properly
> supported delegating to applications and the in-use setups where this
> happens are custom configurations where there is no boundary between
> system and applications and adhoc trial-and-error is good enough a way
> to find a working solution.  That wiggle room goes away once we
> officially open this up to individual applications.
> 
> Unless and until that changes, I think that landlock should stay away
> from cgroups.  Others could reasonably disagree with me.

Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
and not for sandboxing. So the above doesn't matter in such contexts.
lsm hooks + cgroups provide convenient scope and existing entry points.
Please see checmate examples how it's used.



Re: [PATCH v2 3/3] x86,arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction

2016-09-14 Thread Kyle Huey
On Wed, Sep 14, 2016 at 6:54 PM, Andy Lutomirski  wrote:
> On Wed, Sep 14, 2016 at 6:47 PM, Kyle Huey  wrote:
>> On Wed, Sep 14, 2016 at 6:29 PM, Andy Lutomirski  wrote:
>>> On Wed, Sep 14, 2016 at 2:01 PM, Kyle Huey  wrote:
>
 +
 +int set_cpuid_mode(struct task_struct *task, unsigned long val)
 +{
 +   /* Only disable/enable_cpuid() if it is supported on this 
 hardware. */
 +   bool cpuid_fault_supported = 
 static_cpu_has(X86_FEATURE_CPUID_FAULT);
 +
 +   if (val == ARCH_CPUID_ENABLE && cpuid_fault_supported) {
 +   if (task_no_new_privs(task) && 
 test_thread_flag(TIF_NOCPUID))
 +   return -EACCES;
>>>
>>> This check seems confused.  If this flag were preserved on execve,
>>> it's the SIGSEGV mode that would need the check.
>>
>> Not sure I follow this one.  no_new_privs should block transitions
>> from SIGSEGV to ENABLE, right?  That's what this check does.
>
> It's the other way around entirely: if you make a change to your
> process context such that a subseqently execve()'d setuid program
> might malfunction, you've just done something dangerous.  This is only
> okay, at least in newly-supported instances, if you are either
> privileged or if you have no_new_privs set.  Having privilege makes it
> okay: unprivileged programs can't use it to subvert setuid programs.
> no_new_privs makes it safe as well: if no_new_privs is set, you can't
> gain privilege via execve(), so there's no attack surface.  So, if you
> have execve() keep ARCH_CPUID_SIGSEGV set, then setting it that way in
> the first place should require privilege or no_new_privs.
>
> I personally favor resetting to ARCH_CPUID_ENABLE on execve() and not
> worrying about no_new_privs.
>
> Does that make sense?

Yes, ok.  Robert and I agree that resetting does make the most sense.
Using this usefully requires a ptrace supervisor (to catch the traps),
which can easily inject a call to arch_prctl to reenable
ARCH_CPUID_SIGSEGV when desired.

- Kyle


[PATCH 1/4] x86/dumpstack: simplify in_exception_stack()

2016-09-14 Thread Josh Poimboeuf
in_exception_stack() does some bad, bad things just so the unwinder can
print different values for different areas of the debug exception stack.

There's no need to clarify where exactly on the stack it is.  Just print
"#DB" and be done with it.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/kernel/dumpstack_64.c | 89 --
 1 file changed, 26 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 07373be..904fb46 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -16,83 +16,46 @@
 
 #include 
 
+static char *exception_stack_names[N_EXCEPTION_STACKS] = {
+   [ DOUBLEFAULT_STACK-1   ]   = "#DF",
+   [ NMI_STACK-1   ]   = "NMI",
+   [ DEBUG_STACK-1 ]   = "#DB",
+   [ MCE_STACK-1   ]   = "#MC",
+};
 
-#define N_EXCEPTION_STACKS_END \
-   (N_EXCEPTION_STACKS + DEBUG_STKSZ/EXCEPTION_STKSZ - 2)
-
-static char x86_stack_ids[][8] = {
-   [ DEBUG_STACK-1 ]   = "#DB",
-   [ NMI_STACK-1   ]   = "NMI",
-   [ DOUBLEFAULT_STACK-1   ]   = "#DF",
-   [ MCE_STACK-1   ]   = "#MC",
-#if DEBUG_STKSZ > EXCEPTION_STKSZ
-   [ N_EXCEPTION_STACKS ...
- N_EXCEPTION_STACKS_END]   = "#DB[?]"
-#endif
+static unsigned long exception_stack_sizes[N_EXCEPTION_STACKS] = {
+   [0 ... N_EXCEPTION_STACKS - 1]  = EXCEPTION_STKSZ,
+   [DEBUG_STACK - 1]   = DEBUG_STKSZ
 };
 
 static unsigned long *in_exception_stack(unsigned long stack, unsigned *usedp,
 char **idp)
 {
+   unsigned long begin, end;
unsigned k;
 
-   /*
-* Iterate over all exception stacks, and figure out whether
-* 'stack' is in one of them:
-*/
+   BUILD_BUG_ON(N_EXCEPTION_STACKS != 4);
+
for (k = 0; k < N_EXCEPTION_STACKS; k++) {
-   unsigned long end = raw_cpu_ptr(_ist)->ist[k];
-   /*
-* Is 'stack' above this exception frame's end?
-* If yes then skip to the next frame.
-*/
-   if (stack >= end)
+   end   = raw_cpu_ptr(_ist)->ist[k];
+   begin = end - exception_stack_sizes[k];
+
+   if (stack < begin || stack >= end)
continue;
+
/*
-* Is 'stack' above this exception frame's start address?
-* If yes then we found the right frame.
-*/
-   if (stack >= end - EXCEPTION_STKSZ) {
-   /*
-* Make sure we only iterate through an exception
-* stack once. If it comes up for the second time
-* then there's something wrong going on - just
-* break out and return NULL:
-*/
-   if (*usedp & (1U << k))
-   break;
-   *usedp |= 1U << k;
-   *idp = x86_stack_ids[k];
-   return (unsigned long *)end;
-   }
-   /*
-* If this is a debug stack, and if it has a larger size than
-* the usual exception stacks, then 'stack' might still
-* be within the lower portion of the debug stack:
+* Make sure we only iterate through an exception stack once.
+* If it comes up for the second time then there's something
+* wrong going on - just break and return NULL:
 */
-#if DEBUG_STKSZ > EXCEPTION_STKSZ
-   if (k == DEBUG_STACK - 1 && stack >= end - DEBUG_STKSZ) {
-   unsigned j = N_EXCEPTION_STACKS - 1;
+   if (*usedp & (1U << k))
+   break;
+   *usedp |= 1U << k;
 
-   /*
-* Black magic. A large debug stack is composed of
-* multiple exception stack entries, which we
-* iterate through now. Dont look:
-*/
-   do {
-   ++j;
-   end -= EXCEPTION_STKSZ;
-   x86_stack_ids[j][4] = '1' +
-   (j - N_EXCEPTION_STACKS);
-   } while (stack < end - EXCEPTION_STKSZ);
-   if (*usedp & (1U << j))
-   break;
-   *usedp |= 1U << j;
-   *idp = x86_stack_ids[j];
-   return (unsigned long *)end;
-   }
-#endif
+   *idp = 

[PATCH 3/4] x86/dumpstack: support for unwinding empty irq stacks

2016-09-14 Thread Josh Poimboeuf
When an interrupt happens in entry code while running on a software irq
stack, and the irq stack was empty, regs->sp will contain the stack end
address (e.g., irq_stack_ptr).  If the regs are passed to dump_trace(),
get_stack_info() will report STACK_TYPE_UNKNOWN, causing dump_trace() to
return prematurely without trying to go to the next stack.

Update the bounds checking for software interrupt stacks so that the
ending address is now considered part of the stack.

This means that it's now possible for the 'walk_stack' callbacks --
print_context_stack() and print_context_stack_bp() -- to be called with
an empty stack.  But that's fine; they're already prepared to deal with
that due to their on_stack() checks.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/kernel/dumpstack_32.c | 12 ++--
 arch/x86/kernel/dumpstack_64.c |  6 +-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index c92da5a..50076d4 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -35,7 +35,11 @@ static bool in_hardirq_stack(unsigned long *stack, struct 
stack_info *info)
unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
 
-   if (stack < begin || stack >= end)
+   /*
+* This is a software stack, so 'end' can be a valid stack pointer.
+* It just means the stack is empty.
+*/
+   if (stack < begin || stack > end)
return false;
 
info->type  = STACK_TYPE_IRQ;
@@ -56,7 +60,11 @@ static bool in_softirq_stack(unsigned long *stack, struct 
stack_info *info)
unsigned long *begin = (unsigned long *)this_cpu_read(softirq_stack);
unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
 
-   if (stack < begin || stack >= end)
+   /*
+* This is a software stack, so 'end' can be a valid stack pointer.
+* It just means the stack is empty.
+*/
+   if (stack < begin || stack > end)
return false;
 
info->type  = STACK_TYPE_SOFTIRQ;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 41813ab..2e708af 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -90,7 +90,11 @@ static bool in_irq_stack(unsigned long *stack, struct 
stack_info *info)
unsigned long *end   = (unsigned long *)this_cpu_read(irq_stack_ptr);
unsigned long *begin = end - (IRQ_STACK_SIZE / sizeof(long));
 
-   if (stack < begin || stack >= end)
+   /*
+* This is a software stack, so 'end' can be a valid stack pointer.
+* It just means the stack is empty.
+*/
+   if (stack < begin || stack > end)
return false;
 
info->type  = STACK_TYPE_IRQ;
-- 
2.7.4



[PATCH 2/4] x86/dumpstack: add get_stack_info() interface

2016-09-14 Thread Josh Poimboeuf
valid_stack_ptr() is buggy: it assumes that all stacks are of size
THREAD_SIZE, which is not true for exception stacks.  So the
walk_stack() callbacks will need to know the location of the beginning
of the stack as well as the end.

Another issue is that in general the various features of a stack (type,
size, next stack pointer, description string) are scattered around in
various places throughout the stack dump code.

Encapsulate all that information in a single place with a new stack_info
struct and a get_stack_info() interface.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/events/core.c|   2 +-
 arch/x86/include/asm/stacktrace.h |  41 -
 arch/x86/kernel/dumpstack.c   |  40 +
 arch/x86/kernel/dumpstack_32.c| 106 ++--
 arch/x86/kernel/dumpstack_64.c| 169 --
 arch/x86/kernel/stacktrace.c  |   2 +-
 arch/x86/oprofile/backtrace.c |   2 +-
 7 files changed, 234 insertions(+), 128 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dcaa887..dd3a1dc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2271,7 +2271,7 @@ void arch_perf_update_userpage(struct perf_event *event,
  * callchain support
  */
 
-static int backtrace_stack(void *data, char *name)
+static int backtrace_stack(void *data, const char *name)
 {
return 0;
 }
diff --git a/arch/x86/include/asm/stacktrace.h 
b/arch/x86/include/asm/stacktrace.h
index 3552f5e..780a83e 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -10,6 +10,39 @@
 #include 
 #include 
 
+enum stack_type {
+   STACK_TYPE_UNKNOWN,
+   STACK_TYPE_TASK,
+   STACK_TYPE_IRQ,
+   STACK_TYPE_SOFTIRQ,
+   STACK_TYPE_EXCEPTION,
+   STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
+};
+
+struct stack_info {
+   enum stack_type type;
+   unsigned long *begin, *end, *next_sp;
+};
+
+bool in_task_stack(unsigned long *stack, struct task_struct *task,
+  struct stack_info *info);
+
+int get_stack_info(unsigned long *stack, struct task_struct *task,
+  struct stack_info *info, unsigned long *visit_mask);
+
+void stack_type_str(enum stack_type type, const char **begin,
+   const char **end);
+
+static inline bool on_stack(struct stack_info *info, void *addr, size_t len)
+{
+   void *begin = info->begin;
+   void *end   = info->end;
+
+   return (info->type != STACK_TYPE_UNKNOWN &&
+   addr >= begin && addr < end &&
+   addr + len > begin && addr + len <= end);
+}
+
 extern int kstack_depth_to_print;
 
 struct thread_info;
@@ -20,27 +53,27 @@ typedef unsigned long (*walk_stack_t)(struct task_struct 
*task,
  unsigned long bp,
  const struct stacktrace_ops *ops,
  void *data,
- unsigned long *end,
+ struct stack_info *info,
  int *graph);
 
 extern unsigned long
 print_context_stack(struct task_struct *task,
unsigned long *stack, unsigned long bp,
const struct stacktrace_ops *ops, void *data,
-   unsigned long *end, int *graph);
+   struct stack_info *info, int *graph);
 
 extern unsigned long
 print_context_stack_bp(struct task_struct *task,
   unsigned long *stack, unsigned long bp,
   const struct stacktrace_ops *ops, void *data,
-  unsigned long *end, int *graph);
+  struct stack_info *info, int *graph);
 
 /* Generic stack tracer with callbacks */
 
 struct stacktrace_ops {
int (*address)(void *data, unsigned long address, int reliable);
/* On negative return stop dumping */
-   int (*stack)(void *data, char *name);
+   int (*stack)(void *data, const char *name);
walk_stack_twalk_stack;
 };
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index c6c6c39..aa208e5 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -25,6 +25,23 @@ unsigned int code_bytes = 64;
 int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
 static int die_counter;
 
+bool in_task_stack(unsigned long *stack, struct task_struct *task,
+  struct stack_info *info)
+{
+   unsigned long *begin = task_stack_page(task);
+   unsigned long *end   = task_stack_page(task) + THREAD_SIZE;
+
+   if (stack < begin || stack >= end)
+   return false;
+
+   info->type  = STACK_TYPE_TASK;
+   info->begin = begin;
+   info->end   = end;
+   info->next_sp   = NULL;
+
+   return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
  

[PATCH 1/4] x86/dumpstack: simplify in_exception_stack()

2016-09-14 Thread Josh Poimboeuf
in_exception_stack() does some bad, bad things just so the unwinder can
print different values for different areas of the debug exception stack.

There's no need to clarify where exactly on the stack it is.  Just print
"#DB" and be done with it.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/kernel/dumpstack_64.c | 89 --
 1 file changed, 26 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 07373be..904fb46 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -16,83 +16,46 @@
 
 #include 
 
+static char *exception_stack_names[N_EXCEPTION_STACKS] = {
+   [ DOUBLEFAULT_STACK-1   ]   = "#DF",
+   [ NMI_STACK-1   ]   = "NMI",
+   [ DEBUG_STACK-1 ]   = "#DB",
+   [ MCE_STACK-1   ]   = "#MC",
+};
 
-#define N_EXCEPTION_STACKS_END \
-   (N_EXCEPTION_STACKS + DEBUG_STKSZ/EXCEPTION_STKSZ - 2)
-
-static char x86_stack_ids[][8] = {
-   [ DEBUG_STACK-1 ]   = "#DB",
-   [ NMI_STACK-1   ]   = "NMI",
-   [ DOUBLEFAULT_STACK-1   ]   = "#DF",
-   [ MCE_STACK-1   ]   = "#MC",
-#if DEBUG_STKSZ > EXCEPTION_STKSZ
-   [ N_EXCEPTION_STACKS ...
- N_EXCEPTION_STACKS_END]   = "#DB[?]"
-#endif
+static unsigned long exception_stack_sizes[N_EXCEPTION_STACKS] = {
+   [0 ... N_EXCEPTION_STACKS - 1]  = EXCEPTION_STKSZ,
+   [DEBUG_STACK - 1]   = DEBUG_STKSZ
 };
 
 static unsigned long *in_exception_stack(unsigned long stack, unsigned *usedp,
 char **idp)
 {
+   unsigned long begin, end;
unsigned k;
 
-   /*
-* Iterate over all exception stacks, and figure out whether
-* 'stack' is in one of them:
-*/
+   BUILD_BUG_ON(N_EXCEPTION_STACKS != 4);
+
for (k = 0; k < N_EXCEPTION_STACKS; k++) {
-   unsigned long end = raw_cpu_ptr(_ist)->ist[k];
-   /*
-* Is 'stack' above this exception frame's end?
-* If yes then skip to the next frame.
-*/
-   if (stack >= end)
+   end   = raw_cpu_ptr(_ist)->ist[k];
+   begin = end - exception_stack_sizes[k];
+
+   if (stack < begin || stack >= end)
continue;
+
/*
-* Is 'stack' above this exception frame's start address?
-* If yes then we found the right frame.
-*/
-   if (stack >= end - EXCEPTION_STKSZ) {
-   /*
-* Make sure we only iterate through an exception
-* stack once. If it comes up for the second time
-* then there's something wrong going on - just
-* break out and return NULL:
-*/
-   if (*usedp & (1U << k))
-   break;
-   *usedp |= 1U << k;
-   *idp = x86_stack_ids[k];
-   return (unsigned long *)end;
-   }
-   /*
-* If this is a debug stack, and if it has a larger size than
-* the usual exception stacks, then 'stack' might still
-* be within the lower portion of the debug stack:
+* Make sure we only iterate through an exception stack once.
+* If it comes up for the second time then there's something
+* wrong going on - just break and return NULL:
 */
-#if DEBUG_STKSZ > EXCEPTION_STKSZ
-   if (k == DEBUG_STACK - 1 && stack >= end - DEBUG_STKSZ) {
-   unsigned j = N_EXCEPTION_STACKS - 1;
+   if (*usedp & (1U << k))
+   break;
+   *usedp |= 1U << k;
 
-   /*
-* Black magic. A large debug stack is composed of
-* multiple exception stack entries, which we
-* iterate through now. Dont look:
-*/
-   do {
-   ++j;
-   end -= EXCEPTION_STKSZ;
-   x86_stack_ids[j][4] = '1' +
-   (j - N_EXCEPTION_STACKS);
-   } while (stack < end - EXCEPTION_STKSZ);
-   if (*usedp & (1U << j))
-   break;
-   *usedp |= 1U << j;
-   *idp = x86_stack_ids[j];
-   return (unsigned long *)end;
-   }
-#endif
+   *idp = exception_stack_names[k];
+ 

[PATCH 3/4] x86/dumpstack: support for unwinding empty irq stacks

2016-09-14 Thread Josh Poimboeuf
When an interrupt happens in entry code while running on a software irq
stack, and the irq stack was empty, regs->sp will contain the stack end
address (e.g., irq_stack_ptr).  If the regs are passed to dump_trace(),
get_stack_info() will report STACK_TYPE_UNKNOWN, causing dump_trace() to
return prematurely without trying to go to the next stack.

Update the bounds checking for software interrupt stacks so that the
ending address is now considered part of the stack.

This means that it's now possible for the 'walk_stack' callbacks --
print_context_stack() and print_context_stack_bp() -- to be called with
an empty stack.  But that's fine; they're already prepared to deal with
that due to their on_stack() checks.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/kernel/dumpstack_32.c | 12 ++--
 arch/x86/kernel/dumpstack_64.c |  6 +-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index c92da5a..50076d4 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -35,7 +35,11 @@ static bool in_hardirq_stack(unsigned long *stack, struct 
stack_info *info)
unsigned long *begin = (unsigned long *)this_cpu_read(hardirq_stack);
unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
 
-   if (stack < begin || stack >= end)
+   /*
+* This is a software stack, so 'end' can be a valid stack pointer.
+* It just means the stack is empty.
+*/
+   if (stack < begin || stack > end)
return false;
 
info->type  = STACK_TYPE_IRQ;
@@ -56,7 +60,11 @@ static bool in_softirq_stack(unsigned long *stack, struct 
stack_info *info)
unsigned long *begin = (unsigned long *)this_cpu_read(softirq_stack);
unsigned long *end   = begin + (THREAD_SIZE / sizeof(long));
 
-   if (stack < begin || stack >= end)
+   /*
+* This is a software stack, so 'end' can be a valid stack pointer.
+* It just means the stack is empty.
+*/
+   if (stack < begin || stack > end)
return false;
 
info->type  = STACK_TYPE_SOFTIRQ;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 41813ab..2e708af 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -90,7 +90,11 @@ static bool in_irq_stack(unsigned long *stack, struct 
stack_info *info)
unsigned long *end   = (unsigned long *)this_cpu_read(irq_stack_ptr);
unsigned long *begin = end - (IRQ_STACK_SIZE / sizeof(long));
 
-   if (stack < begin || stack >= end)
+   /*
+* This is a software stack, so 'end' can be a valid stack pointer.
+* It just means the stack is empty.
+*/
+   if (stack < begin || stack > end)
return false;
 
info->type  = STACK_TYPE_IRQ;
-- 
2.7.4



[PATCH 2/4] x86/dumpstack: add get_stack_info() interface

2016-09-14 Thread Josh Poimboeuf
valid_stack_ptr() is buggy: it assumes that all stacks are of size
THREAD_SIZE, which is not true for exception stacks.  So the
walk_stack() callbacks will need to know the location of the beginning
of the stack as well as the end.

Another issue is that in general the various features of a stack (type,
size, next stack pointer, description string) are scattered around in
various places throughout the stack dump code.

Encapsulate all that information in a single place with a new stack_info
struct and a get_stack_info() interface.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/events/core.c|   2 +-
 arch/x86/include/asm/stacktrace.h |  41 -
 arch/x86/kernel/dumpstack.c   |  40 +
 arch/x86/kernel/dumpstack_32.c| 106 ++--
 arch/x86/kernel/dumpstack_64.c| 169 --
 arch/x86/kernel/stacktrace.c  |   2 +-
 arch/x86/oprofile/backtrace.c |   2 +-
 7 files changed, 234 insertions(+), 128 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dcaa887..dd3a1dc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2271,7 +2271,7 @@ void arch_perf_update_userpage(struct perf_event *event,
  * callchain support
  */
 
-static int backtrace_stack(void *data, char *name)
+static int backtrace_stack(void *data, const char *name)
 {
return 0;
 }
diff --git a/arch/x86/include/asm/stacktrace.h 
b/arch/x86/include/asm/stacktrace.h
index 3552f5e..780a83e 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -10,6 +10,39 @@
 #include 
 #include 
 
+enum stack_type {
+   STACK_TYPE_UNKNOWN,
+   STACK_TYPE_TASK,
+   STACK_TYPE_IRQ,
+   STACK_TYPE_SOFTIRQ,
+   STACK_TYPE_EXCEPTION,
+   STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
+};
+
+struct stack_info {
+   enum stack_type type;
+   unsigned long *begin, *end, *next_sp;
+};
+
+bool in_task_stack(unsigned long *stack, struct task_struct *task,
+  struct stack_info *info);
+
+int get_stack_info(unsigned long *stack, struct task_struct *task,
+  struct stack_info *info, unsigned long *visit_mask);
+
+void stack_type_str(enum stack_type type, const char **begin,
+   const char **end);
+
+static inline bool on_stack(struct stack_info *info, void *addr, size_t len)
+{
+   void *begin = info->begin;
+   void *end   = info->end;
+
+   return (info->type != STACK_TYPE_UNKNOWN &&
+   addr >= begin && addr < end &&
+   addr + len > begin && addr + len <= end);
+}
+
 extern int kstack_depth_to_print;
 
 struct thread_info;
@@ -20,27 +53,27 @@ typedef unsigned long (*walk_stack_t)(struct task_struct 
*task,
  unsigned long bp,
  const struct stacktrace_ops *ops,
  void *data,
- unsigned long *end,
+ struct stack_info *info,
  int *graph);
 
 extern unsigned long
 print_context_stack(struct task_struct *task,
unsigned long *stack, unsigned long bp,
const struct stacktrace_ops *ops, void *data,
-   unsigned long *end, int *graph);
+   struct stack_info *info, int *graph);
 
 extern unsigned long
 print_context_stack_bp(struct task_struct *task,
   unsigned long *stack, unsigned long bp,
   const struct stacktrace_ops *ops, void *data,
-  unsigned long *end, int *graph);
+  struct stack_info *info, int *graph);
 
 /* Generic stack tracer with callbacks */
 
 struct stacktrace_ops {
int (*address)(void *data, unsigned long address, int reliable);
/* On negative return stop dumping */
-   int (*stack)(void *data, char *name);
+   int (*stack)(void *data, const char *name);
walk_stack_twalk_stack;
 };
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index c6c6c39..aa208e5 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -25,6 +25,23 @@ unsigned int code_bytes = 64;
 int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
 static int die_counter;
 
+bool in_task_stack(unsigned long *stack, struct task_struct *task,
+  struct stack_info *info)
+{
+   unsigned long *begin = task_stack_page(task);
+   unsigned long *end   = task_stack_page(task) + THREAD_SIZE;
+
+   if (stack < begin || stack >= end)
+   return false;
+
+   info->type  = STACK_TYPE_TASK;
+   info->begin = begin;
+   info->end   = end;
+   info->next_sp   = NULL;
+
+   return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 char 

[PATCH 0/4] x86/dumpstack: yet more stack dump improvements

2016-09-14 Thread Josh Poimboeuf
This is the last batch before the new unwinder.

Josh Poimboeuf (4):
  x86/dumpstack: simplify in_exception_stack()
  x86/dumpstack: add get_stack_info() interface
  x86/dumpstack: support for unwinding empty irq stacks
  x86/dumpstack: add recursion checking for all stacks

 arch/x86/events/core.c|   2 +-
 arch/x86/include/asm/stacktrace.h |  41 ++-
 arch/x86/kernel/dumpstack.c   |  40 +++---
 arch/x86/kernel/dumpstack_32.c| 130 
 arch/x86/kernel/dumpstack_64.c| 251 ++
 arch/x86/kernel/stacktrace.c  |   2 +-
 arch/x86/oprofile/backtrace.c |   2 +-
 7 files changed, 284 insertions(+), 184 deletions(-)

-- 
2.7.4



[PATCH 4/4] x86/dumpstack: add recursion checking for all stacks

2016-09-14 Thread Josh Poimboeuf
in_exception_stack() has some recursion checking which makes sure the
stack trace code never traverses a given exception stack more than once.
This prevents an infinite loop if corruption somehow causes a stack's
"next stack" pointer to point to itself (directly or indirectly).

The recursion checking can be useful for other stacks in addition to the
exception stack, so extend it to work for all stacks.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/kernel/dumpstack_32.c | 22 +++---
 arch/x86/kernel/dumpstack_64.c | 35 +++
 2 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 50076d4..2d65cfa 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -89,16 +89,32 @@ int get_stack_info(unsigned long *stack, struct task_struct 
*task,
task = task ? : current;
 
if (in_task_stack(stack, task, info))
-   return 0;
+   goto recursion_check;
 
if (task != current)
goto unknown;
 
if (in_hardirq_stack(stack, info))
-   return 0;
+   goto recursion_check;
 
if (in_softirq_stack(stack, info))
-   return 0;
+   goto recursion_check;
+
+   goto unknown;
+
+recursion_check:
+   /*
+* Make sure we don't iterate through any given stack more than once.
+* If it comes up a second time then there's something wrong going on:
+* just break out and report an unknown stack type.
+*/
+   if (visit_mask) {
+   if (*visit_mask & (1UL << info->type))
+   goto unknown;
+   *visit_mask |= 1UL << info->type;
+   }
+
+   return 0;
 
 unknown:
info->type = STACK_TYPE_UNKNOWN;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 2e708af..8cb6004 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -47,8 +47,7 @@ void stack_type_str(enum stack_type type, const char **begin, 
const char **end)
}
 }
 
-static bool in_exception_stack(unsigned long *stack, struct stack_info *info,
-  unsigned long *visit_mask)
+static bool in_exception_stack(unsigned long *stack, struct stack_info *info)
 {
unsigned long *begin, *end;
struct pt_regs *regs;
@@ -64,16 +63,6 @@ static bool in_exception_stack(unsigned long *stack, struct 
stack_info *info,
if (stack < begin || stack >= end)
continue;
 
-   /*
-* Make sure we don't iterate through an exception stack more
-* than once.  If it comes up a second time then there's
-* something wrong going on - just break out and report an
-* unknown stack type.
-*/
-   if (*visit_mask & (1U << k))
-   break;
-   *visit_mask |= 1U << k;
-
info->type  = STACK_TYPE_EXCEPTION + k;
info->begin = begin;
info->end   = end;
@@ -119,16 +108,30 @@ int get_stack_info(unsigned long *stack, struct 
task_struct *task,
task = task ? : current;
 
if (in_task_stack(stack, task, info))
-   return 0;
+   goto recursion_check;
 
if (task != current)
goto unknown;
 
-   if (in_exception_stack(stack, info, visit_mask))
-   return 0;
+   if (in_exception_stack(stack, info))
+   goto recursion_check;
 
if (in_irq_stack(stack, info))
-   return 0;
+   goto recursion_check;
+
+   goto unknown;
+
+recursion_check:
+   /*
+* Make sure we don't iterate through any given stack more than once.
+* If it comes up a second time then there's something wrong going on:
+* just break out and report an unknown stack type.
+*/
+   if (visit_mask) {
+   if (*visit_mask & (1UL << info->type))
+   goto unknown;
+   *visit_mask |= 1UL << info->type;
+   }
 
return 0;
 
-- 
2.7.4



[PATCH 0/4] x86/dumpstack: yet more stack dump improvements

2016-09-14 Thread Josh Poimboeuf
This is the last batch before the new unwinder.

Josh Poimboeuf (4):
  x86/dumpstack: simplify in_exception_stack()
  x86/dumpstack: add get_stack_info() interface
  x86/dumpstack: support for unwinding empty irq stacks
  x86/dumpstack: add recursion checking for all stacks

 arch/x86/events/core.c|   2 +-
 arch/x86/include/asm/stacktrace.h |  41 ++-
 arch/x86/kernel/dumpstack.c   |  40 +++---
 arch/x86/kernel/dumpstack_32.c| 130 
 arch/x86/kernel/dumpstack_64.c| 251 ++
 arch/x86/kernel/stacktrace.c  |   2 +-
 arch/x86/oprofile/backtrace.c |   2 +-
 7 files changed, 284 insertions(+), 184 deletions(-)

-- 
2.7.4



[PATCH 4/4] x86/dumpstack: add recursion checking for all stacks

2016-09-14 Thread Josh Poimboeuf
in_exception_stack() has some recursion checking which makes sure the
stack trace code never traverses a given exception stack more than once.
This prevents an infinite loop if corruption somehow causes a stack's
"next stack" pointer to point to itself (directly or indirectly).

The recursion checking can be useful for other stacks in addition to the
exception stack, so extend it to work for all stacks.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/kernel/dumpstack_32.c | 22 +++---
 arch/x86/kernel/dumpstack_64.c | 35 +++
 2 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 50076d4..2d65cfa 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -89,16 +89,32 @@ int get_stack_info(unsigned long *stack, struct task_struct 
*task,
task = task ? : current;
 
if (in_task_stack(stack, task, info))
-   return 0;
+   goto recursion_check;
 
if (task != current)
goto unknown;
 
if (in_hardirq_stack(stack, info))
-   return 0;
+   goto recursion_check;
 
if (in_softirq_stack(stack, info))
-   return 0;
+   goto recursion_check;
+
+   goto unknown;
+
+recursion_check:
+   /*
+* Make sure we don't iterate through any given stack more than once.
+* If it comes up a second time then there's something wrong going on:
+* just break out and report an unknown stack type.
+*/
+   if (visit_mask) {
+   if (*visit_mask & (1UL << info->type))
+   goto unknown;
+   *visit_mask |= 1UL << info->type;
+   }
+
+   return 0;
 
 unknown:
info->type = STACK_TYPE_UNKNOWN;
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 2e708af..8cb6004 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -47,8 +47,7 @@ void stack_type_str(enum stack_type type, const char **begin, 
const char **end)
}
 }
 
-static bool in_exception_stack(unsigned long *stack, struct stack_info *info,
-  unsigned long *visit_mask)
+static bool in_exception_stack(unsigned long *stack, struct stack_info *info)
 {
unsigned long *begin, *end;
struct pt_regs *regs;
@@ -64,16 +63,6 @@ static bool in_exception_stack(unsigned long *stack, struct 
stack_info *info,
if (stack < begin || stack >= end)
continue;
 
-   /*
-* Make sure we don't iterate through an exception stack more
-* than once.  If it comes up a second time then there's
-* something wrong going on - just break out and report an
-* unknown stack type.
-*/
-   if (*visit_mask & (1U << k))
-   break;
-   *visit_mask |= 1U << k;
-
info->type  = STACK_TYPE_EXCEPTION + k;
info->begin = begin;
info->end   = end;
@@ -119,16 +108,30 @@ int get_stack_info(unsigned long *stack, struct 
task_struct *task,
task = task ? : current;
 
if (in_task_stack(stack, task, info))
-   return 0;
+   goto recursion_check;
 
if (task != current)
goto unknown;
 
-   if (in_exception_stack(stack, info, visit_mask))
-   return 0;
+   if (in_exception_stack(stack, info))
+   goto recursion_check;
 
if (in_irq_stack(stack, info))
-   return 0;
+   goto recursion_check;
+
+   goto unknown;
+
+recursion_check:
+   /*
+* Make sure we don't iterate through any given stack more than once.
+* If it comes up a second time then there's something wrong going on:
+* just break out and report an unknown stack type.
+*/
+   if (visit_mask) {
+   if (*visit_mask & (1UL << info->type))
+   goto unknown;
+   *visit_mask |= 1UL << info->type;
+   }
 
return 0;
 
-- 
2.7.4



Re: [PATCH v2 3/3] x86,arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 6:47 PM, Kyle Huey  wrote:
> On Wed, Sep 14, 2016 at 6:29 PM, Andy Lutomirski  wrote:
>> On Wed, Sep 14, 2016 at 2:01 PM, Kyle Huey  wrote:

>>> +
>>> +int set_cpuid_mode(struct task_struct *task, unsigned long val)
>>> +{
>>> +   /* Only disable/enable_cpuid() if it is supported on this hardware. 
>>> */
>>> +   bool cpuid_fault_supported = 
>>> static_cpu_has(X86_FEATURE_CPUID_FAULT);
>>> +
>>> +   if (val == ARCH_CPUID_ENABLE && cpuid_fault_supported) {
>>> +   if (task_no_new_privs(task) && 
>>> test_thread_flag(TIF_NOCPUID))
>>> +   return -EACCES;
>>
>> This check seems confused.  If this flag were preserved on execve,
>> it's the SIGSEGV mode that would need the check.
>
> Not sure I follow this one.  no_new_privs should block transitions
> from SIGSEGV to ENABLE, right?  That's what this check does.

It's the other way around entirely: if you make a change to your
process context such that a subseqently execve()'d setuid program
might malfunction, you've just done something dangerous.  This is only
okay, at least in newly-supported instances, if you are either
privileged or if you have no_new_privs set.  Having privilege makes it
okay: unprivileged programs can't use it to subvert setuid programs.
no_new_privs makes it safe as well: if no_new_privs is set, you can't
gain privilege via execve(), so there's no attack surface.  So, if you
have execve() keep ARCH_CPUID_SIGSEGV set, then setting it that way in
the first place should require privilege or no_new_privs.

I personally favor resetting to ARCH_CPUID_ENABLE on execve() and not
worrying about no_new_privs.

Does that make sense?


Re: [PATCH v2 3/3] x86,arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction

2016-09-14 Thread Andy Lutomirski
On Wed, Sep 14, 2016 at 6:47 PM, Kyle Huey  wrote:
> On Wed, Sep 14, 2016 at 6:29 PM, Andy Lutomirski  wrote:
>> On Wed, Sep 14, 2016 at 2:01 PM, Kyle Huey  wrote:

>>> +
>>> +int set_cpuid_mode(struct task_struct *task, unsigned long val)
>>> +{
>>> +   /* Only disable/enable_cpuid() if it is supported on this hardware. 
>>> */
>>> +   bool cpuid_fault_supported = 
>>> static_cpu_has(X86_FEATURE_CPUID_FAULT);
>>> +
>>> +   if (val == ARCH_CPUID_ENABLE && cpuid_fault_supported) {
>>> +   if (task_no_new_privs(task) && 
>>> test_thread_flag(TIF_NOCPUID))
>>> +   return -EACCES;
>>
>> This check seems confused.  If this flag were preserved on execve,
>> it's the SIGSEGV mode that would need the check.
>
> Not sure I follow this one.  no_new_privs should block transitions
> from SIGSEGV to ENABLE, right?  That's what this check does.

It's the other way around entirely: if you make a change to your
process context such that a subseqently execve()'d setuid program
might malfunction, you've just done something dangerous.  This is only
okay, at least in newly-supported instances, if you are either
privileged or if you have no_new_privs set.  Having privilege makes it
okay: unprivileged programs can't use it to subvert setuid programs.
no_new_privs makes it safe as well: if no_new_privs is set, you can't
gain privilege via execve(), so there's no attack surface.  So, if you
have execve() keep ARCH_CPUID_SIGSEGV set, then setting it that way in
the first place should require privilege or no_new_privs.

I personally favor resetting to ARCH_CPUID_ENABLE on execve() and not
worrying about no_new_privs.

Does that make sense?


Re: [PATCH v5 1/5] kexec_file: Include the purgatory segment in the kexec image checksum.

2016-09-14 Thread Thiago Jung Bauermann
Hello Stephen,

Am Donnerstag, 15 September 2016, 11:43:08 schrieb Stephen Rothwell:
> Hi Thiago,
> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 2a1f0ce7c59a..dcd1679f3005 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1792,6 +1792,11 @@ config SECCOMP
> > 
> >  source kernel/Kconfig.hz
> > 
> > +# x86 needs to relocate the purgatory after the checksum is calculated,
> > +# therefore the purgatory cannot be part of the kexec image checksum.
> > +config ARCH_MODIFIES_KEXEC_PURGATORY
> > +   bool
> > +
> 
> The above should probably be in arch/Kconfig (with an appropriately
> changed comment) since it is used in generic code.

Thanks for your quick response! I'll make that change tomorrow and send an 
updated version of just this patch.

-- 
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [PATCH v5 1/5] kexec_file: Include the purgatory segment in the kexec image checksum.

2016-09-14 Thread Thiago Jung Bauermann
Hello Stephen,

Am Donnerstag, 15 September 2016, 11:43:08 schrieb Stephen Rothwell:
> Hi Thiago,
> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 2a1f0ce7c59a..dcd1679f3005 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1792,6 +1792,11 @@ config SECCOMP
> > 
> >  source kernel/Kconfig.hz
> > 
> > +# x86 needs to relocate the purgatory after the checksum is calculated,
> > +# therefore the purgatory cannot be part of the kexec image checksum.
> > +config ARCH_MODIFIES_KEXEC_PURGATORY
> > +   bool
> > +
> 
> The above should probably be in arch/Kconfig (with an appropriately
> changed comment) since it is used in generic code.

Thanks for your quick response! I'll make that change tomorrow and send an 
updated version of just this patch.

-- 
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [PATCH v2 3/3] x86,arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction

2016-09-14 Thread Kyle Huey
On Wed, Sep 14, 2016 at 6:29 PM, Andy Lutomirski  wrote:
> On Wed, Sep 14, 2016 at 2:01 PM, Kyle Huey  wrote:
>> Intel supports faulting on the CPUID instruction in newer processors. Bit
>> 31 of MSR_PLATFORM_INFO advertises support for this feature. It is
>> documented in detail in Section 2.3.2 of
>> http://www.intel.com/content/dam/www/public/us/en/documents/application-notes/virtualization-technology-flexmigration-application-note.pdf
>>
>> Signed-off-by: Kyle Huey 
>> ---
>>  arch/x86/include/asm/msr-index.h  |   1 +
>>  arch/x86/include/asm/thread_info.h|   4 +-
>>  arch/x86/include/uapi/asm/prctl.h |   6 +
>>  arch/x86/kernel/process.c |  81 +++
>>  tools/testing/selftests/x86/Makefile  |   2 +-
>>  tools/testing/selftests/x86/cpuid-fault.c | 223 
>> ++
>>  6 files changed, 315 insertions(+), 2 deletions(-)
>>  create mode 100644 tools/testing/selftests/x86/cpuid-fault.c
>>
>> diff --git a/arch/x86/include/asm/msr-index.h 
>> b/arch/x86/include/asm/msr-index.h
>> index 83908d5..4aebec2 100644
>> --- a/arch/x86/include/asm/msr-index.h
>> +++ b/arch/x86/include/asm/msr-index.h
>> @@ -53,6 +53,7 @@
>>  #define MSR_MTRRcap0x00fe
>>  #define MSR_IA32_BBL_CR_CTL0x0119
>>  #define MSR_IA32_BBL_CR_CTL3   0x011e
>> +#define MSR_MISC_FEATURES_ENABLES  0x0140
>>
>>  #define MSR_IA32_SYSENTER_CS   0x0174
>>  #define MSR_IA32_SYSENTER_ESP  0x0175
>> diff --git a/arch/x86/include/asm/thread_info.h 
>> b/arch/x86/include/asm/thread_info.h
>> index 8b7c8d8..ec93976 100644
>> --- a/arch/x86/include/asm/thread_info.h
>> +++ b/arch/x86/include/asm/thread_info.h
>> @@ -93,6 +93,7 @@ struct thread_info {
>>  #define TIF_SECCOMP8   /* secure computing */
>>  #define TIF_USER_RETURN_NOTIFY 11  /* notify kernel of userspace return 
>> */
>>  #define TIF_UPROBE 12  /* breakpointed or singlestepping */
>> +#define TIF_NOCPUID15  /* CPUID is not accessible in 
>> userland */
>>  #define TIF_NOTSC  16  /* TSC is not accessible in userland 
>> */
>>  #define TIF_IA32   17  /* IA32 compatibility process */
>>  #define TIF_FORK   18  /* ret_from_fork */
>> @@ -117,6 +118,7 @@ struct thread_info {
>>  #define _TIF_SECCOMP   (1 << TIF_SECCOMP)
>>  #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY)
>>  #define _TIF_UPROBE(1 << TIF_UPROBE)
>> +#define _TIF_NOCPUID   (1 << TIF_NOCPUID)
>>  #define _TIF_NOTSC (1 << TIF_NOTSC)
>>  #define _TIF_IA32  (1 << TIF_IA32)
>>  #define _TIF_FORK  (1 << TIF_FORK)
>> @@ -146,7 +148,7 @@ struct thread_info {
>>
>>  /* flags to check in __switch_to() */
>>  #define _TIF_WORK_CTXSW 
>>\
>> -   (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP)
>> +   (_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP)
>>
>>  #define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
>>  #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
>> diff --git a/arch/x86/include/uapi/asm/prctl.h 
>> b/arch/x86/include/uapi/asm/prctl.h
>> index 3ac5032..c087e55 100644
>> --- a/arch/x86/include/uapi/asm/prctl.h
>> +++ b/arch/x86/include/uapi/asm/prctl.h
>> @@ -6,4 +6,10 @@
>>  #define ARCH_GET_FS 0x1003
>>  #define ARCH_GET_GS 0x1004
>>
>> +/* Get/set the process' ability to use the CPUID instruction */
>> +#define ARCH_GET_CPUID 0x1005
>> +#define ARCH_SET_CPUID 0x1006
>> +# define ARCH_CPUID_ENABLE 1   /* allow the use of the 
>> CPUID instruction */
>> +# define ARCH_CPUID_SIGSEGV2   /* throw a SIGSEGV instead 
>> of reading the CPUID */
>> +
>>  #endif /* _ASM_X86_PRCTL_H */
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 0f857c3..5fc8e9d 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -193,6 +193,69 @@ int set_tsc_mode(unsigned int val)
>> return 0;
>>  }
>>
>> +static void switch_cpuid_faulting(bool on)
>> +{
>> +   if (on)
>> +   msr_set_bit(MSR_MISC_FEATURES_ENABLES, 0);
>> +   else
>> +   msr_clear_bit(MSR_MISC_FEATURES_ENABLES, 0);
>> +}
>> +
>
> I don't know if we care (yet?), but this is going to be unnecessarily
> slow because of the implicit rdmsr.  You could add a percpu shadow
> copy of MISC_FEATURES_ENABLES, initialized during boot, and avoid the
> rdmsr.
>
>> +
>> +int set_cpuid_mode(struct task_struct *task, unsigned long val)
>> +{
>> +   /* Only disable/enable_cpuid() if it is supported on this hardware. 
>> */
>> +   bool cpuid_fault_supported = static_cpu_has(X86_FEATURE_CPUID_FAULT);
>> +
>> +   if (val == ARCH_CPUID_ENABLE && cpuid_fault_supported) {
>> +   if 

Re: [PATCH v2 3/3] x86,arch_prctl Add ARCH_[GET|SET]_CPUID for controlling the CPUID instruction

2016-09-14 Thread Kyle Huey
On Wed, Sep 14, 2016 at 6:29 PM, Andy Lutomirski  wrote:
> On Wed, Sep 14, 2016 at 2:01 PM, Kyle Huey  wrote:
>> Intel supports faulting on the CPUID instruction in newer processors. Bit
>> 31 of MSR_PLATFORM_INFO advertises support for this feature. It is
>> documented in detail in Section 2.3.2 of
>> http://www.intel.com/content/dam/www/public/us/en/documents/application-notes/virtualization-technology-flexmigration-application-note.pdf
>>
>> Signed-off-by: Kyle Huey 
>> ---
>>  arch/x86/include/asm/msr-index.h  |   1 +
>>  arch/x86/include/asm/thread_info.h|   4 +-
>>  arch/x86/include/uapi/asm/prctl.h |   6 +
>>  arch/x86/kernel/process.c |  81 +++
>>  tools/testing/selftests/x86/Makefile  |   2 +-
>>  tools/testing/selftests/x86/cpuid-fault.c | 223 
>> ++
>>  6 files changed, 315 insertions(+), 2 deletions(-)
>>  create mode 100644 tools/testing/selftests/x86/cpuid-fault.c
>>
>> diff --git a/arch/x86/include/asm/msr-index.h 
>> b/arch/x86/include/asm/msr-index.h
>> index 83908d5..4aebec2 100644
>> --- a/arch/x86/include/asm/msr-index.h
>> +++ b/arch/x86/include/asm/msr-index.h
>> @@ -53,6 +53,7 @@
>>  #define MSR_MTRRcap0x00fe
>>  #define MSR_IA32_BBL_CR_CTL0x0119
>>  #define MSR_IA32_BBL_CR_CTL3   0x011e
>> +#define MSR_MISC_FEATURES_ENABLES  0x0140
>>
>>  #define MSR_IA32_SYSENTER_CS   0x0174
>>  #define MSR_IA32_SYSENTER_ESP  0x0175
>> diff --git a/arch/x86/include/asm/thread_info.h 
>> b/arch/x86/include/asm/thread_info.h
>> index 8b7c8d8..ec93976 100644
>> --- a/arch/x86/include/asm/thread_info.h
>> +++ b/arch/x86/include/asm/thread_info.h
>> @@ -93,6 +93,7 @@ struct thread_info {
>>  #define TIF_SECCOMP8   /* secure computing */
>>  #define TIF_USER_RETURN_NOTIFY 11  /* notify kernel of userspace return 
>> */
>>  #define TIF_UPROBE 12  /* breakpointed or singlestepping */
>> +#define TIF_NOCPUID15  /* CPUID is not accessible in 
>> userland */
>>  #define TIF_NOTSC  16  /* TSC is not accessible in userland 
>> */
>>  #define TIF_IA32   17  /* IA32 compatibility process */
>>  #define TIF_FORK   18  /* ret_from_fork */
>> @@ -117,6 +118,7 @@ struct thread_info {
>>  #define _TIF_SECCOMP   (1 << TIF_SECCOMP)
>>  #define _TIF_USER_RETURN_NOTIFY(1 << TIF_USER_RETURN_NOTIFY)
>>  #define _TIF_UPROBE(1 << TIF_UPROBE)
>> +#define _TIF_NOCPUID   (1 << TIF_NOCPUID)
>>  #define _TIF_NOTSC (1 << TIF_NOTSC)
>>  #define _TIF_IA32  (1 << TIF_IA32)
>>  #define _TIF_FORK  (1 << TIF_FORK)
>> @@ -146,7 +148,7 @@ struct thread_info {
>>
>>  /* flags to check in __switch_to() */
>>  #define _TIF_WORK_CTXSW 
>>\
>> -   (_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP)
>> +   (_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP)
>>
>>  #define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
>>  #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
>> diff --git a/arch/x86/include/uapi/asm/prctl.h 
>> b/arch/x86/include/uapi/asm/prctl.h
>> index 3ac5032..c087e55 100644
>> --- a/arch/x86/include/uapi/asm/prctl.h
>> +++ b/arch/x86/include/uapi/asm/prctl.h
>> @@ -6,4 +6,10 @@
>>  #define ARCH_GET_FS 0x1003
>>  #define ARCH_GET_GS 0x1004
>>
>> +/* Get/set the process' ability to use the CPUID instruction */
>> +#define ARCH_GET_CPUID 0x1005
>> +#define ARCH_SET_CPUID 0x1006
>> +# define ARCH_CPUID_ENABLE 1   /* allow the use of the 
>> CPUID instruction */
>> +# define ARCH_CPUID_SIGSEGV2   /* throw a SIGSEGV instead 
>> of reading the CPUID */
>> +
>>  #endif /* _ASM_X86_PRCTL_H */
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 0f857c3..5fc8e9d 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -193,6 +193,69 @@ int set_tsc_mode(unsigned int val)
>> return 0;
>>  }
>>
>> +static void switch_cpuid_faulting(bool on)
>> +{
>> +   if (on)
>> +   msr_set_bit(MSR_MISC_FEATURES_ENABLES, 0);
>> +   else
>> +   msr_clear_bit(MSR_MISC_FEATURES_ENABLES, 0);
>> +}
>> +
>
> I don't know if we care (yet?), but this is going to be unnecessarily
> slow because of the implicit rdmsr.  You could add a percpu shadow
> copy of MISC_FEATURES_ENABLES, initialized during boot, and avoid the
> rdmsr.
>
>> +
>> +int set_cpuid_mode(struct task_struct *task, unsigned long val)
>> +{
>> +   /* Only disable/enable_cpuid() if it is supported on this hardware. 
>> */
>> +   bool cpuid_fault_supported = static_cpu_has(X86_FEATURE_CPUID_FAULT);
>> +
>> +   if (val == ARCH_CPUID_ENABLE && cpuid_fault_supported) {
>> +   if (task_no_new_privs(task) && test_thread_flag(TIF_NOCPUID))
>> + 

Re: [PATCH v4 3/4] gpu: ipu-ic: Add complete image conversion support with tiling

2016-09-14 Thread Steve Longerbeam

Hi Philipp,


On 09/06/2016 02:26 AM, Philipp Zabel wrote:

Hi Steve,

Am Mittwoch, den 17.08.2016, 17:50 -0700 schrieb Steve Longerbeam:

This patch implements complete image conversion support to ipu-ic,
with tiling to support scaling to and from images up to 4096x4096.
Image rotation is also supported.

The internal API is subsystem agnostic (no V4L2 dependency except
for the use of V4L2 fourcc pixel formats).

Callers prepare for image conversion by calling
ipu_image_convert_prepare(), which initializes the parameters of
the conversion.

... and possibly allocates intermediate buffers for rotation support.
This should be documented somewhere, with a node that v4l2 users should
be doing this during REQBUFS.


I added comment headers for all the image conversion prototypes.
It caused bloat in imx-ipu-v3.h, so I moved it to a new header:
include/video/imx-image-convert.h, but let me know if we should put
this somewhere else and/or under Documentation/ somewhere.



  The caller passes in the ipu_ic task to use for
the conversion, the input and output image formats, a rotation mode,
and a completion callback and completion context pointer:

struct image_converter_ctx *
ipu_image_convert_prepare(struct ipu_ic *ic,
   struct ipu_image *in, struct ipu_image *out,
   enum ipu_rotate_mode rot_mode,
   image_converter_cb_t complete,
   void *complete_context);

As I commented on the other patch, I think the image_convert functions
should use a separate handle for the image conversion queues that sit on
top of the ipu_ic task handles.


Here is a new prototype I came up with:

struct ipu_image_convert_ctx *
ipu_image_convert_prepare(struct ipu_soc *ipu, enum ipu_ic_task ic_task,
  struct ipu_image *in, struct ipu_image *out,
  enum ipu_rotate_mode rot_mode,
  ipu_image_convert_cb_t complete,
  void *complete_context);

In other words, the ipu_ic handle is replaced by the IPU handle and IC task
that are requested for carrying out the conversion.

The image converter will acquire the ipu_ic handle internally, whenever 
there
are queued contexts to that IC task (which I am calling a 'struct 
ipu_image_convert_chan').
This way the IC handle can be shared by all contexts using that IC task. 
After all
contexts have been freed from the (struct 
ipu_image_convert_chan)->ctx_list queue,

the ipu_ic handle is freed.

The ipu_ic handle is acquired in get_ipu_resources() and freed in 
release_ipu_resources(),
along with all the other IPU resources that *could possibly be needed* 
in that
ipu_image_convert_chan by future contexts (*all* idmac channels, *all* 
irqs).


I should have done this from the start, instead of allowing multiple 
handles the the IC tasks.

Thanks for pointing this out.




+
+#define MIN_W 128
+#define MIN_H 128

Where does this minimum come from?


Nowhere really :) This is just some sane minimums, to pass
to clamp_align() when aligning input/output width/height in
ipu_image_convert_adjust().


+struct ic_task_channels {
+   int in;
+   int out;
+   int rot_in;
+   int rot_out;
+   int vdi_in_p;
+   int vdi_in;
+   int vdi_in_n;

The vdi channels are unused.


Well, I'd prefer to keep the VDI channels. It's quite possible we
can add motion compensated deinterlacing support using the
PRP_VF task to the image converter in the future.


+struct image_converter_ctx {
+   struct image_converter *cvt;
+
+   image_converter_cb_t complete;
+   void *complete_context;
+
+   /* Source/destination image data and rotation mode */
+   struct ipu_ic_image in;
+   struct ipu_ic_image out;
+   enum ipu_rotate_mode rot_mode;
+
+   /* intermediate buffer for rotation */
+   struct ipu_ic_dma_buf rot_intermediate[2];

No need to change it now, but I assume these could be per IC task
instead of per context.


Actually no. The rotation intermediate buffers have the dimension
of a single tile, so they must remain in the context struct.


+static const struct ipu_ic_pixfmt ipu_ic_formats[] = {
+   {
+   .name   = "RGB565",

Please drop the names, keeping a list of user readable format names is
the v4l2 core's business, not ours.


done.


+   .fourcc = V4L2_PIX_FMT_RGB565,
+   .bpp= 16,

bpp is only ever used in bytes, not bits (always divided by 8).
Why not make this bytes_per_pixel or pixel_stride = 2.


Actually bpp is used to calculate *total* tile sizes and *total* bytes
per line. For the planar 4:2:0 formats that means it must be specified
in bits.



+   }, {
+   .name   = "4:2:0 planar, YUV",
+   .fourcc = V4L2_PIX_FMT_YUV420,
+   .bpp= 12,
+   .y_depth = 8,

y_depth is only ever used in bytes, not bits (always divided by 8).
Why not make this bool planar instead.


sure why not, although I think 

Re: [PATCH v4 3/4] gpu: ipu-ic: Add complete image conversion support with tiling

2016-09-14 Thread Steve Longerbeam

Hi Philipp,


On 09/06/2016 02:26 AM, Philipp Zabel wrote:

Hi Steve,

Am Mittwoch, den 17.08.2016, 17:50 -0700 schrieb Steve Longerbeam:

This patch implements complete image conversion support to ipu-ic,
with tiling to support scaling to and from images up to 4096x4096.
Image rotation is also supported.

The internal API is subsystem agnostic (no V4L2 dependency except
for the use of V4L2 fourcc pixel formats).

Callers prepare for image conversion by calling
ipu_image_convert_prepare(), which initializes the parameters of
the conversion.

... and possibly allocates intermediate buffers for rotation support.
This should be documented somewhere, with a node that v4l2 users should
be doing this during REQBUFS.


I added comment headers for all the image conversion prototypes.
It caused bloat in imx-ipu-v3.h, so I moved it to a new header:
include/video/imx-image-convert.h, but let me know if we should put
this somewhere else and/or under Documentation/ somewhere.



  The caller passes in the ipu_ic task to use for
the conversion, the input and output image formats, a rotation mode,
and a completion callback and completion context pointer:

struct image_converter_ctx *
ipu_image_convert_prepare(struct ipu_ic *ic,
   struct ipu_image *in, struct ipu_image *out,
   enum ipu_rotate_mode rot_mode,
   image_converter_cb_t complete,
   void *complete_context);

As I commented on the other patch, I think the image_convert functions
should use a separate handle for the image conversion queues that sit on
top of the ipu_ic task handles.


Here is a new prototype I came up with:

struct ipu_image_convert_ctx *
ipu_image_convert_prepare(struct ipu_soc *ipu, enum ipu_ic_task ic_task,
  struct ipu_image *in, struct ipu_image *out,
  enum ipu_rotate_mode rot_mode,
  ipu_image_convert_cb_t complete,
  void *complete_context);

In other words, the ipu_ic handle is replaced by the IPU handle and IC task
that are requested for carrying out the conversion.

The image converter will acquire the ipu_ic handle internally, whenever 
there
are queued contexts to that IC task (which I am calling a 'struct 
ipu_image_convert_chan').
This way the IC handle can be shared by all contexts using that IC task. 
After all
contexts have been freed from the (struct 
ipu_image_convert_chan)->ctx_list queue,

the ipu_ic handle is freed.

The ipu_ic handle is acquired in get_ipu_resources() and freed in 
release_ipu_resources(),
along with all the other IPU resources that *could possibly be needed* 
in that
ipu_image_convert_chan by future contexts (*all* idmac channels, *all* 
irqs).


I should have done this from the start, instead of allowing multiple 
handles the the IC tasks.

Thanks for pointing this out.




+
+#define MIN_W 128
+#define MIN_H 128

Where does this minimum come from?


Nowhere really :) This is just some sane minimums, to pass
to clamp_align() when aligning input/output width/height in
ipu_image_convert_adjust().


+struct ic_task_channels {
+   int in;
+   int out;
+   int rot_in;
+   int rot_out;
+   int vdi_in_p;
+   int vdi_in;
+   int vdi_in_n;

The vdi channels are unused.


Well, I'd prefer to keep the VDI channels. It's quite possible we
can add motion compensated deinterlacing support using the
PRP_VF task to the image converter in the future.


+struct image_converter_ctx {
+   struct image_converter *cvt;
+
+   image_converter_cb_t complete;
+   void *complete_context;
+
+   /* Source/destination image data and rotation mode */
+   struct ipu_ic_image in;
+   struct ipu_ic_image out;
+   enum ipu_rotate_mode rot_mode;
+
+   /* intermediate buffer for rotation */
+   struct ipu_ic_dma_buf rot_intermediate[2];

No need to change it now, but I assume these could be per IC task
instead of per context.


Actually no. The rotation intermediate buffers have the dimension
of a single tile, so they must remain in the context struct.


+static const struct ipu_ic_pixfmt ipu_ic_formats[] = {
+   {
+   .name   = "RGB565",

Please drop the names, keeping a list of user readable format names is
the v4l2 core's business, not ours.


done.


+   .fourcc = V4L2_PIX_FMT_RGB565,
+   .bpp= 16,

bpp is only ever used in bytes, not bits (always divided by 8).
Why not make this bytes_per_pixel or pixel_stride = 2.


Actually bpp is used to calculate *total* tile sizes and *total* bytes
per line. For the planar 4:2:0 formats that means it must be specified
in bits.



+   }, {
+   .name   = "4:2:0 planar, YUV",
+   .fourcc = V4L2_PIX_FMT_YUV420,
+   .bpp= 12,
+   .y_depth = 8,

y_depth is only ever used in bytes, not bits (always divided by 8).
Why not make this bool planar instead.


sure why not, although I think 

  1   2   3   4   5   6   7   8   9   10   >