linux-next: Tree for May 9

2016-05-08 Thread Stephen Rothwell
Hi all,

Changes since 20160506:

Dropped tree: hsi (at the maintainer's request)

The f2fs tree gained a conflict against the ext4 tree.

The libata tree gained a build failure so I used the version from
next-20160506 for today.

The net-next tree gained conflicts against the wireless-drivers and
net trees.

The drm tree gained conflicts against Linus' tree.

The sound-asoc tree lost its build failure.

Non-merge commits (relative to Linus' tree): 8795
 7764 files changed, 383723 insertions(+), 168978 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
(this fails its final link) and pseries_le_defconfig and i386, sparc
and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 235 trees (counting Linus' and 35 trees of patches
pending for Linus' tree).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (44549e8f5eea Linux 4.6-rc7)
Merging fixes/master (9735a22799b9 Linux 4.6-rc2)
Merging kbuild-current/rc-fixes (3d1450d54a4f Makefile: Force gzip and xz on 
module install)
Merging arc-current/for-curr (26f9d5fd82ca ARC: support HIGHMEM even without 
PAE40)
Merging arm-current/fixes (ec953b70f368 ARM: 8573/1: domain: move 
{set,get}_domain under config guard)
Merging m68k-current/for-linus (7b8ba82ad4ad m68k/defconfig: Update defconfigs 
for v4.6-rc2)
Merging metag-fixes/fixes (0164a711c97b metag: Fix ioremap_wc/ioremap_cached 
build errors)
Merging powerpc-fixes/fixes (b4c112114aab powerpc: Fix bad inline asm 
constraint in create_zero_mask())
Merging powerpc-merge-mpe/fixes (bc0195aad0da Linux 4.2-rc2)
Merging sparc/master (33656a1f2ee5 Merge branch 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs)
Merging net/master (8c1f45462574 netxen: netxen_rom_fast_read() doesn't return 
-1)
Merging ipsec/master (d6af1a31cc72 vti: Add pmtu handling to vti_xmit.)
Merging ipvs/master (f28f20da704d Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net)
Merging wireless-drivers/master (cbbba30f1ac9 Merge tag 
'iwlwifi-for-kalle-2016-05-04' of 
https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes)
Merging mac80211/master (e6436be21e77 mac80211: fix statistics leak if 
dev_alloc_name() fails)
Merging sound-current/for-linus (2d2c038a ALSA: usb-audio: Quirk for yet 
another Phoenix Audio devices (v2))
Merging pci-current/for-linus (9a2a5a638f8e PCI: Do not treat EPROBE_DEFER as 
device attach failure)
Merging driver-core.current/driver-core-linus (c3b46c73264b Linux 4.6-rc4)
Merging tty.current/tty-linus (02da2d72174c Linux 4.6-rc5)
Merging usb.current/usb-linus (9be427efc764 Revert "USB / PM: Allow USB devices 
to remain runtime-suspended when sleeping")
Merging usb-gadget-fixes/fixes (38740a5b87d5 usb: gadget: f_fs: Fix 
use-after-free)
Merging usb-serial-fixes/usb-linus (74d2a91aec97 USB: serial: option: add even 
more ZTE device ids)
Merging usb-chipidea-fixes/ci-for-usb-stable (d144dfea8af7 usb: chipidea: otg: 
change workqueue ci_otg as freezable)
Merging staging.current/staging-linus (2b86c4a84377 Merge tag 
'iio-fixes-for-4.6d' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio 
into staging-linus)
Merging char-misc.current/char-misc-linus (d1306eb675ad nvmem: mxs-ocotp: fix 
buffer overflow in read)
Merging input-current/for-linus (eb43335c4095 Input: atmel_mxt_ts - use 
mxt_acquire_irq in mxt_soft_reset)
Merging crypto-current/master (58446fef579e crypto: rsa - select crypto mgr 
dependency)
Merging ide/master (1993b176a822 Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide)
Merging 

Re: [PATCH v2 1/2] mm, kasan: improve double-free detection

2016-05-08 Thread Dmitry Vyukov
On Sun, May 8, 2016 at 11:17 AM, Yury Norov  wrote:
> On Sat, May 07, 2016 at 03:15:59PM +, Luruo, Kuthonuzo wrote:
>> Thank you for the review!
>>
>> > > + switch (alloc_data.state) {
>> > > + case KASAN_STATE_QUARANTINE:
>> > > + case KASAN_STATE_FREE:
>> > > + kasan_report((unsigned long)object, 0, false,
>> > > + (unsigned long)__builtin_return_address(1));
>> >
>> > __builtin_return_address() is unsafe if argument is non-zero. Use
>> > return_address() instead.
>>
>> hmm, I/cscope can't seem to find an x86 implementation for return_address().
>> Will dig further; thanks.
>>
>
> It seems there's no generic interface to obtain return address. x86
> has  working __builtin_return_address() and it's ok with it, others
> use their own return_adderss(), and ok as well.
>
> I think unification is needed here.


We use _RET_IP_ in other places in portable part of kasan.


Re: [PATCH v7 7/9] clk: mediatek: Enable critical clocks for MT2701

2016-05-08 Thread James Liao
Hi Stephen,

On Fri, 2016-05-06 at 16:12 -0700, Stephen Boyd wrote:
> On 04/14, James Liao wrote:
> > Some system clocks should be turned on by default on MT2701.
> > This patch enable these clocks when related clocks have
> > been registered.
> > 
> > Signed-off-by: James Liao 
> > ---
> 
> critical clks got merged now (sorry I'm slowly getting back to
> looking at patches). Please use that flag.

I don't see critical clock support in v4.6-rc7. Is there a repo/branch
that has critical clocks merged?


Best regards,

James



Re: [PATCH] compiler-gcc: require gcc 4.8 for powerpc __builtin_bswap16()

2016-05-08 Thread Sedat Dilek
On 5/9/16, Stephen Rothwell  wrote:
> Hi Josh,
>
> On Fri, 6 May 2016 09:22:25 -0500 Josh Poimboeuf 
> wrote:
>>
>> I've also seen no problems on powerpc with 4.4 and 4.8.  I suspect it's
>> specific to gcc 4.6.  Stephen, can you confirm this patch fixes it?
>
> That will obviously fix the problem for us (since it will effectively
> restore the code to what it was before the other commit for our gcc
> 4.6.3 builds and we have not seen it in other builds).  I will add this
> patch to linux-next today.
>
> And since "byteswap: try to avoid __builtin_constant_p gcc bug" is not
> in Linus' tree, hopefully we can have this fix applied soon.
>

FYI, this patch is in Linus tree (v4.6-rc7 has it).

- Sedat -

[1] 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7322dd755e7dd34bc5359aa27abeed1687e0f628

>> From: Josh Poimboeuf 
>> Subject: [PATCH] compiler-gcc: require gcc 4.8 for powerpc
>> __builtin_bswap16()
>>
>> gcc support for __builtin_bswap16() was supposedly added for powerpc in
>> gcc 4.6, and was then later added for other architectures in gcc 4.8.
>>
>> However, Stephen Rothwell reported that attempting to use it on powerpc
>> in gcc 4.6 fails with:
>>
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[0]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[1]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[2]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[3]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>
>> I'm not entirely sure what those errors mean, but I don't see them on
>> gcc 4.8.  So let's consider gcc 4.8 to be the official starting point
>> for __builtin_bswap16().
>>
>> Fixes: 7322dd755e7d ("byteswap: try to avoid __builtin_constant_p gcc
>> bug")
>> Reported-by: Stephen Rothwell 
>> Signed-off-by: Josh Poimboeuf 
>> ---
>>  include/linux/compiler-gcc.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
>> index eeae401..3d5202e 100644
>> --- a/include/linux/compiler-gcc.h
>> +++ b/include/linux/compiler-gcc.h
>> @@ -246,7 +246,7 @@
>>  #define __HAVE_BUILTIN_BSWAP32__
>>  #define __HAVE_BUILTIN_BSWAP64__
>>  #endif
>> -#if GCC_VERSION >= 40800 || (defined(__powerpc__) && GCC_VERSION >=
>> 40600)
>> +#if GCC_VERSION >= 40800
>>  #define __HAVE_BUILTIN_BSWAP16__
>>  #endif
>>  #endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */
>> --
>> 2.4.11
>
> --
> Cheers,
> Stephen Rothwell
> --
> To unsubscribe from this list: send the line "unsubscribe linux-next" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


Re: [PATCH 0/6] Intel Secure Guard Extensions

2016-05-08 Thread Jarkko Sakkinen
On Fri, May 06, 2016 at 01:54:14PM +0200, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Jarkko Sakkinen wrote:
> 
> > On Tue, May 03, 2016 at 04:06:27AM -0500, Dr. Greg Wettstein wrote:
> > > It would be helpful and instructive for anyone involved in this debate
> > > to review the following URL which details Intel's SGX licening
> > > program:
> > > 
> > > https://software.intel.com/en-us/articles/intel-sgx-product-licensing
> > 
> > I think it would be good  to note that the licensing process is available
> > only for Windows. For Linux you can only use debug enclaves at the
> > moment. The default LE has "allow-all" policy for debug enclaves.
> 
> Which makes the feature pretty useless.
>  
> > > I think the only way forward to make all of this palatable is to
> > > embrace something similar to what has been done with Secure Boot.  The
> > > Root Enclave Key will need to be something which can be reconfigured
> > > by the Platform Owner through BIOS/EFI.  That model would take Intel
> > > off the hook from a security perspective and establish the notion of
> > > platform trust to be a bilateral relationship between a service
> > > provider and client.
> > 
> > This concern has been raised many times now. Sadly this did not make
> > into Skyle but in future we will have one shot MSRs (can be set only
> > once per boot cycle) for defining your own root of trust.
> 
> We'll wait for that to happen.

I fully understand if you (and others) want to keep this standpoint but
what if we could get it to staging after I've revised it with suggested
changes and internal changes in my TODO? Then it would not pollute the
mainline kernel but still would be easily available for experimentation.

There was one header out of staging tree in the patch set sgx.h that I
could place to the staging area in the next revision.

For the next revision I'll document how IA32_LEPUBKEYHASHx MSRs work
based on some concerns that Andy raised so that we can hopefully have a
better discussion about this feature.

> Thanks,
> 
>   tglx

/Jarkko


Re: [PATCH v7 8/9] clk: mediatek: Add config options for MT2701 subsystem clocks

2016-05-08 Thread James Liao
HI Stephen,

On Fri, 2016-05-06 at 16:02 -0700, Stephen Boyd wrote:
> On 04/14, James Liao wrote:
> > MT2701 subsystem clocks are optional and should be enabled only if
> > their subsystem drivers are ready to control these clocks.
> > 
> > Signed-off-by: James Liao 
> > ---
> 
> Why is this patch split off from the patch that introduces the
> file?

I was looking for comments about how to make subsystem clocks optional.
So I used a separated patch to do it. Is it an acceptable way to use
config options to enable subsystem clock support?


Best regards,

James




Re: [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()

2016-05-08 Thread Peter Hurley
On 05/08/2016 09:56 PM, Davidlohr Bueso wrote:
> The field is obviously updated w.o the lock and needs a READ_ONCE
> while waiting for lock holder(s) to go away, just like we do with
> all other ->count accesses.

This isn't actually fixing a bug because it's passed through
several full barriers which will force reloading from sem->count.

I think the patch is ok if you want it just for consistency anyway,
but please change $subject and changelog.

Regards,
Peter Hurley


> Signed-off-by: Davidlohr Bueso 
> ---
>  kernel/locking/rwsem-xadd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index df4dcb883b50..7d62772600cf 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -494,7 +494,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore 
> *sem, int state)
>   }
>   schedule();
>   set_current_state(state);
> - } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
> + } while ((count = READ_ONCE(sem->count)) & RWSEM_ACTIVE_MASK);
>  
>   raw_spin_lock_irq(>wait_lock);
>   }
> 



Re: [PATCH v2 2/2] kasan: add kasan_double_free() test

2016-05-08 Thread Dmitry Vyukov
On Fri, May 6, 2016 at 1:50 PM, Kuthonuzo Luruo  wrote:
> This patch adds a new 'test_kasan' test for KASAN double-free error
> detection when the same slab object is concurrently deallocated.
>
> Signed-off-by: Kuthonuzo Luruo 
> ---
> Changes in v2:
> - This patch is new for v2.
> ---
>  lib/test_kasan.c |   79 
> ++
>  1 files changed, 79 insertions(+), 0 deletions(-)
>
> diff --git a/lib/test_kasan.c b/lib/test_kasan.c
> index bd75a03..dec5f74 100644
> --- a/lib/test_kasan.c
> +++ b/lib/test_kasan.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  static noinline void __init kmalloc_oob_right(void)
>  {
> @@ -389,6 +390,83 @@ static noinline void __init ksize_unpoisons_memory(void)
> kfree(ptr);
>  }
>
> +#ifdef CONFIG_SLAB
> +#ifdef CONFIG_SMP

Will it fail without CONFIG_SMP if we create more than 1 kthread? If
it does not fail, then please remove the ifdef.
Also see below.


> +static DECLARE_COMPLETION(starting_gun);
> +static DECLARE_COMPLETION(finish_line);
> +
> +static int try_free(void *p)
> +{
> +   wait_for_completion(_gun);
> +   kfree(p);
> +   complete(_line);
> +   return 0;
> +}
> +
> +/*
> + * allocs an object; then all cpus concurrently attempt to free the
> + * same object.
> + */
> +static noinline void __init kasan_double_free(void)
> +{
> +   char *p;
> +   int cpu;
> +   struct task_struct **tasks;
> +   size_t size = (KMALLOC_MAX_CACHE_SIZE/4 + 1);

Is it important to use such tricky size calculation here? If it is not
important, then please replace it with some small constant.
There are some tests that calculate size based on
KMALLOC_MAX_CACHE_SIZE, but that's important for them.



> +   /*
> +* max slab size instrumented by KASAN is KMALLOC_MAX_CACHE_SIZE/2.
> +* Do not increase size beyond this: slab corruption from double-free
> +* may ensue.
> +*/
> +   pr_info("concurrent double-free test\n");
> +   init_completion(_gun);
> +   init_completion(_line);
> +   tasks = kzalloc((sizeof(tasks) * nr_cpu_ids), GFP_KERNEL);
> +   if (!tasks) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +   p = kmalloc(size, GFP_KERNEL);
> +   if (!p) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +
> +   for_each_online_cpu(cpu) {


Won't the test fail with 1 cpu?
By failing I mean that it won't detect the double-free. Soon we will
start automatically ensuring that a double-free test in fact detects a
double-free.
I think it will be much simpler to use just, say, 4 threads. It will
eliminate kzmalloc, kfree, allocation failure tests, memory leaks and
also fix !CONFIG_SMP.



> +   tasks[cpu] = kthread_create(try_free, (void *)p, "try_free%d",
> +   cpu);
> +   if (IS_ERR(tasks[cpu])) {
> +   WARN(1, "kthread_create failed.\n");
> +   return;
> +   }
> +   kthread_bind(tasks[cpu], cpu);
> +   wake_up_process(tasks[cpu]);
> +   }
> +
> +   complete_all(_gun);
> +   for_each_online_cpu(cpu)
> +   wait_for_completion(_line);
> +   kfree(tasks);
> +}
> +#else
> +static noinline void __init kasan_double_free(void)

This test should work with CONFIG_SLAB as well.
Please name the tests differently (e.g. kasan_double_free and
kasan_double_free_threaded), and run kasan_double_free always.
If kasan_double_free_threaded fails, but kasan_double_free does not,
that's already some useful info. And if both fail, then it's always
better to have a simpler reproducer.


> +{
> +   char *p;
> +   size_t size = 2049;
> +
> +   pr_info("double-free test\n");
> +   p = kmalloc(size, GFP_KERNEL);
> +   if (!p) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +   kfree(p);
> +   kfree(p);
> +}
> +#endif
> +#endif
> +
>  static int __init kmalloc_tests_init(void)
>  {
> kmalloc_oob_right();
> @@ -414,6 +492,7 @@ static int __init kmalloc_tests_init(void)
> kasan_global_oob();
>  #ifdef CONFIG_SLAB
> kasan_quarantine_cache();
> +   kasan_double_free();
>  #endif
> ksize_unpoisons_memory();
> return -EAGAIN;
> --
> 1.7.1
>


Re: [PATCH 3/6] intel_sgx: driver for Intel Secure Guard eXtensions

2016-05-08 Thread Jarkko Sakkinen
On Fri, Apr 29, 2016 at 03:22:19PM -0700, Jethro Beekman wrote:
> On 29-04-16 13:04, Jarkko Sakkinen wrote:
> >>> Why would you want to do that?
> >>
> >> ...
> >
> > Do you see this as a performance issue or why do you think that this
> > would hurt that much?
> 
> I don't think it's a performance issue at all. I'm just giving an example of 
> why
> you'd want to do this. I'm sure people who want to use this instruction set 
> can
> come up with other uses, so I think the driver should support it. Other 
> drivers
> on different platform might support this, in which case we should be 
> compatible
> (to achieve the same enclave measurement). Other Linux drivers support it 
> [1]. I
> would ask: why would you not want to do this? It seems trivial to expand the
> current flag into 16 separate flags; one for each 256-byte chunk in the page.

I'm fine with adding a 16-bit bitmask.

/Jarkko


Re: [PATCH 1/1] xen/gntdev: kmalloc structure gntdev_copy_batch

2016-05-08 Thread Juergen Gross
On 07/05/16 10:17, Heinrich Schuchardt wrote:
> Commit a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy")
> leads to a warning
> xen/gntdev.c: In function ‘gntdev_ioctl_grant_copy’:
> xen/gntdev.c:949:1: warning: the frame size of 1248 bytes
> is larger than 1024 bytes [-Wframe-larger-than=]
> 
> This can be avoided by using kmalloc instead of the stack.
> 
> Testing requires CONFIG_XEN_GNTDEV.
> 
> Fixes: a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy")
> Signed-off-by: Heinrich Schuchardt 

Acked-by: Juergen Gross 


Re: [PATCH] kdump: Fix gdb macros work work with newer and 64-bit kernels

2016-05-08 Thread Baoquan He
Hi Corey,

I am trying to review this patch now, and these fixes contained are very
great. Just several concerns are added in inline comment.

By the way, did you run this in your side?

Hi Vivek,

Member variable was added into task_struct in below commit replacing
pids[PIDTYPE_TGID], and from then on nobody complained about it. Seems
people rarely use this utility.

commit 47e65328a7b1cdfc4e3102e50d60faf94ebba7d3
Author: Oleg Nesterov 
Date:   Tue Mar 28 16:11:25 2006 -0800

[PATCH] pids: kill PIDTYPE_TGID



On 04/27/16 at 07:21am, Corey Minyard wrote:
> Any comments on this?  If no one else cares I'd be willing to take over
> maintenance of this.
> 
> -corey
> 
> On 02/25/2016 07:51 AM, miny...@acm.org wrote:
> >From: Corey Minyard 
> >
> >Lots of little changes needed to be made to clean these up, remove the
> >four byte pointer assumption and traverse the pid queue properly.
> >Also consolidate the traceback code into a single function instead
> >of having three copies of it.
> >
> >Signed-off-by: Corey Minyard 
> >---
> >  Documentation/kdump/gdbmacros.txt | 90 
> > +--
> >  1 file changed, 40 insertions(+), 50 deletions(-)
> >
> >I sent this earlier, but I didn't get a response.  These are clearly
> >wrong.  I'd be happy to take over maintenance of these macros.  It
> >might be better to move them someplace else, too, since they are also
> >useful for kgdb.
> >
> >diff --git a/Documentation/kdump/gdbmacros.txt 
> >b/Documentation/kdump/gdbmacros.txt
> >index 9b9b454..e5bbd8d 100644
> >--- a/Documentation/kdump/gdbmacros.txt
> >+++ b/Documentation/kdump/gdbmacros.txt
> >@@ -15,14 +15,14 @@
> >  define bttnobp
> > set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
> >-set $pid_off=((size_t)&((struct task_struct *)0)->pids[1].pid_list.next)
> >+set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)

This is a quite nice fix.

> > set $init_t=_task
> > set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
> > while ($next_t != $init_t)
> > set $next_t=(struct task_struct *)$next_t
> > printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
> > printf "===\n"
> >-set var $stackp = $next_t.thread.esp
> >+set var $stackp = $next_t.thread.sp
> > set var $stack_top = ($stackp & ~4095) + 4096
> > while ($stackp < $stack_top)
> >@@ -31,12 +31,12 @@ define bttnobp
> > end
> > set $stackp += 4
> > end
> >-set $next_th=(((char *)$next_t->pids[1].pid_list.next) - 
> >$pid_off)
> >+set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
> > while ($next_th != $next_t)
> > set $next_th=(struct task_struct *)$next_th
> > printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
> > printf "===\n"
> >-set var $stackp = $next_t.thread.esp
> >+set var $stackp = $next_t.thread.sp
> > set var $stack_top = ($stackp & ~4095) + 4096
> > while ($stackp < $stack_top)
> >@@ -45,7 +45,7 @@ define bttnobp
> > end
> > set $stackp += 4
> > end
> >-set $next_th=(((char *)$next_th->pids[1].pid_list.next) 
> >- $pid_off)
> >+set $next_th=(((char *)$next_th->thread_group.next) - 
> >$pid_off)
> > end
> > set $next_t=(char *)($next_t->tasks.next) - $tasks_off
> > end
> >@@ -54,42 +54,43 @@ document bttnobp
> > dump all thread stack traces on a kernel compiled with 
> > !CONFIG_FRAME_POINTER
> >  end
> >+define btthreadstruct

This is a nice wrapping, but I guess you want to name it as
btthreadstack, right? Since I didn't get at all why it's related to
thread_struct except of getting 'sp'.

> >+set var $pid_task = $arg0
> >+
> >+printf "\npid %d; comm %s:\n", $pid_task.pid, $pid_task.comm
> >+printf "task struct: "
> >+print $pid_task
> >+printf "===\n"
> >+set var $stackp = $pid_task.thread.sp
> >+set var $stack_top = ($stackp & ~4095) + 4096
> >+set var $stack_bot = ($stackp & ~4095)
> >+
> >+set $stackp = *((unsigned long *) $stackp)
> >+while (($stackp < $stack_top) && ($stackp > $stack_bot))
> >+set var $addr = *(((unsigned long *) $stackp) + 1)
> >+info symbol $addr
> >+set $stackp = *((unsigned long *) $stackp)
> >+end
> >+end
> >+document btthreadstruct
> >+ dump a thread stack using the given task structure pointer
> >+end
> >+
> >+
> >  define btt
> > set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
> >-set $pid_off=((size_t)&((struct task_struct 

RE: [Patch v3 5/8] firmware: qcom: scm: Convert to streaming DMA APIS

2016-05-08 Thread Sricharan
Hi,
> This patch converts the Qualcomm SCM driver to use the streaming DMA
> APIs for communication buffers.
> 
> Signed-off-by: Andy Gross 
> ---

 Reviewed-by: sricha...@codeaurora.org

Regards,
 Sricharan

>  drivers/firmware/qcom_scm-32.c | 152
+
> 
>  drivers/firmware/qcom_scm.c|   6 +-
>  drivers/firmware/qcom_scm.h|  10 +--
>  3 files changed, 58 insertions(+), 110 deletions(-)
> 
> diff --git a/drivers/firmware/qcom_scm-32.c b/drivers/firmware/qcom_scm-
> 32.c index 4388d13..3e71aec 100644
> --- a/drivers/firmware/qcom_scm-32.c
> +++ b/drivers/firmware/qcom_scm-32.c
> @@ -23,8 +23,7 @@
>  #include 
>  #include 
>  #include 
> -
> -#include 
> +#include 
> 
>  #include "qcom_scm.h"
> 
> @@ -97,44 +96,6 @@ struct qcom_scm_response {  };
> 
>  /**
> - * alloc_qcom_scm_command() - Allocate an SCM command
> - * @cmd_size: size of the command buffer
> - * @resp_size: size of the response buffer
> - *
> - * Allocate an SCM command, including enough room for the command
> - * and response headers as well as the command and response buffers.
> - *
> - * Returns a valid _scm_command on success or %NULL if the
> allocation fails.
> - */
> -static struct qcom_scm_command *alloc_qcom_scm_command(size_t
> cmd_size, size_t resp_size) -{
> - struct qcom_scm_command *cmd;
> - size_t len = sizeof(*cmd) + sizeof(struct qcom_scm_response) +
> cmd_size +
> - resp_size;
> - u32 offset;
> -
> - cmd = kzalloc(PAGE_ALIGN(len), GFP_KERNEL);
> - if (cmd) {
> - cmd->len = cpu_to_le32(len);
> - offset = offsetof(struct qcom_scm_command, buf);
> - cmd->buf_offset = cpu_to_le32(offset);
> - cmd->resp_hdr_offset = cpu_to_le32(offset + cmd_size);
> - }
> - return cmd;
> -}
> -
> -/**
> - * free_qcom_scm_command() - Free an SCM command
> - * @cmd: command to free
> - *
> - * Free an SCM command.
> - */
> -static inline void free_qcom_scm_command(struct qcom_scm_command
> *cmd) -{
> - kfree(cmd);
> -}
> -
> -/**
>   * qcom_scm_command_to_response() - Get a pointer to a
> qcom_scm_response
>   * @cmd: command
>   *
> @@ -168,7 +129,7 @@ static inline void
> *qcom_scm_get_response_buffer(const struct qcom_scm_response
>   return (void *)rsp + le32_to_cpu(rsp->buf_offset);  }
> 
> -static u32 smc(u32 cmd_addr)
> +static u32 smc(dma_addr_t cmd_addr)
>  {
>   int context_id;
>   register u32 r0 asm("r0") = 1;
> @@ -192,51 +153,15 @@ static u32 smc(u32 cmd_addr)
>   return r0;
>  }
> 
> -static int __qcom_scm_call(const struct qcom_scm_command *cmd) -{
> - int ret;
> - u32 cmd_addr = virt_to_phys(cmd);
> -
> - /*
> -  * Flush the command buffer so that the secure world sees
> -  * the correct data.
> -  */
> - secure_flush_area(cmd, cmd->len);
> -
> - ret = smc(cmd_addr);
> - if (ret < 0)
> - ret = qcom_scm_remap_error(ret);
> -
> - return ret;
> -}
> -
> -static void qcom_scm_inv_range(unsigned long start, unsigned long end) -{
> - u32 cacheline_size, ctr;
> -
> - asm volatile("mrc p15, 0, %0, c0, c0, 1" : "=r" (ctr));
> - cacheline_size = 4 << ((ctr >> 16) & 0xf);
> -
> - start = round_down(start, cacheline_size);
> - end = round_up(end, cacheline_size);
> - outer_inv_range(start, end);
> - while (start < end) {
> - asm ("mcr p15, 0, %0, c7, c6, 1" : : "r" (start)
> -  : "memory");
> - start += cacheline_size;
> - }
> - dsb();
> - isb();
> -}
> -
>  /**
>   * qcom_scm_call() - Send an SCM command
> - * @svc_id: service identifier
> - * @cmd_id: command identifier
> - * @cmd_buf: command buffer
> - * @cmd_len: length of the command buffer
> - * @resp_buf: response buffer
> - * @resp_len: length of the response buffer
> + * @dev: struct device
> + * @svc_id:  service identifier
> + * @cmd_id:  command identifier
> + * @cmd_buf: command buffer
> + * @cmd_len: length of the command buffer
> + * @resp_buf:response buffer
> + * @resp_len:length of the response buffer
>   *
>   * Sends a command to the SCM and waits for the command to finish
> processing.
>   *
> @@ -247,42 +172,60 @@ static void qcom_scm_inv_range(unsigned long
> start, unsigned long end)
>   * and response buffers is taken care of by qcom_scm_call; however,
callers
> are
>   * responsible for any other cached buffers passed over to the secure
world.
>   */
> -static int qcom_scm_call(u32 svc_id, u32 cmd_id, const void *cmd_buf,
> - size_t cmd_len, void *resp_buf, size_t resp_len)
> +static int qcom_scm_call(struct device *dev, u32 svc_id, u32 cmd_id,
> +  const void *cmd_buf, size_t cmd_len, void
> *resp_buf,
> +  size_t resp_len)
>  {
>   int ret;
>   struct qcom_scm_command *cmd;
>   struct qcom_scm_response *rsp;
> - unsigned long start, end;
> + size_t alloc_len = 

Re: [PATCH 4/4] x86/kasan: Instrument user memory access API

2016-05-08 Thread Dmitry Vyukov
On Fri, May 6, 2016 at 2:45 PM, Andrey Ryabinin  wrote:
> Exchange between user and kernel memory is coded in assembly language.
> Which means that such accesses won't be spotted by KASAN as a compiler
> instruments only C code.
> Add explicit KASAN checks to user memory access API to ensure that
> userspace writes to (or reads from) a valid kernel memory.
>
> Note: Unlike others strncpy_from_user() is written mostly in C and KASAN
> sees memory accesses in it. However, it makes sense to add explicit check
> for all @count bytes that *potentially* could be written to the kernel.


Reviewed-by: Dmitry Vyukov 

Thanks!


> Signed-off-by: Andrey Ryabinin 
> Cc: Alexander Potapenko 
> Cc: Dmitry Vyukov 
> Cc: x...@kernel.org
> ---
>  arch/x86/include/asm/uaccess.h| 5 +
>  arch/x86/include/asm/uaccess_64.h | 7 +++
>  lib/strncpy_from_user.c   | 2 ++
>  3 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 0b17fad..5dd6d18 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -5,6 +5,7 @@
>   */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -732,6 +733,8 @@ copy_from_user(void *to, const void __user *from, 
> unsigned long n)
>
> might_fault();
>
> +   kasan_check_write(to, n);
> +
> /*
>  * While we would like to have the compiler do the checking for us
>  * even in the non-constant size case, any false positives there are
> @@ -765,6 +768,8 @@ copy_to_user(void __user *to, const void *from, unsigned 
> long n)
>  {
> int sz = __compiletime_object_size(from);
>
> +   kasan_check_read(from, n);
> +
> might_fault();
>
> /* See the comment in copy_from_user() above. */
> diff --git a/arch/x86/include/asm/uaccess_64.h 
> b/arch/x86/include/asm/uaccess_64.h
> index 3076986..2eac2aa 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -109,6 +110,7 @@ static __always_inline __must_check
>  int __copy_from_user(void *dst, const void __user *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_write(dst, size);
> return __copy_from_user_nocheck(dst, src, size);
>  }
>
> @@ -175,6 +177,7 @@ static __always_inline __must_check
>  int __copy_to_user(void __user *dst, const void *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_read(src, size);
> return __copy_to_user_nocheck(dst, src, size);
>  }
>
> @@ -242,12 +245,14 @@ int __copy_in_user(void __user *dst, const void __user 
> *src, unsigned size)
>  static __must_check __always_inline int
>  __copy_from_user_inatomic(void *dst, const void __user *src, unsigned size)
>  {
> +   kasan_check_write(dst, size);
> return __copy_from_user_nocheck(dst, src, size);
>  }
>
>  static __must_check __always_inline int
>  __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
>  {
> +   kasan_check_read(src, size);
> return __copy_to_user_nocheck(dst, src, size);
>  }
>
> @@ -258,6 +263,7 @@ static inline int
>  __copy_from_user_nocache(void *dst, const void __user *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_write(dst, size);
> return __copy_user_nocache(dst, src, size, 1);
>  }
>
> @@ -265,6 +271,7 @@ static inline int
>  __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>   unsigned size)
>  {
> +   kasan_check_write(dst, size);
> return __copy_user_nocache(dst, src, size, 0);
>  }
>
> diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
> index 3384032..e3472b0 100644
> --- a/lib/strncpy_from_user.c
> +++ b/lib/strncpy_from_user.c
> @@ -1,5 +1,6 @@
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -103,6 +104,7 @@ long strncpy_from_user(char *dst, const char __user *src, 
> long count)
> if (unlikely(count <= 0))
> return 0;
>
> +   kasan_check_write(dst, count);
> max_addr = user_addr_max();
> src_addr = (unsigned long)src;
> if (likely(src_addr < max_addr)) {
> --
> 2.7.3
>


Re: [PATCH] mm/zsmalloc: avoid unnecessary iteration in get_pages_per_zspage()

2016-05-08 Thread Minchan Kim
On Fri, May 06, 2016 at 06:33:42PM +0900, Sergey Senozhatsky wrote:
> On (05/06/16 18:08), Sergey Senozhatsky wrote:
> [..]
> > and it's not 45 iterations that we are getting rid of, but around 31:
> > not every class reaches it's ideal 100% ratio on the first iteration.
> > so, no, sorry, I don't think the patch really does what we want.
> 
> 
> to be clear, what I meant was:
> 
>   495 `cmp' + 15 `cmp je' IN
>   31 `mov cltd idiv mov sub imul cltd idiv cmp'   OUT
> 
> IN > OUT.
> 
> 
> CORRECTION here:
> 
> > * by the way, we don't even need `cltd' in those calculations. the
> > reason why gcc puts cltd is because ZS_MAX_PAGES_PER_ZSPAGE has the
> > 'wrong' data type. the patch to correct it is below (not a formal
> > patch).
> 
> no, we need cltd there. but ZS_MAX_PAGES_PER_ZSPAGE also affects
> ZS_MIN_ALLOC_SIZE, which is used in several places, like
> get_size_class_index(). that's why ZS_MAX_PAGES_PER_ZSPAGE data
> type change `improves' zs_malloc().

Why not if such simple improves zsmalloc? :)
Please send a patch.

Thanks a lot, Sergey!


[PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount

2016-05-08 Thread Davidlohr Bueso
Read waiters are currently reference counted from the time it enters
the slowpath until the lock is released and the waiter is awoken. This
is fragile and superfluous considering everything occurs within down_read()
without returning to the caller, and the very nature of the primitive does
not suggest that the task can disappear from underneath us. In addition,
spurious wakeups can make the whole refcount useless as get_task_struct()
is only called when setting up the waiter.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 7d62772600cf..b592bb48d880 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -197,7 +197,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
smp_mb();
waiter->task = NULL;
wake_up_process(tsk);
-   put_task_struct(tsk);
} while (--loop);
 
sem->wait_list.next = next;
@@ -220,7 +219,6 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
/* set up my own style of waitqueue */
waiter.task = tsk;
waiter.type = RWSEM_WAITING_FOR_READ;
-   get_task_struct(tsk);
 
raw_spin_lock_irq(>wait_lock);
if (list_empty(>wait_list))
-- 
2.8.1



[PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s)

2016-05-08 Thread Davidlohr Bueso
As wake_qs gain users, we can teach rwsems about them such that
waiters can be awoken without the wait_lock. This is for both
readers and writer, the former being the most ideal candidate
as we can batch the wakeups shortening the critical region that
much more -- ie writer task blocking a bunch of tasks waiting to
service page-faults (mmap_sem readers).

In general applying wake_qs to rwsem (xadd) is not difficult as
the wait_lock is intended to be released soon _anyways_, with
the exception of when a writer slowpath will proactively wakeup
any queued readers if it sees that the lock is owned by a reader,
in which we simply do the wakeups with the lock held (see comment
in __rwsem_down_write_failed_common()).

Similar to other locking primitives, delaying the waiter being
awoken does allow, at least in theory, the lock to be stolen in
the case of writers, however no harm was seen in this (in fact
lock stealing tends to be a _good_ thing in most workloads), and
this is a tiny window anyways.

Some page-fault (pft) and mmap_sem intensive benchmarks show some
pretty constant reduction in systime (by up to ~8 and ~10%) on a
2-socket, 12 core AMD box.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 53 +++--
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index b592bb48d880..1b8c1285a2aa 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -114,12 +114,16 @@ enum rwsem_wake_type {
  *   - the 'active part' of count (&0x) reached 0 (but may have 
changed)
  *   - the 'waiting part' of count (&0x) is -ve (and will still be so)
  * - there must be someone on the queue
- * - the spinlock must be held by the caller
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ *   to actually wakeup the blocked task(s), preferably when the wait_lock
+ *   is released
  * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only woken if downgrading is false
+ * - writers are only marked woken if downgrading is false
  */
 static struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
+__rwsem_mark_wake(struct rw_semaphore *sem,
+ enum rwsem_wake_type wake_type, struct wake_q_head *wake_q)
 {
struct rwsem_waiter *waiter;
struct task_struct *tsk;
@@ -129,12 +133,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY)
-   /* Wake writer at the front of the queue, but do not
-* grant it the lock yet as we want other writers
-* to be able to steal it.  Readers, on the other hand,
-* will block as they will notice the queued writer.
+   /*
+* Mark writer at the front of the queue for wakeup.
+* Until the task is actually later awoken later by
+* the caller, other writers are able to steal it the
+* lock to be able to steal it.  Readers, on the other,
+* hand, will block as they will notice the queued 
writer.
 */
-   wake_up_process(waiter->task);
+   wake_q_add(wake_q, waiter->task);
goto out;
}
 
@@ -196,12 +202,11 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 */
smp_mb();
waiter->task = NULL;
-   wake_up_process(tsk);
+   wake_q_add(wake_q, tsk);
} while (--loop);
 
sem->wait_list.next = next;
next->prev = >wait_list;
-
  out:
return sem;
 }
@@ -215,6 +220,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
struct rwsem_waiter waiter;
struct task_struct *tsk = current;
+   WAKE_Q(wake_q);
 
/* set up my own style of waitqueue */
waiter.task = tsk;
@@ -236,9 +242,10 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
if (count == RWSEM_WAITING_BIAS ||
(count > RWSEM_WAITING_BIAS &&
 adjustment != -RWSEM_ACTIVE_READ_BIAS))
-   sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+   sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, _q);
 
raw_spin_unlock_irq(>wait_lock);
+   wake_up_q(_q);
 
/* wait to be given the lock */
while (true) {
@@ -470,9 +477,19 @@ 

[PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()

2016-05-08 Thread Davidlohr Bueso
The field is obviously updated w.o the lock and needs a READ_ONCE
while waiting for lock holder(s) to go away, just like we do with
all other ->count accesses.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index df4dcb883b50..7d62772600cf 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -494,7 +494,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, 
int state)
}
schedule();
set_current_state(state);
-   } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
+   } while ((count = READ_ONCE(sem->count)) & RWSEM_ACTIVE_MASK);
 
raw_spin_lock_irq(>wait_lock);
}
-- 
2.8.1



[PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task

2016-05-08 Thread Davidlohr Bueso
Readers that are awoken will expect a nil ->task indicating
that a wakeup has occurred. There is a mismatch between the
smp_mb() and its documentation, in that the serialization is
done between reading the task and the nil store. Furthermore,
in addition to having the overlapping of loads and stores to
waiter->task guaranteed to be ordered within that CPU, both
wake_up_process() originally and now wake_q_add() already
imply barriers upon successful calls, which serves the comment.

Just atomically do a xchg() and simplify the whole thing. We can
use relaxed semantics as before mentioned in addition to the
barrier provided by wake_q_add(), delaying there is no risk in
reordering with the actual wakeup.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 1b8c1285a2aa..96e53cb4a4db 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -126,7 +126,6 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
  enum rwsem_wake_type wake_type, struct wake_q_head *wake_q)
 {
struct rwsem_waiter *waiter;
-   struct task_struct *tsk;
struct list_head *next;
long oldcount, woken, loop, adjustment;
 
@@ -190,24 +189,18 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
next = sem->wait_list.next;
loop = woken;
do {
+   struct task_struct *tsk;
+
waiter = list_entry(next, struct rwsem_waiter, list);
next = waiter->list.next;
-   tsk = waiter->task;
-   /*
-* Make sure we do not wakeup the next reader before
-* setting the nil condition to grant the next reader;
-* otherwise we could miss the wakeup on the other
-* side and end up sleeping again. See the pairing
-* in rwsem_down_read_failed().
-*/
-   smp_mb();
-   waiter->task = NULL;
+
+   tsk = xchg_relaxed(>task, NULL);
wake_q_add(wake_q, tsk);
} while (--loop);
 
sem->wait_list.next = next;
next->prev = >wait_list;
- out:
+out:
return sem;
 }
 
-- 
2.8.1



[PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations

2016-05-08 Thread Davidlohr Bueso
Hi,

This is a follow up series while reviewing Waiman's reader-owned
state work[1]. While I have based it on -tip instead of that change,
I can certainly rebase the series in some future iteration.

Changes are mainly around reader-waiter optimizations, in no particular
order. Has passed numerous DB benchmarks without things falling apart
for a 8 core Westmere doing page allocations (page_test) in aim9:

aim9
 4.6-rc64.6-rc6
rwsemv2
Min  page_test   378167.89 (  0.00%)   382613.33 (  1.18%)
Min  exec_test  499.00 (  0.00%)  502.67 (  0.74%)
Min  fork_test 3395.47 (  0.00%) 3537.64 (  4.19%)
Hmeanpage_test   395433.06 (  0.00%)   414693.68 (  4.87%)
Hmeanexec_test  499.67 (  0.00%)  505.30 (  1.13%)
Hmeanfork_test 3504.22 (  0.00%) 3594.95 (  2.59%)
Stddev   page_test17426.57 (  0.00%)26649.92 (-52.93%)
Stddev   exec_test0.47 (  0.00%)1.41 (-199.05%)
Stddev   fork_test   63.74 (  0.00%)   32.59 ( 48.86%)
Max  page_test   429873.33 (  0.00%)   456960.00 (  6.30%)
Max  exec_test  500.33 (  0.00%)  507.66 (  1.47%)
Max  fork_test 3653.33 (  0.00%) 3650.90 ( -0.07%)

 4.6-rc6 4.6-rc6
 rwsemv2
User1.120.04
System  0.230.04
Elapsed   727.27  721.98

[1] http://permalink.gmane.org/gmane.linux.kernel/2216743

Thanks!

Davidlohr Bueso (4):
  locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()
  locking/rwsem: Drop superfluous waiter refcount
  locking/rwsem: Enable lockless waiter wakeup(s)
  locking/rwsem: Rework zeroing reader waiter->task

 kernel/locking/rwsem-xadd.c | 74 ++---
 1 file changed, 43 insertions(+), 31 deletions(-)

-- 
2.8.1



[GIT] Networking

2016-05-08 Thread David Miller

1) Check klogctl failure correctly, from Colin Ian King.

2) Prevent OOM when under memory pressure in flowcache, from Steffen
   Klassert.

3) Fix info leak in llc and rtnetlink ifmap code, from Kangjie Lu.

4) Memory barrier and multicast handling fixes in bnxt_en, from
   Michael Chan.

5) Endianness bug in mlx5, from Daniel Jurgens.

6) Fix disconnect handling in VSOCK, from Ian Campbell.

7) Fix locking of netdev list walking in get_bridge_ifindices(), from
   Nikolay Aleksandrov.

8) Bridge multicast MLD parser can look at wrong packet offsets, fix
   from Linus Lüssing.

9) Fix chip hang in qede driver, from Sudarsana Reddy Kalluru.

10) Fix missing setting of encapsulation before inner handling
completes in udp_offload code, from Jarno Rajahalme.

11) Missing rollbacks during LAG join and flood configuration failures
in mlxsw driver, from Ido Schimmel.

12) Fix error code checks in netxen driver, from Dan Carpenter.

13) Fix key size in new macsec driver, from Sabrina Dubroca.

14) Fix mlx5/VXLAN dependencies, from Arnd Bergmann.

Please pull, thanks a lot!

The following changes since commit 7391daf2ffc780679d6ab3fad1db2619e5dd2c2a:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2016-05-03 
15:07:50 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 8846a125de97f96be64ca234906eedfd26ad778e:

  Merge branch 'mlx5-build-fix' (2016-05-09 00:21:13 -0400)


Arnd Bergmann (2):
  Revert "net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue"
  net/mlx5e: make VXLAN support conditional

Colin Ian King (1):
  tools: bpf_jit_disasm: check for klogctl failure

Dan Carpenter (4):
  netxen: fix error handling in netxen_get_flash_block()
  netxen: reversed condition in netxen_nic_set_link_parameters()
  netxen: netxen_rom_fast_read() doesn't return -1
  qede: uninitialized variable in qede_start_xmit()

Daniel Jurgens (1):
  net/mlx4_en: Fix endianness bug in IPV6 csum calculation

David Ahern (1):
  net: ipv6: tcp reset, icmp need to consider L3 domain

David S. Miller (3):
  Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
  Merge branch 'bnxt_en-fixes'
  Merge branch 'mlx5-build-fix'

Eric Dumazet (1):
  macvtap: segmented packet is consumed

Ian Campbell (1):
  VSOCK: do not disconnect socket when peer has shutdown SEND only

Ido Schimmel (2):
  mlxsw: spectrum: Fix rollback order in LAG join failure
  mlxsw: spectrum: Add missing rollback in flood configuration

Jarno Rajahalme (2):
  udp_tunnel: Remove redundant udp_tunnel_gro_complete().
  udp_offload: Set encapsulation before inner completes.

Kangjie Lu (2):
  net: fix infoleak in llc
  net: fix infoleak in rtnetlink

Linus Lüssing (1):
  bridge: fix igmp / mld query parsing

Matthias Brugger (1):
  drivers: net: xgene: Fix error handling

Michael Chan (2):
  bnxt_en: Need memory barrier when processing the completion ring.
  bnxt_en: Setup multicast properly after resetting device.

Nikolay Aleksandrov (1):
  net: bridge: fix old ioctl unlocked net device walk

Sabrina Dubroca (1):
  macsec: key identifier is 128 bits, not 64

Shmulik Ladkani (1):
  Documentation/networking: more accurate LCO explanation

Steffen Klassert (3):
  flowcache: Avoid OOM condition under preasure
  xfrm: Reset encapsulation field of the skb before transformation
  vti: Add pmtu handling to vti_xmit.

Sudarsana Reddy Kalluru (1):
  qede: prevent chip hang when increasing channels

Uwe Kleine-König (1):
  net: fec: only clear a queue's work bit if the queue was emptied

 Documentation/networking/checksum-offloads.txt   | 14 +++---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c |  7 ---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c| 23 
+++
 drivers/net/ethernet/freescale/fec_main.c| 10 --
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig  |  8 +++-
 drivers/net/ethernet/mellanox/mlx5/core/Makefile |  3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en.h |  2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|  4 
 drivers/net/ethernet/mellanox/mlx5/core/vxlan.h  | 11 +--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  4 ++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c |  8 
 drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c   | 14 +-
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c |  3 ++-
 drivers/net/ethernet/qlogic/qede/qede_main.c |  8 +++-
 drivers/net/geneve.c |  5 +++--
 drivers/net/macsec.c | 

Re: [PATCH 2/2] net: Use ns_capable_noaudit() when determining net sysctl permissions

2016-05-08 Thread Serge Hallyn
Quoting Tyler Hicks (tyhi...@canonical.com):
> The capability check should not be audited since it is only being used
> to determine the inode permissions. A failed check does not indicate a
> violation of security policy but, when an LSM is enabled, a denial audit
> message was being generated.
> 
> The denial audit message caused confusion for some application authors
> because root-running Go applications always triggered the denial. To
> prevent this confusion, the capability check in net_ctl_permissions() is
> switched to the noaudit variant.
> 
> BugLink: https://launchpad.net/bugs/1465724
> 
> Signed-off-by: Tyler Hicks 

Acked-by: Serge E. Hallyn 

> ---
>  net/sysctl_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sysctl_net.c b/net/sysctl_net.c
> index ed98c1f..46a71c7 100644
> --- a/net/sysctl_net.c
> +++ b/net/sysctl_net.c
> @@ -46,7 +46,7 @@ static int net_ctl_permissions(struct ctl_table_header 
> *head,
>   kgid_t root_gid = make_kgid(net->user_ns, 0);
>  
>   /* Allow network administrator to have same access as root. */
> - if (ns_capable(net->user_ns, CAP_NET_ADMIN) ||
> + if (ns_capable_noaudit(net->user_ns, CAP_NET_ADMIN) ||
>   uid_eq(root_uid, current_euid())) {
>   int mode = (table->mode >> 6) & 7;
>   return (mode << 6) | (mode << 3) | mode;
> -- 
> 2.7.4
> 


Re: [PATCH 1/2] kernel: Add noaudit variant of ns_capable()

2016-05-08 Thread Serge Hallyn
Quoting Tyler Hicks (tyhi...@canonical.com):
> When checking the current cred for a capability in a specific user
> namespace, it isn't always desirable to have the LSMs audit the check.
> This patch adds a noaudit variant of ns_capable() for when those
> situations arise.
> 
> The common logic between ns_capable() and the new ns_capable_noaudit()
> is moved into a single, shared function to keep duplicated code to a
> minimum and ease maintainability.
> 
> Signed-off-by: Tyler Hicks 

Acked-by: Serge E. Hallyn 

> ---
>  include/linux/capability.h |  5 +
>  kernel/capability.c| 46 
> --
>  2 files changed, 41 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 00690ff..5f3c63d 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -206,6 +206,7 @@ extern bool has_ns_capability_noaudit(struct task_struct 
> *t,
> struct user_namespace *ns, int cap);
>  extern bool capable(int cap);
>  extern bool ns_capable(struct user_namespace *ns, int cap);
> +extern bool ns_capable_noaudit(struct user_namespace *ns, int cap);
>  #else
>  static inline bool has_capability(struct task_struct *t, int cap)
>  {
> @@ -233,6 +234,10 @@ static inline bool ns_capable(struct user_namespace *ns, 
> int cap)
>  {
>   return true;
>  }
> +static inline bool ns_capable_noaudit(struct user_namespace *ns, int cap)
> +{
> + return true;
> +}
>  #endif /* CONFIG_MULTIUSER */
>  extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
>  extern bool file_ns_capable(const struct file *file, struct user_namespace 
> *ns, int cap);
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 45432b5..00411c8 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -361,6 +361,24 @@ bool has_capability_noaudit(struct task_struct *t, int 
> cap)
>   return has_ns_capability_noaudit(t, _user_ns, cap);
>  }
>  
> +static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit)
> +{
> + int capable;
> +
> + if (unlikely(!cap_valid(cap))) {
> + pr_crit("capable() called with invalid cap=%u\n", cap);
> + BUG();
> + }
> +
> + capable = audit ? security_capable(current_cred(), ns, cap) :
> +   security_capable_noaudit(current_cred(), ns, cap);
> + if (capable == 0) {
> + current->flags |= PF_SUPERPRIV;
> + return true;
> + }
> + return false;
> +}
> +
>  /**
>   * ns_capable - Determine if the current task has a superior capability in 
> effect
>   * @ns:  The usernamespace we want the capability in
> @@ -374,19 +392,27 @@ bool has_capability_noaudit(struct task_struct *t, int 
> cap)
>   */
>  bool ns_capable(struct user_namespace *ns, int cap)
>  {
> - if (unlikely(!cap_valid(cap))) {
> - pr_crit("capable() called with invalid cap=%u\n", cap);
> - BUG();
> - }
> -
> - if (security_capable(current_cred(), ns, cap) == 0) {
> - current->flags |= PF_SUPERPRIV;
> - return true;
> - }
> - return false;
> + return ns_capable_common(ns, cap, true);
>  }
>  EXPORT_SYMBOL(ns_capable);
>  
> +/**
> + * ns_capable_noaudit - Determine if the current task has a superior 
> capability
> + * (unaudited) in effect
> + * @ns:  The usernamespace we want the capability in
> + * @cap: The capability to be tested for
> + *
> + * Return true if the current task has the given superior capability 
> currently
> + * available for use, false if not.
> + *
> + * This sets PF_SUPERPRIV on the task if the capability is available on the
> + * assumption that it's about to be used.
> + */
> +bool ns_capable_noaudit(struct user_namespace *ns, int cap)
> +{
> + return ns_capable_common(ns, cap, false);
> +}
> +EXPORT_SYMBOL(ns_capable_noaudit);
>  
>  /**
>   * capable - Determine if the current task has a superior capability in 
> effect
> -- 
> 2.7.4
> 


RE:Drawstring bags

2016-05-08 Thread Jack
Dear purchasing manager,

We have rich experience in manufacturing and exporting all kinds of bags, We 
have our own production base with advanced machine equipment, and employ 
professional workforce of technicians and engineers. 

Our products range from tote bags, drawstring bags, luggage bags, cooler bags, 
diaper bags, backpacks, handbags, cosmetic bags, travel bags, school bags, 
computer bags, gym bags, tool bags and so on. Material options include 
polyester, nylon, jeans, canvas and PVC,etc.

Hope we can have a chance to do business with you.

Best regards,

Jack Xiu
Sales manager
Ronta(Xiamen)Co.,Ltd
www.xmronta.com


Re: [PATCH] Use pid_t instead of int

2016-05-08 Thread René Nyffenegger
Somewhere else, pid_t is a typedef for an int.

Rene

On 09.05.2016 03:25, Andy Lutomirski wrote:
> On Sun, May 8, 2016 at 12:38 PM, René Nyffenegger
>  wrote:
>> Use pid_t instead of int in the declarations of sys_kill, sys_tgkill,
>> sys_tkill and sys_rt_sigqueueinfo in include/linux/syscalls.h
> 
> The description is no good.  *Why* are you changing it?
> 
> I checked tgkill and, indeed, tgkill takes pid_t parameters, so this
> fixes an incorrect declaration.  I'm wondering why the code compiles
> without warning.  Is SYSCALL_DEFINE too lenient for some reason?  Or
> is pid_t just defined as int.
> 
> --Andy
> 



Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Mon, May 09, 2016 at 05:52:51AM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> 
> > In addition, I would argue maybe beefing up idle balancing is a more
> > productive way to spread load, as work-stealing just does what needs
> > to be done. And seems it has been (sub-unconsciously) neglected in this
> > case, :)
> 
> P.S. Nope, I'm dinging up multiple spots ;-)

You bet, :)


Re: [RFC PATCH v2 07/10] efi: load SSTDs from EFI variables

2016-05-08 Thread Jon Masters
Hi Octavian,

Apologies for missing this earlier, just catching up on this thread...

On 04/19/2016 06:39 PM, Octavian Purdila wrote:

> This patch allows SSDTs to be loaded from EFI variables. It works by
> specifying the EFI variable name containing the SSDT to be loaded. All
> variables with the same name (regardless of the vendor GUID) will be
> loaded.

This sounds very useful during development. However, and using EFI
variables isn't so terrible, but I am concerned that this should be
standardized through ASWG and at least involve certain other OS vendors
so that the variable (GUID) can be captured somewhere. If not in the
spec itself, then it should be captured as an external ACPI resource on
the UEFI website with a clear pointer to the exact IDs to be used.

Can you confirm that's the intention? i.e. that you're allowing a
command line option for specifying the ID now because you intend to go
ensure that there is a standard one that everyone will use later?

I should check (but maybe you know) if the kernel is automatically
tainted by this codepath as well?

Thanks,

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop



[PATCH] sched/rt/deadline: Don't push if task's scheduling class was changed

2016-05-08 Thread Xunlei Pang
We got a warning below:
WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 
set_task_cpu+0x1af/0x1c0
CPU: 1 PID: 2468 Comm: bugon Not tainted 4.6.0-rc3+ #16
Hardware name: Intel Corporation Broadwell Client
0086 89618374 8800897a7d50 8133dc8c
  8800897a7d90 81089921
048981037f39 88016c4315c0 88016ecd6e40 
Call Trace:
[] dump_stack+0x63/0x87
[] __warn+0xd1/0xf0
[] warn_slowpath_null+0x1d/0x20
[] set_task_cpu+0x1af/0x1c0
[] push_dl_task.part.34+0xea/0x180
[] push_dl_tasks+0x17/0x30
[] __balance_callback+0x45/0x5c
[] __sched_setscheduler+0x906/0xb90
[] SyS_sched_setattr+0x150/0x190
[] do_syscall_64+0x62/0x110
[] entry_SYSCALL64_slow_path+0x25/0x25

The corresponding warning triggering code:
WARN_ON_ONCE(p->state == TASK_RUNNING &&
 p->sched_class == _sched_class &&
 (p->on_rq && !task_on_rq_migrating(p)))

This is because in find_lock_later_rq(), the task whose scheduling
class was changed to fair class is still pushed away as deadline.

So, check in find_lock_later_rq() after double_lock_balance(), if the
scheduling class of the deadline task was changed, break and retry.
Apply the same logic to RT.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 1 +
 kernel/sched/rt.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 169d40d..57eb3e4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1385,6 +1385,7 @@ static struct rq *find_lock_later_rq(struct task_struct 
*task, struct rq *rq)
 !cpumask_test_cpu(later_rq->cpu,
   >cpus_allowed) ||
 task_running(rq, task) ||
+!dl_task(task) ||
 !task_on_rq_queued(task))) {
double_unlock_balance(rq, later_rq);
later_rq = NULL;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ecfc83d..c10a6f5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1720,6 +1720,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task, struct rq *rq)
 !cpumask_test_cpu(lowest_rq->cpu,
   tsk_cpus_allowed(task)) 
||
 task_running(rq, task) ||
+!rt_task(task) ||
 !task_on_rq_queued(task))) {
 
double_unlock_balance(rq, lowest_rq);
-- 
1.8.3.1



Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Mon, May 09, 2016 at 05:45:40AM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> > On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > > > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > > > llc_size,
> > > > but the numbers are so wild to easily break the fragile condition, like:
> > > 
> > > Seems lockless traversal and averages just lets multiple CPUs select
> > > the same spot.  An atomic reservation (feature) when looking for an
> > > idle spot (also for fork) might fix it up.  Run the thing as RT,
> > > push/pull ensures that it reaches box saturation regardless of the
> > > number of messaging threads, whereas with fair class, any number > 1
> > > will certainly stack tasks before the box is saturated.
> > 
> > Yes, good idea, bringing order to the race to grab idle CPU is absolutely
> > helpful.
> 
> Well, good ideas work, as yet this one helps jack diddly spit.

Then a valid question is whether it is this selection screwed up in case
like this, as it should necessarily always be asked.
 
> > In addition, I would argue maybe beefing up idle balancing is a more
> > productive way to spread load, as work-stealing just does what needs
> > to be done. And seems it has been (sub-unconsciously) neglected in this
> > case, :)
> > 
> > Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
> > the slave will be 0 forever (as last_wakee is never flipped).
> 
> Yeah, it's irrelevant here, this load is all about instantaneous state.
>  I could use a bit more of that, reserving on the wakeup side won't
> help this benchmark until everything else cares.  One stack, and it's
> game over.  It could help generic utilization and latency some.. but it
> seems kinda unlikely it'll be worth the cycle expenditure.

Yes and no, it depends on how efficient work-stealing is, compared to
selection, but remember, at the end of the day, the wakee CPU measures the
latency, that CPU does not care it is selected or it steals.
 
> > Basically whenever a waker has more than 1 wakee, the wakee_flips
> > will comfortably grow very large (with last_wakee alternating),
> > whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.
> 
> Yup, it is a heuristic, and like all of those, imperfect.  I've watched
> it improving utilization in the wild though, so won't mind that until I
> catch it doing really bad things.
 
> > So recording only the last_wakee seems not right unless you have other
> > good reason. If not the latter, counting waking wakee times should be
> > better, and then allow the statistics to happily play.

En... should we try remove recording last_wakee?


Re: [PATCH V2 2/2] irqchip/gicv3-its: Implement two-level(indirect) device table support

2016-05-08 Thread Shanker Donthineni


On 05/08/2016 09:14 PM, Shanker Donthineni wrote:
> Since device IDs are extremely sparse, the single, a.k.a flat table is
> not sufficient for the following two reasons.
>
> 1) According to ARM-GIC spec, ITS hw can access maximum of 256(pages)*
>64K(pageszie) bytes. In the best case, it supports upto DEVid=21
>sparse with minimum device table entry size 8bytes.
>
> 2) The maximum memory size that is possible without memblock depends on
>MAX_ORDER. 4MB on 4K page size kernel with default MAX_ORDER, so it
>supports DEVid range 19bits.
>
> The two-level device table feature brings us two advantages, the first
> is a very high possibility of supporting upto 32bit sparse, and the
> second one is the best utilization of memory allocation.
>
> The feature is enabled automatically during driver probe if a single
> ITS page is not adequate for flat table and the hardware is capable
> of two-level table walk.
>
> Signed-off-by: Shanker Donthineni 
> ---
>
> This patch is based on Marc Zyngier's branch 
> https://git.kernel.org/cgit/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/irqchip-4.7
>
> I have tested the Indirection feature on Qualcomm Technologies QDF2XXX server 
> platform.
>
> Changes since v1:
>   Most of this patch has been rewritten after refactoring its_alloc_tables().
>   Always enable device two-level if the memory requirement is more than 
> PAGE_SIZE.
>   Fixed the coding bug that breaks on the BE machine.
>   Edited the commit text.
>
>  drivers/irqchip/irq-gic-v3-its.c | 100 
> ---
>  1 file changed, 83 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
> b/drivers/irqchip/irq-gic-v3-its.c
> index b23e00c..27be792 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -938,6 +938,18 @@ retry_baser:
>   return 0;
>  }
>  
> +/**
> + * Find out whether an implemented baser register supports a single, flat 
> table
> + * or a two-level table by reading bit offset at '62' after writing '1' to 
> it.
> + */
> +static u64 its_baser_check_indirect(struct its_baser *baser)
> +{
> + u64 val = GITS_BASER_InnerShareable | GITS_BASER_WaWb;
> +
> + writeq_relaxed(val | GITS_BASER_INDIRECT, baser->hwreg);
> + return (readq_relaxed(baser->hwreg) & GITS_BASER_INDIRECT);
> +}
> +
>  static int its_alloc_tables(const char *node_name, struct its_node *its)
>  {
>   u64 typer = readq_relaxed(its->base + GITS_TYPER);
> @@ -964,6 +976,7 @@ static int its_alloc_tables(const char *node_name, struct 
> its_node *its)
>   u64 entry_size = GITS_BASER_ENTRY_SIZE(val);
>   int order = get_order(psz);
>   struct its_baser *baser = its->tables + i;
> + u64 indirect = 0;
>  
>   if (type == GITS_BASER_TYPE_NONE)
>   continue;
> @@ -977,17 +990,27 @@ static int its_alloc_tables(const char *node_name, 
> struct its_node *its)
>* Allocate as many entries as required to fit the
>* range of device IDs that the ITS can grok... The ID
>* space being incredibly sparse, this results in a
> -  * massive waste of memory.
> +  * massive waste of memory if two-level device table
> +  * feature is not supported by hardware.
>*
>* For other tables, only allocate a single page.
>*/
>   if (type == GITS_BASER_TYPE_DEVICE) {
> - /*
> -  * 'order' was initialized earlier to the default page
> -  * granule of the the ITS.  We can't have an allocation
> -  * smaller than that.  If the requested allocation
> -  * is smaller, round up to the default page granule.
> -  */
> + if ((entry_size << ids) > psz)
> + indirect = its_baser_check_indirect(baser);
> +
> + if (indirect) {
> + /*
> +  * The size of the lvl2 table is equal to ITS
> +  * page size which is 'psz'. For computing lvl1
> +  * table size, subtract ID bits that sparse
> +  * lvl2 table from 'ids' which is reported by
> +  * ITS hardware times lvl1 table entry size.
> +  */
> + ids -= ilog2(psz / entry_size);
> + entry_size = GITS_LVL1_ENTRY_SIZE;
> + }
> +
>   order = max(get_order(entry_size << ids), order);
>   if (order >= MAX_ORDER) {
>   order = MAX_ORDER - 1;
> @@ -997,7 +1020,7 @@ static int its_alloc_tables(const char *node_name, 
> struct its_node *its)
>  

[PATCH -tip] sched/wake_q: fix typo in wake_q_add

2016-05-08 Thread Davidlohr Bueso
... the comment clearly refers to wake_up_q, and not
wake_up_list.

Signed-off-by: Davidlohr Bueso 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c82ca6eccfec..c59e4df38591 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -400,7 +400,7 @@ void wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
 * wakeup due to that.
 *
 * This cmpxchg() implies a full barrier, which pairs with the write
-* barrier implied by the wakeup in wake_up_list().
+* barrier implied by the wakeup in wake_up_q().
 */
if (cmpxchg(>next, NULL, WAKE_Q_TAIL))
return;
-- 
2.8.1



Re: [PATCH] tools: bpf_jit_disasm: check for klogctl failure

2016-05-08 Thread David Miller
From: Daniel Borkmann 
Date: Fri, 06 May 2016 00:46:56 +0200

> On 05/06/2016 12:39 AM, Colin King wrote:
>> From: Colin Ian King 
>>
>> klogctl can fail and return -ve len, so check for this and
>> return NULL to avoid passing a (size_t)-1 to malloc.
>>
>> Signed-off-by: Colin Ian King 
> 
> [ would be nice to get Cc'ed in future ... ]
> 
> Acked-by: Daniel Borkmann 

Applied.


Re: [PATCH 0/2] Quiet noisy LSM denial when accessing net sysctl

2016-05-08 Thread David Miller
From: Tyler Hicks 
Date: Fri,  6 May 2016 18:04:12 -0500

> This pair of patches does away with what I believe is a useless denial
> audit message when a privileged process initially accesses a net sysctl.

The LSM folks can apply this if they agree with you.


Re: [PATCH v2] net: arc/emac: Move arc_emac_tx_clean() into arc_emac_tx() and disable tx interrut

2016-05-08 Thread David Miller
From: Caesar Wang 
Date: Fri,  6 May 2016 20:19:16 +0800

> Doing tx_clean() inside poll() may scramble the tx ring buffer if
> tx() is running. This will cause tx to stop working, which can be
> reproduced by simultaneously downloading two large files at high speed.
> 
> Moving tx_clean() into tx() will prevent this. And tx interrupt is no
> longer needed now.

TX completion work is always recommended to be done in the ->poll()
handler.

Fix the race or whatever bug there is rather than working around it,
and regressing the driver, by handling TX completion in the interrupt
handler.

Thanks.


Re: [PATCH v4 1/2] soc: qcom: smd: Introduce compile stubs

2016-05-08 Thread David Miller
From: Bjorn Andersson 
Date: Fri,  6 May 2016 07:09:07 -0700

> Introduce compile stubs for the SMD API, allowing consumers to be
> compile tested.
> 
> Acked-by: Andy Gross 
> Signed-off-by: Bjorn Andersson 

Applied.


Re: [PATCH v4 2/2] net: Add Qualcomm IPC router

2016-05-08 Thread David Miller
From: Bjorn Andersson 
Date: Fri,  6 May 2016 07:09:08 -0700

> From: Courtney Cavin 
> 
> Add an implementation of Qualcomm's IPC router protocol, used to
> communicate with service providing remote processors.
> 
> Signed-off-by: Courtney Cavin 
> Signed-off-by: Bjorn Andersson 
> [bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
> Signed-off-by: Bjorn Andersson 

Applied.


Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Mike Galbraith
On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:

> In addition, I would argue maybe beefing up idle balancing is a more
> productive way to spread load, as work-stealing just does what needs
> to be done. And seems it has been (sub-unconsciously) neglected in this
> case, :)

P.S. Nope, I'm dinging up multiple spots ;-)



Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Mike Galbraith
On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > > llc_size,
> > > but the numbers are so wild to easily break the fragile condition, like:
> > 
> > Seems lockless traversal and averages just lets multiple CPUs select
> > the same spot.  An atomic reservation (feature) when looking for an
> > idle spot (also for fork) might fix it up.  Run the thing as RT,
> > push/pull ensures that it reaches box saturation regardless of the
> > number of messaging threads, whereas with fair class, any number > 1
> > will certainly stack tasks before the box is saturated.
> 
> Yes, good idea, bringing order to the race to grab idle CPU is absolutely
> helpful.

Well, good ideas work, as yet this one helps jack diddly spit.

> In addition, I would argue maybe beefing up idle balancing is a more
> productive way to spread load, as work-stealing just does what needs
> to be done. And seems it has been (sub-unconsciously) neglected in this
> case, :)
> 
> Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
> the slave will be 0 forever (as last_wakee is never flipped).

Yeah, it's irrelevant here, this load is all about instantaneous state.
 I could use a bit more of that, reserving on the wakeup side won't
help this benchmark until everything else cares.  One stack, and it's
game over.  It could help generic utilization and latency some.. but it
seems kinda unlikely it'll be worth the cycle expenditure.

> Basically whenever a waker has more than 1 wakee, the wakee_flips
> will comfortably grow very large (with last_wakee alternating),
> whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.

Yup, it is a heuristic, and like all of those, imperfect.  I've watched
it improving utilization in the wild though, so won't mind that until I
catch it doing really bad things.

> So recording only the last_wakee seems not right unless you have other
> good reason. If not the latter, counting waking wakee times should be
> better, and then allow the statistics to happily play.




Re: [patch] qede: uninitialized variable in qede_start_xmit()

2016-05-08 Thread David Miller
From: Dan Carpenter 
Date: Thu, 5 May 2016 16:21:30 +0300

> "data_split" was never set to false.  It's just uninitialized.
> 
> Fixes: 2950219d87b0 ('qede: Add basic network device support')
> Signed-off-by: Dan Carpenter 

Applied, thanks Dan.


RE: [PATCH] debugobjects: insulate non-fixup logic related to static obj from fixup callbacks

2016-05-08 Thread Du, Changbin
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> On Sun, 8 May 2016, Du, Changbin wrote:
> > > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > > > raw_spin_unlock_irqrestore(>lock, flags);
> > > > /*
> > > > -* Maybe the object is static.  Let the type specific
> > > > +* Maybe the object is static. Let the type specific
> > > >  * code decide what to do.
> > >
> > > Instead of doing white space changes you really want to explain the logic
> > > here.
> > >
> > Comments is in following code.
> 
> Well. It's a comment, but the code you replace has better explanations about
> statically initialized objects. This should move here.
> 
> Thanks,
> 
>   tglx

Ok, let me improve the comment for patch v2.

Best Regards,
Du, Changbin



Re: [PATCH] mmc: mmc: do not use CMD13 to get status after speed mode switch

2016-05-08 Thread Shawn Lin

+ linux-rockchip


I just hacked my local branch to fix the issues found on rockchip
platform. The reaseon is that mmc core fail to get status
after switching from hs200 to hs. So I disabled sending status for it
just like what Chaotian does here. But I didn't deeply dig out the root
cause but I agree with Chaotian's opinion.

FYI:
My emmc deivce is KLMA62WEPD-B031.

[1.526008] sdhci: Secure Digital Host Controller Interface driver
[1.526558] sdhci: Copyright(c) Pierre Ossman
[1.527899] sdhci-pltfm: SDHCI platform and OF driver helper
[1.529967] sdhci-arasan fe33.sdhci: No vmmc regulator found
[1.530501] sdhci-arasan fe33.sdhci: No vqmmc regulator found
[1.568710] mmc0: SDHCI controller on fe33.sdhci [fe33.sdhci] 
using ADMA

[1.627552] mmc0: switch to high-speed from hs200 failed, err:-84
[1.628108] mmc0: error -84 whilst initialising MMC card


在 2016/5/4 14:54, Chaotian Jing 写道:

Per JEDEC spec, it is not recommended to use CMD13 to get card status
after speed mode switch. below are two reason about this:
1. CMD13 cannot be guaranteed due to the asynchronous operation.
Therefore it is not recommended to use CMD13 to check busy completion
of the timing change indication.
2. After switch to HS200, CMD13 will get response of 0x800, and even the
busy signal gets de-asserted, the response of CMD13 is aslo 0x800.

this patch drops CMD13 when doing speed mode switch, if host do not
support MMC_CAP_WAIT_WHILE_BUSY and there is no ops->card_busy(),
then the only way is to wait a fixed timeout.

Signed-off-by: Chaotian Jing 
---
 drivers/mmc/core/mmc.c |   82 
 drivers/mmc/core/mmc_ops.c |   25 +-
 2 files changed, 45 insertions(+), 62 deletions(-)

diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index 4dbe3df..03ee7a4 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -962,7 +962,7 @@ static int mmc_select_hs(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, EXT_CSD_TIMING_HS,
   card->ext_csd.generic_cmd6_time,
-  true, true, true);
+  true, false, true);
if (!err)
mmc_set_timing(card->host, MMC_TIMING_MMC_HS);

@@ -1056,7 +1056,6 @@ static int mmc_switch_status(struct mmc_card *card)
 static int mmc_select_hs400(struct mmc_card *card)
 {
struct mmc_host *host = card->host;
-   bool send_status = true;
unsigned int max_dtr;
int err = 0;
u8 val;
@@ -1068,9 +1067,6 @@ static int mmc_select_hs400(struct mmc_card *card)
  host->ios.bus_width == MMC_BUS_WIDTH_8))
return 0;

-   if (host->caps & MMC_CAP_WAIT_WHILE_BUSY)
-   send_status = false;
-
/* Reduce frequency to HS frequency */
max_dtr = card->ext_csd.hs_max_dtr;
mmc_set_clock(host, max_dtr);
@@ -1080,7 +1076,7 @@ static int mmc_select_hs400(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, val,
   card->ext_csd.generic_cmd6_time,
-  true, send_status, true);
+  true, false, true);
if (err) {
pr_err("%s: switch to high-speed from hs200 failed, err:%d\n",
mmc_hostname(host), err);
@@ -1090,11 +1086,9 @@ static int mmc_select_hs400(struct mmc_card *card)
/* Set host controller to HS timing */
mmc_set_timing(card->host, MMC_TIMING_MMC_HS);

-   if (!send_status) {
-   err = mmc_switch_status(card);
-   if (err)
-   goto out_err;
-   }
+   err = mmc_switch_status(card);
+   if (err)
+   goto out_err;

/* Switch card to DDR */
err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
@@ -1113,7 +1107,7 @@ static int mmc_select_hs400(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, val,
   card->ext_csd.generic_cmd6_time,
-  true, send_status, true);
+  true, false, true);
if (err) {
pr_err("%s: switch to hs400 failed, err:%d\n",
 mmc_hostname(host), err);
@@ -1124,11 +1118,9 @@ static int mmc_select_hs400(struct mmc_card *card)
mmc_set_timing(host, MMC_TIMING_MMC_HS400);
mmc_set_bus_speed(card);

-   if (!send_status) {
-   err = mmc_switch_status(card);
-   if (err)
-   goto out_err;
-   }
+   err = mmc_switch_status(card);
+   if (err)
+   goto out_err;

return 0;

@@ -1146,14 +1138,10 @@ int mmc_hs200_to_hs400(struct mmc_card *card)
 int 

[PATCH 3/3] staging: dgnc: Need to check for NULL of ch

2016-05-08 Thread Daeseok Youn
the "ch" from brd structure could be NULL, it need to
check for NULL.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_neo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/dgnc/dgnc_neo.c b/drivers/staging/dgnc/dgnc_neo.c
index 9eae1a6..ba57e95 100644
--- a/drivers/staging/dgnc/dgnc_neo.c
+++ b/drivers/staging/dgnc/dgnc_neo.c
@@ -380,7 +380,7 @@ static inline void neo_parse_isr(struct dgnc_board *brd, 
uint port)
unsigned long flags;
 
ch = brd->channels[port];
-   if (ch->magic != DGNC_CHANNEL_MAGIC)
+   if (!ch || ch->magic != DGNC_CHANNEL_MAGIC)
return;
 
/* Here we try to figure out what caused the interrupt to happen */
-- 
2.8.2



[PATCH 1/3] staging: dgnc: fix 'line over 80 characters'

2016-05-08 Thread Daeseok Youn
fix checkpatch.pl warning about 'line over 80 characters'.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_sysfs.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_sysfs.c 
b/drivers/staging/dgnc/dgnc_sysfs.c
index d825964..b8d41c5 100644
--- a/drivers/staging/dgnc/dgnc_sysfs.c
+++ b/drivers/staging/dgnc/dgnc_sysfs.c
@@ -189,19 +189,21 @@ static ssize_t dgnc_ports_msignals_show(struct device *p,
DGNC_VERIFY_BOARD(p, bd);
 
for (i = 0; i < bd->nasync; i++) {
-   if (bd->channels[i]->ch_open_count) {
+   struct channel_t *ch = bd->channels[i];
+
+   if (ch->ch_open_count) {
count += snprintf(buf + count, PAGE_SIZE - count,
"%d %s %s %s %s %s %s\n",
-   bd->channels[i]->ch_portnum,
-   (bd->channels[i]->ch_mostat & UART_MCR_RTS) ? 
"RTS" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_CTS) ? 
"CTS" : "",
-   (bd->channels[i]->ch_mostat & UART_MCR_DTR) ? 
"DTR" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_DSR) ? 
"DSR" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_DCD) ? 
"DCD" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_RI)  ? 
"RI"  : "");
+   ch->ch_portnum,
+   (ch->ch_mostat & UART_MCR_RTS) ? "RTS" : "",
+   (ch->ch_mistat & UART_MSR_CTS) ? "CTS" : "",
+   (ch->ch_mostat & UART_MCR_DTR) ? "DTR" : "",
+   (ch->ch_mistat & UART_MSR_DSR) ? "DSR" : "",
+   (ch->ch_mistat & UART_MSR_DCD) ? "DCD" : "",
+   (ch->ch_mistat & UART_MSR_RI)  ? "RI"  : "");
} else {
count += snprintf(buf + count, PAGE_SIZE - count,
-   "%d\n", bd->channels[i]->ch_portnum);
+   "%d\n", ch->ch_portnum);
}
}
return count;
-- 
2.8.2



[PATCH 2/3] staging: dgnc: remove redundant condition check

2016-05-08 Thread Daeseok Youn
dgnc_board(brd) was already checked for NULL before calling
neo_parse_isr(). And also port doesn't need to check.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_neo.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_neo.c b/drivers/staging/dgnc/dgnc_neo.c
index 3b8ce38..9eae1a6 100644
--- a/drivers/staging/dgnc/dgnc_neo.c
+++ b/drivers/staging/dgnc/dgnc_neo.c
@@ -379,12 +379,6 @@ static inline void neo_parse_isr(struct dgnc_board *brd, 
uint port)
unsigned char cause;
unsigned long flags;
 
-   if (!brd || brd->magic != DGNC_BOARD_MAGIC)
-   return;
-
-   if (port >= brd->maxports)
-   return;
-
ch = brd->channels[port];
if (ch->magic != DGNC_CHANNEL_MAGIC)
return;
-- 
2.8.2



[PATCH] sched: fix the calculation of __sched_period in sched_slice()

2016-05-08 Thread Zhou Chengming
When we get the sched_slice of a sched_entity, we use cfs_rq->nr_running
to calculate the whole __sched_period. But cfs_rq->nr_running is the
number of sched_entity in that cfs_rq, rq->nr_running is the number
of all the tasks that are not throttled. So we should use the
rq->nr_running to calculate the whole __sched_period value.

Signed-off-by: Zhou Chengming 
---
 kernel/sched/fair.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..59c9378 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -625,7 +625,7 @@ static u64 __sched_period(unsigned long nr_running)
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
+   u64 slice = __sched_period(rq_of(cfs_rq)->nr_running + !se->on_rq);
 
for_each_sched_entity(se) {
struct load_weight *load;
-- 
1.7.7



Re: [PATCH 1/6] statx: Add a system call to make enhanced file info available

2016-05-08 Thread J. Bruce Fields
On Mon, May 09, 2016 at 11:45:43AM +1000, Dave Chinner wrote:
> [ OT, but I'll reply anyway :P ]
> 
> On Fri, May 06, 2016 at 02:29:23PM -0400, J. Bruce Fields wrote:
> > On Thu, May 05, 2016 at 08:56:02AM +1000, Dave Chinner wrote:
> > > In the latest XFS filesystem format, we randomise the generation
> > > value during every inode allocation to make it hard to guess the
> > > handle of adjacent inodes from an existing ino+gen pair, or even
> > > from life time to life time of the same inode.
> > 
> > The one thing I wonder about is whether that increases the probability
> > of a filehandle collision (where you accidentally generate the same
> > filehandle for two different files).
> 
> Not possible - inode number is still different between the two
> files. i.e. ino+gen makes the handle unique, not gen.
> 
> > If the generation number is a 32-bit counter per inode number (is that
> > actually the way filesystems work?), then it takes 2^32 reuses of the
> > inode number to hit the same filehandle.
> 
> 4 billion unlink/create operations that hit the same inode number
> are going to take some time. I suspect someone will notice the load
> generated by an attmept to brute force this sort of thing ;)
> 
> > If you choose it randomly then
> > you expect a collision after about 2^16 reuses.
> 
> I'm pretty sure that a random search will need to, on average,
> search half the keyspace before a match is found (i.e. 2^31
> attempts, not 2^16).

Yeah, but I was wondering whether you could somehow get into the
situation where clients between then are caching N distinct filehandles
with the same inode number.  Then a collision becomes likely around
2^16, by the usual birthday paradox rule-of-thumb.

Uh, but now that I think of it that's irrelevant.  At most one of those
filehandles actually refers to a still-existing file.  Any attempt to
use the other 2^16-1 should return -ESTALE.  So collisions among that
set don't matter, it's only collisions involving the existing file that
are interesting.  So, nevermind, I can't see a practical way to hit a
problem here

--b.


Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > llc_size,
> > but the numbers are so wild to easily break the fragile condition, like:
> 
> Seems lockless traversal and averages just lets multiple CPUs select
> the same spot.  An atomic reservation (feature) when looking for an
> idle spot (also for fork) might fix it up.  Run the thing as RT,
> push/pull ensures that it reaches box saturation regardless of the
> number of messaging threads, whereas with fair class, any number > 1
> will certainly stack tasks before the box is saturated.

Yes, good idea, bringing order to the race to grab idle CPU is absolutely
helpful.

In addition, I would argue maybe beefing up idle balancing is a more
productive way to spread load, as work-stealing just does what needs
to be done. And seems it has been (sub-unconsciously) neglected in this
case, :)

Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
the slave will be 0 forever (as last_wakee is never flipped).

Basically whenever a waker has more than 1 wakee, the wakee_flips
will comfortably grow very large (with last_wakee alternating),
whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.

So recording only the last_wakee seems not right unless you have other
good reason. If not the latter, counting waking wakee times should be
better, and then allow the statistics to happily play.


[PATCH v5 00/13] Support non-lru page migration

2016-05-08 Thread Minchan Kim
Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.

The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages. In the end, compaction cannot work well so reclaimer
shrinks all of working set pages. It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.

Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not
memory efficient. In our product which has big CMA memory, it reclaims
zones too exccessively to allocate GPU and zram page although there are
lots of free space in CMA so system becomes very slow easily.

To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.


struct address_space_operations {
..
..
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
..
}

new page flags

PG_movable
PG_isolated

For details, please read description in "mm: migrate: support non-lru
movable page migration".

Originally, Gioh Kim had tried to support this feature but he moved so
I took over the work. I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to have
many credit, too.

And I should mention Chulmin who have tested this patchset heavily
so I can find many bugs from him. :)

Thanks, Gioh, Konstantin and Chulmin!

This patchset consists of five parts.

1. clean up migration
  mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
  mm: migrate: support non-lru movable page migration

3. rework KVM memory-ballooning
  mm: balloon: use general non-lru movable page feature

4. zsmalloc refactoring for preparing page migration
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index

5. zsmalloc page migration
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

* From v4
  * rebase on mmotm-2016-05-05-17-19
  * fix huge object migration - Chulmin
  * !CONFIG_COMPACTION support for zsmalloc

* From v3
  * rebase on mmotm-2016-04-06-20-40
  * fix swap_info deadlock - Chulmin
  * race without page_lock - Vlastimil
  * no use page._mapcount for potential user-mapped page driver - Vlastimil
  * fix and enhance doc/description - Vlastimil
  * use page->mapping lower bits to represent PG_movable
  * make driver side's rule simple.

* From v2
  * rebase on mmotm-2016-03-29-15-54-16
  * check PageMovable before lock_page - Joonsoo
  * check PageMovable before PageIsolated checking - Joonsoo
  * add more description about rule

* From v1
  * rebase on v4.5-mmotm-2016-03-17-15-04
  * reordering patches to merge clean-up patches first
  * add Acked-by/Reviewed-by from Vlastimil and Sergey
  * use each own mount model instead of reusing anon_inode_fs - Al Viro
  * small changes - YiPing, Gioh

Cc: Vlastimil Babka 
Cc: dri-de...@lists.freedesktop.org
Cc: Hugh Dickins 
Cc: John Einar Reitan 
Cc: Jonathan Corbet 
Cc: Joonsoo Kim 
Cc: Konstantin Khlebnikov 
Cc: Mel Gorman 
Cc: Naoya Horiguchi 
Cc: Rafael Aquini 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: virtualizat...@lists.linux-foundation.org
Cc: Gioh Kim 
Cc: Chan Gyun Jeong 
Cc: Sangseok Lee 
Cc: Kyeongdon Kim 
Cc: Chulmin Kim 

Minchan Kim (12):
  mm: use put_page to free page instead of putback_lru_page
  mm: migrate: support non-lru movable page migration
  mm: balloon: use general non-lru movable page feature
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

 Documentation/filesystems/Locking  |4 +
 Documentation/filesystems/vfs.txt  |   11 +
 Documentation/vm/page_migration|  107 ++-
 drivers/block/zram/zram_drv.c  |6 +-
 drivers/virtio/virtio_balloon.c|   52 +-
 include/linux/balloon_compaction.h |   51 +-
 

[PATCH v5 02/12] mm: migrate: support non-lru movable page migration

2016-05-08 Thread Minchan Kim
We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
migrated the oldpage successfully and returns MIGRATEPAGE_SUCCESS.
If driver cannot migrate the page at the moment, driver can return -EAGAIN.
On -EAGAIN, VM will retry page migration in a short time because VM interprets
-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
VM will give up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data
structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under page_lock.

void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family functions
which will be called by VM. Exactly speaking, PG_movable is not a real flag of
struct page. Rather than, VM reuses page->mapping's lower bits to represent it.

#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping so it can get
right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose.

Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: Joonsoo Kim 
Cc: Mel 

[PATCH v5 04/12] zsmalloc: keep max_object in size_class

2016-05-08 Thread Minchan Kim
Every zspage in a size_class has same number of max objects so
we could move it to a size_class.

Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 32 +++-
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 3d6d3dae505a..3c2574be8cee 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -32,8 +32,6 @@
  * page->freelist: points to the first free object in zspage.
  * Free objects are linked together using in-place
  * metadata.
- * page->objects: maximum number of objects we can store in this
- * zspage (class->zspage_order * PAGE_SIZE / class->size)
  * page->lru: links together first pages of various zspages.
  * Basically forming list of zspages in a fullness group.
  * page->mapping: class index and fullness group of the zspage
@@ -211,6 +209,7 @@ struct size_class {
 * of ZS_ALIGN.
 */
int size;
+   int objs_per_zspage;
unsigned int index;
 
struct zs_size_stat stats;
@@ -627,21 +626,22 @@ static inline void zs_pool_stat_destroy(struct zs_pool 
*pool)
  * the pool (not yet implemented). This function returns fullness
  * status of the given page.
  */
-static enum fullness_group get_fullness_group(struct page *first_page)
+static enum fullness_group get_fullness_group(struct size_class *class,
+   struct page *first_page)
 {
-   int inuse, max_objects;
+   int inuse, objs_per_zspage;
enum fullness_group fg;
 
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
inuse = first_page->inuse;
-   max_objects = first_page->objects;
+   objs_per_zspage = class->objs_per_zspage;
 
if (inuse == 0)
fg = ZS_EMPTY;
-   else if (inuse == max_objects)
+   else if (inuse == objs_per_zspage)
fg = ZS_FULL;
-   else if (inuse <= 3 * max_objects / fullness_threshold_frac)
+   else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
fg = ZS_ALMOST_EMPTY;
else
fg = ZS_ALMOST_FULL;
@@ -728,7 +728,7 @@ static enum fullness_group fix_fullness_group(struct 
size_class *class,
enum fullness_group currfg, newfg;
 
get_zspage_mapping(first_page, _idx, );
-   newfg = get_fullness_group(first_page);
+   newfg = get_fullness_group(class, first_page);
if (newfg == currfg)
goto out;
 
@@ -1011,9 +1011,6 @@ static struct page *alloc_zspage(struct size_class 
*class, gfp_t flags)
init_zspage(class, first_page);
 
first_page->freelist = location_to_obj(first_page, 0);
-   /* Maximum number of objects we can store in this zspage */
-   first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
-
error = 0; /* Success */
 
 cleanup:
@@ -1241,11 +1238,11 @@ static bool can_merge(struct size_class *prev, int 
size, int pages_per_zspage)
return true;
 }
 
-static bool zspage_full(struct page *first_page)
+static bool zspage_full(struct size_class *class, struct page *first_page)
 {
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-   return first_page->inuse == first_page->objects;
+   return first_page->inuse == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1631,7 +1628,7 @@ static int migrate_zspage(struct zs_pool *pool, struct 
size_class *class,
}
 
/* Stop if there is no more space */
-   if (zspage_full(d_page)) {
+   if (zspage_full(class, d_page)) {
unpin_tag(handle);
ret = -ENOMEM;
break;
@@ -1690,7 +1687,7 @@ static enum fullness_group putback_zspage(struct zs_pool 
*pool,
 {
enum fullness_group fullness;
 
-   fullness = get_fullness_group(first_page);
+   fullness = get_fullness_group(class, first_page);
insert_zspage(class, fullness, first_page);
set_zspage_mapping(first_page, class->index, fullness);
 
@@ -1939,8 +1936,9 @@ struct zs_pool *zs_create_pool(const char *name)
class->size = size;
class->index = i;
class->pages_per_zspage = pages_per_zspage;
-   if (pages_per_zspage == 1 &&
-   get_maxobj_per_zspage(size, pages_per_zspage) == 1)
+   class->objs_per_zspage = class->pages_per_zspage *
+   PAGE_SIZE / class->size;
+   if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
class->huge = true;
spin_lock_init(>lock);
pool->size_class[i] = class;
-- 
1.9.1



[PATCH v5 05/12] zsmalloc: use bit_spin_lock

2016-05-08 Thread Minchan Kim
Use kernel standard bit spin-lock instead of custom mess. Even, it has
a bug which doesn't disable preemption. The reason we don't have any
problem is that we have used it during preemption disable section
by class->lock spinlock. So no need to go to stable.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 3c2574be8cee..718dde7fd028 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -867,21 +867,17 @@ static unsigned long obj_idx_to_offset(struct page *page,
 
 static inline int trypin_tag(unsigned long handle)
 {
-   unsigned long *ptr = (unsigned long *)handle;
-
-   return !test_and_set_bit_lock(HANDLE_PIN_BIT, ptr);
+   return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void pin_tag(unsigned long handle)
 {
-   while (!trypin_tag(handle));
+   bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void unpin_tag(unsigned long handle)
 {
-   unsigned long *ptr = (unsigned long *)handle;
-
-   clear_bit_unlock(HANDLE_PIN_BIT, ptr);
+   bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void reset_page(struct page *page)
-- 
1.9.1



[PATCH v5 03/12] mm: balloon: use general non-lru movable page feature

2016-05-08 Thread Minchan Kim
Now, VM has a feature to migrate non-lru movable pages so
balloon doesn't need custom migration hooks in migrate.c
and compaction.c. Instead, this patch implements
page->mapping->a_ops->{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general
migration functions and make balloon compaction simple.

Cc: virtualizat...@lists.linux-foundation.org
Cc: Rafael Aquini 
Cc: Konstantin Khlebnikov 
Signed-off-by: Gioh Kim 
Signed-off-by: Minchan Kim 
---
 drivers/virtio/virtio_balloon.c| 52 +++--
 include/linux/balloon_compaction.h | 51 ++---
 include/uapi/linux/magic.h |  1 +
 mm/balloon_compaction.c| 94 +++---
 mm/compaction.c|  7 ---
 mm/migrate.c   | 19 +---
 mm/vmscan.c|  2 +-
 7 files changed, 83 insertions(+), 143 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7b6d74f0c72f..04fc63b4a735 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,10 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+#ifdef CONFIG_BALLOON_COMPACTION
+static struct vfsmount *balloon_mnt;
+#endif
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -486,6 +491,24 @@ static int virtballoon_migratepage(struct balloon_dev_info 
*vb_dev_info,
 
return MIGRATEPAGE_SUCCESS;
 }
+
+static struct dentry *balloon_mount(struct file_system_type *fs_type,
+   int flags, const char *dev_name, void *data)
+{
+   static const struct dentry_operations ops = {
+   .d_dname = simple_dname,
+   };
+
+   return mount_pseudo(fs_type, "balloon-kvm:", NULL, ,
+   BALLOON_KVM_MAGIC);
+}
+
+static struct file_system_type balloon_fs = {
+   .name   = "balloon-kvm",
+   .mount  = balloon_mount,
+   .kill_sb= kill_anon_super,
+};
+
 #endif /* CONFIG_BALLOON_COMPACTION */
 
 static int virtballoon_probe(struct virtio_device *vdev)
@@ -515,9 +538,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->vdev = vdev;
 
balloon_devinfo_init(>vb_dev_info);
-#ifdef CONFIG_BALLOON_COMPACTION
-   vb->vb_dev_info.migratepage = virtballoon_migratepage;
-#endif
 
err = init_vqs(vb);
if (err)
@@ -527,13 +547,33 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
err = register_oom_notifier(>nb);
if (err < 0)
-   goto out_oom_notify;
+   goto out_del_vqs;
+
+#ifdef CONFIG_BALLOON_COMPACTION
+   balloon_mnt = kern_mount(_fs);
+   if (IS_ERR(balloon_mnt)) {
+   err = PTR_ERR(balloon_mnt);
+   unregister_oom_notifier(>nb);
+   goto out_del_vqs;
+   }
+
+   vb->vb_dev_info.migratepage = virtballoon_migratepage;
+   vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
+   if (IS_ERR(vb->vb_dev_info.inode)) {
+   err = PTR_ERR(vb->vb_dev_info.inode);
+   kern_unmount(balloon_mnt);
+   unregister_oom_notifier(>nb);
+   vb->vb_dev_info.inode = NULL;
+   goto out_del_vqs;
+   }
+   vb->vb_dev_info.inode->i_mapping->a_ops = _aops;
+#endif
 
virtio_device_ready(vdev);
 
return 0;
 
-out_oom_notify:
+out_del_vqs:
vdev->config->del_vqs(vdev);
 out_free_vb:
kfree(vb);
@@ -567,6 +607,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
cancel_work_sync(>update_balloon_stats_work);
 
remove_common(vb);
+   if (vb->vb_dev_info.inode)
+   iput(vb->vb_dev_info.inode);
kfree(vb);
 }
 
diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 9b0a15d06a4f..79542b2698ec 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device information descriptor.
@@ -62,6 +63,7 @@ struct balloon_dev_info {
struct list_head pages; /* Pages enqueued & handled to Host */
int (*migratepage)(struct balloon_dev_info *, struct page *newpage,
struct page *page, enum migrate_mode mode);
+   struct inode *inode;
 };
 
 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
@@ -73,45 +75,19 @@ static inline void 

[PATCH v5 06/12] zsmalloc: use accessor

2016-05-08 Thread Minchan Kim
Upcoming patch will change how to encode zspage meta so for easy review,
this patch wraps code to access metadata as accessor.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 82 +++
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 718dde7fd028..086fd65311f7 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -266,10 +266,14 @@ struct zs_pool {
  * A zspage's class index and fullness group
  * are encoded in its (first)page->mapping
  */
-#define CLASS_IDX_BITS 28
 #define FULLNESS_BITS  4
-#define CLASS_IDX_MASK ((1 << CLASS_IDX_BITS) - 1)
-#define FULLNESS_MASK  ((1 << FULLNESS_BITS) - 1)
+#define CLASS_BITS 28
+
+#define FULLNESS_SHIFT 0
+#define CLASS_SHIFT(FULLNESS_SHIFT + FULLNESS_BITS)
+
+#define FULLNESS_MASK  ((1UL << FULLNESS_BITS) - 1)
+#define CLASS_MASK ((1UL << CLASS_BITS) - 1)
 
 struct mapping_area {
 #ifdef CONFIG_PGTABLE_MAPPING
@@ -416,6 +420,41 @@ static int is_last_page(struct page *page)
return PagePrivate2(page);
 }
 
+static inline int get_zspage_inuse(struct page *first_page)
+{
+   return first_page->inuse;
+}
+
+static inline void set_zspage_inuse(struct page *first_page, int val)
+{
+   first_page->inuse = val;
+}
+
+static inline void mod_zspage_inuse(struct page *first_page, int val)
+{
+   first_page->inuse += val;
+}
+
+static inline int get_first_obj_offset(struct page *page)
+{
+   return page->index;
+}
+
+static inline void set_first_obj_offset(struct page *page, int offset)
+{
+   page->index = offset;
+}
+
+static inline unsigned long get_freeobj(struct page *first_page)
+{
+   return (unsigned long)first_page->freelist;
+}
+
+static inline void set_freeobj(struct page *first_page, unsigned long obj)
+{
+   first_page->freelist = (void *)obj;
+}
+
 static void get_zspage_mapping(struct page *first_page,
unsigned int *class_idx,
enum fullness_group *fullness)
@@ -424,8 +463,8 @@ static void get_zspage_mapping(struct page *first_page,
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
m = (unsigned long)first_page->mapping;
-   *fullness = m & FULLNESS_MASK;
-   *class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK;
+   *fullness = (m >> FULLNESS_SHIFT) & FULLNESS_MASK;
+   *class_idx = (m >> CLASS_SHIFT) & CLASS_MASK;
 }
 
 static void set_zspage_mapping(struct page *first_page,
@@ -435,8 +474,7 @@ static void set_zspage_mapping(struct page *first_page,
unsigned long m;
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-   m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) |
-   (fullness & FULLNESS_MASK);
+   m = (class_idx << CLASS_SHIFT) | (fullness << FULLNESS_SHIFT);
first_page->mapping = (struct address_space *)m;
 }
 
@@ -634,7 +672,7 @@ static enum fullness_group get_fullness_group(struct 
size_class *class,
 
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-   inuse = first_page->inuse;
+   inuse = get_zspage_inuse(first_page);
objs_per_zspage = class->objs_per_zspage;
 
if (inuse == 0)
@@ -680,7 +718,7 @@ static void insert_zspage(struct size_class *class,
 * empty/full. Put pages with higher ->inuse first.
 */
list_add_tail(_page->lru, &(*head)->lru);
-   if (first_page->inuse >= (*head)->inuse)
+   if (get_zspage_inuse(first_page) >= get_zspage_inuse(*head))
*head = first_page;
 }
 
@@ -860,7 +898,7 @@ static unsigned long obj_idx_to_offset(struct page *page,
unsigned long off = 0;
 
if (!is_first_page(page))
-   off = page->index;
+   off = get_first_obj_offset(page);
 
return off + obj_idx * class_size;
 }
@@ -895,7 +933,7 @@ static void free_zspage(struct page *first_page)
struct page *nextp, *tmp, *head_extra;
 
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-   VM_BUG_ON_PAGE(first_page->inuse, first_page);
+   VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
 
head_extra = (struct page *)page_private(first_page);
 
@@ -936,7 +974,7 @@ static void init_zspage(struct size_class *class, struct 
page *first_page)
 * head of corresponding zspage's freelist.
 */
if (page != first_page)
-   page->index = off;
+   set_first_obj_offset(page, off);
 
vaddr = kmap_atomic(page);
link = (struct link_free *)vaddr + off / sizeof(*link);
@@ -991,7 +1029,7 @@ static struct page *alloc_zspage(struct size_class *class, 
gfp_t flags)
SetPagePrivate(page);
set_page_private(page, 0);
first_page = page;
-   

[PATCH v5 10/12] zsmalloc: use freeobj for index

2016-05-08 Thread Minchan Kim
Zsmalloc stores first free object's  position into
freeobj in each zspage. If we change it with index from first_page
instead of position, it makes page migration simple because we
don't need to correct other entries for linked list if a page is
migrated out.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 139 ++
 1 file changed, 73 insertions(+), 66 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 5ccd83732a14..29dd413322b0 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -69,9 +69,7 @@
  * Object location (, ) is encoded as
  * as single (unsigned long) handle value.
  *
- * Note that object index  is relative to system
- * page  it is stored in, so for each sub-page belonging
- * to a zspage, obj_idx starts with 0.
+ * Note that object index  starts from 0.
  *
  * This is made more complicated by various memory models and PAE.
  */
@@ -212,10 +210,10 @@ struct size_class {
 struct link_free {
union {
/*
-* Position of next free chunk (encodes )
+* Free object index;
 * It's valid for non-allocated object
 */
-   void *next;
+   unsigned long next;
/*
 * Handle of allocated object.
 */
@@ -259,7 +257,7 @@ struct zspage {
unsigned int class:CLASS_BITS;
};
unsigned int inuse;
-   void *freeobj;
+   unsigned int freeobj;
struct page *first_page;
struct list_head list; /* fullness list */
 };
@@ -456,14 +454,14 @@ static inline void set_first_obj_offset(struct page 
*page, int offset)
page->index = offset;
 }
 
-static inline unsigned long get_freeobj(struct zspage *zspage)
+static inline unsigned int get_freeobj(struct zspage *zspage)
 {
-   return (unsigned long)zspage->freeobj;
+   return zspage->freeobj;
 }
 
-static inline void set_freeobj(struct zspage *zspage, unsigned long obj)
+static inline void set_freeobj(struct zspage *zspage, unsigned int obj)
 {
-   zspage->freeobj = (void *)obj;
+   zspage->freeobj = obj;
 }
 
 static void get_zspage_mapping(struct zspage *zspage,
@@ -808,6 +806,10 @@ static int get_pages_per_zspage(int class_size)
return max_usedpc_order;
 }
 
+static struct page *get_first_page(struct zspage *zspage)
+{
+   return zspage->first_page;
+}
 
 static struct zspage *get_zspage(struct page *page)
 {
@@ -819,37 +821,33 @@ static struct page *get_next_page(struct page *page)
return page->next;
 }
 
-/*
- * Encode  as a single handle value.
- * We use the least bit of handle for tagging.
+/**
+ * obj_to_location - get (, ) from encoded object value
+ * @page: page object resides in zspage
+ * @obj_idx: object index
  */
-static void *location_to_obj(struct page *page, unsigned long obj_idx)
+static void obj_to_location(unsigned long obj, struct page **page,
+   unsigned int *obj_idx)
 {
-   unsigned long obj;
+   obj >>= OBJ_TAG_BITS;
+   *page = pfn_to_page(obj >> OBJ_INDEX_BITS);
+   *obj_idx = (obj & OBJ_INDEX_MASK);
+}
 
-   if (!page) {
-   VM_BUG_ON(obj_idx);
-   return NULL;
-   }
+/**
+ * location_to_obj - get obj value encoded from (, )
+ * @page: page object resides in zspage
+ * @obj_idx: object index
+ */
+static unsigned long location_to_obj(struct page *page, unsigned int obj_idx)
+{
+   unsigned long obj;
 
obj = page_to_pfn(page) << OBJ_INDEX_BITS;
-   obj |= ((obj_idx) & OBJ_INDEX_MASK);
+   obj |= obj_idx & OBJ_INDEX_MASK;
obj <<= OBJ_TAG_BITS;
 
-   return (void *)obj;
-}
-
-/*
- * Decode  pair from the given object handle. We adjust the
- * decoded obj_idx back to its original value since it was adjusted in
- * location_to_obj().
- */
-static void obj_to_location(unsigned long obj, struct page **page,
-   unsigned long *obj_idx)
-{
-   obj >>= OBJ_TAG_BITS;
-   *page = pfn_to_page(obj >> OBJ_INDEX_BITS);
-   *obj_idx = (obj & OBJ_INDEX_MASK);
+   return obj;
 }
 
 static unsigned long handle_to_obj(unsigned long handle)
@@ -867,16 +865,6 @@ static unsigned long obj_to_head(struct size_class *class, 
struct page *page,
return *(unsigned long *)obj;
 }
 
-static unsigned long obj_idx_to_offset(struct page *page,
-   unsigned long obj_idx, int class_size)
-{
-   unsigned long off;
-
-   off = get_first_obj_offset(page);
-
-   return off + obj_idx * class_size;
-}
-
 static inline int trypin_tag(unsigned long handle)
 {
return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
@@ -920,13 +908,13 @@ static void free_zspage(struct zs_pool *pool, struct 
zspage *zspage)
 /* Initialize a newly 

[PATCH v5 12/12] zram: use __GFP_MOVABLE for memory allocation

2016-05-08 Thread Minchan Kim
Zsmalloc is ready for page migration so zram can use __GFP_MOVABLE
from now on.

I did test to see how it helps to make higher order pages.
Test scenario is as follows.

KVM guest, 1G memory, ext4 formated zram block device,

for i in `seq 1 8`;
do
dd if=/dev/vda1 of=mnt/test$i.txt bs=128M count=1 &
done

wait `pidof dd`

for i in `seq 1 2 8`;
do
rm -rf mnt/test$i.txt
done
fstrim -v mnt

echo "init"
cat /proc/buddyinfo

echo "compaction"
echo 1 > /proc/sys/vm/compact_memory
cat /proc/buddyinfo

old:

init
Node 0, zone  DMA208120 51 41 11  0  0  0   
   0  0  0
Node 0, zoneDMA32  16380  13777   9184   3805789 54  3  0   
   0  0  0
compaction
Node 0, zone  DMA132 82 40 39 16  2  1  0   
   0  0  0
Node 0, zoneDMA32   5219   5526   4969   3455   1831677139 15   
   0  0  0

new:

init
Node 0, zone  DMA379115 97 19  2  0  0  0   
   0  0  0
Node 0, zoneDMA32  18891  16774  10862   3947637 21  0  0   
   0  0  0
compaction  1
Node 0, zone  DMA214 66 87 29 10  3  0  0   
   0  0  0
Node 0, zoneDMA32   1612   3139   3154   2469   1745990384 94   
   7  0  0

As you can see, compaction made so many high-order pages. Yay!

Reviewed-by: Sergey Senozhatsky 

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8fcfbebe79cd..55419f104f67 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -714,13 +714,15 @@ static int zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index,
handle = zs_malloc(meta->mem_pool, clen,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
-   __GFP_HIGHMEM);
+   __GFP_HIGHMEM |
+   __GFP_MOVABLE);
if (!handle) {
zcomp_strm_release(zram->comp, zstrm);
zstrm = NULL;
 
handle = zs_malloc(meta->mem_pool, clen,
-   GFP_NOIO | __GFP_HIGHMEM);
+   GFP_NOIO | __GFP_HIGHMEM |
+   __GFP_MOVABLE);
if (handle)
goto compress_again;
 
-- 
1.9.1



[PATCH v5 09/12] zsmalloc: separate free_zspage from putback_zspage

2016-05-08 Thread Minchan Kim
Currently, putback_zspage does free zspage under class->lock
if fullness become ZS_EMPTY but it makes trouble to implement
locking scheme for new zspage migration.
So, this patch is to separate free_zspage from putback_zspage
and free zspage out of class->lock which is preparation for
zspage migration.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 27 +++
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 162a598a417a..5ccd83732a14 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1685,14 +1685,12 @@ static struct zspage *isolate_zspage(struct size_class 
*class, bool source)
 
 /*
  * putback_zspage - add @zspage into right class's fullness list
- * @pool: target pool
  * @class: destination class
  * @zspage: target page
  *
  * Return @zspage's fullness_group
  */
-static enum fullness_group putback_zspage(struct zs_pool *pool,
-   struct size_class *class,
+static enum fullness_group putback_zspage(struct size_class *class,
struct zspage *zspage)
 {
enum fullness_group fullness;
@@ -1701,15 +1699,6 @@ static enum fullness_group putback_zspage(struct zs_pool 
*pool,
insert_zspage(class, zspage, fullness);
set_zspage_mapping(zspage, class->index, fullness);
 
-   if (fullness == ZS_EMPTY) {
-   zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
-   class->size, class->pages_per_zspage));
-   atomic_long_sub(class->pages_per_zspage,
-   >pages_allocated);
-
-   free_zspage(pool, zspage);
-   }
-
return fullness;
 }
 
@@ -1755,23 +1744,29 @@ static void __zs_compact(struct zs_pool *pool, struct 
size_class *class)
if (!migrate_zspage(pool, class, ))
break;
 
-   putback_zspage(pool, class, dst_zspage);
+   putback_zspage(class, dst_zspage);
}
 
/* Stop if we couldn't find slot */
if (dst_zspage == NULL)
break;
 
-   putback_zspage(pool, class, dst_zspage);
-   if (putback_zspage(pool, class, src_zspage) == ZS_EMPTY)
+   putback_zspage(class, dst_zspage);
+   if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
+   zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
+   class->size, class->pages_per_zspage));
+   atomic_long_sub(class->pages_per_zspage,
+   >pages_allocated);
+   free_zspage(pool, src_zspage);
pool->stats.pages_compacted += class->pages_per_zspage;
+   }
spin_unlock(>lock);
cond_resched();
spin_lock(>lock);
}
 
if (src_zspage)
-   putback_zspage(pool, class, src_zspage);
+   putback_zspage(class, src_zspage);
 
spin_unlock(>lock);
 }
-- 
1.9.1



[PATCH v5 11/12] zsmalloc: page migration support

2016-05-08 Thread Minchan Kim
This patch introduces run-time migration feature for zspage.

For migration, VM uses page.lru field so it would be better to not use
page.next field which is unified with page.lru for own purpose.
For that, firstly, we can get first object offset of the page via
runtime calculation instead of using page.index so we can use
page.index as link for page chaining instead of page.next.

In case of huge object, it stores handle to page.index instead of
next link of page chaining because huge object doesn't need to next
link for page chaining. So get_next_page need to identify huge
object to return NULL. For it, this patch uses PG_owner_priv_1 flag
of the page flag.

For migration, it supports three functions

* zs_page_isolate

It isolates a zspage which includes a subpage VM want to migrate
from class so anyone cannot allocate new object from the zspage.

We could try to isolate a zspage by the number of subpage so
subsequent isolation trial of other subpage of the zpsage shouldn't
fail. For that, we introduce zspage.isolated count. With that,
zs_page_isolate can know whether zspage is already isolated or not
for migration so if it is isolated for migration, subsequent
isolation trial can be successful without trying further isolation.

* zs_page_migrate

First of all, it holds write-side zspage->lock to prevent migrate other
subpage in zspage. Then, lock all objects in the page VM want to migrate.
The reason we should lock all objects in the page is due to race between
zs_map_object and zs_page_migrate.

zs_map_object   zs_page_migrate

pin_tag(handle)
obj = handle_to_obj(handle)
obj_to_location(obj, , _idx);

write_lock(>lock)
if (!trypin_tag(handle))
goto unpin_object

zspage = get_zspage(page);
read_lock(>lock);

If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can
be stale by migration so it goes crash.

If it locks all of objects successfully, it copies content from
old page to new one, finally, create new zspage chain with new page.
And if it's last isolated subpage in the zspage, put the zspage back
to class.

* zs_page_putback

It returns isolated zspage to right fullness_group list if it fails to
migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage
freeing to workqueue. See below about async zspage freeing.

This patch introduces asynchronous zspage free. The reason to need it
is we need page_lock to clear PG_movable but unfortunately,
zs_free path should be atomic so the apporach is try to grab page_lock.
If it got page_lock of all of pages successfully, it can free zspage
immediately. Otherwise, it queues free request and free zspage via
workqueue in process context.

If zs_free finds the zspage is isolated when it try to free zspage,
it delays the freeing until zs_page_putback finds it so it will free
free the zspage finally.

In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.
First of all, it will use ZS_EMPTY list for delay freeing.
And with adding ZS_FULL list, it makes to identify whether zspage is
isolated or not via list_empty(>list) test.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 include/uapi/linux/magic.h |   1 +
 mm/zsmalloc.c  | 796 ++---
 2 files changed, 675 insertions(+), 122 deletions(-)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index d829ce63529d..e398beac67b8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -81,5 +81,6 @@
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC0x15013346
 #define BALLOON_KVM_MAGIC  0x13661366
+#define ZSMALLOC_MAGIC 0x58295829
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 29dd413322b0..107ec06eabb6 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -17,14 +17,14 @@
  *
  * Usage of struct page fields:
  * page->private: points to zspage
- * page->index: offset of the first object starting in this page.
- * For the first page, this is always 0, so we use this field
- * to store handle for huge object.
- * page->next: links together all component pages of a zspage
+ * page->freelist(index): links together all component pages of a zspage
+ * For the huge page, this is always 0, so we use this field
+ * to store handle.
  *
  * Usage of struct page flags:
  * PG_private: identifies the first component page
  * PG_private2: identifies the last component page
+ * PG_owner_priv_1: indentifies the huge component page
  *
  */
 
@@ -47,6 +47,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+#define ZSPAGE_MAGIC   0x58
 
 /*
  * This must be power of 2 and greater than of equal to sizeof(link_free).
@@ -134,25 +138,23 @@
  * We do not 

[PATCH v5 08/12] zsmalloc: introduce zspage structure

2016-05-08 Thread Minchan Kim
We have squeezed meta data of zspage into first page's descriptor.
So, to get meta data from subpage, we should get first page first
of all. But it makes trouble to implment page migration feature
of zsmalloc because any place where to get first page from subpage
can be raced with first page migration. IOW, first page it got
could be stale. For preventing it, I have tried several approahces
but it made code complicated so finally, I concluded to separate
metadata from first page. Of course, it consumes more memory. IOW,
16bytes per zspage on 32bit at the moment. It means we lost 1%
at *worst case*(40B/4096B) which is not bad I think at the cost of
maintenance.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 530 ++
 1 file changed, 241 insertions(+), 289 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 1855e75766c7..162a598a417a 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -16,26 +16,11 @@
  * struct page(s) to form a zspage.
  *
  * Usage of struct page fields:
- * page->private: points to the first component (0-order) page
- * page->index (union with page->freelist): offset of the first object
- * starting in this page. For the first page, this is
- * always 0, so we use this field (aka freelist) to point
- * to the first free object in zspage.
- * page->lru: links together all component pages (except the first page)
- * of a zspage
- *
- * For _first_ page only:
- *
- * page->private: refers to the component page after the first page
- * If the page is first_page for huge object, it stores handle.
- * Look at size_class->huge.
- * page->freelist: points to the first free object in zspage.
- * Free objects are linked together using in-place
- * metadata.
- * page->lru: links together first pages of various zspages.
- * Basically forming list of zspages in a fullness group.
- * page->mapping: class index and fullness group of the zspage
- * page->inuse: the number of objects that are used in this zspage
+ * page->private: points to zspage
+ * page->index: offset of the first object starting in this page.
+ * For the first page, this is always 0, so we use this field
+ * to store handle for huge object.
+ * page->next: links together all component pages of a zspage
  *
  * Usage of struct page flags:
  * PG_private: identifies the first component page
@@ -145,7 +130,7 @@
  *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
  */
-#define ZS_SIZE_CLASS_DELTA(PAGE_SIZE >> 8)
+#define ZS_SIZE_CLASS_DELTA(PAGE_SIZE >> CLASS_BITS)
 
 /*
  * We do not maintain any list for completely empty or full pages
@@ -153,8 +138,6 @@
 enum fullness_group {
ZS_ALMOST_FULL,
ZS_ALMOST_EMPTY,
-   _ZS_NR_FULLNESS_GROUPS,
-
ZS_EMPTY,
ZS_FULL
 };
@@ -203,7 +186,7 @@ static const int fullness_threshold_frac = 4;
 
 struct size_class {
spinlock_t lock;
-   struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS];
+   struct list_head fullness_list[2];
/*
 * Size of objects stored in this class. Must be multiple
 * of ZS_ALIGN.
@@ -222,7 +205,7 @@ struct size_class {
 
 /*
  * Placed within free objects to form a singly linked list.
- * For every zspage, first_page->freelist gives head of this list.
+ * For every zspage, zspage->freeobj gives head of this list.
  *
  * This must be power of 2 and less than or equal to ZS_ALIGN
  */
@@ -245,6 +228,7 @@ struct zs_pool {
 
struct size_class **size_class;
struct kmem_cache *handle_cachep;
+   struct kmem_cache *zspage_cachep;
 
atomic_long_t pages_allocated;
 
@@ -266,14 +250,19 @@ struct zs_pool {
  * A zspage's class index and fullness group
  * are encoded in its (first)page->mapping
  */
-#define FULLNESS_BITS  4
-#define CLASS_BITS 28
+#define FULLNESS_BITS  2
+#define CLASS_BITS 8
 
-#define FULLNESS_SHIFT 0
-#define CLASS_SHIFT(FULLNESS_SHIFT + FULLNESS_BITS)
-
-#define FULLNESS_MASK  ((1UL << FULLNESS_BITS) - 1)
-#define CLASS_MASK ((1UL << CLASS_BITS) - 1)
+struct zspage {
+   struct {
+   unsigned int fullness:FULLNESS_BITS;
+   unsigned int class:CLASS_BITS;
+   };
+   unsigned int inuse;
+   void *freeobj;
+   struct page *first_page;
+   struct list_head list; /* fullness list */
+};
 
 struct mapping_area {
 #ifdef CONFIG_PGTABLE_MAPPING
@@ -285,29 +274,50 @@ struct mapping_area {
enum zs_mapmode vm_mm; /* mapping mode */
 };
 
-static int create_handle_cache(struct zs_pool *pool)
+static int create_cache(struct zs_pool *pool)
 {
pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,

[PATCH v5 01/12] mm: use put_page to free page instead of putback_lru_page

2016-05-08 Thread Minchan Kim
Procedure of page migration is as follows:

First of all, it should isolate a page from LRU and try to
migrate the page. If it is successful, it releases the page
for freeing. Otherwise, it should put the page back to LRU
list.

For LRU pages, we have used putback_lru_page for both freeing
and putback to LRU list. It's okay because put_page is aware of
LRU list so if it releases last refcount of the page, it removes
the page from LRU list. However, It makes unnecessary operations
(e.g., lru_cache_add, pagevec and flags operations. It would be
not significant but no worth to do) and harder to support new
non-lru page migration because put_page isn't aware of non-lru
page's data structure.

To solve the problem, we can add new hook in put_page with
PageMovable flags check but it can increase overhead in
hot path and needs new locking scheme to stabilize the flag check
with put_page.

So, this patch cleans it up to divide two semantic(ie, put and putback).
If migration is successful, use put_page instead of putback_lru_page and
use putback_lru_page only on failure. That makes code more readable
and doesn't add overhead in put_page.

Comment from Vlastimil
"Yeah, and compaction (perhaps also other migration users) has to drain
the lru pvec... Getting rid of this stuff is worth even by itself."

Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Naoya Horiguchi 
Acked-by: Vlastimil Babka 
Signed-off-by: Minchan Kim 
---
 mm/migrate.c | 64 +---
 1 file changed, 40 insertions(+), 24 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 53ab6398e7a2..f2932498ded2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -913,6 +913,19 @@ static int __unmap_and_move(struct page *page, struct page 
*newpage,
put_anon_vma(anon_vma);
unlock_page(page);
 out:
+   /*
+* If migration is successful, decrease refcount of the newpage
+* which will not free the page because new page owner increased
+* refcounter. As well, if it is LRU page, add the page to LRU
+* list in here.
+*/
+   if (rc == MIGRATEPAGE_SUCCESS) {
+   if (unlikely(__is_movable_balloon_page(newpage)))
+   put_page(newpage);
+   else
+   putback_lru_page(newpage);
+   }
+
return rc;
 }
 
@@ -946,6 +959,12 @@ static ICE_noinline int unmap_and_move(new_page_t 
get_new_page,
 
if (page_count(page) == 1) {
/* page was freed from under us. So we are done. */
+   ClearPageActive(page);
+   ClearPageUnevictable(page);
+   if (put_new_page)
+   put_new_page(newpage, private);
+   else
+   put_page(newpage);
goto out;
}
 
@@ -958,10 +977,8 @@ static ICE_noinline int unmap_and_move(new_page_t 
get_new_page,
}
 
rc = __unmap_and_move(page, newpage, force, mode);
-   if (rc == MIGRATEPAGE_SUCCESS) {
-   put_new_page = NULL;
+   if (rc == MIGRATEPAGE_SUCCESS)
set_page_owner_migrate_reason(newpage, reason);
-   }
 
 out:
if (rc != -EAGAIN) {
@@ -974,34 +991,33 @@ static ICE_noinline int unmap_and_move(new_page_t 
get_new_page,
list_del(>lru);
dec_zone_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
-   /* Soft-offlined page shouldn't go through lru cache list */
-   if (reason == MR_MEMORY_FAILURE && rc == MIGRATEPAGE_SUCCESS) {
+   }
+
+   /*
+* If migration is successful, releases reference grabbed during
+* isolation. Otherwise, restore the page to right list unless
+* we want to retry.
+*/
+   if (rc == MIGRATEPAGE_SUCCESS) {
+   put_page(page);
+   if (reason == MR_MEMORY_FAILURE) {
/*
-* With this release, we free successfully migrated
-* page and set PG_HWPoison on just freed page
-* intentionally. Although it's rather weird, it's how
-* HWPoison flag works at the moment.
+* Set PG_HWPoison on just freed page
+* intentionally. Although it's rather weird,
+* it's how HWPoison flag works at the moment.
 */
-   put_page(page);
if (!test_set_page_hwpoison(page))
num_poisoned_pages_inc();
-   } else
+   }
+   } else {
+   if (rc != -EAGAIN)
putback_lru_page(page);
+   if (put_new_page)
+   

[PATCH v5 07/12] zsmalloc: factor page chain functionality out

2016-05-08 Thread Minchan Kim
For page migration, we need to create page chain of zspage dynamically
so this patch factors it out from alloc_zspage.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 59 +++
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 086fd65311f7..1855e75766c7 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -959,7 +959,8 @@ static void init_zspage(struct size_class *class, struct 
page *first_page)
unsigned long off = 0;
struct page *page = first_page;
 
-   VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+   first_page->freelist = NULL;
+   set_zspage_inuse(first_page, 0);
 
while (page) {
struct page *next_page;
@@ -995,15 +996,16 @@ static void init_zspage(struct size_class *class, struct 
page *first_page)
page = next_page;
off %= PAGE_SIZE;
}
+
+   set_freeobj(first_page, (unsigned long)location_to_obj(first_page, 0));
 }
 
-/*
- * Allocate a zspage for the given size class
- */
-static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
+static void create_page_chain(struct page *pages[], int nr_pages)
 {
-   int i, error;
-   struct page *first_page = NULL, *uninitialized_var(prev_page);
+   int i;
+   struct page *page;
+   struct page *prev_page = NULL;
+   struct page *first_page = NULL;
 
/*
 * Allocate individual pages and link them together as:
@@ -1016,20 +1018,14 @@ static struct page *alloc_zspage(struct size_class 
*class, gfp_t flags)
 * (i.e. no other sub-page has this flag set) and PG_private_2 to
 * identify the last page.
 */
-   error = -ENOMEM;
-   for (i = 0; i < class->pages_per_zspage; i++) {
-   struct page *page;
-
-   page = alloc_page(flags);
-   if (!page)
-   goto cleanup;
+   for (i = 0; i < nr_pages; i++) {
+   page = pages[i];
 
INIT_LIST_HEAD(>lru);
-   if (i == 0) {   /* first page */
+   if (i == 0) {
SetPagePrivate(page);
set_page_private(page, 0);
first_page = page;
-   set_zspage_inuse(first_page, 0);
}
if (i == 1)
set_page_private(first_page, (unsigned long)page);
@@ -1037,22 +1033,37 @@ static struct page *alloc_zspage(struct size_class 
*class, gfp_t flags)
set_page_private(page, (unsigned long)first_page);
if (i >= 2)
list_add(>lru, _page->lru);
-   if (i == class->pages_per_zspage - 1)   /* last page */
+   if (i == nr_pages - 1)
SetPagePrivate2(page);
prev_page = page;
}
+}
 
-   init_zspage(class, first_page);
+/*
+ * Allocate a zspage for the given size class
+ */
+static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
+{
+   int i;
+   struct page *first_page = NULL;
+   struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
 
-   set_freeobj(first_page, (unsigned long)location_to_obj(first_page, 0));
-   error = 0; /* Success */
+   for (i = 0; i < class->pages_per_zspage; i++) {
+   struct page *page;
 
-cleanup:
-   if (unlikely(error) && first_page) {
-   free_zspage(first_page);
-   first_page = NULL;
+   page = alloc_page(flags);
+   if (!page) {
+   while (--i >= 0)
+   __free_page(pages[i]);
+   return NULL;
+   }
+   pages[i] = page;
}
 
+   create_page_chain(pages, class->pages_per_zspage);
+   first_page = pages[0];
+   init_zspage(class, first_page);
+
return first_page;
 }
 
-- 
1.9.1



[PATCH V2 2/2] irqchip/gicv3-its: Implement two-level(indirect) device table support

2016-05-08 Thread Shanker Donthineni
Since device IDs are extremely sparse, the single, a.k.a flat table is
not sufficient for the following two reasons.

1) According to ARM-GIC spec, ITS hw can access maximum of 256(pages)*
   64K(pageszie) bytes. In the best case, it supports upto DEVid=21
   sparse with minimum device table entry size 8bytes.

2) The maximum memory size that is possible without memblock depends on
   MAX_ORDER. 4MB on 4K page size kernel with default MAX_ORDER, so it
   supports DEVid range 19bits.

The two-level device table feature brings us two advantages, the first
is a very high possibility of supporting upto 32bit sparse, and the
second one is the best utilization of memory allocation.

The feature is enabled automatically during driver probe if a single
ITS page is not adequate for flat table and the hardware is capable
of two-level table walk.

Signed-off-by: Shanker Donthineni 
---

This patch is based on Marc Zyngier's branch 
https://git.kernel.org/cgit/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/irqchip-4.7

I have tested the Indirection feature on Qualcomm Technologies QDF2XXX server 
platform.

Changes since v1:
  Most of this patch has been rewritten after refactoring its_alloc_tables().
  Always enable device two-level if the memory requirement is more than 
PAGE_SIZE.
  Fixed the coding bug that breaks on the BE machine.
  Edited the commit text.

 drivers/irqchip/irq-gic-v3-its.c | 100 ---
 1 file changed, 83 insertions(+), 17 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index b23e00c..27be792 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -938,6 +938,18 @@ retry_baser:
return 0;
 }
 
+/**
+ * Find out whether an implemented baser register supports a single, flat table
+ * or a two-level table by reading bit offset at '62' after writing '1' to it.
+ */
+static u64 its_baser_check_indirect(struct its_baser *baser)
+{
+   u64 val = GITS_BASER_InnerShareable | GITS_BASER_WaWb;
+
+   writeq_relaxed(val | GITS_BASER_INDIRECT, baser->hwreg);
+   return (readq_relaxed(baser->hwreg) & GITS_BASER_INDIRECT);
+}
+
 static int its_alloc_tables(const char *node_name, struct its_node *its)
 {
u64 typer = readq_relaxed(its->base + GITS_TYPER);
@@ -964,6 +976,7 @@ static int its_alloc_tables(const char *node_name, struct 
its_node *its)
u64 entry_size = GITS_BASER_ENTRY_SIZE(val);
int order = get_order(psz);
struct its_baser *baser = its->tables + i;
+   u64 indirect = 0;
 
if (type == GITS_BASER_TYPE_NONE)
continue;
@@ -977,17 +990,27 @@ static int its_alloc_tables(const char *node_name, struct 
its_node *its)
 * Allocate as many entries as required to fit the
 * range of device IDs that the ITS can grok... The ID
 * space being incredibly sparse, this results in a
-* massive waste of memory.
+* massive waste of memory if two-level device table
+* feature is not supported by hardware.
 *
 * For other tables, only allocate a single page.
 */
if (type == GITS_BASER_TYPE_DEVICE) {
-   /*
-* 'order' was initialized earlier to the default page
-* granule of the the ITS.  We can't have an allocation
-* smaller than that.  If the requested allocation
-* is smaller, round up to the default page granule.
-*/
+   if ((entry_size << ids) > psz)
+   indirect = its_baser_check_indirect(baser);
+
+   if (indirect) {
+   /*
+* The size of the lvl2 table is equal to ITS
+* page size which is 'psz'. For computing lvl1
+* table size, subtract ID bits that sparse
+* lvl2 table from 'ids' which is reported by
+* ITS hardware times lvl1 table entry size.
+*/
+   ids -= ilog2(psz / entry_size);
+   entry_size = GITS_LVL1_ENTRY_SIZE;
+   }
+
order = max(get_order(entry_size << ids), order);
if (order >= MAX_ORDER) {
order = MAX_ORDER - 1;
@@ -997,7 +1020,7 @@ static int its_alloc_tables(const char *node_name, struct 
its_node *its)
}
}
 
-   err = its_baser_setup(its, baser, order, 0);
+   err = its_baser_setup(its, baser, order, indirect);
if (err < 

[PATCH V2 1/2] irqchip/gicv3-its: split its_alloc_tables() into two functions

2016-05-08 Thread Shanker Donthineni
The function is getting out of control, it has too many goto
statements and would be too complicated for adding a feature
two-level device table. So, it is time for us to cleanup and
move some of the logic to a separate function without affecting
the existing functionality.

Signed-off-by: Shanker Donthineni 
---

This patch is based on Marc Zyngier's branch 
https://git.kernel.org/cgit/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/irqchip-4.7

 drivers/irqchip/irq-gic-v3-its.c   | 256 -
 include/linux/irqchip/arm-gic-v3.h |   3 +
 2 files changed, 144 insertions(+), 115 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 6bd881b..b23e00c 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -55,13 +55,15 @@ struct its_collection {
 };
 
 /*
- * The ITS_BASER structure - contains memory information and cached
- * value of BASER register configuration.
+ * The ITS_BASER structure - contains memory information, cached value
+ * of BASER register configuration, ioremaped address and page size.
  */
 struct its_baser {
+   void __iomem*hwreg;
void*base;
u64 val;
u32 order;
+   u32 psz;
 };
 
 /*
@@ -823,27 +825,135 @@ static void its_free_tables(struct its_node *its)
}
 }
 
+static int its_baser_setup(struct its_node *its, struct its_baser *baser,
+ u32 order, u64 indirect)
+{
+   u64 val = readq_relaxed(baser->hwreg);
+   u64 type = GITS_BASER_TYPE(val);
+   u64 entry_size = GITS_BASER_ENTRY_SIZE(val);
+   int psz, alloc_pages;
+   u64 cache, shr, tmp;
+   void *base;
+
+   /* Do first attempt with the requested attributes */
+   cache = baser->val & GITS_BASER_CACHEABILITY_MASK;
+   shr = baser->val & GITS_BASER_SHAREABILITY_MASK;
+   psz = baser->psz;
+
+retry_alloc_baser:
+   alloc_pages = (PAGE_ORDER_TO_SIZE(order) / psz);
+   if (alloc_pages > GITS_BASER_PAGES_MAX) {
+   pr_warn("ITS@%lx: %s too large, reduce ITS pages %u->%u\n",
+   its->phys_base, its_base_type_string[type],
+   alloc_pages, GITS_BASER_PAGES_MAX);
+   alloc_pages = GITS_BASER_PAGES_MAX;
+   order = get_order(GITS_BASER_PAGES_MAX * psz);
+   }
+
+   base = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+   if (!base)
+   return -ENOMEM;
+
+retry_baser:
+   val = (virt_to_phys(base)|
+   (type << GITS_BASER_TYPE_SHIFT)  |
+   ((entry_size - 1) << GITS_BASER_ENTRY_SIZE_SHIFT) |
+   ((alloc_pages - 1) << GITS_BASER_PAGES_SHIFT)|
+   cache|
+   shr  |
+   indirect |
+   GITS_BASER_VALID);
+
+   switch (psz) {
+   case SZ_4K:
+   val |= GITS_BASER_PAGE_SIZE_4K;
+   break;
+   case SZ_16K:
+   val |= GITS_BASER_PAGE_SIZE_16K;
+   break;
+   case SZ_64K:
+   val |= GITS_BASER_PAGE_SIZE_64K;
+   break;
+   }
+
+   writeq_relaxed(val, baser->hwreg);
+   tmp = readq_relaxed(baser->hwreg);
+
+   if ((val ^ tmp) & GITS_BASER_SHAREABILITY_MASK) {
+   /*
+* Shareability didn't stick. Just use
+* whatever the read reported, which is likely
+* to be the only thing this redistributor
+* supports. If that's zero, make it
+* non-cacheable as well.
+*/
+   shr = tmp & GITS_BASER_SHAREABILITY_MASK;
+   if (!shr) {
+   cache = GITS_BASER_nC;
+   __flush_dcache_area(base, PAGE_ORDER_TO_SIZE(order));
+   }
+   goto retry_baser;
+   }
+
+   if ((val ^ tmp) & GITS_BASER_PAGE_SIZE_MASK) {
+   /*
+* Page size didn't stick. Let's try a smaller
+* size and retry. If we reach 4K, then
+* something is horribly wrong...
+*/
+   free_pages((unsigned long)base, order);
+   baser->base = NULL;
+
+   switch (psz) {
+   case SZ_16K:
+   psz = SZ_4K;
+   goto retry_alloc_baser;
+   case SZ_64K:
+   psz = SZ_16K;
+   goto retry_alloc_baser;
+   }
+   }
+
+   if (val != tmp) {
+   pr_err("ITS@%lx: %s doesn't stick: %lx %lx\n",
+  its->phys_base, its_base_type_string[type],
+  (unsigned long) val, (unsigned 

Re: [PATCH 1/4] signals/sigaltstack: If SS_AUTODISARM, bypass on_sig_stack

2016-05-08 Thread Stas Sergeev

09.05.2016 04:32, Andy Lutomirski пишет:

On May 7, 2016 7:38 AM, "Stas Sergeev"  wrote:

03.05.2016 20:31, Andy Lutomirski пишет:


If a signal stack is set up with SS_AUTODISARM, then the kernel
inherently avoids incorrectly resetting the signal stack if signals
recurse: the signal stack will be reset on the first signal
delivery.  This means that we don't need check the stack pointer
when delivering signals if SS_AUTODISARM is set.

This will make segmented x86 programs more robust: currently there's
a hole that could be triggered if ESP/RSP appears to point to the
signal stack but actually doesn't due to a nonzero SS base.

Signed-off-by: Stas Sergeev 
Cc: Al Viro 
Cc: Aleksa Sarai 
Cc: Amanieu d'Antras 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: Eric W. Biederman 
Cc: Frederic Weisbecker 
Cc: H. Peter Anvin 
Cc: Heinrich Schuchardt 
Cc: Jason Low 
Cc: Josh Triplett 
Cc: Konstantin Khlebnikov 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Palmer Dabbelt 
Cc: Paul Moore 
Cc: Pavel Emelyanov 
Cc: Peter Zijlstra 
Cc: Richard Weinberger 
Cc: Sasha Levin 
Cc: Shuah Khan 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Vladimir Davydov 
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Andy Lutomirski 
---
   include/linux/sched.h | 12 
   1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2950c5cd3005..8f03a93348b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2576,6 +2576,18 @@ static inline int kill_cad_pid(int sig, int priv)
*/
   static inline int on_sig_stack(unsigned long sp)
   {
+   /*
+* If the signal stack is AUTODISARM then, by construction, we
+* can't be on the signal stack unless user code deliberately set
+* SS_AUTODISARM when we were already on the it.

"on the it" -> "on it".

Anyway, I am a bit puzzled with this patch.
You say "unless user code deliberately set

SS_AUTODISARM when we were already on the it"
so what happens in case it actually does?


Stack corruption.  Don't do that.

Only after your change, I have to admit. :)


Without your patch: if user sets up the same sas - no stack switch.
if user sets up different sas - stack switch on nested signal.

With your patch: stack switch in any case, so if user
set up same sas - stack corruption by nested signal.

Or am I missing the intention?

The intention is to make everything completely explicit.  With
SS_AUTODISARM, the kernel knows directly whether you're on the signal
stack, and there should be no need to look at sp.  If you set
SS_AUTODISARM and get a signal, the signal stack gets disarmed.  If
you take a nested signal, it's delivered normally.  When you return
all the way out, the signal stack is re-armed.

For DOSEMU, this means that no 16-bit register state can possibly
cause a signal to be delivered wrong, because the register state when
a signal is raised won't affect delivery, which seems like a good
thing to me.

Yes, but doesn't affect dosemu1 which doesn't use SS_AUTODISARM.
So IMHO the SS check should still be added, even if not for dosemu2.


If this behavior would be problematic for you, can you explain why?

Only theoretically: if someone sets SS_AUTODISARM inside a
sighandler. Since this doesn't give EPERM, I wouldn't deliberately
make it a broken scenario (esp if it wasn't before the particular change).
Ideally it would give EPERM, but we can't, so doesn't matter much.
I just wanted to warn about the possible regression.


RE: [PATCH v2 0/3] net: ethtool: add ethtool_op_{get|set}_link_ksettings

2016-05-08 Thread Fugang Duan
Fom: Philippe Reynes  Sent: Monday, May 09, 2016 5:45 AM
> To: Fugang Duan ; da...@davemloft.net;
> b...@decadent.org.uk; kan.li...@intel.com; de...@googlers.com;
> adu...@mirantis.com; j...@mellanox.com; jacob.e.kel...@intel.com;
> t...@herbertland.com; and...@lunn.ch
> Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org; Philippe Reynes
> 
> Subject: [PATCH v2 0/3] net: ethtool: add ethtool_op_{get|set}_link_ksettings
> 
> Ethtool callbacks {get|set}_link_ksettings may be the same for many drivers. 
> So
> we add two generics callbacks ethtool_op_{get|set}_link_ksettings.
> 
> To use those generics callbacks, the ethernet driver must use the pointer
> phydev contained in struct net_device, and not use a private structure to 
> store
> this pointer.
> 
> Changelog:
> v2:
> - use generic function instead of macro
> - ethernet driver use the pointer phydev provided by struct net_device
>   Those idea were provided by Ben Hutchings,
>   and Florian Fainelli acknowledge them.
> 
> Philippe Reynes (3):
>   net: core: ethtool: add ethtool_op_{get|set}_link_ksettings
>   net: ethernet: fec: use phydev from struct net_device
>   net: ethernet: fec: use ethtool_op_{get|set}_link_ksettings
> 
>  drivers/net/ethernet/freescale/fec.h  |1 -
>  drivers/net/ethernet/freescale/fec_main.c |   71 
> +
>  include/linux/ethtool.h   |5 ++
>  net/core/ethtool.c|   24 ++
>  4 files changed, 50 insertions(+), 51 deletions(-)
> 
> --
> 1.7.4.4

Acked-by: Fugang Duan 


RE: EXT4 bad block - ext4_xattr_block_get

2016-05-08 Thread Lay, Kuan Loon
Hi,

Not getting the bad block message after disable metadata_csum.

Best Regards,
Lay

> -Original Message-
> From: Philipp Hahn [mailto:pmh...@pmhahn.de]
> Sent: Monday, May 2, 2016 2:43 PM
> To: Lay, Kuan Loon ; ty...@mit.edu;
> adilger.ker...@dilger.ca; linux-e...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: EXT4 bad block - ext4_xattr_block_get
> 
> Hello,
> 
> Am 28.04.2016 um 11:44 schrieb Lay, Kuan Loon:
> > I encounter random bad block on different file, the message looks like
> "EXT4-fs error (device mmcblk0p14): ext4_xattr_block_get:298: inode #77:
> comm (syslogd): bad block 7288".
> 
> Interesting; I posted a similar bug report on 2016-04-19 titles  [BUG 4.1.11]
> EXT4-fs error: ext4_xattr_block_get:299 - Remounting filesystem read-only
> 
> I never got a reply.
> 
> > I am using mke2fs 1.43-WIP (18-May-2015) and I saw this message
> "Suggestion: Use Linux kernel >= 3.18 for improved stability of the metadata
> and journal checksum features." print out.
> >
> > My current kernel version is 3.14.55, what patch I need to backport to solve
> the bad block issue?
> 
> That one happened on 4.1.11 on a virtual machine running inside VMware-
> ESX after a hardware change. Last change was to disabled the pvscsi drivers
> again; the system seems to be running fine since 1 week, but the first time it
> took 1 month to notice the corruption, so we're not yet sure that the
> problem is solved.
> 
> Philipp


[PATCH v3 1/2] perf tools: Support reading from backward ring buffer

2016-05-08 Thread Wang Nan
perf_evlist__mmap_read_backward() is introduced for reading backward
ring buffer. Since direction for reading such ring buffer is different
from the direction kernel writing to it, and since user need to fetch
most recent record from it, a perf_evlist__mmap_read_catchup() is
introduced to move the reading pointer to the end of the buffer.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
---
 tools/perf/util/evlist.c | 50 
 tools/perf/util/evlist.h |  4 
 2 files changed, 54 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 17cd014..c4bfe11 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -766,6 +766,56 @@ union perf_event *perf_evlist__mmap_read(struct 
perf_evlist *evlist, int idx)
return perf_mmap__read(md, evlist->overwrite, old, head, >prev);
 }
 
+union perf_event *
+perf_evlist__mmap_read_backward(struct perf_evlist *evlist, int idx)
+{
+   struct perf_mmap *md = >mmap[idx];
+   u64 head, end;
+   u64 start = md->prev;
+
+   /*
+* Check if event was unmapped due to a POLLHUP/POLLERR.
+*/
+   if (!atomic_read(>refcnt))
+   return NULL;
+
+   head = perf_mmap__read_head(md);
+   if (!head)
+   return NULL;
+
+   /*
+* 'head' pointer starts from 0. Kernel minus sizeof(record) form
+* it each time when kernel writes to it, so in fact 'head' is
+* negative. 'end' pointer is made manually by adding the size of
+* the ring buffer to 'head' pointer, means the validate data can
+* read is the whole ring buffer. If 'end' is positive, the ring
+* buffer has not fully filled, so we must adjust 'end' to 0.
+*
+* However, since both 'head' and 'end' is unsigned, we can't
+* simply compare 'end' against 0. Here we compare '-head' and
+* the size of the ring buffer, where -head is the number of bytes
+* kernel write to the ring buffer.
+*/
+   if (-head < (u64)(md->mask + 1))
+   end = 0;
+   else
+   end = head + md->mask + 1;
+
+   return perf_mmap__read(md, false, start, end, >prev);
+}
+
+void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int idx)
+{
+   struct perf_mmap *md = >mmap[idx];
+   u64 head;
+
+   if (!atomic_read(>refcnt))
+   return;
+
+   head = perf_mmap__read_head(md);
+   md->prev = head;
+}
+
 static bool perf_mmap__empty(struct perf_mmap *md)
 {
return perf_mmap__read_head(md) == md->prev && !md->auxtrace_mmap.base;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 208897a..85d1b59 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -129,6 +129,10 @@ struct perf_sample_id *perf_evlist__id2sid(struct 
perf_evlist *evlist, u64 id);
 
 union perf_event *perf_evlist__mmap_read(struct perf_evlist *evlist, int idx);
 
+union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist,
+ int idx);
+void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int idx);
+
 void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx);
 
 int perf_evlist__open(struct perf_evlist *evlist);
-- 
1.8.3.4



[PATCH v3 2/2] perf tests: Add test to check backward ring buffer

2016-05-08 Thread Wang Nan
This test checks reading from backward ring buffer.

Test result:

 # ~/perf test 'ring buffer'
 45: Test backward reading from ring buffer   : Ok

Test case is a while loop which calls prctl(PR_SET_NAME) multiple
times. Each prctl should issue 2 events: one PERF_RECORD_SAMPLE,
one PERF_RECORD_COMM.

The first round creates a relative large ring buffer (256 pages). It
can afford all events. Read from it and check the count of each type of
events.

The second round creates a small ring buffer (1 page) and makes it
overwritable. Check the correctness of the buffer.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
---
 tools/perf/tests/Build  |   1 +
 tools/perf/tests/backward-ring-buffer.c | 151 
 tools/perf/tests/builtin-test.c |   4 +
 tools/perf/tests/tests.h|   1 +
 4 files changed, 157 insertions(+)
 create mode 100644 tools/perf/tests/backward-ring-buffer.c

diff --git a/tools/perf/tests/Build b/tools/perf/tests/Build
index 449fe97..66a2898 100644
--- a/tools/perf/tests/Build
+++ b/tools/perf/tests/Build
@@ -38,6 +38,7 @@ perf-y += cpumap.o
 perf-y += stat.o
 perf-y += event_update.o
 perf-y += event-times.o
+perf-y += backward-ring-buffer.o
 
 $(OUTPUT)tests/llvm-src-base.c: tests/bpf-script-example.c tests/Build
$(call rule_mkdir)
diff --git a/tools/perf/tests/backward-ring-buffer.c 
b/tools/perf/tests/backward-ring-buffer.c
new file mode 100644
index 000..d9ba991
--- /dev/null
+++ b/tools/perf/tests/backward-ring-buffer.c
@@ -0,0 +1,151 @@
+/*
+ * Test backward bit in event attribute, read ring buffer from end to
+ * beginning
+ */
+
+#include 
+#include 
+#include 
+#include "tests.h"
+#include "debug.h"
+
+#define NR_ITERS 111
+
+static void testcase(void)
+{
+   int i;
+
+   for (i = 0; i < NR_ITERS; i++) {
+   char proc_name[10];
+
+   snprintf(proc_name, sizeof(proc_name), "p:%d\n", i);
+   prctl(PR_SET_NAME, proc_name);
+   }
+}
+
+static int count_samples(struct perf_evlist *evlist, int *sample_count,
+int *comm_count)
+{
+   int i;
+
+   for (i = 0; i < evlist->nr_mmaps; i++) {
+   union perf_event *event;
+
+   perf_evlist__mmap_read_catchup(evlist, i);
+   while ((event = perf_evlist__mmap_read_backward(evlist, i)) != 
NULL) {
+   const u32 type = event->header.type;
+
+   switch (type) {
+   case PERF_RECORD_SAMPLE:
+   (*sample_count)++;
+   break;
+   case PERF_RECORD_COMM:
+   (*comm_count)++;
+   break;
+   default:
+   pr_err("Unexpected record of type %d\n", type);
+   return TEST_FAIL;
+   }
+   }
+   }
+   return TEST_OK;
+}
+
+static int do_test(struct perf_evlist *evlist, int mmap_pages,
+  int *sample_count, int *comm_count)
+{
+   int err;
+   char sbuf[STRERR_BUFSIZE];
+
+   err = perf_evlist__mmap(evlist, mmap_pages, true);
+   if (err < 0) {
+   pr_debug("perf_evlist__mmap: %s\n",
+strerror_r(errno, sbuf, sizeof(sbuf)));
+   return TEST_FAIL;
+   }
+
+   perf_evlist__enable(evlist);
+   testcase();
+   perf_evlist__disable(evlist);
+
+   err = count_samples(evlist, sample_count, comm_count);
+   perf_evlist__munmap(evlist);
+   return err;
+}
+
+
+int test__backward_ring_buffer(int subtest __maybe_unused)
+{
+   int ret = TEST_SKIP, err, sample_count = 0, comm_count = 0;
+   char pid[16], sbuf[STRERR_BUFSIZE];
+   struct perf_evlist *evlist;
+   struct perf_evsel *evsel __maybe_unused;
+   struct parse_events_error parse_error;
+   struct record_opts opts = {
+   .target = {
+   .uid = UINT_MAX,
+   .uses_mmap = true,
+   },
+   .freq = 0,
+   .mmap_pages   = 256,
+   .default_interval = 1,
+   };
+
+   snprintf(pid, sizeof(pid), "%d", getpid());
+   pid[sizeof(pid) - 1] = '\0';
+   opts.target.tid = opts.target.pid = pid;
+
+   evlist = perf_evlist__new();
+   if (!evlist) {
+   pr_debug("No ehough memory to create evlist\n");
+   return TEST_FAIL;
+   }
+
+   err = perf_evlist__create_maps(evlist, );
+   if (err < 0) {
+   pr_debug("Not enough memory to create thread/cpu maps\n");
+   goto out_delete_evlist;
+   }
+
+   bzero(_error, sizeof(parse_error));
+   err = parse_events(evlist, 

[PATCH v3 0/2] perf tools: Backward ring buffer support

2016-05-08 Thread Wang Nan
Commit 9ecda41acb97 ("perf/core: Add ::write_backward attribute to
perf event") introduces backward ring buffer. This 2 patches add basic
support for reading from it, and add a new test case for it.

v2 -> v3:
  Improve commit message, add more comments (patch 1/2). patch 1-2/4 in
  v2 have been collected so remove them.

v1 -> v2:
  Patch 1/5 in v1 has been collected by perf/core, so remove it;
  Change function names in patch 2/5 in v1 (1/4 in v2):
  __perf_evlist__mmap_read -> perf_mmap__read

Wang Nan (2):
  perf tools: Support reading from backward ring buffer
  perf tests: Add test to check backward ring buffer

 tools/perf/tests/Build  |   1 +
 tools/perf/tests/backward-ring-buffer.c | 151 
 tools/perf/tests/builtin-test.c |   4 +
 tools/perf/tests/tests.h|   1 +
 tools/perf/util/evlist.c|  50 +++
 tools/perf/util/evlist.h|   4 +
 6 files changed, 211 insertions(+)
 create mode 100644 tools/perf/tests/backward-ring-buffer.c

-- 
1.8.3.4



Re: [PATCH 1/6] statx: Add a system call to make enhanced file info available

2016-05-08 Thread Dave Chinner
[ OT, but I'll reply anyway :P ]

On Fri, May 06, 2016 at 02:29:23PM -0400, J. Bruce Fields wrote:
> On Thu, May 05, 2016 at 08:56:02AM +1000, Dave Chinner wrote:
> > In the latest XFS filesystem format, we randomise the generation
> > value during every inode allocation to make it hard to guess the
> > handle of adjacent inodes from an existing ino+gen pair, or even
> > from life time to life time of the same inode.
> 
> The one thing I wonder about is whether that increases the probability
> of a filehandle collision (where you accidentally generate the same
> filehandle for two different files).

Not possible - inode number is still different between the two
files. i.e. ino+gen makes the handle unique, not gen.

> If the generation number is a 32-bit counter per inode number (is that
> actually the way filesystems work?), then it takes 2^32 reuses of the
> inode number to hit the same filehandle.

4 billion unlink/create operations that hit the same inode number
are going to take some time. I suspect someone will notice the load
generated by an attmept to brute force this sort of thing ;)

> If you choose it randomly then
> you expect a collision after about 2^16 reuses.

I'm pretty sure that a random search will need to, on average,
search half the keyspace before a match is found (i.e. 2^31
attempts, not 2^16).

> > >  If the caller didn't ask for them, then they may be approximated.  
> > > For
> > >  example, NFS won't waste any time updating them from the server, 
> > > unless
> > >  as a byproduct of updating something requested.
> > 
> > I would suggest that exposing them from the NFS server is something
> > we most definitely don't want to do because they are the only thing
> > that keeps remote users from guessing filehandles with ease
> 
> The first line of defense is not to depend on unguessable filehandles.
> (Don't export sudirectories unless you're willing to export the whole
> filesystem; and don't depend on directory permissions to keep children
> secret.)

Defense in depth also says "don't make it easy to guess filehandles"
because not everyone knows this is a problem. In many cases, users
may not even know what consitutes a "filesystem" because their NFS
server appliance only defines "exports". The underlying
implementation may, in fact, be "everything exported from a single
filesystem" and so the user has no choice in the matter


Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 2/4] selftests/sigaltstack: Fix the sas test on old kernels

2016-05-08 Thread Andy Lutomirski
On May 7, 2016 8:02 AM, "Stas Sergeev"  wrote:
>
> 03.05.2016 20:31, Andy Lutomirski пишет:
>
>> The handling for old kernels was wrong.  Fix it.
>>
>> Reported-by: Ingo Molnar 
>> Cc: Stas Sergeev 
>> Cc: Al Viro 
>> Cc: Andrew Morton 
>> Cc: Andy Lutomirski 
>> Cc: Borislav Petkov 
>> Cc: Brian Gerst 
>> Cc: Denys Vlasenko 
>> Cc: H. Peter Anvin 
>> Cc: Linus Torvalds 
>> Cc: Oleg Nesterov 
>> Cc: Pavel Emelyanov 
>> Cc: Peter Zijlstra 
>> Cc: Shuah Khan 
>> Cc: Thomas Gleixner 
>> Cc: linux-...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Andy Lutomirski 
>> ---
>>   tools/testing/selftests/sigaltstack/sas.c | 21 ++---
>>   1 file changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/tools/testing/selftests/sigaltstack/sas.c 
>> b/tools/testing/selftests/sigaltstack/sas.c
>> index 57da8bfde60b..a98c3ef8141f 100644
>> --- a/tools/testing/selftests/sigaltstack/sas.c
>> +++ b/tools/testing/selftests/sigaltstack/sas.c
>> @@ -15,6 +15,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>> #ifndef SS_AUTODISARM
>>   #define SS_AUTODISARM  (1 << 4)
>> @@ -117,13 +118,19 @@ int main(void)
>> stk.ss_flags = SS_ONSTACK | SS_AUTODISARM;
>> err = sigaltstack(, NULL);
>> if (err) {
>> -   perror("[FAIL]\tsigaltstack(SS_ONSTACK | SS_AUTODISARM)");
>> -   stk.ss_flags = SS_ONSTACK;
>> -   }
>> -   err = sigaltstack(, NULL);
>> -   if (err) {
>> -   perror("[FAIL]\tsigaltstack(SS_ONSTACK)");
>> -   return EXIT_FAILURE;
>> +   if (errno == EINVAL) {
>> +   printf("[NOTE]\tThe running kernel doesn't support 
>> SS_AUTODISARM\n");
>> +   /*
>> +* If test cases for the !SS_AUTODISARM variant were
>> +* added, we could still run them.  We don't have any
>> +* test cases like that yet, so just exit and report
>> +* success.
>> +*/
>
> But that was the point, please see how it handles the
> old kernels:
>
> $ ./sas
> [FAIL]sigaltstack(SS_ONSTACK | SS_AUTODISARM): Invalid argument
> [RUN]signal USR1
> [FAIL]ss_flags=1, should be SS_DISABLE
> [RUN]switched to user ctx
> [RUN]signal USR2
> [FAIL]sigaltstack re-used
> [FAIL]Stack corrupted
> [RUN]Aborting

This is useful as a demonstration of why the feature is useful, but it
doesn't indicate that anything is wrong with old kernels per she.
That's why I changed it to simply report that the feature is missing.


Re: [PATCH 1/4] signals/sigaltstack: If SS_AUTODISARM, bypass on_sig_stack

2016-05-08 Thread Andy Lutomirski
On May 7, 2016 7:38 AM, "Stas Sergeev"  wrote:
>
> 03.05.2016 20:31, Andy Lutomirski пишет:
>
>> If a signal stack is set up with SS_AUTODISARM, then the kernel
>> inherently avoids incorrectly resetting the signal stack if signals
>> recurse: the signal stack will be reset on the first signal
>> delivery.  This means that we don't need check the stack pointer
>> when delivering signals if SS_AUTODISARM is set.
>>
>> This will make segmented x86 programs more robust: currently there's
>> a hole that could be triggered if ESP/RSP appears to point to the
>> signal stack but actually doesn't due to a nonzero SS base.
>>
>> Signed-off-by: Stas Sergeev 
>> Cc: Al Viro 
>> Cc: Aleksa Sarai 
>> Cc: Amanieu d'Antras 
>> Cc: Andrea Arcangeli 
>> Cc: Andrew Morton 
>> Cc: Andy Lutomirski 
>> Cc: Borislav Petkov 
>> Cc: Brian Gerst 
>> Cc: Denys Vlasenko 
>> Cc: Eric W. Biederman 
>> Cc: Frederic Weisbecker 
>> Cc: H. Peter Anvin 
>> Cc: Heinrich Schuchardt 
>> Cc: Jason Low 
>> Cc: Josh Triplett 
>> Cc: Konstantin Khlebnikov 
>> Cc: Linus Torvalds 
>> Cc: Oleg Nesterov 
>> Cc: Palmer Dabbelt 
>> Cc: Paul Moore 
>> Cc: Pavel Emelyanov 
>> Cc: Peter Zijlstra 
>> Cc: Richard Weinberger 
>> Cc: Sasha Levin 
>> Cc: Shuah Khan 
>> Cc: Tejun Heo 
>> Cc: Thomas Gleixner 
>> Cc: Vladimir Davydov 
>> Cc: linux-...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Andy Lutomirski 
>> ---
>>   include/linux/sched.h | 12 
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 2950c5cd3005..8f03a93348b9 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2576,6 +2576,18 @@ static inline int kill_cad_pid(int sig, int priv)
>>*/
>>   static inline int on_sig_stack(unsigned long sp)
>>   {
>> +   /*
>> +* If the signal stack is AUTODISARM then, by construction, we
>> +* can't be on the signal stack unless user code deliberately set
>> +* SS_AUTODISARM when we were already on the it.
>
> "on the it" -> "on it".
>
> Anyway, I am a bit puzzled with this patch.
> You say "unless user code deliberately set
>
> SS_AUTODISARM when we were already on the it"
> so what happens in case it actually does?
>

Stack corruption.  Don't do that.

> Without your patch: if user sets up the same sas - no stack switch.
> if user sets up different sas - stack switch on nested signal.
>
> With your patch: stack switch in any case, so if user
> set up same sas - stack corruption by nested signal.
>
> Or am I missing the intention?

The intention is to make everything completely explicit.  With
SS_AUTODISARM, the kernel knows directly whether you're on the signal
stack, and there should be no need to look at sp.  If you set
SS_AUTODISARM and get a signal, the signal stack gets disarmed.  If
you take a nested signal, it's delivered normally.  When you return
all the way out, the signal stack is re-armed.

For DOSEMU, this means that no 16-bit register state can possibly
cause a signal to be delivered wrong, because the register state when
a signal is raised won't affect delivery, which seems like a good
thing to me.

If this behavior would be problematic for you, can you explain why?


Re: [PATCH 0/6] Intel Secure Guard Extensions

2016-05-08 Thread Andy Lutomirski
On May 8, 2016 2:59 AM, "Dr. Greg Wettstein"  wrote:
>
>
> This now means the security of SGX on 'unlocked' platforms, at least
> from a trust perspective, will be dependent on using TXT so as to
> provide a hardware root of trust on which to base the SGX trust model.

Can you explain what you mean by "trust"?  In particular, what kind of
"trust" would you have with a verified or trusted boot plus verified
SGX launch root key that you would not have in the complete absence of
hardware launch control.

I've heard multiple people say they want launch control but I've never
heard a cogent explanation of a threat model in which it's useful.

> I would assume that everyone is using signed Launch Control Policies
> (LCP) as we are.  This means that TXT/tboot already has access to the
> public key which is used for the LCP data file signature.  It would
> seem logical to have tboot compute the signature on that public key
> and program that signature into the module signature registers.  That
> would tie the hardware root of trust to the SGX root of trust.

Now I'm confused.  TXT, in theory*, lets you establish a good root of
trust for TPM PCR measurements.  So, with TXT, if you had one-shop
launch control MSRs, you could attest that you've locked the launch
control policy.

But what do you gain by doing such a thing?  All you're actually
attesting is that you locked it until the next reboot.  Someone who
subsequently compromises you can reboot you, bypass TXT on the next
boot, and launch any enclave they want.  In any event, SGX is supposed
to make it so that your enclaves remain secure regardless of what
happens to the kernel, so I'm at a loss for what you're trying to do.

* There are some serious concerns about the security of TXT.  SGX is
supposed to be better.

--Andy


linux-next: manual merge of the drm tree with Linus' tree

2016-05-08 Thread Stephen Rothwell
Hi Dave,

Today's linux-next merge of the drm tree got a conflict in:

  drivers/gpu/drm/ttm/ttm_bo.c

between commit:

  56fc350224f1 ("drm/ttm: fix kref count mess in ttm_bo_move_to_lru_tail")

from Linus' tree and commits:

  c3ea576e0583 ("drm/ttm: add optional LRU removal callback v2")
  98c2872ae99b ("drm/ttm: implement LRU add callbacks v2")

from the drm tree.

I fixed it up (I have no idea how to fix merge these changes, so I just
used the latter ones) and can carry the fix as necessary. This is now fixed
as far as linux-next is concerned, but any non trivial conflicts should
be mentioned to your upstream maintainer when your tree is submitted for
merging.  You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


linux-next: manual merge of the drm tree with Linus' tree

2016-05-08 Thread Stephen Rothwell
Hi Dave,

Today's linux-next merge of the drm tree got a conflict in:

  drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c

between commit:

  562e2689baeb ("amdgpu/uvd: add uvd fw version for amdgpu")

from Linus' tree and commit:

  c036554170fc ("drm/amdgpu: handle more than 10 UVD sessions (v2)")

from the drm tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index 871018c634e0,db86012deb67..
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@@ -158,11 -171,19 +171,22 @@@ int amdgpu_uvd_sw_init(struct amdgpu_de
DRM_INFO("Found UVD firmware Version: %hu.%hu Family ID: %hu\n",
version_major, version_minor, family_id);
  
 +  adev->uvd.fw_version = ((version_major << 24) | (version_minor << 16) |
 +  (family_id << 8));
 +
+   /*
+* Limit the number of UVD handles depending on microcode major
+* and minor versions. The firmware version which has 40 UVD
+* instances support is 1.80. So all subsequent versions should
+* also have the same support.
+*/
+   if ((version_major > 0x01) ||
+   ((version_major == 0x01) && (version_minor >= 0x50)))
+   adev->uvd.max_handles = AMDGPU_MAX_UVD_HANDLES;
+ 
bo_size = AMDGPU_GPU_PAGE_ALIGN(le32_to_cpu(hdr->ucode_size_bytes) + 8)
-+  AMDGPU_UVD_STACK_SIZE + AMDGPU_UVD_HEAP_SIZE;
+ +  AMDGPU_UVD_STACK_SIZE + AMDGPU_UVD_HEAP_SIZE
+ +  AMDGPU_UVD_SESSION_SIZE * adev->uvd.max_handles;
r = amdgpu_bo_create(adev, bo_size, PAGE_SIZE, true,
 AMDGPU_GEM_DOMAIN_VRAM,
 AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED,


Re: [PATCH] Use pid_t instead of int

2016-05-08 Thread Andy Lutomirski
On Sun, May 8, 2016 at 12:38 PM, René Nyffenegger
 wrote:
> Use pid_t instead of int in the declarations of sys_kill, sys_tgkill,
> sys_tkill and sys_rt_sigqueueinfo in include/linux/syscalls.h

The description is no good.  *Why* are you changing it?

I checked tgkill and, indeed, tgkill takes pid_t parameters, so this
fixes an incorrect declaration.  I'm wondering why the code compiles
without warning.  Is SYSCALL_DEFINE too lenient for some reason?  Or
is pid_t just defined as int.

--Andy


Re: [PATCHv3 0/2] target: make location of /var/targets configurable

2016-05-08 Thread Lee Duncan
On 04/14/2016 06:18 PM, Lee Duncan wrote:
> These patches make the location of "/var/target" configurable,
> though it still defauls to "/var/target".
> 
> This "target database directory" can only be changed
> after the target_core_mod loads but before any
> fabric drivers are loaded, and must be the pathname
> of an existing directory.
> 
> This configuration is accomplished via the configfs
> top-level target attribute "dbroot", i.e. dumping
> out "/sys/kernel/config/target/dbroot" will normally
> return "/var/target". Writing to this attribute
> changes the loation where the kernel looks for the
> target database.
> 
> The first patch creates this configurable value for
> the "dbroot", and the second patch modifies users
> of this directory to use this new attribute.
> 
> Changes from v2:
>  * Add locking around access to target driver list
> 
> Changes from v1:
>  * Only allow changing target DB root before it
>can be used by others
>  * Validate that new DB root is a valid directory
> 
> Lee Duncan (2):
>   target: make target db location configurable
>   target: use new "dbroot" target attribute
> 
>  drivers/target/target_core_alua.c |  6 ++--
>  drivers/target/target_core_configfs.c | 62 
> +++
>  drivers/target/target_core_internal.h |  6 
>  drivers/target/target_core_pr.c   |  2 +-
>  4 files changed, 72 insertions(+), 4 deletions(-)
> 

Ping?
-- 
Lee Duncan



Re: [PATCH v2 3/4] perf tools: Support reading from backward ring buffer

2016-05-08 Thread Wangnan (F)



On 2016/5/6 21:40, Wangnan (F) wrote:



On 2016/5/6 4:07, Arnaldo Carvalho de Melo wrote:

Em Wed, Apr 27, 2016 at 02:19:22AM +, Wang Nan escreveu:

perf_evlist__mmap_read_backward() is introduced for reading backward
ring buffer. Different from reading forward, before reading, caller
needs to call perf_evlist__mmap_read_catchup() first.

Backward ring buffer should be read from 'head' pointer, not '0'.
perf_evlist__mmap_read_catchup() saves 'head' to 'md->prev', then
make it remember the start position after each reading.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: Zefan Li 
Cc: pi3or...@163.com
---
  tools/perf/util/evlist.c | 39 +++
  tools/perf/util/evlist.h |  4 
  2 files changed, 43 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 17cd014..2e0b7b0 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -766,6 +766,45 @@ union perf_event *perf_evlist__mmap_read(struct 
perf_evlist *evlist, int idx)
  return perf_mmap__read(md, evlist->overwrite, old, head, 
>prev);

  }
  +union perf_event *
+perf_evlist__mmap_read_backward(struct perf_evlist *evlist, int idx)
+{
+struct perf_mmap *md = >mmap[idx];
+u64 head, end;
+u64 start = md->prev;
+
+/*
+ * Check if event was unmapped due to a POLLHUP/POLLERR.
+ */
+if (!atomic_read(>refcnt))
+return NULL;
+
+/* NOTE: head is negative in this case */

What is this comment for, can you ellaborate? Are you double sure this
arithmetic with u64, negative values, that -head below are all ok?


Yes. In backward ring buffer, kernel write data from '0'. Each
time it write a record, it minus sizeof(record) from 'head'
pointer. So '-head' means number of bytes already written.


I've applied the first two patches in this series.

I also need to check why we now need that catchup thing :-\


I think catchup is necessary. I tried to get rid of it by setting
md->prev to -1 and initialize it to 0 when reading forwardly and
to head when reading backwardly, but it require too much code
adjustments.



In addtion, reader of backward ring buffer tends to fetch the most recent
record. Since each reading gets earlier record, a catchup function is always
required.

Thank you.

+head = perf_mmap__read_head(md);
+
+if (!head)
+return NULL;
+
+end = head + md->mask + 1;
+
+if ((end - head) > -head)
+end = 0;
+


This '-head' is used to detect wrapping. Never use positive pointer.


+return perf_mmap__read(md, false, start, end, >prev);
+}
+
+void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int 
idx)

+{
+struct perf_mmap *md = >mmap[idx];
+u64 head;
+
+if (!atomic_read(>refcnt))
+return;
+
+head = perf_mmap__read_head(md);
+md->prev = head;
+}
+
  static bool perf_mmap__empty(struct perf_mmap *md)
  {
  return perf_mmap__read_head(md) == md->prev && 
!md->auxtrace_mmap.base;

diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 208897a..85d1b59 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -129,6 +129,10 @@ struct perf_sample_id 
*perf_evlist__id2sid(struct perf_evlist *evlist, u64 id);
union perf_event *perf_evlist__mmap_read(struct perf_evlist 
*evlist, int idx);
  +union perf_event *perf_evlist__mmap_read_backward(struct 
perf_evlist *evlist,

+  int idx);
+void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int 
idx);

+
  void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx);
int perf_evlist__open(struct perf_evlist *evlist);
--
1.8.3.4







Re: [PATCH v2 00/23] ata: sata_dwc_460ex: make it working again

2016-05-08 Thread Tejun Heo
On Sun, May 08, 2016 at 04:00:08PM -0400, Tejun Heo wrote:
> Hello, Andy.
> 
> On Wed, May 04, 2016 at 03:22:51PM +0300, Andy Shevchenko wrote:
> > Tejun, since Vinod applied all necessary patches into his tree, the
> > series now has just a dependency to whatever branch / tag he marks for
> > it.
> > Do we have a chance to see the SATA series be applied in your tree?
> 
> Applied 1-22 to libata/for-4.7.  There was a couple trivial conflicts
> which I resolved while applying but it'd be great if you can check
> whether everything looks okay.
> 
>  https://git.kernel.org/cgit/linux/kernel/git/tj/libata.git/log/?h=for-4.7

Oops, build failure.  Reverted for now.

Thanks.

-- 
tejun


Re: [PATCH] compiler-gcc: require gcc 4.8 for powerpc __builtin_bswap16()

2016-05-08 Thread Stephen Rothwell
Hi Josh,

On Fri, 6 May 2016 09:22:25 -0500 Josh Poimboeuf  wrote:
>
> I've also seen no problems on powerpc with 4.4 and 4.8.  I suspect it's
> specific to gcc 4.6.  Stephen, can you confirm this patch fixes it?

That will obviously fix the problem for us (since it will effectively
restore the code to what it was before the other commit for our gcc
4.6.3 builds and we have not seen it in other builds).  I will add this
patch to linux-next today.

And since "byteswap: try to avoid __builtin_constant_p gcc bug" is not
in Linus' tree, hopefully we can have this fix applied soon.

> From: Josh Poimboeuf 
> Subject: [PATCH] compiler-gcc: require gcc 4.8 for powerpc __builtin_bswap16()
> 
> gcc support for __builtin_bswap16() was supposedly added for powerpc in
> gcc 4.6, and was then later added for other architectures in gcc 4.8.
> 
> However, Stephen Rothwell reported that attempting to use it on powerpc
> in gcc 4.6 fails with:
> 
>   lib/vsprintf.c:160:2: error: initializer element is not constant
>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[0]')
>   lib/vsprintf.c:160:2: error: initializer element is not constant
>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[1]')
>   lib/vsprintf.c:160:2: error: initializer element is not constant
>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[2]')
>   lib/vsprintf.c:160:2: error: initializer element is not constant
>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[3]')
>   lib/vsprintf.c:160:2: error: initializer element is not constant
> 
> I'm not entirely sure what those errors mean, but I don't see them on
> gcc 4.8.  So let's consider gcc 4.8 to be the official starting point
> for __builtin_bswap16().
> 
> Fixes: 7322dd755e7d ("byteswap: try to avoid __builtin_constant_p gcc bug")
> Reported-by: Stephen Rothwell 
> Signed-off-by: Josh Poimboeuf 
> ---
>  include/linux/compiler-gcc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index eeae401..3d5202e 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -246,7 +246,7 @@
>  #define __HAVE_BUILTIN_BSWAP32__
>  #define __HAVE_BUILTIN_BSWAP64__
>  #endif
> -#if GCC_VERSION >= 40800 || (defined(__powerpc__) && GCC_VERSION >= 40600)
> +#if GCC_VERSION >= 40800
>  #define __HAVE_BUILTIN_BSWAP16__
>  #endif
>  #endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */
> -- 
> 2.4.11

-- 
Cheers,
Stephen Rothwell


linux-next: manual merge of the net-next tree with the net tree

2016-05-08 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  include/linux/netdevice.h

between commit:

  229740c63169 ("udp_offload: Set encapsulation before inner completes.")

from the net tree and commit:

  46aa2f30aa7f ("udp: Remove udp_offloads")

from the net-next tree.

I fixed it up (the latter removed the struct that was commented in the
former) and can carry the fix as necessary. This is now fixed as far as
linux-next is concerned, but any non trivial conflicts should be mentioned
to your upstream maintainer when your tree is submitted for merging.
You may also want to consider cooperating with the maintainer of the
conflicting tree to minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


linux-next: manual merge of the net-next tree with the wireless-drivers tree

2016-05-08 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/wireless/intel/iwlwifi/mvm/tx.c

between commit:

  5c08b0f5026f ("iwlwifi: mvm: don't override the rate with the AMSDU len")

from the wireless-drivers tree and commit:

  d8fe484470dd ("iwlwifi: mvm: add support for new TX CMD API")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/wireless/intel/iwlwifi/mvm/tx.c
index 34731e29c589,bd286fca3776..
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tx.c
@@@ -186,10 -294,16 +295,16 @@@ void iwl_mvm_set_tx_cmd(struct iwl_mvm 
tx_cmd->tx_flags = cpu_to_le32(tx_flags);
/* Total # bytes to be transmitted */
tx_cmd->len = cpu_to_le16((u16)skb->len +
 -  (uintptr_t)info->driver_data[0]);
 +  (uintptr_t)skb_info->driver_data[0]);
-   tx_cmd->next_frame_len = 0;
tx_cmd->life_time = cpu_to_le32(TX_CMD_LIFE_TIME_INFINITE);
tx_cmd->sta_id = sta_id;
+ 
+   /* padding is inserted later in transport */
+   if (ieee80211_hdrlen(fc) % 4 &&
+   !(tx_cmd->offload_assist & cpu_to_le16(BIT(TX_CMD_OFFLD_AMSDU
+   tx_cmd->offload_assist |= cpu_to_le16(BIT(TX_CMD_OFFLD_PAD));
+ 
+   iwl_mvm_tx_csum(mvm, skb, hdr, info, tx_cmd);
  }
  
  /*


Re: usb: dwc2: regression on MyBook Live Duo / Canyonlands since 4.3.0-rc4

2016-05-08 Thread Benjamin Herrenschmidt
On Sun, 2016-05-08 at 13:44 +0200, Christian Lamparter wrote:
> On Sunday, May 08, 2016 08:40:55 PM Benjamin Herrenschmidt wrote:
> > 
> > On Sun, 2016-05-08 at 00:54 +0200, Christian Lamparter via Linuxppc-dev 
> > wrote:
> > > 
> > > I've been looking in getting the MyBook Live Duo's USB OTG port
> > > to function. The SoC is a APM82181. Which has a PowerPC 464 core
> > > and related to the supported canyonlands architecture in
> > > arch/powerpc/.
> > > 
> > > Currently in -next the dwc2 module doesn't load: 
> > Smells like the APM implementation is little endian. You might need to
> > use a flag to indicate what endian to use instead and set it
> > appropriately based on some DT properties.
> I tried. As per common-properties[0], I added little-endian; but it has no
> effect. I looked in dwc2_driver_probe and found no way of specifying the
> endian of the device. It all comes down to the dwc2_readl & dwc2_writel
> accessors. These - sadly - have been hardwired to use __raw_readl and
> __raw_writel. So, it's always "native-endian". While common-properties
> says little-endian should be preferred.

Right, I meant, you should produce a patch adding a runtime test inside
those functions based on a device-tree property, a bit like we do for
some of the HCDs like OHCI, EHCI etc...

Cheers,
Ben.

> > 
> > > 
> > > dwc2 4bff8.usbotg: dwc2_core_reset() HANG! AHB Idle GRSTCTL=80
> > > dwc2 4bff8.usbotg: Bad value for GSNPSID: 0x0a29544f
> > > 
> > > Looking at the Bad GSNPSID value: 0x0a29544f. It is obvious that
> > > this is an endian problem. git finds this patch:
> > > 
> > > commit 95c8bc3609440af5e4a4f760b8680caea7424396
> > > Author: Antti Seppälä 
> > > Date:   Thu Aug 20 21:41:07 2015 +0300
> > > 
> > > usb: dwc2: Use platform endianness when accessing registers
> > > 
> > > This patch is necessary to access dwc2 registers correctly on
> > > big-endian
> > > systems such as the mips based SoCs made by Lantiq. Then dwc2 can
> > > be
> > > used to replace ifx-hcd driver for Lantiq platforms found e.g. in
> > > OpenWrt.
> > > 
> > > The patch was autogenerated with the following commands:
> > > $EDITOR core.h
> > > sed -i "s/\/dwc2_readl/g" *.c hcd.h hw.h
> > > sed -i "s/\/dwc2_writel/g" *.c hcd.h hw.h
> > > 
> > > Some files were then hand-edited to fix checkpatch.pl warnings
> > > about
> > > too long lines.
> > > 
> > > which unfortunately, broke the USB-OTG port on the MyBook Live Duo.
> > > Reverting to the readl / writel:
> > > 
> > > --- 
> > > diff --git a/drivers/usb/dwc2/core.h b/drivers/usb/dwc2/core.h
> > > index 3c58d63..c021c1f 100644
> > > --- a/drivers/usb/dwc2/core.h
> > > +++ b/drivers/usb/dwc2/core.h
> > > @@ -66,7 +66,7 @@
> > >  
> > >  static inline u32 dwc2_readl(const void __iomem *addr)
> > >  {
> > > - u32 value = __raw_readl(addr);
> > > + u32 value = readl(addr);
> > >  
> > >   /* In order to preserve endianness __raw_* operation is
> > > used. Therefore
> > >    * a barrier is needed to ensure IO access is not re-ordered 
> > > across
> > > @@ -78,7 +78,7 @@ static inline u32 dwc2_readl(const void __iomem
> > > *addr)
> > >  
> > >  static inline void dwc2_writel(u32 value, void __iomem *addr)
> > >  {
> > > - __raw_writel(value, addr);
> > > + writel(value, addr);
> > >  
> > >   /*
> > >    * In order to preserve endianness __raw_* operation is
> > > used. Therefore
> > > 
> > > ---
> > > 
> > > restores the dwc-otg port to full working order:
> > > dwc2 4bff8.usbotg: Specified GNPTXFDEP=1024 > 256
> > > dwc2 4bff8.usbotg: EPs: 3, shared fifos, 2042 entries in SPRAM
> > > dwc2 4bff8.usbotg: DWC OTG Controller
> > > dwc2 4bff8.usbotg: new USB bus registered, assigned bus number 1
> > > dwc2 4bff8.usbotg: irq 33, io mem 0x
> > > hub 1-0:1.0: USB hub found
> > > hub 1-0:1.0: 1 port detected
> > > root@mbl:~# usb 1-1: new high-speed USB device number 2 using dwc2
> > > 
> > > So, what to do?
>  ^^^
> 
> Regards,
> Christian
> 
> [0] 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH net-next 5/7] Driver: Vmxnet3: Add support for get_coalesce, set_coalesce ethtool operations

2016-05-08 Thread Ben Hutchings
On Sun, 2016-05-08 at 13:55 -0700, Shrikrishna Khare wrote:
> 
> On Sat, 7 May 2016, Ben Hutchings wrote:
> 
> > On Fri, 2016-05-06 at 16:12 -0700, Shrikrishna Khare wrote:
> > [...]
> > > +static int
> > > +vmxnet3_set_coalesce(struct net_device *netdev, struct ethtool_coalesce 
> > > *ec)
> > > +{
> > [...]
> > > +   switch (ec->rx_coalesce_usecs) {
> > > +   case VMXNET3_COALESCE_DEFAULT:
> > > +   case VMXNET3_COALESCE_DISABLED:
> > > +   case VMXNET3_COALESCE_ADAPT:
> > > +   if (ec->tx_max_coalesced_frames ||
> > > +   ec->tx_max_coalesced_frames_irq ||
> > > +   ec->rx_max_coalesced_frames_irq) {
> > > +   return -EINVAL;
> > > +   }
> > > +   memset(adapter->coal_conf, 0, sizeof(*adapter->coal_conf));
> > > +   adapter->coal_conf->coalMode = ec->rx_coalesce_usecs;
> > > +   break;
> > > +   case VMXNET3_COALESCE_STATIC:
> > [...]
> > 
> > I don't want to see drivers introducing magic values for fields that
> > are denominated in microseconds (especially not for 0, which is the
> > correct way to specify 'no coalescing').  If the current
> > ethtool_coalesce structure is inadequate, propose an extension.
> 
> For vmxnet3, we need an ethtool mechanism to indicate coalescing mode to 
> the device.
> 
> Would a patch that maps 0 to 'no coalescing' be acceptable? That is:
> 
> rx-usecs = 0 -> coalescing disabled.
> rx-usecs = 1 -> default (chosen by the device).
> rx-usecs = 2 -> adaptive coalescing.
> rx-usecs = 3 -> static coalescing.

I still don't like it much.  For the 3 special values (0 isn't really
special):

1 = default: When the driver sets the virtual device to this mode, can it then 
read back what the actual settings are, or are they hidden?  If it can, then 
userland can also read the defaults and explicitly return to them later.  But I 
do see the usefulness of an explicit request to reset to defaults.

2 = adaptive coalescing: There are already fields to request adaptive 
coalescing; you should support them.

3 = static coalescing: I don't understand what this means.

> all other rx-usecs values -> rate based coalescing where rx-usecs denotes 
> rate.
> 
> Alternatively: I don't think new members could be added to struct 
> ethtool_coalesce without breaking compatibility.

That's right, unfortunately.

> Thus, I could extend coalescing as follows:
> - new struct ethtool_coalesce_v2 with coalesce_mode (along with all the 
> members of struct ethtool_coalesce).
> - introduce new ETHTOOL_{G,S}COALESCE_V2 commands.
> - extend userspace ethtool to invoke new commands.
> 
> Could you please advice?

That's roughly how you would extend it.  Though we would probably want
to consider making other extensions to interrupt coalescing control at
the same time.

Ben.

>  
Ben Hutchings
If you seem to know what you are doing, you'll be given more to do.

signature.asc
Description: This is a digitally signed message part


linux-next: build failure after merge of the libata tree

2016-05-08 Thread Stephen Rothwell
Hi Tejun,

After merging the libata tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/ata/sata_dwc_460ex.c:203:2: error: unknown field 'm_master' specified 
in initializer
  .m_master = 1,
  ^
/home/sfr/next/next/drivers/ata/sata_dwc_460ex.c:204:2: error: unknown field 
'p_master' specified in initializer
  .p_master = 0,
  ^
/home/sfr/next/next/drivers/ata/sata_dwc_460ex.c: In function 
'sata_dwc_dma_init_old':
/home/sfr/next/next/drivers/ata/sata_dwc_460ex.c:269:9: error: too few 
arguments to function 'dw_dma_probe'
  return dw_dma_probe(hsdev->dma);
 ^

Caused by commit

  30ee9d52f7a7 ("ata: sata_dwc_460ex: use "dmas" DT property to find dma 
channel")

I have used the libata tree from next-20160506 for today.

-- 
Cheers,
Stephen Rothwell


[PATCH] arm64: defconfig: Enable Cadence MACB/GEM support

2016-05-08 Thread Chanho Min
This patch enables the cadence MACB/GEM support that is needed
by lg1312 SoC.

Signed-off-by: Chanho Min 
---
 arch/arm64/configs/defconfig |1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 14dbe27..ed11cb6 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -105,6 +105,7 @@ CONFIG_TUN=y
 CONFIG_VIRTIO_NET=y
 CONFIG_AMD_XGBE=y
 CONFIG_NET_XGENE=y
+CONFIG_MACB=y
 CONFIG_E1000E=y
 CONFIG_IGB=y
 CONFIG_IGBVF=y
-- 
1.7.9.5



linux-next: manual merge of the f2fs tree with the ext4 tree

2016-05-08 Thread Stephen Rothwell
Hi Jaegeuk,

Today's linux-next merge of the f2fs tree got a conflict in:

  fs/ext4/ext4.h

between commit:

  c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and 
ext4_writepages")

from the ext4 tree and commit:

  a618a2a1dda4 ("ext4 crypto: migrate into vfs's crypto engine")

from the f2fs tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

P.S. I would expect to see a Reviewed-by or Acked-by from the ext4
maintainer on that f2fs tree commit ...

-- 
Cheers,
Stephen Rothwell

diff --cc fs/ext4/ext4.h
index ba5aecc07fbc,91b62e54ef51..
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@@ -32,8 -32,8 +32,9 @@@
  #include 
  #include 
  #include 
+ #include 
  #include 
 +#include 
  #ifdef __KERNEL__
  #include 
  #endif
@@@ -1509,9 -1498,10 +1502,13 @@@ struct ext4_sb_info 
struct ratelimit_state s_err_ratelimit_state;
struct ratelimit_state s_warning_ratelimit_state;
struct ratelimit_state s_msg_ratelimit_state;
 +
 +  /* Barrier between changing inodes' journal flags and writepages ops. */
 +  struct percpu_rw_semaphore s_journal_flag_rwsem;
+ #ifdef CONFIG_EXT4_FS_ENCRYPTION
+   u8 key_prefix[EXT4_KEY_DESC_PREFIX_SIZE];
+   u8 key_prefix_size;
+ #endif
  };
  
  static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)


Re: [PATCH] md: make the code more readable in the for-loop

2016-05-08 Thread Shaohua Li
On Sun, May 08, 2016 at 08:56:55PM +0800, Tiezhu Yang wrote:
> This patch modifies raid1.c, raid10.c and raid5.c
> to make the code more readable in the for-loop
> and also fixes the scripts/checkpatch.pl error:
> ERROR: trailing statements should be on next line.
> 
> Signed-off-by: Tiezhu Yang 
> @@ -3573,7 +3573,8 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>   pr_debug("force RCW rmw_level=%u, recovery_cp=%llu 
> sh->sector=%llu\n",
>conf->rmw_level, (unsigned long long)recovery_cp,
>(unsigned long long)sh->sector);
> - } else for (i = disks; i--; ) {
> + } else
> + for (i = disks - 1; i >= 0; i--) {
>   /* would I have to read this buffer for read_modify_write */
>   struct r5dev *dev = >dev[i];
>   if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&

Applied. I move the for statement to be in a {} of else statement.


Re: [PATCH v2 2/3] net: ethernet: fec: use phydev from struct net_device

2016-05-08 Thread Ben Hutchings
On Mon, 2016-05-09 at 00:47 +0200, Philippe Reynes wrote:
> On 09/05/16 00:22, Ben Hutchings wrote:
> > 
> > On Sun, 2016-05-08 at 23:44 +0200, Philippe Reynes wrote:
> > > 
> > > The private structure contain a pointer to phydev, but the structure
> > > net_device already contain such pointer. So we can remove the pointer
> > > phydev in the private structure, and update the driver to use the one
> > > contained in struct net_device.
> > But there is no central code that updates the pointer, so:
> The function phy_attach_direct and phy_detach update the pointer
> phydev in the struct net_device.
[...]
> So from my understanding, those two lines aren't usefull.
> May you confirm that I'm on the right way please ?

Sorry, you're right.

Ben.

-- 
Ben Hutchings
I haven't lost my mind; it's backed up on tape somewhere.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH v2 2/3] net: ethernet: fec: use phydev from struct net_device

2016-05-08 Thread Philippe Reynes

On 09/05/16 00:22, Ben Hutchings wrote:

On Sun, 2016-05-08 at 23:44 +0200, Philippe Reynes wrote:

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phydev in the private structure, and update the driver to use the one
contained in struct net_device.


But there is no central code that updates the pointer, so:


The function phy_attach_direct and phy_detach update the pointer
phydev in the struct net_device.



[...]

@@ -1928,7 +1926,6 @@ static int fec_enet_mii_probe(struct net_device *ndev)

phy_dev->advertising = phy_dev->supported;

-   fep->phy_dev = phy_dev;


you need to assign ndev->phydev here


The function fec_enet_mii_probe call of_phy_connect,
which call phy_connect_direct, which call phy_attach_direct.
This last function update the pointer phydev of struct net_device.


[...]

@@ -2875,8 +2869,7 @@ fec_enet_close(struct net_device *ndev)
fec_stop(ndev);
}

-   phy_disconnect(fep->phy_dev);
-   fep->phy_dev = NULL;
+   phy_disconnect(ndev->phydev);

[...]

and you need to set it to NULL here.


The function fec_enet_close call phy_disconnect, which call phy_detach.
This last function set the pointer phydev in the struct net_device to NULL.


So from my understanding, those two lines aren't usefull.
May you confirm that I'm on the right way please ?

 

Ben.



Philippe


[PATCH 1/1] net: thunderx: avoid exposing kernel stack

2016-05-08 Thread Heinrich Schuchardt
Reserved fields should be set to zero to avoid exposing
bits from the kernel stack.

Signed-off-by: Heinrich Schuchardt 
---
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 8acd7c0..0ff8e60 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -533,6 +533,7 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, 
struct queue_set *qs,
nicvf_config_vlan_stripping(nic, nic->netdev->features);
 
/* Enable Receive queue */
+   memset(_cfg, 0, sizeof(struct rq_cfg));
rq_cfg.ena = 1;
rq_cfg.tcp_ena = 0;
nicvf_queue_reg_write(nic, NIC_QSET_RQ_0_7_CFG, qidx, *(u64 *)_cfg);
@@ -565,6 +566,7 @@ void nicvf_cmp_queue_config(struct nicvf *nic, struct 
queue_set *qs,
  qidx, (u64)(cq->dmem.phys_base));
 
/* Enable Completion queue */
+   memset(_cfg, 0, sizeof(struct cq_cfg));
cq_cfg.ena = 1;
cq_cfg.reset = 0;
cq_cfg.caching = 0;
@@ -613,6 +615,7 @@ static void nicvf_snd_queue_config(struct nicvf *nic, 
struct queue_set *qs,
  qidx, (u64)(sq->dmem.phys_base));
 
/* Enable send queue  & set queue size */
+   memset(_cfg, 0, sizeof(struct sq_cfg));
sq_cfg.ena = 1;
sq_cfg.reset = 0;
sq_cfg.ldwb = 0;
@@ -649,6 +652,7 @@ static void nicvf_rbdr_config(struct nicvf *nic, struct 
queue_set *qs,
 
/* Enable RBDR  & set queue size */
/* Buffer size should be in multiples of 128 bytes */
+   memset(_cfg, 0, sizeof(struct rbdr_cfg));
rbdr_cfg.ena = 1;
rbdr_cfg.reset = 0;
rbdr_cfg.ldwb = 0;
-- 
2.1.4



[PATCH 2/7] libnvdimm, dax: introduce device-dax infrastructure

2016-05-08 Thread Dan Williams
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX).  It allows persistent memory ranges to be allocated and
mapped without need of an intervening file system.  This initial
infrastructure arranges for a libnvdimm pfn-device to be represented as
a different device-type so that it can be attached to a driver other
than the pmem driver.

Signed-off-by: Dan Williams 
---
 drivers/nvdimm/Kconfig  |   13 +
 drivers/nvdimm/Makefile |1 
 drivers/nvdimm/bus.c|4 ++
 drivers/nvdimm/claim.c  |2 +
 drivers/nvdimm/dax_devs.c   |   99 +++
 drivers/nvdimm/namespace_devs.c |   19 +++
 drivers/nvdimm/nd-core.h|1 
 drivers/nvdimm/nd.h |   25 ++
 drivers/nvdimm/pfn_devs.c   |  100 ++-
 drivers/nvdimm/region.c |2 +
 drivers/nvdimm/region_devs.c|   29 +++
 include/uapi/linux/ndctl.h  |2 +
 tools/testing/nvdimm/Kbuild |1 
 13 files changed, 264 insertions(+), 34 deletions(-)
 create mode 100644 drivers/nvdimm/dax_devs.c

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 53c11621d5b1..7c8a3bf07884 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -88,4 +88,17 @@ config NVDIMM_PFN
 
  Select Y if unsure
 
+config NVDIMM_DAX
+   bool "NVDIMM DAX: Raw access to persistent memory"
+   default LIBNVDIMM
+   depends on NVDIMM_PFN
+   help
+ Support raw device dax access to a persistent memory
+ namespace.  For environments that want to hard partition
+ peristent memory, this capability provides a mechanism to
+ sub-divide a namespace into character devices that can only be
+ accessed via DAX (mmap(2)).
+
+ Select Y if unsure
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index ea84d3c4e8e5..909554c3f955 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -23,3 +23,4 @@ libnvdimm-y += label.o
 libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
 libnvdimm-$(CONFIG_BTT) += btt_devs.o
 libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
+libnvdimm-$(CONFIG_NVDIMM_DAX) += dax_devs.o
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 19f822d7f652..97589e3cb852 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -40,6 +40,8 @@ static int to_nd_device_type(struct device *dev)
return ND_DEVICE_REGION_PMEM;
else if (is_nd_blk(dev))
return ND_DEVICE_REGION_BLK;
+   else if (is_nd_dax(dev))
+   return ND_DEVICE_DAX_PMEM;
else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
return nd_region_to_nstype(to_nd_region(dev->parent));
 
@@ -246,6 +248,8 @@ static void nd_async_device_unregister(void *d, 
async_cookie_t cookie)
 
 void __nd_device_register(struct device *dev)
 {
+   if (!dev)
+   return;
dev->bus = _bus_type;
get_device(dev);
async_schedule_domain(nd_async_device_register, dev,
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 6bbd0a36994a..5f53db59a058 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -85,6 +85,8 @@ static bool is_idle(struct device *dev, struct 
nd_namespace_common *ndns)
seed = nd_region->btt_seed;
else if (is_nd_pfn(dev))
seed = nd_region->pfn_seed;
+   else if (is_nd_dax(dev))
+   seed = nd_region->dax_seed;
 
if (seed == dev || ndns || dev->driver)
return false;
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
new file mode 100644
index ..f90f7549e7f4
--- /dev/null
+++ b/drivers/nvdimm/dax_devs.c
@@ -0,0 +1,99 @@
+/*
+ * Copyright(c) 2013-2016 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include "nd-core.h"
+#include "nd.h"
+
+static void nd_dax_release(struct device *dev)
+{
+   struct nd_region *nd_region = to_nd_region(dev->parent);
+   struct nd_dax *nd_dax = to_nd_dax(dev);
+   struct nd_pfn *nd_pfn = _dax->nd_pfn;
+
+   dev_dbg(dev, "%s\n", __func__);
+   nd_detach_ndns(dev, _pfn->ndns);
+   ida_simple_remove(_region->dax_ida, nd_pfn->id);
+   kfree(nd_pfn->uuid);
+   kfree(nd_dax);
+}
+
+static struct device_type nd_dax_device_type = {
+   .name = "nd_dax",
+   .release = nd_dax_release,
+};
+
+bool 

[PATCH 3/7] libnvdimm, dax: reserve space to store labels for device-dax

2016-05-08 Thread Dan Williams
We may want to subdivide a device-dax range into multiple devices so
that each can have separate permissions or naming.  Reserve 128K of
label space by default so we have the capability of making allocation
decisions persistent.  This reservation is not something we can add
later since it would result in the default size of a device-dax range
changing between kernel versions.

Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pfn_devs.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 6ade2eb7615d..ca396c8f2cd5 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -540,6 +540,7 @@ static struct vmem_altmap *__nvdimm_setup_pfn(struct nd_pfn 
*nd_pfn,
 
 static int nd_pfn_init(struct nd_pfn *nd_pfn)
 {
+   u32 dax_label_reserve = is_nd_dax(_pfn->dev) ? SZ_128K : 0;
struct nd_namespace_common *ndns = nd_pfn->ndns;
u32 start_pad = 0, end_trunc = 0;
resource_size_t start, size;
@@ -606,10 +607,11 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
size = resource_size(>res);
npfns = (size - start_pad - end_trunc - SZ_8K) / SZ_4K;
if (nd_pfn->mode == PFN_MODE_PMEM)
-   offset = ALIGN(start + SZ_8K + 64 * npfns, nd_pfn->align)
-   - start;
+   offset = ALIGN(start + SZ_8K + 64 * npfns + dax_label_reserve,
+   nd_pfn->align) - start;
else if (nd_pfn->mode == PFN_MODE_RAM)
-   offset = ALIGN(start + SZ_8K, nd_pfn->align) - start;
+   offset = ALIGN(start + SZ_8K + dax_label_reserve,
+   nd_pfn->align) - start;
else
return -ENXIO;
 



[PATCH 6/7] /dev/dax, core: file operations and dax-mmap

2016-05-08 Thread Dan Williams
The "Device DAX" core enables dax mappings of performance / feature
differentiated memory.  An open mapping or file handle keeps the backing
struct device live, but new mappings are only possible while the device
is enabled.   Faults are handled under the device lock to synchronize
with the enabled state of the device.  Per the standard device model,
device->driver_data is non-NULL while the device is enabled, and device
state transitions happen under the device lock.

Similar to the filesystem-dax case the backing memory may optionally
have struct page entries.  However, unlike fs-dax there is no support
for private mappings, or mappings that are not backed by media (see
use of zero-page in fs-dax).

Mappings are always guaranteed to match the alignment of the dax_region.
If the dax_region is configured to have a 2MB alignment, all mappings
are guaranteed to be backed by a pmd entry.  Contrast this determinism
with the fs-dax case where pmd mappings are opportunistic.  If userspace
attempts to force a misaligned mapping, the driver will fail the mmap
attempt.  See dax_dev_check_vma() for other scenarios that are rejected,
like MAP_PRIVATE mappings.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Andrew Morton 
Cc: Dave Hansen 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/dax/dax.c |  320 +
 mm/huge_memory.c  |1 
 mm/hugetlb.c  |1 
 3 files changed, 322 insertions(+)

diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index e95af4ead357..4198054e703c 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -187,9 +187,329 @@ int devm_create_dax_dev(struct dax_region *dax_region, 
struct resource *res,
 }
 EXPORT_SYMBOL_GPL(devm_create_dax_dev);
 
+/* return an unmapped area aligned to the dax region specified alignment */
+static unsigned long dax_dev_get_unmapped_area(struct file *filp,
+   unsigned long addr, unsigned long len, unsigned long pgoff,
+   unsigned long flags)
+{
+   struct dax_dev *dax_dev;
+   struct device *dev = filp ? filp->private_data : NULL;
+   unsigned long off, off_end, off_align, len_align, addr_align, align = 0;
+
+   if (!filp || addr)
+   goto out;
+
+   device_lock(dev);
+   dax_dev = dev_get_drvdata(dev);
+   if (dax_dev) {
+   struct dax_region *dax_region = dax_dev->region;
+
+   align = dax_region->align;
+   }
+   device_unlock(dev);
+
+   if (!align)
+   goto out;
+
+   off = pgoff << PAGE_SHIFT;
+   off_end = off + len;
+   off_align = round_up(off, align);
+
+   if ((off_end <= off_align) || ((off_end - off_align) < align))
+   goto out;
+
+   len_align = len + align;
+   if ((off + len_align) < off)
+   goto out;
+
+   addr_align = current->mm->get_unmapped_area(filp, addr, len_align,
+   pgoff, flags);
+   if (!IS_ERR_VALUE(addr_align)) {
+   addr_align += (off - addr_align) & (align - 1);
+   return addr_align;
+   }
+ out:
+   return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+}
+
+static int __match_devt(struct device *dev, const void *data)
+{
+   const dev_t *devt = data;
+
+   return dev->devt == *devt;
+}
+
+static int dax_dev_open(struct inode *inode, struct file *filp)
+{
+   struct device *dev;
+
+   dev = class_find_device(dax_class, NULL, >i_rdev, __match_devt);
+   if (dev) {
+   dev_dbg(dev, "%s\n", __func__);
+   filp->private_data = dev;
+   inode->i_flags = S_DAX;
+   return 0;
+   }
+   return -ENXIO;
+}
+
+static int dax_dev_release(struct inode *inode, struct file *filp)
+{
+   struct device *dev = filp->private_data;
+
+   dev_dbg(dev, "%s\n", __func__);
+   put_device(dev);
+   return 0;
+}
+
+static struct dax_dev *to_dax_dev(struct device *dev)
+{
+   WARN_ON(dev->class != dax_class);
+   device_lock_assert(dev);
+   return dev_get_drvdata(dev);
+}
+
+static int dax_dev_check_vma(struct device *dev, struct vm_area_struct *vma,
+   const char *func)
+{
+   struct dax_dev *dax_dev = to_dax_dev(dev);
+   struct dax_region *dax_region;
+   unsigned long mask;
+
+   if (!dax_dev)
+   return -ENXIO;
+
+   /* prevent private / writable mappings from being established */
+   if ((vma->vm_flags & (VM_NORESERVE|VM_SHARED|VM_WRITE)) == VM_WRITE) {
+   dev_dbg(dev, "%s: fail, attempted private mapping\n", func);
+   return -EINVAL;
+   }
+
+   dax_region = dax_dev->region;
+   mask = dax_region->align - 1;
+   if (vma->vm_start & mask || vma->vm_end & mask) {
+   dev_dbg(dev, "%s: fail, unaligned vma 

[PATCH 5/7] /dev/dax, pmem: direct access to persistent memory

2016-05-08 Thread Dan Williams
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX).  It allows memory ranges to be allocated and mapped
without need of an intervening file system.  Device DAX is strict,
precise and predictable.  Specifically this interface:

1/ Guarantees fault granularity with respect to a given page size (pte,
pmd, or pud) set at configuration time.

2/ Enforces deterministic behavior by being strict about what fault
scenarios are supported.

For example, by forcing MADV_DONTFORK semantics and omitting MAP_PRIVATE
support device-dax guarantees that a mapping always behaves/performs the
same once established.  It is the "what you see is what you get" access
mechanism to differentiated memory vs filesystem DAX which has
filesystem specific implementation semantics.

Persistent memory is the first target, but the mechanism is also
targeted for exclusive allocations of performance differentiated memory
ranges.

This commit is limited to the base device driver infrastructure to
associate a dax device with pmem range.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Andrew Morton 
Cc: Dave Hansen 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/Kconfig |2 
 drivers/Makefile|1 
 drivers/dax/Kconfig |   25 
 drivers/dax/Makefile|4 +
 drivers/dax/dax.c   |  223 +++
 drivers/dax/dax.h   |   24 
 drivers/dax/pmem.c  |  168 ++
 tools/testing/nvdimm/Kbuild |9 +
 tools/testing/nvdimm/config_check.c |2 
 9 files changed, 458 insertions(+)
 create mode 100644 drivers/dax/Kconfig
 create mode 100644 drivers/dax/Makefile
 create mode 100644 drivers/dax/dax.c
 create mode 100644 drivers/dax/dax.h
 create mode 100644 drivers/dax/pmem.c

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339de85f..8298eab84a6f 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -190,6 +190,8 @@ source "drivers/android/Kconfig"
 
 source "drivers/nvdimm/Kconfig"
 
+source "drivers/dax/Kconfig"
+
 source "drivers/nvmem/Kconfig"
 
 source "drivers/hwtracing/stm/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 8f5d076baeb0..0b6f3d60193d 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_PARPORT) += parport/
 obj-$(CONFIG_NVM)  += lightnvm/
 obj-y  += base/ block/ misc/ mfd/ nfc/
 obj-$(CONFIG_LIBNVDIMM)+= nvdimm/
+obj-$(CONFIG_DEV_DAX)  += dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
 obj-$(CONFIG_NUBUS)+= nubus/
 obj-y  += macintosh/
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
new file mode 100644
index ..86ffbaa891ad
--- /dev/null
+++ b/drivers/dax/Kconfig
@@ -0,0 +1,25 @@
+menuconfig DEV_DAX
+   tristate "DAX: direct access to differentiated memory"
+   default m if NVDIMM_DAX
+   help
+ Support raw access to differentiated (persistence, bandwidth,
+ latency...) memory via an mmap(2) capable character
+ device.  Platform firmware or a device driver may identify a
+ platform memory resource that is differentiated from the
+ baseline memory pool.  Mappings of a /dev/daxX.Y device impose
+ restrictions that make the mapping behavior deterministic.
+
+if DEV_DAX
+
+config DEV_DAX_PMEM
+   tristate "PMEM DAX: direct access to persistent memory"
+   depends on NVDIMM_DAX
+   default DEV_DAX
+   help
+ Support raw access to persistent memory.  Note that this
+ driver consumes memory ranges allocated and exported by the
+ libnvdimm sub-system.
+
+ Say Y if unsure
+
+endif
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
new file mode 100644
index ..27c54e38478a
--- /dev/null
+++ b/drivers/dax/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_DEV_DAX) += dax.o
+obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+
+dax_pmem-y := pmem.o
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
new file mode 100644
index ..e95af4ead357
--- /dev/null
+++ b/drivers/dax/dax.c
@@ -0,0 +1,223 @@
+/*
+ * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+

[PATCH 7/7] Revert "block: enable dax for raw block devices"

2016-05-08 Thread Dan Williams
This reverts commit 5a023cdba50c5f5f2bc351783b3131699deb3937.

The functionality is superseded by the new "Device DAX" facility.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Andrew Morton 
Cc: Ross Zwisler 
Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 block/ioctl.c   |   32 
 fs/block_dev.c  |   96 ++-
 include/linux/fs.h  |8 
 include/uapi/linux/fs.h |1 
 4 files changed, 29 insertions(+), 108 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 4ff1f92f89ca..698c7933d582 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -407,35 +407,6 @@ static inline int is_unrecognized_ioctl(int ret)
ret == -ENOIOCTLCMD;
 }
 
-#ifdef CONFIG_FS_DAX
-bool blkdev_dax_capable(struct block_device *bdev)
-{
-   struct gendisk *disk = bdev->bd_disk;
-
-   if (!disk->fops->direct_access)
-   return false;
-
-   /*
-* If the partition is not aligned on a page boundary, we can't
-* do dax I/O to it.
-*/
-   if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
-   || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
-   return false;
-
-   /*
-* If the device has known bad blocks, force all I/O through the
-* driver / page cache.
-*
-* TODO: support finer grained dax error handling
-*/
-   if (disk->bb && disk->bb->count)
-   return false;
-
-   return true;
-}
-#endif
-
 static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode,
unsigned cmd, unsigned long arg)
 {
@@ -598,9 +569,6 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, 
unsigned cmd,
case BLKTRACESETUP:
case BLKTRACETEARDOWN:
return blk_trace_ioctl(bdev, cmd, argp);
-   case BLKDAXGET:
-   return put_int(arg, !!(bdev->bd_inode->i_flags & S_DAX));
-   break;
case IOC_PR_REGISTER:
return blkdev_pr_register(bdev, argp);
case IOC_PR_RESERVE:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 20a2c02b77c4..36ee10ca503e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -1159,6 +1160,33 @@ void bd_set_size(struct block_device *bdev, loff_t size)
 }
 EXPORT_SYMBOL(bd_set_size);
 
+static bool blkdev_dax_capable(struct block_device *bdev)
+{
+   struct gendisk *disk = bdev->bd_disk;
+
+   if (!disk->fops->direct_access || !IS_ENABLED(CONFIG_FS_DAX))
+   return false;
+
+   /*
+* If the partition is not aligned on a page boundary, we can't
+* do dax I/O to it.
+*/
+   if ((bdev->bd_part->start_sect % (PAGE_SIZE / 512))
+   || (bdev->bd_part->nr_sects % (PAGE_SIZE / 512)))
+   return false;
+
+   /*
+* If the device has known bad blocks, force all I/O through the
+* driver / page cache.
+*
+* TODO: support finer grained dax error handling
+*/
+   if (disk->bb && disk->bb->count)
+   return false;
+
+   return true;
+}
+
 static void __blkdev_put(struct block_device *bdev, fmode_t mode, int 
for_part);
 
 /*
@@ -1724,79 +1752,13 @@ static const struct address_space_operations 
def_blk_aops = {
.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
-#ifdef CONFIG_FS_DAX
-/*
- * In the raw block case we do not need to contend with truncation nor
- * unwritten file extents.  Without those concerns there is no need for
- * additional locking beyond the mmap_sem context that these routines
- * are already executing under.
- *
- * Note, there is no protection if the block device is dynamically
- * resized (partition grow/shrink) during a fault. A stable block device
- * size is already not enforced in the blkdev_direct_IO path.
- *
- * For DAX, it is the responsibility of the block device driver to
- * ensure the whole-disk device size is stable while requests are in
- * flight.
- *
- * Finally, unlike the filemap_page_mkwrite() case there is no
- * filesystem superblock to sync against freezing.  We still include a
- * pfn_mkwrite callback for dax drivers to receive write fault
- * notifications.
- */
-static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
-   return __dax_fault(vma, vmf, blkdev_get_block, NULL);
-}
-
-static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
-   struct vm_fault *vmf)
-{
-   return dax_pfn_mkwrite(vma, vmf);
-}
-
-static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
-   pmd_t *pmd, unsigned int flags)
-{
-   return __dax_pmd_fault(vma, addr, pmd, flags, 

[PATCH 4/7] libnvdimm, dax: record the specified alignment of a dax-device instance

2016-05-08 Thread Dan Williams
We want to use the alignment as the allocation and mapping unit.
Previously this information was only useful for establishing the data
offset, but now it is important to remember the granularity for the
later use.

Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pfn.h  |4 +++-
 drivers/nvdimm/pfn_devs.c |8 ++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index 8e343a3ca873..9d2704c83fa7 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -33,7 +33,9 @@ struct nd_pfn_sb {
/* minor-version-1 additions for section alignment */
__le32 start_pad;
__le32 end_trunc;
-   u8 padding[4004];
+   /* minor-version-2 record the base alignment of the mapping */
+   __le32 align;
+   u8 padding[4000];
__le64 checksum;
 };
 
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index ca396c8f2cd5..58740d7ce81b 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -394,6 +394,9 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn)
pfn_sb->end_trunc = 0;
}
 
+   if (__le16_to_cpu(pfn_sb->version_minor) < 2)
+   pfn_sb->align = 0;
+
switch (le32_to_cpu(pfn_sb->mode)) {
case PFN_MODE_RAM:
case PFN_MODE_PMEM:
@@ -433,7 +436,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn)
return -EBUSY;
}
 
-   nd_pfn->align = 1UL << ilog2(offset);
+   nd_pfn->align = le32_to_cpu(pfn_sb->align);
if (!is_power_of_2(offset) || offset < PAGE_SIZE) {
dev_err(_pfn->dev, "bad offset: %#llx dax disabled\n",
offset);
@@ -629,9 +632,10 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
memcpy(pfn_sb->parent_uuid, nd_dev_to_uuid(>dev), 16);
pfn_sb->version_major = cpu_to_le16(1);
-   pfn_sb->version_minor = cpu_to_le16(1);
+   pfn_sb->version_minor = cpu_to_le16(2);
pfn_sb->start_pad = cpu_to_le32(start_pad);
pfn_sb->end_trunc = cpu_to_le32(end_trunc);
+   pfn_sb->align = cpu_to_le32(nd_pfn->align);
checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb);
pfn_sb->checksum = cpu_to_le64(checksum);
 



[PATCH 1/7] libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host'

2016-05-08 Thread Dan Williams
The 'host' variable can be killed as it is always the same as the passed
in device.

Signed-off-by: Dan Williams 
---
 drivers/nvdimm/namespace_devs.c |   19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index f5cb88601359..e5ad5162bf34 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1379,21 +1379,16 @@ struct nd_namespace_common 
*nvdimm_namespace_common_probe(struct device *dev)
 {
struct nd_btt *nd_btt = is_nd_btt(dev) ? to_nd_btt(dev) : NULL;
struct nd_pfn *nd_pfn = is_nd_pfn(dev) ? to_nd_pfn(dev) : NULL;
-   struct nd_namespace_common *ndns;
+   struct nd_namespace_common *ndns = NULL;
resource_size_t size;
 
if (nd_btt || nd_pfn) {
-   struct device *host = NULL;
-
-   if (nd_btt) {
-   host = _btt->dev;
+   if (nd_btt)
ndns = nd_btt->ndns;
-   } else if (nd_pfn) {
-   host = _pfn->dev;
+   else if (nd_pfn)
ndns = nd_pfn->ndns;
-   }
 
-   if (!ndns || !host)
+   if (!ndns)
return ERR_PTR(-ENODEV);
 
/*
@@ -1404,12 +1399,12 @@ struct nd_namespace_common 
*nvdimm_namespace_common_probe(struct device *dev)
device_unlock(>dev);
if (ndns->dev.driver) {
dev_dbg(>dev, "is active, can't bind %s\n",
-   dev_name(host));
+   dev_name(dev));
return ERR_PTR(-EBUSY);
}
-   if (dev_WARN_ONCE(>dev, ndns->claim != host,
+   if (dev_WARN_ONCE(>dev, ndns->claim != dev,
"host (%s) vs claim (%s) mismatch\n",
-   dev_name(host),
+   dev_name(dev),
dev_name(ndns->claim)))
return ERR_PTR(-ENXIO);
} else {



[PATCH 0/7] "Device DAX" for persistent memory

2016-05-08 Thread Dan Williams
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX).  It allows memory ranges to be allocated and mapped
without need of an intervening file system or being bound to block
device semantics.  Device DAX is strict and predictable.  Specifically
this interface:

1/ Guarantees fault granularity with respect to a given page size (pte,
pmd, or pud) set at configuration time.

2/ Enforces deterministic behavior by being strict about what fault
scenarios are supported.

This first implementation, for persistent memory, is targeted at
applications like hypervisors and some databases that only need an
allocate + map mechanism from the kernel.  Later this mechanism can be
used to enable direct access to other performance/feature differentiated
memory ranges.

This series is built on "[PATCH 00/13] prep for device-dax, untangle
pfn-device setup" [1], posted at the end of March.

A libnvdimm pmem namespace can be switched from its default /dev/pmemX
(block device) interface to /dev/daxX.Y with the ndctl utility:

ndctl create-namespace -m dax -e namespace0.0 -f

This implementation passes a basic setup, map, fault, and shutdown
sequence.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-March/005086.html

---

Dan Williams (7):
  libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host'
  libnvdimm, dax: introduce device-dax infrastructure
  libnvdimm, dax: reserve space to store labels for device-dax
  libnvdimm, dax: record the specified alignment of a dax-device instance
  /dev/dax, pmem: direct access to persistent memory
  /dev/dax, core: file operations and dax-mmap
  Revert "block: enable dax for raw block devices"


 block/ioctl.c   |   32 --
 drivers/Kconfig |2 
 drivers/Makefile|1 
 drivers/dax/Kconfig |   25 ++
 drivers/dax/Makefile|4 
 drivers/dax/dax.c   |  543 +++
 drivers/dax/dax.h   |   24 ++
 drivers/dax/pmem.c  |  168 +++
 drivers/nvdimm/Kconfig  |   13 +
 drivers/nvdimm/Makefile |1 
 drivers/nvdimm/bus.c|4 
 drivers/nvdimm/claim.c  |2 
 drivers/nvdimm/dax_devs.c   |   99 ++
 drivers/nvdimm/namespace_devs.c |   38 ++
 drivers/nvdimm/nd-core.h|1 
 drivers/nvdimm/nd.h |   25 ++
 drivers/nvdimm/pfn.h|4 
 drivers/nvdimm/pfn_devs.c   |  116 +--
 drivers/nvdimm/region.c |2 
 drivers/nvdimm/region_devs.c|   29 ++
 fs/block_dev.c  |   96 ++
 include/linux/fs.h  |8 -
 include/uapi/linux/fs.h |1 
 include/uapi/linux/ndctl.h  |2 
 mm/huge_memory.c|1 
 mm/hugetlb.c|1 
 tools/testing/nvdimm/Kbuild |   10 +
 tools/testing/nvdimm/config_check.c |2 
 28 files changed, 1094 insertions(+), 160 deletions(-)
 create mode 100644 drivers/dax/Kconfig
 create mode 100644 drivers/dax/Makefile
 create mode 100644 drivers/dax/dax.c
 create mode 100644 drivers/dax/dax.h
 create mode 100644 drivers/dax/pmem.c
 create mode 100644 drivers/nvdimm/dax_devs.c


Re: [PATCH v2 2/3] net: ethernet: fec: use phydev from struct net_device

2016-05-08 Thread Ben Hutchings
On Sun, 2016-05-08 at 23:44 +0200, Philippe Reynes wrote:
> The private structure contain a pointer to phydev, but the structure
> net_device already contain such pointer. So we can remove the pointer
> phydev in the private structure, and update the driver to use the one
> contained in struct net_device.

But there is no central code that updates the pointer, so:

[...]
> @@ -1928,7 +1926,6 @@ static int fec_enet_mii_probe(struct net_device *ndev)
>  
>   phy_dev->advertising = phy_dev->supported;
>  
> - fep->phy_dev = phy_dev;

you need to assign ndev->phydev here

[...]
> @@ -2875,8 +2869,7 @@ fec_enet_close(struct net_device *ndev)
>   fec_stop(ndev);
>   }
>  
> - phy_disconnect(fep->phy_dev);
> - fep->phy_dev = NULL;
> + phy_disconnect(ndev->phydev);
[...]

and you need to set it to NULL here.

Ben.
 
-- 
Ben Hutchings
I haven't lost my mind; it's backed up on tape somewhere.

signature.asc
Description: This is a digitally signed message part


[PATCH 1/1] USB: FHCI: avoid redundant condition

2016-05-08 Thread Heinrich Schuchardt
The right part of the following or expression is only evaluated if
td is nonzero.
!td || (td && td.status == USB_TD_INPROGRESS)
So no need to check td again.

Signed-off-by: Heinrich Schuchardt 
---
 drivers/usb/host/fhci-sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/usb/host/fhci-sched.c b/drivers/usb/host/fhci-sched.c
index a9609a3..2f162fa 100644
--- a/drivers/usb/host/fhci-sched.c
+++ b/drivers/usb/host/fhci-sched.c
@@ -288,7 +288,7 @@ static int scan_ed_list(struct fhci_usb *usb,
list_for_each_entry(ed, list, node) {
td = ed->td_head;
 
-   if (!td || (td && td->status == USB_TD_INPROGRESS))
+   if (!td || td->status == USB_TD_INPROGRESS)
continue;
 
if (ed->state != FHCI_ED_OPER) {
-- 
2.1.4



Re: [PATCH 1/3] md: set MD_CHANGE_PENDING in a atomic region

2016-05-08 Thread Shaohua Li
On Tue, May 03, 2016 at 10:22:13PM -0400, Guoqing Jiang wrote:
> Some code waits for a metadata update by:
> 
> 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
> 2. setting MD_CHANGE_PENDING and waking the management thread
> 3. waiting for MD_CHANGE_PENDING to be cleared
> 
> If the first two are done without locking, the code in md_update_sb()
> which checks if it needs to repeat might test if an update is needed
> before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
> in the wait returning early.
> 
> So make sure all places that set MD_CHANGE_PENDING are atomicial, and
> bit_clear_unless (suggested by Neil) is introduced for the purpose.

Applied the 3, thanks!


  1   2   3   >