linux-next: Tree for May 9

2016-05-08 Thread Stephen Rothwell
Hi all,

Changes since 20160506:

Dropped tree: hsi (at the maintainer's request)

The f2fs tree gained a conflict against the ext4 tree.

The libata tree gained a build failure so I used the version from
next-20160506 for today.

The net-next tree gained conflicts against the wireless-drivers and
net trees.

The drm tree gained conflicts against Linus' tree.

The sound-asoc tree lost its build failure.

Non-merge commits (relative to Linus' tree): 8795
 7764 files changed, 383723 insertions(+), 168978 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
(this fails its final link) and pseries_le_defconfig and i386, sparc
and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 235 trees (counting Linus' and 35 trees of patches
pending for Linus' tree).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (44549e8f5eea Linux 4.6-rc7)
Merging fixes/master (9735a22799b9 Linux 4.6-rc2)
Merging kbuild-current/rc-fixes (3d1450d54a4f Makefile: Force gzip and xz on 
module install)
Merging arc-current/for-curr (26f9d5fd82ca ARC: support HIGHMEM even without 
PAE40)
Merging arm-current/fixes (ec953b70f368 ARM: 8573/1: domain: move 
{set,get}_domain under config guard)
Merging m68k-current/for-linus (7b8ba82ad4ad m68k/defconfig: Update defconfigs 
for v4.6-rc2)
Merging metag-fixes/fixes (0164a711c97b metag: Fix ioremap_wc/ioremap_cached 
build errors)
Merging powerpc-fixes/fixes (b4c112114aab powerpc: Fix bad inline asm 
constraint in create_zero_mask())
Merging powerpc-merge-mpe/fixes (bc0195aad0da Linux 4.2-rc2)
Merging sparc/master (33656a1f2ee5 Merge branch 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs)
Merging net/master (8c1f45462574 netxen: netxen_rom_fast_read() doesn't return 
-1)
Merging ipsec/master (d6af1a31cc72 vti: Add pmtu handling to vti_xmit.)
Merging ipvs/master (f28f20da704d Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net)
Merging wireless-drivers/master (cbbba30f1ac9 Merge tag 
'iwlwifi-for-kalle-2016-05-04' of 
https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes)
Merging mac80211/master (e6436be21e77 mac80211: fix statistics leak if 
dev_alloc_name() fails)
Merging sound-current/for-linus (2d2c038a ALSA: usb-audio: Quirk for yet 
another Phoenix Audio devices (v2))
Merging pci-current/for-linus (9a2a5a638f8e PCI: Do not treat EPROBE_DEFER as 
device attach failure)
Merging driver-core.current/driver-core-linus (c3b46c73264b Linux 4.6-rc4)
Merging tty.current/tty-linus (02da2d72174c Linux 4.6-rc5)
Merging usb.current/usb-linus (9be427efc764 Revert "USB / PM: Allow USB devices 
to remain runtime-suspended when sleeping")
Merging usb-gadget-fixes/fixes (38740a5b87d5 usb: gadget: f_fs: Fix 
use-after-free)
Merging usb-serial-fixes/usb-linus (74d2a91aec97 USB: serial: option: add even 
more ZTE device ids)
Merging usb-chipidea-fixes/ci-for-usb-stable (d144dfea8af7 usb: chipidea: otg: 
change workqueue ci_otg as freezable)
Merging staging.current/staging-linus (2b86c4a84377 Merge tag 
'iio-fixes-for-4.6d' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio 
into staging-linus)
Merging char-misc.current/char-misc-linus (d1306eb675ad nvmem: mxs-ocotp: fix 
buffer overflow in read)
Merging input-current/for-linus (eb43335c4095 Input: atmel_mxt_ts - use 
mxt_acquire_irq in mxt_soft_reset)
Merging crypto-current/master (58446fef579e crypto: rsa - select crypto mgr 
dependency)
Merging ide/master (1993b176a822 Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide)
Merging 

linux-next: Tree for May 9

2016-05-08 Thread Stephen Rothwell
Hi all,

Changes since 20160506:

Dropped tree: hsi (at the maintainer's request)

The f2fs tree gained a conflict against the ext4 tree.

The libata tree gained a build failure so I used the version from
next-20160506 for today.

The net-next tree gained conflicts against the wireless-drivers and
net trees.

The drm tree gained conflicts against Linus' tree.

The sound-asoc tree lost its build failure.

Non-merge commits (relative to Linus' tree): 8795
 7764 files changed, 383723 insertions(+), 168978 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
(this fails its final link) and pseries_le_defconfig and i386, sparc
and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 235 trees (counting Linus' and 35 trees of patches
pending for Linus' tree).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (44549e8f5eea Linux 4.6-rc7)
Merging fixes/master (9735a22799b9 Linux 4.6-rc2)
Merging kbuild-current/rc-fixes (3d1450d54a4f Makefile: Force gzip and xz on 
module install)
Merging arc-current/for-curr (26f9d5fd82ca ARC: support HIGHMEM even without 
PAE40)
Merging arm-current/fixes (ec953b70f368 ARM: 8573/1: domain: move 
{set,get}_domain under config guard)
Merging m68k-current/for-linus (7b8ba82ad4ad m68k/defconfig: Update defconfigs 
for v4.6-rc2)
Merging metag-fixes/fixes (0164a711c97b metag: Fix ioremap_wc/ioremap_cached 
build errors)
Merging powerpc-fixes/fixes (b4c112114aab powerpc: Fix bad inline asm 
constraint in create_zero_mask())
Merging powerpc-merge-mpe/fixes (bc0195aad0da Linux 4.2-rc2)
Merging sparc/master (33656a1f2ee5 Merge branch 'for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs)
Merging net/master (8c1f45462574 netxen: netxen_rom_fast_read() doesn't return 
-1)
Merging ipsec/master (d6af1a31cc72 vti: Add pmtu handling to vti_xmit.)
Merging ipvs/master (f28f20da704d Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net)
Merging wireless-drivers/master (cbbba30f1ac9 Merge tag 
'iwlwifi-for-kalle-2016-05-04' of 
https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes)
Merging mac80211/master (e6436be21e77 mac80211: fix statistics leak if 
dev_alloc_name() fails)
Merging sound-current/for-linus (2d2c038a ALSA: usb-audio: Quirk for yet 
another Phoenix Audio devices (v2))
Merging pci-current/for-linus (9a2a5a638f8e PCI: Do not treat EPROBE_DEFER as 
device attach failure)
Merging driver-core.current/driver-core-linus (c3b46c73264b Linux 4.6-rc4)
Merging tty.current/tty-linus (02da2d72174c Linux 4.6-rc5)
Merging usb.current/usb-linus (9be427efc764 Revert "USB / PM: Allow USB devices 
to remain runtime-suspended when sleeping")
Merging usb-gadget-fixes/fixes (38740a5b87d5 usb: gadget: f_fs: Fix 
use-after-free)
Merging usb-serial-fixes/usb-linus (74d2a91aec97 USB: serial: option: add even 
more ZTE device ids)
Merging usb-chipidea-fixes/ci-for-usb-stable (d144dfea8af7 usb: chipidea: otg: 
change workqueue ci_otg as freezable)
Merging staging.current/staging-linus (2b86c4a84377 Merge tag 
'iio-fixes-for-4.6d' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio 
into staging-linus)
Merging char-misc.current/char-misc-linus (d1306eb675ad nvmem: mxs-ocotp: fix 
buffer overflow in read)
Merging input-current/for-linus (eb43335c4095 Input: atmel_mxt_ts - use 
mxt_acquire_irq in mxt_soft_reset)
Merging crypto-current/master (58446fef579e crypto: rsa - select crypto mgr 
dependency)
Merging ide/master (1993b176a822 Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide)
Merging 

Re: [PATCH v2 1/2] mm, kasan: improve double-free detection

2016-05-08 Thread Dmitry Vyukov
On Sun, May 8, 2016 at 11:17 AM, Yury Norov  wrote:
> On Sat, May 07, 2016 at 03:15:59PM +, Luruo, Kuthonuzo wrote:
>> Thank you for the review!
>>
>> > > + switch (alloc_data.state) {
>> > > + case KASAN_STATE_QUARANTINE:
>> > > + case KASAN_STATE_FREE:
>> > > + kasan_report((unsigned long)object, 0, false,
>> > > + (unsigned long)__builtin_return_address(1));
>> >
>> > __builtin_return_address() is unsafe if argument is non-zero. Use
>> > return_address() instead.
>>
>> hmm, I/cscope can't seem to find an x86 implementation for return_address().
>> Will dig further; thanks.
>>
>
> It seems there's no generic interface to obtain return address. x86
> has  working __builtin_return_address() and it's ok with it, others
> use their own return_adderss(), and ok as well.
>
> I think unification is needed here.


We use _RET_IP_ in other places in portable part of kasan.


Re: [PATCH v2 1/2] mm, kasan: improve double-free detection

2016-05-08 Thread Dmitry Vyukov
On Sun, May 8, 2016 at 11:17 AM, Yury Norov  wrote:
> On Sat, May 07, 2016 at 03:15:59PM +, Luruo, Kuthonuzo wrote:
>> Thank you for the review!
>>
>> > > + switch (alloc_data.state) {
>> > > + case KASAN_STATE_QUARANTINE:
>> > > + case KASAN_STATE_FREE:
>> > > + kasan_report((unsigned long)object, 0, false,
>> > > + (unsigned long)__builtin_return_address(1));
>> >
>> > __builtin_return_address() is unsafe if argument is non-zero. Use
>> > return_address() instead.
>>
>> hmm, I/cscope can't seem to find an x86 implementation for return_address().
>> Will dig further; thanks.
>>
>
> It seems there's no generic interface to obtain return address. x86
> has  working __builtin_return_address() and it's ok with it, others
> use their own return_adderss(), and ok as well.
>
> I think unification is needed here.


We use _RET_IP_ in other places in portable part of kasan.


Re: [PATCH v7 7/9] clk: mediatek: Enable critical clocks for MT2701

2016-05-08 Thread James Liao
Hi Stephen,

On Fri, 2016-05-06 at 16:12 -0700, Stephen Boyd wrote:
> On 04/14, James Liao wrote:
> > Some system clocks should be turned on by default on MT2701.
> > This patch enable these clocks when related clocks have
> > been registered.
> > 
> > Signed-off-by: James Liao 
> > ---
> 
> critical clks got merged now (sorry I'm slowly getting back to
> looking at patches). Please use that flag.

I don't see critical clock support in v4.6-rc7. Is there a repo/branch
that has critical clocks merged?


Best regards,

James



Re: [PATCH v7 7/9] clk: mediatek: Enable critical clocks for MT2701

2016-05-08 Thread James Liao
Hi Stephen,

On Fri, 2016-05-06 at 16:12 -0700, Stephen Boyd wrote:
> On 04/14, James Liao wrote:
> > Some system clocks should be turned on by default on MT2701.
> > This patch enable these clocks when related clocks have
> > been registered.
> > 
> > Signed-off-by: James Liao 
> > ---
> 
> critical clks got merged now (sorry I'm slowly getting back to
> looking at patches). Please use that flag.

I don't see critical clock support in v4.6-rc7. Is there a repo/branch
that has critical clocks merged?


Best regards,

James



Re: [PATCH] compiler-gcc: require gcc 4.8 for powerpc __builtin_bswap16()

2016-05-08 Thread Sedat Dilek
On 5/9/16, Stephen Rothwell  wrote:
> Hi Josh,
>
> On Fri, 6 May 2016 09:22:25 -0500 Josh Poimboeuf 
> wrote:
>>
>> I've also seen no problems on powerpc with 4.4 and 4.8.  I suspect it's
>> specific to gcc 4.6.  Stephen, can you confirm this patch fixes it?
>
> That will obviously fix the problem for us (since it will effectively
> restore the code to what it was before the other commit for our gcc
> 4.6.3 builds and we have not seen it in other builds).  I will add this
> patch to linux-next today.
>
> And since "byteswap: try to avoid __builtin_constant_p gcc bug" is not
> in Linus' tree, hopefully we can have this fix applied soon.
>

FYI, this patch is in Linus tree (v4.6-rc7 has it).

- Sedat -

[1] 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7322dd755e7dd34bc5359aa27abeed1687e0f628

>> From: Josh Poimboeuf 
>> Subject: [PATCH] compiler-gcc: require gcc 4.8 for powerpc
>> __builtin_bswap16()
>>
>> gcc support for __builtin_bswap16() was supposedly added for powerpc in
>> gcc 4.6, and was then later added for other architectures in gcc 4.8.
>>
>> However, Stephen Rothwell reported that attempting to use it on powerpc
>> in gcc 4.6 fails with:
>>
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[0]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[1]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[2]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[3]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>
>> I'm not entirely sure what those errors mean, but I don't see them on
>> gcc 4.8.  So let's consider gcc 4.8 to be the official starting point
>> for __builtin_bswap16().
>>
>> Fixes: 7322dd755e7d ("byteswap: try to avoid __builtin_constant_p gcc
>> bug")
>> Reported-by: Stephen Rothwell 
>> Signed-off-by: Josh Poimboeuf 
>> ---
>>  include/linux/compiler-gcc.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
>> index eeae401..3d5202e 100644
>> --- a/include/linux/compiler-gcc.h
>> +++ b/include/linux/compiler-gcc.h
>> @@ -246,7 +246,7 @@
>>  #define __HAVE_BUILTIN_BSWAP32__
>>  #define __HAVE_BUILTIN_BSWAP64__
>>  #endif
>> -#if GCC_VERSION >= 40800 || (defined(__powerpc__) && GCC_VERSION >=
>> 40600)
>> +#if GCC_VERSION >= 40800
>>  #define __HAVE_BUILTIN_BSWAP16__
>>  #endif
>>  #endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */
>> --
>> 2.4.11
>
> --
> Cheers,
> Stephen Rothwell
> --
> To unsubscribe from this list: send the line "unsubscribe linux-next" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


Re: [PATCH] compiler-gcc: require gcc 4.8 for powerpc __builtin_bswap16()

2016-05-08 Thread Sedat Dilek
On 5/9/16, Stephen Rothwell  wrote:
> Hi Josh,
>
> On Fri, 6 May 2016 09:22:25 -0500 Josh Poimboeuf 
> wrote:
>>
>> I've also seen no problems on powerpc with 4.4 and 4.8.  I suspect it's
>> specific to gcc 4.6.  Stephen, can you confirm this patch fixes it?
>
> That will obviously fix the problem for us (since it will effectively
> restore the code to what it was before the other commit for our gcc
> 4.6.3 builds and we have not seen it in other builds).  I will add this
> patch to linux-next today.
>
> And since "byteswap: try to avoid __builtin_constant_p gcc bug" is not
> in Linus' tree, hopefully we can have this fix applied soon.
>

FYI, this patch is in Linus tree (v4.6-rc7 has it).

- Sedat -

[1] 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7322dd755e7dd34bc5359aa27abeed1687e0f628

>> From: Josh Poimboeuf 
>> Subject: [PATCH] compiler-gcc: require gcc 4.8 for powerpc
>> __builtin_bswap16()
>>
>> gcc support for __builtin_bswap16() was supposedly added for powerpc in
>> gcc 4.6, and was then later added for other architectures in gcc 4.8.
>>
>> However, Stephen Rothwell reported that attempting to use it on powerpc
>> in gcc 4.6 fails with:
>>
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[0]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[1]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[2]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>   lib/vsprintf.c:160:2: error: (near initialization for 'decpair[3]')
>>   lib/vsprintf.c:160:2: error: initializer element is not constant
>>
>> I'm not entirely sure what those errors mean, but I don't see them on
>> gcc 4.8.  So let's consider gcc 4.8 to be the official starting point
>> for __builtin_bswap16().
>>
>> Fixes: 7322dd755e7d ("byteswap: try to avoid __builtin_constant_p gcc
>> bug")
>> Reported-by: Stephen Rothwell 
>> Signed-off-by: Josh Poimboeuf 
>> ---
>>  include/linux/compiler-gcc.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
>> index eeae401..3d5202e 100644
>> --- a/include/linux/compiler-gcc.h
>> +++ b/include/linux/compiler-gcc.h
>> @@ -246,7 +246,7 @@
>>  #define __HAVE_BUILTIN_BSWAP32__
>>  #define __HAVE_BUILTIN_BSWAP64__
>>  #endif
>> -#if GCC_VERSION >= 40800 || (defined(__powerpc__) && GCC_VERSION >=
>> 40600)
>> +#if GCC_VERSION >= 40800
>>  #define __HAVE_BUILTIN_BSWAP16__
>>  #endif
>>  #endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */
>> --
>> 2.4.11
>
> --
> Cheers,
> Stephen Rothwell
> --
> To unsubscribe from this list: send the line "unsubscribe linux-next" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


Re: [PATCH 0/6] Intel Secure Guard Extensions

2016-05-08 Thread Jarkko Sakkinen
On Fri, May 06, 2016 at 01:54:14PM +0200, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Jarkko Sakkinen wrote:
> 
> > On Tue, May 03, 2016 at 04:06:27AM -0500, Dr. Greg Wettstein wrote:
> > > It would be helpful and instructive for anyone involved in this debate
> > > to review the following URL which details Intel's SGX licening
> > > program:
> > > 
> > > https://software.intel.com/en-us/articles/intel-sgx-product-licensing
> > 
> > I think it would be good  to note that the licensing process is available
> > only for Windows. For Linux you can only use debug enclaves at the
> > moment. The default LE has "allow-all" policy for debug enclaves.
> 
> Which makes the feature pretty useless.
>  
> > > I think the only way forward to make all of this palatable is to
> > > embrace something similar to what has been done with Secure Boot.  The
> > > Root Enclave Key will need to be something which can be reconfigured
> > > by the Platform Owner through BIOS/EFI.  That model would take Intel
> > > off the hook from a security perspective and establish the notion of
> > > platform trust to be a bilateral relationship between a service
> > > provider and client.
> > 
> > This concern has been raised many times now. Sadly this did not make
> > into Skyle but in future we will have one shot MSRs (can be set only
> > once per boot cycle) for defining your own root of trust.
> 
> We'll wait for that to happen.

I fully understand if you (and others) want to keep this standpoint but
what if we could get it to staging after I've revised it with suggested
changes and internal changes in my TODO? Then it would not pollute the
mainline kernel but still would be easily available for experimentation.

There was one header out of staging tree in the patch set sgx.h that I
could place to the staging area in the next revision.

For the next revision I'll document how IA32_LEPUBKEYHASHx MSRs work
based on some concerns that Andy raised so that we can hopefully have a
better discussion about this feature.

> Thanks,
> 
>   tglx

/Jarkko


Re: [PATCH 0/6] Intel Secure Guard Extensions

2016-05-08 Thread Jarkko Sakkinen
On Fri, May 06, 2016 at 01:54:14PM +0200, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Jarkko Sakkinen wrote:
> 
> > On Tue, May 03, 2016 at 04:06:27AM -0500, Dr. Greg Wettstein wrote:
> > > It would be helpful and instructive for anyone involved in this debate
> > > to review the following URL which details Intel's SGX licening
> > > program:
> > > 
> > > https://software.intel.com/en-us/articles/intel-sgx-product-licensing
> > 
> > I think it would be good  to note that the licensing process is available
> > only for Windows. For Linux you can only use debug enclaves at the
> > moment. The default LE has "allow-all" policy for debug enclaves.
> 
> Which makes the feature pretty useless.
>  
> > > I think the only way forward to make all of this palatable is to
> > > embrace something similar to what has been done with Secure Boot.  The
> > > Root Enclave Key will need to be something which can be reconfigured
> > > by the Platform Owner through BIOS/EFI.  That model would take Intel
> > > off the hook from a security perspective and establish the notion of
> > > platform trust to be a bilateral relationship between a service
> > > provider and client.
> > 
> > This concern has been raised many times now. Sadly this did not make
> > into Skyle but in future we will have one shot MSRs (can be set only
> > once per boot cycle) for defining your own root of trust.
> 
> We'll wait for that to happen.

I fully understand if you (and others) want to keep this standpoint but
what if we could get it to staging after I've revised it with suggested
changes and internal changes in my TODO? Then it would not pollute the
mainline kernel but still would be easily available for experimentation.

There was one header out of staging tree in the patch set sgx.h that I
could place to the staging area in the next revision.

For the next revision I'll document how IA32_LEPUBKEYHASHx MSRs work
based on some concerns that Andy raised so that we can hopefully have a
better discussion about this feature.

> Thanks,
> 
>   tglx

/Jarkko


Re: [PATCH v7 8/9] clk: mediatek: Add config options for MT2701 subsystem clocks

2016-05-08 Thread James Liao
HI Stephen,

On Fri, 2016-05-06 at 16:02 -0700, Stephen Boyd wrote:
> On 04/14, James Liao wrote:
> > MT2701 subsystem clocks are optional and should be enabled only if
> > their subsystem drivers are ready to control these clocks.
> > 
> > Signed-off-by: James Liao 
> > ---
> 
> Why is this patch split off from the patch that introduces the
> file?

I was looking for comments about how to make subsystem clocks optional.
So I used a separated patch to do it. Is it an acceptable way to use
config options to enable subsystem clock support?


Best regards,

James




Re: [PATCH v7 8/9] clk: mediatek: Add config options for MT2701 subsystem clocks

2016-05-08 Thread James Liao
HI Stephen,

On Fri, 2016-05-06 at 16:02 -0700, Stephen Boyd wrote:
> On 04/14, James Liao wrote:
> > MT2701 subsystem clocks are optional and should be enabled only if
> > their subsystem drivers are ready to control these clocks.
> > 
> > Signed-off-by: James Liao 
> > ---
> 
> Why is this patch split off from the patch that introduces the
> file?

I was looking for comments about how to make subsystem clocks optional.
So I used a separated patch to do it. Is it an acceptable way to use
config options to enable subsystem clock support?


Best regards,

James




Re: [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()

2016-05-08 Thread Peter Hurley
On 05/08/2016 09:56 PM, Davidlohr Bueso wrote:
> The field is obviously updated w.o the lock and needs a READ_ONCE
> while waiting for lock holder(s) to go away, just like we do with
> all other ->count accesses.

This isn't actually fixing a bug because it's passed through
several full barriers which will force reloading from sem->count.

I think the patch is ok if you want it just for consistency anyway,
but please change $subject and changelog.

Regards,
Peter Hurley


> Signed-off-by: Davidlohr Bueso 
> ---
>  kernel/locking/rwsem-xadd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index df4dcb883b50..7d62772600cf 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -494,7 +494,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore 
> *sem, int state)
>   }
>   schedule();
>   set_current_state(state);
> - } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
> + } while ((count = READ_ONCE(sem->count)) & RWSEM_ACTIVE_MASK);
>  
>   raw_spin_lock_irq(>wait_lock);
>   }
> 



Re: [PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()

2016-05-08 Thread Peter Hurley
On 05/08/2016 09:56 PM, Davidlohr Bueso wrote:
> The field is obviously updated w.o the lock and needs a READ_ONCE
> while waiting for lock holder(s) to go away, just like we do with
> all other ->count accesses.

This isn't actually fixing a bug because it's passed through
several full barriers which will force reloading from sem->count.

I think the patch is ok if you want it just for consistency anyway,
but please change $subject and changelog.

Regards,
Peter Hurley


> Signed-off-by: Davidlohr Bueso 
> ---
>  kernel/locking/rwsem-xadd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index df4dcb883b50..7d62772600cf 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -494,7 +494,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore 
> *sem, int state)
>   }
>   schedule();
>   set_current_state(state);
> - } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
> + } while ((count = READ_ONCE(sem->count)) & RWSEM_ACTIVE_MASK);
>  
>   raw_spin_lock_irq(>wait_lock);
>   }
> 



Re: [PATCH v2 2/2] kasan: add kasan_double_free() test

2016-05-08 Thread Dmitry Vyukov
On Fri, May 6, 2016 at 1:50 PM, Kuthonuzo Luruo  wrote:
> This patch adds a new 'test_kasan' test for KASAN double-free error
> detection when the same slab object is concurrently deallocated.
>
> Signed-off-by: Kuthonuzo Luruo 
> ---
> Changes in v2:
> - This patch is new for v2.
> ---
>  lib/test_kasan.c |   79 
> ++
>  1 files changed, 79 insertions(+), 0 deletions(-)
>
> diff --git a/lib/test_kasan.c b/lib/test_kasan.c
> index bd75a03..dec5f74 100644
> --- a/lib/test_kasan.c
> +++ b/lib/test_kasan.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  static noinline void __init kmalloc_oob_right(void)
>  {
> @@ -389,6 +390,83 @@ static noinline void __init ksize_unpoisons_memory(void)
> kfree(ptr);
>  }
>
> +#ifdef CONFIG_SLAB
> +#ifdef CONFIG_SMP

Will it fail without CONFIG_SMP if we create more than 1 kthread? If
it does not fail, then please remove the ifdef.
Also see below.


> +static DECLARE_COMPLETION(starting_gun);
> +static DECLARE_COMPLETION(finish_line);
> +
> +static int try_free(void *p)
> +{
> +   wait_for_completion(_gun);
> +   kfree(p);
> +   complete(_line);
> +   return 0;
> +}
> +
> +/*
> + * allocs an object; then all cpus concurrently attempt to free the
> + * same object.
> + */
> +static noinline void __init kasan_double_free(void)
> +{
> +   char *p;
> +   int cpu;
> +   struct task_struct **tasks;
> +   size_t size = (KMALLOC_MAX_CACHE_SIZE/4 + 1);

Is it important to use such tricky size calculation here? If it is not
important, then please replace it with some small constant.
There are some tests that calculate size based on
KMALLOC_MAX_CACHE_SIZE, but that's important for them.



> +   /*
> +* max slab size instrumented by KASAN is KMALLOC_MAX_CACHE_SIZE/2.
> +* Do not increase size beyond this: slab corruption from double-free
> +* may ensue.
> +*/
> +   pr_info("concurrent double-free test\n");
> +   init_completion(_gun);
> +   init_completion(_line);
> +   tasks = kzalloc((sizeof(tasks) * nr_cpu_ids), GFP_KERNEL);
> +   if (!tasks) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +   p = kmalloc(size, GFP_KERNEL);
> +   if (!p) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +
> +   for_each_online_cpu(cpu) {


Won't the test fail with 1 cpu?
By failing I mean that it won't detect the double-free. Soon we will
start automatically ensuring that a double-free test in fact detects a
double-free.
I think it will be much simpler to use just, say, 4 threads. It will
eliminate kzmalloc, kfree, allocation failure tests, memory leaks and
also fix !CONFIG_SMP.



> +   tasks[cpu] = kthread_create(try_free, (void *)p, "try_free%d",
> +   cpu);
> +   if (IS_ERR(tasks[cpu])) {
> +   WARN(1, "kthread_create failed.\n");
> +   return;
> +   }
> +   kthread_bind(tasks[cpu], cpu);
> +   wake_up_process(tasks[cpu]);
> +   }
> +
> +   complete_all(_gun);
> +   for_each_online_cpu(cpu)
> +   wait_for_completion(_line);
> +   kfree(tasks);
> +}
> +#else
> +static noinline void __init kasan_double_free(void)

This test should work with CONFIG_SLAB as well.
Please name the tests differently (e.g. kasan_double_free and
kasan_double_free_threaded), and run kasan_double_free always.
If kasan_double_free_threaded fails, but kasan_double_free does not,
that's already some useful info. And if both fail, then it's always
better to have a simpler reproducer.


> +{
> +   char *p;
> +   size_t size = 2049;
> +
> +   pr_info("double-free test\n");
> +   p = kmalloc(size, GFP_KERNEL);
> +   if (!p) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +   kfree(p);
> +   kfree(p);
> +}
> +#endif
> +#endif
> +
>  static int __init kmalloc_tests_init(void)
>  {
> kmalloc_oob_right();
> @@ -414,6 +492,7 @@ static int __init kmalloc_tests_init(void)
> kasan_global_oob();
>  #ifdef CONFIG_SLAB
> kasan_quarantine_cache();
> +   kasan_double_free();
>  #endif
> ksize_unpoisons_memory();
> return -EAGAIN;
> --
> 1.7.1
>


Re: [PATCH v2 2/2] kasan: add kasan_double_free() test

2016-05-08 Thread Dmitry Vyukov
On Fri, May 6, 2016 at 1:50 PM, Kuthonuzo Luruo  wrote:
> This patch adds a new 'test_kasan' test for KASAN double-free error
> detection when the same slab object is concurrently deallocated.
>
> Signed-off-by: Kuthonuzo Luruo 
> ---
> Changes in v2:
> - This patch is new for v2.
> ---
>  lib/test_kasan.c |   79 
> ++
>  1 files changed, 79 insertions(+), 0 deletions(-)
>
> diff --git a/lib/test_kasan.c b/lib/test_kasan.c
> index bd75a03..dec5f74 100644
> --- a/lib/test_kasan.c
> +++ b/lib/test_kasan.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  static noinline void __init kmalloc_oob_right(void)
>  {
> @@ -389,6 +390,83 @@ static noinline void __init ksize_unpoisons_memory(void)
> kfree(ptr);
>  }
>
> +#ifdef CONFIG_SLAB
> +#ifdef CONFIG_SMP

Will it fail without CONFIG_SMP if we create more than 1 kthread? If
it does not fail, then please remove the ifdef.
Also see below.


> +static DECLARE_COMPLETION(starting_gun);
> +static DECLARE_COMPLETION(finish_line);
> +
> +static int try_free(void *p)
> +{
> +   wait_for_completion(_gun);
> +   kfree(p);
> +   complete(_line);
> +   return 0;
> +}
> +
> +/*
> + * allocs an object; then all cpus concurrently attempt to free the
> + * same object.
> + */
> +static noinline void __init kasan_double_free(void)
> +{
> +   char *p;
> +   int cpu;
> +   struct task_struct **tasks;
> +   size_t size = (KMALLOC_MAX_CACHE_SIZE/4 + 1);

Is it important to use such tricky size calculation here? If it is not
important, then please replace it with some small constant.
There are some tests that calculate size based on
KMALLOC_MAX_CACHE_SIZE, but that's important for them.



> +   /*
> +* max slab size instrumented by KASAN is KMALLOC_MAX_CACHE_SIZE/2.
> +* Do not increase size beyond this: slab corruption from double-free
> +* may ensue.
> +*/
> +   pr_info("concurrent double-free test\n");
> +   init_completion(_gun);
> +   init_completion(_line);
> +   tasks = kzalloc((sizeof(tasks) * nr_cpu_ids), GFP_KERNEL);
> +   if (!tasks) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +   p = kmalloc(size, GFP_KERNEL);
> +   if (!p) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +
> +   for_each_online_cpu(cpu) {


Won't the test fail with 1 cpu?
By failing I mean that it won't detect the double-free. Soon we will
start automatically ensuring that a double-free test in fact detects a
double-free.
I think it will be much simpler to use just, say, 4 threads. It will
eliminate kzmalloc, kfree, allocation failure tests, memory leaks and
also fix !CONFIG_SMP.



> +   tasks[cpu] = kthread_create(try_free, (void *)p, "try_free%d",
> +   cpu);
> +   if (IS_ERR(tasks[cpu])) {
> +   WARN(1, "kthread_create failed.\n");
> +   return;
> +   }
> +   kthread_bind(tasks[cpu], cpu);
> +   wake_up_process(tasks[cpu]);
> +   }
> +
> +   complete_all(_gun);
> +   for_each_online_cpu(cpu)
> +   wait_for_completion(_line);
> +   kfree(tasks);
> +}
> +#else
> +static noinline void __init kasan_double_free(void)

This test should work with CONFIG_SLAB as well.
Please name the tests differently (e.g. kasan_double_free and
kasan_double_free_threaded), and run kasan_double_free always.
If kasan_double_free_threaded fails, but kasan_double_free does not,
that's already some useful info. And if both fail, then it's always
better to have a simpler reproducer.


> +{
> +   char *p;
> +   size_t size = 2049;
> +
> +   pr_info("double-free test\n");
> +   p = kmalloc(size, GFP_KERNEL);
> +   if (!p) {
> +   pr_err("Allocation failed\n");
> +   return;
> +   }
> +   kfree(p);
> +   kfree(p);
> +}
> +#endif
> +#endif
> +
>  static int __init kmalloc_tests_init(void)
>  {
> kmalloc_oob_right();
> @@ -414,6 +492,7 @@ static int __init kmalloc_tests_init(void)
> kasan_global_oob();
>  #ifdef CONFIG_SLAB
> kasan_quarantine_cache();
> +   kasan_double_free();
>  #endif
> ksize_unpoisons_memory();
> return -EAGAIN;
> --
> 1.7.1
>


Re: [PATCH 3/6] intel_sgx: driver for Intel Secure Guard eXtensions

2016-05-08 Thread Jarkko Sakkinen
On Fri, Apr 29, 2016 at 03:22:19PM -0700, Jethro Beekman wrote:
> On 29-04-16 13:04, Jarkko Sakkinen wrote:
> >>> Why would you want to do that?
> >>
> >> ...
> >
> > Do you see this as a performance issue or why do you think that this
> > would hurt that much?
> 
> I don't think it's a performance issue at all. I'm just giving an example of 
> why
> you'd want to do this. I'm sure people who want to use this instruction set 
> can
> come up with other uses, so I think the driver should support it. Other 
> drivers
> on different platform might support this, in which case we should be 
> compatible
> (to achieve the same enclave measurement). Other Linux drivers support it 
> [1]. I
> would ask: why would you not want to do this? It seems trivial to expand the
> current flag into 16 separate flags; one for each 256-byte chunk in the page.

I'm fine with adding a 16-bit bitmask.

/Jarkko


Re: [PATCH 3/6] intel_sgx: driver for Intel Secure Guard eXtensions

2016-05-08 Thread Jarkko Sakkinen
On Fri, Apr 29, 2016 at 03:22:19PM -0700, Jethro Beekman wrote:
> On 29-04-16 13:04, Jarkko Sakkinen wrote:
> >>> Why would you want to do that?
> >>
> >> ...
> >
> > Do you see this as a performance issue or why do you think that this
> > would hurt that much?
> 
> I don't think it's a performance issue at all. I'm just giving an example of 
> why
> you'd want to do this. I'm sure people who want to use this instruction set 
> can
> come up with other uses, so I think the driver should support it. Other 
> drivers
> on different platform might support this, in which case we should be 
> compatible
> (to achieve the same enclave measurement). Other Linux drivers support it 
> [1]. I
> would ask: why would you not want to do this? It seems trivial to expand the
> current flag into 16 separate flags; one for each 256-byte chunk in the page.

I'm fine with adding a 16-bit bitmask.

/Jarkko


Re: [PATCH 1/1] xen/gntdev: kmalloc structure gntdev_copy_batch

2016-05-08 Thread Juergen Gross
On 07/05/16 10:17, Heinrich Schuchardt wrote:
> Commit a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy")
> leads to a warning
> xen/gntdev.c: In function ‘gntdev_ioctl_grant_copy’:
> xen/gntdev.c:949:1: warning: the frame size of 1248 bytes
> is larger than 1024 bytes [-Wframe-larger-than=]
> 
> This can be avoided by using kmalloc instead of the stack.
> 
> Testing requires CONFIG_XEN_GNTDEV.
> 
> Fixes: a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy")
> Signed-off-by: Heinrich Schuchardt 

Acked-by: Juergen Gross 


Re: [PATCH 1/1] xen/gntdev: kmalloc structure gntdev_copy_batch

2016-05-08 Thread Juergen Gross
On 07/05/16 10:17, Heinrich Schuchardt wrote:
> Commit a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy")
> leads to a warning
> xen/gntdev.c: In function ‘gntdev_ioctl_grant_copy’:
> xen/gntdev.c:949:1: warning: the frame size of 1248 bytes
> is larger than 1024 bytes [-Wframe-larger-than=]
> 
> This can be avoided by using kmalloc instead of the stack.
> 
> Testing requires CONFIG_XEN_GNTDEV.
> 
> Fixes: a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy")
> Signed-off-by: Heinrich Schuchardt 

Acked-by: Juergen Gross 


Re: [PATCH] kdump: Fix gdb macros work work with newer and 64-bit kernels

2016-05-08 Thread Baoquan He
Hi Corey,

I am trying to review this patch now, and these fixes contained are very
great. Just several concerns are added in inline comment.

By the way, did you run this in your side?

Hi Vivek,

Member variable was added into task_struct in below commit replacing
pids[PIDTYPE_TGID], and from then on nobody complained about it. Seems
people rarely use this utility.

commit 47e65328a7b1cdfc4e3102e50d60faf94ebba7d3
Author: Oleg Nesterov 
Date:   Tue Mar 28 16:11:25 2006 -0800

[PATCH] pids: kill PIDTYPE_TGID



On 04/27/16 at 07:21am, Corey Minyard wrote:
> Any comments on this?  If no one else cares I'd be willing to take over
> maintenance of this.
> 
> -corey
> 
> On 02/25/2016 07:51 AM, miny...@acm.org wrote:
> >From: Corey Minyard 
> >
> >Lots of little changes needed to be made to clean these up, remove the
> >four byte pointer assumption and traverse the pid queue properly.
> >Also consolidate the traceback code into a single function instead
> >of having three copies of it.
> >
> >Signed-off-by: Corey Minyard 
> >---
> >  Documentation/kdump/gdbmacros.txt | 90 
> > +--
> >  1 file changed, 40 insertions(+), 50 deletions(-)
> >
> >I sent this earlier, but I didn't get a response.  These are clearly
> >wrong.  I'd be happy to take over maintenance of these macros.  It
> >might be better to move them someplace else, too, since they are also
> >useful for kgdb.
> >
> >diff --git a/Documentation/kdump/gdbmacros.txt 
> >b/Documentation/kdump/gdbmacros.txt
> >index 9b9b454..e5bbd8d 100644
> >--- a/Documentation/kdump/gdbmacros.txt
> >+++ b/Documentation/kdump/gdbmacros.txt
> >@@ -15,14 +15,14 @@
> >  define bttnobp
> > set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
> >-set $pid_off=((size_t)&((struct task_struct *)0)->pids[1].pid_list.next)
> >+set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)

This is a quite nice fix.

> > set $init_t=_task
> > set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
> > while ($next_t != $init_t)
> > set $next_t=(struct task_struct *)$next_t
> > printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
> > printf "===\n"
> >-set var $stackp = $next_t.thread.esp
> >+set var $stackp = $next_t.thread.sp
> > set var $stack_top = ($stackp & ~4095) + 4096
> > while ($stackp < $stack_top)
> >@@ -31,12 +31,12 @@ define bttnobp
> > end
> > set $stackp += 4
> > end
> >-set $next_th=(((char *)$next_t->pids[1].pid_list.next) - 
> >$pid_off)
> >+set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
> > while ($next_th != $next_t)
> > set $next_th=(struct task_struct *)$next_th
> > printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
> > printf "===\n"
> >-set var $stackp = $next_t.thread.esp
> >+set var $stackp = $next_t.thread.sp
> > set var $stack_top = ($stackp & ~4095) + 4096
> > while ($stackp < $stack_top)
> >@@ -45,7 +45,7 @@ define bttnobp
> > end
> > set $stackp += 4
> > end
> >-set $next_th=(((char *)$next_th->pids[1].pid_list.next) 
> >- $pid_off)
> >+set $next_th=(((char *)$next_th->thread_group.next) - 
> >$pid_off)
> > end
> > set $next_t=(char *)($next_t->tasks.next) - $tasks_off
> > end
> >@@ -54,42 +54,43 @@ document bttnobp
> > dump all thread stack traces on a kernel compiled with 
> > !CONFIG_FRAME_POINTER
> >  end
> >+define btthreadstruct

This is a nice wrapping, but I guess you want to name it as
btthreadstack, right? Since I didn't get at all why it's related to
thread_struct except of getting 'sp'.

> >+set var $pid_task = $arg0
> >+
> >+printf "\npid %d; comm %s:\n", $pid_task.pid, $pid_task.comm
> >+printf "task struct: "
> >+print $pid_task
> >+printf "===\n"
> >+set var $stackp = $pid_task.thread.sp
> >+set var $stack_top = ($stackp & ~4095) + 4096
> >+set var $stack_bot = ($stackp & ~4095)
> >+
> >+set $stackp = *((unsigned long *) $stackp)
> >+while (($stackp < $stack_top) && ($stackp > $stack_bot))
> >+set var $addr = *(((unsigned long *) $stackp) + 1)
> >+info symbol $addr
> >+set $stackp = *((unsigned long *) $stackp)
> >+end
> >+end
> >+document btthreadstruct
> >+ dump a thread stack using the given task structure pointer
> >+end
> >+
> >+
> >  define btt
> > set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
> >-set $pid_off=((size_t)&((struct task_struct 

Re: [PATCH] kdump: Fix gdb macros work work with newer and 64-bit kernels

2016-05-08 Thread Baoquan He
Hi Corey,

I am trying to review this patch now, and these fixes contained are very
great. Just several concerns are added in inline comment.

By the way, did you run this in your side?

Hi Vivek,

Member variable was added into task_struct in below commit replacing
pids[PIDTYPE_TGID], and from then on nobody complained about it. Seems
people rarely use this utility.

commit 47e65328a7b1cdfc4e3102e50d60faf94ebba7d3
Author: Oleg Nesterov 
Date:   Tue Mar 28 16:11:25 2006 -0800

[PATCH] pids: kill PIDTYPE_TGID



On 04/27/16 at 07:21am, Corey Minyard wrote:
> Any comments on this?  If no one else cares I'd be willing to take over
> maintenance of this.
> 
> -corey
> 
> On 02/25/2016 07:51 AM, miny...@acm.org wrote:
> >From: Corey Minyard 
> >
> >Lots of little changes needed to be made to clean these up, remove the
> >four byte pointer assumption and traverse the pid queue properly.
> >Also consolidate the traceback code into a single function instead
> >of having three copies of it.
> >
> >Signed-off-by: Corey Minyard 
> >---
> >  Documentation/kdump/gdbmacros.txt | 90 
> > +--
> >  1 file changed, 40 insertions(+), 50 deletions(-)
> >
> >I sent this earlier, but I didn't get a response.  These are clearly
> >wrong.  I'd be happy to take over maintenance of these macros.  It
> >might be better to move them someplace else, too, since they are also
> >useful for kgdb.
> >
> >diff --git a/Documentation/kdump/gdbmacros.txt 
> >b/Documentation/kdump/gdbmacros.txt
> >index 9b9b454..e5bbd8d 100644
> >--- a/Documentation/kdump/gdbmacros.txt
> >+++ b/Documentation/kdump/gdbmacros.txt
> >@@ -15,14 +15,14 @@
> >  define bttnobp
> > set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
> >-set $pid_off=((size_t)&((struct task_struct *)0)->pids[1].pid_list.next)
> >+set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)

This is a quite nice fix.

> > set $init_t=_task
> > set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
> > while ($next_t != $init_t)
> > set $next_t=(struct task_struct *)$next_t
> > printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
> > printf "===\n"
> >-set var $stackp = $next_t.thread.esp
> >+set var $stackp = $next_t.thread.sp
> > set var $stack_top = ($stackp & ~4095) + 4096
> > while ($stackp < $stack_top)
> >@@ -31,12 +31,12 @@ define bttnobp
> > end
> > set $stackp += 4
> > end
> >-set $next_th=(((char *)$next_t->pids[1].pid_list.next) - 
> >$pid_off)
> >+set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
> > while ($next_th != $next_t)
> > set $next_th=(struct task_struct *)$next_th
> > printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
> > printf "===\n"
> >-set var $stackp = $next_t.thread.esp
> >+set var $stackp = $next_t.thread.sp
> > set var $stack_top = ($stackp & ~4095) + 4096
> > while ($stackp < $stack_top)
> >@@ -45,7 +45,7 @@ define bttnobp
> > end
> > set $stackp += 4
> > end
> >-set $next_th=(((char *)$next_th->pids[1].pid_list.next) 
> >- $pid_off)
> >+set $next_th=(((char *)$next_th->thread_group.next) - 
> >$pid_off)
> > end
> > set $next_t=(char *)($next_t->tasks.next) - $tasks_off
> > end
> >@@ -54,42 +54,43 @@ document bttnobp
> > dump all thread stack traces on a kernel compiled with 
> > !CONFIG_FRAME_POINTER
> >  end
> >+define btthreadstruct

This is a nice wrapping, but I guess you want to name it as
btthreadstack, right? Since I didn't get at all why it's related to
thread_struct except of getting 'sp'.

> >+set var $pid_task = $arg0
> >+
> >+printf "\npid %d; comm %s:\n", $pid_task.pid, $pid_task.comm
> >+printf "task struct: "
> >+print $pid_task
> >+printf "===\n"
> >+set var $stackp = $pid_task.thread.sp
> >+set var $stack_top = ($stackp & ~4095) + 4096
> >+set var $stack_bot = ($stackp & ~4095)
> >+
> >+set $stackp = *((unsigned long *) $stackp)
> >+while (($stackp < $stack_top) && ($stackp > $stack_bot))
> >+set var $addr = *(((unsigned long *) $stackp) + 1)
> >+info symbol $addr
> >+set $stackp = *((unsigned long *) $stackp)
> >+end
> >+end
> >+document btthreadstruct
> >+ dump a thread stack using the given task structure pointer
> >+end
> >+
> >+
> >  define btt
> > set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
> >-set $pid_off=((size_t)&((struct task_struct *)0)->pids[1].pid_list.next)
> >+set $pid_off=((size_t)&((struct task_struct 

RE: [Patch v3 5/8] firmware: qcom: scm: Convert to streaming DMA APIS

2016-05-08 Thread Sricharan
Hi,
> This patch converts the Qualcomm SCM driver to use the streaming DMA
> APIs for communication buffers.
> 
> Signed-off-by: Andy Gross 
> ---

 Reviewed-by: sricha...@codeaurora.org

Regards,
 Sricharan

>  drivers/firmware/qcom_scm-32.c | 152
+
> 
>  drivers/firmware/qcom_scm.c|   6 +-
>  drivers/firmware/qcom_scm.h|  10 +--
>  3 files changed, 58 insertions(+), 110 deletions(-)
> 
> diff --git a/drivers/firmware/qcom_scm-32.c b/drivers/firmware/qcom_scm-
> 32.c index 4388d13..3e71aec 100644
> --- a/drivers/firmware/qcom_scm-32.c
> +++ b/drivers/firmware/qcom_scm-32.c
> @@ -23,8 +23,7 @@
>  #include 
>  #include 
>  #include 
> -
> -#include 
> +#include 
> 
>  #include "qcom_scm.h"
> 
> @@ -97,44 +96,6 @@ struct qcom_scm_response {  };
> 
>  /**
> - * alloc_qcom_scm_command() - Allocate an SCM command
> - * @cmd_size: size of the command buffer
> - * @resp_size: size of the response buffer
> - *
> - * Allocate an SCM command, including enough room for the command
> - * and response headers as well as the command and response buffers.
> - *
> - * Returns a valid _scm_command on success or %NULL if the
> allocation fails.
> - */
> -static struct qcom_scm_command *alloc_qcom_scm_command(size_t
> cmd_size, size_t resp_size) -{
> - struct qcom_scm_command *cmd;
> - size_t len = sizeof(*cmd) + sizeof(struct qcom_scm_response) +
> cmd_size +
> - resp_size;
> - u32 offset;
> -
> - cmd = kzalloc(PAGE_ALIGN(len), GFP_KERNEL);
> - if (cmd) {
> - cmd->len = cpu_to_le32(len);
> - offset = offsetof(struct qcom_scm_command, buf);
> - cmd->buf_offset = cpu_to_le32(offset);
> - cmd->resp_hdr_offset = cpu_to_le32(offset + cmd_size);
> - }
> - return cmd;
> -}
> -
> -/**
> - * free_qcom_scm_command() - Free an SCM command
> - * @cmd: command to free
> - *
> - * Free an SCM command.
> - */
> -static inline void free_qcom_scm_command(struct qcom_scm_command
> *cmd) -{
> - kfree(cmd);
> -}
> -
> -/**
>   * qcom_scm_command_to_response() - Get a pointer to a
> qcom_scm_response
>   * @cmd: command
>   *
> @@ -168,7 +129,7 @@ static inline void
> *qcom_scm_get_response_buffer(const struct qcom_scm_response
>   return (void *)rsp + le32_to_cpu(rsp->buf_offset);  }
> 
> -static u32 smc(u32 cmd_addr)
> +static u32 smc(dma_addr_t cmd_addr)
>  {
>   int context_id;
>   register u32 r0 asm("r0") = 1;
> @@ -192,51 +153,15 @@ static u32 smc(u32 cmd_addr)
>   return r0;
>  }
> 
> -static int __qcom_scm_call(const struct qcom_scm_command *cmd) -{
> - int ret;
> - u32 cmd_addr = virt_to_phys(cmd);
> -
> - /*
> -  * Flush the command buffer so that the secure world sees
> -  * the correct data.
> -  */
> - secure_flush_area(cmd, cmd->len);
> -
> - ret = smc(cmd_addr);
> - if (ret < 0)
> - ret = qcom_scm_remap_error(ret);
> -
> - return ret;
> -}
> -
> -static void qcom_scm_inv_range(unsigned long start, unsigned long end) -{
> - u32 cacheline_size, ctr;
> -
> - asm volatile("mrc p15, 0, %0, c0, c0, 1" : "=r" (ctr));
> - cacheline_size = 4 << ((ctr >> 16) & 0xf);
> -
> - start = round_down(start, cacheline_size);
> - end = round_up(end, cacheline_size);
> - outer_inv_range(start, end);
> - while (start < end) {
> - asm ("mcr p15, 0, %0, c7, c6, 1" : : "r" (start)
> -  : "memory");
> - start += cacheline_size;
> - }
> - dsb();
> - isb();
> -}
> -
>  /**
>   * qcom_scm_call() - Send an SCM command
> - * @svc_id: service identifier
> - * @cmd_id: command identifier
> - * @cmd_buf: command buffer
> - * @cmd_len: length of the command buffer
> - * @resp_buf: response buffer
> - * @resp_len: length of the response buffer
> + * @dev: struct device
> + * @svc_id:  service identifier
> + * @cmd_id:  command identifier
> + * @cmd_buf: command buffer
> + * @cmd_len: length of the command buffer
> + * @resp_buf:response buffer
> + * @resp_len:length of the response buffer
>   *
>   * Sends a command to the SCM and waits for the command to finish
> processing.
>   *
> @@ -247,42 +172,60 @@ static void qcom_scm_inv_range(unsigned long
> start, unsigned long end)
>   * and response buffers is taken care of by qcom_scm_call; however,
callers
> are
>   * responsible for any other cached buffers passed over to the secure
world.
>   */
> -static int qcom_scm_call(u32 svc_id, u32 cmd_id, const void *cmd_buf,
> - size_t cmd_len, void *resp_buf, size_t resp_len)
> +static int qcom_scm_call(struct device *dev, u32 svc_id, u32 cmd_id,
> +  const void *cmd_buf, size_t cmd_len, void
> *resp_buf,
> +  size_t resp_len)
>  {
>   int ret;
>   struct qcom_scm_command *cmd;
>   struct qcom_scm_response *rsp;
> - unsigned long start, end;
> + size_t alloc_len = 

RE: [Patch v3 5/8] firmware: qcom: scm: Convert to streaming DMA APIS

2016-05-08 Thread Sricharan
Hi,
> This patch converts the Qualcomm SCM driver to use the streaming DMA
> APIs for communication buffers.
> 
> Signed-off-by: Andy Gross 
> ---

 Reviewed-by: sricha...@codeaurora.org

Regards,
 Sricharan

>  drivers/firmware/qcom_scm-32.c | 152
+
> 
>  drivers/firmware/qcom_scm.c|   6 +-
>  drivers/firmware/qcom_scm.h|  10 +--
>  3 files changed, 58 insertions(+), 110 deletions(-)
> 
> diff --git a/drivers/firmware/qcom_scm-32.c b/drivers/firmware/qcom_scm-
> 32.c index 4388d13..3e71aec 100644
> --- a/drivers/firmware/qcom_scm-32.c
> +++ b/drivers/firmware/qcom_scm-32.c
> @@ -23,8 +23,7 @@
>  #include 
>  #include 
>  #include 
> -
> -#include 
> +#include 
> 
>  #include "qcom_scm.h"
> 
> @@ -97,44 +96,6 @@ struct qcom_scm_response {  };
> 
>  /**
> - * alloc_qcom_scm_command() - Allocate an SCM command
> - * @cmd_size: size of the command buffer
> - * @resp_size: size of the response buffer
> - *
> - * Allocate an SCM command, including enough room for the command
> - * and response headers as well as the command and response buffers.
> - *
> - * Returns a valid _scm_command on success or %NULL if the
> allocation fails.
> - */
> -static struct qcom_scm_command *alloc_qcom_scm_command(size_t
> cmd_size, size_t resp_size) -{
> - struct qcom_scm_command *cmd;
> - size_t len = sizeof(*cmd) + sizeof(struct qcom_scm_response) +
> cmd_size +
> - resp_size;
> - u32 offset;
> -
> - cmd = kzalloc(PAGE_ALIGN(len), GFP_KERNEL);
> - if (cmd) {
> - cmd->len = cpu_to_le32(len);
> - offset = offsetof(struct qcom_scm_command, buf);
> - cmd->buf_offset = cpu_to_le32(offset);
> - cmd->resp_hdr_offset = cpu_to_le32(offset + cmd_size);
> - }
> - return cmd;
> -}
> -
> -/**
> - * free_qcom_scm_command() - Free an SCM command
> - * @cmd: command to free
> - *
> - * Free an SCM command.
> - */
> -static inline void free_qcom_scm_command(struct qcom_scm_command
> *cmd) -{
> - kfree(cmd);
> -}
> -
> -/**
>   * qcom_scm_command_to_response() - Get a pointer to a
> qcom_scm_response
>   * @cmd: command
>   *
> @@ -168,7 +129,7 @@ static inline void
> *qcom_scm_get_response_buffer(const struct qcom_scm_response
>   return (void *)rsp + le32_to_cpu(rsp->buf_offset);  }
> 
> -static u32 smc(u32 cmd_addr)
> +static u32 smc(dma_addr_t cmd_addr)
>  {
>   int context_id;
>   register u32 r0 asm("r0") = 1;
> @@ -192,51 +153,15 @@ static u32 smc(u32 cmd_addr)
>   return r0;
>  }
> 
> -static int __qcom_scm_call(const struct qcom_scm_command *cmd) -{
> - int ret;
> - u32 cmd_addr = virt_to_phys(cmd);
> -
> - /*
> -  * Flush the command buffer so that the secure world sees
> -  * the correct data.
> -  */
> - secure_flush_area(cmd, cmd->len);
> -
> - ret = smc(cmd_addr);
> - if (ret < 0)
> - ret = qcom_scm_remap_error(ret);
> -
> - return ret;
> -}
> -
> -static void qcom_scm_inv_range(unsigned long start, unsigned long end) -{
> - u32 cacheline_size, ctr;
> -
> - asm volatile("mrc p15, 0, %0, c0, c0, 1" : "=r" (ctr));
> - cacheline_size = 4 << ((ctr >> 16) & 0xf);
> -
> - start = round_down(start, cacheline_size);
> - end = round_up(end, cacheline_size);
> - outer_inv_range(start, end);
> - while (start < end) {
> - asm ("mcr p15, 0, %0, c7, c6, 1" : : "r" (start)
> -  : "memory");
> - start += cacheline_size;
> - }
> - dsb();
> - isb();
> -}
> -
>  /**
>   * qcom_scm_call() - Send an SCM command
> - * @svc_id: service identifier
> - * @cmd_id: command identifier
> - * @cmd_buf: command buffer
> - * @cmd_len: length of the command buffer
> - * @resp_buf: response buffer
> - * @resp_len: length of the response buffer
> + * @dev: struct device
> + * @svc_id:  service identifier
> + * @cmd_id:  command identifier
> + * @cmd_buf: command buffer
> + * @cmd_len: length of the command buffer
> + * @resp_buf:response buffer
> + * @resp_len:length of the response buffer
>   *
>   * Sends a command to the SCM and waits for the command to finish
> processing.
>   *
> @@ -247,42 +172,60 @@ static void qcom_scm_inv_range(unsigned long
> start, unsigned long end)
>   * and response buffers is taken care of by qcom_scm_call; however,
callers
> are
>   * responsible for any other cached buffers passed over to the secure
world.
>   */
> -static int qcom_scm_call(u32 svc_id, u32 cmd_id, const void *cmd_buf,
> - size_t cmd_len, void *resp_buf, size_t resp_len)
> +static int qcom_scm_call(struct device *dev, u32 svc_id, u32 cmd_id,
> +  const void *cmd_buf, size_t cmd_len, void
> *resp_buf,
> +  size_t resp_len)
>  {
>   int ret;
>   struct qcom_scm_command *cmd;
>   struct qcom_scm_response *rsp;
> - unsigned long start, end;
> + size_t alloc_len = sizeof(*cmd) + 

Re: [PATCH 4/4] x86/kasan: Instrument user memory access API

2016-05-08 Thread Dmitry Vyukov
On Fri, May 6, 2016 at 2:45 PM, Andrey Ryabinin  wrote:
> Exchange between user and kernel memory is coded in assembly language.
> Which means that such accesses won't be spotted by KASAN as a compiler
> instruments only C code.
> Add explicit KASAN checks to user memory access API to ensure that
> userspace writes to (or reads from) a valid kernel memory.
>
> Note: Unlike others strncpy_from_user() is written mostly in C and KASAN
> sees memory accesses in it. However, it makes sense to add explicit check
> for all @count bytes that *potentially* could be written to the kernel.


Reviewed-by: Dmitry Vyukov 

Thanks!


> Signed-off-by: Andrey Ryabinin 
> Cc: Alexander Potapenko 
> Cc: Dmitry Vyukov 
> Cc: x...@kernel.org
> ---
>  arch/x86/include/asm/uaccess.h| 5 +
>  arch/x86/include/asm/uaccess_64.h | 7 +++
>  lib/strncpy_from_user.c   | 2 ++
>  3 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 0b17fad..5dd6d18 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -5,6 +5,7 @@
>   */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -732,6 +733,8 @@ copy_from_user(void *to, const void __user *from, 
> unsigned long n)
>
> might_fault();
>
> +   kasan_check_write(to, n);
> +
> /*
>  * While we would like to have the compiler do the checking for us
>  * even in the non-constant size case, any false positives there are
> @@ -765,6 +768,8 @@ copy_to_user(void __user *to, const void *from, unsigned 
> long n)
>  {
> int sz = __compiletime_object_size(from);
>
> +   kasan_check_read(from, n);
> +
> might_fault();
>
> /* See the comment in copy_from_user() above. */
> diff --git a/arch/x86/include/asm/uaccess_64.h 
> b/arch/x86/include/asm/uaccess_64.h
> index 3076986..2eac2aa 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -109,6 +110,7 @@ static __always_inline __must_check
>  int __copy_from_user(void *dst, const void __user *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_write(dst, size);
> return __copy_from_user_nocheck(dst, src, size);
>  }
>
> @@ -175,6 +177,7 @@ static __always_inline __must_check
>  int __copy_to_user(void __user *dst, const void *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_read(src, size);
> return __copy_to_user_nocheck(dst, src, size);
>  }
>
> @@ -242,12 +245,14 @@ int __copy_in_user(void __user *dst, const void __user 
> *src, unsigned size)
>  static __must_check __always_inline int
>  __copy_from_user_inatomic(void *dst, const void __user *src, unsigned size)
>  {
> +   kasan_check_write(dst, size);
> return __copy_from_user_nocheck(dst, src, size);
>  }
>
>  static __must_check __always_inline int
>  __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
>  {
> +   kasan_check_read(src, size);
> return __copy_to_user_nocheck(dst, src, size);
>  }
>
> @@ -258,6 +263,7 @@ static inline int
>  __copy_from_user_nocache(void *dst, const void __user *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_write(dst, size);
> return __copy_user_nocache(dst, src, size, 1);
>  }
>
> @@ -265,6 +271,7 @@ static inline int
>  __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>   unsigned size)
>  {
> +   kasan_check_write(dst, size);
> return __copy_user_nocache(dst, src, size, 0);
>  }
>
> diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
> index 3384032..e3472b0 100644
> --- a/lib/strncpy_from_user.c
> +++ b/lib/strncpy_from_user.c
> @@ -1,5 +1,6 @@
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -103,6 +104,7 @@ long strncpy_from_user(char *dst, const char __user *src, 
> long count)
> if (unlikely(count <= 0))
> return 0;
>
> +   kasan_check_write(dst, count);
> max_addr = user_addr_max();
> src_addr = (unsigned long)src;
> if (likely(src_addr < max_addr)) {
> --
> 2.7.3
>


Re: [PATCH 4/4] x86/kasan: Instrument user memory access API

2016-05-08 Thread Dmitry Vyukov
On Fri, May 6, 2016 at 2:45 PM, Andrey Ryabinin  wrote:
> Exchange between user and kernel memory is coded in assembly language.
> Which means that such accesses won't be spotted by KASAN as a compiler
> instruments only C code.
> Add explicit KASAN checks to user memory access API to ensure that
> userspace writes to (or reads from) a valid kernel memory.
>
> Note: Unlike others strncpy_from_user() is written mostly in C and KASAN
> sees memory accesses in it. However, it makes sense to add explicit check
> for all @count bytes that *potentially* could be written to the kernel.


Reviewed-by: Dmitry Vyukov 

Thanks!


> Signed-off-by: Andrey Ryabinin 
> Cc: Alexander Potapenko 
> Cc: Dmitry Vyukov 
> Cc: x...@kernel.org
> ---
>  arch/x86/include/asm/uaccess.h| 5 +
>  arch/x86/include/asm/uaccess_64.h | 7 +++
>  lib/strncpy_from_user.c   | 2 ++
>  3 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 0b17fad..5dd6d18 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -5,6 +5,7 @@
>   */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -732,6 +733,8 @@ copy_from_user(void *to, const void __user *from, 
> unsigned long n)
>
> might_fault();
>
> +   kasan_check_write(to, n);
> +
> /*
>  * While we would like to have the compiler do the checking for us
>  * even in the non-constant size case, any false positives there are
> @@ -765,6 +768,8 @@ copy_to_user(void __user *to, const void *from, unsigned 
> long n)
>  {
> int sz = __compiletime_object_size(from);
>
> +   kasan_check_read(from, n);
> +
> might_fault();
>
> /* See the comment in copy_from_user() above. */
> diff --git a/arch/x86/include/asm/uaccess_64.h 
> b/arch/x86/include/asm/uaccess_64.h
> index 3076986..2eac2aa 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -109,6 +110,7 @@ static __always_inline __must_check
>  int __copy_from_user(void *dst, const void __user *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_write(dst, size);
> return __copy_from_user_nocheck(dst, src, size);
>  }
>
> @@ -175,6 +177,7 @@ static __always_inline __must_check
>  int __copy_to_user(void __user *dst, const void *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_read(src, size);
> return __copy_to_user_nocheck(dst, src, size);
>  }
>
> @@ -242,12 +245,14 @@ int __copy_in_user(void __user *dst, const void __user 
> *src, unsigned size)
>  static __must_check __always_inline int
>  __copy_from_user_inatomic(void *dst, const void __user *src, unsigned size)
>  {
> +   kasan_check_write(dst, size);
> return __copy_from_user_nocheck(dst, src, size);
>  }
>
>  static __must_check __always_inline int
>  __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
>  {
> +   kasan_check_read(src, size);
> return __copy_to_user_nocheck(dst, src, size);
>  }
>
> @@ -258,6 +263,7 @@ static inline int
>  __copy_from_user_nocache(void *dst, const void __user *src, unsigned size)
>  {
> might_fault();
> +   kasan_check_write(dst, size);
> return __copy_user_nocache(dst, src, size, 1);
>  }
>
> @@ -265,6 +271,7 @@ static inline int
>  __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>   unsigned size)
>  {
> +   kasan_check_write(dst, size);
> return __copy_user_nocache(dst, src, size, 0);
>  }
>
> diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
> index 3384032..e3472b0 100644
> --- a/lib/strncpy_from_user.c
> +++ b/lib/strncpy_from_user.c
> @@ -1,5 +1,6 @@
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -103,6 +104,7 @@ long strncpy_from_user(char *dst, const char __user *src, 
> long count)
> if (unlikely(count <= 0))
> return 0;
>
> +   kasan_check_write(dst, count);
> max_addr = user_addr_max();
> src_addr = (unsigned long)src;
> if (likely(src_addr < max_addr)) {
> --
> 2.7.3
>


Re: [PATCH] mm/zsmalloc: avoid unnecessary iteration in get_pages_per_zspage()

2016-05-08 Thread Minchan Kim
On Fri, May 06, 2016 at 06:33:42PM +0900, Sergey Senozhatsky wrote:
> On (05/06/16 18:08), Sergey Senozhatsky wrote:
> [..]
> > and it's not 45 iterations that we are getting rid of, but around 31:
> > not every class reaches it's ideal 100% ratio on the first iteration.
> > so, no, sorry, I don't think the patch really does what we want.
> 
> 
> to be clear, what I meant was:
> 
>   495 `cmp' + 15 `cmp je' IN
>   31 `mov cltd idiv mov sub imul cltd idiv cmp'   OUT
> 
> IN > OUT.
> 
> 
> CORRECTION here:
> 
> > * by the way, we don't even need `cltd' in those calculations. the
> > reason why gcc puts cltd is because ZS_MAX_PAGES_PER_ZSPAGE has the
> > 'wrong' data type. the patch to correct it is below (not a formal
> > patch).
> 
> no, we need cltd there. but ZS_MAX_PAGES_PER_ZSPAGE also affects
> ZS_MIN_ALLOC_SIZE, which is used in several places, like
> get_size_class_index(). that's why ZS_MAX_PAGES_PER_ZSPAGE data
> type change `improves' zs_malloc().

Why not if such simple improves zsmalloc? :)
Please send a patch.

Thanks a lot, Sergey!


[PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount

2016-05-08 Thread Davidlohr Bueso
Read waiters are currently reference counted from the time it enters
the slowpath until the lock is released and the waiter is awoken. This
is fragile and superfluous considering everything occurs within down_read()
without returning to the caller, and the very nature of the primitive does
not suggest that the task can disappear from underneath us. In addition,
spurious wakeups can make the whole refcount useless as get_task_struct()
is only called when setting up the waiter.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 7d62772600cf..b592bb48d880 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -197,7 +197,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
smp_mb();
waiter->task = NULL;
wake_up_process(tsk);
-   put_task_struct(tsk);
} while (--loop);
 
sem->wait_list.next = next;
@@ -220,7 +219,6 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
/* set up my own style of waitqueue */
waiter.task = tsk;
waiter.type = RWSEM_WAITING_FOR_READ;
-   get_task_struct(tsk);
 
raw_spin_lock_irq(>wait_lock);
if (list_empty(>wait_list))
-- 
2.8.1



Re: [PATCH] mm/zsmalloc: avoid unnecessary iteration in get_pages_per_zspage()

2016-05-08 Thread Minchan Kim
On Fri, May 06, 2016 at 06:33:42PM +0900, Sergey Senozhatsky wrote:
> On (05/06/16 18:08), Sergey Senozhatsky wrote:
> [..]
> > and it's not 45 iterations that we are getting rid of, but around 31:
> > not every class reaches it's ideal 100% ratio on the first iteration.
> > so, no, sorry, I don't think the patch really does what we want.
> 
> 
> to be clear, what I meant was:
> 
>   495 `cmp' + 15 `cmp je' IN
>   31 `mov cltd idiv mov sub imul cltd idiv cmp'   OUT
> 
> IN > OUT.
> 
> 
> CORRECTION here:
> 
> > * by the way, we don't even need `cltd' in those calculations. the
> > reason why gcc puts cltd is because ZS_MAX_PAGES_PER_ZSPAGE has the
> > 'wrong' data type. the patch to correct it is below (not a formal
> > patch).
> 
> no, we need cltd there. but ZS_MAX_PAGES_PER_ZSPAGE also affects
> ZS_MIN_ALLOC_SIZE, which is used in several places, like
> get_size_class_index(). that's why ZS_MAX_PAGES_PER_ZSPAGE data
> type change `improves' zs_malloc().

Why not if such simple improves zsmalloc? :)
Please send a patch.

Thanks a lot, Sergey!


[PATCH 2/4] locking/rwsem: Drop superfluous waiter refcount

2016-05-08 Thread Davidlohr Bueso
Read waiters are currently reference counted from the time it enters
the slowpath until the lock is released and the waiter is awoken. This
is fragile and superfluous considering everything occurs within down_read()
without returning to the caller, and the very nature of the primitive does
not suggest that the task can disappear from underneath us. In addition,
spurious wakeups can make the whole refcount useless as get_task_struct()
is only called when setting up the waiter.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 7d62772600cf..b592bb48d880 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -197,7 +197,6 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
smp_mb();
waiter->task = NULL;
wake_up_process(tsk);
-   put_task_struct(tsk);
} while (--loop);
 
sem->wait_list.next = next;
@@ -220,7 +219,6 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
/* set up my own style of waitqueue */
waiter.task = tsk;
waiter.type = RWSEM_WAITING_FOR_READ;
-   get_task_struct(tsk);
 
raw_spin_lock_irq(>wait_lock);
if (list_empty(>wait_list))
-- 
2.8.1



[PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s)

2016-05-08 Thread Davidlohr Bueso
As wake_qs gain users, we can teach rwsems about them such that
waiters can be awoken without the wait_lock. This is for both
readers and writer, the former being the most ideal candidate
as we can batch the wakeups shortening the critical region that
much more -- ie writer task blocking a bunch of tasks waiting to
service page-faults (mmap_sem readers).

In general applying wake_qs to rwsem (xadd) is not difficult as
the wait_lock is intended to be released soon _anyways_, with
the exception of when a writer slowpath will proactively wakeup
any queued readers if it sees that the lock is owned by a reader,
in which we simply do the wakeups with the lock held (see comment
in __rwsem_down_write_failed_common()).

Similar to other locking primitives, delaying the waiter being
awoken does allow, at least in theory, the lock to be stolen in
the case of writers, however no harm was seen in this (in fact
lock stealing tends to be a _good_ thing in most workloads), and
this is a tiny window anyways.

Some page-fault (pft) and mmap_sem intensive benchmarks show some
pretty constant reduction in systime (by up to ~8 and ~10%) on a
2-socket, 12 core AMD box.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 53 +++--
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index b592bb48d880..1b8c1285a2aa 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -114,12 +114,16 @@ enum rwsem_wake_type {
  *   - the 'active part' of count (&0x) reached 0 (but may have 
changed)
  *   - the 'waiting part' of count (&0x) is -ve (and will still be so)
  * - there must be someone on the queue
- * - the spinlock must be held by the caller
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ *   to actually wakeup the blocked task(s), preferably when the wait_lock
+ *   is released
  * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only woken if downgrading is false
+ * - writers are only marked woken if downgrading is false
  */
 static struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
+__rwsem_mark_wake(struct rw_semaphore *sem,
+ enum rwsem_wake_type wake_type, struct wake_q_head *wake_q)
 {
struct rwsem_waiter *waiter;
struct task_struct *tsk;
@@ -129,12 +133,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY)
-   /* Wake writer at the front of the queue, but do not
-* grant it the lock yet as we want other writers
-* to be able to steal it.  Readers, on the other hand,
-* will block as they will notice the queued writer.
+   /*
+* Mark writer at the front of the queue for wakeup.
+* Until the task is actually later awoken later by
+* the caller, other writers are able to steal it the
+* lock to be able to steal it.  Readers, on the other,
+* hand, will block as they will notice the queued 
writer.
 */
-   wake_up_process(waiter->task);
+   wake_q_add(wake_q, waiter->task);
goto out;
}
 
@@ -196,12 +202,11 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 */
smp_mb();
waiter->task = NULL;
-   wake_up_process(tsk);
+   wake_q_add(wake_q, tsk);
} while (--loop);
 
sem->wait_list.next = next;
next->prev = >wait_list;
-
  out:
return sem;
 }
@@ -215,6 +220,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
struct rwsem_waiter waiter;
struct task_struct *tsk = current;
+   WAKE_Q(wake_q);
 
/* set up my own style of waitqueue */
waiter.task = tsk;
@@ -236,9 +242,10 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
if (count == RWSEM_WAITING_BIAS ||
(count > RWSEM_WAITING_BIAS &&
 adjustment != -RWSEM_ACTIVE_READ_BIAS))
-   sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+   sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, _q);
 
raw_spin_unlock_irq(>wait_lock);
+   wake_up_q(_q);
 
/* wait to be given the lock */
while (true) {
@@ -470,9 +477,19 @@ 

[PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()

2016-05-08 Thread Davidlohr Bueso
The field is obviously updated w.o the lock and needs a READ_ONCE
while waiting for lock holder(s) to go away, just like we do with
all other ->count accesses.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index df4dcb883b50..7d62772600cf 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -494,7 +494,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, 
int state)
}
schedule();
set_current_state(state);
-   } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
+   } while ((count = READ_ONCE(sem->count)) & RWSEM_ACTIVE_MASK);
 
raw_spin_lock_irq(>wait_lock);
}
-- 
2.8.1



[PATCH 3/4] locking/rwsem: Enable lockless waiter wakeup(s)

2016-05-08 Thread Davidlohr Bueso
As wake_qs gain users, we can teach rwsems about them such that
waiters can be awoken without the wait_lock. This is for both
readers and writer, the former being the most ideal candidate
as we can batch the wakeups shortening the critical region that
much more -- ie writer task blocking a bunch of tasks waiting to
service page-faults (mmap_sem readers).

In general applying wake_qs to rwsem (xadd) is not difficult as
the wait_lock is intended to be released soon _anyways_, with
the exception of when a writer slowpath will proactively wakeup
any queued readers if it sees that the lock is owned by a reader,
in which we simply do the wakeups with the lock held (see comment
in __rwsem_down_write_failed_common()).

Similar to other locking primitives, delaying the waiter being
awoken does allow, at least in theory, the lock to be stolen in
the case of writers, however no harm was seen in this (in fact
lock stealing tends to be a _good_ thing in most workloads), and
this is a tiny window anyways.

Some page-fault (pft) and mmap_sem intensive benchmarks show some
pretty constant reduction in systime (by up to ~8 and ~10%) on a
2-socket, 12 core AMD box.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 53 +++--
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index b592bb48d880..1b8c1285a2aa 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -114,12 +114,16 @@ enum rwsem_wake_type {
  *   - the 'active part' of count (&0x) reached 0 (but may have 
changed)
  *   - the 'waiting part' of count (&0x) is -ve (and will still be so)
  * - there must be someone on the queue
- * - the spinlock must be held by the caller
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ *   to actually wakeup the blocked task(s), preferably when the wait_lock
+ *   is released
  * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only woken if downgrading is false
+ * - writers are only marked woken if downgrading is false
  */
 static struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
+__rwsem_mark_wake(struct rw_semaphore *sem,
+ enum rwsem_wake_type wake_type, struct wake_q_head *wake_q)
 {
struct rwsem_waiter *waiter;
struct task_struct *tsk;
@@ -129,12 +133,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY)
-   /* Wake writer at the front of the queue, but do not
-* grant it the lock yet as we want other writers
-* to be able to steal it.  Readers, on the other hand,
-* will block as they will notice the queued writer.
+   /*
+* Mark writer at the front of the queue for wakeup.
+* Until the task is actually later awoken later by
+* the caller, other writers are able to steal it the
+* lock to be able to steal it.  Readers, on the other,
+* hand, will block as they will notice the queued 
writer.
 */
-   wake_up_process(waiter->task);
+   wake_q_add(wake_q, waiter->task);
goto out;
}
 
@@ -196,12 +202,11 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
 */
smp_mb();
waiter->task = NULL;
-   wake_up_process(tsk);
+   wake_q_add(wake_q, tsk);
} while (--loop);
 
sem->wait_list.next = next;
next->prev = >wait_list;
-
  out:
return sem;
 }
@@ -215,6 +220,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
struct rwsem_waiter waiter;
struct task_struct *tsk = current;
+   WAKE_Q(wake_q);
 
/* set up my own style of waitqueue */
waiter.task = tsk;
@@ -236,9 +242,10 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct 
rw_semaphore *sem)
if (count == RWSEM_WAITING_BIAS ||
(count > RWSEM_WAITING_BIAS &&
 adjustment != -RWSEM_ACTIVE_READ_BIAS))
-   sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
+   sem = __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, _q);
 
raw_spin_unlock_irq(>wait_lock);
+   wake_up_q(_q);
 
/* wait to be given the lock */
while (true) {
@@ -470,9 +477,19 @@ __rwsem_down_write_failed_common(struct 

[PATCH 1/4] locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()

2016-05-08 Thread Davidlohr Bueso
The field is obviously updated w.o the lock and needs a READ_ONCE
while waiting for lock holder(s) to go away, just like we do with
all other ->count accesses.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index df4dcb883b50..7d62772600cf 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -494,7 +494,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, 
int state)
}
schedule();
set_current_state(state);
-   } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
+   } while ((count = READ_ONCE(sem->count)) & RWSEM_ACTIVE_MASK);
 
raw_spin_lock_irq(>wait_lock);
}
-- 
2.8.1



[PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task

2016-05-08 Thread Davidlohr Bueso
Readers that are awoken will expect a nil ->task indicating
that a wakeup has occurred. There is a mismatch between the
smp_mb() and its documentation, in that the serialization is
done between reading the task and the nil store. Furthermore,
in addition to having the overlapping of loads and stores to
waiter->task guaranteed to be ordered within that CPU, both
wake_up_process() originally and now wake_q_add() already
imply barriers upon successful calls, which serves the comment.

Just atomically do a xchg() and simplify the whole thing. We can
use relaxed semantics as before mentioned in addition to the
barrier provided by wake_q_add(), delaying there is no risk in
reordering with the actual wakeup.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 1b8c1285a2aa..96e53cb4a4db 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -126,7 +126,6 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
  enum rwsem_wake_type wake_type, struct wake_q_head *wake_q)
 {
struct rwsem_waiter *waiter;
-   struct task_struct *tsk;
struct list_head *next;
long oldcount, woken, loop, adjustment;
 
@@ -190,24 +189,18 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
next = sem->wait_list.next;
loop = woken;
do {
+   struct task_struct *tsk;
+
waiter = list_entry(next, struct rwsem_waiter, list);
next = waiter->list.next;
-   tsk = waiter->task;
-   /*
-* Make sure we do not wakeup the next reader before
-* setting the nil condition to grant the next reader;
-* otherwise we could miss the wakeup on the other
-* side and end up sleeping again. See the pairing
-* in rwsem_down_read_failed().
-*/
-   smp_mb();
-   waiter->task = NULL;
+
+   tsk = xchg_relaxed(>task, NULL);
wake_q_add(wake_q, tsk);
} while (--loop);
 
sem->wait_list.next = next;
next->prev = >wait_list;
- out:
+out:
return sem;
 }
 
-- 
2.8.1



[PATCH 4/4] locking/rwsem: Rework zeroing reader waiter->task

2016-05-08 Thread Davidlohr Bueso
Readers that are awoken will expect a nil ->task indicating
that a wakeup has occurred. There is a mismatch between the
smp_mb() and its documentation, in that the serialization is
done between reading the task and the nil store. Furthermore,
in addition to having the overlapping of loads and stores to
waiter->task guaranteed to be ordered within that CPU, both
wake_up_process() originally and now wake_q_add() already
imply barriers upon successful calls, which serves the comment.

Just atomically do a xchg() and simplify the whole thing. We can
use relaxed semantics as before mentioned in addition to the
barrier provided by wake_q_add(), delaying there is no risk in
reordering with the actual wakeup.

Signed-off-by: Davidlohr Bueso 
---
 kernel/locking/rwsem-xadd.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 1b8c1285a2aa..96e53cb4a4db 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -126,7 +126,6 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
  enum rwsem_wake_type wake_type, struct wake_q_head *wake_q)
 {
struct rwsem_waiter *waiter;
-   struct task_struct *tsk;
struct list_head *next;
long oldcount, woken, loop, adjustment;
 
@@ -190,24 +189,18 @@ __rwsem_mark_wake(struct rw_semaphore *sem,
next = sem->wait_list.next;
loop = woken;
do {
+   struct task_struct *tsk;
+
waiter = list_entry(next, struct rwsem_waiter, list);
next = waiter->list.next;
-   tsk = waiter->task;
-   /*
-* Make sure we do not wakeup the next reader before
-* setting the nil condition to grant the next reader;
-* otherwise we could miss the wakeup on the other
-* side and end up sleeping again. See the pairing
-* in rwsem_down_read_failed().
-*/
-   smp_mb();
-   waiter->task = NULL;
+
+   tsk = xchg_relaxed(>task, NULL);
wake_q_add(wake_q, tsk);
} while (--loop);
 
sem->wait_list.next = next;
next->prev = >wait_list;
- out:
+out:
return sem;
 }
 
-- 
2.8.1



[PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations

2016-05-08 Thread Davidlohr Bueso
Hi,

This is a follow up series while reviewing Waiman's reader-owned
state work[1]. While I have based it on -tip instead of that change,
I can certainly rebase the series in some future iteration.

Changes are mainly around reader-waiter optimizations, in no particular
order. Has passed numerous DB benchmarks without things falling apart
for a 8 core Westmere doing page allocations (page_test) in aim9:

aim9
 4.6-rc64.6-rc6
rwsemv2
Min  page_test   378167.89 (  0.00%)   382613.33 (  1.18%)
Min  exec_test  499.00 (  0.00%)  502.67 (  0.74%)
Min  fork_test 3395.47 (  0.00%) 3537.64 (  4.19%)
Hmeanpage_test   395433.06 (  0.00%)   414693.68 (  4.87%)
Hmeanexec_test  499.67 (  0.00%)  505.30 (  1.13%)
Hmeanfork_test 3504.22 (  0.00%) 3594.95 (  2.59%)
Stddev   page_test17426.57 (  0.00%)26649.92 (-52.93%)
Stddev   exec_test0.47 (  0.00%)1.41 (-199.05%)
Stddev   fork_test   63.74 (  0.00%)   32.59 ( 48.86%)
Max  page_test   429873.33 (  0.00%)   456960.00 (  6.30%)
Max  exec_test  500.33 (  0.00%)  507.66 (  1.47%)
Max  fork_test 3653.33 (  0.00%) 3650.90 ( -0.07%)

 4.6-rc6 4.6-rc6
 rwsemv2
User1.120.04
System  0.230.04
Elapsed   727.27  721.98

[1] http://permalink.gmane.org/gmane.linux.kernel/2216743

Thanks!

Davidlohr Bueso (4):
  locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()
  locking/rwsem: Drop superfluous waiter refcount
  locking/rwsem: Enable lockless waiter wakeup(s)
  locking/rwsem: Rework zeroing reader waiter->task

 kernel/locking/rwsem-xadd.c | 74 ++---
 1 file changed, 43 insertions(+), 31 deletions(-)

-- 
2.8.1



[PATCH -tip 0/4] locking/rwsem (xadd): Reader waiter optimizations

2016-05-08 Thread Davidlohr Bueso
Hi,

This is a follow up series while reviewing Waiman's reader-owned
state work[1]. While I have based it on -tip instead of that change,
I can certainly rebase the series in some future iteration.

Changes are mainly around reader-waiter optimizations, in no particular
order. Has passed numerous DB benchmarks without things falling apart
for a 8 core Westmere doing page allocations (page_test) in aim9:

aim9
 4.6-rc64.6-rc6
rwsemv2
Min  page_test   378167.89 (  0.00%)   382613.33 (  1.18%)
Min  exec_test  499.00 (  0.00%)  502.67 (  0.74%)
Min  fork_test 3395.47 (  0.00%) 3537.64 (  4.19%)
Hmeanpage_test   395433.06 (  0.00%)   414693.68 (  4.87%)
Hmeanexec_test  499.67 (  0.00%)  505.30 (  1.13%)
Hmeanfork_test 3504.22 (  0.00%) 3594.95 (  2.59%)
Stddev   page_test17426.57 (  0.00%)26649.92 (-52.93%)
Stddev   exec_test0.47 (  0.00%)1.41 (-199.05%)
Stddev   fork_test   63.74 (  0.00%)   32.59 ( 48.86%)
Max  page_test   429873.33 (  0.00%)   456960.00 (  6.30%)
Max  exec_test  500.33 (  0.00%)  507.66 (  1.47%)
Max  fork_test 3653.33 (  0.00%) 3650.90 ( -0.07%)

 4.6-rc6 4.6-rc6
 rwsemv2
User1.120.04
System  0.230.04
Elapsed   727.27  721.98

[1] http://permalink.gmane.org/gmane.linux.kernel/2216743

Thanks!

Davidlohr Bueso (4):
  locking/rwsem: Avoid stale ->count for rwsem_down_write_failed()
  locking/rwsem: Drop superfluous waiter refcount
  locking/rwsem: Enable lockless waiter wakeup(s)
  locking/rwsem: Rework zeroing reader waiter->task

 kernel/locking/rwsem-xadd.c | 74 ++---
 1 file changed, 43 insertions(+), 31 deletions(-)

-- 
2.8.1



[GIT] Networking

2016-05-08 Thread David Miller

1) Check klogctl failure correctly, from Colin Ian King.

2) Prevent OOM when under memory pressure in flowcache, from Steffen
   Klassert.

3) Fix info leak in llc and rtnetlink ifmap code, from Kangjie Lu.

4) Memory barrier and multicast handling fixes in bnxt_en, from
   Michael Chan.

5) Endianness bug in mlx5, from Daniel Jurgens.

6) Fix disconnect handling in VSOCK, from Ian Campbell.

7) Fix locking of netdev list walking in get_bridge_ifindices(), from
   Nikolay Aleksandrov.

8) Bridge multicast MLD parser can look at wrong packet offsets, fix
   from Linus Lüssing.

9) Fix chip hang in qede driver, from Sudarsana Reddy Kalluru.

10) Fix missing setting of encapsulation before inner handling
completes in udp_offload code, from Jarno Rajahalme.

11) Missing rollbacks during LAG join and flood configuration failures
in mlxsw driver, from Ido Schimmel.

12) Fix error code checks in netxen driver, from Dan Carpenter.

13) Fix key size in new macsec driver, from Sabrina Dubroca.

14) Fix mlx5/VXLAN dependencies, from Arnd Bergmann.

Please pull, thanks a lot!

The following changes since commit 7391daf2ffc780679d6ab3fad1db2619e5dd2c2a:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2016-05-03 
15:07:50 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 8846a125de97f96be64ca234906eedfd26ad778e:

  Merge branch 'mlx5-build-fix' (2016-05-09 00:21:13 -0400)


Arnd Bergmann (2):
  Revert "net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue"
  net/mlx5e: make VXLAN support conditional

Colin Ian King (1):
  tools: bpf_jit_disasm: check for klogctl failure

Dan Carpenter (4):
  netxen: fix error handling in netxen_get_flash_block()
  netxen: reversed condition in netxen_nic_set_link_parameters()
  netxen: netxen_rom_fast_read() doesn't return -1
  qede: uninitialized variable in qede_start_xmit()

Daniel Jurgens (1):
  net/mlx4_en: Fix endianness bug in IPV6 csum calculation

David Ahern (1):
  net: ipv6: tcp reset, icmp need to consider L3 domain

David S. Miller (3):
  Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
  Merge branch 'bnxt_en-fixes'
  Merge branch 'mlx5-build-fix'

Eric Dumazet (1):
  macvtap: segmented packet is consumed

Ian Campbell (1):
  VSOCK: do not disconnect socket when peer has shutdown SEND only

Ido Schimmel (2):
  mlxsw: spectrum: Fix rollback order in LAG join failure
  mlxsw: spectrum: Add missing rollback in flood configuration

Jarno Rajahalme (2):
  udp_tunnel: Remove redundant udp_tunnel_gro_complete().
  udp_offload: Set encapsulation before inner completes.

Kangjie Lu (2):
  net: fix infoleak in llc
  net: fix infoleak in rtnetlink

Linus Lüssing (1):
  bridge: fix igmp / mld query parsing

Matthias Brugger (1):
  drivers: net: xgene: Fix error handling

Michael Chan (2):
  bnxt_en: Need memory barrier when processing the completion ring.
  bnxt_en: Setup multicast properly after resetting device.

Nikolay Aleksandrov (1):
  net: bridge: fix old ioctl unlocked net device walk

Sabrina Dubroca (1):
  macsec: key identifier is 128 bits, not 64

Shmulik Ladkani (1):
  Documentation/networking: more accurate LCO explanation

Steffen Klassert (3):
  flowcache: Avoid OOM condition under preasure
  xfrm: Reset encapsulation field of the skb before transformation
  vti: Add pmtu handling to vti_xmit.

Sudarsana Reddy Kalluru (1):
  qede: prevent chip hang when increasing channels

Uwe Kleine-König (1):
  net: fec: only clear a queue's work bit if the queue was emptied

 Documentation/networking/checksum-offloads.txt   | 14 +++---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c |  7 ---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c| 23 
+++
 drivers/net/ethernet/freescale/fec_main.c| 10 --
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig  |  8 +++-
 drivers/net/ethernet/mellanox/mlx5/core/Makefile |  3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en.h |  2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|  4 
 drivers/net/ethernet/mellanox/mlx5/core/vxlan.h  | 11 +--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  4 ++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c |  8 
 drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c   | 14 +-
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c |  3 ++-
 drivers/net/ethernet/qlogic/qede/qede_main.c |  8 +++-
 drivers/net/geneve.c |  5 +++--
 drivers/net/macsec.c | 

[GIT] Networking

2016-05-08 Thread David Miller

1) Check klogctl failure correctly, from Colin Ian King.

2) Prevent OOM when under memory pressure in flowcache, from Steffen
   Klassert.

3) Fix info leak in llc and rtnetlink ifmap code, from Kangjie Lu.

4) Memory barrier and multicast handling fixes in bnxt_en, from
   Michael Chan.

5) Endianness bug in mlx5, from Daniel Jurgens.

6) Fix disconnect handling in VSOCK, from Ian Campbell.

7) Fix locking of netdev list walking in get_bridge_ifindices(), from
   Nikolay Aleksandrov.

8) Bridge multicast MLD parser can look at wrong packet offsets, fix
   from Linus Lüssing.

9) Fix chip hang in qede driver, from Sudarsana Reddy Kalluru.

10) Fix missing setting of encapsulation before inner handling
completes in udp_offload code, from Jarno Rajahalme.

11) Missing rollbacks during LAG join and flood configuration failures
in mlxsw driver, from Ido Schimmel.

12) Fix error code checks in netxen driver, from Dan Carpenter.

13) Fix key size in new macsec driver, from Sabrina Dubroca.

14) Fix mlx5/VXLAN dependencies, from Arnd Bergmann.

Please pull, thanks a lot!

The following changes since commit 7391daf2ffc780679d6ab3fad1db2619e5dd2c2a:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2016-05-03 
15:07:50 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 8846a125de97f96be64ca234906eedfd26ad778e:

  Merge branch 'mlx5-build-fix' (2016-05-09 00:21:13 -0400)


Arnd Bergmann (2):
  Revert "net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue"
  net/mlx5e: make VXLAN support conditional

Colin Ian King (1):
  tools: bpf_jit_disasm: check for klogctl failure

Dan Carpenter (4):
  netxen: fix error handling in netxen_get_flash_block()
  netxen: reversed condition in netxen_nic_set_link_parameters()
  netxen: netxen_rom_fast_read() doesn't return -1
  qede: uninitialized variable in qede_start_xmit()

Daniel Jurgens (1):
  net/mlx4_en: Fix endianness bug in IPV6 csum calculation

David Ahern (1):
  net: ipv6: tcp reset, icmp need to consider L3 domain

David S. Miller (3):
  Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
  Merge branch 'bnxt_en-fixes'
  Merge branch 'mlx5-build-fix'

Eric Dumazet (1):
  macvtap: segmented packet is consumed

Ian Campbell (1):
  VSOCK: do not disconnect socket when peer has shutdown SEND only

Ido Schimmel (2):
  mlxsw: spectrum: Fix rollback order in LAG join failure
  mlxsw: spectrum: Add missing rollback in flood configuration

Jarno Rajahalme (2):
  udp_tunnel: Remove redundant udp_tunnel_gro_complete().
  udp_offload: Set encapsulation before inner completes.

Kangjie Lu (2):
  net: fix infoleak in llc
  net: fix infoleak in rtnetlink

Linus Lüssing (1):
  bridge: fix igmp / mld query parsing

Matthias Brugger (1):
  drivers: net: xgene: Fix error handling

Michael Chan (2):
  bnxt_en: Need memory barrier when processing the completion ring.
  bnxt_en: Setup multicast properly after resetting device.

Nikolay Aleksandrov (1):
  net: bridge: fix old ioctl unlocked net device walk

Sabrina Dubroca (1):
  macsec: key identifier is 128 bits, not 64

Shmulik Ladkani (1):
  Documentation/networking: more accurate LCO explanation

Steffen Klassert (3):
  flowcache: Avoid OOM condition under preasure
  xfrm: Reset encapsulation field of the skb before transformation
  vti: Add pmtu handling to vti_xmit.

Sudarsana Reddy Kalluru (1):
  qede: prevent chip hang when increasing channels

Uwe Kleine-König (1):
  net: fec: only clear a queue's work bit if the queue was emptied

 Documentation/networking/checksum-offloads.txt   | 14 +++---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c |  7 ---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c| 23 
+++
 drivers/net/ethernet/freescale/fec_main.c| 10 --
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig  |  8 +++-
 drivers/net/ethernet/mellanox/mlx5/core/Makefile |  3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en.h |  2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|  4 
 drivers/net/ethernet/mellanox/mlx5/core/vxlan.h  | 11 +--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  4 ++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c |  8 
 drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c   | 14 +-
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c |  3 ++-
 drivers/net/ethernet/qlogic/qede/qede_main.c |  8 +++-
 drivers/net/geneve.c |  5 +++--
 drivers/net/macsec.c | 

Re: [PATCH 2/2] net: Use ns_capable_noaudit() when determining net sysctl permissions

2016-05-08 Thread Serge Hallyn
Quoting Tyler Hicks (tyhi...@canonical.com):
> The capability check should not be audited since it is only being used
> to determine the inode permissions. A failed check does not indicate a
> violation of security policy but, when an LSM is enabled, a denial audit
> message was being generated.
> 
> The denial audit message caused confusion for some application authors
> because root-running Go applications always triggered the denial. To
> prevent this confusion, the capability check in net_ctl_permissions() is
> switched to the noaudit variant.
> 
> BugLink: https://launchpad.net/bugs/1465724
> 
> Signed-off-by: Tyler Hicks 

Acked-by: Serge E. Hallyn 

> ---
>  net/sysctl_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sysctl_net.c b/net/sysctl_net.c
> index ed98c1f..46a71c7 100644
> --- a/net/sysctl_net.c
> +++ b/net/sysctl_net.c
> @@ -46,7 +46,7 @@ static int net_ctl_permissions(struct ctl_table_header 
> *head,
>   kgid_t root_gid = make_kgid(net->user_ns, 0);
>  
>   /* Allow network administrator to have same access as root. */
> - if (ns_capable(net->user_ns, CAP_NET_ADMIN) ||
> + if (ns_capable_noaudit(net->user_ns, CAP_NET_ADMIN) ||
>   uid_eq(root_uid, current_euid())) {
>   int mode = (table->mode >> 6) & 7;
>   return (mode << 6) | (mode << 3) | mode;
> -- 
> 2.7.4
> 


Re: [PATCH 2/2] net: Use ns_capable_noaudit() when determining net sysctl permissions

2016-05-08 Thread Serge Hallyn
Quoting Tyler Hicks (tyhi...@canonical.com):
> The capability check should not be audited since it is only being used
> to determine the inode permissions. A failed check does not indicate a
> violation of security policy but, when an LSM is enabled, a denial audit
> message was being generated.
> 
> The denial audit message caused confusion for some application authors
> because root-running Go applications always triggered the denial. To
> prevent this confusion, the capability check in net_ctl_permissions() is
> switched to the noaudit variant.
> 
> BugLink: https://launchpad.net/bugs/1465724
> 
> Signed-off-by: Tyler Hicks 

Acked-by: Serge E. Hallyn 

> ---
>  net/sysctl_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sysctl_net.c b/net/sysctl_net.c
> index ed98c1f..46a71c7 100644
> --- a/net/sysctl_net.c
> +++ b/net/sysctl_net.c
> @@ -46,7 +46,7 @@ static int net_ctl_permissions(struct ctl_table_header 
> *head,
>   kgid_t root_gid = make_kgid(net->user_ns, 0);
>  
>   /* Allow network administrator to have same access as root. */
> - if (ns_capable(net->user_ns, CAP_NET_ADMIN) ||
> + if (ns_capable_noaudit(net->user_ns, CAP_NET_ADMIN) ||
>   uid_eq(root_uid, current_euid())) {
>   int mode = (table->mode >> 6) & 7;
>   return (mode << 6) | (mode << 3) | mode;
> -- 
> 2.7.4
> 


Re: [PATCH 1/2] kernel: Add noaudit variant of ns_capable()

2016-05-08 Thread Serge Hallyn
Quoting Tyler Hicks (tyhi...@canonical.com):
> When checking the current cred for a capability in a specific user
> namespace, it isn't always desirable to have the LSMs audit the check.
> This patch adds a noaudit variant of ns_capable() for when those
> situations arise.
> 
> The common logic between ns_capable() and the new ns_capable_noaudit()
> is moved into a single, shared function to keep duplicated code to a
> minimum and ease maintainability.
> 
> Signed-off-by: Tyler Hicks 

Acked-by: Serge E. Hallyn 

> ---
>  include/linux/capability.h |  5 +
>  kernel/capability.c| 46 
> --
>  2 files changed, 41 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 00690ff..5f3c63d 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -206,6 +206,7 @@ extern bool has_ns_capability_noaudit(struct task_struct 
> *t,
> struct user_namespace *ns, int cap);
>  extern bool capable(int cap);
>  extern bool ns_capable(struct user_namespace *ns, int cap);
> +extern bool ns_capable_noaudit(struct user_namespace *ns, int cap);
>  #else
>  static inline bool has_capability(struct task_struct *t, int cap)
>  {
> @@ -233,6 +234,10 @@ static inline bool ns_capable(struct user_namespace *ns, 
> int cap)
>  {
>   return true;
>  }
> +static inline bool ns_capable_noaudit(struct user_namespace *ns, int cap)
> +{
> + return true;
> +}
>  #endif /* CONFIG_MULTIUSER */
>  extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
>  extern bool file_ns_capable(const struct file *file, struct user_namespace 
> *ns, int cap);
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 45432b5..00411c8 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -361,6 +361,24 @@ bool has_capability_noaudit(struct task_struct *t, int 
> cap)
>   return has_ns_capability_noaudit(t, _user_ns, cap);
>  }
>  
> +static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit)
> +{
> + int capable;
> +
> + if (unlikely(!cap_valid(cap))) {
> + pr_crit("capable() called with invalid cap=%u\n", cap);
> + BUG();
> + }
> +
> + capable = audit ? security_capable(current_cred(), ns, cap) :
> +   security_capable_noaudit(current_cred(), ns, cap);
> + if (capable == 0) {
> + current->flags |= PF_SUPERPRIV;
> + return true;
> + }
> + return false;
> +}
> +
>  /**
>   * ns_capable - Determine if the current task has a superior capability in 
> effect
>   * @ns:  The usernamespace we want the capability in
> @@ -374,19 +392,27 @@ bool has_capability_noaudit(struct task_struct *t, int 
> cap)
>   */
>  bool ns_capable(struct user_namespace *ns, int cap)
>  {
> - if (unlikely(!cap_valid(cap))) {
> - pr_crit("capable() called with invalid cap=%u\n", cap);
> - BUG();
> - }
> -
> - if (security_capable(current_cred(), ns, cap) == 0) {
> - current->flags |= PF_SUPERPRIV;
> - return true;
> - }
> - return false;
> + return ns_capable_common(ns, cap, true);
>  }
>  EXPORT_SYMBOL(ns_capable);
>  
> +/**
> + * ns_capable_noaudit - Determine if the current task has a superior 
> capability
> + * (unaudited) in effect
> + * @ns:  The usernamespace we want the capability in
> + * @cap: The capability to be tested for
> + *
> + * Return true if the current task has the given superior capability 
> currently
> + * available for use, false if not.
> + *
> + * This sets PF_SUPERPRIV on the task if the capability is available on the
> + * assumption that it's about to be used.
> + */
> +bool ns_capable_noaudit(struct user_namespace *ns, int cap)
> +{
> + return ns_capable_common(ns, cap, false);
> +}
> +EXPORT_SYMBOL(ns_capable_noaudit);
>  
>  /**
>   * capable - Determine if the current task has a superior capability in 
> effect
> -- 
> 2.7.4
> 


Re: [PATCH 1/2] kernel: Add noaudit variant of ns_capable()

2016-05-08 Thread Serge Hallyn
Quoting Tyler Hicks (tyhi...@canonical.com):
> When checking the current cred for a capability in a specific user
> namespace, it isn't always desirable to have the LSMs audit the check.
> This patch adds a noaudit variant of ns_capable() for when those
> situations arise.
> 
> The common logic between ns_capable() and the new ns_capable_noaudit()
> is moved into a single, shared function to keep duplicated code to a
> minimum and ease maintainability.
> 
> Signed-off-by: Tyler Hicks 

Acked-by: Serge E. Hallyn 

> ---
>  include/linux/capability.h |  5 +
>  kernel/capability.c| 46 
> --
>  2 files changed, 41 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 00690ff..5f3c63d 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -206,6 +206,7 @@ extern bool has_ns_capability_noaudit(struct task_struct 
> *t,
> struct user_namespace *ns, int cap);
>  extern bool capable(int cap);
>  extern bool ns_capable(struct user_namespace *ns, int cap);
> +extern bool ns_capable_noaudit(struct user_namespace *ns, int cap);
>  #else
>  static inline bool has_capability(struct task_struct *t, int cap)
>  {
> @@ -233,6 +234,10 @@ static inline bool ns_capable(struct user_namespace *ns, 
> int cap)
>  {
>   return true;
>  }
> +static inline bool ns_capable_noaudit(struct user_namespace *ns, int cap)
> +{
> + return true;
> +}
>  #endif /* CONFIG_MULTIUSER */
>  extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
>  extern bool file_ns_capable(const struct file *file, struct user_namespace 
> *ns, int cap);
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 45432b5..00411c8 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -361,6 +361,24 @@ bool has_capability_noaudit(struct task_struct *t, int 
> cap)
>   return has_ns_capability_noaudit(t, _user_ns, cap);
>  }
>  
> +static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit)
> +{
> + int capable;
> +
> + if (unlikely(!cap_valid(cap))) {
> + pr_crit("capable() called with invalid cap=%u\n", cap);
> + BUG();
> + }
> +
> + capable = audit ? security_capable(current_cred(), ns, cap) :
> +   security_capable_noaudit(current_cred(), ns, cap);
> + if (capable == 0) {
> + current->flags |= PF_SUPERPRIV;
> + return true;
> + }
> + return false;
> +}
> +
>  /**
>   * ns_capable - Determine if the current task has a superior capability in 
> effect
>   * @ns:  The usernamespace we want the capability in
> @@ -374,19 +392,27 @@ bool has_capability_noaudit(struct task_struct *t, int 
> cap)
>   */
>  bool ns_capable(struct user_namespace *ns, int cap)
>  {
> - if (unlikely(!cap_valid(cap))) {
> - pr_crit("capable() called with invalid cap=%u\n", cap);
> - BUG();
> - }
> -
> - if (security_capable(current_cred(), ns, cap) == 0) {
> - current->flags |= PF_SUPERPRIV;
> - return true;
> - }
> - return false;
> + return ns_capable_common(ns, cap, true);
>  }
>  EXPORT_SYMBOL(ns_capable);
>  
> +/**
> + * ns_capable_noaudit - Determine if the current task has a superior 
> capability
> + * (unaudited) in effect
> + * @ns:  The usernamespace we want the capability in
> + * @cap: The capability to be tested for
> + *
> + * Return true if the current task has the given superior capability 
> currently
> + * available for use, false if not.
> + *
> + * This sets PF_SUPERPRIV on the task if the capability is available on the
> + * assumption that it's about to be used.
> + */
> +bool ns_capable_noaudit(struct user_namespace *ns, int cap)
> +{
> + return ns_capable_common(ns, cap, false);
> +}
> +EXPORT_SYMBOL(ns_capable_noaudit);
>  
>  /**
>   * capable - Determine if the current task has a superior capability in 
> effect
> -- 
> 2.7.4
> 


RE:Drawstring bags

2016-05-08 Thread Jack
Dear purchasing manager,

We have rich experience in manufacturing and exporting all kinds of bags, We 
have our own production base with advanced machine equipment, and employ 
professional workforce of technicians and engineers. 

Our products range from tote bags, drawstring bags, luggage bags, cooler bags, 
diaper bags, backpacks, handbags, cosmetic bags, travel bags, school bags, 
computer bags, gym bags, tool bags and so on. Material options include 
polyester, nylon, jeans, canvas and PVC,etc.

Hope we can have a chance to do business with you.

Best regards,

Jack Xiu
Sales manager
Ronta(Xiamen)Co.,Ltd
www.xmronta.com


RE:Drawstring bags

2016-05-08 Thread Jack
Dear purchasing manager,

We have rich experience in manufacturing and exporting all kinds of bags, We 
have our own production base with advanced machine equipment, and employ 
professional workforce of technicians and engineers. 

Our products range from tote bags, drawstring bags, luggage bags, cooler bags, 
diaper bags, backpacks, handbags, cosmetic bags, travel bags, school bags, 
computer bags, gym bags, tool bags and so on. Material options include 
polyester, nylon, jeans, canvas and PVC,etc.

Hope we can have a chance to do business with you.

Best regards,

Jack Xiu
Sales manager
Ronta(Xiamen)Co.,Ltd
www.xmronta.com


Re: [PATCH] Use pid_t instead of int

2016-05-08 Thread René Nyffenegger
Somewhere else, pid_t is a typedef for an int.

Rene

On 09.05.2016 03:25, Andy Lutomirski wrote:
> On Sun, May 8, 2016 at 12:38 PM, René Nyffenegger
>  wrote:
>> Use pid_t instead of int in the declarations of sys_kill, sys_tgkill,
>> sys_tkill and sys_rt_sigqueueinfo in include/linux/syscalls.h
> 
> The description is no good.  *Why* are you changing it?
> 
> I checked tgkill and, indeed, tgkill takes pid_t parameters, so this
> fixes an incorrect declaration.  I'm wondering why the code compiles
> without warning.  Is SYSCALL_DEFINE too lenient for some reason?  Or
> is pid_t just defined as int.
> 
> --Andy
> 



Re: [PATCH] Use pid_t instead of int

2016-05-08 Thread René Nyffenegger
Somewhere else, pid_t is a typedef for an int.

Rene

On 09.05.2016 03:25, Andy Lutomirski wrote:
> On Sun, May 8, 2016 at 12:38 PM, René Nyffenegger
>  wrote:
>> Use pid_t instead of int in the declarations of sys_kill, sys_tgkill,
>> sys_tkill and sys_rt_sigqueueinfo in include/linux/syscalls.h
> 
> The description is no good.  *Why* are you changing it?
> 
> I checked tgkill and, indeed, tgkill takes pid_t parameters, so this
> fixes an incorrect declaration.  I'm wondering why the code compiles
> without warning.  Is SYSCALL_DEFINE too lenient for some reason?  Or
> is pid_t just defined as int.
> 
> --Andy
> 



Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Mon, May 09, 2016 at 05:52:51AM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> 
> > In addition, I would argue maybe beefing up idle balancing is a more
> > productive way to spread load, as work-stealing just does what needs
> > to be done. And seems it has been (sub-unconsciously) neglected in this
> > case, :)
> 
> P.S. Nope, I'm dinging up multiple spots ;-)

You bet, :)


Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Mon, May 09, 2016 at 05:52:51AM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> 
> > In addition, I would argue maybe beefing up idle balancing is a more
> > productive way to spread load, as work-stealing just does what needs
> > to be done. And seems it has been (sub-unconsciously) neglected in this
> > case, :)
> 
> P.S. Nope, I'm dinging up multiple spots ;-)

You bet, :)


Re: [RFC PATCH v2 07/10] efi: load SSTDs from EFI variables

2016-05-08 Thread Jon Masters
Hi Octavian,

Apologies for missing this earlier, just catching up on this thread...

On 04/19/2016 06:39 PM, Octavian Purdila wrote:

> This patch allows SSDTs to be loaded from EFI variables. It works by
> specifying the EFI variable name containing the SSDT to be loaded. All
> variables with the same name (regardless of the vendor GUID) will be
> loaded.

This sounds very useful during development. However, and using EFI
variables isn't so terrible, but I am concerned that this should be
standardized through ASWG and at least involve certain other OS vendors
so that the variable (GUID) can be captured somewhere. If not in the
spec itself, then it should be captured as an external ACPI resource on
the UEFI website with a clear pointer to the exact IDs to be used.

Can you confirm that's the intention? i.e. that you're allowing a
command line option for specifying the ID now because you intend to go
ensure that there is a standard one that everyone will use later?

I should check (but maybe you know) if the kernel is automatically
tainted by this codepath as well?

Thanks,

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop



Re: [RFC PATCH v2 07/10] efi: load SSTDs from EFI variables

2016-05-08 Thread Jon Masters
Hi Octavian,

Apologies for missing this earlier, just catching up on this thread...

On 04/19/2016 06:39 PM, Octavian Purdila wrote:

> This patch allows SSDTs to be loaded from EFI variables. It works by
> specifying the EFI variable name containing the SSDT to be loaded. All
> variables with the same name (regardless of the vendor GUID) will be
> loaded.

This sounds very useful during development. However, and using EFI
variables isn't so terrible, but I am concerned that this should be
standardized through ASWG and at least involve certain other OS vendors
so that the variable (GUID) can be captured somewhere. If not in the
spec itself, then it should be captured as an external ACPI resource on
the UEFI website with a clear pointer to the exact IDs to be used.

Can you confirm that's the intention? i.e. that you're allowing a
command line option for specifying the ID now because you intend to go
ensure that there is a standard one that everyone will use later?

I should check (but maybe you know) if the kernel is automatically
tainted by this codepath as well?

Thanks,

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop



[PATCH] sched/rt/deadline: Don't push if task's scheduling class was changed

2016-05-08 Thread Xunlei Pang
We got a warning below:
WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 
set_task_cpu+0x1af/0x1c0
CPU: 1 PID: 2468 Comm: bugon Not tainted 4.6.0-rc3+ #16
Hardware name: Intel Corporation Broadwell Client
0086 89618374 8800897a7d50 8133dc8c
  8800897a7d90 81089921
048981037f39 88016c4315c0 88016ecd6e40 
Call Trace:
[] dump_stack+0x63/0x87
[] __warn+0xd1/0xf0
[] warn_slowpath_null+0x1d/0x20
[] set_task_cpu+0x1af/0x1c0
[] push_dl_task.part.34+0xea/0x180
[] push_dl_tasks+0x17/0x30
[] __balance_callback+0x45/0x5c
[] __sched_setscheduler+0x906/0xb90
[] SyS_sched_setattr+0x150/0x190
[] do_syscall_64+0x62/0x110
[] entry_SYSCALL64_slow_path+0x25/0x25

The corresponding warning triggering code:
WARN_ON_ONCE(p->state == TASK_RUNNING &&
 p->sched_class == _sched_class &&
 (p->on_rq && !task_on_rq_migrating(p)))

This is because in find_lock_later_rq(), the task whose scheduling
class was changed to fair class is still pushed away as deadline.

So, check in find_lock_later_rq() after double_lock_balance(), if the
scheduling class of the deadline task was changed, break and retry.
Apply the same logic to RT.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 1 +
 kernel/sched/rt.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 169d40d..57eb3e4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1385,6 +1385,7 @@ static struct rq *find_lock_later_rq(struct task_struct 
*task, struct rq *rq)
 !cpumask_test_cpu(later_rq->cpu,
   >cpus_allowed) ||
 task_running(rq, task) ||
+!dl_task(task) ||
 !task_on_rq_queued(task))) {
double_unlock_balance(rq, later_rq);
later_rq = NULL;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ecfc83d..c10a6f5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1720,6 +1720,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task, struct rq *rq)
 !cpumask_test_cpu(lowest_rq->cpu,
   tsk_cpus_allowed(task)) 
||
 task_running(rq, task) ||
+!rt_task(task) ||
 !task_on_rq_queued(task))) {
 
double_unlock_balance(rq, lowest_rq);
-- 
1.8.3.1



[PATCH] sched/rt/deadline: Don't push if task's scheduling class was changed

2016-05-08 Thread Xunlei Pang
We got a warning below:
WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 
set_task_cpu+0x1af/0x1c0
CPU: 1 PID: 2468 Comm: bugon Not tainted 4.6.0-rc3+ #16
Hardware name: Intel Corporation Broadwell Client
0086 89618374 8800897a7d50 8133dc8c
  8800897a7d90 81089921
048981037f39 88016c4315c0 88016ecd6e40 
Call Trace:
[] dump_stack+0x63/0x87
[] __warn+0xd1/0xf0
[] warn_slowpath_null+0x1d/0x20
[] set_task_cpu+0x1af/0x1c0
[] push_dl_task.part.34+0xea/0x180
[] push_dl_tasks+0x17/0x30
[] __balance_callback+0x45/0x5c
[] __sched_setscheduler+0x906/0xb90
[] SyS_sched_setattr+0x150/0x190
[] do_syscall_64+0x62/0x110
[] entry_SYSCALL64_slow_path+0x25/0x25

The corresponding warning triggering code:
WARN_ON_ONCE(p->state == TASK_RUNNING &&
 p->sched_class == _sched_class &&
 (p->on_rq && !task_on_rq_migrating(p)))

This is because in find_lock_later_rq(), the task whose scheduling
class was changed to fair class is still pushed away as deadline.

So, check in find_lock_later_rq() after double_lock_balance(), if the
scheduling class of the deadline task was changed, break and retry.
Apply the same logic to RT.

Signed-off-by: Xunlei Pang 
---
 kernel/sched/deadline.c | 1 +
 kernel/sched/rt.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 169d40d..57eb3e4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1385,6 +1385,7 @@ static struct rq *find_lock_later_rq(struct task_struct 
*task, struct rq *rq)
 !cpumask_test_cpu(later_rq->cpu,
   >cpus_allowed) ||
 task_running(rq, task) ||
+!dl_task(task) ||
 !task_on_rq_queued(task))) {
double_unlock_balance(rq, later_rq);
later_rq = NULL;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ecfc83d..c10a6f5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1720,6 +1720,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct 
*task, struct rq *rq)
 !cpumask_test_cpu(lowest_rq->cpu,
   tsk_cpus_allowed(task)) 
||
 task_running(rq, task) ||
+!rt_task(task) ||
 !task_on_rq_queued(task))) {
 
double_unlock_balance(rq, lowest_rq);
-- 
1.8.3.1



Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Mon, May 09, 2016 at 05:45:40AM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> > On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > > > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > > > llc_size,
> > > > but the numbers are so wild to easily break the fragile condition, like:
> > > 
> > > Seems lockless traversal and averages just lets multiple CPUs select
> > > the same spot.  An atomic reservation (feature) when looking for an
> > > idle spot (also for fork) might fix it up.  Run the thing as RT,
> > > push/pull ensures that it reaches box saturation regardless of the
> > > number of messaging threads, whereas with fair class, any number > 1
> > > will certainly stack tasks before the box is saturated.
> > 
> > Yes, good idea, bringing order to the race to grab idle CPU is absolutely
> > helpful.
> 
> Well, good ideas work, as yet this one helps jack diddly spit.

Then a valid question is whether it is this selection screwed up in case
like this, as it should necessarily always be asked.
 
> > In addition, I would argue maybe beefing up idle balancing is a more
> > productive way to spread load, as work-stealing just does what needs
> > to be done. And seems it has been (sub-unconsciously) neglected in this
> > case, :)
> > 
> > Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
> > the slave will be 0 forever (as last_wakee is never flipped).
> 
> Yeah, it's irrelevant here, this load is all about instantaneous state.
>  I could use a bit more of that, reserving on the wakeup side won't
> help this benchmark until everything else cares.  One stack, and it's
> game over.  It could help generic utilization and latency some.. but it
> seems kinda unlikely it'll be worth the cycle expenditure.

Yes and no, it depends on how efficient work-stealing is, compared to
selection, but remember, at the end of the day, the wakee CPU measures the
latency, that CPU does not care it is selected or it steals.
 
> > Basically whenever a waker has more than 1 wakee, the wakee_flips
> > will comfortably grow very large (with last_wakee alternating),
> > whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.
> 
> Yup, it is a heuristic, and like all of those, imperfect.  I've watched
> it improving utilization in the wild though, so won't mind that until I
> catch it doing really bad things.
 
> > So recording only the last_wakee seems not right unless you have other
> > good reason. If not the latter, counting waking wakee times should be
> > better, and then allow the statistics to happily play.

En... should we try remove recording last_wakee?


Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Mon, May 09, 2016 at 05:45:40AM +0200, Mike Galbraith wrote:
> On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> > On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > > > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > > > llc_size,
> > > > but the numbers are so wild to easily break the fragile condition, like:
> > > 
> > > Seems lockless traversal and averages just lets multiple CPUs select
> > > the same spot.  An atomic reservation (feature) when looking for an
> > > idle spot (also for fork) might fix it up.  Run the thing as RT,
> > > push/pull ensures that it reaches box saturation regardless of the
> > > number of messaging threads, whereas with fair class, any number > 1
> > > will certainly stack tasks before the box is saturated.
> > 
> > Yes, good idea, bringing order to the race to grab idle CPU is absolutely
> > helpful.
> 
> Well, good ideas work, as yet this one helps jack diddly spit.

Then a valid question is whether it is this selection screwed up in case
like this, as it should necessarily always be asked.
 
> > In addition, I would argue maybe beefing up idle balancing is a more
> > productive way to spread load, as work-stealing just does what needs
> > to be done. And seems it has been (sub-unconsciously) neglected in this
> > case, :)
> > 
> > Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
> > the slave will be 0 forever (as last_wakee is never flipped).
> 
> Yeah, it's irrelevant here, this load is all about instantaneous state.
>  I could use a bit more of that, reserving on the wakeup side won't
> help this benchmark until everything else cares.  One stack, and it's
> game over.  It could help generic utilization and latency some.. but it
> seems kinda unlikely it'll be worth the cycle expenditure.

Yes and no, it depends on how efficient work-stealing is, compared to
selection, but remember, at the end of the day, the wakee CPU measures the
latency, that CPU does not care it is selected or it steals.
 
> > Basically whenever a waker has more than 1 wakee, the wakee_flips
> > will comfortably grow very large (with last_wakee alternating),
> > whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.
> 
> Yup, it is a heuristic, and like all of those, imperfect.  I've watched
> it improving utilization in the wild though, so won't mind that until I
> catch it doing really bad things.
 
> > So recording only the last_wakee seems not right unless you have other
> > good reason. If not the latter, counting waking wakee times should be
> > better, and then allow the statistics to happily play.

En... should we try remove recording last_wakee?


Re: [PATCH V2 2/2] irqchip/gicv3-its: Implement two-level(indirect) device table support

2016-05-08 Thread Shanker Donthineni


On 05/08/2016 09:14 PM, Shanker Donthineni wrote:
> Since device IDs are extremely sparse, the single, a.k.a flat table is
> not sufficient for the following two reasons.
>
> 1) According to ARM-GIC spec, ITS hw can access maximum of 256(pages)*
>64K(pageszie) bytes. In the best case, it supports upto DEVid=21
>sparse with minimum device table entry size 8bytes.
>
> 2) The maximum memory size that is possible without memblock depends on
>MAX_ORDER. 4MB on 4K page size kernel with default MAX_ORDER, so it
>supports DEVid range 19bits.
>
> The two-level device table feature brings us two advantages, the first
> is a very high possibility of supporting upto 32bit sparse, and the
> second one is the best utilization of memory allocation.
>
> The feature is enabled automatically during driver probe if a single
> ITS page is not adequate for flat table and the hardware is capable
> of two-level table walk.
>
> Signed-off-by: Shanker Donthineni 
> ---
>
> This patch is based on Marc Zyngier's branch 
> https://git.kernel.org/cgit/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/irqchip-4.7
>
> I have tested the Indirection feature on Qualcomm Technologies QDF2XXX server 
> platform.
>
> Changes since v1:
>   Most of this patch has been rewritten after refactoring its_alloc_tables().
>   Always enable device two-level if the memory requirement is more than 
> PAGE_SIZE.
>   Fixed the coding bug that breaks on the BE machine.
>   Edited the commit text.
>
>  drivers/irqchip/irq-gic-v3-its.c | 100 
> ---
>  1 file changed, 83 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
> b/drivers/irqchip/irq-gic-v3-its.c
> index b23e00c..27be792 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -938,6 +938,18 @@ retry_baser:
>   return 0;
>  }
>  
> +/**
> + * Find out whether an implemented baser register supports a single, flat 
> table
> + * or a two-level table by reading bit offset at '62' after writing '1' to 
> it.
> + */
> +static u64 its_baser_check_indirect(struct its_baser *baser)
> +{
> + u64 val = GITS_BASER_InnerShareable | GITS_BASER_WaWb;
> +
> + writeq_relaxed(val | GITS_BASER_INDIRECT, baser->hwreg);
> + return (readq_relaxed(baser->hwreg) & GITS_BASER_INDIRECT);
> +}
> +
>  static int its_alloc_tables(const char *node_name, struct its_node *its)
>  {
>   u64 typer = readq_relaxed(its->base + GITS_TYPER);
> @@ -964,6 +976,7 @@ static int its_alloc_tables(const char *node_name, struct 
> its_node *its)
>   u64 entry_size = GITS_BASER_ENTRY_SIZE(val);
>   int order = get_order(psz);
>   struct its_baser *baser = its->tables + i;
> + u64 indirect = 0;
>  
>   if (type == GITS_BASER_TYPE_NONE)
>   continue;
> @@ -977,17 +990,27 @@ static int its_alloc_tables(const char *node_name, 
> struct its_node *its)
>* Allocate as many entries as required to fit the
>* range of device IDs that the ITS can grok... The ID
>* space being incredibly sparse, this results in a
> -  * massive waste of memory.
> +  * massive waste of memory if two-level device table
> +  * feature is not supported by hardware.
>*
>* For other tables, only allocate a single page.
>*/
>   if (type == GITS_BASER_TYPE_DEVICE) {
> - /*
> -  * 'order' was initialized earlier to the default page
> -  * granule of the the ITS.  We can't have an allocation
> -  * smaller than that.  If the requested allocation
> -  * is smaller, round up to the default page granule.
> -  */
> + if ((entry_size << ids) > psz)
> + indirect = its_baser_check_indirect(baser);
> +
> + if (indirect) {
> + /*
> +  * The size of the lvl2 table is equal to ITS
> +  * page size which is 'psz'. For computing lvl1
> +  * table size, subtract ID bits that sparse
> +  * lvl2 table from 'ids' which is reported by
> +  * ITS hardware times lvl1 table entry size.
> +  */
> + ids -= ilog2(psz / entry_size);
> + entry_size = GITS_LVL1_ENTRY_SIZE;
> + }
> +
>   order = max(get_order(entry_size << ids), order);
>   if (order >= MAX_ORDER) {
>   order = MAX_ORDER - 1;
> @@ -997,7 +1020,7 @@ static int its_alloc_tables(const char *node_name, 
> struct its_node *its)
>  

Re: [PATCH V2 2/2] irqchip/gicv3-its: Implement two-level(indirect) device table support

2016-05-08 Thread Shanker Donthineni


On 05/08/2016 09:14 PM, Shanker Donthineni wrote:
> Since device IDs are extremely sparse, the single, a.k.a flat table is
> not sufficient for the following two reasons.
>
> 1) According to ARM-GIC spec, ITS hw can access maximum of 256(pages)*
>64K(pageszie) bytes. In the best case, it supports upto DEVid=21
>sparse with minimum device table entry size 8bytes.
>
> 2) The maximum memory size that is possible without memblock depends on
>MAX_ORDER. 4MB on 4K page size kernel with default MAX_ORDER, so it
>supports DEVid range 19bits.
>
> The two-level device table feature brings us two advantages, the first
> is a very high possibility of supporting upto 32bit sparse, and the
> second one is the best utilization of memory allocation.
>
> The feature is enabled automatically during driver probe if a single
> ITS page is not adequate for flat table and the hardware is capable
> of two-level table walk.
>
> Signed-off-by: Shanker Donthineni 
> ---
>
> This patch is based on Marc Zyngier's branch 
> https://git.kernel.org/cgit/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/irqchip-4.7
>
> I have tested the Indirection feature on Qualcomm Technologies QDF2XXX server 
> platform.
>
> Changes since v1:
>   Most of this patch has been rewritten after refactoring its_alloc_tables().
>   Always enable device two-level if the memory requirement is more than 
> PAGE_SIZE.
>   Fixed the coding bug that breaks on the BE machine.
>   Edited the commit text.
>
>  drivers/irqchip/irq-gic-v3-its.c | 100 
> ---
>  1 file changed, 83 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c 
> b/drivers/irqchip/irq-gic-v3-its.c
> index b23e00c..27be792 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -938,6 +938,18 @@ retry_baser:
>   return 0;
>  }
>  
> +/**
> + * Find out whether an implemented baser register supports a single, flat 
> table
> + * or a two-level table by reading bit offset at '62' after writing '1' to 
> it.
> + */
> +static u64 its_baser_check_indirect(struct its_baser *baser)
> +{
> + u64 val = GITS_BASER_InnerShareable | GITS_BASER_WaWb;
> +
> + writeq_relaxed(val | GITS_BASER_INDIRECT, baser->hwreg);
> + return (readq_relaxed(baser->hwreg) & GITS_BASER_INDIRECT);
> +}
> +
>  static int its_alloc_tables(const char *node_name, struct its_node *its)
>  {
>   u64 typer = readq_relaxed(its->base + GITS_TYPER);
> @@ -964,6 +976,7 @@ static int its_alloc_tables(const char *node_name, struct 
> its_node *its)
>   u64 entry_size = GITS_BASER_ENTRY_SIZE(val);
>   int order = get_order(psz);
>   struct its_baser *baser = its->tables + i;
> + u64 indirect = 0;
>  
>   if (type == GITS_BASER_TYPE_NONE)
>   continue;
> @@ -977,17 +990,27 @@ static int its_alloc_tables(const char *node_name, 
> struct its_node *its)
>* Allocate as many entries as required to fit the
>* range of device IDs that the ITS can grok... The ID
>* space being incredibly sparse, this results in a
> -  * massive waste of memory.
> +  * massive waste of memory if two-level device table
> +  * feature is not supported by hardware.
>*
>* For other tables, only allocate a single page.
>*/
>   if (type == GITS_BASER_TYPE_DEVICE) {
> - /*
> -  * 'order' was initialized earlier to the default page
> -  * granule of the the ITS.  We can't have an allocation
> -  * smaller than that.  If the requested allocation
> -  * is smaller, round up to the default page granule.
> -  */
> + if ((entry_size << ids) > psz)
> + indirect = its_baser_check_indirect(baser);
> +
> + if (indirect) {
> + /*
> +  * The size of the lvl2 table is equal to ITS
> +  * page size which is 'psz'. For computing lvl1
> +  * table size, subtract ID bits that sparse
> +  * lvl2 table from 'ids' which is reported by
> +  * ITS hardware times lvl1 table entry size.
> +  */
> + ids -= ilog2(psz / entry_size);
> + entry_size = GITS_LVL1_ENTRY_SIZE;
> + }
> +
>   order = max(get_order(entry_size << ids), order);
>   if (order >= MAX_ORDER) {
>   order = MAX_ORDER - 1;
> @@ -997,7 +1020,7 @@ static int its_alloc_tables(const char *node_name, 
> struct its_node *its)
>   }
>   }

[PATCH -tip] sched/wake_q: fix typo in wake_q_add

2016-05-08 Thread Davidlohr Bueso
... the comment clearly refers to wake_up_q, and not
wake_up_list.

Signed-off-by: Davidlohr Bueso 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c82ca6eccfec..c59e4df38591 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -400,7 +400,7 @@ void wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
 * wakeup due to that.
 *
 * This cmpxchg() implies a full barrier, which pairs with the write
-* barrier implied by the wakeup in wake_up_list().
+* barrier implied by the wakeup in wake_up_q().
 */
if (cmpxchg(>next, NULL, WAKE_Q_TAIL))
return;
-- 
2.8.1



[PATCH -tip] sched/wake_q: fix typo in wake_q_add

2016-05-08 Thread Davidlohr Bueso
... the comment clearly refers to wake_up_q, and not
wake_up_list.

Signed-off-by: Davidlohr Bueso 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c82ca6eccfec..c59e4df38591 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -400,7 +400,7 @@ void wake_q_add(struct wake_q_head *head, struct 
task_struct *task)
 * wakeup due to that.
 *
 * This cmpxchg() implies a full barrier, which pairs with the write
-* barrier implied by the wakeup in wake_up_list().
+* barrier implied by the wakeup in wake_up_q().
 */
if (cmpxchg(>next, NULL, WAKE_Q_TAIL))
return;
-- 
2.8.1



Re: [PATCH] tools: bpf_jit_disasm: check for klogctl failure

2016-05-08 Thread David Miller
From: Daniel Borkmann 
Date: Fri, 06 May 2016 00:46:56 +0200

> On 05/06/2016 12:39 AM, Colin King wrote:
>> From: Colin Ian King 
>>
>> klogctl can fail and return -ve len, so check for this and
>> return NULL to avoid passing a (size_t)-1 to malloc.
>>
>> Signed-off-by: Colin Ian King 
> 
> [ would be nice to get Cc'ed in future ... ]
> 
> Acked-by: Daniel Borkmann 

Applied.


Re: [PATCH] tools: bpf_jit_disasm: check for klogctl failure

2016-05-08 Thread David Miller
From: Daniel Borkmann 
Date: Fri, 06 May 2016 00:46:56 +0200

> On 05/06/2016 12:39 AM, Colin King wrote:
>> From: Colin Ian King 
>>
>> klogctl can fail and return -ve len, so check for this and
>> return NULL to avoid passing a (size_t)-1 to malloc.
>>
>> Signed-off-by: Colin Ian King 
> 
> [ would be nice to get Cc'ed in future ... ]
> 
> Acked-by: Daniel Borkmann 

Applied.


Re: [PATCH 0/2] Quiet noisy LSM denial when accessing net sysctl

2016-05-08 Thread David Miller
From: Tyler Hicks 
Date: Fri,  6 May 2016 18:04:12 -0500

> This pair of patches does away with what I believe is a useless denial
> audit message when a privileged process initially accesses a net sysctl.

The LSM folks can apply this if they agree with you.


Re: [PATCH 0/2] Quiet noisy LSM denial when accessing net sysctl

2016-05-08 Thread David Miller
From: Tyler Hicks 
Date: Fri,  6 May 2016 18:04:12 -0500

> This pair of patches does away with what I believe is a useless denial
> audit message when a privileged process initially accesses a net sysctl.

The LSM folks can apply this if they agree with you.


Re: [PATCH v2] net: arc/emac: Move arc_emac_tx_clean() into arc_emac_tx() and disable tx interrut

2016-05-08 Thread David Miller
From: Caesar Wang 
Date: Fri,  6 May 2016 20:19:16 +0800

> Doing tx_clean() inside poll() may scramble the tx ring buffer if
> tx() is running. This will cause tx to stop working, which can be
> reproduced by simultaneously downloading two large files at high speed.
> 
> Moving tx_clean() into tx() will prevent this. And tx interrupt is no
> longer needed now.

TX completion work is always recommended to be done in the ->poll()
handler.

Fix the race or whatever bug there is rather than working around it,
and regressing the driver, by handling TX completion in the interrupt
handler.

Thanks.


Re: [PATCH v2] net: arc/emac: Move arc_emac_tx_clean() into arc_emac_tx() and disable tx interrut

2016-05-08 Thread David Miller
From: Caesar Wang 
Date: Fri,  6 May 2016 20:19:16 +0800

> Doing tx_clean() inside poll() may scramble the tx ring buffer if
> tx() is running. This will cause tx to stop working, which can be
> reproduced by simultaneously downloading two large files at high speed.
> 
> Moving tx_clean() into tx() will prevent this. And tx interrupt is no
> longer needed now.

TX completion work is always recommended to be done in the ->poll()
handler.

Fix the race or whatever bug there is rather than working around it,
and regressing the driver, by handling TX completion in the interrupt
handler.

Thanks.


Re: [PATCH v4 1/2] soc: qcom: smd: Introduce compile stubs

2016-05-08 Thread David Miller
From: Bjorn Andersson 
Date: Fri,  6 May 2016 07:09:07 -0700

> Introduce compile stubs for the SMD API, allowing consumers to be
> compile tested.
> 
> Acked-by: Andy Gross 
> Signed-off-by: Bjorn Andersson 

Applied.


Re: [PATCH v4 2/2] net: Add Qualcomm IPC router

2016-05-08 Thread David Miller
From: Bjorn Andersson 
Date: Fri,  6 May 2016 07:09:08 -0700

> From: Courtney Cavin 
> 
> Add an implementation of Qualcomm's IPC router protocol, used to
> communicate with service providing remote processors.
> 
> Signed-off-by: Courtney Cavin 
> Signed-off-by: Bjorn Andersson 
> [bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
> Signed-off-by: Bjorn Andersson 

Applied.


Re: [PATCH v4 1/2] soc: qcom: smd: Introduce compile stubs

2016-05-08 Thread David Miller
From: Bjorn Andersson 
Date: Fri,  6 May 2016 07:09:07 -0700

> Introduce compile stubs for the SMD API, allowing consumers to be
> compile tested.
> 
> Acked-by: Andy Gross 
> Signed-off-by: Bjorn Andersson 

Applied.


Re: [PATCH v4 2/2] net: Add Qualcomm IPC router

2016-05-08 Thread David Miller
From: Bjorn Andersson 
Date: Fri,  6 May 2016 07:09:08 -0700

> From: Courtney Cavin 
> 
> Add an implementation of Qualcomm's IPC router protocol, used to
> communicate with service providing remote processors.
> 
> Signed-off-by: Courtney Cavin 
> Signed-off-by: Bjorn Andersson 
> [bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
> Signed-off-by: Bjorn Andersson 

Applied.


Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Mike Galbraith
On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:

> In addition, I would argue maybe beefing up idle balancing is a more
> productive way to spread load, as work-stealing just does what needs
> to be done. And seems it has been (sub-unconsciously) neglected in this
> case, :)

P.S. Nope, I'm dinging up multiple spots ;-)



Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Mike Galbraith
On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:

> In addition, I would argue maybe beefing up idle balancing is a more
> productive way to spread load, as work-stealing just does what needs
> to be done. And seems it has been (sub-unconsciously) neglected in this
> case, :)

P.S. Nope, I'm dinging up multiple spots ;-)



Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Mike Galbraith
On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > > llc_size,
> > > but the numbers are so wild to easily break the fragile condition, like:
> > 
> > Seems lockless traversal and averages just lets multiple CPUs select
> > the same spot.  An atomic reservation (feature) when looking for an
> > idle spot (also for fork) might fix it up.  Run the thing as RT,
> > push/pull ensures that it reaches box saturation regardless of the
> > number of messaging threads, whereas with fair class, any number > 1
> > will certainly stack tasks before the box is saturated.
> 
> Yes, good idea, bringing order to the race to grab idle CPU is absolutely
> helpful.

Well, good ideas work, as yet this one helps jack diddly spit.

> In addition, I would argue maybe beefing up idle balancing is a more
> productive way to spread load, as work-stealing just does what needs
> to be done. And seems it has been (sub-unconsciously) neglected in this
> case, :)
> 
> Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
> the slave will be 0 forever (as last_wakee is never flipped).

Yeah, it's irrelevant here, this load is all about instantaneous state.
 I could use a bit more of that, reserving on the wakeup side won't
help this benchmark until everything else cares.  One stack, and it's
game over.  It could help generic utilization and latency some.. but it
seems kinda unlikely it'll be worth the cycle expenditure.

> Basically whenever a waker has more than 1 wakee, the wakee_flips
> will comfortably grow very large (with last_wakee alternating),
> whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.

Yup, it is a heuristic, and like all of those, imperfect.  I've watched
it improving utilization in the wild though, so won't mind that until I
catch it doing really bad things.

> So recording only the last_wakee seems not right unless you have other
> good reason. If not the latter, counting waking wakee times should be
> better, and then allow the statistics to happily play.




Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Mike Galbraith
On Mon, 2016-05-09 at 02:57 +0800, Yuyang Du wrote:
> On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > > llc_size,
> > > but the numbers are so wild to easily break the fragile condition, like:
> > 
> > Seems lockless traversal and averages just lets multiple CPUs select
> > the same spot.  An atomic reservation (feature) when looking for an
> > idle spot (also for fork) might fix it up.  Run the thing as RT,
> > push/pull ensures that it reaches box saturation regardless of the
> > number of messaging threads, whereas with fair class, any number > 1
> > will certainly stack tasks before the box is saturated.
> 
> Yes, good idea, bringing order to the race to grab idle CPU is absolutely
> helpful.

Well, good ideas work, as yet this one helps jack diddly spit.

> In addition, I would argue maybe beefing up idle balancing is a more
> productive way to spread load, as work-stealing just does what needs
> to be done. And seems it has been (sub-unconsciously) neglected in this
> case, :)
> 
> Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
> the slave will be 0 forever (as last_wakee is never flipped).

Yeah, it's irrelevant here, this load is all about instantaneous state.
 I could use a bit more of that, reserving on the wakeup side won't
help this benchmark until everything else cares.  One stack, and it's
game over.  It could help generic utilization and latency some.. but it
seems kinda unlikely it'll be worth the cycle expenditure.

> Basically whenever a waker has more than 1 wakee, the wakee_flips
> will comfortably grow very large (with last_wakee alternating),
> whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.

Yup, it is a heuristic, and like all of those, imperfect.  I've watched
it improving utilization in the wild though, so won't mind that until I
catch it doing really bad things.

> So recording only the last_wakee seems not right unless you have other
> good reason. If not the latter, counting waking wakee times should be
> better, and then allow the statistics to happily play.




Re: [patch] qede: uninitialized variable in qede_start_xmit()

2016-05-08 Thread David Miller
From: Dan Carpenter 
Date: Thu, 5 May 2016 16:21:30 +0300

> "data_split" was never set to false.  It's just uninitialized.
> 
> Fixes: 2950219d87b0 ('qede: Add basic network device support')
> Signed-off-by: Dan Carpenter 

Applied, thanks Dan.


Re: [patch] qede: uninitialized variable in qede_start_xmit()

2016-05-08 Thread David Miller
From: Dan Carpenter 
Date: Thu, 5 May 2016 16:21:30 +0300

> "data_split" was never set to false.  It's just uninitialized.
> 
> Fixes: 2950219d87b0 ('qede: Add basic network device support')
> Signed-off-by: Dan Carpenter 

Applied, thanks Dan.


RE: [PATCH] debugobjects: insulate non-fixup logic related to static obj from fixup callbacks

2016-05-08 Thread Du, Changbin
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> On Sun, 8 May 2016, Du, Changbin wrote:
> > > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > > > raw_spin_unlock_irqrestore(>lock, flags);
> > > > /*
> > > > -* Maybe the object is static.  Let the type specific
> > > > +* Maybe the object is static. Let the type specific
> > > >  * code decide what to do.
> > >
> > > Instead of doing white space changes you really want to explain the logic
> > > here.
> > >
> > Comments is in following code.
> 
> Well. It's a comment, but the code you replace has better explanations about
> statically initialized objects. This should move here.
> 
> Thanks,
> 
>   tglx

Ok, let me improve the comment for patch v2.

Best Regards,
Du, Changbin



RE: [PATCH] debugobjects: insulate non-fixup logic related to static obj from fixup callbacks

2016-05-08 Thread Du, Changbin
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> On Sun, 8 May 2016, Du, Changbin wrote:
> > > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > > > raw_spin_unlock_irqrestore(>lock, flags);
> > > > /*
> > > > -* Maybe the object is static.  Let the type specific
> > > > +* Maybe the object is static. Let the type specific
> > > >  * code decide what to do.
> > >
> > > Instead of doing white space changes you really want to explain the logic
> > > here.
> > >
> > Comments is in following code.
> 
> Well. It's a comment, but the code you replace has better explanations about
> statically initialized objects. This should move here.
> 
> Thanks,
> 
>   tglx

Ok, let me improve the comment for patch v2.

Best Regards,
Du, Changbin



Re: [PATCH] mmc: mmc: do not use CMD13 to get status after speed mode switch

2016-05-08 Thread Shawn Lin

+ linux-rockchip


I just hacked my local branch to fix the issues found on rockchip
platform. The reaseon is that mmc core fail to get status
after switching from hs200 to hs. So I disabled sending status for it
just like what Chaotian does here. But I didn't deeply dig out the root
cause but I agree with Chaotian's opinion.

FYI:
My emmc deivce is KLMA62WEPD-B031.

[1.526008] sdhci: Secure Digital Host Controller Interface driver
[1.526558] sdhci: Copyright(c) Pierre Ossman
[1.527899] sdhci-pltfm: SDHCI platform and OF driver helper
[1.529967] sdhci-arasan fe33.sdhci: No vmmc regulator found
[1.530501] sdhci-arasan fe33.sdhci: No vqmmc regulator found
[1.568710] mmc0: SDHCI controller on fe33.sdhci [fe33.sdhci] 
using ADMA

[1.627552] mmc0: switch to high-speed from hs200 failed, err:-84
[1.628108] mmc0: error -84 whilst initialising MMC card


在 2016/5/4 14:54, Chaotian Jing 写道:

Per JEDEC spec, it is not recommended to use CMD13 to get card status
after speed mode switch. below are two reason about this:
1. CMD13 cannot be guaranteed due to the asynchronous operation.
Therefore it is not recommended to use CMD13 to check busy completion
of the timing change indication.
2. After switch to HS200, CMD13 will get response of 0x800, and even the
busy signal gets de-asserted, the response of CMD13 is aslo 0x800.

this patch drops CMD13 when doing speed mode switch, if host do not
support MMC_CAP_WAIT_WHILE_BUSY and there is no ops->card_busy(),
then the only way is to wait a fixed timeout.

Signed-off-by: Chaotian Jing 
---
 drivers/mmc/core/mmc.c |   82 
 drivers/mmc/core/mmc_ops.c |   25 +-
 2 files changed, 45 insertions(+), 62 deletions(-)

diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index 4dbe3df..03ee7a4 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -962,7 +962,7 @@ static int mmc_select_hs(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, EXT_CSD_TIMING_HS,
   card->ext_csd.generic_cmd6_time,
-  true, true, true);
+  true, false, true);
if (!err)
mmc_set_timing(card->host, MMC_TIMING_MMC_HS);

@@ -1056,7 +1056,6 @@ static int mmc_switch_status(struct mmc_card *card)
 static int mmc_select_hs400(struct mmc_card *card)
 {
struct mmc_host *host = card->host;
-   bool send_status = true;
unsigned int max_dtr;
int err = 0;
u8 val;
@@ -1068,9 +1067,6 @@ static int mmc_select_hs400(struct mmc_card *card)
  host->ios.bus_width == MMC_BUS_WIDTH_8))
return 0;

-   if (host->caps & MMC_CAP_WAIT_WHILE_BUSY)
-   send_status = false;
-
/* Reduce frequency to HS frequency */
max_dtr = card->ext_csd.hs_max_dtr;
mmc_set_clock(host, max_dtr);
@@ -1080,7 +1076,7 @@ static int mmc_select_hs400(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, val,
   card->ext_csd.generic_cmd6_time,
-  true, send_status, true);
+  true, false, true);
if (err) {
pr_err("%s: switch to high-speed from hs200 failed, err:%d\n",
mmc_hostname(host), err);
@@ -1090,11 +1086,9 @@ static int mmc_select_hs400(struct mmc_card *card)
/* Set host controller to HS timing */
mmc_set_timing(card->host, MMC_TIMING_MMC_HS);

-   if (!send_status) {
-   err = mmc_switch_status(card);
-   if (err)
-   goto out_err;
-   }
+   err = mmc_switch_status(card);
+   if (err)
+   goto out_err;

/* Switch card to DDR */
err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
@@ -1113,7 +1107,7 @@ static int mmc_select_hs400(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, val,
   card->ext_csd.generic_cmd6_time,
-  true, send_status, true);
+  true, false, true);
if (err) {
pr_err("%s: switch to hs400 failed, err:%d\n",
 mmc_hostname(host), err);
@@ -1124,11 +1118,9 @@ static int mmc_select_hs400(struct mmc_card *card)
mmc_set_timing(host, MMC_TIMING_MMC_HS400);
mmc_set_bus_speed(card);

-   if (!send_status) {
-   err = mmc_switch_status(card);
-   if (err)
-   goto out_err;
-   }
+   err = mmc_switch_status(card);
+   if (err)
+   goto out_err;

return 0;

@@ -1146,14 +1138,10 @@ int mmc_hs200_to_hs400(struct mmc_card *card)
 int 

Re: [PATCH] mmc: mmc: do not use CMD13 to get status after speed mode switch

2016-05-08 Thread Shawn Lin

+ linux-rockchip


I just hacked my local branch to fix the issues found on rockchip
platform. The reaseon is that mmc core fail to get status
after switching from hs200 to hs. So I disabled sending status for it
just like what Chaotian does here. But I didn't deeply dig out the root
cause but I agree with Chaotian's opinion.

FYI:
My emmc deivce is KLMA62WEPD-B031.

[1.526008] sdhci: Secure Digital Host Controller Interface driver
[1.526558] sdhci: Copyright(c) Pierre Ossman
[1.527899] sdhci-pltfm: SDHCI platform and OF driver helper
[1.529967] sdhci-arasan fe33.sdhci: No vmmc regulator found
[1.530501] sdhci-arasan fe33.sdhci: No vqmmc regulator found
[1.568710] mmc0: SDHCI controller on fe33.sdhci [fe33.sdhci] 
using ADMA

[1.627552] mmc0: switch to high-speed from hs200 failed, err:-84
[1.628108] mmc0: error -84 whilst initialising MMC card


在 2016/5/4 14:54, Chaotian Jing 写道:

Per JEDEC spec, it is not recommended to use CMD13 to get card status
after speed mode switch. below are two reason about this:
1. CMD13 cannot be guaranteed due to the asynchronous operation.
Therefore it is not recommended to use CMD13 to check busy completion
of the timing change indication.
2. After switch to HS200, CMD13 will get response of 0x800, and even the
busy signal gets de-asserted, the response of CMD13 is aslo 0x800.

this patch drops CMD13 when doing speed mode switch, if host do not
support MMC_CAP_WAIT_WHILE_BUSY and there is no ops->card_busy(),
then the only way is to wait a fixed timeout.

Signed-off-by: Chaotian Jing 
---
 drivers/mmc/core/mmc.c |   82 
 drivers/mmc/core/mmc_ops.c |   25 +-
 2 files changed, 45 insertions(+), 62 deletions(-)

diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
index 4dbe3df..03ee7a4 100644
--- a/drivers/mmc/core/mmc.c
+++ b/drivers/mmc/core/mmc.c
@@ -962,7 +962,7 @@ static int mmc_select_hs(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, EXT_CSD_TIMING_HS,
   card->ext_csd.generic_cmd6_time,
-  true, true, true);
+  true, false, true);
if (!err)
mmc_set_timing(card->host, MMC_TIMING_MMC_HS);

@@ -1056,7 +1056,6 @@ static int mmc_switch_status(struct mmc_card *card)
 static int mmc_select_hs400(struct mmc_card *card)
 {
struct mmc_host *host = card->host;
-   bool send_status = true;
unsigned int max_dtr;
int err = 0;
u8 val;
@@ -1068,9 +1067,6 @@ static int mmc_select_hs400(struct mmc_card *card)
  host->ios.bus_width == MMC_BUS_WIDTH_8))
return 0;

-   if (host->caps & MMC_CAP_WAIT_WHILE_BUSY)
-   send_status = false;
-
/* Reduce frequency to HS frequency */
max_dtr = card->ext_csd.hs_max_dtr;
mmc_set_clock(host, max_dtr);
@@ -1080,7 +1076,7 @@ static int mmc_select_hs400(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, val,
   card->ext_csd.generic_cmd6_time,
-  true, send_status, true);
+  true, false, true);
if (err) {
pr_err("%s: switch to high-speed from hs200 failed, err:%d\n",
mmc_hostname(host), err);
@@ -1090,11 +1086,9 @@ static int mmc_select_hs400(struct mmc_card *card)
/* Set host controller to HS timing */
mmc_set_timing(card->host, MMC_TIMING_MMC_HS);

-   if (!send_status) {
-   err = mmc_switch_status(card);
-   if (err)
-   goto out_err;
-   }
+   err = mmc_switch_status(card);
+   if (err)
+   goto out_err;

/* Switch card to DDR */
err = mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
@@ -1113,7 +1107,7 @@ static int mmc_select_hs400(struct mmc_card *card)
err = __mmc_switch(card, EXT_CSD_CMD_SET_NORMAL,
   EXT_CSD_HS_TIMING, val,
   card->ext_csd.generic_cmd6_time,
-  true, send_status, true);
+  true, false, true);
if (err) {
pr_err("%s: switch to hs400 failed, err:%d\n",
 mmc_hostname(host), err);
@@ -1124,11 +1118,9 @@ static int mmc_select_hs400(struct mmc_card *card)
mmc_set_timing(host, MMC_TIMING_MMC_HS400);
mmc_set_bus_speed(card);

-   if (!send_status) {
-   err = mmc_switch_status(card);
-   if (err)
-   goto out_err;
-   }
+   err = mmc_switch_status(card);
+   if (err)
+   goto out_err;

return 0;

@@ -1146,14 +1138,10 @@ int mmc_hs200_to_hs400(struct mmc_card *card)
 int mmc_hs400_to_hs200(struct mmc_card *card)
 {
  

[PATCH 3/3] staging: dgnc: Need to check for NULL of ch

2016-05-08 Thread Daeseok Youn
the "ch" from brd structure could be NULL, it need to
check for NULL.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_neo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/dgnc/dgnc_neo.c b/drivers/staging/dgnc/dgnc_neo.c
index 9eae1a6..ba57e95 100644
--- a/drivers/staging/dgnc/dgnc_neo.c
+++ b/drivers/staging/dgnc/dgnc_neo.c
@@ -380,7 +380,7 @@ static inline void neo_parse_isr(struct dgnc_board *brd, 
uint port)
unsigned long flags;
 
ch = brd->channels[port];
-   if (ch->magic != DGNC_CHANNEL_MAGIC)
+   if (!ch || ch->magic != DGNC_CHANNEL_MAGIC)
return;
 
/* Here we try to figure out what caused the interrupt to happen */
-- 
2.8.2



[PATCH 3/3] staging: dgnc: Need to check for NULL of ch

2016-05-08 Thread Daeseok Youn
the "ch" from brd structure could be NULL, it need to
check for NULL.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_neo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/dgnc/dgnc_neo.c b/drivers/staging/dgnc/dgnc_neo.c
index 9eae1a6..ba57e95 100644
--- a/drivers/staging/dgnc/dgnc_neo.c
+++ b/drivers/staging/dgnc/dgnc_neo.c
@@ -380,7 +380,7 @@ static inline void neo_parse_isr(struct dgnc_board *brd, 
uint port)
unsigned long flags;
 
ch = brd->channels[port];
-   if (ch->magic != DGNC_CHANNEL_MAGIC)
+   if (!ch || ch->magic != DGNC_CHANNEL_MAGIC)
return;
 
/* Here we try to figure out what caused the interrupt to happen */
-- 
2.8.2



[PATCH 1/3] staging: dgnc: fix 'line over 80 characters'

2016-05-08 Thread Daeseok Youn
fix checkpatch.pl warning about 'line over 80 characters'.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_sysfs.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_sysfs.c 
b/drivers/staging/dgnc/dgnc_sysfs.c
index d825964..b8d41c5 100644
--- a/drivers/staging/dgnc/dgnc_sysfs.c
+++ b/drivers/staging/dgnc/dgnc_sysfs.c
@@ -189,19 +189,21 @@ static ssize_t dgnc_ports_msignals_show(struct device *p,
DGNC_VERIFY_BOARD(p, bd);
 
for (i = 0; i < bd->nasync; i++) {
-   if (bd->channels[i]->ch_open_count) {
+   struct channel_t *ch = bd->channels[i];
+
+   if (ch->ch_open_count) {
count += snprintf(buf + count, PAGE_SIZE - count,
"%d %s %s %s %s %s %s\n",
-   bd->channels[i]->ch_portnum,
-   (bd->channels[i]->ch_mostat & UART_MCR_RTS) ? 
"RTS" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_CTS) ? 
"CTS" : "",
-   (bd->channels[i]->ch_mostat & UART_MCR_DTR) ? 
"DTR" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_DSR) ? 
"DSR" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_DCD) ? 
"DCD" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_RI)  ? 
"RI"  : "");
+   ch->ch_portnum,
+   (ch->ch_mostat & UART_MCR_RTS) ? "RTS" : "",
+   (ch->ch_mistat & UART_MSR_CTS) ? "CTS" : "",
+   (ch->ch_mostat & UART_MCR_DTR) ? "DTR" : "",
+   (ch->ch_mistat & UART_MSR_DSR) ? "DSR" : "",
+   (ch->ch_mistat & UART_MSR_DCD) ? "DCD" : "",
+   (ch->ch_mistat & UART_MSR_RI)  ? "RI"  : "");
} else {
count += snprintf(buf + count, PAGE_SIZE - count,
-   "%d\n", bd->channels[i]->ch_portnum);
+   "%d\n", ch->ch_portnum);
}
}
return count;
-- 
2.8.2



[PATCH 1/3] staging: dgnc: fix 'line over 80 characters'

2016-05-08 Thread Daeseok Youn
fix checkpatch.pl warning about 'line over 80 characters'.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_sysfs.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_sysfs.c 
b/drivers/staging/dgnc/dgnc_sysfs.c
index d825964..b8d41c5 100644
--- a/drivers/staging/dgnc/dgnc_sysfs.c
+++ b/drivers/staging/dgnc/dgnc_sysfs.c
@@ -189,19 +189,21 @@ static ssize_t dgnc_ports_msignals_show(struct device *p,
DGNC_VERIFY_BOARD(p, bd);
 
for (i = 0; i < bd->nasync; i++) {
-   if (bd->channels[i]->ch_open_count) {
+   struct channel_t *ch = bd->channels[i];
+
+   if (ch->ch_open_count) {
count += snprintf(buf + count, PAGE_SIZE - count,
"%d %s %s %s %s %s %s\n",
-   bd->channels[i]->ch_portnum,
-   (bd->channels[i]->ch_mostat & UART_MCR_RTS) ? 
"RTS" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_CTS) ? 
"CTS" : "",
-   (bd->channels[i]->ch_mostat & UART_MCR_DTR) ? 
"DTR" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_DSR) ? 
"DSR" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_DCD) ? 
"DCD" : "",
-   (bd->channels[i]->ch_mistat & UART_MSR_RI)  ? 
"RI"  : "");
+   ch->ch_portnum,
+   (ch->ch_mostat & UART_MCR_RTS) ? "RTS" : "",
+   (ch->ch_mistat & UART_MSR_CTS) ? "CTS" : "",
+   (ch->ch_mostat & UART_MCR_DTR) ? "DTR" : "",
+   (ch->ch_mistat & UART_MSR_DSR) ? "DSR" : "",
+   (ch->ch_mistat & UART_MSR_DCD) ? "DCD" : "",
+   (ch->ch_mistat & UART_MSR_RI)  ? "RI"  : "");
} else {
count += snprintf(buf + count, PAGE_SIZE - count,
-   "%d\n", bd->channels[i]->ch_portnum);
+   "%d\n", ch->ch_portnum);
}
}
return count;
-- 
2.8.2



[PATCH 2/3] staging: dgnc: remove redundant condition check

2016-05-08 Thread Daeseok Youn
dgnc_board(brd) was already checked for NULL before calling
neo_parse_isr(). And also port doesn't need to check.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_neo.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_neo.c b/drivers/staging/dgnc/dgnc_neo.c
index 3b8ce38..9eae1a6 100644
--- a/drivers/staging/dgnc/dgnc_neo.c
+++ b/drivers/staging/dgnc/dgnc_neo.c
@@ -379,12 +379,6 @@ static inline void neo_parse_isr(struct dgnc_board *brd, 
uint port)
unsigned char cause;
unsigned long flags;
 
-   if (!brd || brd->magic != DGNC_BOARD_MAGIC)
-   return;
-
-   if (port >= brd->maxports)
-   return;
-
ch = brd->channels[port];
if (ch->magic != DGNC_CHANNEL_MAGIC)
return;
-- 
2.8.2



[PATCH 2/3] staging: dgnc: remove redundant condition check

2016-05-08 Thread Daeseok Youn
dgnc_board(brd) was already checked for NULL before calling
neo_parse_isr(). And also port doesn't need to check.

Signed-off-by: Daeseok Youn 
---
 drivers/staging/dgnc/dgnc_neo.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_neo.c b/drivers/staging/dgnc/dgnc_neo.c
index 3b8ce38..9eae1a6 100644
--- a/drivers/staging/dgnc/dgnc_neo.c
+++ b/drivers/staging/dgnc/dgnc_neo.c
@@ -379,12 +379,6 @@ static inline void neo_parse_isr(struct dgnc_board *brd, 
uint port)
unsigned char cause;
unsigned long flags;
 
-   if (!brd || brd->magic != DGNC_BOARD_MAGIC)
-   return;
-
-   if (port >= brd->maxports)
-   return;
-
ch = brd->channels[port];
if (ch->magic != DGNC_CHANNEL_MAGIC)
return;
-- 
2.8.2



[PATCH] sched: fix the calculation of __sched_period in sched_slice()

2016-05-08 Thread Zhou Chengming
When we get the sched_slice of a sched_entity, we use cfs_rq->nr_running
to calculate the whole __sched_period. But cfs_rq->nr_running is the
number of sched_entity in that cfs_rq, rq->nr_running is the number
of all the tasks that are not throttled. So we should use the
rq->nr_running to calculate the whole __sched_period value.

Signed-off-by: Zhou Chengming 
---
 kernel/sched/fair.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..59c9378 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -625,7 +625,7 @@ static u64 __sched_period(unsigned long nr_running)
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
+   u64 slice = __sched_period(rq_of(cfs_rq)->nr_running + !se->on_rq);
 
for_each_sched_entity(se) {
struct load_weight *load;
-- 
1.7.7



[PATCH] sched: fix the calculation of __sched_period in sched_slice()

2016-05-08 Thread Zhou Chengming
When we get the sched_slice of a sched_entity, we use cfs_rq->nr_running
to calculate the whole __sched_period. But cfs_rq->nr_running is the
number of sched_entity in that cfs_rq, rq->nr_running is the number
of all the tasks that are not throttled. So we should use the
rq->nr_running to calculate the whole __sched_period value.

Signed-off-by: Zhou Chengming 
---
 kernel/sched/fair.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..59c9378 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -625,7 +625,7 @@ static u64 __sched_period(unsigned long nr_running)
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
+   u64 slice = __sched_period(rq_of(cfs_rq)->nr_running + !se->on_rq);
 
for_each_sched_entity(se) {
struct load_weight *load;
-- 
1.7.7



Re: [PATCH 1/6] statx: Add a system call to make enhanced file info available

2016-05-08 Thread J. Bruce Fields
On Mon, May 09, 2016 at 11:45:43AM +1000, Dave Chinner wrote:
> [ OT, but I'll reply anyway :P ]
> 
> On Fri, May 06, 2016 at 02:29:23PM -0400, J. Bruce Fields wrote:
> > On Thu, May 05, 2016 at 08:56:02AM +1000, Dave Chinner wrote:
> > > In the latest XFS filesystem format, we randomise the generation
> > > value during every inode allocation to make it hard to guess the
> > > handle of adjacent inodes from an existing ino+gen pair, or even
> > > from life time to life time of the same inode.
> > 
> > The one thing I wonder about is whether that increases the probability
> > of a filehandle collision (where you accidentally generate the same
> > filehandle for two different files).
> 
> Not possible - inode number is still different between the two
> files. i.e. ino+gen makes the handle unique, not gen.
> 
> > If the generation number is a 32-bit counter per inode number (is that
> > actually the way filesystems work?), then it takes 2^32 reuses of the
> > inode number to hit the same filehandle.
> 
> 4 billion unlink/create operations that hit the same inode number
> are going to take some time. I suspect someone will notice the load
> generated by an attmept to brute force this sort of thing ;)
> 
> > If you choose it randomly then
> > you expect a collision after about 2^16 reuses.
> 
> I'm pretty sure that a random search will need to, on average,
> search half the keyspace before a match is found (i.e. 2^31
> attempts, not 2^16).

Yeah, but I was wondering whether you could somehow get into the
situation where clients between then are caching N distinct filehandles
with the same inode number.  Then a collision becomes likely around
2^16, by the usual birthday paradox rule-of-thumb.

Uh, but now that I think of it that's irrelevant.  At most one of those
filehandles actually refers to a still-existing file.  Any attempt to
use the other 2^16-1 should return -ESTALE.  So collisions among that
set don't matter, it's only collisions involving the existing file that
are interesting.  So, nevermind, I can't see a practical way to hit a
problem here

--b.


Re: [PATCH 1/6] statx: Add a system call to make enhanced file info available

2016-05-08 Thread J. Bruce Fields
On Mon, May 09, 2016 at 11:45:43AM +1000, Dave Chinner wrote:
> [ OT, but I'll reply anyway :P ]
> 
> On Fri, May 06, 2016 at 02:29:23PM -0400, J. Bruce Fields wrote:
> > On Thu, May 05, 2016 at 08:56:02AM +1000, Dave Chinner wrote:
> > > In the latest XFS filesystem format, we randomise the generation
> > > value during every inode allocation to make it hard to guess the
> > > handle of adjacent inodes from an existing ino+gen pair, or even
> > > from life time to life time of the same inode.
> > 
> > The one thing I wonder about is whether that increases the probability
> > of a filehandle collision (where you accidentally generate the same
> > filehandle for two different files).
> 
> Not possible - inode number is still different between the two
> files. i.e. ino+gen makes the handle unique, not gen.
> 
> > If the generation number is a 32-bit counter per inode number (is that
> > actually the way filesystems work?), then it takes 2^32 reuses of the
> > inode number to hit the same filehandle.
> 
> 4 billion unlink/create operations that hit the same inode number
> are going to take some time. I suspect someone will notice the load
> generated by an attmept to brute force this sort of thing ;)
> 
> > If you choose it randomly then
> > you expect a collision after about 2^16 reuses.
> 
> I'm pretty sure that a random search will need to, on average,
> search half the keyspace before a match is found (i.e. 2^31
> attempts, not 2^16).

Yeah, but I was wondering whether you could somehow get into the
situation where clients between then are caching N distinct filehandles
with the same inode number.  Then a collision becomes likely around
2^16, by the usual birthday paradox rule-of-thumb.

Uh, but now that I think of it that's irrelevant.  At most one of those
filehandles actually refers to a still-existing file.  Any attempt to
use the other 2^16-1 should return -ESTALE.  So collisions among that
set don't matter, it's only collisions involving the existing file that
are interesting.  So, nevermind, I can't see a practical way to hit a
problem here

--b.


Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > llc_size,
> > but the numbers are so wild to easily break the fragile condition, like:
> 
> Seems lockless traversal and averages just lets multiple CPUs select
> the same spot.  An atomic reservation (feature) when looking for an
> idle spot (also for fork) might fix it up.  Run the thing as RT,
> push/pull ensures that it reaches box saturation regardless of the
> number of messaging threads, whereas with fair class, any number > 1
> will certainly stack tasks before the box is saturated.

Yes, good idea, bringing order to the race to grab idle CPU is absolutely
helpful.

In addition, I would argue maybe beefing up idle balancing is a more
productive way to spread load, as work-stealing just does what needs
to be done. And seems it has been (sub-unconsciously) neglected in this
case, :)

Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
the slave will be 0 forever (as last_wakee is never flipped).

Basically whenever a waker has more than 1 wakee, the wakee_flips
will comfortably grow very large (with last_wakee alternating),
whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.

So recording only the last_wakee seems not right unless you have other
good reason. If not the latter, counting waking wakee times should be
better, and then allow the statistics to happily play.


Re: sched: tweak select_idle_sibling to look for idle threads

2016-05-08 Thread Yuyang Du
On Sun, May 08, 2016 at 10:08:55AM +0200, Mike Galbraith wrote:
> > Maybe give the criteria a bit margin, not just wakees tend to equal 
> > llc_size,
> > but the numbers are so wild to easily break the fragile condition, like:
> 
> Seems lockless traversal and averages just lets multiple CPUs select
> the same spot.  An atomic reservation (feature) when looking for an
> idle spot (also for fork) might fix it up.  Run the thing as RT,
> push/pull ensures that it reaches box saturation regardless of the
> number of messaging threads, whereas with fair class, any number > 1
> will certainly stack tasks before the box is saturated.

Yes, good idea, bringing order to the race to grab idle CPU is absolutely
helpful.

In addition, I would argue maybe beefing up idle balancing is a more
productive way to spread load, as work-stealing just does what needs
to be done. And seems it has been (sub-unconsciously) neglected in this
case, :)

Regarding wake_wide(), it seems the M:N is 1:24, not 6:6*24, if so,
the slave will be 0 forever (as last_wakee is never flipped).

Basically whenever a waker has more than 1 wakee, the wakee_flips
will comfortably grow very large (with last_wakee alternating),
whereas when a waker has 0 or 1 wakee, the wakee_flips will just be 0.

So recording only the last_wakee seems not right unless you have other
good reason. If not the latter, counting waking wakee times should be
better, and then allow the statistics to happily play.


[PATCH v5 00/13] Support non-lru page migration

2016-05-08 Thread Minchan Kim
Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.

The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages. In the end, compaction cannot work well so reclaimer
shrinks all of working set pages. It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.

Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not
memory efficient. In our product which has big CMA memory, it reclaims
zones too exccessively to allocate GPU and zram page although there are
lots of free space in CMA so system becomes very slow easily.

To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.


struct address_space_operations {
..
..
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
..
}

new page flags

PG_movable
PG_isolated

For details, please read description in "mm: migrate: support non-lru
movable page migration".

Originally, Gioh Kim had tried to support this feature but he moved so
I took over the work. I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to have
many credit, too.

And I should mention Chulmin who have tested this patchset heavily
so I can find many bugs from him. :)

Thanks, Gioh, Konstantin and Chulmin!

This patchset consists of five parts.

1. clean up migration
  mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
  mm: migrate: support non-lru movable page migration

3. rework KVM memory-ballooning
  mm: balloon: use general non-lru movable page feature

4. zsmalloc refactoring for preparing page migration
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index

5. zsmalloc page migration
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

* From v4
  * rebase on mmotm-2016-05-05-17-19
  * fix huge object migration - Chulmin
  * !CONFIG_COMPACTION support for zsmalloc

* From v3
  * rebase on mmotm-2016-04-06-20-40
  * fix swap_info deadlock - Chulmin
  * race without page_lock - Vlastimil
  * no use page._mapcount for potential user-mapped page driver - Vlastimil
  * fix and enhance doc/description - Vlastimil
  * use page->mapping lower bits to represent PG_movable
  * make driver side's rule simple.

* From v2
  * rebase on mmotm-2016-03-29-15-54-16
  * check PageMovable before lock_page - Joonsoo
  * check PageMovable before PageIsolated checking - Joonsoo
  * add more description about rule

* From v1
  * rebase on v4.5-mmotm-2016-03-17-15-04
  * reordering patches to merge clean-up patches first
  * add Acked-by/Reviewed-by from Vlastimil and Sergey
  * use each own mount model instead of reusing anon_inode_fs - Al Viro
  * small changes - YiPing, Gioh

Cc: Vlastimil Babka 
Cc: dri-de...@lists.freedesktop.org
Cc: Hugh Dickins 
Cc: John Einar Reitan 
Cc: Jonathan Corbet 
Cc: Joonsoo Kim 
Cc: Konstantin Khlebnikov 
Cc: Mel Gorman 
Cc: Naoya Horiguchi 
Cc: Rafael Aquini 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: virtualizat...@lists.linux-foundation.org
Cc: Gioh Kim 
Cc: Chan Gyun Jeong 
Cc: Sangseok Lee 
Cc: Kyeongdon Kim 
Cc: Chulmin Kim 

Minchan Kim (12):
  mm: use put_page to free page instead of putback_lru_page
  mm: migrate: support non-lru movable page migration
  mm: balloon: use general non-lru movable page feature
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

 Documentation/filesystems/Locking  |4 +
 Documentation/filesystems/vfs.txt  |   11 +
 Documentation/vm/page_migration|  107 ++-
 drivers/block/zram/zram_drv.c  |6 +-
 drivers/virtio/virtio_balloon.c|   52 +-
 include/linux/balloon_compaction.h |   51 +-
 

[PATCH v5 02/12] mm: migrate: support non-lru movable page migration

2016-05-08 Thread Minchan Kim
We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
migrated the oldpage successfully and returns MIGRATEPAGE_SUCCESS.
If driver cannot migrate the page at the moment, driver can return -EAGAIN.
On -EAGAIN, VM will retry page migration in a short time because VM interprets
-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
VM will give up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data
structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under page_lock.

void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family functions
which will be called by VM. Exactly speaking, PG_movable is not a real flag of
struct page. Rather than, VM reuses page->mapping's lower bits to represent it.

#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping so it can get
right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose.

Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: Joonsoo Kim 
Cc: Mel 

[PATCH v5 00/13] Support non-lru page migration

2016-05-08 Thread Minchan Kim
Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.

The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages. In the end, compaction cannot work well so reclaimer
shrinks all of working set pages. It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.

Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not
memory efficient. In our product which has big CMA memory, it reclaims
zones too exccessively to allocate GPU and zram page although there are
lots of free space in CMA so system becomes very slow easily.

To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.


struct address_space_operations {
..
..
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
..
}

new page flags

PG_movable
PG_isolated

For details, please read description in "mm: migrate: support non-lru
movable page migration".

Originally, Gioh Kim had tried to support this feature but he moved so
I took over the work. I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to have
many credit, too.

And I should mention Chulmin who have tested this patchset heavily
so I can find many bugs from him. :)

Thanks, Gioh, Konstantin and Chulmin!

This patchset consists of five parts.

1. clean up migration
  mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
  mm: migrate: support non-lru movable page migration

3. rework KVM memory-ballooning
  mm: balloon: use general non-lru movable page feature

4. zsmalloc refactoring for preparing page migration
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index

5. zsmalloc page migration
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

* From v4
  * rebase on mmotm-2016-05-05-17-19
  * fix huge object migration - Chulmin
  * !CONFIG_COMPACTION support for zsmalloc

* From v3
  * rebase on mmotm-2016-04-06-20-40
  * fix swap_info deadlock - Chulmin
  * race without page_lock - Vlastimil
  * no use page._mapcount for potential user-mapped page driver - Vlastimil
  * fix and enhance doc/description - Vlastimil
  * use page->mapping lower bits to represent PG_movable
  * make driver side's rule simple.

* From v2
  * rebase on mmotm-2016-03-29-15-54-16
  * check PageMovable before lock_page - Joonsoo
  * check PageMovable before PageIsolated checking - Joonsoo
  * add more description about rule

* From v1
  * rebase on v4.5-mmotm-2016-03-17-15-04
  * reordering patches to merge clean-up patches first
  * add Acked-by/Reviewed-by from Vlastimil and Sergey
  * use each own mount model instead of reusing anon_inode_fs - Al Viro
  * small changes - YiPing, Gioh

Cc: Vlastimil Babka 
Cc: dri-de...@lists.freedesktop.org
Cc: Hugh Dickins 
Cc: John Einar Reitan 
Cc: Jonathan Corbet 
Cc: Joonsoo Kim 
Cc: Konstantin Khlebnikov 
Cc: Mel Gorman 
Cc: Naoya Horiguchi 
Cc: Rafael Aquini 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: virtualizat...@lists.linux-foundation.org
Cc: Gioh Kim 
Cc: Chan Gyun Jeong 
Cc: Sangseok Lee 
Cc: Kyeongdon Kim 
Cc: Chulmin Kim 

Minchan Kim (12):
  mm: use put_page to free page instead of putback_lru_page
  mm: migrate: support non-lru movable page migration
  mm: balloon: use general non-lru movable page feature
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

 Documentation/filesystems/Locking  |4 +
 Documentation/filesystems/vfs.txt  |   11 +
 Documentation/vm/page_migration|  107 ++-
 drivers/block/zram/zram_drv.c  |6 +-
 drivers/virtio/virtio_balloon.c|   52 +-
 include/linux/balloon_compaction.h |   51 +-
 include/linux/fs.h |2 +
 include/linux/ksm.h|3 +-
 include/linux/migrate.h|5 +
 include/linux/mm.h |1 +
 include/linux/page-flags.h |   29 +-
 include/uapi/linux/magic.h |2 +
 mm/balloon_compaction.c|   94 +--
 mm/compaction.c 

[PATCH v5 02/12] mm: migrate: support non-lru movable page migration

2016-05-08 Thread Minchan Kim
We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
clear PG_movable of oldpage via __ClearPageMovable under page_lock if you
migrated the oldpage successfully and returns MIGRATEPAGE_SUCCESS.
If driver cannot migrate the page at the moment, driver can return -EAGAIN.
On -EAGAIN, VM will retry page migration in a short time because VM interprets
-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
VM will give up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data
structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under page_lock.

void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family functions
which will be called by VM. Exactly speaking, PG_movable is not a real flag of
struct page. Rather than, VM reuses page->mapping's lower bits to represent it.

#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping so it can get
right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose.

Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: Joonsoo Kim 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Rafael Aquini 
Cc: 

[PATCH v5 04/12] zsmalloc: keep max_object in size_class

2016-05-08 Thread Minchan Kim
Every zspage in a size_class has same number of max objects so
we could move it to a size_class.

Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 32 +++-
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 3d6d3dae505a..3c2574be8cee 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -32,8 +32,6 @@
  * page->freelist: points to the first free object in zspage.
  * Free objects are linked together using in-place
  * metadata.
- * page->objects: maximum number of objects we can store in this
- * zspage (class->zspage_order * PAGE_SIZE / class->size)
  * page->lru: links together first pages of various zspages.
  * Basically forming list of zspages in a fullness group.
  * page->mapping: class index and fullness group of the zspage
@@ -211,6 +209,7 @@ struct size_class {
 * of ZS_ALIGN.
 */
int size;
+   int objs_per_zspage;
unsigned int index;
 
struct zs_size_stat stats;
@@ -627,21 +626,22 @@ static inline void zs_pool_stat_destroy(struct zs_pool 
*pool)
  * the pool (not yet implemented). This function returns fullness
  * status of the given page.
  */
-static enum fullness_group get_fullness_group(struct page *first_page)
+static enum fullness_group get_fullness_group(struct size_class *class,
+   struct page *first_page)
 {
-   int inuse, max_objects;
+   int inuse, objs_per_zspage;
enum fullness_group fg;
 
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
inuse = first_page->inuse;
-   max_objects = first_page->objects;
+   objs_per_zspage = class->objs_per_zspage;
 
if (inuse == 0)
fg = ZS_EMPTY;
-   else if (inuse == max_objects)
+   else if (inuse == objs_per_zspage)
fg = ZS_FULL;
-   else if (inuse <= 3 * max_objects / fullness_threshold_frac)
+   else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
fg = ZS_ALMOST_EMPTY;
else
fg = ZS_ALMOST_FULL;
@@ -728,7 +728,7 @@ static enum fullness_group fix_fullness_group(struct 
size_class *class,
enum fullness_group currfg, newfg;
 
get_zspage_mapping(first_page, _idx, );
-   newfg = get_fullness_group(first_page);
+   newfg = get_fullness_group(class, first_page);
if (newfg == currfg)
goto out;
 
@@ -1011,9 +1011,6 @@ static struct page *alloc_zspage(struct size_class 
*class, gfp_t flags)
init_zspage(class, first_page);
 
first_page->freelist = location_to_obj(first_page, 0);
-   /* Maximum number of objects we can store in this zspage */
-   first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
-
error = 0; /* Success */
 
 cleanup:
@@ -1241,11 +1238,11 @@ static bool can_merge(struct size_class *prev, int 
size, int pages_per_zspage)
return true;
 }
 
-static bool zspage_full(struct page *first_page)
+static bool zspage_full(struct size_class *class, struct page *first_page)
 {
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-   return first_page->inuse == first_page->objects;
+   return first_page->inuse == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1631,7 +1628,7 @@ static int migrate_zspage(struct zs_pool *pool, struct 
size_class *class,
}
 
/* Stop if there is no more space */
-   if (zspage_full(d_page)) {
+   if (zspage_full(class, d_page)) {
unpin_tag(handle);
ret = -ENOMEM;
break;
@@ -1690,7 +1687,7 @@ static enum fullness_group putback_zspage(struct zs_pool 
*pool,
 {
enum fullness_group fullness;
 
-   fullness = get_fullness_group(first_page);
+   fullness = get_fullness_group(class, first_page);
insert_zspage(class, fullness, first_page);
set_zspage_mapping(first_page, class->index, fullness);
 
@@ -1939,8 +1936,9 @@ struct zs_pool *zs_create_pool(const char *name)
class->size = size;
class->index = i;
class->pages_per_zspage = pages_per_zspage;
-   if (pages_per_zspage == 1 &&
-   get_maxobj_per_zspage(size, pages_per_zspage) == 1)
+   class->objs_per_zspage = class->pages_per_zspage *
+   PAGE_SIZE / class->size;
+   if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
class->huge = true;
spin_lock_init(>lock);
pool->size_class[i] = class;
-- 
1.9.1



[PATCH v5 05/12] zsmalloc: use bit_spin_lock

2016-05-08 Thread Minchan Kim
Use kernel standard bit spin-lock instead of custom mess. Even, it has
a bug which doesn't disable preemption. The reason we don't have any
problem is that we have used it during preemption disable section
by class->lock spinlock. So no need to go to stable.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 3c2574be8cee..718dde7fd028 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -867,21 +867,17 @@ static unsigned long obj_idx_to_offset(struct page *page,
 
 static inline int trypin_tag(unsigned long handle)
 {
-   unsigned long *ptr = (unsigned long *)handle;
-
-   return !test_and_set_bit_lock(HANDLE_PIN_BIT, ptr);
+   return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void pin_tag(unsigned long handle)
 {
-   while (!trypin_tag(handle));
+   bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void unpin_tag(unsigned long handle)
 {
-   unsigned long *ptr = (unsigned long *)handle;
-
-   clear_bit_unlock(HANDLE_PIN_BIT, ptr);
+   bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void reset_page(struct page *page)
-- 
1.9.1



[PATCH v5 03/12] mm: balloon: use general non-lru movable page feature

2016-05-08 Thread Minchan Kim
Now, VM has a feature to migrate non-lru movable pages so
balloon doesn't need custom migration hooks in migrate.c
and compaction.c. Instead, this patch implements
page->mapping->a_ops->{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general
migration functions and make balloon compaction simple.

Cc: virtualizat...@lists.linux-foundation.org
Cc: Rafael Aquini 
Cc: Konstantin Khlebnikov 
Signed-off-by: Gioh Kim 
Signed-off-by: Minchan Kim 
---
 drivers/virtio/virtio_balloon.c| 52 +++--
 include/linux/balloon_compaction.h | 51 ++---
 include/uapi/linux/magic.h |  1 +
 mm/balloon_compaction.c| 94 +++---
 mm/compaction.c|  7 ---
 mm/migrate.c   | 19 +---
 mm/vmscan.c|  2 +-
 7 files changed, 83 insertions(+), 143 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7b6d74f0c72f..04fc63b4a735 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,10 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+#ifdef CONFIG_BALLOON_COMPACTION
+static struct vfsmount *balloon_mnt;
+#endif
+
 struct virtio_balloon {
struct virtio_device *vdev;
struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -486,6 +491,24 @@ static int virtballoon_migratepage(struct balloon_dev_info 
*vb_dev_info,
 
return MIGRATEPAGE_SUCCESS;
 }
+
+static struct dentry *balloon_mount(struct file_system_type *fs_type,
+   int flags, const char *dev_name, void *data)
+{
+   static const struct dentry_operations ops = {
+   .d_dname = simple_dname,
+   };
+
+   return mount_pseudo(fs_type, "balloon-kvm:", NULL, ,
+   BALLOON_KVM_MAGIC);
+}
+
+static struct file_system_type balloon_fs = {
+   .name   = "balloon-kvm",
+   .mount  = balloon_mount,
+   .kill_sb= kill_anon_super,
+};
+
 #endif /* CONFIG_BALLOON_COMPACTION */
 
 static int virtballoon_probe(struct virtio_device *vdev)
@@ -515,9 +538,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->vdev = vdev;
 
balloon_devinfo_init(>vb_dev_info);
-#ifdef CONFIG_BALLOON_COMPACTION
-   vb->vb_dev_info.migratepage = virtballoon_migratepage;
-#endif
 
err = init_vqs(vb);
if (err)
@@ -527,13 +547,33 @@ static int virtballoon_probe(struct virtio_device *vdev)
vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
err = register_oom_notifier(>nb);
if (err < 0)
-   goto out_oom_notify;
+   goto out_del_vqs;
+
+#ifdef CONFIG_BALLOON_COMPACTION
+   balloon_mnt = kern_mount(_fs);
+   if (IS_ERR(balloon_mnt)) {
+   err = PTR_ERR(balloon_mnt);
+   unregister_oom_notifier(>nb);
+   goto out_del_vqs;
+   }
+
+   vb->vb_dev_info.migratepage = virtballoon_migratepage;
+   vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
+   if (IS_ERR(vb->vb_dev_info.inode)) {
+   err = PTR_ERR(vb->vb_dev_info.inode);
+   kern_unmount(balloon_mnt);
+   unregister_oom_notifier(>nb);
+   vb->vb_dev_info.inode = NULL;
+   goto out_del_vqs;
+   }
+   vb->vb_dev_info.inode->i_mapping->a_ops = _aops;
+#endif
 
virtio_device_ready(vdev);
 
return 0;
 
-out_oom_notify:
+out_del_vqs:
vdev->config->del_vqs(vdev);
 out_free_vb:
kfree(vb);
@@ -567,6 +607,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
cancel_work_sync(>update_balloon_stats_work);
 
remove_common(vb);
+   if (vb->vb_dev_info.inode)
+   iput(vb->vb_dev_info.inode);
kfree(vb);
 }
 
diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index 9b0a15d06a4f..79542b2698ec 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device information descriptor.
@@ -62,6 +63,7 @@ struct balloon_dev_info {
struct list_head pages; /* Pages enqueued & handled to Host */
int (*migratepage)(struct balloon_dev_info *, struct page *newpage,
struct page *page, enum migrate_mode mode);
+   struct inode *inode;
 };
 
 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
@@ -73,45 +75,19 @@ static inline void 

[PATCH v5 06/12] zsmalloc: use accessor

2016-05-08 Thread Minchan Kim
Upcoming patch will change how to encode zspage meta so for easy review,
this patch wraps code to access metadata as accessor.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 mm/zsmalloc.c | 82 +++
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 718dde7fd028..086fd65311f7 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -266,10 +266,14 @@ struct zs_pool {
  * A zspage's class index and fullness group
  * are encoded in its (first)page->mapping
  */
-#define CLASS_IDX_BITS 28
 #define FULLNESS_BITS  4
-#define CLASS_IDX_MASK ((1 << CLASS_IDX_BITS) - 1)
-#define FULLNESS_MASK  ((1 << FULLNESS_BITS) - 1)
+#define CLASS_BITS 28
+
+#define FULLNESS_SHIFT 0
+#define CLASS_SHIFT(FULLNESS_SHIFT + FULLNESS_BITS)
+
+#define FULLNESS_MASK  ((1UL << FULLNESS_BITS) - 1)
+#define CLASS_MASK ((1UL << CLASS_BITS) - 1)
 
 struct mapping_area {
 #ifdef CONFIG_PGTABLE_MAPPING
@@ -416,6 +420,41 @@ static int is_last_page(struct page *page)
return PagePrivate2(page);
 }
 
+static inline int get_zspage_inuse(struct page *first_page)
+{
+   return first_page->inuse;
+}
+
+static inline void set_zspage_inuse(struct page *first_page, int val)
+{
+   first_page->inuse = val;
+}
+
+static inline void mod_zspage_inuse(struct page *first_page, int val)
+{
+   first_page->inuse += val;
+}
+
+static inline int get_first_obj_offset(struct page *page)
+{
+   return page->index;
+}
+
+static inline void set_first_obj_offset(struct page *page, int offset)
+{
+   page->index = offset;
+}
+
+static inline unsigned long get_freeobj(struct page *first_page)
+{
+   return (unsigned long)first_page->freelist;
+}
+
+static inline void set_freeobj(struct page *first_page, unsigned long obj)
+{
+   first_page->freelist = (void *)obj;
+}
+
 static void get_zspage_mapping(struct page *first_page,
unsigned int *class_idx,
enum fullness_group *fullness)
@@ -424,8 +463,8 @@ static void get_zspage_mapping(struct page *first_page,
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
m = (unsigned long)first_page->mapping;
-   *fullness = m & FULLNESS_MASK;
-   *class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK;
+   *fullness = (m >> FULLNESS_SHIFT) & FULLNESS_MASK;
+   *class_idx = (m >> CLASS_SHIFT) & CLASS_MASK;
 }
 
 static void set_zspage_mapping(struct page *first_page,
@@ -435,8 +474,7 @@ static void set_zspage_mapping(struct page *first_page,
unsigned long m;
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-   m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) |
-   (fullness & FULLNESS_MASK);
+   m = (class_idx << CLASS_SHIFT) | (fullness << FULLNESS_SHIFT);
first_page->mapping = (struct address_space *)m;
 }
 
@@ -634,7 +672,7 @@ static enum fullness_group get_fullness_group(struct 
size_class *class,
 
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-   inuse = first_page->inuse;
+   inuse = get_zspage_inuse(first_page);
objs_per_zspage = class->objs_per_zspage;
 
if (inuse == 0)
@@ -680,7 +718,7 @@ static void insert_zspage(struct size_class *class,
 * empty/full. Put pages with higher ->inuse first.
 */
list_add_tail(_page->lru, &(*head)->lru);
-   if (first_page->inuse >= (*head)->inuse)
+   if (get_zspage_inuse(first_page) >= get_zspage_inuse(*head))
*head = first_page;
 }
 
@@ -860,7 +898,7 @@ static unsigned long obj_idx_to_offset(struct page *page,
unsigned long off = 0;
 
if (!is_first_page(page))
-   off = page->index;
+   off = get_first_obj_offset(page);
 
return off + obj_idx * class_size;
 }
@@ -895,7 +933,7 @@ static void free_zspage(struct page *first_page)
struct page *nextp, *tmp, *head_extra;
 
VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-   VM_BUG_ON_PAGE(first_page->inuse, first_page);
+   VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
 
head_extra = (struct page *)page_private(first_page);
 
@@ -936,7 +974,7 @@ static void init_zspage(struct size_class *class, struct 
page *first_page)
 * head of corresponding zspage's freelist.
 */
if (page != first_page)
-   page->index = off;
+   set_first_obj_offset(page, off);
 
vaddr = kmap_atomic(page);
link = (struct link_free *)vaddr + off / sizeof(*link);
@@ -991,7 +1029,7 @@ static struct page *alloc_zspage(struct size_class *class, 
gfp_t flags)
SetPagePrivate(page);
set_page_private(page, 0);
first_page = page;
-   

  1   2   3   4   5   >