date:20150417

Re: [PATCH 0/2] Tentative fix for the divide-by-zero on score/paris/..

2015-04-17 Thread Guenter Roeck


Hi Quentin,

it looks like there is another failure in linux-next, this time with 
sparc64:allmodconfig:

WARNING: arch/sparc/kernel/built-in.o(__ex_table+0x3b4): Section mismatch in 
reference from the (unknown reference) (unknown) to the variable :__retl_efault
/bin/sh: line 1: 22844 Floating point exception(core dumped) 
scripts/mod/modpost arch/sparc/kernel/built-in.o

The problem was previously hidden by a different build failure.

Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

sparc64: Build failure due to commit f1600e549b94 (sparc: Make sparc64 use scalable lib/iommu-common.c functions)

2015-04-17 Thread Guenter Roeck

Hi all,

I see the following build failure when compiling sparc64:allmodconfig
in the upstream kernel (v4.0-7820-g04b7fe6a4a23).

arch/sparc/kernel/pci_sun4v.o:(.discard+0x1): multiple definition of 
`__pcpu_unique_iommu_pool_hash'
arch/sparc/kernel/iommu.o:(.discard+0x0): first defined here
make[2]: *** [arch/sparc/kernel/built-in.o] Error 1
make[1]: *** [arch/sparc/kernel] Error 2

The problem is caused by commit f1600e549b94 ("sparc: Make sparc64
use scalable lib/iommu-common.c functions"), which introduces 

static DEFINE_PER_CPU(unsigned int, iommu_pool_hash);

in both files.

DEFINE_PER_CPU translates to DEFINE_PER_CPU_SECTION, which in turn is
defined as

#define DEFINE_PER_CPU_SECTION(type, name, sec) \
__PCPU_DUMMY_ATTRS char __pcpu_scope_##name;\
extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;\
--> __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;   \
extern __PCPU_ATTRS(sec) __typeof__(type) name; \
__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES __weak \
__typeof__(type) name

if CONFIG_DEBUG_FORCE_WEAK_PER_CPU is configured, which is the case here.
The marked line above shows that __pcpu_unique_iommu_pool_hash is declared as
global variable, which explains the problem (and makes me wonder what the
'static' keyword in front of DEFINE_PER_CPU is supposed to accomplish).

I thought about fixing the problem by renaming one of the variables, but
I am not sure if that is what is intended. Specifically, I am not sure if
the variables are supposed to be different, as it looks like, or if they
are supposed to be the same.

In case it is relevant, I use gcc version 4.6.3 for my build test.

Guenter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Help with debugging intermittent crash on resume from hibernation

2015-04-17 Thread Nikolaus Rath

On Apr 17 2015, Mike Galbraith  wrote:
> On Fri, 2015-04-17 at 20:15 -0700, Nikolaus Rath wrote:
>> 
>> Is there anything I can do to help debug this issue?
>
> You could try to bisect it, but judging from the problem description, 
> that could be more like a career than a troubleshooting session.

Furthermore, I have yet to find a kernel version that does *not* exhibit
the problem (I went back to 3.14).

Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Help with debugging intermittent crash on resume from hibernation

2015-04-17 Thread Mike Galbraith

On Fri, 2015-04-17 at 21:13 -0700, Nikolaus Rath wrote:
> On Apr 17 2015, Mike Galbraith  wrote:
> > On Fri, 2015-04-17 at 20:15 -0700, Nikolaus Rath wrote:
> > > 
> > > Is there anything I can do to help debug this issue?
> > 
> > You could try to bisect it, but judging from the problem 
> > description, 
> > that could be more like a career than a troubleshooting session.
> 
> Furthermore, I have yet to find a kernel version that does *not* 
> exhibit
> the problem (I went back to 3.14).

My solution would be don't do that then.  Hibernation doesn't save 
much time anyway, so is worth zero annoyance.

I would like to be able to suspend, but went from GeForce 8600 GT in 
old box that suspended great, resumed not at all to GeForce GTX 980 in 
new box which suspends great and resumes not at all, and is supported 
by nothing that makes eyecandy, so I have roughly a zillion unused 
transistors.  The thing occupies nearly the same volume of space as my 
laptop... but lappy doesn't have sexy big green leds to remind me that 
it's a Satellite lest I forget :)

-Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/21] e820, efi: add ACPI 6.0 persistent memory types

2015-04-17 Thread Andy Lutomirski

On Fri, Apr 17, 2015 at 6:35 PM, Dan Williams  wrote:
> ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
> Mark it "reserved" and allow it to be claimed by a persistent memory
> device driver.
>
> This definition is in addition to the Linux kernel's existing type-12
> definition that was recently added in support of shipping platforms with
> NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
> OEM reserved).  We may choose to exploit this wealth of definitions for
> NVDIMMs to differentiate E820_PRAM (type-12) from E820_PMEM (type-7).
> One potential differentiation is that PMEM is not backed by struct page
> by default in contrast to PRAM.  For now, they are effectively treated
> as aliases by the mm.
>
> Note, /proc/iomem can be consulted for differentiating legacy
> "Persistent RAM" E820_PRAM vs standard "Persistent I/O Memory"
> E820_PMEM.
>

Looks reasonable.  Time to ask my vendor if they can give me ACPI
6.0-compliant firmware.

--Andy

> Cc: Andy Lutomirski 
> Cc: Boaz Harrosh 
> Cc: H. Peter Anvin 
> Cc: Jens Axboe 
> Cc: Ingo Molnar 
> Cc: Christoph Hellwig 
> Signed-off-by: Dan Williams 
> Reviewed-by: Ross Zwisler 
> ---
>  arch/arm64/kernel/efi.c  |1 +
>  arch/ia64/kernel/efi.c   |1 +
>  arch/x86/boot/compressed/eboot.c |4 
>  arch/x86/include/uapi/asm/e820.h |1 +
>  arch/x86/kernel/e820.c   |   25 +++--
>  arch/x86/platform/efi/efi.c  |3 +++
>  include/linux/efi.h  |3 ++-
>  7 files changed, 31 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
> index ab21e0d58278..9d4aa18f2a82 100644
> --- a/arch/arm64/kernel/efi.c
> +++ b/arch/arm64/kernel/efi.c
> @@ -158,6 +158,7 @@ static __init int is_reserve_region(efi_memory_desc_t *md)
> case EFI_BOOT_SERVICES_CODE:
> case EFI_BOOT_SERVICES_DATA:
> case EFI_CONVENTIONAL_MEMORY:
> +   case EFI_PERSISTENT_MEMORY:
> return 0;
> default:
> break;
> diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
> index c52d7540dc05..cd8b7485e396 100644
> --- a/arch/ia64/kernel/efi.c
> +++ b/arch/ia64/kernel/efi.c
> @@ -1227,6 +1227,7 @@ efi_initialize_iomem_resources(struct resource 
> *code_resource,
> case EFI_RUNTIME_SERVICES_CODE:
> case EFI_RUNTIME_SERVICES_DATA:
> case EFI_ACPI_RECLAIM_MEMORY:
> +   case EFI_PERSISTENT_MEMORY:
> default:
> name = "reserved";
> break;
> diff --git a/arch/x86/boot/compressed/eboot.c 
> b/arch/x86/boot/compressed/eboot.c
> index ef17683484e9..dde5bf7726f4 100644
> --- a/arch/x86/boot/compressed/eboot.c
> +++ b/arch/x86/boot/compressed/eboot.c
> @@ -1222,6 +1222,10 @@ static efi_status_t setup_e820(struct boot_params 
> *params,
> e820_type = E820_NVS;
> break;
>
> +   case EFI_PERSISTENT_MEMORY:
> +   e820_type = E820_PMEM;
> +   break;
> +
> default:
> continue;
> }
> diff --git a/arch/x86/include/uapi/asm/e820.h 
> b/arch/x86/include/uapi/asm/e820.h
> index 960a8a9dc4ab..0f457e6eab18 100644
> --- a/arch/x86/include/uapi/asm/e820.h
> +++ b/arch/x86/include/uapi/asm/e820.h
> @@ -32,6 +32,7 @@
>  #define E820_ACPI  3
>  #define E820_NVS   4
>  #define E820_UNUSABLE  5
> +#define E820_PMEM  7
>
>  /*
>   * This is a non-standardized way to represent ADR or NVDIMM regions that
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 11cc7d54ec3f..410af501a941 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -137,6 +137,8 @@ static void __init e820_print_type(u32 type)
> case E820_RESERVED_KERN:
> printk(KERN_CONT "usable");
> break;
> +   case E820_PMEM:
> +   case E820_PRAM:
> case E820_RESERVED:
> printk(KERN_CONT "reserved");
> break;
> @@ -149,9 +151,6 @@ static void __init e820_print_type(u32 type)
> case E820_UNUSABLE:
> printk(KERN_CONT "unusable");
> break;
> -   case E820_PRAM:
> -   printk(KERN_CONT "persistent (type %u)", type);
> -   break;
> default:
> printk(KERN_CONT "type %u", type);
> break;
> @@ -919,10 +918,26 @@ static inline const char *e820_type_to_string(int 
> e820_type)
> case E820_NVS:  return "ACPI Non-volatile Storage";
> case E820_UNUSABLE: return "Unusable memory";
> case E820_PRAM: return "Persistent RAM";
> +   case E820_PMEM: return "Persistent I/O Memory";
> default:return "reserved";
> }
>  }
>
> +static bool

Re: panic with CPU hotplug + blk-mq + scsi-mq

2015-04-17 Thread Ming Lei

Hi Dongsu,

On Fri, Apr 17, 2015 at 5:41 AM, Dongsu Park
 wrote:
> Hi,
>
> there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq.
> Every time when a CPU is offlined, some arbitrary range of kernel memory
> seems to get corrupted. Then after a while, kernel panics at random places
> when block IOs are issued. (for example, see the call traces below)

Thanks for the report.

>
> This bug can be easily reproducible with a Qemu VM running with virtio-scsi,
> when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded
> with blk-mq enabled. And yes, 4.0 release is still affected, as well as
> Jens' for-4.1/core. How to reproduce:
>
>  # echo 0 > /sys/devices/system/cpu/cpu1/online
>  (and issue some block IOs, that's it.)
>
> Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden
> until commit ccbedf117f01 ("virtio_scsi: support multi hw queue of blk-mq"),
> which started to allow virtio-scsi to map virtqueues to hardware queues of
> blk-mq. Reverting that commit makes the bug go away. However, I suppose
> reverting it could not be a correct solution.

I agree, and that patch only enables multiple hw queues.

>
> More precisely, every time a CPU hotplug event gets triggered,
> a call graph is like the following:
>
>   blk_mq_queue_reinit_notify()
>   -> blk_mq_queue_reinit()
>-> blk_mq_map_swqueue()
> -> blk_mq_free_rq_map()
>  -> scsi_exit_request()
>
> From that point, as soon as any address in the request gets modified, an
> arbitrary range of memory gets corrupted. My first guess was that probably
> the exit routine could try to deallocate tags->rqs[] where invalid
> addresses are stored. But actually it looks like it's not the case,
> and cmd->sense_buffer looks also valid.
> It's not obvious to me, exactly what could go wrong.
>
> Does anyone have an idea?

As far as I can see, at least two problems exist:
- race between timeout and CPU hotplug
- in case of shared tags, during CPU online handling, about setting
and checking hctx->tags

So could you please test the attached two patches to see if they fix your issue?

I run them in my VM, and looks opps does disappear.


Thanks,
Ming Lei
>
> Regards,
> Dongsu
>
>  [beginning of call traces] 
> [   47.274292] BUG: unable to handle kernel NULL pointer dereference at 
> 0018
> [   47.275013] IP: [] __bt_get.isra.5+0x7d/0x1e0
> [   47.275013] PGD 79c55067 PUD 7ba17067 PMD 0
> [   47.275013] Oops:  [#1] SMP
> [   47.275013] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache 
> dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq 
> serio_raw i2c_piix4 virtio_net
> [   47.275013] CPU: 3 PID: 6232 Comm: blkid Not tainted 4.0.0 #303
> [   47.275013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140709_153950- 04/01/2014
> [   47.275013] task: 88003dfbc020 ti: 880079bac000 task.ti: 
> 880079bac000
> [   47.275013] RIP: 0010:[]  [] 
> __bt_get.isra.5+0x7d/0x1e0
> [   47.275013] RSP: 0018:880079baf898  EFLAGS: 00010246
> [   47.275013] RAX: 003c RBX: 880079198400 RCX: 
> 0078
> [   47.275013] RDX: 88007fddbb80 RSI: 0010 RDI: 
> 880079198400
> [   47.275013] RBP: 880079baf8e8 R08: 88007fddbb80 R09: 
> 
> [   47.275013] R10: 0001 R11: 0001 R12: 
> 0010
> [   47.275013] R13: 0010 R14: 880079baf9e8 R15: 
> 88007fddbb80
> [   47.275013] FS:  2b270c049800() GS:88007fc0() 
> knlGS:
> [   47.275013] CS:  0010 DS:  ES:  CR0: 80050033
> [   47.275013] CR2: 0018 CR3: 7ca8d000 CR4: 
> 001407e0
> [   47.275013] Stack:
> [   47.275013]  880079baf978 88007fdd58c0 0078 
> 814071ff
> [   47.275013]  880079baf8d8 880079198400 0010 
> 0010
> [   47.275013]  880079baf9e8 88007fddbb80 880079baf968 
> 8140b4e5
> [   47.275013] Call Trace:
> [   47.275013]  [] ? blk_mq_queue_enter+0x9f/0x2d0
> [   47.275013]  [] bt_get+0x65/0x1e0
> [   47.275013]  [] ? blk_mq_queue_enter+0x9f/0x2d0
> [   47.275013]  [] ? wait_woken+0xa0/0xa0
> [   47.275013]  [] blk_mq_get_tag+0xa7/0xd0
> [   47.275013]  [] __blk_mq_alloc_request+0x1b/0x200
> [   47.275013]  [] blk_mq_map_request+0xd6/0x4e0
> [   47.275013]  [] blk_mq_make_request+0x6e/0x2d0
> [   47.275013]  [] ? generic_make_request_checks+0x674/0x6a0
> [   47.275013]  [] ? bio_add_page+0x5e/0x70
> [   47.275013]  [] generic_make_request+0xc0/0x110
> [   47.275013]  [] submit_bio+0x68/0x150
> [   47.275013]  [] ? lru_cache_add+0x1c/0x50
> [   47.275013]  [] mpage_bio_submit+0x2a/0x40
> [   47.275013]  [] mpage_readpages+0x10c/0x130
> [   47.275013]  [] ? I_BDEV+0x10/0x10
> [   47.275013]  [] ? I_BDEV+0x10/0x10
> [   47.275013]  [] ? __page_cache_alloc+0x137/0x160
> [   47.275013]  [] blkdev_readpages+0x1d/0x20
> [

Re: Help with debugging intermittent crash on resume from hibernation

2015-04-17 Thread Mike Galbraith

On Fri, 2015-04-17 at 20:15 -0700, Nikolaus Rath wrote:
> 
> Is there anything I can do to help debug this issue?

You could try to bisect it, but judging from the problem description, 
that could be more like a career than a troubleshooting session.

-Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Input: elants_i2c - zero-extend hardware ID in firmware name

2015-04-17 Thread Dmitry Torokhov

Let's zero-extend hardware id number when forming firmware file name,
to avoid kernel requesting firmware like "elants_i2c_   0.bin", which
is quite unexpected.

Signed-off-by: Dmitry Torokhov 
---
 drivers/input/touchscreen/elants_i2c.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/input/touchscreen/elants_i2c.c 
b/drivers/input/touchscreen/elants_i2c.c
index 43b3c9c..0efd766 100644
--- a/drivers/input/touchscreen/elants_i2c.c
+++ b/drivers/input/touchscreen/elants_i2c.c
@@ -699,7 +699,7 @@ static int elants_i2c_fw_update(struct elants_data *ts)
char *fw_name;
int error;
 
-   fw_name = kasprintf(GFP_KERNEL, "elants_i2c_%4x.bin", ts->hw_version);
+   fw_name = kasprintf(GFP_KERNEL, "elants_i2c_%04x.bin", ts->hw_version);
if (!fw_name)
return -ENOMEM;
 
-- 
2.2.0.rc0.207.ga3a616c


-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Help with debugging intermittent crash on resume from hibernation

2015-04-17 Thread Nikolaus Rath

On 03/19/2015 08:57 PM, Mike Galbraith wrote:
> (+CC)
> 
> On Thu, 2015-03-19 at 20:21 -0700, Nikolaus Rath wrote:
>> On Mar 13 2015, Nikolaus Rath  wrote:
>>> Hello,
>>>
>>> In about one out of 10 resumes from hibernation, my system resets after
>>> the hibernation image has been loaded. I am hibernating using
>>>
>>> # echo platform > /sys/power/disk
>>> # echo disk > /sys/power/state
>>>
>>> When testing hibernation using
>>>
>>> # echo core > /sys/power/pm_test
>>> # echo platform > /sys/power/disk
>>> # echo disk > /sys/power/state
>>>
>>> I was not able to produce the same failure.
>>>
>>> I then tried attaching a serial console and booted with
>>> console=ttyS0,115200n8 no_console_suspend. For the failed attempt, the
>>> last messages before the reset are:
>> [...]
>>
>> I reproduced the same problem with 4.0.0-rc3. Is there anything I can do
>> to get more debugging information?

I also found that blacklisting the nouveau module seems to dramatically
reduce the occurence of this problem (it only happened once since I
blacklisted the module which is probably about ~30 resumes ago).

Is there anything I can do to help debug this issue?


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/20] STAGING/lustre: limit follow_link recursion using stack space.

2015-04-17 Thread Al Viro

On Mon, Mar 23, 2015 at 01:37:38PM +1100, NeilBrown wrote:
> lustre's ->follow_link() uses a lot of stack space and so
> need to limit symlink recursion based on stack size.
> 
> It currently tests current->link_count, but that will soon
> become private to fs/namei.c.
> So instead base on actual available stack space.
> This patch aborts recursive symlinks in less than 2K of space
> is available.  This seems consistent with current code, but
> hasn't been tested.

BTW, in the best case that logics is fishy.  We have "up to 5 levels
with 4Kb stack and up to 7 with 8Kb one".  Could somebody manage to dig out
the reasons for such limits?  Preferably along with the kernel version where
the overflows had been observed, both for 4K and 8K cases.

I'm very tempted to rip that thing out in the "kill link_path_walk()
recursion completely" series...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: DRBG seeding

2015-04-17 Thread Herbert Xu

On Sat, Apr 18, 2015 at 04:04:14AM +0200, Stephan Mueller wrote:
> 
> However, the only serious solution I can offer to not block is to use my 
> Jitter RNG which delivers entropy in (almost all) use cases. See [1]. The 
> code 
> is relatively small and does not have any dependencies. In this case, we 
> could 
> perform the initialization of the DRBG as follows:
> 
> 1. pull buffer of size entropy + nonce from get_random_bytes
> 
> 2. pull another buffer of size entropy + nonce from my Jitter RNG
> 
> 3. XOR both

Don't xor them.  Just provide them both as input to the seed
function.

> 4. seed the DRBG with it
> 
> 5. trigger the async invocation of the in-kernel /dev/random
> 
> 6. return the DRBG instance to the caller without waiting for the completion 
> of step 5
> 
> This way, we will get entropy during the first initialization without 
> blocking. After speaking with mathematicians at NIST, that Jitter RNG 
> approach 
> would be accepted. Note, I personally think that the Jitter RNG has 
> sufficient 
> entropy in almost all circumstances (see the massive testing I conducted on 
> all more widely used CPUs).
> 
> [1] http://www.chronox.de/jent.html

OK this sounds like it should satisfy everybody.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

2015-04-17 Thread Waiman Long

In up_read()/up_write(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do
the real wakeup.  This can be a problem especially for up_read()
where many readers can potentially call rwsem_wake() at more or less
the same time even though a single call should be enough. This will
cause contention in the wait_lock cacheline resulting in delay of
the up_read/up_write operations.

This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.

This patch also checks one more time to see if the rwsem was stolen
just before doing the expensive wakeup operation which will be
unnecessary if the lock was stolen.

On an 8-socket Westmere-EX server (80 cores, HT off), running AIM7's
high_systime workload (1000 users) on a vanilla 4.0 kernel produced
the following perf profile for spinlock contention:

  9.23%reaim  [kernel.kallsyms][k] _raw_spin_lock_irqsave
  |--97.39%-- rwsem_wake
  |--0.69%-- try_to_wake_up
  |--0.52%-- release_pages
   --1.40%-- [...]

  1.70%reaim  [kernel.kallsyms][k] _raw_spin_lock_irq
  |--96.61%-- rwsem_down_write_failed
  |--2.03%-- __schedule
  |--0.50%-- run_timer_softirq
   --0.86%-- [...]

With a patched 4.0 kernel, the perf profile became:

  1.87%reaim  [kernel.kallsyms][k] _raw_spin_lock_irqsave
  |--87.64%-- rwsem_wake
  |--2.80%-- release_pages
  |--2.56%-- try_to_wake_up
  |--1.10%-- __wake_up
  |--1.06%-- pagevec_lru_move_fn
  |--0.93%-- prepare_to_wait_exclusive
  |--0.71%-- free_pid
  |--0.58%-- get_page_from_freelist
  |--0.57%-- add_device_randomness
   --2.04%-- [...]

  0.80%reaim  [kernel.kallsyms][k] _raw_spin_lock_irq
  |--92.49%-- rwsem_down_write_failed
  |--4.24%-- __schedule
  |--1.37%-- run_timer_softirq
   --1.91%-- [...]

The table below shows the % improvement in throughput (1100-2000 users)
in the various AIM7's workloads:

Workload% increase in throughput

custom   3.8%
five-sec 3.5%
fserver  4.1%
high_systime22.2%
shared   2.1%
short   10.1%

Signed-off-by: Waiman Long 
---
 include/linux/osq_lock.h|5 +++
 kernel/locking/rwsem-xadd.c |   60 ++-
 2 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 3a6490e..703ea5c 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -32,4 +32,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue 
*lock)
 extern bool osq_lock(struct optimistic_spin_queue *lock);
 extern void osq_unlock(struct optimistic_spin_queue *lock);
 
+static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
+{
+   return atomic_read(>tail) != OSQ_UNLOCKED_VAL;
+}
+
 #endif
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2f7cc40..d088045 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -107,6 +107,35 @@ enum rwsem_wake_type {
RWSEM_WAKE_READ_OWNED   /* Waker thread holds the read lock */
 };
 
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * return true if there is an active writer by checking the owner field which
+ * should be set if there is one.
+ */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+   return READ_ONCE(sem->owner) != NULL;
+}
+
+/*
+ * Return true if the rwsem has active spinner
+ */
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+   return osq_is_locked(>osq);
+}
+#else /* CONFIG_RWSEM_SPIN_ON_OWNER */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+   return false;   /* Assume it has no active writer */
+}
+
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+   return false;
+}
+#endif /* CONFIG_RWSEM_SPIN_ON_OWNER */
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_(), then:
@@ -125,6 +154,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
rwsem_wake_type wake_type)
struct list_head *next;
long oldcount, woken, loop, adjustment;
 
+   /*
+* up_write() cleared the owner field before calling this function.
+

Re: [PATCH 3/6] direct-io: add support for write stream IDs

2015-04-17 Thread Jens Axboe


On 04/17/2015 05:51 PM, Dave Chinner wrote:

On Fri, Apr 17, 2015 at 05:11:40PM -0600, Jens Axboe wrote:

On 04/17/2015 05:06 PM, Dave Chinner wrote:

On Thu, Apr 16, 2015 at 11:20:45PM -0700, Ming Lin wrote:

On Sat, Apr 11, 2015 at 4:59 AM, Dave Chinner  wrote:

On Fri, Apr 10, 2015 at 04:50:05PM -0700, Ming Lin wrote:

On Wed, Mar 25, 2015 at 7:26 AM, Jens Axboe  wrote:

If iocb->ki_filp->f_streamid is not set, then it should fall back to
whatever is set on the inode->i_streamid.


Why should do the fall back?


Because then you have a method of using streams with applications
that aren't aware of streams.

Or perhaps you have a file you know has different access patterns to
the rest of the files in a directory, and you don't want to have to
set the stream on every process that opens and uses that file. e.g.
database writeahead log files (sequential write, never read) vs
database index/table files (random read/write).


Good point, agree. Will make that change.


That change causes problem for direct IO, for example

process 1:
fd = open("/dev/nvme0n1", O_DIRECT...);
//set stream_id 1
fadvise(fd, 1, 0, POSIX_FADV_STREAMID);
pwrite(fd, );

process 2:
fd = open("/dev/nvme0n1", O_DIRECT...);
//should be legacy stream_id 0
pwrite(fd, );

But now process 2 also see stream_id 1, which is wrong.


It's not wrong, your behaviour model is just different You have
defined a process/fd based stream model and not considered
considered that admins and applications might want to use a file
based stream model instead, so applications don't need to even be
aware that write streams are in use...


The stream must be opened, otherwise device will return error if application
write to a not-opened stream.


That's an extremely device specific *implementation* of a write
stream. The *concept* of a write stream being passed from userspace to
the block layer doesn't have such constraints, and I get realy
concerned when implementations of a generic concept are so tightly
focussed around one type of hardware implementation of the
concept...


Indeed, which is why the implementation posted cares ONLY about the
stream ID itself, and passing that through.

But the point about fallback is valid, however, for some use cases
that will not be what you want. But we have to make some sort of
decision, and falling back to the inode set value (if one is set) is
probably the right thing to do in most use cases.


Right, the question is then whether fadvise should set the value on
the inode at all, because then the effect of setting it on a fd also
changes the fallback. Perhaps we need to a distinction between
"setting the stream for this fd" which lasts as long as the fd is
active, and "setting the default inode stream" which is potentially
a persistent operation if the filesystem stores it on disk...


Yes, that might be a good compromise. The easiest would be to define a 
second fadvise advice, where the stronger advice would be file + inode. 
Another option would be changing the file approach to use fcntl(), and 
keeping the fadvise for the inode. I'll be happy to take input on what 
people would prefer here.



Device has limited number of streams, for example, 16 streams.
There are 2 APIs to open/close the stream.


What's to stop me writing something for DM-thinp that understands
write streams in bios and uses it to separate out the write streams
into different regions of the thinp device to improve locality of
it's data placement and hence reduce fragmentation?


Absolutely nothing, in fact that's one of the use cases that I had
in mind. Or for for caching software.


*nod*. We are on the same page, then :)


Yes completely, basically just wanted to clarify that.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Device mapper failed to open temporary keystore device

2015-04-17 Thread Herbert Xu

On Fri, Apr 17, 2015 at 06:38:49PM -0400, Mike Snitzer wrote:
>
> There are also some crypto changes that could very easily be the cause
> of your problem (cc'ing Herbert), e.g.:
> 
> $ git diff next-20150410^..next-20150413 -- crypto | diffstat

My guess is that you guys need this patch:

commit 34c9a0ffc75ad25b6a60f61e27c4a4b1189b8085
Author: Herbert Xu 
Date:   Thu Apr 16 11:07:13 2015 +0800

crypto: fix broken crypto_register_instance() module handling

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/21] nd_acpi: initial core implementation and nfit skeleton

2015-04-17 Thread Dan Williams

1/ Autodetect an NFIT table for the ACPI namespace device with _HID of
   "ACPI0012"

2/ Skeleton implementation to register an NFIT bus.

The NFIT provided by ACPI is the primary method by which platforms will
discover NVDIMM resources.  However, the intent of the
nfit_bus_descriptor abstraction is to contain "provider" specific
details, leaving the nd core to be NFIT-provider agnostic.  This
flexibility is exploited in later patches to implement special purpose
providers of test and custom-defined NFITs.

Cc: 
Cc: Robert Moore 
Cc: Rafael J. Wysocki 
Signed-off-by: Dan Williams 
---
 drivers/block/Kconfig |2 
 drivers/block/Makefile|1 
 drivers/block/nd/Kconfig  |   44 ++
 drivers/block/nd/Makefile |6 +
 drivers/block/nd/acpi.c   |  112 +
 drivers/block/nd/core.c   |   48 +++
 drivers/block/nd/nfit.h   |  201 +
 7 files changed, 414 insertions(+)
 create mode 100644 drivers/block/nd/Kconfig
 create mode 100644 drivers/block/nd/Makefile
 create mode 100644 drivers/block/nd/acpi.c
 create mode 100644 drivers/block/nd/core.c
 create mode 100644 drivers/block/nd/nfit.h

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index eb1fed5bd516..dfe40e5ca9bd 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -321,6 +321,8 @@ config BLK_DEV_NVME
  To compile this driver as a module, choose M here: the
  module will be called nvme.
 
+source "drivers/block/nd/Kconfig"
+
 config BLK_DEV_SKD
tristate "STEC S1120 Block Driver"
depends on PCI
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 9cc6c18a1c7e..18b27bb9cd2d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_CDROM_PKTCDVD)   += pktcdvd.o
 obj-$(CONFIG_MG_DISK)  += mg_disk.o
 obj-$(CONFIG_SUNVDC)   += sunvdc.o
 obj-$(CONFIG_BLK_DEV_NVME) += nvme.o
+obj-$(CONFIG_NFIT_DEVICES) += nd/
 obj-$(CONFIG_BLK_DEV_SKD)  += skd.o
 obj-$(CONFIG_BLK_DEV_OSD)  += osdblk.o
 
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
new file mode 100644
index ..5fa74f124b3e
--- /dev/null
+++ b/drivers/block/nd/Kconfig
@@ -0,0 +1,44 @@
+config ND_ARCH_HAS_IOREMAP_CACHE
+   depends on (X86 || IA64 || ARM || ARM64 || SH || XTENSA)
+   def_bool y
+
+menuconfig NFIT_DEVICES
+   bool "NVDIMM (NFIT) Support"
+   depends on ND_ARCH_HAS_IOREMAP_CACHE
+   depends on PHYS_ADDR_T_64BIT
+   help
+ Support for non-volatile memory devices defined by the NVDIMM
+ Firmware Interface Table. (NFIT)  On platforms that define an
+ NFIT, via ACPI, or other means, a "nd_bus" is registered to
+ advertise PM (persistent memory) namespaces (/dev/pmemX) and
+ BLOCK (sliding block data window) namespaces (/dev/ndX). A PM
+ namespace refers to a system-physical-address-range than may
+ span multiple DIMMs and support DAX (see CONFIG_DAX).  A BLOCK
+ namespace refers to a NVDIMM control region which exposes a
+ register-based windowed access mode to non-volatile memory.
+ See the NVDIMM Firmware Interface Table specification for more
+ details.
+
+if NFIT_DEVICES
+
+config ND_CORE
+   tristate "Core: Generic 'nd' Device Model"
+   help
+ Platform agnostic device model for an NFIT-defined bus.
+ Publishes resources for a NFIT-persistent-memory driver and/or
+ NFIT-block-data-window driver to attach.  Exposes a device
+ topology under a "ndX" bus device and a "/dev/ndctl"
+ dimm-ioctl message passing interface per registered NFIT
+ instance.  A userspace library "ndctl" provides an API to
+ enumerate/manage this subsystem.
+
+config NFIT_ACPI
+   tristate "NFIT ACPI: Discover ACPI-Namespace NFIT Devices"
+   select ND_CORE
+   depends on ACPI
+   help
+ Infrastructure to probe the ACPI namespace for NVDIMMs and
+ register the platform-global NFIT blob with the core.  Also
+ enables the core to craft ACPI._DSM messages for platform/dimm
+ configuration.
+endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
new file mode 100644
index ..22701ab7dcae
--- /dev/null
+++ b/drivers/block/nd/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_ND_CORE) += nd.o
+obj-$(CONFIG_NFIT_ACPI) += nd_acpi.o
+
+nd_acpi-y := acpi.o
+
+nd-y := core.o
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
new file mode 100644
index ..48db723d7a90
--- /dev/null
+++ b/drivers/block/nd/acpi.c
@@ -0,0 +1,112 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed

[PATCH 15/21] nd: pmem label sets and namespace instantiation.

2015-04-17 Thread Dan Williams

A complete label set is a PMEM-label per dimm where all the UUIDs
match and the interleave set cookie matches an active interleave set.

Present a sysfs ABI for manipulation of a PMEM-namespace's 'alt_name',
'uuid', and 'size' attributes.  A later patch will make these settings
persistent by writing back the label.

Note that PMEM allocations grow forwards from the start of an interleave
set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
with a PMEM interleave set will grow allocations backward from the
highest DPA.

Cc: Greg KH 
Cc: Neil Brown 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/bus.c|6 
 drivers/block/nd/core.c   |   64 ++
 drivers/block/nd/dimm.c   |2 
 drivers/block/nd/dimm_devs.c  |  127 +
 drivers/block/nd/label.c  |   54 ++
 drivers/block/nd/label.h  |3 
 drivers/block/nd/namespace_devs.c | 1020 +
 drivers/block/nd/nd-private.h |   14 +
 drivers/block/nd/nd.h |   33 +
 drivers/block/nd/pmem.c   |   22 +
 drivers/block/nd/region_devs.c|  147 +
 include/linux/nd.h|   24 +
 include/uapi/linux/ndctl.h|4 
 13 files changed, 1508 insertions(+), 12 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 944d7d7845fe..8e70098b6cb0 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -274,8 +274,10 @@ void nd_bus_destroy_ndctl(struct nd_bus *nd_bus)
device_destroy(nd_class, MKDEV(nd_bus_major, nd_bus->id));
 }
 
-static void wait_nd_bus_probe_idle(struct nd_bus *nd_bus)
+void wait_nd_bus_probe_idle(struct device *dev)
 {
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
do {
if (nd_bus->probe_active == 0)
break;
@@ -294,7 +296,7 @@ static int nd_cmd_clear_to_send(struct nd_dimm *nd_dimm, 
unsigned int cmd)
return 0;
 
nd_bus = walk_to_nd_bus(_dimm->dev);
-   wait_nd_bus_probe_idle(nd_bus);
+   wait_nd_bus_probe_idle(_bus->dev);
 
if (atomic_read(_dimm->busy))
return -EBUSY;
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 976cd5e3ebaf..560ed496 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -149,6 +150,69 @@ struct nd_bus *walk_to_nd_bus(struct device *nd_dev)
return NULL;
 }
 
+static bool is_uuid_sep(char sep)
+{
+   if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+   return true;
+   return false;
+}
+
+static int nd_uuid_parse(struct device *dev, u8 *uuid_out, const char *buf,
+   size_t len)
+{
+   const char *str = buf;
+   u8 uuid[16];
+   int i;
+
+   for (i = 0; i < 16; i++) {
+   if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+   dev_dbg(dev, "%s: pos: %d buf[%zd]: %c buf[%zd]: %c\n",
+   __func__, i, str - buf, str[0],
+   str + 1 - buf, str[1]);
+   return -EINVAL;
+   }
+
+   uuid[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+   str += 2;
+   if (is_uuid_sep(*str))
+   str++;
+   }
+
+   memcpy(uuid_out, uuid, sizeof(uuid));
+   return 0;
+}
+
+/**
+ * nd_uuid_store: common implementation for writing 'uuid' sysfs attributes
+ * @dev: container device for the uuid property
+ * @uuid_out: uuid buffer to replace
+ * @buf: raw sysfs buffer to parse
+ *
+ * Enforce that uuids can only be changed while the device is disabled
+ * (driver detached)
+ * LOCKING: expects device_lock() is held on entry
+ */
+int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
+   size_t len)
+{
+   u8 uuid[16];
+   int rc;
+
+   if (dev->driver)
+   return -EBUSY;
+
+   rc = nd_uuid_parse(dev, uuid, buf, len);
+   if (rc)
+   return rc;
+
+   kfree(*uuid_out);
+   *uuid_out = kmemdup(uuid, sizeof(uuid), GFP_KERNEL);
+   if (!(*uuid_out))
+   return -ENOMEM;
+
+   return 0;
+}
+
 static ssize_t commands_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
index ccc96d8fe2e7..eb62bc2848d3 100644
--- a/drivers/block/nd/dimm.c
+++ b/drivers/block/nd/dimm.c
@@ -97,7 +97,7 @@ static int nd_dimm_remove(struct device *dev)
nd_bus_lock(dev);
dev_set_drvdata(dev, NULL);
for_each_dpa_resource_safe(ndd, res, _r)
-   __release_region(>dpa, res->start, resource_size(res));
+   nd_dimm_free_dpa(ndd, res);
nd_bus_unlock(dev);
free_data(ndd);
 
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index

[PATCH 09/21] nd_dimm: dimm driver and base nd-bus device-driver infrastructure

2015-04-17 Thread Dan Williams

* Implement the device-model infrastructure for loading modules and
  attaching drivers to nd devices.  This is a simple association of a
  nd-device-type number with a driver that has a bitmask of supported
  device types.  To facilitate userspace bind/unbind operations 'modalias'
  and 'devtype', that also appear in the uevent, are added as generic
  sysfs attributes for all nd devices.  The reason for the device-type
  number is to support sub-types within a given parent devtype, be it a
  vendor-specific sub-type or otherwise.

* The first consumer of this infrastructure is the driver
  for dimm devices.  It simply uses control messages to retrieve and
  store the configuration-data image (label set) from each dimm.

Note: nd_device_register() arranges for asynchronous registration of
  nd bus devices.

Cc: Greg KH 
Cc: Neil Brown 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/Makefile |1 
 drivers/block/nd/bus.c|  158 
 drivers/block/nd/core.c   |   43 ++-
 drivers/block/nd/dimm.c   |  103 ++
 drivers/block/nd/dimm_devs.c  |  161 ++---
 drivers/block/nd/nd-private.h |   12 ++-
 drivers/block/nd/nd.h |   21 +
 include/linux/nd.h|   39 ++
 include/uapi/linux/ndctl.h|6 ++
 9 files changed, 526 insertions(+), 18 deletions(-)
 create mode 100644 drivers/block/nd/dimm.c
 create mode 100644 include/linux/nd.h

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 6b34dd4d4df8..9f1b69c86fba 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -22,3 +22,4 @@ nd_acpi-y := acpi.o
 nd-y := core.o
 nd-y += bus.o
 nd-y += dimm_devs.o
+nd-y += dimm.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 67a0624c265b..c815dd425a49 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -16,10 +16,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include "nd-private.h"
 #include "nfit.h"
 #include "nd.h"
@@ -28,8 +30,57 @@ int nd_dimm_major;
 static int nd_bus_major;
 static struct class *nd_class;
 
-struct bus_type nd_bus_type = {
+static int to_nd_device_type(struct device *dev)
+{
+   if (is_nd_dimm(dev))
+   return ND_DEVICE_DIMM;
+
+   return 0;
+}
+
+static int nd_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+   return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
+   to_nd_device_type(dev));
+}
+
+static int nd_bus_match(struct device *dev, struct device_driver *drv)
+{
+   struct nd_device_driver *nd_drv = to_nd_device_driver(drv);
+
+   return test_bit(to_nd_device_type(dev), _drv->type);
+}
+
+static int nd_bus_probe(struct device *dev)
+{
+   struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+   int rc;
+
+   rc = nd_drv->probe(dev);
+   dev_dbg(_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
+   dev_name(dev), rc);
+   return rc;
+}
+
+static int nd_bus_remove(struct device *dev)
+{
+   struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+   int rc;
+
+   rc = nd_drv->remove(dev);
+   dev_dbg(_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
+   dev_name(dev), rc);
+   return rc;
+}
+
+static struct bus_type nd_bus_type = {
.name = "nd",
+   .uevent = nd_bus_uevent,
+   .match = nd_bus_match,
+   .probe = nd_bus_probe,
+   .remove = nd_bus_remove,
 };
 
 static ASYNC_DOMAIN_EXCLUSIVE(nd_async_domain);
@@ -68,6 +119,109 @@ void nd_synchronize(void)
async_synchronize_full_domain(_async_domain);
 }
 
+static void nd_async_device_register(void *d, async_cookie_t cookie)
+{
+   struct device *dev = d;
+
+   if (device_add(dev) != 0) {
+   dev_err(dev, "%s: failed\n", __func__);
+   put_device(dev);
+   }
+   put_device(dev);
+}
+
+static void nd_async_device_unregister(void *d, async_cookie_t cookie)
+{
+   struct device *dev = d;
+
+   device_unregister(dev);
+   put_device(dev);
+}
+
+void nd_device_register(struct device *dev)
+{
+   dev->bus = _bus_type;
+   device_initialize(dev);
+   get_device(dev);
+   async_schedule_domain(nd_async_device_register, dev,
+   _async_domain);
+}
+EXPORT_SYMBOL(nd_device_register);
+
+void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
+{
+   switch (mode) {
+   case ND_ASYNC:
+   get_device(dev);
+   async_schedule_domain(nd_async_device_unregister, dev,
+   _async_domain);
+   break;
+   case ND_SYNC:
+   nd_synchronize();
+

[PATCH 04/21] nd: create an 'nd_bus' from an 'nfit_desc'

2015-04-17 Thread Dan Williams

Basic allocation and parsing of an nfit table.  This is infrastructure
for walking the list of "System Physical Address (SPA) Range Tables",
and "Memory device to SPA" to create "region" devices representing
persistent-memory (PMEM) or a dimm block data window set (BLK).

Note, BLK windows may be interleaved.  The nd_mem object tracks all the
tables needed for carrying out BLK I/O operations.  For the interleaved
case there may be multiple nd_mem instances per dimm-control-region
(DCR).

Signed-off-by: Dan Williams 
---
 drivers/block/nd/core.c   |  438 +
 drivers/block/nd/nd-private.h |   61 ++
 drivers/block/nd/nfit.h   |   25 ++
 3 files changed, 523 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/nd/nd-private.h

diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 8df8d315b726..d126799e7ff7 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -10,19 +10,455 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include "nd-private.h"
 #include "nfit.h"
 
-struct nd_bus *nfit_bus_register(struct device *parent,
+static DEFINE_IDA(nd_ida);
+
+static bool warn_checksum;
+module_param(warn_checksum, bool, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(warn_checksum, "Turn checksum errors into warnings");
+
+static void nd_bus_release(struct device *dev)
+{
+   struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
+   struct nd_memdev *nd_memdev, *_memdev;
+   struct nd_spa *nd_spa, *_spa;
+   struct nd_mem *nd_mem, *_mem;
+   struct nd_dcr *nd_dcr, *_dcr;
+   struct nd_bdw *nd_bdw, *_bdw;
+
+   list_for_each_entry_safe(nd_spa, _spa, _bus->spas, list) {
+   list_del_init(_spa->list);
+   kfree(nd_spa);
+   }
+   list_for_each_entry_safe(nd_dcr, _dcr, _bus->dcrs, list) {
+   list_del_init(_dcr->list);
+   kfree(nd_dcr);
+   }
+   list_for_each_entry_safe(nd_bdw, _bdw, _bus->bdws, list) {
+   list_del_init(_bdw->list);
+   kfree(nd_bdw);
+   }
+   list_for_each_entry_safe(nd_memdev, _memdev, _bus->memdevs, list) {
+   list_del_init(_memdev->list);
+   kfree(nd_memdev);
+   }
+   list_for_each_entry_safe(nd_mem, _mem, _bus->dimms, list) {
+   list_del_init(_mem->list);
+   kfree(nd_mem);
+   }
+
+   ida_simple_remove(_ida, nd_bus->id);
+   kfree(nd_bus);
+}
+
+struct nd_bus *to_nd_bus(struct device *dev)
+{
+   struct nd_bus *nd_bus = container_of(dev, struct nd_bus, dev);
+
+   WARN_ON(nd_bus->dev.release != nd_bus_release);
+   return nd_bus;
+}
+
+static void *nd_bus_new(struct device *parent,
struct nfit_bus_descriptor *nfit_desc)
 {
+   struct nd_bus *nd_bus = kzalloc(sizeof(*nd_bus), GFP_KERNEL);
+   int rc;
+
+   if (!nd_bus)
+   return NULL;
+   INIT_LIST_HEAD(_bus->spas);
+   INIT_LIST_HEAD(_bus->dcrs);
+   INIT_LIST_HEAD(_bus->bdws);
+   INIT_LIST_HEAD(_bus->memdevs);
+   INIT_LIST_HEAD(_bus->dimms);
+   nd_bus->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
+   if (nd_bus->id < 0) {
+   kfree(nd_bus);
+   return NULL;
+   }
+   nd_bus->nfit_desc = nfit_desc;
+   nd_bus->dev.parent = parent;
+   nd_bus->dev.release = nd_bus_release;
+   dev_set_name(_bus->dev, "ndbus%d", nd_bus->id);
+   rc = device_register(_bus->dev);
+   if (rc) {
+   dev_dbg(_bus->dev, "device registration failed: %d\n", rc);
+   put_device(_bus->dev);
+   return NULL;
+   }
+   return nd_bus;
+}
+
+struct nfit_table_header {
+   __le16 type;
+   __le16 length;
+};
+
+static const char *spa_type_name(u16 type)
+{
+   switch (type) {
+   case NFIT_SPA_VOLATILE: return "volatile";
+   case NFIT_SPA_PM: return "pmem";
+   case NFIT_SPA_DCR: return "dimm-control-region";
+   case NFIT_SPA_BDW: return "block-data-window";
+   default: return "unknown";
+   }
+}
+
+static int nfit_spa_type(struct nfit_spa __iomem *nfit_spa)
+{
+   __u8 uuid[16];
+
+   memcpy_fromio(uuid, _spa->type_uuid, sizeof(uuid));
+
+   if (memcmp(_spa_uuid_volatile, uuid, sizeof(uuid)) == 0)
+   return NFIT_SPA_VOLATILE;
+
+   if (memcmp(_spa_uuid_pm, uuid, sizeof(uuid)) == 0)
+   return NFIT_SPA_PM;
+
+   if (memcmp(_spa_uuid_dcr, uuid, sizeof(uuid)) == 0)
+   return NFIT_SPA_DCR;
+
+   if (memcmp(_spa_uuid_bdw, uuid, sizeof(uuid)) == 0)
+   return NFIT_SPA_BDW;
+
+   if (memcmp(_spa_uuid_vdisk, uuid, sizeof(uuid)) == 0)
+   return NFIT_SPA_VDISK;
+
+   if (memcmp(_spa_uuid_vcd, uuid, sizeof(uuid)) == 0)
+   return

[PATCH 14/21] nd: namespace indices: read and validate

2015-04-17 Thread Dan Williams

On media label format consists of two index blocks followed by an array
of labels.  None of these structures are ever updated in place.  A
sequence number tracks the current active index and the next one to
write, while labels are written to free slots.

++
||
|  nsindex0  |
||
++
||
|  nsindex1  |
||
++
|   label0   |
++
|   label1   |
++
||
 nslot...
||
++
|   labelN   |
++

After reading valid labels, store the dpa ranges they claim into
per-dimm resource trees.

Signed-off-by: Dan Williams 
---
 drivers/block/nd/Makefile|1 
 drivers/block/nd/dimm.c  |   25 +++-
 drivers/block/nd/dimm_devs.c |6 +
 drivers/block/nd/label.c |  291 ++
 drivers/block/nd/label.h |  129 +++
 drivers/block/nd/nd.h|   45 ++
 include/uapi/linux/ndctl.h   |1 
 7 files changed, 495 insertions(+), 3 deletions(-)
 create mode 100644 drivers/block/nd/label.c
 create mode 100644 drivers/block/nd/label.h

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index c0194d52e5ad..93856f1c9dbd 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -27,5 +27,6 @@ nd-y += dimm.o
 nd-y += region_devs.o
 nd-y += region.o
 nd-y += namespace_devs.o
+nd-y += label.o
 
 nd_pmem-y := pmem.o
diff --git a/drivers/block/nd/dimm.c b/drivers/block/nd/dimm.c
index 7e043c0c1bf5..ccc96d8fe2e7 100644
--- a/drivers/block/nd/dimm.c
+++ b/drivers/block/nd/dimm.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include "label.h"
 #include "nd.h"
 
 static bool force_enable_dimms;
@@ -53,6 +54,12 @@ static int nd_dimm_probe(struct device *dev)
return -ENOMEM;
 
dev_set_drvdata(dev, ndd);
+   ndd->dpa.name = dev_name(dev);
+   ndd->ns_current = -1;
+   ndd->ns_next = -1;
+   ndd->dpa.start = 0;
+   ndd->dpa.end = -1;
+   ndd->dev = dev;
 
rc = nd_dimm_init_nsarea(ndd);
if (rc)
@@ -64,18 +71,34 @@ static int nd_dimm_probe(struct device *dev)
 
dev_dbg(dev, "config data size: %d\n", ndd->nsarea.config_size);
 
+   nd_bus_lock(dev);
+   ndd->ns_current = nd_label_validate(ndd);
+   ndd->ns_next = nd_label_next_nsindex(ndd->ns_current);
+   nd_label_copy(ndd, to_next_namespace_index(ndd),
+   to_current_namespace_index(ndd));
+   rc = nd_label_reserve_dpa(ndd);
+   nd_bus_unlock(dev);
+
+   if (rc)
+   goto err;
+
return 0;
 
  err:
free_data(ndd);
return rc;
-
 }
 
 static int nd_dimm_remove(struct device *dev)
 {
struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+   struct resource *res, *_r;
 
+   nd_bus_lock(dev);
+   dev_set_drvdata(dev, NULL);
+   for_each_dpa_resource_safe(ndd, res, _r)
+   __release_region(>dpa, res->start, resource_size(res));
+   nd_bus_unlock(dev);
free_data(ndd);
 
return 0;
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index 6192d9c82b9b..652dee210fe8 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -94,8 +94,12 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
if (ndd->data)
return 0;
 
-   if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0)
+   if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0
+   || ndd->nsarea.config_size < ND_LABEL_MIN_SIZE) {
+   dev_dbg(ndd->dev, "failed to init config data area: (%d:%d)\n",
+   ndd->nsarea.max_xfer, ndd->nsarea.config_size);
return -ENXIO;
+   }
 
ndd->data = kmalloc(ndd->nsarea.config_size, GFP_KERNEL);
if (!ndd->data)
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
new file mode 100644
index ..e791ea8bbdde
--- /dev/null
+++ b/drivers/block/nd/label.c
@@ -0,0 +1,291 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include "nd-private.h"
+#include "label.h"
+#include "nd.h"
+
+#include 
+
+static u32 best_seq(u32 a, u32 b)
+{
+   a &= NSINDEX_SEQ_MASK;
+   b &= NSINDEX_SEQ_MASK;
+
+   if (a == 0 || a == b)
+   return b;
+   else if (b == 0)
+

[PATCH 05/21] nfit-test: manufactured NFITs for interface development

2015-04-17 Thread Dan Williams

Manually create and register NFITs to describe 2 topologies.  Topology1
is an advanced plausible configuration for BLK/PMEM aliased NVDIMMs.
Topology2 is an example configuration for current platforms that only
ship with a persistent address range.

 Kernel provider "nfit_test.0" produces an NFIT with the following attributes:

  (a)   (b)   DIMM   BLK-REGION
   +---++++
 +--+  |   pm0.0   | blk2.0 | pm1.0  | blk2.1 |0  region2
 | imc0 +--+- - - region0- - - ++++
 +--+---+  |   pm0.0   | blk3.0 | pm1.0  | blk3.1 |1  region3
|  +---+vv+
 +--+---+   | |
 | cpu0 | region1
 +--+---+   | |
|  +^^+
 +--+---+  |   blk4.0   | pm1.0  | blk4.0 |2  region4
 | imc1 +--+|++
 +--+  |   blk5.0   | pm1.0  | blk5.0 |3  region5
   ++++

 *) In this layout we have four dimms and two memory controllers in one
socket.  Each unique interface ("block" or "pmem") to DPA space
is identified by a region device with a dynamically assigned id.

 *) The first portion of dimm0 and dimm1 are interleaved as REGION0.
A single "pmem" namespace is created in the REGION0-"spa"-range
that spans dimm0 and dimm1 with a user-specified name of "pm0.0".
Some of that interleaved "spa" range is reclaimed as "bdw"
accessed space starting at offset (a) into each dimm.  In that
reclaimed space we create two "bdw" "namespaces" from REGION2 and
REGION3 where "blk2.0" and "blk3.0" are just human readable names
that could be set to any user-desired name in the label.

 *) In the last portion of dimm0 and dimm1 we have an interleaved
"spa" range, REGION1, that spans those two dimms as well as dimm2
and dimm3.  Some of REGION1 allocated to a "pmem" namespace named
"pm1.0" the rest is reclaimed in 4 "bdw" namespaces (for each
dimm in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
"blk5.0".

 *) The portion of dimm2 and dimm3 that do not participate in the
REGION1 interleaved "spa" range (i.e. the DPA address below
offset (b) are also included in the "blk4.0" and "blk5.0"
namespaces.  Note, that this example shows that "bdw" namespaces
don't need to be contiguous in DPA-space.

 Kernel provider "nfit_test.1" produces an NFIT with the following attributes:

 region2
 +-+
 |-|
 ||   pm2.0   ||
 |-|
 +-+

 *) Describes a simple system-physical-address range with no backing
dimm or interleave description.

Signed-off-by: Dan Williams 
---
 drivers/block/nd/Kconfig  |   20 +
 drivers/block/nd/Makefile |   16 +
 drivers/block/nd/nfit.h   |9 
 drivers/block/nd/test/Makefile|5 
 drivers/block/nd/test/iomap.c |  148 ++
 drivers/block/nd/test/nfit.c  |  930 +
 drivers/block/nd/test/nfit_test.h |   25 +
 7 files changed, 1153 insertions(+)
 create mode 100644 drivers/block/nd/test/Makefile
 create mode 100644 drivers/block/nd/test/iomap.c
 create mode 100644 drivers/block/nd/test/nfit.c
 create mode 100644 drivers/block/nd/test/nfit_test.h

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 5fa74f124b3e..0106b3807202 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -41,4 +41,24 @@ config NFIT_ACPI
  register the platform-global NFIT blob with the core.  Also
  enables the core to craft ACPI._DSM messages for platform/dimm
  configuration.
+
+config NFIT_TEST
+   tristate "NFIT TEST: Manufactured NFIT for interface testing"
+   depends on DMA_CMA
+   depends on ND_CORE=m
+   depends on m
+   help
+ For development purposes register a manufactured
+ NFIT table to verify the resulting device model topology.
+ Note, this module arranges for ioremap_cache() to be
+ overridden locally to allow simulation of system-memory as an
+ io-memory-resource.
+
+ Note, this test expects to be able to find at least
+ 256MB of CMA space (CONFIG_CMA_SIZE_MBYTES) or it will fail to
+ load.  Kconfig does not allow for numerical value
+ dependencies, so we can only warn at runtime.
+
+ Say N unless you are doing development of the 'nd' subsystem.
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 22701ab7dcae..c6bec0c185c5 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -1,3 +1,19 @@

[PATCH 20/21] nd_btt: atomic sector updates

2015-04-17 Thread Dan Williams

From: Vishal Verma 

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the ->rw_bytes() capability
of provided nd namespace devices.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata.  BLK namespaces may
mandate use of a BTT and expect the bus to initialize a BTT if not
already present.  Otherwise if a BTT is desired for other namespaces (or
partitions of a namespace) a BTT may be manually configured.

Cc: Andy Lutomirski 
Cc: Boaz Harrosh 
Cc: H. Peter Anvin 
Cc: Jens Axboe 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Neil Brown 
Cc: Jeff Moyer 
Cc: Dave Chinner 
Cc: Greg KH 
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 Documentation/blockdev/btt.txt |  273 
 drivers/block/nd/Kconfig   |   18 -
 drivers/block/nd/Makefile  |2 
 drivers/block/nd/btt.c | 1423 
 drivers/block/nd/btt.h |  140 
 drivers/block/nd/btt_devs.c|3 
 drivers/block/nd/core.c|1 
 drivers/block/nd/nd.h  |9 
 drivers/block/nd/region_devs.c |   77 ++
 9 files changed, 1944 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/blockdev/btt.txt
 create mode 100644 drivers/block/nd/btt.c

diff --git a/Documentation/blockdev/btt.txt b/Documentation/blockdev/btt.txt
new file mode 100644
index ..95134d5ec4a0
--- /dev/null
+++ b/Documentation/blockdev/btt.txt
@@ -0,0 +1,273 @@
+BTT - Block Translation Table
+=
+
+
+1. Introduction
+---
+
+Persistent memory based storage is able to perform IO at byte (or more
+accurately, cache line) granularity. However, we often want to expose such
+storage as traditional block devices. The block drivers for persistent memory
+will do exactly this. However, they do not provide any atomicity guarantees.
+Traditional SSDs typically provide protection against torn sectors in hardware,
+using stored energy in capacitors to complete in-flight block writes, or 
perhaps
+in firmware. We don't have this luxury with persistent memory - if a write is 
in
+progress, and we experience a power failure, the block will contain a mix of 
old
+and new data. Applications may not be prepared to handle such a scenario.
+
+The Block Translation Table (BTT) provides atomic sector update semantics for
+persistent memory devices, so that applications that rely on sector writes not
+being torn can continue to do so. The BTT manifests itself as a stacked block
+device, and reserves a portion of the underlying storage for its metadata. At
+the heart of it, is an indirection table that re-maps all the blocks on the
+volume. It can be thought of as an extremely simple file system that only
+provides atomic sector updates.
+
+
+2. Static Layout
+
+
+The underlying storage on which a BTT can be laid out is not limited in any 
way.
+The BTT, however, splits the available space into chunks of up to 512 GiB,
+called "Arenas".
+
+Each arena follows the same layout for its metadata, and all references in an
+arena are internal to it (with the exception of one field that points to the
+next arena). The following depicts the "On-disk" metadata layout:
+
+
+  Backing Store +--->  Arena
++---+   |   +--+
+|   |   |   | Arena info block |
+|Arena 0+---+   |   4K |
+| 512G  |   +--+
+|   |   |  |
++---+   |  |
+|   |   |  |
+|Arena 1|   |   Data Blocks|
+| 512G  |   |  |
+|   |   |  |
++---+   |  |
+|   .   |   |  |
+|   .   |   |  |
+|   .   |   |  |
+|   |   |  |
+|   |   |  |
++---+   +--+
+|  |
+| BTT Map  |
+|  |
+|  |
++--+
+|  |
+| BTT Flog |
+|  |
++--+
+| Info block copy  |
+|   4K |
++--+
+
+
+3. Theory of Operation

[PATCH 13/21] nd: add interleave-set state-tracking infrastructure

2015-04-17 Thread Dan Williams

On platforms that have firmware support for reading/writing per-dimm
label space, a portion of the dimm may be accessible via an interleave
set PMEM mapping in addition to the dimm's BLK (block-data-window
aperture(s)) interface.  A label, stored in a "configuration data
region" on the dimm, disambiguates which dimm addresses are accessed
through which exclusive interface.

Add infrastructure that allows the kernel to block modifications to a
label in the set while any member dimm is active.  Note that this is
meant only for enforcing "no modifications of active labels" via the
coarse ioctl command.  Adding/deleting namespaces from an active
interleave set will only be possible via sysfs.

Another aspect of tracking interleave sets is tracking their integrity
when DIMMs in a set are physically re-ordered.  For this purpose we
generate an "interleave-set cookie" that can be recorded in a label and
validated against the current configuration.

Signed-off-by: Dan Williams 
---
 drivers/block/nd/bus.c |   41 +
 drivers/block/nd/core.c|   51 
 drivers/block/nd/dimm_devs.c   |   18 
 drivers/block/nd/nd-private.h  |   17 
 drivers/block/nd/nd.h  |4 +
 drivers/block/nd/region_devs.c |  176 
 6 files changed, 305 insertions(+), 2 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index c98fe05a4c9b..944d7d7845fe 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -79,7 +79,10 @@ static int nd_bus_probe(struct device *dev)
if (!try_module_get(provider))
return -ENXIO;
 
+   nd_region_probe_start(nd_bus, dev);
rc = nd_drv->probe(dev);
+   nd_region_probe_end(nd_bus, dev, rc);
+
dev_dbg(_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
dev_name(dev), rc);
if (rc != 0)
@@ -95,6 +98,8 @@ static int nd_bus_remove(struct device *dev)
int rc;
 
rc = nd_drv->remove(dev);
+   nd_region_notify_remove(nd_bus, dev, rc);
+
dev_dbg(_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
dev_name(dev), rc);
module_put(provider);
@@ -269,6 +274,33 @@ void nd_bus_destroy_ndctl(struct nd_bus *nd_bus)
device_destroy(nd_class, MKDEV(nd_bus_major, nd_bus->id));
 }
 
+static void wait_nd_bus_probe_idle(struct nd_bus *nd_bus)
+{
+   do {
+   if (nd_bus->probe_active == 0)
+   break;
+   nd_bus_unlock(_bus->dev);
+   wait_event(nd_bus->probe_wait, nd_bus->probe_active == 0);
+   nd_bus_lock(_bus->dev);
+   } while (true);
+}
+
+/* set_config requires an idle interleave set */
+static int nd_cmd_clear_to_send(struct nd_dimm *nd_dimm, unsigned int cmd)
+{
+   struct nd_bus *nd_bus;
+
+   if (!nd_dimm || cmd != NFIT_CMD_SET_CONFIG_DATA)
+   return 0;
+
+   nd_bus = walk_to_nd_bus(_dimm->dev);
+   wait_nd_bus_probe_idle(nd_bus);
+
+   if (atomic_read(_dimm->busy))
+   return -EBUSY;
+   return 0;
+}
+
 static int __nd_ioctl(struct nd_bus *nd_bus, struct nd_dimm *nd_dimm,
int read_only, unsigned int cmd, unsigned long arg)
 {
@@ -399,11 +431,18 @@ static int __nd_ioctl(struct nd_bus *nd_bus, struct 
nd_dimm *nd_dimm,
goto out;
}
 
+   nd_bus_lock(_bus->dev);
+   rc = nd_cmd_clear_to_send(nd_dimm, _IOC_NR(cmd));
+   if (rc)
+   goto out_unlock;
+
rc = nfit_desc->nfit_ctl(nfit_desc, nd_dimm, _IOC_NR(cmd), buf, 
buf_len);
if (rc < 0)
-   goto out;
+   goto out_unlock;
if (copy_to_user(p, buf, buf_len))
rc = -EFAULT;
+ out_unlock:
+   nd_bus_unlock(_bus->dev);
  out:
if (is_vmalloc_addr(buf))
vfree(buf);
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index c795e8057061..976cd5e3ebaf 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -31,6 +31,36 @@ static bool warn_checksum;
 module_param(warn_checksum, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(warn_checksum, "Turn checksum errors into warnings");
 
+void nd_bus_lock(struct device *dev)
+{
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+   if (!nd_bus)
+   return;
+   mutex_lock(_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(nd_bus_lock);
+
+void nd_bus_unlock(struct device *dev)
+{
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+   if (!nd_bus)
+   return;
+   mutex_unlock(_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(nd_bus_unlock);
+
+bool is_nd_bus_locked(struct device *dev)
+{
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+   if (!nd_bus)
+   return false;
+   return mutex_is_locked(_bus->reconfig_mutex);
+}
+EXPORT_SYMBOL(is_nd_bus_locked);
+
 /**
  * nd_dimm_by_handle - lookup an nd_dimm by its corresponding nfit_handle
  *

[PATCH 11/21] nd_region: support for legacy nvdimms

2015-04-17 Thread Dan Williams

The NFIT region driver is an intermediary driver that translates NFIT
defined "region"s into "namespace" devices that are consumed by
persistent memory block drivers.  A "namespace" is a sub-division of a
region.

Support for NVDIMM labels is reserved for a later patch.  For now,
publish 'nd_namespace_io' devices which are simply memory ranges with no
regard for dimm boundaries, interleave, or aliasing.  This also adds a
"nstype" attribute to the parent region so that userspace can know ahead
of time the type of namespaces a given region will produce.

Signed-off-by: Dan Williams 
---
 drivers/block/nd/Makefile |2 +
 drivers/block/nd/bus.c|   26 +
 drivers/block/nd/core.c   |   18 --
 drivers/block/nd/dimm.c   |2 -
 drivers/block/nd/namespace_devs.c |  111 +
 drivers/block/nd/nd-private.h |8 ++-
 drivers/block/nd/nd.h |7 ++
 drivers/block/nd/nfit.h   |7 ++
 drivers/block/nd/region.c |   88 +
 drivers/block/nd/region_devs.c|   65 +-
 include/linux/nd.h|   10 +++
 include/uapi/linux/ndctl.h|   10 +++
 12 files changed, 343 insertions(+), 11 deletions(-)
 create mode 100644 drivers/block/nd/namespace_devs.c
 create mode 100644 drivers/block/nd/region.c

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 6698acbe7b44..769ddc34f974 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -24,3 +24,5 @@ nd-y += bus.o
 nd-y += dimm_devs.o
 nd-y += dimm.o
 nd-y += region_devs.o
+nd-y += region.o
+nd-y += namespace_devs.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index c815dd425a49..c98fe05a4c9b 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -13,6 +13,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -34,6 +35,12 @@ static int to_nd_device_type(struct device *dev)
 {
if (is_nd_dimm(dev))
return ND_DEVICE_DIMM;
+   else if (is_nd_pmem(dev))
+   return ND_DEVICE_REGION_PMEM;
+   else if (is_nd_blk(dev))
+   return ND_DEVICE_REGION_BLOCK;
+   else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
+   return nd_region_to_namespace_type(to_nd_region(dev->parent));
 
return 0;
 }
@@ -51,27 +58,46 @@ static int nd_bus_match(struct device *dev, struct 
device_driver *drv)
return test_bit(to_nd_device_type(dev), _drv->type);
 }
 
+static struct module *to_bus_provider(struct device *dev)
+{
+   /* pin bus providers while regions are enabled */
+   if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+   struct nd_bus *nd_bus = walk_to_nd_bus(dev);
+
+   return nd_bus->module;
+   }
+   return NULL;
+}
+
 static int nd_bus_probe(struct device *dev)
 {
struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+   struct module *provider = to_bus_provider(dev);
struct nd_bus *nd_bus = walk_to_nd_bus(dev);
int rc;
 
+   if (!try_module_get(provider))
+   return -ENXIO;
+
rc = nd_drv->probe(dev);
dev_dbg(_bus->dev, "%s.probe(%s) = %d\n", dev->driver->name,
dev_name(dev), rc);
+   if (rc != 0)
+   module_put(provider);
return rc;
 }
 
 static int nd_bus_remove(struct device *dev)
 {
struct nd_device_driver *nd_drv = to_nd_device_driver(dev->driver);
+   struct module *provider = to_bus_provider(dev);
struct nd_bus *nd_bus = walk_to_nd_bus(dev);
int rc;
 
rc = nd_drv->remove(dev);
dev_dbg(_bus->dev, "%s.remove(%s) = %d\n", dev->driver->name,
dev_name(dev), rc);
+   module_put(provider);
return rc;
 }
 
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 32ecd6f05c90..c795e8057061 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -192,7 +192,7 @@ static const struct attribute_group 
*nd_bus_attribute_groups[] = {
 };
 
 static void *nd_bus_new(struct device *parent,
-   struct nfit_bus_descriptor *nfit_desc)
+   struct nfit_bus_descriptor *nfit_desc, struct module *module)
 {
struct nd_bus *nd_bus = kzalloc(sizeof(*nd_bus), GFP_KERNEL);
int rc;
@@ -212,6 +212,7 @@ static void *nd_bus_new(struct device *parent,
return NULL;
}
nd_bus->nfit_desc = nfit_desc;
+   nd_bus->module = module;
nd_bus->dev.parent = parent;
nd_bus->dev.release = nd_bus_release;
nd_bus->dev.groups = nd_bus_attribute_groups;
@@ -595,15 +596,16 @@ static struct nd_bus *nd_bus_probe(struct nd_bus *nd_bus)
 
 }
 
-struct nd_bus *nfit_bus_register(struct device *parent,
-   struct nfit_bus_descriptor *nfit_desc)
+struct nd_bus

[PATCH 10/21] nd: regions (block-data-window, persistent memory, volatile memory)

2015-04-17 Thread Dan Williams

A "region" device represents the maximum capacity of a
block-data-window, or an interleaved spa range (direct-access persistent
memory or volatile memory), without regard for aliasing.  Aliasing is
resolved by the label data on the dimm to designate which exclusive
interface will access the aliased data.  Enabling for the
label-designated sub-device is in a subsequent patch.

The "region" types are defined in the NFIT System Physical Address (spa)
table.  In the case of persistent memory the spa-range describes the
direct memory address range of the storage (NFIT_SPA_PM).  A block
"region" region (NFIT_SPA_DCR) points to a DIMM Control Region (DCR) or
an interleaved group of DCRs.  Those DCRs are (optionally) referenced by
a block-data-window (BDW) set to describe the access mechanism and
capacity of the BLK-accessible storage.  If the related BDW is not
published then the dimm is only available for control/configuration
commands.  Finally, a volatile "region" (NFIT_SPA_VOLATILE) indicates
the portions of NVDIMMs that have been re-assigned as normal volatile
system memory by platform firmware.

The name format of "region" devices is "regionN" where, like dimms, N is
a global ida index assigned at discovery time.  This id is not reliable
across reboots nor in the presence of hotplug.  Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name.

"region"s have 2 generic attributes "size", and "mapping"s where:
- size: the block-data-window accessible capacity or the span of the
  spa-range in the case of pm.

- mappingN: a tuple describing a dimm's contribution to the region's
  capacity in the format (,,).  For a
  pm-region there will be at least one mapping per dimm in the interleave
  set.  For a block-region there is only "mapping0" listing the starting dimm
  offset of the block-data-window and the available capacity of that
  window (matches "size" above).

The max number of mappings per "region" is hard coded per the constraints of
sysfs attribute groups.  That said the number of mappings per region should
never exceed the maximum number of possible dimms in the system.  If the
current number turns out to not be enough then the "mappings" attribute
clarifies how many there are supposed to be. "32 should be enough for
anybody...".

Cc: Greg KH 
Cc: Neil Brown 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/Makefile  |1 
 drivers/block/nd/core.c|8 +
 drivers/block/nd/nd-private.h  |5 
 drivers/block/nd/nd.h  |   17 ++
 drivers/block/nd/region_devs.c |  426 
 5 files changed, 455 insertions(+), 2 deletions(-)
 create mode 100644 drivers/block/nd/region_devs.c

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 9f1b69c86fba..6698acbe7b44 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -23,3 +23,4 @@ nd-y := core.o
 nd-y += bus.o
 nd-y += dimm_devs.o
 nd-y += dimm.o
+nd-y += region_devs.o
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 426f96b02594..32ecd6f05c90 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -230,7 +230,7 @@ struct nfit_table_header {
__le16 length;
 };
 
-static const char *spa_type_name(u16 type)
+const char *spa_type_name(u16 type)
 {
switch (type) {
case NFIT_SPA_VOLATILE: return "volatile";
@@ -241,7 +241,7 @@ static const char *spa_type_name(u16 type)
}
 }
 
-static int nfit_spa_type(struct nfit_spa __iomem *nfit_spa)
+int nfit_spa_type(struct nfit_spa __iomem *nfit_spa)
 {
__u8 uuid[16];
 
@@ -577,6 +577,10 @@ static struct nd_bus *nd_bus_probe(struct nd_bus *nd_bus)
if (rc)
goto err_child;
 
+   rc = nd_bus_register_regions(nd_bus);
+   if (rc)
+   goto err_child;
+
mutex_lock(_bus_list_mutex);
list_add_tail(_bus->list, _bus_list);
mutex_unlock(_bus_list_mutex);
diff --git a/drivers/block/nd/nd-private.h b/drivers/block/nd/nd-private.h
index 72197992e386..d254ff688ad6 100644
--- a/drivers/block/nd/nd-private.h
+++ b/drivers/block/nd/nd-private.h
@@ -85,6 +85,8 @@ struct nd_mem {
struct list_head list;
 };
 
+const char *spa_type_name(u16 type);
+int nfit_spa_type(struct nfit_spa __iomem *nfit_spa);
 struct nd_dimm *nd_dimm_by_handle(struct nd_bus *nd_bus, u32 nfit_handle);
 bool is_nd_dimm(struct device *dev);
 struct nd_bus *to_nd_bus(struct device *dev);
@@ -99,4 +101,7 @@ void __exit nd_dimm_exit(void);
 int nd_bus_create_ndctl(struct nd_bus *nd_bus);
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus);
 int nd_bus_register_dimms(struct nd_bus *nd_bus);
+int nd_bus_register_regions(struct nd_bus *nd_bus);
+int nd_match_dimm(struct device *dev, void *data);
+bool is_nd_dimm(struct device *dev);
 #endif /* __ND_PRIVATE_H__ */
diff --git a/drivers/block/nd/nd.h b/drivers/block/nd/nd.h
index f277440c72b4..13eba9bd74c7 100644
--- a/drivers/block/nd/nd.h
+++

[PATCH 08/21] nd: ndctl.h, the nd ioctl abi

2015-04-17 Thread Dan Williams

Most configuration of the nd-subsystem is done via nd-sysfs.  However,
the NFIT specification defines a small set of messages that can be
passed to the subsystem via platform-firmware-defined methods.  The
command set (as of the current version of the NFIT-DSM spec) is:

NFIT_CMD_SMART: media health and diagnostics
NFIT_CMD_GET_CONFIG_SIZE: size of the label space
NFIT_CMD_GET_CONFIG_DATA: read label
NFIT_CMD_SET_CONFIG_DATA: write label
NFIT_CMD_VENDOR: vendor-specific command passthrough
NFIT_CMD_ARS_CAP: report address-range-scrubbing capabilities
NFIT_CMD_START_ARS: initiate scrubbing
NFIT_CMD_QUERY_ARS: report on scrubbing state
NFIT_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

Most of the commands target a specific dimm.  However, the
address-range-scrubbing commands target the entire NFIT-bus / platform.
The 'commands' attribute of an nd-bus, or an nd-dimm enumerate the
supported commands for that object.

Cc: 
Cc: Robert Moore 
Cc: Rafael J. Wysocki 
Reported-by: Nicholas Moulin 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/Kconfig  |   11 +
 drivers/block/nd/acpi.c   |  333 +
 drivers/block/nd/bus.c|  230 
 drivers/block/nd/core.c   |   17 ++
 drivers/block/nd/dimm_devs.c  |   69 
 drivers/block/nd/nd-private.h |   11 +
 drivers/block/nd/nd.h |   21 +++
 drivers/block/nd/test/nfit.c  |   89 +++
 include/uapi/linux/Kbuild |1 
 include/uapi/linux/ndctl.h|  178 ++
 10 files changed, 950 insertions(+), 10 deletions(-)
 create mode 100644 drivers/block/nd/nd.h
 create mode 100644 include/uapi/linux/ndctl.h

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 0106b3807202..6c15d10bf4e0 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -42,6 +42,17 @@ config NFIT_ACPI
  enables the core to craft ACPI._DSM messages for platform/dimm
  configuration.
 
+config NFIT_ACPI_DEBUG
+   bool "NFIT ACPI: Turn on extra debugging"
+   depends on NFIT_ACPI
+   depends on DYNAMIC_DEBUG
+   default n
+   help
+ Enabling this option causes the nd_acpi driver to dump the
+ input and output buffers of _DSM operations on the ACPI0012
+ device, which can be very verbose.  Leave it disabled unless
+ you are debugging a hardware / firmware issue.
+
 config NFIT_TEST
tristate "NFIT TEST: Manufactured NFIT for interface testing"
depends on DMA_CMA
diff --git a/drivers/block/nd/acpi.c b/drivers/block/nd/acpi.c
index 48db723d7a90..073ff28fdbfe 100644
--- a/drivers/block/nd/acpi.c
+++ b/drivers/block/nd/acpi.c
@@ -13,8 +13,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "nfit.h"
+#include "nd.h"
 
 enum {
NFIT_ACPI_NOTIFY_TABLE = 0x80,
@@ -26,20 +28,330 @@ struct acpi_nfit {
struct nd_bus *nd_bus;
 };
 
+static struct acpi_nfit *to_acpi_nfit(struct nfit_bus_descriptor *nfit_desc)
+{
+   return container_of(nfit_desc, struct acpi_nfit, nfit_desc);
+}
+
+#define NFIT_ACPI_MAX_ELEM 4
+struct nfit_cmd_desc {
+   int in_num;
+   int out_num;
+   u32 in_sizes[NFIT_ACPI_MAX_ELEM];
+   int out_sizes[NFIT_ACPI_MAX_ELEM];
+};
+
+static const struct nfit_cmd_desc nfit_dimm_descs[] = {
+   [NFIT_CMD_IMPLEMENTED] = { },
+   [NFIT_CMD_SMART] = {
+   .out_num = 2,
+   .out_sizes = { 4, 8, },
+   },
+   [NFIT_CMD_SMART_THRESHOLD] = {
+   .out_num = 2,
+   .out_sizes = { 4, 8, },
+   },
+   [NFIT_CMD_DIMM_FLAGS] = {
+   .out_num = 2,
+   .out_sizes = { 4, 4 },
+   },
+   [NFIT_CMD_GET_CONFIG_SIZE] = {
+   .out_num = 3,
+   .out_sizes = { 4, 4, 4, },
+   },
+   [NFIT_CMD_GET_CONFIG_DATA] = {
+   .in_num = 2,
+   .in_sizes = { 4, 4, },
+   .out_num = 2,
+   .out_sizes = { 4, UINT_MAX, },
+   },
+   [NFIT_CMD_SET_CONFIG_DATA] = {
+   .in_num = 3,
+   .in_sizes = { 4, 4, UINT_MAX, },
+   .out_num = 1,
+   .out_sizes = { 4, },
+   },
+   [NFIT_CMD_VENDOR] = {
+   .in_num = 3,
+   .in_sizes = { 4, 4, UINT_MAX, },
+   .out_num = 3,
+   .out_sizes = { 4, 4, UINT_MAX, },
+   },
+};
+
+static const struct nfit_cmd_desc nfit_acpi_descs[] = {
+   [NFIT_CMD_IMPLEMENTED] = { },
+   [NFIT_CMD_ARS_CAP] = {
+   .in_num = 2,
+   .in_sizes = { 8, 8, },
+   .out_num = 2,
+   .out_sizes = { 4, 4, },
+   },
+   [NFIT_CMD_ARS_START] = {
+   .in_num = 4,
+   .in_sizes = { 8, 8, 2, 6, },
+   .out_num = 1,
+   .out_sizes = { 4, },
+   },
+   [NFIT_CMD_ARS_QUERY] = {
+

[PATCH 19/21] nd: infrastructure for btt devices

2015-04-17 Thread Dan Williams

Block devices from an nd bus, in addition to accepting "struct bio"
based requests, also have the capability to perform byte-aligned
accesses.  By default only the bio/block interface is used.  However, if
another driver can make effective use of the byte-aligned capability it
can claim/disable the block interface and use the byte-aligned "nd_io"
interface.

The BTT driver is the intended first consumer of this mechanism to allow
layering atomic sector update guarantees on top of nd_io capable
nd-bus-block-devices.

Cc: Greg KH 
Cc: Neil Brown 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/Kconfig  |3 
 drivers/block/nd/Makefile |2 
 drivers/block/nd/btt.h|   45 
 drivers/block/nd/btt_devs.c   |  442 +
 drivers/block/nd/bus.c|  128 
 drivers/block/nd/core.c   |   80 +++
 drivers/block/nd/nd-private.h |   28 +++
 drivers/block/nd/nd.h |   94 +
 drivers/block/nd/pmem.c   |   30 +++
 include/uapi/linux/ndctl.h|2 
 10 files changed, 850 insertions(+), 4 deletions(-)
 create mode 100644 drivers/block/nd/btt.h
 create mode 100644 drivers/block/nd/btt_devs.c

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 38eae5f0ae4b..faa756841773 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -89,4 +89,7 @@ config BLK_DEV_PMEM
 
  Say Y if you want to use a NVDIMM described by NFIT
 
+config ND_BTT_DEVS
+   def_bool y
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 93856f1c9dbd..3e4878e0fe1d 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -29,4 +29,6 @@ nd-y += region.o
 nd-y += namespace_devs.o
 nd-y += label.o
 
+nd-$(CONFIG_ND_BTT_DEVS) += btt_devs.o
+
 nd_pmem-y := pmem.o
diff --git a/drivers/block/nd/btt.h b/drivers/block/nd/btt.h
new file mode 100644
index ..e8f6d8e0ddd3
--- /dev/null
+++ b/drivers/block/nd/btt.h
@@ -0,0 +1,45 @@
+/*
+ * Block Translation Table library
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_BTT_H
+#define _LINUX_BTT_H
+
+#include 
+
+#define BTT_SIG_LEN 16
+#define BTT_SIG "BTT_ARENA_INFO\0"
+
+struct btt_sb {
+   u8 signature[BTT_SIG_LEN];
+   u8 uuid[16];
+   u8 parent_uuid[16];
+   __le32 flags;
+   __le16 version_major;
+   __le16 version_minor;
+   __le32 external_lbasize;
+   __le32 external_nlba;
+   __le32 internal_lbasize;
+   __le32 internal_nlba;
+   __le32 nfree;
+   __le32 infosize;
+   __le64 nextoff;
+   __le64 dataoff;
+   __le64 mapoff;
+   __le64 logoff;
+   __le64 info2off;
+   u8 padding[3968];
+   __le64 checksum;
+};
+
+#endif
diff --git a/drivers/block/nd/btt_devs.c b/drivers/block/nd/btt_devs.c
new file mode 100644
index ..746d582910b6
--- /dev/null
+++ b/drivers/block/nd/btt_devs.c
@@ -0,0 +1,442 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "nd-private.h"
+#include "btt.h"
+#include "nd.h"
+
+static DEFINE_IDA(btt_ida);
+
+static void nd_btt_release(struct device *dev)
+{
+   struct nd_btt *nd_btt = to_nd_btt(dev);
+
+   dev_dbg(dev, "%s\n", __func__);
+   WARN_ON(nd_btt->backing_dev);
+   ndio_del_claim(nd_btt->ndio_claim);
+   ida_simple_remove(_ida, nd_btt->id);
+   kfree(nd_btt->uuid);
+   kfree(nd_btt);
+}
+
+static struct device_type nd_btt_device_type = {
+   .name = "nd_btt",
+   .release = nd_btt_release,
+};
+
+bool is_nd_btt(struct device *dev)
+{
+   return dev->type == _btt_device_type;
+}
+
+struct nd_btt *to_nd_btt(struct device *dev)
+{
+   struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev);
+
+   WARN_ON(!is_nd_btt(dev));
+   return nd_btt;
+}
+EXPORT_SYMBOL(to_nd_btt);
+
+static const unsigned long btt_lbasize_supported[] = { 512, 4096, 0 };
+
+static ssize_t sector_size_show(struct device *dev,
+   struct

[PATCH 12/21] nd_pmem: add NFIT support to the pmem driver

2015-04-17 Thread Dan Williams

nd_pmem attaches to persistent memory regions and namespaces emitted by
the nd subsystem, and, same as the original pmem driver, presents the
system-physical-address range as a block device.

Cc: Andy Lutomirski 
Cc: Boaz Harrosh 
Cc: H. Peter Anvin 
Cc: Jens Axboe 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/block/Kconfig |   11 ---
 drivers/block/Makefile|1 -
 drivers/block/nd/Kconfig  |   17 +++
 drivers/block/nd/Makefile |3 ++
 drivers/block/nd/pmem.c   |   72 +++--
 5 files changed, 83 insertions(+), 21 deletions(-)
 rename drivers/block/{pmem.c => nd/pmem.c} (81%)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index dfe40e5ca9bd..1cef4ffb16c5 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -406,17 +406,6 @@ config BLK_DEV_RAM_DAX
  and will prevent RAM block device backing store memory from being
  allocated from highmem (only a problem for highmem systems).
 
-config BLK_DEV_PMEM
-   tristate "Persistent memory block device support"
-   help
- Saying Y here will allow you to use a contiguous range of reserved
- memory as one or more persistent block devices.
-
- To compile this driver as a module, choose M here: the module will be
- called 'pmem'.
-
- If unsure, say N.
-
 config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media"
depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 18b27bb9cd2d..3a2f15be66a3 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,7 +14,6 @@ obj-$(CONFIG_PS3_VRAM)+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)  += z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)  += brd.o
-obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
 obj-$(CONFIG_BLK_CPQ_DA)   += cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 6c15d10bf4e0..38eae5f0ae4b 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -72,4 +72,21 @@ config NFIT_TEST
 
  Say N unless you are doing development of the 'nd' subsystem.
 
+config BLK_DEV_PMEM
+   tristate "PMEM: Persistent memory block device support"
+   depends on ND_CORE || X86_PMEM_LEGACY
+   default ND_CORE
+   help
+ Memory ranges for PMEM are described by either an NFIT
+ (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
+ non-standard OEM-specific E820 memory type (type-12, see
+ CONFIG_X86_PMEM_LEGACY), or it is manually specified by the
+ 'memmap=nn[KMG]!ss[KMG]' kernel command line (see
+ Documentation/kernel-parameters.txt).  This driver converts
+ these persistent memory ranges into block devices that are
+ capable of DAX (direct-access) file system mappings.  See
+ Documentation/blockdev/nd.txt for more details.
+
+ Say Y if you want to use a NVDIMM described by NFIT
+
 endif
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 769ddc34f974..c0194d52e5ad 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -16,6 +16,7 @@ endif
 
 obj-$(CONFIG_ND_CORE) += nd.o
 obj-$(CONFIG_NFIT_ACPI) += nd_acpi.o
+obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 
 nd_acpi-y := acpi.o
 
@@ -26,3 +27,5 @@ nd-y += dimm.o
 nd-y += region_devs.o
 nd-y += region.o
 nd-y += namespace_devs.o
+
+nd_pmem-y := pmem.o
diff --git a/drivers/block/pmem.c b/drivers/block/nd/pmem.c
similarity index 81%
rename from drivers/block/pmem.c
rename to drivers/block/nd/pmem.c
index eabf4a8d0085..cd83a9a98d89 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -1,7 +1,7 @@
 /*
  * Persistent Memory Driver
  *
- * Copyright (c) 2014, Intel Corporation.
+ * Copyright (c) 2014-2015, Intel Corporation.
  * Copyright (c) 2015, Christoph Hellwig .
  * Copyright (c) 2015, Boaz Harrosh .
  *
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PMEM_MINORS16
 
@@ -34,10 +35,11 @@ struct pmem_device {
phys_addr_t phys_addr;
void*virt_addr;
size_t  size;
+   int id;
 };
 
 static int pmem_major;
-static atomic_t pmem_index;
+static DEFINE_IDA(pmem_ida);
 
 static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
unsigned int len, unsigned int off, int rw,
@@ -122,20 +124,26 @@ static struct pmem_device *pmem_alloc(struct device *dev, 
struct resource *res)
 {
struct pmem_device *pmem;
struct gendisk *disk;
-   int idx, err;
+   int err;
 
err = -ENOMEM;
pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
if (!pmem)
goto out;
 
+   pmem->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
+

[PATCH 18/21] nd: write blk label set

2015-04-17 Thread Dan Williams

After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
set to valid values the labels on the dimm can be updated.  The
difference with the pmem case is that blk namespaces are limited to one
dimm and can cover discontiguous ranges in dpa space.

Also, after allocating label slots, it is useful for userspace to know
how many slots are left.  Export this information in sysfs.

Signed-off-by: Dan Williams 
---
 drivers/block/nd/bus.c|4 
 drivers/block/nd/dimm_devs.c  |   25 +++
 drivers/block/nd/label.c  |  297 +++--
 drivers/block/nd/label.h  |5 +
 drivers/block/nd/namespace_devs.c |   57 +++
 drivers/block/nd/nd-private.h |1 
 6 files changed, 367 insertions(+), 22 deletions(-)

diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index 8e70098b6cb0..cb619d70166d 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -165,6 +165,10 @@ static void nd_async_device_unregister(void *d, 
async_cookie_t cookie)
 {
struct device *dev = d;
 
+   /* flush bus operations before delete */
+   nd_bus_lock(dev);
+   nd_bus_unlock(dev);
+
device_unregister(dev);
put_device(dev);
 }
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index a1685c01a2bb..eead15c98196 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include "nd-private.h"
+#include "label.h"
 #include "nfit.h"
 #include "nd.h"
 
@@ -364,6 +365,29 @@ static ssize_t state_show(struct device *dev, struct 
device_attribute *attr,
 }
 static DEVICE_ATTR_RO(state);
 
+static ssize_t available_slots_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_dimm_drvdata *ndd = dev_get_drvdata(dev);
+   ssize_t rc;
+   u32 nfree;
+
+   if (!ndd)
+   return -ENXIO;
+
+   nd_bus_lock(dev);
+   nfree = nd_label_nfree(ndd);
+   if (nfree - 1 > nfree) {
+   dev_WARN_ONCE(dev, 1, "we ate our last label?\n");
+   nfree = 0;
+   } else
+   nfree--;
+   rc = sprintf(buf, "%d\n", nfree);
+   nd_bus_unlock(dev);
+   return rc;
+}
+static DEVICE_ATTR_RO(available_slots);
+
 static struct attribute *nd_dimm_attributes[] = {
_attr_handle.attr,
_attr_phys_id.attr,
@@ -374,6 +398,7 @@ static struct attribute *nd_dimm_attributes[] = {
_attr_state.attr,
_attr_revision.attr,
_attr_commands.attr,
+   _attr_available_slots.attr,
NULL,
 };
 
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index 78898b642191..069c26d50ed1 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -58,7 +58,7 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
return ndd->nsindex_size;
 }
 
-static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
+int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
 {
return ndd->nsarea.config_size / 129;
 }
@@ -416,7 +416,7 @@ u32 nd_label_nfree(struct nd_dimm_drvdata *ndd)
WARN_ON(!is_nd_bus_locked(ndd->dev));
 
if (!preamble_next(ndd, , , ))
-   return 0;
+   return nd_dimm_num_label_slots(ndd);
 
return bitmap_weight(free, nslot);
 }
@@ -553,22 +553,270 @@ static int __pmem_label_update(struct nd_region 
*nd_region,
return 0;
 }
 
-static int init_labels(struct nd_mapping *nd_mapping)
+static void del_label(struct nd_mapping *nd_mapping, int l)
+{
+   struct nd_namespace_label __iomem *next_label, __iomem *nd_label;
+   struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+   unsigned int slot;
+   int j;
+
+   nd_label = nd_get_label(nd_mapping->labels, l);
+   slot = to_slot(ndd, nd_label);
+   dev_vdbg(ndd->dev, "%s: clear: %d\n", __func__, slot);
+
+   for (j = l; (next_label = nd_get_label(nd_mapping->labels, j + 1)); j++)
+   nd_set_label(nd_mapping->labels, next_label, j);
+   nd_set_label(nd_mapping->labels, NULL, j);
+}
+
+static bool is_old_resource(struct resource *res, struct resource **list, int 
n)
 {
int i;
+
+   if (res->flags & DPA_RESOURCE_ADJUSTED)
+   return false;
+   for (i = 0; i < n; i++)
+   if (res == list[i])
+   return true;
+   return false;
+}
+
+static struct resource *to_resource(struct nd_dimm_drvdata *ndd,
+   struct nd_namespace_label __iomem *nd_label)
+{
+   struct resource *res;
+
+   for_each_dpa_resource(ndd, res) {
+   if (res->start != readq(_label->dpa))
+   continue;
+   if (resource_size(res) != readq(_label->rawsize))
+   continue;
+   return res;
+   }
+
+   return NULL;
+}
+
+/*
+ * 1/ Account all the labels that can be freed after this update
+ * 2/ Allocate

[PATCH 17/21] nd: write pmem label set

2015-04-17 Thread Dan Williams

After 'uuid', 'size', and optionally 'alt_name' have been set to valid
values the labels on the dimms can be updated.

Write procedure is:
1/ Allocate and write new labels in the "next" index
2/ Free the old labels in the working copy
3/ Write the bitmap and the label space on the dimm
4/ Write the index to make the update valid

Label ranges directly mirror the dpa resource values for the given
label_id of the namespace.

Signed-off-by: Dan Williams 
---
 drivers/block/nd/dimm_devs.c  |   49 ++
 drivers/block/nd/label.c  |  327 +
 drivers/block/nd/label.h  |6 +
 drivers/block/nd/namespace_devs.c |   82 -
 drivers/block/nd/nd.h |3 
 5 files changed, 453 insertions(+), 14 deletions(-)

diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index ae77bf4a5188..a1685c01a2bb 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -134,6 +134,55 @@ int nd_dimm_init_config_data(struct nd_dimm_drvdata *ndd)
return rc;
 }
 
+int nd_dimm_set_config_data(struct nd_dimm_drvdata *ndd, size_t offset,
+   void *buf, size_t len)
+{
+   int rc = validate_dimm(ndd);
+   size_t max_cmd_size, buf_offset;
+   struct nfit_cmd_set_config_hdr *cmd;
+   struct nd_bus *nd_bus = walk_to_nd_bus(ndd->dev);
+   struct nfit_bus_descriptor *nfit_desc = nd_bus->nfit_desc;
+
+   if (rc)
+   return rc;
+
+   if (!ndd->data)
+   return -ENXIO;
+
+   if (offset + len > ndd->nsarea.config_size)
+   return -ENXIO;
+
+   max_cmd_size = min_t(u32, PAGE_SIZE, len);
+   max_cmd_size = min_t(u32, max_cmd_size, ndd->nsarea.max_xfer);
+   cmd = kzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL);
+   if (!cmd)
+   return -ENOMEM;
+
+   for (buf_offset = 0; len; len -= cmd->in_length,
+   buf_offset += cmd->in_length) {
+   size_t cmd_size;
+   u32 *status;
+
+   cmd->in_offset = offset + buf_offset;
+   cmd->in_length = min(max_cmd_size, len);
+   memcpy(cmd->in_buf, buf + buf_offset, cmd->in_length);
+
+   /* status is output in the last 4-bytes of the command buffer */
+   cmd_size = sizeof(*cmd) + cmd->in_length + sizeof(u32);
+   status = ((void *) cmd) + cmd_size - sizeof(u32);
+
+   rc = nfit_desc->nfit_ctl(nfit_desc, to_nd_dimm(ndd->dev),
+   NFIT_CMD_SET_CONFIG_DATA, cmd, cmd_size);
+   if (rc || *status) {
+   rc = rc ? rc : -ENXIO;
+   break;
+   }
+   }
+   kfree(cmd);
+
+   return rc;
+}
+
 static void nd_dimm_release(struct device *dev)
 {
struct nd_dimm *nd_dimm = to_nd_dimm(dev);
diff --git a/drivers/block/nd/label.c b/drivers/block/nd/label.c
index b55fa2a6f872..78898b642191 100644
--- a/drivers/block/nd/label.c
+++ b/drivers/block/nd/label.c
@@ -12,6 +12,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "nd-private.h"
@@ -57,6 +58,11 @@ size_t sizeof_namespace_index(struct nd_dimm_drvdata *ndd)
return ndd->nsindex_size;
 }
 
+static int nd_dimm_num_label_slots(struct nd_dimm_drvdata *ndd)
+{
+   return ndd->nsarea.config_size / 129;
+}
+
 int nd_label_validate(struct nd_dimm_drvdata *ndd)
 {
/*
@@ -202,23 +208,30 @@ static struct nd_namespace_label __iomem 
*nd_label_base(struct nd_dimm_drvdata *
return base + 2 * sizeof_namespace_index(ndd);
 }
 
+static int to_slot(struct nd_dimm_drvdata *ndd,
+   struct nd_namespace_label __iomem *nd_label)
+{
+   return nd_label - nd_label_base(ndd);
+}
+
 #define for_each_clear_bit_le(bit, addr, size) \
for ((bit) = find_next_zero_bit_le((addr), (size), 0);  \
 (bit) < (size);\
 (bit) = find_next_zero_bit_le((addr), (size), (bit) + 1))
 
 /**
- * preamble_current - common variable initialization for nd_label_* routines
+ * preamble_index - common variable initialization for nd_label_* routines
  * @nd_dimm: dimm container for the relevant label set
+ * @idx: namespace_index index
  * @nsindex: on return set to the currently active namespace index
  * @free: on return set to the free label bitmap in the index
  * @nslot: on return set to the number of slots in the label space
  */
-static bool preamble_current(struct nd_dimm_drvdata *ndd,
+static bool preamble_index(struct nd_dimm_drvdata *ndd, int idx,
struct nd_namespace_index **nsindex,
unsigned long **free, u32 *nslot)
 {
-   *nsindex = to_current_namespace_index(ndd);
+   *nsindex = to_namespace_index(ndd, idx);
if (*nsindex == NULL)
return false;
 
@@ -237,6 +250,22 @@ char *nd_label_gen_id(struct nd_label_id *label_id, u8 
*uuid, u32 flags)

[PATCH 21/21] nd_blk: nfit blk driver

2015-04-17 Thread Dan Williams

From: Ross Zwisler 

Block-device driver for BLK namespaces described by DCR (dimm control
region), BDW (block data window), and IDT (interleave descriptor) NFIT
structures.

The BIOS may choose to interleave multiple dimms into a given SPA
(system physical address) range, so this driver includes core nd
infrastructure for multiplexing multiple BLK namespace devices on a
single request_mem_region() + ioremap() mapping.  Note, the math and
table walking to de-interleave the memory space on each I/O may prove to
be too computationally expensive, in which case we would look to replace
it with a flat lookup implementation.

A new nd core api nd_blk_validate_namespace() is introduced to check
that the labels on the DIMM are in sync with the current set of
dpa-resources assigned to the namespace.  nd_blk_validate_namespace()
prevents enabling the namespace when they are out of sync.  Userspace
can retry the writing the labels in that scenario.

Finally, enable testing of the BLK namespace infrastructure via
nfit_test.  Provide a mock implementations of  nd_blk_do_io() to route
block-data-window accesses to an nfit_test allocation simulating BLK
storage.

Cc: Andy Lutomirski 
Cc: Boaz Harrosh 
Cc: H. Peter Anvin 
Cc: Jens Axboe 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Signed-off-by: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/Kconfig  |   19 ++
 drivers/block/nd/Makefile |3 
 drivers/block/nd/blk.c|  269 
 drivers/block/nd/core.c   |   57 ++-
 drivers/block/nd/namespace_devs.c |   47 ++
 drivers/block/nd/nd-private.h |   24 +++
 drivers/block/nd/nd.h |   51 ++
 drivers/block/nd/region.c |   11 +
 drivers/block/nd/region_devs.c|  314 -
 drivers/block/nd/test/iomap.c |   53 ++
 drivers/block/nd/test/nfit.c  |3 
 drivers/block/nd/test/nfit_test.h |   14 ++
 12 files changed, 851 insertions(+), 14 deletions(-)
 create mode 100644 drivers/block/nd/blk.c

diff --git a/drivers/block/nd/Kconfig b/drivers/block/nd/Kconfig
index 29d9f8e4eedb..72580cb0e39c 100644
--- a/drivers/block/nd/Kconfig
+++ b/drivers/block/nd/Kconfig
@@ -70,6 +70,9 @@ config NFIT_TEST
  load.  Kconfig does not allow for numerical value
  dependencies, so we can only warn at runtime.
 
+ Enabling this option will degrade the performance of other BLK
+ namespaces.  Do not enable for production environments.
+
  Say N unless you are doing development of the 'nd' subsystem.
 
 config BLK_DEV_PMEM
@@ -89,6 +92,22 @@ config BLK_DEV_PMEM
 
  Say Y if you want to use a NVDIMM described by NFIT
 
+config ND_BLK
+   tristate "BLK: Block data window (aperture) device support"
+   depends on ND_CORE
+   default ND_CORE
+   help
+ This driver performs I/O using a set of DCR/BDW defined
+ apertures.  The set of apertures will all access the one
+ DIMM.  Multiple windows allow multiple concurrent accesses,
+ much like tagged-command-queuing, and would likely be used
+ by different threads or different CPUs.
+
+ The NFIT specification defines a standard format for a Block
+ Data Window.
+
+ Say Y if you want to use a NVDIMM described by NFIT
+
 config ND_BTT_DEVS
bool
 
diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 2dc1ab6fdef2..df104f2123a4 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -12,12 +12,14 @@ ldflags-y += --wrap=ioremap_nocache
 ldflags-y += --wrap=iounmap
 ldflags-y += --wrap=__request_region
 ldflags-y += --wrap=__release_region
+ldflags-y += --wrap=nd_blk_do_io
 endif
 
 obj-$(CONFIG_ND_CORE) += nd.o
 obj-$(CONFIG_NFIT_ACPI) += nd_acpi.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
+obj-$(CONFIG_ND_BLK) += nd_blk.o
 
 nd_acpi-y := acpi.o
 
@@ -34,3 +36,4 @@ nd-$(CONFIG_ND_BTT_DEVS) += btt_devs.o
 
 nd_pmem-y := pmem.o
 nd_btt-y := btt.o
+nd_blk-y := blk.o
diff --git a/drivers/block/nd/blk.c b/drivers/block/nd/blk.c
new file mode 100644
index ..9e32ae610d15
--- /dev/null
+++ b/drivers/block/nd/blk.c
@@ -0,0 +1,269 @@
+/*
+ * NVDIMM Block Window Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "nd.h"
+
+struct nd_blk_device {
+   struct request_queue *queue;
+   struct gendisk *disk;
+

[PATCH 16/21] nd: blk labels and namespace instantiation

2015-04-17 Thread Dan Williams

A blk label set describes a namespace comprised of one or more
discontiguous dpa ranges on a single dimm.  They may alias with one or
more pmem interleave sets that include the given dimm.

This is the runtime/volatile configuration infrastructure for sysfs
manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
patch will make these settings persistent by writing back the label(s).

Unlike pmem namespaces, multiple blk namespaces can be created per
region.  Once a blk namespace has been created a new seed device
(unconfigured child of a parent blk region) is instantiated.  As long as
a region has 'available_size' != 0 new child namespaces may be created.

Cc: Greg KH 
Cc: Neil Brown 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/core.c   |   40 +++
 drivers/block/nd/dimm_devs.c  |   35 +++
 drivers/block/nd/namespace_devs.c |  502 ++---
 drivers/block/nd/nd-private.h |   10 +
 drivers/block/nd/nd.h |5 
 drivers/block/nd/region_devs.c|   15 +
 include/linux/nd.h|   25 ++
 7 files changed, 588 insertions(+), 44 deletions(-)

diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index 560ed496..880aef08f919 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -213,6 +213,46 @@ int nd_uuid_store(struct device *dev, u8 **uuid_out, const 
char *buf,
return 0;
 }
 
+ssize_t nd_sector_size_show(unsigned long current_lbasize,
+   const unsigned long *supported, char *buf)
+{
+   ssize_t len = 0;
+   int i;
+
+   for (i = 0; supported[i]; i++)
+   if (current_lbasize == supported[i])
+   len += sprintf(buf + len, "[%ld] ", supported[i]);
+   else
+   len += sprintf(buf + len, "%ld ", supported[i]);
+   len += sprintf(buf + len, "\n");
+   return len;
+}
+
+ssize_t nd_sector_size_store(struct device *dev, const char *buf,
+   unsigned long *current_lbasize, const unsigned long *supported)
+{
+   unsigned long lbasize;
+   int rc, i;
+
+   if (dev->driver)
+   return -EBUSY;
+
+   rc = kstrtoul(buf, 0, );
+   if (rc)
+   return rc;
+
+   for (i = 0; supported[i]; i++)
+   if (lbasize == supported[i])
+   break;
+
+   if (supported[i]) {
+   *current_lbasize = lbasize;
+   return 0;
+   } else {
+   return -EINVAL;
+   }
+}
+
 static ssize_t commands_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/block/nd/dimm_devs.c b/drivers/block/nd/dimm_devs.c
index caa51d3ea6af..ae77bf4a5188 100644
--- a/drivers/block/nd/dimm_devs.c
+++ b/drivers/block/nd/dimm_devs.c
@@ -417,6 +417,41 @@ static struct nd_dimm *nd_dimm_create(struct nd_bus 
*nd_bus,
 }
 
 /**
+ * nd_blk_available_dpa - account the unused dpa of BLK region
+ * @nd_mapping: container of dpa-resource-root + labels
+ *
+ * Unlike PMEM, BLK namespaces can occupy discontiguous DPA ranges.
+ */
+resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping)
+{
+   struct nd_dimm_drvdata *ndd = to_ndd(nd_mapping);
+   resource_size_t map_end, busy = 0, available;
+   struct resource *res;
+
+   if (!ndd)
+   return 0;
+
+   map_end = nd_mapping->start + nd_mapping->size - 1;
+   for_each_dpa_resource(ndd, res)
+   if (res->start >= nd_mapping->start && res->start < map_end) {
+   resource_size_t end = min(map_end, res->end);
+
+   busy += end - res->start + 1;
+   } else if (res->end >= nd_mapping->start && res->end <= 
map_end) {
+   busy += res->end - nd_mapping->start;
+   } else if (nd_mapping->start > res->start
+   && nd_mapping->start < res->end) {
+   /* total eclipse of the BLK region mapping */
+   busy += nd_mapping->size;
+   }
+
+   available = map_end - nd_mapping->start + 1;
+   if (busy < available)
+   return available - busy;
+   return 0;
+}
+
+/**
  * nd_pmem_available_dpa - for the given dimm+region account unallocated dpa
  * @nd_mapping: container of dpa-resource-root + labels
  * @nd_region: constrain available space check to this reference region
diff --git a/drivers/block/nd/namespace_devs.c 
b/drivers/block/nd/namespace_devs.c
index 386776845830..de36f3891284 100644
--- a/drivers/block/nd/namespace_devs.c
+++ b/drivers/block/nd/namespace_devs.c
@@ -37,7 +37,15 @@ static void namespace_pmem_release(struct device *dev)
 
 static void namespace_blk_release(struct device *dev)
 {
-   /* TODO: blk namespace support */
+   struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
+   struct nd_region *nd_region = to_nd_region(dev->parent);
+
+   if (nsblk->id >= 0)
+

[PATCH 01/21] e820, efi: add ACPI 6.0 persistent memory types

2015-04-17 Thread Dan Williams

ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
Mark it "reserved" and allow it to be claimed by a persistent memory
device driver.

This definition is in addition to the Linux kernel's existing type-12
definition that was recently added in support of shipping platforms with
NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
OEM reserved).  We may choose to exploit this wealth of definitions for
NVDIMMs to differentiate E820_PRAM (type-12) from E820_PMEM (type-7).
One potential differentiation is that PMEM is not backed by struct page
by default in contrast to PRAM.  For now, they are effectively treated
as aliases by the mm.

Note, /proc/iomem can be consulted for differentiating legacy
"Persistent RAM" E820_PRAM vs standard "Persistent I/O Memory"
E820_PMEM.

Cc: Andy Lutomirski 
Cc: Boaz Harrosh 
Cc: H. Peter Anvin 
Cc: Jens Axboe 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Signed-off-by: Dan Williams 
Reviewed-by: Ross Zwisler 
---
 arch/arm64/kernel/efi.c  |1 +
 arch/ia64/kernel/efi.c   |1 +
 arch/x86/boot/compressed/eboot.c |4 
 arch/x86/include/uapi/asm/e820.h |1 +
 arch/x86/kernel/e820.c   |   25 +++--
 arch/x86/platform/efi/efi.c  |3 +++
 include/linux/efi.h  |3 ++-
 7 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index ab21e0d58278..9d4aa18f2a82 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -158,6 +158,7 @@ static __init int is_reserve_region(efi_memory_desc_t *md)
case EFI_BOOT_SERVICES_CODE:
case EFI_BOOT_SERVICES_DATA:
case EFI_CONVENTIONAL_MEMORY:
+   case EFI_PERSISTENT_MEMORY:
return 0;
default:
break;
diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
index c52d7540dc05..cd8b7485e396 100644
--- a/arch/ia64/kernel/efi.c
+++ b/arch/ia64/kernel/efi.c
@@ -1227,6 +1227,7 @@ efi_initialize_iomem_resources(struct resource 
*code_resource,
case EFI_RUNTIME_SERVICES_CODE:
case EFI_RUNTIME_SERVICES_DATA:
case EFI_ACPI_RECLAIM_MEMORY:
+   case EFI_PERSISTENT_MEMORY:
default:
name = "reserved";
break;
diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
index ef17683484e9..dde5bf7726f4 100644
--- a/arch/x86/boot/compressed/eboot.c
+++ b/arch/x86/boot/compressed/eboot.c
@@ -1222,6 +1222,10 @@ static efi_status_t setup_e820(struct boot_params 
*params,
e820_type = E820_NVS;
break;
 
+   case EFI_PERSISTENT_MEMORY:
+   e820_type = E820_PMEM;
+   break;
+
default:
continue;
}
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index 960a8a9dc4ab..0f457e6eab18 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -32,6 +32,7 @@
 #define E820_ACPI  3
 #define E820_NVS   4
 #define E820_UNUSABLE  5
+#define E820_PMEM  7
 
 /*
  * This is a non-standardized way to represent ADR or NVDIMM regions that
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 11cc7d54ec3f..410af501a941 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -137,6 +137,8 @@ static void __init e820_print_type(u32 type)
case E820_RESERVED_KERN:
printk(KERN_CONT "usable");
break;
+   case E820_PMEM:
+   case E820_PRAM:
case E820_RESERVED:
printk(KERN_CONT "reserved");
break;
@@ -149,9 +151,6 @@ static void __init e820_print_type(u32 type)
case E820_UNUSABLE:
printk(KERN_CONT "unusable");
break;
-   case E820_PRAM:
-   printk(KERN_CONT "persistent (type %u)", type);
-   break;
default:
printk(KERN_CONT "type %u", type);
break;
@@ -919,10 +918,26 @@ static inline const char *e820_type_to_string(int 
e820_type)
case E820_NVS:  return "ACPI Non-volatile Storage";
case E820_UNUSABLE: return "Unusable memory";
case E820_PRAM: return "Persistent RAM";
+   case E820_PMEM: return "Persistent I/O Memory";
default:return "reserved";
}
 }
 
+static bool do_mark_busy(u32 type, struct resource *res)
+{
+   if (res->start < (1ULL<<20))
+   return true;
+
+   switch (type) {
+   case E820_RESERVED:
+   case E820_PRAM:
+   case E820_PMEM:
+   return false;
+   default:
+   return true;
+   }
+}
+
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
@@ -952,9 +967,7 @@ void

[PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem

2015-04-17 Thread Dan Williams

Maintainer information and documenation for drivers/block/nd/

Cc: Andy Lutomirski 
Cc: Boaz Harrosh 
Cc: H. Peter Anvin 
Cc: Jens Axboe 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Neil Brown 
Cc: Greg KH 
Signed-off-by: Dan Williams 
---
 Documentation/blockdev/nd.txt |  867 +
 MAINTAINERS   |   34 +-
 2 files changed, 895 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/blockdev/nd.txt

diff --git a/Documentation/blockdev/nd.txt b/Documentation/blockdev/nd.txt
new file mode 100644
index ..bcfdf21063ab
--- /dev/null
+++ b/Documentation/blockdev/nd.txt
@@ -0,0 +1,867 @@
+ The NFIT-Defined/NVDIMM Sub-system (ND)
+
+  nd - kernel abi / device-model & ndctl - userspace helper library
+ linux-nvd...@lists.01.org
+v9: April 17th, 2015
+
+
+  Glossary
+
+  Overview
+Supporting Documents
+Git Trees
+
+  NFIT Terminology and NVDIMM Types
+
+  Why BLK?
+PMEM vs BLK (SPA vs BDW)
+  BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
+
+  Example NFIT Diagram
+
+  ND Device Model/ABI and NDCTL API
+NDCTL: Context
+  ndctl: instantiate a new library context example
+
+ND/NDCTL: Bus
+  nd: control class device in /sys/class
+  nd: bus layout
+  ndctl: bus enumeration example
+
+ND/NDCTL: DIMM (NMEM)
+  nd: DIMM (NMEM) layout
+  ndctl: DIMM enumeration example
+
+ND/NDCTL: Region
+  nd: region layout
+  ndctl: region enumeration example
+  Why Not Encode the Region Type into the Region Name?
+  How Do I Determine the Major Type of a Region?
+
+ND/NDCTL: Namespace
+  nd: namespace layout
+  ndctl: namespace enumeration example
+  ndctl: namespace creation example
+  Why the Term “namespace”?
+
+ND/NDCTL: Block Translation Table “btt”
+  nd: btt layout
+  ndctl: btt creation example
+
+  Summary NDCTL Diagram
+
+
+Glossary
+
+
+NFIT: NVDIMM Firmware Interface Table
+
+SPA: System Physical Address also refers to an NFIT system-physical
+address table entry describing contiguous persistent memory range.
+
+DPA: DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
+the system there would be a 1:1 SPA:DPA association.  Once more DIMMs
+are added an interleave-description-table provided by NFIT is needed to
+decode a SPA to a DPA.
+
+DCR: DIMM Control Region Descriptor, an NFIT sub-table entry conveying
+the vendor, format, revision, and geometry of the related
+block-data-windows.
+
+BDW: Block Data Window Region Descriptor, an NFIT sub-table referenced
+by a DCR locating a set of data transfer apertures and control registers
+in system memory.
+
+PMEM: A linux block device which provides access to an SPA range. A PMEM
+device is capable of DAX (see below).
+
+DAX: File system extensions to bypass the page cache and block layer to
+map persistent memory, from a PMEM block device, directly into a process
+address space.
+
+BLK: A linux block device which accesses NVDIMM storage through a BDW
+(block-data-window aperture).  A BLK device is not amenable to DAX.
+
+DSM: Device Specific Method, refers to a runtime service provided by
+platform firmware to send formatted control/configuration messages to a
+DIMM device.  In ACPI this is an _DSM attribute of an object.
+
+BTT: Block Translation Table: Persistent memory is byte addressable.
+Existing software may have an expectation that the power-fail-atomicity
+of writes is at least one sector, 512 bytes.  The BTT is an indirection
+table with atomic update semantics to front a PMEM/BLK block device
+driver and present arbitrary atomic sector sizes.
+
+LABEL: Metadata stored on a DIMM device that partitions and identifies
+(persistently names) storage between PMEM and BLK.  It also partitions
+BLK storage to host BTTs with different parameters per BLK-partition.
+Note that traditional partition tables, GPT/MBR, are layered on top of a
+BLK or PMEM device.
+
+
+
+
+Overview
+
+
+The “NVDIMM Firmware Interface Table” (NFIT) defines a set of tables
+that describe the non-volatile memory resources in a platform.  Platform
+firmware provides this table as well as  runtime-services for sending
+control and configuration messages to capable NVDIMM devices.  NFIT is a
+new top-level table in ACPI 6.  The Linux ND subsystem is designed as a
+generic mechanism that can register a binary NFIT from any provider,
+ACPI being just one example of a provider.  The unit test infrastructure
+in the kernel exploits this capability to provide multiple sample NFITs
+via custom test-platform-devices.
+
+
+Supporting Documents
+ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
+NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
+DSM Interface Example: 
http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+Driver Writer’s Guide:

[PATCH 06/21] nd: ndctl class device, and nd bus attributes

2015-04-17 Thread Dan Williams

This is the position (device topology) independent method to find all
the NFIT-defined buses in the system.  The expectation is that there
will only ever be one "nd" bus discovered via /sys/class/nd/ndctl0.
However, we allow for the possibility of multiple buses and they will
listed in discovery order as ndctl0...ndctlN.  This character device
hosts the ioctl for passing control messages (as defined by the NFIT
spec).  The "format" and "revision" attributes of this device identify
the format of the messages.  In the event an NFIT is registered with an
unknown/unsupported control message format then the "format" attribute
will not be visible.

Cc: Greg KH 
Cc: Neil Brown 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/Makefile |1 
 drivers/block/nd/bus.c|   84 +
 drivers/block/nd/core.c   |   71 ++-
 drivers/block/nd/nd-private.h |5 ++
 4 files changed, 160 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/nd/bus.c

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index c6bec0c185c5..7772fb599809 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_NFIT_ACPI) += nd_acpi.o
 nd_acpi-y := acpi.o
 
 nd-y := core.o
+nd-y += bus.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
new file mode 100644
index ..c27db50511f2
--- /dev/null
+++ b/drivers/block/nd/bus.c
@@ -0,0 +1,84 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "nd-private.h"
+#include "nfit.h"
+
+static int nd_major;
+static struct class *nd_class;
+
+int nd_bus_create_ndctl(struct nd_bus *nd_bus)
+{
+   dev_t devt = MKDEV(nd_major, nd_bus->id);
+   struct device *dev;
+
+   dev = device_create(nd_class, _bus->dev, devt, nd_bus, "ndctl%d",
+   nd_bus->id);
+
+   if (IS_ERR(dev)) {
+   dev_dbg(_bus->dev, "failed to register ndctl%d: %ld\n",
+   nd_bus->id, PTR_ERR(dev));
+   return PTR_ERR(dev);
+   }
+   return 0;
+}
+
+void nd_bus_destroy_ndctl(struct nd_bus *nd_bus)
+{
+   device_destroy(nd_class, MKDEV(nd_major, nd_bus->id));
+}
+
+static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+   return -ENXIO;
+}
+
+static const struct file_operations nd_bus_fops = {
+   .owner = THIS_MODULE,
+   .open = nonseekable_open,
+   .unlocked_ioctl = nd_ioctl,
+   .compat_ioctl = nd_ioctl,
+   .llseek = noop_llseek,
+};
+
+int __init nd_bus_init(void)
+{
+   int rc;
+
+   rc = register_chrdev(0, "ndctl", _bus_fops);
+   if (rc < 0)
+   return rc;
+   nd_major = rc;
+
+   nd_class = class_create(THIS_MODULE, "nd");
+   if (IS_ERR(nd_class))
+   goto err_class;
+
+   return 0;
+
+ err_class:
+   unregister_chrdev(nd_major, "ndctl");
+
+   return rc;
+}
+
+void __exit nd_bus_exit(void)
+{
+   class_destroy(nd_class);
+   unregister_chrdev(nd_major, "ndctl");
+}
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index d126799e7ff7..d6a666b9228b 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -14,12 +14,15 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include "nd-private.h"
 #include "nfit.h"
 
+LIST_HEAD(nd_bus_list);
+DEFINE_MUTEX(nd_bus_list_mutex);
 static DEFINE_IDA(nd_ida);
 
 static bool warn_checksum;
@@ -68,6 +71,53 @@ struct nd_bus *to_nd_bus(struct device *dev)
return nd_bus;
 }
 
+static const char *nd_bus_provider(struct nd_bus *nd_bus)
+{
+   struct nfit_bus_descriptor *nfit_desc = nd_bus->nfit_desc;
+   struct device *parent = nd_bus->dev.parent;
+
+   if (nfit_desc->provider_name)
+   return nfit_desc->provider_name;
+   else if (parent)
+   return dev_name(parent);
+   else
+   return "unknown";
+}
+
+static ssize_t provider_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_bus *nd_bus = to_nd_bus(dev);
+
+   return sprintf(buf, "%s\n", nd_bus_provider(nd_bus));
+}
+static DEVICE_ATTR_RO(provider);
+
+static ssize_t revision_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_bus *nd_bus = to_nd_bus(dev);
+

[PATCH 07/21] nd: dimm devices (nfit "memory-devices")

2015-04-17 Thread Dan Williams

Register the dimms described in the nfit as devices on a nd_bus, named
"dimmN" where N is a global ida index.  The dimm numbering per-bus may
appear contiguous, since we only allow a single nd_bus to be registered
at at a time.  However, eventually, dimm-hotplug invalidates this
property and dimms should be addressed via NFIT-handle.

Cc: Greg KH 
Cc: Neil Brown 
Signed-off-by: Dan Williams 
---
 drivers/block/nd/Makefile |1 
 drivers/block/nd/bus.c|   62 +-
 drivers/block/nd/core.c   |   55 +
 drivers/block/nd/dimm_devs.c  |  243 +
 drivers/block/nd/nd-private.h |   19 +++
 5 files changed, 373 insertions(+), 7 deletions(-)
 create mode 100644 drivers/block/nd/dimm_devs.c

diff --git a/drivers/block/nd/Makefile b/drivers/block/nd/Makefile
index 7772fb599809..6b34dd4d4df8 100644
--- a/drivers/block/nd/Makefile
+++ b/drivers/block/nd/Makefile
@@ -21,3 +21,4 @@ nd_acpi-y := acpi.o
 
 nd-y := core.o
 nd-y += bus.o
+nd-y += dimm_devs.o
diff --git a/drivers/block/nd/bus.c b/drivers/block/nd/bus.c
index c27db50511f2..e24db67001d0 100644
--- a/drivers/block/nd/bus.c
+++ b/drivers/block/nd/bus.c
@@ -13,18 +13,59 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include "nd-private.h"
 #include "nfit.h"
 
-static int nd_major;
+static int nd_bus_major;
 static struct class *nd_class;
 
+struct bus_type nd_bus_type = {
+   .name = "nd",
+};
+
+static ASYNC_DOMAIN_EXCLUSIVE(nd_async_domain);
+
+static void nd_async_dimm_delete(void *d, async_cookie_t cookie)
+{
+   u32 nfit_handle;
+   struct nd_dimm_delete *del_info = d;
+   struct nd_bus *nd_bus = del_info->nd_bus;
+   struct nd_mem *nd_mem = del_info->nd_mem;
+
+   nfit_handle = readl(_mem->nfit_mem_dcr->nfit_handle);
+
+   mutex_lock(_bus_list_mutex);
+   radix_tree_delete(_bus->dimm_radix, nfit_handle);
+   mutex_unlock(_bus_list_mutex);
+
+   put_device(_bus->dev);
+   kfree(del_info);
+}
+
+void nd_dimm_delete(struct nd_dimm *nd_dimm)
+{
+   struct nd_bus *nd_bus = walk_to_nd_bus(_dimm->dev);
+   struct nd_dimm_delete *del_info = nd_dimm->del_info;
+
+   del_info->nd_bus = nd_bus;
+   get_device(_bus->dev);
+   del_info->nd_mem = nd_dimm->nd_mem;
+   async_schedule_domain(nd_async_dimm_delete, del_info,
+   _async_domain);
+}
+
+void nd_synchronize(void)
+{
+   async_synchronize_full_domain(_async_domain);
+}
+
 int nd_bus_create_ndctl(struct nd_bus *nd_bus)
 {
-   dev_t devt = MKDEV(nd_major, nd_bus->id);
+   dev_t devt = MKDEV(nd_bus_major, nd_bus->id);
struct device *dev;
 
dev = device_create(nd_class, _bus->dev, devt, nd_bus, "ndctl%d",
@@ -40,7 +81,7 @@ int nd_bus_create_ndctl(struct nd_bus *nd_bus)
 
 void nd_bus_destroy_ndctl(struct nd_bus *nd_bus)
 {
-   device_destroy(nd_class, MKDEV(nd_major, nd_bus->id));
+   device_destroy(nd_class, MKDEV(nd_bus_major, nd_bus->id));
 }
 
 static long nd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
@@ -60,10 +101,14 @@ int __init nd_bus_init(void)
 {
int rc;
 
+   rc = bus_register(_bus_type);
+   if (rc)
+   return rc;
+
rc = register_chrdev(0, "ndctl", _bus_fops);
if (rc < 0)
-   return rc;
-   nd_major = rc;
+   goto err_chrdev;
+   nd_bus_major = rc;
 
nd_class = class_create(THIS_MODULE, "nd");
if (IS_ERR(nd_class))
@@ -72,7 +117,9 @@ int __init nd_bus_init(void)
return 0;
 
  err_class:
-   unregister_chrdev(nd_major, "ndctl");
+   unregister_chrdev(nd_bus_major, "ndctl");
+ err_chrdev:
+   bus_unregister(_bus_type);
 
return rc;
 }
@@ -80,5 +127,6 @@ int __init nd_bus_init(void)
 void __exit nd_bus_exit(void)
 {
class_destroy(nd_class);
-   unregister_chrdev(nd_major, "ndctl");
+   unregister_chrdev(nd_bus_major, "ndctl");
+   bus_unregister(_bus_type);
 }
diff --git a/drivers/block/nd/core.c b/drivers/block/nd/core.c
index d6a666b9228b..a0d1623b3641 100644
--- a/drivers/block/nd/core.c
+++ b/drivers/block/nd/core.c
@@ -29,6 +29,24 @@ static bool warn_checksum;
 module_param(warn_checksum, bool, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(warn_checksum, "Turn checksum errors into warnings");
 
+/**
+ * nd_dimm_by_handle - lookup an nd_dimm by its corresponding nfit_handle
+ * @nd_bus: parent bus of the dimm
+ * @nfit_handle: handle from the memory-device-to-spa (nfit_mem) structure
+ *
+ * LOCKING: expect nd_bus_list_mutex() held at entry
+ */
+struct nd_dimm *nd_dimm_by_handle(struct nd_bus *nd_bus, u32 nfit_handle)
+{
+   struct nd_dimm *nd_dimm;
+
+   WARN_ON_ONCE(!mutex_is_locked(_bus_list_mutex));
+   nd_dimm = radix_tree_lookup(_bus->dimm_radix, nfit_handle);
+   if (nd_dimm)
+   get_device(_dimm->dev);
+   return nd_dimm;
+}
+
 static void nd_bus_release(struct

[PATCH 00/21] ND: NFIT-Defined / NVDIMM Subsystem

2015-04-17 Thread Dan Williams

Since 2010 Intel has included non-volatile memory support on a few
storage-focused platforms with a feature named ADR (Asynchronous DRAM
Refresh).  These platforms were mostly targeted at custom applications
and never enjoyed standard discovery mechanisms for platform firmware
to advertise non-volatile memory capabilities.  This now changes with
the publication of version 6 of the ACPI specification [1] and its
inclusion of a new table for describing platform memory capabilities.
The NVDIMM Firmware Interface Table (NFIT), along with new EFI and E820
memory types, enumerates persistent memory ranges, memory-mapped-I/O
apertures, physical memory devices (DIMMs), and their associated
properties.

The ND-subsystem wraps a Linux device driver model around the objects
and address boundaries defined in the specification and introduces 3 new
drivers.

  nd_pmem: NFIT enabled version of the existing 'pmem' driver [2]
  nd_blk: mmio aperture method for accessing persistent storage
  nd_btt: give persistent memory disk semantics (atomic sector update)

See the documentation in patch2 for more details, and there is
supplemental documentation on pmem.io [4].  Please review, and
patches welcome...

For kicking the tires, this release is accompanied by a userspace
management library 'ndctl' that includes unit tests (make check) for all
of the kernel ABIs.  The nfit_test.ko module can be used to explore a
sample NFIT topology.

[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/log/?h=x86/pmem
[3]: https://github.com/pmem/ndctl
[4]: http://pmem.io/documents/

--
Dan for the NFIT driver development team Andy Rudoff, Matthew Wilcox, Ross
Zwisler, and Vishal Verma


---

Dan Williams (19):
  e820, efi: add ACPI 6.0 persistent memory types
  ND NFIT-Defined/NVIDIMM Subsystem
  nd_acpi: initial core implementation and nfit skeleton
  nd: create an 'nd_bus' from an 'nfit_desc'
  nfit-test: manufactured NFITs for interface development
  nd: ndctl class device, and nd bus attributes
  nd: dimm devices (nfit "memory-devices")
  nd: ndctl.h, the nd ioctl abi
  nd_dimm: dimm driver and base nd-bus device-driver infrastructure
  nd: regions (block-data-window, persistent memory, volatile memory)
  nd_region: support for legacy nvdimms
  nd_pmem: add NFIT support to the pmem driver
  nd: add interleave-set state-tracking infrastructure
  nd: namespace indices: read and validate
  nd: pmem label sets and namespace instantiation.
  nd: blk labels and namespace instantiation
  nd: write pmem label set
  nd: write blk label set
  nd: infrastructure for btt devices

Ross Zwisler (1):
  nd_blk: nfit blk driver

Vishal Verma (1):
  nd_btt: atomic sector updates


 Documentation/blockdev/btt.txt|  273 ++
 Documentation/blockdev/nd.txt |  867 +++
 MAINTAINERS   |   34 +
 arch/arm64/kernel/efi.c   |1 
 arch/ia64/kernel/efi.c|1 
 arch/x86/boot/compressed/eboot.c  |4 
 arch/x86/include/uapi/asm/e820.h  |1 
 arch/x86/kernel/e820.c|   25 -
 arch/x86/platform/efi/efi.c   |3 
 drivers/block/Kconfig |   13 
 drivers/block/Makefile|2 
 drivers/block/nd/Kconfig  |  130 +++
 drivers/block/nd/Makefile |   39 +
 drivers/block/nd/acpi.c   |  443 ++
 drivers/block/nd/blk.c|  269 ++
 drivers/block/nd/btt.c| 1423 +++
 drivers/block/nd/btt.h|  185 
 drivers/block/nd/btt_devs.c   |  443 ++
 drivers/block/nd/bus.c|  703 +++
 drivers/block/nd/core.c   |  963 +
 drivers/block/nd/dimm.c   |  126 +++
 drivers/block/nd/dimm_devs.c  |  701 +++
 drivers/block/nd/label.c  |  925 
 drivers/block/nd/label.h  |  143 +++
 drivers/block/nd/namespace_devs.c | 1697 +
 drivers/block/nd/nd-private.h |  203 
 drivers/block/nd/nd.h |  310 +++
 drivers/block/nd/nfit.h   |  238 +
 drivers/block/nd/pmem.c   |  122 ++-
 drivers/block/nd/region.c |   95 ++
 drivers/block/nd/region_devs.c| 1196 ++
 drivers/block/nd/test/Makefile|5 
 drivers/block/nd/test/iomap.c |  199 
 drivers/block/nd/test/nfit.c  | 1018 ++
 drivers/block/nd/test/nfit_test.h |   37 +
 include/linux/efi.h   |3 
 include/linux/nd.h|   98 ++
 include/uapi/linux/Kbuild |1 
 include/uapi/linux/ndctl.h|  199 
 39 files changed, 13102 insertions(+), 36 deletions(-)
 create mode 100644 Documentation/blockdev/btt.txt
 create mode 100644 Documentation/blockdev/nd.txt
 create mode 100644

[PATCH] x86: Reset FPU on exec

2015-04-17 Thread Andi Kleen

From: Andi Kleen 

Currently we don't reset FPU state on exec. This can be seen as a
(minor) security issue. The bigger issue however is that the
AVX state also does not get reset. So a program that uses SSE
without VZEROUPPER may get a large penalty.

Always set the FPU to the init state at exec time.

For the eager FPU case this restores the init state,
for non eager it forces an init on the next FPU use.

Signed-off-by: Andi Kleen 
---
 arch/x86/include/asm/elf.h | 4 
 arch/x86/kernel/xsave.c| 5 +
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index ca3347a..56ab629 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -90,6 +90,8 @@ extern unsigned int vdso32_enabled;
 
 #include 
 
+extern void reset_fpu(void);
+
 #ifdef CONFIG_X86_32
 #include 
 
@@ -110,6 +112,7 @@ extern unsigned int vdso32_enabled;
_r->bx = 0; _r->cx = 0; _r->dx = 0; \
_r->si = 0; _r->di = 0; _r->bp = 0; \
_r->ax = 0; \
+   reset_fpu();\
 } while (0)
 
 /*
@@ -178,6 +181,7 @@ static inline void elf_common_init(struct thread_struct *t,
t->fs = t->gs = 0;
t->fsindex = t->gsindex = 0;
t->ds = t->es = ds;
+   reset_fpu();
 }
 
 #define ELF_PLAT_INIT(_r, load_addr)   \
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index cdc6cf9..520e505 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -741,3 +741,8 @@ void *get_xsave_addr(struct xsave_struct *xsave, int xstate)
return (void *)xsave + xstate_comp_offsets[feature];
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
+
+void reset_fpu(void)
+{
+   drop_init_fpu(current);
+}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

2015-04-17 Thread Andi Kleen

> ... which would give you arbitrary skew, because one counter is
> free-running and the other is not (when scheduling a context in or out we stop
> the PMU)

Everyone just reads the counter and subtracts it from
the last value they've seen.

That's the same how any other shared free running counter work,
such as the standard timer.

The only thing that perf needs to enforce is that the counters
are running with the same event. 

It also wouldn't work for sampling, but the uncore doesn't do
sampling anyways.

> From my PoV that violates group semantics, because now the events aren't
> always counting at the same time (which would be the reason I grouped
> them in the first place).

You never use the absolute value, just differences. The differences 
effectively count only when the group runs.

> However, it is the case that you cannot offer group semantics.

I don't think that's true.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3] rtc: add rtc-abx80x, a driver for the Abracon AB x80x i2c rtc

2015-04-17 Thread Alexandre Belloni

From: Philippe De Muyter 

This is a basic driver for the ultra-low-power Abracon AB x80x series of RTC
chips. It supports in particular, the supersets AB0805 and AB1805.
It allows reading and writing the time, and enables the supercapacitor/
battery charger.

[a...@arndb.de: abx805 depends on i2c]
Signed-off-by: Philippe De Muyter 
Cc: Alessandro Zummo 
Signed-off-by: Alexandre Belloni 
Signed-off-by: Arnd Bergmann 
Signed-off-by: Andrew Morton 
---
Changes in v3:
 - renamed buffer from date to buf in abx80x_rtc_read_time()


 drivers/rtc/Kconfig  |  10 ++
 drivers/rtc/Makefile |   1 +
 drivers/rtc/rtc-abx80x.c | 307 +++
 3 files changed, 318 insertions(+)
 create mode 100644 drivers/rtc/rtc-abx80x.c

diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
index b5b5c3d485d6..89507c1858ec 100644
--- a/drivers/rtc/Kconfig
+++ b/drivers/rtc/Kconfig
@@ -164,6 +164,16 @@ config RTC_DRV_ABB5ZES3
  This driver can also be built as a module. If so, the module
  will be called rtc-ab-b5ze-s3.
 
+config RTC_DRV_ABX80X
+   tristate "Abracon ABx80x"
+   help
+ If you say yes here you get support for Abracon AB080X and AB180X
+ families of ultra-low-power  battery- and capacitor-backed real-time
+ clock chips.
+
+ This driver can also be built as a module. If so, the module
+ will be called rtc-abx80x.
+
 config RTC_DRV_AS3722
tristate "ams AS3722 RTC driver"
depends on MFD_AS3722
diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile
index 69c87062b098..7b20b0462cb6 100644
--- a/drivers/rtc/Makefile
+++ b/drivers/rtc/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_RTC_DRV_88PM80X) += rtc-88pm80x.o
 obj-$(CONFIG_RTC_DRV_AB3100)   += rtc-ab3100.o
 obj-$(CONFIG_RTC_DRV_AB8500)   += rtc-ab8500.o
 obj-$(CONFIG_RTC_DRV_ABB5ZES3) += rtc-ab-b5ze-s3.o
+obj-$(CONFIG_RTC_DRV_ABX80X)   += rtc-abx80x.o
 obj-$(CONFIG_RTC_DRV_ARMADA38X)+= rtc-armada38x.o
 obj-$(CONFIG_RTC_DRV_AS3722)   += rtc-as3722.o
 obj-$(CONFIG_RTC_DRV_AT32AP700X)+= rtc-at32ap700x.o
diff --git a/drivers/rtc/rtc-abx80x.c b/drivers/rtc/rtc-abx80x.c
new file mode 100644
index ..4337c3bc6ace
--- /dev/null
+++ b/drivers/rtc/rtc-abx80x.c
@@ -0,0 +1,307 @@
+/*
+ * A driver for the I2C members of the Abracon AB x8xx RTC family,
+ * and compatible: AB 1805 and AB 0805
+ *
+ * Copyright 2014-2015 Macq S.A.
+ *
+ * Author: Philippe De Muyter 
+ * Author: Alexandre Belloni 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#define ABX8XX_REG_HTH 0x00
+#define ABX8XX_REG_SC  0x01
+#define ABX8XX_REG_MN  0x02
+#define ABX8XX_REG_HR  0x03
+#define ABX8XX_REG_DA  0x04
+#define ABX8XX_REG_MO  0x05
+#define ABX8XX_REG_YR  0x06
+#define ABX8XX_REG_WD  0x07
+
+#define ABX8XX_REG_CTRL1   0x10
+#define ABX8XX_CTRL_WRITE  BIT(1)
+#define ABX8XX_CTRL_12_24  BIT(6)
+
+#define ABX8XX_REG_CFG_KEY 0x1f
+#define ABX8XX_CFG_KEY_MISC0x9d
+
+#define ABX8XX_REG_ID0 0x28
+
+#define ABX8XX_REG_TRICKLE 0x20
+#define ABX8XX_TRICKLE_CHARGE_ENABLE   0xa0
+#define ABX8XX_TRICKLE_STANDARD_DIODE  0x8
+#define ABX8XX_TRICKLE_SCHOTTKY_DIODE  0x4
+
+static u8 trickle_resistors[] = {0, 3, 6, 11};
+
+enum abx80x_chip {AB0801, AB0803, AB0804, AB0805,
+   AB1801, AB1803, AB1804, AB1805, ABX80X};
+
+struct abx80x_cap {
+   u16 pn;
+   bool has_tc;
+};
+
+static struct abx80x_cap abx80x_caps[] = {
+   [AB0801] = {.pn = 0x0801},
+   [AB0803] = {.pn = 0x0803},
+   [AB0804] = {.pn = 0x0804, .has_tc = true},
+   [AB0805] = {.pn = 0x0805, .has_tc = true},
+   [AB1801] = {.pn = 0x1801},
+   [AB1803] = {.pn = 0x1803},
+   [AB1804] = {.pn = 0x1804, .has_tc = true},
+   [AB1805] = {.pn = 0x1805, .has_tc = true},
+   [ABX80X] = {.pn = 0}
+};
+
+static struct i2c_driver abx80x_driver;
+
+static int abx80x_enable_trickle_charger(struct i2c_client *client,
+u8 trickle_cfg)
+{
+   int err;
+
+   /*
+* Write the configuration key register to enable access to the Trickle
+* register
+*/
+   err = i2c_smbus_write_byte_data(client, ABX8XX_REG_CFG_KEY,
+   ABX8XX_CFG_KEY_MISC);
+   if (err < 0) {
+   dev_err(>dev, "Unable to write configuration key\n");
+   return -EIO;
+   }
+
+   err = i2c_smbus_write_byte_data(client, ABX8XX_REG_TRICKLE,
+   ABX8XX_TRICKLE_CHARGE_ENABLE |
+   trickle_cfg);
+   if (err < 0) {
+   dev_err(>dev, "Unable to write trickle register\n");
+   return -EIO;
+   }
+
+

Re: Is it OK to export symbols 'getname' and 'putname'?

2015-04-17 Thread Boqun Feng

On Fri, Apr 17, 2015 at 08:35:30PM +0800, Boqun Feng wrote:
> Hi Al,
> 
> On Sun, Apr 12, 2015 at 02:13:18AM +0100, Al Viro wrote:
> > 
> > BTW, looking at the __getname() callers...  Lustre one sure as hell looks
> > bogus:
> > char *tmp = __getname();
> > 
> > if (!tmp)
> > return ERR_PTR(-ENOMEM);
> > 
> > len = strncpy_from_user(tmp, filename, PATH_MAX);
> > if (len == 0)
> > ret = -ENOENT;
> > else if (len > PATH_MAX)
> > ret = -ENAMETOOLONG;
> > 
> > if (ret) {
> > __putname(tmp);
> > tmp =  ERR_PTR(ret);
> > }
> > return tmp;
> > 
> > Note that
> > * strncpy_from_user(p, u, n) can return a negative (-EFAULT)
> > * strncpy_from_user(p, u, n) cannot return a positive greater than
> > n.  In case of missing NUL among the n bytes starting at u (and no faults
> > accessing those) we get exactly n bytes copied and n returned.  In case
> > when NUL is there, we copy everything up to and including that NUL and
> > return number of non-NUL bytes copied.
> > 
> > IOW, these failure cases had never been tested.  Name being too long ends up
> > with non-NUL-terminated array of characters returned, and the very first
> > caller of ll_getname() feeds it to strlen().  Fault ends up with 
> > uninitialized
> > array...
> > 
> > AFAICS, the damn thing should just use getname() and quit reinventing the
> > wheel, badly.
> > 
> 
> I'm trying to clean that part of code you mentioned, and I found I have
> to export the symbols 'getname' and 'putname' as follow to replace that
> __getname() caller:
> 
> diff --git a/drivers/staging/lustre/lustre/llite/dir.c 
> b/drivers/staging/lustre/lustre/llite/dir.c
> index a182019..014f51a 100644
> --- a/drivers/staging/lustre/lustre/llite/dir.c
> +++ b/drivers/staging/lustre/lustre/llite/dir.c
> @@ -1216,29 +1216,8 @@ out:
>   return rc;
>  }
>  
> -static char *
> -ll_getname(const char __user *filename)
> -{
> - int ret = 0, len;
> - char *tmp = __getname();
> -
> - if (!tmp)
> - return ERR_PTR(-ENOMEM);
> -
> - len = strncpy_from_user(tmp, filename, PATH_MAX);
> - if (len == 0)
> - ret = -ENOENT;
> - else if (len > PATH_MAX)
> - ret = -ENAMETOOLONG;
> -
> - if (ret) {
> - __putname(tmp);
> - tmp =  ERR_PTR(ret);
> - }
> - return tmp;
> -}
> -
> -#define ll_putname(filename) __putname(filename)
> +#define ll_getname(filename) getname(filename)
> +#define ll_putname(name) putname(name)
>  
>  static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long 
> arg)
>  {
> @@ -1441,6 +1420,7 @@ free_lmv:
>   return rc;
>   }
>   case LL_IOC_REMOVE_ENTRY: {
> + struct filename *name = NULL;
>   char*filename = NULL;
>   int  namelen = 0;
>   int  rc;
> @@ -1453,20 +1433,17 @@ free_lmv:
>   if (!(exp_connect_flags(sbi->ll_md_exp) & OBD_CONNECT_LVB_TYPE))
>   return -ENOTSUPP;
>  
> - filename = ll_getname((const char *)arg);
> - if (IS_ERR(filename))
> - return PTR_ERR(filename);
> + name = ll_getname((const char *)arg);
> + if (IS_ERR(name))
> + return PTR_ERR(name);
>  
> + filename = name->name;
>   namelen = strlen(filename);
> - if (namelen < 1) {
> - rc = -EINVAL;
> - goto out_rmdir;
> - }
>  
>   rc = ll_rmdir_entry(inode, filename, namelen);
>  out_rmdir:
> - if (filename)
> - ll_putname(filename);
> + if (name)
> + ll_putname(name);
>   return rc;
>   }
>   case LL_IOC_LOV_SWAP_LAYOUTS:
> @@ -1481,15 +1458,17 @@ out_rmdir:
>   struct lov_user_md *lump;
>   struct lov_mds_md *lmm = NULL;
>   struct mdt_body *body;
> + struct filename *name;
>   char *filename = NULL;
>   int lmmsize;
>  
>   if (cmd == IOC_MDC_GETFILEINFO ||
>   cmd == IOC_MDC_GETFILESTRIPE) {
> - filename = ll_getname((const char *)arg);
> - if (IS_ERR(filename))
> - return PTR_ERR(filename);
> + name = ll_getname((const char *)arg);
> + if (IS_ERR(name))
> + return PTR_ERR(name);
>  
> + filename = name->name;
>   rc = ll_lov_getstripe_ea_info(inode, filename, ,
> , );
>   } else {

Sorry.. one modification is missing here:

@@ -1535,8 +1535,8 @@ skip_lmm:
 
 out_req:
ptlrpc_req_finished(request);
-

[PATCH] sound/oss: fix deadlock in sequencer_ioctl(SNDCTL_SEQ_OUTOFBAND)

2015-04-17 Thread Alexey Khoroshilov

A deadlock can be initiated by userspace via ioctl(SNDCTL_SEQ_OUTOFBAND)
on /dev/sequencer with TMR_ECHO midi event.

In this case the control flow is:
sound_ioctl()
-> case SND_DEV_SEQ:
   case SND_DEV_SEQ2:
 sequencer_ioctl()
 -> case SNDCTL_SEQ_OUTOFBAND:
  spin_lock_irqsave(,flags);
  play_event();
  -> case EV_TIMING:
   seq_timing_event()
   -> case TMR_ECHO:
seq_copy_to_input()
-> spin_lock_irqsave(,flags);

It seems that spin_lock_irqsave() around play_event() is not necessary,
because the only other call location in seq_startplay() makes the call
without acquiring spinlock.

So, the patch just removes spinlocks around play_event().
By the way, it removes unreachable code in seq_timing_event(),
since (seq_mode == SEQ_2) case is handled in the beginning.

Compile tested only.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov 
---
 sound/oss/sequencer.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/sound/oss/sequencer.c b/sound/oss/sequencer.c
index c0eea1dfe90f..f19da4b47c1d 100644
--- a/sound/oss/sequencer.c
+++ b/sound/oss/sequencer.c
@@ -681,13 +681,8 @@ static int seq_timing_event(unsigned char *event_rec)
break;
 
case TMR_ECHO:
-   if (seq_mode == SEQ_2)
-   seq_copy_to_input(event_rec, 8);
-   else
-   {
-   parm = (parm << 8 | SEQ_ECHO);
-   seq_copy_to_input((unsigned char *) , 4);
-   }
+   parm = (parm << 8 | SEQ_ECHO);
+   seq_copy_to_input((unsigned char *) , 4);
break;
 
default:;
@@ -1324,7 +1319,6 @@ int sequencer_ioctl(int dev, struct file *file, unsigned 
int cmd, void __user *a
int mode = translate_mode(file);
struct synth_info inf;
struct seq_event_rec event_rec;
-   unsigned long flags;
int __user *p = arg;
 
orig_dev = dev = dev >> 4;
@@ -1479,9 +1473,7 @@ int sequencer_ioctl(int dev, struct file *file, unsigned 
int cmd, void __user *a
case SNDCTL_SEQ_OUTOFBAND:
if (copy_from_user(_rec, arg, sizeof(event_rec)))
return -EFAULT;
-   spin_lock_irqsave(,flags);
play_event(event_rec.arr);
-   spin_unlock_irqrestore(,flags);
return 0;
 
case SNDCTL_MIDI_INFO:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] direct-io: add support for write stream IDs

2015-04-17 Thread Dave Chinner

On Fri, Apr 17, 2015 at 05:11:40PM -0600, Jens Axboe wrote:
> On 04/17/2015 05:06 PM, Dave Chinner wrote:
> >On Thu, Apr 16, 2015 at 11:20:45PM -0700, Ming Lin wrote:
> >>On Sat, Apr 11, 2015 at 4:59 AM, Dave Chinner  wrote:
> >>>On Fri, Apr 10, 2015 at 04:50:05PM -0700, Ming Lin wrote:
> On Wed, Mar 25, 2015 at 7:26 AM, Jens Axboe  wrote:
> >>If iocb->ki_filp->f_streamid is not set, then it should fall back to
> >>whatever is set on the inode->i_streamid.
> 
> Why should do the fall back?
> >>>
> >>>Because then you have a method of using streams with applications
> >>>that aren't aware of streams.
> >>>
> >>>Or perhaps you have a file you know has different access patterns to
> >>>the rest of the files in a directory, and you don't want to have to
> >>>set the stream on every process that opens and uses that file. e.g.
> >>>database writeahead log files (sequential write, never read) vs
> >>>database index/table files (random read/write).
> >>>
> >Good point, agree. Will make that change.
> 
> That change causes problem for direct IO, for example
> 
> process 1:
> fd = open("/dev/nvme0n1", O_DIRECT...);
> //set stream_id 1
> fadvise(fd, 1, 0, POSIX_FADV_STREAMID);
> pwrite(fd, );
> 
> process 2:
> fd = open("/dev/nvme0n1", O_DIRECT...);
> //should be legacy stream_id 0
> pwrite(fd, );
> 
> But now process 2 also see stream_id 1, which is wrong.
> >>>
> >>>It's not wrong, your behaviour model is just different You have
> >>>defined a process/fd based stream model and not considered
> >>>considered that admins and applications might want to use a file
> >>>based stream model instead, so applications don't need to even be
> >>>aware that write streams are in use...
> >>
> >>The stream must be opened, otherwise device will return error if application
> >>write to a not-opened stream.
> >
> >That's an extremely device specific *implementation* of a write
> >stream. The *concept* of a write stream being passed from userspace to
> >the block layer doesn't have such constraints, and I get realy
> >concerned when implementations of a generic concept are so tightly
> >focussed around one type of hardware implementation of the
> >concept...
> 
> Indeed, which is why the implementation posted cares ONLY about the
> stream ID itself, and passing that through.
> 
> But the point about fallback is valid, however, for some use cases
> that will not be what you want. But we have to make some sort of
> decision, and falling back to the inode set value (if one is set) is
> probably the right thing to do in most use cases.

Right, the question is then whether fadvise should set the value on
the inode at all, because then the effect of setting it on a fd also
changes the fallback. Perhaps we need to a distinction between
"setting the stream for this fd" which lasts as long as the fd is
active, and "setting the default inode stream" which is potentially
a persistent operation if the filesystem stores it on disk...

> >>Device has limited number of streams, for example, 16 streams.
> >>There are 2 APIs to open/close the stream.
> >
> >What's to stop me writing something for DM-thinp that understands
> >write streams in bios and uses it to separate out the write streams
> >into different regions of the thinp device to improve locality of
> >it's data placement and hence reduce fragmentation?
> 
> Absolutely nothing, in fact that's one of the use cases that I had
> in mind. Or for for caching software.

*nod*. We are on the same page, then :)

Cheers,

Dave.
> 
> -- 
> Jens Axboe
> 
> 

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] rcu: small rcu_dereference doc update

2015-04-17 Thread Jeff Haran

> -Original Message-
> From: Paul E. McKenney [mailto:paul...@linux.vnet.ibm.com]
> Sent: Friday, April 17, 2015 11:41 AM
> To: Jeff Haran
> Cc: Milos Vyletel; Josh Triplett; Steven Rostedt; Mathieu Desnoyers; Lai
> Jiangshan; Jonathan Corbet; open list:READ-COPY UPDATE...; open
> list:DOCUMENTATION
> Subject: Re: [PATCH] rcu: small rcu_dereference doc update
> 
> On Fri, Apr 17, 2015 at 04:53:15PM +, Jeff Haran wrote:
> > > -Original Message-
> > > From: Paul E. McKenney [mailto:paul...@linux.vnet.ibm.com]
> > > Sent: Friday, April 17, 2015 7:07 AM
> > > To: Milos Vyletel
> > > Cc: Josh Triplett; Steven Rostedt; Mathieu Desnoyers; Lai Jiangshan;
> > > Jonathan Corbet; open list:READ-COPY UPDATE...; open
> > > list:DOCUMENTATION; Jeff Haran
> > > Subject: Re: [PATCH] rcu: small rcu_dereference doc update
> > >
> > > On Fri, Apr 17, 2015 at 12:33:36PM +0200, Milos Vyletel wrote:
> > > > Make a note stating that repeated calls of rcu_dereference() may
> > > > not return the same pointer if update happens while in critical section.
> > > >
> > > > Reported-by: Jeff Haran 
> > > > Signed-off-by: Milos Vyletel 
> > >
> > > Hmmm...  Seems like that should be obvious, but on the other hand, I
> > > have been using RCU for more than twenty years, so my obviousness
> > > sensors might need recalibration.
> > >
> > > Queued for 4.2.
> > >
> > >   Thanx, Paul
> >
> > It's just that the original text suggests repeated rcu_dereference() calls 
> > are
> discouraged because they are ugly and not efficient on some architectures.
> When I read that I concluded that those were the only reasons not to do it,
> that despite the possible inefficiency it would always return the same
> pointer. Depending on how one's code is structured, being able to do this
> could be advantageous. Then I started looking at the code that implements it
> and I couldn't see how it could possibly be the case. I even wrote a little
> kernel module to prove to myself that doing this could return different
> pointer values. If I misinterpreted the original text I figured others might 
> also.
> Milos even found some code in the kernel where it's author had done this,
> so it might be a widely held misunderstanding. It's easy for people who have
> worked with rwlock_ts to think an RCU read lock works the same.
> 
> Fair point, and thank you the rationale!  Are there any other parts of the RCU
> documentation that are similarly blind to your initial point of view?  If so, 
> it
> would be good for them to be fixed.
> 
>   Thanx, Paul

I can't think of much off the top of my head, but I'm hoping I might get some 
time to review it again and perhaps provide some more concrete suggestions.

One thing that does come to mind is the article you wrote in LWN, 
http://lwn.net/Articles/263130/, where you discussed RCU as a reader-write lock 
replacement. whatisRCU.txt seems to incorporated some of that. Something along 
the lines of the original section in the LWN article where there was some 
discussion of the differences between a rwlock_t read lock critical section and 
a RCU read lock critical section might be beneficial, a key thing being that 
with RCU there really is no locking, the value of the pointer can change in a 
RCU critical section because writers aren't blocked from updating it. Another 
thing might be some discussion that the cases where you'd call read_lock_bh() 
are way different than when you'd call rcu_read_lock_bh() and as a corollary 
why there is no rcu_read_lock_irq().

To me it seems that the names of some of these functions are perhaps 
misleading. rcu_read_lock() sort of implies there is some locking going on when 
there isn't. It might have been easier to understand if rcu_read_lock() was 
called rcu_get() and rcu_read_unlock() was called rcu_put() to reflect that 
they are really as much about memory management as synchronization. Too late 
the change any of that obviously.

Thanks,

Jeff Haran

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 4.0 kernel XFS filesystem crash when running AIM7's disk workload

2015-04-17 Thread Dave Chinner

On Fri, Apr 17, 2015 at 01:38:49PM -0400, Waiman Long wrote:
> Hi Dave,
> 
> When I was running the AIM7's disk workload on a 8-socket
> Westmere-EX server with 4.0 kernel, the kernel crash. A set of small
> ramdisks were created (ramdisk_size=271072). Those ramdisks were
> formatted with XFS filesystem before the test began. The kernel log
> was:
> 
> XFS (ram12): Mounting V4 Filesystem
> XFS (ram12): Log size 1424 blocks too small, minimum size is 1596 blocks
> XFS (ram12): Log size out of supported range. Continuing onwards,
> but if log hangs are
> experienced then please report this message in the bug report.

First thing you need to do is upgrade xfsprogs so that this message
goes away. or use "mkfs.xfs -l size=10m" so that the log is larger
than the minimum.

> XFS (ram15): Ending clean mount
> BUG: unable to handle kernel NULL pointer dereference at   (null)
> IP: [] __memcpy+0xd/0x110
> PGD 29f7655f067 PUD 29f75a80067 PMD 0
> Oops:  [#1] SMP
> Modules linked in: xfs exportfs libcrc32c ebtable_nat ebtables
> xt_CHECKSUM iptable_mangle bridge stp llc autofs4 ipt_REJECT
> nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter
> ip_tables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
> nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6
> vhost_net macvtap macvlan vhost tun kvm_intel kvm ipmi_si
> ipmi_msghandler tpm_infineon iTCO_wdt iTCO_vendor_support wmi
> acpi_cpufreq microcode pcspkr serio_raw qlcnic be2net vxlan
> udp_tunnel ip6_udp_tunnel ses enclosure igb dca ptp pps_core lpc_ich
> mfd_core hpilo hpwdt sg i7core_edac edac_core netxen_nic ext4(E)
> jbd2(E) mbcache(E) sr_mod(E) cdrom(E) sd_mod(E) lpfc(E) qla2xxx(E)
> scsi_transport_fc(E) pata_acpi(E) ata_generic(E) ata_piix(E) hpsa(E)
> radeon(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E)
> i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)

Why do you have a mix of signed and unsigned modules loaded?

> CPU: 69 PID: 116603 Comm: xfsaild/ram5 Tainted: GE   4.0.0 #2
> Hardware name: HP ProLiant DL980 G7, BIOS P66 07/30/2012
> task: 8b9f7eeb4f80 ti: 8b9f7f1ac000 task.ti: 8b9f7f1ac000
> RIP: 0010:[]  [] __memcpy+0xd/0x110
> RSP: 0018:8b9f7f1afc10  EFLAGS: 00010206
> RAX: 88102476a3cc RBX: 889ff2ab5000 RCX: 0005
> RDX: 0006 RSI:  RDI: 88102476a3cc

edx = 6 bytes.

> RBP: 8b9f7f1afc18 R08: 0001 R09: 88102476a3cc
> R10: 8a1f6c03ea80 R11:  R12: 8b1ff1269400
> R13: 8b1f64837c98 R14: 881038701200 R15: 88102476a300
> FS:  () GS:8b1fffa4() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2:  CR3: 029f7655e000 CR4: 06e0
> Stack:
>  a0ca8c41 8b9f7f1afc68 a0cc4803 8b9f7f1afc68
>  a0cd2777 8b9f7f1afc68 8b1ff1269400 8a9f59022800
>  8b1f7c932718 0003 8a9f590228e4 8b9f7f1afce8
> Call Trace:
>  [] ? xfs_iflush_fork+0x181/0x240 [xfs]
>  [] xfs_iflush_int+0x1f3/0x320 [xfs]
>  [] ? kmem_alloc+0x87/0x100 [xfs]
>  [] xfs_iflush_cluster+0x295/0x380 [xfs]
>  [] xfs_iflush+0xf4/0x1f0 [xfs]
>  [] xfs_inode_item_push+0xea/0x130 [xfs]
>  [] xfsaild_push+0x10d/0x500 [xfs]
>  [] ? lock_timer_base+0x70/0x70
>  [] xfsaild+0x98/0x130 [xfs]
>  [] ? xfsaild_push+0x500/0x500 [xfs]
>  [] ? xfsaild_push+0x500/0x500 [xfs]
>  [] ? xfsaild_push+0x500/0x500 [xfs]
>  [] ? kthread_freezable_should_stop+0x70/0x70
>  [] ret_from_fork+0x58/0x90
>  [] ? kthread_freezable_should_stop+0x70/0x70
> Code: 0f b6 c0 5b c9 c3 0f 1f 84 00 00 00 00 00 e8 2b f9 ff ff 80 7b
> 25 00 74 c8 eb d3 90 90 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07
>  48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c
> RIP  [] __memcpy+0xd/0x110
>  RSP 
> CR2: 
> ---[ end trace fb8a4add69562a76 ]---
> 
> The xfs_iflush_fork+0x181/0x240 (385) IP address is at:
> 

(rearrange slightly to make more sense)

> 823case XFS_DINODE_FMT_LOCAL:
> 824if ((iip->ili_fields & dataflag[whichfork]) &&
>0x23c0 <+336>:movslq %ecx,%rcx
>0x23c3 <+339>:movswl 0x0(%rcx,%rcx,1),%eax
>0x23cb <+347>:test   %eax,0x90(%rdx)
>0x23d1 <+353>:je 0x2350 
> 
> 825(ifp->if_bytes > 0)) {
>0x23d7 <+359>:mov(%r10),%edx
>0x23da <+362>:test   %edx,%edx
>0x23dc <+364>:jle0x2350 

So the contents of rdx says that the inode fork size is 6 bytes in
local format. The call location also indicates that it is the
attribute fork that is in being flushed. The minimum size of the
attr fork is 3 bytes - an empty header. However, then ext valid size
has a second header that adds 4 bytes to the size, plus the bytes
inteh attr name and value.

Hence a size of 6 bytes is invalid, and probably indicates that
there is some form of

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Marcelo Tosatti

On Fri, Apr 17, 2015 at 03:25:28PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 17, 2015 at 3:04 PM, Linus Torvalds
>  wrote:
> > On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski  
> > wrote:
> >>
> >> Muahaha.  The auditors have invaded your system.  (I did my little
> >> benchmark with a more sensible configuration -- see way below).
> >>
> >> Can you send the output of:
> >>
> >> # auditctl -s
> >> # auditctl -l
> >
> >   # auditctl -s
> >   enabled 1
> >   flag 1
> >   pid 822
> >   rate_limit 0
> >   backlog_limit 320
> >   lost 0
> >   backlog 0
> >   backlog_wait_time 6
> >   loginuid_immutable 0 unlocked
> >   # auditctl -l
> >   No rules
> 
> Yes, "No rules" doesn't mean "don't audit".
> 
> >
> >> Are you, perchance, using Fedora?
> >
> > F21. Yup.
> >
> > I used to just disable auditing in the kernel entirely, but then I
> > ended up deciding that I need to run something closer to the broken
> > Fedora config (selinux in particular) in order to actually optimize
> > the real-world pathname handling situation rather than the _sane_ one.
> > Oh well. I think audit support got enabled at the same time in my
> > kernels because I ended up using the default config and then taking
> > out the truly crazy stuff without noticing AUDITSYSCALL.
> >
> >> I lobbied rather heavily, and
> >> successfully, to get the default configuration to stop auditing.
> >> Unfortunately, the fix wasn't retroactive, so, unless you have a very
> >> fresh install, you might want to apply the fix yourself:
> >
> > Is that fix happening in Fedora going forward, though? Like F22?
> 
> It's in F21, actually.  Unfortunately, the audit package is really
> weird.  It installs /etc/audit/rules.d/audit.rules, containing:
> 
> # This file contains the auditctl rules that are loaded
> # whenever the audit daemon is started via the initscripts.
> # The rules are simply the parameters that would be passed
> # to auditctl.
> 
> # First rule - delete all
> -D
> 
> # This suppresses syscall auditing for all tasks started
> # with this rule in effect.  Remove it if you need syscall
> # auditing.
> -a task,never
> 
> Then, if it's a fresh install (i.e. /etc/audit/audit.rules doesn't
> exist) it copies that file to /etc/audit/audit.rules post-install.
> (No, I don't know why it works this way.)
> 
> IOW, you might want to copy your /etc/audit/rules.d/audit.rules to
> /etc/audit/audit.rules.  I think you need to reboot to get the full
> effect.  You could apply the rule manually (or maybe restart the audit
> service), but it will only affect newly-started tasks.
> 
> >
> >> Amdy Lumirtowsky thinks he meant to attach a condition to his
> >> maintainerish activities: he will do his best to keep the audit code
> >> *out* of the low-level stuff, but he's going to try to avoid ever
> >> touching the audit code itself, because if he ever had to change it,
> >> he might accidentally delete the entire file.
> >
> > Oooh. That would be _such_ a shame.
> >
> > Can we please do it by mistake? "Oops, my fingers slipped"
> >
> >> Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
> >> point?  I would love auditing to set some really loud global warning
> >> that you've just done a Bad Thing (tm) performance-wise by enabling
> >> it.
> >
> > Or even just a big fat warning in dmesg the first time auditing triggers.
> >
> >> Back to timing.  With kvm-clock, I see:
> >>
> >>   23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read
> >
> > Oh wow. How can that possibly be sane?
> >
> > Isn't the *whole* point of pvclock_clocksource_read() to be a native
> > rdtsc with scaling? How does it cause that kind of insane pain?
> 
> An unnecessarily complicated protocol, a buggy host implementation,
> and an unnecessarily complicated guest implementation :(

How about start by removing the unnecessary rdtsc-barrier? 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] direct-io: add support for write stream IDs

2015-04-17 Thread Jens Axboe


On 04/17/2015 05:06 PM, Dave Chinner wrote:

On Thu, Apr 16, 2015 at 11:20:45PM -0700, Ming Lin wrote:

On Sat, Apr 11, 2015 at 4:59 AM, Dave Chinner  wrote:

On Fri, Apr 10, 2015 at 04:50:05PM -0700, Ming Lin wrote:

On Wed, Mar 25, 2015 at 7:26 AM, Jens Axboe  wrote:

If iocb->ki_filp->f_streamid is not set, then it should fall back to
whatever is set on the inode->i_streamid.


Why should do the fall back?


Because then you have a method of using streams with applications
that aren't aware of streams.

Or perhaps you have a file you know has different access patterns to
the rest of the files in a directory, and you don't want to have to
set the stream on every process that opens and uses that file. e.g.
database writeahead log files (sequential write, never read) vs
database index/table files (random read/write).


Good point, agree. Will make that change.


That change causes problem for direct IO, for example

process 1:
fd = open("/dev/nvme0n1", O_DIRECT...);
//set stream_id 1
fadvise(fd, 1, 0, POSIX_FADV_STREAMID);
pwrite(fd, );

process 2:
fd = open("/dev/nvme0n1", O_DIRECT...);
//should be legacy stream_id 0
pwrite(fd, );

But now process 2 also see stream_id 1, which is wrong.


It's not wrong, your behaviour model is just different You have
defined a process/fd based stream model and not considered
considered that admins and applications might want to use a file
based stream model instead, so applications don't need to even be
aware that write streams are in use...


The stream must be opened, otherwise device will return error if application
write to a not-opened stream.


That's an extremely device specific *implementation* of a write
stream. The *concept* of a write stream being passed from userspace to
the block layer doesn't have such constraints, and I get realy
concerned when implementations of a generic concept are so tightly
focussed around one type of hardware implementation of the
concept...


Indeed, which is why the implementation posted cares ONLY about the 
stream ID itself, and passing that through.


But the point about fallback is valid, however, for some use cases that 
will not be what you want. But we have to make some sort of decision, 
and falling back to the inode set value (if one is set) is probably the 
right thing to do in most use cases.



Device has limited number of streams, for example, 16 streams.
There are 2 APIs to open/close the stream.


What's to stop me writing something for DM-thinp that understands
write streams in bios and uses it to separate out the write streams
into different regions of the thinp device to improve locality of
it's data placement and hence reduce fragmentation?


Absolutely nothing, in fact that's one of the use cases that I had in 
mind. Or for for caching software.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/6] direct-io: add support for write stream IDs

2015-04-17 Thread Dave Chinner

On Thu, Apr 16, 2015 at 11:20:45PM -0700, Ming Lin wrote:
> On Sat, Apr 11, 2015 at 4:59 AM, Dave Chinner  wrote:
> > On Fri, Apr 10, 2015 at 04:50:05PM -0700, Ming Lin wrote:
> >> On Wed, Mar 25, 2015 at 7:26 AM, Jens Axboe  wrote:
> >> >> If iocb->ki_filp->f_streamid is not set, then it should fall back to
> >> >> whatever is set on the inode->i_streamid.
> >>
> >> Why should do the fall back?
> >
> > Because then you have a method of using streams with applications
> > that aren't aware of streams.
> >
> > Or perhaps you have a file you know has different access patterns to
> > the rest of the files in a directory, and you don't want to have to
> > set the stream on every process that opens and uses that file. e.g.
> > database writeahead log files (sequential write, never read) vs
> > database index/table files (random read/write).
> >
> >> > Good point, agree. Will make that change.
> >>
> >> That change causes problem for direct IO, for example
> >>
> >> process 1:
> >> fd = open("/dev/nvme0n1", O_DIRECT...);
> >> //set stream_id 1
> >> fadvise(fd, 1, 0, POSIX_FADV_STREAMID);
> >> pwrite(fd, );
> >>
> >> process 2:
> >> fd = open("/dev/nvme0n1", O_DIRECT...);
> >> //should be legacy stream_id 0
> >> pwrite(fd, );
> >>
> >> But now process 2 also see stream_id 1, which is wrong.
> >
> > It's not wrong, your behaviour model is just different You have
> > defined a process/fd based stream model and not considered
> > considered that admins and applications might want to use a file
> > based stream model instead, so applications don't need to even be
> > aware that write streams are in use...
> 
> The stream must be opened, otherwise device will return error if application
> write to a not-opened stream.

That's an extremely device specific *implementation* of a write
stream. The *concept* of a write stream being passed from userspace to
the block layer doesn't have such constraints, and I get realy
concerned when implementations of a generic concept are so tightly
focussed around one type of hardware implementation of the
concept...

> Device has limited number of streams, for example, 16 streams.
> There are 2 APIs to open/close the stream.

What's to stop me writing something for DM-thinp that understands
write streams in bios and uses it to separate out the write streams
into different regions of the thinp device to improve locality of
it's data placement and hence reduce fragmentation?

Yes, for nvme devices, the "streamid" might come from hardware,
but there's nothing stopping other storage devices using it
differently or having fewer constraints. e.g. unknown ID -> same as
stream 0

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] fs: use a sequence counter instead of file_lock in fd_install

2015-04-17 Thread Al Viro

On Sat, Apr 18, 2015 at 12:16:48AM +0200, Mateusz Guzik wrote:

> I would say this makes the use of seq counter impossible. Even if we
> decided to fall back to a lock on retry, we cannot know what to do if
> the slot is reserved - it very well could be that something called
> close, and something else reserved the slot, so putting the file inside
> could be really bad. In fact we would be putting a file for which we
> don't have a reference anymore.
> 
> However, not all hope is lost and I still think we can speed things up.
> 
> A locking primitive which only locks stuff for current cpu and has
> another mode where it locks stuff for all cpus would do the trick just
> fine. I'm not a linux guy, quick search suggests 'lglock' would do what
> I want.
> 
> table reallocation is an extremely rare operation, so this should be
> fine. It would take the lock 'globally' for given table.

It would also mean percpu_alloc() for each descriptor table...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/4] fs: Add generic file system event notifications

2015-04-17 Thread Andreas Dilger

On Apr 17, 2015, at 5:31 AM, Jan Kara  wrote:
> On Wed 15-04-15 09:15:44, Beata Michalska wrote:
>> Introduce configurable generic interface for file
>> system-wide event notifications to provide file
>> systems with a common way of reporting any potential
>> issues as they emerge.
>> 
>> The notifications are to be issued through generic
>> netlink interface, by a dedicated, for file system
>> events, multicast group. The file systems might as
>> well use this group to send their own custom messages.
>> 
>> The events have been split into four base categories:
>> information, warnings, errors and threshold notifications,
>> with some very basic event types like running out of space
>> or file system being remounted as read-only.
>> 
>> Threshold notifications have been included to allow
>> triggering an event whenever the amount of free space
>> drops below a certain level - or levels to be more precise
>> as two of them are being supported: the lower and the upper
>> range. The notifications work both ways: once the threshold
>> level has been reached, an event shall be generated whenever
>> the number of available blocks goes up again re-activating
>> the threshold.
>> 
>> The interface has been exposed through a vfs. Once mounted,
>> it serves as an entry point for the set-up where one can
>> register for particular file system events.
>> 
>> Signed-off-by: Beata Michalska 
>  Thanks for the patches! Some comments are below.
> 
>> ---
>> Documentation/filesystems/events.txt |  254 +++
>> fs/Makefile  |1 +
>> fs/events/Makefile   |6 +
>> fs/events/fs_event.c |  775 
>> ++
>> fs/events/fs_event.h |   27 ++
>> fs/events/fs_event_netlink.c |   94 +
>> fs/namespace.c   |1 +
>> include/linux/fs.h   |6 +-
>> include/linux/fs_event.h |   69 +++
>> include/uapi/linux/fs_event.h|   62 +++
>> include/uapi/linux/genetlink.h   |1 +
>> net/netlink/genetlink.c  |7 +-
>> 12 files changed, 1301 insertions(+), 2 deletions(-)
>> create mode 100644 Documentation/filesystems/events.txt
>> create mode 100644 fs/events/Makefile
>> create mode 100644 fs/events/fs_event.c
>> create mode 100644 fs/events/fs_event.h
>> create mode 100644 fs/events/fs_event_netlink.c
>> create mode 100644 include/linux/fs_event.h
>> create mode 100644 include/uapi/linux/fs_event.h
>> 
>> diff --git a/Documentation/filesystems/events.txt 
>> b/Documentation/filesystems/events.txt
>> new file mode 100644
>> index 000..c85dd88
>> --- /dev/null
>> +++ b/Documentation/filesystems/events.txt
>> @@ -0,0 +1,254 @@
>> +
>> +Generic file system event notification interface
>> +
>> +Document created 09 April 2015 by Beata Michalska 
>> +
>> +1. The reason behind:
>> +=
>> +
>> +There are many corner cases when things might get messy with the 
>> filesystems.
>> +And it is not always obvious what and when went wrong. Sometimes you might
>> +get some subtle hints that there is something going on - but by the time
>> +you realise it, it might be too late as you are already out-of-space
>> +or the filesystem has been remounted as read-only (i.e.). The generic
>> +interface for the filesystem events fills the gap by providing a rather
>> +easy way of real-time notifications triggered whenever something intreseting
>> +happens, allowing filesystems to report events in a common way, as they 
>> occur.
>> +
>> +2. How does it work:
>> +
>> +
>> +The interface itself has been exposed as fstrace-type Virtual File System,
>> +primarily to ease the process of setting up the configuration for the file
>> +system notifications. So for starters it needs to get mounted (obviously):
>> +
>> +mount -t fstrace none /sys/fs/events
>> +
>> +This will unveil the single fstrace filesystem entry - the 'config' file,
>> +through which the notification are being set-up.
>> +
>> +Activating notifications for particular filesystem is as straightforward
>> +as writing into the 'config' file. Note that by default all events despite
>> +the actual filesystem type are being disregarded.
>  Is there a reason to have a special filesystem for this? Do you expect
> extending it by (many) more files? Why not just creating a file in sysfs or
> something like that?
> 
>> +Synopsis of config:
>> +--
>> +
>> +MOUNT EVENT_TYPE [L1] [L2]
>> +
>> + MOUNT  : the filesystem's mount point
>  I'm not quite decided but is mountpoint really the right thing to pass
> via the interface? They aren't unique (filesystem can be mounted in
> multiple places) and more importantly can change over time. So won't it be
> better to pass major:minor over the interface? These are stable, unique to
> the filesystem, and userspace can easily get them by calling stat(2) on the
> desired path (or directly from /proc/self/mountinfo).

Re: Device mapper failed to open temporary keystore device

2015-04-17 Thread Mike Snitzer

On Fri, Apr 17 2015 at  4:11pm -0400,
Murilo Opsfelder Araújo  wrote:

> Hello, everyone.
> 
> Right after I enter my passphrase to unlock my cryptsetup partition,
> it displays the following error and asks for cryptsetup password again
> (it got stuck on this loop).
> 
> This issue was introduced in next-20150413.  next-20150410 is working just 
> fine.
> 
> Any hint on how to debug this?
> 
> Unlocking the disk /dev/disk/by-uuid/ (sda5_crypt)
> Enter passphrase: *
> [  244.239821] device-mapper: table: 252:0: crypt: Error allocating crypto tfm
> device-mapper: reload ioctl on  failed: No such file or directory
> Failed to open temporary keystore device.
> device-mapper: remove ioctl on temporary-cryptsetup-239 failed: No
> such device or address
> device-mapper: reload ioctl on temporary-cryptsetup-239 failed: No
> such device or address
> device-mapper: remove ioctl on temporary-cryptsetup-239 failed: No
> such device or address
> device-mapper: remove ioctl on temporary-cryptsetup-239 failed: No
> such device or address
> device-mapper: remove ioctl on temporary-cryptsetup-239 failed: No
> such device or address
> device-mapper: remove ioctl on temporary-cryptsetup-239 failed: No
> such device or address

git diff next-20150410^..next-20150413 drivers/md/dm-crypt.c points to
this commit as the nly dm-crpt.c change:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next=5977907937afa2b5584a874d44ba6c0f56aeaa9c

Which appears unrelated given your "crypt: Error allocating crypto tfm"
error.

But any chance you can try reverting that commit from next-20150413 and
re-test just to be absolutely sure?

There are also some crypto changes that could very easily be the cause
of your problem (cc'ing Herbert), e.g.:

$ git diff next-20150410^..next-20150413 -- crypto | diffstat
 algapi.c |   13 -
 algif_hash.c |4 -
 algif_skcipher.c |4 -
 sha1_generic.c   |  102 --
 sha256_generic.c |  133 ---
 sha512_generic.c |  123 --
 6 files changed, 76 insertions(+), 303 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/4] fs: Add generic file system event notifications

2015-04-17 Thread Andreas Dilger

On Apr 17, 2015, at 11:37 AM, John Spray  wrote:
> On 17/04/2015 17:22, Jan Kara wrote:
>> On Fri 17-04-15 17:08:10, John Spray wrote:
>>> On 17/04/2015 16:43, Jan Kara wrote:
>>> In that case I'm confused -- why would ENOSPC be an appropriate use
>>> of this interface if the mount being entirely blocked would be
>>> inappropriate?  Isn't being unable to service any I/O a more
>>> fundamental and severe thing than being up and healthy but full?
>>> 
>>> Were you intending the interface to be exclusively for data
>>> integrity issues like checksum failures, rather than more general
>>> events about a mount that userspace would probably like to know
>>> about?
>>   Well, I'm not saying we cannot have those events for fs availability /
>> inavailability. I'm just saying I'd like to see some use for that first.
>> I don't want events to be added just because it's possible...
>> 
>> For ENOSPC we have thin provisioned storage and the userspace deamon
>> shuffling real storage underneath. So there I know the usecase.
>> 
> 
> Ah, OK.  So I can think of a couple of use cases:
> * a cluster scheduling service (think MPI jobs or docker containers) might 
> check for events like this.  If it can see the cluster filesystem is 
> unavailable, then it can avoid scheduling the job, so that the (multi-node) 
> application does not get hung on one node with a bad mount.  If it sees a 
> mount go bad (unavailable, or client evicted) partway through a job, then it 
> can kill -9 the process that was relying on the bad mount, and go run it 
> somewhere else.
> * Boring but practical case: a nagios health check for checking if mounts are 
> OK.

John,
thanks for chiming in, as I was just about to write the same.  Some users
were just asking yesterday at the Lustre User Group meeting about adding
an interface to notify job schedulers for your #1 point, and I'd much
rather use a generic interface than inventing our own for Lustre.

Cheers, Andreas

> We don't have to invent these event types now of course, but something to 
> bear in mind.  Hopefully if/when any of the distributed filesystems 
> (Lustre/Ceph/etc) choose to implement this, we can look at making the event 
> types common at that time though.
> 
> BTW in any case an interface for filesystem events to userspace will be a 
> useful addition, thank you!
> 
> Cheers,
> John


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 7/8] selftest/x86: install tests

2015-04-17 Thread Tyler Baker

On 17 April 2015 at 15:28, Andy Lutomirski  wrote:
> On Fri, Apr 17, 2015 at 3:02 PM, Tyler Baker  wrote:
>> Include lib.mk and set TEST_PROGS where appropriate. Skip the install and 
>> test
>> case when CROSS_COMPILE is not set.
>>
>> Cc: Andy Lutomirski 
>> Signed-off-by: Tyler Baker 
>> ---
>>  tools/testing/selftests/x86/Makefile | 9 +
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/tools/testing/selftests/x86/Makefile 
>> b/tools/testing/selftests/x86/Makefile
>> index 9962e10..622717e 100644
>> --- a/tools/testing/selftests/x86/Makefile
>> +++ b/tools/testing/selftests/x86/Makefile
>> @@ -12,19 +12,28 @@ UNAME_M := $(shell uname -m)
>>  ifeq ($(CROSS_COMPILE),)
>>  # Always build 32-bit tests
>>  all: all_32
>> +# Install 32-bit tests
>> +TEST_PROGS += $(BINARIES_32) run_x86_tests.sh
>>  # If we're on a 64-bit host, build 64-bit tests as well
>>  ifeq ($(UNAME_M),x86_64)
>>  all: all_32 all_64
>> +# Install 64-bit tests
>> +TEST_PROGS += $(BINARIES_64)
>>  endif
>>  else
>>  # No dependency on all when cross building
>>  all:
>> +# Skip install and test case when not built
>> +override INSTALL_RULE :=
>> +override EMIT_TESTS :=  echo "echo \"selftests: run_x86_tests.sh [SKIP]\""
>
> I may just be confused, but why is an emply TEST_PROGS insufficient?

This is a good question. The default install in lib.mk rule blindly
calls 'install -t   ' which fails the install
as it is not enough arguments passed. Perhaps we fix this behavior in
lib.mk.

>
> --Andy
>
>>  endif
>>
>>  all_32: check_build32 $(BINARIES_32)
>>
>>  all_64: $(BINARIES_64)
>>
>> +include ../lib.mk
>> +
>>  clean:
>> $(RM) $(BINARIES_32) $(BINARIES_64)
>>
>> --
>> 2.1.4
>>
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC



-- 
Tyler Baker
Tech Lead, LAVA
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/6] phy: twl4030-usb: add ABI documentation

2015-04-17 Thread NeilBrown

On Sat, 18 Apr 2015 00:14:36 +0200 Pavel Machek  wrote:

> On Thu 2015-04-16 18:03:04, NeilBrown wrote:
> > From: NeilBrown 
> > 
> > This driver device one local attribute: vbus.
> > Describe that in Documentation/ABI/testing/sysfs-platform/twl4030-usb.
> > 
> > Signed-off-by: NeilBrown 
> > ---
> >  .../ABI/testing/sysfs-platform-twl4030-usb |8 
> >  1 file changed, 8 insertions(+)
> >  create mode 100644 Documentation/ABI/testing/sysfs-platform-twl4030-usb
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-platform-twl4030-usb 
> > b/Documentation/ABI/testing/sysfs-platform-twl4030-usb
> > new file mode 100644
> > index ..512c51be64ae
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-platform-twl4030-usb
> > @@ -0,0 +1,8 @@
> > +What: /sys/bus/platform/devices/*twl4030-usb/vbus
> > +Description:
> > +   Read-only status reporting if VBUS (approx 5V)
> > +   is being supplied by the USB bus.
> > +
> > +   Possible values: "on", "off".
> 
> Would bit be better to have values "0" and "1"? Kernel usually does
> that for booleans...

1/ The code  already uses "on" and "off", so changing would be an ABI
breakage.

2/ No it doesn't.
 For modules params, the kernel uses "Y" and "N"

  git grep '? "on" : "off"'  | wc

 find 172 matches.

NeilBrown


pgpdNDbnvsrmq.pgp
Description: OpenPGP digital signature

Re: [PATCH v2 5/8] selftest/x86: build both bitnesses

2015-04-17 Thread Tyler Baker

On 17 April 2015 at 15:27, Andy Lutomirski  wrote:
> On Fri, Apr 17, 2015 at 3:24 PM, Tyler Baker  wrote:
>> On 17 April 2015 at 15:06, Andy Lutomirski  wrote:
>>> On Fri, Apr 17, 2015 at 3:01 PM, Tyler Baker  wrote:
 Using uname with the processor flag option in some cases can yield 
 'unknown'
 so lets use the machine flag option as it is deterministic. Add a 
 dependency
 for all_32 when building on a x86 64 bit host so that both bitnesses are
 built in this case.

 Cc: Andy Lutomirski 
 Signed-off-by: Tyler Baker 
 ---
  tools/testing/selftests/x86/Makefile | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

 diff --git a/tools/testing/selftests/x86/Makefile 
 b/tools/testing/selftests/x86/Makefile
 index f0a7918..57090ad 100644
 --- a/tools/testing/selftests/x86/Makefile
 +++ b/tools/testing/selftests/x86/Makefile
 @@ -7,14 +7,14 @@ BINARIES_64 := $(TARGETS_C_BOTHBITS:%=%_64)

  CFLAGS := -O2 -g -std=gnu99 -pthread -Wall

 -UNAME_P := $(shell uname -p)
 +UNAME_M := $(shell uname -m)

  # Always build 32-bit tests
  all: all_32

  # If we're on a 64-bit host, build 64-bit tests as well
 -ifeq ($(shell uname -p),x86_64)
 -all: all_64
 +ifeq ($(UNAME_M),x86_64)
 +all: all_32 all_64
>>>
>>> This duplicates the all: all_32 above.
>>
>> I agree with you but the behavior is different than expected.
>>
>> From a clean linux-next tree building on a 64-bit x86 host
>>
>> (jessie)tyler@localhost:~/Dev/kernels/linux$ git describe
>> next-20150415
>> (jessie)tyler@localhost:~/Dev/kernels/linux$ uname -m
>> x86_64
>> (jessie)tyler@localhost:~/Dev/kernels/linux$ make -C
>> tools/testing/selftests/x86/
>> make: Entering directory
>> '/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
>> cc -m32 -o sigreturn_32 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt 
>> -ldl
>> make: Leaving directory
>> '/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
>>
>> With this series applied on top I get
>>
>> (jessie)tyler@localhost:~/Dev/kernels/linux$ make -C
>> tools/testing/selftests/x86/
>> make: Entering directory
>> '/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
>> Makefile:41: warning: overriding recipe for target 'run_tests'
>> ../lib.mk:12: warning: ignoring old recipe for target 'run_tests'
>
> That can't be a good sign.
>
>> gcc -m32 -o sigreturn_32 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt 
>> -ldl
>> gcc -m64 -o sigreturn_64 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt 
>> -ldl
>> make: Leaving directory
>> '/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
>>
>> Which is what I expected.
>
> I meant specifically this line:
>
> +all: all_32 all_64
>
> The rest of this patch looks okay.

You are right again. On my machine 'uname -p' returns 'unknown' so
that is why it's not building the 64-bit test on -next :) I'll fix
this up.

>
> --Andy

Tyler
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] intel powerclamp: support Knights Landing

2015-04-17 Thread Dasaratharaman Chandramouli

This patch enables intel_powerclamp driver to run on the
next-generation Intel(R) Xeon Phi Microarchitecture
code named "Knights Landing"

Signed-off-by: Dasaratharaman Chandramouli 

---
 drivers/thermal/intel_powerclamp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/thermal/intel_powerclamp.c 
b/drivers/thermal/intel_powerclamp.c
index 12623bc..e34ccd5 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -690,6 +690,7 @@ static const struct x86_cpu_id intel_powerclamp_ids[] = {
{ X86_VENDOR_INTEL, 6, 0x4c},
{ X86_VENDOR_INTEL, 6, 0x4d},
{ X86_VENDOR_INTEL, 6, 0x56},
+   { X86_VENDOR_INTEL, 6, 0x57},
{}
 };
 MODULE_DEVICE_TABLE(x86cpu, intel_powerclamp_ids);
-- 
1.8.1.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 7/8] selftest/x86: install tests

2015-04-17 Thread Andy Lutomirski

On Fri, Apr 17, 2015 at 3:02 PM, Tyler Baker  wrote:
> Include lib.mk and set TEST_PROGS where appropriate. Skip the install and test
> case when CROSS_COMPILE is not set.
>
> Cc: Andy Lutomirski 
> Signed-off-by: Tyler Baker 
> ---
>  tools/testing/selftests/x86/Makefile | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/tools/testing/selftests/x86/Makefile 
> b/tools/testing/selftests/x86/Makefile
> index 9962e10..622717e 100644
> --- a/tools/testing/selftests/x86/Makefile
> +++ b/tools/testing/selftests/x86/Makefile
> @@ -12,19 +12,28 @@ UNAME_M := $(shell uname -m)
>  ifeq ($(CROSS_COMPILE),)
>  # Always build 32-bit tests
>  all: all_32
> +# Install 32-bit tests
> +TEST_PROGS += $(BINARIES_32) run_x86_tests.sh
>  # If we're on a 64-bit host, build 64-bit tests as well
>  ifeq ($(UNAME_M),x86_64)
>  all: all_32 all_64
> +# Install 64-bit tests
> +TEST_PROGS += $(BINARIES_64)
>  endif
>  else
>  # No dependency on all when cross building
>  all:
> +# Skip install and test case when not built
> +override INSTALL_RULE :=
> +override EMIT_TESTS :=  echo "echo \"selftests: run_x86_tests.sh [SKIP]\""

I may just be confused, but why is an emply TEST_PROGS insufficient?

--Andy

>  endif
>
>  all_32: check_build32 $(BINARIES_32)
>
>  all_64: $(BINARIES_64)
>
> +include ../lib.mk
> +
>  clean:
> $(RM) $(BINARIES_32) $(BINARIES_64)
>
> --
> 2.1.4
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/8] selftest/x86: build both bitnesses

2015-04-17 Thread Andy Lutomirski

On Fri, Apr 17, 2015 at 3:24 PM, Tyler Baker  wrote:
> On 17 April 2015 at 15:06, Andy Lutomirski  wrote:
>> On Fri, Apr 17, 2015 at 3:01 PM, Tyler Baker  wrote:
>>> Using uname with the processor flag option in some cases can yield 'unknown'
>>> so lets use the machine flag option as it is deterministic. Add a dependency
>>> for all_32 when building on a x86 64 bit host so that both bitnesses are
>>> built in this case.
>>>
>>> Cc: Andy Lutomirski 
>>> Signed-off-by: Tyler Baker 
>>> ---
>>>  tools/testing/selftests/x86/Makefile | 6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/x86/Makefile 
>>> b/tools/testing/selftests/x86/Makefile
>>> index f0a7918..57090ad 100644
>>> --- a/tools/testing/selftests/x86/Makefile
>>> +++ b/tools/testing/selftests/x86/Makefile
>>> @@ -7,14 +7,14 @@ BINARIES_64 := $(TARGETS_C_BOTHBITS:%=%_64)
>>>
>>>  CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
>>>
>>> -UNAME_P := $(shell uname -p)
>>> +UNAME_M := $(shell uname -m)
>>>
>>>  # Always build 32-bit tests
>>>  all: all_32
>>>
>>>  # If we're on a 64-bit host, build 64-bit tests as well
>>> -ifeq ($(shell uname -p),x86_64)
>>> -all: all_64
>>> +ifeq ($(UNAME_M),x86_64)
>>> +all: all_32 all_64
>>
>> This duplicates the all: all_32 above.
>
> I agree with you but the behavior is different than expected.
>
> From a clean linux-next tree building on a 64-bit x86 host
>
> (jessie)tyler@localhost:~/Dev/kernels/linux$ git describe
> next-20150415
> (jessie)tyler@localhost:~/Dev/kernels/linux$ uname -m
> x86_64
> (jessie)tyler@localhost:~/Dev/kernels/linux$ make -C
> tools/testing/selftests/x86/
> make: Entering directory
> '/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
> cc -m32 -o sigreturn_32 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt 
> -ldl
> make: Leaving directory
> '/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
>
> With this series applied on top I get
>
> (jessie)tyler@localhost:~/Dev/kernels/linux$ make -C
> tools/testing/selftests/x86/
> make: Entering directory
> '/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
> Makefile:41: warning: overriding recipe for target 'run_tests'
> ../lib.mk:12: warning: ignoring old recipe for target 'run_tests'

That can't be a good sign.

> gcc -m32 -o sigreturn_32 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt 
> -ldl
> gcc -m64 -o sigreturn_64 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt 
> -ldl
> make: Leaving directory
> '/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
>
> Which is what I expected.

I meant specifically this line:

+all: all_32 all_64

The rest of this patch looks okay.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2 02/11] slab: add private memory allocator header for arch/lib

2015-04-17 Thread Christoph Lameter

On Fri, 17 Apr 2015, Richard Weinberger wrote:

> SLUB is the unqueued SLAB and SLLB is the library SLAB. :D

Good that this convention is now so broadly known that I did not even
have to explain what it meant. But I think you can give it any name you
want. SLLB was just a way to tersely state how this is going to integrate
into the overall scheme of things.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Andy Lutomirski

On Fri, Apr 17, 2015 at 3:04 PM, Linus Torvalds
 wrote:
> On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski  wrote:
>>
>> Muahaha.  The auditors have invaded your system.  (I did my little
>> benchmark with a more sensible configuration -- see way below).
>>
>> Can you send the output of:
>>
>> # auditctl -s
>> # auditctl -l
>
>   # auditctl -s
>   enabled 1
>   flag 1
>   pid 822
>   rate_limit 0
>   backlog_limit 320
>   lost 0
>   backlog 0
>   backlog_wait_time 6
>   loginuid_immutable 0 unlocked
>   # auditctl -l
>   No rules

Yes, "No rules" doesn't mean "don't audit".

>
>> Are you, perchance, using Fedora?
>
> F21. Yup.
>
> I used to just disable auditing in the kernel entirely, but then I
> ended up deciding that I need to run something closer to the broken
> Fedora config (selinux in particular) in order to actually optimize
> the real-world pathname handling situation rather than the _sane_ one.
> Oh well. I think audit support got enabled at the same time in my
> kernels because I ended up using the default config and then taking
> out the truly crazy stuff without noticing AUDITSYSCALL.
>
>> I lobbied rather heavily, and
>> successfully, to get the default configuration to stop auditing.
>> Unfortunately, the fix wasn't retroactive, so, unless you have a very
>> fresh install, you might want to apply the fix yourself:
>
> Is that fix happening in Fedora going forward, though? Like F22?

It's in F21, actually.  Unfortunately, the audit package is really
weird.  It installs /etc/audit/rules.d/audit.rules, containing:

# This file contains the auditctl rules that are loaded
# whenever the audit daemon is started via the initscripts.
# The rules are simply the parameters that would be passed
# to auditctl.

# First rule - delete all
-D

# This suppresses syscall auditing for all tasks started
# with this rule in effect.  Remove it if you need syscall
# auditing.
-a task,never

Then, if it's a fresh install (i.e. /etc/audit/audit.rules doesn't
exist) it copies that file to /etc/audit/audit.rules post-install.
(No, I don't know why it works this way.)

IOW, you might want to copy your /etc/audit/rules.d/audit.rules to
/etc/audit/audit.rules.  I think you need to reboot to get the full
effect.  You could apply the rule manually (or maybe restart the audit
service), but it will only affect newly-started tasks.

>
>> Amdy Lumirtowsky thinks he meant to attach a condition to his
>> maintainerish activities: he will do his best to keep the audit code
>> *out* of the low-level stuff, but he's going to try to avoid ever
>> touching the audit code itself, because if he ever had to change it,
>> he might accidentally delete the entire file.
>
> Oooh. That would be _such_ a shame.
>
> Can we please do it by mistake? "Oops, my fingers slipped"
>
>> Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
>> point?  I would love auditing to set some really loud global warning
>> that you've just done a Bad Thing (tm) performance-wise by enabling
>> it.
>
> Or even just a big fat warning in dmesg the first time auditing triggers.
>
>> Back to timing.  With kvm-clock, I see:
>>
>>   23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read
>
> Oh wow. How can that possibly be sane?
>
> Isn't the *whole* point of pvclock_clocksource_read() to be a native
> rdtsc with scaling? How does it cause that kind of insane pain?

An unnecessarily complicated protocol, a buggy host implementation,
and an unnecessarily complicated guest implementation :(

>
> Oh well. Some paravirt person would need to look and care.

The code there is a bit scary.

--Andy

>
>Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/8] selftest/x86: build both bitnesses

2015-04-17 Thread Tyler Baker

On 17 April 2015 at 15:06, Andy Lutomirski  wrote:
> On Fri, Apr 17, 2015 at 3:01 PM, Tyler Baker  wrote:
>> Using uname with the processor flag option in some cases can yield 'unknown'
>> so lets use the machine flag option as it is deterministic. Add a dependency
>> for all_32 when building on a x86 64 bit host so that both bitnesses are
>> built in this case.
>>
>> Cc: Andy Lutomirski 
>> Signed-off-by: Tyler Baker 
>> ---
>>  tools/testing/selftests/x86/Makefile | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/testing/selftests/x86/Makefile 
>> b/tools/testing/selftests/x86/Makefile
>> index f0a7918..57090ad 100644
>> --- a/tools/testing/selftests/x86/Makefile
>> +++ b/tools/testing/selftests/x86/Makefile
>> @@ -7,14 +7,14 @@ BINARIES_64 := $(TARGETS_C_BOTHBITS:%=%_64)
>>
>>  CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
>>
>> -UNAME_P := $(shell uname -p)
>> +UNAME_M := $(shell uname -m)
>>
>>  # Always build 32-bit tests
>>  all: all_32
>>
>>  # If we're on a 64-bit host, build 64-bit tests as well
>> -ifeq ($(shell uname -p),x86_64)
>> -all: all_64
>> +ifeq ($(UNAME_M),x86_64)
>> +all: all_32 all_64
>
> This duplicates the all: all_32 above.

I agree with you but the behavior is different than expected.

>From a clean linux-next tree building on a 64-bit x86 host

(jessie)tyler@localhost:~/Dev/kernels/linux$ git describe
next-20150415
(jessie)tyler@localhost:~/Dev/kernels/linux$ uname -m
x86_64
(jessie)tyler@localhost:~/Dev/kernels/linux$ make -C
tools/testing/selftests/x86/
make: Entering directory
'/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
cc -m32 -o sigreturn_32 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt -ldl
make: Leaving directory
'/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'

With this series applied on top I get

(jessie)tyler@localhost:~/Dev/kernels/linux$ make -C
tools/testing/selftests/x86/
make: Entering directory
'/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'
Makefile:41: warning: overriding recipe for target 'run_tests'
../lib.mk:12: warning: ignoring old recipe for target 'run_tests'
gcc -m32 -o sigreturn_32 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt -ldl
gcc -m64 -o sigreturn_64 -O2 -g -std=gnu99 -pthread -Wall  sigreturn.c -lrt -ldl
make: Leaving directory
'/home/tyler/Dev/kernels/linux/tools/testing/selftests/x86'

Which is what I expected.

>
> --Andy
>
>>  endif
>>
>>  all_32: check_build32 $(BINARIES_32)
>> --
>> 2.1.4
>>
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

Tyler
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Adi-buildroot-devel] [Consult] blackfin: About one building break issue for STACKTRACE

2015-04-17 Thread Chen Gang

On 4/17/15 22:02, Chen Gang wrote:
> On 4/17/15 11:02, Zhang, Sonic wrote:
>> Hi Gang,
>>
>> Please only use the GCC for Blackfin 2013R1 or 2014R1 from 
>> https://sourceforge.net/projects/adi-buildroot/files/ . Upstream GCC5 isn't 
>> ported to Blackfin properly.
>>

However, I should still try to consult gcc members for this issue. At
least, I should report to Bugzilla and try to find root cause (although
I can not fix it, at present).

> 
> OK, thank you very much for your reply. :-)
> 
> 
> For me, I want to let gcc5 support Blackfin properly, but sorry, at
> present I can not.
> 
>  - In honest, I am still not quite familiar with gcc (although I am
>trying and improving).
> 
>  - This year, I have no enough time resource for it (I am mainly for
>upstream qemu this year).
> 
> But if next year, upstream gcc is still not ported to Blackfin properly,
> I shall try.
> 
> However, there are still several another issues for upstream blackfin
> gcc5 (they are all coredumps), I should still try to analyze them and
> find root causes, hope I can finish within this month.
> 
> 
> Thanks.
> 

Thanks.
-- 
Chen Gang

Open, share, and attitude like air, water, and life which God blessed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] fs: use a sequence counter instead of file_lock in fd_install

2015-04-17 Thread Mateusz Guzik

On Fri, Apr 17, 2015 at 02:46:56PM -0700, Eric Dumazet wrote:
> On Thu, 2015-04-16 at 14:16 +0200, Mateusz Guzik wrote:
> > Hi,
> > 
> > Currently obtaining a new file descriptor results in locking fdtable
> > twice - once in order to reserve a slot and second time to fill it
> 
> ...
> 
> 
> >  void __fd_install(struct files_struct *files, unsigned int fd,
> > struct file *file)
> >  {
> > +   unsigned long seq;
> 
>   unsigned int seq;
> 
> > struct fdtable *fdt;
> > -   spin_lock(>file_lock);
> > -   fdt = files_fdtable(files);
> > -   BUG_ON(fdt->fd[fd] != NULL);
> > -   rcu_assign_pointer(fdt->fd[fd], file);
> > -   spin_unlock(>file_lock);
> > +
> > +   rcu_read_lock();
> > +   do {
> > +   seq = read_seqcount_begin(>fdt_seqcount);
> > +   fdt = files_fdtable_seq(files);
> > +   /*
> > +* Entry in the table can already be equal to file if we
> > +* had to restart and copy_fdtable picked up our update.
> > +*/
> > +   BUG_ON(!(fdt->fd[fd] == NULL || fdt->fd[fd] == file));
> > +   rcu_assign_pointer(fdt->fd[fd], file);
> > +   smp_mb();
> > +   } while (__read_seqcount_retry(>fdt_seqcount, seq));
> > +   rcu_read_unlock();
> >  }
> >  
> 
> So one problem here is :
> 
> As soon as  rcu_assign_pointer(fdt->fd[fd], file) is done, and other cpu
> does one expand_fdtable() and releases files->file_lock, another cpu can
> close(fd).
> 
> Then another cpu can reuse the [fd] now empty slot and install a new
> file in it.
> 
> Then this cpu will crash here :
> 
> BUG_ON(!(fdt->fd[fd] == NULL || fdt->fd[fd] == file));
> 

Ouch, this is so obvious now that you mention it. Really stupid
mistake on my side.

I would say this makes the use of seq counter impossible. Even if we
decided to fall back to a lock on retry, we cannot know what to do if
the slot is reserved - it very well could be that something called
close, and something else reserved the slot, so putting the file inside
could be really bad. In fact we would be putting a file for which we
don't have a reference anymore.

However, not all hope is lost and I still think we can speed things up.

A locking primitive which only locks stuff for current cpu and has
another mode where it locks stuff for all cpus would do the trick just
fine. I'm not a linux guy, quick search suggests 'lglock' would do what
I want.

table reallocation is an extremely rare operation, so this should be
fine. It would take the lock 'globally' for given table.

I'll play with this.

-- 
Mateusz Guzik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 6/8] selftest/x86: have no dependency on all when cross building

2015-04-17 Thread Tyler Baker

On 17 April 2015 at 15:08, Andy Lutomirski  wrote:
> On Fri, Apr 17, 2015 at 3:01 PM, Tyler Baker  wrote:
>> If the CROSS_COMPILE is set have no dependency on all.
>
> You mean "remove all's dependency on all_32 and all_64", I think.

Yes I'll clean this up.

>
>>
>> Cc: Andy Lutomirski 
>> Signed-off-by: Tyler Baker 
>> ---
>>  tools/testing/selftests/x86/Makefile | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/x86/Makefile 
>> b/tools/testing/selftests/x86/Makefile
>> index 57090ad..9962e10 100644
>> --- a/tools/testing/selftests/x86/Makefile
>> +++ b/tools/testing/selftests/x86/Makefile
>> @@ -9,13 +9,17 @@ CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
>>
>>  UNAME_M := $(shell uname -m)
>
> I think you should add
>
> all:
>
> above.  Otherwise, with CROSS_COMPILE set, the default rule won't be
> 'all' any more.

Ack. Good suggestion, thanks.

>
> -Andy
>
>>
>> +ifeq ($(CROSS_COMPILE),)
>>  # Always build 32-bit tests
>>  all: all_32
>> -
>>  # If we're on a 64-bit host, build 64-bit tests as well
>>  ifeq ($(UNAME_M),x86_64)
>>  all: all_32 all_64
>>  endif
>> +else
>> +# No dependency on all when cross building
>> +all:
>> +endif
>>
>>  all_32: check_build32 $(BINARIES_32)
>>
>> --
>> 2.1.4
>>
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

Tyler
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/6] phy: twl4030-usb: add ABI documentation

2015-04-17 Thread Pavel Machek

On Thu 2015-04-16 18:03:04, NeilBrown wrote:
> From: NeilBrown 
> 
> This driver device one local attribute: vbus.
> Describe that in Documentation/ABI/testing/sysfs-platform/twl4030-usb.
> 
> Signed-off-by: NeilBrown 
> ---
>  .../ABI/testing/sysfs-platform-twl4030-usb |8 
>  1 file changed, 8 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-platform-twl4030-usb
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-twl4030-usb 
> b/Documentation/ABI/testing/sysfs-platform-twl4030-usb
> new file mode 100644
> index ..512c51be64ae
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-platform-twl4030-usb
> @@ -0,0 +1,8 @@
> +What: /sys/bus/platform/devices/*twl4030-usb/vbus
> +Description:
> + Read-only status reporting if VBUS (approx 5V)
> + is being supplied by the USB bus.
> +
> + Possible values: "on", "off".

Would bit be better to have values "0" and "1"? Kernel usually does
that for booleans...

Thanks,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 6/8] selftest/x86: have no dependency on all when cross building

2015-04-17 Thread Andy Lutomirski

On Fri, Apr 17, 2015 at 3:01 PM, Tyler Baker  wrote:
> If the CROSS_COMPILE is set have no dependency on all.

You mean "remove all's dependency on all_32 and all_64", I think.

>
> Cc: Andy Lutomirski 
> Signed-off-by: Tyler Baker 
> ---
>  tools/testing/selftests/x86/Makefile | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/x86/Makefile 
> b/tools/testing/selftests/x86/Makefile
> index 57090ad..9962e10 100644
> --- a/tools/testing/selftests/x86/Makefile
> +++ b/tools/testing/selftests/x86/Makefile
> @@ -9,13 +9,17 @@ CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
>
>  UNAME_M := $(shell uname -m)

I think you should add

all:

above.  Otherwise, with CROSS_COMPILE set, the default rule won't be
'all' any more.

-Andy

>
> +ifeq ($(CROSS_COMPILE),)
>  # Always build 32-bit tests
>  all: all_32
> -
>  # If we're on a 64-bit host, build 64-bit tests as well
>  ifeq ($(UNAME_M),x86_64)
>  all: all_32 all_64
>  endif
> +else
> +# No dependency on all when cross building
> +all:
> +endif
>
>  all_32: check_build32 $(BINARIES_32)
>
> --
> 2.1.4
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 5/8] selftest/x86: build both bitnesses

2015-04-17 Thread Andy Lutomirski

On Fri, Apr 17, 2015 at 3:01 PM, Tyler Baker  wrote:
> Using uname with the processor flag option in some cases can yield 'unknown'
> so lets use the machine flag option as it is deterministic. Add a dependency
> for all_32 when building on a x86 64 bit host so that both bitnesses are
> built in this case.
>
> Cc: Andy Lutomirski 
> Signed-off-by: Tyler Baker 
> ---
>  tools/testing/selftests/x86/Makefile | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/x86/Makefile 
> b/tools/testing/selftests/x86/Makefile
> index f0a7918..57090ad 100644
> --- a/tools/testing/selftests/x86/Makefile
> +++ b/tools/testing/selftests/x86/Makefile
> @@ -7,14 +7,14 @@ BINARIES_64 := $(TARGETS_C_BOTHBITS:%=%_64)
>
>  CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
>
> -UNAME_P := $(shell uname -p)
> +UNAME_M := $(shell uname -m)
>
>  # Always build 32-bit tests
>  all: all_32
>
>  # If we're on a 64-bit host, build 64-bit tests as well
> -ifeq ($(shell uname -p),x86_64)
> -all: all_64
> +ifeq ($(UNAME_M),x86_64)
> +all: all_32 all_64

This duplicates the all: all_32 above.

--Andy

>  endif
>
>  all_32: check_build32 $(BINARIES_32)
> --
> 2.1.4
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Linus Torvalds

On Fri, Apr 17, 2015 at 5:42 PM, Andy Lutomirski  wrote:
>
> Muahaha.  The auditors have invaded your system.  (I did my little
> benchmark with a more sensible configuration -- see way below).
>
> Can you send the output of:
>
> # auditctl -s
> # auditctl -l

  # auditctl -s
  enabled 1
  flag 1
  pid 822
  rate_limit 0
  backlog_limit 320
  lost 0
  backlog 0
  backlog_wait_time 6
  loginuid_immutable 0 unlocked
  # auditctl -l
  No rules

> Are you, perchance, using Fedora?

F21. Yup.

I used to just disable auditing in the kernel entirely, but then I
ended up deciding that I need to run something closer to the broken
Fedora config (selinux in particular) in order to actually optimize
the real-world pathname handling situation rather than the _sane_ one.
Oh well. I think audit support got enabled at the same time in my
kernels because I ended up using the default config and then taking
out the truly crazy stuff without noticing AUDITSYSCALL.

> I lobbied rather heavily, and
> successfully, to get the default configuration to stop auditing.
> Unfortunately, the fix wasn't retroactive, so, unless you have a very
> fresh install, you might want to apply the fix yourself:

Is that fix happening in Fedora going forward, though? Like F22?

> Amdy Lumirtowsky thinks he meant to attach a condition to his
> maintainerish activities: he will do his best to keep the audit code
> *out* of the low-level stuff, but he's going to try to avoid ever
> touching the audit code itself, because if he ever had to change it,
> he might accidentally delete the entire file.

Oooh. That would be _such_ a shame.

Can we please do it by mistake? "Oops, my fingers slipped"

> Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
> point?  I would love auditing to set some really loud global warning
> that you've just done a Bad Thing (tm) performance-wise by enabling
> it.

Or even just a big fat warning in dmesg the first time auditing triggers.

> Back to timing.  With kvm-clock, I see:
>
>   23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read

Oh wow. How can that possibly be sane?

Isn't the *whole* point of pvclock_clocksource_read() to be a native
rdtsc with scaling? How does it cause that kind of insane pain?

Oh well. Some paravirt person would need to look and care.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 8/8] selftests/exec: do not install subdir as it is already created

2015-04-17 Thread Tyler Baker

Remove subdir from DEPS as it is already created at runtime. Without this,
make install fails.

Signed-off-by: Tyler Baker 
---
 tools/testing/selftests/exec/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/exec/Makefile 
b/tools/testing/selftests/exec/Makefile
index 4edb7d0..6b76bfd 100644
--- a/tools/testing/selftests/exec/Makefile
+++ b/tools/testing/selftests/exec/Makefile
@@ -1,6 +1,6 @@
 CFLAGS = -Wall
 BINARIES = execveat
-DEPS = execveat.symlink execveat.denatured script subdir
+DEPS = execveat.symlink execveat.denatured script
 all: $(BINARIES) $(DEPS)
 
 subdir:
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 7/8] selftest/x86: install tests

2015-04-17 Thread Tyler Baker

Include lib.mk and set TEST_PROGS where appropriate. Skip the install and test
case when CROSS_COMPILE is not set.

Cc: Andy Lutomirski 
Signed-off-by: Tyler Baker 
---
 tools/testing/selftests/x86/Makefile | 9 +
 1 file changed, 9 insertions(+)

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index 9962e10..622717e 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -12,19 +12,28 @@ UNAME_M := $(shell uname -m)
 ifeq ($(CROSS_COMPILE),)
 # Always build 32-bit tests
 all: all_32
+# Install 32-bit tests
+TEST_PROGS += $(BINARIES_32) run_x86_tests.sh
 # If we're on a 64-bit host, build 64-bit tests as well
 ifeq ($(UNAME_M),x86_64)
 all: all_32 all_64
+# Install 64-bit tests
+TEST_PROGS += $(BINARIES_64)
 endif
 else
 # No dependency on all when cross building
 all:
+# Skip install and test case when not built
+override INSTALL_RULE :=
+override EMIT_TESTS :=  echo "echo \"selftests: run_x86_tests.sh [SKIP]\""
 endif
 
 all_32: check_build32 $(BINARIES_32)
 
 all_64: $(BINARIES_64)
 
+include ../lib.mk
+
 clean:
$(RM) $(BINARIES_32) $(BINARIES_64)
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] Makefile: Fix detection of clang when cross-compiling

2015-04-17 Thread Paul Cercueil

When the host's C compiler is clang, and when attempting to
cross-compile Linux e.g. to MIPS with mipsel-linux-gcc, the Makefile
would incorrectly detect the use of clang, which resulted in
clang-specific flags being passed to mipsel-linux-gcc.

This can be verified under Debian by installing the "clang" package,
and then using it as the default compiler with:
sudo update-alternatives --config cc

This patch moves the detection of clang after the $(CC) variable is
initialized to the name of the cross-compiler, so that the check applies
to the cross-compiler and not the host's C compiler.

v2: Move the detection of clang after the inclusion of the
arch/*/Makefile (as they might set $(CROSS_COMPILE))

Signed-off-by: Paul Cercueil 
---
 Makefile | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/Makefile b/Makefile
index fbd43bf..e1e8c7e 100644
--- a/Makefile
+++ b/Makefile
@@ -335,15 +335,6 @@ endif
 export KBUILD_MODULES KBUILD_BUILTIN
 export KBUILD_CHECKSRC KBUILD_SRC KBUILD_EXTMOD
 
-ifneq ($(CC),)
-ifeq ($(shell $(CC) -v 2>&1 | grep -c "clang version"), 1)
-COMPILER := clang
-else
-COMPILER := gcc
-endif
-export COMPILER
-endif
-
 # Look for make include files relative to root of kernel src
 MAKEFLAGS += --include-dir=$(srctree)
 
@@ -673,6 +664,13 @@ endif
 endif
 KBUILD_CFLAGS += $(stackp-flag)
 
+ifeq ($(shell $(CC) -v 2>&1 | grep -c "clang version"), 1)
+COMPILER := clang
+else
+COMPILER := gcc
+endif
+export COMPILER
+
 ifeq ($(COMPILER),clang)
 KBUILD_CPPFLAGS += $(call cc-option,-Qunused-arguments,)
 KBUILD_CPPFLAGS += $(call cc-option,-Wno-unknown-warning-option,)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 6/8] selftest/x86: have no dependency on all when cross building

2015-04-17 Thread Tyler Baker

If the CROSS_COMPILE is set have no dependency on all.

Cc: Andy Lutomirski 
Signed-off-by: Tyler Baker 
---
 tools/testing/selftests/x86/Makefile | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index 57090ad..9962e10 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -9,13 +9,17 @@ CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
 
 UNAME_M := $(shell uname -m)
 
+ifeq ($(CROSS_COMPILE),)
 # Always build 32-bit tests
 all: all_32
-
 # If we're on a 64-bit host, build 64-bit tests as well
 ifeq ($(UNAME_M),x86_64)
 all: all_32 all_64
 endif
+else
+# No dependency on all when cross building
+all:
+endif
 
 all_32: check_build32 $(BINARIES_32)
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 5/8] selftest/x86: build both bitnesses

2015-04-17 Thread Tyler Baker

Using uname with the processor flag option in some cases can yield 'unknown'
so lets use the machine flag option as it is deterministic. Add a dependency
for all_32 when building on a x86 64 bit host so that both bitnesses are
built in this case.

Cc: Andy Lutomirski 
Signed-off-by: Tyler Baker 
---
 tools/testing/selftests/x86/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index f0a7918..57090ad 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -7,14 +7,14 @@ BINARIES_64 := $(TARGETS_C_BOTHBITS:%=%_64)
 
 CFLAGS := -O2 -g -std=gnu99 -pthread -Wall
 
-UNAME_P := $(shell uname -p)
+UNAME_M := $(shell uname -m)
 
 # Always build 32-bit tests
 all: all_32
 
 # If we're on a 64-bit host, build 64-bit tests as well
-ifeq ($(shell uname -p),x86_64)
-all: all_64
+ifeq ($(UNAME_M),x86_64)
+all: all_32 all_64
 endif
 
 all_32: check_build32 $(BINARIES_32)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V6 07/10] sched: add a macro to ref all CLONE_NEW* flags

2015-04-17 Thread Richard Guy Briggs

On 15/04/17, Peter Zijlstra wrote:
> On Fri, Apr 17, 2015 at 11:42:50AM -0400, Richard Guy Briggs wrote:
> > On 15/04/17, Peter Zijlstra wrote:
> > > On Fri, Apr 17, 2015 at 03:35:54AM -0400, Richard Guy Briggs wrote:
> > > > Added the macro CLONE_NEW_MASK_ALL to refer to all CLONE_NEW* flags.
> > > 
> > > A wee bit about why might be nice..
> > 
> > It makes the following patch much cleaner to read:
> > [PATCH V6 08/10] fork: audit on creation of new namespace(s)
> > https://lkml.org/lkml/2015/4/17/50
> > 
> > I was hoping it might also make a lot of other code cleaner, but most of
> > the other places where multiple CLONE_NEW* flags are used, not all six
> > are used together, but only 5 are used.  Ok, so it is helpful in 1 of 3:
> > 
> > It would actually be useful in check_unshare_flags():
> > https://github.com/torvalds/linux/blob/v3.17/kernel/fork.c#L1791
> > 
> > but not in copy_namespaces() or unshare_nsproxy_namespaces():
> > https://github.com/torvalds/linux/blob/v3.17/kernel/nsproxy.c#L130
> > https://github.com/torvalds/linux/blob/v3.17/kernel/nsproxy.c#L183
> 
> Right, so no objections from me on this, its just that I only saw this
> one patch in isolation without context and the changelog failed on
> rationale.

I realize you only saw a small window of this patchset, but this feels
like bike shedding about the main objective of the set...

I'll add a bit more justification and context if/when I respin for the
rest of the set.

> Does it perchance make sense to fold this patch into the next patch that
> actually makes use of it?

It would if it were the only potential user.  I don't want to bury a
surprise in something bigger.  Is there a preferred way to use such a
macro to make the other three examples cleaner, or is that just useless
churn and obfuscation?  Would there be a concise way to express all
CLONE_NEW* flags *except* user?

- RGB

--
Richard Guy Briggs 
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red 
Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 4/8] selftests/kdbus: install kdbus-test

2015-04-17 Thread Tyler Baker

Set TEST_PROGS so that kdbus-test is installed.

Cc: Greg Kroah-Hartman 
Signed-off-by: Tyler Baker 
---
 tools/testing/selftests/kdbus/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/kdbus/Makefile 
b/tools/testing/selftests/kdbus/Makefile
index e21bf1f..62f8d5a 100644
--- a/tools/testing/selftests/kdbus/Makefile
+++ b/tools/testing/selftests/kdbus/Makefile
@@ -42,6 +42,8 @@ include ../lib.mk
 kdbus-test: $(OBJS)
$(CC) $(CFLAGS) $^ $(LDLIBS) -o $@
 
+TEST_PROGS := kdbus-test
+
 run_tests:
./kdbus-test --tap
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/8] selftests/ftrace: install test.d

2015-04-17 Thread Tyler Baker

The ftrace test requires the directory test.d and all of it's contents to be
present during execution. Use TEST_DIRS to ensure this is copied to the
INSTALL_PATH.

Signed-off-by: Tyler Baker 
---
 tools/testing/selftests/ftrace/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/ftrace/Makefile 
b/tools/testing/selftests/ftrace/Makefile
index 3467206..0acbeca 100644
--- a/tools/testing/selftests/ftrace/Makefile
+++ b/tools/testing/selftests/ftrace/Makefile
@@ -1,6 +1,7 @@
 all:
 
 TEST_PROGS := ftracetest
+TEST_DIRS := test.d/
 
 include ../lib.mk
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/8] selftests/breakpoints: emit skip and omit installation when tests are not compiled

2015-04-17 Thread Tyler Baker

The breakpoints test should only should be executed on x86 targets, so lets
emit a skip and omit the installation when ARCH != x86.

Signed-off-by: Tyler Baker 
---
 tools/testing/selftests/breakpoints/Makefile | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/breakpoints/Makefile 
b/tools/testing/selftests/breakpoints/Makefile
index 1822356..430b76d 100644
--- a/tools/testing/selftests/breakpoints/Makefile
+++ b/tools/testing/selftests/breakpoints/Makefile
@@ -8,7 +8,6 @@ ifeq ($(ARCH),x86_64)
ARCH := x86
 endif
 
-
 all:
 ifeq ($(ARCH),x86)
gcc breakpoint_test.c -o breakpoint_test
@@ -20,5 +19,11 @@ TEST_PROGS := breakpoint_test
 
 include ../lib.mk
 
+install:
+ifneq ($(ARCH),x86)
+echo "Not an x86 target, can't install breakpoints selftests"
+override EMIT_TESTS :=  echo "echo \"selftests: breakpoint_test [SKIP]\""
+endif
+
 clean:
rm -fr breakpoint_test
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 0/8] selftests: fixes for installation and cross compilation

2015-04-17 Thread Tyler Baker

This patch set fixes various issues observed when cross building and
installing selftests.

As I began investigating improving the test output format, I performed an
audit of the current tests to ensure all tests were able to execute on various
target architectures. I found that some tests did not install their binaries
and others required directories to be installed to execute properly. There
were also cases in which tests were being installed when they were never built.
With this series applied all tests compile when appropriate and install their
output properly.

I have tested this series by building, installing and deploying all selftests
to x86, arm and arm64 targets.

Changes v1 -> v2:
* Have no dependency on all when CROSS_COMPILE is set. (Andy Lutomirski)
* Added Andy on CC for all x86 test patches.
* Split up the x86 patches for better clarity.
* Rebased onto next-20150415.

This series is based on next-20150415.

Tyler Baker (8):
  selftests: copy TEST_DIRS to INSTALL_PATH
  selftests/ftrace: install test.d
  selftests/breakpoints: emit skip and omit installation when tests are
not compiled
  selftests/kdbus: install kdbus-test
  selftest/x86: build both bitnesses
  selftest/x86: have no dependency on all when cross building
  selftest/x86: install tests
  selftests/exec: do not install subdir as it is already created

 tools/testing/selftests/breakpoints/Makefile |  7 ++-
 tools/testing/selftests/exec/Makefile|  2 +-
 tools/testing/selftests/ftrace/Makefile  |  1 +
 tools/testing/selftests/kdbus/Makefile   |  2 ++
 tools/testing/selftests/lib.mk   |  3 +++
 tools/testing/selftests/x86/Makefile | 21 +
 6 files changed, 30 insertions(+), 6 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/8] selftests: copy TEST_DIRS to INSTALL_PATH

2015-04-17 Thread Tyler Baker

Loop over all TEST_DIRS and recursively copy them to the INSTALL_PATH. Tests
such as ftrace require a directory and all of it's contents to execute the
test properly, thus these directories and files need to be copied when we
perform an install.

Signed-off-by: Tyler Baker 
---
 tools/testing/selftests/lib.mk | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 2194155..ee412ba 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -13,6 +13,9 @@ run_tests: all
 
 define INSTALL_RULE
mkdir -p $(INSTALL_PATH)
+   @for TEST_DIR in $(TEST_DIRS); do\
+   cp -r $$TEST_DIR $(INSTALL_PATH); \
+   done;
install -t $(INSTALL_PATH) $(TEST_PROGS) $(TEST_PROGS_EXTENDED) 
$(TEST_FILES)
 endef
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] fs: use a sequence counter instead of file_lock in fd_install

2015-04-17 Thread Eric Dumazet

On Thu, 2015-04-16 at 14:16 +0200, Mateusz Guzik wrote:
> Hi,
> 
> Currently obtaining a new file descriptor results in locking fdtable
> twice - once in order to reserve a slot and second time to fill it

...


>  void __fd_install(struct files_struct *files, unsigned int fd,
>   struct file *file)
>  {
> + unsigned long seq;

unsigned int seq;

>   struct fdtable *fdt;
> - spin_lock(>file_lock);
> - fdt = files_fdtable(files);
> - BUG_ON(fdt->fd[fd] != NULL);
> - rcu_assign_pointer(fdt->fd[fd], file);
> - spin_unlock(>file_lock);
> +
> + rcu_read_lock();
> + do {
> + seq = read_seqcount_begin(>fdt_seqcount);
> + fdt = files_fdtable_seq(files);
> + /*
> +  * Entry in the table can already be equal to file if we
> +  * had to restart and copy_fdtable picked up our update.
> +  */
> + BUG_ON(!(fdt->fd[fd] == NULL || fdt->fd[fd] == file));
> + rcu_assign_pointer(fdt->fd[fd], file);
> + smp_mb();
> + } while (__read_seqcount_retry(>fdt_seqcount, seq));
> + rcu_read_unlock();
>  }
>  

So one problem here is :

As soon as  rcu_assign_pointer(fdt->fd[fd], file) is done, and other cpu
does one expand_fdtable() and releases files->file_lock, another cpu can
close(fd).

Then another cpu can reuse the [fd] now empty slot and install a new
file in it.

Then this cpu will crash here :

BUG_ON(!(fdt->fd[fd] == NULL || fdt->fd[fd] == file));



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] kdbus for 4.1-rc1

2015-04-17 Thread Alex Elsayed

Havoc Pennington wrote:

> Hi,
> 
> On Fri, Apr 17, 2015 at 3:27 PM, James Bottomley
>  wrote:
>>
>> This is why I think kdbus is a bad idea: it solidifies as a linux kernel
>> API something which runs counter to granular OS virtualization (and
>> something which caused Windows to fall behind Linux in the container
>> space).  Splitting out the acceleration problem and leaving the rest to
>> user space currently looks fine because the ideas Al and Andy are
>> kicking around don't cause problems with OS virtualization.
>>
> 
> I'm interested in understanding this problem (if only for my own
> curiosity) but I'm not confident I understand what you're saying
> correctly.
> 
> Can I try to explain back / ask questions and see what I have right?
> 
> I think you are saying that if an application relies on a system
> service (= any other process that runs on the system bus) then to
> virtualize that app by itself in a dedicated container, the system bus
> and the system service need to also be in the container. So the
> container ends up with a bunch of stuff in it beyond only the
> application.  Right / wrong / confused?
> 
> I also think you're saying that userspace dbus has the same issue
> (this isn't a userspace vs. kernel thing per se), the objection to
> kdbus is that it makes this issue more solidified / harder to fix?
> 
> Do you have ideas on how to go about fixing it, whether in userspace
> or kernel dbus?
> 
> Havoc

So far as I understand (and this may be wrong), this is the use case of 
kdbus "endpoints" - you'd create a (constrained) kdbus endpoint on the host, 
and then expose it to the application, such that the application uses it as 
if it were the system bus.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Andy Lutomirski

On Fri, Apr 17, 2015 at 2:28 PM, Linus Torvalds
 wrote:
> On Fri, Apr 17, 2015 at 4:39 PM, Andy Lutomirski  wrote:
>>
>> On my box, vclock_gettime using kvm-clock is about 40 ns.  An empty
>> syscall is about 33 ns.  clock_gettime *should* be around 17 ns.
>>
>> The clock_gettime syscall is about 73 ns.
>>
>> Could we figure out why clock_gettime (the syscall) is so slow
>
> If we only could profile some random program (let's call it "a.out"
> that did the syscall(__NR_gettime_syscall) a couple million times.
>
> Oh, lookie here, Santa came around:
>
>   21.83%   [k] system_call
>   12.85%   [.] syscall
>9.76%   [k] __audit_syscall_exit
>9.55%   [k] copy_user_enhanced_fast_string
>4.68%   [k] __getnstimeofday64
>4.08%   [k] syscall_trace_enter_phase1
>3.85%   [k] __audit_syscall_entry
>3.77%   [k] unroll_tree_refs
>3.15%   [k] sys_clock_gettime
>2.92%   [k] int_very_careful
>2.73%   [.] main
>2.35%   [k] syscall_trace_leave
>2.28%   [k] read_tsc
>1.73%   [k] int_restore_rest
>1.73%   [k] int_with_check
>1.48%   [k] syscall_return
>1.32%   [k] dput
>1.24%   [k] system_call_fastpath
>1.21%   [k] syscall_return_via_sysret
>1.21%   [k] tracesys
>0.81%   [k] do_audit_syscall_entry
>0.80%   [k] current_kernel_time
>0.73%   [k] getnstimeofday64
>0.68%   [k] path_put
>0.66%   [k] posix_clock_realtime_get
>0.61%   [k] int_careful
>0.60%   [k] mntput
>0.49%   [k] kfree
>0.36%   [k] _copy_to_user
>0.31%   [k] int_ret_from_sys_call_irqs_off
>
> looks to me like it's spending a lot of time in system call auditing.
> Which makes no sense to me, since none of this should be triggering
> any auditing. And there's a lot of time in low-level kernel system
> call assembly code.

Muahaha.  The auditors have invaded your system.  (I did my little
benchmark with a more sensible configuration -- see way below).

Can you send the output of:

# auditctl -s
# auditctl -l

Are you, perchance, using Fedora?  I lobbied rather heavily, and
successfully, to get the default configuration to stop auditing.
Unfortunately, the fix wasn't retroactive, so, unless you have a very
fresh install, you might want to apply the fix yourself:

https://fedorahosted.org/fesco/ticket/1311

>
> If I only remembered the name of the crazy person who said "Ok" when I
> suggest he just be the maintainer of the code since he has spent a lot
> of time sending patches for it. Something like Amdy Letorsky. No, that
> wasn't it. Hmm. It's on the tip of my tongue.
>
> Oh well. Maybe somebody can remember the guys name. It's familiar for
> some reason. Andy?

Amdy Lumirtowsky thinks he meant to attach a condition to his
maintainerish activities: he will do his best to keep the audit code
*out* of the low-level stuff, but he's going to try to avoid ever
touching the audit code itself, because if he ever had to change it,
he might accidentally delete the entire file.

Seriously, wasn't there a TAINT_PERFORMANCE thing proposed at some
point?  I would love auditing to set some really loud global warning
that you've just done a Bad Thing (tm) performance-wise by enabling
it.

Back to timing.  With kvm-clock, I see:

  23.80%  timing_test_64  [kernel.kallsyms]   [k] pvclock_clocksource_read
  15.57%  timing_test_64  libc-2.20.so[.] syscall
  12.39%  timing_test_64  [kernel.kallsyms]   [k] system_call
  12.35%  timing_test_64  [kernel.kallsyms]   [k] copy_user_generic_string
  10.95%  timing_test_64  [kernel.kallsyms]   [k] system_call_after_swapgs
   7.35%  timing_test_64  [kernel.kallsyms]   [k] ktime_get_ts64
   6.20%  timing_test_64  [kernel.kallsyms]   [k] sys_clock_gettime
   3.62%  timing_test_64  [kernel.kallsyms]   [k] system_call_fastpath
   2.08%  timing_test_64  timing_test_64  [.] main
   1.72%  timing_test_64  timing_test_64  [.] syscall@plt
   1.58%  timing_test_64  [kernel.kallsyms]   [k] kvm_clock_get_cycles
   1.22%  timing_test_64  [kernel.kallsyms]   [k] _copy_to_user
   0.65%  timing_test_64  [kernel.kallsyms]   [k] posix_ktime_get_ts
   0.13%  timing_test_64  [kernel.kallsyms]   [k] apic_timer_interrupt

We've got some silly indirection, a uaccess that probably didn't get
optimized very well, and the terrifying function
pvclock_clocksource_read.

By comparison, using tsc:

  19.51%  timing_test_64  libc-2.20.so   [.] syscall
  15.52%  timing_test_64  [kernel.kallsyms]  [k] system_call
  15.25%  timing_test_64  [kernel.kallsyms]  [k] copy_user_generic_string
  14.34%  timing_test_64  [kernel.kallsyms]  [k] system_call_after_swapgs
   8.66%  timing_test_64  [kernel.kallsyms]  [k] ktime_get_ts64
   6.95%  timing_test_64  [kernel.kallsyms]  [k] sys_clock_gettime
   5.93%  timing_test_64  [kernel.kallsyms]  [k] native_read_tsc
   5.12%  timing_test_64  [kernel.kallsyms]  [k] system_call_fastpath
   2.62%  timing_test_64  timing_test_64 [.] main

That's better, although the uaccess silliness is still there.

(No PEBS -- I

Re: [patch 10/10] perf_event_open.2: 4.0 update rdpmc documentation

2015-04-17 Thread Andy Lutomirski




On 04/16/2015 11:20 AM, Vince Weaver wrote:


The rdpmc instruction allows reading performance counters directly
from usersapce.  Prior to Linux 4.0 any process could use this
instruction when a perf event was running, even if the process itself
did not have any open.  The following changesets changed the default
behavior so that only processes with active events can use rdpmc.

Note this change broke the ABI.  Previously:
/sys/bus/event_source/devices/cpu/rdpmc
Set to "1" meant allow across whole system.

After the change "2" means the whole system, and "1" means per-process.

Probably a better change would have been to add "2" to mean per-process
and make that the default setting.  Probably too late to fix that now.


Good point.  I wish you'd thought of that sooner :(

--Andy



commit a66734297f78707ce39d756b656bfae861d53f62
Author: Andy Lutomirski 

perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks

commit 7911d3f7af14a614617e38245fedf98a724e46a9
Author: Andy Lutomirski 

perf/x86: Only allow rdpmc if a perf_event is mapped

Signed-off-by: Andy Lutomirski 

Signed-off-by: Peter Zijlstra (Intel) 

Cc: Paul Mackerras 
Cc: Arnaldo Carvalho de Melo 

Cc: Kees Cook 
Cc: Andrea Arcangeli 
Cc: Vince Weaver 
Cc: "hillf.zj" 
Cc: Valdis Kletnieks 
Cc: Linus Torvalds 

Link: 
http://lkml.kernel.org/r/caac3c1c707dcca48ecbc35f4def21495856f479.1414190806.git.luto-klttt9wpgjjwatoyat5...@public.gmane.org
Signed-off-by: Ingo Molnar 


Signed-off-by: Vince Weaver 


diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
index 01ee579..c854d21 100644
--- a/man2/perf_event_open.2
+++ b/man2/perf_event_open.2
@@ -2377,6 +2377,16 @@ Support for this can be detected with the
  .I cap_usr_rdpmc
  field in the mmap page; documentation on how
  to calculate event values can be found in that section.
+
+Originally when rdpmc support was enabled, any process (not just ones
+with an active perf event) could use the rdpmc instruction to access
+the counters.
+Starting with Linux 4.0
+.\" 7911d3f7af14a614617e38245fedf98a724e46a9
+rdpmc support is only enabled if an event is currently enabled
+in a process' context.
+To restore the old behavior, write the value 2 to
+.IR /sys/devices/cpu/rdpmc .
  .SS perf_event ioctl calls
  .PP
  Various ioctls act on
@@ -2552,11 +2562,18 @@ field of
  .I perf_event_attr
  to indicate that you wish to use this PMU.
  .TP
-.IR /sys/bus/event_source/devices/*/rdpmc " (since Linux 3.4)"
+.IR /sys/bus/event_source/devices/cpu/rdpmc " (since Linux 3.4)"
  .\" commit 0c9d42ed4cee2aa1dfc3a260b741baae8615744f
  If this file is 1, then direct user-space access to the
  performance counter registers is allowed via the rdpmc instruction.
  This can be disabled by echoing 0 to the file.
+
+As of Linux 4.0
+.\" a66734297f78707ce39d756b656bfae861d53f62
+.\" 7911d3f7af14a614617e38245fedf98a724e46a9
+the behavior has changed, so that 1 now means only allow access
+to processes with active perf events, with 2 indicating the old
+allow-anyone-access behavior.
  .TP
  .IR /sys/bus/event_source/devices/*/format/ " (since Linux 3.4)"
  .\" commit 641cc938815dfd09f8fa1ec72deb814f0938ac33
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwxl29ty76z2rm5m...@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Linus Torvalds

On Fri, Apr 17, 2015 at 4:39 PM, Andy Lutomirski  wrote:
>
> On my box, vclock_gettime using kvm-clock is about 40 ns.  An empty
> syscall is about 33 ns.  clock_gettime *should* be around 17 ns.
>
> The clock_gettime syscall is about 73 ns.
>
> Could we figure out why clock_gettime (the syscall) is so slow

If we only could profile some random program (let's call it "a.out"
that did the syscall(__NR_gettime_syscall) a couple million times.

Oh, lookie here, Santa came around:

  21.83%   [k] system_call
  12.85%   [.] syscall
   9.76%   [k] __audit_syscall_exit
   9.55%   [k] copy_user_enhanced_fast_string
   4.68%   [k] __getnstimeofday64
   4.08%   [k] syscall_trace_enter_phase1
   3.85%   [k] __audit_syscall_entry
   3.77%   [k] unroll_tree_refs
   3.15%   [k] sys_clock_gettime
   2.92%   [k] int_very_careful
   2.73%   [.] main
   2.35%   [k] syscall_trace_leave
   2.28%   [k] read_tsc
   1.73%   [k] int_restore_rest
   1.73%   [k] int_with_check
   1.48%   [k] syscall_return
   1.32%   [k] dput
   1.24%   [k] system_call_fastpath
   1.21%   [k] syscall_return_via_sysret
   1.21%   [k] tracesys
   0.81%   [k] do_audit_syscall_entry
   0.80%   [k] current_kernel_time
   0.73%   [k] getnstimeofday64
   0.68%   [k] path_put
   0.66%   [k] posix_clock_realtime_get
   0.61%   [k] int_careful
   0.60%   [k] mntput
   0.49%   [k] kfree
   0.36%   [k] _copy_to_user
   0.31%   [k] int_ret_from_sys_call_irqs_off

looks to me like it's spending a lot of time in system call auditing.
Which makes no sense to me, since none of this should be triggering
any auditing. And there's a lot of time in low-level kernel system
call assembly code.

If I only remembered the name of the crazy person who said "Ok" when I
suggest he just be the maintainer of the code since he has spent a lot
of time sending patches for it. Something like Amdy Letorsky. No, that
wasn't it. Hmm. It's on the tip of my tongue.

Oh well. Maybe somebody can remember the guys name. It's familiar for
some reason. Andy?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rcu: small rcu_dereference doc update

2015-04-17 Thread Paul E. McKenney

On Fri, Apr 17, 2015 at 01:13:18PM -0400, Pranith Kumar wrote:
> On Fri, Apr 17, 2015 at 12:15 PM, Paul E. McKenney
>  wrote:
> > Sounds like a good thought for a separate patch.  Please take a look
> > through the rest of the documentation -- this might well be the right
> > place for such an example, but there might well be a better place.
> > Is this issue mentioned in the checklist?  If not, another item might
> > be good.
> 
> Yup, I will take a look and send a patch for this.

Sounds good!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [V4.1] Regression: Bluetooth mouse not working.

2015-04-17 Thread Marcel Holtmann

Hi Linus,

>> accepting all flags regardless was an oversight on my part in the first 
>> place. What this patch tried to do is to limit it to what userspace is 
>> currently actually using. My mistake was to look only at BlueZ 5.x userspace 
>> and not at BlueZ 4.x userspace.
> 
> So what about anybody else? Android doesn't use BlueZ, afaik. Any
> other direct accesses?

this interface is for bluetoothd (Bluetooth userspace daemon) since you need to 
do a lot of initial setup before you can hand this over to the kernel to drive 
HID. On Android this was never used. And even BlueZ for Android (replacement 
for Bluedroid) is not using it today either.

Google Fiber (their set-top boxes) actually moved this all over to /dev/uhid 
now since it gives them better re-connect experience for their remotes.

> If we already know that BlueZ 4.x did something else, what makes you
> so sure that this now covers all cases?

I am certain since nothing else than bluetoothd ever used this interface.

> The thing is, the bluetooth code has clearly never cared about these
> bits before. Is there any real reason to think that people haven't
> passed in garbage? Do we even know that those flags were *initialized*
> at all by user space in all use cases?
> 
> So I'm ok with trying to fix things up, but I have to say that if the
> fixed-up case also causes problems (because there was some other case
> that you didn't think of), I'm going to be pissed off, and I'm going
> to expect you to *jump* on it, and revert the whole thing.

The reason why I starting cleaning this up is because there is an overlay with 
internal and external flags mixed together. This is clearly a bug, but sadly 
that also can open up security issues since we clearly do not want userspace 
allowing messing with internal flags. That is actually worse.

My viewpoint is the reverting the whole patch is actually not helping here 
either. So either we take the patch that I just send around to fix the breakage 
that I caused with BlueZ 4.x userspace. Or an as alternative we keep allowing 
userspace to provide whatever flags it wants, but clear all unknown ones before 
using them in the HIDP logic. My intent was to make this old code less 
vulnerable.

Is one of these options acceptable for you compared to reverting the whole 
patch?

Regards

Marcel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] clocksource: st_lpc: Add LPC timer as a clocksource.

2015-04-17 Thread Paul Bolle

On Fri, 2015-04-17 at 11:50 +0100, Peter Griffin wrote:
> --- a/drivers/clocksource/Kconfig
> +++ b/drivers/clocksource/Kconfig

> +config CLKSRC_ST_LPC_CLOCK
> + bool
> + depends on ARCH_STI
> + select CLKSRC_OF if OF
> + help
> +   Enable this option to use the Low Power controller timer
> +   as clock source.
> +
> +config CLKSRC_ST_LPC_TIMER_SCHED_CLOCK
> + bool
> + depends on ST_LPC_CLOCK

It looks like you meant
 depends on CLKSRC_ST_LPC_CLOCK

> + default y
> + help
> +   Use Low Power controller timer clock source as sched_clock

> --- a/drivers/clocksource/Makefile
> +++ b/drivers/clocksource/Makefile

> +obj-$(CONFIG_CLKSRC_ST_LPC_CLOCK)+= st_lpc.o

> --- /dev/null
> +++ b/drivers/clocksource/st_lpc.c

> +#ifdef CONFIG_CLKSRC_LPC_TIMER_SCHED_CLOCK

#ifdef CONFIG_CLKSRC_ST_LPC_TIMER_SCHED_CLOCK here?

> +static u64 notrace st_lpc_sched_clock_read(void)
> +{
> + return st_lpc_counter_read();
> +}
> +#endif

> +#ifdef CONFIG_CLKSRC_LPC_TIMER_SCHED_CLOCK

Again, #ifdef CONFIG_CLKSRC_ST_LPC_TIMER_SCHED_CLOCK here?

> + sched_clock_register(st_lpc_sched_clock_read, 64, rate);
> +#endif

Assuming the above suggestions are correct: checkkconfigsymbols.py, as
shipped in linux-next, helps detect stuff like this. See
scripts/checkkconfigsymbols.py --help.

Thanks,


Paul Bolle

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] blk: clean up plug

2015-04-17 Thread Jeff Moyer

Shaohua Li  writes:

>> Also, Jens or Shaohua or anyone, please review my blk-mq plug fix (patch
>> 1/2 of aforementioned thread).  ;)
>
> You are not alone :), I posted 2 times too
> http://marc.info/?l=linux-kernel=142627559617005=2

Oh, sorry!  I think Jens had mentioned that you had a patch that touched
that code.  I took a quick look at it, and I think the general idea is
good.  I'll take a closer look next week, and I'll also give it to our
performance team for testing.  Hopefully I can follow-up on that patch
by the end of next week.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 3/4] cgroups: allow a cgroup subsystem to reject a fork

2015-04-17 Thread Aleksa Sarai

>> Do you mean like this?
>>
>> #define SUBSYS_TAG_COUNT(_tag) (CGROUP_ ## _tag ## _END - CGROUP_ ##
>> _tag ## _START)
>>
>> That's fine I guess, I just wanted to match CGROUP_SUBSYS_COUNT in
>> semantics, but I'll do that if you prefer it that way.
>
> Not even that, just do it manually.
>
> #define CGROUP_TAGNAME_COUNT(CGROUP_TAGNAME_END - CGROUP_TAGNAME_START)
>
> At maximum, we're only gonna have a few of these.  No reason to be
> smart about it.

Yeah, that's fair I guess.

--
Aleksa Sarai (cyphar)
www.cyphar.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [V4.1] Regression: Bluetooth mouse not working.

2015-04-17 Thread Linus Torvalds

On Fri, Apr 17, 2015 at 4:35 PM, Marcel Holtmann  wrote:
>
> accepting all flags regardless was an oversight on my part in the first 
> place. What this patch tried to do is to limit it to what userspace is 
> currently actually using. My mistake was to look only at BlueZ 5.x userspace 
> and not at BlueZ 4.x userspace.

So what about anybody else? Android doesn't use BlueZ, afaik. Any
other direct accesses?

If we already know that BlueZ 4.x did something else, what makes you
so sure that this now covers all cases?

The thing is, the bluetooth code has clearly never cared about these
bits before. Is there any real reason to think that people haven't
passed in garbage? Do we even know that those flags were *initialized*
at all by user space in all use cases?

So I'm ok with trying to fix things up, but I have to say that if the
fixed-up case also causes problems (because there was some other case
that you didn't think of), I'm going to be pissed off, and I'm going
to expect you to *jump* on it, and revert the whole thing.

Ok?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] blk: clean up plug

2015-04-17 Thread Shaohua Li

On Fri, Apr 17, 2015 at 04:54:40PM -0400, Jeff Moyer wrote:
> Shaohua Li  writes:
> 
> > Current code looks like inner plug gets flushed with a
> > blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
> > to current->plug, while only outmost plug is assigned to current->plug.
> > So inner plug always has empty request/callback list, which makes
> > blk_flush_plug_list() a nop. This tries to make the code more clear.
> >
> > Signed-off-by: Shaohua Li 
> 
> Hi, Shaohua,
> 
> I agree that this looks like a clean-up with no behavioral change, and
> it looks good to me.  However,it does make me scratch my head about the
> numbers I was seeing.  Here's the table from that other email thread[1]:
> 
> device|  vanilla   |patch1  |   dio-noplug  |  noflush-nested
> --+++---+-
> rssda | 701,684|1,168,527   |   1,342,177   |   1,297,612
>   | 100%   | +66%   |+91%   |+85%
> vdb0  | 358,264|  902,913   | 906,850   | 922,327
>   | 100%   |+152%   |   +153%   |   +157%
> 
> Patch1 refers to the first patch in this series, which fixes the merge
> logic for single-queue blk-mq devices.  Each column after that includes
> that first patch.  In dio-noplug, I removed the blk_plug from the
> direct-io code path (so there is no nesting at all).  This is a control,
> since it is what I expect the outcome of the noflush-nested column to
> actually be.  Then, the noflush-nested column leaves the blk_plug in
> place in the dio code, but includes the patch that prevents nested
> blk_plug's from being flushed.  All numbers are the average of 5 runs.
> With the exception of the vanilla run on rssda (the first run was
> faster, causing the average to go up), the standard deviation is very
> small.
> 
> For the dio-noplug column, if the inner plug really was a noop, then why
> would we see any change in performance?  Like I said, I agree with your
> reading of the code and the patch.  Color me confused.  I'll poke at it
> more next week.  For now, I think your patch is fine.
> 
> Reviewed-by: Jeff Moyer 

Thanks! I don't know why either the your second makes change.
 
> Also, Jens or Shaohua or anyone, please review my blk-mq plug fix (patch
> 1/2 of aforementioned thread).  ;)

You are not alone :), I posted 2 times too
http://marc.info/?l=linux-kernel=142627559617005=2


Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf: Ensure symbols for plugins are exported

2015-04-17 Thread Mathias Krause

On 17 April 2015 at 17:34, Jiri Olsa  wrote:
> On Sun, Apr 12, 2015 at 06:00:51PM +0200, Mathias Krause wrote:
>> When building perf with perl or python support it implicitly gets linked
>> with the -export-dynamic linker option through the additional linker
>> flags, namely with -Wl,-E via perl or -Xlinker -export-dynamic via
>> python. That flag is essential for the traceevent plugin support so we
>> shouldn't rely on adding it implicitly.
>>
>> Ensure perf's exported symbols can be used by dlopen()ed plugins by
>> unconditionally adding this flag when linking perf. Otherwise plugins
>> won't be able to access symbols in the perf binary.
>>
>> This fixes the following warning / bug when trying to load plugins:
>>
>>   Warning: could not load plugin 
>> '/home/minipli/.traceevent/plugins/plugin_xen.so'
>>   /home/minipli/.traceevent/plugins/plugin_xen.so: undefined symbol: 
>> trace_seq_printf
>>   [...]
>
> hum, not sure now how -export-dynamic works but should this
> be rather in traceevent lib then?

Nope. Here's the relevant excerpt from ld(1):

   --export-dynamic
   [...]
   If you use "dlopen" to load a dynamic object which needs to
   refer back to the symbols defined by the program, rather
   than some other dynamic object, then you will probably need
   to use this option when linking the program itself.

So that flag has to be in the linker call for perf, as the plugins
(which are dlopen()ed) want to access symbols within the perf binary
(or more specific, within libperf.a / libtraceevent.a statically
linked into perf).


Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 3/4] cgroups: allow a cgroup subsystem to reject a fork

2015-04-17 Thread Tejun Heo

On Sat, Apr 18, 2015 at 06:48:55AM +1000, Aleksa Sarai wrote:
> Do you mean like this?
> 
> #define SUBSYS_TAG_COUNT(_tag) (CGROUP_ ## _tag ## _END - CGROUP_ ##
> _tag ## _START)
> 
> That's fine I guess, I just wanted to match CGROUP_SUBSYS_COUNT in
> semantics, but I'll do that if you prefer it that way.

Not even that, just do it manually.

#define CGROUP_TAGNAME_COUNT(CGROUP_TAGNAME_END - CGROUP_TAGNAME_START)

At maximum, we're only gonna have a few of these.  No reason to be
smart about it.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] blk: clean up plug

2015-04-17 Thread Jeff Moyer

Shaohua Li  writes:

> Current code looks like inner plug gets flushed with a
> blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
> to current->plug, while only outmost plug is assigned to current->plug.
> So inner plug always has empty request/callback list, which makes
> blk_flush_plug_list() a nop. This tries to make the code more clear.
>
> Signed-off-by: Shaohua Li 

Hi, Shaohua,

I agree that this looks like a clean-up with no behavioral change, and
it looks good to me.  However,it does make me scratch my head about the
numbers I was seeing.  Here's the table from that other email thread[1]:

device|  vanilla   |patch1  |   dio-noplug  |  noflush-nested
--+++---+-
rssda | 701,684|1,168,527   |   1,342,177   |   1,297,612
  | 100%   | +66%   |+91%   |+85%
vdb0  | 358,264|  902,913   | 906,850   | 922,327
  | 100%   |+152%   |   +153%   |   +157%

Patch1 refers to the first patch in this series, which fixes the merge
logic for single-queue blk-mq devices.  Each column after that includes
that first patch.  In dio-noplug, I removed the blk_plug from the
direct-io code path (so there is no nesting at all).  This is a control,
since it is what I expect the outcome of the noflush-nested column to
actually be.  Then, the noflush-nested column leaves the blk_plug in
place in the dio code, but includes the patch that prevents nested
blk_plug's from being flushed.  All numbers are the average of 5 runs.
With the exception of the vanilla run on rssda (the first run was
faster, causing the average to go up), the standard deviation is very
small.

For the dio-noplug column, if the inner plug really was a noop, then why
would we see any change in performance?  Like I said, I agree with your
reading of the code and the patch.  Color me confused.  I'll poke at it
more next week.  For now, I think your patch is fine.

Reviewed-by: Jeff Moyer 

Also, Jens or Shaohua or anyone, please review my blk-mq plug fix (patch
1/2 of aforementioned thread).  ;)

-Jeff

[1] https://lkml.org/lkml/2015/4/16/366

> ---
>  block/blk-core.c | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 794c3e7..d3161f3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3018,21 +3018,20 @@ void blk_start_plug(struct blk_plug *plug)
>  {
>   struct task_struct *tsk = current;
>  
> + /*
> +  * If this is a nested plug, don't actually assign it.
> +  */
> + if (tsk->plug)
> + return;
> +
>   INIT_LIST_HEAD(>list);
>   INIT_LIST_HEAD(>mq_list);
>   INIT_LIST_HEAD(>cb_list);
> -
>   /*
> -  * If this is a nested plug, don't actually assign it. It will be
> -  * flushed on its own.
> +  * Store ordering should not be needed here, since a potential
> +  * preempt will imply a full memory barrier
>*/
> - if (!tsk->plug) {
> - /*
> -  * Store ordering should not be needed here, since a potential
> -  * preempt will imply a full memory barrier
> -  */
> - tsk->plug = plug;
> - }
> + tsk->plug = plug;
>  }
>  EXPORT_SYMBOL(blk_start_plug);
>  
> @@ -3179,10 +3178,11 @@ void blk_flush_plug_list(struct blk_plug *plug, bool 
> from_schedule)
>  
>  void blk_finish_plug(struct blk_plug *plug)
>  {
> + if (plug != current->plug)
> + return;
>   blk_flush_plug_list(plug, false);
>  
> - if (plug == current->plug)
> - current->plug = NULL;
> + current->plug = NULL;
>  }
>  EXPORT_SYMBOL(blk_finish_plug);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 3/4] cgroups: allow a cgroup subsystem to reject a fork

2015-04-17 Thread Aleksa Sarai

>> >> Do you also want me to completely drop the COUNT macro? IMO it makes
>> >> the CGROUP__COUNT consolidation much nicer.
>> >
>> > What's wrong with simply having start and end tags?
>>
>> Because you'd have to write (CGROUP_TAG_END - CGROUP_TAG_START) every
>> time? It's a small addition and it makes referencing the range of a
>> tagged section much easier.
>
> Wouldn't loops look more like
>
> for (subsys = CGROUP_TAG_START; subsys < CGROUP_TAG_END; subsys++)

Sorry, I meant for defining arrays. `state[CGROUP_TAG_END -
CGROUP_TAG_START]` is just more annoying to type and read than
`state[CGROUP_TAG_COUNT]`.

> And even if not, just define a separate macro for the length.  It's
> not like we're gonna have a lot of tags.

Do you mean like this?

#define SUBSYS_TAG_COUNT(_tag) (CGROUP_ ## _tag ## _END - CGROUP_ ##
_tag ## _START)

That's fine I guess, I just wanted to match CGROUP_SUBSYS_COUNT in
semantics, but I'll do that if you prefer it that way.

--
Aleksa Sarai (cyphar)
www.cyphar.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 3/4] cgroups: allow a cgroup subsystem to reject a fork

2015-04-17 Thread Tejun Heo

On Sat, Apr 18, 2015 at 06:35:41AM +1000, Aleksa Sarai wrote:
> >> Do you also want me to completely drop the COUNT macro? IMO it makes
> >> the CGROUP__COUNT consolidation much nicer.
> >
> > What's wrong with simply having start and end tags?
> 
> Because you'd have to write (CGROUP_TAG_END - CGROUP_TAG_START) every
> time? It's a small addition and it makes referencing the range of a
> tagged section much easier.

Wouldn't loops look more like

for (subsys = CGROUP_TAG_START; subsys < CGROUP_TAG_END; subsys++)

And even if not, just define a separate macro for the length.  It's
not like we're gonna have a lot of tags.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Andy Lutomirski

On Fri, Apr 17, 2015 at 1:18 PM, Marcelo Tosatti  wrote:
> On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote:
>>
>>
>> >> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
>> >> From: Paolo Bonzini 
>> >> Date: Fri, 17 Apr 2015 14:57:34 +0200
>> >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
>> >>
>> >> The task migration notifier is only used in x86 paravirt.  Make it
>> >> possible to compile it out.
>> >>
>> >> While at it, move some code around to ensure tmn is filled from CPU
>> >> registers.
>> >>
>> >> Signed-off-by: Paolo Bonzini 
>> >> ---
>> >>  arch/x86/Kconfig| 1 +
>> >>  init/Kconfig| 3 +++
>> >>  kernel/sched/core.c | 9 -
>> >>  3 files changed, 12 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> >> index d43e7e1c784b..9af252c8698d 100644
>> >> --- a/arch/x86/Kconfig
>> >> +++ b/arch/x86/Kconfig
>> >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
>> >>
>> >>  config PARAVIRT
>> >>bool "Enable paravirtualization code"
>> >> +  select TASK_MIGRATION_NOTIFIER
>> >>---help---
>> >>  This changes the kernel so it can modify itself when it is run
>> >>  under a hypervisor, potentially improving performance significantly
>> >> diff --git a/init/Kconfig b/init/Kconfig
>> >> index 3b9df1aa35db..891917123338 100644
>> >> --- a/init/Kconfig
>> >> +++ b/init/Kconfig
>> >> @@ -2016,6 +2016,9 @@ source "block/Kconfig"
>> >>  config PREEMPT_NOTIFIERS
>> >>bool
>> >>
>> >> +config TASK_MIGRATION_NOTIFIER
>> >> +  bool
>> >> +
>> >>  config PADATA
>> >>depends on SMP
>> >>bool
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index f9123a82cbb6..c07a53aa543c 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct 
>> >> task_struct *p, int flags)
>> >>rq_clock_skip_update(rq, true);
>> >>  }
>> >>
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >>  static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
>> >>
>> >>  void register_task_migration_notifier(struct notifier_block *n)
>> >>  {
>> >>atomic_notifier_chain_register(_migration_notifier, n);
>> >>  }
>> >> +#endif
>> >>
>> >>  #ifdef CONFIG_SMP
>> >>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned 
>> >> int new_cpu)
>> >>trace_sched_migrate_task(p, new_cpu);
>> >>
>> >>if (task_cpu(p) != new_cpu) {
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >>struct task_migration_notifier tmn;
>> >> +  int from_cpu = task_cpu(p);
>> >> +#endif
>> >>
>> >>if (p->sched_class->migrate_task_rq)
>> >>p->sched_class->migrate_task_rq(p, new_cpu);
>> >>p->se.nr_migrations++;
>> >>perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
>> >>
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >>tmn.task = p;
>> >> -  tmn.from_cpu = task_cpu(p);
>> >> +  tmn.from_cpu = from_cpu;
>> >>tmn.to_cpu = new_cpu;
>> >>
>> >>atomic_notifier_call_chain(_migration_notifier, 0, );
>> >> +#endif
>> >>}
>> >>
>> >>__set_task_cpu(p, new_cpu);
>> >> --
>> >> 2.3.5
>> >
>> > Paolo,
>> >
>> > Please revert the patch -- can fix properly in the host
>> > which also conforms the KVM guest/host documented protocol.
>> >
>> > Radim submitted a patch to kvm@ to split
>> > the kvm_write_guest in two with a barrier in between, i think.
>> >
>> > I'll review that patch.
>>
>> You're thinking of
>> http://article.gmane.org/gmane.linux.kernel.stable/129187, but see
>> Andy's reply:
>>
>> >
>> > I think there are at least two ways that would work:
>> >
>> > a) If KVM incremented version as advertised:
>> >
>> > cpu = getcpu();
>> > pvti = pvti for cpu;
>> >
>> > ver1 = pvti->version;
>> > check stable bit;
>> > rdtsc_barrier, rdtsc, read scale, shift, etc.
>> > if (getcpu() != cpu) retry;
>> > if (pvti->version != ver1) retry;
>> >
>> > I think this is safe because, we're guaranteed that there was an
>> > interval (between the two version reads) in which the vcpu we think
>> > we're on was running and the kvmclock data was valid and marked
>> > stable, and we know that the tsc we read came from that interval.
>> >
>> > Note: rdtscp isn't needed. If we're stable, is makes no difference
>> > which cpu's tsc we actually read.
>> >
>> > b) If version remains buggy but we use this migrations_from hack:
>> >
>> > cpu = getcpu();
>> > pvti = pvti for cpu;
>> > m1 = pvti->migrations_from;
>> > barrier();
>> >
>> > ver1 = pvti->version;
>> > check stable bit;
>> > rdtsc_barrier, rdtsc, read scale, shift, etc.
>> > if (getcpu() != cpu) retry;
>> > if (pvti->version != ver1) retry;  /* probably not really needed */
>> >
>> > barrier();
>> > if (pvti->migrations_from != m1) retry;
>> >
>> > This is just like (a), except that

Re: [PATCH] Bluetooth: Pre-initialize variables in read_local_oob_ext_data_complete()

2015-04-17 Thread Marcel Holtmann

Hi Geert,

>>> net/bluetooth/mgmt.c: In function ‘read_local_oob_ext_data_complete’:
>>> net/bluetooth/mgmt.c:6474: warning: ‘r256’ may be used uninitialized in 
>>> this function
>>> net/bluetooth/mgmt.c:6474: warning: ‘h256’ may be used uninitialized in 
>>> this function
>>> net/bluetooth/mgmt.c:6474: warning: ‘r192’ may be used uninitialized in 
>>> this function
>>> net/bluetooth/mgmt.c:6474: warning: ‘h192’ may be used uninitialized in 
>>> this function
>>> 
>>> While these are false positives, the code can be shortened by
>>> pre-initializing the hash table pointers and eir_len. This has the side
>>> effect of killing the compiler warnings.
>> 
>> can you be a bit specific on which compiler version is this. I fixed one 
>> occurrence that seemed valid. However in this case the compiler seems to be 
>> just plain stupid. On a gcc 4.9, I am not seeing these for example.
> 
> gcc 4.1.2. As there were too many false positives, these warnings were
> disabled in later versions (throwing away the children with the bad water).
> 
> If you don't like my patch, just drop it. I only look at newly
> introduced warnings
> of this kind anyway.

I really do not know what is the best solution here. This is a false positive. 
And I have been looking at this particular code for a warning that was valid, 
but we missed initially. But these warnings that you are fixing are clearly 
false positive.

If this only happens with an old compiler version, I would tend to leave the 
code as is. Then again, what is the general preferred approach here?

Regards

Marcel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 3/4] cgroups: allow a cgroup subsystem to reject a fork

2015-04-17 Thread Aleksa Sarai

>> Do you also want me to completely drop the COUNT macro? IMO it makes
>> the CGROUP__COUNT consolidation much nicer.
>
> What's wrong with simply having start and end tags?

Because you'd have to write (CGROUP_TAG_END - CGROUP_TAG_START) every
time? It's a small addition and it makes referencing the range of a
tagged section much easier.

--
Aleksa Sarai (cyphar)
www.cyphar.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [V4.1] Regression: Bluetooth mouse not working.

2015-04-17 Thread Marcel Holtmann

Hi Linus,

>> okay. I only looked at BlueZ 5.x and that might have been my mistake. Let me 
>> check this and fix this properly.
> 
> Why not just revert that commit. It looks like garbage. It has odd code like
> 
> +   u32 valid_flags = 0;
> +   ci->flags = session->flags & valid_flags;
> 
> which is basically saying "no flags are valid, and we are silently
> just clearing them all when copying".
> 
> The reason I think it's garbage is
> 
> (a) the commit clearly breaks something, so the whole "let's check
> flags that we've never checked before" is already fundamentally
> suspicious
> 
> (b) code like the above is just crap to begin with, because it makes
> things superficially "look" sensible when looking at individual lines
> of code (for example, when grepping things), and then when you look at
> the actual bigger picture, it turns out that the code doesn't actually
> care about the flags it is "copying", it just clears them all.
> 
> The other code sequences do things like
> 
> +   u32 valid_flags = 0;
> +   if (req->flags & ~valid_flags)
> +   return -EINVAL;
> 
> Which again is just a very unreadable way of saying "if any flags are
> set, return an error". This kind of thing is presumably what breaks
> things, because clearly people *have* set flags that you thought are
> invalid.
> 
> Now *IF* the interfaces had had these kinds of flag validation checks
> from day one, that would be one thing. But adding these kinds of
> things after the fact, when somebody then reports that they break
> things, then that's just a big big flag that you shouldn't try to do
> this at all. It's water under the bridge. That ship has sailed. It's
> too late. Give up on it.
> 
> So I don't think this code is "fixable". It really smells like a
> fundamental mistake to begin with. Just revert it, chalk it up as "ok,
> that was a stupid idea", and move on.

accepting all flags regardless was an oversight on my part in the first place. 
What this patch tried to do is to limit it to what userspace is currently 
actually using. My mistake was to look only at BlueZ 5.x userspace and not at 
BlueZ 4.x userspace. The fix to not break existing userspace is essentially 
this:

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index a05b9dbf14c9..9070dfd6b4ad 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -1313,7 +1313,8 @@ int hidp_connection_add(struct hidp_connadd_req *req,
struct socket *ctrl_sock,
struct socket *intr_sock)
 {
-   u32 valid_flags = 0;
+   u32 valid_flags = BIT(HIDP_VIRTUAL_CABLE_UNPLUG) |
+ BIT(HIDP_BOOT_PROTOCOL_MODE);

I ask Joerg to test this patch, but looking at old userspace is that is what is 
happening there.

Regards

Marcel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Bluetooth: hidp: Fix regression with older userspace and flags validation

2015-04-17 Thread Marcel Holtmann

While it is not used by newer userspace anymore, the older userspace was
utilizing HIDP_VIRTUAL_CABLE_UNPLUG and HIDP_BOOT_PROTOCOL_MODE flags
when adding a new HIDP connection.

The flags validation is important, but we can not break older userspace
and with that allow providing these flags even if newer userspace does
not use them anymore.

Reported-by: Jörg Otte 
Signed-off-by: Marcel Holtmann 
---
 net/bluetooth/hidp/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index a05b9dbf14c9..9070dfd6b4ad 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -1313,7 +1313,8 @@ int hidp_connection_add(struct hidp_connadd_req *req,
struct socket *ctrl_sock,
struct socket *intr_sock)
 {
-   u32 valid_flags = 0;
+   u32 valid_flags = BIT(HIDP_VIRTUAL_CABLE_UNPLUG) |
+ BIT(HIDP_BOOT_PROTOCOL_MODE);
struct hidp_session *session;
struct l2cap_conn *conn;
struct l2cap_chan *chan;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1674 matches

Mail list logo