Re: [PATCH 4/8] x86/platform/UV: Add Support for UV4 Hubless NMIs

2017-01-13 Thread Ingo Molnar

* Mike Travis  wrote:

> --- linux.orig/arch/x86/kernel/apic/x2apic_uv_x.c
> +++ linux/arch/x86/kernel/apic/x2apic_uv_x.c
> @@ -1529,6 +1529,8 @@ void __init uv_system_init(void)
>   uv_system_init_hub();
>  
>   /* Initialize UV hubless system here */
> + else
> + uv_nmi_setup_hubless();

That's not the proper coding style ...

> +++ linux/arch/x86/platform/uv/uv_nmi.c
> @@ -67,6 +67,19 @@ static struct uv_hub_nmi_s **uv_hub_nmi_
>  DEFINE_PER_CPU(struct uv_cpu_nmi_s, uv_cpu_nmi);
>  EXPORT_PER_CPU_SYMBOL_GPL(uv_cpu_nmi);
>  
> +/* UV hubless values */
> +static int *pch_base;
> +#define  NMI_CONTROL_PORT0x70

Ditto...

> +#define  NMI_DUMMY_PORT  0x71
> +#define  GPI_NMI_STS_GPP_D_0 0x164
> +#define  GPI_NMI_ENA_GPP_D_0 0x174
> +#define  STS_GPP_D_0_MASK0x1
> +#define  PAD_CFG_DW0_GPP_D_0 0x4c0
> +#define  GPIROUTNMI  (1ul << 17)
> +#define  PCH_PCR_GPIO_1_BASE 0xfdaeul
> +#define  PCH_PCR_GPIO_ADDRESS(offset)\
> + (int *)((unsigned long long)pch_base | offset)

> +static void uv_init_hubless_pch_io(int offset, int mask, int data)
> +{
> + int *addr = PCH_PCR_GPIO_ADDRESS(offset);
> + int readd = readl(addr);
> +
> + if (mask) { /* OR in new data */
> + int writed = (readd & ~mask) | data;
> +
> + nmi_debug("UV:PCH: %p = %x & %x | %x (%x)\n",
> + addr, readd, ~mask, data, writed);

No unnecessary linebreaks please.

> +
>   /* wait a moment for the hub nmi locker to set flag */
> - cpu_relax();

Please capitalize 'NMI' consistently in all the patches.

> + /* PCH NMI causes only one cpu to respond */

Same for 'CPU'.

>   else if (uv_nmi_action_is("kdb") || uv_nmi_action_is("kgdb"))
>   uv_call_kgdb_kdb(cpu, regs, master);
>  
> + /* Unknown NMI action */
> + else {

Sigh ...

> +/* setup for UV Hubless systems */
> +void __init uv_nmi_setup_hubless(void)
> +{
> + uv_nmi_setup_common(false);
> +
> + pch_base = xlate_dev_mem_ptr(PCH_PCR_GPIO_1_BASE);
> + nmi_debug("UV: PCH base:%p from 0x%lx, GPP_D_0\n",
> + pch_base, PCH_PCR_GPIO_1_BASE);
> +
> + uv_init_hubless_pch_io(GPI_NMI_ENA_GPP_D_0,
> + STS_GPP_D_0_MASK, STS_GPP_D_0_MASK);
> +
> + uv_nmi_setup_hubless_intr();
> +
> + uv_reassert_nmi();  /* insure NMI enabled in proc inf reg */

And typos ...

Much cleaner patches please! I haven't checked the others in the series but 
please 
review them for similar mishaps.

Thanks,

Ingo


Re: [PATCH 4/8] x86/platform/UV: Add Support for UV4 Hubless NMIs

2017-01-13 Thread Ingo Molnar

* Mike Travis  wrote:

> --- linux.orig/arch/x86/kernel/apic/x2apic_uv_x.c
> +++ linux/arch/x86/kernel/apic/x2apic_uv_x.c
> @@ -1529,6 +1529,8 @@ void __init uv_system_init(void)
>   uv_system_init_hub();
>  
>   /* Initialize UV hubless system here */
> + else
> + uv_nmi_setup_hubless();

That's not the proper coding style ...

> +++ linux/arch/x86/platform/uv/uv_nmi.c
> @@ -67,6 +67,19 @@ static struct uv_hub_nmi_s **uv_hub_nmi_
>  DEFINE_PER_CPU(struct uv_cpu_nmi_s, uv_cpu_nmi);
>  EXPORT_PER_CPU_SYMBOL_GPL(uv_cpu_nmi);
>  
> +/* UV hubless values */
> +static int *pch_base;
> +#define  NMI_CONTROL_PORT0x70

Ditto...

> +#define  NMI_DUMMY_PORT  0x71
> +#define  GPI_NMI_STS_GPP_D_0 0x164
> +#define  GPI_NMI_ENA_GPP_D_0 0x174
> +#define  STS_GPP_D_0_MASK0x1
> +#define  PAD_CFG_DW0_GPP_D_0 0x4c0
> +#define  GPIROUTNMI  (1ul << 17)
> +#define  PCH_PCR_GPIO_1_BASE 0xfdaeul
> +#define  PCH_PCR_GPIO_ADDRESS(offset)\
> + (int *)((unsigned long long)pch_base | offset)

> +static void uv_init_hubless_pch_io(int offset, int mask, int data)
> +{
> + int *addr = PCH_PCR_GPIO_ADDRESS(offset);
> + int readd = readl(addr);
> +
> + if (mask) { /* OR in new data */
> + int writed = (readd & ~mask) | data;
> +
> + nmi_debug("UV:PCH: %p = %x & %x | %x (%x)\n",
> + addr, readd, ~mask, data, writed);

No unnecessary linebreaks please.

> +
>   /* wait a moment for the hub nmi locker to set flag */
> - cpu_relax();

Please capitalize 'NMI' consistently in all the patches.

> + /* PCH NMI causes only one cpu to respond */

Same for 'CPU'.

>   else if (uv_nmi_action_is("kdb") || uv_nmi_action_is("kgdb"))
>   uv_call_kgdb_kdb(cpu, regs, master);
>  
> + /* Unknown NMI action */
> + else {

Sigh ...

> +/* setup for UV Hubless systems */
> +void __init uv_nmi_setup_hubless(void)
> +{
> + uv_nmi_setup_common(false);
> +
> + pch_base = xlate_dev_mem_ptr(PCH_PCR_GPIO_1_BASE);
> + nmi_debug("UV: PCH base:%p from 0x%lx, GPP_D_0\n",
> + pch_base, PCH_PCR_GPIO_1_BASE);
> +
> + uv_init_hubless_pch_io(GPI_NMI_ENA_GPP_D_0,
> + STS_GPP_D_0_MASK, STS_GPP_D_0_MASK);
> +
> + uv_nmi_setup_hubless_intr();
> +
> + uv_reassert_nmi();  /* insure NMI enabled in proc inf reg */

And typos ...

Much cleaner patches please! I haven't checked the others in the series but 
please 
review them for similar mishaps.

Thanks,

Ingo


Re: [PATCHv4 3/5] pinctrl: mvebu: pinctrl driver for 98DX3236 SoC

2017-01-13 Thread Chris Packham
On 13/01/17 22:54, Sebastian Hesselbarth wrote:
> On 13.01.2017 10:12, Chris Packham wrote:
>> From: Kalyan Kinthada 
>>
>> This pinctrl driver supports the 98DX3236, 98DX3336 and 98DX4251 SoCs
>> from Marvell.
>>
>> Signed-off-by: Kalyan Kinthada 
>> Signed-off-by: Chris Packham 
>> Acked-by: Rob Herring 
>> Acked-by: Sebastian Hesselbarth 
>> ---
>>
>> Notes:
>> Changes in v2:
>> - include sdio support for the 98DX4251
>> Changes in v3:
>> - None
>> Changes in v4:
>> - Correct some discrepencies between binding and driver.
>
> Well, unfortunately I still see differences between the "gpio" in
> the binding and "gpo" in the driver.
>
> Please go back to that list I sent you yesterday and fix them all.
>

I think you may have missed my initial reply [1]. Or I have missed your 
response to it. Long story short "gpo" is intentional because some of 
those pins can't be used as inputs. But if you still want me to change 
it I will.

[1] - https://lkml.org/lkml/2017/1/12/117



Re: [PATCHv4 3/5] pinctrl: mvebu: pinctrl driver for 98DX3236 SoC

2017-01-13 Thread Chris Packham
On 13/01/17 22:54, Sebastian Hesselbarth wrote:
> On 13.01.2017 10:12, Chris Packham wrote:
>> From: Kalyan Kinthada 
>>
>> This pinctrl driver supports the 98DX3236, 98DX3336 and 98DX4251 SoCs
>> from Marvell.
>>
>> Signed-off-by: Kalyan Kinthada 
>> Signed-off-by: Chris Packham 
>> Acked-by: Rob Herring 
>> Acked-by: Sebastian Hesselbarth 
>> ---
>>
>> Notes:
>> Changes in v2:
>> - include sdio support for the 98DX4251
>> Changes in v3:
>> - None
>> Changes in v4:
>> - Correct some discrepencies between binding and driver.
>
> Well, unfortunately I still see differences between the "gpio" in
> the binding and "gpo" in the driver.
>
> Please go back to that list I sent you yesterday and fix them all.
>

I think you may have missed my initial reply [1]. Or I have missed your 
response to it. Long story short "gpo" is intentional because some of 
those pins can't be used as inputs. But if you still want me to change 
it I will.

[1] - https://lkml.org/lkml/2017/1/12/117



Re: [f2fs-dev] [PATCH 6/6] f2fs: show # of on-going flush and discard bios

2017-01-13 Thread heyunlei

Hi Jaegeuk,

On 2017/1/13 6:44, Jaegeuk Kim wrote:

This patch adds stat information for flush and discard commands.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/debug.c   | 7 +--
 fs/f2fs/f2fs.h| 3 ++-
 fs/f2fs/segment.c | 4 
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
index f9f6b0aeba02..e67cab2fafac 100644
--- a/fs/f2fs/debug.c
+++ b/fs/f2fs/debug.c
@@ -54,6 +54,8 @@ static void update_general_status(struct f2fs_sb_info *sbi)
si->max_aw_cnt = atomic_read(>max_aw_cnt);
si->nr_wb_cp_data = get_pages(sbi, F2FS_WB_CP_DATA);
si->nr_wb_data = get_pages(sbi, F2FS_WB_DATA);
+   si->nr_flush = atomic_read(_I(sbi)->fcc_info->submit_flush);
+   si->nr_discard = atomic_read(_I(sbi)->dcc_info->submit_discard);
si->total_count = (int)sbi->user_block_count / sbi->blocks_per_seg;
si->rsvd_segs = reserved_segments(sbi);
si->overp_segs = overprovision_segments(sbi);
@@ -318,8 +320,9 @@ static int stat_show(struct seq_file *s, void *v)
seq_printf(s, "  - Inner Struct Count: tree: %d(%d), node: 
%d\n",
si->ext_tree, si->zombie_tree, si->ext_node);
seq_puts(s, "\nBalancing F2FS Async:\n");
-   seq_printf(s, "  - IO (CP: %4d, Data: %4d)\n",
-  si->nr_wb_cp_data, si->nr_wb_data);
+   seq_printf(s, "  - IO (CP: %4d, Data: %4d, Flush: %4d, Discard: 
%4d)\n",
+  si->nr_wb_cp_data, si->nr_wb_data,
+  si->nr_flush, si->nr_discard);
seq_printf(s, "  - inmem: %4d, atomic IO: %4d (Max. %4d)\n",
   si->inmem_pages, si->aw_cnt, si->max_aw_cnt);
seq_printf(s, "  - nodes: %4d in %4d\n",
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index ef5e3709c161..d51bc18e7292 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -201,6 +201,7 @@ struct discard_cmd_control {
wait_queue_head_t discard_wait_queue;   /* waiting queue for wake-up */
struct mutex cmd_lock;
int max_discards;   /* max. discards to be issued */
+   atomic_t submit_discard;/* # of issued discard */
 };

 /* for the list of fsync inodes, used only during recovery */
@@ -2254,7 +2255,7 @@ struct f2fs_stat_info {
unsigned int ndirty_dirs, ndirty_files, ndirty_all;
int nats, dirty_nats, sits, dirty_sits, free_nids, alloc_nids;
int total_count, utilization;
-   int bg_gc, nr_wb_cp_data, nr_wb_data;
+   int bg_gc, nr_wb_cp_data, nr_wb_data, nr_flush, nr_discard;
int inline_xattr, inline_inode, inline_dir, orphans;
int aw_cnt, max_aw_cnt;
unsigned int valid_count, valid_node_count, valid_inode_count, 
discard_blks;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 74364006bfa6..5a2d380182d9 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -653,6 +653,8 @@ static void __remove_discard_cmd(struct f2fs_sb_info *sbi, 
struct discard_cmd *d
 {
int err = dc->bio->bi_error;

+   atomic_dec(&(SM_I(sbi)->dcc_info->submit_discard));

Here submit_discard decrease will include discard command without bio submit.

Thanks.

+
if (err == -EOPNOTSUPP)
err = 0;

@@ -678,6 +680,7 @@ void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, 
block_t blkaddr)
if (dc->state == D_PREP) {
dc->state = D_SUBMIT;
submit_bio(dc->bio);
+   atomic_inc(>submit_discard);
}
wait_for_completion_io(>wait);

@@ -723,6 +726,7 @@ static int issue_discard_thread(void *data)
if (dc->state == D_PREP) {
dc->state = D_SUBMIT;
submit_bio(dc->bio);
+   atomic_inc(>submit_discard);
if (iter++ > DISCARD_ISSUE_RATE)
break;
} else if (dc->state == D_DONE) {





Re: [f2fs-dev] [PATCH 6/6] f2fs: show # of on-going flush and discard bios

2017-01-13 Thread heyunlei

Hi Jaegeuk,

On 2017/1/13 6:44, Jaegeuk Kim wrote:

This patch adds stat information for flush and discard commands.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/debug.c   | 7 +--
 fs/f2fs/f2fs.h| 3 ++-
 fs/f2fs/segment.c | 4 
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
index f9f6b0aeba02..e67cab2fafac 100644
--- a/fs/f2fs/debug.c
+++ b/fs/f2fs/debug.c
@@ -54,6 +54,8 @@ static void update_general_status(struct f2fs_sb_info *sbi)
si->max_aw_cnt = atomic_read(>max_aw_cnt);
si->nr_wb_cp_data = get_pages(sbi, F2FS_WB_CP_DATA);
si->nr_wb_data = get_pages(sbi, F2FS_WB_DATA);
+   si->nr_flush = atomic_read(_I(sbi)->fcc_info->submit_flush);
+   si->nr_discard = atomic_read(_I(sbi)->dcc_info->submit_discard);
si->total_count = (int)sbi->user_block_count / sbi->blocks_per_seg;
si->rsvd_segs = reserved_segments(sbi);
si->overp_segs = overprovision_segments(sbi);
@@ -318,8 +320,9 @@ static int stat_show(struct seq_file *s, void *v)
seq_printf(s, "  - Inner Struct Count: tree: %d(%d), node: 
%d\n",
si->ext_tree, si->zombie_tree, si->ext_node);
seq_puts(s, "\nBalancing F2FS Async:\n");
-   seq_printf(s, "  - IO (CP: %4d, Data: %4d)\n",
-  si->nr_wb_cp_data, si->nr_wb_data);
+   seq_printf(s, "  - IO (CP: %4d, Data: %4d, Flush: %4d, Discard: 
%4d)\n",
+  si->nr_wb_cp_data, si->nr_wb_data,
+  si->nr_flush, si->nr_discard);
seq_printf(s, "  - inmem: %4d, atomic IO: %4d (Max. %4d)\n",
   si->inmem_pages, si->aw_cnt, si->max_aw_cnt);
seq_printf(s, "  - nodes: %4d in %4d\n",
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index ef5e3709c161..d51bc18e7292 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -201,6 +201,7 @@ struct discard_cmd_control {
wait_queue_head_t discard_wait_queue;   /* waiting queue for wake-up */
struct mutex cmd_lock;
int max_discards;   /* max. discards to be issued */
+   atomic_t submit_discard;/* # of issued discard */
 };

 /* for the list of fsync inodes, used only during recovery */
@@ -2254,7 +2255,7 @@ struct f2fs_stat_info {
unsigned int ndirty_dirs, ndirty_files, ndirty_all;
int nats, dirty_nats, sits, dirty_sits, free_nids, alloc_nids;
int total_count, utilization;
-   int bg_gc, nr_wb_cp_data, nr_wb_data;
+   int bg_gc, nr_wb_cp_data, nr_wb_data, nr_flush, nr_discard;
int inline_xattr, inline_inode, inline_dir, orphans;
int aw_cnt, max_aw_cnt;
unsigned int valid_count, valid_node_count, valid_inode_count, 
discard_blks;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 74364006bfa6..5a2d380182d9 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -653,6 +653,8 @@ static void __remove_discard_cmd(struct f2fs_sb_info *sbi, 
struct discard_cmd *d
 {
int err = dc->bio->bi_error;

+   atomic_dec(&(SM_I(sbi)->dcc_info->submit_discard));

Here submit_discard decrease will include discard command without bio submit.

Thanks.

+
if (err == -EOPNOTSUPP)
err = 0;

@@ -678,6 +680,7 @@ void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, 
block_t blkaddr)
if (dc->state == D_PREP) {
dc->state = D_SUBMIT;
submit_bio(dc->bio);
+   atomic_inc(>submit_discard);
}
wait_for_completion_io(>wait);

@@ -723,6 +726,7 @@ static int issue_discard_thread(void *data)
if (dc->state == D_PREP) {
dc->state = D_SUBMIT;
submit_bio(dc->bio);
+   atomic_inc(>submit_discard);
if (iter++ > DISCARD_ISSUE_RATE)
break;
} else if (dc->state == D_DONE) {





Re: [PATCH v9 4/4] console: Make persistent scrollback a boot parameter

2017-01-13 Thread Greg KH
On Fri, Jan 13, 2017 at 09:00:34PM +0100, Manuel Schölling wrote:
> On Tue, 2017-01-10 at 23:58 +0100, Adam Borowski wrote:
> > On Tue, Jan 10, 2017 at 10:28:38PM +0100, Manuel Schölling wrote:
> > > The impact of the persistent scrollback feature on the code size is
> > > rather small, so the config option is removed. The feature stays
> > > disabled by default and can be enabled by using the boot command
> > > line
> > > parameter 'vgacon.scrollback_persistent=1' or by setting
> > > VGACON_SOFT_SCROLLBACK_PERSISTENT_ENABLE_BY_DEFAULT=y.
> > > 
> > > Signed-off-by: Manuel Schölling 
> > > Suggested-by: Bartlomiej Zolnierkiewicz 
> > > +module_param_named(scrollback_persistent, scrollback_persistent,
> > > bool, );
> > > +MODULE_PARM_DESC(scrollback_persistent, "Enable persistent
> > > scrollback for all vga consoles");
> > 
> > A command-line knob settable by the end-user is something more
> > persistent
> > than a config option.  As you're going to extend this code beyond
> > vgacon in
> > the near future, perhaps it'd be better to have a shared setting for
> > all
> > console drivers?
> According to the guys at #kernelnewbies on IRC everybody hates new
> command line options.

That was me, you can use my name here :)

> I'd rather stick to the module parameter for now and maybe introduce a 
> new cmd line option later, once this feature has been implemented in
> several console drivers.

Yes, that should be fine.

thanks,

greg k-h


Re: [PATCH v9 4/4] console: Make persistent scrollback a boot parameter

2017-01-13 Thread Greg KH
On Fri, Jan 13, 2017 at 09:00:34PM +0100, Manuel Schölling wrote:
> On Tue, 2017-01-10 at 23:58 +0100, Adam Borowski wrote:
> > On Tue, Jan 10, 2017 at 10:28:38PM +0100, Manuel Schölling wrote:
> > > The impact of the persistent scrollback feature on the code size is
> > > rather small, so the config option is removed. The feature stays
> > > disabled by default and can be enabled by using the boot command
> > > line
> > > parameter 'vgacon.scrollback_persistent=1' or by setting
> > > VGACON_SOFT_SCROLLBACK_PERSISTENT_ENABLE_BY_DEFAULT=y.
> > > 
> > > Signed-off-by: Manuel Schölling 
> > > Suggested-by: Bartlomiej Zolnierkiewicz 
> > > +module_param_named(scrollback_persistent, scrollback_persistent,
> > > bool, );
> > > +MODULE_PARM_DESC(scrollback_persistent, "Enable persistent
> > > scrollback for all vga consoles");
> > 
> > A command-line knob settable by the end-user is something more
> > persistent
> > than a config option.  As you're going to extend this code beyond
> > vgacon in
> > the near future, perhaps it'd be better to have a shared setting for
> > all
> > console drivers?
> According to the guys at #kernelnewbies on IRC everybody hates new
> command line options.

That was me, you can use my name here :)

> I'd rather stick to the module parameter for now and maybe introduce a 
> new cmd line option later, once this feature has been implemented in
> several console drivers.

Yes, that should be fine.

thanks,

greg k-h


Re: [PATCH 4.4 00/27] 4.4.43-stable review

2017-01-13 Thread Greg Kroah-Hartman
On Fri, Jan 13, 2017 at 12:19:08PM -0800, Guenter Roeck wrote:
> On Fri, Jan 13, 2017 at 12:38:17PM +0100, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.4.43 release.
> > There are 27 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Sun Jan 15 11:36:57 UTC 2017.
> > Anything received after that time might be too late.
> > 
> Build results:
>   total: 149 pass: 149 fail: 0
> Qemu test results:
>   total: 115 pass: 115 fail: 0
> 
> In case you wonder, the avr32 tests don't fail anymore because I disabled
> CONFIG_NET for those builds, not because anything has been fixed. This lets us
> at least build test the architecture code.

Ah, I was wondering, building the arch code is good.

thanks,

greg k-h


Re: [PATCH 4.9 00/59] 4.9.4-stable review

2017-01-13 Thread Greg Kroah-Hartman
On Fri, Jan 13, 2017 at 12:20:20PM -0800, Guenter Roeck wrote:
> On Fri, Jan 13, 2017 at 01:01:07PM +0100, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.9.4 release.
> > There are 59 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Sun Jan 15 11:38:27 UTC 2017.
> > Anything received after that time might be too late.
> > 
> 
> Build results:
>   total: 149 pass: 149 fail: 0
> Qemu test results:
>   total: 122 pass: 122 fail: 0
> 
> Details are available at http://kerneltests.org/builders.

Thanks for testing all of these and letting me know.

greg k-h


Re: [PATCH 4.9 00/59] 4.9.4-stable review

2017-01-13 Thread Greg Kroah-Hartman
On Fri, Jan 13, 2017 at 02:58:09PM -0700, Shuah Khan wrote:
> On 01/13/2017 05:01 AM, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.9.4 release.
> > There are 59 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Sun Jan 15 11:38:27 UTC 2017.
> > Anything received after that time might be too late.
> > 
> > The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.4-rc1.gz
> > or in the git tree and branch at:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-4.9.y
> > and the diffstat can be found below.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> 
> Compiled and booted on my test system. No dmesg regressions.

Thanks for testing these and letting me know.

greg k-h


Re: [PATCH 4.4 00/27] 4.4.43-stable review

2017-01-13 Thread Greg Kroah-Hartman
On Fri, Jan 13, 2017 at 12:19:08PM -0800, Guenter Roeck wrote:
> On Fri, Jan 13, 2017 at 12:38:17PM +0100, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.4.43 release.
> > There are 27 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Sun Jan 15 11:36:57 UTC 2017.
> > Anything received after that time might be too late.
> > 
> Build results:
>   total: 149 pass: 149 fail: 0
> Qemu test results:
>   total: 115 pass: 115 fail: 0
> 
> In case you wonder, the avr32 tests don't fail anymore because I disabled
> CONFIG_NET for those builds, not because anything has been fixed. This lets us
> at least build test the architecture code.

Ah, I was wondering, building the arch code is good.

thanks,

greg k-h


Re: [PATCH 4.9 00/59] 4.9.4-stable review

2017-01-13 Thread Greg Kroah-Hartman
On Fri, Jan 13, 2017 at 12:20:20PM -0800, Guenter Roeck wrote:
> On Fri, Jan 13, 2017 at 01:01:07PM +0100, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.9.4 release.
> > There are 59 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Sun Jan 15 11:38:27 UTC 2017.
> > Anything received after that time might be too late.
> > 
> 
> Build results:
>   total: 149 pass: 149 fail: 0
> Qemu test results:
>   total: 122 pass: 122 fail: 0
> 
> Details are available at http://kerneltests.org/builders.

Thanks for testing all of these and letting me know.

greg k-h


Re: [PATCH 4.9 00/59] 4.9.4-stable review

2017-01-13 Thread Greg Kroah-Hartman
On Fri, Jan 13, 2017 at 02:58:09PM -0700, Shuah Khan wrote:
> On 01/13/2017 05:01 AM, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.9.4 release.
> > There are 59 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Sun Jan 15 11:38:27 UTC 2017.
> > Anything received after that time might be too late.
> > 
> > The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.4-rc1.gz
> > or in the git tree and branch at:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-4.9.y
> > and the diffstat can be found below.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> 
> Compiled and booted on my test system. No dmesg regressions.

Thanks for testing these and letting me know.

greg k-h


[PATCH v1 1/3] dt: bindings: add documentation for zx2967 family reset controller

2017-01-13 Thread Baoyou Xie
This patch adds dt-binding documentation for zx2967 family
reset controller.

Signed-off-by: Baoyou Xie 
---
 .../devicetree/bindings/reset/zte,zx2967-reset.txt   | 20 
 1 file changed, 20 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt

diff --git a/Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt 
b/Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt
new file mode 100644
index 000..22d590e
--- /dev/null
+++ b/Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt
@@ -0,0 +1,20 @@
+ZTE zx2967 SoCs Reset Controller
+===
+
+Please also refer to reset.txt in this directory for common reset
+controller binding usage.
+
+Required properties:
+- compatible: should be one of the following.
+   * zte,zx296718-reset
+- reg: physical base address of the controller and length of memory mapped
+   region.
+- #reset-cells: must be 1.
+
+example:
+
+   toprst: reset@1461060 {
+   compatible = "zte,zx296718-reset";
+   reg = <0x01461060 0x8>;
+   #reset-cells = <1>;
+   };
-- 
2.7.4



[PATCH v1 2/3] MAINTAINERS: add zx2967 reset controller driver to ARM ZTE architecture

2017-01-13 Thread Baoyou Xie
Add the zx2967 reset controller driver as maintained by ARM ZTE
architecture maintainers, as they're parts of the core IP.

Signed-off-by: Baoyou Xie 
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2793808..08f8155 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1980,10 +1980,12 @@ L:  linux-arm-ker...@lists.infradead.org (moderated 
for non-subscribers)
 S: Maintained
 F: arch/arm/mach-zx/
 F: drivers/clk/zte/
+F: drivers/reset/reset-zx2967.c
 F: drivers/soc/zte/
 F: drivers/thermal/zx*
 F: Documentation/devicetree/bindings/arm/zte.txt
 F: Documentation/devicetree/bindings/clock/zx296702-clk.txt
+F: Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt
 F: Documentation/devicetree/bindings/soc/zte/
 F: Documentation/devicetree/bindings/thermal/zx*
 F: include/dt-bindings/soc/zx*.h
-- 
2.7.4



[PATCH v1 1/3] dt: bindings: add documentation for zx2967 family reset controller

2017-01-13 Thread Baoyou Xie
This patch adds dt-binding documentation for zx2967 family
reset controller.

Signed-off-by: Baoyou Xie 
---
 .../devicetree/bindings/reset/zte,zx2967-reset.txt   | 20 
 1 file changed, 20 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt

diff --git a/Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt 
b/Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt
new file mode 100644
index 000..22d590e
--- /dev/null
+++ b/Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt
@@ -0,0 +1,20 @@
+ZTE zx2967 SoCs Reset Controller
+===
+
+Please also refer to reset.txt in this directory for common reset
+controller binding usage.
+
+Required properties:
+- compatible: should be one of the following.
+   * zte,zx296718-reset
+- reg: physical base address of the controller and length of memory mapped
+   region.
+- #reset-cells: must be 1.
+
+example:
+
+   toprst: reset@1461060 {
+   compatible = "zte,zx296718-reset";
+   reg = <0x01461060 0x8>;
+   #reset-cells = <1>;
+   };
-- 
2.7.4



[PATCH v1 2/3] MAINTAINERS: add zx2967 reset controller driver to ARM ZTE architecture

2017-01-13 Thread Baoyou Xie
Add the zx2967 reset controller driver as maintained by ARM ZTE
architecture maintainers, as they're parts of the core IP.

Signed-off-by: Baoyou Xie 
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2793808..08f8155 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1980,10 +1980,12 @@ L:  linux-arm-ker...@lists.infradead.org (moderated 
for non-subscribers)
 S: Maintained
 F: arch/arm/mach-zx/
 F: drivers/clk/zte/
+F: drivers/reset/reset-zx2967.c
 F: drivers/soc/zte/
 F: drivers/thermal/zx*
 F: Documentation/devicetree/bindings/arm/zte.txt
 F: Documentation/devicetree/bindings/clock/zx296702-clk.txt
+F: Documentation/devicetree/bindings/reset/zte,zx2967-reset.txt
 F: Documentation/devicetree/bindings/soc/zte/
 F: Documentation/devicetree/bindings/thermal/zx*
 F: include/dt-bindings/soc/zx*.h
-- 
2.7.4



[PATCH v1 3/3] reset: zx2967: add reset controller driver for ZTE's zx2967 family

2017-01-13 Thread Baoyou Xie
This patch adds reset controller driver for ZTE's zx2967 family.

Signed-off-by: Baoyou Xie 
---
 drivers/reset/Kconfig|   6 ++
 drivers/reset/Makefile   |   1 +
 drivers/reset/reset-zx2967.c | 136 +++
 3 files changed, 143 insertions(+)
 create mode 100644 drivers/reset/reset-zx2967.c

diff --git a/drivers/reset/Kconfig b/drivers/reset/Kconfig
index 172dc96..972d077 100644
--- a/drivers/reset/Kconfig
+++ b/drivers/reset/Kconfig
@@ -92,6 +92,12 @@ config RESET_ZYNQ
help
  This enables the reset controller driver for Xilinx Zynq SoCs.
 
+config RESET_ZX2967
+   bool "ZX2967 Reset Driver"
+   depends on ARCH_ZX || COMPILE_TEST
+   help
+ This enables the reset controller driver for ZTE zx2967 family.
+
 source "drivers/reset/sti/Kconfig"
 source "drivers/reset/hisilicon/Kconfig"
 source "drivers/reset/tegra/Kconfig"
diff --git a/drivers/reset/Makefile b/drivers/reset/Makefile
index 13b346e..807b77b 100644
--- a/drivers/reset/Makefile
+++ b/drivers/reset/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_RESET_SUNXI) += reset-sunxi.o
 obj-$(CONFIG_TI_SYSCON_RESET) += reset-ti-syscon.o
 obj-$(CONFIG_RESET_UNIPHIER) += reset-uniphier.o
 obj-$(CONFIG_RESET_ZYNQ) += reset-zynq.o
+obj-$(CONFIG_RESET_ZX2967) += reset-zx2967.o
diff --git a/drivers/reset/reset-zx2967.c b/drivers/reset/reset-zx2967.c
new file mode 100644
index 000..63f9c41
--- /dev/null
+++ b/drivers/reset/reset-zx2967.c
@@ -0,0 +1,136 @@
+/*
+ * ZTE's zx2967 family thermal sensor driver
+ *
+ * Copyright (C) 2017 ZTE Ltd.
+ *
+ * Author: Baoyou Xie 
+ *
+ * License terms: GNU General Public License (GPL) version 2
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+struct zx2967_reset {
+   void __iomem*reg_base;
+   spinlock_t  lock;
+   struct reset_controller_dev rcdev;
+};
+
+static int zx2967_reset_assert(struct reset_controller_dev *rcdev,
+  unsigned long id)
+{
+   struct zx2967_reset *reset = NULL;
+   int bank = id / 32;
+   int offset = id % 32;
+   unsigned int reg;
+   unsigned long flags;
+
+   reset = container_of(rcdev, struct zx2967_reset, rcdev);
+
+   spin_lock_irqsave(>lock, flags);
+
+   reg = readl(reset->reg_base + (bank * 4));
+   writel(reg & ~BIT(offset), reset->reg_base + (bank * 4));
+   reg = readl(reset->reg_base + (bank * 4));
+
+   spin_unlock_irqrestore(>lock, flags);
+
+   return 0;
+}
+
+static int zx2967_reset_deassert(struct reset_controller_dev *rcdev,
+unsigned long id)
+{
+   struct zx2967_reset *reset = NULL;
+   int bank = id / 32;
+   int offset = id % 32;
+   unsigned int reg;
+   unsigned long flags;
+
+   reset = container_of(rcdev, struct zx2967_reset, rcdev);
+
+   spin_lock_irqsave(>lock, flags);
+
+   reg = readl(reset->reg_base + (bank * 4));
+   writel(reg | BIT(offset), reset->reg_base + (bank * 4));
+   reg = readl(reset->reg_base + (bank * 4));
+
+   spin_unlock_irqrestore(>lock, flags);
+
+   return 0;
+}
+
+static struct reset_control_ops zx2967_reset_ops = {
+   .assert = zx2967_reset_assert,
+   .deassert   = zx2967_reset_deassert,
+};
+
+static int zx2967_reset_probe(struct platform_device *pdev)
+{
+   struct zx2967_reset *reset;
+   struct resource *res;
+
+   reset = devm_kzalloc(>dev, sizeof(*reset), GFP_KERNEL);
+   if (!reset)
+   return -ENOMEM;
+
+   res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+   reset->reg_base = devm_ioremap_resource(>dev, res);
+   if (IS_ERR(reset->reg_base))
+   return PTR_ERR(reset->reg_base);
+
+   spin_lock_init(>lock);
+
+   reset->rcdev.owner = THIS_MODULE;
+   reset->rcdev.nr_resets = resource_size(res) * 8;
+   reset->rcdev.ops = _reset_ops;
+   reset->rcdev.of_node = pdev->dev.of_node;
+
+   dev_info(>dev, "reset controller cnt:%d",
+ reset->rcdev.nr_resets);
+
+   return reset_controller_register(>rcdev);
+}
+
+static int zx2967_reset_remove(struct platform_device *pdev)
+{
+   struct zx2967_reset *reset = platform_get_drvdata(pdev);
+
+   reset_controller_unregister(>rcdev);
+
+   return 0;
+}
+
+static const struct of_device_id zx2967_reset_dt_ids[] = {
+{ .compatible = "zte,zx296718-reset", },
+{},
+};
+MODULE_DEVICE_TABLE(of, zx2967_reset_dt_ids);
+
+static struct platform_driver zx2967_reset_driver = {
+   .probe  = zx2967_reset_probe,
+   .remove = zx2967_reset_remove,
+   .driver = {
+   .name   = "zx2967-reset",
+   .of_match_table = zx2967_reset_dt_ids,
+   },
+};
+
+static int __init zx2967_reset_init(void)
+{
+   return platform_driver_register(_reset_driver);
+}
+arch_initcall(zx2967_reset_init);
+
+static void 

[PATCH v1 3/3] reset: zx2967: add reset controller driver for ZTE's zx2967 family

2017-01-13 Thread Baoyou Xie
This patch adds reset controller driver for ZTE's zx2967 family.

Signed-off-by: Baoyou Xie 
---
 drivers/reset/Kconfig|   6 ++
 drivers/reset/Makefile   |   1 +
 drivers/reset/reset-zx2967.c | 136 +++
 3 files changed, 143 insertions(+)
 create mode 100644 drivers/reset/reset-zx2967.c

diff --git a/drivers/reset/Kconfig b/drivers/reset/Kconfig
index 172dc96..972d077 100644
--- a/drivers/reset/Kconfig
+++ b/drivers/reset/Kconfig
@@ -92,6 +92,12 @@ config RESET_ZYNQ
help
  This enables the reset controller driver for Xilinx Zynq SoCs.
 
+config RESET_ZX2967
+   bool "ZX2967 Reset Driver"
+   depends on ARCH_ZX || COMPILE_TEST
+   help
+ This enables the reset controller driver for ZTE zx2967 family.
+
 source "drivers/reset/sti/Kconfig"
 source "drivers/reset/hisilicon/Kconfig"
 source "drivers/reset/tegra/Kconfig"
diff --git a/drivers/reset/Makefile b/drivers/reset/Makefile
index 13b346e..807b77b 100644
--- a/drivers/reset/Makefile
+++ b/drivers/reset/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_RESET_SUNXI) += reset-sunxi.o
 obj-$(CONFIG_TI_SYSCON_RESET) += reset-ti-syscon.o
 obj-$(CONFIG_RESET_UNIPHIER) += reset-uniphier.o
 obj-$(CONFIG_RESET_ZYNQ) += reset-zynq.o
+obj-$(CONFIG_RESET_ZX2967) += reset-zx2967.o
diff --git a/drivers/reset/reset-zx2967.c b/drivers/reset/reset-zx2967.c
new file mode 100644
index 000..63f9c41
--- /dev/null
+++ b/drivers/reset/reset-zx2967.c
@@ -0,0 +1,136 @@
+/*
+ * ZTE's zx2967 family thermal sensor driver
+ *
+ * Copyright (C) 2017 ZTE Ltd.
+ *
+ * Author: Baoyou Xie 
+ *
+ * License terms: GNU General Public License (GPL) version 2
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+struct zx2967_reset {
+   void __iomem*reg_base;
+   spinlock_t  lock;
+   struct reset_controller_dev rcdev;
+};
+
+static int zx2967_reset_assert(struct reset_controller_dev *rcdev,
+  unsigned long id)
+{
+   struct zx2967_reset *reset = NULL;
+   int bank = id / 32;
+   int offset = id % 32;
+   unsigned int reg;
+   unsigned long flags;
+
+   reset = container_of(rcdev, struct zx2967_reset, rcdev);
+
+   spin_lock_irqsave(>lock, flags);
+
+   reg = readl(reset->reg_base + (bank * 4));
+   writel(reg & ~BIT(offset), reset->reg_base + (bank * 4));
+   reg = readl(reset->reg_base + (bank * 4));
+
+   spin_unlock_irqrestore(>lock, flags);
+
+   return 0;
+}
+
+static int zx2967_reset_deassert(struct reset_controller_dev *rcdev,
+unsigned long id)
+{
+   struct zx2967_reset *reset = NULL;
+   int bank = id / 32;
+   int offset = id % 32;
+   unsigned int reg;
+   unsigned long flags;
+
+   reset = container_of(rcdev, struct zx2967_reset, rcdev);
+
+   spin_lock_irqsave(>lock, flags);
+
+   reg = readl(reset->reg_base + (bank * 4));
+   writel(reg | BIT(offset), reset->reg_base + (bank * 4));
+   reg = readl(reset->reg_base + (bank * 4));
+
+   spin_unlock_irqrestore(>lock, flags);
+
+   return 0;
+}
+
+static struct reset_control_ops zx2967_reset_ops = {
+   .assert = zx2967_reset_assert,
+   .deassert   = zx2967_reset_deassert,
+};
+
+static int zx2967_reset_probe(struct platform_device *pdev)
+{
+   struct zx2967_reset *reset;
+   struct resource *res;
+
+   reset = devm_kzalloc(>dev, sizeof(*reset), GFP_KERNEL);
+   if (!reset)
+   return -ENOMEM;
+
+   res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+   reset->reg_base = devm_ioremap_resource(>dev, res);
+   if (IS_ERR(reset->reg_base))
+   return PTR_ERR(reset->reg_base);
+
+   spin_lock_init(>lock);
+
+   reset->rcdev.owner = THIS_MODULE;
+   reset->rcdev.nr_resets = resource_size(res) * 8;
+   reset->rcdev.ops = _reset_ops;
+   reset->rcdev.of_node = pdev->dev.of_node;
+
+   dev_info(>dev, "reset controller cnt:%d",
+ reset->rcdev.nr_resets);
+
+   return reset_controller_register(>rcdev);
+}
+
+static int zx2967_reset_remove(struct platform_device *pdev)
+{
+   struct zx2967_reset *reset = platform_get_drvdata(pdev);
+
+   reset_controller_unregister(>rcdev);
+
+   return 0;
+}
+
+static const struct of_device_id zx2967_reset_dt_ids[] = {
+{ .compatible = "zte,zx296718-reset", },
+{},
+};
+MODULE_DEVICE_TABLE(of, zx2967_reset_dt_ids);
+
+static struct platform_driver zx2967_reset_driver = {
+   .probe  = zx2967_reset_probe,
+   .remove = zx2967_reset_remove,
+   .driver = {
+   .name   = "zx2967-reset",
+   .of_match_table = zx2967_reset_dt_ids,
+   },
+};
+
+static int __init zx2967_reset_init(void)
+{
+   return platform_driver_register(_reset_driver);
+}
+arch_initcall(zx2967_reset_init);
+
+static void __exit zx2967_reset_exit(void)
+{
+   

Re: [PATCH 14/17] spi/topcliff-pch: Adjust six checks for null pointers

2017-01-13 Thread Dan Carpenter
On Fri, Jan 13, 2017 at 06:24:22PM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 13 Jan 2017 16:16:05 +0100
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> Comparison to NULL could be written …
> 
> Thus fix the affected source code places.
> 
> Signed-off-by: Markus Elfring 
> ---
>  drivers/spi/spi-topcliff-pch.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/spi/spi-topcliff-pch.c b/drivers/spi/spi-topcliff-pch.c
> index 0a876311b67b..e4f1f66b751b 100644
> --- a/drivers/spi/spi-topcliff-pch.c
> +++ b/drivers/spi/spi-topcliff-pch.c
> @@ -531,12 +531,11 @@ static int pch_spi_transfer(struct spi_device *pspi, 
> struct spi_message *pmsg)
>  static inline void pch_spi_select_chip(struct pch_spi_data *data,
>  struct spi_device *pspi)
>  {
> - if (data->current_chip != NULL) {
> + if (data->current_chip)


Put the curly braces back.  Multi-line indents get curly braces for
readability.

>   if (pspi->chip_select != data->n_curnt_chip) {
>   dev_dbg(>dev, "%s : different slave\n", __func__);
>   data->current_chip = NULL;
>   }
> - }

regards,
dan carpenter



Re: [PATCH 14/17] spi/topcliff-pch: Adjust six checks for null pointers

2017-01-13 Thread Dan Carpenter
On Fri, Jan 13, 2017 at 06:24:22PM +0100, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Fri, 13 Jan 2017 16:16:05 +0100
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> The script "checkpatch.pl" pointed information out like the following.
> 
> Comparison to NULL could be written …
> 
> Thus fix the affected source code places.
> 
> Signed-off-by: Markus Elfring 
> ---
>  drivers/spi/spi-topcliff-pch.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/spi/spi-topcliff-pch.c b/drivers/spi/spi-topcliff-pch.c
> index 0a876311b67b..e4f1f66b751b 100644
> --- a/drivers/spi/spi-topcliff-pch.c
> +++ b/drivers/spi/spi-topcliff-pch.c
> @@ -531,12 +531,11 @@ static int pch_spi_transfer(struct spi_device *pspi, 
> struct spi_message *pmsg)
>  static inline void pch_spi_select_chip(struct pch_spi_data *data,
>  struct spi_device *pspi)
>  {
> - if (data->current_chip != NULL) {
> + if (data->current_chip)


Put the curly braces back.  Multi-line indents get curly braces for
readability.

>   if (pspi->chip_select != data->n_curnt_chip) {
>   dev_dbg(>dev, "%s : different slave\n", __func__);
>   data->current_chip = NULL;
>   }
> - }

regards,
dan carpenter



Re: [PATCH] printk: Correctly handle preemption in console_unlock()

2017-01-13 Thread Sergey Senozhatsky
On (01/13/17 14:15), Petr Mladek wrote:
> Some console drivers code calls console_conditional_schedule()
> that looks at @console_may_schedule. The value must be cleared
> when the drivers are called from console_unlock() with
> interrupts disabled. But rescheduling is fine when the same
> code is called, for example, from tty operations where the
> console semaphore is taken via console_lock().
> 
> This is why @console_may_schedule is cleared before calling console
> drivers. The original value is stored to decide if we could sleep
> between lines.
> 
> Now, @console_may_schedule is not cleared when we call
> console_trylock() and jump back to the "again" goto label.
> This has become a problem, since the commit 6b97a20d3a7909daa066
> ("printk: set may_schedule for some of console_trylock() callers").

so I think I'd prefer to revert that commit.

the reason I added the commit in question was to reduce the number of
printk() soft lockups that I observed back then. however, it obviously
didn't solve all of the printk() problems. now printk() is moving in a
completely different direction in term of lockups and deadlocks. there
will be no console_trylock() call in vprintk_emit() at all. we will
either do console_lock() from scheduleable printk_kthread or
console_trylock() from IRQ work. so 6b97a20d3a7909daa066 didn't buy us
a lot, and it still doesn't (+ it introduced a bug).


apart from that, Tetsuo wasn't really happy with the patch
http://www.spinics.net/lists/linux-mm/msg103099.html


so let's just return the old behavior back.

---

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 7180088cbb23..ddfbd47398f8 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2078,20 +2078,7 @@ int console_trylock(void)
return 0;
}
console_locked = 1;
-   /*
-* When PREEMPT_COUNT disabled we can't reliably detect if it's
-* safe to schedule (e.g. calling printk while holding a spin_lock),
-* because preempt_disable()/preempt_enable() are just barriers there
-* and preempt_count() is always 0.
-*
-* RCU read sections have a separate preemption counter when
-* PREEMPT_RCU enabled thus we must take extra care and check
-* rcu_preempt_depth(), otherwise RCU read sections modify
-* preempt_count().
-*/
-   console_may_schedule = !oops_in_progress &&
-   preemptible() &&
-   !rcu_preempt_depth();
+   console_may_schedule = 0;
return 1;
 }
 EXPORT_SYMBOL(console_trylock);



Re: [PATCH] printk: Correctly handle preemption in console_unlock()

2017-01-13 Thread Sergey Senozhatsky
On (01/13/17 14:15), Petr Mladek wrote:
> Some console drivers code calls console_conditional_schedule()
> that looks at @console_may_schedule. The value must be cleared
> when the drivers are called from console_unlock() with
> interrupts disabled. But rescheduling is fine when the same
> code is called, for example, from tty operations where the
> console semaphore is taken via console_lock().
> 
> This is why @console_may_schedule is cleared before calling console
> drivers. The original value is stored to decide if we could sleep
> between lines.
> 
> Now, @console_may_schedule is not cleared when we call
> console_trylock() and jump back to the "again" goto label.
> This has become a problem, since the commit 6b97a20d3a7909daa066
> ("printk: set may_schedule for some of console_trylock() callers").

so I think I'd prefer to revert that commit.

the reason I added the commit in question was to reduce the number of
printk() soft lockups that I observed back then. however, it obviously
didn't solve all of the printk() problems. now printk() is moving in a
completely different direction in term of lockups and deadlocks. there
will be no console_trylock() call in vprintk_emit() at all. we will
either do console_lock() from scheduleable printk_kthread or
console_trylock() from IRQ work. so 6b97a20d3a7909daa066 didn't buy us
a lot, and it still doesn't (+ it introduced a bug).


apart from that, Tetsuo wasn't really happy with the patch
http://www.spinics.net/lists/linux-mm/msg103099.html


so let's just return the old behavior back.

---

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 7180088cbb23..ddfbd47398f8 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2078,20 +2078,7 @@ int console_trylock(void)
return 0;
}
console_locked = 1;
-   /*
-* When PREEMPT_COUNT disabled we can't reliably detect if it's
-* safe to schedule (e.g. calling printk while holding a spin_lock),
-* because preempt_disable()/preempt_enable() are just barriers there
-* and preempt_count() is always 0.
-*
-* RCU read sections have a separate preemption counter when
-* PREEMPT_RCU enabled thus we must take extra care and check
-* rcu_preempt_depth(), otherwise RCU read sections modify
-* preempt_count().
-*/
-   console_may_schedule = !oops_in_progress &&
-   preemptible() &&
-   !rcu_preempt_depth();
+   console_may_schedule = 0;
return 1;
 }
 EXPORT_SYMBOL(console_trylock);



Re: [PATCH 05/62] watchdog: bcm2835_wdt: Convert to use device managed functions and other improvements

2017-01-13 Thread Eric Anholt
Guenter Roeck  writes:

> Use device managed functions to simplify error handling, reduce
> source code size, improve readability, and reduce the likelyhood of bugs.
> Other improvements as listed below.
>
> The conversion was done automatically with coccinelle using the
> following semantic patches. The semantic patches and the scripts used
> to generate this commit log are available at
> https://github.com/groeck/coccinelle-patches
>
> - Drop assignments to otherwise unused variables
> - Replace of_iomap() with platform_get_resource() followed by
>   devm_ioremap_resource()

Every time I see this pattern I wish we had a
devm_ioremap_platform_resource().

> - Replace >dev with dev if 'struct device *dev' is a declared
>   variable
> - Use devm_watchdog_register_driver() to register watchdog device
> - Replace shutdown function with call to watchdog_stop_on_reboot()

I'm trusting you here that this last change is right.  All the rest of
it looks good:

Acked-by: Eric Anholt 


signature.asc
Description: PGP signature


Re: [PATCH 05/62] watchdog: bcm2835_wdt: Convert to use device managed functions and other improvements

2017-01-13 Thread Eric Anholt
Guenter Roeck  writes:

> Use device managed functions to simplify error handling, reduce
> source code size, improve readability, and reduce the likelyhood of bugs.
> Other improvements as listed below.
>
> The conversion was done automatically with coccinelle using the
> following semantic patches. The semantic patches and the scripts used
> to generate this commit log are available at
> https://github.com/groeck/coccinelle-patches
>
> - Drop assignments to otherwise unused variables
> - Replace of_iomap() with platform_get_resource() followed by
>   devm_ioremap_resource()

Every time I see this pattern I wish we had a
devm_ioremap_platform_resource().

> - Replace >dev with dev if 'struct device *dev' is a declared
>   variable
> - Use devm_watchdog_register_driver() to register watchdog device
> - Replace shutdown function with call to watchdog_stop_on_reboot()

I'm trusting you here that this last change is right.  All the rest of
it looks good:

Acked-by: Eric Anholt 


signature.asc
Description: PGP signature


[PATCH 9/9] slab: remove slub sysfs interface files early for empty memcg caches

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

Each cache has a number of sysfs interface files under
/sys/kernel/slab.  On a system with a lot of memory and transient
memcgs, the number of interface files which have to be removed once
memory reclaim kicks in can reach millions.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 mm/slub.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8621940..41a3da7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3951,8 +3951,20 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 #ifdef CONFIG_MEMCG
 static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
 {
-   /* called with all the locks held after a sched RCU grace period */
-   __kmem_cache_shrink(s);
+   /*
+* Called with all the locks held after a sched RCU grace period.
+* Even if @s becomes empty after shrinking, we can't know that @s
+* doesn't have allocations already in-flight and thus can't
+* destroy @s until the associated memcg is released.
+*
+* However, let's remove the sysfs files for empty caches here.
+* Each cache has a lot of interface files which aren't
+* particularly useful for empty draining caches; otherwise, we can
+* easily end up with millions of unnecessary sysfs files on
+* systems which have a lot of memory and transient cgroups.
+*/
+   if (!__kmem_cache_shrink(s))
+   sysfs_slab_remove(s);
 }
 
 void __kmemcg_cache_deactivate(struct kmem_cache *s)
@@ -5650,6 +5662,15 @@ static void sysfs_slab_remove(struct kmem_cache *s)
 */
return;
 
+   if (!s->kobj.state_in_sysfs)
+   /*
+* For a memcg cache, this may be called during
+* deactivation and again on shutdown.  Remove only once.
+* A cache is never shut down before deactivation is
+* complete, so no need to worry about synchronization.
+*/
+   return;
+
 #ifdef CONFIG_MEMCG
kset_unregister(s->memcg_kset);
 #endif
-- 
2.9.3



[PATCHSET] slab: make memcg slab destruction scalable

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.

I've seen machines which end up with hundred thousands of caches and
many millions of kernfs_nodes.  The current code is O(N^2) on the
total number of caches and has synchronous rcu_barrier() and
synchronize_sched() in cgroup offline / release path which is executed
while holding cgroup_mutex.  Combined, this leads to very expensive
and slow cache destruction operations which can easily keep running
for half a day.

This also messes up /proc/slabinfo along with other cache iterating
operations.  seq_file operates on 4k chunks and on each 4k boundary
tries to seek to the last position in the list.  With a huge number of
caches on the list, this becomes very slow and very prone to the list
content changing underneath it leading to a lot of missing and/or
duplicate entries.

This patchset addresses the scalability problem.

* Separate out root and memcg cache lists and add per-memcg list.
  Update each user to use the appropriate list.

* Replace rcu_barrier() and synchronize_rcu() with call_rcu() and
  call_rcu_sched().

* For dying empty slub caches, remove the sysfs files after
  deactivation so that we don't end up with millions of sysfs files
  without any useful information on them.

This patchset contains the following nine patches.

 0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
 0002-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
 0003-slab-simplify-shutdown_memcg_caches.patch
 0004-slab-reorganize-memcg_cache_params.patch
 0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
 0006-slab-don-t-put-memcg-caches-on-slab_caches-list.patch
 0007-slab-introduce-__kmemcg_cache_deactivate.patch
 0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
 0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch

0001 reverts an existing optimization to prepare for the following
changes.  0002 replaces rcu_barrier() in release path with call_rcu().
0003-0006 separate out the lists.  0007-0008 replace
synchronize_sched() in slub destruction path with call_rcu_sched().
0009 removes sysfs files early for empty dying caches.

This patchset is on top of the current linus#master a121103c9228 and
also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git 
review-kmemcg-scalability

diffstat follows.  Thanks.

 include/linux/memcontrol.h |1 
 include/linux/slab.h   |   39 -
 include/linux/slab_def.h   |5 
 include/linux/slub_def.h   |9 -
 mm/memcontrol.c|7 -
 mm/slab.c  |7 +
 mm/slab.h  |   21 ++-
 mm/slab_common.c   |  306 -
 mm/slub.c  |   54 +++
 9 files changed, 283 insertions(+), 166 deletions(-)

--
tejun


[PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't
scale when a huge number of caches are destroyed back-to-back.  While
there used to be a simple batching mechanism, the batching was too
restricted to be helpful.

This patch implements slab_deactivate_memcg_cache_rcu_sched() which
slub can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex.  While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab.h |  6 ++
 mm/slab.h|  2 ++
 mm/slab_common.c | 60 
 mm/slub.c| 12 +++
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 63d543d..84701bb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -576,6 +576,12 @@ struct memcg_cache_params {
struct {
struct mem_cgroup *memcg;
struct list_head kmem_caches_node;
+
+   void (*deact_fn)(struct kmem_cache *);
+   union {
+   struct rcu_head deact_rcu_head;
+   struct work_struct deact_work;
+   };
};
};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 73ed6b5..b67ac8f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -299,6 +299,8 @@ static __always_inline void memcg_uncharge_slab(struct page 
*page, int order,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
+extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+   void (*deact_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG && !CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 87e5535..4730ef6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -583,6 +583,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
put_online_cpus();
 }
 
+static void kmemcg_deactivate_workfn(struct work_struct *work)
+{
+   struct kmem_cache *s = container_of(work, struct kmem_cache,
+   memcg_params.deact_work);
+
+   get_online_cpus();
+   get_online_mems();
+
+   mutex_lock(_mutex);
+
+   s->memcg_params.deact_fn(s);
+
+   mutex_unlock(_mutex);
+
+   put_online_mems();
+   put_online_cpus();
+
+   /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
+   css_put(>memcg_params.memcg->css);
+}
+
+static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+{
+   struct kmem_cache *s = container_of(head, struct kmem_cache,
+   memcg_params.deact_rcu_head);
+
+   /*
+* We need to grab blocking locks.  Bounce to ->deact_work.  The
+* work item shares the space with the RCU head and can't be
+* initialized eariler.
+*/
+   INIT_WORK(>memcg_params.deact_work, kmemcg_deactivate_workfn);
+   schedule_work(>memcg_params.deact_work);
+}
+
+/**
+ * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
+ *sched RCU grace period
+ * @s: target kmem_cache
+ * @deact_fn: deactivation function to call
+ *
+ * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
+ * held after a sched RCU grace period.  The slab is guaranteed to stay
+ * alive until @deact_fn is finished.  This is to be used from
+ * __kmemcg_cache_deactivate().
+ */
+void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+  void (*deact_fn)(struct kmem_cache 
*))
+{
+   if (WARN_ON_ONCE(is_root_cache(s)) ||
+   WARN_ON_ONCE(s->memcg_params.deact_fn))
+   return;
+
+   /* pin memcg so 

[PATCH 3/9] slab: simplify shutdown_memcg_caches()

2017-01-13 Thread Tejun Heo
shutdown_memcg_caches() shuts down all memcg caches associated with a
root cache.  It first walks the index table clearing and shutting down
each entry and then shuts down the ones on
root_cache->memcg_params.list.  As active caches are on both the table
and the list, they're stashed away from the list to avoid shutting
down twice and then get spliced back later.

This is unnecessarily complication.  All memcg caches are on
root_cache->memcg_params.list.  The function can simply clear the
index table and shut down all caches on the list.  There's no need to
muck with temporary stashing.

Simplify the code.

Signed-off-by: Tejun Heo 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 mm/slab_common.c | 32 +---
 1 file changed, 5 insertions(+), 27 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 851c75e..45aa67c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -634,48 +634,26 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 {
struct memcg_cache_array *arr;
struct kmem_cache *c, *c2;
-   LIST_HEAD(busy);
int i;
 
BUG_ON(!is_root_cache(s));
 
/*
-* First, shutdown active caches, i.e. caches that belong to online
-* memory cgroups.
+* First, clear the pointers to all memcg caches so that they will
+* never be accessed even if the root cache stays alive.
 */
arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
lockdep_is_held(_mutex));
-   for_each_memcg_cache_index(i) {
-   c = arr->entries[i];
-   if (!c)
-   continue;
-   if (shutdown_cache(c))
-   /*
-* The cache still has objects. Move it to a temporary
-* list so as not to try to destroy it for a second
-* time while iterating over inactive caches below.
-*/
-   list_move(>memcg_params.list, );
-   else
-   /*
-* The cache is empty and will be destroyed soon. Clear
-* the pointer to it in the memcg_caches array so that
-* it will never be accessed even if the root cache
-* stays alive.
-*/
-   arr->entries[i] = NULL;
-   }
+   for_each_memcg_cache_index(i)
+   arr->entries[i] = NULL;
 
/*
-* Second, shutdown all caches left from memory cgroups that are now
-* offline.
+* Shutdown all caches.
 */
list_for_each_entry_safe(c, c2, >memcg_params.list,
 memcg_params.list)
shutdown_cache(c);
 
-   list_splice(, >memcg_params.list);
-
/*
 * A cache being destroyed must be empty. In particular, this means
 * that all per memcg caches attached to it must be empty too.
-- 
2.9.3



[PATCH 9/9] slab: remove slub sysfs interface files early for empty memcg caches

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

Each cache has a number of sysfs interface files under
/sys/kernel/slab.  On a system with a lot of memory and transient
memcgs, the number of interface files which have to be removed once
memory reclaim kicks in can reach millions.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 mm/slub.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8621940..41a3da7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3951,8 +3951,20 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 #ifdef CONFIG_MEMCG
 static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
 {
-   /* called with all the locks held after a sched RCU grace period */
-   __kmem_cache_shrink(s);
+   /*
+* Called with all the locks held after a sched RCU grace period.
+* Even if @s becomes empty after shrinking, we can't know that @s
+* doesn't have allocations already in-flight and thus can't
+* destroy @s until the associated memcg is released.
+*
+* However, let's remove the sysfs files for empty caches here.
+* Each cache has a lot of interface files which aren't
+* particularly useful for empty draining caches; otherwise, we can
+* easily end up with millions of unnecessary sysfs files on
+* systems which have a lot of memory and transient cgroups.
+*/
+   if (!__kmem_cache_shrink(s))
+   sysfs_slab_remove(s);
 }
 
 void __kmemcg_cache_deactivate(struct kmem_cache *s)
@@ -5650,6 +5662,15 @@ static void sysfs_slab_remove(struct kmem_cache *s)
 */
return;
 
+   if (!s->kobj.state_in_sysfs)
+   /*
+* For a memcg cache, this may be called during
+* deactivation and again on shutdown.  Remove only once.
+* A cache is never shut down before deactivation is
+* complete, so no need to worry about synchronization.
+*/
+   return;
+
 #ifdef CONFIG_MEMCG
kset_unregister(s->memcg_kset);
 #endif
-- 
2.9.3



[PATCHSET] slab: make memcg slab destruction scalable

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.

I've seen machines which end up with hundred thousands of caches and
many millions of kernfs_nodes.  The current code is O(N^2) on the
total number of caches and has synchronous rcu_barrier() and
synchronize_sched() in cgroup offline / release path which is executed
while holding cgroup_mutex.  Combined, this leads to very expensive
and slow cache destruction operations which can easily keep running
for half a day.

This also messes up /proc/slabinfo along with other cache iterating
operations.  seq_file operates on 4k chunks and on each 4k boundary
tries to seek to the last position in the list.  With a huge number of
caches on the list, this becomes very slow and very prone to the list
content changing underneath it leading to a lot of missing and/or
duplicate entries.

This patchset addresses the scalability problem.

* Separate out root and memcg cache lists and add per-memcg list.
  Update each user to use the appropriate list.

* Replace rcu_barrier() and synchronize_rcu() with call_rcu() and
  call_rcu_sched().

* For dying empty slub caches, remove the sysfs files after
  deactivation so that we don't end up with millions of sysfs files
  without any useful information on them.

This patchset contains the following nine patches.

 0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
 0002-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
 0003-slab-simplify-shutdown_memcg_caches.patch
 0004-slab-reorganize-memcg_cache_params.patch
 0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
 0006-slab-don-t-put-memcg-caches-on-slab_caches-list.patch
 0007-slab-introduce-__kmemcg_cache_deactivate.patch
 0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
 0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch

0001 reverts an existing optimization to prepare for the following
changes.  0002 replaces rcu_barrier() in release path with call_rcu().
0003-0006 separate out the lists.  0007-0008 replace
synchronize_sched() in slub destruction path with call_rcu_sched().
0009 removes sysfs files early for empty dying caches.

This patchset is on top of the current linus#master a121103c9228 and
also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git 
review-kmemcg-scalability

diffstat follows.  Thanks.

 include/linux/memcontrol.h |1 
 include/linux/slab.h   |   39 -
 include/linux/slab_def.h   |5 
 include/linux/slub_def.h   |9 -
 mm/memcontrol.c|7 -
 mm/slab.c  |7 +
 mm/slab.h  |   21 ++-
 mm/slab_common.c   |  306 -
 mm/slub.c  |   54 +++
 9 files changed, 283 insertions(+), 166 deletions(-)

--
tejun


[PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't
scale when a huge number of caches are destroyed back-to-back.  While
there used to be a simple batching mechanism, the batching was too
restricted to be helpful.

This patch implements slab_deactivate_memcg_cache_rcu_sched() which
slub can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex.  While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab.h |  6 ++
 mm/slab.h|  2 ++
 mm/slab_common.c | 60 
 mm/slub.c| 12 +++
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 63d543d..84701bb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -576,6 +576,12 @@ struct memcg_cache_params {
struct {
struct mem_cgroup *memcg;
struct list_head kmem_caches_node;
+
+   void (*deact_fn)(struct kmem_cache *);
+   union {
+   struct rcu_head deact_rcu_head;
+   struct work_struct deact_work;
+   };
};
};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 73ed6b5..b67ac8f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -299,6 +299,8 @@ static __always_inline void memcg_uncharge_slab(struct page 
*page, int order,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
+extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+   void (*deact_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG && !CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 87e5535..4730ef6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -583,6 +583,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
put_online_cpus();
 }
 
+static void kmemcg_deactivate_workfn(struct work_struct *work)
+{
+   struct kmem_cache *s = container_of(work, struct kmem_cache,
+   memcg_params.deact_work);
+
+   get_online_cpus();
+   get_online_mems();
+
+   mutex_lock(_mutex);
+
+   s->memcg_params.deact_fn(s);
+
+   mutex_unlock(_mutex);
+
+   put_online_mems();
+   put_online_cpus();
+
+   /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
+   css_put(>memcg_params.memcg->css);
+}
+
+static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+{
+   struct kmem_cache *s = container_of(head, struct kmem_cache,
+   memcg_params.deact_rcu_head);
+
+   /*
+* We need to grab blocking locks.  Bounce to ->deact_work.  The
+* work item shares the space with the RCU head and can't be
+* initialized eariler.
+*/
+   INIT_WORK(>memcg_params.deact_work, kmemcg_deactivate_workfn);
+   schedule_work(>memcg_params.deact_work);
+}
+
+/**
+ * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
+ *sched RCU grace period
+ * @s: target kmem_cache
+ * @deact_fn: deactivation function to call
+ *
+ * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
+ * held after a sched RCU grace period.  The slab is guaranteed to stay
+ * alive until @deact_fn is finished.  This is to be used from
+ * __kmemcg_cache_deactivate().
+ */
+void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+  void (*deact_fn)(struct kmem_cache 
*))
+{
+   if (WARN_ON_ONCE(is_root_cache(s)) ||
+   WARN_ON_ONCE(s->memcg_params.deact_fn))
+   return;
+
+   /* pin memcg so that @s doesn't get destroyed in the middle */
+   css_get(>memcg_params.memcg->css);
+
+   s->memcg_params.deact_fn = deact_fn;
+   

[PATCH 3/9] slab: simplify shutdown_memcg_caches()

2017-01-13 Thread Tejun Heo
shutdown_memcg_caches() shuts down all memcg caches associated with a
root cache.  It first walks the index table clearing and shutting down
each entry and then shuts down the ones on
root_cache->memcg_params.list.  As active caches are on both the table
and the list, they're stashed away from the list to avoid shutting
down twice and then get spliced back later.

This is unnecessarily complication.  All memcg caches are on
root_cache->memcg_params.list.  The function can simply clear the
index table and shut down all caches on the list.  There's no need to
muck with temporary stashing.

Simplify the code.

Signed-off-by: Tejun Heo 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 mm/slab_common.c | 32 +---
 1 file changed, 5 insertions(+), 27 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 851c75e..45aa67c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -634,48 +634,26 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 {
struct memcg_cache_array *arr;
struct kmem_cache *c, *c2;
-   LIST_HEAD(busy);
int i;
 
BUG_ON(!is_root_cache(s));
 
/*
-* First, shutdown active caches, i.e. caches that belong to online
-* memory cgroups.
+* First, clear the pointers to all memcg caches so that they will
+* never be accessed even if the root cache stays alive.
 */
arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
lockdep_is_held(_mutex));
-   for_each_memcg_cache_index(i) {
-   c = arr->entries[i];
-   if (!c)
-   continue;
-   if (shutdown_cache(c))
-   /*
-* The cache still has objects. Move it to a temporary
-* list so as not to try to destroy it for a second
-* time while iterating over inactive caches below.
-*/
-   list_move(>memcg_params.list, );
-   else
-   /*
-* The cache is empty and will be destroyed soon. Clear
-* the pointer to it in the memcg_caches array so that
-* it will never be accessed even if the root cache
-* stays alive.
-*/
-   arr->entries[i] = NULL;
-   }
+   for_each_memcg_cache_index(i)
+   arr->entries[i] = NULL;
 
/*
-* Second, shutdown all caches left from memory cgroups that are now
-* offline.
+* Shutdown all caches.
 */
list_for_each_entry_safe(c, c2, >memcg_params.list,
 memcg_params.list)
shutdown_cache(c);
 
-   list_splice(, >memcg_params.list);
-
/*
 * A cache being destroyed must be empty. In particular, this means
 * that all per memcg caches attached to it must be empty too.
-- 
2.9.3



[PATCH 5/9] slab: link memcg kmem_caches on their associated memory cgroup

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

While a memcg kmem_cache is listed on its root cache's ->children
list, there is no direct way to iterate all kmem_caches which are
assocaited with a memory cgroup.  The only way to iterate them is
walking all caches while filtering out caches which don't match, which
would be most of them.

This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge.  This combined with the
synchronous RCU operations can tie up a CPU and affect the whole
machine for many hours when memory reclaim triggers offlining and
destruction of the stale memcgs.

This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg.  All memcg specific iterations, including
stat file access, are updated to use the new list instead.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/memcontrol.h |  1 +
 include/linux/slab.h   |  3 +++
 mm/memcontrol.c|  7 ---
 mm/slab.h  |  6 +++---
 mm/slab_common.c   | 42 --
 5 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 61d20c1..4de925c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -253,6 +253,7 @@ struct mem_cgroup {
 /* Index in the kmem_cache->memcg_params.memcg_caches array */
int kmemcg_id;
enum memcg_kmem_state kmem_state;
+   struct list_head kmem_caches;
 #endif
 
int last_scanned_node;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index b310173..54ec959 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -565,6 +565,8 @@ struct memcg_cache_array {
  * @memcg: Pointer to the memcg this cache belongs to.
  *
  * @children_node: List node for @root_cache->children list.
+ *
+ * @kmem_caches_node: List node for @memcg->kmem_caches list.
  */
 struct memcg_cache_params {
struct kmem_cache *root_cache;
@@ -576,6 +578,7 @@ struct memcg_cache_params {
struct {
struct mem_cgroup *memcg;
struct list_head children_node;
+   struct list_head kmem_caches_node;
};
};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4048897..a2b20f7f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2839,6 +2839,7 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 */
memcg->kmemcg_id = memcg_id;
memcg->kmem_state = KMEM_ONLINE;
+   INIT_LIST_HEAD(>kmem_caches);
 
return 0;
 }
@@ -4004,9 +4005,9 @@ static struct cftype mem_cgroup_legacy_files[] = {
 #ifdef CONFIG_SLABINFO
{
.name = "kmem.slabinfo",
-   .seq_start = slab_start,
-   .seq_next = slab_next,
-   .seq_stop = slab_stop,
+   .seq_start = memcg_slab_start,
+   .seq_next = memcg_slab_next,
+   .seq_stop = memcg_slab_stop,
.seq_show = memcg_slab_show,
},
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 9b3ec3f..b5e0040 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -488,9 +488,9 @@ static inline struct kmem_cache_node *get_node(struct 
kmem_cache *s, int node)
 
 #endif
 
-void *slab_start(struct seq_file *m, loff_t *pos);
-void *slab_next(struct seq_file *m, void *p, loff_t *pos);
-void slab_stop(struct seq_file *m, void *p);
+void *memcg_slab_start(struct seq_file *m, loff_t *pos);
+void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
+void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index fdf5b6d..74c36d8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -149,6 +149,7 @@ static int init_memcg_params(struct kmem_cache *s,
s->memcg_params.root_cache = root_cache;
s->memcg_params.memcg = memcg;
INIT_LIST_HEAD(>memcg_params.children_node);

[PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

SLAB_DESTORY_BY_RCU caches need a rcu grace period before destruction.
Currently, it's done synchronously with rcu_barrier().  As
rcu_barrier() is expensive time-wise, slab implements a batching
mechanism so that rcu_barrier() can be done for multiple caches at the
same time.

Unfortunately, the rcu_barrier() is in synchronous path which is
called while holding cgroup_mutex and the batching is too limited to
be actually helpful.  Besides, the batching is just a very degenerate
form of the actual RCU callback mechanism.

This patch updates the cache release path so that it simply uses
call_rcu() instead of the synchronous rcu_barrier() + custom batching.
This doesn't cost more while being logically simpler and way more
scalable.

* ->rcu_head is added to kmem_cache structs.  It shares storage space
  with ->list.

* slub sysfs removal and release are separated and the former is now
  called from __kmem_cache_shutdown() while the latter is called from
  the release path.  There's no reason to defer sysfs removal through
  RCU and this makes it unnecessary to bounce to workqueue from the
  RCU callback.

* release_caches() is removed and shutdown_cache() now either directly
  release the cache or schedules a RCU callback to do that.  This
  makes the cache inaccessible once shutdown_cache() is called and
  makes it impossible for shutdown_memcg_caches() to do memcg-specific
  cleanups afterwards.  Move memcg-specific part into a helper,
  unlink_memcg_cache(), and make shutdown_cache() call it directly.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab_def.h |  5 ++-
 include/linux/slub_def.h |  9 --
 mm/slab.h|  5 ++-
 mm/slab_common.c | 84 
 mm/slub.c|  9 +-
 5 files changed, 57 insertions(+), 55 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 4ad2c5a..b649629 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -39,7 +39,10 @@ struct kmem_cache {
 
 /* 4) cache creation/removal */
const char *name;
-   struct list_head list;
+   union {
+   struct list_head list;
+   struct rcu_head rcu_head;
+   };
int refcount;
int object_size;
int align;
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 75f56c2..7637b41 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -80,7 +80,10 @@ struct kmem_cache {
int align;  /* Alignment */
int reserved;   /* Reserved bytes at the end of slabs */
const char *name;   /* Name (only for display!) */
-   struct list_head list;  /* List of slab caches */
+   union {
+   struct list_head list;  /* List of slab caches */
+   struct rcu_head rcu_head;
+   };
int red_left_pad;   /* Left redzone padding size */
 #ifdef CONFIG_SYSFS
struct kobject kobj;/* For sysfs */
@@ -113,9 +116,9 @@ struct kmem_cache {
 
 #ifdef CONFIG_SYSFS
 #define SLAB_SUPPORTS_SYSFS
-void sysfs_slab_remove(struct kmem_cache *);
+void sysfs_slab_release(struct kmem_cache *);
 #else
-static inline void sysfs_slab_remove(struct kmem_cache *s)
+static inline void sysfs_slab_release(struct kmem_cache *s)
 {
 }
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 4acc644..3fa2d77 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -24,7 +24,10 @@ struct kmem_cache {
const char *name;   /* Slab name for sysfs */
int refcount;   /* Use counter */
void (*ctor)(void *);   /* Called on object slot creation */
-   struct list_head list;  /* List of all slab caches on the system */
+   union {
+   struct list_head list;  /* List of all slab caches on the 
system */
+   struct rcu_head rcu_head;
+   };
 };
 
 #endif /* CONFIG_SLOB */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 46ff746..851c75e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -215,6 +215,11 @@ int memcg_update_all_caches(int num_memcgs)
mutex_unlock(_mutex);
return ret;
 }
+
+static 

[PATCH 5/9] slab: link memcg kmem_caches on their associated memory cgroup

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

While a memcg kmem_cache is listed on its root cache's ->children
list, there is no direct way to iterate all kmem_caches which are
assocaited with a memory cgroup.  The only way to iterate them is
walking all caches while filtering out caches which don't match, which
would be most of them.

This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge.  This combined with the
synchronous RCU operations can tie up a CPU and affect the whole
machine for many hours when memory reclaim triggers offlining and
destruction of the stale memcgs.

This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg.  All memcg specific iterations, including
stat file access, are updated to use the new list instead.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/memcontrol.h |  1 +
 include/linux/slab.h   |  3 +++
 mm/memcontrol.c|  7 ---
 mm/slab.h  |  6 +++---
 mm/slab_common.c   | 42 --
 5 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 61d20c1..4de925c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -253,6 +253,7 @@ struct mem_cgroup {
 /* Index in the kmem_cache->memcg_params.memcg_caches array */
int kmemcg_id;
enum memcg_kmem_state kmem_state;
+   struct list_head kmem_caches;
 #endif
 
int last_scanned_node;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index b310173..54ec959 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -565,6 +565,8 @@ struct memcg_cache_array {
  * @memcg: Pointer to the memcg this cache belongs to.
  *
  * @children_node: List node for @root_cache->children list.
+ *
+ * @kmem_caches_node: List node for @memcg->kmem_caches list.
  */
 struct memcg_cache_params {
struct kmem_cache *root_cache;
@@ -576,6 +578,7 @@ struct memcg_cache_params {
struct {
struct mem_cgroup *memcg;
struct list_head children_node;
+   struct list_head kmem_caches_node;
};
};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4048897..a2b20f7f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2839,6 +2839,7 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 */
memcg->kmemcg_id = memcg_id;
memcg->kmem_state = KMEM_ONLINE;
+   INIT_LIST_HEAD(>kmem_caches);
 
return 0;
 }
@@ -4004,9 +4005,9 @@ static struct cftype mem_cgroup_legacy_files[] = {
 #ifdef CONFIG_SLABINFO
{
.name = "kmem.slabinfo",
-   .seq_start = slab_start,
-   .seq_next = slab_next,
-   .seq_stop = slab_stop,
+   .seq_start = memcg_slab_start,
+   .seq_next = memcg_slab_next,
+   .seq_stop = memcg_slab_stop,
.seq_show = memcg_slab_show,
},
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 9b3ec3f..b5e0040 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -488,9 +488,9 @@ static inline struct kmem_cache_node *get_node(struct 
kmem_cache *s, int node)
 
 #endif
 
-void *slab_start(struct seq_file *m, loff_t *pos);
-void *slab_next(struct seq_file *m, void *p, loff_t *pos);
-void slab_stop(struct seq_file *m, void *p);
+void *memcg_slab_start(struct seq_file *m, loff_t *pos);
+void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
+void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index fdf5b6d..74c36d8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -149,6 +149,7 @@ static int init_memcg_params(struct kmem_cache *s,
s->memcg_params.root_cache = root_cache;
s->memcg_params.memcg = memcg;
INIT_LIST_HEAD(>memcg_params.children_node);
+   INIT_LIST_HEAD(>memcg_params.kmem_caches_node);
return 0;
}
 
@@ -219,6 +220,7 @@ int memcg_update_all_caches(int 

[PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

SLAB_DESTORY_BY_RCU caches need a rcu grace period before destruction.
Currently, it's done synchronously with rcu_barrier().  As
rcu_barrier() is expensive time-wise, slab implements a batching
mechanism so that rcu_barrier() can be done for multiple caches at the
same time.

Unfortunately, the rcu_barrier() is in synchronous path which is
called while holding cgroup_mutex and the batching is too limited to
be actually helpful.  Besides, the batching is just a very degenerate
form of the actual RCU callback mechanism.

This patch updates the cache release path so that it simply uses
call_rcu() instead of the synchronous rcu_barrier() + custom batching.
This doesn't cost more while being logically simpler and way more
scalable.

* ->rcu_head is added to kmem_cache structs.  It shares storage space
  with ->list.

* slub sysfs removal and release are separated and the former is now
  called from __kmem_cache_shutdown() while the latter is called from
  the release path.  There's no reason to defer sysfs removal through
  RCU and this makes it unnecessary to bounce to workqueue from the
  RCU callback.

* release_caches() is removed and shutdown_cache() now either directly
  release the cache or schedules a RCU callback to do that.  This
  makes the cache inaccessible once shutdown_cache() is called and
  makes it impossible for shutdown_memcg_caches() to do memcg-specific
  cleanups afterwards.  Move memcg-specific part into a helper,
  unlink_memcg_cache(), and make shutdown_cache() call it directly.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab_def.h |  5 ++-
 include/linux/slub_def.h |  9 --
 mm/slab.h|  5 ++-
 mm/slab_common.c | 84 
 mm/slub.c|  9 +-
 5 files changed, 57 insertions(+), 55 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 4ad2c5a..b649629 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -39,7 +39,10 @@ struct kmem_cache {
 
 /* 4) cache creation/removal */
const char *name;
-   struct list_head list;
+   union {
+   struct list_head list;
+   struct rcu_head rcu_head;
+   };
int refcount;
int object_size;
int align;
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 75f56c2..7637b41 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -80,7 +80,10 @@ struct kmem_cache {
int align;  /* Alignment */
int reserved;   /* Reserved bytes at the end of slabs */
const char *name;   /* Name (only for display!) */
-   struct list_head list;  /* List of slab caches */
+   union {
+   struct list_head list;  /* List of slab caches */
+   struct rcu_head rcu_head;
+   };
int red_left_pad;   /* Left redzone padding size */
 #ifdef CONFIG_SYSFS
struct kobject kobj;/* For sysfs */
@@ -113,9 +116,9 @@ struct kmem_cache {
 
 #ifdef CONFIG_SYSFS
 #define SLAB_SUPPORTS_SYSFS
-void sysfs_slab_remove(struct kmem_cache *);
+void sysfs_slab_release(struct kmem_cache *);
 #else
-static inline void sysfs_slab_remove(struct kmem_cache *s)
+static inline void sysfs_slab_release(struct kmem_cache *s)
 {
 }
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 4acc644..3fa2d77 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -24,7 +24,10 @@ struct kmem_cache {
const char *name;   /* Slab name for sysfs */
int refcount;   /* Use counter */
void (*ctor)(void *);   /* Called on object slot creation */
-   struct list_head list;  /* List of all slab caches on the system */
+   union {
+   struct list_head list;  /* List of all slab caches on the 
system */
+   struct rcu_head rcu_head;
+   };
 };
 
 #endif /* CONFIG_SLOB */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 46ff746..851c75e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -215,6 +215,11 @@ int memcg_update_all_caches(int num_memcgs)
mutex_unlock(_mutex);
return ret;
 }
+
+static void unlink_memcg_cache(struct kmem_cache *s)
+{
+   list_del(>memcg_params.list);
+}
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
 

[PATCH 1/9] Revert "slub: move synchronize_sched out of slab_mutex on shrink"

2017-01-13 Thread Tejun Heo
This reverts commit 89e364db71fb5e7fc8d93228152abfa67daf35fa.

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

Moving synchronize_sched() out of slab_mutex isn't enough as it's
still inside cgroup_mutex.  The whole deactivation / release path will
be updated to avoid all synchronous RCU operations.  Revert this
insufficient optimization in preparation to ease future changes.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 mm/slab.c|  4 ++--
 mm/slab.h|  2 +-
 mm/slab_common.c | 27 ++-
 mm/slob.c|  2 +-
 mm/slub.c| 19 +--
 5 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 29bc6c0..767e8e4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2314,7 +2314,7 @@ static int drain_freelist(struct kmem_cache *cache,
return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 {
int ret = 0;
int node;
@@ -2334,7 +2334,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-   return __kmem_cache_shrink(cachep);
+   return __kmem_cache_shrink(cachep, false);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index de6579d..4acc644 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -161,7 +161,7 @@ static inline unsigned long kmem_cache_flags(unsigned long 
object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+int __kmem_cache_shrink(struct kmem_cache *, bool);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index ae32384..46ff746 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -579,29 +579,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
get_online_cpus();
get_online_mems();
 
-#ifdef CONFIG_SLUB
-   /*
-* In case of SLUB, we need to disable empty slab caching to
-* avoid pinning the offline memory cgroup by freeable kmem
-* pages charged to it. SLAB doesn't need this, as it
-* periodically purges unused slabs.
-*/
-   mutex_lock(_mutex);
-   list_for_each_entry(s, _caches, list) {
-   c = is_root_cache(s) ? cache_from_memcg_idx(s, idx) : NULL;
-   if (c) {
-   c->cpu_partial = 0;
-   c->min_partial = 0;
-   }
-   }
-   mutex_unlock(_mutex);
-   /*
-* kmem_cache->cpu_partial is checked locklessly (see
-* put_cpu_partial()). Make sure the change is visible.
-*/
-   synchronize_sched();
-#endif
-
mutex_lock(_mutex);
list_for_each_entry(s, _caches, list) {
if (!is_root_cache(s))
@@ -613,7 +590,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
if (!c)
continue;
 
-   __kmem_cache_shrink(c);
+   __kmem_cache_shrink(c, true);
arr->entries[idx] = NULL;
}
mutex_unlock(_mutex);
@@ -784,7 +761,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
get_online_cpus();
get_online_mems();
kasan_cache_shrink(cachep);
-   ret = __kmem_cache_shrink(cachep);
+   ret = __kmem_cache_shrink(cachep, false);
put_online_mems();
put_online_cpus();
return ret;
diff --git a/mm/slob.c b/mm/slob.c
index eac04d4..5ec1580 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
 {
return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 067598a..68b84f9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3883,7 +3883,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+int __kmem_cache_shrink(struct kmem_cache *s, bool 

[PATCH 6/9] slab: don't put memcg caches on slab_caches list

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slab_caches currently lists all caches including root and memcg ones.
This is the only data structure which lists the root caches and
iterating root caches can only be done by walking the list while
skipping over memcg caches.  As there can be a huge number of memcg
caches, this can become very expensive.

This also can make /proc/slabinfo behave very badly.  seq_file
processes reads in 4k chunks and seeks to the previous Nth position on
slab_caches list to resume after each chunk.  With a lot of memcg
cache churns on the list, reading /proc/slabinfo can become very slow
and its content often ends up with duplicate and/or missing entries.

As the previous patch made it unnecessary to walk slab_caches to
iterate memcg-specific caches, there is no reason to keep memcg caches
on the list.  This patch makes slab_caches include only the root
caches.  As this makes slab_cache->list unused for memcg caches,
->memcg_params.children_node is removed and ->list is used instead.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab.h |  3 ---
 mm/slab.h|  3 +--
 mm/slab_common.c | 58 +---
 3 files changed, 29 insertions(+), 35 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 54ec959..63d543d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -564,8 +564,6 @@ struct memcg_cache_array {
  *
  * @memcg: Pointer to the memcg this cache belongs to.
  *
- * @children_node: List node for @root_cache->children list.
- *
  * @kmem_caches_node: List node for @memcg->kmem_caches list.
  */
 struct memcg_cache_params {
@@ -577,7 +575,6 @@ struct memcg_cache_params {
};
struct {
struct mem_cgroup *memcg;
-   struct list_head children_node;
struct list_head kmem_caches_node;
};
};
diff --git a/mm/slab.h b/mm/slab.h
index b5e0040..8f47a44 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -203,8 +203,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, 
size_t, void **);
  * slab_mutex.
  */
 #define for_each_memcg_cache(iter, root) \
-   list_for_each_entry(iter, &(root)->memcg_params.children, \
-   memcg_params.children_node)
+   list_for_each_entry(iter, &(root)->memcg_params.children, list)
 
 static inline bool is_root_cache(struct kmem_cache *s)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 74c36d8..c0d0126 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -68,6 +68,22 @@ unsigned int kmem_cache_size(struct kmem_cache *s)
 EXPORT_SYMBOL(kmem_cache_size);
 
 #ifdef CONFIG_DEBUG_VM
+static void kmem_cache_verify_name(struct kmem_cache *s)
+{
+   char tmp;
+   int res;
+
+   /*
+* This happens when the module gets unloaded and doesn't destroy
+* its slab cache and no-one else reuses the vmalloc area of the
+* module.  Print a warning.
+*/
+   res = probe_kernel_address(s->name, tmp);
+   if (res)
+   pr_err("Slab cache with size %d has lost its name\n",
+  s->object_size);
+}
+
 static int kmem_cache_sanity_check(const char *name, size_t size)
 {
struct kmem_cache *s = NULL;
@@ -79,20 +95,12 @@ static int kmem_cache_sanity_check(const char *name, size_t 
size)
}
 
list_for_each_entry(s, _caches, list) {
-   char tmp;
-   int res;
+   struct kmem_cache *c;
 
-   /*
-* This happens when the module gets unloaded and doesn't
-* destroy its slab cache and no-one else reuses the vmalloc
-* area of the module.  Print a warning.
-*/
-   res = probe_kernel_address(s->name, tmp);
-   if (res) {
-   pr_err("Slab cache with size %d has lost its name\n",
-  s->object_size);
-   continue;
-   }
+   kmem_cache_verify_name(s);
+
+   for_each_memcg_cache(c, s)
+   

[PATCH 7/9] slab: introduce __kmemcg_cache_deactivate()

2017-01-13 Thread Tejun Heo
__kmem_cache_shrink() is called with %true @deactivate only for memcg
caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
__kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
should implement it and it should deactivate and drain the cache.

This is to allow memcg cache deactivation behavior to further deviate
from simple shrinking without messing up __kmem_cache_shrink().

This is pure reorganization and doesn't introduce any observable
behavior changes.

Signed-off-by: Tejun Heo 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 mm/slab.c| 11 +--
 mm/slab.h|  5 -
 mm/slab_common.c |  4 ++--
 mm/slob.c|  2 +-
 mm/slub.c| 39 ++-
 5 files changed, 38 insertions(+), 23 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 767e8e4..65814f2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2314,7 +2314,7 @@ static int drain_freelist(struct kmem_cache *cache,
return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *cachep)
 {
int ret = 0;
int node;
@@ -2332,9 +2332,16 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool 
deactivate)
return (ret ? 1 : 0);
 }
 
+#ifdef CONFIG_MEMCG
+void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
+{
+   __kmem_cache_shrink(cachep);
+}
+#endif
+
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-   return __kmem_cache_shrink(cachep, false);
+   return __kmem_cache_shrink(cachep);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 8f47a44..73ed6b5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -164,7 +164,10 @@ static inline unsigned long kmem_cache_flags(unsigned long 
object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *, bool);
+int __kmem_cache_shrink(struct kmem_cache *);
+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+void __kmemcg_cache_deactivate(struct kmem_cache *s);
+#endif
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index c0d0126..87e5535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -602,7 +602,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
if (!c)
continue;
 
-   __kmem_cache_shrink(c, true);
+   __kmemcg_cache_deactivate(c);
arr->entries[idx] = NULL;
}
mutex_unlock(_mutex);
@@ -727,7 +727,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
get_online_cpus();
get_online_mems();
kasan_cache_shrink(cachep);
-   ret = __kmem_cache_shrink(cachep, false);
+   ret = __kmem_cache_shrink(cachep);
put_online_mems();
put_online_cpus();
return ret;
diff --git a/mm/slob.c b/mm/slob.c
index 5ec1580..eac04d4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *d)
 {
return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index a26cb90..ef89a07 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3886,7 +3886,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *s)
 {
int node;
int i;
@@ -3898,21 +3898,6 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool 
deactivate)
unsigned long flags;
int ret = 0;
 
-   if (deactivate) {
-   /*
-* Disable empty slabs caching. Used to avoid pinning offline
-* memory cgroups by kmem pages that can be freed.
-*/
-   s->cpu_partial = 0;
-   s->min_partial = 0;
-
-   /*
-* s->cpu_partial is checked locklessly (see put_cpu_partial),
-* so we have to make sure the change is visible.
-*/
-   synchronize_sched();
-   }
-
flush_all(s);
for_each_kmem_cache_node(s, node, n) {
INIT_LIST_HEAD();
@@ -3963,13 +3948,33 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool 
deactivate)
return ret;
 }
 
+#ifdef CONFIG_MEMCG
+void __kmemcg_cache_deactivate(struct kmem_cache *s)
+{
+   /*
+* Disable empty slabs caching. Used to avoid pinning offline
+* memory cgroups by kmem pages that can be freed.
+ 

[PATCH 4/9] slab: reorganize memcg_cache_params

2017-01-13 Thread Tejun Heo
We're gonna change how memcg caches are iterated.  In preparation,
clean up and reorganize memcg_cache_params.

* The shared ->list is replaced by ->children in root and
  ->children_node in children.

* ->is_root_cache is removed.  Instead ->root_cache is moved out of
  the child union and now used by both root and children.  NULL
  indicates root cache.  Non-NULL a memcg one.

This patch doesn't cause any observable behavior changes.

Signed-off-by: Tejun Heo 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab.h | 33 -
 mm/slab.h|  6 +++---
 mm/slab_common.c | 21 +++--
 3 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12b..b310173 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -545,22 +545,37 @@ struct memcg_cache_array {
  * array to be accessed without taking any locks, on relocation we free the old
  * version only after a grace period.
  *
- * Child caches will hold extra metadata needed for its operation. Fields are:
+ * Root and child caches hold different metadata.
  *
- * @memcg: pointer to the memcg this cache belongs to
- * @root_cache: pointer to the global, root cache, this cache was derived from
+ * @root_cache:Common to root and child caches.  NULL for root, 
pointer to
+ * the root cache for children.
  *
- * Both root and child caches of the same kind are linked into a list chained
- * through @list.
+ * The following fields are specific to root caches.
+ *
+ * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
+ * used to index child cachces during allocation and cleared
+ * early during shutdown.
+ *
+ * @children:  List of all child caches.  While the child caches are also
+ * reachable through @memcg_caches, a child cache remains on
+ * this list until it is actually destroyed.
+ *
+ * The following fields are specifci to child caches.
+ *
+ * @memcg: Pointer to the memcg this cache belongs to.
+ *
+ * @children_node: List node for @root_cache->children list.
  */
 struct memcg_cache_params {
-   bool is_root_cache;
-   struct list_head list;
+   struct kmem_cache *root_cache;
union {
-   struct memcg_cache_array __rcu *memcg_caches;
+   struct {
+   struct memcg_cache_array __rcu *memcg_caches;
+   struct list_head children;
+   };
struct {
struct mem_cgroup *memcg;
-   struct kmem_cache *root_cache;
+   struct list_head children_node;
};
};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 3fa2d77..9b3ec3f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -203,12 +203,12 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, 
size_t, void **);
  * slab_mutex.
  */
 #define for_each_memcg_cache(iter, root) \
-   list_for_each_entry(iter, &(root)->memcg_params.list, \
-   memcg_params.list)
+   list_for_each_entry(iter, &(root)->memcg_params.children, \
+   memcg_params.children_node)
 
 static inline bool is_root_cache(struct kmem_cache *s)
 {
-   return s->memcg_params.is_root_cache;
+   return !s->memcg_params.root_cache;
 }
 
 static inline bool slab_equal_or_root(struct kmem_cache *s,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 45aa67c..fdf5b6d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -135,9 +135,9 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
flags, size_t nr,
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
 void slab_init_memcg_params(struct kmem_cache *s)
 {
-   s->memcg_params.is_root_cache = true;
-   INIT_LIST_HEAD(>memcg_params.list);
+   s->memcg_params.root_cache = NULL;
RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
+   INIT_LIST_HEAD(>memcg_params.children);
 }
 
 static int init_memcg_params(struct kmem_cache *s,
@@ -145,10 +145,10 @@ static int init_memcg_params(struct kmem_cache *s,
 {
struct memcg_cache_array *arr;
 
-   if (memcg) {
-   s->memcg_params.is_root_cache = false;
-   s->memcg_params.memcg = memcg;
+   if (root_cache) {
s->memcg_params.root_cache = root_cache;
+   s->memcg_params.memcg = memcg;
+   INIT_LIST_HEAD(>memcg_params.children_node);
return 0;
}
 
@@ -218,7 +218,7 @@ int memcg_update_all_caches(int num_memcgs)
 
 static void unlink_memcg_cache(struct kmem_cache *s)
 {
-   list_del(>memcg_params.list);
+   

[PATCH 7/9] slab: introduce __kmemcg_cache_deactivate()

2017-01-13 Thread Tejun Heo
__kmem_cache_shrink() is called with %true @deactivate only for memcg
caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
__kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
should implement it and it should deactivate and drain the cache.

This is to allow memcg cache deactivation behavior to further deviate
from simple shrinking without messing up __kmem_cache_shrink().

This is pure reorganization and doesn't introduce any observable
behavior changes.

Signed-off-by: Tejun Heo 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 mm/slab.c| 11 +--
 mm/slab.h|  5 -
 mm/slab_common.c |  4 ++--
 mm/slob.c|  2 +-
 mm/slub.c| 39 ++-
 5 files changed, 38 insertions(+), 23 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 767e8e4..65814f2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2314,7 +2314,7 @@ static int drain_freelist(struct kmem_cache *cache,
return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *cachep)
 {
int ret = 0;
int node;
@@ -2332,9 +2332,16 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool 
deactivate)
return (ret ? 1 : 0);
 }
 
+#ifdef CONFIG_MEMCG
+void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
+{
+   __kmem_cache_shrink(cachep);
+}
+#endif
+
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-   return __kmem_cache_shrink(cachep, false);
+   return __kmem_cache_shrink(cachep);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 8f47a44..73ed6b5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -164,7 +164,10 @@ static inline unsigned long kmem_cache_flags(unsigned long 
object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *, bool);
+int __kmem_cache_shrink(struct kmem_cache *);
+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+void __kmemcg_cache_deactivate(struct kmem_cache *s);
+#endif
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index c0d0126..87e5535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -602,7 +602,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
if (!c)
continue;
 
-   __kmem_cache_shrink(c, true);
+   __kmemcg_cache_deactivate(c);
arr->entries[idx] = NULL;
}
mutex_unlock(_mutex);
@@ -727,7 +727,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
get_online_cpus();
get_online_mems();
kasan_cache_shrink(cachep);
-   ret = __kmem_cache_shrink(cachep, false);
+   ret = __kmem_cache_shrink(cachep);
put_online_mems();
put_online_cpus();
return ret;
diff --git a/mm/slob.c b/mm/slob.c
index 5ec1580..eac04d4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *d)
 {
return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index a26cb90..ef89a07 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3886,7 +3886,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *s)
 {
int node;
int i;
@@ -3898,21 +3898,6 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool 
deactivate)
unsigned long flags;
int ret = 0;
 
-   if (deactivate) {
-   /*
-* Disable empty slabs caching. Used to avoid pinning offline
-* memory cgroups by kmem pages that can be freed.
-*/
-   s->cpu_partial = 0;
-   s->min_partial = 0;
-
-   /*
-* s->cpu_partial is checked locklessly (see put_cpu_partial),
-* so we have to make sure the change is visible.
-*/
-   synchronize_sched();
-   }
-
flush_all(s);
for_each_kmem_cache_node(s, node, n) {
INIT_LIST_HEAD();
@@ -3963,13 +3948,33 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool 
deactivate)
return ret;
 }
 
+#ifdef CONFIG_MEMCG
+void __kmemcg_cache_deactivate(struct kmem_cache *s)
+{
+   /*
+* Disable empty slabs caching. Used to avoid pinning offline
+* memory cgroups by kmem pages that can be freed.
+*/
+   s->cpu_partial = 0;
+   s->min_partial = 0;
+
+   /*
+* s->cpu_partial is checked locklessly (see put_cpu_partial), so

[PATCH 4/9] slab: reorganize memcg_cache_params

2017-01-13 Thread Tejun Heo
We're gonna change how memcg caches are iterated.  In preparation,
clean up and reorganize memcg_cache_params.

* The shared ->list is replaced by ->children in root and
  ->children_node in children.

* ->is_root_cache is removed.  Instead ->root_cache is moved out of
  the child union and now used by both root and children.  NULL
  indicates root cache.  Non-NULL a memcg one.

This patch doesn't cause any observable behavior changes.

Signed-off-by: Tejun Heo 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab.h | 33 -
 mm/slab.h|  6 +++---
 mm/slab_common.c | 21 +++--
 3 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12b..b310173 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -545,22 +545,37 @@ struct memcg_cache_array {
  * array to be accessed without taking any locks, on relocation we free the old
  * version only after a grace period.
  *
- * Child caches will hold extra metadata needed for its operation. Fields are:
+ * Root and child caches hold different metadata.
  *
- * @memcg: pointer to the memcg this cache belongs to
- * @root_cache: pointer to the global, root cache, this cache was derived from
+ * @root_cache:Common to root and child caches.  NULL for root, 
pointer to
+ * the root cache for children.
  *
- * Both root and child caches of the same kind are linked into a list chained
- * through @list.
+ * The following fields are specific to root caches.
+ *
+ * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
+ * used to index child cachces during allocation and cleared
+ * early during shutdown.
+ *
+ * @children:  List of all child caches.  While the child caches are also
+ * reachable through @memcg_caches, a child cache remains on
+ * this list until it is actually destroyed.
+ *
+ * The following fields are specifci to child caches.
+ *
+ * @memcg: Pointer to the memcg this cache belongs to.
+ *
+ * @children_node: List node for @root_cache->children list.
  */
 struct memcg_cache_params {
-   bool is_root_cache;
-   struct list_head list;
+   struct kmem_cache *root_cache;
union {
-   struct memcg_cache_array __rcu *memcg_caches;
+   struct {
+   struct memcg_cache_array __rcu *memcg_caches;
+   struct list_head children;
+   };
struct {
struct mem_cgroup *memcg;
-   struct kmem_cache *root_cache;
+   struct list_head children_node;
};
};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 3fa2d77..9b3ec3f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -203,12 +203,12 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, 
size_t, void **);
  * slab_mutex.
  */
 #define for_each_memcg_cache(iter, root) \
-   list_for_each_entry(iter, &(root)->memcg_params.list, \
-   memcg_params.list)
+   list_for_each_entry(iter, &(root)->memcg_params.children, \
+   memcg_params.children_node)
 
 static inline bool is_root_cache(struct kmem_cache *s)
 {
-   return s->memcg_params.is_root_cache;
+   return !s->memcg_params.root_cache;
 }
 
 static inline bool slab_equal_or_root(struct kmem_cache *s,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 45aa67c..fdf5b6d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -135,9 +135,9 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
flags, size_t nr,
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
 void slab_init_memcg_params(struct kmem_cache *s)
 {
-   s->memcg_params.is_root_cache = true;
-   INIT_LIST_HEAD(>memcg_params.list);
+   s->memcg_params.root_cache = NULL;
RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
+   INIT_LIST_HEAD(>memcg_params.children);
 }
 
 static int init_memcg_params(struct kmem_cache *s,
@@ -145,10 +145,10 @@ static int init_memcg_params(struct kmem_cache *s,
 {
struct memcg_cache_array *arr;
 
-   if (memcg) {
-   s->memcg_params.is_root_cache = false;
-   s->memcg_params.memcg = memcg;
+   if (root_cache) {
s->memcg_params.root_cache = root_cache;
+   s->memcg_params.memcg = memcg;
+   INIT_LIST_HEAD(>memcg_params.children_node);
return 0;
}
 
@@ -218,7 +218,7 @@ int memcg_update_all_caches(int num_memcgs)
 
 static void unlink_memcg_cache(struct kmem_cache *s)
 {
-   list_del(>memcg_params.list);
+   list_del(>memcg_params.children_node);
 }
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
@@ -559,7 +559,8 @@ void memcg_create_kmem_cache(struct mem_cgroup 

[PATCH 1/9] Revert "slub: move synchronize_sched out of slab_mutex on shrink"

2017-01-13 Thread Tejun Heo
This reverts commit 89e364db71fb5e7fc8d93228152abfa67daf35fa.

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

Moving synchronize_sched() out of slab_mutex isn't enough as it's
still inside cgroup_mutex.  The whole deactivation / release path will
be updated to avoid all synchronous RCU operations.  Revert this
insufficient optimization in preparation to ease future changes.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 mm/slab.c|  4 ++--
 mm/slab.h|  2 +-
 mm/slab_common.c | 27 ++-
 mm/slob.c|  2 +-
 mm/slub.c| 19 +--
 5 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 29bc6c0..767e8e4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2314,7 +2314,7 @@ static int drain_freelist(struct kmem_cache *cache,
return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 {
int ret = 0;
int node;
@@ -2334,7 +2334,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-   return __kmem_cache_shrink(cachep);
+   return __kmem_cache_shrink(cachep, false);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index de6579d..4acc644 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -161,7 +161,7 @@ static inline unsigned long kmem_cache_flags(unsigned long 
object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+int __kmem_cache_shrink(struct kmem_cache *, bool);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index ae32384..46ff746 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -579,29 +579,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
get_online_cpus();
get_online_mems();
 
-#ifdef CONFIG_SLUB
-   /*
-* In case of SLUB, we need to disable empty slab caching to
-* avoid pinning the offline memory cgroup by freeable kmem
-* pages charged to it. SLAB doesn't need this, as it
-* periodically purges unused slabs.
-*/
-   mutex_lock(_mutex);
-   list_for_each_entry(s, _caches, list) {
-   c = is_root_cache(s) ? cache_from_memcg_idx(s, idx) : NULL;
-   if (c) {
-   c->cpu_partial = 0;
-   c->min_partial = 0;
-   }
-   }
-   mutex_unlock(_mutex);
-   /*
-* kmem_cache->cpu_partial is checked locklessly (see
-* put_cpu_partial()). Make sure the change is visible.
-*/
-   synchronize_sched();
-#endif
-
mutex_lock(_mutex);
list_for_each_entry(s, _caches, list) {
if (!is_root_cache(s))
@@ -613,7 +590,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
if (!c)
continue;
 
-   __kmem_cache_shrink(c);
+   __kmem_cache_shrink(c, true);
arr->entries[idx] = NULL;
}
mutex_unlock(_mutex);
@@ -784,7 +761,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
get_online_cpus();
get_online_mems();
kasan_cache_shrink(cachep);
-   ret = __kmem_cache_shrink(cachep);
+   ret = __kmem_cache_shrink(cachep, false);
put_online_mems();
put_online_cpus();
return ret;
diff --git a/mm/slob.c b/mm/slob.c
index eac04d4..5ec1580 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
 {
return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 067598a..68b84f9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3883,7 +3883,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 {
int node;
int i;
@@ -3895,6 +3895,21 @@ int __kmem_cache_shrink(struct kmem_cache *s)
unsigned long flags;
int ret = 

[PATCH 6/9] slab: don't put memcg caches on slab_caches list

2017-01-13 Thread Tejun Heo
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slab_caches currently lists all caches including root and memcg ones.
This is the only data structure which lists the root caches and
iterating root caches can only be done by walking the list while
skipping over memcg caches.  As there can be a huge number of memcg
caches, this can become very expensive.

This also can make /proc/slabinfo behave very badly.  seq_file
processes reads in 4k chunks and seeks to the previous Nth position on
slab_caches list to resume after each chunk.  With a lot of memcg
cache churns on the list, reading /proc/slabinfo can become very slow
and its content often ends up with duplicate and/or missing entries.

As the previous patch made it unnecessary to walk slab_caches to
iterate memcg-specific caches, there is no reason to keep memcg caches
on the list.  This patch makes slab_caches include only the root
caches.  As this makes slab_cache->list unused for memcg caches,
->memcg_params.children_node is removed and ->list is used instead.

Signed-off-by: Tejun Heo 
Reported-by: Jay Vana 
Cc: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
---
 include/linux/slab.h |  3 ---
 mm/slab.h|  3 +--
 mm/slab_common.c | 58 +---
 3 files changed, 29 insertions(+), 35 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 54ec959..63d543d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -564,8 +564,6 @@ struct memcg_cache_array {
  *
  * @memcg: Pointer to the memcg this cache belongs to.
  *
- * @children_node: List node for @root_cache->children list.
- *
  * @kmem_caches_node: List node for @memcg->kmem_caches list.
  */
 struct memcg_cache_params {
@@ -577,7 +575,6 @@ struct memcg_cache_params {
};
struct {
struct mem_cgroup *memcg;
-   struct list_head children_node;
struct list_head kmem_caches_node;
};
};
diff --git a/mm/slab.h b/mm/slab.h
index b5e0040..8f47a44 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -203,8 +203,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, 
size_t, void **);
  * slab_mutex.
  */
 #define for_each_memcg_cache(iter, root) \
-   list_for_each_entry(iter, &(root)->memcg_params.children, \
-   memcg_params.children_node)
+   list_for_each_entry(iter, &(root)->memcg_params.children, list)
 
 static inline bool is_root_cache(struct kmem_cache *s)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 74c36d8..c0d0126 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -68,6 +68,22 @@ unsigned int kmem_cache_size(struct kmem_cache *s)
 EXPORT_SYMBOL(kmem_cache_size);
 
 #ifdef CONFIG_DEBUG_VM
+static void kmem_cache_verify_name(struct kmem_cache *s)
+{
+   char tmp;
+   int res;
+
+   /*
+* This happens when the module gets unloaded and doesn't destroy
+* its slab cache and no-one else reuses the vmalloc area of the
+* module.  Print a warning.
+*/
+   res = probe_kernel_address(s->name, tmp);
+   if (res)
+   pr_err("Slab cache with size %d has lost its name\n",
+  s->object_size);
+}
+
 static int kmem_cache_sanity_check(const char *name, size_t size)
 {
struct kmem_cache *s = NULL;
@@ -79,20 +95,12 @@ static int kmem_cache_sanity_check(const char *name, size_t 
size)
}
 
list_for_each_entry(s, _caches, list) {
-   char tmp;
-   int res;
+   struct kmem_cache *c;
 
-   /*
-* This happens when the module gets unloaded and doesn't
-* destroy its slab cache and no-one else reuses the vmalloc
-* area of the module.  Print a warning.
-*/
-   res = probe_kernel_address(s->name, tmp);
-   if (res) {
-   pr_err("Slab cache with size %d has lost its name\n",
-  s->object_size);
-   continue;
-   }
+   kmem_cache_verify_name(s);
+
+   for_each_memcg_cache(c, s)
+   kmem_cache_verify_name(c);
}
 
WARN_ON(strchr(name, ' ')); /* It confuses parsers */
@@ -148,7 +156,6 @@ static int init_memcg_params(struct kmem_cache *s,
if 

Re: [PATCH] platform/x86: Add IMS/ZII SCU driver

2017-01-13 Thread Guenter Roeck

On 01/13/2017 03:15 PM, Florian Fainelli wrote:

On 01/13/2017 08:38 AM, Andy Shevchenko wrote:

On Wed, Jan 11, 2017 at 11:26 PM, Florian Fainelli  wrote:

From: Guenter Roeck 

This patch adds support for the IMS (now Zodiac Inflight Innovations)
SCU Generation 1/2/3 platform driver. This driver registers all the
on-module peripherals: Ethernet switches (Broadcom or Marvell), I2C to
GPIO expanders, Kontrom CPLD/I2C master, and more.

Signed-off-by: Guenter Roeck 
Signed-off-by: Florian Fainelli 



---
Darren,

This is against your "for-next" branch thanks!


I'm going to review this later, though few of comments.


+config SCU


No, no. We have enough mess with Intel's SCU/PMIC here, not add more.


OK OK.



At least add manufacturer as prefix to option and file name.

Btw, Darren, would it be good idea to start creating folders to make a
bit of order in the subsystem? For first I would move Intel's PMIC/SCU
stuff to somewhere (not sure if it should be per manufacturer or per
function).


+obj-$(CONFIG_SCU)  += scu.o


For file name as well.


+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 


Is it possible to keep them in order?
Do you need all of them?
Does it sound like MFD driver + individual drivers?


My understanding of a valid MFD candidate driver is that is partitions a
shared resource space into individual resources that can all be managed
by their respective driver. The KemPLD driver is already a MFD driver,
here, this is more of a superset of all of that and an aggregate
x86-based module that has a number of on-board peripherals.




+struct __packed eeprom_data {
+   unsigned short length;  /* 0 - 1 */
+   unsigned char checksum; /* 2 */
+   unsigned char have_gsm_modem;   /* 3 */
+   unsigned char have_cdma_modem;  /* 4 */
+   unsigned char have_wifi_modem;  /* 5 */
+   unsigned char have_rhdd;/* 6 */
+   unsigned char have_dvd; /* 7 */
+   unsigned char have_tape;/* 8 */
+   unsigned char have_humidity_sensor; /* 9 */
+   unsigned char have_fiber_channel;   /* 10 */
+   unsigned char lru_part_number[11];  /* 11 - 21 Box Part Number */
+   unsigned char lru_revision[7];  /* 22 - 28 Box Revision */
+   unsigned char lru_serial_number[7]; /* 29 - 35 Box Serial Number */
+   unsigned char lru_date_of_manufacture[7];
+   /* 36 - 42 Box Date of Manufacture */
+   unsigned char board_part_number[11];
+   /* 43 - 53 Base Board Part Number */
+   unsigned char board_revision[7];
+   /* 54 - 60 Base Board Revision */
+   unsigned char board_serial_number[7];
+   /* 61 - 67 Base Board Serial Number */
+   unsigned char board_date_of_manufacture[7];
+   /* 68 - 74 Base Board Date of Manufacture */
+   unsigned char board_updated_date_of_manufacture[7];
+   /* 75 - 81 Updated Box Date of Manufacture */
+   unsigned char board_updated_revision[7];
+   /* 82 - 88 Updated Box Revision */
+   unsigned char dummy[7]; /* 89 - 95 spare/filler */
+};


Would it be better to use fixed-width types here:
u8
u16


+enum scu_version { scu1, scu2, scu3, unknown };


MANUFACTURER_SCU_VERSION_x
?


OK.




+struct scu_data {
+   struct device *dev; /* SCU platform device */
+   struct net_device *netdev;  /* Ethernet device */
+   struct platform_device *mdio_dev;   /* MDIO device */
+   struct platform_device *dsa_dev;/* DSA device */
+   struct proc_dir_entry *rave_proc_dir;
+   struct mutex write_lock;
+   struct platform_device *leds_pdev[3];
+   struct i2c_adapter *adapter;/* I2C adapter */
+   struct spi_master *master;  /* SPI master */
+   struct i2c_client *client[10];  /* I2C clients */
+   struct spi_device *spidev[1];   /* SPI devices */


Comments are good candidates for kernel doc.


+   const struct scu_platform_data *pdata;
+   bool have_write_magic;
+   struct eeprom_data eeprom;
+   struct nvmem_device *nvmem;



+   bool eeprom_accessible;
+   bool eeprom_valid;


unsigned int flag1:1;
unsigned int flag2:1;

?


Are you concerned with storage, or what motivates the preference for
bitfields vs. bools?


Four bytes less storage, lots of additional code for each access. Not really 
sure
if I like that idea. In my scope of responsibility I ask 

Re: [PATCH] platform/x86: Add IMS/ZII SCU driver

2017-01-13 Thread Guenter Roeck

On 01/13/2017 03:15 PM, Florian Fainelli wrote:

On 01/13/2017 08:38 AM, Andy Shevchenko wrote:

On Wed, Jan 11, 2017 at 11:26 PM, Florian Fainelli  wrote:

From: Guenter Roeck 

This patch adds support for the IMS (now Zodiac Inflight Innovations)
SCU Generation 1/2/3 platform driver. This driver registers all the
on-module peripherals: Ethernet switches (Broadcom or Marvell), I2C to
GPIO expanders, Kontrom CPLD/I2C master, and more.

Signed-off-by: Guenter Roeck 
Signed-off-by: Florian Fainelli 



---
Darren,

This is against your "for-next" branch thanks!


I'm going to review this later, though few of comments.


+config SCU


No, no. We have enough mess with Intel's SCU/PMIC here, not add more.


OK OK.



At least add manufacturer as prefix to option and file name.

Btw, Darren, would it be good idea to start creating folders to make a
bit of order in the subsystem? For first I would move Intel's PMIC/SCU
stuff to somewhere (not sure if it should be per manufacturer or per
function).


+obj-$(CONFIG_SCU)  += scu.o


For file name as well.


+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 


Is it possible to keep them in order?
Do you need all of them?
Does it sound like MFD driver + individual drivers?


My understanding of a valid MFD candidate driver is that is partitions a
shared resource space into individual resources that can all be managed
by their respective driver. The KemPLD driver is already a MFD driver,
here, this is more of a superset of all of that and an aggregate
x86-based module that has a number of on-board peripherals.




+struct __packed eeprom_data {
+   unsigned short length;  /* 0 - 1 */
+   unsigned char checksum; /* 2 */
+   unsigned char have_gsm_modem;   /* 3 */
+   unsigned char have_cdma_modem;  /* 4 */
+   unsigned char have_wifi_modem;  /* 5 */
+   unsigned char have_rhdd;/* 6 */
+   unsigned char have_dvd; /* 7 */
+   unsigned char have_tape;/* 8 */
+   unsigned char have_humidity_sensor; /* 9 */
+   unsigned char have_fiber_channel;   /* 10 */
+   unsigned char lru_part_number[11];  /* 11 - 21 Box Part Number */
+   unsigned char lru_revision[7];  /* 22 - 28 Box Revision */
+   unsigned char lru_serial_number[7]; /* 29 - 35 Box Serial Number */
+   unsigned char lru_date_of_manufacture[7];
+   /* 36 - 42 Box Date of Manufacture */
+   unsigned char board_part_number[11];
+   /* 43 - 53 Base Board Part Number */
+   unsigned char board_revision[7];
+   /* 54 - 60 Base Board Revision */
+   unsigned char board_serial_number[7];
+   /* 61 - 67 Base Board Serial Number */
+   unsigned char board_date_of_manufacture[7];
+   /* 68 - 74 Base Board Date of Manufacture */
+   unsigned char board_updated_date_of_manufacture[7];
+   /* 75 - 81 Updated Box Date of Manufacture */
+   unsigned char board_updated_revision[7];
+   /* 82 - 88 Updated Box Revision */
+   unsigned char dummy[7]; /* 89 - 95 spare/filler */
+};


Would it be better to use fixed-width types here:
u8
u16


+enum scu_version { scu1, scu2, scu3, unknown };


MANUFACTURER_SCU_VERSION_x
?


OK.




+struct scu_data {
+   struct device *dev; /* SCU platform device */
+   struct net_device *netdev;  /* Ethernet device */
+   struct platform_device *mdio_dev;   /* MDIO device */
+   struct platform_device *dsa_dev;/* DSA device */
+   struct proc_dir_entry *rave_proc_dir;
+   struct mutex write_lock;
+   struct platform_device *leds_pdev[3];
+   struct i2c_adapter *adapter;/* I2C adapter */
+   struct spi_master *master;  /* SPI master */
+   struct i2c_client *client[10];  /* I2C clients */
+   struct spi_device *spidev[1];   /* SPI devices */


Comments are good candidates for kernel doc.


+   const struct scu_platform_data *pdata;
+   bool have_write_magic;
+   struct eeprom_data eeprom;
+   struct nvmem_device *nvmem;



+   bool eeprom_accessible;
+   bool eeprom_valid;


unsigned int flag1:1;
unsigned int flag2:1;

?


Are you concerned with storage, or what motivates the preference for
bitfields vs. bools?


Four bytes less storage, lots of additional code for each access. Not really 
sure
if I like that idea. In my scope of responsibility I ask for the opposite.





+/* platform data */
+
+static struct gpio_led 

[PATCH v6 1/3] dmaengine: xilinx_dma: Check for channel idle state before submitting dma descriptor

2017-01-13 Thread Kedareswara rao Appana
Add variable for checking channel idle state to ensure that dma descriptor is 
not
Submitted when DMA engine is in progress.

This will avoids the pollling for a bit in the status register to know
Dma state in the driver hot path.

Reviewed-by: Jose Abreu 
Signed-off-by: Kedareswara rao Appana 
---
Changes for v6:
---> Updated commit message as suggested by Vinod.
---> Added Channel idle variable description in the driver
 as suggested by Vinod.
Changes for v5:
---> None.
Changes for v4:
---> None.
Changes for v3:
---> None.
Changes for v2:
---> Add idle check in the reset as suggested by Jose Abreu
---> Removed xilinx_dma_is_running/xilinx_dma_is_idle checks
in the driver and used common idle checks across the driver
as suggested by Laurent Pinchart.

 drivers/dma/xilinx/xilinx_dma.c | 61 +++--
 1 file changed, 22 insertions(+), 39 deletions(-)

diff --git a/drivers/dma/xilinx/xilinx_dma.c b/drivers/dma/xilinx/xilinx_dma.c
index 8288fe4..5eeea57 100644
--- a/drivers/dma/xilinx/xilinx_dma.c
+++ b/drivers/dma/xilinx/xilinx_dma.c
@@ -321,6 +321,7 @@ struct xilinx_dma_tx_descriptor {
  * @cyclic: Check for cyclic transfers.
  * @genlock: Support genlock mode
  * @err: Channel has errors
+ * @idle: Check for channel idle
  * @tasklet: Cleanup work after irq
  * @config: Device configuration info
  * @flush_on_fsync: Flush on Frame sync
@@ -351,6 +352,7 @@ struct xilinx_dma_chan {
bool cyclic;
bool genlock;
bool err;
+   bool idle;
struct tasklet_struct tasklet;
struct xilinx_vdma_config config;
bool flush_on_fsync;
@@ -920,32 +922,6 @@ static enum dma_status xilinx_dma_tx_status(struct 
dma_chan *dchan,
 }
 
 /**
- * xilinx_dma_is_running - Check if DMA channel is running
- * @chan: Driver specific DMA channel
- *
- * Return: '1' if running, '0' if not.
- */
-static bool xilinx_dma_is_running(struct xilinx_dma_chan *chan)
-{
-   return !(dma_ctrl_read(chan, XILINX_DMA_REG_DMASR) &
-XILINX_DMA_DMASR_HALTED) &&
-   (dma_ctrl_read(chan, XILINX_DMA_REG_DMACR) &
-XILINX_DMA_DMACR_RUNSTOP);
-}
-
-/**
- * xilinx_dma_is_idle - Check if DMA channel is idle
- * @chan: Driver specific DMA channel
- *
- * Return: '1' if idle, '0' if not.
- */
-static bool xilinx_dma_is_idle(struct xilinx_dma_chan *chan)
-{
-   return dma_ctrl_read(chan, XILINX_DMA_REG_DMASR) &
-   XILINX_DMA_DMASR_IDLE;
-}
-
-/**
  * xilinx_dma_halt - Halt DMA channel
  * @chan: Driver specific DMA channel
  */
@@ -966,6 +942,7 @@ static void xilinx_dma_halt(struct xilinx_dma_chan *chan)
chan, dma_ctrl_read(chan, XILINX_DMA_REG_DMASR));
chan->err = true;
}
+   chan->idle = true;
 }
 
 /**
@@ -1007,6 +984,9 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
if (chan->err)
return;
 
+   if (!chan->idle)
+   return;
+
if (list_empty(>pending_list))
return;
 
@@ -1018,13 +998,6 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
tail_segment = list_last_entry(_desc->segments,
   struct xilinx_vdma_tx_segment, node);
 
-   /* If it is SG mode and hardware is busy, cannot submit */
-   if (chan->has_sg && xilinx_dma_is_running(chan) &&
-   !xilinx_dma_is_idle(chan)) {
-   dev_dbg(chan->dev, "DMA controller still busy\n");
-   return;
-   }
-
/*
 * If hardware is idle, then all descriptors on the running lists are
 * done, start new transfers
@@ -1110,6 +1083,7 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
vdma_desc_write(chan, XILINX_DMA_REG_VSIZE, last->hw.vsize);
}
 
+   chan->idle = false;
if (!chan->has_sg) {
list_del(>node);
list_add_tail(>node, >active_list);
@@ -1136,6 +1110,9 @@ static void xilinx_cdma_start_transfer(struct 
xilinx_dma_chan *chan)
if (chan->err)
return;
 
+   if (!chan->idle)
+   return;
+
if (list_empty(>pending_list))
return;
 
@@ -1181,6 +1158,7 @@ static void xilinx_cdma_start_transfer(struct 
xilinx_dma_chan *chan)
 
list_splice_tail_init(>pending_list, >active_list);
chan->desc_pendingcount = 0;
+   chan->idle = false;
 }
 
 /**
@@ -1196,15 +1174,11 @@ static void xilinx_dma_start_transfer(struct 
xilinx_dma_chan *chan)
if (chan->err)
return;
 
-   if (list_empty(>pending_list))
+   if (!chan->idle)
return;
 
-   /* If it is SG mode and hardware is busy, cannot submit */
-   if (chan->has_sg && xilinx_dma_is_running(chan) &&
-   !xilinx_dma_is_idle(chan)) {
-   dev_dbg(chan->dev, "DMA controller still busy\n");

[PATCH v6 0/3] dmaengine: xilinx_dma: Bug fixes

2017-01-13 Thread Kedareswara rao Appana
This patch series fixes below bugs in DMA and VDMA IP's
---> Do not start VDMA until frame buffer is processed by the h/w
---> Fix bug in Multi frame sotres handling in VDMA
---> Fix issues w.r.to multi frame descriptors submit with AXI DMA S2MM(recv) 
Side.

Kedareswara rao Appana (3):
  dmaengine: xilinx_dma: Check for channel idle state before submitting
dma descriptor
  dmaeninge: xilinx_dma: Fix bug in multiple frame stores scenario in
vdma
  dmaengine: xilinx_dma: Fix race condition in the driver for multiple
descriptor scenario

 .../devicetree/bindings/dma/xilinx/xilinx_dma.txt  |   2 +
 drivers/dma/xilinx/xilinx_dma.c| 270 -
 2 files changed, 161 insertions(+), 111 deletions(-)

-- 
2.1.2



[PATCH v6 1/3] dmaengine: xilinx_dma: Check for channel idle state before submitting dma descriptor

2017-01-13 Thread Kedareswara rao Appana
Add variable for checking channel idle state to ensure that dma descriptor is 
not
Submitted when DMA engine is in progress.

This will avoids the pollling for a bit in the status register to know
Dma state in the driver hot path.

Reviewed-by: Jose Abreu 
Signed-off-by: Kedareswara rao Appana 
---
Changes for v6:
---> Updated commit message as suggested by Vinod.
---> Added Channel idle variable description in the driver
 as suggested by Vinod.
Changes for v5:
---> None.
Changes for v4:
---> None.
Changes for v3:
---> None.
Changes for v2:
---> Add idle check in the reset as suggested by Jose Abreu
---> Removed xilinx_dma_is_running/xilinx_dma_is_idle checks
in the driver and used common idle checks across the driver
as suggested by Laurent Pinchart.

 drivers/dma/xilinx/xilinx_dma.c | 61 +++--
 1 file changed, 22 insertions(+), 39 deletions(-)

diff --git a/drivers/dma/xilinx/xilinx_dma.c b/drivers/dma/xilinx/xilinx_dma.c
index 8288fe4..5eeea57 100644
--- a/drivers/dma/xilinx/xilinx_dma.c
+++ b/drivers/dma/xilinx/xilinx_dma.c
@@ -321,6 +321,7 @@ struct xilinx_dma_tx_descriptor {
  * @cyclic: Check for cyclic transfers.
  * @genlock: Support genlock mode
  * @err: Channel has errors
+ * @idle: Check for channel idle
  * @tasklet: Cleanup work after irq
  * @config: Device configuration info
  * @flush_on_fsync: Flush on Frame sync
@@ -351,6 +352,7 @@ struct xilinx_dma_chan {
bool cyclic;
bool genlock;
bool err;
+   bool idle;
struct tasklet_struct tasklet;
struct xilinx_vdma_config config;
bool flush_on_fsync;
@@ -920,32 +922,6 @@ static enum dma_status xilinx_dma_tx_status(struct 
dma_chan *dchan,
 }
 
 /**
- * xilinx_dma_is_running - Check if DMA channel is running
- * @chan: Driver specific DMA channel
- *
- * Return: '1' if running, '0' if not.
- */
-static bool xilinx_dma_is_running(struct xilinx_dma_chan *chan)
-{
-   return !(dma_ctrl_read(chan, XILINX_DMA_REG_DMASR) &
-XILINX_DMA_DMASR_HALTED) &&
-   (dma_ctrl_read(chan, XILINX_DMA_REG_DMACR) &
-XILINX_DMA_DMACR_RUNSTOP);
-}
-
-/**
- * xilinx_dma_is_idle - Check if DMA channel is idle
- * @chan: Driver specific DMA channel
- *
- * Return: '1' if idle, '0' if not.
- */
-static bool xilinx_dma_is_idle(struct xilinx_dma_chan *chan)
-{
-   return dma_ctrl_read(chan, XILINX_DMA_REG_DMASR) &
-   XILINX_DMA_DMASR_IDLE;
-}
-
-/**
  * xilinx_dma_halt - Halt DMA channel
  * @chan: Driver specific DMA channel
  */
@@ -966,6 +942,7 @@ static void xilinx_dma_halt(struct xilinx_dma_chan *chan)
chan, dma_ctrl_read(chan, XILINX_DMA_REG_DMASR));
chan->err = true;
}
+   chan->idle = true;
 }
 
 /**
@@ -1007,6 +984,9 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
if (chan->err)
return;
 
+   if (!chan->idle)
+   return;
+
if (list_empty(>pending_list))
return;
 
@@ -1018,13 +998,6 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
tail_segment = list_last_entry(_desc->segments,
   struct xilinx_vdma_tx_segment, node);
 
-   /* If it is SG mode and hardware is busy, cannot submit */
-   if (chan->has_sg && xilinx_dma_is_running(chan) &&
-   !xilinx_dma_is_idle(chan)) {
-   dev_dbg(chan->dev, "DMA controller still busy\n");
-   return;
-   }
-
/*
 * If hardware is idle, then all descriptors on the running lists are
 * done, start new transfers
@@ -1110,6 +1083,7 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
vdma_desc_write(chan, XILINX_DMA_REG_VSIZE, last->hw.vsize);
}
 
+   chan->idle = false;
if (!chan->has_sg) {
list_del(>node);
list_add_tail(>node, >active_list);
@@ -1136,6 +1110,9 @@ static void xilinx_cdma_start_transfer(struct 
xilinx_dma_chan *chan)
if (chan->err)
return;
 
+   if (!chan->idle)
+   return;
+
if (list_empty(>pending_list))
return;
 
@@ -1181,6 +1158,7 @@ static void xilinx_cdma_start_transfer(struct 
xilinx_dma_chan *chan)
 
list_splice_tail_init(>pending_list, >active_list);
chan->desc_pendingcount = 0;
+   chan->idle = false;
 }
 
 /**
@@ -1196,15 +1174,11 @@ static void xilinx_dma_start_transfer(struct 
xilinx_dma_chan *chan)
if (chan->err)
return;
 
-   if (list_empty(>pending_list))
+   if (!chan->idle)
return;
 
-   /* If it is SG mode and hardware is busy, cannot submit */
-   if (chan->has_sg && xilinx_dma_is_running(chan) &&
-   !xilinx_dma_is_idle(chan)) {
-   dev_dbg(chan->dev, "DMA controller still busy\n");
+   if (list_empty(>pending_list))

[PATCH v6 0/3] dmaengine: xilinx_dma: Bug fixes

2017-01-13 Thread Kedareswara rao Appana
This patch series fixes below bugs in DMA and VDMA IP's
---> Do not start VDMA until frame buffer is processed by the h/w
---> Fix bug in Multi frame sotres handling in VDMA
---> Fix issues w.r.to multi frame descriptors submit with AXI DMA S2MM(recv) 
Side.

Kedareswara rao Appana (3):
  dmaengine: xilinx_dma: Check for channel idle state before submitting
dma descriptor
  dmaeninge: xilinx_dma: Fix bug in multiple frame stores scenario in
vdma
  dmaengine: xilinx_dma: Fix race condition in the driver for multiple
descriptor scenario

 .../devicetree/bindings/dma/xilinx/xilinx_dma.txt  |   2 +
 drivers/dma/xilinx/xilinx_dma.c| 270 -
 2 files changed, 161 insertions(+), 111 deletions(-)

-- 
2.1.2



[PATCH v6 2/3] dmaeninge: xilinx_dma: Fix bug in multiple frame stores scenario in vdma

2017-01-13 Thread Kedareswara rao Appana
When VDMA is configured for more than one frame in the h/w.
For example h/w is configured for n number of frames, user
Submits n number of frames and triggered the DMA using issue_pending API.

In the current driver flow we are submitting one frame at a time,
But we should submit all the n number of frames at one time
As the h/w is configured for n number of frames.

This patch fixes this issue.

Acked-by: Rob Herring 
Reviewed-by: Jose Abreu 
Signed-off-by: Kedareswara rao Appana 
---
Changes for v6:
---> Added Rob Acked-by
---> Updated commit message as suggested by Vinod.
Changes for v5:
---> Updated xlnx,fstore-config property to xlnx,fstore-enable
 and updated description as suggested by Rob.
Changes for v4:
---> Add Check for framestore configuration on Transmit case as well
 as suggested by Jose Abreu.
---> Modified the dev_dbg checks to dev_warn checks as suggested
 by Jose Abreu.
Changes for v3:
---> Added Checks for frame store configuration. If frame store
 Configuration is not present at the h/w level and user
 Submits less frames added debug prints in the driver as relevant.
Changes for v2:
---> Fixed race conditions in the driver as suggested by Jose Abreu
---> Fixed unnecessray if else checks in the vdma_start_transfer
 as suggested by Laurent Pinchart.

 .../devicetree/bindings/dma/xilinx/xilinx_dma.txt  |  2 +
 drivers/dma/xilinx/xilinx_dma.c| 78 +++---
 2 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/Documentation/devicetree/bindings/dma/xilinx/xilinx_dma.txt 
b/Documentation/devicetree/bindings/dma/xilinx/xilinx_dma.txt
index a2b8bfa..e951c09 100644
--- a/Documentation/devicetree/bindings/dma/xilinx/xilinx_dma.txt
+++ b/Documentation/devicetree/bindings/dma/xilinx/xilinx_dma.txt
@@ -66,6 +66,8 @@ Optional child node properties:
 Optional child node properties for VDMA:
 - xlnx,genlock-mode: Tells Genlock synchronization is
enabled/disabled in hardware.
+- xlnx,fstore-enable: boolean; if defined, it indicates that controller
+   supports frame store configuration.
 Optional child node properties for AXI DMA:
 -dma-channels: Number of dma channels in child node.
 
diff --git a/drivers/dma/xilinx/xilinx_dma.c b/drivers/dma/xilinx/xilinx_dma.c
index 5eeea57..edb5b71 100644
--- a/drivers/dma/xilinx/xilinx_dma.c
+++ b/drivers/dma/xilinx/xilinx_dma.c
@@ -322,6 +322,7 @@ struct xilinx_dma_tx_descriptor {
  * @genlock: Support genlock mode
  * @err: Channel has errors
  * @idle: Check for channel idle
+ * @has_fstoreen: Check for frame store configuration
  * @tasklet: Cleanup work after irq
  * @config: Device configuration info
  * @flush_on_fsync: Flush on Frame sync
@@ -353,6 +354,7 @@ struct xilinx_dma_chan {
bool genlock;
bool err;
bool idle;
+   bool has_fstoreen;
struct tasklet_struct tasklet;
struct xilinx_vdma_config config;
bool flush_on_fsync;
@@ -990,6 +992,27 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
if (list_empty(>pending_list))
return;
 
+   /*
+* Note: When VDMA is built with default h/w configuration
+* User should submit frames upto H/W configured.
+* If users submits less than h/w configured
+* VDMA engine tries to write to a invalid location
+* Results undefined behaviour/memory corruption.
+*
+* If user would like to submit frames less than h/w capable
+* On S2MM side please enable debug info 13 at the h/w level
+* On MM2S side please enable debug info 6 at the h/w level
+* It will allows the frame buffers numbers to be modified at runtime.
+*/
+   if (!chan->has_fstoreen &&
+chan->desc_pendingcount < chan->num_frms) {
+   dev_warn(chan->dev, "Frame Store Configuration is not enabled 
at the\n");
+   dev_warn(chan->dev, "H/w level enable Debug info 13 or 6 at the 
h/w level\n");
+   dev_warn(chan->dev, "OR Submit the frames upto h/w 
Capable\n\r");
+
+   return;
+   }
+
desc = list_first_entry(>pending_list,
struct xilinx_dma_tx_descriptor, node);
tail_desc = list_last_entry(>pending_list,
@@ -1052,25 +1075,38 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
if (chan->has_sg) {
dma_ctrl_write(chan, XILINX_DMA_REG_TAILDESC,
tail_segment->phys);
+   list_splice_tail_init(>pending_list, >active_list);
+   chan->desc_pendingcount = 0;
} else {
struct xilinx_vdma_tx_segment *segment, *last = NULL;
-   int i = 0;
+   int i = 0, j = 0;
 
if (chan->desc_submitcount < chan->num_frms)
i = chan->desc_submitcount;
 
-   

[PATCH v6 3/3] dmaengine: xilinx_dma: Fix race condition in the driver for multiple descriptor scenario

2017-01-13 Thread Kedareswara rao Appana
As per AXI DMA spec the software must not move the tail pointer to a location
That has not been updated (next descriptor field of the h/w descriptor
Should always point to a valid address).

When user submits multiple descriptors on the recv side, with the
Current driver flow the last buffer descriptor next descriptor field
Points to a invalid location, resulting the invalid data or errors in the
DMA engine.

This patch fixes this issue by creating a Buffer Descritpor Chain during
Channel allocation itself and use those Buffer Descriptors.

Signed-off-by: Kedareswara rao Appana 
---
Changes for v6:
---> Updated Commit message as suggested by Vinod.
Changes for v5:
---> None.
Changes for v4:
---> None.
Changes for v3:
---> None.
Changes for v2:
---> None.

 drivers/dma/xilinx/xilinx_dma.c | 133 +---
 1 file changed, 83 insertions(+), 50 deletions(-)

diff --git a/drivers/dma/xilinx/xilinx_dma.c b/drivers/dma/xilinx/xilinx_dma.c
index edb5b71..c5cd935 100644
--- a/drivers/dma/xilinx/xilinx_dma.c
+++ b/drivers/dma/xilinx/xilinx_dma.c
@@ -163,6 +163,7 @@
 #define XILINX_DMA_BD_SOP  BIT(27)
 #define XILINX_DMA_BD_EOP  BIT(26)
 #define XILINX_DMA_COALESCE_MAX255
+#define XILINX_DMA_NUM_DESCS   255
 #define XILINX_DMA_NUM_APP_WORDS   5
 
 /* Multi-Channel DMA Descriptor offsets*/
@@ -310,6 +311,7 @@ struct xilinx_dma_tx_descriptor {
  * @pending_list: Descriptors waiting
  * @active_list: Descriptors ready to submit
  * @done_list: Complete descriptors
+ * @free_seg_list: Free descriptors
  * @common: DMA common channel
  * @desc_pool: Descriptors pool
  * @dev: The dma device
@@ -331,7 +333,9 @@ struct xilinx_dma_tx_descriptor {
  * @desc_submitcount: Descriptor h/w submitted count
  * @residue: Residue for AXI DMA
  * @seg_v: Statically allocated segments base
+ * @seg_p: Physical allocated segments base
  * @cyclic_seg_v: Statically allocated segment base for cyclic transfers
+ * @cyclic_seg_p: Physical allocated segments base for cyclic dma
  * @start_transfer: Differentiate b/w DMA IP's transfer
  */
 struct xilinx_dma_chan {
@@ -342,6 +346,7 @@ struct xilinx_dma_chan {
struct list_head pending_list;
struct list_head active_list;
struct list_head done_list;
+   struct list_head free_seg_list;
struct dma_chan common;
struct dma_pool *desc_pool;
struct device *dev;
@@ -363,7 +368,9 @@ struct xilinx_dma_chan {
u32 desc_submitcount;
u32 residue;
struct xilinx_axidma_tx_segment *seg_v;
+   dma_addr_t seg_p;
struct xilinx_axidma_tx_segment *cyclic_seg_v;
+   dma_addr_t cyclic_seg_p;
void (*start_transfer)(struct xilinx_dma_chan *chan);
u16 tdest;
 };
@@ -569,17 +576,31 @@ static struct xilinx_axidma_tx_segment *
 xilinx_axidma_alloc_tx_segment(struct xilinx_dma_chan *chan)
 {
struct xilinx_axidma_tx_segment *segment;
-   dma_addr_t phys;
-
-   segment = dma_pool_zalloc(chan->desc_pool, GFP_ATOMIC, );
-   if (!segment)
-   return NULL;
+   unsigned long flags;
 
-   segment->phys = phys;
+   spin_lock_irqsave(>lock, flags);
+   if (!list_empty(>free_seg_list)) {
+   segment = list_first_entry(>free_seg_list,
+  struct xilinx_axidma_tx_segment,
+  node);
+   list_del(>node);
+   }
+   spin_unlock_irqrestore(>lock, flags);
 
return segment;
 }
 
+static void xilinx_dma_clean_hw_desc(struct xilinx_axidma_desc_hw *hw)
+{
+   u32 next_desc = hw->next_desc;
+   u32 next_desc_msb = hw->next_desc_msb;
+
+   memset(hw, 0, sizeof(struct xilinx_axidma_desc_hw));
+
+   hw->next_desc = next_desc;
+   hw->next_desc_msb = next_desc_msb;
+}
+
 /**
  * xilinx_dma_free_tx_segment - Free transaction segment
  * @chan: Driver specific DMA channel
@@ -588,7 +609,9 @@ xilinx_axidma_alloc_tx_segment(struct xilinx_dma_chan *chan)
 static void xilinx_dma_free_tx_segment(struct xilinx_dma_chan *chan,
struct xilinx_axidma_tx_segment *segment)
 {
-   dma_pool_free(chan->desc_pool, segment, segment->phys);
+   xilinx_dma_clean_hw_desc(>hw);
+
+   list_add_tail(>node, >free_seg_list);
 }
 
 /**
@@ -713,16 +736,26 @@ static void xilinx_dma_free_descriptors(struct 
xilinx_dma_chan *chan)
 static void xilinx_dma_free_chan_resources(struct dma_chan *dchan)
 {
struct xilinx_dma_chan *chan = to_xilinx_chan(dchan);
+   unsigned long flags;
 
dev_dbg(chan->dev, "Free all channel resources.\n");
 
xilinx_dma_free_descriptors(chan);
+
if (chan->xdev->dma_config->dmatype == XDMA_TYPE_AXIDMA) {
-   xilinx_dma_free_tx_segment(chan, chan->cyclic_seg_v);
-   xilinx_dma_free_tx_segment(chan, chan->seg_v);
+   spin_lock_irqsave(>lock, flags);
+   

[PATCH v6 2/3] dmaeninge: xilinx_dma: Fix bug in multiple frame stores scenario in vdma

2017-01-13 Thread Kedareswara rao Appana
When VDMA is configured for more than one frame in the h/w.
For example h/w is configured for n number of frames, user
Submits n number of frames and triggered the DMA using issue_pending API.

In the current driver flow we are submitting one frame at a time,
But we should submit all the n number of frames at one time
As the h/w is configured for n number of frames.

This patch fixes this issue.

Acked-by: Rob Herring 
Reviewed-by: Jose Abreu 
Signed-off-by: Kedareswara rao Appana 
---
Changes for v6:
---> Added Rob Acked-by
---> Updated commit message as suggested by Vinod.
Changes for v5:
---> Updated xlnx,fstore-config property to xlnx,fstore-enable
 and updated description as suggested by Rob.
Changes for v4:
---> Add Check for framestore configuration on Transmit case as well
 as suggested by Jose Abreu.
---> Modified the dev_dbg checks to dev_warn checks as suggested
 by Jose Abreu.
Changes for v3:
---> Added Checks for frame store configuration. If frame store
 Configuration is not present at the h/w level and user
 Submits less frames added debug prints in the driver as relevant.
Changes for v2:
---> Fixed race conditions in the driver as suggested by Jose Abreu
---> Fixed unnecessray if else checks in the vdma_start_transfer
 as suggested by Laurent Pinchart.

 .../devicetree/bindings/dma/xilinx/xilinx_dma.txt  |  2 +
 drivers/dma/xilinx/xilinx_dma.c| 78 +++---
 2 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/Documentation/devicetree/bindings/dma/xilinx/xilinx_dma.txt 
b/Documentation/devicetree/bindings/dma/xilinx/xilinx_dma.txt
index a2b8bfa..e951c09 100644
--- a/Documentation/devicetree/bindings/dma/xilinx/xilinx_dma.txt
+++ b/Documentation/devicetree/bindings/dma/xilinx/xilinx_dma.txt
@@ -66,6 +66,8 @@ Optional child node properties:
 Optional child node properties for VDMA:
 - xlnx,genlock-mode: Tells Genlock synchronization is
enabled/disabled in hardware.
+- xlnx,fstore-enable: boolean; if defined, it indicates that controller
+   supports frame store configuration.
 Optional child node properties for AXI DMA:
 -dma-channels: Number of dma channels in child node.
 
diff --git a/drivers/dma/xilinx/xilinx_dma.c b/drivers/dma/xilinx/xilinx_dma.c
index 5eeea57..edb5b71 100644
--- a/drivers/dma/xilinx/xilinx_dma.c
+++ b/drivers/dma/xilinx/xilinx_dma.c
@@ -322,6 +322,7 @@ struct xilinx_dma_tx_descriptor {
  * @genlock: Support genlock mode
  * @err: Channel has errors
  * @idle: Check for channel idle
+ * @has_fstoreen: Check for frame store configuration
  * @tasklet: Cleanup work after irq
  * @config: Device configuration info
  * @flush_on_fsync: Flush on Frame sync
@@ -353,6 +354,7 @@ struct xilinx_dma_chan {
bool genlock;
bool err;
bool idle;
+   bool has_fstoreen;
struct tasklet_struct tasklet;
struct xilinx_vdma_config config;
bool flush_on_fsync;
@@ -990,6 +992,27 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
if (list_empty(>pending_list))
return;
 
+   /*
+* Note: When VDMA is built with default h/w configuration
+* User should submit frames upto H/W configured.
+* If users submits less than h/w configured
+* VDMA engine tries to write to a invalid location
+* Results undefined behaviour/memory corruption.
+*
+* If user would like to submit frames less than h/w capable
+* On S2MM side please enable debug info 13 at the h/w level
+* On MM2S side please enable debug info 6 at the h/w level
+* It will allows the frame buffers numbers to be modified at runtime.
+*/
+   if (!chan->has_fstoreen &&
+chan->desc_pendingcount < chan->num_frms) {
+   dev_warn(chan->dev, "Frame Store Configuration is not enabled 
at the\n");
+   dev_warn(chan->dev, "H/w level enable Debug info 13 or 6 at the 
h/w level\n");
+   dev_warn(chan->dev, "OR Submit the frames upto h/w 
Capable\n\r");
+
+   return;
+   }
+
desc = list_first_entry(>pending_list,
struct xilinx_dma_tx_descriptor, node);
tail_desc = list_last_entry(>pending_list,
@@ -1052,25 +1075,38 @@ static void xilinx_vdma_start_transfer(struct 
xilinx_dma_chan *chan)
if (chan->has_sg) {
dma_ctrl_write(chan, XILINX_DMA_REG_TAILDESC,
tail_segment->phys);
+   list_splice_tail_init(>pending_list, >active_list);
+   chan->desc_pendingcount = 0;
} else {
struct xilinx_vdma_tx_segment *segment, *last = NULL;
-   int i = 0;
+   int i = 0, j = 0;
 
if (chan->desc_submitcount < chan->num_frms)
i = chan->desc_submitcount;
 
-   list_for_each_entry(segment, >segments, node) {
- 

[PATCH v6 3/3] dmaengine: xilinx_dma: Fix race condition in the driver for multiple descriptor scenario

2017-01-13 Thread Kedareswara rao Appana
As per AXI DMA spec the software must not move the tail pointer to a location
That has not been updated (next descriptor field of the h/w descriptor
Should always point to a valid address).

When user submits multiple descriptors on the recv side, with the
Current driver flow the last buffer descriptor next descriptor field
Points to a invalid location, resulting the invalid data or errors in the
DMA engine.

This patch fixes this issue by creating a Buffer Descritpor Chain during
Channel allocation itself and use those Buffer Descriptors.

Signed-off-by: Kedareswara rao Appana 
---
Changes for v6:
---> Updated Commit message as suggested by Vinod.
Changes for v5:
---> None.
Changes for v4:
---> None.
Changes for v3:
---> None.
Changes for v2:
---> None.

 drivers/dma/xilinx/xilinx_dma.c | 133 +---
 1 file changed, 83 insertions(+), 50 deletions(-)

diff --git a/drivers/dma/xilinx/xilinx_dma.c b/drivers/dma/xilinx/xilinx_dma.c
index edb5b71..c5cd935 100644
--- a/drivers/dma/xilinx/xilinx_dma.c
+++ b/drivers/dma/xilinx/xilinx_dma.c
@@ -163,6 +163,7 @@
 #define XILINX_DMA_BD_SOP  BIT(27)
 #define XILINX_DMA_BD_EOP  BIT(26)
 #define XILINX_DMA_COALESCE_MAX255
+#define XILINX_DMA_NUM_DESCS   255
 #define XILINX_DMA_NUM_APP_WORDS   5
 
 /* Multi-Channel DMA Descriptor offsets*/
@@ -310,6 +311,7 @@ struct xilinx_dma_tx_descriptor {
  * @pending_list: Descriptors waiting
  * @active_list: Descriptors ready to submit
  * @done_list: Complete descriptors
+ * @free_seg_list: Free descriptors
  * @common: DMA common channel
  * @desc_pool: Descriptors pool
  * @dev: The dma device
@@ -331,7 +333,9 @@ struct xilinx_dma_tx_descriptor {
  * @desc_submitcount: Descriptor h/w submitted count
  * @residue: Residue for AXI DMA
  * @seg_v: Statically allocated segments base
+ * @seg_p: Physical allocated segments base
  * @cyclic_seg_v: Statically allocated segment base for cyclic transfers
+ * @cyclic_seg_p: Physical allocated segments base for cyclic dma
  * @start_transfer: Differentiate b/w DMA IP's transfer
  */
 struct xilinx_dma_chan {
@@ -342,6 +346,7 @@ struct xilinx_dma_chan {
struct list_head pending_list;
struct list_head active_list;
struct list_head done_list;
+   struct list_head free_seg_list;
struct dma_chan common;
struct dma_pool *desc_pool;
struct device *dev;
@@ -363,7 +368,9 @@ struct xilinx_dma_chan {
u32 desc_submitcount;
u32 residue;
struct xilinx_axidma_tx_segment *seg_v;
+   dma_addr_t seg_p;
struct xilinx_axidma_tx_segment *cyclic_seg_v;
+   dma_addr_t cyclic_seg_p;
void (*start_transfer)(struct xilinx_dma_chan *chan);
u16 tdest;
 };
@@ -569,17 +576,31 @@ static struct xilinx_axidma_tx_segment *
 xilinx_axidma_alloc_tx_segment(struct xilinx_dma_chan *chan)
 {
struct xilinx_axidma_tx_segment *segment;
-   dma_addr_t phys;
-
-   segment = dma_pool_zalloc(chan->desc_pool, GFP_ATOMIC, );
-   if (!segment)
-   return NULL;
+   unsigned long flags;
 
-   segment->phys = phys;
+   spin_lock_irqsave(>lock, flags);
+   if (!list_empty(>free_seg_list)) {
+   segment = list_first_entry(>free_seg_list,
+  struct xilinx_axidma_tx_segment,
+  node);
+   list_del(>node);
+   }
+   spin_unlock_irqrestore(>lock, flags);
 
return segment;
 }
 
+static void xilinx_dma_clean_hw_desc(struct xilinx_axidma_desc_hw *hw)
+{
+   u32 next_desc = hw->next_desc;
+   u32 next_desc_msb = hw->next_desc_msb;
+
+   memset(hw, 0, sizeof(struct xilinx_axidma_desc_hw));
+
+   hw->next_desc = next_desc;
+   hw->next_desc_msb = next_desc_msb;
+}
+
 /**
  * xilinx_dma_free_tx_segment - Free transaction segment
  * @chan: Driver specific DMA channel
@@ -588,7 +609,9 @@ xilinx_axidma_alloc_tx_segment(struct xilinx_dma_chan *chan)
 static void xilinx_dma_free_tx_segment(struct xilinx_dma_chan *chan,
struct xilinx_axidma_tx_segment *segment)
 {
-   dma_pool_free(chan->desc_pool, segment, segment->phys);
+   xilinx_dma_clean_hw_desc(>hw);
+
+   list_add_tail(>node, >free_seg_list);
 }
 
 /**
@@ -713,16 +736,26 @@ static void xilinx_dma_free_descriptors(struct 
xilinx_dma_chan *chan)
 static void xilinx_dma_free_chan_resources(struct dma_chan *dchan)
 {
struct xilinx_dma_chan *chan = to_xilinx_chan(dchan);
+   unsigned long flags;
 
dev_dbg(chan->dev, "Free all channel resources.\n");
 
xilinx_dma_free_descriptors(chan);
+
if (chan->xdev->dma_config->dmatype == XDMA_TYPE_AXIDMA) {
-   xilinx_dma_free_tx_segment(chan, chan->cyclic_seg_v);
-   xilinx_dma_free_tx_segment(chan, chan->seg_v);
+   spin_lock_irqsave(>lock, flags);
+   

Re: [RFC/PATCH] ftrace: Allow to change size of function graph filters

2017-01-13 Thread Steven Rostedt
On Sat, 14 Jan 2017 13:20:00 +0900
Namhyung Kim  wrote:


> But I'm not sure how to synchronize hash table manipulations.  It
> seems synchronize_sched() is not good enough for function graph
> tracer, right?  So I limited changing filter size only when no tracer
> is used, but is it ok to have the limitation when adding or removing
> an entry to/from the table?  If not, what can I do?

There's already a hash that function tracing uses. Look at the code in
ftrace_hash_move(). If you make it where the hash is always looked at
with preemption disabled, you can use synchronize_sched() to modify the
hash.

The trampolines are a bit trickier, because there's no way to disable
preemption or do any other synchronization when they are called.

-- Steve


Re: [RFC/PATCH] ftrace: Allow to change size of function graph filters

2017-01-13 Thread Steven Rostedt
On Sat, 14 Jan 2017 13:20:00 +0900
Namhyung Kim  wrote:


> But I'm not sure how to synchronize hash table manipulations.  It
> seems synchronize_sched() is not good enough for function graph
> tracer, right?  So I limited changing filter size only when no tracer
> is used, but is it ok to have the limitation when adding or removing
> an entry to/from the table?  If not, what can I do?

There's already a hash that function tracing uses. Look at the code in
ftrace_hash_move(). If you make it where the hash is always looked at
with preemption disabled, you can use synchronize_sched() to modify the
hash.

The trampolines are a bit trickier, because there's no way to disable
preemption or do any other synchronization when they are called.

-- Steve


[Update][PATCH v5 2/9] mm/swap: Add cluster lock

2017-01-13 Thread Huang, Ying
This patch is to reduce the lock contention of swap_info_struct->lock
via using a more fine grained lock in swap_cluster_info for some swap
operations.  swap_info_struct->lock is heavily contended if multiple
processes reclaim pages simultaneously.  Because there is only one lock
for each swap device.  While in common configuration, there is only one
or several swap devices in the system.  The lock protects almost all
swap related operations.

In fact, many swap operations only access one element of
swap_info_struct->swap_map array.  And there is no dependency between
different elements of swap_info_struct->swap_map.  So a fine grained
lock can be used to allow parallel access to the different elements of
swap_info_struct->swap_map.

In this patch, a spinlock is added to swap_cluster_info to protect the
elements of swap_info_struct->swap_map in the swap cluster and the
fields of swap_cluster_info.  This reduced locking contention for
swap_info_struct->swap_map access greatly.

Because of the added spinlock, the size of swap_cluster_info increases
from 4 bytes to 8 bytes on the 64 bit and 32 bit system.  This will
use additional 4k RAM for every 1G swap space.

Because the size of swap_cluster_info is much smaller than the size of
the cache line (8 vs 64 on x86_64 architecture), there may be false
cache line sharing between spinlocks in swap_cluster_info.  To avoid
the false sharing in the first round of the swap cluster allocation,
the order of the swap clusters in the free clusters list is changed.
So that, the swap_cluster_info sharing the same cache line will be
placed as far as possible.  After the first round of allocation, the
order of the clusters in free clusters list is expected to be random.
So the false sharing should be not serious.

Compared with a previous implementation using bit_spin_lock, the
sequential swap out throughput improved about 3.2%.  Test was done on
a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM
(persistent memory) device.  To test the sequential swapping out, the
test case created 32 processes, which sequentially allocate and write
to the anonymous pages until the RAM and part of the swap device is
used.

Signed-off-by: "Huang, Ying" 
---
 include/linux/swap.h |   6 ++
 mm/swapfile.c| 211 +--
 2 files changed, 177 insertions(+), 40 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 09f4be179ff3..48e0ed4dc3c8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -175,6 +175,12 @@ enum {
  * protected by swap_info_struct.lock.
  */
 struct swap_cluster_info {
+   spinlock_t lock;/*
+* Protect swap_cluster_info fields
+* and swap_info_struct->swap_map
+* elements correspond to the swap
+* cluster
+*/
unsigned int data:24;
unsigned int flags:8;
 };
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 19a7c1ddd6c2..f5515c1552fe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -257,6 +257,52 @@ static inline void cluster_set_null(struct 
swap_cluster_info *info)
info->data = 0;
 }
 
+static inline void __lock_cluster(struct swap_cluster_info *ci)
+{
+   spin_lock(>lock);
+}
+
+static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct 
*si,
+unsigned long offset)
+{
+   struct swap_cluster_info *ci;
+
+   ci = si->cluster_info;
+   if (ci) {
+   ci += offset / SWAPFILE_CLUSTER;
+   __lock_cluster(ci);
+   }
+   return ci;
+}
+
+static inline void unlock_cluster(struct swap_cluster_info *ci)
+{
+   if (ci)
+   spin_unlock(>lock);
+}
+
+static inline struct swap_cluster_info *lock_cluster_or_swap_info(
+   struct swap_info_struct *si,
+   unsigned long offset)
+{
+   struct swap_cluster_info *ci;
+
+   ci = lock_cluster(si, offset);
+   if (!ci)
+   spin_lock(>lock);
+
+   return ci;
+}
+
+static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
+  struct swap_cluster_info *ci)
+{
+   if (ci)
+   unlock_cluster(ci);
+   else
+   spin_unlock(>lock);
+}
+
 static inline bool cluster_list_empty(struct swap_cluster_list *list)
 {
return cluster_is_null(>head);
@@ -281,9 +327,17 @@ static void cluster_list_add_tail(struct swap_cluster_list 
*list,
cluster_set_next_flag(>head, idx, 0);
cluster_set_next_flag(>tail, idx, 0);
} else {
+   struct swap_cluster_info *ci_tail;
unsigned int tail = cluster_next(>tail);
 
-   cluster_set_next([tail], idx);
+   /*
+* Nested cluster lock, but both cluster 

[Update][PATCH v5 2/9] mm/swap: Add cluster lock

2017-01-13 Thread Huang, Ying
This patch is to reduce the lock contention of swap_info_struct->lock
via using a more fine grained lock in swap_cluster_info for some swap
operations.  swap_info_struct->lock is heavily contended if multiple
processes reclaim pages simultaneously.  Because there is only one lock
for each swap device.  While in common configuration, there is only one
or several swap devices in the system.  The lock protects almost all
swap related operations.

In fact, many swap operations only access one element of
swap_info_struct->swap_map array.  And there is no dependency between
different elements of swap_info_struct->swap_map.  So a fine grained
lock can be used to allow parallel access to the different elements of
swap_info_struct->swap_map.

In this patch, a spinlock is added to swap_cluster_info to protect the
elements of swap_info_struct->swap_map in the swap cluster and the
fields of swap_cluster_info.  This reduced locking contention for
swap_info_struct->swap_map access greatly.

Because of the added spinlock, the size of swap_cluster_info increases
from 4 bytes to 8 bytes on the 64 bit and 32 bit system.  This will
use additional 4k RAM for every 1G swap space.

Because the size of swap_cluster_info is much smaller than the size of
the cache line (8 vs 64 on x86_64 architecture), there may be false
cache line sharing between spinlocks in swap_cluster_info.  To avoid
the false sharing in the first round of the swap cluster allocation,
the order of the swap clusters in the free clusters list is changed.
So that, the swap_cluster_info sharing the same cache line will be
placed as far as possible.  After the first round of allocation, the
order of the clusters in free clusters list is expected to be random.
So the false sharing should be not serious.

Compared with a previous implementation using bit_spin_lock, the
sequential swap out throughput improved about 3.2%.  Test was done on
a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM
(persistent memory) device.  To test the sequential swapping out, the
test case created 32 processes, which sequentially allocate and write
to the anonymous pages until the RAM and part of the swap device is
used.

Signed-off-by: "Huang, Ying" 
---
 include/linux/swap.h |   6 ++
 mm/swapfile.c| 211 +--
 2 files changed, 177 insertions(+), 40 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 09f4be179ff3..48e0ed4dc3c8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -175,6 +175,12 @@ enum {
  * protected by swap_info_struct.lock.
  */
 struct swap_cluster_info {
+   spinlock_t lock;/*
+* Protect swap_cluster_info fields
+* and swap_info_struct->swap_map
+* elements correspond to the swap
+* cluster
+*/
unsigned int data:24;
unsigned int flags:8;
 };
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 19a7c1ddd6c2..f5515c1552fe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -257,6 +257,52 @@ static inline void cluster_set_null(struct 
swap_cluster_info *info)
info->data = 0;
 }
 
+static inline void __lock_cluster(struct swap_cluster_info *ci)
+{
+   spin_lock(>lock);
+}
+
+static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct 
*si,
+unsigned long offset)
+{
+   struct swap_cluster_info *ci;
+
+   ci = si->cluster_info;
+   if (ci) {
+   ci += offset / SWAPFILE_CLUSTER;
+   __lock_cluster(ci);
+   }
+   return ci;
+}
+
+static inline void unlock_cluster(struct swap_cluster_info *ci)
+{
+   if (ci)
+   spin_unlock(>lock);
+}
+
+static inline struct swap_cluster_info *lock_cluster_or_swap_info(
+   struct swap_info_struct *si,
+   unsigned long offset)
+{
+   struct swap_cluster_info *ci;
+
+   ci = lock_cluster(si, offset);
+   if (!ci)
+   spin_lock(>lock);
+
+   return ci;
+}
+
+static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
+  struct swap_cluster_info *ci)
+{
+   if (ci)
+   unlock_cluster(ci);
+   else
+   spin_unlock(>lock);
+}
+
 static inline bool cluster_list_empty(struct swap_cluster_list *list)
 {
return cluster_is_null(>head);
@@ -281,9 +327,17 @@ static void cluster_list_add_tail(struct swap_cluster_list 
*list,
cluster_set_next_flag(>head, idx, 0);
cluster_set_next_flag(>tail, idx, 0);
} else {
+   struct swap_cluster_info *ci_tail;
unsigned int tail = cluster_next(>tail);
 
-   cluster_set_next([tail], idx);
+   /*
+* Nested cluster lock, but both cluster locks are
+  

Re: [PATCH v7 00/15] ACPI platform MSI support and its example mbigen

2017-01-13 Thread Hanjun Guo
Hi Wei,

On 2017/1/13 22:11, Wei Xu wrote:
> Hi Hanjun,
>
> On 2017/1/11 15:06, Hanjun Guo wrote:
>> With platform msi support landed in the kernel, and the introduction
>> of IORT for GICv3 ITS (PCI MSI) and SMMU, the framework for platform msi
>> is ready, this patch set add few patches to enable the ACPI platform
>> msi support.
>>
>> For platform device connecting to ITS on arm platform, we have IORT
>> table with the named componant node to describe the mappings of paltform
>> device and ITS, so we can retrieve the dev id and find its parent
>> irqdomain (ITS) from IORT table (simlar with the ACPI ITS support).
>>
>> v6 -> v7: 
>>  - Introduce iort_node_map_platform_id() to retrieve the
>>dev id for both NC (named component) -> ITS/SMMU and
>>NC -> SMMU -> ITS cases, suggested by Lorenzo;
>>
>>  - Reorder the patches and rewrite some commit message;
>>
>>  - Remove the test tags because it has major changes
>>to retrieve the dev id, Sinan, Majun, Xinwei, could
>>you please test them again on your platform?
>>
>>  - rebased on top of 4.10-rc3 and Lorenzo's patch
>>   https://patchwork.kernel.org/patch/9507041/
>>
>>  - Tested against Agustin's patch [1-2/3] "[PATCH V9 0/3] irqchip: qcom:
>>   Add IRQ combiner driver"
> I tested this patch set on the Hisilicon D05 board with these patches:
>
>   [1] Agustin's V9 IRQ combiner driver patch set
>   https://patchwork.kernel.org/patch/9474751/
>
>   [2] Lorenzo's v2 iort_node_get_id fix patch
>   https://patchwork.kernel.org/patch/9507041/
>
> The branch is at 
> https://github.com/hisilicon/linux-hisi/tree/topic-acpi-mbigen.
>
> The integrated XGE, SAS and PCIe controller works fine.
> So with this patch set:
>
> Tested-by: Wei Xu 

I didn't test PCIe it's good to know it's working, thanks for testing
and let me know.

Hanjun



Re: [PATCH v7 00/15] ACPI platform MSI support and its example mbigen

2017-01-13 Thread Hanjun Guo
Hi Wei,

On 2017/1/13 22:11, Wei Xu wrote:
> Hi Hanjun,
>
> On 2017/1/11 15:06, Hanjun Guo wrote:
>> With platform msi support landed in the kernel, and the introduction
>> of IORT for GICv3 ITS (PCI MSI) and SMMU, the framework for platform msi
>> is ready, this patch set add few patches to enable the ACPI platform
>> msi support.
>>
>> For platform device connecting to ITS on arm platform, we have IORT
>> table with the named componant node to describe the mappings of paltform
>> device and ITS, so we can retrieve the dev id and find its parent
>> irqdomain (ITS) from IORT table (simlar with the ACPI ITS support).
>>
>> v6 -> v7: 
>>  - Introduce iort_node_map_platform_id() to retrieve the
>>dev id for both NC (named component) -> ITS/SMMU and
>>NC -> SMMU -> ITS cases, suggested by Lorenzo;
>>
>>  - Reorder the patches and rewrite some commit message;
>>
>>  - Remove the test tags because it has major changes
>>to retrieve the dev id, Sinan, Majun, Xinwei, could
>>you please test them again on your platform?
>>
>>  - rebased on top of 4.10-rc3 and Lorenzo's patch
>>   https://patchwork.kernel.org/patch/9507041/
>>
>>  - Tested against Agustin's patch [1-2/3] "[PATCH V9 0/3] irqchip: qcom:
>>   Add IRQ combiner driver"
> I tested this patch set on the Hisilicon D05 board with these patches:
>
>   [1] Agustin's V9 IRQ combiner driver patch set
>   https://patchwork.kernel.org/patch/9474751/
>
>   [2] Lorenzo's v2 iort_node_get_id fix patch
>   https://patchwork.kernel.org/patch/9507041/
>
> The branch is at 
> https://github.com/hisilicon/linux-hisi/tree/topic-acpi-mbigen.
>
> The integrated XGE, SAS and PCIe controller works fine.
> So with this patch set:
>
> Tested-by: Wei Xu 

I didn't test PCIe it's good to know it's working, thanks for testing
and let me know.

Hanjun



Re: [RFC/PATCH] ftrace: Allow to change size of function graph filters

2017-01-13 Thread Namhyung Kim
Hi Steve,

On Fri, Jan 13, 2017 at 11:16:18AM -0500, Steven Rostedt wrote:
> On Fri, 13 Jan 2017 13:22:43 +0900
> Namhyung Kim  wrote:
> 
> > It's currently fixed to 32 and it ignores when user gives a pattern
> > which match to functions more than the size.  So filtering like all
> > system calls or many functions with common prefix cannot be set all.
> > Not sure this is right though.
> 
> Yes it's small, and there's a reason for it. So I'm giving a
> conditional nack to the patch.
> > 
> > This patch adds 'graph_filter_size' file in the tracefs to adjust the
> > size.  It can be changed only if the current tracer is not set.
> > 
> > Signed-off-by: Namhyung Kim 
> 
> The condition is, we need to fix ftrace_graph_addr() first.
> 
>   for (i = 0; i < ftrace_graph_count; i++) {
> 
> That gets called at every function being traced. See where I'm heading
> with that? ;-)
> 
> We need to create a hash table or binary search first and make that
> function handle a large ftrace_graph_count before implementing your
> patch. Remove the linear search, replace it with either a binary search
> or a hash. But an O(n) algorithm at every function call is out of the
> question.

Fair enough.  I'll try to add a hash table then.

But I'm not sure how to synchronize hash table manipulations.  It
seems synchronize_sched() is not good enough for function graph
tracer, right?  So I limited changing filter size only when no tracer
is used, but is it ok to have the limitation when adding or removing
an entry to/from the table?  If not, what can I do?

Thanks,
Namhyung


Re: [RFC/PATCH] ftrace: Allow to change size of function graph filters

2017-01-13 Thread Namhyung Kim
Hi Steve,

On Fri, Jan 13, 2017 at 11:16:18AM -0500, Steven Rostedt wrote:
> On Fri, 13 Jan 2017 13:22:43 +0900
> Namhyung Kim  wrote:
> 
> > It's currently fixed to 32 and it ignores when user gives a pattern
> > which match to functions more than the size.  So filtering like all
> > system calls or many functions with common prefix cannot be set all.
> > Not sure this is right though.
> 
> Yes it's small, and there's a reason for it. So I'm giving a
> conditional nack to the patch.
> > 
> > This patch adds 'graph_filter_size' file in the tracefs to adjust the
> > size.  It can be changed only if the current tracer is not set.
> > 
> > Signed-off-by: Namhyung Kim 
> 
> The condition is, we need to fix ftrace_graph_addr() first.
> 
>   for (i = 0; i < ftrace_graph_count; i++) {
> 
> That gets called at every function being traced. See where I'm heading
> with that? ;-)
> 
> We need to create a hash table or binary search first and make that
> function handle a large ftrace_graph_count before implementing your
> patch. Remove the linear search, replace it with either a binary search
> or a hash. But an O(n) algorithm at every function call is out of the
> question.

Fair enough.  I'll try to add a hash table then.

But I'm not sure how to synchronize hash table manipulations.  It
seems synchronize_sched() is not good enough for function graph
tracer, right?  So I limited changing filter size only when no tracer
is used, but is it ok to have the limitation when adding or removing
an entry to/from the table?  If not, what can I do?

Thanks,
Namhyung


Re: [PATCH v7 09/15] ACPI: platform-msi: retrieve dev id from IORT

2017-01-13 Thread Hanjun Guo
Hi Lorenzo,

On 2017/1/13 20:11, Lorenzo Pieralisi wrote:
> On Wed, Jan 11, 2017 at 11:06:33PM +0800, Hanjun Guo wrote:
>> For devices connecting to ITS, it needs dev id to identify itself, and
>> this dev id is represented in the IORT table in named component node
>> [1] for platform devices, so in this patch we will scan the IORT to
>> retrieve device's dev id.
>>
>> For named components we know that there are always two steps
>> involved (second optional):
>>
>> (1) Retrieve the initial id (this may well provide the final mapping)
>> (2) Map the id (optional if (1) represents the map type we need), this
>> is needed for use cases such as NC (named component) -> SMMU -> ITS
>> mappings.
>>
>> we have API iort_node_get_id() for step (1) above and
>> iort_node_map_rid() for step (2), so create a wrapper
>> iort_node_map_platform_id() to retrieve the dev id.
>>
>> [1]: https://static.docs.arm.com/den0049/b/DEN0049B_IO_Remapping_Table.pdf
> This patch should be split and IORT changes should be squashed with
> patch 10.

If split the changes for IORT and its platform msi, API introduced in IORT will
not be used in a single patch, seems violate the suggestion of "new introduced 
API
needs to be used in the same patch", did I miss something?

>
>> Suggested-by: Lorenzo Pieralisi 
>> Suggested-by: Tomasz Nowicki 
>> Signed-off-by: Hanjun Guo 
>> Cc: Marc Zyngier 
>> Cc: Lorenzo Pieralisi 
>> Cc: Sinan Kaya 
>> Cc: Tomasz Nowicki 
>> Cc: Thomas Gleixner 
>> ---
>>  drivers/acpi/arm64/iort.c | 56 
>> +++
>>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |  4 +-
>>  include/linux/acpi_iort.h |  8 
>>  3 files changed, 67 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
>> index 069a690..95fd20b 100644
>> --- a/drivers/acpi/arm64/iort.c
>> +++ b/drivers/acpi/arm64/iort.c
>> @@ -30,6 +30,7 @@
>>  #define IORT_MSI_TYPE   (1 << ACPI_IORT_NODE_ITS_GROUP)
>>  #define IORT_IOMMU_TYPE ((1 << ACPI_IORT_NODE_SMMU) |   \
>>  (1 << ACPI_IORT_NODE_SMMU_V3))
>> +#define IORT_TYPE_ANY   (IORT_MSI_TYPE | IORT_IOMMU_TYPE)
>>  
>>  struct iort_its_msi_chip {
>>  struct list_headlist;
>> @@ -406,6 +407,34 @@ static struct acpi_iort_node *iort_node_map_id(struct 
>> acpi_iort_node *node,
>>  return NULL;
>>  }
>>  
>> +static
>> +struct acpi_iort_node *iort_node_map_platform_id(struct acpi_iort_node 
>> *node,
>> + u32 *id_out, u8 type_mask,
>> + int index)
>> +{
>> +struct acpi_iort_node *parent;
>> +u32 id;
>> +
>> +/* step 1: retrieve the initial dev id */
>> +parent = iort_node_get_id(node, , IORT_TYPE_ANY, index);
>> +if (!parent)
>> +return NULL;
>> +
>> +/*
>> + * optional step 2: map the initial dev id if its parent is not
>> + * the target type we wanted, map it again for the use cases such
>> + * as NC (named component) -> SMMU -> ITS. If the type is matched,
>> + * return the parent pointer directly.
>> + */
>> +if (!(IORT_TYPE_MASK(parent->type) & type_mask))
>> +parent = iort_node_map_id(parent, id, id_out, type_mask);
>> +else
>> +if (id_out)
> Remove this pointer check.

This was added because of NULL pointer reference, I passed NULL for id_out 
because I
only want to get its parent node, I think we have four options:

 - Introduce a new API to get the parent only from the scratch, but it will 
duplicate the code
a lot;

 - Don't check the id_out in iort_node_map_platform_id(), and introduce a 
wrapper and pass the
   dummy id for iort_node_map_platform_id() :
static
struct acpi_iort_node *iort_node_get_platform_parent{struct device *dev, u8 
type_mask}
{
struct acpi_iort_node *node, *parent = NULL;
int i;
u32 dummy_id;

node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
  iort_match_node_callback, dev);

if (!node)
return NULL;

for (i = 0; i < node->mapping_count; i++) {
/* we just want to get the parent node */
parent = iort_node_map_platform_id(node, _id,
   IORT_MSI_TYPE, i);
if (parent)
break;
}

return parent;
}

 - Similar solution as above but don't introduce wrapper, just use dummy_id if
   iort_node_map_platform_id() is called;

- Use the solution I proposed in this patch.

Please share you suggestion on this :)

>
>> +*id_out = id;
>> +
>> +return parent;
>> +}
>> +
>>  static struct acpi_iort_node 

Re: [PATCH v7 09/15] ACPI: platform-msi: retrieve dev id from IORT

2017-01-13 Thread Hanjun Guo
Hi Lorenzo,

On 2017/1/13 20:11, Lorenzo Pieralisi wrote:
> On Wed, Jan 11, 2017 at 11:06:33PM +0800, Hanjun Guo wrote:
>> For devices connecting to ITS, it needs dev id to identify itself, and
>> this dev id is represented in the IORT table in named component node
>> [1] for platform devices, so in this patch we will scan the IORT to
>> retrieve device's dev id.
>>
>> For named components we know that there are always two steps
>> involved (second optional):
>>
>> (1) Retrieve the initial id (this may well provide the final mapping)
>> (2) Map the id (optional if (1) represents the map type we need), this
>> is needed for use cases such as NC (named component) -> SMMU -> ITS
>> mappings.
>>
>> we have API iort_node_get_id() for step (1) above and
>> iort_node_map_rid() for step (2), so create a wrapper
>> iort_node_map_platform_id() to retrieve the dev id.
>>
>> [1]: https://static.docs.arm.com/den0049/b/DEN0049B_IO_Remapping_Table.pdf
> This patch should be split and IORT changes should be squashed with
> patch 10.

If split the changes for IORT and its platform msi, API introduced in IORT will
not be used in a single patch, seems violate the suggestion of "new introduced 
API
needs to be used in the same patch", did I miss something?

>
>> Suggested-by: Lorenzo Pieralisi 
>> Suggested-by: Tomasz Nowicki 
>> Signed-off-by: Hanjun Guo 
>> Cc: Marc Zyngier 
>> Cc: Lorenzo Pieralisi 
>> Cc: Sinan Kaya 
>> Cc: Tomasz Nowicki 
>> Cc: Thomas Gleixner 
>> ---
>>  drivers/acpi/arm64/iort.c | 56 
>> +++
>>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |  4 +-
>>  include/linux/acpi_iort.h |  8 
>>  3 files changed, 67 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
>> index 069a690..95fd20b 100644
>> --- a/drivers/acpi/arm64/iort.c
>> +++ b/drivers/acpi/arm64/iort.c
>> @@ -30,6 +30,7 @@
>>  #define IORT_MSI_TYPE   (1 << ACPI_IORT_NODE_ITS_GROUP)
>>  #define IORT_IOMMU_TYPE ((1 << ACPI_IORT_NODE_SMMU) |   \
>>  (1 << ACPI_IORT_NODE_SMMU_V3))
>> +#define IORT_TYPE_ANY   (IORT_MSI_TYPE | IORT_IOMMU_TYPE)
>>  
>>  struct iort_its_msi_chip {
>>  struct list_headlist;
>> @@ -406,6 +407,34 @@ static struct acpi_iort_node *iort_node_map_id(struct 
>> acpi_iort_node *node,
>>  return NULL;
>>  }
>>  
>> +static
>> +struct acpi_iort_node *iort_node_map_platform_id(struct acpi_iort_node 
>> *node,
>> + u32 *id_out, u8 type_mask,
>> + int index)
>> +{
>> +struct acpi_iort_node *parent;
>> +u32 id;
>> +
>> +/* step 1: retrieve the initial dev id */
>> +parent = iort_node_get_id(node, , IORT_TYPE_ANY, index);
>> +if (!parent)
>> +return NULL;
>> +
>> +/*
>> + * optional step 2: map the initial dev id if its parent is not
>> + * the target type we wanted, map it again for the use cases such
>> + * as NC (named component) -> SMMU -> ITS. If the type is matched,
>> + * return the parent pointer directly.
>> + */
>> +if (!(IORT_TYPE_MASK(parent->type) & type_mask))
>> +parent = iort_node_map_id(parent, id, id_out, type_mask);
>> +else
>> +if (id_out)
> Remove this pointer check.

This was added because of NULL pointer reference, I passed NULL for id_out 
because I
only want to get its parent node, I think we have four options:

 - Introduce a new API to get the parent only from the scratch, but it will 
duplicate the code
a lot;

 - Don't check the id_out in iort_node_map_platform_id(), and introduce a 
wrapper and pass the
   dummy id for iort_node_map_platform_id() :
static
struct acpi_iort_node *iort_node_get_platform_parent{struct device *dev, u8 
type_mask}
{
struct acpi_iort_node *node, *parent = NULL;
int i;
u32 dummy_id;

node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
  iort_match_node_callback, dev);

if (!node)
return NULL;

for (i = 0; i < node->mapping_count; i++) {
/* we just want to get the parent node */
parent = iort_node_map_platform_id(node, _id,
   IORT_MSI_TYPE, i);
if (parent)
break;
}

return parent;
}

 - Similar solution as above but don't introduce wrapper, just use dummy_id if
   iort_node_map_platform_id() is called;

- Use the solution I proposed in this patch.

Please share you suggestion on this :)

>
>> +*id_out = id;
>> +
>> +return parent;
>> +}
>> +
>>  static struct acpi_iort_node *iort_find_dev_node(struct device *dev)
>>  {
>>  struct pci_bus *pbus;
>> @@ -444,6 +473,33 @@ u32 iort_msi_map_rid(struct device *dev, u32 req_id)
>>  }
>>  
>>  /**
>> + * 

Re: [PATCH] can: Fix kernel panic at security_sock_rcv_skb

2017-01-13 Thread Liu Shuo

On Thu 12.Jan'17 at 17:33:38 +0100, Oliver Hartkopp wrote:

On 01/12/2017 02:01 PM, Eric Dumazet wrote:

On Thu, 2017-01-12 at 09:22 +0100, Oliver Hartkopp wrote:



But my main concern is:

The reason why can_rx_delete_receiver() was introduced was the need to
remove a huge number of receivers with can_rx_unregister().

When you call synchronize_rcu() after each receiver removal this would
potentially lead to a big performance issue when e.g. closing CAN_RAW
sockets with a high number of receivers.

So the idea was to remove/unlink the receiver hlist_del_rcu(>list)
and also kmem_cache_free(rcv_cache, r) by some rcu mechanism - so that
all elements are cleaned up by rcu at a later point.

Is it possible that the problems emerge due to hlist_del_rcu(>list)
and you accidently fix it with your introduced synchronize_rcu()?


I agree this patch does not fix the root cause.

The main problem seems that the sockets themselves are not RCU
protected.

If CAN uses RCU for delivery, then sockets should be freed only after
one RCU grace period.

On recent kernels, following patch could help :



Thanks Eric!

@Liu ShuoX: Can you check if Eric's suggestion fixes the issue in your 
setup?

Sorry for late reply. I was OOO yesterday.
With Eric's hint, i just found his patch that "net: add SOCK_RCU_FREE
socket flag" in the latest kernel. With backporting this one plus Eric's
following patch, it fixs my failure.

Thanks Eric and Oliver!

Shuo


Best regards,
Oliver


Re: [PATCH] can: Fix kernel panic at security_sock_rcv_skb

2017-01-13 Thread Liu Shuo

On Thu 12.Jan'17 at 17:33:38 +0100, Oliver Hartkopp wrote:

On 01/12/2017 02:01 PM, Eric Dumazet wrote:

On Thu, 2017-01-12 at 09:22 +0100, Oliver Hartkopp wrote:



But my main concern is:

The reason why can_rx_delete_receiver() was introduced was the need to
remove a huge number of receivers with can_rx_unregister().

When you call synchronize_rcu() after each receiver removal this would
potentially lead to a big performance issue when e.g. closing CAN_RAW
sockets with a high number of receivers.

So the idea was to remove/unlink the receiver hlist_del_rcu(>list)
and also kmem_cache_free(rcv_cache, r) by some rcu mechanism - so that
all elements are cleaned up by rcu at a later point.

Is it possible that the problems emerge due to hlist_del_rcu(>list)
and you accidently fix it with your introduced synchronize_rcu()?


I agree this patch does not fix the root cause.

The main problem seems that the sockets themselves are not RCU
protected.

If CAN uses RCU for delivery, then sockets should be freed only after
one RCU grace period.

On recent kernels, following patch could help :



Thanks Eric!

@Liu ShuoX: Can you check if Eric's suggestion fixes the issue in your 
setup?

Sorry for late reply. I was OOO yesterday.
With Eric's hint, i just found his patch that "net: add SOCK_RCU_FREE
socket flag" in the latest kernel. With backporting this one plus Eric's
following patch, it fixs my failure.

Thanks Eric and Oliver!

Shuo


Best regards,
Oliver


Re: 4.10-rc3, firmware loading via user space helper crashes if firmware not present

2017-01-13 Thread Jakub Kicinski
On Fri, 13 Jan 2017 13:32:58 -0800, Jakub Kicinski wrote:
> If one requests a FW which does not exist in the FS and the user space
> helper is used then fw_load_abort() will be called twice which leads to
> NULL-deref.
> 
> It will be called once in firmware_loading_store() (handling the -1
> case) and then again in _request_firmware_load() because return value
> from fw_state_wait_timeout() was negative.
> 
> I think this is introduced in by f52cc379423d ("firmware: refactor
> loading status").
> 
> The simple fix would be to not "unlink" the buf by fw_load_abort() in
> firmware_loading_store() and always rely on firmware_loading_store().
> 
> --->8  
> 
> diff --git a/drivers/base/firmware_class.c
> b/drivers/base/firmware_class.c index 4497d263209f..89eb9de81145 100644
> --- a/drivers/base/firmware_class.c
> +++ b/drivers/base/firmware_class.c
> @@ -766,7 +770,7 @@ static ssize_t firmware_loading_store(struct device
> *dev, dev_err(dev, "%s: unexpected value (%d)\n", __func__, loading);
>   /* fallthrough */
>   case -1:
> - fw_load_abort(fw_priv);
> + fw_state_aborted(_buf->fw_st);
>   break;
>   }
>  out:
> 
> ---8<
> 
> Or should we fix up the ret code handling in __fw_state_wait_common()?

I got this backwards, I blamed the wrong commit, it's the 5d47ec02c37e
("firmware: Correct handling of fw_state_wait() return value") in fact.
It made the assumption that all errors returned from
fw_state_wait_timeout() require calling the abort, while in fact abort
should only be called if user mode helper didn't do anything (either
it timed out or someone hit ^C).

I'm leaning towards this:

--->8

diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
index 4497d263209f..ce142e6b2c72 100644
--- a/drivers/base/firmware_class.c
+++ b/drivers/base/firmware_class.c
@@ -1020,7 +1020,7 @@ static int _request_firmware_load(struct firmware_priv 
*fw_priv,
}
 
retval = fw_state_wait_timeout(>fw_st, timeout);
-   if (retval < 0) {
+   if (retval == -ETIMEDOUT || retval == -ERESTARTSYS) {
mutex_lock(_lock);
fw_load_abort(fw_priv);
mutex_unlock(_lock);

---8<

Unless advised otherwise I will submit this officially on Monday :)


Re: 4.10-rc3, firmware loading via user space helper crashes if firmware not present

2017-01-13 Thread Jakub Kicinski
On Fri, 13 Jan 2017 13:32:58 -0800, Jakub Kicinski wrote:
> If one requests a FW which does not exist in the FS and the user space
> helper is used then fw_load_abort() will be called twice which leads to
> NULL-deref.
> 
> It will be called once in firmware_loading_store() (handling the -1
> case) and then again in _request_firmware_load() because return value
> from fw_state_wait_timeout() was negative.
> 
> I think this is introduced in by f52cc379423d ("firmware: refactor
> loading status").
> 
> The simple fix would be to not "unlink" the buf by fw_load_abort() in
> firmware_loading_store() and always rely on firmware_loading_store().
> 
> --->8  
> 
> diff --git a/drivers/base/firmware_class.c
> b/drivers/base/firmware_class.c index 4497d263209f..89eb9de81145 100644
> --- a/drivers/base/firmware_class.c
> +++ b/drivers/base/firmware_class.c
> @@ -766,7 +770,7 @@ static ssize_t firmware_loading_store(struct device
> *dev, dev_err(dev, "%s: unexpected value (%d)\n", __func__, loading);
>   /* fallthrough */
>   case -1:
> - fw_load_abort(fw_priv);
> + fw_state_aborted(_buf->fw_st);
>   break;
>   }
>  out:
> 
> ---8<
> 
> Or should we fix up the ret code handling in __fw_state_wait_common()?

I got this backwards, I blamed the wrong commit, it's the 5d47ec02c37e
("firmware: Correct handling of fw_state_wait() return value") in fact.
It made the assumption that all errors returned from
fw_state_wait_timeout() require calling the abort, while in fact abort
should only be called if user mode helper didn't do anything (either
it timed out or someone hit ^C).

I'm leaning towards this:

--->8

diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
index 4497d263209f..ce142e6b2c72 100644
--- a/drivers/base/firmware_class.c
+++ b/drivers/base/firmware_class.c
@@ -1020,7 +1020,7 @@ static int _request_firmware_load(struct firmware_priv 
*fw_priv,
}
 
retval = fw_state_wait_timeout(>fw_st, timeout);
-   if (retval < 0) {
+   if (retval == -ETIMEDOUT || retval == -ERESTARTSYS) {
mutex_lock(_lock);
fw_load_abort(fw_priv);
mutex_unlock(_lock);

---8<

Unless advised otherwise I will submit this officially on Monday :)


Re: [PATCH v7 08/15] ACPI: IORT: rename iort_node_map_rid() to make it generic

2017-01-13 Thread Hanjun Guo
On 2017/1/13 19:47, Lorenzo Pieralisi wrote:
> On Wed, Jan 11, 2017 at 11:06:32PM +0800, Hanjun Guo wrote:
>> iort_node_map_rid() was designed for both PCI and platform
>> device, but the rid means requester id is for ITS mappings,
> I do not understand what this means sorry.
>
>> rename iort_node_map_rid() to iort_node_map_id() and update
>> its argument names to make it more generic.
>>
> "iort_node_map_rid() was designed to take an input id (that is not
> necessarily a PCI requester id) and map it to an output id (eg an SMMU
> streamid or an ITS deviceid) according to the mappings provided by an
> IORT node mapping entries. This means that the iort_node_map_rid() input
> id is not always a PCI requester id as its name, parameters and local
> variables suggest, which is misleading.
>
> Apply the s/rid/id substitution to the iort_node_map_rid() mapping
> function and its users to make sure its intended usage is clearer."

Thank your patience, I will update the commit message.

Hanjun



Re: [PATCH v7 08/15] ACPI: IORT: rename iort_node_map_rid() to make it generic

2017-01-13 Thread Hanjun Guo
On 2017/1/13 19:47, Lorenzo Pieralisi wrote:
> On Wed, Jan 11, 2017 at 11:06:32PM +0800, Hanjun Guo wrote:
>> iort_node_map_rid() was designed for both PCI and platform
>> device, but the rid means requester id is for ITS mappings,
> I do not understand what this means sorry.
>
>> rename iort_node_map_rid() to iort_node_map_id() and update
>> its argument names to make it more generic.
>>
> "iort_node_map_rid() was designed to take an input id (that is not
> necessarily a PCI requester id) and map it to an output id (eg an SMMU
> streamid or an ITS deviceid) according to the mappings provided by an
> IORT node mapping entries. This means that the iort_node_map_rid() input
> id is not always a PCI requester id as its name, parameters and local
> variables suggest, which is misleading.
>
> Apply the s/rid/id substitution to the iort_node_map_rid() mapping
> function and its users to make sure its intended usage is clearer."

Thank your patience, I will update the commit message.

Hanjun



Re: [PATCH v7 12/15] msi: platform: make platform_msi_create_device_domain() ACPI aware

2017-01-13 Thread Hanjun Guo
On 2017/1/13 18:45, Lorenzo Pieralisi wrote:
> On Wed, Jan 11, 2017 at 11:06:36PM +0800, Hanjun Guo wrote:
>> platform_msi_create_device_domain() is used to ctreate
>> irqdomain for the device such as irqchip mbigen generating
>> the MSIs, it's almost ready for ACPI use except
>> of_node_to_fwnode() is for dt only, make it ACPI aware then
>> things will work in both DTS and ACPI.
> "The irqdomain creation carried out in:
>
> platform_msi_create_device_domain()
>
> relies on the fwnode_handle interrupt controller token to associate the
> interrupt controller with a specific irqdomain. Current code relies on
> the OF layer to retrieve a fwnode_handle for the device representing the
> interrupt controller from its device->of_node pointer.  This makes
> platform_msi_create_device_domain() DT specific whilst it really is not
> because after the merge of commit f94277af03ea ("of/platform: Initialise
> dev->fwnode appropriately") the fwnode_handle can easily be retrieved
> from the dev->fwnode pointer in a firmware agnostic way.
>
> Update platform_msi_create_device_domain() to retrieve the interrupt
> controller fwnode_handle from the dev->fwnode pointer so that it can
> be used seamlessly in ACPI and DT systems".

Much better, I will update the patch.

>
> Reviewed-by: Lorenzo Pieralisi 

Thanks
Hanjun



Re: [PATCH v7 12/15] msi: platform: make platform_msi_create_device_domain() ACPI aware

2017-01-13 Thread Hanjun Guo
On 2017/1/13 18:45, Lorenzo Pieralisi wrote:
> On Wed, Jan 11, 2017 at 11:06:36PM +0800, Hanjun Guo wrote:
>> platform_msi_create_device_domain() is used to ctreate
>> irqdomain for the device such as irqchip mbigen generating
>> the MSIs, it's almost ready for ACPI use except
>> of_node_to_fwnode() is for dt only, make it ACPI aware then
>> things will work in both DTS and ACPI.
> "The irqdomain creation carried out in:
>
> platform_msi_create_device_domain()
>
> relies on the fwnode_handle interrupt controller token to associate the
> interrupt controller with a specific irqdomain. Current code relies on
> the OF layer to retrieve a fwnode_handle for the device representing the
> interrupt controller from its device->of_node pointer.  This makes
> platform_msi_create_device_domain() DT specific whilst it really is not
> because after the merge of commit f94277af03ea ("of/platform: Initialise
> dev->fwnode appropriately") the fwnode_handle can easily be retrieved
> from the dev->fwnode pointer in a firmware agnostic way.
>
> Update platform_msi_create_device_domain() to retrieve the interrupt
> controller fwnode_handle from the dev->fwnode pointer so that it can
> be used seamlessly in ACPI and DT systems".

Much better, I will update the patch.

>
> Reviewed-by: Lorenzo Pieralisi 

Thanks
Hanjun



Re: [PATCH v7 15/15] irqchip: mbigen: Add ACPI support

2017-01-13 Thread Hanjun Guo
Hi Lorenzo,

On 2017/1/13 18:21, Lorenzo Pieralisi wrote:
> On Wed, Jan 11, 2017 at 11:06:39PM +0800, Hanjun Guo wrote:
>> With the preparation of platform msi support and interrupt producer
>> in DSDT, we can add mbigen ACPI support now.
>>
>> We are using _PRS methd to indicate number of irq pins instead
>> of num_pins in DT to avoid _DSD usage in this case.
>>
>> For mbi-gen,
>> Device(MBI0) {
>>   Name(_HID, "HISI0152")
>>   Name(_UID, Zero)
>>   Name(_CRS, ResourceTemplate() {
>>   Memory32Fixed(ReadWrite, 0xa008, 0x1)
>>   })
>>
>>   Name (_PRS, ResourceTemplate() {
>>Interrupt(ResourceProducer,...) {12,14,}
> I still do not understand why you are using _PRS for this, I think
> the MBIgen configuration is static and if it is so the Interrupt
> resource should be part of the _CRS unless there is something I am
> missing here.

Sorry for not clear in the commit message. MBIgen is an interrupt producer
which produces irq resource to devices connecting to it, and MBIgen itself
don't consume wired interrupts.

Also devices connecting MBIgen may not consume all the interrupts produced
by MBIgen, for example, MBIgen may produce 128 interrupts but only half of
them are currently used, so _PRS here means "provide interrupt resources
may consumed by devices connecting to it".

Should I add this into the commit message?

Thanks
Hanjun



Re: [PATCH v7 15/15] irqchip: mbigen: Add ACPI support

2017-01-13 Thread Hanjun Guo
Hi Lorenzo,

On 2017/1/13 18:21, Lorenzo Pieralisi wrote:
> On Wed, Jan 11, 2017 at 11:06:39PM +0800, Hanjun Guo wrote:
>> With the preparation of platform msi support and interrupt producer
>> in DSDT, we can add mbigen ACPI support now.
>>
>> We are using _PRS methd to indicate number of irq pins instead
>> of num_pins in DT to avoid _DSD usage in this case.
>>
>> For mbi-gen,
>> Device(MBI0) {
>>   Name(_HID, "HISI0152")
>>   Name(_UID, Zero)
>>   Name(_CRS, ResourceTemplate() {
>>   Memory32Fixed(ReadWrite, 0xa008, 0x1)
>>   })
>>
>>   Name (_PRS, ResourceTemplate() {
>>Interrupt(ResourceProducer,...) {12,14,}
> I still do not understand why you are using _PRS for this, I think
> the MBIgen configuration is static and if it is so the Interrupt
> resource should be part of the _CRS unless there is something I am
> missing here.

Sorry for not clear in the commit message. MBIgen is an interrupt producer
which produces irq resource to devices connecting to it, and MBIgen itself
don't consume wired interrupts.

Also devices connecting MBIgen may not consume all the interrupts produced
by MBIgen, for example, MBIgen may produce 128 interrupts but only half of
them are currently used, so _PRS here means "provide interrupt resources
may consumed by devices connecting to it".

Should I add this into the commit message?

Thanks
Hanjun



Re: [PATCH 5/6] treewide: use kv[mz]alloc* rather than opencoded variants

2017-01-13 Thread Tetsuo Handa
On 2017/01/13 2:29, Michal Hocko wrote:
> Ilya has noticed that I've screwed up some k[zc]alloc conversions and
> didn't use the kvzalloc. This is an updated patch with some acks
> collected on the way
> ---
> From a7b89c6d0a3c685045e37740c8f97b065f37e0a4 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Wed, 4 Jan 2017 13:30:32 +0100
> Subject: [PATCH] treewide: use kv[mz]alloc* rather than opencoded variants
> 
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests < 64kB are basically never failing and invoke OOM killer to

Isn't this "requests <= 32kB" because allocation requests for 33kB will be
rounded up to 64kB?

Same for "smaller than 64kB" in PATCH 6/6. But strictly speaking, isn't
it bogus to refer actual size because PAGE_SIZE is not always 4096?

-- arch/ia64/include/asm/page.h --
/*
 * PAGE_SHIFT determines the actual kernel page size.
 */
#if defined(CONFIG_IA64_PAGE_SIZE_4KB)
# define PAGE_SHIFT 12
#elif defined(CONFIG_IA64_PAGE_SIZE_8KB)
# define PAGE_SHIFT 13
#elif defined(CONFIG_IA64_PAGE_SIZE_16KB)
# define PAGE_SHIFT 14
#elif defined(CONFIG_IA64_PAGE_SIZE_64KB)
# define PAGE_SHIFT 16
#else
# error Unsupported page size!
#endif

#define PAGE_SIZE   (__IA64_UL_CONST(1) << PAGE_SHIFT)


> satisfy the allocation. This sounds too disruptive for something that
> has a reasonable fallback - the vmalloc. On the other hand those
> requests might fallback to vmalloc even when the memory allocator would
> succeed after several more reclaim/compaction attempts previously. There
> is no guarantee something like that happens though.
> 
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.



Re: [PATCH 5/6] treewide: use kv[mz]alloc* rather than opencoded variants

2017-01-13 Thread Tetsuo Handa
On 2017/01/13 2:29, Michal Hocko wrote:
> Ilya has noticed that I've screwed up some k[zc]alloc conversions and
> didn't use the kvzalloc. This is an updated patch with some acks
> collected on the way
> ---
> From a7b89c6d0a3c685045e37740c8f97b065f37e0a4 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Wed, 4 Jan 2017 13:30:32 +0100
> Subject: [PATCH] treewide: use kv[mz]alloc* rather than opencoded variants
> 
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests < 64kB are basically never failing and invoke OOM killer to

Isn't this "requests <= 32kB" because allocation requests for 33kB will be
rounded up to 64kB?

Same for "smaller than 64kB" in PATCH 6/6. But strictly speaking, isn't
it bogus to refer actual size because PAGE_SIZE is not always 4096?

-- arch/ia64/include/asm/page.h --
/*
 * PAGE_SHIFT determines the actual kernel page size.
 */
#if defined(CONFIG_IA64_PAGE_SIZE_4KB)
# define PAGE_SHIFT 12
#elif defined(CONFIG_IA64_PAGE_SIZE_8KB)
# define PAGE_SHIFT 13
#elif defined(CONFIG_IA64_PAGE_SIZE_16KB)
# define PAGE_SHIFT 14
#elif defined(CONFIG_IA64_PAGE_SIZE_64KB)
# define PAGE_SHIFT 16
#else
# error Unsupported page size!
#endif

#define PAGE_SIZE   (__IA64_UL_CONST(1) << PAGE_SHIFT)


> satisfy the allocation. This sounds too disruptive for something that
> has a reasonable fallback - the vmalloc. On the other hand those
> requests might fallback to vmalloc even when the memory allocator would
> succeed after several more reclaim/compaction attempts previously. There
> is no guarantee something like that happens though.
> 
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.



Re: [PATCH v7 7/7] perf/amd/iommu: Enable support for multiple IOMMUs

2017-01-13 Thread Suravee Suthikulpanit



On 1/13/17 18:49, Borislav Petkov wrote:

On Fri, Jan 13, 2017 at 05:24:01PM +0700, Suravee Suthikulpanit wrote:

IIUC, Perf tools looks at the /sys/devices/x to identify
availalble PMUs. Are you planning to have perf tools look at
/sys/devices/system/iommu/xxx instead?


No, I'm planning to understand what do you mean exactly. Because the
PMUs are, as I'm being pointed to, here:

#define EVENT_SOURCE_DEVICE_PATH "/bus/event_source/devices/"

Do you mean that, per chance?



Ah, okay. I can see the following directory path:

# ls -l  /sys/devices/
drwxr-xr-x  5 root root 0 Jan 13 20:36 amd_iommu_0

# ls -l /sys/bus/event_source/devices/
lrwxrwxrwx 1 root root 0 Jan 13 20:33 amd_iommu_0 -> 
../../../devices/amd_iommu_0

I'll update the commit log to mention /bus/event_source/devices/amd_iommu_X 
instead.

Thanks,
S


Re: [PATCH v7 7/7] perf/amd/iommu: Enable support for multiple IOMMUs

2017-01-13 Thread Suravee Suthikulpanit



On 1/13/17 18:49, Borislav Petkov wrote:

On Fri, Jan 13, 2017 at 05:24:01PM +0700, Suravee Suthikulpanit wrote:

IIUC, Perf tools looks at the /sys/devices/x to identify
availalble PMUs. Are you planning to have perf tools look at
/sys/devices/system/iommu/xxx instead?


No, I'm planning to understand what do you mean exactly. Because the
PMUs are, as I'm being pointed to, here:

#define EVENT_SOURCE_DEVICE_PATH "/bus/event_source/devices/"

Do you mean that, per chance?



Ah, okay. I can see the following directory path:

# ls -l  /sys/devices/
drwxr-xr-x  5 root root 0 Jan 13 20:36 amd_iommu_0

# ls -l /sys/bus/event_source/devices/
lrwxrwxrwx 1 root root 0 Jan 13 20:33 amd_iommu_0 -> 
../../../devices/amd_iommu_0

I'll update the commit log to mention /bus/event_source/devices/amd_iommu_X 
instead.

Thanks,
S


Re: [PATCH 8/9] serdev: add a tty port controller driver

2017-01-13 Thread Rob Herring
On Tue, Jan 10, 2017 at 4:04 PM, Pavel Machek  wrote:
> Hi!
>
>> +MODULE_AUTHOR("Rob Herring 
> Missing ">".

Actually, this I should drop because this driver can not be a module.
The bus can be though.

Rob


Re: [PATCH 8/9] serdev: add a tty port controller driver

2017-01-13 Thread Rob Herring
On Tue, Jan 10, 2017 at 4:04 PM, Pavel Machek  wrote:
> Hi!
>
>> +MODULE_AUTHOR("Rob Herring 
> Missing ">".

Actually, this I should drop because this driver can not be a module.
The bus can be though.

Rob


Re: [PATCH v2 1/2] of: base: add support to find the level of the last cache

2017-01-13 Thread Rob Herring
On Thu, Jan 12, 2017 at 12:29 PM, Sudeep Holla  wrote:
> It is useful to have helper function just to get the number of cache
> levels for a given logical cpu. We can obtain the same by just checking
> the level at which the last cache is present. This patch adds support
> to find the level of the last cache for a given cpu.
>
> It will be used on ARM64 platform where the device tree provides the
> information for the additional non-architected/transparent/external
> last level caches that are not integrated with the processors.
>
> Suggested-by: Rob Herring 
> Cc: Rob Herring 
> Cc: Mark Rutland 
> Signed-off-by: Sudeep Holla 
> ---
>  drivers/of/base.c  | 27 +++
>  include/linux/of.h |  1 +
>  2 files changed, 28 insertions(+)
>
> v1->v2:
> - Moved to using "cache-level" in the last level cache instead
>   of counting through all the nodes as suggested by Rob
>
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index d4bea3c797d6..c1128a077aea 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2268,6 +2269,32 @@ struct device_node *of_find_next_cache_node(const 
> struct device_node *np)
>  }
>
>  /**
> + * of_find_last_cache_level - Find the level at which the last cache is
> + * present for the given logical cpu
> + *
> + * @cpu: cpu number(logical index) for which the last cache level is needed
> + *
> + * Returns the the level at which the last cache is present. It is exactly
> + * same as  the total number of cache levels for the given logical cpu.
> + */
> +int of_find_last_cache_level(unsigned int cpu)
> +{
> +   int cache_level = 0;
> +   struct device_node *prev = NULL, *np = of_cpu_device_node_get(cpu);
> +
> +   while (np) {
> +   prev = np;
> +   of_node_put(np);
> +   np = of_find_next_cache_node(np);
> +   }
> +
> +   if (prev)

Probably don't need this check. Otherwise,

Acked-by: Rob Herring 

> +   of_property_read_u32(prev, "cache-level", _level);
> +
> +   return cache_level;
> +}
> +
> +/**
>   * of_graph_parse_endpoint() - parse common endpoint node properties
>   * @node: pointer to endpoint device_node
>   * @endpoint: pointer to the OF endpoint data structure
> diff --git a/include/linux/of.h b/include/linux/of.h
> index d72f01009297..21e6323de0f3 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -280,6 +280,7 @@ extern struct device_node *of_get_child_by_name(const 
> struct device_node *node,
>
>  /* cache lookup */
>  extern struct device_node *of_find_next_cache_node(const struct device_node 
> *);
> +extern int of_find_last_cache_level(unsigned int cpu);
>  extern struct device_node *of_find_node_with_property(
> struct device_node *from, const char *prop_name);
>
> --
> 2.7.4
>


Re: [PATCH v2 1/2] of: base: add support to find the level of the last cache

2017-01-13 Thread Rob Herring
On Thu, Jan 12, 2017 at 12:29 PM, Sudeep Holla  wrote:
> It is useful to have helper function just to get the number of cache
> levels for a given logical cpu. We can obtain the same by just checking
> the level at which the last cache is present. This patch adds support
> to find the level of the last cache for a given cpu.
>
> It will be used on ARM64 platform where the device tree provides the
> information for the additional non-architected/transparent/external
> last level caches that are not integrated with the processors.
>
> Suggested-by: Rob Herring 
> Cc: Rob Herring 
> Cc: Mark Rutland 
> Signed-off-by: Sudeep Holla 
> ---
>  drivers/of/base.c  | 27 +++
>  include/linux/of.h |  1 +
>  2 files changed, 28 insertions(+)
>
> v1->v2:
> - Moved to using "cache-level" in the last level cache instead
>   of counting through all the nodes as suggested by Rob
>
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index d4bea3c797d6..c1128a077aea 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2268,6 +2269,32 @@ struct device_node *of_find_next_cache_node(const 
> struct device_node *np)
>  }
>
>  /**
> + * of_find_last_cache_level - Find the level at which the last cache is
> + * present for the given logical cpu
> + *
> + * @cpu: cpu number(logical index) for which the last cache level is needed
> + *
> + * Returns the the level at which the last cache is present. It is exactly
> + * same as  the total number of cache levels for the given logical cpu.
> + */
> +int of_find_last_cache_level(unsigned int cpu)
> +{
> +   int cache_level = 0;
> +   struct device_node *prev = NULL, *np = of_cpu_device_node_get(cpu);
> +
> +   while (np) {
> +   prev = np;
> +   of_node_put(np);
> +   np = of_find_next_cache_node(np);
> +   }
> +
> +   if (prev)

Probably don't need this check. Otherwise,

Acked-by: Rob Herring 

> +   of_property_read_u32(prev, "cache-level", _level);
> +
> +   return cache_level;
> +}
> +
> +/**
>   * of_graph_parse_endpoint() - parse common endpoint node properties
>   * @node: pointer to endpoint device_node
>   * @endpoint: pointer to the OF endpoint data structure
> diff --git a/include/linux/of.h b/include/linux/of.h
> index d72f01009297..21e6323de0f3 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -280,6 +280,7 @@ extern struct device_node *of_get_child_by_name(const 
> struct device_node *node,
>
>  /* cache lookup */
>  extern struct device_node *of_find_next_cache_node(const struct device_node 
> *);
> +extern int of_find_last_cache_level(unsigned int cpu);
>  extern struct device_node *of_find_node_with_property(
> struct device_node *from, const char *prop_name);
>
> --
> 2.7.4
>


Re: [PATCH 2/6] mm: support __GFP_REPEAT in kvmalloc_node for >=64kB

2017-01-13 Thread Tetsuo Handa
On 2017/01/13 0:37, Michal Hocko wrote:
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 5dc34653274a..105cd04c7414 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -797,12 +797,9 @@ static int vhost_net_open(struct inode *inode, struct 
> file *f)
>   struct vhost_virtqueue **vqs;
>   int i;
>  
> - n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> - if (!n) {
> - n = vmalloc(sizeof *n);
> - if (!n)
> - return -ENOMEM;
> - }
> + n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT);

An opportunity to standardize as sizeof(*n) like other allocations.

> diff --git a/mm/util.c b/mm/util.c
> index 7e0c240b5760..9306244b9f41 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -333,7 +333,8 @@ EXPORT_SYMBOL(vm_mmap);
>   * Uses kmalloc to get the memory but if the allocation fails then falls back
>   * to the vmalloc allocator. Use kvfree for freeing the memory.
>   *
> - * Reclaim modifiers - __GFP_NORETRY, __GFP_REPEAT and __GFP_NOFAIL are not 
> supported
> + * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. 
> __GFP_REPEAT
> + * is supported only for large (>64kB) allocations

Isn't this ">32kB" (i.e. __GFP_REPEAT is supported for 64kB allocation) ?

>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> @@ -350,8 +351,18 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>* Make sure that larger requests are not too disruptive - no OOM
>* killer and no allocation failure warnings as we have a fallback
>*/
> - if (size > PAGE_SIZE)
> - kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> + if (size > PAGE_SIZE) {
> + kmalloc_flags |= __GFP_NOWARN;
> +
> + /*
> +  * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly
> +  * requests because there is no other way to tell the allocator
> +  * that we want to fail rather than retry endlessly.
> +  */
> + if (!(kmalloc_flags & __GFP_REPEAT) ||
> + (size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> + kmalloc_flags |= __GFP_NORETRY;
> + }
>  
>   ret = kmalloc_node(size, kmalloc_flags, node);
>  
> 



Re: [PATCH 2/6] mm: support __GFP_REPEAT in kvmalloc_node for >=64kB

2017-01-13 Thread Tetsuo Handa
On 2017/01/13 0:37, Michal Hocko wrote:
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 5dc34653274a..105cd04c7414 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -797,12 +797,9 @@ static int vhost_net_open(struct inode *inode, struct 
> file *f)
>   struct vhost_virtqueue **vqs;
>   int i;
>  
> - n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> - if (!n) {
> - n = vmalloc(sizeof *n);
> - if (!n)
> - return -ENOMEM;
> - }
> + n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT);

An opportunity to standardize as sizeof(*n) like other allocations.

> diff --git a/mm/util.c b/mm/util.c
> index 7e0c240b5760..9306244b9f41 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -333,7 +333,8 @@ EXPORT_SYMBOL(vm_mmap);
>   * Uses kmalloc to get the memory but if the allocation fails then falls back
>   * to the vmalloc allocator. Use kvfree for freeing the memory.
>   *
> - * Reclaim modifiers - __GFP_NORETRY, __GFP_REPEAT and __GFP_NOFAIL are not 
> supported
> + * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. 
> __GFP_REPEAT
> + * is supported only for large (>64kB) allocations

Isn't this ">32kB" (i.e. __GFP_REPEAT is supported for 64kB allocation) ?

>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> @@ -350,8 +351,18 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>* Make sure that larger requests are not too disruptive - no OOM
>* killer and no allocation failure warnings as we have a fallback
>*/
> - if (size > PAGE_SIZE)
> - kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> + if (size > PAGE_SIZE) {
> + kmalloc_flags |= __GFP_NOWARN;
> +
> + /*
> +  * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly
> +  * requests because there is no other way to tell the allocator
> +  * that we want to fail rather than retry endlessly.
> +  */
> + if (!(kmalloc_flags & __GFP_REPEAT) ||
> + (size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> + kmalloc_flags |= __GFP_NORETRY;
> + }
>  
>   ret = kmalloc_node(size, kmalloc_flags, node);
>  
> 



Re: [PATCH v3 2/4] dt-bindings: Add TI SCI PM Domains

2017-01-13 Thread Rob Herring
On Fri, Jan 13, 2017 at 2:28 PM, Dave Gerlach  wrote:
> On 01/13/2017 01:25 PM, Rob Herring wrote:
>>
>> On Thu, Jan 12, 2017 at 9:27 AM, Dave Gerlach  wrote:
>>>
>>> Rob,
>>>
>>> On 01/11/2017 03:34 PM, Rob Herring wrote:


 On Mon, Jan 9, 2017 at 11:57 AM, Dave Gerlach  wrote:
>
>
> Rob,
>
> On 01/09/2017 11:50 AM, Rob Herring wrote:
>>
>>
>>
>> On Wed, Jan 04, 2017 at 02:55:34PM -0600, Dave Gerlach wrote:
>>>
>>>
>>>
>>> Add a generic power domain implementation, TI SCI PM Domains, that
>>> will hook into the genpd framework and allow the TI SCI protocol to
>>> control device power states.
>>>
>>> Also, provide macros representing each device index as understood
>>> by TI SCI to be used in the device node power-domain references.
>>> These are identifiers for the K2G devices managed by the PMMC.
>>>
>>> Signed-off-by: Nishanth Menon 
>>> Signed-off-by: Dave Gerlach 
>>> ---
>>> v2->v3:
>>> Update k2g_pds node docs to show it should be a child of pmmc
>>> node.
>>> In early versions a phandle was used to point to pmmc and
>>> docs
>>> still
>>> incorrectly showed this.
>>>
>>>  .../devicetree/bindings/soc/ti/sci-pm-domain.txt   | 59
>>> ++
>>>  MAINTAINERS|  2 +
>>>  include/dt-bindings/genpd/k2g.h| 90
>>> ++
>>>  3 files changed, 151 insertions(+)
>>>  create mode 100644
>>> Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
>>>  create mode 100644 include/dt-bindings/genpd/k2g.h
>>>
>>> diff --git
>>> a/Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
>>> b/Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
>>> new file mode 100644
>>> index ..4c9064e512cb
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
>>> @@ -0,0 +1,59 @@
>>> +Texas Instruments TI-SCI Generic Power Domain
>>> +-
>>> +
>>> +Some TI SoCs contain a system controller (like the PMMC, etc...)
>>> that
>>> is
>>> +responsible for controlling the state of the IPs that are present.
>>> +Communication between the host processor running an OS and the
>>> system
>>> +controller happens through a protocol known as TI-SCI [1]. This pm
>>> domain
>>> +implementation plugs into the generic pm domain framework and makes
>>> use
>>> of
>>> +the TI SCI protocol power on and off each device when needed.
>>> +
>>> +[1] Documentation/devicetree/bindings/arm/keystone/ti,sci.txt
>>> +
>>> +PM Domain Node
>>> +==
>>> +The PM domain node represents the global PM domain managed by the
>>> PMMC,
>>> +which in this case is the single implementation as documented by the
>>> generic
>>> +PM domain bindings in
>>> Documentation/devicetree/bindings/power/power_domain.txt.
>>> +Because this relies on the TI SCI protocol to communicate with the
>>> PMMC
>>> it
>>> +must be a child of the pmmc node.
>>> +
>>> +Required Properties:
>>> +
>>> +- compatible: should be "ti,sci-pm-domain"
>>> +- #power-domain-cells: Must be 0.
>>> +
>>> +Example (K2G):
>>> +-
>>> +   pmmc: pmmc {
>>> +   compatible = "ti,k2g-sci";
>>> +   ...
>>> +
>>> +   k2g_pds: k2g_pds {
>>> +   compatible = "ti,sci-pm-domain";
>>> +   #power-domain-cells = <0>;
>>> +   };
>>> +   };
>>> +
>>> +PM Domain Consumers
>>> +===
>>> +Hardware blocks that require SCI control over their state must
>>> provide
>>> +a reference to the sci-pm-domain they are part of and a unique
>>> device
>>> +specific ID that identifies the device.
>>> +
>>> +Required Properties:
>>> +
>>> +- power-domains: phandle pointing to the corresponding PM domain
>>> node.
>>> +- ti,sci-id: index representing the device id to be passed oevr SCI
>>> to
>>> +be used for device control.
>>
>>
>>
>>
>> As I've already stated before, this goes in power-domain cells. When
>> you
>> have a single thing (i.e. node) that controls multiple things, then
>> you
>> you need to specify the ID for each of them in phandle args. This is
>> how
>> irqs, gpio, clocks, *everything* in DT works.
>
>
>
>
> You think the reasoning for doing it this way provided by both Ulf and
> myself on v2 [1] is not valid then?
>
> From Ulf:
>

Re: [PATCH v3 2/4] dt-bindings: Add TI SCI PM Domains

2017-01-13 Thread Rob Herring
On Fri, Jan 13, 2017 at 2:28 PM, Dave Gerlach  wrote:
> On 01/13/2017 01:25 PM, Rob Herring wrote:
>>
>> On Thu, Jan 12, 2017 at 9:27 AM, Dave Gerlach  wrote:
>>>
>>> Rob,
>>>
>>> On 01/11/2017 03:34 PM, Rob Herring wrote:


 On Mon, Jan 9, 2017 at 11:57 AM, Dave Gerlach  wrote:
>
>
> Rob,
>
> On 01/09/2017 11:50 AM, Rob Herring wrote:
>>
>>
>>
>> On Wed, Jan 04, 2017 at 02:55:34PM -0600, Dave Gerlach wrote:
>>>
>>>
>>>
>>> Add a generic power domain implementation, TI SCI PM Domains, that
>>> will hook into the genpd framework and allow the TI SCI protocol to
>>> control device power states.
>>>
>>> Also, provide macros representing each device index as understood
>>> by TI SCI to be used in the device node power-domain references.
>>> These are identifiers for the K2G devices managed by the PMMC.
>>>
>>> Signed-off-by: Nishanth Menon 
>>> Signed-off-by: Dave Gerlach 
>>> ---
>>> v2->v3:
>>> Update k2g_pds node docs to show it should be a child of pmmc
>>> node.
>>> In early versions a phandle was used to point to pmmc and
>>> docs
>>> still
>>> incorrectly showed this.
>>>
>>>  .../devicetree/bindings/soc/ti/sci-pm-domain.txt   | 59
>>> ++
>>>  MAINTAINERS|  2 +
>>>  include/dt-bindings/genpd/k2g.h| 90
>>> ++
>>>  3 files changed, 151 insertions(+)
>>>  create mode 100644
>>> Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
>>>  create mode 100644 include/dt-bindings/genpd/k2g.h
>>>
>>> diff --git
>>> a/Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
>>> b/Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
>>> new file mode 100644
>>> index ..4c9064e512cb
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
>>> @@ -0,0 +1,59 @@
>>> +Texas Instruments TI-SCI Generic Power Domain
>>> +-
>>> +
>>> +Some TI SoCs contain a system controller (like the PMMC, etc...)
>>> that
>>> is
>>> +responsible for controlling the state of the IPs that are present.
>>> +Communication between the host processor running an OS and the
>>> system
>>> +controller happens through a protocol known as TI-SCI [1]. This pm
>>> domain
>>> +implementation plugs into the generic pm domain framework and makes
>>> use
>>> of
>>> +the TI SCI protocol power on and off each device when needed.
>>> +
>>> +[1] Documentation/devicetree/bindings/arm/keystone/ti,sci.txt
>>> +
>>> +PM Domain Node
>>> +==
>>> +The PM domain node represents the global PM domain managed by the
>>> PMMC,
>>> +which in this case is the single implementation as documented by the
>>> generic
>>> +PM domain bindings in
>>> Documentation/devicetree/bindings/power/power_domain.txt.
>>> +Because this relies on the TI SCI protocol to communicate with the
>>> PMMC
>>> it
>>> +must be a child of the pmmc node.
>>> +
>>> +Required Properties:
>>> +
>>> +- compatible: should be "ti,sci-pm-domain"
>>> +- #power-domain-cells: Must be 0.
>>> +
>>> +Example (K2G):
>>> +-
>>> +   pmmc: pmmc {
>>> +   compatible = "ti,k2g-sci";
>>> +   ...
>>> +
>>> +   k2g_pds: k2g_pds {
>>> +   compatible = "ti,sci-pm-domain";
>>> +   #power-domain-cells = <0>;
>>> +   };
>>> +   };
>>> +
>>> +PM Domain Consumers
>>> +===
>>> +Hardware blocks that require SCI control over their state must
>>> provide
>>> +a reference to the sci-pm-domain they are part of and a unique
>>> device
>>> +specific ID that identifies the device.
>>> +
>>> +Required Properties:
>>> +
>>> +- power-domains: phandle pointing to the corresponding PM domain
>>> node.
>>> +- ti,sci-id: index representing the device id to be passed oevr SCI
>>> to
>>> +be used for device control.
>>
>>
>>
>>
>> As I've already stated before, this goes in power-domain cells. When
>> you
>> have a single thing (i.e. node) that controls multiple things, then
>> you
>> you need to specify the ID for each of them in phandle args. This is
>> how
>> irqs, gpio, clocks, *everything* in DT works.
>
>
>
>
> You think the reasoning for doing it this way provided by both Ulf and
> myself on v2 [1] is not valid then?
>
> From Ulf:
>
> To me, the TI SCI ID, is similar to a "conid" for any another "device
> 

Re: [f2fs-dev] [PATCH 6/6] f2fs: show # of on-going flush and discard bios

2017-01-13 Thread heyunlei

Hi Jaegeuk Kim,

On 2017/1/13 6:44, Jaegeuk Kim wrote:

This patch adds stat information for flush and discard commands.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/debug.c   | 7 +--
 fs/f2fs/f2fs.h| 3 ++-
 fs/f2fs/segment.c | 4 
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
index f9f6b0aeba02..e67cab2fafac 100644
--- a/fs/f2fs/debug.c
+++ b/fs/f2fs/debug.c
@@ -54,6 +54,8 @@ static void update_general_status(struct f2fs_sb_info *sbi)
si->max_aw_cnt = atomic_read(>max_aw_cnt);
si->nr_wb_cp_data = get_pages(sbi, F2FS_WB_CP_DATA);
si->nr_wb_data = get_pages(sbi, F2FS_WB_DATA);
+   si->nr_flush = atomic_read(_I(sbi)->fcc_info->submit_flush);
+   si->nr_discard = atomic_read(_I(sbi)->dcc_info->submit_discard);
si->total_count = (int)sbi->user_block_count / sbi->blocks_per_seg;
si->rsvd_segs = reserved_segments(sbi);
si->overp_segs = overprovision_segments(sbi);
@@ -318,8 +320,9 @@ static int stat_show(struct seq_file *s, void *v)
seq_printf(s, "  - Inner Struct Count: tree: %d(%d), node: 
%d\n",
si->ext_tree, si->zombie_tree, si->ext_node);
seq_puts(s, "\nBalancing F2FS Async:\n");
-   seq_printf(s, "  - IO (CP: %4d, Data: %4d)\n",
-  si->nr_wb_cp_data, si->nr_wb_data);
+   seq_printf(s, "  - IO (CP: %4d, Data: %4d, Flush: %4d, Discard: 
%4d)\n",
+  si->nr_wb_cp_data, si->nr_wb_data,
+  si->nr_flush, si->nr_discard);
seq_printf(s, "  - inmem: %4d, atomic IO: %4d (Max. %4d)\n",
   si->inmem_pages, si->aw_cnt, si->max_aw_cnt);
seq_printf(s, "  - nodes: %4d in %4d\n",
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index ef5e3709c161..d51bc18e7292 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -201,6 +201,7 @@ struct discard_cmd_control {
wait_queue_head_t discard_wait_queue;   /* waiting queue for wake-up */
struct mutex cmd_lock;
int max_discards;   /* max. discards to be issued */
+   atomic_t submit_discard;/* # of issued discard */
 };


Here, we need to initialize submit_discard.

Thanks.


 /* for the list of fsync inodes, used only during recovery */
@@ -2254,7 +2255,7 @@ struct f2fs_stat_info {
unsigned int ndirty_dirs, ndirty_files, ndirty_all;
int nats, dirty_nats, sits, dirty_sits, free_nids, alloc_nids;
int total_count, utilization;
-   int bg_gc, nr_wb_cp_data, nr_wb_data;
+   int bg_gc, nr_wb_cp_data, nr_wb_data, nr_flush, nr_discard;
int inline_xattr, inline_inode, inline_dir, orphans;
int aw_cnt, max_aw_cnt;
unsigned int valid_count, valid_node_count, valid_inode_count, 
discard_blks;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 74364006bfa6..5a2d380182d9 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -653,6 +653,8 @@ static void __remove_discard_cmd(struct f2fs_sb_info *sbi, 
struct discard_cmd *d
 {
int err = dc->bio->bi_error;

+   atomic_dec(&(SM_I(sbi)->dcc_info->submit_discard));
+
if (err == -EOPNOTSUPP)
err = 0;

@@ -678,6 +680,7 @@ void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, 
block_t blkaddr)
if (dc->state == D_PREP) {
dc->state = D_SUBMIT;
submit_bio(dc->bio);
+   atomic_inc(>submit_discard);
}
wait_for_completion_io(>wait);

@@ -723,6 +726,7 @@ static int issue_discard_thread(void *data)
if (dc->state == D_PREP) {
dc->state = D_SUBMIT;
submit_bio(dc->bio);
+   atomic_inc(>submit_discard);
if (iter++ > DISCARD_ISSUE_RATE)
break;
} else if (dc->state == D_DONE) {





Re: [f2fs-dev] [PATCH 6/6] f2fs: show # of on-going flush and discard bios

2017-01-13 Thread heyunlei

Hi Jaegeuk Kim,

On 2017/1/13 6:44, Jaegeuk Kim wrote:

This patch adds stat information for flush and discard commands.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/debug.c   | 7 +--
 fs/f2fs/f2fs.h| 3 ++-
 fs/f2fs/segment.c | 4 
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
index f9f6b0aeba02..e67cab2fafac 100644
--- a/fs/f2fs/debug.c
+++ b/fs/f2fs/debug.c
@@ -54,6 +54,8 @@ static void update_general_status(struct f2fs_sb_info *sbi)
si->max_aw_cnt = atomic_read(>max_aw_cnt);
si->nr_wb_cp_data = get_pages(sbi, F2FS_WB_CP_DATA);
si->nr_wb_data = get_pages(sbi, F2FS_WB_DATA);
+   si->nr_flush = atomic_read(_I(sbi)->fcc_info->submit_flush);
+   si->nr_discard = atomic_read(_I(sbi)->dcc_info->submit_discard);
si->total_count = (int)sbi->user_block_count / sbi->blocks_per_seg;
si->rsvd_segs = reserved_segments(sbi);
si->overp_segs = overprovision_segments(sbi);
@@ -318,8 +320,9 @@ static int stat_show(struct seq_file *s, void *v)
seq_printf(s, "  - Inner Struct Count: tree: %d(%d), node: 
%d\n",
si->ext_tree, si->zombie_tree, si->ext_node);
seq_puts(s, "\nBalancing F2FS Async:\n");
-   seq_printf(s, "  - IO (CP: %4d, Data: %4d)\n",
-  si->nr_wb_cp_data, si->nr_wb_data);
+   seq_printf(s, "  - IO (CP: %4d, Data: %4d, Flush: %4d, Discard: 
%4d)\n",
+  si->nr_wb_cp_data, si->nr_wb_data,
+  si->nr_flush, si->nr_discard);
seq_printf(s, "  - inmem: %4d, atomic IO: %4d (Max. %4d)\n",
   si->inmem_pages, si->aw_cnt, si->max_aw_cnt);
seq_printf(s, "  - nodes: %4d in %4d\n",
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index ef5e3709c161..d51bc18e7292 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -201,6 +201,7 @@ struct discard_cmd_control {
wait_queue_head_t discard_wait_queue;   /* waiting queue for wake-up */
struct mutex cmd_lock;
int max_discards;   /* max. discards to be issued */
+   atomic_t submit_discard;/* # of issued discard */
 };


Here, we need to initialize submit_discard.

Thanks.


 /* for the list of fsync inodes, used only during recovery */
@@ -2254,7 +2255,7 @@ struct f2fs_stat_info {
unsigned int ndirty_dirs, ndirty_files, ndirty_all;
int nats, dirty_nats, sits, dirty_sits, free_nids, alloc_nids;
int total_count, utilization;
-   int bg_gc, nr_wb_cp_data, nr_wb_data;
+   int bg_gc, nr_wb_cp_data, nr_wb_data, nr_flush, nr_discard;
int inline_xattr, inline_inode, inline_dir, orphans;
int aw_cnt, max_aw_cnt;
unsigned int valid_count, valid_node_count, valid_inode_count, 
discard_blks;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 74364006bfa6..5a2d380182d9 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -653,6 +653,8 @@ static void __remove_discard_cmd(struct f2fs_sb_info *sbi, 
struct discard_cmd *d
 {
int err = dc->bio->bi_error;

+   atomic_dec(&(SM_I(sbi)->dcc_info->submit_discard));
+
if (err == -EOPNOTSUPP)
err = 0;

@@ -678,6 +680,7 @@ void f2fs_wait_discard_bio(struct f2fs_sb_info *sbi, 
block_t blkaddr)
if (dc->state == D_PREP) {
dc->state = D_SUBMIT;
submit_bio(dc->bio);
+   atomic_inc(>submit_discard);
}
wait_for_completion_io(>wait);

@@ -723,6 +726,7 @@ static int issue_discard_thread(void *data)
if (dc->state == D_PREP) {
dc->state = D_SUBMIT;
submit_bio(dc->bio);
+   atomic_inc(>submit_discard);
if (iter++ > DISCARD_ISSUE_RATE)
break;
} else if (dc->state == D_DONE) {





[PATCH] mwifiex: don't complain about 'unknown event id: 0x63'

2017-01-13 Thread Brian Norris
Marvell folks tell me this is a debugging event that the driver doesn't
need to handle, but on 8997 w/ firmware 16.68.1.p97, I see several of
these sorts of messages at (for instance) boot time:

[   13.825848] mwifiex_pcie :01:00.0: event: unknown event id: 0x63
[   14.838561] mwifiex_pcie :01:00.0: event: unknown event id: 0x63
[   14.850397] mwifiex_pcie :01:00.0: event: unknown event id: 0x63
[   32.529923] mwifiex_pcie :01:00.0: event: unknown event id: 0x63

Let's handle this "event" with a much lower verbosity.

Signed-off-by: Brian Norris 
---
 drivers/net/wireless/marvell/mwifiex/fw.h| 1 +
 drivers/net/wireless/marvell/mwifiex/sta_event.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/drivers/net/wireless/marvell/mwifiex/fw.h 
b/drivers/net/wireless/marvell/mwifiex/fw.h
index 55db158fd156..cb6a1a81d44e 100644
--- a/drivers/net/wireless/marvell/mwifiex/fw.h
+++ b/drivers/net/wireless/marvell/mwifiex/fw.h
@@ -550,6 +550,7 @@ enum mwifiex_channel_flags {
 #define EVENT_TX_DATA_PAUSE 0x0055
 #define EVENT_EXT_SCAN_REPORT   0x0058
 #define EVENT_RXBA_SYNC 0x0059
+#define EVENT_UNKNOWN_DEBUG 0x0063
 #define EVENT_BG_SCAN_STOPPED   0x0065
 #define EVENT_REMAIN_ON_CHAN_EXPIRED0x005f
 #define EVENT_MULTI_CHAN_INFO   0x006a
diff --git a/drivers/net/wireless/marvell/mwifiex/sta_event.c 
b/drivers/net/wireless/marvell/mwifiex/sta_event.c
index 9df0c4dc06ed..96503d3d053f 100644
--- a/drivers/net/wireless/marvell/mwifiex/sta_event.c
+++ b/drivers/net/wireless/marvell/mwifiex/sta_event.c
@@ -1009,6 +1009,10 @@ int mwifiex_process_sta_event(struct mwifiex_private 
*priv)
adapter->event_skb->len -
sizeof(eventcause));
break;
+   /* Debugging event; not used, but let's not print an ERROR for it. */
+   case EVENT_UNKNOWN_DEBUG:
+   mwifiex_dbg(adapter, EVENT, "event: debug\n");
+   break;
default:
mwifiex_dbg(adapter, ERROR, "event: unknown event id: %#x\n",
eventcause);
-- 
2.11.0.483.g087da7b7c-goog



[PATCH] mwifiex: don't complain about 'unknown event id: 0x63'

2017-01-13 Thread Brian Norris
Marvell folks tell me this is a debugging event that the driver doesn't
need to handle, but on 8997 w/ firmware 16.68.1.p97, I see several of
these sorts of messages at (for instance) boot time:

[   13.825848] mwifiex_pcie :01:00.0: event: unknown event id: 0x63
[   14.838561] mwifiex_pcie :01:00.0: event: unknown event id: 0x63
[   14.850397] mwifiex_pcie :01:00.0: event: unknown event id: 0x63
[   32.529923] mwifiex_pcie :01:00.0: event: unknown event id: 0x63

Let's handle this "event" with a much lower verbosity.

Signed-off-by: Brian Norris 
---
 drivers/net/wireless/marvell/mwifiex/fw.h| 1 +
 drivers/net/wireless/marvell/mwifiex/sta_event.c | 4 
 2 files changed, 5 insertions(+)

diff --git a/drivers/net/wireless/marvell/mwifiex/fw.h 
b/drivers/net/wireless/marvell/mwifiex/fw.h
index 55db158fd156..cb6a1a81d44e 100644
--- a/drivers/net/wireless/marvell/mwifiex/fw.h
+++ b/drivers/net/wireless/marvell/mwifiex/fw.h
@@ -550,6 +550,7 @@ enum mwifiex_channel_flags {
 #define EVENT_TX_DATA_PAUSE 0x0055
 #define EVENT_EXT_SCAN_REPORT   0x0058
 #define EVENT_RXBA_SYNC 0x0059
+#define EVENT_UNKNOWN_DEBUG 0x0063
 #define EVENT_BG_SCAN_STOPPED   0x0065
 #define EVENT_REMAIN_ON_CHAN_EXPIRED0x005f
 #define EVENT_MULTI_CHAN_INFO   0x006a
diff --git a/drivers/net/wireless/marvell/mwifiex/sta_event.c 
b/drivers/net/wireless/marvell/mwifiex/sta_event.c
index 9df0c4dc06ed..96503d3d053f 100644
--- a/drivers/net/wireless/marvell/mwifiex/sta_event.c
+++ b/drivers/net/wireless/marvell/mwifiex/sta_event.c
@@ -1009,6 +1009,10 @@ int mwifiex_process_sta_event(struct mwifiex_private 
*priv)
adapter->event_skb->len -
sizeof(eventcause));
break;
+   /* Debugging event; not used, but let's not print an ERROR for it. */
+   case EVENT_UNKNOWN_DEBUG:
+   mwifiex_dbg(adapter, EVENT, "event: debug\n");
+   break;
default:
mwifiex_dbg(adapter, ERROR, "event: unknown event id: %#x\n",
eventcause);
-- 
2.11.0.483.g087da7b7c-goog



RE: [PATCH v2 00/26] IB: Optimize DMA mapping

2017-01-13 Thread Estrin, Alex


> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Friday, January 13, 2017 5:00 PM
> To: Estrin, Alex ; dledf...@redhat.com
> Cc: linux-kernel@vger.kernel.org; linux-r...@vger.kernel.org;
> gre...@linuxfoundation.org
> Subject: Re: [PATCH v2 00/26] IB: Optimize DMA mapping
> 
> On Fri, 2017-01-13 at 21:08 +, Estrin, Alex wrote:
> > It didn't fix the failure... Apparently there is an issue with generic
> > mapping itself.
> 
> Hello Alex,
> 
> The generic mapping code in lib/dma-virt.c works fine with at least the rxe
> driver. Additionally, as far as I can see the mapping code in lib/dma-virt.c
> is identical to the mapping in drivers/infiniband/sw/rdmavt/dma.c. What is
> not clear to me however is how my patches can have caused the SDMA engine
> error messages? As far as I can see the code in
> drivers/infiniband/hw/hfi1/sdma.c uses PCI DMA mapping operations instead of
> the generic dma_virt_ops. Is that correct?

Hi Bart,
If you refer to code like this 
from sdma.c: 
/* dma map the coalesce buffer */
addr = dma_map_single(>pcidev->dev,
. 
please see hfi1/verbs.c @ hfi1_register_ib_device()  

'dma_device' field is initialized here
->  ibdev->dma_device = >pcidev->dev;
followed by 
ret = rvt_register_device(>verbs_dev.rdi);
->  set_dma_ops(rdi->ibdev.dma_device, _virt_ops);
 
Thanks,
Alex.


RE: [PATCH v2 00/26] IB: Optimize DMA mapping

2017-01-13 Thread Estrin, Alex


> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Friday, January 13, 2017 5:00 PM
> To: Estrin, Alex ; dledf...@redhat.com
> Cc: linux-kernel@vger.kernel.org; linux-r...@vger.kernel.org;
> gre...@linuxfoundation.org
> Subject: Re: [PATCH v2 00/26] IB: Optimize DMA mapping
> 
> On Fri, 2017-01-13 at 21:08 +, Estrin, Alex wrote:
> > It didn't fix the failure... Apparently there is an issue with generic
> > mapping itself.
> 
> Hello Alex,
> 
> The generic mapping code in lib/dma-virt.c works fine with at least the rxe
> driver. Additionally, as far as I can see the mapping code in lib/dma-virt.c
> is identical to the mapping in drivers/infiniband/sw/rdmavt/dma.c. What is
> not clear to me however is how my patches can have caused the SDMA engine
> error messages? As far as I can see the code in
> drivers/infiniband/hw/hfi1/sdma.c uses PCI DMA mapping operations instead of
> the generic dma_virt_ops. Is that correct?

Hi Bart,
If you refer to code like this 
from sdma.c: 
/* dma map the coalesce buffer */
addr = dma_map_single(>pcidev->dev,
. 
please see hfi1/verbs.c @ hfi1_register_ib_device()  

'dma_device' field is initialized here
->  ibdev->dma_device = >pcidev->dev;
followed by 
ret = rvt_register_device(>verbs_dev.rdi);
->  set_dma_ops(rdi->ibdev.dma_device, _virt_ops);
 
Thanks,
Alex.


Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn

2017-01-13 Thread Al Viro
On Fri, Jan 13, 2017 at 05:46:34PM -0800, Linus Torvalds wrote:
> On Fri, Jan 13, 2017 at 5:24 PM, Al Viro  wrote:
> >
> > Why would advance by 0 change ->iov_offset here?
> 
> That's not my worry. Advancing by zero obviously doesn't change the position.
> 
> But the _truncation_ of the rest requires iov_offset to be zero in
> order to actually truncate everything.
> 
> So I was worrying about something updating it, and then wanting to
> truncate things on error.
> 
> But you bring up the kinds of cases I worried about:
> 
> > On error it does use iov_iter_advance(), pretty much as a way to
> > trigger pipe_truncate().  There we directly reset ->iov_offset to 0
> > and ->idx to its original value.
> 
> Ok, this was the part I worried about. And this
> 
> > However, theoretically it is possible that ->read_iter() instance does
> > successful copy_to_iter() and then decides to return an error.  This
> > } else if (ret < 0) {
> > to.idx = idx;
> > to.iov_offset = 0;
> > iov_iter_advance(, 0); /* to free what was emitted */
> > in generic_file_splice_read() catches any such cases.
> 
> So I'm happy with that last patch then, and my worries are laid to rest.

OK.  Let's wait for Alan to confirm that the last variant works and
I'll send a pull request (with Cc: stable # v4.9).


Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn

2017-01-13 Thread Al Viro
On Fri, Jan 13, 2017 at 05:46:34PM -0800, Linus Torvalds wrote:
> On Fri, Jan 13, 2017 at 5:24 PM, Al Viro  wrote:
> >
> > Why would advance by 0 change ->iov_offset here?
> 
> That's not my worry. Advancing by zero obviously doesn't change the position.
> 
> But the _truncation_ of the rest requires iov_offset to be zero in
> order to actually truncate everything.
> 
> So I was worrying about something updating it, and then wanting to
> truncate things on error.
> 
> But you bring up the kinds of cases I worried about:
> 
> > On error it does use iov_iter_advance(), pretty much as a way to
> > trigger pipe_truncate().  There we directly reset ->iov_offset to 0
> > and ->idx to its original value.
> 
> Ok, this was the part I worried about. And this
> 
> > However, theoretically it is possible that ->read_iter() instance does
> > successful copy_to_iter() and then decides to return an error.  This
> > } else if (ret < 0) {
> > to.idx = idx;
> > to.iov_offset = 0;
> > iov_iter_advance(, 0); /* to free what was emitted */
> > in generic_file_splice_read() catches any such cases.
> 
> So I'm happy with that last patch then, and my worries are laid to rest.

OK.  Let's wait for Alan to confirm that the last variant works and
I'll send a pull request (with Cc: stable # v4.9).


Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn

2017-01-13 Thread Linus Torvalds
On Fri, Jan 13, 2017 at 5:24 PM, Al Viro  wrote:
>
> Why would advance by 0 change ->iov_offset here?

That's not my worry. Advancing by zero obviously doesn't change the position.

But the _truncation_ of the rest requires iov_offset to be zero in
order to actually truncate everything.

So I was worrying about something updating it, and then wanting to
truncate things on error.

But you bring up the kinds of cases I worried about:

> On error it does use iov_iter_advance(), pretty much as a way to
> trigger pipe_truncate().  There we directly reset ->iov_offset to 0
> and ->idx to its original value.

Ok, this was the part I worried about. And this

> However, theoretically it is possible that ->read_iter() instance does
> successful copy_to_iter() and then decides to return an error.  This
> } else if (ret < 0) {
> to.idx = idx;
> to.iov_offset = 0;
> iov_iter_advance(, 0); /* to free what was emitted */
> in generic_file_splice_read() catches any such cases.

So I'm happy with that last patch then, and my worries are laid to rest.

Linus


Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn

2017-01-13 Thread Linus Torvalds
On Fri, Jan 13, 2017 at 5:24 PM, Al Viro  wrote:
>
> Why would advance by 0 change ->iov_offset here?

That's not my worry. Advancing by zero obviously doesn't change the position.

But the _truncation_ of the rest requires iov_offset to be zero in
order to actually truncate everything.

So I was worrying about something updating it, and then wanting to
truncate things on error.

But you bring up the kinds of cases I worried about:

> On error it does use iov_iter_advance(), pretty much as a way to
> trigger pipe_truncate().  There we directly reset ->iov_offset to 0
> and ->idx to its original value.

Ok, this was the part I worried about. And this

> However, theoretically it is possible that ->read_iter() instance does
> successful copy_to_iter() and then decides to return an error.  This
> } else if (ret < 0) {
> to.idx = idx;
> to.iov_offset = 0;
> iov_iter_advance(, 0); /* to free what was emitted */
> in generic_file_splice_read() catches any such cases.

So I'm happy with that last patch then, and my worries are laid to rest.

Linus


Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn

2017-01-13 Thread Al Viro
On Sat, Jan 14, 2017 at 01:24:28AM +, Al Viro wrote:
> On Fri, Jan 13, 2017 at 04:59:37PM -0800, Linus Torvalds wrote:
> 
> > EXCEPT.
> > 
> > I don't think "i->iov_offset" is actually correct. If you truncated
> > the whole thing, you should have cleared iov_offset too, and that
> > never happened. So truncating everything will not empty the buffers
> > for us, we'll still get to that "if (off)" case and have nrbufs=1.
> > 
> > So I'd expect something like
> > 
> > if (!size)
> > i->iov_offset = 0;
> > 
> > in pipe_advance(), in order to really free all buffers for that case. No?
> 
> Why would advance by 0 change ->iov_offset here?
> 
> > Or is there some guarantee that iov_offset was already zero there? I
> > don't see it, but I'm batting zero on this code...
> 
> It was zero initially (see iov_iter_pipe()).  It was not affected by
> iov_iter_get_pages() and friends.  If there was copy_to_iter/zero_iter/
> copy_page_to_iter for any non-zero amount of data, then it's _not_ zero and
> should bloody well stay such.

PS: note that after copy_page_to_iter()/copy_to_iter()/zero_iter() we have
->idx pointing to the last used buffer and ->iov_offset pointing to the end
of data in it, even if it's certain to be full and the next piece written
will go into the next buffer.  The only situation where we have zero
->iov_offset is when no copying had been done at all (via that iov_iter,
that is).  In that case ->idx points to the empty buffer.

If you start with empty pipe and copy_to_pipe() until it fills, you'll have
(assuming e.g. 4K pages and ->curbuf == 3 in the beginning)
->idx == 3, ->iov_offset == 0 initially
->idx == 3, ->iov_offset == 4096 after 4096 bytes copied
->idx == 4, ->iov_offset == 1 after 4097 bytes
...
->idx == 2, ->iov_offset == 4096 by the end of it.

That's what this "correction" is about...


Re: 4.9.0 regression in pipe-backed iov_iter with systemd-nspawn

2017-01-13 Thread Al Viro
On Sat, Jan 14, 2017 at 01:24:28AM +, Al Viro wrote:
> On Fri, Jan 13, 2017 at 04:59:37PM -0800, Linus Torvalds wrote:
> 
> > EXCEPT.
> > 
> > I don't think "i->iov_offset" is actually correct. If you truncated
> > the whole thing, you should have cleared iov_offset too, and that
> > never happened. So truncating everything will not empty the buffers
> > for us, we'll still get to that "if (off)" case and have nrbufs=1.
> > 
> > So I'd expect something like
> > 
> > if (!size)
> > i->iov_offset = 0;
> > 
> > in pipe_advance(), in order to really free all buffers for that case. No?
> 
> Why would advance by 0 change ->iov_offset here?
> 
> > Or is there some guarantee that iov_offset was already zero there? I
> > don't see it, but I'm batting zero on this code...
> 
> It was zero initially (see iov_iter_pipe()).  It was not affected by
> iov_iter_get_pages() and friends.  If there was copy_to_iter/zero_iter/
> copy_page_to_iter for any non-zero amount of data, then it's _not_ zero and
> should bloody well stay such.

PS: note that after copy_page_to_iter()/copy_to_iter()/zero_iter() we have
->idx pointing to the last used buffer and ->iov_offset pointing to the end
of data in it, even if it's certain to be full and the next piece written
will go into the next buffer.  The only situation where we have zero
->iov_offset is when no copying had been done at all (via that iov_iter,
that is).  In that case ->idx points to the empty buffer.

If you start with empty pipe and copy_to_pipe() until it fills, you'll have
(assuming e.g. 4K pages and ->curbuf == 3 in the beginning)
->idx == 3, ->iov_offset == 0 initially
->idx == 3, ->iov_offset == 4096 after 4096 bytes copied
->idx == 4, ->iov_offset == 1 after 4097 bytes
...
->idx == 2, ->iov_offset == 4096 by the end of it.

That's what this "correction" is about...


  1   2   3   4   5   6   7   8   9   10   >