Re: [PATCH] mm, swap: Make VMA based swap readahead configurable

2017-09-24 Thread Huang, Ying
Hi, Minchan,

Minchan Kim  writes:

> Hi Huang,
>
> On Thu, Sep 21, 2017 at 09:33:10AM +0800, Huang, Ying wrote:
>> From: Huang Ying 

[snip]

>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 9c4b80c2..e62c8e2e34ef 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -434,6 +434,26 @@ config THP_SWAP
>>  
>>For selection by architectures with reasonable THP sizes.
>>  
>> +config VMA_SWAP_READAHEAD
>> +bool "VMA based swap readahead"
>> +depends on SWAP
>> +default y
>> +help
>> +  VMA based swap readahead detects page accessing pattern in a
>> +  VMA and adjust the swap readahead window for pages in the
>> +  VMA accordingly.  It works better for more complex workload
>> +  compared with the original physical swap readahead.
>> +
>> +  It can be controlled via the following sysfs interface,
>> +
>> +/sys/kernel/mm/swap/vma_ra_enabled
>> +/sys/kernel/mm/swap/vma_ra_max_order
>
> It might be better to discuss in other thread but if you mention new
> interface here again, I will discuss it here.
>
> We are creating new ABI in here so I want to ask question in here.
>
> Did you consier to use /sys/block/xxx/queue/read_ahead_kb for the
> swap readahead knob? Reusing such common/consistent knob would be better
> than adding new separate konb.

The problem is that the configuration of VMA based swap readahead is
global instead of block device specific.  And because it works in
virtual way, that is, the swap blocks on the different block devices may
be readahead together.  It's a little hard to use the block device
specific configuration.

>> +
>> +  If set to no, the original physical swap readahead will be
>> +  used.
>
> In here, could you point out kindly somewhere where describes two
> readahead algorithm in the system?
>
> I don't mean we should explain how it works. Rather than, there are
> two parallel algorithm in swap readahead.
>
> Anonymous memory works based on VMA while shm works based on physical
> block. There are working separately on parallel. Each of knobs are
> vma_ra_max_order and page-cluster, blah, blah.

Sure.  I will add some description about that somewhere.

>> +
>> +  If unsure, say Y to enable VMA based swap readahead.
>> +
>>  config  TRANSPARENT_HUGE_PAGECACHE
>>  def_bool y
>>  depends on TRANSPARENT_HUGEPAGE

[snip]


Re: [PATCH] mm, swap: Make VMA based swap readahead configurable

2017-09-24 Thread Huang, Ying
Hi, Minchan,

Minchan Kim  writes:

> Hi Huang,
>
> On Thu, Sep 21, 2017 at 09:33:10AM +0800, Huang, Ying wrote:
>> From: Huang Ying 

[snip]

>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 9c4b80c2..e62c8e2e34ef 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -434,6 +434,26 @@ config THP_SWAP
>>  
>>For selection by architectures with reasonable THP sizes.
>>  
>> +config VMA_SWAP_READAHEAD
>> +bool "VMA based swap readahead"
>> +depends on SWAP
>> +default y
>> +help
>> +  VMA based swap readahead detects page accessing pattern in a
>> +  VMA and adjust the swap readahead window for pages in the
>> +  VMA accordingly.  It works better for more complex workload
>> +  compared with the original physical swap readahead.
>> +
>> +  It can be controlled via the following sysfs interface,
>> +
>> +/sys/kernel/mm/swap/vma_ra_enabled
>> +/sys/kernel/mm/swap/vma_ra_max_order
>
> It might be better to discuss in other thread but if you mention new
> interface here again, I will discuss it here.
>
> We are creating new ABI in here so I want to ask question in here.
>
> Did you consier to use /sys/block/xxx/queue/read_ahead_kb for the
> swap readahead knob? Reusing such common/consistent knob would be better
> than adding new separate konb.

The problem is that the configuration of VMA based swap readahead is
global instead of block device specific.  And because it works in
virtual way, that is, the swap blocks on the different block devices may
be readahead together.  It's a little hard to use the block device
specific configuration.

>> +
>> +  If set to no, the original physical swap readahead will be
>> +  used.
>
> In here, could you point out kindly somewhere where describes two
> readahead algorithm in the system?
>
> I don't mean we should explain how it works. Rather than, there are
> two parallel algorithm in swap readahead.
>
> Anonymous memory works based on VMA while shm works based on physical
> block. There are working separately on parallel. Each of knobs are
> vma_ra_max_order and page-cluster, blah, blah.

Sure.  I will add some description about that somewhere.

>> +
>> +  If unsure, say Y to enable VMA based swap readahead.
>> +
>>  config  TRANSPARENT_HUGE_PAGECACHE
>>  def_bool y
>>  depends on TRANSPARENT_HUGEPAGE

[snip]


Re: [PATCH] mm, swap: Make VMA based swap readahead configurable

2017-09-24 Thread Minchan Kim
Hi Huang,

On Thu, Sep 21, 2017 at 09:33:10AM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> This patch adds a new Kconfig option VMA_SWAP_READAHEAD and wraps VMA
> based swap readahead code inside #ifdef CONFIG_VMA_SWAP_READAHEAD/#endif.
> This is more friendly for tiny kernels.  And as pointed to by Minchan
> Kim, give people who want to disable the swap readahead an opportunity
> to notice the changes to the swap readahead algorithm and the
> corresponding knobs.
> 
> Cc: Johannes Weiner 
> Cc: Rik van Riel 
> Cc: Shaohua Li 
> Cc: Hugh Dickins 
> Cc: Fengguang Wu 
> Cc: Tim Chen 
> Cc: Dave Hansen 
> Suggested-by: Minchan Kim 
> Signed-off-by: "Huang, Ying" 
> ---
>  include/linux/mm_types.h |  2 ++
>  include/linux/swap.h | 64 
> +---
>  mm/Kconfig   | 20 +++
>  mm/swap_state.c  | 25 ---
>  4 files changed, 72 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 46f4ecf5479a..51da54d8027f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -336,7 +336,9 @@ struct vm_area_struct {
>   struct file * vm_file;  /* File we map to (can be NULL). */
>   void * vm_private_data; /* was vm_pte (shared mem) */
>  
> +#ifdef CONFIG_VMA_SWAP_READAHEAD
>   atomic_long_t swap_readahead_info;
> +#endif
>  #ifndef CONFIG_MMU
>   struct vm_region *vm_region;/* NOMMU mapping region */
>  #endif
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 8a807292037f..ebc783a23b80 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -278,6 +278,7 @@ struct swap_info_struct {
>  #endif
>  
>  struct vma_swap_readahead {
> +#ifdef CONFIG_VMA_SWAP_READAHEAD
>   unsigned short win;
>   unsigned short offset;
>   unsigned short nr_pte;
> @@ -286,6 +287,7 @@ struct vma_swap_readahead {
>  #else
>   pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
>  #endif
> +#endif
>  };
>  
>  /* linux/mm/workingset.c */
> @@ -387,7 +389,6 @@ int generic_swapfile_activate(struct swap_info_struct *, 
> struct file *,
>  #define SWAP_ADDRESS_SPACE_SHIFT 14
>  #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
>  extern struct address_space *swapper_spaces[];
> -extern bool swap_vma_readahead;
>  #define swap_address_space(entry)\
>   (_spaces[swp_type(entry)][swp_offset(entry) \
>   >> SWAP_ADDRESS_SPACE_SHIFT])
> @@ -412,23 +413,12 @@ extern struct page 
> *__read_swap_cache_async(swp_entry_t, gfp_t,
>  extern struct page *swapin_readahead(swp_entry_t, gfp_t,
>   struct vm_area_struct *vma, unsigned long addr);
>  
> -extern struct page *swap_readahead_detect(struct vm_fault *vmf,
> -   struct vma_swap_readahead *swap_ra);
> -extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t 
> gfp_mask,
> -struct vm_fault *vmf,
> -struct vma_swap_readahead *swap_ra);
> -
>  /* linux/mm/swapfile.c */
>  extern atomic_long_t nr_swap_pages;
>  extern long total_swap_pages;
>  extern atomic_t nr_rotate_swap;
>  extern bool has_usable_swap(void);
>  
> -static inline bool swap_use_vma_readahead(void)
> -{
> - return READ_ONCE(swap_vma_readahead) && !atomic_read(_rotate_swap);
> -}
> -
>  /* Swap 50% full? Release swapcache more aggressively.. */
>  static inline bool vm_swap_full(void)
>  {
> @@ -518,24 +508,6 @@ static inline struct page *swapin_readahead(swp_entry_t 
> swp, gfp_t gfp_mask,
>   return NULL;
>  }
>  
> -static inline bool swap_use_vma_readahead(void)
> -{
> - return false;
> -}
> -
> -static inline struct page *swap_readahead_detect(
> - struct vm_fault *vmf, struct vma_swap_readahead *swap_ra)
> -{
> - return NULL;
> -}
> -
> -static inline struct page *do_swap_page_readahead(
> - swp_entry_t fentry, gfp_t gfp_mask,
> - struct vm_fault *vmf, struct vma_swap_readahead *swap_ra)
> -{
> - return NULL;
> -}
> -
>  static inline int swap_writepage(struct page *p, struct writeback_control 
> *wbc)
>  {
>   return 0;
> @@ -662,5 +634,37 @@ static inline bool mem_cgroup_swap_full(struct page 
> *page)
>  }
>  #endif
>  
> +#ifdef CONFIG_VMA_SWAP_READAHEAD
> +extern bool swap_vma_readahead;
> +
> +static inline bool swap_use_vma_readahead(void)
> +{
> + return READ_ONCE(swap_vma_readahead) && !atomic_read(_rotate_swap);
> +}
> +extern struct page *swap_readahead_detect(struct vm_fault *vmf,
> +   struct vma_swap_readahead *swap_ra);
> +extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t 
> gfp_mask,
> +  

Re: [PATCH] mm, swap: Make VMA based swap readahead configurable

2017-09-24 Thread Minchan Kim
Hi Huang,

On Thu, Sep 21, 2017 at 09:33:10AM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> This patch adds a new Kconfig option VMA_SWAP_READAHEAD and wraps VMA
> based swap readahead code inside #ifdef CONFIG_VMA_SWAP_READAHEAD/#endif.
> This is more friendly for tiny kernels.  And as pointed to by Minchan
> Kim, give people who want to disable the swap readahead an opportunity
> to notice the changes to the swap readahead algorithm and the
> corresponding knobs.
> 
> Cc: Johannes Weiner 
> Cc: Rik van Riel 
> Cc: Shaohua Li 
> Cc: Hugh Dickins 
> Cc: Fengguang Wu 
> Cc: Tim Chen 
> Cc: Dave Hansen 
> Suggested-by: Minchan Kim 
> Signed-off-by: "Huang, Ying" 
> ---
>  include/linux/mm_types.h |  2 ++
>  include/linux/swap.h | 64 
> +---
>  mm/Kconfig   | 20 +++
>  mm/swap_state.c  | 25 ---
>  4 files changed, 72 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 46f4ecf5479a..51da54d8027f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -336,7 +336,9 @@ struct vm_area_struct {
>   struct file * vm_file;  /* File we map to (can be NULL). */
>   void * vm_private_data; /* was vm_pte (shared mem) */
>  
> +#ifdef CONFIG_VMA_SWAP_READAHEAD
>   atomic_long_t swap_readahead_info;
> +#endif
>  #ifndef CONFIG_MMU
>   struct vm_region *vm_region;/* NOMMU mapping region */
>  #endif
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 8a807292037f..ebc783a23b80 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -278,6 +278,7 @@ struct swap_info_struct {
>  #endif
>  
>  struct vma_swap_readahead {
> +#ifdef CONFIG_VMA_SWAP_READAHEAD
>   unsigned short win;
>   unsigned short offset;
>   unsigned short nr_pte;
> @@ -286,6 +287,7 @@ struct vma_swap_readahead {
>  #else
>   pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
>  #endif
> +#endif
>  };
>  
>  /* linux/mm/workingset.c */
> @@ -387,7 +389,6 @@ int generic_swapfile_activate(struct swap_info_struct *, 
> struct file *,
>  #define SWAP_ADDRESS_SPACE_SHIFT 14
>  #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT)
>  extern struct address_space *swapper_spaces[];
> -extern bool swap_vma_readahead;
>  #define swap_address_space(entry)\
>   (_spaces[swp_type(entry)][swp_offset(entry) \
>   >> SWAP_ADDRESS_SPACE_SHIFT])
> @@ -412,23 +413,12 @@ extern struct page 
> *__read_swap_cache_async(swp_entry_t, gfp_t,
>  extern struct page *swapin_readahead(swp_entry_t, gfp_t,
>   struct vm_area_struct *vma, unsigned long addr);
>  
> -extern struct page *swap_readahead_detect(struct vm_fault *vmf,
> -   struct vma_swap_readahead *swap_ra);
> -extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t 
> gfp_mask,
> -struct vm_fault *vmf,
> -struct vma_swap_readahead *swap_ra);
> -
>  /* linux/mm/swapfile.c */
>  extern atomic_long_t nr_swap_pages;
>  extern long total_swap_pages;
>  extern atomic_t nr_rotate_swap;
>  extern bool has_usable_swap(void);
>  
> -static inline bool swap_use_vma_readahead(void)
> -{
> - return READ_ONCE(swap_vma_readahead) && !atomic_read(_rotate_swap);
> -}
> -
>  /* Swap 50% full? Release swapcache more aggressively.. */
>  static inline bool vm_swap_full(void)
>  {
> @@ -518,24 +508,6 @@ static inline struct page *swapin_readahead(swp_entry_t 
> swp, gfp_t gfp_mask,
>   return NULL;
>  }
>  
> -static inline bool swap_use_vma_readahead(void)
> -{
> - return false;
> -}
> -
> -static inline struct page *swap_readahead_detect(
> - struct vm_fault *vmf, struct vma_swap_readahead *swap_ra)
> -{
> - return NULL;
> -}
> -
> -static inline struct page *do_swap_page_readahead(
> - swp_entry_t fentry, gfp_t gfp_mask,
> - struct vm_fault *vmf, struct vma_swap_readahead *swap_ra)
> -{
> - return NULL;
> -}
> -
>  static inline int swap_writepage(struct page *p, struct writeback_control 
> *wbc)
>  {
>   return 0;
> @@ -662,5 +634,37 @@ static inline bool mem_cgroup_swap_full(struct page 
> *page)
>  }
>  #endif
>  
> +#ifdef CONFIG_VMA_SWAP_READAHEAD
> +extern bool swap_vma_readahead;
> +
> +static inline bool swap_use_vma_readahead(void)
> +{
> + return READ_ONCE(swap_vma_readahead) && !atomic_read(_rotate_swap);
> +}
> +extern struct page *swap_readahead_detect(struct vm_fault *vmf,
> +   struct vma_swap_readahead *swap_ra);
> +extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t 
> gfp_mask,
> +struct vm_fault *vmf,
> +struct vma_swap_readahead *swap_ra);
> +#else
> +static inline bool swap_use_vma_readahead(void)
> +{
> + 

Re: [PATCH] nvme: make controller 'state' sysfs attribute pollable

2017-09-24 Thread Sagi Grimberg



So why exposing it then in the first time? I know you don't want dm-mpath in
NVMe (neither do I) but we have to have something until your patchset and ANA
is merged. And with this patch it's trivial to build a path checker that just
looks at the state attribute in sysfs.


Can't we just not use path-checkers for nvme (we already have one in
nvme)?


Re: [PATCH] nvme: make controller 'state' sysfs attribute pollable

2017-09-24 Thread Sagi Grimberg



So why exposing it then in the first time? I know you don't want dm-mpath in
NVMe (neither do I) but we have to have something until your patchset and ANA
is merged. And with this patch it's trivial to build a path checker that just
looks at the state attribute in sysfs.


Can't we just not use path-checkers for nvme (we already have one in
nvme)?


Re: [PATCH] nvme: make controller 'state' sysfs attribute pollable

2017-09-24 Thread Sagi Grimberg



Notify sysfs about changes of a nvme controller so user-space can watch the
file via poll() or select() in order to react to a state change.


Userspace has no business polling for the state.



Please consider this patch. At least upstream multipath-tools is using the 
sysfs state now:
[1] 
https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=29c3b0446c4d919859f9e87b291563d483aab594
[2] 
https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=d2561442cc0b444e8a728bac2c1466468816ee9d


I have to agree with Christoph that we need to be able to keep the
controller states internal as they are bound to change at any point.

We do need to move into debugfs to avoid the confusion...


Re: [PATCH] nvme: make controller 'state' sysfs attribute pollable

2017-09-24 Thread Sagi Grimberg



Notify sysfs about changes of a nvme controller so user-space can watch the
file via poll() or select() in order to react to a state change.


Userspace has no business polling for the state.



Please consider this patch. At least upstream multipath-tools is using the 
sysfs state now:
[1] 
https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=29c3b0446c4d919859f9e87b291563d483aab594
[2] 
https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=d2561442cc0b444e8a728bac2c1466468816ee9d


I have to agree with Christoph that we need to be able to keep the
controller states internal as they are bound to change at any point.

We do need to move into debugfs to avoid the confusion...


[PATCH v3 2/2] ARM: socfpga: dtsi: add dw-wdt reset lines

2017-09-24 Thread Oleksij Rempel
From: Steffen Trumtrar 

Signed-off-by: Steffen Trumtrar 
Signed-off-by: Oleksij Rempel 
Cc: Dinh Nguyen 
Cc: linux-arm-ker...@lists.infradead.org
---

no changes since version v1

 arch/arm/boot/dts/socfpga.dtsi | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm/boot/dts/socfpga.dtsi b/arch/arm/boot/dts/socfpga.dtsi
index 7e24dc8e82d4..6e49cee084b8 100644
--- a/arch/arm/boot/dts/socfpga.dtsi
+++ b/arch/arm/boot/dts/socfpga.dtsi
@@ -924,6 +924,7 @@
reg = <0xffd02000 0x1000>;
interrupts = <0 171 4>;
clocks = <>;
+   resets = < L4WD0_RESET>;
status = "disabled";
};
 
@@ -932,6 +933,7 @@
reg = <0xffd03000 0x1000>;
interrupts = <0 172 4>;
clocks = <>;
+   resets = < L4WD1_RESET>;
status = "disabled";
};
};
-- 
2.11.0



[PATCH v3 1/2] watchdog: dw_wdt: add stop watchdog operation

2017-09-24 Thread Oleksij Rempel
From: Steffen Trumtrar 

The only way of stopping the watchdog is by resetting it.
Add the watchdog op for stopping the device and reset if
a reset line is provided.

Signed-off-by: Steffen Trumtrar 
Signed-off-by: Oleksij Rempel 
Cc: Wim Van Sebroeck 
Cc: Guenter Roeck 
Cc: linux-watch...@vger.kernel.org
---

changes v3:
 - don't return error if rst is not present and set WDOG_HW_RUNNING bit
   to notify watchdog core.

changes v2:
 - test if dw_wdt->rst is NULL instead of IS_ERR

 drivers/watchdog/dw_wdt.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/watchdog/dw_wdt.c b/drivers/watchdog/dw_wdt.c
index 36be987ff9ef..6cc56b18ee52 100644
--- a/drivers/watchdog/dw_wdt.c
+++ b/drivers/watchdog/dw_wdt.c
@@ -135,6 +135,21 @@ static int dw_wdt_start(struct watchdog_device *wdd)
return 0;
 }
 
+static int dw_wdt_stop(struct watchdog_device *wdd)
+{
+   struct dw_wdt *dw_wdt = to_dw_wdt(wdd);
+
+   if (!dw_wdt->rst) {
+   set_bit(WDOG_HW_RUNNING, >status);
+   return 0;
+   }
+
+   reset_control_assert(dw_wdt->rst);
+   reset_control_deassert(dw_wdt->rst);
+
+   return 0;
+}
+
 static int dw_wdt_restart(struct watchdog_device *wdd,
  unsigned long action, void *data)
 {
@@ -173,6 +188,7 @@ static const struct watchdog_info dw_wdt_ident = {
 static const struct watchdog_ops dw_wdt_ops = {
.owner  = THIS_MODULE,
.start  = dw_wdt_start,
+   .stop   = dw_wdt_stop,
.ping   = dw_wdt_ping,
.set_timeout= dw_wdt_set_timeout,
.get_timeleft   = dw_wdt_get_timeleft,
-- 
2.11.0



Re: [PATCH] s390/sclp: Use setup_timer and mod_timer

2017-09-24 Thread Martin Schwidefsky
On Sun, 24 Sep 2017 17:30:14 +0530
Himanshu Jha  wrote:

> Use setup_timer and mod_timer API instead of structure assignments.
> 
> This is done using Coccinelle and semantic patch used
> for this as follows:
> 
> @@
> expression x,y,z,a,b;
> @@
> 
> -init_timer ();
> +setup_timer (, y, z);
> +mod_timer (, b);
> -x.function = y;
> -x.data = z;
> -x.expires = b;
> -add_timer();
> 
> Signed-off-by: Himanshu Jha 
> ---
>  drivers/s390/char/sclp_con.c | 7 ++-
>  drivers/s390/char/sclp_tty.c | 7 ++-
>  2 files changed, 4 insertions(+), 10 deletions(-)

Added to s390/linux for the 4.15 merge window, thanks.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



[PATCH v3 2/2] ARM: socfpga: dtsi: add dw-wdt reset lines

2017-09-24 Thread Oleksij Rempel
From: Steffen Trumtrar 

Signed-off-by: Steffen Trumtrar 
Signed-off-by: Oleksij Rempel 
Cc: Dinh Nguyen 
Cc: linux-arm-ker...@lists.infradead.org
---

no changes since version v1

 arch/arm/boot/dts/socfpga.dtsi | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm/boot/dts/socfpga.dtsi b/arch/arm/boot/dts/socfpga.dtsi
index 7e24dc8e82d4..6e49cee084b8 100644
--- a/arch/arm/boot/dts/socfpga.dtsi
+++ b/arch/arm/boot/dts/socfpga.dtsi
@@ -924,6 +924,7 @@
reg = <0xffd02000 0x1000>;
interrupts = <0 171 4>;
clocks = <>;
+   resets = < L4WD0_RESET>;
status = "disabled";
};
 
@@ -932,6 +933,7 @@
reg = <0xffd03000 0x1000>;
interrupts = <0 172 4>;
clocks = <>;
+   resets = < L4WD1_RESET>;
status = "disabled";
};
};
-- 
2.11.0



[PATCH v3 1/2] watchdog: dw_wdt: add stop watchdog operation

2017-09-24 Thread Oleksij Rempel
From: Steffen Trumtrar 

The only way of stopping the watchdog is by resetting it.
Add the watchdog op for stopping the device and reset if
a reset line is provided.

Signed-off-by: Steffen Trumtrar 
Signed-off-by: Oleksij Rempel 
Cc: Wim Van Sebroeck 
Cc: Guenter Roeck 
Cc: linux-watch...@vger.kernel.org
---

changes v3:
 - don't return error if rst is not present and set WDOG_HW_RUNNING bit
   to notify watchdog core.

changes v2:
 - test if dw_wdt->rst is NULL instead of IS_ERR

 drivers/watchdog/dw_wdt.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/watchdog/dw_wdt.c b/drivers/watchdog/dw_wdt.c
index 36be987ff9ef..6cc56b18ee52 100644
--- a/drivers/watchdog/dw_wdt.c
+++ b/drivers/watchdog/dw_wdt.c
@@ -135,6 +135,21 @@ static int dw_wdt_start(struct watchdog_device *wdd)
return 0;
 }
 
+static int dw_wdt_stop(struct watchdog_device *wdd)
+{
+   struct dw_wdt *dw_wdt = to_dw_wdt(wdd);
+
+   if (!dw_wdt->rst) {
+   set_bit(WDOG_HW_RUNNING, >status);
+   return 0;
+   }
+
+   reset_control_assert(dw_wdt->rst);
+   reset_control_deassert(dw_wdt->rst);
+
+   return 0;
+}
+
 static int dw_wdt_restart(struct watchdog_device *wdd,
  unsigned long action, void *data)
 {
@@ -173,6 +188,7 @@ static const struct watchdog_info dw_wdt_ident = {
 static const struct watchdog_ops dw_wdt_ops = {
.owner  = THIS_MODULE,
.start  = dw_wdt_start,
+   .stop   = dw_wdt_stop,
.ping   = dw_wdt_ping,
.set_timeout= dw_wdt_set_timeout,
.get_timeleft   = dw_wdt_get_timeleft,
-- 
2.11.0



Re: [PATCH] s390/sclp: Use setup_timer and mod_timer

2017-09-24 Thread Martin Schwidefsky
On Sun, 24 Sep 2017 17:30:14 +0530
Himanshu Jha  wrote:

> Use setup_timer and mod_timer API instead of structure assignments.
> 
> This is done using Coccinelle and semantic patch used
> for this as follows:
> 
> @@
> expression x,y,z,a,b;
> @@
> 
> -init_timer ();
> +setup_timer (, y, z);
> +mod_timer (, b);
> -x.function = y;
> -x.data = z;
> -x.expires = b;
> -add_timer();
> 
> Signed-off-by: Himanshu Jha 
> ---
>  drivers/s390/char/sclp_con.c | 7 ++-
>  drivers/s390/char/sclp_tty.c | 7 ++-
>  2 files changed, 4 insertions(+), 10 deletions(-)

Added to s390/linux for the 4.15 merge window, thanks.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: [PATCH] mac80211: aead api to reduce redundancy

2017-09-24 Thread Johannes Berg
On Mon, 2017-09-25 at 12:56 +0800, Herbert Xu wrote:
> On Sun, Sep 24, 2017 at 07:42:46PM +0200, Johannes Berg wrote:
> > 
> > Unrelated to this, I'm not sure whose tree this should go through -
> > probably Herbert's (or DaveM's with his ACK? not sure if there's a
> > crypto tree?) or so?
> 
> Since you're just rearranging code invoking the crypto API, rather
> than touching actual crypto API code, I think you should handle it
> as you do with any other wireless patch.

The code moves to crypto/ though, and I'm not even sure I can vouch for
the Makefile choice there.

johannes


Re: [PATCH] mac80211: aead api to reduce redundancy

2017-09-24 Thread Johannes Berg
On Mon, 2017-09-25 at 12:56 +0800, Herbert Xu wrote:
> On Sun, Sep 24, 2017 at 07:42:46PM +0200, Johannes Berg wrote:
> > 
> > Unrelated to this, I'm not sure whose tree this should go through -
> > probably Herbert's (or DaveM's with his ACK? not sure if there's a
> > crypto tree?) or so?
> 
> Since you're just rearranging code invoking the crypto API, rather
> than touching actual crypto API code, I think you should handle it
> as you do with any other wireless patch.

The code moves to crypto/ though, and I'm not even sure I can vouch for
the Makefile choice there.

johannes


[PATCH v1 2/4] KVM/vmx: auto switch MSR_IA32_DEBUGCTLMSR

2017-09-24 Thread Wei Wang
Passthrough the MSR_IA32_DEBUGCTLMSR to the guest, and take advantage of
the hardware VT-x feature to auto switch the msr upon VMExit and VMEntry.

Signed-off-by: Wei Wang 
---
 arch/x86/kvm/vmx.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8434fc8..5f5c2f1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5502,13 +5502,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
if (cpu_has_vmx_vmfunc())
vmcs_write64(VM_FUNCTION_CONTROL, 0);
 
-   vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
-   vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autoload.guest));
-   vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
 
+   add_atomic_switch_msr(vmx, MSR_IA32_DEBUGCTLMSR, 0, 0);
+
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
 
@@ -6821,6 +6820,7 @@ static __init int hardware_setup(void)
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
+   vmx_disable_intercept_for_msr(MSR_IA32_DEBUGCTLMSR, false);
 
memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
vmx_msr_bitmap_legacy, PAGE_SIZE);
@@ -9285,7 +9285,7 @@ static void vmx_save_host_msrs(struct msr_autoload *m)
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned long debugctlmsr, cr3, cr4;
+   unsigned long cr3, cr4;
 
/* Don't enter VMX if guest state is invalid, let the exit handler
   start emulation until we arrive back to a valid state */
@@ -9333,7 +9333,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
__write_pkru(vcpu->arch.pkru);
 
atomic_switch_perf_msrs(vmx);
-   debugctlmsr = get_debugctlmsr();
 
vmx_arm_hv_timer(vcpu);
 
@@ -9445,10 +9444,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 #endif
  );
 
-   /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
-   if (debugctlmsr)
-   update_debugctlmsr(debugctlmsr);
-
 #ifndef CONFIG_X86_64
/*
 * The sysexit path does not restore ds/es, so we must set them to
-- 
2.7.4



[PATCH v1 1/4] KVM/vmx: re-write the msr auto switch feature

2017-09-24 Thread Wei Wang
This patch clarifies a vague statement in the SDM: the recommended maximum
number of MSRs that can be automically switched by CPU during VMExit and
VMEntry is 512, rather than 512 Bytes of MSRs.

Depending on the CPU implementations, it may also support more than 512
MSRs to be auto switched. This can be calculated by
(MSR_IA32_VMX_MISC[27:25] + 1) * 512.

Signed-off-by: Wei Wang 
---
 arch/x86/kvm/vmx.c | 72 +++---
 1 file changed, 63 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0726ca7..8434fc8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -158,6 +158,7 @@ module_param_named(preemption_timer, 
enable_preemption_timer, bool, S_IRUGO);
 #define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
 #define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\
INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
+#define KVM_VMX_DEFAULT_MSR_AUTO_LOAD_COUNT 512
 
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
@@ -178,9 +179,10 @@ static int ple_window_actual_max = 
KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 module_param(ple_window_max, int, S_IRUGO);
 
+static int msr_autoload_count_max = KVM_VMX_DEFAULT_MSR_AUTO_LOAD_COUNT;
+
 extern const ulong vmx_return;
 
-#define NR_AUTOLOAD_MSRS 8
 #define VMCS02_POOL_SIZE 1
 
 struct vmcs {
@@ -588,8 +590,8 @@ struct vcpu_vmx {
bool  __launched; /* temporary, used in vmx_vcpu_run */
struct msr_autoload {
unsigned nr;
-   struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
-   struct vmx_msr_entry host[NR_AUTOLOAD_MSRS];
+   struct vmx_msr_entry *guest;
+   struct vmx_msr_entry *host;
} msr_autoload;
struct {
int   loaded;
@@ -1942,6 +1944,7 @@ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, 
unsigned msr)
m->host[i] = m->host[m->nr];
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+   vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
 }
 
 static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,
@@ -1997,7 +2000,7 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, 
unsigned msr,
if (m->guest[i].index == msr)
break;
 
-   if (i == NR_AUTOLOAD_MSRS) {
+   if (i == msr_autoload_count_max) {
printk_once(KERN_WARNING "Not enough msr switch entries. "
"Can't add msr %x\n", msr);
return;
@@ -2005,6 +2008,7 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, 
unsigned msr,
++m->nr;
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+   vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
}
 
m->guest[i].index = msr;
@@ -5501,6 +5505,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
+   vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autoload.guest));
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
 
@@ -6670,6 +6675,21 @@ static void update_ple_window_actual_max(void)
ple_window_grow, INT_MIN);
 }
 
+static void update_msr_autoload_count_max(void)
+{
+   u64 vmx_msr;
+   int n;
+
+   /*
+* According to the Intel SDM, if Bits 27:25 of MSR_IA32_VMX_MISC is
+* n, then (n + 1) * 512 is the recommended max number of MSRs to be
+* included in the VMExit and VMEntry MSR auto switch list.
+*/
+   rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
+   n = ((vmx_msr & 0xe00) >> 25) + 1;
+   msr_autoload_count_max = n * KVM_VMX_DEFAULT_MSR_AUTO_LOAD_COUNT;
+}
+
 /*
  * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
  */
@@ -6837,6 +6857,7 @@ static __init int hardware_setup(void)
kvm_disable_tdp();
 
update_ple_window_actual_max();
+   update_msr_autoload_count_max();
 
/*
 * Only enable PML when hardware supports PML feature, and both EPT
@@ -9248,6 +9269,19 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
 }
 
+/*
+ * Currently, the CPU does not support the auto save of MSRs on VMEntry, so we
+ * save the MSRs for the host before entering into guest.
+ */
+static void vmx_save_host_msrs(struct msr_autoload *m)
+
+{
+   u32 i;
+
+   for (i = 0; i < m->nr; i++)
+   m->host[i].value = __rdmsr(m->host[i].index);
+}
+
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
   

[PATCH v1 2/4] KVM/vmx: auto switch MSR_IA32_DEBUGCTLMSR

2017-09-24 Thread Wei Wang
Passthrough the MSR_IA32_DEBUGCTLMSR to the guest, and take advantage of
the hardware VT-x feature to auto switch the msr upon VMExit and VMEntry.

Signed-off-by: Wei Wang 
---
 arch/x86/kvm/vmx.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8434fc8..5f5c2f1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5502,13 +5502,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
if (cpu_has_vmx_vmfunc())
vmcs_write64(VM_FUNCTION_CONTROL, 0);
 
-   vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
-   vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autoload.guest));
-   vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
 
+   add_atomic_switch_msr(vmx, MSR_IA32_DEBUGCTLMSR, 0, 0);
+
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
 
@@ -6821,6 +6820,7 @@ static __init int hardware_setup(void)
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
+   vmx_disable_intercept_for_msr(MSR_IA32_DEBUGCTLMSR, false);
 
memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
vmx_msr_bitmap_legacy, PAGE_SIZE);
@@ -9285,7 +9285,7 @@ static void vmx_save_host_msrs(struct msr_autoload *m)
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned long debugctlmsr, cr3, cr4;
+   unsigned long cr3, cr4;
 
/* Don't enter VMX if guest state is invalid, let the exit handler
   start emulation until we arrive back to a valid state */
@@ -9333,7 +9333,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
__write_pkru(vcpu->arch.pkru);
 
atomic_switch_perf_msrs(vmx);
-   debugctlmsr = get_debugctlmsr();
 
vmx_arm_hv_timer(vcpu);
 
@@ -9445,10 +9444,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 #endif
  );
 
-   /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */
-   if (debugctlmsr)
-   update_debugctlmsr(debugctlmsr);
-
 #ifndef CONFIG_X86_64
/*
 * The sysexit path does not restore ds/es, so we must set them to
-- 
2.7.4



[PATCH v1 1/4] KVM/vmx: re-write the msr auto switch feature

2017-09-24 Thread Wei Wang
This patch clarifies a vague statement in the SDM: the recommended maximum
number of MSRs that can be automically switched by CPU during VMExit and
VMEntry is 512, rather than 512 Bytes of MSRs.

Depending on the CPU implementations, it may also support more than 512
MSRs to be auto switched. This can be calculated by
(MSR_IA32_VMX_MISC[27:25] + 1) * 512.

Signed-off-by: Wei Wang 
---
 arch/x86/kvm/vmx.c | 72 +++---
 1 file changed, 63 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0726ca7..8434fc8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -158,6 +158,7 @@ module_param_named(preemption_timer, 
enable_preemption_timer, bool, S_IRUGO);
 #define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
 #define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\
INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
+#define KVM_VMX_DEFAULT_MSR_AUTO_LOAD_COUNT 512
 
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
@@ -178,9 +179,10 @@ static int ple_window_actual_max = 
KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 module_param(ple_window_max, int, S_IRUGO);
 
+static int msr_autoload_count_max = KVM_VMX_DEFAULT_MSR_AUTO_LOAD_COUNT;
+
 extern const ulong vmx_return;
 
-#define NR_AUTOLOAD_MSRS 8
 #define VMCS02_POOL_SIZE 1
 
 struct vmcs {
@@ -588,8 +590,8 @@ struct vcpu_vmx {
bool  __launched; /* temporary, used in vmx_vcpu_run */
struct msr_autoload {
unsigned nr;
-   struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
-   struct vmx_msr_entry host[NR_AUTOLOAD_MSRS];
+   struct vmx_msr_entry *guest;
+   struct vmx_msr_entry *host;
} msr_autoload;
struct {
int   loaded;
@@ -1942,6 +1944,7 @@ static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, 
unsigned msr)
m->host[i] = m->host[m->nr];
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+   vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
 }
 
 static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,
@@ -1997,7 +2000,7 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, 
unsigned msr,
if (m->guest[i].index == msr)
break;
 
-   if (i == NR_AUTOLOAD_MSRS) {
+   if (i == msr_autoload_count_max) {
printk_once(KERN_WARNING "Not enough msr switch entries. "
"Can't add msr %x\n", msr);
return;
@@ -2005,6 +2008,7 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, 
unsigned msr,
++m->nr;
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+   vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
}
 
m->guest[i].index = msr;
@@ -5501,6 +5505,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
+   vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autoload.guest));
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
 
@@ -6670,6 +6675,21 @@ static void update_ple_window_actual_max(void)
ple_window_grow, INT_MIN);
 }
 
+static void update_msr_autoload_count_max(void)
+{
+   u64 vmx_msr;
+   int n;
+
+   /*
+* According to the Intel SDM, if Bits 27:25 of MSR_IA32_VMX_MISC is
+* n, then (n + 1) * 512 is the recommended max number of MSRs to be
+* included in the VMExit and VMEntry MSR auto switch list.
+*/
+   rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
+   n = ((vmx_msr & 0xe00) >> 25) + 1;
+   msr_autoload_count_max = n * KVM_VMX_DEFAULT_MSR_AUTO_LOAD_COUNT;
+}
+
 /*
  * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
  */
@@ -6837,6 +6857,7 @@ static __init int hardware_setup(void)
kvm_disable_tdp();
 
update_ple_window_actual_max();
+   update_msr_autoload_count_max();
 
/*
 * Only enable PML when hardware supports PML feature, and both EPT
@@ -9248,6 +9269,19 @@ static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
 }
 
+/*
+ * Currently, the CPU does not support the auto save of MSRs on VMEntry, so we
+ * save the MSRs for the host before entering into guest.
+ */
+static void vmx_save_host_msrs(struct msr_autoload *m)
+
+{
+   u32 i;
+
+   for (i = 0; i < m->nr; i++)
+   m->host[i].value = __rdmsr(m->host[i].index);
+}
+
 static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx 

[PATCH v1 0/4] Enable LBR for the guest

2017-09-24 Thread Wei Wang
This patch series enables the Last Branch Recording feature for the
guest. Instead of trapping each LBR stack MSR access, the MSRs are
passthroughed to the guest. Those MSRs are switched (i.e. load and
saved) on VMExit and VMEntry.

Test:
Try "perf record -b ./test_program" on guest.

Wei Wang (4):
  KVM/vmx: re-write the msr auto switch feature
  KVM/vmx: auto switch MSR_IA32_DEBUGCTLMSR
  perf/x86: add a function to get the lbr stack
  KVM/vmx: enable lbr for the guest

 arch/x86/events/intel/lbr.c   |  23 +++
 arch/x86/include/asm/perf_event.h |  14 
 arch/x86/kvm/vmx.c| 135 +-
 3 files changed, 154 insertions(+), 18 deletions(-)

-- 
2.7.4



Re: [PATCH] mac80211: aead api to reduce redundancy

2017-09-24 Thread Herbert Xu
On Sun, Sep 24, 2017 at 07:42:46PM +0200, Johannes Berg wrote:
>
> Unrelated to this, I'm not sure whose tree this should go through -
> probably Herbert's (or DaveM's with his ACK? not sure if there's a
> crypto tree?) or so?

Since you're just rearranging code invoking the crypto API, rather
than touching actual crypto API code, I think you should handle it
as you do with any other wireless patch.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH v1 3/4] perf/x86: add a function to get the lbr stack

2017-09-24 Thread Wei Wang
The LBR stack MSRs are architecturally specific. The perf subsystem has
already assigned the abstracted MSR values based on the CPU architecture.

This patch enables a caller outside the perf subsystem to get the LBR
stack info. This is useful for hyperviosrs to prepare the lbr feature
for the guest.

Signed-off-by: Wei Wang 
---
 arch/x86/events/intel/lbr.c   | 23 +++
 arch/x86/include/asm/perf_event.h | 14 ++
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 8a6bbac..ea547ec 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1230,3 +1230,26 @@ void intel_pmu_lbr_init_knl(void)
x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
 }
+
+/**
+ * perf_get_lbr_stack - get the lbr stack related MSRs
+ *
+ * @stack: the caller's memory to get the lbr stack
+ *
+ * Returns: 0 indicates that the lbr stack has been successfully obtained.
+ */
+int perf_get_lbr_stack(struct perf_lbr_stack *stack)
+{
+   stack->lbr_nr = x86_pmu.lbr_nr;
+   stack->lbr_tos = x86_pmu.lbr_tos;
+   stack->lbr_from = x86_pmu.lbr_from;
+   stack->lbr_to = x86_pmu.lbr_to;
+
+   if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+   stack->lbr_info = MSR_LBR_INFO_0;
+   else
+   stack->lbr_info = 0;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(perf_get_lbr_stack);
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index f353061..c098462 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -266,7 +266,16 @@ struct perf_guest_switch_msr {
u64 host, guest;
 };
 
+struct perf_lbr_stack {
+   int lbr_nr;
+   unsigned long   lbr_tos;
+   unsigned long   lbr_from;
+   unsigned long   lbr_to;
+   unsigned long   lbr_info;
+};
+
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int perf_get_lbr_stack(struct perf_lbr_stack *stack);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
 #else
@@ -276,6 +285,11 @@ static inline struct perf_guest_switch_msr 
*perf_guest_get_msrs(int *nr)
return NULL;
 }
 
+static inline int perf_get_lbr_stack(struct perf_lbr_stack *stack)
+{
+   return -1;
+}
+
 static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
memset(cap, 0, sizeof(*cap));
-- 
2.7.4



[PATCH v1 4/4] KVM/vmx: enable lbr for the guest

2017-09-24 Thread Wei Wang
Passthrough the LBR stack to the guest, and auto switch the stack MSRs
upon VMEntry and VMExit.

Signed-off-by: Wei Wang 
---
 arch/x86/kvm/vmx.c | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5f5c2f1..35e02a7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -107,6 +107,9 @@ static u64 __read_mostly host_xss;
 static bool __read_mostly enable_pml = 1;
 module_param_named(pml, enable_pml, bool, S_IRUGO);
 
+static bool __read_mostly enable_lbrv = 1;
+module_param_named(lbrv, enable_lbrv, bool, 0444);
+
 #define KVM_VMX_TSC_MULTIPLIER_MAX 0xULL
 
 /* Guest_tsc -> host_tsc conversion requires 64-bit division.  */
@@ -5428,6 +5431,25 @@ static void ept_set_mmio_spte_mask(void)
   VMX_EPT_MISCONFIG_WX_VALUE);
 }
 
+static void auto_switch_lbr_msrs(struct vcpu_vmx *vmx)
+{
+   int i;
+   struct perf_lbr_stack lbr_stack;
+
+   perf_get_lbr_stack(_stack);
+
+   add_atomic_switch_msr(vmx, MSR_LBR_SELECT, 0, 0);
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_tos, 0, 0);
+
+   for (i = 0; i < lbr_stack.lbr_nr; i++) {
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_from + i, 0, 0);
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_to + i, 0, 0);
+   if (lbr_stack.lbr_info)
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_info + i, 0,
+ 0);
+   }
+}
+
 #define VMX_XSS_EXIT_BITMAP 0
 /*
  * Sets up the vmcs for emulated real mode.
@@ -5508,6 +5530,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
add_atomic_switch_msr(vmx, MSR_IA32_DEBUGCTLMSR, 0, 0);
 
+   if (enable_lbrv)
+   auto_switch_lbr_msrs(vmx);
+
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
 
@@ -6721,6 +6746,28 @@ void vmx_enable_tdp(void)
kvm_enable_tdp();
 }
 
+static void vmx_passthrough_lbr_msrs(void)
+{
+   int i;
+   struct perf_lbr_stack lbr_stack;
+
+   if (perf_get_lbr_stack(_stack) < 0) {
+   enable_lbrv = false;
+   return;
+   }
+
+   vmx_disable_intercept_for_msr(MSR_LBR_SELECT, false);
+   vmx_disable_intercept_for_msr(lbr_stack.lbr_tos, false);
+
+   for (i = 0; i < lbr_stack.lbr_nr; i++) {
+   vmx_disable_intercept_for_msr(lbr_stack.lbr_from + i, false);
+   vmx_disable_intercept_for_msr(lbr_stack.lbr_to + i, false);
+   if (lbr_stack.lbr_info)
+   vmx_disable_intercept_for_msr(lbr_stack.lbr_info + i,
+ false);
+   }
+}
+
 static __init int hardware_setup(void)
 {
int r = -ENOMEM, i, msr;
@@ -6822,6 +6869,9 @@ static __init int hardware_setup(void)
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
vmx_disable_intercept_for_msr(MSR_IA32_DEBUGCTLMSR, false);
 
+   if (enable_lbrv)
+   vmx_passthrough_lbr_msrs();
+
memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
vmx_msr_bitmap_legacy, PAGE_SIZE);
memcpy(vmx_msr_bitmap_longmode_x2apic_apicv,
-- 
2.7.4



Re: [PATCH] mac80211: aead api to reduce redundancy

2017-09-24 Thread Herbert Xu
On Sun, Sep 24, 2017 at 07:42:46PM +0200, Johannes Berg wrote:
>
> Unrelated to this, I'm not sure whose tree this should go through -
> probably Herbert's (or DaveM's with his ACK? not sure if there's a
> crypto tree?) or so?

Since you're just rearranging code invoking the crypto API, rather
than touching actual crypto API code, I think you should handle it
as you do with any other wireless patch.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH v1 3/4] perf/x86: add a function to get the lbr stack

2017-09-24 Thread Wei Wang
The LBR stack MSRs are architecturally specific. The perf subsystem has
already assigned the abstracted MSR values based on the CPU architecture.

This patch enables a caller outside the perf subsystem to get the LBR
stack info. This is useful for hyperviosrs to prepare the lbr feature
for the guest.

Signed-off-by: Wei Wang 
---
 arch/x86/events/intel/lbr.c   | 23 +++
 arch/x86/include/asm/perf_event.h | 14 ++
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 8a6bbac..ea547ec 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1230,3 +1230,26 @@ void intel_pmu_lbr_init_knl(void)
x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
 }
+
+/**
+ * perf_get_lbr_stack - get the lbr stack related MSRs
+ *
+ * @stack: the caller's memory to get the lbr stack
+ *
+ * Returns: 0 indicates that the lbr stack has been successfully obtained.
+ */
+int perf_get_lbr_stack(struct perf_lbr_stack *stack)
+{
+   stack->lbr_nr = x86_pmu.lbr_nr;
+   stack->lbr_tos = x86_pmu.lbr_tos;
+   stack->lbr_from = x86_pmu.lbr_from;
+   stack->lbr_to = x86_pmu.lbr_to;
+
+   if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+   stack->lbr_info = MSR_LBR_INFO_0;
+   else
+   stack->lbr_info = 0;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(perf_get_lbr_stack);
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index f353061..c098462 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -266,7 +266,16 @@ struct perf_guest_switch_msr {
u64 host, guest;
 };
 
+struct perf_lbr_stack {
+   int lbr_nr;
+   unsigned long   lbr_tos;
+   unsigned long   lbr_from;
+   unsigned long   lbr_to;
+   unsigned long   lbr_info;
+};
+
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int perf_get_lbr_stack(struct perf_lbr_stack *stack);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
 #else
@@ -276,6 +285,11 @@ static inline struct perf_guest_switch_msr 
*perf_guest_get_msrs(int *nr)
return NULL;
 }
 
+static inline int perf_get_lbr_stack(struct perf_lbr_stack *stack)
+{
+   return -1;
+}
+
 static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
memset(cap, 0, sizeof(*cap));
-- 
2.7.4



[PATCH v1 0/4] Enable LBR for the guest

2017-09-24 Thread Wei Wang
This patch series enables the Last Branch Recording feature for the
guest. Instead of trapping each LBR stack MSR access, the MSRs are
passthroughed to the guest. Those MSRs are switched (i.e. load and
saved) on VMExit and VMEntry.

Test:
Try "perf record -b ./test_program" on guest.

Wei Wang (4):
  KVM/vmx: re-write the msr auto switch feature
  KVM/vmx: auto switch MSR_IA32_DEBUGCTLMSR
  perf/x86: add a function to get the lbr stack
  KVM/vmx: enable lbr for the guest

 arch/x86/events/intel/lbr.c   |  23 +++
 arch/x86/include/asm/perf_event.h |  14 
 arch/x86/kvm/vmx.c| 135 +-
 3 files changed, 154 insertions(+), 18 deletions(-)

-- 
2.7.4



[PATCH v1 4/4] KVM/vmx: enable lbr for the guest

2017-09-24 Thread Wei Wang
Passthrough the LBR stack to the guest, and auto switch the stack MSRs
upon VMEntry and VMExit.

Signed-off-by: Wei Wang 
---
 arch/x86/kvm/vmx.c | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5f5c2f1..35e02a7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -107,6 +107,9 @@ static u64 __read_mostly host_xss;
 static bool __read_mostly enable_pml = 1;
 module_param_named(pml, enable_pml, bool, S_IRUGO);
 
+static bool __read_mostly enable_lbrv = 1;
+module_param_named(lbrv, enable_lbrv, bool, 0444);
+
 #define KVM_VMX_TSC_MULTIPLIER_MAX 0xULL
 
 /* Guest_tsc -> host_tsc conversion requires 64-bit division.  */
@@ -5428,6 +5431,25 @@ static void ept_set_mmio_spte_mask(void)
   VMX_EPT_MISCONFIG_WX_VALUE);
 }
 
+static void auto_switch_lbr_msrs(struct vcpu_vmx *vmx)
+{
+   int i;
+   struct perf_lbr_stack lbr_stack;
+
+   perf_get_lbr_stack(_stack);
+
+   add_atomic_switch_msr(vmx, MSR_LBR_SELECT, 0, 0);
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_tos, 0, 0);
+
+   for (i = 0; i < lbr_stack.lbr_nr; i++) {
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_from + i, 0, 0);
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_to + i, 0, 0);
+   if (lbr_stack.lbr_info)
+   add_atomic_switch_msr(vmx, lbr_stack.lbr_info + i, 0,
+ 0);
+   }
+}
+
 #define VMX_XSS_EXIT_BITMAP 0
 /*
  * Sets up the vmcs for emulated real mode.
@@ -5508,6 +5530,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
add_atomic_switch_msr(vmx, MSR_IA32_DEBUGCTLMSR, 0, 0);
 
+   if (enable_lbrv)
+   auto_switch_lbr_msrs(vmx);
+
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
 
@@ -6721,6 +6746,28 @@ void vmx_enable_tdp(void)
kvm_enable_tdp();
 }
 
+static void vmx_passthrough_lbr_msrs(void)
+{
+   int i;
+   struct perf_lbr_stack lbr_stack;
+
+   if (perf_get_lbr_stack(_stack) < 0) {
+   enable_lbrv = false;
+   return;
+   }
+
+   vmx_disable_intercept_for_msr(MSR_LBR_SELECT, false);
+   vmx_disable_intercept_for_msr(lbr_stack.lbr_tos, false);
+
+   for (i = 0; i < lbr_stack.lbr_nr; i++) {
+   vmx_disable_intercept_for_msr(lbr_stack.lbr_from + i, false);
+   vmx_disable_intercept_for_msr(lbr_stack.lbr_to + i, false);
+   if (lbr_stack.lbr_info)
+   vmx_disable_intercept_for_msr(lbr_stack.lbr_info + i,
+ false);
+   }
+}
+
 static __init int hardware_setup(void)
 {
int r = -ENOMEM, i, msr;
@@ -6822,6 +6869,9 @@ static __init int hardware_setup(void)
vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
vmx_disable_intercept_for_msr(MSR_IA32_DEBUGCTLMSR, false);
 
+   if (enable_lbrv)
+   vmx_passthrough_lbr_msrs();
+
memcpy(vmx_msr_bitmap_legacy_x2apic_apicv,
vmx_msr_bitmap_legacy, PAGE_SIZE);
memcpy(vmx_msr_bitmap_longmode_x2apic_apicv,
-- 
2.7.4



Re: [PATCH 1/4] rcu: Allow for page faults in NMI handlers

2017-09-24 Thread Paul E. McKenney
On Mon, Sep 25, 2017 at 06:41:30AM +0200, Steven Rostedt wrote:
> Sorry for the top post, currently on a train to Paris.
> 
> This series already went through all my testing, and I would hate to rebase 
> it for this reason. Can you just add a patch to remove the READ_ONCE()s?

If Linus accepts the original series, easy enough.

Thanx, Paul

> Thanks,
> 
> -- Steve
> 
> 
> On September 25, 2017 2:34:56 AM GMT+02:00, "Paul E. McKenney" 
>  wrote:
> >On Sun, Sep 24, 2017 at 05:26:53PM -0700, Paul E. McKenney wrote:
> >> On Sun, Sep 24, 2017 at 05:12:13PM -0700, Linus Torvalds wrote:
> >> > On Sun, Sep 24, 2017 at 5:03 PM, Paul E. McKenney
> >> >  wrote:
> >> > >
> >> > > Mostly just paranoia on my part.  I would be happy to remove it
> >if
> >> > > you prefer.  Or you or Steve can do so if that is more
> >convenient.
> >> > 
> >> > I really don't think it's warranted. The values are *stable*.
> >There's
> >> > no subtle lack of locking, or some optimistic access to a value
> >that
> >> > can change.
> >> > 
> >> > The compiler can generate code to read the value fifteen billion
> >> > times, and it will always get the same value.
> >> > 
> >> > Yes, maybe in between the different accesses, an NMI will happen,
> >and
> >> > the value will be incremented, but then as the NMI exits, it will
> >> > decrement again, so the code that got interrupted will not actually
> >> > see the change.
> >> > 
> >> > So the READ_ONCE() isn't "paranoia". It's just confusing.
> >> > 
> >> > > And yes, consistency would dictate that the uses in
> >rcu_nmi_enter()
> >> > > and rcu_nmi_exit() should be _ONCE(), particularly the stores to
> >> > > ->dynticks_nmi_nesting.
> >> > 
> >> > NO.
> >> > 
> >> > That would be just more of that confusion.
> >> > 
> >> > That value is STABLE. It's stable even within an NMI handler. The
> >NMI
> >> > code can read it, modify it, write it back, do a little dance, all
> >> > without having to care. There's no "_ONCE()" about it - not for the
> >> > readers, not for the writers, not for _anybody_.
> >> > 
> >> > So adding even more READ/WRITE_ONCE() accesses wouldn't be
> >> > "consistent", it would just be insanity.
> >> > 
> >> > Now, if an NMI happens and the value would be different on entry
> >than
> >> > it is on exit, that would be something else. Then it really
> >wouldn't
> >> > be stable wrt random users. But that would also be a major bug in
> >the
> >> > NMI handler, as far as I can tell.
> >> > 
> >> > So the reason I'm objecting to that READ_ONCE() is that it isn't
> >> > "paranoia", it's "voodoo programming". And we don't do voodoo
> >> > programming.
> >> 
> >> I already agreed that the READ_ONCE() can be removed.
> >
> >And for whatever it is worth, here is the updated patch.
> >
> > Thanx, Paul
> >
> >
> >
> >commit 3e2baa988b9c13095995c46c51e0e32c0b6a7d43
> >Author: Paul E. McKenney 
> >Date:   Fri Sep 22 13:14:42 2017 -0700
> >
> >rcu: Allow for page faults in NMI handlers
> >
> >  A number of architecture invoke rcu_irq_enter() on exception entry in
> >order to allow RCU read-side critical sections in the exception handler
> >   when the exception is from an idle or nohz_full CPU.  This works, at
> >   least unless the exception happens in an NMI handler.  In that case,
> >rcu_nmi_enter() would already have exited the extended quiescent state,
> >which would mean that rcu_irq_enter() would (incorrectly) cause RCU
> >   to think that it is again in an extended quiescent state.  This will
> >in turn result in lockdep splats in response to later RCU read-side
> >critical sections.
> >
> >This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to
> > take no action if there is an rcu_nmi_enter() in effect, thus avoiding
> >the unscheduled return to RCU quiescent state.  This in turn should
> >make the kernel safe for on-demand RCU voyeurism.
> >
> >Reported-by: Steven Rostedt 
> >Signed-off-by: Paul E. McKenney 
> >[ paulmck: Remove READ_ONCE() per Linux Torvalds feedback. ]
> >
> >diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >index db5eb8c3f7af..e4fe06d42385 100644
> >--- a/kernel/rcu/tree.c
> >+++ b/kernel/rcu/tree.c
> >@@ -891,6 +891,11 @@ void rcu_irq_exit(void)
> > 
> > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_irq_exit() invoked with irqs
> >enabled!!!");
> > rdtp = this_cpu_ptr(_dynticks);
> >+
> >+/* Page faults can happen in NMI handlers, so check... */
> >+if (rdtp->dynticks_nmi_nesting)
> >+return;
> >+
> > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> >  rdtp->dynticks_nesting < 1);
> > if (rdtp->dynticks_nesting <= 1) {
> >@@ -1036,6 

Re: [PATCH 1/4] rcu: Allow for page faults in NMI handlers

2017-09-24 Thread Paul E. McKenney
On Mon, Sep 25, 2017 at 06:41:30AM +0200, Steven Rostedt wrote:
> Sorry for the top post, currently on a train to Paris.
> 
> This series already went through all my testing, and I would hate to rebase 
> it for this reason. Can you just add a patch to remove the READ_ONCE()s?

If Linus accepts the original series, easy enough.

Thanx, Paul

> Thanks,
> 
> -- Steve
> 
> 
> On September 25, 2017 2:34:56 AM GMT+02:00, "Paul E. McKenney" 
>  wrote:
> >On Sun, Sep 24, 2017 at 05:26:53PM -0700, Paul E. McKenney wrote:
> >> On Sun, Sep 24, 2017 at 05:12:13PM -0700, Linus Torvalds wrote:
> >> > On Sun, Sep 24, 2017 at 5:03 PM, Paul E. McKenney
> >> >  wrote:
> >> > >
> >> > > Mostly just paranoia on my part.  I would be happy to remove it
> >if
> >> > > you prefer.  Or you or Steve can do so if that is more
> >convenient.
> >> > 
> >> > I really don't think it's warranted. The values are *stable*.
> >There's
> >> > no subtle lack of locking, or some optimistic access to a value
> >that
> >> > can change.
> >> > 
> >> > The compiler can generate code to read the value fifteen billion
> >> > times, and it will always get the same value.
> >> > 
> >> > Yes, maybe in between the different accesses, an NMI will happen,
> >and
> >> > the value will be incremented, but then as the NMI exits, it will
> >> > decrement again, so the code that got interrupted will not actually
> >> > see the change.
> >> > 
> >> > So the READ_ONCE() isn't "paranoia". It's just confusing.
> >> > 
> >> > > And yes, consistency would dictate that the uses in
> >rcu_nmi_enter()
> >> > > and rcu_nmi_exit() should be _ONCE(), particularly the stores to
> >> > > ->dynticks_nmi_nesting.
> >> > 
> >> > NO.
> >> > 
> >> > That would be just more of that confusion.
> >> > 
> >> > That value is STABLE. It's stable even within an NMI handler. The
> >NMI
> >> > code can read it, modify it, write it back, do a little dance, all
> >> > without having to care. There's no "_ONCE()" about it - not for the
> >> > readers, not for the writers, not for _anybody_.
> >> > 
> >> > So adding even more READ/WRITE_ONCE() accesses wouldn't be
> >> > "consistent", it would just be insanity.
> >> > 
> >> > Now, if an NMI happens and the value would be different on entry
> >than
> >> > it is on exit, that would be something else. Then it really
> >wouldn't
> >> > be stable wrt random users. But that would also be a major bug in
> >the
> >> > NMI handler, as far as I can tell.
> >> > 
> >> > So the reason I'm objecting to that READ_ONCE() is that it isn't
> >> > "paranoia", it's "voodoo programming". And we don't do voodoo
> >> > programming.
> >> 
> >> I already agreed that the READ_ONCE() can be removed.
> >
> >And for whatever it is worth, here is the updated patch.
> >
> > Thanx, Paul
> >
> >
> >
> >commit 3e2baa988b9c13095995c46c51e0e32c0b6a7d43
> >Author: Paul E. McKenney 
> >Date:   Fri Sep 22 13:14:42 2017 -0700
> >
> >rcu: Allow for page faults in NMI handlers
> >
> >  A number of architecture invoke rcu_irq_enter() on exception entry in
> >order to allow RCU read-side critical sections in the exception handler
> >   when the exception is from an idle or nohz_full CPU.  This works, at
> >   least unless the exception happens in an NMI handler.  In that case,
> >rcu_nmi_enter() would already have exited the extended quiescent state,
> >which would mean that rcu_irq_enter() would (incorrectly) cause RCU
> >   to think that it is again in an extended quiescent state.  This will
> >in turn result in lockdep splats in response to later RCU read-side
> >critical sections.
> >
> >This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to
> > take no action if there is an rcu_nmi_enter() in effect, thus avoiding
> >the unscheduled return to RCU quiescent state.  This in turn should
> >make the kernel safe for on-demand RCU voyeurism.
> >
> >Reported-by: Steven Rostedt 
> >Signed-off-by: Paul E. McKenney 
> >[ paulmck: Remove READ_ONCE() per Linux Torvalds feedback. ]
> >
> >diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >index db5eb8c3f7af..e4fe06d42385 100644
> >--- a/kernel/rcu/tree.c
> >+++ b/kernel/rcu/tree.c
> >@@ -891,6 +891,11 @@ void rcu_irq_exit(void)
> > 
> > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_irq_exit() invoked with irqs
> >enabled!!!");
> > rdtp = this_cpu_ptr(_dynticks);
> >+
> >+/* Page faults can happen in NMI handlers, so check... */
> >+if (rdtp->dynticks_nmi_nesting)
> >+return;
> >+
> > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> >  rdtp->dynticks_nesting < 1);
> > if (rdtp->dynticks_nesting <= 1) {
> >@@ -1036,6 +1041,11 @@ void rcu_irq_enter(void)
> > 
> > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_irq_enter() invoked with irqs
> >enabled!!!");

[PATCH][trivial] Kconfig: Fix typos in Kconfig

2017-09-24 Thread Masanari Iida
This patch fix some spelling typos found in Kconfig files.

Signed-off-by: Masanari Iida 
---
 arch/alpha/Kconfig| 2 +-
 arch/arc/Kconfig  | 6 +++---
 arch/arm/mach-bcm/Kconfig | 4 ++--
 arch/arm/plat-samsung/Kconfig | 2 +-
 arch/arm64/Kconfig| 2 +-
 arch/powerpc/platforms/Kconfig| 2 +-
 arch/unicore32/Kconfig| 2 +-
 arch/xtensa/Kconfig   | 2 +-
 drivers/net/ethernet/aquantia/Kconfig | 2 +-
 drivers/nfc/st-nci/Kconfig| 4 ++--
 drivers/nvdimm/Kconfig| 2 +-
 drivers/platform/x86/Kconfig  | 2 +-
 drivers/power/supply/Kconfig  | 2 +-
 drivers/scsi/Kconfig  | 2 +-
 fs/notify/fanotify/Kconfig| 2 +-
 15 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 0e49d39ea74a..aa6b11957cd7 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -505,7 +505,7 @@ config ALPHA_QEMU
 
  Generic kernels will auto-detect QEMU.  But when building a
  system-specific kernel, the assumption is that we want to
- elimiate as many runtime tests as possible.
+ eliminate as many runtime tests as possible.
 
  If unsure, say N.
 
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index a598641eed98..eab4ba316a58 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -298,7 +298,7 @@ config ARC_MMU_V1
 config ARC_MMU_V2
bool "MMU v2"
help
- Fixed the deficiency of v1 - possible thrashing in memcpy sceanrio
+ Fixed the deficiency of v1 - possible thrashing in memcpy scenario
  when 2 D-TLB and 1 I-TLB entries index into same 2way set.
 
 config ARC_MMU_V3
@@ -371,7 +371,7 @@ config ARC_FPU_SAVE_RESTORE
bool "Enable FPU state persistence across context switch"
default n
help
- Double Precision Floating Point unit had dedictaed regs which
+ Double Precision Floating Point unit had dedicated regs which
  need to be saved/restored across context-switch.
  Note that ARC FPU is overly simplistic, unlike say x86, which has
  hardware pieces to allow software to conditionally save/restore,
@@ -467,7 +467,7 @@ config ARC_PLAT_NEEDS_PHYS_TO_DMA
bool
 
 config ARC_KVADDR_SIZE
-   int "Kernel Virtaul Address Space size (MB)"
+   int "Kernel Virtual Address Space size (MB)"
range 0 512
default "256"
help
diff --git a/arch/arm/mach-bcm/Kconfig b/arch/arm/mach-bcm/Kconfig
index 73be3d578851..0bb7e74e1d87 100644
--- a/arch/arm/mach-bcm/Kconfig
+++ b/arch/arm/mach-bcm/Kconfig
@@ -22,7 +22,7 @@ config ARCH_BCM_IPROC
help
  This enables support for systems based on Broadcom IPROC architected 
SoCs.
  The IPROC complex contains one or more ARM CPUs along with common
- core periperals. Application specific SoCs are created by adding a
+ core peripherals. Application specific SoCs are created by adding a
  uArchitecture containing peripherals outside of the IPROC complex.
  Currently supported SoCs are Cygnus.
 
@@ -69,7 +69,7 @@ config ARCH_BCM_5301X
 
  This is a network SoC line mostly used in home routers and
  wifi access points, it's internal name is Northstar.
- This inclused the following SoC: BCM53010, BCM53011, BCM53012,
+ This include the following SoC: BCM53010, BCM53011, BCM53012,
  BCM53014, BCM53015, BCM53016, BCM53017, BCM53018, BCM4707,
  BCM4708 and BCM4709.
 
diff --git a/arch/arm/plat-samsung/Kconfig b/arch/arm/plat-samsung/Kconfig
index e8229b9fee4a..8d4a64cc644c 100644
--- a/arch/arm/plat-samsung/Kconfig
+++ b/arch/arm/plat-samsung/Kconfig
@@ -278,7 +278,7 @@ config SAMSUNG_PM_CHECK_CHUNKSIZE
help
  Set the chunksize in Kilobytes of the CRC for checking memory
  corruption over suspend and resume. A smaller value will mean that
- the CRC data block will take more memory, but wil identify any
+ the CRC data block will take more memory, but will identify any
  faults with better precision.
 
  See 
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..416dbc637dc8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -946,7 +946,7 @@ config ARM64_UAO
help
  User Access Override (UAO; part of the ARMv8.2 Extensions)
  causes the 'unprivileged' variant of the load/store instructions to
- be overriden to be privileged.
+ be overridden to be privileged.
 
  This option changes get_user() and friends to use the 'unprivileged'
  variant of the load/store instructions. This ensures that user-space
diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 4fd64d3f5c44..aac89f51d824 100644
--- a/arch/powerpc/platforms/Kconfig
+++ 

[PATCH][trivial] Kconfig: Fix typos in Kconfig

2017-09-24 Thread Masanari Iida
This patch fix some spelling typos found in Kconfig files.

Signed-off-by: Masanari Iida 
---
 arch/alpha/Kconfig| 2 +-
 arch/arc/Kconfig  | 6 +++---
 arch/arm/mach-bcm/Kconfig | 4 ++--
 arch/arm/plat-samsung/Kconfig | 2 +-
 arch/arm64/Kconfig| 2 +-
 arch/powerpc/platforms/Kconfig| 2 +-
 arch/unicore32/Kconfig| 2 +-
 arch/xtensa/Kconfig   | 2 +-
 drivers/net/ethernet/aquantia/Kconfig | 2 +-
 drivers/nfc/st-nci/Kconfig| 4 ++--
 drivers/nvdimm/Kconfig| 2 +-
 drivers/platform/x86/Kconfig  | 2 +-
 drivers/power/supply/Kconfig  | 2 +-
 drivers/scsi/Kconfig  | 2 +-
 fs/notify/fanotify/Kconfig| 2 +-
 15 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 0e49d39ea74a..aa6b11957cd7 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -505,7 +505,7 @@ config ALPHA_QEMU
 
  Generic kernels will auto-detect QEMU.  But when building a
  system-specific kernel, the assumption is that we want to
- elimiate as many runtime tests as possible.
+ eliminate as many runtime tests as possible.
 
  If unsure, say N.
 
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index a598641eed98..eab4ba316a58 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -298,7 +298,7 @@ config ARC_MMU_V1
 config ARC_MMU_V2
bool "MMU v2"
help
- Fixed the deficiency of v1 - possible thrashing in memcpy sceanrio
+ Fixed the deficiency of v1 - possible thrashing in memcpy scenario
  when 2 D-TLB and 1 I-TLB entries index into same 2way set.
 
 config ARC_MMU_V3
@@ -371,7 +371,7 @@ config ARC_FPU_SAVE_RESTORE
bool "Enable FPU state persistence across context switch"
default n
help
- Double Precision Floating Point unit had dedictaed regs which
+ Double Precision Floating Point unit had dedicated regs which
  need to be saved/restored across context-switch.
  Note that ARC FPU is overly simplistic, unlike say x86, which has
  hardware pieces to allow software to conditionally save/restore,
@@ -467,7 +467,7 @@ config ARC_PLAT_NEEDS_PHYS_TO_DMA
bool
 
 config ARC_KVADDR_SIZE
-   int "Kernel Virtaul Address Space size (MB)"
+   int "Kernel Virtual Address Space size (MB)"
range 0 512
default "256"
help
diff --git a/arch/arm/mach-bcm/Kconfig b/arch/arm/mach-bcm/Kconfig
index 73be3d578851..0bb7e74e1d87 100644
--- a/arch/arm/mach-bcm/Kconfig
+++ b/arch/arm/mach-bcm/Kconfig
@@ -22,7 +22,7 @@ config ARCH_BCM_IPROC
help
  This enables support for systems based on Broadcom IPROC architected 
SoCs.
  The IPROC complex contains one or more ARM CPUs along with common
- core periperals. Application specific SoCs are created by adding a
+ core peripherals. Application specific SoCs are created by adding a
  uArchitecture containing peripherals outside of the IPROC complex.
  Currently supported SoCs are Cygnus.
 
@@ -69,7 +69,7 @@ config ARCH_BCM_5301X
 
  This is a network SoC line mostly used in home routers and
  wifi access points, it's internal name is Northstar.
- This inclused the following SoC: BCM53010, BCM53011, BCM53012,
+ This include the following SoC: BCM53010, BCM53011, BCM53012,
  BCM53014, BCM53015, BCM53016, BCM53017, BCM53018, BCM4707,
  BCM4708 and BCM4709.
 
diff --git a/arch/arm/plat-samsung/Kconfig b/arch/arm/plat-samsung/Kconfig
index e8229b9fee4a..8d4a64cc644c 100644
--- a/arch/arm/plat-samsung/Kconfig
+++ b/arch/arm/plat-samsung/Kconfig
@@ -278,7 +278,7 @@ config SAMSUNG_PM_CHECK_CHUNKSIZE
help
  Set the chunksize in Kilobytes of the CRC for checking memory
  corruption over suspend and resume. A smaller value will mean that
- the CRC data block will take more memory, but wil identify any
+ the CRC data block will take more memory, but will identify any
  faults with better precision.
 
  See 
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..416dbc637dc8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -946,7 +946,7 @@ config ARM64_UAO
help
  User Access Override (UAO; part of the ARMv8.2 Extensions)
  causes the 'unprivileged' variant of the load/store instructions to
- be overriden to be privileged.
+ be overridden to be privileged.
 
  This option changes get_user() and friends to use the 'unprivileged'
  variant of the load/store instructions. This ensures that user-space
diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 4fd64d3f5c44..aac89f51d824 100644
--- a/arch/powerpc/platforms/Kconfig
+++ 

Re: [PATCH 1/4] rcu: Allow for page faults in NMI handlers

2017-09-24 Thread Steven Rostedt
Sorry for the top post, currently on a train to Paris.

This series already went through all my testing, and I would hate to rebase it 
for this reason. Can you just add a patch to remove the READ_ONCE()s?

Thanks,

-- Steve


On September 25, 2017 2:34:56 AM GMT+02:00, "Paul E. McKenney" 
 wrote:
>On Sun, Sep 24, 2017 at 05:26:53PM -0700, Paul E. McKenney wrote:
>> On Sun, Sep 24, 2017 at 05:12:13PM -0700, Linus Torvalds wrote:
>> > On Sun, Sep 24, 2017 at 5:03 PM, Paul E. McKenney
>> >  wrote:
>> > >
>> > > Mostly just paranoia on my part.  I would be happy to remove it
>if
>> > > you prefer.  Or you or Steve can do so if that is more
>convenient.
>> > 
>> > I really don't think it's warranted. The values are *stable*.
>There's
>> > no subtle lack of locking, or some optimistic access to a value
>that
>> > can change.
>> > 
>> > The compiler can generate code to read the value fifteen billion
>> > times, and it will always get the same value.
>> > 
>> > Yes, maybe in between the different accesses, an NMI will happen,
>and
>> > the value will be incremented, but then as the NMI exits, it will
>> > decrement again, so the code that got interrupted will not actually
>> > see the change.
>> > 
>> > So the READ_ONCE() isn't "paranoia". It's just confusing.
>> > 
>> > > And yes, consistency would dictate that the uses in
>rcu_nmi_enter()
>> > > and rcu_nmi_exit() should be _ONCE(), particularly the stores to
>> > > ->dynticks_nmi_nesting.
>> > 
>> > NO.
>> > 
>> > That would be just more of that confusion.
>> > 
>> > That value is STABLE. It's stable even within an NMI handler. The
>NMI
>> > code can read it, modify it, write it back, do a little dance, all
>> > without having to care. There's no "_ONCE()" about it - not for the
>> > readers, not for the writers, not for _anybody_.
>> > 
>> > So adding even more READ/WRITE_ONCE() accesses wouldn't be
>> > "consistent", it would just be insanity.
>> > 
>> > Now, if an NMI happens and the value would be different on entry
>than
>> > it is on exit, that would be something else. Then it really
>wouldn't
>> > be stable wrt random users. But that would also be a major bug in
>the
>> > NMI handler, as far as I can tell.
>> > 
>> > So the reason I'm objecting to that READ_ONCE() is that it isn't
>> > "paranoia", it's "voodoo programming". And we don't do voodoo
>> > programming.
>> 
>> I already agreed that the READ_ONCE() can be removed.
>
>And for whatever it is worth, here is the updated patch.
>
>   Thanx, Paul
>
>
>
>commit 3e2baa988b9c13095995c46c51e0e32c0b6a7d43
>Author: Paul E. McKenney 
>Date:   Fri Sep 22 13:14:42 2017 -0700
>
>rcu: Allow for page faults in NMI handlers
>
>  A number of architecture invoke rcu_irq_enter() on exception entry in
>order to allow RCU read-side critical sections in the exception handler
>   when the exception is from an idle or nohz_full CPU.  This works, at
>   least unless the exception happens in an NMI handler.  In that case,
>rcu_nmi_enter() would already have exited the extended quiescent state,
>which would mean that rcu_irq_enter() would (incorrectly) cause RCU
>   to think that it is again in an extended quiescent state.  This will
>in turn result in lockdep splats in response to later RCU read-side
>critical sections.
>
>This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to
> take no action if there is an rcu_nmi_enter() in effect, thus avoiding
>the unscheduled return to RCU quiescent state.  This in turn should
>make the kernel safe for on-demand RCU voyeurism.
>
>Reported-by: Steven Rostedt 
>Signed-off-by: Paul E. McKenney 
>[ paulmck: Remove READ_ONCE() per Linux Torvalds feedback. ]
>
>diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>index db5eb8c3f7af..e4fe06d42385 100644
>--- a/kernel/rcu/tree.c
>+++ b/kernel/rcu/tree.c
>@@ -891,6 +891,11 @@ void rcu_irq_exit(void)
> 
>   RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_irq_exit() invoked with irqs
>enabled!!!");
>   rdtp = this_cpu_ptr(_dynticks);
>+
>+  /* Page faults can happen in NMI handlers, so check... */
>+  if (rdtp->dynticks_nmi_nesting)
>+  return;
>+
>   WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
>rdtp->dynticks_nesting < 1);
>   if (rdtp->dynticks_nesting <= 1) {
>@@ -1036,6 +1041,11 @@ void rcu_irq_enter(void)
> 
>   RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_irq_enter() invoked with irqs
>enabled!!!");
>   rdtp = this_cpu_ptr(_dynticks);
>+
>+  /* Page faults can happen in NMI handlers, so check... */
>+  if (rdtp->dynticks_nmi_nesting)
>+  return;
>+
>   oldval = rdtp->dynticks_nesting;
>   rdtp->dynticks_nesting++;
>   

Re: [PATCH 1/4] rcu: Allow for page faults in NMI handlers

2017-09-24 Thread Steven Rostedt
Sorry for the top post, currently on a train to Paris.

This series already went through all my testing, and I would hate to rebase it 
for this reason. Can you just add a patch to remove the READ_ONCE()s?

Thanks,

-- Steve


On September 25, 2017 2:34:56 AM GMT+02:00, "Paul E. McKenney" 
 wrote:
>On Sun, Sep 24, 2017 at 05:26:53PM -0700, Paul E. McKenney wrote:
>> On Sun, Sep 24, 2017 at 05:12:13PM -0700, Linus Torvalds wrote:
>> > On Sun, Sep 24, 2017 at 5:03 PM, Paul E. McKenney
>> >  wrote:
>> > >
>> > > Mostly just paranoia on my part.  I would be happy to remove it
>if
>> > > you prefer.  Or you or Steve can do so if that is more
>convenient.
>> > 
>> > I really don't think it's warranted. The values are *stable*.
>There's
>> > no subtle lack of locking, or some optimistic access to a value
>that
>> > can change.
>> > 
>> > The compiler can generate code to read the value fifteen billion
>> > times, and it will always get the same value.
>> > 
>> > Yes, maybe in between the different accesses, an NMI will happen,
>and
>> > the value will be incremented, but then as the NMI exits, it will
>> > decrement again, so the code that got interrupted will not actually
>> > see the change.
>> > 
>> > So the READ_ONCE() isn't "paranoia". It's just confusing.
>> > 
>> > > And yes, consistency would dictate that the uses in
>rcu_nmi_enter()
>> > > and rcu_nmi_exit() should be _ONCE(), particularly the stores to
>> > > ->dynticks_nmi_nesting.
>> > 
>> > NO.
>> > 
>> > That would be just more of that confusion.
>> > 
>> > That value is STABLE. It's stable even within an NMI handler. The
>NMI
>> > code can read it, modify it, write it back, do a little dance, all
>> > without having to care. There's no "_ONCE()" about it - not for the
>> > readers, not for the writers, not for _anybody_.
>> > 
>> > So adding even more READ/WRITE_ONCE() accesses wouldn't be
>> > "consistent", it would just be insanity.
>> > 
>> > Now, if an NMI happens and the value would be different on entry
>than
>> > it is on exit, that would be something else. Then it really
>wouldn't
>> > be stable wrt random users. But that would also be a major bug in
>the
>> > NMI handler, as far as I can tell.
>> > 
>> > So the reason I'm objecting to that READ_ONCE() is that it isn't
>> > "paranoia", it's "voodoo programming". And we don't do voodoo
>> > programming.
>> 
>> I already agreed that the READ_ONCE() can be removed.
>
>And for whatever it is worth, here is the updated patch.
>
>   Thanx, Paul
>
>
>
>commit 3e2baa988b9c13095995c46c51e0e32c0b6a7d43
>Author: Paul E. McKenney 
>Date:   Fri Sep 22 13:14:42 2017 -0700
>
>rcu: Allow for page faults in NMI handlers
>
>  A number of architecture invoke rcu_irq_enter() on exception entry in
>order to allow RCU read-side critical sections in the exception handler
>   when the exception is from an idle or nohz_full CPU.  This works, at
>   least unless the exception happens in an NMI handler.  In that case,
>rcu_nmi_enter() would already have exited the extended quiescent state,
>which would mean that rcu_irq_enter() would (incorrectly) cause RCU
>   to think that it is again in an extended quiescent state.  This will
>in turn result in lockdep splats in response to later RCU read-side
>critical sections.
>
>This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to
> take no action if there is an rcu_nmi_enter() in effect, thus avoiding
>the unscheduled return to RCU quiescent state.  This in turn should
>make the kernel safe for on-demand RCU voyeurism.
>
>Reported-by: Steven Rostedt 
>Signed-off-by: Paul E. McKenney 
>[ paulmck: Remove READ_ONCE() per Linux Torvalds feedback. ]
>
>diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>index db5eb8c3f7af..e4fe06d42385 100644
>--- a/kernel/rcu/tree.c
>+++ b/kernel/rcu/tree.c
>@@ -891,6 +891,11 @@ void rcu_irq_exit(void)
> 
>   RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_irq_exit() invoked with irqs
>enabled!!!");
>   rdtp = this_cpu_ptr(_dynticks);
>+
>+  /* Page faults can happen in NMI handlers, so check... */
>+  if (rdtp->dynticks_nmi_nesting)
>+  return;
>+
>   WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
>rdtp->dynticks_nesting < 1);
>   if (rdtp->dynticks_nesting <= 1) {
>@@ -1036,6 +1041,11 @@ void rcu_irq_enter(void)
> 
>   RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_irq_enter() invoked with irqs
>enabled!!!");
>   rdtp = this_cpu_ptr(_dynticks);
>+
>+  /* Page faults can happen in NMI handlers, so check... */
>+  if (rdtp->dynticks_nmi_nesting)
>+  return;
>+
>   oldval = rdtp->dynticks_nesting;
>   rdtp->dynticks_nesting++;
>   WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v4 1/9] brcmsmac: make some local variables 'static const' to reduce stack size

2017-09-24 Thread Kalle Valo
Arnd Bergmann  writes:

> With KASAN and a couple of other patches applied, this driver is one
> of the few remaining ones that actually use more than 2048 bytes of
> kernel stack:
>
> broadcom/brcm80211/brcmsmac/phy/phy_n.c: In function 
> 'wlc_phy_workarounds_nphy_gainctrl':
> broadcom/brcm80211/brcmsmac/phy/phy_n.c:16065:1: warning: the frame size of 
> 3264 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> broadcom/brcm80211/brcmsmac/phy/phy_n.c: In function 
> 'wlc_phy_workarounds_nphy':
> broadcom/brcm80211/brcmsmac/phy/phy_n.c:17138:1: warning: the frame size of 
> 2864 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>
> Here, I'm reducing the stack size by marking as many local variables as
> 'static const' as I can without changing the actual code.
>
> This is the first of three patches to improve the stack usage in this
> driver. It would be good to have this backported to stabl kernels
> to get all drivers in 'allmodconfig' below the 2048 byte limit so
> we can turn on the frame warning again globally, but I realize that
> the patch is larger than the normal limit for stable backports.
>
> The other two patches do not need to be backported.
>
> Acked-by: Arend van Spriel 
> Signed-off-by: Arnd Bergmann 

I'll queue this and the two following brcmsmac patches for 4.14.

Also I'll add (only for this patch):

Cc: 

-- 
Kalle Valo


Re: [PATCH v4 1/9] brcmsmac: make some local variables 'static const' to reduce stack size

2017-09-24 Thread Kalle Valo
Arnd Bergmann  writes:

> With KASAN and a couple of other patches applied, this driver is one
> of the few remaining ones that actually use more than 2048 bytes of
> kernel stack:
>
> broadcom/brcm80211/brcmsmac/phy/phy_n.c: In function 
> 'wlc_phy_workarounds_nphy_gainctrl':
> broadcom/brcm80211/brcmsmac/phy/phy_n.c:16065:1: warning: the frame size of 
> 3264 bytes is larger than 2048 bytes [-Wframe-larger-than=]
> broadcom/brcm80211/brcmsmac/phy/phy_n.c: In function 
> 'wlc_phy_workarounds_nphy':
> broadcom/brcm80211/brcmsmac/phy/phy_n.c:17138:1: warning: the frame size of 
> 2864 bytes is larger than 2048 bytes [-Wframe-larger-than=]
>
> Here, I'm reducing the stack size by marking as many local variables as
> 'static const' as I can without changing the actual code.
>
> This is the first of three patches to improve the stack usage in this
> driver. It would be good to have this backported to stabl kernels
> to get all drivers in 'allmodconfig' below the 2048 byte limit so
> we can turn on the frame warning again globally, but I realize that
> the patch is larger than the normal limit for stable backports.
>
> The other two patches do not need to be backported.
>
> Acked-by: Arend van Spriel 
> Signed-off-by: Arnd Bergmann 

I'll queue this and the two following brcmsmac patches for 4.14.

Also I'll add (only for this patch):

Cc: 

-- 
Kalle Valo


Re: [PATCH 4.9 00/77] 4.9.52-stable review

2017-09-24 Thread Tom Gall
On Sun, Sep 24, 2017 at 3:31 PM, Greg Kroah-Hartman
 wrote:
> This is the start of the stable review cycle for the 4.9.52 release.
> There are 77 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Tue Sep 26 20:32:25 UTC 2017.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.52-rc1.gz
> or in the git tree and branch at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.9.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

kernel: 4.9.52-rc1
git repo: 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
git branch: linux-4.9.y
git commit: e009129d09cb626ff5271c9de006883671f38ab3
git describe: v4.9.51-78-ge009129d09cb
Test details: 
https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.51-78-ge009129d09cb


No regressions (compared to build v4.9.51)

Boards, architectures and test suites:
-

hi6220-hikey - arm64
* boot
* libhugetlbfs
* ltp-syscalls-tests

dell-poweredge-r200 - x86_64
* boot
* kselftest
* libhugetlbfs
* ltp-syscalls-tests



Documentation - https://collaborate.linaro.org/display/LKFT/Email+Reports


-- 
Regards,
Tom

Director, Linaro Mobile Group
Linaro.org │ Open source software for ARM SoCs
irc: tgall_foo | skype : tom_gall

"Where's the kaboom!? There was supposed to be an earth-shattering
kaboom!" Marvin Martian


Re: [PATCH 4.9 00/77] 4.9.52-stable review

2017-09-24 Thread Tom Gall
On Sun, Sep 24, 2017 at 3:31 PM, Greg Kroah-Hartman
 wrote:
> This is the start of the stable review cycle for the 4.9.52 release.
> There are 77 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Tue Sep 26 20:32:25 UTC 2017.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.52-rc1.gz
> or in the git tree and branch at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.9.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

kernel: 4.9.52-rc1
git repo: 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
git branch: linux-4.9.y
git commit: e009129d09cb626ff5271c9de006883671f38ab3
git describe: v4.9.51-78-ge009129d09cb
Test details: 
https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.51-78-ge009129d09cb


No regressions (compared to build v4.9.51)

Boards, architectures and test suites:
-

hi6220-hikey - arm64
* boot
* libhugetlbfs
* ltp-syscalls-tests

dell-poweredge-r200 - x86_64
* boot
* kselftest
* libhugetlbfs
* ltp-syscalls-tests



Documentation - https://collaborate.linaro.org/display/LKFT/Email+Reports


-- 
Regards,
Tom

Director, Linaro Mobile Group
Linaro.org │ Open source software for ARM SoCs
irc: tgall_foo | skype : tom_gall

"Where's the kaboom!? There was supposed to be an earth-shattering
kaboom!" Marvin Martian


Re: usb/wireless/rsi_91x: use-after-free write in __run_timers

2017-09-24 Thread Kalle Valo
Andrey Konovalov  writes:

> I've got the following report while fuzzing the kernel with syzkaller.
>
> On commit 6e80ecdddf4ea6f3cd84e83720f3d852e6624a68 (Sep 21).
>
> ==
> BUG: KASAN: use-after-free in __run_timers+0xc0e/0xd40
> Write of size 8 at addr 880069f701b8 by task swapper/0/0
>
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc1-42311-g6e80ecdddf4e #234
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011

[...]

> Allocated by task 1845:
>  save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kmem_cache_alloc_trace+0x11e/0x2d0 mm/slub.c:2772
>  kmalloc ./include/linux/slab.h:493
>  kzalloc ./include/linux/slab.h:666
>  rsi_91x_init+0x98/0x510 drivers/net/wireless/rsi/rsi_91x_main.c:203
>  rsi_probe+0xb6/0x13b0 drivers/net/wireless/rsi/rsi_91x_usb.c:665
>  usb_probe_interface+0x35d/0x8e0 drivers/usb/core/driver.c:361

I'm curious about your setup. Apparently you are running syzkaller on
QEMU but what I don't understand is how the rsi device comes into the
picture. Did you have a rsi usb device connected to the virtual machine
or what? Or does syzkaller do some kind of magic here?

-- 
Kalle Valo


Re: usb/wireless/rsi_91x: use-after-free write in __run_timers

2017-09-24 Thread Kalle Valo
Andrey Konovalov  writes:

> I've got the following report while fuzzing the kernel with syzkaller.
>
> On commit 6e80ecdddf4ea6f3cd84e83720f3d852e6624a68 (Sep 21).
>
> ==
> BUG: KASAN: use-after-free in __run_timers+0xc0e/0xd40
> Write of size 8 at addr 880069f701b8 by task swapper/0/0
>
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc1-42311-g6e80ecdddf4e #234
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011

[...]

> Allocated by task 1845:
>  save_stack_trace+0x1b/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kmem_cache_alloc_trace+0x11e/0x2d0 mm/slub.c:2772
>  kmalloc ./include/linux/slab.h:493
>  kzalloc ./include/linux/slab.h:666
>  rsi_91x_init+0x98/0x510 drivers/net/wireless/rsi/rsi_91x_main.c:203
>  rsi_probe+0xb6/0x13b0 drivers/net/wireless/rsi/rsi_91x_usb.c:665
>  usb_probe_interface+0x35d/0x8e0 drivers/usb/core/driver.c:361

I'm curious about your setup. Apparently you are running syzkaller on
QEMU but what I don't understand is how the rsi device comes into the
picture. Did you have a rsi usb device connected to the virtual machine
or what? Or does syzkaller do some kind of magic here?

-- 
Kalle Valo


Re: [PATCH] brcm80211: make const array ucode_ofdm_rates static, reduces object code size

2017-09-24 Thread Kalle Valo
Arend van Spriel  writes:

> Please use 'brcmsmac:' as prefix instead of 'brcm80211:'.

I can fix that.

-- 
Kalle Valo


Re: [PATCH] brcm80211: make const array ucode_ofdm_rates static, reduces object code size

2017-09-24 Thread Kalle Valo
Arend van Spriel  writes:

> Please use 'brcmsmac:' as prefix instead of 'brcm80211:'.

I can fix that.

-- 
Kalle Valo


Re: [PATCH v2 06/10] arm64: allwinner: a64: Add devicetree binding for DMA controller

2017-09-24 Thread Rob Herring
On Sat, Sep 23, 2017 at 6:34 PM, Stefan Bruens
 wrote:
> On Mittwoch, 20. September 2017 22:53:00 CEST Rob Herring wrote:
>> On Sun, Sep 17, 2017 at 05:19:52AM +0200, Stefan Brüns wrote:
>> > The A64 is register compatible with the H3, but has a different number
>> > of dma channels and request ports.
>> >
>> > Attach additional properties to the node to allow future reuse of the
>> > compatible for controllers with different number of channels/requests.
>> >
>> > If dma-requests is not specified, the register layout defined maximum
>> > of 32 is used.
>> >
>> > Signed-off-by: Stefan Brüns 
>> > ---
>> >
>> >  .../devicetree/bindings/dma/sun6i-dma.txt  | 26
>> >  ++ 1 file changed, 26 insertions(+)
>> >
>> > diff --git a/Documentation/devicetree/bindings/dma/sun6i-dma.txt
>> > b/Documentation/devicetree/bindings/dma/sun6i-dma.txt index
>> > 98fbe1a5c6dd..6ebc79f95202 100644
>> > --- a/Documentation/devicetree/bindings/dma/sun6i-dma.txt
>> > +++ b/Documentation/devicetree/bindings/dma/sun6i-dma.txt
>> >
>> > @@ -27,6 +27,32 @@ Example:
>> > #dma-cells = <1>;
>> >
>> > };
>> >
>> > +-
>> > - +For A64 DMA controller:
>> > +
>> > +Required properties:
>> > +- compatible:  "allwinner,sun50i-a64-dma"
>> > +- dma-channels: Number of DMA channels supported by the controller.
>> > +   Refer to Documentation/devicetree/bindings/dma/dma.txt
>> > +- all properties above, i.e. reg, interrupts, clocks, resets and
>> > #dma-cells +
>> > +Optional properties:
>> > +- dma-requests: Number of DMA request signals supported by the
>> > controller.
>> > +   Refer to Documentation/devicetree/bindings/dma/dma.txt
>> > +
>> > +Example:
>> > +   dma: dma-controller@01c02000 {
>>
>> Drop the leading 0. Building dtbs with W=2 will tell you this.
>>
>> With that,
>>
>> Acked-by: Rob Herring 
>
> The leading 0 was copied from the A31 example just a few lines above. Should I
> also correct that one, or should that go in a separate patch?

A separate patch.

Rob


Re: [PATCH v2 06/10] arm64: allwinner: a64: Add devicetree binding for DMA controller

2017-09-24 Thread Rob Herring
On Sat, Sep 23, 2017 at 6:34 PM, Stefan Bruens
 wrote:
> On Mittwoch, 20. September 2017 22:53:00 CEST Rob Herring wrote:
>> On Sun, Sep 17, 2017 at 05:19:52AM +0200, Stefan Brüns wrote:
>> > The A64 is register compatible with the H3, but has a different number
>> > of dma channels and request ports.
>> >
>> > Attach additional properties to the node to allow future reuse of the
>> > compatible for controllers with different number of channels/requests.
>> >
>> > If dma-requests is not specified, the register layout defined maximum
>> > of 32 is used.
>> >
>> > Signed-off-by: Stefan Brüns 
>> > ---
>> >
>> >  .../devicetree/bindings/dma/sun6i-dma.txt  | 26
>> >  ++ 1 file changed, 26 insertions(+)
>> >
>> > diff --git a/Documentation/devicetree/bindings/dma/sun6i-dma.txt
>> > b/Documentation/devicetree/bindings/dma/sun6i-dma.txt index
>> > 98fbe1a5c6dd..6ebc79f95202 100644
>> > --- a/Documentation/devicetree/bindings/dma/sun6i-dma.txt
>> > +++ b/Documentation/devicetree/bindings/dma/sun6i-dma.txt
>> >
>> > @@ -27,6 +27,32 @@ Example:
>> > #dma-cells = <1>;
>> >
>> > };
>> >
>> > +-
>> > - +For A64 DMA controller:
>> > +
>> > +Required properties:
>> > +- compatible:  "allwinner,sun50i-a64-dma"
>> > +- dma-channels: Number of DMA channels supported by the controller.
>> > +   Refer to Documentation/devicetree/bindings/dma/dma.txt
>> > +- all properties above, i.e. reg, interrupts, clocks, resets and
>> > #dma-cells +
>> > +Optional properties:
>> > +- dma-requests: Number of DMA request signals supported by the
>> > controller.
>> > +   Refer to Documentation/devicetree/bindings/dma/dma.txt
>> > +
>> > +Example:
>> > +   dma: dma-controller@01c02000 {
>>
>> Drop the leading 0. Building dtbs with W=2 will tell you this.
>>
>> With that,
>>
>> Acked-by: Rob Herring 
>
> The leading 0 was copied from the A31 example just a few lines above. Should I
> also correct that one, or should that go in a separate patch?

A separate patch.

Rob


Re: [PATCH v4 2/5] dt-bindings: input: Add document bindings for mtk-pmic-keys

2017-09-24 Thread Rob Herring
On Sat, Sep 23, 2017 at 1:47 AM, Chen Zhong  wrote:
> Sorry for the typo.
>
> On Sat, 2017-09-23 at 14:38 +0800, Chen Zhong wrote:
>> On Wed, 2017-09-20 at 15:53 -0500, Rob Herring wrote:
>> > On Sun, Sep 17, 2017 at 04:00:49PM +0800, Chen Zhong wrote:
>> > > This patch adds the device tree binding documentation for the MediaTek
>> > > pmic keys found on PMIC MT6397/MT6323.
>> > >
>> > > Signed-off-by: Chen Zhong 
>> > > ---
>> > >  .../devicetree/bindings/input/mtk-pmic-keys.txt|   41 
>> > > 
>> > >  1 file changed, 41 insertions(+)
>> > >  create mode 100644 
>> > > Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
>> > >
>> > > diff --git a/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt 
>> > > b/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
>> > > new file mode 100644
>> > > index 000..fd48ff7
>> > > --- /dev/null
>> > > +++ b/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
>> > > @@ -0,0 +1,41 @@
>> > > +MediaTek MT6397/MT6323 PMIC Keys Device Driver
>> > > +
>> > > +There are two key functions provided by MT6397/MT6323 PMIC, pwrkey
>> > > +and homekey. The key functions are defined as the subnode of the 
>> > > function
>> > > +node provided by MT6397/MT6323 PMIC that is being defined as one kind
>> > > +of Muti-Function Device (MFD)
>> > > +
>> > > +For MT6397/MT6323 MFD bindings see:
>> > > +Documentation/devicetree/bindings/mfd/mt6397.txt
>> > > +
>> > > +Required properties:
>> > > +- compatible: "mediatek,mt6397-keys" or "mediatek,mt6323-keys"
>> > > +- linux,keycodes: Specifies the numeric keycode values to
>> > > + be used for reporting keys presses. The array can
>> > > + contain up to 2 entries.
>> > > +
>> > > +Optional Properties:
>> > > +- mediatek,wakeup-keys: Specifies each key can be used as a wakeup 
>> > > source
>> > > + or not. This can be customized depends on board design.
>> >
>> > I think this should be a common property if we're going to put into DT.
>> > Something like "wakeup-scancodes" to be clear the values are the raw
>> > scancodes. Alternatively, we could list Linux keycodes instead with
>> > something like "linux,wakeup-keycodes".
>> >
>> > > +- wakeup-source: PMIC keys can be used as wakeup sources.
>> >
>> > Just "See ../power/wakeup-source.txt" for the description.
>> >
>> > > +- mediatek,long-press-mode: Long press key shutdown setting, 1 for
>> > > + pwrkey only, 2 for pwrkey/homekey together, others for disabled.
>> > > +- debounce-interval: Long press key shutdown debouncing interval time
>> > > + in seconds. 0/1/2/3 for 8/11/14/5 seconds. If not specified defaults 
>> > > to 0.
>> >
>> > This property units should be in milliseconds. However, this doesn't
>> > sound like debounce filtering time if 5-14 seconds. That sounds like
>> > forced power off time (i.e. for a hung device). This also should be
>> > common. I imagine we already have some drivers with similar properties.
>>
>> Hi Rob,
>>
>> I searched in kernel documents and found a similar usage in
>> "ti,palmas-pwrbutton.txt"
>> "- ti,palmas-long-press-seconds: Duration in seconds which the power
>>   button should be kept pressed for Palmas to power off automatically."
>>
>> Could I just wrote it like this?
>> mediatek,long-press-seconds = <0>;

That doesn't really tell what the long press does. How about
"power-off-time-sec"? Surprisingly we don't have a common keyboard
binding doc, so please start one and document it there. Then just
refer to it.

>>
>> And for the wakeup source part, how about Dmitry's suggestion?

It's fine for me.

>> The whole device node would be:
>>
>> mt6397keys: mt6397keys {
>>   compatible = "mediatek,mt6397-keys";
>>   mediatek,long-press-mode = <1>;
>>   mediatek,long-press-seconds = <0>;
>>
>>   power@0 {
>>   linux,code = <116>;

linux,keycodes

Also, you either need a reg property with "0" or drop the unit address.

>>   wakeup-source;
>>   };
>>
>>   home@0 {
> should be home@1 {
>>   linux,code = <114>;
>>   };
>> };
>>
>> Thank you.
>>
>> >
>> > Rob
>>
>
>


Re: [PATCH v4 2/5] dt-bindings: input: Add document bindings for mtk-pmic-keys

2017-09-24 Thread Rob Herring
On Sat, Sep 23, 2017 at 1:47 AM, Chen Zhong  wrote:
> Sorry for the typo.
>
> On Sat, 2017-09-23 at 14:38 +0800, Chen Zhong wrote:
>> On Wed, 2017-09-20 at 15:53 -0500, Rob Herring wrote:
>> > On Sun, Sep 17, 2017 at 04:00:49PM +0800, Chen Zhong wrote:
>> > > This patch adds the device tree binding documentation for the MediaTek
>> > > pmic keys found on PMIC MT6397/MT6323.
>> > >
>> > > Signed-off-by: Chen Zhong 
>> > > ---
>> > >  .../devicetree/bindings/input/mtk-pmic-keys.txt|   41 
>> > > 
>> > >  1 file changed, 41 insertions(+)
>> > >  create mode 100644 
>> > > Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
>> > >
>> > > diff --git a/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt 
>> > > b/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
>> > > new file mode 100644
>> > > index 000..fd48ff7
>> > > --- /dev/null
>> > > +++ b/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
>> > > @@ -0,0 +1,41 @@
>> > > +MediaTek MT6397/MT6323 PMIC Keys Device Driver
>> > > +
>> > > +There are two key functions provided by MT6397/MT6323 PMIC, pwrkey
>> > > +and homekey. The key functions are defined as the subnode of the 
>> > > function
>> > > +node provided by MT6397/MT6323 PMIC that is being defined as one kind
>> > > +of Muti-Function Device (MFD)
>> > > +
>> > > +For MT6397/MT6323 MFD bindings see:
>> > > +Documentation/devicetree/bindings/mfd/mt6397.txt
>> > > +
>> > > +Required properties:
>> > > +- compatible: "mediatek,mt6397-keys" or "mediatek,mt6323-keys"
>> > > +- linux,keycodes: Specifies the numeric keycode values to
>> > > + be used for reporting keys presses. The array can
>> > > + contain up to 2 entries.
>> > > +
>> > > +Optional Properties:
>> > > +- mediatek,wakeup-keys: Specifies each key can be used as a wakeup 
>> > > source
>> > > + or not. This can be customized depends on board design.
>> >
>> > I think this should be a common property if we're going to put into DT.
>> > Something like "wakeup-scancodes" to be clear the values are the raw
>> > scancodes. Alternatively, we could list Linux keycodes instead with
>> > something like "linux,wakeup-keycodes".
>> >
>> > > +- wakeup-source: PMIC keys can be used as wakeup sources.
>> >
>> > Just "See ../power/wakeup-source.txt" for the description.
>> >
>> > > +- mediatek,long-press-mode: Long press key shutdown setting, 1 for
>> > > + pwrkey only, 2 for pwrkey/homekey together, others for disabled.
>> > > +- debounce-interval: Long press key shutdown debouncing interval time
>> > > + in seconds. 0/1/2/3 for 8/11/14/5 seconds. If not specified defaults 
>> > > to 0.
>> >
>> > This property units should be in milliseconds. However, this doesn't
>> > sound like debounce filtering time if 5-14 seconds. That sounds like
>> > forced power off time (i.e. for a hung device). This also should be
>> > common. I imagine we already have some drivers with similar properties.
>>
>> Hi Rob,
>>
>> I searched in kernel documents and found a similar usage in
>> "ti,palmas-pwrbutton.txt"
>> "- ti,palmas-long-press-seconds: Duration in seconds which the power
>>   button should be kept pressed for Palmas to power off automatically."
>>
>> Could I just wrote it like this?
>> mediatek,long-press-seconds = <0>;

That doesn't really tell what the long press does. How about
"power-off-time-sec"? Surprisingly we don't have a common keyboard
binding doc, so please start one and document it there. Then just
refer to it.

>>
>> And for the wakeup source part, how about Dmitry's suggestion?

It's fine for me.

>> The whole device node would be:
>>
>> mt6397keys: mt6397keys {
>>   compatible = "mediatek,mt6397-keys";
>>   mediatek,long-press-mode = <1>;
>>   mediatek,long-press-seconds = <0>;
>>
>>   power@0 {
>>   linux,code = <116>;

linux,keycodes

Also, you either need a reg property with "0" or drop the unit address.

>>   wakeup-source;
>>   };
>>
>>   home@0 {
> should be home@1 {
>>   linux,code = <114>;
>>   };
>> };
>>
>> Thank you.
>>
>> >
>> > Rob
>>
>
>


[Kernel.org Helpdesk #46182] [linuxfoundation.org #46182] Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Linus Torvalds via RT
On Sun, Sep 24, 2017 at 7:57 PM, Randy Dunlap  wrote:
>
> Downloading & applying 4.14-rc2 [patch] 
> 
>
> from kernel.org (home page) gives me a file that does not apply cleanly to 
> v4.13:

Hmm. The rc patches are automatically generated from the git tree
these days, so I don't have control over them.

It does sound like you might have caught it while it was being generated:

> patch unexpectedly ends in middle of line
> patch:  unexpected end of file in patch

which would seem to indicate that maybe you just caught it while it
was still being generated.

But I just tried it myself, and get the same breakage. In fact, the
patch it downloads is exactly 50397184 bytes in size. That may not
sound like a round number, but it is: it is hex 0x301, so it's
evenly divisible by 65536.

Methinks there's some incorrect flushing of block IO going on. Konstantin?

> I also notice that the [pgp] signing is not there.  Is that normal?

So I don't sign the rc patches any more because I don't generate them
(but the final release patches and tar-balls I *do* sign).

But maybe they could be signed by some kernel.org key.

Again, that would be an automation issue..

Linus



[Kernel.org Helpdesk #46182] [linuxfoundation.org #46182] Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Linus Torvalds via RT
On Sun, Sep 24, 2017 at 8:41 PM, Linus Torvalds
 wrote:
>
> But I just tried it myself, and get the same breakage. In fact, the
> patch it downloads is exactly 50397184 bytes in size.

Side note: instead of downloading a 50MB patch, you could probably use
the same amount of bandwidth to download and build git, and then use
that to download much smaller incremental updates.

I'm surprised that people still even use those nasty patches and tar-balls.

   Linus



[Kernel.org Helpdesk #46182] [linuxfoundation.org #46182] Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Linus Torvalds via RT
On Sun, Sep 24, 2017 at 7:57 PM, Randy Dunlap  wrote:
>
> Downloading & applying 4.14-rc2 [patch] 
> 
>
> from kernel.org (home page) gives me a file that does not apply cleanly to 
> v4.13:

Hmm. The rc patches are automatically generated from the git tree
these days, so I don't have control over them.

It does sound like you might have caught it while it was being generated:

> patch unexpectedly ends in middle of line
> patch:  unexpected end of file in patch

which would seem to indicate that maybe you just caught it while it
was still being generated.

But I just tried it myself, and get the same breakage. In fact, the
patch it downloads is exactly 50397184 bytes in size. That may not
sound like a round number, but it is: it is hex 0x301, so it's
evenly divisible by 65536.

Methinks there's some incorrect flushing of block IO going on. Konstantin?

> I also notice that the [pgp] signing is not there.  Is that normal?

So I don't sign the rc patches any more because I don't generate them
(but the final release patches and tar-balls I *do* sign).

But maybe they could be signed by some kernel.org key.

Again, that would be an automation issue..

Linus



[Kernel.org Helpdesk #46182] [linuxfoundation.org #46182] Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Linus Torvalds via RT
On Sun, Sep 24, 2017 at 8:41 PM, Linus Torvalds
 wrote:
>
> But I just tried it myself, and get the same breakage. In fact, the
> patch it downloads is exactly 50397184 bytes in size.

Side note: instead of downloading a 50MB patch, you could probably use
the same amount of bandwidth to download and build git, and then use
that to download much smaller incremental updates.

I'm surprised that people still even use those nasty patches and tar-balls.

   Linus



Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Linus Torvalds
On Sun, Sep 24, 2017 at 8:41 PM, Linus Torvalds
 wrote:
>
> But I just tried it myself, and get the same breakage. In fact, the
> patch it downloads is exactly 50397184 bytes in size.

Side note: instead of downloading a 50MB patch, you could probably use
the same amount of bandwidth to download and build git, and then use
that to download much smaller incremental updates.

I'm surprised that people still even use those nasty patches and tar-balls.

   Linus


Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Linus Torvalds
On Sun, Sep 24, 2017 at 8:41 PM, Linus Torvalds
 wrote:
>
> But I just tried it myself, and get the same breakage. In fact, the
> patch it downloads is exactly 50397184 bytes in size.

Side note: instead of downloading a 50MB patch, you could probably use
the same amount of bandwidth to download and build git, and then use
that to download much smaller incremental updates.

I'm surprised that people still even use those nasty patches and tar-balls.

   Linus


Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Linus Torvalds
On Sun, Sep 24, 2017 at 7:57 PM, Randy Dunlap  wrote:
>
> Downloading & applying 4.14-rc2 [patch] 
> 
>
> from kernel.org (home page) gives me a file that does not apply cleanly to 
> v4.13:

Hmm. The rc patches are automatically generated from the git tree
these days, so I don't have control over them.

It does sound like you might have caught it while it was being generated:

> patch unexpectedly ends in middle of line
> patch:  unexpected end of file in patch

which would seem to indicate that maybe you just caught it while it
was still being generated.

But I just tried it myself, and get the same breakage. In fact, the
patch it downloads is exactly 50397184 bytes in size. That may not
sound like a round number, but it is: it is hex 0x301, so it's
evenly divisible by 65536.

Methinks there's some incorrect flushing of block IO going on. Konstantin?

> I also notice that the [pgp] signing is not there.  Is that normal?

So I don't sign the rc patches any more because I don't generate them
(but the final release patches and tar-balls I *do* sign).

But maybe they could be signed by some kernel.org key.

Again, that would be an automation issue..

Linus


Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Linus Torvalds
On Sun, Sep 24, 2017 at 7:57 PM, Randy Dunlap  wrote:
>
> Downloading & applying 4.14-rc2 [patch] 
> 
>
> from kernel.org (home page) gives me a file that does not apply cleanly to 
> v4.13:

Hmm. The rc patches are automatically generated from the git tree
these days, so I don't have control over them.

It does sound like you might have caught it while it was being generated:

> patch unexpectedly ends in middle of line
> patch:  unexpected end of file in patch

which would seem to indicate that maybe you just caught it while it
was still being generated.

But I just tried it myself, and get the same breakage. In fact, the
patch it downloads is exactly 50397184 bytes in size. That may not
sound like a round number, but it is: it is hex 0x301, so it's
evenly divisible by 65536.

Methinks there's some incorrect flushing of block IO going on. Konstantin?

> I also notice that the [pgp] signing is not there.  Is that normal?

So I don't sign the rc patches any more because I don't generate them
(but the final release patches and tar-balls I *do* sign).

But maybe they could be signed by some kernel.org key.

Again, that would be an automation issue..

Linus


Re: [PATCH v2 1/5] clk: Add clock driver for ASPEED BMC SoCs

2017-09-24 Thread Andrew Jeffery
On Thu, 2017-09-21 at 13:56 +0930, Joel Stanley wrote:
> This adds the stub of a driver for the ASPEED SoCs. The clocks are
> defined and the static registration is set up.
> 
> Signed-off-by: Joel Stanley 
> ---
>  drivers/clk/Kconfig  |  12 +++
>  drivers/clk/Makefile |   1 +
>  drivers/clk/clk-aspeed.c | 162 
> +++
>  include/dt-bindings/clock/aspeed-clock.h |  43 
>  4 files changed, 218 insertions(+)
>  create mode 100644 drivers/clk/clk-aspeed.c
>  create mode 100644 include/dt-bindings/clock/aspeed-clock.h
> 
> diff --git a/drivers/clk/Kconfig b/drivers/clk/Kconfig
> index 1c4e1aa6767e..9abe063ef8d2 100644
> --- a/drivers/clk/Kconfig
> +++ b/drivers/clk/Kconfig
> @@ -142,6 +142,18 @@ config COMMON_CLK_GEMINI
>     This driver supports the SoC clocks on the Cortina Systems Gemini
>     platform, also known as SL3516 or CS3516.
>  
> +config COMMON_CLK_ASPEED
> + bool "Clock driver for Aspeed BMC SoCs"
> + depends on ARCH_ASPEED || COMPILE_TEST
> + default ARCH_ASPEED
> + select MFD_SYSCON
> + select RESET_CONTROLLER
> + ---help---
> +   This driver supports the SoC clocks on the Aspeed BMC platforms.
>
> +   The G4 and G5 series, including the ast2400 and ast2500, are supported
> +   by this driver.
> +
>  config COMMON_CLK_S2MPS11
>   tristate "Clock driver for S2MPS1X/S5M8767 MFD"
>   depends on MFD_SEC_CORE || COMPILE_TEST
> diff --git a/drivers/clk/Makefile b/drivers/clk/Makefile
> index c99f363826f0..575c68919d9b 100644
> --- a/drivers/clk/Makefile
> +++ b/drivers/clk/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_ARCH_CLPS711X) += clk-clps711x.o
>  obj-$(CONFIG_COMMON_CLK_CS2000_CP)   += clk-cs2000-cp.o
>  obj-$(CONFIG_ARCH_EFM32) += clk-efm32gg.o
>  obj-$(CONFIG_COMMON_CLK_GEMINI)  += clk-gemini.o
> +obj-$(CONFIG_COMMON_CLK_ASPEED)  += clk-aspeed.o
>  obj-$(CONFIG_ARCH_HIGHBANK)  += clk-highbank.o
>  obj-$(CONFIG_CLK_HSDK)   += clk-hsdk-pll.o
>  obj-$(CONFIG_COMMON_CLK_MAX77686)+= clk-max77686.o
> diff --git a/drivers/clk/clk-aspeed.c b/drivers/clk/clk-aspeed.c
> new file mode 100644
> index ..824c54767009
> --- /dev/null
> +++ b/drivers/clk/clk-aspeed.c
> @@ -0,0 +1,162 @@
> +/*
> + * Copyright 2017 IBM Corporation
> + *
> + * Joel Stanley 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) "clk-aspeed: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define ASPEED_RESET_CTRL0x04
> +#define ASPEED_CLK_SELECTION 0x08
> +#define ASPEED_CLK_STOP_CTRL 0x0c
> +#define ASPEED_MPLL_PARAM0x20
> +#define ASPEED_HPLL_PARAM0x24
> +#define ASPEED_MISC_CTRL 0x2c
> +#define ASPEED_STRAP 0x70
> +
> +/* Keeps track of all clocks */
> +static struct clk_hw_onecell_data *aspeed_clk_data;
> +
> +static void __iomem *scu_base;
> +
> +/**
> + * struct aspeed_gate_data - Aspeed gated clocks
> + * @clock_idx: bit used to gate this clock in the clock register
> + * @reset_idx: bit used to reset this IP in the reset register. -1 if no
> + * reset is required when enabling the clock
> + * @name: the clock name
> + * @parent_name: the name of the parent clock
> + * @flags: standard clock framework flags
> + */
> +struct aspeed_gate_data {
> + u8  clock_idx;
> + s8  reset_idx;
> + const char  *name;
> + const char  *parent_name;
> + unsigned long   flags;
> +};
> +
> +/**
> + * struct aspeed_clk_gate - Aspeed specific clk_gate structure
> + * @hw:  handle between common and hardware-specific interfaces
> + * @reg: register controlling gate
> + * @clock_idx:   bit used to gate this clock in the clock register
> + * @reset_idx:   bit used to reset this IP in the reset register. -1 if 
> no
> + *   reset is required when enabling the clock
> + * @flags:   hardware-specific flags
> + * @lock:register lock
> + *
> + * Some of the clocks in the Aspeed SoC must be put in reset before enabling.
> + * This modified version of clk_gate allows an optional reset bit to be
> + * specified.
> + */
> +struct aspeed_clk_gate {
> + struct clk_hw   hw;
> + struct regmap   *map;
> + u8  clock_idx;
> + s8  reset_idx;
> + u8  flags;
> + spinlock_t  *lock;
> +};

It feels like the two structures could be unified, but the result turns into a
bit of a mess with a union of structs to limit the space impact, so perhaps we
shouldn't go there?

> +
> +#define to_aspeed_clk_gate(_hw) 

Re: [PATCH v2 1/5] clk: Add clock driver for ASPEED BMC SoCs

2017-09-24 Thread Andrew Jeffery
On Thu, 2017-09-21 at 13:56 +0930, Joel Stanley wrote:
> This adds the stub of a driver for the ASPEED SoCs. The clocks are
> defined and the static registration is set up.
> 
> Signed-off-by: Joel Stanley 
> ---
>  drivers/clk/Kconfig  |  12 +++
>  drivers/clk/Makefile |   1 +
>  drivers/clk/clk-aspeed.c | 162 
> +++
>  include/dt-bindings/clock/aspeed-clock.h |  43 
>  4 files changed, 218 insertions(+)
>  create mode 100644 drivers/clk/clk-aspeed.c
>  create mode 100644 include/dt-bindings/clock/aspeed-clock.h
> 
> diff --git a/drivers/clk/Kconfig b/drivers/clk/Kconfig
> index 1c4e1aa6767e..9abe063ef8d2 100644
> --- a/drivers/clk/Kconfig
> +++ b/drivers/clk/Kconfig
> @@ -142,6 +142,18 @@ config COMMON_CLK_GEMINI
>     This driver supports the SoC clocks on the Cortina Systems Gemini
>     platform, also known as SL3516 or CS3516.
>  
> +config COMMON_CLK_ASPEED
> + bool "Clock driver for Aspeed BMC SoCs"
> + depends on ARCH_ASPEED || COMPILE_TEST
> + default ARCH_ASPEED
> + select MFD_SYSCON
> + select RESET_CONTROLLER
> + ---help---
> +   This driver supports the SoC clocks on the Aspeed BMC platforms.
>
> +   The G4 and G5 series, including the ast2400 and ast2500, are supported
> +   by this driver.
> +
>  config COMMON_CLK_S2MPS11
>   tristate "Clock driver for S2MPS1X/S5M8767 MFD"
>   depends on MFD_SEC_CORE || COMPILE_TEST
> diff --git a/drivers/clk/Makefile b/drivers/clk/Makefile
> index c99f363826f0..575c68919d9b 100644
> --- a/drivers/clk/Makefile
> +++ b/drivers/clk/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_ARCH_CLPS711X) += clk-clps711x.o
>  obj-$(CONFIG_COMMON_CLK_CS2000_CP)   += clk-cs2000-cp.o
>  obj-$(CONFIG_ARCH_EFM32) += clk-efm32gg.o
>  obj-$(CONFIG_COMMON_CLK_GEMINI)  += clk-gemini.o
> +obj-$(CONFIG_COMMON_CLK_ASPEED)  += clk-aspeed.o
>  obj-$(CONFIG_ARCH_HIGHBANK)  += clk-highbank.o
>  obj-$(CONFIG_CLK_HSDK)   += clk-hsdk-pll.o
>  obj-$(CONFIG_COMMON_CLK_MAX77686)+= clk-max77686.o
> diff --git a/drivers/clk/clk-aspeed.c b/drivers/clk/clk-aspeed.c
> new file mode 100644
> index ..824c54767009
> --- /dev/null
> +++ b/drivers/clk/clk-aspeed.c
> @@ -0,0 +1,162 @@
> +/*
> + * Copyright 2017 IBM Corporation
> + *
> + * Joel Stanley 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) "clk-aspeed: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define ASPEED_RESET_CTRL0x04
> +#define ASPEED_CLK_SELECTION 0x08
> +#define ASPEED_CLK_STOP_CTRL 0x0c
> +#define ASPEED_MPLL_PARAM0x20
> +#define ASPEED_HPLL_PARAM0x24
> +#define ASPEED_MISC_CTRL 0x2c
> +#define ASPEED_STRAP 0x70
> +
> +/* Keeps track of all clocks */
> +static struct clk_hw_onecell_data *aspeed_clk_data;
> +
> +static void __iomem *scu_base;
> +
> +/**
> + * struct aspeed_gate_data - Aspeed gated clocks
> + * @clock_idx: bit used to gate this clock in the clock register
> + * @reset_idx: bit used to reset this IP in the reset register. -1 if no
> + * reset is required when enabling the clock
> + * @name: the clock name
> + * @parent_name: the name of the parent clock
> + * @flags: standard clock framework flags
> + */
> +struct aspeed_gate_data {
> + u8  clock_idx;
> + s8  reset_idx;
> + const char  *name;
> + const char  *parent_name;
> + unsigned long   flags;
> +};
> +
> +/**
> + * struct aspeed_clk_gate - Aspeed specific clk_gate structure
> + * @hw:  handle between common and hardware-specific interfaces
> + * @reg: register controlling gate
> + * @clock_idx:   bit used to gate this clock in the clock register
> + * @reset_idx:   bit used to reset this IP in the reset register. -1 if 
> no
> + *   reset is required when enabling the clock
> + * @flags:   hardware-specific flags
> + * @lock:register lock
> + *
> + * Some of the clocks in the Aspeed SoC must be put in reset before enabling.
> + * This modified version of clk_gate allows an optional reset bit to be
> + * specified.
> + */
> +struct aspeed_clk_gate {
> + struct clk_hw   hw;
> + struct regmap   *map;
> + u8  clock_idx;
> + s8  reset_idx;
> + u8  flags;
> + spinlock_t  *lock;
> +};

It feels like the two structures could be unified, but the result turns into a
bit of a mess with a union of structs to limit the space impact, so perhaps we
shouldn't go there?

> +
> +#define to_aspeed_clk_gate(_hw) container_of(_hw, struct aspeed_clk_gate, hw)

[PATCH 2/3] iio: accel: mma8452: Rename read/write event value callbacks to generic function name.

2017-09-24 Thread Harinath Nampally
'mma8452_read_thresh' and 'mma8452_write_thresh' functions
does more than just read/write threshold values.
They also handle  IIO_EV_INFO_HIGH_PASS_FILTER_3DB and
IIO_EV_INFO_PERIOD therefore renaming to generic names.

Improves code readability, no impact on functionality.

Signed-off-by: Harinath Nampally 
---
 drivers/iio/accel/mma8452.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/iio/accel/mma8452.c b/drivers/iio/accel/mma8452.c
index 74b6221..43c3a6b 100644
--- a/drivers/iio/accel/mma8452.c
+++ b/drivers/iio/accel/mma8452.c
@@ -792,7 +792,7 @@ static int mma8452_get_event_regs(struct mma8452_data *data,
}
 }
 
-static int mma8452_read_thresh(struct iio_dev *indio_dev,
+static int mma8452_read_event_value(struct iio_dev *indio_dev,
   const struct iio_chan_spec *chan,
   enum iio_event_type type,
   enum iio_event_direction dir,
@@ -855,7 +855,7 @@ static int mma8452_read_thresh(struct iio_dev *indio_dev,
}
 }
 
-static int mma8452_write_thresh(struct iio_dev *indio_dev,
+static int mma8452_write_event_value(struct iio_dev *indio_dev,
const struct iio_chan_spec *chan,
enum iio_event_type type,
enum iio_event_direction dir,
@@ -1391,8 +1391,8 @@ static const struct iio_info mma8452_info = {
.read_raw = _read_raw,
.write_raw = _write_raw,
.event_attrs = _event_attribute_group,
-   .read_event_value = _read_thresh,
-   .write_event_value = _write_thresh,
+   .read_event_value = _read_event_value,
+   .write_event_value = _write_event_value,
.read_event_config = _read_event_config,
.write_event_config = _write_event_config,
.debugfs_reg_access = _reg_access_dbg,
-- 
2.7.4



[PATCH 1/3] iio: accel: mma8452: Rename time step look up struct to generic name as the values are same for all the events.

2017-09-24 Thread Harinath Nampally
Improves code readability, no impact on functionality.

Signed-off-by: Harinath Nampally 
---
 drivers/iio/accel/mma8452.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iio/accel/mma8452.c b/drivers/iio/accel/mma8452.c
index 3472e7e..74b6221 100644
--- a/drivers/iio/accel/mma8452.c
+++ b/drivers/iio/accel/mma8452.c
@@ -284,7 +284,7 @@ static const int mma8452_samp_freq[8][2] = {
 };
 
 /* Datasheet table: step time "Relationship with the ODR" (sample frequency) */
-static const unsigned int mma8452_transient_time_step_us[4][8] = {
+static const unsigned int mma8452_time_step_us[4][8] = {
{ 1250, 2500, 5000, 1, 2, 2, 2, 2 },  /* normal */
{ 1250, 2500, 5000, 1, 2, 8, 8, 8 },  /* l p l n */
{ 1250, 2500, 2500, 2500, 2500, 2500, 2500, 2500 },   /* high res*/
@@ -826,7 +826,7 @@ static int mma8452_read_thresh(struct iio_dev *indio_dev,
if (power_mode < 0)
return power_mode;
 
-   us = ret * mma8452_transient_time_step_us[power_mode][
+   us = ret * mma8452_time_step_us[power_mode][
mma8452_get_odr_index(data)];
*val = us / USEC_PER_SEC;
*val2 = us % USEC_PER_SEC;
@@ -883,7 +883,7 @@ static int mma8452_write_thresh(struct iio_dev *indio_dev,
return ret;
 
steps = (val * USEC_PER_SEC + val2) /
-   mma8452_transient_time_step_us[ret][
+   mma8452_time_step_us[ret][
mma8452_get_odr_index(data)];
 
if (steps < 0 || steps > 0xff)
-- 
2.7.4



[PATCH 2/3] iio: accel: mma8452: Rename read/write event value callbacks to generic function name.

2017-09-24 Thread Harinath Nampally
'mma8452_read_thresh' and 'mma8452_write_thresh' functions
does more than just read/write threshold values.
They also handle  IIO_EV_INFO_HIGH_PASS_FILTER_3DB and
IIO_EV_INFO_PERIOD therefore renaming to generic names.

Improves code readability, no impact on functionality.

Signed-off-by: Harinath Nampally 
---
 drivers/iio/accel/mma8452.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/iio/accel/mma8452.c b/drivers/iio/accel/mma8452.c
index 74b6221..43c3a6b 100644
--- a/drivers/iio/accel/mma8452.c
+++ b/drivers/iio/accel/mma8452.c
@@ -792,7 +792,7 @@ static int mma8452_get_event_regs(struct mma8452_data *data,
}
 }
 
-static int mma8452_read_thresh(struct iio_dev *indio_dev,
+static int mma8452_read_event_value(struct iio_dev *indio_dev,
   const struct iio_chan_spec *chan,
   enum iio_event_type type,
   enum iio_event_direction dir,
@@ -855,7 +855,7 @@ static int mma8452_read_thresh(struct iio_dev *indio_dev,
}
 }
 
-static int mma8452_write_thresh(struct iio_dev *indio_dev,
+static int mma8452_write_event_value(struct iio_dev *indio_dev,
const struct iio_chan_spec *chan,
enum iio_event_type type,
enum iio_event_direction dir,
@@ -1391,8 +1391,8 @@ static const struct iio_info mma8452_info = {
.read_raw = _read_raw,
.write_raw = _write_raw,
.event_attrs = _event_attribute_group,
-   .read_event_value = _read_thresh,
-   .write_event_value = _write_thresh,
+   .read_event_value = _read_event_value,
+   .write_event_value = _write_event_value,
.read_event_config = _read_event_config,
.write_event_config = _write_event_config,
.debugfs_reg_access = _reg_access_dbg,
-- 
2.7.4



[PATCH 1/3] iio: accel: mma8452: Rename time step look up struct to generic name as the values are same for all the events.

2017-09-24 Thread Harinath Nampally
Improves code readability, no impact on functionality.

Signed-off-by: Harinath Nampally 
---
 drivers/iio/accel/mma8452.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iio/accel/mma8452.c b/drivers/iio/accel/mma8452.c
index 3472e7e..74b6221 100644
--- a/drivers/iio/accel/mma8452.c
+++ b/drivers/iio/accel/mma8452.c
@@ -284,7 +284,7 @@ static const int mma8452_samp_freq[8][2] = {
 };
 
 /* Datasheet table: step time "Relationship with the ODR" (sample frequency) */
-static const unsigned int mma8452_transient_time_step_us[4][8] = {
+static const unsigned int mma8452_time_step_us[4][8] = {
{ 1250, 2500, 5000, 1, 2, 2, 2, 2 },  /* normal */
{ 1250, 2500, 5000, 1, 2, 8, 8, 8 },  /* l p l n */
{ 1250, 2500, 2500, 2500, 2500, 2500, 2500, 2500 },   /* high res*/
@@ -826,7 +826,7 @@ static int mma8452_read_thresh(struct iio_dev *indio_dev,
if (power_mode < 0)
return power_mode;
 
-   us = ret * mma8452_transient_time_step_us[power_mode][
+   us = ret * mma8452_time_step_us[power_mode][
mma8452_get_odr_index(data)];
*val = us / USEC_PER_SEC;
*val2 = us % USEC_PER_SEC;
@@ -883,7 +883,7 @@ static int mma8452_write_thresh(struct iio_dev *indio_dev,
return ret;
 
steps = (val * USEC_PER_SEC + val2) /
-   mma8452_transient_time_step_us[ret][
+   mma8452_time_step_us[ret][
mma8452_get_odr_index(data)];
 
if (steps < 0 || steps > 0xff)
-- 
2.7.4



[PATCH 0/3] This patchset refactors event related functions

2017-09-24 Thread Harinath Nampally
Harinath Nampally (3):
Following 2 patches are for refactor:
  iio: accel: mma8452: Rename time step look up struct to generic
name as the values are same for all the events.
  iio: accel: mma8452: Rename read/write event value callbacks to
generic function name.
Following patch adds new feature:
  iio: accel: mma8452: Add single pulse/tap event detection feature
for fxls8471.

 drivers/iio/accel/mma8452.c | 170 
 1 file changed, 158 insertions(+), 12 deletions(-)

-- 
2.7.4



[PATCH 3/3] iio: accel: mma8452: Add single pulse/tap event detection feature for fxls8471.

2017-09-24 Thread Harinath Nampally
This patch adds following changes to support tap feature:
- defines pulse event related registers
- enables and handles single pulse interrupt for fxls8471
- handles IIO_EV_DIR_EITHER in read/write callbacks
  because event direction for pulse is either rising or
  falling.
- configures read/write event value for pulse latency register
  using IIO_EV_INFO_HYSTERESIS.
- add multiple events like pulse and tranient event spec
  as elements of event_spec array named 'mma8452_accel_events'

Except mma8653 chip all other chips like mma845x and
fxls8471 have single tap detection feature.
Tested thoroughly using iio_event_monitor application on
imx6ul-evk board.

Signed-off-by: Harinath Nampally 
---
 drivers/iio/accel/mma8452.c | 156 ++--
 1 file changed, 151 insertions(+), 5 deletions(-)

diff --git a/drivers/iio/accel/mma8452.c b/drivers/iio/accel/mma8452.c
index 43c3a6b..36f1b56 100644
--- a/drivers/iio/accel/mma8452.c
+++ b/drivers/iio/accel/mma8452.c
@@ -72,6 +72,19 @@
 #define  MMA8452_TRANSIENT_THS_MASKGENMASK(6, 0)
 #define MMA8452_TRANSIENT_COUNT0x20
 #define MMA8452_TRANSIENT_CHAN_SHIFT 1
+#define MMA8452_PULSE_CFG  0x21
+#define MMA8452_PULSE_CFG_CHAN(chan)   BIT(chan * 2)
+#define MMA8452_PULSE_CFG_ELE  BIT(6)
+#define MMA8452_PULSE_SRC  0x22
+#define MMA8452_PULSE_SRC_XPULSE   BIT(4)
+#define MMA8452_PULSE_SRC_YPULSE   BIT(5)
+#define MMA8452_PULSE_SRC_ZPULSE   BIT(6)
+#define MMA8452_PULSE_THS  0x23
+#define MMA8452_PULSE_THS_MASK GENMASK(6, 0)
+#define MMA8452_PULSE_COUNT0x26
+#define MMA8452_PULSE_CHAN_SHIFT   2
+#define MMA8452_PULSE_LTCY 0x27
+
 #define MMA8452_CTRL_REG1  0x2a
 #define  MMA8452_CTRL_ACTIVE   BIT(0)
 #define  MMA8452_CTRL_DR_MASK  GENMASK(5, 3)
@@ -91,6 +104,7 @@
 
 #define  MMA8452_INT_DRDY  BIT(0)
 #define  MMA8452_INT_FF_MT BIT(2)
+#define  MMA8452_INT_PULSE BIT(3)
 #define  MMA8452_INT_TRANS BIT(5)
 
 #define MMA8451_DEVICE_ID  0x1a
@@ -155,6 +169,16 @@ static const struct mma8452_event_regs trans_ev_regs = {
.ev_count = MMA8452_TRANSIENT_COUNT,
 };
 
+static const struct mma8452_event_regs pulse_ev_regs = {
+   .ev_cfg = MMA8452_PULSE_CFG,
+   .ev_cfg_ele = MMA8452_PULSE_CFG_ELE,
+   .ev_cfg_chan_shift = MMA8452_PULSE_CHAN_SHIFT,
+   .ev_src = MMA8452_PULSE_SRC,
+   .ev_ths = MMA8452_PULSE_THS,
+   .ev_ths_mask = MMA8452_PULSE_THS_MASK,
+   .ev_count = MMA8452_PULSE_COUNT,
+};
+
 /**
  * struct mma_chip_info - chip specific data
  * @chip_id:   WHO_AM_I register's value
@@ -784,6 +808,14 @@ static int mma8452_get_event_regs(struct mma8452_data 
*data,
case IIO_EV_DIR_FALLING:
*ev_reg = _mt_ev_regs;
return 0;
+   case IIO_EV_DIR_EITHER:
+   if (!(data->chip_info->all_events
+   & MMA8452_INT_PULSE) ||
+   !(data->chip_info->enabled_events
+   & MMA8452_INT_PULSE))
+   return -EINVAL;
+   *ev_reg = _ev_regs;
+   return 0;
default:
return -EINVAL;
}
@@ -848,6 +880,25 @@ static int mma8452_read_event_value(struct iio_dev 
*indio_dev,
return ret;
}
 
+   case IIO_EV_INFO_HYSTERESIS:
+   if (!(data->chip_info->all_events & MMA8452_INT_PULSE) ||
+   !(data->chip_info->enabled_events & MMA8452_INT_PULSE))
+   return -EINVAL;
+
+   ret = i2c_smbus_read_byte_data(data->client,
+   MMA8452_PULSE_LTCY);
+   if (ret < 0)
+   return ret;
+
+   power_mode = mma8452_get_power_mode(data);
+   if (power_mode < 0)
+   return power_mode;
+
+   us = ret * mma8452_time_step_us[power_mode][
+   mma8452_get_odr_index(data)];
+   *val = us / USEC_PER_SEC;
+   *val2 = us % USEC_PER_SEC;
+
return IIO_VAL_INT_PLUS_MICRO;
 
default:
@@ -908,6 +959,24 @@ static int mma8452_write_event_value(struct iio_dev 
*indio_dev,
 
return mma8452_change_config(data, MMA8452_TRANSIENT_CFG, reg);
 
+   case IIO_EV_INFO_HYSTERESIS:
+   if (!(data->chip_info->all_events & MMA8452_INT_PULSE) ||
+  

[PATCH 0/3] This patchset refactors event related functions

2017-09-24 Thread Harinath Nampally
Harinath Nampally (3):
Following 2 patches are for refactor:
  iio: accel: mma8452: Rename time step look up struct to generic
name as the values are same for all the events.
  iio: accel: mma8452: Rename read/write event value callbacks to
generic function name.
Following patch adds new feature:
  iio: accel: mma8452: Add single pulse/tap event detection feature
for fxls8471.

 drivers/iio/accel/mma8452.c | 170 
 1 file changed, 158 insertions(+), 12 deletions(-)

-- 
2.7.4



[PATCH 3/3] iio: accel: mma8452: Add single pulse/tap event detection feature for fxls8471.

2017-09-24 Thread Harinath Nampally
This patch adds following changes to support tap feature:
- defines pulse event related registers
- enables and handles single pulse interrupt for fxls8471
- handles IIO_EV_DIR_EITHER in read/write callbacks
  because event direction for pulse is either rising or
  falling.
- configures read/write event value for pulse latency register
  using IIO_EV_INFO_HYSTERESIS.
- add multiple events like pulse and tranient event spec
  as elements of event_spec array named 'mma8452_accel_events'

Except mma8653 chip all other chips like mma845x and
fxls8471 have single tap detection feature.
Tested thoroughly using iio_event_monitor application on
imx6ul-evk board.

Signed-off-by: Harinath Nampally 
---
 drivers/iio/accel/mma8452.c | 156 ++--
 1 file changed, 151 insertions(+), 5 deletions(-)

diff --git a/drivers/iio/accel/mma8452.c b/drivers/iio/accel/mma8452.c
index 43c3a6b..36f1b56 100644
--- a/drivers/iio/accel/mma8452.c
+++ b/drivers/iio/accel/mma8452.c
@@ -72,6 +72,19 @@
 #define  MMA8452_TRANSIENT_THS_MASKGENMASK(6, 0)
 #define MMA8452_TRANSIENT_COUNT0x20
 #define MMA8452_TRANSIENT_CHAN_SHIFT 1
+#define MMA8452_PULSE_CFG  0x21
+#define MMA8452_PULSE_CFG_CHAN(chan)   BIT(chan * 2)
+#define MMA8452_PULSE_CFG_ELE  BIT(6)
+#define MMA8452_PULSE_SRC  0x22
+#define MMA8452_PULSE_SRC_XPULSE   BIT(4)
+#define MMA8452_PULSE_SRC_YPULSE   BIT(5)
+#define MMA8452_PULSE_SRC_ZPULSE   BIT(6)
+#define MMA8452_PULSE_THS  0x23
+#define MMA8452_PULSE_THS_MASK GENMASK(6, 0)
+#define MMA8452_PULSE_COUNT0x26
+#define MMA8452_PULSE_CHAN_SHIFT   2
+#define MMA8452_PULSE_LTCY 0x27
+
 #define MMA8452_CTRL_REG1  0x2a
 #define  MMA8452_CTRL_ACTIVE   BIT(0)
 #define  MMA8452_CTRL_DR_MASK  GENMASK(5, 3)
@@ -91,6 +104,7 @@
 
 #define  MMA8452_INT_DRDY  BIT(0)
 #define  MMA8452_INT_FF_MT BIT(2)
+#define  MMA8452_INT_PULSE BIT(3)
 #define  MMA8452_INT_TRANS BIT(5)
 
 #define MMA8451_DEVICE_ID  0x1a
@@ -155,6 +169,16 @@ static const struct mma8452_event_regs trans_ev_regs = {
.ev_count = MMA8452_TRANSIENT_COUNT,
 };
 
+static const struct mma8452_event_regs pulse_ev_regs = {
+   .ev_cfg = MMA8452_PULSE_CFG,
+   .ev_cfg_ele = MMA8452_PULSE_CFG_ELE,
+   .ev_cfg_chan_shift = MMA8452_PULSE_CHAN_SHIFT,
+   .ev_src = MMA8452_PULSE_SRC,
+   .ev_ths = MMA8452_PULSE_THS,
+   .ev_ths_mask = MMA8452_PULSE_THS_MASK,
+   .ev_count = MMA8452_PULSE_COUNT,
+};
+
 /**
  * struct mma_chip_info - chip specific data
  * @chip_id:   WHO_AM_I register's value
@@ -784,6 +808,14 @@ static int mma8452_get_event_regs(struct mma8452_data 
*data,
case IIO_EV_DIR_FALLING:
*ev_reg = _mt_ev_regs;
return 0;
+   case IIO_EV_DIR_EITHER:
+   if (!(data->chip_info->all_events
+   & MMA8452_INT_PULSE) ||
+   !(data->chip_info->enabled_events
+   & MMA8452_INT_PULSE))
+   return -EINVAL;
+   *ev_reg = _ev_regs;
+   return 0;
default:
return -EINVAL;
}
@@ -848,6 +880,25 @@ static int mma8452_read_event_value(struct iio_dev 
*indio_dev,
return ret;
}
 
+   case IIO_EV_INFO_HYSTERESIS:
+   if (!(data->chip_info->all_events & MMA8452_INT_PULSE) ||
+   !(data->chip_info->enabled_events & MMA8452_INT_PULSE))
+   return -EINVAL;
+
+   ret = i2c_smbus_read_byte_data(data->client,
+   MMA8452_PULSE_LTCY);
+   if (ret < 0)
+   return ret;
+
+   power_mode = mma8452_get_power_mode(data);
+   if (power_mode < 0)
+   return power_mode;
+
+   us = ret * mma8452_time_step_us[power_mode][
+   mma8452_get_odr_index(data)];
+   *val = us / USEC_PER_SEC;
+   *val2 = us % USEC_PER_SEC;
+
return IIO_VAL_INT_PLUS_MICRO;
 
default:
@@ -908,6 +959,24 @@ static int mma8452_write_event_value(struct iio_dev 
*indio_dev,
 
return mma8452_change_config(data, MMA8452_TRANSIENT_CFG, reg);
 
+   case IIO_EV_INFO_HYSTERESIS:
+   if (!(data->chip_info->all_events & MMA8452_INT_PULSE) ||
+   

Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall

2017-09-24 Thread Marcelo Tosatti
On Sun, Sep 24, 2017 at 09:05:44AM -0400, Paolo Bonzini wrote:
> 
> 
> - Original Message -
> > From: "Peter Zijlstra" 
> > To: "Paolo Bonzini" 
> > Cc: "Marcelo Tosatti" , "Konrad Rzeszutek Wilk" 
> > , mi...@redhat.com,
> > k...@vger.kernel.org, linux-kernel@vger.kernel.org, "Thomas Gleixner" 
> > 
> > Sent: Saturday, September 23, 2017 3:41:14 PM
> > Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO 
> > hypercall
> > 
> > On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote:
> > > On 22/09/2017 14:55, Peter Zijlstra wrote:
> > > > You just explained it yourself. If the thread that needs to complete
> > > > what you're waiting on has lower priority, it will _never_ get to run if
> > > > you're busy waiting on it.
> > > > 
> > > > This is _trivial_.
> > > > 
> > > > And even for !RT it can be quite costly, because you can end up having
> > > > to burn your entire slot of CPU time before you run the other task.
> > > > 
> > > > Userspace spinning is _bad_, do not do this.
> > > 
> > > This is not userspace spinning, it is guest spinning---which has
> > > effectively the same effect but you cannot quite avoid.
> > 
> > So I'm virt illiterate and have no clue on how all this works; but
> > wasn't this a vmexit ? (that's what marcelo traced). And once you've
> > done a vmexit you're a regular task again, not a vcpu.
> 
> His trace simply shows that the timer tick happened and the SCHED_NORMAL
> thread was preempted.  Bumping the vCPU thread to SCHED_FIFO drops
> the scheduler tick (the system is NOHZ_FULL) and thus 1) the frequency
> of EXTERNAL_INTERRUPT vmexits drops to 1 second 2) the thread is not
> preempted anymore.
> 
> > > But I agree that the solution is properly prioritizing threads that can
> > > interrupt the VCPU, and using PI mutexes.

Thats exactly what the patch does, the prioritization is not fixed in
time, and depends on whether or not vcpu-0 is in spinlock protected
section.

Are you suggesting a different prioritization? Can you describe it
please, even if incomplete?

> > 
> > Right, if you want to run RT VCPUs the whole emulator/vcpu interaction
> > needs to be designed for RT.
> > 
> > > I'm not a priori opposed to paravirt scheduling primitives, but I am not
> > > at all sure that it's required.
> > 
> > Problem is that the proposed thing doesn't solve anything. There is
> > nothing that prohibits the guest from triggering a vmexit while holding
> > a spinlock and landing in the self-same problems.
> 
> Well, part of configuring virt for RT is (at all levels: host hypervisor+QEMU
> and guest kernel+userspace) is that vmexits while holding a spinlock are 
> either
> confined to one vCPU or are handled in the host hypervisor very quickly, like
> less than 2000 clock cycles.
> 
> So I'm not denying that Marcelo's approach solves the problem, but it's very
> heavyweight and it masks an important misconfiguration (as you write above,
> everything needs to be RT and the priorities must be designed carefully).

I think you are missing the following point:

"vcpu0 can be interrupted when its not in a spinlock protected section, 
otherwise it can't."

So you _have_ to communicate to the host when the guest enters/leaves a
critical section.

So this point of "everything needs to be RT and the priorities must be
designed carefully", is this: 

WHEN in spinlock protected section (more specifically, when 
spinlock protected section _shared with realtime vcpus_),

priority of vcpu0 > priority of emulator thread

OTHERWISE

priority of vcpu0 < priority of emulator thread.

(*)

So emulator thread can interrupt and inject interrupts to vcpu0.

> 
> _However_, even if you do this, you may want to put the less important vCPUs
> and the emulator threads on the same physical CPU.  In that case, the vCPU
> can be placed at SCHED_RR to avoid starvation (while the emulator thread needs
> to stay at SCHED_FIFO and higher priority).  Some kind of trick that bumps
> spinlock critical sections in that vCPU to SCHED_FIFO, for a limited time 
> only,
> might still be useful.

Anything that violates (*) above is going to cause excessive latencies
in realtime vcpus, via:

PCPU-0:
* vcpu-0 grabs spinlock A.
* event wakes up emulator thread, vcpu-0 sched out, vcpu-0 sched
in.
PCPU-1:
* realtime vcpu grabs spinlock-A, busy spins on emulator threads
completion.

So its more than useful, its necessary.

I'm open to suggestions as better ways to solve this problem 
while sharing emulator thread with vcpu-0 (which is something users
are interested in, for obvious economical reasons), but:

1) Don't get the point of Peters rejection.

2) Don't get how SCHED_RR can help the situation.



Re: I/O hangs after resuming from suspend-to-ram

2017-09-24 Thread Ming Lei
On Sun, Sep 24, 2017 at 07:33:00PM +0200, Martin Steigerwald wrote:
> Ming Lei - 21.09.17, 06:17:
> > On Wed, Sep 20, 2017 at 07:25:02PM +0200, Martin Steigerwald wrote:
> > > Ming Lei - 28.08.17, 21:32:
> > > > On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote:
> > > > > Ming Lei - 28.08.17, 20:58:
> > > > > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > > > > > > Hi.
> > > 
> > > > > > > Here is disk setup for QEMU VM:
> > > […]
> > > 
> > > > > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on
> > > > > > > it,
> > > > > > > then
> > > > > > > LVM, then ext4 for boot, swap and btrfs for /.
> > > > > > > 
> > > > > > > I couldn't reproduce the issue with single disk without RAID.
> > > > > > 
> > > > > > Could you verify if the following patch fixes your issue?
> > > > > 
> > > > > Could this also apply to non MD RAID systems? I am using BTRFS RAID
> > > > > 1 with two SSDs. So far with CFQ it runs stable.
> > > > 
> > > > It is for fixing Oleksandr's issue wrt. blk-mq, and looks not for you.
> > > 
> > > My findings are different:
> > > 
> > > On 4.12.10 with CONFIG_HZ=1000, CONFIG_PREEMPT=y and optimizations for
> > > Intel Core/newer Xeon I see this:
> > > 
> > > 1) Running with CFQ: No hang after resume
> > > 
> > > 2) Running with scsi_mod.use_blk_mq=1 + BFQ: Hang after resume within
> > > first 1-2 days.
> > 
> > Hi Martin,
> > 
> > Thanks for your report!
> > 
> > Could you test the following patchset to see if it fixes your issue?
> > 
> > https://marc.info/?l=linux-block=150579298505484=2
> 
> Testing with https://github.com/ming1/linux.git, my_v4.13-safe-scsi-
> quiesce_V5_for_test branch as of 53954fd58fb9fe6894b1635dad1acec96ca51a2f I 
> now have a bit more than three days and 8 hours of uptime without a hang on 
> resume from memory or disk, despite using blk-mq + BFQ.
> 
> So it looks like this patch set fixes the issue for me. To say for sure I´d 
> say 
> some more days of testing are needed. But it looks like the hang on resume is 
> fixed by this patch set.

Martin, thanks for your test, it is great to see your issue is fixed
by this patchset.

Also I remember that your issue wasn't related with MD, and actually
you were using BTRFS(RAID), just want to double check with you,
is that true?

-- 
Ming


Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall

2017-09-24 Thread Marcelo Tosatti
On Sun, Sep 24, 2017 at 09:05:44AM -0400, Paolo Bonzini wrote:
> 
> 
> - Original Message -
> > From: "Peter Zijlstra" 
> > To: "Paolo Bonzini" 
> > Cc: "Marcelo Tosatti" , "Konrad Rzeszutek Wilk" 
> > , mi...@redhat.com,
> > k...@vger.kernel.org, linux-kernel@vger.kernel.org, "Thomas Gleixner" 
> > 
> > Sent: Saturday, September 23, 2017 3:41:14 PM
> > Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO 
> > hypercall
> > 
> > On Sat, Sep 23, 2017 at 12:56:12PM +0200, Paolo Bonzini wrote:
> > > On 22/09/2017 14:55, Peter Zijlstra wrote:
> > > > You just explained it yourself. If the thread that needs to complete
> > > > what you're waiting on has lower priority, it will _never_ get to run if
> > > > you're busy waiting on it.
> > > > 
> > > > This is _trivial_.
> > > > 
> > > > And even for !RT it can be quite costly, because you can end up having
> > > > to burn your entire slot of CPU time before you run the other task.
> > > > 
> > > > Userspace spinning is _bad_, do not do this.
> > > 
> > > This is not userspace spinning, it is guest spinning---which has
> > > effectively the same effect but you cannot quite avoid.
> > 
> > So I'm virt illiterate and have no clue on how all this works; but
> > wasn't this a vmexit ? (that's what marcelo traced). And once you've
> > done a vmexit you're a regular task again, not a vcpu.
> 
> His trace simply shows that the timer tick happened and the SCHED_NORMAL
> thread was preempted.  Bumping the vCPU thread to SCHED_FIFO drops
> the scheduler tick (the system is NOHZ_FULL) and thus 1) the frequency
> of EXTERNAL_INTERRUPT vmexits drops to 1 second 2) the thread is not
> preempted anymore.
> 
> > > But I agree that the solution is properly prioritizing threads that can
> > > interrupt the VCPU, and using PI mutexes.

Thats exactly what the patch does, the prioritization is not fixed in
time, and depends on whether or not vcpu-0 is in spinlock protected
section.

Are you suggesting a different prioritization? Can you describe it
please, even if incomplete?

> > 
> > Right, if you want to run RT VCPUs the whole emulator/vcpu interaction
> > needs to be designed for RT.
> > 
> > > I'm not a priori opposed to paravirt scheduling primitives, but I am not
> > > at all sure that it's required.
> > 
> > Problem is that the proposed thing doesn't solve anything. There is
> > nothing that prohibits the guest from triggering a vmexit while holding
> > a spinlock and landing in the self-same problems.
> 
> Well, part of configuring virt for RT is (at all levels: host hypervisor+QEMU
> and guest kernel+userspace) is that vmexits while holding a spinlock are 
> either
> confined to one vCPU or are handled in the host hypervisor very quickly, like
> less than 2000 clock cycles.
> 
> So I'm not denying that Marcelo's approach solves the problem, but it's very
> heavyweight and it masks an important misconfiguration (as you write above,
> everything needs to be RT and the priorities must be designed carefully).

I think you are missing the following point:

"vcpu0 can be interrupted when its not in a spinlock protected section, 
otherwise it can't."

So you _have_ to communicate to the host when the guest enters/leaves a
critical section.

So this point of "everything needs to be RT and the priorities must be
designed carefully", is this: 

WHEN in spinlock protected section (more specifically, when 
spinlock protected section _shared with realtime vcpus_),

priority of vcpu0 > priority of emulator thread

OTHERWISE

priority of vcpu0 < priority of emulator thread.

(*)

So emulator thread can interrupt and inject interrupts to vcpu0.

> 
> _However_, even if you do this, you may want to put the less important vCPUs
> and the emulator threads on the same physical CPU.  In that case, the vCPU
> can be placed at SCHED_RR to avoid starvation (while the emulator thread needs
> to stay at SCHED_FIFO and higher priority).  Some kind of trick that bumps
> spinlock critical sections in that vCPU to SCHED_FIFO, for a limited time 
> only,
> might still be useful.

Anything that violates (*) above is going to cause excessive latencies
in realtime vcpus, via:

PCPU-0:
* vcpu-0 grabs spinlock A.
* event wakes up emulator thread, vcpu-0 sched out, vcpu-0 sched
in.
PCPU-1:
* realtime vcpu grabs spinlock-A, busy spins on emulator threads
completion.

So its more than useful, its necessary.

I'm open to suggestions as better ways to solve this problem 
while sharing emulator thread with vcpu-0 (which is something users
are interested in, for obvious economical reasons), but:

1) Don't get the point of Peters rejection.

2) Don't get how SCHED_RR can help the situation.



Re: I/O hangs after resuming from suspend-to-ram

2017-09-24 Thread Ming Lei
On Sun, Sep 24, 2017 at 07:33:00PM +0200, Martin Steigerwald wrote:
> Ming Lei - 21.09.17, 06:17:
> > On Wed, Sep 20, 2017 at 07:25:02PM +0200, Martin Steigerwald wrote:
> > > Ming Lei - 28.08.17, 21:32:
> > > > On Mon, Aug 28, 2017 at 03:10:35PM +0200, Martin Steigerwald wrote:
> > > > > Ming Lei - 28.08.17, 20:58:
> > > > > > On Sun, Aug 27, 2017 at 09:43:52AM +0200, Oleksandr Natalenko wrote:
> > > > > > > Hi.
> > > 
> > > > > > > Here is disk setup for QEMU VM:
> > > […]
> > > 
> > > > > > > In words: 2 virtual disks, RAID10 setup with far-2 layout, LUKS on
> > > > > > > it,
> > > > > > > then
> > > > > > > LVM, then ext4 for boot, swap and btrfs for /.
> > > > > > > 
> > > > > > > I couldn't reproduce the issue with single disk without RAID.
> > > > > > 
> > > > > > Could you verify if the following patch fixes your issue?
> > > > > 
> > > > > Could this also apply to non MD RAID systems? I am using BTRFS RAID
> > > > > 1 with two SSDs. So far with CFQ it runs stable.
> > > > 
> > > > It is for fixing Oleksandr's issue wrt. blk-mq, and looks not for you.
> > > 
> > > My findings are different:
> > > 
> > > On 4.12.10 with CONFIG_HZ=1000, CONFIG_PREEMPT=y and optimizations for
> > > Intel Core/newer Xeon I see this:
> > > 
> > > 1) Running with CFQ: No hang after resume
> > > 
> > > 2) Running with scsi_mod.use_blk_mq=1 + BFQ: Hang after resume within
> > > first 1-2 days.
> > 
> > Hi Martin,
> > 
> > Thanks for your report!
> > 
> > Could you test the following patchset to see if it fixes your issue?
> > 
> > https://marc.info/?l=linux-block=150579298505484=2
> 
> Testing with https://github.com/ming1/linux.git, my_v4.13-safe-scsi-
> quiesce_V5_for_test branch as of 53954fd58fb9fe6894b1635dad1acec96ca51a2f I 
> now have a bit more than three days and 8 hours of uptime without a hang on 
> resume from memory or disk, despite using blk-mq + BFQ.
> 
> So it looks like this patch set fixes the issue for me. To say for sure I´d 
> say 
> some more days of testing are needed. But it looks like the hang on resume is 
> fixed by this patch set.

Martin, thanks for your test, it is great to see your issue is fixed
by this patchset.

Also I remember that your issue wasn't related with MD, and actually
you were using BTRFS(RAID), just want to double check with you,
is that true?

-- 
Ming


Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Randy Dunlap
On 09/24/17 17:03, Linus Torvalds wrote:
> I'm back to my usual Sunday release schedule, and rc2 is out there in
> all the normal places.

Downloading & applying 4.14-rc2 [patch] 


from kernel.org (home page) gives me a file that does not apply cleanly to 
v4.13:

--
|diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
|index 4d81f6d..9deb5a2 100644
|--- a/virt/kvm/kvm_main.c
|+++ b/virt/kvm/kvm_main.c
--
checking file virt/kvm/kvm_main.c
Using Plan A...
patch unexpectedly ends in middle of line
patch:  unexpected end of file in patch


I also notice that the [pgp] signing is not there.  Is that normal?



thanks,
-- 
~Randy


Re: Linux 4.14-rc2 (bad patch file on kernel.org)

2017-09-24 Thread Randy Dunlap
On 09/24/17 17:03, Linus Torvalds wrote:
> I'm back to my usual Sunday release schedule, and rc2 is out there in
> all the normal places.

Downloading & applying 4.14-rc2 [patch] 


from kernel.org (home page) gives me a file that does not apply cleanly to 
v4.13:

--
|diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
|index 4d81f6d..9deb5a2 100644
|--- a/virt/kvm/kvm_main.c
|+++ b/virt/kvm/kvm_main.c
--
checking file virt/kvm/kvm_main.c
Using Plan A...
patch unexpectedly ends in middle of line
patch:  unexpected end of file in patch


I also notice that the [pgp] signing is not there.  Is that normal?



thanks,
-- 
~Randy


Re: [PATCH] Revert "f2fs: node segment is prior to data segment selected victim"

2017-09-24 Thread Chao Yu
On 2017/9/23 17:02, Yunlong Song wrote:
> This reverts commit b9cd20619e359d199b755543474c3d853c8e3415.
> 
> That patch causes much fewer node segments (which can be used for SSR)
> than before, and in the corner case (e.g. create and delete *.txt files in
> one same directory, there will be very few node segments but many data
> segments), if the reserved free segments are all used up during gc, then
> the write_checkpoint can still flush dentry pages to data ssr segments,
> but will probably fail to flush node pages to node ssr segments, since
> there are not enough node ssr segments left (the left ones are all
> full).

IMO, greedy algorithm wants to minimize price of moving one dirty segment, our
behavior is accord with semantics of our algorithm to select victim with least
valid blocks. Pengyang's patch tries to adjust greedy algorithm to consider
minimizing total number of valid blocks in all selected victim segments during
whole FGGC cycle, but its algorithm is corrupted, since if all valid data blocks
in current victim segment is not belong to different dnode block, our selection
may be incorrect.

Anyway, I agree to revert Pengyang's patch first before we got a entire scheme.

BTW, for SSR or LFS selection, there is a trade-off in between: a) SSR-write
costs less free segment and move less data/node blocks, but it triggers random
write which results in bad performance. b) LFS-write costs more free segment and
move more data/node blocks, but it triggers sequential write which results in
good performance. So I don't think more SSR we trigger, lower latency our FGGC
faces.

Thanks,

> 
> So revert this patch to give a fair chance to let node segments remain
> for SSR, which provides more robustness for corner cases.
> 
> Conflicts:
>   fs/f2fs/gc.c
> ---
>  fs/f2fs/gc.c | 12 +---
>  1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index bfe6a8c..f777e07 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -267,16 +267,6 @@ static unsigned int get_cb_cost(struct f2fs_sb_info 
> *sbi, unsigned int segno)
>   return UINT_MAX - ((100 * (100 - u) * age) / (100 + u));
>  }
>  
> -static unsigned int get_greedy_cost(struct f2fs_sb_info *sbi,
> - unsigned int segno)
> -{
> - unsigned int valid_blocks =
> - get_valid_blocks(sbi, segno, true);
> -
> - return IS_DATASEG(get_seg_entry(sbi, segno)->type) ?
> - valid_blocks * 2 : valid_blocks;
> -}
> -
>  static inline unsigned int get_gc_cost(struct f2fs_sb_info *sbi,
>   unsigned int segno, struct victim_sel_policy *p)
>  {
> @@ -285,7 +275,7 @@ static inline unsigned int get_gc_cost(struct 
> f2fs_sb_info *sbi,
>  
>   /* alloc_mode == LFS */
>   if (p->gc_mode == GC_GREEDY)
> - return get_greedy_cost(sbi, segno);
> + return get_valid_blocks(sbi, segno, true);
>   else
>   return get_cb_cost(sbi, segno);
>  }
> 



Re: [PATCH] Revert "f2fs: node segment is prior to data segment selected victim"

2017-09-24 Thread Chao Yu
On 2017/9/23 17:02, Yunlong Song wrote:
> This reverts commit b9cd20619e359d199b755543474c3d853c8e3415.
> 
> That patch causes much fewer node segments (which can be used for SSR)
> than before, and in the corner case (e.g. create and delete *.txt files in
> one same directory, there will be very few node segments but many data
> segments), if the reserved free segments are all used up during gc, then
> the write_checkpoint can still flush dentry pages to data ssr segments,
> but will probably fail to flush node pages to node ssr segments, since
> there are not enough node ssr segments left (the left ones are all
> full).

IMO, greedy algorithm wants to minimize price of moving one dirty segment, our
behavior is accord with semantics of our algorithm to select victim with least
valid blocks. Pengyang's patch tries to adjust greedy algorithm to consider
minimizing total number of valid blocks in all selected victim segments during
whole FGGC cycle, but its algorithm is corrupted, since if all valid data blocks
in current victim segment is not belong to different dnode block, our selection
may be incorrect.

Anyway, I agree to revert Pengyang's patch first before we got a entire scheme.

BTW, for SSR or LFS selection, there is a trade-off in between: a) SSR-write
costs less free segment and move less data/node blocks, but it triggers random
write which results in bad performance. b) LFS-write costs more free segment and
move more data/node blocks, but it triggers sequential write which results in
good performance. So I don't think more SSR we trigger, lower latency our FGGC
faces.

Thanks,

> 
> So revert this patch to give a fair chance to let node segments remain
> for SSR, which provides more robustness for corner cases.
> 
> Conflicts:
>   fs/f2fs/gc.c
> ---
>  fs/f2fs/gc.c | 12 +---
>  1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index bfe6a8c..f777e07 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -267,16 +267,6 @@ static unsigned int get_cb_cost(struct f2fs_sb_info 
> *sbi, unsigned int segno)
>   return UINT_MAX - ((100 * (100 - u) * age) / (100 + u));
>  }
>  
> -static unsigned int get_greedy_cost(struct f2fs_sb_info *sbi,
> - unsigned int segno)
> -{
> - unsigned int valid_blocks =
> - get_valid_blocks(sbi, segno, true);
> -
> - return IS_DATASEG(get_seg_entry(sbi, segno)->type) ?
> - valid_blocks * 2 : valid_blocks;
> -}
> -
>  static inline unsigned int get_gc_cost(struct f2fs_sb_info *sbi,
>   unsigned int segno, struct victim_sel_policy *p)
>  {
> @@ -285,7 +275,7 @@ static inline unsigned int get_gc_cost(struct 
> f2fs_sb_info *sbi,
>  
>   /* alloc_mode == LFS */
>   if (p->gc_mode == GC_GREEDY)
> - return get_greedy_cost(sbi, segno);
> + return get_valid_blocks(sbi, segno, true);
>   else
>   return get_cb_cost(sbi, segno);
>  }
> 



Re: [PATCH] Input: add support for the Samsung S6SY761 touchscreen

2017-09-24 Thread Andi Shyti
Hi Dmitry,

> > > > +static void s6sy761_report_coordinates(struct s6sy761_data *sdata, u8 
> > > > *event)
> > > > +{
> > > > +   u8 tid = ((event[0] & S6SY761_MASK_TID) >> 2) - 1;
> > > 
> > > Should we make sure that event[0] & S6SY761_MASK_TID is not 0?
> > 
> > I check event[0] already in s6sy761_handle_events (called by the
> > irq handler), if we get here event[0] is for sure positive...
> > 
> > [...]
> > 
> > > > +static void s6sy761_handle_events(struct s6sy761_data *sdata, u8 
> > > > left_event)
> > > > +{
> > > > +   int i;
> > > > +
> > > > +   for (i = 0; i < left_event; i++) {
> > > > +   u8 *event = >data[i * S6SY761_EVENT_SIZE];
> > > > +   u8 event_id = event[0] & S6SY761_MASK_EID;
> > > > +
> > > > +   if (!event[0])
> > > > +   return;
> >
> > ... exactly here.
> > 
> > '!event[0]' means also to me that there is nothing left,
> > therefore I can discard whatever is next (given that there is
> > something left).
> 
> What happens if you get event[0] == S6SY761_EVENT_ID_COORDINATE? I.e.
> the value is non-zero, but tid component is 0?

Oh, I see what you mean. It shouldn't happen, in anycase I can
put it under an 'unlikely' statement.

> > > > +   err = devm_request_threaded_irq(>dev, client->irq, NULL,
> > > > +   s6sy761_irq_handler,
> > > > +   IRQF_TRIGGER_LOW | IRQF_ONESHOT,
> > > > +   "s6sy761_irq", sdata);
> > > > +   if (err)
> > > > +   return err;
> > > > +
> > > > +   disable_irq(client->irq);
> > > 
> > > Can you request IRQ after allocating and setting up the input device?
> > > Then you do not need to check for its presence in the interrupt handler.
> > 
> > The reason I do it here is because the x and y are embedded in
> > the device itself. This means that I first need to enable the
> > device, read x and y and then register the input device.
> > 
> > At power up I might expect an interrupt coming, thus I need to
> > check if 'input' is not 'NULL'.
> 
> But you do not need interrupts to read x and y, right? So you can power
> device, create input device, set it up as needed, and then request irq,
> or am I missing something?

OK, all right. I'll do that. I will move the irq request after
the input registration.

Thanks again for your review,
Andi


Re: [PATCH] Input: add support for the Samsung S6SY761 touchscreen

2017-09-24 Thread Andi Shyti
Hi Dmitry,

> > > > +static void s6sy761_report_coordinates(struct s6sy761_data *sdata, u8 
> > > > *event)
> > > > +{
> > > > +   u8 tid = ((event[0] & S6SY761_MASK_TID) >> 2) - 1;
> > > 
> > > Should we make sure that event[0] & S6SY761_MASK_TID is not 0?
> > 
> > I check event[0] already in s6sy761_handle_events (called by the
> > irq handler), if we get here event[0] is for sure positive...
> > 
> > [...]
> > 
> > > > +static void s6sy761_handle_events(struct s6sy761_data *sdata, u8 
> > > > left_event)
> > > > +{
> > > > +   int i;
> > > > +
> > > > +   for (i = 0; i < left_event; i++) {
> > > > +   u8 *event = >data[i * S6SY761_EVENT_SIZE];
> > > > +   u8 event_id = event[0] & S6SY761_MASK_EID;
> > > > +
> > > > +   if (!event[0])
> > > > +   return;
> >
> > ... exactly here.
> > 
> > '!event[0]' means also to me that there is nothing left,
> > therefore I can discard whatever is next (given that there is
> > something left).
> 
> What happens if you get event[0] == S6SY761_EVENT_ID_COORDINATE? I.e.
> the value is non-zero, but tid component is 0?

Oh, I see what you mean. It shouldn't happen, in anycase I can
put it under an 'unlikely' statement.

> > > > +   err = devm_request_threaded_irq(>dev, client->irq, NULL,
> > > > +   s6sy761_irq_handler,
> > > > +   IRQF_TRIGGER_LOW | IRQF_ONESHOT,
> > > > +   "s6sy761_irq", sdata);
> > > > +   if (err)
> > > > +   return err;
> > > > +
> > > > +   disable_irq(client->irq);
> > > 
> > > Can you request IRQ after allocating and setting up the input device?
> > > Then you do not need to check for its presence in the interrupt handler.
> > 
> > The reason I do it here is because the x and y are embedded in
> > the device itself. This means that I first need to enable the
> > device, read x and y and then register the input device.
> > 
> > At power up I might expect an interrupt coming, thus I need to
> > check if 'input' is not 'NULL'.
> 
> But you do not need interrupts to read x and y, right? So you can power
> device, create input device, set it up as needed, and then request irq,
> or am I missing something?

OK, all right. I'll do that. I will move the irq request after
the input registration.

Thanks again for your review,
Andi


RE: [PATCH V9 14/15] mmc: cqhci: support for command queue enabled host

2017-09-24 Thread Bough Chen
> -Original Message-
> From: Adrian Hunter [mailto:adrian.hun...@intel.com]
> Sent: Friday, September 22, 2017 8:37 PM
> To: Ulf Hansson 
> Cc: linux-mmc ; linux-block  bl...@vger.kernel.org>; linux-kernel ; Bough
> Chen ; Alex Lemberg ;
> Mateusz Nowak ; Yuliy Izrailov
> ; Jaehoon Chung ;
> Dong Aisheng ; Das Asutosh
> ; Zhangfei Gao ;
> Sahitya Tummala ; Harjani Ritesh
> ; Venu Byravarasu ;
> Linus Walleij ; Shawn Lin  chips.com>; Christoph Hellwig 
> Subject: [PATCH V9 14/15] mmc: cqhci: support for command queue enabled
> host
> 
> From: Venkat Gopalakrishnan 
> 
> This patch adds CMDQ support for command-queue compatible hosts.
> 
> Command queue is added in eMMC-5.1 specification. This enables the
> controller to process upto 32 requests at a time.
> 
> Adrian Hunter contributed renaming to cqhci, recovery, suspend and resume,
> cqhci_off, cqhci_wait_for_idle, and external timeout handling.
> 
> Signed-off-by: Asutosh Das 
> Signed-off-by: Sujit Reddy Thumma 
> Signed-off-by: Konstantin Dorfman 
> Signed-off-by: Venkat Gopalakrishnan 
> Signed-off-by: Subhash Jadavani 
> Signed-off-by: Ritesh Harjani 
> Signed-off-by: Adrian Hunter 
> ---
>  drivers/mmc/host/Kconfig  |   13 +
>  drivers/mmc/host/Makefile |1 +
>  drivers/mmc/host/cqhci.c  | 1154
> +
>  drivers/mmc/host/cqhci.h  |  240 ++
>  4 files changed, 1408 insertions(+)
>  create mode 100644 drivers/mmc/host/cqhci.c  create mode 100644
> drivers/mmc/host/cqhci.h
> 
> diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig index
> 17afe1ad3a03..f2751465bc54 100644
> --- a/drivers/mmc/host/Kconfig
> +++ b/drivers/mmc/host/Kconfig
> @@ -843,6 +843,19 @@ config MMC_SUNXI
> This selects support for the SD/MMC Host Controller on
> Allwinner sunxi SoCs.
> 
> +config MMC_CQHCI
> + tristate "Command Queue Host Controller Interface support"
> + depends on HAS_DMA
> + help
> +   This selects the Command Queue Host Controller Interface (CQHCI)
> +   support present in host controllers of Qualcomm Technologies, Inc
> +   amongst others.
> +   This controller supports eMMC devices with command queue support.
> +
> +   If you have a controller with this interface, say Y or M here.
> +
> +   If unsure, say N.
> +
>  config MMC_TOSHIBA_PCI
>   tristate "Toshiba Type A SD/MMC Card Interface Driver"
>   depends on PCI
> diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile index
> 2b5a8133948d..f01d9915304d 100644
> --- a/drivers/mmc/host/Makefile
> +++ b/drivers/mmc/host/Makefile
> @@ -90,6 +90,7 @@ obj-$(CONFIG_MMC_SDHCI_ST)  += sdhci-st.o
>  obj-$(CONFIG_MMC_SDHCI_MICROCHIP_PIC32)  += sdhci-pic32.o
>  obj-$(CONFIG_MMC_SDHCI_BRCMSTB)  += sdhci-brcmstb.o
>  obj-$(CONFIG_MMC_SDHCI_OMAP) += sdhci-omap.o
> +obj-$(CONFIG_MMC_CQHCI)  += cqhci.o
> 
>  ifeq ($(CONFIG_CB710_DEBUG),y)
>   CFLAGS-cb710-mmc+= -DDEBUG
> diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c new file
> mode 100644 index ..eb3c1695b0c7
> --- /dev/null
> +++ b/drivers/mmc/host/cqhci.c
> @@ -0,0 +1,1154 @@
> +/* Copyright (c) 2015, The Linux Foundation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 and
> + * only version 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "cqhci.h"
> +
> +#define DCMD_SLOT 31
> +#define NUM_SLOTS 32
> +
> +struct cqhci_slot {
> + struct mmc_request *mrq;
> + unsigned int flags;
> +#define CQHCI_EXTERNAL_TIMEOUT   BIT(0)
> +#define CQHCI_COMPLETED  BIT(1)
> +#define CQHCI_HOST_CRC   BIT(2)
> +#define CQHCI_HOST_TIMEOUT   BIT(3)
> +#define CQHCI_HOST_OTHER BIT(4)
> +};
> +
> +static inline u8 *get_desc(struct cqhci_host *cq_host, u8 tag) {
> + 

RE: [PATCH V9 14/15] mmc: cqhci: support for command queue enabled host

2017-09-24 Thread Bough Chen
> -Original Message-
> From: Adrian Hunter [mailto:adrian.hun...@intel.com]
> Sent: Friday, September 22, 2017 8:37 PM
> To: Ulf Hansson 
> Cc: linux-mmc ; linux-block  bl...@vger.kernel.org>; linux-kernel ; Bough
> Chen ; Alex Lemberg ;
> Mateusz Nowak ; Yuliy Izrailov
> ; Jaehoon Chung ;
> Dong Aisheng ; Das Asutosh
> ; Zhangfei Gao ;
> Sahitya Tummala ; Harjani Ritesh
> ; Venu Byravarasu ;
> Linus Walleij ; Shawn Lin  chips.com>; Christoph Hellwig 
> Subject: [PATCH V9 14/15] mmc: cqhci: support for command queue enabled
> host
> 
> From: Venkat Gopalakrishnan 
> 
> This patch adds CMDQ support for command-queue compatible hosts.
> 
> Command queue is added in eMMC-5.1 specification. This enables the
> controller to process upto 32 requests at a time.
> 
> Adrian Hunter contributed renaming to cqhci, recovery, suspend and resume,
> cqhci_off, cqhci_wait_for_idle, and external timeout handling.
> 
> Signed-off-by: Asutosh Das 
> Signed-off-by: Sujit Reddy Thumma 
> Signed-off-by: Konstantin Dorfman 
> Signed-off-by: Venkat Gopalakrishnan 
> Signed-off-by: Subhash Jadavani 
> Signed-off-by: Ritesh Harjani 
> Signed-off-by: Adrian Hunter 
> ---
>  drivers/mmc/host/Kconfig  |   13 +
>  drivers/mmc/host/Makefile |1 +
>  drivers/mmc/host/cqhci.c  | 1154
> +
>  drivers/mmc/host/cqhci.h  |  240 ++
>  4 files changed, 1408 insertions(+)
>  create mode 100644 drivers/mmc/host/cqhci.c  create mode 100644
> drivers/mmc/host/cqhci.h
> 
> diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig index
> 17afe1ad3a03..f2751465bc54 100644
> --- a/drivers/mmc/host/Kconfig
> +++ b/drivers/mmc/host/Kconfig
> @@ -843,6 +843,19 @@ config MMC_SUNXI
> This selects support for the SD/MMC Host Controller on
> Allwinner sunxi SoCs.
> 
> +config MMC_CQHCI
> + tristate "Command Queue Host Controller Interface support"
> + depends on HAS_DMA
> + help
> +   This selects the Command Queue Host Controller Interface (CQHCI)
> +   support present in host controllers of Qualcomm Technologies, Inc
> +   amongst others.
> +   This controller supports eMMC devices with command queue support.
> +
> +   If you have a controller with this interface, say Y or M here.
> +
> +   If unsure, say N.
> +
>  config MMC_TOSHIBA_PCI
>   tristate "Toshiba Type A SD/MMC Card Interface Driver"
>   depends on PCI
> diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile index
> 2b5a8133948d..f01d9915304d 100644
> --- a/drivers/mmc/host/Makefile
> +++ b/drivers/mmc/host/Makefile
> @@ -90,6 +90,7 @@ obj-$(CONFIG_MMC_SDHCI_ST)  += sdhci-st.o
>  obj-$(CONFIG_MMC_SDHCI_MICROCHIP_PIC32)  += sdhci-pic32.o
>  obj-$(CONFIG_MMC_SDHCI_BRCMSTB)  += sdhci-brcmstb.o
>  obj-$(CONFIG_MMC_SDHCI_OMAP) += sdhci-omap.o
> +obj-$(CONFIG_MMC_CQHCI)  += cqhci.o
> 
>  ifeq ($(CONFIG_CB710_DEBUG),y)
>   CFLAGS-cb710-mmc+= -DDEBUG
> diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c new file
> mode 100644 index ..eb3c1695b0c7
> --- /dev/null
> +++ b/drivers/mmc/host/cqhci.c
> @@ -0,0 +1,1154 @@
> +/* Copyright (c) 2015, The Linux Foundation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 and
> + * only version 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "cqhci.h"
> +
> +#define DCMD_SLOT 31
> +#define NUM_SLOTS 32
> +
> +struct cqhci_slot {
> + struct mmc_request *mrq;
> + unsigned int flags;
> +#define CQHCI_EXTERNAL_TIMEOUT   BIT(0)
> +#define CQHCI_COMPLETED  BIT(1)
> +#define CQHCI_HOST_CRC   BIT(2)
> +#define CQHCI_HOST_TIMEOUT   BIT(3)
> +#define CQHCI_HOST_OTHER BIT(4)
> +};
> +
> +static inline u8 *get_desc(struct cqhci_host *cq_host, u8 tag) {
> + return cq_host->desc_base + (tag * cq_host->slot_sz); }
> +
> +static inline u8 *get_link_desc(struct cqhci_host *cq_host, u8 tag) {
> + u8 *desc = get_desc(cq_host, tag);
> +
> + return desc + cq_host->task_desc_len;
> +}
> +
> +static inline dma_addr_t get_trans_desc_dma(struct cqhci_host *cq_host,
> +u8 tag) {
> + return cq_host->trans_desc_dma_base +
> + (cq_host->mmc->max_segs * tag *
> +  cq_host->trans_desc_len);
> +}
> +
> +static inline u8 *get_trans_desc(struct cqhci_host *cq_host, u8 tag) {
> + return cq_host->trans_desc_base +
> +

Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall

2017-09-24 Thread Marcelo Tosatti
On Fri, Sep 22, 2017 at 02:59:51PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 09:36:39AM -0300, Marcelo Tosatti wrote:
> > On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote:
> > > > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote:
> > > > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > > > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > > > > > to
> > > > > > deal with the following situation:
> > > > > > 
> > > > > > VCPU-0 (housekeeping VCPU)  VCPU-1 (realtime VCPU)
> > > > > > 
> > > > > > raw_spin_lock(A)
> > > > > > interrupted, schedule task T-1  raw_spin_lock(A) (spin)
> > > > > > 
> > > > > > raw_spin_unlock(A)
> > > > > > 
> > > > > > Certain operations must interrupt guest vcpu-0 (see trace below).
> > > > > 
> > > > > Those traces don't make any sense. All they include is kvm_exit and 
> > > > > you
> > > > > can't tell anything from that.
> > > > 
> > > > Hi Peter,
> > > > 
> > > > OK lets describe whats happening:
> > > > 
> > > > With QEMU emulator thread and vcpu-0 sharing a physical CPU
> > > > (which is a request from several NFV customers, to improve
> > > > guest packing), the following occurs when the guest generates 
> > > > the following pattern:
> > > > 
> > > > 1. submit IO.
> > > > 2. busy spin.
> > > 
> > > User-space spinning is a bad idea in general and terminally broken in
> > > a RT setup. Sounds like you need to go fix qemu to not suck.
> > 
> > One can run whatever application they want on the housekeeping
> > vcpus. This is why rteval exists.
> 
> Nobody cares about other tasks. The problem is between the VCPU and
> emulator thread. They get a priority inversion and live-lock because of
> spin-waiting.
> 
> > This is not the realtime vcpu we are talking about.
> 
> You're being confused, its a RT _guest_, all VCPUs _must_ be RT.
> Because, as you ran into, the guest functions as a whole, not as a bunch
> of individual CPUs.
> 
> > We can fix the BIOS, which is hanging now, but userspace can 
> > do whatever it wants, on non realtime vcpus (again, this is why
> > rteval test exists and is used by the -RT community as 
> > a testcase).
> 
> But nobody cares what other tasks on the system do, all you care about
> is that the VCPUs make deterministic forward progress.
> 
> > I haven't understood what is the wrong with the patch? Are you trying
> > to avoid pollution of the spinlock codepath to keep it simple?
> 
> Your patch is voodoo programming. You don't solve the actual problem,
> you try and paper over it.

Priority boosting on a particular section of code is voodoo programming? 




Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\

2017-09-24 Thread Marcelo Tosatti
On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote:
> 
> > Are you arguing its invalid for the following application to execute on 
> > housekeeping vcpu of a realtime system:
> > 
> > void main(void)
> > {
> > 
> > submit_IO();
> > do {
> >computation();
> > } while (!interrupted());
> > }
> > 
> > Really?
> 
> No. Nobody cares about random crap tasks.

Nobody has control over all code that runs in userspace Peter. And not
supporting a valid sequence of steps because its "crap" (whatever your 
definition of crap is) makes no sense.

It might be that someone decides to do the above (i really can't see 
any actual reasoning i can follow and agree on your "its crap"
argument), this truly seems valid to me.

So lets follow the reasoning steps:

1) "NACK, because you didnt understand the problem".

OK thats an invalid NACK, you did understand the problem
later and now your argument is the following.

2) "NACK, because all VCPUs should be SCHED_FIFO all the time".

But the existence of this code path from userspace:

  submit_IO();
  do {
 computation();
  } while (!interrupted());

Its a supported code sequence, and works fine in a non-RT environment.
Therefore it should work on an -RT environment.
Think of any two applications, such as an IO application
and a CPU bound application. The IO application will be severely
impacted, or never execute, in such scenario.

Is that combination of tasks "random crap tasks" ? (No, its not, which 
makes me think you're just NACKing without giving enough thought to the
problem).

So please give me some logical reasoning for the NACK (people can live with
it, but it has to be good enough to justify the decreasing packing of 
guests in pCPUs):

1) "Voodoo programming" (its hard for me to parse what you mean with
that... do you mean you foresee this style of priority boosting causing
problems in the future? Can you give an example?).

Is there fundamentally wrong about priority boosting in spinlock
sections, or this particular style of priority boosting is wrong?

2) "Pollution of the kernel code path". That makes sense to me, if thats
whats your concerned about.

3) "Reduction of spinlock performance". Its true, but for NFV workloads
people don't care about.

4) "All vcpus should be SCHED_FIFO all the time". OK, why is that?
What dictates that to be true?

What the patch does is the following:
It reduces the window where SCHED_FIFO is applied vcpu0
to those were a spinlock is shared between -RT vcpus and vcpu0
(why: because otherwise, when the emulator thread is sharing a
pCPU with vcpu0, its unable to generate interrupts vcpu0).

And its being rejected because:
Please fill in.











Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall

2017-09-24 Thread Marcelo Tosatti
On Fri, Sep 22, 2017 at 02:59:51PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 09:36:39AM -0300, Marcelo Tosatti wrote:
> > On Fri, Sep 22, 2017 at 02:31:07PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 22, 2017 at 09:16:40AM -0300, Marcelo Tosatti wrote:
> > > > On Fri, Sep 22, 2017 at 12:00:05PM +0200, Peter Zijlstra wrote:
> > > > > On Thu, Sep 21, 2017 at 10:10:41PM -0300, Marcelo Tosatti wrote:
> > > > > > When executing guest vcpu-0 with FIFO:1 priority, which is necessary
> > > > > > to
> > > > > > deal with the following situation:
> > > > > > 
> > > > > > VCPU-0 (housekeeping VCPU)  VCPU-1 (realtime VCPU)
> > > > > > 
> > > > > > raw_spin_lock(A)
> > > > > > interrupted, schedule task T-1  raw_spin_lock(A) (spin)
> > > > > > 
> > > > > > raw_spin_unlock(A)
> > > > > > 
> > > > > > Certain operations must interrupt guest vcpu-0 (see trace below).
> > > > > 
> > > > > Those traces don't make any sense. All they include is kvm_exit and 
> > > > > you
> > > > > can't tell anything from that.
> > > > 
> > > > Hi Peter,
> > > > 
> > > > OK lets describe whats happening:
> > > > 
> > > > With QEMU emulator thread and vcpu-0 sharing a physical CPU
> > > > (which is a request from several NFV customers, to improve
> > > > guest packing), the following occurs when the guest generates 
> > > > the following pattern:
> > > > 
> > > > 1. submit IO.
> > > > 2. busy spin.
> > > 
> > > User-space spinning is a bad idea in general and terminally broken in
> > > a RT setup. Sounds like you need to go fix qemu to not suck.
> > 
> > One can run whatever application they want on the housekeeping
> > vcpus. This is why rteval exists.
> 
> Nobody cares about other tasks. The problem is between the VCPU and
> emulator thread. They get a priority inversion and live-lock because of
> spin-waiting.
> 
> > This is not the realtime vcpu we are talking about.
> 
> You're being confused, its a RT _guest_, all VCPUs _must_ be RT.
> Because, as you ran into, the guest functions as a whole, not as a bunch
> of individual CPUs.
> 
> > We can fix the BIOS, which is hanging now, but userspace can 
> > do whatever it wants, on non realtime vcpus (again, this is why
> > rteval test exists and is used by the -RT community as 
> > a testcase).
> 
> But nobody cares what other tasks on the system do, all you care about
> is that the VCPUs make deterministic forward progress.
> 
> > I haven't understood what is the wrong with the patch? Are you trying
> > to avoid pollution of the spinlock codepath to keep it simple?
> 
> Your patch is voodoo programming. You don't solve the actual problem,
> you try and paper over it.

Priority boosting on a particular section of code is voodoo programming? 




Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO hypercall\

2017-09-24 Thread Marcelo Tosatti
On Fri, Sep 22, 2017 at 03:01:41PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 09:40:05AM -0300, Marcelo Tosatti wrote:
> 
> > Are you arguing its invalid for the following application to execute on 
> > housekeeping vcpu of a realtime system:
> > 
> > void main(void)
> > {
> > 
> > submit_IO();
> > do {
> >computation();
> > } while (!interrupted());
> > }
> > 
> > Really?
> 
> No. Nobody cares about random crap tasks.

Nobody has control over all code that runs in userspace Peter. And not
supporting a valid sequence of steps because its "crap" (whatever your 
definition of crap is) makes no sense.

It might be that someone decides to do the above (i really can't see 
any actual reasoning i can follow and agree on your "its crap"
argument), this truly seems valid to me.

So lets follow the reasoning steps:

1) "NACK, because you didnt understand the problem".

OK thats an invalid NACK, you did understand the problem
later and now your argument is the following.

2) "NACK, because all VCPUs should be SCHED_FIFO all the time".

But the existence of this code path from userspace:

  submit_IO();
  do {
 computation();
  } while (!interrupted());

Its a supported code sequence, and works fine in a non-RT environment.
Therefore it should work on an -RT environment.
Think of any two applications, such as an IO application
and a CPU bound application. The IO application will be severely
impacted, or never execute, in such scenario.

Is that combination of tasks "random crap tasks" ? (No, its not, which 
makes me think you're just NACKing without giving enough thought to the
problem).

So please give me some logical reasoning for the NACK (people can live with
it, but it has to be good enough to justify the decreasing packing of 
guests in pCPUs):

1) "Voodoo programming" (its hard for me to parse what you mean with
that... do you mean you foresee this style of priority boosting causing
problems in the future? Can you give an example?).

Is there fundamentally wrong about priority boosting in spinlock
sections, or this particular style of priority boosting is wrong?

2) "Pollution of the kernel code path". That makes sense to me, if thats
whats your concerned about.

3) "Reduction of spinlock performance". Its true, but for NFV workloads
people don't care about.

4) "All vcpus should be SCHED_FIFO all the time". OK, why is that?
What dictates that to be true?

What the patch does is the following:
It reduces the window where SCHED_FIFO is applied vcpu0
to those were a spinlock is shared between -RT vcpus and vcpu0
(why: because otherwise, when the emulator thread is sharing a
pCPU with vcpu0, its unable to generate interrupts vcpu0).

And its being rejected because:
Please fill in.











9f4835fb96 ("x86/fpu: Tighten validation of user-supplied .."): Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

2017-09-24 Thread kernel test robot
Hi Ingo,

On your request I'm resending the report here, with attached dmesg,
kconfig and reproduce script.

I'll go on to test your split up commits, too.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/fpu

commit 9f4835fb965d8eea7e608d0cb62c246c804dec90
Author: Eric Biggers 
AuthorDate: Fri Sep 22 10:41:55 2017 -0700
Commit: Ingo Molnar 
CommitDate: Sat Sep 23 11:02:00 2017 +0200

 x86/fpu: Tighten validation of user-supplied xstate_header
 
 Move validation of user-supplied xstate_headers into a helper function
 and call it from both the ptrace and sigreturn syscall paths.  The new
 function also considers it to be an error if *any* reserved bits are
 set, whereas before we were just clearing most of them.
 
 This should reduce the chance of bugs that fail to correctly validate
 user-supplied XSAVE areas.  It also will expose any broken userspace
 programs that set the other reserved bits; this is desirable because
 such programs will lose compatibility with future CPUs and kernels if
 those bits are ever used for anything.  (There shouldn't be any such
 programs, and in fact in the case where the compacted format is in use
 we were already validating xfeatures.  But you never know...)
 
 Signed-off-by: Eric Biggers 
 Reviewed-by: Kees Cook 
 Reviewed-by: Rik van Riel 
 Acked-by: Dave Hansen 
 Cc: Andy Lutomirski 
 Cc: Dmitry Vyukov 
 Cc: Fenghua Yu 
 Cc: Kevin Hao 
 Cc: Linus Torvalds 
 Cc: Michael Halcrow 
 Cc: Oleg Nesterov 
 Cc: Peter Zijlstra 
 Cc: Thomas Gleixner 
 Cc: Wanpeng Li 
 Cc: Yu-cheng Yu 
 Cc: kernel-harden...@lists.openwall.com
 Link: http://lkml.kernel.org/r/20170922174156.16780-3-ebigge...@gmail.com
 Signed-off-by: Ingo Molnar 

29ed270cd3  x86/fpu: Don't let userspace set bogus xcomp_bv
9f4835fb96  x86/fpu: Tighten validation of user-supplied xstate_header
8d3e268d89  x86/fpu: Rename fpu__activate_fpstate_read/write() to 
fpu__read/write()
e7c6e36753  Merge branch 'x86/urgent'
+---+++++
|   | 29ed270cd3 | 
9f4835fb96 | 8d3e268d89 | e7c6e36753 |
+---+++++
| boot_successes| 35 | 2
  | 6  | 0  |
| boot_failures | 0  | 13   
  | 13 | 11 |
| Kernel_panic-not_syncing:Attempted_to_kill_init!exitcode= | 0  | 13   
  | 13 | 11 |
+---+++++

procd: Console is alive
procd: - preinit -
Press the [f] key and hit [enter] to enter failsafe mode
Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level
[   23.975862] init[1] bad frame in sigreturn frame:7fad9e6c ip:77f3bbc6 
sp:7fada3fc orax: in libuClibc-0.9.33.2.so[77f31000+4f000]
[   23.977287] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x000b
[   23.977287]
[   23.978120] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc1-00218-g9f4835f #1
[   23.978770] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.3-20161025_171302-gandalf 04/01/2014
[   23.979681] Call Trace:
[   23.980087]  dump_stack+0x40/0x5e
[   23.980558]  panic+0x1c5/0x58c
[   23.980963]  forget_original_parent+0x1ee/0x843
[   23.981363]  do_exit+0x1087/0x17c6
[   23.981668]  do_group_exit+0x1d1/0x1d1
[   23.982017]  get_signal+0x1294/0x12ca
[   23.982345]  do_signal+0x2c/0x55b
[   23.982643]  ? force_sig_info+0x1bd/0x1d5
[   23.983079]  ? force_sig+0x22/0x32
[   23.983563]  ? signal_fault+0x14b/0x161
[   23.984168]  ? exit_to_usermode_loop+0x2f/0x2ae
[   23.984748]  ? trace_hardirqs_on_caller+0x2d/0x384
[   23.985170]  exit_to_usermode_loop+0xf7/0x2ae
[   23.985554]  do_int80_syscall_32+0x4e8/0x4fe
[   23.985937]  entry_INT80_32+0x2f/0x2f
[   23.986264] EIP: 0x77f3bbc6
[   23.986515] EFLAGS: 0246 CPU: 0
[   23.986851] EAX:  EBX: 0003 ECX: 77fb9554 EDX: 000a
[   23.987385] ESI:  EDI: 7fada55c EBP: 7fada468 ESP: 7fada3fc
[   23.987925]  DS: 007b ES: 007b FS:  GS:  SS: 007b
[   23.988462] Kernel Offset: disabled
   # HH:MM RESULT GOOD 
BAD GOOD_BUT_DIRTY 

9f4835fb96 ("x86/fpu: Tighten validation of user-supplied .."): Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

2017-09-24 Thread kernel test robot
Hi Ingo,

On your request I'm resending the report here, with attached dmesg,
kconfig and reproduce script.

I'll go on to test your split up commits, too.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/fpu

commit 9f4835fb965d8eea7e608d0cb62c246c804dec90
Author: Eric Biggers 
AuthorDate: Fri Sep 22 10:41:55 2017 -0700
Commit: Ingo Molnar 
CommitDate: Sat Sep 23 11:02:00 2017 +0200

 x86/fpu: Tighten validation of user-supplied xstate_header
 
 Move validation of user-supplied xstate_headers into a helper function
 and call it from both the ptrace and sigreturn syscall paths.  The new
 function also considers it to be an error if *any* reserved bits are
 set, whereas before we were just clearing most of them.
 
 This should reduce the chance of bugs that fail to correctly validate
 user-supplied XSAVE areas.  It also will expose any broken userspace
 programs that set the other reserved bits; this is desirable because
 such programs will lose compatibility with future CPUs and kernels if
 those bits are ever used for anything.  (There shouldn't be any such
 programs, and in fact in the case where the compacted format is in use
 we were already validating xfeatures.  But you never know...)
 
 Signed-off-by: Eric Biggers 
 Reviewed-by: Kees Cook 
 Reviewed-by: Rik van Riel 
 Acked-by: Dave Hansen 
 Cc: Andy Lutomirski 
 Cc: Dmitry Vyukov 
 Cc: Fenghua Yu 
 Cc: Kevin Hao 
 Cc: Linus Torvalds 
 Cc: Michael Halcrow 
 Cc: Oleg Nesterov 
 Cc: Peter Zijlstra 
 Cc: Thomas Gleixner 
 Cc: Wanpeng Li 
 Cc: Yu-cheng Yu 
 Cc: kernel-harden...@lists.openwall.com
 Link: http://lkml.kernel.org/r/20170922174156.16780-3-ebigge...@gmail.com
 Signed-off-by: Ingo Molnar 

29ed270cd3  x86/fpu: Don't let userspace set bogus xcomp_bv
9f4835fb96  x86/fpu: Tighten validation of user-supplied xstate_header
8d3e268d89  x86/fpu: Rename fpu__activate_fpstate_read/write() to 
fpu__read/write()
e7c6e36753  Merge branch 'x86/urgent'
+---+++++
|   | 29ed270cd3 | 
9f4835fb96 | 8d3e268d89 | e7c6e36753 |
+---+++++
| boot_successes| 35 | 2
  | 6  | 0  |
| boot_failures | 0  | 13   
  | 13 | 11 |
| Kernel_panic-not_syncing:Attempted_to_kill_init!exitcode= | 0  | 13   
  | 13 | 11 |
+---+++++

procd: Console is alive
procd: - preinit -
Press the [f] key and hit [enter] to enter failsafe mode
Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level
[   23.975862] init[1] bad frame in sigreturn frame:7fad9e6c ip:77f3bbc6 
sp:7fada3fc orax: in libuClibc-0.9.33.2.so[77f31000+4f000]
[   23.977287] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x000b
[   23.977287]
[   23.978120] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc1-00218-g9f4835f #1
[   23.978770] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.3-20161025_171302-gandalf 04/01/2014
[   23.979681] Call Trace:
[   23.980087]  dump_stack+0x40/0x5e
[   23.980558]  panic+0x1c5/0x58c
[   23.980963]  forget_original_parent+0x1ee/0x843
[   23.981363]  do_exit+0x1087/0x17c6
[   23.981668]  do_group_exit+0x1d1/0x1d1
[   23.982017]  get_signal+0x1294/0x12ca
[   23.982345]  do_signal+0x2c/0x55b
[   23.982643]  ? force_sig_info+0x1bd/0x1d5
[   23.983079]  ? force_sig+0x22/0x32
[   23.983563]  ? signal_fault+0x14b/0x161
[   23.984168]  ? exit_to_usermode_loop+0x2f/0x2ae
[   23.984748]  ? trace_hardirqs_on_caller+0x2d/0x384
[   23.985170]  exit_to_usermode_loop+0xf7/0x2ae
[   23.985554]  do_int80_syscall_32+0x4e8/0x4fe
[   23.985937]  entry_INT80_32+0x2f/0x2f
[   23.986264] EIP: 0x77f3bbc6
[   23.986515] EFLAGS: 0246 CPU: 0
[   23.986851] EAX:  EBX: 0003 ECX: 77fb9554 EDX: 000a
[   23.987385] ESI:  EDI: 7fada55c EBP: 7fada468 ESP: 7fada3fc
[   23.987925]  DS: 007b ES: 007b FS:  GS:  SS: 007b
[   23.988462] Kernel Offset: disabled
   # HH:MM RESULT GOOD 
BAD GOOD_BUT_DIRTY DIRTY_NOT_BAD
git bisect start f8fce8fa419bb00ed5a5d6e91abe6dbed75f5842 
2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e --
git bisect good 330ac28434f18e4dfc62985e9d2ed5119c224781  # 23:44  G 11 
00   0  Merge 'rdma/k.o/net-next-base' into devel-spot-201709232001
git bisect good 2cf018879b36a0d3681086cfc1c08c6cc9bef52a  # 00:58  G 11 
00   0  Merge 

Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index

2017-09-24 Thread Jason Wang



On 2017年09月22日 17:02, Stefan Hajnoczi wrote:

On Fri, Sep 22, 2017 at 04:02:32PM +0800, Jason Wang wrote:

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f87ec75..8424166d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct 
vhost_dev *dev,
  }
  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
  
+int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,

+   struct vring_used_elem *heads,
+   u16 num, bool used_update)

Missing doc comment.


Will fix this.




+{
+   int ret, ret2;
+   u16 last_avail_idx, last_used_idx, total, copied;
+   __virtio16 avail_idx;
+   struct vring_used_elem __user *used;
+   int i;

The following variable names are a little confusing:

last_avail_idx vs vq->last_avail_idx.  last_avail_idx is a wrapped
avail->ring[] index, vq->last_avail_idx is a free-running counter.  The
same for last_used_idx vs vq->last_used_idx.

num argument vs vq->num.  The argument could be called nheads instead to
make it clear that this is heads[] and not the virtqueue size.

Not a bug but it took me a while to figure out what was going on.


I admit the name is confusing. Let me try better ones in V2.

Thanks


Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index

2017-09-24 Thread Jason Wang



On 2017年09月22日 17:02, Stefan Hajnoczi wrote:

On Fri, Sep 22, 2017 at 04:02:32PM +0800, Jason Wang wrote:

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f87ec75..8424166d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct 
vhost_dev *dev,
  }
  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
  
+int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,

+   struct vring_used_elem *heads,
+   u16 num, bool used_update)

Missing doc comment.


Will fix this.




+{
+   int ret, ret2;
+   u16 last_avail_idx, last_used_idx, total, copied;
+   __virtio16 avail_idx;
+   struct vring_used_elem __user *used;
+   int i;

The following variable names are a little confusing:

last_avail_idx vs vq->last_avail_idx.  last_avail_idx is a wrapped
avail->ring[] index, vq->last_avail_idx is a free-running counter.  The
same for last_used_idx vs vq->last_used_idx.

num argument vs vq->num.  The argument could be called nheads instead to
make it clear that this is heads[] and not the virtqueue size.

Not a bug but it took me a while to figure out what was going on.


I admit the name is confusing. Let me try better ones in V2.

Thanks


Re: [PATCH net-next RFC 1/5] vhost: split out ring head fetching logic

2017-09-24 Thread Jason Wang



On 2017年09月22日 16:31, Stefan Hajnoczi wrote:

On Fri, Sep 22, 2017 at 04:02:31PM +0800, Jason Wang wrote:

+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found.  A negative code is
+ * returned on error. */
+int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num,
+   struct vhost_log *log, unsigned int *log_num,
+   __virtio16 head)

[...]

+int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)

Please document vhost_get_vq_desc().

Please also explain the difference between __vhost_get_vq_desc() and
vhost_get_vq_desc() in the documentation.


Right, will document this in next version.

Thanks



Re: [PATCH net-next RFC 1/5] vhost: split out ring head fetching logic

2017-09-24 Thread Jason Wang



On 2017年09月22日 16:31, Stefan Hajnoczi wrote:

On Fri, Sep 22, 2017 at 04:02:31PM +0800, Jason Wang wrote:

+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found.  A negative code is
+ * returned on error. */
+int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num,
+   struct vhost_log *log, unsigned int *log_num,
+   __virtio16 head)

[...]

+int vhost_get_vq_desc(struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)

Please document vhost_get_vq_desc().

Please also explain the difference between __vhost_get_vq_desc() and
vhost_get_vq_desc() in the documentation.


Right, will document this in next version.

Thanks



Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

2017-09-24 Thread Vitaly Mayatskikh
On Sun, 24 Sep 2017 10:27:39 -0400,
Al Viro wrote:

> BTW, there's something fishy in bio_copy_user_iov().  If the area we'd asked 
> for
> had been too large for a single bio, we are going to create a bio and have
>  bio_add_pc_page() eventually fill it up to limit.  Then we return into
> __blk_rq_map_user_iov(), advance iter and call bio_copy_user_iov() again.
> Fine, but... now we might have non-zero iter->iov_offset.  And this
> bmd->is_our_pages = map_data ? 0 : 1;
> memcpy(bmd->iov, iter->iov, sizeof(struct iovec) * iter->nr_segs);
> iov_iter_init(>iter, iter->type, bmd->iov,
> iter->nr_segs, iter->count);
> does not even look at iter->iov_offset.  As the result, when it gets to
> bio_uncopy_user(), we copy the data from each bio into the *beginning* of
> the user area, overwriting that from the other bio.

Yeah, something is wrong with bio_copy_user_iov. Our datapath hangs when IO 
flows through unmodified SG (it forces bio_copy if iov_count is set). I did not 
look at details, but same IO pattern and memory layout work well with bio_map 
(module refcount problem).

> Anyway, I'd added the obvious fix to #work.iov_iter, reordered it and
> force-pushed the result.

I'll give it a try, thanks!
-- 
wbr, Vitaly


Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov

2017-09-24 Thread Vitaly Mayatskikh
On Sun, 24 Sep 2017 10:27:39 -0400,
Al Viro wrote:

> BTW, there's something fishy in bio_copy_user_iov().  If the area we'd asked 
> for
> had been too large for a single bio, we are going to create a bio and have
>  bio_add_pc_page() eventually fill it up to limit.  Then we return into
> __blk_rq_map_user_iov(), advance iter and call bio_copy_user_iov() again.
> Fine, but... now we might have non-zero iter->iov_offset.  And this
> bmd->is_our_pages = map_data ? 0 : 1;
> memcpy(bmd->iov, iter->iov, sizeof(struct iovec) * iter->nr_segs);
> iov_iter_init(>iter, iter->type, bmd->iov,
> iter->nr_segs, iter->count);
> does not even look at iter->iov_offset.  As the result, when it gets to
> bio_uncopy_user(), we copy the data from each bio into the *beginning* of
> the user area, overwriting that from the other bio.

Yeah, something is wrong with bio_copy_user_iov. Our datapath hangs when IO 
flows through unmodified SG (it forces bio_copy if iov_count is set). I did not 
look at details, but same IO pattern and memory layout work well with bio_map 
(module refcount problem).

> Anyway, I'd added the obvious fix to #work.iov_iter, reordered it and
> force-pushed the result.

I'll give it a try, thanks!
-- 
wbr, Vitaly


ffsb job does not exit on xfs 4.14-rc1+

2017-09-24 Thread Xiong Zhou
Hi,

ffsb test won't exit like this on Linus tree 4.14-rc1+.
Latest commit cd4175b11685

This does not happen on v4.13

Thanks,

1  1505  Ss   0   0:00 /usr/sbin/sshd -D
 1505  1752  Ss   0   0:00  \_ sshd: root [priv]
 1752  1762  S0   0:00  |   \_ sshd: root@pts/0
 1762  1763  Ss   0   0:00  |   \_ -bash
 1763  8706  S+   0   0:00  |   \_ /bin/bash -x ./run.sh --daxoff ffsb
 8706 10044  S+   0   0:00  |   \_ /bin/bash -x ./ffsb.sh xfs 
/dev/pmem0 /daxmnt 8099
10044 10053  S+   0   0:00  |   \_ make run
10053 10056  S+   0   0:00  |   \_ /bin/bash -x ./runtest.sh
10056 10167  Sl+  0 21171969:20  |   \_ ffsb 
large_file_creates_threads_192.ffsb
10056 10168  S+   0   0:00  |   \_ tee 
large_file_creates_threads_192.ffsb.out
 

sh-4.4# xfs_info /daxmnt/
meta-data=/dev/pmem0 isize=512agcount=4, agsize=524288 blks
 =   sectsz=4096  attr=2, projid32bit=1
 =   crc=1finobt=1 spinodes=0 rmapbt=0
 =   reflink=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
log  =internal   bsize=4096   blocks=2560, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

sh-4.4# uname -r
4.14.0-rc1-master-cd4175b11685+

sh-4.4# xfs_io -V
xfs_io version 4.10.0

sh-4.4# ffsb -V
FFSB version 6.0-RC2 started
-V: No such file or directory


# cat large_file_creates_threads_192.ffsb

# Large file creates
# Creating 1 GB files.

time=300
alignio=1
directio=0

[filesystem0]
location=__SCRATCH_MNT__

# All created files will be 1 GB.
min_filesize=1GB
max_filesize=1GB
[end0]

[threadgroup0]
num_threads=192

create_weight=1

write_blocksize=4KB

[stats]
enable_stats=1
enable_range=1

msec_range0.00  0.01
msec_range0.01  0.02
msec_range0.02  0.05
msec_range0.05  0.10
msec_range0.10  0.20
msec_range0.20  0.50
msec_range0.50  1.00
msec_range1.00  2.00
msec_range2.00  5.00
msec_range5.00 10.00
msec_range   10.00 20.00
msec_range   20.00 50.00
msec_range   50.00100.00
msec_range  100.00200.00
msec_range  200.00500.00
msec_range  500.00   1000.00
msec_range 1000.00   2000.00
msec_range 2000.00   5000.00
msec_range 5000.00  1.00
[end]
[end0]


ffsb job does not exit on xfs 4.14-rc1+

2017-09-24 Thread Xiong Zhou
Hi,

ffsb test won't exit like this on Linus tree 4.14-rc1+.
Latest commit cd4175b11685

This does not happen on v4.13

Thanks,

1  1505  Ss   0   0:00 /usr/sbin/sshd -D
 1505  1752  Ss   0   0:00  \_ sshd: root [priv]
 1752  1762  S0   0:00  |   \_ sshd: root@pts/0
 1762  1763  Ss   0   0:00  |   \_ -bash
 1763  8706  S+   0   0:00  |   \_ /bin/bash -x ./run.sh --daxoff ffsb
 8706 10044  S+   0   0:00  |   \_ /bin/bash -x ./ffsb.sh xfs 
/dev/pmem0 /daxmnt 8099
10044 10053  S+   0   0:00  |   \_ make run
10053 10056  S+   0   0:00  |   \_ /bin/bash -x ./runtest.sh
10056 10167  Sl+  0 21171969:20  |   \_ ffsb 
large_file_creates_threads_192.ffsb
10056 10168  S+   0   0:00  |   \_ tee 
large_file_creates_threads_192.ffsb.out
 

sh-4.4# xfs_info /daxmnt/
meta-data=/dev/pmem0 isize=512agcount=4, agsize=524288 blks
 =   sectsz=4096  attr=2, projid32bit=1
 =   crc=1finobt=1 spinodes=0 rmapbt=0
 =   reflink=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
log  =internal   bsize=4096   blocks=2560, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

sh-4.4# uname -r
4.14.0-rc1-master-cd4175b11685+

sh-4.4# xfs_io -V
xfs_io version 4.10.0

sh-4.4# ffsb -V
FFSB version 6.0-RC2 started
-V: No such file or directory


# cat large_file_creates_threads_192.ffsb

# Large file creates
# Creating 1 GB files.

time=300
alignio=1
directio=0

[filesystem0]
location=__SCRATCH_MNT__

# All created files will be 1 GB.
min_filesize=1GB
max_filesize=1GB
[end0]

[threadgroup0]
num_threads=192

create_weight=1

write_blocksize=4KB

[stats]
enable_stats=1
enable_range=1

msec_range0.00  0.01
msec_range0.01  0.02
msec_range0.02  0.05
msec_range0.05  0.10
msec_range0.10  0.20
msec_range0.20  0.50
msec_range0.50  1.00
msec_range1.00  2.00
msec_range2.00  5.00
msec_range5.00 10.00
msec_range   10.00 20.00
msec_range   20.00 50.00
msec_range   50.00100.00
msec_range  100.00200.00
msec_range  200.00500.00
msec_range  500.00   1000.00
msec_range 1000.00   2000.00
msec_range 2000.00   5000.00
msec_range 5000.00  1.00
[end]
[end0]


Re: [PATCH v2] mm, sysctl: make VM stats configurable

2017-09-24 Thread Huang, Ying
Kemi Wang  writes:

> This is the second step which introduces a tunable interface that allow VM
> stats configurable for optimizing zone_statistics(), as suggested by Dave
> Hansen and Ying Huang.
>
> ===
> When performance becomes a bottleneck and you can tolerate some possible
> tool breakage and some decreased counter precision (e.g. numa counter), you
> can do:
>   echo [C|c]oarse > /proc/sys/vm/vmstat_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
>
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 1000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
>
> ===
> When performance is not a bottleneck and you want all tooling to work,
> you

When page allocation performance isn't a bottleneck ...

> can do:
>   echo [S|s]trict > /proc/sys/vm/vmstat_mode
>
> ===
> We recommend automatic detection of virtual memory statistics by system,
> this is also system default configuration, you can do:
>   echo [A|a]uto > /proc/sys/vm/vmstat_mode
> In this case, automatic detection of VM statistics, numa counter update
> is skipped unless it has been read by users at least once, e.g. cat
> /proc/zoneinfo.
>
> Therefore, with different VM stats mode, numa counters update can operate
> differently so that everybody can benefit.
>
> Many thanks to Michal Hocko and Dave Hansen for comments to help improve
> the original patch.
>
> ChangeLog:
>   Since V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
>
> Reported-by: Jesper Dangaard Brouer 
> Suggested-by: Dave Hansen 
> Suggested-by: Ying Huang 
> Signed-off-by: Kemi Wang 
> ---
>  Documentation/sysctl/vm.txt |  26 +
>  drivers/base/node.c |   2 +
>  include/linux/vmstat.h  |  22 
>  init/main.c |   2 +
>  kernel/sysctl.c |   7 +++
>  mm/page_alloc.c |  14 +
>  mm/vmstat.c | 126 
> 
>  7 files changed, 199 insertions(+)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..6ab2843 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- vmstat_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,31 @@ ten times more freeable objects than there are.
>  
>  =
>  
> +vmstat_mode
> +
> +This interface allows virtual memory statistics configurable.
> +
> +When performance becomes a bottleneck and you can tolerate some possible
> +tool breakage and some decreased counter precision (e.g. numa counter), you
> +can do:
> + echo [C|c]oarse > /proc/sys/vm/vmstat_mode
> +ignorable statistics list:
> +- numa counters
> +
> +When performance is not a bottleneck and you want all tooling to work, you
> +can do:
> + echo [S|s]trict > /proc/sys/vm/vmstat_mode
> +
> +We recommend automatic detection of virtual memory statistics by system,
> +this is also system default configuration, you can do:
> + echo [A|a]uto > /proc/sys/vm/vmstat_mode
> +
> +E.g. numa statistics does not affect system's decision and it is very
> +rarely consumed. If set vmstat_mode = auto, numa counters update is skipped
> +unless the counter is *read* by users at least once.
> +
> +==
> +
>  watermark_scale_factor:
>  
>  This factor controls the aggressiveness of kswapd. It defines the
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 3855902..033c0c3 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -153,6 +153,7 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, 
> NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> + disable_zone_statistics = false;
>   return sprintf(buf,
>  "numa_hit %lu\n"
>  "numa_miss %lu\n"
> @@ -194,6 +195,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>NR_VM_NUMA_STAT_ITEMS],
>node_page_state(pgdat, i));
>  
> + 

Re: [PATCH v2] mm, sysctl: make VM stats configurable

2017-09-24 Thread Huang, Ying
Kemi Wang  writes:

> This is the second step which introduces a tunable interface that allow VM
> stats configurable for optimizing zone_statistics(), as suggested by Dave
> Hansen and Ying Huang.
>
> ===
> When performance becomes a bottleneck and you can tolerate some possible
> tool breakage and some decreased counter precision (e.g. numa counter), you
> can do:
>   echo [C|c]oarse > /proc/sys/vm/vmstat_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
>
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 1000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
>
> ===
> When performance is not a bottleneck and you want all tooling to work,
> you

When page allocation performance isn't a bottleneck ...

> can do:
>   echo [S|s]trict > /proc/sys/vm/vmstat_mode
>
> ===
> We recommend automatic detection of virtual memory statistics by system,
> this is also system default configuration, you can do:
>   echo [A|a]uto > /proc/sys/vm/vmstat_mode
> In this case, automatic detection of VM statistics, numa counter update
> is skipped unless it has been read by users at least once, e.g. cat
> /proc/zoneinfo.
>
> Therefore, with different VM stats mode, numa counters update can operate
> differently so that everybody can benefit.
>
> Many thanks to Michal Hocko and Dave Hansen for comments to help improve
> the original patch.
>
> ChangeLog:
>   Since V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
>
> Reported-by: Jesper Dangaard Brouer 
> Suggested-by: Dave Hansen 
> Suggested-by: Ying Huang 
> Signed-off-by: Kemi Wang 
> ---
>  Documentation/sysctl/vm.txt |  26 +
>  drivers/base/node.c |   2 +
>  include/linux/vmstat.h  |  22 
>  init/main.c |   2 +
>  kernel/sysctl.c |   7 +++
>  mm/page_alloc.c |  14 +
>  mm/vmstat.c | 126 
> 
>  7 files changed, 199 insertions(+)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..6ab2843 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- vmstat_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,31 @@ ten times more freeable objects than there are.
>  
>  =
>  
> +vmstat_mode
> +
> +This interface allows virtual memory statistics configurable.
> +
> +When performance becomes a bottleneck and you can tolerate some possible
> +tool breakage and some decreased counter precision (e.g. numa counter), you
> +can do:
> + echo [C|c]oarse > /proc/sys/vm/vmstat_mode
> +ignorable statistics list:
> +- numa counters
> +
> +When performance is not a bottleneck and you want all tooling to work, you
> +can do:
> + echo [S|s]trict > /proc/sys/vm/vmstat_mode
> +
> +We recommend automatic detection of virtual memory statistics by system,
> +this is also system default configuration, you can do:
> + echo [A|a]uto > /proc/sys/vm/vmstat_mode
> +
> +E.g. numa statistics does not affect system's decision and it is very
> +rarely consumed. If set vmstat_mode = auto, numa counters update is skipped
> +unless the counter is *read* by users at least once.
> +
> +==
> +
>  watermark_scale_factor:
>  
>  This factor controls the aggressiveness of kswapd. It defines the
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 3855902..033c0c3 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -153,6 +153,7 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, 
> NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> + disable_zone_statistics = false;
>   return sprintf(buf,
>  "numa_hit %lu\n"
>  "numa_miss %lu\n"
> @@ -194,6 +195,7 @@ static ssize_t node_read_vmstat(struct device *dev,
>NR_VM_NUMA_STAT_ITEMS],
>node_page_state(pgdat, i));
>  
> + disable_zone_statistics = false;
>   return n;
>  }
>  static DEVICE_ATTR(vmstat, S_IRUGO, 

[PATCH review for 4.9 15/50] sched/fair: Update rq clock before changing a task's CPU affinity

2017-09-24 Thread Levin, Alexander (Sasha Levin)
From: Wanpeng Li 

[ Upstream commit a499c3ead88ccf147fc50689e85a530ad923ce36 ]

This is triggered during boot when CONFIG_SCHED_DEBUG is enabled:

 [ cut here ]
 WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380
 rq->clock_update_flags < RQCF_ACT_SKIP
 CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 
02/16/2016
 Call Trace:
  dump_stack+0x85/0xc2
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  set_next_entity+0x11d/0x380
  set_curr_task_fair+0x2b/0x60
  do_set_cpus_allowed+0x139/0x180
  __set_cpus_allowed_ptr+0x113/0x260
  set_cpus_allowed_ptr+0x10/0x20
  torture_shuffle+0xfd/0x180
  kthread+0x10f/0x150
  ? torture_shutdown_init+0x60/0x60
  ? kthread_create_on_node+0x60/0x60
  ret_from_fork+0x31/0x40
 ---[ end trace dd94d92344cea9c6 ]---

The task is running && !queued, so there is no rq clock update before calling
set_curr_task().

This patch fixes it by updating rq clock after holding rq->lock/pi_lock
just as what other dequeue + put_prev + enqueue + set_curr story does.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Matt Fleming 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2098954c690f..d271a1c387eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1141,6 +1141,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
int ret = 0;
 
rq = task_rq_lock(p, );
+   update_rq_clock(rq);
 
if (p->flags & PF_KTHREAD) {
/*
-- 
2.11.0


[PATCH review for 4.9 15/50] sched/fair: Update rq clock before changing a task's CPU affinity

2017-09-24 Thread Levin, Alexander (Sasha Levin)
From: Wanpeng Li 

[ Upstream commit a499c3ead88ccf147fc50689e85a530ad923ce36 ]

This is triggered during boot when CONFIG_SCHED_DEBUG is enabled:

 [ cut here ]
 WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380
 rq->clock_update_flags < RQCF_ACT_SKIP
 CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1
 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 
02/16/2016
 Call Trace:
  dump_stack+0x85/0xc2
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  set_next_entity+0x11d/0x380
  set_curr_task_fair+0x2b/0x60
  do_set_cpus_allowed+0x139/0x180
  __set_cpus_allowed_ptr+0x113/0x260
  set_cpus_allowed_ptr+0x10/0x20
  torture_shuffle+0xfd/0x180
  kthread+0x10f/0x150
  ? torture_shutdown_init+0x60/0x60
  ? kthread_create_on_node+0x60/0x60
  ret_from_fork+0x31/0x40
 ---[ end trace dd94d92344cea9c6 ]---

The task is running && !queued, so there is no rq clock update before calling
set_curr_task().

This patch fixes it by updating rq clock after holding rq->lock/pi_lock
just as what other dequeue + put_prev + enqueue + set_curr story does.

Signed-off-by: Wanpeng Li 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Matt Fleming 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng...@hotmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2098954c690f..d271a1c387eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1141,6 +1141,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
int ret = 0;
 
rq = task_rq_lock(p, );
+   update_rq_clock(rq);
 
if (p->flags & PF_KTHREAD) {
/*
-- 
2.11.0


[PATCH review for 4.9 29/50] qede: Prevent index problems in loopback test

2017-09-24 Thread Levin, Alexander (Sasha Levin)
From: Sudarsana Reddy Kalluru 

[ Upstream commit afe981d664aeeebc8d1bcbd7d2070b5432edaecb ]

Driver currently utilizes the same loop variable in two
nested loops.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Yuval Mintz 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index 7567cc464b88..634e4149af22 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -1221,7 +1221,7 @@ static int qede_selftest_receive_traffic(struct qede_dev 
*edev)
struct qede_rx_queue *rxq = NULL;
struct sw_rx_data *sw_rx_data;
union eth_rx_cqe *cqe;
-   int i, rc = 0;
+   int i, iter, rc = 0;
u8 *data_ptr;
 
for_each_queue(i) {
@@ -1240,7 +1240,7 @@ static int qede_selftest_receive_traffic(struct qede_dev 
*edev)
 * enabled. This is because the queue 0 is configured as the default
 * queue and that the loopback traffic is not IP.
 */
-   for (i = 0; i < QEDE_SELFTEST_POLL_COUNT; i++) {
+   for (iter = 0; iter < QEDE_SELFTEST_POLL_COUNT; iter++) {
if (!qede_has_rx_work(rxq)) {
usleep_range(100, 200);
continue;
@@ -1287,7 +1287,7 @@ static int qede_selftest_receive_traffic(struct qede_dev 
*edev)
qed_chain_recycle_consumed(>rx_comp_ring);
}
 
-   if (i == QEDE_SELFTEST_POLL_COUNT) {
+   if (iter == QEDE_SELFTEST_POLL_COUNT) {
DP_NOTICE(edev, "Failed to receive the traffic\n");
return -1;
}
-- 
2.11.0


[PATCH review for 4.9 29/50] qede: Prevent index problems in loopback test

2017-09-24 Thread Levin, Alexander (Sasha Levin)
From: Sudarsana Reddy Kalluru 

[ Upstream commit afe981d664aeeebc8d1bcbd7d2070b5432edaecb ]

Driver currently utilizes the same loop variable in two
nested loops.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Yuval Mintz 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/qlogic/qede/qede_ethtool.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c 
b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
index 7567cc464b88..634e4149af22 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_ethtool.c
@@ -1221,7 +1221,7 @@ static int qede_selftest_receive_traffic(struct qede_dev 
*edev)
struct qede_rx_queue *rxq = NULL;
struct sw_rx_data *sw_rx_data;
union eth_rx_cqe *cqe;
-   int i, rc = 0;
+   int i, iter, rc = 0;
u8 *data_ptr;
 
for_each_queue(i) {
@@ -1240,7 +1240,7 @@ static int qede_selftest_receive_traffic(struct qede_dev 
*edev)
 * enabled. This is because the queue 0 is configured as the default
 * queue and that the loopback traffic is not IP.
 */
-   for (i = 0; i < QEDE_SELFTEST_POLL_COUNT; i++) {
+   for (iter = 0; iter < QEDE_SELFTEST_POLL_COUNT; iter++) {
if (!qede_has_rx_work(rxq)) {
usleep_range(100, 200);
continue;
@@ -1287,7 +1287,7 @@ static int qede_selftest_receive_traffic(struct qede_dev 
*edev)
qed_chain_recycle_consumed(>rx_comp_ring);
}
 
-   if (i == QEDE_SELFTEST_POLL_COUNT) {
+   if (iter == QEDE_SELFTEST_POLL_COUNT) {
DP_NOTICE(edev, "Failed to receive the traffic\n");
return -1;
}
-- 
2.11.0


[PATCH review for 4.9 14/50] f2fs: do SSR for data when there is enough free space

2017-09-24 Thread Levin, Alexander (Sasha Levin)
From: Yunlong Song 

[ Upstream commit 035e97adab26c1121cedaeb9bd04cf48a8e8cf51 ]

In allocate_segment_by_default(), need_SSR() already detected it's time to do
SSR. So, let's try to find victims for data segments more aggressively in time.

Signed-off-by: Yunlong Song 
Signed-off-by: Jaegeuk Kim 
Signed-off-by: Sasha Levin 
---
 fs/f2fs/segment.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 74a2b06d..e10f61684ea4 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1263,7 +1263,7 @@ static int get_ssr_segment(struct f2fs_sb_info *sbi, int 
type)
struct curseg_info *curseg = CURSEG_I(sbi, type);
const struct victim_selection *v_ops = DIRTY_I(sbi)->v_ops;
 
-   if (IS_NODESEG(type) || !has_not_enough_free_secs(sbi, 0, 0))
+   if (IS_NODESEG(type))
return v_ops->get_victim(sbi,
&(curseg)->next_segno, BG_GC, type, SSR);
 
-- 
2.11.0


[PATCH review for 4.9 26/50] ASoC: mediatek: add I2C dependency for CS42XX8

2017-09-24 Thread Levin, Alexander (Sasha Levin)
From: Arnd Bergmann 

[ Upstream commit 72cedf599fcebfd6cd2550274d7855838068d28c ]

We should not select drivers that depend on I2C when that is disabled,
as it results in a build error:

warning: (SND_SOC_MT2701_CS42448) selects SND_SOC_CS42XX8_I2C which has unmet 
direct dependencies (SOUND && !M68K && !UML && SND && SND_SOC && I2C)
sound/soc/codecs/cs42xx8-i2c.c:60:1: warning: data definition has no type or 
storage class
 module_i2c_driver(cs42xx8_i2c_driver);
sound/soc/codecs/cs42xx8-i2c.c:60:1: error: type defaults to 'int' in 
declaration of 'module_i2c_driver' [-Werror=implicit-int]

Fixes: 1f458d53f76c ("ASoC: mediatek: Add mt2701-cs42448 driver and config 
option.")
Signed-off-by: Arnd Bergmann 
Signed-off-by: Mark Brown 
Signed-off-by: Sasha Levin 
---
 sound/soc/mediatek/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/mediatek/Kconfig b/sound/soc/mediatek/Kconfig
index 05cf809cf9e1..d7013bde6f45 100644
--- a/sound/soc/mediatek/Kconfig
+++ b/sound/soc/mediatek/Kconfig
@@ -13,7 +13,7 @@ config SND_SOC_MT2701
 
 config SND_SOC_MT2701_CS42448
tristate "ASoc Audio driver for MT2701 with CS42448 codec"
-   depends on SND_SOC_MT2701
+   depends on SND_SOC_MT2701 && I2C
select SND_SOC_CS42XX8_I2C
select SND_SOC_BT_SCO
help
-- 
2.11.0


[PATCH review for 4.9 14/50] f2fs: do SSR for data when there is enough free space

2017-09-24 Thread Levin, Alexander (Sasha Levin)
From: Yunlong Song 

[ Upstream commit 035e97adab26c1121cedaeb9bd04cf48a8e8cf51 ]

In allocate_segment_by_default(), need_SSR() already detected it's time to do
SSR. So, let's try to find victims for data segments more aggressively in time.

Signed-off-by: Yunlong Song 
Signed-off-by: Jaegeuk Kim 
Signed-off-by: Sasha Levin 
---
 fs/f2fs/segment.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 74a2b06d..e10f61684ea4 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1263,7 +1263,7 @@ static int get_ssr_segment(struct f2fs_sb_info *sbi, int 
type)
struct curseg_info *curseg = CURSEG_I(sbi, type);
const struct victim_selection *v_ops = DIRTY_I(sbi)->v_ops;
 
-   if (IS_NODESEG(type) || !has_not_enough_free_secs(sbi, 0, 0))
+   if (IS_NODESEG(type))
return v_ops->get_victim(sbi,
&(curseg)->next_segno, BG_GC, type, SSR);
 
-- 
2.11.0


[PATCH review for 4.9 26/50] ASoC: mediatek: add I2C dependency for CS42XX8

2017-09-24 Thread Levin, Alexander (Sasha Levin)
From: Arnd Bergmann 

[ Upstream commit 72cedf599fcebfd6cd2550274d7855838068d28c ]

We should not select drivers that depend on I2C when that is disabled,
as it results in a build error:

warning: (SND_SOC_MT2701_CS42448) selects SND_SOC_CS42XX8_I2C which has unmet 
direct dependencies (SOUND && !M68K && !UML && SND && SND_SOC && I2C)
sound/soc/codecs/cs42xx8-i2c.c:60:1: warning: data definition has no type or 
storage class
 module_i2c_driver(cs42xx8_i2c_driver);
sound/soc/codecs/cs42xx8-i2c.c:60:1: error: type defaults to 'int' in 
declaration of 'module_i2c_driver' [-Werror=implicit-int]

Fixes: 1f458d53f76c ("ASoC: mediatek: Add mt2701-cs42448 driver and config 
option.")
Signed-off-by: Arnd Bergmann 
Signed-off-by: Mark Brown 
Signed-off-by: Sasha Levin 
---
 sound/soc/mediatek/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/mediatek/Kconfig b/sound/soc/mediatek/Kconfig
index 05cf809cf9e1..d7013bde6f45 100644
--- a/sound/soc/mediatek/Kconfig
+++ b/sound/soc/mediatek/Kconfig
@@ -13,7 +13,7 @@ config SND_SOC_MT2701
 
 config SND_SOC_MT2701_CS42448
tristate "ASoc Audio driver for MT2701 with CS42448 codec"
-   depends on SND_SOC_MT2701
+   depends on SND_SOC_MT2701 && I2C
select SND_SOC_CS42XX8_I2C
select SND_SOC_BT_SCO
help
-- 
2.11.0


  1   2   3   4   5   6   7   8   9   10   >