Re: Is Documentation/networking/phy.txt still up-to-date?

2016-11-09 Thread Florian Fainelli
On 11/09/2016 05:24 AM, Sebastian Frias wrote:
> Hi,
> 
> Documentation/networking/phy.txt discusses phy_connect and states that:
> 
>  "...
> 
>  interface is a u32 which specifies the connection type used
>  between the controller and the PHY.  Examples are GMII, MII,
>  RGMII, and SGMII.  For a full list, see include/linux/phy.h
> 
>  Now just make sure that phydev->supported and phydev->advertising have any
>  values pruned from them which don't make sense for your controller (a 10/100
>  controller may be connected to a gigabit capable PHY, so you would need to
>  mask off SUPPORTED_1000baseT*).  See include/linux/ethtool.h for definitions
>  for these bitfields. Note that you should not SET any bits, or the PHY may
>  get put into an unsupported state.
> 
>  ..."
> 
> However, 'drivers/net/ethernet/aurora/nb8800.c' for example, does SETs some
> bits (in function 'nb8800_pause_adv').

All pause/flow control related bits should be set by the Ethernet MAC
driver because this is an Ethernet MAC, not PHY, thing. See this
discussion for some details:

https://www.mail-archive.com/netdev@vger.kernel.org/msg135347.html

So the nb8800 drivers does the correct thing here, but the documentation
should be updated to reflect that this applies to all bits, except the
Pause capabilities because these need to come from the Ethernet MAC.

> 
> I checked 'drivers/net/ethernet/broadcom/genet/bcmmii.c' and that one CLEARs
> bits (as per the documentation).
> 
> Does anybody knows what is the correct/recommended approach?

Both drivers do correct things, they just don't set the same things here.
-- 
Florian


Re: Is Documentation/networking/phy.txt still up-to-date?

2016-11-09 Thread Florian Fainelli
On 11/09/2016 05:24 AM, Sebastian Frias wrote:
> Hi,
> 
> Documentation/networking/phy.txt discusses phy_connect and states that:
> 
>  "...
> 
>  interface is a u32 which specifies the connection type used
>  between the controller and the PHY.  Examples are GMII, MII,
>  RGMII, and SGMII.  For a full list, see include/linux/phy.h
> 
>  Now just make sure that phydev->supported and phydev->advertising have any
>  values pruned from them which don't make sense for your controller (a 10/100
>  controller may be connected to a gigabit capable PHY, so you would need to
>  mask off SUPPORTED_1000baseT*).  See include/linux/ethtool.h for definitions
>  for these bitfields. Note that you should not SET any bits, or the PHY may
>  get put into an unsupported state.
> 
>  ..."
> 
> However, 'drivers/net/ethernet/aurora/nb8800.c' for example, does SETs some
> bits (in function 'nb8800_pause_adv').

All pause/flow control related bits should be set by the Ethernet MAC
driver because this is an Ethernet MAC, not PHY, thing. See this
discussion for some details:

https://www.mail-archive.com/netdev@vger.kernel.org/msg135347.html

So the nb8800 drivers does the correct thing here, but the documentation
should be updated to reflect that this applies to all bits, except the
Pause capabilities because these need to come from the Ethernet MAC.

> 
> I checked 'drivers/net/ethernet/broadcom/genet/bcmmii.c' and that one CLEARs
> bits (as per the documentation).
> 
> Does anybody knows what is the correct/recommended approach?

Both drivers do correct things, they just don't set the same things here.
-- 
Florian


[PATCH v7] powerpc: Do not make the entire heap executable

2016-11-09 Thread Denys Vlasenko
On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt,
or with a toolchain which defaults to it) look like this:

  [17] .sbss NOBITS  0002aff8 01aff8 14 00  WA  0   0  4
  [18] .plt  NOBITS  0002b00c 01aff8 84 00 WAX  0   0  4
  [19] .bss  NOBITS  0002b090 01aff8 a4 00  WA  0   0  4

Which results in an ELF load header:

  Type   Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD   0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x1

This is all correct, the load region containing the PLT is marked as
executable. Note that the PLT starts at 0002b00c but the file mapping ends at
0002aff8, so the PLT falls in the 0 fill section described by the load header,
and after a page boundary.

Unfortunately the generic ELF loader ignores the X bit in the load headers
when it creates the 0 filled non-file backed mappings. It assumes all of these
mappings are RW BSS sections, which is not the case for PPC.

gcc/ld has an option (--secure-plt) to not do this, this is said to incur
a small performance penalty.

Currently, to support 32-bit binaries with PLT in BSS kernel maps *entire
brk area* with executable rights for all binaries, even --secure-plt ones.

Stop doing that.

Teach the ELF loader to check the X bit in the relevant load header
and create 0 filled anonymous mappings that are executable
if the load header requests that.

The patch was originally posted in 2012 by Jason Gunthorpe
and apparently ignored:

https://lkml.org/lkml/2012/9/30/138

Lightly run-tested.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Denys Vlasenko 
Acked-by: Kees Cook 
Acked-by: Michael Ellerman 
Tested-by: Jason Gunthorpe 
CC: Andrew Morton 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: "Aneesh Kumar K.V" 
CC: Kees Cook 
CC: Oleg Nesterov 
CC: Michael Ellerman 
CC: Florian Weimer 
CC: linux...@kvack.org
CC: linuxppc-...@lists.ozlabs.org
CC: linux-kernel@vger.kernel.org
---
Changes since v6:
* rebased to current Linus tree
* sending to akpm

Changes since v5:
* made do_brk_flags() error out if any bits other than VM_EXEC are set.
  (Kees Cook: "With this, I'd be happy to Ack.")
  See https://patchwork.ozlabs.org/patch/661595/

Changes since v4:
* if (current->personality & READ_IMPLIES_EXEC), still use VM_EXEC
  for 32-bit executables.

Changes since v3:
* typo fix in commit message
* rebased to current Linus tree

Changes since v2:
* moved capability to map with VM_EXEC into vm_brk_flags()

Changes since v1:
* wrapped lines to not exceed 79 chars
* improved comment
* expanded CC list

 arch/powerpc/include/asm/page.h |  4 +++-
 fs/binfmt_elf.c | 30 ++
 include/linux/mm.h  |  1 +
 mm/mmap.c   | 24 +++-
 4 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 56398e7..17d3d2c 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -230,7 +230,9 @@ extern long long virt_phys_offset;
  * and needs to be executable.  This means the whole heap ends
  * up being executable.
  */
-#define VM_DATA_DEFAULT_FLAGS32(VM_READ | VM_WRITE | VM_EXEC | \
+#define VM_DATA_DEFAULT_FLAGS32 \
+   (((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0) | \
+VM_READ | VM_WRITE | \
 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
 #define VM_DATA_DEFAULT_FLAGS64(VM_READ | VM_WRITE | \
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 2472af2..065134b 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -91,12 +91,18 @@ static struct linux_binfmt elf_format = {
 
 #define BAD_ADDR(x) ((unsigned long)(x) >= TASK_SIZE)
 
-static int set_brk(unsigned long start, unsigned long end)
+static int set_brk(unsigned long start, unsigned long end, int prot)
 {
start = ELF_PAGEALIGN(start);
end = ELF_PAGEALIGN(end);
if (end > start) {
-   int error = vm_brk(start, end - start);
+   /*
+* Map the last of the bss segment.
+* If the header is requesting these pages to be
+* executable, honour that (ppc32 needs this).
+*/
+   int error = vm_brk_flags(start, end - start,
+   prot & PROT_EXEC ? VM_EXEC : 0);
if (error)
return error;
}
@@ -524,6 +530,7 @@ static unsigned long load_elf_interp(struct elfhdr 
*interp_elf_ex,
unsigned long 

[PATCH v7] powerpc: Do not make the entire heap executable

2016-11-09 Thread Denys Vlasenko
On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt,
or with a toolchain which defaults to it) look like this:

  [17] .sbss NOBITS  0002aff8 01aff8 14 00  WA  0   0  4
  [18] .plt  NOBITS  0002b00c 01aff8 84 00 WAX  0   0  4
  [19] .bss  NOBITS  0002b090 01aff8 a4 00  WA  0   0  4

Which results in an ELF load header:

  Type   Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD   0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x1

This is all correct, the load region containing the PLT is marked as
executable. Note that the PLT starts at 0002b00c but the file mapping ends at
0002aff8, so the PLT falls in the 0 fill section described by the load header,
and after a page boundary.

Unfortunately the generic ELF loader ignores the X bit in the load headers
when it creates the 0 filled non-file backed mappings. It assumes all of these
mappings are RW BSS sections, which is not the case for PPC.

gcc/ld has an option (--secure-plt) to not do this, this is said to incur
a small performance penalty.

Currently, to support 32-bit binaries with PLT in BSS kernel maps *entire
brk area* with executable rights for all binaries, even --secure-plt ones.

Stop doing that.

Teach the ELF loader to check the X bit in the relevant load header
and create 0 filled anonymous mappings that are executable
if the load header requests that.

The patch was originally posted in 2012 by Jason Gunthorpe
and apparently ignored:

https://lkml.org/lkml/2012/9/30/138

Lightly run-tested.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Denys Vlasenko 
Acked-by: Kees Cook 
Acked-by: Michael Ellerman 
Tested-by: Jason Gunthorpe 
CC: Andrew Morton 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: "Aneesh Kumar K.V" 
CC: Kees Cook 
CC: Oleg Nesterov 
CC: Michael Ellerman 
CC: Florian Weimer 
CC: linux...@kvack.org
CC: linuxppc-...@lists.ozlabs.org
CC: linux-kernel@vger.kernel.org
---
Changes since v6:
* rebased to current Linus tree
* sending to akpm

Changes since v5:
* made do_brk_flags() error out if any bits other than VM_EXEC are set.
  (Kees Cook: "With this, I'd be happy to Ack.")
  See https://patchwork.ozlabs.org/patch/661595/

Changes since v4:
* if (current->personality & READ_IMPLIES_EXEC), still use VM_EXEC
  for 32-bit executables.

Changes since v3:
* typo fix in commit message
* rebased to current Linus tree

Changes since v2:
* moved capability to map with VM_EXEC into vm_brk_flags()

Changes since v1:
* wrapped lines to not exceed 79 chars
* improved comment
* expanded CC list

 arch/powerpc/include/asm/page.h |  4 +++-
 fs/binfmt_elf.c | 30 ++
 include/linux/mm.h  |  1 +
 mm/mmap.c   | 24 +++-
 4 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 56398e7..17d3d2c 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -230,7 +230,9 @@ extern long long virt_phys_offset;
  * and needs to be executable.  This means the whole heap ends
  * up being executable.
  */
-#define VM_DATA_DEFAULT_FLAGS32(VM_READ | VM_WRITE | VM_EXEC | \
+#define VM_DATA_DEFAULT_FLAGS32 \
+   (((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0) | \
+VM_READ | VM_WRITE | \
 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
 #define VM_DATA_DEFAULT_FLAGS64(VM_READ | VM_WRITE | \
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 2472af2..065134b 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -91,12 +91,18 @@ static struct linux_binfmt elf_format = {
 
 #define BAD_ADDR(x) ((unsigned long)(x) >= TASK_SIZE)
 
-static int set_brk(unsigned long start, unsigned long end)
+static int set_brk(unsigned long start, unsigned long end, int prot)
 {
start = ELF_PAGEALIGN(start);
end = ELF_PAGEALIGN(end);
if (end > start) {
-   int error = vm_brk(start, end - start);
+   /*
+* Map the last of the bss segment.
+* If the header is requesting these pages to be
+* executable, honour that (ppc32 needs this).
+*/
+   int error = vm_brk_flags(start, end - start,
+   prot & PROT_EXEC ? VM_EXEC : 0);
if (error)
return error;
}
@@ -524,6 +530,7 @@ static unsigned long load_elf_interp(struct elfhdr 
*interp_elf_ex,
unsigned long load_addr = 0;
int load_addr_set = 0;
unsigned long last_bss = 0, elf_bss = 0;
+   int bss_prot = 0;
unsigned long error = ~0UL;
unsigned long total_size;
int i;
@@ -606,8 +613,10 @@ static unsigned long load_elf_interp(struct elfhdr 
*interp_elf_ex,
 * 

Re: [PATCH] net/mlx4_en: Fix bpf_prog_add ref_cnt in mlx4

2016-11-09 Thread Brenden Blanco
On Wed, Nov 09, 2016 at 10:57:32AM +0100, Daniel Borkmann wrote:
> On 11/09/2016 10:45 AM, Zhiyi Sun wrote:
> >On Wed, Nov 09, 2016 at 10:05:31AM +0100, Daniel Borkmann wrote:
> >>On 11/09/2016 08:35 AM, Zhiyi Sun wrote:
> >>>There are rx_ring_num queues. Each queue will load xdp prog. So
> >>>bpf_prog_add() should add rx_ring_num to ref_cnt.
> >>>
> >>>Signed-off-by: Zhiyi Sun 
> >>
> >>Your analysis looks incorrect to me. Please elaborate in more detail why
> >>you think current code is buggy ...
> >
> >Yes, you are correct. My patch is incorrect. It is not a bug.
> >
> >>Call path is dev_change_xdp_fd(), which does bpf_prog_get_type() on the
> >>fd. This already takes a ref and only drops it in case of error. Thus
> >>in mlx4_xdp_set(), you only need priv->rx_ring_num - 1 refs for the rest
> >>of the rings, so that dropping refs from old_prog makes sure we release
> >>it again. Looks correct to me (maybe a comment would have helped there).
> >
> >I thought mlx4's code is incorrect because in mlx5's driver, function
> >mlx5e_xdp_set() calls a pair of bpf_prog_add/put, the number of add and
> >put to the refs are same. I didn't notice that one "add" has been called in 
> >its
> >calller. So, it seems that mlx5's code is incorrect, right?
> 
> Yep, I think the two attached patches are needed.
> 
> The other thing I noticed in mlx5e_create_rq() is that it calls
> bpf_prog_add(rq->xdp_prog, 1) without actually checking for errors.

> From d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0 Mon Sep 17 00:00:00 2001
> Message-Id: 
> 
> From: Daniel Borkmann 
> Date: Wed, 9 Nov 2016 10:31:19 +0100
> Subject: [PATCH net-next 1/2] bpf, mlx4: fix prog refcount in 
> mlx4_en_try_alloc_resources error path
> 
> Commit 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings
> scheme") added a bug in that the prog's reference count is not dropped
> in the error path when mlx4_en_try_alloc_resources() is failing.
> 
> We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
> need to release again. Earlier in the call-path, dev_change_xdp_fd()
> itself holds a ref to the prog as well, which is then released though
> bpf_prog_put() due to the propagated error.
> 
> Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
> Signed-off-by: Daniel Borkmann 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  5 -
>  include/linux/bpf.h|  1 +
>  kernel/bpf/syscall.c   | 11 +++
>  3 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 0f6225c..4104aec 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2747,8 +2747,11 @@ static int mlx4_xdp_set(struct net_device *dev, struct 
> bpf_prog *prog)
>   }
>  
>   err = mlx4_en_try_alloc_resources(priv, tmp, _prof);
> - if (err)
> + if (err) {
> + if (prog)
> + bpf_prog_add_undo(prog, priv->rx_ring_num - 1);
Why not just move the above bpf_prog_add to be below the try_alloc?
Nobody needs those references until all of the resources have been
allocated, and then we can remove the need for bpf_prog_add_undo.
>   goto unlock_out;
> + }
>  
>   if (priv->port_up) {
>   port_up = 1;
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index edcd96d..4f6a4f1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -234,6 +234,7 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void 
> *meta, u64 meta_size,
>  struct bpf_prog *bpf_prog_get(u32 ufd);
>  struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
>  struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i);
> +void bpf_prog_add_undo(struct bpf_prog *prog, int i);
>  struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog);
>  void bpf_prog_put(struct bpf_prog *prog);
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 228f962..a6e4dd8 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -680,6 +680,17 @@ struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int 
> i)
>  }
>  EXPORT_SYMBOL_GPL(bpf_prog_add);
>  
> +void bpf_prog_add_undo(struct bpf_prog *prog, int i)
> +{
> + /* Only to be used for undoing previous bpf_prog_add() in some
> +  * error path. We still know that another entity in our call
> +  * path holds a reference to the program, thus atomic_sub() can
> +  * be safely used here!
> +  */
> + atomic_sub(i, >aux->refcnt);
> +}
> +EXPORT_SYMBOL_GPL(bpf_prog_add_undo);
> +
>  struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
>  {
>   return bpf_prog_add(prog, 1);
> -- 
> 1.9.3
> 

> From 

Re: [PATCH] net/mlx4_en: Fix bpf_prog_add ref_cnt in mlx4

2016-11-09 Thread Brenden Blanco
On Wed, Nov 09, 2016 at 10:57:32AM +0100, Daniel Borkmann wrote:
> On 11/09/2016 10:45 AM, Zhiyi Sun wrote:
> >On Wed, Nov 09, 2016 at 10:05:31AM +0100, Daniel Borkmann wrote:
> >>On 11/09/2016 08:35 AM, Zhiyi Sun wrote:
> >>>There are rx_ring_num queues. Each queue will load xdp prog. So
> >>>bpf_prog_add() should add rx_ring_num to ref_cnt.
> >>>
> >>>Signed-off-by: Zhiyi Sun 
> >>
> >>Your analysis looks incorrect to me. Please elaborate in more detail why
> >>you think current code is buggy ...
> >
> >Yes, you are correct. My patch is incorrect. It is not a bug.
> >
> >>Call path is dev_change_xdp_fd(), which does bpf_prog_get_type() on the
> >>fd. This already takes a ref and only drops it in case of error. Thus
> >>in mlx4_xdp_set(), you only need priv->rx_ring_num - 1 refs for the rest
> >>of the rings, so that dropping refs from old_prog makes sure we release
> >>it again. Looks correct to me (maybe a comment would have helped there).
> >
> >I thought mlx4's code is incorrect because in mlx5's driver, function
> >mlx5e_xdp_set() calls a pair of bpf_prog_add/put, the number of add and
> >put to the refs are same. I didn't notice that one "add" has been called in 
> >its
> >calller. So, it seems that mlx5's code is incorrect, right?
> 
> Yep, I think the two attached patches are needed.
> 
> The other thing I noticed in mlx5e_create_rq() is that it calls
> bpf_prog_add(rq->xdp_prog, 1) without actually checking for errors.

> From d2bd6b3cd8636716a06b0ea3b1e041e16f87cce0 Mon Sep 17 00:00:00 2001
> Message-Id: 
> 
> From: Daniel Borkmann 
> Date: Wed, 9 Nov 2016 10:31:19 +0100
> Subject: [PATCH net-next 1/2] bpf, mlx4: fix prog refcount in 
> mlx4_en_try_alloc_resources error path
> 
> Commit 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings
> scheme") added a bug in that the prog's reference count is not dropped
> in the error path when mlx4_en_try_alloc_resources() is failing.
> 
> We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
> need to release again. Earlier in the call-path, dev_change_xdp_fd()
> itself holds a ref to the prog as well, which is then released though
> bpf_prog_put() due to the propagated error.
> 
> Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
> Signed-off-by: Daniel Borkmann 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  5 -
>  include/linux/bpf.h|  1 +
>  kernel/bpf/syscall.c   | 11 +++
>  3 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 0f6225c..4104aec 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2747,8 +2747,11 @@ static int mlx4_xdp_set(struct net_device *dev, struct 
> bpf_prog *prog)
>   }
>  
>   err = mlx4_en_try_alloc_resources(priv, tmp, _prof);
> - if (err)
> + if (err) {
> + if (prog)
> + bpf_prog_add_undo(prog, priv->rx_ring_num - 1);
Why not just move the above bpf_prog_add to be below the try_alloc?
Nobody needs those references until all of the resources have been
allocated, and then we can remove the need for bpf_prog_add_undo.
>   goto unlock_out;
> + }
>  
>   if (priv->port_up) {
>   port_up = 1;
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index edcd96d..4f6a4f1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -234,6 +234,7 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void 
> *meta, u64 meta_size,
>  struct bpf_prog *bpf_prog_get(u32 ufd);
>  struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
>  struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i);
> +void bpf_prog_add_undo(struct bpf_prog *prog, int i);
>  struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog);
>  void bpf_prog_put(struct bpf_prog *prog);
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 228f962..a6e4dd8 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -680,6 +680,17 @@ struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int 
> i)
>  }
>  EXPORT_SYMBOL_GPL(bpf_prog_add);
>  
> +void bpf_prog_add_undo(struct bpf_prog *prog, int i)
> +{
> + /* Only to be used for undoing previous bpf_prog_add() in some
> +  * error path. We still know that another entity in our call
> +  * path holds a reference to the program, thus atomic_sub() can
> +  * be safely used here!
> +  */
> + atomic_sub(i, >aux->refcnt);
> +}
> +EXPORT_SYMBOL_GPL(bpf_prog_add_undo);
> +
>  struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
>  {
>   return bpf_prog_add(prog, 1);
> -- 
> 1.9.3
> 

> From f0789544432bbb89c53c3b8ac6575d48fed97786 Mon Sep 17 00:00:00 2001
> Message-Id: 
> 
> In-Reply-To: 
> 
> References: 
> 
> From: Daniel Borkmann 
> Date: Wed, 9 Nov 2016 

Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-09 Thread Will Deacon
On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:
> On 11/08/2016 06:35 PM, Alex Williamson wrote:
> >On Tue, 8 Nov 2016 21:29:22 +0100
> >Christoffer Dall  wrote:
> >>Is my understanding correct, that you need to tell userspace about the
> >>location of the doorbell (in the IOVA space) in case (2), because even
> >>though the configuration of the device is handled by the (host) kernel
> >>through trapping of the BARs, we have to avoid the VFIO user programming
> >>the device to create other DMA transactions to this particular address,
> >>since that will obviously conflict and either not produce the desired
> >>DMA transactions or result in unintended weird interrupts?

Yes, that's the crux of the issue.

> >Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
> >it's potentially a DMA target and we'll get bogus data on DMA read from
> >the device, and lose data and potentially trigger spurious interrupts on
> >DMA write from the device.  Thanks,
> >
> That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
> they address match before the SMMU checks are done.  if
> all DMA addrs had to go through SMMU first, then the DMA access could
> be ignored/rejected.

That's actually not true :( The SMMU can't generally distinguish between MSI
writes and DMA writes, so it would just see a write transaction to the
doorbell address, regardless of how it was generated by the endpoint.

Will


Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-09 Thread Will Deacon
On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:
> On 11/08/2016 06:35 PM, Alex Williamson wrote:
> >On Tue, 8 Nov 2016 21:29:22 +0100
> >Christoffer Dall  wrote:
> >>Is my understanding correct, that you need to tell userspace about the
> >>location of the doorbell (in the IOVA space) in case (2), because even
> >>though the configuration of the device is handled by the (host) kernel
> >>through trapping of the BARs, we have to avoid the VFIO user programming
> >>the device to create other DMA transactions to this particular address,
> >>since that will obviously conflict and either not produce the desired
> >>DMA transactions or result in unintended weird interrupts?

Yes, that's the crux of the issue.

> >Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
> >it's potentially a DMA target and we'll get bogus data on DMA read from
> >the device, and lose data and potentially trigger spurious interrupts on
> >DMA write from the device.  Thanks,
> >
> That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
> they address match before the SMMU checks are done.  if
> all DMA addrs had to go through SMMU first, then the DMA access could
> be ignored/rejected.

That's actually not true :( The SMMU can't generally distinguish between MSI
writes and DMA writes, so it would just see a write transaction to the
doorbell address, regardless of how it was generated by the endpoint.

Will


Re: [PATCH 1/2] net: ethernet: nb8800: Do not apply TX delay at MAC level

2016-11-09 Thread Florian Fainelli
On 11/09/2016 05:02 AM, Sebastian Frias wrote:
> On 11/04/2016 05:49 PM, Måns Rullgård wrote:
 But when doing so, both the Atheros 8035 and the Aurora NB8800 drivers
 will apply the delay.

 I think a better way of dealing with this is that both, PHY and MAC
 drivers exchange information so that the delay is applied only once.
>>>
>>> Exchange what information? The PHY device interface (phydev->interface)
>>> conveys the needed information for both entities.
>>
>> There doesn't seem to be any consensus among the drivers regarding where
>> the delay should be applied.  Since only a few drivers, MAC or PHY, act
>> on this property, most combinations still work by chance.  It is common
>> for boards to set the delay at the PHY using external config pins so no
>> software setup is required (although I have one Sigma based board that
>> gets this wrong).  I suspect if drivers/net/ethernet/broadcom/genet were
>> used with one of the four PHY drivers that also set the delay based on
>> this DT property, things would go wrong.
>>
> 
> Exactly, what about a patch like (I can make a formal submission, even
> merge it with the patch discussed in this thread, consider this a RFC):

I really don't see a point in doing this when we can just clarify what
phydev->interface does and already have the knowledge that we need
without introducing additional flags in the phy driver.

> 
> diff --git a/drivers/net/ethernet/aurora/nb8800.c 
> b/drivers/net/ethernet/aurora/nb8800.c
> index fba2699..4217ff4 100644
> --- a/drivers/net/ethernet/aurora/nb8800.c
> +++ b/drivers/net/ethernet/aurora/nb8800.c
> @@ -1283,6 +1283,10 @@ static int nb8800_tangox_init(struct net_device *dev)
>   case PHY_INTERFACE_MODE_RGMII_RXID:
>   case PHY_INTERFACE_MODE_RGMII_TXID:
>   pad_mode = PAD_MODE_RGMII;
> +
> + if ((dev->phydev->flags & PHY_SUPPORTS_TXID) == 0)
> + pad_mode |= PAD_MODE_GTX_CLK_DELAY;
> +
>   break;
>  
>   default:
> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
> index 2e0c759..5eddb04 100644
> --- a/drivers/net/phy/at803x.c
> +++ b/drivers/net/phy/at803x.c
> @@ -426,7 +426,9 @@ static int at803x_aneg_done(struct phy_device *phydev)
>   .suspend= at803x_suspend,
>   .resume = at803x_resume,
>   .features   = PHY_GBIT_FEATURES,
> - .flags  = PHY_HAS_INTERRUPT,
> + .flags  = PHY_HAS_INTERRUPT |
> +   PHY_SUPPORTS_RXID |
> +   PHY_SUPPORTS_TXID,
>   .config_aneg= genphy_config_aneg,
>   .read_status= genphy_read_status,
>   .ack_interrupt  = at803x_ack_interrupt,
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index e7e1fd3..0f0b17e 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -61,6 +61,8 @@
>  #define PHY_HAS_INTERRUPT0x0001
>  #define PHY_HAS_MAGICANEG0x0002
>  #define PHY_IS_INTERNAL  0x0004
> +#define PHY_SUPPORTS_RXID   0x0008
> +#define PHY_SUPPORTS_TXID   0x0010
>  #define MDIO_DEVICE_IS_PHY   0x8000
>  
>  /* Interface Mode definitions */
> 


-- 
Florian


Re: [PATCH 1/2] net: ethernet: nb8800: Do not apply TX delay at MAC level

2016-11-09 Thread Florian Fainelli
On 11/09/2016 05:02 AM, Sebastian Frias wrote:
> On 11/04/2016 05:49 PM, Måns Rullgård wrote:
 But when doing so, both the Atheros 8035 and the Aurora NB8800 drivers
 will apply the delay.

 I think a better way of dealing with this is that both, PHY and MAC
 drivers exchange information so that the delay is applied only once.
>>>
>>> Exchange what information? The PHY device interface (phydev->interface)
>>> conveys the needed information for both entities.
>>
>> There doesn't seem to be any consensus among the drivers regarding where
>> the delay should be applied.  Since only a few drivers, MAC or PHY, act
>> on this property, most combinations still work by chance.  It is common
>> for boards to set the delay at the PHY using external config pins so no
>> software setup is required (although I have one Sigma based board that
>> gets this wrong).  I suspect if drivers/net/ethernet/broadcom/genet were
>> used with one of the four PHY drivers that also set the delay based on
>> this DT property, things would go wrong.
>>
> 
> Exactly, what about a patch like (I can make a formal submission, even
> merge it with the patch discussed in this thread, consider this a RFC):

I really don't see a point in doing this when we can just clarify what
phydev->interface does and already have the knowledge that we need
without introducing additional flags in the phy driver.

> 
> diff --git a/drivers/net/ethernet/aurora/nb8800.c 
> b/drivers/net/ethernet/aurora/nb8800.c
> index fba2699..4217ff4 100644
> --- a/drivers/net/ethernet/aurora/nb8800.c
> +++ b/drivers/net/ethernet/aurora/nb8800.c
> @@ -1283,6 +1283,10 @@ static int nb8800_tangox_init(struct net_device *dev)
>   case PHY_INTERFACE_MODE_RGMII_RXID:
>   case PHY_INTERFACE_MODE_RGMII_TXID:
>   pad_mode = PAD_MODE_RGMII;
> +
> + if ((dev->phydev->flags & PHY_SUPPORTS_TXID) == 0)
> + pad_mode |= PAD_MODE_GTX_CLK_DELAY;
> +
>   break;
>  
>   default:
> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
> index 2e0c759..5eddb04 100644
> --- a/drivers/net/phy/at803x.c
> +++ b/drivers/net/phy/at803x.c
> @@ -426,7 +426,9 @@ static int at803x_aneg_done(struct phy_device *phydev)
>   .suspend= at803x_suspend,
>   .resume = at803x_resume,
>   .features   = PHY_GBIT_FEATURES,
> - .flags  = PHY_HAS_INTERRUPT,
> + .flags  = PHY_HAS_INTERRUPT |
> +   PHY_SUPPORTS_RXID |
> +   PHY_SUPPORTS_TXID,
>   .config_aneg= genphy_config_aneg,
>   .read_status= genphy_read_status,
>   .ack_interrupt  = at803x_ack_interrupt,
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index e7e1fd3..0f0b17e 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -61,6 +61,8 @@
>  #define PHY_HAS_INTERRUPT0x0001
>  #define PHY_HAS_MAGICANEG0x0002
>  #define PHY_IS_INTERNAL  0x0004
> +#define PHY_SUPPORTS_RXID   0x0008
> +#define PHY_SUPPORTS_TXID   0x0010
>  #define MDIO_DEVICE_IS_PHY   0x8000
>  
>  /* Interface Mode definitions */
> 


-- 
Florian


Re: [PATCH 0/2] PCI: Add quirk for Fintek F81504/508/512 on Skylake

2016-11-09 Thread Bjorn Helgaas
Hi Peter,

On Mon, Nov 07, 2016 at 05:22:30PM +0800, Ji-Ze Hong (Peter Hong) wrote:
> We had tested Fintek F81504/508/512 PCIe-to-UART/GPIO on Intel Skylake
> platform. It's maybe flood AER correctable error interrupt and slow down
> the system boot. It's the same issue about below link:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The issue in the Launchpad report above is a bug in the PCI AER
driver.  We need to fix that problem, not add quirks to avoid the
problem.

I think it's a pretty straightforward problem:
https://marc.info/?l=linux-pci=145140470807043

I haven't had time to fix it myself, so it'd be great if somebody
else would step up and fix it.

> and this IC will malfunctional after suspend/resume (S3, D0->D3->D0) on
> Skylake platform. 
> 
> The first patch will use parent AER interrupt mask to prevent generating
> correctable error interrupt, the  and second will prevent the IC malfunctional
> after D3.
> 
> Ji-Ze Hong (Peter Hong) (2):
>   PCI: Add quirk for Fintek F81504/508/512 AER issue
>   PCI: Add quirk for Fintek F81504/508/512 D3 issue
> 
>  drivers/pci/quirks.c| 50 
> +
>  include/linux/pci_ids.h |  5 +
>  2 files changed, 55 insertions(+)
> 
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] PCI: Add quirk for Fintek F81504/508/512 on Skylake

2016-11-09 Thread Bjorn Helgaas
Hi Peter,

On Mon, Nov 07, 2016 at 05:22:30PM +0800, Ji-Ze Hong (Peter Hong) wrote:
> We had tested Fintek F81504/508/512 PCIe-to-UART/GPIO on Intel Skylake
> platform. It's maybe flood AER correctable error interrupt and slow down
> the system boot. It's the same issue about below link:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The issue in the Launchpad report above is a bug in the PCI AER
driver.  We need to fix that problem, not add quirks to avoid the
problem.

I think it's a pretty straightforward problem:
https://marc.info/?l=linux-pci=145140470807043

I haven't had time to fix it myself, so it'd be great if somebody
else would step up and fix it.

> and this IC will malfunctional after suspend/resume (S3, D0->D3->D0) on
> Skylake platform. 
> 
> The first patch will use parent AER interrupt mask to prevent generating
> correctable error interrupt, the  and second will prevent the IC malfunctional
> after D3.
> 
> Ji-Ze Hong (Peter Hong) (2):
>   PCI: Add quirk for Fintek F81504/508/512 AER issue
>   PCI: Add quirk for Fintek F81504/508/512 D3 issue
> 
>  drivers/pci/quirks.c| 50 
> +
>  include/linux/pci_ids.h |  5 +
>  2 files changed, 55 insertions(+)
> 
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 22/25] x86/mcheck: Do the init in one place

2016-11-09 Thread Borislav Petkov
On Wed, Nov 09, 2016 at 05:24:51PM +0100, Sebastian Andrzej Siewior wrote:
> The behaviour was not changed - it was only reorganized to keep in one
> spot.

So there's the full CPU init path down cpu_up() -> ... -> identify_cpu()
where mcheck_cpu_init() is called and then there's also the hotplug
callbacks in mce_cpu_callback().

What you're proposing now is, merge the two.

But then the full path down identify_cpu() could still do
mheck_cpu_init() regardless where you move it.

IOW, I still don't see why this change is needed.

In more OW, why can't you simply do:

err = cpuhp_setup_state(CPUHP_AP_X86_MCE_STARTING, "x86/mce:starting",
mce_reenable_cpu, NULL);

and use the current notifier callback?

I still don't get the need for this churn.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH 22/25] x86/mcheck: Do the init in one place

2016-11-09 Thread Borislav Petkov
On Wed, Nov 09, 2016 at 05:24:51PM +0100, Sebastian Andrzej Siewior wrote:
> The behaviour was not changed - it was only reorganized to keep in one
> spot.

So there's the full CPU init path down cpu_up() -> ... -> identify_cpu()
where mcheck_cpu_init() is called and then there's also the hotplug
callbacks in mce_cpu_callback().

What you're proposing now is, merge the two.

But then the full path down identify_cpu() could still do
mheck_cpu_init() regardless where you move it.

IOW, I still don't see why this change is needed.

In more OW, why can't you simply do:

err = cpuhp_setup_state(CPUHP_AP_X86_MCE_STARTING, "x86/mce:starting",
mce_reenable_cpu, NULL);

and use the current notifier callback?

I still don't get the need for this churn.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


[PATCH] ASoC: fix platform_no_drv_owner.cocci warnings

2016-11-09 Thread kbuild test robot
sound/soc/codecs/cs42l42.c:1972:3-8: No need to set .owner here. The core will 
do it.

 Remove .owner field if calls are used which set it automatically

Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci

CC: James Schulman 
Signed-off-by: Fengguang Wu 
---

 cs42l42.c |1 -
 1 file changed, 1 deletion(-)

--- a/sound/soc/codecs/cs42l42.c
+++ b/sound/soc/codecs/cs42l42.c
@@ -1969,7 +1969,6 @@ MODULE_DEVICE_TABLE(i2c, cs42l42_id);
 static struct i2c_driver cs42l42_i2c_driver = {
.driver = {
.name = "cs42l42",
-   .owner = THIS_MODULE,
.pm = _runtime_pm,
.of_match_table = cs42l42_of_match,
},


Re: [Intel-gfx] [PATCH v3] drm: move allocation out of drm_get_format_name()

2016-11-09 Thread Eric Engestrom
On Wednesday, 2016-11-09 14:13:40 +0100, Daniel Vetter wrote:
> On Wed, Nov 9, 2016 at 12:42 PM, Eric Engestrom
>  wrote:
> >> Well, had to drop it again since it didn't compile:
> >>
> >>
> >>   CC [M]  drivers/gpu/drm/drm_blend.o
> >> drivers/gpu/drm/drm_atomic.c: In function ‘drm_atomic_plane_print_state’:
> >> drivers/gpu/drm/drm_atomic.c:920:5: error: too few arguments to function 
> >> ‘drm_get_format_name’
> >>  drm_get_format_name(fb->pixel_format));
> >>  ^~~
> >> In file included from ./include/drm/drmP.h:71:0,
> >>  from drivers/gpu/drm/drm_atomic.c:29:
> >> ./include/drm/drm_fourcc.h:65:7: note: declared here
> >>  char *drm_get_format_name(uint32_t format, struct drm_format_name_buf 
> >> *buf);
> >>^~~
> >>
> >> Can you pls rebase onto drm-misc or linux-next or something?
> >
> > That was based on airlied/drm-next (last fetched on Sunday I think),
> > I can rebase it on drm-misc if it helps, but it seems older than
> > drm-next.
> > Should I just rebase on top of current head of drm-next?
> 
> It needs to be drm-misc (linux-next doesn't have it yet) due to the
> new atomic debug work that we just landed. I'm working on drm-tip as a
> drm local integration tree to ease pains like these a bit, but that
> doesn't really exist yet.

I'm confused as to how the different trees and branches merge back to
Torvalds' tree (I'm interested in particular in drm), and I'm not sure
which branch you want me to rebase on in the drm-misc tree [1],
especially since all of them are older than drm-next [2].

I'll try to rebase on drm-misc-fixes (currently at 4da5caa6a6f82cda3193)
as it sounds about right, but it doesn't apply at all, so it'll take
a little while.

Could you give me a quick explanation or point me to a doc/page that
explains how the various trees and branches get merged?
I googled a bit and found this doc [4] by Jani, but it doesn't mention
drm-misc for instance, so I'm not sure how up-to-date and
non-intel-specific it is.

Looking at this page, something just occurred to me: did you mean
drm-fixes [3], instead of one of the branches on drm-misc?

Cheers,
  Eric

[1] git://anongit.freedesktop.org/drm/drm-misc
[2] git://people.freedesktop.org/~airlied/linux drm-next
[2] git://people.freedesktop.org/~airlied/linux drm-fixes
[3] https://01.org/linuxgraphics/gfx-docs/maintainer-tools/drm-intel.html


[PATCH] ASoC: fix platform_no_drv_owner.cocci warnings

2016-11-09 Thread kbuild test robot
sound/soc/codecs/cs42l42.c:1972:3-8: No need to set .owner here. The core will 
do it.

 Remove .owner field if calls are used which set it automatically

Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci

CC: James Schulman 
Signed-off-by: Fengguang Wu 
---

 cs42l42.c |1 -
 1 file changed, 1 deletion(-)

--- a/sound/soc/codecs/cs42l42.c
+++ b/sound/soc/codecs/cs42l42.c
@@ -1969,7 +1969,6 @@ MODULE_DEVICE_TABLE(i2c, cs42l42_id);
 static struct i2c_driver cs42l42_i2c_driver = {
.driver = {
.name = "cs42l42",
-   .owner = THIS_MODULE,
.pm = _runtime_pm,
.of_match_table = cs42l42_of_match,
},


Re: [Intel-gfx] [PATCH v3] drm: move allocation out of drm_get_format_name()

2016-11-09 Thread Eric Engestrom
On Wednesday, 2016-11-09 14:13:40 +0100, Daniel Vetter wrote:
> On Wed, Nov 9, 2016 at 12:42 PM, Eric Engestrom
>  wrote:
> >> Well, had to drop it again since it didn't compile:
> >>
> >>
> >>   CC [M]  drivers/gpu/drm/drm_blend.o
> >> drivers/gpu/drm/drm_atomic.c: In function ‘drm_atomic_plane_print_state’:
> >> drivers/gpu/drm/drm_atomic.c:920:5: error: too few arguments to function 
> >> ‘drm_get_format_name’
> >>  drm_get_format_name(fb->pixel_format));
> >>  ^~~
> >> In file included from ./include/drm/drmP.h:71:0,
> >>  from drivers/gpu/drm/drm_atomic.c:29:
> >> ./include/drm/drm_fourcc.h:65:7: note: declared here
> >>  char *drm_get_format_name(uint32_t format, struct drm_format_name_buf 
> >> *buf);
> >>^~~
> >>
> >> Can you pls rebase onto drm-misc or linux-next or something?
> >
> > That was based on airlied/drm-next (last fetched on Sunday I think),
> > I can rebase it on drm-misc if it helps, but it seems older than
> > drm-next.
> > Should I just rebase on top of current head of drm-next?
> 
> It needs to be drm-misc (linux-next doesn't have it yet) due to the
> new atomic debug work that we just landed. I'm working on drm-tip as a
> drm local integration tree to ease pains like these a bit, but that
> doesn't really exist yet.

I'm confused as to how the different trees and branches merge back to
Torvalds' tree (I'm interested in particular in drm), and I'm not sure
which branch you want me to rebase on in the drm-misc tree [1],
especially since all of them are older than drm-next [2].

I'll try to rebase on drm-misc-fixes (currently at 4da5caa6a6f82cda3193)
as it sounds about right, but it doesn't apply at all, so it'll take
a little while.

Could you give me a quick explanation or point me to a doc/page that
explains how the various trees and branches get merged?
I googled a bit and found this doc [4] by Jani, but it doesn't mention
drm-misc for instance, so I'm not sure how up-to-date and
non-intel-specific it is.

Looking at this page, something just occurred to me: did you mean
drm-fixes [3], instead of one of the branches on drm-misc?

Cheers,
  Eric

[1] git://anongit.freedesktop.org/drm/drm-misc
[2] git://people.freedesktop.org/~airlied/linux drm-next
[2] git://people.freedesktop.org/~airlied/linux drm-fixes
[3] https://01.org/linuxgraphics/gfx-docs/maintainer-tools/drm-intel.html


sched/autogroup: race if !sysctl_sched_autogroup_enabled ?

2016-11-09 Thread Oleg Nesterov
I am trying to investigate a bug-report which looks as an autogroup bug,
and it seems I found the race which _might_ explain the problem. I'll
try to make the fix tomorrow but could you confirm I got it right and
answer the question below?

Let's look at task_wants_autogroup()

/*
 * We can only assume the task group can't go away on us if
 * autogroup_move_group() can see us on ->thread_group list.
 */
if (p->flags & PF_EXITING)
return false;

Firstly, I think that this PF_EXITING check is no longer needed.
sched_change_group() can be called by autogroup or cgroups code.
autogroup is obviously fine, cgroup-attach will never try to migrate
the exiting task (cgroup_threadgroup_rwsem helps), cpu_cgroup_fork()
is obviously fine too.

But!!! at the same time the comment is _correct_ even if very cryptic ;)

We need to ensure that autogroup/tg returned by autogroup_task_group()
can't go away if we race with autogroup_move_group(), and unless the
caller holds ->siglock we rely on fact that autogroup_move_group()
will a) see this task and b) do sched_move_task() which needs the same
same rq->lock.

However. autogroup_move_group() skips for_each_thread/sched_move_task
if sysctl_sched_autogroup_enabled == 0.

So. Doesn't this mean that cgroup migration to the root cgroup can race
with autogroup_move_group() and use the soon-to-be-freed autogroup->tg?

Or, even simpler, cgroup_post_fork()->cpu_cgroup_fork() can hit the
same race if CLONE_TRHEAD?

Or I am totally confused?

-
Now the qustions. Can't we simplify autogroup_task_get() and avoid
lock_task_sighand() ?

struct autogroup *autogroup_task_get(struct task_struct *p)
{
struct autogroup *ag

rcu_read_lock();
for (;;) {
// it is freed by sched_free_group_rcu() path
// and thus ->autogroup is rcu-safe too.
ag = READ_ONCE(p->signal->autogroup);
if (kref_get_unless_zero(>kref))
break;
}
rcu_read_unlock();

return ag;
}

although this is a bit off-topic. Another question is that I fail to
understand why sched_autogroup_create_attach() does autogroup_create()
and changes signal->autogroup even if !sysctl_sched_autogroup_enabled.

IOW, even ignoring the problem above, what is wrong with this patch?

--- x/kernel/sched/auto_group.c
+++ x/kernel/sched/auto_group.c
@@ -152,8 +152,12 @@ out:
 /* Allocates GFP_KERNEL, cannot be called under any spinlock */
 void sched_autogroup_create_attach(struct task_struct *p)
 {
-   struct autogroup *ag = autogroup_create();
+   struct autogroup *ag;
+
+   if (!sysctl_sched_autogroup_enabled)
+   return;
 
+   ag = autogroup_create();
autogroup_move_group(p, ag);
/* drop extra reference added by autogroup_create() */
autogroup_kref_put(ag);

or even

--- x/kernel/sched/auto_group.c
+++ x/kernel/sched/auto_group.c
@@ -64,9 +64,13 @@ static inline struct autogroup *autogrou
 
 static inline struct autogroup *autogroup_create(void)
 {
-   struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
+   struct autogroup *ag;
struct task_group *tg;
 
+   if (!sysctl_sched_autogroup_enabled)
+   goto xxx;
+
+   ag = kzalloc(sizeof(*ag), GFP_KERNEL);
if (!ag)
goto out_fail;
 
@@ -103,7 +107,7 @@ out_fail:
printk(KERN_WARNING "autogroup_create: %s failure.\n",
ag ? "sched_create_group()" : "kmalloc()");
}
-
+xxx:
return autogroup_kref_get(_default);
 }
 

Oleg.



sched/autogroup: race if !sysctl_sched_autogroup_enabled ?

2016-11-09 Thread Oleg Nesterov
I am trying to investigate a bug-report which looks as an autogroup bug,
and it seems I found the race which _might_ explain the problem. I'll
try to make the fix tomorrow but could you confirm I got it right and
answer the question below?

Let's look at task_wants_autogroup()

/*
 * We can only assume the task group can't go away on us if
 * autogroup_move_group() can see us on ->thread_group list.
 */
if (p->flags & PF_EXITING)
return false;

Firstly, I think that this PF_EXITING check is no longer needed.
sched_change_group() can be called by autogroup or cgroups code.
autogroup is obviously fine, cgroup-attach will never try to migrate
the exiting task (cgroup_threadgroup_rwsem helps), cpu_cgroup_fork()
is obviously fine too.

But!!! at the same time the comment is _correct_ even if very cryptic ;)

We need to ensure that autogroup/tg returned by autogroup_task_group()
can't go away if we race with autogroup_move_group(), and unless the
caller holds ->siglock we rely on fact that autogroup_move_group()
will a) see this task and b) do sched_move_task() which needs the same
same rq->lock.

However. autogroup_move_group() skips for_each_thread/sched_move_task
if sysctl_sched_autogroup_enabled == 0.

So. Doesn't this mean that cgroup migration to the root cgroup can race
with autogroup_move_group() and use the soon-to-be-freed autogroup->tg?

Or, even simpler, cgroup_post_fork()->cpu_cgroup_fork() can hit the
same race if CLONE_TRHEAD?

Or I am totally confused?

-
Now the qustions. Can't we simplify autogroup_task_get() and avoid
lock_task_sighand() ?

struct autogroup *autogroup_task_get(struct task_struct *p)
{
struct autogroup *ag

rcu_read_lock();
for (;;) {
// it is freed by sched_free_group_rcu() path
// and thus ->autogroup is rcu-safe too.
ag = READ_ONCE(p->signal->autogroup);
if (kref_get_unless_zero(>kref))
break;
}
rcu_read_unlock();

return ag;
}

although this is a bit off-topic. Another question is that I fail to
understand why sched_autogroup_create_attach() does autogroup_create()
and changes signal->autogroup even if !sysctl_sched_autogroup_enabled.

IOW, even ignoring the problem above, what is wrong with this patch?

--- x/kernel/sched/auto_group.c
+++ x/kernel/sched/auto_group.c
@@ -152,8 +152,12 @@ out:
 /* Allocates GFP_KERNEL, cannot be called under any spinlock */
 void sched_autogroup_create_attach(struct task_struct *p)
 {
-   struct autogroup *ag = autogroup_create();
+   struct autogroup *ag;
+
+   if (!sysctl_sched_autogroup_enabled)
+   return;
 
+   ag = autogroup_create();
autogroup_move_group(p, ag);
/* drop extra reference added by autogroup_create() */
autogroup_kref_put(ag);

or even

--- x/kernel/sched/auto_group.c
+++ x/kernel/sched/auto_group.c
@@ -64,9 +64,13 @@ static inline struct autogroup *autogrou
 
 static inline struct autogroup *autogroup_create(void)
 {
-   struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
+   struct autogroup *ag;
struct task_group *tg;
 
+   if (!sysctl_sched_autogroup_enabled)
+   goto xxx;
+
+   ag = kzalloc(sizeof(*ag), GFP_KERNEL);
if (!ag)
goto out_fail;
 
@@ -103,7 +107,7 @@ out_fail:
printk(KERN_WARNING "autogroup_create: %s failure.\n",
ag ? "sched_create_group()" : "kmalloc()");
}
-
+xxx:
return autogroup_kref_get(_default);
 }
 

Oleg.



Re: [PATCH v2 0/7] soc: renesas: Identify SoC and register with the SoC bus

2016-11-09 Thread Arnd Bergmann
On Wednesday, November 9, 2016 2:34:33 PM CET Geert Uytterhoeven wrote:
> >
> > And Samsung.
> > Shall I create the immutable branch now?
> 
> Arnd: are you happy with the new patches and changes?

I still had some comments for patch 7, but that shouldn't stop
you from creating a branch for the first three so everyone can
build on top of that.

Arnd


Re: [PATCH v2 0/7] soc: renesas: Identify SoC and register with the SoC bus

2016-11-09 Thread Arnd Bergmann
On Wednesday, November 9, 2016 2:34:33 PM CET Geert Uytterhoeven wrote:
> >
> > And Samsung.
> > Shall I create the immutable branch now?
> 
> Arnd: are you happy with the new patches and changes?

I still had some comments for patch 7, but that shouldn't stop
you from creating a branch for the first three so everyone can
build on top of that.

Arnd


RE: [PATCH 3/3] clk: qcom: Set BRANCH_HALT_DELAY flags for venus core0/1 clks

2016-11-09 Thread Sricharan
Hi Stephen,

>>
>> On 11/05/2016 01:48 AM, 'Stephen Boyd' wrote:
>> > Well I'm also curious which case is failing. Does turning on the
>> > clocks work after the gdsc is enabled? Does turning off the
>> > clocks fail because we don't know when the gdsc has turned off? I
>> > would hope that the firmware keeps the gdsc on when it's done
>> > processing things, goes idle, and hands back control to software.
>> > Right now I'm failing to see how the halt bits fail to toggle
>> > assuming that firmware isn't misbehaving and the kernel driver is
>> > power controlling in a coordinated manner with the firmware.
>>
>> What fails is turning ON the clocks after the gdsc is put under
>> hardware control (by fails I mean the halt checks fail to indicate
>> the clock is running, but register accesses etc thereafter suggest
>> the clocks are actually running)
>> The halt checks seem to work only while the gdsc is put in SW enabled
>> state.
>>
>
>Um... that is bad. I don't see how that is possible. It sounds
>like the clocks are not turning on when we're asking for them to
>turn on. The register accesses are always working because these
>subcore clks aren't required for register accesses. Most likely
>the GDSC for the subdomains is off (even after we thought we
>enabled it).
>
>Let's take a step back. The video hardware has three GDSCs. One
>for the main logic, and two for each subdomain. We're adding hw
>control for the two subdomains here. From what I can tell there
>isn't any hw control for the main domain. I see that there are
>two registers in the video hardware to control those subdomain hw
>enable signals that go to the GDSC. The reset value is OFF (not
>ON like was stated earlier), so it seems that if we switch the
>subdomain GDSCs on in these patches it will turn on for a short
>time, and then turn off when we switch into hw mode (by virtue of
>the way we enable the GDSC). Presumably we can assert these hw
>signal bits regardless of the subdomain power states, because
>otherwise we're in a chicken-egg situation for the firmware
>controlling this stuff.
>
>The proper sequence sounds like it should be:
>
>   1. Enable GDSC for main domain
>   2. Enable clocks for main domain (video_{core,maxi,ahb,axi}_clk)
>   3. Write the two registers to assert hw signal for subdomains
>   4. Enable GDSCs for two subdomains
>   5. Enable clocks for subdomains (video_subcore{0,1}_clk)
>
>I can only guess that we're not doing this. Probably the sequence
>right now is:
>
>   1. Enable GDSC for main domain and both sub-domains
>   2. Enable clocks for main domain (video_{core,maxi,ahb,axi}_clk)
>   3. Enable clocks for subdomains (video_subcore{0,1}_clk)
>   
>
>Sound right? If so, please fix the sequence in the video driver.
>

So the above is the sequence which is actually carried out on the
firmware side. The same can be done in host as well.
The clocks stuck issue indeed is not there with this. But with the
above sequence we need to add a step to do inverse of STEP3
above (ie write the registers to de-assert hw_signal), to keep
the subdomains in off, till firmware uses it. So the above sequence
helps to avoid masking the halt check, although the host really
does not wants to use these clocks, except setting it up for the
firmware.

Regards,
 Sricharan







RE: [PATCH 3/3] clk: qcom: Set BRANCH_HALT_DELAY flags for venus core0/1 clks

2016-11-09 Thread Sricharan
Hi Stephen,

>>
>> On 11/05/2016 01:48 AM, 'Stephen Boyd' wrote:
>> > Well I'm also curious which case is failing. Does turning on the
>> > clocks work after the gdsc is enabled? Does turning off the
>> > clocks fail because we don't know when the gdsc has turned off? I
>> > would hope that the firmware keeps the gdsc on when it's done
>> > processing things, goes idle, and hands back control to software.
>> > Right now I'm failing to see how the halt bits fail to toggle
>> > assuming that firmware isn't misbehaving and the kernel driver is
>> > power controlling in a coordinated manner with the firmware.
>>
>> What fails is turning ON the clocks after the gdsc is put under
>> hardware control (by fails I mean the halt checks fail to indicate
>> the clock is running, but register accesses etc thereafter suggest
>> the clocks are actually running)
>> The halt checks seem to work only while the gdsc is put in SW enabled
>> state.
>>
>
>Um... that is bad. I don't see how that is possible. It sounds
>like the clocks are not turning on when we're asking for them to
>turn on. The register accesses are always working because these
>subcore clks aren't required for register accesses. Most likely
>the GDSC for the subdomains is off (even after we thought we
>enabled it).
>
>Let's take a step back. The video hardware has three GDSCs. One
>for the main logic, and two for each subdomain. We're adding hw
>control for the two subdomains here. From what I can tell there
>isn't any hw control for the main domain. I see that there are
>two registers in the video hardware to control those subdomain hw
>enable signals that go to the GDSC. The reset value is OFF (not
>ON like was stated earlier), so it seems that if we switch the
>subdomain GDSCs on in these patches it will turn on for a short
>time, and then turn off when we switch into hw mode (by virtue of
>the way we enable the GDSC). Presumably we can assert these hw
>signal bits regardless of the subdomain power states, because
>otherwise we're in a chicken-egg situation for the firmware
>controlling this stuff.
>
>The proper sequence sounds like it should be:
>
>   1. Enable GDSC for main domain
>   2. Enable clocks for main domain (video_{core,maxi,ahb,axi}_clk)
>   3. Write the two registers to assert hw signal for subdomains
>   4. Enable GDSCs for two subdomains
>   5. Enable clocks for subdomains (video_subcore{0,1}_clk)
>
>I can only guess that we're not doing this. Probably the sequence
>right now is:
>
>   1. Enable GDSC for main domain and both sub-domains
>   2. Enable clocks for main domain (video_{core,maxi,ahb,axi}_clk)
>   3. Enable clocks for subdomains (video_subcore{0,1}_clk)
>   
>
>Sound right? If so, please fix the sequence in the video driver.
>

So the above is the sequence which is actually carried out on the
firmware side. The same can be done in host as well.
The clocks stuck issue indeed is not there with this. But with the
above sequence we need to add a step to do inverse of STEP3
above (ie write the registers to de-assert hw_signal), to keep
the subdomains in off, till firmware uses it. So the above sequence
helps to avoid masking the halt check, although the host really
does not wants to use these clocks, except setting it up for the
firmware.

Regards,
 Sricharan







Re: [PATCH v2 7/7] soc: renesas: Identify SoC and register with the SoC bus

2016-11-09 Thread Arnd Bergmann
On Monday, October 31, 2016 12:30:55 PM CET Geert Uytterhoeven wrote:

> v2:
>   - Drop SoC families and family names; use fixed "Renesas" instead,

I think I'd rather have seen the family names left in there, but it's
not important, so up to you.

>   - Use "renesas,prr" and "renesas,cccr" device nodes in DT if
> available, else fall back to hardcoded addresses for compatibility
> with existing DTBs,

I only see patches 2, 3, 5, and 7 in my inbox, so I'm lacking the DT
binding for these, among other things.

It does seem wrong to have a device node for a specific register though.
Shouldn't the node be for the block of registers that these are inside
of?

>   - Don't register the SoC bus if the chip ID register is missing,

Why? My objection was to hardcoding the register, not to registering
the device? I think I'd rather see the device registered with an
empty revision string.

> +#define CCCR   0xe600101c  /* Common Chip Code Register */
> +#define PRR0xff44  /* Product Register */
> +#define PRR3   0xfff00044  /* Product Register on R-Car Gen3 */
> +
> +static const struct of_device_id renesas_socs[] __initconst = {
> +#ifdef CONFIG_ARCH_R8A73A4
> +   { .compatible = "renesas,r8a73a4",  .data = (void *)PRR, },
> +#endif
> +#ifdef CONFIG_ARCH_R8A7740
> +   { .compatible = "renesas,r8a7740",  .data = (void *)CCCR, },
> +#endif

My preference here would be to list the register address only for
SoCs that are known to need them, while also having .dtb files that
don't have the nodes.

> +static int __init renesas_soc_init(void)
> +{
> +   struct soc_device_attribute *soc_dev_attr;
> +   const struct of_device_id *match;
> +   void __iomem *chipid = NULL;
> +   struct soc_device *soc_dev;
> +   struct device_node *np;
> +   unsigned int product;
> +
> +   np = of_find_matching_node_and_match(NULL, renesas_socs, );
> +   if (!np)
> +   return -ENODEV;
> +
> +   of_node_put(np);
> +
> +   /* Try PRR first, then CCCR, then hardcoded fallback */
> +   np = of_find_compatible_node(NULL, NULL, "renesas,prr");
> +   if (!np)
> +   np = of_find_compatible_node(NULL, NULL, "renesas,cccr");
> +   if (np) {
> +   chipid = of_iomap(np, 0);
> +   of_node_put(np);
> +   } else if (match->data) {
> +   chipid = ioremap((uintptr_t)match->data, 4);
> +   }
> +   if (!chipid)
> 

Here, I'd turn the order around and look for the DT nodes of the
devices first. Only if they are not found, look at the compatible
string of the root node. No need to search for a node though,
you know which one it is when you look for a compatible =
"renesas,r8a73a4".

Arnd


Re: [PATCH v2 7/7] soc: renesas: Identify SoC and register with the SoC bus

2016-11-09 Thread Arnd Bergmann
On Monday, October 31, 2016 12:30:55 PM CET Geert Uytterhoeven wrote:

> v2:
>   - Drop SoC families and family names; use fixed "Renesas" instead,

I think I'd rather have seen the family names left in there, but it's
not important, so up to you.

>   - Use "renesas,prr" and "renesas,cccr" device nodes in DT if
> available, else fall back to hardcoded addresses for compatibility
> with existing DTBs,

I only see patches 2, 3, 5, and 7 in my inbox, so I'm lacking the DT
binding for these, among other things.

It does seem wrong to have a device node for a specific register though.
Shouldn't the node be for the block of registers that these are inside
of?

>   - Don't register the SoC bus if the chip ID register is missing,

Why? My objection was to hardcoding the register, not to registering
the device? I think I'd rather see the device registered with an
empty revision string.

> +#define CCCR   0xe600101c  /* Common Chip Code Register */
> +#define PRR0xff44  /* Product Register */
> +#define PRR3   0xfff00044  /* Product Register on R-Car Gen3 */
> +
> +static const struct of_device_id renesas_socs[] __initconst = {
> +#ifdef CONFIG_ARCH_R8A73A4
> +   { .compatible = "renesas,r8a73a4",  .data = (void *)PRR, },
> +#endif
> +#ifdef CONFIG_ARCH_R8A7740
> +   { .compatible = "renesas,r8a7740",  .data = (void *)CCCR, },
> +#endif

My preference here would be to list the register address only for
SoCs that are known to need them, while also having .dtb files that
don't have the nodes.

> +static int __init renesas_soc_init(void)
> +{
> +   struct soc_device_attribute *soc_dev_attr;
> +   const struct of_device_id *match;
> +   void __iomem *chipid = NULL;
> +   struct soc_device *soc_dev;
> +   struct device_node *np;
> +   unsigned int product;
> +
> +   np = of_find_matching_node_and_match(NULL, renesas_socs, );
> +   if (!np)
> +   return -ENODEV;
> +
> +   of_node_put(np);
> +
> +   /* Try PRR first, then CCCR, then hardcoded fallback */
> +   np = of_find_compatible_node(NULL, NULL, "renesas,prr");
> +   if (!np)
> +   np = of_find_compatible_node(NULL, NULL, "renesas,cccr");
> +   if (np) {
> +   chipid = of_iomap(np, 0);
> +   of_node_put(np);
> +   } else if (match->data) {
> +   chipid = ioremap((uintptr_t)match->data, 4);
> +   }
> +   if (!chipid)
> 

Here, I'd turn the order around and look for the DT nodes of the
devices first. Only if they are not found, look at the compatible
string of the root node. No need to search for a node though,
you know which one it is when you look for a compatible =
"renesas,r8a73a4".

Arnd


Re: [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Jens Axboe

On 11/09/2016 08:32 AM, Lars Ellenberg wrote:

On Tue, Nov 08, 2016 at 09:52:04AM -0700, Jens Axboe wrote:

This should go into 4.9,
and into all stable branches since and including v4.0,
which is the first to contain the exposing change.

It is correct for all stable branches older than that as well
(which contain the DRBD driver; which is 2.6.33 and up).

It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
we dropped the comment block immediately preceding the kernel_sendmsg().

Cc: sta...@vger.kernel.org
Cc: v...@zeniv.linux.org.uk
Cc: christoph.lechleit...@iteg.at
Cc: wolfgang.g...@iteg.at
Reported-by: Christoph Lechleitner 
Tested-by: Christoph Lechleitner 
Signed-off-by: Richard Weinberger 
Signed-off-by: Lars Ellenberg 


Changing my patch is perfectly fine, but please clearly state it.
I.e. by adding something like that before your S-o-b.
[Lars: Massaged patch to match my personal taste...]





Lars, are you sending a new one? If you do, add the stable tag as well.


So my "change" against his original patch was
- rv = kernel_sendmsg(sock, , , 1, size - sent);
+ rv = kernel_sendmsg(sock, , , 1, iov.iov_len);
to make it "more obviously correct" from looking just at the one line
without even having to read the context.  And a more verbose commit message.


I'm fine with you making that change, I do that too sometimes for
patches. But Richard is absolutely right in that you need to make a note
of that. It's no longer the patch he signed off on, so it needs to
reflect that.


Should I sent two patches, one that applies to 4.5 and later,
and one that applies to 2.6.33 ... 4.4, or are you or stable
willing to resolve the trivial "missing comment block" conflict yourself?


Since it's a trivial one liner, let's just do one for the current series
and the stable folks should be able to do that one. If not, they will
let us know.

--
Jens Axboe



Re: [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Jens Axboe

On 11/09/2016 08:32 AM, Lars Ellenberg wrote:

On Tue, Nov 08, 2016 at 09:52:04AM -0700, Jens Axboe wrote:

This should go into 4.9,
and into all stable branches since and including v4.0,
which is the first to contain the exposing change.

It is correct for all stable branches older than that as well
(which contain the DRBD driver; which is 2.6.33 and up).

It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
we dropped the comment block immediately preceding the kernel_sendmsg().

Cc: sta...@vger.kernel.org
Cc: v...@zeniv.linux.org.uk
Cc: christoph.lechleit...@iteg.at
Cc: wolfgang.g...@iteg.at
Reported-by: Christoph Lechleitner 
Tested-by: Christoph Lechleitner 
Signed-off-by: Richard Weinberger 
Signed-off-by: Lars Ellenberg 


Changing my patch is perfectly fine, but please clearly state it.
I.e. by adding something like that before your S-o-b.
[Lars: Massaged patch to match my personal taste...]





Lars, are you sending a new one? If you do, add the stable tag as well.


So my "change" against his original patch was
- rv = kernel_sendmsg(sock, , , 1, size - sent);
+ rv = kernel_sendmsg(sock, , , 1, iov.iov_len);
to make it "more obviously correct" from looking just at the one line
without even having to read the context.  And a more verbose commit message.


I'm fine with you making that change, I do that too sometimes for
patches. But Richard is absolutely right in that you need to make a note
of that. It's no longer the patch he signed off on, so it needs to
reflect that.


Should I sent two patches, one that applies to 4.5 and later,
and one that applies to 2.6.33 ... 4.4, or are you or stable
willing to resolve the trivial "missing comment block" conflict yourself?


Since it's a trivial one liner, let's just do one for the current series
and the stable folks should be able to do that one. If not, they will
let us know.

--
Jens Axboe



Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

2016-11-09 Thread Vincent Guittot
On 19 October 2016 at 11:53, Peter Zijlstra  wrote:
> On Wed, Oct 19, 2016 at 08:38:12AM +0200, Vincent Guittot wrote:
>> > It might make sense to have helper functions to evaluate those
>>
>> The main issue is the number of parameters used in these conditions
>> that makes helper function not really more readable.
>
> Fair enough I suppose..
>
>> > conditions, because currently there's two instances of each, once in the
>> > branch selection and then again (but inverted, we miss the == case fwiw)
>>
>> not sure to catch the comment about inverted and miss the == case
>
> Oh, you're right. Which proves my point on this not being entirely
> readable :/

Hi Peter,

I'm not sure of the status of this patch.
It seems difficult to get a more readable code.
Do you want me to add more comments about when we enter each state or
current version is enough ?

Vincent

>
> Initially I thought the things were of the form:
>
> else if (ineq) {
> }
>
> ...
>
> if (... || !ineq || ...)
>
>
> Which would get you things like: a undefined. But if I put them along side one another like:
>
>
> +   } else if (min_runnable_load > (runnable_load + imbalance)) {
>
> + (min_runnable_load > (this_runnable_load + 
> imbalance)) ||
>
>
> +   } else if ((runnable_load < (min_runnable_load + imbalance)) 
> &&
> +   (100*min_avg_load > 
> sd->imbalance_pct*avg_load)) {
>
> +((this_runnable_load < (min_runnable_load + imbalance)) 
> &&
> +   (100*min_avg_load > 
> sd->imbalance_pct*this_avg_load)))
>
> We can see this is not in fact the case.
>
> Blergh, I also cannot see a pretty way to increase readability here,
> because while they have the same general shape, there's this small
> variation with this_*.
>
> A well..


Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

2016-11-09 Thread Vincent Guittot
On 19 October 2016 at 11:53, Peter Zijlstra  wrote:
> On Wed, Oct 19, 2016 at 08:38:12AM +0200, Vincent Guittot wrote:
>> > It might make sense to have helper functions to evaluate those
>>
>> The main issue is the number of parameters used in these conditions
>> that makes helper function not really more readable.
>
> Fair enough I suppose..
>
>> > conditions, because currently there's two instances of each, once in the
>> > branch selection and then again (but inverted, we miss the == case fwiw)
>>
>> not sure to catch the comment about inverted and miss the == case
>
> Oh, you're right. Which proves my point on this not being entirely
> readable :/

Hi Peter,

I'm not sure of the status of this patch.
It seems difficult to get a more readable code.
Do you want me to add more comments about when we enter each state or
current version is enough ?

Vincent

>
> Initially I thought the things were of the form:
>
> else if (ineq) {
> }
>
> ...
>
> if (... || !ineq || ...)
>
>
> Which would get you things like: ab, and leave a==b
> undefined. But if I put them along side one another like:
>
>
> +   } else if (min_runnable_load > (runnable_load + imbalance)) {
>
> + (min_runnable_load > (this_runnable_load + 
> imbalance)) ||
>
>
> +   } else if ((runnable_load < (min_runnable_load + imbalance)) 
> &&
> +   (100*min_avg_load > 
> sd->imbalance_pct*avg_load)) {
>
> +((this_runnable_load < (min_runnable_load + imbalance)) 
> &&
> +   (100*min_avg_load > 
> sd->imbalance_pct*this_avg_load)))
>
> We can see this is not in fact the case.
>
> Blergh, I also cannot see a pretty way to increase readability here,
> because while they have the same general shape, there's this small
> variation with this_*.
>
> A well..


Re: KASAN & the vmalloc area

2016-11-09 Thread Andrey Ryabinin
On 11/08/2016 10:03 PM, Mark Rutland wrote:
> Hi,
> 
> I see a while back [1] there was a discussion of what to do about KASAN
> and vmapped stacks, but it doesn't look like that was solved, judging by
> the vmapped stacks pull [2] for v4.9.
> 
> I wondered whether anyone had looked at that since?
> 

Sadly, but I didn't have much time for this yet, so it's in an initial state.

> I have an additional reason to want to dynamically allocate the vmalloc
> area shadow: it turns out that KASAN currently interacts rather poorly
> with the arm64 ptdump code.
> 
> When KASAN is selected, we allocate shadow for the whole vmalloc area,
> using common zero pte, pmd, pud tables. Walking over these in the ptdump
> code takes a *very* long time (I've seen up to 15 minutes with
> KASAN_OUTLINE enabled). For DEBUG_WX [3], this means boot hangs for that
> long, too.
> 

I didn't look at any code, but we probably could can remember last visited pgd
and skip next pgd if it's the same as previous.
Alternatively - just skip kasan_zero_p*d in ptdump walker.

> If I don't allocate vmalloc shadow (and remove the apparently pointlesss
> shadow of the shadow area), and only allocate shadow for the image,
> fixmap, vmemmap and so on, that delay gets cut to a few seconds, which
> is tolerable for a debug configuration...
> 
> ... however, things blow up when the kernel touches vmalloc'd memory for
> the first time, as we don't install shadow for that dynamically.
> 
> Thanks,
> Mark.
> 
> [1] 
> https://lkml.kernel.org/r/CALCETrWucrYp+yq8RHSDqf93xtg793duByirurzJbLRhrz=t...@mail.gmail.com
> [2] https://lkml.kernel.org/r/20161003092940.ga...@gmail.com
> [3] 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-October/464191.html
> 


Re: KASAN & the vmalloc area

2016-11-09 Thread Andrey Ryabinin
On 11/08/2016 10:03 PM, Mark Rutland wrote:
> Hi,
> 
> I see a while back [1] there was a discussion of what to do about KASAN
> and vmapped stacks, but it doesn't look like that was solved, judging by
> the vmapped stacks pull [2] for v4.9.
> 
> I wondered whether anyone had looked at that since?
> 

Sadly, but I didn't have much time for this yet, so it's in an initial state.

> I have an additional reason to want to dynamically allocate the vmalloc
> area shadow: it turns out that KASAN currently interacts rather poorly
> with the arm64 ptdump code.
> 
> When KASAN is selected, we allocate shadow for the whole vmalloc area,
> using common zero pte, pmd, pud tables. Walking over these in the ptdump
> code takes a *very* long time (I've seen up to 15 minutes with
> KASAN_OUTLINE enabled). For DEBUG_WX [3], this means boot hangs for that
> long, too.
> 

I didn't look at any code, but we probably could can remember last visited pgd
and skip next pgd if it's the same as previous.
Alternatively - just skip kasan_zero_p*d in ptdump walker.

> If I don't allocate vmalloc shadow (and remove the apparently pointlesss
> shadow of the shadow area), and only allocate shadow for the image,
> fixmap, vmemmap and so on, that delay gets cut to a few seconds, which
> is tolerable for a debug configuration...
> 
> ... however, things blow up when the kernel touches vmalloc'd memory for
> the first time, as we don't install shadow for that dynamically.
> 
> Thanks,
> Mark.
> 
> [1] 
> https://lkml.kernel.org/r/CALCETrWucrYp+yq8RHSDqf93xtg793duByirurzJbLRhrz=t...@mail.gmail.com
> [2] https://lkml.kernel.org/r/20161003092940.ga...@gmail.com
> [3] 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-October/464191.html
> 


Re: [PATCH 1/2] mfd: davinci_voicecodec: tidyup header difinitions

2016-11-09 Thread Lee Jones
On Thu, 27 Oct 2016, Kuninori Morimoto wrote:

> 
> From: Kuninori Morimoto 

Please use `git send-mail` to submit your patches.

> mach/hardware.h is needed on C source code side, not header.
> And struct davinci_vc is duplicated definition.
> 
> Signed-off-by: Kuninori Morimoto 
> ---
>  drivers/mfd/davinci_voicecodec.c   | 1 +
>  include/linux/mfd/davinci_voicecodec.h | 4 
>  2 files changed, 1 insertion(+), 4 deletions(-)

Nice.

Applied, thanks.

> diff --git a/drivers/mfd/davinci_voicecodec.c 
> b/drivers/mfd/davinci_voicecodec.c
> index dff2f19..4d0a5f3 100644
> --- a/drivers/mfd/davinci_voicecodec.c
> +++ b/drivers/mfd/davinci_voicecodec.c
> @@ -32,6 +32,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  static const struct regmap_config davinci_vc_regmap = {
>   .reg_bits = 32,
> diff --git a/include/linux/mfd/davinci_voicecodec.h 
> b/include/linux/mfd/davinci_voicecodec.h
> index 8e1cdbe..2c0127c 100644
> --- a/include/linux/mfd/davinci_voicecodec.h
> +++ b/include/linux/mfd/davinci_voicecodec.h
> @@ -28,8 +28,6 @@
>  #include 
>  #include 
>  
> -#include 
> -
>  struct regmap;
>  
>  /*
> @@ -99,8 +97,6 @@ struct davinci_vcif {
>   dma_addr_t dma_rx_addr;
>  };
>  
> -struct davinci_vc;
> -
>  struct davinci_vc {
>   /* Device data */
>   struct device *dev;

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [PATCH 1/2] mfd: davinci_voicecodec: tidyup header difinitions

2016-11-09 Thread Lee Jones
On Thu, 27 Oct 2016, Kuninori Morimoto wrote:

> 
> From: Kuninori Morimoto 

Please use `git send-mail` to submit your patches.

> mach/hardware.h is needed on C source code side, not header.
> And struct davinci_vc is duplicated definition.
> 
> Signed-off-by: Kuninori Morimoto 
> ---
>  drivers/mfd/davinci_voicecodec.c   | 1 +
>  include/linux/mfd/davinci_voicecodec.h | 4 
>  2 files changed, 1 insertion(+), 4 deletions(-)

Nice.

Applied, thanks.

> diff --git a/drivers/mfd/davinci_voicecodec.c 
> b/drivers/mfd/davinci_voicecodec.c
> index dff2f19..4d0a5f3 100644
> --- a/drivers/mfd/davinci_voicecodec.c
> +++ b/drivers/mfd/davinci_voicecodec.c
> @@ -32,6 +32,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  static const struct regmap_config davinci_vc_regmap = {
>   .reg_bits = 32,
> diff --git a/include/linux/mfd/davinci_voicecodec.h 
> b/include/linux/mfd/davinci_voicecodec.h
> index 8e1cdbe..2c0127c 100644
> --- a/include/linux/mfd/davinci_voicecodec.h
> +++ b/include/linux/mfd/davinci_voicecodec.h
> @@ -28,8 +28,6 @@
>  #include 
>  #include 
>  
> -#include 
> -
>  struct regmap;
>  
>  /*
> @@ -99,8 +97,6 @@ struct davinci_vcif {
>   dma_addr_t dma_rx_addr;
>  };
>  
> -struct davinci_vc;
> -
>  struct davinci_vc {
>   /* Device data */
>   struct device *dev;

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [Drbd-dev] [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Lars Ellenberg
On Wed, Nov 09, 2016 at 04:47:15PM +0100, Richard Weinberger wrote:
> > Should I sent two patches, one that applies to 4.5 and later,
> > and one that applies to 2.6.33 ... 4.4, or are you or stable
> > willing to resolve the trivial "missing comment block" conflict yourself?
> 
> BTW: Why did you drop the "Fixes:" tag too?
> 

By accident, probably.

I'm ok with you guys just using the original, if you prefer, just let me
know.  It's the fix, that's important, not how much noise we can make
about that oneliner.

Lars



Re: [Drbd-dev] [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Lars Ellenberg
On Wed, Nov 09, 2016 at 04:47:15PM +0100, Richard Weinberger wrote:
> > Should I sent two patches, one that applies to 4.5 and later,
> > and one that applies to 2.6.33 ... 4.4, or are you or stable
> > willing to resolve the trivial "missing comment block" conflict yourself?
> 
> BTW: Why did you drop the "Fixes:" tag too?
> 

By accident, probably.

I'm ok with you guys just using the original, if you prefer, just let me
know.  It's the fix, that's important, not how much noise we can make
about that oneliner.

Lars



Re: [PATCH V5 2/3] ARM64 LPC: Add missing range exception for special ISA

2016-11-09 Thread liviu.du...@arm.com
On Wed, Nov 09, 2016 at 04:16:17PM +, Gabriele Paoloni wrote:
> Hi Liviu
> 
> Thanks for reviewing
> 

[removed some irrelevant part of discussion, avoid crazy formatting]

> > > +/**
> > > + * addr_is_indirect_io - check whether the input taddr is for
> > indirectIO.
> > > + * @taddr: the io address to be checked.
> > > + *
> > > + * Returns 1 when taddr is in the range; otherwise return 0.
> > > + */
> > > +int addr_is_indirect_io(u64 taddr)
> > > +{
> > > + if (arm64_extio_ops->start > taddr || arm64_extio_ops->end <
> > taddr)
> > 
> > start >= taddr ?
> 
> Nope... if  (taddr < arm64_extio_ops->start || taddr > arm64_extio_ops->end)
> then taddr is outside the range [start; end] and will return 0; otherwise
> it will return 1...

Oops, sorry, did not pay attention to the returned value. The check is
correct as it is, no need to change then.

> 
> > 
> > > + return 0;
> > > +
> > > + return 1;
> > > +}
> > >
> > >  BUILD_EXTIO(b, u8)
> > >
> > > diff --git a/drivers/of/address.c b/drivers/of/address.c
> > > index 02b2903..cc2a05d 100644
> > > --- a/drivers/of/address.c
> > > +++ b/drivers/of/address.c
> > > @@ -479,6 +479,50 @@ static int of_empty_ranges_quirk(struct
> > device_node *np)
> > >   return false;
> > >  }
> > >
> > > +
> > > +/*
> > > + * of_isa_indirect_io - get the IO address from some isa reg
> > property value.
> > > + *   For some isa/lpc devices, no ranges property in ancestor node.
> > > + *   The device addresses are described directly in their regs
> > property.
> > > + *   This fixup function will be called to get the IO address of
> > isa/lpc
> > > + *   devices when the normal of_translation failed.
> > > + *
> > > + * @parent:  points to the parent dts node;
> > > + * @bus: points to the of_bus which can be used to parse
> > address;
> > > + * @addr:the address from reg property;
> > > + * @na:  the address cell counter of @addr;
> > > + * @presult: store the address paresed from @addr;
> > > + *
> > > + * return 1 when successfully get the I/O address;
> > > + * 0 will return for some failures.
> > 
> > Bah, you are returning a signed int, why 0 for failure? Return a
> > negative value with
> > error codes. Otherwise change the return value into a bool.
> 
> Yes we'll move to bool
> 
> > 
> > > + */
> > > +static int of_get_isa_indirect_io(struct device_node *parent,
> > > + struct of_bus *bus, __be32 *addr,
> > > + int na, u64 *presult)
> > > +{
> > > + unsigned int flags;
> > > + unsigned int rlen;
> > > +
> > > + /* whether support indirectIO */
> > > + if (!indirect_io_enabled())
> > > + return 0;
> > > +
> > > + if (!of_bus_isa_match(parent))
> > > + return 0;
> > > +
> > > + flags = bus->get_flags(addr);
> > > + if (!(flags & IORESOURCE_IO))
> > > + return 0;
> > > +
> > > + /* there is ranges property, apply the normal translation
> > directly. */
> > 
> > s/there is ranges/if we have a 'ranges'/
> 
> Thanks for spotting this
> 
> > 
> > > + if (of_get_property(parent, "ranges", ))
> > > + return 0;
> > > +
> > > + *presult = of_read_number(addr + 1, na - 1);
> > > + /* this fixup is only valid for specific I/O range. */
> > > + return addr_is_indirect_io(*presult);
> > > +}
> > > +
> > >  static int of_translate_one(struct device_node *parent, struct
> > of_bus *bus,
> > >   struct of_bus *pbus, __be32 *addr,
> > >   int na, int ns, int pna, const char *rprop)
> > > @@ -595,6 +639,15 @@ static u64 __of_translate_address(struct
> > device_node *dev,
> > >   result = of_read_number(addr, na);
> > >   break;
> > >   }
> > > + /*
> > > +  * For indirectIO device which has no ranges property, get
> > > +  * the address from reg directly.
> > > +  */
> > > + if (of_get_isa_indirect_io(dev, bus, addr, na, )) {
> > > + pr_debug("isa indirectIO matched(%s)..addr =
> > 0x%llx\n",
> > > + of_node_full_name(dev), result);
> > > + break;
> > > + }
> > >
> > >   /* Get new parent bus and counts */
> > >   pbus = of_match_bus(parent);
> > > @@ -688,8 +741,9 @@ static int __of_address_to_resource(struct
> > device_node *dev,
> > >   if (taddr == OF_BAD_ADDR)
> > >   return -EINVAL;
> > >   memset(r, 0, sizeof(struct resource));
> > > - if (flags & IORESOURCE_IO) {
> > > + if (flags & IORESOURCE_IO && taddr >= PCIBIOS_MIN_IO) {
> > >   unsigned long port;
> > > +
> > >   port = pci_address_to_pio(taddr);
> > >   if (port == (unsigned long)-1)
> > >   return -EINVAL;
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index ba34907..1a08511 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -3263,7 +3263,7 @@ int __weak pci_register_io_range(phys_addr_t
> > addr, 

Re: [PATCH V5 2/3] ARM64 LPC: Add missing range exception for special ISA

2016-11-09 Thread liviu.du...@arm.com
On Wed, Nov 09, 2016 at 04:16:17PM +, Gabriele Paoloni wrote:
> Hi Liviu
> 
> Thanks for reviewing
> 

[removed some irrelevant part of discussion, avoid crazy formatting]

> > > +/**
> > > + * addr_is_indirect_io - check whether the input taddr is for
> > indirectIO.
> > > + * @taddr: the io address to be checked.
> > > + *
> > > + * Returns 1 when taddr is in the range; otherwise return 0.
> > > + */
> > > +int addr_is_indirect_io(u64 taddr)
> > > +{
> > > + if (arm64_extio_ops->start > taddr || arm64_extio_ops->end <
> > taddr)
> > 
> > start >= taddr ?
> 
> Nope... if  (taddr < arm64_extio_ops->start || taddr > arm64_extio_ops->end)
> then taddr is outside the range [start; end] and will return 0; otherwise
> it will return 1...

Oops, sorry, did not pay attention to the returned value. The check is
correct as it is, no need to change then.

> 
> > 
> > > + return 0;
> > > +
> > > + return 1;
> > > +}
> > >
> > >  BUILD_EXTIO(b, u8)
> > >
> > > diff --git a/drivers/of/address.c b/drivers/of/address.c
> > > index 02b2903..cc2a05d 100644
> > > --- a/drivers/of/address.c
> > > +++ b/drivers/of/address.c
> > > @@ -479,6 +479,50 @@ static int of_empty_ranges_quirk(struct
> > device_node *np)
> > >   return false;
> > >  }
> > >
> > > +
> > > +/*
> > > + * of_isa_indirect_io - get the IO address from some isa reg
> > property value.
> > > + *   For some isa/lpc devices, no ranges property in ancestor node.
> > > + *   The device addresses are described directly in their regs
> > property.
> > > + *   This fixup function will be called to get the IO address of
> > isa/lpc
> > > + *   devices when the normal of_translation failed.
> > > + *
> > > + * @parent:  points to the parent dts node;
> > > + * @bus: points to the of_bus which can be used to parse
> > address;
> > > + * @addr:the address from reg property;
> > > + * @na:  the address cell counter of @addr;
> > > + * @presult: store the address paresed from @addr;
> > > + *
> > > + * return 1 when successfully get the I/O address;
> > > + * 0 will return for some failures.
> > 
> > Bah, you are returning a signed int, why 0 for failure? Return a
> > negative value with
> > error codes. Otherwise change the return value into a bool.
> 
> Yes we'll move to bool
> 
> > 
> > > + */
> > > +static int of_get_isa_indirect_io(struct device_node *parent,
> > > + struct of_bus *bus, __be32 *addr,
> > > + int na, u64 *presult)
> > > +{
> > > + unsigned int flags;
> > > + unsigned int rlen;
> > > +
> > > + /* whether support indirectIO */
> > > + if (!indirect_io_enabled())
> > > + return 0;
> > > +
> > > + if (!of_bus_isa_match(parent))
> > > + return 0;
> > > +
> > > + flags = bus->get_flags(addr);
> > > + if (!(flags & IORESOURCE_IO))
> > > + return 0;
> > > +
> > > + /* there is ranges property, apply the normal translation
> > directly. */
> > 
> > s/there is ranges/if we have a 'ranges'/
> 
> Thanks for spotting this
> 
> > 
> > > + if (of_get_property(parent, "ranges", ))
> > > + return 0;
> > > +
> > > + *presult = of_read_number(addr + 1, na - 1);
> > > + /* this fixup is only valid for specific I/O range. */
> > > + return addr_is_indirect_io(*presult);
> > > +}
> > > +
> > >  static int of_translate_one(struct device_node *parent, struct
> > of_bus *bus,
> > >   struct of_bus *pbus, __be32 *addr,
> > >   int na, int ns, int pna, const char *rprop)
> > > @@ -595,6 +639,15 @@ static u64 __of_translate_address(struct
> > device_node *dev,
> > >   result = of_read_number(addr, na);
> > >   break;
> > >   }
> > > + /*
> > > +  * For indirectIO device which has no ranges property, get
> > > +  * the address from reg directly.
> > > +  */
> > > + if (of_get_isa_indirect_io(dev, bus, addr, na, )) {
> > > + pr_debug("isa indirectIO matched(%s)..addr =
> > 0x%llx\n",
> > > + of_node_full_name(dev), result);
> > > + break;
> > > + }
> > >
> > >   /* Get new parent bus and counts */
> > >   pbus = of_match_bus(parent);
> > > @@ -688,8 +741,9 @@ static int __of_address_to_resource(struct
> > device_node *dev,
> > >   if (taddr == OF_BAD_ADDR)
> > >   return -EINVAL;
> > >   memset(r, 0, sizeof(struct resource));
> > > - if (flags & IORESOURCE_IO) {
> > > + if (flags & IORESOURCE_IO && taddr >= PCIBIOS_MIN_IO) {
> > >   unsigned long port;
> > > +
> > >   port = pci_address_to_pio(taddr);
> > >   if (port == (unsigned long)-1)
> > >   return -EINVAL;
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index ba34907..1a08511 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -3263,7 +3263,7 @@ int __weak pci_register_io_range(phys_addr_t
> > addr, 

Re: [PATCH 1/1 V2] mqueue: Implment generic xattr support

2016-11-09 Thread Christoph Hellwig
> +/*
> + * Callback for security_inode_init_security() for acquiring xattrs.
> + */
> +static int mqueue_initxattrs(struct inode *inode,
> + const struct xattr *xattr_array,
> + void *fs_info)
> +{
> + struct mqueue_inode_info *info = MQUEUE_I(inode);
> + const struct xattr *xattr;
> + struct simple_xattr *new_xattr;
> + size_t len;
> +
> + for (xattr = xattr_array; xattr->name != NULL; xattr++) {
> + new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len);
> + if (!new_xattr)
> + return -ENOMEM;
> + len = strlen(xattr->name) + 1;
> + new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len,
> +   GFP_KERNEL);
> + if (!new_xattr->name) {
> + kfree(new_xattr);
> + return -ENOMEM;
> + }
> +
> + memcpy(new_xattr->name, XATTR_SECURITY_PREFIX,
> +XATTR_SECURITY_PREFIX_LEN);
> + memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
> +xattr->name, len);
> +
> + simple_xattr_list_add(>xattrs, new_xattr);
> + }
> +
> + return 0;
> +}

This is a 1:1 copy of the shmem code, we rally should consolidate it
into a single place first, as people will want it for whatever virtual
fs they care about sooner or later.


Re: [PATCH 1/1 V2] mqueue: Implment generic xattr support

2016-11-09 Thread Christoph Hellwig
> +/*
> + * Callback for security_inode_init_security() for acquiring xattrs.
> + */
> +static int mqueue_initxattrs(struct inode *inode,
> + const struct xattr *xattr_array,
> + void *fs_info)
> +{
> + struct mqueue_inode_info *info = MQUEUE_I(inode);
> + const struct xattr *xattr;
> + struct simple_xattr *new_xattr;
> + size_t len;
> +
> + for (xattr = xattr_array; xattr->name != NULL; xattr++) {
> + new_xattr = simple_xattr_alloc(xattr->value, xattr->value_len);
> + if (!new_xattr)
> + return -ENOMEM;
> + len = strlen(xattr->name) + 1;
> + new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len,
> +   GFP_KERNEL);
> + if (!new_xattr->name) {
> + kfree(new_xattr);
> + return -ENOMEM;
> + }
> +
> + memcpy(new_xattr->name, XATTR_SECURITY_PREFIX,
> +XATTR_SECURITY_PREFIX_LEN);
> + memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
> +xattr->name, len);
> +
> + simple_xattr_list_add(>xattrs, new_xattr);
> + }
> +
> + return 0;
> +}

This is a 1:1 copy of the shmem code, we rally should consolidate it
into a single place first, as people will want it for whatever virtual
fs they care about sooner or later.


[PATCH] kvm: kvmclock: let KVM_GET_CLOCK return whether the master clock is in use

2016-11-09 Thread Paolo Bonzini
Userspace can read the exact value of kvmclock by reading the TSC
and fetching the timekeeping parameters out of guest memory.  This
however is brittle and not necessary anymore with KVM 4.11.  Provide
a mechanism that lets userspace know if the new KVM_GET_CLOCK
semantics are in effect, and---since we are at it---if the clock
is stable across all VCPUs.

Cc: Radim Krčmář 
Cc: Marcelo Tosatti 
Signed-off-by: Paolo Bonzini 
---
 Documentation/virtual/kvm/api.txt | 11 +++
 arch/x86/kvm/x86.c| 10 +++---
 include/uapi/linux/kvm.h  |  7 +++
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 739db9ab16b2..6bbceb9a3a19 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -777,6 +777,17 @@ Gets the current timestamp of kvmclock as seen by the 
current guest. In
 conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios
 such as migration.
 
+When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
+set of bits that KVM can return in struct kvm_clock_data's flag member.
+
+The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
+value is the exact kvmclock value seen by all VCPUs at the instant
+when KVM_GET_CLOCK was called.  If clear, the returned value is simply
+CLOCK_MONOTONIC plus a constant offset; the offset can be modified
+with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
+but the exact value read by each VCPU could differ, because the host
+TSC is not stable.
+
 struct kvm_clock_data {
__u64 clock;  /* kvmclock current value */
__u32 flags;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3017de0431bd..1ba08278a9a9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2596,7 +2596,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
case KVM_CAP_XEN_HVM:
-   case KVM_CAP_ADJUST_CLOCK:
case KVM_CAP_VCPU_EVENTS:
case KVM_CAP_HYPERV:
case KVM_CAP_HYPERV_VAPIC:
@@ -2623,6 +2622,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
 #endif
r = 1;
break;
+   case KVM_CAP_ADJUST_CLOCK:
+   r = KVM_CLOCK_TSC_STABLE;
+   break;
case KVM_CAP_X86_SMM:
/* SMBASE is usually relocated above 1M on modern chipsets,
 * and SMM handlers might indeed rely on 4G segment limits,
@@ -4103,9 +4105,11 @@ long kvm_arch_vm_ioctl(struct file *filp,
struct kvm_clock_data user_ns;
u64 now_ns;
 
-   now_ns = get_kvmclock_ns(kvm);
+   local_irq_disable();
+   now_ns = __get_kvmclock_ns(kvm);
user_ns.clock = now_ns;
-   user_ns.flags = 0;
+   user_ns.flags = kvm->arch.use_master_clock ? 
KVM_CLOCK_TSC_STABLE : 0;
+   local_irq_enable();
memset(_ns.pad, 0, sizeof(user_ns.pad));
 
r = -EFAULT;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 300ef255d1e0..4ee67cb99143 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -972,12 +972,19 @@ struct kvm_irqfd {
__u8  pad[16];
 };
 
+/* For KVM_CAP_ADJUST_CLOCK */
+
+/* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags.  */
+#define KVM_CLOCK_TSC_STABLE   2
+
 struct kvm_clock_data {
__u64 clock;
__u32 flags;
__u32 pad[9];
 };
 
+/* For KVM_CAP_SW_TLB */
+
 #define KVM_MMU_FSL_BOOKE_NOHV 0
 #define KVM_MMU_FSL_BOOKE_HV   1
 
-- 
1.8.3.1



[PATCH] kvm: kvmclock: let KVM_GET_CLOCK return whether the master clock is in use

2016-11-09 Thread Paolo Bonzini
Userspace can read the exact value of kvmclock by reading the TSC
and fetching the timekeeping parameters out of guest memory.  This
however is brittle and not necessary anymore with KVM 4.11.  Provide
a mechanism that lets userspace know if the new KVM_GET_CLOCK
semantics are in effect, and---since we are at it---if the clock
is stable across all VCPUs.

Cc: Radim Krčmář 
Cc: Marcelo Tosatti 
Signed-off-by: Paolo Bonzini 
---
 Documentation/virtual/kvm/api.txt | 11 +++
 arch/x86/kvm/x86.c| 10 +++---
 include/uapi/linux/kvm.h  |  7 +++
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 739db9ab16b2..6bbceb9a3a19 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -777,6 +777,17 @@ Gets the current timestamp of kvmclock as seen by the 
current guest. In
 conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios
 such as migration.
 
+When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
+set of bits that KVM can return in struct kvm_clock_data's flag member.
+
+The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
+value is the exact kvmclock value seen by all VCPUs at the instant
+when KVM_GET_CLOCK was called.  If clear, the returned value is simply
+CLOCK_MONOTONIC plus a constant offset; the offset can be modified
+with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
+but the exact value read by each VCPU could differ, because the host
+TSC is not stable.
+
 struct kvm_clock_data {
__u64 clock;  /* kvmclock current value */
__u32 flags;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3017de0431bd..1ba08278a9a9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2596,7 +2596,6 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
case KVM_CAP_XEN_HVM:
-   case KVM_CAP_ADJUST_CLOCK:
case KVM_CAP_VCPU_EVENTS:
case KVM_CAP_HYPERV:
case KVM_CAP_HYPERV_VAPIC:
@@ -2623,6 +2622,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
 #endif
r = 1;
break;
+   case KVM_CAP_ADJUST_CLOCK:
+   r = KVM_CLOCK_TSC_STABLE;
+   break;
case KVM_CAP_X86_SMM:
/* SMBASE is usually relocated above 1M on modern chipsets,
 * and SMM handlers might indeed rely on 4G segment limits,
@@ -4103,9 +4105,11 @@ long kvm_arch_vm_ioctl(struct file *filp,
struct kvm_clock_data user_ns;
u64 now_ns;
 
-   now_ns = get_kvmclock_ns(kvm);
+   local_irq_disable();
+   now_ns = __get_kvmclock_ns(kvm);
user_ns.clock = now_ns;
-   user_ns.flags = 0;
+   user_ns.flags = kvm->arch.use_master_clock ? 
KVM_CLOCK_TSC_STABLE : 0;
+   local_irq_enable();
memset(_ns.pad, 0, sizeof(user_ns.pad));
 
r = -EFAULT;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 300ef255d1e0..4ee67cb99143 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -972,12 +972,19 @@ struct kvm_irqfd {
__u8  pad[16];
 };
 
+/* For KVM_CAP_ADJUST_CLOCK */
+
+/* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags.  */
+#define KVM_CLOCK_TSC_STABLE   2
+
 struct kvm_clock_data {
__u64 clock;
__u32 flags;
__u32 pad[9];
 };
 
+/* For KVM_CAP_SW_TLB */
+
 #define KVM_MMU_FSL_BOOKE_NOHV 0
 #define KVM_MMU_FSL_BOOKE_HV   1
 
-- 
1.8.3.1



Re: [PATCH 2/2] ASoC: axentia: tse850: add ASoC driver for the Axentia TSE-850

2016-11-09 Thread Peter Rosin
On 2016-11-09 14:38, Mark Brown wrote:
> On Tue, Nov 08, 2016 at 05:20:57PM +0100, Peter Rosin wrote:
> 
>> +++ b/sound/soc/axentia/Kconfig
>> @@ -0,0 +1,10 @@
>> +config SND_SOC_AXENTIA_TSE850_PCM5142
>> +tristate "ASoC driver for the Axentia TSE-850"
>> +depends on ARCH_AT91 && OF
>> +select ATMEL_SSC
>> +select SND_ATMEL_SOC
>> +select SND_ATMEL_SOC_SSC_DMA
>> +select SND_SOC_PCM512x_I2C
>> +help
>> +  Say Y if you want to add support for the ASoC driver for the
>> +  Axentia TSE-850 with a PCM5142 codec.
> 
> This just looks like a normal machine driver for an Atmel system which
> would usually go in the atemel directory - why is a new directory being
> created?

I thought atmel in this context meant that Atmel made the board, not that
the board was based on an Atmel cpu.

I'll move it for v2.

>> +static int tse850_get_mux2(struct snd_kcontrol *kctrl,
>> +   struct snd_ctl_elem_value *ucontrol)
>> +{
>> +struct snd_soc_dapm_context *dapm = snd_soc_dapm_kcontrol_dapm(kctrl);
>> +struct snd_soc_card *card = dapm->card;
>> +struct tse850_priv *tse850 = snd_soc_card_get_drvdata(card);
>> +int ret;
>> +
>> +ret = gpiod_get_value(tse850->loop2);
>> +if (ret < 0)
>> +return ret;
> 
> We can't reliably read the value of output GPIOs (though in practice the
> majority do support it) so it'd be better practice to use a state
> variable to remember what we set.  I'd also expect this to use the
> _cansleep() GPIO calls as it's not in a context where sleeping would be
> a problem.

Ok, I'll add _cansleep and cached values for v2.

>> +int tse850_get_ana(struct snd_kcontrol *kctrl,
>> +   struct snd_ctl_elem_value *ucontrol)
>> +{
>> +struct snd_soc_dapm_context *dapm = snd_soc_dapm_kcontrol_dapm(kctrl);
>> +struct snd_soc_card *card = dapm->card;
>> +struct tse850_priv *tse850 = snd_soc_card_get_drvdata(card);
>> +int ret;
>> +
>> +ret = regulator_get_voltage(tse850->ana);
>> +if (ret < 0)
>> +return ret;
>> +
>> +if (ret < 1100)
>> +ret = 1100;
>> +else if (ret > 2000)
>> +ret = 2000;
> 
> This needs some comments...

Ok, I'll add some words...

>> +struct snd_soc_pcm_runtime *rtd = substream->private_data;
>> +struct device *dev = rtd->dev;
>> +struct snd_soc_dai *cpu_dai = rtd->cpu_dai;
>> +int dir = substream->stream != SNDRV_PCM_STREAM_PLAYBACK;
>> +int div_id = dir ? ATMEL_SSC_RCMR_PERIOD : ATMEL_SSC_TCMR_PERIOD;
>> +int period = snd_soc_params_to_frame_size(params) / 2 - 1;
> 
> Please write the logic out as normal if statements for legibility.  It's
> a bit concerning that we even need this function, it looks like pretty
> basic stuff that I'd expect the CPU DAI to just be doing - why can't
> this be the default behaviour of the CPU DAI?

I don't know and obviously don't have all the relevant HW to test
changes. Do you want me to attempt such a change anyway?
Adding Cc: Nicolas Ferre

>> +static int tse850_init(struct snd_soc_pcm_runtime *rtd)
>> +{
>> +struct snd_soc_dapm_context *dapm = >card->dapm;
>> +
>> +return snd_soc_dapm_add_routes(dapm, tse850_intercon,
>> +   ARRAY_SIZE(tse850_intercon));
> 
> Set this up in the card data structure rather than open coding the call,
> you can register DAPM routes there too.

Right.

Thanks for looking!

Cheers,
Peter



Re: [PATCH 2/2] ASoC: axentia: tse850: add ASoC driver for the Axentia TSE-850

2016-11-09 Thread Peter Rosin
On 2016-11-09 14:38, Mark Brown wrote:
> On Tue, Nov 08, 2016 at 05:20:57PM +0100, Peter Rosin wrote:
> 
>> +++ b/sound/soc/axentia/Kconfig
>> @@ -0,0 +1,10 @@
>> +config SND_SOC_AXENTIA_TSE850_PCM5142
>> +tristate "ASoC driver for the Axentia TSE-850"
>> +depends on ARCH_AT91 && OF
>> +select ATMEL_SSC
>> +select SND_ATMEL_SOC
>> +select SND_ATMEL_SOC_SSC_DMA
>> +select SND_SOC_PCM512x_I2C
>> +help
>> +  Say Y if you want to add support for the ASoC driver for the
>> +  Axentia TSE-850 with a PCM5142 codec.
> 
> This just looks like a normal machine driver for an Atmel system which
> would usually go in the atemel directory - why is a new directory being
> created?

I thought atmel in this context meant that Atmel made the board, not that
the board was based on an Atmel cpu.

I'll move it for v2.

>> +static int tse850_get_mux2(struct snd_kcontrol *kctrl,
>> +   struct snd_ctl_elem_value *ucontrol)
>> +{
>> +struct snd_soc_dapm_context *dapm = snd_soc_dapm_kcontrol_dapm(kctrl);
>> +struct snd_soc_card *card = dapm->card;
>> +struct tse850_priv *tse850 = snd_soc_card_get_drvdata(card);
>> +int ret;
>> +
>> +ret = gpiod_get_value(tse850->loop2);
>> +if (ret < 0)
>> +return ret;
> 
> We can't reliably read the value of output GPIOs (though in practice the
> majority do support it) so it'd be better practice to use a state
> variable to remember what we set.  I'd also expect this to use the
> _cansleep() GPIO calls as it's not in a context where sleeping would be
> a problem.

Ok, I'll add _cansleep and cached values for v2.

>> +int tse850_get_ana(struct snd_kcontrol *kctrl,
>> +   struct snd_ctl_elem_value *ucontrol)
>> +{
>> +struct snd_soc_dapm_context *dapm = snd_soc_dapm_kcontrol_dapm(kctrl);
>> +struct snd_soc_card *card = dapm->card;
>> +struct tse850_priv *tse850 = snd_soc_card_get_drvdata(card);
>> +int ret;
>> +
>> +ret = regulator_get_voltage(tse850->ana);
>> +if (ret < 0)
>> +return ret;
>> +
>> +if (ret < 1100)
>> +ret = 1100;
>> +else if (ret > 2000)
>> +ret = 2000;
> 
> This needs some comments...

Ok, I'll add some words...

>> +struct snd_soc_pcm_runtime *rtd = substream->private_data;
>> +struct device *dev = rtd->dev;
>> +struct snd_soc_dai *cpu_dai = rtd->cpu_dai;
>> +int dir = substream->stream != SNDRV_PCM_STREAM_PLAYBACK;
>> +int div_id = dir ? ATMEL_SSC_RCMR_PERIOD : ATMEL_SSC_TCMR_PERIOD;
>> +int period = snd_soc_params_to_frame_size(params) / 2 - 1;
> 
> Please write the logic out as normal if statements for legibility.  It's
> a bit concerning that we even need this function, it looks like pretty
> basic stuff that I'd expect the CPU DAI to just be doing - why can't
> this be the default behaviour of the CPU DAI?

I don't know and obviously don't have all the relevant HW to test
changes. Do you want me to attempt such a change anyway?
Adding Cc: Nicolas Ferre

>> +static int tse850_init(struct snd_soc_pcm_runtime *rtd)
>> +{
>> +struct snd_soc_dapm_context *dapm = >card->dapm;
>> +
>> +return snd_soc_dapm_add_routes(dapm, tse850_intercon,
>> +   ARRAY_SIZE(tse850_intercon));
> 
> Set this up in the card data structure rather than open coding the call,
> you can register DAPM routes there too.

Right.

Thanks for looking!

Cheers,
Peter



[PATCH] i.MX: Kconfig: Drop errata 769419 for Vybrid

2016-11-09 Thread Andrey Smirnov
According to the datasheet, VF610 uses revision r3p2 of the L2C-310
block, same as i.MX6Q+, which does not require a software workaround for
ARM errata 769419.

Signed-off-by: Andrey Smirnov 
---
 arch/arm/mach-imx/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm/mach-imx/Kconfig b/arch/arm/mach-imx/Kconfig
index 9155b63..936c59d 100644
--- a/arch/arm/mach-imx/Kconfig
+++ b/arch/arm/mach-imx/Kconfig
@@ -557,7 +557,6 @@ config SOC_VF610
bool "Vybrid Family VF610 support"
select ARM_GIC if ARCH_MULTI_V7
select PINCTRL_VF610
-   select PL310_ERRATA_769419 if CACHE_L2X0
 
help
  This enables support for Freescale Vybrid VF610 processor.
-- 
2.5.5



[PATCH] i.MX: Kconfig: Drop errata 769419 for Vybrid

2016-11-09 Thread Andrey Smirnov
According to the datasheet, VF610 uses revision r3p2 of the L2C-310
block, same as i.MX6Q+, which does not require a software workaround for
ARM errata 769419.

Signed-off-by: Andrey Smirnov 
---
 arch/arm/mach-imx/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm/mach-imx/Kconfig b/arch/arm/mach-imx/Kconfig
index 9155b63..936c59d 100644
--- a/arch/arm/mach-imx/Kconfig
+++ b/arch/arm/mach-imx/Kconfig
@@ -557,7 +557,6 @@ config SOC_VF610
bool "Vybrid Family VF610 support"
select ARM_GIC if ARCH_MULTI_V7
select PINCTRL_VF610
-   select PL310_ERRATA_769419 if CACHE_L2X0
 
help
  This enables support for Freescale Vybrid VF610 processor.
-- 
2.5.5



Re: [PATCH 1/3] tuntap: rx batching

2016-11-09 Thread Michael S. Tsirkin
On Wed, Nov 09, 2016 at 03:38:31PM +0800, Jason Wang wrote:
> Backlog were used for tuntap rx, but it can only process 1 packet at
> one time since it was scheduled during sendmsg() synchronously in
> process context. This lead bad cache utilization so this patch tries
> to do some batching before call rx NAPI. This is done through:
> 
> - accept MSG_MORE as a hint from sendmsg() caller, if it was set,
>   batch the packet temporarily in a linked list and submit them all
>   once MSG_MORE were cleared.
> - implement a tuntap specific NAPI handler for processing this kind of
>   possible batching. (This could be done by extending backlog to
>   support skb like, but using a tun specific one looks cleaner and
>   easier for future extension).
> 
> Signed-off-by: Jason Wang 

So why do we need an extra queue? This is not what hardware devices do.
How about adding the packet to queue unconditionally, deferring
signalling until we get sendmsg without MSG_MORE?


> ---
>  drivers/net/tun.c | 71 
> ++-
>  1 file changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 1588469..d40583b 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -74,6 +74,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  /* Uncomment to enable debugging */
>  /* #define TUN_DEBUG 1 */
> @@ -169,6 +170,8 @@ struct tun_file {
>   struct list_head next;
>   struct tun_struct *detached;
>   struct skb_array tx_array;
> + struct napi_struct napi;
> + struct sk_buff_head process_queue;
>  };
>  
>  struct tun_flow_entry {
> @@ -522,6 +525,8 @@ static void tun_queue_purge(struct tun_file *tfile)
>   while ((skb = skb_array_consume(>tx_array)) != NULL)
>   kfree_skb(skb);
>  
> + skb_queue_purge(>sk.sk_write_queue);
> + skb_queue_purge(>process_queue);
>   skb_queue_purge(>sk.sk_error_queue);
>  }
>  
> @@ -532,6 +537,11 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
>  
>   tun = rtnl_dereference(tfile->tun);
>  
> + if (tun && clean) {
> + napi_disable(>napi);
> + netif_napi_del(>napi);
> + }
> +
>   if (tun && !tfile->detached) {
>   u16 index = tfile->queue_index;
>   BUG_ON(index >= tun->numqueues);
> @@ -587,6 +597,7 @@ static void tun_detach_all(struct net_device *dev)
>  
>   for (i = 0; i < n; i++) {
>   tfile = rtnl_dereference(tun->tfiles[i]);
> + napi_disable(>napi);
>   BUG_ON(!tfile);
>   tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
>   tfile->socket.sk->sk_data_ready(tfile->socket.sk);
> @@ -603,6 +614,7 @@ static void tun_detach_all(struct net_device *dev)
>   synchronize_net();
>   for (i = 0; i < n; i++) {
>   tfile = rtnl_dereference(tun->tfiles[i]);
> + netif_napi_del(>napi);
>   /* Drop read queue */
>   tun_queue_purge(tfile);
>   sock_put(>sk);
> @@ -618,6 +630,41 @@ static void tun_detach_all(struct net_device *dev)
>   module_put(THIS_MODULE);
>  }
>  
> +static int tun_poll(struct napi_struct *napi, int budget)
> +{
> + struct tun_file *tfile = container_of(napi, struct tun_file, napi);
> + struct sk_buff_head *input_queue =
> +>socket.sk->sk_write_queue;
> + struct sk_buff *skb;
> + unsigned int received = 0;
> +
> + while (1) {
> + while ((skb = __skb_dequeue(>process_queue))) {
> + netif_receive_skb(skb);
> + if (++received >= budget)
> + return received;
> + }
> +
> + spin_lock(_queue->lock);
> + if (skb_queue_empty(input_queue)) {
> + spin_unlock(_queue->lock);
> + break;
> + }
> + skb_queue_splice_tail_init(input_queue, >process_queue);
> + spin_unlock(_queue->lock);
> + }
> +
> + if (received < budget) {
> + napi_complete(napi);
> + if (skb_peek(>socket.sk->sk_write_queue) &&
> + unlikely(napi_schedule_prep(napi))) {
> + __napi_schedule(napi);
> + }
> + }
> +
> + return received;
> +}
> +
>  static int tun_attach(struct tun_struct *tun, struct file *file, bool 
> skip_filter)
>  {
>   struct tun_file *tfile = file->private_data;
> @@ -666,9 +713,11 @@ static int tun_attach(struct tun_struct *tun, struct 
> file *file, bool skip_filte
>  
>   if (tfile->detached)
>   tun_enable_queue(tfile);
> - else
> + else {
>   sock_hold(>sk);
> -
> + netif_napi_add(tun->dev, >napi, tun_poll, 64);
> + napi_enable(>napi);
> + }
>   tun_set_real_num_queues(tun);
>  
>   /* device is allowed to go away first, so no need to hold extra
> @@ 

Re: [PATCH 1/3] tuntap: rx batching

2016-11-09 Thread Michael S. Tsirkin
On Wed, Nov 09, 2016 at 03:38:31PM +0800, Jason Wang wrote:
> Backlog were used for tuntap rx, but it can only process 1 packet at
> one time since it was scheduled during sendmsg() synchronously in
> process context. This lead bad cache utilization so this patch tries
> to do some batching before call rx NAPI. This is done through:
> 
> - accept MSG_MORE as a hint from sendmsg() caller, if it was set,
>   batch the packet temporarily in a linked list and submit them all
>   once MSG_MORE were cleared.
> - implement a tuntap specific NAPI handler for processing this kind of
>   possible batching. (This could be done by extending backlog to
>   support skb like, but using a tun specific one looks cleaner and
>   easier for future extension).
> 
> Signed-off-by: Jason Wang 

So why do we need an extra queue? This is not what hardware devices do.
How about adding the packet to queue unconditionally, deferring
signalling until we get sendmsg without MSG_MORE?


> ---
>  drivers/net/tun.c | 71 
> ++-
>  1 file changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 1588469..d40583b 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -74,6 +74,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  /* Uncomment to enable debugging */
>  /* #define TUN_DEBUG 1 */
> @@ -169,6 +170,8 @@ struct tun_file {
>   struct list_head next;
>   struct tun_struct *detached;
>   struct skb_array tx_array;
> + struct napi_struct napi;
> + struct sk_buff_head process_queue;
>  };
>  
>  struct tun_flow_entry {
> @@ -522,6 +525,8 @@ static void tun_queue_purge(struct tun_file *tfile)
>   while ((skb = skb_array_consume(>tx_array)) != NULL)
>   kfree_skb(skb);
>  
> + skb_queue_purge(>sk.sk_write_queue);
> + skb_queue_purge(>process_queue);
>   skb_queue_purge(>sk.sk_error_queue);
>  }
>  
> @@ -532,6 +537,11 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
>  
>   tun = rtnl_dereference(tfile->tun);
>  
> + if (tun && clean) {
> + napi_disable(>napi);
> + netif_napi_del(>napi);
> + }
> +
>   if (tun && !tfile->detached) {
>   u16 index = tfile->queue_index;
>   BUG_ON(index >= tun->numqueues);
> @@ -587,6 +597,7 @@ static void tun_detach_all(struct net_device *dev)
>  
>   for (i = 0; i < n; i++) {
>   tfile = rtnl_dereference(tun->tfiles[i]);
> + napi_disable(>napi);
>   BUG_ON(!tfile);
>   tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
>   tfile->socket.sk->sk_data_ready(tfile->socket.sk);
> @@ -603,6 +614,7 @@ static void tun_detach_all(struct net_device *dev)
>   synchronize_net();
>   for (i = 0; i < n; i++) {
>   tfile = rtnl_dereference(tun->tfiles[i]);
> + netif_napi_del(>napi);
>   /* Drop read queue */
>   tun_queue_purge(tfile);
>   sock_put(>sk);
> @@ -618,6 +630,41 @@ static void tun_detach_all(struct net_device *dev)
>   module_put(THIS_MODULE);
>  }
>  
> +static int tun_poll(struct napi_struct *napi, int budget)
> +{
> + struct tun_file *tfile = container_of(napi, struct tun_file, napi);
> + struct sk_buff_head *input_queue =
> +>socket.sk->sk_write_queue;
> + struct sk_buff *skb;
> + unsigned int received = 0;
> +
> + while (1) {
> + while ((skb = __skb_dequeue(>process_queue))) {
> + netif_receive_skb(skb);
> + if (++received >= budget)
> + return received;
> + }
> +
> + spin_lock(_queue->lock);
> + if (skb_queue_empty(input_queue)) {
> + spin_unlock(_queue->lock);
> + break;
> + }
> + skb_queue_splice_tail_init(input_queue, >process_queue);
> + spin_unlock(_queue->lock);
> + }
> +
> + if (received < budget) {
> + napi_complete(napi);
> + if (skb_peek(>socket.sk->sk_write_queue) &&
> + unlikely(napi_schedule_prep(napi))) {
> + __napi_schedule(napi);
> + }
> + }
> +
> + return received;
> +}
> +
>  static int tun_attach(struct tun_struct *tun, struct file *file, bool 
> skip_filter)
>  {
>   struct tun_file *tfile = file->private_data;
> @@ -666,9 +713,11 @@ static int tun_attach(struct tun_struct *tun, struct 
> file *file, bool skip_filte
>  
>   if (tfile->detached)
>   tun_enable_queue(tfile);
> - else
> + else {
>   sock_hold(>sk);
> -
> + netif_napi_add(tun->dev, >napi, tun_poll, 64);
> + napi_enable(>napi);
> + }
>   tun_set_real_num_queues(tun);
>  
>   /* device is allowed to go away first, so no need to hold extra
> @@ -1150,7 +1199,7 @@ 

Re: [RFC v3 1/6] Track the active utilisation

2016-11-09 Thread luca abeni
On Tue, 8 Nov 2016 17:56:35 +
Juri Lelli  wrote:
[...]
> > > > @@ -947,14 +965,19 @@ static void enqueue_task_dl(struct rq *rq, struct 
> > > > task_struct *p, int flags)
> > > > return;
> > > > }
> > > >  
> > > > +   if (p->on_rq == TASK_ON_RQ_MIGRATING)
> > > > +   add_running_bw(>dl, >dl);
> > > > +
> > > > /*
> > > >  * If p is throttled, we do nothing. In fact, if it exhausted
> > > >  * its budget it needs a replenishment and, since it now is on
> > > >  * its rq, the bandwidth timer callback (which clearly has not
> > > >  * run yet) will take care of this.
> > > >  */
> > > > -   if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> > > > +   if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> > > > +   add_running_bw(>dl, >dl);
> > > 
> > > Don't rememeber if we discussed this already, but do we need to add the 
> > > bw here
> > > even if the task is not actually enqueued until after the replenishment 
> > > timer
> > > fires?  
> > I think yes... The active utilization does not depend on the fact that the 
> > task
> > is on the runqueue or not, but depends on the task's state (in GRUB 
> > parlance,
> > "inactive" vs "active contending"). In other words, even when a task is 
> > throttled
> > its utilization must be counted in the active utilization.
> >   
> 
> OK. Could you add a comment about this point please (so that I don't
> forget again :)?
So, I just changed the comment in

/*
 * If p is throttled, we do not enqueue it. In fact, if it exhausted
 * its budget it needs a replenishment and, since it now is on
 * its rq, the bandwidth timer callback (which clearly has not
 * run yet) will take care of this.
 * However, the active utilization does not depend on the fact
 * that the task is on the runqueue or not (but depends on the
 * task's state - in GRUB parlance, "inactive" vs "active contending").
 * In other words, even if a task is throttled its utilization must
 * be counted in the active utilization; hence, we need to call
 * add_running_bw().
 */

Is this ok?


Thanks,
Luca


Re: [RFC v3 1/6] Track the active utilisation

2016-11-09 Thread luca abeni
On Tue, 8 Nov 2016 17:56:35 +
Juri Lelli  wrote:
[...]
> > > > @@ -947,14 +965,19 @@ static void enqueue_task_dl(struct rq *rq, struct 
> > > > task_struct *p, int flags)
> > > > return;
> > > > }
> > > >  
> > > > +   if (p->on_rq == TASK_ON_RQ_MIGRATING)
> > > > +   add_running_bw(>dl, >dl);
> > > > +
> > > > /*
> > > >  * If p is throttled, we do nothing. In fact, if it exhausted
> > > >  * its budget it needs a replenishment and, since it now is on
> > > >  * its rq, the bandwidth timer callback (which clearly has not
> > > >  * run yet) will take care of this.
> > > >  */
> > > > -   if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> > > > +   if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> > > > +   add_running_bw(>dl, >dl);
> > > 
> > > Don't rememeber if we discussed this already, but do we need to add the 
> > > bw here
> > > even if the task is not actually enqueued until after the replenishment 
> > > timer
> > > fires?  
> > I think yes... The active utilization does not depend on the fact that the 
> > task
> > is on the runqueue or not, but depends on the task's state (in GRUB 
> > parlance,
> > "inactive" vs "active contending"). In other words, even when a task is 
> > throttled
> > its utilization must be counted in the active utilization.
> >   
> 
> OK. Could you add a comment about this point please (so that I don't
> forget again :)?
So, I just changed the comment in

/*
 * If p is throttled, we do not enqueue it. In fact, if it exhausted
 * its budget it needs a replenishment and, since it now is on
 * its rq, the bandwidth timer callback (which clearly has not
 * run yet) will take care of this.
 * However, the active utilization does not depend on the fact
 * that the task is on the runqueue or not (but depends on the
 * task's state - in GRUB parlance, "inactive" vs "active contending").
 * In other words, even if a task is throttled its utilization must
 * be counted in the active utilization; hence, we need to call
 * add_running_bw().
 */

Is this ok?


Thanks,
Luca


Re: [PATCH] x86/cpuid: Deal with broken firmware once more

2016-11-09 Thread Charles (Chas) Williams



On 11/09/2016 11:03 AM, Peter Zijlstra wrote:

On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote:

Both ACPI and MP specifications require that the APIC id in the respective
tables must be the same as the APIC id in CPUID.

The kernel retrieves the physical package id from the APIC id during the
ACPI/MP table scan and builds the physical to logical package map.

There exist Virtualbox and Xen implementations which violate the spec. As a


ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy
virt stuff.


Yes, this was VMware in particular.  It would be good to get this comment
right so as not to mislead anyone.


 /*
+ * The physical to logical package id mapping is initialized from the
+ * acpi/mptables information. Make sure that CPUID actually agrees with
+ * that.
+ */
+static void sanitize_package_id(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_SMP
+   unsigned int pkg, apicid, cpu = smp_processor_id();
+
+   apicid = apic->cpu_present_to_apicid(cpu);
+   pkg = apicid >> boot_cpu_data.x86_coreid_bits;
+
+   if (apicid != c->initial_apicid) {
+   pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: 
%x\n",
+  cpu, apicid, c->initial_apicid);


Should we not also 'fix' c->initial_apicid ?


Since we have c->apicid and c->initial_apicid it seems reasonable to keep one
set to the "correct" value.  I don't think c->initial_apicid is used past
this.

I should have some tests on this patch later today.


Re: [PATCH] x86/cpuid: Deal with broken firmware once more

2016-11-09 Thread Charles (Chas) Williams



On 11/09/2016 11:03 AM, Peter Zijlstra wrote:

On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote:

Both ACPI and MP specifications require that the APIC id in the respective
tables must be the same as the APIC id in CPUID.

The kernel retrieves the physical package id from the APIC id during the
ACPI/MP table scan and builds the physical to logical package map.

There exist Virtualbox and Xen implementations which violate the spec. As a


ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy
virt stuff.


Yes, this was VMware in particular.  It would be good to get this comment
right so as not to mislead anyone.


 /*
+ * The physical to logical package id mapping is initialized from the
+ * acpi/mptables information. Make sure that CPUID actually agrees with
+ * that.
+ */
+static void sanitize_package_id(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_SMP
+   unsigned int pkg, apicid, cpu = smp_processor_id();
+
+   apicid = apic->cpu_present_to_apicid(cpu);
+   pkg = apicid >> boot_cpu_data.x86_coreid_bits;
+
+   if (apicid != c->initial_apicid) {
+   pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: 
%x\n",
+  cpu, apicid, c->initial_apicid);


Should we not also 'fix' c->initial_apicid ?


Since we have c->apicid and c->initial_apicid it seems reasonable to keep one
set to the "correct" value.  I don't think c->initial_apicid is used past
this.

I should have some tests on this patch later today.


Re: [PATCH 1/1 V2] mqueue: Implment generic xattr support

2016-11-09 Thread David Graziano
On Mon, Nov 7, 2016 at 4:23 PM, Paul Moore  wrote:
> On Mon, Nov 7, 2016 at 3:46 PM, David Graziano
>  wrote:
>> This patch adds support for generic extended attributes within the
>> POSIX message queues filesystem and setting them by consulting the LSM.
>> This is needed so that the security.selinux extended attribute can be
>> set via a SELinux named type transition on file inodes created within
>> the filesystem. The implementation and LSM call back function are based
>> off tmpfs/shmem.
>>
>> Signed-off-by: David Graziano 
>> ---
>>  ipc/mqueue.c | 46 ++
>>  1 file changed, 46 insertions(+)
>
> Hi David,
>
> At first glance this looks reasonable to me, I just have a two
> questions/comments:
>
> * Can you explain your current need for this functionality?  For
> example, what are you trying to do that is made easier by allowing
> greater message queue labeling flexibility?  This helps put things in
> context and helps people review and comment on your patch.
>
> * How have you tested this?  While this patch is not SELinux specific,
> I think adding a test to the selinux-testsuite[1] would be worthwhile.
> The other LSM maintainers may suggest something similar if they have
> an established public testsuite.
>
> [1] https://github.com/SELinuxProject/selinux-testsuite

Hi Paul,

I needed to write a selinux policy for a set of custom applications that use
POSIX message queues for their IPC. The queues are created by one
application and we needed a way for selinux to enforce which of the other
apps are able to read/write to each individual queue. Uniquely labeling them
based on the app that created them and the file name seemed to be our best
solution since it’s an embedded system and we don’t have restorecond to
handle any relabeling.


To test this patch I used both a selinux enabled, buildroot based qemu target
with a customized selinux policy and test C app to create the mqueues. I also
tested with our real apps and selinux policy on our target hardware. I can
certainly look at adding a test to the selinux-testsuite if that would
be helpful.

Thanks,
David

>
>> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
>> index 0b13ace..512a546 100644
>> --- a/ipc/mqueue.c
>> +++ b/ipc/mqueue.c
>> @@ -35,6 +35,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include "util.h"
>> @@ -70,6 +71,7 @@ struct mqueue_inode_info {
>> struct rb_root msg_tree;
>> struct posix_msg_tree_node *node_cache;
>> struct mq_attr attr;
>> +   struct simple_xattrs xattrs;/* list of xattrs */
>>
>> struct sigevent notify;
>> struct pid *notify_owner;
>> @@ -254,6 +256,7 @@ static struct inode *mqueue_get_inode(struct super_block 
>> *sb,
>> info->attr.mq_maxmsg = attr->mq_maxmsg;
>> info->attr.mq_msgsize = attr->mq_msgsize;
>> }
>> +   simple_xattrs_init(>xattrs);
>> /*
>>  * We used to allocate a static array of pointers and account
>>  * the size of that array as well as one msg_msg struct per
>> @@ -413,6 +416,41 @@ static void mqueue_evict_inode(struct inode *inode)
>> put_ipc_ns(ipc_ns);
>>  }
>>
>> +/*
>> + * Callback for security_inode_init_security() for acquiring xattrs.
>> + */
>> +static int mqueue_initxattrs(struct inode *inode,
>> +   const struct xattr *xattr_array,
>> +   void *fs_info)
>> +{
>> +   struct mqueue_inode_info *info = MQUEUE_I(inode);
>> +   const struct xattr *xattr;
>> +   struct simple_xattr *new_xattr;
>> +   size_t len;
>> +
>> +   for (xattr = xattr_array; xattr->name != NULL; xattr++) {
>> +   new_xattr = simple_xattr_alloc(xattr->value, 
>> xattr->value_len);
>> +   if (!new_xattr)
>> +   return -ENOMEM;
>> +   len = strlen(xattr->name) + 1;
>> +   new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len,
>> + GFP_KERNEL);
>> +   if (!new_xattr->name) {
>> +   kfree(new_xattr);
>> +   return -ENOMEM;
>> +   }
>> +
>> +   memcpy(new_xattr->name, XATTR_SECURITY_PREFIX,
>> +  XATTR_SECURITY_PREFIX_LEN);
>> +   memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
>> +  xattr->name, len);
>> +
>> +   simple_xattr_list_add(>xattrs, new_xattr);
>> +   }
>> +
>> +   return 0;
>> +}
>> +
>>  static int mqueue_create(struct inode *dir, struct dentry *dentry,
>> umode_t mode, bool excl)
>>  {
>> @@ -443,6 +481,14 @@ static int mqueue_create(struct inode *dir, struct 
>> dentry *dentry,
>> 

Re: [PATCH 1/1 V2] mqueue: Implment generic xattr support

2016-11-09 Thread David Graziano
On Mon, Nov 7, 2016 at 4:23 PM, Paul Moore  wrote:
> On Mon, Nov 7, 2016 at 3:46 PM, David Graziano
>  wrote:
>> This patch adds support for generic extended attributes within the
>> POSIX message queues filesystem and setting them by consulting the LSM.
>> This is needed so that the security.selinux extended attribute can be
>> set via a SELinux named type transition on file inodes created within
>> the filesystem. The implementation and LSM call back function are based
>> off tmpfs/shmem.
>>
>> Signed-off-by: David Graziano 
>> ---
>>  ipc/mqueue.c | 46 ++
>>  1 file changed, 46 insertions(+)
>
> Hi David,
>
> At first glance this looks reasonable to me, I just have a two
> questions/comments:
>
> * Can you explain your current need for this functionality?  For
> example, what are you trying to do that is made easier by allowing
> greater message queue labeling flexibility?  This helps put things in
> context and helps people review and comment on your patch.
>
> * How have you tested this?  While this patch is not SELinux specific,
> I think adding a test to the selinux-testsuite[1] would be worthwhile.
> The other LSM maintainers may suggest something similar if they have
> an established public testsuite.
>
> [1] https://github.com/SELinuxProject/selinux-testsuite

Hi Paul,

I needed to write a selinux policy for a set of custom applications that use
POSIX message queues for their IPC. The queues are created by one
application and we needed a way for selinux to enforce which of the other
apps are able to read/write to each individual queue. Uniquely labeling them
based on the app that created them and the file name seemed to be our best
solution since it’s an embedded system and we don’t have restorecond to
handle any relabeling.


To test this patch I used both a selinux enabled, buildroot based qemu target
with a customized selinux policy and test C app to create the mqueues. I also
tested with our real apps and selinux policy on our target hardware. I can
certainly look at adding a test to the selinux-testsuite if that would
be helpful.

Thanks,
David

>
>> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
>> index 0b13ace..512a546 100644
>> --- a/ipc/mqueue.c
>> +++ b/ipc/mqueue.c
>> @@ -35,6 +35,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include "util.h"
>> @@ -70,6 +71,7 @@ struct mqueue_inode_info {
>> struct rb_root msg_tree;
>> struct posix_msg_tree_node *node_cache;
>> struct mq_attr attr;
>> +   struct simple_xattrs xattrs;/* list of xattrs */
>>
>> struct sigevent notify;
>> struct pid *notify_owner;
>> @@ -254,6 +256,7 @@ static struct inode *mqueue_get_inode(struct super_block 
>> *sb,
>> info->attr.mq_maxmsg = attr->mq_maxmsg;
>> info->attr.mq_msgsize = attr->mq_msgsize;
>> }
>> +   simple_xattrs_init(>xattrs);
>> /*
>>  * We used to allocate a static array of pointers and account
>>  * the size of that array as well as one msg_msg struct per
>> @@ -413,6 +416,41 @@ static void mqueue_evict_inode(struct inode *inode)
>> put_ipc_ns(ipc_ns);
>>  }
>>
>> +/*
>> + * Callback for security_inode_init_security() for acquiring xattrs.
>> + */
>> +static int mqueue_initxattrs(struct inode *inode,
>> +   const struct xattr *xattr_array,
>> +   void *fs_info)
>> +{
>> +   struct mqueue_inode_info *info = MQUEUE_I(inode);
>> +   const struct xattr *xattr;
>> +   struct simple_xattr *new_xattr;
>> +   size_t len;
>> +
>> +   for (xattr = xattr_array; xattr->name != NULL; xattr++) {
>> +   new_xattr = simple_xattr_alloc(xattr->value, 
>> xattr->value_len);
>> +   if (!new_xattr)
>> +   return -ENOMEM;
>> +   len = strlen(xattr->name) + 1;
>> +   new_xattr->name = kmalloc(XATTR_SECURITY_PREFIX_LEN + len,
>> + GFP_KERNEL);
>> +   if (!new_xattr->name) {
>> +   kfree(new_xattr);
>> +   return -ENOMEM;
>> +   }
>> +
>> +   memcpy(new_xattr->name, XATTR_SECURITY_PREFIX,
>> +  XATTR_SECURITY_PREFIX_LEN);
>> +   memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
>> +  xattr->name, len);
>> +
>> +   simple_xattr_list_add(>xattrs, new_xattr);
>> +   }
>> +
>> +   return 0;
>> +}
>> +
>>  static int mqueue_create(struct inode *dir, struct dentry *dentry,
>> umode_t mode, bool excl)
>>  {
>> @@ -443,6 +481,14 @@ static int mqueue_create(struct inode *dir, struct 
>> dentry *dentry,
>> ipc_ns->mq_queues_count--;
>> goto out_unlock;
>> }
>> +   error = 

Re: [PATCH] igb: use igb_adapter->io_addr instead of e1000_hw->hw_addr

2016-11-09 Thread Alexander Duyck
On Mon, Nov 7, 2016 at 11:06 PM, Cao jin  wrote:
> When running as guest, under certain condition, it will oops as following.
> writel() in igb_configure_tx_ring() results in oops, because hw->hw_addr
> is NULL. While other register access won't oops kernel because they use
> wr32/rd32 which have a defense against NULL pointer.
>
> [  141.225449] pcieport :00:1c.0: AER: Multiple Uncorrected (Fatal)
> error received: id=0101
> [  141.225523] igb :01:00.1: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Unaccessible,
> id=0101(Unregistered Agent ID)
> [  141.299442] igb :01:00.1: broadcast error_detected message
> [  141.300539] igb :01:00.0 enp1s0f0: PCIe link lost, device now
> detached
> [  141.351019] igb :01:00.1 enp1s0f1: PCIe link lost, device now
> detached
> [  143.465904] pcieport :00:1c.0: Root Port link has been reset
> [  143.465994] igb :01:00.1: broadcast slot_reset message
> [  143.466039] igb :01:00.0: enabling device ( -> 0002)
> [  144.389078] igb :01:00.1: enabling device ( -> 0002)
> [  145.312078] igb :01:00.1: broadcast resume message
> [  145.322211] BUG: unable to handle kernel paging request at
> 3818
> [  145.361275] IP: []
> igb_configure_tx_ring+0x14d/0x280 [igb]
> [  145.400048] PGD 0
> [  145.438007] Oops: 0002 [#1] SMP
>
> A similiar issue & solution could be found at:
> http://patchwork.ozlabs.org/patch/689592/
>
> Signed-off-by: Cao jin 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index edc9a6a..3f240ac 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3390,7 +3390,7 @@ void igb_configure_tx_ring(struct igb_adapter *adapter,
>  tdba & 0xULL);
> wr32(E1000_TDBAH(reg_idx), tdba >> 32);
>
> -   ring->tail = hw->hw_addr + E1000_TDT(reg_idx);
> +   ring->tail = adapter->io_addr + E1000_TDT(reg_idx);
> wr32(E1000_TDH(reg_idx), 0);
> writel(0, ring->tail);
>
> @@ -3729,7 +3729,7 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
>  ring->count * sizeof(union e1000_adv_rx_desc));
>
> /* initialize head and tail */
> -   ring->tail = hw->hw_addr + E1000_RDT(reg_idx);
> +   ring->tail = adapter->io_addr + E1000_RDT(reg_idx);
> wr32(E1000_RDH(reg_idx), 0);
> writel(0, ring->tail);
>

This patch looks good to me.

Acked-by: Alexander Duyck 


Re: [PATCH] igb: use igb_adapter->io_addr instead of e1000_hw->hw_addr

2016-11-09 Thread Alexander Duyck
On Mon, Nov 7, 2016 at 11:06 PM, Cao jin  wrote:
> When running as guest, under certain condition, it will oops as following.
> writel() in igb_configure_tx_ring() results in oops, because hw->hw_addr
> is NULL. While other register access won't oops kernel because they use
> wr32/rd32 which have a defense against NULL pointer.
>
> [  141.225449] pcieport :00:1c.0: AER: Multiple Uncorrected (Fatal)
> error received: id=0101
> [  141.225523] igb :01:00.1: PCIe Bus Error:
> severity=Uncorrected (Fatal), type=Unaccessible,
> id=0101(Unregistered Agent ID)
> [  141.299442] igb :01:00.1: broadcast error_detected message
> [  141.300539] igb :01:00.0 enp1s0f0: PCIe link lost, device now
> detached
> [  141.351019] igb :01:00.1 enp1s0f1: PCIe link lost, device now
> detached
> [  143.465904] pcieport :00:1c.0: Root Port link has been reset
> [  143.465994] igb :01:00.1: broadcast slot_reset message
> [  143.466039] igb :01:00.0: enabling device ( -> 0002)
> [  144.389078] igb :01:00.1: enabling device ( -> 0002)
> [  145.312078] igb :01:00.1: broadcast resume message
> [  145.322211] BUG: unable to handle kernel paging request at
> 3818
> [  145.361275] IP: []
> igb_configure_tx_ring+0x14d/0x280 [igb]
> [  145.400048] PGD 0
> [  145.438007] Oops: 0002 [#1] SMP
>
> A similiar issue & solution could be found at:
> http://patchwork.ozlabs.org/patch/689592/
>
> Signed-off-by: Cao jin 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index edc9a6a..3f240ac 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3390,7 +3390,7 @@ void igb_configure_tx_ring(struct igb_adapter *adapter,
>  tdba & 0xULL);
> wr32(E1000_TDBAH(reg_idx), tdba >> 32);
>
> -   ring->tail = hw->hw_addr + E1000_TDT(reg_idx);
> +   ring->tail = adapter->io_addr + E1000_TDT(reg_idx);
> wr32(E1000_TDH(reg_idx), 0);
> writel(0, ring->tail);
>
> @@ -3729,7 +3729,7 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
>  ring->count * sizeof(union e1000_adv_rx_desc));
>
> /* initialize head and tail */
> -   ring->tail = hw->hw_addr + E1000_RDT(reg_idx);
> +   ring->tail = adapter->io_addr + E1000_RDT(reg_idx);
> wr32(E1000_RDH(reg_idx), 0);
> writel(0, ring->tail);
>

This patch looks good to me.

Acked-by: Alexander Duyck 


[PATCH -next 0/2] ipc/sem: semop updates

2016-11-09 Thread Davidlohr Bueso
Hi,

Here are two small updates to semop(2), suggested a while ago by Manfred.

Both patches have passed the usual regression tests, including ltp and
some sysvsem benchmarks.

Thanks!

Davidlohr Bueso (2):
  ipc/sem: simplify wait-wake loop
  ipc/sem: avoid idr tree lookup for interrupted semop

 ipc/sem.c | 124 +++---
 1 file changed, 46 insertions(+), 78 deletions(-)

-- 
2.6.6



[PATCH -next 0/2] ipc/sem: semop updates

2016-11-09 Thread Davidlohr Bueso
Hi,

Here are two small updates to semop(2), suggested a while ago by Manfred.

Both patches have passed the usual regression tests, including ltp and
some sysvsem benchmarks.

Thanks!

Davidlohr Bueso (2):
  ipc/sem: simplify wait-wake loop
  ipc/sem: avoid idr tree lookup for interrupted semop

 ipc/sem.c | 124 +++---
 1 file changed, 46 insertions(+), 78 deletions(-)

-- 
2.6.6



[PATCH -next 1/2] ipc/sem: simplify wait-wake loop

2016-11-09 Thread Davidlohr Bueso
Instead of using the reverse goto, we can simplify the flow and
make it more language natural by just doing do-while instead.
One would hope this is the standard way (or obviously just with
a while bucle) that we do wait/wakeup handling in the kernel.
The exact same logic is kept, just more indented.

Signed-off-by: Davidlohr Bueso 
---
 ipc/sem.c | 107 ++
 1 file changed, 51 insertions(+), 56 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ebd18a7104fd..a5eaf517c8b4 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1980,71 +1980,66 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
sma->complex_count++;
}
 
-sleep_again:
-   queue.status = -EINTR;
-   queue.sleeper = current;
+   do {
+   queue.status = -EINTR;
+   queue.sleeper = current;
 
-   __set_current_state(TASK_INTERRUPTIBLE);
-   sem_unlock(sma, locknum);
-   rcu_read_unlock();
+   __set_current_state(TASK_INTERRUPTIBLE);
+   sem_unlock(sma, locknum);
+   rcu_read_unlock();
 
-   if (timeout)
-   jiffies_left = schedule_timeout(jiffies_left);
-   else
-   schedule();
+   if (timeout)
+   jiffies_left = schedule_timeout(jiffies_left);
+   else
+   schedule();
 
-   /*
-* fastpath: the semop has completed, either successfully or not, from
-* the syscall pov, is quite irrelevant to us at this point; we're done.
-*
-* We _do_ care, nonetheless, about being awoken by a signal or
-* spuriously.  The queue.status is checked again in the slowpath (aka
-* after taking sem_lock), such that we can detect scenarios where we
-* were awakened externally, during the window between wake_q_add() and
-* wake_up_q().
-*/
-   error = READ_ONCE(queue.status);
-   if (error != -EINTR) {
/*
-* User space could assume that semop() is a memory barrier:
-* Without the mb(), the cpu could speculatively read in user
-* space stale data that was overwritten by the previous owner
-* of the semaphore.
+* fastpath: the semop has completed, either successfully or 
not, from
+* the syscall pov, is quite irrelevant to us at this point; 
we're done.
+*
+* We _do_ care, nonetheless, about being awoken by a signal or
+* spuriously.  The queue.status is checked again in the 
slowpath (aka
+* after taking sem_lock), such that we can detect scenarios 
where we
+* were awakened externally, during the window between 
wake_q_add() and
+* wake_up_q().
 */
-   smp_mb();
-   goto out_free;
-   }
-
-   rcu_read_lock();
-   sma = sem_obtain_lock(ns, semid, sops, nsops, );
-   error = READ_ONCE(queue.status);
+   error = READ_ONCE(queue.status);
+   if (error != -EINTR) {
+   /*
+* User space could assume that semop() is a memory 
barrier:
+* Without the mb(), the cpu could speculatively read 
in user
+* space stale data that was overwritten by the 
previous owner
+* of the semaphore.
+*/
+   smp_mb();
+   goto out_free;
+   }
 
-   /*
-* Array removed? If yes, leave without sem_unlock().
-*/
-   if (IS_ERR(sma)) {
-   rcu_read_unlock();
-   goto out_free;
-   }
+   rcu_read_lock();
+   sma = sem_obtain_lock(ns, semid, sops, nsops, );
+   error = READ_ONCE(queue.status);
 
-   /*
-* If queue.status != -EINTR we are woken up by another process.
-* Leave without unlink_queue(), but with sem_unlock().
-*/
-   if (error != -EINTR)
-   goto out_unlock_free;
+   /*
+* Array removed? If yes, leave without sem_unlock().
+*/
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
+   goto out_free;
+   }
 
-   /*
-* If an interrupt occurred we have to clean up the queue.
-*/
-   if (timeout && jiffies_left == 0)
-   error = -EAGAIN;
+   /*
+* If queue.status != -EINTR we are woken up by another process.
+* Leave without unlink_queue(), but with sem_unlock().
+*/
+   if (error != -EINTR)
+   goto out_unlock_free;
 
-   /*
-* If the wakeup was spurious, just retry.
-*/
-   if 

[PATCH -next 1/2] ipc/sem: simplify wait-wake loop

2016-11-09 Thread Davidlohr Bueso
Instead of using the reverse goto, we can simplify the flow and
make it more language natural by just doing do-while instead.
One would hope this is the standard way (or obviously just with
a while bucle) that we do wait/wakeup handling in the kernel.
The exact same logic is kept, just more indented.

Signed-off-by: Davidlohr Bueso 
---
 ipc/sem.c | 107 ++
 1 file changed, 51 insertions(+), 56 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ebd18a7104fd..a5eaf517c8b4 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1980,71 +1980,66 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
sma->complex_count++;
}
 
-sleep_again:
-   queue.status = -EINTR;
-   queue.sleeper = current;
+   do {
+   queue.status = -EINTR;
+   queue.sleeper = current;
 
-   __set_current_state(TASK_INTERRUPTIBLE);
-   sem_unlock(sma, locknum);
-   rcu_read_unlock();
+   __set_current_state(TASK_INTERRUPTIBLE);
+   sem_unlock(sma, locknum);
+   rcu_read_unlock();
 
-   if (timeout)
-   jiffies_left = schedule_timeout(jiffies_left);
-   else
-   schedule();
+   if (timeout)
+   jiffies_left = schedule_timeout(jiffies_left);
+   else
+   schedule();
 
-   /*
-* fastpath: the semop has completed, either successfully or not, from
-* the syscall pov, is quite irrelevant to us at this point; we're done.
-*
-* We _do_ care, nonetheless, about being awoken by a signal or
-* spuriously.  The queue.status is checked again in the slowpath (aka
-* after taking sem_lock), such that we can detect scenarios where we
-* were awakened externally, during the window between wake_q_add() and
-* wake_up_q().
-*/
-   error = READ_ONCE(queue.status);
-   if (error != -EINTR) {
/*
-* User space could assume that semop() is a memory barrier:
-* Without the mb(), the cpu could speculatively read in user
-* space stale data that was overwritten by the previous owner
-* of the semaphore.
+* fastpath: the semop has completed, either successfully or 
not, from
+* the syscall pov, is quite irrelevant to us at this point; 
we're done.
+*
+* We _do_ care, nonetheless, about being awoken by a signal or
+* spuriously.  The queue.status is checked again in the 
slowpath (aka
+* after taking sem_lock), such that we can detect scenarios 
where we
+* were awakened externally, during the window between 
wake_q_add() and
+* wake_up_q().
 */
-   smp_mb();
-   goto out_free;
-   }
-
-   rcu_read_lock();
-   sma = sem_obtain_lock(ns, semid, sops, nsops, );
-   error = READ_ONCE(queue.status);
+   error = READ_ONCE(queue.status);
+   if (error != -EINTR) {
+   /*
+* User space could assume that semop() is a memory 
barrier:
+* Without the mb(), the cpu could speculatively read 
in user
+* space stale data that was overwritten by the 
previous owner
+* of the semaphore.
+*/
+   smp_mb();
+   goto out_free;
+   }
 
-   /*
-* Array removed? If yes, leave without sem_unlock().
-*/
-   if (IS_ERR(sma)) {
-   rcu_read_unlock();
-   goto out_free;
-   }
+   rcu_read_lock();
+   sma = sem_obtain_lock(ns, semid, sops, nsops, );
+   error = READ_ONCE(queue.status);
 
-   /*
-* If queue.status != -EINTR we are woken up by another process.
-* Leave without unlink_queue(), but with sem_unlock().
-*/
-   if (error != -EINTR)
-   goto out_unlock_free;
+   /*
+* Array removed? If yes, leave without sem_unlock().
+*/
+   if (IS_ERR(sma)) {
+   rcu_read_unlock();
+   goto out_free;
+   }
 
-   /*
-* If an interrupt occurred we have to clean up the queue.
-*/
-   if (timeout && jiffies_left == 0)
-   error = -EAGAIN;
+   /*
+* If queue.status != -EINTR we are woken up by another process.
+* Leave without unlink_queue(), but with sem_unlock().
+*/
+   if (error != -EINTR)
+   goto out_unlock_free;
 
-   /*
-* If the wakeup was spurious, just retry.
-*/
-   if (error == -EINTR 

Re: [PATCH v2] fs/nfsd/nfs4callback: Remove deprecated create_singlethread_workqueue

2016-11-09 Thread J. Bruce Fields
On Wed, Nov 09, 2016 at 08:18:08AM -0500, Jeff Layton wrote:
> On Tue, 2016-11-08 at 20:27 -0500, J. Bruce Fields wrote:
> > On Tue, Nov 08, 2016 at 05:52:21PM -0500, Tejun Heo wrote:
> > > 
> > > Hello, Bruce.
> > > 
> > > On Tue, Nov 08, 2016 at 04:39:11PM -0500, J. Bruce Fields wrote:
> > > > 
> > > > Apologies, just cleaning out old mail and finding some I should have
> > > > responded to long ago:
> > > > 
> > > > On Wed, Aug 31, 2016 at 02:23:48AM +0530, Bhaktipriya Shridhar wrote:
> > > > > 
> > > > > The workqueue "callback_wq" queues a single work item >cb_work per
> > > > > nfsd4_callback instance and thus, it doesn't require execution 
> > > > > ordering.
> > > > 
> > > > What's "execution ordering"?
> > > > 
> 
> AIUI, it means that jobs are always run in the order queued and are
> serialized.
> 
> > > > We definitely do depend on the fact that at most one of these is running
> > > > at a time.
> > > 
> 
> We do?
> 
> > > If there can be multiple cb's and thus cb->cb_work's per callback_wq,
> > > it'd need explicit ordering.  Is that the case?
> > 
> 
> These are basically client RPC tasks, and the cb_work just handles the
> submission into the client RPC state machine. Just because we're running
> several callbacks at the same time doesn't mean that they need to be
> strictly ordered. The client state machine can certainly handle running
> these in parallel.

I'm not worried about the rpc calls themselves, I'm worried about the
other stuff in nfsd4_run_cb_work(), especially
nfsd4_process_cb_update().

It's been a while since I thought about it and maybe it'd be OK with a
little bit of extra locking.

--b.

> > Yes, there can be multiple cb_work's.
> > 
> 
> Yes, but each is effectively a separate work unit. I see no reason why
> we'd need to order them at all.
> 
> -- 
> Jeff Layton 


Re: [PATCH v2] fs/nfsd/nfs4callback: Remove deprecated create_singlethread_workqueue

2016-11-09 Thread J. Bruce Fields
On Wed, Nov 09, 2016 at 08:18:08AM -0500, Jeff Layton wrote:
> On Tue, 2016-11-08 at 20:27 -0500, J. Bruce Fields wrote:
> > On Tue, Nov 08, 2016 at 05:52:21PM -0500, Tejun Heo wrote:
> > > 
> > > Hello, Bruce.
> > > 
> > > On Tue, Nov 08, 2016 at 04:39:11PM -0500, J. Bruce Fields wrote:
> > > > 
> > > > Apologies, just cleaning out old mail and finding some I should have
> > > > responded to long ago:
> > > > 
> > > > On Wed, Aug 31, 2016 at 02:23:48AM +0530, Bhaktipriya Shridhar wrote:
> > > > > 
> > > > > The workqueue "callback_wq" queues a single work item >cb_work per
> > > > > nfsd4_callback instance and thus, it doesn't require execution 
> > > > > ordering.
> > > > 
> > > > What's "execution ordering"?
> > > > 
> 
> AIUI, it means that jobs are always run in the order queued and are
> serialized.
> 
> > > > We definitely do depend on the fact that at most one of these is running
> > > > at a time.
> > > 
> 
> We do?
> 
> > > If there can be multiple cb's and thus cb->cb_work's per callback_wq,
> > > it'd need explicit ordering.  Is that the case?
> > 
> 
> These are basically client RPC tasks, and the cb_work just handles the
> submission into the client RPC state machine. Just because we're running
> several callbacks at the same time doesn't mean that they need to be
> strictly ordered. The client state machine can certainly handle running
> these in parallel.

I'm not worried about the rpc calls themselves, I'm worried about the
other stuff in nfsd4_run_cb_work(), especially
nfsd4_process_cb_update().

It's been a while since I thought about it and maybe it'd be OK with a
little bit of extra locking.

--b.

> > Yes, there can be multiple cb_work's.
> > 
> 
> Yes, but each is effectively a separate work unit. I see no reason why
> we'd need to order them at all.
> 
> -- 
> Jeff Layton 


[PATCH -next 2/2] ipc/sem: avoid idr tree lookup for interrupted semop

2016-11-09 Thread Davidlohr Bueso
We can avoid the idr tree lookup (albeit possibly avoiding
idr_find_fast()) when being awoken in EINTR, as the semid
will not change in this context while blocked. Use the sma
pointer directly and take the sem_lock, then re-check for
RMID races. We continue to re-check the queue.status with
the lock held such that we can detect situations where we
where are dealing with a spurious wakeup but another task
that holds the sem_lock updated the queue.status while we
were spinning for it. Once we take the lock it obviously
won't change again.

Being the only caller, get rid of sem_obtain_lock() altogether.

Signed-off-by: Davidlohr Bueso 
---
 ipc/sem.c | 37 +
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index a5eaf517c8b4..11cdec301167 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -431,29 +431,6 @@ static inline void sem_unlock(struct sem_array *sma, int 
locknum)
  *
  * The caller holds the RCU read lock.
  */
-static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns,
-   int id, struct sembuf *sops, int nsops, int *locknum)
-{
-   struct kern_ipc_perm *ipcp;
-   struct sem_array *sma;
-
-   ipcp = ipc_obtain_object_idr(_ids(ns), id);
-   if (IS_ERR(ipcp))
-   return ERR_CAST(ipcp);
-
-   sma = container_of(ipcp, struct sem_array, sem_perm);
-   *locknum = sem_lock(sma, sops, nsops);
-
-   /* ipc_rmid() may have already freed the ID while sem_lock
-* was spinning: verify that the structure is still valid
-*/
-   if (ipc_valid_object(ipcp))
-   return container_of(ipcp, struct sem_array, sem_perm);
-
-   sem_unlock(sma, *locknum);
-   return ERR_PTR(-EINVAL);
-}
-
 static inline struct sem_array *sem_obtain_object(struct ipc_namespace *ns, 
int id)
 {
struct kern_ipc_perm *ipcp = ipc_obtain_object_idr(_ids(ns), id);
@@ -2016,16 +1993,12 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
}
 
rcu_read_lock();
-   sma = sem_obtain_lock(ns, semid, sops, nsops, );
-   error = READ_ONCE(queue.status);
+   sem_lock(sma, sops, nsops);
 
-   /*
-* Array removed? If yes, leave without sem_unlock().
-*/
-   if (IS_ERR(sma)) {
-   rcu_read_unlock();
-   goto out_free;
-   }
+   if (!ipc_valid_object(>sem_perm))
+   goto out_unlock_free;
+
+   error = READ_ONCE(queue.status);
 
/*
 * If queue.status != -EINTR we are woken up by another process.
-- 
2.6.6



[PATCH -next 2/2] ipc/sem: avoid idr tree lookup for interrupted semop

2016-11-09 Thread Davidlohr Bueso
We can avoid the idr tree lookup (albeit possibly avoiding
idr_find_fast()) when being awoken in EINTR, as the semid
will not change in this context while blocked. Use the sma
pointer directly and take the sem_lock, then re-check for
RMID races. We continue to re-check the queue.status with
the lock held such that we can detect situations where we
where are dealing with a spurious wakeup but another task
that holds the sem_lock updated the queue.status while we
were spinning for it. Once we take the lock it obviously
won't change again.

Being the only caller, get rid of sem_obtain_lock() altogether.

Signed-off-by: Davidlohr Bueso 
---
 ipc/sem.c | 37 +
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index a5eaf517c8b4..11cdec301167 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -431,29 +431,6 @@ static inline void sem_unlock(struct sem_array *sma, int 
locknum)
  *
  * The caller holds the RCU read lock.
  */
-static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns,
-   int id, struct sembuf *sops, int nsops, int *locknum)
-{
-   struct kern_ipc_perm *ipcp;
-   struct sem_array *sma;
-
-   ipcp = ipc_obtain_object_idr(_ids(ns), id);
-   if (IS_ERR(ipcp))
-   return ERR_CAST(ipcp);
-
-   sma = container_of(ipcp, struct sem_array, sem_perm);
-   *locknum = sem_lock(sma, sops, nsops);
-
-   /* ipc_rmid() may have already freed the ID while sem_lock
-* was spinning: verify that the structure is still valid
-*/
-   if (ipc_valid_object(ipcp))
-   return container_of(ipcp, struct sem_array, sem_perm);
-
-   sem_unlock(sma, *locknum);
-   return ERR_PTR(-EINVAL);
-}
-
 static inline struct sem_array *sem_obtain_object(struct ipc_namespace *ns, 
int id)
 {
struct kern_ipc_perm *ipcp = ipc_obtain_object_idr(_ids(ns), id);
@@ -2016,16 +1993,12 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf 
__user *, tsops,
}
 
rcu_read_lock();
-   sma = sem_obtain_lock(ns, semid, sops, nsops, );
-   error = READ_ONCE(queue.status);
+   sem_lock(sma, sops, nsops);
 
-   /*
-* Array removed? If yes, leave without sem_unlock().
-*/
-   if (IS_ERR(sma)) {
-   rcu_read_unlock();
-   goto out_free;
-   }
+   if (!ipc_valid_object(>sem_perm))
+   goto out_unlock_free;
+
+   error = READ_ONCE(queue.status);
 
/*
 * If queue.status != -EINTR we are woken up by another process.
-- 
2.6.6



Re: [PATCH 22/25] x86/mcheck: Do the init in one place

2016-11-09 Thread Sebastian Andrzej Siewior
On 2016-11-09 16:38:07 [+0100], Borislav Petkov wrote:
> On Wed, Nov 09, 2016 at 03:22:21PM +0100, Sebastian Andrzej Siewior wrote:
> > I want to get rid of non-symmetrical part and the arch hook which should
> > be part of the hp notifier itself. I wouldn't be too much afraid about
> > the when point in time when the notifier runs: It is the *first*
> > notifier that will be invoked on the target CPU. This is only a few
> > lines after the old hook. Nothing else long delaying should be invoked.
> 
> So I'm not sure *everything* mcheck_cpu_init() does should be in hotplug
> and toggle on and off. For example, we should set CR4.MCE once during
> init and that's it. Not turn it on and off. It would be pretty pointless
> IMO.
> 
> That's why the hotplug callback mce_disable_cpu() doesn't fiddle with
> CR4 - it only clears the bits in MCi_CTL. And I think we should remain
> that way.

The behaviour was not changed - it was only reorganized to keep in one
spot.

> The initial, one-time enabling of MCE features per core should be done
> only once per boot and that's it.

It is. You still want this in your hotplug code in case someone replaces
(physically) your CPU (say a larger NUMA box).

Sebastian


Re: [PATCH 22/25] x86/mcheck: Do the init in one place

2016-11-09 Thread Sebastian Andrzej Siewior
On 2016-11-09 16:38:07 [+0100], Borislav Petkov wrote:
> On Wed, Nov 09, 2016 at 03:22:21PM +0100, Sebastian Andrzej Siewior wrote:
> > I want to get rid of non-symmetrical part and the arch hook which should
> > be part of the hp notifier itself. I wouldn't be too much afraid about
> > the when point in time when the notifier runs: It is the *first*
> > notifier that will be invoked on the target CPU. This is only a few
> > lines after the old hook. Nothing else long delaying should be invoked.
> 
> So I'm not sure *everything* mcheck_cpu_init() does should be in hotplug
> and toggle on and off. For example, we should set CR4.MCE once during
> init and that's it. Not turn it on and off. It would be pretty pointless
> IMO.
> 
> That's why the hotplug callback mce_disable_cpu() doesn't fiddle with
> CR4 - it only clears the bits in MCi_CTL. And I think we should remain
> that way.

The behaviour was not changed - it was only reorganized to keep in one
spot.

> The initial, one-time enabling of MCE features per core should be done
> only once per boot and that's it.

It is. You still want this in your hotplug code in case someone replaces
(physically) your CPU (say a larger NUMA box).

Sebastian


Re: [RFC 2/2] mmc: sdhci-pci: Use ACPI to get max frequency for Intel byt sdio host controller sub-vended by NI

2016-11-09 Thread Zach Brown
On Wed, Nov 09, 2016 at 03:24:24PM +0200, Adrian Hunter wrote:
> On 08/11/16 22:07, Zach Brown wrote:
> > On NI 9037 boards the max SDIO frequency is limited by trace lengths
> > and other layout choices. The max SDIO frequency is stored in an ACPI
> > table, as MXFQ.
> >
> > The driver reads the ACPI entry MXFQ during sdio_probe_slot and sets the
> > f_max field of the host with it.
> >
> > Signed-off-by: Nathan Sullivan 
> > Reviewed-by: Jaeden Amero 
> > Reviewed-by: Josh Cartwright 
> > Signed-off-by: Zach Brown 
> > ---
> >  drivers/mmc/host/sdhci-pci-core.c | 30 ++
> >  1 file changed, 30 insertions(+)
> >
> > diff --git a/drivers/mmc/host/sdhci-pci-core.c 
> > b/drivers/mmc/host/sdhci-pci-core.c
> > index c333ce2..4ac7f16 100644
> > --- a/drivers/mmc/host/sdhci-pci-core.c
> > +++ b/drivers/mmc/host/sdhci-pci-core.c
> > @@ -27,6 +27,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include "sdhci.h"
> >  #include "sdhci-pci.h"
> > @@ -377,6 +378,35 @@ static int byt_emmc_probe_slot(struct sdhci_pci_slot 
> > *slot)
> >
> >  static int ni_byt_sdio_probe_slot(struct sdhci_pci_slot *slot)
> >  {
> > +#ifdef CONFIG_ACPI
> > +   /* Get max freq from ACPI for NI hardware */
> > +   acpi_handle acpi_hdl;
> > +   acpi_status status;
> > +   struct acpi_buffer acpi_result = {
> > +   ACPI_ALLOCATE_BUFFER, NULL };
> > +   union acpi_object *acpi_buffer;
> > +   int max_freq;
> > +
> > +   status = acpi_get_handle(ACPI_HANDLE(>chip->pdev->dev), "MXFQ",
> > +_hdl);
>
> Is "MXFQ" an object that has already been deployed or are you inventing it
> now?  In the latter case, did you consider device properties as an 
> alternative?
>
"MXFQ" is an object that we have already deployed on some of our devices.

> > +   if (ACPI_FAILURE(status))
> > +   return  -ENODEV;
> > +
> > +   status = acpi_evaluate_object(acpi_hdl, NULL,
> > + NULL, _result);
> > +   if (ACPI_FAILURE(status))
> > +   return -EINVAL;
> > +
> > +   acpi_buffer = (union acpi_object *)acpi_result.pointer;
> > +
> > +   if (acpi_buffer->type != ACPI_TYPE_INTEGER)
> > +   return -EINVAL;
> > +
> > +   max_freq = acpi_buffer->integer.value;
> > +
> > +   slot->host->mmc->f_max = max_freq * 100;
> > +#endif
> > +
> > slot->host->mmc->caps |= MMC_CAP_POWER_OFF_CARD | MMC_CAP_NONREMOVABLE;
> > return 0;
> >  }
> >
>


Re: [RFC 2/2] mmc: sdhci-pci: Use ACPI to get max frequency for Intel byt sdio host controller sub-vended by NI

2016-11-09 Thread Zach Brown
On Wed, Nov 09, 2016 at 03:24:24PM +0200, Adrian Hunter wrote:
> On 08/11/16 22:07, Zach Brown wrote:
> > On NI 9037 boards the max SDIO frequency is limited by trace lengths
> > and other layout choices. The max SDIO frequency is stored in an ACPI
> > table, as MXFQ.
> >
> > The driver reads the ACPI entry MXFQ during sdio_probe_slot and sets the
> > f_max field of the host with it.
> >
> > Signed-off-by: Nathan Sullivan 
> > Reviewed-by: Jaeden Amero 
> > Reviewed-by: Josh Cartwright 
> > Signed-off-by: Zach Brown 
> > ---
> >  drivers/mmc/host/sdhci-pci-core.c | 30 ++
> >  1 file changed, 30 insertions(+)
> >
> > diff --git a/drivers/mmc/host/sdhci-pci-core.c 
> > b/drivers/mmc/host/sdhci-pci-core.c
> > index c333ce2..4ac7f16 100644
> > --- a/drivers/mmc/host/sdhci-pci-core.c
> > +++ b/drivers/mmc/host/sdhci-pci-core.c
> > @@ -27,6 +27,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include "sdhci.h"
> >  #include "sdhci-pci.h"
> > @@ -377,6 +378,35 @@ static int byt_emmc_probe_slot(struct sdhci_pci_slot 
> > *slot)
> >
> >  static int ni_byt_sdio_probe_slot(struct sdhci_pci_slot *slot)
> >  {
> > +#ifdef CONFIG_ACPI
> > +   /* Get max freq from ACPI for NI hardware */
> > +   acpi_handle acpi_hdl;
> > +   acpi_status status;
> > +   struct acpi_buffer acpi_result = {
> > +   ACPI_ALLOCATE_BUFFER, NULL };
> > +   union acpi_object *acpi_buffer;
> > +   int max_freq;
> > +
> > +   status = acpi_get_handle(ACPI_HANDLE(>chip->pdev->dev), "MXFQ",
> > +_hdl);
>
> Is "MXFQ" an object that has already been deployed or are you inventing it
> now?  In the latter case, did you consider device properties as an 
> alternative?
>
"MXFQ" is an object that we have already deployed on some of our devices.

> > +   if (ACPI_FAILURE(status))
> > +   return  -ENODEV;
> > +
> > +   status = acpi_evaluate_object(acpi_hdl, NULL,
> > + NULL, _result);
> > +   if (ACPI_FAILURE(status))
> > +   return -EINVAL;
> > +
> > +   acpi_buffer = (union acpi_object *)acpi_result.pointer;
> > +
> > +   if (acpi_buffer->type != ACPI_TYPE_INTEGER)
> > +   return -EINVAL;
> > +
> > +   max_freq = acpi_buffer->integer.value;
> > +
> > +   slot->host->mmc->f_max = max_freq * 100;
> > +#endif
> > +
> > slot->host->mmc->caps |= MMC_CAP_POWER_OFF_CARD | MMC_CAP_NONREMOVABLE;
> > return 0;
> >  }
> >
>


Re: [PATCH] vxlan: hide unused local variable

2016-11-09 Thread Jiri Benc
On Mon,  7 Nov 2016 22:09:07 +0100, Arnd Bergmann wrote:
> A bugfix introduced a harmless warning in v4.9-rc4:
> 
> drivers/net/vxlan.c: In function 'vxlan_group_used':
> drivers/net/vxlan.c:947:21: error: unused variable 'sock6' 
> [-Werror=unused-variable]
> 
> This hides the variable inside of the same #ifdef that is
> around its user. The extraneous initialization is removed
> at the same time, it was accidentally introduced in the
> same commit.
> 
> Fixes: c6fcc4fc5f8b ("vxlan: avoid using stale vxlan socket.")
> Signed-off-by: Arnd Bergmann 

I think this should be applied instead of Pravin's patch. It fixes just
the one problem, contains the proper Fixes: tag and addresses net.git.

Acked-by: Jiri Benc 


Re: [PATCH] vxlan: hide unused local variable

2016-11-09 Thread Jiri Benc
On Mon,  7 Nov 2016 22:09:07 +0100, Arnd Bergmann wrote:
> A bugfix introduced a harmless warning in v4.9-rc4:
> 
> drivers/net/vxlan.c: In function 'vxlan_group_used':
> drivers/net/vxlan.c:947:21: error: unused variable 'sock6' 
> [-Werror=unused-variable]
> 
> This hides the variable inside of the same #ifdef that is
> around its user. The extraneous initialization is removed
> at the same time, it was accidentally introduced in the
> same commit.
> 
> Fixes: c6fcc4fc5f8b ("vxlan: avoid using stale vxlan socket.")
> Signed-off-by: Arnd Bergmann 

I think this should be applied instead of Pravin's patch. It fixes just
the one problem, contains the proper Fixes: tag and addresses net.git.

Acked-by: Jiri Benc 


RE: [PATCH V5 2/3] ARM64 LPC: Add missing range exception for special ISA

2016-11-09 Thread Gabriele Paoloni
Hi Liviu

Thanks for reviewing

> -Original Message-
> From: liviu.du...@arm.com [mailto:liviu.du...@arm.com]
> Sent: 09 November 2016 11:40
> To: Yuanzhichang
> Cc: catalin.mari...@arm.com; will.dea...@arm.com; robh...@kernel.org;
> bhelg...@google.com; mark.rutl...@arm.com; o...@lixom.net;
> a...@arndb.de; linux-arm-ker...@lists.infradead.org;
> lorenzo.pieral...@arm.com; linux-kernel@vger.kernel.org; Linuxarm;
> devicet...@vger.kernel.org; linux-...@vger.kernel.org; linux-
> ser...@vger.kernel.org; miny...@acm.org; b...@kernel.crashing.org;
> zourongr...@gmail.com; John Garry; Gabriele Paoloni;
> zhichang.yua...@gmail.com; kant...@163.com; xuwei (O)
> Subject: Re: [PATCH V5 2/3] ARM64 LPC: Add missing range exception for
> special ISA
> 
> On Tue, Nov 08, 2016 at 11:47:08AM +0800, zhichang.yuan wrote:
> > This patch solves two issues:
> > 1) parse and get the right I/O range from DTS node whose parent does
> not
> > define the corresponding ranges property;
> >
> > There are some special ISA/LPC devices that work on a specific I/O
> range where
> > it is not correct to specify a ranges property in DTS parent node as
> cpu
> > addresses translated from DTS node are only for memory space on some
> > architectures, such as Arm64. Without the parent 'ranges' property,
> current
> > of_translate_address() return an error.
> > Here we add a fixup function, of_get_isa_indirect_io(). During the OF
> address
> > translation, this fixup will be called to check the 'reg' address to
> be
> > translating is for those sepcial ISA/LPC devices and get the I/O
> range
> > directly from the 'reg' property.
> >
> > 2) eliminate the I/O range conflict risk with PCI/PCIE leagecy I/O
> device;
> >
> > The current __of_address_to_resource() always translates the I/O
> range to PIO.
> > But this processing is not suitable for our ISA/LPC devices whose I/O
> range is
> > not cpu address(Arnd had stressed this in his comments on V2,V3
> patch-set).
> > Here, we bypass the mapping between cpu address and PIO for the
> special
> > ISA/LPC devices. But to drive these ISA/LPC devices, a I/O port
> address below
> > PCIBIOS_MIN_IO is needed by in*/out*(). Which means there is conflict
> risk
> > between I/O range of [0, PCIBIOS_MIN_IO) and PCI/PCIE legacy I/O
> range of [0,
> > IO_SPACE_LIMIT).
> > To avoid the I/O conflict, this patch reserve the I/O range below
> > PCIBIOS_MIN_IO.
> >
> > Signed-off-by: zhichang.yuan 
> > Signed-off-by: Gabriele Paoloni 
> > ---
> >  .../arm/hisilicon/hisilicon-low-pin-count.txt  | 31 
> >  arch/arm64/include/asm/io.h|  6 +++
> >  arch/arm64/kernel/extio.c  | 25 ++
> >  drivers/of/address.c   | 56
> +-
> >  drivers/pci/pci.c  |  6 +--
> >  include/linux/of_address.h | 17 +++
> >  include/linux/pci.h|  8 
> >  7 files changed, 145 insertions(+), 4 deletions(-)
> >  create mode 100644
> Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-pin-
> count.txt
> >
> > diff --git
> a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-pin-
> count.txt b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-
> low-pin-count.txt
> > new file mode 100644
> > index 000..13c8ddd
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-
> pin-count.txt
> > @@ -0,0 +1,31 @@
> > +Hisilicon Hip06 low-pin-count device
> > +  Usually LPC controller is part of PCI host bridge, so the legacy
> ISA ports
> > +  locate on LPC bus can be accessed direclty. But some SoCs have
> independent
> > +  LPC controller, and access the legacy ports by triggering LPC I/O
> cycles.
> > +  Hisilicon Hip06 implements this LPC device.
> > +
> > +Required properties:
> > +- compatible: should be "hisilicon,low-pin-count"
> > +- #address-cells: must be 2 which stick to the ISA/EISA binding doc.
> > +- #size-cells: must be 1 which stick to the ISA/EISA binding doc.
> > +- reg: base memory range where the register set of this device is
> mapped.
> > +
> > +Note:
> > +  The node name before '@' must be "isa" to represent the binding
> stick to the
> > +  ISA/EISA binding specification.
> > +
> > +Example:
> > +
> > +isa@a01b {
> > +   compatible = "hisilicom,low-pin-count";
> > +   #address-cells = <2>;
> > +   #size-cells = <1>;
> > +   reg = <0x0 0xa01b 0x0 0x1000>;
> > +
> > +   ipmi0: bt@e4 {
> > +   compatible = "ipmi-bt";
> > +   device_type = "ipmi";
> > +   reg = <0x01 0xe4 0x04>;
> > +   status = "disabled";
> > +   };
> > +};
> 
> 
> This documentation file needs to be part of the next patch. It has
> nothing to do with
> what you are trying to fix here.

Yes you're right...we'll move it to next one

> 
> 
> > diff --git a/arch/arm64/include/asm/io.h
> 

RE: [PATCH V5 2/3] ARM64 LPC: Add missing range exception for special ISA

2016-11-09 Thread Gabriele Paoloni
Hi Liviu

Thanks for reviewing

> -Original Message-
> From: liviu.du...@arm.com [mailto:liviu.du...@arm.com]
> Sent: 09 November 2016 11:40
> To: Yuanzhichang
> Cc: catalin.mari...@arm.com; will.dea...@arm.com; robh...@kernel.org;
> bhelg...@google.com; mark.rutl...@arm.com; o...@lixom.net;
> a...@arndb.de; linux-arm-ker...@lists.infradead.org;
> lorenzo.pieral...@arm.com; linux-kernel@vger.kernel.org; Linuxarm;
> devicet...@vger.kernel.org; linux-...@vger.kernel.org; linux-
> ser...@vger.kernel.org; miny...@acm.org; b...@kernel.crashing.org;
> zourongr...@gmail.com; John Garry; Gabriele Paoloni;
> zhichang.yua...@gmail.com; kant...@163.com; xuwei (O)
> Subject: Re: [PATCH V5 2/3] ARM64 LPC: Add missing range exception for
> special ISA
> 
> On Tue, Nov 08, 2016 at 11:47:08AM +0800, zhichang.yuan wrote:
> > This patch solves two issues:
> > 1) parse and get the right I/O range from DTS node whose parent does
> not
> > define the corresponding ranges property;
> >
> > There are some special ISA/LPC devices that work on a specific I/O
> range where
> > it is not correct to specify a ranges property in DTS parent node as
> cpu
> > addresses translated from DTS node are only for memory space on some
> > architectures, such as Arm64. Without the parent 'ranges' property,
> current
> > of_translate_address() return an error.
> > Here we add a fixup function, of_get_isa_indirect_io(). During the OF
> address
> > translation, this fixup will be called to check the 'reg' address to
> be
> > translating is for those sepcial ISA/LPC devices and get the I/O
> range
> > directly from the 'reg' property.
> >
> > 2) eliminate the I/O range conflict risk with PCI/PCIE leagecy I/O
> device;
> >
> > The current __of_address_to_resource() always translates the I/O
> range to PIO.
> > But this processing is not suitable for our ISA/LPC devices whose I/O
> range is
> > not cpu address(Arnd had stressed this in his comments on V2,V3
> patch-set).
> > Here, we bypass the mapping between cpu address and PIO for the
> special
> > ISA/LPC devices. But to drive these ISA/LPC devices, a I/O port
> address below
> > PCIBIOS_MIN_IO is needed by in*/out*(). Which means there is conflict
> risk
> > between I/O range of [0, PCIBIOS_MIN_IO) and PCI/PCIE legacy I/O
> range of [0,
> > IO_SPACE_LIMIT).
> > To avoid the I/O conflict, this patch reserve the I/O range below
> > PCIBIOS_MIN_IO.
> >
> > Signed-off-by: zhichang.yuan 
> > Signed-off-by: Gabriele Paoloni 
> > ---
> >  .../arm/hisilicon/hisilicon-low-pin-count.txt  | 31 
> >  arch/arm64/include/asm/io.h|  6 +++
> >  arch/arm64/kernel/extio.c  | 25 ++
> >  drivers/of/address.c   | 56
> +-
> >  drivers/pci/pci.c  |  6 +--
> >  include/linux/of_address.h | 17 +++
> >  include/linux/pci.h|  8 
> >  7 files changed, 145 insertions(+), 4 deletions(-)
> >  create mode 100644
> Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-pin-
> count.txt
> >
> > diff --git
> a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-pin-
> count.txt b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-
> low-pin-count.txt
> > new file mode 100644
> > index 000..13c8ddd
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon-low-
> pin-count.txt
> > @@ -0,0 +1,31 @@
> > +Hisilicon Hip06 low-pin-count device
> > +  Usually LPC controller is part of PCI host bridge, so the legacy
> ISA ports
> > +  locate on LPC bus can be accessed direclty. But some SoCs have
> independent
> > +  LPC controller, and access the legacy ports by triggering LPC I/O
> cycles.
> > +  Hisilicon Hip06 implements this LPC device.
> > +
> > +Required properties:
> > +- compatible: should be "hisilicon,low-pin-count"
> > +- #address-cells: must be 2 which stick to the ISA/EISA binding doc.
> > +- #size-cells: must be 1 which stick to the ISA/EISA binding doc.
> > +- reg: base memory range where the register set of this device is
> mapped.
> > +
> > +Note:
> > +  The node name before '@' must be "isa" to represent the binding
> stick to the
> > +  ISA/EISA binding specification.
> > +
> > +Example:
> > +
> > +isa@a01b {
> > +   compatible = "hisilicom,low-pin-count";
> > +   #address-cells = <2>;
> > +   #size-cells = <1>;
> > +   reg = <0x0 0xa01b 0x0 0x1000>;
> > +
> > +   ipmi0: bt@e4 {
> > +   compatible = "ipmi-bt";
> > +   device_type = "ipmi";
> > +   reg = <0x01 0xe4 0x04>;
> > +   status = "disabled";
> > +   };
> > +};
> 
> 
> This documentation file needs to be part of the next patch. It has
> nothing to do with
> what you are trying to fix here.

Yes you're right...we'll move it to next one

> 
> 
> > diff --git a/arch/arm64/include/asm/io.h
> b/arch/arm64/include/asm/io.h
> > index 136735d..c26b7cc 

[tip:x86/urgent] x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems

2016-11-09 Thread tip-bot for Yazen Ghannam
Commit-ID:  b0b6e86846093c5f8820386bc01515f857dd8faa
Gitweb: http://git.kernel.org/tip/b0b6e86846093c5f8820386bc01515f857dd8faa
Author: Yazen Ghannam 
AuthorDate: Tue, 8 Nov 2016 09:35:06 +0100
Committer:  Ingo Molnar 
CommitDate: Wed, 9 Nov 2016 17:06:08 +0100

x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems

cpu_llc_id (Last Level Cache ID) derivation on AMD Fam17h has an
underflow bug when extracting the socket_id value. It starts from 0
so subtracting 1 from it will result in an invalid value. This breaks
scheduling topology later on since the cpu_llc_id will be incorrect.

For example, the the cpu_llc_id of the *other* CPU in the loops in
set_cpu_sibling_map() underflows and we're generating the funniest
thread_siblings masks and then when I run 8 threads of nbench, they get
spread around the LLC domains in a very strange pattern which doesn't
give you the normal scheduling spread one would expect for performance.

Other things like EDAC use cpu_llc_id so they will be b0rked too.

So, the APIC ID is preset in APICx020 for bits 3 and above: they contain
the core complex, node and socket IDs.

The LLC is at the core complex level so we can find a unique cpu_llc_id
by right shifting the APICID by 3 because then the least significant bit
will be the Core Complex ID.

Tested-by: Borislav Petkov 
Signed-off-by: Yazen Ghannam 
[ Cleaned up and extended the commit message. ]
Signed-off-by: Borislav Petkov 
Acked-by: Thomas Gleixner 
Cc:  # v4.4..
Cc: Aravind Gopalakrishnan 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Fixes: 3849e91f571d ("x86/AMD: Fix last level cache topology for AMD Fam17h 
systems")
Link: http://lkml.kernel.org/r/20161108083506.rvqb5h4chrcpt...@pd.tnic
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/amd.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index b81fe2d..1e81a37 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -347,7 +347,6 @@ static void amd_detect_cmp(struct cpuinfo_x86 *c)
 #ifdef CONFIG_SMP
unsigned bits;
int cpu = smp_processor_id();
-   unsigned int socket_id, core_complex_id;
 
bits = c->x86_coreid_bits;
/* Low order bits define the core id (index of core in socket) */
@@ -365,10 +364,7 @@ static void amd_detect_cmp(struct cpuinfo_x86 *c)
 if (c->x86 != 0x17 || !cpuid_edx(0x8006))
return;
 
-   socket_id   = (c->apicid >> bits) - 1;
-   core_complex_id = (c->apicid & ((1 << bits) - 1)) >> 3;
-
-   per_cpu(cpu_llc_id, cpu) = (socket_id << 3) | core_complex_id;
+   per_cpu(cpu_llc_id, cpu) = c->apicid >> 3;
 #endif
 }
 


[tip:x86/urgent] x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems

2016-11-09 Thread tip-bot for Yazen Ghannam
Commit-ID:  b0b6e86846093c5f8820386bc01515f857dd8faa
Gitweb: http://git.kernel.org/tip/b0b6e86846093c5f8820386bc01515f857dd8faa
Author: Yazen Ghannam 
AuthorDate: Tue, 8 Nov 2016 09:35:06 +0100
Committer:  Ingo Molnar 
CommitDate: Wed, 9 Nov 2016 17:06:08 +0100

x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems

cpu_llc_id (Last Level Cache ID) derivation on AMD Fam17h has an
underflow bug when extracting the socket_id value. It starts from 0
so subtracting 1 from it will result in an invalid value. This breaks
scheduling topology later on since the cpu_llc_id will be incorrect.

For example, the the cpu_llc_id of the *other* CPU in the loops in
set_cpu_sibling_map() underflows and we're generating the funniest
thread_siblings masks and then when I run 8 threads of nbench, they get
spread around the LLC domains in a very strange pattern which doesn't
give you the normal scheduling spread one would expect for performance.

Other things like EDAC use cpu_llc_id so they will be b0rked too.

So, the APIC ID is preset in APICx020 for bits 3 and above: they contain
the core complex, node and socket IDs.

The LLC is at the core complex level so we can find a unique cpu_llc_id
by right shifting the APICID by 3 because then the least significant bit
will be the Core Complex ID.

Tested-by: Borislav Petkov 
Signed-off-by: Yazen Ghannam 
[ Cleaned up and extended the commit message. ]
Signed-off-by: Borislav Petkov 
Acked-by: Thomas Gleixner 
Cc:  # v4.4..
Cc: Aravind Gopalakrishnan 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Fixes: 3849e91f571d ("x86/AMD: Fix last level cache topology for AMD Fam17h 
systems")
Link: http://lkml.kernel.org/r/20161108083506.rvqb5h4chrcpt...@pd.tnic
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/amd.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index b81fe2d..1e81a37 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -347,7 +347,6 @@ static void amd_detect_cmp(struct cpuinfo_x86 *c)
 #ifdef CONFIG_SMP
unsigned bits;
int cpu = smp_processor_id();
-   unsigned int socket_id, core_complex_id;
 
bits = c->x86_coreid_bits;
/* Low order bits define the core id (index of core in socket) */
@@ -365,10 +364,7 @@ static void amd_detect_cmp(struct cpuinfo_x86 *c)
 if (c->x86 != 0x17 || !cpuid_edx(0x8006))
return;
 
-   socket_id   = (c->apicid >> bits) - 1;
-   core_complex_id = (c->apicid & ((1 << bits) - 1)) >> 3;
-
-   per_cpu(cpu_llc_id, cpu) = (socket_id << 3) | core_complex_id;
+   per_cpu(cpu_llc_id, cpu) = c->apicid >> 3;
 #endif
 }
 


Re: [PATCH 2/4] mfd: palmas: Reset the POWERHOLD mux during power off

2016-11-09 Thread Lee Jones
On Thu, 27 Oct 2016, Keerthy wrote:

> POWERHOLD signal has higher priority  over the DEV_ON bit.
> So power off will not happen if the POWERHOLD is held high.
> Hence reset the MUX to GPIO_7 mode to release the POWERHOLD
> and the DEV_ON bit to take effect to power off the PMIC.
> 
> PMIC Power off happens in dire situations like thermal shutdown
> so irrespective of the POWERHOLD setting go ahead and turn off
> the powerhold.  Currently poweroff is broken on boards that have
> powerhold enabled. This fixes poweroff on those boards.
> 
> Signed-off-by: Keerthy 
> ---
>  drivers/mfd/palmas.c | 18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/mfd/palmas.c b/drivers/mfd/palmas.c
> index 8f8bacb..8fbc5e0 100644
> --- a/drivers/mfd/palmas.c
> +++ b/drivers/mfd/palmas.c
> @@ -430,10 +430,28 @@ static void palmas_power_off(void)
>  {
>   unsigned int addr;
>   int ret, slave;
> + struct device_node *node;
> + bool override_powerhold;
>  
>   if (!palmas_dev)

Can this happen?

>   return;
>  
> + node = palmas_dev->dev->of_node;

Just do:

struct device_node *np = palmas_dev->dev->of_node;

> + override_powerhold = of_property_read_bool(node,
> + "ti,palmas-override-powerhold");

Break the line after the '=' instead.

> + if (override_powerhold) {

if (of_property_read_bool(node, "ti,palmas-override-powerhold"))

Then remove 'override_powerhold'.

> + addr = PALMAS_BASE_TO_REG(PALMAS_PU_PD_OD_BASE,
> +   PALMAS_PRIMARY_SECONDARY_PAD2);
> + slave = PALMAS_BASE_TO_SLAVE(PALMAS_PU_PD_OD_BASE);
> +
> + ret = regmap_update_bits(palmas_dev->regmap[slave], addr,
> +  
> PALMAS_PRIMARY_SECONDARY_PAD2_GPIO_7_MASK, 0);
> + if (ret)
> + pr_err("%s: Unable to write 
> PALMAS_PRIMARY_SECONDARY_PAD2 %d\n",
> +__func__, ret);

Don't use __func__ in live code.

And use dev_err();


> + }
> +
>   slave = PALMAS_BASE_TO_SLAVE(PALMAS_PMU_CONTROL_BASE);
>   addr = PALMAS_BASE_TO_REG(PALMAS_PMU_CONTROL_BASE, PALMAS_DEV_CTRL);
>  

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [PATCH 2/4] mfd: palmas: Reset the POWERHOLD mux during power off

2016-11-09 Thread Lee Jones
On Thu, 27 Oct 2016, Keerthy wrote:

> POWERHOLD signal has higher priority  over the DEV_ON bit.
> So power off will not happen if the POWERHOLD is held high.
> Hence reset the MUX to GPIO_7 mode to release the POWERHOLD
> and the DEV_ON bit to take effect to power off the PMIC.
> 
> PMIC Power off happens in dire situations like thermal shutdown
> so irrespective of the POWERHOLD setting go ahead and turn off
> the powerhold.  Currently poweroff is broken on boards that have
> powerhold enabled. This fixes poweroff on those boards.
> 
> Signed-off-by: Keerthy 
> ---
>  drivers/mfd/palmas.c | 18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/mfd/palmas.c b/drivers/mfd/palmas.c
> index 8f8bacb..8fbc5e0 100644
> --- a/drivers/mfd/palmas.c
> +++ b/drivers/mfd/palmas.c
> @@ -430,10 +430,28 @@ static void palmas_power_off(void)
>  {
>   unsigned int addr;
>   int ret, slave;
> + struct device_node *node;
> + bool override_powerhold;
>  
>   if (!palmas_dev)

Can this happen?

>   return;
>  
> + node = palmas_dev->dev->of_node;

Just do:

struct device_node *np = palmas_dev->dev->of_node;

> + override_powerhold = of_property_read_bool(node,
> + "ti,palmas-override-powerhold");

Break the line after the '=' instead.

> + if (override_powerhold) {

if (of_property_read_bool(node, "ti,palmas-override-powerhold"))

Then remove 'override_powerhold'.

> + addr = PALMAS_BASE_TO_REG(PALMAS_PU_PD_OD_BASE,
> +   PALMAS_PRIMARY_SECONDARY_PAD2);
> + slave = PALMAS_BASE_TO_SLAVE(PALMAS_PU_PD_OD_BASE);
> +
> + ret = regmap_update_bits(palmas_dev->regmap[slave], addr,
> +  
> PALMAS_PRIMARY_SECONDARY_PAD2_GPIO_7_MASK, 0);
> + if (ret)
> + pr_err("%s: Unable to write 
> PALMAS_PRIMARY_SECONDARY_PAD2 %d\n",
> +__func__, ret);

Don't use __func__ in live code.

And use dev_err();


> + }
> +
>   slave = PALMAS_BASE_TO_SLAVE(PALMAS_PMU_CONTROL_BASE);
>   addr = PALMAS_BASE_TO_REG(PALMAS_PMU_CONTROL_BASE, PALMAS_DEV_CTRL);
>  

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [PATCH 6/8] block: add scalable completion tracking of requests

2016-11-09 Thread Jens Axboe

On 11/09/2016 02:01 AM, Jan Kara wrote:

On Tue 08-11-16 08:25:52, Jens Axboe wrote:

On 11/08/2016 06:30 AM, Jan Kara wrote:

On Tue 01-11-16 15:08:49, Jens Axboe wrote:

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

Signed-off-by: Jens Axboe 


This patch looks mostly good to me but I have one concern: You track
statistics in a fixed 134ms window, stats get cleared at the beginning of
each window. Now this can interact with the writeback window and latency
settings which are dynamic and settable from userspace - so if the
writeback code observation window gets set larger than the stats window,
things become strange since you'll likely miss quite some observations
about read latencies. So I think you need to make sure stats window is
always larger than writeback window. Or actually, why do you have something
like stats window and don't leave clearing of statistics completely to the
writeback tracking code?


That's a good point, and there actually used to be a comment to that
effect in the code. I think the best solution here would be to make the
stats code mask available somewhere, and allow a consumer of the stats
to request a larger window.

Similarly, we could make the stat window be driven by the consumer, as
you suggest.

Currently there are two pending submissions that depend on the stats
code. One is this writeback series, and the other one is the hybrid
polling code. The latter does not really care about the window size as
such, since it has no monitoring window of its own, and it wants the
auto-clearing as well.

I don't mind working on additions for this, but I'd prefer if we could
layer them on top of the existing series instead of respinning it.
There's considerable test time on the existing patchset. Would that work
for you? Especially collapsing the stats and wbt windows would require
some re-architecting.


OK, that works for me. Actually, when thinking about this, I have one more
suggestion: Do we really want to expose the wbt window as a sysfs tunable?
I guess it is good for initial experiments but longer term having the wbt
window length be a function of target read latency might be better.
Generally you want the window length to be considerably larger than the
target latency but OTOH not too large so that the algorithm can react
reasonably quickly so that suggests it could really be autotuned (and we
scale the window anyway to adapt it to current situation).


That's not a bad idea, I have thought about that as well before. We
don't need the window tunable, and you are right, it can be a function
of the desired latency.

I'll hardwire the 100msec latency window for now and get rid of the
exposed tunable. It's harder to remove sysfs files once they have made
it into the kernel...


Also as a side note - nobody currently uses the mean value of the
statistics. It may be faster to track just sum and count so that mean can
be computed on request which will be presumably much more rare than current
situation where we recompute the mean on each batch update. Actually, that
way you could get rid of the batching as well I assume.


That could be opt-in as well. The poll code uses it. And fwiw, it is
exposed through sysfs as well.


Yeah, my point was that just doing the division in response to sysfs read
or actual request to read the average is likely going to be less expensive
than having to do it on each batch completion (actually, you seem to have
that batching code only so that you don't have to do the division too
often). Whether my suggestion is right depends on how often polling code
actually needs to read the average...


The polling code currently does it for every IO... That is not ideal for
other purposes, I think I'm going to work on changing that to just keep
the previous window available, so we only need to read it when the stats
window changes.

With the batching, I don't see the division as a problem in micro
benchmarks. That's why I added the batching, because it did show up
before.

--
Jens Axboe



Re: [PATCH 6/8] block: add scalable completion tracking of requests

2016-11-09 Thread Jens Axboe

On 11/09/2016 02:01 AM, Jan Kara wrote:

On Tue 08-11-16 08:25:52, Jens Axboe wrote:

On 11/08/2016 06:30 AM, Jan Kara wrote:

On Tue 01-11-16 15:08:49, Jens Axboe wrote:

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

Signed-off-by: Jens Axboe 


This patch looks mostly good to me but I have one concern: You track
statistics in a fixed 134ms window, stats get cleared at the beginning of
each window. Now this can interact with the writeback window and latency
settings which are dynamic and settable from userspace - so if the
writeback code observation window gets set larger than the stats window,
things become strange since you'll likely miss quite some observations
about read latencies. So I think you need to make sure stats window is
always larger than writeback window. Or actually, why do you have something
like stats window and don't leave clearing of statistics completely to the
writeback tracking code?


That's a good point, and there actually used to be a comment to that
effect in the code. I think the best solution here would be to make the
stats code mask available somewhere, and allow a consumer of the stats
to request a larger window.

Similarly, we could make the stat window be driven by the consumer, as
you suggest.

Currently there are two pending submissions that depend on the stats
code. One is this writeback series, and the other one is the hybrid
polling code. The latter does not really care about the window size as
such, since it has no monitoring window of its own, and it wants the
auto-clearing as well.

I don't mind working on additions for this, but I'd prefer if we could
layer them on top of the existing series instead of respinning it.
There's considerable test time on the existing patchset. Would that work
for you? Especially collapsing the stats and wbt windows would require
some re-architecting.


OK, that works for me. Actually, when thinking about this, I have one more
suggestion: Do we really want to expose the wbt window as a sysfs tunable?
I guess it is good for initial experiments but longer term having the wbt
window length be a function of target read latency might be better.
Generally you want the window length to be considerably larger than the
target latency but OTOH not too large so that the algorithm can react
reasonably quickly so that suggests it could really be autotuned (and we
scale the window anyway to adapt it to current situation).


That's not a bad idea, I have thought about that as well before. We
don't need the window tunable, and you are right, it can be a function
of the desired latency.

I'll hardwire the 100msec latency window for now and get rid of the
exposed tunable. It's harder to remove sysfs files once they have made
it into the kernel...


Also as a side note - nobody currently uses the mean value of the
statistics. It may be faster to track just sum and count so that mean can
be computed on request which will be presumably much more rare than current
situation where we recompute the mean on each batch update. Actually, that
way you could get rid of the batching as well I assume.


That could be opt-in as well. The poll code uses it. And fwiw, it is
exposed through sysfs as well.


Yeah, my point was that just doing the division in response to sysfs read
or actual request to read the average is likely going to be less expensive
than having to do it on each batch completion (actually, you seem to have
that batching code only so that you don't have to do the division too
often). Whether my suggestion is right depends on how often polling code
actually needs to read the average...


The polling code currently does it for every IO... That is not ideal for
other purposes, I think I'm going to work on changing that to just keep
the previous window available, so we only need to read it when the stats
window changes.

With the batching, I don't see the division as a problem in micro
benchmarks. That's why I added the batching, because it did show up
before.

--
Jens Axboe



Re: [PATCH 1/4] Documentation: pinctrl: palmas: Add ti,palmas-powerhold-override property definition

2016-11-09 Thread Lee Jones
On Thu, 27 Oct 2016, Keerthy wrote:

> GPIO7 is configured in POWERHOLD mode which has higher priority
> over DEV_ON bit and keeps the PMIC supplies on even after the DEV_ON
> bit is turned off. This property enables driver to over ride the
> POWERHOLD value to GPIO7 so as to turn off the PMIC in power off
> scenarios.
> 
> Signed-off-by: Keerthy 
> ---
>  Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt | 9 +
>  1 file changed, 9 insertions(+)

This requires a DT Ack.

> diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt 
> b/Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt
> index caf297b..c28d4eb8 100644
> --- a/Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt
> +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt
> @@ -35,6 +35,15 @@ Optional properties:
>  - ti,palmas-enable-dvfs2: Enable DVFS2. Configure pins for DVFS2 mode.
>   Selection primary or secondary function associated to GPADC_START
>   and SYSEN2 pin/pad for DVFS2 interface
> +- ti,palmas-override-powerhold: This is applicable for PMICs for which
> + GPIO7 is configured in POWERHOLD mode which has higher priority
> + over DEV_ON bit and keeps the PMIC supplies on even after the DEV_ON
> + bit is turned off. This property enables driver to over ride the
> + POWERHOLD value to GPIO7 so as to turn off the PMIC in power off
> + scenarios. So for GPIO7 if ti,palmas-override-powerhold is set
> + then the GPIO_7 field should never be muxed to anything else.
> + It should be set to POWERHOLD by default and only in case of
> + power off scenarios the driver will over ride the mux value.
>  
>  This binding uses the following generic properties as defined in
>  pinctrl-bindings.txt:

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [PATCH 1/4] Documentation: pinctrl: palmas: Add ti,palmas-powerhold-override property definition

2016-11-09 Thread Lee Jones
On Thu, 27 Oct 2016, Keerthy wrote:

> GPIO7 is configured in POWERHOLD mode which has higher priority
> over DEV_ON bit and keeps the PMIC supplies on even after the DEV_ON
> bit is turned off. This property enables driver to over ride the
> POWERHOLD value to GPIO7 so as to turn off the PMIC in power off
> scenarios.
> 
> Signed-off-by: Keerthy 
> ---
>  Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt | 9 +
>  1 file changed, 9 insertions(+)

This requires a DT Ack.

> diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt 
> b/Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt
> index caf297b..c28d4eb8 100644
> --- a/Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt
> +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-palmas.txt
> @@ -35,6 +35,15 @@ Optional properties:
>  - ti,palmas-enable-dvfs2: Enable DVFS2. Configure pins for DVFS2 mode.
>   Selection primary or secondary function associated to GPADC_START
>   and SYSEN2 pin/pad for DVFS2 interface
> +- ti,palmas-override-powerhold: This is applicable for PMICs for which
> + GPIO7 is configured in POWERHOLD mode which has higher priority
> + over DEV_ON bit and keeps the PMIC supplies on even after the DEV_ON
> + bit is turned off. This property enables driver to over ride the
> + POWERHOLD value to GPIO7 so as to turn off the PMIC in power off
> + scenarios. So for GPIO7 if ti,palmas-override-powerhold is set
> + then the GPIO_7 field should never be muxed to anything else.
> + It should be set to POWERHOLD by default and only in case of
> + power off scenarios the driver will over ride the mux value.
>  
>  This binding uses the following generic properties as defined in
>  pinctrl-bindings.txt:

-- 
Lee Jones
Linaro STMicroelectronics Landing Team Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [PATCH 7/8] blk-wbt: add general throttling mechanism

2016-11-09 Thread Jens Axboe

On 11/09/2016 01:40 AM, Jan Kara wrote:

So for devices with write cache, you will completely drain the device
before waking anybody waiting to issue new requests. Isn't it too strict?
In particular may_queue() will allow new writers to issue new writes once
we drop below the limit so it can happen that some processes will be
effectively starved waiting in may_queue?


It is strict, and perhaps too strict. In testing, it's the only method
that's proven to keep the writeback caching devices in check. It will
round robin the writers, if we have more, which isn't necessarily a bad
thing. Each will get to do a burst of depth writes, then wait for a new
one.


Well, I'm more concerned about a situation where one writer does a
bursty write and blocks sleeping in may_queue(). Another writer
produces a steady flow of write requests so that never causes the
write queue to completely drain but that writer also never blocks in
may_queue() when it starts queueing after write queue has somewhat
drained because it never submits many requests in parallel. In such
case the first writer would get starved AFAIU.


I see what you are saying. I can modify the logic to ensure that if we
do have a waiter, we queue up others behind it. That should get rid of
that concern.


Also I'm not sure why such logic for devices with writeback cache is
needed. Sure the disk is fast to accept writes but if that causes long
read latencies, we should scale down the writeback limits so that we
eventually end up submitting only one write request anyway -
effectively the same thing as limit=0 - won't we?


Basically we want to avoid getting into that situation. The problem with
write caching is that it takes a while for you to notice that anything
is wrong, and when you do, you are way down in the hole. That causes the
first violations to be pretty bad.

I'm fine with playing with this logic and improving it, but I'd rather
wait for a 2nd series for that.

--
Jens Axboe



Re: [PATCH 7/8] blk-wbt: add general throttling mechanism

2016-11-09 Thread Jens Axboe

On 11/09/2016 01:40 AM, Jan Kara wrote:

So for devices with write cache, you will completely drain the device
before waking anybody waiting to issue new requests. Isn't it too strict?
In particular may_queue() will allow new writers to issue new writes once
we drop below the limit so it can happen that some processes will be
effectively starved waiting in may_queue?


It is strict, and perhaps too strict. In testing, it's the only method
that's proven to keep the writeback caching devices in check. It will
round robin the writers, if we have more, which isn't necessarily a bad
thing. Each will get to do a burst of depth writes, then wait for a new
one.


Well, I'm more concerned about a situation where one writer does a
bursty write and blocks sleeping in may_queue(). Another writer
produces a steady flow of write requests so that never causes the
write queue to completely drain but that writer also never blocks in
may_queue() when it starts queueing after write queue has somewhat
drained because it never submits many requests in parallel. In such
case the first writer would get starved AFAIU.


I see what you are saying. I can modify the logic to ensure that if we
do have a waiter, we queue up others behind it. That should get rid of
that concern.


Also I'm not sure why such logic for devices with writeback cache is
needed. Sure the disk is fast to accept writes but if that causes long
read latencies, we should scale down the writeback limits so that we
eventually end up submitting only one write request anyway -
effectively the same thing as limit=0 - won't we?


Basically we want to avoid getting into that situation. The problem with
write caching is that it takes a while for you to notice that anything
is wrong, and when you do, you are way down in the hole. That causes the
first violations to be pretty bad.

I'm fine with playing with this logic and improving it, but I'd rather
wait for a 2nd series for that.

--
Jens Axboe



Re: [PATCH] x86/cpuid: Deal with broken firmware once more

2016-11-09 Thread Peter Zijlstra
On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote:
> Both ACPI and MP specifications require that the APIC id in the respective
> tables must be the same as the APIC id in CPUID. 
> 
> The kernel retrieves the physical package id from the APIC id during the
> ACPI/MP table scan and builds the physical to logical package map.
> 
> There exist Virtualbox and Xen implementations which violate the spec. As a

ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy
virt stuff.

>  /*
> + * The physical to logical package id mapping is initialized from the
> + * acpi/mptables information. Make sure that CPUID actually agrees with
> + * that.
> + */
> +static void sanitize_package_id(struct cpuinfo_x86 *c)
> +{
> +#ifdef CONFIG_SMP
> + unsigned int pkg, apicid, cpu = smp_processor_id();
> +
> + apicid = apic->cpu_present_to_apicid(cpu);
> + pkg = apicid >> boot_cpu_data.x86_coreid_bits;
> +
> + if (apicid != c->initial_apicid) {
> + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: 
> %x\n",
> +cpu, apicid, c->initial_apicid);

Should we not also 'fix' c->initial_apicid ?

> + }
> + if (pkg != c->phys_proc_id) {
> + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of 
> %u\n",
> +cpu, pkg, c->phys_proc_id);
> + c->phys_proc_id = pkg;
> + }
> + c->logical_proc_id = topology_phys_to_logical_pkg(pkg);
> +#else
> + c->locical_proc_id = 0;

UP FTW ;-)

> +#endif
> +}


Re: [PATCH] x86/cpuid: Deal with broken firmware once more

2016-11-09 Thread Peter Zijlstra
On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote:
> Both ACPI and MP specifications require that the APIC id in the respective
> tables must be the same as the APIC id in CPUID. 
> 
> The kernel retrieves the physical package id from the APIC id during the
> ACPI/MP table scan and builds the physical to logical package map.
> 
> There exist Virtualbox and Xen implementations which violate the spec. As a

ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy
virt stuff.

>  /*
> + * The physical to logical package id mapping is initialized from the
> + * acpi/mptables information. Make sure that CPUID actually agrees with
> + * that.
> + */
> +static void sanitize_package_id(struct cpuinfo_x86 *c)
> +{
> +#ifdef CONFIG_SMP
> + unsigned int pkg, apicid, cpu = smp_processor_id();
> +
> + apicid = apic->cpu_present_to_apicid(cpu);
> + pkg = apicid >> boot_cpu_data.x86_coreid_bits;
> +
> + if (apicid != c->initial_apicid) {
> + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: 
> %x\n",
> +cpu, apicid, c->initial_apicid);

Should we not also 'fix' c->initial_apicid ?

> + }
> + if (pkg != c->phys_proc_id) {
> + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of 
> %u\n",
> +cpu, pkg, c->phys_proc_id);
> + c->phys_proc_id = pkg;
> + }
> + c->logical_proc_id = topology_phys_to_logical_pkg(pkg);
> +#else
> + c->locical_proc_id = 0;

UP FTW ;-)

> +#endif
> +}


Re: [lustre-devel] [PATCH] staging: lustre: ldlm: pl_recalc time handling is wrong

2016-11-09 Thread Arnd Bergmann
On Wednesday, November 9, 2016 3:50:29 AM CET Dilger, Andreas wrote:
> On Nov 7, 2016, at 19:47, James Simmons  wrote:
> > 
> > The ldlm_pool field pl_recalc_time is set to the current
> > monotonic clock value but the interval period is calculated
> > with the wall clock. This means the interval period will
> > always be far larger than the pl_recalc_period, which is
> > just a small interval time period. The correct thing to
> > do is to use monotomic clock current value instead of the
> > wall clocks value when calculating recalc_interval_sec.
> 
> It looks like this was introduced by commit 8f83409cf
> "staging/lustre: use 64-bit time for pl_recalc" but that patch changed
> get_seconds() to a mix of ktime_get_seconds() and ktime_get_real_seconds()
> for an unknown reason.  It doesn't appear that there is any difference
> in overhead between the two (on 64-bit at least).

It was meant to use ktime_get_real_seconds() consistently, very sorry
about the mistake. I don't remember exactly how we got there, I assume
I had started out using ktime_get_seconds() and then moved to
ktime_get_real_seconds() later but missed the last three instances.

> Since the ldlm pool recalculation interval is actually driven in response to
> load on the server, it makes sense to use the "real" time instead of the
> monotonic time (if I understand correctly) if the client is in a VM that
> may periodically be blocked and "miss time" compared to the outside world.
> Using the "real" clock, the recalc_interval_sec will correctly reflect the
> actual elapsed time rather than just the number of ticks inside the VM.
> 
> Is my understanding of these different clocks correct?

No, this is not the difference: monotonic and real time always tick
at exactly the same rate, the only difference is the starting point.
monotonic time starts at boot and can not be adjusted, while real
time is set to reflect the UTC time base and gets initialized
from the real time clock at boot, or using settimeofday(2) or
NTP later on.

In my changelog text, I wrote

keeping the 'real' instead of 'monotonic' time because of the
debug prints.

the intention here is simply to have the console log keep the
same numbers as "date +%s" for absolute values. The patch that
James suggested converting everything to ktime_get_seconds()
would result in the same intervals that have always been there
(until I broke it by using time domains inconsistently), but
would mean we could use a u32 type for pl_recalc_time and
pl_recalc_period because that doesn't overflow until 136 years
after booting. (signed time_t overflows 68 years after 1970,
i.e 2038).

Arnd


Re: [lustre-devel] [PATCH] staging: lustre: ldlm: pl_recalc time handling is wrong

2016-11-09 Thread Arnd Bergmann
On Wednesday, November 9, 2016 3:50:29 AM CET Dilger, Andreas wrote:
> On Nov 7, 2016, at 19:47, James Simmons  wrote:
> > 
> > The ldlm_pool field pl_recalc_time is set to the current
> > monotonic clock value but the interval period is calculated
> > with the wall clock. This means the interval period will
> > always be far larger than the pl_recalc_period, which is
> > just a small interval time period. The correct thing to
> > do is to use monotomic clock current value instead of the
> > wall clocks value when calculating recalc_interval_sec.
> 
> It looks like this was introduced by commit 8f83409cf
> "staging/lustre: use 64-bit time for pl_recalc" but that patch changed
> get_seconds() to a mix of ktime_get_seconds() and ktime_get_real_seconds()
> for an unknown reason.  It doesn't appear that there is any difference
> in overhead between the two (on 64-bit at least).

It was meant to use ktime_get_real_seconds() consistently, very sorry
about the mistake. I don't remember exactly how we got there, I assume
I had started out using ktime_get_seconds() and then moved to
ktime_get_real_seconds() later but missed the last three instances.

> Since the ldlm pool recalculation interval is actually driven in response to
> load on the server, it makes sense to use the "real" time instead of the
> monotonic time (if I understand correctly) if the client is in a VM that
> may periodically be blocked and "miss time" compared to the outside world.
> Using the "real" clock, the recalc_interval_sec will correctly reflect the
> actual elapsed time rather than just the number of ticks inside the VM.
> 
> Is my understanding of these different clocks correct?

No, this is not the difference: monotonic and real time always tick
at exactly the same rate, the only difference is the starting point.
monotonic time starts at boot and can not be adjusted, while real
time is set to reflect the UTC time base and gets initialized
from the real time clock at boot, or using settimeofday(2) or
NTP later on.

In my changelog text, I wrote

keeping the 'real' instead of 'monotonic' time because of the
debug prints.

the intention here is simply to have the console log keep the
same numbers as "date +%s" for absolute values. The patch that
James suggested converting everything to ktime_get_seconds()
would result in the same intervals that have always been there
(until I broke it by using time domains inconsistently), but
would mean we could use a u32 type for pl_recalc_time and
pl_recalc_period because that doesn't overflow until 136 years
after booting. (signed time_t overflows 68 years after 1970,
i.e 2038).

Arnd


[ppdev] sysfs warning on qemu boot

2016-11-09 Thread Joe Lawrence
Hi Sudip,

I hit a sysfs_warn_dup inside QEMU running 4.9.0-rc4 (I suspect earlier
versions as well, but this is the first upstream I've run in a while).
This warning looks like something the kernel test robot ran into a few
months ago:

  http://marc.info/?t=14726777333=1=2

I'm running CentOS7 inside QEMU with the following options:

  qemu-system-x86_64 -smp 4 -m 8192 \
-drive file=centos7.qcow2,cache=none,if=virtio \
-usbdevice tablet -enable-kvm -monitor stdio \
-parallel /dev/null -redir tcp:::22 \
-serial file:serial.out

a kernel with these options:

  % grep PARPORT .config
  CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
  CONFIG_PARPORT=m
  CONFIG_PARPORT_PC=m
  # CONFIG_PARPORT_SERIAL is not set
  # CONFIG_PARPORT_PC_FIFO is not set
  # CONFIG_PARPORT_PC_SUPERIO is not set
  # CONFIG_PARPORT_GSC is not set
  # CONFIG_PARPORT_AX88796 is not set
  CONFIG_PARPORT_1284=y
  # CONFIG_I2C_PARPORT is not set
  # CONFIG_I2C_PARPORT_LIGHT is not set

and I see the following warning on boot:

  ppdev: user-space parallel port driver
  [ cut here ]
  WARNING: CPU: 2 PID: 581 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
  sysfs: cannot create duplicate filename '/devices/pnp0/00:04/ppdev/parport0'
  Modules linked in: ppdev sg parport_pc(+) parport pcspkr i2c_piix4 ip_tables 
xfs libcrc32c sr_mod cdrom ata_generic pata_acpi cirrus drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_blk e1000 
virtio_pci ata_piix virtio_ring i2c_core libata virtio serio_raw floppy 
dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 581 Comm: systemd-udevd Not tainted 4.9.0-rc4+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
   c900014c76f8 81351b2f c900014c7748 
   c900014c7738 8108bab1 001f1000 8802280bc548
   88022c290c68 880225c98398 88022d834bd8 0630
  Call Trace:
   [] dump_stack+0x63/0x84
   [] __warn+0xd1/0xf0
   [] warn_slowpath_fmt+0x5f/0x80
   [] ? kernfs_path_from_node+0x50/0x60
   [] sysfs_warn_dup+0x64/0x80
   [] sysfs_create_dir_ns+0x7e/0x90
   [] kobject_add_internal+0xa2/0x320
   [] ? device_private_init+0x23/0x70
   [] kobject_add+0x75/0xd0
   [] ? mutex_lock+0x12/0x2f
   [] device_add+0x119/0x610
   [] device_create_groups_vargs+0xd8/0x100
   [] ? dead_read+0x10/0x10 [parport]
   [] device_create+0x51/0x70
   [] ? kfree+0x121/0x170
   [] ? klist_next+0x20/0xf0
   [] pp_attach+0x34/0x40 [ppdev]
   [] driver_check+0x17/0x20 [parport]
   [] bus_for_each_drv+0x68/0xb0
   [] attach_driver_chain+0x59/0x60 [parport]
   [] parport_announce_port+0xc0/0x110 [parport]
   [] parport_pc_probe_port+0x70c/0xb20 [parport_pc]
   [] ? kernfs_activate+0x7a/0xe0
   [] ? _dev_info+0x6c/0x90
   [] parport_pc_pnp_probe+0x145/0x1e0 [parport_pc]
   [] ? sio_via_probe+0x430/0x430 [parport_pc]
   [] pnp_device_probe+0x61/0xc0
   [] driver_probe_device+0x227/0x440
   [] __driver_attach+0xdd/0xe0
   [] ? driver_probe_device+0x440/0x440
   [] bus_for_each_dev+0x6c/0xc0
   [] driver_attach+0x1e/0x20
   [] bus_add_driver+0x45/0x270
   [] driver_register+0x60/0xe0
   [] pnp_register_driver+0x20/0x30
   [] parport_pc_init+0x2b5/0xf1d [parport_pc]
   [] ? parport_parse_param.constprop.15+0xe3/0xe3 
[parport_pc]
   [] do_one_initcall+0x50/0x190
   [] ? do_init_module+0x27/0x1f1
   [] ? kmem_cache_alloc_trace+0x16c/0x1b0
   [] do_init_module+0x60/0x1f1
   [] load_module+0x15ab/0x1aa0
   [] ? __symbol_put+0x60/0x60
   [] ? ima_post_read_file+0x3d/0x80
   [] ? security_kernel_post_read_file+0x6b/0x80
   [] SYSC_finit_module+0xa6/0xf0
   [] SyS_finit_module+0xe/0x10
   [] entry_SYSCALL_64_fastpath+0x1a/0xa9
  ---[ end trace 8304e3df5abed4a7 ]---

  [ cut here ]
  WARNING: CPU: 2 PID: 581 at lib/kobject.c:240 kobject_add_internal+0x2b5/0x320
  kobject_add_internal failed for parport0 with -EEXIST, don't try to register 
things with the same name in the same directory.
  Modules linked in: ppdev sg parport_pc(+) parport pcspkr i2c_piix4 ip_tables 
xfs libcrc32c sr_mod cdrom ata_generic pata_acpi cirrus drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_blk e1000 
virtio_pci ata_piix virtio_ring i2c_core libata virtio serio_raw floppy 
dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 581 Comm: systemd-udevd Tainted: GW   4.9.0-rc4+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
   c900014c7748 81351b2f c900014c7798 
   c900014c7788 8108bab1 00f02c290c68 880226a0afe8
   88022d834bd8 ffef 88022d834bd8 0630
  Call Trace:
   [] dump_stack+0x63/0x84
   [] __warn+0xd1/0xf0
   [] warn_slowpath_fmt+0x5f/0x80
   [] kobject_add_internal+0x2b5/0x320
   [] ? device_private_init+0x23/0x70
   [] kobject_add+0x75/0xd0
   [] ? mutex_lock+0x12/0x2f
   [] device_add+0x119/0x610
   [] 

[ppdev] sysfs warning on qemu boot

2016-11-09 Thread Joe Lawrence
Hi Sudip,

I hit a sysfs_warn_dup inside QEMU running 4.9.0-rc4 (I suspect earlier
versions as well, but this is the first upstream I've run in a while).
This warning looks like something the kernel test robot ran into a few
months ago:

  http://marc.info/?t=14726777333=1=2

I'm running CentOS7 inside QEMU with the following options:

  qemu-system-x86_64 -smp 4 -m 8192 \
-drive file=centos7.qcow2,cache=none,if=virtio \
-usbdevice tablet -enable-kvm -monitor stdio \
-parallel /dev/null -redir tcp:::22 \
-serial file:serial.out

a kernel with these options:

  % grep PARPORT .config
  CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
  CONFIG_PARPORT=m
  CONFIG_PARPORT_PC=m
  # CONFIG_PARPORT_SERIAL is not set
  # CONFIG_PARPORT_PC_FIFO is not set
  # CONFIG_PARPORT_PC_SUPERIO is not set
  # CONFIG_PARPORT_GSC is not set
  # CONFIG_PARPORT_AX88796 is not set
  CONFIG_PARPORT_1284=y
  # CONFIG_I2C_PARPORT is not set
  # CONFIG_I2C_PARPORT_LIGHT is not set

and I see the following warning on boot:

  ppdev: user-space parallel port driver
  [ cut here ]
  WARNING: CPU: 2 PID: 581 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
  sysfs: cannot create duplicate filename '/devices/pnp0/00:04/ppdev/parport0'
  Modules linked in: ppdev sg parport_pc(+) parport pcspkr i2c_piix4 ip_tables 
xfs libcrc32c sr_mod cdrom ata_generic pata_acpi cirrus drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_blk e1000 
virtio_pci ata_piix virtio_ring i2c_core libata virtio serio_raw floppy 
dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 581 Comm: systemd-udevd Not tainted 4.9.0-rc4+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
   c900014c76f8 81351b2f c900014c7748 
   c900014c7738 8108bab1 001f1000 8802280bc548
   88022c290c68 880225c98398 88022d834bd8 0630
  Call Trace:
   [] dump_stack+0x63/0x84
   [] __warn+0xd1/0xf0
   [] warn_slowpath_fmt+0x5f/0x80
   [] ? kernfs_path_from_node+0x50/0x60
   [] sysfs_warn_dup+0x64/0x80
   [] sysfs_create_dir_ns+0x7e/0x90
   [] kobject_add_internal+0xa2/0x320
   [] ? device_private_init+0x23/0x70
   [] kobject_add+0x75/0xd0
   [] ? mutex_lock+0x12/0x2f
   [] device_add+0x119/0x610
   [] device_create_groups_vargs+0xd8/0x100
   [] ? dead_read+0x10/0x10 [parport]
   [] device_create+0x51/0x70
   [] ? kfree+0x121/0x170
   [] ? klist_next+0x20/0xf0
   [] pp_attach+0x34/0x40 [ppdev]
   [] driver_check+0x17/0x20 [parport]
   [] bus_for_each_drv+0x68/0xb0
   [] attach_driver_chain+0x59/0x60 [parport]
   [] parport_announce_port+0xc0/0x110 [parport]
   [] parport_pc_probe_port+0x70c/0xb20 [parport_pc]
   [] ? kernfs_activate+0x7a/0xe0
   [] ? _dev_info+0x6c/0x90
   [] parport_pc_pnp_probe+0x145/0x1e0 [parport_pc]
   [] ? sio_via_probe+0x430/0x430 [parport_pc]
   [] pnp_device_probe+0x61/0xc0
   [] driver_probe_device+0x227/0x440
   [] __driver_attach+0xdd/0xe0
   [] ? driver_probe_device+0x440/0x440
   [] bus_for_each_dev+0x6c/0xc0
   [] driver_attach+0x1e/0x20
   [] bus_add_driver+0x45/0x270
   [] driver_register+0x60/0xe0
   [] pnp_register_driver+0x20/0x30
   [] parport_pc_init+0x2b5/0xf1d [parport_pc]
   [] ? parport_parse_param.constprop.15+0xe3/0xe3 
[parport_pc]
   [] do_one_initcall+0x50/0x190
   [] ? do_init_module+0x27/0x1f1
   [] ? kmem_cache_alloc_trace+0x16c/0x1b0
   [] do_init_module+0x60/0x1f1
   [] load_module+0x15ab/0x1aa0
   [] ? __symbol_put+0x60/0x60
   [] ? ima_post_read_file+0x3d/0x80
   [] ? security_kernel_post_read_file+0x6b/0x80
   [] SYSC_finit_module+0xa6/0xf0
   [] SyS_finit_module+0xe/0x10
   [] entry_SYSCALL_64_fastpath+0x1a/0xa9
  ---[ end trace 8304e3df5abed4a7 ]---

  [ cut here ]
  WARNING: CPU: 2 PID: 581 at lib/kobject.c:240 kobject_add_internal+0x2b5/0x320
  kobject_add_internal failed for parport0 with -EEXIST, don't try to register 
things with the same name in the same directory.
  Modules linked in: ppdev sg parport_pc(+) parport pcspkr i2c_piix4 ip_tables 
xfs libcrc32c sr_mod cdrom ata_generic pata_acpi cirrus drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_blk e1000 
virtio_pci ata_piix virtio_ring i2c_core libata virtio serio_raw floppy 
dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 581 Comm: systemd-udevd Tainted: GW   4.9.0-rc4+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
   c900014c7748 81351b2f c900014c7798 
   c900014c7788 8108bab1 00f02c290c68 880226a0afe8
   88022d834bd8 ffef 88022d834bd8 0630
  Call Trace:
   [] dump_stack+0x63/0x84
   [] __warn+0xd1/0xf0
   [] warn_slowpath_fmt+0x5f/0x80
   [] kobject_add_internal+0x2b5/0x320
   [] ? device_private_init+0x23/0x70
   [] kobject_add+0x75/0xd0
   [] ? mutex_lock+0x12/0x2f
   [] device_add+0x119/0x610
   [] 

Re: [Patch V6 2/6] irqchip: xilinx: Clean up irqdomain argument and read/write

2016-11-09 Thread Marc Zyngier
On 01/11/16 11:05, Zubair Lutfullah Kakakhel wrote:
> Hi,
> 
> Thanks for the review.
> 
> On 10/31/2016 07:51 PM, Thomas Gleixner wrote:
>> On Mon, 31 Oct 2016, Zubair Lutfullah Kakakhel wrote:
>>> The drivers read/write function handling is a bit quirky.
>>
>> Can you please explain in more detail what's quirky and why it should be
>> done differently,
>>
>>> And the irqmask is passed directly to the handler.
>>
>> I can't make any sense out of that sentence.  Which handler? If you talk
>> about the write function, then I don't see a change. So what are you
>> talking about?
> 
> Thanks. I'll add more detail in v7 if this patch survives.
> 
>>
>>> Add a new irqchip struct to pass to the handler and
>>> cleanup read/write handling.
>>
>> I still don't see what it cleans up. You move the write function pointer
>> into a data structure, which is exposed by another pointer. So you create
>> two levels of indirection in the hotpath. The function prototype is still
>> the same. So all this does is making things slower unless I'm missing
>> something.
> 
> I wrote this patch/cleanup based on a review of driver by Marc when I moved 
> the
> driver from arch/microblaze to drivers/irqchip
> 
> "Marc Zyngier
> 
> ...
> 
>  >  arch/microblaze/kernel/intc.c   | 196 
> 
>  >  drivers/irqchip/irq-axi-intc.c  | 196 
> 
> 
> ...
> 
>  > +  /* Yeah, okay, casting the intr_mask to a void* is butt-ugly, but I'm
>  > +   * lazy and Michal can clean it up to something nicer when he tests
>  > +   * and commits this patch.  ~~gcl */
>  > +  root_domain = irq_domain_add_linear(intc, nr_irq, _irq_domain_ops,
>  > +  (void *)intr_mask);
> 
> Since you're now reworking this driver, how about addressing this
> ugliness? You could store the intr_mask together with intc_baseaddr,
> and the read/write functions in a global structure, and pass a
> pointer to it? That would make the code a bit nicer...
> "
> 
> https://patchwork.kernel.org/patch/9287933/
> 
>>
>>> -static unsigned int (*read_fn)(void __iomem *);
>>> -static void (*write_fn)(u32, void __iomem *);
>>> +struct xintc_irq_chip {
>>> +   void __iomem *base;
>>> +   struct  irq_domain *domain;
>>> +   struct  irq_chip chip;
>>
>> The tabs between struct and the structure name are bogus.
>>
>>> +   u32 intr_mask;
>>> +   unsigned int (*read)(void __iomem *iomem);
>>> +   void (*write)(u32 data, void __iomem *iomem);
>>
>> Please structure that like a table:
>>
>>  void__iomem *base;
>>  struct irq_domain   *domain;
>>  struct irq_chip chip;
>>  u32 intr_mask;
>>  unsigned int(*read)(void __iomem *iomem);
>>  void(*write)(u32 data, void __iomem *iomem);
>>
>> Can you see how that makes parsing the struct simpler, because the data
>> types are clearly to identify?
> 
> That does make it look much better.
> 
>>
>>> +static struct xintc_irq_chip *xintc_irqc;
>>>
>>>  static void intc_write32(u32 val, void __iomem *addr)
>>>  {
>>> @@ -54,6 +60,18 @@ static unsigned int intc_read32_be(void __iomem *addr)
>>> return ioread32be(addr);
>>>  }
>>>
>>> +static inline unsigned int xintc_read(struct xintc_irq_chip *xintc_irqc,
>>> +int reg)
>>> +{
>>> +   return xintc_irqc->read(xintc_irqc->base + reg);
>>> +}
>>> +
>>> +static inline void xintc_write(struct xintc_irq_chip *xintc_irqc,
>>> +int reg, u32 data)
>>> +{
>>> +   xintc_irqc->write(data, xintc_irqc->base + reg);
>>> +}
>>> +
>>>  static void intc_enable_or_unmask(struct irq_data *d)
>>>  {
>>> unsigned long mask = 1 << d->hwirq;
>>> @@ -65,21 +83,21 @@ static void intc_enable_or_unmask(struct irq_data *d)
>>>  * acks the irq before calling the interrupt handler
>>>  */
>>> if (irqd_is_level_type(d))
>>> -   write_fn(mask, intc_baseaddr + IAR);
>>> +   xintc_write(xintc_irqc, IAR, mask);
>>
>> So this whole thing makes only sense, when you want to support multiple
>> instances of that chip and then you need to store the xintc_irqc pointer as
>> irqchip data and retrieve it from there. Unless you do that, this "cleanup"
>> is just churn for nothing with the effect of making things less efficient.
>>
> 
> Indeed the driver doesn't support multiple instances of the Xilinx Interrupt 
> controller.
> I don't have a use-case or the hardware for that.
> 
> So what would be the recommended course of action?

If you really don't want/need to support multi-instance, then this is
indeed overkill. You're then better off having a simple static key that
deals with the endianess of the peripheral, and have a global structure
that contains the relevant data:

static DEFINE_STATIC_KEY_FALSE(xintc_is_be);

static void xintc_write(int reg, u32 data)
{
if (static_branch_unlikely(_is_be))
 

Re: [Patch V6 2/6] irqchip: xilinx: Clean up irqdomain argument and read/write

2016-11-09 Thread Marc Zyngier
On 01/11/16 11:05, Zubair Lutfullah Kakakhel wrote:
> Hi,
> 
> Thanks for the review.
> 
> On 10/31/2016 07:51 PM, Thomas Gleixner wrote:
>> On Mon, 31 Oct 2016, Zubair Lutfullah Kakakhel wrote:
>>> The drivers read/write function handling is a bit quirky.
>>
>> Can you please explain in more detail what's quirky and why it should be
>> done differently,
>>
>>> And the irqmask is passed directly to the handler.
>>
>> I can't make any sense out of that sentence.  Which handler? If you talk
>> about the write function, then I don't see a change. So what are you
>> talking about?
> 
> Thanks. I'll add more detail in v7 if this patch survives.
> 
>>
>>> Add a new irqchip struct to pass to the handler and
>>> cleanup read/write handling.
>>
>> I still don't see what it cleans up. You move the write function pointer
>> into a data structure, which is exposed by another pointer. So you create
>> two levels of indirection in the hotpath. The function prototype is still
>> the same. So all this does is making things slower unless I'm missing
>> something.
> 
> I wrote this patch/cleanup based on a review of driver by Marc when I moved 
> the
> driver from arch/microblaze to drivers/irqchip
> 
> "Marc Zyngier
> 
> ...
> 
>  >  arch/microblaze/kernel/intc.c   | 196 
> 
>  >  drivers/irqchip/irq-axi-intc.c  | 196 
> 
> 
> ...
> 
>  > +  /* Yeah, okay, casting the intr_mask to a void* is butt-ugly, but I'm
>  > +   * lazy and Michal can clean it up to something nicer when he tests
>  > +   * and commits this patch.  ~~gcl */
>  > +  root_domain = irq_domain_add_linear(intc, nr_irq, _irq_domain_ops,
>  > +  (void *)intr_mask);
> 
> Since you're now reworking this driver, how about addressing this
> ugliness? You could store the intr_mask together with intc_baseaddr,
> and the read/write functions in a global structure, and pass a
> pointer to it? That would make the code a bit nicer...
> "
> 
> https://patchwork.kernel.org/patch/9287933/
> 
>>
>>> -static unsigned int (*read_fn)(void __iomem *);
>>> -static void (*write_fn)(u32, void __iomem *);
>>> +struct xintc_irq_chip {
>>> +   void __iomem *base;
>>> +   struct  irq_domain *domain;
>>> +   struct  irq_chip chip;
>>
>> The tabs between struct and the structure name are bogus.
>>
>>> +   u32 intr_mask;
>>> +   unsigned int (*read)(void __iomem *iomem);
>>> +   void (*write)(u32 data, void __iomem *iomem);
>>
>> Please structure that like a table:
>>
>>  void__iomem *base;
>>  struct irq_domain   *domain;
>>  struct irq_chip chip;
>>  u32 intr_mask;
>>  unsigned int(*read)(void __iomem *iomem);
>>  void(*write)(u32 data, void __iomem *iomem);
>>
>> Can you see how that makes parsing the struct simpler, because the data
>> types are clearly to identify?
> 
> That does make it look much better.
> 
>>
>>> +static struct xintc_irq_chip *xintc_irqc;
>>>
>>>  static void intc_write32(u32 val, void __iomem *addr)
>>>  {
>>> @@ -54,6 +60,18 @@ static unsigned int intc_read32_be(void __iomem *addr)
>>> return ioread32be(addr);
>>>  }
>>>
>>> +static inline unsigned int xintc_read(struct xintc_irq_chip *xintc_irqc,
>>> +int reg)
>>> +{
>>> +   return xintc_irqc->read(xintc_irqc->base + reg);
>>> +}
>>> +
>>> +static inline void xintc_write(struct xintc_irq_chip *xintc_irqc,
>>> +int reg, u32 data)
>>> +{
>>> +   xintc_irqc->write(data, xintc_irqc->base + reg);
>>> +}
>>> +
>>>  static void intc_enable_or_unmask(struct irq_data *d)
>>>  {
>>> unsigned long mask = 1 << d->hwirq;
>>> @@ -65,21 +83,21 @@ static void intc_enable_or_unmask(struct irq_data *d)
>>>  * acks the irq before calling the interrupt handler
>>>  */
>>> if (irqd_is_level_type(d))
>>> -   write_fn(mask, intc_baseaddr + IAR);
>>> +   xintc_write(xintc_irqc, IAR, mask);
>>
>> So this whole thing makes only sense, when you want to support multiple
>> instances of that chip and then you need to store the xintc_irqc pointer as
>> irqchip data and retrieve it from there. Unless you do that, this "cleanup"
>> is just churn for nothing with the effect of making things less efficient.
>>
> 
> Indeed the driver doesn't support multiple instances of the Xilinx Interrupt 
> controller.
> I don't have a use-case or the hardware for that.
> 
> So what would be the recommended course of action?

If you really don't want/need to support multi-instance, then this is
indeed overkill. You're then better off having a simple static key that
deals with the endianess of the peripheral, and have a global structure
that contains the relevant data:

static DEFINE_STATIC_KEY_FALSE(xintc_is_be);

static void xintc_write(int reg, u32 data)
{
if (static_branch_unlikely(_is_be))
 

Re: [PATCH] perf/x86: Fix overlap counter scheduling bug

2016-11-09 Thread Peter Zijlstra
On Wed, Nov 09, 2016 at 03:25:15PM +0100, Robert Richter wrote:
> On 08.11.16 19:27:39, Peter Zijlstra wrote:
> > The comment with EVENT_CONSTRAINT_OVERLAP states: "This is the case if
> > the counter mask of such an event is not a subset of any other counter
> > mask of a constraint with an equal or higher weight".
> > 
> > Esp. that latter part is of interest here I think, our overlapping mask
> > is 0x0e, that has 3 bits set and is the highest weight mask in on the
> > PMU, therefore it will be placed last. Can we still create a scenario
> > where we would need to rewind that?
> > 
> > The scenario for AMD Fam15h is we're having masks like:
> > 
> > 0x3F -- 11
> > 0x38 -- 111000
> > 0x07 -- 000111
> > 
> > 0x09 -- 001001
> > 
> > And we mark 0x09 as overlapping, because it is not a direct subset of
> > 0x38 or 0x07 and has less weight than either of those. This means we'll
> > first try and place the 0x09 event, then try and place 0x38/0x07 events.
> > Now imagine we have:
> > 
> > 3 * 0x07 + 0x09
> > 
> > and the initial pick for the 0x09 event is counter 0, then we'll fail to
> > place all 0x07 events. So we'll pop back, try counter 4 for the 0x09
> > event, and then re-try all 0x07 events, which will now work.
> > 
> > 
> > 
> > But given, that in the uncore case, the overlapping event is the
> > heaviest mask, I don't think this can happen. Or did I overlook
> > something takes a bit to page all this back in.
> 
> Right, IMO 0xE mask may not be marked as overlapping. It is placed
> last and if there is no space left we are lost. There is no other
> combination or state we could try then. So the fix is to remove the
> overlapping bit for that counter, the state is then never saved.
> 
> This assumes there are no other counters than 0x3 and 0xc that overlap
> with 0xe. It becomes a bit tricky if there is another counter with the
> same or higher weight that overlaps with 0xe, e.g. 0x7.

As per a prior mail, the masks on the PMU in question are:

 0x01 - 0001
 0x03 - 0011
 0x0e - 1110
 0x0c - 1100

But since all the masks that have overlap (0xe -> {0xc,0x3}) and (0x3 ->
0x1) are of heavier weight, it should all work out I think.

So yes, something like the below (removing the OVERLAP bit) looks like
its sufficient.

---
 arch/x86/events/intel/uncore_snbep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/uncore_snbep.c 
b/arch/x86/events/intel/uncore_snbep.c
index 272427700d48..e6832be714bc 100644
--- a/arch/x86/events/intel/uncore_snbep.c
+++ b/arch/x86/events/intel/uncore_snbep.c
@@ -669,7 +669,7 @@ static struct event_constraint 
snbep_uncore_cbox_constraints[] = {
UNCORE_EVENT_CONSTRAINT(0x1c, 0xc),
UNCORE_EVENT_CONSTRAINT(0x1d, 0xc),
UNCORE_EVENT_CONSTRAINT(0x1e, 0xc),
-   EVENT_CONSTRAINT_OVERLAP(0x1f, 0xe, 0xff),
+   UNCORE_EVENT_CONSTRAINT(0x1f, 0xe),
UNCORE_EVENT_CONSTRAINT(0x21, 0x3),
UNCORE_EVENT_CONSTRAINT(0x23, 0x3),
UNCORE_EVENT_CONSTRAINT(0x31, 0x3),


Re: [PATCH] perf/x86: Fix overlap counter scheduling bug

2016-11-09 Thread Peter Zijlstra
On Wed, Nov 09, 2016 at 03:25:15PM +0100, Robert Richter wrote:
> On 08.11.16 19:27:39, Peter Zijlstra wrote:
> > The comment with EVENT_CONSTRAINT_OVERLAP states: "This is the case if
> > the counter mask of such an event is not a subset of any other counter
> > mask of a constraint with an equal or higher weight".
> > 
> > Esp. that latter part is of interest here I think, our overlapping mask
> > is 0x0e, that has 3 bits set and is the highest weight mask in on the
> > PMU, therefore it will be placed last. Can we still create a scenario
> > where we would need to rewind that?
> > 
> > The scenario for AMD Fam15h is we're having masks like:
> > 
> > 0x3F -- 11
> > 0x38 -- 111000
> > 0x07 -- 000111
> > 
> > 0x09 -- 001001
> > 
> > And we mark 0x09 as overlapping, because it is not a direct subset of
> > 0x38 or 0x07 and has less weight than either of those. This means we'll
> > first try and place the 0x09 event, then try and place 0x38/0x07 events.
> > Now imagine we have:
> > 
> > 3 * 0x07 + 0x09
> > 
> > and the initial pick for the 0x09 event is counter 0, then we'll fail to
> > place all 0x07 events. So we'll pop back, try counter 4 for the 0x09
> > event, and then re-try all 0x07 events, which will now work.
> > 
> > 
> > 
> > But given, that in the uncore case, the overlapping event is the
> > heaviest mask, I don't think this can happen. Or did I overlook
> > something takes a bit to page all this back in.
> 
> Right, IMO 0xE mask may not be marked as overlapping. It is placed
> last and if there is no space left we are lost. There is no other
> combination or state we could try then. So the fix is to remove the
> overlapping bit for that counter, the state is then never saved.
> 
> This assumes there are no other counters than 0x3 and 0xc that overlap
> with 0xe. It becomes a bit tricky if there is another counter with the
> same or higher weight that overlaps with 0xe, e.g. 0x7.

As per a prior mail, the masks on the PMU in question are:

 0x01 - 0001
 0x03 - 0011
 0x0e - 1110
 0x0c - 1100

But since all the masks that have overlap (0xe -> {0xc,0x3}) and (0x3 ->
0x1) are of heavier weight, it should all work out I think.

So yes, something like the below (removing the OVERLAP bit) looks like
its sufficient.

---
 arch/x86/events/intel/uncore_snbep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/uncore_snbep.c 
b/arch/x86/events/intel/uncore_snbep.c
index 272427700d48..e6832be714bc 100644
--- a/arch/x86/events/intel/uncore_snbep.c
+++ b/arch/x86/events/intel/uncore_snbep.c
@@ -669,7 +669,7 @@ static struct event_constraint 
snbep_uncore_cbox_constraints[] = {
UNCORE_EVENT_CONSTRAINT(0x1c, 0xc),
UNCORE_EVENT_CONSTRAINT(0x1d, 0xc),
UNCORE_EVENT_CONSTRAINT(0x1e, 0xc),
-   EVENT_CONSTRAINT_OVERLAP(0x1f, 0xe, 0xff),
+   UNCORE_EVENT_CONSTRAINT(0x1f, 0xe),
UNCORE_EVENT_CONSTRAINT(0x21, 0x3),
UNCORE_EVENT_CONSTRAINT(0x23, 0x3),
UNCORE_EVENT_CONSTRAINT(0x31, 0x3),


Re: [PATCH 1/2] backlight: arcxcnn: add support for ArticSand devices

2016-11-09 Thread Lee Jones
Jingoo?

On Wed, 26 Oct 2016, Olimpiu Dejeu wrote:

> Resubmition of arcxcnn backliught driver adding devicetree entries
>  for all registers
> 
> Signed-off-by: Olimpiu Dejeu 
> 
> ---
>  drivers/video/backlight/Kconfig  |   7 +
>  drivers/video/backlight/Makefile |   1 +
>  drivers/video/backlight/arcxcnn_bl.c | 541 
> +++
>  include/linux/i2c/arcxcnn.h  |  67 +
>  4 files changed, 616 insertions(+)
>  create mode 100644 drivers/video/backlight/arcxcnn_bl.c
>  create mode 100644 include/linux/i2c/arcxcnn.h
> 
> diff --git a/drivers/video/backlight/Kconfig b/drivers/video/backlight/Kconfig
> index 5ffa4b4..4e1d2ad 100644
> --- a/drivers/video/backlight/Kconfig
> +++ b/drivers/video/backlight/Kconfig
> @@ -460,6 +460,13 @@ config BACKLIGHT_BD6107
>   help
> If you have a Rohm BD6107 say Y to enable the backlight driver.
>  
> +config BACKLIGHT_ARCXCNN
> + tristate "Backlight driver for the Arctic Sands ARCxC family"
> + depends on I2C
> + help
> +   If you have an ARCxC family backlight say Y to enable
> +   the backlight driver.
> +
>  endif # BACKLIGHT_CLASS_DEVICE
>  
>  endif # BACKLIGHT_LCD_SUPPORT
> diff --git a/drivers/video/backlight/Makefile 
> b/drivers/video/backlight/Makefile
> index 16ec534..8905129 100644
> --- a/drivers/video/backlight/Makefile
> +++ b/drivers/video/backlight/Makefile
> @@ -55,3 +55,4 @@ obj-$(CONFIG_BACKLIGHT_SKY81452)+= sky81452-backlight.o
>  obj-$(CONFIG_BACKLIGHT_TOSA) += tosa_bl.o
>  obj-$(CONFIG_BACKLIGHT_TPS65217) += tps65217_bl.o
>  obj-$(CONFIG_BACKLIGHT_WM831X)   += wm831x_bl.o
> +obj-$(CONFIG_BACKLIGHT_ARCXCNN)  += arcxcnn_bl.o
> diff --git a/drivers/video/backlight/arcxcnn_bl.c 
> b/drivers/video/backlight/arcxcnn_bl.c
> new file mode 100644
> index 000..1dad680
> --- /dev/null
> +++ b/drivers/video/backlight/arcxcnn_bl.c
> @@ -0,0 +1,541 @@
> +/*
> + * Backlight driver for ArcticSand ARC_X_C_0N_0N Devices
> + *
> + * Copyright 2016 ArcticSand, Inc.
> + *
> + * Licensed under the GPL-2 or later.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "linux/i2c/arcxcnn.h"
> +
> +#define ARCXCNN_CMD  (0x00)  /* Command Register */
> +#define ARCXCNN_CMD_STDBY(0x80)  /* I2C Standby */
> +#define ARCXCNN_CMD_RESET(0x40)  /* Reset */
> +#define ARCXCNN_CMD_BOOST(0x10)  /* Boost */
> +#define ARCXCNN_CMD_OVP_MASK (0x0C)  /* --- Over Voltage Threshold */
> +#define ARCXCNN_CMD_OVP_XXV  (0x0C)  /*  Over Voltage Threshold */
> +#define ARCXCNN_CMD_OVP_20V  (0x08)  /* 20v Over Voltage Threshold */
> +#define ARCXCNN_CMD_OVP_24V  (0x04)  /* 24v Over Voltage Threshold */
> +#define ARCXCNN_CMD_OVP_31V  (0x00)  /* 31.4v Over Voltage Threshold */
> +#define ARCXCNN_CMD_EXT_COMP (0x01)  /* part (0) or full (1) external comp */
> +
> +#define ARCXCNN_CONFIG   (0x01)  /* Configuration */
> +#define ARCXCNN_STATUS1  (0x02)  /* Status 1 */
> +#define ARCXCNN_STATUS2  (0x03)  /* Status 2 */
> +#define ARCXCNN_FADECTRL (0x04)  /* Fading Control */
> +#define ARCXCNN_ILED_CONFIG  (0x05)  /* ILED Configuration */
> +
> +#define ARCXCNN_LEDEN(0x06)  /* LED Enable Register */
> +#define ARCXCNN_LEDEN_ISETEXT(0x80)  /* Full-scale current set 
> externally */
> +#define ARCXCNN_LEDEN_MASK   (0x3F)  /* LED string enables */
> +#define ARCXCNN_LEDEN_LED1   (0x01)
> +#define ARCXCNN_LEDEN_LED2   (0x02)
> +#define ARCXCNN_LEDEN_LED3   (0x04)
> +#define ARCXCNN_LEDEN_LED4   (0x08)
> +#define ARCXCNN_LEDEN_LED5   (0x10)
> +#define ARCXCNN_LEDEN_LED6   (0x20)
> +
> +#define ARCXCNN_WLED_ISET_LSB(0x07)  /* LED ISET LSB (in upper 
> nibble) */
> +#define ARCXCNN_WLED_ISET_MSB(0x08)  /* LED ISET MSB (8 bits) */
> +
> +#define ARCXCNN_DIMFREQ  (0x09)
> +#define ARCXCNN_COMP_CONFIG  (0x0A)
> +#define ARCXCNN_FILT_CONFIG  (0x0B)
> +#define ARCXCNN_IMAXTUNE (0x0C)
> +
> +#define DEFAULT_BL_NAME  "arctic_bl"
> +#define MAX_BRIGHTNESS   4095
> +
> +static int s_no_reset_on_remove;
> +module_param_named(noreset, s_no_reset_on_remove, int, 0644);
> +MODULE_PARM_DESC(noreset, "No reset on module removal");
> +
> +static int s_ibright = 60;
> +module_param_named(ibright, s_ibright, int, 0644);
> +MODULE_PARM_DESC(ibright, "Initial brightness (when no plat data)");
> +
> +static int s_iledstr = 0x3F;
> +module_param_named(iledstr, s_iledstr, int, 0644);
> +MODULE_PARM_DESC(iledstr, "Initial LED String (when no plat data)");
> +
> +static int s_retries = 2; /* 1 == only one try */
> +module_param_named(retries, s_retries, int, 0644);
> +MODULE_PARM_DESC(retries, "I2C retries attempted");
> +
> +enum arcxcnn_brightness_ctrl_mode {
> + PWM_BASED = 1,
> + REGISTER_BASED,
> +};
> +
> +struct arcxcnn;
> +
> +struct arcxcnn {
> + char chipname[64];
> + enum 

Re: [PATCH 1/2] backlight: arcxcnn: add support for ArticSand devices

2016-11-09 Thread Lee Jones
Jingoo?

On Wed, 26 Oct 2016, Olimpiu Dejeu wrote:

> Resubmition of arcxcnn backliught driver adding devicetree entries
>  for all registers
> 
> Signed-off-by: Olimpiu Dejeu 
> 
> ---
>  drivers/video/backlight/Kconfig  |   7 +
>  drivers/video/backlight/Makefile |   1 +
>  drivers/video/backlight/arcxcnn_bl.c | 541 
> +++
>  include/linux/i2c/arcxcnn.h  |  67 +
>  4 files changed, 616 insertions(+)
>  create mode 100644 drivers/video/backlight/arcxcnn_bl.c
>  create mode 100644 include/linux/i2c/arcxcnn.h
> 
> diff --git a/drivers/video/backlight/Kconfig b/drivers/video/backlight/Kconfig
> index 5ffa4b4..4e1d2ad 100644
> --- a/drivers/video/backlight/Kconfig
> +++ b/drivers/video/backlight/Kconfig
> @@ -460,6 +460,13 @@ config BACKLIGHT_BD6107
>   help
> If you have a Rohm BD6107 say Y to enable the backlight driver.
>  
> +config BACKLIGHT_ARCXCNN
> + tristate "Backlight driver for the Arctic Sands ARCxC family"
> + depends on I2C
> + help
> +   If you have an ARCxC family backlight say Y to enable
> +   the backlight driver.
> +
>  endif # BACKLIGHT_CLASS_DEVICE
>  
>  endif # BACKLIGHT_LCD_SUPPORT
> diff --git a/drivers/video/backlight/Makefile 
> b/drivers/video/backlight/Makefile
> index 16ec534..8905129 100644
> --- a/drivers/video/backlight/Makefile
> +++ b/drivers/video/backlight/Makefile
> @@ -55,3 +55,4 @@ obj-$(CONFIG_BACKLIGHT_SKY81452)+= sky81452-backlight.o
>  obj-$(CONFIG_BACKLIGHT_TOSA) += tosa_bl.o
>  obj-$(CONFIG_BACKLIGHT_TPS65217) += tps65217_bl.o
>  obj-$(CONFIG_BACKLIGHT_WM831X)   += wm831x_bl.o
> +obj-$(CONFIG_BACKLIGHT_ARCXCNN)  += arcxcnn_bl.o
> diff --git a/drivers/video/backlight/arcxcnn_bl.c 
> b/drivers/video/backlight/arcxcnn_bl.c
> new file mode 100644
> index 000..1dad680
> --- /dev/null
> +++ b/drivers/video/backlight/arcxcnn_bl.c
> @@ -0,0 +1,541 @@
> +/*
> + * Backlight driver for ArcticSand ARC_X_C_0N_0N Devices
> + *
> + * Copyright 2016 ArcticSand, Inc.
> + *
> + * Licensed under the GPL-2 or later.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "linux/i2c/arcxcnn.h"
> +
> +#define ARCXCNN_CMD  (0x00)  /* Command Register */
> +#define ARCXCNN_CMD_STDBY(0x80)  /* I2C Standby */
> +#define ARCXCNN_CMD_RESET(0x40)  /* Reset */
> +#define ARCXCNN_CMD_BOOST(0x10)  /* Boost */
> +#define ARCXCNN_CMD_OVP_MASK (0x0C)  /* --- Over Voltage Threshold */
> +#define ARCXCNN_CMD_OVP_XXV  (0x0C)  /*  Over Voltage Threshold */
> +#define ARCXCNN_CMD_OVP_20V  (0x08)  /* 20v Over Voltage Threshold */
> +#define ARCXCNN_CMD_OVP_24V  (0x04)  /* 24v Over Voltage Threshold */
> +#define ARCXCNN_CMD_OVP_31V  (0x00)  /* 31.4v Over Voltage Threshold */
> +#define ARCXCNN_CMD_EXT_COMP (0x01)  /* part (0) or full (1) external comp */
> +
> +#define ARCXCNN_CONFIG   (0x01)  /* Configuration */
> +#define ARCXCNN_STATUS1  (0x02)  /* Status 1 */
> +#define ARCXCNN_STATUS2  (0x03)  /* Status 2 */
> +#define ARCXCNN_FADECTRL (0x04)  /* Fading Control */
> +#define ARCXCNN_ILED_CONFIG  (0x05)  /* ILED Configuration */
> +
> +#define ARCXCNN_LEDEN(0x06)  /* LED Enable Register */
> +#define ARCXCNN_LEDEN_ISETEXT(0x80)  /* Full-scale current set 
> externally */
> +#define ARCXCNN_LEDEN_MASK   (0x3F)  /* LED string enables */
> +#define ARCXCNN_LEDEN_LED1   (0x01)
> +#define ARCXCNN_LEDEN_LED2   (0x02)
> +#define ARCXCNN_LEDEN_LED3   (0x04)
> +#define ARCXCNN_LEDEN_LED4   (0x08)
> +#define ARCXCNN_LEDEN_LED5   (0x10)
> +#define ARCXCNN_LEDEN_LED6   (0x20)
> +
> +#define ARCXCNN_WLED_ISET_LSB(0x07)  /* LED ISET LSB (in upper 
> nibble) */
> +#define ARCXCNN_WLED_ISET_MSB(0x08)  /* LED ISET MSB (8 bits) */
> +
> +#define ARCXCNN_DIMFREQ  (0x09)
> +#define ARCXCNN_COMP_CONFIG  (0x0A)
> +#define ARCXCNN_FILT_CONFIG  (0x0B)
> +#define ARCXCNN_IMAXTUNE (0x0C)
> +
> +#define DEFAULT_BL_NAME  "arctic_bl"
> +#define MAX_BRIGHTNESS   4095
> +
> +static int s_no_reset_on_remove;
> +module_param_named(noreset, s_no_reset_on_remove, int, 0644);
> +MODULE_PARM_DESC(noreset, "No reset on module removal");
> +
> +static int s_ibright = 60;
> +module_param_named(ibright, s_ibright, int, 0644);
> +MODULE_PARM_DESC(ibright, "Initial brightness (when no plat data)");
> +
> +static int s_iledstr = 0x3F;
> +module_param_named(iledstr, s_iledstr, int, 0644);
> +MODULE_PARM_DESC(iledstr, "Initial LED String (when no plat data)");
> +
> +static int s_retries = 2; /* 1 == only one try */
> +module_param_named(retries, s_retries, int, 0644);
> +MODULE_PARM_DESC(retries, "I2C retries attempted");
> +
> +enum arcxcnn_brightness_ctrl_mode {
> + PWM_BASED = 1,
> + REGISTER_BASED,
> +};
> +
> +struct arcxcnn;
> +
> +struct arcxcnn {
> + char chipname[64];
> + enum arcxcnn_chip_id chip_id;
> + 

Re: [PATCH] x86/cpuid: Deal with broken firmware once more

2016-11-09 Thread Thomas Gleixner
On Wed, 9 Nov 2016, Thomas Gleixner wrote:
> + if (pkg != c->phys_proc_id) {
> + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of 
> %u\n",
> +cpu, pkg, c->phys_proc_id);
> + c->phys_proc_id = pkg;
> + }
> + c->logical_proc_id = topology_phys_to_logical_pkg(pkg);
> +#else
> + c->locical_proc_id = 0;

That want's to be logical_proc_id of course ...

Stupid me.


Re: [PATCH] x86/cpuid: Deal with broken firmware once more

2016-11-09 Thread Thomas Gleixner
On Wed, 9 Nov 2016, Thomas Gleixner wrote:
> + if (pkg != c->phys_proc_id) {
> + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of 
> %u\n",
> +cpu, pkg, c->phys_proc_id);
> + c->phys_proc_id = pkg;
> + }
> + c->logical_proc_id = topology_phys_to_logical_pkg(pkg);
> +#else
> + c->locical_proc_id = 0;

That want's to be logical_proc_id of course ...

Stupid me.


Re: linux.git: printk() problem

2016-11-09 Thread Petr Mladek
On Mon 2016-10-24 19:22:59, Linus Torvalds wrote:
> On Mon, Oct 24, 2016 at 7:06 PM, Linus Torvalds
>  wrote:
> > On Mon, Oct 24, 2016 at 6:55 PM, Sergey Senozhatsky
> >  wrote:
> >>
> >> I think cont_flush() should grab the logbuf_lock lock, because
> >> it does log_store() and touches the cont.len. so something like
> >> this perhaps
> >
> > Absolutely. Good catch.
> 
> Actually, you can't do it the way you did (inside cont_flush), because
> "cont_flush()" is already called with logbuf_lock held in most cases
> (see "cont_add()").
> 
> So it's really just the timer function that needs to take the
> logbuf_lock before it calls cont_flush().
> 
> So here's a new version. How does this look to you?
> 
> Again, this still tests "cont.len" outside the lock (not just in
> console_unlock(), but also in deferred_cont_flush()). And it's fine:
> even if it sees the "wrong" value due to some race, it does so either
> because cont.len was just set to non-zero (and whoever set it will
> force the re-check anyway), or it got cleared just as it was tested
> (and at worst you end up with an extra timer invocation).

This patch really seems to reduce the number of too-early flushed
continuous lines. It reduces the scattered output. But I am not sure
if we want to add a timer code into the printk calls at this stage
(for 4.9-rc5).

Well, the patch looks fine, except that we call cont_flush() without
poking console. It is not a regression because only newlines triggered
console in the past and they still do but...

I would suggest to revert the commit bfd8d3f23b51018388be
("printk: make reading the kernel log flush pending lines") for
4.9. Then we could  test/fix it properly for 4.10. Me and Sergey would
happily help with it.

Just in case, you still want to commit this patch. I would suggest
to apply the one below on top.


>From 7a0ad7ce2347346fc81872fe42a95ad5dfba4098 Mon Sep 17 00:00:00 2001
From: Petr Mladek 
Date: Tue, 25 Oct 2016 15:23:13 +0200
Subject: [PATCH] printk: Poke console and other loggers when cont buffer is
 flushed

The commit bfd8d3f23b51018388be041 ("printk: make reading the kernel
log flush pending lines") allows to add new message into the log
buffer without flushing it to the console and waking other loggers.

This patch adds wake_up_console() and calls it when the cont
buffer is flushed. The function name will make more sense once
we switch to the async printk.

Note that it is enough to poke console. The other loggers
are waken from console_unlock() when there is a new message.

The patch also renames PRINTK_PENDING flags and wake_up_klogd_work*
stuff to reduce confusion. First, there are more possible log
daemons, e.g. klogd, syslogd. Second, the word "console" is more
descriptive than "output".

This patch is based on top of the fix proposed at
https://lkml.kernel.org/r/CA+55aFwKYnrMJr_vSE+GfDGszeUGyd=cpud15-zz8ywqw61...@mail.gmail.com

Signed-off-by: Petr Mladek 
---
 kernel/printk/printk.c | 60 ++
 1 file changed, 41 insertions(+), 19 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index e63aa679614e..f0e72de6ddbc 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -780,6 +780,7 @@ static ssize_t devkmsg_write(struct kiocb *iocb, struct 
iov_iter *from)
 }
 
 static void deferred_cont_flush(void);
+static void __wake_up_console(void);
 
 static ssize_t devkmsg_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
@@ -1620,13 +1621,11 @@ static bool cont_flush(void)
 static void flush_timer(unsigned long data)
 {
unsigned long flags;
-   bool did_flush;
 
raw_spin_lock_irqsave(_lock, flags);
-   did_flush = cont_flush();
+   if (cont_flush())
+   __wake_up_console();
raw_spin_unlock_irqrestore(_lock, flags);
-   if (did_flush)
-   wake_up_klogd();
 }
 
 static void deferred_cont_flush(void)
@@ -2730,40 +2729,57 @@ static int __init printk_late_init(void)
 
 #if defined CONFIG_PRINTK
 /*
- * Delayed printk version, for scheduler-internal messages:
+ * Handle console and wakeup loggers via IRQ work so that it can be done
+ * from any context.
  */
-#define PRINTK_PENDING_WAKEUP  0x01
-#define PRINTK_PENDING_OUTPUT  0x02
+#define PRINTK_PENDING_LOGGERS 0x01
+#define PRINTK_PENDING_CONSOLE 0x02
 
 static DEFINE_PER_CPU(int, printk_pending);
 
-static void wake_up_klogd_work_func(struct irq_work *irq_work)
+static void printk_pending_func(struct irq_work *irq_work)
 {
int pending = __this_cpu_xchg(printk_pending, 0);
 
-   if (pending & PRINTK_PENDING_OUTPUT) {
+   if (pending & PRINTK_PENDING_CONSOLE) {
/* If trylock fails, someone else is doing the printing */
if (console_trylock())
console_unlock();
}
 
-   if 

Re: linux.git: printk() problem

2016-11-09 Thread Petr Mladek
On Mon 2016-10-24 19:22:59, Linus Torvalds wrote:
> On Mon, Oct 24, 2016 at 7:06 PM, Linus Torvalds
>  wrote:
> > On Mon, Oct 24, 2016 at 6:55 PM, Sergey Senozhatsky
> >  wrote:
> >>
> >> I think cont_flush() should grab the logbuf_lock lock, because
> >> it does log_store() and touches the cont.len. so something like
> >> this perhaps
> >
> > Absolutely. Good catch.
> 
> Actually, you can't do it the way you did (inside cont_flush), because
> "cont_flush()" is already called with logbuf_lock held in most cases
> (see "cont_add()").
> 
> So it's really just the timer function that needs to take the
> logbuf_lock before it calls cont_flush().
> 
> So here's a new version. How does this look to you?
> 
> Again, this still tests "cont.len" outside the lock (not just in
> console_unlock(), but also in deferred_cont_flush()). And it's fine:
> even if it sees the "wrong" value due to some race, it does so either
> because cont.len was just set to non-zero (and whoever set it will
> force the re-check anyway), or it got cleared just as it was tested
> (and at worst you end up with an extra timer invocation).

This patch really seems to reduce the number of too-early flushed
continuous lines. It reduces the scattered output. But I am not sure
if we want to add a timer code into the printk calls at this stage
(for 4.9-rc5).

Well, the patch looks fine, except that we call cont_flush() without
poking console. It is not a regression because only newlines triggered
console in the past and they still do but...

I would suggest to revert the commit bfd8d3f23b51018388be
("printk: make reading the kernel log flush pending lines") for
4.9. Then we could  test/fix it properly for 4.10. Me and Sergey would
happily help with it.

Just in case, you still want to commit this patch. I would suggest
to apply the one below on top.


>From 7a0ad7ce2347346fc81872fe42a95ad5dfba4098 Mon Sep 17 00:00:00 2001
From: Petr Mladek 
Date: Tue, 25 Oct 2016 15:23:13 +0200
Subject: [PATCH] printk: Poke console and other loggers when cont buffer is
 flushed

The commit bfd8d3f23b51018388be041 ("printk: make reading the kernel
log flush pending lines") allows to add new message into the log
buffer without flushing it to the console and waking other loggers.

This patch adds wake_up_console() and calls it when the cont
buffer is flushed. The function name will make more sense once
we switch to the async printk.

Note that it is enough to poke console. The other loggers
are waken from console_unlock() when there is a new message.

The patch also renames PRINTK_PENDING flags and wake_up_klogd_work*
stuff to reduce confusion. First, there are more possible log
daemons, e.g. klogd, syslogd. Second, the word "console" is more
descriptive than "output".

This patch is based on top of the fix proposed at
https://lkml.kernel.org/r/CA+55aFwKYnrMJr_vSE+GfDGszeUGyd=cpud15-zz8ywqw61...@mail.gmail.com

Signed-off-by: Petr Mladek 
---
 kernel/printk/printk.c | 60 ++
 1 file changed, 41 insertions(+), 19 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index e63aa679614e..f0e72de6ddbc 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -780,6 +780,7 @@ static ssize_t devkmsg_write(struct kiocb *iocb, struct 
iov_iter *from)
 }
 
 static void deferred_cont_flush(void);
+static void __wake_up_console(void);
 
 static ssize_t devkmsg_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
@@ -1620,13 +1621,11 @@ static bool cont_flush(void)
 static void flush_timer(unsigned long data)
 {
unsigned long flags;
-   bool did_flush;
 
raw_spin_lock_irqsave(_lock, flags);
-   did_flush = cont_flush();
+   if (cont_flush())
+   __wake_up_console();
raw_spin_unlock_irqrestore(_lock, flags);
-   if (did_flush)
-   wake_up_klogd();
 }
 
 static void deferred_cont_flush(void)
@@ -2730,40 +2729,57 @@ static int __init printk_late_init(void)
 
 #if defined CONFIG_PRINTK
 /*
- * Delayed printk version, for scheduler-internal messages:
+ * Handle console and wakeup loggers via IRQ work so that it can be done
+ * from any context.
  */
-#define PRINTK_PENDING_WAKEUP  0x01
-#define PRINTK_PENDING_OUTPUT  0x02
+#define PRINTK_PENDING_LOGGERS 0x01
+#define PRINTK_PENDING_CONSOLE 0x02
 
 static DEFINE_PER_CPU(int, printk_pending);
 
-static void wake_up_klogd_work_func(struct irq_work *irq_work)
+static void printk_pending_func(struct irq_work *irq_work)
 {
int pending = __this_cpu_xchg(printk_pending, 0);
 
-   if (pending & PRINTK_PENDING_OUTPUT) {
+   if (pending & PRINTK_PENDING_CONSOLE) {
/* If trylock fails, someone else is doing the printing */
if (console_trylock())
console_unlock();
}
 
-   if (pending & PRINTK_PENDING_WAKEUP)
+   if (pending & PRINTK_PENDING_LOGGERS)

Re: [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Richard Weinberger
On 09.11.2016 16:32, Lars Ellenberg wrote:
> On Tue, Nov 08, 2016 at 09:52:04AM -0700, Jens Axboe wrote:
 This should go into 4.9,
 and into all stable branches since and including v4.0,
 which is the first to contain the exposing change.

 It is correct for all stable branches older than that as well
 (which contain the DRBD driver; which is 2.6.33 and up).

 It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
 we dropped the comment block immediately preceding the kernel_sendmsg().

 Cc: sta...@vger.kernel.org
 Cc: v...@zeniv.linux.org.uk
 Cc: christoph.lechleit...@iteg.at
 Cc: wolfgang.g...@iteg.at
 Reported-by: Christoph Lechleitner 
 Tested-by: Christoph Lechleitner 
 Signed-off-by: Richard Weinberger 
 Signed-off-by: Lars Ellenberg 
>>>
>>> Changing my patch is perfectly fine, but please clearly state it.
>>> I.e. by adding something like that before your S-o-b.
>>> [Lars: Massaged patch to match my personal taste...]
>>
> 
>> Lars, are you sending a new one? If you do, add the stable tag as well.
> 
> So my "change" against his original patch was
> - rv = kernel_sendmsg(sock, , , 1, size - sent);
> + rv = kernel_sendmsg(sock, , , 1, iov.iov_len);
> to make it "more obviously correct" from looking just at the one line
> without even having to read the context.  And a more verbose commit message.
> 
> If that requires yet additional noise, sure, so be it :)
> 
> Should I sent two patches, one that applies to 4.5 and later,
> and one that applies to 2.6.33 ... 4.4, or are you or stable
> willing to resolve the trivial "missing comment block" conflict yourself?

BTW: Why did you drop the "Fixes:" tag too?

Thanks,
//richard


Re: [PATCH] drbd: Fix kernel_sendmsg() usage

2016-11-09 Thread Richard Weinberger
On 09.11.2016 16:32, Lars Ellenberg wrote:
> On Tue, Nov 08, 2016 at 09:52:04AM -0700, Jens Axboe wrote:
 This should go into 4.9,
 and into all stable branches since and including v4.0,
 which is the first to contain the exposing change.

 It is correct for all stable branches older than that as well
 (which contain the DRBD driver; which is 2.6.33 and up).

 It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
 we dropped the comment block immediately preceding the kernel_sendmsg().

 Cc: sta...@vger.kernel.org
 Cc: v...@zeniv.linux.org.uk
 Cc: christoph.lechleit...@iteg.at
 Cc: wolfgang.g...@iteg.at
 Reported-by: Christoph Lechleitner 
 Tested-by: Christoph Lechleitner 
 Signed-off-by: Richard Weinberger 
 Signed-off-by: Lars Ellenberg 
>>>
>>> Changing my patch is perfectly fine, but please clearly state it.
>>> I.e. by adding something like that before your S-o-b.
>>> [Lars: Massaged patch to match my personal taste...]
>>
> 
>> Lars, are you sending a new one? If you do, add the stable tag as well.
> 
> So my "change" against his original patch was
> - rv = kernel_sendmsg(sock, , , 1, size - sent);
> + rv = kernel_sendmsg(sock, , , 1, iov.iov_len);
> to make it "more obviously correct" from looking just at the one line
> without even having to read the context.  And a more verbose commit message.
> 
> If that requires yet additional noise, sure, so be it :)
> 
> Should I sent two patches, one that applies to 4.5 and later,
> and one that applies to 2.6.33 ... 4.4, or are you or stable
> willing to resolve the trivial "missing comment block" conflict yourself?

BTW: Why did you drop the "Fixes:" tag too?

Thanks,
//richard


<    5   6   7   8   9   10   11   12   13   14   >