date:20161012

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-12 Thread Andy Lutomirski

On Wed, Oct 12, 2016 at 7:22 AM, Johannes Berg
 wrote:
>
>> > Can you elaborate on how exactly it kills your system?
>>
>> the last time I saw it it was a NULL deref at
>> ieee80211_aes_ccm_decrypt.
>
> Hm. I was expecting something within the crypto code would cause the
> crash, this seems strange.
>
> Anyway, I'm surely out of my depth wrt. the actual cause. Something
> like the patch below probably works around it, but it's horribly
> inefficient due to the locking and doesn't cover CMAC/GMAC either.

In a pinch, I have these patches sitting around:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=0a39cfa6fbb5d5635c85253cc7d6b44b54822afd
https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=bf8cfa200b5a01383ea39fc8ce2f32909767baa8

I don't like them, though.  I think it's rather silly that we can't
just pass virtual addresses to the crypto code.

Re: [PATCH 1/1] vxlan: insert ipv6 macro

2016-10-12 Thread zhuyj

 Soon I will analyze the previous patch. I will let you know.

Thanks a lot.

On Thu, Oct 13, 2016 at 1:28 PM, zhuyj  wrote:
>  Hi,  Jiri
>
> The dumped source code is in the attachment. Please check it. I think
> this file can explain all.
>
> If anything, please just let me know.
> Thanks a lot.
>
> On Wed, Oct 12, 2016 at 9:16 PM, Jiri Benc  wrote:
>> On Wed, 12 Oct 2016 21:01:54 +0800, zhuyj wrote:
>>> How to explain the following source code? As you mentioned,  are the
>>> #ifdefs in the following source pointless?
>>
>> They are not, the code would not compile without them. Look how struct
>> vxlan_dev is defined.
>>
>> Those are really basic questions you have. I suggest you try yourself
>> before asking such questions next time. In this case, you could
>> trivially remove the #ifdef and see for yourself, as I explained in the
>> previous email. Please do not try to offload your homework to other
>> people. It's very obvious you didn't even try to understand this, even
>> after the feedback you received.
>>
>> And do not top post.
>>
>> Thanks,
>>
>>  Jiri

[PATCH 06/10] mm: replace get_user_pages() write/force parameters with gup_flags

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write and force parameters from get_user_pages() and
replaces them with a gup_flags parameter to make the use of FOLL_FORCE explicit
in callers as use of this flag can result in surprising behaviour (and hence
bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 arch/cris/arch-v32/drivers/cryptocop.c |  4 +---
 arch/ia64/kernel/err_inject.c  |  2 +-
 arch/x86/mm/mpx.c  |  5 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|  7 +--
 drivers/gpu/drm/radeon/radeon_ttm.c|  3 ++-
 drivers/gpu/drm/via/via_dmablit.c  |  4 ++--
 drivers/infiniband/core/umem.c |  6 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c|  2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c |  3 ++-
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  5 -
 drivers/media/v4l2-core/videobuf-dma-sg.c  |  7 +--
 drivers/misc/mic/scif/scif_rma.c   |  3 +--
 drivers/misc/sgi-gru/grufault.c|  2 +-
 drivers/platform/goldfish/goldfish_pipe.c  |  3 ++-
 drivers/rapidio/devices/rio_mport_cdev.c   |  3 ++-
 .../vc04_services/interface/vchiq_arm/vchiq_2835_arm.c |  3 +--
 .../vc04_services/interface/vchiq_arm/vchiq_arm.c  |  3 +--
 drivers/virt/fsl_hypervisor.c  |  4 ++--
 include/linux/mm.h |  2 +-
 mm/gup.c   | 12 +++-
 mm/mempolicy.c |  2 +-
 mm/nommu.c | 18 --
 22 files changed, 49 insertions(+), 54 deletions(-)

diff --git a/arch/cris/arch-v32/drivers/cryptocop.c 
b/arch/cris/arch-v32/drivers/cryptocop.c
index b5698c8..099e170 100644
--- a/arch/cris/arch-v32/drivers/cryptocop.c
+++ b/arch/cris/arch-v32/drivers/cryptocop.c
@@ -2722,7 +2722,6 @@ static int cryptocop_ioctl_process(struct inode *inode, 
struct file *filp, unsig
err = get_user_pages((unsigned long int)(oper.indata + prev_ix),
 noinpages,
 0,  /* read access only for in data */
-0, /* no force */
 inpages,
 NULL);
 
@@ -2736,8 +2735,7 @@ static int cryptocop_ioctl_process(struct inode *inode, 
struct file *filp, unsig
if (oper.do_cipher){
err = get_user_pages((unsigned long int)oper.cipher_outdata,
 nooutpages,
-1, /* write access for out data */
-0, /* no force */
+FOLL_WRITE, /* write access for out data */
 outpages,
 NULL);
up_read(>mm->mmap_sem);
diff --git a/arch/ia64/kernel/err_inject.c b/arch/ia64/kernel/err_inject.c
index 09f8457..5ed0ea9 100644
--- a/arch/ia64/kernel/err_inject.c
+++ b/arch/ia64/kernel/err_inject.c
@@ -142,7 +142,7 @@ store_virtual_to_phys(struct device *dev, struct 
device_attribute *attr,
u64 virt_addr=simple_strtoull(buf, NULL, 16);
int ret;
 
-   ret = get_user_pages(virt_addr, 1, VM_READ, 0, NULL, NULL);
+   ret = get_user_pages(virt_addr, 1, FOLL_WRITE, NULL, NULL);
if (ret<=0) {
 #ifdef ERR_INJ_DEBUG
printk("Virtual address %lx is not existing.\n",virt_addr);
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 8047687..e4f8009 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -544,10 +544,9 @@ static int mpx_resolve_fault(long __user *addr, int write)
 {
long gup_ret;
int nr_pages = 1;
-   int force = 0;
 
-   gup_ret = get_user_pages((unsigned long)addr, nr_pages, write,
-   force, NULL, NULL);
+   gup_ret = get_user_pages((unsigned long)addr, nr_pages,
+   write ? FOLL_WRITE : 0, NULL, NULL);
/*
 * get_user_pages() returns number of pages gotten.
 * 0 means we failed to fault in and get anything,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 887483b..dcaf691 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -555,10 +555,13 @@ struct amdgpu_ttm_tt {
 int amdgpu_ttm_tt_get_user_pages(struct ttm_tt *ttm, struct page **pages)
 {
struct amdgpu_ttm_tt *gtt = (void *)ttm;
-   int write = !(gtt->userflags & AMDGPU_GEM_USERPTR_READONLY);
+   unsigned int flags = 0;
unsigned pinned = 0;
int r;
 
+   if (!(gtt->userflags & AMDGPU_GEM_USERPTR_READONLY))
+   flags |= FOLL_WRITE;
+
if (gtt->userflags &

[PATCH 10/10] mm: replace access_process_vm() write parameter with gup_flags

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write parameter from access_process_vm() and replaces it
with a gup_flags parameter as use of this function previously _implied_
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising behaviour
(and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 arch/alpha/kernel/ptrace.c |  9 ++---
 arch/blackfin/kernel/ptrace.c  |  5 +++--
 arch/cris/arch-v32/kernel/ptrace.c |  4 ++--
 arch/ia64/kernel/ptrace.c  | 14 +-
 arch/m32r/kernel/ptrace.c  | 15 ++-
 arch/mips/kernel/ptrace32.c|  5 +++--
 arch/powerpc/kernel/ptrace32.c |  5 +++--
 arch/score/kernel/ptrace.c | 10 ++
 arch/sparc/kernel/ptrace_64.c  | 24 
 arch/x86/kernel/step.c |  3 ++-
 arch/x86/um/ptrace_32.c|  3 ++-
 arch/x86/um/ptrace_64.c|  3 ++-
 include/linux/mm.h |  3 ++-
 kernel/ptrace.c| 16 ++--
 mm/memory.c|  8 ++--
 mm/nommu.c |  6 +++---
 mm/util.c  |  5 +++--
 17 files changed, 84 insertions(+), 54 deletions(-)

diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index d9ee817..940dfb4 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -157,14 +157,16 @@ put_reg(struct task_struct *task, unsigned long regno, 
unsigned long data)
 static inline int
 read_int(struct task_struct *task, unsigned long addr, int * data)
 {
-   int copied = access_process_vm(task, addr, data, sizeof(int), 0);
+   int copied = access_process_vm(task, addr, data, sizeof(int),
+   FOLL_FORCE);
return (copied == sizeof(int)) ? 0 : -EIO;
 }
 
 static inline int
 write_int(struct task_struct *task, unsigned long addr, int data)
 {
-   int copied = access_process_vm(task, addr, , sizeof(int), 1);
+   int copied = access_process_vm(task, addr, , sizeof(int),
+   FOLL_FORCE | FOLL_WRITE);
return (copied == sizeof(int)) ? 0 : -EIO;
 }
 
@@ -281,7 +283,8 @@ long arch_ptrace(struct task_struct *child, long request,
/* When I and D space are separate, these will need to be fixed.  */
case PTRACE_PEEKTEXT: /* read word at location addr. */
case PTRACE_PEEKDATA:
-   copied = access_process_vm(child, addr, , sizeof(tmp), 0);
+   copied = access_process_vm(child, addr, , sizeof(tmp),
+   FOLL_FORCE);
ret = -EIO;
if (copied != sizeof(tmp))
break;
diff --git a/arch/blackfin/kernel/ptrace.c b/arch/blackfin/kernel/ptrace.c
index 8b8fe67..8d79286 100644
--- a/arch/blackfin/kernel/ptrace.c
+++ b/arch/blackfin/kernel/ptrace.c
@@ -271,7 +271,7 @@ long arch_ptrace(struct task_struct *child, long request,
case BFIN_MEM_ACCESS_CORE:
case BFIN_MEM_ACCESS_CORE_ONLY:
copied = access_process_vm(child, addr, ,
-  to_copy, 0);
+  to_copy, FOLL_FORCE);
if (copied)
break;
 
@@ -324,7 +324,8 @@ long arch_ptrace(struct task_struct *child, long request,
case BFIN_MEM_ACCESS_CORE:
case BFIN_MEM_ACCESS_CORE_ONLY:
copied = access_process_vm(child, addr, ,
-  to_copy, 1);
+  to_copy,
+  FOLL_FORCE | 
FOLL_WRITE);
break;
case BFIN_MEM_ACCESS_DMA:
if (safe_dma_memcpy(paddr, , to_copy))
diff --git a/arch/cris/arch-v32/kernel/ptrace.c 
b/arch/cris/arch-v32/kernel/ptrace.c
index f085229..f0df654 100644
--- a/arch/cris/arch-v32/kernel/ptrace.c
+++ b/arch/cris/arch-v32/kernel/ptrace.c
@@ -147,7 +147,7 @@ long arch_ptrace(struct task_struct *child, long request,
/* The trampoline page is globally mapped, no 
page table to traverse.*/
tmp = *(unsigned long*)addr;
} else {
-   copied = access_process_vm(child, addr, , 
sizeof(tmp), 0);
+   copied = access_process_vm(child, addr, , 
sizeof(tmp), FOLL_FORCE);
 
if (copied != sizeof(tmp))
break;
@@ -279,7 +279,7 @@ static int insn_size(struct task_struct *child, unsigned 
long pc)
   int opsize = 0;
 
   /* Read the

[PATCH 05/10] mm: replace get_vaddr_frames() write/force parameters with gup_flags

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write and force parameters from get_vaddr_frames() and
replaces them with a gup_flags parameter to make the use of FOLL_FORCE explicit
in callers as use of this flag can result in surprising behaviour (and hence
bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 drivers/gpu/drm/exynos/exynos_drm_g2d.c|  3 ++-
 drivers/media/platform/omap/omap_vout.c|  2 +-
 drivers/media/v4l2-core/videobuf2-memops.c |  6 +-
 include/linux/mm.h |  2 +-
 mm/frame_vector.c  | 13 ++---
 5 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c 
b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
index aa92dec..fbd13fa 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
@@ -488,7 +488,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct 
drm_device *drm_dev,
goto err_free;
}
 
-   ret = get_vaddr_frames(start, npages, true, true, g2d_userptr->vec);
+   ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
+   g2d_userptr->vec);
if (ret != npages) {
DRM_ERROR("failed to get user pages from userptr.\n");
if (ret < 0)
diff --git a/drivers/media/platform/omap/omap_vout.c 
b/drivers/media/platform/omap/omap_vout.c
index e668dde..a31b95c 100644
--- a/drivers/media/platform/omap/omap_vout.c
+++ b/drivers/media/platform/omap/omap_vout.c
@@ -214,7 +214,7 @@ static int omap_vout_get_userptr(struct videobuf_buffer 
*vb, u32 virtp,
if (!vec)
return -ENOMEM;
 
-   ret = get_vaddr_frames(virtp, 1, true, false, vec);
+   ret = get_vaddr_frames(virtp, 1, FOLL_WRITE, vec);
if (ret != 1) {
frame_vector_destroy(vec);
return -EINVAL;
diff --git a/drivers/media/v4l2-core/videobuf2-memops.c 
b/drivers/media/v4l2-core/videobuf2-memops.c
index 3c3b517..1cd322e 100644
--- a/drivers/media/v4l2-core/videobuf2-memops.c
+++ b/drivers/media/v4l2-core/videobuf2-memops.c
@@ -42,6 +42,10 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
unsigned long first, last;
unsigned long nr;
struct frame_vector *vec;
+   unsigned int flags = FOLL_FORCE;
+
+   if (write)
+   flags |= FOLL_WRITE;
 
first = start >> PAGE_SHIFT;
last = (start + length - 1) >> PAGE_SHIFT;
@@ -49,7 +53,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
vec = frame_vector_create(nr);
if (!vec)
return ERR_PTR(-ENOMEM);
-   ret = get_vaddr_frames(start & PAGE_MASK, nr, write, true, vec);
+   ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
if (ret < 0)
goto out_destroy;
/* We accept only complete set of PFNs */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ab538..5ff084f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1305,7 +1305,7 @@ struct frame_vector {
 struct frame_vector *frame_vector_create(unsigned int nr_frames);
 void frame_vector_destroy(struct frame_vector *vec);
 int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
-bool write, bool force, struct frame_vector *vec);
+unsigned int gup_flags, struct frame_vector *vec);
 void put_vaddr_frames(struct frame_vector *vec);
 int frame_vector_to_pages(struct frame_vector *vec);
 void frame_vector_to_pfns(struct frame_vector *vec);
diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index 81b6749..db77dcb 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -11,10 +11,7 @@
  * get_vaddr_frames() - map virtual addresses to pfns
  * @start: starting user address
  * @nr_frames: number of pages / pfns from start to map
- * @write: whether pages will be written to by the caller
- * @force: whether to force write access even if user mapping is
- * readonly. See description of the same argument of
-   get_user_pages().
+ * @gup_flags: flags modifying lookup behaviour
  * @vec:   structure which receives pages / pfns of the addresses mapped.
  * It should have space for at least nr_frames entries.
  *
@@ -34,23 +31,17 @@
  * This function takes care of grabbing mmap_sem as necessary.
  */
 int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
-bool write, bool force, struct frame_vector *vec)
+unsigned int gup_flags, struct frame_vector *vec)
 {
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int ret = 0;
int err;
int locked;
-   unsigned int gup_flags = 0;
 
if (nr_frames == 0)
return 0;
 
-   if (write)
-   gup_flags |= FOLL_WRITE;
-   if (force)
-   gup_flags |= FOLL_FORCE;
-
if

[PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write parameter from __access_remote_vm() and replaces it
with a gup_flags parameter as use of this function previously _implied_
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising behaviour
(and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 mm/memory.c | 23 +++
 mm/nommu.c  |  9 ++---
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 20a9adb..79ebed3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3869,14 +3869,11 @@ EXPORT_SYMBOL_GPL(generic_access_phys);
  * given task for page fault accounting.
  */
 static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
-   unsigned long addr, void *buf, int len, int write)
+   unsigned long addr, void *buf, int len, unsigned int gup_flags)
 {
struct vm_area_struct *vma;
void *old_buf = buf;
-   unsigned int flags = FOLL_FORCE;
-
-   if (write)
-   flags |= FOLL_WRITE;
+   int write = gup_flags & FOLL_WRITE;
 
down_read(>mmap_sem);
/* ignore errors, just check how much was successfully transferred */
@@ -3886,7 +3883,7 @@ static int __access_remote_vm(struct task_struct *tsk, 
struct mm_struct *mm,
struct page *page = NULL;
 
ret = get_user_pages_remote(tsk, mm, addr, 1,
-   flags, , );
+   gup_flags, , );
if (ret <= 0) {
 #ifndef CONFIG_HAVE_IOREMAP_PROT
break;
@@ -3945,7 +3942,12 @@ static int __access_remote_vm(struct task_struct *tsk, 
struct mm_struct *mm,
 int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, int write)
 {
-   return __access_remote_vm(NULL, mm, addr, buf, len, write);
+   unsigned int flags = FOLL_FORCE;
+
+   if (write)
+   flags |= FOLL_WRITE;
+
+   return __access_remote_vm(NULL, mm, addr, buf, len, flags);
 }
 
 /*
@@ -3958,12 +3960,17 @@ int access_process_vm(struct task_struct *tsk, unsigned 
long addr,
 {
struct mm_struct *mm;
int ret;
+   unsigned int flags = FOLL_FORCE;
 
mm = get_task_mm(tsk);
if (!mm)
return 0;
 
-   ret = __access_remote_vm(tsk, mm, addr, buf, len, write);
+   if (write)
+   flags |= FOLL_WRITE;
+
+   ret = __access_remote_vm(tsk, mm, addr, buf, len, flags);
+
mmput(mm);
 
return ret;
diff --git a/mm/nommu.c b/mm/nommu.c
index 70cb844..bde7df3 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1809,9 +1809,10 @@ void filemap_map_pages(struct fault_env *fe,
 EXPORT_SYMBOL(filemap_map_pages);
 
 static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
-   unsigned long addr, void *buf, int len, int write)
+   unsigned long addr, void *buf, int len, unsigned int gup_flags)
 {
struct vm_area_struct *vma;
+   int write = gup_flags & FOLL_WRITE;
 
down_read(>mmap_sem);
 
@@ -1853,7 +1854,8 @@ static int __access_remote_vm(struct task_struct *tsk, 
struct mm_struct *mm,
 int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, int write)
 {
-   return __access_remote_vm(NULL, mm, addr, buf, len, write);
+   return __access_remote_vm(NULL, mm, addr, buf, len,
+   write ? FOLL_WRITE : 0);
 }
 
 /*
@@ -1871,7 +1873,8 @@ int access_process_vm(struct task_struct *tsk, unsigned 
long addr, void *buf, in
if (!mm)
return 0;
 
-   len = __access_remote_vm(tsk, mm, addr, buf, len, write);
+   len = __access_remote_vm(tsk, mm, addr, buf, len,
+   write ? FOLL_WRITE : 0);
 
mmput(mm);
return len;
-- 
2.10.0

[PATCH 09/10] mm: replace access_remote_vm() write parameter with gup_flags

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write parameter from access_remote_vm() and replaces it
with a gup_flags parameter as use of this function previously _implied_
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising behaviour
(and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 fs/proc/base.c | 19 +--
 include/linux/mm.h |  2 +-
 mm/memory.c| 11 +++
 mm/nommu.c |  7 +++
 4 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c2964d8..8e65446 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -252,7 +252,7 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
char __user *buf,
 * Inherently racy -- command line shares address space
 * with code and data.
 */
-   rv = access_remote_vm(mm, arg_end - 1, , 1, 0);
+   rv = access_remote_vm(mm, arg_end - 1, , 1, FOLL_FORCE);
if (rv <= 0)
goto out_free_page;
 
@@ -270,7 +270,8 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
char __user *buf,
int nr_read;
 
_count = min3(count, len, PAGE_SIZE);
-   nr_read = access_remote_vm(mm, p, page, _count, 0);
+   nr_read = access_remote_vm(mm, p, page, _count,
+   FOLL_FORCE);
if (nr_read < 0)
rv = nr_read;
if (nr_read <= 0)
@@ -305,7 +306,8 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
char __user *buf,
bool final;
 
_count = min3(count, len, PAGE_SIZE);
-   nr_read = access_remote_vm(mm, p, page, _count, 0);
+   nr_read = access_remote_vm(mm, p, page, _count,
+   FOLL_FORCE);
if (nr_read < 0)
rv = nr_read;
if (nr_read <= 0)
@@ -354,7 +356,8 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
char __user *buf,
bool final;
 
_count = min3(count, len, PAGE_SIZE);
-   nr_read = access_remote_vm(mm, p, page, _count, 0);
+   nr_read = access_remote_vm(mm, p, page, _count,
+   FOLL_FORCE);
if (nr_read < 0)
rv = nr_read;
if (nr_read <= 0)
@@ -832,6 +835,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
unsigned long addr = *ppos;
ssize_t copied;
char *page;
+   unsigned int flags = FOLL_FORCE;
 
if (!mm)
return 0;
@@ -844,6 +848,9 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
if (!atomic_inc_not_zero(>mm_users))
goto free;
 
+   if (write)
+   flags |= FOLL_WRITE;
+
while (count > 0) {
int this_len = min_t(int, count, PAGE_SIZE);
 
@@ -852,7 +859,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
break;
}
 
-   this_len = access_remote_vm(mm, addr, page, this_len, write);
+   this_len = access_remote_vm(mm, addr, page, this_len, flags);
if (!this_len) {
if (!copied)
copied = -EIO;
@@ -965,7 +972,7 @@ static ssize_t environ_read(struct file *file, char __user 
*buf,
this_len = min(max_len, this_len);
 
retval = access_remote_vm(mm, (env_start + src),
-   page, this_len, 0);
+   page, this_len, FOLL_FORCE);
 
if (retval <= 0) {
ret = retval;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2a481d3..3e5234e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1268,7 +1268,7 @@ static inline int fixup_user_fault(struct task_struct 
*tsk,
 
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void 
*buf, int len, int write);
 extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
-   void *buf, int len, int write);
+   void *buf, int len, unsigned int gup_flags);
 
 long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
  unsigned long start, unsigned long nr_pages,
diff --git a/mm/memory.c b/mm/memory.c
index 79ebed3..bac2d99 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3935,19 +3935,14 @@ static int __access_remote_vm(struct task_struct *tsk, 
struct mm_struct *mm,
  * @addr:  start address to access
  * @buf:   source or destination buffer
  * @len:   number of bytes to transfer
- * @write:

[PATCH 07/10] mm: replace get_user_pages_remote() write/force parameters with gup_flags

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write and force parameters from get_user_pages_remote()
and replaces them with a gup_flags parameter to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 drivers/gpu/drm/etnaviv/etnaviv_gem.c   |  7 +--
 drivers/gpu/drm/i915/i915_gem_userptr.c |  6 +-
 drivers/infiniband/core/umem_odp.c  |  7 +--
 fs/exec.c   |  9 +++--
 include/linux/mm.h  |  2 +-
 kernel/events/uprobes.c |  6 --
 mm/gup.c| 22 +++---
 mm/memory.c |  6 +-
 security/tomoyo/domain.c|  2 +-
 9 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
index 5ce3603..0370b84 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
@@ -748,19 +748,22 @@ static struct page **etnaviv_gem_userptr_do_get_pages(
int ret = 0, pinned, npages = etnaviv_obj->base.size >> PAGE_SHIFT;
struct page **pvec;
uintptr_t ptr;
+   unsigned int flags = 0;
 
pvec = drm_malloc_ab(npages, sizeof(struct page *));
if (!pvec)
return ERR_PTR(-ENOMEM);
 
+   if (!etnaviv_obj->userptr.ro)
+   flags |= FOLL_WRITE;
+
pinned = 0;
ptr = etnaviv_obj->userptr.ptr;
 
down_read(>mmap_sem);
while (pinned < npages) {
ret = get_user_pages_remote(task, mm, ptr, npages - pinned,
-   !etnaviv_obj->userptr.ro, 0,
-   pvec + pinned, NULL);
+   flags, pvec + pinned, NULL);
if (ret < 0)
break;
 
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/i915_gem_userptr.c
index e537930..c6f780f 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -508,6 +508,10 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
*_work)
pvec = drm_malloc_gfp(npages, sizeof(struct page *), GFP_TEMPORARY);
if (pvec != NULL) {
struct mm_struct *mm = obj->userptr.mm->mm;
+   unsigned int flags = 0;
+
+   if (!obj->userptr.read_only)
+   flags |= FOLL_WRITE;
 
ret = -EFAULT;
if (atomic_inc_not_zero(>mm_users)) {
@@ -517,7 +521,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
*_work)
(work->task, mm,
 obj->userptr.ptr + pinned * PAGE_SIZE,
 npages - pinned,
-!obj->userptr.read_only, 0,
+flags,
 pvec + pinned, NULL);
if (ret < 0)
break;
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 75077a0..1f0fe32 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -527,6 +527,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
user_virt, u64 bcnt,
u64 off;
int j, k, ret = 0, start_idx, npages = 0;
u64 base_virt_addr;
+   unsigned int flags = 0;
 
if (access_mask == 0)
return -EINVAL;
@@ -556,6 +557,9 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
user_virt, u64 bcnt,
goto out_put_task;
}
 
+   if (access_mask & ODP_WRITE_ALLOWED_BIT)
+   flags |= FOLL_WRITE;
+
start_idx = (user_virt - ib_umem_start(umem)) >> PAGE_SHIFT;
k = start_idx;
 
@@ -574,8 +578,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
user_virt, u64 bcnt,
 */
npages = get_user_pages_remote(owning_process, owning_mm,
user_virt, gup_num_pages,
-   access_mask & ODP_WRITE_ALLOWED_BIT,
-   0, local_page_list, NULL);
+   flags, local_page_list, NULL);
up_read(_mm->mmap_sem);
 
if (npages < 0)
diff --git a/fs/exec.c b/fs/exec.c
index 6fcfb3f..4e497b9 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -191,6 +191,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, 
unsigned long pos,
 {
struct page *page;
int ret;
+   unsigned int gup_flags = FOLL_FORCE;
 
 #ifdef CONFIG_STACK_GROWSUP
if (write) {
@@ -199,12 +200,16 @@ static struct page *get_arg_page(struct linux_binprm 
*bprm, unsigned

Re: [PATCH V2 net-next] net/mlx5: Add MLX5_ARRAY_SET64 to fix BUILD_BUG_ON

2016-10-12 Thread David Miller


I've been travelling and will get to this patch when I get to it.

Re: [PATCH v6] net: ip, diag -- Add diag interface for raw sockets

2016-10-12 Thread David Miller

From: Cyrill Gorcunov 
Date: Wed, 12 Oct 2016 09:53:29 +0300

> I can't rename the field, neither a can use union.

Remind me again what is wrong with using an anonymous union?

RE: [PATCH] xen-netback: fix type mismatch warning

2016-10-12 Thread Paul Durrant

> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: 12 October 2016 10:54
> To: Wei Liu ; Paul Durrant 
> Cc: Arnd Bergmann ; David S. Miller
> ; David Vrabel ; xen-
> de...@lists.xenproject.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: [PATCH] xen-netback: fix type mismatch warning
> 
> Wiht the latest rework of the xen-netback driver, we get a warning
> on ARM about the types passed into min():
> 
> drivers/net/xen-netback/rx.c: In function 'xenvif_rx_next_chunk':
> include/linux/kernel.h:739:16: error: comparison of distinct pointer types
> lacks a cast [-Werror]
> 
> The reason is that XEN_PAGE_SIZE is not size_t here. There
> is no actual bug, and we can easily avoid the warning using the
> min_t() macro instead of min().
> 
> Fixes: eb1723a29b9a ("xen-netback: refactor guest rx")
> Signed-off-by: Arnd Bergmann 

LGTM

Acked-by: Paul Durrant 

> ---
>  drivers/net/xen-netback/rx.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/rx.c b/drivers/net/xen-netback/rx.c
> index 8e9ade6ccf18..aeb150258c6c 100644
> --- a/drivers/net/xen-netback/rx.c
> +++ b/drivers/net/xen-netback/rx.c
> @@ -337,9 +337,9 @@ static void xenvif_rx_next_chunk(struct
> xenvif_queue *queue,
>   frag_data += pkt->frag_offset;
>   frag_len -= pkt->frag_offset;
> 
> - chunk_len = min(frag_len, XEN_PAGE_SIZE - offset);
> - chunk_len = min(chunk_len,
> - XEN_PAGE_SIZE -
>   xen_offset_in_page(frag_data));
> + chunk_len = min_t(size_t, frag_len, XEN_PAGE_SIZE - offset);
> + chunk_len = min_t(size_t, chunk_len, XEN_PAGE_SIZE -
> +  xen_offset_in_page(frag_data));
> 
>   pkt->frag_offset += chunk_len;
> 
> --
> 2.9.0

[PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags

2016-10-12 Thread Lorenzo Stoakes

This patch series adjusts functions in the get_user_pages* family such that
desired FOLL_* flags are passed as an argument rather than implied by flags.

The purpose of this change is to make the use of FOLL_FORCE explicit so it is
easier to grep for and clearer to callers that this flag is being used. The use
of FOLL_FORCE is an issue as it overrides missing VM_READ/VM_WRITE flags for the
VMA whose pages we are reading from/writing to, which can result in surprising
behaviour.

The patch series came out of the discussion around commit 38e0885, which
addressed a BUG_ON() being triggered when a page was faulted in with PROT_NONE
set but having been overridden by FOLL_FORCE. do_numa_page() was run on the
assumption the page _must_ be one marked for NUMA node migration as an actual
PROT_NONE page would have been dealt with prior to this code path, however
FOLL_FORCE introduced a situation where this assumption did not hold.

See https://marc.info/?l=linux-mm=147585445805166 for the patch proposal.

Lorenzo Stoakes (10):
  mm: remove write/force parameters from __get_user_pages_locked()
  mm: remove write/force parameters from __get_user_pages_unlocked()
  mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
  mm: replace get_user_pages_locked() write/force parameters with gup_flags
  mm: replace get_vaddr_frames() write/force parameters with gup_flags
  mm: replace get_user_pages() write/force parameters with gup_flags
  mm: replace get_user_pages_remote() write/force parameters with gup_flags
  mm: replace __access_remote_vm() write parameter with gup_flags
  mm: replace access_remote_vm() write parameter with gup_flags
  mm: replace access_process_vm() write parameter with gup_flags

 arch/alpha/kernel/ptrace.c |  9 ++--
 arch/blackfin/kernel/ptrace.c  |  5 ++-
 arch/cris/arch-v32/drivers/cryptocop.c |  4 +-
 arch/cris/arch-v32/kernel/ptrace.c |  4 +-
 arch/ia64/kernel/err_inject.c  |  2 +-
 arch/ia64/kernel/ptrace.c  | 14 +++---
 arch/m32r/kernel/ptrace.c  | 15 ---
 arch/mips/kernel/ptrace32.c|  5 ++-
 arch/mips/mm/gup.c |  2 +-
 arch/powerpc/kernel/ptrace32.c |  5 ++-
 arch/s390/mm/gup.c |  3 +-
 arch/score/kernel/ptrace.c | 10 +++--
 arch/sh/mm/gup.c   |  3 +-
 arch/sparc/kernel/ptrace_64.c  | 24 +++
 arch/sparc/mm/gup.c|  3 +-
 arch/x86/kernel/step.c |  3 +-
 arch/x86/mm/gup.c  |  2 +-
 arch/x86/mm/mpx.c  |  5 +--
 arch/x86/um/ptrace_32.c|  3 +-
 arch/x86/um/ptrace_64.c|  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|  7 ++-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c  |  7 ++-
 drivers/gpu/drm/exynos/exynos_drm_g2d.c|  3 +-
 drivers/gpu/drm/i915/i915_gem_userptr.c|  6 ++-
 drivers/gpu/drm/radeon/radeon_ttm.c|  3 +-
 drivers/gpu/drm/via/via_dmablit.c  |  4 +-
 drivers/infiniband/core/umem.c |  6 ++-
 drivers/infiniband/core/umem_odp.c |  7 ++-
 drivers/infiniband/hw/mthca/mthca_memfree.c|  2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c |  3 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c   |  5 ++-
 drivers/media/pci/ivtv/ivtv-udma.c |  4 +-
 drivers/media/pci/ivtv/ivtv-yuv.c  |  5 ++-
 drivers/media/platform/omap/omap_vout.c|  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c  |  7 ++-
 drivers/media/v4l2-core/videobuf2-memops.c |  6 ++-
 drivers/misc/mic/scif/scif_rma.c   |  3 +-
 drivers/misc/sgi-gru/grufault.c|  2 +-
 drivers/platform/goldfish/goldfish_pipe.c  |  3 +-
 drivers/rapidio/devices/rio_mport_cdev.c   |  3 +-
 drivers/scsi/st.c  |  5 +--
 .../interface/vchiq_arm/vchiq_2835_arm.c   |  3 +-
 .../vc04_services/interface/vchiq_arm/vchiq_arm.c  |  3 +-
 drivers/video/fbdev/pvr2fb.c   |  4 +-
 drivers/virt/fsl_hypervisor.c  |  4 +-
 fs/exec.c  |  9 +++-
 fs/proc/base.c | 19 +---
 include/linux/mm.h | 18 
 kernel/events/uprobes.c|  6 ++-
 kernel/ptrace.c| 16 ---
 mm/frame_vector.c  |  9 ++--
 mm/gup.c   | 50 ++
 mm/memory.c| 16 ---

[PATCH 01/10] mm: remove write/force parameters from __get_user_pages_locked()

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write and force parameters from __get_user_pages_locked()
to make the use of FOLL_FORCE explicit in callers as use of this flag can result
in surprising behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 mm/gup.c | 47 +--
 1 file changed, 33 insertions(+), 14 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 96b2b2f..ba83942 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -729,7 +729,6 @@ static __always_inline long __get_user_pages_locked(struct 
task_struct *tsk,
struct mm_struct *mm,
unsigned long start,
unsigned long nr_pages,
-   int write, int force,
struct page **pages,
struct vm_area_struct **vmas,
int *locked, bool notify_drop,
@@ -747,10 +746,6 @@ static __always_inline long __get_user_pages_locked(struct 
task_struct *tsk,
 
if (pages)
flags |= FOLL_GET;
-   if (write)
-   flags |= FOLL_WRITE;
-   if (force)
-   flags |= FOLL_FORCE;
 
pages_done = 0;
lock_dropped = false;
@@ -846,9 +841,15 @@ long get_user_pages_locked(unsigned long start, unsigned 
long nr_pages,
   int write, int force, struct page **pages,
   int *locked)
 {
+   unsigned int flags = FOLL_TOUCH;
+
+   if (write)
+   flags |= FOLL_WRITE;
+   if (force)
+   flags |= FOLL_FORCE;
+
return __get_user_pages_locked(current, current->mm, start, nr_pages,
-  write, force, pages, NULL, locked, true,
-  FOLL_TOUCH);
+  pages, NULL, locked, true, flags);
 }
 EXPORT_SYMBOL(get_user_pages_locked);
 
@@ -869,9 +870,15 @@ __always_inline long __get_user_pages_unlocked(struct 
task_struct *tsk, struct m
 {
long ret;
int locked = 1;
+
+   if (write)
+   gup_flags |= FOLL_WRITE;
+   if (force)
+   gup_flags |= FOLL_FORCE;
+
down_read(>mmap_sem);
-   ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
- pages, NULL, , false, gup_flags);
+   ret = __get_user_pages_locked(tsk, mm, start, nr_pages, pages, NULL,
+ , false, gup_flags);
if (locked)
up_read(>mmap_sem);
return ret;
@@ -963,9 +970,15 @@ long get_user_pages_remote(struct task_struct *tsk, struct 
mm_struct *mm,
int write, int force, struct page **pages,
struct vm_area_struct **vmas)
 {
-   return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
-  pages, vmas, NULL, false,
-  FOLL_TOUCH | FOLL_REMOTE);
+   unsigned int flags = FOLL_TOUCH | FOLL_REMOTE;
+
+   if (write)
+   flags |= FOLL_WRITE;
+   if (force)
+   flags |= FOLL_FORCE;
+
+   return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+  NULL, false, flags);
 }
 EXPORT_SYMBOL(get_user_pages_remote);
 
@@ -979,9 +992,15 @@ long get_user_pages(unsigned long start, unsigned long 
nr_pages,
int write, int force, struct page **pages,
struct vm_area_struct **vmas)
 {
+   unsigned int flags = FOLL_TOUCH;
+
+   if (write)
+   flags |= FOLL_WRITE;
+   if (force)
+   flags |= FOLL_FORCE;
+
return __get_user_pages_locked(current, current->mm, start, nr_pages,
-  write, force, pages, vmas, NULL, false,
-  FOLL_TOUCH);
+  pages, vmas, NULL, false, flags);
 }
 EXPORT_SYMBOL(get_user_pages);
 
-- 
2.10.0

[PATCH 03/10] mm: replace get_user_pages_unlocked() write/force parameters with gup_flags

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write and force parameters from get_user_pages_unlocked()
and replaces them with a gup_flags parameter to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 arch/mips/mm/gup.c |  2 +-
 arch/s390/mm/gup.c |  3 ++-
 arch/sh/mm/gup.c   |  3 ++-
 arch/sparc/mm/gup.c|  3 ++-
 arch/x86/mm/gup.c  |  2 +-
 drivers/media/pci/ivtv/ivtv-udma.c |  4 ++--
 drivers/media/pci/ivtv/ivtv-yuv.c  |  5 +++--
 drivers/scsi/st.c  |  5 ++---
 drivers/video/fbdev/pvr2fb.c   |  4 ++--
 include/linux/mm.h |  2 +-
 mm/gup.c   | 14 --
 mm/nommu.c | 11 ++-
 mm/util.c  |  3 ++-
 net/ceph/pagevec.c |  2 +-
 14 files changed, 27 insertions(+), 36 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 42d124f..d8c3c15 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -287,7 +287,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
pages += nr;
 
ret = get_user_pages_unlocked(start, (end - start) >> PAGE_SHIFT,
- write, 0, pages);
+ pages, write ? FOLL_WRITE : 0);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index adb0c34..18d4107 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -266,7 +266,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
pages += nr;
-   ret = get_user_pages_unlocked(start, nr_pages - nr, write, 0, pages);
+   ret = get_user_pages_unlocked(start, nr_pages - nr, pages,
+ write ? FOLL_WRITE : 0);
/* Have to be a bit careful with return values */
if (nr > 0)
ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index 40fa6c8..063c298 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -258,7 +258,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
pages += nr;
 
ret = get_user_pages_unlocked(start,
-   (end - start) >> PAGE_SHIFT, write, 0, pages);
+   (end - start) >> PAGE_SHIFT, pages,
+   write ? FOLL_WRITE : 0);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 4e06750..cd0e32b 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -238,7 +238,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
pages += nr;
 
ret = get_user_pages_unlocked(start,
-   (end - start) >> PAGE_SHIFT, write, 0, pages);
+   (end - start) >> PAGE_SHIFT, pages,
+   write ? FOLL_WRITE : 0);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index b8b6a60..0d4fb3e 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -435,7 +435,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
 
ret = get_user_pages_unlocked(start,
  (end - start) >> PAGE_SHIFT,
- write, 0, pages);
+ pages, write ? FOLL_WRITE : 0);
 
/* Have to be a bit careful with return values */
if (nr > 0) {
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c 
b/drivers/media/pci/ivtv/ivtv-udma.c
index 4769469..2c9232e 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,8 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long 
ivtv_dest_addr,
}
 
/* Get user pages for DMA Xfer */
-   err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, 0,
-   1, dma->map);
+   err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count,
+   dma->map, FOLL_FORCE);
 
if (user_dma.page_count != err) {
IVTV_DEBUG_WARN("failed to map user pages, returned %d instead 
of %d\n",
diff --git a/drivers/media/pci/ivtv/ivtv-yuv.c 
b/drivers/media/pci/ivtv/ivtv-yuv.c
index b094054..f7299d3 100644
--- a/drivers/media/pci/ivtv/ivtv-yuv.c
+++ b/drivers/media/pci/ivtv/ivtv-yuv.c
@@ -76,11 +76,12 @@ static int ivtv_yuv_prep_user_dma(struct ivtv *itv, struct

[PATCH 02/10] mm: remove write/force parameters from __get_user_pages_unlocked()

2016-10-12 Thread Lorenzo Stoakes

This patch removes the write and force parameters from
__get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in callers as
use of this flag can result in surprising behaviour (and hence bugs) within the
mm subsystem.

Signed-off-by: Lorenzo Stoakes 
---
 include/linux/mm.h |  3 +--
 mm/gup.c   | 17 +
 mm/nommu.c | 12 +---
 mm/process_vm_access.c |  7 +--
 virt/kvm/async_pf.c|  3 ++-
 virt/kvm/kvm_main.c| 11 ---
 6 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e9caec6..2db98b6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1285,8 +1285,7 @@ long get_user_pages_locked(unsigned long start, unsigned 
long nr_pages,
int write, int force, struct page **pages, int *locked);
 long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
   unsigned long start, unsigned long nr_pages,
-  int write, int force, struct page **pages,
-  unsigned int gup_flags);
+  struct page **pages, unsigned int gup_flags);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
diff --git a/mm/gup.c b/mm/gup.c
index ba83942..3d620dd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -865,17 +865,11 @@ EXPORT_SYMBOL(get_user_pages_locked);
  */
 __always_inline long __get_user_pages_unlocked(struct task_struct *tsk, struct 
mm_struct *mm,
   unsigned long start, unsigned 
long nr_pages,
-  int write, int force, struct 
page **pages,
-  unsigned int gup_flags)
+  struct page **pages, unsigned 
int gup_flags)
 {
long ret;
int locked = 1;
 
-   if (write)
-   gup_flags |= FOLL_WRITE;
-   if (force)
-   gup_flags |= FOLL_FORCE;
-
down_read(>mmap_sem);
ret = __get_user_pages_locked(tsk, mm, start, nr_pages, pages, NULL,
  , false, gup_flags);
@@ -905,8 +899,15 @@ EXPORT_SYMBOL(__get_user_pages_unlocked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 int write, int force, struct page **pages)
 {
+   unsigned int flags = FOLL_TOUCH;
+
+   if (write)
+   flags |= FOLL_WRITE;
+   if (force)
+   flags |= FOLL_FORCE;
+
return __get_user_pages_unlocked(current, current->mm, start, nr_pages,
-write, force, pages, FOLL_TOUCH);
+pages, flags);
 }
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 95daf81..925dcc1 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -185,8 +185,7 @@ EXPORT_SYMBOL(get_user_pages_locked);
 
 long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
   unsigned long start, unsigned long nr_pages,
-  int write, int force, struct page **pages,
-  unsigned int gup_flags)
+  struct page **pages, unsigned int gup_flags)
 {
long ret;
down_read(>mmap_sem);
@@ -200,8 +199,15 @@ EXPORT_SYMBOL(__get_user_pages_unlocked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 int write, int force, struct page **pages)
 {
+   unsigned int flags = 0;
+
+   if (write)
+   flags |= FOLL_WRITE;
+   if (force)
+   flags |= FOLL_FORCE;
+
return __get_user_pages_unlocked(current, current->mm, start, nr_pages,
-write, force, pages, 0);
+pages, flags);
 }
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 07514d4..be8dc8d 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -88,12 +88,16 @@ static int process_vm_rw_single_vec(unsigned long addr,
ssize_t rc = 0;
unsigned long max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
/ sizeof(struct pages *);
+   unsigned int flags = FOLL_REMOTE;
 
/* Work out address and page range required */
if (len == 0)
return 0;
nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1;
 
+   if (vm_write)
+   flags |= FOLL_WRITE;
+
while (!rc && nr_pages && iov_iter_count(iter)) {
int pages = min(nr_pages, max_pages_per_loop);
size_t bytes;
@@ -104,8

[PATCH net-next 4/5] udp: UDP tunnel flow dissection infrastructure

2016-10-12 Thread Tom Herbert

Add infrastructure to allow UDP tunnels to setup flow dissection.

Signed-off-by: Tom Herbert 
---
 include/net/udp_tunnel.h | 5 +
 net/ipv4/udp_tunnel.c| 5 +
 2 files changed, 10 insertions(+)

diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 02c5be0..81d2584 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -69,6 +69,10 @@ typedef struct sk_buff **(*udp_tunnel_gro_receive_t)(struct 
sock *sk,
 struct sk_buff *skb);
 typedef int (*udp_tunnel_gro_complete_t)(struct sock *sk, struct sk_buff *skb,
 int nhoff);
+typedef int (*udp_tunnel_flow_dissect_t)(struct sock *sk,
+const struct sk_buff *skb,
+void *data, int hlen, int *nhoff,
+u8 *ip_proto, __be16 *proto);
 
 struct udp_tunnel_sock_cfg {
void *sk_user_data; /* user data used by encap_rcv call back */
@@ -78,6 +82,7 @@ struct udp_tunnel_sock_cfg {
udp_tunnel_encap_destroy_t encap_destroy;
udp_tunnel_gro_receive_t gro_receive;
udp_tunnel_gro_complete_t gro_complete;
+   udp_tunnel_flow_dissect_t flow_dissect;
 };
 
 /* Setup the given (UDP) sock to receive UDP encapsulated packets */
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 58bd39f..4459288 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -72,6 +72,11 @@ void setup_udp_tunnel_sock(struct net *net, struct socket 
*sock,
udp_sk(sk)->gro_receive = cfg->gro_receive;
udp_sk(sk)->gro_complete = cfg->gro_complete;
 
+   if (cfg->flow_dissect) {
+   udp_sk(sk)->flow_dissect = cfg->flow_dissect;
+   udp_flow_dissect_enable();
+   }
+
udp_tunnel_encap_enable(sock);
 }
 EXPORT_SYMBOL_GPL(setup_udp_tunnel_sock);
-- 
2.9.3

[PATCH net-next 3/5] udp: Add UDP flow dissection functions to IPv4 and IPv6

2016-10-12 Thread Tom Herbert

Add per protocol offload callbacks for flow_dissect to UDP for
IPv4 and IPv6. The callback functions extract the port number
information and with the packet addresses (given in an argument with
type flow_dissector_key_addrs) it performs a lookup on the UDP
socket. If a socket is found and flow_dissect is set for the
socket then that function is called.

Signed-off-by: Tom Herbert 
---
 net/ipv4/udp_offload.c | 39 +++
 net/ipv6/udp_offload.c | 38 ++
 2 files changed, 77 insertions(+)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index f9333c9..c7753ba 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -377,11 +377,50 @@ static int udp4_gro_complete(struct sk_buff *skb, int 
nhoff)
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
 
+/* Assumes rcu lock is held */
+static int udp4_flow_dissect(const struct sk_buff *skb, void *data, int hlen,
+int *nhoff, u8 *ip_proto, __be16 *proto,
+struct flow_dissector_key_addrs *key_addrs)
+{
+   u16 _ports[2], *ports;
+   struct net *net;
+   struct sock *sk;
+   int dif = -1;
+
+   /* See if there is a flow dissector in the UDP socket */
+
+   if (skb->dev) {
+   net = dev_net(skb->dev);
+   dif = skb->dev->ifindex;
+   } else if (skb->sk) {
+   net = sock_net(skb->sk);
+   } else {
+   return FLOW_DIS_RET_PASS;
+   }
+
+   ports = __skb_header_pointer(skb, *nhoff, sizeof(_ports),
+data, hlen, &_ports);
+   if (!ports)
+   return FLOW_DIS_RET_BAD;
+
+   sk = udp4_lib_lookup_noref(net,
+  key_addrs->v4addrs.src, ports[0],
+  key_addrs->v4addrs.dst, ports[1],
+  dif);
+
+   if (sk && udp_sk(sk)->flow_dissect)
+   return udp_sk(sk)->flow_dissect(sk, skb, data, hlen, nhoff,
+   ip_proto, proto);
+   else
+   return FLOW_DIS_RET_PASS;
+}
+
 static const struct net_offload udpv4_offload = {
.callbacks = {
.gso_segment = udp4_ufo_fragment,
.gro_receive  = udp4_gro_receive,
.gro_complete = udp4_gro_complete,
+   .flow_dissect = udp4_flow_dissect,
},
 };
 
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index ac858c4..b3f4a6c 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -163,11 +163,49 @@ static int udp6_gro_complete(struct sk_buff *skb, int 
nhoff)
return udp_gro_complete(skb, nhoff, udp6_lib_lookup_skb);
 }
 
+/* Assumes rcu lock is held */
+static int udp6_flow_dissect(const struct sk_buff *skb, void *data, int hlen,
+int *nhoff, u8 *ip_proto, __be16 *proto,
+const struct flow_dissector_key_addrs *key_addrs)
+{
+   u16 _ports[2], *ports;
+   struct net *net;
+   struct sock *sk;
+   int dif = -1;
+
+   /* See if there is a flow dissector in the UDP socket */
+
+   if (skb->dev) {
+   net = dev_net(skb->dev);
+   dif = skb->dev->ifindex;
+   } else if (skb->sk) {
+   net = sock_net(skb->sk);
+   } else {
+   return FLOW_DIS_RET_PASS;
+   }
+
+   ports = __skb_header_pointer(skb, *nhoff, sizeof(_ports),
+data, hlen, &_ports);
+   if (!ports)
+   return FLOW_DIS_RET_BAD;
+
+   sk = udp6_lib_lookup_noref(net,
+  _addrs->v6addrs.src, ports[0],
+  _addrs->v6addrs.dst, ports[1],
+  dif);
+
+   if (sk && udp_sk(sk)->flow_dissect)
+   return udp_sk(sk)->flow_dissect(sk, skb, data, hlen, nhoff,
+   ip_proto, proto);
+   return FLOW_DIS_RET_PASS;
+}
+
 static const struct net_offload udpv6_offload = {
.callbacks = {
.gso_segment=   udp6_ufo_fragment,
.gro_receive=   udp6_gro_receive,
.gro_complete   =   udp6_gro_complete,
+   .flow_dissect   =   udp6_flow_dissect,
},
 };
 
-- 
2.9.3

[PATCH net-next 0/5] udp: Flow dissection for tunnels

2016-10-12 Thread Tom Herbert

Now that we have a means to perform a UDP socket lookup without taking
a reference, it is feasible to have flow dissector crack open UDP
encapsulated packets. Generally, we would expect that the UDP source
port or the flow label in IPv6 would contain enough entropy about
the encapsulated flow. However, there will be cases, such as a static
UDP tunnel with fixed ports, where dissecting the encapsulated packet
is valuable.

The model is here is similar to that implemented for UDP GRO. A
tunnel implementation (e.g. GUE) may set a flow_dissect function
in the udp_sk. In __skb_flow_dissect a case has been added for
UDP to check if there is a socket with flow_dissect set. If there
is the function is called. The (per tunnel implementation)
function can parse the encapsulation headers and return the
next protocol for __skb_flow_dissect to process and it's position
in nhoff.

Since performing a UDP lookup on every packet might be expensive
I added a static key check to bypass the lookup if there are no
sockets with flow_dissect set. I should mention that doing the
lookup wasn't particularly a big hit anyway.

Fou/gue was modified to perform tunnel dissection. This is enabled
on each listener socket via a netlink configuration option.

Tested:

Running 200 streams with TCP_RR.

GRE/GUE variable source port (baseline)
RSS distributes packets, RFS is effective
1211702 tps
147/241/442 50/90/99% latencies
87.95 CPU utilization

GRE/GUE fixed source port
All packets to one CPU, RFS is ineffective
173680 tps
1170/1377/1853 50/90/99% latencies
7.42 CPU utilization

GRE/GUE fixed source port with deep hash enabled
All packets to one CPU, but now RFS is effective
730359 tps
263/325/464 50/90/99% latencies
38.25% CPU utilization (Interrupting CPU is maxed out)


Tom Herbert (5):
  udp: Add socket lookup functions with noref
  udp: UDP flow dissector
  udp: Add UDP flow dissection functions to IPv4 and IPv6
  udp: UDP tunnel flow dissection infrastructure
  fou: Support flow dissection

 include/linux/netdevice.h|  5 +++
 include/linux/udp.h  |  7 +
 include/net/flow_dissector.h |  8 +
 include/net/udp.h| 12 
 include/net/udp_tunnel.h |  5 +++
 include/uapi/linux/fou.h |  1 +
 net/core/flow_dissector.c| 73 ++--
 net/ipv4/fou.c   | 68 -
 net/ipv4/udp.c   | 11 +++
 net/ipv4/udp_offload.c   | 39 +++
 net/ipv4/udp_tunnel.c|  5 +++
 net/ipv6/udp.c   | 10 ++
 net/ipv6/udp_offload.c   | 38 +++
 13 files changed, 279 insertions(+), 3 deletions(-)

-- 
2.9.3

[PATCH net-next 1/5] udp: Add socket lookup functions with noref

2016-10-12 Thread Tom Herbert

Create udp4_lib_lookup_noref and udp6_lib_lookup_noref. These perfrom
a socket lookup on addresses and ports without taking a reference.

Signed-off-by: Tom Herbert 
---
 include/net/udp.h |  8 
 net/ipv4/udp.c|  8 
 net/ipv6/udp.c| 10 ++
 3 files changed, 26 insertions(+)

diff --git a/include/net/udp.h b/include/net/udp.h
index ea53a87..717a972 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -275,6 +275,10 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr, __be16 sport,
   struct udp_table *tbl, struct sk_buff *skb);
 struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
 __be16 sport, __be16 dport);
+struct sock *udp4_lib_lookup_noref(struct net *net,
+  __be32 saddr, __be16 sport,
+  __be32 daddr, __be16 dport,
+  int dif);
 struct sock *udp6_lib_lookup(struct net *net,
 const struct in6_addr *saddr, __be16 sport,
 const struct in6_addr *daddr, __be16 dport,
@@ -286,6 +290,10 @@ struct sock *__udp6_lib_lookup(struct net *net,
   struct sk_buff *skb);
 struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
 __be16 sport, __be16 dport);
+struct sock *udp6_lib_lookup_noref(struct net *net,
+  const struct in6_addr *saddr, __be16 sport,
+  const struct in6_addr *daddr, __be16 dport,
+  int dif);
 
 /*
  * SNMP statistics for UDP and UDP-Lite
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7d96dc2..7f84c51 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -595,6 +595,14 @@ struct sock *udp4_lib_lookup(struct net *net, __be32 
saddr, __be16 sport,
 EXPORT_SYMBOL_GPL(udp4_lib_lookup);
 #endif
 
+struct sock *udp4_lib_lookup_noref(struct net *net, __be32 saddr, __be16 sport,
+  __be32 daddr, __be16 dport, int dif)
+{
+   return __udp4_lib_lookup(net, saddr, sport, daddr, dport,
+dif, _table, NULL);
+}
+EXPORT_SYMBOL_GPL(udp4_lib_lookup_noref);
+
 static inline bool __udp_is_mcast_sock(struct net *net, struct sock *sk,
   __be16 loc_port, __be32 loc_addr,
   __be16 rmt_port, __be32 rmt_addr,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 9aa7c1c..6e382d9 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -317,6 +317,16 @@ struct sock *udp6_lib_lookup(struct net *net, const struct 
in6_addr *saddr, __be
 EXPORT_SYMBOL_GPL(udp6_lib_lookup);
 #endif
 
+struct sock *udp6_lib_lookup_noref(struct net *net,
+  const struct in6_addr *saddr, __be16 sport,
+  const struct in6_addr *daddr, __be16 dport,
+  int dif)
+{
+   return __udp6_lib_lookup(net, saddr, sport, daddr, dport,
+dif, _table, NULL);
+}
+EXPORT_SYMBOL_GPL(udp6_lib_lookup_noref);
+
 /*
  * This should be easy, if there is something there we
  * return it, otherwise we block.
-- 
2.9.3

[PATCH net-next 5/5] fou: Support flow dissection

2016-10-12 Thread Tom Herbert

This patch performs flow dissection for GUE and FOU. This is an
optional feature on the receiver and is set by FOU_ATTR_DEEP_HASH
netlink configuration. When enable the UDP socket flow_dissect
function is set to fou_flow_dissect or gue_flow_dissect as
appropriate. These functions return FLOW_DIS_RET_IPPROTO and
set ip protocol argument. In the case of GUE the header is
parsed to find the protocol number.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/fou.h |  1 +
 net/ipv4/fou.c   | 68 +++-
 2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/fou.h b/include/uapi/linux/fou.h
index d2947c5..2c837eb 100644
--- a/include/uapi/linux/fou.h
+++ b/include/uapi/linux/fou.h
@@ -15,6 +15,7 @@ enum {
FOU_ATTR_IPPROTO,   /* u8 */
FOU_ATTR_TYPE,  /* u8 */
FOU_ATTR_REMCSUM_NOPARTIAL, /* flag */
+   FOU_ATTR_DEEP_HASH, /* flag */
 
__FOU_ATTR_MAX,
 };
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index cf50f7e..95ac5a8 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -27,7 +27,8 @@ struct fou {
struct rcu_head rcu;
 };
 
-#define FOU_F_REMCSUM_NOPARTIAL BIT(0)
+#define FOU_F_REMCSUM_NOPARTIALBIT(0)
+#define FOU_F_DEEP_HASHBIT(1)
 
 struct fou_cfg {
u16 type;
@@ -281,6 +282,16 @@ static int fou_gro_complete(struct sock *sk, struct 
sk_buff *skb,
return err;
 }
 
+static int fou_flow_dissect(struct sock *sk, const struct sk_buff *skb,
+   void *data, int hlen, int *nhoff, u8 *ip_proto,
+   __be16 *proto)
+{
+   *ip_proto = fou_from_sock(sk)->protocol;
+   *nhoff += sizeof(struct udphdr);
+
+   return FLOW_DIS_RET_IPPROTO;
+}
+
 static struct guehdr *gue_gro_remcsum(struct sk_buff *skb, unsigned int off,
  struct guehdr *guehdr, void *data,
  size_t hdrlen, struct gro_remcsum *grc,
@@ -498,6 +509,48 @@ static int gue_gro_complete(struct sock *sk, struct 
sk_buff *skb, int nhoff)
return err;
 }
 
+static int gue_flow_dissect(struct sock *sk, const struct sk_buff *skb,
+   void *data, int hlen, int *nhoff, u8 *ip_proto,
+   __be16 *proto)
+{
+   struct guehdr _hdr, *hdr;
+
+   hdr = __skb_header_pointer(skb, *nhoff + sizeof(struct udphdr),
+  sizeof(_hdr), data, hlen, &_hdr);
+   if (!hdr)
+   return FLOW_DIS_RET_BAD;
+
+   switch (hdr->version) {
+   case 0: /* Full GUE header present */
+   if (hdr->control)
+   return FLOW_DIS_RET_PASS;
+
+   *nhoff += sizeof(struct udphdr) + sizeof(_hdr) +
+ (hdr->hlen << 2);
+   *ip_proto = hdr->proto_ctype;
+
+   return FLOW_DIS_RET_IPPROTO;
+   case 1:
+   /* Direct encasulation of IPv4 or IPv6 */
+
+   switch (((struct iphdr *)hdr)->version) {
+   case 4:
+   *nhoff += sizeof(struct udphdr);
+   *ip_proto = IPPROTO_IPIP;
+   return FLOW_DIS_RET_IPPROTO;
+   case 6:
+   *nhoff += sizeof(struct udphdr);
+   *ip_proto = IPPROTO_IPV6;
+   return FLOW_DIS_RET_IPPROTO;
+   default:
+   return FLOW_DIS_RET_PASS;
+   }
+
+   default:
+   return FLOW_DIS_RET_PASS;
+   }
+}
+
 static int fou_add_to_port_list(struct net *net, struct fou *fou)
 {
struct fou_net *fn = net_generic(net, fou_net_id);
@@ -568,12 +621,16 @@ static int fou_create(struct net *net, struct fou_cfg 
*cfg,
tunnel_cfg.encap_rcv = fou_udp_recv;
tunnel_cfg.gro_receive = fou_gro_receive;
tunnel_cfg.gro_complete = fou_gro_complete;
+   if (cfg->flags & FOU_F_DEEP_HASH)
+   tunnel_cfg.flow_dissect = fou_flow_dissect;
fou->protocol = cfg->protocol;
break;
case FOU_ENCAP_GUE:
tunnel_cfg.encap_rcv = gue_udp_recv;
tunnel_cfg.gro_receive = gue_gro_receive;
tunnel_cfg.gro_complete = gue_gro_complete;
+   if (cfg->flags & FOU_F_DEEP_HASH)
+   tunnel_cfg.flow_dissect = gue_flow_dissect;
break;
default:
err = -EINVAL;
@@ -637,6 +694,7 @@ static const struct nla_policy fou_nl_policy[FOU_ATTR_MAX + 
1] = {
[FOU_ATTR_IPPROTO] = { .type = NLA_U8, },
[FOU_ATTR_TYPE] = { .type = NLA_U8, },
[FOU_ATTR_REMCSUM_NOPARTIAL] = { .type = NLA_FLAG, },
+   [FOU_ATTR_DEEP_HASH] = { .type = NLA_FLAG },
 };
 
 static int parse_nl_config(struct

[PATCH net-next 2/5] udp: UDP flow dissector

2016-10-12 Thread Tom Herbert

Add infrastructure for performing per protocol flow dissection and
support flow dissection in UDP payloads (e.g. flow dissection on a
UDP encapsulated tunnel.

The per protocol flow dissector is called by flow_dissect function
in the offload_callbacks of a protocol. The arguments of this function
include the necessary information to do flow dissection as derived
from __skb_flow_dissect which is where the callback is intended to be
called from. There are return codes from the callback in the form
FLOW_DIS_RET_* that indicate the result. FLOW_DIS_RET_IPPROTO
means that the payload should be dissected as an IP proto, the
specific protocol is returned in a pointer argument. Likewise,
FLOW_DIS_RET_PROTO indicate the payload should be processed as
an ethertype which is returned in another argument.

A case for IPPROTO_UDP was added to __skb_flow_dissect. Since
UDP flow dissector involves a relatively expensive socket lookup
there is a static key check first to see if there are any sockets
that have enabled flow dissection. After this check, the offload
ops for UDP for either IPv4 or IPv6 is considered. If the
flow_dissect function is it is called. Upon return the result
is processed (pass, out_bad, process as IP protocol, process
as ethertype). Note that if the result indicates a protocol must
be processed it is expected that nhoff has been updated to the
encapsulated protocol header.

Signed-off-by: Tom Herbert 
---
 include/linux/netdevice.h|  5 +++
 include/linux/udp.h  |  7 +
 include/net/flow_dissector.h |  8 +
 include/net/udp.h|  4 +++
 net/core/flow_dissector.c| 73 ++--
 net/ipv4/udp.c   |  3 ++
 6 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 136ae6bb..51b43fb1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2199,6 +2199,11 @@ struct offload_callbacks {
struct sk_buff  **(*gro_receive)(struct sk_buff **head,
 struct sk_buff *skb);
int (*gro_complete)(struct sk_buff *skb, int nhoff);
+   int (*flow_dissect)(const struct sk_buff *skb,
+   void *data, int hlen,
+   int *nhoff, u8 *ip_proto,
+   __be16 *proto,
+struct flow_dissector_key_addrs *key_addrs);
 };
 
 struct packet_offload {
diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..608ebf4 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,13 @@ struct udp_sock {
int (*gro_complete)(struct sock *sk,
struct sk_buff *skb,
int nhoff);
+
+   /* Flow dissector function for UDP socket */
+   int (*flow_dissect)(struct sock *sk,
+   const struct sk_buff *skb,
+   void *data, int hlen,
+   int *nhoff, u8 *ip_proto,
+   __be16 *proto);
 };
 
 static inline struct udp_sock *udp_sk(const struct sock *sk)
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index d953492..9de4904 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -203,4 +203,12 @@ static inline void *skb_flow_dissector_target(struct 
flow_dissector *flow_dissec
return ((char *)target_container) + flow_dissector->offset[key_id];
 }
 
+/* Return codes from per socket flow dissector (e.g. UDP) */
+enum {
+   FLOW_DIS_RET_PASS = 0,
+   FLOW_DIS_RET_BAD,
+   FLOW_DIS_RET_IPPROTO,
+   FLOW_DIS_RET_PROTO,
+};
+
 #endif
diff --git a/include/net/udp.h b/include/net/udp.h
index 717a972..8d364e8 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -360,4 +360,8 @@ void udp_encap_enable(void);
 #if IS_ENABLED(CONFIG_IPV6)
 void udpv6_encap_enable(void);
 #endif
+
+void udp_flow_dissect_enable(void);
+void udp_flow_dissect_disable(void);
+
 #endif /* _UDP_H */
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 1a7b80f..5a4dfaf 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -57,6 +59,20 @@ void skb_flow_dissector_init(struct flow_dissector 
*flow_dissector,
 }
 EXPORT_SYMBOL(skb_flow_dissector_init);
 
+static struct static_key udp_flow_dissect __read_mostly;
+
+void udp_flow_dissect_enable(void)
+{
+   static_key_slow_inc(_flow_dissect);
+}
+EXPORT_SYMBOL(udp_flow_dissect_enable);
+
+void udp_flow_dissect_disable(void)
+{
+   static_key_slow_dec(_flow_dissect);
+}
+EXPORT_SYMBOL(udp_flow_dissect_disable);
+
 /**
  * __skb_flow_get_ports -

Re: [PATCH v3] iproute2: macvlan: add "source" mode

2016-10-12 Thread Stephen Hemminger

On Sun, 25 Sep 2016 21:08:55 +0200
Michael Braun  wrote:

> Adjusting iproute2 utility to support new macvlan link type mode called
> "source".
> 
> Example of commands that can be applied:
>   ip link add link eth0 name macvlan0 type macvlan mode source
>   ip link set link dev macvlan0 type macvlan macaddr add 00:11:11:11:11:11
>   ip link set link dev macvlan0 type macvlan macaddr del 00:11:11:11:11:11
>   ip link set link dev macvlan0 type macvlan macaddr flush
>   ip -details link show dev macvlan0
> 
> Based on previous work of Stefan Gula 
> 
> Signed-off-by: Michael Braun 
> 
> Cc: ste...@gmail.com

Applied.

Re: [PATCH iproute2 1/1] tc filters: add support to get individual filters by handle

2016-10-12 Thread Stephen Hemminger

On Mon, 10 Oct 2016 12:45:14 -0400
Jamal Hadi Salim  wrote:

> From: Jamal Hadi Salim 
> 
> sudo $TC filter add dev $ETH parent : prio 2 protocol ip \
> u32 match u32 0 0 flowid 1:1 \
> action ok
> sudo $TC filter add dev $ETH parent : prio 1 protocol ip \
> u32 match ip protocol 1 0xff flowid 1:10 \
> action ok
> 
> now dump to see all rules..
> $TC -s filter ls dev $ETH parent : protocol ip
>  
> filter pref 1 u32
> filter pref 1 u32 fh 801: ht divisor 1
> filter pref 1 u32 fh 801::800 order 2048 key ht 801 bkt 0 flowid 1:10  (rule 
> hit 0 success 0)
>   match 0001/00ff at 8 (success 0 )
> action order 1: gact action drop
>  random type none pass val 0
>  index 6 ref 1 bind 1 installed 4 sec used 4 sec
> Action statistics:
> Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> 
> filter pref 2 u32
> filter pref 2 u32 fh 800: ht divisor 1
> filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule 
> hit 336 success 336)
>   match / at 0 (success 336 )
> action order 1: gact action pass
>  random type none pass val 0
>  index 5 ref 1 bind 1 installed 38 sec used 4 sec
> Action statistics:
> Sent 24864 bytes 336 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
>  
> 
> ..get filter 801::800
> $TC -s filter get dev $ETH parent : protocol ip \
> handle 801:0:800 prio 2  u32
> 
>  
> filter parent : protocol ip pref 1 u32 fh 801::800 order 2048 key ht 801 
> bkt 0 flowid 1:10  (rule hit 260 success 130)
>   match 0001/00ff at 8 (success 130 )
> action order 1: gact action drop
>  random type none pass val 0
>  index 6 ref 1 bind 1 installed 348 sec used 0 sec
> Action statistics:
> Sent 11440 bytes 130 pkt (dropped 130, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
>  
> 
> ..get other one
> $TC -s filter get dev $ETH parent : protocol ip \
> handle 800:0:800 prio 2  u32
> 
> 
> filter parent : protocol ip pref 2 u32 fh 800::800 order 2048 key ht 800 
> bkt 0 flowid 1:1  (rule hit 514 success 514)
>   match / at 0 (success 514 )
> action order 1: gact action pass
>  random type none pass val 0
>  index 5 ref 1 bind 1 installed 506 sec used 4 sec
> Action statistics:
> Sent 35544 bytes 514 pkt (dropped 0, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> 
> 
> ..try something that doesnt exist
> $TC -s filter get dev $ETH parent : protocol ip  handle 800:0:803 prio 2  
> u32
> 
> .
> RTNETLINK answers: No such file or directory
> We have an error talking to the kernel
> .
> 
> Note, added NLM_F_ECHO is for backward compatibility. old kernels never
> before Eric's patch will not respond without it and newer kernels (after 
> Erics patch)
> will ignore it.
> In old kernels there is a side effect:
> In addition to a response to the GET you will receive an event (if you do tc 
> mon).
> But this is still better than what it was before (not working at all).
> 
> Signed-off-by: Jamal Hadi Salim 

Applied

Re: [PATCH v2 iproute2 0/9] Cleanup backlog

2016-10-12 Thread Stephen Hemminger

On Tue, 11 Oct 2016 07:00:39 -0400
Jamal Hadi Salim  wrote:

> From: Jamal Hadi Salim 
> 
> 
> Variety of cleanup and new functionality I had sitting around on my
> private tree
> 
> Craig Dillabaugh (1):
>   action gact: list pipe as a valid action
> 
> Jamal Hadi Salim (3):
>   actions ife: Introduce encoding and decoding of tcindex metadata
>   actions:  add skbmod action
>   man pages: Add tc-ife to Makefile
> 
> Lucas Bates (2):
>   man pages: update ife action to include tcindex
>   man pages: add man page for skbmod action
> 
> Roman Mashak (3):
>   ife action: allow specifying index in hex
>   ife: print prio, mark and hash as unsigned
>   ife: improve help text
> 
>  include/linux/tc_act/tc_ife.h |   3 +-
>  man/man8/Makefile |   2 +-
>  man/man8/tc-ife.8 |  29 -
>  tc/m_gact.c   |   4 +-
>  tc/m_ife.c|  38 --
>  tc/m_skbmod.c | 260 
> ++
>  6 files changed, 319 insertions(+), 17 deletions(-)
>  create mode 100644 tc/m_skbmod.c
> 

Applied. Then went through and cleaned up minor checkpatch style issues.

Re: [PATCH net-next] net: phy: Trigger state machine on state change and not polling.

2016-10-12 Thread Kyle Roeschley

On Wed, Oct 12, 2016 at 10:14:53PM +0200, Andrew Lunn wrote:
> The phy_start() is used to indicate the PHY is now ready to do its
> work. The state is changed, normally to PHY_UP which means that both
> the MAC and the PHY are ready.
> 
> If the phy driver is using polling, when the next poll happens, the
> state machine notices the PHY is now in PHY_UP, and kicks off
> auto-negotiation, if needed.
> 
> If however, the PHY is using interrupts, there is no polling. The phy
> is stuck in PHY_UP until the next interrupt comes along. And there is
> no reason for the PHY to interrupt.
> 
> Have phy_start() schedule the state machine to run, which both speeds
> up the polling use case, and makes the interrupt use case actually
> work.
> 
> This problems exists whenever there is a state change which will not
> cause an interrupt. Trigger the state machine in these cases,
> e.g. phy_error().
> 
> Signed-off-by: Andrew Lunn 
> Cc: Kyle Roeschley 
> ---
> 
> This should be applied to stable, but i've no idea what fixes: tag to
> use. It could be phylib has been broken since interrupts were added?
> 

This patched fixed another problem had with the phy state machine which was
caused by d5c3d8465 ("net: phy: Avoid polling PHY with PHY_IGNORE_INTERRUPTS").
That might be the commit you're looking for. Also,

Tested-by: Kyle Roeschley 

>  drivers/net/phy/phy.c | 22 --
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index c6f66832a1a6..f424b867f73e 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -608,6 +608,21 @@ void phy_start_machine(struct phy_device *phydev)
>  }
>  
>  /**
> + * phy_trigger_machine - trigger the state machine to run
> + *
> + * @phydev: the phy_device struct
> + *
> + * Description: There has been a change in state which requires that the
> + *   state machine runs.
> + */
> +

You've got a bonus newline here.

> +static void phy_trigger_machine(struct phy_device *phydev)
> +{
> + cancel_delayed_work_sync(>state_queue);
> + queue_delayed_work(system_power_efficient_wq, >state_queue, 0);
> +}
> +
> +/**
>   * phy_stop_machine - stop the PHY state machine tracking
>   * @phydev: target phy_device struct
>   *
> @@ -639,6 +654,8 @@ static void phy_error(struct phy_device *phydev)
>   mutex_lock(>lock);
>   phydev->state = PHY_HALTED;
>   mutex_unlock(>lock);
> +
> + phy_trigger_machine(phydev);
>  }
>  
>  /**
> @@ -800,8 +817,7 @@ void phy_change(struct work_struct *work)
>   }
>  
>   /* reschedule state queue work to run as soon as possible */
> - cancel_delayed_work_sync(>state_queue);
> - queue_delayed_work(system_power_efficient_wq, >state_queue, 0);
> + phy_trigger_machine(phydev);
>   return;
>  
>  ignore:
> @@ -890,6 +906,8 @@ void phy_start(struct phy_device *phydev)
>   /* if phy was suspended, bring the physical link up again */
>   if (do_resume)
>   phy_resume(phydev);
> +
> + phy_trigger_machine(phydev);
>  }
>  EXPORT_SYMBOL(phy_start);
>  
> -- 
> 2.9.3
> 

-- 
Kyle Roeschley
Software Engineer
National Instruments

[PATCH net-next 06/11] ixgbe: Flip to the new dev walk API

2016-10-12 Thread David Ahern

Convert ixgbe users of the old macros to new dev walk API. 
This is just a move to the new API; no functional change is intended.

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 132 --
 1 file changed, 82 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index a244d9a67264..aae5e0349d93 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5012,24 +5012,23 @@ static int ixgbe_fwd_ring_up(struct net_device *vdev,
return err;
 }
 
-static void ixgbe_configure_dfwd(struct ixgbe_adapter *adapter)
+static int ixgbe_upper_dev_walk(struct net_device *upper, void *data)
 {
-   struct net_device *upper;
-   struct list_head *iter;
-   int err;
-
-   netdev_for_each_all_upper_dev_rcu(adapter->netdev, upper, iter) {
-   if (netif_is_macvlan(upper)) {
-   struct macvlan_dev *dfwd = netdev_priv(upper);
-   struct ixgbe_fwd_adapter *vadapter = dfwd->fwd_priv;
+   if (netif_is_macvlan(upper)) {
+   struct macvlan_dev *dfwd = netdev_priv(upper);
+   struct ixgbe_fwd_adapter *vadapter = dfwd->fwd_priv;
 
-   if (dfwd->fwd_priv) {
-   err = ixgbe_fwd_ring_up(upper, vadapter);
-   if (err)
-   continue;
-   }
-   }
+   if (dfwd->fwd_priv)
+   ixgbe_fwd_ring_up(upper, vadapter);
}
+
+   return 0;
+}
+
+static void ixgbe_configure_dfwd(struct ixgbe_adapter *adapter)
+{
+   netdev_walk_all_upper_dev_rcu(adapter->netdev,
+ ixgbe_upper_dev_walk, NULL);
 }
 
 static void ixgbe_configure(struct ixgbe_adapter *adapter)
@@ -5448,12 +5447,25 @@ static void ixgbe_fdir_filter_exit(struct ixgbe_adapter 
*adapter)
spin_unlock(>fdir_perfect_lock);
 }
 
+static int ixgbe_disable_macvlan(struct net_device *upper, void *data)
+{
+   if (netif_is_macvlan(upper)) {
+   struct macvlan_dev *vlan = netdev_priv(upper);
+
+   if (vlan->fwd_priv) {
+   netif_tx_stop_all_queues(upper);
+   netif_carrier_off(upper);
+   netif_tx_disable(upper);
+   }
+   }
+
+   return 0;
+}
+
 void ixgbe_down(struct ixgbe_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
struct ixgbe_hw *hw = >hw;
-   struct net_device *upper;
-   struct list_head *iter;
int i;
 
/* signal that we are down to the interrupt handler */
@@ -5477,17 +5489,8 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
netif_tx_disable(netdev);
 
/* disable any upper devices */
-   netdev_for_each_all_upper_dev_rcu(adapter->netdev, upper, iter) {
-   if (netif_is_macvlan(upper)) {
-   struct macvlan_dev *vlan = netdev_priv(upper);
-
-   if (vlan->fwd_priv) {
-   netif_tx_stop_all_queues(upper);
-   netif_carrier_off(upper);
-   netif_tx_disable(upper);
-   }
-   }
-   }
+   netdev_walk_all_upper_dev_rcu(adapter->netdev,
+ ixgbe_disable_macvlan, NULL);
 
ixgbe_irq_disable(adapter);
 
@@ -6728,6 +6731,18 @@ static void ixgbe_update_default_up(struct ixgbe_adapter 
*adapter)
 #endif
 }
 
+static int ixgbe_enable_macvlan(struct net_device *upper, void *data)
+{
+   if (netif_is_macvlan(upper)) {
+   struct macvlan_dev *vlan = netdev_priv(upper);
+
+   if (vlan->fwd_priv)
+   netif_tx_wake_all_queues(upper);
+   }
+
+   return 0;
+}
+
 /**
  * ixgbe_watchdog_link_is_up - update netif_carrier status and
  * print link up message
@@ -6737,8 +6752,6 @@ static void ixgbe_watchdog_link_is_up(struct 
ixgbe_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
struct ixgbe_hw *hw = >hw;
-   struct net_device *upper;
-   struct list_head *iter;
u32 link_speed = adapter->link_speed;
const char *speed_str;
bool flow_rx, flow_tx;
@@ -6809,14 +6822,8 @@ static void ixgbe_watchdog_link_is_up(struct 
ixgbe_adapter *adapter)
 
/* enable any upper devices */
rtnl_lock();
-   netdev_for_each_all_upper_dev_rcu(adapter->netdev, upper, iter) {
-   if (netif_is_macvlan(upper)) {
-   struct macvlan_dev *vlan = netdev_priv(upper);
-
-   if (vlan->fwd_priv)
-   netif_tx_wake_all_queues(upper);
-   }
-   }
+

[PATCH net-next 11/11] net: dev: Improve debug statements for adjacency tracking

2016-10-12 Thread David Ahern

Adjacency code only has debugs for the insert case. Add debugs for
the remove path and make both consistently worded to make it easier
to follow the insert and removal with reference counts.

In addition, change the BUG to a WARN_ON. A missing adjacency at
removal time is not cause for a panic.

Signed-off-by: David Ahern 
---
 net/core/dev.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 52e70a3d61a4..ad5e7bfda403 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5700,6 +5700,9 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,
 
if (adj) {
adj->ref_nr += 1;
+   pr_debug("Insert adjacency: dev %s adj_dev %s adj->ref_nr %d\n",
+dev->name, adj_dev->name, adj->ref_nr);
+
return 0;
}
 
@@ -5713,8 +5716,8 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,
adj->private = private;
dev_hold(adj_dev);
 
-   pr_debug("dev_hold for %s, because of link added from %s to %s\n",
-adj_dev->name, dev->name, adj_dev->name);
+   pr_debug("Insert adjacency: dev %s adj_dev %s adj->ref_nr %d; dev_hold 
on %s\n",
+dev->name, adj_dev->name, adj->ref_nr, adj_dev->name);
 
if (netdev_adjacent_is_neigh_list(dev, adj_dev, dev_list)) {
ret = netdev_adjacent_sysfs_add(dev, adj_dev, dev_list);
@@ -5753,17 +5756,22 @@ static void __netdev_adjacent_dev_remove(struct 
net_device *dev,
 {
struct netdev_adjacent *adj;
 
+   pr_debug("Remove adjacency: dev %s adj_dev %s ref_nr %d\n",
+dev->name, adj_dev->name, ref_nr);
+
adj = __netdev_find_adj(adj_dev, dev_list);
 
if (!adj) {
-   pr_err("tried to remove device %s from %s\n",
+   pr_err("Adjacency does not exist for device %s from %s\n",
   dev->name, adj_dev->name);
-   BUG();
+   WARN_ON(1);
+   return;
}
 
if (adj->ref_nr > ref_nr) {
-   pr_debug("%s to %s ref_nr-%d = %d\n", dev->name, adj_dev->name,
-ref_nr, adj->ref_nr-ref_nr);
+   pr_debug("adjacency: %s to %s ref_nr - %d = %d\n",
+dev->name, adj_dev->name, ref_nr,
+adj->ref_nr - ref_nr);
adj->ref_nr -= ref_nr;
return;
}
@@ -5775,7 +5783,7 @@ static void __netdev_adjacent_dev_remove(struct 
net_device *dev,
netdev_adjacent_sysfs_del(dev, adj_dev->name, dev_list);
 
list_del_rcu(>list);
-   pr_debug("dev_put for %s, because link removed from %s to %s\n",
+   pr_debug("adjacency: dev_put for %s, because link removed from %s to 
%s\n",
 adj_dev->name, dev->name, adj_dev->name);
dev_put(adj_dev);
kfree_rcu(adj, rcu);
-- 
2.1.4

[PATCH net-next 09/11] net: Remove all_adj_list and its references

2016-10-12 Thread David Ahern

Only direct adjacencies are maintained. All upper or lower devices can
be learned via the new walk API which recursively walks the adj_list for
upper devices or lower devices.

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h |  25 -
 net/core/dev.c| 229 +-
 2 files changed, 21 insertions(+), 233 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 053f6f75f26a..8215908556ec 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1456,7 +1456,6 @@ enum netdev_priv_flags {
  * @ptype_specific: Device-specific, protocol-specific packet handlers
  *
  * @adj_list:  Directly linked devices, like slaves for bonding
- * @all_adj_list:  All linked devices, *including* neighbours
  * @features:  Currently active device features
  * @hw_features:   User-changeable features
  *
@@ -1673,11 +1672,6 @@ struct net_device {
struct list_head lower;
} adj_list;
 
-   struct {
-   struct list_head upper;
-   struct list_head lower;
-   } all_adj_list;
-
netdev_features_t   features;
netdev_features_t   hw_features;
netdev_features_t   wanted_features;
@@ -3832,13 +3826,6 @@ struct net_device 
*netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
 updev; \
 updev = netdev_upper_get_next_dev_rcu(dev, &(iter)))
 
-/* iterate through upper list, must be called under RCU read lock */
-#define netdev_for_each_all_upper_dev_rcu(dev, updev, iter) \
-   for (iter = &(dev)->all_adj_list.upper, \
-updev = netdev_all_upper_get_next_dev_rcu(dev, &(iter)); \
-updev; \
-updev = netdev_all_upper_get_next_dev_rcu(dev, &(iter)))
-
 int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
  int (*fn)(struct net_device *lower_dev,
void *data),
@@ -3878,18 +3865,6 @@ struct net_device *netdev_all_lower_get_next(struct 
net_device *dev,
 struct net_device *netdev_all_lower_get_next_rcu(struct net_device *dev,
 struct list_head **iter);
 
-#define netdev_for_each_all_lower_dev(dev, ldev, iter) \
-   for (iter = (dev)->all_adj_list.lower.next, \
-ldev = netdev_all_lower_get_next(dev, &(iter)); \
-ldev; \
-ldev = netdev_all_lower_get_next(dev, &(iter)))
-
-#define netdev_for_each_all_lower_dev_rcu(dev, ldev, iter) \
-   for (iter = (dev)->all_adj_list.lower.next, \
-ldev = netdev_all_lower_get_next_rcu(dev, &(iter)); \
-ldev; \
-ldev = netdev_all_lower_get_next_rcu(dev, &(iter)))
-
 int netdev_walk_all_lower_dev(struct net_device *dev,
  int (*fn)(struct net_device *lower_dev,
void *data),
diff --git a/net/core/dev.c b/net/core/dev.c
index fcf3641db783..0f9b0985a84c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5273,6 +5273,16 @@ static struct netdev_adjacent *__netdev_find_adj(struct 
net_device *adj_dev,
return NULL;
 }
 
+static int __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
+{
+   struct net_device *dev = (struct net_device *)data;
+
+   if (upper_dev == dev)
+   return 1;
+
+   return 0;
+}
+
 /**
  * netdev_has_upper_dev - Check if device is linked to an upper device
  * @dev: device
@@ -5287,7 +5297,8 @@ bool netdev_has_upper_dev(struct net_device *dev,
 {
ASSERT_RTNL();
 
-   return __netdev_find_adj(upper_dev, >all_adj_list.upper);
+   return netdev_walk_all_upper_dev_rcu(dev, __netdev_has_upper_dev,
+upper_dev);
 }
 EXPORT_SYMBOL(netdev_has_upper_dev);
 
@@ -5301,16 +5312,6 @@ EXPORT_SYMBOL(netdev_has_upper_dev);
  * The caller must hold rcu lock.
  */
 
-static int __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
-{
-   struct net_device *dev = (struct net_device *)data;
-
-   if (upper_dev == dev)
-   return 1;
-
-   return 0;
-}
-
 bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
  struct net_device *upper_dev)
 {
@@ -5333,7 +5334,7 @@ static bool netdev_has_any_upper_dev(struct net_device 
*dev)
 {
ASSERT_RTNL();
 
-   return !list_empty(>all_adj_list.upper);
+   return !list_empty(>adj_list.upper);
 }
 
 /**
@@ -5396,32 +5397,6 @@ struct net_device *netdev_upper_get_next_dev_rcu(struct 
net_device *dev,
 }
 EXPORT_SYMBOL(netdev_upper_get_next_dev_rcu);
 
-/**
- * netdev_all_upper_get_next_dev_rcu - Get the next dev from upper list
- * @dev: device
- * @iter: list_head ** of the current position
- *
- * Gets the next device from the dev's upper list, starting from iter
- * position. The caller must hold RCU read lock.
-

[PATCH net-next 07/11] mlxsw: Flip to the new dev walk API

2016-10-12 Thread David Ahern

Convert mlxsw users to new dev walk API. This is just a move to the
new API; no functional change is intended.

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 37 --
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 1ec0a4ce3c46..1d5434ff02bb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3090,19 +3090,30 @@ static bool mlxsw_sp_port_dev_check(const struct 
net_device *dev)
return dev->netdev_ops == _sp_port_netdev_ops;
 }
 
+static int mlxsw_lower_dev_walk(struct net_device *lower_dev, void *data)
+{
+   struct mlxsw_sp_port **port = data;
+   int ret = 0;
+
+   if (mlxsw_sp_port_dev_check(lower_dev)) {
+   *port = netdev_priv(lower_dev);
+   ret = 1;
+   }
+
+   return ret;
+}
+
 static struct mlxsw_sp_port *mlxsw_sp_port_dev_lower_find(struct net_device 
*dev)
 {
-   struct net_device *lower_dev;
-   struct list_head *iter;
+   struct mlxsw_sp_port *port;
 
if (mlxsw_sp_port_dev_check(dev))
return netdev_priv(dev);
 
-   netdev_for_each_all_lower_dev(dev, lower_dev, iter) {
-   if (mlxsw_sp_port_dev_check(lower_dev))
-   return netdev_priv(lower_dev);
-   }
-   return NULL;
+   port = NULL;
+   netdev_walk_all_lower_dev(dev, mlxsw_lower_dev_walk, );
+
+   return port;
 }
 
 static struct mlxsw_sp *mlxsw_sp_lower_get(struct net_device *dev)
@@ -3115,17 +3126,15 @@ static struct mlxsw_sp *mlxsw_sp_lower_get(struct 
net_device *dev)
 
 static struct mlxsw_sp_port *mlxsw_sp_port_dev_lower_find_rcu(struct 
net_device *dev)
 {
-   struct net_device *lower_dev;
-   struct list_head *iter;
+   struct mlxsw_sp_port *port;
 
if (mlxsw_sp_port_dev_check(dev))
return netdev_priv(dev);
 
-   netdev_for_each_all_lower_dev_rcu(dev, lower_dev, iter) {
-   if (mlxsw_sp_port_dev_check(lower_dev))
-   return netdev_priv(lower_dev);
-   }
-   return NULL;
+   port = NULL;
+   netdev_walk_all_lower_dev_rcu(dev, mlxsw_lower_dev_walk, );
+
+   return port;
 }
 
 struct mlxsw_sp_port *mlxsw_sp_port_lower_dev_hold(struct net_device *dev)
-- 
2.1.4

Re: [RFC] net: phy: smsc: Disable auto-negotiation on startup

2016-10-12 Thread Andrew Lunn

On Wed, Oct 12, 2016 at 10:05:33AM -0500, Kyle Roeschley wrote:
> On Wed, Oct 12, 2016 at 02:13:06AM -0700, Florian Fainelli wrote:
> > On 10/10/2016 10:41 AM, Kyle Roeschley wrote:
> > > Because the SMSC PHY completes auto-negotiation before the driver is
> > > ready to handle interrupts, the PHY state machine never realizes that we
> > > have a link. Clear the ANENABLE bit on initialization, which lets
> > > genphy_config_aneg do its thing when that code is hit later.
> > > 
> > > While this patch does fix the problem we see (no link on boot without
> > > re-plugging the cable), it seems like the generic PHY code should be
> > > able to handle auto-negotiation completing before interrupts are
> > > enabled. Submitted as an RFC in the hopes that someone has an idea as to
> > > how that could be done.
> > > 
> > > This fix is copied from commit 99f81afc139c ("phy: micrel: Disable auto
> > > negotiation on startup").
> > 
> > Do you mind trying:
> > 
> > https://www.spinics.net/lists/netdev/msg397857.html
> > 
> > and see if you do get link interrupts without your patch applied? Thanks!
> 
> Yep, that fixes it. I figured there was some state machine issue I was 
> missing.
> Thanks very much!

Humm, O.K, time to pull that patch out of the series and make it
standalone.

Andrew

[PATCH net-next 08/11] rocker: Flip to the new dev walk API

2016-10-12 Thread David Ahern

Convert rocker to the new dev walk API. This is just a code conversion;
no functional change is intended.

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/rocker/rocker_main.c | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c 
b/drivers/net/ethernet/rocker/rocker_main.c
index 5424fb341613..9310adc0bcbb 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2839,20 +2839,37 @@ static bool rocker_port_dev_check_under(const struct 
net_device *dev,
return true;
 }
 
+struct rocker_walk_data {
+   struct rocker *rocker;
+   struct rocker_port *port;
+};
+
+static int rocker_lower_dev_walk(struct net_device *lower_dev, void *_data)
+{
+   struct rocker_walk_data *data = (struct rocker_walk_data *)_data;
+   int ret = 0;
+
+   if (rocker_port_dev_check_under(lower_dev, data->rocker)) {
+   data->port = netdev_priv(lower_dev);
+   ret = 1;
+   }
+
+   return ret;
+}
+
 struct rocker_port *rocker_port_dev_lower_find(struct net_device *dev,
   struct rocker *rocker)
 {
-   struct net_device *lower_dev;
-   struct list_head *iter;
+   struct rocker_walk_data data;
 
if (rocker_port_dev_check_under(dev, rocker))
return netdev_priv(dev);
 
-   netdev_for_each_all_lower_dev(dev, lower_dev, iter) {
-   if (rocker_port_dev_check_under(lower_dev, rocker))
-   return netdev_priv(lower_dev);
-   }
-   return NULL;
+   data.rocker = rocker;
+   data.port = NULL;
+   netdev_walk_all_lower_dev(dev, rocker_lower_dev_walk, );
+
+   return data.port;
 }
 
 static int rocker_netdevice_event(struct notifier_block *unused,
-- 
2.1.4

[PATCH net-next 10/11] net: Add warning if any lower device is still in adjacency list

2016-10-12 Thread David Ahern

Lower list should be empty just like upper.

Signed-off-by: David Ahern 
---
 net/core/dev.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 0f9b0985a84c..52e70a3d61a4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5361,6 +5361,20 @@ struct net_device *netdev_master_upper_dev_get(struct 
net_device *dev)
 }
 EXPORT_SYMBOL(netdev_master_upper_dev_get);
 
+/**
+ * netdev_has_any_lower_dev - Check if device is linked to some device
+ * @dev: device
+ *
+ * Find out if a device is linked to a lower device and return true in case
+ * it is. The caller must hold the RTNL lock.
+ */
+static bool netdev_has_any_lower_dev(struct net_device *dev)
+{
+   ASSERT_RTNL();
+
+   return !list_empty(>adj_list.lower);
+}
+
 void *netdev_adjacent_get_private(struct list_head *adj_list)
 {
struct netdev_adjacent *adj;
@@ -6746,6 +6760,7 @@ static void rollback_registered_many(struct list_head 
*head)
 
/* Notifier chain MUST detach us all upper devices. */
WARN_ON(netdev_has_any_upper_dev(dev));
+   WARN_ON(netdev_has_any_lower_dev(dev));
 
/* Remove entries from kobject tree */
netdev_unregister_kobject(dev);
-- 
2.1.4

[PATCH net-next 05/11] IB/ipoib: Flip to new dev walk API

2016-10-12 Thread David Ahern

Convert ipoib_get_net_dev_match_addr to the new upper device walk API.
This is just a move to the new API; no functional change is intended.

Signed-off-by: David Ahern 
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 37 +--
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 5636fc3da6b8..624855ab7205 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -292,6 +292,25 @@ static struct net_device *ipoib_get_master_net_dev(struct 
net_device *dev)
return dev;
 }
 
+struct ipoib_walk_data {
+   const struct sockaddr *addr;
+   struct net_device *result;
+};
+
+static int ipoib_upper_walk(struct net_device *upper, void *_data)
+{
+   struct ipoib_walk_data *data = (struct ipoib_walk_data *)_data;
+   int ret = 0;
+
+   if (ipoib_is_dev_match_addr_rcu(data->addr, upper)) {
+   dev_hold(upper);
+   data->result = upper;
+   ret = 1;
+   }
+
+   return ret;
+}
+
 /**
  * Find a net_device matching the given address, which is an upper device of
  * the given net_device.
@@ -304,27 +323,21 @@ static struct net_device *ipoib_get_master_net_dev(struct 
net_device *dev)
 static struct net_device *ipoib_get_net_dev_match_addr(
const struct sockaddr *addr, struct net_device *dev)
 {
-   struct net_device *upper,
- *result = NULL;
-   struct list_head *iter;
+   struct ipoib_walk_data data = {
+   .addr = addr,
+   };
 
rcu_read_lock();
if (ipoib_is_dev_match_addr_rcu(addr, dev)) {
dev_hold(dev);
-   result = dev;
+   data.result = dev;
goto out;
}
 
-   netdev_for_each_all_upper_dev_rcu(dev, upper, iter) {
-   if (ipoib_is_dev_match_addr_rcu(addr, upper)) {
-   dev_hold(upper);
-   result = upper;
-   break;
-   }
-   }
+   netdev_walk_all_upper_dev_rcu(dev, ipoib_upper_walk, );
 out:
rcu_read_unlock();
-   return result;
+   return data.result;
 }
 
 /* returns the number of IPoIB netdevs on top a given ipoib device matching a
-- 
2.1.4

[PATCH net-next 04/11] IB/core: Flip to the new dev walk API

2016-10-12 Thread David Ahern

Convert rdma_is_upper_dev_rcu, handle_netdev_upper and
ipoib_get_net_dev_match_addr to the new upper device walk API.
This is just a move to the new API; no functional change is intended.

Signed-off-by: David Ahern 
---
 drivers/infiniband/core/core_priv.h |  9 +--
 drivers/infiniband/core/roce_gid_mgmt.c | 42 ++---
 2 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index 19d499dcab76..0c0bea091de8 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -127,14 +127,7 @@ void ib_cache_release_one(struct ib_device *device);
 static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
 struct net_device *upper)
 {
-   struct net_device *_upper = NULL;
-   struct list_head *iter;
-
-   netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
-   if (_upper == upper)
-   break;
-
-   return _upper == upper;
+   return netdev_has_upper_dev_all_rcu(dev, upper);
 }
 
 int addr_init(void);
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 06556c34606d..db759a68d948 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -437,6 +437,28 @@ static void callback_for_addr_gid_device_scan(struct 
ib_device *device,
  >gid_attr);
 }
 
+struct upper_list {
+   struct list_head list;
+   struct net_device *upper;
+};
+
+static int netdev_upper_walk(struct net_device *upper, void *data)
+{
+   struct upper_list *entry = kmalloc(sizeof(*entry), GFP_ATOMIC);
+   struct list_head *upper_list = (struct list_head *)data;
+
+   if (!entry) {
+   pr_info("roce_gid_mgmt: couldn't allocate entry to delete 
ndev\n");
+   return 0;
+   }
+
+   list_add_tail(>list, upper_list);
+   dev_hold(upper);
+   entry->upper = upper;
+
+   return 0;
+}
+
 static void handle_netdev_upper(struct ib_device *ib_dev, u8 port,
void *cookie,
void (*handle_netdev)(struct ib_device *ib_dev,
@@ -444,30 +466,12 @@ static void handle_netdev_upper(struct ib_device *ib_dev, 
u8 port,
  struct net_device *ndev))
 {
struct net_device *ndev = (struct net_device *)cookie;
-   struct upper_list {
-   struct list_head list;
-   struct net_device *upper;
-   };
-   struct net_device *upper;
-   struct list_head *iter;
struct upper_list *upper_iter;
struct upper_list *upper_temp;
LIST_HEAD(upper_list);
 
rcu_read_lock();
-   netdev_for_each_all_upper_dev_rcu(ndev, upper, iter) {
-   struct upper_list *entry = kmalloc(sizeof(*entry),
-  GFP_ATOMIC);
-
-   if (!entry) {
-   pr_info("roce_gid_mgmt: couldn't allocate entry to 
delete ndev\n");
-   continue;
-   }
-
-   list_add_tail(>list, _list);
-   dev_hold(upper);
-   entry->upper = upper;
-   }
+   netdev_walk_all_upper_dev_rcu(ndev, netdev_upper_walk, _list);
rcu_read_unlock();
 
handle_netdev(ib_dev, port, ndev);
-- 
2.1.4

[PATCH net-next 01/11] net: Remove refnr arg when inserting link adjacencies

2016-10-12 Thread David Ahern

Commit 93409033ae65 ("net: Add netdev all_adj_list refcnt propagation to
fix panic") propagated the refnr to insert and remove functions tracking
the netdev adjacency graph. However, for the insert path the refnr can
only be 1. Accordingly, remove the refnr argument to make that clear.
ie., the refnr arg in 93409033ae65 was only needed for the remove path.

Signed-off-by: David Ahern 
---
 net/core/dev.c | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f1fe26f66458..5aaa12a9e369 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5589,7 +5589,6 @@ static inline bool netdev_adjacent_is_neigh_list(struct 
net_device *dev,
 
 static int __netdev_adjacent_dev_insert(struct net_device *dev,
struct net_device *adj_dev,
-   u16 ref_nr,
struct list_head *dev_list,
void *private, bool master)
 {
@@ -5599,7 +5598,7 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,
adj = __netdev_find_adj(adj_dev, dev_list);
 
if (adj) {
-   adj->ref_nr += ref_nr;
+   adj->ref_nr += 1;
return 0;
}
 
@@ -5609,7 +5608,7 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,
 
adj->dev = adj_dev;
adj->master = master;
-   adj->ref_nr = ref_nr;
+   adj->ref_nr = 1;
adj->private = private;
dev_hold(adj_dev);
 
@@ -5683,22 +5682,21 @@ static void __netdev_adjacent_dev_remove(struct 
net_device *dev,
 
 static int __netdev_adjacent_dev_link_lists(struct net_device *dev,
struct net_device *upper_dev,
-   u16 ref_nr,
struct list_head *up_list,
struct list_head *down_list,
void *private, bool master)
 {
int ret;
 
-   ret = __netdev_adjacent_dev_insert(dev, upper_dev, ref_nr, up_list,
+   ret = __netdev_adjacent_dev_insert(dev, upper_dev, up_list,
   private, master);
if (ret)
return ret;
 
-   ret = __netdev_adjacent_dev_insert(upper_dev, dev, ref_nr, down_list,
+   ret = __netdev_adjacent_dev_insert(upper_dev, dev, down_list,
   private, false);
if (ret) {
-   __netdev_adjacent_dev_remove(dev, upper_dev, ref_nr, up_list);
+   __netdev_adjacent_dev_remove(dev, upper_dev, 1, up_list);
return ret;
}
 
@@ -5706,10 +5704,9 @@ static int __netdev_adjacent_dev_link_lists(struct 
net_device *dev,
 }
 
 static int __netdev_adjacent_dev_link(struct net_device *dev,
- struct net_device *upper_dev,
- u16 ref_nr)
+ struct net_device *upper_dev)
 {
-   return __netdev_adjacent_dev_link_lists(dev, upper_dev, ref_nr,
+   return __netdev_adjacent_dev_link_lists(dev, upper_dev,
>all_adj_list.upper,
_dev->all_adj_list.lower,
NULL, false);
@@ -5738,12 +5735,12 @@ static int __netdev_adjacent_dev_link_neighbour(struct 
net_device *dev,
struct net_device *upper_dev,
void *private, bool master)
 {
-   int ret = __netdev_adjacent_dev_link(dev, upper_dev, 1);
+   int ret = __netdev_adjacent_dev_link(dev, upper_dev);
 
if (ret)
return ret;
 
-   ret = __netdev_adjacent_dev_link_lists(dev, upper_dev, 1,
+   ret = __netdev_adjacent_dev_link_lists(dev, upper_dev,
   >adj_list.upper,
   _dev->adj_list.lower,
   private, master);
@@ -5812,7 +5809,7 @@ static int __netdev_upper_dev_link(struct net_device *dev,
list_for_each_entry(j, _dev->all_adj_list.upper, list) {
pr_debug("Interlinking %s with %s, non-neighbour\n",
 i->dev->name, j->dev->name);
-   ret = __netdev_adjacent_dev_link(i->dev, j->dev, 
i->ref_nr);
+   ret = __netdev_adjacent_dev_link(i->dev, j->dev);
if (ret)
goto rollback_mesh;
}
@@ -5822,7 +5819,7 @@ static int __netdev_upper_dev_link(struct net_device *dev,
list_for_each_entry(i, _dev->all_adj_list.upper, list) {
pr_debug("linking %s's upper

[PATCH net-next 00/11] net: Fix netdev adjacency tracking

2016-10-12 Thread David Ahern

The netdev adjacency tracking is failing to create proper dependencies
for some topologies. For example this topology

++
|  myvrf |
++
  ||
  |  +-+
  |  | macvlan |
  |  +-+
  ||
  +--+
  |  bridge  |
  +--+
  |
  ++
  | bond0  |
  ++
  |
  ++
  |  eth3  |
  ++

hits 1 of 2 problems depending on the order of enslavement. The base set of
commands for both cases:

ip link add bond1 type bond
ip link set bond1 up
ip link set eth3 down
ip link set eth3 master bond1
ip link set eth3 up

ip link add bridge type bridge
ip link set bridge up
ip link add macvlan link bridge type macvlan
ip link set macvlan up

ip link add myvrf type vrf table 1234
ip link set myvrf up

ip link set bridge master myvrf

Case 1 enslave macvlan to the vrf before enslaving the bond to the bridge:

ip link set macvlan master myvrf
ip link set bond1 master bridge

Attempts to delete the VRF:
ip link delete myvrf

trigger the BUG in __netdev_adjacent_dev_remove:

[  587.405260] tried to remove device eth3 from myvrf
[  587.407269] [ cut here ]
[  587.408918] kernel BUG at /home/dsa/kernel.git/net/core/dev.c:5661!
[  587.43] invalid opcode:  [#1] SMP
[  587.412454] Modules linked in: macvlan bridge stp llc bonding vrf
[  587.414765] CPU: 0 PID: 726 Comm: ip Not tainted 4.8.0+ #109
[  587.416766] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[  587.420241] task: 88013ab6eec0 task.stack: c9628000
[  587.422163] RIP: 0010:[]  [] 
__netdev_adjacent_dev_remove+0x40/0x12c
...
[  587.446053] Call Trace:
[  587.446424]  [] __netdev_adjacent_dev_unlink+0x20/0x3c
[  587.447390]  [] netdev_upper_dev_unlink+0xfa/0x15e
[  587.448297]  [] vrf_del_slave+0x13/0x2a [vrf]
[  587.449153]  [] vrf_dev_uninit+0xea/0x114 [vrf]
[  587.450036]  [] rollback_registered_many+0x22b/0x2da
[  587.450974]  [] unregister_netdevice_many+0x17/0x48
[  587.451903]  [] rtnl_delete_link+0x3c/0x43
[  587.452719]  [] rtnl_dellink+0x180/0x194

When the BUG is converted to a WARN_ON it shows 4 missing adjacencies:
  eth3 - myvrf, mvrf - eth3, bond1 - myvrf and myvrf - bond1

All of those are because the __netdev_upper_dev_link function does not
properly link macvlan lower devices to myvrf when it is enslaved.

The second case just flips the ordering of the enslavements:
ip link set bond1 master bridge
ip link set macvlan master myvrf

Then run:
ip link delete bond1
ip link delete myvrf

The vrf delete command hangs because myvrf has a reference that has not
been released. In this case the removal code does not account for 2 paths 
between eth3 and myvrf - one from bridge to vrf and the other through the
macvlan.

Rather than try to maintain a linked list of all upper and lower devices
per netdevice, only track the direct neighbors. The remaining stack can
be determined by recursively walking the neighbors.

The existing netdev_for_each_all_upper_dev_rcu,
netdev_for_each_all_lower_dev and netdev_for_each_all_lower_dev_rcu macros
are replaced with APIs that walk the upper and lower device lists. The
new APIs take a callback function and a data arg that is passed to the
callback for each device in the list. Drivers using the old macros are
converted in separate patches to make it easier on reviewers. It is an
API conversion only; no functional change is intended.

DaveM: Given the impact of this bug (both cases requiring a reboot) I
would like to get this backported to at least the 4.8 tree which as I 
understand it has been targeted as the next LTS.

David Ahern (11):
  net: Remove refnr arg when inserting link adjacencies
  net: Introduce new api for walking upper and lower devices
  net: bonding: Flip to the new dev walk API
  IB/core: Flip to the new dev walk API
  IB/ipoib: Flip to new dev walk API
  ixgbe: Flip to the new dev walk API
  mlxsw: Flip to the new dev walk API
  rocker: Flip to the new dev walk API
  net: Remove all_adj_list and its references
  net: Add warning if any lower device is still in adjacency list
  net: dev: Improve debug statements for adjacency tracking

 drivers/infiniband/core/core_priv.h|   9 +-
 drivers/infiniband/core/roce_gid_mgmt.c|  42 +--
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  37 ++-
 drivers/net/bonding/bond_alb.c |  82 +++---
 drivers/net/bonding/bond_main.c|  21 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 132 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  37 ++-
 drivers/net/ethernet/rocker/rocker_main.c  |  31 ++-
 include/linux/netdevice.h  |  38 ++-
 net/core/dev.c | 349 -
 10 files changed, 428

[PATCH net-next 02/11] net: Introduce new api for walking upper and lower devices

2016-10-12 Thread David Ahern

This patch introduces netdev_walk_all_upper_dev_rcu,
netdev_walk_all_lower_dev and netdev_walk_all_lower_dev_rcu. These
functions recursively walk the adj_list of devices to determine all upper
and lower devices.

The functions take a callback function that is invoked for each device
in the list. If the callback returns non-0, the walk is terminated and
the functions return that code back to callers.

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h |  17 +
 net/core/dev.c| 158 ++
 2 files changed, 175 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 136ae6bbe81e..053f6f75f26a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3839,6 +3839,14 @@ struct net_device 
*netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
 updev; \
 updev = netdev_all_upper_get_next_dev_rcu(dev, &(iter)))
 
+int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
+ int (*fn)(struct net_device *lower_dev,
+   void *data),
+ void *data);
+
+bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
+ struct net_device *upper_dev);
+
 void *netdev_lower_get_next_private(struct net_device *dev,
struct list_head **iter);
 void *netdev_lower_get_next_private_rcu(struct net_device *dev,
@@ -3882,6 +3890,15 @@ struct net_device *netdev_all_lower_get_next_rcu(struct 
net_device *dev,
 ldev; \
 ldev = netdev_all_lower_get_next_rcu(dev, &(iter)))
 
+int netdev_walk_all_lower_dev(struct net_device *dev,
+ int (*fn)(struct net_device *lower_dev,
+   void *data),
+ void *data);
+int netdev_walk_all_lower_dev_rcu(struct net_device *dev,
+ int (*fn)(struct net_device *lower_dev,
+   void *data),
+ void *data);
+
 void *netdev_adjacent_get_private(struct list_head *adj_list);
 void *netdev_lower_get_first_private_rcu(struct net_device *dev);
 struct net_device *netdev_master_upper_dev_get(struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5aaa12a9e369..fcf3641db783 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5292,6 +5292,37 @@ bool netdev_has_upper_dev(struct net_device *dev,
 EXPORT_SYMBOL(netdev_has_upper_dev);
 
 /**
+ * netdev_has_upper_dev_all - Check if device is linked to an upper device
+ * @dev: device
+ * @upper_dev: upper device to check
+ *
+ * Find out if a device is linked to specified upper device and return true
+ * in case it is. Note that this checks the entire upper device chain.
+ * The caller must hold rcu lock.
+ */
+
+static int __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
+{
+   struct net_device *dev = (struct net_device *)data;
+
+   if (upper_dev == dev)
+   return 1;
+
+   return 0;
+}
+
+bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
+ struct net_device *upper_dev)
+{
+   if (netdev_walk_all_upper_dev_rcu(dev, __netdev_has_upper_dev,
+ upper_dev))
+   return true;
+
+   return false;
+}
+EXPORT_SYMBOL(netdev_has_upper_dev_all_rcu);
+
+/**
  * netdev_has_any_upper_dev - Check if device is linked to some device
  * @dev: device
  *
@@ -5391,6 +5422,51 @@ struct net_device 
*netdev_all_upper_get_next_dev_rcu(struct net_device *dev,
 }
 EXPORT_SYMBOL(netdev_all_upper_get_next_dev_rcu);
 
+static struct net_device *netdev_next_upper_dev_rcu(struct net_device *dev,
+   struct list_head **iter)
+{
+   struct netdev_adjacent *upper;
+
+   WARN_ON_ONCE(!rcu_read_lock_held() && !lockdep_rtnl_is_held());
+
+   upper = list_entry_rcu((*iter)->next, struct netdev_adjacent, list);
+
+   if (>list == >adj_list.upper)
+   return NULL;
+
+   *iter = >list;
+
+   return upper->dev;
+}
+
+int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
+ int (*fn)(struct net_device *dev,
+   void *data),
+ void *data)
+{
+   struct list_head *iter;
+   struct net_device *udev;
+   int ret;
+
+   for (iter = &(dev)->adj_list.upper,
+udev = netdev_next_upper_dev_rcu(dev, &(iter));
+udev;
+udev = netdev_next_upper_dev_rcu(dev, &(iter))) {
+   /* first is the upper device itself */
+   ret = fn(udev, data);
+   if (ret)
+   return ret;
+
+   /* then look at all of its upper devices */
+

[PATCH net-next 03/11] net: bonding: Flip to the new dev walk API

2016-10-12 Thread David Ahern

Convert alb_send_learning_packets and bond_has_this_ip to use the new
netdev_walk_all_upper_dev_rcu API. In both cases this is just a move
to the new API; no functional change is intended.

Signed-off-by: David Ahern 
---
 drivers/net/bonding/bond_alb.c  | 82 ++---
 drivers/net/bonding/bond_main.c | 21 +++
 2 files changed, 65 insertions(+), 38 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 551f0f8dead3..1ddedec61900 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -950,13 +950,61 @@ static void alb_send_lp_vid(struct slave *slave, u8 
mac_addr[],
dev_queue_xmit(skb);
 }
 
+struct alb_walk_data {
+   struct bonding *bond;
+   struct slave *slave;
+   u8 *mac_addr;
+   bool strict_match;
+};
+
+static int alb_upper_dev_walk(struct net_device *upper, void *data)
+{
+   struct alb_walk_data *_data = (struct alb_walk_data *)data;
+   bool strict_match = _data->strict_match;
+   struct bonding *bond = _data->bond;
+   struct slave *slave = _data->slave;
+   u8 *mac_addr = _data->mac_addr;
+   struct bond_vlan_tag *tags;
+
+   if (is_vlan_dev(upper) && vlan_get_encap_level(upper) == 0) {
+   if (strict_match &&
+   ether_addr_equal_64bits(mac_addr,
+   upper->dev_addr)) {
+   alb_send_lp_vid(slave, mac_addr,
+   vlan_dev_vlan_proto(upper),
+   vlan_dev_vlan_id(upper));
+   } else if (!strict_match) {
+   alb_send_lp_vid(slave, upper->dev_addr,
+   vlan_dev_vlan_proto(upper),
+   vlan_dev_vlan_id(upper));
+   }
+   }
+
+   /* If this is a macvlan device, then only send updates
+* when strict_match is turned off.
+*/
+   if (netif_is_macvlan(upper) && !strict_match) {
+   tags = bond_verify_device_path(bond->dev, upper, 0);
+   if (IS_ERR_OR_NULL(tags))
+   BUG();
+   alb_send_lp_vid(slave, upper->dev_addr,
+   tags[0].vlan_proto, tags[0].vlan_id);
+   kfree(tags);
+   }
+
+   return 0;
+}
+
 static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[],
  bool strict_match)
 {
struct bonding *bond = bond_get_bond_by_slave(slave);
-   struct net_device *upper;
-   struct list_head *iter;
-   struct bond_vlan_tag *tags;
+   struct alb_walk_data data = {
+   .strict_match = strict_match,
+   .mac_addr = mac_addr,
+   .slave = slave,
+   .bond = bond,
+   };
 
/* send untagged */
alb_send_lp_vid(slave, mac_addr, 0, 0);
@@ -965,33 +1013,7 @@ static void alb_send_learning_packets(struct slave 
*slave, u8 mac_addr[],
 * for that device.
 */
rcu_read_lock();
-   netdev_for_each_all_upper_dev_rcu(bond->dev, upper, iter) {
-   if (is_vlan_dev(upper) && vlan_get_encap_level(upper) == 0) {
-   if (strict_match &&
-   ether_addr_equal_64bits(mac_addr,
-   upper->dev_addr)) {
-   alb_send_lp_vid(slave, mac_addr,
-   vlan_dev_vlan_proto(upper),
-   vlan_dev_vlan_id(upper));
-   } else if (!strict_match) {
-   alb_send_lp_vid(slave, upper->dev_addr,
-   vlan_dev_vlan_proto(upper),
-   vlan_dev_vlan_id(upper));
-   }
-   }
-
-   /* If this is a macvlan device, then only send updates
-* when strict_match is turned off.
-*/
-   if (netif_is_macvlan(upper) && !strict_match) {
-   tags = bond_verify_device_path(bond->dev, upper, 0);
-   if (IS_ERR_OR_NULL(tags))
-   BUG();
-   alb_send_lp_vid(slave, upper->dev_addr,
-   tags[0].vlan_proto, tags[0].vlan_id);
-   kfree(tags);
-   }
-   }
+   netdev_walk_all_upper_dev_rcu(bond->dev, alb_upper_dev_walk, );
rcu_read_unlock();
 }
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 3f31ca32f52b..2b4134d5e081 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2270,22 +2270,27 @@ static void bond_mii_monitor(struct work_struct *work)
}
 }

[PATCH] net: ipv4: Do not drop to make_route if oif is l3mdev

2016-10-12 Thread David Ahern

Commit e0d56fdd7342 was a bit aggressive removing l3mdev calls in
the IPv4 stack. If the fib_lookup fails we do not want to drop to
make_route if the oif is an l3mdev device.

Also reverts 19664c6a0009 ("net: l3mdev: Remove netif_index_is_l3_master")
which removed netif_index_is_l3_master.

Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls")
Signed-off-by: David Ahern 
---
 include/net/l3mdev.h | 24 
 net/ipv4/route.c |  3 ++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/net/l3mdev.h b/include/net/l3mdev.h
index b220dabeab45..3832099289c5 100644
--- a/include/net/l3mdev.h
+++ b/include/net/l3mdev.h
@@ -114,6 +114,25 @@ static inline u32 l3mdev_fib_table(const struct net_device 
*dev)
return tb_id;
 }
 
+static inline bool netif_index_is_l3_master(struct net *net, int ifindex)
+{
+   struct net_device *dev;
+   bool rc = false;
+
+   if (ifindex == 0)
+   return false;
+
+   rcu_read_lock();
+
+   dev = dev_get_by_index_rcu(net, ifindex);
+   if (dev)
+   rc = netif_is_l3_master(dev);
+
+   rcu_read_unlock();
+
+   return rc;
+}
+
 struct dst_entry *l3mdev_link_scope_lookup(struct net *net, struct flowi6 
*fl6);
 
 static inline
@@ -207,6 +226,11 @@ static inline u32 l3mdev_fib_table_by_index(struct net 
*net, int ifindex)
return 0;
 }
 
+static inline bool netif_index_is_l3_master(struct net *net, int ifindex)
+{
+   return false;
+}
+
 static inline
 struct dst_entry *l3mdev_link_scope_lookup(struct net *net, struct flowi6 *fl6)
 {
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f2be689a6c85..62d4d90c1389 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2265,7 +2265,8 @@ struct rtable *__ip_route_output_key_hash(struct net 
*net, struct flowi4 *fl4,
if (err) {
res.fi = NULL;
res.table = NULL;
-   if (fl4->flowi4_oif) {
+   if (fl4->flowi4_oif &&
+   !netif_index_is_l3_master(net, fl4->flowi4_oif)) {
/* Apparently, routing tables are wrong. Assume,
   that the destination is on link.
 
-- 
2.1.4

[PATCH net-next] net: phy: Trigger state machine on state change and not polling.

2016-10-12 Thread Andrew Lunn

The phy_start() is used to indicate the PHY is now ready to do its
work. The state is changed, normally to PHY_UP which means that both
the MAC and the PHY are ready.

If the phy driver is using polling, when the next poll happens, the
state machine notices the PHY is now in PHY_UP, and kicks off
auto-negotiation, if needed.

If however, the PHY is using interrupts, there is no polling. The phy
is stuck in PHY_UP until the next interrupt comes along. And there is
no reason for the PHY to interrupt.

Have phy_start() schedule the state machine to run, which both speeds
up the polling use case, and makes the interrupt use case actually
work.

This problems exists whenever there is a state change which will not
cause an interrupt. Trigger the state machine in these cases,
e.g. phy_error().

Signed-off-by: Andrew Lunn 
Cc: Kyle Roeschley 
---

This should be applied to stable, but i've no idea what fixes: tag to
use. It could be phylib has been broken since interrupts were added?

 drivers/net/phy/phy.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index c6f66832a1a6..f424b867f73e 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -608,6 +608,21 @@ void phy_start_machine(struct phy_device *phydev)
 }
 
 /**
+ * phy_trigger_machine - trigger the state machine to run
+ *
+ * @phydev: the phy_device struct
+ *
+ * Description: There has been a change in state which requires that the
+ *   state machine runs.
+ */
+
+static void phy_trigger_machine(struct phy_device *phydev)
+{
+   cancel_delayed_work_sync(>state_queue);
+   queue_delayed_work(system_power_efficient_wq, >state_queue, 0);
+}
+
+/**
  * phy_stop_machine - stop the PHY state machine tracking
  * @phydev: target phy_device struct
  *
@@ -639,6 +654,8 @@ static void phy_error(struct phy_device *phydev)
mutex_lock(>lock);
phydev->state = PHY_HALTED;
mutex_unlock(>lock);
+
+   phy_trigger_machine(phydev);
 }
 
 /**
@@ -800,8 +817,7 @@ void phy_change(struct work_struct *work)
}
 
/* reschedule state queue work to run as soon as possible */
-   cancel_delayed_work_sync(>state_queue);
-   queue_delayed_work(system_power_efficient_wq, >state_queue, 0);
+   phy_trigger_machine(phydev);
return;
 
 ignore:
@@ -890,6 +906,8 @@ void phy_start(struct phy_device *phydev)
/* if phy was suspended, bring the physical link up again */
if (do_resume)
phy_resume(phydev);
+
+   phy_trigger_machine(phydev);
 }
 EXPORT_SYMBOL(phy_start);
 
-- 
2.9.3

[PATCH] net: wan: slic_ds26522: Allow driver to built if COMPILE_TEST is enabled

2016-10-12 Thread Javier Martinez Canillas

The driver only has runtime but no build time dependency with FSL_SOC ||
ARCH_MXC || ARCH_LAYERSCAPE.  So it can be built for testing purposes if
the COMPILE_TEST option is enabled.

This is useful to have more build coverage and make sure that the driver
is not affected by changes that could cause build regressions.

Signed-off-by: Javier Martinez Canillas 
---

 drivers/net/wan/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wan/Kconfig b/drivers/net/wan/Kconfig
index 33ab3345d333..4e9fe75d7067 100644
--- a/drivers/net/wan/Kconfig
+++ b/drivers/net/wan/Kconfig
@@ -294,7 +294,7 @@ config FSL_UCC_HDLC
 config SLIC_DS26522
tristate "Slic Maxim ds26522 card support"
depends on SPI
-   depends on FSL_SOC || ARCH_MXC || ARCH_LAYERSCAPE
+   depends on FSL_SOC || ARCH_MXC || ARCH_LAYERSCAPE || COMPILE_TEST
help
  This module initializes and configures the slic maxim card
  in T1 or E1 mode.
-- 
2.7.4

[PATCH 1/2] net: wan: slic_ds26522: add SPI device ID table to fix module autoload

2016-10-12 Thread Javier Martinez Canillas

If the driver is built as a module, module alias information isn't filled
so the module won't be autoloaded. Add a SPI device ID table and use the
MODULE_DEVICE_TABLE() macro so the information is exported in the module.

Before this patch:

$ modinfo drivers/net/wan/slic_ds26522.ko | grep alias
$

After this patch:

$ modinfo drivers/net/wan/slic_ds26522.ko | grep alias
alias:  spi:ds26522

Signed-off-by: Javier Martinez Canillas 
---

 drivers/net/wan/slic_ds26522.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/wan/slic_ds26522.c b/drivers/net/wan/slic_ds26522.c
index d06a887a2352..53366a2232f0 100644
--- a/drivers/net/wan/slic_ds26522.c
+++ b/drivers/net/wan/slic_ds26522.c
@@ -223,6 +223,12 @@ static int slic_ds26522_probe(struct spi_device *spi)
return ret;
 }
 
+static const struct spi_device_id slic_ds26522_id[] = {
+   { .name = "ds26522" },
+   { /* sentinel */ },
+};
+MODULE_DEVICE_TABLE(spi, slic_ds26522_id);
+
 static const struct of_device_id slic_ds26522_match[] = {
{
 .compatible = "maxim,ds26522",
@@ -239,6 +245,7 @@ static struct spi_driver slic_ds26522_driver = {
   },
.probe = slic_ds26522_probe,
.remove = slic_ds26522_remove,
+   .id_table = slic_ds26522_id,
 };
 
 static int __init slic_ds26522_init(void)
-- 
2.7.4

[PATCH 2/2] net: wan: slic_ds26522: Export OF module alias information

2016-10-12 Thread Javier Martinez Canillas

When the device is registered via OF, the OF table is used to match the
driver instead of the SPI device ID table, but the entries in the later
are used as aliasses to load the module if the driver was not built-in.

This is because the SPI core always reports an SPI module alias instead
of an OF one, but that could change so it's better to always export it.

Signed-off-by: Javier Martinez Canillas 
---

 drivers/net/wan/slic_ds26522.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wan/slic_ds26522.c b/drivers/net/wan/slic_ds26522.c
index 53366a2232f0..b776a0ab106c 100644
--- a/drivers/net/wan/slic_ds26522.c
+++ b/drivers/net/wan/slic_ds26522.c
@@ -235,6 +235,7 @@ static const struct of_device_id slic_ds26522_match[] = {
 },
{},
 };
+MODULE_DEVICE_TABLE(of, slic_ds26522_match);
 
 static struct spi_driver slic_ds26522_driver = {
.driver = {
-- 
2.7.4

Re: igb driver can cause cache invalidation of non-owned memory?

2016-10-12 Thread Alexander Duyck

On Wed, Oct 12, 2016 at 11:12 AM, Nikita Yushchenko
 wrote:
>> It would make more sense to update the DMA API for
>> __dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
>> the direction is DMA_FROM_DEVICE.
>
> No, in generic case it's unsafe.
>
> If CPU issued a write to a location, and sometime later that location is
> used as DMA buffer, there is danger that write is still in cache only,
> and writeback is pending. Later this writeback can overwrite data
> written to memory via DMA, causing corruption.

Okay so if I understand it correctly then the invalidation in
sync_for_device is to force any writes to be flushed out, and the
invalidation in sync_for_cpu is to flush out any speculative reads.
So without speculative reads then the sync_for_cpu portion is not
needed.  You might want to determine if the core you are using
supports the speculative reads, if not you might be able to get away
without having to do the sync_for_cpu at all.

>> The point I was trying to make is that you are invalidating the cache
>> in both the sync_for_device and sync_for_cpu.  Do you really need that
>> for ARM or do you need to perform the invalidation on sync_for_device
>> if that may be pushed out anyway?  If you aren't running with with
>> speculative look-ups do you even need the invalidation in
>> sync_for_cpu?
>
> I'm not lowlevel arm guru and I don't know for sure. Probably another
> CPU core can be accessing locations neighbor page, causing specilative
> load of locations in DMA page.
>
>
>> Changing the driver code for this won't necessarily work on all
>> architectures, and on top of it we have some changes planned which
>> will end up making the pages writable in the future to support the
>> ongoing XDP effort.  That is one of the reasons why I would not be
>> okay with changing the driver to make this work.
>
> Well I was not really serious about removing that sync_for_device() in
> mainline :)   Although >20% throughput win that this provides is
> impressive...

I agree the improvement is pretty impressive.  The think is there are
similar gains we can generate on x86 by stripping out bits and pieces
that are needed for other architectures.  I'm just wanting to make
certain we aren't optimizing for one architecture at the detriment of
others.

> But what about doing something safer, e.g. adding a bit of tracking and
> only sync_for_device() what was previously sync_for_cpu()ed?  Will you
> accept that?
>
> Nikita

The problem is that as we move things over for XDP we will be looking
at having the CPU potentially write to any spot in the region that was
mapped as we could append headers to the front or pad data onto the
end of the frame.  It is probably safest for us to invalidate the
entire region just to make sure we don't have a collision with
something that is writing to the page.

So for example in the near future I am planning to expand out the
DMA_ATTR_SKIP_CPU_SYNC DMA attribute beyond just the ARM architecture
to see if I can expand it for use with SWIOTLB.  If we can use this on
unmap we might be able to solve some of the existing problems that
required us to make the page read-only since we could unmap the page
without invalidating any existing writes on the page.

- Alex

Re: [PATCH] iwlwifi: pcie: reduce "unsupported splx" to a warning

2016-10-12 Thread Chris Rorvick

On Wed, Oct 12, 2016 at 1:05 PM, Paul Bolle  wrote:
> On Wed, 2016-10-12 at 12:50 -0500, Chris Rorvick wrote:
>> This may already be apparent, but Dell sells two versions of the 9350:
>> one with the Broadcom adapter and one with the AC 8260.
>
> Off topic, for most readers: my version (with the AC 8260) came with
> Ubuntu preinstalled. Perhaps Chris' version came preinstalled with
> something else?

Exactly.  Mine came with Windows 10 and the Broadcom adapter, while
the "Developer Edition" comes preloaded with Ubuntu and has the AC
8260.  The Broadcom adapter has been supported since 4.4 but still
seems to have issues.  For example, I had it working at home but I was
not able to connect to a friend's AT U-Verse wireless gateway;
something was failing with -EBUSY.  I ordered the AC 8260 and it works
fine after the upgrade.

Chris

Re: [PATCH] IB/ipoib: move back the IB LL address into the hard header

2016-10-12 Thread Doug Ledford

On 10/11/2016 2:50 PM, Doug Ledford wrote:
> On 10/11/2016 2:30 PM, Jason Gunthorpe wrote:
>> On Tue, Oct 11, 2016 at 02:17:51PM -0400, Doug Ledford wrote:
>>
>>> Well, not exactly.  Even if we put 65520 into the scripts, the kernel
>>> will silently drop it down to 65504.  It actually won't require anyone
>>> change anything, they just won't get the full value.  I experimented
>>> with this in the past for other reasons and an overly large MTU setting
>>> just resulted in the max MTU.  I don't know if that's changed, but if it
>>> still works that way, this is much less of an issue than it might
>>> otherwise be.
>>
>> So it is just docs and relying on PMTU? That is not as bad..
>>
>> Still would be nice to avoid if at all possible..
> 
> I agree, but we have a test getting ready to commence.  We'll know
> shortly how much the reduced MTU effects things because they aren't
> going to alter any of their setup, just put the new kernel in place, and
> see what happens.
> 
> 

Long story short on the MTU stuff, the setups whined a bit about not
being able to set the desired MTU, used the new max MTU instead, and
things otherwise worked fine.  But, Paolo submitted a v2 patch that
removes this change, so it's all moot anyway.

-- 
Doug Ledford 
GPG Key ID: 0E572FDD



signature.asc
Description: OpenPGP digital signature

Re: igb driver can cause cache invalidation of non-owned memory?

2016-10-12 Thread Nikita Yushchenko

> It would make more sense to update the DMA API for
> __dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
> the direction is DMA_FROM_DEVICE.

No, in generic case it's unsafe.

If CPU issued a write to a location, and sometime later that location is
used as DMA buffer, there is danger that write is still in cache only,
and writeback is pending. Later this writeback can overwrite data
written to memory via DMA, causing corruption.


> The point I was trying to make is that you are invalidating the cache
> in both the sync_for_device and sync_for_cpu.  Do you really need that
> for ARM or do you need to perform the invalidation on sync_for_device
> if that may be pushed out anyway?  If you aren't running with with
> speculative look-ups do you even need the invalidation in
> sync_for_cpu?

I'm not lowlevel arm guru and I don't know for sure. Probably another
CPU core can be accessing locations neighbor page, causing specilative
load of locations in DMA page.


> Changing the driver code for this won't necessarily work on all
> architectures, and on top of it we have some changes planned which
> will end up making the pages writable in the future to support the
> ongoing XDP effort.  That is one of the reasons why I would not be
> okay with changing the driver to make this work.

Well I was not really serious about removing that sync_for_device() in
mainline :)   Although >20% throughput win that this provides is
impressive...

But what about doing something safer, e.g. adding a bit of tracking and
only sync_for_device() what was previously sync_for_cpu()ed?  Will you
accept that?

Nikita

Re: [PATCH] iwlwifi: pcie: reduce "unsupported splx" to a warning

2016-10-12 Thread Paul Bolle

On Wed, 2016-10-12 at 12:50 -0500, Chris Rorvick wrote:
> This may already be apparent, but Dell sells two versions of the 9350:
> one with the Broadcom adapter and one with the AC 8260.

Off topic, for most readers: my version (with the AC 8260) came with
Ubuntu preinstalled. Perhaps Chris' version came preinstalled with
something else?

Thanks,

Paul Bolle

Re: igb driver can cause cache invalidation of non-owned memory?

2016-10-12 Thread Alexander Duyck

On Wed, Oct 12, 2016 at 9:11 AM, Nikita Yushchenko
 wrote:
 To get some throughput improvement, I propose removal of that
 sync_for_device() before reusing buffer. Will you accept such a patch ;)
>>>
>>> Not one that gets rid of sync_for_device() in the driver.  From what I
>>> can tell there are some DMA APIs that use that to perform the
>>> invalidation on the region of memory so that it can be DMAed into.
>>> Without that we run the risk of having a race between something the
>>> CPU might have placed in the cache and something the device wrote into
>>> memory.  The sync_for_device() call is meant to invalidate the cache
>>> for the region so that when the device writes into memory there is no
>>> risk of that race.
>>
>> I'm not expert, but some thought...
>>
>> Just remember that the cpu can do speculative memory and cache line reads.
>> So you need to ensure there are no dirty cache lines when the receive
>> buffer is setup and no cache lines at all at before looking at the frame.
>
> Exactly.
>
> And because of that, arm does cache invalidation both in sync_for_cpu()
> and sync_for_device().
>
>> If you can 100% guarantee the cpu hasn't dirtied the cache then I
>> think the invalidate prior to reusing the buffer can be skipped.
>> But I wouldn't want to debug that going wrong.
>
> I've written earlier in this thread why it is the case for igb - as long
> as Alexander's statement that igb's buffers are readonly for the stack
> is true. If Alexander's statement is wrong, then igb is vulnerable to
> issue I've described in the first message of this thread.
>
> Btw, we are unable get any breakage with that sync_to_device() removed
> already for a day. And - our tests run with software checksumming,
> because on our hardware i210 is wired connected to DSA switch and in
> this config, no i210 offloading works (i210 hardware gets frames with
> eDSA headers that it can't parse). Thus any breakage should be
> immediately discovered.
>
> And throughput measured by iperf raises from 650 to 850 Mbps.
>
> Nikita

The point I was trying to make is that you are invalidating the cache
in both the sync_for_device and sync_for_cpu.  Do you really need that
for ARM or do you need to perform the invalidation on sync_for_device
if that may be pushed out anyway?  If you aren't running with with
speculative look-ups do you even need the invalidation in
sync_for_cpu?  The underlying problem lives in the code for
__dma_page_cpu_to_dev and __dma_page_dev_to_cpu.  It seems like they
are trying to solve for both speculative and non-speculative setups
and having "FIXME" comments in the code related to this kind of points
to your problems being there.

Changing the driver code for this won't necessarily work on all
architectures, and on top of it we have some changes planned which
will end up making the pages writable in the future to support the
ongoing XDP effort.  That is one of the reasons why I would not be
okay with changing the driver to make this work.

It would make more sense to update the DMA API for
__dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
the direction is DMA_FROM_DEVICE.

- Alex

Re: [PATCH] iwlwifi: pcie: reduce "unsupported splx" to a warning

2016-10-12 Thread Chris Rorvick

Hi Luca,

FYI, It seems that Google does not like your email as I'm not
receiving any of your messages in gmail.  Some responses below:

On Wed, 2016-10-12 at 15:24 +0300, Luca Coelho wrote:
> Hi Chris,
> On Tue, 2016-10-11 at 09:09 -0500, Chris Rorvick wrote:
> > On Tue, Oct 11, 2016 at 5:11 AM, Paul Bolle  wrote:
> > > > This is not coming from the NIC itself, but from the platform's ACPI
> > > > tables. Can you tell us which platform you are using?
> >
> > Interesting. I'm running a Dell XPS 13 9350. I replaced the
> > factory-provided Broadcom card with an AC 8260. I can update the
> > commit log to reflect this.
>
> Okay, so this makes sense. Those entries are probably formatted for
> the Broadcom card, which the iwlwifi driver obviously doesn't
> understand. The best we can do, as I already said, is to ignore values
> we don't understand.

This may already be apparent, but Dell sells two versions of the 9350:
one with the Broadcom adapter and one with the AC 8260.  I just
happened to find the former version at a deep discount at Costco so
decided to chance it.  Turns out the Broadcom card is not so good even
with new kernels so I upgraded.  Anyway, since Paul is seeing the same
issue I don't think the values are intended to be Broadcom-specific.

On Wed, 2016-10-12 at 17:21 +0300, Luca Coelho wrote:
> And, the values in the SPLX structs are being changed here, to DOM1,
> LIM1, TIM1 etc., before being returned. Â This also matches your
> description that, at runtime, you got something different than the pure
> dump. Â If you follow these DOM*, LIM*, TIM* symbols, you'll probably
> end up getting the values you observed at runtime.

Probably not important, but it seems that there is some additional
indirection.  The only values I'm seeing associated with those symbols
are 8 and 16:

$ grep -e 'DOM[0-9]' -e 'LIM[0-9]' -e 'TIM[0-9]' dsdt.dsl | grep -v Store
DOM1,   8,
LIM1,   16,
TIM1,   16,
DOM2,   8,
LIM2,   16,
TIM2,   16,
DOM3,   8,
LIM3,   16,
TIM3,   16,

> I'll send you a patch for testing soon.

I will keep an eye on the list archive, thanks!

Chris

Re: [RFC] net: phy: smsc: Disable auto-negotiation on startup

2016-10-12 Thread Kyle Roeschley

On Tue, Oct 11, 2016 at 09:32:30AM -0500, Jeremy Linton wrote:
> On 10/10/2016 12:41 PM, Kyle Roeschley wrote:
> > Because the SMSC PHY completes auto-negotiation before the driver is
> > ready to handle interrupts, the PHY state machine never realizes that we
> > have a link. Clear the ANENABLE bit on initialization, which lets
> > genphy_config_aneg do its thing when that code is hit later.
> > 
> > While this patch does fix the problem we see (no link on boot without
> > re-plugging the cable), it seems like the generic PHY code should be
> > able to handle auto-negotiation completing before interrupts are
> > enabled. Submitted as an RFC in the hopes that someone has an idea as to
> > how that could be done.
> 
> Hi,
> 
>   Which smsc chip/driver? Maybe assuring the device interrupts are enabled
> before the phy is started is a solution?
> 
>   The whole problem sounds similar to what was recently happening in the
> smsc911x driver, but AFAIK that driver is basically only polling at this
> point so connecting the phy before the interrupts are enabled shouldn't be a
> problem.
> 

We're using the SMSC LAN8720A with the Cadence MACB ethernet controller.
Interrupts are enabled before the phy is started, but it looks like the patch
Florian pointed me to (https://www.spinics.net/lists/netdev/msg397857.html)
fixes my interrupt problem.

> > 
> > This fix is copied from commit 99f81afc139c ("phy: micrel: Disable auto
> > negotiation on startup").
> > 
> > Signed-off-by: Kyle Roeschley 
> > ---
> >  drivers/net/phy/smsc.c | 10 ++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
> > index b62c4aa..8de8011 100644
> > --- a/drivers/net/phy/smsc.c
> > +++ b/drivers/net/phy/smsc.c
> > @@ -62,6 +62,16 @@ static int smsc_phy_config_init(struct phy_device 
> > *phydev)
> > return rc;
> > }
> > 
> > +   if (phy_interrupt_is_valid(phydev)) {
> > +   rc = phy_read(phydev, MII_BMCR);
> > +   if (rc < 0)
> > +   return rc;
> > +
> > +   rc = phy_write(phydev, MII_BMCR, rc & ~BMCR_ANENABLE);
> > +   if (rc < 0)
> > +   return rc;
> > +   }
> > +
> > return smsc_phy_ack_interrupt(phydev);
> >  }
> > 
> > 
> 

-- 
Kyle Roeschley
Software Engineer
National Instruments

Re: [RFC v3 3/3] phy,leds: add support for led triggers on phy link state change

2016-10-12 Thread Zach Brown

On Mon, Oct 10, 2016 at 02:03:32AM -0700, Florian Fainelli wrote:
> > +
> > +#ifdef CONFIG_LED_TRIGGER_PHY
> > +
> > +#include 
> > +#include 
> > +
> > +#define PHY_LINK_LED_MAX_TRIGGERS  5
> > +#define PHY_LED_TRIGGER_SPEED_SUFFIX_SIZE  7
> > +#define PHY_MII_BUS_ID_SIZE(20 - 3)
> 
> This particular constant may be something worth moving to
> include/linux/phy.h eventually.
> -- 
> Florian

MII_BUS_ID_SIZE is defined in include/linux/phy.h but it's defined after
phy_led_triggers.h is included so phy_led_triggers.h doesn't have access.
I could move the definition of MII_BUS_ID_SIZE above the include, but that
seemed ugly. Do you have any suggestions?

[PATCH] net: limit a number of namespaces which can be cleaned up concurrently

2016-10-12 Thread Andrei Vagin

From: Andrey Vagin 

The operation of destroying netns is heavy and it is executed under
net_mutex. If many namespaces are destroyed concurrently, net_mutex can
be locked for a long time. It is impossible to create a new netns during
this period of time.

In our days when userns allows to create network namespaces to
unprivilaged users, it may be a real problem.

On my laptop (fedora 24, i5-5200U, 12GB) 1000 namespaces requires about
300MB of RAM and are being destroyed for 8 seconds.

In this patch, a number of namespaces which can be cleaned up
concurrently is limited by 32. net_mutex is released after handling each
portion of net namespaces and then it is locked again to handle the next
one. It allows other users to lock it without waiting for a long
time.

I am not sure whether we need to add a sysctl to costomize this limit.
Let me know if you think it's required.

Cc: "David S. Miller" 
Cc: "Eric W. Biederman" 
Signed-off-by: Andrei Vagin 
---
 net/core/net_namespace.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 989434f..33dd3b7 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -406,10 +406,20 @@ static void cleanup_net(struct work_struct *work)
struct net *net, *tmp;
struct list_head net_kill_list;
LIST_HEAD(net_exit_list);
+   int i = 0;
 
/* Atomically snapshot the list of namespaces to cleanup */
spin_lock_irq(_list_lock);
-   list_replace_init(_list, _kill_list);
+   list_for_each_entry_safe(net, tmp, _list, cleanup_list)
+   if (++i == 32)
+   break;
+   if (i == 32) {
+   list_cut_position(_kill_list,
+ _list, >cleanup_list);
+   queue_work(netns_wq, work);
+   } else {
+   list_replace_init(_list, _kill_list);
+   }
spin_unlock_irq(_list_lock);
 
mutex_lock(_mutex);
-- 
2.7.4

[PATCH net] ipv6: tcp: restore IP6CB for pktoptions skbs

2016-10-12 Thread Eric Dumazet

From: Eric Dumazet 

Baozeng Ding reported following KASAN splat :

BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at 
addr 880029c84ec8
Read of size 1 by task poc/25548
Call Trace:
 [] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
 [< inline >] print_address_description /mm/kasan/report.c:204
 [] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
 [< inline >] kasan_report /mm/kasan/report.c:303
 [] __asan_report_load1_noabort+0x3e/0x40 
/mm/kasan/report.c:321
 [] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 
/net/ipv6/datagram.c:687
 [] ip6_datagram_recv_ctl+0x33/0x40
 [] do_ipv6_getsockopt.isra.4+0xaec/0x2150
 [] ipv6_getsockopt+0x116/0x230
 [] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
 [] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
 [< inline >] SYSC_getsockopt /net/socket.c:1776
 [] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
 [] entry_SYSCALL_64_fastpath+0x23/0xc6
Memory state around the buggy address:
 880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ^
 880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

He also provided a syzkaller reproducer.

Issue is that ip6_datagram_recv_specific_ctl() expects to find IP6CB
data that was moved at a different place in tcp_v6_rcv()

This patch moves tcp_v6_restore_cb() up and calls it from
tcp_v6_do_rcv() when np->pktoptions is set.

Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line 
misses")
Signed-off-by: Eric Dumazet 
Reported-by: Baozeng Ding 
---
 net/ipv6/tcp_ipv6.c |   20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 54cf7197c7ab..5a27ab4eab39 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1190,6 +1190,16 @@ static struct sock *tcp_v6_syn_recv_sock(const struct 
sock *sk, struct sk_buff *
return NULL;
 }
 
+static void tcp_v6_restore_cb(struct sk_buff *skb)
+{
+   /* We need to move header back to the beginning if xfrm6_policy_check()
+* and tcp_v6_fill_cb() are going to be called again.
+* ip6_datagram_recv_specific_ctl() also expects IP6CB to be there.
+*/
+   memmove(IP6CB(skb), _SKB_CB(skb)->header.h6,
+   sizeof(struct inet6_skb_parm));
+}
+
 /* The socket must have it's spinlock held when we get
  * here, unless it is a TCP_LISTEN socket.
  *
@@ -1319,6 +1329,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
np->flow_label = ip6_flowlabel(ipv6_hdr(opt_skb));
if (ipv6_opt_accepted(sk, opt_skb, 
_SKB_CB(opt_skb)->header.h6)) {
skb_set_owner_r(opt_skb, sk);
+   tcp_v6_restore_cb(opt_skb);
opt_skb = xchg(>pktoptions, opt_skb);
} else {
__kfree_skb(opt_skb);
@@ -1352,15 +1363,6 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const 
struct ipv6hdr *hdr,
TCP_SKB_CB(skb)->sacked = 0;
 }
 
-static void tcp_v6_restore_cb(struct sk_buff *skb)
-{
-   /* We need to move header back to the beginning if xfrm6_policy_check()
-* and tcp_v6_fill_cb() are going to be called again.
-*/
-   memmove(IP6CB(skb), _SKB_CB(skb)->header.h6,
-   sizeof(struct inet6_skb_parm));
-}
-
 static int tcp_v6_rcv(struct sk_buff *skb)
 {
const struct tcphdr *th;

Re: igb driver can cause cache invalidation of non-owned memory?

2016-10-12 Thread Nikita Yushchenko

>>> To get some throughput improvement, I propose removal of that
>>> sync_for_device() before reusing buffer. Will you accept such a patch ;)
>>
>> Not one that gets rid of sync_for_device() in the driver.  From what I
>> can tell there are some DMA APIs that use that to perform the
>> invalidation on the region of memory so that it can be DMAed into.
>> Without that we run the risk of having a race between something the
>> CPU might have placed in the cache and something the device wrote into
>> memory.  The sync_for_device() call is meant to invalidate the cache
>> for the region so that when the device writes into memory there is no
>> risk of that race.
> 
> I'm not expert, but some thought...
> 
> Just remember that the cpu can do speculative memory and cache line reads.
> So you need to ensure there are no dirty cache lines when the receive
> buffer is setup and no cache lines at all at before looking at the frame.

Exactly.

And because of that, arm does cache invalidation both in sync_for_cpu()
and sync_for_device().

> If you can 100% guarantee the cpu hasn't dirtied the cache then I
> think the invalidate prior to reusing the buffer can be skipped.
> But I wouldn't want to debug that going wrong.

I've written earlier in this thread why it is the case for igb - as long
as Alexander's statement that igb's buffers are readonly for the stack
is true. If Alexander's statement is wrong, then igb is vulnerable to
issue I've described in the first message of this thread.

Btw, we are unable get any breakage with that sync_to_device() removed
already for a day. And - our tests run with software checksumming,
because on our hardware i210 is wired connected to DSA switch and in
this config, no i210 offloading works (i210 hardware gets frames with
eDSA headers that it can't parse). Thus any breakage should be
immediately discovered.

And throughput measured by iperf raises from 650 to 850 Mbps.

Nikita

Re: [PATCH iproute2 1/1] tc filters: add support to get individual filters by handle

2016-10-12 Thread Cong Wang

On Mon, Oct 10, 2016 at 9:45 AM, Jamal Hadi Salim  wrote:
>  tc/tc_filter.c | 185 
> +
>  1 file changed, 175 insertions(+), 10 deletions(-)

Please update man/man8/tc.8 too, but can be a separated patch. ;)

[PATCH net v2] netvsc: fix checksum on UDP IPV6

2016-10-12 Thread Stephen Hemminger

From: Stephen Hemminger 

The software calculation of UDP checksum in Netvsc driver was
only handling IPv4 case. Instead, use common code to recompute
checksum as needed for all protocols.

Signed-off-by: Stephen Hemminger 
---
v2 remove accidental udp.h inclusion

 drivers/net/hyperv/netvsc_drv.c | 71 -
 1 file changed, 21 insertions(+), 50 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 52eeb2f..f0919bd 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -442,8 +442,6 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct 
net_device *net)
}
 
net_trans_info = get_net_transport_info(skb, _offset);
-   if (net_trans_info == TRANSPORT_INFO_NOT_IP)
-   goto do_send;
 
/*
 * Setup the sendside checksum offload only if this is not a
@@ -478,56 +476,29 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct 
net_device *net)
}
lso_info->lso_v2_transmit.tcp_header_offset = hdr_offset;
lso_info->lso_v2_transmit.mss = skb_shinfo(skb)->gso_size;
-   goto do_send;
-   }
-
-   if ((skb->ip_summed == CHECKSUM_NONE) ||
-   (skb->ip_summed == CHECKSUM_UNNECESSARY))
-   goto do_send;
-
-   rndis_msg_size += NDIS_CSUM_PPI_SIZE;
-   ppi = init_ppi_data(rndis_msg, NDIS_CSUM_PPI_SIZE,
-   TCPIP_CHKSUM_PKTINFO);
-
-   csum_info = (struct ndis_tcp_ip_checksum_info *)((void *)ppi +
-   ppi->ppi_offset);
-
-   if (net_trans_info & (INFO_IPV4 << 16))
-   csum_info->transmit.is_ipv4 = 1;
-   else
-   csum_info->transmit.is_ipv6 = 1;
-
-   if (net_trans_info & INFO_TCP) {
-   csum_info->transmit.tcp_checksum = 1;
-   csum_info->transmit.tcp_header_offset = hdr_offset;
-   } else if (net_trans_info & INFO_UDP) {
-   /* UDP checksum offload is not supported on ws2008r2.
-* Furthermore, on ws2012 and ws2012r2, there are some
-* issues with udp checksum offload from Linux guests.
-* (these are host issues).
-* For now compute the checksum here.
-*/
-   struct udphdr *uh;
-   u16 udp_len;
-
-   ret = skb_cow_head(skb, 0);
-   if (ret)
-   goto no_memory;
-
-   uh = udp_hdr(skb);
-   udp_len = ntohs(uh->len);
-   uh->check = 0;
-   uh->check = csum_tcpudp_magic(ip_hdr(skb)->saddr,
- ip_hdr(skb)->daddr,
- udp_len, IPPROTO_UDP,
- csum_partial(uh, udp_len, 0));
-   if (uh->check == 0)
-   uh->check = CSUM_MANGLED_0;
-
-   csum_info->transmit.udp_checksum = 0;
+   } else if (skb->ip_summed == CHECKSUM_PARTIAL) {
+   if (net_trans_info & INFO_TCP) {
+   rndis_msg_size += NDIS_CSUM_PPI_SIZE;
+   ppi = init_ppi_data(rndis_msg, NDIS_CSUM_PPI_SIZE,
+   TCPIP_CHKSUM_PKTINFO);
+
+   csum_info = (struct ndis_tcp_ip_checksum_info *)((void 
*)ppi +
+
ppi->ppi_offset);
+
+   if (net_trans_info & (INFO_IPV4 << 16))
+   csum_info->transmit.is_ipv4 = 1;
+   else
+   csum_info->transmit.is_ipv6 = 1;
+
+   csum_info->transmit.tcp_checksum = 1;
+   csum_info->transmit.tcp_header_offset = hdr_offset;
+   } else {
+   /* UDP checksum (and other) offload is not supported. */
+   if (skb_checksum_help(skb))
+   goto drop;
+   }
}
 
-do_send:
/* Start filling in the page buffers with the rndis hdr */
rndis_msg->msg_len += rndis_msg_size;
packet->total_data_buflen = rndis_msg->msg_len;
-- 
2.9.3

Re: [RFC] net: phy: smsc: Disable auto-negotiation on startup

2016-10-12 Thread Kyle Roeschley

On Wed, Oct 12, 2016 at 02:13:06AM -0700, Florian Fainelli wrote:
> On 10/10/2016 10:41 AM, Kyle Roeschley wrote:
> > Because the SMSC PHY completes auto-negotiation before the driver is
> > ready to handle interrupts, the PHY state machine never realizes that we
> > have a link. Clear the ANENABLE bit on initialization, which lets
> > genphy_config_aneg do its thing when that code is hit later.
> > 
> > While this patch does fix the problem we see (no link on boot without
> > re-plugging the cable), it seems like the generic PHY code should be
> > able to handle auto-negotiation completing before interrupts are
> > enabled. Submitted as an RFC in the hopes that someone has an idea as to
> > how that could be done.
> > 
> > This fix is copied from commit 99f81afc139c ("phy: micrel: Disable auto
> > negotiation on startup").
> 
> Do you mind trying:
> 
> https://www.spinics.net/lists/netdev/msg397857.html
> 
> and see if you do get link interrupts without your patch applied? Thanks!

Yep, that fixes it. I figured there was some state machine issue I was missing.
Thanks very much!

> -- 
> Florian

-- 
Kyle Roeschley
Software Engineer
National Instruments

Re: [PATCH net] net_sched: do not broadcast RTM_GETTFILTER result

2016-10-12 Thread Cong Wang

On Sun, Oct 9, 2016 at 8:25 PM, Eric Dumazet  wrote:
> +   if (unicast)
> +   return netlink_unicast(net->rtnl, skb, portid, MSG_DONTWAIT);

Nit: rtnl_unicast() is simpler.

[PATCH net-next v11 0/1] net: phy: Rebase Edge-Rate clean up

2016-10-12 Thread Allan W. Nielsen

Hi,

I can see that the "Add Wake-on-LAN driver for Microsemi PHYs" (0a55c12f97)
patch has been applied to net-next, and that it is causing the Edge-Rate patch
to conflict.

I have therefore rebased v10 of the patch to fit on top of net-next.

/Allan

-- 
2.7.3

RE: igb driver can cause cache invalidation of non-owned memory?

2016-10-12 Thread David Laight

From: Alexander Duyck
> Sent: 12 October 2016 16:33
...
> > To get some throughput improvement, I propose removal of that
> > sync_for_device() before reusing buffer. Will you accept such a patch ;)
> 
> Not one that gets rid of sync_for_device() in the driver.  From what I
> can tell there are some DMA APIs that use that to perform the
> invalidation on the region of memory so that it can be DMAed into.
> Without that we run the risk of having a race between something the
> CPU might have placed in the cache and something the device wrote into
> memory.  The sync_for_device() call is meant to invalidate the cache
> for the region so that when the device writes into memory there is no
> risk of that race.

I'm not expert, but some thought...

Just remember that the cpu can do speculative memory and cache line reads.
So you need to ensure there are no dirty cache lines when the receive
buffer is setup and no cache lines at all at before looking at the frame.

So unless you know the exact rules for these speculative cache line reads
you have to invalidate the cache after the buffer is written to by the
hardware even it was invalidated when the buffer was set up.

If you can 100% guarantee the cpu hasn't dirtied the cache then I
think the invalidate prior to reusing the buffer can be skipped.
But I wouldn't want to debug that going wrong.
Might be provable safe in the 'copybreak' path.

David

Re: [PATCH iproute2] tc: cls_bpf: handle skip_sw and skip_hw flags

2016-10-12 Thread Daniel Borkmann


On 10/12/2016 05:46 PM, Jakub Kicinski wrote:

Add support for controling hardware offload using (now standard)
skip_sw and skip_hw flags in cls_bpf.

Signed-off-by: Jakub Kicinski 


Acked-by: Daniel Borkmann

[PATCH iproute2] tc: cls_bpf: handle skip_sw and skip_hw flags

2016-10-12 Thread Jakub Kicinski

Add support for controling hardware offload using (now standard)
skip_sw and skip_hw flags in cls_bpf.

Signed-off-by: Jakub Kicinski 
---
Hi Stephen!

This requires header rebase to get TCA_BPF_FLAGS_GEN in pkt_cls.h.
The patch is for 4.9 release so it is targeted at master branch.

Thanks!

 man/man8/tc-bpf.8 | 14 ++
 tc/f_bpf.c| 21 +++--
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-bpf.8 b/man/man8/tc-bpf.8
index c8d5c5f94da8..e371964d06ab 100644
--- a/man/man8/tc-bpf.8
+++ b/man/man8/tc-bpf.8
@@ -14,6 +14,10 @@ CLS_NAME ] [
 UDS_FILE ] [
 .B verbose
 ] [
+.B skip_hw
+|
+.B skip_sw
+] [
 .B police
 POLICE_SPEC ] [
 .B action
@@ -137,6 +141,16 @@ if set, it will dump the eBPF verifier output, even if 
loading the eBPF
 program was successful. By default, only on error, the verifier log is
 being emitted to the user.
 
+.SS skip_hw | skip_sw
+hardware offload control flags. By default TC will try to offload
+filters to hardware if possible.
+.B skip_hw
+explicitly disables the attempt to offload.
+.B skip_sw
+forces the offload and disables running the eBPF program in the kernel.
+If hardware offload is not possible and this flag was set kernel will
+report an error and filter will not be installed at all.
+
 .SS police
 is an optional parameter for an eBPF/cBPF classifier that specifies a
 police in
diff --git a/tc/f_bpf.c b/tc/f_bpf.c
index 5c97c863d15a..665bc6612eeb 100644
--- a/tc/f_bpf.c
+++ b/tc/f_bpf.c
@@ -37,8 +37,8 @@ static void explain(void)
fprintf(stderr, "\n");
fprintf(stderr, "eBPF use case:\n");
fprintf(stderr, " object-file FILE [ section CLS_NAME ] [ export 
UDS_FILE ]");
-   fprintf(stderr, " [ verbose ] [ direct-action ]\n");
-   fprintf(stderr, " object-pinned FILE [ direct-action ]\n");
+   fprintf(stderr, " [ verbose ] [ direct-action ] [ skip_hw | skip_sw 
]\n");
+   fprintf(stderr, " object-pinned FILE [ direct-action ] [ skip_hw | 
skip_sw ]\n");
fprintf(stderr, "\n");
fprintf(stderr, "Common remaining options:\n");
fprintf(stderr, " [ action ACTION_SPEC ]\n");
@@ -66,6 +66,7 @@ static int bpf_parse_opt(struct filter_util *qu, char *handle,
 {
const char *bpf_obj = NULL, *bpf_uds_name = NULL;
struct tcmsg *t = NLMSG_DATA(n);
+   unsigned int bpf_gen_flags = 0;
unsigned int bpf_flags = 0;
bool seen_run = false;
struct rtattr *tail;
@@ -107,6 +108,10 @@ static int bpf_parse_opt(struct filter_util *qu, char 
*handle,
} else if (matches(*argv, "direct-action") == 0 ||
   matches(*argv, "da") == 0) {
bpf_flags |= TCA_BPF_FLAG_ACT_DIRECT;
+   } else if (matches(*argv, "skip_hw") == 0) {
+   bpf_gen_flags |= TCA_CLS_FLAGS_SKIP_HW;
+   } else if (matches(*argv, "skip_sw") == 0) {
+   bpf_gen_flags |= TCA_CLS_FLAGS_SKIP_SW;
} else if (matches(*argv, "action") == 0) {
NEXT_ARG();
if (parse_action(, , TCA_BPF_ACT, n)) {
@@ -136,6 +141,8 @@ static int bpf_parse_opt(struct filter_util *qu, char 
*handle,
NEXT_ARG_FWD();
}
 
+   if (bpf_gen_flags)
+   addattr32(n, MAX_MSG, TCA_BPF_FLAGS_GEN, bpf_gen_flags);
if (bpf_obj && bpf_flags)
addattr32(n, MAX_MSG, TCA_BPF_FLAGS, bpf_flags);
 
@@ -178,6 +185,16 @@ static int bpf_print_opt(struct filter_util *qu, FILE *f,
fprintf(f, "direct-action ");
}
 
+   if (tb[TCA_BPF_FLAGS_GEN]) {
+   unsigned int flags =
+   rta_getattr_u32(tb[TCA_BPF_FLAGS_GEN]);
+
+   if (flags & TCA_CLS_FLAGS_SKIP_HW)
+   fprintf(f, "skip_hw ");
+   if (flags & TCA_CLS_FLAGS_SKIP_SW)
+   fprintf(f, "skip_sw ");
+   }
+
if (tb[TCA_BPF_OPS] && tb[TCA_BPF_OPS_LEN]) {
bpf_print_ops(f, tb[TCA_BPF_OPS],
  rta_getattr_u16(tb[TCA_BPF_OPS_LEN]));
-- 
1.9.1

Re: igb driver can cause cache invalidation of non-owned memory?

2016-10-12 Thread Alexander Duyck

On Tue, Oct 11, 2016 at 11:55 PM, Nikita Yushchenko
 wrote:
 The main reason why this isn't a concern for the igb driver is because
 we currently pass the page up as read-only.  We don't allow the stack
 to write into the page by keeping the page count greater than 1 which
 means that the page is shared.  It isn't until we unmap the page that
 the page count is allowed to drop to 1 indicating that it is writable.
>>>
>>> Doesn't that mean that sync_to_device() in igb_reuse_rx_page() can be
>>> avoided? If page is read only for entire world, then it can't be dirty
>>> in cache and thus device can safely write to it without preparation step.
>>
>> For the sake of correctness we were adding the
>> dma_sync_single_range_for_device.
>
> Could you please elaborate this "for sake of correctness"?
>
> If by "correctness" you mean ensuring that buffer gets frame DMAed by
> device and that's not broken by cache activity, then:
> - on first use of this buffer after page allocation, sync_for_device()
> is not needed due to previous dma_page_map() call,
> - on later uses of the same buffer, sync_for_device() is not needed due
> to buffer being read-only since dma_page_map() call, thus it can't be
> dirty in cache and thus no writebacks of this area can be possible.
>
> If by "correctness" you mean strict following "ownership" concept - i.e.
> memory area is "owned" either by cpu or by device, and "ownersip" must
> be passed to device before DMA and back to cpu after DMA - then, igb
> driver already breaks these rules anyway:

Sort of.  Keep in mind the recent changes to only sync what the device
had DMAed into is a recent change and was provided by a third party.

> - igb calls dma_map_page() at page allocation time, thus entire page
> becomes "owned" by device,
> - and then, on first use of second buffer inside the page, igb calls
> sync_for_device() for buffer area, despite of that area is already
> "owned" by device,

Right.  However there is nothing wrong with assigning a buffer to the
device twice especially since we are just starting out.  If we wanted
to be more correct we would probably be allocating and deallocating
the pages with the DMA_ATTR_SKIP_CPU_SYNC attribute and then just do
the sync_for_device before reassigning the page back to the device.

> - and later, if a buffer within page gets reused, igb calls
> sync_for_device() for entire buffer, despite of only part of buffer was
> sync_for_cpu()'ed at time of completing receive of previous frame into
> this buffer,

This is going to come into play with future changes we have planned.
If we update things to use build_skb we are going to be writing into
other parts of the page as we will be using that for shared info.

> - and later, igb calls dma_unmap_page(), despite of that part of page
> was sync_for_cpu()'ed and thus is "owned" by CPU.

Right we are looking into finding a fix for that.

> Given all that, not calling sync_for_device() before reusing buffer
> won't make picture much worse :)

Actually it will since you are suggesting taking things in a different
direction then the rest of the community.  What we plan to do is
weaken the dma_map/dma_unmap semantics by likely using
DMA_ATTR_SKIP_CPU_SYNC on the map and unmap calls.

>> Since it is an DMA_FROM_DEVICE
>> mapping calling it should really have no effect for most DMA mapping
>> interfaces.
>
> Unfortunately dma_sync_single_range_for_device() *is* slow on imx6q - it
> does cache invalidation.  I don't really understand why invalidating
> cache can be slow - it only removes data from cache, it should not
> access slow outer memory - but cache invalidation *is* in top of perf
> profiles.
>
> To get some throughput improvement, I propose removal of that
> sync_for_device() before reusing buffer. Will you accept such a patch ;)

Not one that gets rid of sync_for_device() in the driver.  From what I
can tell there are some DMA APIs that use that to perform the
invalidation on the region of memory so that it can be DMAed into.
Without that we run the risk of having a race between something the
CPU might have placed in the cache and something the device wrote into
memory.  The sync_for_device() call is meant to invalidate the cache
for the region so that when the device writes into memory there is no
risk of that race.

What you may want to do is look at the DMA API you are using and
determine if it is functioning correctly.  Most DMA APIs I am familiar
with will either sync Rx data on the sync_for_cpu() or
sync_for_device() but it should not sync on both.  The fact that it is
syncing on both makes me wonder if the API was modified to work around
a buggy driver that didn't follow the proper semantics for buffer
ownership instead of just fixing the buggy driver.

>> Also you may want to try updating to the 4.8 version of the driver.
>> It reduces the size of the dma_sync_single_range_for_cpu loops by
>> reducing the sync size down to the size that

Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

2016-10-12 Thread Eric Dumazet

On Tue, 2016-10-11 at 20:56 -0700, Yuchung Cheng wrote:

> I thought more about this patch on my way home and have more
> questions: why do we exclude RTO retransmission specifically? also
> when we rehash, we'll introduce reordering either in recovery or after
> recovery, as some TCP CC like bbr would continue sending regardlessly,
> so starting in tcp_ack() with tp->txhash_want does not really prevent
> causing more reordering.

Note that changing txhash during a non rto retransmit is going to break
pacing on a bonding setup, since the change in txhash will likely select
a different slave, where MQ+FQ are the qdisc in place.

Re: BUG: net/ipv6: kernel memory leak in ip6_datagram_recv_specific_ctl

2016-10-12 Thread Eric Dumazet

On Wed, 2016-10-12 at 19:09 +0800, Baozeng Ding wrote:
> Hi all,
> The following program triggers use-after-free in 
> ip6_datagram_recv_specific_ctl, which may leak kernel memory. The
> kernel version is 4.8.0+ (on Oct 7 commit 
> d1f5323370fceaed43a7ee38f4c7bfc7e70f28d0).
> ==
> BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at 
> addr 880029c84ec8
> Read of size 1 by task poc/25548
> Call Trace:
>  [] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
>  [< inline >] print_address_description /mm/kasan/report.c:204
>  [] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
>  [< inline >] kasan_report /mm/kasan/report.c:303
>  [] __asan_report_load1_noabort+0x3e/0x40 
> /mm/kasan/report.c:321
>  [] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 
> /net/ipv6/datagram.c:687
>  [] ip6_datagram_recv_ctl+0x33/0x40
>  [] do_ipv6_getsockopt.isra.4+0xaec/0x2150
>  [] ipv6_getsockopt+0x116/0x230
>  [] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
>  [] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
>  [< inline >] SYSC_getsockopt /net/socket.c:1776
>  [] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
>  [] entry_SYSCALL_64_fastpath+0x23/0xc6
> Memory state around the buggy address:
>  880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>  880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > 880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>   ^
>  880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>  880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 
> 
> ==
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define IPV6_2292DSTOPTS4
> #define IPV6_2292PKTOPTIONS 6
> #define IPV6_FLOWINFO   11
> 
> 
> int main()
> {
>   int fd;
>   int i, r;
>   int opt = 1, len = 0;
>   struct msghdr msg;
>   struct sockaddr_in6 addr;
>   int sub_addr[4];
>   struct iovec iov;
>   memset(sub_addr, 0, sizeof(sub_addr));
>   sub_addr[3] = 0x100;
>   memcpy(_addr, sub_addr, sizeof(sub_addr));
>   addr.sin6_family = AF_INET6;
>   addr.sin6_port = 0x10ab;
>   addr.sin6_flowinfo = 0x1;
>   addr.sin6_scope_id = 0;
> 
>   mmap(0x2000ul, 0x1c000ul, 0x3ul, 0x32ul, -1, 0x0ul, 0, 0, 0);
>   memset(0x2000, 'a', 0x1c000);
>   fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
>   bind(fd, , sizeof(addr));
> 
>   setsockopt(fd, IPPROTO_IPV6, IPV6_2292DSTOPTS, , 4);
>   setsockopt(fd, IPPROTO_IPV6, IPV6_FLOWINFO, , 4);
> 
>   addr.sin6_flowinfo = 0;
>   addr.sin6_scope_id = 0;
>   msg.msg_name = 
>   msg.msg_namelen = sizeof(addr);
>   msg.msg_iov = 
>   msg.msg_iovlen = 1;
>   msg.msg_iov->iov_base = 0x2000;
>   msg.msg_iov->iov_len = 0x100;
>   msg.msg_control = 0x2000;
>   msg.msg_controllen = 0x100;
>   msg.msg_flags = 0;
>   r = sendmsg(fd, , MSG_FASTOPEN);
>   if (r < 0) {
>   printf("sendmsg errno=%d\n", r);
>   }
> 
>   r = getsockopt(fd, IPPROTO_IPV6, IPV6_2292PKTOPTIONS, 0x20012000ul, );
>   if (r < 0) printf("getsockopt error\n");
>   return 0;
> }
> 
> The following lines case out-of-bounds read, which may leak kernel memory by 
> put_cmsg.
> 686   u8 *ptr = nh + opt->dst0; // Out-of-bouds when opt->dst0 is large.
> 687   put_cmsg(msg, SOL_IPV6, IPV6_2292DSTOPTS, (ptr[1]+1)<<3, ptr);
> 
> I debuged using printk and  got some values of opt as the following, which 
> may help locate the root cause of the bug. Thanks.
> 
> [85564.842733] degug: opt->iif is 0xe21c
> [85564.842737] degug: opt->ra is 0x121c
> [85564.842741] degug: opt->dst0 is 0xe111
> 
> Best Regards,
> Baozeng Ding

Thanks for the report.

I am testing a patch to fix this issue.

[PATCH] xen-netback: fix type mismatch warning

2016-10-12 Thread Arnd Bergmann

Wiht the latest rework of the xen-netback driver, we get a warning
on ARM about the types passed into min():

drivers/net/xen-netback/rx.c: In function 'xenvif_rx_next_chunk':
include/linux/kernel.h:739:16: error: comparison of distinct pointer types 
lacks a cast [-Werror]

The reason is that XEN_PAGE_SIZE is not size_t here. There
is no actual bug, and we can easily avoid the warning using the
min_t() macro instead of min().

Fixes: eb1723a29b9a ("xen-netback: refactor guest rx")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/xen-netback/rx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/xen-netback/rx.c b/drivers/net/xen-netback/rx.c
index 8e9ade6ccf18..aeb150258c6c 100644
--- a/drivers/net/xen-netback/rx.c
+++ b/drivers/net/xen-netback/rx.c
@@ -337,9 +337,9 @@ static void xenvif_rx_next_chunk(struct xenvif_queue *queue,
frag_data += pkt->frag_offset;
frag_len -= pkt->frag_offset;
 
-   chunk_len = min(frag_len, XEN_PAGE_SIZE - offset);
-   chunk_len = min(chunk_len,
-   XEN_PAGE_SIZE - xen_offset_in_page(frag_data));
+   chunk_len = min_t(size_t, frag_len, XEN_PAGE_SIZE - offset);
+   chunk_len = min_t(size_t, chunk_len, XEN_PAGE_SIZE -
+xen_offset_in_page(frag_data));
 
pkt->frag_offset += chunk_len;
 
-- 
2.9.0

[PATCH v2] IB/ipoib: move back IB LL address into the hard header

2016-10-12 Thread Paolo Abeni

After the commit 9207f9d45b0a ("net: preserve IP control block
during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
That destroy the IPoIB address information cached there,
causing a severe performance regression, as better described here:

http://marc.info/?l=linux-kernel=146787279825501=2

This change moves the data cached by the IPoIB driver from the
skb control lock into the IPoIB hard header, as done before
the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
and use skb->cb to stash LL addresses").
In order to avoid GRO issue, on packet reception, the IPoIB driver
stash into the skb a dummy pseudo header, so that the received
packets have actually a hard header matching the declared length.
To avoid changing the connected mode maximum mtu, the allocated 
head buffer size is increased by the pseudo header length.

After this commit, IPoIB performances are back to pre-regression
value.

v1 -> v2: avoid changing the max mtu, increasing the head buf size

Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Paolo Abeni 
---
 drivers/infiniband/ulp/ipoib/ipoib.h   | 20 +++---
 drivers/infiniband/ulp/ipoib/ipoib_cm.c| 15 +++
 drivers/infiniband/ulp/ipoib/ipoib_ib.c| 12 +++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c  | 54 --
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |  6 ++-
 5 files changed, 64 insertions(+), 43 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9dbfcc0..5ff64af 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -63,6 +63,8 @@ enum ipoib_flush_level {
 
 enum {
IPOIB_ENCAP_LEN   = 4,
+   IPOIB_PSEUDO_LEN  = 20,
+   IPOIB_HARD_LEN= IPOIB_ENCAP_LEN + IPOIB_PSEUDO_LEN,
 
IPOIB_UD_HEAD_SIZE= IB_GRH_BYTES + IPOIB_ENCAP_LEN,
IPOIB_UD_RX_SG= 2, /* max buffer needed for 4K mtu */
@@ -134,15 +136,21 @@ struct ipoib_header {
u16 reserved;
 };
 
-struct ipoib_cb {
-   struct qdisc_skb_cb qdisc_cb;
-   u8  hwaddr[INFINIBAND_ALEN];
+struct ipoib_pseudo_header {
+   u8  hwaddr[INFINIBAND_ALEN];
 };
 
-static inline struct ipoib_cb *ipoib_skb_cb(const struct sk_buff *skb)
+static inline void skb_add_pseudo_hdr(struct sk_buff *skb)
 {
-   BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct ipoib_cb));
-   return (struct ipoib_cb *)skb->cb;
+   char *data = skb_push(skb, IPOIB_PSEUDO_LEN);
+
+   /*
+* only the ipoib header is present now, make room for a dummy
+* pseudo header and set skb field accordingly
+*/
+   memset(data, 0, IPOIB_PSEUDO_LEN);
+   skb_reset_mac_header(skb);
+   skb_pull(skb, IPOIB_HARD_LEN);
 }
 
 /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 4ad297d..339a1ee 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -63,6 +63,8 @@ MODULE_PARM_DESC(cm_data_debug_level,
 #define IPOIB_CM_RX_DELAY   (3 * 256 * HZ)
 #define IPOIB_CM_RX_UPDATE_MASK (0x3)
 
+#define IPOIB_CM_RX_RESERVE (ALIGN(IPOIB_HARD_LEN, 16) - IPOIB_ENCAP_LEN)
+
 static struct ib_qp_attr ipoib_cm_err_attr = {
.qp_state = IB_QPS_ERR
 };
@@ -146,15 +148,15 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct 
net_device *dev,
struct sk_buff *skb;
int i;
 
-   skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12);
+   skb = dev_alloc_skb(ALIGN(IPOIB_CM_HEAD_SIZE + IPOIB_PSEUDO_LEN, 16));
if (unlikely(!skb))
return NULL;
 
/*
-* IPoIB adds a 4 byte header. So we need 12 more bytes to align the
+* IPoIB adds a IPOIB_ENCAP_LEN byte header, this will align the
 * IP header to a multiple of 16.
 */
-   skb_reserve(skb, 12);
+   skb_reserve(skb, IPOIB_CM_RX_RESERVE);
 
mapping[0] = ib_dma_map_single(priv->ca, skb->data, IPOIB_CM_HEAD_SIZE,
   DMA_FROM_DEVICE);
@@ -624,9 +626,9 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
if (wc->byte_len < IPOIB_CM_COPYBREAK) {
int dlen = wc->byte_len;
 
-   small_skb = dev_alloc_skb(dlen + 12);
+   small_skb = dev_alloc_skb(dlen + IPOIB_CM_RX_RESERVE);
if (small_skb) {
-   skb_reserve(small_skb, 12);
+   skb_reserve(small_skb, IPOIB_CM_RX_RESERVE);
ib_dma_sync_single_for_cpu(priv->ca, 
rx_ring[wr_id].mapping[0],
   dlen, DMA_FROM_DEVICE);
skb_copy_from_linear_data(skb, small_skb->data, dlen);
@@ -663,8 +665,7 @@ void

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-12 Thread Johannes Berg


> > Can you elaborate on how exactly it kills your system?
> 
> the last time I saw it it was a NULL deref at
> ieee80211_aes_ccm_decrypt.

Hm. I was expecting something within the crypto code would cause the
crash, this seems strange.

Anyway, I'm surely out of my depth wrt. the actual cause. Something
like the patch below probably works around it, but it's horribly
inefficient due to the locking and doesn't cover CMAC/GMAC either.

johannes

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 103187ca9474..e820f437f02e 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1224,6 +1225,10 @@ struct ieee80211_local {
 
spinlock_t rx_path_lock;
 
+   /* temporary buffers for software crypto */
+   u8 aad[2 * AES_BLOCK_SIZE];
+   u8 b_0[AES_BLOCK_SIZE];
+
/* Station data */
/*
 * The mutex only protects the list, hash table and
diff --git a/net/mac80211/wpa.c b/net/mac80211/wpa.c
index b48c1e13e281..a3f17a710b85 100644
--- a/net/mac80211/wpa.c
+++ b/net/mac80211/wpa.c
@@ -405,8 +405,8 @@ static int ccmp_encrypt_skb(struct ieee80211_tx_data *tx, 
struct sk_buff *skb,
u8 *pos;
u8 pn[6];
u64 pn64;
-   u8 aad[2 * AES_BLOCK_SIZE];
-   u8 b_0[AES_BLOCK_SIZE];
+   u8 *aad = tx->local->aad;
+   u8 *b_0 = tx->local->b_0;
 
if (info->control.hw_key &&
!(info->control.hw_key->flags & IEEE80211_KEY_FLAG_GENERATE_IV) &&
@@ -460,9 +460,11 @@ static int ccmp_encrypt_skb(struct ieee80211_tx_data *tx, 
struct sk_buff *skb,
return 0;
 
pos += IEEE80211_CCMP_HDR_LEN;
+   spin_lock_bh(>local->rx_path_lock);
ccmp_special_blocks(skb, pn, b_0, aad);
ieee80211_aes_ccm_encrypt(key->u.ccmp.tfm, b_0, aad, pos, len,
  skb_put(skb, mic_len), mic_len);
+   spin_unlock_bh(>local->rx_path_lock);
 
return 0;
 }
@@ -534,8 +536,9 @@ ieee80211_crypto_ccmp_decrypt(struct ieee80211_rx_data *rx,
}
 
if (!(status->flag & RX_FLAG_DECRYPTED)) {
-   u8 aad[2 * AES_BLOCK_SIZE];
-   u8 b_0[AES_BLOCK_SIZE];
+   u8 *aad = rx->local->aad;
+   u8 *b_0 = rx->local->b_0;
+
/* hardware didn't decrypt/verify MIC */
ccmp_special_blocks(skb, pn, b_0, aad);
 
@@ -639,8 +642,8 @@ static int gcmp_encrypt_skb(struct ieee80211_tx_data *tx, 
struct sk_buff *skb)
u8 *pos;
u8 pn[6];
u64 pn64;
-   u8 aad[2 * AES_BLOCK_SIZE];
-   u8 j_0[AES_BLOCK_SIZE];
+   u8 *aad = tx->local->aad;
+   u8 *j_0 = tx->local->b_0;
 
if (info->control.hw_key &&
!(info->control.hw_key->flags & IEEE80211_KEY_FLAG_GENERATE_IV) &&
@@ -695,9 +698,11 @@ static int gcmp_encrypt_skb(struct ieee80211_tx_data *tx, 
struct sk_buff *skb)
return 0;
 
pos += IEEE80211_GCMP_HDR_LEN;
+   spin_lock_bh(>local->rx_path_lock);
gcmp_special_blocks(skb, pn, j_0, aad);
ieee80211_aes_gcm_encrypt(key->u.gcmp.tfm, j_0, aad, pos, len,
  skb_put(skb, IEEE80211_GCMP_MIC_LEN));
+   spin_unlock_bh(>local->rx_path_lock);
 
return 0;
 }
@@ -764,8 +769,9 @@ ieee80211_crypto_gcmp_decrypt(struct ieee80211_rx_data *rx)
}
 
if (!(status->flag & RX_FLAG_DECRYPTED)) {
-   u8 aad[2 * AES_BLOCK_SIZE];
-   u8 j_0[AES_BLOCK_SIZE];
+   u8 *aad = rx->local->aad;
+   u8 *j_0 = rx->local->b_0;
+
/* hardware didn't decrypt/verify MIC */
gcmp_special_blocks(skb, pn, j_0, aad);

[PATCH net-next v11 1/1] net: phy: Cleanup the Edge-Rate feature in Microsemi PHYs.

2016-10-12 Thread Allan W. Nielsen

Edge-Rate cleanup include the following:
- Updated device tree bindings documentation for edge-rate
- The edge-rate is now specified as a "slowdown", meaning that it is now
  being specified as positive values instead of negative (both
  documentation and implementation wise).
- Only explicitly documented values for "vsc8531,vddmac" and
  "vsc8531,edge-slowdown" are accepted by the device driver.
- Deleted include/dt-bindings/net/mscc-phy-vsc8531.h as it was not needed.
- Read/validate devicetree settings in probe instead of init

Signed-off-by: Allan W. Nielsen 
Signed-off-by: Raju Lakkaraju 
---
 .../devicetree/bindings/net/mscc-phy-vsc8531.txt   |  51 
 drivers/net/phy/mscc.c | 135 ++---
 include/dt-bindings/net/mscc-phy-vsc8531.h |  21 
 3 files changed, 90 insertions(+), 117 deletions(-)
 delete mode 100644 include/dt-bindings/net/mscc-phy-vsc8531.h

diff --git a/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt 
b/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
index 99c7eb0..bdefefc6 100644
--- a/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
+++ b/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
@@ -6,22 +6,27 @@ Required properties:
  Documentation/devicetree/bindings/net/phy.txt
 
 Optional properties:
-- vsc8531,vddmac   : The vddmac in mV.
+- vsc8531,vddmac   : The vddmac in mV. Allowed values is listed
+ in the first row of Table 1 (below).
+ This property is only used in combination
+ with the 'edge-slowdown' property.
+ Default value is 3300.
 - vsc8531,edge-slowdown: % the edge should be slowed down relative to
- the fastest possible edge time. Native sign
- need not enter.
+ the fastest possible edge time.
  Edge rate sets the drive strength of the MAC
- interface output signals.  Changing the drive
- strength will affect the edge rate of the output
- signal.  The goal of this setting is to help
- reduce electrical emission (EMI) by being able
- to reprogram drive strength and in effect slow
- down the edge rate if desired.  Table 1 shows the
- impact to the edge rate per VDDMAC supply for each
- drive strength setting.
- Ref: Table:1 - Edge rate change below.
-
-Note: see dt-bindings/net/mscc-phy-vsc8531.h for applicable values
+ interface output signals.  Changing the
+ drive strength will affect the edge rate of
+ the output signal.  The goal of this setting
+ is to help reduce electrical emission (EMI)
+ by being able to reprogram drive strength
+ and in effect slow down the edge rate if
+ desired.
+ To adjust the edge-slowdown, the 'vddmac'
+ must be specified. Table 1 lists the
+ supported edge-slowdown values for a given
+ 'vddmac'.
+ Default value is 0%.
+ Ref: Table:1 - Edge rate change (below).
 
 Table: 1 - Edge rate change
 |
@@ -29,23 +34,23 @@ Table: 1 - Edge rate change
 |  |
 | 3300 mV  2500 mV 1800 mV 1500 mV |
 |---|
-| Default  Deafult Default Default |
+| 0%   0%  0%  0%  |
 | (Fastest)(recommended)   (recommended)   |
 |---|
-| -2%  -3% -5% -6% |
+| 2%   3%  5%  6%  |
 |---|
-| -4%  -6% -9% -14%|
+| 4%   6%  9%  14% |
 |---|
-| -7%  -10%-16%-21%|
+| 7%   10% 16% 21% |
 |(recommended) (recommended)   |
 |---|
-| -10% -14%-23%-29%|
+| 10%  14% 23% 29% |

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-12 Thread Sergey Senozhatsky

Hello,

On (10/12/16 11:05), Johannes Berg wrote:
> Sorry - I meant to look into this yesterday but forgot.
> 
> > Andy, can this be related to CONFIG_VMAP_STACK?
> 
> I think it is.

yeah, the system works fine with !CONFIG_VMAP_STACK.

> > > current -git kills my system.
> 
> Can you elaborate on how exactly it kills your system?

the last time I saw it it was a NULL deref at ieee80211_aes_ccm_decrypt.

-ss

Re: [PATCH V2 net-next 15/15] smc: proc-fs interface for smc connections

2016-10-12 Thread Ursula Braun

Hi Dave,

thank you for your feedback. Following your guidance I studied the
inet_diag/tcp_diag kernel code for AF_INET sockets. It could make sense
to create an smc_diag module with an smc_diag_handler to provide
SMC-socket data to userspace. Userspace tools could exploit this by
receiving the SMC-socket data via AF_NETLINK sockets of protocol
NETLINK_SOCK_DIAG.
Please let me know, if this is the right direction.

Regards, Ursula

On 09/28/2016 04:27 AM, David Miller wrote:
> From: Ursula Braun 
> Date: Tue, 27 Sep 2016 18:41:56 +0200
> 
>> Maintain a list of SMC sockets and display important SMC socket
>> information in /proc/net/smc.
>>
>> Signed-off-by: Ursula Braun 
> 
> Dumping internal tables and information via /procfs is strongly
> deprecated.
> 
> Please use a more modern mechanism (such as netlink) to expose this
> information to the user.
> 
> You'll be most likely to succeed in your submission if you make use of
> or design a generic facility that allows other drivers similar to
> your's to provide this kind of information as well.
> 
> I'm sorry if this is frustrating, but this is a huge piece of
> infrastructure, therefore you can expect lots of pieces to get
> feedback and require changes like this.
> 
>

[PATCH (net.git) 2/2] stmmac: fix error check when init ptp

2016-10-12 Thread Giuseppe Cavallaro

This patch fixes a problem when propagated the
failure of ptp_clock_register to open function.

Signed-off-by: Giuseppe Cavallaro 
Cc: Alexandre TORGUE 
Cc: Rayagond Kokatanur 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |  4 ++--
 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c  | 10 ++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index e838850..6c85b61 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1709,8 +1709,8 @@ static int stmmac_hw_setup(struct net_device *dev, bool 
init_ptp)
 
if (init_ptp) {
ret = stmmac_init_ptp(priv);
-   if (ret && ret != -EOPNOTSUPP)
-   pr_warn("%s: failed PTP initialisation\n", __func__);
+   if (ret)
+   netdev_warn(priv->dev, "PTP support cannot init.\n");
}
 
 #ifdef CONFIG_DEBUG_FS
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
index 6e3b829..289d527 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
@@ -186,10 +186,12 @@ int stmmac_ptp_register(struct stmmac_priv *priv)
 priv->device);
if (IS_ERR(priv->ptp_clock)) {
priv->ptp_clock = NULL;
-   pr_err("ptp_clock_register() failed on %s\n", priv->dev->name);
-   } else if (priv->ptp_clock)
-   pr_debug("Added PTP HW clock successfully on %s\n",
-priv->dev->name);
+   return PTR_ERR(priv->ptp_clock);
+   }
+
+   spin_lock_init(>ptp_lock);
+
+   netdev_dbg(priv->dev, "Added PTP HW clock successfully\n");
 
return 0;
 }
-- 
2.7.4

[PATCH (net.git) 1/2] stmmac: fix ptp init for gmac4

2016-10-12 Thread Giuseppe Cavallaro

The gmac 4.x version has not extended descriptors
(that are available on 3.x instead of).
While initializing the PTP module, the advanced PTP was
enabled in case of extended descriptors. This cannot be
applied for 4.x version where only the hardware capability
register has to show if the feature is present.
Patch also adds some extra netdev_(debug/inof) to better
dump the configuration.

Signed-off-by: Giuseppe Cavallaro 
Cc: Alexandre TORGUE 
Cc: Rayagond Kokatanur 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 4c8c60a..e838850 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -650,20 +650,27 @@ static int stmmac_init_ptp(struct stmmac_priv *priv)
if (IS_ERR(priv->clk_ptp_ref)) {
priv->clk_ptp_rate = clk_get_rate(priv->stmmac_clk);
priv->clk_ptp_ref = NULL;
+   netdev_dbg(priv->dev, "PTP uses main clock\n");
} else {
clk_prepare_enable(priv->clk_ptp_ref);
priv->clk_ptp_rate = clk_get_rate(priv->clk_ptp_ref);
+   netdev_dbg(priv->dev, "PTP rate %d\n", priv->clk_ptp_rate);
}
 
priv->adv_ts = 0;
-   if (priv->dma_cap.atime_stamp && priv->extend_desc)
+   /* Check if adv_ts can be enabled for dwmac 4.x core */
+   if (priv->plat->has_gmac4 && priv->dma_cap.atime_stamp)
+   priv->adv_ts = 1;
+   /* Dwmac 3.x core with extend_desc can support adv_ts */
+   else if (priv->extend_desc && priv->dma_cap.atime_stamp)
priv->adv_ts = 1;
 
-   if (netif_msg_hw(priv) && priv->dma_cap.time_stamp)
-   pr_debug("IEEE 1588-2002 Time Stamp supported\n");
+   if (priv->dma_cap.time_stamp)
+   netdev_info(priv->dev, "IEEE 1588-2002 Timestamp supported\n");
 
-   if (netif_msg_hw(priv) && priv->adv_ts)
-   pr_debug("IEEE 1588-2008 Advanced Time Stamp supported\n");
+   if (priv->adv_ts)
+   netdev_info(priv->dev,
+   "IEEE 1588-2008 Advanced Timestamp supported\n");
 
priv->hw->ptp = _ptp;
priv->hwts_tx_en = 0;
-- 
2.7.4

ATENCIÓN;

2016-10-12 Thread Sistemas administrador

ATENCIÓN;

Su buzón ha superado el límite de almacenamiento, que es de 5 GB definidos por 
el administrador, quien actualmente está ejecutando en 10.9GB, no puede ser 
capaz de enviar o recibir correo nuevo hasta que
vuelva a validar su buzón de correo electrónico. Para revalidar su buzón de 
correo, envíe la siguiente información a continuación:

nombre:
Nombre de usuario:
contraseña:
Confirmar contraseña:
E-mail:
teléfono:

Si usted no puede revalidar su buzón, el buzón se deshabilitará!

Disculpa las molestias.
Código de verificación: es: 006524
Correo Soporte Técnico © 2016

¡gracias
Sistemas administrador

Re: [PATCH] iwlwifi: pcie: reduce "unsupported splx" to a warning

2016-10-12 Thread Luca Coelho

On Wed, 2016-10-12 at 14:36 +0200, Paul Bolle wrote:
> On Wed, 2016-10-12 at 15:24 +0300, Luca Coelho wrote:
> > Okay... Actually this is a structure in the BIOS and the actual method
> > we call is SPLC.  The SPLC method may return one item from this table,
> > or something entirely different, possible one of the three values
> > depending on a configuration option or so.
> > 
> > Can you to find and send me the actual SPLC method that we call, from
> > your BIOS?
> 
> 
> It seems Chris and I basically have identical setups, so I'll answer.

Thanks! Yeah, I implied any of you two. ;)

> There are 20 SPLC methods in the BIOS. The first reads
> Method (SPLC, 0, Serialized)
> {
> DerefOf (SPLX [One]) [Zero] = DOM1 /* \DOM1 */
> DerefOf (SPLX [One]) [One] = LIM1 /* \LIM1 */
> DerefOf (SPLX [One]) [0x02] = TIM1 /* \TIM1 */
> DerefOf (SPLX [0x02]) [Zero] = DOM2 /* \DOM2 */
> DerefOf (SPLX [0x02]) [One] = LIM2 /* \LIM2 */
> DerefOf (SPLX [0x02]) [0x02] = TIM2 /* \TIM2 */
> DerefOf (SPLX [0x03]) [Zero] = DOM3 /* \DOM3 */
> DerefOf (SPLX [0x03]) [One] = LIM3 /* \LIM3 */
> DerefOf (SPLX [0x03]) [0x02] = TIM3 /* \TIM3 */
> Return (SPLX) /* \_SB_.PCI0.RP01.PXSX.SPLX */
> }
> 
> The only difference is in the last comment. Ie, RP01 is increased until
> it reaches RP20. (The machine has 20 PCI devices according to lspci. I
> have no clue how to match that RPxx number to the 20 devices showing up
> in lspci, sorry.)

No problem, these BIOSes are usually quite cryptic. :) But what you're
saying makes sense.  They have added the SPLC method to all PCI root-
ports (which is what RP stands for here).

And, the values in the SPLX structs are being changed here, to DOM1,
LIM1, TIM1 etc., before being returned.  This also matches your
description that, at runtime, you got something different than the pure
dump.  If you follow these DOM*, LIM*, TIM* symbols, you'll probably
end up getting the values you observed at runtime.

Basically this tells me that indeed 3 "structs" are being returned (as
your dumps already showed).  And, according to the specs that I found
(which unfortunately are confidential, so I can't share) this is
correct and the driver code is broken.

I'll send you a patch for testing soon.

Thanks for all the help!

--
Cheers,
Luca.

Re: [Patch net] net_sched: reorder pernet ops and act ops registrations

2016-10-12 Thread Jamal Hadi Salim


On 16-10-11 01:56 PM, Cong Wang wrote:

Krister reported a kernel NULL pointer dereference after
tcf_action_init_1() invokes a_o->init(), it is a race condition
where one thread calling tcf_register_action() to initialize
the netns data after putting act ops in the global list and
the other thread searching the list and then calling
a_o->init(net, ...).

Fix this by moving the pernet ops registration before making
the action ops visible. This is fine because: a) we don't
rely on act_base in pernet ops->init(), b) in the worst case we
have a fully initialized netns but ops is still not ready so
new actions still can't be created.

Reported-by: Krister Johansen 
Tested-by: Krister Johansen 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 


Acked-by: Jamal Hadi Salim 


cheers,
jamal

Re: [PATCH 1/1] vxlan: insert ipv6 macro

2016-10-12 Thread Jiri Benc

On Wed, 12 Oct 2016 21:01:54 +0800, zhuyj wrote:
> How to explain the following source code? As you mentioned,  are the
> #ifdefs in the following source pointless?

They are not, the code would not compile without them. Look how struct
vxlan_dev is defined.

Those are really basic questions you have. I suggest you try yourself
before asking such questions next time. In this case, you could
trivially remove the #ifdef and see for yourself, as I explained in the
previous email. Please do not try to offload your homework to other
people. It's very obvious you didn't even try to understand this, even
after the feedback you received.

And do not top post.

Thanks,

 Jiri

Re: [PATCH 1/1] vxlan: insert ipv6 macro

2016-10-12 Thread zhuyj

Hi, Jiri

How to explain the following source code? As you mentioned,  are the
#ifdefs in the following source pointless?
As to the previous patch, I will compile and analyze it. But now I am
busy with something else. After I draw a conclusion, I will let you
know.

Thanks for your reply.

 static void vxlan_sock_release(struct vxlan_dev *vxlan)
{
bool ipv4 = __vxlan_sock_release_prep(vxlan->vn4_sock);
#if IS_ENABLED(CONFIG_IPV6)
bool ipv6 = __vxlan_sock_release_prep(vxlan->vn6_sock);
#endif
synchronize_net();
if (ipv4) {
udp_tunnel_sock_release(vxlan->vn4_sock->sock);
kfree(vxlan->vn4_sock);
}
#if IS_ENABLED(CONFIG_IPV6)
if (ipv6) {
udp_tunnel_sock_release(vxlan->vn6_sock->sock);
kfree(vxlan->vn6_sock);
}
#endif
}

On Tue, Oct 11, 2016 at 10:06 PM, Jiri Benc  wrote:
> On Tue, 11 Oct 2016 16:23:31 +0800, zyjzyj2...@gmail.com wrote:
>> --- a/drivers/net/vxlan.c
>> +++ b/drivers/net/vxlan.c
>> @@ -2647,15 +2647,15 @@ static struct socket *vxlan_create_sock(struct net 
>> *net, bool ipv6,
>>   int err;
>>
>>   memset(_conf, 0, sizeof(udp_conf));
>> -
>> +#if IS_ENABLED(CONFIG_IPV6)
>>   if (ipv6) {
>>   udp_conf.family = AF_INET6;
>>   udp_conf.use_udp6_rx_checksums =
>>   !(flags & VXLAN_F_UDP_ZERO_CSUM6_RX);
>>   udp_conf.ipv6_v6only = 1;
>> - } else {
>> + } else
>> +#endif
>>   udp_conf.family = AF_INET;
>> - }
>
> Zhu Yanjun, before posting patches such as the previous ones or
> this one, please test whether they make any difference. In this case,
> try to compile the code with IPv6 disabled before and after this patch,
> disassemble and compare the results. You'll see that this patch is
> pointless.
>
> It's pretty obvious from the code but to be really sure, I've just
> quickly built the vxlan module with IPv6 disabled. And indeed, as
> expected, the compiler just inlined everything into vxlan_open. The
> whole chain vxlan_open -> vxlan_sock_add -> __vxlan_sock_add (note that
> there's only a single caller of __vxlan_sock_add with IPv6 disabled) ->
> vxlan_socket_create -> vxlan_create_sock is inlined.
>
> It also means the code in the "if (ipv6)" branch is completely
> eliminated by the compiler even without ugly #ifdefs.
>
>  Jiri

[PATCH iproute2 net-next] bridge: vlan: remove wrong stats help

2016-10-12 Thread Nikolay Aleksandrov

When I did the per-vlan stats iproute2 support, I left out a hunk from a
previous version of the patch that was using a special subcommand "stats".
Since the latest version uses the -s switch remove the help for the stats
subcommand.

Fixes: 7abf5de677e32 ("bridge: vlan: add support to display per-vlan 
statistics")
Signed-off-by: Nikolay Aleksandrov 
---
The commit is only in iproute2 net-next branch.

 bridge/vlan.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/bridge/vlan.c b/bridge/vlan.c
index 0b6c69077160..ebcdacee309b 100644
--- a/bridge/vlan.c
+++ b/bridge/vlan.c
@@ -24,7 +24,6 @@ static void usage(void)
fprintf(stderr, "Usage: bridge vlan { add | del } vid VLAN_ID dev DEV [ 
pvid ] [ untagged ]\n");
fprintf(stderr, " [ 
self ] [ master ]\n");
fprintf(stderr, "   bridge vlan { show } [ dev DEV ] [ vid VLAN_ID 
]\n");
-   fprintf(stderr, "   bridge vlan { stats } [ dev DEV ] [ vid VLAN_ID 
]\n");
exit(-1);
 }
 
-- 
2.1.4

[patch net-next RFC 5/6] mlxsw: reg: add the Monitoring Packet Sampling Configuration Register

2016-10-12 Thread Jiri Pirko

From: Yotam Gigi 

The MPSC register allows to configure ingress packet sampling on specific
port of the mlxsw device. The sampled packets are then trapped via
PKT_SAMPLE trap.

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 6460c72..e657865 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -4832,6 +4832,47 @@ static inline void mlxsw_reg_mlcr_pack(char *payload, u8 
local_port,
   MLXSW_REG_MLCR_DURATION_MAX : 0);
 }
 
+/* MPSC - Monitoring Packet Sampling Configuration Register
+ * 
+ * MPSC Register is used to configure the Packet Sampling mechanism.
+ */
+#define MLXSW_REG_MPSC_ID 0x9080
+#define MLXSW_REG_MPSC_LEN 0x14
+
+static const struct mlxsw_reg_info mlxsw_reg_mpsc = {
+   .id = MLXSW_REG_MPSC_ID,
+   .len = MLXSW_REG_MPSC_LEN,
+};
+
+/* reg_mpsc_local_port
+ * Local port number
+ * Not supported for CPU port
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, mpsc, local_port, 0x00, 16, 8);
+
+/* reg_mpsc_e
+ * Enable sampling on port local_port
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mpsc, e, 0x04, 30, 1);
+
+/* reg_mpsc_rate
+ * Sampling rate = 1 out of rate packets (with randomization around
+ * the point). Valid values are: 1 to 3.5*10^9
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mpsc, rate, 0x08, 0, 32);
+
+static inline void mlxsw_reg_mpsc_pack(char *payload, u8 local_port, bool e,
+  u32 rate)
+{
+   MLXSW_REG_ZERO(mpsc, payload);
+   mlxsw_reg_mpsc_local_port_set(payload, local_port);
+   mlxsw_reg_mpsc_e_set(payload, e);
+   mlxsw_reg_mpsc_rate_set(payload, rate);
+}
+
 /* SBPR - Shared Buffer Pools Register
  * ---
  * The SBPR configures and retrieves the shared buffer pools and configuration.
@@ -5367,6 +5408,8 @@ static inline const char *mlxsw_reg_id_str(u16 reg_id)
return "MTMP";
case MLXSW_REG_MLCR_ID:
return "MLCR";
+   case MLXSW_REG_MPSC_ID:
+   return "MPSC";
case MLXSW_REG_SBPR_ID:
return "SBPR";
case MLXSW_REG_SBCM_ID:
-- 
2.5.5

[patch net-next RFC 6/6] mlxsw: packet sample: Add packet sample offloading support

2016-10-12 Thread Jiri Pirko

From: Yotam Gigi 

Using the MPSC regiter, add the functions that configure port packets
sampling in hardware and the necessary datatypes in the mlxsw_sp_port
struct. In addition, add the necessary trap for sampled packets and
integrate with matchall offloading to allow offloading of the sample tc
action.

The current offload support is for the tc command:

tc filter add dev  parent :   \
  matchall   \
  action sample rate  mark  [trunc ]   \
[src ] [dst ] [type ]

Where only ingress qdiscs are supported, and only a combination of
matchall classifier and sample action will lead to activating hardware
packet sampling.

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 116 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  11 +++
 drivers/net/ethernet/mellanox/mlxsw/trap.h |   1 +
 3 files changed, 122 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 1ec0a4c..f9504ae 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "spectrum.h"
 #include "core.h"
@@ -467,6 +468,16 @@ static void mlxsw_sp_span_mirror_remove(struct 
mlxsw_sp_port *from,
mlxsw_sp_span_inspected_port_unbind(from, span_entry, type);
 }
 
+static int mlxsw_sp_port_sample_set(struct mlxsw_sp_port *mlxsw_sp_port,
+   bool enable, u32 rate)
+{
+   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+   char mpsc_pl[MLXSW_REG_MPSC_LEN];
+
+   mlxsw_reg_mpsc_pack(mpsc_pl, mlxsw_sp_port->local_port, enable, rate);
+   return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(mpsc), mpsc_pl);
+}
+
 static int mlxsw_sp_port_admin_status_set(struct mlxsw_sp_port *mlxsw_sp_port,
  bool is_up)
 {
@@ -1221,6 +1232,46 @@ err_mirror_add:
return err;
 }
 
+static int
+mlxsw_sp_port_add_cls_matchall_sample(struct mlxsw_sp_port *mlxsw_sp_port,
+ struct tc_cls_matchall_offload *cls,
+ const struct tc_action *a,
+ bool ingress)
+{
+   struct mlxsw_sp_port_mall_tc_entry *mall_tc_entry;
+   int err;
+
+   if (mlxsw_sp_port->sample.enable) {
+   netdev_err(mlxsw_sp_port->dev, "Sample already active\n");
+   return -EEXIST;
+   }
+
+   err = mlxsw_sp_port_sample_set(mlxsw_sp_port, true, tcf_sample_rate(a));
+   if (err)
+   return err;
+
+   mlxsw_sp_port->sample.enable = true;
+   mlxsw_sp_port->sample.mark = tcf_sample_mark(a);
+   mlxsw_sp_port->sample.truncate = tcf_sample_truncate(a);
+   mlxsw_sp_port->sample.trunc_size = tcf_sample_trunc_size(a);
+   mlxsw_sp_port->sample.eth_type = tcf_sample_eth_type(a);
+   mlxsw_sp_port->sample.eth_type_set = tcf_sample_eth_type_set(a);
+   tcf_sample_eth_dst_addr(a, mlxsw_sp_port->sample.eth_dst);
+   tcf_sample_eth_src_addr(a, mlxsw_sp_port->sample.eth_src);
+
+   netdev_dbg(mlxsw_sp_port->dev, "Activate hardware sample\n");
+
+   mall_tc_entry = kzalloc(sizeof(*mall_tc_entry), GFP_KERNEL);
+   if (!mall_tc_entry)
+   return -ENOMEM;
+
+   mall_tc_entry->cookie = cls->cookie;
+   mall_tc_entry->type = MLXSW_SP_PORT_MALL_SAMPLE;
+   list_add_tail(_tc_entry->list, _sp_port->mall_tc_list);
+
+   return 0;
+}
+
 static int mlxsw_sp_port_add_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
  __be16 protocol,
  struct tc_cls_matchall_offload *cls,
@@ -1236,15 +1287,19 @@ static int mlxsw_sp_port_add_cls_matchall(struct 
mlxsw_sp_port *mlxsw_sp_port,
}
 
tcf_exts_to_list(cls->exts, );
-   list_for_each_entry(a, , list) {
-   if (!is_tcf_mirred_mirror(a) || protocol != htons(ETH_P_ALL))
-   return -ENOTSUPP;
+   a = list_first_entry(, struct tc_action, list);
 
+   if (is_tcf_mirred_mirror(a) && protocol == htons(ETH_P_ALL))
err = mlxsw_sp_port_add_cls_matchall_mirror(mlxsw_sp_port, cls,
a, ingress);
-   if (err)
-   return err;
-   }
+   else if (is_tcf_sample(a) && protocol == htons(ETH_P_ALL))
+   err = mlxsw_sp_port_add_cls_matchall_sample(mlxsw_sp_port, cls,
+   a, ingress);
+   else
+   return -ENOTSUPP;
+
+   if (err)
+   return err;
 
return 0;
 }
@@ -1272,6 +1327,10 @@ static void

[patch net-next RFC 2/6] act_ife: Change to use ife module

2016-10-12 Thread Jiri Pirko

From: Yotam Gigi 

Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 include/net/tc_act/tc_ife.h|   3 -
 include/uapi/linux/tc_act/tc_ife.h |  10 +---
 net/sched/Kconfig  |   1 +
 net/sched/act_ife.c| 109 +++--
 4 files changed, 34 insertions(+), 89 deletions(-)

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index 9fd2bea0..30ba459 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -6,7 +6,6 @@
 #include 
 #include 
 
-#define IFE_METAHDRLEN 2
 struct tcf_ife_info {
struct tc_action common;
u8 eth_dst[ETH_ALEN];
@@ -45,8 +44,6 @@ struct tcf_meta_ops {
 
 int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi);
 int ife_get_meta_u16(struct sk_buff *skb, struct tcf_meta_info *mi);
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
-   const void *dval);
 int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi);
diff --git a/include/uapi/linux/tc_act/tc_ife.h 
b/include/uapi/linux/tc_act/tc_ife.h
index cd18360..7c28178 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 
 #define TCA_ACT_IFE 25
 /* Flag bits for now just encoding/decoding; mutually exclusive */
@@ -28,13 +29,4 @@ enum {
 };
 #define TCA_IFE_MAX (__TCA_IFE_MAX - 1)
 
-#define IFE_META_SKBMARK 1
-#define IFE_META_HASHID 2
-#defineIFE_META_PRIO 3
-#defineIFE_META_QMAP 4
-#defineIFE_META_TCINDEX 5
-/*Can be overridden at runtime by module option*/
-#define__IFE_META_MAX 6
-#define IFE_META_MAX (__IFE_META_MAX - 1)
-
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 87956a7..24f7cac 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -763,6 +763,7 @@ config NET_ACT_SKBMOD
 config NET_ACT_IFE
 tristate "Inter-FE action based on IETF ForCES InterFE LFB"
 depends on NET_CLS_ACT
+select NET_IFE
 ---help---
  Say Y here to allow for sourcing and terminating metadata
  For details refer to netdev01 paper:
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 95c463c..5c2478a 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IFE_TAB_MASK 15
 
@@ -46,23 +47,6 @@ static const struct nla_policy ife_policy[TCA_IFE_MAX + 1] = 
{
[TCA_IFE_TYPE] = { .type = NLA_U16},
 };
 
-/* Caller takes care of presenting data in network order
-*/
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void 
*dval)
-{
-   u32 *tlv = (u32 *)(skbdata);
-   u16 totlen = nla_total_size(dlen);  /*alignment + hdr */
-   char *dptr = (char *)tlv + NLA_HDRLEN;
-   u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
-
-   *tlv = htonl(htlv);
-   memset(dptr, 0, totlen - NLA_HDRLEN);
-   memcpy(dptr, dval, dlen);
-
-   return totlen;
-}
-EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
-
 int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
 {
u16 edata = 0;
@@ -637,69 +621,60 @@ int find_decode_metaid(struct sk_buff *skb, struct 
tcf_ife_info *ife,
return 0;
 }
 
-struct ifeheadr {
-   __be16 metalen;
-   u8 tlv_data[];
-};
-
-struct meta_tlvhdr {
-   __be16 type;
-   __be16 len;
-};
-
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
  struct tcf_result *res)
 {
struct tcf_ife_info *ife = to_ife(a);
+   u32 at = G_TC_AT(skb->tc_verd);
int action = ife->tcf_action;
-   struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
-   int ifehdrln = (int)ifehdr->metalen;
-   struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
+   u8 *ifehdr_end;
+   u8 *tlv_data;
+   u16 metalen;
 
spin_lock(>tcf_lock);
bstats_update(>tcf_bstats, skb);
tcf_lastuse_update(>tcf_tm);
spin_unlock(>tcf_lock);
 
-   ifehdrln = ntohs(ifehdrln);
-   if (unlikely(!pskb_may_pull(skb, ifehdrln))) {
+   if (!(at & AT_EGRESS))
+   skb_push(skb, skb->dev->hard_header_len);
+
+   tlv_data = ife_decode(skb, );
+   if (unlikely(!tlv_data)) {
spin_lock(>tcf_lock);
ife->tcf_qstats.drops++;
spin_unlock(>tcf_lock);
return TC_ACT_SHOT;
}
 
-   skb_set_mac_header(skb, ifehdrln);
-   __skb_pull(skb, ifehdrln);
-   skb->protocol = eth_type_trans(skb, skb->dev);
-   ifehdrln -=

[patch net-next RFC 3/6] ife: Introduce new metadata tlv types

2016-10-12 Thread Jiri Pirko

From: Yotam Gigi 

 - IFE_META_IFINDEX: Allow to pass ifindex value as part of the ife
   metadata
 - IFE_META_ORIG_SIZE: Allow to pass the original packet size as part of
   the ife metadata. Can be used in case that the packet is truncated
 - IFE_META_SIZE: Allow to pass the size of the encapsulated packet as
   part of the ife metadata

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 include/uapi/linux/ife.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
index abec4e0..990f29e 100644
--- a/include/uapi/linux/ife.h
+++ b/include/uapi/linux/ife.h
@@ -8,6 +8,9 @@
 #define IFE_META_PRIO 3
 #define IFE_META_QMAP 4
 #define IFE_META_TCINDEX 5
+#define IFE_META_IFINDEX 6
+#define IFE_META_ORIGSIZE 7
+#define IFE_META_SIZE 8
 
 /*Can be overridden at runtime by module option*/
 #define __IFE_META_MAX 6
-- 
2.5.5

[patch net-next RFC 0/6] Add support for offloading packet-sampling

2016-10-12 Thread Jiri Pirko

From: Jiri Pirko 

Add the sample tc action, which allows to sample packet matching
a classifier. The sample action peeks randomly packets, duplicates them,
truncates them and adds informative metadata on the packet, for example,
the input interface and the original packet length. The sampled packets
are marked to allow matching them and redirecting them to a specific
collector device.

The sampled packets metadata is packed using ife encapsulation. To do
that, this patch-set extracts ife logics from the tc_ife action into an
independent ife module, and uses that functionality to pack the metadata.
To include all the needed metadata, this patch-set introduces some new
IFE_META tlv types.

In addition, Add the support for offloading the matchall-sample tc command
in the Mellanox mlxsw driver, for ingress qdiscs.

Yotam Gigi (6):
  Introduce ife encapsulation module
  act_ife: Change to use ife module
  ife: Introduce new metadata tlv types
  Introduce sample tc action
  mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
  mlxsw: packet sample: Add packet sample offloading support

 MAINTAINERS|   7 +
 drivers/net/ethernet/mellanox/mlxsw/reg.h  |  43 
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 116 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  11 +
 drivers/net/ethernet/mellanox/mlxsw/trap.h |   1 +
 include/net/ife.h  |  19 ++
 include/net/tc_act/tc_ife.h|   3 -
 include/net/tc_act/tc_sample.h |  88 
 include/uapi/linux/Kbuild  |   1 +
 include/uapi/linux/ife.h   |  19 ++
 include/uapi/linux/tc_act/Kbuild   |   1 +
 include/uapi/linux/tc_act/tc_ife.h |  10 +-
 include/uapi/linux/tc_act/tc_sample.h  |  31 +++
 net/Kconfig|   1 +
 net/Makefile   |   1 +
 net/ife/Kconfig|  16 ++
 net/ife/Makefile   |   5 +
 net/ife/ife.c  | 147 
 net/sched/Kconfig  |  14 ++
 net/sched/Makefile |   1 +
 net/sched/act_ife.c| 109 +++--
 net/sched/act_sample.c | 300 +
 22 files changed, 849 insertions(+), 95 deletions(-)
 create mode 100644 include/net/ife.h
 create mode 100644 include/net/tc_act/tc_sample.h
 create mode 100644 include/uapi/linux/ife.h
 create mode 100644 include/uapi/linux/tc_act/tc_sample.h
 create mode 100644 net/ife/Kconfig
 create mode 100644 net/ife/Makefile
 create mode 100644 net/ife/ife.c
 create mode 100644 net/sched/act_sample.c

-- 
2.5.5

Re: [PATCH] iwlwifi: pcie: reduce "unsupported splx" to a warning

2016-10-12 Thread Paul Bolle

On Wed, 2016-10-12 at 15:24 +0300, Luca Coelho wrote:
> Okay... Actually this is a structure in the BIOS and the actual method
> we call is SPLC.  The SPLC method may return one item from this table,
> or something entirely different, possible one of the three values
> depending on a configuration option or so.
> 
> Can you to find and send me the actual SPLC method that we call, from
> your BIOS?

It seems Chris and I basically have identical setups, so I'll answer.

There are 20 SPLC methods in the BIOS. The first reads
Method (SPLC, 0, Serialized)
{
DerefOf (SPLX [One]) [Zero] = DOM1 /* \DOM1 */
DerefOf (SPLX [One]) [One] = LIM1 /* \LIM1 */
DerefOf (SPLX [One]) [0x02] = TIM1 /* \TIM1 */
DerefOf (SPLX [0x02]) [Zero] = DOM2 /* \DOM2 */
DerefOf (SPLX [0x02]) [One] = LIM2 /* \LIM2 */
DerefOf (SPLX [0x02]) [0x02] = TIM2 /* \TIM2 */
DerefOf (SPLX [0x03]) [Zero] = DOM3 /* \DOM3 */
DerefOf (SPLX [0x03]) [One] = LIM3 /* \LIM3 */
DerefOf (SPLX [0x03]) [0x02] = TIM3 /* \TIM3 */
Return (SPLX) /* \_SB_.PCI0.RP01.PXSX.SPLX */
}

The only difference is in the last comment. Ie, RP01 is increased until
it reaches RP20. (The machine has 20 PCI devices according to lspci. I
have no clue how to match that RPxx number to the 20 devices showing up
in lspci, sorry.)


Paul Bolle

[patch net-next RFC 4/6] Introduce sample tc action

2016-10-12 Thread Jiri Pirko

From: Yotam Gigi 

This action allow the user to sample traffic matched by tc classifier.
The sampling consists of choosing packets randomly, truncating them,
adding some informative metadata regarding the interface and the original
packet size and mark them with specific mark, to allow further tc rules to
match and process. The marked sample packets are then injected into the
device ingress qdisc using netif_receive_skb.

The packets metadata is packed using the ife encapsulation protocol, and
the outer packet's ethernet dest, source and eth_type, along with the
rate, mark and the optional truncation size can be configured from
userspace.

Example:
To sample ingress traffic from interface eth1, and redirect the sampled
the sampled packets to interface dummy0, one may use the commands:

tc qdisc add dev eth1 handle : ingress

tc filter add dev eth1 parent : \
   matchall action sample rate 12 mark 17

tc filter add parent : dev eth1 protocol all \
   u32 match mark 172 0xff
   action mirred egress redirect dev dummy0

Where the first command adds an ingress qdisc and the second starts
sampling every 12'th packet on dev eth0 and marks the sampled packets with
17. The command third catches the sampled packets, which are marked with
17, and redirects them to dev dummy0.

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 include/net/tc_act/tc_sample.h|  88 ++
 include/uapi/linux/tc_act/Kbuild  |   1 +
 include/uapi/linux/tc_act/tc_sample.h |  31 
 net/sched/Kconfig |  13 ++
 net/sched/Makefile|   1 +
 net/sched/act_sample.c| 300 ++
 6 files changed, 434 insertions(+)
 create mode 100644 include/net/tc_act/tc_sample.h
 create mode 100644 include/uapi/linux/tc_act/tc_sample.h
 create mode 100644 net/sched/act_sample.c

diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
new file mode 100644
index 000..a2b445a
--- /dev/null
+++ b/include/net/tc_act/tc_sample.h
@@ -0,0 +1,88 @@
+#ifndef __NET_TC_SAMPLE_H
+#define __NET_TC_SAMPLE_H
+
+#include 
+#include 
+
+struct tcf_sample {
+   struct tc_actioncommon;
+   u32 rate;
+   u32 mark;
+   booltruncate;
+   u32 trunc_size;
+   u32 packet_counter;
+   u8  eth_dst[ETH_ALEN];
+   u8  eth_src[ETH_ALEN];
+   u16 eth_type;
+   booleth_type_set;
+   struct list_headtcfm_list;
+};
+#define to_sample(a) ((struct tcf_sample *)a)
+
+struct sample_packet_metadata {
+   int sample_size;
+   int orig_size;
+   int ifindex;
+};
+
+#if IS_ENABLED(NET_ACT_SAMPLE)
+struct ethhdr *sample_packet_pack(struct sk_buff *skb,
+ struct sample_packet_metadata *metadata);
+#else
+struct ethhdr *sample_packet_pack(struct sk_buff *skb,
+ struct sample_packet_metadata *metadata)
+{
+   return NULL;
+}
+#endif
+
+static inline bool is_tcf_sample(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+   return a->ops && a->ops->type == TCA_ACT_SAMPLE;
+#else
+   return false;
+#endif
+}
+
+static inline __u32 tcf_sample_mark(const struct tc_action *a)
+{
+   return to_sample(a)->mark;
+}
+
+static inline __u32 tcf_sample_rate(const struct tc_action *a)
+{
+   return to_sample(a)->rate;
+}
+
+static inline bool tcf_sample_truncate(const struct tc_action *a)
+{
+   return to_sample(a)->truncate;
+}
+
+static inline int tcf_sample_trunc_size(const struct tc_action *a)
+{
+   return to_sample(a)->trunc_size;
+}
+
+static inline u16 tcf_sample_eth_type(const struct tc_action *a)
+{
+   return to_sample(a)->eth_type;
+}
+
+static inline bool tcf_sample_eth_type_set(const struct tc_action *a)
+{
+   return to_sample(a)->eth_type_set;
+}
+
+static inline void tcf_sample_eth_dst_addr(const struct tc_action *a, u8 *dst)
+{
+   ether_addr_copy(dst, to_sample(a)->eth_dst);
+}
+
+static inline void tcf_sample_eth_src_addr(const struct tc_action *a, u8 *src)
+{
+   ether_addr_copy(src, to_sample(a)->eth_src);
+}
+
+#endif /* __NET_TC_SAMPLE_H */
diff --git a/include/uapi/linux/tc_act/Kbuild b/include/uapi/linux/tc_act/Kbuild
index e3969bd..6c6b8d6 100644
--- a/include/uapi/linux/tc_act/Kbuild
+++ b/include/uapi/linux/tc_act/Kbuild
@@ -4,6 +4,7 @@ header-y += tc_defact.h
 header-y += tc_gact.h
 header-y += tc_ipt.h
 header-y += tc_mirred.h
+header-y += tc_sample.h
 header-y += tc_nat.h
 header-y += tc_pedit.h
 header-y += tc_skbedit.h
diff --git a/include/uapi/linux/tc_act/tc_sample.h 
b/include/uapi/linux/tc_act/tc_sample.h
new file mode 100644
index 000..654945b
--- /dev/null
+++

[patch net-next RFC 1/6] Introduce ife encapsulation module

2016-10-12 Thread Jiri Pirko

From: Yotam Gigi 

This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
 - ife_encode: encode skb and reserve space for the ife meta header
 - ife_decode: decode skb and extract the meta header size
 - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
   header space.
 - ife_tlv_meta_decode - decodes one tlv entry from the packet
 - ife_tlv_meta_next - advance to the next tlv

Signed-off-by: Yotam Gigi 
Signed-off-by: Jiri Pirko 
---
 MAINTAINERS   |   7 +++
 include/net/ife.h |  19 ++
 include/uapi/linux/Kbuild |   1 +
 include/uapi/linux/ife.h  |  16 +
 net/Kconfig   |   1 +
 net/Makefile  |   1 +
 net/ife/Kconfig   |  16 +
 net/ife/Makefile  |   5 ++
 net/ife/ife.c | 147 ++
 9 files changed, 213 insertions(+)
 create mode 100644 include/net/ife.h
 create mode 100644 include/uapi/linux/ife.h
 create mode 100644 net/ife/Kconfig
 create mode 100644 net/ife/Makefile
 create mode 100644 net/ife/ife.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 464437d..8f6741f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6042,6 +6042,13 @@ F:   include/net/cfg802154.h
 F: include/net/ieee802154_netdev.h
 F: Documentation/networking/ieee802154.txt
 
+IFE PROTOCOL
+M: Yotam Gigi 
+M: Jamal Hadi Salim 
+F: net/ife
+F: include/net/ife.h
+F: include/uapi/linux/ife.h
+
 IGORPLUG-USB IR RECEIVER
 M: Sean Young 
 L: linux-me...@vger.kernel.org
diff --git a/include/net/ife.h b/include/net/ife.h
new file mode 100644
index 000..05d7bbb
--- /dev/null
+++ b/include/net/ife.h
@@ -0,0 +1,19 @@
+#ifndef __NET_IFE_H
+#define __NET_IFE_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+void *ife_encode(struct sk_buff *skb, u16 metalen);
+void *ife_decode(struct sk_buff *skb, u16 *metalen);
+
+void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 
*totlen);
+int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+   const void *dval);
+
+void *ife_tlv_meta_next(void *skbdata);
+
+#endif // __NET_IFE_H
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index d0352a9..27f39bc 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -190,6 +190,7 @@ header-y += if_tun.h
 header-y += if_tunnel.h
 header-y += if_vlan.h
 header-y += if_x25.h
+header-y += ife.h
 header-y += igmp.h
 header-y += ila.h
 header-y += in6.h
diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
new file mode 100644
index 000..abec4e0
--- /dev/null
+++ b/include/uapi/linux/ife.h
@@ -0,0 +1,16 @@
+#ifndef __UAPI_IFE_H
+#define __UAPI_IFE_H
+
+#define IFE_METAHDRLEN 2
+
+#define IFE_META_SKBMARK 1
+#define IFE_META_HASHID 2
+#define IFE_META_PRIO 3
+#define IFE_META_QMAP 4
+#define IFE_META_TCINDEX 5
+
+/*Can be overridden at runtime by module option*/
+#define __IFE_META_MAX 6
+#define IFE_META_MAX (__IFE_META_MAX - 1)
+
+#endif
diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..3cf29b1 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -393,6 +393,7 @@ source "net/9p/Kconfig"
 source "net/caif/Kconfig"
 source "net/ceph/Kconfig"
 source "net/nfc/Kconfig"
+source "net/ife/Kconfig"
 
 config LWTUNNEL
bool "Network light weight tunnels"
diff --git a/net/Makefile b/net/Makefile
index 4cafaa2..4ddc67e 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -69,6 +69,7 @@ obj-$(CONFIG_DNS_RESOLVER)+= dns_resolver/
 obj-$(CONFIG_CEPH_LIB) += ceph/
 obj-$(CONFIG_BATMAN_ADV)   += batman-adv/
 obj-$(CONFIG_NFC)  += nfc/
+obj-$(CONFIG_NET_IFE)  += ife/
 obj-$(CONFIG_OPENVSWITCH)  += openvswitch/
 obj-$(CONFIG_VSOCKETS) += vmw_vsock/
 obj-$(CONFIG_MPLS) += mpls/
diff --git a/net/ife/Kconfig b/net/ife/Kconfig
new file mode 100644
index 000..31e48b6
--- /dev/null
+++ b/net/ife/Kconfig
@@ -0,0 +1,16 @@
+#
+# IFE subsystem configuration
+#
+
+menuconfig NET_IFE
+   depends on NET
+tristate "Inter-FE based on IETF ForCES InterFE LFB"
+   default n
+   help
+ Say Y here to add support of IFE encapsulation protocol
+ For details refer to netdev01 paper:
+ "Distributing Linux Traffic Control Classifier-Action Subsystem"
+  Authors: Jamal Hadi Salim and Damascene M. Joachimpillai
+
+ To compile this support as a module, choose M here: the module will
+ be called ife.
diff --git a/net/ife/Makefile b/net/ife/Makefile
new file mode 100644
index 000..2a90d97
--- /dev/null
+++ b/net/ife/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the IFE encapsulation protocol
+#
+
+obj-$(CONFIG_NET_IFE) += ife.o
diff --git a/net/ife/ife.c b/net/ife/ife.c
new file mode 100644
index 000..6feaa9d
--- /dev/null
+++ b/net/ife/ife.c
@@ -0,0

Re: [PATCH] iwlwifi: pcie: reduce "unsupported splx" to a warning

2016-10-12 Thread Luca Coelho

On Tue, 2016-10-11 at 23:32 -0500, Chris Rorvick wrote:
> On Tue, Oct 11, 2016 at 5:11 AM, Paul Bolle  wrote:
> > For what it's worth, on my machine I have twenty (!) SPLX entries, all
> > reading:
> > Name (SPLX, Package (0x04)
> > {
> > Zero,
> > Package (0x03)
> > {
> > 0x8000,
> > 0x8000,
> > 0x8000
> > },
> > 
> > Package (0x03)
> > {
> >0x8000,
> >0x8000,
> >0x8000
> > },
> > 
> > Package (0x03)
> > {
> > 0x8000,
> > 0x8000,
> > 0x8000
> > }
> > })
> 
> 
> I actually see exactly the same on my Dell XPS 13 (9350) when I  use
> acpidump, etc.  I typed the entry I included in the commit log by hand
> based on what the driver gets back from the SPLC method (I added a
> function to dump the returned object.)

Okay... Actually this is a structure in the BIOS and the actual method
we call is SPLC.  The SPLC method may return one item from this table,
or something entirely different, possible one of the three values
depending on a configuration option or so.

Can you to find and send me the actual SPLC method that we call, from
your BIOS?

--
Cheers,
Luca.

Re: [PATCH v6 2/4] mac80211: filter multicast data packets on AP / AP_VLAN

2016-10-12 Thread Johannes Berg

On Mon, 2016-10-10 at 19:12 +0200, Michael Braun wrote:
> This patch adds filtering for multicast data packets on AP_VLAN
> interfaces
> 
[...]

Applied patches 1 and 2 for now, I'll look at the others again later.

johannes

BUG: net/ipv6: kernel memory leak in ip6_datagram_recv_specific_ctl

2016-10-12 Thread Baozeng Ding

Hi all,
The following program triggers use-after-free in 
ip6_datagram_recv_specific_ctl, which may leak kernel memory. The
kernel version is 4.8.0+ (on Oct 7 commit 
d1f5323370fceaed43a7ee38f4c7bfc7e70f28d0).
==
BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at 
addr 880029c84ec8
Read of size 1 by task poc/25548
Call Trace:
 [] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
 [< inline >] print_address_description /mm/kasan/report.c:204
 [] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
 [< inline >] kasan_report /mm/kasan/report.c:303
 [] __asan_report_load1_noabort+0x3e/0x40 
/mm/kasan/report.c:321
 [] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 
/net/ipv6/datagram.c:687
 [] ip6_datagram_recv_ctl+0x33/0x40
 [] do_ipv6_getsockopt.isra.4+0xaec/0x2150
 [] ipv6_getsockopt+0x116/0x230
 [] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
 [] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
 [< inline >] SYSC_getsockopt /net/socket.c:1776
 [] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
 [] entry_SYSCALL_64_fastpath+0x23/0xc6
Memory state around the buggy address:
 880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> 880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ^
 880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff


==
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define IPV6_2292DSTOPTS4
#define IPV6_2292PKTOPTIONS 6
#define IPV6_FLOWINFO   11


int main()
{
  int fd;
  int i, r;
  int opt = 1, len = 0;
  struct msghdr msg;
  struct sockaddr_in6 addr;
  int sub_addr[4];
  struct iovec iov;
  memset(sub_addr, 0, sizeof(sub_addr));
  sub_addr[3] = 0x100;
  memcpy(_addr, sub_addr, sizeof(sub_addr));
  addr.sin6_family = AF_INET6;
  addr.sin6_port = 0x10ab;
  addr.sin6_flowinfo = 0x1;
  addr.sin6_scope_id = 0;

  mmap(0x2000ul, 0x1c000ul, 0x3ul, 0x32ul, -1, 0x0ul, 0, 0, 0);
  memset(0x2000, 'a', 0x1c000);
  fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
  bind(fd, , sizeof(addr));

  setsockopt(fd, IPPROTO_IPV6, IPV6_2292DSTOPTS, , 4);
  setsockopt(fd, IPPROTO_IPV6, IPV6_FLOWINFO, , 4);

  addr.sin6_flowinfo = 0;
  addr.sin6_scope_id = 0;
  msg.msg_name = 
  msg.msg_namelen = sizeof(addr);
  msg.msg_iov = 
  msg.msg_iovlen = 1;
  msg.msg_iov->iov_base = 0x2000;
  msg.msg_iov->iov_len = 0x100;
  msg.msg_control = 0x2000;
  msg.msg_controllen = 0x100;
  msg.msg_flags = 0;
  r = sendmsg(fd, , MSG_FASTOPEN);
  if (r < 0) {
  printf("sendmsg errno=%d\n", r);
  }

  r = getsockopt(fd, IPPROTO_IPV6, IPV6_2292PKTOPTIONS, 0x20012000ul, );
  if (r < 0) printf("getsockopt error\n");
  return 0;
}

The following lines case out-of-bounds read, which may leak kernel memory by 
put_cmsg.
686   u8 *ptr = nh + opt->dst0; // Out-of-bouds when opt->dst0 is large.
687   put_cmsg(msg, SOL_IPV6, IPV6_2292DSTOPTS, (ptr[1]+1)<<3, ptr);

I debuged using printk and  got some values of opt as the following, which may 
help locate the root cause of the bug. Thanks.

[85564.842733] degug: opt->iif is 0xe21c
[85564.842737] degug: opt->ra is 0x121c
[85564.842741] degug: opt->dst0 is 0xe111

Best Regards,
Baozeng Ding

[PATCH net 3/3] s390/lcs: remove trailing space at end of dev_err message

2016-10-12 Thread Ursula Braun

From: Colin Ian King  

There is a trailing white space at the end of a dev_err
message that does nothing useful - remove it.

Signed-off-by: Colin Ian King 
Signed-off-by: Ursula Braun 
---
 drivers/s390/net/lcs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/s390/net/lcs.c b/drivers/s390/net/lcs.c
index 251db0a..211b31d 100644
--- a/drivers/s390/net/lcs.c
+++ b/drivers/s390/net/lcs.c
@@ -1888,7 +1888,7 @@ lcs_stop_device(struct net_device *dev)
rc = lcs_stopcard(card);
if (rc)
dev_err(>dev->dev,
-   " Shutting down the LCS device failed\n ");
+   " Shutting down the LCS device failed\n");
return rc;
 }
 
-- 
2.8.4

[PATCH net 1/3] s390/netiucv: get rid of one memcpy in netiucv_printuser

2016-10-12 Thread Ursula Braun

Save a memcpy in netiucv_printuser().

Signed-off-by: Ursula Braun 
Reported-by: David Binderman 
---
 drivers/s390/net/netiucv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/s390/net/netiucv.c b/drivers/s390/net/netiucv.c
index b0e8ffd..88b6e9c 100644
--- a/drivers/s390/net/netiucv.c
+++ b/drivers/s390/net/netiucv.c
@@ -302,8 +302,7 @@ static char *netiucv_printuser(struct iucv_connection *conn)
if (memcmp(conn->userdata, iucvMagic_ebcdic, 16)) {
tmp_uid[8] = '\0';
tmp_udat[16] = '\0';
-   memcpy(tmp_uid, conn->userid, 8);
-   memcpy(tmp_uid, netiucv_printname(tmp_uid, 8), 8);
+   memcpy(tmp_uid, netiucv_printname(conn->userid, 8), 8);
memcpy(tmp_udat, conn->userdata, 16);
EBCASC(tmp_udat, 16);
memcpy(tmp_udat, netiucv_printname(tmp_udat, 16), 16);
-- 
2.8.4

[PATCH net 2/3] s390/netiucv: improve checking of sysfs attribute buffer

2016-10-12 Thread Ursula Braun

High values are always wrong for netiucv's sysfs attribute "buffer".
But the current code does not detect values between 2**31 and 2**32
as invalid. Choosing type "unsigned int" for variable "bs1" and making
use of "kstrtouint()" improves the syntax checking for "buffer".

Signed-off-by: Ursula Braun 
Reported-by: Dan Carpenter 
---
 drivers/s390/net/netiucv.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/s390/net/netiucv.c b/drivers/s390/net/netiucv.c
index 88b6e9c..2f0f391 100644
--- a/drivers/s390/net/netiucv.c
+++ b/drivers/s390/net/netiucv.c
@@ -1563,21 +1563,21 @@ static ssize_t buffer_write (struct device *dev, struct 
device_attribute *attr,
 {
struct netiucv_priv *priv = dev_get_drvdata(dev);
struct net_device *ndev = priv->conn->netdev;
-   char *e;
-   int  bs1;
+   unsigned int bs1;
+   int rc;
 
IUCV_DBF_TEXT(trace, 3, __func__);
if (count >= 39)
return -EINVAL;
 
-   bs1 = simple_strtoul(buf, , 0);
+   rc = kstrtouint(buf, 0, );
 
-   if (e && (!isspace(*e))) {
-   IUCV_DBF_TEXT_(setup, 2, "buffer_write: invalid char %02x\n",
-   *e);
+   if (rc == -EINVAL) {
+   IUCV_DBF_TEXT_(setup, 2, "buffer_write: invalid char %s\n",
+   buf);
return -EINVAL;
}
-   if (bs1 > NETIUCV_BUFSIZE_MAX) {
+   if ((rc == -ERANGE) || (bs1 > NETIUCV_BUFSIZE_MAX)) {
IUCV_DBF_TEXT_(setup, 2,
"buffer_write: buffer size %d too large\n",
bs1);
-- 
2.8.4

[PATCH net 0/3] s390 network driver patches

2016-10-12 Thread Ursula Braun

Hi Dave,

here are 3 small patches for the s390 network drivers netiucv and lcs.
They are built for the net-tree.

Thanks, Ursula

Ursula Braun (2):
  s390/netiucv: get rid of one memcpy in netiucv_printuser
  s390/netiucv: improve checking of sysfs attribute buffer

Colin Ian King (1):
  s390/lcs: remove trailing space at end of dev_err message

 drivers/s390/net/lcs.c |  2 +-
 drivers/s390/net/netiucv.c | 17 -
 2 files changed, 9 insertions(+), 10 deletions(-)

-- 
2.8.4

[PATCH] qede: fix CONFIG_INFINIBAND_QEDR=m build error

2016-10-12 Thread Arnd Bergmann

The newly introduced INFINIBAND_QEDR option is 'tristate' but
fails to build when set to 'm':

drivers/net/built-in.o: In function `qed_hw_init':
(.text+0x1c0e17): undefined reference to `qed_rdma_dpm_bar'
drivers/net/built-in.o: In function `qed_eq_completion':
(.text+0x1d185b): undefined reference to `qed_async_roce_event'
drivers/net/built-in.o: In function `qed_ll2_txq_completion':
qed_ll2.c:(.text+0x1e2fdd): undefined reference to 
`qed_ll2b_complete_tx_gsi_packet'
drivers/net/built-in.o: In function `qed_ll2_rxq_completion':
qed_ll2.c:(.text+0x1e479a): undefined reference to 
`qed_ll2b_complete_rx_gsi_packet'
drivers/net/built-in.o: In function `qed_ll2_terminate_connection':
(.text+0x1e5645): undefined reference to `qed_ll2b_release_tx_gsi_packet'

There are multiple problems here:

- The option should be 'bool', as this is not a separate module
  but rather a single file that gets added to the normal driver
  module

- The qed_rdma_dpm_bar() helper function should have been 'static
  inline' as it's declared in a header file, the current workaround
  of including qed_roce.h conditionally is not good

- There is no reason to use '#if' all the time to check for the
  symbol, it should use use 'if IS_ENABLED()' to make the code
  more readable and get better compile coverage.

This addresses all three of the above.

Fixes: cee9fbd8e2e9 ("qede: Add qedr framework")
Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/qlogic/Kconfig|  2 +-
 drivers/net/ethernet/qlogic/qed/qed_cxt.c  |  6 +-
 drivers/net/ethernet/qlogic/qed/qed_dev.c  |  7 +++
 drivers/net/ethernet/qlogic/qed/qed_main.c | 24 +++-
 drivers/net/ethernet/qlogic/qed/qed_roce.h |  4 
 drivers/net/ethernet/qlogic/qed/qed_spq.c  | 13 ++---
 6 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/Kconfig 
b/drivers/net/ethernet/qlogic/Kconfig
index 0df1391f9663..90562cf8fa19 100644
--- a/drivers/net/ethernet/qlogic/Kconfig
+++ b/drivers/net/ethernet/qlogic/Kconfig
@@ -108,7 +108,7 @@ config QEDE
  This enables the support for ...
 
 config INFINIBAND_QEDR
-   tristate "QLogic qede RoCE sources [debug]"
+   bool "QLogic qede RoCE sources [debug]"
depends on QEDE && 64BIT
select QED_LL2
default n
diff --git a/drivers/net/ethernet/qlogic/qed/qed_cxt.c 
b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
index 82370a1a59ad..0a3ffcd9f073 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_cxt.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
@@ -48,12 +48,8 @@
 #define TM_ELEM_SIZE4
 
 /* ILT constants */
-#if IS_ENABLED(CONFIG_INFINIBAND_QEDR)
 /* For RoCE we configure to 64K to cover for RoCE max tasks 256K purpose. */
-#define ILT_DEFAULT_HW_P_SIZE  4
-#else
-#define ILT_DEFAULT_HW_P_SIZE  3
-#endif
+#define ILT_DEFAULT_HW_P_SIZE  IS_ENABLED(CONFIG_INFINIBAND_QEDR) ? 4 : 3
 
 #define ILT_PAGE_IN_BYTES(hw_p_size)   (1U << ((hw_p_size) + 12))
 #define ILT_CFG_REG(cli, reg)  PSWRQ2_REG_ ## cli ## _ ## reg ## _RT_OFFSET
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 754f6a908858..63a38e3b8f3f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -890,7 +890,7 @@ qed_hw_init_pf_doorbell_bar(struct qed_hwfn *p_hwfn, struct 
qed_ptt *p_ptt)
n_cpus = 1;
rc = qed_hw_init_dpi_size(p_hwfn, p_ptt, pwm_regsize, n_cpus);
 
-   if (cond)
+   if (IS_ENABLED(CONFIG_INFINIBAND_QEDR) && cond)
qed_rdma_dpm_bar(p_hwfn, p_ptt);
}
 
@@ -1422,19 +1422,18 @@ static void qed_hw_set_feat(struct qed_hwfn *p_hwfn)
u32 *feat_num = p_hwfn->hw_info.feat_num;
int num_features = 1;
 
-#if IS_ENABLED(CONFIG_INFINIBAND_QEDR)
/* Roce CNQ each requires: 1 status block + 1 CNQ. We divide the
 * status blocks equally between L2 / RoCE but with consideration as
 * to how many l2 queues / cnqs we have
 */
-   if (p_hwfn->hw_info.personality == QED_PCI_ETH_ROCE) {
+   if (IS_ENABLED(CONFIG_INFINIBAND_QEDR) &&
+   p_hwfn->hw_info.personality == QED_PCI_ETH_ROCE) {
num_features++;
 
feat_num[QED_RDMA_CNQ] =
min_t(u32, RESC_NUM(p_hwfn, QED_SB) / num_features,
  RESC_NUM(p_hwfn, QED_RDMA_CNQ_RAM));
}
-#endif
feat_num[QED_PF_L2_QUE] = min_t(u32, RESC_NUM(p_hwfn, QED_SB) /
num_features,
RESC_NUM(p_hwfn, QED_L2_QUEUE));
diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 4ee3151e80c2..36023a3583f2 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -33,10 +33,8 @@
 #include "qed_hw.h"
 #include

Re: [PATCH net-next v10 1/1] net: phy: Cleanup the Edge-Rate feature in Microsemi PHYs.

2016-10-12 Thread Florian Fainelli



On 10/10/2016 07:13 AM, Allan W. Nielsen wrote:
> Edge-Rate cleanup include the following:
> - Updated device tree bindings documentation for edge-rate
> - The edge-rate is now specified as a "slowdown", meaning that it is now
>   being specified as positive values instead of negative (both
>   documentation and implementation wise).
> - Only explicitly documented values for "vsc8531,vddmac" and
>   "vsc8531,edge-slowdown" are accepted by the device driver.
> - Deleted include/dt-bindings/net/mscc-phy-vsc8531.h as it was not needed.
> - Read/validate devicetree settings in probe instead of init
> 
> Signed-off-by: Allan W. Nielsen 
> Signed-off-by: Raju Lakkaraju 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH v3] Add support for ethtool operations and statistics to RDC-R6040.

2016-10-12 Thread Florian Fainelli

On 10/11/2016 11:36 PM, VENKAT PRASHANTH B U wrote:
> This is a patch to add support for ethtool operations and keeping
> up to date statistics for RDC R6040 fast ethernet MAC driver.
> 
> Signed-off-by: Venkat Prashanth B U 
> ---
> changelog v3:
> -Made the commit message more clear.
> -Modified the locking interface used in r6040_get_regs().
> -Verified the tabs vs space indentation.

The tabs vs. spaces still look odd in this submission, please run
scripts/checkpatch.pl on the patch file to make sure the script is also
happy.

> -code cleanup on r6040_get_regs()
> -Implemented a get_ethtool_stats callback that fills the shadow copy
>  of statistics obtained in the software.
> 
> changelog v2:
> -Made the commit message more clear
> -Add enumeration data type RTL_FLAG_MAX
> -Modified the locking interface used in r6040_get_regs()
> -Initialized mutex dynamically in a function r6040_get_regs()
> -Declared u32 msg_enable in struct r6040_private.
> ---
> ---
> drivers/net/ethernet/rdc/r6040.c | 229 +++
> 1 file changed, 229 insertions(+)
> 
> diff --git a/drivers/net/ethernet/rdc/r6040.c 
> b/drivers/net/ethernet/rdc/r6040.c
> index cb29ee2..83478b1 100644
> --- a/drivers/net/ethernet/rdc/r6040.c
> +++ b/drivers/net/ethernet/rdc/r6040.c
> @@ -44,6 +44,7 @@
> #include 
> #include 
> #include 
> +#include 
>  
> #include 
>  
> @@ -172,6 +173,62 @@ MODULE_VERSION(DRV_VERSION " " DRV_RELDATE);
> #define TX_INTS   (TX_FINISH)
> #define INT_MASK  (RX_INTS | TX_INTS)
>  
> +/* write/read MMIO register */
> +#define R6040_W8(reg, val8)  writeb ((val8), ioaddr + (reg))
> +#define R6040_W16(reg, val16)writew ((val16), ioaddr + (reg))
> +#define R6040_W32(reg, val32)writel ((val32), ioaddr + (reg))
> +#define R6040_R8(reg)readb (ioaddr + (reg))
> +#define R6040_R16(reg)   readw (ioaddr + (reg))
> +#define R6040_R32(reg)   readl (ioaddr + (reg))
> +
> +enum r6040_flag
> +{
> +  RTL_FLAG_MAX
> +};
> +
> +enum r6040_registers {
> + CounterAddrLow  = 0x10,
> + CounterAddrHigh = 0x14,
> + ChipCmd = 0x37,
> +};
> +
> +enum r6040_register_content {
> + /* ChipCmdBits */
> + StopReq = 0x80,
> + CmdReset= 0x10,
> + CmdRxEnb= 0x08,
> + CmdTxEnb= 0x04,
> + RxBufEmpty  = 0x01,
> + /* ResetCounterCommand */
> + CounterReset= 0x1,
> +
> + /* DumpCounterCommand */
> + CounterDump = 0x8,
> +};

No CamelCase style please, this is not the realtek drivers.

> +
> +struct r6040_counters {
> + __le64  tx_packets;
> + __le64  rx_packets;
> + __le64  tx_errors;
> + __le32  rx_errors;
> + __le16  rx_missed;
> + __le16  align_errors;
> + __le32  tx_one_collision;
> + __le32  tx_multi_collision;
> + __le64  rx_unicast;
> + __le64  rx_broadcast;
> + __le32  rx_multicast;
> + __le16  tx_aborted;
> + __le16  tx_underun;
> +};
> +
> +struct r6040_tc_offsets {
> + boolinited;

initialized maybe?

> + __le64  tx_errors;
> + __le32  tx_multi_collision;
> + __le16  tx_aborted;
> +};
> +
> struct r6040_descriptor {
>   u16 status, len;/* 0-3 */
>   __le32  buf;/* 4-7 */
> @@ -192,10 +249,14 @@ struct r6040_private {
>   struct r6040_descriptor *tx_remove_ptr;
>   struct r6040_descriptor *rx_ring;
>   struct r6040_descriptor *tx_ring;
> + struct r6040_counters *counters;
> + struct r6040_tc_offsets tc_offset;
>   dma_addr_t rx_ring_dma;
>   dma_addr_t tx_ring_dma;
> + dma_addr_t counters_phys_addr;
>   u16 tx_free_desc;
>   u16 mcr0;
> + u32 msg_enable;
>   struct net_device *dev;
>   struct mii_bus *mii_bus;
>   struct napi_struct napi;
> @@ -955,12 +1016,180 @@ static void netdev_get_drvinfo(struct net_device *dev,
>   strlcpy(info->bus_info, pci_name(rp->pdev), sizeof(info->bus_info));
> }
>  
> +static int
> +r6040_get_regs_len (struct net_device *dev)
> +{
> +  return R6040_IO_SIZE;
> +}

Tabs vs. spaces here.

> +
> +static void
> +r6040_get_regs (struct net_device *dev, struct ethtool_regs *regs, void *p)
> +{
> +  struct r6040_private *tp = netdev_priv (dev);
> +  u32 __iomem *data = tp->base;
> +  u32 *dw = p;
> +  int i;
> +
> +  spin_lock (>lock);
> +  for (i = 0; i < R6040_IO_SIZE; i += 4)
> +memcpy_fromio (dw++, data++, 4);
> +  spin_unlock (>lock);

What part of my last comment was not clear when I indicated that
registers are typically (exclusively actually) 16-bit wide?

> +}
> +
> +static u32
> +r6040_get_msglevel (struct net_device *dev)
> +{
> +  struct r6040_private *tp = netdev_priv (dev);
> +
> +  return tp->msg_enable;
> +}

Tabs vs. spaces here.

> +
> +static void
> +r6040_set_msglevel (struct net_device *dev, u32

[patch v2] netfilter: nf_tables: underflow in nft_parse_u32_check()

2016-10-12 Thread Dan Carpenter

We don't want to allow negatives here.

Fixes: 36b701fae12a ('netfilter: nf_tables: validate maximum value of u32 
netlink attributes')
Signed-off-by: Dan Carpenter 
---
v2: cosmetic change

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index b70d3ea..dd55187 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4423,7 +4423,7 @@ static int nf_tables_check_loops(const struct nft_ctx 
*ctx,
  */
 unsigned int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest)
 {
-   int val;
+   u32 val;
 
val = ntohl(nla_get_be32(attr));
if (val > max)
--
To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] net: phy: smsc: Disable auto-negotiation on startup

2016-10-12 Thread Florian Fainelli

On 10/10/2016 10:41 AM, Kyle Roeschley wrote:
> Because the SMSC PHY completes auto-negotiation before the driver is
> ready to handle interrupts, the PHY state machine never realizes that we
> have a link. Clear the ANENABLE bit on initialization, which lets
> genphy_config_aneg do its thing when that code is hit later.
> 
> While this patch does fix the problem we see (no link on boot without
> re-plugging the cable), it seems like the generic PHY code should be
> able to handle auto-negotiation completing before interrupts are
> enabled. Submitted as an RFC in the hopes that someone has an idea as to
> how that could be done.
> 
> This fix is copied from commit 99f81afc139c ("phy: micrel: Disable auto
> negotiation on startup").

Do you mind trying:

https://www.spinics.net/lists/netdev/msg397857.html

and see if you do get link interrupts without your patch applied? Thanks!
-- 
Florian

1 2 >

1 - 100 of 122 matches

Mail list logo