date:20161217

RE: [PATCH] qed: fix memory leak of a qed_spq_entry on error failure paths

2016-12-17 Thread Mintz, Yuval

> From: Colin Ian King 
> 
> A qed_spq_entry entry is allocated by qed_sp_init_request but is not kfree'd
> if an error occurs, causing a memory leak. Fix this by kfree'ing it and also
> setting *pp_ent to NULL to be safe.
> 
> Found with static analysis by CoverityScan, CIDs 1389468-1389470
> 
> Signed-off-by: Colin Ian King 
...
> +err:
> + kfree(*pp_ent);
> + *pp_ent = NULL;
> +
> + return rc;
>  }

Hi Colin - thanks for this.
It would have been preferable to return the previously allocated spq entry.
I.e., do:

+err:
+   qed_spq_return_entry(p_hwfn, *pp_ent);
+   *pp_ent = NULL;
+   return rc;

Thanks,
Yuval

Re: [PATCH net] ipvlan: fix crash

2016-12-17 Thread David Miller

From: Mahesh Bandewar 
Date: Sat, 17 Dec 2016 18:16:19 -0800

> diff --git a/drivers/net/ipvlan/ipvlan_core.c 
> b/drivers/net/ipvlan/ipvlan_core.c
> index b4e990743e1d..4294fc1f5564 100644
> --- a/drivers/net/ipvlan/ipvlan_core.c
> +++ b/drivers/net/ipvlan/ipvlan_core.c
> @@ -660,6 +660,9 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff 
> **pskb)
>   if (!port)
>   return RX_HANDLER_PASS;
>  
> + if (unlikely(!pskb_may_pull(skb, sizeof(struct ethhdr
> + goto out;
> +
>   switch (port->mode) {

ipvlan only allows non-loopback ethernet devices to register
this RX handler.

Such situations being tested here should therefore be completely
impossible.

Every such device must send the SKB through eth_type_trans(), which
unconditionally accesses the ethernet header, therefore it must
be pulled into the linear SKB area already, long before this RX
handler is invoked.

If this really can legitimately happen, you must explain how so.

Just showing the crash that later happens in some (completely
unrelated BTW) ipvlan multicast workqueue handling function, is
really an insufficient commit log message for a bug like this.

[GIT] Networking

2016-12-17 Thread David Miller


1) Revert bogus nla_ok() change, from Alexey Dobriyan.

2) Various bpf validator fixes from Daniel Borkmann.

3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04
   drivers, from Dongpo Li.

4) Several ethtool ksettings conversions from Philippe Reynes.

5) Fix bugs in inet port management wrt. soreuseport, from Tom
   Herbert.

6) XDP support for virtio_net, from John Fastabend.

7) Fix NAT handling within a vrf, from David Ahern.

8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil.

Please pull, thanks a lot!

The following changes since commit 8fa3b6f9392bf6d90cb7b908e07bd90166639f0a:

  Merge tag 'cris-for-4.10' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jesper/cris (2016-12-12 09:06:38 
-0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 3e3397e7b11ce1b9526975ddfbe8dd569fc1f316:

  net: mv643xx_eth: fix build failure (2016-12-17 21:47:26 -0500)


Alexey Dobriyan (1):
  netlink: revert broken, broken "2-clause nla_ok()"

Andrew Lunn (1):
  net: dsa: mv88e6xxx: Fix opps when adding vlan bridge

Andy Lutomirski (1):
  cgroup: Fix CGROUP_BPF config

Arnd Bergmann (1):
  qed: fix old-style function definition

Bartosz Folta (1):
  net: macb: Added PCI wrapper for Platform Driver.

Ben Greear (1):
  mac80211: fix legacy and invalid rx-rate report

Cedric Izoard (1):
  mac80211: Ensure enough headroom when forwarding mesh pkt

Claudiu Manoil (1):
  dpaa_eth: use big endian accessors

Dan Carpenter (1):
  irda: w83977af_ir: cleanup an indent issue

Daniel Borkmann (5):
  bpf: fix regression on verifier pruning wrt map lookups
  bpf, test_verifier: fix a test case error result on unprivileged
  bpf: dynamically allocate digest scratch buffer
  bpf: fix overflow in prog accounting
  bpf: fix mark_reg_unknown_value for spilled regs on map value marking

Daniel Mack (1):
  bpf: cgroup: annotate pointers in struct cgroup_bpf with __rcu

David Ahern (2):
  net: vrf: Fix NAT within a VRF
  net: vrf: Drop conntrack data after pass through VRF device on Tx

David S. Miller (8):
  Merge branch 'hisilicon-netdev-dev'
  Merge branch 'cls_flower-mask'
  Merge branch 'inet_csk_get_port-and-soreusport-fixes'
  Merge branch 'dpaa_eth-fixes'
  Merge branch 'virtio_net-XDP'
  Merge branch 'gtp-fixes'
  Merge branch 'bpf-fixes'
  Merge tag 'mac80211-for-davem-2016-12-16' of 
git://git.kernel.org/.../jberg/mac80211

Dongpo Li (2):
  net: ethernet: hisi_femac: Call SET_NETDEV_DEV()
  net: ethernet: hip04: Call SET_NETDEV_DEV()

Emese Revfy (1):
  isdn: Constify some function parameters

Harald Welte (1):
  gtp: Fix initialization of Flags octet in GTPv1 header

Ido Schimmel (1):
  mlxsw: spectrum: Mark split ports as such

Jason Wang (1):
  virtio-net: correctly enable multiqueue

Jeroen De Wachter (2):
  encx24j600: bugfix - always move ERXTAIL to next packet in 
encx24j600_rx_packets
  encx24j600: Fix some checkstyle warnings

Johannes Berg (1):
  mac80211: don't call drv_set_default_unicast_key() for VLANs

John Fastabend (5):
  net: xdp: add invalid buffer warning
  virtio_net: Add XDP support
  virtio_net: add dedicated XDP transmit queues
  virtio_net: add XDP_TX support
  virtio_net: xdp, add slowpath case for non contiguous buffers

Kees Cook (7):
  isdn/gigaset: use designated initializers
  ATM: use designated initializers
  net: use designated initializers
  WAN: use designated initializers
  bna: use designated initializers
  isdn: use designated initializers
  net/x25: use designated initializers

LABBE Corentin (5):
  irda: irproc.c: Remove unneeded linux/miscdevice.h include
  irda: irnet: Move linux/miscdevice.h include
  irnet: ppp: move IRNET_MINOR to include/linux/miscdevice.h
  irda: irnet: Remove unused IRNET_MAJOR define
  irda: irnet: add member name to the miscdevice declaration

Lionel Gauthier (1):
  gtp: gtp_check_src_ms_ipv4() always return success

Madalin Bucur (2):
  dpaa_eth: remove redundant dependency on FSL_SOC
  MAINTAINERS: net: add entry for Freescale QorIQ DPAA Ethernet driver

Mantas M (1):
  net: ipv6: check route protocol when deleting routes

Manuel Bessler (1):
  r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected

Paul Blakey (2):
  net/sched: cls_flower: Use mask for addr_type
  net/sched: cls_flower: Use masked key when calling HW offloads

Philippe Reynes (5):
  net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings
  net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings
  net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings
  net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings

Re: [PATCH] net: mv643xx_eth: fix build failure

2016-12-17 Thread David Miller

From: Sudip Mukherjee 
Date: Sat, 17 Dec 2016 00:45:05 +

> The build of sparc allmodconfig fails with the error:
> "of_irq_to_resource" [drivers/net/ethernet/marvell/mv643xx_eth.ko]
>   undefined!
> 
> of_irq_to_resource() is defined when CONFIG_OF_IRQ is defined. And also
> CONFIG_OF_IRQ can only be defined if CONFIG_IRQ is defined. So we can
> safely use #if defined(CONFIG_OF_IRQ) in the code.
> 
> Signed-off-by: Sudip Mukherjee 

Applied, thanks.

Re: [PATCH] isdn: Constify some function parameters

2016-12-17 Thread David Miller

From: Kees Cook 
Date: Fri, 16 Dec 2016 13:40:47 -0800

> From: Emese Revfy 
> 
> The coming initify gcc plugin expects const pointer types, and caught
> some __printf arguments that weren't const yet. This fixes those.
> 
> Signed-off-by: Emese Revfy 
> [kees: expanded commit message]
> Signed-off-by: Kees Cook 

Applied.

Re: [patch net] mlxsw: spectrum: Mark split ports as such

2016-12-17 Thread David Miller

From: Jiri Pirko 
Date: Fri, 16 Dec 2016 19:29:03 +0100

> From: Ido Schimmel 
> 
> When a port is split we should mark it as such, as otherwise the split
> ports aren't renamed correctly (e.g. sw1p3 -> sw1p3s1) and the unsplit
> operation fails:
> 
> $ devlink port split sw1p3 count 4
> $ devlink port unsplit eth0
> devlink answers: Invalid argument
> [  598.565307] mlxsw_spectrum :03:00.0 eth0: Port wasn't split
> 
> Fixes: 67963a33b4fd ("mlxsw: Make devlink port instances independent of 
> spectrum/switchx2 port instances")
> Signed-off-by: Ido Schimmel 
> Reported-by: Tamir Winetroub 
> Reviewed-by: Elad Raz 
> Tested-by: Tamir Winetroub 
> Signed-off-by: Jiri Pirko 

Applied, thanks Jiri.

Re: [PATCH] cgroup: Fix CGROUP_BPF config

2016-12-17 Thread David Miller

From: Andy Lutomirski 
Date: Fri, 16 Dec 2016 08:33:45 -0800

> CGROUP_BPF depended on SOCK_CGROUP_DATA which can't be manually
> enabled, making it rather challenging to turn CGROUP_BPF on.
> 
> Signed-off-by: Andy Lutomirski 

Applied, thanks.

Re: pull-request: mac80211 2016-12-16

2016-12-17 Thread David Miller

From: Johannes Berg 
Date: Fri, 16 Dec 2016 13:39:56 +0100

> Since you seem to be updating net, I thought I'd send you a few fixes.
> These aren't really all that important though, so if you want to let
> them wait for a bit I can live with that.
> 
> Please pull and let me know if there's any problem.

Pulled, thanks.

Re: [PATCH net] qed: fix old-style function definition

2016-12-17 Thread David Miller

From: Arnd Bergmann 
Date: Fri, 16 Dec 2016 09:47:41 +0100

> The newly added file causes a harmless warning, with "make W=1":
> 
> drivers/net/ethernet/qlogic/qed/qed_iscsi.c: In function 'qed_get_iscsi_ops':
> drivers/net/ethernet/qlogic/qed/qed_iscsi.c:1268:29: warning: old-style 
> function definition [-Wold-style-definition]
> 
> This makes it a proper prototype.
> 
> Fixes: fc831825f99e ("qed: Add support for hardware offloaded iSCSI.")
> Signed-off-by: Arnd Bergmann 

APplied.

Re: [PATCH] net: ipv6: check route protocol when deleting routes

2016-12-17 Thread David Miller

From: Mantas Mikulėnas 
Date: Fri, 16 Dec 2016 10:30:59 +0200

> The protocol field is checked when deleting IPv4 routes, but ignored for
> IPv6, which causes problems with routing daemons accidentally deleting
> externally set routes (observed by multiple bird6 users).
> 
> This can be verified using `ip -6 route del  proto something`.
> 
> Signed-off-by: Mantas Mikulėnas 

Applied, thanks.

Re: [PATCH v3 net] r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected

2016-12-17 Thread David Miller

From: Manuel Bessler 
Date: Thu, 15 Dec 2016 22:55:00 -0500

> 'ifconfig eth0 down' makes r6040_close() trigger:
>  INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> 
> Fixed by moving calls to phy_stop(), napi_disable(), netif_stop_queue()
> to outside of the module's private spin_lock_irq block.
> 
> Found on a Versalogic Tomcat SBC with a Vortex86 SoC
 ...
> Signed-off-by: Manuel Bessler 

Applied, thanks.

Re: [patch net-next] irda: w83977af_ir: cleanup an indent issue

2016-12-17 Thread David Miller

From: Dan Carpenter 
Date: Mon, 12 Dec 2016 14:21:34 +0300

> In commit 99d8d2159d7c ("irda: w83977af_ir: Neaten logging"), we
> accidentally added an extra tab to these lines.
> 
> Signed-off-by: Dan Carpenter 

Applied.

Re: [PATCH] net: sfc: use new api ethtool_{get|set}_link_ksettings

2016-12-17 Thread David Miller

From: Philippe Reynes 
Date: Thu, 15 Dec 2016 00:12:53 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings

2016-12-17 Thread David Miller

From: Philippe Reynes 
Date: Mon, 12 Dec 2016 00:27:49 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings

2016-12-17 Thread David Miller

From: Philippe Reynes 
Date: Sun, 11 Dec 2016 22:47:50 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings

2016-12-17 Thread David Miller

From: Philippe Reynes 
Date: Mon, 12 Dec 2016 23:28:33 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH] net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings

2016-12-17 Thread David Miller

From: Philippe Reynes 
Date: Wed, 14 Dec 2016 10:01:58 +0100

> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Applied.

Re: [PATCH net v2 0/3] Couple of BPF fixes

2016-12-17 Thread David Miller

From: Daniel Borkmann 
Date: Sun, 18 Dec 2016 01:52:56 +0100

> This set contains three BPF fixes for net, one that addresses the
> complaint from Geert wrt static allocations, and the other is a fix
> wrt mem accounting that I found recently during testing and there's
> still one more fix on the map value marking.
> 
> Thanks!
> 
> v1 -> v2:
>   - Patch 1 as is.
>   - Fixed kbuild bot issue by letting charging helpers stay in the
> syscall.c, since there locked_vm is valid and only export the
> ones needed by bpf_prog_realloc(). Add empty stubs in case the
> bpf syscall is not enabled.
>   - Added patch 3 that addresses one more issue in map val marking.

Series applied, thanks Daniel.

[PATCH net] ipvlan: fix crash

2016-12-17 Thread Mahesh Bandewar

From: Mahesh Bandewar 

[ cut here ]
kernel BUG at include/linux/skbuff.h:1737!
Call Trace:
 [] dev_forward_skb+0x92/0xd0
 [] ipvlan_process_multicast+0x395/0x4c0 [ipvlan]
 [] ? ipvlan_process_multicast+0xd7/0x4c0 [ipvlan]
 [] ? process_one_work+0x147/0x660
 [] process_one_work+0x1a9/0x660
 [] ? process_one_work+0x147/0x660
 [] worker_thread+0x11d/0x360
 [] ? rescuer_thread+0x350/0x350
 [] kthread+0xdb/0xe0
 [] ? _raw_spin_unlock_irq+0x30/0x50
 [] ? flush_kthread_worker+0xc0/0xc0
 [] ret_from_fork+0x9a/0xd0
 [] ? flush_kthread_worker+0xc0/0xc0

Signed-off-by: Mahesh Bandewar 
---
 drivers/net/ipvlan/ipvlan_core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index b4e990743e1d..4294fc1f5564 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -660,6 +660,9 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff 
**pskb)
if (!port)
return RX_HANDLER_PASS;
 
+   if (unlikely(!pskb_may_pull(skb, sizeof(struct ethhdr
+   goto out;
+
switch (port->mode) {
case IPVLAN_MODE_L2:
return ipvlan_handle_mode_l2(pskb, port);
@@ -672,6 +675,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff 
**pskb)
/* Should not reach here */
WARN_ONCE(true, "ipvlan_handle_frame() called for mode = [%hx]\n",
  port->mode);
+
+out:
kfree_skb(skb);
return RX_HANDLER_CONSUMED;
 }
-- 
2.8.0.rc3.226.g39d4020

[PATCH net v2 0/3] Couple of BPF fixes

2016-12-17 Thread Daniel Borkmann

This set contains three BPF fixes for net, one that addresses the
complaint from Geert wrt static allocations, and the other is a fix
wrt mem accounting that I found recently during testing and there's
still one more fix on the map value marking.

Thanks!

v1 -> v2:
  - Patch 1 as is.
  - Fixed kbuild bot issue by letting charging helpers stay in the
syscall.c, since there locked_vm is valid and only export the
ones needed by bpf_prog_realloc(). Add empty stubs in case the
bpf syscall is not enabled.
  - Added patch 3 that addresses one more issue in map val marking.

Daniel Borkmann (3):
  bpf: dynamically allocate digest scratch buffer
  bpf: fix overflow in prog accounting
  bpf: fix mark_reg_unknown_value for spilled regs on map value marking

 include/linux/bpf.h| 13 -
 include/linux/filter.h | 14 +++---
 kernel/bpf/core.c  | 43 +--
 kernel/bpf/syscall.c   | 38 +-
 kernel/bpf/verifier.c  | 17 -
 5 files changed, 93 insertions(+), 32 deletions(-)

-- 
1.9.3

[PATCH net v2 2/3] bpf: fix overflow in prog accounting

2016-12-17 Thread Daniel Borkmann

Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
programs") made a wrong assumption of charging against prog->pages.
Unlike map->pages, prog->pages are still subject to change when we
need to expand the program through bpf_prog_realloc().

This can for example happen during verification stage when we need to
expand and rewrite parts of the program. Should the required space
cross a page boundary, then prog->pages is not the same anymore as
its original value that we used to bpf_prog_charge_memlock() on. Thus,
we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
is freed eventually. I noticed this that despite having unlimited
memlock, programs suddenly refused to load with EPERM error due to
insufficient memlock.

There are two ways to fix this issue. One would be to add a cached
variable to struct bpf_prog that takes a snapshot of prog->pages at the
time of charging. The other approach is to also account for resizes. I
chose to go with the latter for a couple of reasons: i) We want accounting
rather to be more accurate instead of further fooling limits, ii) adding
yet another page counter on struct bpf_prog would also be a waste just
for this purpose. We also do want to charge as early as possible to
avoid going into the verifier just to find out later on that we crossed
limits. The only place that needs to be fixed is bpf_prog_realloc(),
since only here we expand the program, so we try to account for the
needed delta and should we fail, call-sites check for outcome anyway.
On cBPF to eBPF migrations, we don't grab a reference to the user as
they are charged differently. With that in place, my test case worked
fine.

Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h  | 11 +++
 kernel/bpf/core.c| 16 +---
 kernel/bpf/syscall.c | 36 
 3 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 201eb48..f74ae68 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -238,6 +238,8 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void 
*meta, u64 meta_size,
 void bpf_prog_sub(struct bpf_prog *prog, int i);
 struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
 void bpf_prog_put(struct bpf_prog *prog);
+int __bpf_prog_charge(struct user_struct *user, u32 pages);
+void __bpf_prog_uncharge(struct user_struct *user, u32 pages);
 
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
@@ -318,6 +320,15 @@ static inline struct bpf_prog * __must_check 
bpf_prog_inc(struct bpf_prog *prog)
 {
return ERR_PTR(-EOPNOTSUPP);
 }
+
+static inline int __bpf_prog_charge(struct user_struct *user, u32 pages)
+{
+   return 0;
+}
+
+static inline void __bpf_prog_uncharge(struct user_struct *user, u32 pages)
+{
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 /* verifier prototypes for helper functions called from eBPF programs */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 75c08b8..1eb4f13 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -105,19 +105,29 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog 
*fp_old, unsigned int size,
gfp_t gfp_flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO |
  gfp_extra_flags;
struct bpf_prog *fp;
+   u32 pages, delta;
+   int ret;
 
BUG_ON(fp_old == NULL);
 
size = round_up(size, PAGE_SIZE);
-   if (size <= fp_old->pages * PAGE_SIZE)
+   pages = size / PAGE_SIZE;
+   if (pages <= fp_old->pages)
return fp_old;
 
+   delta = pages - fp_old->pages;
+   ret = __bpf_prog_charge(fp_old->aux->user, delta);
+   if (ret)
+   return NULL;
+
fp = __vmalloc(size, gfp_flags, PAGE_KERNEL);
-   if (fp != NULL) {
+   if (fp == NULL) {
+   __bpf_prog_uncharge(fp_old->aux->user, delta);
+   } else {
kmemcheck_annotate_bitfield(fp, meta);
 
memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE);
-   fp->pages = size / PAGE_SIZE;
+   fp->pages = pages;
fp->aux->prog = fp;
 
/* We keep fp->aux from fp_old around in the new
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 35d674c..e89acea 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -615,19 +615,39 @@ static void free_used_maps(struct bpf_prog_aux *aux)
kfree(aux->used_maps);
 }
 
+int __bpf_prog_charge(struct user_struct *user, u32 pages)
+{
+   unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   unsigned long user_bufs;
+
+   if (user) {
+   user_bufs = atomic_long_add_return(pages, >locked_vm);
+   if (user_bufs > memlock_limit) {
+

[PATCH net v2 3/3] bpf: fix mark_reg_unknown_value for spilled regs on map value marking

2016-12-17 Thread Daniel Borkmann

Martin reported a verifier issue that hit the BUG_ON() for his
test case in the mark_reg_unknown_value() function:

  [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
  [...]
  [  203.291109] Call Trace:
  [  203.296501]  [] mark_map_reg+0x45/0x50
  [  203.308225]  [] mark_map_regs+0x78/0x90
  [  203.320140]  [] do_check+0x226d/0x2c90
  [  203.331865]  [] bpf_check+0x48b/0x780
  [  203.343403]  [] bpf_prog_load+0x27e/0x440
  [  203.355705]  [] ? handle_mm_fault+0x11af/0x1230
  [  203.369158]  [] ? security_capable+0x48/0x60
  [  203.382035]  [] SyS_bpf+0x124/0x960
  [  203.393185]  [] ? __do_page_fault+0x276/0x490
  [  203.406258]  [] entry_SYSCALL_64_fastpath+0x13/0x94

This issue got uncovered after the fix in a08dd0da5307 ("bpf: fix
regression on verifier pruning wrt map lookups"). The reason why it
wasn't noticed before was, because as mentioned in a08dd0da5307,
mark_map_regs() was doing the id matching incorrectly based on the
uncached regs[regno].id. So, in the first loop, we walked all regs
and as soon as we found regno == i, then this reg's id was cleared
when calling mark_reg_unknown_value() thus that every subsequent
register was probed against id of 0 (which, in combination with the
PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
register state can hold), and therefore wasn't type transitioned such
as in the spilled register case for the second loop.

Now since that got fixed, it turned out that 57a09bf0a416 ("bpf:
Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
mark_reg_unknown_value() incorrectly for the spilled regs, and thus
hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.

Although spilled regs have the same type as the non-spilled regs
for the verifier state, that is, struct bpf_reg_state, they are
semantically different from the non-spilled regs. In other words,
there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
in the stack, for example, register R could have been spilled by
the program to stack location X, Y, Z, and in mark_map_regs() we
need to scan these stack slots of type STACK_SPILL for potential
registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
Therefore, depending on the location, the spilled_regs regno can
be a lot higher than just MAX_BPF_REG's value since we operate on
stack instead. The reset in mark_reg_unknown_value() itself is
just fine, only that the BUG_ON() was inappropriate for this. Fix
it by making a __mark_reg_unknown_value() version that can be
called from mark_map_reg() generically; we know for the non-spilled
case that the regno is always < MAX_BPF_REG anyway.

Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
Reported-by: Martin KaFai Lau 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 64b7b1a..83ed2f8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -462,14 +462,19 @@ static void init_reg_state(struct bpf_reg_state *regs)
regs[BPF_REG_1].type = PTR_TO_CTX;
 }
 
-static void mark_reg_unknown_value(struct bpf_reg_state *regs, u32 regno)
+static void __mark_reg_unknown_value(struct bpf_reg_state *regs, u32 regno)
 {
-   BUG_ON(regno >= MAX_BPF_REG);
regs[regno].type = UNKNOWN_VALUE;
regs[regno].id = 0;
regs[regno].imm = 0;
 }
 
+static void mark_reg_unknown_value(struct bpf_reg_state *regs, u32 regno)
+{
+   BUG_ON(regno >= MAX_BPF_REG);
+   __mark_reg_unknown_value(regs, regno);
+}
+
 static void reset_reg_range_values(struct bpf_reg_state *regs, u32 regno)
 {
regs[regno].min_value = BPF_REGISTER_MIN_RANGE;
@@ -1976,7 +1981,7 @@ static void mark_map_reg(struct bpf_reg_state *regs, u32 
regno, u32 id,
 */
reg->id = 0;
if (type == UNKNOWN_VALUE)
-   mark_reg_unknown_value(regs, regno);
+   __mark_reg_unknown_value(regs, regno);
}
 }
 
-- 
1.9.3

[PATCH net v2 1/3] bpf: dynamically allocate digest scratch buffer

2016-12-17 Thread Daniel Borkmann

Geert rightfully complained that 7bd509e311f4 ("bpf: add prog_digest
and expose it via fdinfo/netlink") added a too large allocation of
variable 'raw' from bss section, and should instead be done dynamically:

  # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
  add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
  function old new   delta
  raw-   32832  +32832
  [...]

Since this is only relevant during program creation path, which can be
considered slow-path anyway, lets allocate that dynamically and be not
implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
the beginning of replace_map_fd_with_map_ptr() and also error handling
stays straight forward.

Reported-by: Geert Uytterhoeven 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h|  2 +-
 include/linux/filter.h | 14 +++---
 kernel/bpf/core.c  | 27 ---
 kernel/bpf/syscall.c   |  2 +-
 kernel/bpf/verifier.c  |  6 --
 5 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8796ff0..201eb48 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -216,7 +216,7 @@ struct bpf_event_entry {
 u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
-void bpf_prog_calc_digest(struct bpf_prog *fp);
+int bpf_prog_calc_digest(struct bpf_prog *fp);
 
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index af8a180..7023142 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -57,9 +57,6 @@
 /* BPF program can access up to 512 bytes of stack space. */
 #define MAX_BPF_STACK  512
 
-/* Maximum BPF program size in bytes. */
-#define MAX_BPF_SIZE   (BPF_MAXINSNS * sizeof(struct bpf_insn))
-
 /* Helper macros for filter block array initializers. */
 
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
@@ -517,6 +514,17 @@ static __always_inline u32 bpf_prog_run_xdp(const struct 
bpf_prog *prog,
return BPF_PROG_RUN(prog, xdp);
 }
 
+static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
+{
+   return prog->len * sizeof(struct bpf_insn);
+}
+
+static inline u32 bpf_prog_digest_scratch_size(const struct bpf_prog *prog)
+{
+   return round_up(bpf_prog_insn_size(prog) +
+   sizeof(__be64) + 1, SHA_MESSAGE_BYTES);
+}
+
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
return max(sizeof(struct bpf_prog),
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 83e0d15..75c08b8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -136,28 +136,29 @@ void __bpf_prog_free(struct bpf_prog *fp)
vfree(fp);
 }
 
-#define SHA_BPF_RAW_SIZE   \
-   round_up(MAX_BPF_SIZE + sizeof(__be64) + 1, SHA_MESSAGE_BYTES)
-
-/* Called under verifier mutex. */
-void bpf_prog_calc_digest(struct bpf_prog *fp)
+int bpf_prog_calc_digest(struct bpf_prog *fp)
 {
const u32 bits_offset = SHA_MESSAGE_BYTES - sizeof(__be64);
-   static u32 ws[SHA_WORKSPACE_WORDS];
-   static u8 raw[SHA_BPF_RAW_SIZE];
-   struct bpf_insn *dst = (void *)raw;
+   u32 raw_size = bpf_prog_digest_scratch_size(fp);
+   u32 ws[SHA_WORKSPACE_WORDS];
u32 i, bsize, psize, blocks;
+   struct bpf_insn *dst;
bool was_ld_map;
-   u8 *todo = raw;
+   u8 *raw, *todo;
__be32 *result;
__be64 *bits;
 
+   raw = vmalloc(raw_size);
+   if (!raw)
+   return -ENOMEM;
+
sha_init(fp->digest);
memset(ws, 0, sizeof(ws));
 
/* We need to take out the map fd for the digest calculation
 * since they are unstable from user space side.
 */
+   dst = (void *)raw;
for (i = 0, was_ld_map = false; i < fp->len; i++) {
dst[i] = fp->insnsi[i];
if (!was_ld_map &&
@@ -177,12 +178,13 @@ void bpf_prog_calc_digest(struct bpf_prog *fp)
}
}
 
-   psize = fp->len * sizeof(struct bpf_insn);
-   memset([psize], 0, sizeof(raw) - psize);
+   psize = bpf_prog_insn_size(fp);
+   memset([psize], 0, raw_size - psize);
raw[psize++] = 0x80;
 
bsize  = round_up(psize, SHA_MESSAGE_BYTES);
blocks = bsize / SHA_MESSAGE_BYTES;
+   todo   = raw;
if (bsize - psize >= sizeof(__be64)) {
bits = (__be64 *)(todo + bsize - sizeof(__be64));
} else {
@@ -199,6 +201,9 @@ void bpf_prog_calc_digest(struct bpf_prog *fp)
result = (__force __be32 *)fp->digest;
for (i = 0; i < SHA_DIGEST_WORDS; i++)
result[i] = cpu_to_be32(fp->digest[i]);
+
+

Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock

2016-12-17 Thread Francois Romieu

Pavel Machek  :
[...]
> Won't this up/down the interface, in a way userspace can observe?

It won't up/down the interface as it doesn't exactly mimic what the
network code does (there's more than rtnl_lock).

For the same reason it's broken if it races with the transmit path: it
can release driver resources while the transmit path uses these.

Btw the points below may not matter/hurt much for a proof a concept
but they would need to be addressed as well:
1) unchecked (and avoidable) extra error paths due to stmmac_release()
2) racy cancel_work_sync. Low probability as it is, an irq + error could
   take place right after cancel_work_sync

Lino, have you considered via-rhine.c since its "move work from irq to
workqueue context" changes that started in
7ab87ff4c770eed71e3777936299292739fcd0fe [*] ?

It's a shameless plug - originated in r8169.c - but it should be rather
close to what the sxgbe and friends require. Thought / opinion ?

[*] Including fixes/changes in:
- 3a5a883a8a663b930908cae4abe5ec913b9b2fd2
- e1efa87241272104d6a12c8b9fcdc4f62634d447
- 810f19bcb862f8889b27e0c9d9eceac9593925dd
- e45af497950a89459a0c4b13ffd91e1729fffef4
- a926592f5e4e900f3fa903298c4619a131e60963
- 559bcac35facfed49ab4f408e162971612dcfdf3

-- 
Ueimor

Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function

2016-12-17 Thread Christian Kujau

On Thu, 15 Dec 2016, Jason A. Donenfeld wrote:
> > I'd still drop the "24" unless you really think we're going to have
> > multiple variants coming into the kernel.
> 
> Okay. I don't have a problem with this, unless anybody has some reason
> to the contrary.

What if the 2/4-round version falls and we need more rounds to withstand 
future cryptoanalysis? We'd then have siphash_ and siphash48_ functions, 
no? My amateurish bike-shedding argument would be "let's keep the 24 then" :-)

C.
-- 
BOFH excuse #354:

Chewing gum on /dev/sd3c

[PATCH] net: wireless: marvell: libertas: constify cfg80211_ops structures

2016-12-17 Thread Bhumika Goyal

cfg80211_ops structures are only passed as an argument to the function
wiphy_new. This argument is of type const, so cfg80211_ops strutures
having this property can be declared as const.
Done using Coccinelle

@r1 disable optional_qualifier @
identifier i;
position p;
@@
static struct cfg80211_ops i@p = {...};

@ok1@
identifier r1.i;
position p;
@@
wiphy_new(@p,...)

@bad@
position p!={r1.p,ok1.p};
identifier r1.i;
@@
i@p

@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
+const
struct cfg80211_ops i;

File size before:
   textdata bss dec hex filename
  212251954  16   231955a9b wireless/marvell/libertas/cfg.o

File size after:
   textdata bss dec hex filename
  220411154  16   232115aab wireless/marvell/libertas/cfg.o

Signed-off-by: Bhumika Goyal 
---
 drivers/net/wireless/marvell/libertas/cfg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/marvell/libertas/cfg.c 
b/drivers/net/wireless/marvell/libertas/cfg.c
index 7ff2efa..3f97acb 100644
--- a/drivers/net/wireless/marvell/libertas/cfg.c
+++ b/drivers/net/wireless/marvell/libertas/cfg.c
@@ -2086,7 +2086,7 @@ static int lbs_set_power_mgmt(struct wiphy *wiphy, struct 
net_device *dev,
  * Initialization
  */
 
-static struct cfg80211_ops lbs_cfg80211_ops = {
+static const struct cfg80211_ops lbs_cfg80211_ops = {
.set_monitor_channel = lbs_cfg_set_monitor_channel,
.libertas_set_mesh_channel = lbs_cfg_set_mesh_channel,
.scan = lbs_cfg_scan,
-- 
1.9.1

[PATCH] net: wireless: ath: wil6210: constify cfg80211_ops structures

2016-12-17 Thread Bhumika Goyal

cfg80211_ops structures are only passed as an argument to the function
wiphy_new. This argument is of type const, so cfg80211_ops strutures
having this property can be declared as const.
Done using Coccinelle

@r1 disable optional_qualifier @
identifier i;
position p;
@@
static struct cfg80211_ops i@p = {...};

@ok1@
identifier r1.i;
position p;
@@
wiphy_new(@p,...)

@bad@
position p!={r1.p,ok1.p};
identifier r1.i;
@@
i@p

@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
+const
struct cfg80211_ops i;

File size before:
   textdata bss dec hex filename
  181336632   0   2476560bd wireless/ath/wil6210/cfg80211.o

File size after:
   textdata bss dec hex filename
  189335832   0   2476560bd wireless/ath/wil6210/cfg80211.o

Signed-off-by: Bhumika Goyal 
---
 drivers/net/wireless/ath/wil6210/cfg80211.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/wil6210/cfg80211.c 
b/drivers/net/wireless/ath/wil6210/cfg80211.c
index 6aa3ff4..54dd116 100644
--- a/drivers/net/wireless/ath/wil6210/cfg80211.c
+++ b/drivers/net/wireless/ath/wil6210/cfg80211.c
@@ -1499,7 +1499,7 @@ static int wil_cfg80211_set_power_mgmt(struct wiphy 
*wiphy,
return rc;
 }
 
-static struct cfg80211_ops wil_cfg80211_ops = {
+static const struct cfg80211_ops wil_cfg80211_ops = {
.add_virtual_intf = wil_cfg80211_add_iface,
.del_virtual_intf = wil_cfg80211_del_iface,
.scan = wil_cfg80211_scan,
-- 
1.9.1

Re: [PATCHv2 2/5] sh_eth: enable wake-on-lan for Gen2 devices

2016-12-17 Thread Sergei Shtylyov


On 12/14/2016 04:37 PM, Sergei Shtylyov wrote:


Tested on Gen2 r8a7791/Koelsch.

Signed-off-by: Niklas Söderlund 
---
 drivers/net/ethernet/renesas/sh_eth.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c
b/drivers/net/ethernet/renesas/sh_eth.c
index 87640b9..348ed22 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -624,8 +624,9 @@ static struct sh_eth_cpu_data r8a779x_data = {

 .register_type= SH_ETH_REG_FAST_RCAR,

-.ecsr_value= ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD,
-.ecsipr_value= ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP,
+.ecsr_value= ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD | ECSR_MPD,
+.ecsipr_value= ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP |
+  ECSIPR_MPDIP,


  These expressions seem to have been sorted by the bit # before your patch,
now they aren't... care to fix? :-)


   After looking at the SH7743/64 code, nevermind this request. Sorry. :-)

MBR, Sergei

Re: [PATCHv2 1/5] sh_eth: add generic wake-on-lan support via magic packet

2016-12-17 Thread Sergei Shtylyov


Hello!

On 12/12/2016 07:09 PM, Niklas Söderlund wrote:

   Not the complete review yet, just some superficial comments.


Add generic functionality to support Wake-on-Lan using MagicPacket which


   LAN.


are supported by at least a few versions of sh_eth. Only add
functionality for WoL, no specific sh_eth version are marked to support
WoL yet.


[...]


diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
b/drivers/net/ethernet/renesas/sh_eth.c
index 05b0dc5..87640b9 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c

[...]

@@ -3150,15 +3187,67 @@ static int sh_eth_drv_remove(struct platform_device 
*pdev)

[...]

+static int sh_eth_wol_restore(struct net_device *ndev)
+{
+   struct sh_eth_private *mdp = netdev_priv(ndev);
+   int ret;
+
+   napi_enable(>napi);
+
+   /* Disable MagicPacket */
+   sh_eth_modify(ndev, ECMR, ECMR_PMDE, 0);
+
+   /* The device need to be reset to restore MagicPacket logic


   Needs.

[...]

index d050f37..4ceed00 100644
--- a/drivers/net/ethernet/renesas/sh_eth.h
+++ b/drivers/net/ethernet/renesas/sh_eth.h
@@ -493,6 +493,7 @@ struct sh_eth_cpu_data {
unsigned shift_rd0:1;   /* shift Rx descriptor word 0 right by 16 */
unsigned rmiimode:1;/* EtherC has RMIIMODE register */
unsigned rtrate:1;  /* EtherC has RTRATE register */
+   unsigned magic:1;   /* EtherC has ECMR.PMDE and ECSR.MPD */


   After looking at the SH7734/63 manuals it became obvious that PMDE was a 
result of typo, the bit is called MPDE actually, the current name doesn't make 
sense anyway. Care to fix?


MBR, Sergei

Re: [PATCH] net: use designated initializers

2016-12-17 Thread Kees Cook

On Sat, Dec 17, 2016 at 8:57 AM, David Miller  wrote:
> From: Kees Cook 
> Date: Fri, 16 Dec 2016 16:58:58 -0800
>
>> Prepare to mark sensitive kernel structures for randomization by making
>> sure they're using designated initializers. These were identified during
>> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
>> extracted from grsecurity.
>>
>> Signed-off-by: Kees Cook 
>
> Applied, although "decnet: " would have been a much better
> subsystem prefix.

Thanks! Yeah, I had corrected that in my tree already in case there
was a v2 needed. I was working off an auto-splitting script that I
taught to guess at prefixes by looking at commit logs. It didn't work
very well. ;) We need another field in MAINTAINERS. :)

-Kees

-- 
Kees Cook
Nexus Security

Re: [TSN RFC v2 0/9] TSN driver for the kernel

2016-12-17 Thread Richard Cochran

On Sat, Dec 17, 2016 at 10:05:54AM +0100, Henrik Austad wrote:
> I'm sending out a new set now because, what I have works _fairly_ well for 
> testing and a way to see what you can do with AVB. Using spotify to play 
> music on random machines is quite entertaining.

You have missed the point about TSN entirely.  Unless your demo has
synchronized playback (in the low microsecond range), then it really
is pointless.  We can already play music over the LAN using gstreamer,
without a single kernel change.  Heck, gstreamer even has its own
rudimentary synchronization, so your series is a step backwards.

> And therein lies the problem. It cannot yet be written, so we have to start 
> in *some* end. And as I repeatedly stated in June, I'm at an RFC here, 
> trying to spark some interest and lure other developers in :)

The best way to attract interest is to provide the critical
infrastructure missing in the kernel.  Coding a media player in kernel
space is not very interesting in my view.

> Yes, and this requires a lot of change to ALSA (and probably something in 
> V4L2 as well?), so before we get to that, perhaps have a set of patches 
> that does this best effort and *then* work on getting time-triggered 
> playback into the kernel?

No, we don't need a best effort implementation.  That is gstreamer and Co.

> So, the next issue I plan to tackle, is how I do buffers, the current 
> approach where tsn_core allocates memory is on its way out and I'll let the 
> shim (which means alsa/v4l2) will provide a buffer. Then I'll start looking 
> at qdisc.

More than once you wrote something like, "I know that's needed, but it
is just too hard ATM."  Please start with qdisc and tc.  That
shouldn't be too hard.  If we had the AVB shaping rules with one or
two drivers supporting them, that would be one piece already done.

Thanks,
Richard

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-17 Thread Andy Lutomirski

On Sat, Dec 17, 2016 at 11:26 AM, Mickaël Salaün  wrote:
>
> On 17/12/2016 19:18, Andy Lutomirski wrote:
>> Hi all-
>>
>> I apologize for being rather late with this.  I didn't realize that
>> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
>> it on the linux-api list, so I missed the discussion.
>>
>> I think that the inet ingress, egress etc filters are a neat feature,
>> but I think the API has some issues that will bite us down the road
>> if it becomes stable in its current form.
>>
>> Most of the problems I see are summarized in this transcript:
>>
>> # mkdir cg2
>> # mount -t cgroup2 none cg2
>> # mkdir cg2/nosockets
>> # strace cgrp_socket_rule cg2/nosockets/ 0
>> ...
>> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
>>
>>  You can modify a cgroup after opening it O_RDONLY?
>
> I sent a patch to check the cgroup.procs permission before attaching a
> BPF program to it [1], but it was not merged because not part of the
> current security model (which may not be crystal clear). The thing is
> that the current socket/BPF/cgroup feature is only available to a
> process with the *global CAP_NET_ADMIN* and such a process can already
> modify the network for every processes, so it doesn't make much sense to
> check if it can modify the network for a subset of this processes.
>
> [1] https://lkml.org/lkml/2016/9/19/854

I don't think that's quite the right approach.  First, checking
permissions on an fd that's not the fd passed to the syscall is weird
and IMO shouldn't be done without a good reason.  Second,
cgroups.procs seems like the wrong file.

Why not create "net.socket_create_filter" and require a writable fd to
*that* to be passed to the syscall.

>
> However, needing a process to open a cgroup *directory* in write mode
> may not make sense because the process does not modify the content of
> the cgroup but only use it as a *reference* in the network stack.
> Forcing an open with write mode may forbid to use this kind of
> network-filtering feature in a read-only file-system but not necessarily
> read-only *network configuration*.

It does modify the cgroup.  Logically it changes the cgroup behavior.
As an implementation matter, it modifies the struct cgroup.

>
> Another point of view is that the CAP_NET_ADMIN may be an unneeded
> privilege if the cgroup migration is using a no_new_privs-like feature
> as I proposed with Landlock [2] (with an extra ptrace_may_access() check).
> The new capability proposition for cgroup may be interesting too.
>
> [2] https://lkml.org/lkml/2016/9/14/82
>
>>
>> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
>> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
>> log_buf=0x6020c0, kern_version=0}, 48) = 4
>>
>>  This is fine.  The bpf() syscall manipulates bpf objects.
>>
>> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
>>
>>  This is not so good:
>> 
>>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
>> is manipulating a cgroup.  There's no reason that a socket creation
>> filter couldn't be written in a different language (new iptables
>> table?  Simple list of address families?), but if that happened,
>> then using bpf() to install it would be entirely nonsensical.
>
> Another point of view is to say that the BPF program (called by the
> network stack) is using a reference to a set of processes thanks to a
> cgroup.

I strongly disagree with this point of view.  From a user's
perspective, nothing about the bpf program changed -- the cgroup
changes.  Even in the code, the bpf program doesn't have a reference
to anything.  The cgroup references the bpf program.

>>
>> # echo $$ >cg2/nosockets/cgroup.procs
>> # ping 127.0.0.1
>> ping: socket: Operation not permitted
>> # ls cg2/nosockets/
>> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
>> # cat cg2/nosockets/cgroup.controllers
>>
>>  Something in cgroupfs should give an indication that this cgroup
>>  filters socket creation, but there's nothing there.  You should also
>>  be able to turn the filter off from cgroupfs.
>
> Right. Everybody was OK at LPC to add such an information but it is not
> there yet.

Then let's do it before CONFIG_CGROUP_BPF=y becomes possible in a
released kernel.

> Right. Because this feature doesn't handle namespaces (only global
> CAP_NET_ADMIN), nested containers should not be allowed to use it at all.
> If we want this kind of feature to be usable by something other than a
> process with a global capability, then we need an inheritance mechanism
> similar to what I designed for Landlock. I think it could be added later.

I think it's okay to have a new kernel feature that doesn't support
namespacing yet.  I don't think it's okay to have a new kernel feature
that will become problematic when namespacing is added, and I think
cgroup-bpf is in the latter class.


Anyway, here's a half-baked proposal

probably serious conntrack/netfilter panic, 4.8.14, timers and intel turbo

2016-12-17 Thread Denys Fedoryshchenko


Hi,

I posted recently several netfilter related crashes, didn't got any 
answers, one of them started to happen quite often on loaded NAT 
(17Gbps),
so after trying endless ways to make it stable, i found out that in 
backtrace i can often see timers, and this bug probably appearing on 
older releases,

i've seen such backtrace with timer fired for conntrack on them.
I disabled Intel turbo for cpus on this loaded NAT, and voila, panic 
disappeared for 2nd day!

* by wrmsr -a 0x1a0 0x4000850089
I am not sure timers is the reason, but probably turbo creating some 
condition for bug.




Here is examples of backtrace of last reboots (kernel 4.8.14), and same 
kernel worked perfectly without turbo.
Last one also one crash on 4.8.0 that looks painfully similar, on 
totally different workload, but with conntrack enabled. It happens there 
much less often,

so harder to crash and test by disabling turbo.

[28904.162607] BUG: unable to handle kernel
NULL pointer dereference
at 0008
[28904.163210] IP:
[] nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack]
[28904.163745] PGD 0

[28904.164058] Oops: 0002 [#1] SMP
[28904.164323] Modules linked in:
nf_nat_pptp
nf_nat_proto_gre
xt_TCPMSS
xt_connmark
ipt_MASQUERADE
nf_nat_masquerade_ipv4
xt_nat
xt_rateest
xt_RATEEST
nf_conntrack_pptp
nf_conntrack_proto_gre
xt_CT
xt_set
xt_hl
xt_tcpudp
ip_set_hash_net
ip_set
nfnetlink
iptable_raw
iptable_mangle
iptable_nat
nf_conntrack_ipv4
nf_defrag_ipv4
nf_nat_ipv4
nf_nat
nf_conntrack
iptable_filter
ip_tables
x_tables
netconsole
configfs
8021q
garp
mrp
stp
llc
bonding
ixgbe
dca

[28904.168132] CPU: 27 PID: 0 Comm: swapper/27 Not tainted 
4.8.14-build-0124 #2
[28904.168398] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
SE5C610.86B.01.01.1008.031920151331 03/19/2015

[28904.168853] task: 885fa42e8c40 task.stack: 885fa42f
[28904.169114] RIP: 0010:[]
[] nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack]
[28904.169643] RSP: 0018:885fbccc3dd8 EFLAGS: 00010246
[28904.169901] RAX:  RBX: 885fbccc RCX: 
885fbccc0010
[28904.170169] RDX: 885f87a1c150 RSI: 0142 RDI: 
885fbccc
[28904.170437] RBP: 885fbccc3de8 R08: cbdee177 R09: 
0100
[28904.170704] R10: 885fbccc3dd0 R11: 820050c0 R12: 
885f87a1c140
[28904.170971] R13: 0005d948 R14: 000ea942 R15: 
885f87a1c160
[28904.171237] FS: () GS:885fbccc() 
knlGS:

[28904.171688] CS: 0010 DS:  ES:  CR0: 80050033
[28904.171964] CR2: 0008 CR3: 00607f006000 CR4: 
001406e0

[28904.172231] Stack:
[28904.172482] 885f87a1c140
820a1405
885fbccc3e28
a00abb30

[28904.173182] 0002820a1405
885f87a1c140
885f99a28201


[28904.173884] 
820050c8
885fbccc3e58
a00abc62

[28904.174585] Call Trace:
[28904.174835] 

[28904.174912] [] nf_ct_delete_from_lists+0xc9/0xf2 
[nf_conntrack]
[28904.175613] [] nf_ct_delete+0x109/0x12c 
[nf_conntrack]
[28904.175894] [] ? nf_ct_delete+0x12c/0x12c 
[nf_conntrack]
[28904.176169] [] death_by_timeout+0xd/0xf 
[nf_conntrack]

[28904.176443] [] call_timer_fn.isra.5+0x17/0x6b
[28904.176714] [] expire_timers+0x6f/0x7e
[28904.176975] [] run_timer_softirq+0x69/0x8b
[28904.177238] [] ? 
clockevents_program_event+0xd0/0xe8

[28904.177504] [] __do_softirq+0xbd/0x1aa
[28904.177765] [] irq_exit+0x37/0x7c
[28904.178026] [] 
smp_trace_apic_timer_interrupt+0x7b/0x88

[28904.178300] [] smp_apic_timer_interrupt+0x9/0xb
[28904.178565] [] apic_timer_interrupt+0x7c/0x90
[28904.178835] 

[28904.178907] [] ? mwait_idle+0x64/0x7a
[28904.179436] [] ? 
atomic_notifier_call_chain+0x13/0x15

[28904.179712] [] arch_cpu_idle+0xa/0xc
[28904.179976] [] default_idle_call+0x27/0x29
[28904.180244] [] cpu_startup_entry+0x11d/0x1c7
[28904.180508] [] start_secondary+0xe8/0xeb
[28904.180767] Code:
80
2f
0b
82
48
89
df
e8
da
90
84
e1
48
8b
43
10
49
8d
54
24
10
48
8d
4b
10
49
89
4c
24
18
a8
01
49
89
44
24
10
48
89
53
10
75
04

89
50
08
c6
03
00
5b
41
5c
5d
c3
48
8b
05
10
be
00
00
89
f6

[28904.185546] RIP
[] nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack]
[28904.186065] RSP 
[28904.186319] CR2: 0008
[28904.186593] ---[ end trace 35cbc6c885a5c2d8 ]---
[28904.186860] Kernel panic - not syncing: Fatal exception in interrupt
[28904.187155] Kernel Offset: disabled
[28904.187419] Rebooting in 5 seconds..

[28909.193662] ACPI MEMORY or I/O RESET_REG.



[14125.227611] BUG: unable to handle kernel
NULL pointer dereference
at (null)
[14125.228215] IP:
[] nf_nat_setup_info+0x6d8/0x755 [nf_nat]
[14125.228564] PGD 0

[14125.228882] Oops:  [#1] SMP
[14125.229146] Modules linked in:
nf_nat_pptp
nf_nat_proto_gre
xt_TCPMSS
xt_connmark
ipt_MASQUERADE
nf_nat_masquerade_ipv4
xt_nat
xt_rateest
xt_RATEEST
nf_conntrack_pptp
nf_conntrack_proto_gre
xt_CT
xt_set
xt_hl
xt_tcpudp
ip_set_hash_net
ip_set
nfnetlink
iptable_raw

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-17 Thread Mickaël Salaün


On 17/12/2016 19:18, Andy Lutomirski wrote:
> Hi all-
> 
> I apologize for being rather late with this.  I didn't realize that
> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> it on the linux-api list, so I missed the discussion.
> 
> I think that the inet ingress, egress etc filters are a neat feature,
> but I think the API has some issues that will bite us down the road
> if it becomes stable in its current form.
> 
> Most of the problems I see are summarized in this transcript:
> 
> # mkdir cg2
> # mount -t cgroup2 none cg2
> # mkdir cg2/nosockets
> # strace cgrp_socket_rule cg2/nosockets/ 0
> ...
> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> 
>  You can modify a cgroup after opening it O_RDONLY?

I sent a patch to check the cgroup.procs permission before attaching a
BPF program to it [1], but it was not merged because not part of the
current security model (which may not be crystal clear). The thing is
that the current socket/BPF/cgroup feature is only available to a
process with the *global CAP_NET_ADMIN* and such a process can already
modify the network for every processes, so it doesn't make much sense to
check if it can modify the network for a subset of this processes.

[1] https://lkml.org/lkml/2016/9/19/854

However, needing a process to open a cgroup *directory* in write mode
may not make sense because the process does not modify the content of
the cgroup but only use it as a *reference* in the network stack.
Forcing an open with write mode may forbid to use this kind of
network-filtering feature in a read-only file-system but not necessarily
read-only *network configuration*.

Another point of view is that the CAP_NET_ADMIN may be an unneeded
privilege if the cgroup migration is using a no_new_privs-like feature
as I proposed with Landlock [2] (with an extra ptrace_may_access() check).
The new capability proposition for cgroup may be interesting too.

[2] https://lkml.org/lkml/2016/9/14/82

> 
> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> log_buf=0x6020c0, kern_version=0}, 48) = 4
> 
>  This is fine.  The bpf() syscall manipulates bpf objects.
> 
> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> 
>  This is not so good:
> 
>  a) The bpf() syscall is supposed to manipulate bpf objects.  This
> is manipulating a cgroup.  There's no reason that a socket creation
> filter couldn't be written in a different language (new iptables
> table?  Simple list of address families?), but if that happened,
> then using bpf() to install it would be entirely nonsensical.

Another point of view is to say that the BPF program (called by the
network stack) is using a reference to a set of processes thanks to a
cgroup.

> 
>  b) This is starting to be an excessively ugly multiplexer.  Among
> other things, it's very unfriendly to seccomp.

FWIW, Landlock will have the capability to filter this kind of action.

> 
> # echo $$ >cg2/nosockets/cgroup.procs
> # ping 127.0.0.1
> ping: socket: Operation not permitted
> # ls cg2/nosockets/
> cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
> # cat cg2/nosockets/cgroup.controllers
> 
>  Something in cgroupfs should give an indication that this cgroup
>  filters socket creation, but there's nothing there.  You should also
>  be able to turn the filter off from cgroupfs.

Right. Everybody was OK at LPC to add such an information but it is not
there yet.

> 
> # mkdir cg2/nosockets/sockets
> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
> 
>  This succeeded, which means that, if this feature is enabled in 4.10,
>  then we're stuck with its semantics.  If it returned -EINVAL instead,
>  there would be a chance to refine it.

This is indeed unfortunate.

> 
> # echo $$ >cg2/nosockets/sockets/cgroup.procs
> # ping 127.0.0.1
> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
> ^C
> --- 127.0.0.1 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms
> 
>  Bash was inside a cgroup that disallowed socket creation, but socket
>  creation wasn't disallowed.  This means that the obvious use of socket
>  creation filters in nestable constainers fails insecurely.
> 
> 
> There's also a subtle but nasty potential security problem here.
> In 4.9 and before, cgroups has only one real effect in the kernel:
> resource control. A process in a malicious cgroup could be DoSed,
> but that was about the extent of the damage that a malicious cgroup
> could do.
> 
> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
> programs attached that can do things if various events occur. (Right
> now, this means socket operations, but there are plans in the works
> to do this

[PATCH] rtlwifi: rtl8192ee: New firmware from Realtek

2016-12-17 Thread Larry Finger

From: Troy Tan 

Recent testing by Realtek found bugs in both the driver and firmware for
the RTL8192EE chips.

Signed-off-by: Troy Tan 
Signed-off-by: Larry Finger 
---
 WHENCE  |   5 -
 rtlwifi/rtl8192eefw.bin | Bin 32754 -> 31818 bytes
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/WHENCE b/WHENCE
index df2dffa..c6f650e 100644
--- a/WHENCE
+++ b/WHENCE
@@ -2115,7 +2115,10 @@ Licence: Redistributable. See 
LICENCE.rtlwifi_firmware.txt for details.
 
 Driver: rtl8192ee - Realtek 802.11n WLAN driver for RTL8192EE
 
-Info: Taken from Realtek version 
rtl_92ce_92se_92de_8723ae_88ee_8723be_92ee_linux_mac80211_0017.1224.2013
+Info: Initial version taken from Realtek version
+  rtl_92ce_92se_92de_8723ae_88ee_8723be_92ee_linux_mac80211_0017.1224.2013
+  Updated Jan. 14, 2015 with file added by Realtek to
+  http://github.com/lwfinger/rtlwifi_new.git.
 File: rtlwifi/rtl8192eefw.bin
 
 Licence: Redistributable. See LICENCE.rtlwifi_firmware.txt for details.
diff --git a/rtlwifi/rtl8192eefw.bin b/rtlwifi/rtl8192eefw.bin
index 
bede1ad79f9c5243484f31c790ef794d579c9b4b..4a034d3ae8910df49682876dfe44fe8690798825
 100644
GIT binary patch
literal 31818
zcmb`w3w#q*+CQF|OfGHH+q6K-wS-nhMS^%+*Oekw@dhn~izpJX-epPGbOp6!0CB~&
z%P2+PT@}P6s3=r->#D4Bv9d4kzPr+jyX#(r0h5A2nyE_=Y?J)H=jWddH;oE
z{`*WV>InVQ)lLOBdnMzD1lX+Zm_4$vPOz#!pCF_sB;TC>gJS)Ov;ypj;|L#4n
z-}~t={Jaw^iobWV_FQx%UriMvoUhV19@%e2+FOy97H1G9?ZuSO3NQgAg;0??f!=fN
zX0u7ivC+5PYUc56F>@v%Ki6h4*!DP?NwIZCCYcbO|VCAO(C_sCp)j~WTcicupd>>NSyBS#DbmxT)LuqOsuAKYXmQsx9(mpjhSHdkOwPJcfw
zKR;V|D+6ObZ24!-$9eA+zMP5E$}t~4vn+}OngiF0$|_vdwZMP;nax013&-(1J?Y(I
zv0CvT3m_-gcpB+en*m|ZG17ALG9ixsutb3RcH>E@cC4MulS81*KZLILI+n3zrNkD=9A<
zT#i}%zx{pGDkU!#ul23$$ZPDdZ`UNIJ@Ka^PhYYm{`99$AFKS?$*vE0-j`}$
zyR5C+RQ<^k^I~aWm))E9QEQj$fVrmZ#HX$HzeSJyyu%rBq()UPJ6UMAZ#N5m)8b`a
zuJG5Uh@*XdZH2|*d!ov5vMaH1nZ0X^V;dFur;~>ldK7!t4n;Zr26JcOQ2@_WMs;
z-O@ES=i(4wA@`;mO!pYcJ0nJ)-!X6vD=7itmOgAJ({q1tGhPa~
zFY^5i;q5f47`9WC(%aSd)q@`UYnDSNhLualUo$~^qF@G{D0_k{B*=}N{Rx4R
z_&1kDktILpHA~kw(s7E~w`$oL4WVI=SX>JKY53EF7;KFGL9UR$_Jh@GIF+QykxqW}
zy6c;KzrD+oH|B)LReOuWMisaYI*?tEMD||dJ~jn?Lx22s
zd#q*!hJ7ZR%c&`6?1EJHmC5K2634FSbH;8qK7LKoXI_(XDklTVsetlrKD4IEJeU#ZTH6&=e@hSzvSJg{qfQAM>R-%uM~%B}Hy5JmRu0|>#6hic@Misai$YzYm?=nNT|iJztW|o4p-*g_U5YiFzzFZ_=yA71)nMnjJ0us@3>SDYvTpqNy!>(Gdp~JV`!lH_==;dQ=hKY%Ntji6nUdn^%pHQ
ziC6y_$hC^QXk=TXisT*4GW6HSBKKEpsN{1T~PuGCbha^QDLie_X4S|i
z`#3e$UgA

[PATCH net-next v4 2/2] net: stmmac: dwmac-meson8b: make the RGMII TX delay configurable

2016-12-17 Thread Martin Blumenstingl

Prior to this patch we were using a hardcoded RGMII TX clock delay of
2ns (= 1/4 cycle of the 125MHz RGMII TX clock). This value works for
many boards, but unfortunately not for all (due to the way the actual
circuit is designed, sometimes because the TX delay is enabled in the
PHY, etc.). Making the TX delay on the MAC side configurable allows us
to support all possible hardware combinations.

This allows fixing a compatibility issue on some boards, where the
RTL8211F PHY is configured to generate the TX delay. We can now turn
off the TX delay in the MAC, because otherwise we would be applying the
delay twice (which results in non-working TX traffic).

Signed-off-by: Martin Blumenstingl 
Tested-by: Neil Armstrong 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index ffaed1f35efe..db970cd6600f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -35,10 +35,6 @@
 
 #define PRG_ETH0_TXDLY_SHIFT   5
 #define PRG_ETH0_TXDLY_MASKGENMASK(6, 5)
-#define PRG_ETH0_TXDLY_OFF (0x0 << PRG_ETH0_TXDLY_SHIFT)
-#define PRG_ETH0_TXDLY_QUARTER (0x1 << PRG_ETH0_TXDLY_SHIFT)
-#define PRG_ETH0_TXDLY_HALF(0x2 << PRG_ETH0_TXDLY_SHIFT)
-#define PRG_ETH0_TXDLY_THREE_QUARTERS  (0x3 << PRG_ETH0_TXDLY_SHIFT)
 
 /* divider for the result of m250_sel */
 #define PRG_ETH0_CLK_M250_DIV_SHIFT7
@@ -69,6 +65,8 @@ struct meson8b_dwmac {
 
struct clk_divider  m25_div;
struct clk  *m25_div_clk;
+
+   u32 tx_delay_ns;
 };
 
 static void meson8b_dwmac_mask_bits(struct meson8b_dwmac *dwmac, u32 reg,
@@ -179,6 +177,7 @@ static int meson8b_init_prg_eth(struct meson8b_dwmac *dwmac)
 {
int ret;
unsigned long clk_rate;
+   u8 tx_dly_val;
 
switch (dwmac->phy_mode) {
case PHY_INTERFACE_MODE_RGMII:
@@ -196,9 +195,13 @@ static int meson8b_init_prg_eth(struct meson8b_dwmac 
*dwmac)
meson8b_dwmac_mask_bits(dwmac, PRG_ETH0,
PRG_ETH0_INVERTED_RMII_CLK, 0);
 
-   /* TX clock delay - all known boards use a 1/4 cycle delay */
+   /* TX clock delay in ns = "8ns / 4 * tx_dly_val" (where
+* 8ns are exactly one cycle of the 125MHz RGMII TX clock):
+* 0ns = 0x0, 2ns = 0x1, 4ns = 0x2, 6ns = 0x3
+*/
+   tx_dly_val = dwmac->tx_delay_ns >> 1;
meson8b_dwmac_mask_bits(dwmac, PRG_ETH0, PRG_ETH0_TXDLY_MASK,
-   PRG_ETH0_TXDLY_QUARTER);
+   tx_dly_val << PRG_ETH0_TXDLY_SHIFT);
break;
 
case PHY_INTERFACE_MODE_RMII:
@@ -282,6 +285,12 @@ static int meson8b_dwmac_probe(struct platform_device 
*pdev)
dev_err(>dev, "missing phy-mode property\n");
ret = -EINVAL;
goto err_remove_config_dt;
+   } else if (dwmac->phy_mode != PHY_INTERFACE_MODE_RMII) {
+   /* ignore errors as this is an optional property - by default
+* we assume a TX delay of 0ns.
+*/
+   of_property_read_u32(pdev->dev.of_node, "amlogic,tx-delay-ns",
+>tx_delay_ns);
}
 
ret = meson8b_init_clk(dwmac);
-- 
2.11.0

[PATCH net-next v4 1/2] net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac

2016-12-17 Thread Martin Blumenstingl

This allows configuring the RGMII TX clock delay. The RGMII clock is
generated by underlying hardware of the the Meson 8b / GXBB DWMAC glue.
The configuration depends on the actual hardware (no delay may be
needed due to the design of the actual circuit, the PHY might add this
delay, etc.).

Signed-off-by: Martin Blumenstingl 
Tested-by: Neil Armstrong 
Acked-by: Rob Herring 
---
 Documentation/devicetree/bindings/net/meson-dwmac.txt | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt 
b/Documentation/devicetree/bindings/net/meson-dwmac.txt
index 89e62ddc69ca..f8bc54094e3c 100644
--- a/Documentation/devicetree/bindings/net/meson-dwmac.txt
+++ b/Documentation/devicetree/bindings/net/meson-dwmac.txt
@@ -25,6 +25,20 @@ Required properties on Meson8b and newer:
- "clkin0" - first parent clock of the internal mux
- "clkin1" - second parent clock of the internal mux
 
+Optional properties on Meson8b and newer:
+- amlogic,tx-delay-ns: The internal RGMII TX clock delay (provided
+   by this driver) in nanoseconds. Allowed values
+   are: 0ns, 2ns, 4ns, 6ns.
+   This must be configured when the phy-mode is
+   "rgmii" (typically a value of 2ns is used in
+   this case).
+   When phy-mode is set to "rgmii-id" or
+   "rgmii-txid" the TX clock delay is already
+   provided by the PHY. In that case this
+   property should be set to 0ns (which disables
+   the TX clock delay in the MAC to prevent the
+   clock from going off because both PHY and MAC
+   are adding a delay).
 
 Example for Meson6:
 
-- 
2.11.0

[PATCH net-next v4 0/2] stmmac: dwmac-meson8b: configurable RGMII TX delay

2016-12-17 Thread Martin Blumenstingl

Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4
cycle (= 2ns) TX clock delay. This seems to work fine for many boards
(for example Odroid-C2 or Amlogic's reference boards) but there are
some others where TX traffic is simply broken.
There are probably multiple reasons why it's working on some boards
while it's broken on others:
- some of Amlogic's reference boards are using a Micrel PHY
- hardware circuit design
- maybe more...

iperf3 results on my Mecool BB2 board (Meson GXM, RTL8211F PHY) with
TX clock delay disabled on the MAC (as it's enabled in the PHY driver).
TX throughput was virtually zero before:
$ iperf3 -c 192.168.1.100 -R
Connecting to host 192.168.1.100, port 5201
Reverse mode, remote host 192.168.1.100 is sending
[  4] local 192.168.1.206 port 52828 connected to 192.168.1.100 port 5201
[ ID] Interval   Transfer Bandwidth
[  4]   0.00-1.00   sec   108 MBytes   901 Mbits/sec
[  4]   1.00-2.00   sec  94.2 MBytes   791 Mbits/sec
[  4]   2.00-3.00   sec  96.5 MBytes   810 Mbits/sec
[  4]   3.00-4.00   sec  96.2 MBytes   808 Mbits/sec
[  4]   4.00-5.00   sec  96.6 MBytes   810 Mbits/sec
[  4]   5.00-6.00   sec  96.5 MBytes   810 Mbits/sec
[  4]   6.00-7.00   sec  96.6 MBytes   810 Mbits/sec
[  4]   7.00-8.00   sec  96.5 MBytes   809 Mbits/sec
[  4]   8.00-9.00   sec   105 MBytes   884 Mbits/sec
[  4]   9.00-10.00  sec   111 MBytes   934 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bandwidth   Retr
[  4]   0.00-10.00  sec  1000 MBytes   839 Mbits/sec0 sender
[  4]   0.00-10.00  sec   998 MBytes   837 Mbits/sec  receiver

iperf Done.
$ iperf3 -c 192.168.1.100
Connecting to host 192.168.1.100, port 5201
[  4] local 192.168.1.206 port 52832 connected to 192.168.1.100 port 5201
[ ID] Interval   Transfer Bandwidth   Retr  Cwnd
[  4]   0.00-1.01   sec  99.5 MBytes   829 Mbits/sec  117139 KBytes
[  4]   1.01-2.00   sec   105 MBytes   884 Mbits/sec  129   70.7 KBytes
[  4]   2.00-3.01   sec   107 MBytes   889 Mbits/sec  106187 KBytes
[  4]   3.01-4.01   sec   105 MBytes   878 Mbits/sec   92143 KBytes
[  4]   4.01-5.00   sec   105 MBytes   882 Mbits/sec  140129 KBytes
[  4]   5.00-6.01   sec   106 MBytes   883 Mbits/sec  115195 KBytes
[  4]   6.01-7.00   sec   102 MBytes   863 Mbits/sec  133   70.7 KBytes
[  4]   7.00-8.01   sec   106 MBytes   884 Mbits/sec  143   97.6 KBytes
[  4]   8.01-9.01   sec   104 MBytes   875 Mbits/sec  124107 KBytes
[  4]   9.01-10.01  sec   105 MBytes   876 Mbits/sec   90139 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bandwidth   Retr
[  4]   0.00-10.01  sec  1.02 GBytes   874 Mbits/sec  1189 sender
[  4]   0.00-10.01  sec  1.02 GBytes   873 Mbits/sec  receiver

iperf Done.

I get similar TX throughput on my Meson GXBB "MXQ Pro+" board when I
disable the PHY's TX-delay and configure a 4ms TX-delay on the MAC.
So changes to at least the RTL8211F PHY driver are needed to get it
working properly in all situations.

Changes since v3:
- rebased to apply against current net-next branch (fixes a conflict
  with d2ed0a7755fe14c7 "net: ethernet: stmmac: fix of-node and
  fixed-link-phydev leaks")
- 

Changes since v2:
- moved all .dts patches (3-7) to a separate series
- removed the default 2ns TX delay when phy-mode RGMII is specified
- (rebased against current net-next)

Changes since v1:
- renamed the devicetree property "amlogic,tx-delay" to
  "amlogic,tx-delay-ns", which makes the .dts easier to read as we can
  simply specify human-readable values instead of having "preprocessor
  defines and calculation in human brain". Thanks to Andrew Lunn for
  the suggestion!
- improved documentation to indicate when the MAC TX-delay should be
  configured and how to use the PHY's TX-delay
- changed the default TX-delay in the dwmac-meson8b driver from 2ns
  to 0ms when any of the rgmii-*id modes are used (the 2ns default
  value still applies for phy-mode "rgmii")
- added patches to properly reset the PHY on Meson GXBB devices and to
  use a similar configuration than the one we use on Meson GXL devices
  (by passing a phy-handle to stmmac and defining the PHY in the mdio0
  bus - patch 3-6)
- add the "amlogic,tx-delay-ns" property to all boards which are using
  the RGMII PHY (patch 7)

Martin Blumenstingl (2):
  net: dt-bindings: add RGMII TX delay configuration to meson8b-dwmac
  net: stmmac: dwmac-meson8b: make the RGMII TX delay configurable

 .../devicetree/bindings/net/meson-dwmac.txt | 14 ++
 drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c | 21 +++--
 2 files changed, 29 insertions(+), 6 deletions(-)

-- 
2.11.0

Potential issues (security and otherwise) with the current cgroup-bpf API

2016-12-17 Thread Andy Lutomirski

Hi all-

I apologize for being rather late with this.  I didn't realize that
cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
it on the linux-api list, so I missed the discussion.

I think that the inet ingress, egress etc filters are a neat feature,
but I think the API has some issues that will bite us down the road
if it becomes stable in its current form.

Most of the problems I see are summarized in this transcript:

# mkdir cg2
# mount -t cgroup2 none cg2
# mkdir cg2/nosockets
# strace cgrp_socket_rule cg2/nosockets/ 0
...
open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3

 You can modify a cgroup after opening it O_RDONLY?

bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
log_buf=0x6020c0, kern_version=0}, 48) = 4

 This is fine.  The bpf() syscall manipulates bpf objects.

bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0

 This is not so good:

 a) The bpf() syscall is supposed to manipulate bpf objects.  This
is manipulating a cgroup.  There's no reason that a socket creation
filter couldn't be written in a different language (new iptables
table?  Simple list of address families?), but if that happened,
then using bpf() to install it would be entirely nonsensical.

 b) This is starting to be an excessively ugly multiplexer.  Among
other things, it's very unfriendly to seccomp.

# echo $$ >cg2/nosockets/cgroup.procs
# ping 127.0.0.1
ping: socket: Operation not permitted
# ls cg2/nosockets/
cgroup.controllers  cgroup.events  cgroup.procs  cgroup.subtree_control
# cat cg2/nosockets/cgroup.controllers

 Something in cgroupfs should give an indication that this cgroup
 filters socket creation, but there's nothing there.  You should also
 be able to turn the filter off from cgroupfs.

# mkdir cg2/nosockets/sockets
# /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1

 This succeeded, which means that, if this feature is enabled in 4.10,
 then we're stuck with its semantics.  If it returned -EINVAL instead,
 there would be a chance to refine it.

# echo $$ >cg2/nosockets/sockets/cgroup.procs
# ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
^C
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.029/0.029/0.029/0.000 ms

 Bash was inside a cgroup that disallowed socket creation, but socket
 creation wasn't disallowed.  This means that the obvious use of socket
 creation filters in nestable constainers fails insecurely.


There's also a subtle but nasty potential security problem here.
In 4.9 and before, cgroups has only one real effect in the kernel:
resource control. A process in a malicious cgroup could be DoSed,
but that was about the extent of the damage that a malicious cgroup
could do.

In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
programs attached that can do things if various events occur. (Right
now, this means socket operations, but there are plans in the works
to do this for LSM hooks too.) These bpf programs can say yes or no,
but they can also read out various data (including socket payloads!)
and save them away where an attacker can find them. This sounds a
lot like seccomp with a narrower scope but a much stronger ability to
exfiltrate private information.

Unfortunately, while seccomp is very, very careful to prevent
injection of a privileged victim into a malicious sandbox, the
CGROUP_BPF mechanism appears to have no real security model. There
is nothing to prevent a program that's in a malicious cgroup from
running a setuid binary, and there is nothing to prevent a program
that has the ability to move itself or another program into a
malicious cgroup from doing so and then, if needed for exploitation,
exec a setuid binary.

This isn't much of a problem yet because you currently need
CAP_NET_ADMIN to create a malicious sandbox in the first place.  I'm
sure that, in the near future, someone will want to make this stuff
work in containers with delegated cgroup hierarchies, and then there
may be a real problem here.


I've included a few security people on this thread.  The current API
looks abusable, and it would be nice to find all the holes before
4.10 comes out.


(The cgrp_socket_rule source is attached.  You can build it by sticking it
 in samples/bpf and doing:

 $ make headers_install
 $ cd samples/bpf
 $ gcc -o cgrp_socket_rule cgrp_socket_rule.c libbpf.c -I../../usr/include
)

--Andy
/* eBPF example program:
 *
 * - Loads eBPF program
 *
 *   The eBPF program sets the sk_bound_dev_if index in new AF_INET{6}
 *   sockets opened by processes in the cgroup.
 *
 * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
 */

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include

Re: Synopsys Ethernet QoS

2016-12-17 Thread Pavel Machek

Hi!

> >> So if there is a long time before handling interrupts,
> >> I guess that it makes sense that one stream could
> >> get an advantage in the net scheduler.
> >>
> >> If I find the time, and if no one beats me to it, I will try to replace
> >> the normal timers with HR timers + a smaller default timeout.
> >>
> > 
> > Can you try something like this? Highres timers will be needed, too,
> > but this fixes the logic problem.
> > 
> > You'll need to apply it twice as code is copy
> > 
> > Best regards,
> > Pavel
> > 
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > 
> >  */
> > priv->tx_count_frames += nfrags + 1;
> > if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
> > -   mod_timer(>txtimer,
> > - STMMAC_COAL_TIMER(priv->tx_coal_timer));
> > +   if (priv->tx_count_frames == nfrags + 1)
> > +   mod_timer(>txtimer,
> > + STMMAC_COAL_TIMER(priv->tx_coal_timer));
> > } else {
> > priv->tx_count_frames = 0;
> > priv->hw->desc->set_tx_ic(desc);
> > 
> > 
> 
> I know that this is completely of topic, but I am facing a dificulty with
> stmmac. I have interrupts, mac well configured rx packets being received
> successfully, but TX is not working, resulting in Tx errors = Total TX 
> packets.
> I have made a lot of debug and my conclusions is that by some reason when 
> using
> stmmac after starting tx dma, the hw state machine enters a deadend state
> resulting in those errors. Anyone faced this trouble?

SMP or UP system? AFAICT the driver gets the memory barriers wrong. It
does not fail completely for me, but still fails rather quickly.

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3e40578..641b03d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2121,11 +2205,11 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, 
struct net_device *dev)
if (mss_desc)
priv->hw->desc->set_tx_owner(mss_desc);
 
-   /* The own bit must be the latest setting done when prepare the
+   /* The own bit must be the latest setting done when preparing the
 * descriptor and then barrier is needed to make sure that
 * all is coherent before granting the DMA engine.
 */
-   smp_wmb();
+   wmb();
 
if (netif_msg_pktdata(priv)) {
pr_info("%s: curr=%d dirty=%d f=%d, e=%d, f_p=%p, nfrags %d\n",
@@ -2336,9 +2401,9 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
/* The own bit must be the latest setting done when prepare the
 * descriptor and then barrier is needed to make sure that
-* all is coherent before granting the DMA engine.
+* all is coherent before granting access to the DMA engine.
 */
-   smp_wmb();
+   wmb();
}
 
netdev_sent_queue(dev, skb->len);

Plus I'd suggest... at least (hand-edited). Driver should really be
modified to use readl() when accessing memory that changes.

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3e40578..641b03d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1309,6 +1323,8 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
status = priv->hw->desc->tx_status(>dev->stats,
  >xstats, p,
  priv->ioaddr);
+   rmb();
+   
/* Check if the descriptor is owned by the DMA */
if (unlikely(status & tx_dma_own))
break;

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock

2016-12-17 Thread Pavel Machek

On Thu 2016-12-15 23:33:22, Lino Sanfilippo wrote:
> On 15.12.2016 22:32, Lino Sanfilippo wrote:
> 
> > Ah ok. Then maybe priv->hw->dma->stop_tx() does not do the job correctly 
> > (stop the
> > tx path properly) and the HW is still active on the tx path while the tx 
> > buffers are
> > freed. OTOH stmmac_release() also stops the phy before the tx (and rx) 
> > paths are stopped.
> > Did you try to stop the phy fist in stmmac_tx_err_work(), too?
> > 
> > Regards,
> > Lino
> > 
> 
> And this is the "sledgehammer" approach: Do a complete shutdown and restart
> of the hardware in case of tx error (against net-next and only
>compile tested).

Wow, thanks a lot. I'll try to get the driver back to the non-working
state, and try it.

I believe I have some idea what is wrong there. (Missing memory barriers).

> +static void stmmac_tx_err_work(struct work_struct *work)
> +{
> + struct stmmac_priv *priv = container_of(work, struct stmmac_priv,
> + tx_err_work);
> + /* restart netdev */
> + rtnl_lock();
> + stmmac_release(priv->dev);
> + stmmac_open(priv->dev);
> + rtnl_unlock();
> +}

Won't this up/down the interface, in a way userspace can observe?

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [PATCH 0/2] GTP tunneling fixes for net

2016-12-17 Thread David Miller

From: Pablo Neira Ayuso 
Date: Thu, 15 Dec 2016 22:35:51 +0100

> The following patchset contains two GTP tunneling fixes for your net
> tree, they are:
> 
> 1) Offset to IPv4 header in gtp_check_src_ms_ipv4() is incorrect, thus
>this function always succeeds and therefore this defeats this sanity
>check. This allows packets that have no PDP to go though, patch from
>Lionel Gauthier.
> 
> 2) According to Note 0 of Figure 2 in Section 6 of 3GPP TS 29.060 v13.5.0
>Release 13, always set GTPv1 reserved bit to zero. This may cause
>interoperability problems, patch from Harald Welte.
> 
> Please, apply, thanks a lot!

Series applied, thanks Pablo.

Re: [PATCH] WAN: use designated initializers

2016-12-17 Thread David Miller

From: Kees Cook 
Date: Fri, 16 Dec 2016 16:59:18 -0800

> Prepare to mark sensitive kernel structures for randomization by making
> sure they're using designated initializers. These were identified during
> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
> extracted from grsecurity.
> 
> Signed-off-by: Kees Cook 

Applied.

Re: [PATCH] bna: use designated initializers

2016-12-17 Thread David Miller

From: Kees Cook 
Date: Fri, 16 Dec 2016 17:00:54 -0800

> Prepare to mark sensitive kernel structures for randomization by making
> sure they're using designated initializers. These were identified during
> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
> extracted from grsecurity.
> 
> Signed-off-by: Kees Cook 

Applied.

Re: [PATCH] isdn/gigaset: use designated initializers

2016-12-17 Thread David Miller

From: Kees Cook 
Date: Fri, 16 Dec 2016 16:58:06 -0800

> Prepare to mark sensitive kernel structures for randomization by making
> sure they're using designated initializers. These were identified during
> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
> extracted from grsecurity.
> 
> Signed-off-by: Kees Cook 

Applied.

Re: [PATCH] net/x25: use designated initializers

2016-12-17 Thread David Miller

From: Kees Cook 
Date: Fri, 16 Dec 2016 17:03:39 -0800

> Prepare to mark sensitive kernel structures for randomization by making
> sure they're using designated initializers. These were identified during
> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
> extracted from grsecurity.
> 
> Signed-off-by: Kees Cook 

Applied.

Re: [PATCH] net: use designated initializers

2016-12-17 Thread David Miller

From: Kees Cook 
Date: Fri, 16 Dec 2016 16:58:58 -0800

> Prepare to mark sensitive kernel structures for randomization by making
> sure they're using designated initializers. These were identified during
> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
> extracted from grsecurity.
> 
> Signed-off-by: Kees Cook 

Applied, although "decnet: " would have been a much better
subsystem prefix.

Re: [PATCH] isdn: use designated initializers

2016-12-17 Thread David Miller

From: Kees Cook 
Date: Fri, 16 Dec 2016 17:01:42 -0800

> Prepare to mark sensitive kernel structures for randomization by making
> sure they're using designated initializers. These were identified during
> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
> extracted from grsecurity.
> 
> Signed-off-by: Kees Cook 

Applied.

Re: [PATCH] ATM: use designated initializers

2016-12-17 Thread David Miller

From: Kees Cook 
Date: Fri, 16 Dec 2016 16:58:43 -0800

> Prepare to mark sensitive kernel structures for randomization by making
> sure they're using designated initializers. These were identified during
> allyesconfig builds of x86, arm, and arm64, with most initializer fixes
> extracted from grsecurity.
> 
> Signed-off-by: Kees Cook 

Applied.

Re: [net-next PATCH v6 0/5] XDP for virtio_net

2016-12-17 Thread David Miller

From: "Michael S. Tsirkin" 
Date: Fri, 16 Dec 2016 22:48:14 +0200

> On Fri, Dec 16, 2016 at 01:20:02PM -0500, David Miller wrote:
>> From: "Michael S. Tsirkin" 
>> Date: Fri, 16 Dec 2016 01:17:44 +0200
>> 
>> > OK, I think we can queue this for -next.
>> > 
>> > It's fairly limited in the kind of hardware supported, we can and
>> > probably should extend it further with time.
>> > 
>> > Acked-by: Michael S. Tsirkin 
>> 
>> Michael, thanks for reviewing.
>> 
>> Since the substance of this patch series has honestly been ready since
>> before the merge window, would you mind if I send this to Linus now?
>> 
>> That's what I was hoping I would be able to do.
>> 
>> Thanks again.
> 
> ACK, no problem.
> BTW in case people wonder, I'll be offline for a couple of weeks.
> This might delay review of some patches a bit.

Great, series applied, thanks everyone!

Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-17 Thread Dave Jones

On Sat, Dec 17, 2016 at 10:41:20AM -0500, David Miller wrote:
 > From: Dave Jones 
 > Date: Wed, 14 Dec 2016 10:47:29 -0500
 > 
 > > It seems to be possible to craft a packet for sendmsg that triggers
 > > the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:
 > > 
 > > RIP: 0010:[] [] 
 > > rawv6_sendmsg+0xc30/0xc40
 > > RSP: 0018:881f6c4a7c18  EFLAGS: 00010282
 > > RAX: fff2 RBX: 881f6c681680 RCX: 0002
 > > RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00
 > > RBP: 881f6c4a7da8 R08:  R09: 0009
 > > R10: 881fed0f6a00 R11: 0009 R12: 0030
 > > R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80
 > > 
 > > Call Trace:
 > >  [] ? unmap_page_range+0x693/0x830
 > >  [] inet_sendmsg+0x67/0xa0
 > >  [] sock_sendmsg+0x38/0x50
 > >  [] SYSC_sendto+0xef/0x170
 > >  [] SyS_sendto+0xe/0x10
 > >  [] do_syscall_64+0x50/0xa0
 > >  [] entry_SYSCALL64_slow_path+0x25/0x25
 > > 
 > > Handle this in rawv6_push_pending_frames and jump to the failure path.
 > > 
 > > Signed-off-by: Dave Jones 
 > 
 > Hmmm, that's interesting.  Becaue the code in __ip6_append_data(), which
 > sets up the ->cork.base.length value, seems to be defensively trying to
 > avoid this possibility.
 > 
 > For example, it checks things like:
 > 
 >  if (cork->length + length > mtu - headersize && ipc6->dontfrag &&
 >  (sk->sk_protocol == IPPROTO_UDP ||
 >   sk->sk_protocol == IPPROTO_RAW)) {
 > 
 > This is why the transport offset plus the length should never exceed
 > the total length for that skb_copy_bits() call.
 > 
 > Perhaps this protocol check in the code above is incomplete?  Do you
 > know what the sk->sk_protocol value was when that BUG triggered?  That
 > might shine some light on what is really happening here.

I'll see if I can craft up a reproducer next week.
For some reason I've not hit this on my test setup at home, but it
reproduces daily in our test setup at facebook.  The only thing
I can think of is that those fb boxes are ipv6 only, so I might be
exercising v4 more at home.

Dave

Re: [PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments

2016-12-17 Thread David Miller

From: Thomas Petazzoni 
Date: Sat, 17 Dec 2016 16:26:58 +0100

> Yes, I was thinking of moving towards a single array, as it's indeed
> crazy to have three arrays for that. However, since it's a fix going
> into stable, I also wanted to keep it as simple/straightforward as
> possible and avoid refactoring other parts of the code.

By the same token, by adding a third array you are making the code
more complex, adding more error recovery paths, etc.

> If you however believe moving to one array should be done as part of
> the fix, I'll do this.

Please do.

Re: [PATCH net] sctp: sctp_transport_lookup_process should rcu_read_unlock when transport is null

2016-12-17 Thread David Miller

From: Xin Long 
Date: Thu, 15 Dec 2016 23:05:52 +0800

> Prior to this patch, sctp_transport_lookup_process didn't rcu_read_unlock
> when it failed to find a transport by sctp_addrs_lookup_transport.
> 
> This patch is to fix it by moving up rcu_read_unlock right before checking
> transport and also to remove the out path.
> 
> Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in 
> rcu_read_lock")
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks.

Re: [PATCH net] sctp: sctp_epaddr_lookup_transport should be protected by rcu_read_lock

2016-12-17 Thread David Miller

From: Xin Long 
Date: Thu, 15 Dec 2016 23:00:55 +0800

> Since commit 7fda702f9315 ("sctp: use new rhlist interface on sctp transport
> rhashtable"), sctp has changed to use rhlist_lookup to look up transport, but
> rhlist_lookup doesn't call rcu_read_lock inside, unlike 
> rhashtable_lookup_fast.
> 
> It is called in sctp_epaddr_lookup_transport and sctp_addrs_lookup_transport.
> sctp_addrs_lookup_transport is always in the protection of rcu_read_lock(),
> as __sctp_lookup_association is called in rx path or sctp_lookup_association
> which are in the protection of rcu_read_lock() already.
> 
> But sctp_epaddr_lookup_transport is called by sctp_endpoint_lookup_assoc, it
> doesn't call rcu_read_lock, which may cause "suspicious rcu_dereference_check
> usage' in __rhashtable_lookup.
> 
> This patch is to fix it by adding rcu_read_lock in sctp_endpoint_lookup_assoc
> before calling sctp_epaddr_lookup_transport.
> 
> Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> rhashtable")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCH net 0/3] dpaa_eth: a couple of fixes

2016-12-17 Thread David Miller

From: Madalin Bucur 
Date: Thu, 15 Dec 2016 15:13:03 +0200

> This patch set introduces big endian accessors in the dpaa_eth driver
> making sure accesses to the QBMan HW are correct on little endian
> platforms. Removing a redundant Kconfig dependency on FSL_SOC.
> Adding myself as maintainer of the dpaa_eth driver.

Series applied, thanks.

Re: [PATCH 2/5] irda: irnet: Move linux/miscdevice.h include

2016-12-17 Thread David Miller

From: Corentin Labbe 
Date: Thu, 15 Dec 2016 11:42:47 +0100

> The only use of miscdevice is irda_ppp so no need to include
> linux/miscdevice.h for all irda files.
> This patch move the linux/miscdevice.h include to irnet_ppp.h
> 
> Signed-off-by: Corentin Labbe 

Applied.

Re: [PATCH 3/5] irnet: ppp: move IRNET_MINOR to include/linux/miscdevice.h

2016-12-17 Thread David Miller

From: Corentin Labbe 
Date: Thu, 15 Dec 2016 11:42:48 +0100

> This patch move the define for IRNET_MINOR to include/linux/miscdevice.h
> It is better that all minor number definitions are in the same place.
> 
> Signed-off-by: Corentin Labbe 

Applied.

Re: [PATCH 1/5] irda: irproc.c: Remove unneeded linux/miscdevice.h include

2016-12-17 Thread David Miller

From: Corentin Labbe 
Date: Thu, 15 Dec 2016 11:42:46 +0100

> irproc.c does not use any miscdevice so this patch remove this
> unnecessary inclusion.
> 
> Signed-off-by: Corentin Labbe 

Applied.

Re: [PATCH 4/5] irda: irnet: Remove unused IRNET_MAJOR define

2016-12-17 Thread David Miller

From: Corentin Labbe 
Date: Thu, 15 Dec 2016 11:42:49 +0100

> The IRNET_MAJOR define is not used, so this patch remove it.
> 
> Signed-off-by: Corentin Labbe 

Applied.

Re: [PATCH 5/5] irda: irnet: add member name to the miscdevice declaration

2016-12-17 Thread David Miller

From: Corentin Labbe 
Date: Thu, 15 Dec 2016 11:42:50 +0100

> Since the struct miscdevice have many members, it is dangerous to init
> it without members name relying only on member order.
> 
> This patch add member name to the init declaration.
> 
> Signed-off-by: Corentin Labbe 

Applied.

Re: [kernel-hardening] Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread Jeffrey Walton

> As far as half-siphash is concerned, it occurs to me that the main
> problem will be those users who need to guarantee that output can't be
> guessed over a long period of time.  For example, if you have a
> long-running process, then the output needs to remain unguessable over
> potentially months or years, or else you might be weakening the ASLR
> protections.  If on the other hand, the hash table or the process will
> be going away in a matter of seconds or minutes, the requirements with
> respect to cryptographic strength go down significantly.

Perhaps SipHash-4-8 should be used instead of SipHash-2-4. I believe
SipHash-4-8 is recommended for the security conscious who want to be
more conservative in their security estimates.

SipHash-4-8 does not add much more processing. If you are clocking
SipHash-2-4 at 2.0 or 2.5 cpb, then SipHash-4-8 will run at 3.0 to
4.0. Both are well below MD5 times. (At least with the data sets I've
tested).

> Now, maybe this doesn't matter that much if we can guarantee (or make
> assumptions) that the attacker doesn't have unlimited access the
> output stream of get_random_{long,int}(), or if it's being used in an
> anti-DOS use case where it ultimately only needs to be harder than
> alternate ways of attacking the system.
>
> Rekeying every five minutes doesn't necessarily help the with respect
> to ASLR, but it might reduce the amount of the output stream that
> would be available to the attacker in order to be able to attack the
> get_random_{long,int}() generator, and it also reduces the value of
> doing that attack to only compromising the ASLR for those processes
> started within that five minute window.

Forgive my ignorance... I did not find reading on using the primitive
in a PRNG. Does anyone know what Aumasson or Bernstein have to say?
Aumasson's site does not seem to discuss the use case:
https://www.google.com/search?q=siphash+rng+site%3A131002.net. (And
their paper only mentions random-number once in a different context).

Making the leap from internal hash tables and short-lived network
packets to the rng case may leave something to be desired, especially
if the bits get used in unanticipated ways, like creating long term
private keys.

Jeff

Re: [PATCH] bpf: cgroup: annotate pointers in struct cgroup_bpf with __rcu

2016-12-17 Thread David Miller

From: Daniel Mack 
Date: Thu, 15 Dec 2016 10:53:21 +0100

> The member 'effective' in 'struct cgroup_bpf' is protected by RCU.
> Annotate it accordingly to squelch a sparse warning.
> 
> Signed-off-by: Daniel Mack 

Applied.

Re: [PATCH net-next 0/2] inet: Fixes for inet_csk_get_port and soreusport

2016-12-17 Thread David Miller

From: Tom Herbert 
Date: Wed, 14 Dec 2016 16:54:14 -0800

> This patch set fixes a couple of issues I noticed while debugging our
> softlockup issue in inet_csk_get_port.
> 
> - Don't allow jump into port scan in inet_csk_get_port if function
>   was called with non-zero port number (looking up explicit port
>   number).
> - When inet_csk_get_port is called with zero port number (ie. perform
>   scan) an reuseport is set on the socket, don't match sockets that
>   also have reuseport set. The intent from the user should be
>   to get a new port number and then explictly bind other
>   sockets to that number using soreuseport.
> 
> Tested:
> 
> Ran first patch on production workload with no ill effect.
> 
> For second patch, ran a little listener application and first
> demonstrated that unbound sockets with soreuseport can indeed
> be bound to unrelated soreuseport sockets.

Series applied, thanks Tom.

Re: [PATCH net] bpf, test_verifier: fix a test case error result on unprivileged

2016-12-17 Thread David Miller

From: Daniel Borkmann 
Date: Thu, 15 Dec 2016 01:39:10 +0100

> Running ./test_verifier as unprivileged lets 1 out of 98 tests fail:
> 
>   [...]
>   #71 unpriv: check that printk is disallowed FAIL
>   Unexpected error message!
>   0: (7a) *(u64 *)(r10 -8) = 0
>   1: (bf) r1 = r10
>   2: (07) r1 += -8
>   3: (b7) r2 = 8
>   4: (bf) r3 = r1
>   5: (85) call bpf_trace_printk#6
>   unknown func bpf_trace_printk#6
>   [...]
> 
> The test case is correct, just that the error outcome changed with
> ebb676daa1a3 ("bpf: Print function name in addition to function id").
> Same as with e00c7b216f34 ("bpf: fix multiple issues in selftest suite
> and samples") issue 2), so just fix up the function name.
> 
> Fixes: ebb676daa1a3 ("bpf: Print function name in addition to function id")
> Signed-off-by: Daniel Borkmann 

Applied.

Re: [PATCH net] bpf: fix regression on verifier pruning wrt map lookups

2016-12-17 Thread David Miller

From: Daniel Borkmann 
Date: Thu, 15 Dec 2016 01:30:06 +0100

> Commit 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
> registers") introduced a regression where existing programs stopped
> loading due to reaching the verifier's maximum complexity limit,
> whereas prior to this commit they were loading just fine; the affected
> program has roughly 2k instructions.
> 
> What was found is that state pruning couldn't be performed effectively
> anymore due to mismatches of the verifier's register state, in particular
> in the id tracking. It doesn't mean that 57a09bf0a416 is incorrect per
> se, but rather that verifier needs to perform a lot more work for the
> same program with regards to involved map lookups.
 ...
> Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL 
> registers")
> Signed-off-by: Daniel Borkmann 
> Acked-by: Thomas Graf 
> Acked-by: Alexei Starovoitov 

Applied.

Re: [PATCH net] net: vrf: Drop conntrack data after pass through VRF device on Tx

2016-12-17 Thread David Miller

From: David Ahern 
Date: Wed, 14 Dec 2016 14:31:11 -0800

> Locally originated traffic in a VRF fails in the presence of a POSTROUTING
> rule. For example,
> 
> $ iptables -t nat -A POSTROUTING -s 11.1.1.0/24  -j MASQUERADE
> $ ping -I red -c1 11.1.1.3
> ping: Warning: source address might be selected on device other than red.
> PING 11.1.1.3 (11.1.1.3) from 11.1.1.2 red: 56(84) bytes of data.
> ping: sendmsg: Operation not permitted
> 
> Worse, the above causes random corruption resulting in a panic in random
> places (I have not seen a consistent backtrace).
> 
> Call nf_reset to drop the conntrack info following the pass through the
> VRF device.  The nf_reset is needed on Tx but not Rx because of the order
> in which NF_HOOK's are hit: on Rx the VRF device is after the real ingress
> device and on Tx it is is before the real egress device. Connection
> tracking should be tied to the real egress device and not the VRF device.
> 
> Fixes: 8f58336d3f78a ("net: Add ethernet header for pass through VRF device")
> Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
> Signed-off-by: David Ahern 

Applied and queued up for -stable.

Re: [PATCH net] net: vrf: Fix NAT within a VRF

2016-12-17 Thread David Miller

From: David Ahern 
Date: Wed, 14 Dec 2016 11:06:18 -0800

> Connection tracking with VRF is broken because the pass through the VRF
> device drops the connection tracking info. Removing the call to nf_reset
> allows DNAT and MASQUERADE to work across interfaces within a VRF.
> 
> Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf 
> device")
> Signed-off-by: David Ahern 

Applied and queued up for -stable, thanks David.

Re: [PATCH net 0/2] net/sched: cls_flower: Fix mask handling

2016-12-17 Thread David Miller

From: Paul Blakey 
Date: Wed, 14 Dec 2016 19:00:56 +0200

> The series fix how the mask is being handled.

I guess this is reasonable, series applied, thanks.

Re: [kernel-hardening] Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread Theodore Ts'o

On Fri, Dec 16, 2016 at 09:15:03PM -0500, George Spelvin wrote:
> >> - Ted, Andy Lutorminski and I will try to figure out a construction of
> >>   get_random_long() that we all like.

We don't have to find the most optimal solution right away; we can
approach this incrementally, after all.

So long as we replace get_random_{long,int}() with something which is
(a) strictly better in terms of security given today's use of MD5, and
(b) which is strictly *faster* than the current construction on 32-bit
and 64-bit systems, we can do that, and can try to make it be faster
while maintaining some minimum level of security which is sufficient
for all current users of get_random_{long,int}() and which can be
clearly artificulated for future users of get_random_{long,int}().

The main worry at this point I have is benchmarking siphash on a
32-bit system.  It may be that simply batching the chacha20 output so
that we're using the urandom construction more efficiently is the
better way to go, since that *does* meet the criteron of strictly more
secure and strictly faster than the current MD5 solution.  I'm open to
using siphash, but I want to see the the 32-bit numbers first.

As far as half-siphash is concerned, it occurs to me that the main
problem will be those users who need to guarantee that output can't be
guessed over a long period of time.  For example, if you have a
long-running process, then the output needs to remain unguessable over
potentially months or years, or else you might be weakening the ASLR
protections.  If on the other hand, the hash table or the process will
be going away in a matter of seconds or minutes, the requirements with
respect to cryptographic strength go down significantly.

Now, maybe this doesn't matter that much if we can guarantee (or make
assumptions) that the attacker doesn't have unlimited access the
output stream of get_random_{long,int}(), or if it's being used in an
anti-DOS use case where it ultimately only needs to be harder than
alternate ways of attacking the system.

Rekeying every five minutes doesn't necessarily help the with respect
to ASLR, but it might reduce the amount of the output stream that
would be available to the attacker in order to be able to attack the
get_random_{long,int}() generator, and it also reduces the value of
doing that attack to only compromising the ASLR for those processes
started within that five minute window.

Cheers,

- Ted

P.S.  I'm using ASLR as an example use case, above; of course we will
need to make similar eximainations of the other uses of
get_random_{long,int}().

P.P.S.  We might also want to think about potentially defining
get_random_{long,int}() to be unambiguously strong, and then creating
a get_weak_random_{long,int}() which on platforms where performance
might be a consideration, it uses a weaker algorithm perhaps with some
kind of rekeying interval.

Re: ipv6: handle -EFAULT from skb_copy_bits

2016-12-17 Thread David Miller

From: Dave Jones 
Date: Wed, 14 Dec 2016 10:47:29 -0500

> It seems to be possible to craft a packet for sendmsg that triggers
> the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:
> 
> RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
> RSP: 0018:881f6c4a7c18  EFLAGS: 00010282
> RAX: fff2 RBX: 881f6c681680 RCX: 0002
> RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00
> RBP: 881f6c4a7da8 R08:  R09: 0009
> R10: 881fed0f6a00 R11: 0009 R12: 0030
> R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80
> 
> Call Trace:
>  [] ? unmap_page_range+0x693/0x830
>  [] inet_sendmsg+0x67/0xa0
>  [] sock_sendmsg+0x38/0x50
>  [] SYSC_sendto+0xef/0x170
>  [] SyS_sendto+0xe/0x10
>  [] do_syscall_64+0x50/0xa0
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> 
> Handle this in rawv6_push_pending_frames and jump to the failure path.
> 
> Signed-off-by: Dave Jones 

Hmmm, that's interesting.  Becaue the code in __ip6_append_data(), which
sets up the ->cork.base.length value, seems to be defensively trying to
avoid this possibility.

For example, it checks things like:

if (cork->length + length > mtu - headersize && ipc6->dontfrag &&
(sk->sk_protocol == IPPROTO_UDP ||
 sk->sk_protocol == IPPROTO_RAW)) {

This is why the transport offset plus the length should never exceed
the total length for that skb_copy_bits() call.

Perhaps this protocol check in the code above is incomplete?  Do you
know what the sk->sk_protocol value was when that BUG triggered?  That
might shine some light on what is really happening here.

Thanks.

Re: [PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments

2016-12-17 Thread Thomas Petazzoni

Hello,

On Sat, 17 Dec 2016 10:20:57 -0500 (EST), David Miller wrote:

> You're really destroying cache locality, and making things overly
> complicated, by having two arrays.  Actually this is now the third in
> this structure alone.  That's crazy.
> 
> Just have one array for the TX ring software state:
> 
> struct tx_buff_info {
>   struct sk_buff  *skb;
>   dma_addr_t  dma_addr;
>   unsigned intsize;
> };
> 
> Then in the per-cpu TX struct:
> 
>   struct tx_buff_info *info;
> 
> This way every data access by the cpu for processing a ring entry
> will be localized, increasing cache hit rates.
> 
> This also significantly simplifies the code that allocates and
> frees this memory.

Yes, I was thinking of moving towards a single array, as it's indeed
crazy to have three arrays for that. However, since it's a fix going
into stable, I also wanted to keep it as simple/straightforward as
possible and avoid refactoring other parts of the code.

If you however believe moving to one array should be done as part of
the fix, I'll do this.

Thanks for your feedback,

Thomas Petazzoni
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

Re: [PATCH v3] net: macb: Added PCI wrapper for Platform Driver.

2016-12-17 Thread David Miller

From: Bartosz Folta 
Date: Wed, 14 Dec 2016 06:39:15 +

> There are hardware PCI implementations of Cadence GEM network
> controller. This patch will allow to use such hardware with reuse of
> existing Platform Driver.
> 
> Signed-off-by: Bartosz Folta 
> ---
> Changed in v3:
> Fixed dependencies in Kconfig.
> ---
> Changed in v2:
> Respin to net-next. Changed patch formatting.

Applied.

Re: [PATCH net] ibmveth: calculate gso_segs for large packets

2016-12-17 Thread David Miller

From: Thomas Falcon 
Date: Tue, 13 Dec 2016 18:15:09 -0600

> Include calculations to compute the number of segments
> that comprise an aggregated large packet.
> 
> Signed-off-by: Thomas Falcon 

Applied.

Re: [PATCH] [v2] net: qcom/emac: don't try to claim clocks on ACPI systems

2016-12-17 Thread David Miller

From: Timur Tabi 
Date: Tue, 13 Dec 2016 17:49:02 -0600

> On ACPI systems, clocks are not available to drivers directly.  They are
> handled exclusively by ACPI and/or firmware, so there is no clock driver.
> Calls to clk_get() always fail, so we should not even attempt to claim
> any clocks on ACPI systems.
> 
> Signed-off-by: Timur Tabi 
> ---
> 
> Notes:
> v2: move check into functions

Applied, thanks.

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread George Spelvin

To follow up on my comments that your benchmark results were peculiar,
here's my benchmark code.

It just computes the hash of all n*(n+1)/2 possible non-empty substrings
of a buffer of n (called "max" below) bytes.  "cpb" is "cycles per byte".

(The average length is (n+2)/3, c.f. https://oeis.org/A000292)

On x86-32, HSipHash is asymptotically twice the speed of SipHash,
rising to 2.5x for short strings:

SipHash/HSipHash benchmark, sizeof(long) = 4
 SipHash: max=   4 cycles= 10495 cpb=524.7500 (sum=47a4f5554869fa97)
HSipHash: max=   4 cycles=  3400 cpb=170. (sum=146a863e)
 SipHash: max=   8 cycles= 24468 cpb=203.9000 (sum=21c41a86355affcc)
HSipHash: max=   8 cycles=  9237 cpb= 76.9750 (sum=d3b5e0cd)
 SipHash: max=  16 cycles= 94622 cpb=115.9583 (sum=26d816b72721e48f)
HSipHash: max=  16 cycles= 34499 cpb= 42.2782 (sum=16bb7475)
 SipHash: max=  32 cycles=418767 cpb= 69.9811 (sum=dd5a97694b8a832d)
HSipHash: max=  32 cycles=156695 cpb= 26.1857 (sum=eed00fcb)
 SipHash: max=  64 cycles=   2119152 cpb= 46.3101 (sum=a2a725aecc09ed00)
HSipHash: max=  64 cycles=   1008678 cpb= 22.0428 (sum=99b9f4f)
 SipHash: max= 128 cycles=  12728659 cpb= 35.5788 (sum=420878cd20272817)
HSipHash: max= 128 cycles=   5452931 cpb= 15.2419 (sum=f1f4ad18)
 SipHash: max= 256 cycles=  38931946 cpb= 13.7615 (sum=e05dfb28b90dfd98)
HSipHash: max= 256 cycles=  13807312 cpb=  4.8805 (sum=ceeafcc1)
 SipHash: max= 512 cycles= 205537380 cpb=  9.1346 (sum=7d129d4de145fbea)
HSipHash: max= 512 cycles= 103420960 cpb=  4.5963 (sum=7f15a313)
 SipHash: max=1024 cycles=1540259472 cpb=  8.5817 (sum=cca7cbdc778ca8af)
HSipHash: max=1024 cycles= 796090824 cpb=  4.4355 (sum=d8f3374f)

On x86-64, SipHash is consistently faster, asymptotically approaching 2x
for long strings:

SipHash/HSipHash benchmark, sizeof(long) = 8
 SipHash: max=   4 cycles=  2642 cpb=132.1000 (sum=47a4f5554869fa97)
HSipHash: max=   4 cycles=  2498 cpb=124.9000 (sum=146a863e)
 SipHash: max=   8 cycles=  5270 cpb= 43.9167 (sum=21c41a86355affcc)
HSipHash: max=   8 cycles=  7140 cpb= 59.5000 (sum=d3b5e0cd)
 SipHash: max=  16 cycles= 19950 cpb= 24.4485 (sum=26d816b72721e48f)
HSipHash: max=  16 cycles= 23546 cpb= 28.8554 (sum=16bb7475)
 SipHash: max=  32 cycles= 80188 cpb= 13.4004 (sum=dd5a97694b8a832d)
HSipHash: max=  32 cycles=101218 cpb= 16.9148 (sum=eed00fcb)
 SipHash: max=  64 cycles=373286 cpb=  8.1575 (sum=a2a725aecc09ed00)
HSipHash: max=  64 cycles=535568 cpb= 11.7038 (sum=99b9f4f)
 SipHash: max= 128 cycles=   2075224 cpb=  5.8006 (sum=420878cd20272817)
HSipHash: max= 128 cycles=   3336820 cpb=  9.3270 (sum=f1f4ad18)
 SipHash: max= 256 cycles=  14276278 cpb=  5.0463 (sum=e05dfb28b90dfd98)
HSipHash: max= 256 cycles=  28847880 cpb= 10.1970 (sum=ceeafcc1)
 SipHash: max= 512 cycles=  50135180 cpb=  2.2281 (sum=7d129d4de145fbea)
HSipHash: max= 512 cycles=  86145916 cpb=  3.8286 (sum=7f15a313)
 SipHash: max=1024 cycles= 334111900 cpb=  1.8615 (sum=cca7cbdc778ca8af)
HSipHash: max=1024 cycles= 640432452 cpb=  3.5682 (sum=d8f3374f)


Here's the code; compile with -DSELFTEST.  (The main purpose of
printing the sum is to prevent dead code elimination.)


#if SELFTEST
#include 
#include 

static inline uint64_t rol64(uint64_t word, unsigned int shift)
{
return word << shift | word >> (64 - shift);
}

static inline uint32_t rol32(uint32_t word, unsigned int shift)
{
return word << shift | word >> (32 - shift);
}

static inline uint64_t get_unaligned_le64(void const *p)
{
return *(uint64_t const *)p;
}

static inline uint32_t get_unaligned_le32(void const *p)
{
return *(uint32_t const *)p;
}

static inline uint64_t le64_to_cpup(uint64_t const *p)
{
return *p;
}

static inline uint32_t le32_to_cpup(uint32_t const *p)
{
return *p;
}


#else
#include/* For rol64 */
#include 
#include 
#include 
#endif

/* The basic ARX mixing function, taken from Skein */
#define SIP_MIX(a, b, s) ((a) += (b), (b) = rol64(b, s), (b) ^= (a))

/*
 * The complete SipRound.  Note that, when unrolled twice like below,
 * the 32-bit rotates drop out on 32-bit machines.
 */
#define SIP_ROUND(a, b, c, d) \
(SIP_MIX(a, b, 13), SIP_MIX(c, d, 16), (a) = rol64(a, 32), \
 SIP_MIX(c, b, 17), SIP_MIX(a, d, 21), (c) = rol64(c, 32))

/*
 * This is rolled up more than most implementations, resulting in about
 * 55% the code size.  Speed is a few precent slower.  A crude benchmark
 * (for (i=1; i <= max; i++) for (j = 0; j < 4096-i; j++) hash(buf+j, i);)
 * produces the following timings (in usec):
 *
 *  i386i386i386x86_64  x86_64  x86_64  x86_64
 * Length   small   unroll  halfmd4 small   unroll  halfmd4 teahash
 * 1..4106910291608 195 160 399 690
 * 1..8248323813851 410 360 9881659
 * 1..12   430341526207 690 61816422690
 * 1..16   61225931

Re: [PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments

2016-12-17 Thread David Miller

From: Thomas Petazzoni 
Date: Tue, 13 Dec 2016 17:53:15 +0100

> diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
> b/drivers/net/ethernet/marvell/mvpp2.c
> index 1026c45..d168b13 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -791,6 +791,8 @@ struct mvpp2_txq_pcpu {
>   /* Array of transmitted buffers' physical addresses */
>   dma_addr_t *tx_buffs;
>  
> + size_t *tx_data_size;
> +

You're really destroying cache locality, and making things overly
complicated, by having two arrays.  Actually this is now the third in
this structure alone.  That's crazy.

Just have one array for the TX ring software state:

struct tx_buff_info {
struct sk_buff  *skb;
dma_addr_t  dma_addr;
unsigned intsize;
};

Then in the per-cpu TX struct:

struct tx_buff_info *info;

This way every data access by the cpu for processing a ring entry
will be localized, increasing cache hit rates.

This also significantly simplifies the code that allocates and
frees this memory.

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread Jeffrey Walton

> diff --git a/lib/test_siphash.c b/lib/test_siphash.c
> new file mode 100644
> index ..93549e4e22c5
> --- /dev/null
> +++ b/lib/test_siphash.c
> @@ -0,0 +1,83 @@
> +/* Test cases for siphash.c
> + *
> + * Copyright (C) 2016 Jason A. Donenfeld . All Rights 
> Reserved.
> + *
> + * This file is provided under a dual BSD/GPLv2 license.
> + *
> + * SipHash: a fast short-input PRF
> + * https://131002.net/siphash/
> + *
> + * This implementation is specifically for SipHash2-4.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/* Test vectors taken from official reference source available at:
> + * https://131002.net/siphash/siphash24.c
> + */
> +static const u64 test_vectors[64] = {
> +   0x726fdb47dd0e0e31ULL, 0x74f839c593dc67fdULL, 0x0d6c8009d9a94f5aULL,
> +   0x85676696d7fb7e2dULL, 0xcf2794e0277187b7ULL, 0x18765564cd99a68dULL,
> +   0xcbc9466e58fee3ceULL, 0xab0200f58b01d137ULL, 0x93f5f5799a932462ULL,
> +   0x9e0082df0ba9e4b0ULL, 0x7a5dbbc594ddb9f3ULL, 0xf4b32f46226bada7ULL,
> +   0x751e8fbc860ee5fbULL, 0x14ea5627c0843d90ULL, 0xf723ca908e7af2eeULL,
> +   0xa129ca6149be45e5ULL, 0x3f2acc7f57c29bdbULL, 0x699ae9f52cbe4794ULL,
> +   0x4bc1b3f0968dd39cULL, 0xbb6dc91da77961bdULL, 0xbed65cf21aa2ee98ULL,
> +   0xd0f2cbb02e3b67c7ULL, 0x93536795e3a33e88ULL, 0xa80c038ccd5ccec8ULL,
> +   0xb8ad50c6f649af94ULL, 0xbce192de8a85b8eaULL, 0x17d835b85bbb15f3ULL,
> +   0x2f2e6163076bcfadULL, 0xde4daaaca71dc9a5ULL, 0xa6a2506687956571ULL,
> +   0xad87a3535c49ef28ULL, 0x32d892fad841c342ULL, 0x7127512f72f27cceULL,
> +   0xa7f32346f95978e3ULL, 0x12e0b01abb051238ULL, 0x15e034d40fa197aeULL,
> +   0x314dffbe0815a3b4ULL, 0x027990f029623981ULL, 0xcadcd4e59ef40c4dULL,
> +   0x9abfd8766a33735cULL, 0x0e3ea96b5304a7d0ULL, 0xad0c42d6fc585992ULL,
> +   0x187306c89bc215a9ULL, 0xd4a60abcf3792b95ULL, 0xf935451de4f21df2ULL,
> +   0xa9538f0419755787ULL, 0xdb9acddff56ca510ULL, 0xd06c98cd5c0975ebULL,
> +   0xe612a3cb9ecba951ULL, 0xc766e62cfcadaf96ULL, 0xee64435a9752fe72ULL,
> +   0xa192d576b245165aULL, 0x0a8787bf8ecb74b2ULL, 0x81b3e73d20b49b6fULL,
> +   0x7fa8220ba3b2eceaULL, 0x245731c13ca42499ULL, 0xb78dbfaf3a8d83bdULL,
> +   0xea1ad565322a1a0bULL, 0x60e61c23a3795013ULL, 0x6606d7e446282b93ULL,
> +   0x6ca4ecb15c5f91e1ULL, 0x9f626da15c9625f3ULL, 0xe51b38608ef25f57ULL,
> +   0x958a324ceb064572ULL
> +};
> +static const siphash_key_t test_key =
> +   { 0x0706050403020100ULL , 0x0f0e0d0c0b0a0908ULL };
> +
> +static int __init siphash_test_init(void)
> +{
> +   u8 in[64] __aligned(SIPHASH_ALIGNMENT);
> +   u8 in_unaligned[65];
> +   u8 i;
> +   int ret = 0;
> +
> +   for (i = 0; i < 64; ++i) {
> +   in[i] = i;
> +   in_unaligned[i + 1] = i;
> +   if (siphash(in, i, test_key) != test_vectors[i]) {
> +   pr_info("self-test aligned %u: FAIL\n", i + 1);
> +   ret = -EINVAL;
> +   }
> +   if (siphash_unaligned(in_unaligned + 1, i, test_key) != 
> test_vectors[i]) {
> +   pr_info("self-test unaligned %u: FAIL\n", i + 1);
> +   ret = -EINVAL;
> +   }
> +   }
> +   if (!ret)
> +   pr_info("self-tests: pass\n");
> +   return ret;
> +}
> +
> +static void __exit siphash_test_exit(void)
> +{
> +}
> +
> +module_init(siphash_test_init);
> +module_exit(siphash_test_exit);
> +
> +MODULE_AUTHOR("Jason A. Donenfeld ");
> +MODULE_LICENSE("Dual BSD/GPL");
> --
> 2.11.0
>

I believe the output of SipHash depends upon endianness. Folks who
request a digest through the af_alg interface will likely expect a
byte array.

I think that means on little endian machines, values like element 0
must be reversed byte reversed:

0x726fdb47dd0e0e31ULL => 31,0e,0e,dd,47,db,6f,72

If I am not mistaken, that value (and other tv's) are returned here:

return (v0 ^ v1) ^ (v2 ^ v3);

It may be prudent to include the endian reversal in the test to ensure
big endian machines produce expected results. Some closely related
testing on an old Apple PowerMac G5 revealed that result needed to be
reversed before returning it to a caller.

Jeff

DO YOU NEED A LOAN??

2016-12-17 Thread bancoleite

LOAN

Re: Soft lockup in inet_put_port on 4.6

2016-12-17 Thread Josef Bacik


> On Dec 17, 2016, at 6:09 AM, Hannes Frederic Sowa 
>  wrote:
> 
>> On 16.12.2016 23:50, Josef Bacik wrote:
>>> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert  wrote:
 On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik  wrote:
> On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik  wrote:
> 
>> On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik  wrote:
>> 
>> On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>>  wrote:
>>> 
>>> Hi Josef,
>>> 
 On 15.12.2016 19:53, Josef Bacik wrote:
 
  On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
 wrote:
> 
>  On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
> 
>  wrote:
>> 
>>   On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
>> 
>>  wrote:
>>> 
>>>   I think there may be some suspicious code in
>>> inet_csk_get_port. At
>>>   tb_found there is:
>>> 
>>>   if (((tb->fastreuse > 0 && reuse) ||
>>>(tb->fastreuseport > 0 &&
>>> 
>>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>> sk->sk_reuseport && uid_eq(tb->fastuid,
>>>  uid))) &&
>>>   smallest_size == -1)
>>>   goto success;
>>>   if
>>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>>  tb, true)) {
>>>   if ((reuse ||
>>>(tb->fastreuseport > 0 &&
>>> sk->sk_reuseport &&
>>> 
>>>  !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>> uid_eq(tb->fastuid, uid))) &&
>>>   smallest_size != -1 &&
>>> --attempts >=
>>> 0) {
>>>   spin_unlock_bh(>lock);
>>>   goto again;
>>>   }
>>>   goto fail_unlock;
>>>   }
>>> 
>>>   AFAICT there is redundancy in these two conditionals.  The
>>> same
>>> clause
>>>   is being checked in both: (tb->fastreuseport > 0 &&
>>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>> sk->sk_reuseport &&
>>>   uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>>> is true
>>> the
>>>   first conditional should be hit, goto done,  and the second
>>> will
>>> never
>>>   evaluate that part to true-- unless the sk is changed (do
>>> we need
>>>   READ_ONCE for sk->sk_reuseport_cb?).
>> 
>>   That's an interesting point... It looks like this function also
>>   changed in 4.6 from using a single local_bh_disable() at the
>> beginning
>>   with several spin_lock(>lock) to exclusively
>>   spin_lock_bh(>lock) at each locking point.  Perhaps
>> the full
>> bh
>>   disable variant was preventing the timers in your stack
>> trace from
>>   running interleaved with this function before?
> 
> 
>  Could be, although dropping the lock shouldn't be able to
> affect the
>  search state. TBH, I'm a little lost in reading function, the
>  SO_REUSEPORT handling is pretty complicated. For instance,
>  rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
> that
>  function and also in every call to inet_csk_bind_conflict. I
> wonder
> if
>  we can simply this under the assumption that SO_REUSEPORT is only
>  allowed if the port number (snum) is explicitly specified.
 
 
  Ok first I have data for you Hannes, here's the time distributions
  before during and after the lockup (with all the debugging in
 place
 the
  box eventually recovers).  I've attached it as a text file
 since it is
  long.
>>> 
>>> 
>>> Thanks a lot!
>>> 
  Second is I was thinking about why we would spend so much time
 doing
 the
  ->owners list, and obviously it's because of the massive amount of
  timewait sockets on the owners list.  I wrote the following
 dumb patch
  and tested it and the problem has disappeared completely.  Now
 I don't
  know if this is right at all, but I thought it was weird we
 weren't
  copying the soreuseport option from the original socket onto
 the twsk.
  Is there are reason

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread George Spelvin

BTW, here's some SipHash code I wrote for Linux a while ago.

My target application was ext4 directory hashing, resulting in different
implementation choices, although I still think that a rolled-up
implementation like this is reasonable.  Reducing I-cache impact speeds
up the calling code.

One thing I'd like to suggest you steal is the way it handles the
fetch of the final partial word.  It's a lot smaller and faster than
an 8-way case statement.


#include/* For rol64 */
#include 
#include 
#include 

/* The basic ARX mixing function, taken from Skein */
#define SIP_MIX(a, b, s) ((a) += (b), (b) = rol64(b, s), (b) ^= (a))

/*
 * The complete SipRound.  Note that, when unrolled twice like below,
 * the 32-bit rotates drop out on 32-bit machines.
 */
#define SIP_ROUND(a, b, c, d) \
(SIP_MIX(a, b, 13), SIP_MIX(c, d, 16), (a) = rol64(a, 32), \
 SIP_MIX(c, b, 17), SIP_MIX(a, d, 21), (c) = rol64(c, 32))

/*
 * This is rolled up more than most implementations, resulting in about
 * 55% the code size.  Speed is a few precent slower.  A crude benchmark
 * (for (i=1; i <= max; i++) for (j = 0; j < 4096-i; j++) hash(buf+j, i);)
 * produces the following timings (in usec):
 *
 *  i386i386i386x86_64  x86_64  x86_64  x86_64
 * Length   small   unroll  halfmd4 small   unroll  halfmd4 teahash
 * 1..4106910291608 195 160 399 690
 * 1..8248323813851 410 360 9881659
 * 1..12   430341526207 690 61816422690
 * 1..16   612259318668 968 87623633786
 * 1..20   83488137   112451323118531625567
 * 1..24  10580   10327   139351657150440667635
 * 1..28  13211   12956   168032069187150289759
 * 1..32  15843   15572   19725247022606084   11932
 * 1..36  18864   18609   24259293426787566   14794
 * 1..1024  5890194 6130242 10264816 881933  881244 3617392 7589036
 *
 * The performance penalty is quite minor, decreasing for long strings,
 * and it's significantly faster than half_md4, so I'm going for the
 * I-cache win.
 */
uint64_t
siphash24(char const *in, size_t len, uint32_t const seed[4])
{
uint64_t a = 0x736f6d6570736575;/* somepseu */
uint64_t b = 0x646f72616e646f6d;/* dorandom */
uint64_t c = 0x6c7967656e657261;/* lygenera */
uint64_t d = 0x7465646279746573;/* tedbytes */
uint64_t m = 0;
uint8_t padbyte = len;

/*
 * Mix in the 128-bit hash seed.  This is in a format convenient
 * to the ext3/ext4 code.  Please feel free to adapt the
 * */
if (seed) {
m = seed[2] | (uint64_t)seed[3] << 32;
b ^= m;
d ^= m;
m = seed[0] | (uint64_t)seed[1] << 32;
/* a ^= m; is done in loop below */
c ^= m;
}

/*
 * By using the same SipRound code for all iterations, we
 * save space, at the expense of some branch prediction.  But
 * branch prediction is hard because of variable length anyway.
 */
len = len/8 + 3;/* Now number of rounds to perform */
do {
a ^= m;

switch (--len) {
unsigned bytes;

default:/* Full words */
d ^= m = get_unaligned_le64(in);
in += 8;
break;
case 2: /* Final partial word */
/*
 * We'd like to do one 64-bit fetch rather than
 * mess around with bytes, but reading past the end
 * might hit a protection boundary.  Fortunately,
 * we know that protection boundaries are aligned,
 * so we can consider only three cases:
 * - The remainder occupies zero words
 * - The remainder fits into one word
 * - The remainder straddles two words
 */
bytes = padbyte & 7;

if (bytes == 0) {
m = 0;
} else {
unsigned offset = (unsigned)(uintptr_t)in & 7;

if (offset + bytes <= 8) {
m = le64_to_cpup((uint64_t const *)
(in - offset));
m >>= 8*offset;
} else {
m = get_unaligned_le64(in);
}
m &= ((uint64_t)1 <<

Re: Soft lockup in inet_put_port on 4.6

2016-12-17 Thread Hannes Frederic Sowa

On 16.12.2016 23:50, Josef Bacik wrote:
> On Fri, Dec 16, 2016 at 5:18 PM, Tom Herbert  wrote:
>> On Fri, Dec 16, 2016 at 2:08 PM, Josef Bacik  wrote:
>>>  On Fri, Dec 16, 2016 at 10:21 AM, Josef Bacik  wrote:

  On Fri, Dec 16, 2016 at 9:54 AM, Josef Bacik  wrote:
>
>  On Thu, Dec 15, 2016 at 7:07 PM, Hannes Frederic Sowa
>   wrote:
>>
>>  Hi Josef,
>>
>>  On 15.12.2016 19:53, Josef Bacik wrote:
>>>
>>>   On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert 
>>>  wrote:

   On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek
 
   wrote:
>
>On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert
> 
>   wrote:
>>
>>I think there may be some suspicious code in
>> inet_csk_get_port. At
>>tb_found there is:
>>
>>if (((tb->fastreuse > 0 && reuse) ||
>> (tb->fastreuseport > 0 &&
>> 
>> !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>  sk->sk_reuseport && uid_eq(tb->fastuid,
>>   uid))) &&
>>smallest_size == -1)
>>goto success;
>>if
>> (inet_csk(sk)->icsk_af_ops->bind_conflict(sk,
>>   tb, true)) {
>>if ((reuse ||
>> (tb->fastreuseport > 0 &&
>>  sk->sk_reuseport &&
>>
>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>  uid_eq(tb->fastuid, uid))) &&
>>smallest_size != -1 &&
>> --attempts >=
>>  0) {
>>spin_unlock_bh(>lock);
>>goto again;
>>}
>>goto fail_unlock;
>>}
>>
>>AFAICT there is redundancy in these two conditionals.  The
>> same
>>  clause
>>is being checked in both: (tb->fastreuseport > 0 &&
>>!rcu_access_pointer(sk->sk_reuseport_cb) &&
>> sk->sk_reuseport &&
>>uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this
>> is true
>>  the
>>first conditional should be hit, goto done,  and the second
>> will
>>  never
>>evaluate that part to true-- unless the sk is changed (do
>> we need
>>READ_ONCE for sk->sk_reuseport_cb?).
>
>That's an interesting point... It looks like this function also
>changed in 4.6 from using a single local_bh_disable() at the
>  beginning
>with several spin_lock(>lock) to exclusively
>spin_lock_bh(>lock) at each locking point.  Perhaps
> the full
>  bh
>disable variant was preventing the timers in your stack
> trace from
>running interleaved with this function before?


   Could be, although dropping the lock shouldn't be able to
 affect the
   search state. TBH, I'm a little lost in reading function, the
   SO_REUSEPORT handling is pretty complicated. For instance,
   rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in
  that
   function and also in every call to inet_csk_bind_conflict. I
 wonder
  if
   we can simply this under the assumption that SO_REUSEPORT is only
   allowed if the port number (snum) is explicitly specified.
>>>
>>>
>>>   Ok first I have data for you Hannes, here's the time distributions
>>>   before during and after the lockup (with all the debugging in
>>> place
>>>  the
>>>   box eventually recovers).  I've attached it as a text file
>>> since it is
>>>   long.
>>
>>
>>  Thanks a lot!
>>
>>>   Second is I was thinking about why we would spend so much time
>>> doing
>>>  the
>>>   ->owners list, and obviously it's because of the massive amount of
>>>   timewait sockets on the owners list.  I wrote the following
>>> dumb patch
>>>   and tested it and the problem has disappeared completely.  Now
>>> I don't
>>>   know if this is right at all, but I thought it was weird we
>>> weren't
>>>   copying the soreuseport option from the original socket onto
>>> the twsk.
>>>   Is there are reason we aren't doing this currently?  Does this
>>> help
>>>   explain what is happening?  Thanks,
>>
>>
>>  The patch is

mlx4: Bug in XDP_TX + 16 rx-queues

2016-12-17 Thread Martin KaFai Lau

Hi All,

I have been debugging with XDP_TX and 16 rx-queues.

1) When 16 rx-queues is used and an XDP prog is doing XDP_TX,
it seems that the packet cannot be XDP_TX out if the pkt
is received from some particular CPUs (/rx-queues).

2) If 8 rx-queues is used, it does not have problem.

3) The 16 rx-queues problem also went away after reverting these
two patches:
15fca2c8eb41 net/mlx4_en: Add ethtool statistics for XDP cases
67f8b1dcb9ee net/mlx4_en: Refactor the XDP forwarding rings scheme

4) I can reproduce the problem by running samples/bof/xdp_ip_tunnel at
the receiver side.  The sender side sends out TCP packets with
source port ranging from 1 to 1024.  At the sender side also, do
a tcpdump to capture the ip-tunnel packet reflected by xdp_ip_tunnel.
With 8 rx-queues,  I can get all 1024 packets back.  With 16 rx-queues,
I can only get 512 packets back.  It is a 40 CPUs machine.
I also checked the rx*_xdp_tx counters (from ethtool -S eth0) to ensure
the xdp prog has XDP_TX-ed it out.

Not saying that 67f8b1dcb9ee is 100% the cause because there are other
changes since then.  It is merely a brain dump on what I have already
tried.

Tariq/Saeed, any thoughts?  I can easily test some patches in
my setup.

Thanks,
--Martin

Re: [PATCH net 2/2] bpf: fix overflow in prog accounting

2016-12-17 Thread Daniel Borkmann


On 12/17/2016 03:52 AM, kbuild test robot wrote:

Hi Daniel,

[auto build test ERROR on net/master]

url:
https://github.com/0day-ci/linux/commits/Daniel-Borkmann/bpf-dynamically-allocate-digest-scratch-buffer/20161217-090046
config: sparc-defconfig (attached as .config)
compiler: sparc-linux-gcc (GCC) 6.2.0
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=sparc

All errors (new ones prefixed by >>):

kernel/bpf/core.c: In function '__bpf_prog_charge':

kernel/bpf/core.c:80:50: error: 'struct user_struct' has no member named 
'locked_vm'; did you mean 'locked_shm'?

   user_bufs = atomic_long_add_return(pages, >locked_vm);
  ^~
kernel/bpf/core.c:82:32: error: 'struct user_struct' has no member named 
'locked_vm'; did you mean 'locked_shm'?
atomic_long_sub(pages, >locked_vm);
^~
kernel/bpf/core.c: In function '__bpf_prog_uncharge':
kernel/bpf/core.c:93:31: error: 'struct user_struct' has no member named 
'locked_vm'; did you mean 'locked_shm'?
   atomic_long_sub(pages, >locked_vm);


Argh, right, I'll send v2 later today.

Re: [PATCH net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list

2016-12-17 Thread Xin Long

On Wed, Aug 24, 2016 at 6:38 PM, Neil Horman  wrote:
> On Wed, Aug 24, 2016 at 01:14:27PM +0800, Xin Long wrote:
>> >> > Ah, I see what you're doing.  Ok, this makes some sense, at least on 
>> >> > the receive
>> >> > side, when you get a cookie unpacked and modify the remote peers 
>> >> > address list,
>> >> > it makes sense to check for duplicates.  On the local side however, I 
>> >> > would,
>> >> > instead of checking it when the list gets copied, I'd check it when the 
>> >> > master
>> >> > list gets updated (in the NETDEV_UP event notifier for the local 
>> >> > address list,
>> >>
>> >> I was thinking about to check it in the NETDEV_UP, yes it can make the
>> >> master list has no duplicated addresses.  But what if two same addresses
>> >> events come, and they come from different NICs (though I can't point  out
>> >> the valid use case), then we filter there.
>> >>
>> > That I think would be a bug in the protocol code.  For the ipv4 case, all
>> > addresses are owned by the system and the same addresses added to multiple
>> > interfaces should not be allowed.  The same is true of ipv6 case.  The only
>> > exception there is a link local address and that should still be unique 
>> > within
>> > the context of an address/dev tuple.
>> >
>> understand, just sounds a little harsh. :-)
>>
>> For now, does it make sense to you to just leave as the master list
>> is, and check
>> the duplicate address when sctp is really binding them ?
>>
> I would think so, yes.

Hi, Neil,

About this patch, I think we are on the page, right ?

If yes, I will repost v2, but other than improving some changelog,
no other change compare to v1. Do you agree ?

Re: [TSN RFC v2 0/9] TSN driver for the kernel

2016-12-17 Thread Henrik Austad

Hi Richard,

On Fri, Dec 16, 2016 at 11:05:30PM +0100, Richard Cochran wrote:
> On Fri, Dec 16, 2016 at 06:59:04PM +0100, hen...@austad.us wrote:
> > The driver is directed via ConfigFS as we need userspace to handle
> > stream-reservation (MSRP), discovery and enumeration (IEEE 1722.1) and
> > whatever other management is needed.
> 
> I complained about configfs before, but you didn't listen.

Yes you did, I remember quite well, and no, I didn't listen :)

At the time, there were other issues that I had to address. The 
configfs-part is fairly isolated. As I tried to explain the last round, 
the *reason* I've used ConfigFS thus far, is because it makes it pretty 
easy from userspace to signal the driver to create a new alsa-device.

And the reason I haven't changed configfs, is because so far, that part has 
worked fairly well and have made testing quite easy. At this stage, *this* 
is what is helpful, not a perfect interface. This does not mean that 
configfs is set in stone.

To clearify:
I'm sending out a new set now because, what I have works _fairly_ well for 
testing and a way to see what you can do with AVB. Using spotify to play 
music on random machines is quite entertaining.

It is by no means -done-, nor do I consider it done. I have been tight on 
time, and instead of sitting in an office polishing on some code, I thought 
it better to send out a new (and not done) set of patches so that others 
could see it still being worked on. If this turned out to be noise-only, I 
appologize!

> > 2 new fields in netdev_ops have been introduced, and the Intel
> > igb-driver has been updated (as this an AVB-capable NIC which is
> > available as a PCI-e card).
> 
> The igb hacks show that you are on the wrong track.  We can and should
> be able to support TSN without resorting to driver specific hacks and
> module parameters.

I was not able to find a sane way to change the mode of the NIC, some of 
the settings required to enable Qav-mode must be done when bringing the 
NIC up, so I needed hooks in _probe().

ANother elemnt needed is a way for tsn_core to ascertain if a NIC is 
capable of TSN or not (this would be ndo_tsn_capable)

Then finally, you need to update values in a per-tx-queue manner when a new 
stream is ready (hence ndo_tsn_link_configure).

What you mean by 'driver specific hacks' is not obvious though, TSN 
requires a set of fairly standardized parameters (priority code points, 
size of frames to send in a new stream and so on), adding this to the 
hw-registers in the NIC is an operation that will be common for all 
TSN-capable NICs.

> > Before reading on - this is not even beta, but I'd really appreciate if
> > people would comment on the overall architecture and perhaps provide
> > some pointers to where I should improve/fix/update
> 
> As I said before about V1, this architecture stinks. 

I like feedback when it's short, sweet and to the point
2 out of 3 ain't that bad ;)

> You appear to have continued hacking along and posted the same design 
> again.  Did you even address any of the points I raised back then?

So you did raise a lot of good points the last round, and no, I have not 
had the time to address them properly. That does not mean I do not *want* 
to (apart from configfs actually having worked quite nicely thus far and 
'shim' being a name I like ;)

From the last round of discussion:

> 1. A proper userland stack for AVDECC, MAAP, FQTSS, and so on.  The
>OpenAVB project does not offer much beyond simple examples.

Yes, I fully agree, as far as I know, no-one is working on this. That being 
said, I have not paid much attention the userspace tooling lately. But all 
of this must be handled in userspace, having avdecc in the kernel would be 
an utter nightmare :)

> 2. A user space audio application that puts it all together, making
>   use of the services in #1, the linuxptp gPTP service, the ALSA
>   services, and the network connections.  This program will have all
>   the knowledge about packet formats, AV encodings, and the local HW
>   capabilities.  This program cannot yet be written, as we still need
>   some kernel work in the audio and networking subsystems.

And therein lies the problem. It cannot yet be written, so we have to start 
in *some* end. And as I repeatedly stated in June, I'm at an RFC here, 
trying to spark some interest and lure other developers in :)

Also, I really do not want a media-application to care about _where_ the 
frames are going. Sure, I see the issue of configuring a link, but that can 
be done from _outside_ the media-application. VLC (or aplay, or totem, or 
.. take your pick) should not have to worry about this.

Applications that require finer control over timestamping is easier to 
adapt to AVB than all the others, I'd rather add special knobs for those 
who are interested than adding a set of knobs that -all- applications must 
be aware of.

Could be that we are talking about the same thing, just from different 
perspectives.

Re: [TSN RFC v2 5/9] Add TSN header for the driver

2016-12-17 Thread Henrik Austad

On Fri, Dec 16, 2016 at 11:09:38PM +0100, Richard Cochran wrote:
> On Fri, Dec 16, 2016 at 06:59:09PM +0100, hen...@austad.us wrote:
> > +/*
> > + * List of current subtype fields in the common header of AVTPDU
> > + *
> > + * Note: AVTPDU is a remnant of the standards from when it was AVB.
> > + *
> > + * The list has been updated with the recent values from IEEE 1722, draft 
> > 16.
> > + */
> > +enum avtp_subtype {
> > +   TSN_61883_IIDC = 0, /* IEC 61883/IIDC Format */
> > +   TSN_MMA_STREAM, /* MMA Streams */
> > +   TSN_AAF,/* AVTP Audio Format */
> > +   TSN_CVF,/* Compressed Video Format */
> > +   TSN_CRF,/* Clock Reference Format */
> > +   TSN_TSCF,   /* Time-Synchronous Control Format */
> > +   TSN_SVF,/* SDI Video Format */
> > +   TSN_RVF,/* Raw Video Format */
> > +   /* 0x08 - 0x6D reserved */
> > +   TSN_AEF_CONTINOUS = 0x6e, /* AES Encrypted Format Continous */
> > +   TSN_VSF_STREAM, /* Vendor Specific Format Stream */
> > +   /* 0x70 - 0x7e reserved */
> > +   TSN_EF_STREAM = 0x7f,   /* Experimental Format Stream */
> > +   /* 0x80 - 0x81 reserved */
> > +   TSN_NTSCF = 0x82,   /* Non Time-Synchronous Control Format */
> > +   /* 0x83 - 0xed reserved */
> > +   TSN_ESCF = 0xec,/* ECC Signed Control Format */
> > +   TSN_EECF,   /* ECC Encrypted Control Format */
> > +   TSN_AEF_DISCRETE,   /* AES Encrypted Format Discrete */
> > +   /* 0xef - 0xf9 reserved */
> > +   TSN_ADP = 0xfa, /* AVDECC Discovery Protocol */
> > +   TSN_AECP,   /* AVDECC Enumeration and Control Protocol */
> > +   TSN_ACMP,   /* AVDECC Connection Management Protocol */
> > +   /* 0xfd reserved */
> > +   TSN_MAAP = 0xfe,/* MAAP Protocol */
> > +   TSN_EF_CONTROL, /* Experimental Format Control */
> > +};
> 
> The kernel shouldn't be in the business of assembling media packets.

No, but assembling the packets and shipping frames to a destination is not 
neccessarily the same thing.

A nice workflow would be to signal to the shim that "I'm sending a 
compressed video format" and then the shim/tsn_core will ship out the 
frames over the network - and then you need to set TSN_CVF as subtype in 
each header.

That does not that mean you should do H.264 encode/decode *in* the kernel

Perhaps this is better placed in include/uapi/tsn.h so that userspace and 
kernel share the same header?

-- 
Henrik Austad


signature.asc
Description: PGP signature

Re: [PATCH 1/2] bpf: do not use KMALLOC_SHIFT_MAX

2016-12-17 Thread Michal Hocko

On Fri 16-12-16 16:28:21, Alexei Starovoitov wrote:
> On Sat, Dec 17, 2016 at 12:39:17AM +0100, Michal Hocko wrote:
> > On Fri 16-12-16 15:23:42, Alexei Starovoitov wrote:
> > > On Fri, Dec 16, 2016 at 11:02:35PM +0100, Michal Hocko wrote:
> > > > On Fri 16-12-16 10:02:10, Alexei Starovoitov wrote:
> > > > > On Thu, Dec 15, 2016 at 05:47:21PM +0100, Michal Hocko wrote:
> > > > > > From: Michal Hocko 
> > > > > > 
> > > > > > 01b3f52157ff ("bpf: fix allocation warnings in bpf maps and integer
> > > > > > overflow") has added checks for the maximum allocateable size. It
> > > > > > (ab)used KMALLOC_SHIFT_MAX for that purpose. While this is not 
> > > > > > incorrect
> > > > > > it is not very clean because we already have KMALLOC_MAX_SIZE for 
> > > > > > this
> > > > > > very reason so let's change both checks to use KMALLOC_MAX_SIZE 
> > > > > > instead.
> > > > > > 
> > > > > > Cc: Alexei Starovoitov 
> > > > > > Signed-off-by: Michal Hocko 
> > > > > 
> > > > > Nack until the patches 1 and 2 are reversed.
> > > > 
> > > > I do not insist on ordering. The thing is that it shouldn't matter all
> > > > that much. Or are you worried about bisectability?
> > > 
> > > This patch 1 strongly depends on patch 2 !
> > > Therefore order matters.
> > > The patch 1 by itself is broken.
> > > The commit log is saying
> > > '(ab)used KMALLOC_SHIFT_MAX for that purpose .. use KMALLOC_MAX_SIZE 
> > > instead'
> > > that is also incorrect. We cannot do that until KMALLOC_MAX_SIZE is fixed.
> > > So please change the order
> > 
> > Yes, I agree that using KMALLOC_MAX_SIZE could lead to a warning with
> > the current ordering. Why that matters all that much is less clear to
> > me. The allocation would simply fail and you would return ENOMEM rather
> > than E2BIG. Does this really matter?
> > 
> > Anyway, as I've said, I do not really insist on the current ordering and
> > the will ask Andrew to reorder them. I am just really wondering about
> > such a strong pushback about something that barely matters. Or maybe I
> > am just missing your point and checking KMALLOC_MAX_SIZE without an
> > update would lead to a wrong behavior, user space breakage, crash or
> > anything similar.
> 
> if admin set ulimit for locked memory high enough for the particular user,
> that non-root user will be able to trigger warn_on_once in 
> __alloc_pages_slowpath
> which is not acceptable.

But why is the warning such a big deal?

Also note that such a setup would be inherently dangerous. Even the
default ulimit for the locked memory allows to allocat 64k which means
that an untrusted user will be able to request PAGE_ALLOC_COSTLY_ORDER
and potentially deplete those larger blocks to the extend it hits the
OOM killer with the current gfp flags.

I think what you really want is a GFP_NORETRY for size > PAGE_SIZE and
fallback to the vmalloc for failure. But that is a separate topic.

> Also see the comment in hashtab.c
>   if (htab->map.value_size >= (1 << (KMALLOC_SHIFT_MAX - 1)) -
>   MAX_BPF_STACK - sizeof(struct htab_elem))
>   /* if value_size is bigger, the user space won't be able to
>* access the elements via bpf syscall. This check also makes
>* sure that the elem_size doesn't overflow and it's
>* kmalloc-able later in htab_map_update_elem()
>*/
>   goto free_htab;

I have seen this comment before, but honestly, I do not understand it
(well apart from the overflow part).
htab_map_update_elem has to be able to handle the allocation failure in
any case. Note that any allocation larger than 64kB is likely to fail
anyway.

> 
> > > and fix the commit log to say that KMALLOC_MAX_SIZE
> > > is actually valid limit now.
> > 
> > KMALLOC_MAX_SIZE has always been the right limit. It's value has been
> > incorrect but that is to be fixed now. Using KMALLOC_SHIFT_MAX is simply
> > abusing an internal constant. So I am not sure what should be fixed in
> > the changelog.
> 
> that's exactly my problem with this patch and the commit log.
> You think it's abusing KMALLOC_SHIFT_MAX whereas it's doing so
> for reasons stated above.
> That piece of code cannot use KMALLOC_MAX_SIZE until it's fixed.
> So commit log should say something like:
> "now since KMALLOC_MAX_SIZE is fixed and size < KMALLOC_MAX_SIZE condition
> guarantees warn free allocation in kmalloc(value_size, GFP_USER | 
> __GFP_NOWARN);
> we can safely use KMALLOC_MAX_SIZE instead of KMALLOC_SHIFT_MAX"

OK, fair enough, I will update the changelog

-- 
Michal Hocko
SUSE Labs

87 matches

Mail list logo