date:20181102

Re: [PATCH v2 net] net: dsa: microchip: initialize mutex before use

2018-11-02 Thread David Miller

From: 
Date: Fri, 2 Nov 2018 19:23:41 -0700

> From: Tristram Ha 
> 
> Initialize mutex before use.  Avoid kernel complaint when
> CONFIG_DEBUG_LOCK_ALLOC is enabled.
> 
> Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
> Signed-off-by: Tristram Ha 
> Reviewed-by: Pavel Machek 
> Reviewed-by: Andrew Lunn 
> Reviewed-by: Florian Fainelli 

Applied and queued up for -stable.

Re: [PATCH net v7] net/ipv6: Add anycast addresses to a global hashtable

2018-11-02 Thread David Miller

From: Jeff Barnhill <0xeff...@gmail.com>
Date: Fri,  2 Nov 2018 20:23:57 +

> icmp6_send() function is expensive on systems with a large number of
> interfaces. Every time it’s called, it has to verify that the source
> address does not correspond to an existing anycast address by looping
> through every device and every anycast address on the device.  This can
> result in significant delays for a CPU when there are a large number of
> neighbors and ND timers are frequently timing out and calling
> neigh_invalidate().
> 
> Add anycast addresses to a global hashtable to allow quick searching for
> matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.
> 
> Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>

Applied, thank you.

[PATCH net] sctp: define SCTP_SS_DEFAULT for Stream schedulers

2018-11-02 Thread Xin Long

According to rfc8260#section-4.3.2, SCTP_SS_DEFAULT is required to
defined as SCTP_SS_FCFS or SCTP_SS_RR.

SCTP_SS_FCFS is used for SCTP_SS_DEFAULT's value in this patch.

Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
Reported-by: Jianwen Ji 
Signed-off-by: Xin Long 
---
 include/uapi/linux/sctp.h | 1 +
 net/sctp/outqueue.c   | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 680ecc3..c81feb3 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1153,6 +1153,7 @@ struct sctp_add_streams {
 /* SCTP Stream schedulers */
 enum sctp_sched_type {
SCTP_SS_FCFS,
+   SCTP_SS_DEFAULT = SCTP_SS_FCFS,
SCTP_SS_PRIO,
SCTP_SS_RR,
SCTP_SS_MAX = SCTP_SS_RR
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 9cb854b..c37e1c2 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -212,7 +212,7 @@ void sctp_outq_init(struct sctp_association *asoc, struct 
sctp_outq *q)
INIT_LIST_HEAD(&q->retransmit);
INIT_LIST_HEAD(&q->sacked);
INIT_LIST_HEAD(&q->abandoned);
-   sctp_sched_set_sched(asoc, SCTP_SS_FCFS);
+   sctp_sched_set_sched(asoc, SCTP_SS_DEFAULT);
 }
 
 /* Free the outqueue structure and any related pending chunks.
-- 
2.1.0

[PATCH net] sctp: fix strchange_flags name for Stream Change Event

2018-11-02 Thread Xin Long

As defined in rfc6525#section-6.1.3, SCTP_STREAM_CHANGE_DENIED
and SCTP_STREAM_CHANGE_FAILED should be used instead of
SCTP_ASSOC_CHANGE_DENIED and SCTP_ASSOC_CHANGE_FAILED.

To keep the compatibility, fix it by adding two macros.

Fixes: b444153fb5a6 ("sctp: add support for generating add stream change event 
notification")
Reported-by: Jianwen Ji 
Signed-off-by: Xin Long 
---
 include/uapi/linux/sctp.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 34dd3d4..680ecc3 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -568,6 +568,8 @@ struct sctp_assoc_reset_event {
 
 #define SCTP_ASSOC_CHANGE_DENIED   0x0004
 #define SCTP_ASSOC_CHANGE_FAILED   0x0008
+#define SCTP_STREAM_CHANGE_DENIED  SCTP_ASSOC_CHANGE_DENIED
+#define SCTP_STREAM_CHANGE_FAILED  SCTP_ASSOC_CHANGE_FAILED
 struct sctp_stream_change_event {
__u16 strchange_type;
__u16 strchange_flags;
-- 
2.1.0

[PATCH v2 net] net: dsa: microchip: initialize mutex before use

2018-11-02 Thread Tristram.Ha

From: Tristram Ha 

Initialize mutex before use.  Avoid kernel complaint when
CONFIG_DEBUG_LOCK_ALLOC is enabled.

Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
Signed-off-by: Tristram Ha 
Reviewed-by: Pavel Machek 
Reviewed-by: Andrew Lunn 
Reviewed-by: Florian Fainelli 
---
v2
- Add endorsements

v1
- Remove comment

 drivers/net/dsa/microchip/ksz_common.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz_common.c 
b/drivers/net/dsa/microchip/ksz_common.c
index 54e0ca6..86b6464 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev)
 {
int i;
 
-   mutex_init(&dev->reg_mutex);
-   mutex_init(&dev->stats_mutex);
-   mutex_init(&dev->alu_mutex);
-   mutex_init(&dev->vlan_mutex);
-
dev->ds->ops = &ksz_switch_ops;
 
for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) {
@@ -1206,6 +1201,11 @@ int ksz_switch_register(struct ksz_device *dev)
if (dev->pdata)
dev->chip_id = dev->pdata->chip_id;
 
+   mutex_init(&dev->reg_mutex);
+   mutex_init(&dev->stats_mutex);
+   mutex_init(&dev->alu_mutex);
+   mutex_init(&dev->vlan_mutex);
+
if (ksz_switch_detect(dev))
return -EINVAL;
 
-- 
1.9.1

[RFC PATCH] net/mlx5e: Temp software stats variable is not required

2018-11-02 Thread Saeed Mahameed

mlx5e_grp_sw_update_stats can be called from two threads,
1) ndo_get_stats64
2) get_ethtool_stats

For this reason and to minimize concurrency issue impact on 64bit machines
mlx5e_grp_sw_update_stats folds the software stats into a temporary sw
stats variable and at the end memcopy it to the global driver stats,
both ethtool and ndo will use the global software stats copy to report
whatever stats they need.

Actually ndo_get_stats64 doesn't need to fold the whole software stats
(mlx5e_grp_sw_update_stats), all it needs is five counters to fill the
rtnl_link_stats64 relevant stats paramter.

Hence this patch introduces a simpler helper function to fold software
stats for ndo_get_stats64 mlx5e_fold_sw_stats which will work directly on
rtnl_link_stats64 stats parameter and not on the global or even temporary
mlx5e_sw_stats, And since now ndo_get_stats64 doesn't call
mlx5e_grp_sw_update_stats any more, we can make it static and remove the
temp var.

This patch will fix high stack usage of mlx5e_grp_sw_update_stats on
x86 gcc-4.9 and higher.  And the concurrency issue between mlx5's
ndo_get_stats64 and get_ethtool_stats.

Reported-by: Arnd Bergmann 
Reported-by: Andrew Morton 
Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 36 ++-
 .../ethernet/mellanox/mlx5/core/en_stats.c|  6 ++--
 .../ethernet/mellanox/mlx5/core/en_stats.h|  2 --
 3 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1243edbedc9e..2d1c5bb81acd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3437,11 +3437,35 @@ static int mlx5e_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
}
 }
 
+static void mlx5e_fold_sw_stats(struct mlx5e_priv *priv, struct 
rtnl_link_stats64 *s)
+{
+   int i;
+
+   /* not required ? */
+   memset(s, 0, sizeof(*s));
+
+   for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev); i++) {
+   struct mlx5e_channel_stats *channel_stats = 
&priv->channel_stats[i];
+   struct mlx5e_rq_stats *rq_stats = &channel_stats->rq;
+   int j;
+
+   s->rx_packets   += rq_stats->packets;
+   s->rx_bytes += rq_stats->bytes;
+
+   for (j = 0; j < priv->max_opened_tc; j++) {
+   struct mlx5e_sq_stats *sq_stats = &channel_stats->sq[j];
+
+   s->tx_packets+= sq_stats->packets;
+   s->tx_bytes  += sq_stats->bytes;
+   s->tx_dropped+= sq_stats->dropped;
+   }
+   }
+}
+
 static void
 mlx5e_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
 {
struct mlx5e_priv *priv = netdev_priv(dev);
-   struct mlx5e_sw_stats *sstats = &priv->stats.sw;
struct mlx5e_vport_stats *vstats = &priv->stats.vport;
struct mlx5e_pport_stats *pstats = &priv->stats.pport;
 
@@ -3453,14 +3477,8 @@ mlx5e_get_stats(struct net_device *dev, struct 
rtnl_link_stats64 *stats)
stats->rx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_received_ok);
stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
-   } else {
-   mlx5e_grp_sw_update_stats(priv);
-   stats->rx_packets = sstats->rx_packets;
-   stats->rx_bytes   = sstats->rx_bytes;
-   stats->tx_packets = sstats->tx_packets;
-   stats->tx_bytes   = sstats->tx_bytes;
-   stats->tx_dropped = sstats->tx_queue_dropped;
-   }
+   } else
+   mlx5e_fold_sw_stats(priv, stats);
 
stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 1e55b9c27ffc..8afc2a183a23 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -126,9 +126,9 @@ static int mlx5e_grp_sw_fill_stats(struct mlx5e_priv *priv, 
u64 *data, int idx)
return idx;
 }
 
-void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
+static void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
 {
-   struct mlx5e_sw_stats temp, *s = &temp;
+   struct mlx5e_sw_stats *s = &priv->stats.sw;
int i;
 
memset(s, 0, sizeof(*s));
@@ -211,8 +211,6 @@ void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
s->tx_cqes  += sq_stats->cqes;
}
}
-
-   memcpy(&priv->stats.sw, s, sizeof(*s));
 }
 
 static const struct counter_desc q_stats_desc[] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 77f7

RE: [Intel-wired-lan] [PATCH v2 net] igb: shorten maximum PHC timecounter update interval

2018-11-02 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Miroslav Lichvar
> Sent: Friday, October 26, 2018 10:13 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-...@lists.osuosl.org; Richard Cochran
> ; Thomas Gleixner 
> Subject: [Intel-wired-lan] [PATCH v2 net] igb: shorten maximum PHC
> timecounter update interval
> 
> The timecounter needs to be updated at least once per ~550 seconds in
> order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
> timestamp.
> 
> Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"),
> scheduling of delayed work seems to be less accurate and a requested
> delay of 540 seconds may actually be longer than 550 seconds. Also, the
> PHC may be adjusted to run up to 6% faster than real time and the system
> clock up to 10% slower. Shorten the delay to 360 seconds to be sure the
> timecounter is updated in time.
> 
> This fixes an issue with HW timestamps on 82580/I350/I354 being off by
> ~1100 seconds for few seconds every ~9 minutes.
> 
> Cc: Jacob Keller 
> Cc: Richard Cochran 
> Cc: Thomas Gleixner 
> Signed-off-by: Miroslav Lichvar 
> ---
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>

Tested-by: Aaron Brown

pull-request: bpf 2018-11-03

2018-11-02 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix BPF prog kallsyms and bpf_prog_get_info_by_fd() jited address export
   to not use page start but actual start instead to work correctly with
   profiling e.g. finding hot instructions from stack traces. Also fix latter
   with regards to dumping address and jited len for !multi prog, from Song.

2) Fix bpf_prog_get_info_by_fd() for !root to zero info.nr_jited_func_lens
   instead of leaving the user provided length, from Daniel.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 7de414a9dd91426318df7b63da024b2b07e53df5:

  net: drop skb on failure in ip_check_defrag() (2018-11-01 13:55:30 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 28c2fae726bf5003cd209b0d5910a642af98316f:

  bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv (2018-11-02 
13:51:15 -0700)


Daniel Borkmann (2):
  Merge branch 'bpf-accurate-prog-addr'
  bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv

Song Liu (3):
  bpf: show real jited prog address in /proc/kallsyms
  bpf: show real jited address in bpf_prog_info->jited_ksyms
  bpf: show main program address and length in bpf_prog_info

 kernel/bpf/core.c|  4 +---
 kernel/bpf/syscall.c | 35 +--
 2 files changed, 26 insertions(+), 13 deletions(-)

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-02 Thread Paweł Staszewski





W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 \ iface   Rx TxTotal
=

=
  enp175s0f1:  28.51 Gb/s   37.24
Gb/s
65.74 Gb/s
  enp175s0f0:  38.07 Gb/s   28.44
Gb/s
66.51 Gb/s
---

---
   total:  66.58 Gb/s   65.67
Gb/s
132.25 Gb/s


Packets per second:

bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 - iface   Rx TxTotal
=

=
  enp175s0f1:  5248589.00 P/s   3486617.75 P/s
8735207.00 P/s
  enp175s0f0:  3557944.25 P/s   5232516.00 P/s
8790460.00 P/s
---

---
   total:  8806533.00 P/s   8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.


Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir
So yes - we are hitting there other problem i think pcie is most 
probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time TX 
should be 126Gbit







This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why


Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)


Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits


So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]



ethtool -S enp175s0f1
NIC statistics:
rx_packets: 173730800927
rx_bytes: 99827422751332
tx_packets: 142532009512
tx_bytes: 184633045911222
tx_tso_packets: 25989113891
tx_tso_bytes: 132933363384458
tx_tso_inner_packets: 0
tx_tso_inner_bytes: 0
tx_added_vlan_packets: 74630239613
tx_nop: 2029817748
rx_lro_packets: 0
rx_lro_bytes: 0
rx_ecn_mark: 0
rx_removed_vlan_packets: 173730800927
rx_csum_unnecessary: 0
rx_csum_none: 434357
rx_csum_complete: 173730366570
rx_csum_unnecessary_inner: 0
rx_xdp_drop: 0
rx_xdp_redirect: 0
rx_xdp_tx_xmit: 0
rx_xdp_tx_full: 0
rx_xdp_tx_err: 0
rx_xdp_tx_cqe: 0
tx_csum_none: 38260960853
tx_csum_partial: 36369278774
tx_csum_partial_inner: 0
tx_queue_stopped: 1
tx_queue_dropped: 0
tx_xmit_more: 748638099
tx_recover: 0
tx_cqes: 73881645031
tx_queue_wake: 1
tx_udp_seg_rem: 0
tx_cqe_err: 0
tx_xdp_xmit: 0
tx_xdp_full: 0
tx_xdp_err: 0
tx_xdp_cqes: 0
rx_wqe_err: 0
rx_mpwqe_filler_cqes: 0
rx_mpwqe_filler_strides: 0
rx_buff_alloc_err: 0
rx_cqe_compress_blks: 0
rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts inc

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-02 Thread Paweł Staszewski





W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:



W dniu 02.11.2018 o 15:20, Aaron Lu pisze:

On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu  wrote:


On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:

On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
wrote:
... ...

Section copied out:

   mlx5e_poll_tx_cq
   |
    --16.34%--napi_consume_skb
  |
  |--12.65%--__free_pages_ok
  |  |
  |   --11.86%--free_one_page
  | |
  | |--10.10%
--queued_spin_lock_slowpath
  | |
  | --0.65%--_raw_spin_lock
This callchain looks like it is freeing higher order pages than 
order

0:
__free_pages_ok is only called for pages whose order are bigger than
0.
mlx5 rx uses only order 0 pages, so i don't know where these high 
order

tx SKBs are coming from..

Perhaps here:
__netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
__napi_alloc_frag() will all call page_frag_alloc(), which will use
__page_frag_cache_refill() to get an order 3 page if possible, or fall
back to an order 0 page if order 3 page is not available.

I'm not sure if your workload will use the above code path though.

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with order-0
pages, via:

static void skb_free_head(struct sk_buff *skb)
{
unsigned char *head = skb->head;

if (skb->head_frag)
    skb_free_frag(head);
else
    kfree(head);
}

static inline void skb_free_frag(void *addr)
{
page_frag_free(addr);
}

/*
  * Frees a page fragment allocated out of either a compound or 
order 0 page.

  */
void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);

if (unlikely(put_page_testzero(page)))
    __free_pages_ok(page, compound_order(page));
}
EXPORT_SYMBOL(page_frag_free);

I think here is a problem - order 0 pages are freed directly to buddy,
bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?

Will test it tonight


Patch applied
perf report:
https://ufile.io/sytfh



But i need to wait also with more traffic currently cpu's are sleeping









diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..65c0ae13215a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
  {
  struct page *page = virt_to_head_page(addr);
  -    if (unlikely(put_page_testzero(page)))
-    __free_pages_ok(page, compound_order(page));
+    if (unlikely(put_page_testzero(page))) {
+    unsigned int order = compound_order(page);
+
+    if (order == 0)
+    free_unref_page(page);
+    else
+    __free_pages_ok(page, order);
+    }
  }
  EXPORT_SYMBOL(page_frag_free);

Notice for the mlx5 driver it support several RX-memory models, so it
can be hard to follow, but from the perf report output we can see that
is uses mlx5e_skb_from_cqe_linear, which use build_skb.

--13.63%--mlx5e_skb_from_cqe_linear
   |
    --5.02%--build_skb
  |
   --1.85%--__build_skb
 |
  --1.00%--kmem_cache_alloc

/* build_skb() is wrapper over __build_skb(), that specifically
  * takes care of skb->head and skb->pfmemalloc
  * This means that if @frag_size is not zero, then @data must be 
backed

  * by a page fragment, not kmalloc() or vmalloc()
  */
struct sk_buff *build_skb(void *data, unsigned int frag_size)
{
struct sk_buff *skb = __build_skb(data, frag_size);

if (skb && frag_size) {
    skb->head_frag = 1;
    if (page_is_pfmemalloc(virt_to_head_page(data)))
    skb->pfmemalloc = 1;
}
return skb;
}
EXPORT_SYMBOL(build_skb);

It still doesn't prove, that the @data is backed by by a order-0 page.
For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
page_pool_dev_alloc_pages(), and I can see perf report using
__page_pool_alloc_pages_slow().

The setup for page_pool in mlx5 uses order=0.

/* Create a page_pool and register it with rxq */
pp_params.order = 0;
pp_params.flags = 0; /* No-internal DMA mapping in page_pool */
pp_params.pool_size = pool_size;
pp_params.nid   = cpu_to_node(c->cpu);
pp_params.dev   = c->pdev;
pp_params.dma_dir   = rq->buff.map_dir;

/* page_pool can be used even when there is no rq->xdp_prog,
 * given page_pool does not handle DMA mapping there is no
 * required state to clear. And page_pool gracefully handle
 * elevated refcnt.
 */
rq->page_pool = page_pool_create(&pp_params);
if (IS_ERR(rq->page_pool)) {

[PATCH RFC net-next] net: vlan: add support for tunnel offload

2018-11-02 Thread Davide Caratti

GSO tunneled packets are always segmented in software before they are
transmitted by a VLAN, even when the lower device can offload tunnel
encapsulation and VLAN together (i.e., some bits in NETIF_F_GSO_ENCAP_ALL
mask are set in the lower device 'vlan_features'). If we let VLANs have
the same tunnel offload capabilities as their lower device, throughput
can improve significantly when CPU is limited on the transmitter side.

 - set NETIF_F_GSO_ENCAP_ALL bits in the VLAN 'hw_features', to ensure
 that 'features' will have those bits zeroed only when the lower device
 has no hardware support for tunnel encapsulation.
 - for the same reason, copy GSO-related bits of 'hw_enc_features' from
 lower device to VLAN, and ensure to update that value when the lower
 device changes its features.
 - set NETIF_F_HW_CSUM bit in the VLAN 'hw_enc_features' if 'real_dev'
 is able to compute checksums at least for a kind of packets, like done
 with commit 8403debeead8 ("vlan: Keep NETIF_F_HW_CSUM similar to other
 software devices"). This avoids software segmentation due to mismatching
 checksum capabilities between VLAN's 'features' and 'hw_enc_features'.

Reported-by: Flavio Leitner 
Signed-off-by: Davide Caratti 
---
 net/8021q/vlan.c |  1 +
 net/8021q/vlan.h | 12 
 net/8021q/vlan_dev.c |  2 ++
 3 files changed, 15 insertions(+)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 5e9950453955..1b7a375c6616 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -330,6 +330,7 @@ static void vlan_transfer_features(struct net_device *dev,
 
vlandev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
vlandev->priv_flags |= (vlan->real_dev->priv_flags & 
IFF_XMIT_DST_RELEASE);
+   vlandev->hw_enc_features = vlan_tnl_features(vlan->real_dev);
 
netdev_update_features(vlandev);
 }
diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h
index 44df1c3df02d..c46daf09a501 100644
--- a/net/8021q/vlan.h
+++ b/net/8021q/vlan.h
@@ -92,6 +92,18 @@ static inline struct net_device *vlan_find_dev(struct 
net_device *real_dev,
return NULL;
 }
 
+static inline netdev_features_t vlan_tnl_features(struct net_device *real_dev)
+{
+   netdev_features_t ret;
+
+   ret = real_dev->hw_enc_features &
+ (NETIF_F_CSUM_MASK | NETIF_F_ALL_TSO | NETIF_F_GSO_ENCAP_ALL);
+
+   if ((ret & NETIF_F_GSO_ENCAP_ALL) && (ret & NETIF_F_CSUM_MASK))
+   return (ret & ~NETIF_F_CSUM_MASK) | NETIF_F_HW_CSUM;
+   return 0;
+}
+
 #define vlan_group_for_each_dev(grp, i, dev) \
for ((i) = 0; i < VLAN_PROTO_NUM * VLAN_N_VID; i++) \
if (((dev) = __vlan_group_get_device((grp), (i) / VLAN_N_VID, \
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index ff720f1ebf73..b2d9c8f27cd7 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -562,6 +562,7 @@ static int vlan_dev_init(struct net_device *dev)
 
dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
   NETIF_F_FRAGLIST | NETIF_F_GSO_SOFTWARE |
+  NETIF_F_GSO_ENCAP_ALL |
   NETIF_F_HIGHDMA | NETIF_F_SCTP_CRC |
   NETIF_F_ALL_FCOE;
 
@@ -572,6 +573,7 @@ static int vlan_dev_init(struct net_device *dev)
netdev_warn(real_dev, "VLAN features are set incorrectly.  
Q-in-Q configurations may not work correctly.\n");
 
dev->vlan_features = real_dev->vlan_features & ~NETIF_F_ALL_FCOE;
+   dev->hw_enc_features = vlan_tnl_features(real_dev);
 
/* ipv6 shared card related stuff */
dev->dev_id = real_dev->dev_id;
-- 
2.17.1

Re: Help with the BPF verifier

2018-11-02 Thread Yonghong Song



On 11/2/18 8:42 AM, Edward Cree wrote:
> On 02/11/18 15:02, Arnaldo Carvalho de Melo wrote:
>> Yeah, didn't work as well:
> 
>> And the -vv in 'perf trace' didn't seem to map to further details in the
>> output of the verifier debug:
> Yeah for log_level 2 you probably need to make source-level changes to either
>   perf or libbpf (I think the latter).  It's annoying that essentially no 
> tools
>   plumb through an option for that, someone should fix them ;-)
> 
>> libbpf: -- BEGIN DUMP LOG ---
>> libbpf:
>> 0: (bf) r6 = r1
>> 1: (bf) r1 = r10
>> 2: (07) r1 += -328
>> 3: (b7) r7 = 64
>> 4: (b7) r2 = 64
>> 5: (bf) r3 = r6
>> 6: (85) call bpf_probe_read#4
>> 7: (79) r1 = *(u64 *)(r10 -320)
>> 8: (15) if r1 == 0x101 goto pc+4
>>   R0=inv(id=0) R1=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
>> 9: (55) if r1 != 0x2 goto pc+22
>>   R0=inv(id=0) R1=inv2 R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
>> 10: (bf) r1 = r6
>> 11: (07) r1 += 16
>> 12: (05) goto pc+2
>> 15: (79) r3 = *(u64 *)(r1 +0)
>> dereference of modified ctx ptr R1 off=16 disallowed
> Aha, we at least got a different error message this time.
> And indeed llvm has done that optimisation, rather than the more obvious
> 11: r3 = *(u64 *)(r1 +16)
>   because it wants to have lots of reads share a single insn.  You may be able
>   to defeat that optimisation by adding compiler barriers, idk.  Maybe someone
>   with llvm knowledge can figure out how to stop it (ideally, llvm would know
>   when it's generating for bpf backend and not do that).  -O0?  ¯\_(ツ)_/¯

The optimization mostly likes below:
br1:
  ...
  r1 += 16
  goto merge
br2:
  ...
  r1 += 20
  goto merge
merge:
  *(u64 *)(r1 + 0)

The compiler tries to merge common loads. There is no easy way to
stop this compiler optimization without turning off a lot of other
optimizations. The easiest way is to add barriers
__asm__ __volatile__("": : :"memory")
after the ctx memory access to prevent their down stream merging.

> Alternatively, your prog looks short enough that maybe you could kick the C
>   habit and write it directly in eBPF asm, that way no-one is optimising 
> things
>   behind your back.  (I realise this option won't appeal to everyone ;-)

The LLVM supports BPF inline assembly as well. Some examples here
https://github.com/llvm-mirror/llvm/blob/master/test/CodeGen/BPF/inline_asm.ll
You may try it for selective ctx access to work around some
compiler optimizations. I personally have not used it yet and actually
not sure whether it actually works or not :-)

> The reason the verifier disallows this, iirc, is because it needs to be able
>   to rewrite the offsets on ctx accesses (see convert_ctx_accesses()) in case 
> >   underlying kernel struct doesn't match the layout of the ctx ABI.  
To do this
>   it needs the ctx offset to live entirely in the insn doing the access,
>   otherwise different paths could lead to the same insn accessing different 
> ctx
>   offsets with different fixups needed — can't be done.
> 
> -Ed
>

Re: [PATCH bpf] bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv

2018-11-02 Thread Alexei Starovoitov

On Fri, Nov 02, 2018 at 11:35:46AM +0100, Daniel Borkmann wrote:
> While dbecd7388476 ("bpf: get kernel symbol addresses via syscall")
> zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries
> from unprivileged users, commit 815581c11cc2 ("bpf: get JITed image
> lengths of functions via syscall") forgot about doing so and therefore
> returns the #elems of the user set up buffer which is incorrect. It
> also needs to indicate a info.nr_jited_func_lens of zero.
> 
> Fixes: 815581c11cc2 ("bpf: get JITed image lengths of functions via syscall")
> Signed-off-by: Daniel Borkmann 
> Cc: Sandipan Das 
> Cc: Song Liu 

Applied, Thanks

Re: [PATCH v2 bpf 0/3] show more accurrate bpf program address

2018-11-02 Thread Daniel Borkmann

On 11/02/2018 06:16 PM, Song Liu wrote:
> Changes v1 -> v2:
> 1. Added main program length to bpf_prog_info->jited_fun_lens (3/3).
> 2. Updated commit message of 1/3 and 2/3 with more background about the
>address masking, and why it is still save after the changes.
> 3. Replace "ulong" with "unsigned long".
> 
> This set improves bpf program address showed in /proc/kallsyms and in
> bpf_prog_info. First, real program address is showed instead of page
> address. Second, when there is no subprogram, bpf_prog_info->jited_ksyms
> and bpf_prog_info->jited_fun_lens returns the main prog address and
> length.
> 
> Song Liu (3):
>   bpf: show real jited prog address in /proc/kallsyms
>   bpf: show real jited address in bpf_prog_info->jited_ksyms
>   bpf: show main program address and length in bpf_prog_info
> 
>  kernel/bpf/core.c|  4 +---
>  kernel/bpf/syscall.c | 34 --
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> --
> 2.17.1
> 

Applied to bpf, thanks!

[PATCH net v7] net/ipv6: Add anycast addresses to a global hashtable

2018-11-02 Thread Jeff Barnhill

icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device.  This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().

Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.

Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>
---
 include/net/addrconf.h |  2 ++
 include/net/if_inet6.h |  2 ++
 net/ipv6/af_inet6.c|  5 
 net/ipv6/anycast.c | 80 +++---
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 14b789a123e7..1656c5978498 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device 
*dev,
 const struct in6_addr *addr);
 bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
 const struct in6_addr *addr);
+int ipv6_anycast_init(void);
+void ipv6_anycast_cleanup(void);
 
 /* Device notifier */
 int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d7578cf49c3a..c9c78c15bce0 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -146,10 +146,12 @@ struct ifacaddr6 {
struct in6_addr aca_addr;
struct fib6_info*aca_rt;
struct ifacaddr6*aca_next;
+   struct hlist_node   aca_addr_lst;
int aca_users;
refcount_t  aca_refcnt;
unsigned long   aca_cstamp;
unsigned long   aca_tstamp;
+   struct rcu_head rcu;
 };
 
 #defineIFA_HOSTIPV6_ADDR_LOOPBACK
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3f4d61017a69..f0cd291034f0 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
err = ip6_flowlabel_init();
if (err)
goto ip6_flowlabel_fail;
+   err = ipv6_anycast_init();
+   if (err)
+   goto ipv6_anycast_fail;
err = addrconf_init();
if (err)
goto addrconf_fail;
@@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
 ipv6_exthdrs_fail:
addrconf_cleanup();
 addrconf_fail:
+   ipv6_anycast_cleanup();
+ipv6_anycast_fail:
ip6_flowlabel_cleanup();
 ip6_flowlabel_fail:
ndisc_late_cleanup();
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 4e0ff7031edd..7698637cf827 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -44,8 +44,22 @@
 
 #include 
 
+#define IN6_ADDR_HSIZE_SHIFT   8
+#define IN6_ADDR_HSIZE BIT(IN6_ADDR_HSIZE_SHIFT)
+/* anycast address hash table
+ */
+static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
+static DEFINE_SPINLOCK(acaddr_hash_lock);
+
 static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr 
*addr);
 
+static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+   u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
+
+   return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
+}
+
 /*
  * socket join an anycast group
  */
@@ -204,16 +218,39 @@ void ipv6_sock_ac_close(struct sock *sk)
rtnl_unlock();
 }
 
+static void ipv6_add_acaddr_hash(struct net *net, struct ifacaddr6 *aca)
+{
+   unsigned int hash = inet6_acaddr_hash(net, &aca->aca_addr);
+
+   spin_lock(&acaddr_hash_lock);
+   hlist_add_head_rcu(&aca->aca_addr_lst, &inet6_acaddr_lst[hash]);
+   spin_unlock(&acaddr_hash_lock);
+}
+
+static void ipv6_del_acaddr_hash(struct ifacaddr6 *aca)
+{
+   spin_lock(&acaddr_hash_lock);
+   hlist_del_init_rcu(&aca->aca_addr_lst);
+   spin_unlock(&acaddr_hash_lock);
+}
+
 static void aca_get(struct ifacaddr6 *aca)
 {
refcount_inc(&aca->aca_refcnt);
 }
 
+static void aca_free_rcu(struct rcu_head *h)
+{
+   struct ifacaddr6 *aca = container_of(h, struct ifacaddr6, rcu);
+
+   fib6_info_release(aca->aca_rt);
+   kfree(aca);
+}
+
 static void aca_put(struct ifacaddr6 *ac)
 {
if (refcount_dec_and_test(&ac->aca_refcnt)) {
-   fib6_info_release(ac->aca_rt);
-   kfree(ac);
+   call_rcu(&ac->rcu, aca_free_rcu);
}
 }
 
@@ -229,6 +266,7 @@ static struct ifacaddr6 *aca_alloc(struct fib6_info *f6i,
aca->aca_addr = *addr;
fib6_info_hold(f6i);
aca->aca_rt = f6i;
+   INIT_HLIST_NODE(&aca->aca_addr_lst);
aca->aca_users = 1;
/* aca_tstamp should be updated upon changes */
aca->aca_cstamp = aca->aca_ts

[PATCH net] mlxsw: spectrum: Fix IP2ME CPU policer configuration

2018-11-02 Thread Ido Schimmel

From: Shalom Toledo 

The CPU policer used to police packets being trapped via a local route
(IP2ME) was incorrectly configured to police based on bytes per second
instead of packets per second.

Change the policer to police based on packets per second and avoid
packet loss under certain circumstances.

Fixes: 9148e7cf73ce ("mlxsw: spectrum: Add policers for trap groups")
Signed-off-by: Shalom Toledo 
Signed-off-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index a2df12b79f8e..9bec940330a4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3568,7 +3568,6 @@ static int mlxsw_sp_cpu_policers_set(struct mlxsw_core 
*mlxsw_core)
burst_size = 7;
break;
case MLXSW_REG_HTGT_TRAP_GROUP_SP_IP2ME:
-   is_bytes = true;
rate = 4 * 1024;
burst_size = 4;
break;
-- 
2.17.2

[PATCH net-next v4 2/9] net: ensure unbound stream socket to be chosen when not in a VRF

2018-11-02 Thread Mike Manning

The commit a04a480d4392 ("net: Require exact match for TCP socket
lookups if dif is l3mdev") only ensures that the correct socket is
selected for packets in a VRF. However, there is no guarantee that
the unbound socket will be selected for packets when not in a VRF.
By checking for a device match in compute_score() also for the case
when there is no bound device and attaching a score to this, the
unbound socket is selected. And if a failure is returned when there
is no device match, this ensures that bound sockets are never selected,
even if there is no unbound socket.

Signed-off-by: Mike Manning 
---
 include/net/inet_hashtables.h | 11 +++
 include/net/inet_sock.h   |  8 
 net/ipv4/inet_hashtables.c| 14 ++
 net/ipv6/inet6_hashtables.c   | 14 ++
 4 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 4ae060b4bac2..5de2d9f24c05 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -189,6 +189,17 @@ static inline void inet_ehash_locks_free(struct 
inet_hashinfo *hashinfo)
hashinfo->ehash_locks = NULL;
 }
 
+static inline bool inet_sk_bound_dev_eq(struct net *net, int bound_dev_if,
+   int dif, int sdif)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+   return inet_bound_dev_eq(net->ipv4.sysctl_tcp_l3mdev_accept,
+   bound_dev_if, dif, sdif);
+#else
+   return inet_bound_dev_eq(1, bound_dev_if, dif, sdif);
+#endif
+}
+
 struct inet_bind_bucket *
 inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
struct inet_bind_hashbucket *head,
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index ed3f723af00b..e8eef85006aa 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -143,6 +143,14 @@ static inline int inet_sk_bound_l3mdev(const struct sock 
*sk)
return 0;
 }
 
+static inline bool inet_bound_dev_eq(bool l3mdev_accept, int bound_dev_if,
+int dif, int sdif)
+{
+   if (!bound_dev_if)
+   return !sdif || l3mdev_accept;
+   return bound_dev_if == dif || bound_dev_if == sdif;
+}
+
 struct inet_cork {
unsigned intflags;
__be32  addr;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 40d722ab1738..13890d5bfc34 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -235,6 +235,7 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
 {
int score = -1;
struct inet_sock *inet = inet_sk(sk);
+   bool dev_match;
 
if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
!ipv6_only_sock(sk)) {
@@ -245,15 +246,12 @@ static inline int compute_score(struct sock *sk, struct 
net *net,
return -1;
score += 4;
}
-   if (sk->sk_bound_dev_if || exact_dif) {
-   bool dev_match = (sk->sk_bound_dev_if == dif ||
- sk->sk_bound_dev_if == sdif);
+   dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+dif, sdif);
+   if (!dev_match)
+   return -1;
+   score += 4;
 
-   if (!dev_match)
-   return -1;
-   if (sk->sk_bound_dev_if)
-   score += 4;
-   }
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
}
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 3d7c7460a0c5..5eeeba7181a1 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -99,6 +99,7 @@ static inline int compute_score(struct sock *sk, struct net 
*net,
const int dif, const int sdif, bool exact_dif)
 {
int score = -1;
+   bool dev_match;
 
if (net_eq(sock_net(sk), net) && inet_sk(sk)->inet_num == hnum &&
sk->sk_family == PF_INET6) {
@@ -109,15 +110,12 @@ static inline int compute_score(struct sock *sk, struct 
net *net,
return -1;
score++;
}
-   if (sk->sk_bound_dev_if || exact_dif) {
-   bool dev_match = (sk->sk_bound_dev_if == dif ||
- sk->sk_bound_dev_if == sdif);
+   dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+dif, sdif);
+   if (!dev_match)
+   return -1;
+   score++;
 
-   if (!dev_match)
-   return -1;
-

[PATCH net-next v4 0/9] vrf: allow simultaneous service instances in default and other VRFs

2018-11-02 Thread Mike Manning

Services currently have to be VRF-aware if they are using an unbound
socket. One cannot have multiple service instances running in the
default and other VRFs for services that are not VRF-aware and listen
on an unbound socket. This is because there is no easy way of isolating
packets received in the default VRF from those arriving in other VRFs.

This series provides this isolation for stream sockets subject to the
existing kernel parameter net.ipv4.tcp_l3mdev_accept not being set,
given that this is documented as allowing a single service instance to
work across all VRF domains. Similarly, net.ipv4.udp_l3mdev_accept is
checked for datagram sockets, and net.ipv4.raw_l3mdev_accept is
introduced for raw sockets. The functionality applies to UDP & TCP
services as well as those using raw sockets, and is for IPv4 and IPv6.

Example of running ssh instances in default and blue VRF:

$ /usr/sbin/sshd -D
$ ip vrf exec vrf-blue /usr/sbin/sshd
$ ss -ta | egrep 'State|ssh'
State   Recv-Q   Send-Q   Local Address:Port   Peer Address:Port
LISTEN  0128   0.0.0.0%vrf-blue:ssh 0.0.0.0:*
LISTEN  01280.0.0.0:ssh 0.0.0.0:*
ESTAB   00  192.168.122.220:ssh   192.168.122.1:50282
LISTEN  0128  [::]%vrf-blue:ssh[::]:*
LISTEN  0128   [::]:ssh[::]:*
ESTAB   00   [3000::2]%vrf-blue:ssh   [3000::9]:45896
ESTAB   00[2000::2]:ssh   [2000::9]:46398

v1:
   - Address Paolo Abeni's comments (patch 4/5)
   - Fix build when CONFIG_NET_L3_MASTER_DEV not defined (patch 1/5)
v2:
   - Address David Aherns' comments (patches 4/5 and 5/5)
   - Remove patches 3/5 and 5/5 from series for individual submissions
   - Include a sysctl for raw sockets as recommended by David Ahern
   - Expand series into 10 patches and provide improved descriptions
v3:
   - Update description for patch 1/10 and remove patch 6/10
v4:
   - Set default to enabled for raw socket sysctl as recommended by David Ahern

Dewi Morgan (1):
  ipv6: do not drop vrf udp multicast packets

Duncan Eastoe (1):
  net: fix raw socket lookup device bind matching with VRFs

Mike Manning (6):
  net: ensure unbound stream socket to be chosen when not in a VRF
  net: ensure unbound datagram socket to be chosen when not in a VRF
  net: provide a sysctl raw_l3mdev_accept for raw socket lookup with
VRFs
  vrf: mark skb for multicast or link-local as enslaved to VRF
  ipv6: allow ping to link-local address in VRF
  ipv6: handling of multicast packets received in VRF

Robert Shearman (1):
  net: allow binding socket in a VRF when there's an unbound socket

 Documentation/networking/ip-sysctl.txt | 12 
 Documentation/networking/vrf.txt   | 22 +
 drivers/net/vrf.c  | 19 +-
 include/net/inet6_hashtables.h |  5 ++---
 include/net/inet_hashtables.h  | 24 ---
 include/net/inet_sock.h| 21 
 include/net/netns/ipv4.h   |  3 +++
 include/net/raw.h  | 13 +
 include/net/udp.h  | 11 +++
 net/core/sock.c|  2 ++
 net/ipv4/af_inet.c |  2 ++
 net/ipv4/inet_connection_sock.c| 13 ++---
 net/ipv4/inet_hashtables.c | 34 -
 net/ipv4/raw.c | 19 ++
 net/ipv4/sysctl_net_ipv4.c | 11 +++
 net/ipv4/udp.c | 15 ++-
 net/ipv6/datagram.c|  5 -
 net/ipv6/inet6_hashtables.c| 14 ++
 net/ipv6/ip6_input.c   | 35 +++---
 net/ipv6/ipv6_sockglue.c   |  2 +-
 net/ipv6/raw.c |  5 ++---
 net/ipv6/udp.c | 22 ++---
 22 files changed, 228 insertions(+), 81 deletions(-)

-- 
2.11.0

[PATCH net-next v4 6/9] vrf: mark skb for multicast or link-local as enslaved to VRF

2018-11-02 Thread Mike Manning

The skb for packets that are multicast or to a link-local address are
not marked as being enslaved to a VRF, if they are received on a socket
bound to the VRF. This is needed for ND and it is preferable for the
kernel not to have to deal with the additional use-cases if ll or mcast
packets are handled as enslaved. However, this does not allow service
instances listening on unbound and bound to VRF sockets to distinguish
the VRF used, if packets are sent as multicast or to a link-local
address. The fix is for the VRF driver to also mark these skb as being
enslaved to the VRF.

Signed-off-by: Mike Manning 
---
 drivers/net/vrf.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 69b7227c637e..21ad4b1d7f03 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -981,24 +981,23 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device 
*vrf_dev,
   struct sk_buff *skb)
 {
int orig_iif = skb->skb_iif;
-   bool need_strict;
+   bool need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
+   bool is_ndisc = ipv6_ndisc_frame(skb);
 
-   /* loopback traffic; do not push through packet taps again.
-* Reset pkt_type for upper layers to process skb
+   /* loopback, multicast & non-ND link-local traffic; do not push through
+* packet taps again. Reset pkt_type for upper layers to process skb
 */
-   if (skb->pkt_type == PACKET_LOOPBACK) {
+   if (skb->pkt_type == PACKET_LOOPBACK || (need_strict && !is_ndisc)) {
skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex;
IP6CB(skb)->flags |= IP6SKB_L3SLAVE;
-   skb->pkt_type = PACKET_HOST;
+   if (skb->pkt_type == PACKET_LOOPBACK)
+   skb->pkt_type = PACKET_HOST;
goto out;
}
 
-   /* if packet is NDISC or addressed to multicast or link-local
-* then keep the ingress interface
-*/
-   need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
-   if (!ipv6_ndisc_frame(skb) && !need_strict) {
+   /* if packet is NDISC then keep the ingress interface */
+   if (!is_ndisc) {
vrf_rx_stats(vrf_dev, skb->len);
skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex;
-- 
2.11.0

[PATCH net-next v4 5/9] net: fix raw socket lookup device bind matching with VRFs

2018-11-02 Thread Mike Manning

From: Duncan Eastoe 

When there exist a pair of raw sockets one unbound and one bound
to a VRF but equal in all other respects, when a packet is received
in the VRF context, __raw_v4_lookup() matches on both sockets.

This results in the packet being delivered over both sockets,
instead of only the raw socket bound to the VRF. The bound device
checks in __raw_v4_lookup() are replaced with a call to
raw_sk_bound_dev_eq() which correctly handles whether the packet
should be delivered over the unbound socket in such cases.

In __raw_v6_lookup() the match on the device binding of the socket is
similarly updated to use raw_sk_bound_dev_eq() which matches the
handling in __raw_v4_lookup().

Importantly raw_sk_bound_dev_eq() takes the raw_l3mdev_accept sysctl
into account.

Signed-off-by: Duncan Eastoe 
Signed-off-by: Mike Manning 
---
 include/net/raw.h | 12 
 net/ipv4/raw.c|  3 +--
 net/ipv6/raw.c|  5 ++---
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index 20ebf0b3dfa8..6ed2ae5b4a80 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -18,6 +18,7 @@
 #define _RAW_H
 
 
+#include 
 #include 
 #include 
 
@@ -75,4 +76,15 @@ static inline struct raw_sock *raw_sk(const struct sock *sk)
return (struct raw_sock *)sk;
 }
 
+static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
+  int dif, int sdif)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+   return inet_bound_dev_eq(net->ipv4.sysctl_raw_l3mdev_accept,
+bound_dev_if, dif, sdif);
+#else
+   return inet_bound_dev_eq(1, bound_dev_if, dif, sdif);
+#endif
+}
+
 #endif /* _RAW_H */
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index da453c7dfb75..d42cdd018987 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -131,8 +131,7 @@ struct sock *__raw_v4_lookup(struct net *net, struct sock 
*sk,
if (net_eq(sock_net(sk), net) && inet->inet_num == num  &&
!(inet->inet_daddr && inet->inet_daddr != raddr)&&
!(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
-   !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif &&
- sk->sk_bound_dev_if != sdif))
+   raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
goto found; /* gotcha */
}
sk = NULL;
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5e0efd3954e9..aed7eb5c2123 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -86,9 +86,8 @@ struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
!ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr))
continue;
 
-   if (sk->sk_bound_dev_if &&
-   sk->sk_bound_dev_if != dif &&
-   sk->sk_bound_dev_if != sdif)
+   if (!raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+dif, sdif))
continue;
 
if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-- 
2.11.0

[PATCH net-next v4 3/9] net: ensure unbound datagram socket to be chosen when not in a VRF

2018-11-02 Thread Mike Manning

Ensure an unbound datagram skt is chosen when not in a VRF. The check
for a device match in compute_score() for UDP must be performed when
there is no device match. For this, a failure is returned when there is
no device match. This ensures that bound sockets are never selected,
even if there is no unbound socket.

Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
packets are currently blocked, as flowi6_oif was set to that of the
master vrf device, and the ipi6_ifindex is that of the slave device.
Allow these packets to be sent by checking the device with ipi6_ifindex
has the same L3 scope as that of the bound device of the skt, which is
the master vrf device. Note that this check always succeeds if the skt
is unbound.

Even though the right datagram skt is now selected by compute_score(),
a different skt is being returned that is bound to the wrong vrf. The
difference between these and stream sockets is the handling of the skt
option for SO_REUSEPORT. While the handling when adding a skt for reuse
correctly checks that the bound device of the skt is a match, the skts
in the hashslot are already incorrect. So for the same hash, a skt for
the wrong vrf may be selected for the required port. The root cause is
that the skt is immediately placed into a slot when it is created,
but when the skt is then bound using SO_BINDTODEVICE, it remains in the
same slot. The solution is to move the skt to the correct slot by
forcing a rehash.

Signed-off-by: Mike Manning 
---
 include/net/udp.h   | 11 +++
 net/core/sock.c |  2 ++
 net/ipv4/udp.c  | 15 ++-
 net/ipv6/datagram.c |  5 -
 net/ipv6/udp.c  | 14 +-
 5 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 9e82cb391dea..057972d0eea5 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -252,6 +252,17 @@ static inline int udp_rqueue_get(struct sock *sk)
return sk_rmem_alloc_get(sk) - READ_ONCE(udp_sk(sk)->forward_deficit);
 }
 
+static inline bool udp_sk_bound_dev_eq(struct net *net, int bound_dev_if,
+  int dif, int sdif)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+   return inet_bound_dev_eq(net->ipv4.sysctl_udp_l3mdev_accept,
+bound_dev_if, dif, sdif);
+#else
+   return inet_bound_dev_eq(1, bound_dev_if, dif, sdif);
+#endif
+}
+
 /* net/ipv4/udp.c */
 void udp_destruct_sock(struct sock *sk);
 void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
diff --git a/net/core/sock.c b/net/core/sock.c
index 6fcc4bc07d19..6eda848192aa 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -567,6 +567,8 @@ static int sock_setbindtodevice(struct sock *sk, char 
__user *optval,
 
lock_sock(sk);
sk->sk_bound_dev_if = index;
+   if (sk->sk_prot->rehash)
+   sk->sk_prot->rehash(sk);
sk_dst_reset(sk);
release_sock(sk);
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1976fddb9e00..cf73c9194bb6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -371,6 +371,7 @@ static int compute_score(struct sock *sk, struct net *net,
 {
int score;
struct inet_sock *inet;
+   bool dev_match;
 
if (!net_eq(sock_net(sk), net) ||
udp_sk(sk)->udp_port_hash != hnum ||
@@ -398,15 +399,11 @@ static int compute_score(struct sock *sk, struct net *net,
score += 4;
}
 
-   if (sk->sk_bound_dev_if || exact_dif) {
-   bool dev_match = (sk->sk_bound_dev_if == dif ||
- sk->sk_bound_dev_if == sdif);
-
-   if (!dev_match)
-   return -1;
-   if (sk->sk_bound_dev_if)
-   score += 4;
-   }
+   dev_match = udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+   dif, sdif);
+   if (!dev_match)
+   return -1;
+   score += 4;
 
if (sk->sk_incoming_cpu == raw_smp_processor_id())
score++;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 1ede7a16a0be..4813293d4fad 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -782,7 +782,10 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 
if (src_info->ipi6_ifindex) {
if (fl6->flowi6_oif &&
-   src_info->ipi6_ifindex != fl6->flowi6_oif)
+   src_info->ipi6_ifindex != fl6->flowi6_oif &&
+   (sk->sk_bound_dev_if != fl6->flowi6_oif ||
+!sk_dev_equal_l3scope(
+sk, src_info->ipi6_ifindex)))
return -EINVAL;
fl6->flowi6_oif = src_info->ipi6_ifindex;
}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d2d9

[PATCH net-next v4 7/9] ipv6: allow ping to link-local address in VRF

2018-11-02 Thread Mike Manning

If link-local packets are marked as enslaved to a VRF, then to allow
ping to the link-local from a vrf, the error handling for IPV6_PKTINFO
needs to be relaxed to also allow the pkt ipi6_ifindex to be that of a
slave device to the vrf.

Note that the real device also needs to be retrieved in icmp6_iif()
to set the ipv6 flow oif to this for icmp echo reply handling. The
recent commit 24b711edfc34 ("net/ipv6: Fix linklocal to global address
with VRF") takes care of this, so the sdif does not need checking here.

This fix makes ping to link-local consistent with that to global
addresses, in that this can now be done from within the same VRF that
the address is in.

Signed-off-by: Mike Manning 
---
 net/ipv6/ipv6_sockglue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 381ce38940ae..973e215c3114 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -486,7 +486,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, 
int optname,
retv = -EFAULT;
break;
}
-   if (sk->sk_bound_dev_if && pkt.ipi6_ifindex != 
sk->sk_bound_dev_if)
+   if (!sk_dev_equal_l3scope(sk, pkt.ipi6_ifindex))
goto e_inval;
 
np->sticky_pktinfo.ipi6_ifindex = pkt.ipi6_ifindex;
-- 
2.11.0

[PATCH net-next v4 8/9] ipv6: handling of multicast packets received in VRF

2018-11-02 Thread Mike Manning

If the skb for multicast packets marked as enslaved to a VRF are
received, then the secondary device index should be used to obtain
the real device. And verify the multicast address against the
enslaved rather than the l3mdev device.

Signed-off-by: Dewi Morgan 
Signed-off-by: Mike Manning 
---
 net/ipv6/ip6_input.c | 35 ---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 96577e742afd..df58e1100226 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -359,6 +359,8 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
}
} else if (ipprot->flags & INET6_PROTO_FINAL) {
const struct ipv6hdr *hdr;
+   int sdif = inet6_sdif(skb);
+   struct net_device *dev;
 
/* Only do this once for first final protocol */
have_final = true;
@@ -371,9 +373,19 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
skb_postpull_rcsum(skb, skb_network_header(skb),
   skb_network_header_len(skb));
hdr = ipv6_hdr(skb);
+
+   /* skb->dev passed may be master dev for vrfs. */
+   if (sdif) {
+   dev = dev_get_by_index_rcu(net, sdif);
+   if (!dev)
+   goto discard;
+   } else {
+   dev = skb->dev;
+   }
+
if (ipv6_addr_is_multicast(&hdr->daddr) &&
-   !ipv6_chk_mcast_addr(skb->dev, &hdr->daddr,
-   &hdr->saddr) &&
+   !ipv6_chk_mcast_addr(dev, &hdr->daddr,
+&hdr->saddr) &&
!ipv6_is_mld(skb, nexthdr, 
skb_network_header_len(skb)))
goto discard;
}
@@ -432,15 +444,32 @@ EXPORT_SYMBOL_GPL(ip6_input);
 
 int ip6_mc_input(struct sk_buff *skb)
 {
+   int sdif = inet6_sdif(skb);
const struct ipv6hdr *hdr;
+   struct net_device *dev;
bool deliver;
 
__IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev),
 __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST,
 skb->len);
 
+   /* skb->dev passed may be master dev for vrfs. */
+   if (sdif) {
+   rcu_read_lock();
+   dev = dev_get_by_index_rcu(dev_net(skb->dev), sdif);
+   if (!dev) {
+   rcu_read_unlock();
+   kfree_skb(skb);
+   return -ENODEV;
+   }
+   } else {
+   dev = skb->dev;
+   }
+
hdr = ipv6_hdr(skb);
-   deliver = ipv6_chk_mcast_addr(skb->dev, &hdr->daddr, NULL);
+   deliver = ipv6_chk_mcast_addr(dev, &hdr->daddr, NULL);
+   if (sdif)
+   rcu_read_unlock();
 
 #ifdef CONFIG_IPV6_MROUTE
/*
-- 
2.11.0

[PATCH net-next v4 4/9] net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs

2018-11-02 Thread Mike Manning

Add a sysctl raw_l3mdev_accept to control raw socket lookup in a manner
similar to use of tcp_l3mdev_accept for stream and of udp_l3mdev_accept
for datagram sockets. Have this default to enabled for reasons of
backwards compatibility. This is so as to specify the output device
with cmsg and IP_PKTINFO, but using a socket not bound to the
corresponding VRF. This allows e.g. older ping implementations to be
run with specifying the device but without executing it in the VRF.
If the option is disabled, packets received in a VRF context are only
handled by a raw socket bound to the VRF, and correspondingly packets
in the default VRF are only handled by a socket not bound to any VRF.

Signed-off-by: Mike Manning 
---
 Documentation/networking/ip-sysctl.txt | 12 
 Documentation/networking/vrf.txt   | 13 +
 include/net/netns/ipv4.h   |  3 +++
 include/net/raw.h  |  1 +
 net/ipv4/af_inet.c |  2 ++
 net/ipv4/raw.c | 16 ++--
 net/ipv4/sysctl_net_ipv4.c | 11 +++
 7 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 32b21571adfe..aa9e6a331679 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -370,6 +370,7 @@ tcp_l3mdev_accept - BOOLEAN
derived from the listen socket to be bound to the L3 domain in
which the packets originated. Only valid when the kernel was
compiled with CONFIG_NET_L3_MASTER_DEV.
+Default: 0 (disabled)
 
 tcp_low_latency - BOOLEAN
This is a legacy option, it has no effect anymore.
@@ -773,6 +774,7 @@ udp_l3mdev_accept - BOOLEAN
being received regardless of the L3 domain in which they
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
+Default: 0 (disabled)
 
 udp_mem - vector of 3 INTEGERs: min, pressure, max
Number of pages allowed for queueing by all UDP sockets.
@@ -799,6 +801,16 @@ udp_wmem_min - INTEGER
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
Default: 4K
 
+RAW variables:
+
+raw_l3mdev_accept - BOOLEAN
+   Enabling this option allows a "global" bound socket to work
+   across L3 master domains (e.g., VRFs) with packets capable of
+   being received regardless of the L3 domain in which they
+   originated. Only valid when the kernel was compiled with
+   CONFIG_NET_L3_MASTER_DEV.
+   Default: 1 (enabled)
+
 CIPSOv4 Variables:
 
 cipso_cache_enable - BOOLEAN
diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt
index d4b129402d57..d234c9750c72 100644
--- a/Documentation/networking/vrf.txt
+++ b/Documentation/networking/vrf.txt
@@ -111,9 +111,22 @@ the same port if they bind to an l3mdev.
 TCP & UDP services running in the default VRF context (ie., not bound
 to any VRF device) can work across all VRF domains by enabling the
 tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
+
 sysctl -w net.ipv4.tcp_l3mdev_accept=1
 sysctl -w net.ipv4.udp_l3mdev_accept=1
 
+These options are disabled by default so that a socket in a VRF is only
+selected for packets in that VRF. There is a similar option for RAW
+sockets, which is enabled by default for reasons of backwards compatibility.
+This is so as to specify the output device with cmsg and IP_PKTINFO, but
+using a socket not bound to the corresponding VRF. This allows e.g. older ping
+implementations to be run with specifying the device but without executing it
+in the VRF. This option can be disabled so that packets received in a VRF
+context are only handled by a raw socket bound to the VRF, and packets in the
+in the default VRF are only handled by a socket not bound to any VRF:
+
+sysctl -w net.ipv4.raw_l3mdev_accept=0
+
 netfilter rules on the VRF device can be used to limit access to services
 running in the default VRF context as well.
 
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index e47503b4e4d1..104a6669e344 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -103,6 +103,9 @@ struct netns_ipv4 {
/* Shall we try to damage output packets if routing dev changes? */
int sysctl_ip_dynaddr;
int sysctl_ip_early_demux;
+#ifdef CONFIG_NET_L3_MASTER_DEV
+   int sysctl_raw_l3mdev_accept;
+#endif
int sysctl_tcp_early_demux;
int sysctl_udp_early_demux;
 
diff --git a/include/net/raw.h b/include/net/raw.h
index 9c9fa98a91a4..20ebf0b3dfa8 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -61,6 +61,7 @@ void raw_seq_stop(struct seq_file *seq, void *v);
 
 int raw_hash_sk(struct sock *sk);
 void raw_unhash_sk(struct sock *sk);
+void raw_init(void);
 
 struct raw_sock {
/* inet_sock has to be the first member */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
i

[PATCH net-next v4 9/9] ipv6: do not drop vrf udp multicast packets

2018-11-02 Thread Mike Manning

From: Dewi Morgan 

For bound udp sockets in a vrf, also check the sdif to get the index
for ingress devices enslaved to an l3mdev.

Signed-off-by: Dewi Morgan 
Signed-off-by: Mike Manning 
---
 net/ipv6/udp.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 0559adc2f357..a25571c12a8a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -637,7 +637,7 @@ static int udpv6_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
 static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk,
   __be16 loc_port, const struct in6_addr 
*loc_addr,
   __be16 rmt_port, const struct in6_addr 
*rmt_addr,
-  int dif, unsigned short hnum)
+  int dif, int sdif, unsigned short hnum)
 {
struct inet_sock *inet = inet_sk(sk);
 
@@ -649,7 +649,7 @@ static bool __udp_v6_is_mcast_sock(struct net *net, struct 
sock *sk,
(inet->inet_dport && inet->inet_dport != rmt_port) ||
(!ipv6_addr_any(&sk->sk_v6_daddr) &&
!ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) ||
-   (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif) ||
+   !udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif) ||
(!ipv6_addr_any(&sk->sk_v6_rcv_saddr) &&
!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, loc_addr)))
return false;
@@ -683,6 +683,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct 
sk_buff *skb,
unsigned int offset = offsetof(typeof(*sk), sk_node);
unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
int dif = inet6_iif(skb);
+   int sdif = inet6_sdif(skb);
struct hlist_node *node;
struct sk_buff *nskb;
 
@@ -697,7 +698,8 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct 
sk_buff *skb,
 
sk_for_each_entry_offset_rcu(sk, node, &hslot->head, offset) {
if (!__udp_v6_is_mcast_sock(net, sk, uh->dest, daddr,
-   uh->source, saddr, dif, hnum))
+   uh->source, saddr, dif, sdif,
+   hnum))
continue;
/* If zero checksum and no_check is not on for
 * the socket then skip it.
-- 
2.11.0

[PATCH net-next v4 1/9] net: allow binding socket in a VRF when there's an unbound socket

2018-11-02 Thread Mike Manning

From: Robert Shearman 

Change the inet socket lookup to avoid packets arriving on a device
enslaved to an l3mdev from matching unbound sockets by removing the
wildcard for non sk_bound_dev_if and instead relying on check against
the secondary device index, which will be 0 when the input device is
not enslaved to an l3mdev and so match against an unbound socket and
not match when the input device is enslaved.

Change the socket binding to take the l3mdev into account to allow an
unbound socket to not conflict sockets bound to an l3mdev given the
datapath isolation now guaranteed.

Signed-off-by: Robert Shearman 
Signed-off-by: Mike Manning 
---
 Documentation/networking/vrf.txt |  9 +
 include/net/inet6_hashtables.h   |  5 ++---
 include/net/inet_hashtables.h| 13 ++---
 include/net/inet_sock.h  | 13 +
 net/ipv4/inet_connection_sock.c  | 13 ++---
 net/ipv4/inet_hashtables.c   | 20 +++-
 6 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt
index 8ff7b4c8f91b..d4b129402d57 100644
--- a/Documentation/networking/vrf.txt
+++ b/Documentation/networking/vrf.txt
@@ -103,6 +103,11 @@ VRF device:
 
 or to specify the output device using cmsg and IP_PKTINFO.
 
+By default the scope of the port bindings for unbound sockets is
+limited to the default VRF. That is, it will not be matched by packets
+arriving on interfaces enslaved to an l3mdev and processes may bind to
+the same port if they bind to an l3mdev.
+
 TCP & UDP services running in the default VRF context (ie., not bound
 to any VRF device) can work across all VRF domains by enabling the
 tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
@@ -112,10 +117,6 @@ tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
 netfilter rules on the VRF device can be used to limit access to services
 running in the default VRF context as well.
 
-The default VRF does not have limited scope with respect to port bindings.
-That is, if a process does a wildcard bind to a port in the default VRF it
-owns the port across all VRF domains within the network namespace.
-
 

 
 Using iproute2 for VRFs
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 6e91e38a31da..9db98af46985 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -115,9 +115,8 @@ int inet6_hash(struct sock *sk);
 ((__sk)->sk_family == AF_INET6)&&  \
 ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr))   &&  
\
 ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr))   &&  \
-(!(__sk)->sk_bound_dev_if  ||  \
-  ((__sk)->sk_bound_dev_if == (__dif)) ||  \
-  ((__sk)->sk_bound_dev_if == (__sdif)))   &&  \
+(((__sk)->sk_bound_dev_if == (__dif))  ||  \
+ ((__sk)->sk_bound_dev_if == (__sdif)))&&  \
 net_eq(sock_net(__sk), (__net)))
 
 #endif /* _INET6_HASHTABLES_H */
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 9141e95529e7..4ae060b4bac2 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -79,6 +79,7 @@ struct inet_ehash_bucket {
 
 struct inet_bind_bucket {
possible_net_t  ib_net;
+   int l3mdev;
unsigned short  port;
signed char fastreuse;
signed char fastreuseport;
@@ -191,7 +192,7 @@ static inline void inet_ehash_locks_free(struct 
inet_hashinfo *hashinfo)
 struct inet_bind_bucket *
 inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
struct inet_bind_hashbucket *head,
-   const unsigned short snum);
+   const unsigned short snum, int l3mdev);
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
  struct inet_bind_bucket *tb);
 
@@ -282,9 +283,8 @@ static inline struct sock *inet_lookup_listener(struct net 
*net,
 #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, 
__sdif) \
(((__sk)->sk_portpair == (__ports)) &&  \
 ((__sk)->sk_addrpair == (__cookie))&&  \
-(!(__sk)->sk_bound_dev_if  ||  \
-  ((__sk)->sk_bound_dev_if == (__dif)) ||  \
-  ((__sk)->sk_bound_dev_if == (__sdif)))   &&  \
+(((__sk)->sk_bound_dev_if == (__dif))  ||  \
+ ((__sk)->sk_bound_dev_if == (__sdif)))&&  \
 net_eq(sock_net(__sk), (__net)))
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
@@ -294,9 +29

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-02 Thread Paweł Staszewski





W dniu 02.11.2018 o 15:20, Aaron Lu pisze:

On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu  wrote:


On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:

On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
wrote:
... ...

Section copied out:

   mlx5e_poll_tx_cq
   |
--16.34%--napi_consume_skb
  |
  |--12.65%--__free_pages_ok
  |  |
  |   --11.86%--free_one_page
  | |
  | |--10.10%
--queued_spin_lock_slowpath
  | |
  |  --0.65%--_raw_spin_lock

This callchain looks like it is freeing higher order pages than order
0:
__free_pages_ok is only called for pages whose order are bigger than
0.

mlx5 rx uses only order 0 pages, so i don't know where these high order
tx SKBs are coming from..

Perhaps here:
__netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
__napi_alloc_frag() will all call page_frag_alloc(), which will use
__page_frag_cache_refill() to get an order 3 page if possible, or fall
back to an order 0 page if order 3 page is not available.

I'm not sure if your workload will use the above code path though.

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with order-0
pages, via:

static void skb_free_head(struct sk_buff *skb)
{
unsigned char *head = skb->head;

if (skb->head_frag)
skb_free_frag(head);
else
kfree(head);
}

static inline void skb_free_frag(void *addr)
{
page_frag_free(addr);
}

/*
  * Frees a page fragment allocated out of either a compound or order 0 page.
  */
void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);

if (unlikely(put_page_testzero(page)))
__free_pages_ok(page, compound_order(page));
}
EXPORT_SYMBOL(page_frag_free);

I think here is a problem - order 0 pages are freed directly to buddy,
bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?

Will test it tonight





diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..65c0ae13215a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
  {
struct page *page = virt_to_head_page(addr);
  
-	if (unlikely(put_page_testzero(page)))

-   __free_pages_ok(page, compound_order(page));
+   if (unlikely(put_page_testzero(page))) {
+   unsigned int order = compound_order(page);
+
+   if (order == 0)
+   free_unref_page(page);
+   else
+   __free_pages_ok(page, order);
+   }
  }
  EXPORT_SYMBOL(page_frag_free);
  

Notice for the mlx5 driver it support several RX-memory models, so it
can be hard to follow, but from the perf report output we can see that
is uses mlx5e_skb_from_cqe_linear, which use build_skb.

--13.63%--mlx5e_skb_from_cqe_linear
   |
--5.02%--build_skb
  |
   --1.85%--__build_skb
 |
  --1.00%--kmem_cache_alloc

/* build_skb() is wrapper over __build_skb(), that specifically
  * takes care of skb->head and skb->pfmemalloc
  * This means that if @frag_size is not zero, then @data must be backed
  * by a page fragment, not kmalloc() or vmalloc()
  */
struct sk_buff *build_skb(void *data, unsigned int frag_size)
{
struct sk_buff *skb = __build_skb(data, frag_size);

if (skb && frag_size) {
skb->head_frag = 1;
if (page_is_pfmemalloc(virt_to_head_page(data)))
skb->pfmemalloc = 1;
}
return skb;
}
EXPORT_SYMBOL(build_skb);

It still doesn't prove, that the @data is backed by by a order-0 page.
For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
page_pool_dev_alloc_pages(), and I can see perf report using
__page_pool_alloc_pages_slow().

The setup for page_pool in mlx5 uses order=0.

/* Create a page_pool and register it with rxq */
pp_params.order = 0;
pp_params.flags = 0; /* No-internal DMA mapping in page_pool */
pp_params.pool_size = pool_size;
pp_params.nid   = cpu_to_node(c->cpu);
pp_params.dev   = c->pdev;
pp_params.dma_dir   = rq->buff.map_dir;

/* page_pool can be used even when there is no rq->xdp_prog,
 * given page_pool does not handle DMA mapping there is no
 * required state to clear. And page_pool gracefully handle
 * elevated refcnt.
 */
rq->page_pool = page_pool_create(&pp_params

[PATCH net] net-sysfs: fix formatting of tx_timeout attribute

2018-11-02 Thread Julian Wiedmann

Add a missing newline.

Signed-off-by: Julian Wiedmann 
---
 net/core/net-sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bd67c4d0fcfd..ef06409d768e 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1039,7 +1039,7 @@ static ssize_t tx_timeout_show(struct netdev_queue 
*queue, char *buf)
trans_timeout = queue->trans_timeout;
spin_unlock_irq(&queue->_xmit_lock);
 
-   return sprintf(buf, "%lu", trans_timeout);
+   return sprintf(buf, "%lu\n", trans_timeout);
 }
 
 static unsigned int get_netdev_queue_index(struct netdev_queue *queue)
-- 
2.16.4

[PATCH net 4/6] s390/qeth: fix initial operstate

2018-11-02 Thread Julian Wiedmann

Setting the carrier 'on' for an unregistered netdevice doesn't update
its operstate. Fix this by delaying the update until the netdevice has
been registered.

Fixes: 91cc98f51e3d ("s390/qeth: remove duplicated carrier state tracking")
Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  2 +-
 drivers/s390/net/qeth_core_main.c | 13 ++---
 drivers/s390/net/qeth_l2_main.c   | 10 +++---
 drivers/s390/net/qeth_l3_main.c   | 10 +++---
 4 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index b3a0b8838d2f..90cb213b0d55 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -991,7 +991,7 @@ int qeth_wait_for_threads(struct qeth_card *, unsigned 
long);
 int qeth_do_run_thread(struct qeth_card *, unsigned long);
 void qeth_clear_thread_start_bit(struct qeth_card *, unsigned long);
 void qeth_clear_thread_running_bit(struct qeth_card *, unsigned long);
-int qeth_core_hardsetup_card(struct qeth_card *);
+int qeth_core_hardsetup_card(struct qeth_card *card, bool *carrier_ok);
 void qeth_print_status_message(struct qeth_card *);
 int qeth_init_qdio_queues(struct qeth_card *);
 int qeth_send_ipa_cmd(struct qeth_card *, struct qeth_cmd_buffer *,
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 639ac0aca1e9..aed1a7961553 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -5075,7 +5075,7 @@ static struct ccw_driver qeth_ccw_driver = {
.remove = ccwgroup_remove_ccwdev,
 };
 
-int qeth_core_hardsetup_card(struct qeth_card *card)
+int qeth_core_hardsetup_card(struct qeth_card *card, bool *carrier_ok)
 {
int retries = 3;
int rc;
@@ -5150,13 +5150,20 @@ int qeth_core_hardsetup_card(struct qeth_card *card)
if (rc == IPA_RC_LAN_OFFLINE) {
dev_warn(&card->gdev->dev,
"The LAN is offline\n");
-   netif_carrier_off(card->dev);
+   *carrier_ok = false;
} else {
rc = -ENODEV;
goto out;
}
} else {
-   netif_carrier_on(card->dev);
+   *carrier_ok = true;
+   }
+
+   if (qeth_netdev_is_registered(card->dev)) {
+   if (*carrier_ok)
+   netif_carrier_on(card->dev);
+   else
+   netif_carrier_off(card->dev);
}
 
card->options.ipa4.supported_funcs = 0;
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 2b978eba7e30..2914a1a69f83 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -863,7 +863,7 @@ static const struct net_device_ops qeth_l2_netdev_ops = {
.ndo_set_features   = qeth_set_features
 };
 
-static int qeth_l2_setup_netdev(struct qeth_card *card)
+static int qeth_l2_setup_netdev(struct qeth_card *card, bool carrier_ok)
 {
int rc;
 
@@ -920,6 +920,9 @@ static int qeth_l2_setup_netdev(struct qeth_card *card)
qeth_l2_request_initial_mac(card);
netif_napi_add(card->dev, &card->napi, qeth_poll, QETH_NAPI_WEIGHT);
rc = register_netdev(card->dev);
+   if (!rc && carrier_ok)
+   netif_carrier_on(card->dev);
+
if (rc)
card->dev->netdev_ops = NULL;
return rc;
@@ -950,6 +953,7 @@ static int __qeth_l2_set_online(struct ccwgroup_device 
*gdev, int recovery_mode)
struct qeth_card *card = dev_get_drvdata(&gdev->dev);
int rc = 0;
enum qeth_card_states recover_flag;
+   bool carrier_ok;
 
mutex_lock(&card->discipline_mutex);
mutex_lock(&card->conf_mutex);
@@ -957,7 +961,7 @@ static int __qeth_l2_set_online(struct ccwgroup_device 
*gdev, int recovery_mode)
QETH_DBF_HEX(SETUP, 2, &card, sizeof(void *));
 
recover_flag = card->state;
-   rc = qeth_core_hardsetup_card(card);
+   rc = qeth_core_hardsetup_card(card, &carrier_ok);
if (rc) {
QETH_DBF_TEXT_(SETUP, 2, "2err%04x", rc);
rc = -ENODEV;
@@ -968,7 +972,7 @@ static int __qeth_l2_set_online(struct ccwgroup_device 
*gdev, int recovery_mode)
dev_info(&card->gdev->dev,
"The device represents a Bridge Capable Port\n");
 
-   rc = qeth_l2_setup_netdev(card);
+   rc = qeth_l2_setup_netdev(card, carrier_ok);
if (rc)
goto out_remove;
 
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index a719c5ec4171..b26f7d7a2ca0 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -2351,7 +2351,7 @@ static const struct net_device_ops qeth_l3_osa_netdev_ops 
= {
.ndo_neigh_setup= qeth_l3_neigh_setup,
 };
 
-static int qeth_l3_setup_netdev(struct qeth_card *card)
+static int qeth_

[PATCH net 2/6] s390/qeth: fix HiperSockets sniffer

2018-11-02 Thread Julian Wiedmann

Sniffing mode for L3 HiperSockets requires that no IP addresses are
registered with the HW. The preferred way to achieve this is for
userspace to delete all the IPs on the interface. But qeth is expected
to also tolerate a configuration where that is not the case, by skipping
the IP registration when in sniffer mode.
Since commit 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
reworked the IP registration logic in the L3 subdriver, this no longer
works. When the qeth device is set online, qeth_l3_recover_ip() now
unconditionally registers all unicast addresses from our internal
IP table.

While we could fix this particular problem by skipping
qeth_l3_recover_ip() on a sniffer device, the more future-proof change
is to skip the IP address registration at the lowest level. This way we
a) catch any future code path that attempts to register an IP address
   without considering the sniffer scenario, and
b) continue to build up our internal IP table, so that if sniffer mode
   is switched off later we can operate just like normal.

Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_l3_main.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index ffa2aa1dd4c5..968e344a240b 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -278,9 +278,6 @@ static void qeth_l3_clear_ip_htable(struct qeth_card *card, 
int recover)
 
QETH_CARD_TEXT(card, 4, "clearip");
 
-   if (recover && card->options.sniffer)
-   return;
-
spin_lock_bh(&card->ip_lock);
 
hash_for_each_safe(card->ip_htable, i, tmp, addr, hnode) {
@@ -661,6 +658,8 @@ static int qeth_l3_register_addr_entry(struct qeth_card 
*card,
int rc = 0;
int cnt = 3;
 
+   if (card->options.sniffer)
+   return 0;
 
if (addr->proto == QETH_PROT_IPV4) {
QETH_CARD_TEXT(card, 2, "setaddr4");
@@ -695,6 +694,9 @@ static int qeth_l3_deregister_addr_entry(struct qeth_card 
*card,
 {
int rc = 0;
 
+   if (card->options.sniffer)
+   return 0;
+
if (addr->proto == QETH_PROT_IPV4) {
QETH_CARD_TEXT(card, 2, "deladdr4");
QETH_CARD_HEX(card, 3, &addr->u.a4.addr, sizeof(int));
-- 
2.16.4

[PATCH net 1/6] s390/qeth: sanitize strings in debug messages

2018-11-02 Thread Julian Wiedmann

As Documentation/s390/s390dbf.txt states quite clearly, using any
pointer in sprinf-formatted s390dbf debug entries is dangerous.
The pointers are dereferenced whenever the trace file is read from.
So if the referenced data has a shorter life-time than the trace file,
any read operation can result in a use-after-free.

So rip out all hazardous use of indirect data, and replace any usage of
dev_name() and such by the Bus ID number.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  15 -
 drivers/s390/net/qeth_core_main.c | 127 +-
 drivers/s390/net/qeth_l2_main.c   |  24 +++
 drivers/s390/net/qeth_l3_main.c   | 104 +++
 4 files changed, 119 insertions(+), 151 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 6843bc7ee9f2..884ba9dfb341 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -87,6 +87,18 @@ struct qeth_dbf_info {
 #define SENSE_RESETTING_EVENT_BYTE 1
 #define SENSE_RESETTING_EVENT_FLAG 0x80
 
+static inline u32 qeth_get_device_id(struct ccw_device *cdev)
+{
+   struct ccw_dev_id dev_id;
+   u32 id;
+
+   ccw_device_get_id(cdev, &dev_id);
+   id = dev_id.devno;
+   id |= (u32) (dev_id.ssid << 16);
+
+   return id;
+}
+
 /*
  * Common IO related definitions
  */
@@ -97,7 +109,8 @@ struct qeth_dbf_info {
 #define CARD_RDEV_ID(card) dev_name(&card->read.ccwdev->dev)
 #define CARD_WDEV_ID(card) dev_name(&card->write.ccwdev->dev)
 #define CARD_DDEV_ID(card) dev_name(&card->data.ccwdev->dev)
-#define CHANNEL_ID(channel) dev_name(&channel->ccwdev->dev)
+#define CCW_DEVID(cdev)(qeth_get_device_id(cdev))
+#define CARD_DEVID(card)   (CCW_DEVID(CARD_RDEV(card)))
 
 /**
  * card stuff
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 3274f13aad57..639ac0aca1e9 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -554,8 +554,8 @@ static int __qeth_issue_next_read(struct qeth_card *card)
if (!iob) {
dev_warn(&card->gdev->dev, "The qeth device driver "
"failed to recover an error on the device\n");
-   QETH_DBF_MESSAGE(2, "%s issue_next_read failed: no iob "
-   "available\n", dev_name(&card->gdev->dev));
+   QETH_DBF_MESSAGE(2, "issue_next_read on device %x failed: no 
iob available\n",
+CARD_DEVID(card));
return -ENOMEM;
}
qeth_setup_ccw(channel->ccw, CCW_CMD_READ, QETH_BUFSIZE, iob->data);
@@ -563,8 +563,8 @@ static int __qeth_issue_next_read(struct qeth_card *card)
rc = ccw_device_start(channel->ccwdev, channel->ccw,
  (addr_t) iob, 0, 0);
if (rc) {
-   QETH_DBF_MESSAGE(2, "%s error in starting next read ccw! "
-   "rc=%i\n", dev_name(&card->gdev->dev), rc);
+   QETH_DBF_MESSAGE(2, "error %i on device %x when starting next 
read ccw!\n",
+rc, CARD_DEVID(card));
atomic_set(&channel->irq_pending, 0);
card->read_or_write_problem = 1;
qeth_schedule_recovery(card);
@@ -613,16 +613,14 @@ static void qeth_issue_ipa_msg(struct qeth_ipa_cmd *cmd, 
int rc,
const char *ipa_name;
int com = cmd->hdr.command;
ipa_name = qeth_get_ipa_cmd_name(com);
+
if (rc)
-   QETH_DBF_MESSAGE(2, "IPA: %s(x%X) for %s/%s returned "
-   "x%X \"%s\"\n",
-   ipa_name, com, dev_name(&card->gdev->dev),
-   QETH_CARD_IFNAME(card), rc,
-   qeth_get_ipa_msg(rc));
+   QETH_DBF_MESSAGE(2, "IPA: %s(%#x) for device %x returned %#x 
\"%s\"\n",
+ipa_name, com, CARD_DEVID(card), rc,
+qeth_get_ipa_msg(rc));
else
-   QETH_DBF_MESSAGE(5, "IPA: %s(x%X) for %s/%s succeeded\n",
-   ipa_name, com, dev_name(&card->gdev->dev),
-   QETH_CARD_IFNAME(card));
+   QETH_DBF_MESSAGE(5, "IPA: %s(%#x) for device %x succeeded\n",
+ipa_name, com, CARD_DEVID(card));
 }
 
 static struct qeth_ipa_cmd *qeth_check_ipa_data(struct qeth_card *card,
@@ -711,7 +709,7 @@ static int qeth_check_idx_response(struct qeth_card *card,
 
QETH_DBF_HEX(CTRL, 2, buffer, QETH_DBF_CTRL_LEN);
if ((buffer[2] & 0xc0) == 0xc0) {
-   QETH_DBF_MESSAGE(2, "received an IDX TERMINATE with cause code 
%#02x\n",
+   QETH_DBF_MESSAGE(2, "received an IDX TERMINATE with cause code 
%#04x\n",
 buffer[4]);
QETH_CARD_TEXT(card, 2, "ckidxres");
QETH_CARD_TEXT(card,

[PATCH net 3/6] s390/qeth: unregister netdevice only when registered

2018-11-02 Thread Julian Wiedmann

qeth only registers its netdevice when the qeth device is first set
online. Thus a device that has never been set online will trigger
a WARN ("network todo 'hsi%d' but state 0") in unregister_netdev() when
removed.

Fix this by protecting the unregister step, just like we already protect
against repeated registering of the netdevice.

Fixes: d3d1b205e89f ("s390/qeth: allocate netdevice early")
Reported-by: Karsten Graul 
Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h| 5 +
 drivers/s390/net/qeth_l2_main.c | 5 +++--
 drivers/s390/net/qeth_l3_main.c | 5 +++--
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 884ba9dfb341..b3a0b8838d2f 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -843,6 +843,11 @@ struct qeth_trap_id {
 /*some helper functions*/
 #define QETH_CARD_IFNAME(card) (((card)->dev)? (card)->dev->name : "")
 
+static inline bool qeth_netdev_is_registered(struct net_device *dev)
+{
+   return dev->netdev_ops != NULL;
+}
+
 static inline void qeth_scrub_qdio_buffer(struct qdio_buffer *buf,
  unsigned int elements)
 {
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 5b67fd1f2b77..2b978eba7e30 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -826,7 +826,8 @@ static void qeth_l2_remove_device(struct ccwgroup_device 
*cgdev)
 
if (cgdev->state == CCWGROUP_ONLINE)
qeth_l2_set_offline(cgdev);
-   unregister_netdev(card->dev);
+   if (qeth_netdev_is_registered(card->dev))
+   unregister_netdev(card->dev);
 }
 
 static const struct ethtool_ops qeth_l2_ethtool_ops = {
@@ -866,7 +867,7 @@ static int qeth_l2_setup_netdev(struct qeth_card *card)
 {
int rc;
 
-   if (card->dev->netdev_ops)
+   if (qeth_netdev_is_registered(card->dev))
return 0;
 
card->dev->priv_flags |= IFF_UNICAST_FLT;
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index 968e344a240b..a719c5ec4171 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -2356,7 +2356,7 @@ static int qeth_l3_setup_netdev(struct qeth_card *card)
unsigned int headroom;
int rc;
 
-   if (card->dev->netdev_ops)
+   if (qeth_netdev_is_registered(card->dev))
return 0;
 
if (card->info.type == QETH_CARD_TYPE_OSD ||
@@ -2465,7 +2465,8 @@ static void qeth_l3_remove_device(struct ccwgroup_device 
*cgdev)
if (cgdev->state == CCWGROUP_ONLINE)
qeth_l3_set_offline(cgdev);
 
-   unregister_netdev(card->dev);
+   if (qeth_netdev_is_registered(card->dev))
+   unregister_netdev(card->dev);
qeth_l3_clear_ip_htable(card, 0);
qeth_l3_clear_ipato_list(card);
 }
-- 
2.16.4

[PATCH net 5/6] s390/qeth: sanitize ARP requests

2018-11-02 Thread Julian Wiedmann

The ARP_{ADD,REMOVE}_ENTRY cmd structs contain reserved fields.
Introduce a common helper that doesn't raw-copy the user-provided data
into the cmd, but only sets those fields that are strictly needed for
the command.

This also sets the correct command length for ARP_REMOVE_ENTRY.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core.h  |  5 ---
 drivers/s390/net/qeth_core_main.c | 12 ++---
 drivers/s390/net/qeth_core_mpc.h  |  2 +-
 drivers/s390/net/qeth_l3_main.c   | 94 +++
 4 files changed, 34 insertions(+), 79 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 90cb213b0d55..04e294d1d16d 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -1046,11 +1046,6 @@ int qeth_configure_cq(struct qeth_card *, enum qeth_cq);
 int qeth_hw_trap(struct qeth_card *, enum qeth_diags_trap_action);
 void qeth_trace_features(struct qeth_card *);
 void qeth_close_dev(struct qeth_card *);
-int qeth_send_setassparms(struct qeth_card *, struct qeth_cmd_buffer *, __u16,
- long,
- int (*reply_cb)(struct qeth_card *,
- struct qeth_reply *, unsigned long),
- void *);
 int qeth_setassparms_cb(struct qeth_card *, struct qeth_reply *, unsigned 
long);
 struct qeth_cmd_buffer *qeth_get_setassparms_cmd(struct qeth_card *,
 enum qeth_ipa_funcs,
diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index aed1a7961553..82282b2092d8 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -5477,11 +5477,12 @@ struct qeth_cmd_buffer *qeth_get_setassparms_cmd(struct 
qeth_card *card,
 }
 EXPORT_SYMBOL_GPL(qeth_get_setassparms_cmd);
 
-int qeth_send_setassparms(struct qeth_card *card,
- struct qeth_cmd_buffer *iob, __u16 len, long data,
- int (*reply_cb)(struct qeth_card *,
- struct qeth_reply *, unsigned long),
- void *reply_param)
+static int qeth_send_setassparms(struct qeth_card *card,
+struct qeth_cmd_buffer *iob, u16 len,
+long data, int (*reply_cb)(struct qeth_card *,
+   struct qeth_reply *,
+   unsigned long),
+void *reply_param)
 {
int rc;
struct qeth_ipa_cmd *cmd;
@@ -5497,7 +5498,6 @@ int qeth_send_setassparms(struct qeth_card *card,
rc = qeth_send_ipa_cmd(card, iob, reply_cb, reply_param);
return rc;
 }
-EXPORT_SYMBOL_GPL(qeth_send_setassparms);
 
 int qeth_send_simple_setassparms_prot(struct qeth_card *card,
  enum qeth_ipa_funcs ipa_func,
diff --git a/drivers/s390/net/qeth_core_mpc.h b/drivers/s390/net/qeth_core_mpc.h
index e85090467afe..80c036acf563 100644
--- a/drivers/s390/net/qeth_core_mpc.h
+++ b/drivers/s390/net/qeth_core_mpc.h
@@ -436,7 +436,7 @@ struct qeth_ipacmd_setassparms {
__u32 flags_32bit;
struct qeth_ipa_caps caps;
struct qeth_checksum_cmd chksum;
-   struct qeth_arp_cache_entry add_arp_entry;
+   struct qeth_arp_cache_entry arp_entry;
struct qeth_arp_query_data query_arp;
struct qeth_tso_start_data tso;
__u8 ip[16];
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index b26f7d7a2ca0..f08b745c2007 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -1777,13 +1777,18 @@ static int qeth_l3_arp_query(struct qeth_card *card, 
char __user *udata)
return rc;
 }
 
-static int qeth_l3_arp_add_entry(struct qeth_card *card,
-   struct qeth_arp_cache_entry *entry)
+static int qeth_l3_arp_modify_entry(struct qeth_card *card,
+   struct qeth_arp_cache_entry *entry,
+   enum qeth_arp_process_subcmds arp_cmd)
 {
+   struct qeth_arp_cache_entry *cmd_entry;
struct qeth_cmd_buffer *iob;
int rc;
 
-   QETH_CARD_TEXT(card, 3, "arpadent");
+   if (arp_cmd == IPA_CMD_ASS_ARP_ADD_ENTRY)
+   QETH_CARD_TEXT(card, 3, "arpadd");
+   else
+   QETH_CARD_TEXT(card, 3, "arpdel");
 
/*
 * currently GuestLAN only supports the ARP assist function
@@ -1796,54 +1801,19 @@ static int qeth_l3_arp_add_entry(struct qeth_card *card,
return -EOPNOTSUPP;
}
 
-   iob = qeth_get_setassparms_cmd(card, IPA_ARP_PROCESSING,
-  IPA_CMD_ASS_ARP_ADD_ENTRY,
-  sizeof(struct qeth_arp_cache_entry),

[PATCH net 0/6] s390/qeth: fixes 2018-11-02

2018-11-02 Thread Julian Wiedmann

Hi Dave,

please apply one round of qeth fixes for -net.

Patch 1 is rather large and removes a use-after-free hazard from many of our
debug trace entries.
Patch 2 is yet another fix-up for the L3 subdriver's new IP address management
code.
Patch 3 and 4 resolve some fallout from the recent changes wrt how/when qeth
allocates its net_device.
Patch 5 makes sure we don't set reserved bits when building HW commands from
user-provided data.
And finally, patch 6 allows ethtool to play nice with new HW.

Thanks,
Julian


Julian Wiedmann (6):
  s390/qeth: sanitize strings in debug messages
  s390/qeth: fix HiperSockets sniffer
  s390/qeth: unregister netdevice only when registered
  s390/qeth: fix initial operstate
  s390/qeth: sanitize ARP requests
  s390/qeth: report 25Gbit link speed

 drivers/s390/net/qeth_core.h  |  27 +++--
 drivers/s390/net/qeth_core_main.c | 172 ---
 drivers/s390/net/qeth_core_mpc.h  |   4 +-
 drivers/s390/net/qeth_l2_main.c   |  39 +++
 drivers/s390/net/qeth_l3_main.c   | 207 +-
 5 files changed, 207 insertions(+), 242 deletions(-)

-- 
2.16.4

[PATCH net 6/6] s390/qeth: report 25Gbit link speed

2018-11-02 Thread Julian Wiedmann

This adds the various identifiers for 25Gbit cards, and wires them up
into sysfs and ethtool.

Signed-off-by: Julian Wiedmann 
---
 drivers/s390/net/qeth_core_main.c | 20 ++--
 drivers/s390/net/qeth_core_mpc.h  |  2 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/net/qeth_core_main.c 
b/drivers/s390/net/qeth_core_main.c
index 82282b2092d8..4bce5ae65a55 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -167,6 +167,8 @@ const char *qeth_get_cardname_short(struct qeth_card *card)
return "OSD_1000";
case QETH_LINK_TYPE_10GBIT_ETH:
return "OSD_10GIG";
+   case QETH_LINK_TYPE_25GBIT_ETH:
+   return "OSD_25GIG";
case QETH_LINK_TYPE_LANE_ETH100:
return "OSD_FE_LANE";
case QETH_LINK_TYPE_LANE_TR:
@@ -4432,7 +4434,8 @@ static int qeth_mdio_read(struct net_device *dev, int 
phy_id, int regnum)
rc = BMCR_FULLDPLX;
if ((card->info.link_type != QETH_LINK_TYPE_GBIT_ETH) &&
(card->info.link_type != QETH_LINK_TYPE_OSN) &&
-   (card->info.link_type != QETH_LINK_TYPE_10GBIT_ETH))
+   (card->info.link_type != QETH_LINK_TYPE_10GBIT_ETH) &&
+   (card->info.link_type != QETH_LINK_TYPE_25GBIT_ETH))
rc |= BMCR_SPEED100;
break;
case MII_BMSR: /* Basic mode status register */
@@ -6166,8 +6169,14 @@ static void qeth_set_cmd_adv_sup(struct 
ethtool_link_ksettings *cmd,
WARN_ON_ONCE(1);
}
 
-   /* fallthrough from high to low, to select all legal speeds: */
+   /* partially does fall through, to also select lower speeds */
switch (maxspeed) {
+   case SPEED_25000:
+   ethtool_link_ksettings_add_link_mode(cmd, supported,
+25000baseSR_Full);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising,
+25000baseSR_Full);
+   break;
case SPEED_1:
ethtool_link_ksettings_add_link_mode(cmd, supported,
 1baseT_Full);
@@ -6250,6 +6259,10 @@ int qeth_core_ethtool_get_link_ksettings(struct 
net_device *netdev,
cmd->base.speed = SPEED_1;
cmd->base.port = PORT_FIBRE;
break;
+   case QETH_LINK_TYPE_25GBIT_ETH:
+   cmd->base.speed = SPEED_25000;
+   cmd->base.port = PORT_FIBRE;
+   break;
default:
cmd->base.speed = SPEED_10;
cmd->base.port = PORT_TP;
@@ -6316,6 +6329,9 @@ int qeth_core_ethtool_get_link_ksettings(struct 
net_device *netdev,
case CARD_INFO_PORTS_10G:
cmd->base.speed = SPEED_1;
break;
+   case CARD_INFO_PORTS_25G:
+   cmd->base.speed = SPEED_25000;
+   break;
}
 
return 0;
diff --git a/drivers/s390/net/qeth_core_mpc.h b/drivers/s390/net/qeth_core_mpc.h
index 80c036acf563..3e54be201b27 100644
--- a/drivers/s390/net/qeth_core_mpc.h
+++ b/drivers/s390/net/qeth_core_mpc.h
@@ -90,6 +90,7 @@ enum qeth_link_types {
QETH_LINK_TYPE_GBIT_ETH = 0x03,
QETH_LINK_TYPE_OSN  = 0x04,
QETH_LINK_TYPE_10GBIT_ETH   = 0x10,
+   QETH_LINK_TYPE_25GBIT_ETH   = 0x12,
QETH_LINK_TYPE_LANE_ETH100  = 0x81,
QETH_LINK_TYPE_LANE_TR  = 0x82,
QETH_LINK_TYPE_LANE_ETH1000 = 0x83,
@@ -347,6 +348,7 @@ enum qeth_card_info_port_speed {
CARD_INFO_PORTS_100M= 0x0006,
CARD_INFO_PORTS_1G  = 0x0007,
CARD_INFO_PORTS_10G = 0x0008,
+   CARD_INFO_PORTS_25G = 0x000A,
 };
 
 /* (SET)DELIP(M) IPA stuff ***/
-- 
2.16.4

[PATCH v2 bpf 3/3] bpf: show main program address and length in bpf_prog_info

2018-11-02 Thread Song Liu

Currently, when there is no subprog (prog->aux->func_cnt == 0),
bpf_prog_info does not return any jited_ksyms or jited_func_lens. This
patch adds main program address (prog->bpf_func) and main program
length (prog->jited_len) to bpf_prog_info.

Signed-off-by: Song Liu 
---
 kernel/bpf/syscall.c | 33 -
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 34a9eef5992c..9418174c276c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2158,11 +2158,11 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
*prog,
}
 
ulen = info.nr_jited_ksyms;
-   info.nr_jited_ksyms = prog->aux->func_cnt;
+   info.nr_jited_ksyms = prog->aux->func_cnt ? : 1;
if (info.nr_jited_ksyms && ulen) {
if (bpf_dump_raw_ok()) {
+   unsigned long ksym_addr;
u64 __user *user_ksyms;
-   ulong ksym_addr;
u32 i;
 
/* copy the address of the kernel symbol
@@ -2170,9 +2170,17 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 */
ulen = min_t(u32, info.nr_jited_ksyms, ulen);
user_ksyms = u64_to_user_ptr(info.jited_ksyms);
-   for (i = 0; i < ulen; i++) {
-   ksym_addr = (ulong) 
prog->aux->func[i]->bpf_func;
-   if (put_user((u64) ksym_addr, &user_ksyms[i]))
+   if (prog->aux->func_cnt) {
+   for (i = 0; i < ulen; i++) {
+   ksym_addr = (unsigned long)
+   prog->aux->func[i]->bpf_func;
+   if (put_user((u64) ksym_addr,
+&user_ksyms[i]))
+   return -EFAULT;
+   }
+   } else {
+   ksym_addr = (unsigned long) prog->bpf_func;
+   if (put_user((u64) ksym_addr, &user_ksyms[0]))
return -EFAULT;
}
} else {
@@ -2181,7 +2189,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
}
 
ulen = info.nr_jited_func_lens;
-   info.nr_jited_func_lens = prog->aux->func_cnt;
+   info.nr_jited_func_lens = prog->aux->func_cnt ? : 1;
if (info.nr_jited_func_lens && ulen) {
if (bpf_dump_raw_ok()) {
u32 __user *user_lens;
@@ -2190,9 +2198,16 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
/* copy the JITed image lengths for each function */
ulen = min_t(u32, info.nr_jited_func_lens, ulen);
user_lens = u64_to_user_ptr(info.jited_func_lens);
-   for (i = 0; i < ulen; i++) {
-   func_len = prog->aux->func[i]->jited_len;
-   if (put_user(func_len, &user_lens[i]))
+   if (prog->aux->func_cnt) {
+   for (i = 0; i < ulen; i++) {
+   func_len =
+   prog->aux->func[i]->jited_len;
+   if (put_user(func_len, &user_lens[i]))
+   return -EFAULT;
+   }
+   } else {
+   func_len = prog->jited_len;
+   if (put_user(func_len, &user_lens[0]))
return -EFAULT;
}
} else {
-- 
2.17.1

[PATCH v2 bpf 0/3] show more accurrate bpf program address

2018-11-02 Thread Song Liu

Changes v1 -> v2:
1. Added main program length to bpf_prog_info->jited_fun_lens (3/3).
2. Updated commit message of 1/3 and 2/3 with more background about the
   address masking, and why it is still save after the changes.
3. Replace "ulong" with "unsigned long".

This set improves bpf program address showed in /proc/kallsyms and in
bpf_prog_info. First, real program address is showed instead of page
address. Second, when there is no subprogram, bpf_prog_info->jited_ksyms
and bpf_prog_info->jited_fun_lens returns the main prog address and
length.

Song Liu (3):
  bpf: show real jited prog address in /proc/kallsyms
  bpf: show real jited address in bpf_prog_info->jited_ksyms
  bpf: show main program address and length in bpf_prog_info

 kernel/bpf/core.c|  4 +---
 kernel/bpf/syscall.c | 34 --
 2 files changed, 25 insertions(+), 13 deletions(-)

--
2.17.1

[PATCH v2 bpf 1/3] bpf: show real jited prog address in /proc/kallsyms

2018-11-02 Thread Song Liu

Currently, /proc/kallsyms shows page address of jited bpf program. The
main reason here is to not expose randomized start address. However,
This is not ideal for detailed profiling (find hot instructions from
stack traces). This patch replaces the page address with real prog start
address.

This change is OK because these addresses are still protected by sysctl
kptr_restrict (see kallsyms_show_value()), and only programs loaded by
root are added to kallsyms (see bpf_prog_kallsyms_add()).

Signed-off-by: Song Liu 
---
 kernel/bpf/core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 6377225b2082..1a796e0799ec 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -553,7 +553,6 @@ bool is_bpf_text_address(unsigned long addr)
 int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
char *sym)
 {
-   unsigned long symbol_start, symbol_end;
struct bpf_prog_aux *aux;
unsigned int it = 0;
int ret = -ERANGE;
@@ -566,10 +565,9 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long 
*value, char *type,
if (it++ != symnum)
continue;
 
-   bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end);
bpf_get_prog_name(aux->prog, sym);
 
-   *value = symbol_start;
+   *value = (unsigned long)aux->prog->bpf_func;
*type  = BPF_SYM_ELF_TYPE;
 
ret = 0;
-- 
2.17.1

[PATCH v2 bpf 2/3] bpf: show real jited address in bpf_prog_info->jited_ksyms

2018-11-02 Thread Song Liu

Currently, jited_ksyms in bpf_prog_info shows page addresses of jited
bpf program. The main reason here is to not expose randomized start
address. However, this is not ideal for detailed profiling (find hot
instructions from stack traces). This patch replaces the page address
with real prog start address.

This change is OK because bpf_prog_get_info_by_fd() is only available
to root.

Signed-off-by: Song Liu 
---
 kernel/bpf/syscall.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ccb93277aae2..34a9eef5992c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2172,7 +2172,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
user_ksyms = u64_to_user_ptr(info.jited_ksyms);
for (i = 0; i < ulen; i++) {
ksym_addr = (ulong) 
prog->aux->func[i]->bpf_func;
-   ksym_addr &= PAGE_MASK;
if (put_user((u64) ksym_addr, &user_ksyms[i]))
return -EFAULT;
}
-- 
2.17.1

Re: [PATCH iproute2-next v1 4/4] rdma: Document IB device renaming option

2018-11-02 Thread David Ahern

On 10/31/18 1:17 AM, Leon Romanovsky wrote:
> diff --git a/man/man8/rdma-dev.8 b/man/man8/rdma-dev.8
> index 461681b6..b2f9964a 100644
> --- a/man/man8/rdma-dev.8
> +++ b/man/man8/rdma-dev.8
> @@ -22,6 +22,12 @@ rdmak-dev \- RDMA device configuration
>  .B rdma dev show
>  .RI "[ " DEV " ]"
>  
> +.ti -8
> +.B rdma dev set
> +.RI "[ " DEV " ]"
> +.BR name
> +.BR NEWNAME
> +
>  .ti -8
>  .B rdma dev help
>  
> @@ -33,6 +39,8 @@ rdmak-dev \- RDMA device configuration
>  - specifies the RDMA device to show.
>  If this argument is omitted all devices are listed.
>  
> +.SS rdma dev set - rename rdma device
> +

'nroff -u0 -Tlp -man man/man8/rdma-dev.8' shows that you lost the
spacing between that line and the Examples line.

   rdma dev set - rename rdma device
EXAMPLES
   rdma dev
   Shows the state of all RDMA devices on the system.

...


>  .SH "EXAMPLES"
>  .PP
>  rdma dev
> @@ -45,6 +53,11 @@ rdma dev show mlx5_3
>  Shows the state of specified RDMA device.
>  .RE
>  .PP
> +rdma dev set mlx5_3 name rdma_0
> +.RS 4
> +Renames the mlx5_3 device to be named rdma_0.

That does not read well. I suggest dropping the 'be named' to make it
"Renames the mlx5_3 device to rdma_0."

> +.RE
> +.PP
>  
>  .SH SEE ALSO
>  .BR rdma (8),
>

Re: [PATCH iproute2-next v1 3/4] rdma: Add an option to rename IB device interface

2018-11-02 Thread David Ahern

On 10/31/18 1:17 AM, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Enrich rdmatool with an option to rename IB devices,
> the command interface follows Iproute2 convention:
> "rdma dev set [OLD-DEVNAME] name NEW-DEVNAME"
> 
> Signed-off-by: Leon Romanovsky 
> Reviewed-by: Steve Wise 
> ---
>  rdma/dev.c | 35 +++
>  1 file changed, 35 insertions(+)
> 

applied to iproute2-next. Thanks

Re: [PATCH iproute2-next v1 2/4] rdma: Introduce command execution helper with required device name

2018-11-02 Thread David Ahern

On 10/31/18 1:17 AM, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> In contradiction to various show commands, the set command explicitly
> requires to use device name as an argument. Provide new command
> execution helper which enforces it.
> 
> Signed-off-by: Leon Romanovsky 
> Reviewed-by: Steve Wise 
> ---
>  rdma/rdma.h  |  1 +
>  rdma/utils.c | 10 ++
>  2 files changed, 11 insertions(+)
> 

applied to iproute2-next. Thanks

Re: [PATCH iproute2-next v1 1/4] rdma: Update kernel include file to support IB device renaming

2018-11-02 Thread David Ahern

On 10/31/18 1:17 AM, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Bring kernel header file changes upto commit 05d940d3a3ec
> ("RDMA/nldev: Allow IB device rename through RDMA netlink")
> 
> Signed-off-by: Leon Romanovsky 
> Reviewed-by: Steve Wise 
> ---
>  rdma/include/uapi/rdma/rdma_netlink.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 

applied to iproute2-next. Thanks

Re: [PATCH bpf 2/3] bpf: show real jited address in bpf_prog_info->jited_ksyms

2018-11-02 Thread Song Liu




> On Nov 2, 2018, at 3:19 AM, Daniel Borkmann  wrote:
> 
> On 11/02/2018 11:09 AM, Daniel Borkmann wrote:
>> On 11/01/2018 08:00 AM, Song Liu wrote:
>>> Currently, jited_ksyms in bpf_prog_info shows page addresses of jited
>>> bpf program. This is not ideal for detailed profiling (find hot
>>> instructions from stack traces). This patch replaces the page address
>>> with real prog start address.
>>> 
>>> Signed-off-by: Song Liu 
>>> ---
>>> kernel/bpf/syscall.c | 1 -
>>> 1 file changed, 1 deletion(-)
>>> 
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index ccb93277aae2..34a9eef5992c 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -2172,7 +2172,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
>>> *prog,
>>> user_ksyms = u64_to_user_ptr(info.jited_ksyms);
>>> for (i = 0; i < ulen; i++) {
>>> ksym_addr = (ulong) 
>>> prog->aux->func[i]->bpf_func;
>>> -   ksym_addr &= PAGE_MASK;
>> 
>> Note that the masking was done on purpose here and in patch 1/3 in order to
>> not expose randomized start address to kallsyms at least. I suppose it's
>> okay to change it here and for kallsyms given bpf_prog_get_info_by_fd() dump
>> is for root only, and in each of the two cases we additionally apply
>> kallsyms_show_value() logic, so for unpriv this is zeroed out plus only root
>> loaded programs are added under kallsyms (capable(CAP_SYS_ADMIN)) anyway.
> 
> (Btw, something like above should have been in changelog to provide some more
> historical context of why we used to do it like that and explaining why it is
> okay to change it this way.)

Thanks Daniel!

I will send v2 with these fixes. 

Song

[PATCH net v2 2/2] ipv6: properly check return value in inet6_dump_all()

2018-11-02 Thread Alexey Kodanev

Make sure we call fib6_dump_end() if it happens that skb->len
is zero. rtnl_dump_all() can reset cb->args on the next loop
iteration there.

Fixes: 08e814c9e8eb ("net/ipv6: Bail early if user only wants cloned entries")
Fixes: ae677bbb4441 ("net: Don't return invalid table id error when dumping all 
families")
Signed-off-by: Alexey Kodanev 
---

v2: a new patch in v2

 net/ipv6/ip6_fib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 1b8bc00..ae37861 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -591,7 +591,7 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 
/* fib entries are never clones */
if (arg.filter.flags & RTM_F_CLONED)
-   return skb->len;
+   goto out;
 
w = (void *)cb->args[2];
if (!w) {
@@ -621,7 +621,7 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
tb = fib6_get_table(net, arg.filter.table_id);
if (!tb) {
if (arg.filter.dump_all_families)
-   return skb->len;
+   goto out;
 
NL_SET_ERR_MSG_MOD(cb->extack, "FIB table does not 
exist");
return -ENOENT;
-- 
1.8.3.1

[PATCH net v2 1/2] rtnetlink: restore handling of dumpit return value in rtnl_dump_all()

2018-11-02 Thread Alexey Kodanev

For non-zero return from dumpit() we should break the loop
in rtnl_dump_all() and return the result. Otherwise, e.g.,
we could get the memory leak in inet6_dump_fib() [1]. The
pointer to the allocated struct fib6_walker there (saved
in cb->args) can be lost, reset on the next iteration.

Fix it by partially restoring the previous behavior before
commit c63586dc9b3e ("net: rtnl_dump_all needs to propagate
error from dumpit function"). The returned error from
dumpit() is still passed further.

[1]:
unreferenced object 0x88001322a200 (size 96):
  comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s)
  hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  
18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff  ..A6..A6
  backtrace:
[<95846b39>] kmem_cache_alloc_trace+0x151/0x220
[<7d12709f>] inet6_dump_fib+0x68d/0x940
[<2775a316>] rtnl_dump_all+0x1d9/0x2d0
[] netlink_dump+0x945/0x11a0
[<2f43485f>] __netlink_dump_start+0x55d/0x800
[] rtnetlink_rcv_msg+0x4fa/0xa00
[<9b5761f3>] netlink_rcv_skb+0x29c/0x420
[<87a1dae1>] rtnetlink_rcv+0x15/0x20
[<691b703b>] netlink_unicast+0x4e3/0x6c0
[] netlink_sendmsg+0x7f2/0xba0
[<96d2aa60>] sock_sendmsg+0xba/0xf0
[<8c1b786f>] __sys_sendto+0x1e4/0x330
[<19587b3f>] __x64_sys_sendto+0xe1/0x1a0
[<071f4d56>] do_syscall_64+0x9f/0x300
[<2737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<57587684>] 0x

Fixes: c63586dc9b3e ("net: rtnl_dump_all needs to propagate error from dumpit 
function")
Signed-off-by: Alexey Kodanev 
---

v2: * fix it by restoring the previous behavior as suggested by David

* adjust subject and commit description

 net/core/rtnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e01274b..33d9227 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3367,7 +3367,7 @@ static int rtnl_dump_all(struct sk_buff *skb, struct 
netlink_callback *cb)
cb->seq = 0;
}
ret = dumpit(skb, cb);
-   if (ret < 0)
+   if (ret)
break;
}
cb->family = idx;
-- 
1.8.3.1

Re: Help with the BPF verifier

2018-11-02 Thread Edward Cree

On 02/11/18 15:02, Arnaldo Carvalho de Melo wrote:
> Yeah, didn't work as well: 

> And the -vv in 'perf trace' didn't seem to map to further details in the
> output of the verifier debug:
Yeah for log_level 2 you probably need to make source-level changes to either
 perf or libbpf (I think the latter).  It's annoying that essentially no tools
 plumb through an option for that, someone should fix them ;-)

> libbpf: -- BEGIN DUMP LOG ---
> libbpf: 
> 0: (bf) r6 = r1
> 1: (bf) r1 = r10
> 2: (07) r1 += -328
> 3: (b7) r7 = 64
> 4: (b7) r2 = 64
> 5: (bf) r3 = r6
> 6: (85) call bpf_probe_read#4
> 7: (79) r1 = *(u64 *)(r10 -320)
> 8: (15) if r1 == 0x101 goto pc+4
>  R0=inv(id=0) R1=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> 9: (55) if r1 != 0x2 goto pc+22
>  R0=inv(id=0) R1=inv2 R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> 10: (bf) r1 = r6
> 11: (07) r1 += 16
> 12: (05) goto pc+2
> 15: (79) r3 = *(u64 *)(r1 +0)
> dereference of modified ctx ptr R1 off=16 disallowed
Aha, we at least got a different error message this time.
And indeed llvm has done that optimisation, rather than the more obvious
11: r3 = *(u64 *)(r1 +16)
 because it wants to have lots of reads share a single insn.  You may be able
 to defeat that optimisation by adding compiler barriers, idk.  Maybe someone
 with llvm knowledge can figure out how to stop it (ideally, llvm would know
 when it's generating for bpf backend and not do that).  -O0?  ¯\_(ツ)_/¯
Alternatively, your prog looks short enough that maybe you could kick the C
 habit and write it directly in eBPF asm, that way no-one is optimising things
 behind your back.  (I realise this option won't appeal to everyone ;-)
The reason the verifier disallows this, iirc, is because it needs to be able
 to rewrite the offsets on ctx accesses (see convert_ctx_accesses()) in case
 underlying kernel struct doesn't match the layout of the ctx ABI.  To do this
 it needs the ctx offset to live entirely in the insn doing the access,
 otherwise different paths could lead to the same insn accessing different ctx
 offsets with different fixups needed — can't be done.

-Ed

[PATCH iproute2 v2] Fix warning in tc-skbprio.8 manpage

2018-11-02 Thread Luca Boccassi

". If" gets interpreted as a macro, so move the period to the previous
line:

  33: warning: macro `If' not defined

Fixes: 141b55f8544e ("Add SKB Priority qdisc support in tc(8)")

Signed-off-by: Luca Boccassi 
---
v2: remove extra space to avoid making the full-stop bold.

 man/man8/tc-skbprio.8 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-skbprio.8 b/man/man8/tc-skbprio.8
index 844bbf46..a0a316ba 100644
--- a/man/man8/tc-skbprio.8
+++ b/man/man8/tc-skbprio.8
@@ -29,8 +29,8 @@ While SKB Priority Queue is agnostic to how
 .B skb->priority
 is assigned. A typical use case is to copy
 the 6-bit DS field of IPv4 and IPv6 packets using
-.BR tc-skbedit (8)
-. If
+.BR tc-skbedit (8).
+If
 .B skb->priority
 is greater or equal to 64, the priority is assumed to be 63.
 Priorities less than 64 are taken at face value.
-- 
2.19.1

Re: [PATCH iproute2] Fix warning in tc-skbprio.8 manpage

2018-11-02 Thread Luca Boccassi

On Fri, 2018-11-02 at 15:19 +, Edward Cree wrote:
> On 02/11/18 10:57, Luca Boccassi wrote:
> > ". If" gets interpreted as a macro, so move the period to the
> > previous
> > line:
> > 
> >   33: warning: macro `If' not defined
> > 
> > Fixes: 141b55f8544e ("Add SKB Priority qdisc support in tc(8)")
> > 
> > Signed-off-by: Luca Boccassi 
> > ---
> >  man/man8/tc-skbprio.8 | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/man/man8/tc-skbprio.8 b/man/man8/tc-skbprio.8
> > index 844bbf46..8b259839 100644
> > --- a/man/man8/tc-skbprio.8
> > +++ b/man/man8/tc-skbprio.8
> > @@ -29,8 +29,8 @@ While SKB Priority Queue is agnostic to how
> >  .B skb->priority
> >  is assigned. A typical use case is to copy
> >  the 6-bit DS field of IPv4 and IPv6 packets using
> > -.BR tc-skbedit (8)
> > -. If
> > +.BR tc-skbedit (8) .
> > +If
> 
> Won't that make the full-stop bold?
> Removing the space, i.e.
> .BR rc-skbedit (8).
> ought to work.
> -Ed

Yes you are right, fixed in v2, thanks.

-- 
Kind regards,
Luca Boccassi

signature.asc
Description: This is a digitally signed message part

Re: [PATCH iproute2] Fix warning in tc-skbprio.8 manpage

2018-11-02 Thread Edward Cree

On 02/11/18 10:57, Luca Boccassi wrote:
> ". If" gets interpreted as a macro, so move the period to the previous
> line:
>
>   33: warning: macro `If' not defined
>
> Fixes: 141b55f8544e ("Add SKB Priority qdisc support in tc(8)")
>
> Signed-off-by: Luca Boccassi 
> ---
>  man/man8/tc-skbprio.8 | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/man/man8/tc-skbprio.8 b/man/man8/tc-skbprio.8
> index 844bbf46..8b259839 100644
> --- a/man/man8/tc-skbprio.8
> +++ b/man/man8/tc-skbprio.8
> @@ -29,8 +29,8 @@ While SKB Priority Queue is agnostic to how
>  .B skb->priority
>  is assigned. A typical use case is to copy
>  the 6-bit DS field of IPv4 and IPv6 packets using
> -.BR tc-skbedit (8)
> -. If
> +.BR tc-skbedit (8) .
> +If
Won't that make the full-stop bold?
Removing the space, i.e.
.BR rc-skbedit (8).
ought to work.
-Ed
>  .B skb->priority
>  is greater or equal to 64, the priority is assumed to be 63.
>  Priorities less than 64 are taken at face value.

ethtool 4.19 released

2018-11-02 Thread John W. Linville

ethtool version 4.19 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-4.19.tar.xz

Release notes:

* Feature: support combinations of FEC modes
* Feature: better syntax for combinations of FEC modes
* Fix: Fix uninitialized variable use at qsfp dump

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

Re: Help with the BPF verifier

2018-11-02 Thread Arnaldo Carvalho de Melo

Em Thu, Nov 01, 2018 at 08:05:07PM +, Edward Cree escreveu:
> On 01/11/18 18:52, Arnaldo Carvalho de Melo wrote:
> >  R0=inv(id=0) R1=inv6 R2=inv6 R3=inv(id=0) R6=ctx(id=0,off=0,imm=0) 
> > R7=inv64 R10=fp0,call_-1
> > 15: (b7) r2 = 0
> > 16: (63) *(u32 *)(r10 -260) = r2
> > 17: (67) r1 <<= 32
> > 18: (77) r1 >>= 32
> > 19: (67) r1 <<= 3
> > 20: (bf) r2 = r6
> > 21: (0f) r2 += r1
> > 22: (79) r3 = *(u64 *)(r2 +16)
> > R2 invalid mem access 'inv'
> I wonder if you could run this with verifier log level 2?  (I'm not sure how
>  you would go about plumbing that through the perf tooling.)  It seems very
>  odd that it ends up with R2=inv, and I'm wondering whether R1 becomes unknown
>  during the shifts or whether the addition in insn 21 somehow produces the
>  unknown-ness.  (I know we used to have a thing[1] where doing ptr += K and
>  then also having an offset in the LDX produced an error about
>  ptr+const+const, but that seems to have been fixed at some point.)
> 
> Note however that even if we get past this, R1 at this point holds 6, so it
>  looks like the verifier is walking the impossible path where we're inside the
>  'if' even though filename_arg = 6.  This is a (slightly annoying) verifier
>  limitation, that it walks paths with impossible combinations of constraints
>  (we've previously had cases where assertions in the verifier would blow up
>  because of this, e.g. registers with max_val less than min_val).  So if the
>  check_ctx_access() is going to worry about whether you're off the end of the
>  array (I'm not sure what your program type is and thus which is_valid_access
>  callback is involved), then it'll complain about this.
> If filename_arg came from some external source you'd have a different
>  problem, because then it would have a totally unknown value, that on entering
>  the 'if' becomes "unknown but < 6", which is still too variable to have as
>  the offset of a ctx access.  Those have to be at a known constant offset, so
>  that we can determine the type of the returned value.
> 
> As a way to fix this, how about [UNTESTED!]:
>     const void *filename_arg = NULL;
>     /* ... */
>     switch (augmented_args.args.syscall_nr) {
>         case SYS_OPEN: filename_arg = args->args[0]; break;
>         case SYS_OPENAT: filename_arg = args->args[1]; break;
>     }
>     /* ... */
>     if (filename_arg) {
>     /* stuff */
>     blah = probe_read_str(/* ... */, filename_arg);
>     } else {
>     /* the other stuff */
>     }
> That way, you're only ever dealing in constant pointers (although judging by
>  an old thread I found[1] about ptr+const+const, the compiler might decide to
>  make some optimisations that end up looking like your existing code).

Yeah, didn't work as well:

SEC("raw_syscalls:sys_enter")
int sys_enter(struct syscall_enter_args *args)
{
struct {
struct syscall_enter_args args;
struct augmented_filename filename;
} augmented_args;
unsigned int len = sizeof(augmented_args);
const void *filename_arg = NULL;

probe_read(&augmented_args.args, sizeof(augmented_args.args), args);

switch (augmented_args.args.syscall_nr) {
case SYS_OPEN:   filename_arg = (const void *)args->args[0]; break;
case SYS_OPENAT: filename_arg = (const void *)args->args[1]; break;
}

if (filename_arg != NULL) {
augmented_args.filename.reserved = 0;
augmented_args.filename.size = 
probe_read_str(&augmented_args.filename.value,
  
sizeof(augmented_args.filename.value),
  filename_arg);
if (augmented_args.filename.size < 
sizeof(augmented_args.filename.value)) {
len -= sizeof(augmented_args.filename.value) - 
augmented_args.filename.size;
len &= sizeof(augmented_args.filename.value) - 1;
}
} else {
len = sizeof(augmented_args.args);
}

perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, 
&augmented_args, len);
return 0;
}

And the -vv in 'perf trace' didn't seem to map to further details in the
output of the verifier debug:

# trace -vv -e tools/perf/examples/bpf/augmented_raw_syscalls.c sleep 1
bpf: builtin compilation failed: -95, try external compiler
Kernel build dir is set to /lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
set env: KBUILD_DIR=/lib/modules/4.19.0-rc8-00014-gc0cff31be705/build
unset env: KBUILD_OPTS
include option is set to  -nostdinc -isystem 
/usr/lib/gcc/x86_64-redhat-linux/7/include 
-I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  
-I/home/acme/git/linux/include -I./include 
-I/home/acme/git/linux/arch/x86/include/uapi 
-I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi 
-I./include/generated/uapi -include 
/hom

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-02 Thread Aaron Lu

On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 2 Nov 2018 13:23:56 +0800
> Aaron Lu  wrote:
> 
> > On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:
> > > On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:  
> > > > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> > > > wrote:
> > > > ... ...  
> > > > > Section copied out:
> > > > > 
> > > > >   mlx5e_poll_tx_cq
> > > > >   |  
> > > > >--16.34%--napi_consume_skb
> > > > >  |  
> > > > >  |--12.65%--__free_pages_ok
> > > > >  |  |  
> > > > >  |   --11.86%--free_one_page
> > > > >  | |  
> > > > >  | |--10.10%
> > > > > --queued_spin_lock_slowpath
> > > > >  | |  
> > > > >  |  --0.65%--_raw_spin_lock  
> > > > 
> > > > This callchain looks like it is freeing higher order pages than order
> > > > 0:
> > > > __free_pages_ok is only called for pages whose order are bigger than
> > > > 0.  
> > > 
> > > mlx5 rx uses only order 0 pages, so i don't know where these high order
> > > tx SKBs are coming from..   
> > 
> > Perhaps here:
> > __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
> > __napi_alloc_frag() will all call page_frag_alloc(), which will use
> > __page_frag_cache_refill() to get an order 3 page if possible, or fall
> > back to an order 0 page if order 3 page is not available.
> > 
> > I'm not sure if your workload will use the above code path though.
> 
> TL;DR: this is order-0 pages (code-walk trough proof below)
> 
> To Aaron, the network stack *can* call __free_pages_ok() with order-0
> pages, via:
> 
> static void skb_free_head(struct sk_buff *skb)
> {
>   unsigned char *head = skb->head;
> 
>   if (skb->head_frag)
>   skb_free_frag(head);
>   else
>   kfree(head);
> }
> 
> static inline void skb_free_frag(void *addr)
> {
>   page_frag_free(addr);
> }
> 
> /*
>  * Frees a page fragment allocated out of either a compound or order 0 page.
>  */
> void page_frag_free(void *addr)
> {
>   struct page *page = virt_to_head_page(addr);
> 
>   if (unlikely(put_page_testzero(page)))
>   __free_pages_ok(page, compound_order(page));
> }
> EXPORT_SYMBOL(page_frag_free);

I think here is a problem - order 0 pages are freed directly to buddy,
bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..65c0ae13215a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
 {
struct page *page = virt_to_head_page(addr);
 
-   if (unlikely(put_page_testzero(page)))
-   __free_pages_ok(page, compound_order(page));
+   if (unlikely(put_page_testzero(page))) {
+   unsigned int order = compound_order(page);
+
+   if (order == 0)
+   free_unref_page(page);
+   else
+   __free_pages_ok(page, order);
+   }
 }
 EXPORT_SYMBOL(page_frag_free);
 
> Notice for the mlx5 driver it support several RX-memory models, so it
> can be hard to follow, but from the perf report output we can see that
> is uses mlx5e_skb_from_cqe_linear, which use build_skb.
> 
> --13.63%--mlx5e_skb_from_cqe_linear
>   |  
>--5.02%--build_skb
>  |  
>   --1.85%--__build_skb
> |  
>  --1.00%--kmem_cache_alloc
> 
> /* build_skb() is wrapper over __build_skb(), that specifically
>  * takes care of skb->head and skb->pfmemalloc
>  * This means that if @frag_size is not zero, then @data must be backed
>  * by a page fragment, not kmalloc() or vmalloc()
>  */
> struct sk_buff *build_skb(void *data, unsigned int frag_size)
> {
>   struct sk_buff *skb = __build_skb(data, frag_size);
> 
>   if (skb && frag_size) {
>   skb->head_frag = 1;
>   if (page_is_pfmemalloc(virt_to_head_page(data)))
>   skb->pfmemalloc = 1;
>   }
>   return skb;
> }
> EXPORT_SYMBOL(build_skb);
> 
> It still doesn't prove, that the @data is backed by by a order-0 page.
> For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
> page_pool_dev_alloc_pages(), and I can see perf report using
> __page_pool_alloc_pages_slow().
> 
> The setup for page_pool in mlx5 uses order=0.
> 
>   /* Create a page_pool and register it with rxq */
>   pp_params.order = 0;
>   pp_params.flags = 0; /* No-internal DMA mapping in page_pool */
>   pp_params.pool_size = pool_size;
>   pp_params.nid   = cpu_to_node(c->cpu);
>   pp_params.d

Re: [RFC PATCH v3 06/10] udp: cope with UDP GRO packet misdirection

2018-11-02 Thread Paolo Abeni

On Thu, 2018-11-01 at 17:01 -0400, Willem de Bruijn wrote:
> On Wed, Oct 31, 2018 at 5:57 AM Paolo Abeni  wrote:
> > @@ -450,4 +457,32 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
> > >  void udpv6_encap_enable(void);
> > >  #endif
> > > 
> > > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > > +   struct sk_buff *skb)
> > > +{
> > > + bool ipv4 = skb->protocol == htons(ETH_P_IP);
> > 
> > And this cause a compile warning when # CONFIG_IPV6 is not set, I will
> > fix in the next iteration (again thanks kbuildbot)
> 
> Can also just pass it as argument. 

Agreed. 

> This skb->protocol should work correctly
> with tunneled packets, but it wasn't as immediately obvious to me.
> 
> Also
> 
> +   if (unlikely(!segs))
> +   goto drop;
> 
> this should not happen. But if it could and the caller treats it the
> same as error (both now return NULL), then skb needs to be freed.

Right you are. Will do in the next iteration.

Thanks,

Paolo

Re: [RFC PATCH v3 01/10] udp: implement complete book-keeping for encap_needed

2018-11-02 Thread Paolo Abeni

Hi,

On Thu, 2018-11-01 at 16:59 -0400, Willem de Bruijn wrote:
> On Tue, Oct 30, 2018 at 1:28 PM Paolo Abeni  wrote:
> > 
> > The *encap_needed static keys are enabled by UDP tunnels
> > and several UDP encapsulations type, but they are never
> > turned off. This can cause unneeded overall performance
> > degradation for systems where such features are used
> > transiently.
> > 
> > This patch introduces complete book-keeping for such keys,
> > decreasing the usage at socket destruction time, if needed,
> > and avoiding that the same socket could increase the key
> > usage multiple times.
> > 
> > rfc v2 - rfc v3:
> >  - use udp_tunnel_encap_enable() in setsockopt()
> > 
> > Signed-off-by: Paolo Abeni 
> > @@ -2447,7 +2452,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, 
> > int optname,
> > /* FALLTHROUGH */
> > case UDP_ENCAP_L2TPINUDP:
> > up->encap_type = val;
> > -   udp_encap_enable();
> > +   udp_tunnel_encap_enable(sk->sk_socket);
> 
> this now also needs lock_sock?

Yep, you are right. I'll add it in the next iteration.

Thanks,

Paolo

Re: [RFC PATCH v3 04/10] ip: factor out protocol delivery helper

2018-11-02 Thread Paolo Abeni

On Thu, 2018-11-01 at 00:35 -0600, Subash Abhinov Kasiviswanathan
wrote:
> On 2018-10-30 11:24, Paolo Abeni wrote:
> > So that we can re-use it at the UDP lavel in a later patch
> > 
> 
> Hi Paolo
> 
> Minor queries -
> Should it be "level" instead of "lavel"? Similar comment for the ipv6
> patch as well.

Indeed. Will fix in the next iteration.

> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index 35a786c0aaa0..72250b4e466d 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
> > return false;
> >  }
> > 
> > -static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > struct sk_buff *skb)
> > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb,
> > int protocol)
> 
> Would it be better if this function was declared in include/net/ip.h &
> include/net/ipv6.h rather than in net/ipv4/udp.c & net/ipv6/udp.c as in
> the patch "udp: cope with UDP GRO packet misdirection"
> 
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 72593e1..3d7fdb4 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -717,4 +717,6 @@ static inline void ip_cmsg_recv(struct msghdr *msg, 
> struct sk_buff *skb)
>   int rtm_getroute_parse_ip_proto(struct nlattr *attr, u8 *ip_proto,
>  struct netlink_ext_ack *extack);
> 
> +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> proto);
> +
>   #endif /* _IP_H */
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 8296505..4d4d2cfe 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -1101,4 +1101,8 @@ int ipv6_sock_mc_join_ssm(struct sock *sk, int 
> ifindex,
>const struct in6_addr *addr, unsigned int 
> mode);
>   int ipv6_sock_mc_drop(struct sock *sk, int ifindex,
>const struct in6_addr *addr);
> +
> +void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> nexthdr,
> +  bool have_final);
> +
>   #endif /* _NET_IPV6_H */
> 

Agreed, I will do in the next iteration. 

Thanks,

Paolo

[PATCH iproute2] man: ss.8: break and indent long line

2018-11-02 Thread Luca Boccassi

Fixes groff warning:
  ss.8 92: warning [p 2, 2.8i]: can't break line

And makes the line also more readable.

Signed-off-by: Luca Boccassi 
---
This line is the example of an output, so you might not want to break it.
Let me know if you prefer a different solution.

 man/man8/ss.8 | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 7a6572b1..699a1271 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -89,7 +89,13 @@ an uuid of the socket
 Show socket memory usage. The output format is:
 .RS
 .P
-skmem:(r,rb,t,tb,f,w,o,bl)
+skmem:(r,rb,t,tb,f,
+.br
+.RS
+.RS
+w,o,bl)
+.RE
+.RE
 .P
 .TP
 .B 
-- 
2.19.1

[PATCH iproute2 1/2] testsuite: build generate_nlmsg with QUIET_CC

2018-11-02 Thread Luca Boccassi

Follow the standard pattern, and respect user's verbosity setting.

Signed-off-by: Luca Boccassi 
---
 testsuite/tools/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/testsuite/tools/Makefile b/testsuite/tools/Makefile
index e1d9bfef..85df69ec 100644
--- a/testsuite/tools/Makefile
+++ b/testsuite/tools/Makefile
@@ -3,7 +3,7 @@ CFLAGS=
 include ../../config.mk
 
 generate_nlmsg: generate_nlmsg.c ../../lib/libnetlink.c
-   $(CC) $(CPPFLAGS) $(CFLAGS) $(EXTRA_CFLAGS) -I../../include 
-include../../include/uapi/linux/netlink.h -o $@ $^ -lmnl
+   $(QUIET_CC) $(CPPFLAGS) $(CFLAGS) $(EXTRA_CFLAGS) -I../../include 
-include../../include/uapi/linux/netlink.h -o $@ $^ -lmnl
 
 clean:
rm -f generate_nlmsg
-- 
2.19.1

[PATCH iproute2 2/2] Pass CPPFLAGS to the compiler

2018-11-02 Thread Luca Boccassi

When building Debian packages pre-processor flags are passed via
CPPFLAGS, as the convention indicates. Specifically, the hardening
-D_FORTIFY_SOURCE=2 flag is used.
Pass CPPFLAGS to all calls of QUIET_CC together with CFLAGS.

Signed-off-by: Luca Boccassi 
---
 configure | 2 +-
 misc/Makefile | 8 
 tc/Makefile   | 8 
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/configure b/configure
index c5655978..5df6082b 100755
--- a/configure
+++ b/configure
@@ -436,4 +436,4 @@ check_cap
 
 echo >> $CONFIG
 echo "%.o: %.c" >> $CONFIG
-echo ' $(QUIET_CC)$(CC) $(CFLAGS) $(EXTRA_CFLAGS) -c -o $@ $<' >> $CONFIG
+echo ' $(QUIET_CC)$(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(CPPFLAGS) -c -o $@ $<' >> 
$CONFIG
diff --git a/misc/Makefile b/misc/Makefile
index b2dd6b26..6a849af4 100644
--- a/misc/Makefile
+++ b/misc/Makefile
@@ -16,16 +16,16 @@ ss: $(SSOBJ)
$(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
 
 nstat: nstat.c
-   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o nstat nstat.c $(LDLIBS) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o nstat nstat.c 
$(LDLIBS) -lm
 
 ifstat: ifstat.c
-   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o ifstat ifstat.c $(LDLIBS) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o ifstat ifstat.c 
$(LDLIBS) -lm
 
 rtacct: rtacct.c
-   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o rtacct rtacct.c $(LDLIBS) -lm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -o rtacct rtacct.c 
$(LDLIBS) -lm
 
 arpd: arpd.c
-   $(QUIET_CC)$(CC) $(CFLAGS) -I$(DBM_INCLUDE) $(LDFLAGS) -o arpd arpd.c 
$(LDLIBS) -ldb
+   $(QUIET_CC)$(CC) $(CFLAGS) -I$(DBM_INCLUDE) $(CPPFLAGS) $(LDFLAGS) -o 
arpd arpd.c $(LDLIBS) -ldb
 
 ssfilter.c: ssfilter.y
$(QUIET_YACC)bison ssfilter.y -o ssfilter.c
diff --git a/tc/Makefile b/tc/Makefile
index 25a28284..f8010d3c 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -128,7 +128,7 @@ CFLAGS += -DYY_NO_INPUT
 MODDESTDIR := $(DESTDIR)$(LIBDIR)/tc
 
 %.so: %.c
-   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -shared -fpic $< -o $@
+   $(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -shared -fpic $< -o $@
 
 
 all: tc $(TCSO)
@@ -158,13 +158,13 @@ clean:
rm -f emp_ematch.yacc.*
 
 q_atm.so: q_atm.c
-   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -shared -fpic -o q_atm.so q_atm.c 
-latm
+   $(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -shared -fpic -o 
q_atm.so q_atm.c -latm
 
 m_xt.so: m_xt.c
-   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -shared -fpic -o m_xt.so m_xt.c 
$$($(PKG_CONFIG) xtables --cflags --libs)
+   $(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -shared -fpic -o 
m_xt.so m_xt.c $$($(PKG_CONFIG) xtables --cflags --libs)
 
 m_xt_old.so: m_xt_old.c
-   $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -shared -fpic -o m_xt_old.so 
m_xt_old.c $$($(PKG_CONFIG) xtables --cflags --libs)
+   $(QUIET_CC)$(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) -shared -fpic -o 
m_xt_old.so m_xt_old.c $$($(PKG_CONFIG) xtables --cflags --libs)
 
 em_ipset.o: CFLAGS += $$($(PKG_CONFIG) xtables --cflags)
 
-- 
2.19.1

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-02 Thread Jesper Dangaard Brouer

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu  wrote:

> On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:
> > On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:  
> > > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> > > wrote:
> > > ... ...  
> > > > Section copied out:
> > > > 
> > > >   mlx5e_poll_tx_cq
> > > >   |  
> > > >--16.34%--napi_consume_skb
> > > >  |  
> > > >  |--12.65%--__free_pages_ok
> > > >  |  |  
> > > >  |   --11.86%--free_one_page
> > > >  | |  
> > > >  | |--10.10%
> > > > --queued_spin_lock_slowpath
> > > >  | |  
> > > >  |  --0.65%--_raw_spin_lock  
> > > 
> > > This callchain looks like it is freeing higher order pages than order
> > > 0:
> > > __free_pages_ok is only called for pages whose order are bigger than
> > > 0.  
> > 
> > mlx5 rx uses only order 0 pages, so i don't know where these high order
> > tx SKBs are coming from..   
> 
> Perhaps here:
> __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
> __napi_alloc_frag() will all call page_frag_alloc(), which will use
> __page_frag_cache_refill() to get an order 3 page if possible, or fall
> back to an order 0 page if order 3 page is not available.
> 
> I'm not sure if your workload will use the above code path though.

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with order-0
pages, via:

static void skb_free_head(struct sk_buff *skb)
{
unsigned char *head = skb->head;

if (skb->head_frag)
skb_free_frag(head);
else
kfree(head);
}

static inline void skb_free_frag(void *addr)
{
page_frag_free(addr);
}

/*
 * Frees a page fragment allocated out of either a compound or order 0 page.
 */
void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);

if (unlikely(put_page_testzero(page)))
__free_pages_ok(page, compound_order(page));
}
EXPORT_SYMBOL(page_frag_free);


Notice for the mlx5 driver it support several RX-memory models, so it
can be hard to follow, but from the perf report output we can see that
is uses mlx5e_skb_from_cqe_linear, which use build_skb.

--13.63%--mlx5e_skb_from_cqe_linear
  |  
   --5.02%--build_skb
 |  
  --1.85%--__build_skb
|  
 --1.00%--kmem_cache_alloc

/* build_skb() is wrapper over __build_skb(), that specifically
 * takes care of skb->head and skb->pfmemalloc
 * This means that if @frag_size is not zero, then @data must be backed
 * by a page fragment, not kmalloc() or vmalloc()
 */
struct sk_buff *build_skb(void *data, unsigned int frag_size)
{
struct sk_buff *skb = __build_skb(data, frag_size);

if (skb && frag_size) {
skb->head_frag = 1;
if (page_is_pfmemalloc(virt_to_head_page(data)))
skb->pfmemalloc = 1;
}
return skb;
}
EXPORT_SYMBOL(build_skb);

It still doesn't prove, that the @data is backed by by a order-0 page.
For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
page_pool_dev_alloc_pages(), and I can see perf report using
__page_pool_alloc_pages_slow().

The setup for page_pool in mlx5 uses order=0.

/* Create a page_pool and register it with rxq */
pp_params.order = 0;
pp_params.flags = 0; /* No-internal DMA mapping in page_pool */
pp_params.pool_size = pool_size;
pp_params.nid   = cpu_to_node(c->cpu);
pp_params.dev   = c->pdev;
pp_params.dma_dir   = rq->buff.map_dir;

/* page_pool can be used even when there is no rq->xdp_prog,
 * given page_pool does not handle DMA mapping there is no
 * required state to clear. And page_pool gracefully handle
 * elevated refcnt.
 */
rq->page_pool = page_pool_create(&pp_params);
if (IS_ERR(rq->page_pool)) {
err = PTR_ERR(rq->page_pool);
rq->page_pool = NULL;
goto err_free;
}
err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
 MEM_TYPE_PAGE_POOL, rq->page_pool);

 
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [RFC bpf-next] libbpf: increase rlimit before trying to create BPF maps

2018-11-02 Thread Quentin Monnet

2018-11-02 10:08 UTC+0100 ~ Daniel Borkmann 
> On 11/01/2018 06:18 PM, Quentin Monnet wrote:
>> 2018-10-30 15:23 UTC+ ~ Quentin Monnet 
>>> The limit for memory locked in the kernel by a process is usually set to
>>> 64 bytes by default. This can be an issue when creating large BPF maps.
>>> A workaround is to raise this limit for the current process before
>>> trying to create a new BPF map. Changing the hard limit requires the
>>> CAP_SYS_RESOURCE and can usually only be done by root user (but then
>>> only root can create BPF maps).
>>
>> Sorry, the parenthesis is not correct: non-root users can in fact create
>> BPF maps as well. If a non-root user calls the function to create a map,
>> setrlimit() will fail silently (but set errno), and the program will
>> simply go on with its rlimit unchanged.
>>
>>> As far as I know there is not API to get the current amount of memory
>>> locked for a user, therefore we cannot raise the limit only when
>>> required. One solution, used by bcc, is to try to create the map, and on
>>> getting a EPERM error, raising the limit to infinity before giving
>>> another try. Another approach, used in iproute, is to raise the limit in
>>> all cases, before trying to create the map.
>>>
>>> Here we do the same as in iproute2: the rlimit is raised to infinity
>>> before trying to load the map.
>>>
>>> I send this patch as a RFC to see if people would prefer the bcc
>>> approach instead, or the rlimit change to be in bpftool rather than in
>>> libbpf.
> 
> I'd avoid doing something like this in a generic library; it's basically an
> ugly hack for the kind of accounting we're doing and only shows that while
> this was "good enough" to start off with in the early days, we should be
> doing something better today if every application raises it to inf anyway
> then it's broken. :) It just shows that this missed its purpose. Similarly
> to the jit_limit discussion on rlimit, perhaps we should be considering
> switching to something else entirely from kernel side. Could be something
> like memcg but this definitely needs some more evaluation first.

Changing the way limitations are enforced sounds like a cleaner
long-term approach indeed.

> (Meanwhile
> I'd not change the lib but callers instead and once we have something better
> in place we remove this type of "raising to inf" from the tree ...)

Understood, for the time beeing I'll repost a patch adding the
modification to bpftool once bpf-next is open.

Thanks Daniel!
Quentin

[PATCH iproute2] Fix warning in tc-skbprio.8 manpage

2018-11-02 Thread Luca Boccassi

". If" gets interpreted as a macro, so move the period to the previous
line:

  33: warning: macro `If' not defined

Fixes: 141b55f8544e ("Add SKB Priority qdisc support in tc(8)")

Signed-off-by: Luca Boccassi 
---
 man/man8/tc-skbprio.8 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-skbprio.8 b/man/man8/tc-skbprio.8
index 844bbf46..8b259839 100644
--- a/man/man8/tc-skbprio.8
+++ b/man/man8/tc-skbprio.8
@@ -29,8 +29,8 @@ While SKB Priority Queue is agnostic to how
 .B skb->priority
 is assigned. A typical use case is to copy
 the 6-bit DS field of IPv4 and IPv6 packets using
-.BR tc-skbedit (8)
-. If
+.BR tc-skbedit (8) .
+If
 .B skb->priority
 is greater or equal to 64, the priority is assumed to be 63.
 Priorities less than 64 are taken at face value.
-- 
2.19.1

Re: [PATCH bpf] bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv

2018-11-02 Thread Sandipan Das




On 02/11/18 4:05 PM, Daniel Borkmann wrote:
> While dbecd7388476 ("bpf: get kernel symbol addresses via syscall")
> zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries
> from unprivileged users, commit 815581c11cc2 ("bpf: get JITed image
> lengths of functions via syscall") forgot about doing so and therefore
> returns the #elems of the user set up buffer which is incorrect. It
> also needs to indicate a info.nr_jited_func_lens of zero.
> 
> Fixes: 815581c11cc2 ("bpf: get JITed image lengths of functions via syscall")
> Signed-off-by: Daniel Borkmann 
> Cc: Sandipan Das 
> Cc: Song Liu 
> ---
>  kernel/bpf/syscall.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ccb9327..1ea5ce1 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2078,6 +2078,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
> *prog,
>   info.jited_prog_len = 0;
>   info.xlated_prog_len = 0;
>   info.nr_jited_ksyms = 0;
> + info.nr_jited_func_lens = 0;
>   goto done;
>   }
> 

Looks good. Thanks for fixing this. 

- Sandipan

[PATCH bpf] bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv

2018-11-02 Thread Daniel Borkmann

While dbecd7388476 ("bpf: get kernel symbol addresses via syscall")
zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries
from unprivileged users, commit 815581c11cc2 ("bpf: get JITed image
lengths of functions via syscall") forgot about doing so and therefore
returns the #elems of the user set up buffer which is incorrect. It
also needs to indicate a info.nr_jited_func_lens of zero.

Fixes: 815581c11cc2 ("bpf: get JITed image lengths of functions via syscall")
Signed-off-by: Daniel Borkmann 
Cc: Sandipan Das 
Cc: Song Liu 
---
 kernel/bpf/syscall.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ccb9327..1ea5ce1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2078,6 +2078,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
info.jited_prog_len = 0;
info.xlated_prog_len = 0;
info.nr_jited_ksyms = 0;
+   info.nr_jited_func_lens = 0;
goto done;
}
 
-- 
2.9.5

Re: [PATCH bpf 2/3] bpf: show real jited address in bpf_prog_info->jited_ksyms

2018-11-02 Thread Daniel Borkmann

On 11/02/2018 11:09 AM, Daniel Borkmann wrote:
> On 11/01/2018 08:00 AM, Song Liu wrote:
>> Currently, jited_ksyms in bpf_prog_info shows page addresses of jited
>> bpf program. This is not ideal for detailed profiling (find hot
>> instructions from stack traces). This patch replaces the page address
>> with real prog start address.
>>
>> Signed-off-by: Song Liu 
>> ---
>>  kernel/bpf/syscall.c | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index ccb93277aae2..34a9eef5992c 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2172,7 +2172,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
>> *prog,
>>  user_ksyms = u64_to_user_ptr(info.jited_ksyms);
>>  for (i = 0; i < ulen; i++) {
>>  ksym_addr = (ulong) 
>> prog->aux->func[i]->bpf_func;
>> -ksym_addr &= PAGE_MASK;
> 
> Note that the masking was done on purpose here and in patch 1/3 in order to
> not expose randomized start address to kallsyms at least. I suppose it's
> okay to change it here and for kallsyms given bpf_prog_get_info_by_fd() dump
> is for root only, and in each of the two cases we additionally apply
> kallsyms_show_value() logic, so for unpriv this is zeroed out plus only root
> loaded programs are added under kallsyms (capable(CAP_SYS_ADMIN)) anyway.

(Btw, something like above should have been in changelog to provide some more
historical context of why we used to do it like that and explaining why it is
okay to change it this way.)

>>  if (put_user((u64) ksym_addr, &user_ksyms[i]))
>>  return -EFAULT;
>>  }
>>
>

Re: [PATCH bpf 2/3] bpf: show real jited address in bpf_prog_info->jited_ksyms

2018-11-02 Thread Daniel Borkmann

On 11/01/2018 08:00 AM, Song Liu wrote:
> Currently, jited_ksyms in bpf_prog_info shows page addresses of jited
> bpf program. This is not ideal for detailed profiling (find hot
> instructions from stack traces). This patch replaces the page address
> with real prog start address.
> 
> Signed-off-by: Song Liu 
> ---
>  kernel/bpf/syscall.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ccb93277aae2..34a9eef5992c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2172,7 +2172,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
> *prog,
>   user_ksyms = u64_to_user_ptr(info.jited_ksyms);
>   for (i = 0; i < ulen; i++) {
>   ksym_addr = (ulong) 
> prog->aux->func[i]->bpf_func;
> - ksym_addr &= PAGE_MASK;

Note that the masking was done on purpose here and in patch 1/3 in order to
not expose randomized start address to kallsyms at least. I suppose it's
okay to change it here and for kallsyms given bpf_prog_get_info_by_fd() dump
is for root only, and in each of the two cases we additionally apply
kallsyms_show_value() logic, so for unpriv this is zeroed out plus only root
loaded programs are added under kallsyms (capable(CAP_SYS_ADMIN)) anyway.

>   if (put_user((u64) ksym_addr, &user_ksyms[i]))
>   return -EFAULT;
>   }
>

Re: Ethernet on my CycloneV broke since 4.9.124

2018-11-02 Thread Clément Péron

Hi Dinh,

On Wed, 31 Oct 2018 at 23:02, Dinh Nguyen  wrote:
>
> Hi Clement,
>
> On 10/31/2018 10:36 AM, Clément Péron wrote:
> > Hi Dinh,
> >
> > On Wed, 31 Oct 2018 at 15:42, Dinh Nguyen  wrote:
> >>
> >> Hi Clement,
> >>
> >> On 10/31/2018 08:01 AM, Clément Péron wrote:
> >>> Hi,
> >>>
> >>> The patch "net: stmmac: socfpga: add additional ocp reset line for
> >>> Stratix10" introduce in 4.9.124 broke the ethernet on my CycloneV
> >>> board.
> >>>
> >>> When I boot i have this issue :
> >>>
> >>> socfpga-dwmac ff702000.ethernet: error getting reset control of ocp -2
> >>> socfpga-dwmac: probe of ff702000.ethernet failed with error -2
> >>>
> >>> Reverting the commit : 6f37f7b62baa6a71d7f3f298acb64de51275e724 fix the 
> >>> issue.
> >>>
> >>
> >> Are you sure? I just booted v4.9.124 and did not see any errors. The
> >> error should not appear because the commit is using
> >> devm_reset_control_get_optional().
> >
> > I'm booting on 4.9.130 actually, Agree with you that
> > devm_reset_control_get_optional should not failed but checking other
> > usage of this helper
> > https://elixir.bootlin.com/linux/v4.9.135/source/drivers/i2c/busses/i2c-mv64xxx.c#L824
> > https://elixir.bootlin.com/linux/v4.9.135/source/drivers/crypto/sunxi-ss/sun4i-ss-core.c#L259
> > Show that they don't check for errors except for PROBE_DEFER
> >
>
> I made a mistake, I was booting linux-next. I am seeing the error with
> v4.9.124. It's due to this commit not getting backported:
>
> "bb475230b8e59a reset: make optional functions really optional"
>
> I have backported the patch and is available here if you like to take a
> look:

Thanks, works fine on my board too.
Regards,
Clement

>
> https://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux.git/log/?h=v4.9.124_optional_reset
>
> Dinh

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-11-02 Thread Stefano Brivio

On Thu, 1 Nov 2018 14:06:23 -0700
Jakub Kicinski  wrote:

> JSONification would probably be quite an undertaking for ss :(

Probably not too much, and we would skip buffering for JSON as we don't
need to print columns, so we don't have to care about buffering and
rendering performance too much.

All it takes is to extend a bit the out() function and pass the right
arguments to it -- it looks way easier than implementing format
strings, even though I'd like the latter more.

-- 
Stefano

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-11-02 Thread Stefano Brivio

On Wed, 31 Oct 2018 20:48:05 -0600
David Ahern  wrote:

> On 10/30/18 11:34 AM, Stefano Brivio wrote:
> > 
> > - the current column abstraction is rather lightweight, things are
> >   already buffered in the defined column order so we don't have to jump
> >   back and forth in the buffer while rendering. Doing that needs some
> >   extra care to avoid a performance hit, but it's probably doable, I
> >   can put that on my to-do list  
> 
> The ss command is always a pain; it's much better in newer releases but
> I am always having to adjust terminal width.

Ouch. Do you have some examples?

-- 
Stefano

Re: [PATCH bpf 3/3] bpf: show main program address in bpf_prog_info->jited_ksyms

2018-11-02 Thread Daniel Borkmann

On 11/01/2018 08:00 AM, Song Liu wrote:
> Currently, when there is not subprog (prog->aux->func_cnt == 0),
> bpf_prog_info does not return any jited_ksyms. This patch adds
> main program address (prog->bpf_func) to jited_ksyms.
> 
> Signed-off-by: Song Liu 
> ---
>  kernel/bpf/syscall.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 34a9eef5992c..7293b17ca62a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2158,7 +2158,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
> *prog,
>   }
>  
>   ulen = info.nr_jited_ksyms;
> - info.nr_jited_ksyms = prog->aux->func_cnt;
> + info.nr_jited_ksyms = prog->aux->func_cnt ? : 1;
>   if (info.nr_jited_ksyms && ulen) {
>   if (bpf_dump_raw_ok()) {
>   u64 __user *user_ksyms;
> @@ -2170,9 +2170,17 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
> *prog,
>*/
>   ulen = min_t(u32, info.nr_jited_ksyms, ulen);
>   user_ksyms = u64_to_user_ptr(info.jited_ksyms);
> - for (i = 0; i < ulen; i++) {
> - ksym_addr = (ulong) 
> prog->aux->func[i]->bpf_func;
> - if (put_user((u64) ksym_addr, &user_ksyms[i]))
> + if (prog->aux->func_cnt) {
> + for (i = 0; i < ulen; i++) {
> + ksym_addr = (ulong)

Small nit: can we change ksym_addr, the above and below cast to kernel-style
'unsigned long' while at it?

> + prog->aux->func[i]->bpf_func;
> + if (put_user((u64) ksym_addr,
> +  &user_ksyms[i]))
> + return -EFAULT;
> + }
> + } else {
> + ksym_addr = (ulong) prog->bpf_func;
> + if (put_user((u64) ksym_addr, &user_ksyms[0]))
>   return -EFAULT;

If we do this here, I think we should also update nr_jited_func_lens to copy
prog->jited_len to user space to be consistent with this change here. In case
of multi-func, the latter copies the len of the main program, and the lens of
the subprogs. Given we push the address for it to user space, we should then
also push the main prog len if it's only main prog there so this case doesn't
need any special handling by user space.

>   }
>   } else {
>

Re: [RFC bpf-next] libbpf: increase rlimit before trying to create BPF maps

2018-11-02 Thread Daniel Borkmann

On 11/01/2018 06:18 PM, Quentin Monnet wrote:
> 2018-10-30 15:23 UTC+ ~ Quentin Monnet 
>> The limit for memory locked in the kernel by a process is usually set to
>> 64 bytes by default. This can be an issue when creating large BPF maps.
>> A workaround is to raise this limit for the current process before
>> trying to create a new BPF map. Changing the hard limit requires the
>> CAP_SYS_RESOURCE and can usually only be done by root user (but then
>> only root can create BPF maps).
> 
> Sorry, the parenthesis is not correct: non-root users can in fact create
> BPF maps as well. If a non-root user calls the function to create a map,
> setrlimit() will fail silently (but set errno), and the program will
> simply go on with its rlimit unchanged.
> 
>> As far as I know there is not API to get the current amount of memory
>> locked for a user, therefore we cannot raise the limit only when
>> required. One solution, used by bcc, is to try to create the map, and on
>> getting a EPERM error, raising the limit to infinity before giving
>> another try. Another approach, used in iproute, is to raise the limit in
>> all cases, before trying to create the map.
>>
>> Here we do the same as in iproute2: the rlimit is raised to infinity
>> before trying to load the map.
>>
>> I send this patch as a RFC to see if people would prefer the bcc
>> approach instead, or the rlimit change to be in bpftool rather than in
>> libbpf.

I'd avoid doing something like this in a generic library; it's basically an
ugly hack for the kind of accounting we're doing and only shows that while
this was "good enough" to start off with in the early days, we should be
doing something better today if every application raises it to inf anyway
then it's broken. :) It just shows that this missed its purpose. Similarly
to the jit_limit discussion on rlimit, perhaps we should be considering
switching to something else entirely from kernel side. Could be something
like memcg but this definitely needs some more evaluation first. (Meanwhile
I'd not change the lib but callers instead and once we have something better
in place we remove this type of "raising to inf" from the tree ...)

>> Signed-off-by: Quentin Monnet 
>> ---
>>  tools/lib/bpf/bpf.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index 03f9bcc4ef50..456a5a7b112c 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
>> @@ -26,6 +26,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>  #include "bpf.h"
>>  #include "libbpf.h"
>>  #include 
>> @@ -68,8 +70,11 @@ static inline int sys_bpf(enum bpf_cmd cmd, union 
>> bpf_attr *attr,
>>  int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
>>  {
>>  __u32 name_len = create_attr->name ? strlen(create_attr->name) : 0;
>> +struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>>  union bpf_attr attr;
>>  
>> +setrlimit(RLIMIT_MEMLOCK, &rinf);
>> +
>>  memset(&attr, '\0', sizeof(attr));
>>  
>>  attr.map_type = create_attr->map_type;
>>
>

71 matches

Mail list logo