date:20181030

Re: Fw: [Bug 201423] New: eth0: hw csum failure

2018-10-30 Thread Andre Tomt

On 30.10.2018 12:04, Andre Tomt wrote:

On 30.10.2018 11:58, Andre Tomt wrote:

On 27.10.2018 23:41, Andre Tomt wrote:

On 26.10.2018 13:45, Andre Tomt wrote:

On 25.10.2018 19:38, Eric Dumazet wrote:

On 10/24/2018 12:41 PM, Andre Tomt wrote:

It eventually showed up again with mlx4, on 4.18.16 + fix and also
on 4.19. I still do not have a useful packet capture.

It is running a torrent client serving up various linux
distributions.

Have you also applied this fix ?

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=db4f1be3ca9b0ef7330763d07bf4ace83ad6f913

No. I've applied it now to 4.19 and will report back if anything
shows up.

Just hit it on the simpler server; no VRF, no tunnels, no
nat/conntrack. Only a basic stateless nftables ruleset and a vlan
netdev (unlikely to be the one triggering this I guess; it has only
v4 traffic).

I'm currently testing 4.19 with the recomended commit added, plus
these to sort out some GRO issues (on a hunch, unsure if related):
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=a8305bff685252e80b7c60f4f5e7dd2e63e38218

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=992cba7e276d438ac8b0a8c17b147b37c8c286f7

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=ece23711dd956cd5053c9cb03e9fe0668f9c8894

and I *think* it is behaving better now? it's not conclusive as it
could take a while to trip in this environment but some of the test
servers have not shown anything bad in almost 24h.

Sorry, s/some of the/none of the

I think it is fairly safe to say 4.19 + mlx4 + these 4 commits is OK. At
least for my workload. Servers are now 51-61 hours in, no splats. I also
added ntp pool traffic to one of them to make things a little more exciting.

Not sure what is needed for 4.18, I dont have the mental bandwidth to
test that right now. Also no idea about the similar looking mlx5 splats
reported elsewhere.

Re: [RFC PATCH 1/4] ptp: add PTP_SYS_OFFSET_EXTENDED ioctl

2018-10-30 Thread Richard Cochran

On Tue, Oct 30, 2018 at 07:23:51PM -0700, Richard Cochran wrote:
> This collection of automatic variables is getting ugly.  May I ask you
> to prefix a patch that puts them into reverse Christmas tree before
> your changes?  (Patch below)

Forgot the diff. Here it is...

---
diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
index 2012551d93e0..b54b8158ff8a 100644
--- a/drivers/ptp/ptp_chardev.c
+++ b/drivers/ptp/ptp_chardev.c
@@ -121,18 +121,18 @@ int ptp_open(struct posix_clock *pc, fmode_t fmode)
 
 long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 {
-   struct ptp_clock_caps caps;
-   struct ptp_clock_request req;
-   struct ptp_sys_offset *sysoff = NULL;
-   struct ptp_sys_offset_precise precise_offset;
-   struct ptp_pin_desc pd;
struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock);
+   struct ptp_sys_offset_precise precise_offset;
+   struct system_device_crosststamp xtstamp;
struct ptp_clock_info *ops = ptp->info;
+   struct ptp_sys_offset *sysoff = NULL;
+   struct ptp_clock_request req;
+   struct ptp_clock_caps caps;
struct ptp_clock_time *pct;
+   unsigned int i, pin_index;
+   struct ptp_pin_desc pd;
struct timespec64 ts;
-   struct system_device_crosststamp xtstamp;
int enable, err = 0;
-   unsigned int i, pin_index;
 
switch (cmd) {

Re: [PATCH v2 net] igb: shorten maximum PHC timecounter update interval

2018-10-30 Thread Richard Cochran

On Fri, Oct 26, 2018 at 07:13:00PM +0200, Miroslav Lichvar wrote:
> The timecounter needs to be updated at least once per ~550 seconds in
> order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
> timestamp.
> 
> Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"),
> scheduling of delayed work seems to be less accurate and a requested
> delay of 540 seconds may actually be longer than 550 seconds. Also, the
> PHC may be adjusted to run up to 6% faster than real time and the system
> clock up to 10% slower. Shorten the delay to 360 seconds to be sure the
> timecounter is updated in time.
> 
> This fixes an issue with HW timestamps on 82580/I350/I354 being off by
> ~1100 seconds for few seconds every ~9 minutes.

Acked-by: Richard Cochran

Re: [RFC PATCH 3/4] igb: add support for extended PHC gettime

2018-10-30 Thread Richard Cochran

On Fri, Oct 26, 2018 at 06:27:41PM +0200, Miroslav Lichvar wrote:
> +static int igb_ptp_gettimex(struct ptp_clock_info *ptp,
> + struct ptp_system_timestamp *sts)
> +{
> + struct igb_adapter *igb = container_of(ptp, struct igb_adapter,
> +ptp_caps);
> + struct e1000_hw *hw = >hw;
> + unsigned long flags;
> + u32 lo, hi;
> + u64 ns;
> +
> + spin_lock_irqsave(>tmreg_lock, flags);
> +
> + /* 82576 doesn't have SYSTIMR */
> + if (igb->hw.mac.type == e1000_82576) {

Instead of if/then/else, can't you follow the pattern of providing
different function flavors ...

> + ptp_read_system_prets(sts);
> + lo = rd32(E1000_SYSTIML);
> + ptp_read_system_postts(sts);
> + hi = rd32(E1000_SYSTIMH);
> + } else {
> + ptp_read_system_prets(sts);
> + rd32(E1000_SYSTIMR);
> + ptp_read_system_postts(sts);
> + lo = rd32(E1000_SYSTIML);
> + hi = rd32(E1000_SYSTIMH);
> + }
> +
> + /* SYSTIM on I210/I211 counts time in seconds and nanoseconds */
> + if (igb->hw.mac.type == e1000_i210 || igb->hw.mac.type == e1000_i211) {
> + sts->phc_ts.tv_sec = hi;
> + sts->phc_ts.tv_nsec = lo;
> + } else {
> + ns = timecounter_cyc2time(>tc, ((u64)hi << 32) | lo);
> + sts->phc_ts = ns_to_timespec64(ns);
> + }
> +
> + spin_unlock_irqrestore(>tmreg_lock, flags);
> +
> + return 0;
> +}
> +
>  static int igb_ptp_settime_82576(struct ptp_clock_info *ptp,
>const struct timespec64 *ts)
>  {
> @@ -1125,6 +1165,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
>   adapter->ptp_caps.adjfreq = igb_ptp_adjfreq_82576;
>   adapter->ptp_caps.adjtime = igb_ptp_adjtime_82576;
>   adapter->ptp_caps.gettime64 = igb_ptp_gettime_82576;
> + adapter->ptp_caps.gettimex64 = igb_ptp_gettimex;

... here?

Thanks,
Richard

>   adapter->ptp_caps.settime64 = igb_ptp_settime_82576;
>   adapter->ptp_caps.enable = igb_ptp_feature_enable;
>   adapter->cc.read = igb_ptp_read_82576;
> @@ -1144,6 +1185,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
>   adapter->ptp_caps.adjfine = igb_ptp_adjfine_82580;
>   adapter->ptp_caps.adjtime = igb_ptp_adjtime_82576;
>   adapter->ptp_caps.gettime64 = igb_ptp_gettime_82576;
> + adapter->ptp_caps.gettimex64 = igb_ptp_gettimex;
>   adapter->ptp_caps.settime64 = igb_ptp_settime_82576;
>   adapter->ptp_caps.enable = igb_ptp_feature_enable;
>   adapter->cc.read = igb_ptp_read_82580;
> @@ -1172,6 +1214,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
>   adapter->ptp_caps.adjfine = igb_ptp_adjfine_82580;
>   adapter->ptp_caps.adjtime = igb_ptp_adjtime_i210;
>   adapter->ptp_caps.gettime64 = igb_ptp_gettime_i210;
> + adapter->ptp_caps.gettimex64 = igb_ptp_gettimex;
>   adapter->ptp_caps.settime64 = igb_ptp_settime_i210;
>   adapter->ptp_caps.enable = igb_ptp_feature_enable_i210;
>   adapter->ptp_caps.verify = igb_ptp_verify_pin;

Re: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization

2018-10-30 Thread Richard Cochran

On Fri, Oct 26, 2018 at 06:27:38PM +0200, Miroslav Lichvar wrote:
> The other patches add support for the new ioctl to the e1000e, igb,
> and ixgbe driver. Tests with few different NICs in different machines
> (and PCIe slots) show that:
> - with an I219 (e1000e) the measured delay improved from 2500 to 1300 ns
>   and the error in the measured offset, when compared to cross
>   timestamping, was reduced by a factor of 5
> - with an I210 (igb) the delay improved from 5100 to 1700 ns
> - with an I350 (igb) the delay improved from 2300 to 750 ns
> - with an X550 (ixgbe) the delay improved from 1950 to 650 ns

Nice work!  This is a welcome improvement.

Thanks,
Richard

Re: [RFC PATCH 1/4] ptp: add PTP_SYS_OFFSET_EXTENDED ioctl

2018-10-30 Thread Richard Cochran

On Fri, Oct 26, 2018 at 06:27:39PM +0200, Miroslav Lichvar wrote:
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index 2012551d93e0..1a04c437fd4f 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -124,11 +124,13 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int 
> cmd, unsigned long arg)
>   struct ptp_clock_caps caps;
>   struct ptp_clock_request req;
>   struct ptp_sys_offset *sysoff = NULL;
> + struct ptp_sys_offset_extended *sysoff_extended = NULL;

How about a more succinct name, like 'extoff' ?

>   struct ptp_sys_offset_precise precise_offset;
>   struct ptp_pin_desc pd;
>   struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock);
>   struct ptp_clock_info *ops = ptp->info;
>   struct ptp_clock_time *pct;
> + struct ptp_system_timestamp sts;
>   struct timespec64 ts;
>   struct system_device_crosststamp xtstamp;
>   int enable, err = 0;

This collection of automatic variables is getting ugly.  May I ask you
to prefix a patch that puts them into reverse Christmas tree before
your changes?  (Patch below)

> @@ -211,6 +213,43 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, 
> unsigned long arg)
>   err = -EFAULT;
>   break;
>  
> + case PTP_SYS_OFFSET_EXTENDED:
> + if (!ptp->info->gettimex64) {
> + err = -EOPNOTSUPP;
> + break;
> + }
> + sysoff_extended = memdup_user((void __user *)arg,
> +   sizeof(*sysoff_extended));

As pointed out before, this needs to be freed, and

> + if (IS_ERR(sysoff_extended)) {
> + err = PTR_ERR(sysoff_extended);
> + sysoff = NULL;

here you meant, sysoff_extended = NULL;

> + break;
> + }
> + if (sysoff_extended->n_samples > PTP_MAX_SAMPLES) {
> + err = -EINVAL;
> + break;
> + }
> +
> + pct = _extended->ts[0];
> + for (i = 0; i < sysoff_extended->n_samples; i++) {
> + err = ptp->info->gettimex64(ptp->info, );
> + if (err)
> + break;
> + pct->sec = sts.sys_ts1.tv_sec;
> + pct->nsec = sts.sys_ts1.tv_nsec;
> + pct++;
> + pct->sec = sts.phc_ts.tv_sec;
> + pct->nsec = sts.phc_ts.tv_nsec;
> + pct++;
> + pct->sec = sts.sys_ts2.tv_sec;
> + pct->nsec = sts.sys_ts2.tv_nsec;
> + pct++;
> + }
> + if (copy_to_user((void __user *)arg, sysoff_extended,
> +  sizeof(*sysoff_extended)))
> + err = -EFAULT;
> + break;
> +
>   case PTP_SYS_OFFSET:
>   sysoff = memdup_user((void __user *)arg, sizeof(*sysoff));
>   if (IS_ERR(sysoff)) {
> diff --git a/include/linux/ptp_clock_kernel.h 
> b/include/linux/ptp_clock_kernel.h
> index 51349d124ee5..79321d929925 100644
> --- a/include/linux/ptp_clock_kernel.h
> +++ b/include/linux/ptp_clock_kernel.h
> @@ -39,6 +39,13 @@ struct ptp_clock_request {
>  };
>  
>  struct system_device_crosststamp;
> +

KernelDoc please for this:

> +struct ptp_system_timestamp {
> + struct timespec64 sys_ts1;
> + struct timespec64 phc_ts;
> + struct timespec64 sys_ts2;
> +};
> +

Thanks,
Richard

Re: [PATCH net] net: dsa: microchip: initialize mutex before use

2018-10-30 Thread Andrew Lunn

On Tue, Oct 30, 2018 at 04:45:49PM -0700, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Initialize mutex before use.  Avoid kernel complaint when
> CONFIG_DEBUG_LOCK_ALLOC is enabled.
> 
> Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
> Signed-off-by: Tristram Ha 
> ---
>  drivers/net/dsa/microchip/ksz_common.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/dsa/microchip/ksz_common.c 
> b/drivers/net/dsa/microchip/ksz_common.c
> index 54e0ca6..f6f0662 100644
> --- a/drivers/net/dsa/microchip/ksz_common.c
> +++ b/drivers/net/dsa/microchip/ksz_common.c
> @@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev)
>  {
>   int i;
>  
> - mutex_init(>reg_mutex);
> - mutex_init(>stats_mutex);
> - mutex_init(>alu_mutex);
> - mutex_init(>vlan_mutex);
> -
>   dev->ds->ops = _switch_ops;
>  
>   for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) {
> @@ -1206,6 +1201,12 @@ int ksz_switch_register(struct ksz_device *dev)
>   if (dev->pdata)
>   dev->chip_id = dev->pdata->chip_id;
>  
> + /* mutex is used in next function call. */
> + mutex_init(>reg_mutex);
> + mutex_init(>stats_mutex);
> + mutex_init(>alu_mutex);
> + mutex_init(>vlan_mutex);

Hi Tristram

The comment is no longer relevant now that all mutexes are
initialised.

Andrew

[no subject]

2018-10-30 Thread Ubaithullah Masood

This is Mr Ubaithullah Masood from Banco Santander Bank S A Hong Kong.
I got your contact during my private search on net..Would you be
interested in a business transaction to act as the beneficiary to
claim 9.8M USD funds of my deceased client who died intestate (
Without a Will)and my bank wants to confiscate the funds if the funds
are not claimed soon. Do get back for more details as this deal is
safe and all documentation will be done legally and we will share 50%
each.
Thanks.

Re: Fw: [Bug 201423] New: eth0: hw csum failure

2018-10-30 Thread Fabio Rossi

> On 10/16/2018 06:00 AM, Eric Dumazet wrote:
> > On Mon, Oct 15, 2018 at 11:30 PM Andre Tomt  wrote:
> >>
> >> On 15.10.2018 17:41, Eric Dumazet wrote:
> >>> On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger
>  Something is changed between 4.17.12 and 4.18, after bisecting the 
>  problem I
>  got the following first bad commit:
> 
>  commit 88078d98d1bb085d72af8437707279e203524fa5
>  Author: Eric Dumazet 
>  Date:   Wed Apr 18 11:43:15 2018 -0700
> 
>   net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
> 
>   After working on IP defragmentation lately, I found that some large
>   packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
>   zero paddings on the last (small) fragment.
> 
>   While removing the padding with pskb_trim_rcsum(), we set 
>  skb->ip_summed
>   to CHECKSUM_NONE, forcing a full csum validation, even if all prior
>   fragments had CHECKSUM_COMPLETE set.
> 
>   We can instead compute the checksum of the part we are trimming,
>   usually smaller than the part we keep.
> 
>   Signed-off-by: Eric Dumazet 
>   Signed-off-by: David S. Miller 
> 
> >>>
> >>> Thanks for bisecting !
> >>>
> >>> This commit is known to expose some NIC/driver bugs.
> >>>
> >>> Look at commit 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f
> >>> ("net: sungem: fix rx checksum support")  for one driver needing a fix.
> >>>
> >>> I assume SKY2_HW_NEW_LE is not set on your NIC ?
> >>>
> >>
> >> I've seen similar on several systems with mlx4 cards when using 4.18.x -
> >> that is hw csum failure followed by some backtrace.
> >>
> >> Only seems to happen on systems dealing with quite a bit of UDP.
> >>
> > 
> > Strange, because mlx4 on IPv6+UDP should not use CHECKSUM_COMPLETE,
> > but CHECKSUM_UNNECESSARY
> > 
> > I would be nice to track this a bit further, maybe by providing the
> > full packet content.
> > 
> >> Example from 4.18.10:
> >>> [635607.740574] p0xe0: hw csum failure
> >>> [635607.740598] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1
> >>> [635607.740599] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 
> >>> 2.0b 05/02/2017
> >>> [635607.740599] Call Trace:
> >>> [635607.740602]  
> >>> [635607.740611]  dump_stack+0x5c/0x7b
> >>> [635607.740617]  __skb_gro_checksum_complete+0x9a/0xa0
> >>> [635607.740621]  udp6_gro_receive+0x211/0x290
> >>> [635607.740624]  ipv6_gro_receive+0x1a8/0x390
> >>> [635607.740627]  dev_gro_receive+0x33e/0x550
> >>> [635607.740628]  napi_gro_frags+0xa2/0x210
> >>> [635607.740635]  mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en]
> >>> [635607.740648]  ? mlx4_cq_completion+0x23/0x70 [mlx4_core]
> >>> [635607.740654]  ? mlx4_eq_int+0x373/0xc80 [mlx4_core]
> >>> [635607.740657]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
> >>> [635607.740658]  net_rx_action+0xe0/0x2e0
> >>> [635607.740662]  __do_softirq+0xd8/0x2e5
> >>> [635607.740666]  irq_exit+0xb4/0xc0
> >>> [635607.740667]  do_IRQ+0x85/0xd0
> >>> [635607.740670]  common_interrupt+0xf/0xf
> >>> [635607.740671]  
> >>> [635607.740675] RIP: 0010:cpuidle_enter_state+0xb4/0x2a0
> >>> [635607.740675] Code: 31 ff e8 df a6 ba ff 45 84 f6 74 17 9c 58 0f 1f 44 
> >>> 00 00 f6 c4 02 0f 85 d8 01 00 00 31 ff e8 13 81 bf ff fb 66 0f 1f 44 00 
> >>> 00 <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
> >>> [635607.740701] RSP: 0018:a5c206353ea8 EFLAGS: 0246 ORIG_RAX: 
> >>> ffd9
> >>> [635607.740703] RAX: 8d72ffd20f00 RBX: 00024214f597c5b0 RCX: 
> >>> 001f
> >>> [635607.740703] RDX: 00024214f597c5b0 RSI: 00020780 RDI: 
> >>> 
> >>> [635607.740704] RBP: 0004 R08: 002542bfbefa99fa R09: 
> >>> 
> >>> [635607.740705] R10: a5c206353e88 R11: 00c5 R12: 
> >>> af0aaf78
> >>> [635607.740706] R13: 8d72ffd297d8 R14:  R15: 
> >>> 00024214f58c2ed5
> >>> [635607.740709]  ? cpuidle_enter_state+0x91/0x2a0
> >>> [635607.740712]  do_idle+0x1d0/0x240
> >>> [635607.740715]  cpu_startup_entry+0x5f/0x70
> >>> [635607.740719]  start_secondary+0x185/0x1a0
> >>> [635607.740722]  secondary_startup_64+0xa5/0xb0
> >>> [635607.740731] p0xe0: hw csum failure
> >>> [635607.740745] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1
> >>> [635607.740746] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 
> >>> 2.0b 05/02/2017
> >>> [635607.740746] Call Trace:
> >>> [635607.740747]  
> >>> [635607.740750]  dump_stack+0x5c/0x7b
> >>> [635607.740755]  __skb_checksum_complete+0xb8/0xd0
> >>> [635607.740760]  __udp6_lib_rcv+0xa6b/0xa70
> >>> [635607.740767]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
> >>> [635607.740770]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
> >>> [635607.740774]  ip6_input_finish+0xc0/0x460
> >>> [635607.740776]  ip6_input+0x2b/0x90
> >>> [635607.740778]  ? ip6_rcv_finish+0x110/0x110
> >>> [635607.740780]  ipv6_rcv+0x2cd/0x4b0
> >>>

[PATCH net] net: dsa: microchip: initialize mutex before use

2018-10-30 Thread Tristram.Ha

From: Tristram Ha 

Initialize mutex before use.  Avoid kernel complaint when
CONFIG_DEBUG_LOCK_ALLOC is enabled.

Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
Signed-off-by: Tristram Ha 
---
 drivers/net/dsa/microchip/ksz_common.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz_common.c 
b/drivers/net/dsa/microchip/ksz_common.c
index 54e0ca6..f6f0662 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev)
 {
int i;
 
-   mutex_init(>reg_mutex);
-   mutex_init(>stats_mutex);
-   mutex_init(>alu_mutex);
-   mutex_init(>vlan_mutex);
-
dev->ds->ops = _switch_ops;
 
for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) {
@@ -1206,6 +1201,12 @@ int ksz_switch_register(struct ksz_device *dev)
if (dev->pdata)
dev->chip_id = dev->pdata->chip_id;
 
+   /* mutex is used in next function call. */
+   mutex_init(>reg_mutex);
+   mutex_init(>stats_mutex);
+   mutex_init(>alu_mutex);
+   mutex_init(>vlan_mutex);
+
if (ksz_switch_detect(dev))
return -EINVAL;
 
-- 
1.9.1

Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable

2018-10-30 Thread David Miller

From: David Ahern 
Date: Tue, 30 Oct 2018 16:06:46 -0600

> or make the table per namespace.

This will increase namespace create/destroy cost, so I'd rather not
for something like this.

[RFC PATCH] net: vlan ipsec-xfrm offload

2018-10-30 Thread Shannon Nelson

This is an RFC post - not working code.  I welcome your comments.

If the real device offers IPsec offload, why shouldn't the VLAN device
also offer the offload?  This essentially becomes a passthru interface
to the real device by swapping out the xfrm_state's netdevice reference
before calling to the real_dev.

Signed-off-by: Shannon Nelson 
---
 net/8021q/vlan_dev.c | 98 
 1 file changed, 98 insertions(+)

diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index ff720f1..a46e056 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vlan.h"
 #include "vlanproc.h"
@@ -546,6 +547,7 @@ static struct device_type vlan_type = {
 };
 
 static const struct net_device_ops vlan_netdev_ops;
+static const struct xfrmdev_ops vlan_xfrmdev_ops;
 
 static int vlan_dev_init(struct net_device *dev)
 {
@@ -553,6 +555,7 @@ static int vlan_dev_init(struct net_device *dev)
 
netif_carrier_off(dev);
 
+   /* copy netdev features into list of user selectable features */
/* IFF_BROADCAST|IFF_MULTICAST; ??? */
dev->flags  = real_dev->flags & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI |
  IFF_MASTER | IFF_SLAVE);
@@ -564,6 +567,12 @@ static int vlan_dev_init(struct net_device *dev)
   NETIF_F_FRAGLIST | NETIF_F_GSO_SOFTWARE |
   NETIF_F_HIGHDMA | NETIF_F_SCTP_CRC |
   NETIF_F_ALL_FCOE;
+#ifdef CONFIG_XFRM_OFFLOAD
+   dev->hw_features |= real_dev->hw_features & (NETIF_F_HW_ESP |
+NETIF_F_HW_ESP_TX_CSUM |
+NETIF_F_GSO_ESP);
+   dev->xfrmdev_ops = _xfrmdev_ops;
+#endif
 
dev->features |= dev->hw_features | NETIF_F_LLTX;
dev->gso_max_size = real_dev->gso_max_size;
@@ -767,6 +776,95 @@ static int vlan_dev_get_iflink(const struct net_device 
*dev)
return real_dev->ifindex;
 }
 
+static int vlan_xdo_state_add(struct xfrm_state *xs)
+{
+   struct net_device *dev = xs->xso.dev;
+   struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+   int ret;
+
+   /* swap out the vlan dev and put a hold on the real device */
+   xs->xso.dev = real_dev;
+   dev_hold(real_dev);
+
+   ret = real_dev->xfrmdev_ops->xdo_dev_state_add(xs);
+
+   /* if it wasn't successful, release the real_dev */
+   if (ret)
+   dev_put(real_dev);
+
+   /* put dev back before we leave */
+   xs->xso.dev = dev;
+
+   return ret;
+}
+
+static void vlan_xdo_state_delete(struct xfrm_state *xs)
+{
+   struct net_device *dev = xs->xso.dev;
+   struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+
+   xs->xso.dev = real_dev;
+   real_dev->xfrmdev_ops->xdo_dev_state_delete(xs);
+   xs->xso.dev = dev;
+}
+
+static void vlan_xdo_state_free(struct xfrm_state *xs)
+{
+   struct net_device *dev = xs->xso.dev;
+   struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+
+   /* check for optional interface */
+   if (real_dev->xfrmdev_ops->xdo_dev_state_free) {
+   xs->xso.dev = real_dev;
+   real_dev->xfrmdev_ops->xdo_dev_state_free(xs);
+   xs->xso.dev = dev;
+   }
+
+   /* let go of the real device, we're done with it */
+   dev_put(real_dev);
+}
+
+static bool vlan_xdo_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
+{
+   struct net_device *dev = xs->xso.dev;
+   struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+   bool ret = true;
+
+   /* check for optional interface */
+   if (likely(real_dev->xfrmdev_ops->xdo_dev_offload_ok)) {
+   xs->xso.dev = real_dev;
+   ret = real_dev->xfrmdev_ops->xdo_dev_offload_ok(skb, xs);
+   xs->xso.dev = dev;
+   }
+
+   if (!ret)
+   netdev_err(dev, "%s: real_dev %s NAKd the offload\n",
+  __func__, real_dev->name);
+
+   return ret;
+}
+
+static void vlan_xdo_advance_esn(struct xfrm_state *xs)
+{
+   struct net_device *dev = xs->xso.dev;
+   struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
+
+   /* check for optional interface */
+   if (real_dev->xfrmdev_ops->xdo_dev_state_advance_esn) {
+   xs->xso.dev = real_dev;
+   real_dev->xfrmdev_ops->xdo_dev_state_advance_esn(xs);
+   xs->xso.dev = dev;
+   }
+}
+
+static const struct xfrmdev_ops vlan_xfrmdev_ops = {
+   .xdo_dev_state_add = vlan_xdo_state_add,
+   .xdo_dev_state_delete = vlan_xdo_state_delete,
+   .xdo_dev_state_free = vlan_xdo_state_free,
+   .xdo_dev_offload_ok = vlan_xdo_offload_ok,
+   .xdo_dev_state_advance_esn = vlan_xdo_advance_esn,
+};
+
 static const struct ethtool_ops vlan_ethtool_ops = {

Re: [PATCH] bpf: tcp_bpf_recvmsg should return EAGAIN when nonblocking and no data

2018-10-30 Thread Daniel Borkmann

On 10/29/2018 08:31 PM, John Fastabend wrote:
> We return 0 in the case of a nonblocking socket that has no data
> available. However, this is incorrect and may confuse applications.
> After this patch we do the correct thing and return the error
> EAGAIN.
> 
> Quoting return codes from recvmsg manpage,
> 
> EAGAIN or EWOULDBLOCK
>  The socket is marked nonblocking and the receive operation would
>  block, or a receive timeout had been set and the timeout expired
>  before data was received.
> 
> Signed-off-by: John Fastabend 

Applied to bpf, thanks!

Re: [PATCH bpf] tools/bpf: add unlimited rlimit for flow_dissector_load

2018-10-30 Thread Daniel Borkmann

On 10/29/2018 10:56 PM, Yonghong Song wrote:
> On our test machine, bpf selftest test_flow_dissector.sh failed
> with the following error:
>   # ./test_flow_dissector.sh
>   bpffs not mounted. Mounting...
>   libbpf: failed to create map (name: 'jmp_table'): Operation not permitted
>   libbpf: failed to load object 'bpf_flow.o'
>   ./flow_dissector_load: bpf_prog_load bpf_flow.o
>   selftests: test_flow_dissector [FAILED]
> 
> Let us increase the rlimit to remove the above map
> creation failure.
> 
> Signed-off-by: Yonghong Song 

Applied to bpf, thanks!

Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable

2018-10-30 Thread David Ahern

On 10/30/18 12:31 PM, David Miller wrote:
> From: Jeff Barnhill <0xeff...@gmail.com>
> Date: Tue, 30 Oct 2018 07:10:58 -0400
> 
>> I originally started implementing it the way you suggested; however,
>> it seemed to complicate management of that structure because it isn't
>> currently using rcu.  Also, assuming that can be worked out, where
>> would I get the net from?  Would I need to store a copy in ifcaddr6,
>> or is there some way to access it during ipv6_chk_acast_addr()?  It
>> seems that if I don't add a copy of net, but instead access it through
>> aca_rt(?), then freeing the ifcaddr6 memory becomes problematic
>> (detaching it from idev, while read_rcu may still be accessing it).
>> On Mon, Oct 29, 2018 at 11:32 PM David Miller  wrote:
> 
> I don't think converting the structure over to RCU, especially because
> all of the read paths (everything leading to ipv6_chk_acast_dev()) are
> taking RCU locks already.
> 
> And I cannot understand how having _two_ structures to manage a piece
> of information can be less complicated than just one.
> 
> You can add a backpointer to the 'idev' in ifacaddr6 to get at the
> network namespace.  You don't even need to do additional reference
> counting because the idev->ac_list is always purged before an idev
> is destroyed.
> 

or make the table per namespace.

[PATCH iproute2-next] ip rule: Add ipproto and port range to filter list

2018-10-30 Thread David Ahern

From: David Ahern 

Allow ip rule dumps and flushes to filter based on ipproto, sport
and dport. Example:

$ ip ru ls ipproto udp
99: from all to 8.8.8.8 ipproto udp dport 53 lookup 1001
$ ip ru ls dport 53
99: from all to 8.8.8.8 ipproto udp dport 53 lookup 1001

Signed-off-by: David Ahern 
---
 ip/iprule.c | 66 +
 1 file changed, 66 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index a85a43904e6e..0713694d4a49 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -78,6 +78,9 @@ static struct
inet_prefix dst;
int protocol;
int protocolmask;
+   struct fib_rule_port_range sport;
+   struct fib_rule_port_range dport;
+   __u8 ipproto;
 } filter;
 
 static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb)
@@ -174,6 +177,39 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
return false;
}
 
+   if (filter.ipproto) {
+   __u8 ipproto = 0;
+
+   if (tb[FRA_IP_PROTO])
+   ipproto = rta_getattr_u8(tb[FRA_IP_PROTO]);
+   if (filter.ipproto != ipproto)
+   return false;
+   }
+
+   if (filter.sport.start) {
+   const struct fib_rule_port_range *r;
+
+   if (!tb[FRA_SPORT_RANGE])
+   return false;
+
+   r = RTA_DATA(tb[FRA_SPORT_RANGE]);
+   if (r->start != filter.sport.start ||
+   r->end != filter.sport.end)
+   return false;
+   }
+
+   if (filter.dport.start) {
+   const struct fib_rule_port_range *r;
+
+   if (!tb[FRA_DPORT_RANGE])
+   return false;
+
+   r = RTA_DATA(tb[FRA_DPORT_RANGE]);
+   if (r->start != filter.dport.start ||
+   r->end != filter.dport.end)
+   return false;
+   }
+
table = frh_get_table(frh, tb);
if (filter.tb > 0 && filter.tb ^ table)
return false;
@@ -607,6 +643,36 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
filter.protocolmask = 0;
}
filter.protocol = prot;
+   } else if (strcmp(*argv, "ipproto") == 0) {
+   int ipproto;
+
+   NEXT_ARG();
+   ipproto = inet_proto_a2n(*argv);
+   if (ipproto < 0)
+   invarg("Invalid \"ipproto\" value\n", *argv);
+   filter.ipproto = ipproto;
+   } else if (strcmp(*argv, "sport") == 0) {
+   struct fib_rule_port_range r;
+   int ret = 0;
+
+   NEXT_ARG();
+   ret = sscanf(*argv, "%hu-%hu", , );
+   if (ret == 1)
+   r.end = r.start;
+   else if (ret != 2)
+   invarg("invalid port range\n", *argv);
+   filter.sport = r;
+   } else if (strcmp(*argv, "dport") == 0) {
+   struct fib_rule_port_range r;
+   int ret = 0;
+
+   NEXT_ARG();
+   ret = sscanf(*argv, "%hu-%hu", , );
+   if (ret == 1)
+   r.end = r.start;
+   else if (ret != 2)
+   invarg("invalid dport range\n", *argv);
+   filter.dport = r;
} else{
if (matches(*argv, "dst") == 0 ||
matches(*argv, "to") == 0) {
-- 
2.11.0

[PATCH iproute2] ip rule: Require at least one argument for add

2018-10-30 Thread David Ahern

From: David Ahern 

'ip rule add' with no additional arguments just adds another rule
for the main table - which exists by default. Require at least
1 argument similar to delete.

Signed-off-by: David Ahern 
---
 ip/iprule.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index d89d808d8909..b465a80785b1 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -691,6 +691,11 @@ static int iprule_modify(int cmd, int argc, char **argv)
};
 
if (cmd == RTM_NEWRULE) {
+   if (argc == 0) {
+   fprintf(stderr,
+   "\"ip rule add\" requires arguments.\n");
+   return -1;
+   }
req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
req.frh.action = FR_ACT_TO_TBL;
}
-- 
2.11.0

[PATCH iproute2] ip rule: Honor filter arguments on flush

2018-10-30 Thread David Ahern

From: David Ahern 

'ip ru flush' currently removes all rules with priority > 0 regardless
of any other command line arguments passed in. Update flush_rule to
call filter_nlmsg to determine if the rule should be flushed or not.
This enables rule flushing such as 'ip ru flush table 1001' and
'ip ru flush pref 99'.

Signed-off-by: David Ahern 
---
 ip/iprule.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index b465a80785b1..a85a43904e6e 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -461,6 +461,7 @@ static int flush_rule(struct nlmsghdr *n, void *arg)
struct fib_rule_hdr *frh = NLMSG_DATA(n);
int len = n->nlmsg_len;
struct rtattr *tb[FRA_MAX+1];
+   int host_len = -1;
 
len -= NLMSG_LENGTH(sizeof(*frh));
if (len < 0)
@@ -468,6 +469,10 @@ static int flush_rule(struct nlmsghdr *n, void *arg)
 
parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
+   host_len = af_bit_len(frh->family);
+   if (!filter_nlmsg(n, tb, host_len))
+   return 0;
+
if (tb[FRA_PROTOCOL]) {
__u8 protocol = rta_getattr_u8(tb[FRA_PROTOCOL]);
 
-- 
2.11.0

RE: [PATCH net] net: ethernet: cadence: fix socket buffer corruption problem

2018-10-30 Thread Tristram.Ha

> Could you check on your side that applying this on top of your patch, your
> scenario is still working?
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c
> b/drivers/net/ethernet/cadence/macb_main.c
> index 1d86b4d5645a..e1347d6d1b50 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -1702,12 +1702,8 @@ static int macb_pad_and_fcs(struct sk_buff **skb,
> struct net_device *ndev)
> *skb = nskb;
> }
> 
> -   if (padlen) {
> -   if (padlen >= ETH_FCS_LEN)
> -   skb_put_zero(*skb, padlen - ETH_FCS_LEN);
> -   else
> -   skb_trim(*skb, ETH_FCS_LEN - padlen);
> -   }
> +   if (padlen > ETH_FCS_LEN)
> +   skb_put_zero(*skb, padlen - ETH_FCS_LEN);

I think it is okay but I need to check all paths are covered.
 
> It was reported in [1] that UDP checksum is offloaded to hardware no matter
> the application previously computed it.
> 
> The code should be executed only for packets that has checksum computed
> by
> applications ((*skb)->ip_summed != CHECKSUM_PARTIAL). The idea was to
> not
> recompute checksum for packets with checksum already computed. To do
> so,
> while hardware checksum is enabled (NETIF_F_HW_CSUM), TX_NOCRC bit
> should
> be set on buffer descriptor. But to do so, packets must have a minimum size
> of 64 and FCS to be computed.
> 
> The NETIF_F_HW_CSUM check was placed there because the issue
> described in
> [1] is reproducible because hardware checksum is enabled and overrides the
> checksum provided by applications.
> 
> [1] https://www.spinics.net/lists/netdev/msg505065.html

I understand the issue now.  It is weird that the transmit descriptor does not
have direct control over turning on checksum generation or not, but it wastes
3 bits returning the error codes of such generation.  What can the driver do
with such information?

In my opinion then hardware transmit checksumming cannot be supported
In Linux.

> > NETIF_F_SG is not enabled in the MAC I used, so enabling
> NETIF_IF_HW_CSUM
> > is rather pointless.  With the padding code the transmit throughput cannot
> get
> > higher than 100Mbps in a gigabit connection.
> >
> > I would recommend to add this option to disable manual padding in one of
> those
> > macb_config structures.
> 
> In this way the user would have to know from the beginning what kind of
> packets are used.
> 

The kernel already does a good job of calculating checksum.  Using hardware to
do that does not improve performance much.

Alternative is to use ethtool to disable hardware tx checksum so that software 
can
intentionally send wrong checksums.

Re: [PATCH v4.14-stable] sch_netem: restore skb->dev after dequeuing from the rbtree

2018-10-30 Thread Eduardo Valentin

Greg,

On Thu, Oct 18, 2018 at 03:43:48PM -0700, David Miller wrote:
> From: Christoph Paasch 
> Date: Thu, 18 Oct 2018 13:38:40 -0700
> 
> > Upstream commit bffa72cf7f9d ("net: sk_buff rbnode reorg") got
> > backported as commit 6b921536f170 ("net: sk_buff rbnode reorg") into the
> > v4.14.x-tree.
> > 
> > However, the backport does not include the changes in sch_netem.c
> > 
> > We need these, as otherwise the skb->dev pointer is not set when
> > dequeueing from the netem rbtree, resulting in a panic:
>  ...
> > Fixes: 6b921536f170 ("net: sk_buff rbnode reorg")
> > Cc: Stephen Hemminger 
> > Cc: Eric Dumazet 
> > Cc: Soheil Hassas Yeganeh 
> > Cc: Wei Wang 
> > Cc: Willem de Bruijn 
> > Signed-off-by: Christoph Paasch 
> > ---
> > 
> > Notes:
> > This patch should only make it into v4.14-stable as that's the only 
> > branch where
> > the offending commit has been backported to.
> 
> Greg, please queue up.

Are you planing to queue this one ?

Looks to me it was a miss on the backport.

It seams that the backport was touching different files, and missed the change
on net/sched/sch_netem.c. So, to me, even if this patch may not follow the
strictly the rules of stable, as it is not a patch in upstream, seams to be a 
needed change, even if it is specific to stable linux-4.14.y.

> 

-- 
All the best,
Eduardo Valentin

Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-30 Thread Eric Dumazet

On Tue, Oct 30, 2018 at 11:59 AM Cong Wang  wrote:

> I wonder how compiler recognizes it as "never fail" when marked with
> __must_check.

__must_check means that you can not ignore the return value of a function.

Here we do use the return value.

Also prior code was not checking skb->length so really my patch does
add explicit
check if in the future skb->len gets wrong after a bug is added in this driver.

(A NULL deref will trap the bug, instead of potentially reading
garbage from skb->data)

Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-30 Thread Cong Wang

On Tue, Oct 30, 2018 at 11:42 AM Eric Dumazet  wrote:
>
> On Tue, Oct 30, 2018 at 11:09 AM Cong Wang  wrote:
>
> > At least skb_header_pointer() is marked as __must_check, I don't see
> > you check its return value here.
>
> This can not fail here.
>
> skb->length must be above 14+4 at this point.

Never say it is wrong, just saying what compiler thinks.

>
> My compiler seems to be okay with that.

I wonder how compiler recognizes it as "never fail" when marked with
__must_check.

Re: [Patch net] net: make pskb_trim_rcsum_slow() robust

2018-10-30 Thread Cong Wang

On Mon, Oct 29, 2018 at 8:08 PM Eric Dumazet  wrote:
>
>
>
> On 10/29/2018 07:41 PM, Cong Wang wrote:
> > On Mon, Oct 29, 2018 at 7:25 PM Eric Dumazet  wrote:
> >>
> >>
> >>
> >> On 10/29/2018 07:21 PM, Cong Wang wrote:
> >>> On Mon, Oct 29, 2018 at 7:14 PM Eric Dumazet  
> >>> wrote:
> 
>  Would not it be simpler to set ip_summed to CHECKSUM_NONE (no need to 
>  save old_csum) ?
> >>>
> >>> For !CHECKSUM_COMPLETE, ip_summed should be untouched, right?
> >>>
> >>> If you mean only setting to CHECKSUM_NONE for CHECKSUM_COMPLETE case,
> >>> the end result may not be simpler.
> >>
> >> I meant to reinstate what was there before my patch in this error case
> >>
> >>if (skb->ip_summed == CHECKSUM_COMPLETE)
> >>skb->ip_summed = CHECKSUM_NONE;
> >>
> >> That would only be run in error (quite unlikely) path, instead of saving 
> >> old_csum in all cases.
> >
> > I know your point, however, I am not sure that is a desired behavior.
> >
> > On failure, I think the whole skb should be restored to its previous state
> > before entering this function, changing it to CHECKSUM_NONE on failure
> > is inconsistent with success case.
> >
>
> Before my patch, we were changing skb->ip_summed to CHECKSUM_NONE,
> so why suddenly we need to be consistent ?

That is because setting it to CHECKSUM_NONE _was_ how the success
case works and nothing _was_ needed for failure case.

You changed how we handle checksum for success case, it is why we need
to change for the failure case too.


>
> In any case, ip_check_defrag() should really drop this skb, as for other 
> allocation
> failures (like skb_share_check()), if really we want consistency.

I have the same feeling, just not brave enough to change the logic of
ip_check_defrag() where pskb_may_pull() failure is treated in a same way.

Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-30 Thread David Miller

From: Cong Wang 
Date: Tue, 30 Oct 2018 11:09:21 -0700

> On Tue, Oct 30, 2018 at 12:57 AM Eric Dumazet  wrote:
>> -static __be32 mlx5e_get_fcs(struct sk_buff *skb)
>> +static u32 mlx5e_get_fcs(const struct sk_buff *skb)
>>  {
>> -   int last_frag_sz, bytes_in_prev, nr_frags;
>> -   u8 *fcs_p1, *fcs_p2;
>> -   skb_frag_t *last_frag;
>> -   __be32 fcs_bytes;
>> +   const void *fcs_bytes;
>> +   u32 _fcs_bytes;
>>
>> -   if (!skb_is_nonlinear(skb))
>> -   return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN);
>> +   fcs_bytes = skb_header_pointer(skb, skb->len - ETH_FCS_LEN,
>> +  ETH_FCS_LEN, &_fcs_bytes);
> 
> At least skb_header_pointer() is marked as __must_check, I don't see
> you check its return value here.

In this case we reasonably know it is guaranteed to succeed though.

We know skb->len is non-zero and we are asking for the skb->len - 1
byte or something like that.

Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-30 Thread Eric Dumazet

On Tue, Oct 30, 2018 at 11:09 AM Cong Wang  wrote:

> At least skb_header_pointer() is marked as __must_check, I don't see
> you check its return value here.

This can not fail here.

skb->length must be above 14+4 at this point.

My compiler seems to be okay with that.

Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable

2018-10-30 Thread David Miller

From: Jeff Barnhill <0xeff...@gmail.com>
Date: Tue, 30 Oct 2018 07:10:58 -0400

> I originally started implementing it the way you suggested; however,
> it seemed to complicate management of that structure because it isn't
> currently using rcu.  Also, assuming that can be worked out, where
> would I get the net from?  Would I need to store a copy in ifcaddr6,
> or is there some way to access it during ipv6_chk_acast_addr()?  It
> seems that if I don't add a copy of net, but instead access it through
> aca_rt(?), then freeing the ifcaddr6 memory becomes problematic
> (detaching it from idev, while read_rcu may still be accessing it).
> On Mon, Oct 29, 2018 at 11:32 PM David Miller  wrote:

I don't think converting the structure over to RCU, especially because
all of the read paths (everything leading to ipv6_chk_acast_dev()) are
taking RCU locks already.

And I cannot understand how having _two_ structures to manage a piece
of information can be less complicated than just one.

You can add a backpointer to the 'idev' in ifacaddr6 to get at the
network namespace.  You don't even need to do additional reference
counting because the idev->ac_list is always purged before an idev
is destroyed.

Re: [PATCH net] net/mlx4_en: add a missing include

2018-10-30 Thread David Miller

From: Eric Dumazet 
Date: Tue, 30 Oct 2018 00:18:12 -0700

> Abdul Haleem reported a build error on ppc :
> 
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct
> iphdr` declared inside parameter list [enabled by default]
>struct iphdr *iph)
>   ^
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is
> only this definition or declaration, which is probably not what you want
> [enabled by default]
> drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function
> get_fixed_ipv4_csum:
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing
> pointer to incomplete type
>   __u8 ipproto = iph->protocol;
> ^
> 
> Fixes: 55469bc6b577 ("drivers: net: remove  inclusion when 
> not needed")
> Signed-off-by: Eric Dumazet 
> Reported-by: Abdul Haleem 

Applied, thanks Eric.

Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-30 Thread Cong Wang

On Tue, Oct 30, 2018 at 12:57 AM Eric Dumazet  wrote:
> -static __be32 mlx5e_get_fcs(struct sk_buff *skb)
> +static u32 mlx5e_get_fcs(const struct sk_buff *skb)
>  {
> -   int last_frag_sz, bytes_in_prev, nr_frags;
> -   u8 *fcs_p1, *fcs_p2;
> -   skb_frag_t *last_frag;
> -   __be32 fcs_bytes;
> +   const void *fcs_bytes;
> +   u32 _fcs_bytes;
>
> -   if (!skb_is_nonlinear(skb))
> -   return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN);
> +   fcs_bytes = skb_header_pointer(skb, skb->len - ETH_FCS_LEN,
> +  ETH_FCS_LEN, &_fcs_bytes);

At least skb_header_pointer() is marked as __must_check, I don't see
you check its return value here.

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Cong Wang

On Tue, Oct 30, 2018 at 10:50 AM Eric Dumazet  wrote:
>
>
>
> On 10/30/2018 10:32 AM, Cong Wang wrote:
>
> > Unlike Pawel's case, we don't use vlan at all, maybe this is why we see
> > it much less frequently than Pawel.
> >
> > Also, it is probably not specific to mlx5, as there is another report which
> > is probably a non-mlx5 driver.
>
> Not sure if you provided a stack trace ?

I said it is the same with Pawel's. Here it is anyway:

[ 3731.075989] eth0: hw csum failure
[ 3731.079316] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 4.14.74.x86_64 #1
[ 3731.086703] Hardware name: Wiwynn F4WW/Y 300-0284/F4WW MAIN BOARD,
BIOS F4WWP02 10/19/2018
[ 3731.094961] Call Trace:
[ 3731.097408]  
[ 3731.099432]  dump_stack+0x46/0x59
[ 3731.102751]  __skb_checksum_complete+0xb8/0xd0
[ 3731.107194]  tcp_v4_rcv+0x116/0xa30
[ 3731.110688]  ip_local_deliver_finish+0x5d/0x1f0
[ 3731.115218]  ip_local_deliver+0x6b/0xe0
[ 3731.119056]  ? ip_rcv_finish+0x400/0x400
[ 3731.122973]  ip_rcv+0x287/0x360
[ 3731.126112]  ? inet_del_offload+0x40/0x40
[ 3731.130124]  __netif_receive_skb_core+0x404/0xc10
[ 3731.134831]  ? netif_receive_skb_internal+0x34/0xd0
[ 3731.139709]  netif_receive_skb_internal+0x34/0xd0
[ 3731.144415]  napi_gro_receive+0xb8/0xe0
[ 3731.148271]  mlx5e_handle_rx_cqe_mpwrq+0x4e3/0x7f0 [mlx5_core]
[ 3731.154099]  ? enqueue_entity+0x103/0x7f0
[ 3731.158114]  mlx5e_poll_rx_cq+0xba/0x850 [mlx5_core]
[ 3731.163080]  mlx5e_napi_poll+0x91/0x290 [mlx5_core]
[ 3731.167955]  net_rx_action+0x14a/0x3e0
[ 3731.171707]  ? credit_entropy_bits+0x23d/0x260
[ 3731.176153]  __do_softirq+0xe2/0x2c3
[ 3731.179734]  irq_exit+0xbc/0xd0
[ 3731.182878]  do_IRQ+0x89/0xd0
[ 3731.185851]  common_interrupt+0x7a/0x7a
[ 3731.189690]  
[ 3731.191799] RIP: 0010:cpuidle_enter_state+0xa6/0x2d0
[ 3731.196761] RSP: 0018:bb950c6f7eb0 EFLAGS: 0246 ORIG_RAX:
ff60
[ 3731.204328] RAX: 9fe25fbe14c0 RBX: 0364b57553af RCX: 001f
[ 3731.211459] RDX: 20c49ba5e353f7cf RSI: 68294248f469 RDI: 
[ 3731.218583] RBP: db7d003c3300 R08: c3be R09: 8612
[ 3731.225709] R10: bb950c6f7e98 R11: c3be R12: 0003
[ 3731.232841] R13: 912c9d18 R14:  R15: 0364b396207a
[ 3731.239968]  do_idle+0x166/0x1a0
[ 3731.243199]  cpu_startup_entry+0x6f/0x80
[ 3731.247128]  start_secondary+0x19c/0x1f0
[ 3731.251052]  secondary_startup_64+0xa5/0xb0



>
> Have you tried IPv6 frags maybe ?
>

We have no IPv6 traffic. I asked people to try to generate IPv4 fragment
traffic to see if it would be more reproducible, no progress yet.

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Eric Dumazet

On 10/30/2018 10:32 AM, Cong Wang wrote:

> Unlike Pawel's case, we don't use vlan at all, maybe this is why we see
> it much less frequently than Pawel.
> 
> Also, it is probably not specific to mlx5, as there is another report which
> is probably a non-mlx5 driver.

Not sure if you provided a stack trace ?

Have you tried IPv6 frags maybe ?

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-10-30 Thread Stefano Brivio

On Tue, 30 Oct 2018 10:34:45 -0600
David Ahern  wrote:

> A more flexible approach is to use format strings to allow users to
> customize the output order and whitespace as well. So for ss and your
> column list (winging it here):
> 
> netid  = %N
> state  = %S
> recv Q = %Qr
> send Q = %Qs
> local address  = %Al
> lport port = %Pl
> remote address = %Ar
> remote port= %Pr
> process data   = %p
> ...
> 
> then a format string could be: "%S  %Qr %Qs  %Al:%Pl %Ar:%Pr  %p\n"

I like the idea indeed, but I see two issues with ss:

- the current column abstraction is rather lightweight, things are
  already buffered in the defined column order so we don't have to jump
  back and forth in the buffer while rendering. Doing that needs some
  extra care to avoid a performance hit, but it's probably doable, I
  can put that on my to-do list

- how would you model automatic spacing in a format string? Should we
  support width specifiers? Disable automatic spacing if a format
  string is given? It might even make sense to allow partial automatic
  spacing with a special character in the format string, that is:

"%S.%Qr.%Qs  %Al:%Pl %Ar:%Pr  %p\n"

  would mean "align everything to the right, distribute remaining
  whitespace between %S, %Qr and %Qs". But it looks rather complicated
  at a glance.

-- 
Stefano

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Cong Wang

On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet  wrote:
>
>
>
> On 10/30/2018 01:09 AM, Paweł Staszewski wrote:
> >
> >
> > W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:
> >>
> >> On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:
> >>
> >>> Indeed this is a bug. I would expect it to produce frequent errors
> >>> though as many odd-length
> >>> packets would trigger it. Do you have RXFCS? Regardless, how
> >>> frequently do you see the problem?
> >>>
> >> Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
> >> CHECKSUM_NONE
> >>
> >> And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the 
> >> bug you fixed.
> >>
> >> So we now need to also fix mlx5.
> >>
> >> And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
> >> earlier,
> >> plus __get_unaligned_cpu32() as you hinted.
> >>
> >>
> >>
> >>
> >
> > No RXFCS


Same with Pawel, RXFCS is disabled by default.


> >
> > And this trace is rly frequently like once per 3/4 seconds
> > like below:
> > [28965.776864] vlan1490: hw csum failure
>
> Might be vlan related.

Unlike Pawel's case, we don't use vlan at all, maybe this is why we see
it much less frequently than Pawel.

Also, it is probably not specific to mlx5, as there is another report which
is probably a non-mlx5 driver.

Thanks.

[RFC PATCH v3 10/10] selftests: add functionals test for UDP GRO

2018-10-30 Thread Paolo Abeni

Extends the existing udp programs to allow checking for proper
GRO aggregation/GSO size, and run the tests via a shell script, using
a veth pair with XDP program attached to trigger the GRO code path.

rfc v2 -> rfc v3:
 - add missing test program options documentation
 - fix sporatic test failures (receiver faster than sender)

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile  |   2 +-
 tools/testing/selftests/net/udpgro.sh | 147 ++
 tools/testing/selftests/net/udpgro_bench.sh   |   8 +-
 tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
 tools/testing/selftests/net/udpgso_bench_rx.c | 123 +--
 tools/testing/selftests/net/udpgso_bench_tx.c |  22 ++-
 6 files changed, 281 insertions(+), 23 deletions(-)
 create mode 100755 tools/testing/selftests/net/udpgro.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index ac999354af54..a8a0d256aafb 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,7 +7,7 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
-TEST_PROGS += udpgro_bench.sh
+TEST_PROGS += udpgro_bench.sh udpgro.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
new file mode 100755
index ..3f12b72a3568
--- /dev/null
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -0,0 +1,147 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgro functional tests.
+
+readonly PEER_NS="ns-peer-$(mktemp -u XX)"
+
+cleanup() {
+   local -r jobs="$(jobs -p)"
+   local -r ns="$(ip netns list|grep $PEER_NS)"
+
+   [ -n "${jobs}" ] && kill -1 ${jobs} 2>/dev/null
+   [ -n "$ns" ] && ip netns del $ns 2>/dev/null
+}
+trap cleanup EXIT
+
+cfg_veth() {
+   ip netns add "${PEER_NS}"
+   ip -netns "${PEER_NS}" link set lo up
+   ip link add type veth
+   ip link set dev veth0 up
+   ip addr add dev veth0 192.168.1.2/24
+   ip addr add dev veth0 2001:db8::2/64 nodad
+
+   ip link set dev veth1 netns "${PEER_NS}"
+   ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
+   ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
+   ip -netns "${PEER_NS}" link set dev veth1 up
+}
+
+run_one() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   cfg_veth
+
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} && \
+   echo "ok" || \
+   echo "failed" &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+   wait $(jobs -p)
+}
+
+run_test() {
+   local -r args=$@
+
+   printf " %-40s" "$1"
+   ./in_netns.sh $0 __subprocess $2 rx -G -r -x veth1 $3
+}
+
+run_one_nat() {
+   # use 'rx' as separator between sender args and receiver args
+   local addr1 addr2 pid family="" ipt_cmd=ip6tables
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   if [[ ${tx_args} = *-4* ]]; then
+   ipt_cmd=iptables
+   family=-4
+   addr1=192.168.1.1
+   addr2=192.168.1.3/24
+   else
+   addr1=2001:db8::1
+   addr2="2001:db8::3/64 nodad"
+   fi
+
+   cfg_veth
+   ip -netns "${PEER_NS}" addr add dev veth1 ${addr2}
+
+   # fool the GRO engine changing the destination address ...
+   ip netns exec "${PEER_NS}" $ipt_cmd -t nat -I PREROUTING -d ${addr1} -j 
DNAT --to-destination ${addr2%/*}
+
+   # ... so that GRO will match the UDP_GRO enabled socket, but packets
+   # will land on the 'plain' one
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -G ${family} -x veth1 -b 
${addr1} -n 0 &
+   pid=$!
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${family} -b ${addr2%/*} 
${rx_args} && \
+   echo "ok" || \
+   echo "failed"&
+
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+   kill -INT $pid
+   wait $(jobs -p)
+}
+
+run_nat_test() {
+   local -r args=$@
+
+   printf " %-40s" "$1"
+   ./in_netns.sh $0 __subprocess_nat $2 rx -r $3
+}
+
+run_all() {
+   local -r core_args="-l 4"
+   local -r ipv4_args="${core_args} -4 -D 192.168.1.1"
+   local -r ipv6_args="${core_args} -6 -D 2001:db8::1"
+
+   echo "ipv4"
+   run_test "no GRO" "${ipv4_args} -M 10 -s 1400" "-4 -n 10 -l 1400"
+
+   # explicitly check we are not receiving UDP_SEGMENT

[RFC PATCH v3 08/10] selftests: conditionally enable XDP support in udpgso_bench_rx

2018-10-30 Thread Paolo Abeni

XDP support will be used by a later patch to test the GRO path
in a net namespace, leveraging the veth XDP implementation.
To avoid breaking existing setup, XDP support is conditionally
enabled and build only if llc is locally available.

rfc v2 -> rfc v3:
 - move 'x' option handling here

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile  | 69 +++
 tools/testing/selftests/net/udpgso_bench_rx.c | 41 ++-
 tools/testing/selftests/net/xdp_dummy.c   | 13 
 3 files changed, 121 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/net/xdp_dummy.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 256d82d5fa87..176459b7c4d6 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -16,8 +16,77 @@ TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu 
reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
 
 KSFT_KHDR_INSTALL := 1
+
+# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
+#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
+LLC ?= llc
+CLANG ?= clang
+LLVM_OBJCOPY ?= llvm-objcopy
+BTF_PAHOLE ?= pahole
+HAS_LLC := $(shell which $(LLC) 2>/dev/null)
+
+# conditional enable testes requiring llc
+ifneq (, $(HAS_LLC))
+TEST_GEN_FILES += xdp_dummy.o
+endif
+
 include ../lib.mk
 
+ifneq (, $(HAS_LLC))
+
+# Detect that we're cross compiling and use the cross compiler
+ifdef CROSS_COMPILE
+CLANG_ARCH_ARGS = -target $(ARCH)
+endif
+
+PROBE := $(shell $(LLC) -march=bpf -mcpu=probe -filetype=null /dev/null 2>&1)
+
+# Let newer LLVM versions transparently probe the kernel for availability
+# of full BPF instruction set.
+ifeq ($(PROBE),)
+  CPU ?= probe
+else
+  CPU ?= generic
+endif
+
+SRC_PATH := $(abspath ../../../..)
+LIB_PATH := $(SRC_PATH)/tools/lib
+XDP_CFLAGS := -D SUPPORT_XDP=1 -I$(LIB_PATH)
+LIBBPF = $(LIB_PATH)/bpf/libbpf.a
+BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
+BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
+BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
'usage.*llvm')
+CLANG_SYS_INCLUDES := $(shell $(CLANG) -v -E - &1 \
+| sed -n '/<...> search starts here:/,/End of search list./{ s| 
\(/.*\)|-idirafter \1|p }')
+CLANG_FLAGS = -I. -I$(SRC_PATH)/include -I../bpf/ \
+ $(CLANG_SYS_INCLUDES) -Wno-compare-distinct-pointer-types
+
+ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),)
+   CLANG_CFLAGS += -g
+   LLC_FLAGS += -mattr=dwarfris
+   DWARF2BTF = y
+endif
+
+$(LIBBPF): FORCE
+# Fix up variables inherited from Kbuild that tools/ build system won't like
+   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(SRC_PATH) O= 
$(nodir $@)
+
+$(OUTPUT)/udpgso_bench_rx: $(OUTPUT)/udpgso_bench_rx.c $(LIBBPF)
+   $(CC) -o $@ $(XDP_CFLAGS) $(CFLAGS) $(LOADLIBES) $(LDLIBS) $^ -lelf
+
+FORCE:
+
+# bpf program[s] generation
+$(OUTPUT)/%.o: %.c
+   $(CLANG) $(CLANG_FLAGS) \
+-O2 -target bpf -emit-llvm -c $< -o - |  \
+   $(LLC) -march=bpf -mcpu=$(CPU) $(LLC_FLAGS) -filetype=obj -o $@
+ifeq ($(DWARF2BTF),y)
+   $(BTF_PAHOLE) -J $@
+endif
+
+endif
+
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
 $(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
b/tools/testing/selftests/net/udpgso_bench_rx.c
index 8f48d7fb32cf..5dcb719abe04 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,6 +31,10 @@
 #include 
 #include 
 
+#ifdef SUPPORT_XDP
+#include "bpf/libbpf.h"
+#endif
+
 #ifndef UDP_GRO
 #define UDP_GRO104
 #endif
@@ -40,6 +44,9 @@ static bool cfg_tcp;
 static bool cfg_verify;
 static bool cfg_read_all;
 static bool cfg_gro_segment;
+#ifdef SUPPORT_XDP
+static int cfg_xdp_iface;
+#endif
 
 static bool interrupted;
 static unsigned long packets, bytes;
@@ -202,14 +209,14 @@ static void do_flush_udp(int fd)
 
 static void usage(const char *filepath)
 {
-   error(1, 0, "Usage: %s [-Grtv] [-p port]", filepath);
+   error(1, 0, "Usage: %s [-Grtv] [-p port] [-x device]", filepath);
 }
 
 static void parse_opts(int argc, char **argv)
 {
int c;
 
-   while ((c = getopt(argc, argv, "Gp:rtv")) != -1) {
+   while ((c = getopt(argc, argv, "Gp:rtvx:")) != -1) {
switch (c) {
case 'G':
cfg_gro_segment = true;
@@ -227,6 +234,13 @@ static void parse_opts(int argc, char **argv)
cfg_verify = true;
cfg_read_all = true;
break;
+#ifdef SUPPORT_XDP
+   case 'x':
+   cfg_xdp_iface = if_nametoindex(optarg);
+   if (!cfg_xdp_iface)
+

[RFC PATCH v3 07/10] selftests: add GRO support to udp bench rx program

2018-10-30 Thread Paolo Abeni

And fix a couple of buglets (port option processing,
clean termination on SIGINT). This is preparatory work
for GRO tests.

rfc v2 -> rfc v3:
 - use ETH_MAX_MTU

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/udpgso_bench_rx.c | 37 +++
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
b/tools/testing/selftests/net/udpgso_bench_rx.c
index 727cf67a3f75..8f48d7fb32cf 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,9 +31,15 @@
 #include 
 #include 
 
+#ifndef UDP_GRO
+#define UDP_GRO104
+#endif
+
 static int  cfg_port   = 8000;
 static bool cfg_tcp;
 static bool cfg_verify;
+static bool cfg_read_all;
+static bool cfg_gro_segment;
 
 static bool interrupted;
 static unsigned long packets, bytes;
@@ -63,6 +69,8 @@ static void do_poll(int fd)
 
do {
ret = poll(, 1, 10);
+   if (interrupted)
+   break;
if (ret == -1)
error(1, errno, "poll");
if (ret == 0)
@@ -70,7 +78,7 @@ static void do_poll(int fd)
if (pfd.revents != POLLIN)
error(1, errno, "poll: 0x%x expected 0x%x\n",
pfd.revents, POLLIN);
-   } while (!ret && !interrupted);
+   } while (!ret);
 }
 
 static int do_socket(bool do_tcp)
@@ -102,6 +110,8 @@ static int do_socket(bool do_tcp)
error(1, errno, "listen");
 
do_poll(accept_fd);
+   if (interrupted)
+   exit(0);
 
fd = accept(accept_fd, NULL, NULL);
if (fd == -1)
@@ -167,10 +177,10 @@ static void do_verify_udp(const char *data, int len)
 /* Flush all outstanding datagrams. Verify first few bytes of each. */
 static void do_flush_udp(int fd)
 {
-   static char rbuf[ETH_DATA_LEN];
+   static char rbuf[ETH_MAX_MTU];
int ret, len, budget = 256;
 
-   len = cfg_verify ? sizeof(rbuf) : 0;
+   len = cfg_read_all ? sizeof(rbuf) : 0;
while (budget--) {
/* MSG_TRUNC will make return value full datagram length */
ret = recv(fd, rbuf, len, MSG_TRUNC | MSG_DONTWAIT);
@@ -178,7 +188,7 @@ static void do_flush_udp(int fd)
return;
if (ret == -1)
error(1, errno, "recv");
-   if (len) {
+   if (len && cfg_verify) {
if (ret == 0)
error(1, errno, "recv: 0 byte datagram\n");
 
@@ -192,23 +202,30 @@ static void do_flush_udp(int fd)
 
 static void usage(const char *filepath)
 {
-   error(1, 0, "Usage: %s [-tv] [-p port]", filepath);
+   error(1, 0, "Usage: %s [-Grtv] [-p port]", filepath);
 }
 
 static void parse_opts(int argc, char **argv)
 {
int c;
 
-   while ((c = getopt(argc, argv, "ptv")) != -1) {
+   while ((c = getopt(argc, argv, "Gp:rtv")) != -1) {
switch (c) {
+   case 'G':
+   cfg_gro_segment = true;
+   break;
case 'p':
-   cfg_port = htons(strtoul(optarg, NULL, 0));
+   cfg_port = strtoul(optarg, NULL, 0);
+   break;
+   case 'r':
+   cfg_read_all = true;
break;
case 't':
cfg_tcp = true;
break;
case 'v':
cfg_verify = true;
+   cfg_read_all = true;
break;
}
}
@@ -227,6 +244,12 @@ static void do_recv(void)
 
fd = do_socket(cfg_tcp);
 
+   if (cfg_gro_segment && !cfg_tcp) {
+   int val = 1;
+   if (setsockopt(fd, IPPROTO_UDP, UDP_GRO, , sizeof(val)))
+   error(1, errno, "setsockopt UDP_GRO");
+   }
+
treport = gettimeofday_ms() + 1000;
do {
do_poll(fd);
-- 
2.17.2

[RFC PATCH v3 02/10] udp: implement GRO for plain UDP sockets.

2018-10-30 Thread Paolo Abeni

This is the RX counterpart of commit bec1f6f69736 ("udp: generate gso
with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
eligible for GRO in the rx path: UDP segments directed to such socket
are assembled into a larger GSO_UDP_L4 packet.

The core UDP GRO support is enabled with setsockopt(UDP_GRO).

Initial benchmark numbers:

Before:
udp rx:   1079 MB/s   769065 calls/s

After:
udp rx:   1466 MB/s24877 calls/s

This change introduces a side effect in respect to UDP tunnels:
after a UDP tunnel creation, now the kernel performs a lookup per ingress
UDP packet, while before such lookup happened only if the ingress packet
carried a valid internal header csum.

rfc v2 -> rfc v3:
 - fixed typos in macro name and comments
 - really enforce UDP_GRO_CNT_MAX, instead of UDP_GRO_CNT_MAX + 1
 - acquire socket lock in UDP_GRO setsockopt

rfc v1 -> rfc v2:
 - use a new option to enable UDP GRO
 - use static keys to protect the UDP GRO socket lookup

Signed-off-by: Paolo Abeni 
--
Note: I opted for acquiring the socket lock only for the newly introduced
setsockopt instead for every value, despite the previous conversation on
this topic, to avoid introducing somewhat larger and unrelated changes.
---
 include/linux/udp.h  |   3 +-
 include/uapi/linux/udp.h |   1 +
 net/ipv4/udp.c   |   8 +++
 net/ipv4/udp_offload.c   | 109 +++
 net/ipv6/udp_offload.c   |   6 +--
 5 files changed, 99 insertions(+), 28 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index a4dafff407fb..f613b329852e 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -50,11 +50,12 @@ struct udp_sock {
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
 no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
-encap_enabled:1; /* This socket enabled encap
+encap_enabled:1, /* This socket enabled encap
   * processing; UDP tunnels and
   * different encapsulation layer set
   * this
   */
+gro_enabled:1; /* Can accept GRO packets */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 09502de447f5..30baccb6c9c4 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -33,6 +33,7 @@ struct udphdr {
 #define UDP_NO_CHECK6_TX 101   /* Disable sending checksum for UDP6X */
 #define UDP_NO_CHECK6_RX 102   /* Disable accpeting checksum for UDP6 */
 #define UDP_SEGMENT103 /* Set GSO segmentation size */
+#define UDP_GRO104 /* This socket can receive UDP GRO 
packets */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c51721fb293a..4d4f4d044c28 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2474,6 +2474,14 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
up->gso_size = val;
break;
 
+   case UDP_GRO:
+   lock_sock(sk);
+   if (valbool)
+   udp_tunnel_encap_enable(sk->sk_socket);
+   up->gro_enabled = valbool;
+   release_sock(sk);
+   break;
+
/*
 *  UDP-Lite's partial checksum coverage (RFC 3828).
 */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 802f2bc00d69..0646d61f4fa8 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -343,6 +343,54 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
return segs;
 }
 
+#define UDP_GRO_CNT_MAX 64
+static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
+  struct sk_buff *skb)
+{
+   struct udphdr *uh = udp_hdr(skb);
+   struct sk_buff *pp = NULL;
+   struct udphdr *uh2;
+   struct sk_buff *p;
+
+   /* requires non zero csum, for symmetry with GSO */
+   if (!uh->check) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   return NULL;
+   }
+
+   /* pull encapsulating udp header */
+   skb_gro_pull(skb, sizeof(struct udphdr));
+   skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
+
+   list_for_each_entry(p, head, list) {
+   if (!NAPI_GRO_CB(p)->same_flow)
+   continue;
+
+   uh2 = udp_hdr(p);
+
+   /* Match ports only, as csum is always non zero */
+   if ((*(u32 *)>source != *(u32 *)>source)) {
+   NAPI_GRO_CB(p)->same_flow = 0;
+

[RFC PATCH v3 09/10] selftests: add some benchmark for UDP GRO

2018-10-30 Thread Paolo Abeni

Run on top of veth pair, using a dummy XDP program to enable the GRO.

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile|  1 +
 tools/testing/selftests/net/udpgro_bench.sh | 92 +
 2 files changed, 93 insertions(+)
 create mode 100755 tools/testing/selftests/net/udpgro_bench.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 176459b7c4d6..ac999354af54 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,6 +7,7 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
+TEST_PROGS += udpgro_bench.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/udpgro_bench.sh 
b/tools/testing/selftests/net/udpgro_bench.sh
new file mode 100755
index ..03d37e5e7424
--- /dev/null
+++ b/tools/testing/selftests/net/udpgro_bench.sh
@@ -0,0 +1,92 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgro benchmarks
+
+readonly PEER_NS="ns-peer-$(mktemp -u XX)"
+
+cleanup() {
+   local -r jobs="$(jobs -p)"
+   local -r ns="$(ip netns list|grep $PEER_NS)"
+
+   [ -n "${jobs}" ] && kill -INT ${jobs} 2>/dev/null
+   [ -n "$ns" ] && ip netns del $ns 2>/dev/null
+}
+trap cleanup EXIT
+
+run_one() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   ip netns add "${PEER_NS}"
+   ip -netns "${PEER_NS}" link set lo up
+   ip link add type veth
+   ip link set dev veth0 up
+   ip addr add dev veth0 192.168.1.2/24
+   ip addr add dev veth0 2001:db8::2/64 nodad
+
+   ip link set dev veth1 netns "${PEER_NS}"
+   ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
+   ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
+   ip -netns "${PEER_NS}" link set dev veth1 up
+
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} -r -x veth1 &
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -t ${rx_args} -r &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+}
+
+run_in_netns() {
+   local -r args=$@
+
+   ./in_netns.sh $0 __subprocess ${args}
+}
+
+run_udp() {
+   local -r args=$@
+
+   echo "udp gso - over veth touching data"
+   run_in_netns ${args} -S rx
+
+   echo "udp gso and gro - over veth touching data"
+   run_in_netns ${args} -S rx -G
+}
+
+run_tcp() {
+   local -r args=$@
+
+   echo "tcp - over veth touching data"
+   run_in_netns ${args} -t rx
+}
+
+run_all() {
+   local -r core_args="-l 4"
+   local -r ipv4_args="${core_args} -4 -D 192.168.1.1"
+   local -r ipv6_args="${core_args} -6 -D 2001:db8::1"
+
+   echo "ipv4"
+   run_tcp "${ipv4_args}"
+   run_udp "${ipv4_args}"
+
+   echo "ipv6"
+   run_tcp "${ipv4_args}"
+   run_udp "${ipv6_args}"
+}
+
+if [ ! -f xdp_dummy.o ]; then
+   echo "Skipping GRO benchmarks - missing LLC"
+   exit 0
+fi
+
+if [[ $# -eq 0 ]]; then
+   run_all
+elif [[ $1 == "__subprocess" ]]; then
+   shift
+   run_one $@
+else
+   run_in_netns $@
+fi
-- 
2.17.2

[RFC PATCH v3 01/10] udp: implement complete book-keeping for encap_needed

2018-10-30 Thread Paolo Abeni

The *encap_needed static keys are enabled by UDP tunnels
and several UDP encapsulations type, but they are never
turned off. This can cause unneeded overall performance
degradation for systems where such features are used
transiently.

This patch introduces complete book-keeping for such keys,
decreasing the usage at socket destruction time, if needed,
and avoiding that the same socket could increase the key
usage multiple times.

rfc v2 - rfc v3:
 - use udp_tunnel_encap_enable() in setsockopt()

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h  |  7 ++-
 include/net/udp_tunnel.h |  6 ++
 net/ipv4/udp.c   | 17 +++--
 net/ipv6/udp.c   | 14 +-
 4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 320d49d85484..a4dafff407fb 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,12 @@ struct udp_sock {
unsigned int corkflag;  /* Cork is required */
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+encap_enabled:1; /* This socket enabled encap
+  * processing; UDP tunnels and
+  * different encapsulation layer set
+  * this
+  */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index fe680ab6b15a..3fbe56430e3b 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -165,6 +165,12 @@ static inline int udp_tunnel_handle_offloads(struct 
sk_buff *skb, bool udp_csum)
 
 static inline void udp_tunnel_encap_enable(struct socket *sock)
 {
+   struct udp_sock *up = udp_sk(sock->sk);
+
+   if (up->encap_enabled)
+   return;
+
+   up->encap_enabled = 1;
 #if IS_ENABLED(CONFIG_IPV6)
if (sock->sk->sk_family == PF_INET6)
ipv6_stub->udpv6_encap_enable();
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca3ed931f2a9..c51721fb293a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -115,6 +115,7 @@
 #include "udp_impl.h"
 #include 
 #include 
+#include 
 
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
@@ -2398,11 +2399,15 @@ void udp_destroy_sock(struct sock *sk)
bool slow = lock_sock_fast(sk);
udp_flush_pending_frames(sk);
unlock_sock_fast(sk, slow);
-   if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
-   void (*encap_destroy)(struct sock *sk);
-   encap_destroy = READ_ONCE(up->encap_destroy);
-   if (encap_destroy)
-   encap_destroy(sk);
+   if (static_branch_unlikely(_encap_needed_key)) {
+   if (up->encap_type) {
+   void (*encap_destroy)(struct sock *sk);
+   encap_destroy = READ_ONCE(up->encap_destroy);
+   if (encap_destroy)
+   encap_destroy(sk);
+   }
+   if (up->encap_enabled)
+   static_branch_disable(_encap_needed_key);
}
 }
 
@@ -2447,7 +2452,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
/* FALLTHROUGH */
case UDP_ENCAP_L2TPINUDP:
up->encap_type = val;
-   udp_encap_enable();
+   udp_tunnel_encap_enable(sk->sk_socket);
break;
default:
err = -ENOPROTOOPT;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d2d97d07ef27..fc0ce6c59ebb 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1458,11 +1458,15 @@ void udpv6_destroy_sock(struct sock *sk)
udp_v6_flush_pending_frames(sk);
release_sock(sk);
 
-   if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
-   void (*encap_destroy)(struct sock *sk);
-   encap_destroy = READ_ONCE(up->encap_destroy);
-   if (encap_destroy)
-   encap_destroy(sk);
+   if (static_branch_unlikely(_encap_needed_key)) {
+   if (up->encap_type) {
+   void (*encap_destroy)(struct sock *sk);
+   encap_destroy = READ_ONCE(up->encap_destroy);
+   if (encap_destroy)
+   encap_destroy(sk);
+   }
+   if (up->encap_enabled)
+   static_branch_disable(_encap_needed_key);
}
 
inet6_destroy_sock(sk);

[RFC PATCH v3 04/10] ip: factor out protocol delivery helper

2018-10-30 Thread Paolo Abeni

So that we can re-use it at the UDP lavel in a later patch

Signed-off-by: Paolo Abeni 
---
 net/ipv4/ip_input.c | 73 ++---
 1 file changed, 36 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 35a786c0aaa0..72250b4e466d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
return false;
 }
 
-static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
protocol)
 {
-   __skb_pull(skb, skb_network_header_len(skb));
-
-   rcu_read_lock();
-   {
-   int protocol = ip_hdr(skb)->protocol;
-   const struct net_protocol *ipprot;
-   int raw;
+   const struct net_protocol *ipprot;
+   int raw, ret;
 
-   resubmit:
-   raw = raw_local_deliver(skb, protocol);
+resubmit:
+   raw = raw_local_deliver(skb, protocol);
 
-   ipprot = rcu_dereference(inet_protos[protocol]);
-   if (ipprot) {
-   int ret;
-
-   if (!ipprot->no_policy) {
-   if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, 
skb)) {
-   kfree_skb(skb);
-   goto out;
-   }
-   nf_reset(skb);
+   ipprot = rcu_dereference(inet_protos[protocol]);
+   if (ipprot) {
+   if (!ipprot->no_policy) {
+   if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
+   kfree_skb(skb);
+   return;
}
-   ret = ipprot->handler(skb);
-   if (ret < 0) {
-   protocol = -ret;
-   goto resubmit;
+   nf_reset(skb);
+   }
+   ret = ipprot->handler(skb);
+   if (ret < 0) {
+   protocol = -ret;
+   goto resubmit;
+   }
+   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   } else {
+   if (!raw) {
+   if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
+   __IP_INC_STATS(net, 
IPSTATS_MIB_INUNKNOWNPROTOS);
+   icmp_send(skb, ICMP_DEST_UNREACH,
+ ICMP_PROT_UNREACH, 0);
}
-   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   kfree_skb(skb);
} else {
-   if (!raw) {
-   if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, 
skb)) {
-   __IP_INC_STATS(net, 
IPSTATS_MIB_INUNKNOWNPROTOS);
-   icmp_send(skb, ICMP_DEST_UNREACH,
- ICMP_PROT_UNREACH, 0);
-   }
-   kfree_skb(skb);
-   } else {
-   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
-   consume_skb(skb);
-   }
+   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   consume_skb(skb);
}
}
- out:
+}
+
+static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
+{
+   __skb_pull(skb, skb_network_header_len(skb));
+
+   rcu_read_lock();
+   ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
rcu_read_unlock();
 
return 0;
-- 
2.17.2

[RFC PATCH v3 06/10] udp: cope with UDP GRO packet misdirection

2018-10-30 Thread Paolo Abeni

In some scenarios, the GRO engine can assemble an UDP GRO packet
that ultimately lands on a non GRO-enabled socket.
This patch tries to address the issue explicitly checking for the UDP
socket features before enqueuing the packet, and eventually segmenting
the unexpected GRO packet, as needed.

We must also cope with re-insertion requests: after segmentation the
UDP code calls the helper introduced by the previous patches, as needed.

Segmentation is performed by a common helper, which takes care of
updating socket and protocol stats is case of failure.

rfc v2 -> rfc v3
 - moved udp_rcv_segment() into net/udp.h, account errors to socket
   and ns, always return NULL or segs list

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h |  6 ++
 include/net/udp.h   | 51 ++---
 net/ipv4/udp.c  | 25 +-
 net/ipv6/udp.c  | 27 +++-
 4 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index e23d5024f42f..0a9c54e76305 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -132,6 +132,12 @@ static inline void udp_cmsg_recv(struct msghdr *msg, 
struct sock *sk,
}
 }
 
+static inline bool udp_unexpected_gso(struct sock *sk, struct sk_buff *skb)
+{
+   return !udp_sk(sk)->gro_enabled && skb_is_gso(skb) &&
+  skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4;
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 9e82cb391dea..f94aed316a04 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -406,17 +406,24 @@ static inline int copy_linear_skb(struct sk_buff *skb, 
int len, int off,
 } while(0)
 
 #if IS_ENABLED(CONFIG_IPV6)
-#define __UDPX_INC_STATS(sk, field)\
-do {   \
-   if ((sk)->sk_family == AF_INET) \
-   __UDP_INC_STATS(sock_net(sk), field, 0);\
-   else\
-   __UDP6_INC_STATS(sock_net(sk), field, 0);   \
-} while (0)
+#define __UDPX_MIB(sk, ipv4)   \
+({ \
+   ipv4 ? (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
+sock_net(sk)->mib.udp_statistics) :\
+   (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_stats_in6 : \
+sock_net(sk)->mib.udp_stats_in6);  \
+})
 #else
-#define __UDPX_INC_STATS(sk, field) __UDP_INC_STATS(sock_net(sk), field, 0)
+#define __UDPX_MIB(sk, ipv4)   \
+({ \
+   IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
+sock_net(sk)->mib.udp_statistics;  \
+})
 #endif
 
+#define __UDPX_INC_STATS(sk, field) \
+   __SNMP_INC_STATS(__UDPX_MIB(sk, (sk)->sk_family == AF_INET, field)
+
 #ifdef CONFIG_PROC_FS
 struct udp_seq_afinfo {
sa_family_t family;
@@ -450,4 +457,32 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
 void udpv6_encap_enable(void);
 #endif
 
+static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
+ struct sk_buff *skb)
+{
+   bool ipv4 = skb->protocol == htons(ETH_P_IP);
+   int segs_nr = skb_shinfo(skb)->gso_segs;
+   struct sk_buff *segs;
+
+   /* the GSO CB lays after the UDP one, no need to save and restore any
+* CB fragment
+*/
+   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
+   if (unlikely(IS_ERR(segs))) {
+   kfree_skb(skb);
+   goto drop;
+   }
+
+   if (unlikely(!segs))
+   goto drop;
+
+   consume_skb(skb);
+   return segs;
+
+drop:
+   atomic_add(segs_nr, >sk_drops);
+   SNMP_ADD_STATS(__UDPX_MIB(sk, ipv4), UDP_MIB_INERRORS, segs_nr);
+   return NULL;
+}
+
 #endif /* _UDP_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b345f71b1cbb..b45033f63673 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable);
  * Note that in the success and error cases, the skb is assumed to
  * have either been requeued or freed.
  */
-static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
 {
struct udp_sock *up = udp_sk(sk);
int is_udplite = IS_UDPLITE(sk);
@@ -2012,6 +2012,29 @@ static int udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
return -1;
 }
 
+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int

[RFC PATCH v3 05/10] ipv6: factor out protocol delivery helper

2018-10-30 Thread Paolo Abeni

So that we can re-use it at the UDP lavel in the next patch

Signed-off-by: Paolo Abeni 
---
 net/ipv6/ip6_input.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 96577e742afd..3065226bdc57 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -319,28 +319,26 @@ void ipv6_list_rcv(struct list_head *head, struct 
packet_type *pt,
 /*
  * Deliver the packet to the host
  */
-
-
-static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
nexthdr,
+ bool have_final)
 {
const struct inet6_protocol *ipprot;
struct inet6_dev *idev;
unsigned int nhoff;
-   int nexthdr;
bool raw;
-   bool have_final = false;
 
/*
 *  Parse extension headers
 */
 
-   rcu_read_lock();
 resubmit:
idev = ip6_dst_idev(skb_dst(skb));
-   if (!pskb_pull(skb, skb_transport_offset(skb)))
-   goto discard;
nhoff = IP6CB(skb)->nhoff;
-   nexthdr = skb_network_header(skb)[nhoff];
+   if (!have_final) {
+   if (!pskb_pull(skb, skb_transport_offset(skb)))
+   goto discard;
+   nexthdr = skb_network_header(skb)[nhoff];
+   }
 
 resubmit_final:
raw = raw6_local_deliver(skb, nexthdr);
@@ -411,13 +409,19 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
consume_skb(skb);
}
}
-   rcu_read_unlock();
-   return 0;
+   return;
 
 discard:
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
-   rcu_read_unlock();
kfree_skb(skb);
+}
+
+static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+{
+   rcu_read_lock();
+   ip6_protocol_deliver_rcu(net, skb, 0, false);
+   rcu_read_unlock();
+
return 0;
 }
 
-- 
2.17.2

[RFC PATCH v3 03/10] udp: add support for UDP_GRO cmsg

2018-10-30 Thread Paolo Abeni

When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
datagram size. User-space can use such info to compute the original
packets layout.

Signed-off-by: Paolo Abeni 
---
Note: I avoided setting a bit in cmsg_flag for UDP_GRO, as that
attempt produced some uglyfication, expecially on the ipv6 side
with no measurable performances benefits.
---
 include/linux/udp.h | 11 +++
 net/ipv4/udp.c  |  4 
 net/ipv6/udp.c  |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index f613b329852e..e23d5024f42f 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -121,6 +121,17 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
return udp_sk(sk)->no_check6_rx;
 }
 
+static inline void udp_cmsg_recv(struct msghdr *msg, struct sock *sk,
+struct sk_buff *skb)
+{
+   int gso_size;
+
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
+   gso_size = skb_shinfo(skb)->gso_size;
+   put_cmsg(msg, SOL_UDP, UDP_GRO, sizeof(gso_size), _size);
+   }
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4d4f4d044c28..b345f71b1cbb 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1714,6 +1714,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
*addr_len = sizeof(*sin);
}
+
+   if (udp_sk(sk)->gro_enabled)
+   udp_cmsg_recv(msg, sk, skb);
+
if (inet->cmsg_flags)
ip_cmsg_recv_offset(msg, sk, skb, sizeof(struct udphdr), off);
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index fc0ce6c59ebb..8e76e719305c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -421,6 +421,9 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
*addr_len = sizeof(*sin6);
}
 
+   if (udp_sk(sk)->gro_enabled)
+   udp_cmsg_recv(msg, sk, skb);
+
if (np->rxopt.all)
ip6_datagram_recv_common_ctl(sk, msg, skb);
 
-- 
2.17.2

[RFC PATCH v3 00/10] udp: implement GRO support

2018-10-30 Thread Paolo Abeni

This series implements GRO support for UDP sockets, as the RX counterpart
of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
The core functionality is implemented by the second patch, introducing a new
sockopt to enable UDP_GRO, while patch 3 implements support for passing the
segment size to the user space via a new cmsg.
UDP GRO performs a socket lookup for each ingress packets and aggregate datagram
directed to UDP GRO enabled sockets with constant l4 tuple.

UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT
rules, and that could potentially confuse existing applications.

The solution adopted here is to de-segment the GRO packet before enqueuing
as needed. Since we must cope with packet reinsertion after de-segmentation,
the relevant code is factored-out in ipv4 and ipv6 specific helpers and exposed
to UDP usage.

While the current code can probably be improved, this safeguard ,implemented in
the patches 4-7, allows future enachements to enable UDP GSO offload on more
virtual devices eventually even on forwarded packets.

The last 4 for patches implement some performance and functional self-tests,
re-using the existing udpgso infrastructure. The problematic scenario described
above is explicitly tested.

This revision of the series try to address the feedback provided by Willem,
Steffen and Subash fixing several bugs all along

rfc v2 - rfc v3:
 - cope better with exceptional conditions
 - test cases cleanup

rfc v1 - rfc v2:
 - use a new option to enable UDP GRO
 - use static keys to protect the UDP GRO socket lookup
 - cope with UDP GRO misdirection
 - add self-tests

Paolo Abeni (10):
  udp: implement complete book-keeping for encap_needed
  udp: implement GRO for plain UDP sockets.
  udp: add support for UDP_GRO cmsg
  ip: factor out protocol delivery helper
  ipv6: factor out protocol delivery helper
  udp: cope with UDP GRO packet misdirection
  selftests: add GRO support to udp bench rx program
  selftests: conditionally enable XDP support in udpgso_bench_rx
  selftests: add some benchmark for UDP GRO
  selftests: add functionals test for UDP GRO

 include/linux/udp.h   |  25 ++-
 include/net/udp.h |  51 -
 include/net/udp_tunnel.h  |   6 +
 include/uapi/linux/udp.h  |   1 +
 net/ipv4/ip_input.c   |  73 ---
 net/ipv4/udp.c|  54 -
 net/ipv4/udp_offload.c| 109 --
 net/ipv6/ip6_input.c  |  28 +--
 net/ipv6/udp.c|  44 +++-
 net/ipv6/udp_offload.c|   6 +-
 tools/testing/selftests/net/Makefile  |  70 +++
 tools/testing/selftests/net/udpgro.sh | 147 +
 tools/testing/selftests/net/udpgro_bench.sh   |  94 +
 tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
 tools/testing/selftests/net/udpgso_bench_rx.c | 193 --
 tools/testing/selftests/net/udpgso_bench_tx.c |  22 +-
 tools/testing/selftests/net/xdp_dummy.c   |  13 ++
 17 files changed, 816 insertions(+), 122 deletions(-)
 create mode 100755 tools/testing/selftests/net/udpgro.sh
 create mode 100755 tools/testing/selftests/net/udpgro_bench.sh
 create mode 100644 tools/testing/selftests/net/xdp_dummy.c

-- 
2.17.2

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-10-30 Thread David Ahern

On 10/30/18 10:38 AM, Stephen Hemminger wrote:
> On Tue, 30 Oct 2018 10:34:45 -0600
> David Ahern  wrote:
> 
>> On 10/30/18 9:05 AM, Stefano Brivio wrote:
>>> Now that we have an abstraction for columns, it's relatively easy to
>>> selectively display only some of them, and Yoann has a use case for it.
>>>
>>> Patch 1/3 fixes a rendering issue that shows up only when display of
>>> arbitrary columns is disabled. Patch 2/3 implements the relevant option,
>>> and patch 3/3 makes the output more readable when some columns are
>>> disabled.
>>>
>>>  
>>
>> I like the intent, and I have prototyped something similar for 'ip'.
>>
>> A more flexible approach is to use format strings to allow users to
>> customize the output order and whitespace as well. So for ss and your
>> column list (winging it here):
>>
>> netid  = %N
>> state  = %S
>> recv Q = %Qr
>> send Q = %Qs
>> local address  = %Al
>> lport port = %Pl
>> remote address = %Ar
>> remote port= %Pr
>> process data   = %p
>> ...
>>
>> then a format string could be: "%S  %Qr %Qs  %Al:%Pl %Ar:%Pr  %p\n"
>>
>> or for csv output: "%S,%Qr,%Qs,%Al,%Pl,%Ar,%Pr,%p\n"
>>
>> I have not had time to look into an implementation for ip. Conceptually
>> - and scanning the kernel's vsprintf code - it does not look that
>> difficult, just time consuming on the frontend with the initial setup.
> 
> The problem with custom formats is that you lose all ability for Gcc
> to check format strings.
> 

Sure, trade-offs. A custom print string is powerful.

While selecting columns is an improvement, column ordering is also
important - even handling other output formats (csv).

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-10-30 Thread Stephen Hemminger

On Tue, 30 Oct 2018 10:34:45 -0600
David Ahern  wrote:

> On 10/30/18 9:05 AM, Stefano Brivio wrote:
> > Now that we have an abstraction for columns, it's relatively easy to
> > selectively display only some of them, and Yoann has a use case for it.
> > 
> > Patch 1/3 fixes a rendering issue that shows up only when display of
> > arbitrary columns is disabled. Patch 2/3 implements the relevant option,
> > and patch 3/3 makes the output more readable when some columns are
> > disabled.
> > 
> >  
> 
> I like the intent, and I have prototyped something similar for 'ip'.
> 
> A more flexible approach is to use format strings to allow users to
> customize the output order and whitespace as well. So for ss and your
> column list (winging it here):
> 
> netid  = %N
> state  = %S
> recv Q = %Qr
> send Q = %Qs
> local address  = %Al
> lport port = %Pl
> remote address = %Ar
> remote port= %Pr
> process data   = %p
> ...
> 
> then a format string could be: "%S  %Qr %Qs  %Al:%Pl %Ar:%Pr  %p\n"
> 
> or for csv output: "%S,%Qr,%Qs,%Al,%Pl,%Ar,%Pr,%p\n"
> 
> I have not had time to look into an implementation for ip. Conceptually
> - and scanning the kernel's vsprintf code - it does not look that
> difficult, just time consuming on the frontend with the initial setup.

The problem with custom formats is that you lose all ability for Gcc
to check format strings.

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-10-30 Thread David Ahern

On 10/30/18 9:05 AM, Stefano Brivio wrote:
> Now that we have an abstraction for columns, it's relatively easy to
> selectively display only some of them, and Yoann has a use case for it.
> 
> Patch 1/3 fixes a rendering issue that shows up only when display of
> arbitrary columns is disabled. Patch 2/3 implements the relevant option,
> and patch 3/3 makes the output more readable when some columns are
> disabled.
> 
>

I like the intent, and I have prototyped something similar for 'ip'.

A more flexible approach is to use format strings to allow users to
customize the output order and whitespace as well. So for ss and your
column list (winging it here):

netid  = %N
state  = %S
recv Q = %Qr
send Q = %Qs
local address  = %Al
lport port = %Pl
remote address = %Ar
remote port= %Pr
process data   = %p
...

then a format string could be: "%S  %Qr %Qs  %Al:%Pl %Ar:%Pr  %p\n"

or for csv output: "%S,%Qr,%Qs,%Al,%Pl,%Ar,%Pr,%p\n"

I have not had time to look into an implementation for ip. Conceptually
- and scanning the kernel's vsprintf code - it does not look that
difficult, just time consuming on the frontend with the initial setup.

[RFC bpf-next] libbpf: increase rlimit before trying to create BPF maps

2018-10-30 Thread Quentin Monnet

The limit for memory locked in the kernel by a process is usually set to
64 bytes by default. This can be an issue when creating large BPF maps.
A workaround is to raise this limit for the current process before
trying to create a new BPF map. Changing the hard limit requires the
CAP_SYS_RESOURCE and can usually only be done by root user (but then
only root can create BPF maps).

As far as I know there is not API to get the current amount of memory
locked for a user, therefore we cannot raise the limit only when
required. One solution, used by bcc, is to try to create the map, and on
getting a EPERM error, raising the limit to infinity before giving
another try. Another approach, used in iproute, is to raise the limit in
all cases, before trying to create the map.

Here we do the same as in iproute2: the rlimit is raised to infinity
before trying to load the map.

I send this patch as a RFC to see if people would prefer the bcc
approach instead, or the rlimit change to be in bpftool rather than in
libbpf.

Signed-off-by: Quentin Monnet 
---
 tools/lib/bpf/bpf.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 03f9bcc4ef50..456a5a7b112c 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -26,6 +26,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "bpf.h"
 #include "libbpf.h"
 #include 
@@ -68,8 +70,11 @@ static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr 
*attr,
 int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
 {
__u32 name_len = create_attr->name ? strlen(create_attr->name) : 0;
+   struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
union bpf_attr attr;
 
+   setrlimit(RLIMIT_MEMLOCK, );
+
memset(, '\0', sizeof(attr));
 
attr.map_type = create_attr->map_type;
-- 
2.7.4

Re: [BUG] MVPP2 driver exploding in presence of a tap interface

2018-10-30 Thread Marc Zyngier

On 30/10/18 15:10, Thomas Petazzoni wrote:
> Hello,
> 
> On Tue, 30 Oct 2018 14:55:01 +, Marc Zyngier wrote:
> 
>>> I.e, isn't the firmware fix papering over a bug that should be fixed in
>>> Linux mvpp2 driver anyway ?  
>>
>> Absolutely. Leaving this unpatched in the kernel, with a 100% chance of
>> memory corruption is just mad.
>>
>> I'm pretty sure there should be a way to sanely reset the interface
>> before it starts repainting the memory.
> 
> I agree here. Do you still have an image of that old firmware version,
> so that we can try to reproduce, and see if we can come up with a way
> to reset the BM on boot up that would avoid this issue ?

Yup. I still have both the original build tree as well as the sdcard, so
you should be able to trigger on demand.

I'll email you the stuff separately, unless you want another delivery
method.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...

Re: [BUG] MVPP2 driver exploding in presence of a tap interface

2018-10-30 Thread Thomas Petazzoni

Hello,

On Tue, 30 Oct 2018 14:55:01 +, Marc Zyngier wrote:

> > I.e, isn't the firmware fix papering over a bug that should be fixed in
> > Linux mvpp2 driver anyway ?  
> 
> Absolutely. Leaving this unpatched in the kernel, with a 100% chance of
> memory corruption is just mad.
> 
> I'm pretty sure there should be a way to sanely reset the interface
> before it starts repainting the memory.

I agree here. Do you still have an image of that old firmware version,
so that we can try to reproduce, and see if we can come up with a way
to reset the BM on boot up that would avoid this issue ?

Thanks,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

[PATCH iproute2 net-next 1/3] ss: Discard empty descriptor at the end of buffer, if any, before rendering

2018-10-30 Thread Stefano Brivio

This will allow us to disable display of any given column.

Signed-off-by: Stefano Brivio 
---
 misc/ss.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index c8970438ce73..c3f61ef66258 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -1245,8 +1245,15 @@ static void render(void)
 
token = (struct buf_token *)buffer.head->data;
 
-   /* Ensure end alignment of last token, it wasn't necessarily flushed */
-   buffer.tail->end += buffer.cur->len % 2;
+   if (!buffer.cur->len) {
+   /* Last token was flushed, a new empty descriptor was appended:
+* discard it
+*/
+   buffer.tail->end -= sizeof(buffer.cur->len);
+   } else {
+   /* Last token wasn't flushed: ensure end alignment */
+   buffer.tail->end += buffer.cur->len % 2;
+   }
 
render_calc_width();
 
-- 
2.19.1

[PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-10-30 Thread Stefano Brivio

Now that we have an abstraction for columns, it's relatively easy to
selectively display only some of them, and Yoann has a use case for it.

Patch 1/3 fixes a rendering issue that shows up only when display of
arbitrary columns is disabled. Patch 2/3 implements the relevant option,
and patch 3/3 makes the output more readable when some columns are
disabled.

Stefano Brivio (3):
  ss: Discard empty descriptor at the end of buffer, if any, before
rendering
  ss: Introduce option to display selected columns only
  ss: Beautify output when arbitrary columns are hidden

 man/man8/ss.8 |  5 +++
 misc/ss.c | 85 +++
 2 files changed, 77 insertions(+), 13 deletions(-)

-- 
2.19.1

[PATCH iproute2 net-next 3/3] ss: Beautify output when arbitrary columns are hidden

2018-10-30 Thread Stefano Brivio

Define a secondary alignment for columns in case the next column is
hidden, this avoids awkward outputs if e.g. the local address is shown,
but not the local port.

Omit embedded delimiter in socket specifiers if the port or service field
is hidden.

Signed-off-by: Stefano Brivio 
---
 misc/ss.c | 30 --
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 91be3c6db151..d489233681e9 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -131,7 +131,8 @@ enum col_align {
 };
 
 struct column {
-   const enum col_align align;
+   enum col_align align;
+   const enum col_align align_without_next;
const char *optname;
const char *header;
const char *ldelim;
@@ -141,15 +142,15 @@ struct column {
 };
 
 static struct column columns[] = {
-   { ALIGN_LEFT,   "netid","Netid","",  0, 0, 0 },
-   { ALIGN_LEFT,   "state","State"," ", 0, 0, 0 },
-   { ALIGN_LEFT,   "recvq","Recv-Q",   " ", 0, 0, 0 },
-   { ALIGN_LEFT,   "sendq","Send-Q",   " ", 0, 0, 0 },
-   { ALIGN_RIGHT,  "local","Local Address:",   " ", 0, 0, 0 },
-   { ALIGN_LEFT,   "lport","Port", "",  0, 0, 0 },
-   { ALIGN_RIGHT,  "peer", "Peer Address:"," ", 0, 0, 0 },
-   { ALIGN_LEFT,   "pport","Port", "",  0, 0, 0 },
-   { ALIGN_LEFT,   "ext",  "", "",  0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "netid", "Netid",  "",  0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "state", "State",  " ", 0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "recvq", "Recv-Q", " ", 0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "sendq", "Send-Q", " ", 0, 0, 0 },
+   { ALIGN_RIGHT, ALIGN_LEFT, "local", "Local Address:", " ", 0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "lport", "Port",   "",  0, 0, 0 },
+   { ALIGN_RIGHT, ALIGN_LEFT, "peer",  "Peer Address:",  " ", 0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "pport", "Port",   "",  0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "ext",   "",   "",  0, 0, 0 },
 };
 
 static struct column *current_field = columns;
@@ -1374,6 +1375,9 @@ static void sock_details_print(struct sockstat *s)
 static void sock_addr_print(const char *addr, char *delim, const char *port,
const char *ifname)
 {
+   if ((current_field + 1)->disabled)
+   delim = "";
+
if (ifname)
out("%s" "%%" "%s%s", addr, ifname, delim);
else
@@ -5006,6 +5010,12 @@ int main(int argc, char *argv[])
}
p = p1 + 1;
} while (p1);
+
+   for (f = columns; field_is_valid(f + 1); f++) {
+   if ((f + 1)->disabled)
+   f->align = f->align_without_next;
+   }
+
break;
}
case 'h':
-- 
2.19.1

[PATCH iproute2 net-next 2/3] ss: Introduce option to display selected columns only

2018-10-30 Thread Stefano Brivio

The new option --columns (short: -c) allows to select columns to be
displayed. Note that this doesn't affect the order in which columns are
displayed.

Reported-by: Yoann P. 
Signed-off-by: Stefano Brivio 
---
 man/man8/ss.8 |  5 +
 misc/ss.c | 62 ++-
 2 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 7a6572b17364..c987dec6bcd7 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -24,6 +24,11 @@ Output version information.
 .B \-H, \-\-no-header
 Suppress header line.
 .TP
+.B \-c COLS, \-\-columns=COLS
+Only display selected columns, separated by commas. The following column names
+are understood: netid, state, local, lport, peer, pport, ext. This does not
+define the order of columns.
+.TP
 .B \-n, \-\-numeric
 Do not try to resolve service names.
 .TP
diff --git a/misc/ss.c b/misc/ss.c
index c3f61ef66258..91be3c6db151 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -132,6 +132,7 @@ enum col_align {
 
 struct column {
const enum col_align align;
+   const char *optname;
const char *header;
const char *ldelim;
int disabled;
@@ -140,15 +141,15 @@ struct column {
 };
 
 static struct column columns[] = {
-   { ALIGN_LEFT,   "Netid","", 0, 0, 0 },
-   { ALIGN_LEFT,   "State"," ",0, 0, 0 },
-   { ALIGN_LEFT,   "Recv-Q",   " ",0, 0, 0 },
-   { ALIGN_LEFT,   "Send-Q",   " ",0, 0, 0 },
-   { ALIGN_RIGHT,  "Local Address:",   " ",0, 0, 0 },
-   { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
-   { ALIGN_RIGHT,  "Peer Address:"," ",0, 0, 0 },
-   { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
-   { ALIGN_LEFT,   "", "", 0, 0, 0 },
+   { ALIGN_LEFT,   "netid","Netid","",  0, 0, 0 },
+   { ALIGN_LEFT,   "state","State"," ", 0, 0, 0 },
+   { ALIGN_LEFT,   "recvq","Recv-Q",   " ", 0, 0, 0 },
+   { ALIGN_LEFT,   "sendq","Send-Q",   " ", 0, 0, 0 },
+   { ALIGN_RIGHT,  "local","Local Address:",   " ", 0, 0, 0 },
+   { ALIGN_LEFT,   "lport","Port", "",  0, 0, 0 },
+   { ALIGN_RIGHT,  "peer", "Peer Address:"," ", 0, 0, 0 },
+   { ALIGN_LEFT,   "pport","Port", "",  0, 0, 0 },
+   { ALIGN_LEFT,   "ext",  "", "",  0, 0, 0 },
 };
 
 static struct column *current_field = columns;
@@ -1073,6 +1074,11 @@ static int field_is_last(struct column *f)
return f - columns == COL_MAX - 1;
 }
 
+static int field_is_valid(struct column *f)
+{
+   return f >= columns && f - columns < COL_MAX;
+}
+
 static void field_next(void)
 {
field_flush(current_field);
@@ -4666,6 +4672,8 @@ static void _usage(FILE *dest)
 "\n"
 "   -K, --kill  forcibly close sockets, display what was closed\n"
 "   -H, --no-header Suppress header line\n"
+"   -c, --columns=COLS  display only COLS columns\n"
+"   COLS := {netid|state|local|lport|peer|pport|ext}[,COLS]\n"
 "\n"
 "   -A, --query=QUERY, --socket=QUERY\n"
 "   QUERY := 
{all|inet|tcp|udp|raw|unix|unix_dgram|unix_stream|unix_seqpacket|packet|netlink|vsock_stream|vsock_dgram|tipc}[,QUERY]\n"
@@ -4785,6 +4793,7 @@ static const struct option long_opts[] = {
{ "tipcinfo", 0, 0, OPT_TIPCINFO},
{ "kill", 0, 0, 'K' },
{ "no-header", 0, 0, 'H' },
+   { "columns", 1, 0, 'c' },
{ 0 }
 
 };
@@ -4800,7 +4809,7 @@ int main(int argc, char *argv[])
int state_filter = 0;
 
while ((ch = getopt_long(argc, argv,
-"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KHS",
+"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KHc:S",
 long_opts, NULL)) != EOF) {
switch (ch) {
case 'n':
@@ -4966,6 +4975,39 @@ int main(int argc, char *argv[])
case 'H':
show_header = 0;
break;
+   case 'c':
+   {
+   struct column *f;
+   char *p, *p1;
+
+   if (!optarg) {
+   fprintf(stderr, "ss: No columns given.\n");
+   usage();
+   }
+
+   for (f = columns; field_is_valid(f); f++)
+   f->disabled = 1;
+
+   p = optarg;
+   do {
+   p1 = strchr(p, ',');
+   if (p1)
+   *p1 = 0;
+   for (f = columns; field_is_valid(f); f++) {
+   if (!strcmp(f->optname, p)) {
+

Re: [BUG] MVPP2 driver exploding in presence of a tap interface

2018-10-30 Thread Marc Zyngier

On 30/10/18 13:00, Thomas Petazzoni wrote:
> Hello Marcin,
> 
> Thanks for the feedback.
> 
> On Tue, 30 Oct 2018 13:37:37 +0100, Marcin Wojtas wrote:
> 
>> You use _really_ archaic firmware, the bug you see is 99% caused by a
>> bug already fixed long time ago (cleanup all PP2 BM pools correctly
>> during exit boot services). Please grab the latest release:
>> https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/wiki/files/flash-image-18.09.4.bin
>> and let know if you observe any further issues with vanilla kernel.
> 
> Even if this was a bug in the UEFI firmware, shouldn't the kernel be
> independent from that, by doing a proper reset/reinit of the HW ?
> 
> I.e, isn't the firmware fix papering over a bug that should be fixed in
> Linux mvpp2 driver anyway ?

Absolutely. Leaving this unpatched in the kernel, with a 100% chance of
memory corruption is just mad.

I'm pretty sure there should be a way to sanely reset the interface
before it starts repainting the memory. And if there is none, we must
find a way to tell the user that the machine is a death trap. Really.

M.

PS: updating the FW to the version provided by Marcin indeed makes
things much more reliable. Thanks for that.
-- 
Jazz is not dead. It just smells funny...

Re: [PATCH v1 0/5] can: add SAE J1939 protocol

2018-10-30 Thread Marc Kleine-Budde

On 10/08/2018 11:48 AM, Oleksij Rempel wrote:
> This series adds SAE J1939 support to the current kernel v4.19-rc6.
> 
> This stack has long history, starting back in 27 Apr 2011, if not
> earlier:
> https://lists.openwall.net/netdev/2011/04/27/45
> 
> After major rework and testing it is a time to send it mainline.

I've removed some trailing newlines and added the stack to
linux-can-next/j1939.

Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Eric Dumazet




On 10/30/2018 01:09 AM, Paweł Staszewski wrote:
> 
> 
> W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:
>>
>> On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:
>>
>>> Indeed this is a bug. I would expect it to produce frequent errors
>>> though as many odd-length
>>> packets would trigger it. Do you have RXFCS? Regardless, how
>>> frequently do you see the problem?
>>>
>> Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
>> CHECKSUM_NONE
>>
>> And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the 
>> bug you fixed.
>>
>> So we now need to also fix mlx5.
>>
>> And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
>> earlier,
>> plus __get_unaligned_cpu32() as you hinted.
>>
>>
>>
>>
> 
> No RXFCS
> 
> And this trace is rly frequently like once per 3/4 seconds
> like below:
> [28965.776864] vlan1490: hw csum failure

Might be vlan related.

Can you first check this :

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 
94224c22ecc310a87b6715051e335446f29bec03..6f4bfebf0d9a3ae7567062abb3ea6532b3aaf3d6
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -789,13 +789,8 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
if (network_depth > ETH_HLEN)
-   /* CQE csum is calculated from the IP header and does
-* not cover VLAN headers (if present). This will add
-* the checksum manually.
-*/
-   skb->csum = csum_partial(skb->data + ETH_HLEN,
-network_depth - ETH_HLEN,
-skb->csum);
+   /* Temporary debugging */
+   skb->ip_summed = CHECKSUM_NONE;
if (unlikely(netdev->features & NETIF_F_RXFCS))
skb->csum = csum_add(skb->csum,
 (__force 
__wsum)mlx5e_get_fcs(skb));

[Patch V5 net 03/11] net: hns3: bugfix for reporting unknown vector0 interrupt repeatly problem

2018-10-30 Thread Huazhong Tan

The current driver supports handling two vector0 interrupts, reset and
mailbox. When the hardware reports an interrupt of another type of
interrupt source, if the driver does not process the interrupt, but
enables the interrupt, the hardware will repeatedly report the unknown
interrupt.

Therefore, the driver enables the vector0 interrupt after clearing the
known type of interrupt source. Other conditions are not enabled.

Fixes: cd8c5c269b1d ("net: hns3: Fix for hclge_reset running repeatly problem")
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 5234b53..2a63147 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2236,7 +2236,7 @@ static irqreturn_t hclge_misc_irq_handle(int irq, void 
*data)
}
 
/* clear the source of interrupt if it is not cause by reset */
-   if (event_cause != HCLGE_VECTOR0_EVENT_RST) {
+   if (event_cause == HCLGE_VECTOR0_EVENT_MBX) {
hclge_clear_event_cause(hdev, event_cause, clearval);
hclge_enable_vector(>misc_vector, true);
}
-- 
2.7.4

[Patch V5 net 01/11] net: hns3: add error handler for hns3_nic_init_vector_data()

2018-10-30 Thread Huazhong Tan

When hns3_nic_init_vector_data() fails to map ring to vector,
it should cancel the netif_napi_add() that has been successfully
done and then exits.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 
SoC")
Signed-off-by: Huazhong Tan 
---
V5: Fixes comments from Joe Perches
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 32f3aca8..0b4323b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2821,7 +2821,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv 
*priv)
struct hnae3_handle *h = priv->ae_handle;
struct hns3_enet_tqp_vector *tqp_vector;
int ret = 0;
-   u16 i;
+   int i;
 
hns3_nic_set_cpumask(priv);
 
@@ -2868,13 +2868,19 @@ static int hns3_nic_init_vector_data(struct 
hns3_nic_priv *priv)
hns3_free_vector_ring_chain(tqp_vector, _ring_chain);
 
if (ret)
-   return ret;
+   goto map_ring_fail;
 
netif_napi_add(priv->netdev, _vector->napi,
   hns3_nic_common_poll, NAPI_POLL_WEIGHT);
}
 
return 0;
+
+map_ring_fail:
+   while (i--)
+   netif_napi_del(>tqp_vector[i].napi);
+
+   return ret;
 }
 
 static int hns3_nic_alloc_vector_data(struct hns3_nic_priv *priv)
-- 
2.7.4

[Patch V5 net 08/11] net: hns3: fix incorrect return value/type of some functions

2018-10-30 Thread Huazhong Tan

There are some functions that, when they fail to send the command,
need to return the corresponding error value to its caller.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility 
Layer Support")
Fixes: 681ec3999b3d ("net: hns3: fix for vlan table lost problem when 
resetting")
Signed-off-by: Huazhong Tan 
---
V2: Fixes the compilation error reported by kbuild test robot
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  6 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 80 +++---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  2 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 34 -
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  2 +-
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 14 ++--
 6 files changed, 85 insertions(+), 53 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index e82e4ca..055b406 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -316,8 +316,8 @@ struct hnae3_ae_ops {
int (*set_loopback)(struct hnae3_handle *handle,
enum hnae3_loop loop_mode, bool en);
 
-   void (*set_promisc_mode)(struct hnae3_handle *handle, bool en_uc_pmc,
-bool en_mc_pmc);
+   int (*set_promisc_mode)(struct hnae3_handle *handle, bool en_uc_pmc,
+   bool en_mc_pmc);
int (*set_mtu)(struct hnae3_handle *handle, int new_mtu);
 
void (*get_pauseparam)(struct hnae3_handle *handle,
@@ -391,7 +391,7 @@ struct hnae3_ae_ops {
  int vector_num,
  struct hnae3_ring_chain_node *vr_chain);
 
-   void (*reset_queue)(struct hnae3_handle *handle, u16 queue_id);
+   int (*reset_queue)(struct hnae3_handle *handle, u16 queue_id);
u32 (*get_fw_version)(struct hnae3_handle *handle);
void (*get_mdix_mode)(struct hnae3_handle *handle,
  u8 *tp_mdix_ctrl, u8 *tp_mdix);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index bf71c23..3f96aa3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -509,16 +509,18 @@ static void hns3_nic_set_rx_mode(struct net_device 
*netdev)
h->netdev_flags = new_flags;
 }
 
-void hns3_update_promisc_mode(struct net_device *netdev, u8 promisc_flags)
+int hns3_update_promisc_mode(struct net_device *netdev, u8 promisc_flags)
 {
struct hns3_nic_priv *priv = netdev_priv(netdev);
struct hnae3_handle *h = priv->ae_handle;
 
if (h->ae_algo->ops->set_promisc_mode) {
-   h->ae_algo->ops->set_promisc_mode(h,
- promisc_flags & HNAE3_UPE,
- promisc_flags & HNAE3_MPE);
+   return h->ae_algo->ops->set_promisc_mode(h,
+   promisc_flags & HNAE3_UPE,
+   promisc_flags & HNAE3_MPE);
}
+
+   return 0;
 }
 
 void hns3_enable_vlan_filter(struct net_device *netdev, bool enable)
@@ -1494,18 +1496,22 @@ static int hns3_vlan_rx_kill_vid(struct net_device 
*netdev,
return ret;
 }
 
-static void hns3_restore_vlan(struct net_device *netdev)
+static int hns3_restore_vlan(struct net_device *netdev)
 {
struct hns3_nic_priv *priv = netdev_priv(netdev);
+   int ret = 0;
u16 vid;
-   int ret;
 
for_each_set_bit(vid, priv->active_vlans, VLAN_N_VID) {
ret = hns3_vlan_rx_add_vid(netdev, htons(ETH_P_8021Q), vid);
-   if (ret)
-   netdev_warn(netdev, "Restore vlan: %d filter, ret:%d\n",
-   vid, ret);
+   if (ret) {
+   netdev_err(netdev, "Restore vlan: %d filter, ret:%d\n",
+  vid, ret);
+   return ret;
+   }
}
+
+   return ret;
 }
 
 static int hns3_ndo_set_vf_vlan(struct net_device *netdev, int vf, u16 vlan,
@@ -3257,11 +3263,12 @@ int hns3_uninit_all_ring(struct hns3_nic_priv *priv)
 }
 
 /* Set mac addr if it is configured. or leave it to the AE driver */
-static void hns3_init_mac_addr(struct net_device *netdev, bool init)
+static int hns3_init_mac_addr(struct net_device *netdev, bool init)
 {
struct hns3_nic_priv *priv = netdev_priv(netdev);
struct hnae3_handle *h = priv->ae_handle;
u8 mac_addr_temp[ETH_ALEN];
+   int ret = 0;
 
if (h->ae_algo->ops->get_mac_addr && init) {
h->ae_algo->ops->get_mac_addr(h, mac_addr_temp);
@@ -3276,8 +3283,9 @@ static void hns3_init_mac_addr(struct net_device *netdev, 
bool init)
}
 
if

[Patch V5 net 10/11] net: hns3: bugfix for rtnl_lock's range in the hclge_reset()

2018-10-30 Thread Huazhong Tan

Since hclge_reset_wait() is used to wait for the hardware to complete
the reset, it is not necessary to hold the rtnl_lock during
hclge_reset_wait(). So this patch releases the lock for the duration
of hclge_reset_wait().

Fixes: 6d4fab39533f ("net: hns3: Reset net device with rtnl_lock")
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index f3212c9..ffdd960 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2470,14 +2470,17 @@ static void hclge_reset(struct hclge_dev *hdev)
handle = >vport[0].nic;
rtnl_lock();
hclge_notify_client(hdev, HNAE3_DOWN_CLIENT);
+   rtnl_unlock();
 
if (!hclge_reset_wait(hdev)) {
+   rtnl_lock();
hclge_notify_client(hdev, HNAE3_UNINIT_CLIENT);
hclge_reset_ae_dev(hdev->ae_dev);
hclge_notify_client(hdev, HNAE3_INIT_CLIENT);
 
hclge_clear_reset_cause(hdev);
} else {
+   rtnl_lock();
/* schedule again to check pending resets later */
set_bit(hdev->reset_type, >reset_pending);
hclge_reset_task_schedule(hdev);
-- 
2.7.4

[Patch V5 net 11/11] net: hns3: bugfix for rtnl_lock's range in the hclgevf_reset()

2018-10-30 Thread Huazhong Tan

Since hclgevf_reset_wait() is used to wait for the hardware to complete
the reset, it is not necessary to hold the rtnl_lock during
hclgevf_reset_wait(). So this patch releases the lock for the duration
of hclgevf_reset_wait().

Fixes: 6988eb2a9b77 ("net: hns3: Add support to reset the enet/ring mgmt layer")
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
index b224f6a..085edb9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c
@@ -1170,6 +1170,8 @@ static int hclgevf_reset(struct hclgevf_dev *hdev)
/* bring down the nic to stop any ongoing TX/RX */
hclgevf_notify_client(hdev, HNAE3_DOWN_CLIENT);
 
+   rtnl_unlock();
+
/* check if VF could successfully fetch the hardware reset completion
 * status from the hardware
 */
@@ -1181,12 +1183,15 @@ static int hclgevf_reset(struct hclgevf_dev *hdev)
ret);
 
dev_warn(>pdev->dev, "VF reset failed, disabling VF!\n");
+   rtnl_lock();
hclgevf_notify_client(hdev, HNAE3_UNINIT_CLIENT);
 
rtnl_unlock();
return ret;
}
 
+   rtnl_lock();
+
/* now, re-initialize the nic client and ae device*/
ret = hclgevf_reset_stack(hdev);
if (ret)
-- 
2.7.4

[Patch V5 net 02/11] net: hns3: bugfix for buffer not free problem during resetting

2018-10-30 Thread Huazhong Tan

When hns3_get_ring_config()/hns3_queue_to_ring()/
hns3_get_vector_ring_chain() failed during resetting, the allocated
memory has not been freed before these three functions return. So
this patch adds error handler in these functions to fix it.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 
SoC")
Signed-off-by: Huazhong Tan 
---
V5: Fixes comments from Sergei Shtylyov
add error handler for hns3_get_vector_ring_chain()
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 0b4323b..b767ff9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2727,7 +2727,7 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
chain = devm_kzalloc(>dev, sizeof(*chain),
 GFP_KERNEL);
if (!chain)
-   return -ENOMEM;
+   goto err_free_chain;
 
cur_chain->next = chain;
chain->tqp_index = tx_ring->tqp->tqp_index;
@@ -2757,7 +2757,7 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
while (rx_ring) {
chain = devm_kzalloc(>dev, sizeof(*chain), GFP_KERNEL);
if (!chain)
-   return -ENOMEM;
+   goto err_free_chain;
 
cur_chain->next = chain;
chain->tqp_index = rx_ring->tqp->tqp_index;
@@ -2772,6 +2772,16 @@ static int hns3_get_vector_ring_chain(struct 
hns3_enet_tqp_vector *tqp_vector,
}
 
return 0;
+
+err_free_chain:
+   cur_chain = head->next;
+   while (cur_chain) {
+   chain = cur_chain->next;
+   devm_kfree(>dev, chain);
+   cur_chain = chain;
+   }
+
+   return -ENOMEM;
 }
 
 static void hns3_free_vector_ring_chain(struct hns3_enet_tqp_vector 
*tqp_vector,
@@ -3037,8 +3047,10 @@ static int hns3_queue_to_ring(struct hnae3_queue *tqp,
return ret;
 
ret = hns3_ring_get_cfg(tqp, priv, HNAE3_RING_TYPE_RX);
-   if (ret)
+   if (ret) {
+   devm_kfree(priv->dev, priv->ring_data[tqp->tqp_index].ring);
return ret;
+   }
 
return 0;
 }
@@ -3065,6 +3077,12 @@ static int hns3_get_ring_config(struct hns3_nic_priv 
*priv)
 
return 0;
 err:
+   while (i--) {
+   devm_kfree(priv->dev, priv->ring_data[i].ring);
+   devm_kfree(priv->dev,
+  priv->ring_data[i + h->kinfo.num_tqps].ring);
+   }
+
devm_kfree(>dev, priv->ring_data);
return ret;
 }
-- 
2.7.4

[Patch V5 net 05/11] net: hns3: remove unnecessary queue reset in the hns3_uninit_all_ring()

2018-10-30 Thread Huazhong Tan

It is not necessary to reset the queue in the hns3_uninit_all_ring(),
since the queue is stopped in the down operation, and will be reset
in the up operation. And the judgment of the HCLGE_STATE_RST_HANDLING
flag in the hclge_reset_tqp() is not correct, because we need to reset
tqp during pf reset, otherwise it may cause queue not being reset to
working state problem.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 
SoC")
Signed-off-by: Huazhong Tan 
---
V4: Fixes comments from Sergei Shtylyov
V3: Fixes comments from Sergei Shtylyov
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 3 ---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 3 ---
 2 files changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index b767ff9..bf71c23 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -3250,9 +3250,6 @@ int hns3_uninit_all_ring(struct hns3_nic_priv *priv)
int i;
 
for (i = 0; i < h->kinfo.num_tqps; i++) {
-   if (h->ae_algo->ops->reset_queue)
-   h->ae_algo->ops->reset_queue(h, i);
-
hns3_fini_ring(priv->ring_data[i].ring);
hns3_fini_ring(priv->ring_data[i + h->kinfo.num_tqps].ring);
}
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 2a63147..4dd0506 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -6116,9 +6116,6 @@ void hclge_reset_tqp(struct hnae3_handle *handle, u16 
queue_id)
u16 queue_gid;
int ret;
 
-   if (test_bit(HCLGE_STATE_RST_HANDLING, >state))
-   return;
-
queue_gid = hclge_covert_handle_qid_global(handle, queue_id);
 
ret = hclge_tqp_enable(hdev, queue_id, 0, false);
-- 
2.7.4

[Patch V5 net 07/11] net: hns3: bugfix for hclge_mdio_write and hclge_mdio_read

2018-10-30 Thread Huazhong Tan

When there is a PHY, the driver needs to complete some operations through
MDIO during reset reinitialization, so HCLGE_STATE_CMD_DISABLE is more
suitable than HCLGE_STATE_RST_HANDLING to prevent the MDIO operation from
being sent during the hardware reset.

Fixes: b50ae26c57cb ("net: hns3: never send command queue message to IMP when 
reset)
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
index 24b1f2a..0301863 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
@@ -52,7 +52,7 @@ static int hclge_mdio_write(struct mii_bus *bus, int phyid, 
int regnum,
struct hclge_desc desc;
int ret;
 
-   if (test_bit(HCLGE_STATE_RST_HANDLING, >state))
+   if (test_bit(HCLGE_STATE_CMD_DISABLE, >state))
return 0;
 
hclge_cmd_setup_basic_desc(, HCLGE_OPC_MDIO_CONFIG, false);
@@ -90,7 +90,7 @@ static int hclge_mdio_read(struct mii_bus *bus, int phyid, 
int regnum)
struct hclge_desc desc;
int ret;
 
-   if (test_bit(HCLGE_STATE_RST_HANDLING, >state))
+   if (test_bit(HCLGE_STATE_CMD_DISABLE, >state))
return 0;
 
hclge_cmd_setup_basic_desc(, HCLGE_OPC_MDIO_CONFIG, true);
-- 
2.7.4

[Patch V5 net 04/11] net: hns3: bugfix for the initialization of command queue's spin lock

2018-10-30 Thread Huazhong Tan

The spin lock of the command queue only need to be initialized once
when the driver initializes the command queue. It is not necessary to
initialize the spin lock when resetting. At the same time, the
modification of the queue member should be performed after acquiring
the lock.

Fixes: 3efb960f056d ("net: hns3: Refactor the initialization of command queue")
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
index ac13cb2..68026a5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
@@ -304,6 +304,10 @@ int hclge_cmd_queue_init(struct hclge_dev *hdev)
 {
int ret;
 
+   /* Setup the lock for command queue */
+   spin_lock_init(>hw.cmq.csq.lock);
+   spin_lock_init(>hw.cmq.crq.lock);
+
/* Setup the queue entries for use cmd queue */
hdev->hw.cmq.csq.desc_num = HCLGE_NIC_CMQ_DESC_NUM;
hdev->hw.cmq.crq.desc_num = HCLGE_NIC_CMQ_DESC_NUM;
@@ -337,18 +341,20 @@ int hclge_cmd_init(struct hclge_dev *hdev)
u32 version;
int ret;
 
+   spin_lock_bh(>hw.cmq.csq.lock);
+   spin_lock_bh(>hw.cmq.crq.lock);
+
hdev->hw.cmq.csq.next_to_clean = 0;
hdev->hw.cmq.csq.next_to_use = 0;
hdev->hw.cmq.crq.next_to_clean = 0;
hdev->hw.cmq.crq.next_to_use = 0;
 
-   /* Setup the lock for command queue */
-   spin_lock_init(>hw.cmq.csq.lock);
-   spin_lock_init(>hw.cmq.crq.lock);
-
hclge_cmd_init_regs(>hw);
clear_bit(HCLGE_STATE_CMD_DISABLE, >state);
 
+   spin_unlock_bh(>hw.cmq.crq.lock);
+   spin_unlock_bh(>hw.cmq.csq.lock);
+
ret = hclge_cmd_query_firmware_version(>hw, );
if (ret) {
dev_err(>pdev->dev,
-- 
2.7.4

[Patch V5 net 09/11] net: hns3: bugfix for handling mailbox while the command queue reinitialized

2018-10-30 Thread Huazhong Tan

In a multi-core machine, the mailbox service and reset service
will be executed at the same time. The reset service will re-initialize
the command queue, before that, the mailbox handler can only get some
invalid messages.

The HCLGE_STATE_CMD_DISABLE flag means that the command queue is not
available and needs to be reinitialized. Therefore, when the mailbox
handler recognizes this flag, it should not process the command.

Fixes: dde1a86e93ca ("net: hns3: Add mailbox support to PF driver")
Signed-off-by: Huazhong Tan 
---
V3: Fixes comments from Sergei Shtylyov
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
index 04462a3..f890022 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c
@@ -400,6 +400,12 @@ void hclge_mbx_handler(struct hclge_dev *hdev)
 
/* handle all the mailbox requests in the queue */
while (!hclge_cmd_crq_empty(>hw)) {
+   if (test_bit(HCLGE_STATE_CMD_DISABLE, >state)) {
+   dev_warn(>pdev->dev,
+"command queue needs re-initializing\n");
+   return;
+   }
+
desc = >desc[crq->next_to_use];
req = (struct hclge_mbx_vf_to_pf_cmd *)desc->data;
 
-- 
2.7.4

[Patch V5 net 06/11] net: hns3: bugfix for is_valid_csq_clean_head()

2018-10-30 Thread Huazhong Tan

The HEAD pointer of the hardware command queue maybe equal to the command
queue's next_to_use in the driver, so that does not belong to the invalid
HEAD pointer, since the hardware may not process the command in time,
causing the HEAD pointer to be too late to update. The variables' name
in this function is unreadable, so give them a more readable one.

Fixes: 3ff504908f95 ("net: hns3: fix a dead loop in hclge_cmd_csq_clean")
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
index 68026a5..690f62e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
@@ -24,15 +24,15 @@ static int hclge_ring_space(struct hclge_cmq_ring *ring)
return ring->desc_num - used - 1;
 }
 
-static int is_valid_csq_clean_head(struct hclge_cmq_ring *ring, int h)
+static int is_valid_csq_clean_head(struct hclge_cmq_ring *ring, int head)
 {
-   int u = ring->next_to_use;
-   int c = ring->next_to_clean;
+   int ntu = ring->next_to_use;
+   int ntc = ring->next_to_clean;
 
-   if (unlikely(h >= ring->desc_num))
-   return 0;
+   if (ntu > ntc)
+   return head >= ntc && head <= ntu;
 
-   return u > c ? (h > c && h <= u) : (h > c || h <= u);
+   return head >= ntc || head <= ntu;
 }
 
 static int hclge_alloc_cmd_desc(struct hclge_cmq_ring *ring)
-- 
2.7.4

[Patch V5 net 00/11] Bugfix for the HNS3 driver

2018-10-30 Thread Huazhong Tan

This patch series include bugfix for the HNS3 ethernet
controller driver.

Change log:
V4->V5:
Fixes comments from Joe Perches & Sergei Shtylyov
V3->V4:
Fixes comments from Sergei Shtylyov
V2->V3:
Fixes comments from Sergei Shtylyov
V1->V2:
Fixes the compilation break reported by kbuild test robot
http://patchwork.ozlabs.org/patch/989818/

Huazhong Tan (11):
  net: hns3: add error handler for hns3_nic_init_vector_data()
  net: hns3: bugfix for buffer not free problem during resetting
  net: hns3: bugfix for reporting unknown vector0 interrupt repeatly
problem
  net: hns3: bugfix for the initialization of command queue's spin lock
  net: hns3: remove unnecessary queue reset in the
hns3_uninit_all_ring()
  net: hns3: bugfix for is_valid_csq_clean_head()
  net: hns3: bugfix for hclge_mdio_write and hclge_mdio_read
  net: hns3: fix incorrect return value/type of some functions
  net: hns3: bugfix for handling mailbox while the command queue
reinitialized
  net: hns3: bugfix for rtnl_lock's range in the hclge_reset()
  net: hns3: bugfix for rtnl_lock's range in the hclgevf_reset()

 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   6 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 117 +++--
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|   2 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c |  26 +++--
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c|  42 
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   2 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c |   6 ++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c|   4 +-
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  |  19 ++--
 9 files changed, 147 insertions(+), 77 deletions(-)

-- 
2.7.4

Re: [PATCH net] net: ethernet: cadence: fix socket buffer corruption problem

2018-10-30 Thread Claudiu.Beznea



On 29.10.2018 23:40, Tristram Ha - C24268 wrote:
>> Could you, please, tell me if the above variable was false in your case?
>>
>> bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb);
>>
>> If yes, then, the proper fix would be as follows:
>>
>> diff --git a/drivers/net/ethernet/cadence/macb_main.c
>> b/drivers/net/ethernet/cadence/macb_main.c
>> index 8f5bf9166c11..492a8e1a34cd 100644
>> --- a/drivers/net/ethernet/cadence/macb_main.c
>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>> @@ -1690,7 +1690,7 @@ static int macb_pad_and_fcs(struct sk_buff **skb,
>> struct net_device *ndev)
>> padlen += ETH_FCS_LEN;
>> }
>>
>> -   if (!cloned && headroom + tailroom >= padlen) {
>> +   if (!cloned && headroom + tailroom >= ETH_FCS_LEN) {
>> (*skb)->data = memmove((*skb)->head, (*skb)->data, 
>> (*skb)->len);
>> skb_set_tail_pointer(*skb, (*skb)->len);
>> } else {
>>
>> Could you please check if it works in your case (and without your patch)?
>>
> 
> Actually doing that reveals another bug:
> 
> if (padlen) {
> if (padlen >= ETH_FCS_LEN)
> skb_put_zero(*skb, padlen - ETH_FCS_LEN);
> else
> skb_trim(*skb, ETH_FCS_LEN - padlen);
> }
> 
> My fix calls skb_put_zero with zero length.  Your change calls skb_trim which
> actually sets the socket buffer length to 1!
> 
> When this problem happens headroom is 2, tailroom is 1, skb->len is 61, and
> padlen is 3.
> 

Ok, I see now. Looking again, your fix is good. But, as you said, there is
the skb_trim() in this function that is wrong from the beginning (my bad).
I propose to remove it since, with your fix is not even reached anymore.

Could you check on your side that applying this on top of your patch, your
scenario is still working?

diff --git a/drivers/net/ethernet/cadence/macb_main.c
b/drivers/net/ethernet/cadence/macb_main.c
index 1d86b4d5645a..e1347d6d1b50 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1702,12 +1702,8 @@ static int macb_pad_and_fcs(struct sk_buff **skb,
struct net_device *ndev)
*skb = nskb;
}

-   if (padlen) {
-   if (padlen >= ETH_FCS_LEN)
-   skb_put_zero(*skb, padlen - ETH_FCS_LEN);
-   else
-   skb_trim(*skb, ETH_FCS_LEN - padlen);
-   }
+   if (padlen > ETH_FCS_LEN)
+   skb_put_zero(*skb, padlen - ETH_FCS_LEN);


> DSA driver is being used.  That is why the length is already padded to 60 
> bytes
> and 1-byte tail tag is added.

Ok, I see, I didn't test with such configurations.

> 
> BTW, I am not sure while this macb_pad_and_fcs function was added.  Is it to
> workaround some hardware bugs?  The code is executed only when
> NETIF_IF_HW_CSUM is used.  But if hardware tx checksumming is enabled why
> not also use the hardware to calculate CRC?

It was reported in [1] that UDP checksum is offloaded to hardware no matter
the application previously computed it.

The code should be executed only for packets that has checksum computed by
applications ((*skb)->ip_summed != CHECKSUM_PARTIAL). The idea was to not
recompute checksum for packets with checksum already computed. To do so,
while hardware checksum is enabled (NETIF_F_HW_CSUM), TX_NOCRC bit should
be set on buffer descriptor. But to do so, packets must have a minimum size
of 64 and FCS to be computed.

The NETIF_F_HW_CSUM check was placed there because the issue described in
[1] is reproducible because hardware checksum is enabled and overrides the
checksum provided by applications.

[1] https://www.spinics.net/lists/netdev/msg505065.html
> 
> NETIF_F_SG is not enabled in the MAC I used, so enabling NETIF_IF_HW_CSUM
> is rather pointless.  With the padding code the transmit throughput cannot get
> higher than 100Mbps in a gigabit connection.
> 
> I would recommend to add this option to disable manual padding in one of those
> macb_config structures.

In this way the user would have to know from the beginning what kind of
packets are used.

Thank you,
Claudiu Beznea

>

Re: [BUG] MVPP2 driver exploding in presence of a tap interface

2018-10-30 Thread Thomas Petazzoni

Hello Marcin,

Thanks for the feedback.

On Tue, 30 Oct 2018 13:37:37 +0100, Marcin Wojtas wrote:

> You use _really_ archaic firmware, the bug you see is 99% caused by a
> bug already fixed long time ago (cleanup all PP2 BM pools correctly
> during exit boot services). Please grab the latest release:
> https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/wiki/files/flash-image-18.09.4.bin
> and let know if you observe any further issues with vanilla kernel.

Even if this was a bug in the UEFI firmware, shouldn't the kernel be
independent from that, by doing a proper reset/reinit of the HW ?

I.e, isn't the firmware fix papering over a bug that should be fixed in
Linux mvpp2 driver anyway ?

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [BUG] MVPP2 driver exploding in presence of a tap interface

2018-10-30 Thread Marc Zyngier

Marcin,

On 30/10/18 12:37, Marcin Wojtas wrote:
> [Resend in UTF-8]
> 
> Hi Marc,
> 
> You use _really_ archaic firmware, the bug you see is 99% caused by a

Please let me fix this for you:

s/_really_ archaic/released/

> bug already fixed long time ago (cleanup all PP2 BM pools correctly
> during exit boot services). 

How long ago? Why didn't you say so when I reported the bug to you and
Antoine back in January? Also, why isn't that "clean-up" taken care of
by the Linux driver? Exiting boot services itself doesn't seem to cause
the issue, and it is setting the interface up that causes it.

> Please grab the latest release:
> https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/wiki/files/flash-image-18.09.4.bin
> and let know if you observe any further issues with vanilla kernel.

What does this image contain?

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...

Re: [BUG] MVPP2 driver exploding in presence of a tap interface

2018-10-30 Thread Marcin Wojtas

[Resend in UTF-8]

Hi Marc,

You use _really_ archaic firmware, the bug you see is 99% caused by a
bug already fixed long time ago (cleanup all PP2 BM pools correctly
during exit boot services). Please grab the latest release:
https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/wiki/files/flash-image-18.09.4.bin
and let know if you observe any further issues with vanilla kernel.

Best regards,
Marcin

wt., 30 paź 2018 o 13:16 Marc Zyngier  napisał(a):
>
> Antoine,
>
> On 30/10/18 10:50, Antoine Tenart wrote:
> > Marc,
> >
> > On Mon, Oct 29, 2018 at 03:05:53PM +, Marc Zyngier wrote:
> >>
> >> This is a follow-up on the conversation Thomas and I had last week at
> >> ELC, with me ranting at the sorry state of the MVPP2 driver.
> >
> >> Triggering this is dead simple:
> >> - Add a macvtap to one of the MVPP2 interfaces
> >> - Bring it online
> >> - Watch the kernel exploding and memory being corrupted
> >>
> >> You don't even need anything listening on the tap interface, just its
> >> simple existence triggers it. I use a similar setup on a large variety
> >> of machines, and this box is the only one that catches fire. Removing
> >> the macvtap interface makes it (more) reliable.
> >>
> >> Given that I cannot reproduce this issue on any other ARM (32 or 64bit)
> >> platform, including other Marvell stuff, I can only conclude that the
> >> MVPP2 driver is responsible for this.
> >>
> >> Example crash and .config below (4.19 vanilla, as linux/master dies in
> >> new and wonderful ways on this box). I'm looking forward to testing any
> >> idea you may have.
> >
> > I used a 4.19 vanilla kernel, with both your configuration and mine,
> > on 2 different Macchiatobins, but was unable to trigger the issue:
> >
> >   # ip link set eth0 up
> >   # ip link add link eth0 name macvtap0 type macvtap
> >   # ip link set macvtap0 up>
> > I can even configure the eth0/macvtap0 interfaces, and use them
> > generating or receiving tcp/udp/icmp traffic.
> >
> > (I also made other tests using macvtap and tap interfaces).
> >
> > How much memory do you have on the board? What version of ATF are you
> > using? Version of U-Boot?
>
> 4GB of RAM. As for the version numbers, see below. I don't use u-boot,
> but UEFI (EDK-II v2.60). The problem can be reproduced on two different
> machines, with the same configuration (and firmwares dating from a
> similar era):
>
> Starting CP-0 IOROM 1.07
> Booting from SD 0 (0x29)
> Found valid image at boot postion 0x002
> lNOTICE:  Starting binary extension
> NOTICE:  Gathering DRAM information
> mv_ddr: mv_ddr-armada-17.06.1-g47f4c8b (Jun  2 2017 - 17:07:23)
> mv_ddr: completed successfully
> NOTICE:  Booting Trusted Firmware
> NOTICE:  BL1: v1.3(release):armada-17.06.2:297d68f
> NOTICE:  BL1: Built : 17:07:27, Jun  2 2017
> NOTICE:  BL1: Booting BL2
> lNOTICE:  BL2: v1.3(release):armada-17.06.2:297d68f
> NOTICE:  BL2: Built : 17:07:28, Jun  2 2017
> NOTICE:  BL1: Booting BL31
> lNOTICE:  BL31: v1.3(release):armada-17.06.2:297d68f
> NOTICE:  BL31: Built : 17:07:30, Jun  2 2017
> lUEFI firmware (version MARVELL_EFI built at 17:12:21 on Jun  2 2017)
>
> Armada 8040 MachiatoBin Platform Init
>
> Comphy0-0: PCIE0 5 Gbps
> Comphy0-1: PCIE0 5 Gbps
> Comphy0-2: PCIE0 5 Gbps
> Comphy0-3: PCIE0 5 Gbps
> Comphy0-4: SFI   10.31 Gbps
> Comphy0-5: SATA1 5 Gbps
>
> Comphy1-0: SGMII11.25 Gbps
> Comphy1-1: SATA2 5 Gbps
> Comphy1-2: USB3_HOST05 Gbps
> Comphy1-3: SATA3 5 Gbps
> Comphy1-4: SFI   10.31 Gbps
> Comphy1-5: SGMII23.125 Gbps
>
> UTMI PHY 0 initialized to USB Host0
> UTMI PHY 1 initialized to USB Host1
> UTMI PHY 0 initialized to USB Host0
> RTC: Initialize controller 1
> Skip I2c chip 0
> Succesfully installed protocol interfaces
> ramdisk:blckio install. Status=Success
>
> With the latest mainline, and after fixing that other irq affinity
> bug (see patch posted yesterday), I only need to bring the interface
> up, without doing anything else:
>
> # ip link set eth0 up
> [  155.507877] mvpp2 f200.ethernet eth0: PHY [f212a600.mdio-mii:00] 
> driver [mv88x3310]
> [  155.526732] mvpp2 f200.ethernet eth0: configuring for phy/10gbase-kr 
> link mode
> [  157.592581] mvpp2 f200.ethernet eth0: Link is Up - 1Gbps/Full - flow 
> control rx/tx
> [  158.339396] BUG: Bad page state in process swapper/0  pfn:e6804
> [  158.345345] page:7e00039a0100 count:0 mapcount:0 
> mapping:8000e7bf3b00 index:0x8000e6804c00
> [  158.354696] flags: 0xfffc200(slab)
> [  158.358815] raw: 0fffc200 7e00039cff80 00040004 
> 8000e7bf3b00
> [  158.366594] raw: 8000e6804c00 801f  
> 
> [  158.374371] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [  158.380840] bad because of flags: 0x200(slab)
> [  158.385216] Modules linked in:
> [  158.388288] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
>

Re: [BUG] MVPP2 driver exploding in presence of a tap interface

2018-10-30 Thread Marc Zyngier

Antoine,

On 30/10/18 10:50, Antoine Tenart wrote:
> Marc,
> 
> On Mon, Oct 29, 2018 at 03:05:53PM +, Marc Zyngier wrote:
>>
>> This is a follow-up on the conversation Thomas and I had last week at 
>> ELC, with me ranting at the sorry state of the MVPP2 driver.
> 
>> Triggering this is dead simple:
>> - Add a macvtap to one of the MVPP2 interfaces
>> - Bring it online
>> - Watch the kernel exploding and memory being corrupted
>>
>> You don't even need anything listening on the tap interface, just its
>> simple existence triggers it. I use a similar setup on a large variety 
>> of machines, and this box is the only one that catches fire. Removing
>> the macvtap interface makes it (more) reliable.
>>
>> Given that I cannot reproduce this issue on any other ARM (32 or 64bit)
>> platform, including other Marvell stuff, I can only conclude that the
>> MVPP2 driver is responsible for this.
>>
>> Example crash and .config below (4.19 vanilla, as linux/master dies in
>> new and wonderful ways on this box). I'm looking forward to testing any
>> idea you may have.
> 
> I used a 4.19 vanilla kernel, with both your configuration and mine,
> on 2 different Macchiatobins, but was unable to trigger the issue:
> 
>   # ip link set eth0 up
>   # ip link add link eth0 name macvtap0 type macvtap
>   # ip link set macvtap0 up> 
> I can even configure the eth0/macvtap0 interfaces, and use them
> generating or receiving tcp/udp/icmp traffic.
> 
> (I also made other tests using macvtap and tap interfaces).
> 
> How much memory do you have on the board? What version of ATF are you
> using? Version of U-Boot?

4GB of RAM. As for the version numbers, see below. I don't use u-boot, 
but UEFI (EDK-II v2.60). The problem can be reproduced on two different
machines, with the same configuration (and firmwares dating from a 
similar era):

Starting CP-0 IOROM 1.07
Booting from SD 0 (0x29)
Found valid image at boot postion 0x002
lNOTICE:  Starting binary extension
NOTICE:  Gathering DRAM information
mv_ddr: mv_ddr-armada-17.06.1-g47f4c8b (Jun  2 2017 - 17:07:23)
mv_ddr: completed successfully
NOTICE:  Booting Trusted Firmware
NOTICE:  BL1: v1.3(release):armada-17.06.2:297d68f
NOTICE:  BL1: Built : 17:07:27, Jun  2 2017
NOTICE:  BL1: Booting BL2
lNOTICE:  BL2: v1.3(release):armada-17.06.2:297d68f
NOTICE:  BL2: Built : 17:07:28, Jun  2 2017
NOTICE:  BL1: Booting BL31
lNOTICE:  BL31: v1.3(release):armada-17.06.2:297d68f
NOTICE:  BL31: Built : 17:07:30, Jun  2 2017
lUEFI firmware (version MARVELL_EFI built at 17:12:21 on Jun  2 2017)

Armada 8040 MachiatoBin Platform Init

Comphy0-0: PCIE0 5 Gbps
Comphy0-1: PCIE0 5 Gbps
Comphy0-2: PCIE0 5 Gbps
Comphy0-3: PCIE0 5 Gbps
Comphy0-4: SFI   10.31 Gbps
Comphy0-5: SATA1 5 Gbps

Comphy1-0: SGMII11.25 Gbps 
Comphy1-1: SATA2 5 Gbps
Comphy1-2: USB3_HOST05 Gbps
Comphy1-3: SATA3 5 Gbps
Comphy1-4: SFI   10.31 Gbps
Comphy1-5: SGMII23.125 Gbps

UTMI PHY 0 initialized to USB Host0
UTMI PHY 1 initialized to USB Host1
UTMI PHY 0 initialized to USB Host0
RTC: Initialize controller 1
Skip I2c chip 0
Succesfully installed protocol interfaces
ramdisk:blckio install. Status=Success

With the latest mainline, and after fixing that other irq affinity
bug (see patch posted yesterday), I only need to bring the interface
up, without doing anything else:

# ip link set eth0 up
[  155.507877] mvpp2 f200.ethernet eth0: PHY [f212a600.mdio-mii:00] driver 
[mv88x3310]
[  155.526732] mvpp2 f200.ethernet eth0: configuring for phy/10gbase-kr 
link mode
[  157.592581] mvpp2 f200.ethernet eth0: Link is Up - 1Gbps/Full - flow 
control rx/tx
[  158.339396] BUG: Bad page state in process swapper/0  pfn:e6804
[  158.345345] page:7e00039a0100 count:0 mapcount:0 
mapping:8000e7bf3b00 index:0x8000e6804c00
[  158.354696] flags: 0xfffc200(slab)
[  158.358815] raw: 0fffc200 7e00039cff80 00040004 
8000e7bf3b00
[  158.366594] raw: 8000e6804c00 801f  

[  158.374371] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[  158.380840] bad because of flags: 0x200(slab)
[  158.385216] Modules linked in:
[  158.388288] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.19.0-09420-g34ae82ac683c #278
[  158.396148] Hardware name: Marvell 8040 MACCHIATOBin (DT)
[  158.401567] Call trace:
[  158.404031]  dump_backtrace+0x0/0x148
[  158.407708]  show_stack+0x14/0x20
[  158.411036]  dump_stack+0x90/0xb4
[  158.414365]  bad_page+0x104/0x130
[  158.417692]  free_pages_check_bad+0x9c/0xa8
[  158.421892]  __free_pages_ok+0x1b0/0x450
[  158.425829]  page_frag_free+0x8c/0xa8
[  158.429505]  skb_free_head+0x18/0x30
[  158.433093]  skb_release_data+0x130/0x160
[  158.437117]  skb_release_all+0x24/0x30
[  158.440881]  consume_skb+0x2c/0x58
[  158.444296]  arp_process.constprop.4+0x200/0x6f0
[  158.448931]  arp_rcv+0xf4/0x128
[

Re: [PATCH] xfrm: Fix error return code in xfrm_output_one()

2018-10-30 Thread Steffen Klassert

On Sat, Oct 27, 2018 at 06:12:06AM +, Wei Yongjun wrote:
> xfrm_output_one() does not return a error code when there is
> no dst_entry attached to the skb, it is still possible crash
> with a NULL pointer dereference in xfrm_output_resume(). Fix
> it by return error code -EHOSTUNREACH.
> 
> Fixes: 9e1437937807 ("xfrm: Fix NULL pointer dereference when skb_dst_force 
> clears the dst_entry.")
> Signed-off-by: Wei Yongjun 

Applied, thanks a lot!

Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable

2018-10-30 Thread Jeff Barnhill

I originally started implementing it the way you suggested; however,
it seemed to complicate management of that structure because it isn't
currently using rcu.  Also, assuming that can be worked out, where
would I get the net from?  Would I need to store a copy in ifcaddr6,
or is there some way to access it during ipv6_chk_acast_addr()?  It
seems that if I don't add a copy of net, but instead access it through
aca_rt(?), then freeing the ifcaddr6 memory becomes problematic
(detaching it from idev, while read_rcu may still be accessing it).
On Mon, Oct 29, 2018 at 11:32 PM David Miller  wrote:
>
> From: Jeff Barnhill <0xeff...@gmail.com>
> Date: Sun, 28 Oct 2018 01:51:59 +
>
> > +struct ipv6_ac_addrlist {
> > + struct in6_addr acal_addr;
> > + possible_net_t  acal_pnet;
> > + refcount_t  acal_users;
> > + struct hlist_node   acal_lst; /* inet6_acaddr_lst */
> > + struct rcu_head rcu;
> > +};
>
> Please just add the hlist to ifcaddr6 instead of duplicating so much
> information and reference counters here.
>
> This seems to waste a lot of memory unnecessary and add lots of
> unnecessary object allocate/setup/destroy logic.

Re: Fw: [Bug 201423] New: eth0: hw csum failure

2018-10-30 Thread Andre Tomt


On 30.10.2018 11:58, Andre Tomt wrote:

On 27.10.2018 23:41, Andre Tomt wrote:

On 26.10.2018 13:45, Andre Tomt wrote:

On 25.10.2018 19:38, Eric Dumazet wrote:



On 10/24/2018 12:41 PM, Andre Tomt wrote:


It eventually showed up again with mlx4, on 4.18.16 + fix and also 
on 4.19. I still do not have a useful packet capture.


It is running a torrent client serving up various linux distributions.



Have you also applied this fix ?

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=db4f1be3ca9b0ef7330763d07bf4ace83ad6f913 





No. I've applied it now to 4.19 and will report back if anything 
shows up.


Just hit it on the simpler server; no VRF, no tunnels, no 
nat/conntrack. Only a basic stateless nftables ruleset and a vlan 
netdev (unlikely to be the one triggering this I guess; it has only v4 
traffic).


I'm currently testing 4.19 with the recomended commit added, plus these 
to sort out some GRO issues (on a hunch, unsure if related):
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=a8305bff685252e80b7c60f4f5e7dd2e63e38218 

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=992cba7e276d438ac8b0a8c17b147b37c8c286f7 

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=ece23711dd956cd5053c9cb03e9fe0668f9c8894 



and I *think* it is behaving better now? it's not conclusive as it could 
take a while to trip in this environment but some of the test servers 
have not shown anything bad in almost 24h.


Sorry, s/some of the/none of the

Re: Fw: [Bug 201423] New: eth0: hw csum failure

2018-10-30 Thread Andre Tomt


On 27.10.2018 23:41, Andre Tomt wrote:

On 26.10.2018 13:45, Andre Tomt wrote:

On 25.10.2018 19:38, Eric Dumazet wrote:



On 10/24/2018 12:41 PM, Andre Tomt wrote:


It eventually showed up again with mlx4, on 4.18.16 + fix and also 
on 4.19. I still do not have a useful packet capture.


It is running a torrent client serving up various linux distributions.



Have you also applied this fix ?

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=db4f1be3ca9b0ef7330763d07bf4ace83ad6f913 





No. I've applied it now to 4.19 and will report back if anything shows 
up.


Just hit it on the simpler server; no VRF, no tunnels, no nat/conntrack. 
Only a basic stateless nftables ruleset and a vlan netdev (unlikely to 
be the one triggering this I guess; it has only v4 traffic).


I'm currently testing 4.19 with the recomended commit added, plus these 
to sort out some GRO issues (on a hunch, unsure if related):

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=a8305bff685252e80b7c60f4f5e7dd2e63e38218
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=992cba7e276d438ac8b0a8c17b147b37c8c286f7
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=ece23711dd956cd5053c9cb03e9fe0668f9c8894

and I *think* it is behaving better now? it's not conclusive as it could 
take a while to trip in this environment but some of the test servers 
have not shown anything bad in almost 24h.

Re: [BUG] MVPP2 driver exploding in presence of a tap interface

2018-10-30 Thread Antoine Tenart

Marc,

On Mon, Oct 29, 2018 at 03:05:53PM +, Marc Zyngier wrote:
> 
> This is a follow-up on the conversation Thomas and I had last week at 
> ELC, with me ranting at the sorry state of the MVPP2 driver.

> Triggering this is dead simple:
> - Add a macvtap to one of the MVPP2 interfaces
> - Bring it online
> - Watch the kernel exploding and memory being corrupted
> 
> You don't even need anything listening on the tap interface, just its
> simple existence triggers it. I use a similar setup on a large variety 
> of machines, and this box is the only one that catches fire. Removing
> the macvtap interface makes it (more) reliable.
> 
> Given that I cannot reproduce this issue on any other ARM (32 or 64bit)
> platform, including other Marvell stuff, I can only conclude that the
> MVPP2 driver is responsible for this.
> 
> Example crash and .config below (4.19 vanilla, as linux/master dies in
> new and wonderful ways on this box). I'm looking forward to testing any
> idea you may have.

I used a 4.19 vanilla kernel, with both your configuration and mine,
on 2 different Macchiatobins, but was unable to trigger the issue:

  # ip link set eth0 up
  # ip link add link eth0 name macvtap0 type macvtap
  # ip link set macvtap0 up

I can even configure the eth0/macvtap0 interfaces, and use them
generating or receiving tcp/udp/icmp traffic.

(I also made other tests using macvtap and tap interfaces).

How much memory do you have on the board? What version of ATF are you
using? Version of U-Boot?

Antoine

-- 
Antoine Ténart, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [Patch V4 net 01/11] net: hns3: add error handler for hns3_nic_init_vector_data()

2018-10-30 Thread tanhuazhong





On 2018/10/30 17:11, Sergei Shtylyov wrote:

On 10/29/2018 4:54 PM, Huazhong Tan wrote:


When hns3_nic_init_vector_data() fails to map ring to vector,
it should cancel the netif_napi_add() that has been successfully
done and then exits.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver 
for hip08 SoC")

Signed-off-by: Huazhong Tan 
---
  drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c

index 32f3aca8..d9066c5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2821,7 +2821,7 @@ static int hns3_nic_init_vector_data(struct 
hns3_nic_priv *priv)

  struct hnae3_handle *h = priv->ae_handle;
  struct hns3_enet_tqp_vector *tqp_vector;
  int ret = 0;
-    u16 i;
+    int i, j;
  hns3_nic_set_cpumask(priv);
@@ -2868,13 +2868,19 @@ static int hns3_nic_init_vector_data(struct 
hns3_nic_priv *priv)

  hns3_free_vector_ring_chain(tqp_vector, _ring_chain);
  if (ret)
-    return ret;
+    goto map_ring_fail;
  netif_napi_add(priv->netdev, _vector->napi,
 hns3_nic_common_poll, NAPI_POLL_WEIGHT);
  }
  return 0;
+
+map_ring_fail:
+    for (j = i - 1; j >= 0; j--)
+    netif_napi_del(>tqp_vector[j].napi);


    'j' doesn't seem needed as well.



yes, it will be change to below one.

+
+map_ring_fail:
+   while(i--)
+   netif_napi_del(>tqp_vector[i].napi);
+
+   return ret;


[...]

MBR, Sergei

.


Thanks, Huazhong.

Re: [Patch V4 net 02/11] net: hns3: add error handler for hns3_get_ring_config/hns3_queue_to_ring

2018-10-30 Thread tanhuazhong





On 2018/10/30 17:09, Sergei Shtylyov wrote:

Hello!

On 10/29/2018 4:54 PM, Huazhong Tan wrote:


When hns3_get_ring_config()/hns3_queue_to_ring() failed during resetting,
the allocated memory has not been freed before hns3_get_ring_config() and
hns3_queue_to_ring() return. So this patch fixes the buffer not freeing
problem during resetting.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver 
for hip08 SoC")

Signed-off-by: Huazhong Tan 
---
  drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c

index d9066c5..6f0fd62 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c

[...]
@@ -3047,7 +3049,7 @@ static int hns3_get_ring_config(struct 
hns3_nic_priv *priv)

  {
  struct hnae3_handle *h = priv->ae_handle;
  struct pci_dev *pdev = h->pdev;
-    int i, ret;
+    int i, j, ret;
  priv->ring_data =  devm_kzalloc(>dev,
  array3_size(h->kinfo.num_tqps,
@@ -3065,6 +3067,12 @@ static int hns3_get_ring_config(struct 
hns3_nic_priv *priv)

  return 0;
  err:
+    for (j = i - 1; j >= 0; j--) {


    As is with the other patch, you don't need 'j' here.



Yes, i have modified it.
Thanks.


+    devm_kfree(priv->dev, priv->ring_data[j].ring);
+    devm_kfree(priv->dev,
+   priv->ring_data[j + h->kinfo.num_tqps].ring);
+    }
+
  devm_kfree(>dev, priv->ring_data);
  return ret;
  }


MBR, Sergei



Greeting.
Huazhong.


.

Re: [Patch V4 net 01/11] net: hns3: add error handler for hns3_nic_init_vector_data()

2018-10-30 Thread Sergei Shtylyov


On 10/29/2018 4:54 PM, Huazhong Tan wrote:


When hns3_nic_init_vector_data() fails to map ring to vector,
it should cancel the netif_napi_add() that has been successfully
done and then exits.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 
SoC")
Signed-off-by: Huazhong Tan 
---
  drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 --
  1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 32f3aca8..d9066c5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2821,7 +2821,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv 
*priv)
struct hnae3_handle *h = priv->ae_handle;
struct hns3_enet_tqp_vector *tqp_vector;
int ret = 0;
-   u16 i;
+   int i, j;
  
  	hns3_nic_set_cpumask(priv);
  
@@ -2868,13 +2868,19 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv)

hns3_free_vector_ring_chain(tqp_vector, _ring_chain);
  
  		if (ret)

-   return ret;
+   goto map_ring_fail;
  
  		netif_napi_add(priv->netdev, _vector->napi,

   hns3_nic_common_poll, NAPI_POLL_WEIGHT);
}
  
  	return 0;

+
+map_ring_fail:
+   for (j = i - 1; j >= 0; j--)
+   netif_napi_del(>tqp_vector[j].napi);


   'j' doesn't seem needed as well.

[...]

MBR, Sergei

[bug report] PCI: Remove NULL device handling from PCI DMA API

2018-10-30 Thread Dan Carpenter

Hello Christoph Hellwig,

The patch 4167b2ad5182: "PCI: Remove NULL device handling from PCI
DMA API" from Jan 10, 2018, leads to the following static checker
warning:

drivers/net/ethernet/amd/pcnet32.c:1921 pcnet32_probe1()
warn: variable dereferenced before check 'pdev' (see line 1843)

drivers/net/ethernet/amd/pcnet32.c
  1839  
  1840  dev->base_addr = ioaddr;
  1841  lp = netdev_priv(dev);
  1842  /* pci_alloc_consistent returns page-aligned memory, so we do 
not have to check the alignment */
  1843  lp->init_block = pci_alloc_consistent(pdev, 
sizeof(*lp->init_block),
  
This function is called with a NULL "pdev" when we're probing from
pcnet32_probe_vlbus().

  1844>init_dma_addr);
  1845  if (!lp->init_block) {
  1846  if (pcnet32_debug & NETIF_MSG_PROBE)
  1847  pr_err("Consistent memory allocation failed\n");
  1848  ret = -ENOMEM;
  1849  goto err_free_netdev;
  1850  }
  1851  lp->pci_dev = pdev;
  1852  
  1853  lp->dev = dev;
  1854  

regards,
dan carpenter

Re: [Patch V4 net 02/11] net: hns3: add error handler for hns3_get_ring_config/hns3_queue_to_ring

2018-10-30 Thread Sergei Shtylyov


Hello!

On 10/29/2018 4:54 PM, Huazhong Tan wrote:


When hns3_get_ring_config()/hns3_queue_to_ring() failed during resetting,
the allocated memory has not been freed before hns3_get_ring_config() and
hns3_queue_to_ring() return. So this patch fixes the buffer not freeing
problem during resetting.

Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 
SoC")
Signed-off-by: Huazhong Tan 
---
  drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index d9066c5..6f0fd62 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c

[...]

@@ -3047,7 +3049,7 @@ static int hns3_get_ring_config(struct hns3_nic_priv 
*priv)
  {
struct hnae3_handle *h = priv->ae_handle;
struct pci_dev *pdev = h->pdev;
-   int i, ret;
+   int i, j, ret;
  
  	priv->ring_data =  devm_kzalloc(>dev,

array3_size(h->kinfo.num_tqps,
@@ -3065,6 +3067,12 @@ static int hns3_get_ring_config(struct hns3_nic_priv 
*priv)
  
  	return 0;

  err:
+   for (j = i - 1; j >= 0; j--) {


   As is with the other patch, you don't need 'j' here.


+   devm_kfree(priv->dev, priv->ring_data[j].ring);
+   devm_kfree(priv->dev,
+  priv->ring_data[j + h->kinfo.num_tqps].ring);
+   }
+
devm_kfree(>dev, priv->ring_data);
return ret;
  }


MBR, Sergei

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Paweł Staszewski





W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:


On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug 
you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
earlier,
plus __get_unaligned_cpu32() as you hinted.






No RXFCS

And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure
[28965.776867] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28965.776868] Call Trace:
[28965.776870]  
[28965.776876]  dump_stack+0x46/0x5b
[28965.776879]  __skb_checksum_complete+0x9a/0xa0
[28965.776882]  tcp_v4_rcv+0xef/0x960
[28965.776884]  ip_local_deliver_finish+0x49/0xd0
[28965.776886]  ip_local_deliver+0x5e/0xe0
[28965.776888]  ? ip_sublist_rcv_finish+0x50/0x50
[28965.776889]  ip_rcv+0x41/0xc0
[28965.776891]  __netif_receive_skb_one_core+0x4b/0x70
[28965.776893]  netif_receive_skb_internal+0x2f/0xd0
[28965.776894]  napi_gro_receive+0xb7/0xe0
[28965.776897]  mlx5e_handle_rx_cqe+0x7a/0xd0
[28965.776899]  mlx5e_poll_rx_cq+0xc6/0x930
[28965.776900]  mlx5e_napi_poll+0xab/0xc90
[28965.776904]  ? kmem_cache_free_bulk+0x1e4/0x280
[28965.776905]  net_rx_action+0x1f1/0x320
[28965.776909]  __do_softirq+0xec/0x2b7
[28965.776912]  irq_exit+0x7b/0x80
[28965.776913]  do_IRQ+0x45/0xc0
[28965.776915]  common_interrupt+0xf/0xf
[28965.776916]  
[28965.776918] RIP: 0010:mwait_idle+0x5f/0x1b0
[28965.776919] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[28965.776920] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: 
ffd3
[28965.776921] RAX:  RBX:  RCX: 

[28965.776922] RDX:  RSI:  RDI: 

[28965.776922] RBP:  R08: 00aa R09: 
88046f81fbc0
[28965.776923] R10:  R11: 0001006d5985 R12: 
8220f780
[28965.776924] R13: 8220f780 R14:  R15: 


[28965.776927]  do_idle+0x1a3/0x1c0
[28965.776929]  cpu_startup_entry+0x14/0x20
[28965.776932]  start_kernel+0x488/0x4a8
[28965.776935]  secondary_startup_64+0xa4/0xb0
[28965.981529] vlan1490: hw csum failure
[28965.981531] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28965.981532] Call Trace:
[28965.981534]  
[28965.981539]  dump_stack+0x46/0x5b
[28965.981543]  __skb_checksum_complete+0x9a/0xa0
[28965.981545]  tcp_v4_rcv+0xef/0x960
[28965.981548]  ip_local_deliver_finish+0x49/0xd0
[28965.981550]  ip_local_deliver+0x5e/0xe0
[28965.981551]  ? ip_sublist_rcv_finish+0x50/0x50
[28965.981552]  ip_rcv+0x41/0xc0
[28965.981555]  __netif_receive_skb_one_core+0x4b/0x70
[28965.981556]  netif_receive_skb_internal+0x2f/0xd0
[28965.981558]  napi_gro_receive+0xb7/0xe0
[28965.981560]  mlx5e_handle_rx_cqe+0x7a/0xd0
[28965.981562]  mlx5e_poll_rx_cq+0xc6/0x930
[28965.981563]  mlx5e_napi_poll+0xab/0xc90
[28965.981567]  ? kmem_cache_free_bulk+0x1e4/0x280
[28965.981568]  net_rx_action+0x1f1/0x320
[28965.981571]  __do_softirq+0xec/0x2b7
[28965.981575]  irq_exit+0x7b/0x80
[28965.981576]  do_IRQ+0x45/0xc0
[28965.981578]  common_interrupt+0xf/0xf
[28965.981579]  
[28965.981580] RIP: 0010:mwait_idle+0x5f/0x1b0
[28965.981582] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[28965.981583] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: 
ffd3
[28965.981584] RAX:  RBX:  RCX: 

[28965.981585] RDX:  RSI:  RDI: 

[28965.981586] RBP:  R08: 0383 R09: 
88046f81fbc0
[28965.981586] R10:  R11: 0001006d59b8 R12: 
8220f780
[28965.981587] R13: 8220f780 R14:  R15: 


[28965.981591]  do_idle+0x1a3/0x1c0
[28965.981592]  cpu_startup_entry+0x14/0x20
[28965.981596]  start_kernel+0x488/0x4a8
[28965.981600]  secondary_startup_64+0xa4/0xb0
[28966.511782] vlan1490: hw csum failure
[28966.511785] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28966.511785] Call Trace:
[28966.511787]  
[28966.511793]  dump_stack+0x46/0x5b
[28966.511797]  __skb_checksum_complete+0x9a/0xa0
[28966.511799]  tcp_v4_rcv+0xef/0x960
[28966.511802]  ip_local_deliver_finish+0x49/0xd0
[28966.511804]  ip_local_deliver+0x5e/0xe0
[28966.511806]  ? ip_sublist_rcv_finish+0x50/0x50

[PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS

2018-10-30 Thread Eric Dumazet

As shown by Dmitris, we need to use csum_block_add() instead of csum_add()
when adding the FCS contribution to skb csum.

Before 4.18 (more exactly commit 88078d98d1bb "net: pskb_trim_rcsum()
and CHECKSUM_COMPLETE are friends"), the whole skb csum was thrown away,
so RXFCS changes were ignored.

Then before commit d55bef5059dd ("net: fix pskb_trim_rcsum_slow() with
odd trim offset") both mlx5 and pskb_trim_rcsum_slow() bugs were canceling
each other.

Now we fixed pskb_trim_rcsum_slow() we need to fix mlx5.

Note that this patch also rewrites mlx5e_get_fcs() to :

- Use skb_header_pointer() instead of reinventing it.
- Use __get_unaligned_cpu32() to avoid possible non aligned accesses
  as Dmitris pointed out.

Fixes: 902a545904c7 ("net/mlx5e: When RXFCS is set, add FCS data into checksum 
calculation")
Reported-by: Paweł Staszewski 
Signed-off-by: Eric Dumazet 
Cc: Eran Ben Elisha 
Cc: Saeed Mahameed 
Cc: Dimitris Michailidis 
Cc: Cong Wang 
Cc: Paweł Staszewski 
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 45 ---
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 
94224c22ecc310a87b6715051e335446f29bec03..79638dcbae78395fb723c9bf3fa877e7a42d91cd
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -713,43 +713,15 @@ static inline void mlx5e_enable_ecn(struct mlx5e_rq *rq, 
struct sk_buff *skb)
rq->stats->ecn_mark += !!rc;
 }
 
-static __be32 mlx5e_get_fcs(struct sk_buff *skb)
+static u32 mlx5e_get_fcs(const struct sk_buff *skb)
 {
-   int last_frag_sz, bytes_in_prev, nr_frags;
-   u8 *fcs_p1, *fcs_p2;
-   skb_frag_t *last_frag;
-   __be32 fcs_bytes;
+   const void *fcs_bytes;
+   u32 _fcs_bytes;
 
-   if (!skb_is_nonlinear(skb))
-   return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN);
+   fcs_bytes = skb_header_pointer(skb, skb->len - ETH_FCS_LEN,
+  ETH_FCS_LEN, &_fcs_bytes);
 
-   nr_frags = skb_shinfo(skb)->nr_frags;
-   last_frag = _shinfo(skb)->frags[nr_frags - 1];
-   last_frag_sz = skb_frag_size(last_frag);
-
-   /* If all FCS data is in last frag */
-   if (last_frag_sz >= ETH_FCS_LEN)
-   return *(__be32 *)(skb_frag_address(last_frag) +
-  last_frag_sz - ETH_FCS_LEN);
-
-   fcs_p2 = (u8 *)skb_frag_address(last_frag);
-   bytes_in_prev = ETH_FCS_LEN - last_frag_sz;
-
-   /* Find where the other part of the FCS is - Linear or another frag */
-   if (nr_frags == 1) {
-   fcs_p1 = skb_tail_pointer(skb);
-   } else {
-   skb_frag_t *prev_frag = _shinfo(skb)->frags[nr_frags - 2];
-
-   fcs_p1 = skb_frag_address(prev_frag) +
-   skb_frag_size(prev_frag);
-   }
-   fcs_p1 -= bytes_in_prev;
-
-   memcpy(_bytes, fcs_p1, bytes_in_prev);
-   memcpy(((u8 *)_bytes) + bytes_in_prev, fcs_p2, last_frag_sz);
-
-   return fcs_bytes;
+   return __get_unaligned_cpu32(fcs_bytes);
 }
 
 static u8 get_ip_proto(struct sk_buff *skb, __be16 proto)
@@ -797,8 +769,9 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
 network_depth - ETH_HLEN,
 skb->csum);
if (unlikely(netdev->features & NETIF_F_RXFCS))
-   skb->csum = csum_add(skb->csum,
-(__force 
__wsum)mlx5e_get_fcs(skb));
+   skb->csum = csum_block_add(skb->csum,
+  (__force 
__wsum)mlx5e_get_fcs(skb),
+  skb->len - ETH_FCS_LEN);
stats->csum_complete++;
return;
}
-- 
2.19.1.568.g152ad8e336-goog

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Eric Dumazet

On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:

> 
> Indeed this is a bug. I would expect it to produce frequent errors
> though as many odd-length
> packets would trigger it. Do you have RXFCS? Regardless, how
> frequently do you see the problem?
> 

Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug 
you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
earlier,
plus __get_unaligned_cpu32() as you hinted.

[PATCH net] net/mlx4_en: add a missing include

2018-10-30 Thread Eric Dumazet

Abdul Haleem reported a build error on ppc :

drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct
iphdr` declared inside parameter list [enabled by default]
   struct iphdr *iph)
  ^
drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is
only this definition or declaration, which is probably not what you want
[enabled by default]
drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function
get_fixed_ipv4_csum:
drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing
pointer to incomplete type
  __u8 ipproto = iph->protocol;
^

Fixes: 55469bc6b577 ("drivers: net: remove  inclusion when not 
needed")
Signed-off-by: Eric Dumazet 
Reported-by: Abdul Haleem 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
5a6d0919533d6e0e619927abd753c5d07ed95dac..db00bf1c23f5ad31d64652ddc8bee32e2e7534c8
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 
+#include 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
 #endif
-- 
2.19.1.568.g152ad8e336-goog

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Dimitris Michailidis

On Mon, Oct 29, 2018 at 8:52 PM, Eric Dumazet  wrote:
>
>
> On 10/29/2018 07:53 PM, Eric Dumazet wrote:
>>
>>
>> On 10/29/2018 07:27 PM, Cong Wang wrote:
>>> Hi,
>>>
>>> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski  
>>> wrote:

 Sorry not complete - followed by hw csum:

 [  342.190831] vlan1490: hw csum failure
 [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
 [  342.190836] Call Trace:
 [  342.190839]  
 [  342.190849]  dump_stack+0x46/0x5b
 [  342.190856]  __skb_checksum_complete+0x9a/0xa0
 [  342.190859]  tcp_v4_rcv+0xef/0x960
 [  342.190864]  ip_local_deliver_finish+0x49/0xd0
 [  342.190866]  ip_local_deliver+0x5e/0xe0
 [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
 [  342.190870]  ip_rcv+0x41/0xc0
 [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
 [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
 [  342.190879]  napi_gro_receive+0xb7/0xe0
 [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
 [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
 [  342.190888]  mlx5e_napi_poll+0xab/0xc90
>>>
>>>
>>> We got exactly the same backtrace in our data center. However,
>>> it is not easy for us to reproduce it, do you have any clue to reproduce it?
>>>
>>> If you do, try to tcpdump the packets triggering this warning, it could
>>> be useful for debugging.
>>>
>>> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
>>> occurs. We tried to revert the offending commit 88078d98d1bb, it
>>> disappears. So it is likely that commit 88078d98d1bb introduces
>>> more troubles than the one fixed by d55bef5059dd057bd.
>>>
>>
>> Or this could be that mlx5 driver is buggy when dealing with VLAN tags.
>>
>> It both uses vlan_tci (hardware vlan offload) in skb _and_ this piece of 
>> code in mlx5e_handle_csum()
>>
>>   if (network_depth > ETH_HLEN)
>>   /* CQE csum is calculated from the IP header and does
>>* not cover VLAN headers (if present). This will add
>>* the checksum manually.
>>*/
>>   skb->csum = csum_partial(skb->data + ETH_HLEN,
>>network_depth - ETH_HLEN,
>>skb->csum);
>>
>>
>> That seems strange to me, because skb_vlan_untag() will not adjust skb->csum 
>> in this case.
>>
>
> Bug might be in NETIF_F_RXFCS mlx5 handling btw...
>
> Code does :
>
> if (unlikely(netdev->features & NETIF_F_RXFCS))
>  skb->csum = csum_add(skb->csum,
>   (__force __wsum)mlx5e_get_fcs(skb));
>
> But Dimitris told us that we need to take into account if FCS starts at odd 
> or even offset.
>
> ->
> if (unlikely(netdev->features & NETIF_F_RXFCS))
>  skb->csum = csum_block_add(skb->csum,
> (__force __wsum)mlx5e_get_fcs(skb),
> skb->len);
>

Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?

There is some other questionable code in the driver's RXFCS implementation.
Code like

return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN);

doesn't work on processors with alignment requirements.