[PATCH net-next 2/2] selftests: rtnetlink: use a local IP address for IPsec tests

2018-06-19 Thread Shannon Nelson
Find an IP address on this machine to use as a source IP, and
make up a destination IP address based on the source IP.  No
actual messages will be sent, just a couple of IPsec rules are
created and deleted.

Fixes: 5e596ee171ba ("selftests: add xfrm state-policy-monitor to rtnetlink.sh")
Reported-by: Anders Roxell 
Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index 0a2bc6e..b33a371 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -522,8 +522,12 @@ kci_test_macsec()
 #---
 kci_test_ipsec()
 {
-   srcip="14.0.0.52"
-   dstip="14.0.0.70"
+   # find an ip address on this machine and make up a destination
+   srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head 
-1 | cut -f1 -d/`
+   net=`echo $srcip | cut -f1-3 -d.`
+   base=`echo $srcip | cut -f4 -d.`
+   dstip="$net."`expr $base + 1`
+
algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 
128"
 
# flush to be sure there's nothing configured
-- 
2.7.4



[PATCH net-next 0/2] fixes for ipsec selftests

2018-06-19 Thread Shannon Nelson
A couple of bad behaviors in the ipsec selftest were pointed out
by Anders Roxell  and are addressed here.

Shannon Nelson (2):
  selftests: rtnetlink: hide complaint from terminated monitor
  selftests: rtnetlink: use a local IP address for IPsec tests

 tools/testing/selftests/net/rtnetlink.sh | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

-- 
2.7.4



[PATCH net-next 1/2] selftests: rtnetlink: hide complaint from terminated monitor

2018-06-19 Thread Shannon Nelson
Set up the "ip xfrm monitor" subprogram so as to not see
a "Terminated" message when the subprogram is killed.

Fixes: 5e596ee171ba ("selftests: add xfrm state-policy-monitor to rtnetlink.sh")
Reported-by: Anders Roxell 
Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index 760faef..0a2bc6e 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -532,8 +532,7 @@ kci_test_ipsec()
 
# start the monitor in the background
tmpfile=`mktemp ipsectestXXX`
-   ip x m > $tmpfile &
-   mpid=$!
+   mpid=`(ip x m > $tmpfile & echo $!) 2>/dev/null`
sleep 0.2
 
ipsecid="proto esp src $srcip dst $dstip spi 0x07"
-- 
2.7.4



Re: [PATCH net] ip: limit use of gso_size to udp

2018-06-19 Thread David Miller
From: Willem de Bruijn 
Date: Tue, 19 Jun 2018 06:40:26 -0400

> From: Willem de Bruijn 
> 
> The ipcm(6)_cookie field gso_size is set only in the udp path. The ip
> layer copies this to cork only if sk_type is SOCK_DGRAM. This check
> proved too permissive. Ping and l2tp sockets have the same type.
> 
> Limit to sockets of type SOCK_DGRAM and protocol IPPROTO_UDP to
> exclude ping sockets.
> 
> Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
> Reported-by: Maciej Żenczykowski 
> Signed-off-by: Willem de Bruijn 

Applied, thanks Willem.

> For net-next, I'll take a look whether ipcm(6)_cookie fields like
> these can be initialized uniformly, and then this branch removed
> completely.

Sounds good.


Re: [PATCH] ucc_geth: Add BQL support

2018-06-19 Thread David Miller
From: Joakim Tjernlund 
Date: Tue, 19 Jun 2018 18:30:36 +0200

> @@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>   struct ucc_geth_private *ugeth = netdev_priv(dev);
>   u8 __iomem *bd; /* BD pointer */
>   u32 bd_status;
> + int howmany = 0;
> + unsigned int bytes_sent = 0;

Please keep the function local variable declarations ordered from
longest to shortest line.

Thank you.


Re: [PATCH v1 net] stmmac: fix DMA channel hang in half-duplex mode

2018-06-19 Thread David Miller
From: Bhadram Varka 
Date: Sun, 17 Jun 2018 20:02:05 +0530

> HW does not support Half-duplex mode in multi-queue
> scenario. Fix it by not advertising the Half-Duplex
> mode if multi-queue enabled.
> 
> Signed-off-by: Bhadram Varka 

Applied and queued up for -stable.


Re: [PATCH net] ipvlan: call dev_change_flags when reset ipvlan mode

2018-06-19 Thread David Miller
From: Hangbin Liu 
Date: Wed, 20 Jun 2018 11:22:54 +0800

> The only case dev_change_flags() return an err is when we change IFF_UP flag.
> Since we only set/reset IFF_NOARP, do you think we still need to check the
> return value?

It is bad to try and take shortcuts on error handling using assumptions
like that.

If dev_change_flags() is adjusted to return error codes in more
situations, nobody is going to remember to undo your "optimziation"
here.

Please check for errors, thank you.


Re: [PATCH net] net: sungem: fix rx checksum support

2018-06-19 Thread David Miller
From: Eric Dumazet 
Date: Tue, 19 Jun 2018 19:18:50 -0700

> After commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
> are friends"), sungem owners reported the infamous "eth0: hw csum failure"
> message.
> 
> CHECKSUM_COMPLETE has in fact never worked for this driver, but this
> was masked by the fact that upper stacks had to strip the FCS, and
> therefore skb->ip_summed was set back to CHECKSUM_NONE before
> my recent change.
> 
> Driver configures a number of bytes to skip when the chip computes
> the checksum, and for some reason only half of the Ethernet header
> was skipped.
> 
> Then a second problem is that we should strip the FCS by default,
> unless the driver is updated to eventually support NETIF_F_RXFCS in
> the future.
> 
> Finally, a driver should check if NETIF_F_RXCSUM feature is enabled
> or not, so that the admin can turn off rx checksum if wanted.
> 
> Many thanks to Andreas Schwab and Mathieu Malaterre for their
> help in debugging this issue.
> 
> Signed-off-by: Eric Dumazet 
> Reported-by: Meelis Roos 
> Reported-by: Mathieu Malaterre 
> Reported-by: Andreas Schwab 
> Tested-by: Andreas Schwab 

Applied and queued up for -stable, thanks Eric.


Re: [PATCH net v3 2/2] ipv4: igmp: use alarmtimer to prevent delayed reports

2018-06-19 Thread kbuild test robot
Hi Tejaswi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net/master]

url:
https://github.com/0day-ci/linux/commits/Tejaswi-Tanikella/ktime-helpers-to-convert-between-ktime-and-jiffies/20180611-214916
config: x86_64-randconfig-s4-06200944 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from include/linux/timer.h:6:0,
from include/linux/workqueue.h:9,
from include/linux/srcu.h:34,
from include/linux/notifier.h:16,
from include/linux/memory_hotplug.h:7,
from include/linux/mmzone.h:777,
from include/linux/gfp.h:6,
from include/linux/umh.h:4,
from include/linux/kmod.h:22,
from include/linux/module.h:13,
from net/ipv4/igmp.c:73:
   net/ipv4/igmp.c: In function 'igmp_mc_seq_show':
>> net/ipv4/igmp.c:2819:28: error: implicit declaration of function 
>> 'alarm_expires_remaining'; did you mean 'hrtimer_expires_remaining'? 
>> [-Werror=implicit-function-declaration]
  delta = ktime_to_jiffies(alarm_expires_remaining(>alarm));
   ^
   include/linux/ktime.h:100:48: note: in definition of macro 'ktime_to_jiffies'
#define ktime_to_jiffies(kt)  nsecs_to_jiffies(kt)
   ^~
>> net/ipv4/igmp.c:2819:55: error: 'struct ip_mc_list' has no member named 
>> 'alarm'
  delta = ktime_to_jiffies(alarm_expires_remaining(>alarm));
  ^
   include/linux/ktime.h:100:48: note: in definition of macro 'ktime_to_jiffies'
#define ktime_to_jiffies(kt)  nsecs_to_jiffies(kt)
   ^~
   cc1: some warnings being treated as errors

vim +2819 net/ipv4/igmp.c

  2813  
  2814  if (rcu_access_pointer(state->in_dev->mc_list) == im) {
  2815  seq_printf(seq, "%d\t%-10s: %5d %7s\n",
  2816 state->dev->ifindex, 
state->dev->name, state->in_dev->mc_count, querier);
  2817  }
  2818  
> 2819  delta = 
> ktime_to_jiffies(alarm_expires_remaining(>alarm));
  2820  seq_printf(seq,
  2821 "\t\t\t\t%08X %5d %d:%08lX\t\t%d\n",
  2822 im->multiaddr, im->users,
  2823 im->tm_running,
  2824 im->tm_running ? 
jiffies_delta_to_clock_t(delta) : 0,
  2825 im->reporter);
  2826  }
  2827  return 0;
  2828  }
  2829  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH net-next] tcp: ignore rcv_rtt sample with old ts ecr value

2018-06-19 Thread Wei Wang
From: Wei Wang 

When receiving multiple packets with the same ts ecr value, only try
to compute rcv_rtt sample with the earliest received packet.
This is because the rcv_rtt calculated by later received packets
could possibly include long idle time or other types of delay.
For example:
(1) server sends last packet of reply with TS val V1
(2) client ACKs last packet of reply with TS ecr V1
(3) long idle time passes
(4) client sends next request data packet with TS ecr V1 (again!)
At this time, the rcv_rtt computed on server with TS ecr V1 will be
inflated with the idle time and should get ignored.

Signed-off-by: Wei Wang 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h  |  1 +
 net/ipv4/tcp.c   |  1 +
 net/ipv4/tcp_input.c | 14 +++---
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 72705eaf4b84..3dbea6610304 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -350,6 +350,7 @@ struct tcp_sock {
 #endif
 
 /* Receiver side RTT estimation */
+   u32 rcv_rtt_last_tsecr;
struct {
u32 rtt_us;
u32 seq;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 141acd92e58a..47c45d5be9f9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2563,6 +2563,7 @@ int tcp_disconnect(struct sock *sk, int flags)
sk->sk_shutdown = 0;
sock_reset_flag(sk, SOCK_DONE);
tp->srtt_us = 0;
+   tp->rcv_rtt_last_tsecr = 0;
tp->write_seq += tp->max_window + 2;
if (tp->write_seq == 0)
tp->write_seq = 1;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 355d3dffd021..76ca88f63b70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -582,9 +582,12 @@ static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tp->rx_opt.rcv_tsecr &&
-   (TCP_SKB_CB(skb)->end_seq -
-TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss)) {
+   if (tp->rx_opt.rcv_tsecr == tp->rcv_rtt_last_tsecr)
+   return;
+   tp->rcv_rtt_last_tsecr = tp->rx_opt.rcv_tsecr;
+
+   if (TCP_SKB_CB(skb)->end_seq -
+   TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss) {
u32 delta = tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
u32 delta_us;
 
@@ -5475,6 +5478,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff 
*skb)
tcp_ack(sk, skb, 0);
__kfree_skb(skb);
tcp_data_snd_check(sk);
+   /* When receiving pure ack in fast path, update
+* last ts ecr directly instead of calling
+* tcp_rcv_rtt_measure_ts()
+*/
+   tp->rcv_rtt_last_tsecr = tp->rx_opt.rcv_tsecr;
return;
} else { /* Header too small */
TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
-- 
2.18.0.rc1.244.gcf134e6275-goog



Re: [PATCH net] ipvlan: call dev_change_flags when reset ipvlan mode

2018-06-19 Thread Hangbin Liu
On Tue, Jun 19, 2018 at 02:10:18PM -0700, Cong Wang wrote:
> On Mon, Jun 18, 2018 at 7:04 AM, Hangbin Liu  wrote:
> > @@ -94,10 +95,13 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, 
> > u16 nval)
> > mdev->l3mdev_ops = NULL;
> > }
> > list_for_each_entry(ipvlan, >ipvlans, pnode) {
> > +   flags = ipvlan->dev->flags;
> > if (nval == IPVLAN_MODE_L3 || nval == 
> > IPVLAN_MODE_L3S)
> > -   ipvlan->dev->flags |= IFF_NOARP;
> > +   dev_change_flags(ipvlan->dev,
> > +flags | IFF_NOARP);
> > else
> > -   ipvlan->dev->flags &= ~IFF_NOARP;
> > +   dev_change_flags(ipvlan->dev,
> > +flags & ~IFF_NOARP);
> 
> You need to check the return value of dev_change_flags().

Hi Wang Cong,

The only case dev_change_flags() return an err is when we change IFF_UP flag.
Since we only set/reset IFF_NOARP, do you think we still need to check the
return value?

Thanks
Hangbin


Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-06-19 Thread Jakub Kicinski
On Tue, 19 Jun 2018 16:37:15 -0500, Bjorn Helgaas wrote:
> On Fri, May 25, 2018 at 09:02:23AM -0500, Bjorn Helgaas wrote:
> > On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:  
> > > Hi Bjorn!
> > > 
> > > On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:  
> > > > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:  
> > > > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > > > to not fail, e.g.:
> > > > > 
> > > > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > > > 
> > > > > For devices which VF support depends on loaded FW we have the
> > > > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > > > to 0.  Remove the special values completely and simply initialize
> > > > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > > > Add a helper for drivers to reset the VF limit back to total.
> > > > 
> > > > I still can't really make sense out of the changelog.
> > > >
> > > > I think part of the reason it's confusing is because there are two
> > > > things going on:
> > > > 
> > > >   1) You want this:
> > > >   
> > > >pci_sriov_set_totalvfs(dev, 0);
> > > >x = pci_sriov_get_totalvfs(dev) 
> > > > 
> > > >  to return 0 instead of total_VFs.  That seems to connect with
> > > >  your subject line.  It means "sriov_totalvfs" in sysfs could be
> > > >  0, but I don't know how that is useful (I'm sure it is; just
> > > >  educate me :))  
> > > 
> > > Let me just quote the bug report that got filed on our internal bug
> > > tracker :)
> > > 
> > >   When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
> > >   errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
> > >   then tries to set that as the sriov_numvfs parameter.
> > > 
> > >   For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
> > >   but it's set to max.  When FW is switched to flower*, the correct 
> > >   sriov_totalvfs value is presented.
> > > 
> > > * flower is a project name  
> > 
> > From the point of view of the PCI core (which knows nothing about
> > device firmware and relies on the architected config space described
> > by the PCIe spec), this sounds like an erratum: with some firmware
> > installed, the device is not capable of SR-IOV, but still advertises
> > an SR-IOV capability with "TotalVFs > 0".
> > 
> > Regardless of whether that's an erratum, we do allow PF drivers to use
> > pci_sriov_set_totalvfs() to limit the number of VFs that may be
> > enabled by writing to the PF's "sriov_numvfs" sysfs file.
> > 
> > But the current implementation does not allow a PF driver to limit VFs
> > to 0, and that does seem nonsensical.
> >   
> > > My understanding is OpenStack uses sriov_totalvfs to determine how many
> > > VFs can be enabled, looks like this is the code:
> > > 
> > > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> > >   
> > > >   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
> > > >  sure what you intend for this.  Is *every* driver supposed to
> > > >  call it in .remove()?  Could/should this be done in the core
> > > >  somehow instead of depending on every driver?  
> > > 
> > > Good question, I was just thinking yesterday we may want to call it
> > > from the core, but I don't think it's strictly necessary nor always
> > > sufficient (we may reload FW without re-probing).
> > > 
> > > We have a device which supports different number of VFs based on the FW
> > > loaded.  Some legacy FWs does not inform the driver how many VFs it can
> > > support, because it supports max.  So the flow in our driver is this:
> > > 
> > > load_fw(dev);
> > > ...
> > > max_vfs = ask_fw_for_max_vfs(dev);
> > > if (max_vfs >= 0)
> > >   return pci_sriov_set_totalvfs(dev, max_vfs);
> > > else /* FW didn't tell us, assume max */
> > >   return pci_sriov_reset_totalvfs(dev); 
> > > 
> > > We also reset the max on device remove, but that's not strictly
> > > necessary.
> > > 
> > > Other users of pci_sriov_set_totalvfs() always know the value to set
> > > the total to (either always get it from FW or it's a constant).
> > > 
> > > If you prefer we can work out the correct max for those legacy cases in
> > > the driver as well, although it seemed cleaner to just ask the core,
> > > since it already has total_VFs value handy :)
> > >   
> > > > I'm also having a hard time connecting your user-space command example
> > > > with the rest of this.  Maybe it will make more sense to me tomorrow
> > > > after some coffee.  
> > > 
> > > OpenStack assumes it will always be able to set sriov_numvfs to
> > > sriov_totalvfs, see this 'if':
> > > 
> > > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n512
> > >   
> > 
> > Thanks for educating me.  I think there 

[PATCH net] net: sungem: fix rx checksum support

2018-06-19 Thread Eric Dumazet
After commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
are friends"), sungem owners reported the infamous "eth0: hw csum failure"
message.

CHECKSUM_COMPLETE has in fact never worked for this driver, but this
was masked by the fact that upper stacks had to strip the FCS, and
therefore skb->ip_summed was set back to CHECKSUM_NONE before
my recent change.

Driver configures a number of bytes to skip when the chip computes
the checksum, and for some reason only half of the Ethernet header
was skipped.

Then a second problem is that we should strip the FCS by default,
unless the driver is updated to eventually support NETIF_F_RXFCS in
the future.

Finally, a driver should check if NETIF_F_RXCSUM feature is enabled
or not, so that the admin can turn off rx checksum if wanted.

Many thanks to Andreas Schwab and Mathieu Malaterre for their
help in debugging this issue.

Signed-off-by: Eric Dumazet 
Reported-by: Meelis Roos 
Reported-by: Mathieu Malaterre 
Reported-by: Andreas Schwab 
Tested-by: Andreas Schwab 
---
 drivers/net/ethernet/sun/sungem.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/sun/sungem.c 
b/drivers/net/ethernet/sun/sungem.c
index 
7a16d40a72d13cf1d522e8a3a396c826fe76f9b9..b9221fc1674dfa0ef17a43f8ff86d700a1ae514f
 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -60,8 +60,7 @@
 #include 
 #include "sungem.h"
 
-/* Stripping FCS is causing problems, disabled for now */
-#undef STRIP_FCS
+#define STRIP_FCS
 
 #define DEFAULT_MSG(NETIF_MSG_DRV  | \
 NETIF_MSG_PROBE| \
@@ -435,7 +434,7 @@ static int gem_rxmac_reset(struct gem *gp)
writel(desc_dma & 0x, gp->regs + RXDMA_DBLOW);
writel(RX_RING_SIZE - 4, gp->regs + RXDMA_KICK);
val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
-  ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
+  (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
writel(val, gp->regs + RXDMA_CFG);
if (readl(gp->regs + GREG_BIFCFG) & GREG_BIFCFG_M66EN)
writel(((5 & RXDMA_BLANK_IPKTS) |
@@ -760,7 +759,6 @@ static int gem_rx(struct gem *gp, int work_to_do)
struct net_device *dev = gp->dev;
int entry, drops, work_done = 0;
u32 done;
-   __sum16 csum;
 
if (netif_msg_rx_status(gp))
printk(KERN_DEBUG "%s: rx interrupt, done: %d, rx_new: %d\n",
@@ -855,9 +853,13 @@ static int gem_rx(struct gem *gp, int work_to_do)
skb = copy_skb;
}
 
-   csum = (__force __sum16)htons((status & RXDCTRL_TCPCSUM) ^ 
0x);
-   skb->csum = csum_unfold(csum);
-   skb->ip_summed = CHECKSUM_COMPLETE;
+   if (likely(dev->features & NETIF_F_RXCSUM)) {
+   __sum16 csum;
+
+   csum = (__force __sum16)htons((status & 
RXDCTRL_TCPCSUM) ^ 0x);
+   skb->csum = csum_unfold(csum);
+   skb->ip_summed = CHECKSUM_COMPLETE;
+   }
skb->protocol = eth_type_trans(skb, gp->dev);
 
napi_gro_receive(>napi, skb);
@@ -1761,7 +1763,7 @@ static void gem_init_dma(struct gem *gp)
writel(0, gp->regs + TXDMA_KICK);
 
val = (RXDMA_CFG_BASE | (RX_OFFSET << 10) |
-  ((14 / 2) << 13) | RXDMA_CFG_FTHRESH_128);
+  (ETH_HLEN << 13) | RXDMA_CFG_FTHRESH_128);
writel(val, gp->regs + RXDMA_CFG);
 
writel(desc_dma >> 32, gp->regs + RXDMA_DBHI);
@@ -2985,8 +2987,8 @@ static int gem_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
pci_set_drvdata(pdev, dev);
 
/* We can do scatter/gather and HW checksum */
-   dev->hw_features = NETIF_F_SG | NETIF_F_HW_CSUM;
-   dev->features |= dev->hw_features | NETIF_F_RXCSUM;
+   dev->hw_features = NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
+   dev->features = dev->hw_features;
if (pci_using_dac)
dev->features |= NETIF_F_HIGHDMA;
 
-- 
2.18.0.rc1.244.gcf134e6275-goog



Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition

2018-06-19 Thread Eric Dumazet



On 06/19/2018 11:05 AM, Saeed Mahameed wrote:

> this is only true for XDP setup, for non XDP max stride_size can only
> be around ~3k and only for mtu > ~6k
> 
> For XDP setup you suggested:
> -   priv->frag_info[0].frag_size = eff_mtu;
> +   priv->frag_info[0].frag_size = PAGE_SIZE;
> 
> currently the condition is:
> 
> release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
> 
> so my solution and yours have the same problem you described above.
> 
> the problem is not with the initial values or with stride/farg size
> math, it just that in XDP we shouldn't reuse at ALL. I agree with you
> that we need to optimize and maybe for PAGE_SIZE > 8k we need to allow
> XDP setup to reuses. but for now there is a data corruption to handle.


Sure, we all agree there is a bug to fix.

The way you are fixing it is kind of illogical.

The NIC can use a frag if its _size_ is big enough to receive the frame.

The _stride_  is an abstraction created by the driver to report an estimation 
of the _truesize_,
or memory consumption, so that linux can better track overall memory usage.

For example, if MTU=1500, the size of the fragment is 1536 bytes, but since we 
can put only
2 fragments per 4KB page (on x86), we declare the _stride_ to be 2048 bytes.

Declaring that a final blob of a page, being 1600 bytes, not able to receive a 
frame because
_stride_ is 2048 is illogical and waste resources.




Re: [PATCH] bpfilter: ignore binary files

2018-06-19 Thread David Miller
From: Matteo Croce 
Date: Tue, 19 Jun 2018 17:21:36 +0200

> net/bpfilter/bpfilter_umh is a binary file generated when bpfilter is
> enabled, add it to .gitignore to avoid committing it.
> 
> Fixes: d2ba09c17a064 ("net: add skeleton of bpfilter kernel module")
> Signed-off-by: Matteo Croce 

Applied.


Re: [PATCH] bpfilter: fix build error

2018-06-19 Thread David Miller
From: Matteo Croce 
Date: Tue, 19 Jun 2018 17:16:20 +0200

> bpfilter Makefile assumes that the system locale is en_US, and the
> parsing of objdump output fails.
> Set LC_ALL=C and, while at it, rewrite the objdump parsing so it spawns
> only 2 processes instead of 7.
> 
> Fixes: d2ba09c17a064 ("net: add skeleton of bpfilter kernel module")
> Signed-off-by: Matteo Croce 

Applied.


Re: [PATCH net] net/sched: act_ife: fix recursive lock and idr leak

2018-06-19 Thread David Miller
From: Davide Caratti 
Date: Tue, 19 Jun 2018 15:39:46 +0200

> a recursive lock warning [1] can be observed with the following script,
 ...
> in case the kernel was unable to run the last command (e.g. because of
> the impossibility to load 'act_meta_skbtcindex'). For a similar reason,
> the kernel can leak idr in the error path of tcf_ife_init(), because
> tcf_idr_release() is not called after successful idr reservation:
 ...
> Since tcfa_lock is already taken when the action is being edited, a call
> to tcf_idr_release() wrongly makes tcf_idr_cleanup() take the same lock
> again. On the other hand, tcf_idr_release() needs to be called in the
> error path of tcf_ife_init(), to undo the last tcf_idr_create() invocation.
> Fix both problems in tcf_ife_init().
> Since the cleanup() routine can now be called when ife->params is NULL,
> also add a NULL pointer check to avoid calling kfree_rcu(NULL, rcu).
> 
>  [1]
 ...
> Fixes: 4e8c86155010 ("net sched: net sched: ife action fix late binding")
> Fixes: ef6980b6becb ("introduce IFE action")
> Signed-off-by: Davide Caratti 

Applied and queued up for -stable.


Re: [PATCH] net/usb/drivers: Remove useless hrtimer_active check

2018-06-19 Thread David Miller
From: Daniel Lezcano 
Date: Tue, 19 Jun 2018 16:14:30 +0200

> The code does:
> 
>  if (hrtimer_active())
> hrtimer_cancel();
> 
> However, hrtimer_cancel() checks if the timer is active, so the
> test above is pointless.
> 
> Signed-off-by: Daniel Lezcano 

Applied.


Re: [PATCH net] net/sched: act_ife: preserve the action control in case of error

2018-06-19 Thread David Miller
From: Davide Caratti 
Date: Tue, 19 Jun 2018 15:45:50 +0200

> in the following script
> 
>  # tc actions add action ife encode allow prio pass index 42
>  # tc actions replace action ife encode allow tcindex drop index 42
> 
> the action control should remain equal to 'pass', if the kernel failed
> to replace the TC action. Pospone the assignment of the action control,
> to ensure it is not overwritten in the error path of tcf_ife_init().
> 
> Fixes: ef6980b6becb ("introduce IFE action")
> Signed-off-by: Davide Caratti 

Applied and queued up for -stable.


Re: [PATCH] [net-next, v2] tcp: use monotonic timestamps for PAWS

2018-06-19 Thread Eric Dumazet



On 06/19/2018 05:36 AM, Arnd Bergmann wrote:
> Using get_seconds() for timestamps is deprecated since it can lead
> to overflows on 32-bit systems. While the interface generally doesn't
> overflow until year 2106, the specific implementation of the TCP PAWS
> algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
> overflow.
> 
> 
> Signed-off-by: Arnd Bergmann 
> ---
> v2: use time_before32()/time_after32() everywhere as suggested
> Eric Dumazet


Thanks a lot Arnd.

Signed-off-by: Eric Dumazet 



Re: [PATCH] net: propagate dev_get_valid_name return code

2018-06-19 Thread David Miller
From: Li RongQing 
Date: Tue, 19 Jun 2018 17:23:17 +0800

> if dev_get_valid_name failed, propagate its return code
> 
> and remove the setting err to ENODEV, it will be set to
> 0 again before dev_change_net_namespace exits.
> 
> Signed-off-by: Li RongQing 

Applied, thank you.


Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?

2018-06-19 Thread Stephen Hemminger
On Wed, 20 Jun 2018 08:01:50 +0900
Greg KH  wrote:

> On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> > Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v 
> > guests failed to bring up network interfaces on boot, logging "A link 
> > change request failed with some changes committed already. Interface eth0 
> > may have been left with an inconsistent configuration, please check."  
> > Running 'ifconfig eth0 up' appears to fix the problem temporarily so I went 
> > about bisecting which landed on:
> > 
> > commit be9c798d0d13ae609a91177323ac816545c39d28
> > Author: Stephen Hemminger 
> > Date:   Mon May 14 15:32:18 2018 -0700
> > 
> > hv_netvsc: common detach logic
> > 
> > [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> > 
> > Make common function for detaching internals of device
> > during changes to MTU and RSS. Make sure no more packets
> > are transmitted and all packets have been received before
> > doing device teardown.
> > 
> > Change the wait logic to be common and use usleep_range().
> > 
> > Changes transmit enabling logic so that transmit queues are disabled
> > during the period when lower device is being changed. And enabled
> > only after sub channels are setup. This avoids issue where it could
> > be that a packet was being sent while subchannel was not initialized.
> > 
> > Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> > Signed-off-by: Stephen Hemminger 
> > Signed-off-by: David S. Miller 
> > Signed-off-by: Greg Kroah-Hartman 
> > 
> > 
> > The removal of which does indeed fix the problem.  That led me to:
> > 
> > commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> > Author: Dexuan Cui 
> > Date:   Wed Jun 6 21:32:51 2018 +
> > 
> > hv_netvsc: Fix a network regression after ifdown/ifup
> > 
> > Recently people reported the NIC stops working after
> > "ifdown eth0; ifup eth0". It turns out in this case the TX queues are 
> > not
> > enabled, after the refactoring of the common detach logic: when the NIC
> > has sub-channels, usually we enable all the TX queues after all
> > sub-channels are set up: see rndis_set_subchannel() ->
> > netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
> > the number of channels doesn't change, we also must make sure the TX 
> > queues
> > are enabled. The patch fixes the regression.
> > 
> > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > Signed-off-by: Dexuan Cui 
> > Cc: Stephen Hemminger 
> > Cc: K. Y. Srinivasan 
> > Cc: Haiyang Zhang 
> > Signed-off-by: David S. Miller 
> > 
> > Which sounded very promising, but does not seem to fully fix the issue.  
> > Doing some more digging, I was able to determine that the message coincides 
> > with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 second) after the 
> > hv_netvsc driver loads.  If I delay the mtu change until well after the 
> > driver loads, everything works fine.  If I unload hv_netvsc and then reload 
> > it and apply the mtu change immediately, the failure re-occurs.  So 
> > something is racy here, and the above doesn't entirely address it.
> > 
> > I'm happy to test out any suggested patches and/or do additional debugging 
> > if anyone has any suggestions.
> > 
> > (oh, and I did also try 4.18-rc1 and the problem still persists)  
> 
> Always cc: the authors of the patch you are having problems with, along
> with the mailing list for the networking subsystem, so that the right
> people know to look at this.

How are you changing the MTU? and starting the device.
When MTU changes the device has to reconnect with the host. This takes a small 
amount of time
and no changes to device state are possible then.

If MTU change happens and at the same time some other script tries to bring up 
the connection,
then that script will get an error.


Re: bpfilter compile failure on parisc

2018-06-19 Thread John David Anglin

On 2018-06-19 1:38 PM, Meelis Roos wrote:

Tried enabling bpfilter on parisc, got this:

   HOSTCC  net/bpfilter/main.o
net/bpfilter/main.c:3:21: fatal error: sys/uio.h: No such file or directory
  #include 
Probably has something to do with the include directories searched by 
HOSTCC.  The location of

the file is "/usr/include/hppa-linux-gnu/sys/uio.h".

Dave

--
John David Anglin  dave.ang...@bell.net



davinci_emac failures in 4.18-rc1 on AM3517-EVM

2018-06-19 Thread Adam Ford
I am not sure if anyone else has seen this.  If not, I'll bisect, but
the AM3517 ethernet seems to have died, and it's throwing some errors

[2.708933] davinci_mdio davinci_mdio.0: failed to get device clock
[2.715363] davinci_mdio: probe of davinci_mdio.0 failed with error -2

[snip]

[3.054552] davinci_emac davinci_emac.0: failed to get EMAC clock
[3.061195] davinci_emac: probe of davinci_emac.0 failed with error -16


If no one has seen this, I'll look into it, but I didn't want to waste
time if someone is already aware of it.

adam


Re: [PATCH net] enic: do not overwrite error code

2018-06-19 Thread David Miller
From: Govindarajulu Varadarajan 
Date: Mon, 18 Jun 2018 10:01:05 -0700

> In failure path, we overwrite err to what vnic_rq_disable() returns. In
> case it returns 0, enic_open() returns success in case of error.
> 
> Reported-by: Ben Hutchings 
> Fixes: e8588e268509 ("enic: enable rq before updating rq descriptors")
> Signed-off-by: Govindarajulu Varadarajan 

Applied and queued up for -stable.


Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?

2018-06-19 Thread Stephen Hemminger
On Wed, 20 Jun 2018 08:01:50 +0900
Greg KH  wrote:

> On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> > Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v 
> > guests failed to bring up network interfaces on boot, logging "A link 
> > change request failed with some changes committed already. Interface eth0 
> > may have been left with an inconsistent configuration, please check."  
> > Running 'ifconfig eth0 up' appears to fix the problem temporarily so I went 
> > about bisecting which landed on:
> > 
> > commit be9c798d0d13ae609a91177323ac816545c39d28
> > Author: Stephen Hemminger 
> > Date:   Mon May 14 15:32:18 2018 -0700
> > 
> > hv_netvsc: common detach logic
> > 
> > [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> > 
> > Make common function for detaching internals of device
> > during changes to MTU and RSS. Make sure no more packets
> > are transmitted and all packets have been received before
> > doing device teardown.
> > 
> > Change the wait logic to be common and use usleep_range().
> > 
> > Changes transmit enabling logic so that transmit queues are disabled
> > during the period when lower device is being changed. And enabled
> > only after sub channels are setup. This avoids issue where it could
> > be that a packet was being sent while subchannel was not initialized.
> > 
> > Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> > Signed-off-by: Stephen Hemminger 
> > Signed-off-by: David S. Miller 
> > Signed-off-by: Greg Kroah-Hartman 
> > 
> > 
> > The removal of which does indeed fix the problem.  That led me to:
> > 
> > commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> > Author: Dexuan Cui 
> > Date:   Wed Jun 6 21:32:51 2018 +
> > 
> > hv_netvsc: Fix a network regression after ifdown/ifup
> > 
> > Recently people reported the NIC stops working after
> > "ifdown eth0; ifup eth0". It turns out in this case the TX queues are 
> > not
> > enabled, after the refactoring of the common detach logic: when the NIC
> > has sub-channels, usually we enable all the TX queues after all
> > sub-channels are set up: see rndis_set_subchannel() ->
> > netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
> > the number of channels doesn't change, we also must make sure the TX 
> > queues
> > are enabled. The patch fixes the regression.
> > 
> > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > Signed-off-by: Dexuan Cui 
> > Cc: Stephen Hemminger 
> > Cc: K. Y. Srinivasan 
> > Cc: Haiyang Zhang 
> > Signed-off-by: David S. Miller 
> > 
> > Which sounded very promising, but does not seem to fully fix the issue.  
> > Doing some more digging, I was able to determine that the message coincides 
> > with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 second) after the 
> > hv_netvsc driver loads.  If I delay the mtu change until well after the 
> > driver loads, everything works fine.  If I unload hv_netvsc and then reload 
> > it and apply the mtu change immediately, the failure re-occurs.  So 
> > something is racy here, and the above doesn't entirely address it.
> > 
> > I'm happy to test out any suggested patches and/or do additional debugging 
> > if anyone has any suggestions.
> > 
> > (oh, and I did also try 4.18-rc1 and the problem still persists)  
> 
> Always cc: the authors of the patch you are having problems with, along
> with the mailing list for the networking subsystem, so that the right
> people know to look at this.

Let me take a look at it, and log it into the bug system.


Re: [PATCH net] net/tcp: Fix socket lookups with SO_BINDTODEVICE

2018-06-19 Thread David Miller
From: dsah...@kernel.org
Date: Mon, 18 Jun 2018 12:30:37 -0700

> From: David Ahern 
> 
> Similar to 69678bcd4d2d ("udp: fix SO_BINDTODEVICE"), TCP socket lookups
> need to fail if dev_match is not true. Currently, a packet to a given port
> can match a socket bound to device when it should not. In the VRF case,
> this causes the lookup to hit a VRF socket and not a global socket
> resulting in a response trying to go through the VRF when it should it.
  ^^^

"not", I fixed this up for you.

> Fixes: 3fa6f616a7a4d ("net: ipv4: add second dif to inet socket lookups")
> Fixes: 4297a0ef08572 ("net: ipv6: add second dif to inet6 socket lookups")
> Reported-by: Lou Berger 
> Diagnosed-by: Renato Westphal 
> Tested-by: Renato Westphal 
> Signed-off-by: David Ahern 

Applied and queued up for -stable, thanks.


Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?

2018-06-19 Thread Greg KH
On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v 
> guests failed to bring up network interfaces on boot, logging "A link change 
> request failed with some changes committed already. Interface eth0 may have 
> been left with an inconsistent configuration, please check."  Running 
> 'ifconfig eth0 up' appears to fix the problem temporarily so I went about 
> bisecting which landed on:
> 
> commit be9c798d0d13ae609a91177323ac816545c39d28
> Author: Stephen Hemminger 
> Date:   Mon May 14 15:32:18 2018 -0700
> 
> hv_netvsc: common detach logic
> 
> [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> 
> Make common function for detaching internals of device
> during changes to MTU and RSS. Make sure no more packets
> are transmitted and all packets have been received before
> doing device teardown.
> 
> Change the wait logic to be common and use usleep_range().
> 
> Changes transmit enabling logic so that transmit queues are disabled
> during the period when lower device is being changed. And enabled
> only after sub channels are setup. This avoids issue where it could
> be that a packet was being sent while subchannel was not initialized.
> 
> Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> Signed-off-by: Stephen Hemminger 
> Signed-off-by: David S. Miller 
> Signed-off-by: Greg Kroah-Hartman 
> 
> 
> The removal of which does indeed fix the problem.  That led me to:
> 
> commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> Author: Dexuan Cui 
> Date:   Wed Jun 6 21:32:51 2018 +
> 
> hv_netvsc: Fix a network regression after ifdown/ifup
> 
> Recently people reported the NIC stops working after
> "ifdown eth0; ifup eth0". It turns out in this case the TX queues are not
> enabled, after the refactoring of the common detach logic: when the NIC
> has sub-channels, usually we enable all the TX queues after all
> sub-channels are set up: see rndis_set_subchannel() ->
> netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
> the number of channels doesn't change, we also must make sure the TX 
> queues
> are enabled. The patch fixes the regression.
> 
> Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> Signed-off-by: Dexuan Cui 
> Cc: Stephen Hemminger 
> Cc: K. Y. Srinivasan 
> Cc: Haiyang Zhang 
> Signed-off-by: David S. Miller 
> 
> Which sounded very promising, but does not seem to fully fix the issue.  
> Doing some more digging, I was able to determine that the message coincides 
> with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 second) after the 
> hv_netvsc driver loads.  If I delay the mtu change until well after the 
> driver loads, everything works fine.  If I unload hv_netvsc and then reload 
> it and apply the mtu change immediately, the failure re-occurs.  So something 
> is racy here, and the above doesn't entirely address it.
> 
> I'm happy to test out any suggested patches and/or do additional debugging if 
> anyone has any suggestions.
> 
> (oh, and I did also try 4.18-rc1 and the problem still persists)

Always cc: the authors of the patch you are having problems with, along
with the mailing list for the networking subsystem, so that the right
people know to look at this.

Now fixed...

thanks,

greg k-h


Re: [PATCH net] net/ipv6: respect rcu grace period before freeing fib6_info

2018-06-19 Thread David Miller
From: Eric Dumazet 
Date: Mon, 18 Jun 2018 05:24:31 -0700

> syzbot reported use after free that is caused by fib6_info being
> freed without a proper RCU grace period.
 ...
> Fixes: a64efe142f5e ("net/ipv6: introduce fib6_info struct and helpers")
> Signed-off-by: Eric Dumazet 
> Cc: David Ahern 
> Reported-by: syzbot+9e6d75e3edef427ee...@syzkaller.appspotmail.com

Applied.


Re: [PATCH 1/1] tcp: tcp_mtup_probe -> tcp_mtu_probe

2018-06-19 Thread David Miller
From: Wang Jian 
Date: Mon, 18 Jun 2018 18:58:50 +0800

> Comment correction.
> 
> Signed-off-by: Jian Wang 

This patch was corrupted by your email client.

Please fix this, email the patch to yourself, and only resubmit this patch
to the mailing list when you can successfull apply the patch yourself that
you send in the test email.

Thank you.


Re: [PATCH net] ipvlan: use ETH_MAX_MTU as max mtu

2018-06-19 Thread David Miller
From: Xin Long 
Date: Mon, 18 Jun 2018 16:15:57 +0800

> Similar to the fixes on team and bonding, this restores the ability
> to set an ipvlan device's mtu to anything higher than 1500.
> 
> Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
> Signed-off-by: Xin Long 

Applied, thanks.


Re: [PATCH net] enic: initialize enic->rfs_h.lock in enic_probe

2018-06-19 Thread David Miller
From: Govindarajulu Varadarajan 
Date: Tue, 19 Jun 2018 08:15:24 -0700

> lockdep spotted that we are using rfs_h.lock in enic_get_rxnfc() without
> initializing. rfs_h.lock is initialized in enic_open(). But ethtool_ops
> can be called when interface is down.
> 
> Move enic_rfs_flw_tbl_init to enic_probe.
 ...
> Signed-off-by: Govindarajulu Varadarajan 

Applied, thanks.


Re: [net] bpf, xdp, i40e: fix i40e_build_skb skb reserve and truesize

2018-06-19 Thread David Miller
From: Jeff Kirsher 
Date: Tue, 19 Jun 2018 14:33:54 -0700

> From: Daniel Borkmann 
> 
> Using skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start))
> is clearly wrong since I40E_SKB_PAD already points to the offset where
> the original xdp->data was sitting since xdp->data_hard_start is defined
> as xdp->data - i40e_rx_offset(rx_ring) where latter offsets to I40E_SKB_PAD
> when build skb is used.
> 
> However, also before cc5b114dcf98 ("bpf, i40e: add meta data support")
> this seems broken since bpf_xdp_adjust_head() helper could have been used
> to alter headroom and enlarge / shrink the frame and with that the assumption
> that the xdp->data remains unchanged does not hold and would push a bogus
> packet to upper stack.
> 
> ixgbe got this right in 924708081629 ("ixgbe: add XDP support for pass and
> drop actions"). In any case, fix it by removing the I40E_SKB_PAD from both
> skb_reserve() and truesize calculation.
> 
> Fixes: cc5b114dcf98 ("bpf, i40e: add meta data support")
> Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")
> Reported-by: Keith Busch 
> Reported-by: Toshiaki Makita 
> Signed-off-by: Daniel Borkmann 
> Cc: Björn Töpel 
> Cc: John Fastabend 
> Tested-by: Keith Busch 
> Acked-by: John Fastabend 
> Acked-by: Alexander Duyck 
> Signed-off-by: Jeff Kirsher 

Applied.


Re: [PATCH v2 0/4] Slience NCSI logging

2018-06-19 Thread David Miller
From: Joel Stanley 
Date: Tue, 19 Jun 2018 15:08:30 +0930

> v2:
>   Fix indent issue and commit message based on Joe's feedback
>   Add Sam's acks
> 
> Here are three changes to silence unnecessary warnings in the ncsi code.
> 
> The final patch adds Sam as the maintainer for NCSI.

Series applied.


Re: [PATCH net 0/3] qed*: Fix series.

2018-06-19 Thread David Miller
From: Sudarsana Reddy Kalluru 
Date: Mon, 18 Jun 2018 21:57:59 -0700

> From: Sudarsana Reddy Kalluru 
> 
> The patch series fixes few issues in the qed/qede drivers.
> Please consider applying this series to "net".

Series applied.


[PATCH net] enic: initialize enic->rfs_h.lock in enic_probe

2018-06-19 Thread Govindarajulu Varadarajan
lockdep spotted that we are using rfs_h.lock in enic_get_rxnfc() without
initializing. rfs_h.lock is initialized in enic_open(). But ethtool_ops
can be called when interface is down.

Move enic_rfs_flw_tbl_init to enic_probe.

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 18 PID: 1189 Comm: ethtool Not tainted 4.17.0-rc7-devel+ #27
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/2014
Call Trace:
dump_stack+0x85/0xc0
register_lock_class+0x550/0x560
? __handle_mm_fault+0xa8b/0x1100
__lock_acquire+0x81/0x670
lock_acquire+0xb9/0x1e0
?  enic_get_rxnfc+0x139/0x2b0 [enic]
_raw_spin_lock_bh+0x38/0x80
? enic_get_rxnfc+0x139/0x2b0 [enic]
enic_get_rxnfc+0x139/0x2b0 [enic]
ethtool_get_rxnfc+0x8d/0x1c0
dev_ethtool+0x16c8/0x2400
? __mutex_lock+0x64d/0xa00
? dev_load+0x6a/0x150
dev_ioctl+0x253/0x4b0
sock_do_ioctl+0x9a/0x130
sock_ioctl+0x1af/0x350
do_vfs_ioctl+0x8e/0x670
? syscall_trace_enter+0x1e2/0x380
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5a/0x170
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Govindarajulu Varadarajan 
---
 drivers/net/ethernet/cisco/enic/enic_clsf.c | 3 +--
 drivers/net/ethernet/cisco/enic/enic_main.c | 3 ++-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_clsf.c 
b/drivers/net/ethernet/cisco/enic/enic_clsf.c
index 973c1fb70d09..99038dfc7fbe 100644
--- a/drivers/net/ethernet/cisco/enic/enic_clsf.c
+++ b/drivers/net/ethernet/cisco/enic/enic_clsf.c
@@ -79,7 +79,6 @@ void enic_rfs_flw_tbl_init(struct enic *enic)
enic->rfs_h.max = enic->config.num_arfs;
enic->rfs_h.free = enic->rfs_h.max;
enic->rfs_h.toclean = 0;
-   enic_rfs_timer_start(enic);
 }
 
 void enic_rfs_flw_tbl_free(struct enic *enic)
@@ -88,7 +87,6 @@ void enic_rfs_flw_tbl_free(struct enic *enic)
 
enic_rfs_timer_stop(enic);
spin_lock_bh(>rfs_h.lock);
-   enic->rfs_h.free = 0;
for (i = 0; i < (1 << ENIC_RFS_FLW_BITSHIFT); i++) {
struct hlist_head *hhead;
struct hlist_node *tmp;
@@ -99,6 +97,7 @@ void enic_rfs_flw_tbl_free(struct enic *enic)
enic_delfltr(enic, n->fltr_id);
hlist_del(>node);
kfree(n);
+   enic->rfs_h.free++;
}
}
spin_unlock_bh(>rfs_h.lock);
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c 
b/drivers/net/ethernet/cisco/enic/enic_main.c
index 30d2eaa18c04..e6ad581eadd8 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -1971,7 +1971,7 @@ static int enic_open(struct net_device *netdev)
vnic_intr_unmask(>intr[i]);
 
enic_notify_timer_start(enic);
-   enic_rfs_flw_tbl_init(enic);
+   enic_rfs_timer_start(enic);
 
return 0;
 
@@ -2904,6 +2904,7 @@ static int enic_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
timer_setup(>notify_timer, enic_notify_timer, 0);
 
+   enic_rfs_flw_tbl_init(enic);
enic_set_rx_coal_setting(enic);
INIT_WORK(>reset, enic_reset);
INIT_WORK(>tx_hang_reset, enic_tx_hang_reset);
-- 
2.17.1



Re: [PATCH] tc, bpf: add option to dump bpf verifier as C program fragment

2018-06-19 Thread Daniel Borkmann
On 06/18/2018 11:44 PM, David Ahern wrote:
> On 6/18/18 2:18 PM, Jakub Kicinski wrote:
>> On Sun, 17 Jun 2018 08:48:41 +, Ophir Munk wrote:
>>> Similar to cbpf used within tcpdump utility with a "-d" option to dump
>>> the compiled packet-matching code in a human readable form - tc has the
>>> "verbose" option to dump ebpf verifier output.
>>> Another useful option of cbpf using tcpdump "-dd" option is to dump
>>> packet-matching code a C program fragment. Similar to this - this commit
>>> adds a new tc ebpf option named "code" to dump ebpf verifier as C program
>>> fragment.
>>>
>>> Existing "verbose" option sample output:
>>>
>>> Verifier analysis:
>>> 0: (61) r2 = *(u32 *)(r1 +52)
>>> 1: (18) r3 = 0xdeadbeef
>>> 3: (63) *(u32 *)(r10 -4) = r3
>>> .
>>> .
>>> 11: (63) *(u32 *)(r1 +52) = r2
>>> 12: (18) r0 = 0x
>>> 14: (95) exit
>>>
>>> New "code" option sample output:
>>>
>>> /* struct bpf_insn cls_q_code[] = { */
>>> {0x61,2,1,   52, 0x},
>>> {0x18,3,0,0, 0xdeadbeef},
>>> {0x00,0,0,0, 0x},
>>> .
>>> .
>>> {0x63,1,2,   52, 0x},
>>> {0x18,0,0,0, 0x},
>>> {0x00,0,0,0, 0x},
>>> {0x95,0,0,0, 0x},
>>>
>>> Signed-off-by: Ophir Munk 
>>
>> Hmm... printing C arrays looks like hacky integration with some C
>> code...  Would you not be better served by simply using libbpf in
>> whatever is consuming this output?
> 
> I was thinking the same. bpftool would provide options too -- print the
> above, print in macro encodings and verifier. I gave an example of this
> side by side dump at netconf 2.1. Does not look like the slides made it
> online; see attached.

+1, I would also doubt that this adds a lot in terms of debuggability
when you're trying to load an object file with thousands of insns. Better
way would be to use llvm-objdump on the obj file to get to the annotated
disassembly, see also example in [0]. A .o to .c converter is wip for
libbpf/bpftool as presented in [1], which should provide the flexibility
to embedd an obj file.

Cheers,
Daniel

  [0] http://cilium.readthedocs.io/en/latest/bpf/#llvm
  [1] 
http://vger.kernel.org/netconf2018_files/AlexeiStarovoitov_netconf2018.pdf page 
22


Re: [PATCH] PCI: allow drivers to limit the number of VFs to 0

2018-06-19 Thread Bjorn Helgaas
On Fri, May 25, 2018 at 09:02:23AM -0500, Bjorn Helgaas wrote:
> On Thu, May 24, 2018 at 06:20:15PM -0700, Jakub Kicinski wrote:
> > Hi Bjorn!
> > 
> > On Thu, 24 May 2018 18:57:48 -0500, Bjorn Helgaas wrote:
> > > On Mon, Apr 02, 2018 at 03:46:52PM -0700, Jakub Kicinski wrote:
> > > > Some user space depends on enabling sriov_totalvfs number of VFs
> > > > to not fail, e.g.:
> > > > 
> > > > $ cat .../sriov_totalvfs > .../sriov_numvfs
> > > > 
> > > > For devices which VF support depends on loaded FW we have the
> > > > pci_sriov_{g,s}et_totalvfs() API.  However, this API uses 0 as
> > > > a special "unset" value, meaning drivers can't limit sriov_totalvfs
> > > > to 0.  Remove the special values completely and simply initialize
> > > > driver_max_VFs to total_VFs.  Then always use driver_max_VFs.
> > > > Add a helper for drivers to reset the VF limit back to total.  
> > > 
> > > I still can't really make sense out of the changelog.
> > >
> > > I think part of the reason it's confusing is because there are two
> > > things going on:
> > > 
> > >   1) You want this:
> > >   
> > >pci_sriov_set_totalvfs(dev, 0);
> > >x = pci_sriov_get_totalvfs(dev) 
> > > 
> > >  to return 0 instead of total_VFs.  That seems to connect with
> > >  your subject line.  It means "sriov_totalvfs" in sysfs could be
> > >  0, but I don't know how that is useful (I'm sure it is; just
> > >  educate me :))
> > 
> > Let me just quote the bug report that got filed on our internal bug
> > tracker :)
> > 
> >   When testing Juju Openstack with Ubuntu 18.04, enabling SR-IOV causes
> >   errors because Juju gets the sriov_totalvfs for SR-IOV-capable device
> >   then tries to set that as the sriov_numvfs parameter.
> > 
> >   For SR-IOV incapable FW, the sriov_totalvfs parameter should be 0, 
> >   but it's set to max.  When FW is switched to flower*, the correct 
> >   sriov_totalvfs value is presented.
> > 
> > * flower is a project name
> 
> From the point of view of the PCI core (which knows nothing about
> device firmware and relies on the architected config space described
> by the PCIe spec), this sounds like an erratum: with some firmware
> installed, the device is not capable of SR-IOV, but still advertises
> an SR-IOV capability with "TotalVFs > 0".
> 
> Regardless of whether that's an erratum, we do allow PF drivers to use
> pci_sriov_set_totalvfs() to limit the number of VFs that may be
> enabled by writing to the PF's "sriov_numvfs" sysfs file.
> 
> But the current implementation does not allow a PF driver to limit VFs
> to 0, and that does seem nonsensical.
> 
> > My understanding is OpenStack uses sriov_totalvfs to determine how many
> > VFs can be enabled, looks like this is the code:
> > 
> > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n464
> > 
> > >   2) You're adding the pci_sriov_reset_totalvfs() interface.  I'm not
> > >  sure what you intend for this.  Is *every* driver supposed to
> > >  call it in .remove()?  Could/should this be done in the core
> > >  somehow instead of depending on every driver?
> > 
> > Good question, I was just thinking yesterday we may want to call it
> > from the core, but I don't think it's strictly necessary nor always
> > sufficient (we may reload FW without re-probing).
> > 
> > We have a device which supports different number of VFs based on the FW
> > loaded.  Some legacy FWs does not inform the driver how many VFs it can
> > support, because it supports max.  So the flow in our driver is this:
> > 
> > load_fw(dev);
> > ...
> > max_vfs = ask_fw_for_max_vfs(dev);
> > if (max_vfs >= 0)
> > return pci_sriov_set_totalvfs(dev, max_vfs);
> > else /* FW didn't tell us, assume max */
> > return pci_sriov_reset_totalvfs(dev); 
> > 
> > We also reset the max on device remove, but that's not strictly
> > necessary.
> > 
> > Other users of pci_sriov_set_totalvfs() always know the value to set
> > the total to (either always get it from FW or it's a constant).
> > 
> > If you prefer we can work out the correct max for those legacy cases in
> > the driver as well, although it seemed cleaner to just ask the core,
> > since it already has total_VFs value handy :)
> > 
> > > I'm also having a hard time connecting your user-space command example
> > > with the rest of this.  Maybe it will make more sense to me tomorrow
> > > after some coffee.
> > 
> > OpenStack assumes it will always be able to set sriov_numvfs to
> > sriov_totalvfs, see this 'if':
> > 
> > http://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/tree/hooks/neutron_ovs_utils.py#n512
> 
> Thanks for educating me.  I think there are two issues here that we
> can separate.  I extracted the patch below for the first.
> 
> The second is the question of resetting driver_max_VFs.  I think we
> currently have a general issue in the core:
> 
>   - load PF driver 1
>   - driver calls pci_sriov_set_totalvfs() to reduce 

[net] bpf, xdp, i40e: fix i40e_build_skb skb reserve and truesize

2018-06-19 Thread Jeff Kirsher
From: Daniel Borkmann 

Using skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start))
is clearly wrong since I40E_SKB_PAD already points to the offset where
the original xdp->data was sitting since xdp->data_hard_start is defined
as xdp->data - i40e_rx_offset(rx_ring) where latter offsets to I40E_SKB_PAD
when build skb is used.

However, also before cc5b114dcf98 ("bpf, i40e: add meta data support")
this seems broken since bpf_xdp_adjust_head() helper could have been used
to alter headroom and enlarge / shrink the frame and with that the assumption
that the xdp->data remains unchanged does not hold and would push a bogus
packet to upper stack.

ixgbe got this right in 924708081629 ("ixgbe: add XDP support for pass and
drop actions"). In any case, fix it by removing the I40E_SKB_PAD from both
skb_reserve() and truesize calculation.

Fixes: cc5b114dcf98 ("bpf, i40e: add meta data support")
Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")
Reported-by: Keith Busch 
Reported-by: Toshiaki Makita 
Signed-off-by: Daniel Borkmann 
Cc: Björn Töpel 
Cc: John Fastabend 
Tested-by: Keith Busch 
Acked-by: John Fastabend 
Acked-by: Alexander Duyck 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 8ffb7454e67c..ed6dbcfd4e96 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2103,9 +2103,8 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring 
*rx_ring,
unsigned int truesize = i40e_rx_pg_size(rx_ring) / 2;
 #else
unsigned int truesize = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) +
-   SKB_DATA_ALIGN(I40E_SKB_PAD +
-  (xdp->data_end -
-   xdp->data_hard_start));
+   SKB_DATA_ALIGN(xdp->data_end -
+  xdp->data_hard_start);
 #endif
struct sk_buff *skb;
 
@@ -2124,7 +2123,7 @@ static struct sk_buff *i40e_build_skb(struct i40e_ring 
*rx_ring,
return NULL;
 
/* update pointers within the skb to store the data */
-   skb_reserve(skb, I40E_SKB_PAD + (xdp->data - xdp->data_hard_start));
+   skb_reserve(skb, xdp->data - xdp->data_hard_start);
__skb_put(skb, xdp->data_end - xdp->data);
if (metasize)
skb_metadata_set(skb, metasize);
-- 
2.17.1



Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-19 Thread Martin KaFai Lau
On Tue, Jun 19, 2018 at 02:16:53PM -0600, David Ahern wrote:
> On 6/19/18 10:36 AM, Martin KaFai Lau wrote:
> > On Tue, Jun 19, 2018 at 09:34:28AM -0600, David Ahern wrote:
> >> On 6/19/18 9:25 AM, Martin KaFai Lau wrote:
> >>> On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
>  On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
> >>/* rc > 0 case */
> >>switch(rc) {
> >>case BPF_FIB_LKUP_RET_BLACKHOLE:
> >>case BPF_FIB_LKUP_RET_UNREACHABLE:
> >>case BPF_FIB_LKUP_RET_PROHIBIT:
> >>return XDP_DROP;
> >>}
> >>
> >> For the others it becomes a question of do we share why the stack needs
> >> to be involved? Maybe the program wants to collect stats to show 
> >> traffic
> >> patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
> >> in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
> >> interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
> > Thanks for the explanation.
> >
> > Agree on the bpf able to collect stats will be useful.
> >
> > I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
> > how may the old xdp_prog work/not-work?  As of now, the return value
> > is straight forward, FWD, PASS (to stack) or DROP (error).
> > With this change, the xdp_prog needs to match/switch() the
> > BPF_FIB_LKUP_RET_* to at least PASS and DROP.
> 
>  IMO, programs should only call XDP_DROP for known reasons - like the 3
>  above. Anything else punt to the stack.
> 
>  If a new RET_XYZ comes along:
>  1. the new XYZ is a new ACL response where the packet is to be dropped.
>  If the program does not understand XYZ and punts to the stack
>  (recommendation), then a second lookup is done during normal packet
>  processing and the stack drops it.
> 
>  2. the new XYZ is a new path in the kernel that is unsupported with
>  respect to XDP forwarding, nothing new for the program to do.
> 
>  Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
>  the program writer.
> 
>  Worst case of punting packets to the stack for any rc != 0 means the
>  stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
>  in normal stack processing - to handle the packet.
> >>> Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
> >>> should the bpf_*_fib_lookup() return value be kept as is such that
> >>> the xdp_prog is clear what to do.  The reason can be returned in
> >>> the 'struct bpf_fib_lookup'.  The number of reasons can be extended.
> >>> If the xdp_prog does not understand a reason, it still will not
> >>> affect its decision because the return value is clear.
> >>> I think the situation here is similar to regular syscall which usually
> >>> uses -1 to clearly states error and errno to spells out the reason.
> >>>
> >>
> >> I did consider returning the status in struct bpf_fib_lookup. However,
> >> it is 64 bytes and can not be extended without a big performance
> >> penalty, so the only option there is to make an existing entry a union
> >> the most logical of which is the ifindex. It seemed odd to me to have
> >> the result by hidden in the struct as a union on ifindex and returning
> >> the egress index from the function:
> >>
> >> @@ -2625,7 +2636,11 @@ struct bpf_fib_lookup {
> >>
> >> /* total length of packet from network header - used for MTU
> >> check */
> >> __u16   tot_len;
> >> -   __u32   ifindex;  /* L3 device index for lookup */
> >> +
> >> +   union {
> >> +   __u32   ifindex;  /* input: L3 device index for lookup */
> >> +   __u32   result;   /* output: one of BPF_FIB_LKUP_RET_* */
> >> +   };
> >>
> >>
> >> It seemed more natural to have ifindex stay ifindex and only change
> >> value on return:
> >>
> >> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
> >>
> >>/* total length of packet from network header - used for MTU check */
> >>__u16   tot_len;
> >> -  __u32   ifindex;  /* L3 device index for lookup */
> >> +
> >> +  /* input: L3 device index for lookup
> >> +   * output: nexthop device index from FIB lookup
> >> +   */
> >> +  __u32   ifindex;
> >>
> >>union {
> >>/* inputs to lookup */
> >>
> >>
> >> From a program's perspective:
> >>
> >> rc < 0  -- program is passing incorrect data
> >> rc == 0 -- packet can be forwarded
> >> rc > 0  -- packet can not be forwarded.
> >>
> >> BPF programs are not required to track the LKUP_RET values any more than
> >> a function returning multiple negative values - the caller just checks
> >> rc < 0 means failure. If the program cares it can look at specific
> >> values of rc to see the specific value.
> >>
> >> The same applies with the LKUP_RET values - they are there to provide
> >> insight into why the packet is not forwarded directly if the program

Re: [PATCH net] ipvlan: call dev_change_flags when reset ipvlan mode

2018-06-19 Thread Cong Wang
On Mon, Jun 18, 2018 at 7:04 AM, Hangbin Liu  wrote:
> @@ -94,10 +95,13 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, 
> u16 nval)
> mdev->l3mdev_ops = NULL;
> }
> list_for_each_entry(ipvlan, >ipvlans, pnode) {
> +   flags = ipvlan->dev->flags;
> if (nval == IPVLAN_MODE_L3 || nval == IPVLAN_MODE_L3S)
> -   ipvlan->dev->flags |= IFF_NOARP;
> +   dev_change_flags(ipvlan->dev,
> +flags | IFF_NOARP);
> else
> -   ipvlan->dev->flags &= ~IFF_NOARP;
> +   dev_change_flags(ipvlan->dev,
> +flags & ~IFF_NOARP);

You need to check the return value of dev_change_flags().


Re: iproute2 won't compile without AF_VSOCK

2018-06-19 Thread Steve Wise



On 6/19/2018 3:29 PM, David Ahern wrote:
> On 6/19/18 2:27 PM, David Ahern wrote:
>> On 6/19/18 9:47 AM, Stephen Hemminger wrote:
>>> On Tue, 19 Jun 2018 10:17:45 -0500
>>> Steve Wise  wrote:
>>>
 Hey David,

 I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
 fails to compile because AF_VSOCK is not defined.  Should this
 functionality be a configure option to disable it on older distros?


 Thanks,

 Steve.

 

 misc
     CC   ss.o
 ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
    .families = FAMILY_MASK(AF_VSOCK),
    ^
 ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
   ^
 ss.c:334:2: error: array index in initializer not of integer type
   [AF_VSOCK] = {
   ^
 ss.c:334:2: error: (near initialization for ‘default_afs’)
 make[1]: *** [ss.o] Error 1
 make: *** [all] Error 2

>>> Probably should just add an #ifdef to takeout that if not present
>>>
>> Most userspace tools have a compat header for cases like this.
>>
>> #ifndef AF_VSOCK
>> #define AF_VSOCK 40
>> #endif
>>
> Add the above to include//utils.h; AF_MPLS is already there.

I'll send out a patch.

Thanks,

Steve.


Re: [bug] cxgb4: vrf stopped working with cxgb4 card

2018-06-19 Thread AMG Zollner Robert

On 19.06.2018 16:24, Ganesh Goudar wrote:

On Monday, June 06/11/18, 2018 at 14:47:55 +0530, Ganesh Goudar wrote:

On Saturday, June 06/09/18, 2018 at 18:47:55 -0600, David Ahern wrote:

Ganesh:

On 6/4/18 9:03 AM, AMG Zollner Robert wrote:

I have noticed that vrf is not working with kernel v4.15.0 but was
working with v4.13.0 when using cxgb4 Chelsio driver (T520-cr)

Setup:
Two metal servers with a T520-cr card each, directly connected without a
switch in between.

    SVR1  only ipfwd SVR2     with vrf
.. .--.
|                        | |     |
|    192.168.8.1 [  ens2f4]--|-|--[ens1f4] 192.168.8.2   |
|    192.168.9.1 [ens2f4d1]--|-|-- 192.168.9.2 VRF=10   |
`' `--'

When vrf is not working there are no error messages (dmesg or iproute
commands), tcpdump on the interface (SVR2.ens1f4d1) enslaved in vrf 10
shows packets(arp req/reply) coming in and going out, but outgoing
packets(arp reply) do not reach the other server SVR1.ens2f4d1


Bisect:
Found this commit to be the problem after doing a git bisect between
v4.13..v4.15:

commit ba581f77df23c8ee70b372966e69cf10bc5453d8
Author: Ganesh Goudar 
Date:   Sat Sep 23 16:07:28 2017 +0530

     cxgb4: do DCB state reset in couple of places

     reset the driver's DCB state in couple of places
     where it was missing.


Are you working on a fix for this or should a revert of the above patch
be sent?

Will look into it and fix/revert it soon, Thanks for responding to Robert.






A bisect step was considered good when:
- successful ping from SVR1 to SVR2.ens1f4d1 vrf interface
- successful ping from SVR2 global to SVR2 vrf interface trough SVR1(l3
forwarding) (this check was redundant,both tests fail or pass simultaneous)

The problem is still present on recent kernels also, checked v4.16.0 and
v4.17.rc7

Disabling DCB for the card support fixes the problem ( Compiling kernel
with "CONFIG_CHELSIO_T4_DCB=n")



This is my first time reporting a bug to the linux kernel and hope I
have included the right amount of information. Please let me know if I
have missed something.



Thank you,
Zollner Robert



Logs:

VRF configured using folowing commands:

#!/bin/sh

CHDEV=ens1f4
VRF=vrf-recv

sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1
sysctl -w net.ipv4.conf.all.accept_local=1

ifconfig ${CHDEV}   192.168.8.2/24
ifconfig ${CHDEV}d1 192.168.9.2/24

ip link add ${VRF} type vrf table 10
ip link set dev ${VRF} up

ip rule add pref 32765 table local
ip rule del pref 0

ip route add table 10 unreachable default metric 4278198272

ip link set dev ${CHDEV}d1 master ${VRF}

ip route add table 10 default via 192.168.9.1
ip route add 192.168.9.0/24 via 192.168.8.1







-netdev, Please feel free to add if needed.

Hi Robert,

My knowledge of VRF is very limited, I am trying to bring
up VRF setup, I just wanted to check if you are doing anything
related DCB and also please let me know how did you setup SRV1.

Thanks



Hello Ganesh,

SRV1 is just forwarding(l3) between the two physical ports of the 
T520-CR card.


ifconfig ens1f4   192.168.8.1/24
ifconfig ens1f4d1 192.168.9.1/24
sysctl -w net.ipv4.ip_forward=1

- No VRF is configured on this box
- DCB is also not used


SVR2 is using VRF and is configured with the script inlined in the first 
email.






Re: iproute2 won't compile without AF_VSOCK

2018-06-19 Thread David Ahern
On 6/19/18 2:27 PM, David Ahern wrote:
> On 6/19/18 9:47 AM, Stephen Hemminger wrote:
>> On Tue, 19 Jun 2018 10:17:45 -0500
>> Steve Wise  wrote:
>>
>>> Hey David,
>>>
>>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>>> fails to compile because AF_VSOCK is not defined.  Should this
>>> functionality be a configure option to disable it on older distros?
>>>
>>>
>>> Thanks,
>>>
>>> Steve.
>>>
>>> 
>>>
>>> misc
>>>     CC   ss.o
>>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>>    .families = FAMILY_MASK(AF_VSOCK),
>>>    ^
>>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>>  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>>   ^
>>> ss.c:334:2: error: array index in initializer not of integer type
>>>   [AF_VSOCK] = {
>>>   ^
>>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>>> make[1]: *** [ss.o] Error 1
>>> make: *** [all] Error 2
>>>
>>
>> Probably should just add an #ifdef to takeout that if not present
>>
> 
> Most userspace tools have a compat header for cases like this.
> 
> #ifndef AF_VSOCK
> #define AF_VSOCK  40
> #endif
> 

Add the above to include//utils.h; AF_MPLS is already there.


Re: iproute2 won't compile without AF_VSOCK

2018-06-19 Thread David Ahern
On 6/19/18 9:47 AM, Stephen Hemminger wrote:
> On Tue, 19 Jun 2018 10:17:45 -0500
> Steve Wise  wrote:
> 
>> Hey David,
>>
>> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
>> fails to compile because AF_VSOCK is not defined.  Should this
>> functionality be a configure option to disable it on older distros?
>>
>>
>> Thanks,
>>
>> Steve.
>>
>> 
>>
>> misc
>>     CC   ss.o
>> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>>    .families = FAMILY_MASK(AF_VSOCK),
>>    ^
>> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>>  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>>   ^
>> ss.c:334:2: error: array index in initializer not of integer type
>>   [AF_VSOCK] = {
>>   ^
>> ss.c:334:2: error: (near initialization for ‘default_afs’)
>> make[1]: *** [ss.o] Error 1
>> make: *** [all] Error 2
>>
> 
> Probably should just add an #ifdef to takeout that if not present
> 

Most userspace tools have a compat header for cases like this.

#ifndef AF_VSOCK
#define AF_VSOCK40
#endif


Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-19 Thread David Ahern
On 6/19/18 10:36 AM, Martin KaFai Lau wrote:
> On Tue, Jun 19, 2018 at 09:34:28AM -0600, David Ahern wrote:
>> On 6/19/18 9:25 AM, Martin KaFai Lau wrote:
>>> On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
 On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
>>  /* rc > 0 case */
>>  switch(rc) {
>>  case BPF_FIB_LKUP_RET_BLACKHOLE:
>>  case BPF_FIB_LKUP_RET_UNREACHABLE:
>>  case BPF_FIB_LKUP_RET_PROHIBIT:
>>  return XDP_DROP;
>>  }
>>
>> For the others it becomes a question of do we share why the stack needs
>> to be involved? Maybe the program wants to collect stats to show traffic
>> patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
>> in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
>> interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
> Thanks for the explanation.
>
> Agree on the bpf able to collect stats will be useful.
>
> I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
> how may the old xdp_prog work/not-work?  As of now, the return value
> is straight forward, FWD, PASS (to stack) or DROP (error).
> With this change, the xdp_prog needs to match/switch() the
> BPF_FIB_LKUP_RET_* to at least PASS and DROP.

 IMO, programs should only call XDP_DROP for known reasons - like the 3
 above. Anything else punt to the stack.

 If a new RET_XYZ comes along:
 1. the new XYZ is a new ACL response where the packet is to be dropped.
 If the program does not understand XYZ and punts to the stack
 (recommendation), then a second lookup is done during normal packet
 processing and the stack drops it.

 2. the new XYZ is a new path in the kernel that is unsupported with
 respect to XDP forwarding, nothing new for the program to do.

 Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
 the program writer.

 Worst case of punting packets to the stack for any rc != 0 means the
 stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
 in normal stack processing - to handle the packet.
>>> Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
>>> should the bpf_*_fib_lookup() return value be kept as is such that
>>> the xdp_prog is clear what to do.  The reason can be returned in
>>> the 'struct bpf_fib_lookup'.  The number of reasons can be extended.
>>> If the xdp_prog does not understand a reason, it still will not
>>> affect its decision because the return value is clear.
>>> I think the situation here is similar to regular syscall which usually
>>> uses -1 to clearly states error and errno to spells out the reason.
>>>
>>
>> I did consider returning the status in struct bpf_fib_lookup. However,
>> it is 64 bytes and can not be extended without a big performance
>> penalty, so the only option there is to make an existing entry a union
>> the most logical of which is the ifindex. It seemed odd to me to have
>> the result by hidden in the struct as a union on ifindex and returning
>> the egress index from the function:
>>
>> @@ -2625,7 +2636,11 @@ struct bpf_fib_lookup {
>>
>> /* total length of packet from network header - used for MTU
>> check */
>> __u16   tot_len;
>> -   __u32   ifindex;  /* L3 device index for lookup */
>> +
>> +   union {
>> +   __u32   ifindex;  /* input: L3 device index for lookup */
>> +   __u32   result;   /* output: one of BPF_FIB_LKUP_RET_* */
>> +   };
>>
>>
>> It seemed more natural to have ifindex stay ifindex and only change
>> value on return:
>>
>> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
>>
>>  /* total length of packet from network header - used for MTU check */
>>  __u16   tot_len;
>> -__u32   ifindex;  /* L3 device index for lookup */
>> +
>> +/* input: L3 device index for lookup
>> + * output: nexthop device index from FIB lookup
>> + */
>> +__u32   ifindex;
>>
>>  union {
>>  /* inputs to lookup */
>>
>>
>> From a program's perspective:
>>
>> rc < 0  -- program is passing incorrect data
>> rc == 0 -- packet can be forwarded
>> rc > 0  -- packet can not be forwarded.
>>
>> BPF programs are not required to track the LKUP_RET values any more than
>> a function returning multiple negative values - the caller just checks
>> rc < 0 means failure. If the program cares it can look at specific
>> values of rc to see the specific value.
>>
>> The same applies with the LKUP_RET values - they are there to provide
>> insight into why the packet is not forwarded directly if the program
>> cares to know why.
> hmm...ic. My concern is, the prog can interpret rc > 0 (in this patch) to be
> drop vs pass (although we can advise them in bpf.h to always pass if it does
> not understand a rc but it is not a strong contract),  it may catch people
> a surprise if a xdp_prog suddenly drops everything 

Re: [PATCH net] net/sched: act_ife: fix recursive lock and idr leak

2018-06-19 Thread Cong Wang
On Tue, Jun 19, 2018 at 6:39 AM, Davide Caratti  wrote:
> a recursive lock warning [1] can be observed with the following script,
>
>  # $TC actions add action ife encode allow prio pass index 42
>  IFE type 0xED3E
>  # $TC actions replace action ife encode allow tcindex pass index 42
>
> in case the kernel was unable to run the last command (e.g. because of
> the impossibility to load 'act_meta_skbtcindex'). For a similar reason,
> the kernel can leak idr in the error path of tcf_ife_init(), because
> tcf_idr_release() is not called after successful idr reservation:
>
>  # $TC actions add action ife encode allow tcindex index 47
>  IFE type 0xED3E
>  RTNETLINK answers: No such file or directory
>  We have an error talking to the kernel
>  # $TC actions add action ife encode allow tcindex index 47
>  IFE type 0xED3E
>  RTNETLINK answers: No space left on device
>  We have an error talking to the kernel
>  # $TC actions add action ife encode use mark 7 type 0xfefe pass index 47
>  IFE type 0xFEFE
>  RTNETLINK answers: No space left on device
>  We have an error talking to the kernel
>
> Since tcfa_lock is already taken when the action is being edited, a call
> to tcf_idr_release() wrongly makes tcf_idr_cleanup() take the same lock
> again. On the other hand, tcf_idr_release() needs to be called in the
> error path of tcf_ife_init(), to undo the last tcf_idr_create() invocation.
> Fix both problems in tcf_ife_init().
> Since the cleanup() routine can now be called when ife->params is NULL,
> also add a NULL pointer check to avoid calling kfree_rcu(NULL, rcu).

Acked-by: Cong Wang 


Re: [PATCH net] net/sched: act_ife: preserve the action control in case of error

2018-06-19 Thread Cong Wang
On Tue, Jun 19, 2018 at 6:45 AM, Davide Caratti  wrote:
> in the following script
>
>  # tc actions add action ife encode allow prio pass index 42
>  # tc actions replace action ife encode allow tcindex drop index 42
>
> the action control should remain equal to 'pass', if the kernel failed
> to replace the TC action. Pospone the assignment of the action control,
> to ensure it is not overwritten in the error path of tcf_ife_init().
>
> Fixes: ef6980b6becb ("introduce IFE action")
> Signed-off-by: Davide Caratti 

Acked-by: Cong Wang 


Re: [PATCH] ucc_geth: Add BQL support

2018-06-19 Thread Joakim Tjernlund
On Tue, 2018-06-19 at 11:37 -0700, Dave Taht wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you recognize the sender and know the 
> content is safe.
> 
> 
> very happy to see this. is there a specific chip or devboard this runs on?

This driver is for MPC83xx family SOCs(possibly others as well) on our custom 
boards, used in 
our telecom product.

You are actually the reason I impl. this :)

 Jocke

> 
> On Tue, Jun 19, 2018 at 11:24 AM, Li Yang  wrote:
> > On Tue, Jun 19, 2018 at 11:30 AM, Joakim Tjernlund
> >  wrote:
> > > Signed-off-by: Joakim Tjernlund 
> > 
> > Acked-by: Li Yang 
> > 
> > > ---
> > >  drivers/net/ethernet/freescale/ucc_geth.c | 7 ++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
> > > b/drivers/net/ethernet/freescale/ucc_geth.c
> > > index f77ba9fa257b..6c99a9af6647 100644
> > > --- a/drivers/net/ethernet/freescale/ucc_geth.c
> > > +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> > > @@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, 
> > > struct net_device *dev)
> > > 
> > > ugeth_vdbg("%s: IN", __func__);
> > > 
> > > +   netdev_sent_queue(dev, skb->len);
> > > spin_lock_irqsave(>lock, flags);
> > > 
> > > dev->stats.tx_bytes += skb->len;
> > > @@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 
> > > txQ)
> > > struct ucc_geth_private *ugeth = netdev_priv(dev);
> > > u8 __iomem *bd; /* BD pointer */
> > > u32 bd_status;
> > > +   int howmany = 0;
> > > +   unsigned int bytes_sent = 0;
> > > 
> > > bd = ugeth->confBd[txQ];
> > > bd_status = in_be32((u32 __iomem *)bd);
> > > @@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 
> > > txQ)
> > > skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
> > > if (!skb)
> > > break;
> > > -
> > > +   howmany++;
> > > +   bytes_sent += skb->len;
> > > dev->stats.tx_packets++;
> > > 
> > > dev_consume_skb_any(skb);
> > > @@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 
> > > txQ)
> > > bd_status = in_be32((u32 __iomem *)bd);
> > > }
> > > ugeth->confBd[txQ] = bd;
> > > +   netdev_completed_queue(dev, howmany, bytes_sent);
> > > return 0;
> > >  }
> > > 
> > > --
> > > 2.13.6
> > > 
> 
> 
> 
> --
> 
> Dave Täht
> CEO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-669-226-2619


Re: [PATCH] ucc_geth: Add BQL support

2018-06-19 Thread Dave Taht
very happy to see this. is there a specific chip or devboard this runs on?

On Tue, Jun 19, 2018 at 11:24 AM, Li Yang  wrote:
> On Tue, Jun 19, 2018 at 11:30 AM, Joakim Tjernlund
>  wrote:
>> Signed-off-by: Joakim Tjernlund 
>
> Acked-by: Li Yang 
>
>> ---
>>  drivers/net/ethernet/freescale/ucc_geth.c | 7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
>> b/drivers/net/ethernet/freescale/ucc_geth.c
>> index f77ba9fa257b..6c99a9af6647 100644
>> --- a/drivers/net/ethernet/freescale/ucc_geth.c
>> +++ b/drivers/net/ethernet/freescale/ucc_geth.c
>> @@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, 
>> struct net_device *dev)
>>
>> ugeth_vdbg("%s: IN", __func__);
>>
>> +   netdev_sent_queue(dev, skb->len);
>> spin_lock_irqsave(>lock, flags);
>>
>> dev->stats.tx_bytes += skb->len;
>> @@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>> struct ucc_geth_private *ugeth = netdev_priv(dev);
>> u8 __iomem *bd; /* BD pointer */
>> u32 bd_status;
>> +   int howmany = 0;
>> +   unsigned int bytes_sent = 0;
>>
>> bd = ugeth->confBd[txQ];
>> bd_status = in_be32((u32 __iomem *)bd);
>> @@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>> skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
>> if (!skb)
>> break;
>> -
>> +   howmany++;
>> +   bytes_sent += skb->len;
>> dev->stats.tx_packets++;
>>
>> dev_consume_skb_any(skb);
>> @@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
>> bd_status = in_be32((u32 __iomem *)bd);
>> }
>> ugeth->confBd[txQ] = bd;
>> +   netdev_completed_queue(dev, howmany, bytes_sent);
>> return 0;
>>  }
>>
>> --
>> 2.13.6
>>



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619


Re: [PATCH] ucc_geth: Add BQL support

2018-06-19 Thread Li Yang
On Tue, Jun 19, 2018 at 11:30 AM, Joakim Tjernlund
 wrote:
> Signed-off-by: Joakim Tjernlund 

Acked-by: Li Yang 

> ---
>  drivers/net/ethernet/freescale/ucc_geth.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
> b/drivers/net/ethernet/freescale/ucc_geth.c
> index f77ba9fa257b..6c99a9af6647 100644
> --- a/drivers/net/ethernet/freescale/ucc_geth.c
> +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> @@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, 
> struct net_device *dev)
>
> ugeth_vdbg("%s: IN", __func__);
>
> +   netdev_sent_queue(dev, skb->len);
> spin_lock_irqsave(>lock, flags);
>
> dev->stats.tx_bytes += skb->len;
> @@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
> struct ucc_geth_private *ugeth = netdev_priv(dev);
> u8 __iomem *bd; /* BD pointer */
> u32 bd_status;
> +   int howmany = 0;
> +   unsigned int bytes_sent = 0;
>
> bd = ugeth->confBd[txQ];
> bd_status = in_be32((u32 __iomem *)bd);
> @@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
> skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
> if (!skb)
> break;
> -
> +   howmany++;
> +   bytes_sent += skb->len;
> dev->stats.tx_packets++;
>
> dev_consume_skb_any(skb);
> @@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
> bd_status = in_be32((u32 __iomem *)bd);
> }
> ugeth->confBd[txQ] = bd;
> +   netdev_completed_queue(dev, howmany, bytes_sent);
> return 0;
>  }
>
> --
> 2.13.6
>


Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition

2018-06-19 Thread Saeed Mahameed
On Thu, 2018-06-14 at 16:49 -0700, Eric Dumazet wrote:
> 
> On 06/14/2018 02:04 PM, Saeed Mahameed wrote:
> 
> > I was looking at the code without my fix :)
> > 
> > with my fix:
> > release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE;
> > 
> > for XDP: frag_info->frag_stride is PAGE_SIZE, so release will
> > always be
> > true regardless of PAGE_SIZE.
> > 
> > So i guess i didn't quite understand your PowerPC concern.. can you
> > elaborate ?
> > 
> 
> So your maths with PAGE_SIZE=65536 and MTU 9000
> 
> frag_stride is about 9344
> 
> So if the last chunk of the page has 9100 bytes, we wont be able to
> use it, while really we should be able to use it.
> 
> 

this is only true for XDP setup, for non XDP max stride_size can only
be around ~3k and only for mtu > ~6k

For XDP setup you suggested:
-   priv->frag_info[0].frag_size = eff_mtu;
+   priv->frag_info[0].frag_size = PAGE_SIZE;

currently the condition is:

release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;

so my solution and yours have the same problem you described above.

the problem is not with the initial values or with stride/farg size
math, it just that in XDP we shouldn't reuse at ALL. I agree with you
that we need to optimize and maybe for PAGE_SIZE > 8k we need to allow
XDP setup to reuses. but for now there is a data corruption to handle.

[RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper

2018-06-19 Thread Tushar Dave
When sg_filter_run() is invoked it runs the attached eBPF
SOCKET_SG_FILTER program which deals with struct scatterlist.

In addition, this patch also adds bpf_sg_next helper function that
allows users to retrieve the next sg element from sg list.

Signed-off-by: Tushar Dave 
Acked-by: Sowmini Varadhan 
---
 include/linux/filter.h|  2 +
 include/uapi/linux/bpf.h  | 10 -
 net/core/filter.c | 72 +++
 tools/include/uapi/linux/bpf.h| 10 -
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
 5 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 71618b1..d176402 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1072,4 +1072,6 @@ struct bpf_sock_ops_kern {
 */
 };
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ef0a7b6..036432b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2076,6 +2076,13 @@ struct bpf_stack_build_id {
  * Return
  * A 64-bit integer containing the current cgroup id based
  * on the cgroup within which the current task is running.
+ *
+ * int bpf_sg_next(struct bpf_scatterlist *sg)
+ * Description
+ * This helper allows user to retrieve next sg element from
+ * sg list.
+ * Return
+ * Returns 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2158,7 +2165,8 @@ struct bpf_stack_build_id {
FN(rc_repeat),  \
FN(rc_keydown), \
FN(skb_cgroup_id),  \
-   FN(get_current_cgroup_id),
+   FN(get_current_cgroup_id),  \
+   FN(sg_next),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 8f67942..702ff5b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,53 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff 
*skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+   struct sk_filter *filter;
+   int err;
+
+   rcu_read_lock();
+   filter = rcu_dereference(sk->sk_filter);
+   if (filter) {
+   struct bpf_scatterlist bpfsg;
+   int num_sg;
+
+   if (!sg) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   num_sg = sg_nents(sg);
+   if (num_sg <= 0) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   /* We store a reference  to the sg list so it can later used by
+* eBPF helpers to retrieve the next sg element.
+*/
+   bpfsg.num_sg = num_sg;
+   bpfsg.cur_sg = 0;
+   bpfsg.sg = sg;
+
+   /* For the first sg element, we store the pkt access pointers
+* into start and end so eBPF program can have pkt access using
+* data and data_end. The pkt access for subsequent element of
+* sg list is possible when eBPF program invokes bpf_sg_next
+* which takes care of setting start and end to the correct sg
+* element.
+*/
+   bpfsg.start = sg_virt(sg);
+   bpfsg.end = bpfsg.start + sg->length;
+   BPF_PROG_RUN(filter->prog, );
+   }
+out:
+   rcu_read_unlock();
+
+   return err;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
 BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
return skb_get_poff(skb);
@@ -3753,6 +3800,29 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const 
void *src_buff,
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_sg_next, struct bpf_scatterlist *, bpfsg)
+{
+   struct scatterlist *sg = bpfsg->sg;
+   int cur_sg = bpfsg->cur_sg;
+
+   cur_sg++;
+   if (cur_sg >= bpfsg->num_sg)
+   return -ENODATA;
+
+   bpfsg->cur_sg = cur_sg;
+   bpfsg->start = sg_virt([cur_sg]);
+   bpfsg->end = bpfsg->start + sg[cur_sg].length;
+
+   return 0;
+}
+
+static const struct bpf_func_proto bpf_sg_next_proto = {
+   .func   = bpf_sg_next,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
   int, level, int, optname, char *, optval, int, optlen)
 {
@@ -4720,6 +4790,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 socksg_filter_func_proto(enum bpf_func_id 

[RFC v2 PATCH 4/4] rds: invoke socket sg filter attached to rds socket

2018-06-19 Thread Tushar Dave
RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.

Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.

Signed-off-by: Tushar Dave 
Reviewed-by: Sowmini Varadhan 
---
 net/rds/ib.c   |  1 +
 net/rds/ib.h   |  1 +
 net/rds/ib_recv.c  | 12 
 net/rds/rds.h  |  2 ++
 net/rds/recv.c | 16 
 net/rds/tcp.c  |  2 ++
 net/rds/tcp.h  |  2 ++
 net/rds/tcp_recv.c | 38 ++
 8 files changed, 74 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 02deee2..3027832 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -421,6 +421,7 @@ struct rds_transport rds_ib_transport = {
.conn_path_shutdown = rds_ib_conn_path_shutdown,
.inc_copy_to_user   = rds_ib_inc_copy_to_user,
.inc_free   = rds_ib_inc_free,
+   .inc_to_sg_get  = rds_ib_inc_to_sg_get,
.cm_initiate_connect= rds_ib_cm_initiate_connect,
.cm_handle_connect  = rds_ib_cm_handle_connect,
.cm_connect_complete= rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index a6f4d7d..699b5b9b 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -375,6 +375,7 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn,
 void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
 void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
 struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index b4e421a..62be497 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
rds_ib_recv_cache_put(>ii_cache_entry, >i_cache_incs);
 }
 
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+   struct rds_ib_incoming *ibinc;
+   struct rds_page_frag *frag;
+
+   ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+   frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+   *sg =  >f_sg;
+
+   return 0;
+}
+
 static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
  struct rds_ib_recv_work *recv)
 {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index b04c333..f5ea833 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -528,6 +528,8 @@ struct rds_transport {
int (*recv_path)(struct rds_conn_path *cp);
int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
void (*inc_free)(struct rds_incoming *inc);
+   int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
+   void (*inc_to_sg_put)(struct scatterlist **sg);
 
int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
 struct rdma_cm_event *event);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index dc67458..e0c5b4c 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -286,6 +286,7 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
struct sock *sk;
unsigned long flags;
struct rds_conn_path *cp;
+   struct sk_filter *filter;
 
inc->i_conn = conn;
inc->i_rx_jiffies = jiffies;
@@ -369,6 +370,21 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
/* We can be racing with rds_release() which marks the socket dead. */
sk = rds_rs_to_sk(rs);
 
+   rcu_read_lock();
+   filter = rcu_dereference(sk->sk_filter);
+   if (filter) {
+   if (conn->c_trans->inc_to_sg_get) {
+   struct scatterlist *sg;
+
+   if (conn->c_trans->inc_to_sg_get(inc, ) == 0) {
+   sg_filter_run(sk, sg);
+   if (conn->c_trans->inc_to_sg_put)
+   conn->c_trans->inc_to_sg_put();
+   }
+   }
+   }
+   rcu_read_unlock();
+
/* serialize with rds_release -> sock_orphan */
write_lock_irqsave(>rs_recv_lock, flags);
if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 351a284..b431854 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -376,6 +376,8 @@ struct rds_transport 

[RFC v2 PATCH 3/4] ebpf: Add sample ebpf program for SOCKET_SG_FILTER

2018-06-19 Thread Tushar Dave
Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.

When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_sg_next
helper along with bpf tail calls to retrieve packet data contained in
struct scatterlist form.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71

Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...

Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71

BPF program output:
[root@lab71]# cat /sys/kernel/debug/tracing/trace_pipe
  -0 [007] ..s.   525.994894: 0: Print first 6 bytes from sg 
element
  -0 [007] ..s.   525.994897: 0: First sg element:
  -0 [007] ..s.   525.994899: 0: 30 31 32
  -0 [007] ..s.   525.994900: 0: 33 34 35
  -0 [007] ..s.   525.994901: 0: next sg element:
  -0 [007] ..s.   525.994902: 0: a8 a9 aa
  -0 [007] ..s.   525.994903: 0: ab ac ad
  -0 [007] ..s.   525.994904: 0: next sg element:
  -0 [007] ..s.   525.994905: 0: 50 51 52
  -0 [007] ..s.   525.994905: 0: 53 54 55
  -0 [007] ..s.   525.994906: 0: next sg element:
  -0 [007] ..s.   525.994907: 0: f8 f9 fa
  -0 [007] ..s.   525.994907: 0: fb fc fd
  -0 [007] ..s.   525.994908: 0: next sg element:
  -0 [007] ..s.   525.994909: 0: a0 a1 a2
  -0 [007] ..s.   525.994909: 0: a3 a4 a5
  -0 [007] ..s.   525.994910: 0: next sg element:
  -0 [007] ..s.   525.994911: 0: 48 49 4a
  -0 [007] ..s.   525.994911: 0: 4b 4c 4d
  -0 [007] ..s.   525.994912: 0: no more sg element

Similary specifying '-t ib' will run this on IB link.

Signed-off-by: Tushar Dave 
Acked-by: Sowmini Varadhan 
---
 samples/bpf/Makefile  |   3 +
 samples/bpf/rds_filter_kern.c |  78 ++
 samples/bpf/rds_filter_user.c | 339 ++
 3 files changed, 420 insertions(+)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af1..5de238b 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
+hostprogs-y += rds_filter
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
+always += rds_filter_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 000..8fe3d3c
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define PROG(F) SEC("socksg/"__stringify(F)) int bpf_func_##F
+
+#define bpf_printk(fmt, ...)   \
+({ \
+   char fmt[] = fmt;   \
+   bpf_trace_printk(fmt, sizeof(fmt),  \
+   ##__VA_ARGS__); \
+})
+
+struct bpf_map_def SEC("maps") jmp_table = {
+   .type = BPF_MAP_TYPE_PROG_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(u32),
+   .max_entries = 2,
+};
+
+#define SG1 1
+
+static inline void dump_sg(struct sg_filter_md *sg)
+{
+   void *data = (void *)(long) sg->data;
+   void *data_end = (void *)(long) sg->data_end;
+   unsigned char *d;
+
+   if (data + 8 > data_end)
+   return;
+
+   d = (unsigned char *)data;
+   bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+   bpf_printk("%x %x %x\n", 

[RFC v2 PATCH 0/4] eBPF and struct scatterlist

2018-06-19 Thread Tushar Dave
This follows up on https://patchwork.ozlabs.org/cover/927050/
where the review feedback was to use bpf_skb_load_bytes() to deal with
linear and non-linear skbs. While that feedback is valid and correct,
the motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module)

This RFC (call it v2) uses exiting socket filter infrastructure and
extend it with new eBPF program type that deals with struct scatterlist.
For RDS, the integrated approach treats the scatterlist as the common
denominator, and allows the application to write a filter for processing
a scatterlist.


Details:
Patch 1 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which
uses the existing socket filter infrastructure for bpf program attach
and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER receives
struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
form of struct scatterlist.

Patch 2 adds functionality to run BPF_PROG_TYPE_SOCKET_SG_FILTER socket
filter program. A bpf helpers bpf_sg_next() is also added so users can
retrieve sg elements from scatterlist.

Patch 3 adds socket filter eBPF sample program that uses patch 1 and
patch 2. The sample program opens an rds socket, attach ebpf program
(socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and uses
bpf_sg_next helper to look into sg. For a test, current ebpf program
only prints first few bytes from each elements of sg list.

Finally, patch 4 allows rds_recv_incoming to invoke socket filter
program which deals with scatterlist.

Thanks.

-Tushar

Tushar Dave (4):
  eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  ebpf: Add sg_filter_run and sg helper
  ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  rds: invoke socket sg filter attached to rds socket

 include/linux/bpf_types.h |   1 +
 include/linux/filter.h|  10 +
 include/uapi/linux/bpf.h  |  17 +-
 kernel/bpf/syscall.c  |   1 +
 kernel/bpf/verifier.c |   1 +
 net/core/filter.c | 149 -
 net/rds/ib.c  |   1 +
 net/rds/ib.h  |   1 +
 net/rds/ib_recv.c |  12 ++
 net/rds/rds.h |   2 +
 net/rds/recv.c|  16 ++
 net/rds/tcp.c |   2 +
 net/rds/tcp.h |   2 +
 net/rds/tcp_recv.c|  38 
 samples/bpf/Makefile  |   3 +
 samples/bpf/bpf_load.c|  11 +-
 samples/bpf/rds_filter_kern.c |  78 +++
 samples/bpf/rds_filter_user.c | 339 ++
 tools/bpf/bpftool/prog.c  |   1 +
 tools/include/uapi/linux/bpf.h|  17 +-
 tools/lib/bpf/libbpf.c|   3 +
 tools/lib/bpf/libbpf.h|   2 +
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 23 files changed, 703 insertions(+), 7 deletions(-)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

-- 
1.8.3.1



[RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER

2018-06-19 Thread Tushar Dave
Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA

Signed-off-by: Tushar Dave 
Acked-by: Sowmini Varadhan 
---
 include/linux/bpf_types.h  |  1 +
 include/linux/filter.h |  8 +
 include/uapi/linux/bpf.h   |  7 
 kernel/bpf/syscall.c   |  1 +
 kernel/bpf/verifier.c  |  1 +
 net/core/filter.c  | 77 --
 samples/bpf/bpf_load.c | 11 --
 tools/bpf/bpftool/prog.c   |  1 +
 tools/include/uapi/linux/bpf.h |  7 
 tools/lib/bpf/libbpf.c |  3 ++
 tools/lib/bpf/libbpf.h |  2 ++
 11 files changed, 114 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index c5700c2..f8b4b56 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45fc0f5..71618b1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -517,6 +517,14 @@ struct bpf_skb_data_end {
void *data_end;
 };
 
+struct bpf_scatterlist {
+   struct scatterlist *sg;
+   void *start;
+   void *end;
+   int cur_sg;
+   int num_sg;
+};
+
 struct sk_msg_buff {
void *data;
void *data_end;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6..ef0a7b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -144,6 +144,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
+   BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
@@ -2358,6 +2359,12 @@ enum sk_action {
SK_PASS,
 };
 
+/* use accessible scatterlist */
+struct sg_filter_md {
+   void *data; /* sg_virt(sg) */
+   void *data_end; /* sg_virt(sg) + sg->length */
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fa2062..74193a8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1300,6 +1300,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 
if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
type != BPF_PROG_TYPE_CGROUP_SKB &&
+   type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
!capable(CAP_SYS_ADMIN))
return -EPERM;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d6403b5..a00d3eb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1320,6 +1320,7 @@ static bool may_access_direct_pkt_data(struct 
bpf_verifier_env *env,
case BPF_PROG_TYPE_LWT_XMIT:
case BPF_PROG_TYPE_SK_SKB:
case BPF_PROG_TYPE_SK_MSG:
+   case BPF_PROG_TYPE_SOCKET_SG_FILTER:
if (meta)
return meta->pkt_access;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 3d9ba7e..8f67942 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1130,7 +1130,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-   if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+   if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+   prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
bpf_prog_put(prog);
} else {
bpf_release_orig_filter(prog);
@@ -1551,10 +1552,16 @@ int sk_reuseport_attach_filter(struct sock_fprog 
*fprog, struct sock *sk)
 
 static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 {
+   struct bpf_prog *prog;
+
if (sock_flag(sk, SOCK_FILTER_LOCKED))
return ERR_PTR(-EPERM);
 
-   return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+   prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+   if (IS_ERR(prog))
+   prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+   return prog;
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4710,6 +4717,15 @@ bool bpf_helper_changes_pkt_data(void *func)
 }
 
 static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+   switch (func_id) {
+   default:
+   return bpf_base_func_proto(func_id);
+  

bpfilter compile failure on parisc

2018-06-19 Thread Meelis Roos
Tried enabling bpfilter on parisc, got this:

  HOSTCC  net/bpfilter/main.o
net/bpfilter/main.c:3:21: fatal error: sys/uio.h: No such file or directory
 #include 
 ^
compilation terminated.

-- 
Meelis Roos (mr...@linux.ee)


Re: [PATCH net 5/5] net sched actions: fix misleading text strings in pedit action

2018-06-19 Thread Roman Mashak
Stephen Hemminger  writes:

> On Tue, 19 Jun 2018 12:56:08 -0400

[...]

>> @@ -326,12 +326,12 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
>> tc_action *a,
>>  }
>>  
>>  if (offset % 4) {
>> -pr_info("tc filter pedit offset must be on 32 
>> bit boundaries\n");
>> +pr_info("tc action pedit offset must be on 32 
>> bit boundaries\n");
>>  goto bad;
>>  }
>>  
>>  if (!offset_valid(skb, hoffset + offset)) {
>> -pr_info("tc filter pedit offset %d out of 
>> bounds\n",
>> +pr_info("tc action pedit offset %d out of 
>> bounds\n",
>>  hoffset + offset);
>>  goto bad;
>
> Time to convert these to netlink extack reporting?

Yes, this is planned in next patches.


Re: [PATCH net 5/5] net sched actions: fix misleading text strings in pedit action

2018-06-19 Thread Stephen Hemminger
On Tue, 19 Jun 2018 12:56:08 -0400
Roman Mashak  wrote:

> Change "tc filter pedit .." to "tc actions pedit .." in error
> messages to clearly refer to pedit action.
> 
> Signed-off-by: Roman Mashak 
> ---
>  net/sched/act_pedit.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
> index 3b775f54cee5..caa6927a992c 100644
> --- a/net/sched/act_pedit.c
> +++ b/net/sched/act_pedit.c
> @@ -305,7 +305,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
> tc_action *a,
>  
>   rc = pedit_skb_hdr_offset(skb, htype, );
>   if (rc) {
> - pr_info("tc filter pedit bad header type 
> specified (0x%x)\n",
> + pr_info("tc action pedit bad header type 
> specified (0x%x)\n",
>   htype);
>   goto bad;
>   }
> @@ -314,7 +314,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
> tc_action *a,
>   char *d, _d;
>  
>   if (!offset_valid(skb, hoffset + tkey->at)) {
> - pr_info("tc filter pedit 'at' offset %d 
> out of bounds\n",
> + pr_info("tc action pedit 'at' offset %d 
> out of bounds\n",
>   hoffset + tkey->at);
>   goto bad;
>   }
> @@ -326,12 +326,12 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
> tc_action *a,
>   }
>  
>   if (offset % 4) {
> - pr_info("tc filter pedit offset must be on 32 
> bit boundaries\n");
> + pr_info("tc action pedit offset must be on 32 
> bit boundaries\n");
>   goto bad;
>   }
>  
>   if (!offset_valid(skb, hoffset + offset)) {
> - pr_info("tc filter pedit offset %d out of 
> bounds\n",
> + pr_info("tc action pedit offset %d out of 
> bounds\n",
>   hoffset + offset);
>   goto bad;

Time to convert these to netlink extack reporting?


[PATCH net 0/5] net sched actions: code style cleanup and fixes

2018-06-19 Thread Roman Mashak
The patchset fixes a few code stylistic issues and typos, as well as one
detected by sparse semantic checker tool.

No functional changes introduced.

Patch 1 & 2 fix coding style bits caught by the checkpatch.pl script
Patch 3 fixes an issue with a shadowed variable
Patch 4 adds sizeof() operator instead of magic number for buffer length
Patch 5 fixes typos in diagnostics messages

Roman Mashak (5):
  net sched actions: fix coding style in pedit action
  net sched actions: fix coding style in pedit headers
  net sched actions: fix sparse warning
  net sched actions: use sizeof operator for buffer length
  net sched actions: fix misleading text strings in pedit action

 include/net/tc_act/tc_pedit.h|  1 +
 include/uapi/linux/tc_act/tc_pedit.h |  9 ++--
 net/sched/act_pedit.c| 41 +++-
 3 files changed, 30 insertions(+), 21 deletions(-)

-- 
2.7.4



[PATCH net 4/5] net sched actions: use sizeof operator for buffer length

2018-06-19 Thread Roman Mashak
Replace constant integer with sizeof() to clearly indicate
the destination buffer length in skb_header_pointer() calls.

Signed-off-by: Roman Mashak 
---
 net/sched/act_pedit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index 9c2d8a31a5c5..3b775f54cee5 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -319,7 +319,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
goto bad;
}
d = skb_header_pointer(skb, hoffset + tkey->at,
-  1, &_d);
+  sizeof(_d), &_d);
if (!d)
goto bad;
offset += (*d & tkey->offmask) >> tkey->shift;
@@ -337,7 +337,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
}
 
ptr = skb_header_pointer(skb, hoffset + offset,
-4, );
+sizeof(hdata), );
if (!ptr)
goto bad;
/* just do it, baby */
-- 
2.7.4



[PATCH net 3/5] net sched actions: fix sparse warning

2018-06-19 Thread Roman Mashak
The variable _data in include/asm-generic/sections.h defines sections,
this causes sparse warning in pedit:

net/sched/act_pedit.c:293:35: warning: symbol '_data' shadows an earlier one
./include/asm-generic/sections.h:36:13: originally declared here

Therefore rename the variable.

Signed-off-by: Roman Mashak 
---
 net/sched/act_pedit.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index e4b29ee79ba8..9c2d8a31a5c5 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -290,7 +290,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
enum pedit_cmd cmd = TCA_PEDIT_KEY_EX_CMD_SET;
 
for (i = p->tcfp_nkeys; i > 0; i--, tkey++) {
-   u32 *ptr, _data;
+   u32 *ptr, hdata;
int offset = tkey->off;
int hoffset;
u32 val;
@@ -337,7 +337,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
}
 
ptr = skb_header_pointer(skb, hoffset + offset,
-4, &_data);
+4, );
if (!ptr)
goto bad;
/* just do it, baby */
@@ -355,7 +355,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
}
 
*ptr = ((*ptr & tkey->mask) ^ val);
-   if (ptr == &_data)
+   if (ptr == )
skb_store_bits(skb, hoffset + offset, ptr, 4);
}
 
-- 
2.7.4



[PATCH net 2/5] net sched actions: fix coding style in pedit headers

2018-06-19 Thread Roman Mashak
Fix coding style issues in tc pedit headers detected by the
checkpatch script.

Signed-off-by: Roman Mashak 
---
 include/net/tc_act/tc_pedit.h| 1 +
 include/uapi/linux/tc_act/tc_pedit.h | 9 +++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/net/tc_act/tc_pedit.h b/include/net/tc_act/tc_pedit.h
index 227a6f1d02f4..fac3ad4a86de 100644
--- a/include/net/tc_act/tc_pedit.h
+++ b/include/net/tc_act/tc_pedit.h
@@ -17,6 +17,7 @@ struct tcf_pedit {
struct tc_pedit_key *tcfp_keys;
struct tcf_pedit_key_ex *tcfp_keys_ex;
 };
+
 #define to_pedit(a) ((struct tcf_pedit *)a)
 
 static inline bool is_tcf_pedit(const struct tc_action *a)
diff --git a/include/uapi/linux/tc_act/tc_pedit.h 
b/include/uapi/linux/tc_act/tc_pedit.h
index 162d1094c41c..24ec792dacc1 100644
--- a/include/uapi/linux/tc_act/tc_pedit.h
+++ b/include/uapi/linux/tc_act/tc_pedit.h
@@ -17,13 +17,15 @@ enum {
TCA_PEDIT_KEY_EX,
__TCA_PEDIT_MAX
 };
+
 #define TCA_PEDIT_MAX (__TCA_PEDIT_MAX - 1)
-   
 
+
 enum {
TCA_PEDIT_KEY_EX_HTYPE = 1,
TCA_PEDIT_KEY_EX_CMD = 2,
__TCA_PEDIT_KEY_EX_MAX
 };
+
 #define TCA_PEDIT_KEY_EX_MAX (__TCA_PEDIT_KEY_EX_MAX - 1)
 
  /* TCA_PEDIT_KEY_EX_HDR_TYPE_NETWROK is a special case for legacy users. It
@@ -38,6 +40,7 @@ enum pedit_header_type {
TCA_PEDIT_KEY_EX_HDR_TYPE_UDP = 5,
__PEDIT_HDR_TYPE_MAX,
 };
+
 #define TCA_PEDIT_HDR_TYPE_MAX (__PEDIT_HDR_TYPE_MAX - 1)
 
 enum pedit_cmd {
@@ -45,6 +48,7 @@ enum pedit_cmd {
TCA_PEDIT_KEY_EX_CMD_ADD = 1,
__PEDIT_CMD_MAX,
 };
+
 #define TCA_PEDIT_CMD_MAX (__PEDIT_CMD_MAX - 1)
 
 struct tc_pedit_key {
@@ -55,13 +59,14 @@ struct tc_pedit_key {
__u32   offmask;
__u32   shift;
 };
-   
 
+
 struct tc_pedit_sel {
tc_gen;
unsigned char   nkeys;
unsigned char   flags;
struct tc_pedit_key keys[0];
 };
+
 #define tc_pedit tc_pedit_sel
 
 #endif
-- 
2.7.4



[PATCH net 5/5] net sched actions: fix misleading text strings in pedit action

2018-06-19 Thread Roman Mashak
Change "tc filter pedit .." to "tc actions pedit .." in error
messages to clearly refer to pedit action.

Signed-off-by: Roman Mashak 
---
 net/sched/act_pedit.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index 3b775f54cee5..caa6927a992c 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -305,7 +305,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
 
rc = pedit_skb_hdr_offset(skb, htype, );
if (rc) {
-   pr_info("tc filter pedit bad header type 
specified (0x%x)\n",
+   pr_info("tc action pedit bad header type 
specified (0x%x)\n",
htype);
goto bad;
}
@@ -314,7 +314,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
char *d, _d;
 
if (!offset_valid(skb, hoffset + tkey->at)) {
-   pr_info("tc filter pedit 'at' offset %d 
out of bounds\n",
+   pr_info("tc action pedit 'at' offset %d 
out of bounds\n",
hoffset + tkey->at);
goto bad;
}
@@ -326,12 +326,12 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
}
 
if (offset % 4) {
-   pr_info("tc filter pedit offset must be on 32 
bit boundaries\n");
+   pr_info("tc action pedit offset must be on 32 
bit boundaries\n");
goto bad;
}
 
if (!offset_valid(skb, hoffset + offset)) {
-   pr_info("tc filter pedit offset %d out of 
bounds\n",
+   pr_info("tc action pedit offset %d out of 
bounds\n",
hoffset + offset);
goto bad;
}
@@ -349,7 +349,7 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
val = (*ptr + tkey->val) & ~tkey->mask;
break;
default:
-   pr_info("tc filter pedit bad command (%d)\n",
+   pr_info("tc action pedit bad command (%d)\n",
cmd);
goto bad;
}
-- 
2.7.4



[PATCH net 1/5] net sched actions: fix coding style in pedit action

2018-06-19 Thread Roman Mashak
Fix coding style issues in tc pedit action detected by the
checkpatch script.

Signed-off-by: Roman Mashak 
---
 net/sched/act_pedit.c | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index 8a925c72db5f..e4b29ee79ba8 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -136,15 +136,15 @@ static int tcf_pedit_init(struct net *net, struct nlattr 
*nla,
 {
struct tc_action_net *tn = net_generic(net, pedit_net_id);
struct nlattr *tb[TCA_PEDIT_MAX + 1];
-   struct nlattr *pattr;
-   struct tc_pedit *parm;
-   int ret = 0, err;
-   struct tcf_pedit *p;
struct tc_pedit_key *keys = NULL;
struct tcf_pedit_key_ex *keys_ex;
+   struct tc_pedit *parm;
+   struct nlattr *pattr;
+   struct tcf_pedit *p;
+   int ret = 0, err;
int ksize;
 
-   if (nla == NULL)
+   if (!nla)
return -EINVAL;
 
err = nla_parse_nested(tb, TCA_PEDIT_MAX, nla, pedit_policy, NULL);
@@ -175,7 +175,7 @@ static int tcf_pedit_init(struct net *net, struct nlattr 
*nla,
return ret;
p = to_pedit(*a);
keys = kmalloc(ksize, GFP_KERNEL);
-   if (keys == NULL) {
+   if (!keys) {
tcf_idr_release(*a, bind);
kfree(keys_ex);
return -ENOMEM;
@@ -220,6 +220,7 @@ static void tcf_pedit_cleanup(struct tc_action *a)
 {
struct tcf_pedit *p = to_pedit(a);
struct tc_pedit_key *keys = p->tcfp_keys;
+
kfree(keys);
kfree(p->tcfp_keys_ex);
 }
@@ -284,7 +285,8 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
if (p->tcfp_nkeys > 0) {
struct tc_pedit_key *tkey = p->tcfp_keys;
struct tcf_pedit_key_ex *tkey_ex = p->tcfp_keys_ex;
-   enum pedit_header_type htype = 
TCA_PEDIT_KEY_EX_HDR_TYPE_NETWORK;
+   enum pedit_header_type htype =
+   TCA_PEDIT_KEY_EX_HDR_TYPE_NETWORK;
enum pedit_cmd cmd = TCA_PEDIT_KEY_EX_CMD_SET;
 
for (i = p->tcfp_nkeys; i > 0; i--, tkey++) {
@@ -316,16 +318,15 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
hoffset + tkey->at);
goto bad;
}
-   d = skb_header_pointer(skb, hoffset + tkey->at, 
1,
-  &_d);
+   d = skb_header_pointer(skb, hoffset + tkey->at,
+  1, &_d);
if (!d)
goto bad;
offset += (*d & tkey->offmask) >> tkey->shift;
}
 
if (offset % 4) {
-   pr_info("tc filter pedit"
-   " offset must be on 32 bit 
boundaries\n");
+   pr_info("tc filter pedit offset must be on 32 
bit boundaries\n");
goto bad;
}
 
@@ -335,7 +336,8 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
goto bad;
}
 
-   ptr = skb_header_pointer(skb, hoffset + offset, 4, 
&_data);
+   ptr = skb_header_pointer(skb, hoffset + offset,
+4, &_data);
if (!ptr)
goto bad;
/* just do it, baby */
@@ -358,8 +360,9 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
tc_action *a,
}
 
goto done;
-   } else
+   } else {
WARN(1, "pedit BUG: index %d\n", p->tcf_index);
+   }
 
 bad:
p->tcf_qstats.overlimits++;
-- 
2.7.4



[PATCH net v2] ip: limit use of gso_size to udp

2018-06-19 Thread Willem de Bruijn
From: Willem de Bruijn 

The ipcm(6)_cookie field gso_size is set only in the udp path. The ip
layer copies this to cork only if sk_type is SOCK_DGRAM. This check
proved too permissive. Ping and l2tp sockets have the same type.

Limit to sockets of type SOCK_DGRAM and protocol IPPROTO_UDP to
exclude ping sockets.

v1 -> v2
- remove irrelevant whitespace changes

Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
Reported-by: Maciej Żenczykowski 
Signed-off-by: Willem de Bruijn 

---

For net-next, I'll take a look whether ipcm(6)_cookie fields like
these can be initialized uniformly, and then this branch removed
completely.
---
 net/ipv4/ip_output.c  | 3 ++-
 net/ipv6/ip6_output.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index af5a830ff6ad..b3308e9d9762 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1145,7 +1145,8 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
cork->fragsize = ip_sk_use_pmtu(sk) ?
 dst_mtu(>dst) : rt->dst.dev->mtu;
 
-   cork->gso_size = sk->sk_type == SOCK_DGRAM ? ipc->gso_size : 0;
+   cork->gso_size = sk->sk_type == SOCK_DGRAM &&
+sk->sk_protocol == IPPROTO_UDP ? ipc->gso_size : 0;
cork->dst = >dst;
cork->length = 0;
cork->ttl = ipc->ttl;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 021e5aef6ba3..a14fb4fcdf18 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1219,7 +1219,8 @@ static int ip6_setup_cork(struct sock *sk, struct 
inet_cork_full *cork,
if (mtu < IPV6_MIN_MTU)
return -EINVAL;
cork->base.fragsize = mtu;
-   cork->base.gso_size = sk->sk_type == SOCK_DGRAM ? ipc6->gso_size : 0;
+   cork->base.gso_size = sk->sk_type == SOCK_DGRAM &&
+ sk->sk_protocol == IPPROTO_UDP ? ipc6->gso_size : 
0;
 
if (dst_allfrag(xfrm_dst_path(>dst)))
cork->base.flags |= IPCORK_ALLFRAG;
-- 
2.18.0.rc1.244.gcf134e6275-goog



Re: [PATCH rdma-next v2 00/20] Introduce mlx5 DEVX interface

2018-06-19 Thread Leon Romanovsky
On Tue, Jun 19, 2018 at 07:59:30AM +0300, Leon Romanovsky wrote:
> On Mon, Jun 18, 2018 at 04:05:04PM -0600, Jason Gunthorpe wrote:
> > On Sun, Jun 17, 2018 at 12:59:46PM +0300, Leon Romanovsky wrote:
> >
> > > Leon Romanovsky (2):
> > >   drm/i915: Move u64-to-ptr helpers to general header
> > >   kernel.h: Reuse u64_to_ptr macro to cast __user pointers
> >
> > I dropped these since they are not needed by this series when using a
> > union.
>
> No problem, it was my idea to reuse existing macro, before it was
> hard-coded implementation, but union makes it cleaner.
>
> >
> > > Matan Barak (5):
> > >   IB/uverbs: Export uverbs idr and fd types
> > >   IB/uverbs: Add PTR_IN attributes that are allocated/copied
> > > automatically
> >
> > Revised this one, as noted
>
> Thanks
>
> >
> > >   IB/uverbs: Add a macro to define a type with no kernel known size
> > >   IB/uverbs: Allow an empty namespace in ioctl() framework
> > >   IB/uverbs: Refactor uverbs_finalize_objects
> >
> > I put the above in a branch and can apply them if you ack my revisions..
> >
>
> Except the line "return (void *)attr;", which should be "return 
> ERR_CAST(attr);"
> everything looks reasonable. I didn't test it, but I'm not worried, we will 
> have
> enough time to fix if needed.
>
> > >   net/mlx5_core: Prevent warns in dmesg upon firmware commands
> > >   IB/core: Improve uverbs_cleanup_ucontext algorithm
> >
> > I dropped these two (they are linked), need comments addressed and
> > resent.
>
> They are linked only logically, the second patch will trigger warning
> which is suppressed by first patch. So actually mlx5-net branch will have
> only first patch "net/mlx5_core: Prevent warns in dmesg upon firmware 
> commands"
> and you will apply "IB/core: Improve uverbs_cleanup_ucontext algorithm" in
> your rdma-next.
>
> >
> > > Yishai Hadas (13):
> > >   net/mlx5: Expose DEVX ifc structures
> > >   IB/mlx5: Introduce DEVX
> > >   IB/core: Introduce DECLARE_UVERBS_GLOBAL_METHODS
> > >   IB: Expose ib_ucontext from a given ib_uverbs_file
> > >   IB/mlx5: Add support for DEVX general command
> > >   IB/mlx5: Add obj create and destroy functionality
> > >   IB/mlx5: Add DEVX support for modify and query commands
> > >   IB/mlx5: Add support for DEVX query UAR
> > >   IB/mlx5: Add DEVX support for memory registration
> > >   IB/mlx5: Add DEVX query EQN support
> > >   IB/mlx5: Expose DEVX tree
> >
> > I put these in a branch also and can apply them, but I need the first
> > two patches in the mlx5 core branch first please, thanks.
> >
> > Since this requires so many core patches I think I prefer to merge the
> > mlx core branch then apply rather merge a branch.
>
> So to summarize, I'm applying those three patches to mlx5-next:
>  * net/mlx5_core: Prevent warns in dmesg upon firmware commands
>  * net/mlx5: Expose DEVX ifc structures
>  * IB/mlx5: Introduce DEVX

Updated mlx5-next with two patches and squashed ifc and commands bits
from third commit into second one.

>
> And resend:
>  * IB/core: Improve uverbs_cleanup_ucontext algorithm
>

Resent.

> Thanks
>
> >
> > Jason




signature.asc
Description: PGP signature


[PATCH] ucc_geth: Add BQL support

2018-06-19 Thread Joakim Tjernlund
Signed-off-by: Joakim Tjernlund 
---
 drivers/net/ethernet/freescale/ucc_geth.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
b/drivers/net/ethernet/freescale/ucc_geth.c
index f77ba9fa257b..6c99a9af6647 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
ugeth_vdbg("%s: IN", __func__);
 
+   netdev_sent_queue(dev, skb->len);
spin_lock_irqsave(>lock, flags);
 
dev->stats.tx_bytes += skb->len;
@@ -3242,6 +3243,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
struct ucc_geth_private *ugeth = netdev_priv(dev);
u8 __iomem *bd; /* BD pointer */
u32 bd_status;
+   int howmany = 0;
+   unsigned int bytes_sent = 0;
 
bd = ugeth->confBd[txQ];
bd_status = in_be32((u32 __iomem *)bd);
@@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
if (!skb)
break;
-
+   howmany++;
+   bytes_sent += skb->len;
dev->stats.tx_packets++;
 
dev_consume_skb_any(skb);
@@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
bd_status = in_be32((u32 __iomem *)bd);
}
ugeth->confBd[txQ] = bd;
+   netdev_completed_queue(dev, howmany, bytes_sent);
return 0;
 }
 
-- 
2.13.6



Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-19 Thread Martin KaFai Lau
On Tue, Jun 19, 2018 at 09:34:28AM -0600, David Ahern wrote:
> On 6/19/18 9:25 AM, Martin KaFai Lau wrote:
> > On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
> >> On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
>   /* rc > 0 case */
>   switch(rc) {
>   case BPF_FIB_LKUP_RET_BLACKHOLE:
>   case BPF_FIB_LKUP_RET_UNREACHABLE:
>   case BPF_FIB_LKUP_RET_PROHIBIT:
>   return XDP_DROP;
>   }
> 
>  For the others it becomes a question of do we share why the stack needs
>  to be involved? Maybe the program wants to collect stats to show traffic
>  patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
>  in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
>  interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
> >>> Thanks for the explanation.
> >>>
> >>> Agree on the bpf able to collect stats will be useful.
> >>>
> >>> I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
> >>> how may the old xdp_prog work/not-work?  As of now, the return value
> >>> is straight forward, FWD, PASS (to stack) or DROP (error).
> >>> With this change, the xdp_prog needs to match/switch() the
> >>> BPF_FIB_LKUP_RET_* to at least PASS and DROP.
> >>
> >> IMO, programs should only call XDP_DROP for known reasons - like the 3
> >> above. Anything else punt to the stack.
> >>
> >> If a new RET_XYZ comes along:
> >> 1. the new XYZ is a new ACL response where the packet is to be dropped.
> >> If the program does not understand XYZ and punts to the stack
> >> (recommendation), then a second lookup is done during normal packet
> >> processing and the stack drops it.
> >>
> >> 2. the new XYZ is a new path in the kernel that is unsupported with
> >> respect to XDP forwarding, nothing new for the program to do.
> >>
> >> Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
> >> the program writer.
> >>
> >> Worst case of punting packets to the stack for any rc != 0 means the
> >> stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
> >> in normal stack processing - to handle the packet.
> > Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
> > should the bpf_*_fib_lookup() return value be kept as is such that
> > the xdp_prog is clear what to do.  The reason can be returned in
> > the 'struct bpf_fib_lookup'.  The number of reasons can be extended.
> > If the xdp_prog does not understand a reason, it still will not
> > affect its decision because the return value is clear.
> > I think the situation here is similar to regular syscall which usually
> > uses -1 to clearly states error and errno to spells out the reason.
> > 
> 
> I did consider returning the status in struct bpf_fib_lookup. However,
> it is 64 bytes and can not be extended without a big performance
> penalty, so the only option there is to make an existing entry a union
> the most logical of which is the ifindex. It seemed odd to me to have
> the result by hidden in the struct as a union on ifindex and returning
> the egress index from the function:
> 
> @@ -2625,7 +2636,11 @@ struct bpf_fib_lookup {
> 
> /* total length of packet from network header - used for MTU
> check */
> __u16   tot_len;
> -   __u32   ifindex;  /* L3 device index for lookup */
> +
> +   union {
> +   __u32   ifindex;  /* input: L3 device index for lookup */
> +   __u32   result;   /* output: one of BPF_FIB_LKUP_RET_* */
> +   };
> 
> 
> It seemed more natural to have ifindex stay ifindex and only change
> value on return:
> 
> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
> 
>   /* total length of packet from network header - used for MTU check */
>   __u16   tot_len;
> - __u32   ifindex;  /* L3 device index for lookup */
> +
> + /* input: L3 device index for lookup
> +  * output: nexthop device index from FIB lookup
> +  */
> + __u32   ifindex;
> 
>   union {
>   /* inputs to lookup */
> 
> 
> From a program's perspective:
> 
> rc < 0  -- program is passing incorrect data
> rc == 0 -- packet can be forwarded
> rc > 0  -- packet can not be forwarded.
> 
> BPF programs are not required to track the LKUP_RET values any more than
> a function returning multiple negative values - the caller just checks
> rc < 0 means failure. If the program cares it can look at specific
> values of rc to see the specific value.
> 
> The same applies with the LKUP_RET values - they are there to provide
> insight into why the packet is not forwarded directly if the program
> cares to know why.
hmm...ic. My concern is, the prog can interpret rc > 0 (in this patch) to be
drop vs pass (although we can advise them in bpf.h to always pass if it does
not understand a rc but it is not a strong contract),  it may catch people
a surprise if a xdp_prog suddenly drops everything when running in a
newer kernel where the upper stack can actually handle it.

while 

Re: iproute2 won't compile without AF_VSOCK

2018-06-19 Thread Stephen Hemminger
On Tue, 19 Jun 2018 10:17:45 -0500
Steve Wise  wrote:

> Hey David,
> 
> I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
> fails to compile because AF_VSOCK is not defined.  Should this
> functionality be a configure option to disable it on older distros?
> 
> 
> Thanks,
> 
> Steve.
> 
> 
> 
> misc
>     CC   ss.o
> ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
>    .families = FAMILY_MASK(AF_VSOCK),
>    ^
> ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
>  #define FAMILY_MASK(family) ((uint64_t)1 << (family))
>   ^
> ss.c:334:2: error: array index in initializer not of integer type
>   [AF_VSOCK] = {
>   ^
> ss.c:334:2: error: (near initialization for ‘default_afs’)
> make[1]: *** [ss.o] Error 1
> make: *** [all] Error 2
> 

Probably should just add an #ifdef to takeout that if not present


Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-19 Thread David Ahern
On 6/19/18 9:25 AM, Martin KaFai Lau wrote:
> On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
>> On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
/* rc > 0 case */
switch(rc) {
case BPF_FIB_LKUP_RET_BLACKHOLE:
case BPF_FIB_LKUP_RET_UNREACHABLE:
case BPF_FIB_LKUP_RET_PROHIBIT:
return XDP_DROP;
}

 For the others it becomes a question of do we share why the stack needs
 to be involved? Maybe the program wants to collect stats to show traffic
 patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
 in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
 interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
>>> Thanks for the explanation.
>>>
>>> Agree on the bpf able to collect stats will be useful.
>>>
>>> I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
>>> how may the old xdp_prog work/not-work?  As of now, the return value
>>> is straight forward, FWD, PASS (to stack) or DROP (error).
>>> With this change, the xdp_prog needs to match/switch() the
>>> BPF_FIB_LKUP_RET_* to at least PASS and DROP.
>>
>> IMO, programs should only call XDP_DROP for known reasons - like the 3
>> above. Anything else punt to the stack.
>>
>> If a new RET_XYZ comes along:
>> 1. the new XYZ is a new ACL response where the packet is to be dropped.
>> If the program does not understand XYZ and punts to the stack
>> (recommendation), then a second lookup is done during normal packet
>> processing and the stack drops it.
>>
>> 2. the new XYZ is a new path in the kernel that is unsupported with
>> respect to XDP forwarding, nothing new for the program to do.
>>
>> Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
>> the program writer.
>>
>> Worst case of punting packets to the stack for any rc != 0 means the
>> stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
>> in normal stack processing - to handle the packet.
> Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
> should the bpf_*_fib_lookup() return value be kept as is such that
> the xdp_prog is clear what to do.  The reason can be returned in
> the 'struct bpf_fib_lookup'.  The number of reasons can be extended.
> If the xdp_prog does not understand a reason, it still will not
> affect its decision because the return value is clear.
> I think the situation here is similar to regular syscall which usually
> uses -1 to clearly states error and errno to spells out the reason.
> 

I did consider returning the status in struct bpf_fib_lookup. However,
it is 64 bytes and can not be extended without a big performance
penalty, so the only option there is to make an existing entry a union
the most logical of which is the ifindex. It seemed odd to me to have
the result by hidden in the struct as a union on ifindex and returning
the egress index from the function:

@@ -2625,7 +2636,11 @@ struct bpf_fib_lookup {

/* total length of packet from network header - used for MTU
check */
__u16   tot_len;
-   __u32   ifindex;  /* L3 device index for lookup */
+
+   union {
+   __u32   ifindex;  /* input: L3 device index for lookup */
+   __u32   result;   /* output: one of BPF_FIB_LKUP_RET_* */
+   };


It seemed more natural to have ifindex stay ifindex and only change
value on return:

@@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {

/* total length of packet from network header - used for MTU check */
__u16   tot_len;
-   __u32   ifindex;  /* L3 device index for lookup */
+
+   /* input: L3 device index for lookup
+* output: nexthop device index from FIB lookup
+*/
+   __u32   ifindex;

union {
/* inputs to lookup */


>From a program's perspective:

rc < 0  -- program is passing incorrect data
rc == 0 -- packet can be forwarded
rc > 0  -- packet can not be forwarded.

BPF programs are not required to track the LKUP_RET values any more than
a function returning multiple negative values - the caller just checks
rc < 0 means failure. If the program cares it can look at specific
values of rc to see the specific value.

The same applies with the LKUP_RET values - they are there to provide
insight into why the packet is not forwarded directly if the program
cares to know why.


Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-19 Thread Martin KaFai Lau
On Mon, Jun 18, 2018 at 03:35:25PM -0600, David Ahern wrote:
> On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
> >>/* rc > 0 case */
> >>switch(rc) {
> >>case BPF_FIB_LKUP_RET_BLACKHOLE:
> >>case BPF_FIB_LKUP_RET_UNREACHABLE:
> >>case BPF_FIB_LKUP_RET_PROHIBIT:
> >>return XDP_DROP;
> >>}
> >>
> >> For the others it becomes a question of do we share why the stack needs
> >> to be involved? Maybe the program wants to collect stats to show traffic
> >> patterns that can be improved (BPF_FIB_LKUP_RET_FRAG_NEEDED) or support
> >> in the kernel needs to be improved (BPF_FIB_LKUP_RET_UNSUPP_LWT) or an
> >> interface is misconfigured (BPF_FIB_LKUP_RET_FWD_DISABLED).
> > Thanks for the explanation.
> > 
> > Agree on the bpf able to collect stats will be useful.
> > 
> > I am wondering, if a new BPF_FIB_LKUP_RET_XYZ is added later,
> > how may the old xdp_prog work/not-work?  As of now, the return value
> > is straight forward, FWD, PASS (to stack) or DROP (error).
> > With this change, the xdp_prog needs to match/switch() the
> > BPF_FIB_LKUP_RET_* to at least PASS and DROP.
> 
> IMO, programs should only call XDP_DROP for known reasons - like the 3
> above. Anything else punt to the stack.
> 
> If a new RET_XYZ comes along:
> 1. the new XYZ is a new ACL response where the packet is to be dropped.
> If the program does not understand XYZ and punts to the stack
> (recommendation), then a second lookup is done during normal packet
> processing and the stack drops it.
> 
> 2. the new XYZ is a new path in the kernel that is unsupported with
> respect to XDP forwarding, nothing new for the program to do.
> 
> Either way I would expect stats on BPF_FIB_LKUP_RET_* to give a hint to
> the program writer.
> 
> Worst case of punting packets to the stack for any rc != 0 means the
> stack is doing 2 lookups - 1 in XDP based on its lookup parameters and 1
> in normal stack processing - to handle the packet.
Instead of having the xdp_prog to follow the meaning of what RET_SYZ is,
should the bpf_*_fib_lookup() return value be kept as is such that
the xdp_prog is clear what to do.  The reason can be returned in
the 'struct bpf_fib_lookup'.  The number of reasons can be extended.
If the xdp_prog does not understand a reason, it still will not
affect its decision because the return value is clear.
I think the situation here is similar to regular syscall which usually
uses -1 to clearly states error and errno to spells out the reason.

> 
> > 
> >>
> >> Arguably BPF_FIB_LKUP_RET_NO_NHDEV is not needed. See below.
> >>
>  @@ -2612,6 +2613,19 @@ struct bpf_raw_tracepoint_args {
>   #define BPF_FIB_LOOKUP_DIRECT  BIT(0)
>   #define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
>   
>  +enum {
>  +BPF_FIB_LKUP_RET_SUCCESS,  /* lookup successful */
>  +BPF_FIB_LKUP_RET_BLACKHOLE,/* dest is blackholed */
>  +BPF_FIB_LKUP_RET_UNREACHABLE,  /* dest is unreachable */
>  +BPF_FIB_LKUP_RET_PROHIBIT, /* dest not allowed */
>  +BPF_FIB_LKUP_RET_NOT_FWDED,/* pkt is not forwardded */
> >>> BPF_FIB_LKUP_RET_NOT_FWDED is a catch all?
> >>>
> >>
> >> Destination is local. More precisely, the FIB lookup is not unicast so
> >> not forwarded. It could be RTN_LOCAL, RTN_BROADCAST, RTN_ANYCAST, or
> >> RTN_MULTICAST. The next ones -- blackhole, reachable, prohibit -- are
> >> called out.
> > I think it also includes the tbid not found case.
> 
> Another one of those "should never happen scenarios". The user does not
> specify the table; it is retrieved based on device association. Table
> defaults to the main table - which always exists - and any VRF
> enslavement of a device happens after the VRF device creates the table.
> 
> > 
> >>
>  @@ -4252,16 +4277,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, 
>  struct bpf_fib_lookup *params,
>   if (check_mtu) {
>   mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src);
>   if (params->tot_len > mtu)
>  -return 0;
>  +return BPF_FIB_LKUP_RET_FRAG_NEEDED;
>   }
>   
>   if (f6i->fib6_nh.nh_lwtstate)
>  -return 0;
>  +return BPF_FIB_LKUP_RET_UNSUPP_LWT;
>   
>   if (f6i->fib6_flags & RTF_GATEWAY)
>   *dst = f6i->fib6_nh.nh_gw;
>   
>   dev = f6i->fib6_nh.nh_dev;
>  +if (unlikely(!dev))
>  +return BPF_FIB_LKUP_RET_NO_NHDEV;
> >>> Is this a bug fix?
> >>>
> >>
> >> Difference between IPv4 and IPv6. Making them consistent.
> >>
> >> It is a major BUG in the kernel to reach this point in either protocol
> >> to have a unicast route not tied to a device. IPv4 has checks; v6 does
> >> not. I figured this being new code, why not make bpf_ipv{4,6}_fib_lookup
> >> as close to the same as possible.
> > Make sense.  A comment in the commit log 

[PATCH] bpfilter: ignore binary files

2018-06-19 Thread Matteo Croce
net/bpfilter/bpfilter_umh is a binary file generated when bpfilter is
enabled, add it to .gitignore to avoid committing it.

Fixes: d2ba09c17a064 ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Matteo Croce 
---
 net/bpfilter/.gitignore | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 net/bpfilter/.gitignore

diff --git a/net/bpfilter/.gitignore b/net/bpfilter/.gitignore
new file mode 100644
index ..e97084e3eea2
--- /dev/null
+++ b/net/bpfilter/.gitignore
@@ -0,0 +1 @@
+bpfilter_umh
-- 
2.17.1



Re: Link modes representation in phylib

2018-06-19 Thread Andrew Lunn
> What I propose is that we add 3 link_mode fields in phy_device, and keep
> the legacy fields for now. It would be up to the driver to fill the new
> "supported" field in config_init, kind of like what's done in the
> marvell10g driver.

Hi Maxime

You can do this conversion in the core. If features == 0, and some
bits are set in the features link_mode, do the conversion at probe
time. The same can be done for lp_advertising, when the call into the
drivers read_status() has completed.

> Would that be acceptable ?

It sounds reasonable. Lets see what the code looks like.

   Andrew


iproute2 won't compile without AF_VSOCK

2018-06-19 Thread Steve Wise
Hey David,

I'm trying to compile the latest iproute2 on an RHEL-7.3 distro, and it
fails to compile because AF_VSOCK is not defined.  Should this
functionality be a configure option to disable it on older distros?


Thanks,

Steve.



misc
    CC   ss.o
ss.c:301:27: error: ‘AF_VSOCK’ undeclared here (not in a function)
   .families = FAMILY_MASK(AF_VSOCK),
   ^
ss.c:252:46: note: in definition of macro ‘FAMILY_MASK’
 #define FAMILY_MASK(family) ((uint64_t)1 << (family))
  ^
ss.c:334:2: error: array index in initializer not of integer type
  [AF_VSOCK] = {
  ^
ss.c:334:2: error: (near initialization for ‘default_afs’)
make[1]: *** [ss.o] Error 1
make: *** [all] Error 2



[PATCH] bpfilter: fix build error

2018-06-19 Thread Matteo Croce
bpfilter Makefile assumes that the system locale is en_US, and the
parsing of objdump output fails.
Set LC_ALL=C and, while at it, rewrite the objdump parsing so it spawns
only 2 processes instead of 7.

Fixes: d2ba09c17a064 ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Matteo Croce 
---
 net/bpfilter/Makefile | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index e0bbe7583e58..dd86b022eff0 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -21,8 +21,10 @@ endif
 # which bpfilter_kern.c passes further into umh blob loader at run-time
 quiet_cmd_copy_umh = GEN $@
   cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
-  $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
-  -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
+  $(OBJCOPY) -I binary \
+  `LC_ALL=C objdump -f net/bpfilter/bpfilter_umh \
+  |awk -F' |,' '/file format/{print "-O",$$NF} \
+  /^architecture:/{print "-B",$$2}'` \
   --rename-section .data=.init.rodata $< $@
 
 $(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
-- 
2.17.1



Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-19 Thread David Ahern
On 6/18/18 2:55 PM, Martin KaFai Lau wrote:
>>
>> Arguably BPF_FIB_LKUP_RET_NO_NHDEV is not needed. See below.
>>

...

 @@ -4252,16 +4277,19 @@ static int bpf_ipv6_fib_lookup(struct net *net, 
 struct bpf_fib_lookup *params,
if (check_mtu) {
mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src);
if (params->tot_len > mtu)
 -  return 0;
 +  return BPF_FIB_LKUP_RET_FRAG_NEEDED;
}
  
if (f6i->fib6_nh.nh_lwtstate)
 -  return 0;
 +  return BPF_FIB_LKUP_RET_UNSUPP_LWT;
  
if (f6i->fib6_flags & RTF_GATEWAY)
*dst = f6i->fib6_nh.nh_gw;
  
dev = f6i->fib6_nh.nh_dev;
 +  if (unlikely(!dev))
 +  return BPF_FIB_LKUP_RET_NO_NHDEV;
>>> Is this a bug fix?
>>>
>>
>> Difference between IPv4 and IPv6. Making them consistent.
>>
>> It is a major BUG in the kernel to reach this point in either protocol
>> to have a unicast route not tied to a device. IPv4 has checks; v6 does
>> not. I figured this being new code, why not make bpf_ipv{4,6}_fib_lookup
>> as close to the same as possible.
> Make sense.  A comment in the commit log will be useful if there is a
> re-spin.
> 

Upon further review, I will remove BPF_FIB_LKUP_RET_NO_NHDEV. The dev
check is not needed in either ipv4 or ipv6. For IPv4 after fib_lookup
calls both __mkroute_input and ip_route_output_key_hash_rcu expect dev
to be set for unicast as it should be.


Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-19 Thread David Ahern
On 6/19/18 3:36 AM, Quentin Monnet wrote:
> Since you are about to respin (I think?), could you please also fix the
> formatting in your change to the doc? The "BPF_FIB_LKUP_RET_" is not
> emphasized (and will even cause an error message when producing the man
> page, because of the trailing underscore that gets interpreted in RST),
> and the three cases for the return value are not formatted properly for
> the conversion.
> 
> Something like the following would work:
> 
> ---
>  * Return
>  ** < 0 if any input argument is invalid.
>  **   0 on success (packet is forwarded and nexthop neighbor 
> exists).
>  ** > 0: one of **BPF_FIB_LKUP_RET_** codes on FIB lookup 
> response.
> ---
> 

Will do. thanks for the review.


Re: [PATCH rdma-next 0/3] Dump and fill MKEY

2018-06-19 Thread Leon Romanovsky
On Tue, Jun 19, 2018 at 08:21:06AM -0600, Jason Gunthorpe wrote:
> On Tue, Jun 19, 2018 at 08:47:21AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> >
> > MLX5 IB HCA offers the memory key, dump_fill_mkey to increase
> > performance, when used in a send or receive operations.
> >
> > It is used to force local HCA operations to skip the PCI bus access,
> > while keeping track of the processed length in the ibv_sge handling.
> >
> > In this three patch series, we expose various bits in our HW
> > spec file (mlx5_ifc.h), move unneeded for mlx5_core FW command and
> > export such memory key to user space thought our mlx5-abi header file.
>
> Where is the user space for this?

It will be published in the near future.

Thanks

>
> Jason


Re: [PATCH rdma-next 0/3] Dump and fill MKEY

2018-06-19 Thread Jason Gunthorpe
On Tue, Jun 19, 2018 at 08:47:21AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> MLX5 IB HCA offers the memory key, dump_fill_mkey to increase
> performance, when used in a send or receive operations.
> 
> It is used to force local HCA operations to skip the PCI bus access,
> while keeping track of the processed length in the ibv_sge handling.
> 
> In this three patch series, we expose various bits in our HW
> spec file (mlx5_ifc.h), move unneeded for mlx5_core FW command and
> export such memory key to user space thought our mlx5-abi header file.

Where is the user space for this?

Jason


[PATCH net] net/sched: act_ife: preserve the action control in case of error

2018-06-19 Thread Davide Caratti
in the following script

 # tc actions add action ife encode allow prio pass index 42
 # tc actions replace action ife encode allow tcindex drop index 42

the action control should remain equal to 'pass', if the kernel failed
to replace the TC action. Pospone the assignment of the action control,
to ensure it is not overwritten in the error path of tcf_ife_init().

Fixes: ef6980b6becb ("introduce IFE action")
Signed-off-by: Davide Caratti 
---
 net/sched/act_ife.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 078d52212172..20d7d36b2fc9 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -517,8 +517,6 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
saddr = nla_data(tb[TCA_IFE_SMAC]);
}
 
-   ife->tcf_action = parm->action;
-
if (parm->flags & IFE_ENCODE) {
if (daddr)
ether_addr_copy(p->eth_dst, daddr);
@@ -575,6 +573,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
}
}
 
+   ife->tcf_action = parm->action;
if (exists)
spin_unlock_bh(>tcf_lock);
 
-- 
2.17.1



[PATCH net] net/sched: act_ife: fix recursive lock and idr leak

2018-06-19 Thread Davide Caratti
a recursive lock warning [1] can be observed with the following script,

 # $TC actions add action ife encode allow prio pass index 42
 IFE type 0xED3E
 # $TC actions replace action ife encode allow tcindex pass index 42

in case the kernel was unable to run the last command (e.g. because of
the impossibility to load 'act_meta_skbtcindex'). For a similar reason,
the kernel can leak idr in the error path of tcf_ife_init(), because
tcf_idr_release() is not called after successful idr reservation:

 # $TC actions add action ife encode allow tcindex index 47
 IFE type 0xED3E
 RTNETLINK answers: No such file or directory
 We have an error talking to the kernel
 # $TC actions add action ife encode allow tcindex index 47
 IFE type 0xED3E
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel
 # $TC actions add action ife encode use mark 7 type 0xfefe pass index 47
 IFE type 0xFEFE
 RTNETLINK answers: No space left on device
 We have an error talking to the kernel

Since tcfa_lock is already taken when the action is being edited, a call
to tcf_idr_release() wrongly makes tcf_idr_cleanup() take the same lock
again. On the other hand, tcf_idr_release() needs to be called in the
error path of tcf_ife_init(), to undo the last tcf_idr_create() invocation.
Fix both problems in tcf_ife_init().
Since the cleanup() routine can now be called when ife->params is NULL,
also add a NULL pointer check to avoid calling kfree_rcu(NULL, rcu).

 [1]
 
 WARNING: possible recursive locking detected
 4.17.0-rc4.kasan+ #417 Tainted: GE
 
 tc/3932 is trying to acquire lock:
 5097c9a6 (&(>tcfa_lock)->rlock){+...}, at: 
tcf_ife_cleanup+0x19/0x80 [act_ife]

 but task is already holding lock:
 5097c9a6 (&(>tcfa_lock)->rlock){+...}, at: 
tcf_ife_init+0xf6d/0x13c0 [act_ife]

 other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock(&(>tcfa_lock)->rlock);
   lock(&(>tcfa_lock)->rlock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by tc/3932:
  #0: 7ca8e990 (rtnl_mutex){+.+.}, at: tcf_ife_init+0xf61/0x13c0 
[act_ife]
  #1: 5097c9a6 (&(>tcfa_lock)->rlock){+...}, at: 
tcf_ife_init+0xf6d/0x13c0 [act_ife]

 stack backtrace:
 CPU: 3 PID: 3932 Comm: tc Tainted: GE 4.17.0-rc4.kasan+ #417
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 Call Trace:
  dump_stack+0x9a/0xeb
  __lock_acquire+0xf43/0x34a0
  ? debug_check_no_locks_freed+0x2b0/0x2b0
  ? debug_check_no_locks_freed+0x2b0/0x2b0
  ? debug_check_no_locks_freed+0x2b0/0x2b0
  ? __mutex_lock+0x62f/0x1240
  ? kvm_sched_clock_read+0x1a/0x30
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1d0
  ? lock_acquire+0x10b/0x330
  lock_acquire+0x10b/0x330
  ? tcf_ife_cleanup+0x19/0x80 [act_ife]
  _raw_spin_lock_bh+0x38/0x70
  ? tcf_ife_cleanup+0x19/0x80 [act_ife]
  tcf_ife_cleanup+0x19/0x80 [act_ife]
  __tcf_idr_release+0xff/0x350
  tcf_ife_init+0xdde/0x13c0 [act_ife]
  ? ife_exit_net+0x290/0x290 [act_ife]
  ? __lock_is_held+0xb4/0x140
  tcf_action_init_1+0x67b/0xad0
  ? tcf_action_dump_old+0xa0/0xa0
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? kvm_sched_clock_read+0x1a/0x30
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? memset+0x1f/0x40
  tcf_action_init+0x30f/0x590
  ? tcf_action_init_1+0xad0/0xad0
  ? memset+0x1f/0x40
  tc_ctl_action+0x48e/0x5e0
  ? mutex_lock_io_nested+0x1160/0x1160
  ? tca_action_gd+0x990/0x990
  ? sched_clock+0x5/0x10
  ? find_held_lock+0x39/0x1d0
  rtnetlink_rcv_msg+0x4da/0x990
  ? validate_linkmsg+0x680/0x680
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1d0
  netlink_rcv_skb+0x127/0x350
  ? validate_linkmsg+0x680/0x680
  ? netlink_ack+0x970/0x970
  ? __kmalloc_node_track_caller+0x304/0x3a0
  netlink_unicast+0x40f/0x5d0
  ? netlink_attachskb+0x580/0x580
  ? _copy_from_iter_full+0x187/0x760
  ? import_iovec+0x90/0x390
  netlink_sendmsg+0x67f/0xb50
  ? netlink_unicast+0x5d0/0x5d0
  ? copy_msghdr_from_user+0x206/0x340
  ? netlink_unicast+0x5d0/0x5d0
  sock_sendmsg+0xb3/0xf0
  ___sys_sendmsg+0x60a/0x8b0
  ? copy_msghdr_from_user+0x340/0x340
  ? lock_downgrade+0x5e0/0x5e0
  ? tty_write_lock+0x18/0x50
  ? kvm_sched_clock_read+0x1a/0x30
  ? sched_clock+0x5/0x10
  ? sched_clock_cpu+0x18/0x170
  ? find_held_lock+0x39/0x1d0
  ? lock_downgrade+0x5e0/0x5e0
  ? lock_acquire+0x10b/0x330
  ? __audit_syscall_entry+0x316/0x690
  ? current_kernel_time64+0x6b/0xd0
  ? __fget_light+0x55/0x1f0
  ? __sys_sendmsg+0xd2/0x170
  __sys_sendmsg+0xd2/0x170
  ? __ia32_sys_shutdown+0x70/0x70
  ? syscall_trace_enter+0x57a/0xd60
  ? rcu_read_lock_sched_held+0xdc/0x110
  ? __bpf_trace_sys_enter+0x10/0x10
  ? do_syscall_64+0x22/0x480
  do_syscall_64+0xa5/0x480
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fd646988ba0
 RSP: 002b:7fffc9fab3c8 EFLAGS: 0246 ORIG_RAX: 

Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-19 Thread Michel Machado

On 06/19/2018 08:01 AM, Jamal Hadi Salim wrote:

Per my previous comments, why do we need the
TCA_SKBEDIT_FLAGS TLV? Isnt SKBEDIT_F_INHERITDSFIELD
sufficient? i.e in tcf_skbedit_init() check for
d->flags_F_INHERITDSFIELD then set skb->priority
and flags|=SKBEDIT_F_INHERITDSFIELD


   Notice that, different from skbmod, there's no field parm->flags in 
skbedit. Skbedit infers the flags in d->flags from the presence of the 
parameters of each of its actions. But SKBEDIT_F_INHERITDSFIELD has no 
parameter and adding field parm->flags breaks backward compatibility 
with user space as pointed out by Marcelo Ricardo Leitner. Our solution 
was to add TCA_SKBEDIT_FLAGS, so SKBEDIT_F_INHERITDSFIELD and future 
flag-only actions can be added.


[ ]'s
Michel Machado


Re: [PATCH net] xfrm_user: prevent leaking 2 bytes of kernel memory

2018-06-19 Thread Steffen Klassert
On Mon, Jun 18, 2018 at 09:35:07PM -0700, Eric Dumazet wrote:
> struct xfrm_userpolicy_type has two holes, so we should not
> use C99 style initializer.
> 
> KMSAN report:
> 
> BUG: KMSAN: kernel-infoleak in copyout lib/iov_iter.c:140 [inline]
> BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571
> CPU: 1 PID: 4520 Comm: syz-executor841 Not tainted 4.17.0+ #5
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x185/0x1d0 lib/dump_stack.c:113
>  kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1117
>  kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1211
>  kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1253
>  copyout lib/iov_iter.c:140 [inline]
>  _copy_to_iter+0x1b14/0x2800 lib/iov_iter.c:571
>  copy_to_iter include/linux/uio.h:106 [inline]
>  skb_copy_datagram_iter+0x422/0xfa0 net/core/datagram.c:431
>  skb_copy_datagram_msg include/linux/skbuff.h:3268 [inline]
>  netlink_recvmsg+0x6f1/0x1900 net/netlink/af_netlink.c:1959
>  sock_recvmsg_nosec net/socket.c:802 [inline]
>  sock_recvmsg+0x1d6/0x230 net/socket.c:809
>  ___sys_recvmsg+0x3fe/0x810 net/socket.c:2279
>  __sys_recvmmsg+0x58e/0xe30 net/socket.c:2391
>  do_sys_recvmmsg+0x2a6/0x3e0 net/socket.c:2472
>  __do_sys_recvmmsg net/socket.c:2485 [inline]
>  __se_sys_recvmmsg net/socket.c:2481 [inline]
>  __x64_sys_recvmmsg+0x15d/0x1c0 net/socket.c:2481
>  do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x446ce9
> RSP: 002b:7fc307918db8 EFLAGS: 0293 ORIG_RAX: 012b
> RAX: ffda RBX: 006dbc24 RCX: 00446ce9
> RDX: 000a RSI: 20005040 RDI: 0003
> RBP: 006dbc20 R08: 20004e40 R09: 
> R10: 4000 R11: 0293 R12: 
> R13: 7ffc8d2df32f R14: 7fc3079199c0 R15: 0001
> 
> Uninit was stored to memory at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
>  kmsan_save_stack mm/kmsan/kmsan.c:294 [inline]
>  kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:685
>  kmsan_memcpy_origins+0x11d/0x170 mm/kmsan/kmsan.c:527
>  __msan_memcpy+0x109/0x160 mm/kmsan/kmsan_instr.c:413
>  __nla_put lib/nlattr.c:569 [inline]
>  nla_put+0x276/0x340 lib/nlattr.c:627
>  copy_to_user_policy_type net/xfrm/xfrm_user.c:1678 [inline]
>  dump_one_policy+0xbe1/0x1090 net/xfrm/xfrm_user.c:1708
>  xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013
>  xfrm_dump_policy+0x1c0/0x2a0 net/xfrm/xfrm_user.c:1749
>  netlink_dump+0x9b5/0x1550 net/netlink/af_netlink.c:2226
>  __netlink_dump_start+0x1131/0x1270 net/netlink/af_netlink.c:2323
>  netlink_dump_start include/linux/netlink.h:214 [inline]
>  xfrm_user_rcv_msg+0x8a3/0x9b0 net/xfrm/xfrm_user.c:2577
>  netlink_rcv_skb+0x37e/0x600 net/netlink/af_netlink.c:2448
>  xfrm_netlink_rcv+0xb2/0xf0 net/xfrm/xfrm_user.c:2598
>  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
>  netlink_unicast+0x1680/0x1750 net/netlink/af_netlink.c:1336
>  netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
>  sock_sendmsg_nosec net/socket.c:629 [inline]
>  sock_sendmsg net/socket.c:639 [inline]
>  ___sys_sendmsg+0xec8/0x1320 net/socket.c:2117
>  __sys_sendmsg net/socket.c:2155 [inline]
>  __do_sys_sendmsg net/socket.c:2164 [inline]
>  __se_sys_sendmsg net/socket.c:2162 [inline]
>  __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
>  do_syscall_64+0x15b/0x230 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Local variable description: upt.i@dump_one_policy
> Variable was created at:
>  dump_one_policy+0x78/0x1090 net/xfrm/xfrm_user.c:1689
>  xfrm_policy_walk+0x45a/0xd00 net/xfrm/xfrm_policy.c:1013
> 
> Byte 130 of 137 is uninitialized
> Memory access starts at 88019550407f
> 
> Fixes: c0144beaeca42 ("[XFRM] netlink: Use nla_put()/NLA_PUT() variantes")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 
> Cc: Steffen Klassert 
> Cc: Herbert Xu 
> ---
>  net/xfrm/xfrm_user.c | 8 +---

It is a fix for xfrm, so I applied it
to the ipsec tree.

Thanks a lot Eric!


Re: [PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-19 Thread Steffen Klassert
On Tue, Jun 19, 2018 at 09:53:49AM +0200, Florian Westphal wrote:
> Li RongQing  wrote:
> > The default NR_CPUS can be very large, but actual possible nr_cpu_ids
> > usually is very small. For some x86 distribution, the NR_CPUS is 8192
> > and nr_cpu_ids is 4, so replace NR_CPU to save some memory
> 
> Steffen,
> 
> I will soon submit a patch to remove the percpu cache; removal
> improved performance for at least one user (and by quite a sizeable
> amount).
> 
> Would you consider such removal for ipsec or ipsec-next?

I think this removel would better fit to ipsec-next.

> If -next, consider applying this patch for ipsec.git.
> 
> Otherwise consider not applying this, as the code will go away soon.

This patch more an optimization than a fix, so I
considered to apply it to ipsec-next. If you plan
to remove it, I'll wait for that.


Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-19 Thread Jamal Hadi Salim



Hi Qiaobin,

Per my previous comments, why do we need the
TCA_SKBEDIT_FLAGS TLV? Isnt SKBEDIT_F_INHERITDSFIELD
sufficient? i.e in tcf_skbedit_init() check for
d->flags_F_INHERITDSFIELD then set skb->priority
and flags|=SKBEDIT_F_INHERITDSFIELD

Side note:
Infact the whole flags setting in the init function
seems to be a redundant given all the TLVs have
d->flags also set by user space.
But thats a different patch.

cheers,
jamal

On 12/06/18 11:42 AM, Fu, Qiaobin wrote:

The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

v4:
*Not allow setting flags other than the expected ones.

*Allow dumping the pure flags.

Original idea by Jamal Hadi Salim 

Signed-off-by: Qiaobin Fu 
Reviewed-by: Michel Machado 
---

Note that the motivation for this patch is found in the following discussion:
https://www.spinics.net/lists/netdev/msg501061.html
---
diff --git a/include/uapi/linux/tc_act/tc_skbedit.h 
b/include/uapi/linux/tc_act/tc_skbedit.h
index fbcfe27a4e6c..6de6071ebed6 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -30,6 +30,7 @@
  #define SKBEDIT_F_MARK0x4
  #define SKBEDIT_F_PTYPE   0x8
  #define SKBEDIT_F_MASK0x10
+#define SKBEDIT_F_INHERITDSFIELD   0x20
  
  struct tc_skbedit {

tc_gen;
@@ -45,6 +46,7 @@ enum {
TCA_SKBEDIT_PAD,
TCA_SKBEDIT_PTYPE,
TCA_SKBEDIT_MASK,
+   TCA_SKBEDIT_FLAGS,
__TCA_SKBEDIT_MAX
  };
  #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 6138d1d71900..9adbcfa3f5fe 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -23,6 +23,9 @@
  #include 
  #include 
  #include 
+#include 
+#include 
+#include 
  
  #include 

  #include 
@@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
  
  	if (d->flags & SKBEDIT_F_PRIORITY)

skb->priority = d->priority;
+   if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
+   int wlen = skb_network_offset(skb);
+
+   switch (tc_skb_protocol(skb)) {
+   case htons(ETH_P_IP):
+   wlen += sizeof(struct iphdr);
+   if (!pskb_may_pull(skb, wlen))
+   goto err;
+   skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+   break;
+
+   case htons(ETH_P_IPV6):
+   wlen += sizeof(struct ipv6hdr);
+   if (!pskb_may_pull(skb, wlen))
+   goto err;
+   skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+   break;
+   }
+   }
if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
skb->dev->real_num_tx_queues > d->queue_mapping)
skb_set_queue_mapping(skb, d->queue_mapping);
@@ -53,6 +75,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
  
  	spin_unlock(>tcf_lock);

return d->tcf_action;
+
+err:
+   spin_unlock(>tcf_lock);
+   return TC_ACT_SHOT;
  }
  
  static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {

@@ -62,6 +88,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX 
+ 1] = {
[TCA_SKBEDIT_MARK]  = { .len = sizeof(u32) },
[TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) },
[TCA_SKBEDIT_MASK]  = { .len = sizeof(u32) },
+   [TCA_SKBEDIT_FLAGS] = { .len = sizeof(u64) },
  };
  
  static int tcf_skbedit_init(struct net *net, struct nlattr *nla,

@@ -73,6 +100,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
struct tc_skbedit *parm;
struct tcf_skbedit *d;
u32 flags = 0, *priority = NULL, *mark = NULL, *mask = NULL;
+   u64 *pure_flags = NULL;
u16 *queue_mapping = NULL, *ptype = NULL;
bool exists = false;
int ret = 0, err;
@@ -114,6 +142,12 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
mask = nla_data(tb[TCA_SKBEDIT_MASK]);
}
  
+	if (tb[TCA_SKBEDIT_FLAGS] != NULL) {

+   pure_flags = nla_data(tb[TCA_SKBEDIT_FLAGS]);
+   if (*pure_flags & SKBEDIT_F_INHERITDSFIELD)
+   flags |= SKBEDIT_F_INHERITDSFIELD;
+   }
+
parm = nla_data(tb[TCA_SKBEDIT_PARMS]);
  
  	exists = tcf_idr_check(tn, parm->index, a, bind);

@@ -178,6 +212,7 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct 
tc_action *a,
.action  = d->tcf_action,
};
struct tcf_t t;
+   u64 pure_flags = 0;
  
  	if (nla_put(skb, TCA_SKBEDIT_PARMS, sizeof(opt), ))

goto nla_put_failure;
@@ -196,6 +231,11 @@ static int 

Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-06-19 Thread Marco Berizzi
> Il 18 giugno 2018 alle 21.28 Cong Wang  ha scritto:
> Can you test the attached patch?
> 
> It almost certainly fixes the warning, but I am not sure if
> it papers out any other real problem. Please make sure
> hfsc still works as expected. :)

Hi Cong,

Thanks a lot for the patch. I'm building 4.17.2 + your patch.
Within 24 hours I will update you.


[PATCH net] ip: limit use of gso_size to udp

2018-06-19 Thread Willem de Bruijn
From: Willem de Bruijn 

The ipcm(6)_cookie field gso_size is set only in the udp path. The ip
layer copies this to cork only if sk_type is SOCK_DGRAM. This check
proved too permissive. Ping and l2tp sockets have the same type.

Limit to sockets of type SOCK_DGRAM and protocol IPPROTO_UDP to
exclude ping sockets.

Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
Reported-by: Maciej Żenczykowski 
Signed-off-by: Willem de Bruijn 

---

For net-next, I'll take a look whether ipcm(6)_cookie fields like
these can be initialized uniformly, and then this branch removed
completely.
---
 net/ipv4/ip_output.c  | 4 ++--
 net/ipv6/ip6_output.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index af5a830ff6ad..66656d9602d0 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1144,8 +1144,8 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
*rtp = NULL;
cork->fragsize = ip_sk_use_pmtu(sk) ?
 dst_mtu(>dst) : rt->dst.dev->mtu;
-
-   cork->gso_size = sk->sk_type == SOCK_DGRAM ? ipc->gso_size : 0;
+   cork->gso_size = sk->sk_type == SOCK_DGRAM &&
+sk->sk_protocol == IPPROTO_UDP ? ipc->gso_size : 0;
cork->dst = >dst;
cork->length = 0;
cork->ttl = ipc->ttl;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 021e5aef6ba3..bcb549ca8795 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1163,7 +1163,6 @@ static int ip6_setup_cork(struct sock *sk, struct 
inet_cork_full *cork,
struct ipv6_pinfo *np = inet6_sk(sk);
unsigned int mtu;
struct ipv6_txoptions *opt = ipc6->opt;
-
/*
 * setup for corking
 */
@@ -1219,7 +1218,8 @@ static int ip6_setup_cork(struct sock *sk, struct 
inet_cork_full *cork,
if (mtu < IPV6_MIN_MTU)
return -EINVAL;
cork->base.fragsize = mtu;
-   cork->base.gso_size = sk->sk_type == SOCK_DGRAM ? ipc6->gso_size : 0;
+   cork->base.gso_size = sk->sk_type == SOCK_DGRAM &&
+ sk->sk_protocol == IPPROTO_UDP ? ipc6->gso_size : 
0;
 
if (dst_allfrag(xfrm_dst_path(>dst)))
cork->base.flags |= IPCORK_ALLFRAG;
-- 
2.18.0.rc1.244.gcf134e6275-goog



Re: [PATCH bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-19 Thread Quentin Monnet
Hi David,

2018-06-17 08:18 UTC-0700 ~ dsah...@kernel.org
> From: David Ahern 
> 
> For ACLs implemented using either FIB rules or FIB entries, the BPF
> program needs the FIB lookup status to be able to drop the packet.
> Since the bpf_fib_lookup API has not reached a released kernel yet,
> change the return code to contain an encoding of the FIB lookup
> result and return the nexthop device index in the params struct.
> 
> In addition, inform the BPF program of any post FIB lookup reason as
> to why the packet needs to go up the stack.
> 
> Update the sample program per the change in API.
> 
> Signed-off-by: David Ahern 
> ---
>  include/uapi/linux/bpf.h   | 28 ++
>  net/core/filter.c  | 74 
> --
>  samples/bpf/xdp_fwd_kern.c |  8 ++---
>  3 files changed, 78 insertions(+), 32 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6a40d7..ceb80071c341 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1857,7 +1857,8 @@ union bpf_attr {
>   *   is resolved), the nexthop address is returned in ipv4_dst
>   *   or ipv6_dst based on family, smac is set to mac address of
>   *   egress device, dmac is set to nexthop mac address, rt_metric
> - *   is set to metric from route (IPv4/IPv6 only).
> + *   is set to metric from route (IPv4/IPv6 only), and ifindex
> + *   is set to the device index of the nexthop from the FIB lookup.
>   *
>   * *plen* argument is the size of the passed in struct.
>   * *flags* argument can be a combination of one or more of the
> @@ -1873,9 +1874,9 @@ union bpf_attr {
>   * *ctx* is either **struct xdp_md** for XDP programs or
>   * **struct sk_buff** tc cls_act programs.
>   * Return
> - * Egress device index on success, 0 if packet needs to continue
> - * up the stack for further processing or a negative error in 
> case
> - * of failure.
> + *   < 0 if any input argument is invalid
> + * 0 on success (packet is forwarded and nexthop neighbor exists)
> + *   > 0 one of BPF_FIB_LKUP_RET_ codes on FIB lookup response


Since you are about to respin (I think?), could you please also fix the
formatting in your change to the doc? The "BPF_FIB_LKUP_RET_" is not
emphasized (and will even cause an error message when producing the man
page, because of the trailing underscore that gets interpreted in RST),
and the three cases for the return value are not formatted properly for
the conversion.

Something like the following would work:

---
 * Return
 *  * < 0 if any input argument is invalid.
 *  *   0 on success (packet is forwarded and nexthop neighbor 
exists).
 *  * > 0: one of **BPF_FIB_LKUP_RET_** codes on FIB lookup 
response.
---

Thank you,
Quentin


Re: Link modes representation in phylib

2018-06-19 Thread Maxime Chevallier
Hello Andrew,

Thanks for your feedback !

>> I'm currently working on adding support for 2.5GBaseT on some Marvell
>> PHYs (the marvell10g family, including the 88X3310).
>> 
>> However, phylib doesn't quite support these modes yet. Its stores the
>> different supported and advertised modes in u32 fields, which can't
>> contain the relevant values for 2500BaseT mode (and all other modes that
>> come after the 31st one).  
>
>Hi Maxime
>
>Did you look at phylink? I think it already gets this right.  It could
>be, any MAC which needs to use > bit 31 should use phylink, not
>phylib.

Indeed, drivers that use phylink dont directly access these u32 fields.

>That narrows the problem down to just the PHY drivers. We might be
>able to mass convert those. Or maybe we can consider just doing some
>conversion work on PHYs which support > 1Gbps?

I think that we can consider converting only the concerned PHYs for the
moment.

What I propose is that we add 3 link_mode fields in phy_device, and keep
the legacy fields for now. It would be up to the driver to fill the new
"supported" field in config_init, kind of like what's done in the
marvell10g driver.

There already are phy_ethtool_ksettings_{g|s}et accessors, that are
used by phylink so that would easily integrate with the above solution
of only supporting phylink for these modes.

That would involve a bit of info duplication, but I think that would
allow for a smooth transition to a newer representation.

Would that be acceptable ?

Thanks,

Maxime



[PATCH] net: propagate dev_get_valid_name return code

2018-06-19 Thread Li RongQing
if dev_get_valid_name failed, propagate its return code

and remove the setting err to ENODEV, it will be set to
0 again before dev_change_net_namespace exits.

Signed-off-by: Li RongQing 
---
 net/core/dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1844d9bc5714..1c7a3761ec3c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8661,7 +8661,8 @@ int dev_change_net_namespace(struct net_device *dev, 
struct net *net, const char
/* We get here if we can't use the current device name */
if (!pat)
goto out;
-   if (dev_get_valid_name(net, dev, pat) < 0)
+   err = dev_get_valid_name(net, dev, pat);
+   if (err < 0)
goto out;
}
 
@@ -8673,7 +8674,6 @@ int dev_change_net_namespace(struct net_device *dev, 
struct net *net, const char
dev_close(dev);
 
/* And unlink it from device chain */
-   err = -ENODEV;
unlist_netdevice(dev);
 
synchronize_net();
-- 
2.16.2



Re: array bounds warning in xfrm_output_resume

2018-06-19 Thread Florian Westphal
David Ahern  wrote:
> $ make O=kbuild/perf -j 24 -s
> In file included from /home/dsa/kernel-3.git/include/linux/kernel.h:10:0,
>  from /home/dsa/kernel-3.git/include/linux/list.h:9,
>  from /home/dsa/kernel-3.git/include/linux/module.h:9,
>  from /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c:13:
> /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c: In function
> ‘xfrm_output_resume’:
> /home/dsa/kernel-3.git/include/linux/compiler.h:252:20: warning: array
> subscript is above array bounds [-Warray-bounds]
>__read_once_size(&(x), __u.__c, sizeof(x));  \
> ^~~~
> /home/dsa/kernel-3.git/include/linux/compiler.h:258:22: note: in
> expansion of macro ‘__READ_ONCE’
>  #define READ_ONCE(x) __READ_ONCE(x, 1)
>   ^~~
> /home/dsa/kernel-3.git/include/linux/rcupdate.h:350:48: note: in
> expansion of macro ‘READ_ONCE’
>   typeof(*p) *p1 = (typeof(*p) *__force)READ_ONCE(p); \
> ^
> /home/dsa/kernel-3.git/include/linux/rcupdate.h:487:2: note: in
> expansion of macro ‘__rcu_dereference_check’
>   __rcu_dereference_check((p), (c) || rcu_read_lock_held(), __rcu)
>   ^~~
> /home/dsa/kernel-3.git/include/linux/rcupdate.h:545:28: note: in
> expansion of macro ‘rcu_dereference_check’
>  #define rcu_dereference(p) rcu_dereference_check(p, 0)
> ^
> /home/dsa/kernel-3.git/include/linux/netfilter.h:218:15: note: in
> expansion of macro ‘rcu_dereference’
>hook_head = rcu_dereference(net->nf.hooks_arp[hook]);

Hmpf. compiler can't know that this is only called for ipv4 and
ipv6 families.

> Line in question is the nf_hook in xfrm_output_resume.
> NF_INET_POST_ROUTING = 4 which is greater than NF_ARP_NUMHOOKS = 3
> 
> I believe ef57170bbfdd6 is the commit that introduced the warning

Yes.  I will see how to best fix this, probably needs an explicit
check on skb_dst(skb)->ops->family.


Re: [PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-19 Thread Florian Westphal
Li RongQing  wrote:
> The default NR_CPUS can be very large, but actual possible nr_cpu_ids
> usually is very small. For some x86 distribution, the NR_CPUS is 8192
> and nr_cpu_ids is 4, so replace NR_CPU to save some memory

Steffen,

I will soon submit a patch to remove the percpu cache; removal
improved performance for at least one user (and by quite a sizeable
amount).

Would you consider such removal for ipsec or ipsec-next?
If -next, consider applying this patch for ipsec.git.

Otherwise consider not applying this, as the code will go away soon.


Thanks,
Florian


Re: [PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-19 Thread Yunsheng Lin



On 2018/6/19 15:11, Li RongQing wrote:
> The default NR_CPUS can be very large, but actual possible nr_cpu_ids
> usually is very small. For some x86 distribution, the NR_CPUS is 8192
> and nr_cpu_ids is 4, so replace NR_CPU to save some memory
> 
> Signed-off-by: Li RongQing 
> Signed-off-by: Wang Li 
> ---
>  net/xfrm/xfrm_policy.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 40b54cc64243..f8188685c1e9 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -2989,11 +2989,11 @@ void __init xfrm_init(void)
>  {
>   int i;
>  
> - xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work),
> + xfrm_pcpu_work = kmalloc_array(nr_cpu_ids, sizeof(*xfrm_pcpu_work),
>  GFP_KERNEL);

It seems that xfrm_pcpu_work is not used anymore, maybe it can be deleted to
save more memory.


>   BUG_ON(!xfrm_pcpu_work);
>  
> - for (i = 0; i < NR_CPUS; i++)
> + for (i = 0; i < nr_cpu_ids; i++)
>   INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn);
>  
>   register_pernet_subsys(_net_ops);
> 



Re: [PATCH net 1/3] qed: Fix possible memory leak in Rx error path handling.

2018-06-19 Thread Yunsheng Lin



On 2018/6/19 14:42, Kalluru, Sudarsana wrote:
> 
> 
> -Original Message-
> From: Yunsheng Lin [mailto:linyunsh...@huawei.com] 
> Sent: 19 June 2018 11:32
> To: Kalluru, Sudarsana ; da...@davemloft.net
> Cc: netdev@vger.kernel.org; Elior, Ariel ; Kalderon, 
> Michal 
> Subject: Re: [PATCH net 1/3] qed: Fix possible memory leak in Rx error path 
> handling.
> 
> External Email
> 
> On 2018/6/19 12:58, Sudarsana Reddy Kalluru wrote:
>> Memory for packet buffers need to be freed in the error paths as there 
>> is no consumer (e.g., upper layer) for such packets and that memory 
>> will never get freed.
>> The issue was uncovered when port was attacked with flood of isatap 
>> packets, these are multicast packets hence were directed at all the PFs.
>> For foce PF, this meant they were routed to the ll2 module which in 
>> turn drops such packets.
>>
>> Fixes: 0a7fb11c ("qed: Add Light L2 support")
>> Signed-off-by: Sudarsana Reddy Kalluru 
>> Signed-off-by: Ariel Elior 
>> Signed-off-by: Michal Kalderon 
>> ---
>>  drivers/net/ethernet/qlogic/qed/qed_ll2.c | 11 +--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c 
>> b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
>> index c97ebd6..012973d 100644
>> --- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
>> +++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
>> @@ -201,8 +201,9 @@ void qed_ll2b_complete_rx_packet(void *cxt, struct 
>> qed_ll2_comp_rx_data *data)
>>
>>   skb = build_skb(buffer->data, 0);
>>   if (!skb) {
>> - rc = -ENOMEM;
>> - goto out_post;
>> + DP_INFO(cdev, "Failed to build SKB\n");
>> + kfree(buffer->data);
>> + goto out_post1;
>>   }
>>
>>   data->u.placement_offset += NET_SKB_PAD; @@ -224,8 +225,14 @@ 
>> void qed_ll2b_complete_rx_packet(void *cxt, struct qed_ll2_comp_rx_data 
>> *data)
>>   cdev->ll2->cbs->rx_cb(cdev->ll2->cb_cookie, skb,
>> data->opaque_data_0,
>> data->opaque_data_1);
>> + } else {
>> + DP_VERBOSE(p_hwfn, (NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA |
>> + QED_MSG_LL2 | QED_MSG_STORAGE),
>> +"Dropping the packet\n");
>> + kfree(buffer->data);
> 
> What about the memory used by skb itself?
> Does skb need to be freed by kfree_skb or something like that?
> 
> [Sudarsana] Thanks for reviewing the changes. qed_ll2_alloc_buffer() 
> allocates this memory. The allocated buffer (i.e., buffer->data) holds 
> complete memory for 'skb + data' as required by build_skb() implementation. 
> Hence freeing of (buffer->data) would suffice here. 

As I read through the code, the skb itself is allocated by kmem_cache_alloc,
see below:

build_skb -> __build_skb -> kmem_cache_alloc

Hope I am not missing something here.

> 
>>   }
>>
>> +out_post1:
>>   /* Update Buffer information and update FW producer */
>>   buffer->data = new_data;
>>   buffer->phys_addr = new_phys_addr;
>>
> 



[PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-19 Thread Li RongQing
The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For some x86 distribution, the NR_CPUS is 8192
and nr_cpu_ids is 4, so replace NR_CPU to save some memory

Signed-off-by: Li RongQing 
Signed-off-by: Wang Li 
---
 net/xfrm/xfrm_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 40b54cc64243..f8188685c1e9 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2989,11 +2989,11 @@ void __init xfrm_init(void)
 {
int i;
 
-   xfrm_pcpu_work = kmalloc_array(NR_CPUS, sizeof(*xfrm_pcpu_work),
+   xfrm_pcpu_work = kmalloc_array(nr_cpu_ids, sizeof(*xfrm_pcpu_work),
   GFP_KERNEL);
BUG_ON(!xfrm_pcpu_work);
 
-   for (i = 0; i < NR_CPUS; i++)
+   for (i = 0; i < nr_cpu_ids; i++)
INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn);
 
register_pernet_subsys(_net_ops);
-- 
2.16.2



RE: [PATCH net 1/3] qed: Fix possible memory leak in Rx error path handling.

2018-06-19 Thread Kalluru, Sudarsana


-Original Message-
From: Yunsheng Lin [mailto:linyunsh...@huawei.com] 
Sent: 19 June 2018 11:32
To: Kalluru, Sudarsana ; da...@davemloft.net
Cc: netdev@vger.kernel.org; Elior, Ariel ; Kalderon, 
Michal 
Subject: Re: [PATCH net 1/3] qed: Fix possible memory leak in Rx error path 
handling.

External Email

On 2018/6/19 12:58, Sudarsana Reddy Kalluru wrote:
> Memory for packet buffers need to be freed in the error paths as there 
> is no consumer (e.g., upper layer) for such packets and that memory 
> will never get freed.
> The issue was uncovered when port was attacked with flood of isatap 
> packets, these are multicast packets hence were directed at all the PFs.
> For foce PF, this meant they were routed to the ll2 module which in 
> turn drops such packets.
>
> Fixes: 0a7fb11c ("qed: Add Light L2 support")
> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Ariel Elior 
> Signed-off-by: Michal Kalderon 
> ---
>  drivers/net/ethernet/qlogic/qed/qed_ll2.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c 
> b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> index c97ebd6..012973d 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> @@ -201,8 +201,9 @@ void qed_ll2b_complete_rx_packet(void *cxt, struct 
> qed_ll2_comp_rx_data *data)
>
>   skb = build_skb(buffer->data, 0);
>   if (!skb) {
> - rc = -ENOMEM;
> - goto out_post;
> + DP_INFO(cdev, "Failed to build SKB\n");
> + kfree(buffer->data);
> + goto out_post1;
>   }
>
>   data->u.placement_offset += NET_SKB_PAD; @@ -224,8 +225,14 @@ 
> void qed_ll2b_complete_rx_packet(void *cxt, struct qed_ll2_comp_rx_data *data)
>   cdev->ll2->cbs->rx_cb(cdev->ll2->cb_cookie, skb,
> data->opaque_data_0,
> data->opaque_data_1);
> + } else {
> + DP_VERBOSE(p_hwfn, (NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA |
> + QED_MSG_LL2 | QED_MSG_STORAGE),
> +"Dropping the packet\n");
> + kfree(buffer->data);

What about the memory used by skb itself?
Does skb need to be freed by kfree_skb or something like that?

[Sudarsana] Thanks for reviewing the changes. qed_ll2_alloc_buffer() allocates 
this memory. The allocated buffer (i.e., buffer->data) holds complete memory 
for 'skb + data' as required by build_skb() implementation. Hence freeing of 
(buffer->data) would suffice here. 

>   }
>
> +out_post1:
>   /* Update Buffer information and update FW producer */
>   buffer->data = new_data;
>   buffer->phys_addr = new_phys_addr;
>



Re: [PATCH net 1/3] qed: Fix possible memory leak in Rx error path handling.

2018-06-19 Thread Yunsheng Lin



On 2018/6/19 12:58, Sudarsana Reddy Kalluru wrote:
> Memory for packet buffers need to be freed in the error paths as there is
> no consumer (e.g., upper layer) for such packets and that memory will never
> get freed.
> The issue was uncovered when port was attacked with flood of isatap
> packets, these are multicast packets hence were directed at all the PFs.
> For foce PF, this meant they were routed to the ll2 module which in turn
> drops such packets.
> 
> Fixes: 0a7fb11c ("qed: Add Light L2 support")
> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Ariel Elior 
> Signed-off-by: Michal Kalderon 
> ---
>  drivers/net/ethernet/qlogic/qed/qed_ll2.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c 
> b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> index c97ebd6..012973d 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
> @@ -201,8 +201,9 @@ void qed_ll2b_complete_rx_packet(void *cxt, struct 
> qed_ll2_comp_rx_data *data)
>  
>   skb = build_skb(buffer->data, 0);
>   if (!skb) {
> - rc = -ENOMEM;
> - goto out_post;
> + DP_INFO(cdev, "Failed to build SKB\n");
> + kfree(buffer->data);
> + goto out_post1;
>   }
>  
>   data->u.placement_offset += NET_SKB_PAD;
> @@ -224,8 +225,14 @@ void qed_ll2b_complete_rx_packet(void *cxt, struct 
> qed_ll2_comp_rx_data *data)
>   cdev->ll2->cbs->rx_cb(cdev->ll2->cb_cookie, skb,
> data->opaque_data_0,
> data->opaque_data_1);
> + } else {
> + DP_VERBOSE(p_hwfn, (NETIF_MSG_RX_STATUS | NETIF_MSG_PKTDATA |
> + QED_MSG_LL2 | QED_MSG_STORAGE),
> +"Dropping the packet\n");
> + kfree(buffer->data);

What about the memory used by skb itself?
Does skb need to be freed by kfree_skb or something like that?

>   }
>  
> +out_post1:
>   /* Update Buffer information and update FW producer */
>   buffer->data = new_data;
>   buffer->phys_addr = new_phys_addr;
>