Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-30 Thread Lawrence Brakmo
On 6/30/18, 5:26 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>   1MB RPCs2.6ms   5.5ms 43ms  208ms
>  10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>   1MB RPCs4.8ms   6.5ms
>  10KB RPCs143us   184us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion 
(CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

Thanks, Larry!

I suspect there is a broader problem with "DCTCP-style precise ECE
ACKs" that this patch series does not address.

AFAICT the DCTCP "historical" ACKs for ECE precision, generated in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the
assumptions of the pre-existing delayed ACK state machine. They
violate those assumptions by rewinding tp->rcv_nxt backwards and then
calling tcp_send_ack(sk). But the existing code path to send an ACK
assumes that any ACK transmission always sends an ACK that accounts
for all received data, so that after sending an ACK we can cancel the
delayed ACK timer. So it seems with DCTCP we can have buggy sequences
where the DCTCP historical ACK causes us to forget that we need to
schedule a delayed ACK (which will force the sender to RTO, as shown
in the trace):

tcp_event_data_recv()
 inet_csk_schedule_ack()  // remember that we need an ACK
 tcp_ecn_check_ce()
  tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE
dctcp_cwnd_event()
  dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0()
   if (... && ca->delayed_ack_reserved)
  tp->rcv_nxt = ca->prior_rcv_nxt;
  tcp_send_ack(sk); // send an ACK, but for old data!
tcp_transmit_skb()
  tcp_event_ack_sent()
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
// forget that we need a delayed ACK!

AFAICT the first patch in this series, to force an immediate ACK on
any packet with a CWR, papers over this issue of the forgotten delayed
ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures
that at least for CWR packets the forgotten delayed ACK does not
matter. But for packets without CWR I believe the buggy "forgotten
delayed ACK" sequence above is still possib

Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows

2018-06-30 Thread Neal Cardwell
On Fri, Jun 29, 2018 at 3:52 PM Ilpo Järvinen  wrote:
> > +.005 < . 1:1(0) ack 2001 win 257
>
> Why did the receiver send a cumulative ACK only for 2001?

Sorry, you are right Ilpo. Upon further reflection, the packetdrill
scenario I posted is not a realistic one, and I agree we should not
worry about it.

As I mentioned, I ran your patch through all our team's TCP
packetdrill tests, and it passes all of the tests. One of our tests
needed updating, because if there is a non-SACK connection with a
spurious RTO due to a delayed flight of ACKs then the FRTO undo now
happens one ACK later (when we get an ACK that doesn't cover a
retransmit). But that seems fine to me.

I also cooked the new packetdrill test below to explicitly cover this
case you are addressing (please let me know if you have an alternate
suggestion).

Tested-by: Neal Cardwell 
Acked-by: Neal Cardwell 

Thanks!
neal

---

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 
   +0 > S. 0:0(0) ack 1 
 +.02 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

// Send 3 packets. First is really lost. And the dupacks
// for the data packets that arrived at the reciver are slow in arriving.
   +0 write(4, ..., 3000) = 3000
   +0 > P. 1:3001(3000) ack 1

// RTO and retransmit head. This fills a real loss.
 +.22 > . 1:1001(1000) ack 1

// Dupacks for packets 2 and 3 arrive.
+.02  < . 1:1(0) ack 1 win 257
   +0 < . 1:1(0) ack 1 win 257

// The cumulative ACK for all the data arrives. We do not undo, because
// this is a non-SACK connection, and retransmitted data was ACKed.
// It's good that there's no FRTO undo, since a packet was really lost.
// Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss
// until something beyond high_seq is ACKed.
+.005 < . 1:1(0) ack 3001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%


Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-06-30 Thread Lawrence Brakmo
On 6/30/18, 11:23 AM, "Neal Cardwell"  wrote:

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> problem is triggered when the last packet of a request arrives CE
> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> to 1 (because there are no packets in flight). When the 1st packet of
> the next request arrives it was sometimes delayed adding up to 40ms to
> the latency.
>
> This patch insures that CWR makred data packets arriving will be acked
> immediately.
>
> Signed-off-by: Lawrence Brakmo 
> ---

Thanks, Larry. Ensuring that CWR-marked data packets arriving will be
acked immediately sounds like a good goal to me.

I am wondering if we can have a simpler fix.

The dctcp_ce_state_0_to_1() code is setting the TCP_ECN_DEMAND_CWR
bit in ecn_flags, which disables the code in __tcp_ecn_check_ce() that
would have otherwise used the tcp_enter_quickack_mode() mechanism to
force an ACK:

__tcp_ecn_check_ce():
...
case INET_ECN_CE:
  if (tcp_ca_needs_ecn(sk))
tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
   // -> dctcp_ce_state_0_to_1()
   // ->  tp->ecn_flags |= TCP_ECN_DEMAND_CWR;

  if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
/* Better not delay acks, sender can have a very low cwnd */
tcp_enter_quickack_mode(sk, 1);
tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
  }
  tp->ecn_flags |= TCP_ECN_SEEN;
  break;

So it seems like the bug here may be that dctcp_ce_state_0_to_1()  is
setting the TCP_ECN_DEMAND_CWR  bit in ecn_flags, when really it
should let its caller, __tcp_ecn_check_ce() set TCP_ECN_DEMAND_CWR, in
which case the code would already properly force an immediate ACK.

So, what if we use this fix instead (not compiled, not tested):

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 5f5e5936760e..4fecd2824edb 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -152,8 +152,6 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)

ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 1;
-
-   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 }

 static void dctcp_ce_state_1_to_0(struct sock *sk)

What do you think?

neal

I see two issues, one is that entering quickack mode as you mentioned does not 
insure that it will still be on when the CWR arrives. The second issue is that 
the problem occurs right after the receiver sends a small reply which results 
in entering pingpong mode right before the sender starts the new request by 
sending just one packet (i.e. forces delayed ack).

I compiled and tested your patch. Both 99 and 99.9 percentile latencies are 
around 40ms. Looking at the packet traces shows that some CWR marked packets 
are not being ack immediately (delayed by 40ms).

Larry





Re: pull-request: bpf 2018-07-01

2018-06-30 Thread David Miller
From: Daniel Borkmann 
Date: Sun,  1 Jul 2018 01:56:37 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
 ...
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.


Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-30 Thread Neal Cardwell
On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>   1MB RPCs2.6ms   5.5ms 43ms  208ms
>  10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>   1MB RPCs4.8ms   6.5ms
>  10KB RPCs143us   184us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion (CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

Thanks, Larry!

I suspect there is a broader problem with "DCTCP-style precise ECE
ACKs" that this patch series does not address.

AFAICT the DCTCP "historical" ACKs for ECE precision, generated in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the
assumptions of the pre-existing delayed ACK state machine. They
violate those assumptions by rewinding tp->rcv_nxt backwards and then
calling tcp_send_ack(sk). But the existing code path to send an ACK
assumes that any ACK transmission always sends an ACK that accounts
for all received data, so that after sending an ACK we can cancel the
delayed ACK timer. So it seems with DCTCP we can have buggy sequences
where the DCTCP historical ACK causes us to forget that we need to
schedule a delayed ACK (which will force the sender to RTO, as shown
in the trace):

tcp_event_data_recv()
 inet_csk_schedule_ack()  // remember that we need an ACK
 tcp_ecn_check_ce()
  tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE
dctcp_cwnd_event()
  dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0()
   if (... && ca->delayed_ack_reserved)
  tp->rcv_nxt = ca->prior_rcv_nxt;
  tcp_send_ack(sk); // send an ACK, but for old data!
tcp_transmit_skb()
  tcp_event_ack_sent()
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
// forget that we need a delayed ACK!

AFAICT the first patch in this series, to force an immediate ACK on
any packet with a CWR, papers over this issue of the forgotten delayed
ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures
that at least for CWR packets the forgotten delayed ACK does not
matter. But for packets without CWR I believe the buggy "forgotten
delayed ACK" sequence above is still possible, and could still plague
DCTCP with high tail latencies in some cases. I suspect that after
your CWR-forces-ACK patch it may not be jumping out in performance
test results because (a) if there is no CWR involved then usually the
cwnd is high enough that missing delayed ACK does not cause a stall,
and (b) if there is a CWR involved then your patch forces an immediate
ACK. But AFAICT that forgotten delayed ACK bug is still there.

What do folks think?

neal


Re: [bpf PATCH v5 0/4] BPF fixes for sockhash

2018-06-30 Thread Daniel Borkmann
On 06/30/2018 03:17 PM, John Fastabend wrote:
> This addresses two syzbot issues that lead to identifying (by Eric and
> Wei) a class of bugs where we don't correctly check for IPv4/v6
> sockets and their associated state. The second issue was a locking
> omission in sockhash.
> 
> The first patch addresses IPv6 socks and fixing an error where
> sockhash would overwrite the prot pointer with IPv4 prot. To fix
> this build similar solution to TLS ULP. Although we continue to
> allow socks in all states not just ESTABLISH in this patch set
> because as Martin points out there should be no issue with this
> on the sockmap ULP because we don't use the ctx in this code. Once
> multiple ULPs coexist we may need to revisit this. However we
> can do this in *next trees.
> 
> The other issue syzbot found that the tcp_close() handler missed
> locking the hash bucket lock which could result in corrupting the
> sockhash bucket list if delete and close ran at the same time. 
> And also the smap_list_remove() routine was not working correctly
> at all. This was not caught in my testing because in general my
> tests (to date at least lets add some more robust selftest in
> bpf-next) do things in the "expected" order, create map, add socks,
> delete socks, then tear down maps. The tests we have that do the
> ops out of this order where only working on single maps not multi-
> maps so we never saw the issue. Thanks syzbot. The fix is to
> restructure the tcp_close() lock handling. And fix the obvious
> bug in smap_list_remove().
> 
> Finally, during review I noticed the release handler was omitted
> from the upstream code (patch 4) due to an incorrect merge conflict
> fix when I ported the code to latest bpf-next before submitting.
> This would leave references to the map around if the user never
> closes the map.
> 
> v3: rework patches, dropping ESTABLISH check and adding rcu
> annotation along with the smap_list_remove fix
> 
> v4: missed one more case where maps was being accessed without
> the sk_callback_lock, spoted by Martin as well.
> 
> v5: changed to use a specific lock for maps and reduced callback
> lock so that it is only used to gaurd sk callbacks. I think
> this makes the logic a bit cleaner and avoids confusion
> ovoer what each lock is doing.
> 
> Also big thanks to Martin for thorough review he caught at least
> one case where I missed a rcu_call().

Applied it to bpf, thanks John!


pull-request: bpf 2018-07-01

2018-06-30 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) A bpf_fib_lookup() helper fix to change the API before freeze to
   return an encoding of the FIB lookup result and return the nexthop
   device index in the params struct (instead of device index as return
   code that we had before), from David.

2) Various BPF JIT fixes to address syzkaller fallout, that is, do not
   reject progs when set_memory_*() fails since it could still be RO.
   Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was
   an issue, and a memory leak in s390 JIT found during review, from
   Daniel.

3) Multiple fixes for sockmap/hash to address most of the syzkaller
   triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(),
   a missing sock_map_release() routine to hook up to callbacks, and a
   fix for an omitted bucket lock in sock_close(), from John.

4) Two bpftool fixes to remove duplicated error message on program load,
   and another one to close the libbpf object after program load. One
   additional fix for nfp driver's BPF offload to avoid stopping offload
   completely if replace of program failed, from Jakub.

5) Couple of BPF selftest fixes that bail out in some of the test
   scripts if the user does not have the right privileges, from Jeffrin.

6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set
   where we need to set the flag that some of the test cases are expected
   to fail, from Kleber.

7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF
   since it has no relation to it and lirc2 users often have configs
   without cgroups enabled and thus would not be able to use it, from Sean.

8) Fix a selftest failure in sockmap by removing a useless setrlimit()
   call that would set a too low limit where at the same time we are
   already including bpf_rlimit.h that does the job, from Yonghong.

9) Fix BPF selftest config with missing missing NET_SCHED, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 3739a21e0ef6ac06f46bd38e81daa95e8cb462bc:

  selftests: net: add tcp_inq to gitignore (2018-06-21 15:02:32 +0900)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to bf2b866a2fe2d74558fe4b7bdf63a4bc0afbdf70:

  Merge branch 'bpf-sockmap-fixes' (2018-07-01 01:21:33 +0200)


Alexei Starovoitov (1):
  Merge branch 'bpf-fixes'

Anders Roxell (1):
  selftests: bpf: add missing NET_SCHED to config

Daniel Borkmann (5):
  Merge branch 'bpf-bpftool-fixes'
  bpf, arm32: fix to use bpf_jit_binary_lock_ro api
  bpf, s390: fix potential memleak when later bpf_jit_prog fails
  bpf: undo prog rejection on read-only lock failure
  Merge branch 'bpf-sockmap-fixes'

David Ahern (1):
  bpf: Change bpf_fib_lookup to return lookup status

Jakub Kicinski (3):
  tools: bpftool: remove duplicated error message on prog load
  tools: bpftool: remember to close the libbpf object after prog load
  nfp: bpf: don't stop offload if replace failed

Jeffrin Jose T (3):
  selftests: bpf: notification about privilege required to run test_kmod.sh 
testing script
  selftests: bpf: notification about privilege required to run 
test_lirc_mode2.sh testing script
  selftests: bpf: notification about privilege required to run 
test_lwt_seg6local.sh testing script

John Fastabend (4):
  bpf: sockmap, fix crash when ipv6 sock is added
  bpf: sockmap, fix smap_list_map_remove when psock is in many maps
  bpf: sockhash fix omitted bucket lock in sock_close
  bpf: sockhash, add release routine

Kleber Sacilotto de Souza (1):
  test_bpf: flag tests that cannot be jited on s390

Sean Young (1):
  bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF

Yonghong Song (1):
  tools/bpf: fix test_sockmap failure

 arch/arm/net/bpf_jit_32.c |   2 +-
 arch/s390/net/bpf_jit_comp.c  |   1 +
 drivers/media/rc/bpf-lirc.c   |  14 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.c |   6 +-
 include/linux/bpf-cgroup.h|  26 +++
 include/linux/bpf.h   |   8 +
 include/linux/bpf_lirc.h  |   5 +-
 include/linux/filter.h|  56 +
 include/uapi/linux/bpf.h  |  28 ++-
 kernel/bpf/cgroup.c   |  54 +
 kernel/bpf/core.c |  30 +--
 kernel/bpf/sockmap.c  | 254 --
 kernel/bpf/syscall.c  |  99 ++---
 lib/test_bpf.c   

Re: [PATCH bpf] bpf: hash_map: decrement counter on error

2018-06-30 Thread Daniel Borkmann
On 06/29/2018 02:48 PM, Mauricio Vasquez B wrote:
> Decrement the number of elements in the map in case the allocation
> of a new node fails.
> 
> Signed-off-by: Mauricio Vasquez B 

Thanks for the fix, Mauricio!

Could you reply with a Fixes: tag in order to track the commit originally
introducing this bug?

Thanks,
Daniel


Re: [PATCH bpf-next 0/8] tools: bpf: updates to bpftool and libbpf

2018-06-30 Thread Daniel Borkmann
On 06/28/2018 11:41 PM, Jakub Kicinski wrote:
> Hi!
> 
> Set of random updates to bpftool and libbpf.  I'm preparing for
> extending bpftool prog load, but there is a good number of
> improvements that can be made before bpf -> bpf-next merge
> helping to keep the later patch set to a manageable size as well.
> 
> First patch is a bpftool build speed improvement.  Next missing
> program types are added to libbpf program type detection by section
> name.  The ability to load programs from '.text' section is restored
> when ELF file doesn't contain any pseudo calls.
> 
> In bpftool I remove my Author comments as unnecessary sign of vanity.
> Last but not least missing option is added to bash completions and
> processing of options in bash completions is improved.

Applied to bpf-next, thanks Jakub!


[PATCH net-next] r8169: remove old PHY reset hack

2018-06-30 Thread Heiner Kallweit
This hack (affecting the non-PCIe models only) was introduced in 2004
to deal with link negotiation failures in 1GBit mode. Based on a
comment in the r8169 vendor driver I assume the issue affects RTL8169sb
in combination with particular 1GBit switch models.

Resetting the PHY every 10s and hoping that one fine day we will make
it to establish the link seems to be very hacky to me. I'd say:
If 1GBit doesn't work reliably in a users environment then the user
should remove 1GBit from the advertised modes, e.g. by using
ethtool -s  advertise <10/100 modes>

If the issue affects one chip version only and that with most link
partners, then we could also think of removing 1GBit from the
advertised modes for this chip version in the driver.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 57 +---
 1 file changed, 1 insertion(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 72a7778b..b0a06902 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -80,7 +80,6 @@ static const int multicast_filter_limit = 32;
 #define R8169_RX_RING_BYTES(NUM_RX_DESC * sizeof(struct RxDesc))
 
 #define RTL8169_TX_TIMEOUT (6*HZ)
-#define RTL8169_PHY_TIMEOUT(10*HZ)
 
 /* write/read MMIO register */
 #define RTL_W8(tp, reg, val8)  writeb((val8), tp->mmio_addr + (reg))
@@ -703,7 +702,6 @@ enum rtl_flag {
RTL_FLAG_TASK_ENABLED,
RTL_FLAG_TASK_SLOW_PENDING,
RTL_FLAG_TASK_RESET_PENDING,
-   RTL_FLAG_TASK_PHY_PENDING,
RTL_FLAG_MAX
 };
 
@@ -731,7 +729,6 @@ struct rtl8169_private {
dma_addr_t RxPhyAddr;
void *Rx_databuff[NUM_RX_DESC]; /* Rx data buffers */
struct ring_info tx_skb[NUM_TX_DESC];   /* Tx data buffers */
-   struct timer_list timer;
u16 cp_cmd;
 
u16 event_slow;
@@ -1788,20 +1785,7 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
 static int rtl8169_set_speed(struct net_device *dev,
 u8 autoneg, u16 speed, u8 duplex, u32 advertising)
 {
-   struct rtl8169_private *tp = netdev_priv(dev);
-   int ret;
-
-   ret = rtl8169_set_speed_xmii(dev, autoneg, speed, duplex, advertising);
-   if (ret < 0)
-   goto out;
-
-   if (netif_running(dev) && (autoneg == AUTONEG_ENABLE) &&
-   (advertising & ADVERTISED_1000baseT_Full) &&
-   !pci_is_pcie(tp->pci_dev)) {
-   mod_timer(&tp->timer, jiffies + RTL8169_PHY_TIMEOUT);
-   }
-out:
-   return ret;
+   return rtl8169_set_speed_xmii(dev, autoneg, speed, duplex, advertising);
 }
 
 static netdev_features_t rtl8169_fix_features(struct net_device *dev,
@@ -1888,8 +1872,6 @@ static int rtl8169_set_link_ksettings(struct net_device 
*dev,
cmd->link_modes.advertising))
return -EINVAL;
 
-   del_timer_sync(&tp->timer);
-
rtl_lock_work(tp);
rc = rtl8169_set_speed(dev, cmd->base.autoneg, cmd->base.speed,
   cmd->base.duplex, advertising);
@@ -4293,44 +4275,12 @@ static void rtl_hw_phy_config(struct net_device *dev)
}
 }
 
-static void rtl_phy_work(struct rtl8169_private *tp)
-{
-   struct timer_list *timer = &tp->timer;
-   unsigned long timeout = RTL8169_PHY_TIMEOUT;
-
-   if (rtl8169_xmii_reset_pending(tp)) {
-   /*
-* A busy loop could burn quite a few cycles on nowadays CPU.
-* Let's delay the execution of the timer for a few ticks.
-*/
-   timeout = HZ/10;
-   goto out_mod_timer;
-   }
-
-   if (rtl8169_xmii_link_ok(tp))
-   return;
-
-   netif_dbg(tp, link, tp->dev, "PHY reset until link up\n");
-
-   rtl8169_xmii_reset_enable(tp);
-
-out_mod_timer:
-   mod_timer(timer, jiffies + timeout);
-}
-
 static void rtl_schedule_task(struct rtl8169_private *tp, enum rtl_flag flag)
 {
if (!test_and_set_bit(flag, tp->wk.flags))
schedule_work(&tp->wk.work);
 }
 
-static void rtl8169_phy_timer(struct timer_list *t)
-{
-   struct rtl8169_private *tp = from_timer(tp, t, timer);
-
-   rtl_schedule_task(tp, RTL_FLAG_TASK_PHY_PENDING);
-}
-
 DECLARE_RTL_COND(rtl_phy_reset_cond)
 {
return rtl8169_xmii_reset_pending(tp);
@@ -6909,7 +6859,6 @@ static void rtl_task(struct work_struct *work)
/* XXX - keep rtl_slow_event_work() as first element. */
{ RTL_FLAG_TASK_SLOW_PENDING,   rtl_slow_event_work },
{ RTL_FLAG_TASK_RESET_PENDING,  rtl_reset_work },
-   { RTL_FLAG_TASK_PHY_PENDING,rtl_phy_work }
};
struct rtl8169_private *tp =
container_of(work, struct rtl8169_private, wk.work);
@@ -6982,8 +6931,6 @@ static void rtl8169_down(struct net_device *dev)
 {
struct rtl8169_private *tp = netdev_priv(dev);
 
-   del_timer

Re: Anyone know if strongswan works with vrf?

2018-06-30 Thread David Ahern
On 6/29/18 4:10 PM, Ben Greear wrote:
> Hello,
> 
> We're trying to create lots of strongswan VPN tunnels on network devices
> bound to different VRFs.  We are using Fedora-24 on the client side,
> with a 4.16.15+ kernel
> and updated 'ip' package, etc.
> 
> So far, no luck getting it to work.
> 
> Any idea if this is supported or not?

Kernel side xfrm code does work; been a couple of years since I tried
it. As I recall strongswan needs an update.

Looking at the 'ip xfrm' based scripts, you need to add 'sel dev ${VRF}'
to the state. eg.,

VRF="sel dev blue"

ip xfrm state add src ${REMIP} dst ${MYIP} \
proto esp spi 0x02122b77 reqid 0 mode tunnel \
replay-window 4 replay-oseq 0x4 \
auth-trunc 'hmac(md5)' 0xd94fcfea65fddf21dc6e0d24a0253508 96 \
enc 'cbc(des3_ede)' 0xfc46c20f8048be9725930ff3fb07ac2a91f0347dffeacf62 \
${VRF}


Re: [net-next 01/12] net/mlx5e: Add UDP GSO support

2018-06-30 Thread Boris Pismenny



On 06/30/18 19:06, Boris Pismenny wrote:



On 06/30/18 01:19, Willem de Bruijn wrote:
On Fri, Jun 29, 2018 at 2:24 AM Saeed Mahameed  
wrote:

From: Boris Pismenny 

This patch enables UDP GSO support. We enable this by using two WQEs
the first is a UDP LSO WQE for all segments with equal length, and the
second is for the last segment in case it has different length.
Due to HW limitation, before sending, we must adjust the packet 
length fields.


We measure performance between two Intel(R) Xeon(R) CPU E5-2643 v2 
@3.50GHz

machines connected back-to-back with Connectx4-Lx (40Gbps) NICs.
We compare single stream UDP, UDP GSO and UDP GSO with offload.
Performance:
 | MSS (bytes)   | Throughput (Gbps) | CPU 
utilization (%)

UDP GSO offload | 1472  | 35.6  | 8%
UDP GSO | 1472  | 25.5  | 17%
UDP | 1472  | 10.2  | 17%
UDP GSO offload | 1024  | 35.6  | 8%
UDP GSO | 1024  | 19.2  | 17%
UDP | 1024  | 5.7   | 17%
UDP GSO offload | 512   | 33.8  | 16%
UDP GSO | 512   | 10.4  | 17%
UDP | 512   | 3.5   | 17%

Very nice results :)

Thanks. We expect to have 100Gbps results by netdev0x12.


+static void mlx5e_udp_gso_prepare_last_skb(struct sk_buff *skb,
+  struct sk_buff *nskb,
+  int remaining)
+{
+   int bytes_needed = remaining, remaining_headlen, 
remaining_page_offset;
+   int headlen = skb_transport_offset(skb) + sizeof(struct 
udphdr);

+   int payload_len = remaining + sizeof(struct udphdr);
+   int k = 0, i, j;
+
+   skb_copy_bits(skb, 0, nskb->data, headlen);
+   nskb->dev = skb->dev;
+   skb_reset_mac_header(nskb);
+   skb_set_network_header(nskb, skb_network_offset(skb));
+   skb_set_transport_header(nskb, skb_transport_offset(skb));
+   skb_set_tail_pointer(nskb, headlen);
+
+   /* How many frags do we need? */
+   for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
+   bytes_needed -= 
skb_frag_size(&skb_shinfo(skb)->frags[i]);

+   k++;
+   if (bytes_needed <= 0)
+   break;
+   }
+
+   /* Fill the first frag and split it if necessary */
+   j = skb_shinfo(skb)->nr_frags - k;
+   remaining_page_offset = -bytes_needed;
+   skb_fill_page_desc(nskb, 0,
+ skb_shinfo(skb)->frags[j].page.p,
+ skb_shinfo(skb)->frags[j].page_offset + remaining_page_offset,
+  skb_shinfo(skb)->frags[j].size - 
remaining_page_offset);

+
+   skb_frag_ref(skb, j);
+
+   /* Fill the rest of the frags */
+   for (i = 1; i < k; i++) {
+   j = skb_shinfo(skb)->nr_frags - k + i;
+
+   skb_fill_page_desc(nskb, i,
+ skb_shinfo(skb)->frags[j].page.p,
+ skb_shinfo(skb)->frags[j].page_offset,
+ skb_shinfo(skb)->frags[j].size);
+   skb_frag_ref(skb, j);
+   }
+   skb_shinfo(nskb)->nr_frags = k;
+
+   remaining_headlen = remaining - skb->data_len;
+
+   /* headlen contains remaining data? */
+   if (remaining_headlen > 0)
+   skb_copy_bits(skb, skb->len - remaining, nskb->data 
+ headlen,

+ remaining_headlen);
+   nskb->len = remaining + headlen;
+   nskb->data_len =  payload_len - sizeof(struct udphdr) +
+   max_t(int, 0, remaining_headlen);
+   nskb->protocol = skb->protocol;
+   if (nskb->protocol == htons(ETH_P_IP)) {
+   ip_hdr(nskb)->id = htons(ntohs(ip_hdr(nskb)->id) +
+ skb_shinfo(skb)->gso_segs);
+   ip_hdr(nskb)->tot_len =
+   htons(payload_len + sizeof(struct iphdr));
+   } else {
+   ipv6_hdr(nskb)->payload_len = htons(payload_len);
+   }
+   udp_hdr(nskb)->len = htons(payload_len);
+   skb_shinfo(nskb)->gso_size = 0;
+   nskb->ip_summed = skb->ip_summed;
+   nskb->csum_start = skb->csum_start;
+   nskb->csum_offset = skb->csum_offset;
+   nskb->queue_mapping = skb->queue_mapping;
+}
+
+/* might send skbs and update wqe and pi */
+struct sk_buff *mlx5e_udp_gso_handle_tx_skb(struct net_device *netdev,
+   struct mlx5e_txqsq *sq,
+   struct sk_buff *skb,
+   struct mlx5e_tx_wqe **wqe,
+   u16 *pi)
+{
+   int payload_len = skb_shinfo(skb)->gso_size + sizeof(struct 
udphdr);
+   int headlen = skb_transport_offset(skb) + sizeof(struct 
udphdr);
+   int remaining = (skb->len - headlen) % 
skb_shinfo(skb)->gso_size;

+   struct sk_buff *nskb;
+
+   if (skb->protocol == htons(ETH_P_IP))
+   ip_hdr(skb)->tot_le

Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-06-30 Thread Neal Cardwell
On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> problem is triggered when the last packet of a request arrives CE
> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> to 1 (because there are no packets in flight). When the 1st packet of
> the next request arrives it was sometimes delayed adding up to 40ms to
> the latency.
>
> This patch insures that CWR makred data packets arriving will be acked
> immediately.
>
> Signed-off-by: Lawrence Brakmo 
> ---

Thanks, Larry. Ensuring that CWR-marked data packets arriving will be
acked immediately sounds like a good goal to me.

I am wondering if we can have a simpler fix.

The dctcp_ce_state_0_to_1() code is setting the TCP_ECN_DEMAND_CWR
bit in ecn_flags, which disables the code in __tcp_ecn_check_ce() that
would have otherwise used the tcp_enter_quickack_mode() mechanism to
force an ACK:

__tcp_ecn_check_ce():
...
case INET_ECN_CE:
  if (tcp_ca_needs_ecn(sk))
tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
   // -> dctcp_ce_state_0_to_1()
   // ->  tp->ecn_flags |= TCP_ECN_DEMAND_CWR;

  if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
/* Better not delay acks, sender can have a very low cwnd */
tcp_enter_quickack_mode(sk, 1);
tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
  }
  tp->ecn_flags |= TCP_ECN_SEEN;
  break;

So it seems like the bug here may be that dctcp_ce_state_0_to_1()  is
setting the TCP_ECN_DEMAND_CWR  bit in ecn_flags, when really it
should let its caller, __tcp_ecn_check_ce() set TCP_ECN_DEMAND_CWR, in
which case the code would already properly force an immediate ACK.

So, what if we use this fix instead (not compiled, not tested):

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 5f5e5936760e..4fecd2824edb 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -152,8 +152,6 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)

ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 1;
-
-   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 }

 static void dctcp_ce_state_1_to_0(struct sock *sk)

What do you think?

neal


[PATCH net-next 2/5] net: gemini: Improve connection prints

2018-06-30 Thread Linus Walleij
Instead of just specify that a PHY is connected at some
speed, also specify which one. This is helpful with several
PHYs on the system.

Signed-off-by: Linus Walleij 
---
 drivers/net/ethernet/cortina/gemini.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 8fc31723f700..b49ed8964026 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -300,23 +300,26 @@ static void gmac_speed_set(struct net_device *netdev)
status.bits.speed = GMAC_SPEED_1000;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_1000;
-   netdev_info(netdev, "connect to RGMII @ 1Gbit\n");
+   netdev_info(netdev, "connect %s to RGMII @ 1Gbit\n",
+   phydev_name(phydev));
break;
case 100:
status.bits.speed = GMAC_SPEED_100;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_100_10;
-   netdev_info(netdev, "connect to RGMII @ 100 Mbit\n");
+   netdev_info(netdev, "connect %s to RGMII @ 100 Mbit\n",
+   phydev_name(phydev));
break;
case 10:
status.bits.speed = GMAC_SPEED_10;
if (phydev->interface == PHY_INTERFACE_MODE_RGMII)
status.bits.mii_rmii = GMAC_PHY_RGMII_100_10;
-   netdev_info(netdev, "connect to RGMII @ 10 Mbit\n");
+   netdev_info(netdev, "connect %s to RGMII @ 10 Mbit\n",
+   phydev_name(phydev));
break;
default:
-   netdev_warn(netdev, "Not supported PHY speed (%d)\n",
-   phydev->speed);
+   netdev_warn(netdev, "Unsupported PHY speed (%d) on %s\n",
+   phydev->speed, phydev_name(phydev));
}
 
if (phydev->duplex == DUPLEX_FULL) {
-- 
2.17.1



[PATCH net-next 4/5] net: gemini: Move main init to port

2018-06-30 Thread Linus Walleij
The initialization sequence for the ethernet, setting up
interrupt routing and such things, need to be done after
both the ports are clocked and reset. Before this the
config will not "take". Move the initialization to the
port probe function and keep track of init status in
the state.

Signed-off-by: Linus Walleij 
---
 drivers/net/ethernet/cortina/gemini.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 8d192fcd51c8..79324bbfd768 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -146,6 +146,7 @@ struct gemini_ethernet {
void __iomem *base;
struct gemini_ethernet_port *port0;
struct gemini_ethernet_port *port1;
+   bool initialized;
 
spinlock_t  irq_lock; /* Locks IRQ-related registers */
unsigned intfreeq_order;
@@ -2301,6 +2302,14 @@ static void gemini_port_remove(struct 
gemini_ethernet_port *port)
 
 static void gemini_ethernet_init(struct gemini_ethernet *geth)
 {
+   /* Only do this once both ports are online */
+   if (geth->initialized)
+   return;
+   if (geth->port0 && geth->port1)
+   geth->initialized = true;
+   else
+   return;
+
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_0_REG);
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_1_REG);
writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_2_REG);
@@ -2447,6 +2456,10 @@ static int gemini_ethernet_port_probe(struct 
platform_device *pdev)
geth->port0 = port;
else
geth->port1 = port;
+
+   /* This will just be done once both ports are up and reset */
+   gemini_ethernet_init(geth);
+
platform_set_drvdata(pdev, port);
 
/* Set up and register the netdev */
@@ -2564,7 +2577,6 @@ static int gemini_ethernet_probe(struct platform_device 
*pdev)
 
spin_lock_init(&geth->irq_lock);
spin_lock_init(&geth->freeq_lock);
-   gemini_ethernet_init(geth);
 
/* The children will use this */
platform_set_drvdata(pdev, geth);
@@ -2577,8 +2589,8 @@ static int gemini_ethernet_remove(struct platform_device 
*pdev)
 {
struct gemini_ethernet *geth = platform_get_drvdata(pdev);
 
-   gemini_ethernet_init(geth);
geth_cleanup_freeq(geth);
+   geth->initialized = false;
 
return 0;
 }
-- 
2.17.1



[PATCH net-next 3/5] net: gemini: Allow multiple ports to instantiate

2018-06-30 Thread Linus Walleij
The code was not tested with two ports actually in use at
the same time. (I blame this on lack of actual hardware using
that feature.) Now after locating a system using both ports,
add necessary fix to make both ports come up.

Signed-off-by: Linus Walleij 
---
 drivers/net/ethernet/cortina/gemini.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index b49ed8964026..8d192fcd51c8 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -1787,7 +1787,10 @@ static int gmac_open(struct net_device *netdev)
phy_start(netdev->phydev);
 
err = geth_resize_freeq(port);
-   if (err) {
+   /* It's fine if it's just busy, the other port has set up
+* the freeq in that case.
+*/
+   if (err && (err != -EBUSY)) {
netdev_err(netdev, "could not resize freeq\n");
goto err_stop_phy;
}
-- 
2.17.1



[PATCH net-next 5/5] net: gemini: Indicate that we can handle jumboframes

2018-06-30 Thread Linus Walleij
The hardware supposedly handles frames up to 10236 bytes and
implements .ndo_change_mtu() so accept 10236 minus the ethernet
header for a VLAN tagged frame on the netdevices.

Signed-off-by: Linus Walleij 
---
 drivers/net/ethernet/cortina/gemini.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 79324bbfd768..ae475393e4ac 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -2473,6 +2473,11 @@ static int gemini_ethernet_port_probe(struct 
platform_device *pdev)
 
netdev->hw_features = GMAC_OFFLOAD_FEATURES;
netdev->features |= GMAC_OFFLOAD_FEATURES | NETIF_F_GRO;
+   /* We can handle jumbo frames up to 10236 bytes so, let's accept
+* payloads of 10236 bytes minus VLAN and ethernet header
+*/
+   netdev->min_mtu = 256;
+   netdev->max_mtu = 10236 - VLAN_ETH_HLEN;
 
port->freeq_refill = 0;
netif_napi_add(netdev, &port->napi, gmac_napi_poll,
-- 
2.17.1



[PATCH net-next 1/5] net: gemini: Look up L3 maxlen from table

2018-06-30 Thread Linus Walleij
The code to calculate the hardware register enumerator
for the maximum L3 length isn't entirely simple to read.
Use the existing defines and rewrite the function into a
table look-up.

Signed-off-by: Linus Walleij 
---
 drivers/net/ethernet/cortina/gemini.c | 61 ---
 1 file changed, 46 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/cortina/gemini.c 
b/drivers/net/ethernet/cortina/gemini.c
index 6d7404f66f84..8fc31723f700 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -401,26 +401,57 @@ static int gmac_setup_phy(struct net_device *netdev)
return 0;
 }
 
-static int gmac_pick_rx_max_len(int max_l3_len)
-{
-   /* index = CONFIG_MAXLEN_XXX values */
-   static const int max_len[8] = {
-   1536, 1518, 1522, 1542,
-   9212, 10236, 1518, 1518
-   };
-   int i, n = 5;
+/* The maximum frame length is not logically enumerated in the
+ * hardware, so we do a table lookup to find the applicable max
+ * frame length.
+ */
+struct gmac_max_framelen {
+   unsigned int max_l3_len;
+   u8 val;
+};
 
-   max_l3_len += ETH_HLEN + VLAN_HLEN;
+static const struct gmac_max_framelen gmac_maxlens[] = {
+   {
+   .max_l3_len = 1518,
+   .val = CONFIG0_MAXLEN_1518,
+   },
+   {
+   .max_l3_len = 1522,
+   .val = CONFIG0_MAXLEN_1522,
+   },
+   {
+   .max_l3_len = 1536,
+   .val = CONFIG0_MAXLEN_1536,
+   },
+   {
+   .max_l3_len = 1542,
+   .val = CONFIG0_MAXLEN_1542,
+   },
+   {
+   .max_l3_len = 9212,
+   .val = CONFIG0_MAXLEN_9k,
+   },
+   {
+   .max_l3_len = 10236,
+   .val = CONFIG0_MAXLEN_10k,
+   },
+};
+
+static int gmac_pick_rx_max_len(unsigned int max_l3_len)
+{
+   const struct gmac_max_framelen *maxlen;
+   int maxtot;
+   int i;
 
-   if (max_l3_len > max_len[n])
-   return -1;
+   maxtot = max_l3_len + ETH_HLEN + VLAN_HLEN;
 
-   for (i = 0; i < 5; i++) {
-   if (max_len[i] >= max_l3_len && max_len[i] < max_len[n])
-   n = i;
+   for (i = 0; i < ARRAY_SIZE(gmac_maxlens); i++) {
+   maxlen = &gmac_maxlens[i];
+   if (maxtot <= maxlen->max_l3_len)
+   return maxlen->val;
}
 
-   return n;
+   return -1;
 }
 
 static int gmac_init(struct net_device *netdev)
-- 
2.17.1



Re: [net-next 01/12] net/mlx5e: Add UDP GSO support

2018-06-30 Thread Boris Pismenny




On 06/30/18 01:19, Willem de Bruijn wrote:

On Fri, Jun 29, 2018 at 2:24 AM Saeed Mahameed  wrote:

From: Boris Pismenny 

This patch enables UDP GSO support. We enable this by using two WQEs
the first is a UDP LSO WQE for all segments with equal length, and the
second is for the last segment in case it has different length.
Due to HW limitation, before sending, we must adjust the packet length fields.

We measure performance between two Intel(R) Xeon(R) CPU E5-2643 v2 @3.50GHz
machines connected back-to-back with Connectx4-Lx (40Gbps) NICs.
We compare single stream UDP, UDP GSO and UDP GSO with offload.
Performance:
 | MSS (bytes)   | Throughput (Gbps) | CPU utilization (%)
UDP GSO offload | 1472  | 35.6  | 8%
UDP GSO | 1472  | 25.5  | 17%
UDP | 1472  | 10.2  | 17%
UDP GSO offload | 1024  | 35.6  | 8%
UDP GSO | 1024  | 19.2  | 17%
UDP | 1024  | 5.7   | 17%
UDP GSO offload | 512   | 33.8  | 16%
UDP GSO | 512   | 10.4  | 17%
UDP | 512   | 3.5   | 17%

Very nice results :)

Thanks. We expect to have 100Gbps results by netdev0x12.


+static void mlx5e_udp_gso_prepare_last_skb(struct sk_buff *skb,
+  struct sk_buff *nskb,
+  int remaining)
+{
+   int bytes_needed = remaining, remaining_headlen, remaining_page_offset;
+   int headlen = skb_transport_offset(skb) + sizeof(struct udphdr);
+   int payload_len = remaining + sizeof(struct udphdr);
+   int k = 0, i, j;
+
+   skb_copy_bits(skb, 0, nskb->data, headlen);
+   nskb->dev = skb->dev;
+   skb_reset_mac_header(nskb);
+   skb_set_network_header(nskb, skb_network_offset(skb));
+   skb_set_transport_header(nskb, skb_transport_offset(skb));
+   skb_set_tail_pointer(nskb, headlen);
+
+   /* How many frags do we need? */
+   for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
+   bytes_needed -= skb_frag_size(&skb_shinfo(skb)->frags[i]);
+   k++;
+   if (bytes_needed <= 0)
+   break;
+   }
+
+   /* Fill the first frag and split it if necessary */
+   j = skb_shinfo(skb)->nr_frags - k;
+   remaining_page_offset = -bytes_needed;
+   skb_fill_page_desc(nskb, 0,
+  skb_shinfo(skb)->frags[j].page.p,
+  skb_shinfo(skb)->frags[j].page_offset + 
remaining_page_offset,
+  skb_shinfo(skb)->frags[j].size - 
remaining_page_offset);
+
+   skb_frag_ref(skb, j);
+
+   /* Fill the rest of the frags */
+   for (i = 1; i < k; i++) {
+   j = skb_shinfo(skb)->nr_frags - k + i;
+
+   skb_fill_page_desc(nskb, i,
+  skb_shinfo(skb)->frags[j].page.p,
+  skb_shinfo(skb)->frags[j].page_offset,
+  skb_shinfo(skb)->frags[j].size);
+   skb_frag_ref(skb, j);
+   }
+   skb_shinfo(nskb)->nr_frags = k;
+
+   remaining_headlen = remaining - skb->data_len;
+
+   /* headlen contains remaining data? */
+   if (remaining_headlen > 0)
+   skb_copy_bits(skb, skb->len - remaining, nskb->data + headlen,
+ remaining_headlen);
+   nskb->len = remaining + headlen;
+   nskb->data_len =  payload_len - sizeof(struct udphdr) +
+   max_t(int, 0, remaining_headlen);
+   nskb->protocol = skb->protocol;
+   if (nskb->protocol == htons(ETH_P_IP)) {
+   ip_hdr(nskb)->id = htons(ntohs(ip_hdr(nskb)->id) +
+skb_shinfo(skb)->gso_segs);
+   ip_hdr(nskb)->tot_len =
+   htons(payload_len + sizeof(struct iphdr));
+   } else {
+   ipv6_hdr(nskb)->payload_len = htons(payload_len);
+   }
+   udp_hdr(nskb)->len = htons(payload_len);
+   skb_shinfo(nskb)->gso_size = 0;
+   nskb->ip_summed = skb->ip_summed;
+   nskb->csum_start = skb->csum_start;
+   nskb->csum_offset = skb->csum_offset;
+   nskb->queue_mapping = skb->queue_mapping;
+}
+
+/* might send skbs and update wqe and pi */
+struct sk_buff *mlx5e_udp_gso_handle_tx_skb(struct net_device *netdev,
+   struct mlx5e_txqsq *sq,
+   struct sk_buff *skb,
+   struct mlx5e_tx_wqe **wqe,
+   u16 *pi)
+{
+   int payload_len = skb_shinfo(skb)->gso_size + sizeof(struct udphdr);
+   int headlen = skb_transport_offset(skb) + sizeof(struct udphdr);
+   int remaining = (skb->len - headlen) % skb_shinfo(s

[PATCH net] net: fix use-after-free in GRO with ESP

2018-06-30 Thread Sabrina Dubroca
Since the addition of GRO for ESP, gro_receive can consume the skb and
return -EINPROGRESS. In that case, the lower layer GRO handler cannot
touch the skb anymore.

Commit 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") converted
some of the gro_receive handlers that can lead to ESP's gro_receive so
that they wouldn't access the skb when -EINPROGRESS is returned, but
missed other spots, mainly in tunneling protocols.

This patch finishes the conversion to using skb_gro_flush_final(), and
adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and
GUE.

Fixes: 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 drivers/net/geneve.c  |  2 +-
 drivers/net/vxlan.c   |  4 +---
 include/linux/netdevice.h | 20 
 net/8021q/vlan.c  |  2 +-
 net/ipv4/fou.c|  4 +---
 net/ipv4/gre_offload.c|  2 +-
 net/ipv4/udp_offload.c|  2 +-
 7 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 750eaa53bf0c..ada33c2d9ac2 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -476,7 +476,7 @@ static struct sk_buff **geneve_gro_receive(struct sock *sk,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final(skb, pp, flush);
 
return pp;
 }
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index aee0e60471f1..f6bb1d54d4bd 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -623,9 +623,7 @@ static struct sk_buff **vxlan_gro_receive(struct sock *sk,
flush = 0;
 
 out:
-   skb_gro_remcsum_cleanup(skb, &grc);
-   skb->remcsum_offload = 0;
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final_remcsum(skb, pp, flush, &grc);
 
return pp;
 }
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850c7936..3d0cc0b5cec2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2789,11 +2789,31 @@ static inline void skb_gro_flush_final(struct sk_buff 
*skb, struct sk_buff **pp,
if (PTR_ERR(pp) != -EINPROGRESS)
NAPI_GRO_CB(skb)->flush |= flush;
 }
+static inline void skb_gro_flush_final_remcsum(struct sk_buff *skb,
+  struct sk_buff **pp,
+  int flush,
+  struct gro_remcsum *grc)
+{
+   if (PTR_ERR(pp) != -EINPROGRESS) {
+   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_remcsum_cleanup(skb, grc);
+   skb->remcsum_offload = 0;
+   }
+}
 #else
 static inline void skb_gro_flush_final(struct sk_buff *skb, struct sk_buff 
**pp, int flush)
 {
NAPI_GRO_CB(skb)->flush |= flush;
 }
+static inline void skb_gro_flush_final_remcsum(struct sk_buff *skb,
+  struct sk_buff **pp,
+  int flush,
+  struct gro_remcsum *grc)
+{
+   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_remcsum_cleanup(skb, grc);
+   skb->remcsum_offload = 0;
+}
 #endif
 
 static inline int dev_hard_header(struct sk_buff *skb, struct net_device *dev,
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 73a65789271b..8ccee3d01822 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -693,7 +693,7 @@ static struct sk_buff **vlan_gro_receive(struct sk_buff 
**head,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final(skb, pp, flush);
 
return pp;
 }
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 1540db65241a..c9ec1603666b 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -448,9 +448,7 @@ static struct sk_buff **gue_gro_receive(struct sock *sk,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
-   skb_gro_remcsum_cleanup(skb, &grc);
-   skb->remcsum_offload = 0;
+   skb_gro_flush_final_remcsum(skb, pp, flush, &grc);
 
return pp;
 }
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 1859c473b21a..6a7d980105f6 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -223,7 +223,7 @@ static struct sk_buff **gre_gro_receive(struct sk_buff 
**head,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final(skb, pp, flush);
 
return pp;
 }
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 92dc9e5a7ff3..69c54540d5b4 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -394,7 +394,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final(skb, pp, flush);
return pp;
 }
 EXPORT_SYMBOL(udp_

[bpf PATCH 1/2] bpf: sockmap, error path can not release psock in multi-map case

2018-06-30 Thread John Fastabend
The current code, in the error path of sock_hash_ctx_update_elem,
checks if the sock has a psock in the user data and if so decrements
the reference count of the psock. However, if the error happens early
in the error path we may have never incremented the psock reference
count and if the psock exists because the sock is in another map then
we may inadvertently decrement the reference count.

Fix this by making the error path only call smap_release_sock if the
error happens after the increment.

Reported-by: syzbot+d464d2c20c717ef5a...@syzkaller.appspotmail.com
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 4fc2cb1..63fb047 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1896,7 +1896,7 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
if (!e) {
err = -ENOMEM;
-   goto out_progs;
+   goto out_free;
}
}
 
@@ -2324,7 +2324,12 @@ static int sock_hash_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
if (err)
goto err;
 
-   /* bpf_map_update_elem() can be called in_irq() */
+   psock = smap_psock_sk(sock);
+   if (unlikely(!psock)) {
+   err = -EINVAL;
+   goto err;
+   }
+
raw_spin_lock_bh(&b->lock);
l_old = lookup_elem_raw(head, hash, key, key_size);
if (l_old && map_flags == BPF_NOEXIST) {
@@ -2342,12 +2347,6 @@ static int sock_hash_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
goto bucket_err;
}
 
-   psock = smap_psock_sk(sock);
-   if (unlikely(!psock)) {
-   err = -EINVAL;
-   goto bucket_err;
-   }
-
rcu_assign_pointer(e->hash_link, l_new);
rcu_assign_pointer(e->htab,
   container_of(map, struct bpf_htab, map));
@@ -2370,12 +2369,10 @@ static int sock_hash_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
raw_spin_unlock_bh(&b->lock);
return 0;
 bucket_err:
+   smap_release_sock(psock, sock);
raw_spin_unlock_bh(&b->lock);
 err:
kfree(e);
-   psock = smap_psock_sk(sock);
-   if (psock)
-   smap_release_sock(psock, sock);
return err;
 }
 



[bpf PATCH 0/2] sockmap, syzbot fix error path and RCU fix

2018-06-30 Thread John Fastabend
  This applies on top of "BPF fixes for sockhash" I just didn't
  want to confuse that series yet again by re-ordering/adding
  these patches in it

I missed fixing the error path in the sockhash code to align with
supporting socks in multiple maps. Simply checking if the psock is
present does not mean we can decrement the reference count because
it could be part of another map. Fix this by cleaning up the error
path so this situation does not happen.

As far as I know this should be the last fix to the fallout from
relaxing the single map restriction. Sorry about the multiple fixes
but these patches were all written before the initial submission then
converted and I missed this detail. But at least we caught these early
in the net cycle. Will continue reviewing/testing however to see if
we catch anything else.

Also we need one more series to check ESTABLISH state as Eric noted.
That will be sent out shortly just going over the patches once more.
The ESTABLISH/unhash fix is also needed in kTLS.

---

John Fastabend (2):
  bpf: sockmap, error path can not release psock in multi-map case
  bpf: sockmap, hash table is RCU so readers do not need locks


 0 files changed

--
Signature


[bpf PATCH 2/2] bpf: sockmap, hash table is RCU so readers do not need locks

2018-06-30 Thread John Fastabend
This removes locking from readers of RCU hash table. Its not
necessary.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 63fb047..12ac10a 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2451,10 +2451,8 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
b = __select_bucket(htab, hash);
head = &b->head;
 
-   raw_spin_lock_bh(&b->lock);
l = lookup_elem_raw(head, hash, key, key_size);
sk = l ? l->sk : NULL;
-   raw_spin_unlock_bh(&b->lock);
return sk;
 }
 



[bpf PATCH v5 2/4] bpf: sockmap, fix smap_list_map_remove when psock is in many maps

2018-06-30 Thread John Fastabend
If a hashmap is free'd with open socks it removes the reference to
the hash entry from the psock. If that is the last reference to the
psock then it will also be free'd by the reference counting logic.
However the current logic that removes the hash reference from the
list of references is broken. In smap_list_remove() we first check
if the sockmap entry matches and then check if the hashmap entry
matches. But, the sockmap entry sill always match because its NULL in
this case which causes the first entry to be removed from the list.
If this is always the "right" entry (because the user adds/removes
entries in order) then everything is OK but otherwise a subsequent
bpf_tcp_close() may reference a free'd object.

To fix this create two list handlers one for sockmap and one for
sockhash.

Reported-by: syzbot+0ce137753c78f7b6a...@syzkaller.appspotmail.com
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Acked-by: Martin KaFai Lau 
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index d7fd17a..5adff4b 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1602,17 +1602,27 @@ static struct bpf_map *sock_map_alloc(union bpf_attr 
*attr)
return ERR_PTR(err);
 }
 
-static void smap_list_remove(struct smap_psock *psock,
-struct sock **entry,
-struct htab_elem *hash_link)
+static void smap_list_map_remove(struct smap_psock *psock,
+struct sock **entry)
 {
struct smap_psock_map_entry *e, *tmp;
 
list_for_each_entry_safe(e, tmp, &psock->maps, list) {
-   if (e->entry == entry || e->hash_link == hash_link) {
+   if (e->entry == entry)
+   list_del(&e->list);
+   }
+}
+
+static void smap_list_hash_remove(struct smap_psock *psock,
+ struct htab_elem *hash_link)
+{
+   struct smap_psock_map_entry *e, *tmp;
+
+   list_for_each_entry_safe(e, tmp, &psock->maps, list) {
+   struct htab_elem *c = e->hash_link;
+
+   if (c == hash_link)
list_del(&e->list);
-   break;
-   }
}
 }
 
@@ -1647,7 +1657,7 @@ static void sock_map_free(struct bpf_map *map)
 * to be null and queued for garbage collection.
 */
if (likely(psock)) {
-   smap_list_remove(psock, &stab->sock_map[i], NULL);
+   smap_list_map_remove(psock, &stab->sock_map[i]);
smap_release_sock(psock, sock);
}
write_unlock_bh(&sock->sk_callback_lock);
@@ -1706,7 +1716,7 @@ static int sock_map_delete_elem(struct bpf_map *map, void 
*key)
 
if (psock->bpf_parse)
smap_stop_sock(psock, sock);
-   smap_list_remove(psock, &stab->sock_map[k], NULL);
+   smap_list_map_remove(psock, &stab->sock_map[k]);
smap_release_sock(psock, sock);
 out:
write_unlock_bh(&sock->sk_callback_lock);
@@ -1908,7 +1918,7 @@ static int sock_map_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
struct smap_psock *opsock = smap_psock_sk(osock);
 
write_lock_bh(&osock->sk_callback_lock);
-   smap_list_remove(opsock, &stab->sock_map[i], NULL);
+   smap_list_map_remove(opsock, &stab->sock_map[i]);
smap_release_sock(opsock, osock);
write_unlock_bh(&osock->sk_callback_lock);
}
@@ -2124,7 +2134,7 @@ static void sock_hash_free(struct bpf_map *map)
 * (psock) to be null and queued for garbage collection.
 */
if (likely(psock)) {
-   smap_list_remove(psock, NULL, l);
+   smap_list_hash_remove(psock, l);
smap_release_sock(psock, sock);
}
write_unlock_bh(&sock->sk_callback_lock);
@@ -2304,7 +2314,7 @@ static int sock_hash_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
psock = smap_psock_sk(l_old->sk);
 
hlist_del_rcu(&l_old->hash_node);
-   smap_list_remove(psock, NULL, l_old);
+   smap_list_hash_remove(psock, l_old);
smap_release_sock(psock, l_old->sk);
free_htab_elem(htab, l_old);
}
@@ -2372,7 +2382,7 @@ static int sock_hash_delete_elem(struct bpf_map *map, 
void *key)
 * to be null and queued for garbage collection.
 */
if (likely(psock)) {
-   smap_list_remove(psock, NULL, l);
+   smap_list_hash_remove(psock, l);
smap_release_sock(psock, sock);
}

[bpf PATCH v5 3/4] bpf: sockhash fix omitted bucket lock in sock_close

2018-06-30 Thread John Fastabend
First the sk_callback_lock() was being used to protect both the
sock callback hooks and the psock->maps list. This got overly
convoluted after the addition of sockhash (in sockmap it made
some sense because masp and callbacks were tightly coupled) so
lets split out a specific lock for maps and only use the callback
lock for its intended purpose. This fixes a couple cases where
we missed using maps lock when it was in fact needed. Also this
makes it easier to follow the code because now we can put the
locking closer to the actual code its serializing.

Next, in sock_hash_delete_elem() the pattern was as follows,

  sock_hash_delete_elem()
 [...]
 spin_lock(bucket_lock)
 l = lookup_elem_raw()
 if (l)
hlist_del_rcu()
write_lock(sk_callback_lock)
  destroy psock ...
write_unlock(sk_callback_lock)
 spin_unlock(bucket_lock)

The ordering is necessary because we only know the {p}sock after
dereferencing the hash table which we can't do unless we have the
bucket lock held. Once we have the bucket lock and the psock element
it is deleted from the hashmap to ensure any other path doing a lookup
will fail. Finally, the refcnt is decremented and if zero the psock
is destroyed.

In parallel with the above (or free'ing the map) a tcp close event
may trigger tcp_close(). Which at the moment omits the bucket lock
altogether (oops!) where the flow looks like this,

  bpf_tcp_close()
 [...]
 write_lock(sk_callback_lock)
 for each psock->maps // list of maps this sock is part of
 hlist_del_rcu(ref_hash_node);
  destroy psock ...
 write_unlock(sk_callback_lock)

Obviously, and demonstrated by syzbot, this is broken because
we can have multiple threads deleting entries via hlist_del_rcu().

To fix this we might be tempted to wrap the hlist operation in a
bucket lock but that would create a lock inversion problem. In
summary to follow locking rules the psocks maps list needs the
sk_callback_lock (after this patch maps_lock) but we need the bucket
lock to do the hlist_del_rcu.

To resolve the lock inversion problem pop the head of the maps list
repeatedly and remove the reference until no more are left. If a
delete happens in parallel from the BPF API that is OK as well because
it will do a similar action, lookup the lock in the map/hash, delete
it from the map/hash, and dec the refcnt. We check for this case
before doing a destroy on the psock to ensure we don't have two
threads tearing down a psock. The new logic is as follows,

  bpf_tcp_close()
  e = psock_map_pop(psock->maps) // done with map lock
  bucket_lock() // lock hash list bucket
  l = lookup_elem_raw(head, hash, key, key_size);
  if (l) {
 //only get here if elmnt was not already removed
 hlist_del_rcu()
 ... destroy psock...
  }
  bucket_unlock()

And finally for all the above to work add missing locking around  map
operations per above. Then add RCU annotations and use
rcu_dereference/rcu_assign_pointer to manage values relying on RCU so
that the object is not free'd from sock_hash_free() while it is being
referenced in bpf_tcp_close().

Reported-by: syzbot+0ce137753c78f7b6a...@syzkaller.appspotmail.com
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |  145 +-
 1 file changed, 96 insertions(+), 49 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 5adff4b..60b6e8b 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -72,6 +72,7 @@ struct bpf_htab {
u32 n_buckets;
u32 elem_size;
struct bpf_sock_progs progs;
+   struct rcu_head rcu;
 };
 
 struct htab_elem {
@@ -89,8 +90,8 @@ enum smap_psock_state {
 struct smap_psock_map_entry {
struct list_head list;
struct sock **entry;
-   struct htab_elem *hash_link;
-   struct bpf_htab *htab;
+   struct htab_elem __rcu *hash_link;
+   struct bpf_htab __rcu *htab;
 };
 
 struct smap_psock {
@@ -120,6 +121,7 @@ struct smap_psock {
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
struct list_head maps;
+   spinlock_t maps_lock;
 
/* Back reference used when sock callback trigger sockmap operations */
struct sock *sock;
@@ -258,16 +260,54 @@ static void bpf_tcp_release(struct sock *sk)
rcu_read_unlock();
 }
 
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
+u32 hash, void *key, u32 key_size)
+{
+   struct htab_elem *l;
+
+   hlist_for_each_entry_rcu(l, head, hash_node) {
+   if (l->hash == hash && !memcmp(&l->key, key, key_size))
+   return l;
+   }
+
+   return NULL;
+}
+
+static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
+{
+   return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static inline struct hlist_head *se

[bpf PATCH v5 4/4] bpf: sockhash, add release routine

2018-06-30 Thread John Fastabend
Add map_release_uref pointer to hashmap ops. This was dropped when
original sockhash code was ported into bpf-next before initial
commit.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Acked-by: Martin KaFai Lau 
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 60b6e8b..4fc2cb1 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2478,6 +2478,7 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map 
*map, void *key)
.map_get_next_key = sock_hash_get_next_key,
.map_update_elem = sock_hash_update_elem,
.map_delete_elem = sock_hash_delete_elem,
+   .map_release_uref = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,



[bpf PATCH v5 0/4] BPF fixes for sockhash

2018-06-30 Thread John Fastabend
This addresses two syzbot issues that lead to identifying (by Eric and
Wei) a class of bugs where we don't correctly check for IPv4/v6
sockets and their associated state. The second issue was a locking
omission in sockhash.

The first patch addresses IPv6 socks and fixing an error where
sockhash would overwrite the prot pointer with IPv4 prot. To fix
this build similar solution to TLS ULP. Although we continue to
allow socks in all states not just ESTABLISH in this patch set
because as Martin points out there should be no issue with this
on the sockmap ULP because we don't use the ctx in this code. Once
multiple ULPs coexist we may need to revisit this. However we
can do this in *next trees.

The other issue syzbot found that the tcp_close() handler missed
locking the hash bucket lock which could result in corrupting the
sockhash bucket list if delete and close ran at the same time. 
And also the smap_list_remove() routine was not working correctly
at all. This was not caught in my testing because in general my
tests (to date at least lets add some more robust selftest in
bpf-next) do things in the "expected" order, create map, add socks,
delete socks, then tear down maps. The tests we have that do the
ops out of this order where only working on single maps not multi-
maps so we never saw the issue. Thanks syzbot. The fix is to
restructure the tcp_close() lock handling. And fix the obvious
bug in smap_list_remove().

Finally, during review I noticed the release handler was omitted
from the upstream code (patch 4) due to an incorrect merge conflict
fix when I ported the code to latest bpf-next before submitting.
This would leave references to the map around if the user never
closes the map.

v3: rework patches, dropping ESTABLISH check and adding rcu
annotation along with the smap_list_remove fix

v4: missed one more case where maps was being accessed without
the sk_callback_lock, spoted by Martin as well.

v5: changed to use a specific lock for maps and reduced callback
lock so that it is only used to gaurd sk callbacks. I think
this makes the logic a bit cleaner and avoids confusion
ovoer what each lock is doing.

Also big thanks to Martin for thorough review he caught at least
one case where I missed a rcu_call().

---

John Fastabend (4):
  bpf: sockmap, fix crash when ipv6 sock is added
  bpf: sockmap, fix smap_list_map_remove when psock is in many maps
  bpf: sockhash fix omitted bucket lock in sock_close
  bpf: sockhash, add release routine


 kernel/bpf/sockmap.c |  236 +++---
 1 file changed, 166 insertions(+), 70 deletions(-)

--
Signature


[bpf PATCH v5 1/4] bpf: sockmap, fix crash when ipv6 sock is added

2018-06-30 Thread John Fastabend
This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.

Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used.

Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f...@syzkaller.appspotmail.com
Acked-by: Martin KaFai Lau 
Signed-off-by: John Fastabend 
Signed-off-by: Wei Wang 
---
 kernel/bpf/sockmap.c |   58 +-
 1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 52a91d8..d7fd17a 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -140,6 +140,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
+static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
 {
@@ -161,7 +162,42 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
return !empty;
 }
 
-static struct proto tcp_bpf_proto;
+enum {
+   SOCKMAP_IPV4,
+   SOCKMAP_IPV6,
+   SOCKMAP_NUM_PROTS,
+};
+
+enum {
+   SOCKMAP_BASE,
+   SOCKMAP_TX,
+   SOCKMAP_NUM_CONFIGS,
+};
+
+static struct proto *saved_tcpv6_prot __read_mostly;
+static DEFINE_SPINLOCK(tcpv6_prot_lock);
+static struct proto bpf_tcp_prots[SOCKMAP_NUM_PROTS][SOCKMAP_NUM_CONFIGS];
+static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
+struct proto *base)
+{
+   prot[SOCKMAP_BASE]  = *base;
+   prot[SOCKMAP_BASE].close= bpf_tcp_close;
+   prot[SOCKMAP_BASE].recvmsg  = bpf_tcp_recvmsg;
+   prot[SOCKMAP_BASE].stream_memory_read   = bpf_tcp_stream_read;
+
+   prot[SOCKMAP_TX]= prot[SOCKMAP_BASE];
+   prot[SOCKMAP_TX].sendmsg= bpf_tcp_sendmsg;
+   prot[SOCKMAP_TX].sendpage   = bpf_tcp_sendpage;
+}
+
+static void update_sk_prot(struct sock *sk, struct smap_psock *psock)
+{
+   int family = sk->sk_family == AF_INET6 ? SOCKMAP_IPV6 : SOCKMAP_IPV4;
+   int conf = psock->bpf_tx_msg ? SOCKMAP_TX : SOCKMAP_BASE;
+
+   sk->sk_prot = &bpf_tcp_prots[family][conf];
+}
+
 static int bpf_tcp_init(struct sock *sk)
 {
struct smap_psock *psock;
@@ -181,14 +217,17 @@ static int bpf_tcp_init(struct sock *sk)
psock->save_close = sk->sk_prot->close;
psock->sk_proto = sk->sk_prot;
 
-   if (psock->bpf_tx_msg) {
-   tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
-   tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
-   tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
-   tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
+   /* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
+   if (sk->sk_family == AF_INET6 &&
+   unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) {
+   spin_lock_bh(&tcpv6_prot_lock);
+   if (likely(sk->sk_prot != saved_tcpv6_prot)) {
+   build_protos(bpf_tcp_prots[SOCKMAP_IPV6], sk->sk_prot);
+   smp_store_release(&saved_tcpv6_prot, sk->sk_prot);
+   }
+   spin_unlock_bh(&tcpv6_prot_lock);
}
-
-   sk->sk_prot = &tcp_bpf_proto;
+   update_sk_prot(sk, psock);
rcu_read_unlock();
return 0;
 }
@@ -,8 +1150,7 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
 
 static int bpf_tcp_ulp_register(void)
 {
-   tcp_bpf_proto = tcp_prot;
-   tcp_bpf_proto.close = bpf_tcp_close;
+   build_protos(bpf_tcp_prots[SOCKMAP_IPV4], &tcp_prot);
/* Once BPF TX ULP is registered it is never unregistered. It
 * will be in the ULP list for the lifetime of the system. Doing
 * duplicate registers is not a problem.



Re: [PATCH net] net: fib_rules: bring back rule_exists to match rule during add

2018-06-30 Thread David Miller
From: Roopa Prabhu 
Date: Fri, 29 Jun 2018 14:32:15 -0700

> From: Roopa Prabhu 
> 
> After commit f9d4b0c1e969 ("fib_rules: move common handling of newrule
> delrule msgs into fib_nl2rule"), rule_exists got replaced by rule_find
> for existing rule lookup in both the add and del paths. While this
> is good for the delete path, it solves a few problems but opens up
> a few invalid key matches in the add path.
> 
> $ip -4 rule add table main tos 10 fwmark 1
> $ip -4 rule add table main tos 10
> RTNETLINK answers: File exists
> 
> The problem here is rule_find does not check if the key masks in
> the new and old rule are the same and hence ends up matching a more
> secific rule. Rule key masks cannot be easily compared today without
> an elaborate if-else block. Its best to introduce key masks for easier
> and accurate rule comparison in the future. Until then, due to fear of
> regressions this patch re-introduces older loose rule_exists during add.
> Also fixes both rule_exists and rule_find to cover missing attributes.
> 
> Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs 
> into fib_nl2rule")
> Signed-off-by: Roopa Prabhu 

Applied, thanks for resolving all of these issues.


Re: [PATCH net-next 00/13] mlxsw: Add resource scale tests

2018-06-30 Thread David Miller
From: Petr Machata 
Date: Sat, 30 Jun 2018 02:44:23 +0200

> There are a number of tests that check features of the Linux networking
> stack. By running them on suitable interfaces, one can exercise the
> mlxsw offloading code. However none of these tests attempts to push
> mlxsw to the limits supported by the ASIC.
> 
> As an additional wrinkle, the "limits supported by the ASIC" themselves
> may not be a set of fixed numbers, but rather depend on a profile that
> determines how the ASIC resources are allocated for different purposes.
> 
> This patchset introduces several tests that verify capability of mlxsw
> to offload amounts of routes, flower rules, and mirroring sessions that
> match predicted ASIC capacity, at different configuration profiles.
> Additionally they verify that amounts exceeding the predicted capacity
> can *not* be offloaded.
 ...

Good stuff, series applied, thanks!


Re: [PATCH net-next 4/5] net/sched: flower: Dump the ethertype encapsulated in vlan

2018-06-30 Thread kbuild test robot
Hi Jianbo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Jianbo-Liu/net-flow_dissector-Save-vlan-ethertype-from-headers/20180630-180158
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32
   net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32
   net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32
   net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32
   net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32
   net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32
   net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32
   net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32
   net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32
   net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32
   net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32
   net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32
   include/linux/slab.h:631:13: sparse: undefined identifier 
'__builtin_mul_overflow'
   include/linux/slab.h:631:13: sparse: not a function 
>> net/sched/cls_flower.c:1317:70: sparse: incorrect type in argument 3 
>> (different base types) @@expected unsigned short [unsigned] [usertype] 
>> value @@got  short [unsigned] [usertype] value @@
   net/sched/cls_flower.c:1317:70:expected unsigned short [unsigned] 
[usertype] value
   net/sched/cls_flower.c:1317:70:got restricted __be16 [usertype] n_proto
   include/linux/slab.h:631:13: sparse: call with no type!

vim +1317 net/sched/cls_flower.c

  1264  
  1265  static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
  1266 struct sk_buff *skb, struct tcmsg *t)
  1267  {
  1268  struct cls_fl_filter *f = fh;
  1269  struct nlattr *nest;
  1270  struct fl_flow_key *key, *mask;
  1271  
  1272  if (!f)
  1273  return skb->len;
  1274  
  1275  t->tcm_handle = f->handle;
  1276  
  1277  nest = nla_nest_start(skb, TCA_OPTIONS);
  1278  if (!nest)
  1279  goto nla_put_failure;
  1280  
  1281  if (f->res.classid &&
  1282  nla_put_u32(skb, TCA_FLOWER_CLASSID, f->res.classid))
  1283  goto nla_put_failure;
  1284  
  1285  key = &f->key;
  1286  mask = &f->mask->key;
  1287  
  1288  if (mask->indev_ifindex) {
  1289  struct net_device *dev;
  1290  
  1291  dev = __dev_get_by_index(net, key->indev_ifindex);
  1292  if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, 
dev->name))
  1293  goto nla_put_failure;
  1294  }
  1295  
  1296  if (!tc_skip_hw(f->flags))
  1297  fl_hw_update_stats(tp, f);
  1298  
  1299  if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
  1300  mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
  1301  sizeof(key->eth.dst)) ||
  1302  fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC,
  1303  mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK,
  1304  sizeof(key->eth.src)) ||
  1305  fl_dump_key_val(skb, &key->basic.n_proto, 
TCA_FLOWER_KEY_ETH_TYPE,
  1306  &mask->basic.n_proto, TCA_FLOWER_UNSPEC,
  1307  sizeof(key->basic.n_proto)))
  1308  goto nla_put_failure;
  1309  
  1310  if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls))
  1311  goto nla_put_failure;
  1312  
  1313  if (fl_dump_key_vlan(skb, &key->vlan, &mask->vlan))
  1314  goto nla_put_failure;
  1315  
  1316  if (mask->vlan.vlan_tpid &&
> 1317  nla_put_u16(skb, TCA_FLOWER_KEY_VLAN_ETH_TYPE, 
> key->basic.n_proto))
  1318  goto nla_put_failure;
  1319  
  1320  if ((key->basic.n_proto == htons(ETH_P_IP) ||
  1321   key->basic.n_proto == htons(ETH_P_IPV6)) &&
  1322  (fl_dump_key_val(skb, &key->basic.ip_proto, 
TCA_FLOWER_KEY_IP_PROTO,
  1323  &mask->basic.ip_proto, TCA_FLOWER_UNSPEC,
  1324  sizeof(key->basic.ip_proto)) ||
  1325  fl_dump_key_ip(skb, &key->ip, &mask->ip)))
  1326  goto nla_put_failure;
  1327  
  1328  if (key->cont

Re: [PATCH net-next 0/9] nfp: flower updates and netconsole

2018-06-30 Thread David Miller
From: Jakub Kicinski 
Date: Fri, 29 Jun 2018 17:04:33 -0700

> This set contains assorted updates to driver base and flower.
> First patch is a follow up to a fix to calculating counters which
> went into net.  For ethtool counters we should also make sure
> they are visible even after ring reconfiguration.  Next patch
> is a safety measure in case we are running on a machine with a
> broken BIOS we should fail the probe when risk of incorrect
> operation is too high.  The next two patches add netpoll support
> and make use of napi_consume_skb().  Last we populate bus info
> on all representors.
> 
> Pieter adds support for offload of the checksum action in flower.
> 
> John follows up to another fix he's done in net, we set TTL
> values on tunnels to stack's default, now Johns does a full
> route lookup to get a more precise information, he populates
> ToS field as well.  Last but not least he follows up on Jiri's
> request to enable LAG offload in case the team driver is used
> and then hash function is unknown.

Series applied, thanks.


Re: [Patch net] net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN

2018-06-30 Thread David Miller
From: Cong Wang 
Date: Fri, 29 Jun 2018 13:42:48 -0700

> As noticed by Eric, we need to switch to the helper
> dev_change_tx_queue_len() for SIOCSIFTXQLEN call path too,
> otheriwse still miss dev_qdisc_change_tx_queue_len().
> 
> Fixes: 6a643ddb5624 ("net: introduce helper dev_change_tx_queue_len()")
> Reported-by: Eric Dumazet 
> Signed-off-by: Cong Wang 

Applied.


Re: [PATCH net-next] strparser: Call skb_unclone conditionally

2018-06-30 Thread David Miller
From: Vakul Garg 
Date: Sat, 30 Jun 2018 00:45:55 +0530

> Calling skb_unclone() is expensive as it triggers a memcpy operation.
> Instead of calling skb_unclone() unconditionally, call it only when skb
> has a shared frag_list. This improves tls rx throughout significantly.
> 
> Signed-off-by: Vakul Garg 
> Suggested-by: Boris Pismenny 

Applied.


Re: [PATCH net 0/5] s390/qeth: fixes 2018-06-29

2018-06-30 Thread David Miller
From: Julian Wiedmann 
Date: Fri, 29 Jun 2018 19:45:49 +0200

> please apply a few qeth fixes for -net and your 4.17 stable queue.
> 
> Patches 1-3 fix several issues wrt to MAC address management that were
> introduced during the 4.17 cycle.
> Patch 4 tackles a long-standing issue with busy multi-connection workloads
> on devices in af_iucv mode.
> Patch 5 makes sure to re-enable all active HW offloads, after a card was
> previously set offline and thus lost its HW context.

Series applied and queued up for -stable.


Re: [PATCH net] alx: take rtnl before calling __alx_open from resume

2018-06-30 Thread David Miller
From: Sabrina Dubroca 
Date: Fri, 29 Jun 2018 17:51:26 +0200

> The __alx_open function can be called from ndo_open, which is called
> under RTNL, or from alx_resume, which isn't. Since commit d768319cd427,
> we're calling the netif_set_real_num_{tx,rx}_queues functions, which
> need to be called under RTNL.
> 
> This is similar to commit 0c2cc02e571a ("igb: Move the calls to set the
> Tx and Rx queues into igb_open").
> 
> Fixes: d768319cd427 ("alx: enable multiple tx queues")
> Signed-off-by: Sabrina Dubroca 

Applied and queued up for -stable.


Re: [PATCH net] sfc: correctly initialise filter rwsem for farch

2018-06-30 Thread David Miller
From: Bert Kenward 
Date: Fri, 29 Jun 2018 16:29:28 +0100

> Fixes: fc7a6c287ff3 ("sfc: use a semaphore to lock farch filters too")
> Suggested-by: Joseph Korty 
> Signed-off-by: Bert Kenward 

Applied and queued up for -stable.


Re: [PATCH net-next v2 1/1] tc-testing: initial version of tunnel_key unit tests

2018-06-30 Thread David Miller
From: Keara Leibovitz 
Date: Fri, 29 Jun 2018 10:47:31 -0400

> Create unittests for the tc tunnel_key action.
> 
> v2:
> For the tests expecting failures, added non-zero exit codes in the
> teardowns. This prevents those tests from failing if the act_tunnel_key
> module is unloaded.
> 
> Signed-off-by: Keara Leibovitz 

Applied.


Re: [PATCH] r8169: remove TBI 1000BaseX support

2018-06-30 Thread David Miller
From: Heiner Kallweit 
Date: Fri, 29 Jun 2018 08:07:04 +0200

> The very first version of RTL8169 from 2002 (and only this one) has
> support for a TBI 1000BaseX fiber interface. The TBI support in the
> driver makes switching to phylib tricky, so best would be to get
> rid of it. I found no report from anybody using a device with RTL8169
> and fiber interface, also the vendor driver doesn't support this mode
> (any longer).
> So remove TBI support and bail out with a message if a card with
> activated TBI is detected. If there really should be any user of it
> out there, we could add a stripped-down version of the driver
> supporting chip version 01 and TBI only (and maybe move it to
> staging).
> 
> Signed-off-by: Heiner Kallweit 

Ok, I guess this is reasonable.

Applied, but it will be a mini-disaster if someone is actually making
use of this fiber support for whatever reason.


Re: [PATCH net] net/ipv6: Fix updates to prefix route

2018-06-30 Thread David Miller
From: dsah...@kernel.org
Date: Thu, 28 Jun 2018 13:36:55 -0700

> From: David Ahern 
> 
> Sowmini reported that a recent commit broke prefix routes for linklocal
> addresses. The newly added modify_prefix_route is attempting to add a
> new prefix route when the ifp priority does not match the route metric
> however the check needs to account for the default priority. In addition,
> the route add fails because the route already exists, and then the delete
> removes the one that exists. Flip the order to do the delete first.
> 
> Fixes: 8308f3ff1753 ("net/ipv6: Add support for specifying metric of 
> connected routes")
> Reported-by: Sowmini Varadhan 
> Tested-by: Sowmini Varadhan 
> Signed-off-by: David Ahern 

Applied, thanks David.


Re: [PATCH net-next] net: phy: realtek: add support for RTL8211

2018-06-30 Thread David Miller
From: Heiner Kallweit 
Date: Thu, 28 Jun 2018 20:46:45 +0200

> In preparation of adding phylib support to the r8169 driver we need
> PHY drivers for all chip-internal PHY types. Fortunately almost all
> of them are either supported by the Realtek PHY driver already or work
> with the genphy driver.
> Still missing is support for the PHY of RTL8169s, it requires a quirk
> to properly support 100Mbit-fixed mode. The quirk was copied from
> r8169 driver which copied it from the vendor driver.
> Based on the PHYID the internal PHY seems to be a RTL8211.
> 
> Signed-off-by: Heiner Kallweit 

Applied.


Re: [PATCH net-next 00/10] pnetid and SMC-D support

2018-06-30 Thread David Miller
From: Ursula Braun 
Date: Thu, 28 Jun 2018 19:05:03 +0200

> SMC requires a configured pnet table to map Ethernet interfaces to
> RoCE adapter ports. For s390 there exists hardware support to group
> such devices. The first three patches cover the s390 pnetid support,
> enabling SMC-R usage on s390 without configuring an extra pnet table.
> 
> SMC currently requires RoCE adapters, and uses RDMA-techniques
> implemented with IB-verbs. But s390 offers another method for
> intra-CEC Shared Memory communication. The following seven patches
> implement a solution to run SMC traffic based on intra-CEC DMA,
> called SMC-D.

Series applied.


Re: [PATCH net-next] r8169: use standard debug output functions

2018-06-30 Thread David Miller
From: Heiner Kallweit 
Date: Thu, 28 Jun 2018 20:36:15 +0200

> I see no need to define a private debug output symbol, let's use the
> standard debug output functions instead. In this context also remove
> the deprecated PFX define.
> 
> The one assertion is wrong IMO anyway, this code path is used also
> by chip version 01.
> 
> Signed-off-by: Heiner Kallweit 

Applied.


Re: [PATCH net-next 0/4] Fixes for running mirror-to-gretap tests on veth

2018-06-30 Thread David Miller
From: Petr Machata 
Date: Thu, 28 Jun 2018 18:56:15 +0200

> The forwarding selftests infrastructure makes it possible to run the
> individual tests on a purely software netdevices. Names of interfaces to
> run the test with can be passed as command line arguments to a test.
> lib.sh then creates veth pairs backing the interfaces if none exist in
> the system.
> 
> However, the tests need to recognize that they might be run on a soft
> device. Many mirror-to-gretap tests are buggy in this regard. This patch
> set aims to fix the problems in running mirror-to-gretap tests on veth
> devices.
> 
> In patch #1, a service function is split out of setup_wait().
> In patch #2, installing a trap is made optional.
> In patch #3, tc filters in several tests are tweaked to work with veth.
> In patch #4, the logic for waiting for neighbor is fixed for veth.

Series applied, thanks.


Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-30 Thread Jiri Pirko
Sat, Jun 30, 2018 at 12:18:02AM CEST, xiyou.wangc...@gmail.com wrote:
>On Fri, Jun 29, 2018 at 10:06 AM Samudrala, Sridhar
> wrote:
>>
>> So instead of introducing 'chaintemplate' object in the kernel, can't we add 
>> 'chain'
>> object in the kernel that takes the 'template' as an attribute?
>
>This is exactly what I mean above. Making the chain a standalone object
>in kernel would benefit:
>
>1. Align with 'tc chain' in iproute2, add/del an object is natural
>
>2. Template is an attribute of this object when creating it:
># tc chain add template 
># tc chain add ... # non-template chain

Okay. So that would allow either create a chain or "chain with
template". Once that is done, there would be no means to manipulate the
template. One can only remove the chain.

What about refounting? I think it would make sense that this implicit
chain addition would take one reference. That means if later on the last
filter is removed, the chain would stay there until user removes it by
hand.

Okay. Sounds good to me. Will do.

Thanks!


>
>3. Easier for sharing by qdiscs:
># tc chain add X block Y ...
># tc filter add ... chain X block Y ...
># tc qdisc add dev eth0 block Y ...
>
>The current 'ingress_block 22 ingress' syntax is ugly.


[PATCH iproute2/net-next] tc: flower: Add support for QinQ

2018-06-30 Thread Jianbo Liu
To support matching on both outer and inner vlan headers,
we add new cvlan_id/cvlan_prio/cvlan_ethtype for inner vlan header.

Example:
# tc filter add dev eth0 protocol 802.1ad parent : \
flower vlan_id 1000 vlan_ethtype 802.1q \
cvlan_id 100 cvlan_ethtype ipv4 \
action vlan pop \
action vlan pop \
action mirred egress redirect dev eth1

# tc filter show dev eth0 ingress
filter protocol 802.1ad pref 1 flower chain 0
filter protocol 802.1ad pref 1 flower chain 0 handle 0x1
  vlan_id 1000
  vlan_ethtype 802.1Q
  cvlan_id 100
  cvlan_ethtype ip
  eth_type ipv4
  in_hw

Signed-off-by: Jianbo Liu 
Acked-by: Jiri Pirko 
---
 include/uapi/linux/pkt_cls.h |   4 ++
 man/man8/tc-flower.8 |  23 ++
 tc/f_flower.c| 103 ++-
 3 files changed, 118 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index be05e66..e7f7c52 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -468,6 +468,10 @@ enum {
TCA_FLOWER_KEY_IP_TTL,  /* u8 */
TCA_FLOWER_KEY_IP_TTL_MASK, /* u8 */
 
+   TCA_FLOWER_KEY_CVLAN_ID,/* be16 */
+   TCA_FLOWER_KEY_CVLAN_PRIO,  /* u8   */
+   TCA_FLOWER_KEY_CVLAN_ETH_TYPE,  /* be16 */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index a561443..c684652 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -32,6 +32,12 @@ flower \- flow based traffic control filter
 .IR PRIORITY " | "
 .BR vlan_ethtype " { " ipv4 " | " ipv6 " | "
 .IR ETH_TYPE " } | "
+.B cvlan_id
+.IR VID " | "
+.B cvlan_prio
+.IR PRIORITY " | "
+.BR cvlan_ethtype " { " ipv4 " | " ipv6 " | "
+.IR ETH_TYPE " } | "
 .B mpls_label
 .IR LABEL " | "
 .B mpls_tc
@@ -132,6 +138,23 @@ Match on layer three protocol.
 .I VLAN_ETH_TYPE
 may be either
 .BR ipv4 ", " ipv6
+or an unsigned 16bit value in hexadecimal format. To match on QinQ packet, it 
must be 802.1Q or 802.1AD.
+.TP
+.BI cvlan_id " VID"
+Match on QinQ inner vlan tag id.
+.I VID
+is an unsigned 12bit value in decimal format.
+.TP
+.BI cvlan_prio " PRIORITY"
+Match on QinQ inner vlan tag priority.
+.I PRIORITY
+is an unsigned 3bit value in decimal format.
+.TP
+.BI cvlan_ethtype " VLAN_ETH_TYPE"
+Match on QinQ layer three protocol.
+.I VLAN_ETH_TYPE
+may be either
+.BR ipv4 ", " ipv6
 or an unsigned 16bit value in hexadecimal format.
 .TP
 .BI mpls_label " LABEL"
diff --git a/tc/f_flower.c b/tc/f_flower.c
index ba8eb66..71f2bc1 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -50,6 +50,9 @@ static void explain(void)
"   vlan_id VID |\n"
"   vlan_prio PRIORITY |\n"
"   vlan_ethtype [ ipv4 | ipv6 | ETH-TYPE ] 
|\n"
+   "   cvlan_id VID |\n"
+   "   cvlan_prio PRIORITY |\n"
+   "   cvlan_ethtype [ ipv4 | ipv6 | ETH-TYPE 
] |\n"
"   dst_mac MASKED-LLADDR |\n"
"   src_mac MASKED-LLADDR |\n"
"   ip_proto [tcp | udp | sctp | icmp | 
icmpv6 | IP-PROTO ] |\n"
@@ -128,15 +131,21 @@ err:
return err;
 }
 
+static bool eth_type_vlan(__be16 ethertype)
+{
+   return ethertype == htons(ETH_P_8021Q) ||
+  ethertype == htons(ETH_P_8021AD);
+}
+
 static int flower_parse_vlan_eth_type(char *str, __be16 eth_type, int type,
  __be16 *p_vlan_eth_type,
  struct nlmsghdr *n)
 {
__be16 vlan_eth_type;
 
-   if (eth_type != htons(ETH_P_8021Q)) {
-   fprintf(stderr,
-   "Can't set \"vlan_ethtype\" if ethertype isn't 
802.1Q\n");
+   if (!eth_type_vlan(eth_type)) {
+   fprintf(stderr, "Can't set \"%s\" if ethertype isn't 802.1Q or 
802.1AD\n",
+   type == TCA_FLOWER_KEY_VLAN_ETH_TYPE ? "vlan_ethtype" : 
"cvlan_ethtype");
return -1;
}
 
@@ -586,6 +595,7 @@ static int flower_parse_opt(struct filter_util *qu, char 
*handle,
struct rtattr *tail;
__be16 eth_type = TC_H_MIN(t->tcm_info);
__be16 vlan_ethtype = 0;
+   __be16 cvlan_ethtype = 0;
__u8 ip_proto = 0xff;
__u32 flags = 0;
__u32 mtf = 0;
@@ -661,9 +671,8 @@ static int flower_parse_opt(struct filter_util *qu, char 
*handle,
__u16 vid;
 
NEXT_ARG();
-   if (eth_type != htons(ETH_P_8021Q)) {
-   fprintf(stderr,
-   "Can't set \"vlan_id\" if ethertype 
isn't 802.1Q\n");
+   if (!eth_type_vlan(eth_type)) {
+   fprintf(stderr, "Can't set \"vlan_id\" if 
ethertype is

[PATCH net-next 4/5] net/sched: flower: Dump the ethertype encapsulated in vlan

2018-06-30 Thread Jianbo Liu
Currently the encapsulated ethertype is not dumped as it's the same as
TCA_FLOWER_KEY_ETH_TYPE keyvalue. But the dumping result is inconsistent
with input, we add dumping it with TCA_FLOWER_KEY_VLAN_ETH_TYPE.

Signed-off-by: Jianbo Liu 
Acked-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e6774af..30f3e18 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1255,6 +1255,10 @@ static int fl_dump(struct net *net, struct tcf_proto 
*tp, void *fh,
if (fl_dump_key_vlan(skb, &key->vlan, &mask->vlan))
goto nla_put_failure;
 
+   if (mask->vlan.vlan_tpid &&
+   nla_put_u16(skb, TCA_FLOWER_KEY_VLAN_ETH_TYPE, key->basic.n_proto))
+   goto nla_put_failure;
+
if ((key->basic.n_proto == htons(ETH_P_IP) ||
 key->basic.n_proto == htons(ETH_P_IPV6)) &&
(fl_dump_key_val(skb, &key->basic.ip_proto, TCA_FLOWER_KEY_IP_PROTO,
-- 
2.9.5



[PATCH net-next 2/5] net/sched: flower: Add support for matching on vlan ethertype

2018-06-30 Thread Jianbo Liu
As flow dissector stores vlan ethertype, tc flower now can match on that.
It is to make preparation for supporting QinQ.

Signed-off-by: Jianbo Liu 
Acked-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 4e74508..e6774af 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -483,6 +483,7 @@ static int fl_set_key_mpls(struct nlattr **tb,
 }
 
 static void fl_set_key_vlan(struct nlattr **tb,
+   __be16 ethertype,
struct flow_dissector_key_vlan *key_val,
struct flow_dissector_key_vlan *key_mask)
 {
@@ -499,6 +500,8 @@ static void fl_set_key_vlan(struct nlattr **tb,
VLAN_PRIORITY_MASK;
key_mask->vlan_priority = VLAN_PRIORITY_MASK;
}
+   key_val->vlan_tpid = ethertype;
+   key_mask->vlan_tpid = cpu_to_be16(~0);
 }
 
 static void fl_set_key_flag(u32 flower_key, u32 flower_mask,
@@ -575,8 +578,8 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
if (tb[TCA_FLOWER_KEY_ETH_TYPE]) {
ethertype = nla_get_be16(tb[TCA_FLOWER_KEY_ETH_TYPE]);
 
-   if (ethertype == htons(ETH_P_8021Q)) {
-   fl_set_key_vlan(tb, &key->vlan, &mask->vlan);
+   if (eth_type_vlan(ethertype)) {
+   fl_set_key_vlan(tb, ethertype, &key->vlan, &mask->vlan);
fl_set_key_val(tb, &key->basic.n_proto,
   TCA_FLOWER_KEY_VLAN_ETH_TYPE,
   &mask->basic.n_proto, TCA_FLOWER_UNSPEC,
-- 
2.9.5



[PATCH net-next 5/5] net/sched: flower: Add supprt for matching on QinQ vlan headers

2018-06-30 Thread Jianbo Liu
As support dissecting of QinQ inner and outer vlan headers, user can
add rules to match on QinQ vlan headers.

Signed-off-by: Jianbo Liu 
Acked-by: Jiri Pirko 
---
 include/uapi/linux/pkt_cls.h |  4 +++
 net/sched/cls_flower.c   | 65 ++--
 2 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 84e4c1d..c4262d9 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -469,6 +469,10 @@ enum {
TCA_FLOWER_KEY_IP_TTL,  /* u8 */
TCA_FLOWER_KEY_IP_TTL_MASK, /* u8 */
 
+   TCA_FLOWER_KEY_CVLAN_ID,/* be16 */
+   TCA_FLOWER_KEY_CVLAN_PRIO,  /* u8   */
+   TCA_FLOWER_KEY_CVLAN_ETH_TYPE,  /* be16 */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 30f3e18..e3ecffd 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -35,6 +35,7 @@ struct fl_flow_key {
struct flow_dissector_key_basic basic;
struct flow_dissector_key_eth_addrs eth;
struct flow_dissector_key_vlan vlan;
+   struct flow_dissector_key_vlan cvlan;
union {
struct flow_dissector_key_ipv4_addrs ipv4;
struct flow_dissector_key_ipv6_addrs ipv6;
@@ -432,6 +433,9 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 
1] = {
[TCA_FLOWER_KEY_IP_TOS_MASK]= { .type = NLA_U8 },
[TCA_FLOWER_KEY_IP_TTL] = { .type = NLA_U8 },
[TCA_FLOWER_KEY_IP_TTL_MASK]= { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_CVLAN_ID]   = { .type = NLA_U16 },
+   [TCA_FLOWER_KEY_CVLAN_PRIO] = { .type = NLA_U8 },
+   [TCA_FLOWER_KEY_CVLAN_ETH_TYPE] = { .type = NLA_U16 },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -484,19 +488,20 @@ static int fl_set_key_mpls(struct nlattr **tb,
 
 static void fl_set_key_vlan(struct nlattr **tb,
__be16 ethertype,
+   int vlan_id_key, int vlan_prio_key,
struct flow_dissector_key_vlan *key_val,
struct flow_dissector_key_vlan *key_mask)
 {
 #define VLAN_PRIORITY_MASK 0x7
 
-   if (tb[TCA_FLOWER_KEY_VLAN_ID]) {
+   if (tb[vlan_id_key]) {
key_val->vlan_id =
-   nla_get_u16(tb[TCA_FLOWER_KEY_VLAN_ID]) & VLAN_VID_MASK;
+   nla_get_u16(tb[vlan_id_key]) & VLAN_VID_MASK;
key_mask->vlan_id = VLAN_VID_MASK;
}
-   if (tb[TCA_FLOWER_KEY_VLAN_PRIO]) {
+   if (tb[vlan_prio_key]) {
key_val->vlan_priority =
-   nla_get_u8(tb[TCA_FLOWER_KEY_VLAN_PRIO]) &
+   nla_get_u8(tb[vlan_prio_key]) &
VLAN_PRIORITY_MASK;
key_mask->vlan_priority = VLAN_PRIORITY_MASK;
}
@@ -579,11 +584,25 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
ethertype = nla_get_be16(tb[TCA_FLOWER_KEY_ETH_TYPE]);
 
if (eth_type_vlan(ethertype)) {
-   fl_set_key_vlan(tb, ethertype, &key->vlan, &mask->vlan);
-   fl_set_key_val(tb, &key->basic.n_proto,
-  TCA_FLOWER_KEY_VLAN_ETH_TYPE,
-  &mask->basic.n_proto, TCA_FLOWER_UNSPEC,
-  sizeof(key->basic.n_proto));
+   fl_set_key_vlan(tb, ethertype, TCA_FLOWER_KEY_VLAN_ID,
+   TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan,
+   &mask->vlan);
+
+   ethertype = 
nla_get_be16(tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE]);
+   if (eth_type_vlan(ethertype)) {
+   fl_set_key_vlan(tb, ethertype,
+   TCA_FLOWER_KEY_CVLAN_ID,
+   TCA_FLOWER_KEY_CVLAN_PRIO,
+   &key->cvlan, &mask->cvlan);
+   fl_set_key_val(tb, &key->basic.n_proto,
+  TCA_FLOWER_KEY_CVLAN_ETH_TYPE,
+  &mask->basic.n_proto,
+  TCA_FLOWER_UNSPEC,
+  sizeof(key->basic.n_proto));
+   } else {
+   key->basic.n_proto = ethertype;
+   mask->basic.n_proto = cpu_to_be16(~0);
+   }
} else {
key->basic.n_proto = ethertype;
mask->basic.n_proto = cpu_to_be16(~0);
@@ -809,6 +828,8 @@ static void fl_init_dissector(struct fl_flow_mask *mask)
FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 FLO

[PATCH net-next 3/5] net/flow_dissector: Add support for QinQ dissection

2018-06-30 Thread Jianbo Liu
Dissect the QinQ packets to get both outer and inner vlan information,
then store to the extended flow keys.

Signed-off-by: Jianbo Liu 
Acked-by: Jiri Pirko 
---
 include/net/flow_dissector.h |  2 ++
 net/core/flow_dissector.c| 32 +---
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 8f89968..c644067 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -206,6 +206,7 @@ enum flow_dissector_key_id {
FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */
FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */
+   FLOW_DISSECTOR_KEY_CVLAN, /* struct flow_dissector_key_flow_vlan */
 
FLOW_DISSECTOR_KEY_MAX,
 };
@@ -237,6 +238,7 @@ struct flow_keys {
struct flow_dissector_key_basic basic;
struct flow_dissector_key_tags tags;
struct flow_dissector_key_vlan vlan;
+   struct flow_dissector_key_vlan cvlan;
struct flow_dissector_key_keyid keyid;
struct flow_dissector_key_ports ports;
struct flow_dissector_key_addrs addrs;
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index e068ccf..09360ae 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -589,7 +589,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
struct flow_dissector_key_tags *key_tags;
struct flow_dissector_key_vlan *key_vlan;
enum flow_dissect_ret fdret;
-   bool skip_vlan = false;
+   enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
int num_hdrs = 0;
u8 ip_proto = 0;
bool ret;
@@ -748,15 +748,14 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
}
case htons(ETH_P_8021AD):
case htons(ETH_P_8021Q): {
-   const struct vlan_hdr *vlan;
+   const struct vlan_hdr *vlan = NULL;
struct vlan_hdr _vlan;
-   bool vlan_tag_present = skb && skb_vlan_tag_present(skb);
__be16 saved_vlan_tpid = proto;
 
-   if (vlan_tag_present)
+   if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX &&
+   skb && skb_vlan_tag_present(skb)) {
proto = skb->protocol;
-
-   if (!vlan_tag_present || eth_type_vlan(skb->protocol)) {
+   } else {
vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
data, hlen, &_vlan);
if (!vlan) {
@@ -766,20 +765,23 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 
proto = vlan->h_vlan_encapsulated_proto;
nhoff += sizeof(*vlan);
-   if (skip_vlan) {
-   fdret = FLOW_DISSECT_RET_PROTO_AGAIN;
-   break;
-   }
}
 
-   skip_vlan = true;
-   if (dissector_uses_key(flow_dissector,
-  FLOW_DISSECTOR_KEY_VLAN)) {
+   if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX) {
+   dissector_vlan = FLOW_DISSECTOR_KEY_VLAN;
+   } else if (dissector_vlan == FLOW_DISSECTOR_KEY_VLAN) {
+   dissector_vlan = FLOW_DISSECTOR_KEY_CVLAN;
+   } else {
+   fdret = FLOW_DISSECT_RET_PROTO_AGAIN;
+   break;
+   }
+
+   if (dissector_uses_key(flow_dissector, dissector_vlan)) {
key_vlan = skb_flow_dissector_target(flow_dissector,
-
FLOW_DISSECTOR_KEY_VLAN,
+dissector_vlan,
 target_container);
 
-   if (vlan_tag_present) {
+   if (!vlan) {
key_vlan->vlan_id = skb_vlan_tag_get_id(skb);
key_vlan->vlan_priority =
(skb_vlan_tag_get_prio(skb) >> 
VLAN_PRIO_SHIFT);
-- 
2.9.5



[PATCH net-next 0/5] Introduce matching on double vlan/QinQ headers for TC flower

2018-06-30 Thread Jianbo Liu
Currently TC flower supports only one vlan tag, it doesn't match on both outer
and inner vlan headers for QinQ. To do this, we add support to get both outer
and inner vlan headers for flow dissector, and then TC flower do matching on
those information.

We also plan to extend TC command to support this feature. We add new
cvlan_id/cvlan_prio/cvlan_ethtype keywords for inner vlan header. The existing
vlan_id/vlan_prio/vlan_ethtype are for outer vlan header, and vlan_ethtype must
be 802.1q or 802.1ad.

The examples for command and output are as the following.
# tc filter add dev ens1f1 parent : protocol 802.1ad pref 33 \
flower vlan_id 1000 vlan_ethtype 802.1q \
cvlan_id 100 cvlan_ethtype ipv4 \
action vlan pop \
action vlan pop \
action mirred egress redirect dev ens1f1_0

# tc filter show dev ens1f1 ingress
filter protocol 802.1ad pref 33 flower chain 0
filter protocol 802.1ad pref 33 flower chain 0 handle 0x1
  vlan_id 1000
  vlan_ethtype 802.1Q
  cvlan_id 100
  cvlan_ethtype ip
  eth_type ipv4
  in_hw
...

Jianbo Liu (5):
  net/flow_dissector: Save vlan ethertype from headers
  net/sched: flower: Add support for matching on vlan ethertype
  net/flow_dissector: Add support for QinQ dissection
  net/sched: flower: Dump the ethertype encapsulated in vlan
  net/sched: flower: Add supprt for matching on QinQ vlan headers

 include/net/flow_dissector.h |  4 ++-
 include/uapi/linux/pkt_cls.h |  4 +++
 net/core/flow_dissector.c| 34 +++--
 net/sched/cls_flower.c   | 70 
 4 files changed, 83 insertions(+), 29 deletions(-)

-- 
2.9.5



[PATCH net-next 1/5] net/flow_dissector: Save vlan ethertype from headers

2018-06-30 Thread Jianbo Liu
Change vlan dissector key to save vlan tpid to support both 802.1Q
and 802.1AD ethertype.

Signed-off-by: Jianbo Liu 
Acked-by: Jiri Pirko 
---
 include/net/flow_dissector.h | 2 +-
 net/core/flow_dissector.c| 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index adc24df5..8f89968 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -47,7 +47,7 @@ struct flow_dissector_key_tags {
 struct flow_dissector_key_vlan {
u16 vlan_id:12,
vlan_priority:3;
-   u16 padding;
+   __be16  vlan_tpid;
 };
 
 struct flow_dissector_key_mpls {
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 4fc1e84..e068ccf 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -751,6 +751,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
const struct vlan_hdr *vlan;
struct vlan_hdr _vlan;
bool vlan_tag_present = skb && skb_vlan_tag_present(skb);
+   __be16 saved_vlan_tpid = proto;
 
if (vlan_tag_present)
proto = skb->protocol;
@@ -789,6 +790,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
(ntohs(vlan->h_vlan_TCI) &
 VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT;
}
+   key_vlan->vlan_tpid = saved_vlan_tpid;
}
 
fdret = FLOW_DISSECT_RET_PROTO_AGAIN;
-- 
2.9.5



Re: [PATCH net 1/1] bnx2x: Fix receiving tx-timeout in error or recovery state.

2018-06-30 Thread David Miller
From: Sudarsana Reddy Kalluru 
Date: Thu, 28 Jun 2018 04:52:15 -0700

> Driver performs the internal reload when it receives tx-timeout event from
> the OS. Internal reload might fail in some scenarios e.g., fatal HW issues.
> In such cases OS still see the link, which would result in undesirable
> functionalities such as re-generation of tx-timeouts.
> The patch addresses this issue by indicating the link-down to OS when
> tx-timeout is detected, and keeping the link in down state till the
> internal reload is successful.
> 
> Please consider applying it to 'net' branch.
> 
> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Ariel Elior 

Applied.


Re: [PATCH net] cnic: tidy up a size calculation

2018-06-30 Thread David Miller
From: Dan Carpenter 
Date: Thu, 28 Jun 2018 12:31:25 +0300

> Static checkers complain that id_tbl->table points to longs and 4 bytes
> is smaller than sizeof(long).  But the since other side is dividing by
> 32 instead of sizeof(long), that means the current code works fine.
> 
> Anyway, it's more conventional to use the BITS_TO_LONGS() macro when
> we're allocating a bitmap.
> 
> Signed-off-by: Dan Carpenter 

Applied.


Re: [PATCH net] atm: iphase: fix a 64 bit bug

2018-06-30 Thread David Miller
From: Dan Carpenter 
Date: Thu, 28 Jun 2018 12:24:42 +0300

> The code assumes that there is 4 bytes in a pointer and it doesn't
> allocate enough memory.
> 
> Signed-off-by: Dan Carpenter 

Applied.


Re: [PATCH net] tcp: fix Fast Open key endianness

2018-06-30 Thread David Miller
From: Yuchung Cheng 
Date: Wed, 27 Jun 2018 16:04:48 -0700

> Fast Open key could be stored in different endian based on the CPU.
> Previously hosts in different endianness in a server farm using
> the same key config (sysctl value) would produce different cookies.
> This patch fixes it by always storing it as little endian to keep
> same API for LE hosts.
> 
> Reported-by: Daniele Iamartino 
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Neal Cardwell 

Applied and queued up for -stable.


Re: [PATCH net-next] net: stmmac: Add support for CBS QDISC

2018-06-30 Thread David Miller
From: Jose Abreu 
Date: Wed, 27 Jun 2018 13:43:20 +0100

> This adds support for CBS reconfiguration using the TC application.
> 
> A new callback was added to TC ops struct and another one to DMA ops to
> reconfigure the channel mode.
> 
> Tested in GMAC5.10.
> 
> Signed-off-by: Jose Abreu 

Applied.


[net-next PATCH v6 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map

2018-06-30 Thread Amritha Nambiar
Signed-off-by: Amritha Nambiar 
---
 Documentation/ABI/testing/sysfs-class-net-queues |   11 
 Documentation/networking/scaling.txt |   61 ++
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues 
b/Documentation/ABI/testing/sysfs-class-net-queues
index 0c0df91..978b763 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -42,6 +42,17 @@ Description:
network device transmit queue. Possible vaules depend on the
number of available CPU(s) in the system.
 
+What:  /sys/class//queues/tx-/xps_rxqs
+Date:  June 2018
+KernelVersion: 4.18.0
+Contact:   netdev@vger.kernel.org
+Description:
+   Mask of the receive queue(s) currently enabled to participate
+   into the Transmit Packet Steering packet processing flow for 
this
+   network device transmit queue. Possible values depend on the
+   number of available receive queue(s) in the network device.
+   Default is disabled.
+
 What:  /sys/class//queues/tx-/byte_queue_limits/hold_time
 Date:  November 2011
 KernelVersion: 3.3
diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/scaling.txt
index f55639d..b7056a8 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
 
 Transmit Packet Steering is a mechanism for intelligently selecting
 which transmit queue to use when transmitting a packet on a multi-queue
-device. To accomplish this, a mapping from CPU to hardware queue(s) is
-recorded. The goal of this mapping is usually to assign queues
+device. This can be accomplished by recording two kinds of maps, either
+a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
+to hardware transmit queue(s).
+
+1. XPS using CPUs map
+
+The goal of this mapping is usually to assign queues
 exclusively to a subset of CPUs, where the transmit completions for
 these queues are processed on a CPU within this set. This choice
 provides two benefits. First, contention on the device queue lock is
@@ -377,15 +382,40 @@ transmit queue). Secondly, cache miss rate on transmit 
completion is
 reduced, in particular for data cache lines that hold the sk_buff
 structures.
 
-XPS is configured per transmit queue by setting a bitmap of CPUs that
-may use that queue to transmit. The reverse mapping, from CPUs to
-transmit queues, is computed and maintained for each network device.
-When transmitting the first packet in a flow, the function
-get_xps_queue() is called to select a queue. This function uses the ID
-of the running CPU as a key into the CPU-to-queue lookup table. If the
+2. XPS using receive queues map
+
+This mapping is used to pick transmit queue based on the receive
+queue(s) map configuration set by the administrator. A set of receive
+queues can be mapped to a set of transmit queues (many:many), although
+the common use case is a 1:1 mapping. This will enable sending packets
+on the same queue associations for transmit and receive. This is useful for
+busy polling multi-threaded workloads where there are challenges in
+associating a given CPU to a given application thread. The application
+threads are not pinned to CPUs and each thread handles packets
+received on a single queue. The receive queue number is cached in the
+socket for the connection. In this model, sending the packets on the same
+transmit queue corresponding to the associated receive queue has benefits
+in keeping the CPU overhead low. Transmit completion work is locked into
+the same queue-association that a given application is polling on. This
+avoids the overhead of triggering an interrupt on another CPU. When the
+application cleans up the packets during the busy poll, transmit completion
+may be processed along with it in the same thread context and so result in
+reduced latency.
+
+XPS is configured per transmit queue by setting a bitmap of
+CPUs/receive-queues that may use that queue to transmit. The reverse
+mapping, from CPUs to transmit queues or from receive-queues to transmit
+queues, is computed and maintained for each network device. When
+transmitting the first packet in a flow, the function get_xps_queue() is
+called to select a queue. This function uses the ID of the receive queue
+for the socket connection for a match in the receive queue-to-transmit queue
+lookup table. Alternatively, this function can also use the ID of the
+running CPU as a key into the CPU-to-queue lookup table. If the
 ID matches a single queue, that is used for transmission. If multiple
 queues match, one is selected by using the flow hash to compute an index
-into the set.
+into the set. When selecting the transmit queue based on receive queue(s)
+map, the transmit device is not validated against the rece

[net-next PATCH v6 5/7] net: Enable Tx queue selection based on Rx queues

2018-06-30 Thread Amritha Nambiar
This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.

Signed-off-by: Amritha Nambiar 
---
 include/net/sock.h |   10 
 net/core/dev.c |   62 ++--
 2 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 2b097cc..2ed99bf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1730,6 +1730,16 @@ static inline void sk_rx_queue_clear(struct sock *sk)
 #endif
 }
 
+#ifdef CONFIG_XPS
+static inline int sk_rx_queue_get(const struct sock *sk)
+{
+   if (sk && sk->sk_rx_queue_mapping != NO_QUEUE_MAPPING)
+   return sk->sk_rx_queue_mapping;
+
+   return -1;
+}
+#endif
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index 43b5575..08d58e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3459,35 +3459,63 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct 
net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+#ifdef CONFIG_XPS
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+  struct xps_dev_maps *dev_maps, unsigned int tci)
+{
+   struct xps_map *map;
+   int queue_index = -1;
+
+   if (dev->num_tc) {
+   tci *= dev->num_tc;
+   tci += netdev_get_prio_tc_map(dev, skb->priority);
+   }
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (map) {
+   if (map->len == 1)
+   queue_index = map->queues[0];
+   else
+   queue_index = map->queues[reciprocal_scale(
+   skb_get_hash(skb), map->len)];
+   if (unlikely(queue_index >= dev->real_num_tx_queues))
+   queue_index = -1;
+   }
+   return queue_index;
+}
+#endif
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
struct xps_dev_maps *dev_maps;
-   struct xps_map *map;
+   struct sock *sk = skb->sk;
int queue_index = -1;
 
if (!static_key_false(&xps_needed))
return -1;
 
rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   if (!static_key_false(&xps_rxqs_needed))
+   goto get_cpus_map;
+
+   dev_maps = rcu_dereference(dev->xps_rxqs_map);
if (dev_maps) {
-   unsigned int tci = skb->sender_cpu - 1;
+   int tci = sk_rx_queue_get(sk);
 
-   if (dev->num_tc) {
-   tci *= dev->num_tc;
-   tci += netdev_get_prio_tc_map(dev, skb->priority);
-   }
+   if (tci >= 0 && tci < dev->num_rx_queues)
+   queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+ tci);
+   }
 
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (map) {
-   if (map->len == 1)
-   queue_index = map->queues[0];
-   else
-   queue_index = 
map->queues[reciprocal_scale(skb_get_hash(skb),
-  
map->len)];
-   if (unlikely(queue_index >= dev->real_num_tx_queues))
-   queue_index = -1;
+get_cpus_map:
+   if (queue_index < 0) {
+   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   if (dev_maps) {
+   unsigned int tci = skb->sender_cpu - 1;
+
+   queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+ tci);
}
}
rcu_read_unlock();



[net-next PATCH v6 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-06-30 Thread Amritha Nambiar
Extend transmit queue sysfs attribute to configure Rx queue(s) map
per Tx queue. By default no receive queues are configured for the
Tx queue.

- /sys/class/net/eth0/queues/tx-*/xps_rxqs

Signed-off-by: Amritha Nambiar 
---
 net/core/net-sysfs.c |   83 ++
 1 file changed, 83 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b39987c..f25ac5f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,88 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 
 static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+   struct net_device *dev = queue->dev;
+   struct xps_dev_maps *dev_maps;
+   unsigned long *mask, index;
+   int j, len, num_tc = 1, tc = 0;
+
+   index = get_netdev_queue_index(queue);
+
+   if (dev->num_tc) {
+   num_tc = dev->num_tc;
+   tc = netdev_txq_to_tc(dev, index);
+   if (tc < 0)
+   return -EINVAL;
+   }
+   mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+  GFP_KERNEL);
+   if (!mask)
+   return -ENOMEM;
+
+   rcu_read_lock();
+   dev_maps = rcu_dereference(dev->xps_rxqs_map);
+   if (!dev_maps)
+   goto out_no_maps;
+
+   for (j = -1; j = netif_attrmask_next(j, NULL, dev->num_rx_queues),
+j < dev->num_rx_queues;) {
+   int i, tci = j * num_tc + tc;
+   struct xps_map *map;
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (!map)
+   continue;
+
+   for (i = map->len; i--;) {
+   if (map->queues[i] == index) {
+   set_bit(j, mask);
+   break;
+   }
+   }
+   }
+out_no_maps:
+   rcu_read_unlock();
+
+   len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+   kfree(mask);
+
+   return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+ size_t len)
+{
+   struct net_device *dev = queue->dev;
+   struct net *net = dev_net(dev);
+   unsigned long *mask, index;
+   int err;
+
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   return -EPERM;
+
+   mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+  GFP_KERNEL);
+   if (!mask)
+   return -ENOMEM;
+
+   index = get_netdev_queue_index(queue);
+
+   err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+   if (err) {
+   kfree(mask);
+   return err;
+   }
+
+   err = __netif_set_xps_queue(dev, mask, index, true);
+   kfree(mask);
+   return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+   = __ATTR_RW(xps_rxqs);
 #endif /* CONFIG_XPS */
 
 static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1372,7 @@ static struct attribute *netdev_queue_default_attrs[] 
__ro_after_init = {
&queue_traffic_class.attr,
 #ifdef CONFIG_XPS
&xps_cpus_attribute.attr,
+   &xps_rxqs_attribute.attr,
&queue_tx_maxrate.attr,
 #endif
NULL



[net-next PATCH v6 0/7] Symmetric queue selection using XPS for Rx queues

2018-06-30 Thread Amritha Nambiar
This patch series implements support for Tx queue selection based on
Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.

XPS is refactored to support Tx queue selection based on either the
CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.

- /sys/class/net//queues/tx-*/xps_rxqs

A set of receive queues can be mapped to a set of transmit queues (many:many),
although the common use case is a 1:1 mapping. This will enable sending
packets on the same Tx-Rx queue association as this is useful for busy polling
multi-threaded workloads where it is not possible to pin the threads to
a CPU. This is a rework of Sridhar's patch for symmetric queueing via
socket option:
https://www.spinics.net/lists/netdev/msg453106.html

Testing Hints:
Kernel:  Linux 4.17.0-rc7+
Interface:
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x00015e0b

Configuration:
ethtool -L $iface combined 16
ethtool -C $iface rx-usecs 1000
sysctl net.core.busy_poll=1000
ATR disabled:
ethtool -K $iface ntuple on

Workload:
Modified memcached that changes the thread selection policy to be based
on the incoming rx-queue of a connection using SO_INCOMING_NAPI_ID socket
option. The default is round-robin.

Default: No rxqs_map configured
Symmetric queues: Enable rxqs_map for all queues 1:1 mapped to Tx queue

System:
Architecture:  x86_64
CPU(s):72
Model name:Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

16 threads  400K requests/sec
=
---
Default Symmetric queues
---
RTT min/avg/max 4/51/2215   2/30/5163
(usec)


intr/sec26655   18606

contextswitch/sec   51454044

insn per cycle  0.430.72

cache-misses6.919   4.310
(% of all cache refs)

L1-dcache-load- 4.493.29
-misses
(% of all L1-dcache hits)

LLC-load-misses 13.26   8.96
(% of all LL-cache hits)

---

32 threads  400K requests/sec
=
---
Default Symmetric queues
---
RTT min/avg/max 10/112/5562 9/46/4637
(usec)


intr/sec30456   27666

contextswitch/sec   75525133

insn per cycle  0.410.49

cache-misses9.357   2.769
(% of all cache refs)

L1-dcache-load- 4.093.98
-misses
(% of all L1-dcache hits)

LLC-load-misses 12.96   3.96
(% of all LL-cache hits)

---

16 threads  800K requests/sec
=
---
Default Symmetric queues
---
RTT min/avg/max  5/151/4989 9/69/2611
(usec)


intr/sec35686   22907

contextswitch/sec   25522   12281

insn per cycle  0.670.74

cache-misses8.652   6.38
(% of all cache refs)

L1-dcache-load- 3.192.86
-misses
(% of all L1-dcache hits)

LLC-load-misses 16.53   11.99
(% of all LL-cache hits)

---
32 threads  800K requests/sec
=
---
Default Symmetric queues
---
RTT min/avg/max  6/163/6152 8/88/4209
(usec)


intr/sec47079   26548

contextswitch/sec   42190   39168

insn per cycle  0.450.54

cache-misses8.798 

[net-next PATCH v6 1/7] net: Refactor XPS for CPUs and Rx queues

2018-06-30 Thread Amritha Nambiar
Refactor XPS code to support Tx queue selection based on
CPU(s) map or Rx queue(s) map.

Signed-off-by: Amritha Nambiar 
---
 include/linux/cpumask.h   |   11 ++
 include/linux/netdevice.h |   98 -
 net/core/dev.c|  211 ++---
 net/core/net-sysfs.c  |4 -
 4 files changed, 244 insertions(+), 80 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bf53d89..57f20a0 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
 #define cpu_active(cpu)((cpu) == 0)
 #endif
 
-/* verify cpu argument to cpumask_* operators */
-static inline unsigned int cpumask_check(unsigned int cpu)
+static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
 {
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
-   WARN_ON_ONCE(cpu >= nr_cpumask_bits);
+   WARN_ON_ONCE(cpu >= bits);
 #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+}
+
+/* verify cpu argument to cpumask_* operators */
+static inline unsigned int cpumask_check(unsigned int cpu)
+{
+   cpu_max_bits_warn(cpu, nr_cpumask_bits);
return cpu;
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c6b377a..8bf8d61 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -731,10 +731,15 @@ struct xps_map {
  */
 struct xps_dev_maps {
struct rcu_head rcu;
-   struct xps_map __rcu *cpu_map[0];
+   struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
 };
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+   (_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
 #endif /* CONFIG_XPS */
 
 #define TC_MAX_QUEUE   16
@@ -1910,7 +1915,8 @@ struct net_device {
int watchdog_timeo;
 
 #ifdef CONFIG_XPS
-   struct xps_dev_maps __rcu *xps_maps;
+   struct xps_dev_maps __rcu *xps_cpus_map;
+   struct xps_dev_maps __rcu *xps_rxqs_map;
 #endif
 #ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
@@ -3259,6 +3265,92 @@ static inline void netif_wake_subqueue(struct net_device 
*dev, u16 queue_index)
 #ifdef CONFIG_XPS
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+ u16 index, bool is_rxqs_map);
+
+/**
+ * netif_attr_test_mask - Test a CPU or Rx queue set in a mask
+ * @j: CPU/Rx queue index
+ * @mask: bitmask of all cpus/rx queues
+ * @nr_bits: number of bits in the bitmask
+ *
+ * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
+ */
+static inline bool netif_attr_test_mask(unsigned long j,
+   const unsigned long *mask,
+   unsigned int nr_bits)
+{
+   cpu_max_bits_warn(j, nr_bits);
+   return test_bit(j, mask);
+}
+
+/**
+ * netif_attr_test_online - Test for online CPU/Rx queue
+ * @j: CPU/Rx queue index
+ * @online_mask: bitmask for CPUs/Rx queues that are online
+ * @nr_bits: number of bits in the bitmask
+ *
+ * Returns true if a CPU/Rx queue is online.
+ */
+static inline bool netif_attr_test_online(unsigned long j,
+ const unsigned long *online_mask,
+ unsigned int nr_bits)
+{
+   cpu_max_bits_warn(j, nr_bits);
+
+   if (online_mask)
+   return test_bit(j, online_mask);
+
+   return (j < nr_bits);
+}
+
+/**
+ * netif_attrmask_next - get the next CPU/Rx queue in a cpu/Rx queues mask
+ * @n: CPU/Rx queue index
+ * @srcp: the cpumask/Rx queue mask pointer
+ * @nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set.
+ */
+static inline unsigned int netif_attrmask_next(int n, const unsigned long 
*srcp,
+  unsigned int nr_bits)
+{
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpu_max_bits_warn(n, nr_bits);
+
+   if (srcp)
+   return find_next_bit(srcp, nr_bits, n + 1);
+
+   return n + 1;
+}
+
+/**
+ * netif_attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
+ * @n: CPU/Rx queue index
+ * @src1p: the first CPUs/Rx queues mask pointer
+ * @src2p: the second CPUs/Rx queues mask pointer
+ * @nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set in both.
+ */
+static inline int netif_attrmask_next_and(int n, const unsigned long *src1p,
+ const unsigned long *src2p,
+

[net-next PATCH v6 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-30 Thread Amritha Nambiar
Change 'skc_tx_queue_mapping' field in sock_common structure from
'int' to 'unsigned short' type with ~0 indicating unset and
other positive queue values being set. This will accommodate adding
a new 'unsigned short' field in sock_common in the next patch for
rx_queue_mapping.

Signed-off-by: Amritha Nambiar 
---
 include/net/sock.h |   14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b7541..37b09c8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -214,7 +214,7 @@ struct sock_common {
struct hlist_node   skc_node;
struct hlist_nulls_node skc_nulls_node;
};
-   int skc_tx_queue_mapping;
+   unsigned short  skc_tx_queue_mapping;
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
@@ -1681,17 +1681,25 @@ static inline int sk_receive_skb(struct sock *sk, 
struct sk_buff *skb,
 
 static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 {
+   /* sk_tx_queue_mapping accept only upto a 16-bit value */
+   if (WARN_ON_ONCE((unsigned short)tx_queue >= USHRT_MAX))
+   return;
sk->sk_tx_queue_mapping = tx_queue;
 }
 
+#define NO_QUEUE_MAPPING   USHRT_MAX
+
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
-   sk->sk_tx_queue_mapping = -1;
+   sk->sk_tx_queue_mapping = NO_QUEUE_MAPPING;
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-   return sk ? sk->sk_tx_queue_mapping : -1;
+   if (sk && sk->sk_tx_queue_mapping != NO_QUEUE_MAPPING)
+   return sk->sk_tx_queue_mapping;
+
+   return -1;
 }
 
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)



[net-next PATCH v6 4/7] net: Record receive queue number for a connection

2018-06-30 Thread Amritha Nambiar
This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.

Signed-off-by: Amritha Nambiar 
Signed-off-by: Sridhar Samudrala 
---
 include/net/busy_poll.h |1 +
 include/net/sock.h  |   28 
 net/core/sock.c |2 ++
 net/ipv4/tcp_input.c|3 +++
 4 files changed, 34 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c518743..9e36fda6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -151,6 +151,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const 
struct sk_buff *skb)
 #ifdef CONFIG_NET_RX_BUSY_POLL
sk->sk_napi_id = skb->napi_id;
 #endif
+   sk_rx_queue_set(sk, skb);
 }
 
 /* variant used for unconnected sockets */
diff --git a/include/net/sock.h b/include/net/sock.h
index 37b09c8..2b097cc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair;
  * @skc_node: main hash linkage for various protocol lookup tables
  * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_rx_queue_mapping: rx queue number for this connection
  * @skc_flags: place holder for sk_flags
  * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +216,9 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
unsigned short  skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+   unsigned short  skc_rx_queue_mapping;
+#endif
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
@@ -326,6 +330,9 @@ struct sock {
 #define sk_nulls_node  __sk_common.skc_nulls_node
 #define sk_refcnt  __sk_common.skc_refcnt
 #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
+#endif
 
 #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end__sk_common.skc_dontcopy_end
@@ -1702,6 +1709,27 @@ static inline int sk_tx_queue_get(const struct sock *sk)
return -1;
 }
 
+static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+   if (skb_rx_queue_recorded(skb)) {
+   u16 rx_queue = skb_get_rx_queue(skb);
+
+   if (WARN_ON_ONCE(rx_queue == NO_QUEUE_MAPPING))
+   return;
+
+   sk->sk_rx_queue_mapping = rx_queue;
+   }
+#endif
+}
+
+static inline void sk_rx_queue_clear(struct sock *sk)
+{
+#ifdef CONFIG_XPS
+   sk->sk_rx_queue_mapping = NO_QUEUE_MAPPING;
+#endif
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
sk_tx_queue_clear(sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index bcc4182..dac6d78 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2818,6 +2818,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_pacing_rate = ~0U;
sk->sk_pacing_shift = 10;
sk->sk_incoming_cpu = -1;
+
+   sk_rx_queue_clear(sk);
/*
 * Before updating sk_refcnt, we must commit prior changes to memory
 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9c5b341..b3b5aef 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -5588,6 +5589,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
*skb)
if (skb) {
icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
security_inet_conn_established(sk, skb);
+   sk_mark_napi_id(sk, skb);
}
 
tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
@@ -6416,6 +6418,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
tcp_rsk(req)->snt_isn = isn;
tcp_rsk(req)->txhash = net_tx_rndhash();
tcp_openreq_init_rwin(req, sk, dst);
+   sk_rx_queue_set(req_to_sk(req), skb);
if (!want_cookie) {
tcp_reqsk_record_syn(sk, req, skb);
fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);



[net-next PATCH v6 2/7] net: Use static_key for XPS maps

2018-06-30 Thread Amritha Nambiar
Use static_key for XPS maps to reduce the cost of extra map checks,
similar to how it is used for RPS and RFS. This includes static_key
'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
Rx queues map.

Signed-off-by: Amritha Nambiar 
---
 net/core/dev.c |   31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7105955..43b5575 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned 
int txq)
 EXPORT_SYMBOL(netdev_txq_to_tc);
 
 #ifdef CONFIG_XPS
+struct static_key xps_needed __read_mostly;
+EXPORT_SYMBOL(xps_needed);
+struct static_key xps_rxqs_needed __read_mostly;
+EXPORT_SYMBOL(xps_rxqs_needed);
 static DEFINE_MUTEX(xps_map_mutex);
 #define xmap_dereference(P)\
rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
@@ -2168,14 +2172,18 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
struct xps_dev_maps *dev_maps;
unsigned int nr_ids;
 
-   mutex_lock(&xps_map_mutex);
+   if (!static_key_false(&xps_needed))
+   return;
 
-   dev_maps = xmap_dereference(dev->xps_rxqs_map);
-   if (dev_maps) {
-   nr_ids = dev->num_rx_queues;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
-  count, true);
+   mutex_lock(&xps_map_mutex);
 
+   if (static_key_false(&xps_rxqs_needed)) {
+   dev_maps = xmap_dereference(dev->xps_rxqs_map);
+   if (dev_maps) {
+   nr_ids = dev->num_rx_queues;
+   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
+  offset, count, true);
+   }
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2189,6 +2197,10 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
   false);
 
 out_no_maps:
+   if (static_key_enabled(&xps_rxqs_needed))
+   static_key_slow_dec(&xps_rxqs_needed);
+
+   static_key_slow_dec(&xps_needed);
mutex_unlock(&xps_map_mutex);
 }
 
@@ -2297,6 +2309,10 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!new_dev_maps)
goto out_no_new_maps;
 
+   static_key_slow_inc(&xps_needed);
+   if (is_rxqs_map)
+   static_key_slow_inc(&xps_rxqs_needed);
+
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
/* copy maps belonging to foreign traffic classes */
@@ -3450,6 +3466,9 @@ static inline int get_xps_queue(struct net_device *dev, 
struct sk_buff *skb)
struct xps_map *map;
int queue_index = -1;
 
+   if (!static_key_false(&xps_needed))
+   return -1;
+
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_cpus_map);
if (dev_maps) {



Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-06-30 Thread David Miller
From: Ka-Cheong Poon 
Date: Wed, 27 Jun 2018 03:23:27 -0700

> This patch changes the internal representation of an IP address to use
> struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
> All the functions which take an IP address as argument are also
> changed to use struct in6_addr.  But RDS socket layer is not modified
> such that it still does not accept IPv6 address from an application.
> And RDS layer does not accept nor initiate IPv6 connections.
> 
> v2: Fixed sparse warnings.
> 
> Signed-off-by: Ka-Cheong Poon 

I really don't like this.

An ipv4 mapped ipv6 address is not the same as an ipv4 address.

You are effectively preventing the use of ipv6 connections
using ipv4 mapped addresses.

Also, assuming the sockaddr type based upon size is wrong.
You have to check the family field, then you can decide
to interpret the rest of the sockaddr in one way or another
and also validate it's length.


Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags

2018-06-30 Thread Daniel Borkmann
On 06/29/2018 07:01 PM, Jakub Kicinski wrote:
> On Fri, 29 Jun 2018 09:04:15 +0200, Daniel Borkmann wrote:
>> On 06/28/2018 06:54 PM, Jakub Kicinski wrote:
>>> On Thu, 28 Jun 2018 09:42:06 +0200, Jiri Benc wrote:  
 On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote:  
> Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT 
> is
> right approach since this is opaque info and solely defined by the BPF 
> prog
> that is using the generic helper.

 Wouldn't it make sense to introduce some safeguards here (in a backward
 compatible way, of course)? It's easy to mistakenly set data for a
 different tunnel type in a BPF program and then be surprised by the
 result. It might help users if such usage was detected by the kernel,
 one way or another.  
>>>
>>> Well, that's how it works today ;)  
>>
>> Well, it was designed like that on purpose, to be i) agnostic of the 
>> underlying
>> device, ii) to not clutter BPF API with tens of different APIs effectively 
>> doing
>> the same thing, and at the same time to avoid adding protocol specifics. 
>> E.g. at
>> least core bits of bpf_skb_{set,get}_tunnel_key() will work whether I use 
>> vxlan
>> or geneve underneath (we are actually using it this way) and I could use 
>> things
>> like tun_id to encode custom meta data from BPF for either of them depending 
>> on flavor
>> picked by orchestration system. For the tunnel options in 
>> bpf_skb_{set,get}_tunnel_opt()
>> it's similar although here there needs to be awareness of the underlying dev 
>> depending
>> on whether you encode data into e.g. gbp or tlvs, etc. However, downside 
>> right now I
>> can see with a patch like below is that:
>>
>> i) People might still just keep using 'TUNNEL_OPTIONS_PRESENT path' since 
>> available
>> and backwards compatible with current/older kernels, ii) we cut bits away 
>> from
>> size over time for each new tunnel proto added in future that would support 
>> tunnel
>> options, iii) that extension is one-sided (at least below) and same would be 
>> needed
>> in getter part, and iv) there needs to be a way for the case when folks add 
>> new
>> tunnel options where we don't need to worry that we forget updating 
>> BPF_F_TUN_*
>> each time otherwise this will easily slip through and again people will just 
>> rely
>> on using TUNNEL_OPTIONS_PRESENT catchall. Given latter and in particular 
>> point i)
>> I wouldn't think it's worth the pain, the APIs were added to BPF in v4.6 so 
>> this
>> would buy them 2 more years wrt kernel compatibility with same functionality 
>> level.
>> And point v), I just noticed the patch is actually buggy: size is 
>> ARG_CONST_SIZE and
>> verifier will attempt to check the value whether the buffer passed in 
>> argument 2 is
>> valid or not, so using flags here in upper bits would let verification fail, 
>> you'd
>> really have to make a new helper just for this.
> 
> Ah, indeed.  I'd rather avoid a new helper, if we reuse an old one
> people can always write a program like:
> 
>   err = helper(all_flags);
>   if (err == -EINVAL)
>   err = helper(fallback_flags);
> 
> With a new helper the program will not even load on old kernels :(
> 
> Could we add the flags new to bpf_skb_set_tunnel_key(), maybe?  It is a
> bit ugly because only options care about the flags and in theory
> bpf_skb_set_tunnel_key() doesn't have to be called before
> bpf_skb_set_tunnel_opt() ...

Right, though it's also ugly splitting things this way over two helpers
when this is _specifically_ only for one of them (unless you do need it for
hw offload where you need to signal for tunnel key _and_ options in BPF
itself but that's not from what I see is the case here). For sake of
'safeguard'-only I bet users won't do such fallback for reasons above but
again use currently TUNNEL_OPTIONS_PRESENT route, as is since it's simpler
and available since long time already. Thus I would prefer to leave it as
is in such case since ship has sailed long ago on this, and usage of this
extra check is not really obvious either, so I don't think it's worth it
and 'value-add' is marginal.

Thanks,
Daniel


Re: [PATCH net-next 3/5] sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams

2018-06-30 Thread 吉藤英明
Hi,

2018-06-28 15:40 GMT+09:00 Xin Long :
> On Tue, Jun 26, 2018 at 8:02 PM, 吉藤英明
>  wrote:
>> 2018-06-26 13:33 GMT+09:00 Xin Long :
>>> On Tue, Jun 26, 2018 at 12:31 AM, Marcelo Ricardo Leitner
>>>  wrote:
 Hi,

 On Tue, Jun 26, 2018 at 01:12:00AM +0900, 吉藤英明 wrote:
> Hi,
>
> 2018-06-25 22:03 GMT+09:00 Marcelo Ricardo Leitner 
> :
> > On Mon, Jun 25, 2018 at 07:28:47AM -0400, Neil Horman wrote:
> >> On Mon, Jun 25, 2018 at 04:31:26PM +0900, David Miller wrote:
> >> > From: Xin Long 
> >> > Date: Mon, 25 Jun 2018 10:14:35 +0800
> >> >
> >> > >  struct sctp_paddrparams {
> >> > > @@ -773,6 +775,8 @@ struct sctp_paddrparams {
> >> > >   __u32   spp_pathmtu;
> >> > >   __u32   spp_sackdelay;
> >> > >   __u32   spp_flags;
> >> > > + __u32   spp_ipv6_flowlabel;
> >> > > + __u8spp_dscp;
> >> > >  } __attribute__((packed, aligned(4)));
> >> >
> >> > I don't think you can change the size of this structure like this.
> >> >
> >> > This check in sctp_setsockopt_peer_addr_params():
> >> >
> >> > if (optlen != sizeof(struct sctp_paddrparams))
> >> > return -EINVAL;
> >> >
> >> > is going to trigger in old kernels when executing programs
> >> > built against the new struct definition.
> >
> > That will happen, yes, but do we really care about being future-proof
> > here? I mean: if we also update such check(s) to support dealing with
> > smaller-than-supported structs, newer kernels will be able to run
> > programs built against the old struct, and the new one; while building
> > using newer headers and running on older kernel may fool the
> > application in other ways too (like enabling support for something
> > that is available on newer kernel and that is not present in the older
> > one).
>
> We should not break existing apps.
> We still accept apps of pre-2.4 era without sin6_scope_id
> (e.g., net/ipv6/af_inet6.c:inet6_bind()).

 Yes. That's what I tried to say. That is supporting an old app built
 with old kernel headers and running on a newer kernel, and not the
 other way around (an app built with fresh headers and running on an
 old kernel).
>>> To make it, I will update the check like:
>>>
>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>> index 1df5d07..c949d8c 100644
>>> --- a/net/sctp/socket.c
>>> +++ b/net/sctp/socket.c
>>> @@ -2715,13 +2715,18 @@ static int
>>> sctp_setsockopt_peer_addr_params(struct sock *sk,
>>> struct sctp_sock*sp = sctp_sk(sk);
>>> int error;
>>> int hb_change, pmtud_change, sackdelay_change;
>>> +   int plen = sizeof(params);
>>> +   int old_plen = plen - sizeof(u32) * 2;
>>
>> if (optlen < offsetof(struct sctp_paddrparams, spp_ipv6_flowlabel))
>> maybe?
> Hi, yoshfuji,
> offsetof() is better. thank you.
>
>>
>>>
>>> -   if (optlen != sizeof(struct sctp_paddrparams))
>>> +   if (optlen != plen && optlen != old_plen)
>>> return -EINVAL;
>>>
>>> if (copy_from_user(¶ms, optval, optlen))
>>> return -EFAULT;
>>>
>>> +   if (optlen == old_plen)
>>> +   params.spp_flags &= ~(SPP_DSCP | SPP_IPV6_FLOWLABEL);
>>
>> I think we should return -EINVAL if size is not new one.
> Sorry, if we returned  -EINVAL when size is the old one,
> how can we guarantee an old app built with old kernel
> headers and running on a newer kernel works well?
> or you meant?
> if ((params.spp_flags & (SPP_DSCP | SPP_IPV6_FLOWLABEL)) &&
> optlen != plen)
> return EINVAL;

Yes, I meant this (it should be -EINVAL though).


>
>>
>> --yoshfuji
>>
>>> +
>>> /* Validate flags and value parameters. */
>>> hb_change= params.spp_flags & SPP_HB;
>>> pmtud_change = params.spp_flags & SPP_PMTUD;
>>> @@ -5591,10 +5596,13 @@ static int
>>> sctp_getsockopt_peer_addr_params(struct sock *sk, int len,
>>> struct sctp_transport   *trans = NULL;
>>> struct sctp_association *asoc = NULL;
>>> struct sctp_sock*sp = sctp_sk(sk);
>>> +   int plen = sizeof(params);
>>> +   int old_plen = plen - sizeof(u32) * 2;
>>>
>>> -   if (len < sizeof(struct sctp_paddrparams))
>>> +   if (len < old_plen)
>>> return -EINVAL;
>>> -   len = sizeof(struct sctp_paddrparams);
>>> +
>>> +   len = len >= plen ? plen : old_plen;
>>> if (copy_from_user(¶ms, optval, len))
>>> return -EFAULT;
>>>
>>> does it look ok to you?


Re: [net-next PATCH v5 4/7] net: Record receive queue number for a connection

2018-06-30 Thread Nambiar, Amritha
On 6/29/2018 6:06 AM, David Miller wrote:
> From: Amritha Nambiar 
> Date: Wed, 27 Jun 2018 15:31:34 -0700
> 
>> @@ -1702,6 +1709,13 @@ static inline int sk_tx_queue_get(const struct sock 
>> *sk)
>>  return -1;
>>  }
>>  
>> +static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff 
>> *skb)
>> +{
>> +#ifdef CONFIG_XPS
>> +sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
>> +#endif
>> +}
> 
> Maybe add a >= NO_QUEUE_MAPPING check like you did for
> skc_tx_queue_mapping.
> 
Will fix.


Re: [net-next PATCH v5 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-30 Thread Nambiar, Amritha
On 6/29/2018 6:05 AM, David Miller wrote:
> From: Amritha Nambiar 
> Date: Wed, 27 Jun 2018 15:31:28 -0700
> 
>> @@ -1681,17 +1681,25 @@ static inline int sk_receive_skb(struct sock *sk, 
>> struct sk_buff *skb,
>>  
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> +/* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +if (WARN_ON_ONCE((unsigned short)tx_queue > USHRT_MAX))
>> +return;
>>  sk->sk_tx_queue_mapping = tx_queue;
>>  }
>>  
>> +#define NO_QUEUE_MAPPINGUSHRT_MAX
> 
> I think you need to check ">= USHRT_MAX" since USHRT_MAX is how you
> indicate no queue.
> 
Agree, I'll fix this.


Re: [net-next PATCH v5 1/7] net: Refactor XPS for CPUs and Rx queues

2018-06-30 Thread Nambiar, Amritha
On 6/29/2018 5:59 AM, David Miller wrote:
> From: Amritha Nambiar 
> Date: Wed, 27 Jun 2018 15:31:18 -0700
> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index c6b377a..3790ac9 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>  ...
>> +static inline bool attr_test_online(unsigned long j,
>> +const unsigned long *online_mask,
>> +unsigned int nr_bits)
> 
> This is a networking header file, so for names you are putting into
> the global namespace please give them some kind of prefix that
> indicates it is about networking.
> 
Sure, will prefix these with 'netif_'.